nf-core/circdna   
 Pipeline for the identification of extrachromosomal circular DNA (ecDNA) from Circle-seq, WGS, and ATAC-seq data that were generated from cancer and other eukaryotic cells.
1.0.3dev-alpha). The latest
                                stable release is
 1.1.0 
.
  Introduction
This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- FastQC - Raw read QC
- MultiQC - Aggregate report describing results and QC from the whole pipeline
- Pipeline information - Report metrics generated during the workflow execution
- TrimGalore - Read Trimming
- BWA - Read mapping to reference genome
- Samtools - Sorting, indexing, filtering & stats generation of BAM file
- Circle-Map Realign - Identifies putative circular DNA junctions
- Circle-Map Repeats - Identifies putative repetitive circular DNA
- CIRCexplorer2 - Identifies putative circular DNA junctions
- Circle_finder - Identifies putative circular DNA junctions
- AmpliconArchitect - Reconstruct the structure of focally amplified regions
- Unicycler - DeNovo Alignment of circular DNAs
General Tools
FastQC
Output files
- fastqc/- *_fastqc.html: FastQC report containing quality metrics.
- *_fastqc.zip: Zip archive containing the FastQC report, tab-delimited data file and plot images.
 
FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.
NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality.
TrimGalore
Output files
- trimgalore/- *_trimming_report.txt: Trimgalore trimming report.
- fastqc/*_fastqc.zip: Zip archive containing the FastQC report, tab-delimited data file and plot images.
- fastqc/*_fastqc.html: FastQC report containing quality metrics.
 
TrimGalore combines the trimming tool Cutadapt for the removal of adapter sequences, primers and other unwanted sequences with the quality control tool FastQC
BWA
BWA is a software package for mapping low-divergent sequences against a large reference genome.
Such files are intermediate and not kept in the final files delivered to users.
Output files
Output directory: results/Reports/[SAMPLE]/SamToolsStats
- [SAMPLE].bam- Alignment file containing information about the read alignment to the reference genome
 
Samtools
samtools stats
samtools stats collects statistics from BAM files and outputs in a text format.
Output files
Output directory: results/Reports/[SAMPLE]/SamToolsStats
- [SAMPLE].bam.samtools.stats.out- Raw statistics used by MultiQC
 
- Raw statistics used by 
Plots will show:
- Alignment metrics.
For further reading and documentation see the samtools manual
Mark Duplicates
GATK MarkDuplicates
By default, circdna will use GATK MarkDuplicates, which locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA.
Output files
Output directory: results/markduplicates/bam
- [SAMPLE].md.bamand- [SAMPLE].md.bai- BAMfile and index
 
For further reading and documentation see the data pre-processing for variant discovery from the GATK best practices.
Samtools view - Duplicates Filtering
By default, circdna removes all duplicates marked by GATK MarkDuplicates using samtools view
Output files
Output directory: results/markduplicates/duplicates_removed
- [SAMPLE].md.filtered.sorted.bamand- [SAMPLE].md.filtered.sorted.bai- BAMfile and index
 
MultiQC
Output files
- multiqc/- multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
- multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
- multiqc_plots/: directory containing static images from the report in various formats.
 
MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.
circdna branches
Branch: circle_finder
Circle_finder
Output files
Output directory: results/circlefinder/
- [SAMPLE].microDNA-JT.txt- BEDfile containing information about putative circular DNA regions
 
Circle_finder identifies putative circular DNA junctions from paired-end sequencing data. Circle_finder uses split and discordant read information to identify junctions that could be generated through the formation of ecDNAs. For more information please see Circle_finder.
Branch: circexplorer2
CIRCexplorer2
CIRCexplorer2 identifies putative circular DNA junctions from paired-end sequencing data. CIRCexplorer2 was developed to identify circular RNAs from RNA-seq data. However, it can be also used to call putative circular DNAs from genomic data. For more information see CIRCexplorer2 docs
Output files
Output directory: results/circexplorer2/
- [SAMPLE].circexplorer_circdna.bed- BEDfile containing information about putative circular DNA regions
 
- [SAMPLE].CIRCexplorer2_parse.log- logfile
 
Branch: circle_map_realign
circle_map_realign uses the functionality of Circle-Map to call putative circular DNAs from mappable regions. To identify circular DNAs it uses information about split and discordant reads and uses realignment steps to identify the exact breakpoint of the circular DNA. For more information, please see the original paper or the GitHub Page
Circle-Map Readextractor
Circle-Map Readextractor extracts read candidates for circular DNA identification.
Output files
Output directory: results/circlemap/readextractor
- [SAMPLE].qname.sorted.circular_read_candidates.bam- BAMfile containing candidate reads
 
Circle-Map Realign
Circle-Map Realign detects putative circular DNA junctions from read candidates extracted by Circle-Map Readextractor
Output files
Output directory: results/circlemap/realign
- [SAMPLE]_circularDNA_coordinates.bed- BEDfile containing information about putative circular DNA regions
 
Branch: circle_map_repeats
Circle-Map Readextractor
Circle-Map Readextractor extracts read candidates for circular DNA identification.
Output files
Output directory: results/circlemap/readextractor
- [SAMPLE].qname.sorted.circular_read_candidates.bam- BAMfile containing candidate reads
 
Circle-Map Repeats
Circle-Map Repeats identifies chromosomal coordinates from repetetive circular DNAs.
Output files
Output directory: results/circlemap/repeats
- [SAMPLE]_circularDNA_repeats_coordinates.bed- BEDfile containing information about repetetive circular DNAs
 
Branch: unicycler
This Branch utilises the ability of Unicycler to denovo assemble circular DNAs in combination with the long read mapping capabilities of Minimap2, to identify the origin of the circular DNAs.
Unicycler
Unicycler was originally built as an assembly pipeline for bacterial genomes. In nf-core/circdna it is used to denovo assemble circular DNAs.
Output files
Output directory: results/unicycler/
- [SAMPLE].assembly.gfa.gz- gfafile containing sequence of denovo assembled sequences
 
- [SAMPLE].assembly.scaffolds.fa.gz- fastafile containing sequences of denovo assembled sequences in fasta format with information if denovo assembled seoriginated from a circular DNA.quence forms a circular contig.
 
Minimap2
Minimap2 uses circular DNA sequences identified by Unicycler and maps it to the given reference genome.
Output files
Output directory: results/unicycler/minimap2
- [SAMPLE].paf- paffile containing mapping information of circular DNA sequences
 
Branch: ampliconarchitect
This pipeline branch ampliconarchitect is only usable with WGS data. This branch uses the utility of PrepareAA to collect amplified seeds from copy number calls, which will be then fed to AmpliconArchitect to characterise amplicons in each given sample.
CNVkit
CNVkit uses alignment information to make copy number calls. These copy number calls will be used by AmpliconArchitect to identify circular and other types of amplicons. The Copy Number calls are then connected to seeds and filtered based on the copy number threshold using scripts provided by PrepareAA
Output files
Output directory: results/ampliconarchitect/cnvkit
- [SAMPLE]_CNV_GAIN.bed- bedfile containing filtered Copy Number calls
 
- [SAMPLE]_AA_CNV_SEEDS.bed- bedfile containing filtered and connected amplified regions (seeds). This is used as input for AmpliconArchitect
 
- [SAMPLE].cnvkit.segment.cns- cnsfile containing copy number calls of CNVkit segment.
 
AmpliconArchitect
AmpliconArchitect uses amplicon seeds provided by CNVkitand PrepareAAto identify different types of amplicons in each sample.
Output files
Output directory: results/ampliconarchitect/ampliconarchitect
- amplicons/[SAMPLE]_[AMPLICONID]_cycles.txt- txtfile describing the amplicon segments
 
- amplicons/[SAMPLE]_[AMPLICONID]_graph.txt- txtfile describing the amplicon graph
 
- cnseg/[SAMPLE]_[SEGMENT]_graph.txt- txtfile describing the copy number segmentation file
 
- summary/[SAMPLE]_summary.txt- txtfile describing each amplicon with regards to breakpoints, composition, oncogene content, copy number
 
- sv_view/[SAMPLE]_[AMPLICONID].{png,pdf}- pngor- pdffile displaying the amplicon rearrangement signature
 
AmpliconClassifier
AmpliconClassifier classifies each amplicon by using the cycles and the graph files generated by AmpliconArchitect.
Output files
Output directory: results/ampliconarchitect/ampliconclassifier
- input/[SAMPLE].AmpliconClassifier.input- txtfile containing the input used for- AmpliconClassifierand- AmpliconSimilarity.
 
- classification/[SAMPLE]_amplicon_classification_profiles.tsv- tsvfile describing the amplicon class of each amplicon in the sample.
 
- ecDNA_counts/[SAMPLE]_ecDNA_counts.tsv- tsvfile describing if an amplicon is circular [1 = circular, 0 = non-circular].
 
- gene_list/[SAMPLE]_gene_list.tsv- tsvfile detailing the genes on each amplicon.
 
- log/[SAMPLE].classifier_stdout.log- logfile
 
- similarity/[SAMPLE]_similarity_scores.tsv- tsvfile containing amplicon similarity scores calculated by- AmpliconSimilarity.
 
- bed/[SAMPLE]_amplicon[AMPLICONID]_[CLASSIFICATION]_[ID]_intervals.bed- bedfiles containing information about the intervals on each amplicon.- unknownintervals were not identified to be located on the respective amplicon.
 
AmpliconArchitect Summary
The Summary script merges the output of AmpliconArchitect and AmpliconClassifer to give full information about each amplicon in a sample. Please refer to AmpliconClassifier for more accurate ecDNA interval calling. Some intervals classified in the AmpliconArchitect and Summary output are not located on ecDNAs.
Output files
Output directory: results/ampliconarchitect/summary
- [SAMPLE].aa_results_summary.tsv- tsvfile containing the merged results.
 
Pipeline information
Output files
- pipeline_info/- Reports generated by Nextflow: execution_report.html,execution_timeline.html,execution_trace.txtandpipeline_dag.dot/pipeline_dag.svg.
- Reports generated by the pipeline: pipeline_report.html,pipeline_report.txtandsoftware_versions.yml. Thepipeline_report*files will only be present if the--email/--email_on_failparameter’s are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
 
- Reports generated by Nextflow: 
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.