Introduction
This document describes the output produced by the pipeline. Most of the plots are taken from the report, which summarizes results at the end of the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- Query identification - obtaining basic information on the query
- Ortholog fetching - obtaining ortholog predictions from public databases
- Sequence fetching - obtaining ortholog sequences form public databases
- ID merging - merging identifiers based on their sequence
- Ortholog scoring - creation of a score table
- Ortholog filtering - selection of final ortholog list
- Ortholog plotting - creation of plots describing the predictions
- Ortholog statistics - calculation of several statistics about the predictions
- Report generation - creation of a human-readable report
- Pipeline information - basic information about the pipeline run
Query identification
Output files
- seqinfo/- *_id.txt: File containing Uniprot identifier of the query or the closest BLAST hit.
- *_taxid.txt: File containing NCBI taxon ID of the query/closest hit.
- *_exact.txt: File containing information on whether the query was found in the database (- true), or the output is the top BLAST hit (- false).
 
Query information necessary for further steps is obtained here. If a sequence was passed, it is identified using OMA. A Uniprot identifier is obtained, along with indication whether it was an exact or closest match. For either query type, an NCBI taxon ID is obtained using the OMA API.
Ortholog fetching
Output files
- orthologs/- [dbname]/- *_[dbname]_group.csv: A CSV file with the hits from the database. It has an additional column necessary for later merging.
 
 
Ortholog predictions are fetched from the databases. Each database can be used locally or online, subject to the feasibility of these access modes. The databases currently supported are:
- OMA (online and local)
- PANTHER (online and local)
- OrthoInspector (online)
- EggNOG (local).
Sequence fetching
Output files
- sequences/- *_orthologs.fa: A FASTA file containing all ortholog sequences that could be found.
- *_seq_hits.txt: The list of all orthologs whose sequence was found.
- *_seq_misses.txt: The list of all orthologs whose sequence was not found.
 
If identifier merging is performed, protein sequences of all orthologs in FASTA format are fetched. The identifiers are split by their format (based on regex) and sequences are fetched from the corresponding databases. Identifiers of unsupported format are automatically registered as misses. The currently supported sequence databases are:
- Uniprot
- RefSeq
- Ensembl
- OMA
Identifier merging
Output files
- merge/- *_clusters.tsv: A TSV file containing the clusters found by Diamond (including singletons)
- *_idmap.tsv: A TSV file containing only non-singleton clusters, used for scoring.
 
In some cases, multiple identifiers might refer to the same sequence. This step uses Diamond to identify such cases and construct a table of possible synonymous IDs. If this table seems incorrect, merge parameter tuning might be necessary.
Ortholog scoring
Output files
- orthologs/- merge_csv/- *.csv: A merged CSV file with predictions from all the databases.
 
- score_table/- *_score_table.csv: A merged CSV with a score column added. The score is the number of databases supporting the prediction.
 
 
At this step, the predictions are combined into a single table. They are also assigned a score which is used for later filtering. The score is the number of supporting sources.
Ortholog filtering
Output files
- orthologs/- filter_hits/- *_minscore_*.txt: Lists of predictions passing different score thresholds, from 1 to the number of sources. For example,- BicD2_minscore_2.txtwould include orthologs of BicD2 supported by at least 2 sources.
- *_centroid.txt: A list of predictions from the source with the highest agreement with other sources.
- *_filtered_hits.txt: The final list of orthologs, chosen based on user-defined criteria.
 
 
In this step, the predictions are split into lists with different minimal scores, indicating each level of support. Additionally, the source with the highest total agreement is found.
The final list of orthologs is determined in one of two ways. If --use_centroid is set, the highest-agreement source will be used. Otherwise, orthologs with a score higher than --min_score are used.
Ortholog plotting
Output files
- orthologs/- plots/- *_supports.png: A bar plot representing the number of predictions from each source and the support of the predictions.
- *_venn.png: A Venn diagram representing the intersections between databases.
- *_jaccard.png: A tile plot representing the Jaccard index (pairwise agreement) between databases.
 
 
Plots representing certain aspects of the predictions are generated using ggplot.
Ortholog statistics
Output files
- orthologs/- stats/- *_stats.yml: A YAML file containing ortholog statistics.
 
- hits/- *_hits.yml: A YAML file containing hit counts per database.
 
 
The following statistics of the predictions are calculated:
- percentage of consensus - the fraction of predictions which are supported by all the sources
- percentage of privates - the fractions of predictions which are supported by only 1 source
- goodness - the ratio of the real sum of scores to the theoretical maximum (i.e. the number of databases times the number of predictions).
Report generation
Output files
- *_dist/- *.html: The report in HTML format.
- run.sh: A script to correctly open the report.
- Other files necessary for the report.
 
- multiqc/- multiqc_report.html: A MultiQC report containing summary of all samples.
 
The report is generated per sample in the form of a React application. It must be hosted on localhost to work correctly. This can be done manually or with the run script provided.
A single MultiQC report is also generated. It contains a comparison of hit count and statistics for each sample, as well as a list of software versions used in the run.
Pipeline information
Output files
- pipeline_info/- Reports generated by Nextflow: execution_report.html,execution_timeline.html,execution_trace.txtandpipeline_dag.dot/pipeline_dag.svg.
- Reports generated by the pipeline: pipeline_report.html,pipeline_report.txtandsoftware_versions.yml. Thepipeline_report*files will only be present if the--email/--email_on_failparameter’s are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
- Parameters used by the pipeline run: params.json.
 
- Reports generated by Nextflow: 
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.