Welcome to EXPath Tool

Data pre-processing

Preparation

Introduction:

Next-generation sequencing:
Before you upload the RNA-seq data to EXPath Tool, the annotation of Gene ontology (GO), KEGG ortholog (KO) and expression profile are necessary. If the annotation data and expression profile are not ready, you can follow the procedure step by step to prepare the necessary files.
If the required files are ready, you can jump to the "Ready to upload data" step to check format.

Microarray:
EXPath Tool is applied to microarray data. However, Gene ontology (GO), KEGG ortholog (KO) and expression profile are necessary too. User can jump to the "Ready to upload data" step to check format.

Requirements:

1. Sequencing data:
        The format of sequencing data (raw data) should be examined, including:
              Form of data: Paired end, single end, or mate paired
              Data format: Fastq or sra
              Sequencing platform: Illumina, Ion torrent or Roche 454
              Reads structure: adapter, barcode, and sequence
                    Note: In most case, the adapters and barcodes have been removed already. If adapters and barcodes are still remained, user can remove them at next step.

              Phred score: Phred 33 or Phred64. If any question, please refer to this document.

2. Computer:
        OS: Linux or windows 10 with linux bash shell
            Note: Before using linux bash shell on windows 10, user should refer to this document for modifying the setting of computer and install linux bash shell.
        Hardward: Please refer to requirements of each software.

Data pre-processing

Assembly

At this step, user can get the contig (candidate transcript) sequences by the following procedure.

Quality control (QC):

        To promote the accuracy of assembly, the low quality reads should be removed or trimmed.
1. Transformation:
        If the format of raw data is not FASTQ, please use sra toolkit to convert your data into FASTQ format.
            Download sra toolkit:
            View the manual:

2. Quality control:
        We suggest using one of the two tools: Fastx_toolkit and Trimmomatic.

Fastx_toolkit:

Execution environment: Linux (64bit), Linux (32bit), OpenSolaris 2009.6, FreeBSD (64bit), MacOS X 10.5.8 (32bit)
Reference:

Gordon, A., and G. J. Hannon. "Fastx-toolkit." FASTQ/A short-reads preprocessing tools (unpublished) http://hannonlab. cshl. edu/fastx_toolkit (2010).

Download:
Manual:
Usage:

Remove low quality reads: fastq_quality_filter -q (threshold of score¹) -p (percentage²) -i (input file) -o (output file) [-Q33/-Q64³]
¹: Minimum quality score to keep.
²: Minimum percent of bases that must have [-q] quality.
³: Phred format.
Trim reads: fastq_quality_trimmer -i (input file) -o (output file) -t (threshold of score) -l (minimum length) [-Q33/-Q64³]

Trimmomatic:

Execution environment: linux, windows
Reference:

Bolger, Anthony M., Marc Lohse, and Bjoern Usadel. "Trimmomatic: a flexible trimmer for Illumina sequence data." Bioinformatics (2014): btu170.

Download:
Manual:
Usage: java -jar trimmomatic.jar PE [-phred33 / -phred64] input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz LEADING:3¹ TRAILING:3² SLIDINGWINDOW:4:15³ MINLEN:36^{4
    1}: Move 3 base from 5' reads
²: Move 3 base from 3' reads
    ³: Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15  (SLIDINGWINDOW:4:15)
⁴: Minimum length=36
Note: Java^TMis required.

Assembly:

To recover the transcripts from a large amount of reads, the assembly software, such as Trinity, Trans-ABySS, Oases, is applied to do assembling.

Trinity:

Execution environment: linux
Reference:

Haas, Brian J., et al. "De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis." Nature protocols 8.8 (2013): 1494-1512.

Download:
Manual:
Usage:
    Trinity –seqType (fa/fq¹) –left (left reads) –right (right reads) –max_memory (Int. ; memory for trinity) –CPU (number) -output (output directory)
        ¹: fasta or fastq.
Note:
        Please install samtool, bowtie into your (home directory)/bin.

Oases:

Execution environment: linux (64bit)
Reference:

Schulz, Marcel H., et al. "Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels." Bioinformatics 28.8 (2012): 1086-1092.

Download:
Manual:
Usage:
python scripts/oases_pipeline.py -m (minimum of K-mer) - M (maximum of K-mer) -o (output directory) -d ' -shortPaired (input file) ' -p ' -ins_length (insert length) -min_trans_lgth (minimum length of contig) '

Trans-ABySS:

Execution environment: linux
Reference:

Robertson, Gordon, et al. "De novo assembly and analysis of RNA-seq data." Nature methods 7.11 (2010): 909-912.

Download:
Manual:
Usage: transabyss --pe (paired end reads) --name (assembly name) --outdir (output directory) --kmer (K-mer)
Note:
ABySS 1.5.2, Python 2.7.6+, python-igraph 0.7.0+, BLAT are required.

Data pre-processing

Annotation

At this step, user can get the files with the annotation of GO and KEGG.

KEGG:

Format of upload file:
Format:
Please refer to

Software:

NCBI BLAST:

Execution environment: linux, windows
Reference:

Altschul, Stephen F., et al. "Basic local alignment search tool." Journal of molecular biology 215.3 (1990): 403-410.

Download:
Manual:
Usage:
    Windows: Please refer to this document.
    Linux:
        Get best hit result:

(blastn/blastp/blastx) -query (input file) -db (blast database) -evalue (e-value) -out (output file) -outfmt 6 -max_target_seqs

Database download:

KAAS (KEGG Automatic Annotation Server):

Execution environment: web browser (i.e. firefox, chrome)
Reference:

Moriya, Yuki, et al. "KAAS: an automatic genome annotation and pathway reconstruction server." Nucleic acids research 35.suppl 2 (2007): W182-W185.

Link:

Gene ontology (GO):

Format of upload file:
Format:
Please refer to

Software:

Blast2GO:

Execution environment: linux, windows
Reference:

Conesa, Ana, et al. "Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research." Bioinformatics 21.18 (2005): 3674-3676.

Download:
Manual:

Data pre-processing

Expression profile

At this step, user can get expression profile by the following procedure.

Expression profile:

Format of upload file:
Format:
Please refer to

Software:

To estimate expression level, some tools such as RSEM, eXpress, Kallisto, salmon are developed. However, the operation is little complex. To streamline operation, Trinity team group wrote a perl script to call these tool and estimate expression level. Users can download Trinity and these tools and run the perl script.

RSEM:

Type: alignment based
Execution environment: linux
Reference:

Li, Bo, and Colin N. Dewey. "RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome." BMC bioinformatics 12.1 (2011): 1.

Download Trinity:
Download RSEM:
Manual of Trinity plugin:
Manual of RSEM:
Usage:
      Trinity script¹:align_and_estimate_abundance.pl --transcripts (assemble file) --seqType (fa/fq) --left (left reads of sample) --right (right reads of sample) --est_method RSEM --aln_method (bowtie/bowtie2) --trinity_mode² --prep_reference --output_dir (output directory)
¹: Please install RSEM under your "home directory/bin".
²: If the assembly software is Trinity, please add this option.

eXpress:

Type: alignment based
Execution environment: linux 64bit, windows 64bit , Mac 64bit
Reference:

Roberts, Adam, and Lior Pachter. "Streaming fragment assignment for real-time analysis of sequencing experiments." Nature methods 10.1 (2013): 71-73.

Download Trinity:
Download eXpress:
Manual of Trinity plugin:
Manual of eXpress:
Usage:
        Trinity script¹: align_and_estimate_abundance.pl --transcripts (assemble file) --seqType (fa/fq) --left (left reads of sample) --right (right reads of sample) --est_method eXpress --aln_method (bowtie/bowtie2) --trinity_mode² --prep_reference --output_dir (output directory)
¹: Please install eXpress under your "home directory/bin".
²: If the assembly software is Trinity, please add this option.

kallisto:

Type: alignment free
Execution environment: linux, Mac
Reference:

Bray, Nicolas L., et al. "Near-optimal probabilistic RNA-seq quantification." Nature biotechnology 34.5 (2016): 525-527.

Download Trinity:
Download kallisto:
Manual of Trinity plugin:
Manual of kallisto:
Usage:
        Trinity script¹: align_and_estimate_abundance.pl --transcripts (assemble file) --seqType (fa/fq) --left (left reads of sample) --right (right reads of sample) --est_method kallisto --trinity_mode² --prep_reference --output_dir (output directory)
¹: Please install kallisto under your "home directory/bin".
²: If the assembly software is Trinity, please add this option.

Salmon:

Type: alignment free
Execution environment: linux 64bit, windows 64bit , Mac 64bit
Reference:

Patro, Rob, Geet Duggal, and Carl Kingsford. "Salmon: Accurate, versatile and ultrafast quantification from rna-seq data using lightweight-alignment." bioRxiv (2015): 021592.

Download Trinity:
Download Salmon:
Manual of Trinity plugin:
Manual of Salmon:
Usage:
        Trinity script¹: align_and_estimate_abundance.pl --transcripts (assemble file) --seqType (fa/fq) --left (left reads of sample) --right (right reads of sample) --est_method salmon --trinity_mode² --prep_reference --output_dir (output directory)
¹: Please install salmon under your "home directory/bin".
²: If the assembly software is Trinity, please add this option.

Data pre-processing

Upload data rechecking

        Before you upload the annotation and expression profile data to EXPath Tool, please recheck the format of the files.

Notice:
        1. Column names are required.
        2. The columns are separated by tab.
        3. Check the ID consistency between each file.
        4. Extension file name:    .txt, .tsv, .tab
        5. Size limitation of uploading file: 100 Mb /per file
Uploading files:
        Essential files:
                Annotation file - pathway (KEGG)
                Annotation file - GO
                Expression profile
        Additional file:
                Annotation file - additions

Gene ontology (GO):

Format of upload file:
    Format:
        Sequence ID and GO ID should be included in the tab-delimited file .Column names are required.
    For example:
        Format 1:

Contig ID	GO ID
Contig0001	GO:0070122;GO:0051033
Contig0002
Contig0003	GO:1904798
...	...

Format 2:

Contig ID	GO ID
Contig0001	GO:0070122
Contig0001	GO:0051033
Contig0002
Contig0003	GO:1904798
...	...

KEGG:

Format of upload file:
    Format:
        Sequence ID and annotation ID should be included in the tab-delimited file .Column names are required.
    Format of annotation ID:

System	Example
KEGG ortholog (KO)	K02894
KEGG gene	ath:AT5G16050
NCBI RefSeq	XP_717575.1
NCBI GI number	758990211
UniprotKB accession number	O45687
UniProtKB Entry name (formerly ID)	ADH2_GEOSE

For example (ID system: KEGG ortholog):

Contig ID	Annotation ID
Contig0001	K02894;K09334;K10458
Contig0002
Contig0003	K00362
...	...

Expression profile:

Format of upload file:
    Format:
        The first column is sequnece ID and other columns are expression level of each sequence in each sample.
        The names of the columns should be separated by tab.
        If the conditions have replicates, the name of the column should be the format liked "Condition name_Replicates number".
        The name of the column can only contain numeric (0-9), alphabet (a-zA-Z) and underscores (_).
    For example:
        Format of non-replicate dataset:

Contig ID	Control_1	Salt_stress_1	Sorbitol_stress_1	...
Contig0001	1	10	8	...
Contig0002	33.3	43.2	20.1	...
Contig0003	1487	987	487	...
Contig0004	0	8	7	...
Contig0005	5.55	32.1	33.2	...
...	...	...	...	...

Format of replicate dataset:

Contig ID	Control_1	Control_2	Control_3	Salt_stress_1	Salt_stress_2	Salt_stress_3	...
Contig0001	11	44	12	33	35.2	31	...
Contig0002	222	201.2	242	888	832.2	800	...
Contig0003	0	0.32	1	48	35.2	50.2	...
Contig0004	1344	1313	1288	5200	5135	5566	...
Contig0005	5	6	7	55	66	60.5	...
...	...	...	...	...	...	3...	...

Additional annotation:

If user want to view the information of other annotations, such as Pfam, in EXPath Tool, user can put the annotations in this file and upload.

Format of upload file:
    Format:
        The first one of the columns is sequnece ID, and the other columns put the annotations.
        The names of the columns should be separated by tab.
        The name of the column can only contain numeric (0-9), alphabet (a-zA-Z) and underscores (_).
    For example:

Contig ID Pfam_ID Pfam_name Gene_symbol ...

Contig0001 PF00931 NB-ARC ...

Contig0002 ADH1 ...

Contig0003 ...

... ... ... ... ...

Contig ID	Pfam_ID	Pfam_name	Gene_symbol	...
Contig0001	PF00931	NB-ARC		...
Contig0002			ADH1	...
Contig0003				...
...	...	...	...	...