Data pre-processing




Preparation



Introduction:


Next-generation sequencing:
        Before you upload the RNA-seq data to EXPath Tool, the annotation of Gene ontology (GO), KEGG ortholog (KO) and expression profile are necessary. If the annotation data and expression profile are not ready, you can follow the procedure step by step to prepare the necessary files.
        If the required files are ready, you can jump to the "Ready to upload data" step to check format.


Microarray:
        EXPath Tool is applied to microarray data. However, Gene ontology (GO), KEGG ortholog (KO) and expression profile are necessary too. User can jump to the "Ready to upload data" step to check format.

Requirements:


1. Sequencing data:

        The format of sequencing data (raw data) should be examined, including:
              Form of data: Paired end, single end, or mate paired
              Data format: Fastq or sra
              Sequencing platform: Illumina, Ion torrent or Roche 454
              Reads structure: adapter, barcode, and sequence
                    Note: In most case, the adapters and barcodes have been removed already. If adapters and barcodes are still remained, user can remove them at next step.

              Phred score: Phred 33 or Phred64. If any question, please refer to this document.


2. Computer:
        OS: Linux or windows 10 with linux bash shell
            Note: Before using linux bash shell on windows 10, user should refer to this document for modifying the setting of computer and install linux bash shell.
        Hardward: Please refer to requirements of each software.








Data pre-processing



Assembly



At this step, user can get the contig (candidate transcript) sequences by the following procedure.


Quality control (QC):




        To promote the accuracy of assembly, the low quality reads should be removed or trimmed.
1. Transformation:
        If the format of raw data is not FASTQ, please use sra toolkit to convert your data into FASTQ format.
            Download sra toolkit: 
            View the manual:        

2. Quality control:
        We suggest using one of the two tools: Fastx_toolkit and Trimmomatic.

Fastx_toolkit:
Execution environment: Linux (64bit), Linux (32bit), OpenSolaris 2009.6, FreeBSD (64bit), MacOS X 10.5.8 (32bit)
Reference:
Gordon, A., and G. J. Hannon. "Fastx-toolkit." FASTQ/A short-reads preprocessing tools (unpublished) http://hannonlab. cshl. edu/fastx_toolkit (2010).

Download:
Manual:    
Usage:
Remove low quality reads: fastq_quality_filter -q (threshold of score1) -p (percentage2) -i (input file) -o (output file) [-Q33/-Q643]
    1: Minimum quality score to keep.
    2: Minimum percent of bases that must have [-q] quality.
    3: Phred format.
Trim reads: fastq_quality_trimmer -i (input file) -o (output file) -t (threshold of score) -l (minimum length) [-Q33/-Q643]


Trimmomatic:
Execution environment: linux, windows
Reference:
Bolger, Anthony M., Marc Lohse, and Bjoern Usadel. "Trimmomatic: a flexible trimmer for Illumina sequence data." Bioinformatics (2014): btu170.

Download:
Manual:         
Usage: java -jar trimmomatic.jar PE [-phred33 / -phred64] input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz LEADING:31 TRAILING:32 SLIDINGWINDOW:4:153 MINLEN:364
    1
: Move 3 base from 5' reads
    2: Move 3 base from 3' reads
    3: Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below  15  (SLIDINGWINDOW:4:15)
    4: Minimum length=36
Note: JavaTM is required.


Assembly:



        To recover the transcripts from a large amount of reads, the assembly software, such as Trinity, Trans-ABySS, Oases, is applied to do assembling.

Trinity:
Execution environment: linux
Reference:
Haas, Brian J., et al. "De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis." Nature protocols 8.8 (2013): 1494-1512.

Download:
Manual:         
Usage:
    Trinity –seqType (fa/fq1) –left (left reads) –right (right reads) –max_memory (Int. ; memory for trinity) –CPU (number) -output (output directory)
        1: fasta or fastq.
Note:
        Please install samtool, bowtie into your (home directory)/bin.

Oases:
Execution environment: linux (64bit)
Reference:
Schulz, Marcel H., et al. "Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels." Bioinformatics 28.8 (2012): 1086-1092.

Download:
Manual:         
Usage:
        python scripts/oases_pipeline.py -m (minimum of K-mer) - M (maximum of K-mer) -o (output directory) -d ' -shortPaired (input file) ' -p ' -ins_length (insert length) -min_trans_lgth (minimum length of contig) '

Trans-ABySS:
Execution environment: linux
Reference:
Robertson, Gordon, et al. "De novo assembly and analysis of RNA-seq data." Nature methods 7.11 (2010): 909-912.

Download:
Manual:         
Usage:  transabyss --pe (paired end reads) --name (assembly name) --outdir (output directory) --kmer (K-mer)
Note:
        ABySS 1.5.2, Python 2.7.6+, python-igraph 0.7.0+, BLAT are required.





Data pre-processing



Annotation

 

At this step, user can get the files with the annotation of GO and KEGG.

KEGG:



Format of upload file:
    Format:
       
Please refer to  

Software:

  
NCBI BLAST:
Execution environment: linux, windows
Reference:
Altschul, Stephen F., et al. "Basic local alignment search tool." Journal of molecular biology 215.3 (1990): 403-410.

Download:
Manual:         
Usage:
    Windows: Please refer to this document.
    Linux:
        Get best hit result: 
(blastn/blastp/blastx) -query (input file) -db (blast database) -evalue (e-value) -out (output file) -outfmt 6 -max_target_seqs
Database download:


KAAS (KEGG Automatic Annotation Server):
Execution environment: web browser (i.e. firefox, chrome)
Reference:
Moriya, Yuki, et al. "KAAS: an automatic genome annotation and pathway reconstruction server." Nucleic acids research 35.suppl 2 (2007): W182-W185.

Link:





Gene ontology (GO):

Format of upload file:
    Format:
        Please refer to          


Software:

Blast2GO:

Execution environment: linux, windows
Reference:
Conesa, Ana, et al. "Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research." Bioinformatics 21.18 (2005): 3674-3676.

Download:
Manual:         








Data pre-processing



Expression profile



        At this step, user can get expression profile by the following procedure.

Expression profile:



Format of upload file:
    Format:
        Please refer to  


Software:
                 To estimate expression level, some tools such as RSEM, eXpress, Kallisto, salmon are developed. However, the operation is little complex. To streamline operation, Trinity team group wrote a perl script to call these tool and estimate expression level. Users can download Trinity and these tools and run the perl script.

RSEM:
Type: alignment based
Execution environment: linux
Reference:
Li, Bo, and Colin N. Dewey. "RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome." BMC bioinformatics 12.1 (2011): 1.

Download Trinity:     
Download RSEM:    
Manual of Trinity plugin:
         
Manual of RSEM:                     
Usage:
      Trinity script1: align_and_estimate_abundance.pl --transcripts (assemble file) --seqType (fa/fq) --left (left reads of sample) --right (right reads of sample) --est_method RSEM --aln_method (bowtie/bowtie2) --trinity_mode2 --prep_reference --output_dir (output directory)

1
: Please install RSEM under your "home directory/bin".
2: If the assembly software is Trinity, please add this option.


eXpress:
Type: alignment based
Execution environment:
linux 64bit, windows 64bit , Mac 64bit
Reference:
Roberts, Adam, and Lior Pachter. "Streaming fragment assignment for real-time analysis of sequencing experiments." Nature methods 10.1 (2013): 71-73.

Download Trinity:     
Download eXpress:
Manual of Trinity plugin:
         
Manual of eXpress:                 
Usage:
       
Trinity script1: align_and_estimate_abundance.pl --transcripts (assemble file) --seqType (fa/fq) --left (left reads of sample) --right (right reads of sample) --est_method eXpress --aln_method (bowtie/bowtie2) --trinity_mode2 --prep_reference --output_dir (output directory)
1: Please install eXpress under your "home directory/bin".
2: If the assembly software is Trinity, please add this option.



kallisto:
Type: alignment free
Execution environment:
linux, Mac
Reference:
Bray, Nicolas L., et al. "Near-optimal probabilistic RNA-seq quantification." Nature biotechnology 34.5 (2016): 525-527.

Download Trinity:     
Download kallisto:  
Manual of Trinity plugin:
         
Manual of kallisto:                   
Usage:
       
Trinity script1: align_and_estimate_abundance.pl --transcripts (assemble file) --seqType (fa/fq) --left (left reads of sample) --right (right reads of sample) --est_method kallisto --trinity_mode2 --prep_reference --output_dir (output directory)
1: Please install kallisto under your "home directory/bin".
2: If the assembly software is Trinity, please add this option.



Salmon:
Type: alignment free
Execution environment:
linux 64bit, windows 64bit , Mac 64bit
Reference:
Patro, Rob, Geet Duggal, and Carl Kingsford. "Salmon: Accurate, versatile and ultrafast quantification from rna-seq data using lightweight-alignment." bioRxiv (2015): 021592.

Download Trinity:     
Download Salmon: 
Manual of Trinity plugin:
         
Manual of Salmon:                  
Usage:
       
Trinity script1: align_and_estimate_abundance.pl --transcripts (assemble file) --seqType (fa/fq) --left (left reads of sample) --right (right reads of sample) --est_method salmon --trinity_mode2 --prep_reference --output_dir (output directory)
1: Please install salmon under your "home directory/bin".
2: If the assembly software is Trinity, please add this option.





Data pre-processing



Upload data rechecking


        Before you upload the annotation and expression profile data to EXPath Tool, please recheck the format of the files.

Notice:
        1. Column names are required.
        2. The columns are separated by tab.
        3. Check the ID consistency between each file.
        4. Extension file name:    .txt, .tsv, .tab
        5. Size limitation of uploading file: 100 Mb /per file
Uploading files:
        Essential files:
                Annotation file - pathway (KEGG)
                Annotation file - GO
                Expression profile
        Additional file:
                Annotation file - additions

Gene ontology (GO):



Format of upload file:
    Format:
       
Sequence ID and GO ID should be included in the tab-delimited file .Column names are required.
    For example:
        Format 1:
Contig ID GO ID
Contig0001 GO:0070122;GO:0051033
Contig0002  
Contig0003 GO:1904798
... ...

        Format 2:   
Contig ID GO ID
Contig0001 GO:0070122
Contig0001 GO:0051033
Contig0002  
Contig0003 GO:1904798
... ...




KEGG:



Format of upload file:
    Format:
       
Sequence ID and annotation ID should be included in the tab-delimited file .Column names are required.
    Format of annotation ID:        
System Example
KEGG ortholog (KO) K02894
KEGG gene ath:AT5G16050
NCBI RefSeq XP_717575.1
NCBI GI number 758990211
UniprotKB accession number O45687
UniProtKB Entry name (formerly ID) ADH2_GEOSE

    For example (ID system: KEGG ortholog):
Contig ID Annotation ID
Contig0001 K02894;K09334;K10458
Contig0002  
Contig0003 K00362
... ...


Expression profile:



Format of upload file:
    Format:
        The first column is sequnece ID and other columns are expression level of each sequence in each sample.
        The names of the columns should be separated by tab.
        If the conditions have replicates, the name of the column should be the format liked "Condition name_Replicates number".
        The name of the column can only contain numeric (0-9), alphabet (a-zA-Z) and underscores (_).
    For example:
        Format of non-replicate dataset:
Contig ID Control_1 Salt_stress_1 Sorbitol_stress_1 ...
Contig0001 1 10 8 ...
Contig0002 33.3 43.2 20.1 ...
Contig0003 1487 987 487 ...
Contig0004 0 8 7 ...
Contig0005 5.55 32.1 33.2 ...
... ... ... ... ...


        Format of replicate dataset:
Contig ID Control_1 Control_2 Control_3 Salt_stress_1 Salt_stress_2 Salt_stress_3 ...
Contig0001 11 44 12 33 35.2 31 ...
Contig0002 222 201.2 242 888 832.2 800 ...
Contig0003 0 0.32 1 48 35.2 50.2 ...
Contig0004 1344 1313 1288 5200 5135 5566 ...
Contig0005 5 6 7 55 66 60.5 ...
... ... ... ... ... ... 3... ...


Additional annotation:



        If user want to view the information of other  annotations, such as Pfam, in EXPath Tool, user can put the annotations in this file and upload.

Format of upload file:
    Format:
        The first one of the columns is sequnece ID, and the other columns put the annotations.
        The names of the columns should be separated by tab.
        The name of the column can only contain numeric (0-9), alphabet (a-zA-Z) and underscores (_).
    For example:
Contig ID Pfam_ID Pfam_name Gene_symbol ...
Contig0001 PF00931 NB-ARC   ...
Contig0002     ADH1 ...
Contig0003       ...
... ... ... ... ...




Contact us:Wen-Chi Chang          E-mail:sarah321@mail.ncku.edu.tw