Build Kallisto Transcriptome Index¶
Description:
As described in the Kallisto paper, RNA-Seq reads are efficiently mapped through a pseudoalignment process against a reference transcriptome index. We will build the index in this step.
Input Data:
Input | Description | Example |
---|---|---|
Reference transcriptome | fasta | Example transcriptome |
RNA-Seq Reads | Cleaned fastq files | Example fastq files |
Build Kallisto Index and Quantify Reads¶
We will now index the Arabidopsis transcriptome imported from Ensembl. This transcriptome can be used multiple times for future Kallisto analyses and only needs to be made once. In this tutorial, we have 36 fastq files (18 pairs), so you will need to add these to the Kallisto analyses. Kallisto uses a ‘hash-based’ pseudo alignment to deliver extremely fast matching of RNA-Seq reads against the transcriptome index.
If necessary, login to the CyVerse Discovery Environment.
In the App panel, open the Kallisto v.0.43.1 app or click this link:
Name your analysis, and if desired enter comments and click ‘Next.’ In the App’s ‘Input’ section under ‘The transcript fasta file supplied (fasta or gzipped)’ browse to and select the transcriptome imported in the previous section.
Under Paired of single end choose the format used in your sequencing.
Sample data
For the sample data, choose Paired
Note
For single-end data you will also need to choose fragment length and fragment standard deviation values in the apps “Options” section. You may also adjust settings for strand-specific reads.
Under ‘FASTQ Files (Read 1)’ navigate to your data and select all the left-read files (usually R1). For paired-end data also enter the right-read files (usually R2) .
Sample data
For the sample data, navigate to /iplant/home/shared/cyverse_training/tutorials/kallisto/00_input_fastq_trimmed
- For FASTQ Files (Read 1) choose all 18 files ending labeled R1 (e.g. SRR1761506_R1_001.fastq.gz_fp.trimmed.fastq.gz)
- For FASTQ Files (Read 2) choose all 18 files ending labeled R2 (e.g. SRR1761506_R2_001.fastq.gz_fp.trimmed.fastq.gz)
If desired adjust the bootstrap value (See Kallisto paper for recommendations); Click ‘Next’ to continue.
Sample data
We will use 25.
If desired adjust the resources required and/or click ‘Next.’
Sample data
For the sample data, we will not specify resources.
Finally click ‘Launch Analyses’ to start the job. Click on the Analyses menu to monitor the job and results.
Output/Results
Kallisto jobs will generate and index file and 3 output files per read / read-pair:
Output | Description | Example |
---|---|---|
Kallisto Index | This is the index file Kallisto will map RNA-Seq reads to. | Example Kallisto index |
abundances.h5 | HDF5 binary file containing run info, abundance estimates, bootstrap estimates, and transcript length information length. This file can be read in by Sleuth | example abundance.h5 |
abundances.tsv | plaintext file of the abundance estimates. It does not contains bootstrap estimates. When plaintext mode is selected; output plaintext abundance estimates. Alternatively, kallisto h5dump will output an HDF5 file to plaintext. The first line contains a header for each column, including estimated counts, TPM, effective length. | example abundance.tsv |
run_info.json | a json file containing information about the run | example json |
Description of results and next steps
First, this application runs the ‘kallisto index’ command to build the the index of the transcriptome. Then the ‘kallisto quant’ command is run to do the pesudoalignment of the RNA-Seq reads. Kallisto quantifies RNA-Seq reads against an indexed transcriptome and generates a folder of results for each set of RNA- Seq reads. Sleuth will be used to examine the Kallisto results in R Studio.
Fix or improve this documentation
- Search for an answer: CyVerse Learning Center
- Ask us for help: click on the lower right-hand side of the page
- Report an issue or submit a change: Github Repo Link
- Send feedback: Tutorials@CyVerse.org