The following guide will outline the download, installation, and configuration of the SRA Toolkit. Detailed information regarding the usage of individual tools in the SRA Toolkit can be found on the tool-specific documentation pages.
The NCBI SRA Toolkit enables reading ("dumping") of sequencing files from the SRA database and writing ("loading") files into the .sra format (Note that this is not required for submission). The Toolkit source code is provided in the form of the SRA SDK , and may be compiled with GCC. However, pre-built software executables are available for Linux, Windows, and Mac OS X, and we highly recommend using these pre-built executables whenever possible.
Note: For most users, the Toolkit functions (fastq-dump, sam-dump, etc.) will not be located in their PATH environmental variable. This may require providing directory information about the location of the Toolkit. See the below examples for how 'fastq-dump' would be called in different circumstances:
The Toolkit comes with a default configuration that will work for most users. You may elect to perform the following tests to confirm that your configuration is working correctly. The default location for the "download repository" is:
Note that if the tests fail, or if you wish to specify the download location for files sourced from NCBI, you should configure your Toolkit installation. During normal operation, the Toolkit may be required to download the following types of data to the default location:
For the test, we are using an arbitrary dataset, SRR390728 (RNA-Seq (polyA+) analysis of DLBCL cell line HS0798), from the National Cancer Institute’s Cancer Genome Characterization Initiative (CGCI) Project. It is a reasonably small SRA dataset that contains aligned (reference-compressed) data, allowing us to test multiple aspects of the toolkit simultaneously.
If you are using SRA Toolkit version 2.4 or higher, you should run the configuration tool, located within the bin subdirectory of the Toolkit package.
Go to the "bin" subdirectory for the Toolkit and run the following command:
This tool will setup your download/cache area for downloaded files and references.
A window will open and present the screen below.
The settings displayed here have the default values. Of primary importance is the Workspace Location, which by default is in the ncbi directory within your home directory.
If you have enabled remote access (enabled by default), the toolkit will contact NCBI on demand over HTTP to retrieve the files it needs to complete your commands.
In order for your commands to complete successfully, the toolkit needs sufficient free space. Genomics datasets are quite large; you may need 100's of GB of free space. This is the primary concern when choosing the Workspace Location. Do you have enough free space there for what you intend to do?
If you need to change the Workspace Location, use the tab key to move the cursor (shown red here) to the change button and press space or enter.
This will bring up the file navigation dialog (see below).
If you already know the path to the directory, you may use the Goto button to directly enter that path. Once you have entered or navigated to the correct directory, press tab to get to the OK button to return to the previous screen.
Once you are happy with the settings, use the tab key to get to the Save button and press enter or space.
Press enter or space one more time, then tab to the Exit button and press enter or space. You will then be returned to your shell command prompt.