How can I BLAST to a local copy of the NCBI nr or nt databases?

You can BLAST to a local copy of nr using custom BLAST in Geneious.  

To set up the BLAST executables, go to Tools→Add/Remove Databases→Set Up Search Services. Check "Let Geneious do the setup" and click OK.  Your BLAST folder should be created inside your Geneious Data folder, which is normally located in your User directory.  

Then download the pre-formatted nr database files from here.  For nr make sure you have downloaded all of the tar.gz files.  Uncompress the files and put them in the BLAST/data folder that was created in your Geneious Data folder when you set up custom BLAST.  Restart Geneious, and the nr database should now show up in the list of custom BLAST databases when you click BLAST.  

Have more questions? Submit a request

Comments

  • Avatar
    Jsmith

    Hi Hillary;

    This worked great and seemed to eliminate the timeouts associated with trying to blast large datasets over the web interface.  

    I do have a question about improving performance.  At the moment, the custom blastx on the local nr database is only using 25% of cpu and a low amount of memory.  Is there a way to tell blastx to use more cpu and memory?  Geneious is enabled to use 26GB and is no where near that value.  I'm the lastest Geneious and just upgraded the blast suite to 2.29+ per the instructions in previous KB thread.  My machine is Win 8.1 Pro with 32GB ram and AMD quad core A10-5700. 

    Thanks!

     

  • Avatar
    Hilary Miller

    BLAST doesn't require much RAM, so increasing the RAM won't speed it up.  However, you can tell it to use more CPUs under the "More Options" button. The default is 1 CPU, but for a quad core machine you could set it to use 3.  

     

  • Avatar
    Oswaldo Palenzuela

    HI, I rescue this thread. I am interested on building a local copy but with a restricted organisms datasets. The pre-formatted Nr databases pointed above are just too large and general, so ideally I would like to use selected genomes like the files listed per organisms from here: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/. IN these, each organisms has a series of files like this: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000336135.1_ASM33613v1/

    I know that I can make local datasets for blast search from fasta datasets, but how can I combine a few genomes like this in a single database? loading them all asfasta and building the custom db from the menu seems a no-go due to the size of the files, does it? In addition I guess I'd loose annotations.

     

  • Avatar
    Hilary Miller

    The easiest way to do what you want to do is to search the NCBI Genomes database (through NCBI BLAST rather than local BLAST) and to use an Entrez Query to limit it to the organisms you want.  See http://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html#entrez_query for more information on how to format this.

    If you want to do it through custom BLAST and make your own database, then you'll probably want to use the Genbank flat file format files as the fasta files do not contain any annotations.  It looks like you're working on bacterial genomes (from the link you gave in your post), so you should be able to import files that size into Geneious without any problems and make the database from within Geneious.  It would actually be best to download the genomes directly from the Genome database into Geneious using our NCBI search function (make sure you click "Download full sequence", and then move the files you want into a folder in your Local documents).  Once you have all the genomes you want to search, select them all and go to Add/Remove Databases to make your database.  

    Please let us know if either of those options will work for you. 

     

     

     

  • Avatar
    Oswaldo Palenzuela

    Thank you Hillary. The first option is what I want to do: blast against a custom-made databases that includes selected genomes and transcriptome assemblies. The NCBI option won't work because I need to do batch jobs and the online lag is too much. In addition, I need to include additional files from unreleased projects if possible. So what I want to do is the second option. However, I do not work with bacteria and I want to download about 10 datasets (genomes and transcriptome assemblies) from here:  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/.

    I'd like to try with a fist practical example: let's say that I want to start with one organism like this one: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000004095.1_Hydra_RP_1.0/ The questions are:

    - which file(s) do I need to download from those from which geneious can format a blast db

    - Is there a way to cobnine such datasets in a single local database or I need to do one for each organism/dataset imported in this fasion?

     

     

     

     

  • Avatar
    Hilary Miller

    Normally, the best way to do what you want to do would be to  download the .gbff file from that site, import it into Geneious and click "Download full sequence" to get the full sequences.  Do this for each genome.  You should end up with one sequence list for each genome, which has the sequence and all the annotations, and you can then select them all and make one database for all the genomes from within Geneious.  

    However, there appears to be a bug where Geneious is not importing the gbff files properly - it is only importing the last sequence in the file.  Our developers are looking into this at the moment.  

    So until that is fixed, you could try this:

    1.  Import the .fna files into Geneious - these are fasta files so only contain sequence and not annotations

    2.  Import the .gff files into the same folder where you imported the fasta files, and when given the option, choose the fasta sequences as the reference.  This should import the annotations onto the sequences (I'm assuming here that those Gff files contain the same annotations as the Genbank document).

    3.   Switch to the Annotations table, and ensure it is displaying all the Annotations for all sequences.  Then select them all (using cntrl/command-A) and go to Edit Annotations.  Under "Track" select "No track".  This will move all the annotations onto the sequence itself (you can't make a blast database with annotations if the annotations are on a track).  Then click Save.

    4.  Do that for each genome, and then select them all and make your database.  

    Or alternatively, download what is in the .gbff file directly within Geneious as follows:

    1. open the .gbff file in a text editor and take a note of the first and last accession numbers in the file. Eg. for the link you sent there are 20,000+ sequences and the accessions go from GL020027 to GL040940.  

    2.  Then in Geneious select the NCBI-Nucleotide folder at the bottom of the Sources panel.  Click "More Options" next to the NCBI Search panel (top right of the window) and enter "Accession" "is" "GL020027:GL040490".  This will download all the documents for the genome.  You'll then need to select them all and click "Download Full Sequence".  It may take a while depending on the speed of your internet connection. 

    3.  Then move all the sequences to a folder in your local documents, and combine them all into one sequence list. 

    4. Do this for each genome, then select them all and make your database.  

    You should be able to do both options on an average machine - I've tested it with the Hydra genome and C. elegans on my machine where 4GB of RAM is allocated to geneious and it was feasible working files of that size.  Loading annotation tracks on a genome with lots of contigs (Hydra) was slow, but once you've transferred the annotations to the sequence it works fine.  

  • Avatar
    Oswaldo Palenzuela

    Hi Hillary. Thank you, it seems that I can proceed with the second option, downloading directly the accession numbers listed in .gbff. COmbining all the sequences in a list takes quite some time and doing so in an external MySQL database has been painful due to problems with the package size. - But doable (still downloading and building the data son I don't know when I reach the point of actually building the database from these lists. A couple of quick questions related to this:

     

    - Once the lists combining all the documents of a genome/transcriptome are created, can the actual documents be erased or they must be handled together with the list?  (i.e., my guess is that the list is only a collection of links but does not contain the data)

    - For building the database, should I select just the lists and any other fasta that I am interested on including?

    -  The following is related to the topic but not to this particular matter, when downloading NCBI preformatted nr database, in order to do local blast against it, how should the database volumes be arranged in the BLAST/DATA directory?.  There are 26 volumes nrXX.tar.gz that once inflated and decompressed create 26 folders, each containing files (numbered like the folder/volume) with 12 different extensions (nr00.phd, nr00.phi, nr00.phr and so on..) and an additional nr.pal that seems repeated on each of the folders. Should I drop everything out of the folders into the BLAST/DATA directory or respect the folders structure. ANd in any case, what about the nr.pal, if dropped out of the folders there will be 26 copies of the same?. Thank you very much, your support is being very helpful.

  • Avatar
    Hilary Miller

    The sequence lists contain the actual documents - when  you run Group sequences into a list, Geneious will also keep the original unlisted sequences so you can get rid of these otherwise you'll end up with two copies of everything.  And yes, for your database just select the sequence lists and any other files you want.  If you have other fasta files you'd be best to import these into Geneious first as you can't make a database from a combination of an uploaded file and a file in Geneious - it has to be one or the other. 

    As far as the preformatted databases go, you should put all the individual files (e.g. nr.00.phd, nr.00.phi etc) directly in the BLAST/data folder - make sure you take them out of the folders you get when you uncompress them.  You should only need one copy of the nr.pal file, and having all the ".00" volumes is critical for getting Geneious to recognise it.  You'll need to restart Geneious once all the files are in your data folder, and it should automatically pick them up as one database.  We have made some improvements for the handling of preformatted, multivolume databases for the next releases (8.0.3 and 7.1.8), so if you find Geneious doesn't recognise it (which appears to happen in some cases) you might be best to wait til the next release.  

  • Avatar
    Oswaldo Palenzuela

    I keep getting errors with the local nr database. Same error with 3 different downloads of the preformatted nr. Checksum OK, database is picked and blast launched but after quite some time comes this:

    Geneious 7.1.7 on Mac OS X (x86_64) with Java 1.6.0_65

    Message:
    Search failed.

    Stacktrace:
    com.biomatters.geneious.publicapi.databaseservice.DatabaseServiceException: Could not find the database file:
    /Users/MYSELF/Documents/XXXXXX/XXXXX/Geneious 7.1 Data/BLAST/data/nr.aaThere is no nr.aa file in the preformatted nr database but, interestingly, the database is picked by geneious as nr(AA) in the list of custom databases.

    I guess ´ll have to wait for the next release then.

     

     

     

  • Avatar
    Oswaldo Palenzuela

    Hi again... I have updated to 8.03 and the good news is that it does recognise the multivolumen local nr database, completing a search without error. The bad news is performance, I didn't expect it to be exactly speedy but.... 25 minutes for a single sequence (1200 bps) seems too much. I am on a 3.2GHz intel i5 with plenty of RAM for geneious and I chose 3 processors for the job. Along the run %CPU use for the process was low  (started strong -around 200% as expected but soon came down to around 20-30%. I observed the process in OSX activity monitor and I saw that the database files were accessed quite slowly: it seemed to have long lags.  The statistics of the instance reported lots of faults (4540000). Is this what I should expect with such a large database or can I improve these figures somehow?. 

  • Avatar
    Hilary Miller

    Hi Oswald,

    BLAST has separate memory allocation to Geneious because it is a separate process, so the memory you have allocated to Geneious won't have any bearing on how fast BLAST runs.  The memory pressure graph (on Mavericks) is the best way to see if it doesn't have enough RAM - I'm not sure whether you can read much into the "faults".  BLAST isn't particularly memory intensive anyway and I suspect that it is just slow because it is a huge database and you've only given it a couple of threads.  You could try returning fewer hits and return only the "matching region" rather than the full sequence.  

  • Avatar
    Pinar Demetci

    Hello,

    I've also tried setting up a local database on Geneious for BLAST using Hilary Miller's method with .fna and .gff files. I am able to import the .fna files with no problems, however, when I try to import .gff file for annotations, I get this error:

    " Failed to import 69704.assembled.gff (Sanger GFF files importer):  Error parsing the strand. Found "1". Expected '+', '-', or '.'. "

    Does anyone know how I can work around this?

  • Avatar
    Hilary Miller

    Hi Pinar

    This indicates that your GFF file is incorrectly formatted.  It might be missing a column as it sounds like you have numbers where the "strand" column should be.  I suggest you open it in a text editor to check the formatting (see http://www.sequenceontology.org/gff3.shtml for format specification).  If you can't identify the issue please contact us directly via the support button in Geneious, as we will probably need to see a copy of your file. 

    Hilary