Batch BLASTp

Jun 22, 2010 at 6:47 PM

Hi,

I plan to use the BLAST service offered in MBF to submit roughly 4000 protein sequences for alignment to the NCBI nr database. I just wanted to get a feeling for what the MBF community thinks is the best approach for submitting a batch this large. Some questions I have are:

1) Does Azure BLAST service contain the NCBI nr database and is it ready for use, or should I just use the NCBI QBLAST service?

2) How should I queue the individual jobs and is there already a method written to accomplish this?

3) Related to 2) the NCBI QBLAST Documentation (http://www.ncbi.nlm.nih.gov/BLAST/Doc/node60.html) states to not exceed 50 threads. I'm guessing that a logical approach would be to submit 40 jobs in parallel with 100 proteins per job. This would result 40 threads. Is my thought process correct?

I look forward to hearing your suggestions and comments.

Vince

Jun 28, 2010 at 6:06 AM

Vince,

Thanks for your queries, please see our responses below:

#1 Azure Blast has “nr” database disabled for now and we are unsure of when it will be back, “NCBI QBLAST”/”EBI WU-BLAST”/”BioHPC P-BLAST” service is recommended at this point of time

#2 Use “MBF.Web.Blast.IBlastServiceHandler” to query any blast service. That way, it will be faster and easier to change the underlying service in future (let say Azure BLAST instead of NCBI QBLAST). Currently NCBI QBLAST, EBI WU-BLAST, Azure & BioHPC P-Blast are supported in MBF.

Here is a skeleton code to submit individual BLAST jobs using MBF

<<
IBlastServiceHandler service = new NCBIBlastHandler(); // Or any other supported BlastHandler

ConfigParameters configParams = new ConfigParameters(); // Define the connection configurationservice.Configuration = configParams;
BlastParameters searchParams = new BlastParameters();   // Define the BLAST settings (Program, Database and another parameters)
Sequence sequence = // Load a sequence from string or file.
// create the requeststring jobID = service.SubmitRequest(sequence, searchParams);
// Queue the jobServiceRequestInformation info = service.GetRequestStatus(jobID);
// There are two way of fetching and parsing the results.

1. // ****************************************// Keep polling “info = service.GetRequestStatus(jobID);” till Error or retries exceeds
// Parse the result on Success.IBlastParser blastXmlParser = new BlastXmlParser();resultsObject = blastXmlParser.Parse(        new StringReader(service.GetResult(jobID, searchParams)));// ****************************************

OR

2. // ****************************************// Use Eventhandler that should be invoked when the results are available

// Register “IBlastServiceHandler.RequestCompleted” before submitting the request (SubmitRequest).// Registered event will get the parsed result.// ****************************************

>>

#3 Sounds good, Couple of points to note:

> As of now only BioHPC P-Blast implements support for querying multiple sequences (jobID = service.SubmitRequest(/*List of sequences*/, searchParams);).

> Maximum file length BioHPC P-BLAST services takes is 40MB. So # of threads will have to be manipulated based on the size of input sequences.

Let us know if you need further information.

-Vivek