use of HPC pack 2008 to launch Microsoft Assembler like a cluster apliccation

Sep 28, 2011 at 5:06 PM
hey everybody!!!! i'm using a cluster with microsoft technologys with 3 nodes, using windows server HPC 2008, with HPC pack 2008 , i need to run microsoft assembler, but its imposible, because i have the next error: Unable to open standard input file on node NODO1:Exception 'Failed to open standard input file 'D:\aplicaciones\Microsoft Biology Initiative\2.0\MBT\Sequence Assembler', Access is denied' reported creating the task. after i had this: Aborting: failed to launch 'Sequence Assembler.exe' on nodo1 Error (2) The system cannot find the file specified. i think i'm lauching the job to the right form. thanks for the help,
Coordinator
Sep 28, 2011 at 6:05 PM

Leonard, thanks for your post. Just to clarify you are trying to run the Sequence assembler GUI based sample app on a cluster. I guess the first thing to examine is your file location spelling. Presumably you have done that but just to make sure here.

If you want to flesh out more of what you are trying to accomplish here perhaps we can assist you more. I was just curious as to why for instance you wouldn't be using PadeNA assembler in this case.

As a side note - we are planning our V2 release shortly (although in this case that should not make a difference for this report)

If you care to share your location and MBF plans please either post here or you can reply to myself privately at a-rickbe@microsoft.com.

Thanks

Coordinator
Oct 5, 2011 at 9:18 PM

Leonard and I have exchanged some emails and I wanted to recap for the community here.

The Sequence Assembler application  is provided as a demonstration application only – it shows what can be done with MBF when the library is integrated with a GUI application, in this case one built using Windows Presentation Foundation and Silverlight. The Sequence Assembler application has not been tested for lab use, and would be non performant with any significant dataset.

None of our work has been designed to take advantge of a cluster at this point. In other words it will work in that environ but it is not architected specifically to take advantage of clustered hardware.

In our exchanges using his dataset he had used a Kmer value >32 as a parameter to padeNA and got the message ‘values of K over 32 are not supported’. Further details are available in the Padena paper, but kmer length is a parameter for the Padena assembly, not the length of the input sequences from the FASTA file.

Any dataset that contains many ‘N’ ambiguity characters, any characters other than ACGT, are not compatible with the current version of the Padena algorithm. There are two approaches to dealing with this:

1. Filter the reads, excluding those that contain N characters. Use the smaller remaining dataset in the assembly. This will work, but will lose a lot of data that could be valuable.  

2. Split reads containing N characters, so the string AAAAAAAAAANGGGGGGGGGG would become two reads, AAAAAAAAAA and GGGGGGGGGG, etc. This will preserve more of the data and would be a better solution 

The filtered reads can be assembled using PadenaUtil commandline utility, and should only take a minute or so to complete with the dataset he provided.  I would recommend setting the parameter k to 32 initially, but for the best results you should try varying all the parameters; read the Padena paper to learn more.

I hope this information is helpful to other members of the community as well.