|
Leonard and I have exchanged some emails and I wanted to recap for the community here.
The Sequence Assembler application is provided as a demonstration application only – it shows what can be done with MBF when the library is integrated with a GUI application, in this case one built using Windows Presentation
Foundation and Silverlight. The Sequence Assembler application has not been tested for lab use, and would be non performant with any significant dataset.
None of our work has been designed to take advantge of a cluster at this point. In other words it will work in that environ but it is not architected specifically to take advantage of clustered hardware.
In our exchanges using his dataset he had used a Kmer value >32 as a parameter to padeNA and got the message ‘values of K over 32 are not supported’. Further details are available in the Padena paper, but kmer
length is a parameter for the Padena assembly, not the length of the input sequences from the FASTA file.
Any dataset that contains many ‘N’ ambiguity characters, any characters other than ACGT, are not compatible with the current version of the Padena algorithm. There are two approaches to dealing with this:
1. Filter the reads, excluding those that contain N characters. Use the smaller remaining dataset in the assembly. This will work, but will lose a lot of data
that could be valuable.
2. Split reads containing N characters, so the string AAAAAAAAAANGGGGGGGGGG would become two reads, AAAAAAAAAA and GGGGGGGGGG, etc. This will preserve more of the data and would be a better
solution
The filtered reads can be assembled using PadenaUtil commandline utility, and should only take a minute or so to complete with the dataset he provided. I would recommend setting the parameter k to 32 initially, but for the
best results you should try varying all the parameters; read the Padena paper to learn more.
I hope this information is helpful to other members of the community as well.
|