Sequence creation without exceptions?

May 31, 2010 at 3:48 PM


I am excited to have found MBF and want to use it for a small project but I am wondering how to create a sequence that can only contain the 4 basic nucleotides (no ambiguities)? And how to do this without throwing exceptions? (*)

At the moment I have:

string str;
... // fills str from user entry
Sequence seq = new Sequence(DnaAlphabet.Instance, str);

but I need an alphabet only containing A,T,C,G. It makes no sense for my application to allow ambiguities. Is there another way to do this?

e.g. I could make my own BasicDnaAlphabet class, but I see that IAlphabet contains methods specific to ambiguities so that would smell bad.
Also, the complement/reverse complement methods would not then work, because they are hard-coded to check for DnaAlphabet or RnaAlphabet. (**)

I guess I could loop through the Sequence checking for IsAmbiguous, IsGap, IsTermination, but it would be better if I could test the string for these first. I need to use this for UI validation so it has to be fast.



I guess I imagined something like:
DnaSequence seq = DnaSequence.Parse(BasicDnaAlphabet.Instance, str);
DnaSeqence seq;
bool success = DnaSequence.TryParse(BasicDnaAlphabet.Instance, str, out seq);

This would fit the .NET Parse/TryParse idiom well, allow to check for errors without throwing, and also fit the .NET guideline not to do much work in a constructor.

This smells bad also - it seems to me that it should not be possible to create a protein sequence object that has nucleotide-specific methods on it.
Could you have instead:

INucleotideSequence : ISequence
IAminoAcidSequence : ISequence

with a similar arrangement for the standard implementations?

This would make it easier to add protein- or nucleotide-specific methods without always having to put in checks as to the molecule type.
In general, shouldn't any method that consumes an interface be able to consume any implementation of that interface? Otherwise it prevents extensibility of the framework by creating alternative implementations.

May 31, 2010 at 8:00 PM

Hi Jon,

While I understand the desire to restrict the alphabet to disallow ambiguity characters, I am not sure I understand the your context well enough to know where this class is a performance bottleneck.  If you make it a pre-condition and restrict the acceptable input strings to 'A', 'T', 'C', or 'G' then doesn't everything else fall out?  Any of your transformations should not introduce ambiguity where none exist or I should think that would be a flaw in the program.

string str;
... // fills str from user entry
... // Validate str contains only ATCG or reject the input
Sequence seq = new Sequence(DnaAlphabet.Instance, str);

With that said, if we can find/understand the context where strict conformance to ATCG is a significant win with 'broad' user support, I know we'd consider adding a 'BasicDnaAlphabet' or something similar.  If you can be more specific with the impact of using a more general class than you actually need, it would help in our discussion/evaluation.

Regarding the over generality of ISequence, point taken. I don’t think it will be changed prior to the approaching June release, but a bit of refactoring in the lower levels of the library is in the discussion.