All well and good. Unfortunately, the base function returns only the accession number as the name of the sequence. My friend was downloading sequences of many different genes from several different species. Understandably, mere accession numbers are not particularly helpful in this situation, and more information is helpful for processing datasets such as this. Thankfully, a quick hack of the function ensured that species and gene region info could be downloaded with the sequences, solving the problem. It also extended the function's utility significantly and in my opinion is now much more useful for phylogenetics-type work.
The resulting function is read.GB(). It currently reads the "ORGANISM", "DEFINITION", and "ACCESSION" fields of Genbank files which record the information regarding species identity, gene region and accession number respectively. These are stored in the resulting DNAbin object as an attribute, and can be returned in the following manner:
a<-read.GB("AY883003")
attr(a, "species")
attr(a, "gene")
attr(a, "accession_num")
The current default names for the sequences are returned in a standard format: accession number|scientific name.
Full credit goes to Emmanuel Paradis who wrote the original function, and who wrote it in such a way that it was fairly painless to extend it in the manner above.
No comments:
Post a Comment