Wednesday, 15 August 2012

A method for subsetting FASTA files

I got back my first sequences for various Irenimus specimens this past week, and have created nice, clean contigs from the forward and reverse sequences. I've done this using FinchTV and Seaview, saving the results as a FASTA file with all of forward, reverse and consensus sequences for each specimen. Saving the data in this format has the benefit of being suitable for tracking through version control software, which means that every change I make to the file can be recalled. I'm only using one file for creating the contigs, but I'm using three gene regions, which will then need to be aligned with each other in the future. Thus, I need to have a method for subsetting my master document into smaller files with only those sequences from the same gene regions.

To do this, I have come up with a convention for naming the sequences I wish to use down the line:

>geneRegion|specimenNumber|speciesCode|otherInformation
From here, all sequences from a certain gene region can be retrieved using a little piece of awk magic. For example, all sequences from the 28S ribosomal RNA region (i.e. those starting with the line >28S|....) can be obtained by running the following code in the terminal:
awk '/>/{p=0};/>28S/{p=1} p' raw_sequences > 28S.fasta
A big thanks to backreference.org for pointing out how this might be achieved.

No comments: