require(ape)This function utilises the bit-level coding scheme that Emmanuel Paradis developed for encoding sequences in R. The unambiguous bases A, G, C and T have the numeric values 136, 72, 40 and 24 respectively. This function figures out which sites don't have these values and returns a vector of TRUEs and FALSEs, TRUEs being ambiguous bases.
data(woodmouse)
is.ambig <- function(x){
x <- as.matrix(x)
bases <- c(136, 72, 40, 24)
ambig <- apply(x, 2, FUN=function(x) sum(as.numeric(!as.numeric(x) %in% bases)))
ambig > 0
}
is.ambig(woodmouse)
The second function is an implementation of Tajima's K, published as equation A3 in Tajima 1983
tajima.K <- function(x, prop = TRUE){This function calculates the mean number of sites that are different between any two sequences. As a default, it returns the result as a proportion of the length of the alignment. Setting prop = FALSE will return the result as the actual number of sites.
res <- mean(dist.dna(x, model="N"))
if(prop) res <- res/dim(x)[2]
res
}
tajima.K(woodmouse)
References:
Tajmia F. 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437-460.