Tuesday, July 8, 2014

Empirical CDF of MAF in RS123 1000G imputed data

First of all we get all MAFs using plink:

# shell code
plink1.9 --bfile RS123_1kg --freq --out RS123_1kg

Then have a look at its distribution in R:

# R code
mafmat = read.table("RS123_1kg.frq", head=T, colClasses=c("NULL", "NULL", "NULL", "NULL", "numeric", "NULL"))
maf_ecdf = ecdf(mafmat$MAF)
plot(maf_ecdf, main="Empirical CDF of MAF", xlab="MAF", ylab="accumulated prob.")

Apparently SNP density is highest on the lower end of MAF, which is the whole point of imputation with 1000G: getting more rare variants.