Monday, July 14, 2014

Incomplete phenotypes and plink performance

Make sure all individuals with phenotype data are included in the genotype data set

# bash code

cd /media/data1/RS123_1kg/
cp RS123.1kg.bim RS123.1kg.fam /tmp
cd sskn
cp sskn.phe /tmp
cd /tmp
R
In R:
# R code

> x = read.table("sskn.phe", head=T)
> y = read.table("RS123_1kg.fam", head=F)
> head(y)
   V1      V2 V3 V4 V5 V6
1 RS1 1000001  0  0 -9 -9
2 RS1 1000002  0  0 -9 -9
3 RS1  100001  0  0 -9 -9
4 RS1  100002  0  0 -9 -9
5 RS1   10001  0  0 -9 -9
6 RS1 1001001  0  0 -9 -9
> head(x)
  FID     IID    image sex      age ec fan  reye geye beye     heye      seye
1 RS1 1000001 IM008163   1 84.43014  1   1  95.0   91   79 45.00000 0.1684211
2 RS1 1000002 IM008161   2 84.01370  1   1  73.0   70   61 45.00000 0.1643836
3 RS1  100002 IM003302   2 72.31233  3   4  76.5   39   24 17.14286 0.6862745
4 RS1   10001 IM007247   1 86.08767  1   2 114.0   99   78 35.00000 0.3157895
5 RS1 1001001 IM005022   2 66.46027  1   2 110.0   98   75 39.42857 0.3181818
6 RS1 1008001 IM008346   2 90.03014  1   2 156.0  147  127 41.37931 0.1858974
> summary(x$IID %in% y$V2)
   Mode    TRUE    NA's
logical    5857       0

So those individuals with phenotyped data collected is a subset of all subjects in the Rotterdam Study, nothing went wrong here.

PLINK dealing with incomplete phenotypes and performance

# shell code

///BASH/// pwd
/media/data2/kaiyin/rs123_plink/noshift_maf_0.1_0.102

///BASH/// time plinkcoll.py --allow-no-sex --bfile RS123_1kg     --max-maf 0.102     --maf 0.1     --pheno RS123_1kg_phe/RS123_1kg_height.csv     --pheno-name HEIGHT     --covar RS123_1kg_phe/RS123_1kg_height.csv     --covar-name SEX,AGE     --linear     --plinkcoll-range 0:0

Exec:  /home/kaiyin/bin/plink1.9 --allow-no-sex --bfile RS123_1kg --max-maf 0.102 --maf 0.1 --pheno RS123_1kg_phe/RS123_1kg_height.csv --pheno-name HEIGHT --covar RS123_1kg_phe/RS123_1kg_height.csv --covar-name SEX,AGE --linear --out RS123_1kg
PLINK v1.90b1e 64-bit (3 Jun 2014)         https://www.cog-genomics.org/plink2
(C) 2005-2014 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to RS123_1kg.log.
96682 MB RAM detected; reserving 48341 MB for main workspace.
15880747 variants loaded from .bim file.
11496 people (0 males, 0 females, 11496 ambiguous) loaded from .fam.
Ambiguous sex IDs written to RS123_1kg.nosex .
9885 phenotype values present after --pheno.
Using 1 thread (no multithreaded calculations invoked).
--covar: 2 out of 3 covariates loaded.
309 people had missing value(s), and 1301 people were not seen in the covariate file.
Calculating allele frequencies... done.
Total genotyping rate is 0.884282.
15840248 variants removed due to MAF threshold(s) (--maf/--max-maf).
40499 variants and 11496 people pass filters and QC.
Phenotype data is quantitative.
Writing linear model association results to RS123_1kg.assoc.linear ... done.

real    1m12.695s
user    1m4.012s
sys     0m8.417s

PLINK is smart enough to recognize that some individuals in the .fam file are not in the phenotype file, and automatically excluded them. Total number of individuals is 9885 + 1301 + 309 = 11495, among which 1301 are not in the phenotype file, and 309 are there, but have NA values.

In terms of performance:

calctime = 64
nsnp = 40499
calctime / nsnp * 1000
[1] 1.580286

For every 1000 SNPs, PLINK uses 1.58 seconds on our Linux server. Technical specifications:

  • OS: Ubuntu Linux 12.04 64 bits
  • CPU: Intel(R) Xeon(R) X5647 @ 2.93GHz, 8 cores
  • RAM: 99 GB

At this rate, for our RS123 dataset with 15 million SNPs, it will take 1.6 * 15000 / 3600 = 6.7 hours. For a QCDH test with the shift size of 200, it will take more than 1000 hours for just one GWAS analysis, which means filtering is necessary.

A full run

# bash code

time plinkcoll.py --allow-no-sex --bfile RS123_1kg     --pheno RS123_1kg_phe/RS123_1kg_height.csv     --pheno-name HEIGHT     --covar RS123_1kg_phe/RS123_1kg_height.csv     --covar-name SEX,AGE     --linear     --plinkcoll-range 0:0
RS123_1kg.bed
['/usr/local/bin/plinkcoll.py', '--allow-no-sex', '--bfile', 'RS123_1kg', '--pheno', 'RS123_1kg_phe/RS123_1kg_height.csv', '--pheno-name', 'HEIGHT', '--covar', 'RS123_1kg_phe/RS123_1kg_height.csv', '--covar-name', 'SEX,AGE', '--linear']

Exec:  /home/kaiyin/bin/plink1.9 --allow-no-sex --bfile RS123_1kg --pheno RS123_1kg_phe/RS123_1kg_height.csv --pheno-name HEIGHT --covar RS123_1kg_phe/RS123_1kg_height.csv --covar-name SEX,AGE --linear --out RS123_1kg
PLINK v1.90b1e 64-bit (3 Jun 2014)         https://www.cog-genomics.org/plink2
(C) 2005-2014 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to RS123_1kg.log.
96682 MB RAM detected; reserving 48341 MB for main workspace.
15880747 variants loaded from .bim file.
11496 people (0 males, 0 females, 11496 ambiguous) loaded from .fam.
Ambiguous sex IDs written to RS123_1kg.nosex .
9885 phenotype values present after --pheno.
Using 1 thread (no multithreaded calculations invoked).
--covar: 2 out of 3 covariates loaded.
309 people had missing value(s), and 1301 people were not seen in the covariate file.
Calculating allele frequencies... done.
Total genotyping rate is 0.884282.
15880747 variants and 11496 people pass filters and QC.
Phenotype data is quantitative.
Writing linear model association results to RS123_1kg.assoc.linear ... done.

real    228m22.345s
user    227m15.240s
sys     0m26.242s

It took only 3.8 hours, about half of what was expected. This is because in the test run with a small subset, about half of the time was spent on calculating allele frequences. So, here is the final setup:

Number of covariates Number of individuals Number of individuals with vadlid phenotype data Computation time (sec) / 1000 SNPs Number of SNPs in total
1 2.00 11496.00 9885.00 0.86 15880747.00

Here is a table for estimation of calculation time and disk space requirements:

Filter by P value Number of shifts Computation time for all shifts Disk space (TB) Disk space + backup space (TB)
1 1.00 20.00 75.87 0.83 1.66
2 0.05 20.00 3.79 0.83 1.66
3 0.01 20.00 0.76 0.83 1.66
4 0.00 20.00 0.08 0.83 1.66
5 1.00 100.00 379.37 4.15 8.30
6 0.05 100.00 18.97 4.15 8.30
7 0.01 100.00 3.79 4.15 8.30
8 0.00 100.00 0.38 4.15 8.30
9 1.00 200.00 758.75 8.30 16.60
10 0.05 200.00 37.94 8.30 16.60
11 0.01 200.00 7.59 8.30 16.60
12 0.00 200.00 0.76 8.30 16.60
13 1.00 500.00 1896.87 20.76 41.51
14 0.05 500.00 94.84 20.76 41.51
15 0.01 500.00 18.97 20.76 41.51
16 0.00 500.00 1.90 20.76 41.51

And R code for generating the above tables:

## install.packages("xtable")
require(xtable)
getwd()
options(scipen = 9)
ncov = 2
nindiv = 11496
nindiv_pheno = 9885
sec_1000snp = 0.86
nsnp_total = 15880747
pval_filter = rep(c(1, .05, .01, .001), 4)
nsnp_after_pval_filter = nsnp_total * pval_filter
comp_time_per_shift = nsnp_after_pval_filter * sec_1000snp / 1000 / 3600
nshift = rep(c(20, 100, 200, 500), each=4)
comp_time_total = comp_time_per_shift * nshift
disk_space = nsnp_total * nindiv / 4 / 1024 / 1024 / 1024 / 1024 * nshift
disk_space_plus_backup = disk_space * 2

dat1 = data.frame(ncov, nindiv, nindiv_pheno, sec_1000snp, nsnp_total)
colnames(dat1) = c(
        "Number of covariates", "Number of individuals",
        "Number of individuals with vadlid phenotype data",
        "Computation time (sec) / 1000 SNPs",
        "Number of SNPs in total"
)
dat1
print(xtable(dat1), type="html", file="/tmp/x1.html")
dat2 = data.frame(pval_filter, nshift, comp_time_total, disk_space, disk_space_plus_backup)
colnames(dat2) = c(
        "Filter by P value", "Number of shifts",
        "Computation time for all shifts",
        "Disk space (TB)", "Disk space + backup space (TB)"
)
dat2
print(xtable(dat2), type="html", file="/tmp/x2.html")

0 comments: