Monday, July 7, 2014

bedcoll performance

bedcoll is a software for calculating collapsed genotype from plink .bed files. Here is a test using Rotterdam Study data imputed with 1000G:

///BASH/// time bedcoll -b RS123_1kg. -m 4 -n 15
..............

real    166m34.999s
user    12m34.911s
sys     11m10.526s

Let's check the size of output files:

///BASH/// ll
total 713593672
-rwxrwx--- 1 kaiyin kaiyin 45641266881 Oct 16  2013 RS123_1kg.bed
-rwxrwx--- 1 kaiyin kaiyin   459238545 Dec 12  2013 RS123_1kg.bim
-rwxrwx--- 1 kaiyin kaiyin      238301 Oct 16  2013 RS123_1kg.fam
-rw-rw-r-- 1 kaiyin kaiyin 45641266881 Jul  6 23:23 RS123_1kg_shift_0001.bed
-rw-rw-r-- 1 kaiyin kaiyin 45641266881 Jul  6 23:23 RS123_1kg_shift_0002.bed
-rw-rw-r-- 1 kaiyin kaiyin 45641266881 Jul  6 23:23 RS123_1kg_shift_0003.bed
-rw-rw-r-- 1 kaiyin kaiyin 45641266881 Jul  7 02:56 RS123_1kg_shift_0004.bed
-rw-rw-r-- 1 kaiyin kaiyin 45641266881 Jul  7 02:56 RS123_1kg_shift_0005.bed
-rw-rw-r-- 1 kaiyin kaiyin 45641266881 Jul  7 02:56 RS123_1kg_shift_0006.bed
-rw-rw-r-- 1 kaiyin kaiyin 45641266881 Jul  7 02:56 RS123_1kg_shift_0007.bed
-rw-rw-r-- 1 kaiyin kaiyin 45641266881 Jul  7 02:56 RS123_1kg_shift_0008.bed
-rw-rw-r-- 1 kaiyin kaiyin 45641266881 Jul  7 02:56 RS123_1kg_shift_0009.bed
-rw-rw-r-- 1 kaiyin kaiyin 45641266881 Jul  7 02:56 RS123_1kg_shift_0010.bed
-rw-rw-r-- 1 kaiyin kaiyin 45641266881 Jul  7 02:56 RS123_1kg_shift_0011.bed
-rw-rw-r-- 1 kaiyin kaiyin 45641266881 Jul  7 02:56 RS123_1kg_shift_0012.bed
-rw-rw-r-- 1 kaiyin kaiyin 45641266881 Jul  7 02:56 RS123_1kg_shift_0013.bed
-rw-rw-r-- 1 kaiyin kaiyin 45641266881 Jul  7 02:56 RS123_1kg_shift_0014.bed
-rw-rw-r-- 1 kaiyin kaiyin 45641266881 Jul  7 02:56 RS123_1kg_shift_0015.bed

So it generated 12 bed files, each of the size 45641266881 bytes, then the output speed is:

# R code
> calctime = 166*60 + 35
> calctime
[1] 9995
> outsize = 45641266881 * 12
> outspeed = outsize / calctime
> outspeed
[1] 54796919

That's 54M/s, pretty close to disk IO speed limit.

Let's also check if the results are correct:

///BASH///  checkcoll.py RS123_1kg 5

        bed file:           RS123_1kg.bed
        bim file:           RS123_1kg.bim
        fam file:           RS123_1kg.fam
        Number of SNPs:        15880747
        Number of obs:         11496
        Bytes / SNP:           2874


        Collapsed bed file:     RS123_1kg_shift_0015.bed
        Shift width:            15
        SNPs skipped:           6040067
        Right results count:    11496
        Wrong results count:    0

Check finished for RS123_1kg_shift_0015.bed.
I didn't see anything abnormal.
Check finished.

Everything looks ok.

0 comments: