Saturday, July 12, 2014

Using python to unpack PLINK bed bytes into numbers

PLINK bed files store genotype data in a compact manner, in order to see the "real" numbers, you need to unpack them, here is how to do it in python:

#!/usr/bin/python3

def bitsToNum(x):
    if x == "00":
        return 2
    elif x == "01":
        return -9
    elif x == "10":
        return 1
    elif x == "11":
        return 0
    else:
        raise Exception("Unknown bits code!")

def byteToNum(xbyte):
    x = "{:08b}".format(xbyte)
    xstring = [x[i:i+2] for i in range(0, len(x), 2)]
    xgen = [str(bitsToNum(i)) for i in xstring]
    xgenstring = """const std::vector<int> """ + "{" + ", ".join(xgen) + "}" + ","
    return xgenstring


# print out c++ code for inclusion in a header file
print("""const std::vector< const std::vector<int> > gencodes {""")
for i in range(256):
    # print("{0:08b}".format(i))
    print("    ", end="")
    print(byteToNum(i))

print("}")

The above code is used to generate c++ code for memoization in one of the R packages.

0 comments: