## Thursday, July 24, 2014

### BED file format

This page describes the format of binary PED (BED) files. Consider the following example PED file, test.ped:
     1 1 0 0 1  0    G G    2 2    C C
1 2 0 0 1  0    A A    0 0    A C
1 3 1 2 1  2    0 0    1 2    A C
2 1 0 0 1  0    A A    2 2    0 0
2 2 0 0 1  2    A A    2 2    0 0
2 3 1 2 1  2    A A    2 2    A A

and corresponding MAP file test.map
     1 snp1 0 1
1 snp2 0 2
1 snp3 0 3

We create a binary fileset with the following command:
#####  plink --file test --make-bed --out test 
which produces output:
     @----------------------------------------------------------@
|         PLINK!       |    v0.99l     |   27/Jul/2006     |
|----------------------------------------------------------|
|  (C) 2006 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
@----------------------------------------------------------@

Web-based version check ( --noweb to skip )
Connecting to web...  OK, v0.99l is current

*** Pre-Release Testing Version ***

Writing this text to log file [ test.log ]
Analysis started: Sat Jul 29 17:22:59 2006

Options in effect:
--file test
--make-bed
--out test

3 (of 3) markers to be included from [ test.map ]
6 individuals read from [ test.ped ]
3 individuals with nonmissing phenotypes
Assuming a binary trait (1=unaff, 2=aff, 0=miss)
Missing phenotype value is also -9
Before frequency and genotyping pruning, there are 3 SNPs
Applying filters (SNP-major mode)
4 founders and 2 non-founders found
0 SNPs failed missingness test ( GENO > 1 )
0 SNPs failed frequency test ( MAF < 0 )
After frequency and genotyping pruning, there are 3 SNPs
Writing pedigree information to [ test.fam ]
Writing map (extended format) information to [ test.bim ]
Writing genotype bitfile to [ test.bed ]
Using (default) SNP-major mode
Analysis finished: Sat Jul 29 17:37:57 2006

and generates files
     test.bed
test.bim
test.fam

The file test.bim is the extended map file, which also includes the names of the alleles: (chromosome, SNP, cM, base-position, allele 1, allele 2):
     1       snp1    0       1       G       A
1       snp2    0       2       1       2
1       snp3    0       3       A       C

The file test.fam is simply the first six columns of test.ped
     1 1 0 0 1 0
1 2 0 0 1 0
1 3 1 2 1 2
2 1 0 0 1 0
2 2 0 0 1 2
2 3 1 2 1 2

We can inspect the BED file with the Unix xxd command, to view a binary file:
#####  xxd -b test.bed 
which generates:
     0000000: 01101100 00011011 00000001 11011100 00001111 11100111  l.....
0000006: 00001111 01101011 00000001                             .k.

The actual binary data are the nine blocks of 8 bits (a byte) in the center: the first 3 bytes have a special meaning. The first two are fixed, a 'magic number' that enables PLINK to confirm that a BED file is really a BED file. That is, BED files should always start 01101100 00011011. The third byte indicates whether the BED file is in SNP-major or individual-major mode: a value of 00000001 indicates SNP-major (i.e. list all individuals for first SNP, all individuals for second SNP, etc) whereas a value of 00000000 indicates individual-major (i.e. list all SNPs for the first individual, list all SNPs for the second individual, etc). By default, all BED files are SNP-major mode (as is the example below). Here we have extracted and annotated the relevant part of the xxd output:
     |-magic number--| |-mode-| |--genotype data---------|

01101100 00011011 00000001 11011100 00001111 11100111

|--genotype data-cont'd--|

00001111 01101011 00000001


For the genotype data, each byte encodes up to four genotypes (2 bits per genoytpe). The coding is
     00  Homozygote "1"/"1"
01  Heterozygote
11  Homozygote "2"/"2"
10  Missing genotype

The only slightly confusing wrinkle is that each byte is effectively read backwards. That is, if we label each of the 8 position as A to H, we would label backwards:
     01101100
HGFEDCBA

and so the first four genotypes are read as follows:
     01101100
HGFEDCBA

AB   00  -- homozygote (first)
CD     11  -- other homozygote (second)
EF       01  -- heterozygote (third)
GH         10  -- missing genotype (fourth)

Finally, when we reach the end of a SNP (or if in individual-mode, the end of an individual) we skip to the start of a new byte (i.e. skip any remaining bits in that byte). It is important to remember that the files test.bim and test.fam will already have been read in, so PLINK knows how many SNPs and individuals to expect. So, considering the full test.bed file: here we consider the six bytes that contain all the genotype data. We consider them one at a time, showing how the 4 genotypes are extracted from each byte to make up the entire dataset. Some positions are called null meaning that all the genotypes for that SNP have been read in, so we advance to the start of a new byte for the next SNP (when in SNP-major mode):
                Genotype    Person    SNP
11011100

00   G/G         1 1       snp1
11     A/A         1 2       snp1
10       0/0         1 3       snp1
11         A/A         2 1       snp1

00001111

11   A/A         2 2       snp1
11     A/A         2 3       snp1
00       (null)
00         (null)

11100111

11   2/2         1 1       snp2
10     0/0         1 2       snp2
01       1/2         1 3       snp2
11         2/2         2 1       snp2

00001111

11   2/2         2 2       snp2
11     2/2         2 3       snp2
00       (null)
00         (null)

01101011

11   C/C         1 1       snp3
01     A/C         1 2       snp3
01       A/C         1 3       snp3
10         0/0         2 1       snp3

00000001

10   0/0         2 2       snp3
00     A/A         2 3       snp3
00       (null)
00         (null)


In summary, the following define the BED file format
• First two bytes 01101100 00011011 for PLINK v1.00 BED file
• Third byte is 00000001 (SNP-major) or 00000000 (individual-major)
• Genotype data, either in SNP-major or individual-major order
• New "row" always starts a new byte
• Each byte encodes up to 4 genotypes
• 10 indicates missing genotype, otherwise 0 and 1 point to allele 1 or allele 2 in the BIM file, respectively
• Bits in each byte read in reverse order
Any changes to this format will be accompanied by a different, unique magic number and will be backwards compatabile in PLINK