Friday, July 4, 2014

clean_tree: a software for high-resolution male lineage classification

As a part of a paper we submitted to Human Mutation, this project is hosted on bitbucket and its main page is here:
This post will only make sense if you have the software already installed.

Quick example

For the impatient, there is a data folder in the software package, which provides a toy example bam file for you to play with. I recommend to copy it to somewhere else to avoid mess up the software itself:
# change into the directory
cd /home/your/path/to/clean_tree
cp -R data /tmp
cd /tmp/data/
Now you can use clean_tree to infer haplogroups on this bam file: test.bam positions.txt out -r 22 -q 30 -b 90
You should see something like this on your screen:
## output
samtools mpileup -f /home/kaiyin/clean_tree/hg19.fa test.bam > /home/kaiyin/clean_tree/tmp/out.pu
[mpileup] 1 samples in 1 input files
 Set max per-file depth to 8000
Loading required package: mice
Loading required package: methods
Loading required package: Rcpp
Loading required package: lattice
mice 2.22 2014-06-10
Loading required package: plyr
539 valid markers provided.
284676 loci after pileup.
25731 loci passed reads number filter.
179 loci are in your markers list.
72 markers gives you haplogroup information.
And you can check that it has generated two output files, out and out.chr
ls -lh

## output:
total 6.5M
-rw-rw-r-- 1 kaiyin arwin  1.5K Jul  4 23:28 out
-rw-rw-r-- 1 kaiyin arwin   710 Jul  4 23:28 out.chr
-rw-rw-r-- 1 kaiyin kaiyin  20K Jul  4 21:04 positions.txt
-rw-rw-r-- 1 kaiyin arwin  6.4M May  3 11:56 test.bam

Check output

There is also a shell script for having a quick look at results: out

## output
         chr            pos            name  branch
        chrY       14159864            L657       A
        chrY       14288981          L127.1       D
        chrY       14288983            L703       A
        chrY       14497168             P47       A
        chrY       14851526              N1       A
        chrY       14851538              N5       A
        chrY       14851554            M181       A
This will print out some interesting columns of the results in a neat format, there should be 72 lines, but only 7 are shown above. Note for this person, L127.1 is typed different from Ancestry, so recorded as D here. You can also open this file with WPS office or Libreoffice to have a more careful look.

Commandline options explained

First you need to provide three filenames:
  Bamfile               Path of bam file which you need to pile up
  Markerfile            Path of file which contains the target markers
  Outputfile            Path of the output file
Then there are the following options:
  -h show this help message and exit

  -q QUALITY_THRESH            Minimum quality for each read, integer between
                               10 and 39, inclusive. If you give it 0, quality
                               of reads will not be checked.

  -r READS_THRESH              The minimum number of "good" reads for each base.
                               "Good" means above QUALITY_THRESH.

  -b BASE_MAJORITY_PERCENTAGE  The minimum percentage of a base result for acceptance.
                               For example, if you give it 80, then 80% of the reads
                               for each marker should be the same, otherwise that marker\
                               will be filtered out.

Now the commandline arguments in the previous example are interpreted this way:
  • test.bam Your bam file, source of genotype info
  • positions.txt A tab delimited file describing your target markers, used for haplogroup inference. You should open this example file with a text editor and have a good look, or use it as a basis for rolling your own.
  • -q 22 Filter out all read with quality number lower than 22
  • -r 30 After filtering by quality, there should be at least 30 reads left, otherwise the locus is discarded
  • -b 90 90% of the reads should agree with each other, or the locus is discarded