Data download


In [1]:
!wget http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/hapmap3/plink_format/draft_2/hapmap3_r2_b36_fwd.consensus.qc.poly.map.bz2
!wget http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/hapmap3/plink_format/draft_2/hapmap3_r2_b36_fwd.consensus.qc.poly.ped.bz2

!wget http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/hapmap3/plink_format/draft_2/relationships_w_pops_121708.txt


--2015-06-26 14:48:20--  http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/hapmap3/plink_format/draft_2/hapmap3_r2_b36_fwd.consensus.qc.poly.map.bz2
Resolving hapmap.ncbi.nlm.nih.gov (hapmap.ncbi.nlm.nih.gov)... 130.14.29.113, 2607:f220:41e:4290::113
Connecting to hapmap.ncbi.nlm.nih.gov (hapmap.ncbi.nlm.nih.gov)|130.14.29.113|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10590722 (10M) [application/x-bzip2]
Saving to: ‘hapmap3_r2_b36_fwd.consensus.qc.poly.map.bz2’

hapmap3_r2_b36_fwd. 100%[=====================>]  10.10M  3.12MB/s   in 3.2s   

2015-06-26 14:48:24 (3.12 MB/s) - ‘hapmap3_r2_b36_fwd.consensus.qc.poly.map.bz2’ saved [10590722/10590722]

--2015-06-26 14:48:24--  http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/hapmap3/plink_format/draft_2/hapmap3_r2_b36_fwd.consensus.qc.poly.ped.bz2
Resolving hapmap.ncbi.nlm.nih.gov (hapmap.ncbi.nlm.nih.gov)... 130.14.29.113, 2607:f220:41e:4290::113
Connecting to hapmap.ncbi.nlm.nih.gov (hapmap.ncbi.nlm.nih.gov)|130.14.29.113|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 757962593 (723M) [application/x-bzip2]
Saving to: ‘hapmap3_r2_b36_fwd.consensus.qc.poly.ped.bz2’

hapmap3_r2_b36_fwd. 100%[=====================>] 722.85M  3.64MB/s   in 2m 57s 

2015-06-26 14:51:21 (4.08 MB/s) - ‘hapmap3_r2_b36_fwd.consensus.qc.poly.ped.bz2’ saved [757962593/757962593]

--2015-06-26 14:51:21--  http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/hapmap3/plink_format/draft_2/relationships_w_pops_121708.txt
Resolving hapmap.ncbi.nlm.nih.gov (hapmap.ncbi.nlm.nih.gov)... 130.14.29.113, 2607:f220:41e:4290::113
Connecting to hapmap.ncbi.nlm.nih.gov (hapmap.ncbi.nlm.nih.gov)|130.14.29.113|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 36765 (36K) [text/plain]
Saving to: ‘relationships_w_pops_121708.txt.1’

relationships_w_pop 100%[=====================>]  35.90K   197KB/s   in 0.2s   

2015-06-26 14:51:22 (197 KB/s) - ‘relationships_w_pops_121708.txt.1’ saved [36765/36765]


In [2]:
!bunzip2 hapmap3_r2_b36_fwd.consensus.qc.poly.map.bz2
!bunzip2 hapmap3_r2_b36_fwd.consensus.qc.poly.ped.bz2

Preparation


In [2]:
import os
from collections import defaultdict

Loading HapMap meta-data


In [3]:
f = open('relationships_w_pops_121708.txt')
pop_ind = defaultdict(list)
f.readline()  # header
offspring = []
for l in f:
    toks = l.rstrip().split('\t')
    fam_id = toks[0]
    ind_id = toks[1]
    mom = toks[2]
    dad = toks[3]
    if mom != '0' or dad != '0':
        offspring.append((fam_id, ind_id))
    pop = toks[-1]
    pop_ind[pop].append((fam_id, ind_id))
f.close()

Sub-sampling


In [4]:
os.system('plink --recode --file hapmap3_r2_b36_fwd.consensus.qc.poly --noweb --out hapmap10 --thin 0.1 --geno 0.1')
os.system('plink --recode --file hapmap3_r2_b36_fwd.consensus.qc.poly --noweb --out hapmap1 --thin 0.01 --geno 0.1')


Out[4]:
0

Getting only autosomal data


In [5]:
def get_non_auto_SNPs(map_file, exclude_file):
    f = open(map_file)
    w = open(exclude_file, 'w')
    for l in f:
        toks = l.rstrip().split('\t')
        chrom = int(toks[0])
        rs = toks[1]
        if chrom > 22:
            w.write('%s\n' % rs)
    w.close()

In [6]:
get_non_auto_SNPs('hapmap10.map', 'exclude10.txt')
get_non_auto_SNPs('hapmap1.map', 'exclude1.txt')
#get_non_auto_SNPs('hapmap3_r2_b36_fwd.consensus.qc.poly.map', 'exclude.txt')

In [7]:
!plink --recode --file hapmap10 --noweb --out hapmap10_auto --exclude exclude10.txt
!plink --recode --file hapmap1 --noweb --out hapmap1_auto --exclude exclude1.txt
#geno!!!
#!plink --recode --file hapmap3_r2_b36_fwd.consensus.qc.poly --noweb --out hapmap_auto --exclude exclude.txt


PLINK v1.90b3l 64-bit (18 Apr 2015)        https://www.cog-genomics.org/plink2
(C) 2005-2015 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to hapmap10_auto.log.
Note: --noweb has no effect since no web check is implemented yet.
32095 MB RAM detected; reserving 16047 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (143633 variants, 1184 people).
--file: hapmap10_auto-temporary.bed + hapmap10_auto-temporary.bim +
hapmap10_auto-temporary.fam written.
143633 variants loaded from .bim file.
1184 people (589 males, 595 females) loaded from .fam.
--exclude: 138477 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 988 founders and 196 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
Total genotyping rate is 0.997746.
138477 variants and 1184 people pass filters and QC.
Note: No phenotypes present.
--recode to hapmap10_auto.ped + hapmap10_auto.map ... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.
PLINK v1.90b3l 64-bit (18 Apr 2015)        https://www.cog-genomics.org/plink2
(C) 2005-2015 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to hapmap1_auto.log.
Note: --noweb has no effect since no web check is implemented yet.
32095 MB RAM detected; reserving 16047 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (14460 variants, 1184 people).
--file: hapmap1_auto-temporary.bed + hapmap1_auto-temporary.bim +
hapmap1_auto-temporary.fam written.
14460 variants loaded from .bim file.
1184 people (589 males, 595 females) loaded from .fam.
--exclude: 13968 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 988 founders and 196 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
Total genotyping rate is 0.99775.
13968 variants and 1184 people pass filters and QC.
Note: No phenotypes present.
--recode to hapmap1_auto.ped + hapmap1_auto.map ... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.

Removing offspring


In [8]:
!plink --file hapmap10_auto --filter-founders --recode --out hapmap10_auto_noofs


PLINK v1.90b3l 64-bit (18 Apr 2015)        https://www.cog-genomics.org/plink2
(C) 2005-2015 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to hapmap10_auto_noofs.log.
32095 MB RAM detected; reserving 16047 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (138477 variants, 1184 people).
--file: hapmap10_auto_noofs-temporary.bed + hapmap10_auto_noofs-temporary.bim +
hapmap10_auto_noofs-temporary.fam written.
138477 variants loaded from .bim file.
1184 people (589 males, 595 females) loaded from .fam.
196 people removed due to founder status (--filter-founders).
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 988 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
Total genotyping rate in remaining samples is 0.997734.
138477 variants and 988 people pass filters and QC.
Note: No phenotypes present.
--recode to hapmap10_auto_noofs.ped + hapmap10_auto_noofs.map ... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.

LD-prunning


In [9]:
!plink --file hapmap10_auto_noofs --indep-pairwise 50 10 0.1 --out keep
!plink --file hapmap10_auto_noofs --extract keep.prune.in --recode --out hapmap10_auto_noofs_ld


PLINK v1.90b3l 64-bit (18 Apr 2015)        https://www.cog-genomics.org/plink2
(C) 2005-2015 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to keep.log.
32095 MB RAM detected; reserving 16047 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (138477 variants, 988 people).
--file: keep-temporary.bed + keep-temporary.bim + keep-temporary.fam written.
138477 variants loaded from .bim file.
988 people (488 males, 500 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 988 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
Total genotyping rate is 0.997734.
138477 variants and 988 people pass filters and QC.
Note: No phenotypes present.
Pruned 7074 variants from chromosome 1, leaving 4578.
Pruned 7372 variants from chromosome 2, leaving 4283.
Pruned 6083 variants from chromosome 3, leaving 3696.
Pruned 5032 variants from chromosome 4, leaving 3333.
Pruned 5454 variants from chromosome 5, leaving 3342.
Pruned 5782 variants from chromosome 6, leaving 3391.
Pruned 4602 variants from chromosome 7, leaving 3020.
Pruned 4629 variants from chromosome 8, leaving 2799.
Pruned 3788 variants from chromosome 9, leaving 2520.
Pruned 4444 variants from chromosome 10, leaving 2943.
Pruned 4308 variants from chromosome 11, leaving 2651.
Pruned 3992 variants from chromosome 12, leaving 2836.
Pruned 2948 variants from chromosome 13, leaving 2169.
Pruned 2670 variants from chromosome 14, leaving 1935.
Pruned 2363 variants from chromosome 15, leaving 1764.
Pruned 2635 variants from chromosome 16, leaving 1983.
Pruned 2094 variants from chromosome 17, leaving 1718.
Pruned 2282 variants from chromosome 18, leaving 1765.
Pruned 1317 variants from chromosome 19, leaving 1285.
Pruned 2054 variants from chromosome 20, leaving 1570.
Pruned 1075 variants from chromosome 21, leaving 850.
Pruned 1126 variants from chromosome 22, leaving 922.
Pruning complete.  83124 of 138477 variants removed.
Marker lists written to keep.prune.in and keep.prune.out .
PLINK v1.90b3l 64-bit (18 Apr 2015)        https://www.cog-genomics.org/plink2
(C) 2005-2015 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to hapmap10_auto_noofs_ld.log.
32095 MB RAM detected; reserving 16047 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (138477 variants, 988 people).
--file: hapmap10_auto_noofs_ld-temporary.bed +
hapmap10_auto_noofs_ld-temporary.bim + hapmap10_auto_noofs_ld-temporary.fam
written.
138477 variants loaded from .bim file.
988 people (488 males, 500 females) loaded from .fam.
--extract: 55353 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 988 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
Total genotyping rate is 0.997677.
55353 variants and 988 people pass filters and QC.
Note: No phenotypes present.
--recode to hapmap10_auto_noofs_ld.ped + hapmap10_auto_noofs_ld.map ... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.

Different coding


In [10]:
!plink --file hapmap10_auto_noofs_ld --recode12 tab --out hapmap10_auto_noofs_ld_12
!plink --make-bed --file hapmap10_auto_noofs_ld --out hapmap10_auto_noofs_ld


PLINK v1.90b3l 64-bit (18 Apr 2015)        https://www.cog-genomics.org/plink2
(C) 2005-2015 Shaun Purcell, Christopher Chang   GNU General Public License v3
Note: --recode12 flag deprecated.  Use 'recode 12 ...'.
Logging to hapmap10_auto_noofs_ld_12.log.
32095 MB RAM detected; reserving 16047 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (55353 variants, 988 people).
--file: hapmap10_auto_noofs_ld_12-temporary.bed +
hapmap10_auto_noofs_ld_12-temporary.bim +
hapmap10_auto_noofs_ld_12-temporary.fam written.
55353 variants loaded from .bim file.
988 people (488 males, 500 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 988 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
Total genotyping rate is 0.997677.
55353 variants and 988 people pass filters and QC.
Note: No phenotypes present.
--recode to hapmap10_auto_noofs_ld_12.ped + hapmap10_auto_noofs_ld_12.map ...
0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.
PLINK v1.90b3l 64-bit (18 Apr 2015)        https://www.cog-genomics.org/plink2
(C) 2005-2015 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to hapmap10_auto_noofs_ld.log.
32095 MB RAM detected; reserving 16047 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (55353 variants, 988 people).
--file: hapmap10_auto_noofs_ld-temporary.bed +
hapmap10_auto_noofs_ld-temporary.bim + hapmap10_auto_noofs_ld-temporary.fam
written.
55353 variants loaded from .bim file.
988 people (488 males, 500 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 988 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
Total genotyping rate is 0.997677.
55353 variants and 988 people pass filters and QC.
Note: No phenotypes present.
--make-bed to hapmap10_auto_noofs_ld.bed + hapmap10_auto_noofs_ld.bim +
hapmap10_auto_noofs_ld.fam ... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.

Single chromosome


In [11]:
!plink --recode --file hapmap10_auto_noofs --chr 2 --out hapmap10_auto_noofs_2


PLINK v1.90b3l 64-bit (18 Apr 2015)        https://www.cog-genomics.org/plink2
(C) 2005-2015 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to hapmap10_auto_noofs_2.log.
32095 MB RAM detected; reserving 16047 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (11655 variants, 988 people).
--file: hapmap10_auto_noofs_2-temporary.bed +
hapmap10_auto_noofs_2-temporary.bim + hapmap10_auto_noofs_2-temporary.fam
written.
11655 variants loaded from .bim file.
988 people (488 males, 500 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 988 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
Total genotyping rate is 0.997719.
11655 variants and 988 people pass filters and QC.
Note: No phenotypes present.
--recode to hapmap10_auto_noofs_2.ped + hapmap10_auto_noofs_2.map ... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.

In [ ]: