This notebook will allow you to practice some basic skills for using python: working with different data types, using various data structures, reading and writing text files, using conditionals, control flow structures, creating functions, and of course working with ipython notebooks.
A biologist is interested in the genetic basis of height. She measures the heights of many subjects and sends off their DNA samples to a core for genotyping arrays. These arrays determine the DNA bases at the variable sites of the genome (known as single nucleotide polymorphisms, or SNPs). Since humans are diploid, i.e. have two of each chromosome, each data point will be two DNA bases corresponding to the two chromosomes in each individual. At each SNP, there will be only three possible genotypes, e.g. AA, AG, GG for an A/G SNP. In order to test the correlation between a SNP genotype and height, she wants to perform a regression with an additive genetic model. However, she cannot do this with the data in the current form. She needs to convert the genotypes, e.g. AA, AG, and GG, to the numbers 0, 1, and 2, respectively (in the example the number corresponds the number of G bases the person has at that SNP). Since she has too much data to do this manually, e.g. in Excel, she comes to you for ideas of how to efficiently transform the data.
Create a new list which has the converted genotype for each subject ('AA' -> 0, 'AG' -> 1, 'GG' -> 2).
In [1]:
genos = ['AA', 'GG', 'AG', 'AG', 'GG']
genos_new = []
# Use your knowledge of if/else statements and loop structures below.
Check your work
In [2]:
genos_new == [0, 2, 1, 1, 2]
Out[2]:
In [3]:
genos_w_missing = ['AA', 'NA', 'GG', 'AG', 'AG', 'GG', 'NA']
genos_w_missing_new = []
# The missing data should not be converted to a number, but remain 'NA' in the new list
Check your work
In [ ]:
genos_w_missing_new == [0, 'NA', 2, 1, 1, 2, 'NA']
In [ ]:
Setup: Open a terminal and run the following commands:
# create a new directory
mkdir python-intro
cd python-intro
# download data file, and ipython notebook
curl -O https://raw.githubusercontent.com/gastonstat/stat259/gh-pages/tutorials/genos.txt
curl -O https://raw.githubusercontent.com/gastonstat/stat259/gh-pages/tutorials/genotypes.ipynb
Data File: The raw data for this practice is in the file genos.txt
which contains one column of genotypes (one genotype per row). Each genotype consists of two characters: e.g. 'AA'
or 'GG'
. In addition, there are some rows that contain missing values denoted as 'NA'
.
I. Read in the data and store the contents in a list called genos.
II. Find out how what are the different (i.e. unique) values are in genos.
III. Calculate the number of occurrences of each genotype, and store the results in a dictionary called geno_counts. Use the following 3 approaches:
count()
methodCounter
from CollectionsIV. Once you've counted the genotypes, make a function get_proportions() that takes geno_counts
and returns a dictionary with relative frequencies (i.e. proportions) of genotypes. Also, test your function with the provided assertion.
V. Convert the string values in genos into integers ('NA' remains as 'NA') and put them in a new list called numeric_genos:
'AA'
= 0'AG'
= 1'GG'
= 2'NA'
= 'NA'
VI. Write the data in numeric_genos to a text file called genos_int.txt
VII. Finally, convert your notebook to html (and open it) by running these commands from the shell:
ipython nbconvert genotypes.ipynb
open genotypes.html
In [1]:
# things to be imported
from __future__ import division # if you use python 2.?
from collections import Counter
Some refs about Reading Files:
In [2]:
# open 'genos.txt' and store values in "genos"
# YOUR CODE
In [3]:
# Find the unique values in genos
# YOUR CODE
In [4]:
# For loop to count occurrences of AA, AG, GG, NA
# (store results in dictionary "geno_counts")
# YOUR CODE
In [5]:
# YOUR CODE
In [6]:
# YOUR CODE
In [7]:
# Write a function "get_proportions()"
# Parameters: geno_counts (dictionary)
# Returns: dictionary of proportions
# YOUR CODE
In [8]:
# apply your function:
# get_proportions(geno_counts)
In [9]:
# test for function get_proportions()
def test_get_proportions():
# We make a fake dictionary
input_val = {'AA': 2, 'AB': 4, 'BB': 14}
expected_result = {'AA': 0.1, 'AB': 0.2, 'BB': 0.7}
# run function
res = get_proportions(input_val)
assert res == expected_result
In [10]:
# run the test and see what happens:
# test_get_proportions()
In [11]:
# convert genotypes: AA = 0, AG = 1, GG = 2, NA = NA
# (create a list called "numeric_genos")
# YOUR CODE
In [12]:
# write values in "numeric_genos" to a file "genos_int.txt"
# YOUR CODE