Python Basics

This notebook will allow you to practice some basic skills for using python: working with different data types, using various data structures, reading and writing text files, using conditionals, control flow structures, creating functions, and of course working with ipython notebooks.

Motivation:

A biologist is interested in the genetic basis of height. She measures the heights of many subjects and sends off their DNA samples to a core for genotyping arrays. These arrays determine the DNA bases at the variable sites of the genome (known as single nucleotide polymorphisms, or SNPs). Since humans are diploid, i.e. have two of each chromosome, each data point will be two DNA bases corresponding to the two chromosomes in each individual. At each SNP, there will be only three possible genotypes, e.g. AA, AG, GG for an A/G SNP. In order to test the correlation between a SNP genotype and height, she wants to perform a regression with an additive genetic model. However, she cannot do this with the data in the current form. She needs to convert the genotypes, e.g. AA, AG, and GG, to the numbers 0, 1, and 2, respectively (in the example the number corresponds the number of G bases the person has at that SNP). Since she has too much data to do this manually, e.g. in Excel, she comes to you for ideas of how to efficiently transform the data.

Part 1:

Create a new list which has the converted genotype for each subject ('AA' -> 0, 'AG' -> 1, 'GG' -> 2).



In [1]:

    
genos = ['AA', 'GG', 'AG', 'AG', 'GG']
genos_new = []
# Use your knowledge of if/else statements and loop structures below.

Check your work



In [2]:

    
genos_new == [0, 2, 1, 1, 2]









    Out[2]:





False

Part 2:

Sometimes there are errors and the genotype cannot be determined. Adapt your code from above to deal with this problem (in this example missing data is assigned NA for "Not Available").



In [3]:

    
genos_w_missing = ['AA', 'NA', 'GG', 'AG', 'AG', 'GG', 'NA']
genos_w_missing_new = []
# The missing data should not be converted to a number, but remain 'NA' in the new list

Check your work



In [ ]:

    
genos_w_missing_new == [0, 'NA', 2, 1, 1, 2, 'NA']



In [ ]:

Main Practice

Setup: Open a terminal and run the following commands:

# create a new directory
mkdir python-intro
cd python-intro

# download data file, and ipython notebook
curl -O https://raw.githubusercontent.com/gastonstat/stat259/gh-pages/tutorials/genos.txt
curl -O https://raw.githubusercontent.com/gastonstat/stat259/gh-pages/tutorials/genotypes.ipynb

Data File: The raw data for this practice is in the file genos.txt which contains one column of genotypes (one genotype per row). Each genotype consists of two characters: e.g. 'AA' or 'GG'. In addition, there are some rows that contain missing values denoted as 'NA'.

I. Read in the data and store the contents in a list called genos.

II. Find out how what are the different (i.e. unique) values are in genos.

III. Calculate the number of occurrences of each genotype, and store the results in a dictionary called geno_counts. Use the following 3 approaches:

Use a for loop to count the genotypes (store the result in a dictionary)
Get the same counts but this time using the count() method
Another alternative is to use Counter from Collections

IV. Once you've counted the genotypes, make a function get_proportions() that takes geno_counts and returns a dictionary with relative frequencies (i.e. proportions) of genotypes. Also, test your function with the provided assertion.

V. Convert the string values in genos into integers ('NA' remains as 'NA') and put them in a new list called numeric_genos:

'AA' = 0
'AG' = 1
'GG' = 2
'NA' = 'NA'

VI. Write the data in numeric_genos to a text file called genos_int.txt

VII. Finally, convert your notebook to html (and open it) by running these commands from the shell:

ipython nbconvert genotypes.ipynb
open genotypes.html



In [1]:

    
# things to be imported
from __future__ import division  # if you use python 2.?
from collections import Counter

I. Reading a text file

Some refs about Reading Files:

File Operations: https://github.com/dlab-berkeley/python-fundamentals/blob/master/cheat-sheets/12-Files.ipynb
Reading Text Files: http://www.jarrodmillman.com/rcsds/lectures/reading_text_files.html



In [2]:

    
# open 'genos.txt' and store values in "genos"
# YOUR CODE

II. Unique Genotypes



In [3]:

    
# Find the unique values in genos
# YOUR CODE

III. Counting Genotypes

a) Using a for loop



In [4]:

    
# For loop to count occurrences of AA, AG, GG, NA
# (store results in dictionary "geno_counts")
# YOUR CODE

b) Using count method



In [5]:

    
# YOUR CODE

c) Using Counter from collections



In [6]:

    
# YOUR CODE

IV. Function to Calculate Proportions



In [7]:

    
# Write a function "get_proportions()"
# Parameters: geno_counts (dictionary)
# Returns: dictionary of proportions
# YOUR CODE



In [8]:

    
# apply your function:
# get_proportions(geno_counts)



In [9]:

    
# test for function get_proportions()
def test_get_proportions():
    # We make a fake dictionary
    input_val = {'AA': 2, 'AB': 4, 'BB': 14}
    expected_result = {'AA': 0.1, 'AB': 0.2, 'BB': 0.7}
    # run function
    res = get_proportions(input_val)
    assert res == expected_result



In [10]:

    
# run the test and see what happens:
# test_get_proportions()

V. Converting to numeric genotypes



In [11]:

    
# convert genotypes: AA = 0, AG = 1, GG = 2, NA = NA
# (create a list called "numeric_genos")
# YOUR CODE

VI. Write Numeric Genotypes to a text file



In [12]:

    
# write values in "numeric_genos" to a file "genos_int.txt"
# YOUR CODE