Intoduction to Pandas and Dataframes

Venkat Malladi (Computational Biologist BICF)

Agenda

Introduction to Pandas
DataSeries
- Exercise 1
- Exercise 2
Dataframe
- Exercise 3
- Exercise 4
- Exercise 5
Import and Store Data
Summarizing and Computing Descriptive Statistics
- Exercise 6
- Exercise 7
Grouped and apply
- Exercise 8
- Exercise 9
Data Transformation and Normalization
- Exercise 10
- Exercise 11
- Exercise 12

1.1 What is Pandas?

Description

Python library providing high-performance, easy-to-use structures and data analysis tools.
Suitable for tabular data with heterogeneously-typed colums, as in an SQL table or Excel spreadsheet
Provides data analysis feautures similar to: R, MATLAB, SAS
Based on NumPy

1.2 Key Features

The library is oriented towards table-like data structures that can be manipulated by a collection of methods:

Easy handling of missing data
Label-based slicing, indexing and subsetting of large data sets
Powerful and flexible group by functionality to perform split-apply-combine operations on data sets
Read/Write data from/to Excel, CSV, SQL databases, JSON

1.3 Import Pandas

Before we explore the pandas package, let's import pandas.

The convention is to use pd to refere to pandas when importing the package.



In [ ]:

    
# Import Pandas and Numpy

import pandas as pd
import numpy as np

1.4 Data Structures

Series
Dataframe

Agenda

Introduction to Pandas
Series
- Exercise 1
- Exercise 2
Dataframe
- Exercise 3
- Exercise 4
- Exercise 5
Import and Store Data
Summarizing and Computing Descriptive Statistics
- Exercise 6
- Exercise 7
Grouped and apply
- Exercise 8
- Exercise 9
Data Transformation and Normalization
- Exercise 10
- Exercise 11
- Exercise 12

2.1 Series

One dimensional array-like object
Contains array of data (of any NumPy data type) with an index that labels each element in the vector.
- Indexes can be:
  - integers
  - strings
  - other data types

NOTE: Not Numpy arrays but adds functionality to a Numpy array

2.2 Series - Integer indexes

If an index is not specified, a default sequence of integers is assigned as index.

Index	Value
0	35
1	50
2	25

Make a Series of count data



In [ ]:

    
# Make Series of count data and visaulize series

counts = pd.Series([223, 43, 53, 24, 43])
counts

What datatype is the counts object?



In [ ]:

    
# What datatype is the counts object?

type(counts)

2.3 Series - String indexes

We can assign meaningful lables to the indexes when making the Series object by specfing an array of index

Index	Value
CA	35
TX	50
OK	25

Make Series of count data with Gene Symbols



In [ ]:

    
# Make Series of count data with Gene Symbols

rna_counts = pd.Series([50, 10, 12, 29, 4], index=['BRCA2', 'GATA2', 'Myc', 'FOXA1', 'ERCC2'])
rna_counts

2.4 Series - Dictionary

Can be thought of as a dict
Can be constructed from a dict directly.

Construct second sample RNA-counts dict



In [ ]:

    
# Construct second sample RNA-counts dict

cell2_counts = {'BRCA2':5, 'GATA2':20, 'Myc':45, 'FOXA1':10, 'ERCC2':0, 'BRCA1': 20}
cell2_counts

Make pandas Series from RNA-counts dict



In [ ]:

    
# Make pandas Series from RNA-counts dict

rna_counts_cell2 = pd.Series(cell2_counts)
rna_counts_cell2

2.5.1 Series - Referencing Elements - Integer

NOTE: We can access the values like an array text

Integer Index

Access the 1st element of counts data



In [ ]:

    
# Access the 1st element of counts data

counts[0]

Get the 2nd through 4th elements



In [ ]:

    
# Get the 2nd through 4th elements

counts[1:4]

2.5.2 Series - Referencing Elements - String

String Index

Get the counts for the Myc Gene



In [ ]:

    
# Get the counts for Myc Gene

rna_counts['Myc']

Get the counts for FOXA1, GATA2 and BRCA2



In [ ]:

    
# Get the Counts for FOXA1, GATA2 and BRCA2

rna_counts[['FOXA1', 'GATA2', 'BRCA2']]

2.5.3 Series - Referencing Elements - array/index values

Can get the array representation and index object of the Series via its values and index atrributes, respectively.

Get the values in the counts matrix



In [ ]:

    
# Get the values in the counts matrix

counts.values

Get the index of the rna_counts matrix



In [ ]:

    
# Get the index of the rna_counts matrix

rna_counts.index

2.5.4 Series - Referencing Elements - labels

Can give both the array of values and the index meaningful labels themselves



In [ ]:

    
rna_counts.name = 'RNA Counts'
rna_counts.index.name = 'Symbol'
rna_counts

2.6 Series - array operations

NumPy array operations can be applied to Series without losing the data structure

Use boolean array to filter Series

Select Genes that have greater than 20 counts



In [ ]:

    
# Select Genes that have greater than 20 counts

rna_counts > 20

Select genes that have greater than 20 counts



In [ ]:

    
# Select genes that have greater than 20 counts

rna_counts[rna_counts > 20]

2.7 Series - null values

Many times Series have missing values that you need to identify and clean

Marvel Cinematic Universe

Make Movie DataFrame with missing values



In [ ]:

    
# Make Movie Database with missing values

mcu_opening = {'Black Panther': 202003951, 'Thor: Ragnarok': 122744989, 'Spider-Man: Homecoming': 117027503,
              'Guardians of the Galaxy Vol. 2': 146510104, 'Doctor Strange': 85058311, 
               'Captain America: Civil War': 179139142}
mcu_movies = ['Ant-Man and the Wasp', 'Avengers: Infinity War', 'Black Panther', 'Thor: Ragnarok', 
              'Spider-Man: Homecoming', 'Guardians of the Galaxy Vol. 2', 'Doctor Strange', 'Captain America: Civil War']
mcu_series = pd.Series(mcu_opening, index=mcu_movies)

mcu_series

Find movies with no opening revenue



In [ ]:

    
# Find movies with no opening revenue

pd.isnull(mcu_series)



In [ ]:

    
# Good opportunity to use Boolean filter get index and only movie names

mcu_series[pd.isnull(mcu_series)].index.values

Find movies with opening revenue



In [ ]:

    
# Find movies with opening revenue

pd.notnull(mcu_series)

Display only movie with no opening revenue



In [ ]:

    
# Display only movies with no opening revenue

mcu_series[pd.isnull(mcu_series)].index.values

2.8 Series - auto alignment

Index labels are used to align (merge) data when used in operations with other Series objects



In [ ]:

    
rna_counts



In [ ]:

    
rna_counts_cell2

Combine counts for 2 cells (rna_counts and rna_counts_cell2)



In [ ]:

    
# Combine counts for 2 cells 

rna_counts + rna_counts_cell2

Adding Series combined values with the same label in the resulting series
Contrast this with arrays, where arrays of the same length will combine values element-wise
Notice that the missing values. If one Series has a missing value you can't add it to the other value and a missing value results.

Exercise 1

Create a Series from a specified list data which has the index labels as student names and test values.

Which Students have scores greater than 15?

Bonus: How would you use get the Students scores greater than 15 and less than 20?

Hint: Will use the bitwise opperator &

Names	Value
Anastasia	12.5
Dima	9
Katherine	16.5
James	NaN
Emily	9
Michael	20
Matthew	14.5
Laura	NaN
Kevin	8
Jonas	19



In [ ]:

    
# Sample Python data and labels:


students = ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas']

test_scores = [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19]

s_scores = pd.Series(test_scores, index=students)

# Which Students have scores greater than 15?



In [ ]:

    
# Bonus: How would you use get the Students scores greater than 15 and less than 20?

Exercise 2

What is the mean, median and max test scores?



In [ ]:

    
# What is the mean, median and max test scores?



In [ ]:



In [ ]:

Agenda

Introduction to Pandas
Series
- Exercise 1
- Exercise 2
Dataframe
- Exercise 3
- Exercise 4
- Exercise 5
Import and Store Data
Summarizing and Computing Descriptive Statistics
- Exercise 6
- Exercise 7
Grouped and apply
- Exercise 8
- Exercise 9
Data Transformation and Normalization
- Exercise 10
- Exercise 11
- Exercise 12

3.1 Dataframe

A DataFrame is a tabular data structure, comprised of rows and columns like in a spreadsheet
Each column can be a different value type (numeric, string, boolean etc)

Title	Year	Studio	Rating
Ant-Man and the Wasp	2018	Beuna Vista	NaN
Avengers: Infinity War	2018	Beuna Vista	NaN
Black Panther	2018	Beuna Vista	0.96
Thor: Ragnarok	2017	Beuna Vista	0.92
Spider-Man: Homecoming	2017	Sony	0.92
Guardians of the Galaxy Vol. 2	2017	Beuna Vista	0.83

3.2 Dataframe - from dict of lists

dict keys: columns
dict values (arrays): rows

Make Dataframe of Marvel data



In [ ]:

    
# Make Dataframe of Marvel data

mcu_data = {'Title': ['Ant-Man and the Wasp', 'Avengers: Infinity War', 'Black Panther', 'Thor: Ragnarok', 
              'Spider-Man: Homecoming', 'Guardians of the Galaxy Vol. 2'],
            'Year':[2018, 2018, 2018, 2017, 2017, 2017],
            'Studio':['Beuna Vista', 'Beuna Vista', 'Beuna Vista', 'Beuna Vista', 'Sony', 'Beuna Vista'],
            'Rating': [np.nan, np.nan, 0.96, 0.92, 0.92, 0.83]}
df_mcu = pd.DataFrame(mcu_data)
df_mcu

What datatype is df_mcu?



In [ ]:

    
# What datatype is df_mcu?

type(df_mcu)

3.3 Dataframe - specifying indices and columns

Order of columns/rows can be specified using:
- columns array
- index array



In [ ]:

    
# Assign column order and index based on Marvel Cinemetic Universe Movie Number 

mcu_index = ['mcu_20','mcu_19', 'mcu_18', 'mcu_17', 'mcu_16', 'mcu_15']
mcu_columns = ['Title', 'Year', 'Studio', 'Rating']
df_mcu = pd.DataFrame(mcu_data, columns = mcu_columns, index = mcu_index)
df_mcu

3.4 Dataframe - from nested dict of dicts

Outer dict keys as columns and inner dict keys as row indices

Make Dataframe of population of states



In [ ]:

    
# Make Dataframe of population

pop = {'Nevada': {2001: 2.9, 2002: 2.9}, 'Ohio': {2002: 3.6, 2001: 1.7, 2000: 1.5}}
df_pop = pd.DataFrame(pop)
df_pop

3.5 Dataframe - number of rows and columns

Get the number of rows in a Dataframe



In [ ]:

    
# Get the number of rows in a Dataframe

len(df_mcu)

Get the (rows, cols) of the Dataframe



In [ ]:

    
# Get the (rows, cols) of the Dataframe

df_mcu.shape

3.6 Dataframe - index, columns and values

Get the column headers



In [ ]:

    
# Get the column headers

df_mcu.columns

Get the row index values



In [ ]:

    
# Get the row index values

df_mcu.index

Get values of the Dataframe only



In [ ]:

    
# Get values of the Dataframe only

df_mcu.values

3.7 Dataframe - Selecting Columns and Rows

There are three basic ways to access the data in the Dataframe:
1. Quick Access: DataFrame[]
2. Integer position based selection method: DataFrame.iloc[row, col]
3. Label based selection method: DataFrame.loc[row, col]

3.7.1 Dataframe - Selecting Columns and Rows - Quick Access

Select values in a single column



In [ ]:

    
# Select values in a single column

df_mcu['Title']

Select values in a list of columns



In [ ]:

    
# Select values in a list of columns

df_mcu[['Title', 'Rating']]

Use slice to get the first n rows (NumPy style indexing)



In [ ]:

    
# Use slice to get the first n rows (NumPy style indexing)

df_mcu[:2]

Can combine slice and column selection to select the first n rows



In [ ]:

    
# Can combine slice and column selection to select the first n rows

df_mcu['Title'][:2]

Order of column and slice doesn't matter



In [ ]:

    
df_mcu[:4]['Year']

3.7.2 Dataframe - Selecting Columns and Rows - Integer based selection

iloc is primarily an integer position based (from 0 to length-1 of the axis), when we include the starting from the end indexing, but may also be used with a boolean array.
Allowed inputs are:

Integer



In [ ]:

    
# Return values in the first row

df_mcu.iloc[0]



In [ ]:

    
# Return values in the first row and second column

df_mcu.iloc[0,1]

A list or array of integers



In [ ]:

    
# Return values in the 3,5 and 6th rows 

df_mcu.iloc[[2,4,5]]

A slice object with ints



In [ ]:

    
# Return values in the first row and columns 2 and 3

df_mcu.iloc[:2, 1:3]

3.7.3 Dataframe - Selecting Columns and Rows - Integer based selection

loc is primarily label based, but may also be used with a boolean array.
Allowed inputs are:

A single label



In [ ]:

    
# Select all values of the 20th Movie

df_mcu.loc['mcu_20']

A list or array of labels



In [ ]:

    
# Select all values of the 20th, 17th and 15th movie, which uses row index values, 
# Not to be confused with df_mcu[['Title', 'Rating']] which uses column headers

df_mcu.loc[['mcu_20', 'mcu_17', 'mcu_15']]

A slice object with labels

NOTE: Unlike numeric index python slices, both the start and the stop are included!



In [ ]:

    
# Select the Year and Rating


df_mcu.loc[:, ['Year', 'Rating']]

3.7.4 Dataframe - Selecting Columns and Rows - Data Filtering

Data filtering using boolean

Filter and select on single condition



In [ ]:

    
# Filter for Rating < .95

df_mcu.loc[df_mcu['Rating'] < .95, :]

Filter and select on multiple conidtions



In [ ]:

    
# Filter for Rating < .95 and Sudio is Sony
# Reuse the bitwise comparator seen earlier but with OR instead of AND.

df_mcu.loc[(df_mcu['Rating'] < .95) | (df_mcu['Studio'] == 'Sony'), :]

3.8 Dataframe - Adding and Deleting data

Add a new column



In [ ]:

    
# Add new predicted rating to Dataframe

df_mcu['Predicted Rating'] = np.random.random(len(df_mcu))
df_mcu

Add a new row



In [ ]:

    
# Add a new row for a new movie 

new_row = pd.Series(['Captain Marvel', 2019, 'BeunaVista',  np.nan, np.random.random(1)[0]], index=df_mcu.columns, name= 'mcu_21' )
df_mcu.append(new_row)

Drop an existing column



In [ ]:

    
# Drop the Rating Column

df_mcu.drop('Rating', axis=1)

Drop an existing row



In [ ]:

    
# Drop the 17 and 19th movies

df_mcu.drop(['mcu_15', 'mcu_17'])

Columns, Rows or individual elements can be modified similarly using loc or iloc.

NOTE: Return new object Dataframe, withouth changing the orignal dataframe (append, drop).

Exericse 3

Create a Dataframe from a specified dictionary data of

Which Students have scorescore is between 15 and 20?

Names	Scores	Attempts	Qualify
Anastasia	12.5	1	'yes'
Dima	9	3	'no'
Katherine	16.5	2	'yes'
James	NaN	3	'no'
Emily	9	2	'no'
Michael	20	3	'yes'
Matthew	14.5	1	'yes'
Laura	NaN	1	'no'
Kevin	8	2	'no'
Jonas	19	1	'yes'



In [ ]:

    
# Sample Python data:


exam_data = {'Names': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'], 
             'Scores': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19], 
             'Attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1], 
             'Qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}



In [ ]:

    
# Select the students the qualify based on scores using the Qualify column 'yes'

Exercise 4

Add a new column of, Grade Level, to indicate which Grade in High school the students are in?

Add a new student named, Jack? (Hint: Need to use ignore_index=True)



In [ ]:

    
# Add a new column of, Grade Level, to indicate which Grade in High school the students are in?



In [ ]:

    
# Add a new student named, Jack? (Hint: Need to use ignore_index=True)

Exercise 5

Add a new column of Pass that is either 0 or 1 based on the column Qualify. (Hint: use numpy.where)

'yes': 1
'no': 0



In [ ]:

    
# Add a new column of Pass that is either 0 or 1 based on the column Qualify. (Hint: use numpy.where)

Agenda

Introduction to Pandas
Series
- Exercise 1
- Exercise 2
Dataframe
- Exercise 3
- Exercise 4
Import and Store Data
Summarizing and Computing Descriptive Statistics
- Exercise 6
- Exercise 7
Grouped and apply
- Exercise 8
- Exercise 9
Data Transformation and Normalization
- Exercise 10
- Exercise 11
- Exercise 12

4.1 Import and Store Data

The first step in any problem is identifying what format your data is in, and then loading it into whatever framework you're using. Common formats are:

Text Files: Text files with a common delimeter to seperate values (e.g. CSV uses ,)
JSON(JavaScript Object Notation): standard formats for sending data by HTTP requests
Web Page: XML and HTML
Binary: "pickle” format and HDF5
Database: MySQL, PostgreSQL

NOTE: Most common is CSV or Text files with different delimeters.

4.2 Import and Store Data - Text Files

Reading
- read_csv: Use comma seperated (,) deliminator to read file
- read_table: Use tab (\t) deliminator to read file
Writting
- write_csv: Use comma seperated (,) deliminator to write file
- write_table: Use tab (\t) deliminator to write file

Read in Winter Olympic Medal Winners form Kaggle



In [ ]:

    
# Read in Winter Olympic Medal Winners 

winter_olympics = pd.read_csv('data/winter_olympics.csv')
winter_olympics.head()

Agenda

Introduction to Pandas
Series
- Exercise 1
- Exercise 2
Dataframe
- Exercise 3
- Exercise 4
Import and Store Data
Summarizing and Computing Descriptive Statistics
- Exercise 6
- Exercise 7
Grouped and apply
- Exercise 8
- Exercise 9
Data Transformation and Normalization
- Exercise 10
- Exercise 11
- Exercise 12

5.1 Summarizing and Computing Descriptive Statistics

Pandas as a lot of built-in essential functionality common to the pandas data structures to help explore the data.

5.2 Summarizing and Computing Descriptive Statistics - Head and Tail

To view a small sample of a Series or DataFrame object, use:
- head()
- tail()

NOTE: The default number of elements to display is five, but you may pass a custom number.

Get the First 3 lines of Dataframe



In [ ]:

    
# Get the First 3 lines of a Dataframe

winter_olympics.head(3)

5.2 Summarizing and Computing Descriptive Statistics - Sorting

To sort data for exploring the data use:
- sort_index(): object by labels (along an axis)
- sort_values(): by the values along either axis

NOTE: axis: 0 or ‘index’, 1 or ‘columns’ and default is 0

Sort Dataframe by rows in ascending order



In [ ]:

    
# Sort Dataframe by rows in ascending order

df_mcu.sort_index(axis=0, ascending=True)

Sort Dataframe by columns in descending order



In [ ]:

    
# Sort Dataframe by column in descending order

df_mcu.sort_values(by=['Rating', 'Predicted Rating'], ascending=False)

5.3 Summarizing and Computing Descriptive Statistics - Descriptive statistics

Built in functions to calculate the values over row or columns

describe(): return summary statistics of each column

  - for numeric data: mean, std, max, min, 25%, 50%, 75%, etc.
  - For non-numeric data: count, uniq, most-frequent item, etc



In [ ]:

    
# Summary Statistics for the Dataframe

df_mcu.describe()

mean()



In [ ]:

    
# Mean of the Rating and Predicted Rating Columns

df_mcu.loc[:,['Rating', 'Predicted Rating']].mean()

var()



In [ ]:

    
# Get the variance of the Rating column

df_mcu.loc[:,['Rating']].var()

5.4 Summarizing and Computing Descriptive Statistics - Missing Data

Data comes in many shapes and forms and Pandas is very flexible in handling missing data:
- While Nan is the default missing value marker
- However, Python None will arise and we wish to also consider that “missing” or “not available” or “NA”

- Drop rows with Nan values



In [ ]:

    
# Drop rows with NaN values

df_mcu.dropna()

  - Fill Nan with default value



In [ ]:

    
# File Nan in Dataframe with default value

df_mcu.fillna(0)

- Use inplace to modify the dataframe instead of retunring a new object



In [ ]:

    
# File Nan in Dataframe with default value in place

df_mcu.fillna(0, inplace=True)
df_mcu

Exercise 6

What is the median score of the students on the exam?



In [ ]:

    
# What is the median score of the students on the exam?

Exercise 7

Deduct 4 points from everyone that attempted the exam 2 or more times. Replace all Nan scores with 0. (Passing is 12 points)

Compute the mean. Would the class as a whole pass the test?

Are there any students that will fail now?



In [ ]:

    
# Deduct 4 points from everyone that attempted the exam 2 or more times. Replace all Nan scores with 0. (Passing is 12 points)



In [ ]:

    
# Compute the mean. Would the class as a whole pass the test?



In [ ]:

    
# Are there any students that will fail now?

Agenda

Introduction to Pandas
Series
- Exercise 1
- Exercise 2
Dataframe
- Exercise 3
- Exercise 4
Import and Store Data
Summarizing and Computing Descriptive Statistics
- Exercise 6
- Exercise 7
Grouped and apply
- Exercise 8
- Exercise 9
Data Transformation and Normalization
- Exercise 10
- Exercise 11
- Exercise 12

6.1 Grouped and apply

The Grouped functionality is referring to a process involving one or more of the following steps:

Splitting the data into groups based on some criteria
Applying a function to each group independently

6.2 Grouped and apply - Splitting

Pandas objects can be split on any of their axes.

'grouping' is to provided as a mapping of labels to group names in a Grouped Objec
groupby()



In [ ]:

    
# Groupby year of release and get mean Rating

df_mcu.groupby('Year').mean()

6.3 Grouped and apply - Apply

Apply the same function to every column or row:

applymap: Apply same function across every cell
apply: Apply same function to every column (default) or row

    - Apply square to every value in a dataframe



In [ ]:

    
# Apply square to every value in a dataframe

test_data = np.arange(9).reshape(3,-1)

df_test = pd.DataFrame(test_data, index=['r1', 'r2', 'r3'], columns=['c1', 'c2', 'c3'])
df_test



In [ ]:

    
df_test.applymap(np.square)

        - Apply a new function that subtract max from 2 times min in every column



In [ ]:

    
# Define max minus min function
def max_minus_min(x): 
    return max(x)-(2*min(x))

# Apply a new function that subtract max from min in every column


df_test.apply(max_minus_min)

Exercise 8

Group students by attempts and find the average score?



In [ ]:

    
# Group students by attempts and find the average score?

Exercise 9

Group students by their pass result and report the variance in scores?



In [ ]:

    
# Group students by their pass result and report the variance in scores?

Introduction to Pandas
Series
- Exercise 1
- Exercise 2
Dataframe
- Exercise 3
- Exercise 4
Import and Store Data
Summarizing and Computing Descriptive Statistics
- Exercise 6
- Exercise 7
Grouped and apply
- Exercise 8
- Exercise 9
Data Transformation and Normalization
- Exercise 10
- Exercise 11
- Exercise 12

7.1 Data Transformation and Normalization

For Machine Learning (ML) and other data analysis it is important to:

Explore your data
Standardize data (Transform/Normalize) to obtain so different columns became comparable / compatible

7.1 Data Transformation and Normalization - Data Exploration

Pandas Dataframes have a built in plot functionality:

Dataframe.plot()

NOTE: We will go into more plotting libraries later in the course.



In [ ]:

    
# Import  maplotlib and setup to display plots notebook

import matplotlib.pyplot as plt
%matplotlib inline

Plot the Number of Medals in barchart by country using: plot.bar()



In [ ]:

    
# Plot the Number of Medals in barchart by country using: plot.bar()

winter_olympics.groupby(['Country'])['Medal'].count().plot.bar()

Other useful plots:
- hist: plot.hist()
- scatter: plot.scatter()
- boxplot: plot.box()

7.2 Data Transformation and Normalization - Normalization

Why Normalize (re-scale)?

Transform data to obtain a certain distribution e.g. from lognormal to normal
Normalize data so different columns became comparable / compatible

Typical normalization approach:

Z-score transformation
Scale to between 0 and 1
Trimmed mean normalization
Vector length transformation
Quantilenorm

Further Resources:

scikit-learn

Further Resources

Resources:

Exercise 10

In the Winter olympics which country has the most Biathlon medals?

In the Winter olympics which country has the most Skiing medals? And in which event do they have the most Gold medals?



In [ ]:

    
# In the Winter olympics which country has the most Biathlon medals?



In [ ]:



In [ ]:

    
# In the Winter olympics which country has the most Skiing medals?



In [ ]:

    
# And in which event do they have the most Gold medals?



In [ ]:

Exercise 11

Import the Summer Olympic dataset located in ('data/summer_olypmics.csv')

Which Olympian has the most medals?

Which Olympian has the most Gold medals and for which Country and Sport?

Which rows have no values and why?



In [ ]:

    
# Import the Summer Olympic dataset located in ('../data/summer_olypmics.csv')
summer_olympics = pd.read_csv('data/summer_olympics.csv')


#Which Olympian has the most medals?
summer_olympics.groupby(['Athlete'])['Medal'].count().sort_values(ascending=False).head()



In [ ]:

    
# Which Olympian has the most Gold medals and for which Country?
summer_olympics.groupby(['Athlete','Medal','Country'])['Medal'].count().sort_values(ascending=False).head()



In [ ]:

    
# Which Olympian has the most Gold medals and for which Sport?
summer_olympics.groupby(['Athlete','Medal','Discipline'])['Medal'].count().sort_values(ascending=False).head()



In [ ]:

    
# Which rows have no values and why?
summer_olympics[pd.isnull(summer_olympics).any(axis=1)]

Exercise 12

Import the example RNA-seq Count Data in ('data/RNAseq_count_table.txt')

Calculate CPM for each Sample. (CPM, Counts Per Million)

Formula for CPM = readsMappedToGene x 1/totalNumReads x 10^6

totalNumReads - total number of mapped reads of a sample readsMappedToGene - number of reads mapped to a selected gene

Which Gene has the highest average CPM?

What is the Correlation between SRR1550986 SRR1550987?



In [ ]:

    
# Import the example RNA-seq Count Data in ('data/RNAseq_count_table.txt')

rna_counts = pd.read_csv('data/RNAseq_count_table.txt', sep = '\t', index_col = 0)
rna_counts.head()



In [ ]:

    
# Calculate CPM for each Sample.


rna_cpm = rna_counts.divide(rna_counts.sum(axis=0)).multiply(1000000)
rna_cpm.head()



In [ ]:

    
# Which Gene has the highest average CPM?


rna_cpm.mean(axis=1).sort_values(ascending=False).head()



In [ ]:

    
# What is the Correlation between SRR1550986 SRR1550987?

rna_cpm.corr()