Intoduction to Pandas and Dataframes


Venkat Malladi (Computational Biologist BICF)

Agenda


  1. Introduction to Pandas
  2. DataSeries
    • Exercise 1
    • Exercise 2
  3. Dataframe
    • Exercise 3
    • Exercise 4
    • Exercise 5
  4. Import and Store Data
  5. Summarizing and Computing Descriptive Statistics
    • Exercise 6
    • Exercise 7
  6. Grouped and apply
    • Exercise 8
    • Exercise 9
  7. Data Transformation and Normalization
    • Exercise 10
    • Exercise 11
    • Exercise 12

1.1 What is Pandas?


Description

  • Python library providing high-performance, easy-to-use structures and data analysis tools.
  • Suitable for tabular data with heterogeneously-typed colums, as in an SQL table or Excel spreadsheet
  • Provides data analysis feautures similar to: R, MATLAB, SAS
  • Based on NumPy

1.2 Key Features


The library is oriented towards table-like data structures that can be manipulated by a collection of methods:

  • Easy handling of missing data
  • Label-based slicing, indexing and subsetting of large data sets
  • Powerful and flexible group by functionality to perform split-apply-combine operations on data sets
  • Read/Write data from/to Excel, CSV, SQL databases, JSON

1.3 Import Pandas


Before we explore the pandas package, let's import pandas.

The convention is to use pd to refere to pandas when importing the package.


In [1]:
# Import Pandas and Numpy

import pandas as pd
import numpy as np

1.4 Data Structures

  • Series
  • Dataframe

Agenda


  1. Introduction to Pandas
  2. Series
    • Exercise 1
    • Exercise 2
  3. Dataframe
    • Exercise 3
    • Exercise 4
    • Exercise 5
  4. Import and Store Data
  5. Summarizing and Computing Descriptive Statistics
    • Exercise 6
    • Exercise 7
  6. Grouped and apply
    • Exercise 8
    • Exercise 9
  7. Data Transformation and Normalization
    • Exercise 10
    • Exercise 11
    • Exercise 12

2.1 Series


  • One dimensional array-like object
  • Contains array of data (of any NumPy data type) with an index that labels each element in the vector.

    • Indexes can be:
      • integers
      • strings
      • other data types
NOTE: Not Numpy arrays but adds functionality to a Numpy array

2.2 Series - Integer indexes


  • If an index is not specified, a default sequence of integers is assigned as index.
Index Value
0 35
1 50
2 25
  • Make a Series of count data

In [2]:
# Make Series of count data and visaulize series

counts = pd.Series([223, 43, 53, 24, 43])
counts


Out[2]:
0    223
1     43
2     53
3     24
4     43
dtype: int64
  • What datatype is the counts object?

In [3]:
# What datatype is the counts object?

type(counts)


Out[3]:
pandas.core.series.Series

2.3 Series - String indexes


  • We can assign meaningful lables to the indexes when making the Series object by specfing an array of index
Index Value
CA 35
TX 50
OK 25
  • Make Series of count data with Gene Symbols

In [4]:
# Make Series of count data with Gene Symbols

rna_counts = pd.Series([50, 10, 12, 29, 4], index=['BRCA2', 'GATA2', 'Myc', 'FOXA1', 'ERCC2'])
rna_counts


Out[4]:
BRCA2    50
GATA2    10
Myc      12
FOXA1    29
ERCC2     4
dtype: int64

2.4 Series - Dictionary


  • Can be thought of as a dict
  • Can be constructed from a dict directly.
  • Construct second sample RNA-counts dict

In [5]:
# Construct second sample RNA-counts dict

cell2_counts = {'BRCA2':5, 'GATA2':20, 'Myc':45, 'FOXA1':10, 'ERCC2':0, 'BRCA1': 20}
cell2_counts


Out[5]:
{'BRCA1': 20, 'BRCA2': 5, 'ERCC2': 0, 'FOXA1': 10, 'GATA2': 20, 'Myc': 45}
  • Make pandas Series from RNA-counts dict

In [6]:
# Make pandas Series from RNA-counts dict

rna_counts_cell2 = pd.Series(cell2_counts)
rna_counts_cell2


Out[6]:
BRCA1    20
BRCA2     5
ERCC2     0
FOXA1    10
GATA2    20
Myc      45
dtype: int64

2.5.1 Series - Referencing Elements - Integer


NOTE: We can access the values like an array text
  • Integer Index
  • Access the 1st element of counts data

In [7]:
# Access the 1st element of counts data

counts[0]


Out[7]:
223
  • Get the 2nd through 4th elements

In [8]:
# Get the 2nd through 4th elements

counts[1:4]


Out[8]:
1    43
2    53
3    24
dtype: int64

2.5.2 Series - Referencing Elements - String


  • String Index
  • Get the counts for the Myc Gene

In [9]:
# Get the counts for Myc Gene

rna_counts['Myc']


Out[9]:
12
  • Get the counts for FOXA1, GATA2 and BRCA2

In [10]:
# Get the Counts for FOXA1, GATA2 and BRCA2

rna_counts[['FOXA1', 'GATA2', 'BRCA2']]


Out[10]:
FOXA1    29
GATA2    10
BRCA2    50
dtype: int64

2.5.3 Series - Referencing Elements - array/index values


  • Can get the array representation and index object of the Series via its values and index atrributes, respectively.
  • Get the values in the counts matrix

In [11]:
# Get the values in the counts matrix

counts.values


Out[11]:
array([223,  43,  53,  24,  43])
  • Get the index of the rna_counts matrix

In [12]:
# Get the index of the rna_counts matrix

rna_counts.index


Out[12]:
Index(['BRCA2', 'GATA2', 'Myc', 'FOXA1', 'ERCC2'], dtype='object')

2.5.4 Series - Referencing Elements - labels


  • Can give both the array of values and the index meaningful labels themselves

In [13]:
rna_counts.name = 'RNA Counts'
rna_counts.index.name = 'Symbol'
rna_counts


Out[13]:
Symbol
BRCA2    50
GATA2    10
Myc      12
FOXA1    29
ERCC2     4
Name: RNA Counts, dtype: int64

2.6 Series - array operations


  • NumPy array operations can be applied to Series without losing the data structure
  • Use boolean array to filter Series
  • Select Genes that have greater than 20 counts

In [14]:
# Select Genes that have greater than 20 counts

rna_counts > 20


Out[14]:
Symbol
BRCA2     True
GATA2    False
Myc      False
FOXA1     True
ERCC2    False
Name: RNA Counts, dtype: bool
  • Select genes that have greater than 20 counts

In [15]:
# Select genes that have greater than 20 counts

rna_counts[rna_counts > 20]


Out[15]:
Symbol
BRCA2    50
FOXA1    29
Name: RNA Counts, dtype: int64

2.7 Series - null values


  • Many times Series have missing values that you need to identify and clean

Marvel Cinematic Universe

  • Make Movie DataFrame with missing values

In [16]:
# Make Movie Database with missing values

mcu_opening = {'Black Panther': 202003951, 'Thor: Ragnarok': 122744989, 'Spider-Man: Homecoming': 117027503,
              'Guardians of the Galaxy Vol. 2': 146510104, 'Doctor Strange': 85058311, 
               'Captain America: Civil War': 179139142}
mcu_movies = ['Ant-Man and the Wasp', 'Avengers: Infinity War', 'Black Panther', 'Thor: Ragnarok', 
              'Spider-Man: Homecoming', 'Guardians of the Galaxy Vol. 2', 'Doctor Strange', 'Captain America: Civil War']
mcu_series = pd.Series(mcu_opening, index=mcu_movies)

mcu_series


Out[16]:
Ant-Man and the Wasp                      NaN
Avengers: Infinity War                    NaN
Black Panther                     202003951.0
Thor: Ragnarok                    122744989.0
Spider-Man: Homecoming            117027503.0
Guardians of the Galaxy Vol. 2    146510104.0
Doctor Strange                     85058311.0
Captain America: Civil War        179139142.0
dtype: float64
  • Find movies with no opening revenue

In [17]:
# Find movies with no opening revenue

pd.isnull(mcu_series)


Out[17]:
Ant-Man and the Wasp               True
Avengers: Infinity War             True
Black Panther                     False
Thor: Ragnarok                    False
Spider-Man: Homecoming            False
Guardians of the Galaxy Vol. 2    False
Doctor Strange                    False
Captain America: Civil War        False
dtype: bool

In [18]:
# Good opportunity to use Boolean filter get index and only movie names

mcu_series[pd.isnull(mcu_series)].index.values


Out[18]:
array(['Ant-Man and the Wasp', 'Avengers: Infinity War'], dtype=object)
  • Find movies with opening revenue

In [19]:
# Find movies with opening revenue

pd.notnull(mcu_series)


Out[19]:
Ant-Man and the Wasp              False
Avengers: Infinity War            False
Black Panther                      True
Thor: Ragnarok                     True
Spider-Man: Homecoming             True
Guardians of the Galaxy Vol. 2     True
Doctor Strange                     True
Captain America: Civil War         True
dtype: bool
  • Display only movie with no opening revenue

In [20]:
# Display only movies with no opening revenue

mcu_series[pd.isnull(mcu_series)].index.values


Out[20]:
array(['Ant-Man and the Wasp', 'Avengers: Infinity War'], dtype=object)

2.8 Series - auto alignment


  • Index labels are used to align (merge) data when used in operations with other Series objects

In [21]:
rna_counts


Out[21]:
Symbol
BRCA2    50
GATA2    10
Myc      12
FOXA1    29
ERCC2     4
Name: RNA Counts, dtype: int64

In [22]:
rna_counts_cell2


Out[22]:
BRCA1    20
BRCA2     5
ERCC2     0
FOXA1    10
GATA2    20
Myc      45
dtype: int64
  • Combine counts for 2 cells (rna_counts and rna_counts_cell2)

In [23]:
# Combine counts for 2 cells 

rna_counts + rna_counts_cell2


Out[23]:
BRCA1     NaN
BRCA2    55.0
ERCC2     4.0
FOXA1    39.0
GATA2    30.0
Myc      57.0
dtype: float64
  • Adding Series combined values with the same label in the resulting series
  • Contrast this with arrays, where arrays of the same length will combine values element-wise
  • Notice that the missing values. If one Series has a missing value you can't add it to the other value and a missing value results.

Exercise 1


Create a Series from a specified list data which has the index labels as student names and test values.

Which Students have scores greater than 15?

Bonus: How would you use get the Students scores greater than 15 and less than 20?

Hint: Will use the bitwise opperator &
Names Value
Anastasia 12.5
Dima 9
Katherine 16.5
James NaN
Emily 9
Michael 20
Matthew 14.5
Laura NaN
Kevin 8
Jonas 19

In [24]:
# Sample Python data and labels:


students = ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas']

test_scores = [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19]

s_scores = pd.Series(test_scores, index=students)

# Which Students have scores greater than 15?


s_scores[s_scores > 15].index.values


Out[24]:
array(['Katherine', 'Michael', 'Jonas'], dtype=object)

In [25]:
# Bonus: How would you use get the Students scores greater than 15 and less than 20?

s_scores[(s_scores > 15) & (s_scores < 20)]


Out[25]:
Katherine    16.5
Jonas        19.0
dtype: float64

Exercise 2

What is the mean, median and max test scores?


In [26]:
# What is the mean, median and max test scores?

s_scores.mean()


Out[26]:
13.5625

In [27]:
s_scores.median()


Out[27]:
13.5

In [28]:
s_scores.max()


Out[28]:
20.0

Agenda


  1. Introduction to Pandas
  2. Series
    • Exercise 1
    • Exercise 2
  3. Dataframe
    • Exercise 3
    • Exercise 4
    • Exercise 5
  4. Import and Store Data
  5. Summarizing and Computing Descriptive Statistics
    • Exercise 6
    • Exercise 7
  6. Grouped and apply
    • Exercise 8
    • Exercise 9
  7. Data Transformation and Normalization
    • Exercise 10
    • Exercise 11
    • Exercise 12

3.1 Dataframe


  • A DataFrame is a tabular data structure, comprised of rows and columns like in a spreadsheet
  • Each column can be a different value type (numeric, string, boolean etc)
Title Year Studio Rating
Ant-Man and the Wasp 2018 Beuna Vista NaN
Avengers: Infinity War 2018 Beuna Vista NaN
Black Panther 2018 Beuna Vista 0.96
Thor: Ragnarok 2017 Beuna Vista 0.92
Spider-Man: Homecoming 2017 Sony 0.92
Guardians of the Galaxy Vol. 2 2017 Beuna Vista 0.83

3.2 Dataframe - from dict of lists


  • dict keys: columns
  • dict values (arrays): rows
  • Make Dataframe of Marvel data

In [29]:
# Make Dataframe of Marvel data

mcu_data = {'Title': ['Ant-Man and the Wasp', 'Avengers: Infinity War', 'Black Panther', 'Thor: Ragnarok', 
              'Spider-Man: Homecoming', 'Guardians of the Galaxy Vol. 2'],
            'Year':[2018, 2018, 2018, 2017, 2017, 2017],
            'Studio':['Beuna Vista', 'Beuna Vista', 'Beuna Vista', 'Beuna Vista', 'Sony', 'Beuna Vista'],
            'Rating': [np.nan, np.nan, 0.96, 0.92, 0.92, 0.83]}
df_mcu = pd.DataFrame(mcu_data)
df_mcu


Out[29]:
Rating Studio Title Year
0 NaN Beuna Vista Ant-Man and the Wasp 2018
1 NaN Beuna Vista Avengers: Infinity War 2018
2 0.96 Beuna Vista Black Panther 2018
3 0.92 Beuna Vista Thor: Ragnarok 2017
4 0.92 Sony Spider-Man: Homecoming 2017
5 0.83 Beuna Vista Guardians of the Galaxy Vol. 2 2017
  • What datatype is df_mcu?

In [30]:
# What datatype is df_mcu?

type(df_mcu)


Out[30]:
pandas.core.frame.DataFrame

3.3 Dataframe - specifying indices and columns


  • Order of columns/rows can be specified using:
    • columns array
    • index array

In [31]:
# Assign column order and index based on Marvel Cinemetic Universe Movie Number 

mcu_index = ['mcu_20','mcu_19', 'mcu_18', 'mcu_17', 'mcu_16', 'mcu_15']
mcu_columns = ['Title', 'Year', 'Studio', 'Rating']
df_mcu = pd.DataFrame(mcu_data, columns = mcu_columns, index = mcu_index)
df_mcu


Out[31]:
Title Year Studio Rating
mcu_20 Ant-Man and the Wasp 2018 Beuna Vista NaN
mcu_19 Avengers: Infinity War 2018 Beuna Vista NaN
mcu_18 Black Panther 2018 Beuna Vista 0.96
mcu_17 Thor: Ragnarok 2017 Beuna Vista 0.92
mcu_16 Spider-Man: Homecoming 2017 Sony 0.92
mcu_15 Guardians of the Galaxy Vol. 2 2017 Beuna Vista 0.83

3.4 Dataframe - from nested dict of dicts


  • Outer dict keys as columns and inner dict keys as row indices
  • Make Dataframe of population of states

In [32]:
# Make Dataframe of population

pop = {'Nevada': {2001: 2.9, 2002: 2.9}, 'Ohio': {2002: 3.6, 2001: 1.7, 2000: 1.5}}
df_pop = pd.DataFrame(pop)
df_pop


Out[32]:
Nevada Ohio
2000 NaN 1.5
2001 2.9 1.7
2002 2.9 3.6

3.5 Dataframe - number of rows and columns


  • Get the number of rows in a Dataframe

In [33]:
# Get the number of rows in a Dataframe

len(df_mcu)


Out[33]:
6
  • Get the (rows, cols) of the Dataframe

In [34]:
# Get the (rows, cols) of the Dataframe

df_mcu.shape


Out[34]:
(6, 4)

3.6 Dataframe - index, columns and values


  • Get the column headers

In [35]:
# Get the column headers

df_mcu.columns


Out[35]:
Index(['Title', 'Year', 'Studio', 'Rating'], dtype='object')
  • Get the row index values

In [36]:
# Get the row index values

df_mcu.index


Out[36]:
Index(['mcu_20', 'mcu_19', 'mcu_18', 'mcu_17', 'mcu_16', 'mcu_15'], dtype='object')
  • Get values of the Dataframe only

In [37]:
# Get values of the Dataframe only

df_mcu.values


Out[37]:
array([['Ant-Man and the Wasp', 2018, 'Beuna Vista', nan],
       ['Avengers: Infinity War', 2018, 'Beuna Vista', nan],
       ['Black Panther', 2018, 'Beuna Vista', 0.96],
       ['Thor: Ragnarok', 2017, 'Beuna Vista', 0.92],
       ['Spider-Man: Homecoming', 2017, 'Sony', 0.92],
       ['Guardians of the Galaxy Vol. 2', 2017, 'Beuna Vista', 0.83]], dtype=object)

3.7 Dataframe - Selecting Columns and Rows


  • There are three basic ways to access the data in the Dataframe:
    1. Quick Access: DataFrame[]
    2. Integer position based selection method: DataFrame.iloc[row, col]
    3. Label based selection method: DataFrame.loc[row, col]

3.7.1 Dataframe - Selecting Columns and Rows - Quick Access


  • Select values in a single column

In [38]:
# Select values in a single column

df_mcu['Title']


Out[38]:
mcu_20              Ant-Man and the Wasp
mcu_19            Avengers: Infinity War
mcu_18                     Black Panther
mcu_17                    Thor: Ragnarok
mcu_16            Spider-Man: Homecoming
mcu_15    Guardians of the Galaxy Vol. 2
Name: Title, dtype: object
  • Select values in a list of columns

In [39]:
# Select values in a list of columns

df_mcu[['Title', 'Rating']]


Out[39]:
Title Rating
mcu_20 Ant-Man and the Wasp NaN
mcu_19 Avengers: Infinity War NaN
mcu_18 Black Panther 0.96
mcu_17 Thor: Ragnarok 0.92
mcu_16 Spider-Man: Homecoming 0.92
mcu_15 Guardians of the Galaxy Vol. 2 0.83
  • Use slice to get the first n rows (NumPy style indexing)

In [40]:
# Use slice to get the first n rows (NumPy style indexing)

df_mcu[:2]


Out[40]:
Title Year Studio Rating
mcu_20 Ant-Man and the Wasp 2018 Beuna Vista NaN
mcu_19 Avengers: Infinity War 2018 Beuna Vista NaN
  • Can combine slice and column selection to select the first n rows

In [41]:
# Can combine slice and column selection to select the first n rows

df_mcu['Title'][:2]


Out[41]:
mcu_20      Ant-Man and the Wasp
mcu_19    Avengers: Infinity War
Name: Title, dtype: object
  • Order of column and slice doesn't matter

In [42]:
df_mcu[:4]['Year']


Out[42]:
mcu_20    2018
mcu_19    2018
mcu_18    2018
mcu_17    2017
Name: Year, dtype: int64

3.7.2 Dataframe - Selecting Columns and Rows - Integer based selection


  • iloc is primarily an integer position based (from 0 to length-1 of the axis), when we include the starting from the end indexing, but may also be used with a boolean array.

  • Allowed inputs are:

  • Integer

In [43]:
# Return values in the first row

df_mcu.iloc[0]


Out[43]:
Title     Ant-Man and the Wasp
Year                      2018
Studio             Beuna Vista
Rating                     NaN
Name: mcu_20, dtype: object

In [44]:
# Return values in the first row and second column

df_mcu.iloc[0,1]


Out[44]:
2018
  • A list or array of integers

In [45]:
# Return values in the 3,5 and 6th rows 

df_mcu.iloc[[2,4,5]]


Out[45]:
Title Year Studio Rating
mcu_18 Black Panther 2018 Beuna Vista 0.96
mcu_16 Spider-Man: Homecoming 2017 Sony 0.92
mcu_15 Guardians of the Galaxy Vol. 2 2017 Beuna Vista 0.83
  • A slice object with ints

In [46]:
# Return values in the first row and columns 2 and 3

df_mcu.iloc[:2, 1:3]


Out[46]:
Year Studio
mcu_20 2018 Beuna Vista
mcu_19 2018 Beuna Vista

3.7.3 Dataframe - Selecting Columns and Rows - Integer based selection


  • loc is primarily label based, but may also be used with a boolean array.

  • Allowed inputs are:

  • A single label

In [47]:
# Select all values of the 20th Movie

df_mcu.loc['mcu_20']


Out[47]:
Title     Ant-Man and the Wasp
Year                      2018
Studio             Beuna Vista
Rating                     NaN
Name: mcu_20, dtype: object
  • A list or array of labels

In [48]:
# Select all values of the 20th, 17th and 15th movie, which uses row index values, 
# Not to be confused with df_mcu[['Title', 'Rating']] which uses column headers

df_mcu.loc[['mcu_20', 'mcu_17', 'mcu_15']]


Out[48]:
Title Year Studio Rating
mcu_20 Ant-Man and the Wasp 2018 Beuna Vista NaN
mcu_17 Thor: Ragnarok 2017 Beuna Vista 0.92
mcu_15 Guardians of the Galaxy Vol. 2 2017 Beuna Vista 0.83
  • A slice object with labels
NOTE: Unlike numeric index python slices, both the start and the stop are included!

In [49]:
# Select the Year and Rating


df_mcu.loc[:, ['Year', 'Rating']]


Out[49]:
Year Rating
mcu_20 2018 NaN
mcu_19 2018 NaN
mcu_18 2018 0.96
mcu_17 2017 0.92
mcu_16 2017 0.92
mcu_15 2017 0.83

3.7.4 Dataframe - Selecting Columns and Rows - Data Filtering


  • Data filtering using boolean
  • Filter and select on single condition

In [50]:
# Filter for Rating < .95

df_mcu.loc[df_mcu['Rating'] < .95, :]


Out[50]:
Title Year Studio Rating
mcu_17 Thor: Ragnarok 2017 Beuna Vista 0.92
mcu_16 Spider-Man: Homecoming 2017 Sony 0.92
mcu_15 Guardians of the Galaxy Vol. 2 2017 Beuna Vista 0.83
  • Filter and select on multiple conidtions

In [51]:
# Filter for Rating < .95 and Sudio is Sony
# Reuse the bitwise comparator seen earlier but with OR instead of AND.

df_mcu.loc[(df_mcu['Rating'] < .95) | (df_mcu['Studio'] == 'Sony'), :]


Out[51]:
Title Year Studio Rating
mcu_17 Thor: Ragnarok 2017 Beuna Vista 0.92
mcu_16 Spider-Man: Homecoming 2017 Sony 0.92
mcu_15 Guardians of the Galaxy Vol. 2 2017 Beuna Vista 0.83

3.8 Dataframe - Adding and Deleting data


  • Add a new column

In [52]:
# Add new predicted rating to Dataframe

df_mcu['Predicted Rating'] = np.random.random(len(df_mcu))
df_mcu


Out[52]:
Title Year Studio Rating Predicted Rating
mcu_20 Ant-Man and the Wasp 2018 Beuna Vista NaN 0.016992
mcu_19 Avengers: Infinity War 2018 Beuna Vista NaN 0.962886
mcu_18 Black Panther 2018 Beuna Vista 0.96 0.872491
mcu_17 Thor: Ragnarok 2017 Beuna Vista 0.92 0.711236
mcu_16 Spider-Man: Homecoming 2017 Sony 0.92 0.997052
mcu_15 Guardians of the Galaxy Vol. 2 2017 Beuna Vista 0.83 0.922450
  • Add a new row

In [53]:
# Add a new row for a new movie 

new_row = pd.Series(['Captain Marvel', 2019, 'BeunaVista',  np.nan, np.random.random(1)[0]], index=df_mcu.columns, name= 'mcu_21' )
df_mcu.append(new_row)


Out[53]:
Title Year Studio Rating Predicted Rating
mcu_20 Ant-Man and the Wasp 2018 Beuna Vista NaN 0.016992
mcu_19 Avengers: Infinity War 2018 Beuna Vista NaN 0.962886
mcu_18 Black Panther 2018 Beuna Vista 0.96 0.872491
mcu_17 Thor: Ragnarok 2017 Beuna Vista 0.92 0.711236
mcu_16 Spider-Man: Homecoming 2017 Sony 0.92 0.997052
mcu_15 Guardians of the Galaxy Vol. 2 2017 Beuna Vista 0.83 0.922450
mcu_21 Captain Marvel 2019 BeunaVista NaN 0.997816
  • Drop an existing column

In [54]:
# Drop the Rating Column

df_mcu.drop('Rating', axis=1)


Out[54]:
Title Year Studio Predicted Rating
mcu_20 Ant-Man and the Wasp 2018 Beuna Vista 0.016992
mcu_19 Avengers: Infinity War 2018 Beuna Vista 0.962886
mcu_18 Black Panther 2018 Beuna Vista 0.872491
mcu_17 Thor: Ragnarok 2017 Beuna Vista 0.711236
mcu_16 Spider-Man: Homecoming 2017 Sony 0.997052
mcu_15 Guardians of the Galaxy Vol. 2 2017 Beuna Vista 0.922450
  • Drop an existing row

In [55]:
# Drop the 17 and 19th movies

df_mcu.drop(['mcu_15', 'mcu_17'])


Out[55]:
Title Year Studio Rating Predicted Rating
mcu_20 Ant-Man and the Wasp 2018 Beuna Vista NaN 0.016992
mcu_19 Avengers: Infinity War 2018 Beuna Vista NaN 0.962886
mcu_18 Black Panther 2018 Beuna Vista 0.96 0.872491
mcu_16 Spider-Man: Homecoming 2017 Sony 0.92 0.997052
  • Columns, Rows or individual elements can be modified similarly using loc or iloc.
NOTE: Return new object Dataframe, withouth changing the orignal dataframe (append, drop).

Exericse 3


Create a Dataframe from a specified dictionary data of

Which Students have scorescore is between 15 and 20?

Names Scores Attempts Qualify
Anastasia 12.5 1 'yes'
Dima 9 3 'no'
Katherine 16.5 2 'yes'
James NaN 3 'no'
Emily 9 2 'no'
Michael 20 3 'yes'
Matthew 14.5 1 'yes'
Laura NaN 1 'no'
Kevin 8 2 'no'
Jonas 19 1 'yes'

In [56]:
# Sample Python data:


exam_data = {'Names': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'], 
             'Scores': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19], 
             'Attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1], 
             'Qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']} 

student_data = pd.DataFrame(exam_data)
student_data


Out[56]:
Attempts Names Qualify Scores
0 1 Anastasia yes 12.5
1 3 Dima no 9.0
2 2 Katherine yes 16.5
3 3 James no NaN
4 2 Emily no 9.0
5 3 Michael yes 20.0
6 1 Matthew yes 14.5
7 1 Laura no NaN
8 2 Kevin no 8.0
9 1 Jonas yes 19.0

In [57]:
# Select the students the qualify based on scores using the Qualify column 'yes'

student_data[student_data['Qualify'] == 'yes']


Out[57]:
Attempts Names Qualify Scores
0 1 Anastasia yes 12.5
2 2 Katherine yes 16.5
5 3 Michael yes 20.0
6 1 Matthew yes 14.5
9 1 Jonas yes 19.0

Exercise 4


Add a new column of, Grade Level, to indicate which Grade in High school the students are in?

Add a new student named, Jack? (Hint: Need to use ignore_index=True)


In [58]:
# Add a new column of, Grade Level, to indicate which Grade in High school the students are in?

grade_level = ['9th', '9th', '10th', '11th', '12th', '9th', '10th', '11th', '12th', '11th']
student_data['Grade'] = grade_level
student_data


Out[58]:
Attempts Names Qualify Scores Grade
0 1 Anastasia yes 12.5 9th
1 3 Dima no 9.0 9th
2 2 Katherine yes 16.5 10th
3 3 James no NaN 11th
4 2 Emily no 9.0 12th
5 3 Michael yes 20.0 9th
6 1 Matthew yes 14.5 10th
7 1 Laura no NaN 11th
8 2 Kevin no 8.0 12th
9 1 Jonas yes 19.0 11th

In [59]:
student_data.columns


Out[59]:
Index(['Attempts', 'Names', 'Qualify', 'Scores', 'Grade'], dtype='object')

In [60]:
# Add a new student named, Jack? (Hint: Need to use ignore_index=True)

jack = pd.Series([2, 'Jack', 'yes' , 17, '9th'],index=student_data.columns )
student_data.append(jack, ignore_index=True)


Out[60]:
Attempts Names Qualify Scores Grade
0 1 Anastasia yes 12.5 9th
1 3 Dima no 9.0 9th
2 2 Katherine yes 16.5 10th
3 3 James no NaN 11th
4 2 Emily no 9.0 12th
5 3 Michael yes 20.0 9th
6 1 Matthew yes 14.5 10th
7 1 Laura no NaN 11th
8 2 Kevin no 8.0 12th
9 1 Jonas yes 19.0 11th
10 2 Jack yes 17.0 9th

Exercise 5


Add a new column of Pass that is either 0 or 1 based on the column Qualify. (Hint: use numpy.where)

  • 'yes': 1
  • 'no': 0

In [61]:
# Add a new column of Pass that is either 0 or 1 based on the column Qualify. (Hint: use numpy.where)

student_data['Pass'] = np.where(student_data['Qualify'] == 'yes', 1, 0)
student_data


Out[61]:
Attempts Names Qualify Scores Grade Pass
0 1 Anastasia yes 12.5 9th 1
1 3 Dima no 9.0 9th 0
2 2 Katherine yes 16.5 10th 1
3 3 James no NaN 11th 0
4 2 Emily no 9.0 12th 0
5 3 Michael yes 20.0 9th 1
6 1 Matthew yes 14.5 10th 1
7 1 Laura no NaN 11th 0
8 2 Kevin no 8.0 12th 0
9 1 Jonas yes 19.0 11th 1

Agenda


  1. Introduction to Pandas
  2. Series
    • Exercise 1
    • Exercise 2
  3. Dataframe
    • Exercise 3
    • Exercise 4
  4. Import and Store Data
  5. Summarizing and Computing Descriptive Statistics
    • Exercise 6
    • Exercise 7
  6. Grouped and apply
    • Exercise 8
    • Exercise 9
  7. Data Transformation and Normalization
    • Exercise 10
    • Exercise 11
    • Exercise 12

4.1 Import and Store Data


The first step in any problem is identifying what format your data is in, and then loading it into whatever framework you're using. Common formats are:

  • Text Files: Text files with a common delimeter to seperate values (e.g. CSV uses ,)
  • JSON(JavaScript Object Notation): standard formats for sending data by HTTP requests
  • Web Page: XML and HTML
  • Binary: "pickle” format and HDF5
  • Database: MySQL, PostgreSQL
NOTE: Most common is CSV or Text files with different delimeters.

4.2 Import and Store Data - Text Files


  • Reading
    • read_csv: Use comma seperated (,) deliminator to read file
    • read_table: Use tab (\t) deliminator to read file
  • Writting
    • write_csv: Use comma seperated (,) deliminator to write file
    • write_table: Use tab (\t) deliminator to write file

Read in Winter Olympic Medal Winners form Kaggle


In [62]:
# Read in Winter Olympic Medal Winners 

winter_olympics = pd.read_csv('data/winter_olympics.csv')
winter_olympics.head()


Out[62]:
Year City Sport Discipline Athlete Country Gender Event Medal
0 1924 Chamonix Biathlon Biathlon BERTHET, G. FRA Men Military Patrol Bronze
1 1924 Chamonix Biathlon Biathlon MANDRILLON, C. FRA Men Military Patrol Bronze
2 1924 Chamonix Biathlon Biathlon MANDRILLON, Maurice FRA Men Military Patrol Bronze
3 1924 Chamonix Biathlon Biathlon VANDELLE, André FRA Men Military Patrol Bronze
4 1924 Chamonix Biathlon Biathlon AUFDENBLATTEN, Adolf SUI Men Military Patrol Gold

Agenda


  1. Introduction to Pandas
  2. Series
    • Exercise 1
    • Exercise 2
  3. Dataframe
    • Exercise 3
    • Exercise 4
  4. Import and Store Data
  5. Summarizing and Computing Descriptive Statistics
    • Exercise 6
    • Exercise 7
  6. Grouped and apply
    • Exercise 8
    • Exercise 9
  7. Data Transformation and Normalization
    • Exercise 10
    • Exercise 11
    • Exercise 12

5.1 Summarizing and Computing Descriptive Statistics


  • Pandas as a lot of built-in essential functionality common to the pandas data structures to help explore the data.

5.2 Summarizing and Computing Descriptive Statistics - Head and Tail


  • To view a small sample of a Series or DataFrame object, use:
    • head()
    • tail()
NOTE: The default number of elements to display is five, but you may pass a custom number.
  • Get the First 3 lines of Dataframe

In [63]:
# Get the First 3 lines of a Dataframe

winter_olympics.head(3)


Out[63]:
Year City Sport Discipline Athlete Country Gender Event Medal
0 1924 Chamonix Biathlon Biathlon BERTHET, G. FRA Men Military Patrol Bronze
1 1924 Chamonix Biathlon Biathlon MANDRILLON, C. FRA Men Military Patrol Bronze
2 1924 Chamonix Biathlon Biathlon MANDRILLON, Maurice FRA Men Military Patrol Bronze

5.2 Summarizing and Computing Descriptive Statistics - Sorting


  • To sort data for exploring the data use:
    • sort_index(): object by labels (along an axis)
    • sort_values(): by the values along either axis
NOTE: axis: 0 or ‘index’, 1 or ‘columns’ and default is 0
  • Sort Dataframe by rows in ascending order

In [64]:
# Sort Dataframe by rows in ascending order

df_mcu.sort_index(axis=0, ascending=True)


Out[64]:
Title Year Studio Rating Predicted Rating
mcu_15 Guardians of the Galaxy Vol. 2 2017 Beuna Vista 0.83 0.922450
mcu_16 Spider-Man: Homecoming 2017 Sony 0.92 0.997052
mcu_17 Thor: Ragnarok 2017 Beuna Vista 0.92 0.711236
mcu_18 Black Panther 2018 Beuna Vista 0.96 0.872491
mcu_19 Avengers: Infinity War 2018 Beuna Vista NaN 0.962886
mcu_20 Ant-Man and the Wasp 2018 Beuna Vista NaN 0.016992
  • Sort Dataframe by columns in descending order

In [65]:
# Sort Dataframe by column in descending order

df_mcu.sort_values(by=['Rating', 'Predicted Rating'], ascending=False)


Out[65]:
Title Year Studio Rating Predicted Rating
mcu_18 Black Panther 2018 Beuna Vista 0.96 0.872491
mcu_16 Spider-Man: Homecoming 2017 Sony 0.92 0.997052
mcu_17 Thor: Ragnarok 2017 Beuna Vista 0.92 0.711236
mcu_15 Guardians of the Galaxy Vol. 2 2017 Beuna Vista 0.83 0.922450
mcu_19 Avengers: Infinity War 2018 Beuna Vista NaN 0.962886
mcu_20 Ant-Man and the Wasp 2018 Beuna Vista NaN 0.016992

5.3 Summarizing and Computing Descriptive Statistics - Descriptive statistics


  • Built in functions to calculate the values over row or columns
  • describe(): return summary statistics of each column
      - for numeric data: mean, std, max, min, 25%, 50%, 75%, etc.
      - For non-numeric data: count, uniq, most-frequent item, etc

In [66]:
# Summary Statistics for the Dataframe

df_mcu.describe()


Out[66]:
Year Rating Predicted Rating
count 6.000000 4.0000 6.000000
mean 2017.500000 0.9075 0.747185
std 0.547723 0.0550 0.371444
min 2017.000000 0.8300 0.016992
25% 2017.000000 0.8975 0.751550
50% 2017.500000 0.9200 0.897470
75% 2018.000000 0.9300 0.952777
max 2018.000000 0.9600 0.997052
  • mean()

In [67]:
# Mean of the Rating and Predicted Rating Columns

df_mcu.loc[:,['Rating', 'Predicted Rating']].mean()


Out[67]:
Rating              0.907500
Predicted Rating    0.747185
dtype: float64
  • var()

In [68]:
# Get the variance of the Rating column

df_mcu.loc[:,['Rating']].var()


Out[68]:
Rating    0.003025
dtype: float64

5.4 Summarizing and Computing Descriptive Statistics - Missing Data


  • Data comes in many shapes and forms and Pandas is very flexible in handling missing data:
    • While Nan is the default missing value marker
    • However, Python None will arise and we wish to also consider that “missing” or “not available” or “NA”
- Drop rows with Nan values

In [69]:
# Drop rows with NaN values

df_mcu.dropna()


Out[69]:
Title Year Studio Rating Predicted Rating
mcu_18 Black Panther 2018 Beuna Vista 0.96 0.872491
mcu_17 Thor: Ragnarok 2017 Beuna Vista 0.92 0.711236
mcu_16 Spider-Man: Homecoming 2017 Sony 0.92 0.997052
mcu_15 Guardians of the Galaxy Vol. 2 2017 Beuna Vista 0.83 0.922450
  - Fill Nan with default value

In [70]:
# File Nan in Dataframe with default value

df_mcu.fillna(0)


Out[70]:
Title Year Studio Rating Predicted Rating
mcu_20 Ant-Man and the Wasp 2018 Beuna Vista 0.00 0.016992
mcu_19 Avengers: Infinity War 2018 Beuna Vista 0.00 0.962886
mcu_18 Black Panther 2018 Beuna Vista 0.96 0.872491
mcu_17 Thor: Ragnarok 2017 Beuna Vista 0.92 0.711236
mcu_16 Spider-Man: Homecoming 2017 Sony 0.92 0.997052
mcu_15 Guardians of the Galaxy Vol. 2 2017 Beuna Vista 0.83 0.922450
- Use inplace to modify the dataframe instead of retunring a new object

In [71]:
# File Nan in Dataframe with default value in place

df_mcu.fillna(0, inplace=True)
df_mcu


Out[71]:
Title Year Studio Rating Predicted Rating
mcu_20 Ant-Man and the Wasp 2018 Beuna Vista 0.00 0.016992
mcu_19 Avengers: Infinity War 2018 Beuna Vista 0.00 0.962886
mcu_18 Black Panther 2018 Beuna Vista 0.96 0.872491
mcu_17 Thor: Ragnarok 2017 Beuna Vista 0.92 0.711236
mcu_16 Spider-Man: Homecoming 2017 Sony 0.92 0.997052
mcu_15 Guardians of the Galaxy Vol. 2 2017 Beuna Vista 0.83 0.922450

Exercise 6


What is the median score of the students on the exam?


In [72]:
# What is the median score of the students on the exam?

student_data['Scores'].median()


Out[72]:
13.5

Exercise 7


Deduct 4 points from everyone that attempted the exam 2 or more times. Replace all Nan scores with 0. (Passing is 12 points)

Compute the mean. Would the class as a whole pass the test?

Are there any students that will fail now?


In [73]:
# Deduct 4 points from everyone that attempted the exam 2 or more times. Replace all Nan scores with 0. (Passing is 12 points)

student_data.loc[student_data['Attempts'] >= 2, 'Scores'] -= 4
student_data


Out[73]:
Attempts Names Qualify Scores Grade Pass
0 1 Anastasia yes 12.5 9th 1
1 3 Dima no 5.0 9th 0
2 2 Katherine yes 12.5 10th 1
3 3 James no NaN 11th 0
4 2 Emily no 5.0 12th 0
5 3 Michael yes 16.0 9th 1
6 1 Matthew yes 14.5 10th 1
7 1 Laura no NaN 11th 0
8 2 Kevin no 4.0 12th 0
9 1 Jonas yes 19.0 11th 1

In [74]:
# Compute the mean. Would the class as a whole pass the test?

student_data['Scores'].mean()


Out[74]:
11.0625

In [75]:
# Are there any students that will fail now?

student_data[(student_data['Qualify'] == 'yes')]


Out[75]:
Attempts Names Qualify Scores Grade Pass
0 1 Anastasia yes 12.5 9th 1
2 2 Katherine yes 12.5 10th 1
5 3 Michael yes 16.0 9th 1
6 1 Matthew yes 14.5 10th 1
9 1 Jonas yes 19.0 11th 1

Agenda


  1. Introduction to Pandas
  2. Series
    • Exercise 1
    • Exercise 2
  3. Dataframe
    • Exercise 3
    • Exercise 4
  4. Import and Store Data
  5. Summarizing and Computing Descriptive Statistics
    • Exercise 6
    • Exercise 7
  6. Grouped and apply
    • Exercise 8
    • Exercise 9
  7. Data Transformation and Normalization
    • Exercise 10
    • Exercise 11
    • Exercise 12

6.1 Grouped and apply


The Grouped functionality is referring to a process involving one or more of the following steps:

  • Splitting the data into groups based on some criteria
  • Applying a function to each group independently

6.2 Grouped and apply - Splitting


Pandas objects can be split on any of their axes.

  • 'grouping' is to provided as a mapping of labels to group names in a Grouped Objec
  • groupby()

In [76]:
# Groupby year of release and get mean Rating

df_mcu.groupby('Year').mean()


Out[76]:
Rating Predicted Rating
Year
2017 0.89 0.876913
2018 0.32 0.617456

6.3 Grouped and apply - Apply


Apply the same function to every column or row:

  • applymap: Apply same function across every cell
  • apply: Apply same function to every column (default) or row
    - Apply square to every value in a dataframe

In [77]:
# Apply square to every value in a dataframe

test_data = np.arange(9).reshape(3,-1)

df_test = pd.DataFrame(test_data, index=['r1', 'r2', 'r3'], columns=['c1', 'c2', 'c3'])
df_test


Out[77]:
c1 c2 c3
r1 0 1 2
r2 3 4 5
r3 6 7 8

In [78]:
df_test.applymap(np.square)


Out[78]:
c1 c2 c3
r1 0 1 4
r2 9 16 25
r3 36 49 64
        - Apply a new function that subtract max from 2 times min in every column

In [79]:
# Define max minus min function
def max_minus_min(x): 
    return max(x)-(2*min(x))

# Apply a new function that subtract max from min in every column


df_test.apply(max_minus_min)


Out[79]:
c1    6
c2    5
c3    4
dtype: int64

Exercise 8


Group students by attempts and find the average score?


In [80]:
# Group students by attempts and find the average score?

student_data.groupby('Attempts')['Scores'].mean()


Out[80]:
Attempts
1    15.333333
2     7.166667
3    10.500000
Name: Scores, dtype: float64

Exercise 9


Group students by their pass result and report the variance in scores?


In [81]:
# Group students by their pass result and report the variance in scores?

student_data.groupby('Pass')['Scores'].var()


Out[81]:
Pass
0    0.333333
1    7.425000
Name: Scores, dtype: float64
  1. Introduction to Pandas
  2. Series
    • Exercise 1
    • Exercise 2
  3. Dataframe
    • Exercise 3
    • Exercise 4
  4. Import and Store Data
  5. Summarizing and Computing Descriptive Statistics
    • Exercise 6
    • Exercise 7
  6. Grouped and apply
    • Exercise 8
    • Exercise 9
  7. Data Transformation and Normalization
    • Exercise 10
    • Exercise 11
    • Exercise 12

7.1 Data Transformation and Normalization


For Machine Learning (ML) and other data analysis it is important to:

  • Explore your data
  • Standardize data (Transform/Normalize) to obtain so different columns became comparable / compatible

7.1 Data Transformation and Normalization - Data Exploration


Pandas Dataframes have a built in plot functionality:

  • Dataframe.plot()
NOTE: We will go into more plotting libraries later in the course.

In [82]:
# Import  maplotlib and setup to display plots notebook

import matplotlib.pyplot as plt
%matplotlib inline
  • Plot the Number of Medals in barchart by country using: plot.bar()

In [83]:
# Plot the Number of Medals in barchart by country using: plot.bar()

winter_olympics.groupby(['Country'])['Medal'].count().plot.bar()


Out[83]:
<matplotlib.axes._subplots.AxesSubplot at 0x10fbaa8d0>
  • Other useful plots:
    • hist: plot.hist()
    • scatter: plot.scatter()
    • boxplot: plot.box()

7.2 Data Transformation and Normalization - Normalization


Why Normalize (re-scale)?

  • Transform data to obtain a certain distribution e.g. from lognormal to normal
  • Normalize data so different columns became comparable / compatible

Typical normalization approach:

  • Z-score transformation
  • Scale to between 0 and 1
  • Trimmed mean normalization
  • Vector length transformation
  • Quantilenorm

Further Resources:

Further Resources


Resources:

Exercise 10


In the Winter olympics which country has the most Biathlon medals?

In the Winter olympics which country has the most Skiing medals? And in which event do they have the most Gold medals?


In [84]:
# In the Winter olympics which country has the most Biathlon medals?


winter_sport_medal = winter_olympics.groupby(['Country','Sport'])['Medal'].count()
winter_sport_medal.head(10)


Out[84]:
Country  Sport    
AUS      Skating        5
         Skiing        10
AUT      Biathlon      11
         Bobsleigh     13
         Luge          27
         Skating       31
         Skiing       198
BEL      Bobsleigh      9
         Skating        4
BLR      Biathlon       8
Name: Medal, dtype: int64

In [85]:
winter_sport_medal.loc[:,'Biathlon',:].sort_values(ascending=False).head()


Out[85]:
Country
GER    80
NOR    62
RUS    57
FRA    44
URS    37
Name: Medal, dtype: int64

In [86]:
# In the Winter olympics which country has the most Skiing medals? 

winter_sport_medal.loc[:,'Skiing',:].sort_values(ascending=False).head()


Out[86]:
Country
NOR    297
FIN    206
AUT    198
SWE    146
URS    116
Name: Medal, dtype: int64

In [87]:
# And in which event do they have the most Gold medals?
winter_sport_disc_medal = winter_olympics.groupby(['Country','Sport', 'Discipline', 'Medal'])['Medal'].count()
winter_sport_disc_medal.head()


Out[87]:
Country  Sport    Discipline                 Medal 
AUS      Skating  Short Track Speed Skating  Bronze    4
                                             Gold      1
         Skiing   Alpine Skiing              Bronze    1
                  Freestyle Skiing           Bronze    2
                                             Gold      3
Name: Medal, dtype: int64

In [88]:
winter_sport_disc_medal['NOR','Skiing'].sort_values(ascending=False)


Out[88]:
Discipline            Medal 
Cross Country Skiing  Silver    78
                      Gold      62
                      Bronze    37
Ski Jumping           Bronze    21
Nordic Combined       Gold      19
                      Silver    13
Alpine Skiing         Bronze    10
                      Gold      10
Ski Jumping           Silver     9
                      Gold       9
Alpine Skiing         Silver     9
Nordic Combined       Bronze     8
Freestyle Skiing      Bronze     4
Snowboard             Silver     3
Freestyle Skiing      Gold       2
                      Silver     2
Snowboard             Bronze     1
Name: Medal, dtype: int64

Exercise 11


Import the Summer Olympic dataset located in ('data/summer_olypmics.csv')

Which Olympian has the most medals?

Which Olympian has the most Gold medals and for which Country and Sport?

Which rows have no values and why?


In [89]:
# Import the Summer Olympic dataset located in ('data/summer_olypmics.csv')
summer_olympics = pd.read_csv('data/summer_olympics.csv')


#Which Olympian has the most medals?
summer_olympics.groupby(['Athlete'])['Medal'].count().sort_values(ascending=False).head()


Out[89]:
Athlete
PHELPS, Michael         22
LATYNINA, Larisa        18
ANDRIANOV, Nikolay      15
MANGIAROTTI, Edoardo    13
SHAKHLIN, Boris         13
Name: Medal, dtype: int64

In [90]:
# Which Olympian has the most Gold medals and for which Country?
summer_olympics.groupby(['Athlete','Medal','Country'])['Medal'].count().sort_values(ascending=False).head()


Out[90]:
Athlete           Medal  Country
PHELPS, Michael   Gold   USA        18
NURMI, Paavo      Gold   FIN         9
SPITZ, Mark       Gold   USA         9
LATYNINA, Larisa  Gold   URS         9
LEWIS, Carl       Gold   USA         9
Name: Medal, dtype: int64

In [91]:
# Which Olympian has the most Gold medals and for which Sport?
summer_olympics.groupby(['Athlete','Medal','Discipline'])['Medal'].count().sort_values(ascending=False).head()


Out[91]:
Athlete           Medal  Discipline 
PHELPS, Michael   Gold   Swimming       18
NURMI, Paavo      Gold   Athletics       9
SPITZ, Mark       Gold   Swimming        9
LATYNINA, Larisa  Gold   Artistic G.     9
LEWIS, Carl       Gold   Athletics       9
Name: Medal, dtype: int64

In [92]:
# Which rows have no values and why?
summer_olympics[pd.isnull(summer_olympics).any(axis=1)]


Out[92]:
Year City Sport Discipline Athlete Country Gender Event Medal
29603 2012 London Athletics Athletics Pending NaN Women 1500M Gold
31072 2012 London Weightlifting Weightlifting Pending NaN Women 63KG Gold
31091 2012 London Weightlifting Weightlifting Pending NaN Men 94KG Silver
31110 2012 London Wrestling Wrestling Freestyle KUDUKHOV, Besik NaN Men Wf 60 KG Silver

Exercise 12


Import the example RNA-seq Count Data in ('../data/RNAseq_count_table.txt')

Calculate CPM for each Sample. (CPM, Counts Per Million)

Formula for CPM = readsMappedToGene x 1/totalNumReads x 10^6

totalNumReads - total number of mapped reads of a sample readsMappedToGene - number of reads mapped to a selected gene

Which Gene has the highest average CPM?

What is the Correlation between SRR1550986 SRR1550987?


In [93]:
# Import the example RNA-seq Count Data in ('data/RNAseq_count_table.txt')

rna_counts = pd.read_csv('data/RNAseq_count_table.txt', sep = '\t', index_col = 0)
rna_counts.head()


Out[93]:
SRR1550986 SRR1550987 SRR1551047 SRR1551048 SRR1551054 SRR1551055 SRR1551068 SRR1551069
SYMBOL
RPS11 2950 23072 4604 27297 4794 34420 6313 27590
PNMA1 51 190 79 212 39 162 78 153
TMEM216 365 162 317 224 192 183 365 123
TRAF3IP2-AS1 26 40 30 49 52 53 31 29
ERCC5 251 533 309 522 418 540 409 436

In [94]:
# Calculate CPM for each Sample.


rna_cpm = rna_counts.divide(rna_counts.sum(axis=0)).multiply(1000000)
rna_cpm.head()


Out[94]:
SRR1550986 SRR1550987 SRR1551047 SRR1551048 SRR1551054 SRR1551055 SRR1551068 SRR1551069
SYMBOL
RPS11 127.914693 880.879456 156.866107 1033.023252 159.333270 1349.522268 205.208699 1199.593017
PNMA1 2.211407 7.254122 2.691664 8.022894 1.296203 6.351616 2.535447 6.652328
TMEM216 15.826733 6.185093 10.800729 8.477020 6.381307 7.174973 11.864593 5.347950
TRAF3IP2-AS1 1.127384 1.527184 1.022151 1.854348 1.728271 2.077998 1.007678 1.260899
ERCC5 10.883589 20.349720 10.528155 19.754484 13.892638 21.172052 13.294845 18.956961

In [95]:
# Which Gene has the highest average CPM?


rna_cpm.mean(axis=1).sort_values(ascending=False).head()


Out[95]:
SYMBOL
B2M       10402.659199
DUSP1      8132.917464
MT-CO1     8089.428266
S100A9     7722.999601
FOS        7507.485039
dtype: float64

In [96]:
# What is the Correlation between SRR1550986 SRR1550987?

rna_cpm.corr()


Out[96]:
SRR1550986 SRR1550987 SRR1551047 SRR1551048 SRR1551054 SRR1551055 SRR1551068 SRR1551069
SRR1550986 1.000000 0.469091 0.854288 0.498308 0.856927 0.491040 0.978841 0.498509
SRR1550987 0.469091 1.000000 0.437926 0.963271 0.446212 0.958366 0.491356 0.968587
SRR1551047 0.854288 0.437926 1.000000 0.469439 0.981212 0.472974 0.847428 0.481002
SRR1551048 0.498308 0.963271 0.469439 1.000000 0.472654 0.970821 0.531707 0.985303
SRR1551054 0.856927 0.446212 0.981212 0.472654 1.000000 0.485487 0.854372 0.490735
SRR1551055 0.491040 0.958366 0.472974 0.970821 0.485487 1.000000 0.526867 0.975907
SRR1551068 0.978841 0.491356 0.847428 0.531707 0.854372 0.526867 1.000000 0.537949
SRR1551069 0.498509 0.968587 0.481002 0.985303 0.490735 0.975907 0.537949 1.000000