Intoduction to Pandas and Dataframes

Venkat Malladi (Computational Biologist BICF)

Agenda

Introduction to Pandas
DataSeries
- Exercise 1
- Exercise 2
Dataframe
- Exercise 3
- Exercise 4
- Exercise 5
Import and Store Data
Summarizing and Computing Descriptive Statistics
- Exercise 6
- Exercise 7
Grouped and apply
- Exercise 8
- Exercise 9
Data Transformation and Normalization
- Exercise 10
- Exercise 11
- Exercise 12

1.1 What is Pandas?

Description

Python library providing high-performance, easy-to-use structures and data analysis tools.
Suitable for tabular data with heterogeneously-typed colums, as in an SQL table or Excel spreadsheet
Provides data analysis feautures similar to: R, MATLAB, SAS
Based on NumPy

1.2 Key Features

The library is oriented towards table-like data structures that can be manipulated by a collection of methods:

Easy handling of missing data
Label-based slicing, indexing and subsetting of large data sets
Powerful and flexible group by functionality to perform split-apply-combine operations on data sets
Read/Write data from/to Excel, CSV, SQL databases, JSON

1.3 Import Pandas

Before we explore the pandas package, let's import pandas.

The convention is to use pd to refere to pandas when importing the package.



In [1]:

    
# Import Pandas and Numpy

import pandas as pd
import numpy as np

1.4 Data Structures

Series
Dataframe

Agenda

Introduction to Pandas
Series
- Exercise 1
- Exercise 2
Dataframe
- Exercise 3
- Exercise 4
- Exercise 5
Import and Store Data
Summarizing and Computing Descriptive Statistics
- Exercise 6
- Exercise 7
Grouped and apply
- Exercise 8
- Exercise 9
Data Transformation and Normalization
- Exercise 10
- Exercise 11
- Exercise 12

2.1 Series

One dimensional array-like object
Contains array of data (of any NumPy data type) with an index that labels each element in the vector.
- Indexes can be:
  - integers
  - strings
  - other data types

NOTE: Not Numpy arrays but adds functionality to a Numpy array

2.2 Series - Integer indexes

If an index is not specified, a default sequence of integers is assigned as index.

Index	Value
0	35
1	50
2	25

Make a Series of count data



In [2]:

    
# Make Series of count data and visaulize series

counts = pd.Series([223, 43, 53, 24, 43])
counts









    Out[2]:





0    223
1     43
2     53
3     24
4     43
dtype: int64

What datatype is the counts object?



In [3]:

    
# What datatype is the counts object?

type(counts)









    Out[3]:





pandas.core.series.Series

2.3 Series - String indexes

We can assign meaningful lables to the indexes when making the Series object by specfing an array of index

Index	Value
CA	35
TX	50
OK	25

Make Series of count data with Gene Symbols



In [4]:

    
# Make Series of count data with Gene Symbols

rna_counts = pd.Series([50, 10, 12, 29, 4], index=['BRCA2', 'GATA2', 'Myc', 'FOXA1', 'ERCC2'])
rna_counts









    Out[4]:





BRCA2    50
GATA2    10
Myc      12
FOXA1    29
ERCC2     4
dtype: int64

2.4 Series - Dictionary

Can be thought of as a dict
Can be constructed from a dict directly.

Construct second sample RNA-counts dict



In [5]:

    
# Construct second sample RNA-counts dict

cell2_counts = {'BRCA2':5, 'GATA2':20, 'Myc':45, 'FOXA1':10, 'ERCC2':0, 'BRCA1': 20}
cell2_counts









    Out[5]:





{'BRCA1': 20, 'BRCA2': 5, 'ERCC2': 0, 'FOXA1': 10, 'GATA2': 20, 'Myc': 45}

Make pandas Series from RNA-counts dict



In [6]:

    
# Make pandas Series from RNA-counts dict

rna_counts_cell2 = pd.Series(cell2_counts)
rna_counts_cell2









    Out[6]:





BRCA1    20
BRCA2     5
ERCC2     0
FOXA1    10
GATA2    20
Myc      45
dtype: int64

2.5.1 Series - Referencing Elements - Integer

NOTE: We can access the values like an array text

Integer Index

Access the 1st element of counts data



In [7]:

    
# Access the 1st element of counts data

counts[0]









    Out[7]:





223

Get the 2nd through 4th elements



In [8]:

    
# Get the 2nd through 4th elements

counts[1:4]









    Out[8]:





1    43
2    53
3    24
dtype: int64

2.5.2 Series - Referencing Elements - String

String Index

Get the counts for the Myc Gene



In [9]:

    
# Get the counts for Myc Gene

rna_counts['Myc']









    Out[9]:





12

Get the counts for FOXA1, GATA2 and BRCA2



In [10]:

    
# Get the Counts for FOXA1, GATA2 and BRCA2

rna_counts[['FOXA1', 'GATA2', 'BRCA2']]









    Out[10]:





FOXA1    29
GATA2    10
BRCA2    50
dtype: int64

2.5.3 Series - Referencing Elements - array/index values

Can get the array representation and index object of the Series via its values and index atrributes, respectively.

Get the values in the counts matrix



In [11]:

    
# Get the values in the counts matrix

counts.values









    Out[11]:





array([223,  43,  53,  24,  43])

Get the index of the rna_counts matrix



In [12]:

    
# Get the index of the rna_counts matrix

rna_counts.index









    Out[12]:





Index(['BRCA2', 'GATA2', 'Myc', 'FOXA1', 'ERCC2'], dtype='object')

2.5.4 Series - Referencing Elements - labels

Can give both the array of values and the index meaningful labels themselves



In [13]:

    
rna_counts.name = 'RNA Counts'
rna_counts.index.name = 'Symbol'
rna_counts









    Out[13]:





Symbol
BRCA2    50
GATA2    10
Myc      12
FOXA1    29
ERCC2     4
Name: RNA Counts, dtype: int64

2.6 Series - array operations

NumPy array operations can be applied to Series without losing the data structure

Use boolean array to filter Series

Select Genes that have greater than 20 counts



In [14]:

    
# Select Genes that have greater than 20 counts

rna_counts > 20









    Out[14]:





Symbol
BRCA2     True
GATA2    False
Myc      False
FOXA1     True
ERCC2    False
Name: RNA Counts, dtype: bool

Select genes that have greater than 20 counts



In [15]:

    
# Select genes that have greater than 20 counts

rna_counts[rna_counts > 20]









    Out[15]:





Symbol
BRCA2    50
FOXA1    29
Name: RNA Counts, dtype: int64

2.7 Series - null values

Many times Series have missing values that you need to identify and clean

Marvel Cinematic Universe

Make Movie DataFrame with missing values



In [16]:

    
# Make Movie Database with missing values

mcu_opening = {'Black Panther': 202003951, 'Thor: Ragnarok': 122744989, 'Spider-Man: Homecoming': 117027503,
              'Guardians of the Galaxy Vol. 2': 146510104, 'Doctor Strange': 85058311, 
               'Captain America: Civil War': 179139142}
mcu_movies = ['Ant-Man and the Wasp', 'Avengers: Infinity War', 'Black Panther', 'Thor: Ragnarok', 
              'Spider-Man: Homecoming', 'Guardians of the Galaxy Vol. 2', 'Doctor Strange', 'Captain America: Civil War']
mcu_series = pd.Series(mcu_opening, index=mcu_movies)

mcu_series









    Out[16]:





Ant-Man and the Wasp                      NaN
Avengers: Infinity War                    NaN
Black Panther                     202003951.0
Thor: Ragnarok                    122744989.0
Spider-Man: Homecoming            117027503.0
Guardians of the Galaxy Vol. 2    146510104.0
Doctor Strange                     85058311.0
Captain America: Civil War        179139142.0
dtype: float64

Find movies with no opening revenue



In [17]:

    
# Find movies with no opening revenue

pd.isnull(mcu_series)









    Out[17]:





Ant-Man and the Wasp               True
Avengers: Infinity War             True
Black Panther                     False
Thor: Ragnarok                    False
Spider-Man: Homecoming            False
Guardians of the Galaxy Vol. 2    False
Doctor Strange                    False
Captain America: Civil War        False
dtype: bool



In [18]:

    
# Good opportunity to use Boolean filter get index and only movie names

mcu_series[pd.isnull(mcu_series)].index.values









    Out[18]:





array(['Ant-Man and the Wasp', 'Avengers: Infinity War'], dtype=object)

Find movies with opening revenue



In [19]:

    
# Find movies with opening revenue

pd.notnull(mcu_series)









    Out[19]:





Ant-Man and the Wasp              False
Avengers: Infinity War            False
Black Panther                      True
Thor: Ragnarok                     True
Spider-Man: Homecoming             True
Guardians of the Galaxy Vol. 2     True
Doctor Strange                     True
Captain America: Civil War         True
dtype: bool

Display only movie with no opening revenue



In [20]:

    
# Display only movies with no opening revenue

mcu_series[pd.isnull(mcu_series)].index.values









    Out[20]:





array(['Ant-Man and the Wasp', 'Avengers: Infinity War'], dtype=object)

2.8 Series - auto alignment

Index labels are used to align (merge) data when used in operations with other Series objects



In [21]:

    
rna_counts









    Out[21]:





Symbol
BRCA2    50
GATA2    10
Myc      12
FOXA1    29
ERCC2     4
Name: RNA Counts, dtype: int64



In [22]:

    
rna_counts_cell2









    Out[22]:





BRCA1    20
BRCA2     5
ERCC2     0
FOXA1    10
GATA2    20
Myc      45
dtype: int64

Combine counts for 2 cells (rna_counts and rna_counts_cell2)



In [23]:

    
# Combine counts for 2 cells 

rna_counts + rna_counts_cell2









    Out[23]:





BRCA1     NaN
BRCA2    55.0
ERCC2     4.0
FOXA1    39.0
GATA2    30.0
Myc      57.0
dtype: float64

Adding Series combined values with the same label in the resulting series
Contrast this with arrays, where arrays of the same length will combine values element-wise
Notice that the missing values. If one Series has a missing value you can't add it to the other value and a missing value results.

Exercise 1

Create a Series from a specified list data which has the index labels as student names and test values.

Which Students have scores greater than 15?

Bonus: How would you use get the Students scores greater than 15 and less than 20?

Hint: Will use the bitwise opperator &

Names	Value
Anastasia	12.5
Dima	9
Katherine	16.5
James	NaN
Emily	9
Michael	20
Matthew	14.5
Laura	NaN
Kevin	8
Jonas	19



In [24]:

    
# Sample Python data and labels:


students = ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas']

test_scores = [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19]

s_scores = pd.Series(test_scores, index=students)

# Which Students have scores greater than 15?


s_scores[s_scores > 15].index.values









    Out[24]:





array(['Katherine', 'Michael', 'Jonas'], dtype=object)



In [25]:

    
# Bonus: How would you use get the Students scores greater than 15 and less than 20?

s_scores[(s_scores > 15) & (s_scores < 20)]









    Out[25]:





Katherine    16.5
Jonas        19.0
dtype: float64

Exercise 2

What is the mean, median and max test scores?



In [26]:

    
# What is the mean, median and max test scores?

s_scores.mean()









    Out[26]:





13.5625



In [27]:

    
s_scores.median()









    Out[27]:





13.5



In [28]:

    
s_scores.max()









    Out[28]:





20.0

Agenda

Introduction to Pandas
Series
- Exercise 1
- Exercise 2
Dataframe
- Exercise 3
- Exercise 4
- Exercise 5
Import and Store Data
Summarizing and Computing Descriptive Statistics
- Exercise 6
- Exercise 7
Grouped and apply
- Exercise 8
- Exercise 9
Data Transformation and Normalization
- Exercise 10
- Exercise 11
- Exercise 12

3.1 Dataframe

A DataFrame is a tabular data structure, comprised of rows and columns like in a spreadsheet
Each column can be a different value type (numeric, string, boolean etc)

Title	Year	Studio	Rating
Ant-Man and the Wasp	2018	Beuna Vista	NaN
Avengers: Infinity War	2018	Beuna Vista	NaN
Black Panther	2018	Beuna Vista	0.96
Thor: Ragnarok	2017	Beuna Vista	0.92
Spider-Man: Homecoming	2017	Sony	0.92
Guardians of the Galaxy Vol. 2	2017	Beuna Vista	0.83

3.2 Dataframe - from dict of lists

dict keys: columns
dict values (arrays): rows

Make Dataframe of Marvel data



In [29]:

    
# Make Dataframe of Marvel data

mcu_data = {'Title': ['Ant-Man and the Wasp', 'Avengers: Infinity War', 'Black Panther', 'Thor: Ragnarok', 
              'Spider-Man: Homecoming', 'Guardians of the Galaxy Vol. 2'],
            'Year':[2018, 2018, 2018, 2017, 2017, 2017],
            'Studio':['Beuna Vista', 'Beuna Vista', 'Beuna Vista', 'Beuna Vista', 'Sony', 'Beuna Vista'],
            'Rating': [np.nan, np.nan, 0.96, 0.92, 0.92, 0.83]}
df_mcu = pd.DataFrame(mcu_data)
df_mcu









    Out[29]:







  
    
      
      Rating
      Studio
      Title
      Year
    
  
  
    
      0
      NaN
      Beuna Vista
      Ant-Man and the Wasp
      2018
    
    
      1
      NaN
      Beuna Vista
      Avengers: Infinity War
      2018
    
    
      2
      0.96
      Beuna Vista
      Black Panther
      2018
    
    
      3
      0.92
      Beuna Vista
      Thor: Ragnarok
      2017
    
    
      4
      0.92
      Sony
      Spider-Man: Homecoming
      2017
    
    
      5
      0.83
      Beuna Vista
      Guardians of the Galaxy Vol. 2
      2017

What datatype is df_mcu?



In [30]:

    
# What datatype is df_mcu?

type(df_mcu)









    Out[30]:





pandas.core.frame.DataFrame

3.3 Dataframe - specifying indices and columns

Order of columns/rows can be specified using:
- columns array
- index array



In [31]:

    
# Assign column order and index based on Marvel Cinemetic Universe Movie Number 

mcu_index = ['mcu_20','mcu_19', 'mcu_18', 'mcu_17', 'mcu_16', 'mcu_15']
mcu_columns = ['Title', 'Year', 'Studio', 'Rating']
df_mcu = pd.DataFrame(mcu_data, columns = mcu_columns, index = mcu_index)
df_mcu









    Out[31]:







  
    
      
      Title
      Year
      Studio
      Rating
    
  
  
    
      mcu_20
      Ant-Man and the Wasp
      2018
      Beuna Vista
      NaN
    
    
      mcu_19
      Avengers: Infinity War
      2018
      Beuna Vista
      NaN
    
    
      mcu_18
      Black Panther
      2018
      Beuna Vista
      0.96
    
    
      mcu_17
      Thor: Ragnarok
      2017
      Beuna Vista
      0.92
    
    
      mcu_16
      Spider-Man: Homecoming
      2017
      Sony
      0.92
    
    
      mcu_15
      Guardians of the Galaxy Vol. 2
      2017
      Beuna Vista
      0.83

3.4 Dataframe - from nested dict of dicts

Outer dict keys as columns and inner dict keys as row indices

Make Dataframe of population of states



In [32]:

    
# Make Dataframe of population

pop = {'Nevada': {2001: 2.9, 2002: 2.9}, 'Ohio': {2002: 3.6, 2001: 1.7, 2000: 1.5}}
df_pop = pd.DataFrame(pop)
df_pop

3.5 Dataframe - number of rows and columns

Get the number of rows in a Dataframe



In [33]:

    
# Get the number of rows in a Dataframe

len(df_mcu)









    Out[33]:





6

Get the (rows, cols) of the Dataframe



In [34]:

    
# Get the (rows, cols) of the Dataframe

df_mcu.shape









    Out[34]:





(6, 4)

3.6 Dataframe - index, columns and values

Get the column headers



In [35]:

    
# Get the column headers

df_mcu.columns









    Out[35]:





Index(['Title', 'Year', 'Studio', 'Rating'], dtype='object')

Get the row index values



In [36]:

    
# Get the row index values

df_mcu.index









    Out[36]:





Index(['mcu_20', 'mcu_19', 'mcu_18', 'mcu_17', 'mcu_16', 'mcu_15'], dtype='object')

Get values of the Dataframe only



In [37]:

    
# Get values of the Dataframe only

df_mcu.values









    Out[37]:





array([['Ant-Man and the Wasp', 2018, 'Beuna Vista', nan],
       ['Avengers: Infinity War', 2018, 'Beuna Vista', nan],
       ['Black Panther', 2018, 'Beuna Vista', 0.96],
       ['Thor: Ragnarok', 2017, 'Beuna Vista', 0.92],
       ['Spider-Man: Homecoming', 2017, 'Sony', 0.92],
       ['Guardians of the Galaxy Vol. 2', 2017, 'Beuna Vista', 0.83]], dtype=object)

3.7 Dataframe - Selecting Columns and Rows

There are three basic ways to access the data in the Dataframe:
1. Quick Access: DataFrame[]
2. Integer position based selection method: DataFrame.iloc[row, col]
3. Label based selection method: DataFrame.loc[row, col]

3.7.1 Dataframe - Selecting Columns and Rows - Quick Access

Select values in a single column



In [38]:

    
# Select values in a single column

df_mcu['Title']









    Out[38]:





mcu_20              Ant-Man and the Wasp
mcu_19            Avengers: Infinity War
mcu_18                     Black Panther
mcu_17                    Thor: Ragnarok
mcu_16            Spider-Man: Homecoming
mcu_15    Guardians of the Galaxy Vol. 2
Name: Title, dtype: object

Select values in a list of columns



In [39]:

    
# Select values in a list of columns

df_mcu[['Title', 'Rating']]









    Out[39]:







  
    
      
      Title
      Rating
    
  
  
    
      mcu_20
      Ant-Man and the Wasp
      NaN
    
    
      mcu_19
      Avengers: Infinity War
      NaN
    
    
      mcu_18
      Black Panther
      0.96
    
    
      mcu_17
      Thor: Ragnarok
      0.92
    
    
      mcu_16
      Spider-Man: Homecoming
      0.92
    
    
      mcu_15
      Guardians of the Galaxy Vol. 2
      0.83

Use slice to get the first n rows (NumPy style indexing)



In [40]:

    
# Use slice to get the first n rows (NumPy style indexing)

df_mcu[:2]









    Out[40]:







  
    
      
      Title
      Year
      Studio
      Rating
    
  
  
    
      mcu_20
      Ant-Man and the Wasp
      2018
      Beuna Vista
      NaN
    
    
      mcu_19
      Avengers: Infinity War
      2018
      Beuna Vista
      NaN

Can combine slice and column selection to select the first n rows



In [41]:

    
# Can combine slice and column selection to select the first n rows

df_mcu['Title'][:2]









    Out[41]:





mcu_20      Ant-Man and the Wasp
mcu_19    Avengers: Infinity War
Name: Title, dtype: object

Order of column and slice doesn't matter



In [42]:

    
df_mcu[:4]['Year']









    Out[42]:





mcu_20    2018
mcu_19    2018
mcu_18    2018
mcu_17    2017
Name: Year, dtype: int64

3.7.2 Dataframe - Selecting Columns and Rows - Integer based selection

iloc is primarily an integer position based (from 0 to length-1 of the axis), when we include the starting from the end indexing, but may also be used with a boolean array.
Allowed inputs are:

Integer



In [43]:

    
# Return values in the first row

df_mcu.iloc[0]









    Out[43]:





Title     Ant-Man and the Wasp
Year                      2018
Studio             Beuna Vista
Rating                     NaN
Name: mcu_20, dtype: object



In [44]:

    
# Return values in the first row and second column

df_mcu.iloc[0,1]









    Out[44]:





2018

A list or array of integers



In [45]:

    
# Return values in the 3,5 and 6th rows 

df_mcu.iloc[[2,4,5]]









    Out[45]:







  
    
      
      Title
      Year
      Studio
      Rating
    
  
  
    
      mcu_18
      Black Panther
      2018
      Beuna Vista
      0.96
    
    
      mcu_16
      Spider-Man: Homecoming
      2017
      Sony
      0.92
    
    
      mcu_15
      Guardians of the Galaxy Vol. 2
      2017
      Beuna Vista
      0.83

A slice object with ints



In [46]:

    
# Return values in the first row and columns 2 and 3

df_mcu.iloc[:2, 1:3]









    Out[46]:







  
    
      
      Year
      Studio
    
  
  
    
      mcu_20
      2018
      Beuna Vista
    
    
      mcu_19
      2018
      Beuna Vista

3.7.3 Dataframe - Selecting Columns and Rows - Integer based selection

loc is primarily label based, but may also be used with a boolean array.
Allowed inputs are:

A single label



In [47]:

    
# Select all values of the 20th Movie

df_mcu.loc['mcu_20']









    Out[47]:





Title     Ant-Man and the Wasp
Year                      2018
Studio             Beuna Vista
Rating                     NaN
Name: mcu_20, dtype: object

A list or array of labels



In [48]:

    
# Select all values of the 20th, 17th and 15th movie, which uses row index values, 
# Not to be confused with df_mcu[['Title', 'Rating']] which uses column headers

df_mcu.loc[['mcu_20', 'mcu_17', 'mcu_15']]









    Out[48]:







  
    
      
      Title
      Year
      Studio
      Rating
    
  
  
    
      mcu_20
      Ant-Man and the Wasp
      2018
      Beuna Vista
      NaN
    
    
      mcu_17
      Thor: Ragnarok
      2017
      Beuna Vista
      0.92
    
    
      mcu_15
      Guardians of the Galaxy Vol. 2
      2017
      Beuna Vista
      0.83

A slice object with labels

NOTE: Unlike numeric index python slices, both the start and the stop are included!



In [49]:

    
# Select the Year and Rating


df_mcu.loc[:, ['Year', 'Rating']]

3.7.4 Dataframe - Selecting Columns and Rows - Data Filtering

Data filtering using boolean

Filter and select on single condition



In [50]:

    
# Filter for Rating < .95

df_mcu.loc[df_mcu['Rating'] < .95, :]









    Out[50]:







  
    
      
      Title
      Year
      Studio
      Rating
    
  
  
    
      mcu_17
      Thor: Ragnarok
      2017
      Beuna Vista
      0.92
    
    
      mcu_16
      Spider-Man: Homecoming
      2017
      Sony
      0.92
    
    
      mcu_15
      Guardians of the Galaxy Vol. 2
      2017
      Beuna Vista
      0.83

Filter and select on multiple conidtions



In [51]:

    
# Filter for Rating < .95 and Sudio is Sony
# Reuse the bitwise comparator seen earlier but with OR instead of AND.

df_mcu.loc[(df_mcu['Rating'] < .95) | (df_mcu['Studio'] == 'Sony'), :]









    Out[51]:







  
    
      
      Title
      Year
      Studio
      Rating
    
  
  
    
      mcu_17
      Thor: Ragnarok
      2017
      Beuna Vista
      0.92
    
    
      mcu_16
      Spider-Man: Homecoming
      2017
      Sony
      0.92
    
    
      mcu_15
      Guardians of the Galaxy Vol. 2
      2017
      Beuna Vista
      0.83

3.8 Dataframe - Adding and Deleting data

Add a new column



In [52]:

    
# Add new predicted rating to Dataframe

df_mcu['Predicted Rating'] = np.random.random(len(df_mcu))
df_mcu









    Out[52]:







  
    
      
      Title
      Year
      Studio
      Rating
      Predicted Rating
    
  
  
    
      mcu_20
      Ant-Man and the Wasp
      2018
      Beuna Vista
      NaN
      0.016992
    
    
      mcu_19
      Avengers: Infinity War
      2018
      Beuna Vista
      NaN
      0.962886
    
    
      mcu_18
      Black Panther
      2018
      Beuna Vista
      0.96
      0.872491
    
    
      mcu_17
      Thor: Ragnarok
      2017
      Beuna Vista
      0.92
      0.711236
    
    
      mcu_16
      Spider-Man: Homecoming
      2017
      Sony
      0.92
      0.997052
    
    
      mcu_15
      Guardians of the Galaxy Vol. 2
      2017
      Beuna Vista
      0.83
      0.922450

Add a new row



In [53]:

    
# Add a new row for a new movie 

new_row = pd.Series(['Captain Marvel', 2019, 'BeunaVista',  np.nan, np.random.random(1)[0]], index=df_mcu.columns, name= 'mcu_21' )
df_mcu.append(new_row)









    Out[53]:







  
    
      
      Title
      Year
      Studio
      Rating
      Predicted Rating
    
  
  
    
      mcu_20
      Ant-Man and the Wasp
      2018
      Beuna Vista
      NaN
      0.016992
    
    
      mcu_19
      Avengers: Infinity War
      2018
      Beuna Vista
      NaN
      0.962886
    
    
      mcu_18
      Black Panther
      2018
      Beuna Vista
      0.96
      0.872491
    
    
      mcu_17
      Thor: Ragnarok
      2017
      Beuna Vista
      0.92
      0.711236
    
    
      mcu_16
      Spider-Man: Homecoming
      2017
      Sony
      0.92
      0.997052
    
    
      mcu_15
      Guardians of the Galaxy Vol. 2
      2017
      Beuna Vista
      0.83
      0.922450
    
    
      mcu_21
      Captain Marvel
      2019
      BeunaVista
      NaN
      0.997816

Drop an existing column



In [54]:

    
# Drop the Rating Column

df_mcu.drop('Rating', axis=1)









    Out[54]:







  
    
      
      Title
      Year
      Studio
      Predicted Rating
    
  
  
    
      mcu_20
      Ant-Man and the Wasp
      2018
      Beuna Vista
      0.016992
    
    
      mcu_19
      Avengers: Infinity War
      2018
      Beuna Vista
      0.962886
    
    
      mcu_18
      Black Panther
      2018
      Beuna Vista
      0.872491
    
    
      mcu_17
      Thor: Ragnarok
      2017
      Beuna Vista
      0.711236
    
    
      mcu_16
      Spider-Man: Homecoming
      2017
      Sony
      0.997052
    
    
      mcu_15
      Guardians of the Galaxy Vol. 2
      2017
      Beuna Vista
      0.922450

Drop an existing row



In [55]:

    
# Drop the 17 and 19th movies

df_mcu.drop(['mcu_15', 'mcu_17'])









    Out[55]:







  
    
      
      Title
      Year
      Studio
      Rating
      Predicted Rating
    
  
  
    
      mcu_20
      Ant-Man and the Wasp
      2018
      Beuna Vista
      NaN
      0.016992
    
    
      mcu_19
      Avengers: Infinity War
      2018
      Beuna Vista
      NaN
      0.962886
    
    
      mcu_18
      Black Panther
      2018
      Beuna Vista
      0.96
      0.872491
    
    
      mcu_16
      Spider-Man: Homecoming
      2017
      Sony
      0.92
      0.997052

Columns, Rows or individual elements can be modified similarly using loc or iloc.

NOTE: Return new object Dataframe, withouth changing the orignal dataframe (append, drop).

Exericse 3

Create a Dataframe from a specified dictionary data of

Which Students have scorescore is between 15 and 20?

Names	Scores	Attempts	Qualify
Anastasia	12.5	1	'yes'
Dima	9	3	'no'
Katherine	16.5	2	'yes'
James	NaN	3	'no'
Emily	9	2	'no'
Michael	20	3	'yes'
Matthew	14.5	1	'yes'
Laura	NaN	1	'no'
Kevin	8	2	'no'
Jonas	19	1	'yes'



In [56]:

    
# Sample Python data:


exam_data = {'Names': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'], 
             'Scores': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19], 
             'Attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1], 
             'Qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']} 

student_data = pd.DataFrame(exam_data)
student_data



In [57]:

    
# Select the students the qualify based on scores using the Qualify column 'yes'

student_data[student_data['Qualify'] == 'yes']









    Out[57]:







  
    
      
      Attempts
      Names
      Qualify
      Scores
    
  
  
    
      0
      1
      Anastasia
      yes
      12.5
    
    
      2
      2
      Katherine
      yes
      16.5
    
    
      5
      3
      Michael
      yes
      20.0
    
    
      6
      1
      Matthew
      yes
      14.5
    
    
      9
      1
      Jonas
      yes
      19.0

Exercise 4

Add a new column of, Grade Level, to indicate which Grade in High school the students are in?

Add a new student named, Jack? (Hint: Need to use ignore_index=True)



In [58]:

    
# Add a new column of, Grade Level, to indicate which Grade in High school the students are in?

grade_level = ['9th', '9th', '10th', '11th', '12th', '9th', '10th', '11th', '12th', '11th']
student_data['Grade'] = grade_level
student_data



In [59]:

    
student_data.columns









    Out[59]:





Index(['Attempts', 'Names', 'Qualify', 'Scores', 'Grade'], dtype='object')



In [60]:

    
# Add a new student named, Jack? (Hint: Need to use ignore_index=True)

jack = pd.Series([2, 'Jack', 'yes' , 17, '9th'],index=student_data.columns )
student_data.append(jack, ignore_index=True)

Exercise 5

Add a new column of Pass that is either 0 or 1 based on the column Qualify. (Hint: use numpy.where)

'yes': 1
'no': 0



In [61]:

    
# Add a new column of Pass that is either 0 or 1 based on the column Qualify. (Hint: use numpy.where)

student_data['Pass'] = np.where(student_data['Qualify'] == 'yes', 1, 0)
student_data

Agenda

Introduction to Pandas
Series
- Exercise 1
- Exercise 2
Dataframe
- Exercise 3
- Exercise 4
Import and Store Data
Summarizing and Computing Descriptive Statistics
- Exercise 6
- Exercise 7
Grouped and apply
- Exercise 8
- Exercise 9
Data Transformation and Normalization
- Exercise 10
- Exercise 11
- Exercise 12

4.1 Import and Store Data

The first step in any problem is identifying what format your data is in, and then loading it into whatever framework you're using. Common formats are:

Text Files: Text files with a common delimeter to seperate values (e.g. CSV uses ,)
JSON(JavaScript Object Notation): standard formats for sending data by HTTP requests
Web Page: XML and HTML
Binary: "pickle” format and HDF5
Database: MySQL, PostgreSQL

NOTE: Most common is CSV or Text files with different delimeters.

4.2 Import and Store Data - Text Files

Reading
- read_csv: Use comma seperated (,) deliminator to read file
- read_table: Use tab (\t) deliminator to read file
Writting
- write_csv: Use comma seperated (,) deliminator to write file
- write_table: Use tab (\t) deliminator to write file

Read in Winter Olympic Medal Winners form Kaggle



In [62]:

    
# Read in Winter Olympic Medal Winners 

winter_olympics = pd.read_csv('data/winter_olympics.csv')
winter_olympics.head()









    Out[62]:







  
    
      
      Year
      City
      Sport
      Discipline
      Athlete
      Country
      Gender
      Event
      Medal
    
  
  
    
      0
      1924
      Chamonix
      Biathlon
      Biathlon
      BERTHET, G.
      FRA
      Men
      Military Patrol
      Bronze
    
    
      1
      1924
      Chamonix
      Biathlon
      Biathlon
      MANDRILLON, C.
      FRA
      Men
      Military Patrol
      Bronze
    
    
      2
      1924
      Chamonix
      Biathlon
      Biathlon
      MANDRILLON, Maurice
      FRA
      Men
      Military Patrol
      Bronze
    
    
      3
      1924
      Chamonix
      Biathlon
      Biathlon
      VANDELLE, André
      FRA
      Men
      Military Patrol
      Bronze
    
    
      4
      1924
      Chamonix
      Biathlon
      Biathlon
      AUFDENBLATTEN, Adolf
      SUI
      Men
      Military Patrol
      Gold

Agenda

Introduction to Pandas
Series
- Exercise 1
- Exercise 2
Dataframe
- Exercise 3
- Exercise 4
Import and Store Data
Summarizing and Computing Descriptive Statistics
- Exercise 6
- Exercise 7
Grouped and apply
- Exercise 8
- Exercise 9
Data Transformation and Normalization
- Exercise 10
- Exercise 11
- Exercise 12

5.1 Summarizing and Computing Descriptive Statistics

Pandas as a lot of built-in essential functionality common to the pandas data structures to help explore the data.

5.2 Summarizing and Computing Descriptive Statistics - Head and Tail

To view a small sample of a Series or DataFrame object, use:
- head()
- tail()

NOTE: The default number of elements to display is five, but you may pass a custom number.

Get the First 3 lines of Dataframe



In [63]:

    
# Get the First 3 lines of a Dataframe

winter_olympics.head(3)









    Out[63]:







  
    
      
      Year
      City
      Sport
      Discipline
      Athlete
      Country
      Gender
      Event
      Medal
    
  
  
    
      0
      1924
      Chamonix
      Biathlon
      Biathlon
      BERTHET, G.
      FRA
      Men
      Military Patrol
      Bronze
    
    
      1
      1924
      Chamonix
      Biathlon
      Biathlon
      MANDRILLON, C.
      FRA
      Men
      Military Patrol
      Bronze
    
    
      2
      1924
      Chamonix
      Biathlon
      Biathlon
      MANDRILLON, Maurice
      FRA
      Men
      Military Patrol
      Bronze

5.2 Summarizing and Computing Descriptive Statistics - Sorting

To sort data for exploring the data use:
- sort_index(): object by labels (along an axis)
- sort_values(): by the values along either axis

NOTE: axis: 0 or ‘index’, 1 or ‘columns’ and default is 0

Sort Dataframe by rows in ascending order



In [64]:

    
# Sort Dataframe by rows in ascending order

df_mcu.sort_index(axis=0, ascending=True)









    Out[64]:







  
    
      
      Title
      Year
      Studio
      Rating
      Predicted Rating
    
  
  
    
      mcu_15
      Guardians of the Galaxy Vol. 2
      2017
      Beuna Vista
      0.83
      0.922450
    
    
      mcu_16
      Spider-Man: Homecoming
      2017
      Sony
      0.92
      0.997052
    
    
      mcu_17
      Thor: Ragnarok
      2017
      Beuna Vista
      0.92
      0.711236
    
    
      mcu_18
      Black Panther
      2018
      Beuna Vista
      0.96
      0.872491
    
    
      mcu_19
      Avengers: Infinity War
      2018
      Beuna Vista
      NaN
      0.962886
    
    
      mcu_20
      Ant-Man and the Wasp
      2018
      Beuna Vista
      NaN
      0.016992

Sort Dataframe by columns in descending order



In [65]:

    
# Sort Dataframe by column in descending order

df_mcu.sort_values(by=['Rating', 'Predicted Rating'], ascending=False)









    Out[65]:







  
    
      
      Title
      Year
      Studio
      Rating
      Predicted Rating
    
  
  
    
      mcu_18
      Black Panther
      2018
      Beuna Vista
      0.96
      0.872491
    
    
      mcu_16
      Spider-Man: Homecoming
      2017
      Sony
      0.92
      0.997052
    
    
      mcu_17
      Thor: Ragnarok
      2017
      Beuna Vista
      0.92
      0.711236
    
    
      mcu_15
      Guardians of the Galaxy Vol. 2
      2017
      Beuna Vista
      0.83
      0.922450
    
    
      mcu_19
      Avengers: Infinity War
      2018
      Beuna Vista
      NaN
      0.962886
    
    
      mcu_20
      Ant-Man and the Wasp
      2018
      Beuna Vista
      NaN
      0.016992

5.3 Summarizing and Computing Descriptive Statistics - Descriptive statistics

Built in functions to calculate the values over row or columns

describe(): return summary statistics of each column

  - for numeric data: mean, std, max, min, 25%, 50%, 75%, etc.
  - For non-numeric data: count, uniq, most-frequent item, etc



In [66]:

    
# Summary Statistics for the Dataframe

df_mcu.describe()









    Out[66]:







  
    
      
      Year
      Rating
      Predicted Rating
    
  
  
    
      count
      6.000000
      4.0000
      6.000000
    
    
      mean
      2017.500000
      0.9075
      0.747185
    
    
      std
      0.547723
      0.0550
      0.371444
    
    
      min
      2017.000000
      0.8300
      0.016992
    
    
      25%
      2017.000000
      0.8975
      0.751550
    
    
      50%
      2017.500000
      0.9200
      0.897470
    
    
      75%
      2018.000000
      0.9300
      0.952777
    
    
      max
      2018.000000
      0.9600
      0.997052

mean()



In [67]:

    
# Mean of the Rating and Predicted Rating Columns

df_mcu.loc[:,['Rating', 'Predicted Rating']].mean()









    Out[67]:





Rating              0.907500
Predicted Rating    0.747185
dtype: float64

var()



In [68]:

    
# Get the variance of the Rating column

df_mcu.loc[:,['Rating']].var()









    Out[68]:





Rating    0.003025
dtype: float64

5.4 Summarizing and Computing Descriptive Statistics - Missing Data

Data comes in many shapes and forms and Pandas is very flexible in handling missing data:
- While Nan is the default missing value marker
- However, Python None will arise and we wish to also consider that “missing” or “not available” or “NA”

- Drop rows with Nan values



In [69]:

    
# Drop rows with NaN values

df_mcu.dropna()









    Out[69]:







  
    
      
      Title
      Year
      Studio
      Rating
      Predicted Rating
    
  
  
    
      mcu_18
      Black Panther
      2018
      Beuna Vista
      0.96
      0.872491
    
    
      mcu_17
      Thor: Ragnarok
      2017
      Beuna Vista
      0.92
      0.711236
    
    
      mcu_16
      Spider-Man: Homecoming
      2017
      Sony
      0.92
      0.997052
    
    
      mcu_15
      Guardians of the Galaxy Vol. 2
      2017
      Beuna Vista
      0.83
      0.922450

  - Fill Nan with default value



In [70]:

    
# File Nan in Dataframe with default value

df_mcu.fillna(0)









    Out[70]:







  
    
      
      Title
      Year
      Studio
      Rating
      Predicted Rating
    
  
  
    
      mcu_20
      Ant-Man and the Wasp
      2018
      Beuna Vista
      0.00
      0.016992
    
    
      mcu_19
      Avengers: Infinity War
      2018
      Beuna Vista
      0.00
      0.962886
    
    
      mcu_18
      Black Panther
      2018
      Beuna Vista
      0.96
      0.872491
    
    
      mcu_17
      Thor: Ragnarok
      2017
      Beuna Vista
      0.92
      0.711236
    
    
      mcu_16
      Spider-Man: Homecoming
      2017
      Sony
      0.92
      0.997052
    
    
      mcu_15
      Guardians of the Galaxy Vol. 2
      2017
      Beuna Vista
      0.83
      0.922450

- Use inplace to modify the dataframe instead of retunring a new object



In [71]:

    
# File Nan in Dataframe with default value in place

df_mcu.fillna(0, inplace=True)
df_mcu









    Out[71]:







  
    
      
      Title
      Year
      Studio
      Rating
      Predicted Rating
    
  
  
    
      mcu_20
      Ant-Man and the Wasp
      2018
      Beuna Vista
      0.00
      0.016992
    
    
      mcu_19
      Avengers: Infinity War
      2018
      Beuna Vista
      0.00
      0.962886
    
    
      mcu_18
      Black Panther
      2018
      Beuna Vista
      0.96
      0.872491
    
    
      mcu_17
      Thor: Ragnarok
      2017
      Beuna Vista
      0.92
      0.711236
    
    
      mcu_16
      Spider-Man: Homecoming
      2017
      Sony
      0.92
      0.997052
    
    
      mcu_15
      Guardians of the Galaxy Vol. 2
      2017
      Beuna Vista
      0.83
      0.922450

Exercise 6

What is the median score of the students on the exam?



In [72]:

    
# What is the median score of the students on the exam?

student_data['Scores'].median()









    Out[72]:





13.5

Exercise 7

Deduct 4 points from everyone that attempted the exam 2 or more times. Replace all Nan scores with 0. (Passing is 12 points)

Compute the mean. Would the class as a whole pass the test?

Are there any students that will fail now?



In [73]:

    
# Deduct 4 points from everyone that attempted the exam 2 or more times. Replace all Nan scores with 0. (Passing is 12 points)

student_data.loc[student_data['Attempts'] >= 2, 'Scores'] -= 4
student_data



In [74]:

    
# Compute the mean. Would the class as a whole pass the test?

student_data['Scores'].mean()









    Out[74]:





11.0625



In [75]:

    
# Are there any students that will fail now?

student_data[(student_data['Qualify'] == 'yes')]









    Out[75]:







  
    
      
      Attempts
      Names
      Qualify
      Scores
      Grade
      Pass
    
  
  
    
      0
      1
      Anastasia
      yes
      12.5
      9th
      1
    
    
      2
      2
      Katherine
      yes
      12.5
      10th
      1
    
    
      5
      3
      Michael
      yes
      16.0
      9th
      1
    
    
      6
      1
      Matthew
      yes
      14.5
      10th
      1
    
    
      9
      1
      Jonas
      yes
      19.0
      11th
      1

Agenda

Introduction to Pandas
Series
- Exercise 1
- Exercise 2
Dataframe
- Exercise 3
- Exercise 4
Import and Store Data
Summarizing and Computing Descriptive Statistics
- Exercise 6
- Exercise 7
Grouped and apply
- Exercise 8
- Exercise 9
Data Transformation and Normalization
- Exercise 10
- Exercise 11
- Exercise 12

6.1 Grouped and apply

The Grouped functionality is referring to a process involving one or more of the following steps:

Splitting the data into groups based on some criteria
Applying a function to each group independently

6.2 Grouped and apply - Splitting

Pandas objects can be split on any of their axes.

'grouping' is to provided as a mapping of labels to group names in a Grouped Objec
groupby()



In [76]:

    
# Groupby year of release and get mean Rating

df_mcu.groupby('Year').mean()









    Out[76]:







  
    
      
      Rating
      Predicted Rating
    
    
      Year
      
      
    
  
  
    
      2017
      0.89
      0.876913
    
    
      2018
      0.32
      0.617456

6.3 Grouped and apply - Apply

Apply the same function to every column or row:

applymap: Apply same function across every cell
apply: Apply same function to every column (default) or row

    - Apply square to every value in a dataframe



In [77]:

    
# Apply square to every value in a dataframe

test_data = np.arange(9).reshape(3,-1)

df_test = pd.DataFrame(test_data, index=['r1', 'r2', 'r3'], columns=['c1', 'c2', 'c3'])
df_test



In [78]:

    
df_test.applymap(np.square)

        - Apply a new function that subtract max from 2 times min in every column



In [79]:

    
# Define max minus min function
def max_minus_min(x): 
    return max(x)-(2*min(x))

# Apply a new function that subtract max from min in every column


df_test.apply(max_minus_min)









    Out[79]:





c1    6
c2    5
c3    4
dtype: int64

Exercise 8

Group students by attempts and find the average score?



In [80]:

    
# Group students by attempts and find the average score?

student_data.groupby('Attempts')['Scores'].mean()









    Out[80]:





Attempts
1    15.333333
2     7.166667
3    10.500000
Name: Scores, dtype: float64

Exercise 9

Group students by their pass result and report the variance in scores?



In [81]:

    
# Group students by their pass result and report the variance in scores?

student_data.groupby('Pass')['Scores'].var()









    Out[81]:





Pass
0    0.333333
1    7.425000
Name: Scores, dtype: float64

Introduction to Pandas
Series
- Exercise 1
- Exercise 2
Dataframe
- Exercise 3
- Exercise 4
Import and Store Data
Summarizing and Computing Descriptive Statistics
- Exercise 6
- Exercise 7
Grouped and apply
- Exercise 8
- Exercise 9
Data Transformation and Normalization
- Exercise 10
- Exercise 11
- Exercise 12

7.1 Data Transformation and Normalization

For Machine Learning (ML) and other data analysis it is important to:

Explore your data
Standardize data (Transform/Normalize) to obtain so different columns became comparable / compatible

7.1 Data Transformation and Normalization - Data Exploration

Pandas Dataframes have a built in plot functionality:

Dataframe.plot()

NOTE: We will go into more plotting libraries later in the course.



In [82]:

    
# Import  maplotlib and setup to display plots notebook

import matplotlib.pyplot as plt
%matplotlib inline

Plot the Number of Medals in barchart by country using: plot.bar()



In [83]:

    
# Plot the Number of Medals in barchart by country using: plot.bar()

winter_olympics.groupby(['Country'])['Medal'].count().plot.bar()









    Out[83]:





<matplotlib.axes._subplots.AxesSubplot at 0x10fbaa8d0>

Other useful plots:
- hist: plot.hist()
- scatter: plot.scatter()
- boxplot: plot.box()

7.2 Data Transformation and Normalization - Normalization

Why Normalize (re-scale)?

Transform data to obtain a certain distribution e.g. from lognormal to normal
Normalize data so different columns became comparable / compatible

Typical normalization approach:

Z-score transformation
Scale to between 0 and 1
Trimmed mean normalization
Vector length transformation
Quantilenorm

Further Resources:

scikit-learn

Further Resources

Resources:

Exercise 10

In the Winter olympics which country has the most Biathlon medals?

In the Winter olympics which country has the most Skiing medals? And in which event do they have the most Gold medals?



In [84]:

    
# In the Winter olympics which country has the most Biathlon medals?


winter_sport_medal = winter_olympics.groupby(['Country','Sport'])['Medal'].count()
winter_sport_medal.head(10)









    Out[84]:





Country  Sport    
AUS      Skating        5
         Skiing        10
AUT      Biathlon      11
         Bobsleigh     13
         Luge          27
         Skating       31
         Skiing       198
BEL      Bobsleigh      9
         Skating        4
BLR      Biathlon       8
Name: Medal, dtype: int64



In [85]:

    
winter_sport_medal.loc[:,'Biathlon',:].sort_values(ascending=False).head()









    Out[85]:





Country
GER    80
NOR    62
RUS    57
FRA    44
URS    37
Name: Medal, dtype: int64



In [86]:

    
# In the Winter olympics which country has the most Skiing medals? 

winter_sport_medal.loc[:,'Skiing',:].sort_values(ascending=False).head()









    Out[86]:





Country
NOR    297
FIN    206
AUT    198
SWE    146
URS    116
Name: Medal, dtype: int64



In [87]:

    
# And in which event do they have the most Gold medals?
winter_sport_disc_medal = winter_olympics.groupby(['Country','Sport', 'Discipline', 'Medal'])['Medal'].count()
winter_sport_disc_medal.head()









    Out[87]:





Country  Sport    Discipline                 Medal 
AUS      Skating  Short Track Speed Skating  Bronze    4
                                             Gold      1
         Skiing   Alpine Skiing              Bronze    1
                  Freestyle Skiing           Bronze    2
                                             Gold      3
Name: Medal, dtype: int64



In [88]:

    
winter_sport_disc_medal['NOR','Skiing'].sort_values(ascending=False)









    Out[88]:





Discipline            Medal 
Cross Country Skiing  Silver    78
                      Gold      62
                      Bronze    37
Ski Jumping           Bronze    21
Nordic Combined       Gold      19
                      Silver    13
Alpine Skiing         Bronze    10
                      Gold      10
Ski Jumping           Silver     9
                      Gold       9
Alpine Skiing         Silver     9
Nordic Combined       Bronze     8
Freestyle Skiing      Bronze     4
Snowboard             Silver     3
Freestyle Skiing      Gold       2
                      Silver     2
Snowboard             Bronze     1
Name: Medal, dtype: int64

Exercise 11

Import the Summer Olympic dataset located in ('data/summer_olypmics.csv')

Which Olympian has the most medals?

Which Olympian has the most Gold medals and for which Country and Sport?

Which rows have no values and why?



In [89]:

    
# Import the Summer Olympic dataset located in ('data/summer_olypmics.csv')
summer_olympics = pd.read_csv('data/summer_olympics.csv')


#Which Olympian has the most medals?
summer_olympics.groupby(['Athlete'])['Medal'].count().sort_values(ascending=False).head()









    Out[89]:





Athlete
PHELPS, Michael         22
LATYNINA, Larisa        18
ANDRIANOV, Nikolay      15
MANGIAROTTI, Edoardo    13
SHAKHLIN, Boris         13
Name: Medal, dtype: int64



In [90]:

    
# Which Olympian has the most Gold medals and for which Country?
summer_olympics.groupby(['Athlete','Medal','Country'])['Medal'].count().sort_values(ascending=False).head()









    Out[90]:





Athlete           Medal  Country
PHELPS, Michael   Gold   USA        18
NURMI, Paavo      Gold   FIN         9
SPITZ, Mark       Gold   USA         9
LATYNINA, Larisa  Gold   URS         9
LEWIS, Carl       Gold   USA         9
Name: Medal, dtype: int64



In [91]:

    
# Which Olympian has the most Gold medals and for which Sport?
summer_olympics.groupby(['Athlete','Medal','Discipline'])['Medal'].count().sort_values(ascending=False).head()









    Out[91]:





Athlete           Medal  Discipline 
PHELPS, Michael   Gold   Swimming       18
NURMI, Paavo      Gold   Athletics       9
SPITZ, Mark       Gold   Swimming        9
LATYNINA, Larisa  Gold   Artistic G.     9
LEWIS, Carl       Gold   Athletics       9
Name: Medal, dtype: int64



In [92]:

    
# Which rows have no values and why?
summer_olympics[pd.isnull(summer_olympics).any(axis=1)]









    Out[92]:







  
    
      
      Year
      City
      Sport
      Discipline
      Athlete
      Country
      Gender
      Event
      Medal
    
  
  
    
      29603
      2012
      London
      Athletics
      Athletics
      Pending
      NaN
      Women
      1500M
      Gold
    
    
      31072
      2012
      London
      Weightlifting
      Weightlifting
      Pending
      NaN
      Women
      63KG
      Gold
    
    
      31091
      2012
      London
      Weightlifting
      Weightlifting
      Pending
      NaN
      Men
      94KG
      Silver
    
    
      31110
      2012
      London
      Wrestling
      Wrestling Freestyle
      KUDUKHOV, Besik
      NaN
      Men
      Wf 60 KG
      Silver

Exercise 12

Import the example RNA-seq Count Data in ('../data/RNAseq_count_table.txt')

Calculate CPM for each Sample. (CPM, Counts Per Million)

Formula for CPM = readsMappedToGene x 1/totalNumReads x 10^6

totalNumReads - total number of mapped reads of a sample readsMappedToGene - number of reads mapped to a selected gene

Which Gene has the highest average CPM?

What is the Correlation between SRR1550986 SRR1550987?



In [93]:

    
# Import the example RNA-seq Count Data in ('data/RNAseq_count_table.txt')

rna_counts = pd.read_csv('data/RNAseq_count_table.txt', sep = '\t', index_col = 0)
rna_counts.head()









    Out[93]:







  
    
      
      SRR1550986
      SRR1550987
      SRR1551047
      SRR1551048
      SRR1551054
      SRR1551055
      SRR1551068
      SRR1551069
    
    
      SYMBOL
      
      
      
      
      
      
      
      
    
  
  
    
      RPS11
      2950
      23072
      4604
      27297
      4794
      34420
      6313
      27590
    
    
      PNMA1
      51
      190
      79
      212
      39
      162
      78
      153
    
    
      TMEM216
      365
      162
      317
      224
      192
      183
      365
      123
    
    
      TRAF3IP2-AS1
      26
      40
      30
      49
      52
      53
      31
      29
    
    
      ERCC5
      251
      533
      309
      522
      418
      540
      409
      436



In [94]:

    
# Calculate CPM for each Sample.


rna_cpm = rna_counts.divide(rna_counts.sum(axis=0)).multiply(1000000)
rna_cpm.head()









    Out[94]:







  
    
      
      SRR1550986
      SRR1550987
      SRR1551047
      SRR1551048
      SRR1551054
      SRR1551055
      SRR1551068
      SRR1551069
    
    
      SYMBOL
      
      
      
      
      
      
      
      
    
  
  
    
      RPS11
      127.914693
      880.879456
      156.866107
      1033.023252
      159.333270
      1349.522268
      205.208699
      1199.593017
    
    
      PNMA1
      2.211407
      7.254122
      2.691664
      8.022894
      1.296203
      6.351616
      2.535447
      6.652328
    
    
      TMEM216
      15.826733
      6.185093
      10.800729
      8.477020
      6.381307
      7.174973
      11.864593
      5.347950
    
    
      TRAF3IP2-AS1
      1.127384
      1.527184
      1.022151
      1.854348
      1.728271
      2.077998
      1.007678
      1.260899
    
    
      ERCC5
      10.883589
      20.349720
      10.528155
      19.754484
      13.892638
      21.172052
      13.294845
      18.956961



In [95]:

    
# Which Gene has the highest average CPM?


rna_cpm.mean(axis=1).sort_values(ascending=False).head()









    Out[95]:





SYMBOL
B2M       10402.659199
DUSP1      8132.917464
MT-CO1     8089.428266
S100A9     7722.999601
FOS        7507.485039
dtype: float64



In [96]:

    
# What is the Correlation between SRR1550986 SRR1550987?

rna_cpm.corr()

	SRR1550986	SRR1550987	SRR1551047	SRR1551048	SRR1551054	SRR1551055	SRR1551068	SRR1551069
SRR1550986	1.000000	0.469091	0.854288	0.498308	0.856927	0.491040	0.978841	0.498509
SRR1550987	0.469091	1.000000	0.437926	0.963271	0.446212	0.958366	0.491356	0.968587
SRR1551047	0.854288	0.437926	1.000000	0.469439	0.981212	0.472974	0.847428	0.481002
SRR1551048	0.498308	0.963271	0.469439	1.000000	0.472654	0.970821	0.531707	0.985303
SRR1551054	0.856927	0.446212	0.981212	0.472654	1.000000	0.485487	0.854372	0.490735
SRR1551055	0.491040	0.958366	0.472974	0.970821	0.485487	1.000000	0.526867	0.975907
SRR1551068	0.978841	0.491356	0.847428	0.531707	0.854372	0.526867	1.000000	0.537949
SRR1551069	0.498509	0.968587	0.481002	0.985303	0.490735	0.975907	0.537949	1.000000

	Title	Year	Studio	Rating
mcu_20	Ant-Man and the Wasp	2018	Beuna Vista	NaN
mcu_19	Avengers: Infinity War	2018	Beuna Vista	NaN
mcu_18	Black Panther	2018	Beuna Vista	0.96
mcu_17	Thor: Ragnarok	2017	Beuna Vista	0.92
mcu_16	Spider-Man: Homecoming	2017	Sony	0.92
mcu_15	Guardians of the Galaxy Vol. 2	2017	Beuna Vista	0.83

	Year	City	Sport	Discipline	Athlete	Country	Gender	Event	Medal
0	1924	Chamonix	Biathlon	Biathlon	BERTHET, G.	FRA	Men	Military Patrol	Bronze
1	1924	Chamonix	Biathlon	Biathlon	MANDRILLON, C.	FRA	Men	Military Patrol	Bronze
2	1924	Chamonix	Biathlon	Biathlon	MANDRILLON, Maurice	FRA	Men	Military Patrol	Bronze
3	1924	Chamonix	Biathlon	Biathlon	VANDELLE, André	FRA	Men	Military Patrol	Bronze
4	1924	Chamonix	Biathlon	Biathlon	AUFDENBLATTEN, Adolf	SUI	Men	Military Patrol	Gold

	Year	Rating	Predicted Rating
count	6.000000	4.0000	6.000000
mean	2017.500000	0.9075	0.747185
std	0.547723	0.0550	0.371444
min	2017.000000	0.8300	0.016992
25%	2017.000000	0.8975	0.751550
50%	2017.500000	0.9200	0.897470
75%	2018.000000	0.9300	0.952777
max	2018.000000	0.9600	0.997052

	Year	City	Sport	Discipline	Athlete	Country	Gender	Event	Medal
29603	2012	London	Athletics	Athletics	Pending	NaN	Women	1500M	Gold
31072	2012	London	Weightlifting	Weightlifting	Pending	NaN	Women	63KG	Gold
31091	2012	London	Weightlifting	Weightlifting	Pending	NaN	Men	94KG	Silver
31110	2012	London	Wrestling	Wrestling Freestyle	KUDUKHOV, Besik	NaN	Men	Wf 60 KG	Silver

	SRR1550986	SRR1550987	SRR1551047	SRR1551048	SRR1551054	SRR1551055	SRR1551068	SRR1551069
SYMBOL
RPS11	2950	23072	4604	27297	4794	34420	6313	27590
PNMA1	51	190	79	212	39	162	78	153
TMEM216	365	162	317	224	192	183	365	123
TRAF3IP2-AS1	26	40	30	49	52	53	31	29
ERCC5	251	533	309	522	418	540	409	436

	SRR1550986	SRR1550987	SRR1551047	SRR1551048	SRR1551054	SRR1551055	SRR1551068	SRR1551069
SYMBOL
RPS11	127.914693	880.879456	156.866107	1033.023252	159.333270	1349.522268	205.208699	1199.593017
PNMA1	2.211407	7.254122	2.691664	8.022894	1.296203	6.351616	2.535447	6.652328
TMEM216	15.826733	6.185093	10.800729	8.477020	6.381307	7.174973	11.864593	5.347950
TRAF3IP2-AS1	1.127384	1.527184	1.022151	1.854348	1.728271	2.077998	1.007678	1.260899
ERCC5	10.883589	20.349720	10.528155	19.754484	13.892638	21.172052	13.294845	18.956961