Intoduction to Pandas and Dataframes


Venkat Malladi (Computational Biologist BICF)

Agenda


  1. Introduction to Pandas
  2. DataSeries
    • Exercise 1
    • Exercise 2
  3. Dataframe
    • Exercise 3
    • Exercise 4
    • Exercise 5
  4. Import and Store Data
  5. Summarizing and Computing Descriptive Statistics
    • Exercise 6
    • Exercise 7
  6. Grouped and apply
    • Exercise 8
    • Exercise 9
  7. Data Transformation and Normalization
    • Exercise 10
    • Exercise 11
    • Exercise 12

1.1 What is Pandas?


Description

  • Python library providing high-performance, easy-to-use structures and data analysis tools.
  • Suitable for tabular data with heterogeneously-typed colums, as in an SQL table or Excel spreadsheet
  • Provides data analysis feautures similar to: R, MATLAB, SAS
  • Based on NumPy

1.2 Key Features


The library is oriented towards table-like data structures that can be manipulated by a collection of methods:

  • Easy handling of missing data
  • Label-based slicing, indexing and subsetting of large data sets
  • Powerful and flexible group by functionality to perform split-apply-combine operations on data sets
  • Read/Write data from/to Excel, CSV, SQL databases, JSON

1.3 Import Pandas


Before we explore the pandas package, let's import pandas.

The convention is to use pd to refere to pandas when importing the package.


In [ ]:
# Import Pandas and Numpy

import pandas as pd
import numpy as np

1.4 Data Structures

  • Series
  • Dataframe

Agenda


  1. Introduction to Pandas
  2. Series
    • Exercise 1
    • Exercise 2
  3. Dataframe
    • Exercise 3
    • Exercise 4
    • Exercise 5
  4. Import and Store Data
  5. Summarizing and Computing Descriptive Statistics
    • Exercise 6
    • Exercise 7
  6. Grouped and apply
    • Exercise 8
    • Exercise 9
  7. Data Transformation and Normalization
    • Exercise 10
    • Exercise 11
    • Exercise 12

2.1 Series


  • One dimensional array-like object
  • Contains array of data (of any NumPy data type) with an index that labels each element in the vector.

    • Indexes can be:
      • integers
      • strings
      • other data types
NOTE: Not Numpy arrays but adds functionality to a Numpy array

2.2 Series - Integer indexes


  • If an index is not specified, a default sequence of integers is assigned as index.
Index Value
0 35
1 50
2 25
  • Make a Series of count data

In [ ]:
# Make Series of count data and visaulize series

counts = pd.Series([223, 43, 53, 24, 43])
counts
  • What datatype is the counts object?

In [ ]:
# What datatype is the counts object?

type(counts)

2.3 Series - String indexes


  • We can assign meaningful lables to the indexes when making the Series object by specfing an array of index
Index Value
CA 35
TX 50
OK 25
  • Make Series of count data with Gene Symbols

In [ ]:
# Make Series of count data with Gene Symbols

rna_counts = pd.Series([50, 10, 12, 29, 4], index=['BRCA2', 'GATA2', 'Myc', 'FOXA1', 'ERCC2'])
rna_counts

2.4 Series - Dictionary


  • Can be thought of as a dict
  • Can be constructed from a dict directly.
  • Construct second sample RNA-counts dict

In [ ]:
# Construct second sample RNA-counts dict

cell2_counts = {'BRCA2':5, 'GATA2':20, 'Myc':45, 'FOXA1':10, 'ERCC2':0, 'BRCA1': 20}
cell2_counts
  • Make pandas Series from RNA-counts dict

In [ ]:
# Make pandas Series from RNA-counts dict

rna_counts_cell2 = pd.Series(cell2_counts)
rna_counts_cell2

2.5.1 Series - Referencing Elements - Integer


NOTE: We can access the values like an array text
  • Integer Index
  • Access the 1st element of counts data

In [ ]:
# Access the 1st element of counts data

counts[0]
  • Get the 2nd through 4th elements

In [ ]:
# Get the 2nd through 4th elements

counts[1:4]

2.5.2 Series - Referencing Elements - String


  • String Index
  • Get the counts for the Myc Gene

In [ ]:
# Get the counts for Myc Gene

rna_counts['Myc']
  • Get the counts for FOXA1, GATA2 and BRCA2

In [ ]:
# Get the Counts for FOXA1, GATA2 and BRCA2

rna_counts[['FOXA1', 'GATA2', 'BRCA2']]

2.5.3 Series - Referencing Elements - array/index values


  • Can get the array representation and index object of the Series via its values and index atrributes, respectively.
  • Get the values in the counts matrix

In [ ]:
# Get the values in the counts matrix

counts.values
  • Get the index of the rna_counts matrix

In [ ]:
# Get the index of the rna_counts matrix

rna_counts.index

2.5.4 Series - Referencing Elements - labels


  • Can give both the array of values and the index meaningful labels themselves

In [ ]:
rna_counts.name = 'RNA Counts'
rna_counts.index.name = 'Symbol'
rna_counts

2.6 Series - array operations


  • NumPy array operations can be applied to Series without losing the data structure
  • Use boolean array to filter Series
  • Select Genes that have greater than 20 counts

In [ ]:
# Select Genes that have greater than 20 counts

rna_counts > 20
  • Select genes that have greater than 20 counts

In [ ]:
# Select genes that have greater than 20 counts

rna_counts[rna_counts > 20]

2.7 Series - null values


  • Many times Series have missing values that you need to identify and clean

Marvel Cinematic Universe

  • Make Movie DataFrame with missing values

In [ ]:
# Make Movie Database with missing values

mcu_opening = {'Black Panther': 202003951, 'Thor: Ragnarok': 122744989, 'Spider-Man: Homecoming': 117027503,
              'Guardians of the Galaxy Vol. 2': 146510104, 'Doctor Strange': 85058311, 
               'Captain America: Civil War': 179139142}
mcu_movies = ['Ant-Man and the Wasp', 'Avengers: Infinity War', 'Black Panther', 'Thor: Ragnarok', 
              'Spider-Man: Homecoming', 'Guardians of the Galaxy Vol. 2', 'Doctor Strange', 'Captain America: Civil War']
mcu_series = pd.Series(mcu_opening, index=mcu_movies)

mcu_series
  • Find movies with no opening revenue

In [ ]:
# Find movies with no opening revenue

pd.isnull(mcu_series)

In [ ]:
# Good opportunity to use Boolean filter get index and only movie names

mcu_series[pd.isnull(mcu_series)].index.values
  • Find movies with opening revenue

In [ ]:
# Find movies with opening revenue

pd.notnull(mcu_series)
  • Display only movie with no opening revenue

In [ ]:
# Display only movies with no opening revenue

mcu_series[pd.isnull(mcu_series)].index.values

2.8 Series - auto alignment


  • Index labels are used to align (merge) data when used in operations with other Series objects

In [ ]:
rna_counts

In [ ]:
rna_counts_cell2
  • Combine counts for 2 cells (rna_counts and rna_counts_cell2)

In [ ]:
# Combine counts for 2 cells 

rna_counts + rna_counts_cell2
  • Adding Series combined values with the same label in the resulting series
  • Contrast this with arrays, where arrays of the same length will combine values element-wise
  • Notice that the missing values. If one Series has a missing value you can't add it to the other value and a missing value results.

Exercise 1


Create a Series from a specified list data which has the index labels as student names and test values.

Which Students have scores greater than 15?

Bonus: How would you use get the Students scores greater than 15 and less than 20?

Hint: Will use the bitwise opperator &
Names Value
Anastasia 12.5
Dima 9
Katherine 16.5
James NaN
Emily 9
Michael 20
Matthew 14.5
Laura NaN
Kevin 8
Jonas 19

In [ ]:
# Sample Python data and labels:


students = ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas']

test_scores = [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19]

s_scores = pd.Series(test_scores, index=students)

# Which Students have scores greater than 15?

In [ ]:
# Bonus: How would you use get the Students scores greater than 15 and less than 20?

Exercise 2

What is the mean, median and max test scores?


In [ ]:
# What is the mean, median and max test scores?

In [ ]:


In [ ]:

Agenda


  1. Introduction to Pandas
  2. Series
    • Exercise 1
    • Exercise 2
  3. Dataframe
    • Exercise 3
    • Exercise 4
    • Exercise 5
  4. Import and Store Data
  5. Summarizing and Computing Descriptive Statistics
    • Exercise 6
    • Exercise 7
  6. Grouped and apply
    • Exercise 8
    • Exercise 9
  7. Data Transformation and Normalization
    • Exercise 10
    • Exercise 11
    • Exercise 12

3.1 Dataframe


  • A DataFrame is a tabular data structure, comprised of rows and columns like in a spreadsheet
  • Each column can be a different value type (numeric, string, boolean etc)
Title Year Studio Rating
Ant-Man and the Wasp 2018 Beuna Vista NaN
Avengers: Infinity War 2018 Beuna Vista NaN
Black Panther 2018 Beuna Vista 0.96
Thor: Ragnarok 2017 Beuna Vista 0.92
Spider-Man: Homecoming 2017 Sony 0.92
Guardians of the Galaxy Vol. 2 2017 Beuna Vista 0.83

3.2 Dataframe - from dict of lists


  • dict keys: columns
  • dict values (arrays): rows
  • Make Dataframe of Marvel data

In [ ]:
# Make Dataframe of Marvel data

mcu_data = {'Title': ['Ant-Man and the Wasp', 'Avengers: Infinity War', 'Black Panther', 'Thor: Ragnarok', 
              'Spider-Man: Homecoming', 'Guardians of the Galaxy Vol. 2'],
            'Year':[2018, 2018, 2018, 2017, 2017, 2017],
            'Studio':['Beuna Vista', 'Beuna Vista', 'Beuna Vista', 'Beuna Vista', 'Sony', 'Beuna Vista'],
            'Rating': [np.nan, np.nan, 0.96, 0.92, 0.92, 0.83]}
df_mcu = pd.DataFrame(mcu_data)
df_mcu
  • What datatype is df_mcu?

In [ ]:
# What datatype is df_mcu?

type(df_mcu)

3.3 Dataframe - specifying indices and columns


  • Order of columns/rows can be specified using:
    • columns array
    • index array

In [ ]:
# Assign column order and index based on Marvel Cinemetic Universe Movie Number 

mcu_index = ['mcu_20','mcu_19', 'mcu_18', 'mcu_17', 'mcu_16', 'mcu_15']
mcu_columns = ['Title', 'Year', 'Studio', 'Rating']
df_mcu = pd.DataFrame(mcu_data, columns = mcu_columns, index = mcu_index)
df_mcu

3.4 Dataframe - from nested dict of dicts


  • Outer dict keys as columns and inner dict keys as row indices
  • Make Dataframe of population of states

In [ ]:
# Make Dataframe of population

pop = {'Nevada': {2001: 2.9, 2002: 2.9}, 'Ohio': {2002: 3.6, 2001: 1.7, 2000: 1.5}}
df_pop = pd.DataFrame(pop)
df_pop

3.5 Dataframe - number of rows and columns


  • Get the number of rows in a Dataframe

In [ ]:
# Get the number of rows in a Dataframe

len(df_mcu)
  • Get the (rows, cols) of the Dataframe

In [ ]:
# Get the (rows, cols) of the Dataframe

df_mcu.shape

3.6 Dataframe - index, columns and values


  • Get the column headers

In [ ]:
# Get the column headers

df_mcu.columns
  • Get the row index values

In [ ]:
# Get the row index values

df_mcu.index
  • Get values of the Dataframe only

In [ ]:
# Get values of the Dataframe only

df_mcu.values

3.7 Dataframe - Selecting Columns and Rows


  • There are three basic ways to access the data in the Dataframe:
    1. Quick Access: DataFrame[]
    2. Integer position based selection method: DataFrame.iloc[row, col]
    3. Label based selection method: DataFrame.loc[row, col]

3.7.1 Dataframe - Selecting Columns and Rows - Quick Access


  • Select values in a single column

In [ ]:
# Select values in a single column

df_mcu['Title']
  • Select values in a list of columns

In [ ]:
# Select values in a list of columns

df_mcu[['Title', 'Rating']]
  • Use slice to get the first n rows (NumPy style indexing)

In [ ]:
# Use slice to get the first n rows (NumPy style indexing)

df_mcu[:2]
  • Can combine slice and column selection to select the first n rows

In [ ]:
# Can combine slice and column selection to select the first n rows

df_mcu['Title'][:2]
  • Order of column and slice doesn't matter

In [ ]:
df_mcu[:4]['Year']

3.7.2 Dataframe - Selecting Columns and Rows - Integer based selection


  • iloc is primarily an integer position based (from 0 to length-1 of the axis), when we include the starting from the end indexing, but may also be used with a boolean array.

  • Allowed inputs are:

  • Integer

In [ ]:
# Return values in the first row

df_mcu.iloc[0]

In [ ]:
# Return values in the first row and second column

df_mcu.iloc[0,1]
  • A list or array of integers

In [ ]:
# Return values in the 3,5 and 6th rows 

df_mcu.iloc[[2,4,5]]
  • A slice object with ints

In [ ]:
# Return values in the first row and columns 2 and 3

df_mcu.iloc[:2, 1:3]

3.7.3 Dataframe - Selecting Columns and Rows - Integer based selection


  • loc is primarily label based, but may also be used with a boolean array.

  • Allowed inputs are:

  • A single label

In [ ]:
# Select all values of the 20th Movie

df_mcu.loc['mcu_20']
  • A list or array of labels

In [ ]:
# Select all values of the 20th, 17th and 15th movie, which uses row index values, 
# Not to be confused with df_mcu[['Title', 'Rating']] which uses column headers

df_mcu.loc[['mcu_20', 'mcu_17', 'mcu_15']]
  • A slice object with labels
NOTE: Unlike numeric index python slices, both the start and the stop are included!

In [ ]:
# Select the Year and Rating


df_mcu.loc[:, ['Year', 'Rating']]

3.7.4 Dataframe - Selecting Columns and Rows - Data Filtering


  • Data filtering using boolean
  • Filter and select on single condition

In [ ]:
# Filter for Rating < .95

df_mcu.loc[df_mcu['Rating'] < .95, :]
  • Filter and select on multiple conidtions

In [ ]:
# Filter for Rating < .95 and Sudio is Sony
# Reuse the bitwise comparator seen earlier but with OR instead of AND.

df_mcu.loc[(df_mcu['Rating'] < .95) | (df_mcu['Studio'] == 'Sony'), :]

3.8 Dataframe - Adding and Deleting data


  • Add a new column

In [ ]:
# Add new predicted rating to Dataframe

df_mcu['Predicted Rating'] = np.random.random(len(df_mcu))
df_mcu
  • Add a new row

In [ ]:
# Add a new row for a new movie 

new_row = pd.Series(['Captain Marvel', 2019, 'BeunaVista',  np.nan, np.random.random(1)[0]], index=df_mcu.columns, name= 'mcu_21' )
df_mcu.append(new_row)
  • Drop an existing column

In [ ]:
# Drop the Rating Column

df_mcu.drop('Rating', axis=1)
  • Drop an existing row

In [ ]:
# Drop the 17 and 19th movies

df_mcu.drop(['mcu_15', 'mcu_17'])
  • Columns, Rows or individual elements can be modified similarly using loc or iloc.
NOTE: Return new object Dataframe, withouth changing the orignal dataframe (append, drop).

Exericse 3


Create a Dataframe from a specified dictionary data of

Which Students have scorescore is between 15 and 20?

Names Scores Attempts Qualify
Anastasia 12.5 1 'yes'
Dima 9 3 'no'
Katherine 16.5 2 'yes'
James NaN 3 'no'
Emily 9 2 'no'
Michael 20 3 'yes'
Matthew 14.5 1 'yes'
Laura NaN 1 'no'
Kevin 8 2 'no'
Jonas 19 1 'yes'

In [ ]:
# Sample Python data:


exam_data = {'Names': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'], 
             'Scores': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19], 
             'Attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1], 
             'Qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}

In [ ]:
# Select the students the qualify based on scores using the Qualify column 'yes'

Exercise 4


Add a new column of, Grade Level, to indicate which Grade in High school the students are in?

Add a new student named, Jack? (Hint: Need to use ignore_index=True)


In [ ]:
# Add a new column of, Grade Level, to indicate which Grade in High school the students are in?

In [ ]:
# Add a new student named, Jack? (Hint: Need to use ignore_index=True)

Exercise 5


Add a new column of Pass that is either 0 or 1 based on the column Qualify. (Hint: use numpy.where)

  • 'yes': 1
  • 'no': 0

In [ ]:
# Add a new column of Pass that is either 0 or 1 based on the column Qualify. (Hint: use numpy.where)

Agenda


  1. Introduction to Pandas
  2. Series
    • Exercise 1
    • Exercise 2
  3. Dataframe
    • Exercise 3
    • Exercise 4
  4. Import and Store Data
  5. Summarizing and Computing Descriptive Statistics
    • Exercise 6
    • Exercise 7
  6. Grouped and apply
    • Exercise 8
    • Exercise 9
  7. Data Transformation and Normalization
    • Exercise 10
    • Exercise 11
    • Exercise 12

4.1 Import and Store Data


The first step in any problem is identifying what format your data is in, and then loading it into whatever framework you're using. Common formats are:

  • Text Files: Text files with a common delimeter to seperate values (e.g. CSV uses ,)
  • JSON(JavaScript Object Notation): standard formats for sending data by HTTP requests
  • Web Page: XML and HTML
  • Binary: "pickle” format and HDF5
  • Database: MySQL, PostgreSQL
NOTE: Most common is CSV or Text files with different delimeters.

4.2 Import and Store Data - Text Files


  • Reading
    • read_csv: Use comma seperated (,) deliminator to read file
    • read_table: Use tab (\t) deliminator to read file
  • Writting
    • write_csv: Use comma seperated (,) deliminator to write file
    • write_table: Use tab (\t) deliminator to write file

Read in Winter Olympic Medal Winners form Kaggle


In [ ]:
# Read in Winter Olympic Medal Winners 

winter_olympics = pd.read_csv('data/winter_olympics.csv')
winter_olympics.head()

Agenda


  1. Introduction to Pandas
  2. Series
    • Exercise 1
    • Exercise 2
  3. Dataframe
    • Exercise 3
    • Exercise 4
  4. Import and Store Data
  5. Summarizing and Computing Descriptive Statistics
    • Exercise 6
    • Exercise 7
  6. Grouped and apply
    • Exercise 8
    • Exercise 9
  7. Data Transformation and Normalization
    • Exercise 10
    • Exercise 11
    • Exercise 12

5.1 Summarizing and Computing Descriptive Statistics


  • Pandas as a lot of built-in essential functionality common to the pandas data structures to help explore the data.

5.2 Summarizing and Computing Descriptive Statistics - Head and Tail


  • To view a small sample of a Series or DataFrame object, use:
    • head()
    • tail()
NOTE: The default number of elements to display is five, but you may pass a custom number.
  • Get the First 3 lines of Dataframe

In [ ]:
# Get the First 3 lines of a Dataframe

winter_olympics.head(3)

5.2 Summarizing and Computing Descriptive Statistics - Sorting


  • To sort data for exploring the data use:
    • sort_index(): object by labels (along an axis)
    • sort_values(): by the values along either axis
NOTE: axis: 0 or ‘index’, 1 or ‘columns’ and default is 0
  • Sort Dataframe by rows in ascending order

In [ ]:
# Sort Dataframe by rows in ascending order

df_mcu.sort_index(axis=0, ascending=True)
  • Sort Dataframe by columns in descending order

In [ ]:
# Sort Dataframe by column in descending order

df_mcu.sort_values(by=['Rating', 'Predicted Rating'], ascending=False)

5.3 Summarizing and Computing Descriptive Statistics - Descriptive statistics


  • Built in functions to calculate the values over row or columns
  • describe(): return summary statistics of each column
      - for numeric data: mean, std, max, min, 25%, 50%, 75%, etc.
      - For non-numeric data: count, uniq, most-frequent item, etc

In [ ]:
# Summary Statistics for the Dataframe

df_mcu.describe()
  • mean()

In [ ]:
# Mean of the Rating and Predicted Rating Columns

df_mcu.loc[:,['Rating', 'Predicted Rating']].mean()
  • var()

In [ ]:
# Get the variance of the Rating column

df_mcu.loc[:,['Rating']].var()

5.4 Summarizing and Computing Descriptive Statistics - Missing Data


  • Data comes in many shapes and forms and Pandas is very flexible in handling missing data:
    • While Nan is the default missing value marker
    • However, Python None will arise and we wish to also consider that “missing” or “not available” or “NA”
- Drop rows with Nan values

In [ ]:
# Drop rows with NaN values

df_mcu.dropna()
  - Fill Nan with default value

In [ ]:
# File Nan in Dataframe with default value

df_mcu.fillna(0)
- Use inplace to modify the dataframe instead of retunring a new object

In [ ]:
# File Nan in Dataframe with default value in place

df_mcu.fillna(0, inplace=True)
df_mcu

Exercise 6


What is the median score of the students on the exam?


In [ ]:
# What is the median score of the students on the exam?

Exercise 7


Deduct 4 points from everyone that attempted the exam 2 or more times. Replace all Nan scores with 0. (Passing is 12 points)

Compute the mean. Would the class as a whole pass the test?

Are there any students that will fail now?


In [ ]:
# Deduct 4 points from everyone that attempted the exam 2 or more times. Replace all Nan scores with 0. (Passing is 12 points)

In [ ]:
# Compute the mean. Would the class as a whole pass the test?

In [ ]:
# Are there any students that will fail now?

Agenda


  1. Introduction to Pandas
  2. Series
    • Exercise 1
    • Exercise 2
  3. Dataframe
    • Exercise 3
    • Exercise 4
  4. Import and Store Data
  5. Summarizing and Computing Descriptive Statistics
    • Exercise 6
    • Exercise 7
  6. Grouped and apply
    • Exercise 8
    • Exercise 9
  7. Data Transformation and Normalization
    • Exercise 10
    • Exercise 11
    • Exercise 12

6.1 Grouped and apply


The Grouped functionality is referring to a process involving one or more of the following steps:

  • Splitting the data into groups based on some criteria
  • Applying a function to each group independently

6.2 Grouped and apply - Splitting


Pandas objects can be split on any of their axes.

  • 'grouping' is to provided as a mapping of labels to group names in a Grouped Objec
  • groupby()

In [ ]:
# Groupby year of release and get mean Rating

df_mcu.groupby('Year').mean()

6.3 Grouped and apply - Apply


Apply the same function to every column or row:

  • applymap: Apply same function across every cell
  • apply: Apply same function to every column (default) or row
    - Apply square to every value in a dataframe

In [ ]:
# Apply square to every value in a dataframe

test_data = np.arange(9).reshape(3,-1)

df_test = pd.DataFrame(test_data, index=['r1', 'r2', 'r3'], columns=['c1', 'c2', 'c3'])
df_test

In [ ]:
df_test.applymap(np.square)
        - Apply a new function that subtract max from 2 times min in every column

In [ ]:
# Define max minus min function
def max_minus_min(x): 
    return max(x)-(2*min(x))

# Apply a new function that subtract max from min in every column


df_test.apply(max_minus_min)

Exercise 8


Group students by attempts and find the average score?


In [ ]:
# Group students by attempts and find the average score?

Exercise 9


Group students by their pass result and report the variance in scores?


In [ ]:
# Group students by their pass result and report the variance in scores?
  1. Introduction to Pandas
  2. Series
    • Exercise 1
    • Exercise 2
  3. Dataframe
    • Exercise 3
    • Exercise 4
  4. Import and Store Data
  5. Summarizing and Computing Descriptive Statistics
    • Exercise 6
    • Exercise 7
  6. Grouped and apply
    • Exercise 8
    • Exercise 9
  7. Data Transformation and Normalization
    • Exercise 10
    • Exercise 11
    • Exercise 12

7.1 Data Transformation and Normalization


For Machine Learning (ML) and other data analysis it is important to:

  • Explore your data
  • Standardize data (Transform/Normalize) to obtain so different columns became comparable / compatible

7.1 Data Transformation and Normalization - Data Exploration


Pandas Dataframes have a built in plot functionality:

  • Dataframe.plot()
NOTE: We will go into more plotting libraries later in the course.

In [ ]:
# Import  maplotlib and setup to display plots notebook

import matplotlib.pyplot as plt
%matplotlib inline
  • Plot the Number of Medals in barchart by country using: plot.bar()

In [ ]:
# Plot the Number of Medals in barchart by country using: plot.bar()

winter_olympics.groupby(['Country'])['Medal'].count().plot.bar()
  • Other useful plots:
    • hist: plot.hist()
    • scatter: plot.scatter()
    • boxplot: plot.box()

7.2 Data Transformation and Normalization - Normalization


Why Normalize (re-scale)?

  • Transform data to obtain a certain distribution e.g. from lognormal to normal
  • Normalize data so different columns became comparable / compatible

Typical normalization approach:

  • Z-score transformation
  • Scale to between 0 and 1
  • Trimmed mean normalization
  • Vector length transformation
  • Quantilenorm

Further Resources:

Further Resources


Resources:

Exercise 10


In the Winter olympics which country has the most Biathlon medals?

In the Winter olympics which country has the most Skiing medals? And in which event do they have the most Gold medals?


In [ ]:
# In the Winter olympics which country has the most Biathlon medals?

In [ ]:


In [ ]:
# In the Winter olympics which country has the most Skiing medals?

In [ ]:
# And in which event do they have the most Gold medals?

In [ ]:

Exercise 11


Import the Summer Olympic dataset located in ('data/summer_olypmics.csv')

Which Olympian has the most medals?

Which Olympian has the most Gold medals and for which Country and Sport?

Which rows have no values and why?


In [ ]:
# Import the Summer Olympic dataset located in ('../data/summer_olypmics.csv')
summer_olympics = pd.read_csv('data/summer_olympics.csv')


#Which Olympian has the most medals?
summer_olympics.groupby(['Athlete'])['Medal'].count().sort_values(ascending=False).head()

In [ ]:
# Which Olympian has the most Gold medals and for which Country?
summer_olympics.groupby(['Athlete','Medal','Country'])['Medal'].count().sort_values(ascending=False).head()

In [ ]:
# Which Olympian has the most Gold medals and for which Sport?
summer_olympics.groupby(['Athlete','Medal','Discipline'])['Medal'].count().sort_values(ascending=False).head()

In [ ]:
# Which rows have no values and why?
summer_olympics[pd.isnull(summer_olympics).any(axis=1)]

Exercise 12


Import the example RNA-seq Count Data in ('data/RNAseq_count_table.txt')

Calculate CPM for each Sample. (CPM, Counts Per Million)

Formula for CPM = readsMappedToGene x 1/totalNumReads x 10^6

totalNumReads - total number of mapped reads of a sample readsMappedToGene - number of reads mapped to a selected gene

Which Gene has the highest average CPM?

What is the Correlation between SRR1550986 SRR1550987?


In [ ]:
# Import the example RNA-seq Count Data in ('data/RNAseq_count_table.txt')

rna_counts = pd.read_csv('data/RNAseq_count_table.txt', sep = '\t', index_col = 0)
rna_counts.head()

In [ ]:
# Calculate CPM for each Sample.


rna_cpm = rna_counts.divide(rna_counts.sum(axis=0)).multiply(1000000)
rna_cpm.head()

In [ ]:
# Which Gene has the highest average CPM?


rna_cpm.mean(axis=1).sort_values(ascending=False).head()

In [ ]:
# What is the Correlation between SRR1550986 SRR1550987?

rna_cpm.corr()