The library is oriented towards table-like data structures that can be manipulated by a collection of methods:
In [ ]:
# Import Pandas and Numpy
import pandas as pd
import numpy as np
In [ ]:
# Make Series of count data and visaulize series
counts = pd.Series([223, 43, 53, 24, 43])
counts
In [ ]:
# What datatype is the counts object?
type(counts)
In [ ]:
# Make Series of count data with Gene Symbols
rna_counts = pd.Series([50, 10, 12, 29, 4], index=['BRCA2', 'GATA2', 'Myc', 'FOXA1', 'ERCC2'])
rna_counts
In [ ]:
# Construct second sample RNA-counts dict
cell2_counts = {'BRCA2':5, 'GATA2':20, 'Myc':45, 'FOXA1':10, 'ERCC2':0, 'BRCA1': 20}
cell2_counts
In [ ]:
# Make pandas Series from RNA-counts dict
rna_counts_cell2 = pd.Series(cell2_counts)
rna_counts_cell2
In [ ]:
# Access the 1st element of counts data
counts[0]
In [ ]:
# Get the 2nd through 4th elements
counts[1:4]
In [ ]:
# Get the counts for Myc Gene
rna_counts['Myc']
In [ ]:
# Get the Counts for FOXA1, GATA2 and BRCA2
rna_counts[['FOXA1', 'GATA2', 'BRCA2']]
In [ ]:
# Get the values in the counts matrix
counts.values
In [ ]:
# Get the index of the rna_counts matrix
rna_counts.index
In [ ]:
rna_counts.name = 'RNA Counts'
rna_counts.index.name = 'Symbol'
rna_counts
In [ ]:
# Select Genes that have greater than 20 counts
rna_counts > 20
In [ ]:
# Select genes that have greater than 20 counts
rna_counts[rna_counts > 20]
In [ ]:
# Make Movie Database with missing values
mcu_opening = {'Black Panther': 202003951, 'Thor: Ragnarok': 122744989, 'Spider-Man: Homecoming': 117027503,
'Guardians of the Galaxy Vol. 2': 146510104, 'Doctor Strange': 85058311,
'Captain America: Civil War': 179139142}
mcu_movies = ['Ant-Man and the Wasp', 'Avengers: Infinity War', 'Black Panther', 'Thor: Ragnarok',
'Spider-Man: Homecoming', 'Guardians of the Galaxy Vol. 2', 'Doctor Strange', 'Captain America: Civil War']
mcu_series = pd.Series(mcu_opening, index=mcu_movies)
mcu_series
In [ ]:
# Find movies with no opening revenue
pd.isnull(mcu_series)
In [ ]:
# Good opportunity to use Boolean filter get index and only movie names
mcu_series[pd.isnull(mcu_series)].index.values
In [ ]:
# Find movies with opening revenue
pd.notnull(mcu_series)
In [ ]:
# Display only movies with no opening revenue
mcu_series[pd.isnull(mcu_series)].index.values
In [ ]:
rna_counts
In [ ]:
rna_counts_cell2
In [ ]:
# Combine counts for 2 cells
rna_counts + rna_counts_cell2
Create a Series from a specified list data which has the index labels as student names and test values.
Which Students have scores greater than 15?
Bonus: How would you use get the Students scores greater than 15 and less than 20?
Names | Value |
---|---|
Anastasia | 12.5 |
Dima | 9 |
Katherine | 16.5 |
James | NaN |
Emily | 9 |
Michael | 20 |
Matthew | 14.5 |
Laura | NaN |
Kevin | 8 |
Jonas | 19 |
In [ ]:
# Sample Python data and labels:
students = ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas']
test_scores = [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19]
s_scores = pd.Series(test_scores, index=students)
# Which Students have scores greater than 15?
In [ ]:
# Bonus: How would you use get the Students scores greater than 15 and less than 20?
What is the mean, median and max test scores?
In [ ]:
# What is the mean, median and max test scores?
In [ ]:
In [ ]:
Title | Year | Studio | Rating |
---|---|---|---|
Ant-Man and the Wasp | 2018 | Beuna Vista | NaN |
Avengers: Infinity War | 2018 | Beuna Vista | NaN |
Black Panther | 2018 | Beuna Vista | 0.96 |
Thor: Ragnarok | 2017 | Beuna Vista | 0.92 |
Spider-Man: Homecoming | 2017 | Sony | 0.92 |
Guardians of the Galaxy Vol. 2 | 2017 | Beuna Vista | 0.83 |
In [ ]:
# Make Dataframe of Marvel data
mcu_data = {'Title': ['Ant-Man and the Wasp', 'Avengers: Infinity War', 'Black Panther', 'Thor: Ragnarok',
'Spider-Man: Homecoming', 'Guardians of the Galaxy Vol. 2'],
'Year':[2018, 2018, 2018, 2017, 2017, 2017],
'Studio':['Beuna Vista', 'Beuna Vista', 'Beuna Vista', 'Beuna Vista', 'Sony', 'Beuna Vista'],
'Rating': [np.nan, np.nan, 0.96, 0.92, 0.92, 0.83]}
df_mcu = pd.DataFrame(mcu_data)
df_mcu
In [ ]:
# What datatype is df_mcu?
type(df_mcu)
In [ ]:
# Assign column order and index based on Marvel Cinemetic Universe Movie Number
mcu_index = ['mcu_20','mcu_19', 'mcu_18', 'mcu_17', 'mcu_16', 'mcu_15']
mcu_columns = ['Title', 'Year', 'Studio', 'Rating']
df_mcu = pd.DataFrame(mcu_data, columns = mcu_columns, index = mcu_index)
df_mcu
In [ ]:
# Make Dataframe of population
pop = {'Nevada': {2001: 2.9, 2002: 2.9}, 'Ohio': {2002: 3.6, 2001: 1.7, 2000: 1.5}}
df_pop = pd.DataFrame(pop)
df_pop
In [ ]:
# Get the number of rows in a Dataframe
len(df_mcu)
In [ ]:
# Get the (rows, cols) of the Dataframe
df_mcu.shape
In [ ]:
# Get the column headers
df_mcu.columns
In [ ]:
# Get the row index values
df_mcu.index
In [ ]:
# Get values of the Dataframe only
df_mcu.values
In [ ]:
# Select values in a single column
df_mcu['Title']
In [ ]:
# Select values in a list of columns
df_mcu[['Title', 'Rating']]
In [ ]:
# Use slice to get the first n rows (NumPy style indexing)
df_mcu[:2]
In [ ]:
# Can combine slice and column selection to select the first n rows
df_mcu['Title'][:2]
In [ ]:
df_mcu[:4]['Year']
In [ ]:
# Return values in the first row
df_mcu.iloc[0]
In [ ]:
# Return values in the first row and second column
df_mcu.iloc[0,1]
In [ ]:
# Return values in the 3,5 and 6th rows
df_mcu.iloc[[2,4,5]]
In [ ]:
# Return values in the first row and columns 2 and 3
df_mcu.iloc[:2, 1:3]
In [ ]:
# Select all values of the 20th Movie
df_mcu.loc['mcu_20']
In [ ]:
# Select all values of the 20th, 17th and 15th movie, which uses row index values,
# Not to be confused with df_mcu[['Title', 'Rating']] which uses column headers
df_mcu.loc[['mcu_20', 'mcu_17', 'mcu_15']]
In [ ]:
# Select the Year and Rating
df_mcu.loc[:, ['Year', 'Rating']]
In [ ]:
# Filter for Rating < .95
df_mcu.loc[df_mcu['Rating'] < .95, :]
In [ ]:
# Filter for Rating < .95 and Sudio is Sony
# Reuse the bitwise comparator seen earlier but with OR instead of AND.
df_mcu.loc[(df_mcu['Rating'] < .95) | (df_mcu['Studio'] == 'Sony'), :]
In [ ]:
# Add new predicted rating to Dataframe
df_mcu['Predicted Rating'] = np.random.random(len(df_mcu))
df_mcu
In [ ]:
# Add a new row for a new movie
new_row = pd.Series(['Captain Marvel', 2019, 'BeunaVista', np.nan, np.random.random(1)[0]], index=df_mcu.columns, name= 'mcu_21' )
df_mcu.append(new_row)
In [ ]:
# Drop the Rating Column
df_mcu.drop('Rating', axis=1)
In [ ]:
# Drop the 17 and 19th movies
df_mcu.drop(['mcu_15', 'mcu_17'])
Create a Dataframe from a specified dictionary data of
Which Students have scorescore is between 15 and 20?
Names | Scores | Attempts | Qualify |
---|---|---|---|
Anastasia | 12.5 | 1 | 'yes' |
Dima | 9 | 3 | 'no' |
Katherine | 16.5 | 2 | 'yes' |
James | NaN | 3 | 'no' |
Emily | 9 | 2 | 'no' |
Michael | 20 | 3 | 'yes' |
Matthew | 14.5 | 1 | 'yes' |
Laura | NaN | 1 | 'no' |
Kevin | 8 | 2 | 'no' |
Jonas | 19 | 1 | 'yes' |
In [ ]:
# Sample Python data:
exam_data = {'Names': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
'Scores': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'Attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'Qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
In [ ]:
# Select the students the qualify based on scores using the Qualify column 'yes'
Add a new column of, Grade Level, to indicate which Grade in High school the students are in?
Add a new student named, Jack? (Hint: Need to use ignore_index=True)
In [ ]:
# Add a new column of, Grade Level, to indicate which Grade in High school the students are in?
In [ ]:
# Add a new student named, Jack? (Hint: Need to use ignore_index=True)
Add a new column of Pass that is either 0 or 1 based on the column Qualify. (Hint: use numpy.where)
In [ ]:
# Add a new column of Pass that is either 0 or 1 based on the column Qualify. (Hint: use numpy.where)
The first step in any problem is identifying what format your data is in, and then loading it into whatever framework you're using. Common formats are:
Read in Winter Olympic Medal Winners form Kaggle
In [ ]:
# Read in Winter Olympic Medal Winners
winter_olympics = pd.read_csv('data/winter_olympics.csv')
winter_olympics.head()
In [ ]:
# Get the First 3 lines of a Dataframe
winter_olympics.head(3)
In [ ]:
# Sort Dataframe by rows in ascending order
df_mcu.sort_index(axis=0, ascending=True)
In [ ]:
# Sort Dataframe by column in descending order
df_mcu.sort_values(by=['Rating', 'Predicted Rating'], ascending=False)
- for numeric data: mean, std, max, min, 25%, 50%, 75%, etc.
- For non-numeric data: count, uniq, most-frequent item, etc
In [ ]:
# Summary Statistics for the Dataframe
df_mcu.describe()
In [ ]:
# Mean of the Rating and Predicted Rating Columns
df_mcu.loc[:,['Rating', 'Predicted Rating']].mean()
In [ ]:
# Get the variance of the Rating column
df_mcu.loc[:,['Rating']].var()
- Drop rows with Nan values
In [ ]:
# Drop rows with NaN values
df_mcu.dropna()
- Fill Nan with default value
In [ ]:
# File Nan in Dataframe with default value
df_mcu.fillna(0)
- Use inplace to modify the dataframe instead of retunring a new object
In [ ]:
# File Nan in Dataframe with default value in place
df_mcu.fillna(0, inplace=True)
df_mcu
What is the median score of the students on the exam?
In [ ]:
# What is the median score of the students on the exam?
Deduct 4 points from everyone that attempted the exam 2 or more times. Replace all Nan scores with 0. (Passing is 12 points)
Compute the mean. Would the class as a whole pass the test?
Are there any students that will fail now?
In [ ]:
# Deduct 4 points from everyone that attempted the exam 2 or more times. Replace all Nan scores with 0. (Passing is 12 points)
In [ ]:
# Compute the mean. Would the class as a whole pass the test?
In [ ]:
# Are there any students that will fail now?
In [ ]:
# Groupby year of release and get mean Rating
df_mcu.groupby('Year').mean()
- Apply square to every value in a dataframe
In [ ]:
# Apply square to every value in a dataframe
test_data = np.arange(9).reshape(3,-1)
df_test = pd.DataFrame(test_data, index=['r1', 'r2', 'r3'], columns=['c1', 'c2', 'c3'])
df_test
In [ ]:
df_test.applymap(np.square)
- Apply a new function that subtract max from 2 times min in every column
In [ ]:
# Define max minus min function
def max_minus_min(x):
return max(x)-(2*min(x))
# Apply a new function that subtract max from min in every column
df_test.apply(max_minus_min)
Group students by attempts and find the average score?
In [ ]:
# Group students by attempts and find the average score?
Group students by their pass result and report the variance in scores?
In [ ]:
# Group students by their pass result and report the variance in scores?
In [ ]:
# Import maplotlib and setup to display plots notebook
import matplotlib.pyplot as plt
%matplotlib inline
In [ ]:
# Plot the Number of Medals in barchart by country using: plot.bar()
winter_olympics.groupby(['Country'])['Medal'].count().plot.bar()
Typical normalization approach:
Further Resources:
In the Winter olympics which country has the most Biathlon medals?
In the Winter olympics which country has the most Skiing medals? And in which event do they have the most Gold medals?
In [ ]:
# In the Winter olympics which country has the most Biathlon medals?
In [ ]:
In [ ]:
# In the Winter olympics which country has the most Skiing medals?
In [ ]:
# And in which event do they have the most Gold medals?
In [ ]:
Import the Summer Olympic dataset located in ('data/summer_olypmics.csv')
Which Olympian has the most medals?
Which Olympian has the most Gold medals and for which Country and Sport?
Which rows have no values and why?
In [ ]:
# Import the Summer Olympic dataset located in ('../data/summer_olypmics.csv')
summer_olympics = pd.read_csv('data/summer_olympics.csv')
#Which Olympian has the most medals?
summer_olympics.groupby(['Athlete'])['Medal'].count().sort_values(ascending=False).head()
In [ ]:
# Which Olympian has the most Gold medals and for which Country?
summer_olympics.groupby(['Athlete','Medal','Country'])['Medal'].count().sort_values(ascending=False).head()
In [ ]:
# Which Olympian has the most Gold medals and for which Sport?
summer_olympics.groupby(['Athlete','Medal','Discipline'])['Medal'].count().sort_values(ascending=False).head()
In [ ]:
# Which rows have no values and why?
summer_olympics[pd.isnull(summer_olympics).any(axis=1)]
Import the example RNA-seq Count Data in ('data/RNAseq_count_table.txt')
Calculate CPM for each Sample. (CPM, Counts Per Million)
Formula for CPM = readsMappedToGene x 1/totalNumReads x 10^6
totalNumReads - total number of mapped reads of a sample readsMappedToGene - number of reads mapped to a selected gene
Which Gene has the highest average CPM?
What is the Correlation between SRR1550986 SRR1550987?
In [ ]:
# Import the example RNA-seq Count Data in ('data/RNAseq_count_table.txt')
rna_counts = pd.read_csv('data/RNAseq_count_table.txt', sep = '\t', index_col = 0)
rna_counts.head()
In [ ]:
# Calculate CPM for each Sample.
rna_cpm = rna_counts.divide(rna_counts.sum(axis=0)).multiply(1000000)
rna_cpm.head()
In [ ]:
# Which Gene has the highest average CPM?
rna_cpm.mean(axis=1).sort_values(ascending=False).head()
In [ ]:
# What is the Correlation between SRR1550986 SRR1550987?
rna_cpm.corr()