Content:
A DataFrame is a two dimensional data structure with columns of potentially different data types. It can be considered like a table or a spreadsheet, similar to R's data.frame.
A DataFrame has both row index and column index.
In [1]:
import numpy as np
import pandas as pd
In [2]:
# Create a DataFrame from dictionary
data = {'Region':['Central','East','North','North-East','West'],
'Area':[132.7,93.1,134.5,103.9,201.3],
'Population':[939890,693500,531860,834450,903010]}
df = pd.DataFrame(data)
# Display df
df
Out[2]:
In [3]:
# Note that the row index are assigned automatically and column index are arranged in alphabetical order.
# Rearrange columns
df = pd.DataFrame(data,columns=['Region','Population','Area'])
df
Out[3]:
In [4]:
# Display columns names
df.columns
Out[4]:
In [5]:
# Display all values
df.values
Out[5]:
In [6]:
# Get the number of rows and columns of the dataframe i.e. its shape
df.shape
Out[6]:
In [7]:
# Size of DataFrame = row x column
df.size
Out[7]:
In [8]:
# Number of rows
len(df)
Out[8]:
In [11]:
# Programming specific information of the dataframe
df.info()
In [9]:
# Statistical description of numerical columns
df.describe()
Out[9]:
In [12]:
# Rename a column label
df = df.rename(columns={'Population':'Pop'})
df
Out[12]:
In [13]:
# Select a single column to series
A = df['Area'] # same answer as df.Area
A
Out[13]:
In [14]:
# Select a single column to dataframe
B = df[['Area']]
B
Out[14]:
In [15]:
# Select multiple columns to dataframe
C = df[['Area','Pop']]
C
Out[15]:
In [16]:
# Change order of columns
D = df[['Region','Area','Pop']]
D
Out[16]:
In [17]:
# Drop a column by label
E = df.drop('Area',axis=1)
E
Out[17]:
In [21]:
# Create a new column 'Density' = 'Population'/'Area'
df['Density'] = df['Pop']/df['Area']
df
Out[21]:
In [22]:
# Sort values
df.sort_values(by=['Pop'], ascending=False)
Out[22]:
In [23]:
# Find index label for max/min values
df['Density'].idxmax()
Out[23]:
Exercise: Which region has the highest density? Can you get the answer without sorting?
In [23]:
# method 1:
df['Region'][df['Density'].idxmax()]
Out[23]:
In [24]:
# method 2: Change the index
df1 = df.set_index('Region')
df1['Density'].idxmax()
Out[24]:
There are many commonly used column-wide methods/attributes:
In [25]:
# Get all numerical summaries of a column
df['Pop'].describe()
Out[25]:
In [26]:
# Select multiple rows
df[2:4]
Out[26]:
In [27]:
# Select the last row
df[-1:]
Out[27]:
In [28]:
# Select all but last row
df[:-1]
Out[28]:
In [29]:
# Select all even rows
df[::2]
Out[29]:
In [30]:
# Select by .iloc
df.iloc[0:2,1:3]
Out[30]:
In [31]:
# Select by .loc
df.loc[0:1,['Pop','Area']]
Out[31]:
In [24]:
# Boolean masking
df['Pop']>800000
Out[24]:
In [25]:
# Select rows by boolean masking
df[df['Pop']>800000]
Out[25]:
In [26]:
# Boolean masking with ==
df['Region']=='Central'
Out[26]:
In [27]:
# Select rows by boolean masking
df[df['Region']=='Central']
Out[27]:
In [28]:
# Using .loc to find the Area of the Central region.
df.loc[df['Region']=='Central', 'Area']
Out[28]:
In [29]:
# Multiple conditions (and: &) (or: |)
(df['Pop'] < 800000) & (df['Density']<8000)
Out[29]:
In [30]:
# Select rows by multiple conditions
df[(df['Pop'] < 800000) & (df['Density']<8000)]
Out[30]:
In [31]:
# Using .query method
df.query("Pop < 800000 & Density < 8000")
Out[31]:
We can import data from a csv file by using pd.read_csv. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
We can import data from an Excel file by using pd.read_excel. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
Source of data: https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table
In [32]:
# Import data from Excel file
og = pd.read_excel('OlympicGames.xlsx')
# Display first 5 rows of dataframe
og.head()
Out[32]:
Q1: How many rows and columns are there in the dataframe?
In [33]:
og.shape
Out[33]:
Q2: Are there any missing values in the dataframe?
In [34]:
#og.isnull().sum()
og.info()
Q3: Create a new column for total number of Olympic medals.
In [35]:
og['Total'] = og['Gold']+og['Silver']+og['Bronze']
og.head()
Out[35]:
Q4: Select the row where Country = Singapore.
In [36]:
og[og['Country']=='Singapore']
Out[36]:
Q5: Which country has won the highest number of Gold medals?
In [38]:
#og[og['Gold'] == og['Gold'].max()]['Country']
og['Country'][og['Gold'].idxmax()]
Out[38]:
Q6: How many countries participated in at least 25 Olympic games (100 years)? Return an integer.
In [40]:
len(og.query("Games >= 25").loc[:, "Country"])
Out[40]:
Q7: Which are the top 3 countries with highest total Olympic medals?
Challenge! Can you return a list of countries in one line of code?
In [48]:
og.sort_values(by='Total', ascending=False)[0:3]['Country'].values
Out[48]:
Q8: Out of the countries which have not won any Gold medals, which country has won the highest number of medals?
Challenge! Can you return the name of the country in one line of code?
In [42]:
og['Country'][og[og['Gold']==0]['Total'].idxmax()]
og.Country[og.query("Gold == 0").Total.idxmax()]
Out[42]:
In [ ]: