Overview:
Being a data scientist means, we gotta have to work with "big" data with different types.
We've seen how 2D Numpy arrays gives us power to compute data in a much efficient way, but the only downside to it is, they must be of the same type.
To solve this issue, ther's where the Pandas package comes in. So what's in Pandas?
The concept of "Data Frames" objects.
More specifically, they are tables,
with "rows" represented as "observations".
"Coloumns" represented by "variables".
Each row has a unique label, same goes for coloumns as well.
Coloumns can have different types.
We typically don't make data frames manually.
We convert .csv
(Comma seperated values) files to data frames
.
We do this importing the pandas package:
import pandas as pd
, again pd
is an "alias".
Now we can use a built-in function that comes packaged with pandas called as:
read_csv(<path to .csv file)
>
Example:
We will be using pandas package to import, read in the "brics dataset" into python, let's look how the dataframes look like:
In [7]:
# import the pandas package
import pandas as pd
# load in the dataset and save it to brics var.
brics = pd.read_csv("C:/Users/pySag/Documents/GitHub/Computer-Science/Courses/DAT-208x/Datasets/BRICS_cummulative.csv")
brics
# we can make the table look more better, by adding a parameter index_col = 0
brics = pd.read_csv("C:/Users/pySag/Documents/GitHub/Computer-Science/Courses/DAT-208x/Datasets/BRICS_cummulative.csv", index_col=0)
brics #notice how the indexes assigned to row observation are now deprecated.
Out[7]:
One of the most effective use of pandas is the ease at which we can select rows and coloumns in different ways, here's how we do it:
To access the coloumns, there are three different ways we can do it, these are:
data_set_var[ "coloumn-name" ]
< data_set_var >.< coloumn-name >
We can add coloumns too, say we rank them:
<data_set_var>["new-coloumn-name"] = < list of values >
In [9]:
# Add a new coloumn
brics["on_earth"] = [ True, True, True, True, True ]
# Print them
brics
Out[9]:
In [1]:
# Manupalating Coloumns
"""Coloumns can be manipulated using arithematic operations
on other coloumns"""
Out[1]:
Syntax: dataframe.loc[ <"row name"> ]
Go to top:TOC
To get just one element in the table, we can specify both coloumn and row label in the loc()
.
Syntax:
dataframe.loc[ <"row-name, coloumn name"> ]
dataframe[ <"row-name"> ].loc[ <"coloumn-name"> ]
dataframe.loc[ <"rowName'> ][< "coloumnName" >]
Objective:
Practice importing data into python as Pandas DataFrame.
Practise accessig Row and Coloumns
Go to:TOC
Preface:
The DataFrame is one of Pandas' most important data structures. It's basically a way to store tabular data, where you can label the rows and the columns.
In the exercises that follow, you will be working wit vehicle data in different countries. Each observation corresponds to a country, and the columns give information about the number of vehicles per capita, whether people drive left or right, and so on. This data is available in a CSV file, named cars.csv. It is available in your current working directory, so the path to the file is simply 'cars.csv'.
To import CSV data into Python as a Pandas DataFrame, you can use read_csv().
Instructions:
To import CSV files, you still need the pandas package: import it as pd.
Use pd.read_csv() to import cars.csv data as a DataFrame. Store this dataframe as cars.
Print out cars. Does everything look OK?
In [ ]:
"""
# Import pandas as pd
import pandas as pd
# Import the cars.csv data: cars
cars = pd.read_csv("cars.csv")
# Print out cars
print(cars)
"""
Preface:
We have a slight of a problem, the row labels are imported as another coloumn, that has no name.
To fix this issue, we are goint to pass an argument index_col = 0
to read_csv()
. This is used to specify which coloumn in the CSV file should be used as row label?
Instructions:
Run the code with Submit Answer and assert that the first column should actually be used as row labels.
Specify the index_col
argument inside pd.read_csv()
: set it to 0
, so that the first column is used as row labels.
Has the printout of cars improved now?
Go to top:TOC
In [3]:
"""
# Import pandas as pd
import pandas as pd
# Import the cars.csv data: cars
cars = pd.read_csv("cars.csv", index_col=0)
# Print out cars
print(cars)
"""
Out[3]:
Preface
Selecting coloumns can be done in two way.
variable_containing_CSV_file['coloumn-name']
variable_containing_CSV_file[['coloumn-name']]
The former gives a pandas series, whereas the latter gives a pandas dataframe.
Instructions:
Use single square brackets to print out the country
column of cars
as a Pandas Series.
Use double square brackets to print out the country
column of cars
as a Pandas DataFrame. Do this by putting country
in two square brackets this time.
In [ ]:
"""
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Print out country column as Pandas Series
print( cars['country'])
# Print out country column as Pandas DataFrame
print( cars[['country']])
"""
With loc
we can do practically any data selection operation on DataFrames you can think of.
loc
is label-based, which means that you have to specify rows and coloumns based on their row and coloumn labels.
Instructions:
loc
to select the observation corresponding to Japan as a Series. The label of this row is JAP
. Make sure to print the resulting Series.loc
to select the observations for Australia and Egypt as a DataFrame.
In [ ]:
"""
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Print out observation for Japan
print( cars.loc['JAP'] )
# Print out observations for Australia and Egypt
print( cars.loc[ ['AUS', 'EG'] ])
"""
loc
also allows us to select both, rows and coloumns from a DataFrame.
Instructions:
drives_right
value of the row corresponding to Morocco (its row label is MOR
)country
and drives_right
.