Let's do some tidy exercise first. This is one of the non-tidy dataset assembled by Hadley Wickham (check out here for more datasets, explanation, and R code).
Let's take a look at this small dataset: https://raw.githubusercontent.com/tidyverse/tidyr/master/vignettes/pew.csv
In [2]:
import pandas as pd
In [3]:
pew_df = pd.read_csv('https://raw.githubusercontent.com/tidyverse/tidyr/4c0a8d0fdb9372302fcc57ad995d57a43d9e4337/vignettes/pew.csv')
pew_df
Out[3]:
This dataset is about the relationships between income and religion, assembled from a research by the Pew Research Center. You can read more details here. Is this dataset tidy or not? Why?
Yes, many of the columns are values, not variable names. How should we fix it?
Pandas provides a convenient function called melt. You specify the id_vars
that are variable columns, and value_vars
that are value columns, and provide the name for the variable as well as the name for the values.
Q: so please go ahead and tidy it up! I'd suggest to use the variable name "income" and value name "frequency"
In [4]:
# TODO: put your code here
If you were successful, you'll have something like this:
In [5]:
pew_tidy_df.sample(10)
Out[5]:
Let's talk about data types briefly. Understanding data types is not only important for choosing the right visualizations, but also important for efficient computing and storage of data. You may not have thought about how pandas represent data in memory. A Pandas Dataframe
is essentially a bunch of Series
, and those Series
are essentially numpy
arrays. An array may contain a fixed-length items such as integers or variable length items such as strings. Putting some efforts to think about the correct data type can potentially save a lot of memory as well as time.
A nice example would be the categorical data type. If you have a variable that only has several possible values, it's essentially a categorical data. Take a look at the income
variable.
In [6]:
pew_tidy_df.income.value_counts()
Out[6]:
These were the column names in the original non-tidy data. The value can take only one of these income ranges and thus it is a categorical data. What is the data type that pandas use to store this column?
In [7]:
pew_tidy_df.income.dtype
Out[7]:
The O
means that it is an object data type, which does not have a fixed size like integer or float. The series contains a sort of pointer to the actual text objects. You can actually inspect the amount of memory used by the dataset.
In [8]:
pew_tidy_df.memory_usage()
Out[8]:
In [9]:
pew_tidy_df.memory_usage(deep=True)
Out[9]:
What's going on with the deep=True
option? When you don't specify deep=True
, the memory usage method just tells you the amount of memory used by the numpy arrays in the pandas dataframe. When you pass deep=True
, it tells you the total amount of memory by including the memory used by all the text objects. So, the religion
and income
columns occupies almost ten times of memory than the frequency
column, which is simply an array of integers.
In [10]:
pew_tidy_df.frequency.dtype
Out[10]:
Is there any way to save up the memory? Note that there are only 10 categories in the income variable. That means we just need 10 numbers to represent the categories! Of course we need to store the names of each category, but that's just one-time cost. The simplest way to convert a column is using astype
method.
In [12]:
income_categorical_series = pew_tidy_df.income.astype('category')
# you can do pew_tidy_df.income = pew_tidy_df.income.astype('category')
Now, this series has the CategoricalDtype
dtype.
In [13]:
income_categorical_series.dtype
Out[13]:
How much memory do we use?
In [14]:
income_categorical_series.memory_usage(deep=True)
Out[14]:
In [15]:
pew_tidy_df.income.memory_usage(deep=True)
Out[15]:
We have reduced the memory usage by almost 10 fold! Not only that, because now the values are just numbers, it will be much faster to match, filter, manipulate. If your dataset is huge, this can save up a lot of space and time.
If the categories have ordering, you can specify the ordering too.
In [16]:
from pandas.api.types import CategoricalDtype
income_type = CategoricalDtype(categories=["Don't know/refused", '<$10k', '$10-20k', '$20-30k', '$30-40k',
'$40-50k', '$50-75k', '$75-100k', '$100-150k', '>150k'], ordered=True)
income_type
Out[16]:
In [17]:
pew_tidy_df.income.astype(income_type).dtype
Out[17]:
This data type now allows you to compare and sort based on the ordering.
Q: ok, now convert both religion and income columns of pew_tidy_df
as categorical dtype (in place) and show that pew_tidy_df
now uses much less memory
In [18]:
# TODO: put your code here
Out[18]: