Module 6: Data types and tidy data

Tidy data

Let's do some tidy exercise first. This is one of the non-tidy dataset assembled by Hadley Wickham (check out here for more datasets, explanation, and R code).

Let's take a look at this small dataset: https://raw.githubusercontent.com/tidyverse/tidyr/master/vignettes/pew.csv


In [2]:
import pandas as pd

In [3]:
pew_df = pd.read_csv('https://raw.githubusercontent.com/tidyverse/tidyr/4c0a8d0fdb9372302fcc57ad995d57a43d9e4337/vignettes/pew.csv')
pew_df


Out[3]:
religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k $100-150k >150k Don't know/refused
0 Agnostic 27 34 60 81 76 137 122 109 84 96
1 Atheist 12 27 37 52 35 70 73 59 74 76
2 Buddhist 27 21 30 34 33 58 62 39 53 54
3 Catholic 418 617 732 670 638 1116 949 792 633 1489
4 Don’t know/refused 15 14 15 11 10 35 21 17 18 116
5 Evangelical Prot 575 869 1064 982 881 1486 949 723 414 1529
6 Hindu 1 9 7 9 11 34 47 48 54 37
7 Historically Black Prot 228 244 236 238 197 223 131 81 78 339
8 Jehovah's Witness 20 27 24 24 21 30 15 11 6 37
9 Jewish 19 19 25 25 30 95 69 87 151 162
10 Mainline Prot 289 495 619 655 651 1107 939 753 634 1328
11 Mormon 29 40 48 51 56 112 85 49 42 69
12 Muslim 6 7 9 10 9 23 16 8 6 22
13 Orthodox 13 17 23 32 32 47 38 42 46 73
14 Other Christian 9 7 11 13 13 14 18 14 12 18
15 Other Faiths 20 33 40 46 49 63 46 40 41 71
16 Other World Religions 5 2 3 4 2 7 3 4 4 8
17 Unaffiliated 217 299 374 365 341 528 407 321 258 597

This dataset is about the relationships between income and religion, assembled from a research by the Pew Research Center. You can read more details here. Is this dataset tidy or not? Why?

Yes, many of the columns are values, not variable names. How should we fix it?

Pandas provides a convenient function called melt. You specify the id_vars that are variable columns, and value_vars that are value columns, and provide the name for the variable as well as the name for the values.

Q: so please go ahead and tidy it up! I'd suggest to use the variable name "income" and value name "frequency"


In [4]:
# TODO: put your code here

If you were successful, you'll have something like this:


In [5]:
pew_tidy_df.sample(10)


Out[5]:
religion income frequency
58 Don’t know/refused $30-40k 11
9 Jewish <$10k 19
111 Catholic $75-100k 949
165 Catholic Don't know/refused 1489
66 Muslim $30-40k 10
84 Muslim $40-50k 9
99 Jewish $50-75k 95
139 Orthodox $100-150k 42
7 Historically Black Prot <$10k 228
163 Atheist Don't know/refused 76

Data types

Let's talk about data types briefly. Understanding data types is not only important for choosing the right visualizations, but also important for efficient computing and storage of data. You may not have thought about how pandas represent data in memory. A Pandas Dataframe is essentially a bunch of Series, and those Series are essentially numpy arrays. An array may contain a fixed-length items such as integers or variable length items such as strings. Putting some efforts to think about the correct data type can potentially save a lot of memory as well as time.

A nice example would be the categorical data type. If you have a variable that only has several possible values, it's essentially a categorical data. Take a look at the income variable.


In [6]:
pew_tidy_df.income.value_counts()


Out[6]:
$30-40k               18
Don't know/refused    18
$100-150k             18
$20-30k               18
$50-75k               18
<$10k                 18
$10-20k               18
$75-100k              18
>150k                 18
$40-50k               18
Name: income, dtype: int64

These were the column names in the original non-tidy data. The value can take only one of these income ranges and thus it is a categorical data. What is the data type that pandas use to store this column?


In [7]:
pew_tidy_df.income.dtype


Out[7]:
dtype('O')

The O means that it is an object data type, which does not have a fixed size like integer or float. The series contains a sort of pointer to the actual text objects. You can actually inspect the amount of memory used by the dataset.


In [8]:
pew_tidy_df.memory_usage()


Out[8]:
Index         128
religion     1440
income       1440
frequency    1440
dtype: int64

In [9]:
pew_tidy_df.memory_usage(deep=True)


Out[9]:
Index          128
religion     12780
income       11700
frequency     1440
dtype: int64

What's going on with the deep=True option? When you don't specify deep=True, the memory usage method just tells you the amount of memory used by the numpy arrays in the pandas dataframe. When you pass deep=True, it tells you the total amount of memory by including the memory used by all the text objects. So, the religion and income columns occupies almost ten times of memory than the frequency column, which is simply an array of integers.


In [10]:
pew_tidy_df.frequency.dtype


Out[10]:
dtype('int64')

Is there any way to save up the memory? Note that there are only 10 categories in the income variable. That means we just need 10 numbers to represent the categories! Of course we need to store the names of each category, but that's just one-time cost. The simplest way to convert a column is using astype method.


In [12]:
income_categorical_series = pew_tidy_df.income.astype('category')
# you can do pew_tidy_df.income = pew_tidy_df.income.astype('category')

Now, this series has the CategoricalDtype dtype.


In [13]:
income_categorical_series.dtype


Out[13]:
CategoricalDtype(categories=['$10-20k', '$100-150k', '$20-30k', '$30-40k', '$40-50k',
                  '$50-75k', '$75-100k', '<$10k', '>150k',
                  'Don't know/refused'],
                 ordered=False)

How much memory do we use?


In [14]:
income_categorical_series.memory_usage(deep=True)


Out[14]:
1278

In [15]:
pew_tidy_df.income.memory_usage(deep=True)


Out[15]:
11828

We have reduced the memory usage by almost 10 fold! Not only that, because now the values are just numbers, it will be much faster to match, filter, manipulate. If your dataset is huge, this can save up a lot of space and time.

If the categories have ordering, you can specify the ordering too.


In [16]:
from pandas.api.types import CategoricalDtype
income_type = CategoricalDtype(categories=["Don't know/refused", '<$10k', '$10-20k', '$20-30k', '$30-40k', 
                                           '$40-50k', '$50-75k', '$75-100k', '$100-150k', '>150k'], ordered=True)
income_type


Out[16]:
CategoricalDtype(categories=['Don't know/refused', '<$10k', '$10-20k', '$20-30k',
                  '$30-40k', '$40-50k', '$50-75k', '$75-100k', '$100-150k',
                  '>150k'],
                 ordered=True)

In [17]:
pew_tidy_df.income.astype(income_type).dtype


Out[17]:
CategoricalDtype(categories=['Don't know/refused', '<$10k', '$10-20k', '$20-30k',
                  '$30-40k', '$40-50k', '$50-75k', '$75-100k', '$100-150k',
                  '>150k'],
                 ordered=True)

This data type now allows you to compare and sort based on the ordering.

Q: ok, now convert both religion and income columns of pew_tidy_df as categorical dtype (in place) and show that pew_tidy_df now uses much less memory


In [18]:
# TODO: put your code here


Out[18]:
Index         128
religion     2119
income       1150
frequency    1440
dtype: int64