Data Wrangling with Pandas

Processing the time series census data extracted from the Australian Bureau of Statistics Datapack 2011

2011 Datapacks BCP_IP_TSP_PEP_ECP_WPP_Release 3/2011 Time Series Profile Release 3/Long Descriptor/SA3

Time series profile

The Time Series Profile (TSP) consists of 34 tables containing key Census characteristics of persons, families and dwellings. Second release tables will contain Labour Force classifications which include occupation, industry and qualifications.

The Time Series Profile presents data from three Censuses, based on the geographical boundaries for the most recent of the three. Comparing data from different Time Series Profiles is not valid, because geographical boundaries are subject to change between Censuses. Within a Time Series Profile, data for the two previous Censuses is concorded to the geographical boundaries for the most recent one.

The 2011 Time Series Profile compares data from the 2001, 2006 and 2011 Censuses where the data classifications are comparable. If a data classification has been revised between Censuses, data will be output on the classification that has been used in the 2011 Census. Footnotes explain the correspondence between the data classifications of previous Censuses and the 2011 classification. The data are based on place of enumeration (where people were counted on Census night).

The change in the geographical classification standards for the 2011 Census has a slight effect on Time Series Profiles. For example, the Statistical Local Area (SLA) used in previous Censuses is no longer available as a defined region in the new standard. As a transitional arrangement, and only for the 2011 Census, Time Series Profiles have been released at the SLA level.

SA3

Statistical Areas Level 3 (SA3s) provide a standardised regional breakup of Australia. The aim of SA3s is to create a standard framework for the analysis of ABS data at the regional level through clustering groups of SA2s that have similar regional characteristics. SA3s are built from whole SA2s and in general have populations between 30,000 and 130,000 persons. They are often the functional areas of regional cities and large urban transport and service hubs.

Setting up the environment (do not use pylab)


In [ ]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as sps
import os

pd.set_option('display.width', 4000)
%matplotlib inline

Load the raw data


In [ ]:
data_dir = '../census-australia-abs2011-sa3/'
readme_file = 'README.md'
to_ignore = '2011Census_geog_desc_1st_and_2nd_release.csv' # This contains the locations
file_list = os.listdir(data_dir)
file_list.remove(readme_file)
file_list.remove(to_ignore)

# Read the first file to get region_id
temp = pd.read_csv(data_dir + file_list[0])
region_id = temp['region_id']

# Read all files and join using region_id
data = pd.DataFrame(region_id, columns=['region_id'])
for sheet in file_list:
    cur_data = pd.read_csv(data_dir + sheet)
    data = data.merge(cur_data, how='outer', on='region_id')
    
# Remove duplicate columns
data = data.T.groupby(level=0).first().T

print('Number of regions: %d' % len(data.index))
print('Number of features: %d' % len(data.columns))
print('Some column headers...')
print(data.columns.values)

Investigate data from 2011

Find the features corresponding to the median or average values (summary statistics). Export this to a CSV file.


In [ ]:
year = '2011'
data_2011 = data.filter(like=year)
print('Shape of 2011 data')
print(data_2011.shape)

dataset = pd.concat([data_2011.filter(like='Median'), data_2011.filter(like='Average')], axis=1)
print(dataset.columns)
print(dataset.shape)
dataset.to_csv('census_abs2011_summary.csv', index=False)

Most of the detailed data comes in the form of counts in various bins. We take a peek at the columns containing the words Rent_Total and construct a list of names that extract the counts.


In [ ]:
data_rent = data_2011.filter(like='Rent_Total')
print(data_rent.columns)

In [ ]:
rent_steps = [1, 200, 300, 400, 600, 800, 1000, 1250, 1500, 2000, 2500, 3000]
cols = []
for ix in range(len(rent_steps)-1):
    cols.append('2011_Census_%d_%d_Rent_Total' % (rent_steps[ix], rent_steps[ix+1]-1))
cols.append('2011_Census_%d_or_more_Rent_Total' % rent_steps[-1])
print(cols)
data_2011[cols].ix[42].plot(kind='bar')

Finding a single scalar to represent the buckets of counts

For many (if not most) machine learning methods, it is convenient to have one single scalar for each feature. The census data provides finer grained information, and we would like to summarise it. Note that the bucket sizes are not uniform.


In [ ]:
def _bins2median(steps, values):
    """Find the median of the buckets of counts"""
    total = np.sum(values)
    if total == 0:
        return 0.0
    middle_idx = total/2.
    seen = 0
    for ix,val in enumerate(values):
        seen += val
        if seen > middle_idx:
            break
    interp = (middle_idx-(seen-val))/float(val)
    median = steps[ix-1] + interp*(steps[ix]-steps[ix-1])
    return median

def bins2median(steps, columns):
    """Find the median of each row"""
    median = []
    for ix in range(columns.shape[0]):
        cur_row = np.array(columns.loc[ix,:])
        median.append(_bins2median(steps, cur_row))
    return np.array(median)

df = pd.DataFrame(region_id, columns=['region_id'])
median_rent = bins2median(rent_steps, data[cols])
df['median_rent'] = median_rent

We compare our median estimate with the reported median weekly rent.


In [ ]:
y_week = np.array(data['Median_rent_weekly_Census_year_2011'])
width = 0.35 # width of the bars
plot_idx = np.arange(81,129)
fig = plt.figure(figsize=(16,5))
ax = fig.add_subplot(111)
ax.bar(plot_idx-width, y_week[plot_idx], width, color='r', label='weekly')
ax.bar(plot_idx, median_rent[plot_idx], width, color='b', label='est. median')
ax.legend()
ax.set_xticks(plot_idx)
ax.set_xticklabels(np.array(data['region_id'], dtype=int)[plot_idx], rotation=90)
ax.set_xlabel('Region ID')
ax.set_ylabel('AUD')

print("Pearson's linear correlation = %1.4f" % sps.pearsonr(y_week, median_rent)[0])
print("Spearman's rank correlation  = %1.4f" % sps.spearmanr(y_week, median_rent)[0])

Other examples of feature construction


In [ ]:
cols = ['Total_persons_2011_Census_Males','Total_persons_2011_Census_Females']
df[cols] = data_2011[cols]

In [ ]:
df.head()

In [ ]:
age_steps = [0, 5, 15, 20, 25, 35, 45, 55, 65, 75, 85]
cols = []
for ix in range(len(age_steps)-1):
    cols.append('Age_group_%d_%d_years_2011_Census_Females' % (age_steps[ix], age_steps[ix+1]-1))
cols.append('Age_group_%d_years_and_over_2011_Census_Females' % age_steps[-1])
print(cols)
median_age_female = bins2median(age_steps, data[cols])
df['median_age_female'] = median_age_female

In [ ]:
cols = []
for ix in range(len(age_steps)-1):
    cols.append('Age_group_%d_%d_years_2011_Census_Males' % (age_steps[ix], age_steps[ix+1]-1))
cols.append('Age_group_%d_years_and_over_2011_Census_Males' % age_steps[-1])
print(cols)
median_age_male = bins2median(age_steps, data[cols])
df['median_age_male'] = median_age_male

In [ ]: