Processing the time series census data extracted from the Australian Bureau of Statistics Datapack 2011
2011 Datapacks BCP_IP_TSP_PEP_ECP_WPP_Release 3/2011 Time Series Profile Release 3/Long Descriptor/SA3
The Time Series Profile (TSP) consists of 34 tables containing key Census characteristics of persons, families and dwellings. Second release tables will contain Labour Force classifications which include occupation, industry and qualifications.
The Time Series Profile presents data from three Censuses, based on the geographical boundaries for the most recent of the three. Comparing data from different Time Series Profiles is not valid, because geographical boundaries are subject to change between Censuses. Within a Time Series Profile, data for the two previous Censuses is concorded to the geographical boundaries for the most recent one.
The 2011 Time Series Profile compares data from the 2001, 2006 and 2011 Censuses where the data classifications are comparable. If a data classification has been revised between Censuses, data will be output on the classification that has been used in the 2011 Census. Footnotes explain the correspondence between the data classifications of previous Censuses and the 2011 classification. The data are based on place of enumeration (where people were counted on Census night).
The change in the geographical classification standards for the 2011 Census has a slight effect on Time Series Profiles. For example, the Statistical Local Area (SLA) used in previous Censuses is no longer available as a defined region in the new standard. As a transitional arrangement, and only for the 2011 Census, Time Series Profiles have been released at the SLA level.
Statistical Areas Level 3 (SA3s) provide a standardised regional breakup of Australia. The aim of SA3s is to create a standard framework for the analysis of ABS data at the regional level through clustering groups of SA2s that have similar regional characteristics. SA3s are built from whole SA2s and in general have populations between 30,000 and 130,000 persons. They are often the functional areas of regional cities and large urban transport and service hubs.
Setting up the environment (do not use pylab)
In [ ]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as sps
import os
pd.set_option('display.width', 4000)
%matplotlib inline
In [ ]:
data_dir = '../census-australia-abs2011-sa3/'
readme_file = 'README.md'
to_ignore = '2011Census_geog_desc_1st_and_2nd_release.csv' # This contains the locations
file_list = os.listdir(data_dir)
file_list.remove(readme_file)
file_list.remove(to_ignore)
# Read the first file to get region_id
temp = pd.read_csv(data_dir + file_list[0])
region_id = temp['region_id']
# Read all files and join using region_id
data = pd.DataFrame(region_id, columns=['region_id'])
for sheet in file_list:
cur_data = pd.read_csv(data_dir + sheet)
data = data.merge(cur_data, how='outer', on='region_id')
# Remove duplicate columns
data = data.T.groupby(level=0).first().T
print('Number of regions: %d' % len(data.index))
print('Number of features: %d' % len(data.columns))
print('Some column headers...')
print(data.columns.values)
In [ ]:
year = '2011'
data_2011 = data.filter(like=year)
print('Shape of 2011 data')
print(data_2011.shape)
dataset = pd.concat([data_2011.filter(like='Median'), data_2011.filter(like='Average')], axis=1)
print(dataset.columns)
print(dataset.shape)
dataset.to_csv('census_abs2011_summary.csv', index=False)
Most of the detailed data comes in the form of counts in various bins. We take a peek at the columns containing the words Rent_Total and construct a list of names that extract the counts.
In [ ]:
data_rent = data_2011.filter(like='Rent_Total')
print(data_rent.columns)
In [ ]:
rent_steps = [1, 200, 300, 400, 600, 800, 1000, 1250, 1500, 2000, 2500, 3000]
cols = []
for ix in range(len(rent_steps)-1):
cols.append('2011_Census_%d_%d_Rent_Total' % (rent_steps[ix], rent_steps[ix+1]-1))
cols.append('2011_Census_%d_or_more_Rent_Total' % rent_steps[-1])
print(cols)
data_2011[cols].ix[42].plot(kind='bar')
In [ ]:
def _bins2median(steps, values):
"""Find the median of the buckets of counts"""
total = np.sum(values)
if total == 0:
return 0.0
middle_idx = total/2.
seen = 0
for ix,val in enumerate(values):
seen += val
if seen > middle_idx:
break
interp = (middle_idx-(seen-val))/float(val)
median = steps[ix-1] + interp*(steps[ix]-steps[ix-1])
return median
def bins2median(steps, columns):
"""Find the median of each row"""
median = []
for ix in range(columns.shape[0]):
cur_row = np.array(columns.loc[ix,:])
median.append(_bins2median(steps, cur_row))
return np.array(median)
df = pd.DataFrame(region_id, columns=['region_id'])
median_rent = bins2median(rent_steps, data[cols])
df['median_rent'] = median_rent
We compare our median estimate with the reported median weekly rent.
In [ ]:
y_week = np.array(data['Median_rent_weekly_Census_year_2011'])
width = 0.35 # width of the bars
plot_idx = np.arange(81,129)
fig = plt.figure(figsize=(16,5))
ax = fig.add_subplot(111)
ax.bar(plot_idx-width, y_week[plot_idx], width, color='r', label='weekly')
ax.bar(plot_idx, median_rent[plot_idx], width, color='b', label='est. median')
ax.legend()
ax.set_xticks(plot_idx)
ax.set_xticklabels(np.array(data['region_id'], dtype=int)[plot_idx], rotation=90)
ax.set_xlabel('Region ID')
ax.set_ylabel('AUD')
print("Pearson's linear correlation = %1.4f" % sps.pearsonr(y_week, median_rent)[0])
print("Spearman's rank correlation = %1.4f" % sps.spearmanr(y_week, median_rent)[0])
In [ ]:
cols = ['Total_persons_2011_Census_Males','Total_persons_2011_Census_Females']
df[cols] = data_2011[cols]
In [ ]:
df.head()
In [ ]:
age_steps = [0, 5, 15, 20, 25, 35, 45, 55, 65, 75, 85]
cols = []
for ix in range(len(age_steps)-1):
cols.append('Age_group_%d_%d_years_2011_Census_Females' % (age_steps[ix], age_steps[ix+1]-1))
cols.append('Age_group_%d_years_and_over_2011_Census_Females' % age_steps[-1])
print(cols)
median_age_female = bins2median(age_steps, data[cols])
df['median_age_female'] = median_age_female
In [ ]:
cols = []
for ix in range(len(age_steps)-1):
cols.append('Age_group_%d_%d_years_2011_Census_Males' % (age_steps[ix], age_steps[ix+1]-1))
cols.append('Age_group_%d_years_and_over_2011_Census_Males' % age_steps[-1])
print(cols)
median_age_male = bins2median(age_steps, data[cols])
df['median_age_male'] = median_age_male
In [ ]: