Wrangling data in Tables

David Culler

This notebook illustrates several of the datascience tables methods for wrangling typical data. Here we use some simple census data. Even that is pretty obscure. We illustrate the process of going from raw data to a distilled form and then answer a simple question: "How does the relative difference of males and females vary with age?" The answer, there are bit more boys, but a LOT less old men.


In [1]:
# HIDDEN
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

Construct a raw table from a csv on the web


In [2]:
census_url = 'http://www.census.gov/popest/data/national/asrh/2014/files/NC-EST2014-AGESEX-RES.csv'
raw_census = Table.read_table(census_url)
raw_census


Out[2]:
SEX AGE CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2012 POPESTIMATE2013 POPESTIMATE2014
0 0 3944153 3944160 3951330 3963071 3926665 3945610 3948350
0 1 3978070 3978090 3957888 3966510 3978006 3943077 3962123
0 2 4096929 4096939 4090862 3971573 3979952 3992690 3957772
0 3 4119040 4119051 4111920 4102501 3983049 3992425 4005190
0 4 4063170 4063186 4077552 4122303 4112638 3994047 4003448
0 5 4056858 4056872 4064653 4087713 4132210 4123408 4004858
0 6 4066381 4066412 4073013 4074979 4097780 4143094 4134352
0 7 4030579 4030594 4043047 4083240 4084964 4108615 4154000
0 8 4046486 4046497 4025604 4053206 4093213 4095827 4119524
0 9 4148353 4148369 4125415 4035769 4063193 4104133 4106832

... (296 rows omitted)

Stage 1: cleaning up columns and encodings

As is often the case, the data you find is pretty ugly.


In [3]:
# A simple tool to decode an encoding
def categorize_sex(s):
   return ['all','male','female'][s]

In [4]:
# Keep only what we need
pre_census = raw_census.select(['SEX', 'AGE', 'CENSUS2010POP', 'POPESTIMATE2014'])
# Clean up the column names
pre_census.relabel('CENSUS2010POP','2010pop')
pre_census.relabel('POPESTIMATE2014','2014est')
# Decode the categories
pre_census['CAT'] = pre_census.apply(categorize_sex, 'SEX')
# Create a new clean table getting rid of what we can
p2_census = pre_census.drop('SEX')
p2_census.move_to_start('CAT')
p2_census


Out[4]:
CAT AGE 2010pop 2014est
all 0 3944153 3948350
all 1 3978070 3962123
all 2 4096929 3957772
all 3 4119040 4005190
all 4 4063170 4003448
all 5 4056858 4004858
all 6 4066381 4134352
all 7 4030579 4154000
all 8 4046486 4119524
all 9 4148353 4106832

... (296 rows omitted)


In [5]:
p2_census.show()


CAT AGE 2010pop 2014est
all 0 3944153 3948350
all 1 3978070 3962123
all 2 4096929 3957772
all 3 4119040 4005190
all 4 4063170 4003448
all 5 4056858 4004858
all 6 4066381 4134352
all 7 4030579 4154000
all 8 4046486 4119524
all 9 4148353 4106832
all 10 4172541 4115254
all 11 4114415 4085636
all 12 4106243 4068790
all 13 4118013 4169346
all 14 4165982 4232480
all 15 4242820 4163723
all 16 4316139 4167234
all 17 4395295 4184706
all 18 4500855 4225590
all 19 4585234 4326394
all 20 4519129 4418887
all 21 4354294 4489746
all 22 4264642 4612322
all 23 4198571 4698584
all 24 4249363 4692635
all 25 4262350 4509594
all 26 4152305 4406472
all 27 4248869 4332545
all 28 4215249 4351851
all 29 4223076 4387476
all 30 4285668 4250780
all 31 3970218 4318706
all 32 3986847 4320072
all 33 3880150 4276113
all 34 3839216 4362895
all 35 3956434 4094039
all 36 3802087 4014413
all 37 3934445 3974190
all 38 4121880 3859181
all 39 4364796 3979827
all 40 4383274 3856888
all 41 4114985 3917413
all 42 4076104 4095929
all 43 4105105 4332491
all 44 4211496 4388762
all 45 4508868 4161450
all 46 4519761 4076237
all 47 4535265 4082150
all 48 4538796 4158996
all 49 4605901 4409209
all 50 4660295 4491431
all 51 4464631 4488666
all 52 4500846 4480175
all 53 4380354 4535748
all 54 4291999 4574789
all 55 4254709 4421625
all 56 4037513 4395471
all 57 3936386 4347179
all 58 3794928 4191757
all 59 3641269 4155417
all 60 3621131 3985367
all 61 3492596 3834428
all 62 3563182 3685822
all 63 3483884 3572379
all 64 2657131 3488136
all 65 2680761 3384449
all 66 2639141 3347776
all 67 2649365 3485502
all 68 2323672 2572527
all 69 2142324 2535012
all 70 2043121 2466485
all 71 1949323 2519748
all 72 1864275 2194234
all 73 1736960 2002191
all 74 1684487 1890366
all 75 1620077 1774697
all 76 1471070 1694648
all 77 1455330 1557239
all 78 1400123 1481505
all 79 1371195 1414235
all 80 1308511 1264072
all 81 1212865 1216009
all 82 1161421 1153634
all 83 1074809 1090465
all 84 985721 1036186
all 85 914723 924969
all 86 814211 855696
all 87 712908 770685
all 88 640619 675487
all 89 537998 599834
all 90 435563 512845
all 91 344987 426946
all 92 281389 354532
all 93 216978 286333
all 94 169449 218655
all 95 129717 157124
all 96 95223 121184
all 97 68138 83731
all 98 45900 60185
all 99 32266 41828
all 100 53364 72197
all 999 308745538 318857056
male 0 2014276 2017857
male 1 2030853 2023253
male 2 2092198 2022502
male 3 2104550 2048618
male 4 2077550 2043498
male 5 2072094 2043467
male 6 2075319 2110328
male 7 2057076 2122240
male 8 2065453 2105122
male 9 2119696 2097272
male 10 2135996 2099874
male 11 2103264 2083687
male 12 2100145 2075243
male 13 2104914 2128187
male 14 2135543 2164228
male 15 2177022 2128424
male 16 2216034 2130938
male 17 2263153 2138809
male 18 2305473 2165062
male 19 2341984 2220790
male 20 2308319 2269570
male 21 2223198 2310734
male 22 2177797 2367842
male 23 2140799 2399883
male 24 2164063 2391398
male 25 2161308 2295836
male 26 2097088 2240026
male 27 2140651 2200429
male 28 2118605 2207185
male 29 2117939 2217913
male 30 2160802 2140108
male 31 1988155 2169725
male 32 1994476 2165840
male 33 1936863 2140697
male 34 1916204 2192455
male 35 1980916 2047326
male 36 1890595 2004990
male 37 1953386 1979035
male 38 2049720 1922571
male 39 2167405 1985914
male 40 2191249 1916261
male 41 2047818 1939916
male 42 2028653 2031325
male 43 2035990 2147656
male 44 2090267 2184235
male 45 2237450 2067223
male 46 2230982 2022698
male 47 2238248 2019306
male 48 2237734 2058243
male 49 2264671 2179993
male 50 2300354 2211607
male 51 2190766 2205405
male 52 2207246 2198124
male 53 2141354 2219617
male 54 2093554 2242828
male 55 2073473 2158258
male 56 1956141 2140722
male 57 1905355 2110149
male 58 1834808 2027959
male 59 1753871 2006900
male 60 1745507 1914009
male 61 1679077 1837080
male 62 1712692 1763504
male 63 1672329 1701827
male 64 1267895 1661474
male 65 1273310 1607688
male 66 1248276 1589127
male 67 1248906 1653257
male 68 1087296 1211480
male 69 994759 1187554
male 70 945611 1149247
male 71 900148 1169356
male 72 853726 1010896
male 73 787863 913008
male 74 756624 857432
male 75 721008 803434
male 76 647804 758378
male 77 631884 689822
male 78 602458 649209
male 79 579234 610723
male 80 543559 540032
male 81 494870 511090
male 82 462983 476976
male 83 419831 442449
male 84 373131 411265
male 85 336819 359264
male 86 293120 322934
male 87 249803 283305
male 88 217436 240316
male 89 176689 205684
male 90 136948 170350
male 91 103799 138080
male 92 81072 109873
male 93 59037 85980
male 94 43531 62130
male 95 30951 42824
male 96 21424 31235
male 97 14556 20479
male 98 9259 13649
male 99 6073 9037
male 100 9162 13729
male 999 151781326 156936487
female 0 1929877 1930493
female 1 1947217 1938870
female 2 2004731 1935270
female 3 2014490 1956572
female 4 1985620 1959950
female 5 1984764 1961391
female 6 1991062 2024024
female 7 1973503 2031760
female 8 1981033 2014402
female 9 2028657 2009560
female 10 2036545 2015380
female 11 2011151 2001949
female 12 2006098 1993547
female 13 2013099 2041159
female 14 2030439 2068252
female 15 2065798 2035299
female 16 2100105 2036296
female 17 2132142 2045897
female 18 2195382 2060528
female 19 2243250 2105604
female 20 2210810 2149317
female 21 2131096 2179012
female 22 2086845 2244480
female 23 2057772 2298701
female 24 2085300 2301237
female 25 2101042 2213758
female 26 2055217 2166446
female 27 2108218 2132116
female 28 2096644 2144666
female 29 2105137 2169563
female 30 2124866 2110672
female 31 1982063 2148981
female 32 1992371 2154232
female 33 1943287 2135416
female 34 1923012 2170440
female 35 1975518 2046713
female 36 1911492 2009423
female 37 1981059 1995155
female 38 2072160 1936610
female 39 2197391 1993913
female 40 2192025 1940627
female 41 2067167 1977497
female 42 2047451 2064604
female 43 2069115 2184835
female 44 2121229 2204527
female 45 2271418 2094227
female 46 2288779 2053539
female 47 2297017 2062844
female 48 2301062 2100753
female 49 2341230 2229216
female 50 2359941 2279824
female 51 2273865 2283261
female 52 2293600 2282051
female 53 2239000 2316131
female 54 2198445 2331961
female 55 2181236 2263367
female 56 2081372 2254749
female 57 2031031 2237030
female 58 1960120 2163798
female 59 1887398 2148517
female 60 1875624 2071358
female 61 1813519 1997348
female 62 1850490 1922318
female 63 1811555 1870552
female 64 1389236 1826662
female 65 1407451 1776761
female 66 1390865 1758649
female 67 1400459 1832245
female 68 1236376 1361047
female 69 1147565 1347458
female 70 1097510 1317238
female 71 1049175 1350392
female 72 1010549 1183338
female 73 949097 1089183
female 74 927863 1032934
female 75 899069 971263
female 76 823266 936270
female 77 823446 867417
female 78 797665 832296
female 79 791961 803512
female 80 764952 724040
female 81 717995 704919
female 82 698438 676658
female 83 654978 648016
female 84 612590 624921
female 85 577904 565705
female 86 521091 532762
female 87 463105 487380
female 88 423183 435171
female 89 361309 394150
female 90 298615 342495
female 91 241188 288866
female 92 200317 244659
female 93 157941 200353
female 94 125918 156525
female 95 98766 114300
female 96 73799 89949
female 97 53582 63252
female 98 36641 46536
female 99 26193 32791
female 100 44202 58468
female 999 156964212 161920569

Stage 2 Cleaning up rows

The Census includes special rows that are the total of other rows by category.


In [6]:
# How many people?
total = p2_census.where('AGE',999)
total


Out[6]:
CAT AGE 2010pop 2014est
all 999 308745538 318857056
male 999 151781326 156936487
female 999 156964212 161920569

In [7]:
# Remove the rows that are totals of the other rows
# Now we have a clean Table
census = p2_census.where(p2_census['AGE'] < 999)

Stage 3: Transform 1 dimension to 2

The natural form of the data is age X gender. We could split the table and reassmeble it, but a better approach is to pivot.


In [8]:
# Split it by gender into two tables
male = census.where('CAT','male')
female = census.where('CAT','female')

In [9]:
male


Out[9]:
CAT AGE 2010pop 2014est
male 0 2014276 2017857
male 1 2030853 2023253
male 2 2092198 2022502
male 3 2104550 2048618
male 4 2077550 2043498
male 5 2072094 2043467
male 6 2075319 2110328
male 7 2057076 2122240
male 8 2065453 2105122
male 9 2119696 2097272

... (91 rows omitted)


In [10]:
pop2010 = census.pivot('CAT','AGE','2010pop',sum)
pop2010


Out[10]:
AGE all female male
0 3944153 1929877 2014276
1 3978070 1947217 2030853
2 4096929 2004731 2092198
3 4119040 2014490 2104550
4 4063170 1985620 2077550
5 4056858 1984764 2072094
6 4066381 1991062 2075319
7 4030579 1973503 2057076
8 4046486 1981033 2065453
9 4148353 2028657 2119696

... (91 rows omitted)

Visualization

Now we can easily visualize what's in this data.


In [11]:
# The number of individuals by ages
pop2010.plot('AGE',overlay=True)



In [12]:
pop2010['diff'] = pop2010['male'] - pop2010['female']
pop2010.show()


AGE all female male diff
0 3944153 1929877 2014276 84399
1 3978070 1947217 2030853 83636
2 4096929 2004731 2092198 87467
3 4119040 2014490 2104550 90060
4 4063170 1985620 2077550 91930
5 4056858 1984764 2072094 87330
6 4066381 1991062 2075319 84257
7 4030579 1973503 2057076 83573
8 4046486 1981033 2065453 84420
9 4148353 2028657 2119696 91039
10 4172541 2036545 2135996 99451
11 4114415 2011151 2103264 92113
12 4106243 2006098 2100145 94047
13 4118013 2013099 2104914 91815
14 4165982 2030439 2135543 105104
15 4242820 2065798 2177022 111224
16 4316139 2100105 2216034 115929
17 4395295 2132142 2263153 131011
18 4500855 2195382 2305473 110091
19 4585234 2243250 2341984 98734
20 4519129 2210810 2308319 97509
21 4354294 2131096 2223198 92102
22 4264642 2086845 2177797 90952
23 4198571 2057772 2140799 83027
24 4249363 2085300 2164063 78763
25 4262350 2101042 2161308 60266
26 4152305 2055217 2097088 41871
27 4248869 2108218 2140651 32433
28 4215249 2096644 2118605 21961
29 4223076 2105137 2117939 12802
30 4285668 2124866 2160802 35936
31 3970218 1982063 1988155 6092
32 3986847 1992371 1994476 2105
33 3880150 1943287 1936863 -6424
34 3839216 1923012 1916204 -6808
35 3956434 1975518 1980916 5398
36 3802087 1911492 1890595 -20897
37 3934445 1981059 1953386 -27673
38 4121880 2072160 2049720 -22440
39 4364796 2197391 2167405 -29986
40 4383274 2192025 2191249 -776
41 4114985 2067167 2047818 -19349
42 4076104 2047451 2028653 -18798
43 4105105 2069115 2035990 -33125
44 4211496 2121229 2090267 -30962
45 4508868 2271418 2237450 -33968
46 4519761 2288779 2230982 -57797
47 4535265 2297017 2238248 -58769
48 4538796 2301062 2237734 -63328
49 4605901 2341230 2264671 -76559
50 4660295 2359941 2300354 -59587
51 4464631 2273865 2190766 -83099
52 4500846 2293600 2207246 -86354
53 4380354 2239000 2141354 -97646
54 4291999 2198445 2093554 -104891
55 4254709 2181236 2073473 -107763
56 4037513 2081372 1956141 -125231
57 3936386 2031031 1905355 -125676
58 3794928 1960120 1834808 -125312
59 3641269 1887398 1753871 -133527
60 3621131 1875624 1745507 -130117
61 3492596 1813519 1679077 -134442
62 3563182 1850490 1712692 -137798
63 3483884 1811555 1672329 -139226
64 2657131 1389236 1267895 -121341
65 2680761 1407451 1273310 -134141
66 2639141 1390865 1248276 -142589
67 2649365 1400459 1248906 -151553
68 2323672 1236376 1087296 -149080
69 2142324 1147565 994759 -152806
70 2043121 1097510 945611 -151899
71 1949323 1049175 900148 -149027
72 1864275 1010549 853726 -156823
73 1736960 949097 787863 -161234
74 1684487 927863 756624 -171239
75 1620077 899069 721008 -178061
76 1471070 823266 647804 -175462
77 1455330 823446 631884 -191562
78 1400123 797665 602458 -195207
79 1371195 791961 579234 -212727
80 1308511 764952 543559 -221393
81 1212865 717995 494870 -223125
82 1161421 698438 462983 -235455
83 1074809 654978 419831 -235147
84 985721 612590 373131 -239459
85 914723 577904 336819 -241085
86 814211 521091 293120 -227971
87 712908 463105 249803 -213302
88 640619 423183 217436 -205747
89 537998 361309 176689 -184620
90 435563 298615 136948 -161667
91 344987 241188 103799 -137389
92 281389 200317 81072 -119245
93 216978 157941 59037 -98904
94 169449 125918 43531 -82387
95 129717 98766 30951 -67815
96 95223 73799 21424 -52375
97 68138 53582 14556 -39026
98 45900 36641 9259 -27382
99 32266 26193 6073 -20120
100 53364 44202 9162 -35040

In [13]:
# The difference in the genders by age
pop2010.select(['AGE', 'diff']).bar('AGE')



In [17]:
pop2010['Rel Diff'] = pop2010['diff'] / pop2010['all']
pop2010.set_format('Rel Diff', PercentFormatter)
pop2010.show()


AGE all female male diff Rel Diff
0 3944153 1929877 2014276 84399 2.14%
1 3978070 1947217 2030853 83636 2.10%
2 4096929 2004731 2092198 87467 2.13%
3 4119040 2014490 2104550 90060 2.19%
4 4063170 1985620 2077550 91930 2.26%
5 4056858 1984764 2072094 87330 2.15%
6 4066381 1991062 2075319 84257 2.07%
7 4030579 1973503 2057076 83573 2.07%
8 4046486 1981033 2065453 84420 2.09%
9 4148353 2028657 2119696 91039 2.19%
10 4172541 2036545 2135996 99451 2.38%
11 4114415 2011151 2103264 92113 2.24%
12 4106243 2006098 2100145 94047 2.29%
13 4118013 2013099 2104914 91815 2.23%
14 4165982 2030439 2135543 105104 2.52%
15 4242820 2065798 2177022 111224 2.62%
16 4316139 2100105 2216034 115929 2.69%
17 4395295 2132142 2263153 131011 2.98%
18 4500855 2195382 2305473 110091 2.45%
19 4585234 2243250 2341984 98734 2.15%
20 4519129 2210810 2308319 97509 2.16%
21 4354294 2131096 2223198 92102 2.12%
22 4264642 2086845 2177797 90952 2.13%
23 4198571 2057772 2140799 83027 1.98%
24 4249363 2085300 2164063 78763 1.85%
25 4262350 2101042 2161308 60266 1.41%
26 4152305 2055217 2097088 41871 1.01%
27 4248869 2108218 2140651 32433 0.76%
28 4215249 2096644 2118605 21961 0.52%
29 4223076 2105137 2117939 12802 0.30%
30 4285668 2124866 2160802 35936 0.84%
31 3970218 1982063 1988155 6092 0.15%
32 3986847 1992371 1994476 2105 0.05%
33 3880150 1943287 1936863 -6424 -0.17%
34 3839216 1923012 1916204 -6808 -0.18%
35 3956434 1975518 1980916 5398 0.14%
36 3802087 1911492 1890595 -20897 -0.55%
37 3934445 1981059 1953386 -27673 -0.70%
38 4121880 2072160 2049720 -22440 -0.54%
39 4364796 2197391 2167405 -29986 -0.69%
40 4383274 2192025 2191249 -776 -0.02%
41 4114985 2067167 2047818 -19349 -0.47%
42 4076104 2047451 2028653 -18798 -0.46%
43 4105105 2069115 2035990 -33125 -0.81%
44 4211496 2121229 2090267 -30962 -0.74%
45 4508868 2271418 2237450 -33968 -0.75%
46 4519761 2288779 2230982 -57797 -1.28%
47 4535265 2297017 2238248 -58769 -1.30%
48 4538796 2301062 2237734 -63328 -1.40%
49 4605901 2341230 2264671 -76559 -1.66%
50 4660295 2359941 2300354 -59587 -1.28%
51 4464631 2273865 2190766 -83099 -1.86%
52 4500846 2293600 2207246 -86354 -1.92%
53 4380354 2239000 2141354 -97646 -2.23%
54 4291999 2198445 2093554 -104891 -2.44%
55 4254709 2181236 2073473 -107763 -2.53%
56 4037513 2081372 1956141 -125231 -3.10%
57 3936386 2031031 1905355 -125676 -3.19%
58 3794928 1960120 1834808 -125312 -3.30%
59 3641269 1887398 1753871 -133527 -3.67%
60 3621131 1875624 1745507 -130117 -3.59%
61 3492596 1813519 1679077 -134442 -3.85%
62 3563182 1850490 1712692 -137798 -3.87%
63 3483884 1811555 1672329 -139226 -4.00%
64 2657131 1389236 1267895 -121341 -4.57%
65 2680761 1407451 1273310 -134141 -5.00%
66 2639141 1390865 1248276 -142589 -5.40%
67 2649365 1400459 1248906 -151553 -5.72%
68 2323672 1236376 1087296 -149080 -6.42%
69 2142324 1147565 994759 -152806 -7.13%
70 2043121 1097510 945611 -151899 -7.43%
71 1949323 1049175 900148 -149027 -7.65%
72 1864275 1010549 853726 -156823 -8.41%
73 1736960 949097 787863 -161234 -9.28%
74 1684487 927863 756624 -171239 -10.17%
75 1620077 899069 721008 -178061 -10.99%
76 1471070 823266 647804 -175462 -11.93%
77 1455330 823446 631884 -191562 -13.16%
78 1400123 797665 602458 -195207 -13.94%
79 1371195 791961 579234 -212727 -15.51%
80 1308511 764952 543559 -221393 -16.92%
81 1212865 717995 494870 -223125 -18.40%
82 1161421 698438 462983 -235455 -20.27%
83 1074809 654978 419831 -235147 -21.88%
84 985721 612590 373131 -239459 -24.29%
85 914723 577904 336819 -241085 -26.36%
86 814211 521091 293120 -227971 -28.00%
87 712908 463105 249803 -213302 -29.92%
88 640619 423183 217436 -205747 -32.12%
89 537998 361309 176689 -184620 -34.32%
90 435563 298615 136948 -161667 -37.12%
91 344987 241188 103799 -137389 -39.82%
92 281389 200317 81072 -119245 -42.38%
93 216978 157941 59037 -98904 -45.58%
94 169449 125918 43531 -82387 -48.62%
95 129717 98766 30951 -67815 -52.28%
96 95223 73799 21424 -52375 -55.00%
97 68138 53582 14556 -39026 -57.27%
98 45900 36641 9259 -27382 -59.66%
99 32266 26193 6073 -20120 -62.36%
100 53364 44202 9162 -35040 -65.66%

In [18]:
pop2010.select(['AGE', 'Rel Diff']).bar('AGE')



In [19]:
pop2010['Ratio F/M'] = pop2010['female'] / pop2010['male']
pop2010.select(['AGE', 'Ratio F/M']).bar('AGE')