How to get a simple dataset from a big CSV file

This notebook goes through some steps to convert a large unwieldy file from the Council's Litter Bin Sensor Project into a simpler dataset for visualisation. We use a number of functions provided by the Python pandas library.

We assume that we've already downloaded the file to our local file system. So our first step is to import the CSV file as a pandas DataFrame:


In [2]:
import pandas as pd
table = pd.read_csv("../data/binsensors.csv")

Let's have a look at the column labels:


In [3]:
list(table.columns.values)


Out[3]:
['ID',
 'Site',
 'Site name',
 'Address',
 'City',
 'Site content type',
 'Content type',
 'Content type name',
 'Date & Time of bin collection (Europe/London)',
 'Frozen',
 "Fill level before collection (%) [Nb 100% is actually a bin that's 80% full]",
 'Fill level after',
 ' Volume (litres3) ',
 ' Weight (kg) ',
 'Partial',
 'Container slot:ID',
 'Container slot:Name',
 'Container slot:Time (Europe/London)',
 'Container slot:Fill level before',
 'Container slot:Fill level after',
 'Container slot:Volume',
 'Container slot:Weight',
 'Container slot:Confidence',
 'Confidence']

Suppose we just want to select a couple of columns, we can use the column labels like this:


In [4]:
table[['ID', 'Address']]


Out[4]:
ID Address
0 12371082.0 Princes Street
1 12370969.0 Princes Street
2 12370940.0 Princes Street
3 12370931.0 Princes Street
4 12370895.0 Shandwick Place
5 12370835.0 Shandwick Place
6 12370992.0 Coates Crescent
7 12371305.0 West Maitland Street
8 12371189.0 Clifton Terrace
9 12370352.0 80 Haymarket Terrace
10 12371102.0 83 Haymarket Terrace
11 12370927.0 20 Atholl Cresent
12 12370899.0 Atholl Crescent
13 12370879.0 Shandwick Place
14 12370388.0 Princes Street
15 12369997.0 Princes Street
16 12370884.0 Princes Street
17 12370088.0 3 Chambers Street
18 12370302.0 Newkirkgate
19 12370476.0 22 Chambers Street
20 12370263.0 Chambers Street
21 12370983.0 Cathedral Lane
22 12370275.0 Cathedral Lane
23 12370041.0 The Mound
24 12370063.0 The Mound
25 12370161.0 Mound Place
26 12370147.0 54 Leith Walk
27 12370172.0 Ramsay Lane
28 12369991.0 96 Leith Walk
29 12369917.0 210 Leith Walk
... ... ...
9971 10854972.0 Nicolson Street
9972 10856192.0 Shandwick Place
9973 10845164.0 Princes Street Gardens East
9974 10855307.0 South Bridge
9975 10854871.0 7 Canongate
9976 10851260.0 271 Canongate
9977 10852254.0 Canongate
9978 10849062.0 162 Canongate
9979 10848466.0 Regent Road
9980 12359624.0 Waterloo Place
9981 10848617.0 Regent Road
9982 12358547.0 South Bridge
9983 10851500.0 86 Canongate
9984 12360158.0 Waterloo Place
9985 10848023.0 Shandwick Place
9986 10847071.0 Shandwick Place
9987 10850092.0 Shandwick Place
9988 12360654.0 Coates Crescent
9989 12358088.0 Coates Crescent
9990 12363496.0 Coates Crescent
9991 12367091.0 Clifton Terrace
9992 10849669.0 80 Haymarket Terrace
9993 10847247.0 90 Haymarket Terrace
9994 12364372.0 20 Atholl Cresent
9995 10850459.0 22 Haymarket Terrace
9996 12367706.0 Clifton Terrace
9997 10839663.0 West Maitland Street
9998 12363241.0 Festival Square
9999 12362887.0 Festival Square
10000 NaN NaN

10001 rows × 2 columns

But a couple of interesting columns (for the collection date and the weight measured by the sensor) have very complicated labels, so let's simplify them.

First, we'll just make a list of all the labels, then we'll bind the relevant string values to a couple of variables. This means that we don't have to worry about mis-typing things like 'Date & Time of bin collection (Europe/London)!


In [5]:
l = list(table.columns.values)
date = l[8]
fill = l[10]
date, fill


Out[5]:
('Date & Time of bin collection (Europe/London)',
 "Fill level before collection (%) [Nb 100% is actually a bin that's 80% full]")

Now that we've got short variables date and time in place of the long strings, let's go ahead and replace those labels with something simpler:


In [6]:
table = table.rename(columns={date: 'Date', fill: 'Fill_level'})

Now we'll make a new table with just four columns:


In [7]:
table1 = table[['ID', 'Address', 'Date', 'Fill_level']]

And we'll just take the first 30 rows:


In [8]:
tabletop = table1.head(30)

In [9]:
tabletop


Out[9]:
ID Address Date Fill_level
0 12371082.0 Princes Street 2016-08-09 15:46:34 78.0
1 12370969.0 Princes Street 2016-08-09 15:37:28 54.0
2 12370940.0 Princes Street 2016-08-09 15:32:35 86.0
3 12370931.0 Princes Street 2016-08-09 15:32:19 61.0
4 12370895.0 Shandwick Place 2016-08-09 15:27:16 31.0
5 12370835.0 Shandwick Place 2016-08-09 15:25:38 23.0
6 12370992.0 Coates Crescent 2016-08-09 15:21:17 99.0
7 12371305.0 West Maitland Street 2016-08-09 15:16:15 88.0
8 12371189.0 Clifton Terrace 2016-08-09 15:01:12 55.0
9 12370352.0 80 Haymarket Terrace 2016-08-09 14:59:18 91.0
10 12371102.0 83 Haymarket Terrace 2016-08-09 14:56:28 80.0
11 12370927.0 20 Atholl Cresent 2016-08-09 14:45:10 67.0
12 12370899.0 Atholl Crescent 2016-08-09 14:43:32 92.0
13 12370879.0 Shandwick Place 2016-08-09 14:38:21 94.0
14 12370388.0 Princes Street 2016-08-09 14:30:27 87.0
15 12369997.0 Princes Street 2016-08-09 14:25:53 100.0
16 12370884.0 Princes Street 2016-08-09 14:25:44 95.0
17 12370088.0 3 Chambers Street 2016-08-09 14:18:21 42.0
18 12370302.0 Newkirkgate 2016-08-09 14:18:15 93.0
19 12370476.0 22 Chambers Street 2016-08-09 14:16:19 84.0
20 12370263.0 Chambers Street 2016-08-09 14:13:56 68.0
21 12370983.0 Cathedral Lane 2016-08-09 14:07:19 87.0
22 12370275.0 Cathedral Lane 2016-08-09 14:06:33 58.0
23 12370041.0 The Mound 2016-08-09 14:02:54 37.0
24 12370063.0 The Mound 2016-08-09 13:58:57 39.0
25 12370161.0 Mound Place 2016-08-09 13:54:31 88.0
26 12370147.0 54 Leith Walk 2016-08-09 13:54:11 100.0
27 12370172.0 Ramsay Lane 2016-08-09 13:53:07 100.0
28 12369991.0 96 Leith Walk 2016-08-09 13:51:42 95.0
29 12369917.0 210 Leith Walk 2016-08-09 13:49:04 99.0

Finally, we'll write the result to a JSON formatted file.


In [10]:
tabletop.to_json('../data/binsensorsimple.json', orient="records")