In [49]:
import pandas as pd
Using IPython notebook's shell magic, we can see what excel files exist in the directory:
In [50]:
!ls *.xlsx
Nassau_police_union_contribs_dirty.xlsx
~$Nassau_police_union_contribs_dirty.xlsx
We can read the excel file (warts and all) using the read_excel command:
In [51]:
df = pd.read_excel('Nassau_police_union_contribs_dirty.xlsx')
Display an (abbreviated) contents of the dataFrame using IPython notebooks display magic:
In [52]:
df
Out[52]:
DATE
NAME OF INITIATIVE/ADDRESS
City
State
Zip Code
CHECK NO.
AMOUNT ($ )
RECORD DATE
Party
Source
0
02/21/02
BRONX CONSERVATIVE COMM;415 MINNIEFORD AVE
BRONX
N.Y.
10458
1562
$ 140
JUL-03-02 10:16 AM
C
SOA
1
02/23/09
BRONX CONSERVATIVE COMM.;188D EDGEWATER PARK
BRONX
NY
10465
2628
100
JUL-13-09 04:13 PM
C
SOA
2
03/01/00
BRONX CONSERVATIVE COMMITTEE;475 MINNEFORD AVE.
BRONX
NY
10464
1346
405
JUL-10-00 02:50 PM
C
SOA
3
03/04/03
BRONX CONSERVATIVE COMMITTEE;188 EDGEWATER PARK
BRONX
NY
10465
1727
100
JUL-14-03 02:41 PM
C
SOA
4
03/18/04
BRONX CONSERVATIVE COMMITTEE;188D EDGEMERE PARK
BRONX
NY
10465
1869
100
JUL-14-04 01:09 PM
C
SOA
5
03/01/05
BRONX CONSERVATIVE COMMITTEE;1880 EDGEWATER PARK
BRONX
NY
10465
1975
100
JUL-13-05 03:48 PM
C
SOA
6
02/26/07
BRONX CONSERVATIVE COMMITTEE;188D EDGEMERE PARK
BRONX
NY
10465
2371
100
JUL-16-07 05:32 PM
C
SOA
7
03/20/08
BRONX CONSERVATIVE COMMITTEE;188D EDGEMERE PARK
BRONX
NY
10465
2504
100
JUL-15-08 02:53 PM
C
SOA
8
05/25/04
NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBO...
LEVITTOWN
NY
11756
1902
200
JUL-14-04 02:08 PM
C
DETECTIVES
9
01/20/09
NASSAU CO. CONSERVATIVE COMMITTEE;105 BOBOLINK LA
LEVITTOWN
NY
11756
2603
200
JUL-13-09 03:51 PM
C
SOA
10
04/30/09
NASSAU CO. CONSERVATIVE COMMITTEE;105 BOBLINK LA
LEVITTOWN
NY
11756
2646
250
JUL-13-09 04:42 PM
C
SOA
11
02/15/05
NASSAU CO. CONSERVATIVE PARTY;105 BOBOLINK LANE
LEVITTOWN
NY
11756
1968
120
JUL-13-05 03:34 PM
C
SOA
12
04/14/11
NASSAU CO. CONSERVATIVE PARTY;1 SYDNEY ST
PLAINVIEW
NY
11803
2854
$ 200
JUL-11-11 03:19 PM
C
SOA
13
05/28/02
NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBOL...
LEVITTOWN
N.Y.
11756
1858
$ 300
JUL-15-02 10:36 AM
C
PBA
14
05/23/03
NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBOL...
LEVITTOWN
NY
11756
1953
2000
JUL-11-03 02:42 PM
C
PBA
15
05/23/03
NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBOL...
LEVITTOWN
NY
11756
1954
300
JUL-11-03 02:43 PM
C
PBA
16
06/01/04
NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBLI...
LEVITTOWN
NY
11756
2032
$ 3500
JUL-15-04 12:08 PM
C
PBA
17
05/05/06
NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBOL...
LEVITTOWN
NY
11756
2238
400
JUN-16-06 09:09 AM
C
PBA
18
02/13/09
NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBOL...
LEVITTOWN
NY
11756
1236
400
JUL-08-09 01:26 PM
C
PBA
19
04/10/09
NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBOL...
LEVITTOWN
NY
11756
1263
1500
JUL-08-09 02:27 PM
C
PBA
20
05/05/11
NASSAU COUNTY CONSERVATIVE COMMITTEE;1 SYDNEY ...
PLAINVIEW
NY
11803
1548
100
JUL-08-11 08:53 AM
C
PBA
21
06/10/11
NASSAU COUNTY CONSERVATIVE COMMITTEE;1 SYDNEY ST
PLAINVIEW
NY
11803
1567
200
JUL-08-11 09:57 AM
C
PBA
22
02/16/99
NASSAU COUNTY CONSERVATIVE COMMITTEE;36 SUNRIS...
PLAINVIEW
NY
11803
NaN
200
JUN-10-99 06:06 PM
C
DETECTIVES
23
02/12/03
NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOB O...
LEVITTOWN
NY
11756
1802
$ 200
JUN-19-03 03:38 PM
C
DETECTIVES
24
04/26/05
NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOB O...
LEVITTOWN
NY
11756
1976
200
JUL-11-05 02:19 PM
C
DETECTIVES
25
02/09/06
NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBOL...
LEVITTOWN
NY
11756
2034
200
JUN-27-06 12:47 PM
C
DETECTIVES
26
05/10/07
NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBOL...
LEVITTOWN
NY
11756
1602
200
JUL-05-07 05:26 PM
C
DETECTIVES
27
02/09/99
NASSAU COUNTY CONSERVATIVE COMMITTEE;36 SUNRIS...
PLAINVIEW
NY
11803
1253
250
JUL-07-99 10:35 AM
C
SOA
28
04/27/00
NASSAU COUNTY CONSERVATIVE COMMITTEE;36 SUNRIS...
PLAINVIEW
NY
11803
1364
300
JUL-10-00 03:22 PM
C
SOA
29
01/24/02
NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBOL...
LEVITTOWN
NY
11756
1551
400
JUL-03-02 10:00 AM
C
SOA
...
...
...
...
...
...
...
...
...
...
...
3251
09/06/06
TOWN OF HEMPSTEAD REPUBLICAN COMMITTEE;164 POS...
WESTBURY
NY
11590
2333
300
SEP-21-06 04:36 PM
R
SOA
3252
03/19/07
TOWN OF HEMPSTEAD REPUBLICAN COMMITTEE;164 POS...
WESTBURY
NY
11590
2389
250
JUL-16-07 05:32 PM
R
SOA
3253
03/20/08
TOWN OF HEMPSTEAD REPUBLICAN COMMITTEE;164 POS...
WESTBURY
NY
11590
2503
250
JUL-15-08 02:52 PM
R
SOA
3254
07/08/09
TOWN OF HEMPSTEAD REPUBLICAN COMMITTEE;164 POS...
WESTBURY
NY
11590
2676
150
JUL-13-09 05:15 PM
R
SOA
3255
07/06/11
TOWN OF HEMPSTEAD REPUBLICAN COMMITTEE;164 POS...
WESTBURY
NY
11590
2883
300
JUL-11-11 03:52 PM
R
SOA
3256
04/06/10
TOWN OF N. HEMPSTEAD REPUBLICAN COMM;164 POST ...
WESTBURY
NY
11590
1424
625
JUL-15-10 10:51 AM
R
PBA
3257
10/19/01
TOWN OF N. HEMPSTEAD REPUBLICAN COMMITTEE;164 ...
WESTBURY
NY
11590
1812
400
NOV-28-01 02:20 PM
R
PBA
3258
06/01/00
TOWN OF N. HEMPSTEAD REPUBLICAN COMMITTEE;164 ...
WESTBURY
NY
11590
1375
1550
JUL-10-00 03:44 PM
R
SOA
3259
07/16/10
TOWN OF NORTH HEMPSTEAD REPUBLICAN COMMITTEE;1...
WESTBURY
NY
11590
1205
550
AUG-11-10 02:13 PM
R
DETECTIVES
3260
10/01/10
TOWN OF OYSTER BAY REPUBLICAN COMMITTEE;164 PO...
WESTBURY
NY
11590
1486
$ 1200
OCT-22-10 11:33 AM
R
PBA
3261
04/05/02
TRUNZO CAMPAIGN FUND;105 WASHINGTON AVE.
BRENTWOOD
NY
11717
1847
350
JUL-15-02 10:28 AM
R
PBA
3262
09/26/05
VALLEY STREAM REPUBLICAN CLUB;362 PICCADILLY POND
LYNBROOK
NY
11563
2184
250
OCT-07-05 03:13 PM
R
PBA
3263
09/14/09
VALLEY STREAM REPUBLICAN COMM;362 PICCADILLY D...
LYNBROOK
NY
11563
1320
400
SEP-30-09 12:54 PM
R
PBA
3264
10/15/09
VALLEY STREAM REPUBLICAN COMM;362 PICCADILLY D...
LYNBROOK
NY
11563
1338
2000
OCT-21-09 09:46 AM
R
PBA
3265
07/19/11
VALLEY STREAM REPUBLICAN COMM;362 PICCADILLY D...
LYNBROOK
NY
11563
1582
1000
OCT-03-11 02:26 PM
R
PBA
3266
09/06/07
VALLEY STREAM REPUBLICAN COMMITTEE;362 PICCADI...
LYNBROOK
NY
11563
1056
225
OCT-05-07 12:55 PM
R
PBA
3267
09/01/10
VALLEY STREAM REPUBLICAN COMMITTEE;362 PICCADI...
LYNBROOK
NY
11563
1472
400
SEP-30-10 11:01 AM
R
PBA
3268
03/02/01
VOLKER CAMPAIGN COMMITTEE;PO BOX 494
LANCASTER
NY
14086
1747
$ 300
JUL-13-01 03:09 PM
R
PBA
3269
08/27/99
WANTAGH GOP;775 WANTAGH AVENUE
WANTAGH
NY
11793
1309
$ 100
AUG-31-99 05:04 PM
R
SOA
3270
07/14/01
WANTAGH GOP;75 WANTAGH AVE
WANTAGH
NY
11793
1497
700
AUG-09-01 10:01 PM
R
SOA
3271
08/03/99
WANTAGH REPUBLICAN COMMITTEE;2196 BROOKSIDE AV...
WANTAGH
NY
11793
1305
550
AUG-11-99 07:35 PM
R
SOA
3272
08/25/10
WOODMERE REPUBLICAN CLUB;36 CENTER STREET
WOODMERE
NY
11598
1210
200
AUG-31-10 06:09 PM
R
DETECTIVES
3273
03/14/11
ZELDIN FOR SENATE;PO BOX 628
SHIRLEY
NY
11967
1526
500
JUL-07-11 04:12 PM
R
PBA
3274
06/06/11
ZELDIN FOR SENATE;PO BOX 628
SHIRLEY
NY
11967
1560
250
JUL-08-11 09:01 AM
R
PBA
3275
01/09/12
ZELDIN FOR SENATE;PO BOX 628
SHIRLEY
NY
11967
1618
500
JAN-16-12 12:00 AM
R
PBA
3276
01/11/12
ZELDIN FOR SENATE;PO BOX 628
SHIRLEY
NY
11967
1621
500
JAN-16-12 12:00 AM
R
PBA
3277
01/09/12
ZELDIN FOR SENATE;47 FLINTLOCK DRIVE
SHIRLEY
NY
11967
1326
500
JAN-16-12 12:00 AM
R
DETECTIVES
3278
06/06/11
ZELDIN FOR SENATE;PO BOX 628
SHIRLEY
NY
11967
2872
250
JUL-11-11 03:37 PM
R
SOA
3279
01/09/12
ZELDIN FOR SENATE;PO BOX 628
SHIRLEY
NY
11967
2916
500
JAN-12-12 12:00 AM
R
SOA
3280
06/30/09
;235 CARMEN AVE
EAST ROCKAWAY
NY
11518
1290
$ 600
JUL-08-09 02:59 PM
R
PBA
3281 rows × 10 columns
Looking at the data, we can tell it has all kinds of problems.
For the remainder of this breakout, we focused on trying to get the City column into better shape.
To look at only the City column using IPython magic, we can type:
In [54]:
df['City']
Out[54]:
0 BRONX
1 BRONX
2 BRONX
3 BRONX
4 BRONX
5 BRONX
6 BRONX
7 BRONX
8 LEVITTOWN
9 LEVITTOWN
10 LEVITTOWN
11 LEVITTOWN
12 PLAINVIEW
13 LEVITTOWN
14 LEVITTOWN
...
3266 LYNBROOK
3267 LYNBROOK
3268 LANCASTER
3269 WANTAGH
3270 WANTAGH
3271 WANTAGH
3272 WOODMERE
3273 SHIRLEY
3274 SHIRLEY
3275 SHIRLEY
3276 SHIRLEY
3277 SHIRLEY
3278 SHIRLEY
3279 SHIRLEY
3280 EAST ROCKAWAY
Name: City, Length: 3281, dtype: object
This still doesn't give us a great idea of how big a mess we have, so let's see how many unique names exist and sort them by name:
In [58]:
cities = df['City'].unique()
cities.sort()
cities
Out[58]:
array([nan, u' MASSAPEQUA', u' WANTAGH', u' ALBANY', u' BALDWIN',
u' BRONX', u' BROOKLYN', u' COHOES', u' GREAT NECK',
u' HEMPSTEAD', u' HUNTINGTON', u' JEFFERSON VALLEY',
u' LEVITTOWN', u' LEVITTOWN ', u' MERRICK', u' MIDDLE VILLAGE',
u' STATEN ISLAND', u' WANTAGH', u' WESTBURY',
u' WILLISTON PARK', u' WOODMERE', u' ALBANY', u' BALDWIN',
u' BROOKLYN', u' CARMEL', u' EAST NASSAU', u' GARDEN CITY',
u' GARDEN CITY', u' GREAT NECK', u' GREAT NECK', u' HUNTINGTON',
u' JERICHO', u' LIDO BEACH', u' LYNBROOK', u' MASSAPEQUA',
u' MASSAPEQUA PARK', u' MERRICK', u' MINEOLA', u' PLAINVIEW',
u' ROSLYN', u' SHIRLEY', u' SOUTHOLD', u' STATEN ISLAND',
u' STATEN ISLAND ', u' WEST ISLIP', u' WESTBURY', u' WOODBURY',
u'ALBANI', u'ALBANNY', u'ALBANY', u'ALBANY ', u'ALBANY ',
u'ALBANY ', u'ALBERTSON', u'ALBION', u'AMITYVILLE', u'AUBURN',
u'BABYLON', u'BALDWIN', u'BALWIN', u'BARDONIA', u'BAYPORT',
u'BAYVILLE', u'BELLEMORE', u'BELLEROSE', u'BELLMORE', u'BETHPAGE',
u'BOHEMIA', u'BRENTWOOD', u'BRENTWOOD ', u'BREWSTER',
u'BRIDGEPORT', u'BROCKPORT', u'BRONX', u'BROOKLYN', u'BROOKLYN ',
u'BROOKLYN ', u'BROOKYN', u'CANADAIGUA', u'CANANDAIGUA',
u'CANANDALGUA', u'CAPITAL STATION', u'CARL PLACE', u'CARLE PLACE',
u'CARMEL', u'CIRCLEVILLE', u'COHOES', u'COHOES ', u'COMMACK',
u'COOPERSTOWN', u'CORONA', u'CROTON-ON-HUDO', u'E ISLIP',
u'E. GREENBUSH', u'E. ISLIP', u'E. MEADOW', u'E. NORTHPORT',
u'E. NORWICH', u'E. ROCHESTER', u'E. ROCKAWAY', u'E. SETAUKET',
u'E. WILLISTON', u'EAST', u'EAST GREENBUSH', u'EAST ISLIP',
u'EAST MEADOW', u'EAST NORTHPOND', u'EAST NORTHPORT',
u'EAST NORWICH', u'EAST ROCKAWAY', u'EAST SETAUKET', u'ELMIRA',
u'ELMONT', u'FARMINGDALE', u'FARMINGVILLE', u'FRANKLIN',
u'FRANKLIN SQUARE', u'FREEPORT', u'FREEPORT ', u'FRESH MEADOW',
u'FT HAMILTON STATE', u'GARDEN CITY', u'GARDEN CITY ',
u'GARDEN CITY PARK', u'GLEN COVE', u'GOSHEN', u'GREAT NECK',
u'GREAT NECK', u'GREAT NECK', u'GREAT NECK ', u'HAMPTON',
u'HAPPAGUE', u'HAPPAUGE', u'HAUPPAUGE', u'HAUPPAUSE', u'HEMPSTEAD',
u'HEWLETT', u'HICKSVILLE', u'HOLBROOK', u'HUNTINGTON',
u'HUNTINGTON ', u'HUNTINGTON STATION', u'HUNTINGTON STAT',
u'HUNTINGTON STATION', u'HUNTINGTONST', u'IRVINGTON-ON-HU',
u'ISLAND PARK', u'ISLIP', u'JAMAICA ESTATES', u'JEFFERSON',
u'JEFFERSON VALLEY', u'JEFFERSON VALLEY', u'JEFFERSON VALLE',
u'JEFFERSON VALLEY', u'JERICHO', u'KINDERHOOK', u'LANCASTER',
u'LANCASTER ', u'LAWRENCE', u'LEVITOWN', u'LEVITTOWN',
u'LEVITTOWN ', u'LEVITTSTOWN', u'LINDENHURST', u'LINDENHURST ',
u'LOCUST VALLEY', u'LOCUST VALLEY', u'LONG BEACH', u'LONG BEACH',
u'LONGBEACH', u'LYNBROOK', u'LYNBROOK ', u'LYNBROOK ', u'MALVERNE',
u'MANORVILLE', u'MASSAPEAQUA', u'MASSAPEQUA', u'MASSAPEQUA ',
u'MASSAPEQUA ', u'MASSAPEQUA PARK', u'MASTIC', u'MEDFORD',
u'MELVILLE', u'MERCERVILLE', u'MERICK', u'MERRICK',
u'MIDDLE VILLAGE', u'MIDDLE VILLAGE', u'MILL NECK', u'MINEOLA',
u'MINEOLA ', u'MT. SINAI', u'N MERRICK', u'N VALLEY',
u'N VALLEY STREAM', u'N. BABYLON', u'N. BALDWIN', u'N. MASSAPEQUA',
u'N. MERRICK', u'N. VALLEY', u'N. VALLEY STREAM', u'N.MASSAPEQUA',
u'N.MASSAPEQUA ', u'N.VALLEY STREAM', u'NANUET', u'NESCONSET',
u'NEW YORK', u'NEW YORK', u'NEW BALTIMORE', u'NEW HAMPTON',
u'NEW HYDE PARK', u'NEW HYDE PK', u'NEW ROCHELLE', u'NEW YORK',
u'NEW YORK ', u'NIAGARA FALLS', u'NO. BALDWIN', u'NO. MASSAPEQUA',
u'NO. MERRICK', u'NO. VALLEY STREAM', u'NO.VALLEY STREAM',
u'NORHTPORT', u'NORTH MASSAPEQUA', u'NORTH MERRICK',
u'NORTH VALLEY', u'NORTH VALLEY STREAM', u'NORTHPORT', u'NY',
u'NYC', u'OCEANSIDE', u'OLD BETHPAGE', u'OLD BETHPARE',
u'OLD WESTBURY', u'ORISKANY', u'OYSTER BAY', u'PATTERSONVILLE',
u'PENFIELD', u'PLAINVIEW', u'PORT', u'PORT WASHINGTON',
u'PORT JEFFERSON', u'PORT WASHINGTON', u'POUGHKEEPSIE',
u'PT. LOOKOUT', u'PT. WASHINGTON', u'PT.WASHINGTON', u'RENSSALAER',
u'RENSSELAER', u'RENSSELAU', u'RENSSELEAR', u'RICHMOND HILL',
u'RICHMOND HILLS', u'RIVERHEAD', u'ROCHESTER', u'ROCKEVILLE CENTRE',
u'ROCKILLE CTR', u'ROCKVILLE', u'ROCKVILLE CENTER',
u'ROCKVILLE CENTR', u'ROCKVILLE CENTRE', u'ROCKVILLE CTR',
u'RONKONKOMA', u'ROOSEVELT', u'ROSLYN', u'S.MERRICK',
u'SANDS POINT', u'SARATOGA SPRING', u'SARATOGA SPRING ',
u'SCHENECTADY', u'SEA CLIFF', u'SEACLIFF', u'SEAFORD', u'SEATAUKET',
u'SETAUKET', u'SHIRLEY', u'SHIRLEY ', u'SHOREHAM', u'SHOREHARN',
u'SMITHTOWN', u'ST JAMES', u'ST. JAMES', u'STATEN',
u'STATEN ISLAND', u'STATEN ISLAND', u'STONY POINT', u'SUFFERN',
u'SYOSSET', u'SYRACUSE', u'UNIONDALE', u'UPPER NYACK',
u'VALLEY STREAM', u'W BABYLON', u'W ISLIP', u'W. BABYLON',
u'W. HEMPSTEAD', u'W. ISLIP', u'W.BABYLON', u'WADING RIVER',
u'WANTAGH', u'WASHINGTONVILLE', u'WEST BABLYON', u'WEST BABYLON',
u'WEST BURY', u'WEST ISLIP', u'WEST MERRICK', u'WESTBURY',
u'WESTBURY ', u'WESTBUTY', u'WESYBURY', u'WHITESTONE',
u'WILLISTIN PK', u'WILLISTON', u'WILLISTON PARK', u'WILLISTON PK',
u'WOODBURY', u'WOODMERE', u'YAPHANK', u'YONKERS'], dtype=object)
Wow, that's... a mess. The most obvious mess is trailing and leading white space. Let's get rid of it with .strip()
In [70]:
cities = df['City'].str.strip().unique()
cities.sort()
cities
Out[70]:
array([nan, u'ALBANI', u'ALBANNY', u'ALBANY', u'ALBERTSON', u'ALBION',
u'AMITYVILLE', u'AUBURN', u'BABYLON', u'BALDWIN', u'BALWIN',
u'BARDONIA', u'BAYPORT', u'BAYVILLE', u'BELLEMORE', u'BELLEROSE',
u'BELLMORE', u'BETHPAGE', u'BOHEMIA', u'BRENTWOOD', u'BREWSTER',
u'BRIDGEPORT', u'BROCKPORT', u'BRONX', u'BROOKLYN', u'BROOKYN',
u'CANADAIGUA', u'CANANDAIGUA', u'CANANDALGUA', u'CAPITAL STATION',
u'CARL PLACE', u'CARLE PLACE', u'CARMEL', u'CIRCLEVILLE', u'COHOES',
u'COMMACK', u'COOPERSTOWN', u'CORONA', u'CROTON-ON-HUDO',
u'E ISLIP', u'E. GREENBUSH', u'E. ISLIP', u'E. MEADOW',
u'E. NORTHPORT', u'E. NORWICH', u'E. ROCHESTER', u'E. ROCKAWAY',
u'E. SETAUKET', u'E. WILLISTON', u'EAST', u'EAST GREENBUSH',
u'EAST ISLIP', u'EAST MEADOW', u'EAST NASSAU', u'EAST NORTHPOND',
u'EAST NORTHPORT', u'EAST NORWICH', u'EAST ROCKAWAY',
u'EAST SETAUKET', u'ELMIRA', u'ELMONT', u'FARMINGDALE',
u'FARMINGVILLE', u'FRANKLIN', u'FRANKLIN SQUARE', u'FREEPORT',
u'FRESH MEADOW', u'FT HAMILTON STATE', u'GARDEN CITY',
u'GARDEN CITY', u'GARDEN CITY PARK', u'GLEN COVE', u'GOSHEN',
u'GREAT NECK', u'GREAT NECK', u'GREAT NECK', u'HAMPTON',
u'HAPPAGUE', u'HAPPAUGE', u'HAUPPAUGE', u'HAUPPAUSE', u'HEMPSTEAD',
u'HEWLETT', u'HICKSVILLE', u'HOLBROOK', u'HUNTINGTON',
u'HUNTINGTON STATION', u'HUNTINGTON STAT', u'HUNTINGTON STATION',
u'HUNTINGTONST', u'IRVINGTON-ON-HU', u'ISLAND PARK', u'ISLIP',
u'JAMAICA ESTATES', u'JEFFERSON', u'JEFFERSON VALLEY',
u'JEFFERSON VALLEY', u'JEFFERSON VALLE', u'JEFFERSON VALLEY',
u'JERICHO', u'KINDERHOOK', u'LANCASTER', u'LAWRENCE', u'LEVITOWN',
u'LEVITTOWN', u'LEVITTSTOWN', u'LIDO BEACH', u'LINDENHURST',
u'LOCUST VALLEY', u'LOCUST VALLEY', u'LONG BEACH', u'LONG BEACH',
u'LONGBEACH', u'LYNBROOK', u'MALVERNE', u'MANORVILLE',
u'MASSAPEAQUA', u'MASSAPEQUA', u'MASSAPEQUA PARK',
u'MASSAPEQUA PARK', u'MASTIC', u'MEDFORD', u'MELVILLE',
u'MERCERVILLE', u'MERICK', u'MERRICK', u'MIDDLE VILLAGE',
u'MIDDLE VILLAGE', u'MILL NECK', u'MINEOLA', u'MT. SINAI',
u'N MERRICK', u'N VALLEY', u'N VALLEY STREAM', u'N. BABYLON',
u'N. BALDWIN', u'N. MASSAPEQUA', u'N. MERRICK', u'N. VALLEY',
u'N. VALLEY STREAM', u'N.MASSAPEQUA', u'N.VALLEY STREAM', u'NANUET',
u'NESCONSET', u'NEW YORK', u'NEW YORK', u'NEW BALTIMORE',
u'NEW HAMPTON', u'NEW HYDE PARK', u'NEW HYDE PK', u'NEW ROCHELLE',
u'NEW YORK', u'NIAGARA FALLS', u'NO. BALDWIN', u'NO. MASSAPEQUA',
u'NO. MERRICK', u'NO. VALLEY STREAM', u'NO.VALLEY STREAM',
u'NORHTPORT', u'NORTH MASSAPEQUA', u'NORTH MERRICK',
u'NORTH VALLEY', u'NORTH VALLEY STREAM', u'NORTHPORT', u'NY',
u'NYC', u'OCEANSIDE', u'OLD BETHPAGE', u'OLD BETHPARE',
u'OLD WESTBURY', u'ORISKANY', u'OYSTER BAY', u'PATTERSONVILLE',
u'PENFIELD', u'PLAINVIEW', u'PORT', u'PORT WASHINGTON',
u'PORT JEFFERSON', u'PORT WASHINGTON', u'POUGHKEEPSIE',
u'PT. LOOKOUT', u'PT. WASHINGTON', u'PT.WASHINGTON', u'RENSSALAER',
u'RENSSELAER', u'RENSSELAU', u'RENSSELEAR', u'RICHMOND HILL',
u'RICHMOND HILLS', u'RIVERHEAD', u'ROCHESTER', u'ROCKEVILLE CENTRE',
u'ROCKILLE CTR', u'ROCKVILLE', u'ROCKVILLE CENTER',
u'ROCKVILLE CENTR', u'ROCKVILLE CENTRE', u'ROCKVILLE CTR',
u'RONKONKOMA', u'ROOSEVELT', u'ROSLYN', u'S.MERRICK',
u'SANDS POINT', u'SARATOGA SPRING', u'SCHENECTADY', u'SEA CLIFF',
u'SEACLIFF', u'SEAFORD', u'SEATAUKET', u'SETAUKET', u'SHIRLEY',
u'SHOREHAM', u'SHOREHARN', u'SMITHTOWN', u'SOUTHOLD', u'ST JAMES',
u'ST. JAMES', u'STATEN', u'STATEN ISLAND', u'STATEN ISLAND',
u'STONY POINT', u'SUFFERN', u'SYOSSET', u'SYRACUSE', u'UNIONDALE',
u'UPPER NYACK', u'VALLEY STREAM', u'W BABYLON', u'W ISLIP',
u'W. BABYLON', u'W. HEMPSTEAD', u'W. ISLIP', u'W.BABYLON',
u'WADING RIVER', u'WANTAGH', u'WASHINGTONVILLE', u'WEST BABLYON',
u'WEST BABYLON', u'WEST BURY', u'WEST ISLIP', u'WEST MERRICK',
u'WESTBURY', u'WESTBUTY', u'WESYBURY', u'WHITESTONE',
u'WILLISTIN PK', u'WILLISTON', u'WILLISTON PARK', u'WILLISTON PK',
u'WOODBURY', u'WOODMERE', u'YAPHANK', u'YONKERS'], dtype=object)
We can work with this, let's reassign all the stripped values in place:
In [71]:
df['City'] = df['City'].str.strip()
First thing the group decided to tackle was standardizing the name "EAST." It currently is "E ", "E." and "EAST"
To make the list more manageable, we can find all cells that start with "E":
In [73]:
df[df['City'].str.startswith("E")]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-73-3f556c32aa54> in <module>()
----> 1 df[df['City'].str.startswith("E")]
/Users/jeramiaory/miniconda/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
1735 if isinstance(key, (Series, np.ndarray, Index, list)):
1736 # either boolean or fancy integer index
-> 1737 return self._getitem_array(key)
1738 elif isinstance(key, DataFrame):
1739 return self._getitem_frame(key)
/Users/jeramiaory/miniconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_array(self, key)
1762 def _getitem_array(self, key):
1763 # also raises Exception if object array with NA values
-> 1764 if com._is_bool_indexer(key):
1765 # warning here just in case -- previously __setitem__ was
1766 # reindexing but __getitem__ was not; it seems more reasonable to
/Users/jeramiaory/miniconda/lib/python2.7/site-packages/pandas/core/common.pyc in _is_bool_indexer(key)
2039 if not lib.is_bool_array(key):
2040 if isnull(key).any():
-> 2041 raise ValueError('cannot index with vector containing '
2042 'NA / NaN values')
2043 return False
ValueError: cannot index with vector containing NA / NaN values
Or... we could, if all of the data was string data. Turns out some of it is missing (of course!)
The group discussed what to do with the data, and decided they wanted to convert it to the string "UNKNOWN"
Using the fillna command, we can do this in place(ish):
In [80]:
df['City'] = df['City'].fillna("UNKNOWN")
In [81]:
cities = df['City'].unique()
cities.sort()
In [82]:
cities
Out[82]:
array([u'ALBANI', u'ALBANNY', u'ALBANY', u'ALBERTSON', u'ALBION',
u'AMITYVILLE', u'AUBURN', u'BABYLON', u'BALDWIN', u'BALWIN',
u'BARDONIA', u'BAYPORT', u'BAYVILLE', u'BELLEMORE', u'BELLEROSE',
u'BELLMORE', u'BETHPAGE', u'BOHEMIA', u'BRENTWOOD', u'BREWSTER',
u'BRIDGEPORT', u'BROCKPORT', u'BRONX', u'BROOKLYN', u'BROOKYN',
u'CANADAIGUA', u'CANANDAIGUA', u'CANANDALGUA', u'CAPITAL STATION',
u'CARL PLACE', u'CARLE PLACE', u'CARMEL', u'CIRCLEVILLE', u'COHOES',
u'COMMACK', u'COOPERSTOWN', u'CORONA', u'CROTON-ON-HUDO',
u'E ISLIP', u'E. GREENBUSH', u'E. ISLIP', u'E. MEADOW',
u'E. NORTHPORT', u'E. NORWICH', u'E. ROCHESTER', u'E. ROCKAWAY',
u'E. SETAUKET', u'E. WILLISTON', u'EAST', u'EAST GREENBUSH',
u'EAST ISLIP', u'EAST MEADOW', u'EAST NASSAU', u'EAST NORTHPOND',
u'EAST NORTHPORT', u'EAST NORWICH', u'EAST ROCKAWAY',
u'EAST SETAUKET', u'ELMIRA', u'ELMONT', u'FARMINGDALE',
u'FARMINGVILLE', u'FRANKLIN', u'FRANKLIN SQUARE', u'FREEPORT',
u'FRESH MEADOW', u'FT HAMILTON STATE', u'GARDEN CITY',
u'GARDEN CITY', u'GARDEN CITY PARK', u'GLEN COVE', u'GOSHEN',
u'GREAT NECK', u'GREAT NECK', u'GREAT NECK', u'HAMPTON',
u'HAPPAGUE', u'HAPPAUGE', u'HAUPPAUGE', u'HAUPPAUSE', u'HEMPSTEAD',
u'HEWLETT', u'HICKSVILLE', u'HOLBROOK', u'HUNTINGTON',
u'HUNTINGTON STATION', u'HUNTINGTON STAT', u'HUNTINGTON STATION',
u'HUNTINGTONST', u'IRVINGTON-ON-HU', u'ISLAND PARK', u'ISLIP',
u'JAMAICA ESTATES', u'JEFFERSON', u'JEFFERSON VALLEY',
u'JEFFERSON VALLEY', u'JEFFERSON VALLE', u'JEFFERSON VALLEY',
u'JERICHO', u'KINDERHOOK', u'LANCASTER', u'LAWRENCE', u'LEVITOWN',
u'LEVITTOWN', u'LEVITTSTOWN', u'LIDO BEACH', u'LINDENHURST',
u'LOCUST VALLEY', u'LOCUST VALLEY', u'LONG BEACH', u'LONG BEACH',
u'LONGBEACH', u'LYNBROOK', u'MALVERNE', u'MANORVILLE',
u'MASSAPEAQUA', u'MASSAPEQUA', u'MASSAPEQUA PARK',
u'MASSAPEQUA PARK', u'MASTIC', u'MEDFORD', u'MELVILLE',
u'MERCERVILLE', u'MERICK', u'MERRICK', u'MIDDLE VILLAGE',
u'MIDDLE VILLAGE', u'MILL NECK', u'MINEOLA', u'MT. SINAI',
u'N MERRICK', u'N VALLEY', u'N VALLEY STREAM', u'N. BABYLON',
u'N. BALDWIN', u'N. MASSAPEQUA', u'N. MERRICK', u'N. VALLEY',
u'N. VALLEY STREAM', u'N.MASSAPEQUA', u'N.VALLEY STREAM', u'NANUET',
u'NESCONSET', u'NEW YORK', u'NEW YORK', u'NEW BALTIMORE',
u'NEW HAMPTON', u'NEW HYDE PARK', u'NEW HYDE PK', u'NEW ROCHELLE',
u'NEW YORK', u'NIAGARA FALLS', u'NO. BALDWIN', u'NO. MASSAPEQUA',
u'NO. MERRICK', u'NO. VALLEY STREAM', u'NO.VALLEY STREAM',
u'NORHTPORT', u'NORTH MASSAPEQUA', u'NORTH MERRICK',
u'NORTH VALLEY', u'NORTH VALLEY STREAM', u'NORTHPORT', u'NY',
u'NYC', u'OCEANSIDE', u'OLD BETHPAGE', u'OLD BETHPARE',
u'OLD WESTBURY', u'ORISKANY', u'OYSTER BAY', u'PATTERSONVILLE',
u'PENFIELD', u'PLAINVIEW', u'PORT', u'PORT WASHINGTON',
u'PORT JEFFERSON', u'PORT WASHINGTON', u'POUGHKEEPSIE',
u'PT. LOOKOUT', u'PT. WASHINGTON', u'PT.WASHINGTON', u'RENSSALAER',
u'RENSSELAER', u'RENSSELAU', u'RENSSELEAR', u'RICHMOND HILL',
u'RICHMOND HILLS', u'RIVERHEAD', u'ROCHESTER', u'ROCKEVILLE CENTRE',
u'ROCKILLE CTR', u'ROCKVILLE', u'ROCKVILLE CENTER',
u'ROCKVILLE CENTR', u'ROCKVILLE CENTRE', u'ROCKVILLE CTR',
u'RONKONKOMA', u'ROOSEVELT', u'ROSLYN', u'S.MERRICK',
u'SANDS POINT', u'SARATOGA SPRING', u'SCHENECTADY', u'SEA CLIFF',
u'SEACLIFF', u'SEAFORD', u'SEATAUKET', u'SETAUKET', u'SHIRLEY',
u'SHOREHAM', u'SHOREHARN', u'SMITHTOWN', u'SOUTHOLD', u'ST JAMES',
u'ST. JAMES', u'STATEN', u'STATEN ISLAND', u'STATEN ISLAND',
u'STONY POINT', u'SUFFERN', u'SYOSSET', u'SYRACUSE', u'UNIONDALE',
'UNKNOWN', u'UPPER NYACK', u'VALLEY STREAM', u'W BABYLON',
u'W ISLIP', u'W. BABYLON', u'W. HEMPSTEAD', u'W. ISLIP',
u'W.BABYLON', u'WADING RIVER', u'WANTAGH', u'WASHINGTONVILLE',
u'WEST BABLYON', u'WEST BABYLON', u'WEST BURY', u'WEST ISLIP',
u'WEST MERRICK', u'WESTBURY', u'WESTBUTY', u'WESYBURY',
u'WHITESTONE', u'WILLISTIN PK', u'WILLISTON', u'WILLISTON PARK',
u'WILLISTON PK', u'WOODBURY', u'WOODMERE', u'YAPHANK', u'YONKERS'], dtype=object)
Ok, no more nan, yay!
Let's try that again:
In [85]:
df['City'].str.startswith('E')
Out[85]:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
...
3266 False
3267 False
3268 False
3269 False
3270 False
3271 False
3272 False
3273 False
3274 False
3275 False
3276 False
3277 False
3278 False
3279 False
3280 True
Name: City, Length: 3281, dtype: bool
We can now select all the rows in our dataframe by feeding the True/False back into the dataframe:
In [86]:
df[df['City'].str.startswith("E")]
Out[86]:
DATE
NAME OF INITIATIVE/ADDRESS
City
State
Zip Code
CHECK NO.
AMOUNT ($ )
RECORD DATE
Party
Source
123
04/14/11
BROOKHAVEN DEMOCRATIC COMM;PO BOX 561
EAST
NY
11733
1538
500
JUL-07-11 04:21 PM
D
PBA
362
10/03/01
ENGLEBRIGHT FOR STATE ASSEMBLY;PO BOX 2703
EAST SETAUKET
NY
11733
1523
250
OCT-23-01 12:08 PM
D
SOA
363
02/09/99
ENGLEBRITE FOR STATE ASSEMBLY;P.O. BOX 2703
EAST SETAUKET
NY
11733
1250
200
JUL-07-99 09:05 AM
D
SOA
390
11/07/11
FRIENDS OF CARRIE SOLAGES;1630 DUTCH BROADWAY
ELMONT
NY
11003
2907
2000
NOV-29-11 12:00 AM
D
SOA
484
02/12/99
FRIENDS OF ENGELBRIGHT;POB 703
E. SETAUKET
NY
11733
NaN
100
JUN-10-99 06:04 PM
D
DETECTIVES
727
04/04/00
FRIENDS OF STEVE ENGLEBRITE;P.O. BOX 2703
EAST SETAUKET
NY
11733
1356
250
JUL-10-00 03:03 PM
D
SOA
871
11/03/11
NASSAU DEMOCRATIC COMMITTEE;6 NEIGHBOUR WAY
EAST NASSAU
NY
12062
1000
31659.77
DEC-05-11 12:00 AM
D
SAFE NASSAU
1128
03/08/10
CITIZENS FOMM TO RE-ELECT KEN LA VALLE;9 BERKS...
E. GREENBUSH
NY
12061
1405
500
JUL-15-10 10:31 AM
R
PBA
1150
08/03/99
CITIZENS FOR CIOTTI;P.O. BOX 03068
ELMONT
NY
11003
1307
450
AUG-11-99 07:39 PM
R
SOA
1166
09/21/05
CITIZENS FOR GREG PETERSON;PO BOX 455
EAST MEADOW
NY
11554
2171
10000
OCT-07-05 03:02 PM
R
PBA
1167
10/05/05
CITIZENS FOR GREG PETERSON;PO BOX 455
EAST MEADOW
NY
11554
2186
10000
OCT-25-05 11:29 AM
R
PBA
1250
09/13/10
CITIZENS FOR MONTESANO;25 MEADOWLARK DRIVE
E. NORTHPORT
NY
11731
1475
250
SEP-30-10 11:02 AM
R
PBA
1288
06/10/05
CITIZENS FOR PETERSON;PO BOX 455
EAST MEADOW
NY
11554
1988
350
JUL-11-05 03:06 PM
R
DETECTIVES
1289
09/21/05
CITIZENS FOR PETERSON;PO BOX 455
EAST MEADOW
NY
11554
2012
1250
OCT-07-05 05:17 PM
R
DETECTIVES
1290
09/07/05
CITIZENS FOR PETERSON;PO BOX 455
EAST MEADOW
NY
11554
2078
1500
SEP-27-05 11:53 AM
R
SOA
1332
09/15/05
CITIZENS FOR SANTINO;PO BOX 22
EAST ROCKAWAY
NY
11515
2160
500
OCT-07-05 02:53 PM
R
PBA
1333
10/24/01
CITIZENS FOR SANTINO;POB 22
E. ROCKAWAY
NY
11518
NaN
100
NOV-26-01 02:28 PM
R
DETECTIVES
1334
06/01/07
CITIZENS FOR SANTINO;POB 22
E. ROCKAWAY
NY
11518
1605
100
JUL-05-07 05:27 PM
R
DETECTIVES
1335
06/12/09
CITIZENS FOR SANTINO;P.O. BOX 22
E. ROCKAWAY
NY
11518
1089
100
JUL-13-09 02:19 PM
R
DETECTIVES
1336
05/07/00
CITIZENS FOR SANTINO;P.O. BOX 22
EAST ROCKAWAY
NY
11518
1371
500
JUL-10-00 03:40 PM
R
SOA
1338
08/04/05
CITIZENS FOR SANTINO;PO BOX 22
EAST ROCKAWAY
NY
11518
2053
200
AUG-08-05 12:34 PM
R
SOA
1358
09/05/02
COMM. TO ELECT JIM ALESSI;PO BOX 200
E. ROCHESTER
NY
14445
1881
200
OCT-03-02 10:22 AM
R
PBA
1406
05/06/99
COMMITTEE TO RE-ELECT R. GAFFNEY;90 MERRICK AV...
EAST MEADOW
NY
11554
1613
1000
JUL-22-99 11:26 AM
R
PBA
1409
09/27/07
COMMITTEE TO REELECT ROBERT SCHMIDT;POB 320
E. NORWICH
NY
11732
1635
500
OCT-05-07 05:23 PM
R
DETECTIVES
1410
09/04/07
COMMITTEE TO RE-ELECT ROBERT SCHMIDT SUPREME C...
EAST MEADOW
NY
11554
2434
500
OCT-01-07 03:31 PM
R
SOA
1424
01/14/08
COMMITTEE TO RE-ELECT SENATOR BRUNO;4 SPRUCE RUN
EAST GREENBUSH
NY
12061
1096
1000
OCT-23-09 01:29 PM
R
PBA
1438
10/20/00
COUNCILMAN JOE KEARNEY ELECTION COMMITTEE;1199...
ELMONT
NY
11580
1418
250
OCT-24-00 11:40 AM
R
SOA
1443
07/20/99
E. ROCKAWAY REPUBLICAN COMM;94 FRANKLIN AVE
E. ROCKAWAY
NY
11518
1622
100
OCT-08-99 10:14 AM
R
PBA
1472
09/11/00
FRIENDS FOR ANGIE M. CULLEN;90 MERRICK AVENUE
EAST MEADOW
NY
11554
1402
125
SEP-19-00 09:18 AM
R
SOA
1561
04/04/07
FRIENDS FOR NORMA GONSALVES;1901 MERION ST
EAST MEADOW
NY
11554
2391
500
JUL-16-07 05:32 PM
R
SOA
...
...
...
...
...
...
...
...
...
...
...
2265
09/23/03
FRIENDS OF NORMA GONSALVES;1901 MERCUM ST.
EAST MEADOW
NY
11554
1980
1500
OCT-03-03 12:50 PM
R
PBA
2266
09/02/05
FRIENDS OF NORMA GONSALVES;1901 MERION STREET
EAST MEADOW
NY
11554
2155
500
OCT-07-05 03:09 PM
R
PBA
2267
09/26/05
FRIENDS OF NORMA GONSALVES;1901 MERION STREET
EAST MEADOW
NY
11554
2177
1988
OCT-07-05 03:09 PM
R
PBA
2268
04/23/09
FRIENDS OF NORMA GONSALVES;1901 MERION ST
EAST MEADOW
NY
11554
1265
700
JUL-08-09 02:28 PM
R
PBA
2269
09/13/10
FRIENDS OF NORMA GONSALVES;1901 MERION STREET
EAST MEADOW
NY
11554
1477
1000
SEP-30-10 11:03 AM
R
PBA
2270
03/14/05
FRIENDS OF NORMA GONSALVES;1901 MERION STREET
EAST MEADOW
NY
11554
1957
250
JUL-11-05 01:56 PM
R
DETECTIVES
2271
06/21/05
FRIENDS OF NORMA GONSALVES;1901 MARION STREET
EAST MEADOW
NY
11554
1994
100
JUL-11-05 03:11 PM
R
DETECTIVES
2272
04/09/09
FRIENDS OF NORMA GONSALVES;1901 MERION STREET
EAST MEADOW
NY
11554
1065
350
JUL-13-09 02:08 PM
R
DETECTIVES
2273
08/19/09
FRIENDS OF NORMA GONSALVES;1901 MERION STREET
EAST MEADOW
NY
11554
1109
495
SEP-04-09 12:51 PM
R
DETECTIVES
2274
09/21/10
FRIENDS OF NORMA GONSALVES;1901 MERION STREET
EAST MEADOW
NY
11554
1216
250
SEP-22-10 03:15 PM
R
DETECTIVES
2275
08/11/03
FRIENDS OF NORMA GONSALVES;1901 MERION ST.
EAST MEADOW
NY
11554
1780
250
SEP-03-03 01:46 PM
R
SOA
2276
10/09/03
FRIENDS OF NORMA GONSALVES;1901 MERION ST
EAST MEADOW
NY
11554
1808
1000
OCT-24-03 02:03 PM
R
SOA
2277
03/21/05
FRIENDS OF NORMA GONSALVES;1901 MERION ST.
EAST MEADOW
NY
11554
1988
250
JUL-13-05 04:00 PM
R
SOA
2278
10/13/05
FRIENDS OF NORMA GONSALVES;1901 MERION ST
EAST MEADOW
NY
11554
2090
2000
OCT-27-05 03:40 PM
R
SOA
2279
04/30/09
FRIENDS OF NORMA GONSALVES;1901 MERION ST
EAST MEADOW
NY
11554
2650
500
JUL-13-09 04:45 PM
R
SOA
2280
09/21/09
FRIENDS OF NORMA GONSALVES;1901 MERION ST
EAST MEADOW
NY
11554
2700
1000
SEP-24-09 03:30 PM
R
SOA
2281
08/03/99
FRIENDS OF NORMA GONZALEZ;1901 MERION STREET
EAST MEADOW
NY
11554
1308
300
AUG-11-99 07:40 PM
R
SOA
2322
11/02/11
FRIENDS OF PAT MAHER;335 SPRING DRIVE
EAST MEADOW
NY
11554
1320
2000
DEC-01-11 12:00 AM
R
DETECTIVES
2323
10/27/11
FRIENDS OF PAT MAHER;335 SPRING DR
EAST MEADOW
NY
11554
2905
1000
NOV-29-11 12:00 AM
R
SOA
2371
06/19/07
FRIENDS OF PHIL BOYLE;136 E MAIN STREET
E. ISLIP
NY
11730
1614
200
AUG-16-07 04:12 PM
R
DETECTIVES
2372
06/12/00
FRIENDS OF PHIL BOYLE;15 STEWART STREET
EAST ISLIP
NY
11730
1377
125
JUL-10-00 03:47 PM
R
SOA
2373
07/13/01
FRIENDS OF PHIL BOYLE;15 STEWART ST
E ISLIP
NY
11730
1499
400
AUG-09-01 10:07 PM
R
SOA
2374
03/01/02
FRIENDS OF PHIL BOYLE;15 STEWART ST
E ISLIP
NY
11730
1564
200
JUL-03-02 10:20 AM
R
SOA
2375
06/19/07
FRIENDS OF PHIL BOYLE;136 E. MAIN ST.
EAST ISLIP
NY
11730
2417
200
JUL-16-07 05:35 PM
R
SOA
2431
09/27/07
FRIENDS OF ROBERT SCHMIDT;PO BOX 320
EAST NORWICH
NY
11732
1635
500
SEP-28-07 04:24 PM
R
DETECTIVES
2464
01/09/12
FRIENDS OF SENATOR JACK MARTIN;192 CLEARMEADOW...
EAST MEADOW
NY
11554
1325
500
JAN-16-12 12:00 AM
R
DETECTIVES
2505
02/23/06
FRIENDS OF TOM MCKEVITT;P.O. BOX 455
E. MEADOW
NY
11554
2039
500
JUN-27-06 12:52 PM
R
DETECTIVES
2740
10/03/01
N. HEPMSTEAD REPUBLICAN COMMITTEE;C/O AXELROD ...
EAST MEADOW
NY
11554
1522
400
OCT-23-01 01:04 PM
R
SOA
3195
10/20/10
TAXPAYERS FOR LENAHAN;17 EVERDELL ROAD
EAST ROCKAWAY
NY
11518
1221
100
NOV-29-10 04:58 PM
R
DETECTIVES
3280
06/30/09
;235 CARMEN AVE
EAST ROCKAWAY
NY
11518
1290
$ 600
JUL-08-09 02:59 PM
R
PBA
86 rows × 10 columns
Let's create a new dataframe that only contains the cities that start with "E" that we can play with:
In [87]:
ecities = df[df['City'].str.startswith("E")]
In [89]:
ecities['City'].unique()
Out[89]:
array([u'EAST', u'EAST SETAUKET', u'ELMONT', u'E. SETAUKET',
u'EAST NASSAU', u'E. GREENBUSH', u'EAST MEADOW', u'E. NORTHPORT',
u'EAST ROCKAWAY', u'E. ROCKAWAY', u'E. ROCHESTER', u'E. NORWICH',
u'EAST GREENBUSH', u'E. MEADOW', u'ELMIRA', u'EAST NORTHPOND',
u'EAST NORTHPORT', u'E. WILLISTON', u'E. ISLIP', u'EAST ISLIP',
u'E ISLIP', u'EAST NORWICH'], dtype=object)
It's stil petty much a mess. However, using regular expressions, we can replace every "E " or "E. " with the string "EAST "
breaking replace(r"\s*E\.?\s", "EAST ") down into chunks:
r = use regular expression language
\s = space
* = wildcard (anything)
\.? = one or no "." characters
In [91]:
ecities['City'].str.replace(r"\s*E\.?\s", "EAST ").unique()
Out[91]:
array([u'EAST', u'EAST SETAUKET', u'ELMONT', u'EAST NASSAU',
u'EAST GREENBUSH', u'EAST MEADOW', u'EAST NORTHPORT',
u'EAST ROCKAWAY', u'EAST ROCHESTER', u'EAST NORWICH', u'ELMIRA',
u'EAST NORTHPOND', u'EAST WILLISTON', u'EAST ISLIP'], dtype=object)
In [92]:
df['City'] = df['City'].str.replace(r"\s*E\.?\s", "EAST ")
In [96]:
df[df['City'].str.startswith("E")]['City'].unique()
Out[96]:
array([u'EAST', u'EAST SETAUKET', u'ELMONT', u'EAST NASSAU',
u'EAST GREENBUSH', u'EAST MEADOW', u'EAST NORTHPORT',
u'EAST ROCKAWAY', u'EAST ROCHESTER', u'EAST NORWICH', u'ELMIRA',
u'EAST NORTHPOND', u'EAST WILLISTON', u'EAST ISLIP'], dtype=object)
In [ ]:
Content source: ipython/mozfest2014
Similar notebooks: