Intro

For one of the breakout sessions in the "Dealing with Messy Data" session run by Milena and Yuandra dealt with getting data into pandas and trying to fix as much as feasible using pandas. Intro by Jeramia, panda-fu by Kyle.

First we need to import pandas


In [49]:
import pandas as pd

Using IPython notebook's shell magic, we can see what excel files exist in the directory:


In [50]:
!ls *.xlsx


Nassau_police_union_contribs_dirty.xlsx
~$Nassau_police_union_contribs_dirty.xlsx

We can read the excel file (warts and all) using the read_excel command:


In [51]:
df = pd.read_excel('Nassau_police_union_contribs_dirty.xlsx')

Display an (abbreviated) contents of the dataFrame using IPython notebooks display magic:


In [52]:
df


Out[52]:
DATE NAME OF INITIATIVE/ADDRESS City State Zip Code CHECK NO. AMOUNT ($ ) RECORD DATE Party Source
0 02/21/02 BRONX CONSERVATIVE COMM;415 MINNIEFORD AVE BRONX N.Y. 10458 1562 $ 140 JUL-03-02 10:16 AM C SOA
1 02/23/09 BRONX CONSERVATIVE COMM.;188D EDGEWATER PARK BRONX NY 10465 2628 100 JUL-13-09 04:13 PM C SOA
2 03/01/00 BRONX CONSERVATIVE COMMITTEE;475 MINNEFORD AVE. BRONX NY 10464 1346 405 JUL-10-00 02:50 PM C SOA
3 03/04/03 BRONX CONSERVATIVE COMMITTEE;188 EDGEWATER PARK BRONX NY 10465 1727 100 JUL-14-03 02:41 PM C SOA
4 03/18/04 BRONX CONSERVATIVE COMMITTEE;188D EDGEMERE PARK BRONX NY 10465 1869 100 JUL-14-04 01:09 PM C SOA
5 03/01/05 BRONX CONSERVATIVE COMMITTEE;1880 EDGEWATER PARK BRONX NY 10465 1975 100 JUL-13-05 03:48 PM C SOA
6 02/26/07 BRONX CONSERVATIVE COMMITTEE;188D EDGEMERE PARK BRONX NY 10465 2371 100 JUL-16-07 05:32 PM C SOA
7 03/20/08 BRONX CONSERVATIVE COMMITTEE;188D EDGEMERE PARK BRONX NY 10465 2504 100 JUL-15-08 02:53 PM C SOA
8 05/25/04 NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBO... LEVITTOWN NY 11756 1902 200 JUL-14-04 02:08 PM C DETECTIVES
9 01/20/09 NASSAU CO. CONSERVATIVE COMMITTEE;105 BOBOLINK LA LEVITTOWN NY 11756 2603 200 JUL-13-09 03:51 PM C SOA
10 04/30/09 NASSAU CO. CONSERVATIVE COMMITTEE;105 BOBLINK LA LEVITTOWN NY 11756 2646 250 JUL-13-09 04:42 PM C SOA
11 02/15/05 NASSAU CO. CONSERVATIVE PARTY;105 BOBOLINK LANE LEVITTOWN NY 11756 1968 120 JUL-13-05 03:34 PM C SOA
12 04/14/11 NASSAU CO. CONSERVATIVE PARTY;1 SYDNEY ST PLAINVIEW NY 11803 2854 $ 200 JUL-11-11 03:19 PM C SOA
13 05/28/02 NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBOL... LEVITTOWN N.Y. 11756 1858 $ 300 JUL-15-02 10:36 AM C PBA
14 05/23/03 NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBOL... LEVITTOWN NY 11756 1953 2000 JUL-11-03 02:42 PM C PBA
15 05/23/03 NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBOL... LEVITTOWN NY 11756 1954 300 JUL-11-03 02:43 PM C PBA
16 06/01/04 NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBLI... LEVITTOWN NY 11756 2032 $ 3500 JUL-15-04 12:08 PM C PBA
17 05/05/06 NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBOL... LEVITTOWN NY 11756 2238 400 JUN-16-06 09:09 AM C PBA
18 02/13/09 NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBOL... LEVITTOWN NY 11756 1236 400 JUL-08-09 01:26 PM C PBA
19 04/10/09 NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBOL... LEVITTOWN NY 11756 1263 1500 JUL-08-09 02:27 PM C PBA
20 05/05/11 NASSAU COUNTY CONSERVATIVE COMMITTEE;1 SYDNEY ... PLAINVIEW NY 11803 1548 100 JUL-08-11 08:53 AM C PBA
21 06/10/11 NASSAU COUNTY CONSERVATIVE COMMITTEE;1 SYDNEY ST PLAINVIEW NY 11803 1567 200 JUL-08-11 09:57 AM C PBA
22 02/16/99 NASSAU COUNTY CONSERVATIVE COMMITTEE;36 SUNRIS... PLAINVIEW NY 11803 NaN 200 JUN-10-99 06:06 PM C DETECTIVES
23 02/12/03 NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOB O... LEVITTOWN NY 11756 1802 $ 200 JUN-19-03 03:38 PM C DETECTIVES
24 04/26/05 NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOB O... LEVITTOWN NY 11756 1976 200 JUL-11-05 02:19 PM C DETECTIVES
25 02/09/06 NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBOL... LEVITTOWN NY 11756 2034 200 JUN-27-06 12:47 PM C DETECTIVES
26 05/10/07 NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBOL... LEVITTOWN NY 11756 1602 200 JUL-05-07 05:26 PM C DETECTIVES
27 02/09/99 NASSAU COUNTY CONSERVATIVE COMMITTEE;36 SUNRIS... PLAINVIEW NY 11803 1253 250 JUL-07-99 10:35 AM C SOA
28 04/27/00 NASSAU COUNTY CONSERVATIVE COMMITTEE;36 SUNRIS... PLAINVIEW NY 11803 1364 300 JUL-10-00 03:22 PM C SOA
29 01/24/02 NASSAU COUNTY CONSERVATIVE COMMITTEE;105 BOBOL... LEVITTOWN NY 11756 1551 400 JUL-03-02 10:00 AM C SOA
... ... ... ... ... ... ... ... ... ... ...
3251 09/06/06 TOWN OF HEMPSTEAD REPUBLICAN COMMITTEE;164 POS... WESTBURY NY 11590 2333 300 SEP-21-06 04:36 PM R SOA
3252 03/19/07 TOWN OF HEMPSTEAD REPUBLICAN COMMITTEE;164 POS... WESTBURY NY 11590 2389 250 JUL-16-07 05:32 PM R SOA
3253 03/20/08 TOWN OF HEMPSTEAD REPUBLICAN COMMITTEE;164 POS... WESTBURY NY 11590 2503 250 JUL-15-08 02:52 PM R SOA
3254 07/08/09 TOWN OF HEMPSTEAD REPUBLICAN COMMITTEE;164 POS... WESTBURY NY 11590 2676 150 JUL-13-09 05:15 PM R SOA
3255 07/06/11 TOWN OF HEMPSTEAD REPUBLICAN COMMITTEE;164 POS... WESTBURY NY 11590 2883 300 JUL-11-11 03:52 PM R SOA
3256 04/06/10 TOWN OF N. HEMPSTEAD REPUBLICAN COMM;164 POST ... WESTBURY NY 11590 1424 625 JUL-15-10 10:51 AM R PBA
3257 10/19/01 TOWN OF N. HEMPSTEAD REPUBLICAN COMMITTEE;164 ... WESTBURY NY 11590 1812 400 NOV-28-01 02:20 PM R PBA
3258 06/01/00 TOWN OF N. HEMPSTEAD REPUBLICAN COMMITTEE;164 ... WESTBURY NY 11590 1375 1550 JUL-10-00 03:44 PM R SOA
3259 07/16/10 TOWN OF NORTH HEMPSTEAD REPUBLICAN COMMITTEE;1... WESTBURY NY 11590 1205 550 AUG-11-10 02:13 PM R DETECTIVES
3260 10/01/10 TOWN OF OYSTER BAY REPUBLICAN COMMITTEE;164 PO... WESTBURY NY 11590 1486 $ 1200 OCT-22-10 11:33 AM R PBA
3261 04/05/02 TRUNZO CAMPAIGN FUND;105 WASHINGTON AVE. BRENTWOOD NY 11717 1847 350 JUL-15-02 10:28 AM R PBA
3262 09/26/05 VALLEY STREAM REPUBLICAN CLUB;362 PICCADILLY POND LYNBROOK NY 11563 2184 250 OCT-07-05 03:13 PM R PBA
3263 09/14/09 VALLEY STREAM REPUBLICAN COMM;362 PICCADILLY D... LYNBROOK NY 11563 1320 400 SEP-30-09 12:54 PM R PBA
3264 10/15/09 VALLEY STREAM REPUBLICAN COMM;362 PICCADILLY D... LYNBROOK NY 11563 1338 2000 OCT-21-09 09:46 AM R PBA
3265 07/19/11 VALLEY STREAM REPUBLICAN COMM;362 PICCADILLY D... LYNBROOK NY 11563 1582 1000 OCT-03-11 02:26 PM R PBA
3266 09/06/07 VALLEY STREAM REPUBLICAN COMMITTEE;362 PICCADI... LYNBROOK NY 11563 1056 225 OCT-05-07 12:55 PM R PBA
3267 09/01/10 VALLEY STREAM REPUBLICAN COMMITTEE;362 PICCADI... LYNBROOK NY 11563 1472 400 SEP-30-10 11:01 AM R PBA
3268 03/02/01 VOLKER CAMPAIGN COMMITTEE;PO BOX 494 LANCASTER NY 14086 1747 $ 300 JUL-13-01 03:09 PM R PBA
3269 08/27/99 WANTAGH GOP;775 WANTAGH AVENUE WANTAGH NY 11793 1309 $ 100 AUG-31-99 05:04 PM R SOA
3270 07/14/01 WANTAGH GOP;75 WANTAGH AVE WANTAGH NY 11793 1497 700 AUG-09-01 10:01 PM R SOA
3271 08/03/99 WANTAGH REPUBLICAN COMMITTEE;2196 BROOKSIDE AV... WANTAGH NY 11793 1305 550 AUG-11-99 07:35 PM R SOA
3272 08/25/10 WOODMERE REPUBLICAN CLUB;36 CENTER STREET WOODMERE NY 11598 1210 200 AUG-31-10 06:09 PM R DETECTIVES
3273 03/14/11 ZELDIN FOR SENATE;PO BOX 628 SHIRLEY NY 11967 1526 500 JUL-07-11 04:12 PM R PBA
3274 06/06/11 ZELDIN FOR SENATE;PO BOX 628 SHIRLEY NY 11967 1560 250 JUL-08-11 09:01 AM R PBA
3275 01/09/12 ZELDIN FOR SENATE;PO BOX 628 SHIRLEY NY 11967 1618 500 JAN-16-12 12:00 AM R PBA
3276 01/11/12 ZELDIN FOR SENATE;PO BOX 628 SHIRLEY NY 11967 1621 500 JAN-16-12 12:00 AM R PBA
3277 01/09/12 ZELDIN FOR SENATE;47 FLINTLOCK DRIVE SHIRLEY NY 11967 1326 500 JAN-16-12 12:00 AM R DETECTIVES
3278 06/06/11 ZELDIN FOR SENATE;PO BOX 628 SHIRLEY NY 11967 2872 250 JUL-11-11 03:37 PM R SOA
3279 01/09/12 ZELDIN FOR SENATE;PO BOX 628 SHIRLEY NY 11967 2916 500 JAN-12-12 12:00 AM R SOA
3280 06/30/09 ;235 CARMEN AVE EAST ROCKAWAY NY 11518 1290 $ 600 JUL-08-09 02:59 PM R PBA

3281 rows × 10 columns

Looking at the data, we can tell it has all kinds of problems.

For the remainder of this breakout, we focused on trying to get the City column into better shape.

To look at only the City column using IPython magic, we can type:


In [54]:
df['City']


Out[54]:
0           BRONX
1           BRONX
2           BRONX
3           BRONX
4           BRONX
5           BRONX
6           BRONX
7           BRONX
8       LEVITTOWN
9       LEVITTOWN
10      LEVITTOWN
11      LEVITTOWN
12      PLAINVIEW
13      LEVITTOWN
14      LEVITTOWN
...
3266         LYNBROOK
3267       LYNBROOK  
3268      LANCASTER  
3269          WANTAGH
3270          WANTAGH
3271          WANTAGH
3272         WOODMERE
3273          SHIRLEY
3274          SHIRLEY
3275          SHIRLEY
3276        SHIRLEY  
3277          SHIRLEY
3278          SHIRLEY
3279          SHIRLEY
3280    EAST ROCKAWAY
Name: City, Length: 3281, dtype: object

This still doesn't give us a great idea of how big a mess we have, so let's see how many unique names exist and sort them by name:


In [58]:
cities = df['City'].unique()
cities.sort()
cities


Out[58]:
array([nan, u'   MASSAPEQUA', u'   WANTAGH', u'  ALBANY', u'  BALDWIN',
       u'  BRONX', u'  BROOKLYN', u'  COHOES', u'  GREAT NECK',
       u'  HEMPSTEAD', u'  HUNTINGTON', u'  JEFFERSON VALLEY',
       u'  LEVITTOWN', u'  LEVITTOWN ', u'  MERRICK', u'  MIDDLE VILLAGE',
       u'  STATEN ISLAND', u'  WANTAGH', u'  WESTBURY',
       u'  WILLISTON PARK', u'  WOODMERE', u' ALBANY', u' BALDWIN',
       u' BROOKLYN', u' CARMEL', u' EAST NASSAU', u' GARDEN   CITY',
       u' GARDEN CITY', u' GREAT   NECK', u' GREAT NECK', u' HUNTINGTON',
       u' JERICHO', u' LIDO BEACH', u' LYNBROOK', u' MASSAPEQUA',
       u' MASSAPEQUA  PARK', u' MERRICK', u' MINEOLA', u' PLAINVIEW',
       u' ROSLYN', u' SHIRLEY', u' SOUTHOLD', u' STATEN ISLAND',
       u' STATEN ISLAND  ', u' WEST ISLIP', u' WESTBURY', u' WOODBURY',
       u'ALBANI', u'ALBANNY', u'ALBANY', u'ALBANY ', u'ALBANY  ',
       u'ALBANY    ', u'ALBERTSON', u'ALBION', u'AMITYVILLE', u'AUBURN',
       u'BABYLON', u'BALDWIN', u'BALWIN', u'BARDONIA', u'BAYPORT',
       u'BAYVILLE', u'BELLEMORE', u'BELLEROSE', u'BELLMORE', u'BETHPAGE',
       u'BOHEMIA', u'BRENTWOOD', u'BRENTWOOD  ', u'BREWSTER',
       u'BRIDGEPORT', u'BROCKPORT', u'BRONX', u'BROOKLYN', u'BROOKLYN ',
       u'BROOKLYN  ', u'BROOKYN', u'CANADAIGUA', u'CANANDAIGUA',
       u'CANANDALGUA', u'CAPITAL STATION', u'CARL PLACE', u'CARLE PLACE',
       u'CARMEL', u'CIRCLEVILLE', u'COHOES', u'COHOES  ', u'COMMACK',
       u'COOPERSTOWN', u'CORONA', u'CROTON-ON-HUDO', u'E ISLIP',
       u'E. GREENBUSH', u'E. ISLIP', u'E. MEADOW', u'E. NORTHPORT',
       u'E. NORWICH', u'E. ROCHESTER', u'E. ROCKAWAY', u'E. SETAUKET',
       u'E. WILLISTON', u'EAST', u'EAST GREENBUSH', u'EAST ISLIP',
       u'EAST MEADOW', u'EAST NORTHPOND', u'EAST NORTHPORT',
       u'EAST NORWICH', u'EAST ROCKAWAY', u'EAST SETAUKET', u'ELMIRA',
       u'ELMONT', u'FARMINGDALE', u'FARMINGVILLE', u'FRANKLIN',
       u'FRANKLIN SQUARE', u'FREEPORT', u'FREEPORT ', u'FRESH MEADOW',
       u'FT HAMILTON STATE', u'GARDEN CITY', u'GARDEN CITY ',
       u'GARDEN CITY PARK', u'GLEN COVE', u'GOSHEN', u'GREAT   NECK',
       u'GREAT  NECK', u'GREAT NECK', u'GREAT NECK  ', u'HAMPTON',
       u'HAPPAGUE', u'HAPPAUGE', u'HAUPPAUGE', u'HAUPPAUSE', u'HEMPSTEAD',
       u'HEWLETT', u'HICKSVILLE', u'HOLBROOK', u'HUNTINGTON',
       u'HUNTINGTON ', u'HUNTINGTON   STATION', u'HUNTINGTON STAT',
       u'HUNTINGTON STATION', u'HUNTINGTONST', u'IRVINGTON-ON-HU',
       u'ISLAND PARK', u'ISLIP', u'JAMAICA ESTATES', u'JEFFERSON',
       u'JEFFERSON   VALLEY', u'JEFFERSON  VALLEY', u'JEFFERSON VALLE',
       u'JEFFERSON VALLEY', u'JERICHO', u'KINDERHOOK', u'LANCASTER',
       u'LANCASTER  ', u'LAWRENCE', u'LEVITOWN', u'LEVITTOWN',
       u'LEVITTOWN ', u'LEVITTSTOWN', u'LINDENHURST', u'LINDENHURST  ',
       u'LOCUST  VALLEY', u'LOCUST VALLEY', u'LONG  BEACH', u'LONG BEACH',
       u'LONGBEACH', u'LYNBROOK', u'LYNBROOK ', u'LYNBROOK  ', u'MALVERNE',
       u'MANORVILLE', u'MASSAPEAQUA', u'MASSAPEQUA', u'MASSAPEQUA ',
       u'MASSAPEQUA  ', u'MASSAPEQUA PARK', u'MASTIC', u'MEDFORD',
       u'MELVILLE', u'MERCERVILLE', u'MERICK', u'MERRICK',
       u'MIDDLE   VILLAGE', u'MIDDLE VILLAGE', u'MILL NECK', u'MINEOLA',
       u'MINEOLA  ', u'MT. SINAI', u'N MERRICK', u'N VALLEY',
       u'N VALLEY STREAM', u'N. BABYLON', u'N. BALDWIN', u'N. MASSAPEQUA',
       u'N. MERRICK', u'N. VALLEY', u'N. VALLEY STREAM', u'N.MASSAPEQUA',
       u'N.MASSAPEQUA ', u'N.VALLEY STREAM', u'NANUET', u'NESCONSET',
       u'NEW   YORK', u'NEW  YORK', u'NEW BALTIMORE', u'NEW HAMPTON',
       u'NEW HYDE PARK', u'NEW HYDE PK', u'NEW ROCHELLE', u'NEW YORK',
       u'NEW YORK ', u'NIAGARA FALLS', u'NO. BALDWIN', u'NO. MASSAPEQUA',
       u'NO. MERRICK', u'NO. VALLEY STREAM', u'NO.VALLEY STREAM',
       u'NORHTPORT', u'NORTH MASSAPEQUA', u'NORTH MERRICK',
       u'NORTH VALLEY', u'NORTH VALLEY STREAM', u'NORTHPORT', u'NY',
       u'NYC', u'OCEANSIDE', u'OLD BETHPAGE', u'OLD BETHPARE',
       u'OLD WESTBURY', u'ORISKANY', u'OYSTER BAY', u'PATTERSONVILLE',
       u'PENFIELD', u'PLAINVIEW', u'PORT', u'PORT   WASHINGTON',
       u'PORT JEFFERSON', u'PORT WASHINGTON', u'POUGHKEEPSIE',
       u'PT. LOOKOUT', u'PT. WASHINGTON', u'PT.WASHINGTON', u'RENSSALAER',
       u'RENSSELAER', u'RENSSELAU', u'RENSSELEAR', u'RICHMOND HILL',
       u'RICHMOND HILLS', u'RIVERHEAD', u'ROCHESTER', u'ROCKEVILLE CENTRE',
       u'ROCKILLE CTR', u'ROCKVILLE', u'ROCKVILLE CENTER',
       u'ROCKVILLE CENTR', u'ROCKVILLE CENTRE', u'ROCKVILLE CTR',
       u'RONKONKOMA', u'ROOSEVELT', u'ROSLYN', u'S.MERRICK',
       u'SANDS POINT', u'SARATOGA SPRING', u'SARATOGA SPRING  ',
       u'SCHENECTADY', u'SEA CLIFF', u'SEACLIFF', u'SEAFORD', u'SEATAUKET',
       u'SETAUKET', u'SHIRLEY', u'SHIRLEY  ', u'SHOREHAM', u'SHOREHARN',
       u'SMITHTOWN', u'ST JAMES', u'ST. JAMES', u'STATEN',
       u'STATEN  ISLAND', u'STATEN ISLAND', u'STONY POINT', u'SUFFERN',
       u'SYOSSET', u'SYRACUSE', u'UNIONDALE', u'UPPER NYACK',
       u'VALLEY STREAM', u'W BABYLON', u'W ISLIP', u'W. BABYLON',
       u'W. HEMPSTEAD', u'W. ISLIP', u'W.BABYLON', u'WADING RIVER',
       u'WANTAGH', u'WASHINGTONVILLE', u'WEST BABLYON', u'WEST BABYLON',
       u'WEST BURY', u'WEST ISLIP', u'WEST MERRICK', u'WESTBURY',
       u'WESTBURY  ', u'WESTBUTY', u'WESYBURY', u'WHITESTONE',
       u'WILLISTIN PK', u'WILLISTON', u'WILLISTON PARK', u'WILLISTON PK',
       u'WOODBURY', u'WOODMERE', u'YAPHANK', u'YONKERS'], dtype=object)

Wow, that's... a mess. The most obvious mess is trailing and leading white space. Let's get rid of it with .strip()


In [70]:
cities = df['City'].str.strip().unique()
cities.sort()
cities


Out[70]:
array([nan, u'ALBANI', u'ALBANNY', u'ALBANY', u'ALBERTSON', u'ALBION',
       u'AMITYVILLE', u'AUBURN', u'BABYLON', u'BALDWIN', u'BALWIN',
       u'BARDONIA', u'BAYPORT', u'BAYVILLE', u'BELLEMORE', u'BELLEROSE',
       u'BELLMORE', u'BETHPAGE', u'BOHEMIA', u'BRENTWOOD', u'BREWSTER',
       u'BRIDGEPORT', u'BROCKPORT', u'BRONX', u'BROOKLYN', u'BROOKYN',
       u'CANADAIGUA', u'CANANDAIGUA', u'CANANDALGUA', u'CAPITAL STATION',
       u'CARL PLACE', u'CARLE PLACE', u'CARMEL', u'CIRCLEVILLE', u'COHOES',
       u'COMMACK', u'COOPERSTOWN', u'CORONA', u'CROTON-ON-HUDO',
       u'E ISLIP', u'E. GREENBUSH', u'E. ISLIP', u'E. MEADOW',
       u'E. NORTHPORT', u'E. NORWICH', u'E. ROCHESTER', u'E. ROCKAWAY',
       u'E. SETAUKET', u'E. WILLISTON', u'EAST', u'EAST GREENBUSH',
       u'EAST ISLIP', u'EAST MEADOW', u'EAST NASSAU', u'EAST NORTHPOND',
       u'EAST NORTHPORT', u'EAST NORWICH', u'EAST ROCKAWAY',
       u'EAST SETAUKET', u'ELMIRA', u'ELMONT', u'FARMINGDALE',
       u'FARMINGVILLE', u'FRANKLIN', u'FRANKLIN SQUARE', u'FREEPORT',
       u'FRESH MEADOW', u'FT HAMILTON STATE', u'GARDEN   CITY',
       u'GARDEN CITY', u'GARDEN CITY PARK', u'GLEN COVE', u'GOSHEN',
       u'GREAT   NECK', u'GREAT  NECK', u'GREAT NECK', u'HAMPTON',
       u'HAPPAGUE', u'HAPPAUGE', u'HAUPPAUGE', u'HAUPPAUSE', u'HEMPSTEAD',
       u'HEWLETT', u'HICKSVILLE', u'HOLBROOK', u'HUNTINGTON',
       u'HUNTINGTON   STATION', u'HUNTINGTON STAT', u'HUNTINGTON STATION',
       u'HUNTINGTONST', u'IRVINGTON-ON-HU', u'ISLAND PARK', u'ISLIP',
       u'JAMAICA ESTATES', u'JEFFERSON', u'JEFFERSON   VALLEY',
       u'JEFFERSON  VALLEY', u'JEFFERSON VALLE', u'JEFFERSON VALLEY',
       u'JERICHO', u'KINDERHOOK', u'LANCASTER', u'LAWRENCE', u'LEVITOWN',
       u'LEVITTOWN', u'LEVITTSTOWN', u'LIDO BEACH', u'LINDENHURST',
       u'LOCUST  VALLEY', u'LOCUST VALLEY', u'LONG  BEACH', u'LONG BEACH',
       u'LONGBEACH', u'LYNBROOK', u'MALVERNE', u'MANORVILLE',
       u'MASSAPEAQUA', u'MASSAPEQUA', u'MASSAPEQUA  PARK',
       u'MASSAPEQUA PARK', u'MASTIC', u'MEDFORD', u'MELVILLE',
       u'MERCERVILLE', u'MERICK', u'MERRICK', u'MIDDLE   VILLAGE',
       u'MIDDLE VILLAGE', u'MILL NECK', u'MINEOLA', u'MT. SINAI',
       u'N MERRICK', u'N VALLEY', u'N VALLEY STREAM', u'N. BABYLON',
       u'N. BALDWIN', u'N. MASSAPEQUA', u'N. MERRICK', u'N. VALLEY',
       u'N. VALLEY STREAM', u'N.MASSAPEQUA', u'N.VALLEY STREAM', u'NANUET',
       u'NESCONSET', u'NEW   YORK', u'NEW  YORK', u'NEW BALTIMORE',
       u'NEW HAMPTON', u'NEW HYDE PARK', u'NEW HYDE PK', u'NEW ROCHELLE',
       u'NEW YORK', u'NIAGARA FALLS', u'NO. BALDWIN', u'NO. MASSAPEQUA',
       u'NO. MERRICK', u'NO. VALLEY STREAM', u'NO.VALLEY STREAM',
       u'NORHTPORT', u'NORTH MASSAPEQUA', u'NORTH MERRICK',
       u'NORTH VALLEY', u'NORTH VALLEY STREAM', u'NORTHPORT', u'NY',
       u'NYC', u'OCEANSIDE', u'OLD BETHPAGE', u'OLD BETHPARE',
       u'OLD WESTBURY', u'ORISKANY', u'OYSTER BAY', u'PATTERSONVILLE',
       u'PENFIELD', u'PLAINVIEW', u'PORT', u'PORT   WASHINGTON',
       u'PORT JEFFERSON', u'PORT WASHINGTON', u'POUGHKEEPSIE',
       u'PT. LOOKOUT', u'PT. WASHINGTON', u'PT.WASHINGTON', u'RENSSALAER',
       u'RENSSELAER', u'RENSSELAU', u'RENSSELEAR', u'RICHMOND HILL',
       u'RICHMOND HILLS', u'RIVERHEAD', u'ROCHESTER', u'ROCKEVILLE CENTRE',
       u'ROCKILLE CTR', u'ROCKVILLE', u'ROCKVILLE CENTER',
       u'ROCKVILLE CENTR', u'ROCKVILLE CENTRE', u'ROCKVILLE CTR',
       u'RONKONKOMA', u'ROOSEVELT', u'ROSLYN', u'S.MERRICK',
       u'SANDS POINT', u'SARATOGA SPRING', u'SCHENECTADY', u'SEA CLIFF',
       u'SEACLIFF', u'SEAFORD', u'SEATAUKET', u'SETAUKET', u'SHIRLEY',
       u'SHOREHAM', u'SHOREHARN', u'SMITHTOWN', u'SOUTHOLD', u'ST JAMES',
       u'ST. JAMES', u'STATEN', u'STATEN  ISLAND', u'STATEN ISLAND',
       u'STONY POINT', u'SUFFERN', u'SYOSSET', u'SYRACUSE', u'UNIONDALE',
       u'UPPER NYACK', u'VALLEY STREAM', u'W BABYLON', u'W ISLIP',
       u'W. BABYLON', u'W. HEMPSTEAD', u'W. ISLIP', u'W.BABYLON',
       u'WADING RIVER', u'WANTAGH', u'WASHINGTONVILLE', u'WEST BABLYON',
       u'WEST BABYLON', u'WEST BURY', u'WEST ISLIP', u'WEST MERRICK',
       u'WESTBURY', u'WESTBUTY', u'WESYBURY', u'WHITESTONE',
       u'WILLISTIN PK', u'WILLISTON', u'WILLISTON PARK', u'WILLISTON PK',
       u'WOODBURY', u'WOODMERE', u'YAPHANK', u'YONKERS'], dtype=object)

We can work with this, let's reassign all the stripped values in place:


In [71]:
df['City'] = df['City'].str.strip()

First thing the group decided to tackle was standardizing the name "EAST." It currently is "E ", "E." and "EAST"

To make the list more manageable, we can find all cells that start with "E":


In [73]:
df[df['City'].str.startswith("E")]


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-73-3f556c32aa54> in <module>()
----> 1 df[df['City'].str.startswith("E")]

/Users/jeramiaory/miniconda/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1735         if isinstance(key, (Series, np.ndarray, Index, list)):
   1736             # either boolean or fancy integer index
-> 1737             return self._getitem_array(key)
   1738         elif isinstance(key, DataFrame):
   1739             return self._getitem_frame(key)

/Users/jeramiaory/miniconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_array(self, key)
   1762     def _getitem_array(self, key):
   1763         # also raises Exception if object array with NA values
-> 1764         if com._is_bool_indexer(key):
   1765             # warning here just in case -- previously __setitem__ was
   1766             # reindexing but __getitem__ was not; it seems more reasonable to

/Users/jeramiaory/miniconda/lib/python2.7/site-packages/pandas/core/common.pyc in _is_bool_indexer(key)
   2039             if not lib.is_bool_array(key):
   2040                 if isnull(key).any():
-> 2041                     raise ValueError('cannot index with vector containing '
   2042                                      'NA / NaN values')
   2043                 return False

ValueError: cannot index with vector containing NA / NaN values

Or... we could, if all of the data was string data. Turns out some of it is missing (of course!)

The group discussed what to do with the data, and decided they wanted to convert it to the string "UNKNOWN"

Using the fillna command, we can do this in place(ish):


In [80]:
df['City'] = df['City'].fillna("UNKNOWN")

In [81]:
cities = df['City'].unique()
cities.sort()

In [82]:
cities


Out[82]:
array([u'ALBANI', u'ALBANNY', u'ALBANY', u'ALBERTSON', u'ALBION',
       u'AMITYVILLE', u'AUBURN', u'BABYLON', u'BALDWIN', u'BALWIN',
       u'BARDONIA', u'BAYPORT', u'BAYVILLE', u'BELLEMORE', u'BELLEROSE',
       u'BELLMORE', u'BETHPAGE', u'BOHEMIA', u'BRENTWOOD', u'BREWSTER',
       u'BRIDGEPORT', u'BROCKPORT', u'BRONX', u'BROOKLYN', u'BROOKYN',
       u'CANADAIGUA', u'CANANDAIGUA', u'CANANDALGUA', u'CAPITAL STATION',
       u'CARL PLACE', u'CARLE PLACE', u'CARMEL', u'CIRCLEVILLE', u'COHOES',
       u'COMMACK', u'COOPERSTOWN', u'CORONA', u'CROTON-ON-HUDO',
       u'E ISLIP', u'E. GREENBUSH', u'E. ISLIP', u'E. MEADOW',
       u'E. NORTHPORT', u'E. NORWICH', u'E. ROCHESTER', u'E. ROCKAWAY',
       u'E. SETAUKET', u'E. WILLISTON', u'EAST', u'EAST GREENBUSH',
       u'EAST ISLIP', u'EAST MEADOW', u'EAST NASSAU', u'EAST NORTHPOND',
       u'EAST NORTHPORT', u'EAST NORWICH', u'EAST ROCKAWAY',
       u'EAST SETAUKET', u'ELMIRA', u'ELMONT', u'FARMINGDALE',
       u'FARMINGVILLE', u'FRANKLIN', u'FRANKLIN SQUARE', u'FREEPORT',
       u'FRESH MEADOW', u'FT HAMILTON STATE', u'GARDEN   CITY',
       u'GARDEN CITY', u'GARDEN CITY PARK', u'GLEN COVE', u'GOSHEN',
       u'GREAT   NECK', u'GREAT  NECK', u'GREAT NECK', u'HAMPTON',
       u'HAPPAGUE', u'HAPPAUGE', u'HAUPPAUGE', u'HAUPPAUSE', u'HEMPSTEAD',
       u'HEWLETT', u'HICKSVILLE', u'HOLBROOK', u'HUNTINGTON',
       u'HUNTINGTON   STATION', u'HUNTINGTON STAT', u'HUNTINGTON STATION',
       u'HUNTINGTONST', u'IRVINGTON-ON-HU', u'ISLAND PARK', u'ISLIP',
       u'JAMAICA ESTATES', u'JEFFERSON', u'JEFFERSON   VALLEY',
       u'JEFFERSON  VALLEY', u'JEFFERSON VALLE', u'JEFFERSON VALLEY',
       u'JERICHO', u'KINDERHOOK', u'LANCASTER', u'LAWRENCE', u'LEVITOWN',
       u'LEVITTOWN', u'LEVITTSTOWN', u'LIDO BEACH', u'LINDENHURST',
       u'LOCUST  VALLEY', u'LOCUST VALLEY', u'LONG  BEACH', u'LONG BEACH',
       u'LONGBEACH', u'LYNBROOK', u'MALVERNE', u'MANORVILLE',
       u'MASSAPEAQUA', u'MASSAPEQUA', u'MASSAPEQUA  PARK',
       u'MASSAPEQUA PARK', u'MASTIC', u'MEDFORD', u'MELVILLE',
       u'MERCERVILLE', u'MERICK', u'MERRICK', u'MIDDLE   VILLAGE',
       u'MIDDLE VILLAGE', u'MILL NECK', u'MINEOLA', u'MT. SINAI',
       u'N MERRICK', u'N VALLEY', u'N VALLEY STREAM', u'N. BABYLON',
       u'N. BALDWIN', u'N. MASSAPEQUA', u'N. MERRICK', u'N. VALLEY',
       u'N. VALLEY STREAM', u'N.MASSAPEQUA', u'N.VALLEY STREAM', u'NANUET',
       u'NESCONSET', u'NEW   YORK', u'NEW  YORK', u'NEW BALTIMORE',
       u'NEW HAMPTON', u'NEW HYDE PARK', u'NEW HYDE PK', u'NEW ROCHELLE',
       u'NEW YORK', u'NIAGARA FALLS', u'NO. BALDWIN', u'NO. MASSAPEQUA',
       u'NO. MERRICK', u'NO. VALLEY STREAM', u'NO.VALLEY STREAM',
       u'NORHTPORT', u'NORTH MASSAPEQUA', u'NORTH MERRICK',
       u'NORTH VALLEY', u'NORTH VALLEY STREAM', u'NORTHPORT', u'NY',
       u'NYC', u'OCEANSIDE', u'OLD BETHPAGE', u'OLD BETHPARE',
       u'OLD WESTBURY', u'ORISKANY', u'OYSTER BAY', u'PATTERSONVILLE',
       u'PENFIELD', u'PLAINVIEW', u'PORT', u'PORT   WASHINGTON',
       u'PORT JEFFERSON', u'PORT WASHINGTON', u'POUGHKEEPSIE',
       u'PT. LOOKOUT', u'PT. WASHINGTON', u'PT.WASHINGTON', u'RENSSALAER',
       u'RENSSELAER', u'RENSSELAU', u'RENSSELEAR', u'RICHMOND HILL',
       u'RICHMOND HILLS', u'RIVERHEAD', u'ROCHESTER', u'ROCKEVILLE CENTRE',
       u'ROCKILLE CTR', u'ROCKVILLE', u'ROCKVILLE CENTER',
       u'ROCKVILLE CENTR', u'ROCKVILLE CENTRE', u'ROCKVILLE CTR',
       u'RONKONKOMA', u'ROOSEVELT', u'ROSLYN', u'S.MERRICK',
       u'SANDS POINT', u'SARATOGA SPRING', u'SCHENECTADY', u'SEA CLIFF',
       u'SEACLIFF', u'SEAFORD', u'SEATAUKET', u'SETAUKET', u'SHIRLEY',
       u'SHOREHAM', u'SHOREHARN', u'SMITHTOWN', u'SOUTHOLD', u'ST JAMES',
       u'ST. JAMES', u'STATEN', u'STATEN  ISLAND', u'STATEN ISLAND',
       u'STONY POINT', u'SUFFERN', u'SYOSSET', u'SYRACUSE', u'UNIONDALE',
       'UNKNOWN', u'UPPER NYACK', u'VALLEY STREAM', u'W BABYLON',
       u'W ISLIP', u'W. BABYLON', u'W. HEMPSTEAD', u'W. ISLIP',
       u'W.BABYLON', u'WADING RIVER', u'WANTAGH', u'WASHINGTONVILLE',
       u'WEST BABLYON', u'WEST BABYLON', u'WEST BURY', u'WEST ISLIP',
       u'WEST MERRICK', u'WESTBURY', u'WESTBUTY', u'WESYBURY',
       u'WHITESTONE', u'WILLISTIN PK', u'WILLISTON', u'WILLISTON PARK',
       u'WILLISTON PK', u'WOODBURY', u'WOODMERE', u'YAPHANK', u'YONKERS'], dtype=object)

Ok, no more nan, yay!

Let's try that again:


In [85]:
df['City'].str.startswith('E')


Out[85]:
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
...
3266    False
3267    False
3268    False
3269    False
3270    False
3271    False
3272    False
3273    False
3274    False
3275    False
3276    False
3277    False
3278    False
3279    False
3280     True
Name: City, Length: 3281, dtype: bool

We can now select all the rows in our dataframe by feeding the True/False back into the dataframe:


In [86]:
df[df['City'].str.startswith("E")]


Out[86]:
DATE NAME OF INITIATIVE/ADDRESS City State Zip Code CHECK NO. AMOUNT ($ ) RECORD DATE Party Source
123 04/14/11 BROOKHAVEN DEMOCRATIC COMM;PO BOX 561 EAST NY 11733 1538 500 JUL-07-11 04:21 PM D PBA
362 10/03/01 ENGLEBRIGHT FOR STATE ASSEMBLY;PO BOX 2703 EAST SETAUKET NY 11733 1523 250 OCT-23-01 12:08 PM D SOA
363 02/09/99 ENGLEBRITE FOR STATE ASSEMBLY;P.O. BOX 2703 EAST SETAUKET NY 11733 1250 200 JUL-07-99 09:05 AM D SOA
390 11/07/11 FRIENDS OF CARRIE SOLAGES;1630 DUTCH BROADWAY ELMONT NY 11003 2907 2000 NOV-29-11 12:00 AM D SOA
484 02/12/99 FRIENDS OF ENGELBRIGHT;POB 703 E. SETAUKET NY 11733 NaN 100 JUN-10-99 06:04 PM D DETECTIVES
727 04/04/00 FRIENDS OF STEVE ENGLEBRITE;P.O. BOX 2703 EAST SETAUKET NY 11733 1356 250 JUL-10-00 03:03 PM D SOA
871 11/03/11 NASSAU DEMOCRATIC COMMITTEE;6 NEIGHBOUR WAY EAST NASSAU NY 12062 1000 31659.77 DEC-05-11 12:00 AM D SAFE NASSAU
1128 03/08/10 CITIZENS FOMM TO RE-ELECT KEN LA VALLE;9 BERKS... E. GREENBUSH NY 12061 1405 500 JUL-15-10 10:31 AM R PBA
1150 08/03/99 CITIZENS FOR CIOTTI;P.O. BOX 03068 ELMONT NY 11003 1307 450 AUG-11-99 07:39 PM R SOA
1166 09/21/05 CITIZENS FOR GREG PETERSON;PO BOX 455 EAST MEADOW NY 11554 2171 10000 OCT-07-05 03:02 PM R PBA
1167 10/05/05 CITIZENS FOR GREG PETERSON;PO BOX 455 EAST MEADOW NY 11554 2186 10000 OCT-25-05 11:29 AM R PBA
1250 09/13/10 CITIZENS FOR MONTESANO;25 MEADOWLARK DRIVE E. NORTHPORT NY 11731 1475 250 SEP-30-10 11:02 AM R PBA
1288 06/10/05 CITIZENS FOR PETERSON;PO BOX 455 EAST MEADOW NY 11554 1988 350 JUL-11-05 03:06 PM R DETECTIVES
1289 09/21/05 CITIZENS FOR PETERSON;PO BOX 455 EAST MEADOW NY 11554 2012 1250 OCT-07-05 05:17 PM R DETECTIVES
1290 09/07/05 CITIZENS FOR PETERSON;PO BOX 455 EAST MEADOW NY 11554 2078 1500 SEP-27-05 11:53 AM R SOA
1332 09/15/05 CITIZENS FOR SANTINO;PO BOX 22 EAST ROCKAWAY NY 11515 2160 500 OCT-07-05 02:53 PM R PBA
1333 10/24/01 CITIZENS FOR SANTINO;POB 22 E. ROCKAWAY NY 11518 NaN 100 NOV-26-01 02:28 PM R DETECTIVES
1334 06/01/07 CITIZENS FOR SANTINO;POB 22 E. ROCKAWAY NY 11518 1605 100 JUL-05-07 05:27 PM R DETECTIVES
1335 06/12/09 CITIZENS FOR SANTINO;P.O. BOX 22 E. ROCKAWAY NY 11518 1089 100 JUL-13-09 02:19 PM R DETECTIVES
1336 05/07/00 CITIZENS FOR SANTINO;P.O. BOX 22 EAST ROCKAWAY NY 11518 1371 500 JUL-10-00 03:40 PM R SOA
1338 08/04/05 CITIZENS FOR SANTINO;PO BOX 22 EAST ROCKAWAY NY 11518 2053 200 AUG-08-05 12:34 PM R SOA
1358 09/05/02 COMM. TO ELECT JIM ALESSI;PO BOX 200 E. ROCHESTER NY 14445 1881 200 OCT-03-02 10:22 AM R PBA
1406 05/06/99 COMMITTEE TO RE-ELECT R. GAFFNEY;90 MERRICK AV... EAST MEADOW NY 11554 1613 1000 JUL-22-99 11:26 AM R PBA
1409 09/27/07 COMMITTEE TO REELECT ROBERT SCHMIDT;POB 320 E. NORWICH NY 11732 1635 500 OCT-05-07 05:23 PM R DETECTIVES
1410 09/04/07 COMMITTEE TO RE-ELECT ROBERT SCHMIDT SUPREME C... EAST MEADOW NY 11554 2434 500 OCT-01-07 03:31 PM R SOA
1424 01/14/08 COMMITTEE TO RE-ELECT SENATOR BRUNO;4 SPRUCE RUN EAST GREENBUSH NY 12061 1096 1000 OCT-23-09 01:29 PM R PBA
1438 10/20/00 COUNCILMAN JOE KEARNEY ELECTION COMMITTEE;1199... ELMONT NY 11580 1418 250 OCT-24-00 11:40 AM R SOA
1443 07/20/99 E. ROCKAWAY REPUBLICAN COMM;94 FRANKLIN AVE E. ROCKAWAY NY 11518 1622 100 OCT-08-99 10:14 AM R PBA
1472 09/11/00 FRIENDS FOR ANGIE M. CULLEN;90 MERRICK AVENUE EAST MEADOW NY 11554 1402 125 SEP-19-00 09:18 AM R SOA
1561 04/04/07 FRIENDS FOR NORMA GONSALVES;1901 MERION ST EAST MEADOW NY 11554 2391 500 JUL-16-07 05:32 PM R SOA
... ... ... ... ... ... ... ... ... ... ...
2265 09/23/03 FRIENDS OF NORMA GONSALVES;1901 MERCUM ST. EAST MEADOW NY 11554 1980 1500 OCT-03-03 12:50 PM R PBA
2266 09/02/05 FRIENDS OF NORMA GONSALVES;1901 MERION STREET EAST MEADOW NY 11554 2155 500 OCT-07-05 03:09 PM R PBA
2267 09/26/05 FRIENDS OF NORMA GONSALVES;1901 MERION STREET EAST MEADOW NY 11554 2177 1988 OCT-07-05 03:09 PM R PBA
2268 04/23/09 FRIENDS OF NORMA GONSALVES;1901 MERION ST EAST MEADOW NY 11554 1265 700 JUL-08-09 02:28 PM R PBA
2269 09/13/10 FRIENDS OF NORMA GONSALVES;1901 MERION STREET EAST MEADOW NY 11554 1477 1000 SEP-30-10 11:03 AM R PBA
2270 03/14/05 FRIENDS OF NORMA GONSALVES;1901 MERION STREET EAST MEADOW NY 11554 1957 250 JUL-11-05 01:56 PM R DETECTIVES
2271 06/21/05 FRIENDS OF NORMA GONSALVES;1901 MARION STREET EAST MEADOW NY 11554 1994 100 JUL-11-05 03:11 PM R DETECTIVES
2272 04/09/09 FRIENDS OF NORMA GONSALVES;1901 MERION STREET EAST MEADOW NY 11554 1065 350 JUL-13-09 02:08 PM R DETECTIVES
2273 08/19/09 FRIENDS OF NORMA GONSALVES;1901 MERION STREET EAST MEADOW NY 11554 1109 495 SEP-04-09 12:51 PM R DETECTIVES
2274 09/21/10 FRIENDS OF NORMA GONSALVES;1901 MERION STREET EAST MEADOW NY 11554 1216 250 SEP-22-10 03:15 PM R DETECTIVES
2275 08/11/03 FRIENDS OF NORMA GONSALVES;1901 MERION ST. EAST MEADOW NY 11554 1780 250 SEP-03-03 01:46 PM R SOA
2276 10/09/03 FRIENDS OF NORMA GONSALVES;1901 MERION ST EAST MEADOW NY 11554 1808 1000 OCT-24-03 02:03 PM R SOA
2277 03/21/05 FRIENDS OF NORMA GONSALVES;1901 MERION ST. EAST MEADOW NY 11554 1988 250 JUL-13-05 04:00 PM R SOA
2278 10/13/05 FRIENDS OF NORMA GONSALVES;1901 MERION ST EAST MEADOW NY 11554 2090 2000 OCT-27-05 03:40 PM R SOA
2279 04/30/09 FRIENDS OF NORMA GONSALVES;1901 MERION ST EAST MEADOW NY 11554 2650 500 JUL-13-09 04:45 PM R SOA
2280 09/21/09 FRIENDS OF NORMA GONSALVES;1901 MERION ST EAST MEADOW NY 11554 2700 1000 SEP-24-09 03:30 PM R SOA
2281 08/03/99 FRIENDS OF NORMA GONZALEZ;1901 MERION STREET EAST MEADOW NY 11554 1308 300 AUG-11-99 07:40 PM R SOA
2322 11/02/11 FRIENDS OF PAT MAHER;335 SPRING DRIVE EAST MEADOW NY 11554 1320 2000 DEC-01-11 12:00 AM R DETECTIVES
2323 10/27/11 FRIENDS OF PAT MAHER;335 SPRING DR EAST MEADOW NY 11554 2905 1000 NOV-29-11 12:00 AM R SOA
2371 06/19/07 FRIENDS OF PHIL BOYLE;136 E MAIN STREET E. ISLIP NY 11730 1614 200 AUG-16-07 04:12 PM R DETECTIVES
2372 06/12/00 FRIENDS OF PHIL BOYLE;15 STEWART STREET EAST ISLIP NY 11730 1377 125 JUL-10-00 03:47 PM R SOA
2373 07/13/01 FRIENDS OF PHIL BOYLE;15 STEWART ST E ISLIP NY 11730 1499 400 AUG-09-01 10:07 PM R SOA
2374 03/01/02 FRIENDS OF PHIL BOYLE;15 STEWART ST E ISLIP NY 11730 1564 200 JUL-03-02 10:20 AM R SOA
2375 06/19/07 FRIENDS OF PHIL BOYLE;136 E. MAIN ST. EAST ISLIP NY 11730 2417 200 JUL-16-07 05:35 PM R SOA
2431 09/27/07 FRIENDS OF ROBERT SCHMIDT;PO BOX 320 EAST NORWICH NY 11732 1635 500 SEP-28-07 04:24 PM R DETECTIVES
2464 01/09/12 FRIENDS OF SENATOR JACK MARTIN;192 CLEARMEADOW... EAST MEADOW NY 11554 1325 500 JAN-16-12 12:00 AM R DETECTIVES
2505 02/23/06 FRIENDS OF TOM MCKEVITT;P.O. BOX 455 E. MEADOW NY 11554 2039 500 JUN-27-06 12:52 PM R DETECTIVES
2740 10/03/01 N. HEPMSTEAD REPUBLICAN COMMITTEE;C/O AXELROD ... EAST MEADOW NY 11554 1522 400 OCT-23-01 01:04 PM R SOA
3195 10/20/10 TAXPAYERS FOR LENAHAN;17 EVERDELL ROAD EAST ROCKAWAY NY 11518 1221 100 NOV-29-10 04:58 PM R DETECTIVES
3280 06/30/09 ;235 CARMEN AVE EAST ROCKAWAY NY 11518 1290 $ 600 JUL-08-09 02:59 PM R PBA

86 rows × 10 columns

Let's create a new dataframe that only contains the cities that start with "E" that we can play with:


In [87]:
ecities = df[df['City'].str.startswith("E")]

In [89]:
ecities['City'].unique()


Out[89]:
array([u'EAST', u'EAST SETAUKET', u'ELMONT', u'E. SETAUKET',
       u'EAST NASSAU', u'E. GREENBUSH', u'EAST MEADOW', u'E. NORTHPORT',
       u'EAST ROCKAWAY', u'E. ROCKAWAY', u'E. ROCHESTER', u'E. NORWICH',
       u'EAST GREENBUSH', u'E. MEADOW', u'ELMIRA', u'EAST NORTHPOND',
       u'EAST NORTHPORT', u'E. WILLISTON', u'E. ISLIP', u'EAST ISLIP',
       u'E ISLIP', u'EAST NORWICH'], dtype=object)

It's stil petty much a mess. However, using regular expressions, we can replace every "E " or "E. " with the string "EAST "

breaking replace(r"\s*E\.?\s", "EAST ") down into chunks:

r = use regular expression language
\s = space
* = wildcard (anything)
\.? = one or no "." characters

In [91]:
ecities['City'].str.replace(r"\s*E\.?\s", "EAST ").unique()


Out[91]:
array([u'EAST', u'EAST SETAUKET', u'ELMONT', u'EAST NASSAU',
       u'EAST GREENBUSH', u'EAST MEADOW', u'EAST NORTHPORT',
       u'EAST ROCKAWAY', u'EAST ROCHESTER', u'EAST NORWICH', u'ELMIRA',
       u'EAST NORTHPOND', u'EAST WILLISTON', u'EAST ISLIP'], dtype=object)

In [92]:
df['City'] = df['City'].str.replace(r"\s*E\.?\s", "EAST ")

In [96]:
df[df['City'].str.startswith("E")]['City'].unique()


Out[96]:
array([u'EAST', u'EAST SETAUKET', u'ELMONT', u'EAST NASSAU',
       u'EAST GREENBUSH', u'EAST MEADOW', u'EAST NORTHPORT',
       u'EAST ROCKAWAY', u'EAST ROCHESTER', u'EAST NORWICH', u'ELMIRA',
       u'EAST NORTHPOND', u'EAST WILLISTON', u'EAST ISLIP'], dtype=object)

In [ ]: