KK's Rscript to Pyscript

Somayaji made a script in R which I am dying to reproduce in python using pandas and other great frameworks. His source file is:


In [1]:
ls -l *.R


-rwxrwx--- 1 root plugdev 1932 Apr 15 21:30 ProjectProposal.R*

And the Resource for this learning is

Data Analysis with Python Pandas Source

And, my Notebook which is a copy of the same tutorial in a notebook, whose file name is pandas_tutorial_3mar15.ipynb and can be found in SageMath Cloud.


In [2]:
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
import sys
%matplotlib inline

In [3]:
print 'Python Version ' + sys.version
print 'Pandas Version ' + pd.__version__


Python Version 2.7.9 |Anaconda 2.2.0 (64-bit)| (default, Mar  9 2015, 16:20:48) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Pandas Version 0.16.0

!

Now, we will use the Crime Data csv file for this exercise and it is here in this location:


In [4]:
ls -l *.csv


-rwxrwx--- 1 root plugdev 1598015 Apr 15 21:30 Crimes_-_2001_to_present.csv*
-rwxrwx--- 1 root plugdev 1624215 Apr 15 21:30 Crimes_-_2001_to_present_v2.csv*

In [5]:
df = pd.read_csv('Crimes_-_2001_to_present.csv')

In [6]:
df.dtypes # Tells us what data type each column is!


Out[6]:
ID                        int64
Case Number              object
Date                     object
Block                    object
IUCR                     object
Primary Type             object
Description              object
Location Description     object
Arrest                     bool
Domestic                   bool
Beat                      int64
Ward                      int64
FBI Code                 object
X Coordinate            float64
Y Coordinate            float64
Updated On               object
Year                      int64
Latitude                float64
Longitude               float64
Location                 object
dtype: object

In [7]:
df.ID # Shows us the elements inside the column named, "ID"


Out[7]:
0       8719624
1       8716231
2       8715757
3       8716098
4       8716149
5       8715324
6       8714875
7       8714873
8       8714982
9       8714865
10      8716418
11      8717296
12      8716047
13      8714884
14      8715081
15      8716481
16      8714830
17      8714840
18      8714845
19      8720289
20      8718363
21      8714866
22      8716495
23      8717669
24      8715667
25      8716303
26      8715233
27      8715378
28      8714826
29      8715076
         ...   
7469    8710901
7470    8704539
7471    8704984
7472    8703401
7473    8703331
7474    8703496
7475    8703270
7476    8703505
7477    8703372
7478    8703385
7479    8703420
7480    8703252
7481    8703642
7482    8707619
7483    8704575
7484    8703271
7485    8703572
7486    8723374
7487    8703559
7488    8724551
7489    8714013
7490    8703195
7491    8703319
7492    8703469
7493    8703405
7494    8703281
7495    8719492
7496    8703304
7497    8703714
7498    8703205
Name: ID, dtype: int64

In [8]:
# You can also check it's data type separately
df.ID.dtype


Out[8]:
dtype('int64')

In [9]:
df.Date.dtype


Out[9]:
dtype('O')

I would like to change the datatype of the Date column as KK did in his Rscript. For that, I found this link useful.


In [10]:
df.convert_objects(convert_dates='coerce',convert_numeric=True)


Out[10]:
ID Case Number Date Block IUCR Primary Type Description Location Description Arrest Domestic Beat Ward FBI Code X Coordinate Y Coordinate Updated On Year Latitude Longitude Location
0 8719624 HV396282 2012-07-20 0000X W 112TH PL NaT CRIMINAL DAMAGE TO PROPERTY RESIDENCE False False 522 34 NaT 1177947 1830412 2012-07-27 2012 41.689964 -87.624156 (41.68996389491519, -87.62415556111233)
1 8716231 HV392628 2012-07-20 033XX W 59TH ST NaT BURGLARY ATTEMPT FORCIBLE ENTRY APARTMENT False False 822 16 2015-04-05 00:00:00 1154824 1865254 2012-07-26 2012 41.786067 -87.707883 (41.78606748679821, -87.7078829857335)
2 8715757 HV391909 2012-07-20 014XX W 62ND ST NaT BURGLARY FORCIBLE ENTRY RESIDENCE False False 713 16 2015-04-05 00:00:00 1167870 1863673 2012-07-23 2012 41.781459 -87.660095 (41.781458651730695, -87.66009516581626)
3 8716098 HV392415 2012-07-20 0000X N GREEN ST NaT CRIMINAL DAMAGE TO VEHICLE STREET False False 1212 27 NaT 1170766 1900477 2012-07-23 2012 41.882390 -87.648402 (41.88238953104529, -87.6484019311562)
4 8716149 HV392499 2012-07-20 020XX S LEAVITT ST NaT MOTOR VEHICLE THEFT AUTOMOBILE STREET False False 1223 25 2015-04-07 00:00:00 1161974 1890328 2012-07-23 2012 41.854728 -87.680969 (41.85472774515502, -87.68096940387777)
5 8715324 HV391662 2012-07-20 068XX S ADA ST NaT THEFT OVER $500 STREET False False 724 17 2015-04-06 00:00:00 1168554 1859165 2012-07-23 2012 41.769073 -87.657717 (41.769073413721, -87.65771730583046)
6 8714875 HV391420 2012-07-20 043XX S MAPLEWOOD AVE NaT PUBLIC PEACE VIOLATION RECKLESS CONDUCT RESIDENTIAL YARD (FRONT/BACK) False False 922 12 NaT 1160051 1875637 2012-07-22 2012 41.814454 -87.688432 (41.81445385046495, -87.68843247351239)
7 8714873 HV391421 2012-07-20 003XX E 75TH ST NaT BATTERY AGGRAVATED: HANDGUN SIDEWALK False False 323 6 NaT 1179462 1855389 2012-07-24 2012 41.758470 -87.617849 (41.75846957933904, -87.61784913812797)
8 8714982 HV391410 2012-07-20 058XX S INDIANA AVE NaT ROBBERY ATTEMPT: ARMED-HANDGUN SIDEWALK False False 232 20 2015-04-03 00:00:00 1178557 1866336 2012-07-22 2012 41.788530 -87.620833 (41.78852989714045, -87.62083348277277)
9 8714865 HV391426 2012-07-20 062XX S VERNON AVE NaT BATTERY DOMESTIC BATTERY SIMPLE STREET False True 313 20 NaT 1180287 1863567 2012-07-24 2012 41.780892 -87.614575 (41.78089198069858, -87.61457513620923)
10 8716418 HV392744 2012-07-20 067XX S KEDVALE AVE NaT BURGLARY FORCIBLE ENTRY RESIDENCE False False 833 13 2015-04-05 00:00:00 1149902 1859654 2012-07-26 2012 41.770797 -87.726075 (41.770797024732644, -87.72607467761279)
11 8717296 HV393972 2012-07-20 067XX S ARTESIAN AVE NaT THEFT OVER $500 APARTMENT False False 832 15 2015-04-06 00:00:00 1161251 1859563 2012-07-24 2012 41.770320 -87.684476 (41.77031988491179, -87.68447572746534)
12 8716047 HV392401 2012-07-20 055XX S ADA ST NaT BURGLARY FORCIBLE ENTRY RESIDENCE False False 713 16 2015-04-05 00:00:00 1168235 1867888 2012-07-23 2012 41.793017 -87.658636 (41.79301726659824, -87.65863572627497)
13 8714884 HV391428 2012-07-20 101XX S MALTA ST NaT BATTERY DOMESTIC BATTERY SIMPLE APARTMENT True True 2213 21 NaT 1168561 1837531 2012-07-23 2012 41.709706 -87.658314 (41.70970648902796, -87.65831365218979)
14 8715081 HV391540 2012-07-20 071XX S JEFFERY BLVD NaT BATTERY SIMPLE STREET False False 333 5 NaT 1190735 1858225 2012-07-22 2012 41.765987 -87.576444 (41.76598708784457, -87.57644365891016)
15 8716481 HV391411 2012-07-20 008XX W WAVELAND AVE NaT BATTERY DOMESTIC BATTERY SIMPLE APARTMENT False True 1923 44 NaT 1170033 1924744 2012-07-27 2012 41.948996 -87.650384 (41.94899550183653, -87.65038391349279)
16 8714830 HV391404 2012-07-20 021XX N SPRINGFIELD AVE NaT PUBLIC PEACE VIOLATION ARSON THREAT STREET False False 2525 30 NaT 1149996 1913760 2012-07-22 2012 41.919269 -87.724324 (41.91926897431185, -87.72432367020212)
17 8714840 HV391389 2012-07-20 057XX W MADISON ST NaT WEAPONS VIOLATION UNLAWFUL USE OTHER DANG WEAPON POLICE FACILITY/VEH PARKING LOT True False 1513 29 NaT 1138126 1899398 2012-07-22 2012 41.880081 -87.768284 (41.88008091586382, -87.76828368897783)
18 8714845 HV391399 2012-07-20 003XX N JEFFERSON ST NaT CRIMINAL DAMAGE TO VEHICLE STREET True False 1212 42 NaT 1172308 1902703 2012-07-22 2012 41.888464 -87.642674 (41.88846390591864, -87.64267394758402)
19 8720289 HV397023 2012-07-20 065XX S GREEN ST NaT BURGLARY FORCIBLE ENTRY APARTMENT False False 723 16 2015-04-05 00:00:00 1171727 1861297 2012-07-27 2012 41.774855 -87.646024 (41.7748548925639, -87.64602416479119)
20 8718363 HV395217 2012-07-20 038XX W 58TH ST NaT BATTERY DOMESTIC BATTERY SIMPLE RESIDENCE False True 822 13 NaT 1151873 1865833 2012-07-27 2012 41.787715 -87.718688 (41.7877147485484, -87.71868775094183)
21 8714866 HV391403 2012-07-20 020XX W 71ST ST NaT ASSAULT SIMPLE STREET False True 735 17 2015-04-15 08:00:00 1163848 1857609 2012-07-27 2012 41.764904 -87.675011 (41.76490368019277, -87.67501096022276)
22 8716495 HV391398 2012-07-20 009XX W ADDISON ST NaT THEFT OVER $500 CTA PLATFORM False False 1923 44 2015-04-06 00:00:00 1169177 1924140 2012-07-25 2012 41.947357 -87.653548 (41.94735677558397, -87.65354801522845)
23 8717669 HV394364 2012-07-20 019XX W MADISON ST NaT OTHER OFFENSE OTHER VEHICLE OFFENSE OTHER False True 1211 27 NaT 1163705 1900002 2012-07-24 2012 41.881238 -87.674343 (41.88123780615503, -87.67434331883847)
24 8715667 HV391679 2012-07-20 031XX S PRINCETON AVE NaT CRIMINAL DAMAGE TO VEHICLE STREET False False 915 11 NaT 1174842 1883965 2012-07-22 2012 41.836989 -87.633929 (41.83698919433196, -87.63392886525048)
25 8716303 HV392639 2012-07-20 010XX N LARAMIE AVE NaT MOTOR VEHICLE THEFT AUTOMOBILE RESIDENTIAL YARD (FRONT/BACK) False False 1531 37 2015-04-07 00:00:00 1141542 1906488 2012-07-22 2012 41.899474 -87.755565 (41.89947433556218, -87.75556507471337)
26 8715233 HV391635 2012-07-20 030XX S HOMAN AVE NaT CRIMINAL DAMAGE TO VEHICLE RESIDENTIAL YARD (FRONT/BACK) False False 1032 22 NaT 1154227 1884003 2012-07-22 2012 41.837529 -87.709573 (41.837529184273876, -87.70957294222553)
27 8715378 HV391394 2012-07-20 030XX W LEXINGTON ST NaT BURGLARY FORCIBLE ENTRY RESIDENCE False False 1134 28 2015-04-05 00:00:00 1156067 1896502 2012-07-25 2012 41.871791 -87.702484 (41.87179096801878, -87.70248416929228)
28 8714826 HV391401 2012-07-20 079XX S DREXEL AVE NaT THEFT $500 AND UNDER RESIDENTIAL YARD (FRONT/BACK) False False 624 8 2015-04-06 00:00:00 1183651 1852475 2012-07-27 2012 41.750377 -87.602588 (41.750376634248745, -87.60258766182127)
29 8715076 HV391521 2012-07-20 033XX S WELLS ST NaT CRIMINAL DAMAGE TO VEHICLE STREET False False 915 11 NaT 1175128 1882685 2012-07-22 2012 41.833470 -87.632918 (41.83347037439109, -87.63291772123416)
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7469 8710901 HV387421 2012-07-12 005XX W ARMITAGE AVE NaT THEFT FROM BUILDING APARTMENT False False 1813 43 2015-04-06 00:00:00 1172496 1913530 2012-07-20 2012 41.918170 -87.641663 (41.91816962883295, -87.64166296483712)
7470 8704539 HV380455 2012-07-12 013XX W PRATT BLVD NaT OTHER OFFENSE HARASSMENT BY TELEPHONE RESIDENCE False True 2431 49 NaT 1165821 1945333 2012-07-20 2012 42.005583 -87.665276 (42.00558343917497, -87.66527622152783)
7471 8704984 HV380525 2012-07-12 057XX N WINTHROP AVE NaT THEFT $500 AND UNDER STREET False False 2022 48 2015-04-06 00:00:00 1167859 1938568 2012-07-20 2012 41.986976 -87.657975 (41.98697626801882, -87.65797457471722)
7472 8703401 HV379568 2012-07-12 105XX S PERRY AVE 2017-01-01 NARCOTICS MANU/DELIVER:CRACK STREET True False 512 34 NaT 1177422 1835195 2012-07-15 2012 41.703101 -87.625934 (41.7031009815263, -87.62593377251329)
7473 8703331 HV379497 2012-07-12 056XX W BELMONT AVE NaT THEFT PURSE-SNATCHING GROCERY FOOD STORE False False 1633 38 2015-04-06 00:00:00 1138107 1920748 2012-07-26 2012 41.938668 -87.767837 (41.93866817803988, -87.76783660929412)
7474 8703496 HV379470 2012-07-12 079XX S ASHLAND AVE NaT BURGLARY FORCIBLE ENTRY DEPARTMENT STORE False False 612 21 2015-04-05 00:00:00 1167090 1852193 2012-07-15 2012 41.749973 -87.663283 (41.74997273294529, -87.66328269424058)
7475 8703270 HV379427 2012-07-12 108XX S WABASH AVE NaT BATTERY DOMESTIC BATTERY SIMPLE APARTMENT True True 513 9 NaT 1178543 1832956 2012-07-20 2012 41.696932 -87.621897 (41.6969315102019, -87.62189667825021)
7476 8703505 HV379732 2012-07-12 054XX S LAFLIN ST NaT BURGLARY FORCIBLE ENTRY RESIDENCE False False 934 16 2015-04-05 00:00:00 1167221 1868639 2012-07-27 2012 41.795100 -87.662332 (41.79509987725828, -87.66233245878783)
7477 8703372 HV379504 2012-07-12 026XX S KEDZIE AVE 2230-01-01 LIQUOR LAW VIOLATION ILLEGAL CONSUMPTION BY MINOR SIDEWALK True False 1033 12 NaT 1155484 1886493 2012-07-15 2012 41.844337 -87.704894 (41.844336899528564, -87.70489357294214)
7478 8703385 HV379475 2012-07-12 064XX S FRANCISCO AVE NaT BATTERY SIMPLE SIDEWALK False False 823 15 NaT 1158108 1861999 2012-07-15 2012 41.777069 -87.695931 (41.777069116408185, -87.69593059291209)
7479 8703420 HV379488 2012-07-12 024XX W TOUHY AVE NaT THEFT $500 AND UNDER OTHER False False 2411 50 2015-04-06 00:00:00 1158731 1947751 2012-07-15 2012 42.012367 -87.691294 (42.0123673541295, -87.69129424019452)
7480 8703252 HV379463 2012-07-12 109XX S LOOMIS ST NaT PUBLIC PEACE VIOLATION RECKLESS CONDUCT STREET False False 2234 34 NaT 1168898 1832441 2012-07-15 2012 41.695731 -87.657226 (41.69573146177049, -87.65722584757034)
7481 8703642 HV379462 2012-07-12 008XX W MONTROSE AVE NaT THEFT $500 AND UNDER SIDEWALK False False 1914 46 2015-04-06 00:00:00 1169497 1929449 2012-07-27 2012 41.961918 -87.652217 (41.96191789149507, -87.65221668964556)
7482 8707619 HV384418 2012-07-12 040XX W WELLINGTON AVE NaT DECEPTIVE PRACTICE FRAUD OR CONFIDENCE GAME RESIDENCE False False 2523 31 NaT 1148676 1919594 2012-07-18 2012 41.935304 -87.729022 (41.93530361758211, -87.72902248815977)
7483 8704575 HV380446 2012-07-12 023XX W LOGAN BLVD NaT THEFT FROM BUILDING ATHLETIC CLUB False False 1432 1 2015-04-06 00:00:00 1160430 1917838 2012-07-20 2012 41.930250 -87.685875 (41.930249560228596, -87.68587461016092)
7484 8703271 HV379478 2012-07-12 034XX W LEXINGTON ST NaT OTHER OFFENSE VIOLATE ORDER OF PROTECTION RESIDENCE False True 1133 24 NaT 1153665 1896531 2012-07-20 2012 41.871919 -87.711302 (41.87191865070911, -87.71130215782516)
7485 8703572 HV379871 2012-07-12 062XX S DREXEL AVE NaT MOTOR VEHICLE THEFT AUTOMOBILE STREET False False 313 20 2015-04-07 00:00:00 1183034 1864022 2012-07-19 2012 41.782077 -87.604490 (41.78207709188042, -87.60449006236749)
7486 8723374 HV379444 2012-07-12 042XX W THOMAS ST 2024-01-01 NARCOTICS POSS: HEROIN(WHITE) APARTMENT False False 1111 37 NaT 1147816 1907033 2012-07-27 2012 41.900852 -87.732506 (41.90085161465535, -87.73250646718618)
7487 8703559 HV379464 2012-07-12 061XX S WESTERN AVE NaT THEFT RETAIL THEFT DEPARTMENT STORE True False 825 15 2015-04-06 00:00:00 1161471 1863600 2012-07-15 2012 41.781393 -87.683557 (41.78139341670748, -87.6835574820193)
7488 8724551 HV379451 2012-07-12 027XX W FLOURNOY ST 2027-01-01 NARCOTICS POSS: CRACK RESIDENCE False False 1135 2 NaT NaN NaN 2012-07-27 2012 NaN NaN NaN
7489 8714013 HV379522 2012-07-12 027XX W FLOURNOY ST 2024-01-01 NARCOTICS POSS: HEROIN(WHITE) APARTMENT False False 1135 2 NaT 1158334 1896963 2012-07-26 2012 41.873010 -87.694148 (41.87300997630423, -87.6941484677931)
7490 8703195 HV379393 2012-07-12 079XX S EXCHANGE AVE NaT CRIMINAL DAMAGE TO VEHICLE STREET False False 422 7 NaT 1197166 1853085 2012-07-15 2012 41.751725 -87.553043 (41.75172486586887, -87.55304319881108)
7491 8703319 HV379466 2012-07-12 001XX E LAKE ST NaT THEFT PURSE-SNATCHING RESTAURANT False False 114 42 2015-04-06 00:00:00 1177576 1901805 2012-07-15 2012 41.885882 -87.623355 (41.885881766915276, -87.62335545769336)
7492 8703469 HV379704 2012-07-12 027XX N HOYNE AVE NaT ASSAULT SIMPLE STREET False False 1931 1 2015-04-15 08:00:00 1162384 1918225 2012-07-15 2012 41.931271 -87.678683 (41.93127079752159, -87.67868328258375)
7493 8703405 HV379639 2012-07-12 016XX N HAMLIN AVE NaT THEFT FROM BUILDING DAY CARE CENTER False False 2535 26 2015-04-06 00:00:00 1150833 1910726 2012-07-15 2012 41.910927 -87.721328 (41.91092704682102, -87.72132794973652)
7494 8703281 HV379436 2012-07-12 033XX N MILWAUKEE AVE NaT OTHER OFFENSE TELEPHONE THREAT COMMERCIAL / BUSINESS OFFICE False False 1731 30 NaT 1148730 1922015 2012-07-17 2012 41.941946 -87.728761 (41.941946001962364, -87.72876128642051)
7495 8719492 HV364735 2012-07-12 0000X S HOMAN AVE NaT CRIMINAL DAMAGE TO PROPERTY GAS STATION False False 1124 28 NaT 1153768 1899597 2012-07-25 2012 41.880330 -87.710842 (41.880330029236724, -87.71084236061526)
7496 8703304 HV379409 2012-07-12 110XX S MICHIGAN AVE NaT THEFT $500 AND UNDER CURRENCY EXCHANGE False False 513 9 2015-04-06 00:00:00 1178826 1831469 2012-07-15 2012 41.692845 -87.620906 (41.69284455167364, -87.62090553561691)
7497 8703714 HV379406 2012-07-12 009XX N SACRAMENTO BLVD NaT BATTERY DOMESTIC BATTERY SIMPLE APARTMENT False True 1311 26 NaT 1156043 1906324 2012-07-16 2012 41.898744 -87.702307 (41.89874396930619, -87.7023071163759)
7498 8703205 HV379391 2012-07-12 012XX W WILSON AVE 1811-01-01 NARCOTICS POSS: CANNABIS 30GMS OR LESS SIDEWALK True False 1913 46 NaT 1167293 1930644 2012-07-15 2012 41.965245 -87.660285 (41.96524480268117, -87.66028530814594)

7499 rows × 20 columns


In [11]:
df.Date.dtype


Out[11]:
dtype('O')

I do not know what happened by the above converting command, but I can see 2 changes, while one of them seems positive.

  1. The dates got a - symbol in place /. (Positive)
  2. The IUCR column seems to be totally different from the source csv file (?) Why is it showing NaT instead of an alphanumeric as in the csv file.

The second point may not be due to this command, perhaps it has been this way from when python started reading the file. But I am not sure how to correct this now.

!

Now, I would like to save this newly created dataframe as a new csv file using pandas. This is how I should do it as per the tutorial.


In [12]:
# Here is the commnd to do it. 
# But the parameter `index=False` will prevent 
# the index column from being exported to the csv file.
df.to_csv('Crimes_-_2001_to_present_v2.csv', index=False)

In [13]:
ls -l *.csv


-rwxrwx--- 1 root plugdev 1598015 Apr 15 21:30 Crimes_-_2001_to_present.csv*
-rwxrwx--- 1 root plugdev 1628092 Apr 15 22:58 Crimes_-_2001_to_present_v2.csv*

!

Next is to use PyMySQLdb module and do some data manipulation as Somayaji did in the Rscript.


In [14]:
from pandasql import sqldf
from pandasql import load_meat, load_births

In [15]:
meat = load_meat
births = load_births

In [16]:
print meat.head()


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-16-36c5d6d7b2dc> in <module>()
----> 1 print meat.head()

AttributeError: 'function' object has no attribute 'head'

In [18]:
q = "SELECT * FROM meat LIMIT 10;"

In [19]:
print sqldf(q, locals())


---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
<ipython-input-19-df013f6af3d4> in <module>()
----> 1 print sqldf(q, locals())

/home/anand/anaconda/lib/python2.7/site-packages/pandasql/sqldf.pyc in sqldf(q, env, inmemory)
    108             raise Exception("%s not found" % table)
    109         df = env[table]
--> 110         df = _ensure_data_frame(df, table)
    111         _write_table(table, df, conn)
    112 

/home/anand/anaconda/lib/python2.7/site-packages/pandasql/sqldf.pyc in _ensure_data_frame(obj, name)
     31             df = pd.DataFrame(obj, columns=["c0"])
     32 
---> 33     if not isinstance(df, pd.DataFrame) :
     34         raise Exception("%s is not a Dataframe, tuple, list, nor dictionary" % name)
     35 

UnboundLocalError: local variable 'df' referenced before assignment

In [ ]: