Intro to Data Wrangling

Data Wrangling is the concept of arranging your dataset into a workable format for analysis. When retreiving data from various sources, it is not always in a format that is ready to be analyzed. There are times when there are missing or incorrect values in the dataset which will reduce the integrity of the analysis performed on the dataset. The process of Data Wrangling has been known to take up as much as 70% of a data scientist's time. In this notebook, we will review a few different formats in which you may retrieve data and what are some of the initial techniques to ensure that the data is ready for analysis.

Data Retrieval from CSV files

One of the common formats for datasets is the comma seperated values or csv format. This format works well with both Microsoft Excel and Google Sheets as both platforms allow you to view them in a spreadsheet format. This format is also supported by the pandas DataFrame. In the code below, we read in a csv into a DataFrame, df, and display the first 5 entries.


In [2]:
import numpy as np
import pandas as pd
df = pd.read_csv('Master.csv')
df.head(5)


Out[2]:
playerID birthYear birthMonth birthDay birthCountry birthState birthCity deathYear deathMonth deathDay ... nameLast nameGiven weight height bats throws debut finalGame retroID bbrefID
0 aardsda01 1981 12 27 USA CO Denver NaN NaN NaN ... Aardsma David Allan 205 75 R R 4/6/2004 9/28/2013 aardd001 aardsda01
1 aaronha01 1934 2 5 USA AL Mobile NaN NaN NaN ... Aaron Henry Louis 180 72 R R 4/13/1954 10/3/1976 aaroh101 aaronha01
2 aaronto01 1939 8 5 USA AL Mobile 1984 8 16 ... Aaron Tommie Lee 190 75 R R 4/10/1962 9/26/1971 aarot101 aaronto01
3 aasedo01 1954 9 8 USA CA Orange NaN NaN NaN ... Aase Donald William 190 75 R R 7/26/1977 10/3/1990 aased001 aasedo01
4 abadan01 1972 8 25 USA FL Palm Beach NaN NaN NaN ... Abad Fausto Andres 184 73 L L 9/10/2001 4/13/2006 abada001 abadan01

5 rows × 24 columns

From the output, you see that this dataset gives information about baseball players. Here is the last 10 entries of our Data Frame.


In [3]:
df.tail(10)


Out[3]:
playerID birthYear birthMonth birthDay birthCountry birthState birthCity deathYear deathMonth deathDay ... nameLast nameGiven weight height bats throws debut finalGame retroID bbrefID
18579 zuberbi01 1913 3 26 USA IA Middle Amana 1982 11 2 ... Zuber William Henry 195 74 R R 9/16/1936 9/23/1947 zubeb101 zuberbi01
18580 zuberjo01 1969 12 10 USA CA Encino NaN NaN NaN ... Zuber Jon Edward 190 73 L L 4/19/1996 9/27/1998 zubej001 zuberjo01
18581 zuletju01 1975 3 28 Panama Panama Panama NaN NaN NaN ... Zuleta Julio Ernesto 230 78 R R 4/6/2000 6/25/2001 zulej001 zuletju01
18582 zumayjo01 1984 11 9 USA CA Chula Vista NaN NaN NaN ... Zumaya Joel Martin 215 75 R R 4/3/2006 6/28/2010 zumaj001 zumayjo01
18583 zuninmi01 1991 3 25 USA FL Cape Coral NaN NaN NaN ... Zunino Michael Accorsi 220 74 R R 6/12/2013 9/28/2014 zunim001 zuninmi01
18584 zupcibo01 1966 8 18 USA PA Pittsburgh NaN NaN NaN ... Zupcic Robert 220 76 R R 9/7/1991 8/4/1994 zupcb001 zupcibo01
18585 zupofr01 1939 8 29 USA CA San Francisco 2005 3 25 ... Zupo Frank Joseph 182 71 L R 7/1/1957 5/9/1961 zupof101 zupofr01
18586 zuvelpa01 1958 10 31 USA CA San Mateo NaN NaN NaN ... Zuvella Paul 173 72 R R 9/4/1982 5/2/1991 zuvep001 zuvelpa01
18587 zuverge01 1924 8 20 USA MI Holland 2014 9 8 ... Zuverink George 195 76 R R 4/21/1951 6/15/1959 zuveg101 zuverge01
18588 zwilldu01 1888 11 2 USA MO St. Louis 1978 3 27 ... Zwilling Edward Harrison 160 66 L L 8/14/1910 7/12/1916 zwild101 zwilldu01

10 rows × 24 columns

One of the first things that you can do is to analyze the DataFrame to see if there is anything that doesn't look correct, such as variation in the number of entries in each column, min, max value etc. This can be easily assess through the describe()


In [4]:
df.describe()


Out[4]:
birthYear birthMonth birthDay deathYear deathMonth deathDay weight height
count 18444.000000 18274.000000 18117.000000 9225.000000 9224.000000 9223.000000 17717.000000 17780.000000
mean 1929.851496 6.625534 15.618977 1963.294309 6.485581 15.546677 185.562680 72.235264
std 40.907143 3.465354 8.748973 31.193396 3.522301 8.781102 20.998279 2.598992
min 1820.000000 1.000000 1.000000 1872.000000 1.000000 1.000000 65.000000 43.000000
25% 1894.000000 4.000000 8.000000 1941.000000 3.000000 8.000000 170.000000 71.000000
50% 1935.000000 7.000000 16.000000 1966.000000 6.000000 15.000000 185.000000 72.000000
75% 1967.000000 10.000000 23.000000 1988.000000 10.000000 23.000000 197.000000 74.000000
max 1994.000000 12.000000 31.000000 2014.000000 12.000000 31.000000 320.000000 83.000000

As we can see from this dataset, there is a lot of variation in the count across each column. Some of this is due to the fact that there are baseball players in this dataset who are still alive today, thus there is no information on their death date. Some of this is can just be due to incomplete information being provided. This is where Data Wrangling techniques must be done so that we can have a good dataset to work with.

Retrieving Data using SQL queries

SQL queries can be used on DataFrames to extract the necessary data through the SQLite syntax. The pandasql provides you with the necessary API for SQL queries. Below are a few examples. The first query retrieves the first 10 entries of the birthMonth and birthYear columns.


In [5]:
import pandasql
q = 'SELECT birthMonth, birthYear FROM df LIMIT 10'
sql_sol = pandasql.sqldf(q.lower(),globals())
sql_sol


Out[5]:
birthMonth birthYear
0 12 1981
1 2 1934
2 8 1939
3 9 1954
4 8 1972
5 12 1985
6 11 1854
7 4 1877
8 11 1869
9 10 1866

The query below retrieves data on players that are less than 200 lbs and organizes the entries by birthCountry


In [6]:
q2 = 'SELECT playerID,birthCountry,bats,weight FROM df WHERE weight < 200 GROUP BY birthCountry LIMIT 10'
sql_sol2 = pandasql.sqldf(q2.lower(),globals())
sql_sol2


Out[6]:
playerID birthCountry bats weight
0 quinnpa01 None None 162
1 madurca01 Aruba R 175
2 porraed01 At Sea R 170
3 willije01 Australia R 185
4 ulricdu01 Austria R 195
5 richaan01 Bahamas B 165
6 martich01 Belize L 180
7 rienzan01 Brazil R 190
8 woodpe01 CAN R 185
9 kingmha01 China L 165

By using SQL quieries, it makes it simple to retrieve the data that you would like to analyze and then wrangle that data specifically if necessary.

Data Retrieval from JSON formats

Another common format is the JSON format. Data is commonly in this format when you are working with Relational Databases. The example below shows how one can retrieve data from a database using that website's REST API. Here we show to load JSON data into a Python dictionary. Here we use the requests library and then load the data using the json library. Pprint is used to print the JSON in a more readable format.


In [7]:
import json
import requests
import pprint
url = 'http://ws.audioscrobbler.com/2.0/?method=album.getinfo&api_key=4beab33cc6d65b05800d51f5e83bde1b&artist=Cher&album=Believe&format=json'
data = requests.get(url).text
df3 = json.loads(data)
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(df)


        playerID  birthYear  birthMonth  birthDay birthCountry birthState  \
0      aardsda01       1981          12        27          USA         CO   
1      aaronha01       1934           2         5          USA         AL   
2      aaronto01       1939           8         5          USA         AL   
3       aasedo01       1954           9         8          USA         CA   
4       abadan01       1972           8        25          USA         FL   
5       abadfe01       1985          12        17         D.R.  La Romana   
6      abadijo01       1854          11         4          USA         PA   
7      abbated01       1877           4        15          USA         PA   
8      abbeybe01       1869          11        11          USA         VT   
9      abbeych01       1866          10        14          USA         NE   
10     abbotda01       1862           3        16          USA         OH   
11     abbotfr01       1874          10        22          USA         OH   
12     abbotgl01       1951           2        16          USA         AR   
13     abbotje01       1972           8        17          USA         GA   
14     abbotji01       1967           9        19          USA         MI   
15     abbotku01       1969           6         2          USA         OH   
16     abbotky01       1968           2        18          USA         MA   
17     abbotod01       1888           9         5          USA         PA   
18     abbotpa01       1967           9        15          USA         CA   
19      aberal01       1927           7        31          USA         OH   
20     abercda01       1850           1         2          USA         OK   
21     abercre01       1980           7        15          USA         GA   
22     abernbi01       1929           1        30          USA         CA   
23     abernbr01       1977           9        23          USA         GA   
24     abernte01       1921          10        30          USA         NC   
25     abernte02       1933           3         6          USA         NC   
26     abernwo01       1915           2         1          USA         NC   
27     aberscl01       1921           8        28          USA         IL   
28     ablesha01       1883          10         4          USA         TX   
29     abnersh01       1966           6        17          USA         OH   
...          ...        ...         ...       ...          ...        ...   
18559  zimmejo01       1975           4        28          CAN         BC   
18560  zimmejo02       1986           5        23          USA         WI   
18561  zimmero01       1916           9        13          USA         PA   
18562  zimmery01       1984           9        28          USA         NC   
18563   zinkch01       1979           8        26          USA         CA   
18564   zinkwa01       1898          11        21          USA         MA   
18565   zinnfr01       1865          12        21          USA         PA   
18566   zinngu01       1887           2        13          USA         WV   
18567   zinnji01       1895           1        21          USA         AR   
18568  zinsebi01       1918           1         6          USA         NY   
18569  zinteal01       1968           5        19          USA         TX   
18570  zipfebu01       1938          11        18          USA         IL   
18571   ziskri01       1949           2         6          USA         NY   
18572   zitoba01       1978           5        13          USA         NV   
18573  zitzmbi01       1895          11        19          USA         NY   
18574  zmiched01       1884          10         1          USA         OH   
18575  zobribe01       1981           5        26          USA         IL   
18576  zoccope01       1977           2         6          USA         NY   
18577  zoldasa01       1918          12         8          USA         NY   
18578  zoskyed01       1968           2        10          USA         CA   
18579  zuberbi01       1913           3        26          USA         IA   
18580  zuberjo01       1969          12        10          USA         CA   
18581  zuletju01       1975           3        28       Panama     Panama   
18582  zumayjo01       1984          11         9          USA         CA   
18583  zuninmi01       1991           3        25          USA         FL   
18584  zupcibo01       1966           8        18          USA         PA   
18585   zupofr01       1939           8        29          USA         CA   
18586  zuvelpa01       1958          10        31          USA         CA   
18587  zuverge01       1924           8        20          USA         MI   
18588  zwilldu01       1888          11         2          USA         MO   

              birthCity  deathYear  deathMonth  deathDay    ...      \
0                Denver        NaN         NaN       NaN    ...       
1                Mobile        NaN         NaN       NaN    ...       
2                Mobile       1984           8        16    ...       
3                Orange        NaN         NaN       NaN    ...       
4            Palm Beach        NaN         NaN       NaN    ...       
5             La Romana        NaN         NaN       NaN    ...       
6          Philadelphia       1905           5        17    ...       
7               Latrobe       1957           1         6    ...       
8                 Essex       1962           6        11    ...       
9            Falls City       1926           4        27    ...       
10              Portage       1930           2        13    ...       
11           Versailles       1935           6        11    ...       
12          Little Rock        NaN         NaN       NaN    ...       
13              Atlanta        NaN         NaN       NaN    ...       
14                Flint        NaN         NaN       NaN    ...       
15           Zanesville        NaN         NaN       NaN    ...       
16          Newburyport        NaN         NaN       NaN    ...       
17            New Eagle       1933           4        13    ...       
18             Van Nuys        NaN         NaN       NaN    ...       
19            Cleveland       1993           5        20    ...       
20          Fort Towson       1939          11        11    ...       
21             Columbus        NaN         NaN       NaN    ...       
22             Torrance       2006           2        19    ...       
23              Atlanta        NaN         NaN       NaN    ...       
24                Bynum       2001          11        16    ...       
25              Stanley       2004          12        16    ...       
26          Forest City       1994          12         5    ...       
27              Chicago       1973           6        23    ...       
28              Terrell       1951           2         8    ...       
29             Hamilton        NaN         NaN       NaN    ...       
...                 ...        ...         ...       ...    ...       
18559           Kelowna        NaN         NaN       NaN    ...       
18560        Auburndale        NaN         NaN       NaN    ...       
18561        Pine Grove       1991          11        22    ...       
18562        Washington        NaN         NaN       NaN    ...       
18563        Carmichael        NaN         NaN       NaN    ...       
18564        Pittsfield       1964           6        12    ...       
18565      Phoenixville       1936           5        12    ...       
18566          Holbrook       1949          10         6    ...       
18567            Benton       1991           2        26    ...       
18568           Astoria       1993           2        16    ...       
18569           El Paso        NaN         NaN       NaN    ...       
18570        Belleville        NaN         NaN       NaN    ...       
18571          Brooklyn        NaN         NaN       NaN    ...       
18572         Las Vegas        NaN         NaN       NaN    ...       
18573  Long Island City       1985           5        29    ...       
18574         Cleveland       1950           8        20    ...       
18575            Eureka        NaN         NaN       NaN    ...       
18576             Bronx        NaN         NaN       NaN    ...       
18577          Brooklyn       1966           8        25    ...       
18578          Whittier        NaN         NaN       NaN    ...       
18579      Middle Amana       1982          11         2    ...       
18580            Encino        NaN         NaN       NaN    ...       
18581            Panama        NaN         NaN       NaN    ...       
18582       Chula Vista        NaN         NaN       NaN    ...       
18583        Cape Coral        NaN         NaN       NaN    ...       
18584        Pittsburgh        NaN         NaN       NaN    ...       
18585     San Francisco       2005           3        25    ...       
18586         San Mateo        NaN         NaN       NaN    ...       
18587           Holland       2014           9         8    ...       
18588         St. Louis       1978           3        27    ...       

          nameLast           nameGiven weight height bats throws       debut  \
0          Aardsma         David Allan    205     75    R      R    4/6/2004   
1            Aaron         Henry Louis    180     72    R      R   4/13/1954   
2            Aaron          Tommie Lee    190     75    R      R   4/10/1962   
3             Aase      Donald William    190     75    R      R   7/26/1977   
4             Abad       Fausto Andres    184     73    L      L   9/10/2001   
5             Abad    Fernando Antonio    220     73    L      L   7/28/2010   
6           Abadie             John W.    192     72    R      R  1875-04-26   
7      Abbaticchio        Edward James    170     71    R      R  1897-09-04   
8            Abbey           Bert Wood    175     71    R      R  1892-06-14   
9            Abbey          Charles S.    169     68    L      L  1893-08-16   
10          Abbott    Leander Franklin    190     71    R      R  1890-04-19   
11          Abbott     Harry Frederick    180     70    R      R   4/25/1903   
12          Abbott       William Glenn    200     78    R      R   7/29/1973   
13          Abbott     Jeffrey William    190     74    R      L   6/10/1997   
14          Abbott       James Anthony    200     75    L      L    4/8/1989   
15          Abbott         Kurt Thomas    180     71    R      R    9/7/1993   
16          Abbott       Lawrence Kyle    200     76    L      L   9/10/1991   
17          Abbott           Ody Cleon    180     74    R      R   9/10/1910   
18          Abbott          Paul David    185     75    R      R   8/21/1990   
19            Aber       Albert Julius    195     74    L      L   9/15/1950   
20     Abercrombie   Francis Patterson    NaN    NaN  NaN    NaN  1871-10-21   
21     Abercrombie   Reginald Damascus    215     75    R      R    4/4/2006   
22      Abernathie      William Edward    190     70    R      R   9/27/1952   
23       Abernathy       Michael Brent    185     73    R      R   6/25/2001   
24       Abernathy  Talmadge Lafayette    210     74    R      L   9/19/1942   
25       Abernathy       Theodore Wade    215     76    R      R   4/13/1955   
26       Abernathy      Virgil Woodrow    170     72    L      L   7/28/1946   
27         Aberson  Clifford Alexander    200     72    R      R   7/18/1947   
28           Ables       Harry Terrell    200     74    R      L    9/4/1905   
29           Abner        Shawn Wesley    190     73    R      R    9/8/1987   
...            ...                 ...    ...    ...  ...    ...         ...   
18559    Zimmerman      Jordan William    200     72    R      L   5/17/1999   
18560   Zimmermann           Jordan M.    220     74    R      R   4/20/2009   
18561    Zimmerman        Roy Franklin    187     74    L      L   8/27/1945   
18562    Zimmerman        Ryan Wallace    220     75    R      R    9/1/2005   
18563         Zink       Charles Tadao    190     73    R      R   8/12/2008   
18564         Zink        Walter Noble    165     72    R      R    7/6/1921   
18565         Zinn       Frank Patrick    150     68  NaN    NaN  1888-04-18   
18566         Zinn                 Guy    170     70    L      R   9/11/1911   
18567         Zinn        James Edward    195     72    L      R    9/4/1919   
18568       Zinser     William Francis    185     73    R      R   8/19/1944   
18569       Zinter        Alan Michael    200     74    B      R   6/18/2002   
18570       Zipfel    Marion Sylvester    200     75    L      R   7/26/1961   
18571         Zisk      Richard Walter    200     73    R      R    9/8/1971   
18572         Zito       Barry William    205     74    L      L   7/22/2000   
18573     Zitzmann      William Arthur    175     70    R      R   4/27/1919   
18574        Zmich       Edward Albert    180     72    L      L   7/23/1910   
18575      Zobrist     Benjamin Thomas    210     75    B      R    8/1/2006   
18576   Zoccolillo          Peter Jude    200     74    L      R    9/5/2003   
18577       Zoldak       Samuel Walter    185     71    L      L   5/13/1944   
18578        Zosky        Edward James    175     72    R      R    9/2/1991   
18579        Zuber       William Henry    195     74    R      R   9/16/1936   
18580        Zuber          Jon Edward    190     73    L      L   4/19/1996   
18581       Zuleta       Julio Ernesto    230     78    R      R    4/6/2000   
18582       Zumaya         Joel Martin    215     75    R      R    4/3/2006   
18583       Zunino     Michael Accorsi    220     74    R      R   6/12/2013   
18584       Zupcic              Robert    220     76    R      R    9/7/1991   
18585         Zupo        Frank Joseph    182     71    L      R    7/1/1957   
18586      Zuvella                Paul    173     72    R      R    9/4/1982   
18587     Zuverink              George    195     76    R      R   4/21/1951   
18588     Zwilling     Edward Harrison    160     66    L      L   8/14/1910   

        finalGame   retroID    bbrefID  
0       9/28/2013  aardd001  aardsda01  
1       10/3/1976  aaroh101  aaronha01  
2       9/26/1971  aarot101  aaronto01  
3       10/3/1990  aased001   aasedo01  
4       4/13/2006  abada001   abadan01  
5       9/27/2014  abadf001   abadfe01  
6      1875-06-10  abadj101  abadijo01  
7       9/15/1910  abbae101  abbated01  
8      1896-09-23  abbeb101  abbeybe01  
9      1897-08-19  abbec101  abbeych01  
10     1890-05-23  abbod101  abbotda01  
11      9/20/1905  abbof101  abbotfr01  
12       8/8/1984  abbog001  abbotgl01  
13      9/29/2001  abboj002  abbotje01  
14      7/21/1999  abboj001  abbotji01  
15      4/13/2001  abbok002  abbotku01  
16      8/24/1996  abbok001  abbotky01  
17     10/15/1910  abboo101  abbotod01  
18       8/7/2004  abbop001  abbotpa01  
19      9/11/1957  abera101   aberal01  
20     1871-10-21  aberd101  abercda01  
21      9/28/2008  aberr001  abercre01  
22      9/27/1952  aberb101  abernbi01  
23      9/29/2005  aberb001  abernbr01  
24      4/29/1944  abert102  abernte01  
25      9/30/1972  abert101  abernte02  
26      4/17/1947  aberw101  abernwo01  
27       5/9/1949  aberc101  aberscl01  
28       5/5/1911  ableh101  ablesha01  
29      10/3/1992  abnes001  abnersh01  
...           ...       ...        ...  
18559    7/2/1999  zimmj002  zimmejo01  
18560   9/28/2014  zimmj003  zimmejo02  
18561   9/30/1945  zimmr101  zimmero01  
18562   9/28/2014  zimmr001  zimmery01  
18563   8/12/2008  zinkc001   zinkch01  
18564   7/19/1921  zinkw101   zinkwa01  
18565  1888-05-03  zinnf101   zinnfr01  
18566   9/16/1915  zinng101   zinngu01  
18567   8/25/1929  zinnj101   zinnji01  
18568   8/26/1944  zinsb101  zinsebi01  
18569   10/3/2004  zinta001  zinteal01  
18570   9/30/1962  zipfb101  zipfebu01  
18571   9/21/1983  ziskr101   ziskri01  
18572   9/29/2013  zitob001   zitoba01  
18573   9/25/1929  zitzb101  zitzmbi01  
18574    6/1/1911  zmice101  zmiched01  
18575   9/27/2014  zobrb001  zobribe01  
18576   9/28/2003  zoccp001  zoccope01  
18577   8/26/1952  zolds101  zoldasa01  
18578   10/1/2000  zoske001  zoskyed01  
18579   9/23/1947  zubeb101  zuberbi01  
18580   9/27/1998  zubej001  zuberjo01  
18581   6/25/2001  zulej001  zuletju01  
18582   6/28/2010  zumaj001  zumayjo01  
18583   9/28/2014  zunim001  zuninmi01  
18584    8/4/1994  zupcb001  zupcibo01  
18585    5/9/1961  zupof101   zupofr01  
18586    5/2/1991  zuvep001  zuvelpa01  
18587   6/15/1959  zuveg101  zuverge01  
18588   7/12/1916  zwild101  zwilldu01  

[18589 rows x 24 columns]

Once the data is in the dictonary, we can access the information as such:


In [8]:
df3['album']['playcount']


Out[8]:
u'1547296'

Here we can see that there are some entries that are not there. There is functiona called fillna that can be used to insert some value for the NaN entries. Here we insert a -1 for all NaN entries in the deathMonth column


In [9]:
print df['deathMonth'].head()


0   NaN
1   NaN
2     8
3   NaN
4   NaN
Name: deathMonth, dtype: float64

In [10]:
df['deathMonth'].fillna(-1).head()


Out[10]:
0   -1
1   -1
2    8
3   -1
4   -1
Name: deathMonth, dtype: float64