Class 05

Big Data Enrichment: Joining and Grouping data

One of the skills in working with data is to pull multiple data sets together. There are often features we would like to use in training our machine learning model that aren't necessarily in the dataset we have in front of us. For example, we may be looking at attendance figures at baseball games, trying to predict what they will be for future games. However, we only have team statistics. The attendance certainly depends on the weather at the stadium (or in the city) on the day of a game. So we would like to be able to gather the weather data and add it to our model. We'll go through how to do something like that.

First Dataset

We start with data from the Iowa open database of class "E" liquor licencee purchases. I have filtered the data to include data from 10/1/2016 to 12/31/2016. I have also added up the sales for each day to each store. Note that I've added an argument na.strings="". This will turn any blank entries into NA values so we know we have a missing entry.



In [1]:

    
iowadf <- read.csv("Class05_iowa_data.csv", stringsAsFactors=FALSE, na.strings="")
head(iowadf)









    





Date Store.Number City Zip.Code County Total.Sales.to.Store..Dollars.

	10/21/2016  5110        Lenox       50851       TAYLOR      $1174.81    
	10/26/2016  4201        Sioux City  51109       WOODBURY    $1218.25    
	12/21/2016  4191        Pella       50219       MARION      $296.56     
	12/29/2016  3877        Des Moines  50322       Polk        $7095.90    
	10/13/2016  9928        Cedar Rapids 52404       LINN        $19.50      
	10/25/2016  4641        Des Moines  50315       POLK        $513.65



In [2]:

    
# The sales data looks like it isn't a float like we want it to be (the presence of a $ in front is my clue that there may be something wrong.) Let's look at the data types to be sure.
str(iowadf)









    



'data.frame':	13779 obs. of  6 variables:
 $ Date                          : chr  "10/21/2016" "10/26/2016" "12/21/2016" "12/29/2016" ...
 $ Store.Number                  : int  5110 4201 4191 3877 9928 4641 4345 4599 2652 4236 ...
 $ City                          : chr  "Lenox" "Sioux City" "Pella" "Des Moines" ...
 $ Zip.Code                      : chr  "50851" "51109" "50219" "50322" ...
 $ County                        : chr  "TAYLOR" "WOODBURY" "MARION" "Polk" ...
 $ Total.Sales.to.Store..Dollars.: chr  "$1174.81" "$1218.25" "$296.56" "$7095.90" ...



In [3]:

    
# Sure enough. We need to get the real values from that column. We'll create a new column for that data and use the regex parser to get the number.
iowadf$SalesToStore<-gsub("\\$","",iowadf$Total.Sales.to.Store..Dollars.)

# We also need to convert the output to a float.
iowadf$SalesToStore <- as.numeric(iowadf$SalesToStore)
print(str(iowadf))
head(iowadf)









    



'data.frame':	13779 obs. of  7 variables:
 $ Date                          : chr  "10/21/2016" "10/26/2016" "12/21/2016" "12/29/2016" ...
 $ Store.Number                  : int  5110 4201 4191 3877 9928 4641 4345 4599 2652 4236 ...
 $ City                          : chr  "Lenox" "Sioux City" "Pella" "Des Moines" ...
 $ Zip.Code                      : chr  "50851" "51109" "50219" "50322" ...
 $ County                        : chr  "TAYLOR" "WOODBURY" "MARION" "Polk" ...
 $ Total.Sales.to.Store..Dollars.: chr  "$1174.81" "$1218.25" "$296.56" "$7095.90" ...
 $ SalesToStore                  : num  1174.8 1218.2 296.6 7095.9 19.5 ...
NULL






    





Date Store.Number City Zip.Code County Total.Sales.to.Store..Dollars. SalesToStore

	10/21/2016  5110        Lenox       50851       TAYLOR      $1174.81    1174.81     
	10/26/2016  4201        Sioux City  51109       WOODBURY    $1218.25    1218.25     
	12/21/2016  4191        Pella       50219       MARION      $296.56      296.56     
	12/29/2016  3877        Des Moines  50322       Polk        $7095.90    7095.90     
	10/13/2016  9928        Cedar Rapids 52404       LINN        $19.50        19.50     
	10/25/2016  4641        Des Moines  50315       POLK        $513.65      513.65



In [4]:

    
# We also need to convert the date. We'll try this and see if it parses the date correctly.
iowadf$SalesDate <- as.POSIXct(iowadf$Date, format="%m/%d/%Y")
print(str(iowadf))
head(iowadf)









    



'data.frame':	13779 obs. of  8 variables:
 $ Date                          : chr  "10/21/2016" "10/26/2016" "12/21/2016" "12/29/2016" ...
 $ Store.Number                  : int  5110 4201 4191 3877 9928 4641 4345 4599 2652 4236 ...
 $ City                          : chr  "Lenox" "Sioux City" "Pella" "Des Moines" ...
 $ Zip.Code                      : chr  "50851" "51109" "50219" "50322" ...
 $ County                        : chr  "TAYLOR" "WOODBURY" "MARION" "Polk" ...
 $ Total.Sales.to.Store..Dollars.: chr  "$1174.81" "$1218.25" "$296.56" "$7095.90" ...
 $ SalesToStore                  : num  1174.8 1218.2 296.6 7095.9 19.5 ...
 $ SalesDate                     : POSIXct, format: "2016-10-21" "2016-10-26" ...
NULL






    





Date Store.Number City Zip.Code County Total.Sales.to.Store..Dollars. SalesToStore SalesDate

	10/21/2016  5110        Lenox       50851       TAYLOR      $1174.81    1174.81     2016-10-21  
	10/26/2016  4201        Sioux City  51109       WOODBURY    $1218.25    1218.25     2016-10-26  
	12/21/2016  4191        Pella       50219       MARION      $296.56      296.56     2016-12-21  
	12/29/2016  3877        Des Moines  50322       Polk        $7095.90    7095.90     2016-12-29  
	10/13/2016  9928        Cedar Rapids 52404       LINN        $19.50        19.50     2016-10-13  
	10/25/2016  4641        Des Moines  50315       POLK        $513.65      513.65     2016-10-25



In [5]:

    
# Let's do a bit more data exploration here to see if there are any other issues. For example, let's see if there are any NA values in the dataframe. That may indicate a problem. We'll first check the entire dataframe.
# This looks for any null value then combines them together. It will only be true if there are any null values.
any(is.na(iowadf))









    




TRUE



In [6]:

    
#Let's see how many data points have null values. We'll look at the total rows and 
print(paste("Initial rows:",nrow(iowadf)))
iowadfcl <- iowadf[complete.cases(iowadf),]
print(paste("Initial rows:",nrow(iowadfcl)))
any(is.na(iowadfcl))









    



[1] "Initial rows: 13779"
[1] "Initial rows: 13754"






    




FALSE

We only lose 25 rows out of 13,000. I'm going to go with that- it simplifies further computations.

Grouping and subsetting data

We now want to narrow down the data in order to make it easier to work with joining datasets. Let's get the top-purchasing store and use just that store for our further work. We first need to do an analysis step: we want to add up all the sales for each store. In other words, we want to group by store, then sum up the StoreSales.

This is a good tutorial on how to use the dplyr package in R, which will will be using here.

The first step is to use the group_by() function to group the data into sections that all have something in common. For our case, we will group by the "Store Number". Dplyr will go through that column, find all the unique values, then group them together. We then want to summarize() the results from the grouping. The function n() will count the number of entries in each group. We assign the output to a new column called count. The pipe operator %>% tells dplr to take the item on the left and send it to the item on the right.



In [8]:

    
library(dplyr)
iowadf %>% group_by(Store.Number) %>% summarize(count=n())









    





Store.Number count

	2106 12  
	2113 13  
	2130 12  
	2178 12  
	2190 54  
	2191 13  
	2200 13  
	2228 13  
	2233 13  
	2238  8  
	2248 15  
	2285 13  
	2290  1  
	2327 12  
	2413 13  
	2445 11  
	2448 13  
	2459 10  
	2460 15  
	2465 13  
	2475 12  
	2478 13  
	2498  4  
	2500 14  
	2501 13  
	2502 16  
	2505 18  
	2506 13  
	2507 13  
	2508 14  
	... ...
	5350  1  
	5351  5  
	5352  2  
	5353  2  
	5354  1  
	5355  1  
	5356  1  
	5358  2  
	5359  1  
	5360  2  
	9001 10  
	9002  9  
	9010  2  
	9013  9  
	9018  1  
	9022  1  
	9023  2  
	9901  9  
	9902  7  
	9906 21  
	9910  1  
	9911 30  
	9912  1  
	9913  3  
	9919  3  
	9922  1  
	9927  1  
	9928  1  
	9929  1  
	9930  1

Instead of the count, what if we want the sum of total sales to the stores, sorted by the maximum first? We create a new pipeline that groups, summs, then sorts. Finally, we assign the output of this to a new dataframe.



In [9]:

    
storesum <- iowadf %>% group_by(Store.Number) %>% summarize(TotalSales=sum(SalesToStore)) %>% arrange(desc(TotalSales))
head(storesum,10)









    





Store.Number TotalSales

	2633    805018.2
	4829    788918.6
	2670    441203.2
	3773    391374.5
	2512    381749.6
	3952    329222.1
	5102    322702.9
	2663    301419.4
	2619    284403.0
	3385    283767.4

Subsetting Data

It looks like the store number 2633 purchased more liquor than any other store, so we want only that subset of the data. We subset the data by looking only for rows where the store number matches 2633. This returns a column of True/False values where the Store Number matches our store.



In [10]:

    
subsetrows <- iowadfcl$Store.Number==2633
head(subsetrows)









    





	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE

We then index the dataframe based on those rows. We get a dataframe that only has data from the store we want. Again, notice the , after subsetrows. R indexes dataframes using a [rows,columns] format. By leaving the columns entry blank, we get all of the columns in the dataframe.



In [11]:

    
topstoredf <- iowadfcl[subsetrows,]
head(topstoredf)









    





Date Store.Number City Zip.Code County Total.Sales.to.Store..Dollars. SalesToStore SalesDate

	322 11/03/2016 2633      Des Moines 50320     POLK      $31294.03 31294.03  2016-11-03
	1106 10/06/2016 2633      Des Moines 50320     POLK      $34386.10 34386.10  2016-10-06
	1484 10/13/2016 2633      Des Moines 50320     POLK      $37319.85 37319.85  2016-10-13
	1542 12/16/2016 2633      Des Moines 50320     POLK      $79.56       79.56  2016-12-16
	2404 11/17/2016 2633      Des Moines 50320     POLK      $41055.27 41055.27  2016-11-17
	2641 10/03/2016 2633      Des Moines 50320     POLK      $27592.98 27592.98  2016-10-03

There is a lot of duplicated information here. We only really care about the SalesDate and the SalesToStore column. So we'll drop the other columns in this dataframe and assign it the same name (effectively dropping the columns).



In [12]:

    
topstoredf <-  subset(topstoredf, select=-c(Date, Store.Number, City, Zip.Code, County, Total.Sales.to.Store..Dollars.))

We now want to sort the data by date so that it is listed in chronological order. We will use the sort_values() function (in place). We tell dplyr which column to sort by.



In [13]:

    
topstoredf <- topstoredf %>% arrange(SalesDate)
head(topstoredf)









    





SalesToStore SalesDate

	27592.98  2016-10-03
	 1621.20  2016-10-04
	34386.10  2016-10-06
	30216.88  2016-10-10
	37319.85  2016-10-13
	21817.61  2016-10-17

Before we move on, let's look at this data. It is always a good idea to have some idea of what the data look like!



In [14]:

    
library(ggplot2)
options(repr.plot.width=5, repr.plot.height=5)
ggplot(topstoredf) + geom_point(aes(SalesDate,SalesToStore))

A Second dataset

Now we need another dataset to work with. Let's try pulling the Dow Jones Industrial average index for each day. We'll download the data from here. We will then turn the date column into the appropriate datetime format.



In [16]:

    
djiadf <- read.csv('Class05_DJIA_data.csv',stringsAsFactors=FALSE)
djiadf$DATE <- as.POSIXct(djiadf$DATE)
djiadf$VALUE <- as.numeric(djiadf$VALUE)
str(djiadf)









    



Warning message in eval(expr, envir, enclos):
"NAs introduced by coercion"





    



'data.frame':	65 obs. of  2 variables:
 $ DATE : POSIXct, format: "2016-10-03" "2016-10-04" ...
 $ VALUE: num  18254 18168 18281 18269 18240 ...



In [17]:

    
# We'll rename the second column to something a little more descriptive
djiadf <- rename(djiadf, DowJonesAvg=VALUE)
#Let's check for problems:
all(complete.cases(djiadf))









    




FALSE



In [18]:

    
# Looks like there are two problem rows. Let's identify which rows by indexing the dataframe any row that isn't complete
djiadf[!complete.cases(djiadf),]









    





DATE DowJonesAvg

	39 2016-11-24 NA        
	61 2016-12-26 NA



In [19]:

    
# These were both holidays and shouldn't be included in the data set anyway! We can drop them in place, modifying our dataframe.
djiadf<- djiadf[complete.cases(djiadf),]
all(complete.cases(djiadf))









    




TRUE

And, of course, let's plot it to see what it looks like.



In [20]:

    
ggplot(djiadf) + geom_point(aes(DATE,DowJonesAvg))

Joining Data

We now want to join the data from the store and the Dow Jones Industrial average. This will let us plot the two together to see if there is any correlation between the DJIA and the purchases that the store is making.

In order to join two dataframes we need to know how to align the rows or how to match up the data. The easiest way to do this is to try it out with an example to see how it works. See this tutorial for more details on merging data in R. We start by creating two small test dataframes.



In [21]:

    
df_a <- data.frame(
        patient_id=c('A', 'B', 'C', 'D', 'E'),
        first_name= c('Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'), 
        last_name= c('Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches'),
        visit_number= c(1,2,3,4,5),stringsAsFactors=FALSE)

df_a









    





patient_id first_name last_name visit_number

	A       Alex    Anderson 1       
	B       Amy     Ackerman 2       
	C       Allen   Ali     3       
	D       Alice   Aoni    4       
	E       Ayoung  Atiches 5



In [22]:

    
df_b <- data.frame(
        doctor_id= c('G', 'H', 'I', 'A', 'B'),
        first_name= c('Billy', 'Brian', 'Bran', 'Bryce', 'Betty'), 
        last_name= c('Bonder', 'Black', 'Balwner', 'Brice', 'Btisan'),
        visit_number= c(4,5,6,7,8),stringsAsFactors=FALSE)
df_b









    





doctor_id first_name last_name visit_number

	G      Billy  Bonder 4      
	H      Brian  Black  5      
	I      Bran   Balwner 6      
	A      Bryce  Brice  7      
	B      Betty  Btisan 8



In [23]:

    
str(df_a)









    



'data.frame':	5 obs. of  4 variables:
 $ patient_id  : chr  "A" "B" "C" "D" ...
 $ first_name  : chr  "Alex" "Amy" "Allen" "Alice" ...
 $ last_name   : chr  "Anderson" "Ackerman" "Ali" "Aoni" ...
 $ visit_number: num  1 2 3 4 5

How do we tack df_b on the end of df_a? This is useful if you have pieces of a dataset that you want to combine together. Notice that there are now NA values in the table! The first dataframe didn't have the doctor_id column, so R filled in those values with NA. Same for the second dataframe and the patient_id values. We could drop those columns if we wanted to.



In [24]:

    
bind_rows(df_a, df_b)









    





patient_id first_name last_name visit_number doctor_id

	A       Alex    Anderson 1       NA      
	B       Amy     Ackerman 2       NA      
	C       Allen   Ali     3       NA      
	D       Alice   Aoni    4       NA      
	E       Ayoung  Atiches 5       NA      
	NA      Billy   Bonder  4       G       
	NA      Brian   Black   5       H       
	NA      Bran    Balwner 6       I       
	NA      Bryce   Brice   7       A       
	NA      Betty   Btisan  8       B

What if we want to join together the rows that have the same visit_number? Looking at the table, we see that Billy Bonder and Alice Aoni have the same visit_number. Let's merge the two dataframes together to see how that works.



In [25]:

    
merge(df_a,df_b,by="visit_number")









    





visit_number patient_id first_name.x last_name.x doctor_id first_name.y last_name.y

	4      D      Alice  Aoni   G      Billy  Bonder 
	5      E      Ayoung Atiches H      Brian  Black

So we now have a much smaller dataframe that only has the rows where visit_number was the same in both. Note that R also has changed the column names. This is because there were duplicate names and it can't have that to maintain unique addressing. Look at the documentation for R to see how to change the suffixes that R appends.

What if, instead, we wanted to merge rows where the patient_id from one matches the doctor_id from the other? We can do that, too. The two dataframes are given the designations x and y for the first and second dataframes, respectively. So we need to identify the columns in each to match.



In [26]:

    
merge(df_a,df_b,by.x='patient_id',by.y='doctor_id')









    





patient_id first_name.x last_name.x visit_number.x first_name.y last_name.y visit_number.y

	A       Alex    Anderson 1       Bryce   Brice   7       
	B       Amy     Ackerman 2       Betty   Btisan  8

What if we want to keep all of the entries in the left dataframe and only add in values on the right where they match? We can do that, too!



In [27]:

    
merge(df_a,df_b,by='visit_number',all.x=TRUE)









    





visit_number patient_id first_name.x last_name.x doctor_id first_name.y last_name.y

	1       A       Alex    Anderson NA      NA      NA      
	2       B       Amy     Ackerman NA      NA      NA      
	3       C       Allen   Ali     NA      NA      NA      
	4       D       Alice   Aoni    G       Billy   Bonder  
	5       E       Ayoung  Atiches H       Brian   Black

Not surprisingly, we can do the right-side, too.



In [28]:

    
merge(df_a,df_b,by='visit_number',all.y=TRUE)









    





visit_number patient_id first_name.x last_name.x doctor_id first_name.y last_name.y

	4      D      Alice  Aoni   G      Billy  Bonder 
	5      E      Ayoung Atiches H      Brian  Black  
	6      NA     NA     NA     I      Bran   Balwner
	7      NA     NA     NA     A      Bryce  Brice  
	8      NA     NA     NA     B      Betty  Btisan

Finally, we can do both and include all the data from both dataframes, matching where possible. That is called an "outer" join.



In [29]:

    
merge(df_a,df_b,by='visit_number',all=TRUE)









    





visit_number patient_id first_name.x last_name.x doctor_id first_name.y last_name.y

	1       A       Alex    Anderson NA      NA      NA      
	2       B       Amy     Ackerman NA      NA      NA      
	3       C       Allen   Ali     NA      NA      NA      
	4       D       Alice   Aoni    G       Billy   Bonder  
	5       E       Ayoung  Atiches H       Brian   Black   
	6       NA      NA      NA      I       Bran    Balwner 
	7       NA      NA      NA      A       Bryce   Brice   
	8       NA      NA      NA      B       Betty   Btisan

Choosing the Merge Index

How do you choose the merge index? We used both the visit_number and the patient/doctor_id values. The key is to be able to merge on some common feature. This may be date (like we show below), but it could also be on zip code, a city name, the month number, financial quarter, or any other piece of information that you have common between your two datasets. You need to make sure that the merge indices have the same data type and are in the same format. For example, the computer doesn't know that 'DALLAS' and 'Dallas' are the same city. We can use the tools we've previously covered to clean up the data prior to attempting the merge.

Merging the Store and DJIA data

Since we want to keep all of the store data and merge in the DJIA data where we can, we are going to try a left join. We are also going to try joining on the date - they should match up reasonably well.



In [30]:

    
storeDJ <- merge(topstoredf, djiadf, by.x='SalesDate',by.y='DATE')
head(storeDJ)









    





SalesDate SalesToStore DowJonesAvg

	2016-10-03 27592.98  18253.85  
	2016-10-04  1621.20  18168.45  
	2016-10-06 34386.10  18268.50  
	2016-10-10 30216.88  18329.04  
	2016-10-13 37319.85  18098.94  
	2016-10-17 21817.61  18086.40

Now we can do what we wanted to do: plot the data to see if there is a correlation here!



In [31]:

    
ggplot(storeDJ) + geom_point(aes(DowJonesAvg,SalesToStore))

Well, after all that work, it doesn't look like there is anything there... Perhaps that isn't too surprising as the purchases by a single store may not depend on a macro-scale indicator like the Dow Jones Industrial average. But we didn't know that until we checked.

In-class Activity

Instead of looking at the sales to one store, what if we look at the total sales to all of the stores in the state of Iowa? Go through the process of grouping that data and joining it with the DJIA data. Plot the results and show me before you leave.

Assignment

Your assignment this week is to do data enriching on your dataset. You can either add/join data or do some kind of grouping to subset the data. I want documentation of what you did and how it turned out. Like our example today, you may find that the data enriching doesn't do much.



In [ ]:

Date	Store.Number	City	Zip.Code	County	Total.Sales.to.Store..Dollars.
10/21/2016	5110	Lenox	50851	TAYLOR	$1174.81
10/26/2016	4201	Sioux City	51109	WOODBURY	$1218.25
12/21/2016	4191	Pella	50219	MARION	$296.56
12/29/2016	3877	Des Moines	50322	Polk	$7095.90
10/13/2016	9928	Cedar Rapids	52404	LINN	$19.50
10/25/2016	4641	Des Moines	50315	POLK	$513.65

Store.Number	count
2106	12
2113	13
2130	12
2178	12
2190	54
2191	13
2200	13
2228	13
2233	13
2238	8
2248	15
2285	13
2290	1
2327	12
2413	13
2445	11
2448	13
2459	10
2460	15
2465	13
2475	12
2478	13
2498	4
2500	14
2501	13
2502	16
2505	18
2506	13
2507	13
2508	14
...	...
5350	1
5351	5
5352	2
5353	2
5354	1
5355	1
5356	1
5358	2
5359	1
5360	2
9001	10
9002	9
9010	2
9013	9
9018	1
9022	1
9023	2
9901	9
9902	7
9906	21
9910	1
9911	30
9912	1
9913	3
9919	3
9922	1
9927	1
9928	1
9929	1
9930	1

Store.Number	TotalSales
2633	805018.2
4829	788918.6
2670	441203.2
3773	391374.5
2512	381749.6
3952	329222.1
5102	322702.9
2663	301419.4
2619	284403.0
3385	283767.4

	Date	Store.Number	City	Zip.Code	County	Total.Sales.to.Store..Dollars.	SalesToStore	SalesDate
322	11/03/2016	2633	Des Moines	50320	POLK	$31294.03	31294.03	2016-11-03
1106	10/06/2016	2633	Des Moines	50320	POLK	$34386.10	34386.10	2016-10-06
1484	10/13/2016	2633	Des Moines	50320	POLK	$37319.85	37319.85	2016-10-13
1542	12/16/2016	2633	Des Moines	50320	POLK	$79.56	79.56	2016-12-16
2404	11/17/2016	2633	Des Moines	50320	POLK	$41055.27	41055.27	2016-11-17
2641	10/03/2016	2633	Des Moines	50320	POLK	$27592.98	27592.98	2016-10-03

SalesToStore	SalesDate
27592.98	2016-10-03
1621.20	2016-10-04
34386.10	2016-10-06
30216.88	2016-10-10
37319.85	2016-10-13
21817.61	2016-10-17

patient_id	first_name	last_name	visit_number
A	Alex	Anderson	1
B	Amy	Ackerman	2
C	Allen	Ali	3
D	Alice	Aoni	4
E	Ayoung	Atiches	5

doctor_id	first_name	last_name	visit_number
G	Billy	Bonder	4
H	Brian	Black	5
I	Bran	Balwner	6
A	Bryce	Brice	7
B	Betty	Btisan	8

SalesDate	SalesToStore	DowJonesAvg
2016-10-03	27592.98	18253.85
2016-10-04	1621.20	18168.45
2016-10-06	34386.10	18268.50
2016-10-10	30216.88	18329.04
2016-10-13	37319.85	18098.94
2016-10-17	21817.61	18086.40