First look at Data

We will look at the location data-set from the vast challenge 2015.

This initial exploration will be accomplished using the following tools:

0. Setup environment



In [1]:

    
import pandas as pd
%matplotlib inline
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np
sns.set_style("darkgrid")



In [2]:

    
%cd C:\Users\Profesor\Documents\curso_va_2015\va_course_2015









    



C:\Users\Profesor\Documents\curso_va_2015\va_course_2015

1. Read the data



In [3]:

    
df = pd.read_csv("../MC1 2015 Data/park-movement-Fri.csv")

Let's look at the first five rows



In [4]:

    
df.head()









    Out[4]:






  
    
      
      Timestamp
      id
      type
      X
      Y
    
  
  
    
      0
      2014-6-06 08:00:16
      1591741
      check-in
      63
      99
    
    
      1
      2014-6-06 08:00:16
      825652
      check-in
      63
      99
    
    
      2
      2014-6-06 08:00:19
      179386
      check-in
      63
      99
    
    
      3
      2014-6-06 08:00:19
      531348
      check-in
      63
      99
    
    
      4
      2014-6-06 08:00:31
      1483004
      check-in
      0
      67

What is the size of the table?



In [5]:

    
df.shape









    Out[5]:





(5350348, 5)

What are the types of the data?



In [6]:

    
df.dtypes









    Out[6]:





Timestamp    object
id            int64
type         object
X             int64
Y             int64
dtype: object

What are the values of type ?



In [7]:

    
df["type"].unique()









    Out[7]:





array(['check-in', 'movement'], dtype=object)



In [8]:

    
df.groupby("type")["type"].count()









    Out[8]:





type
check-in      77771
movement    5272577
Name: type, dtype: int64

How many different ids are there?



In [9]:

    
df["id"].unique().shape









    Out[9]:





(3557,)



In [10]:

    
pd.pivot_table(df,columns="type", values="X", index="id", aggfunc=len).head()



In [11]:

    
pd.pivot_table(df,columns="type", values="X", index="id", aggfunc=len).mean()









    Out[11]:





type
check-in      21.864211
movement    1482.310093
dtype: float64

What is the type of the timestamps?



In [12]:

    
type(df.Timestamp[0])









    Out[12]:





str

They are strings, it would be better if they were dats, lets fix that with the to_datetime function



In [13]:

    
df["time"] = pd.to_datetime(df.Timestamp, format="%Y-%m-%d %H:%M:%S")



In [14]:

    
df.tail()









    Out[14]:






  
    
      
      Timestamp
      id
      type
      X
      Y
      time
    
  
  
    
      5350343
      2014-6-06 20:12:07
      1168815
      movement
      41
      76
      2014-06-06 20:12:07
    
    
      5350344
      2014-6-06 20:12:07
      321318
      movement
      68
      64
      2014-06-06 20:12:07
    
    
      5350345
      2014-6-06 20:12:07
      1687201
      movement
      15
      42
      2014-06-06 20:12:07
    
    
      5350346
      2014-6-06 20:12:07
      580635
      movement
      16
      40
      2014-06-06 20:12:07
    
    
      5350347
      2014-6-06 20:12:07
      973520
      movement
      25
      67
      2014-06-06 20:12:07



In [16]:

    
df.dtypes









    Out[16]:





Timestamp            object
id                    int64
type                 object
X                     int64
Y                     int64
time         datetime64[ns]
dtype: object

Now the time column contains datetime objects

2. Looking at location data

First, take a random subsample to speed up exploration



In [17]:

    
df_small = df.sample(10000)



In [18]:

    
df_small.shape









    Out[18]:





(10000, 6)

We will now create a simple scatter plot with all the X and Y values in our subsample



In [19]:

    
df_small.plot("X","Y","scatter")









    Out[19]:





<matplotlib.axes._subplots.AxesSubplot at 0x9e6deb8>

It looks very similar to the pats in the map

Now lets look at just the check-in samples



In [20]:

    
df_small.loc[df_small["type"]=="check-in"].plot("X","Y","scatter")









    Out[20]:





<matplotlib.axes._subplots.AxesSubplot at 0xb0bf470>

Lets look at the range of the location data



In [21]:

    
df["X"].min()









    Out[21]:





0



In [22]:

    
df["X"].max()









    Out[22]:





99



In [23]:

    
df["Y"].min()









    Out[23]:





11



In [24]:

    
df["Y"].max()









    Out[24]:





99

Now lets create a 2d histogram to see which locations are more popular. We will use the hist2d function



In [25]:

    
cnts, xe, ye, img = plt.hist2d(df_small["X"], df_small["Y"],range=((0,100),(0,100)),normed=True)

We can increase the number of bins



In [26]:

    
cnts, xe, ye, img = plt.hist2d(df_small["X"], df_small["Y"],range=((0,100),(0,100)),normed=True, bins=20)



In [27]:

    
df_small.plot("X","Y","hexbin")









    Out[27]:





<matplotlib.axes._subplots.AxesSubplot at 0xb125b00>

3. Single guest

Now lets plot the locations for a single random person



In [28]:

    
guest_id = np.random.choice(df["id"])



In [29]:

    
guest_df = df.loc[df["id"]==guest_id]



In [30]:

    
guest_df.shape









    Out[30]:





(1155, 6)



In [31]:

    
guest_df.plot("X","Y","scatter")









    Out[31]:





<matplotlib.axes._subplots.AxesSubplot at 0x9f00d30>

Now lets try to use the time information



In [32]:

    
plt.scatter(guest_df["X"],guest_df["Y"],c=guest_df["time"])









    Out[32]:





<matplotlib.collections.PathCollection at 0xb524470>

At what time did he arrive?



In [33]:

    
guest_df["time"].min()









    Out[33]:





Timestamp('2014-06-06 08:25:55')

At what time did he leave?



In [34]:

    
guest_df["time"].max()









    Out[34]:





Timestamp('2014-06-06 20:11:59')

So how long did he stay?



In [35]:

    
guest_df["time"].max() - guest_df["time"].min()









    Out[35]:





Timedelta('0 days 11:46:04')

4. Single time frame

Where were the guests between 12:00 and 12:05 ?



In [36]:

    
noon_dates = (df["time"] < '2014-06-06 12:05:00') & (df["time"] >= '2014-06-06 12:00:00')



In [37]:

    
noon_df = df.loc[noon_dates]



In [38]:

    
noon_df.shape









    Out[38]:





(43356, 6)



In [39]:

    
plt.scatter(noon_df["X"], noon_df["Y"], alpha=0.01, marker="o", s=30)









    Out[39]:





<matplotlib.collections.PathCollection at 0xb5bf438>

lets add some jitter



In [40]:

    
plt.scatter(noon_df["X"] +5*np.random.random(len(noon_df))
           ,noon_df["Y"]+5*np.random.random(len(noon_df)),
            alpha=0.01, marker="o", s=30)









    Out[40]:





<matplotlib.collections.PathCollection at 0xb6277f0>

5. Time analysis

Now lets try to ask some simple questions about time data

At what time do guests arrive?
At what time do they leave?
How long they stay?
How does park ocupacy vary during the day?

To answer the first questions we needd to transform the data



In [41]:

    
grouped_times = df.groupby("id")["time"]



In [42]:

    
arrivals = grouped_times.min()



In [43]:

    
departures = grouped_times.max()



In [44]:

    
duration = departures - arrivals



In [45]:

    
sns.distplot(arrivals.dt.hour+arrivals.dt.minute/60)









    Out[45]:





<matplotlib.axes._subplots.AxesSubplot at 0xb5fcc18>



In [46]:

    
sns.distplot(departures.dt.hour+departures.dt.minute/60)









    Out[46]:





<matplotlib.axes._subplots.AxesSubplot at 0xb8b8b38>



In [47]:

    
h_duration = duration.dt.seconds/60/60



In [48]:

    
sns.distplot(h_duration)









    Out[48]:





<matplotlib.axes._subplots.AxesSubplot at 0xd180080>

Now for the question of park occupacy, we need to group the dataframe by time



In [49]:

    
time_groups = df.groupby(df.time.dt.hour)



In [50]:

    
occupancy = time_groups["id"].aggregate(lambda x:len(np.unique(x)))



In [51]:

    
occupancy.plot()









    Out[51]:





<matplotlib.axes._subplots.AxesSubplot at 0xd249d68>

Questions

What places did the people who stayed for less than 4 hours visit?



In [ ]:

What is the distribution of total traveled distance of park visitors?



In [63]:

What is the mean speed of the park visitors?



In [ ]:

Who are the visitors who walked more?



In [ ]:

At what times are check-in samples recorded?



In [ ]:

type	check-in	movement
id
941	38	1999
2672	35	1934
4343	12	696
4828	27	1564
4908	24	2167

	Timestamp	id	type	X	Y
0	2014-6-06 08:00:16	1591741	check-in	63	99
1	2014-6-06 08:00:16	825652	check-in	63	99
2	2014-6-06 08:00:19	179386	check-in	63	99
3	2014-6-06 08:00:19	531348	check-in	63	99
4	2014-6-06 08:00:31	1483004	check-in	0	67

	Timestamp	id	type	X	Y	time
5350343	2014-6-06 20:12:07	1168815	movement	41	76	2014-06-06 20:12:07
5350344	2014-6-06 20:12:07	321318	movement	68	64	2014-06-06 20:12:07
5350345	2014-6-06 20:12:07	1687201	movement	15	42	2014-06-06 20:12:07
5350346	2014-6-06 20:12:07	580635	movement	16	40	2014-06-06 20:12:07
5350347	2014-6-06 20:12:07	973520	movement	25	67	2014-06-06 20:12:07