We will look at the location data-set from the vast challenge 2015.
This initial exploration will be accomplished using the following tools:
In [1]:
import pandas as pd
%matplotlib inline
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np
sns.set_style("darkgrid")
In [2]:
%cd C:\Users\Profesor\Documents\curso_va_2015\va_course_2015
In [3]:
df = pd.read_csv("../MC1 2015 Data/park-movement-Fri.csv")
Let's look at the first five rows
In [4]:
df.head()
Out[4]:
What is the size of the table?
In [5]:
df.shape
Out[5]:
What are the types of the data?
In [6]:
df.dtypes
Out[6]:
What are the values of type ?
In [7]:
df["type"].unique()
Out[7]:
In [8]:
df.groupby("type")["type"].count()
Out[8]:
How many different ids are there?
In [9]:
df["id"].unique().shape
Out[9]:
In [10]:
pd.pivot_table(df,columns="type", values="X", index="id", aggfunc=len).head()
Out[10]:
In [11]:
pd.pivot_table(df,columns="type", values="X", index="id", aggfunc=len).mean()
Out[11]:
What is the type of the timestamps?
In [12]:
type(df.Timestamp[0])
Out[12]:
They are strings, it would be better if they were dats, lets fix that with the to_datetime function
In [13]:
df["time"] = pd.to_datetime(df.Timestamp, format="%Y-%m-%d %H:%M:%S")
In [14]:
df.tail()
Out[14]:
In [16]:
df.dtypes
Out[16]:
Now the time column contains datetime objects
First, take a random subsample to speed up exploration
In [17]:
df_small = df.sample(10000)
In [18]:
df_small.shape
Out[18]:
We will now create a simple scatter plot with all the X and Y values in our subsample
In [19]:
df_small.plot("X","Y","scatter")
Out[19]:
It looks very similar to the pats in the map
Now lets look at just the check-in samples
In [20]:
df_small.loc[df_small["type"]=="check-in"].plot("X","Y","scatter")
Out[20]:
Lets look at the range of the location data
In [21]:
df["X"].min()
Out[21]:
In [22]:
df["X"].max()
Out[22]:
In [23]:
df["Y"].min()
Out[23]:
In [24]:
df["Y"].max()
Out[24]:
Now lets create a 2d histogram to see which locations are more popular. We will use the hist2d function
In [25]:
cnts, xe, ye, img = plt.hist2d(df_small["X"], df_small["Y"],range=((0,100),(0,100)),normed=True)
We can increase the number of bins
In [26]:
cnts, xe, ye, img = plt.hist2d(df_small["X"], df_small["Y"],range=((0,100),(0,100)),normed=True, bins=20)
In [27]:
df_small.plot("X","Y","hexbin")
Out[27]:
In [28]:
guest_id = np.random.choice(df["id"])
In [29]:
guest_df = df.loc[df["id"]==guest_id]
In [30]:
guest_df.shape
Out[30]:
In [31]:
guest_df.plot("X","Y","scatter")
Out[31]:
Now lets try to use the time information
In [32]:
plt.scatter(guest_df["X"],guest_df["Y"],c=guest_df["time"])
Out[32]:
At what time did he arrive?
In [33]:
guest_df["time"].min()
Out[33]:
At what time did he leave?
In [34]:
guest_df["time"].max()
Out[34]:
So how long did he stay?
In [35]:
guest_df["time"].max() - guest_df["time"].min()
Out[35]:
In [36]:
noon_dates = (df["time"] < '2014-06-06 12:05:00') & (df["time"] >= '2014-06-06 12:00:00')
In [37]:
noon_df = df.loc[noon_dates]
In [38]:
noon_df.shape
Out[38]:
In [39]:
plt.scatter(noon_df["X"], noon_df["Y"], alpha=0.01, marker="o", s=30)
Out[39]:
lets add some jitter
In [40]:
plt.scatter(noon_df["X"] +5*np.random.random(len(noon_df))
,noon_df["Y"]+5*np.random.random(len(noon_df)),
alpha=0.01, marker="o", s=30)
Out[40]:
Now lets try to ask some simple questions about time data
To answer the first questions we needd to transform the data
In [41]:
grouped_times = df.groupby("id")["time"]
In [42]:
arrivals = grouped_times.min()
In [43]:
departures = grouped_times.max()
In [44]:
duration = departures - arrivals
In [45]:
sns.distplot(arrivals.dt.hour+arrivals.dt.minute/60)
Out[45]:
In [46]:
sns.distplot(departures.dt.hour+departures.dt.minute/60)
Out[46]:
In [47]:
h_duration = duration.dt.seconds/60/60
In [48]:
sns.distplot(h_duration)
Out[48]:
Now for the question of park occupacy, we need to group the dataframe by time
In [49]:
time_groups = df.groupby(df.time.dt.hour)
In [50]:
occupancy = time_groups["id"].aggregate(lambda x:len(np.unique(x)))
In [51]:
occupancy.plot()
Out[51]:
What places did the people who stayed for less than 4 hours visit?
In [ ]:
What is the distribution of total traveled distance of park visitors?
In [63]:
What is the mean speed of the park visitors?
In [ ]:
Who are the visitors who walked more?
In [ ]:
At what times are check-in samples recorded?
In [ ]: