In [1]:
import pandas as pd
import os
import numpy as np
import datetime as dt

Leada Problem One - Solution using Pandas

I've registered earlier to Leada's Data Literacy, as it looked pretty fun (and it was :-) )

Email received from Leada :


You just made your 2015 New Year's resolution to improve your data analysis skills and become data literate!

Every two weeks, we will e-mail you a dataset and some problems to work on. You can use any tool you'd like, although we suggest using R or Python. R is easy to install and if you are new to programming, the Introduction to R lesson is free in our courses!

Here's the first dataset and problem! Answers will be released at a later date. Tweet your solution @LeadaHQ!

====================

Dataset Description: A CSV of bicycle rental transactions for a rental company in California.

Dataset Download



In [2]:
# Load & Parse dates
df = pd.read_csv('./data/bike_trip_data.csv', parse_dates=[2,5] )

Problem 1: What was the average total time (in minutes) used by a bicycle in the data?

Extremely straighforward


In [3]:
avg_duration_mins, avg_duration_secs = divmod(df['Duration'].mean(), 60)
print 'The average total time is {:g} minutes and {:.0f} seconds'.format(avg_duration_mins, avg_duration_secs)


The average total time is 20 minutes and 31 seconds

Not as clear. I guess it means 'What was the day with the highest number of trips.

This can be checked pretty easily with a groupby operation


In [4]:
# Create a new column containing the day
df['Day'] = df['Start Date'].apply(lambda x: x.date())

In [5]:
# Grouping by Day, counting any other column
dfq2 = df.groupby('Day')[['Duration']].count()
dfq2.columns = ['Nb Trips']
dfq2 = dfq2.sort(columns='Nb Trips', ascending = False)
dfq2.head(1)


Out[5]:
Nb Trips
Day
2013-09-25 1264

Problem 3 (harder): Assuming there are 30 bikes per station, find what date and time the bikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for a customer to rent.


In [6]:
# Curiosity check : how many unique terminals are there
all_terminals = set(df['Start Terminal'])
print 'There is {} distinct terminals'.format(len(all_terminals))


There is 69 distinct terminals

In [7]:
# With pivot tables, create two dataframes that lists the changesin inventory     
# Taken are the bikes leaving terminals, back are taken back to the terminal
taken = df.pivot_table(index=df.index, columns=['Start Terminal'], values=['Duration'], aggfunc='count').fillna(0) * -1
back = df.pivot_table(index=df.index, columns=['End Terminal'], values=['Duration'], aggfunc='count').fillna(0)

In [8]:
# We combine bikes inflows & outflows, and create a cumulative sum
total = taken + back
total = total.cumsum()['Duration']

In [9]:
# 5 last lines of the dataframe
total.tail()


Out[9]:
Start Terminal 2 3 4 5 6 7 8 9 10 11 ... 71 72 73 74 75 76 77 80 82 83
144010 -152 28 -77 -4 92 -25 10 94 -230 220 ... -557 114 -1356 167 -155 78 316 0 4 2
144011 -152 28 -77 -4 92 -25 10 94 -230 220 ... -557 114 -1356 167 -155 78 316 0 4 2
144012 -152 28 -77 -4 92 -25 10 94 -230 220 ... -557 114 -1356 167 -155 79 316 0 4 2
144013 -152 28 -77 -4 92 -25 10 94 -230 220 ... -557 113 -1356 167 -155 79 316 0 4 2
144014 -152 28 -77 -4 92 -25 10 94 -230 220 ... -557 113 -1356 167 -155 79 316 0 4 2

5 rows × 69 columns


In [10]:
# Now we create a dictionary, that will hold the final result :
#      - Key :  terminal number
#      - Value : index of the time that requires rebalancing
#
# A terminal requires rebalacing the first time it reaches -30 bikes
index_stamps = {}

for col in total.columns:
    stamp = None
    stamp = total[col].loc[total[col] == -30].head(1).index.values
    if stamp.size == 0:
        index_stamps[col] = None
    else:
        index_stamps[col] = [stamp[0]]

# Now we fetch the actual datetime in the original dataframe        
for terminal, rebal_list in index_stamps.items():
    if rebal_list:
        index =  rebal_list[0]
        rebal_list.append(df.loc[index]['Start Date'])

In [11]:
# Example for terminal 2
index_stamps[2]


Out[11]:
[27475, Timestamp('2013-10-01 08:04:00')]

In [12]:
# Displaying the result
for terminal, rebal_list in index_stamps.items():
    if rebal_list:
        print 'Terminal {term} needs rebalancing on {date}'.format(term=terminal, date=rebal_list[1])


Terminal 2 needs rebalancing on 2013-10-01 08:04:00
Terminal 4 needs rebalancing on 2013-10-22 14:16:00
Terminal 7 needs rebalancing on 2013-11-17 14:56:00
Terminal 10 needs rebalancing on 2013-09-25 17:10:00
Terminal 22 needs rebalancing on 2013-09-24 08:29:00
Terminal 28 needs rebalancing on 2013-09-12 09:46:00
Terminal 34 needs rebalancing on 2013-12-04 16:29:00
Terminal 37 needs rebalancing on 2013-11-16 11:23:00
Terminal 41 needs rebalancing on 2013-09-10 17:49:00
Terminal 45 needs rebalancing on 2013-09-09 17:03:00
Terminal 47 needs rebalancing on 2013-09-11 12:49:00
Terminal 50 needs rebalancing on 2013-09-23 11:38:00
Terminal 51 needs rebalancing on 2013-09-14 11:13:00
Terminal 54 needs rebalancing on 2013-09-02 11:55:00
Terminal 55 needs rebalancing on 2013-09-05 17:57:00
Terminal 56 needs rebalancing on 2013-09-13 07:55:00
Terminal 58 needs rebalancing on 2013-09-05 12:27:00
Terminal 59 needs rebalancing on 2013-10-01 13:06:00
Terminal 62 needs rebalancing on 2013-09-04 18:26:00
Terminal 63 needs rebalancing on 2013-10-01 09:31:00
Terminal 66 needs rebalancing on 2013-09-16 12:13:00
Terminal 67 needs rebalancing on 2013-09-20 20:00:00
Terminal 71 needs rebalancing on 2013-09-19 09:53:00
Terminal 73 needs rebalancing on 2013-09-11 14:18:00
Terminal 75 needs rebalancing on 2013-09-11 11:23:00
Terminal 77 needs rebalancing on 2013-09-04 01:18:00