In [1]:

    
import pandas as pd
import os
import numpy as np
import datetime as dt

Leada Problem One - Solution using Pandas

I've registered earlier to Leada's Data Literacy, as it looked pretty fun (and it was :-) )

Email received from Leada :

You just made your 2015 New Year's resolution to improve your data analysis skills and become data literate!

Every two weeks, we will e-mail you a dataset and some problems to work on. You can use any tool you'd like, although we suggest using R or Python. R is easy to install and if you are new to programming, the Introduction to R lesson is free in our courses!

Here's the first dataset and problem! Answers will be released at a later date. Tweet your solution @LeadaHQ!

====================

Dataset Description: A CSV of bicycle rental transactions for a rental company in California.

Dataset Download



In [2]:

    
# Load & Parse dates
df = pd.read_csv('./data/bike_trip_data.csv', parse_dates=[2,5] )

Problem 1: What was the average total time (in minutes) used by a bicycle in the data?

Extremely straighforward



In [3]:

    
avg_duration_mins, avg_duration_secs = divmod(df['Duration'].mean(), 60)
print 'The average total time is {:g} minutes and {:.0f} seconds'.format(avg_duration_mins, avg_duration_secs)









    



The average total time is 20 minutes and 31 seconds

Problem 2: What was the most popular day by trip frequency in this dataset?

Not as clear. I guess it means 'What was the day with the highest number of trips.

This can be checked pretty easily with a groupby operation



In [4]:

    
# Create a new column containing the day
df['Day'] = df['Start Date'].apply(lambda x: x.date())



In [5]:

    
# Grouping by Day, counting any other column
dfq2 = df.groupby('Day')[['Duration']].count()
dfq2.columns = ['Nb Trips']
dfq2 = dfq2.sort(columns='Nb Trips', ascending = False)
dfq2.head(1)

Problem 3 (harder): Assuming there are 30 bikes per station, find what date and time the bikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for a customer to rent.



In [6]:

    
# Curiosity check : how many unique terminals are there
all_terminals = set(df['Start Terminal'])
print 'There is {} distinct terminals'.format(len(all_terminals))









    



There is 69 distinct terminals



In [7]:

    
# With pivot tables, create two dataframes that lists the changesin inventory     
# Taken are the bikes leaving terminals, back are taken back to the terminal
taken = df.pivot_table(index=df.index, columns=['Start Terminal'], values=['Duration'], aggfunc='count').fillna(0) * -1
back = df.pivot_table(index=df.index, columns=['End Terminal'], values=['Duration'], aggfunc='count').fillna(0)



In [8]:

    
# We combine bikes inflows & outflows, and create a cumulative sum
total = taken + back
total = total.cumsum()['Duration']



In [9]:

    
# 5 last lines of the dataframe
total.tail()









    Out[9]:






  
    
      Start Terminal
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      ...
      71
      72
      73
      74
      75
      76
      77
      80
      82
      83
    
  
  
    
      144010
      -152
       28
      -77
      -4
       92
      -25
       10
       94
      -230
       220
      ...
      -557
       114
      -1356
       167
      -155
       78
       316
       0
       4
       2
    
    
      144011
      -152
       28
      -77
      -4
       92
      -25
       10
       94
      -230
       220
      ...
      -557
       114
      -1356
       167
      -155
       78
       316
       0
       4
       2
    
    
      144012
      -152
       28
      -77
      -4
       92
      -25
       10
       94
      -230
       220
      ...
      -557
       114
      -1356
       167
      -155
       79
       316
       0
       4
       2
    
    
      144013
      -152
       28
      -77
      -4
       92
      -25
       10
       94
      -230
       220
      ...
      -557
       113
      -1356
       167
      -155
       79
       316
       0
       4
       2
    
    
      144014
      -152
       28
      -77
      -4
       92
      -25
       10
       94
      -230
       220
      ...
      -557
       113
      -1356
       167
      -155
       79
       316
       0
       4
       2
    
  

5 rows × 69 columns



In [10]:

    
# Now we create a dictionary, that will hold the final result :
#      - Key :  terminal number
#      - Value : index of the time that requires rebalancing
#
# A terminal requires rebalacing the first time it reaches -30 bikes
index_stamps = {}

for col in total.columns:
    stamp = None
    stamp = total[col].loc[total[col] == -30].head(1).index.values
    if stamp.size == 0:
        index_stamps[col] = None
    else:
        index_stamps[col] = [stamp[0]]

# Now we fetch the actual datetime in the original dataframe        
for terminal, rebal_list in index_stamps.items():
    if rebal_list:
        index =  rebal_list[0]
        rebal_list.append(df.loc[index]['Start Date'])



In [11]:

    
# Example for terminal 2
index_stamps[2]









    Out[11]:





[27475, Timestamp('2013-10-01 08:04:00')]



In [12]:

    
# Displaying the result
for terminal, rebal_list in index_stamps.items():
    if rebal_list:
        print 'Terminal {term} needs rebalancing on {date}'.format(term=terminal, date=rebal_list[1])









    



Terminal 2 needs rebalancing on 2013-10-01 08:04:00
Terminal 4 needs rebalancing on 2013-10-22 14:16:00
Terminal 7 needs rebalancing on 2013-11-17 14:56:00
Terminal 10 needs rebalancing on 2013-09-25 17:10:00
Terminal 22 needs rebalancing on 2013-09-24 08:29:00
Terminal 28 needs rebalancing on 2013-09-12 09:46:00
Terminal 34 needs rebalancing on 2013-12-04 16:29:00
Terminal 37 needs rebalancing on 2013-11-16 11:23:00
Terminal 41 needs rebalancing on 2013-09-10 17:49:00
Terminal 45 needs rebalancing on 2013-09-09 17:03:00
Terminal 47 needs rebalancing on 2013-09-11 12:49:00
Terminal 50 needs rebalancing on 2013-09-23 11:38:00
Terminal 51 needs rebalancing on 2013-09-14 11:13:00
Terminal 54 needs rebalancing on 2013-09-02 11:55:00
Terminal 55 needs rebalancing on 2013-09-05 17:57:00
Terminal 56 needs rebalancing on 2013-09-13 07:55:00
Terminal 58 needs rebalancing on 2013-09-05 12:27:00
Terminal 59 needs rebalancing on 2013-10-01 13:06:00
Terminal 62 needs rebalancing on 2013-09-04 18:26:00
Terminal 63 needs rebalancing on 2013-10-01 09:31:00
Terminal 66 needs rebalancing on 2013-09-16 12:13:00
Terminal 67 needs rebalancing on 2013-09-20 20:00:00
Terminal 71 needs rebalancing on 2013-09-19 09:53:00
Terminal 73 needs rebalancing on 2013-09-11 14:18:00
Terminal 75 needs rebalancing on 2013-09-11 11:23:00
Terminal 77 needs rebalancing on 2013-09-04 01:18:00

Start Terminal	2	3	4	5	6	7	8	9	10	11	...	71	72	73	74	75	76	77	82	83
144010	-152	28	-77	-4	92	-25	10	94	-230	220	...	-557	114	-1356	167	-155	78	316	4	2
144011	-152	28	-77	-4	92	-25	10	94	-230	220	...	-557	114	-1356	167	-155	78	316	4	2
144012	-152	28	-77	-4	92	-25	10	94	-230	220	...	-557	114	-1356	167	-155	79	316	4	2
144013	-152	28	-77	-4	92	-25	10	94	-230	220	...	-557	113	-1356	167	-155	79	316	4	2
144014	-152	28	-77	-4	92	-25	10	94	-230	220	...	-557	113	-1356	167	-155	79	316	4	2