This module focuses on leveraging human knowledge to get the best route. At a given time due to certain circumstances (like rush hour) it is possible that a certain route may be faster than a route shown on the map. People who travel at those times or people who are residents of those areas have more knowledge about which route should be taken at what time. If every person entered which path to be taken based on their experience we could create a database that will help us to predict which path should be taken.
The format of the dataset is as follows:
Importing all the required packages
In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
import random
import numpy as np
import pandas as pd
from sklearn import datasets, svm, cross_validation, tree, preprocessing, metrics
import sklearn.ensemble as ske
import tensorflow as tf
from tensorflow.contrib import learn as skflow
In [2]:
route_df = pd.read_excel('route.xls', index_col=None, na_values=['NA'])
Let's look at the data
In [3]:
route_df.head()
Out[3]:
Let's look at what percentage of the drivers are using the map?
In [4]:
route_df['mapUsed'].mean()
Out[4]:
47% of the drivers are following the map.
Let's see the groupings by the country
In [5]:
route_df.groupby('country').mean()
Out[5]:
Approximately 45-48% ofthe drivers are using the map in each country and approximately half of the entries are made in metro cities. Let's plot this values to get batter understanding ofthe data
In [6]:
country_metro_grouping = route_df.groupby(['country','metro']).mean()
country_metro_grouping
Out[6]:
In [7]:
country_metro_grouping['mapUsed'].plot.bar()
Out[7]:
1 signifies that it is a metro
Let's visualize the data based on the rating of the drivers
In [8]:
group_by_age = pd.cut(route_df["rating"], np.arange(0, 6, 1))
rating_grouping = route_df.groupby(group_by_age).mean()
rating_grouping['mapUsed'].plot.bar()
Out[8]:
Most of the drivers are between the range of 1-2
Let's check for missing values by doing a count on each of the columns
In [9]:
route_df.count()
Out[9]:
There are no missing values. However if there are missing values we can deal with them in the following way:
If the column (col1) from which the values are missing is an important factor, we need to drop the rows containing those values.
route_df["col1"] = route_df["col1"].fillna("NA") #first fill the columns with value NA
route_df = route_df.dropna() #drop the rows with missing values
If the column(col2,col3) is not an important factor, we can drop the column
route_df = route_df.drop(['col2','col3'], axis=1) #drop the columns
Now for the preprocessing
In [10]:
def preprocess_route_df(df):
processed_df = df.copy()
le = preprocessing.LabelEncoder()
processed_df.country = le.fit_transform(processed_df.country)
processed_df.oldRoute = le.fit_transform(processed_df.oldRoute)
processed_df.newRoute = le.fit_transform(processed_df.newRoute)
processed_df = processed_df.drop(['name','uid'],axis=1)
return processed_df
What we are basically doing here is processing the data to produce numeric labels for the string data
Let's look at the data again
In [11]:
processed_df = preprocess_route_df(route_df)
processed_df
Out[11]:
X contains all the values besides whether the map was used or not and y contains the answer
In [12]:
X = processed_df.drop(['mapUsed'], axis=1).values
y = processed_df['mapUsed'].values
In [13]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.2)
decision tree
In [14]:
clf_dt = tree.DecisionTreeClassifier(max_depth=10)
In [15]:
clf_dt.fit (X_train, y_train)
clf_dt.score (X_test, y_test)
Out[15]:
In [16]:
shuffle_validator = cross_validation.ShuffleSplit(len(X), n_iter=20, test_size=0.2, random_state=0)
def test_classifier(clf):
scores = cross_validation.cross_val_score(clf, X, y, cv=shuffle_validator)
print("Accuracy: %0.4f (+/- %0.2f)" % (scores.mean(), scores.std()))
In [17]:
test_classifier(clf_dt)
In [18]:
clf_rf = ske.RandomForestClassifier(n_estimators=50)
test_classifier(clf_rf)
In [19]:
clf_gb = ske.GradientBoostingClassifier(n_estimators=50)
test_classifier(clf_gb)
In [20]:
eclf = ske.VotingClassifier([('dt', clf_dt), ('rf', clf_rf), ('gb', clf_gb)])
test_classifier(eclf)
neural Network
In [21]:
#tf_clf_dnn = skflow.TensorFlowDNNClassifier(hidden_units=[20, 40, 20], n_classes=2, batch_size=256, steps=1000, learning_rate=0.05)
feature_columns = [tf.contrib.layers.real_valued_column("")]
tf_clf_dnn = skflow.DNNClassifier(feature_columns=feature_columns, hidden_units=[20, 40, 20], n_classes=2, model_dir="/tmp")
#tf_clf_dnn.evaluate(batch_size=256, steps=1000)
tf_clf_dnn.fit(X_train, y_train, steps=1000)
accuracy_score = tf_clf_dnn.evaluate(X_test, y_test,steps=1000)["accuracy"]
print("\nTest Accuracy: {0:f}\n".format(accuracy_score))