Class 6: Preprocessing.
The feature vector, the input to a model (such as a neural network), must be completely numeric. Converting non-numeric data into numeric is one major component of preprocessing. It is also often important to preprocess numeric values. Scikit-learn provides a large number of preprocessing functions:
However, this is just the beginning. The success of your neural network's predictions is often directly tied to the data representation.
The following functions will be used in conjunction with TensorFlow to help preprocess the data. Some of these were covered previously, some are new.
It is okay to just use them. For better understanding, try to see how they work.
These functions allow you to build the feature vector for a neural network. Consider the following:
In [3]:
import pandas as pd
import sklearn.preprocessing
from sklearn.feature_extraction.text import TfidfTransformer
# Encode text values to dummie variables(i.e. [1,0,0],[0,1,0],[0,0,1] for red,green,blue)
def encode_text_dummy(df,name):
dummies = pd.get_dummies(df[name])
for x in dummies.columns:
dummy_name = "{}-{}".format(name,x)
df[dummy_name] = dummies[x]
df.drop(name, axis=1, inplace=True)
# Encode text values to indexes(i.e. [1],[2],[3] for red,green,blue)
def encode_text_index(df,name):
le = preprocessing.LabelEncoder()
df[name] = le.fit_transform(df[name])
return le.classes_
# Encode a numeric column as zscores
def encode_numeric_zscore(df,name,mean=None,sd=None):
if mean is None:
mean = df[name].mean()
if sd is None:
sd = df[name].std()
df[name] = (df[name]-mean)/sd
# Convert all missing values in the specified column to the median
def missing_median(df, name):
med = df[name].median()
df[name] = df[name].fillna(med)
# Remove all rows where the specified column is +/- sd standard deviations
def remove_outliers(df, name, sd):
drop_rows = df.index[(np.abs(df[name]-df[name].mean())>=(sd*df[name].std()))]
df.drop(drop_rows,axis=0,inplace=True)
# Encode a column to a range between normalized_low and normalized_high.
def encode_numeric_range(df, name, normalized_low =-1, normalized_high =1,
data_low=None, data_high=None):
if data_low is None:
data_low = min(df[name])
data_high = max(df[name])
df[name] = ((df[name] - data_low) / (data_high - data_low)) \
* (normalized_high - normalized_low) + normalized_low
# Convert a Pandas dataframe to the x,y inputs that TensorFlow needs
def to_xy(df,target):
result = []
for x in df.columns:
if x != target:
result.append(x)
# find out the type of the target column. Is it really this hard? :(
target_type = df[target].dtypes
target_type = target_type[0] if hasattr(target_type, '__iter__') else target_type
# Encode to int for classification, float otherwise. TensorFlow likes 32 bits.
if target_type in (np.int64, np.int32):
# Classification
return df.as_matrix(result).astype(np.float32),df.as_matrix([target]).astype(np.int32)
else:
# Regression
return df.as_matrix(result).astype(np.float32),df.as_matrix([target]).astype(np.float32)
In [4]:
ENCODING = 'utf-8'
def expand_categories(values):
result = []
s = values.value_counts()
t = float(len(values))
for v in s.index:
result.append("{}:{}%".format(v,round(100*(s[v]/t),2)))
return "[{}]".format(",".join(result))
def analyze(filename):
print()
print("Analyzing: {}".format(filename))
df = pd.read_csv(filename,encoding=ENCODING)
cols = df.columns.values
total = float(len(df))
print("{} rows".format(int(total)))
for col in cols:
uniques = df[col].unique()
unique_count = len(uniques)
if unique_count>100:
print("** {}:{} ({}%)".format(col,unique_count,int(((unique_count)/total)*100)))
else:
print("** {}:{}".format(col,expand_categories(df[col])))
expand_categories(df[col])
The analyze script can be run on the MPG dataset.
In [5]:
import tensorflow.contrib.learn as skflow
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from scipy.stats import zscore
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
analyze(filename_read)
In [6]:
import tensorflow.contrib.learn as skflow
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from scipy.stats import zscore
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
# create feature vector
missing_median(df, 'horsepower')
df.drop('name',1,inplace=True)
encode_numeric_zscore(df, 'horsepower')
encode_numeric_zscore(df, 'weight')
encode_numeric_range(df, 'cylinders',0,1)
encode_numeric_range(df, 'displacement',0,1)
encode_numeric_zscore(df, 'acceleration')
#encode_numeric_binary(df,'mpg',20)
#df['origin'] = df['origin'].astype(str)
#encode_text_tfidf(df, 'origin')
# Drop outliers in horsepower
print("Length before MPG outliers dropped: {}".format(len(df)))
remove_outliers(df,'mpg',2)
print("Length after MPG outliers dropped: {}".format(len(df)))
print(df)
Addresses can be difficult to encode into a neural network. There are many different approaches, and you must consider how you can transform the address into something more meaningful. Map coordinates can be a good approach. Latitude and longitude can be a useful encoding. Thanks to the power of the Internet, it is relatively easy to transform an address into its latitude and longitude values. The following code determines the coordinates of Washington University:
In [1]:
import requests
address = "1 Brookings Dr, St. Louis, MO 63130"
response = requests.get('https://maps.googleapis.com/maps/api/geocode/json?address='+address)
resp_json_payload = response.json()
print(resp_json_payload['results'][0]['geometry']['location'])
If latitude and longitude are simply fed into the neural network as two features, they might not be overly helpful. These two values would allow your neural network to cluster locations on a map. Sometimes cluster locations on a map can be useful. Consider the percentage of the population that smokes in the USA by state:
The above map shows that certian behaviors, like smoking, can be clustered by global region.
However, often you will want to transform the coordinates into distances. It is reasonably easy to estimate the distance between any two points on Earth by using the great circle distance between any two points on a sphere:
The following code implements this formula:
$\Delta\sigma=\arccos\bigl(\sin\phi_1\cdot\sin\phi_2+\cos\phi_1\cdot\cos\phi_2\cdot\cos(\Delta\lambda)\bigr)$
$d = r \, \Delta\sigma$
In [7]:
from math import sin, cos, sqrt, atan2, radians
# Distance function
def distance_lat_lng(lat1,lng1,lat2,lng2):
# approximate radius of earth in km
R = 6373.0
# degrees to radians (lat/lon are in degrees)
lat1 = radians(lat1)
lng1 = radians(lng1)
lat2 = radians(lat2)
lng2 = radians(lng2)
dlng = lng2 - lng1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlng / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
return R * c
# Find lat lon for address
def lookup_lat_lng(address):
response = requests.get('https://maps.googleapis.com/maps/api/geocode/json?address='+address)
json = response.json()
if len(json['results']) == 0:
print("Can't find: {}".format(address))
return 0,0
map = json['results'][0]['geometry']['location']
return map['lat'],map['lng']
# Distance between two locations
import requests
address1 = "1 Brookings Dr, St. Louis, MO 63130"
address2 = "3301 College Ave, Fort Lauderdale, FL 33314"
lat1, lng1 = lookup_lat_lng(address1)
lat2, lng2 = lookup_lat_lng(address2)
print("Distance, St. Louis, MO to Ft. Lauderdale, FL: {} km".format(
distance_lat_lng(lat1,lng1,lat2,lng2)))
Distances can be useful to encode addresses as. You must consider what distance might be useful for your dataset. Consider:
The following code calculates the distance between 10 universities and washu:
In [26]:
# Encoding other universities by their distance to Washington University
schools = [
["Princeton University, Princeton, NJ 08544", 'Princeton'],
["Massachusetts Hall, Cambridge, MA 02138", 'Harvard'],
["5801 S Ellis Ave, Chicago, IL 60637", 'University of Chicago'],
["Yale, New Haven, CT 06520", 'Yale'],
["116th St & Broadway, New York, NY 10027", 'Columbia University'],
["450 Serra Mall, Stanford, CA 94305", 'Stanford'],
["77 Massachusetts Ave, Cambridge, MA 02139", 'MIT'],
["Duke University, Durham, NC 27708", 'Duke University'],
["University of Pennsylvania, Philadelphia, PA 19104", 'University of Pennsylvania'],
["Johns Hopkins University, Baltimore, MD 21218", 'Johns Hopkins']
]
lat1, lng1 = lookup_lat_lng("1 Brookings Dr, St. Louis, MO 63130")
for address, name in schools:
lat2,lng2 = lookup_lat_lng(address)
dist = distance_lat_lng(lat1,lng1,lat2,lng2)
print("School '{}', distance to wustl is: {}".format(name,dist))
The Bag of Words algorithm is a common means of encoding strings. (Harris, 1954) Each input represents the count of one particular word. The entire input vector would contain one value for each unique word. Consider the following strings.
Of Mice and Men
Three Blind Mice
Blind Man’s Bluff
Mice and More Mice
We have the following unique words. This is our “dictionary.”
Input 0 : and
Input 1 : blind
Input 2 : bluff
Input 3 : man’s
Input 4 : men
Input 5 : mice
Input 6 : more
Input 7 : of
Input 8 : three
The four lines above would be encoded as follows.
Of Mice and Men [ 0 4 5 7 ]
Three Blind Mice [ 1 5 8 ]
Blind Man ’ s Bl u f f [ 1 2 3 ]
Mice and More Mice [ 0 5 6 ]
Of course we have to fill in the missing words with zero, so we end up with the following.
Notice that we now have a consistent vector length of nine. Nine is the total number of words in our “dictionary”. Each component number in the vector is an index into our dictionary of available words. At each vector component is stored a count of the number of words for that dictionary entry. Each string will usually contain only a small subset of the dictionary. As a result, most of the vector values will be zero.
As you can see, one of the most difficult aspects of machine learning programming is translating your problem into a fixed-length array of floating point numbers. The following section shows how to translate several examples.
In [40]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?']
vectorizer = CountVectorizer(min_df=1)
vectorizer.fit(corpus)
print("Mapping")
print(vectorizer.vocabulary_)
print()
print("Encoded")
x = vectorizer.transform(corpus)
print(x.toarray())
In [27]:
from sklearn.feature_extraction.text import CountVectorizer
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
corpus = df['name']
vectorizer = CountVectorizer(min_df=1)
vectorizer.fit(corpus)
print("Mapping")
print(vectorizer.vocabulary_)
print()
print("Encoded")
x = vectorizer.transform(corpus)
print(x.toarray())
print(len(vectorizer.vocabulary_))
# reverse lookup for columns
bag_cols = [0] * len(vectorizer.vocabulary_)
for i,key in enumerate(vectorizer.vocabulary_):
bag_cols[i] = key
In [32]:
import matplotlib.pyplot as plt
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestRegressor
#x = x.toarray() #.as_matrix()
y = df['mpg'].as_matrix()
# Build a forest and compute the feature importances
forest = RandomForestRegressor(n_estimators=50,
random_state=0, verbose = True)
forest.fit(x, y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print("Feature ranking:")
for f in range(x.shape[1]):
print("{}. {} ({})".format(f + 1, bag_cols[f], importances[indices[f]]))
Time series data will need to be encoded for a regular feedforward neural network. In a few classes we will see how to use a recurrent neural network to find patterns over time. For now, we will encode the series into input neurons.
Financial forecasting is a very popular form of temporal algorithm. A temporal algorithm is one that accepts input for values that range over time. If the algorithm supports short term memory (internal state) then ranges over time are supported automatically. If your algorithm does not have an internal state then you should use an input window and a prediction window. Most algorithms do not have an internal state. To see how to use these windows, consider if you would like the algorithm to predict the stock market. You begin with the closing price for a stock over several days:
Day 1 : $45
Day 2 : $47
Day 3 : $48
Day 4 : $40
Day 5 : $41
Day 6 : $43
Day 7 : $45
Day 8 : $57
Day 9 : $50
Day 10 : $41
The first step is to normalize the data. This is necessary whether your algorithm has internal state or not. To normalize, we want to change each number into the percent movement from the previous day. For example, day 2 would become 0.04, because there is a 4% difference between $45 and $47. Once you perform this calculation for every day, the data set will look like the following:
Day 2 : 0. 04
Day 3 : 0. 02
Day 4:−0.16
Day 5 : 0. 02
Day 6 : 0. 04
Day 7 : 0. 04
Day 8 : 0. 04
Day 9:−0.12
Day 10:−0.18
In order to create an algorithm that will predict the next day’s values, we need to think about how to encode this data to be presented to the algorithm. The encoding depends on whether the algorithm has an internal state. The internal state allows the algorithm to use the last few values inputted to help establish trends.
Many machine learning algorithms have no internal state. If this is the case, then you will typically use a sliding window algorithm to encode the data. To do this, we use the last three prices to predict the next one. The inputs would be the last three-day prices, and the output would be the fourth day. The above data could be organized in the following way to provide training data.
These cases specified the ideal output for the given inputs:
[ 0.04 , 0.02 , −0.16 ] −> 0.02
[ 0.02 , −0.16 , 0.02 ] −> 0.04
[ −0.16 , 0.02 , 0.04 ] −> 0.04
[ 0.02 , 0.04 , 0.04 ] −> 0. 26
[ 0.04 , 0.04 , 0.26 ] −> −0.12
[ 0.04 , 0.26 , −0.12 ] −> −0.18
The above encoding would require that the algorithm have three inputs and one output.
In [22]:
import numpy as np
def normalize_price_change(history):
last = None
result = []
for price in history:
if last is not None:
result.append( float(price-last)/last )
last = price
return result
def encode_timeseries_window(source, lag_size, lead_size):
"""
Encode raw data to a time-series window.
:param source: A 2D array that specifies the source to be encoded.
:param lag_size: The number of rows uses to predict.
:param lead_size: The number of rows to be predicted
:return: A tuple that contains the x (input) & y (expected output) for training.
"""
result_x = []
result_y = []
output_row_count = len(source) - (lag_size + lead_size) + 1
for raw_index in range(output_row_count):
encoded_x = []
# Encode x (predictors)
for j in range(lag_size):
encoded_x.append(source[raw_index+j])
result_x.append(encoded_x)
# Encode y (prediction)
encoded_y = []
for j in range(lead_size):
encoded_y.append(source[lag_size+raw_index+j])
result_y.append(encoded_y)
return result_x, result_y
price_history = [ 45, 47, 48, 40, 41, 43, 45, 57, 50, 41 ]
norm_price_history = normalize_price_change(price_history)
print("Normalized price history:")
print(norm_price_history)
print()
print("Rounded normalized price history:")
norm_price_history = np.round(norm_price_history,2)
print(norm_price_history)
print()
print("Time Boxed(time series encoded):")
x, y = encode_timeseries_window(norm_price_history, 3, 1)
for x_row, y_row in zip(x,y):
print("{} -> {}".format(np.round(x_row,2), np.round(y_row,2)))
In [ ]: