Although machine learning is the exciting part of this course, most data scientists spend the vast majority of their time doing data clearning and data wrangling. Some put the figure at as high as 90% of their time! There is a good reason for this: most of the data out there is not in a format needed for the machine learning algorithms. So, in order to do machine learning, the data must be reorganized, cleaned, rearranged, normalized, enriched, and filtered. We'll begin this process today and continue working on it through the course.
We start with an overview of some of the types of features we could potentially use. In the end, all of the data are represented as bits in the computer (ones and zeros), but we can organize those bits in a bunch of different ways in the pandas dataframes. We'll build a "fake" dataframe with the different types in them.
Integers are counting numbers and other whole numbers (including negatives): ...,-4,-3,-2,-1,0,1,2,3,4,... They are somewhat special because they can be stored very efficiently and the computer can operate on them very efficiently (positive integers especially). Pandas stores these using a data type called int64 where the 64 means they are 64-bit integers (capable of storing any number between -9,223,372,036,854,775,807 and 9,223,372,036,854,775,807)
We'll use a sample dataset to look at the different types of data as we go.
In [1]:
import pandas as pd
sampledata = pd.read_csv('Class03_sample_dataframe.csv')
# This will let us look at the data type of each column. Note that the first column is an "int64".
print(sampledata.dtypes)
# These are the values stored in this column.
print("\nInteger Values")
print(sampledata['IntCol'].values)
Floating point numbers, or decimal numbers are just that: any number with a decimal place in it such as 4.566642 and -156.986714. Pandas stores these as a float64. They could also be stored in scientific notation like this: 4.509013e+14. This means "4.509013 times 10 raised to the +14". These are still floating point numbers and are treated like any other decimal number.
In [2]:
print("Float Values")
print(sampledata['FloatCol'].values)
Before we move on, I'd like to take a quick look at the data graphically.
In [3]:
sampledata.plot(kind='scatter', x='IntCol',y='FloatCol')
Out[3]:
Because this is "fake" data, I put in a functional dependence here. The float column looks like it is some function of the integer column. It is almost always a good idea to visualize your data early on to see what it looks like graphically!
Pandas can store text in its columns. Because there are a number of different types of text objects, by default pandas will store text as an object which just means it doesn't know which of the types it really is. Text can, in principle, be anything you want it to be, so it is both the most flexible and the most challenging data type.
In [4]:
print("Text Values")
print(sampledata['TextCol'].values)
A categorical data type is a finite set of different objects. These objects are represented internally as integers but may be displayed as text or other generic objects. To make things simple, we'll start with a categorical object that has three possible values: "yes", "no", and "maybe". Internally, pandas will represent these as integers 0,1, and 2. But it knows that this is a categorical data type, so it keeps track of the text value associated with the integer and displays that for the user.
In [5]:
print("Categorical Values")
print(sampledata['CatCol'].values)
When we loaded the data, it actually loaded this column as an object, which means it doesn't know that it is supposed to be a categorical column. We will tell pandas to do that. We will use the astype()
command that will tell pandas to change the data type of that column. We check to make sure it worked, too. Note that the "CatCol2" column is now a 'category' type.
Data Processing Tip
A quick aside here: there are a couple of ways of doing this kind of transformation on the data. We'll see this a little later when we do more column-wise processing. We could either change the original column or we could create a new column. The second method doesn't overwrite the original data and will be what we typically do. That way if something goes wrong or we want to change how we are processing the data, we still have the original data column to work with.
In [6]:
sampledata["CatCol2"] = sampledata["CatCol"].astype('category')
sampledata.dtypes
Out[6]:
We can now look at how the data are stored as categorical data. We can get thi internal codes for each of the entries like this:
In [7]:
sampledata["CatCol2"].cat.codes
Out[7]:
We can also get a list of the categories that pandas found when converting the column. These are in order- the first entry corresponds to 0, the second to 1, etc.
In [8]:
sampledata["CatCol2"].cat.categories
Out[8]:
We may encounter situations where we want to plot the data and visualize each category as its own color. We saw how to do this back in Class01.
In [9]:
import seaborn as sns
sns.set_style('white')
sns.lmplot(x='IntCol', y='FloatCol', data=sampledata, hue='CatCol2', fit_reg=False)
Out[9]:
We will frequently encounter date/time values in working with data. There are many different ways that these values get stored, but mostly we'll find that they start as a text object. We need to know how they are stored (in what order are the year-month-day-hour-minute-second values are stored). There are utilities to convert any type of date/time string to a datetime object in pandas. We will start with the ISO 8601 datetime standard, since it is both the most logical and the easiest to work with. Dates are stored like this: 2017-01-23 where we use a four-digit year, then a two-digit month and a two-digit day, all separated by dashes. If we want to add a time, it is appended to the date like this: 2017-01-23T03:13:42. The "T" tells the computer that we've added a time. Then it is followed by a two-digit hour (using 00 as midnight and 23 as 11pm) a colon, a two-digit minute, a colon, and a two-digit second. There are other variations of this that can include a time-zone, but we will leave those for later.
In [10]:
print("Date/Time Values")
print(sampledata['DateCol'].values)
They are currently stored as objects, not as datetimes. We need to convert this column as well, but we'll use a special pandas function to do that. Take a quick look at the reference page for this function to see what else it can do. Note that the new column has type datetime64[ns]. That means that the date format is capable of counting nanoseconds. We won't use all of that capability, but pandas used that format because our dates are accurate to the second.
In [11]:
sampledata["DateCol2"] = pd.to_datetime(sampledata["DateCol"])
sampledata.dtypes
Out[11]:
In [12]:
#We print out the column to see what it looks like
sampledata["DateCol2"]
Out[12]:
Now that we have the datetime column, I'd like to plot the data as a function of date. This is often a useful thing to do with time series data. We'll need to import the matplotlib library and use a trick to format the data by date. Here's the code that makes it work.
In [13]:
import matplotlib.pyplot as plt
%matplotlib inline
# We will plot the data values and set the linestyle to 'None' which will not plot the line. We also want to show the individual data points, so we set the marker.
plt.plot(sampledata['DateCol2'].values, sampledata['FloatCol'].values, linestyle='None', marker='o')
# autofmt_xdate() tells the computer that it should treat the x-values as dates and format them appropriately. This is a figure function, so we use gcf() to "get current figure"
plt.gcf().autofmt_xdate()
Although this is not typically a single data type, you may encounter geographical data. These are typically in a Latitude-Longitude format where both Latitude and Longitude are floating point numbers like this: (32.1545, -138.5532). There are a number of tools we can use to work with and plot this type of data, so I wanted to cover it now. For now, we will treat these as separate entities and work with geographical data as we encounter it.
In [14]:
print("Latitude Values")
print(sampledata['LatCol'].values)
print("Longitude Values")
print(sampledata['LonCol'].values)
It is also useful to plot the geographical data. There are python libraries that make this easy to do.
In [15]:
from mpl_toolkits.basemap import Basemap
import numpy as np
# Draw the base map of the world
m = Basemap(projection='robin',lon_0=0,resolution='c')
# Draw the continent coast lines
m.drawcoastlines()
# Color in the water and the land masses
m.fillcontinents(color='red',lake_color='aqua')
# draw parallels and meridians.
m.drawparallels(np.arange(-90.,120.,30.))
m.drawmeridians(np.arange(0.,360.,60.))
#m.drawmapboundary(fill_color='aqua')
# Prep the data for plotting on the map
x,y = m(sampledata['LonCol'].values, sampledata['LatCol'].values)
# Plot the data points on the map
m.plot(x,y, 'bo', markersize=10)
Out[15]:
Now that we have data columns, we've already seen a couple of examples of column-wise processing. When we created the categorical column and the datetime column we took the data from one column and operated on it all at the same time creating the new columns with the different data types. There are other ways to manipulate the columns.
The apply
function takes each entry in a column and applies whatever function you want to the entry. For example, we are interested in whether the entry is greater than 4. We will simplify the code by using what is called a lambda
function. So, inside the apply()
function we have: lambda x: x>4
. This is shorthand notation for the following:
"Treat x
as if it were each entry in the column. Apply whatever follows the colon (:) to each entry and create a new column based on the output". The use of x
was arbitrary: we could choose any variable. For example if we chose w
, the code would read: lambda w: w>4
. This would do exactly the same thing.
In [16]:
sampledata['GTfour'] = sampledata['FloatCol'].apply(lambda x: x > 4.0)
print(sampledata[['FloatCol','GTfour']])
There are a number of common functions that we could use inside the apply
. For example, if we wanted to get the square root of each entry, this is what it would look like. We are using the function np.sqrt
from the numpy
library. We already imported this library, but if we didn't, we'd need to import numpy as np
before running this function.
In [17]:
sampledata['FloatSQRT'] = sampledata['FloatCol'].apply(np.sqrt)
print(sampledata[['FloatCol','FloatSQRT']])
Another useful function is adding up columns. Note that we need to tell pandas to run through each row by adding the argument axis=1
to the apply
function. Otherwise it tries to add up each column. This might be something you might want to do, too, though the easiest way to do that is to use the pandas sum
function for the column.
In [18]:
sampledata['IntSUM'] = sampledata[['IntCol','FloatCol']].apply(np.sum,axis=1)
print(sampledata[['IntCol','FloatCol','IntSUM']])
In [19]:
sampledata['IntCol'].sum()
Out[19]:
In [20]:
# We first tell the computer that we are writing a function by starting with "def"
# The next text is the name of the function. We name this one "isMammal" meaning it will tell us if an animal is in our list of mammals
# The final text in the parenthesis is an input to the function. This is another "dummy" variable - we could give it any name we want.
# In this case we call it "animal" to remind ourselves that we expect an animal type in text form.
def isMammal(animal):
# We create a list of text objects that will be our "inclusive" list. If the item is on this list, the function will return True. Otherwise it returns false.
mammallist = ['cat','dog','horse','cow','elephant','giraffe','wolf','prairie dog', 'whale', 'dolphin']
# This is our first "if" statement. What this particular version does is look at the list "mammallist".
# If the text passed into the variable "animal" matches any item in the list, it jumps into this next block of code
# Otherwise it jumps into block of code following the "else" statement
if animal in mammallist:
# the "return" code word tells the computer we are done and to send back to the apply function the value following "return". In this case, send back "True"
return 'mammal'
else:
# The other case will send back "false".
return 'notmammal'
sampledata['IsMammal'] = sampledata['TextCol'].apply(isMammal)
print(sampledata[['TextCol', 'IsMammal']])
In [21]:
# We'll now operate on an entire row of data at once and do a more complicated operation. We'll return only mammals where the 'FloatCol' is smaller than 2.
def isMammalFloat(row):
# We create a list of text objects that will be our "inclusive" list. If the item is on this list, the function will return True. Otherwise it returns false.
mammallist = ['cat','dog','horse','cow','elephant','giraffe','wolf','prairie dog', 'whale', 'dolphin']
# We need to identify the animal from the row - it can be addressed using the column name
animal = row['TextCol']
if animal in mammallist:
# the "return" code word tells the computer we are done and to send back to the apply function the value following "return".
# In this case it returns True if the float value is less than 2 and false otherwise.
return row['FloatCol'] < 2
else:
# If it isn't a mammal, return false
return False
# Note that we need to tell `apply` to send one row at a time by adding the `axis=1` argument
sampledata['IsSmallMammal'] = sampledata.apply(isMammalFloat, axis=1)
print(sampledata[['TextCol', 'FloatCol','IsSmallMammal']])
In [22]:
sampledata['TextCol'][ sampledata['FloatCol']<2 ]
Out[22]:
We can often pull additional features from what we currently have. This involves doing a column-wise processing step, but with the additional component of doing a transformation or extraction from the data. We'll look at a couple of techniques to do this.
We already saw how to take a text column that is a date and turn it into a datetime data type. The to_datetime()
function has the capability of parsing many different string formats. I recommend looking at the documentation for the function to learn how to do parsing of more specific date time formats.
Once we have a datetime data type, we can use other functions to get, for example, the day of the week or the week of the year for any given date. This may be useful for looking at weekly patterns or yearly patterns. The full list of features we can easily extract is found in the documentation. We use the apply
function with the simple in-line lambda
function to get the date or time features. Another use for this might be to identify holidays- for example, Memorial day is always on the same relative day of the year (last Monday in May). We could use these functions to identify which days are national or bank holidays.
In [23]:
# Get the day of the week for each of the data features. We can get either a numerical value (0-6) or the names
sampledata['DayofWeek'] = sampledata['DateCol2'].apply(lambda x: x.weekday_name)
# Or the week number in the year
sampledata['WeekofYear'] = sampledata['DateCol2'].apply(lambda x: x.week)
print(sampledata[['DayofWeek', 'WeekofYear']])
Sometimes it is helpful to know what unique values are in a column. Especially when there are many rows (millions), it is impractical to manually scan through the columns to look for unique values. However, we can use a pandas function unique()
to do just that. We will see this is particularly helpful in doing data cleaning to identify rows with problems in the data.
In [24]:
sampledata['CatCol'].unique()
Out[24]:
Another type of text feature extraction using a regex
or regular expression pattern recognition code. The date/time conversion uses one form of this, but we can be more general in identifying patterns. There are some very useful tools for testing your pattern. I like the tester at https://regex101.com/. I use it whenever I build a pattern recognition string.
In [25]:
# This simple text pattern gathers all the letters up to (but not including) the last 'e' in the text entry. There are lots of other pattern recognition tools to extract features from text.
# Note that it returns "NaN" if there are no 'e's in the text string. We could use that to find all the strings without an 'e' in them.
sampledata['TextCol'].str.extract("(.*)e", expand=True)
Out[25]:
We already saw how to convert text columns to categorical columns. We can also covert other data types to categorical columns. For example, we could bin a float column into regularly sized bins, then create a categorical column from those bins.
Finally, it is often useful to clean up text entries before trying to turn them into features. For example, we may want to remove all punctuation, capital letters, or other special characters. We may also want to consider all of the forms of a word as the same word. For example, we may want to have both "dog" and "dogs" as the same feature. Or we may want "wonder" and "wonderful" as the same feature. There are a couple of text processing tools in python that simplify this work considerably.
I created a small dataset to work with. We'll use one of the rows to test our text cleaning process.
In [26]:
textDF = pd.read_csv('Class03_text.tsv',sep='\t')
testcase = textDF['review'][3]
testcase
Out[26]:
The first thing we notice is that there are hypertext bits in the text (the <br />
items). We want to clean all of those out. The BeautifulSoup function does this for us.
In [27]:
from bs4 import BeautifulSoup
cleantext = BeautifulSoup(testcase,"html5lib").text
cleantext
Out[27]:
We now want to get rid of everything that isn't an alphabetical letter. That will clean up all punctuation and get rid of all numbers. We'll use a regex substitution function to do this. It looks for everything that is not an alphabetical character and replaces it with a blank space.
In [28]:
import re
onlyletters = re.sub("[^a-zA-Z]"," ",cleantext)
onlyletters
Out[28]:
We'll get rid of upper-case letters to only look at the words themselves.
In [29]:
lowercase = onlyletters.lower()
lowercase
Out[29]:
The next two steps we'll do at once because we need to split up the text into individual words to do them. The split()
function breaks up the string into an array of words. We will then eliminate any words that are stopwords in English. These are words like "and", "or", "the" that don't communciate any information but are necessary for language.
The other thing we'll do is cut the words down to their root stems. This will get rid of plurals or other modifications of words.
In [30]:
import nltk
from nltk.corpus import stopwords # Import the stop word list
words = lowercase.split()
meaningfulwords = [w for w in words if not w in stopwords.words("english")]
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer("english")
stemmedwords = [snowball_stemmer.stem(w) for w in meaningfulwords ]
print(" ".join(meaningfulwords))
print("\n")
print(" ".join(stemmedwords))
In [31]:
# Now we make a function that we can apply to every entry in the dataframe
def cleantext(textinput):
# First Pass: remove any html tags
from bs4 import BeautifulSoup
cleantext = BeautifulSoup(textinput,"html5lib").text
# Second pass: remove non-letters and make everything lower case
import re
testcase = re.sub("[^a-zA-Z]"," ",cleantext)
lowercase = testcase.lower()
# Third pass: remove all stop words (non-essential words)
from nltk.corpus import stopwords # Import the stop word list
words = lowercase.split()
meaningfulwords = [w for w in words if not w in stopwords.words("english")]
# Fourth pass: get the word stems so that plurals, etc. are reduced
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer("english")
stemmedwords = [snowball_stemmer.stem(w) for w in meaningfulwords ]
# Put the words back together again with a single space beteen them
return " ".join(stemmedwords)
textDF['cleaned'] = textDF['review'].apply(cleantext)
textDF
Out[31]:
The tutorial on cleaning messy data is located here: http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.1/cookbook/Chapter%207%20-%20Cleaning%20up%20messy%20data.ipynb
Follow the tutorial, looking at the data and how to do a preliminary clean to eliminate entries that aren't correct or don't help. The data file can be loaded from the SageMath folder. I've reduced the number of column features in the data set to make it a bit easier to work with.
In [32]:
requests = pd.read_csv("Class03_311_data.csv")
Your assignment is to do data processing and cleaning on your own dataset. I want documentation of what you've done and why you chose to do those things to your data.
I would also like you to try redoing your regression from last week, using the new features that you create through the data processing steps. See if you can improve the quality of your regression.
In [ ]: