Class 01

Big Data Ingesting: CSVs, Data frames, and Plots

Welcome to PHY178/CSC171. We will be using the Python language to import data, run machine learning, visualize the results, and communicate those results.

Much of the data that we will use this semester is stored in a CSV file. This stand for Comma-separated Values. The data files are stored in rows- one row per line, with the column values separated by commas. Take a quick look at the data in Class01_diabetes_data.csv by clicking on it in the "Files" tab. You can see that the entries all bunch up together since they are separated by the comma delimeter, not by space.

Where to get data

We will spend quite a bit of time looking for public data as we get going in this class. Here are a couple of places to look for data sets to work with:

Explore a few of these and try downloading one of the files. For example, the data in the UCI repository can be downloaded from the "Data Folder" links. You have to right-click the file, then save it to the local computer. Their files aren't labeled as "CSV" files (the file extension is .data), but they are CSV files.

How to put it on the cloud

Once you have a data file, you need to upload it to the cloud so that we can import it and plot it. The easiest way to do this is to click on the "Files" link in the toolbar. Click on the "Create" button and then drag the file into the upload box. Put the file in the same folder as the Class01 notebook and you'll be able to load it later on.

Import Regression Data

The first thing we want to do is to import data into our notebook so that we can examine it, evaluate it, and use machine learning to learn from it. We will be using a Python library that makes all of that much easier.

Jupyter Hint: Run the command in the next window to import that Pandas library. You evaluate cells in the notebook by highlighting them (by clicking on them), then pressing Shift-Enter to execute the cell.


In [1]:
import pandas as pd

The next step will be to copy the data file that we will be using for this tutorial into the same folder as these notes. We will be looking at a couple of different types of data sets. We'll start with a simple data set that appears to be a functional set of data where one output column depends on the input columns of the data. In this case, we're looking at a set of patient data where there are a handful of input variables that may feed into the likelyhood that the patient will develop type 2 diabetes. The output column is a quantitative measure of disease progression one year after baseline measurements. (http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)


In [2]:
diabetes = pd.read_csv('Class01_diabetes_data.csv')

Now that we've loaded the data in, the first thing to do is to take a look at the raw data. We can look at the first 5 rows (the head of the data set) by doing the following.


In [3]:
diabetes.head()


Out[3]:
Age Sex BMI BP TC LDL HDL TCH LTG GLU Target
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019908 NaN 151.0
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068330 -0.092204 75.0
2 0.085299 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.032356 -0.002592 0.002864 -0.025930 141.0
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022692 -0.009362 206.0
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031991 -0.046641 135.0

Before we move forward, note that there is a strange value in the first row under 'GLU': NaN. This means 'not a number' and indicates there was a missing value or other problem with the data. Before we move foward, we want to drop any row that has missing values in it. There is a simple pandas command that will do that: dropna(inplace=True). The argument to this command: inplace=True tells the computer to drop the rows in our current dataset, not make a new copy.


In [4]:
diabetes.dropna(inplace=True)
diabetes.head()


Out[4]:
Age Sex BMI BP TC LDL HDL TCH LTG GLU Target
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068330 -0.092204 75.0
2 0.085299 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.032356 -0.002592 0.002864 -0.025930 141.0
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022692 -0.009362 206.0
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031991 -0.046641 135.0
5 -0.092695 -0.044642 -0.040696 -0.019442 -0.068991 -0.079288 0.041277 -0.076395 -0.041180 -0.096346 97.0

So we see the first row is gone. That's what we wanted. However, this doesn't really tell us much by itself. It is better to start investigating how the output variable ('Target' in this case) depends on the inputs. We'll visualize the data one at a time to look at this. We'll make a scatter plot where we look at the Target as a function of the Age column. The first entry provides the 'x' values where the second provides the 'y' values. The final input tells the plotting software to plot the data points as dots, not connected lines. We'll almost always use this feature.


In [5]:
diabetes.plot(x='Age',y='Target',kind='scatter')


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1f929601d0>

This doesn't tell us much. It looks like there isn't a large dependence on age - othewise we would have seen something more specific than a large blob of data. Let's try other inputs. We'll plot a bunch of them in a row.

Jupyter Hint: Clicking in the white space next to the output cell will expand and contract the output contents. This is helpful when you have lots of output.


In [6]:
diabetes.plot(x='Sex',y='Target',kind='scatter')
diabetes.plot(x='BMI',y='Target',kind='scatter')
diabetes.plot(x='BP',y='Target',kind='scatter')
diabetes.plot(x='TC',y='Target',kind='scatter')
diabetes.plot(x='LDL',y='Target',kind='scatter')
diabetes.plot(x='HDL',y='Target',kind='scatter')
diabetes.plot(x='TCH',y='Target',kind='scatter')
diabetes.plot(x='LTG',y='Target',kind='scatter')
diabetes.plot(x='GLU',y='Target',kind='scatter')


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1f90284e90>

It looks like there are some of these, like BMI, that as the BMI goes up, so does the Target.

Import Classification Data

There is another type of data set where we have any number of input variables, but the output is no longer a continuous number, but rather it is a class. By that we mean that it is one of a finite number of possibilities. For example, in this next data set, we are looking at the characteristics of three different iris flowers. The measurements apply to one of the three types:

  • Setosa
  • Versicolour
  • Virginica

Let's take a look at this data set and see what it takes to visualize it. First load the data in and inspect the first few rows.


In [7]:
irisDF = pd.read_csv('Class01_iris_data.csv')

irisDF.head()


Out[7]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.0 2.3 3.3 1.0 Versicolour
1 5.7 2.9 4.2 1.3 Versicolour
2 4.7 3.2 1.6 0.2 Setosa
3 7.7 3.0 6.1 2.3 Virginica
4 5.5 2.5 4.0 1.3 Versicolour

As you can see, the 'target' column is no longer numerical, but a text entry that is one of the three possible iris varieties. We also see that the default column headings are a bit long and will get tiring to type out when we want to reference them. Let's rename the columns first.


In [8]:
irisDF.columns=['sepalLen','sepalWid','petalLen','petalWid','target']

irisDF.head()


Out[8]:
sepalLen sepalWid petalLen petalWid target
0 5.0 2.3 3.3 1.0 Versicolour
1 5.7 2.9 4.2 1.3 Versicolour
2 4.7 3.2 1.6 0.2 Setosa
3 7.7 3.0 6.1 2.3 Virginica
4 5.5 2.5 4.0 1.3 Versicolour

Now we want to visualize the data. We don't know what to expect, so let's just pick a couple of variables and see what the data look like.


In [9]:
irisDF.plot(x='sepalLen',y='sepalWid',kind='scatter')


Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1f9004db10>

So we see that there are entries at a number of different points, but it would be really nice to be able to identify which point correpsonds to which variety. We will use another python library to do this. We'll also set the default style to 'white' which looks better.


In [10]:
import seaborn as sns
sns.set_style('white')

The seaborn library provides a number of different plotting options. One of them is lmplot. It is designed to provide a linear model fit (which we don't want right now), so we'll set the fig_reg option to False so that it doesn't try to fit them.

Note that we need two additional parameters here: the first is to tell seaborn to use the irisDF data. That means it will look in that data set for the x and y columns we provide. The second is the hue option. This tells seaborn what column to use to determine the color (or hue) of the points. In this case, it will notice that there are three different options in that column and color them appropriately.


In [11]:
sns.lmplot(x='sepalLen', y='sepalWid', data=irisDF, hue='target', fit_reg=False)


Out[11]:
<seaborn.axisgrid.FacetGrid at 0x7f1f88618510>

Now we can see that the cluster off to the left all belongs to the Setosa variety. It would be really nice to try plotting the other variables as well. We could do that manually or use a nice shortcut in seaborn called pairplot. This plots the hue column against all possible pairs of the other data columns.


In [12]:
sns.pairplot(irisDF, hue="target")


Out[12]:
<seaborn.axisgrid.PairGrid at 0x7f1f903b1690>

We see that there are some of these plots that show there might be a way to distinuish the three different varieties. We'll look at how to do that later on, but this gives us a start.

Import Image Data

The last type of data we are going to look at are image data. This type of data provides information about each pixel (or element) in an image. We'll start by working with gray-scale images where each pixel could be a value anywhere between 0 (black) and 255 (white). We'll read in the data then look at how to create the image. This data set are handwritten digits from 0 to 9 that have been digitized. We will eventually try to teach the computer to read the handwritten digits.


In [13]:
digitDF = pd.read_csv('Class01_digits_data.csv')

digitDF.head()


Out[13]:
0 1 2 3 4 5 6 7 8 9 ... 55 56 57 58 59 60 61 62 63 target
0 0.0 0.0 7.0 12.0 12.0 2.0 0.0 0.0 0.0 5.0 ... 0.0 0.0 0.0 11.0 12.0 13.0 14.0 11.0 0.0 2
1 0.0 0.0 0.0 10.0 15.0 3.0 0.0 0.0 0.0 0.0 ... 2.0 0.0 0.0 1.0 9.0 15.0 16.0 11.0 0.0 6
2 0.0 0.0 7.0 13.0 3.0 0.0 0.0 0.0 0.0 0.0 ... 3.0 0.0 0.0 6.0 15.0 6.0 9.0 9.0 1.0 2
3 0.0 0.0 11.0 16.0 9.0 8.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 13.0 14.0 8.0 0.0 0.0 0.0 8
4 0.0 0.0 0.0 3.0 16.0 3.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 4.0 14.0 0.0 0.0 0.0 4

5 rows × 65 columns

This data set has 65 columns. The first 64 correspond to the grayscale value for each of the pixels in an 8 by 8 image. The last column (the 'target') indicates what digit the image is supposed to be. We'll pick one row to start with (row 41 in this case). We'll use some in-line commenting to explain each step here.


In [14]:
testnum = 61
#
# First, get the first 64 columns which correspond to the image data
#
testimage = digitDF.loc[testnum][0:64]

#
# Then reshape this from a 1 by 64 array into a matrix that is 8 by 8.
#
testimage = testimage.reshape((8,8))

#
# We'll print out what the image is supposed to be. Note the format of the print statement. 
# The '{}' means 'insert the argument from the format here'. 
# The .format means 'pass these values into the string.
#
print('Expected Digit: {}'.format(digitDF.loc[testnum][64]))

#

# Finally, we need one more library to plot the images.
#
import matplotlib.pyplot as plt

#
# We tell Python to plot a gray scale image, then to show our resahped data as an image.
#
plt.gray() 
plt.matshow(testimage)


Expected Digit: 3.0
/projects/sage/sage-7.5/local/lib/python2.7/site-packages/ipykernel/__main__.py:10: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
Out[14]:
<matplotlib.image.AxesImage at 0x7f1f87739d90>
<matplotlib.figure.Figure at 0x7f1f87afef50>

Practice

There is one more data set for you to practice on. It has the filename Class01_breastcancer_data.csv. This data set comes from a study of breast cancer in Wisconsin. You can read more about the data set on page 6 (search BreastCancer) of this file: https://cran.r-project.org/web/packages/mlbench/mlbench.pdf Go ahead and load the data, investigate what is there and plot the data to see what we have.

Assignment

Your assignment is to get your own data set loaded and plotted in your own notebook. This data exploration is the first step to doing machine learning. You will need to get at least 2 different data sets: one that will use a regression and one that is a classification set. We'll use them in future classes to explore different machine learning algorithms.

Working with SageMath Assignments

You have a copy of this file in your Assignments Folder. You should create a copy of this file and not modify it further. That way you won't accidentally erase any of the notes. Rename your file:

Class01_to_grade.ipynb

That way I will know what file to look at when I grade the assignments.