This notebook explores a sample of the SEER breast cancer data

First we import the necessary packages-- Pandas, Numpy and Matplotlib



In [14]:

    
%matplotlib inline

import os
import json
import time
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas.tools.plotting import scatter_matrix, radviz, parallel_coordinates

In order to make the data easy to read, we define the features and labels for the classification variable. We then import the dataset-- a sample of 10000 cases from the larger breast cancer data we previously wrangled. Once the data is read-in we can use Pandas describe function to get a general understanding of the data.



In [66]:

    
FEATURES  = [
    "Birth Year",
    "Age at Diagnosis",
    "Race",
    "Origin",
    "laterality",
    "Radiation",
    "Histrec",
    "ER Status",
    "PR Status",
    "Behanal",
    "Stage",
    "Numprimes",
    "Survival Time",
    "Bucket"
]

LABEL_MAP = {
    0: "< 60 Months",
    1: "60 < months > 120",
    2: "> 120 months",
}

# Read the data into a DataFrame
df = pd.read_csv("clean1.csv", sep=',' , header=0, names=FEATURES)


# Convert class labels into text
for k,v in LABEL_MAP.items():
    df.ix[df.Bucket == k, 'Bucket'] = v

print(df.head(n=5))

df.describe()









    



    Birth Year  Age at Diagnosis  Race  Origin  laterality  Radiation  \
0         1919                82     1       0           1          0   
5         1959                40     1       0           1          0   
8         1931                72     1       0           1          1   
10        1921                84     1       0           1          0   
11        1933                74     1       0           1          1   

    Histrec  ER Status  PR Status  Behanal  Stage  Numprimes  Survival Time  \
0         9          2          0        3      4          1             23   
5         9          2          2        3      1          1            154   
8         9          2          2        3      1          1            104   
10        9          2          2        3      1          1             89   
11        9          2          2        2      0          1             69   

               Bucket  
0         < 60 Months  
5        > 120 months  
8   60 < months > 120  
10  60 < months > 120  
11  60 < months > 120  






    Out[66]:






  
    
      
      Birth Year
      Age at Diagnosis
      Race
      Origin
      laterality
      Radiation
      Histrec
      ER Status
      PR Status
      Behanal
      Stage
      Numprimes
      Survival Time
    
  
  
    
      count
      10497.000000
      10497.000000
      10497.000000
      10497.000000
      10497.000000
      10497.000000
      10497.000000
      10497.000000
      10497.000000
      10497.000000
      10497.000000
      10497.000000
      10497.000000
    
    
      mean
      1942.575212
      57.031342
      2.612651
      0.264838
      1.002001
      0.593122
      8.743832
      1.519101
      1.338287
      2.928360
      1.398304
      1.160713
      120.349624
    
    
      std
      13.475209
      12.930236
      10.292738
      1.253416
      0.044685
      0.631462
      1.212398
      0.851179
      0.936114
      0.257902
      0.840551
      0.367283
      65.418822
    
    
      min
      1894.000000
      20.000000
      1.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
      2.000000
      0.000000
      1.000000
      0.000000
    
    
      25%
      1933.000000
      47.000000
      1.000000
      0.000000
      1.000000
      0.000000
      9.000000
      2.000000
      0.000000
      3.000000
      1.000000
      1.000000
      72.000000
    
    
      50%
      1943.000000
      56.000000
      1.000000
      0.000000
      1.000000
      1.000000
      9.000000
      2.000000
      2.000000
      3.000000
      1.000000
      1.000000
      111.000000
    
    
      75%
      1952.000000
      66.000000
      1.000000
      0.000000
      1.000000
      1.000000
      9.000000
      2.000000
      2.000000
      3.000000
      2.000000
      1.000000
      166.000000
    
    
      max
      1985.000000
      100.000000
      99.000000
      9.000000
      2.000000
      5.000000
      22.000000
      2.000000
      2.000000
      3.000000
      4.000000
      2.000000
      275.000000

This next line is meant to help determine the frequency of each class in the data



In [60]:

    
print df.groupby('Bucket')['Bucket'].count()









    



  File "<ipython-input-60-06245244e21f>", line 2
    print df.groupby('Bucket')['Bucket'].count()
           ^
SyntaxError: invalid syntax

Now we can start using visualizations to explore the data!

First up, a histogram of survival time in months:



In [67]:

    
fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(df['Survival Time'], bins = 10, range = (df['Survival Time'].min(),df['Survival Time'].max()))
plt.title('Survival Time Distribution')
plt.xlabel('Survival Time in months')
plt.ylabel('count')
plt.show()

Next, a scatter matrix of all our variables



In [61]:

    
scatter_matrix(df, alpha=0.2, figsize=(12, 12), diagonal='kde')
plt.show()



In [63]:

    
plt.figure(figsize=(12,12))
parallel_coordinates(df, 'Bucket')
plt.show()

This next plot shows our 3 classess are not actually distinct and there is quite a bit of overlap... maybe thats why our classification machine learning models don't do so well?



In [64]:

    
plt.figure(figsize=(12,12))
radviz(df, 'Bucket')
plt.show()



In [ ]:

	Birth Year	Age at Diagnosis	Race	Origin	laterality	Radiation	Histrec	ER Status	PR Status	Behanal	Stage	Numprimes	Survival Time
count	10497.000000	10497.000000	10497.000000	10497.000000	10497.000000	10497.000000	10497.000000	10497.000000	10497.000000	10497.000000	10497.000000	10497.000000	10497.000000
mean	1942.575212	57.031342	2.612651	0.264838	1.002001	0.593122	8.743832	1.519101	1.338287	2.928360	1.398304	1.160713	120.349624
std	13.475209	12.930236	10.292738	1.253416	0.044685	0.631462	1.212398	0.851179	0.936114	0.257902	0.840551	0.367283	65.418822
min	1894.000000	20.000000	1.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	2.000000	0.000000	1.000000	0.000000
25%	1933.000000	47.000000	1.000000	0.000000	1.000000	0.000000	9.000000	2.000000	0.000000	3.000000	1.000000	1.000000	72.000000
50%	1943.000000	56.000000	1.000000	0.000000	1.000000	1.000000	9.000000	2.000000	2.000000	3.000000	1.000000	1.000000	111.000000
75%	1952.000000	66.000000	1.000000	0.000000	1.000000	1.000000	9.000000	2.000000	2.000000	3.000000	2.000000	1.000000	166.000000
max	1985.000000	100.000000	99.000000	9.000000	2.000000	5.000000	22.000000	2.000000	2.000000	3.000000	4.000000	2.000000	275.000000