Breast Cancer Wisconsin (Diagnostic) Data Set

This is a popular dataset that contains columns that might be useful to determine if a tumor is breast cancer or not. There are a total of 32 columns and 569 rows. This dataset is used in class to introduce binary (two class) classification. The following fields are present:

  • id - Identity column, not really useful to a neural network.
  • diagnosis - Diagnosis, B=Benign, M=Malignant.
  • mean_radius - Potentially predictive field.
  • mean_texture - Potentially predictive field.
  • mean_perimeter - Potentially predictive field.
  • mean_area - Potentially predictive field.
  • mean_smoothness - Potentially predictive field.
  • mean_compactness - Potentially predictive field.
  • mean_concavity - Potentially predictive field.
  • mean_concave_points - Potentially predictive field.
  • mean_symmetry - Potentially predictive field.
  • mean_fractal_dimension - Potentially predictive field.
  • se_radius - Potentially predictive field.
  • se_texture - Potentially predictive field.
  • se_perimeter - Potentially predictive field.
  • se_area - Potentially predictive field.
  • se_smoothness - Potentially predictive field.
  • se_compactness - Potentially predictive field.
  • se_concavity - Potentially predictive field.
  • se_concave_points - Potentially predictive field.
  • se_symmetry - Potentially predictive field.
  • se_fractal_dimension - Potentially predictive field.
  • worst_radius - Potentially predictive field.
  • worst_texture - Potentially predictive field.
  • worst_perimeter - Potentially predictive field.
  • worst_area - Potentially predictive field.
  • worst_smoothness - Potentially predictive field.
  • worst_compactness - Potentially predictive field.
  • worst_concavity - Potentially predictive field.
  • worst_concave_points - Potentially predictive field.
  • worst_symmetry - Potentially predictive field.
  • worst_fractal_dimension - Potentially predictive field.

The following code shows 10 sample rows.


In [6]:
import pandas as pd
import numpy as np

path = "./data/"
    
filename = os.path.join(path,"wcbreast_wdbc.csv")
df = pd.read_csv(filename,na_values=['NA','?'])

# Shuffle
np.random.seed(42)
df = df.reindex(np.random.permutation(df.index))
df.reset_index(inplace=True, drop=True)

df[0:10]


Out[6]:
id diagnosis mean_radius mean_texture mean_perimeter mean_area mean_smoothness mean_compactness mean_concavity mean_concave_points ... worst_radius worst_texture worst_perimeter worst_area worst_smoothness worst_compactness worst_concavity worst_concave_points worst_symmetry worst_fractal_dimension
0 87930 B 12.47 18.60 81.09 481.9 0.09965 0.10580 0.08005 0.03821 ... 14.97 24.64 96.05 677.9 0.14260 0.2378 0.2671 0.10150 0.3014 0.08750
1 859575 M 18.94 21.31 123.60 1130.0 0.09009 0.10290 0.10800 0.07951 ... 24.86 26.58 165.90 1866.0 0.11930 0.2336 0.2687 0.17890 0.2551 0.06589
2 8670 M 15.46 19.48 101.70 748.9 0.10920 0.12230 0.14660 0.08087 ... 19.26 26.00 124.90 1156.0 0.15460 0.2394 0.3791 0.15140 0.2837 0.08019
3 907915 B 12.40 17.68 81.47 467.8 0.10540 0.13160 0.07741 0.02799 ... 12.88 22.91 89.61 515.8 0.14500 0.2629 0.2403 0.07370 0.2556 0.09359
4 921385 B 11.54 14.44 74.65 402.9 0.09984 0.11200 0.06737 0.02594 ... 12.26 19.68 78.78 457.8 0.13450 0.2118 0.1797 0.06918 0.2329 0.08134
5 927241 M 20.60 29.33 140.10 1265.0 0.11780 0.27700 0.35140 0.15200 ... 25.74 39.42 184.60 1821.0 0.16500 0.8681 0.9387 0.26500 0.4087 0.12400
6 9012000 M 22.01 21.90 147.20 1482.0 0.10630 0.19540 0.24480 0.15010 ... 27.66 25.80 195.00 2227.0 0.12940 0.3885 0.4756 0.24320 0.2741 0.08574
7 853201 M 17.57 15.05 115.00 955.1 0.09847 0.11570 0.09875 0.07953 ... 20.01 19.52 134.90 1227.0 0.12550 0.2812 0.2489 0.14560 0.2756 0.07919
8 8611161 B 13.34 15.86 86.49 520.0 0.10780 0.15350 0.11690 0.06987 ... 15.53 23.19 96.66 614.9 0.15360 0.4791 0.4858 0.17080 0.3527 0.10160
9 911673 B 13.90 16.62 88.97 599.4 0.06828 0.05319 0.02224 0.01339 ... 15.14 21.80 101.20 718.9 0.09384 0.2006 0.1384 0.06222 0.2679 0.07698

10 rows × 32 columns


In [ ]: