Breast Cancer Wisconsin (Diagnostic) Data Set

T81-558: Applications of Deep Learning
Dataset provided by UCI Machine Learning Repository
Download Here

This is a popular dataset that contains columns that might be useful to determine if a tumor is breast cancer or not. There are a total of 32 columns and 569 rows. This dataset is used in class to introduce binary (two class) classification. The following fields are present:

id - Identity column, not really useful to a neural network.
diagnosis - Diagnosis, B=Benign, M=Malignant.
mean_radius - Potentially predictive field.
mean_texture - Potentially predictive field.
mean_perimeter - Potentially predictive field.
mean_area - Potentially predictive field.
mean_smoothness - Potentially predictive field.
mean_compactness - Potentially predictive field.
mean_concavity - Potentially predictive field.
mean_concave_points - Potentially predictive field.
mean_symmetry - Potentially predictive field.
mean_fractal_dimension - Potentially predictive field.
se_radius - Potentially predictive field.
se_texture - Potentially predictive field.
se_perimeter - Potentially predictive field.
se_area - Potentially predictive field.
se_smoothness - Potentially predictive field.
se_compactness - Potentially predictive field.
se_concavity - Potentially predictive field.
se_concave_points - Potentially predictive field.
se_symmetry - Potentially predictive field.
se_fractal_dimension - Potentially predictive field.
worst_radius - Potentially predictive field.
worst_texture - Potentially predictive field.
worst_perimeter - Potentially predictive field.
worst_area - Potentially predictive field.
worst_smoothness - Potentially predictive field.
worst_compactness - Potentially predictive field.
worst_concavity - Potentially predictive field.
worst_concave_points - Potentially predictive field.
worst_symmetry - Potentially predictive field.
worst_fractal_dimension - Potentially predictive field.

The following code shows 10 sample rows.



In [6]:

    
import pandas as pd
import numpy as np

path = "./data/"
    
filename = os.path.join(path,"wcbreast_wdbc.csv")
df = pd.read_csv(filename,na_values=['NA','?'])

# Shuffle
np.random.seed(42)
df = df.reindex(np.random.permutation(df.index))
df.reset_index(inplace=True, drop=True)

df[0:10]









    Out[6]:






  
    
      
      id
      diagnosis
      mean_radius
      mean_texture
      mean_perimeter
      mean_area
      mean_smoothness
      mean_compactness
      mean_concavity
      mean_concave_points
      ...
      worst_radius
      worst_texture
      worst_perimeter
      worst_area
      worst_smoothness
      worst_compactness
      worst_concavity
      worst_concave_points
      worst_symmetry
      worst_fractal_dimension
    
  
  
    
      0
      87930
      B
      12.47
      18.60
      81.09
      481.9
      0.09965
      0.10580
      0.08005
      0.03821
      ...
      14.97
      24.64
      96.05
      677.9
      0.14260
      0.2378
      0.2671
      0.10150
      0.3014
      0.08750
    
    
      1
      859575
      M
      18.94
      21.31
      123.60
      1130.0
      0.09009
      0.10290
      0.10800
      0.07951
      ...
      24.86
      26.58
      165.90
      1866.0
      0.11930
      0.2336
      0.2687
      0.17890
      0.2551
      0.06589
    
    
      2
      8670
      M
      15.46
      19.48
      101.70
      748.9
      0.10920
      0.12230
      0.14660
      0.08087
      ...
      19.26
      26.00
      124.90
      1156.0
      0.15460
      0.2394
      0.3791
      0.15140
      0.2837
      0.08019
    
    
      3
      907915
      B
      12.40
      17.68
      81.47
      467.8
      0.10540
      0.13160
      0.07741
      0.02799
      ...
      12.88
      22.91
      89.61
      515.8
      0.14500
      0.2629
      0.2403
      0.07370
      0.2556
      0.09359
    
    
      4
      921385
      B
      11.54
      14.44
      74.65
      402.9
      0.09984
      0.11200
      0.06737
      0.02594
      ...
      12.26
      19.68
      78.78
      457.8
      0.13450
      0.2118
      0.1797
      0.06918
      0.2329
      0.08134
    
    
      5
      927241
      M
      20.60
      29.33
      140.10
      1265.0
      0.11780
      0.27700
      0.35140
      0.15200
      ...
      25.74
      39.42
      184.60
      1821.0
      0.16500
      0.8681
      0.9387
      0.26500
      0.4087
      0.12400
    
    
      6
      9012000
      M
      22.01
      21.90
      147.20
      1482.0
      0.10630
      0.19540
      0.24480
      0.15010
      ...
      27.66
      25.80
      195.00
      2227.0
      0.12940
      0.3885
      0.4756
      0.24320
      0.2741
      0.08574
    
    
      7
      853201
      M
      17.57
      15.05
      115.00
      955.1
      0.09847
      0.11570
      0.09875
      0.07953
      ...
      20.01
      19.52
      134.90
      1227.0
      0.12550
      0.2812
      0.2489
      0.14560
      0.2756
      0.07919
    
    
      8
      8611161
      B
      13.34
      15.86
      86.49
      520.0
      0.10780
      0.15350
      0.11690
      0.06987
      ...
      15.53
      23.19
      96.66
      614.9
      0.15360
      0.4791
      0.4858
      0.17080
      0.3527
      0.10160
    
    
      9
      911673
      B
      13.90
      16.62
      88.97
      599.4
      0.06828
      0.05319
      0.02224
      0.01339
      ...
      15.14
      21.80
      101.20
      718.9
      0.09384
      0.2006
      0.1384
      0.06222
      0.2679
      0.07698
    
  

10 rows × 32 columns



In [ ]:

	id	diagnosis	mean_radius	mean_texture	mean_perimeter	mean_area	mean_smoothness	mean_compactness	mean_concavity	mean_concave_points	...	worst_radius	worst_texture	worst_perimeter	worst_area	worst_smoothness	worst_compactness	worst_concavity	worst_concave_points	worst_symmetry	worst_fractal_dimension
0	87930	B	12.47	18.60	81.09	481.9	0.09965	0.10580	0.08005	0.03821	...	14.97	24.64	96.05	677.9	0.14260	0.2378	0.2671	0.10150	0.3014	0.08750
1	859575	M	18.94	21.31	123.60	1130.0	0.09009	0.10290	0.10800	0.07951	...	24.86	26.58	165.90	1866.0	0.11930	0.2336	0.2687	0.17890	0.2551	0.06589
2	8670	M	15.46	19.48	101.70	748.9	0.10920	0.12230	0.14660	0.08087	...	19.26	26.00	124.90	1156.0	0.15460	0.2394	0.3791	0.15140	0.2837	0.08019
3	907915	B	12.40	17.68	81.47	467.8	0.10540	0.13160	0.07741	0.02799	...	12.88	22.91	89.61	515.8	0.14500	0.2629	0.2403	0.07370	0.2556	0.09359
4	921385	B	11.54	14.44	74.65	402.9	0.09984	0.11200	0.06737	0.02594	...	12.26	19.68	78.78	457.8	0.13450	0.2118	0.1797	0.06918	0.2329	0.08134
5	927241	M	20.60	29.33	140.10	1265.0	0.11780	0.27700	0.35140	0.15200	...	25.74	39.42	184.60	1821.0	0.16500	0.8681	0.9387	0.26500	0.4087	0.12400
6	9012000	M	22.01	21.90	147.20	1482.0	0.10630	0.19540	0.24480	0.15010	...	27.66	25.80	195.00	2227.0	0.12940	0.3885	0.4756	0.24320	0.2741	0.08574
7	853201	M	17.57	15.05	115.00	955.1	0.09847	0.11570	0.09875	0.07953	...	20.01	19.52	134.90	1227.0	0.12550	0.2812	0.2489	0.14560	0.2756	0.07919
8	8611161	B	13.34	15.86	86.49	520.0	0.10780	0.15350	0.11690	0.06987	...	15.53	23.19	96.66	614.9	0.15360	0.4791	0.4858	0.17080	0.3527	0.10160
9	911673	B	13.90	16.62	88.97	599.4	0.06828	0.05319	0.02224	0.01339	...	15.14	21.80	101.20	718.9	0.09384	0.2006	0.1384	0.06222	0.2679	0.07698