UCI Datasets


In [1]:
%load_ext autoreload
%autoreload 2
import os
import pandas as pd
import numpy as np
from mclearn.tools import fetch_data, download_data

In [2]:
uci_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/'

The standard datasets are taken from the UCI Machine Learning Repository. For each dataset, header rows are added manually.

Binary

Ionosphere radar data, where 'good' is the positive label.


In [3]:
url = uci_url + 'ionosphere/ionosphere.data'
dest = 'data/ionosphere.csv'
header = ','.join("x{0}".format(i) for i in np.arange(1, 35)) + ',target'
data = fetch_data(url, dest, header, label={'b': 0, 'g': 1})
data.head()


Out[3]:
target x1 x2 x3 x4 x5 x6 x7 x8 x9 ... x25 x26 x27 x28 x29 x30 x31 x32 x33 x34
0 1 1 0 0.99539 -0.05889 0.85243 0.02306 0.83398 -0.37708 1.00000 ... 0.56811 -0.51171 0.41078 -0.46168 0.21266 -0.34090 0.42267 -0.54487 0.18641 -0.45300
1 0 1 0 1.00000 -0.18829 0.93035 -0.36156 -0.10868 -0.93597 1.00000 ... -0.20332 -0.26569 -0.20468 -0.18401 -0.19040 -0.11593 -0.16626 -0.06288 -0.13738 -0.02447
2 1 1 0 1.00000 -0.03365 1.00000 0.00485 1.00000 -0.12062 0.88965 ... 0.57528 -0.40220 0.58984 -0.22145 0.43100 -0.17365 0.60436 -0.24180 0.56045 -0.38238
3 0 1 0 1.00000 -0.45161 1.00000 1.00000 0.71216 -1.00000 0.00000 ... 1.00000 0.90695 0.51613 1.00000 1.00000 -0.20099 0.25682 1.00000 -0.32382 1.00000
4 1 1 0 1.00000 -0.02401 0.94140 0.06531 0.92106 -0.23255 0.77152 ... 0.03286 -0.65158 0.13290 -0.53206 0.02431 -0.62197 -0.05707 -0.59573 -0.04608 -0.65697

5 rows × 35 columns

Pima Indians diabetes.


In [4]:
url = uci_url + 'pima-indians-diabetes/pima-indians-diabetes.data'
dest = 'data/pima.csv'
header = 'preg,glucose,diastolic,skin,insulin,bmi,pedi,age,target'
remove_missing = lambda df: df[df['diastolic'] > 0]
data = fetch_data(url, dest, header, process_fn=remove_missing)
data.head()


Out[4]:
target preg glucose diastolic skin insulin bmi pedi age
0 1 6 148 72 35 0 33.6 0.627 50
1 0 1 85 66 29 0 26.6 0.351 31
2 1 8 183 64 0 0 23.3 0.672 32
3 0 1 89 66 23 94 28.1 0.167 21
4 1 0 137 40 35 168 43.1 2.288 33

Sonar data, where we want to discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock. Metal cylinder is considered as the positive label.


In [5]:
url = uci_url + 'undocumented/connectionist-bench/sonar/sonar.all-data'
dest = 'data/sonar.csv'
header = ','.join('e{0}'.format(i) for i in np.arange(1, 61)) + ',target'
data = fetch_data(url, dest, header, label={'R': 0, 'M': 1})
data.head()


Out[5]:
target e1 e2 e3 e4 e5 e6 e7 e8 e9 ... e51 e52 e53 e54 e55 e56 e57 e58 e59 e60
0 0 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 ... 0.0232 0.0027 0.0065 0.0159 0.0072 0.0167 0.0180 0.0084 0.0090 0.0032
1 0 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 ... 0.0125 0.0084 0.0089 0.0048 0.0094 0.0191 0.0140 0.0049 0.0052 0.0044
2 0 0.0262 0.0582 0.1099 0.1083 0.0974 0.2280 0.2431 0.3771 0.5598 ... 0.0033 0.0232 0.0166 0.0095 0.0180 0.0244 0.0316 0.0164 0.0095 0.0078
3 0 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 ... 0.0241 0.0121 0.0036 0.0150 0.0085 0.0073 0.0050 0.0044 0.0040 0.0117
4 0 0.0762 0.0666 0.0481 0.0394 0.0590 0.0649 0.1209 0.2467 0.3564 ... 0.0156 0.0031 0.0054 0.0105 0.0110 0.0015 0.0072 0.0048 0.0107 0.0094

5 rows × 61 columns

Prognostic Wisconsin breast cancer. Recurrence is the positive label.


In [6]:
url = uci_url + 'breast-cancer-wisconsin/wpbc.data'
dest = 'data/wpbc.csv'
header = 'id,target,time,rad1,text1,peri1,area1,smooth1,compact1,concave1,' \
         'conpt1,sym1,fract1,rad2,text2,peri2,area2,smooth2,compact2,concave2,' \
         'conpt2,sym2,fract2,rad3,text3,peri3,area3,smooth3,compact3,concave3,' \
         'conpt3,sym3,fract3,tumor,lymph'
data = fetch_data(url, dest, header, label={'N': 0, 'R': 1})
data.head()


Out[6]:
target rad1 text1 peri1 area1 smooth1 compact1 concave1 conpt1 sym1 ... peri3 area3 smooth3 compact3 concave3 conpt3 sym3 fract3 tumor lymph
0 0 18.02 27.60 117.50 1013.0 0.09489 0.1036 0.1086 0.07055 0.1865 ... 139.70 1436.0 0.1195 0.1926 0.3140 0.1170 0.2677 0.08113 5.0 5
1 0 17.99 10.38 122.80 1001.0 0.11840 0.2776 0.3001 0.14710 0.2419 ... 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 3.0 2
2 0 21.37 17.44 137.50 1373.0 0.08836 0.1189 0.1255 0.08180 0.2333 ... 159.10 1949.0 0.1188 0.3449 0.3414 0.2032 0.4334 0.09067 2.5 0
3 0 11.42 20.38 77.58 386.1 0.14250 0.2839 0.2414 0.10520 0.2597 ... 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 2.0 0
4 1 20.29 14.34 135.10 1297.0 0.10030 0.1328 0.1980 0.10430 0.1809 ... 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 3.5 0

5 rows × 33 columns

MAGIC gamma telescope.


In [3]:
url = uci_url + 'magic/magic04.data'
dest = 'data/magic.csv'
header = 'length,width,size,conc,conc1,asym,m3long,m3trans,alpha,dist,target'
data = fetch_data(url, dest, header, label={'g': 1, 'h': 0})
data.head()


Out[3]:
target length width size conc conc1 asym m3long m3trans alpha dist
0 1 28.7967 16.0021 2.6449 0.3918 0.1982 27.7004 22.0110 -8.2027 40.0920 81.8828
1 1 31.6036 11.7235 2.5185 0.5303 0.3773 26.2722 23.8238 -9.9574 6.3609 205.2610
2 1 162.0520 136.0310 4.0612 0.0374 0.0187 116.7410 -64.8580 -45.2160 76.9600 256.7880
3 1 23.8172 9.5728 2.3385 0.6147 0.3922 27.2107 -6.4633 -7.1513 10.4490 116.7370
4 1 75.1362 30.9205 3.1611 0.3168 0.1832 -5.5277 28.5525 21.8393 4.6480 356.4620

MiniBooNE particle identification


In [7]:
url = uci_url + '00199/MiniBooNE_PID.txt'
dest = 'data/miniboone.csv'
download_data(url, dest)

header = ['e{0}'.format(i) for i in np.arange(1, 51)]
data = pd.read_csv(dest, sep='\s+', skiprows=1,
                   header=None, names=header, na_values=[-999])
data['target'] = 1
data.loc[36499:, 'target'] = 0
data = data[['target']].join(data.drop('target', axis=1))
data.dropna(axis=0, how='any', inplace=True)
data.to_csv(dest, index=False, float_format='%.12g')

data.head()


Out[7]:
target e1 e2 e3 e4 e5 e6 e7 e8 e9 ... e41 e42 e43 e44 e45 e46 e47 e48 e49 e50
0 1 2.59413 0.468803 20.6916 0.322648 0.009682 0.374393 0.803479 0.896592 3.59665 ... 101.174 -31.3730 0.442259 5.86453 0.000000 0.090519 0.176909 0.457585 0.071769 0.245996
1 1 3.86388 0.645781 18.1375 0.233529 0.030733 0.361239 1.069740 0.878714 3.59243 ... 186.516 45.9597 -0.478507 6.11126 0.001182 0.091800 -0.465572 0.935523 0.333613 0.230621
2 1 3.38584 1.197140 36.0807 0.200866 0.017341 0.260841 1.108950 0.884405 3.43159 ... 129.931 -11.5608 -0.297008 8.27204 0.003854 0.141721 -0.210559 1.013450 0.255512 0.180901
3 1 4.28524 0.510155 674.2010 0.281923 0.009174 0.000000 0.998822 0.823390 3.16382 ... 163.978 -18.4586 0.453886 2.48112 0.000000 0.180938 0.407968 4.341270 0.473081 0.258990
4 1 5.93662 0.832993 59.8796 0.232853 0.025066 0.233556 1.370040 0.787424 3.66546 ... 229.555 42.9600 -0.975752 2.66109 0.000000 0.170836 -0.814403 4.679490 1.924990 0.253893

5 rows × 51 columns

Multiclass

Classic iris dataset from Fisher with three classes.


In [7]:
url = uci_url + 'iris/iris.data'
dest = 'data/iris.csv'
header = 'sepal_l,sepal_w,petal_l,petal_w,target'
data = fetch_data(url, dest, header)
data.head()


Out[7]:
target sepal_l sepal_w petal_l petal_w
0 Iris-setosa 5.1 3.5 1.4 0.2
1 Iris-setosa 4.9 3.0 1.4 0.2
2 Iris-setosa 4.7 3.2 1.3 0.2
3 Iris-setosa 4.6 3.1 1.5 0.2
4 Iris-setosa 5.0 3.6 1.4 0.2

Glass identification, seven classes.


In [8]:
url = uci_url + 'glass/glass.data'
dest = 'data/glass.csv'
header = 'ri,na,mg,al,si,k,ca,ba,fe,target'
data = fetch_data(url, dest, header)
data.head()


Out[8]:
target ri na mg al si k ca ba fe
1 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0
2 1 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0
3 1 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0
4 1 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0
5 1 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0

Classifying a given sihouette as one of four types of vehicle.


In [9]:
names = ['xaa', 'xab', 'xac', 'xad', 'xae', 'xaf', 'xag', 'xah', 'xai']
urls = ['{0}statlog/vehicle/{1}.dat'.format(uci_url, x) for x in names]
dest = 'data/vehicle.csv'
header = 'compact circ dcirc rrat prar mlen scat elon prr mlenr svarmaj ' \
         'svarmin gy skewmaj skewmin kurtmin kurtmaj hol target placeholder'
data = fetch_data(urls, dest, header, sep=' ')
data.head()


Out[9]:
target compact circ dcirc rrat prar mlen scat elon prr mlenr svarmaj svarmin gy skewmaj skewmin kurtmin kurtmaj hol
0 van 95 48 83 178 72 10 162 42 20 159 176 379 184 70 6 16 187 197
1 van 91 41 84 141 57 9 149 45 19 143 170 330 158 72 9 14 189 199
2 saab 104 50 106 209 66 10 207 32 23 158 223 635 220 73 14 9 188 196
3 van 93 41 82 159 63 9 144 46 19 143 160 309 127 63 6 10 199 207
4 bus 85 44 70 205 103 52 149 45 19 144 241 325 188 127 9 11 180 183

Using chemical analysis to determine the origin of wines.


In [10]:
url = uci_url + 'wine/wine.data'
dest = 'data/wine.csv'
header = 'target,alcohol,malic,ash,alcash,mg,phenols,' \
         'flav,nonflav,proan,color,hue,od280,proline'
data = fetch_data(url, dest, header)
data.head()


Out[10]:
target alcohol malic ash alcash mg phenols flav nonflav proan color hue od280 proline
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735

Page blocks


In [38]:
url = uci_url + 'page-blocks/page-blocks.data.Z'
zip_dest = 'data/pageblocks.csv.Z'
dest = 'data/pageblocks.csv'
download_data(url, zip_dest)
os.system('uncompress {filename}'.format(filename=dest))
header = ['height', 'length', 'area', 'eccen', 'pblack', 'pand',
          'meantr', 'blackpix', 'blackand', 'wbtrans', 'target']
data = pd.read_csv(dest, sep='\s+', header=None, names=header)
data = data[['target']].join(data.drop('target', axis=1))
data.dropna(axis=0, how='any', inplace=True)
data.to_csv(dest, index=False, float_format='%.12g')
data.head()


Out[38]:
target height length area eccen pblack pand meantr blackpix blackand wbtrans
0 1 5 7 35 1.400 0.400 0.657 2.33 14 23 6
1 1 6 7 42 1.167 0.429 0.881 3.60 18 37 5
2 1 6 18 108 3.000 0.287 0.741 4.43 31 80 7
3 1 5 7 35 1.400 0.371 0.743 4.33 13 26 3
4 1 6 3 18 0.500 0.500 0.944 2.25 9 17 4

Semeion handwritten digit dataset


In [42]:
url = uci_url + 'semeion/semeion.data'
dest = 'data/semeion.data'
download_data(url, dest, overwrite=False)

matrix = []
with open(dest) as f:
    for i, line in enumerate(f):
        row = line.strip().split(' ')
        target = row[-10:].index('1')
        matrix.append(row[:-10] + [target])

columns = list(np.arange(256)) + ['target']
data = pd.DataFrame(matrix, columns=columns)
data = data[['target']].join(data.drop('target', axis=1))
data = data.astype(float).astype(int)
data.to_csv('data/semeion.csv', index=False)
data.head()


Out[42]:
target 0 1 2 3 4 5 6 7 8 ... 246 247 248 249 250 251 252 253 254 255
0 0 0 0 0 0 0 0 1 1 1 ... 1 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 1 1 1 1 ... 1 1 1 1 1 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 1 1 1 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 257 columns

Yeast datasset


In [16]:
url = uci_url + 'yeast/yeast.data'
dest = 'data/yeast.csv'
header = 'id mcg gvh alm mit erl pox vac nuc target'
data = fetch_data(url, dest, header, sep=r'\s+')
data.head()


Out[16]:
target mcg gvh alm mit erl pox vac nuc
0 MIT 0.58 0.61 0.47 0.13 0.5 0.0 0.48 0.22
1 MIT 0.43 0.67 0.48 0.27 0.5 0.0 0.53 0.22
2 MIT 0.64 0.62 0.49 0.15 0.5 0.0 0.53 0.22
3 NUC 0.58 0.44 0.57 0.13 0.5 0.0 0.54 0.22
4 MIT 0.42 0.44 0.48 0.54 0.5 0.0 0.48 0.22