Scikit-Learn 패키지의 샘플 데이터 - classification용

Iris Dataset

`load_iris()`

https://en.wikipedia.org/wiki/Iris_flower_data_set
R.A Fisher의 붓꽃 분류 연구
관찰 자료
- 꽃받침 길이(Sepal Length)
- 꽃받침 폭(Sepal Width)
- 꽃잎 길이(Petal Length)
- 꽃잎 폭(Petal Width)
종
- setosa
- versicolor
- virginica



In [2]:

    
from sklearn.datasets import load_iris
iris = load_iris()
print(iris.DESCR)









    



Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...



In [3]:

    
df = pd.DataFrame(iris.data, columns=iris.feature_names)
sy = pd.Series(iris.target, dtype="category")
sy = sy.cat.rename_categories(iris.target_names)
df['species'] = sy
df









    Out[3]:






  
    
      
      sepal length (cm)
      sepal width (cm)
      petal length (cm)
      petal width (cm)
      species
    
  
  
    
      0
      5.1
      3.5
      1.4
      0.2
      setosa
    
    
      1
      4.9
      3.0
      1.4
      0.2
      setosa
    
    
      2
      4.7
      3.2
      1.3
      0.2
      setosa
    
    
      3
      4.6
      3.1
      1.5
      0.2
      setosa
    
    
      4
      5.0
      3.6
      1.4
      0.2
      setosa
    
    
      5
      5.4
      3.9
      1.7
      0.4
      setosa
    
    
      6
      4.6
      3.4
      1.4
      0.3
      setosa
    
    
      7
      5.0
      3.4
      1.5
      0.2
      setosa
    
    
      8
      4.4
      2.9
      1.4
      0.2
      setosa
    
    
      9
      4.9
      3.1
      1.5
      0.1
      setosa
    
    
      10
      5.4
      3.7
      1.5
      0.2
      setosa
    
    
      11
      4.8
      3.4
      1.6
      0.2
      setosa
    
    
      12
      4.8
      3.0
      1.4
      0.1
      setosa
    
    
      13
      4.3
      3.0
      1.1
      0.1
      setosa
    
    
      14
      5.8
      4.0
      1.2
      0.2
      setosa
    
    
      15
      5.7
      4.4
      1.5
      0.4
      setosa
    
    
      16
      5.4
      3.9
      1.3
      0.4
      setosa
    
    
      17
      5.1
      3.5
      1.4
      0.3
      setosa
    
    
      18
      5.7
      3.8
      1.7
      0.3
      setosa
    
    
      19
      5.1
      3.8
      1.5
      0.3
      setosa
    
    
      20
      5.4
      3.4
      1.7
      0.2
      setosa
    
    
      21
      5.1
      3.7
      1.5
      0.4
      setosa
    
    
      22
      4.6
      3.6
      1.0
      0.2
      setosa
    
    
      23
      5.1
      3.3
      1.7
      0.5
      setosa
    
    
      24
      4.8
      3.4
      1.9
      0.2
      setosa
    
    
      25
      5.0
      3.0
      1.6
      0.2
      setosa
    
    
      26
      5.0
      3.4
      1.6
      0.4
      setosa
    
    
      27
      5.2
      3.5
      1.5
      0.2
      setosa
    
    
      28
      5.2
      3.4
      1.4
      0.2
      setosa
    
    
      29
      4.7
      3.2
      1.6
      0.2
      setosa
    
    
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      120
      6.9
      3.2
      5.7
      2.3
      virginica
    
    
      121
      5.6
      2.8
      4.9
      2.0
      virginica
    
    
      122
      7.7
      2.8
      6.7
      2.0
      virginica
    
    
      123
      6.3
      2.7
      4.9
      1.8
      virginica
    
    
      124
      6.7
      3.3
      5.7
      2.1
      virginica
    
    
      125
      7.2
      3.2
      6.0
      1.8
      virginica
    
    
      126
      6.2
      2.8
      4.8
      1.8
      virginica
    
    
      127
      6.1
      3.0
      4.9
      1.8
      virginica
    
    
      128
      6.4
      2.8
      5.6
      2.1
      virginica
    
    
      129
      7.2
      3.0
      5.8
      1.6
      virginica
    
    
      130
      7.4
      2.8
      6.1
      1.9
      virginica
    
    
      131
      7.9
      3.8
      6.4
      2.0
      virginica
    
    
      132
      6.4
      2.8
      5.6
      2.2
      virginica
    
    
      133
      6.3
      2.8
      5.1
      1.5
      virginica
    
    
      134
      6.1
      2.6
      5.6
      1.4
      virginica
    
    
      135
      7.7
      3.0
      6.1
      2.3
      virginica
    
    
      136
      6.3
      3.4
      5.6
      2.4
      virginica
    
    
      137
      6.4
      3.1
      5.5
      1.8
      virginica
    
    
      138
      6.0
      3.0
      4.8
      1.8
      virginica
    
    
      139
      6.9
      3.1
      5.4
      2.1
      virginica
    
    
      140
      6.7
      3.1
      5.6
      2.4
      virginica
    
    
      141
      6.9
      3.1
      5.1
      2.3
      virginica
    
    
      142
      5.8
      2.7
      5.1
      1.9
      virginica
    
    
      143
      6.8
      3.2
      5.9
      2.3
      virginica
    
    
      144
      6.7
      3.3
      5.7
      2.5
      virginica
    
    
      145
      6.7
      3.0
      5.2
      2.3
      virginica
    
    
      146
      6.3
      2.5
      5.0
      1.9
      virginica
    
    
      147
      6.5
      3.0
      5.2
      2.0
      virginica
    
    
      148
      6.2
      3.4
      5.4
      2.3
      virginica
    
    
      149
      5.9
      3.0
      5.1
      1.8
      virginica
    
  

150 rows × 5 columns



In [4]:

    
sns.pairplot(df, hue='species')
plt.show()

뉴스 그룹 텍스트

`fetch_20newsgroups()`: 20 News Groups text



In [5]:

    
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset="all")
print(newsgroups.description)
print(newsgroups.keys())









    



the 20 newsgroups by date dataset
dict_keys(['description', 'data', 'target', 'target_names', 'filenames', 'DESCR'])



In [6]:

    
from pprint import pprint
pprint(list(newsgroups.target_names))









    



['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']



In [7]:

    
print(newsgroups.data[1])
print("="*80)
print(newsgroups.target_names[newsgroups.target[1]])









    



From: mblawson@midway.ecn.uoknor.edu (Matthew B Lawson)
Subject: Which high-performance VLB video card?
Summary: Seek recommendations for VLB video card
Nntp-Posting-Host: midway.ecn.uoknor.edu
Organization: Engineering Computer Network, University of Oklahoma, Norman, OK, USA
Keywords: orchid, stealth, vlb
Lines: 21

  My brother is in the market for a high-performance video card that supports
VESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:

  - Diamond Stealth Pro Local Bus

  - Orchid Farenheit 1280

  - ATI Graphics Ultra Pro

  - Any other high-performance VLB card


Please post or email.  Thank you!

  - Matt

-- 
    |  Matthew B. Lawson <------------> (mblawson@essex.ecn.uoknor.edu)  |   
  --+-- "Now I, Nebuchadnezzar, praise and exalt and glorify the King  --+-- 
    |   of heaven, because everything he does is right and all his ways  |   
    |   are just." - Nebuchadnezzar, king of Babylon, 562 B.C.           |   

================================================================================
comp.sys.ibm.pc.hardware

Olivetti faces

`fetch_olivetti_faces()`

얼굴 인식 이미지



In [8]:

    
from sklearn.datasets import fetch_olivetti_faces
olivetti = fetch_olivetti_faces()
print(olivetti.DESCR)
print(olivetti.keys())









    



Modified Olivetti faces dataset.

The original database was available from (now defunct)

    http://www.uk.research.att.com/facedatabase.html

The version retrieved here comes in MATLAB format from the personal
web page of Sam Roweis:

    http://www.cs.nyu.edu/~roweis/

There are ten different images of each of 40 distinct subjects. For some
subjects, the images were taken at different times, varying the lighting,
facial expressions (open / closed eyes, smiling / not smiling) and facial
details (glasses / no glasses). All the images were taken against a dark
homogeneous background with the subjects in an upright, frontal position (with
tolerance for some side movement).

The original dataset consisted of 92 x 112, while the Roweis version
consists of 64x64 images.

dict_keys(['target', 'images', 'DESCR', 'data'])



In [10]:

    
N=2; M=5;
fig = plt.figure(figsize=(8,5))
plt.subplots_adjust(top=1, bottom=0, hspace=0, wspace=0.05)
klist = np.random.choice(range(len(olivetti.data)), N * M)
for i in range(N):
    for j in range(M):
        k = klist[i*M+j]
        ax = fig.add_subplot(N, M, i*M+j+1)
        ax.imshow(olivetti.images[k], cmap=plt.cm.bone);
        ax.grid(False)
        ax.xaxis.set_ticks([])
        ax.yaxis.set_ticks([])
        plt.title(olivetti.target[k])
plt.tight_layout()
plt.show()

Labeled Faces in the Wild (LFW)

`fetch_lfw_people()`

유명인 얼굴 이미지
Parameters
- funneled : boolean, optional, default: True
  - Download and use the funneled variant of the dataset.
- resize : float, optional, default 0.5
  - Ratio used to resize the each face picture.
- min_faces_per_person : int, optional, default None
  - The extracted dataset will only retain pictures of people that have at least min_faces_per_person different pictures.
- color : boolean, optional, default False
  - Keep the 3 RGB channels instead of averaging them to a single gray level channel. If color is True the shape of the data has one more dimension than than the shape with color = False.



In [11]:

    
from sklearn.datasets import fetch_lfw_people
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
print(lfw_people.DESCR)
print(lfw_people.keys())









    



LFW faces dataset
dict_keys(['target', 'target_names', 'images', 'DESCR', 'data'])



In [12]:

    
N=2; M=5;
fig = plt.figure(figsize=(8,5))
plt.subplots_adjust(top=1, bottom=0, hspace=0.1, wspace=0.05)
klist = np.random.choice(range(len(lfw_people.data)), N * M)
for i in range(N):
    for j in range(M):
        k = klist[i*M+j]
        ax = fig.add_subplot(N, M, i*M+j+1)
        ax.imshow(lfw_people.images[k], cmap=plt.cm.bone);
        ax.grid(False)
        ax.xaxis.set_ticks([])
        ax.yaxis.set_ticks([])
        plt.title(lfw_people.target_names[lfw_people.target[k]])
plt.tight_layout()
plt.show()

`fetch_lfw_pairs()`

얼굴 이미지 Pair
동일 인물일 수도 있고 아닐 수도 있음



In [13]:

    
from sklearn.datasets import fetch_lfw_pairs
lfw_pairs = fetch_lfw_pairs(resize=0.4)
print(lfw_pairs.DESCR)
print(lfw_pairs.keys())









    



'train' segment of the LFW pairs dataset
dict_keys(['target', 'target_names', 'data', 'pairs', 'DESCR'])



In [14]:

    
N=2; M=5;
fig = plt.figure(figsize=(8,5))
plt.subplots_adjust(top=1, bottom=0, hspace=0.01, wspace=0.05)
klist = np.random.choice(range(len(lfw_pairs.data)), M)
for j in range(M):
    k = klist[j]
    ax1 = fig.add_subplot(N, M, j+1)
    ax1.imshow(lfw_pairs.pairs [k][0], cmap=plt.cm.bone);
    ax1.grid(False)
    ax1.xaxis.set_ticks([])
    ax1.yaxis.set_ticks([])
    plt.title(lfw_pairs.target_names[lfw_pairs.target[k]])
    ax2 = fig.add_subplot(N, M, j+1 + M)
    ax2.imshow(lfw_pairs.pairs [k][1], cmap=plt.cm.bone);
    ax2.grid(False)
    ax2.xaxis.set_ticks([])
    ax2.yaxis.set_ticks([])
plt.tight_layout()
plt.show()

Digits Handwriting Image

`load_digits()`

숫자 필기 이미지



In [15]:

    
from sklearn.datasets import load_digits
digits = load_digits()
print(digits.DESCR)
print(digits.keys())









    



Optical Recognition of Handwritten Digits Data Set
===================================================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.

References
----------
  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
    Graduate Studies in Science and Engineering, Bogazici University.
  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
    Linear dimensionalityreduction using relevance weighted LDA. School of
    Electrical and Electronic Engineering Nanyang Technological University.
    2005.
  - Claudio Gentile. A New Approximate Maximal Margin Classification
    Algorithm. NIPS. 2000.

dict_keys(['target', 'target_names', 'images', 'DESCR', 'data'])



In [16]:

    
N=2; M=5;
fig = plt.figure(figsize=(10,5))
plt.subplots_adjust(top=1, bottom=0, hspace=0, wspace=0.05)
for i in range(N):
    for j in range(M):
        k = i*M+j
        ax = fig.add_subplot(N, M, k+1)
        ax.imshow(digits.images[k], cmap=plt.cm.bone, interpolation="none");
        ax.grid(False)
        ax.xaxis.set_ticks([])
        ax.yaxis.set_ticks([])
        plt.title(digits.target_names[k])
plt.tight_layout()
plt.show()

mldata.org repository

`fetch_mldata()`

http://mldata.org
public repository for machine learning data, supported by the PASCAL network
홈페이지에서 data name 을 검색 후 key로 이용

MNIST 숫자 필기인식 자료

https://en.wikipedia.org/wiki/MNIST_database
Mixed National Institute of Standards and Technology (MNIST) database
0-9 필기 숫자 이미지
28x28 pixel bounding box
anti-aliased, grayscale levels
60,000 training images and 10,000 testing images



In [18]:

    
from sklearn.datasets.mldata import fetch_mldata
mnist = fetch_mldata('MNIST original')
mnist.keys()









    Out[18]:





dict_keys(['target', 'data', 'DESCR', 'COL_NAMES'])



In [19]:

    
N=2; M=5;
fig = plt.figure(figsize=(8,5))
plt.subplots_adjust(top=1, bottom=0, hspace=0, wspace=0.05)
klist = np.random.choice(range(len(mnist.data)), N * M)
for i in range(N):
    for j in range(M):
        k = klist[i*M+j]
        ax = fig.add_subplot(N, M, i*M+j+1)
        ax.imshow(mnist.data[k].reshape(28, 28), cmap=plt.cm.bone, interpolation="nearest");
        ax.grid(False)
        ax.xaxis.set_ticks([])
        ax.yaxis.set_ticks([])
        plt.title(mnist.target[k])
plt.tight_layout()
plt.show()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
5	5.4	3.9	1.7	0.4	setosa
6	4.6	3.4	1.4	0.3	setosa
7	5.0	3.4	1.5	0.2	setosa
8	4.4	2.9	1.4	0.2	setosa
9	4.9	3.1	1.5	0.1	setosa
10	5.4	3.7	1.5	0.2	setosa
11	4.8	3.4	1.6	0.2	setosa
12	4.8	3.0	1.4	0.1	setosa
13	4.3	3.0	1.1	0.1	setosa
14	5.8	4.0	1.2	0.2	setosa
15	5.7	4.4	1.5	0.4	setosa
16	5.4	3.9	1.3	0.4	setosa
17	5.1	3.5	1.4	0.3	setosa
18	5.7	3.8	1.7	0.3	setosa
19	5.1	3.8	1.5	0.3	setosa
20	5.4	3.4	1.7	0.2	setosa
21	5.1	3.7	1.5	0.4	setosa
22	4.6	3.6	1.0	0.2	setosa
23	5.1	3.3	1.7	0.5	setosa
24	4.8	3.4	1.9	0.2	setosa
25	5.0	3.0	1.6	0.2	setosa
26	5.0	3.4	1.6	0.4	setosa
27	5.2	3.5	1.5	0.2	setosa
28	5.2	3.4	1.4	0.2	setosa
29	4.7	3.2	1.6	0.2	setosa
...	...	...	...	...	...
120	6.9	3.2	5.7	2.3	virginica
121	5.6	2.8	4.9	2.0	virginica
122	7.7	2.8	6.7	2.0	virginica
123	6.3	2.7	4.9	1.8	virginica
124	6.7	3.3	5.7	2.1	virginica
125	7.2	3.2	6.0	1.8	virginica
126	6.2	2.8	4.8	1.8	virginica
127	6.1	3.0	4.9	1.8	virginica
128	6.4	2.8	5.6	2.1	virginica
129	7.2	3.0	5.8	1.6	virginica
130	7.4	2.8	6.1	1.9	virginica
131	7.9	3.8	6.4	2.0	virginica
132	6.4	2.8	5.6	2.2	virginica
133	6.3	2.8	5.1	1.5	virginica
134	6.1	2.6	5.6	1.4	virginica
135	7.7	3.0	6.1	2.3	virginica
136	6.3	3.4	5.6	2.4	virginica
137	6.4	3.1	5.5	1.8	virginica
138	6.0	3.0	4.8	1.8	virginica
139	6.9	3.1	5.4	2.1	virginica
140	6.7	3.1	5.6	2.4	virginica
141	6.9	3.1	5.1	2.3	virginica
142	5.8	2.7	5.1	1.9	virginica
143	6.8	3.2	5.9	2.3	virginica
144	6.7	3.3	5.7	2.5	virginica
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

Scikit-Learn 패키지의 샘플 데이터 - classification용

Iris Dataset

load_iris()

뉴스 그룹 텍스트

fetch_20newsgroups(): 20 News Groups text

Olivetti faces

fetch_olivetti_faces()

Labeled Faces in the Wild (LFW)

fetch_lfw_people()

fetch_lfw_pairs()

Digits Handwriting Image

load_digits()

mldata.org repository

fetch_mldata()

MNIST 숫자 필기인식 자료

`load_iris()`

`fetch_20newsgroups()`: 20 News Groups text

`fetch_olivetti_faces()`

`fetch_lfw_people()`

`fetch_lfw_pairs()`

`load_digits()`

`fetch_mldata()`