Step 3. 実データの読み込みから俯瞰まで

「ワインの品質」データ読み込み

ヒストグラム

散布図

散布図行列

相関行列

主成分分析

練習 </ol>

Step 3 の目標

実際の多変量データを、主成分分析やその他の手法で可視化し俯瞰する。



In [ ]:

    
# 数値計算やデータフレーム操作に関するライブラリをインポートする
import numpy as np
import pandas as pd



In [ ]:

    
# URL によるリソースへのアクセスを提供するライブラリをインポートする。
# import urllib # Python 2 の場合
import urllib.request # Python 3 の場合



In [ ]:

    
# 図やグラフを図示するためのライブラリをインポートする。
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from matplotlib.colors import LinearSegmentedColormap



In [ ]:

    
from sklearn.decomposition import PCA #主成分分析器

1. 「ワインの品質」データ読み込み

データは UC Irvine Machine Learning Repository から取得したものを少し改変しました。

　詳細

fixed acidity : 不揮発酸濃度（ほぼ酒石酸濃度）
volatile acidity : 揮発酸濃度（ほぼ酢酸濃度）
citric acid : クエン酸濃度
residual sugar : 残存糖濃度
chlorides : 塩化物濃度
free sulfur dioxide : 遊離亜硫酸濃度
total sulfur dioxide : 亜硫酸濃度
density : 密度
pH : pH
sulphates : 硫酸塩濃度
alcohol : アルコール度数
quality (score between 0 and 10) : 0-10 の値で示される品質のスコア



In [ ]:

    
# ウェブ上のリソースを指定する
url = 'https://raw.githubusercontent.com/chemo-wakate/tutorial-6th/master/beginner/data/winequality-red.txt'
# 指定したURLからリソースをダウンロードし、名前をつける。
# urllib.urlretrieve(url, 'winequality-red.csv') # Python 2 の場合
urllib.request.urlretrieve(url, 'winequality-red.txt') # Python 3 の場合



In [ ]:

    
# データの読み込み
df1 = pd.read_csv('winequality-red.txt', sep='\t', index_col=0)



In [ ]:

    
df1 # 中身の確認



In [ ]:

    
df1.T # .T は行列の転置

2. ヒストグラム



In [ ]:

    
# 図やグラフを図示するためのライブラリをインポートする。
import matplotlib.pyplot as plt
%matplotlib inline



In [ ]:

    
df1['fixed acidity'].hist()



In [ ]:

    
df1['fixed acidity'].hist(figsize=(5, 5), bins=20) # bin の数を増やす



In [ ]:

    
# まとめて表示もできる
df1.hist(figsize=(20, 20), bins=20)
plt.show()

3. 散布図

好きな列を２つ選んで散布図が描けます。



In [ ]:

    
df1.plot(kind='scatter', x=u'pH', y=u'alcohol', grid=True)

matplotlib で定義済みのカラーマップで彩色できます。次の例では、quality に応じて coolwarm に従った彩色を行います。他のカラーマップの例は http://www.scipy-lectures.org/intro/matplotlib/matplotlib.html などを参照のこと。



In [ ]:

    
df1.plot(kind='scatter', x=u'pH', y=u'alcohol', \
        c=df1['quality'], cmap='coolwarm', grid=True)

同じような絵を描く方法はいくつもあって、たとえば次のように、微妙に仕上がりが違います。



In [ ]:

    
plt.scatter(df1['pH'], df1['alcohol'], alpha=0.5, \
            c=df1['quality'], cmap='coolwarm')
plt.colorbar(label='quality')
plt.xlabel('pH')
plt.ylabel('alcohol')
plt.grid()

今回は quality は連続値ではなく離散値ですので、次のような描き方のほうが良いかもしれません。



In [ ]:

    
cmap = plt.get_cmap('coolwarm')
colors = [cmap(c / 5) for c in np.arange(1, 6)]
fig, ax = plt.subplots(1, 1)
for i, (key, group) in enumerate(df1.groupby('quality')):
        group.plot(kind='scatter', x=u'pH', y=u'alcohol', color=cmap(i / 5), ax=ax, label=key, alpha=0.5, grid=True)

もし、気に入った colormap がなければ、以下のように自作もできます。



In [ ]:

    
dic = {'red':   ((0, 0, 0), (0.5, 1, 1), (1, 1, 1)), 
       'green': ((0, 0, 0), (0.5, 1, 1), (1, 0, 0)), 
       'blue':  ((0, 1, 1), (0.5, 0, 0), (1, 0, 0))}

tricolor_cmap = LinearSegmentedColormap('tricolor', dic)



In [ ]:

    
plt.scatter(df1['pH'], df1['alcohol'], alpha=0.5, \
            c=df1['quality'], cmap=tricolor_cmap)
plt.colorbar(label='quality')
plt.xlabel('pH')
plt.ylabel('alcohol')
plt.grid()



In [ ]:

    
cmap = tricolor_cmap
colors = [cmap(c / 5) for c in np.arange(1, 6)]
fig, ax = plt.subplots(1, 1)
for i, (key, group) in enumerate(df1.groupby('quality')):
        group.plot(kind='scatter', x=u'pH', y=u'alcohol', color=cmap(i / 5), ax=ax, label=key, alpha=0.5, grid=True)

4. 散布図行列

散布図行列は、多数の変数の間の関係を俯瞰するのに大変便利です。



In [ ]:

    
pd.plotting.scatter_matrix(df1.dropna(axis=1)[df1.columns[:]], figsize=(20, 20)) 
plt.show()



In [ ]:

    
cmap = plt.get_cmap('coolwarm')
colors = [cmap((c - 3)/ 5) for c in df1['quality'].tolist()]
pd.plotting.scatter_matrix(df1.dropna(axis=1)[df1.columns[:]], figsize=(20, 20), color=colors) 
plt.show()

先ほどと同様、自作の colormap も使えます。



In [ ]:

    
cmap = tricolor_cmap
colors = [cmap((c - 3)/ 5) for c in df1['quality'].tolist()]
pd.plotting.scatter_matrix(df1.dropna(axis=1)[df1.columns[:]], figsize=(20, 20), color=colors) 
plt.show()

5. 相関行列

変数間の関係を概観するにあたり、全対全の相関係数を見せてくれる相関行列も便利です。



In [ ]:

    
pd.DataFrame(np.corrcoef(df1.T.dropna().iloc[:, :].as_matrix().tolist()), 
             columns=df1.columns, index=df1.columns)

上のような数字だらけの表だと全体像を掴みづらいので、カラーマップにしてみましょう。



In [ ]:

    
corrcoef = np.corrcoef(df1.dropna().iloc[:, :].T.as_matrix().tolist())
#plt.figure(figsize=(8, 8))
plt.imshow(corrcoef, interpolation='nearest', cmap=plt.cm.coolwarm)
plt.colorbar(label='correlation coefficient')
tick_marks = np.arange(len(corrcoef))
plt.xticks(tick_marks, df1.columns, rotation=90)
plt.yticks(tick_marks, df1.columns)
plt.tight_layout()

quality は alcohol と正の相関、 volatile acidity と負の相関にあることなどが見て取れます。

8. 主成分分析

主成分分析を行う前に、データの正規化を行うことが一般的です。よく使われる正規化として、次のように、各項目において平均0・分散1となるように変換します。



In [ ]:

    
dfs = df1.apply(lambda x: (x-x.mean())/x.std(), axis=0).fillna(0)



In [ ]:

    
dfs.head() # 先頭５行だけ表示

機械学習のライブラリ sklearn の PCA を用いて主成分分析を行います。



In [ ]:

    
pca = PCA()
pca.fit(dfs.iloc[:, :10])
# データを主成分空間に写像 = 次元圧縮
feature = pca.transform(dfs.iloc[:, :10])
#plt.figure(figsize=(6, 6))
plt.scatter(feature[:, 0], feature[:, 1], alpha=0.5)
plt.title('Principal Component Analysis')
plt.xlabel('The first principal component')
plt.ylabel('The second principal component')
plt.grid()
plt.show()

主成分分析では、個々の変数の線形結合を主成分として分析を行ないますので、それぞれの主成分がもとのデータをどの程度説明しているかを示す尺度が必要となります。それを寄与率といいます。また、寄与率を第1主成分から順に累積していったものを累積寄与率といいます。



In [ ]:

    
# 累積寄与率を図示する
plt.gca().get_xaxis().set_major_locator(ticker.MaxNLocator(integer=True))
plt.plot([0] + list(np.cumsum(pca.explained_variance_ratio_)), '-o')
plt.xlabel('Number of principal components')
plt.ylabel('Cumulative contribution ratio')
plt.grid()
plt.show()

これもやはり好きな色で彩色できます。



In [ ]:

    
pca = PCA()
pca.fit(dfs.iloc[:, :10])
# データを主成分空間に写像 = 次元圧縮
feature = pca.transform(dfs.iloc[:, :10])
#plt.figure(figsize=(6, 6))
plt.scatter(feature[:, 0], feature[:, 1], alpha=0.5, color=colors)
plt.title('Principal Component Analysis')
plt.xlabel('The first principal component')
plt.ylabel('The second principal component')
plt.grid()
plt.show()

行列の転置 .T をすることで、行と列を入れ替えて主成分分析を行うことができます。



In [ ]:

    
pca = PCA()
pca.fit(dfs.iloc[:, :10].T)
# データを主成分空間に写像 = 次元圧縮
feature = pca.transform(dfs.iloc[:, :10].T)
#plt.figure(figsize=(6, 6))
for x, y, name in zip(feature[:, 0], feature[:, 1], dfs.columns[:10]):
    plt.text(x, y, name, alpha=0.8, size=8)
plt.scatter(feature[:, 0], feature[:, 1], alpha=0.5)
plt.title('Principal Component Analysis')
plt.xlabel('The first principal component')
plt.ylabel('The second principal component')
plt.grid()
plt.show()



In [ ]:

    
# 累積寄与率を図示する
plt.gca().get_xaxis().set_major_locator(ticker.MaxNLocator(integer=True))
plt.plot([0] + list(np.cumsum(pca.explained_variance_ratio_)), '-o')
plt.xlabel('Number of principal components')
plt.ylabel('Cumulative contribution ratio')
plt.grid()
plt.show()

練習3.1

白ワインのデータ(https://raw.githubusercontent.com/chemo-wakate/tutorial-6th/master/beginner/data/winequality-white.txt) を読み込み、ヒストグラム、散布図行列、相関行列を描いてください。



In [ ]:

    
# 練習3.1