The dataset of etable and poisonous mushrooms species of Agaricus and Lepiota families was downloaded from UCI Machine Learning Repository "Mushrom Dataset". Records were originally drawn from The Audubon Society Field Guide to North American Mushrooms publsihed in 1981. |
"Exploratory Analysis" notebook:
In [1]:
import re, csv, os, sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn as sklearn
%matplotlib inline
In [2]:
file = open('/Users/dariaulybina/Desktop/georgetown/ml_practice/agaricus-lepiota.txt', 'r')
#file = open('U:\\agaricus-lepiota.txt', 'r')
list1 = []
for f in file:
l = f.split(',')
li = [x.strip() for x in l]
list1.append(li)
file.close()
In [3]:
print(len(list1))
8124 instances in my datafile is confirmed.
Below is the list of features and their coding copied from the database description file.
1. cap-shape: bell=b,conical=c,convex=x,flat=f,knobbed=k,sunken=s
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
4. bruises?: bruises=t,no=f
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
6. gill-attachment: attached=a,descending=d,free=f,notched=n
7. gill-spacing: close=c,crowded=w,distant=d
8. gill-size: broad=b,narrow=n
9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g,green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
10. stalk-shape: enlarging=e,tapering=t
11. stalk-root: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
16. veil-type: partial=p,universal=u
17. veil-color: brown=n,orange=o,white=w,yellow=y
18. ring-number: none=n,one=o,two=t
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z
20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y
21. population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
22. habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d
In [6]:
headers = ['classif','cap_shape','cap_surface','cap_colour','bruises','odor','gill_attach','gill_space','gill_size',
'gill_color','stalk_shape','stalk_root','stalk_surf_above_ring','stalk_surf_below_ring',
'stalk_color_above_ring','stalk_color_below_ring','veil_type','veil_color','ring_number',
'ring_type','spore_print_color','population','habitat']
print(len(headers))
In [7]:
#Put the list of lists into dataframe and make sure everything look ok
df = pd.DataFrame(list1, columns=headers)
df.head()
Out[7]:
In [8]:
df.describe()
Out[8]:
As expected, all of the values across all features are categorical. Later, I have encode those into numerical values.
In [8]:
table = pd.crosstab(index=df['classif'], columns="count")
table
Out[8]:
In [9]:
fig = plt.figure(figsize=(2,2))
ax1 = fig.add_subplot(111)
ax1.set_xlabel('classification')
ax1.set_ylabel('count')
ax1.set_title("By classification")
df['classif'].value_counts().plot(kind='bar',color = '#4C72B0')
Out[9]:
Comments:
There are 4208 mushroom instances identified as etable and 3916 instances as poisonous. The distribution between my two target classes is approximately the same, which is good - we won't have a selection bias tending towards one category just due to its overrepresentation in the dataset.
I will explore other properties of mushrooms and create box or stacked charts to vizualise data.
In [10]:
tbl = pd.crosstab(index=df['classif'], columns=df['cap_colour'])
print(tbl)
In [42]:
#Create stacked chart and normal box chart to display the distribution of different cap colors by classification
fig, axes = plt.subplots(nrows=1, ncols=2)
tbl.plot(kind="bar",stacked=False, ax=axes[0], color=['#f0dc82', '#D2691E', '#990000',
'#696969','#49311c','#ff69b4',
'#007f00','#800080','#ffffff',
'#ffff00']);
axes[0].legend_.remove()
tbl.plot(kind="bar",stacked=True, ax=axes[1], color=['#f0dc82', '#D2691E', '#990000',
'#696969','#49311c','#ff69b4',
'#007f00','#800080','#ffffff',
'#ffff00']);
axes[1].legend(['Buff','Cinnamon','Red','Gray','Brown',
'Pink','Green','Purple','White','Yellow'],loc='center left', bbox_to_anchor=(1, 0.5))
fig.suptitle('Cap Color')
axes[0].set_ylabel('Species count')
plt.show()
fig.savefig('cap_color.jpg')
Comment:
Suprisingly, the cap color is not the great predictor of etability of a mushroom. Different colors have varied distributions among both classes. Just a few species with red and yellow caps tend to be poisonous more often, while brown, gray and white caps are more prevailant among etable mushrooms. However, one cannot draw any definitive conclusion and we need to investgate more charactristics.
In [44]:
gills = pd.crosstab(index=df['classif'], columns=df["gill_color"])
print(gills)
#gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g,green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
s1 = pd.crosstab(index=df['classif'], columns = df['gill_size'])
print(s1)
Comments: |
In [60]:
fig, axes = plt.subplots(nrows=1, ncols=2)
plt.tight_layout()
gills.plot(kind="bar",stacked=True, ax=axes[1], color=['#f0dc82','#990000','#696969','#4E2E28','#000000',
'#49311c','#FFA500','#ff69b4','#007f00','#800080',
'#ffffff','#ffff00'])
axes[1].legend(['Buff','Red','Gray','Chocolate','Black','Brown','Orange','Pink',
'Green','Purple','White','Yellow'],loc='center left', bbox_to_anchor=(1, 0.5))
axes[1].set_title('Gills Color')
axes[0].set_ylabel('Species count')
s1.plot(kind="bar",stacked=True, ax=axes[0], color=['#ffd1dc','#d1fff4'])
axes[0].set_title('Gills Size')
axes[0].legend(['Broad','Narrow'],loc='center')
plt.show()
fig.savefig('gills_color_size.jpg')
In [62]:
odors = pd.crosstab(index=df['classif'], columns=df["odor"])
print(odors)
In [72]:
#almond=a,creosote=c,anise=l,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
od = odors.plot(kind="bar",figsize=(4,4),stacked=True, cmap=plt.cm.RdYlGn)
od.set_title('Odor')
od.set_ylabel('Species count')
od.set_xlabel('Classification')
od.legend(['Almond','Creosote','Foul','Anise',
'Musty','None','Pungent','Spicy','Fishy'], loc='center left', bbox_to_anchor=(1, 0.5))
fig.savefig('odor.jpg')
Comments:
Odor seems to be very important in differentiating between posonous and not-poisonous mushrooms in our dataset. If the mushroom smells like almond or like anise - it is indeed etable. If there is no smell at all - the mushroom will be most likely etable, but you cannot be sure: out of all mushrooms that don't have a smell, poisonous represent about 3.5 %. Additionally, only poisonous mushrooms smell like fish, spicy, pungent, foul, creosote and musty. Although sometimes difficult to identify, smell is indeed an important feature.
In [74]:
veils = pd.crosstab(index=df['classif'], columns=df["veil_color"])
print(veils)
spores = pd.crosstab(index=df['classif'], columns=df["spore_print_color"])
print(spores)
In [77]:
fig, axes = plt.subplots(nrows=1, ncols=2)
veils.plot(kind="bar",stacked=True, ax=axes[0], color=['#49311c','#ffa500','#ffffff','#ffff00']);
axes[0].legend(['Brown','Orange','White','Yellow'],loc='center left')
spores.plot(kind="bar",stacked=True, ax=axes[1], color=['#f0dc82','#D2691E','#000000','#49311c',
'#ffa500','#007f00','#800080','#ffffff','#ffff00']);
axes[1].legend(['Buff','Chocolate','Black','Brown','Orange',
'Green','Purple','White','Yellow'],loc='center left', bbox_to_anchor=(1, 0.5))
axes[0].set_ylabel('Species count')
axes[0].set_title('Veils Color')
axes[1].set_title('Spores Color')
plt.show()
fig.savefig('veils_spores_colors.jpg')
Comments: |
In [80]:
stalk_above = pd.crosstab(index=df['classif'], columns=df["stalk_color_above_ring"])
print(stalk_above)
stalk_below = pd.crosstab(index=df['classif'], columns=df["stalk_color_below_ring"])
print(stalk_below)
Comemnts:
Stalk colors above and below vary vetween etable and poisonous mushrooms, however, there are rules one can follow to certainly distinguish them using stalk color features. If the stalk color (both above and below the ring) is buff, yellow or cinnamon in color - the mushrom has to be poisonous. All mushrooms that have their stalks red, gray or orange are etable.
In [82]:
fig, axes = plt.subplots(nrows=1, ncols=2)
stalk_above.plot(kind="bar",stacked=True, ax=axes[0], color=['#f0dc82','#D2691E', '#990000','#696969',
'#49311c','#ffa500','#ff69b4','#ffffff','#ffff00']);
axes[0].legend_.remove()
stalk_below.plot(kind="bar",stacked=True, ax=axes[1], color=['#f0dc82','#D2691E', '#990000','#696969',
'#49311c','#ffa500','#ff69b4','#ffffff','#ffff00']);
axes[1].legend(['Buff','Cinnamon','Red','Gray','Brown','Orange','Pink','White','Yellow'],loc='center left', bbox_to_anchor=(1, 0.5))
axes[0].set_ylabel('Species count')
axes[0].set_title('Stalk color above ring')
axes[1].set_title('Stalk color below ring')
plt.show()
#fig.savefig('stalk_colors.jpg')
In [17]:
hab = pd.crosstab(index=df['classif'], columns=df["habitat"])
print(hab)
pop = pd.crosstab(index=df['classif'], columns=df["population"])
print(pop)
In [19]:
a = hab.plot(kind="bar",stacked=True,figsize=(3,3), cmap=plt.cm.RdYlGn);
a.legend(['Woods','Grasses','Leaves','Meadows',
'Paths','Urban','Waste'],loc='center left',bbox_to_anchor=(1, 0.5))
a.set_ylabel('Species count')
a.set_xlabel('Classification')
a.set_title('Type of habitat')
plt.show()
fig.savefig('habitat.jpg')
p = pop.plot(kind="bar",stacked=True, figsize=(3,3), cmap=plt.cm.RdYlGn);
p.legend(['Abundant','Clustered','Numerous','Scattered',
'Several','Solitary'],loc='center left', bbox_to_anchor=(1, 0.5))
p.set_ylabel('Species count')
p.set_xlabel('Classification')
p.set_title('Type of population')
plt.show()
fig.savefig('population.jpg')
Comments: |
In [20]:
ring = pd.crosstab(index=df['classif'], columns=df["ring_type"])
print(ring)
r = ring.plot(kind='bar', cmap=plt.cm.RdYlGn)
r.set_xlabel('classification')
r.set_ylabel('count')
r.set_title("By Ring Type")
r.legend(['Evanescent','Flaring','Large','None','Pendant'])
plt.show()
Comments: |
In [28]:
#gill-spacing: close=c,crowded=w,distant=d
gsp = pd.crosstab(index=df['classif'], columns=df["gill_space"])
print(gsp)
r = gsp.plot(kind='bar', cmap=plt.cm.RdYlGn)
r.set_xlabel('classification')
r.set_ylabel('count')
r.set_title("By Gill Spacing")
r.legend(['Close','Crowded'])
plt.show()
Comments: |
Replacement of "1-letter" values with "1-word" values for easier Model Operation application later.
The choice of values to recode is not random - it is based on features identified as the most important in the 'feature_selection' notebook as well as my choice of the most 'practical' features.
In [9]:
df['population'].replace(['a','c','n','s','v','y'],['Abundant','Clustered','Numerous',
'Scattered','Several','Solitary'],inplace=True)
df['habitat'].replace(['d','g','l','m','p','u','w'],['Woods','Grasses','Leaves',
'Meadows','Paths','Urban','Waste'],inplace=True)
df['cap_colour'].replace(['b','c','e','g','n','p','r','u','w','y'],['Buff','Cinnamon','Red','Gray',
'Brown','Pink','Green','Purple',
'White','Yellow'],inplace=True)
df['spore_print_color'].replace(['b','h','k','n','o','r','u','w','y'],['Buff','Chocolate','Black','Brown',
'Orange','Green','Purple','White','Yellow'],inplace=True)
df['odor'].replace(['a','c','f','l','m','n','p','s','y'],['Almond','Creosote','Foul','Anise','Musty',
'None','Pungent','Spicy','Fishy'],inplace=True)
df['gill_color'].replace(['b','e','g','h','k','n','o','p','r','u','w','y'],['Buff','Red','Gray','Chocolate','Black',
'Brown','Orange','Pink','Green','Purple',
'White','Yellow'],inplace=True)
df['stalk_surf_above_ring'].replace(['f','k','s','y'],['Fibrous','Silky','Smooth','Scaly'],inplace=True)
df['gill_size'].replace(['b','n'],['Broad','Narrow'],inplace=True)
df['bruises'].replace(['f','t'],['No','Bruises'],inplace=True)
df['stalk_color_above_ring'].replace(['b','c','e','g','n','o','p','w','y'],['Buff','Cinnamon','Red','Gray','Brown',
'Orange','Pink','White','Yellow'],inplace=True)
df['stalk_color_below_ring'].replace(['b','c','e','g','n','o','p','w','y'],['Buff','Cinnamon','Red','Gray','Brown',
'Orange','Pink','White','Yellow'],inplace=True)
df['gill_space'].replace(['c','w'],['Close', 'Crowded'],inplace=True)
df['ring_type'].replace(['e','f','l','n','p'],['Evanescent','Flaring','Large','None','Pendant'],inplace=True)
df['classif'].replace(['e','p'],['Etable', 'Poisonous'],inplace=True)
In [10]:
df.head()
Out[10]:
There are 2480 missing values under the feature stalk_root. I will drop the feature from the beginning to avoid droping samples associated with those missing values.
Alternatively, it is possible to keep the feature and drop all observations where the value is missing: df = df.drop(df[df['stalk_root']=='?'].index
In [ ]:
df['stalk_root'].value_counts()
The crosstab display doesn't demonstrate a type a stalk_root that would be prevailent in poisonous mushrooms. If it was the case, I would assign all the missing values with that stalk root type to avoid risks of classifying poisonous mushroom as etable based on tha feature. Another option could be to assign the values of the mode - 'b', but the most common value occures often in both 'e' and 'p' classified mushrooms.
Final decision:
Drop this feature. After doing some feature analysis, I found out that stalk root qualities were not determinant for distinguishing between etable and poisonous mushrooms.
In [25]:
sr = pd.crosstab(index=df['classif'], columns=df["stalk_root"])
print(sr)
In [26]:
df.drop('stalk_root', axis=1, inplace=True)
Save data to csv
In [32]:
df.to_csv('/Users/dariaulybina/Desktop/georgetown/ml_practice/data/data.csv')
In [ ]: