Summary

This dataset about information of baseball players. It includes player's name, handedness (right, left handed, or both), height, weight, batting average, and the number of home runs.

Design

Exploratory Data Analysis using Python

I downloaded baseball_data.csv the dataset from data set options page given by Udacity. The dataset had player's physical information(name, height, weight) and performance information(avg, HR).


In [30]:
from pandas import DataFrame, read_csv, cut
df = read_csv('data/baseball_data.csv')
df.describe()


Out[30]:
height weight avg HR
count 1157.000000 1157.000000 1157.000000 1157.000000
mean 72.756266 184.513397 0.186793 45.359551
std 2.142272 15.445995 0.106175 74.065110
min 65.000000 140.000000 0.000000 0.000000
25% 71.000000 175.000000 0.138000 1.000000
50% 73.000000 185.000000 0.238000 15.000000
75% 74.000000 195.000000 0.258000 55.000000
max 80.000000 245.000000 0.328000 563.000000

There are 1157 players' records. Interestingly, some players have zero batting average or home runs, so I removed those rows.


In [31]:
df = df[(df.avg > 0.0) & (df.HR > 0)]
import numpy as np
df['avg_category'] = cut(df.avg, bins = np.linspace(0.1, 0.35, 6), 
                         right=False)
df.describe()


Out[31]:
height weight avg HR
count 871.000000 871.000000 871.000000 871.000000
mean 72.338691 182.846154 0.244201 60.253731
std 1.935478 14.971904 0.030703 79.940631
min 65.000000 140.000000 0.104000 1.000000
25% 71.000000 170.000000 0.232000 11.000000
50% 72.000000 180.000000 0.248000 27.000000
75% 74.000000 193.000000 0.263000 78.000000
max 80.000000 230.000000 0.328000 563.000000

After cleaning meaningless rows, I had 871 records. Now I wanted to figure out player's performance with different handedness and avg_category. I showed histogram, HR vs. avg_category with 3 types of handedness.


In [32]:
a = df.drop(['height', 'weight', 'avg'], axis=1).groupby(by=['avg_category', 'handedness']).sum()
a


Out[32]:
HR
avg_category handedness
[0.1, 0.15) B 9
L 11
R 41
[0.15, 0.2) B 2
L 32
R 279
[0.2, 0.25) B 904
L 4449
R 9020
[0.25, 0.3) B 2268
L 12591
R 21861
[0.3, 0.35) B 160
L 660
R 194

In [33]:
a.to_csv('data/cleaned_baseball.csv')

In [34]:
df = read_csv('data/cleaned_baseball.csv')
subtotal = df.groupby('avg_category', as_index=False).sum()
subtotal


Out[34]:
avg_category HR
0 [0.1, 0.15) 61
1 [0.15, 0.2) 313
2 [0.2, 0.25) 14373
3 [0.25, 0.3) 36720
4 [0.3, 0.35) 1014

In [35]:
final_df = df.merge(subtotal, on='avg_category')

In [36]:
final_df['HR_ratio'] = final_df.HR_x / final_df.HR_y
final_df


Out[36]:
avg_category handedness HR_x HR_y HR_ratio
0 [0.1, 0.15) B 9 61 0.147541
1 [0.1, 0.15) L 11 61 0.180328
2 [0.1, 0.15) R 41 61 0.672131
3 [0.15, 0.2) B 2 313 0.006390
4 [0.15, 0.2) L 32 313 0.102236
5 [0.15, 0.2) R 279 313 0.891374
6 [0.2, 0.25) B 904 14373 0.062896
7 [0.2, 0.25) L 4449 14373 0.309539
8 [0.2, 0.25) R 9020 14373 0.627566
9 [0.25, 0.3) B 2268 36720 0.061765
10 [0.25, 0.3) L 12591 36720 0.342892
11 [0.25, 0.3) R 21861 36720 0.595343
12 [0.3, 0.35) B 160 1014 0.157791
13 [0.3, 0.35) L 660 1014 0.650888
14 [0.3, 0.35) R 194 1014 0.191321

In [37]:
final_df.to_csv('data/cleaned_baseball.csv')

In [39]:
df = read_csv('data/cleaned_baseball.csv')
print df
submean = df.drop(['Unnamed: 0']).groupby('avg_category', as_index=False).mean()


    Unnamed: 0 avg_category handedness   HR_x   HR_y  HR_ratio
0            0  [0.1, 0.15)          B      9     61  0.147541
1            1  [0.1, 0.15)          L     11     61  0.180328
2            2  [0.1, 0.15)          R     41     61  0.672131
3            3  [0.15, 0.2)          B      2    313  0.006390
4            4  [0.15, 0.2)          L     32    313  0.102236
5            5  [0.15, 0.2)          R    279    313  0.891374
6            6  [0.2, 0.25)          B    904  14373  0.062896
7            7  [0.2, 0.25)          L   4449  14373  0.309539
8            8  [0.2, 0.25)          R   9020  14373  0.627566
9            9  [0.25, 0.3)          B   2268  36720  0.061765
10          10  [0.25, 0.3)          L  12591  36720  0.342892
11          11  [0.25, 0.3)          R  21861  36720  0.595343
12          12  [0.3, 0.35)          B    160   1014  0.157791
13          13  [0.3, 0.35)          L    660   1014  0.650888
14          14  [0.3, 0.35)          R    194   1014  0.191321
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-39-ea41736448eb> in <module>()
      1 df = read_csv('data/cleaned_baseball.csv')
      2 print df
----> 3 submean = df.drop(['Unnamed: 0']).groupby('avg_category', as_index=False).mean()

C:\WinPython-64bit-2.7.10.2\python-2.7.10.amd64\lib\site-packages\pandas\core\generic.pyc in drop(self, labels, axis, level, inplace, errors)
   1595                 new_axis = axis.drop(labels, level=level, errors=errors)
   1596             else:
-> 1597                 new_axis = axis.drop(labels, errors=errors)
   1598             dropped = self.reindex(**{axis_name: new_axis})
   1599             try:

C:\WinPython-64bit-2.7.10.2\python-2.7.10.amd64\lib\site-packages\pandas\core\index.pyc in drop(self, labels, errors)
   2568         if mask.any():
   2569             if errors != 'ignore':
-> 2570                 raise ValueError('labels %s not contained in axis' % labels[mask])
   2571             indexer = indexer[~mask]
   2572         return self.delete(indexer)

ValueError: labels ['Unnamed: 0'] not contained in axis

In [20]:
df.merge(submean, on='avg_category')


Out[20]:
Unnamed: 0_x avg_category handedness HR_x_x HR_y_x HR_ratio_x Unnamed: 0_y HR_x_y HR_y_y HR_ratio_y
0 0 [0.1, 0.15) B 9 61 0.147541 1 20.333333 61 0.333333
1 1 [0.1, 0.15) L 11 61 0.180328 1 20.333333 61 0.333333
2 2 [0.1, 0.15) R 41 61 0.672131 1 20.333333 61 0.333333
3 3 [0.15, 0.2) B 2 313 0.006390 4 104.333333 313 0.333333
4 4 [0.15, 0.2) L 32 313 0.102236 4 104.333333 313 0.333333
5 5 [0.15, 0.2) R 279 313 0.891374 4 104.333333 313 0.333333
6 6 [0.2, 0.25) B 904 14373 0.062896 7 4791.000000 14373 0.333333
7 7 [0.2, 0.25) L 4449 14373 0.309539 7 4791.000000 14373 0.333333
8 8 [0.2, 0.25) R 9020 14373 0.627566 7 4791.000000 14373 0.333333
9 9 [0.25, 0.3) B 2268 36720 0.061765 10 12240.000000 36720 0.333333
10 10 [0.25, 0.3) L 12591 36720 0.342892 10 12240.000000 36720 0.333333
11 11 [0.25, 0.3) R 21861 36720 0.595343 10 12240.000000 36720 0.333333
12 12 [0.3, 0.35) B 160 1014 0.157791 13 338.000000 1014 0.333333
13 13 [0.3, 0.35) L 660 1014 0.650888 13 338.000000 1014 0.333333
14 14 [0.3, 0.35) R 194 1014 0.191321 13 338.000000 1014 0.333333

In [ ]: