Summary

This dataset about information of baseball players. It includes player's name, handedness (right, left handed, or both), height, weight, batting average, and the number of home runs.

Design

Exploratory Data Analysis using Python

I downloaded baseball_data.csv the dataset from data set options page given by Udacity. The dataset had player's physical information(name, height, weight) and performance information(avg, HR).



In [30]:

    
from pandas import DataFrame, read_csv, cut
df = read_csv('data/baseball_data.csv')
df.describe()









    Out[30]:






  
    
      
      height
      weight
      avg
      HR
    
  
  
    
      count
      1157.000000
      1157.000000
      1157.000000
      1157.000000
    
    
      mean
      72.756266
      184.513397
      0.186793
      45.359551
    
    
      std
      2.142272
      15.445995
      0.106175
      74.065110
    
    
      min
      65.000000
      140.000000
      0.000000
      0.000000
    
    
      25%
      71.000000
      175.000000
      0.138000
      1.000000
    
    
      50%
      73.000000
      185.000000
      0.238000
      15.000000
    
    
      75%
      74.000000
      195.000000
      0.258000
      55.000000
    
    
      max
      80.000000
      245.000000
      0.328000
      563.000000

There are 1157 players' records. Interestingly, some players have zero batting average or home runs, so I removed those rows.



In [31]:

    
df = df[(df.avg > 0.0) & (df.HR > 0)]
import numpy as np
df['avg_category'] = cut(df.avg, bins = np.linspace(0.1, 0.35, 6), 
                         right=False)
df.describe()

After cleaning meaningless rows, I had 871 records. Now I wanted to figure out player's performance with different handedness and avg_category. I showed histogram, HR vs. avg_category with 3 types of handedness.



In [32]:

    
a = df.drop(['height', 'weight', 'avg'], axis=1).groupby(by=['avg_category', 'handedness']).sum()
a









    Out[32]:






  
    
      
      
      HR
    
    
      avg_category
      handedness
      
    
  
  
    
      [0.1, 0.15)
      B
      9
    
    
      L
      11
    
    
      R
      41
    
    
      [0.15, 0.2)
      B
      2
    
    
      L
      32
    
    
      R
      279
    
    
      [0.2, 0.25)
      B
      904
    
    
      L
      4449
    
    
      R
      9020
    
    
      [0.25, 0.3)
      B
      2268
    
    
      L
      12591
    
    
      R
      21861
    
    
      [0.3, 0.35)
      B
      160
    
    
      L
      660
    
    
      R
      194



In [33]:

    
a.to_csv('data/cleaned_baseball.csv')



In [34]:

    
df = read_csv('data/cleaned_baseball.csv')
subtotal = df.groupby('avg_category', as_index=False).sum()
subtotal









    Out[34]:






  
    
      
      avg_category
      HR
    
  
  
    
      0
      [0.1, 0.15)
      61
    
    
      1
      [0.15, 0.2)
      313
    
    
      2
      [0.2, 0.25)
      14373
    
    
      3
      [0.25, 0.3)
      36720
    
    
      4
      [0.3, 0.35)
      1014



In [35]:

    
final_df = df.merge(subtotal, on='avg_category')



In [36]:

    
final_df['HR_ratio'] = final_df.HR_x / final_df.HR_y
final_df









    Out[36]:






  
    
      
      avg_category
      handedness
      HR_x
      HR_y
      HR_ratio
    
  
  
    
      0
      [0.1, 0.15)
      B
      9
      61
      0.147541
    
    
      1
      [0.1, 0.15)
      L
      11
      61
      0.180328
    
    
      2
      [0.1, 0.15)
      R
      41
      61
      0.672131
    
    
      3
      [0.15, 0.2)
      B
      2
      313
      0.006390
    
    
      4
      [0.15, 0.2)
      L
      32
      313
      0.102236
    
    
      5
      [0.15, 0.2)
      R
      279
      313
      0.891374
    
    
      6
      [0.2, 0.25)
      B
      904
      14373
      0.062896
    
    
      7
      [0.2, 0.25)
      L
      4449
      14373
      0.309539
    
    
      8
      [0.2, 0.25)
      R
      9020
      14373
      0.627566
    
    
      9
      [0.25, 0.3)
      B
      2268
      36720
      0.061765
    
    
      10
      [0.25, 0.3)
      L
      12591
      36720
      0.342892
    
    
      11
      [0.25, 0.3)
      R
      21861
      36720
      0.595343
    
    
      12
      [0.3, 0.35)
      B
      160
      1014
      0.157791
    
    
      13
      [0.3, 0.35)
      L
      660
      1014
      0.650888
    
    
      14
      [0.3, 0.35)
      R
      194
      1014
      0.191321



In [37]:

    
final_df.to_csv('data/cleaned_baseball.csv')



In [39]:

    
df = read_csv('data/cleaned_baseball.csv')
print df
submean = df.drop(['Unnamed: 0']).groupby('avg_category', as_index=False).mean()









    



    Unnamed: 0 avg_category handedness   HR_x   HR_y  HR_ratio
0            0  [0.1, 0.15)          B      9     61  0.147541
1            1  [0.1, 0.15)          L     11     61  0.180328
2            2  [0.1, 0.15)          R     41     61  0.672131
3            3  [0.15, 0.2)          B      2    313  0.006390
4            4  [0.15, 0.2)          L     32    313  0.102236
5            5  [0.15, 0.2)          R    279    313  0.891374
6            6  [0.2, 0.25)          B    904  14373  0.062896
7            7  [0.2, 0.25)          L   4449  14373  0.309539
8            8  [0.2, 0.25)          R   9020  14373  0.627566
9            9  [0.25, 0.3)          B   2268  36720  0.061765
10          10  [0.25, 0.3)          L  12591  36720  0.342892
11          11  [0.25, 0.3)          R  21861  36720  0.595343
12          12  [0.3, 0.35)          B    160   1014  0.157791
13          13  [0.3, 0.35)          L    660   1014  0.650888
14          14  [0.3, 0.35)          R    194   1014  0.191321






    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-39-ea41736448eb> in <module>()
      1 df = read_csv('data/cleaned_baseball.csv')
      2 print df
----> 3 submean = df.drop(['Unnamed: 0']).groupby('avg_category', as_index=False).mean()

C:\WinPython-64bit-2.7.10.2\python-2.7.10.amd64\lib\site-packages\pandas\core\generic.pyc in drop(self, labels, axis, level, inplace, errors)
   1595                 new_axis = axis.drop(labels, level=level, errors=errors)
   1596             else:
-> 1597                 new_axis = axis.drop(labels, errors=errors)
   1598             dropped = self.reindex(**{axis_name: new_axis})
   1599             try:

C:\WinPython-64bit-2.7.10.2\python-2.7.10.amd64\lib\site-packages\pandas\core\index.pyc in drop(self, labels, errors)
   2568         if mask.any():
   2569             if errors != 'ignore':
-> 2570                 raise ValueError('labels %s not contained in axis' % labels[mask])
   2571             indexer = indexer[~mask]
   2572         return self.delete(indexer)

ValueError: labels ['Unnamed: 0'] not contained in axis



In [20]:

    
df.merge(submean, on='avg_category')









    Out[20]:






  
    
      
      Unnamed: 0_x
      avg_category
      handedness
      HR_x_x
      HR_y_x
      HR_ratio_x
      Unnamed: 0_y
      HR_x_y
      HR_y_y
      HR_ratio_y
    
  
  
    
      0
      0
      [0.1, 0.15)
      B
      9
      61
      0.147541
      1
      20.333333
      61
      0.333333
    
    
      1
      1
      [0.1, 0.15)
      L
      11
      61
      0.180328
      1
      20.333333
      61
      0.333333
    
    
      2
      2
      [0.1, 0.15)
      R
      41
      61
      0.672131
      1
      20.333333
      61
      0.333333
    
    
      3
      3
      [0.15, 0.2)
      B
      2
      313
      0.006390
      4
      104.333333
      313
      0.333333
    
    
      4
      4
      [0.15, 0.2)
      L
      32
      313
      0.102236
      4
      104.333333
      313
      0.333333
    
    
      5
      5
      [0.15, 0.2)
      R
      279
      313
      0.891374
      4
      104.333333
      313
      0.333333
    
    
      6
      6
      [0.2, 0.25)
      B
      904
      14373
      0.062896
      7
      4791.000000
      14373
      0.333333
    
    
      7
      7
      [0.2, 0.25)
      L
      4449
      14373
      0.309539
      7
      4791.000000
      14373
      0.333333
    
    
      8
      8
      [0.2, 0.25)
      R
      9020
      14373
      0.627566
      7
      4791.000000
      14373
      0.333333
    
    
      9
      9
      [0.25, 0.3)
      B
      2268
      36720
      0.061765
      10
      12240.000000
      36720
      0.333333
    
    
      10
      10
      [0.25, 0.3)
      L
      12591
      36720
      0.342892
      10
      12240.000000
      36720
      0.333333
    
    
      11
      11
      [0.25, 0.3)
      R
      21861
      36720
      0.595343
      10
      12240.000000
      36720
      0.333333
    
    
      12
      12
      [0.3, 0.35)
      B
      160
      1014
      0.157791
      13
      338.000000
      1014
      0.333333
    
    
      13
      13
      [0.3, 0.35)
      L
      660
      1014
      0.650888
      13
      338.000000
      1014
      0.333333
    
    
      14
      14
      [0.3, 0.35)
      R
      194
      1014
      0.191321
      13
      338.000000
      1014
      0.333333



In [ ]:

	height	weight	avg	HR
count	871.000000	871.000000	871.000000	871.000000
mean	72.338691	182.846154	0.244201	60.253731
std	1.935478	14.971904	0.030703	79.940631
min	65.000000	140.000000	0.104000	1.000000
25%	71.000000	170.000000	0.232000	11.000000
50%	72.000000	180.000000	0.248000	27.000000
75%	74.000000	193.000000	0.263000	78.000000
max	80.000000	230.000000	0.328000	563.000000

	height	weight	avg	HR
count	1157.000000	1157.000000	1157.000000	1157.000000
mean	72.756266	184.513397	0.186793	45.359551
std	2.142272	15.445995	0.106175	74.065110
min	65.000000	140.000000	0.000000	0.000000
25%	71.000000	175.000000	0.138000	1.000000
50%	73.000000	185.000000	0.238000	15.000000
75%	74.000000	195.000000	0.258000	55.000000
max	80.000000	245.000000	0.328000	563.000000

		HR
avg_category	handedness
[0.1, 0.15)	B	9
	L	11
	R	41
[0.15, 0.2)	B	2
	L	32
	R	279
[0.2, 0.25)	B	904
	L	4449
	R	9020
[0.25, 0.3)	B	2268
	L	12591
	R	21861
[0.3, 0.35)	B	160
	L	660
	R	194

	avg_category	handedness	HR_x	HR_y	HR_ratio
0	[0.1, 0.15)	B	9	61	0.147541
1	[0.1, 0.15)	L	11	61	0.180328
2	[0.1, 0.15)	R	41	61	0.672131
3	[0.15, 0.2)	B	2	313	0.006390
4	[0.15, 0.2)	L	32	313	0.102236
5	[0.15, 0.2)	R	279	313	0.891374
6	[0.2, 0.25)	B	904	14373	0.062896
7	[0.2, 0.25)	L	4449	14373	0.309539
8	[0.2, 0.25)	R	9020	14373	0.627566
9	[0.25, 0.3)	B	2268	36720	0.061765
10	[0.25, 0.3)	L	12591	36720	0.342892
11	[0.25, 0.3)	R	21861	36720	0.595343
12	[0.3, 0.35)	B	160	1014	0.157791
13	[0.3, 0.35)	L	660	1014	0.650888
14	[0.3, 0.35)	R	194	1014	0.191321

	Unnamed: 0_x	avg_category	handedness	HR_x_x	HR_y_x	HR_ratio_x	Unnamed: 0_y	HR_x_y	HR_y_y	HR_ratio_y
0	0	[0.1, 0.15)	B	9	61	0.147541	1	20.333333	61	0.333333
1	1	[0.1, 0.15)	L	11	61	0.180328	1	20.333333	61	0.333333
2	2	[0.1, 0.15)	R	41	61	0.672131	1	20.333333	61	0.333333
3	3	[0.15, 0.2)	B	2	313	0.006390	4	104.333333	313	0.333333
4	4	[0.15, 0.2)	L	32	313	0.102236	4	104.333333	313	0.333333
5	5	[0.15, 0.2)	R	279	313	0.891374	4	104.333333	313	0.333333
6	6	[0.2, 0.25)	B	904	14373	0.062896	7	4791.000000	14373	0.333333
7	7	[0.2, 0.25)	L	4449	14373	0.309539	7	4791.000000	14373	0.333333
8	8	[0.2, 0.25)	R	9020	14373	0.627566	7	4791.000000	14373	0.333333
9	9	[0.25, 0.3)	B	2268	36720	0.061765	10	12240.000000	36720	0.333333
10	10	[0.25, 0.3)	L	12591	36720	0.342892	10	12240.000000	36720	0.333333
11	11	[0.25, 0.3)	R	21861	36720	0.595343	10	12240.000000	36720	0.333333
12	12	[0.3, 0.35)	B	160	1014	0.157791	13	338.000000	1014	0.333333
13	13	[0.3, 0.35)	L	660	1014	0.650888	13	338.000000	1014	0.333333
14	14	[0.3, 0.35)	R	194	1014	0.191321	13	338.000000	1014	0.333333