Investigating a Baseball Dataset

Problem Definition

Where do the best baseball players come from?

Will a baseball player´s birth location or college location relate to salary or awards?

Approach to answer question

Attributes will be needed to describe the location each player is from, these will form independent variables.

These will give a variety of values about where the player came from. However, it boils down to two independent variables, college location and birth location. These variables have different scales i.e. country, state, city. The right granularity will need to be chosen.

To use more independent variables height and weight can also be investigated.

Salaries, AwardsPlayers, AllStarFull and/or HallofFame can be used to give an indication to the quality of the player. Any of these can be used as a dependent variable or a dependent variable could be created out of a combination of these variables.

The dataset has been processed using numpy and pandas to clean, create new variables and merge tables together. See further into the report for a section on data processing. See the preprocessing module for the code used.

An examination of the data will be described followed by data analysis and conclusions. The data analysis is not exhaustive so observations to do not lead to robust conclusions in this report. Any inference is tentative and would require further work to become robust.

In [10]:
from __future__ import print_function
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import os
# Use the top level of the repository
# Helper functions made to create polished plots
from ballbase import figures

# Config the matplotlib backend as plotting inline in IPython
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import Baseball_data_investigation
df = Baseball_data_investigation.main()

Processed Hall of Fame data

Processed All Star data

Processed Player Awards data

Processed Salary data

Processed College Locations

Processed master file

Master_Merge is ready

Data Audit complete

Data Examination

Overall the dataset is well organised and good to use.

Some datasets display issues regarding how the values are populated

In [11]:
figsize = (15, 15)
# Needed to set up figure style

fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=figsize)

fig.suptitle("Distributions of Weight", fontsize=16)
fig.subplots_adjust(hspace=0.18, top=0.95)

figures.univariate(df['weight'].dropna(), 'Weight', rug=False, bin_n=None, ax=ax1)
figures.univariate(df['weight'].dropna(), 'Weight', rug=False, ax=ax2)
figures.univariate(df['weight'].dropna(), 'Weight', rug=False, 
                   x_truncation_upper=200, x_truncation_lower=150, 
                   formatting_right=False, ax=ax3)

sns.despine(offset=2, trim=True, left=True, bottom=True)

Weight highlights this well. The three figures above are of the same dataset, by reducing the bin numbers to the number of unique values it can be seen that common weight measurements are taken every 5 pounds. The lowermost figure limits the x-axis to highlight this further. Some values are populated according to a higher granularity. A solution to this issue would be to bin weight or similar issues.

Another common theme is highly skewed datasets.

In [12]:
fig_a = figures.dist_transform_plot(df['mean_salary'].dropna(), 'Mean Salary', bin_n=None)

The example of salary shows lognormal distributions with a large dispersion of values towards the maximum value. The number of appearances in All Star matches, player awards and all forms of salary information display this style of data.

The majority of baseball players are born in the USA. This can be seen in a binary plot showing ratio of players born in the USA using the total data set

In [9]:
fig_c2 = figures.boolean_bar(df['birthCountry'].dropna()=='USA', 'USA as birth country', annotate=False)

In [14]:
fig_c3 = figures.boolean_bar(df['college_country'].dropna()=='USA', 'College in USA', annotate=False)

count     6575
unique       2
top       True
freq      6570
Name: college_country, dtype: object

This combined with the entire amount of college location information being sourced in the USA steers this investigation to primarily focus on the USA.

It is beyond the scope of this investigation to do a complete audit of all data in this database. Outliers will be assumed to be realistic, nan values will not be interpolated. Queries will ignore missing values.

The reason for this is to look for trends in players that contain the corresponding data rather than interpolating any salary or other information for this analysis.

Two key independent variables for this assessment are player´s birth state and college state. Both of these are categorical.

California is highlighted in both bar graphs below as the most common occurrence. There is a variety across the other states, the two count bar graphs do not give any information about how related a birth State and college State is.

In [15]:
# df where birthCountry == USA, sort on birthState then display birthState
fig_c4 = figures.count_bar((
                            df[                              # From DataFrame
                                df['birthCountry'] == 'USA'  # Select only USA as birthCountry
                               ].sort_values(['birthState']) # Sort by birthState
                          ['birthState']),                   # Display birthState
                          'Birth State of Players',

In [16]:
fig_c3 = figures.count_bar((df.sort_values(['college_state'])      # Sort by birthState
                           ['college_state']),                     # Display state of college
                           r"Player´s College State",

The birth city has 2208 unique values in the investigation data set while college cities have 721, giving too much granularity to be considered of use at this stage of the investigation. The state is a more usable aggregated category for analysis.

Data Analysis

The following section is a brief, preliminary data analysis. This is not a thorough exploratory data analysis or a more sophisticated data analysis to test any hypotheses in the data.

The question is related to the effect of geographic location on the quality of base ball players.

To begin a different independent variable can be compared to the dependent variables to see if there are no relationships. Height and weight would not be expected to have a strong correlation to dependent variables.

Height vs. weight, two independent shows a strong correlation.

In [17]:
sns.jointplot(x='weight', y='height', data=df[['weight', 'height']].dropna(), 
              s=40, alpha=0.1, color="grey", edgecolor="w", linewidth=1, size=8);
sns.despine(offset=2, trim=True, left=True, bottom=True)

When comparing one of these independent variables to a dependent variable like the mean career salary there is little correlation. The distribution reflects the normal distribution of the weight variable.

In [18]:
sns.jointplot(x='weight', y='mean_salary', data=df[['weight', 'mean_salary']].dropna(), 
              s=40, alpha=0.1, color="grey", edgecolor="w", linewidth=1, size=8);
sns.despine(offset=2, trim=True, left=True, bottom=True)