1. Average Vs Median
Preface: The baseball data is available as a 2D Numpy array with 3 columns (height, weight, age) and 1015 rows. The name of this Numpy array is np_baseball.
After restructuring the data, however, you notice that some height values are abnormally high.
np_height, that is equal to first column of np_baseball.
In :"""Since the data is unavailabe from data camp, let's create some of our own""" # Import numpy import numpy as np from numpy import random from numpy import column_stack # np_baseball is un-available, so let's generate some random distribution! height = np.round( np.random.normal( 5.50, 5.0, 1015 ), 2 ) weight = np.round( np.random.normal( 70.50, 5.0, 1015 ), 2 ) age = np.round( np.random.normal( 31, 2, 1015 ), 2 ) # let's assign these values to np_baseball np_baseball = np.column_stack( ( height, weight, age) ) # Create np_height from np_baseball """ Since height coloumn is at index "0", and all the row are to be included, hence ":" """ np_height = np.array( np_baseball[ :, 0 ] ) # Print out the mean of np_height print( np.mean( np_height ) ) # Print out the median of np_height print( np.median( np_height ) )
In [ ]:""" Inference: + An average length of 5.36 inch, it sound right. + Further, by discriptive statistics, the "median" is very less affected by the outlier in general. However, there's no outlier it seems. + Hence 5.54 inch makes sense. As a side note: Always check both the median and the mean. Why? It gives an insight for the overall distributionn of the entire dataset. """
2. Explore the Baseball data:
Because the mean and median are so far apart, you decide to complain to the MLB. They find the error and send the corrected data over to you. It's again available as a 2D Numpy array np_baseball, with three columns.
Nonewith the correct code.
np.std() on the first column of
np_baseball to calculate
Nonewith the correct code.
Do big players tend to be heavier?
np.corrcoef() to store the correlation between the first and second column of np_baseball in corr.
None with teh correct code.
In :# np_baseball is available # Import numpy import numpy as np # Print mean height (first column) avg = np.mean(np_baseball[:,0]) print("Average: " + str(avg)) # Print median height. Replace 'None' med = np.median( np_baseball[:, 0]) print("\nMedian: " + str(med)) # Print out the standard deviation on height. Replace 'None' stddev = np.std( np_baseball[:, 0]) print("\nStandard Deviation: " + str(stddev)) # Print out correlation between first and second column. Replace 'None' corr = np.corrcoef( np_baseball[:,0], np_baseball[:,1]) print("\nCorrelation: " + str(corr))
Average: 5.36614778325 Median: 5.54 Standard Deviation: 4.95899575065 Correlation: [[ 1. -0.04160593] [-0.04160593 1. ]]
3. Blend it all together:
You've contacted the FIFA for some data and they handed you two lists. The lists are the following:
position = [ 'GK', 'M', 'A', 'D', ... ] height = [ 191, 184, 185, 180, ... ]
Each element in the lists corresponds to a player.
The first list, positions, contains strings representing each player's position.
The possible positions are: 'GK' (goalkeeper), 'M' (midfield), 'A' (attack) and 'D' (defense).
The second list, heights, contains integers representing the height of the player in cm.
The first player in the lists is a goalkeeper and is pretty tall (191 cm).
You're fairly confident that the median height of goalkeepers is higher than that of other players on the soccer field.
Some of your friends don't believe you, so you are determined to show them using the data you received from FIFA and your newly acquired Python skills.
Convert heights and positions, which are regular lists, to numpy arrays. Call them np_heights and np_positions.
Extract all the heights of the goalkeepers. You can use a little trick here:
Extract all the heights of the all the other players.
This time use np_positions != 'GK' as an index for np_heights.
Assign the result to other_heights.
Print out the median height of the goalkeepers using np.median().
Do the same for the other players. Print out their median height. Replace None with the correct code.
In :# heights and positions are un-available from data camp # So let's create some of our own. import numpy as np from numpy import random from numpy import column_stack # let's represent numbers with "positions" # 1:'GK', 2:'D', 3:'M', 4:'A' positions = np.round( np.random.normal( 3, 10, 1000), 2) heights = np.round( np.random.normal( 5.8, 4.0, 1000), 2) #in feets # let's concatenate these into two coloumns np_football = np.column_stack( (positions, heights) ) # Convert positions and heights to numpy arrays: np_positions, np_heights np_positions = np.array( np_football[:, 0]) np_heights = np.array( np_football[:, 1]) # Heights of the goalkeepers: gk_heights gk_heights = np_heights[ np_positions == 1 ] # Heights of the other players: other_heights other_heights = np_heights[ np_positions != 1 ] # Print out the mean position of the np_football print("\nMean positions at which players play: " + str( np.mean( np_positions ) ) ) # Print out the median positon of the np_football print("\nMedian positions at which player play: " + str( np.median( np_positions ) ) ) # Print out the median height of goalkeepers. Replace 'None' print("\nMedian height of goalkeepers: " + str( np.median( gk_heights ) ) ) # Print out the median height of other players. Replace 'None' print("\nMedian height of other players: " + str( np.median( other_heights ) ) )
Mean positions at which players play: 2.66724 Median positions at which player play: 3.005 Median height of goalkeepers: 6.18 Median height of other players: 5.62