(fill in your two names here)
Facilitator: (fill in name)
Spokesperson: (fill in name)
Process Analyst: (fill in name)
Quality Control: (fill in name)
If there are only three people in your group, have one person serve as both spokesperson and process analyst for the rest of this activity.
At the end of this Lesson, you will be asked to record how long each Model required for your team. The Facilitator should keep track of time for your team.
0. Explain one application of computer generated random numbers to your field of science.
1. What is the smallest value printed? The largest value printed?
2. If you run the for-loop a second time, is the same series of numbers generated? If not, what is the new range (min and max) of values?
3. After repeating the for-loop a number of times, describe (to the best of your ability) the range of numbers returned by the random function.
4. Based on the output returned by the random function, describe (to the best of your ability) the nature of the distribution of numbers generated. (Do they appear clustered around a particular value, or are they spread out uniformly over the range?)
In the previous model, we evaluated the output of our function by simply looking at list of numbers. In order to visually assess important numerical relationships, it is often helpful to create graphical displays of our data. matplotlib
is a library for making 2D plots in python. matplotlib
is a powerful and flexible object-oriented library, which makes it both useful and complex. To simplify some basic plots, we will make use of the pyplot
interface that works on top of matplotlib
and makes it easier to make plots more quickly.
Here are some links to the documentation:
In [ ]:
def rand_mod2():
"""
produces a histogram of 10 random numbers
"""
import random
import matplotlib.pyplot as plt
numbers=[]
for i in range(10):
numbers.append(random.random())
plt.hist(numbers,10)
plt.show()
In [ ]:
rand_mod2()
5. What would be appropriate labels of the x and y axis for the plot that displays?
6a. In the cell below, modify the code in rand_mod2()
, so that it increases the number of random numbers generated and plotted. Comment all changes and run an example.
In [ ]:
## new code with comments
In [ ]:
## example
6b. Describe how the output plot generated changes when you increase the number of random numbers plotted.
7. Describe how the output plot generated changes when you increase the value of the second parameter of the pyplot
hist()
method.
In [ ]:
## code, change second parameter plt.hist()
8a. Consider again your answer to question 4 (the last question in the previous model). Based on the plot of the output returned by the random function, how does visualization of this data impact your original assessment of the nature of the distribution?
8b. Use the rand_mod2()
code to demonstrate.
In [ ]:
9. In general, describe what the pyplot.hist()
method does with the series of random numbers to create this type of display.
10. Explain why it is appropriate that we used histograms to plot the results of our random number simulations in Model2.
11. Run the code below to bring in the genotype_height_weight.csv
data set (make sure the data is in the same directory as this notebook, or change the path below):
In [ ]:
import pandas as pd
data = pd.read_csv('genotype_height_weight.csv')
data.head()
12a. What are the variable types in the data set? Justify your answers.
12b. What are the python data types of the variables in the data set? Use code to justify your answer.
13. If you wanted to see how many of each genotype there are, what type of graph would you make? Justify your answer.
14. If you wanted to see what the distribution of both height and weight, what type of graph would you make? Justify your answer.
We will use various methods of the matplotlib library to visualize data in a variety of different ways. First we will consider the correct chart/graph types for different kinds of data. Then we will work through a few more examples of useful plots that can be created using matplotlib
, pyplot
, and if you have a DataFrame, pandas
.
Plotting in Python can get pretty complicated since the object oriented matplotlib
is the basis for most plots, and most of the tools that we use (e.g. pyplot
and pandas
) actually sit on top of matplotlib
and make your life easier (mostly). The advantages of this, are that there are lots of ways to customize plots and simple ones are pretty easy to make, but it can get complicated fast.
For example, run the code in the cells below
In [ ]:
## displays a simple bar graph
import matplotlib.pyplot as plt
mylista = [1,2,3,4,5]
mylistb = [6,2,3,4,5]
plt.bar(mylista, mylistb)
plt.show()
15. What does each list in the code above do?
16. What happens when you switch the position of the lists in the plt.bar
call?
In [ ]:
17. Explain what each argument inside plt.bar() does.
18a. Experiment making bar charts with the lists below.
In [ ]:
## put lists in memory
mylistc = [1,2,3,4,5]
mylistd = [6,2,3,4,5]
In [ ]:
## first, just make a simple bar graph
In [ ]:
## expt 1
In [ ]:
## expt 2
18b. what happens when the lists are different lengths? (run code, leave results, write interpretation)
18c. what happens when items in a list repeats? (you need to consider/test/interpret 2 different scenarios - 1. when the list that is the first parameter repeats, 2. when the list that is the second parameter repeats - but in both cases the lists need to be the same length)
In [ ]:
## displays a slightly more complex
## and better labeled bar graph
cat = ['A', 'B', 'C', 'D', 'E'] # data categories
xcat = range(5) # x-axis "markers" - category order
count = [6,2,3,4,5] # heights of bars in order
plt.bar(xcat,count, align='center') # make plot with x-axis, markers centered
plt.xticks(xcat, cat) # relabel with real categories
plt.title('title')
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.show() # draw the plot
19. What is the difference between cat
and xcat
? Why do we need both?
20. Explain what happens if you change the order of the values in cat
?
21. Copy, paste, and modify the code above to plot a barchart of the count of the three different genotypes in the genotype_height_weight.csv
dataset (remember we brought it into memory above as the data
DataFrame. (hint: if you remember back to the pandas
lesson, df.col_name.value_counts()
will count the values in a column (it returns a Series))
In [ ]:
22. Explain why a bar chart is an acceptable type of graph for this data visualization.
pandas
The pandas
library has some great, quick plotting tools for Series
and DataFrames
. These functions sit on top of matplotlib
and just ease the use of DataFrames and columns from DataFrames as input for some useful chart types. The nice thing about this is that panda'
feeds the data to matplotlib
in a way that it can use it, and we can still use the formatting syntax that we've learned for matplotlib
to make labels and titles (rather than having to learn a whole new, pandas
specific one.
This part of the pandas
docs has a lot of good info on making graphs:
http://pandas.pydata.org/pandas-docs/stable/visualization.html
run the code below:
In [ ]:
## makes the genotype bar chart
%matplotlib inline
data.genotype.value_counts().plot.bar()
plt.title('genotype count')
plt.xlabel('genotype')
plt.ylabel('count')
plt.xticks(rotation='horizontal')
22. Does this graph match the one you made above? Look carefully... Explain differences and/or why we would ask this question (what is the advantage of the pandas
version of this plot)
23a. pandas
tries to only use the appropriate data types for the plot type that you pick. In the example below, we pass the entire DataFrame to the plot.box()
, which variable does it leave out and why?.
In [ ]:
## run this code
data.plot.box()
23b. This plot still looks like crap. Explain why this is not an appropriate graph?
23c. Make and explain a much more appropriate graph to visualize the distribution of these 2 columns.
24a. What type of graph would show the possible relationship of weight and height?
24b. Make a graph that uses the data
DataFrame to visualize the potential relationship of weight to height (height is the explanatory variable and weight is the response variable in this case and the response variable is on the y-axis by convention). Hint:
df.plot.scatter(x='x_col', y='y_col')
24c. Label the scatter plot with better labels on both axes and a title.
How much time did it require for your team to complete each Model?
Model 1:
Model 2:
Model 3:
Model 4: