Unit 3: Simulation

Lesson 17: Random processes, modeling, and plotting

Notebook Authors

(fill in your two names here)

Facilitator: (fill in name)
Spokesperson: (fill in name)
Process Analyst: (fill in name)
Quality Control: (fill in name)

If there are only three people in your group, have one person serve as both spokesperson and process analyst for the rest of this activity.

At the end of this Lesson, you will be asked to record how long each Model required for your team. The Facilitator should keep track of time for your team.

Computational Focus: Random Numbers

Model 0: Context

0. Explain one application of computer generated random numbers to your field of science.

Model 1: Python Random Number Generator

Type the following into a Jupyter code cell:

import random
for i in range(10):
    print(random.random())

Critical Thinking Questions

1. What is the smallest value printed? The largest value printed?

2. If you run the for-loop a second time, is the same series of numbers generated? If not, what is the new range (min and max) of values?

3. After repeating the for-loop a number of times, describe (to the best of your ability) the range of numbers returned by the random function.

4. Based on the output returned by the random function, describe (to the best of your ability) the nature of the distribution of numbers generated. (Do they appear clustered around a particular value, or are they spread out uniformly over the range?)

Model 2: Plotting

In the previous model, we evaluated the output of our function by simply looking at list of numbers. In order to visually assess important numerical relationships, it is often helpful to create graphical displays of our data. matplotlib is a library for making 2D plots in python. matplotlib is a powerful and flexible object-oriented library, which makes it both useful and complex. To simplify some basic plots, we will make use of the pyplot interface that works on top of matplotlib and makes it easier to make plots more quickly.

Here are some links to the documentation:


In [ ]:
def rand_mod2():
    """
    produces a histogram of 10 random numbers
    """
    import random
    import matplotlib.pyplot as plt 
    numbers=[]
    for i in range(10):
        numbers.append(random.random())
    plt.hist(numbers,10)
    plt.show()

In [ ]:
rand_mod2()

Critical Thinking Questions

5. What would be appropriate labels of the x and y axis for the plot that displays?

6a. In the cell below, modify the code in rand_mod2(), so that it increases the number of random numbers generated and plotted. Comment all changes and run an example.


In [ ]:
## new code with comments

In [ ]:
## example

6b. Describe how the output plot generated changes when you increase the number of random numbers plotted.

7. Describe how the output plot generated changes when you increase the value of the second parameter of the pyplot hist() method.


In [ ]:
## code, change second parameter plt.hist()

8a. Consider again your answer to question 4 (the last question in the previous model). Based on the plot of the output returned by the random function, how does visualization of this data impact your original assessment of the nature of the distribution?

8b. Use the rand_mod2() code to demonstrate.


In [ ]:

9. In general, describe what the pyplot.hist() method does with the series of random numbers to create this type of display.

Model 3: Data and graph types

Before we get too far in making different types of graphs, we should consider what the correct type of graph might be for the data that we have. Table 1 below is a handy reference for appropriate graph types based on data type.

Table 1

Critical Thinking Questions

10. Explain why it is appropriate that we used histograms to plot the results of our random number simulations in Model2.

11. Run the code below to bring in the genotype_height_weight.csv data set (make sure the data is in the same directory as this notebook, or change the path below):


In [ ]:
import pandas as pd
data = pd.read_csv('genotype_height_weight.csv')
data.head()

12a. What are the variable types in the data set? Justify your answers.

12b. What are the python data types of the variables in the data set? Use code to justify your answer.

13. If you wanted to see how many of each genotype there are, what type of graph would you make? Justify your answer.

14. If you wanted to see what the distribution of both height and weight, what type of graph would you make? Justify your answer.

Model 4: More plotting

We will use various methods of the matplotlib library to visualize data in a variety of different ways. First we will consider the correct chart/graph types for different kinds of data. Then we will work through a few more examples of useful plots that can be created using matplotlib, pyplot, and if you have a DataFrame, pandas.

Plotting in Python can get pretty complicated since the object oriented matplotlib is the basis for most plots, and most of the tools that we use (e.g. pyplot and pandas) actually sit on top of matplotlib and make your life easier (mostly). The advantages of this, are that there are lots of ways to customize plots and simple ones are pretty easy to make, but it can get complicated fast.

For example, run the code in the cells below


In [ ]:
## displays a simple bar graph
import matplotlib.pyplot as plt
mylista = [1,2,3,4,5]
mylistb = [6,2,3,4,5]
plt.bar(mylista, mylistb)
plt.show()

Critical Thinking Questions

15. What does each list in the code above do?

16. What happens when you switch the position of the lists in the plt.bar call?


In [ ]:

17. Explain what each argument inside plt.bar() does.

18a. Experiment making bar charts with the lists below.


In [ ]:
## put lists in memory
mylistc = [1,2,3,4,5]
mylistd = [6,2,3,4,5]

In [ ]:
## first, just make a simple bar graph

In [ ]:
## expt 1

In [ ]:
## expt 2

18b. what happens when the lists are different lengths? (run code, leave results, write interpretation)

18c. what happens when items in a list repeats? (you need to consider/test/interpret 2 different scenarios - 1. when the list that is the first parameter repeats, 2. when the list that is the second parameter repeats - but in both cases the lists need to be the same length)

shaping up our plots

below is a more realistic and sensibly labled bar chart:


In [ ]:
## displays a slightly more complex
## and better labeled bar graph
cat = ['A', 'B', 'C', 'D', 'E'] # data categories
xcat = range(5) # x-axis "markers" - category order
count = [6,2,3,4,5] # heights of bars in order
plt.bar(xcat,count, align='center') # make plot with x-axis, markers centered
plt.xticks(xcat, cat) # relabel with real categories
plt.title('title')
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.show() # draw the plot

19. What is the difference between cat and xcat? Why do we need both?

20. Explain what happens if you change the order of the values in cat?

21. Copy, paste, and modify the code above to plot a barchart of the count of the three different genotypes in the genotype_height_weight.csv dataset (remember we brought it into memory above as the data DataFrame. (hint: if you remember back to the pandas lesson, df.col_name.value_counts() will count the values in a column (it returns a Series))


In [ ]:

22. Explain why a bar chart is an acceptable type of graph for this data visualization.

plotting with pandas

The pandas library has some great, quick plotting tools for Series and DataFrames. These functions sit on top of matplotlib and just ease the use of DataFrames and columns from DataFrames as input for some useful chart types. The nice thing about this is that panda' feeds the data to matplotlib in a way that it can use it, and we can still use the formatting syntax that we've learned for matplotlib to make labels and titles (rather than having to learn a whole new, pandas specific one.

This part of the pandas docs has a lot of good info on making graphs:
http://pandas.pydata.org/pandas-docs/stable/visualization.html

run the code below:


In [ ]:
## makes the genotype bar chart
%matplotlib inline
data.genotype.value_counts().plot.bar()
plt.title('genotype count')
plt.xlabel('genotype')
plt.ylabel('count')
plt.xticks(rotation='horizontal')

22. Does this graph match the one you made above? Look carefully... Explain differences and/or why we would ask this question (what is the advantage of the pandas version of this plot)

23a. pandas tries to only use the appropriate data types for the plot type that you pick. In the example below, we pass the entire DataFrame to the plot.box(), which variable does it leave out and why?.


In [ ]:
## run this code
data.plot.box()

23b. This plot still looks like crap. Explain why this is not an appropriate graph?

23c. Make and explain a much more appropriate graph to visualize the distribution of these 2 columns.

24a. What type of graph would show the possible relationship of weight and height?

24b. Make a graph that uses the data DataFrame to visualize the potential relationship of weight to height (height is the explanatory variable and weight is the response variable in this case and the response variable is on the y-axis by convention). Hint:

df.plot.scatter(x='x_col', y='y_col')

24c. Label the scatter plot with better labels on both axes and a title.

Temporal Analysis Report

How much time did it require for your team to complete each Model?

Model 1:

Model 2:

Model 3:

Model 4: