Lesson18 Individual Assignment

Individual means that you do it yourself. You won't learn to code if you don't struggle for yourself and write your own code. Remember that while you can discuss the general (algorithmic) way to solve a problem, you should not even be looking at anyone else's code or showing anyone else your code for an individual assignment.
Review the Group Work guidelines on Cavas and/or ask an instructor if you have any questions.

Programming Practice

Be sure to spell all function names correctly - misspelled functions will lose points (and often break anyway since no one is sure what to type to call it). If you prefer showing your earlier, scratch work as you figure out what you are doing, please be sure that you make a final, complete, correct last function in its own cell that you then call several times to test. In other words, separate your thought process/working versions from the final one (a comment that tells us which is the final version would be lovely).

Every function should have at least a docstring at the start that states what it does (see Lesson3 Team Notebook if you need a reminder). Make other comments as necessary.

Make sure that you are running test cases (plural) for everything and commenting on the results in markdown. Your comments should discuss how you know that the test case results are correct.

part 1: Sampling

A. Before you code, we need to set up the model:

In this activity, we're going to look at the effect of sampling on summary statistics.
Sampling is when we select a number of individuals out of a larger population.
There are many sampling strategies, and sampling is used daily in most science experiments and data analysis. We will be looking at a "simple random sample" where we select our sample, well, randomly. This fits in nicely with what we've been doing with random numbers.

Let's go
run the code in the cells below and look at the comments to see what it does



In [ ]:

    
## import libraries we need
import numpy as np
import matplotlib.pyplot as plt



In [ ]:

    
## makes our population of 10,000 numbers from 1-100 
## you might get a warning that this function is deprecated, but it still works
population = np.random.random_integers(1, high=100, size=10000)



In [ ]:

    
## look at 1st 25 (don't print out all 10,000!)
population[1:25]

notice that the population is an array which is a numpy data type that has elements of list-like and tuple-like behavior. If you want more info, here is some more info from the NumPy docs. We will do more with arrays later. For now the best part is that w ecan easily do mathematical operations on an array without looping through the elements (thank you NumPy!). Also, a hint for later, if you want



In [ ]:

    
## confirm the data type
type(population)



In [ ]:

    
## for example, if we want the population mean
population.mean()



In [ ]:

    
## or, if we want the population standard deviation
population.std()



In [ ]:

    
## you try with .min() and .max()



In [ ]:

    
## make a histogram of the population
%matplotlib inline

1. What type of distribution do we have in the population?

B. Define a sample_means function that:

takes 2 parameters - n = the number of samples, and p = the number of individuals in each sample
- for example, n=100, p=10 would be 100 samples with 10 numbers in each sample +(hint look at numpy.random.choice)
returns nothing
your sample_means function should:
- use the population that we have defined above
- print the mean, standard deviation, minimum, and maximum of the collection of samples (n) in a user friendly way
  - put another way, the mean of the means, the std of the means, etc.
  - (hint using a np.array will make your life easier for these calcs)
- plot a histogram of the the means of the samples

then test your sample_means function with:

n=100, p=10
n=10, p=100
n=100, p=100
at least 3 more test cases of your choice



In [ ]:

2. Once you are done testing. Look at your results and comment on what happens to the shape of the distribution and the 4 summary statistics that you print for each simulation.

3. Compare the values of the summary statsitics and the shape of the distributions that you got from your tests to the values from the whole population. Explain what you observe.

4. Provide some general conclusions about the simple random sampling that you simulated.

part 2: Histogram Challenge!

Your mission, should you choose to accept it, is to get as far as you can in replicating the histogram below:

A note to my fellow graphing obsessives and the overwhelmed:
Do the best you can, but this question is worth 15 pts and if you get the data graphed and the thing labeled reasonably, you'll get a lot of partial credit. We just wanted to throw out a challenge that pushed your graphing and formatting skills.

First run the code below to get the data, then you're on your own! (make sure the data file fly_wings_v2.csv from canvas is in the same directory as this notebook)



In [ ]:

    
## import the data from csv into a pandas DataFrame
import pandas as pd
wings = pd.read_csv('fly_wings_v2.csv')