Functions and Visualizations

In the past week, you've learned a lot about using tables to work with datasets. With your tools so far, you can:

  1. Load a dataset from the web;
  2. Work with (extract, add, drop, relabel) columns from the dataset;
  3. Filter and sort it according to certain criteria;
  4. Perform arithmetic on columns of numbers;
  5. Group rows by columns of categories, counting the number of rows in each category;
  6. Make a bar chart of the categories.

These tools are fairly powerful, but they're not quite enough for all the analysis and data we'll eventually be doing in this course. Today we'll learn a tool that dramatically expands this toolbox: the table method apply. We'll also see how to make histograms, which are like bar charts for numerical data.


In [ ]:
# Run this cell to set up the notebook, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

# These lines load the tests.
from client.api.assignment import load_assignment 
tests = load_assignment('lab04.ok')

1. Functions and CEO Incomes

In Which We Write Down a Recipe for Cake

Let's start with a real data analysis task. We'll look at the 2015 compensation of CEOs at the 100 largest companies in California. The data were compiled for a Los Angeles Times analysis here, and ultimately came from filings mandated by the SEC from all publicly-traded companies. Two companies have two CEOs, so there are 102 CEOs in the dataset.

We've copied the data in raw form from the LA Times page into a file called raw_compensation.csv. (The page notes that all dollar amounts are in millions of dollars.)


In [ ]:
raw_compensation = Table.read_table('raw_compensation.csv')
raw_compensation

Question 1. When we first loaded this dataset, we tried to compute the average of the CEOs' pay like this:

np.average(raw_compensation.column("Total Pay"))

Explain why that didn't work. Hint: Try looking at some of the values in the "Total Pay" column.

Write your answer here, replacing this text.


In [ ]:
...

Question 2. Extract the first value in the "Total Pay" column. It's Mark Hurd's pay in 2015, in millions of dollars. Call it mark_hurd_pay_string.


In [ ]:
mark_hurd_pay_string = ...
mark_hurd_pay_string

In [ ]:
_ = tests.grade('q1_2')

Question 3. Convert mark_hurd_pay_string to a number of dollars. The string method strip will be useful for removing the dollar sign; it removes a specified character from the start or end of a string. For example, the value of "100%".strip("%") is the string "100". You'll also need the function float, which converts a string that looks like a number to an actual number. Last, remember that the answer should be in dollars, not millions of dollars.


In [ ]:
mark_hurd_pay = ...
mark_hurd_pay

In [ ]:
_ = tests.grade('q1_3')

To compute the average pay, we need to do this for every CEO. But that looks like it would involve copying this code 102 times.

This is where functions come in. First, we'll define our own function that packages together the code we wrote to convert a pay string to a pay number. This has its own benefits. Later in this lab we'll see a bigger payoff: we can call that function on every pay string in the dataset at once.

Question 4. Below we've written code that defines a function that converts pay strings to pay numbers, just like your code above. But it has a small error, which you can correct without knowing what all the other stuff in the cell means. Correct the problem.


In [ ]:
def convert_pay_string_to_number(pay_string):
    """Converts a pay string like '$100 ' (in millions) to a number of dollars."""
    return float(pay_string.strip("$"))

In [ ]:
_ = tests.grade('q1_4')

Running that cell doesn't convert any particular pay string.

Rather, think of it as defining a recipe for converting a pay string to a number. Writing down a recipe for cake doesn't give you a cake. You have to gather the ingredients and get a chef to execute the instructions in the recipe to get a cake. Similarly, no pay string is converted to a number until we call our function on a particular pay string (which tells Python, our lightning-fast chef, to execute it).

We can call our function just like we call the built-in functions we've seen. (Almost all of those functions are defined in this way, in fact!) It takes one argument, a string, and it returns a number.


In [ ]:
convert_pay_string_to_number(mark_hurd_pay_string)

In [ ]:
# We can also compute Safra Catz's pay in the same way:
convert_pay_string_to_number(raw_compensation.where("Name", are.equal_to("Safra A. Catz*")).column("Total Pay").item(0))

What have we gained? Well, without the function, we'd have to copy that 10**6 * float(pay_string.strip("$")) stuff each time we wanted to convert a pay string. Now we just call a function whose name says exactly what it's doing.

We'd still have to call the function 102 times to convert all the salaries, which we'll fix next.

But for now, let's write some more functions.

2. Defining functions

In Which We Write a Lot of Recipes

Let's write a very simple function that converts a proportion to a percentage by multiplying it by 100. For example, the value of to_percentage(.5) should be the number 50. (No percent sign.)

A function definition has a few parts.

def

It always starts with def (short for define):

def

Name

Next comes the name of the function. Let's call our function to_percentage.

def to_percentage

Signature

Next comes something called the signature of the function. This tells Python how many arguments your function should have, and what names you'll use to refer to those arguments in the function's code. to_percentage should take one argument, and we'll call that argument proportion since it should be a proportion.

def to_percentage(proportion)

We put a colon after the signature to tell Python it's over.

def to_percentage(proportion):

Documentation

Functions can do complicated things, so you should write an explanation of what your function does. For small functions, this is less important, but it's a good habit to learn from the start. Conventionally, Python functions are documented by writing a triple-quoted string:

def to_percentage(proportion):
    """Converts a proportion to a percentage."""


Body

Now we start writing code that runs when the function is called. This is called the body of the function. We can write anything we could write anywhere else. First let's give a name to the number we multiply a proportion by to get a percentage.

def to_percentage(proportion):
    """Converts a proportion to a percentage."""
    factor = 100

return

The special instruction return in a function's body tells Python to make the value of the function call equal to whatever comes right after return. We want the value of to_percentage(.5) to be the proportion .5 times the factor 100, so we write:

def to_percentage(proportion):
    """Converts a proportion to a percentage."""
    factor = 100
    return proportion * factor

Question 1. Define to_percentage in the cell below. Call your function to convert the proportion .2 to a percentage. Name that percentage twenty_percent.


In [ ]:
...
    ...
    ...
    ...

twenty_percent = ...
twenty_percent

In [ ]:
_ = tests.grade('q2_1')

Like the built-in functions, you can use named values as arguments to your function.

Question 2. Use to_percentage again to convert the proportion named a_proportion (defined below) to a percentage called a_percentage.

Note: You don't need to define to_percentage again! Just like other named things, functions stick around after you define them.


In [ ]:
a_proportion = 2**(.5) / 2
a_percentage = ...
a_percentage

In [ ]:
_ = tests.grade('q2_2')

Here's something important about functions: Each time a function is called, it creates its own "space" for names that's separate from the main space where you normally define names. (Exception: all the names from the main space get copied into it.) So even though you defined factor = 100 inside to_percentage above and then called to_percentage, you can't refer to factor anywhere except inside the body of to_percentage:


In [ ]:
# You should see an error when you run this.  (If you don't, you might
# have defined factor somewhere above.)
factor

As we've seen with the built-in functions, functions can also take strings (or arrays, or tables) as arguments, and they can return those things, too.

Question 3. Define a function called disemvowel. It should take a single string as its argument. (You can call that argument whatever you want.) It should return a copy of that string, but with all the characters that are vowels removed. (In English, the vowels are the characters "a", "e", "i", "o", and "u".)

Hint: To remove all the "a"s from a string, you can use that_string.replace("a", ""). And you can call replace multiple times.


In [ ]:
def disemvowel(a_string):
    ...
    ...

# An example call to your function.  (It's often helpful to run
# an example call from time to time while you're writing a function,
# to see how it currently works.)
disemvowel("Can you read this without vowels?")

In [ ]:
_ = tests.grade('q2_3')
Calls on calls on calls

Just as you write a series of lines to build up a complex computation, it's useful to define a series of small functions that build on each other. Since you can write any code inside a function's body, you can call other functions you've written.

This is like a recipe for cake telling you to follow another recipe to make the frosting, and another to make the sprinkles. This makes the cake recipe shorter and clearer, and it avoids having a bunch of duplicated frosting recipes. It's a foundation of productive programming.

For example, suppose you want to count the number of characters that aren't vowels in a piece of text. One way to do that is this to remove all the vowels and count the size of the remaining string.

Question 4. Write a function called num_non_vowels. It should take a string as its argument and return a number. The number should be the number of characters in the argument string that aren't vowels.

Hint: Recall that the function len takes a string as its argument and returns the number of characters in it.


In [ ]:
def num_non_vowels(a_string):
    """The number of characters in a string, minus the vowels."""
    ...

In [ ]:
_ = tests.grade('q2_4')

Functions can also encapsulate code that does things rather than just computing values. For example, if you call print inside a function, and then call that function, something will get printed.

The movies_by_year dataset in the textbook has information about movie sales in recent years. Suppose you'd like to display the year with the 5th-highest total gross movie sales, printed in a human-readable way. You might do this:


In [ ]:
movies_by_year = Table.read_table("movies_by_year.csv")
rank = 5
fifth_from_top_movie_year = movies_by_year.sort("Total Gross", descending=True).column("Year").item(rank-1)
print("Year number", rank, "for total gross movie sales was:", fifth_from_top_movie_year)

After writing this, you realize you also wanted to print out the 2nd and 3rd-highest years. Instead of copying your code, you decide to put it in a function. Since the rank varies, you make that an argument to your function.

Question 5. Write a function called print_kth_top_movie_year. It should take a single argument, the rank of the year (like 2, 3, or 5 in the above examples). It should print out a message like the one above. It shouldn't have a return statement.


In [ ]:
def print_kth_top_movie_year(k):
    # Our solution used 2 lines.
    ...
    ...

# Example calls to your function:
print_kth_top_movie_year(2)
print_kth_top_movie_year(3)

In [ ]:
_ = tests.grade('q2_5')

3. applying functions

In Which Python Bakes 102 Cakes

You'll get more practice writing functions, but let's move on.

Defining a function is a lot like giving a name to a value with =. In fact, a function is a value just like the number 1 or the text "the"!

For example, we can make a new name for the built-in function max if we want:


In [ ]:
our_name_for_max = max
our_name_for_max(2, 6)

The old name for max is still around:


In [ ]:
max(2, 6)

Try just writing max or our_name_for_max (or the name of any other function) in a cell, and run that cell. Python will print out a (very brief) description of the function.


In [ ]:
max

Why is this useful? Since functions are just values, it's possible to pass them as arguments to other functions. Here's a simple but not-so-practical example: we can make an array of functions.


In [ ]:
make_array(max, np.average, are.equal_to)

Question 1. Make an array containing any 3 other functions you've seen. Call it some_functions.


In [ ]:
some_functions = ...
some_functions

In [ ]:
_ = tests.grade('q3_1')

Working with functions as values can lead to some funny-looking code. For example, see if you can figure out why this works:


In [ ]:
make_array(max, np.average, are.equal_to).item(0)(4, -2, 7)

Here's a simpler example that's actually useful: the table method apply.

apply calls a function many times, once on each element in a column of a table. It produces an array of the results. Here we use apply to convert every CEO's pay to a number, using the function you defined:


In [ ]:
raw_compensation.apply(convert_pay_string_to_number, "Total Pay")

Here's an illustration of what that did:

Note that we didn't write something like convert_pay_string_to_number() or convert_pay_string_to_number("Total Pay"). The job of apply is to call the function we give it, so instead of calling convert_pay_string_to_number ourselves, we just write its name as an argument to apply.

Question 2. Using apply, make a table that's a copy of raw_compensation with one more column called "Total Pay (\$)". It should be the result of applying convert_pay_string_to_number to the "Total Pay" column, as we did above. Call the new table compensation.


In [ ]:
compensation = raw_compensation.with_column(
    "Total Pay ($)",
    ...
compensation

In [ ]:
_ = tests.grade('q3_2')

Now that we have the pay in numbers, we can compute things about them.

Question 3. Compute the average total pay of the CEOs in the dataset.


In [ ]:
average_total_pay = ...
average_total_pay

In [ ]:
_ = tests.grade('q3_3')

Question 4. Companies pay executives in a variety of ways: directly in cash; by granting stock or other "equity" in the company; or with ancillary benefits (like private jets). Compute the proportion of each CEO's pay that was cash. (Your answer should be an array of numbers, one for each CEO in the dataset.)


In [ ]:
cash_proportion = ...
cash_proportion

In [ ]:
_ = tests.grade('q3_4')

Check out the "% Change" column in compensation. It shows the percentage increase in the CEO's pay from the previous year. For CEOs with no previous year on record, it instead says "(No previous year)". The values in this column are strings, not numbers, so like the "Total Pay" column, it's not usable without a bit of extra work.

Given your current pay and the percentage increase from the previous year, you can compute your previous year's pay. For example, if your pay is \$100 this year, and that's an increase of 50% from the previous year, then your previous year's pay was $\frac{\$100}{1 + \frac{50}{100}}$, or around \$66.66.

Question 5. Create a new table called with_previous_compensation. It should be a copy of compensation, but with the "(No previous year)" CEOs filtered out, and with an extra column called "2014 Total Pay ($)". That column should have each CEO's pay in 2014.

Hint: This question takes several steps, but each one is still something you've seen before. Take it one step at a time, using as many lines as you need. You can print out your results after each step to make sure you're on the right track.

Hint 2: You'll need to define a function. You can do that just above your other code.


In [ ]:
# For reference, our solution involved more than just this one line of code
...

with_previous_compensation = ...
with_previous_compensation

In [ ]:
_ = tests.grade('q3_5')

Question 6. What was the average pay of these CEOs in 2014? Does it make sense to compare this number to the number you computed in question 3?


In [ ]:
average_pay_2014 = ...
average_pay_2014

In [ ]:
_ = tests.grade('q3_6')

Question 7. A skeptical student asks:

"I already knew lots of ways to operate on each element of an array at once. For example, I can multiply each element of some_array by 100 by writing 100*some_array. What good is apply?

How would you answer? Discuss with a neighbor.

4. Histograms

Earlier, we computed the average pay among the CEOs in our 102-CEO dataset. The average doesn't tell us everything about the amounts CEOs are paid, though. Maybe just a few CEOs make the bulk of the money, even among these 102.

We can use a histogram to display more information about a set of numbers. The table method hist takes a single argument, the name of a column of numbers. It produces a histogram of the numbers in that column.

Question 1. Make a histogram of the pay of the CEOs in compensation.


In [ ]:
...

Question 2. Looking at the histogram, how many CEOs made more than \$30 million? (Answer the question by filling in your answer manually. You'll have to do a bit of arithmetic; feel free to use Python as a calculator.)


In [ ]:
num_ceos_more_than_30_million = ...

Question 3. Answer the same question with code. Hint: Use the table method where and the property num_rows.


In [ ]:
num_ceos_more_than_30_million_2 = ...
num_ceos_more_than_30_million_2

In [ ]:
_ = tests.grade('q4_3')

Question 4. Do most CEOs make around the same amount, or are there some who make a lot more than the rest? Discuss with someone near you.

5. Randomness

Data scientists also have to be able to understand randomness. For example, they have to be able to assign individuals to treatment and control groups at random, and then try to say whether any observed differences in the outcomes of the two groups are simply due to the random assignment or genuinely due to the treatment.

To start off, we will use Python to make choices at random. In numpy there is a sub-module called random that contains many functions that involve random selection. One of these functions is called choice. It picks one item at random from an array, and it is equally likely to pick any of the items. The function call is np.random.choice(array_name), where array_name is the name of the array from which to make the choice. Thus the following code evaluates to treatment with chance 50%, and control with chance 50%. Run the next code block several times and see what happens.


In [ ]:
two_groups = make_array('treatment', 'control')
np.random.choice(two_groups)

The big difference between the code above and all the other code we have run thus far is that the code above doesn't always return the same value. It can return either treatment or control, and we don't know ahead of time which one it will pick. We can repeat the process by providing a second argument, the number of times to repeat the process. In the choice function we just used, we can add an optional second argument that tells the function how many times to make a random selection. Try it below:


In [ ]:
np.random.choice(two_groups, 10)

If we wanted to determine whether the random choice made by the function random is really fair, we could make a random selection a bunch of times and then count how often each selection shows up. In the next few code blocks, write some code that calls the choice function on the two_groups array one thousand times. Then, print out the percentage of occurrences for each of treatment and control. A useful function called Counter will be helpful; look at the code comments to see how it works!


In [ ]:
# replace ... with code that will run the 'choice' function 1000 times;
# the resulting array of choices will then have the name 'exp_results'
exp_results = ...

In [ ]:
from collections import Counter

Counter(exp_results) 
# the output from Counter tells you how many times 'treatment' and 'control' appear in the array
# produced by 'choice'; run this cell to see the output

In [ ]:
# use the info provided by 'Counter' to print the percentage of times 'treatment' and 'control'
# were selected
print(...) # print percentage for 'treatment' here
print(...) # print percentage for 'control' here

A fundamental question about random events is whether or not they occur. For example:

  • Did an individual get assigned to the treatment group, or not?
  • Is a gambler going to win money, or not?
  • Has a poll made an accurate prediction, or not?

Once the event has occurred, you can answer "yes" or "no" to all these questions. In programming, it is conventional to do this by labeling statements as True or False. For example, if an individual did get assigned to the treatment group, then the statement, "The individual was assigned to the treatment group" would be True. If not, it would be False.

6. Booleans and Comparison

In Python, Boolean values, named for the logician George Boole, represent truth and take only two possible values: True and False. Whether problems involve randomness or not, Boolean values most often arise from comparison operators. Python includes a variety of operators that compare values. For example, 3 is larger than 1 + 1. Run the following cell.


In [ ]:
3 > 1 + 1

The value True indicates that the comparison is valid; Python has confirmed this simple fact about the relationship between 3 and 1+1. The full set of common comparison operators are listed below.

Notice the two equal signs == in the comparison to determine equality. This is necessary because Python already uses = to mean assignment to a name, as we have seen. It can't use the same symbol for a different purpose. Thus if you want to check whether 5 is equal to the 10/2, then you have to be careful: 5 = 10/2 returns an error message because Python assumes you are trying to assign the value of the expression 10/2 to a name that is the numeral 5. Instead, you must use 5 == 10/2, which evaluates to True. Run these blocks of code to see for yourself.


In [ ]:
5 = 10/2

In [ ]:
5 == 10/2

An expression can contain multiple comparisons, and they all must hold in order for the whole expression to be True. For example, we can express that 1+1 is between 1 and 3 using the following expression.


In [ ]:
1 < 1 + 1 < 3

The average of two numbers is always between the smaller number and the larger number. We express this relationship for the numbers x and y below. Try different values of x and y to confirm this relationship.


In [ ]:
x = 12
y = 5
min(x, y) <= (x+y)/2 <= max(x, y)

7 Comparing Strings

Strings can also be compared, and their order is alphabetical. A shorter string is less than a longer string that begins with the shorter string.


In [ ]:
'Dog' > 'Catastrophe' > 'Cat'

Let's return to random selection. Recall the array two_groups which consists of just two elements, treatment and control. To see whether a randomly assigned individual went to the treatment group, you can use a comparison:


In [ ]:
np.random.choice(two_groups) == 'treatment'

As before, the random choice will not always be the same, so the result of the comparison won't always be the same either. It will depend on whether treatment or control was chosen. With any cell that involves random selection, it is a good idea to run the cell several times to get a sense of the variability in the result.

8. Conditional Statements

In many situations, actions and results depends on a specific set of conditions being satisfied. For example, individuals in randomized controlled trials receive the treatment if they have been assigned to the treatment group. A gambler makes money if she wins her bet. In this section we will learn how to describe such situations using code. A conditional statement is a multi-line statement that allows Python to choose among different alternatives based on the truth value of an expression. While conditional statements can appear anywhere, they appear most often within the body of a function in order to express alternative behavior depending on argument values.

A conditional statement always begins with an if header, which is a single line followed by an indented body. The body is only executed if the expression directly following if (called the if expression) evaluates to a True value. If the if expression evaluates to a False value, then the body of the if is skipped. Let us start defining a function that returns the sign of a number.


In [ ]:
def sign(x):

    if x > 0:
        return 'Positive'
sign(3)

This function returns the correct sign if the input is a positive number. But if the input is not a positive number, then the if expression evaluates to a False value, and so the return statement is skipped and the function call has no value. See what happens when you run the next block.


In [ ]:
sign(-3)

So let us refine our function to return Negative if the input is a negative number. We can do this by adding an elif clause, where elif is Python's shorthand for the phrase "else, if".


In [ ]:
def sign(x):
    if x > 0:
        return 'Positive'

    elif x < 0:
        return 'Negative'

Now sign returns the correct answer when the input is -3:


In [ ]:
sign(-3)

What if the input is 0? To deal with this case, we can add another elif clause:


In [ ]:
def sign(x):

    if x > 0:
        return 'Positive'

    elif x < 0:
        return 'Negative'

    elif x == 0:
        return 'Neither positive nor negative'
sign(0)

Run the previous code block for different inputs to our sign() function to make sure it does what we want it to.

Equivalently, we can replaced the final elif clause by an else clause, whose body will be executed only if all the previous comparisons are False; that is, if the input value is equal to 0.


In [ ]:
def sign(x):

    if x > 0:
        return 'Positive'

    elif x < 0:
        return 'Negative'

    else:
        return 'Neither positive nor negative'
sign(0)

9. The General Form

A conditional statement can also have multiple clauses with multiple bodies, and only one of those bodies can ever be executed. The general format of a multi-clause conditional statement appears below.

if <if expression>: <if body> elif <elif expression 0>: <elif body 0> elif <elif expression 1>: <elif body 1> ... else: <else body>

There is always exactly one if clause, but there can be any number of elif clauses. Python will evaluate the if and elif expressions in the headers in order until one is found that is a True value, then execute the corresponding body. The else clause is optional. When an else header is provided, its else body is executed only if none of the header expressions of the previous clauses are true. The else clause must always come at the end (or not at all).

10 Example: Pick a Card

We will now use conditional statements to define a function that we could use as part of a card game analysis application. Every time we run the function, we want it to print out a random card from a standard 52-card deck. Specifically, we should randomly choose a suit and a numeric value (1-13 for Ace-King) and print these values to the screen. Finish writing the function in code block below:


In [ ]:
def draw_card():

    """
    Print out a random suit and numeric value representing a card from a standard 52-card deck.
    """
    
    # pick a random number to determine the suit
    suit_num = np.random.uniform(0,1) # this function returns a random decimal number
                                     # between 0 and 1
    
    ### TODO: write an 'if' statement that prints out 'heart' if 0 < suit_num < 0.25,
    ###        'spade' if 0.25 < suit_num < 0.5,
    ###        'club' if 0.5 < suit_num < 0.75,
    ###        'diamond' if 0.75 < suit_num < 1
    
    # pick a random number to determine the suit
    val_num = np.random.uniform(0,13)
    
    ### TODO: write an if statement so that if  2 < val_num <= 12, 
    ###       you print out the floor of val_num
    ###       (you can use the floor() function)
    
    ### TODO: write an 'if' statement that prints out the value of the card for the
    ###       non-numeric possibilities'A' for ace, 'J' for jack, 'Q' for 'queen', 'K'
    ###       for king; 
    
    return

In [ ]:
# test your function by running this block; do it multiple times and see what happens!
draw_card()

11. Iteration

It is often the case in programming – especially when dealing with randomness – that we want to repeat a process multiple times. For example, to check whether np.random.choice does in fact pick at random, we might want to run the following cell many times to see if Heads occurs about 50% of the time.


In [ ]:
np.random.choice(make_array('Heads', 'Tails'))

We might want to re-run code with slightly different input or other slightly different behavior. We could copy-paste the code multiple times, but that's tedious and prone to typos, and if we wanted to do it a thousand times or a million times, forget it.

A more automated solution is to use a for statement to loop over the contents of a sequence. This is called iteration. A for statement begins with the word for, followed by a name we want to give each item in the sequence, followed by the word in, and ending with an expression that evaluates to a sequence. The indented body of the for statement is executed once for each item in that sequence.


In [ ]:
for i in np.arange(3):
    print(i)

It is instructive to imagine code that exactly replicates a for statement without the for statement. (This is called unrolling the loop.) A for statement simple replicates the code inside it, but before each iteration, it assigns a new value from the given sequence to the name we chose. For example, here is an unrolled version of the loop above:


In [ ]:
i = np.arange(3).item(0)
print(i)
i = np.arange(3).item(1)
print(i)
i = np.arange(3).item(2)
print(i)

Notice that the name i is arbitrary, just like any name we assign with =.

Here we use a for statement in a more realistic way: we print 5 random choices from an array.


In [ ]:
coin = make_array('Heads', 'Tails')

for i in np.arange(5):
    print(np.random.choice(make_array('Heads', 'Tails')))

In this case, we simply perform exactly the same (random) action several times, so the code inside our for statement does not actually refer to i.

12. Augmenting Arrays

While the for statement above does simulate the results of five tosses of a coin, the results are simply printed and aren't in a form that we can use for computation. Thus a typical use of a for statement is to create an array of results, by augmenting it each time.

The append method in numpy helps us do this. The call np.append(array_name, value) evaluates to a new array that is array_name augmented by value. When you use append, keep in mind that all the entries of an array must have the same type.


In [ ]:
pets = make_array('Cat', 'Dog')
np.append(pets, 'Another Pet')

This keeps the array pets unchanged:


In [ ]:
pets

But often while using for loops it will be convenient to mutate an array – that is, change it – when augmenting it. This is done by assigning the augmented array to the same name as the original.


In [ ]:
pets = np.append(pets, 'Another Pet')
pets

Example: Counting the Number of Heads

We can now simulate five tosses of a coin and place the results into an array. We will start by creating an empty array and then appending the result of each toss.


In [ ]:
coin = make_array('Heads', 'Tails')

tosses = make_array()

for i in np.arange(5):
    tosses = np.append(tosses, np.random.choice(coin))

tosses

Let us rewrite the cell with the for statement unrolled:


In [ ]:
coin = make_array('Heads', 'Tails')

tosses = make_array()

i = np.arange(5).item(0)
tosses = np.append(tosses, np.random.choice(coin))
i = np.arange(5).item(1)
tosses = np.append(tosses, np.random.choice(coin))
i = np.arange(5).item(2)
tosses = np.append(tosses, np.random.choice(coin))
i = np.arange(5).item(3)
tosses = np.append(tosses, np.random.choice(coin))
i = np.arange(5).item(4)
tosses = np.append(tosses, np.random.choice(coin))

tosses

By capturing the results in an array we have given ourselves the ability to use array methods to do computations. For example, we can use np.count_nonzero to count the number of heads in the five tosses.


In [ ]:
np.count_nonzero(tosses == 'Heads')

Iteration is a powerful technique. For example, by running exactly the same code for 1000 tosses instead of 5, we can count the number of heads in 1000 tosses.


In [ ]:
tosses = make_array()

for i in np.arange(1000):
    tosses = np.append(tosses, np.random.choice(coin))

np.count_nonzero(tosses == 'Heads')

Example: Number of Heads in 100 Tosses

It is natural to expect that in 100 tosses of a coin, there will be 50 heads, give or take a few.

But how many is "a few"? What's the chance of getting exactly 50 heads? Questions like these matter in data science not only because they are about interesting aspects of randomness, but also because they can be used in analyzing experiments where assignments to treatment and control groups are decided by the toss of a coin.

In this example we will simulate 10,000 repetitions of the following experiment:

Toss a coin 100 times and record the number of heads.

The histogram of our results will give us some insight into how many heads are likely.

As a preliminary, note that np.random.choice takes an optional second argument that specifies the number of choices to make. By default, the choices are made with replacement. Here is a simulation of 10 tosses of a coin:


In [ ]:
np.random.choice(coin, 10)

Now let's study 100 tosses. We will start by creating an empty array called heads. Then, in each of the 10,000 repetitions, we will toss a coin 100 times, count the number of heads, and append it to heads.


In [ ]:
N = 10000

heads = make_array()

for i in np.arange(N):
    tosses = np.random.choice(coin, 100)
    heads = np.append(heads, np.count_nonzero(tosses == 'Heads'))

heads

Let us collect the results in a table and draw a histogram.


In [ ]:
results = Table().with_columns(
    'Repetition', np.arange(1, N+1),
    'Number of Heads', heads
)

results

Here is a histogram of the data, with bins of width 1 centered at each value of the number of heads.


In [ ]:
results.select('Number of Heads').hist(bins=np.arange(30.5, 69.6, 1))

Not surprisingly, the histogram looks roughly symmetric around 50 heads. The height of the bar at 50 is about 8% per unit. Since each bin is 1 unit wide, this is the same as saying that about 8% of the repetitions produced exactly 50 heads. That's not a huge percent, but it's the largest compared to the percent at every other number of heads.

The histogram also shows that in almost all of the repetitions, the number of heads in 100 tosses was somewhere between 35 and 65. Indeed, the bulk of the repetitions produced numbers of heads in the range 45 to 55.

While in theory it is possible that the number of heads can be anywhere between 0 and 100, the simulation shows that the range of probable values is much smaller.

This is an instance of a more general phenomenon about the variability in coin tossing, as we will see later in the course.

Exercise: Challenge!

Your task is to write Python code which will find those numbers between 1500 and 2700 inclusive, which are divisible by both 5 and 7. Have your code store each such number in an array (call it whatever you want) and then print out the array at the end.

This will require you to use both for loops, if statements, and array manipulation discussed in this notebook. Good luck!


In [ ]: