[Data, the Humanist's New Best Friend](index.ipynb)
*Class 08*

In this class you are expected to learn:

  1. Statistical modeling
  2. statmodels
  3. Correlation
  4. Regression
  5. Distributions
*True that*

In [9]:
%matplotlib inline
import io
import matplotlib.pyplot as plt
import pandas as pd

Inferential Statistics

So far, we have used (and given as granted) concepts such as means, medians and modes, that fall behind what we call descriptive statistics. This class is about another type of statistics, inferential statistics. Used in everyday language inference means "the action or process of inferring; the drawing of a conclusion from known or assumed facts or statements." Statistical inference is where we draw conclusions from quantitative facts or statements.

To get an idea about inference we will start with examples of non-statistical inference.

  1. All men are mortal (an observed and accepted fact)
  2. Socrates was a man (an observed and accepted fact)
  3. Therefore, Socrates was mortal (a conclusion based on the fact that Socrates was a man and all men are mortal).

So from two accepted facts about men (that they are mortal) and Socrates (that he was a man) we can come to the conclusion that Socrates was mortal. However sometimes correct facts or observations can lead to wrong conclusions when poor inference is made.

  1. Margaret Thatcher was Prime Minister of the UK (an observed and accepted fact)
  2. Margaret Thatcher was a woman (an observed and accepted fact)
  3. Therefore all Prime Ministers of the UK are women (a conclusion based on the fact that Margaret Thatcher was Prime Minister of the UK and that Margaret Thatcher is a woman).

False premises can lead to false conclusions

  1. Hillary Clinton is a woman (an observed and accepted fact)
  2. Women are not allowed to be President of the USA (a false statement)
  3. Therefore Hillary Clinton is not allowed to become President of the USA (a false conclusion from one correct and one incorrect premise).

In order to make accurate conclusions we must understand what has gone into creating our statistics and think logically. First, we will examine the concept of probability, a key concept in drawing conclusions from statistics.

Why probability theory matters

You might be wondering what applying probability to random events such as dice throws and coin tosses has got to do with the sorts of statistics we might use in humanities research. We make inferences from quantitative data on the basis of probability. So if we see that wheat prices go up when oat prices go up and wheat prices go down when oat prices go down it would suggest that there probably is some sort of relationship between the prices of these two commodities that isn't just a coincidence of down to chance. Inferential statistics enable us to explore whether there is a relationship and what the nature of this relationship is (this will be explored in later chapters).

When we perform a statistical test we are looking to see if there are differences between two or more groups of data or that there is no difference. For example we may wish to investigate whether there are differences in how men and women perform on an exam or whether soldiers from Town A are taller than soldiers from Town B. Scientists may wish to find out whether treatment A is a better treatment than treatment B.

Activity

The probability of throwing a six on a 6-sided die is calculated as follows: the die has six possible outcomes (1, 2, 3, 4, 5, 6) so we take 1 and divide it by six. ⅙ = 0.167 or 16.7% or 1 in 6.

  1. If we wanted to calculate the probability of two coins both being heads we multiply the two probabilities together. Similarly, what if we wanted to know the probability of throwing 2 sixes on two dice?
  2. What about getting all sixes on six dice?

Making connections

When we use quantitative data we are often seeking to demonstrate that there is a link between one set of data and other. We might want to investigate what effect a major historical event had on the price of food or whether married men use more words on a daily basis than their wives.


In [12]:
raw = """year wheat oats
1830 63.7 23.1
1831 65.7 25.3
1832 58.6 20.5
1833 53.9 18.3
1834 45.9 20.7
1835 29.2 22
1836 48.1 23
1837 55.1 23.1
1838 64.7 22.4
1839 70.3 25.7"""
data = pd.read_csv(io.StringIO(raw), sep=" ", index_col="year").
data


Out[12]:
wheat oats
year
1830 63.7 23.1
1831 65.7 25.3
1832 58.6 20.5
1833 53.9 18.3
1834 45.9 20.7
1835 29.2 22.0
1836 48.1 23.0
1837 55.1 23.1
1838 64.7 22.4
1839 70.3 25.7

In the DataFrame above we have data about the price of wheat and the price of oats between 1830 and 1839. What is the relationship between wheat prices and oats prices. When wheat prices are high are oat prices high too? From the data alone is difficult to see for sure.

Let's just plot the data and see how it looks like.


In [26]:
fig = plt.figure(figsize=(12, 6))

ax = fig.add_subplot(1, 1, 1)
ax.scatter(data.wheat, data.oats)
ax.set_xlabel("wheat")
ax.set_ylabel("oats")


Out[26]:
<matplotlib.text.Text at 0x7f06b7050d30>

On the figure we have plotted the wheat price against the barley price for each year. We see that there is a pattern as the numbers plots sort of line up. But how do we describe this pattern in more detail? One way is to calculate the Pearson product-moment correlation coefficient of the data. The Pearson product-moment correlation coefficient is a number between 1 and -1. This number is referred to as r or Pearson's r.


In [36]:
from scipy.stats.stats import pearsonr

pearsonr(data.wheat, data.oats)


Out[36]:
(0.44223624234191028, 0.20063669545215068)

For the next class Next class wil be class 11