In this class you are expected to learn:
In [9]:
%matplotlib inline
import io
import matplotlib.pyplot as plt
import pandas as pd
So far, we have used (and given as granted) concepts such as means, medians and modes, that fall behind what we call descriptive statistics. This class is about another type of statistics, inferential statistics. Used in everyday language inference means "the action or process of inferring; the drawing of a conclusion from known or assumed facts or statements." Statistical inference is where we draw conclusions from quantitative facts or statements.
To get an idea about inference we will start with examples of non-statistical inference.
So from two accepted facts about men (that they are mortal) and Socrates (that he was a man) we can come to the conclusion that Socrates was mortal. However sometimes correct facts or observations can lead to wrong conclusions when poor inference is made.
False premises can lead to false conclusions
In order to make accurate conclusions we must understand what has gone into creating our statistics and think logically. First, we will examine the concept of probability, a key concept in drawing conclusions from statistics.
You might be wondering what applying probability to random events such as dice throws and coin tosses has got to do with the sorts of statistics we might use in humanities research. We make inferences from quantitative data on the basis of probability. So if we see that wheat prices go up when oat prices go up and wheat prices go down when oat prices go down it would suggest that there probably is some sort of relationship between the prices of these two commodities that isn't just a coincidence of down to chance. Inferential statistics enable us to explore whether there is a relationship and what the nature of this relationship is (this will be explored in later chapters).
When we perform a statistical test we are looking to see if there are differences between two or more groups of data or that there is no difference. For example we may wish to investigate whether there are differences in how men and women perform on an exam or whether soldiers from Town A are taller than soldiers from Town B. Scientists may wish to find out whether treatment A is a better treatment than treatment B.
Activity
The probability of throwing a six on a 6-sided die is calculated as follows: the die has six possible outcomes (1, 2, 3, 4, 5, 6) so we take 1 and divide it by six. ⅙ = 0.167 or 16.7% or 1 in 6.
When we use quantitative data we are often seeking to demonstrate that there is a link between one set of data and other. We might want to investigate what effect a major historical event had on the price of food or whether married men use more words on a daily basis than their wives.
In [12]:
raw = """year wheat oats
1830 63.7 23.1
1831 65.7 25.3
1832 58.6 20.5
1833 53.9 18.3
1834 45.9 20.7
1835 29.2 22
1836 48.1 23
1837 55.1 23.1
1838 64.7 22.4
1839 70.3 25.7"""
data = pd.read_csv(io.StringIO(raw), sep=" ", index_col="year").
data
Out[12]:
In the DataFrame
above we have data about the price of wheat and
the price of oats between 1830 and 1839. What is the relationship
between wheat prices and oats prices. When wheat
prices are high are oat prices high too? From the data
alone is difficult to see for sure.
Let's just plot the data and see how it looks like.
In [26]:
fig = plt.figure(figsize=(12, 6))
ax = fig.add_subplot(1, 1, 1)
ax.scatter(data.wheat, data.oats)
ax.set_xlabel("wheat")
ax.set_ylabel("oats")
Out[26]:
On the figure we have plotted the wheat price against the barley price for each year. We see that there is a pattern as the numbers plots sort of line up. But how do we describe this pattern in more detail? One way is to calculate the Pearson product-moment correlation coefficient of the data. The Pearson product-moment correlation coefficient is a number between 1 and -1. This number is referred to as r or Pearson's r.
In [36]:
from scipy.stats.stats import pearsonr
pearsonr(data.wheat, data.oats)
Out[36]: