The independent and dependent variables of the experiment are:
Independent
Dependent
We have as starting data two samples gathered from the same test (time taken to say the name of the color a given word is printed in) applied in different conditions: one for Congruent word/colors (the word and color are the same. I.e. the word "blue" printed in blue) and one for Incongruent word/colors (the word is a different color than the printed color. I.e. The word "blue" is printed in red).
From the sampled data, we want to infer whether or not the time taken to say a Congruent word/color is less than the time taken to say an Incongruent word/color.
Having Con be the symbol of the Congruent words and Incon be the symbol of the Incongruent words, and Diff be the symbol of the difference between Con and Incon (Con - Incon), we have:
H0 (HNULL): muCon = muIncon <=> muDiff = 0
Ha (HALTERNATIVE): muCon != muIncon <=> muDiff != 0
HNULL hypothesis: The population mean time it takes to say the correct ink color in the Congruent condition is equal to the population mean time it takes to say the correct ink color in the Incongruent condition, based on the sample means.
HALTERNATIVE hypothesis: The population mean time it takes to say the correct ink color in the Congruent is different than the population mean time it takes to say the correct ink color in the Incongruent condition, based on the sample means.
I will be performing a two-tailed Dependent T-Test because:
I will evaluate the results based on a confidence level of 99% (T-Critical value of 2.807, for 23 degrees of freedom).
I expect to reject the HNULL hypothesis that states that the mean time it takes to say the name of the ink colors in the Congruent group will be equal to the mean time it takes to say the name of the ink colors in the Incongruent group
In [132]:
import pandas as pd
import math
%pylab inline
import matplotlib.pyplot as plt
CONGRUENT = 'Congruent'
INCONGRUENT = 'Incongruent'
TCRITICAL = 2.807 # two-tailed difference with 99% Confidence and Degree of Freedom of 23
In [155]:
path = r'~/udacity-data-analyst-nanodegree/P1/stroopdata.csv'
initialData = pd.read_csv(path)
dataDifference = [initialData[CONGRUENT][i] - initialData[INCONGRUENT][i] for i in range(0, len(initialData[CONGRUENT]))]
congruentMean = mean(initialData[CONGRUENT])
incongruentMean = mean(initialData[INCONGRUENT])
differenceMean = mean(dataDifference)
def mean(data):
return sum(data) / len(data)
def valuesMinusMean(data):
meanOfData = mean(data)
return [value - meanOfData for value in data]
def valuesToPower(data, power):
return [value ** power for value in data]
def variance(data):
return sum(data) / (len(data) - 1)
def standardDeviation(variance):
return math.sqrt(variance)
In [156]:
print('Mean of Congruent values:', congruentMean)
print('Mean of Incongruent values:', incongruentMean)
print('Mean of Difference values:', differenceMean)
print()
print('Range of Congruent values:', max(initialData[CONGRUENT] - min(initialData[CONGRUENT])))
print('Range of Incongruent values:', max(initialData[INCONGRUENT] - min(initialData[INCONGRUENT])))
print('Range of Difference values:', max(dataDifference - min(dataDifference)))
print()
print('Standard Deviation of Congruent values:', standardDeviation(variance(valuesToPower(valuesMinusMean(initialData[CONGRUENT]), 2))))
print('Standard Deviation of Incongruent values:', standardDeviation(variance(valuesToPower(valuesMinusMean(initialData[INCONGRUENT]), 2))))
print('Standard Deviation of Difference values:', standardDeviation(variance(valuesToPower(valuesMinusMean(dataDifference), 2))))
In [144]:
plt.hist(
x=[initialData[CONGRUENT], initialData[INCONGRUENT]],
normed=False,
range=(min(initialData[CONGRUENT]), max(initialData[INCONGRUENT])),
bins=10,
label='Time to name'
)
Out[144]:
In [145]:
plt.hist(
x=initialData[CONGRUENT],
normed=False,
range=(min(initialData[CONGRUENT]), max(initialData[CONGRUENT])),
bins=10,
label='Time to name'
)
Out[145]:
In [146]:
plt.hist(
x=initialData[INCONGRUENT],
normed=False,
range=(min(initialData[INCONGRUENT]), max(initialData[INCONGRUENT])),
bins=10,
label='Time to name',
color='Green'
)
Out[146]:
In [147]:
plt.hist(
x=dataDifference,
normed=False,
range=(min(dataDifference), max(dataDifference)),
bins=10,
label='Time to name',
color='Red'
)
Out[147]:
From analyzing the histograms of both the Congruent and Incongruent datasets we can visualy see that the Incongruent dataset contains a greater number of higher time-to-name values than the Congruent datasets.
This is evident from looking at the values of the mean values of both datasets, previously calculated (14.051125 and 22.0159166667 for Congruent and Incongruent datasets, respectively)
In [159]:
degreesOfFreedom = len(initialData[CONGRUENT]) - 1
def standardError(standardDeviation, sampleSize):
return standardDeviation / math.sqrt(sampleSize)
def getTValue(mean, se):
return mean / se
se = standardError(standardDeviation(variance(valuesToPower(valuesMinusMean(dataDifference), 2))), len(dataDifference))
tValue = getTValue(differenceMean, se)
def marginOfError(t, standardError):
return t * standardError
def getConfidenceInterval(mean, t, standardError):
return (mean - marginOfError(t, standardError), mean + marginOfError(t, standardError))
print('Degrees of Freedom:', degreesOfFreedom)
print('Standard Error:', se)
print('T Value:', tValue)
print('T Critical Regions: Less than', -TCRITICAL, 'and Greater than', TCRITICAL)
print('Is the T Value inside of the critical region?', tValue >= TCRITICAL or tValue < TCRITICAL)
print('Is p < 0.005?', tValue >= TCRITICAL or tValue < TCRITICAL)
print('Confidence Interval:', getConfidenceInterval(differenceMean, TCRITICAL, se))
Based on the data calculated above, we have that the T Value of the difference of the two conditions (Congruent and Incongruent) is inside of the critical region of 99% Confidence.
With this, I reject the HNULL Hypothesis (H0). Since the T Value falls inside of the critical region, it is statistically significant to say that muCon != muIncon
I think that the reason behind this effect is that the brain already has associated the name of the color with it's visual representation (the actual color). When we are shown the name of a color, but it is in a different color our brain can't process the two at the same time (as the logical and the creative side of our brain are each giving a different response as to what we are seeing).
Similar tasks that will have similar results could be a Spatial Stroop Effect (as described in the wikipedia article referenced at the bottom) where show words like Big, Small, Up, Down in different sizes and positions can also trigger this effect.