There are many factors that influence the amplitude of the N400 component. In our study, we are interested in capturing effects that are due to the relationship between the cue and association word. Therefore, we wish to ensure that effects that cannot be attributed to this relationship do not play a large role in our results.
In this notebook we look at the following variables that can have an effect on the amplitude of the N400 component:
| Variable | Description |
|---|---|
length |
The number of characters of a word |
log_freq |
The logarithm of the frequency of occurrence of a word in a movie subtitle corpus [1] [5] |
AoA |
Estimated age of acquisition of a word [3] [2] |
rt |
The mean reaction time of participants performing a lexical descision task on a word [4] [5] |
These variables were obtained through the French and Dutch Lexicon projects:
[1] Keuleers, E., Brysbaert, M., & New, B. (2010). SUBTLEX-NL: a new measure for Dutch word frequency based on film subtitles. Behavior Research Methods, 42(3), 643–650. http://doi.org/10.3758/BRM.42.3.643
[2] Rijn, V., Moor, D., French, I., Ferrand, L., Bonin, P., Méot, A., … Brysbaert, M. (2008). Age-of-acquisition and subjective frequency estimates for all generally known monosyllabic French words and their relation with other psycholinguistic variables. Behavior Research Methods, 40(4), 1049–1054. http://doi.org/10.3758/BRM.40.4.1049
[3] Brysbaert, M., Stevens, M., De Deyne, S., Voorspoels, W., & Storms, G. (2014). Norms of age of acquisition and concreteness for 30,000 Dutch words. Acta Psychologica, 150, 80–84. http://doi.org/10.1016/j.actpsy.2014.04.010
[4] Keuleers, E., Diependaele, K., & Brysbaert, M. (2010). Practice effects in large-scale visual word recognition studies: A lexical decision study on 14,000 dutch mono-and disyllabic words and nonwords. Frontiers in Psychology, 1(174). http://doi.org/10.3389/fpsyg.2010.00174
[5] Ferrand, L., New, B., Brysbaert, M., Keuleers, E., Bonin, P., Méot, A., … Pallier, C. (2010). The French Lexicon Project: Lexical decision data for 38,840 French words and 38,840 pseudowords. Behavior Research Methods, 42(2), 488–496. http://doi.org/10.3758/BRM.42.2.488
In [1]:
# Module for loading and manipulating tabular data
import pandas as pd
# Bring in a bridge to R for statistics
import rpy2
%load_ext rpy2.ipython.rmagic
# The R code at the bottom produces some harmless warnings that clutter up the page.
# This disables printing of the warnings. When modifying this notebook, you may want to turn
# this back on.
import warnings
warnings.filterwarnings('ignore')
# For pretty display of tables
from IPython.display import display
Our stimulus set consisted of 14 words in Dutch and 14 words in French. Throughout the experiment, each word occured multiple times as cue and association. Words were presented to the subjects in their native language, so a subject would either be presented with the Dutch or the French version.
In [2]:
# Load the psycholinguistic variables for our vocabulary
relevant_columns = ['word', 'language', 'length', 'log_freq', 'AoA', 'rt']
psych_ling = pd.read_csv('psycho_linguistic_variables.csv', index_col=['word', 'language'], usecols=relevant_columns)
# Show the table
display(psych_ling)
In the above table, you can see that not all variables are available for all words. There are some missing values, marked as NaN.
Next, we load in the data recorded during our experiment and annotate the cue and association words with the linguistic variables. Each row in the table corresponds to one trial in the experiment, where first the cue word was shown, then the association word, then a response cue which prompted the participant to press button 0 if the words were unrelated or button 1 if they were. The N400 column contains the estimated N400 amplitude evoked by the presentation of the association word (z-scored within-subject).
In [3]:
# Load the N400 amplitudes recorded during our experiment
relevant_columns = ['subject', 'cue', 'association', 'association-english', 'language', 'button', 'N400']
n400 = pd.read_csv('data.csv')
# Annotate the data with the psycholinguistic variables, for both the cue and association words
n400 = n400.join(psych_ling, on=['cue', 'language'])
n400 = n400.join(psych_ling, on=['association', 'language'], lsuffix='-cue', rsuffix='-association')
# Show the top 19 rows
display(n400.head(10))
We do an initial statistical test for the effect of each psycholinguistic variable on the amplitude of the N400 component.
In [4]:
%%R -i n400
# Load the linear mixed effects library
library('lme4')
library('lmerTest')
# This function will test whether variable "var" has a significant effect on the amplitude of the N400 component
test <- function(var) {
# Assemble the regression formula
formula <- paste("N400 ~ ", var, " + (", var, " | subject) + (", var, " | language)", sep = "")
# Fit the LME model
m <- lmer(formula, data=n400)
# Extract the stats related to the slope
coeff = summary(m)$coefficients[1,]
# Sometimes the estimation of the degrees of freedom fails. In which case use the number of subjects.
if(! "df" %in% names(coeff)) {
print(paste("Failed to estimate ddof for", var, ", defaulting to 16", sep=" "))
coeff["df"] <- 16
coeff["Pr(>|t|)"] <- pt(coeff["t value"], 16)
coeff <- coeff[c(1, 2, 4, 3, 5)]
}
return(coeff)
}
# This function will iteratively test a list of variables
test.multiple <- function(vars) {
# Data frame in which the results will be collected
stats <- data.frame(effect.size = numeric(),
std.error = numeric(),
estimated.df = numeric(),
t.value = numeric(),
p.value = numeric())
# Test all variables one by one
for(var in vars) {
stats[var,] <- test(var)
}
return(stats)
}
# A list of the psycho-linguistic variables
vars = c('length.cue', 'length.association', 'log_freq.cue', 'log_freq.association',
'AoA.cue', 'AoA.association', 'rt.cue', 'rt.association')
# Test all the variables
print(test.multiple(vars), digits=3)
In the experimental paradigm used by our proposed method, the psycho-linguistic variables can have a small effect on the amplitude of the N400 component, even though in the case of the current study, no effect passed the significance threshold.
However, in our proposed method, we are not interested in the amplitude of the N400 potential evoked by a single word-pair. Instead, we are interested in the relative change in the amplitude of this component, as a target word is paired with different cue words. Given set $S$ of all words used in the study (here, we regard the Dutch and French translations as the same word), the amplitude of the N400 component evoked by word-pair $a \in S$ and $b \in S$ is denoted $N_{400}(a, b)$ and the distance between the words, denoted $d(a, b)$, is quantified as:
$$ d(a, b) = N_{400}(a, b) - \frac{1}{n} \sum_{w \in S} N_{400}(w, b) \, , $$where $n$ is the total number of words used in the study. Since a word was never paired with itself during our study, an actual measurement of the amplitude of the N400 component is missing for this case. We therefore assume $d(b, b) = 0$.
In [5]:
# Transform the "raw" N400 amplitudes into distance measurements according to the equation above
n400['N400'] -= n400.groupby(['subject', 'association-english'])['N400'].transform('mean')
The relative change in amplitude of the N400 component is a measure that is robust against effects that are unrelated to the relationship between the cue and association word, such as the psycho-linguistic variables we tested before.
In [6]:
%%R -i n400
# Re-test all the variables against the distance metric, rather than the "raw" N400 amplitude
print(test.multiple(vars), digits=3)
The estimated effect sizes of all phycho-linguistic variables on the distance metric are now all very small.