Analysis of the effect of psycho linguistic variables

There are many factors that influence the amplitude of the N400 component. In our study, we are interested in capturing effects that are due to the relationship between the cue and association word. Therefore, we wish to ensure that effects that cannot be attributed to this relationship do not play a large role in our results.

In this notebook we look at the following variables that can have an effect on the amplitude of the N400 component:

Variable Description
length The number of characters of a word
log_freq The logarithm of the frequency of occurrence of a word in a movie subtitle corpus [1] [5]
AoA Estimated age of acquisition of a word [3] [2]
rt The mean reaction time of participants performing a lexical descision task on a word [4] [5]

These variables were obtained through the French and Dutch Lexicon projects:

[1] Keuleers, E., Brysbaert, M., & New, B. (2010). SUBTLEX-NL: a new measure for Dutch word frequency based on film subtitles. Behavior Research Methods, 42(3), 643–650. http://doi.org/10.3758/BRM.42.3.643

[2] Rijn, V., Moor, D., French, I., Ferrand, L., Bonin, P., Méot, A., … Brysbaert, M. (2008). Age-of-acquisition and subjective frequency estimates for all generally known monosyllabic French words and their relation with other psycholinguistic variables. Behavior Research Methods, 40(4), 1049–1054. http://doi.org/10.3758/BRM.40.4.1049

[3] Brysbaert, M., Stevens, M., De Deyne, S., Voorspoels, W., & Storms, G. (2014). Norms of age of acquisition and concreteness for 30,000 Dutch words. Acta Psychologica, 150, 80–84. http://doi.org/10.1016/j.actpsy.2014.04.010

[4] Keuleers, E., Diependaele, K., & Brysbaert, M. (2010). Practice effects in large-scale visual word recognition studies: A lexical decision study on 14,000 dutch mono-and disyllabic words and nonwords. Frontiers in Psychology, 1(174). http://doi.org/10.3389/fpsyg.2010.00174

[5] Ferrand, L., New, B., Brysbaert, M., Keuleers, E., Bonin, P., Méot, A., … Pallier, C. (2010). The French Lexicon Project: Lexical decision data for 38,840 French words and 38,840 pseudowords. Behavior Research Methods, 42(2), 488–496. http://doi.org/10.3758/BRM.42.2.488


In [1]:
# Module for loading and manipulating tabular data
import pandas as pd

# Bring in a bridge to R for statistics
import rpy2
%load_ext rpy2.ipython.rmagic

# The R code at the bottom produces some harmless warnings that clutter up the page.
# This disables printing of the warnings. When modifying this notebook, you may want to turn
# this back on.
import warnings
warnings.filterwarnings('ignore')

# For pretty display of tables
from IPython.display import display

Our stimulus set consisted of 14 words in Dutch and 14 words in French. Throughout the experiment, each word occured multiple times as cue and association. Words were presented to the subjects in their native language, so a subject would either be presented with the Dutch or the French version.


In [2]:
# Load the psycholinguistic variables for our vocabulary
relevant_columns = ['word', 'language', 'length', 'log_freq', 'AoA', 'rt']
psych_ling = pd.read_csv('psycho_linguistic_variables.csv', index_col=['word', 'language'], usecols=relevant_columns)

# Show the table
display(psych_ling)


length log_freq AoA rt
word language
bed NL 3.0 4.020900 3.762500 572.720000
bureau NL 6.0 3.466600 6.555556 550.970000
deur NL 4.0 4.034300 4.444907 507.490000
giraf NL 5.0 1.505100 5.911420 605.220000
kast NL 4.0 3.118900 4.770833 515.550000
leeuw NL 5.0 2.808900 5.160544 506.710000
neushoorn NL 9.0 2.041400 6.811111 618.590000
nijlpaard NL 9.0 1.869200 6.547059 641.460000
olifant NL 7.0 2.721000 5.075000 NaN
stoel NL 5.0 3.350200 3.947024 557.670000
tafel NL 5.0 3.562100 4.034167 509.550000
tijger NL 6.0 2.709300 6.206250 594.460000
zebra NL 5.0 2.130300 6.148148 569.440000
zetel NL 5.0 2.130300 4.872500 552.560000
lit FR 3.0 2.279484 3.778643 606.160000
bureau FR 6.0 2.195014 NaN 606.727273
porte FR 5.0 2.581426 4.246941 747.628440
girafe FR 6.0 0.432969 NaN 612.250000
placard FR 7.0 1.284656 NaN 610.040000
lion FR 4.0 1.163758 4.387097 616.363636
rhinocéros FR 10.0 0.399674 NaN 753.250000
hippopotame FR 11.0 0.410000 NaN NaN
éléphant FR 8.0 1.007321 NaN 630.040000
chaise FR 6.0 1.514548 4.031146 590.750000
table FR 5.0 2.049140 4.069048 571.250000
tigre FR 5.0 1.046885 4.994048 631.120000
zèbre FR 5.0 0.428135 7.414905 737.360000
canapé FR 6.0 1.246991 NaN 715.787234

In the above table, you can see that not all variables are available for all words. There are some missing values, marked as NaN. Next, we load in the data recorded during our experiment and annotate the cue and association words with the linguistic variables. Each row in the table corresponds to one trial in the experiment, where first the cue word was shown, then the association word, then a response cue which prompted the participant to press button 0 if the words were unrelated or button 1 if they were. The N400 column contains the estimated N400 amplitude evoked by the presentation of the association word (z-scored within-subject).


In [3]:
# Load the N400 amplitudes recorded during our experiment
relevant_columns = ['subject', 'cue', 'association', 'association-english', 'language', 'button', 'N400']
n400 = pd.read_csv('data.csv')

# Annotate the data with the psycholinguistic variables, for both the cue and association words
n400 = n400.join(psych_ling, on=['cue', 'language'])
n400 = n400.join(psych_ling, on=['association', 'language'], lsuffix='-cue', rsuffix='-association')

# Show the top 19 rows
display(n400.head(10))


cue-english association-english subject cue association language button N400 length-cue log_freq-cue AoA-cue rt-cue length-association log_freq-association AoA-association rt-association
0 zebra couch subject01 zebra zetel NL 1 0.043147 5.0 2.1303 6.148148 569.44 5.0 2.1303 4.872500 552.56
1 couch hippopotamus subject01 zetel nijlpaard NL 1 -0.725864 5.0 2.1303 4.872500 552.56 9.0 1.8692 6.547059 641.46
2 giraffe closet subject01 giraf kast NL 1 0.252211 5.0 1.5051 5.911420 605.22 4.0 3.1189 4.770833 515.55
3 desk tiger subject01 bureau tijger NL 1 0.563608 6.0 3.4666 6.555556 550.97 6.0 2.7093 6.206250 594.46
4 table rhinoceros subject01 tafel neushoorn NL 1 -0.765238 5.0 3.5621 4.034167 509.55 9.0 2.0414 6.811111 618.59
5 elephant couch subject01 olifant zetel NL 1 0.041667 7.0 2.7210 5.075000 NaN 5.0 2.1303 4.872500 552.56
6 zebra lion subject01 zebra leeuw NL 0 -0.530273 5.0 2.1303 6.148148 569.44 5.0 2.8089 5.160544 506.71
7 chair tiger subject01 stoel tijger NL 1 0.189504 5.0 3.3502 3.947024 557.67 6.0 2.7093 6.206250 594.46
8 lion door subject01 leeuw deur NL 1 1.515943 5.0 2.8089 5.160544 506.71 4.0 4.0343 4.444907 507.49
9 desk elephant subject01 bureau olifant NL 1 1.090760 6.0 3.4666 6.555556 550.97 7.0 2.7210 5.075000 NaN

We do an initial statistical test for the effect of each psycholinguistic variable on the amplitude of the N400 component.


In [4]:
%%R -i n400

# Load the linear mixed effects library
library('lme4')
library('lmerTest')

# This function will test whether variable "var" has a significant effect on the amplitude of the N400 component
test <- function(var) {
    # Assemble the regression formula
    formula <- paste("N400 ~ ", var, " + (", var, " | subject) + (", var, " | language)", sep = "")
    
    # Fit the LME model
    m <- lmer(formula, data=n400)

    # Extract the stats related to the slope
    coeff = summary(m)$coefficients[1,]
    
    # Sometimes the estimation of the degrees of freedom fails. In which case use the number of subjects.
    if(! "df" %in% names(coeff)) {
        print(paste("Failed to estimate ddof for", var, ", defaulting to 16", sep=" "))
        coeff["df"] <- 16
        coeff["Pr(>|t|)"] <- pt(coeff["t value"], 16)
        coeff <- coeff[c(1, 2, 4, 3, 5)]
    }
    return(coeff)
}

# This function will iteratively test a list of variables
test.multiple <- function(vars) {
    # Data frame in which the results will be collected
    stats <- data.frame(effect.size = numeric(),
                        std.error = numeric(),
                        estimated.df = numeric(),
                        t.value = numeric(),
                        p.value = numeric())

    # Test all variables one by one
    for(var in vars) {
        stats[var,] <- test(var)
    }
    
    return(stats)
}

# A list of the psycho-linguistic variables
vars = c('length.cue', 'length.association', 'log_freq.cue', 'log_freq.association',
         'AoA.cue', 'AoA.association', 'rt.cue', 'rt.association')

# Test all the variables
print(test.multiple(vars), digits=3)


                     effect.size std.error estimated.df t.value p.value
length.cue                0.0169    0.0603       2910.0   0.280   0.779
length.association        0.0970    0.0615         19.1   1.577   0.131
log_freq.cue             -0.0190    0.0434       2910.0  -0.437   0.662
log_freq.association     -0.0122    0.0434       2910.0  -0.281   0.779
AoA.cue                  -0.0155    0.1007       2364.0  -0.154   0.877
AoA.association           0.0717    0.1005       2364.0   0.714   0.475
rt.cue                   -0.1016    0.2603         67.9  -0.391   0.697
rt.association            0.2992    0.3123        233.6   0.958   0.339

In the experimental paradigm used by our proposed method, the psycho-linguistic variables can have a small effect on the amplitude of the N400 component, even though in the case of the current study, no effect passed the significance threshold.

However, in our proposed method, we are not interested in the amplitude of the N400 potential evoked by a single word-pair. Instead, we are interested in the relative change in the amplitude of this component, as a target word is paired with different cue words. Given set $S$ of all words used in the study (here, we regard the Dutch and French translations as the same word), the amplitude of the N400 component evoked by word-pair $a \in S$ and $b \in S$ is denoted $N_{400}(a, b)$ and the distance between the words, denoted $d(a, b)$, is quantified as:

$$ d(a, b) = N_{400}(a, b) - \frac{1}{n} \sum_{w \in S} N_{400}(w, b) \, , $$

where $n$ is the total number of words used in the study. Since a word was never paired with itself during our study, an actual measurement of the amplitude of the N400 component is missing for this case. We therefore assume $d(b, b) = 0$.


In [5]:
# Transform the "raw" N400 amplitudes into distance measurements according to the equation above
n400['N400'] -= n400.groupby(['subject', 'association-english'])['N400'].transform('mean')

The relative change in amplitude of the N400 component is a measure that is robust against effects that are unrelated to the relationship between the cue and association word, such as the psycho-linguistic variables we tested before.


In [6]:
%%R -i n400

# Re-test all the variables against the distance metric, rather than the "raw" N400 amplitude
print(test.multiple(vars), digits=3)


[1] "Failed to estimate ddof for AoA.cue , defaulting to 16"
[1] "Failed to estimate ddof for rt.cue , defaulting to 16"
                     effect.size std.error estimated.df   t.value p.value
length.cue              2.44e-02    0.0581         2910  4.19e-01   0.675
length.association     -3.67e-17    0.0581         2910 -6.32e-16   1.000
log_freq.cue           -1.99e-02    0.0418         2910 -4.76e-01   0.634
log_freq.association   -1.34e-17    0.0418         2910 -3.21e-16   1.000
AoA.cue                -1.00e-02    0.0970           16 -1.03e-01   0.460
AoA.association        -1.17e-16    0.0972         2364 -1.21e-15   1.000
rt.cue                 -7.33e-02    0.2665           16 -2.75e-01   0.393
rt.association          5.71e-16    0.2492          121  2.29e-15   1.000

The estimated effect sizes of all phycho-linguistic variables on the distance metric are now all very small.