Analysis of the effect of psycho linguistic variables

There are many factors that influence the amplitude of the N400 component. In our study, we are interested in capturing effects that are due to the relationship between the cue and association word. Therefore, we wish to ensure that effects that cannot be attributed to this relationship do not play a large role in our results.

In this notebook we look at the following variables that can have an effect on the amplitude of the N400 component:

Variable	Description
`length`	The number of characters of a word
`log_freq`	The logarithm of the frequency of occurrence of a word in a movie subtitle corpus [1] [5]
`AoA`	Estimated age of acquisition of a word [3] [2]
`rt`	The mean reaction time of participants performing a lexical descision task on a word [4] [5]

These variables were obtained through the French and Dutch Lexicon projects:

[1] Keuleers, E., Brysbaert, M., & New, B. (2010). SUBTLEX-NL: a new measure for Dutch word frequency based on film subtitles. Behavior Research Methods, 42(3), 643–650. http://doi.org/10.3758/BRM.42.3.643

[2] Rijn, V., Moor, D., French, I., Ferrand, L., Bonin, P., Méot, A., … Brysbaert, M. (2008). Age-of-acquisition and subjective frequency estimates for all generally known monosyllabic French words and their relation with other psycholinguistic variables. Behavior Research Methods, 40(4), 1049–1054. http://doi.org/10.3758/BRM.40.4.1049

[3] Brysbaert, M., Stevens, M., De Deyne, S., Voorspoels, W., & Storms, G. (2014). Norms of age of acquisition and concreteness for 30,000 Dutch words. Acta Psychologica, 150, 80–84. http://doi.org/10.1016/j.actpsy.2014.04.010

[4] Keuleers, E., Diependaele, K., & Brysbaert, M. (2010). Practice effects in large-scale visual word recognition studies: A lexical decision study on 14,000 dutch mono-and disyllabic words and nonwords. Frontiers in Psychology, 1(174). http://doi.org/10.3389/fpsyg.2010.00174

[5] Ferrand, L., New, B., Brysbaert, M., Keuleers, E., Bonin, P., Méot, A., … Pallier, C. (2010). The French Lexicon Project: Lexical decision data for 38,840 French words and 38,840 pseudowords. Behavior Research Methods, 42(2), 488–496. http://doi.org/10.3758/BRM.42.2.488



In [1]:

    
# Module for loading and manipulating tabular data
import pandas as pd

# Bring in a bridge to R for statistics
import rpy2
%load_ext rpy2.ipython.rmagic

# The R code at the bottom produces some harmless warnings that clutter up the page.
# This disables printing of the warnings. When modifying this notebook, you may want to turn
# this back on.
import warnings
warnings.filterwarnings('ignore')

# For pretty display of tables
from IPython.display import display

Our stimulus set consisted of 14 words in Dutch and 14 words in French. Throughout the experiment, each word occured multiple times as cue and association. Words were presented to the subjects in their native language, so a subject would either be presented with the Dutch or the French version.



In [2]:

    
# Load the psycholinguistic variables for our vocabulary
relevant_columns = ['word', 'language', 'length', 'log_freq', 'AoA', 'rt']
psych_ling = pd.read_csv('psycho_linguistic_variables.csv', index_col=['word', 'language'], usecols=relevant_columns)

# Show the table
display(psych_ling)









    







  
    
      
      
      length
      log_freq
      AoA
      rt
    
    
      word
      language
      
      
      
      
    
  
  
    
      bed
      NL
      3.0
      4.020900
      3.762500
      572.720000
    
    
      bureau
      NL
      6.0
      3.466600
      6.555556
      550.970000
    
    
      deur
      NL
      4.0
      4.034300
      4.444907
      507.490000
    
    
      giraf
      NL
      5.0
      1.505100
      5.911420
      605.220000
    
    
      kast
      NL
      4.0
      3.118900
      4.770833
      515.550000
    
    
      leeuw
      NL
      5.0
      2.808900
      5.160544
      506.710000
    
    
      neushoorn
      NL
      9.0
      2.041400
      6.811111
      618.590000
    
    
      nijlpaard
      NL
      9.0
      1.869200
      6.547059
      641.460000
    
    
      olifant
      NL
      7.0
      2.721000
      5.075000
      NaN
    
    
      stoel
      NL
      5.0
      3.350200
      3.947024
      557.670000
    
    
      tafel
      NL
      5.0
      3.562100
      4.034167
      509.550000
    
    
      tijger
      NL
      6.0
      2.709300
      6.206250
      594.460000
    
    
      zebra
      NL
      5.0
      2.130300
      6.148148
      569.440000
    
    
      zetel
      NL
      5.0
      2.130300
      4.872500
      552.560000
    
    
      lit
      FR
      3.0
      2.279484
      3.778643
      606.160000
    
    
      bureau
      FR
      6.0
      2.195014
      NaN
      606.727273
    
    
      porte
      FR
      5.0
      2.581426
      4.246941
      747.628440
    
    
      girafe
      FR
      6.0
      0.432969
      NaN
      612.250000
    
    
      placard
      FR
      7.0
      1.284656
      NaN
      610.040000
    
    
      lion
      FR
      4.0
      1.163758
      4.387097
      616.363636
    
    
      rhinocéros
      FR
      10.0
      0.399674
      NaN
      753.250000
    
    
      hippopotame
      FR
      11.0
      0.410000
      NaN
      NaN
    
    
      éléphant
      FR
      8.0
      1.007321
      NaN
      630.040000
    
    
      chaise
      FR
      6.0
      1.514548
      4.031146
      590.750000
    
    
      table
      FR
      5.0
      2.049140
      4.069048
      571.250000
    
    
      tigre
      FR
      5.0
      1.046885
      4.994048
      631.120000
    
    
      zèbre
      FR
      5.0
      0.428135
      7.414905
      737.360000
    
    
      canapé
      FR
      6.0
      1.246991
      NaN
      715.787234

In the above table, you can see that not all variables are available for all words. There are some missing values, marked as NaN. Next, we load in the data recorded during our experiment and annotate the cue and association words with the linguistic variables. Each row in the table corresponds to one trial in the experiment, where first the cue word was shown, then the association word, then a response cue which prompted the participant to press button 0 if the words were unrelated or button 1 if they were. The N400 column contains the estimated N400 amplitude evoked by the presentation of the association word (z-scored within-subject).



In [3]:

    
# Load the N400 amplitudes recorded during our experiment
relevant_columns = ['subject', 'cue', 'association', 'association-english', 'language', 'button', 'N400']
n400 = pd.read_csv('data.csv')

# Annotate the data with the psycholinguistic variables, for both the cue and association words
n400 = n400.join(psych_ling, on=['cue', 'language'])
n400 = n400.join(psych_ling, on=['association', 'language'], lsuffix='-cue', rsuffix='-association')

# Show the top 19 rows
display(n400.head(10))









    







  
    
      
      cue-english
      association-english
      subject
      cue
      association
      language
      button
      N400
      length-cue
      log_freq-cue
      AoA-cue
      rt-cue
      length-association
      log_freq-association
      AoA-association
      rt-association
    
  
  
    
      0
      zebra
      couch
      subject01
      zebra
      zetel
      NL
      1
      0.043147
      5.0
      2.1303
      6.148148
      569.44
      5.0
      2.1303
      4.872500
      552.56
    
    
      1
      couch
      hippopotamus
      subject01
      zetel
      nijlpaard
      NL
      1
      -0.725864
      5.0
      2.1303
      4.872500
      552.56
      9.0
      1.8692
      6.547059
      641.46
    
    
      2
      giraffe
      closet
      subject01
      giraf
      kast
      NL
      1
      0.252211
      5.0
      1.5051
      5.911420
      605.22
      4.0
      3.1189
      4.770833
      515.55
    
    
      3
      desk
      tiger
      subject01
      bureau
      tijger
      NL
      1
      0.563608
      6.0
      3.4666
      6.555556
      550.97
      6.0
      2.7093
      6.206250
      594.46
    
    
      4
      table
      rhinoceros
      subject01
      tafel
      neushoorn
      NL
      1
      -0.765238
      5.0
      3.5621
      4.034167
      509.55
      9.0
      2.0414
      6.811111
      618.59
    
    
      5
      elephant
      couch
      subject01
      olifant
      zetel
      NL
      1
      0.041667
      7.0
      2.7210
      5.075000
      NaN
      5.0
      2.1303
      4.872500
      552.56
    
    
      6
      zebra
      lion
      subject01
      zebra
      leeuw
      NL
      0
      -0.530273
      5.0
      2.1303
      6.148148
      569.44
      5.0
      2.8089
      5.160544
      506.71
    
    
      7
      chair
      tiger
      subject01
      stoel
      tijger
      NL
      1
      0.189504
      5.0
      3.3502
      3.947024
      557.67
      6.0
      2.7093
      6.206250
      594.46
    
    
      8
      lion
      door
      subject01
      leeuw
      deur
      NL
      1
      1.515943
      5.0
      2.8089
      5.160544
      506.71
      4.0
      4.0343
      4.444907
      507.49
    
    
      9
      desk
      elephant
      subject01
      bureau
      olifant
      NL
      1
      1.090760
      6.0
      3.4666
      6.555556
      550.97
      7.0
      2.7210
      5.075000
      NaN

We do an initial statistical test for the effect of each psycholinguistic variable on the amplitude of the N400 component.



In [4]:

    
%%R -i n400

# Load the linear mixed effects library
library('lme4')
library('lmerTest')

# This function will test whether variable "var" has a significant effect on the amplitude of the N400 component
test <- function(var) {
    # Assemble the regression formula
    formula <- paste("N400 ~ ", var, " + (", var, " | subject) + (", var, " | language)", sep = "")
    
    # Fit the LME model
    m <- lmer(formula, data=n400)

    # Extract the stats related to the slope
    coeff = summary(m)$coefficients[1,]
    
    # Sometimes the estimation of the degrees of freedom fails. In which case use the number of subjects.
    if(! "df" %in% names(coeff)) {
        print(paste("Failed to estimate ddof for", var, ", defaulting to 16", sep=" "))
        coeff["df"] <- 16
        coeff["Pr(>|t|)"] <- pt(coeff["t value"], 16)
        coeff <- coeff[c(1, 2, 4, 3, 5)]
    }
    return(coeff)
}

# This function will iteratively test a list of variables
test.multiple <- function(vars) {
    # Data frame in which the results will be collected
    stats <- data.frame(effect.size = numeric(),
                        std.error = numeric(),
                        estimated.df = numeric(),
                        t.value = numeric(),
                        p.value = numeric())

    # Test all variables one by one
    for(var in vars) {
        stats[var,] <- test(var)
    }
    
    return(stats)
}

# A list of the psycho-linguistic variables
vars = c('length.cue', 'length.association', 'log_freq.cue', 'log_freq.association',
         'AoA.cue', 'AoA.association', 'rt.cue', 'rt.association')

# Test all the variables
print(test.multiple(vars), digits=3)









    





                     effect.size std.error estimated.df t.value p.value
length.cue                0.0169    0.0603       2910.0   0.280   0.779
length.association        0.0970    0.0615         19.1   1.577   0.131
log_freq.cue             -0.0190    0.0434       2910.0  -0.437   0.662
log_freq.association     -0.0122    0.0434       2910.0  -0.281   0.779
AoA.cue                  -0.0155    0.1007       2364.0  -0.154   0.877
AoA.association           0.0717    0.1005       2364.0   0.714   0.475
rt.cue                   -0.1016    0.2603         67.9  -0.391   0.697
rt.association            0.2992    0.3123        233.6   0.958   0.339

In the experimental paradigm used by our proposed method, the psycho-linguistic variables can have a small effect on the amplitude of the N400 component, even though in the case of the current study, no effect passed the significance threshold.

However, in our proposed method, we are not interested in the amplitude of the N400 potential evoked by a single word-pair. Instead, we are interested in the relative change in the amplitude of this component, as a target word is paired with different cue words. Given set $S$ of all words used in the study (here, we regard the Dutch and French translations as the same word), the amplitude of the N400 component evoked by word-pair $a \in S$ and $b \in S$ is denoted $N_{400}(a, b)$ and the distance between the words, denoted $d(a, b)$, is quantified as:

$$ d(a, b) = N_{400}(a, b) - \frac{1}{n} \sum_{w \in S} N_{400}(w, b) \, , $$

where $n$ is the total number of words used in the study. Since a word was never paired with itself during our study, an actual measurement of the amplitude of the N400 component is missing for this case. We therefore assume $d(b, b) = 0$.



In [5]:

    
# Transform the "raw" N400 amplitudes into distance measurements according to the equation above
n400['N400'] -= n400.groupby(['subject', 'association-english'])['N400'].transform('mean')

The relative change in amplitude of the N400 component is a measure that is robust against effects that are unrelated to the relationship between the cue and association word, such as the psycho-linguistic variables we tested before.



In [6]:

    
%%R -i n400

# Re-test all the variables against the distance metric, rather than the "raw" N400 amplitude
print(test.multiple(vars), digits=3)









    





[1] "Failed to estimate ddof for AoA.cue , defaulting to 16"
[1] "Failed to estimate ddof for rt.cue , defaulting to 16"
                     effect.size std.error estimated.df   t.value p.value
length.cue              2.44e-02    0.0581         2910  4.19e-01   0.675
length.association     -3.67e-17    0.0581         2910 -6.32e-16   1.000
log_freq.cue           -1.99e-02    0.0418         2910 -4.76e-01   0.634
log_freq.association   -1.34e-17    0.0418         2910 -3.21e-16   1.000
AoA.cue                -1.00e-02    0.0970           16 -1.03e-01   0.460
AoA.association        -1.17e-16    0.0972         2364 -1.21e-15   1.000
rt.cue                 -7.33e-02    0.2665           16 -2.75e-01   0.393
rt.association          5.71e-16    0.2492          121  2.29e-15   1.000

The estimated effect sizes of all phycho-linguistic variables on the distance metric are now all very small.

		length	log_freq	AoA	rt
word	language
bed	NL	3.0	4.020900	3.762500	572.720000
bureau	NL	6.0	3.466600	6.555556	550.970000
deur	NL	4.0	4.034300	4.444907	507.490000
giraf	NL	5.0	1.505100	5.911420	605.220000
kast	NL	4.0	3.118900	4.770833	515.550000
leeuw	NL	5.0	2.808900	5.160544	506.710000
neushoorn	NL	9.0	2.041400	6.811111	618.590000
nijlpaard	NL	9.0	1.869200	6.547059	641.460000
olifant	NL	7.0	2.721000	5.075000	NaN
stoel	NL	5.0	3.350200	3.947024	557.670000
tafel	NL	5.0	3.562100	4.034167	509.550000
tijger	NL	6.0	2.709300	6.206250	594.460000
zebra	NL	5.0	2.130300	6.148148	569.440000
zetel	NL	5.0	2.130300	4.872500	552.560000
lit	FR	3.0	2.279484	3.778643	606.160000
bureau	FR	6.0	2.195014	NaN	606.727273
porte	FR	5.0	2.581426	4.246941	747.628440
girafe	FR	6.0	0.432969	NaN	612.250000
placard	FR	7.0	1.284656	NaN	610.040000
lion	FR	4.0	1.163758	4.387097	616.363636
rhinocéros	FR	10.0	0.399674	NaN	753.250000
hippopotame	FR	11.0	0.410000	NaN	NaN
éléphant	FR	8.0	1.007321	NaN	630.040000
chaise	FR	6.0	1.514548	4.031146	590.750000
table	FR	5.0	2.049140	4.069048	571.250000
tigre	FR	5.0	1.046885	4.994048	631.120000
zèbre	FR	5.0	0.428135	7.414905	737.360000
canapé	FR	6.0	1.246991	NaN	715.787234

	cue-english	association-english	subject	cue	association	language	button	N400	length-cue	log_freq-cue	AoA-cue	rt-cue	length-association	log_freq-association	AoA-association	rt-association
0	zebra	couch	subject01	zebra	zetel	NL	1	0.043147	5.0	2.1303	6.148148	569.44	5.0	2.1303	4.872500	552.56
1	couch	hippopotamus	subject01	zetel	nijlpaard	NL	1	-0.725864	5.0	2.1303	4.872500	552.56	9.0	1.8692	6.547059	641.46
2	giraffe	closet	subject01	giraf	kast	NL	1	0.252211	5.0	1.5051	5.911420	605.22	4.0	3.1189	4.770833	515.55
3	desk	tiger	subject01	bureau	tijger	NL	1	0.563608	6.0	3.4666	6.555556	550.97	6.0	2.7093	6.206250	594.46
4	table	rhinoceros	subject01	tafel	neushoorn	NL	1	-0.765238	5.0	3.5621	4.034167	509.55	9.0	2.0414	6.811111	618.59
5	elephant	couch	subject01	olifant	zetel	NL	1	0.041667	7.0	2.7210	5.075000	NaN	5.0	2.1303	4.872500	552.56
6	zebra	lion	subject01	zebra	leeuw	NL	0	-0.530273	5.0	2.1303	6.148148	569.44	5.0	2.8089	5.160544	506.71
7	chair	tiger	subject01	stoel	tijger	NL	1	0.189504	5.0	3.3502	3.947024	557.67	6.0	2.7093	6.206250	594.46
8	lion	door	subject01	leeuw	deur	NL	1	1.515943	5.0	2.8089	5.160544	506.71	4.0	4.0343	4.444907	507.49
9	desk	elephant	subject01	bureau	olifant	NL	1	1.090760	6.0	3.4666	6.555556	550.97	7.0	2.7210	5.075000	NaN