This document

https://goo.gl/mNr03x

In search of national stereotypes

Nations and countries

Get a list of nations and their names. Semi-manual work, starting from an online list. Coverage not perfect, but quite good (164 entries).

AF      Afganistan,afganistanlainen
NL      Alankomaat,hollantilainen,alankomaalainen,Hollanti
AL      Albania,albanialainen
DZ      Algeria,algerialainen
AD      Andorra,andorralainen
AO      Angola,angolalainen
AR      Argentiina,argentiinalainen
AM      Armenia,armenialainen
AW      Aruba,arubalainen
AU      Australia,australialainen
AZ      Azerbaidžan,azerbaidžanlainen,Azerbaidzan,azerbaidzanlainen
BS      Bahama,bahamalainen
BH      Bahrain,bahrainlainen
BD      Bangladesh,bangladeshlainen
BB      Barbados,barbadoslainen
BE      Belgia,belgialainen
BZ      Belize,belizeläinen
BM      Bermuda,bermudalainen
BT      Bhutan,bhutanlainen
BO      Bolivia,bolivialainen
BA      Bosnia,Hertzegovina,bosnialainen,hertsegovinalainen
...

Current limitation is to one-word countries and nations.

Hits in data

Gather hits from S24 and, for contrast, the Turku Internet Parsebank.

  • S24: 8.6M hits in 7M sentences
  • Parsebank: 16.9M hits in 13.6M sentences

Some obvious questions to ask are, what is the distribution of these hits for different nations, and is S24 different from your "average Internet page"? The maps are here: http://bionlp-www.utu.fi/.ginter/maastereotypiat/map.html

Most distinctive features

So what characterizes these nations / countries? There are 1000+1 ways to get distinctive features out of a dataset. Our try here is to train a classifier (SVM) for each nation and regularize the classifier heavily with L1 until only about 100 features are left. These features should be strongly distinguishing discussions mentioning one country from discussions mentioning other countries.

We try on several subsets:

  1. Use all words of each sentence (lemmas to be precise)
  2. Only use adjectives and restrict to nations without countries, hoping to get more on-topic data

The maps are here: http://bionlp-www.utu.fi/.ginter/maastereotypiat/map.html

Sentiment

  • There are sentiment dictionaries available, like this one: http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
  • An original list of English sentiment words, machine-translated to several dozens of languages
  • Bummer: the list is not that great and the translations suck, a bit unlucky choice
  • For each nation, we can reduce the data only to hits of these sentiment words, and pick the most distinctive ones once again
  • This gets pretty sparse, so many countries don't get a sentiment --- need better sentiment dictionary!
  • Having selected the most distinctive sentiment terms for each nation, we can aggregate a total sentiment for each, weighted by the (log of the) number of hits of the sentiment words

The maps are here: http://bionlp-www.utu.fi/.ginter/maastereotypiat/map.html

luopua  neg     abandon
hylätty neg     abandoned
hylkääminen     neg     abandonment
abba    pos     abba
sieppaus        neg     abduction
poikkeuksellinen        neg     aberrant
poikkeama       neg     aberration
inhota  neg     abhor
vastenmielinen  neg     abhorrent
kyky    pos     ability
viheliäinen     neg     abject
poikkeava       neg     abnormal
lakkauttaa      neg     abolish
poistaminen     neg     abolition
iljettävä       neg     abominable
inhottavuus     neg     abomination
keskeyttää      neg     abort
abortti neg     abortion
epäonnistunut   neg     abortive
Edellä mainittujen      pos     abovementioned
hiertymä        neg     abrasion
kumota  neg     abrogate
paise   neg     abscess
poissaolo       neg     absence
poissa  neg     absent
poissaolija     neg     absentee
poissaolot      neg     absenteeism
absoluuttinen   pos     absolute
synninpäästö    pos     absolution
imeytyy pos     absorbed
järjetön        neg     absurd
järjettömyys    neg     absurdity

Where next?

  • The distinctive features we get are quite nice (we think) but:
    • Need to be more "stereotypic" - any ideas?
  • We do not take into account the syntax, and simply default to the sentence as the context
    • Try with adjective modifiers and specific syntactic structures
    • Data sparsity for rarely mentioned nations
  • The sentiment detection is not that great atm
    • Need better sentiment list / classifier - any ideas?

Data and code

https://github.com/jmnybl/maastereotypiat