11. Models of Semantic Memory

  • 싸이그래머 / 인지모델링 : 파트2 [1]
  • 김무성

Contetns

  • Introduction
  • Classic Models and Themes in Semantic Memory Research
  • Connectionist Models of Semantic Memory

    Rumelhart Networks
    • Dynamic Attractor Networks
  • Distributional Models of Semantic Memory
    • Latent Semantic Analysis
    • Moving Window Models
    • Random Vector Models
    • Probabilistic Topic Models
    • Retrieval-Based Semantics
  • Grounding Semantic Models
  • Compositional Semantics
  • Common Lessons and Future Directions

Key words

  • semantic memory
  • semantic space model
  • distributional semantics
  • connectionist network
  • concepts
  • cognitive model
  • latent semantic analysis

Introduction

  • Meaning is simultaneously the most obvious feature of memory—we can all compute it rapidly and automatically—and the most mysterious aspect to study.
  • Semantic memory is necessary for us to construct meaning from otherwise meaningless words and utterances, to recognize objects, and to interact with the world in a knowledge-based manner.
  • Semantic memory typically refers to memory for word meanings, facts, concepts, and general world knowledge.
    • concept
    • proposition
  • The goal of this chapter is to provide an overview of recent advances in models of semantic memory.
  • Although there are several exciting new developments in verbal conceptual theory (e.g., Louwerse’s (2011) Symbol Interdependency Hypothesis), we focus exclusively on models that are
    • explicitly expressed by computer code or
    • mathematical expressions.
  • We opt here to follow two major clusters of cognitive models that have been prominent:
    • distributional models
      • models that specify how concepts are learned from statistical experience (distributional models)
    • connectionist models
      • models that specify how propositions are learned or that use conceptual representations in cognitive processes (connection- ist models)

Classic Models and Themes in Semantic Memory Research

  • The three classic models of semantic memory most commonly discussed are
    • semantic networks
    • feature-list models
    • and spatial models

semantic networks

  • The semantic network has traditionally been one of the most common theoretical frameworks used to understand the structure of semantic memory.
    • Collins and Quillian (1969) originally proposed a hierarchical model of semantic memory in which concepts were nodes and propositions were labeled links
      • e.g., the nodes for dog and animal were connected via an “isa” link
    • The superordinate and subordinate structure of the links produced a hierarchical tree structure
      • animals were divided into birds, fish, etc., and birds were further divided into robin, sparrow, etc.
    • A later version of the semantic network model proposed by Collins and Loftus (1975) deemphasized the hierarchical nature of the network in favor of the process of spreading activation through all network links simultaneously to account for semantic priming phenomena—in particular, the ability to produce fast negative responses.
  • Early semantic networks can be seen as clear predecessors to several modern connectionist models, and features of them can also be seen in modern probabilistic and graphical models as well.

feature-list models

  • A competing model was the feature-comparison model of Rips, Shoben, and Smith (1973).
    • In this model, a word’s meaning is encoded as a list of binary descriptive features, which were heavily tied to the word’s perceptual referent.
    • For example, the feature would be turned on for a robin, but off for a beagle.
    • Smith, Shoben, and Rips (1974) proposed two types of semantic features:
      • defining features that all concepts have, and
      • characteristic features that are typical of the concept, but are not present in all cases.
      • For example, all birds have wings, but not all birds fly.
  • Modern versions of feature-list models use aggregate data collected from human raters in property generation tasks (e.g., McRae, de Sa, & Seidenberg, 1997).

spatial models

  • A third type was the spatial model, which emer- ged from Osgood’s (1952, 1971) early attempts to empirically derive semantic features using semantic differential ratings.
    • Osgood had humans rate words on a Likert scale against a set of polar opposites (e.g., rough-smooth, heavy-light), and a word’s meaning was then computed as a coordinate in a multidimensional semantic space.
  • Early spatial models can be seen as predecessors of modern semantic space models of distributional semantics (but co-occurrences in text corpora are used as the data on which the space is constructed rather than human ratings).

Connectionist Models of Semantic Memory

  • Rumelhart Networks
  • Dynamic Attractor Networks

Rumelhart Networks

  • This network has two sets of input units:
    • (1) a set of units meant to represent words or concepts
      • (e.g., robin, canary, sunfish, etc.), and
    • (2) a set of units meant to represent different types of relations
      • (e.g., is-a, can, has, etc.).
  • The network learns to associate conjunctions of
    • those inputs
      • (e.g., robin+can)
    • with outputs representing semantic features
      • (e.g. fly, move, sing, grow, for robin+can).
  • The model accomplishes this using supervised learning,
    • having robin+can activated as inputs,
    • observing what a randomly initialized version of the model produces as an output,
    • and then adjusting the weights so as to make the activation of the correct outputs more likely.
  • in the Rumelhart network, the inputs and outputs are mediated by two sets of hidden units,
    • which allow the network to learn complex internal representations for each input.
  • A critical property of connectionist architectures using hidden layers is that the same hidden units are being used to create internal representations for all possible inputs.
    • In the Rogers et al. example, robin, oak, salmon, and daisy all use the same hidden units; what differentiates their internal representations is that they instantiate different distributed patterns of activation.
  • When the network
    • learns an internal representation (i.e., hidden unit activation state) for the input robin+can, and
    • learns to associate the outputs sing and fly with that internal representation,
    • this will mean that
      • other inputs whose internal representations are similar to robin
        • (i.e., have similar hidden unit activation states, such as canary)
      • will also become more associated with sing and fly.
    • This provides these networks with a natural mechanism for categorization, generalization, and property induction.

  • Rogers and McClelland (2006) extensively stud- ied the behavior of the Rumelhart networks, and found that the model provides an elegant account of a number of aspects of human concept acquisition and representation.
    • For example, they found that as the model acquires concepts through increasing amounts of experience, the internal representations for the concepts show progressive differentiation, learning broader distinctions first and more fine-grained distinctions later, similar to the distinctions children show (Mandler, Bauer, & MoDonough, 1991).
  • Feed-forward connectionist models have only been used in a limited fashion to study the actual structure of semantic memory.
  • However, these models have been used extensively to study how semantic structure interacts with various other cognitive processes.
    • For example, feed-forward models have been used to simulate and understand the word learning process (Gasser & Smith, 1998; Regier, 2005).
  • Feed-forward models have also been used to model consequences of
    • brain damage (Farah & McClelland, 1991; Rogers et al., 2004; Tyler, Durrant-Peatfield, Levy, Voice, & Moss, 2000),
    • Alzheimer’s disease (Chan, Salmon, & Butters, 1998), schizophrenia (Braver, Barch, & Cohen, 1999; Cohen and Servan-Schreiber,1992; Nestor et al., 1998),
    • and a number of other disorders that involve impairments to semantic memory (see Aakerlund & Hemmingsen, 1998, for a review).

Dynamic Attractor Networks

  • A connectionist model becomes a dynamical model when its architecture involves some sort of bi-directionality, feedback, or recurrent connectivity.
  • Dynamical networks allow investigations into how the activation of representations may change over time, as well as how semantic representations interact with other cognitive processes in an online fashion.
  • For example, Figure 11.2a shows McLeod, Shallice, and Plaut’s (2000) dynamical network for pronouncing printed words.

  • Attractor networks have been used to study a very wide range of semantic-memory related phenomena.
    • Rumelhart et al. (1986) used an attractor network to show how schemas (e.g., one’s representations for different rooms) can emerge naturally out of the dynamics of co-occurrence of lower-level objects (e.g., items in the rooms), without needing to build explicit schema representations into the model (see also Botvinick & Plaut, 2004).
    • Like the McLeod example already described, attractor networks have been extensively used to study how semantic memory affects lexical access (Harm & Seidenberg, 2004; McLeod et al., 2000) as well as to model semantic priming (Cree, McRae, & McNorgan, 1999; McRae, et al., 1997; Plaut & Booth, 2000).
  • Dynamical models have also been used to study
    • the organization and development of the child lexicon (Horst, McMurray, & Samuelson, 2006; Li, Zhao, & MacWhinney, 2007),
    • the bilingual lexicon (Li, 2009), and children’s causal reasoning using semantic knowledge (McClelland & Thompson, 2007),
    • and how lexical development differs in typical and atypical developmental circumstances (Thomas & Karmiloff-Smith, 2003).

Distributional Models of Semantic Memory

  • Latent Semantic Analysis
  • Moving Window Models
  • Random Vector Models
  • Probabilistic Topic Models
  • Retrieval-Based Semantics

There are now a large number of computational models in the literature that may be classified as distributional.

  • Other terms commonly used to refer to these models are corpus-based, semantic-space, or co-occurrence models, but distributional is the most appropriate term common to all the models in that it fairly describes the environmental structure all learning mechanisms capitalize on (i.e., not all are truly spatial models, and most do not capitalize merely on direct co-occurrences)
    • The various models differ greatly in the cognitive mechanisms they posit that humans use to construct semantic representations, ranging from Hebbian learning to probabilistic inference.
    • But the unifying theme common to all these models is that they hypothesize a formal cognitive mechanism to learn semantics from repeated episodic experience in the linguistic environment (typically a text corpus).
  • The most famous and commonly used phrase to summarize the approach is Firth’s (1957) “you shall know a word by the company it keeps,”
  • and this idea was further developed by Harris (1970) into the distributional hypothesis of contextual overlap.
    • For example, robin and egg may become related because they tend to co-occur frequently with each other.
    • In contrast, robin and sparrow become related because they are frequently used in similar contexts (with the same set of words), even if they rarely co-occur directly.
    • Ostrich may be less related to robin due to a lower overlap of their contexts compared to sparrow, and stapler is likely to have very little contextual overlap with robin.
  • Formal models of distributional semantics differ in their learning mechanisms, but they all have the same overall goal of formalizing the construction of semantic representations from statistical redundancies in language.

Latent Semantic Analysis

  • Perhaps the best-known distributional model is Latent Semantic Analysis (LSA; Landauer & Du- mais, 1997).
    • LSA begins with a term-by-document frequency matrix of a text corpus, in which each row vector is a word’s frequency distribution over documents.
    • A document is simply a “bag-of-words” in which transitional information is not represented.
    • Next, a word’s row vector is transformed by its log frequency in the document and its information entropy over documents.
    • Finally, the matrix is factorized using singular-value decomposition (SVD) into three component matrices.
  • The original transformed term-by-document matrix, M, may be reconstructed as:

  • dimension reduction
    • LSA uses SVD to infer a small number of latent semantic components in language that explain the pattern of observable word co-occurrences across contexts.
  • The semantic representations constructed by LSA have demonstrated remarkable success at simulating a wide range of human behavioral data, including
    • judgments of semantic similarity (Landauer & Dumais, 1997),
    • word categorization (Laham, 2000), and
    • discourse comprehension (Kintsch, 1998), and
    • the model has also been applied to the automated scoring of essay quality (Landauer, Lahma, Rehder, & Schreiner, 1997).
  • One of the most publicized feats of LSA was its ability to achieve a score on the Test of English as a Foreign Language (TOEFL) that would allow it entrance into most U.S. colleges (Landauer & Dumais, 1997).
  • This finding supports the notion that semantic memory may simply be supported by a mental dimension reduction mechanism applied to episodic contexts.
  • The influence of LSA on the field of semantic modeling cannot be overstated. Several criticisms of the model have emerged over the years (see Perfetti, 1998), including
    • the lack of incremental learning,
    • neglect of word-order information,
    • issues about what exact cognitive mechanisms would perform SVD,
    • and concerns over its core assumption that meaning can be represented as a point in space.

Moving Window Models

  • An alternative approach to learning distribu- tional semantics is to slide an N-word window across a text corpus, and to apply some lexical association function to the co-occurrence counts within the window at each step.
    • Although LSA represents a word’s episodic context as a document, moving-window models operationalize a word’s context in terms of the other words that it is commonly seen with in temporal contexts.
    • Compared to LSA’s batch-learning mechanism, this allows moving-window models to gradually de- velop semantic structure from simple co-occurrence counting (cf. Hebbian learning),
      • because a text corpus is experienced in a continuous fashion
    • In addition, several of these models inversely weight co-occurrence by how many words intervene between a target word and its associate, allowing them to capitalize on word-order information.
  • HAL (Hyperspace Analogue to Language model)
    • The prototypical exemplar of a moving-window model is the Hyperspace Analogue to Language model (HAL; Lund & Burgess, 1996).
    • In HAL, a co-occurrence window (typically, the 10 words preceding and succeeding the target word) is slid across a text corpus, and a global word-by-word co-occurrence matrix is updated at each one-word increment of the window.
    • A word’s semantic representation in the model is simply a concatenation of its row and column vectors from the global co-occurrence matrix.
    • Obviously, the word vectors in HAL are both high dimensional and very sparse.
    • Similarity
      • HAL - paradigmatic similarity; e.g., bee-wasp
      • LSA - syntagmatic relations; e.g., bee-honey
    • Considering its simplicity, HAL has been very successful at accounting for human behavior in semantic tasks, including
      • semantic priming (Lund & Burgess, 1996), and
      • asymmetric semantic similarity as well as higher-order tasks such as problem solving (Burgess & Lund, 2000).
  • Hidex
    • Recent versions of HAL, such as Hidex (Shaoul & Westbury, 2006) factor out chance occurrence by weighting co-occurrence by inverse frequency of the target word, which is similar to LSA’s application of log-entropy weighting but after learning the matrix.
  • COALS (Correlated Occurrence Analogue to Lexical Semantics)
    • In COALS, there is no preceding/succeeding distinction within the moving window, and the model uses a co-occurrence association function based on Pearson’s correlation to factor out the confounding of chance co-occurrence due to frequency.
    • Hence, the similarity between two words is their normalized covariational pattern over all context words.
    • In addition, COALS performs SVD on this matrix.
    • Although these are quite straightforward modifications to HAL, COALS heavily outperforms its predecessor on hu- man tasks such as semantic categorization (Riordan & Jones, 2011).
  • A similar moving window model was used by McDonald and Lowe (1998) to simulate semantic priming.
    • This context word approach, in which as few as 100 context words are used as the columns, has also been successfully used by Mitchell et al. (2008) to predict fMRI brain activity associated with humans making semantic judgments about nouns.

Random Vector Models

  • An entirely different take on contextual rep- resentation is seen in models that use random representations for words that gradually develop semantic structure through repeated episodes of the word in a text corpus.
  • The mechanisms used by these models are theoretically tied to mathematical models of associative memory.
  • For this reason, random vector models tend to capitalize on both contextual co-occurrence as LSA does, and also associative position relative to other words as models like HAL and COALS do, representing both in a composite vector space.
  • BEAGLE (Bound Encoding of the Aggregate Language Environment model)
    • In the Bound Encoding of the Aggregate Language Environment model (BEAGLE; Jones & Mewhort, 2007), semantic representations are gradually acquired as text is experienced in sentence chunks.
    • The model is based heavily on mechanisms from Murdock’s (1982) theory of item and associative memory.
    • The first time a word is encountered, it is assigned a random initial vector known as its en- vironmental vector, ei.
    • This vector is the same each time the word is experienced in the text corpus, and is assumed to represent the relatively stable physical characteristics of perceiving the word (e.g., its visual form or sound).
    • In BEAGLE, each time a word is experienced in the corpus, its memory vector, mi, is updated as the sum of the random environmental vectors for the other words that occurred in context with it, ignoring high-frequency function words.
      • Hence, in the short phrase “A dog bit the mailman,”
        • the memory representation for dog is updated as
          • m_dog = e_bit + e_mailman.
      • In the same sentence,
        • m_bit = e_dog + e_mailman and
        • m_mailman = e_dog + e_bit are encoded.
      • Even though the environmental vectors are random, the memory vectors for each word in the phrase have some of the same random environmental structure summed into their memory representations. Hence, m_dog , m_bit , and m_mailman all move closer to one another in memory space each time they directly co-occur in contexts.
      • In addition, latent similarity naturally emerges in the memory matrix; even if dog and pitbull never directly co- occur with each other, they will become similar in memory space if they tend to occur with the same words (i.e., similar contexts).
        • This allows higher- order abstraction, achieved in LSA by SVD, to emerge in BEAGLE naturally from simple Hebbian summation.
  • Convolution-based memory models
    • The use of random environmental representations allows BEAGLE to learn information as would LSA, but in a continuous fashion and without the need for SVD.
    • But the most interesting aspect of the model is that the random representations allow the model to encode word order information in parallel by applying an operation from signal processing known as convolution to bind together vectors for words in sequence.
    • Convolution-based memory models have been very successful as models of both vision and paired-associate memory, and BEAGLE extends this mechanism to encode n-gram chunk information in the word’s representation.
    • The model uses circular convolution, which binds together two vectors, with dimensionality n, into a third vector of the same dimensionality:

  • BEAGLE applies this operation recursively to create an order vector representing all the environmental vectors that occur in sequences around the target word, and this order vector is also summed into the word’s memory vector.
  • Hence, the memory vector becomes a pattern of elements that reflects the word’s history of co-occurrence with, and position relative to, other words in sentences.
  • Random indexing model
    • A similar approach to BEAGLE, known as random indexing, has been taken by Kanerva and colleagues (Kanerva, 2009; Kanerva, Kristoferson, & Holst, 2000).
    • Random indexing uses similar principles to BEAGLE’s summation of random environmental vectors, but is based on Kanerva’s (1988) theory of sparse distributed memory.
    • The initial vector for a word in random indexing is a sparse binary representation, a very high dimensional vector in which most elements are zeros with a small number of random elements switched to ones (a.k.a., a “spatter code”).
  • RPM (Random Permutation Model)
    • Sahlgren, Holst, & Kanerva (2008) have extended random indexing to encode order in- formation as does BEAGLE in their Random Permutation Model (RPM).
    • The RPM encodes contextual information the same way as standard random indexing.
    • Rather than convolution, it uses a permutation function to encode the order of words around a target word.
  • TCM (Temporal Context Model)
    • Howard and colleagues (e.g., Howard, Shakar, & Jagadisan, 2011) have taken a different approach to learning semantic representations, binding local item representations to a gradually changing representation of context by modifying the Temporal Context Model (TCM; Howard & Kahana, 2002) to learn semantic information from a text corpus.
    • The TCM uses static vectors representing word form, similar to RPM’s initial vectors or BEAGLE’s environmental vectors.
    • However, the model binds words to temporal context, a representation that changes gradually with time, similar to oscillator-based systems.
    • In this sense, the model is heavily inspired by hippocampal function.
    • Encountering a word reinstates its previous temporal contexts when encoding its current state in the corpus.
    • Hence, whereas LSA, HAL, and BEAGLE all treat context as a categorical measure (documents, windows, and sentences, respectively, are completely different contexts), TCM treats context as a continuous measure that is gradually changing over time.
    • Howard et al. (2011) trained a predictive version of TCM (pTCM) on a text corpus to compare to established semantic models.
      • Howard et al. demonstrate impressive performance from pTCM on linguistic association tasks.
      • In addition, the application of TCM in general to semantic representation makes a formal link to mechanisms of episodic memory (which at its core, TCM is), as well as findings in cognitive neuroscience (see Polyn & Kahana, 2008).

Probabilistic Topic Models

  • Considerable attention in the cognitive modeling literature has recently been placed on Bayesian models of cognition (see Austerweil, et al., this volume), and mechanisms of Bayesian inference have been successfully extended to semantic memory as well.
  • Probabilistic topic models (Blei, Ng, & Jordan, 2003; Griffiths, Steyvers, & Tenenbaum, 2007) operate in a similar fashion to LSA, performing statistical inference to reduce the dimensionality of a term-by-document matrix.
  • However, the theoretical mechanisms behind the inference and representation in topic models differ markedly from LSA and other spatial models.
  • An assumption of a topic model is that documents are generated by mixtures of latent “topics,” in which a topic is a probability distribution over words.
    • Although LSA makes a similar assumption that latent semantic components can be inferred from observable co-occurrences across documents, topic models go a step further, specifying a fully generative model for documents (a procedure by which documents may be generated).
    • The assumption is that when constructing documents, humans are sampling a distribution over universal latent topics.
  • To train the model, Bayesian inference is used to reverse the generative process

  • The probabilistic inference machinery behind topic models results in at least three major dif- ferences in topic models when compared to other distributional models.
    • First, as mentioned earlier, topic models are generative.
    • Second, it is often suggested that the topics themselves have a meaningful interpretation, such as finance, medicine, theft, and so on,
      • whereas the components of LSA are difficult to interpret, and the components of models like BEAGLE are purposely not interpretable in isolation from the others.
    • Third, words in a topic model are represented as probability distributions rather than as points in semantic space
  • For these reasons, topic models have been shown to produce better fits to free association data than LSA, and they are able to account for disambiguation, word-prediction, and discourse effects that are problematic for LSA (Griffiths et al., 2007).

Retrieval-Based Semantics

  • CSM (constructed semantics model)
    • Kwantes (2005) proposed an alternative ap- proach to modeling semantic memory from distributional structure.
    • Although not named in his publication, Kwantes’s model is commonly referred to as the constructed semantics model (CSM), a name that is paradoxical given that the model posits that there is no such thing as semantic memory.
    • Rather, semantic behavior exhibited by the model is an emergent artifact of retrieval from episodic memory.
    • Although all other models put the semantic abstraction mechanism at encoding (e.g., SVD, Bayesian inference, vector summation), CSM actually encodes the episodic matrix and performs abstraction as needed when a word is encountered.
  • CSM is based heavily on Hintzman’s (1986) Minerva 2 model which was used as an existence proof that a variety of behavioral effects that had been used to argue for two distinct memory stores (episodic and semantic) could naturally be produced by a model that only had memory for episodes.
    • In CSM, the memory matrix is the term-by- document matrix (i.e., it assumes perfect memory of episodes).
    • When a word is encountered in the environment, its semantic representation is constructed as an average of the episodic memories of all other words in memory, weighted by their contextual similarity to the target.
    • This semantic vector is similar in structure to the memory vector learned in BEAGLE by context averaging, but the averaging is done on the fly, it is not encoded or stored.
  • Although retrieval-based models have received less attention in the literature than models like LSA, they represent a very important link to other instance-based models, especially exemplar models of recognition memory and categorization (e.g., Nosofsky, 1986).
  • The primary reason limiting their uptake in model applications is likely due to the heavy computational expense required to actually simulate their process (Stone, Dennis, & Kwantes, 2011).

Grounding Semantic Models

  • Semantic models, particularly distributional models, have been criticized as psychologically implausible because they learn from only linguistic information and do not contain information about sensorimotor perception contrary to the grounded cognition movement (for a review, see de Vega, Glenberg, & Graesser, 2008).
  • Feature-based representations contain a great deal of sensorimotor properties of words that cannot be learned from purely linguistic input, and both types of information are core to human semantic representation (Louwerse, 2008).
  • Riordan and Jones (2011) recently compared a variety of feature-based and distributional models on semantic clustering tasks.
    • Their results demon- strated that whereas there is information about word meaning redundantly coded in both feature norms and linguistic data, each has its own unique variance and the two information sources serve as complimentary cues to meaning.
  • Research using recurrent networks trained on child-directed speech corpora has found that pretraining a network with features related to children’s sensorimotor experience produced significantly bet- ter word learning when subsequently trained on linguistic data (Howell, Jankowicz, & Becker, 2005).
  • Several recent probabilistic topic models have also explored parallel learning of linguistic and featural information (Andrews, Vigliocco, & Vinson, 2009; Baroni, Murphy, Barba, & Poesio, 2010; Steyvers, 2009).
  • Integration of linguistic and sensorimotor information allows the models to better fit human semantic data than a model trained with only one source (Andrews et al., 2009).

Compositional Semantics

Common Lessons and Future Directions

참고자료