Branding Laws · Internet Marketing · eMarketing · Internet Advertising · Online Branding |
| |
A Distributed, Developmental Model of Word Recognition and Naming
Psychological Review Copyright 1989 by the American Psychological Association, Ire.
1989, Vol. 96, No. 4, 523-568 0033-295X/g9,t$00.75
A Distributed, Developmental Model of Word Recognition and Naming
Mark S. Seidenberg
McGill University
Montreal, Quebec, Canada
James L. McClelland
Carnegie Mellon University
A parallel distributed processing model of visual word recognition and pronunciation is described.
The model consists of sets of orthographic and phonologlc~ units and an interlevel of hidden units.
Weights on connections between units were modified during a training phase using the back-propagation
learning algorithm. The model simulates many aspects of human performance, including (a)
differences bet~n~.'n words in terms of processing difficulty, (b) pronunciation of novel items, (c)
differences between readers in terms of word recognition skill, (d) transitions from beginning to
skilled reading, and (e) differences in performance on lexieal decision and naming tasks. The model's
behavior early in the learning phase corresponds to that of children acquiring word recognition
skills. Training with a smaller number of hidden units produces output characteristic of many dyslexic
readers. Naming is simulated without pronunciation rules, and lexical decisions are simulated
without accessing word-level representations. The performance of the model is largely determined
by three factors: the nature of the input, a significant fragment of written English; the learning rule,
which encodes the implicit structure of the orthography in the weights on connections; and the
architecture of the system, which influences the scope of what can be learned.
The recognition and pronunciation of words is one of the central
topics in reading research and has been studied intensely in
recent years (see articles in Besner, Waller, & MacKinnon,
1985, and Coltheart, 1987, for reviews). The topic is important
primarily because of the immediate, "on-line" character of language
comprehension (Marslen-Wilson, 1975), that is, the fact
that text and discourse are interpreted essentially as the signal
is perceived. Two aspects oflexical processing contribute to this
characteristic of reading. First, words can be identified quickly;
the rate for skilled readers typically exceeds five words per second
(Rayner & Pollatsek, 1987). Second, identification of a
word entails the activation of several types of associated information
or codes, each of which contributes to the rapid interpretation
of text. These codes include one or more meanings of
This research was supported by National Science Foundation Grant
BNS 8609729, Office of Naval Research Grant N00014.86-G-0146,
Natural Science and Engineering Research Council of Canada Grant
A7924, Quebec Ministry of Education Grant FCAR EQ-2074, and National
Institutes of Health Grant HD- 18944. James L. MeClelland was
also supported by National Institutes of Health Career Development
Award MH00385. Part of this research was completed while Mark S.
Seidenberg was a visiting scientist at the MRC Applied Psychology Unit,
Cambridge, England; the use of their facilities is gratefully acknowledged.
We thank Karalyn Patterson, who has collaborated on studies of the
model's implications concerning acquired forms of dyslexia (Patterson,
Seidenberg, & McClelland, in press); Max Coltheart, Michael Tanenhaus,
and Stephen Monsell provided helpful comments on the model.
We also thank Debra Jared and Ken McRae, who ran the experiment
reported in the article and assisted in analyzing the results of several
simulations.
Correspondence concerning this article should be addressed to Mark
S. Seidenberg, Department of Psychology, McGill University, 1205
Docteur Penfieid, Montreal, PQ, H3A IBI, Canada.
a word (Seidenberg, Tanenhaus, Leiman, & Bienkowski, 1982;
Swinney, 1979), information related to its pronunciation or
sound (Baron & Strawson, 1976; Gough, 1972; Tanenhaus,
Flanigan, & Seidenberg, 1980), and information concerning the
kinds of sentence structures in which the word participates
(McClelland & Kawamoto, 1986; Tanenhaus & Carlson, 1989).
Understanding the meanings of words is obviously an important
part of text comprehension. The phonological code may be
related to the retention of information in working memory,
while other comprehension processes such as syntactic analyses
or inferencing continue (Baddeley, 1979; Daneman & Carpenter,
1980). The third type of information facilitates the developmerit
of representations concerning syntactic and conceptual
structures (Tanenhaus & Carlson, 1989). The picture that has
emerged is one in which lexical processing yields access to several
types of information in a rapid and efficient manner. Readers
are typically aware of the results of lexical processing, not
the manner in which it occurred. One of the goals of research
on visual word recognition has been to use experimental methods
to unpack these largely unconscious processes; in the model
that we present in this article, we attempt to give an explicit,
computational account of them.
Word recognition is also important because acquiring this
skill is among the first tasks confronting the beginning reader;
moreover, deficits at the level of word recognition are characteristic
of children who fail to acquire age-appropriate reading
skills (Perfctti, 1985; Stanovich, 1986). The model that we describe
provides an account of the kinds of knowledge that are
acquired, how they are used in performing different reading
tasks, and the bases of some types of reading impairment. Specific
deficits in word recognition are also observed as a consequence
of brain injury; the study of these deficits has provided
important information concerning the types of knowledge and
processes involved in normal reading and clues to their neuro-
523
524 MARK S. SEIDENBERG AND JAMES L. MCCLELLAND
physiological bases (Patterson, Coltheart, & Marshall, 1985).
Our model provides the basis for an account of some characteristics
of pathological performance in terms of damage to the
normal processing system; this aspect of the model is discussed
in Patterson, Seidenberg, and McClelland (in press).
Finally, visual word recognition provides an interesting domain
in which to explore general ideas concerning learning, the
representation of knowledge, and skilled performance because
it is a relatively mature area of inquiry. There has been an enormous
amount of empirical research on the topic, and several
models have already been proposed (Coltheart, 1978; Forster,
1976; LaBerge & Samuels, 1974; McClelland & Rumelhart,
1981; Morton, 1969). Our goal has been to develop an explicit,
computational model that accounts for much of this extensive
body of knowledge. At the same time, word recognition provides
an interesting domain in which to explore the properties
of the connectionist or parallel distributed processing approach
to understanding perception, cognition, and learning (McClelland
& Rumelhart, 1986b; Rumelhart & MeClelland, 1986b)
that we have used in this research. In particular, our model illustrates
an important feature of this approach, the emergence of
systematic, "rule-governed" behavior from a network of simple
processing units.
Scope of the Problem
In acquiring word recognition skills, children must come to
understand at least two basic characteristics of written English.
First, there is the alphabetic principle (Rozin & Gleitman,
1977), the fact that in an alphabetic orthography there are systematic
correspondences between the spoken and written forms
of words. Beginning readers already possess large oral vocabularies;
their initial problem is to learn how known spoken forms
map onto unfamiliar written forms. The scope of this problem
is determined by characteristics of the writing system. The alphabetic
writing system for English is a code for representing
spoken language; units in the writing system--letters and letter
patterns--largely correspond to speech units such as phonemes.
However, the correspondence between the written and
spoken codes is notoriously complex; many correspondences
are inconsistent (e.g., -AVE is usually pronounced as in GAVE,
SAVE, and CAVE, but there is also HAVE) or wholly arbitrary
(e.g., -OLO in COLONEL, -PS in CORPS).
These inconsistencies derive from several sources. One is the
fact that the writing system also encodes morphological information.
Chomsky and Halle (1968) argued that English orthography
represents a solution to the problem of simultaneously
representing information concerning phonology and morphology.
According to their analysis, the writing system follows a
general principle whereby phonological information is encoded
only if it cannot be derived from rules that are conditioned by
morphological structure. Thus, words with seemingly irregular
pronunciations such as SIGN and BOMB preserve in their written
forms information about morphological relations among words
(SIGN-SIGNATURE; BOMB-BOMBARD); the correct pronunciations
can be derived from a morphophonemic rule governing
base and derived forms. Whatever the validity of Chomsky and
Halle's approach (see Bybee, 1985, for an alternative), it is clear
that some irregular correspondences between graphemes and
phonemes are due to the competing demand that the writing
system preserve morphological information.
Other inconsistencies derive from the fact that the spoken
forms of words change over time, whereas the written forms are
essentially fixed. In British English, for example, the word BEEN
is a homophone of BEAN; in American English, it is a homophone
of BIN. The American pronunciation has changed
through a process of phonological reduction, resulting in an irregular
spelling-sound correspondence. These diachronic
changes in pronunciation are an important source of irregularities
in spelling-sound correspondences. There are other sources
as well, principally lexical borrowing from other languages, periodic
spelling reforms, and historical accident. The net result
is that the writing system encodes information related to pronunciation
and sound, but the correspondence between written
and spoken forms is not entirely regular or transparent. English
is said to have a "deep" alphabetic orthography, in contrast to
a "shallow" orthography such as that in Serbo-Croatian, which
has more consistent spelling-sound correspondences (Katz &
Feldman, 1981).
A second aspect of the writing system that the child must
learn about concerns the distribution of letter patterns in the
lexicon. Only some combinations of letters are possible, and the
combinations differ in frequency. These facts about the distribution
of letter patterns give written English its characteristic
redundancy. Of the many possible combinations of 26 letters,
only a small percentage yield letter strings that would be permissible
words in English. An even smaller percentage are realized
as actual entries in the lexicon. As Adams (1981) noted,
From an alphabet of 26 letters, we could generate over 475,254
unique strings of 4 letters or less, or 12,376,630 of 5 letters or less.
Alternatively, we could represent 823,543 unique strings with an
alphabet of only 7 letters, or 16,777,216 with an alphabet of only
8. For comparison, the total number of entries in Webster's New
Collegiate Dictionary is only 150,000. (p. 198)
Constraints on the forms of written words may play an important
role in the recognition process. The reader must discriminate
the input string from other vocabulary items, a task that
might be facilitated by knowledge of the letter combinations
that are permissible or realized. Many studies have provided
evidence that skilled readers use this knowledge (see Henderson,
1982, for a review).
Orthographic redundancy also provides cues to other aspects
of lexical structure, specifically, syllables and morphemes. For
example, the written forms of words typically provide cues to
their syllabic structure (Adams, 198 l) for the following reason.
Syllables derive from articulatory-motor properties of the spoken
language; essentially, they reflect the opening and closing
movements of the jaw cycle (Fowler, 1977; Seidenberg, 1989).
Thus, the capacities of the articulatory-motor apparatus constrain
the possible sequences of phonemes. Moreover, there are
language-specific constraints on phoneme sequencing. Written
English is largely a code for representing speech; hence, properties
of speech such as syllables tend to be reflected in the orthography.
For example, the fact that the letters GP never appear in
word-initial position derives from a phonotactic constraint on
the occurrence of the corresponding phonemes. These letters
can appear at the division between two syllables (e.g., PIGPEN),
WORD RECOGNITION AND NAMING 525
reflecting the fact that there are more constraints on the sequencing
of phonemes within syllables than between. As a resuit,
the letter patterns at syllable boundaries tend to be lower
in frequency than the letter patterns that occur intrasyllabically
(Adams, 1981; Seidenberg, 1987). Thus, facts about the distribution
of phonemes characteristic of spoken syllables are refleeted
in the distribution of letter patterns in their written realizations.
As in the case of grapheme-phoneme correspondences,
however, the realizations of syllables in the orthography
are not entirely consistent, as illustrated by minimal pairs such
as WAIVE-NAIVE, BAKED-NAKED, and DIES-DIET, which are
similar in orthography but differ in syllabic structure. Thus,
written English provides cues to syllabic structure, but these
cues are not entirely reliable.
The situation is similar when we turn to the level of morphology,
which concerns the organization of sublexical units that
contribute to meaning. The meaning of a word is often a compositional
function of the meanings of its morphemes; consider
prefixed words such as PREVIEW and DECODE. That English is
systematic in this regard is seen in the coining of new words
such as PRECOMPILE or DEBUG. That it is inconsistent is illustrated
by words such as PRETENSE (unrelated to TENSE) or DELIVER
(unrelated to LIVER). Again, written English encodes information
related to morphological structure, but not in a regular
or consistent manner.
In sum, the English orthography partially encodes several
types of information simultaneously. The reader's knowledge
of the orthography can be construed as an elaborate matrix of
correlations among letter patterns, phonemes, syllables, and
morphemes. Written English is an example of what we call a
quasiregular system--a body of knowledge that is systematic
but admits many irregularities. In such systems, the relations
among entities are statistical rather than categorical. Many
other types of knowledge may have this character as well.
The child's problem, then, is to acquire knowledge of this
quasiregular system. The task of reading English might be facilitated
by the systematic aspects of the writing system such as
the constraints on possible letter sequences and the correspondences
between spelling and sound. However, there are barriers
to using these types of information. Facts about orthographic
redundancy cannot be used until the child is familiar with a
large number of words. Acquiring useful generalizations about
spelling-sound correspondences is inhibited by the fact that
many words have irregular correspondences, and these words
are overrepresented among the items the child learns to read
first (e.g., GIVE, HAVE, SOME, DOES, GONE). The child must,
nonetheless, learn to use knowledge of the orthography in a
manner that supports the recognition of words within a fraction
of a second.
Our model addresses the acquisition and use of knowledge
concerning orthographic redundancy and orthographic-phonological
correspondences. We focus on these types of information
because they are sufficient to account for phenomena related
to the processing of monosyllabic words, which is our
model's domain of application. In the general discussion we return
to issues concerning syllabic and morphological knowledge
and the processing of more complex words. Our goal has been
to determine how well the basic phenomena of word naming
and recognition might be accounted for by a minimal model of
lexical processing, in which as little as possible of the solution
of the problem is built in and as much as possible is left to the
mechanisms of learning. The model is realized within the conneetionist
framework being applied to many problems in perception
and cognition (McClelland & Rumelhart, 1986b;
Rumelhart & McClelland, 1986b). The model provides an account
of how these types of knowledge are acquired and used
in performing simple reading tasks such as naming words aloud
and making lexical decisions. One of the main points of the
model is that, because of the quasiregular character of written
English, it is felicitous to represent these types of knowledge in
terms of the weights on connections between simple processing
units in a distributed memory network. Learning then involves
modifying the weights through experience in reading and pronouncing
words. Thus, the connectionist approach is ideally
suited to accounting for word recognition because of the nature
of the task, which is largely determined by these characteristics
of the orthography.
A key feature of the model we propose is the assumption that
there is a single, uniform procedure for computing a phonological
representation from an orthographic representation that is
applicable to irregular words and nonwords as well as regular
words. A central dogma of many earlier models (e.g., the dualroute
accounts of Coltheart, 1978; Marshall & Newcombe,
1973; Meyer, Schvaneveldt, & Ruddy, 1974) is that irregular
words and nonwords require separate mechanisms for their
pronunciation: Irregular words require lexical lookup because
they cannot be pronounced by rule, whereas nonwords require
a system of rules because their pronunciations cannot be looked
up (see Seidenberg, 1985b, 1988, for discussion). Whether, in
fact, two mechanisms are required, and whether they are the
mechanisms postulated in dual-route models, are among the
main issues that our model addresses. The model does not entail
a lookup mechanism because it does not contain a lexicon
in which there are entries corresponding to individual words.
Nor does it contain a set of pronunciation rules. Instead, it replaces
both by a single mechanism that learns to process regular
words, irregular words, nonwords, and other types of letter
strings through experience with the spelling-sound correspondences
implicit in the set of words from which it learns.
The model gives a detailed account of a range of empirical
phenomena that have been of continuing interest to reading researchers,
including (a) differences between words in terms of
processing difficulty, (b) differences between readers in terms of
word recognition skill, (c) transitions from beginning to skilled
reading, and (d) differences between silent reading and reading
aloud. The model also provides an account of certain forms of
dyslexia that are observed developmentally and as a consequence
of brain injury.
Description of the Model
Precursors
Before we turn to the model itself, it is important to acknowledge
several precursors of this work. In some ways, this model
can be seen as an application of many of the principles embodied
in the interactive activation model of word perception
(McClelland & Rumelhart, 1981) to a more distributed model
526 MARK S. SEIDENBERG AND JAMES L. McCLELLAND
of the kind used by Rumelhart and McClelland (1986a) in their
simulation of the acquisition of past tense morphology. This
work draws heavily on insights into distributed representation
due primarily to Hinton (1984; Hinton, McClelland, & Rumelhart,
1986) and exists only because ofRumelhart, Hinton, and
Williams's (1986) discovery of a learning procedure for multilayer
networks. In applying many of these ideas to the task of
reading, we follow in the footsteps of Sejnowski and Rosenberg's
(1986) NETtalk model, which was the first application of
the Rumelhart et al. algorithm to the problem of learning the
spelling-sound correspondences of English. Sejnowski and Rosenberg
recognized that this knowledge could be represented by
a parallel distributed network rather than a set of pronunciation
rules. Our goal was to explore the adequacy of this approach by
developing a model that could be related to a broad range of
phenomena concerning human performance.
Several previous models of visual word recognition also influenced
the development of the somewhat different account
presented here. Among them are Morton's (1969) logogen
model, the dual-route model of Coltheart (1978), and
Glushko's (1979) lexical analogy model. Later in the text we
show how our model relates to these precursors. Finally, our
account of lexical decision is similar to ones proposed by Gordon
(1983) and Balota and Chumbley (1984).
The Larger Framework
As we have noted, the model was developed with the goal of
using a minimal architecture in which the learning aspect
played a dominant role. Some minimal structural assumptions
were required, however. A second goal was to keep things as
simple as possible; therefore, the model we have implemented
is a simplification of the larger, somewhat richer processing system
that surely is required to account for aspects of single word
processing outside our primary concerns. We begin by describing
the larger framework of which the model we have implemented
is a part; we then describe the simplifications and detailed
assumptions of the implementation.
The larger framework assumes that reading words involves
the computation of three types of codes: orthographic, phonological,
and semantic. Other codes are probably also computed
(concerning, e.g., the syntactic and thematic functions of
words), but we have not included them in the present model
because they probably are more relevant to comprehension
processes than to the recognition and pronunciation of monosyllabic
words. Each of these codes is assumed to be a distributed
representation; that is, to be a pattern of acrivarion distributed
over a number of primitive representational units. Each
processing unit has an activation value that in our model ranges
from 0 to 1. The representations of different entities are encoded
as different patterns of activity over these units.
Processing in the model is assumed to be interactive (Marslen-
Wilson, 1975; McClelland, 1987; McClelland & Rumelhart,
1981; Rumelhart, 1977). That is, we assume that the protess
of building a representation at each of the three levels both
influences, and is influenced by, the construction of representations
at each of the other levels. We also assume, in keeping with
this inherently interactive view, that word processing can be influenced
by contextual factors arising from syntactic, semantic,
MAKE /mAkl
Figure 1. General framework for lexical processing.
(The implemented model is in boldface type.)
and pragmatic constraints, although the scope and locus of
these effects is a matter of current debate (see McClelland,
1987; Rumelhart, 1977; Tanenhaus, Dell & Carlson, 1987, for
discussion). We assume that at least some of these types ofinformarion
constrain the construction of the representation at the
semantic level and, thus, indirectly influence construction of
representations at the other levels, and conversely, that the construction
of a representation of the context is influenced by activation
at the semantic level.
As in other connecrionist models, processing is mediated by
connections among the units. However, it is well known that
there are limits on the processing capabilities inherent in networks
in which there are only direct connections between units
at different representational levels (Hinton et al., 1986; Minsky
& Papert, 1969). In view of these limits, it is crucial that there
be a set of so-called hidden units, mediating between the pools
of representational units.
The assumptions described thus far are captured in Figure 1,
in which each pool of units--both hidden units and representational
unitsmis represented by an ellipse. Connections between
units on different levels are represented by arrows. These arrows
always run in both directions, in keeping with the assumption
ofinteractivity.
The Simulation Model
The model that we have actually implemented is illustrated
in Figure 2 and is the part of Figure 1 in boldface type. This
simplified model removes the semantic and contextual levels,
leaving only the orthographic level, the phonological level, and
the interlevel of hidden units between these two. Furthermore,
as an additional simplification, we have not implemented feedback
from the phonological to the hidden units; this means, in
effect, that phonological representations cannot in fact influence
the construction of representations at the orthographic
WORD RECOGNITION AND NAMING 527
Figure 2. Structure of the implemented model and number of units.
level. There is, however, feedback from the hidden units to the
orthographic units. This feedback plays the role of the top-down
word-to-letter connections in the interactive activation model
of word perception, allowing the model to sustain, reinforce,
and clean up patterns produced by external input to the orthographic
level.
Several further assumptions were required in implementing
this simplified model. These assumptions can be grouped into
three types: Processing assumptions, specifying the way in
which activations influence each other; learning assumptions,
specifying how connection strength adjustment takes place as a
result of experience; and representational assumptions, specifying
how orthographic and phonological characteristics of words
are to be represented.
Processing assumptions. At a fine-grained level, we believe
it would be most accurate to characterize processing in terms
of the gradual buildup of activation" (McClelland, 1979;
McClelland & Rumelhart, 1981), subject to a considerable
amount of random noise. However, for simplicity, the simulation
model actually computes activations deterministically in a
single processing sweep. This simplification makes simulation
of the learning process feasible because it speeds up simulation
by a couple of orders of magnitude.
Details of the processing assumptions of the model are as follows.
Each word-processing trial begins with the presentation
of a letter string, which the simulation program then encodes
into a pattern of activation over the orthographic units, according
to the representational assumptions described later. Next,
activations of the hidden units are computed on the basis of the
pattern of activation at the orthographic level. For each hidden
unit, a quantity called the net input is computed; this is simply
the activation of each input unit, times the weight on the connection
from that input unit to the hidden unit, plus a bias term
that is unique to the unit. Thus, for hidden unit i, the net input
is given by
net~ = ~ wijaj + bias/.
J
Herej ranges over the orthographic units, aj is the activation of
orthographic unit j, biasi is the bias term for hidden unit i, and
w,i- is the weight of the connection to unit i from unitj. The bias
term may be thought of as an extra weight or connection to the
unit from a special unit that always has activation of 1.0.
The activation of the unit is then determined from the net
input using a nonlinear function called the logistic function:
l
ai = 1 + e-"eti "
The activation function must be nonlinear for reasons described
in Rumelhart, Hinton, and Williams (1986). It must be monotonically
increasing and have a smooth first derivative for reasons
having to do with the learning rule. The logistic function
satisfies these constraints.
Once activations over the hidden units have been computed,
they are used to compute activations for the phonological units
and new activations for the orthographic units based on feedback
from the hidden units. These activations are computed
following exactly the same computations already described;
first, the net input to each unit is calculated, based on the activations
of all of the hidden units; then the activation of each of
these units is computed, based on the net inputs.
Learning assumptions. When the model is initialized, the
connection strengths and biases in the network are assigned
random initial values between _+0.5. This means that each hidden
unit computes an entirely arbitrary function of the input it
receives from the orthographic units and sends a random pattern
of excitatory and inhibitory signals to the phonological
units and back to the orthographic units. This also means that
the network has no initial knowledge of particular correspondences
between spelling and sound, nor can its feedback to the
orthographic units effectively sustain or reinforce inputs to
these units. Thus, the ability to recreate the orthographic input
and generate its phonological code arises as a result of learning
from exposure to letter strings and the corresponding strings of
phonemes.
Learning occurs in the model in the following way. An orthographic
string is presented and processing takes place, as described,
producing first a pattern of activation over the hidden
units and then a feedback pattern on the orthographic units and
a feedforward pattern on the phonological units. At this point,
these two output patterns produced by the model are compared
to the correct, target patterns that the model should have produced.
The target for the orthographic feedback pattern is simply
the orthographic input pattern; the target for the phonological
output is the pattern representing the correct phonological
code of the presented letter string. We assume that in reality
the phonological pattern may be supplied as explicit external
teaching input--as in the case in which the child sees a letter
string and hears a teacher or other person say its correct pronunciation-
or self-generated on the basis of the child's prior
knowledge of the pronunciations of words and the contexts in
which they occur.
For each orthographic and phonemic unit, the difference between
the correct or target activation of the unit and its actual
activation is computed as follows:
di = (ti-ai).
The learning procedure adjusts the strengths of all of the connections
in the network in proportion to the extent to which
this change will reduce a measure of the total error, E. Thus,
0E
Aw U = -g ~ •
ow~
528 MARK S. SEIDENBERG AND JAMES L. McCLELLAND
Here e is a learning rate parameter, and E is the sum of the
difference terms for each unit, each squared:
E-- Zd~.
i
The term 0E/Ow 0 is the partial derivative of the error measure
with respect to a change in the weight to unit i from unitj. 1
The algorithm that is used to compute the partial derivative
for each weight is the back-propagation learning procedure of
Rumelhart, Hinton, and Williams (1986). Readers are referred
to Rumelhart, Hinton, and Williams for an explanation of how
these partial derivatives are calculated. For our purposes the important
thing to note is that the rule changes the strength of
each weight in proportion to the size of the effect changing it
will have on the error measure. Large changes are made to
weights that have a large effect on E, and small changes are made
to weights that have a small effect on E.
Representational assumptions. In reality, the orthographic
and phonological representations used in reading are determined
by learning processes, subject to initial constraints imposed
by biology and prior experience. The learning of these
representations is beyond the scope of the model; for simplicity,
we have treated them as fixed in the simulations. Our choice of
representations is not intended to be definitive; rather, it was
motivated primarily by a desire to capture a few general properties
that we would expect such representations to acquire
through learning, while at the same time building in very little
specifically about the correspondences between spelling and
sound or about the particular kinds of letter and phoneme
strings that are words in English.
In representing a word's orthographic or phonological content,
it is not sufficient to activate a unit for each of the letters
or phonemes in the word because this would yield identical representations
for pairs such as BAT and TAB. It is necessary to
use some scheme that specifies the context in which each letter
occurs. We chose to use a variant of Wickelgren's (1969) triples
scheme, following Rumelhart and McClelland (1986a), rather
than the strict positional encoding scheme of McClelland and
Rumelhart (198 l). In this we have given the model a tendency
to be sensitive to local context rather than absolute spatial position,
because letters occurring in similar local contexts activate
units in common. Thus, for example, the letter string MAKE is
treated as the set of letter triples _MA, MAK, AKE, and KE
(where _ is a symbol representing the beginning or ending of a
word), whereas the phoneme string/mAk/is treated as the set
of phoneme triples _mA, mAk, Ak__. 2
Note that we do not claim that this scheme in its present form
is fully sufficient for representing all of the letter or phoneme
sequences that form words (see Pinker & Prince, 1988). However,
we are presently applying the model only to monosyllables,
and the representation is sufficient for these (see general discussion).
Extensions of the representation scheme can be envisioned
in which more global properties such as approximate
position with respect to particular vowel groups is also represented
in conjunction with each triple. Such a scheme would
largely collapse to the present one for monosyllables.
An important way in which our representations differ from
Wickelgren's (1969) proposal in that we do not assume a oneto-
one correspondence between triples and units; rather, each
triple is encoded as a distributed pattern of activation over a set
of units, each of which participates in the representation of
many triples. The representation used at the phonemic level is
the same as that used by Rumelhart and MeClelland (1986a).
Each unit represents a triple of phonetic features, one feature
of the first of the three phonemes in each triple, one feature of
the second of the three, and one of the third. 3 For example, there
is a unit that represents [vowel, fricative, stop]. This unit should
be activated for any word containing such a sequence, such as
the words POST and SOFT. Word boundaries are also represented
in the featural representation, so that there is a unit, for example,
that represents [vowel, liquid, word boundary]; this unit
would come on in words like CAR and CALL. There are 460 such
units, and each phoneme triple activated 16 of them; see
Rumelhart and McClelland (1986a) for details.
The representation used at the orthographic level is similar
to that used at the phonological level, except that in this instance
400 units were used, and each unit was set up according
to a slightly different scheme. For each unit, there is a table conraining
a list of 10 possible first letters, 10 possible middle letters,
and 10 possible end letters. These tables are generated randomly
except for the constraint that the beginning or end of
word symbol does not occur in the middle position. When the
unit is on, it indicates that one of the 1,000 possible triples that
could be made by selecting one member from the first list of 10,
one from the second, and one from the third is present in the
string being represented. Each triple activated about 20 units.
Although each unit is highly ambiguous, over the full set of 400
such randomly constructed units, the probability that any two
sequences of three letters would activate all and only the same
units in common is effectively zero. 4 In sum, both the phonological
and the orthographic representations can be described as
coarse-coded, distributed representations of the sort discussed
by Hinton et at. (1986). The representations allow any letter
and phoneme sequences to be represented, subject to certain
saturation and ambiguity limits that can arise when the strings
get too long. Thus, there is a minimum of built-in knowledge
In fact, the size of the adjustments made to the strengths of the connections
in the model is given by a somewhat more complex expression,
as follows:
0E ~w~ = -8 ~ + ~./,~j.
Here w' refers to the previous increment to the weights, and ct is a
parameter between 0 and 1. a can be thought of as specifying how much
momentum there is in the magnitude of the changes made to the
weights.
2 Here, and elsewhere in the article, we use the following notation for
representing phonemes: A = a in GAVE; a = a in HAVE; O = O in POSE;
u = o in LOSE; i = i in LINT; I = i in PINT; E = O~ in SEED; A = U in MUST;
u = oo in nOOK; o = o in HOT; W = Ow in HOW; * = aw = PAW.
3 The set of phonological features used was somewhat simplified, so
that certain phonemes pairs (e.g., the initial phonemes in CHIN and
SHIN) were not in fact distinguished. See Rumelhart and McCleUand
(1986a) for details.
4 Ghosts are capable of appearing in this representation when it becomes
too "saturated"; that is, when too many of the units are on at
one time. This is one reason why a richer representation would be required
to represent multisyllabic words.
WORD RECOGNITION AND NAMING 529
of orthographic or phonological structure. The use of a coding
scheme sensitive to local context does promote the exploitation
of local contextual similarity as a basis for generalization in the
model; that is, what it learns to do for a grapheme in one local
context (e.g., the M in MAKE) will tend to transfer to the same
graphemes in similar local contexts (e.g., the M'S in MADE and
MATE and, to a lesser extent, M'S in contexts such as MILE and
SMALL).
Naming and Lexical Decision
The model produces patterns of activation across the orthographic
and phonological units as its output. For naming, we
assume that the pattern over the phonological units serves as the
input to a system that constructs an articulatory-motor program,
which in turn is executed by the motor system, resulting
in an overt pronunciation response. In reality, we believe that
these processes operate in a cascaded fashion, with the triggering
of the response occurring when the articulatory-motor program
has evolved to the point at which it is sufficiently differentiated
from other possible motor programs. Thus, activation
would begin to build up first at the orthographic units, propagating
continuously from there to the hidden and phonological
units and from there to the motor system in which a response
would be triggered when the articulatory-motor representation
became sufficiently differentiated.
The simulation model simplifies this picture. Activations of
the phonological units are computed in a single step, and the
construction and execution of articulatory motor programs are
unimplemented. The activations that are computed in this way
can be shown to correspond to the asymptotic activations that
would be achieved in a cascaded activation process (Cohen,
Dunbar, & McClelland, 1989). To relate the patterns of activation
the model produces to experimental data on latency and
accuracy of naming responses, we use what we call the phonological
error score, which is the sum of the squared differences
between the target activation value for each phonological unit
and the actual activation computed by the network.
It is important not to treat the error score as a direct measure
of the accuracy of an overt response made by the network. In
fact, the error scores can never actually reach zero because the
logistic function used in setting the activations of units prevents
activations from ever reaching their maximum or minimum
values. Rather, with continued practice, error scores simply get
smaller and smaller, as activations of units approximate more
and more closely the target values. This improvement continues
well beyond the point at which the correct answer is the best
match to the pattern produced by the network. To determine
the best match, we simply use the error score as a measure of
how closely the pattern computed by the net matches the correct
pronunciation and each of several other possible pronunciations.
In generalwas we will present in detail later--we find
that after training, the error score is lower for the correct pronunciation
than for any other.
Even when the target code provides the best fit to the pattern
of activation over the phonological units, there is still room for
considerable variation in error scores. We assume that lower
error scores are correlated with faster and more accurate responses
under time pressure. The rationale for the accuracy assumption
is simply that a low error score signifies that the pattern
produced by the network is relatively clear and free from
noise, and so provides a better signal for the articulatory-motor
programming and execution processes to work with. The rationale
for the speed assumption is as follows: In a cascaded system,
patterns that are asymptotically relatively clear (low in error)
will reach a criterion level of clarity relatively quickly. Simulations
demonstrating this point are presented in Cohen et al.
(1989).
Thus far, we have discussed the use of the phonological error
score as a measure of the accuracy and speed of naming. We
shall see that this measure is sensitive to familiarity; the more
frequently the network has processed a particular word, the
smaller the error score will be. The error score computed over
the orthographic units is likewise related to familiarity. Because
the input pattern is also the target pattern for the orthographic
feedback, the orthographic error score is simply the sum of the
squares of the differences between the feedback pattern computed
by the network and the actual input to the orthographic
units. For lexical decision, in which the subject's task is to judge
whether the stimulus is a familiar word, we assume that a measure
like the orthographic error score is actually used in making
this judgment. Note that this differs from our use of the phonological
error score in accounting for naming performance. The
calculated phonological error score is simply a measure of the
asymptotic clarity of the computed phonological representation,
which we use to predict naming latencies. In contrast, a
measure like the orthographic error score is assumed to be actually
computed by subjects as part of the decision process. Because
the orthographic input is in fact presented to the subject,
it seems reasonable to assume that subjects can compare this
input to the internally generated feedback from the hidden
units and use the result of this comparison process as the basis
for judgments of familiarity. This issue is considered again in
the section on Lexical Decisions in the Model.
Our goal was to develop a working simulation model that exhibited
many of the basic phenomena of word recognition and
naming, based on a theory of what is learned and how it is represented.
When it came to assessing the performance of the
model, we discovered that there was a simple monotonic relationship
between error scores and naming latencies. This result
was quite surprising, given that the error scores depend on some
of the more arbitrary aspects of the simulations, such as the
number of orthographic encoding units, the number of entries
in each unit's table, and the Wickelphone output scheme. In
addition, the error scores reflect the effects of training on a corpus
of less than 3,000 words, many fewer than a skilled reader
would know. Finally, we calculated the error scores using the
weights from 250 learning epochs; other weights could have
been used. The net result is that, although the fit between error
scores and latencies is very good, it is by no means perfect. In
future research it will be necessary to determine whether a better
fit could be achieved by addressing some of the limitations
ofthe present implementation (see the General Discussion section).
Parameters
Once the input and output representations are specified, the
model leaves us with very few free parameters. There are two
530 MARK S. SEIDENBERG AND JAMES L. MCCLELLAND
free parameters of the input representation, the number of letters
in each unit's table and the number of such units. After
picking plausible initial values for these, however, we did not
manipulate them. There are two other parameters: the learning
rate ~ and the number of hidden units. For both of these parameters,
the initial values we chose (0.05 and 200, respectively)
have turned out to produce quite good quantitative accounts of
the phenomena. It is interesting that manipulation of the learning
rate parameter has rather little effect; acquisition is not so
much slower as less noisy with a smaller learning rate. Manipulation
of the number of hidden units, however, has interesting
and illuminating effects, which are considered when we discuss
individual differences in learning to read. For completeness,
two other parametric details should be mentioned. First, as targets
for learning, we used the values of 0.9 and 0.1; that is, the
model was trained to set the activations of units that should be
on to 0.9 and the activations of units that should be off to 0.1,
rather than to the extreme values of 1.0 and 0.0. Second, the
momentum parameter a, was set at 0.9. These values are commonly
used in models of this type (see, e.g., Sejnowski & Rosenberg,
1986, and Footnote 1).
The Training Regime
There is one other factor that has profound effects on the
model's performance, namely, the set of learning experiences
with which it is trained. The training corpus we have used consists
of all of the monosyllabic words in the Kucera and Francis
(1967) word count that consist of three or more letters. From
these we removed proper nouns, words we judged to be foreign,
abbreviations, and morphologically complex words that were
formed from the addition of a final -s or -ed inflection. Note
that this is not a complete list of monosyllables; the word FONT,
for example, is one of many that do not appear in Kucera and
Francis. Nevertheless, the corpus provides a reasonable approximation
of the set of monosyllables in the vocabulary of an average
American reader. To this list we added a number of words
that had been used in some of the experiments that we planned
to simulate. Some of these words were inflected forms (e.g.,
DOTS); for these, the Kucera-Francis frequency of the base
form was used. Others were simply entered into the word list
with frequencies of 0. The resulting list contained 2,897 words.
This total includes 13 homographs (words such as WIND and
BASS that have two pronunciations) that were entered twice,
once with each pronunciation. Thus, there were 2,884 unique
orthographic patterns in the list.
The training regime was divided into a series of epochs.
Within an epoch, each word had a chance of being presented
that was monotonically related to its estimated frequency:
p = K log(frequency + 2).
A value of K was chosen so that the most frequent word (THE)
had a probability of about .93. Words occurring once per million
had probabilities of about .09, and words not occurring in
the Kucera-Francis count had probabilities of .057. Thus, the
expected value of the number of presentations of a word over
250 epochs ranged from about 230 to about 14. Because the
sampling process is in fact random, there was about a 5%
chance that one of the least probable words would be presented
less than 7 times in 250 epochs.
The use of the logarithmic frequency transformation radically
compresses the range of variation in the presentation frequencies
of different words. For example, the word THE is presented
only about l0 times as often as a word like RAKE,
whereas in the Kucera and Francis (1967) corpus, THE Occurs
more than 69,000 times as frequently as RAKE. This compression
was motivated in part by practical considerations. We did
not think it feasible to run sufficient trials to achieve even the
current level of exposure to the least frequent words without
compressing the frequency range. Using compressed frequencies,
we achieved this level of exposure with a total of 150,000
learning trials. Using uncompressed frequencies, something on
the order of 5,000,000 learning trials would have been required;
this would take several months given available computational
resources.
There are several other reasons why some compression of the
frequency range is preferable to the use of raw frequencies.
First, the word frequencies found in a count such as Kucera and
Francis (1967) are based on samples of written text taken from
adult sources and do not reflect the relative frequencies of words
experienced by beginning readers. In the early stages of learning
to read, the words to which the child is exposed necessarily span
a much narrower range of frequencies than in the adult norms.
With additional experience, the relative frequencies of words
begin to differentiate. The logarithmic transform, which compresses
the range of frequencies, is thus more in keeping with
the child's experience than with the adult's. We thought that it
was important to approximate this aspect of the child's experience
because the largest gains in reading skill occur early in
training. This is true both for the model, as will be shown, and
for children, whose knowledge of the spelling-sound correspondences
of the language expands rapidly during the first year or
two of instruction.
A second point is that the frequency transform compensates
for the effects of another aspect of the implemented model, the
restricted corpus of words used in training. The training corpus
consists entirely of monosyllabic words and includes only a few
morphologically complex words. Children learn the spellingsound
correspondences of the language on the basis of exposure
to both mono- and multisyllabic words, including morphological
relatives that were excluded from the simulations. For example,
the model is trained on a word such as DUNK but does not
gain additional feedback from related items such as DUNKED
or DUNKING. The net effect is that the listed frequencies of the
base words tend to underestimate their actual frequency of occurrence
in the language. This factor will have little effect on
the model's performance on higher frequency words; the morpbological
relatives tend to be much lower in frequency, and
including these words would result in little additional learning.
However, the morphological relatives of the lower frequency
items tend to be as frequent or more frequent than the base
words themselves; excluding these items eliminates an important
source of feedback. Thus, the restrictions on the training
set disproportionately penalize the lower frequency words,
which the frequency transform tends to counteract.
The effects of the frequency compression must also be considered
in light of the properties of the learning algorithm we used,
WORD RECOGNITION AND NAMING 531
which is an error correcting learning procedure. This means
that changes in connection strengths are made only to the degree
that the network fails to match the target. It follows that
the magnitudes of the changes tend to diminish with successive
presentations of a word. The data to be presented indicate that
the model reached nearly asymptotic performance on higher
frequency words with less than 250 presentations; thus, additional
presentations would have little effect. The net result is
that the network itself effectively compresses the effects of frequency
as it learns in any case. Where the compression in the
frequency range does have an effect is on the relative speed with
which high- and low-frequency words are mastered. Higher frequency
words do not reach asymptote as quickly because they
are presented less often.
In summary, it seems likely that our compression of the frequency
range may distort to some extent the rate of mastery of
words of different frequencies. However, several considerations
suggest that the effects of this compression are less significant
than one might initially suppose. The differences between highand
low-frequency words relevant to the child's experience are
actually smaller than the norms suggest. Moreover, given the
properties of the corpus we have used in these simulations, some
compression of the frequency range seems appropriate. In the
final section of this article, we also present data from an additional
simulation indicating that the model's performance replicates
when a broader range of frequencies is used.
We should stress that the model represents a claim about the
types of knowledge that are acquired, but it is not a simulation
of the child's experience in learning to read in the American
educational system. In the model, all of the words are available
for sampling throughout training, with frequency modeled by
the probability of being selected on a given learning trial. In
actual experience, however, frequency derives in part from age
of exposure; words that are higher frequency for adults tend to
be introduced earlier than are lower frequency items. In learning
to read, then, words are introduced sequentially, and often
in groups that emphasize salient aspects of the orthography. As
shown later, however, the model nonetheless exhibits some of
the basic developmental trends characteristic of the acquisition
process.
Results
Pronunciation of Written Words
We consider first the model's account of the task of naming
written words aloud. Words vary in terms of variables such as
frequency of occurrence, orthographic redundancy and orthographic-
phonological regularity. Many studies have investigated
the effects of these variables on naming performance (see
Barron, 1986; Carr & Pollatsek, 1985; Patterson & Coltheart,
1987; Seidenberg, 1985b, for reviews). The basic research strategy
has been to examine performance in naming words that
differ systematically in terms of these structural variables. The
central observation is that even among very skilled readers,
there are differences among words in terms of ease of pronunciation.
We now consider whether the model's performance on
different types of words is comparable to that of people.
Phonological Output and Naming
Before characterizing the model's performance, it is necessary
to consider further a theory of the naming task and how it
relates to the output computed by the model. We assume that
overt naming involves three cascaded processes: (a) the input's
phonological code is computed, (b) the computed phonological
code is compiled into a set of articulatory-motor commands,
and (c) the articulatory motor code is executed, resulting in the
overt response. Only the first of these processes is implemented
in the model. In practice, however, the phonological output
computed by the model is closely related to observed naming
latencies.
A word is named by recoding the computed phonological
output into a set of articulatory motor commands, which are
then executed. Differences in naming latencies primarily derive
from differences in the quality of the computed phonological
output. Informally speaking, a word that the model "knows"
well produces phonological output that more clearly specifies
its articulatory-motor program than a word that is known less
well. Thus, naming latencies are a function of phonological error
scores, which index differences between the veridical phonological
code and the model's approximation to it. Clearly, the
computed phonological code and the compiled articulatorymotor
program are closely related, which is why the error scores
systematically relate to observed naming latencies. That the
codes are distinct is suggested by evidence that subjects are able
to use phonological information even when compilation of the
articulatory-motor program is blocked by performance of a
secondary articulatory task. For example, subjects can reliably
judge phonological properties of stimuli when they are simultaneously
mouthing a nonsense syllable (Besner & Davelaar,
1982). Other models have also distinguished between phonological
and articulatory codes (e.g., LaBerge & Samuels, 1974).
Differences in naming latencies could also be associated with
the execution of the compiled articulatory-motor programs.
Consider, for example, a factor such as frequency. The distributions
of phonemes in high- and low-frequency words differ;
some phonemes and phoneme sequences occur more often in
higher frequency words than low, and vice versa (Landauer &
Streeter, 1973). Phonemes also differ in terms of ease of articulation
(Locke, 1972); higher frequency words may contain more
of the phonemes that are easier to pronounce, or it may be that
the phonemes that are characteristic of high-frequency words
are easier to pronounce because they are used more often.
Thus, naming latencies for high- and low-frequency words
could differ not because frequency influences the computation
of phonological output or the translation of this output into an
articulatory code, but because they contain phonemes that
differ in terms of ease of articulation. We have ignored this aspect
of the naming process for two reasons. First, we have not
implemented procedures for producing articulatory output.
More important, existing studies indicate that effects of variables
such as frequency and orthographic-phonological regularity
obtain even when articulatory factors are carefully controlled.
For example, there are frequency effects even when articulatory
factors are controlled by using homophones (e.g.,
high frequency: MAIN; lOW frequency: MANE; see McRae, Jared,
& Seidenberg, in press; Theios & Muise, 1977). Among the
532 MARK S. SEIDENBERG AND JAMES L. McCLELLAND
monosyllabic words under consideration, differences at the
stage of producing articulatory-motor output contribute very
little to observed naming latencies (see also Monsell, Doyle, &
Haggard, 1989). In sum, naming latencies depend in part on
factors related to the construction of an articulatory-motor
program and its execution, processes the model does not simulate.
It turns out, however, that we can give a fairly accurate
account of a broad range of naming phenomena simply in
terms of the computation from orthography to phonology.
In the sections that follow, we examine how the model performed
on different types of words that were used in behavioral
studies. Because the model was trained on a large set of words,
we can examine the model's performance on the same items
that were used in specific experiments. We evaluate the model's
performance in the following way. Given a particular input
string, the model produces a pattern of activation across the
phonological units. We characterize this pattern by comparing
it to different target patterns. For example, we can calculate an
error score that reflects the difference between the obtained pattern
and the one associated with the correct phonological code
for the input string. We can also compare the output to other
plausible phonological codes; for example, if the input were an
exception word such as HAVE, we can compare the computed
pattern of activation to the pattern for both the correct phonological
code,/hav/, and the output for a plausible alternative,
such as the regularized pronunciation/hAv/.
For the entire set of words after 250 learning epochs, the following
results obtained. In general, the error scores calculated
using the correct phonological codes as targets were much
smaller than the error scores derived by using other targets. In
order to be certain that the best fit to the computed output for
a given word was the correct phonological code, it would be
necessary to compare the output to all possible phonological
patterns, which we have not done for obvious reasons. However,
the following analysis provides a general picture of the model's
performance. The phonological output computed for each
word was compared to all of the target patterns that could be
created by replacing a single phoneme with some other phoneme.
For the word HOT, for example, the computed output was
compared to the correct code,/hot/, and to all of the strings in
the set formed by/Xot/,/hXt/, and/boX/, where X was any
phoneme. We then determined the number of cases for which
the best fit (smallest error score) was provided by the correct
code or one of the alternatives.
Among the 2,897 words in the corpus, there were 77 cases
(2.7%) in which the best fit to the computed output was a pattern
other than the correct one. The errors, which are listed in
Tables 1 and 2, were of several types. The model produced 14
regularization errors, in which a word with an irregular pronunciation
is given a "regular" pronunciation. These errors are
also observed in children learning to read (Backman, Bruck,
Htbert & Seidenberg, 1984) and in certain cases of dyslexia following
brain injury (Marshall & Newcombe, 1973; Patterson et
al., 1985). Thus, although the model was trained that the correct
pronunciation of BROOCH is/blOC/, the best fit to the computed
output was provided by the regularization/broc/, similar
to BROOM. For PLAID, the model produced/plAd/instead of
/plad/, and for SPOOK, it produced/spuk/(as in BOOK) instead
of/spuk/. All of the regularization errors were produced for
words that occurred with very low frequencies during the training
phase. In these cases, the model's output was determined on
the basis of knowledge derived from exposure to other words,
for which the regular spelling-sound correspondences predominate.
These errors illustrate a basic characteristic of the model,
the fact that the output for a word is affected by exposure to
both the word itself and other words. This aspect of the model
is discussed in greater detail later.
There were 25 other cases in which the model produced incorrect
vowels that were not regularizations. For example, the
best fit to BEAU was/bu/, and the best fit to ROMP was/ramp/.
Vowels account for the bulk of the errors because they are the
primary source of spelling-sound ambiguity in English. There
were also 24 cases in which the model produced incorrect consonants.
Some of these errors are systematic; for example, the
model produced hard Gs instead of soft ones for the words GEL,
GIN, and GIST (it performed correctly on other such words, including
GENE and GEM, however). Finally, one other type of error
occurred because some target pronunciations specified in
the training list were miscoded by the experimenter. For example,
the pronunciation of SKULL was incorrectly coded as
/skull/; in our encoding scheme, the correct code is/sk^l/. It is
interesting that in 5 cases, the best fit to the computed output
was the correct code rather than the one used in training; for
JAYS, for example, the model was trained on the incorrect pronunciation/
jAS/, but the best fit was provided by the correct
code/jAz/. These self-corrections were based on knowledge derived
from exposure to related words, such as DAYS.
This analysis of the errors should not be taken as comprehensive
because it only tests the computed output against the set of
codes containing the same number of phonemes as the target;
hence, it does not reveal cases in which phonemes were deleted
or added from the target pattern. Inspection of other cases, however,
suggests that the model produced few errors of these types.
Consider, for example, words containing silent letters, such as
DEBT and CALM. We tested the computed phonological output
for these words against both the correct pronunciations and the
"regularizations" that would occur by pronouncing the silent
letters. We found no cases in which the regularized pronunciation
yielded a smaller error score. Thus, it appears that in a very
high percentage of cases the best fit to the computed output was
provided by the correct phonological code, and the number of
errors was small.
Among cases in which the best fit was the correct code, the
error scores varied, indicating that the model's response was
not equally strong for all of the correct items. This, of course,
parallels the finding that human subjects pronounce some
words more quickly, or with greater accuracy under time pressure,
than others. Our main concern is to relate the magnitudes
of the error scores computed after 250 epochs of training to the
naming latencies obtained in behavioral studies. The simulations
reported later compare naming latencies for the words
used in particular studies with the error scores for these items.
In general, naming latencies are monotonically related to error
scores; in most of the simulations, latencies are about 10 times
the error score plus a constant of 500-600 ms. The constant
varies from experiment to experiment, and we take it to reflect
experiment-specific factors such as the quality of the stimulus
WORD RECOGNITION AND NAMING 533
display, sensitivity in the voice key used, and other factors that
influence the overall speed of the subjects)
Frequency Effects
We begin by considering simple effects of word frequency on
naming latency. In general, common, familiar words yield faster
naming latencies than do uncommon, less familiar words (e.g.,
Forster & Chambers, 1973; Frederiksen & Kroll, 1976). The
standard interpretation of these effects is that they reflect processes
involved in lexical access (i.e., access to entries stored in
the mental lexicon). Each vocabulary item is thought to have a
frequency-coded entry in the mental lexicon; recognition involves
accessing the appropriate entry. In Morton's (1969)
model, the entries were called logogens, and frequency was encoded
by their resting levels of activation (see McClelland &
Rumelhart, 1981, for a similar proposal). Balota and Chumbley
(1985) also observed small frequency effects that were not attributable
to lexical access because they occurred even when
subjects had more than 1 s to prepare their responses. These
effects were thought to be due to processes involved in producing
articulatory-motor output.
Table 1
Corpus of True Errors
Word Output Word Output
Regularizations (n = 14) Other vowel o-rots
ACHE AC
BROOCH bruc
CROW krw
DOSE doz
DOUSE dwz
DROUGHT dr*t
PLAID plAd
SOOT sUt
SPA spa
SPOOK spuk
SUEDE swEd
(continued)
PLUME plom
QUALMS kwAlmz
QUARTZ kw^rts
QUEUE kwu
ROMP ramp
STARVE starv
SWARM swlrm
WONT w^nt
Consonant ~a'ors (n = 24)
SWAMP swamp
WASP wasp
WOMB wom
Other vowel errors (n = 25)
ALPS ~Ips
BEAU bu
BLITHE bliT
BRONZE branz
CHEW CW
DRAUGHT draet
SCARCE SkerS
SCOUR skAr
FRAPPE frlp
FROST fr'st
KNEAD had
LEWD IEd
MAUVE mav
MOW ml
NONCE nans
OUCH AC
PLEAD plAd
PLUME plom
ANGST ondst
BREADTH brebT
CORPSE kOrts
CYST sist
CZAR vor
DREAMT dremp
EWE wU
FEUD flud
GARB gorg
GEL gel
GIN gin
GIST gist
HEARTH hOPS
NERSE mers
NYMPH mimf
PHAGE pAj
SPHINX spinks
SVELTE swelt
TAPS tats
THWART Twert
TSAR tar
WALTZ w*lps
WARP worb
ZIP vip
Table 2
Corpus of Coding Errors (n = 14)
Word Coded as Output
CHAISE a Cez CAZ
DANG dAng dang
DAUNT a dwnt d*nt
FOLD told DOId
SKULL skull skulk
JAYS a jAS jAZ
MEWa mvu myu
PROWL prowl prwwl
SHOOT a SUT SUt
STRODE strOs stroz
SWATH SWOth swoch
VELDT veldt velvt
WOW wWw WWI
ZOUNDS a zwnds zwndz
a Self-corrections.
Our model differs from these kinds of accounts in a fundamental
way: It contains no lexicon in which there are entries for
individual words; hence, they cannot be "accessed" and there
is no direct record of word frequencies. Instead, knowledge of
words is encoded in the connections in the network. Frequency
affects the computation of the phonological code because items
that the model has encountered more frequently during training
have a larger impact on the weights. Higher frequency words
tend to produce phonological output that more closely approximates
the veridical pattern of activation, yielding smaller error
scores. As noted earlier, we have assumed that the more closely
the computed phonological code corresponds to the veridical
code, the easier it will be to compile the code into a sequence of
articulatory-motor commands. Thus, frequency has important
effects on the computation of the phonological code and therefore
on the time it takes to produce an overt response. Although
we have not implemented the process, frequency should also
affect the computation that takes the phonological code into a
set of articulatory-motor commands; McRae et ai. (in press)
have provided evidence concerning the scope of these effects.
Orthographic-Phonological Regularity
Consider next the contrast between regular words such as
MUST, LIKE, and CANE, and exception words such as HAVE,
SAID, and LOSE. Regular words contain spelling patterns that
recur in a large number of words, always with the same pronunciation.
MUST, for example, contains the ending -UST; all monosyllabic
words that end in this pattern rhyme (JusT, DusT, etc.).
The words sharing the critical spelling pattern are called the
neighbors of the input string (Glushko, 1979). Neighbors have
The simulations reported here involve comparisons between subjeets'
naming latencies and the model's performance on the same items.
The naming latencies presented in the figures sometimes differ slightly
from those reported in the original articles because some experiments
included a small number of words that were not contained in the training
set. Excluding these items did not alter the patterns of results in any
of the experiments.
534 MARK S. SEIDENBERG AND JAMES L. MCCLELLAND
Table 3
Mean Naming Latencies and Percentage Errors
Type Example Latency Errors
High frequency
Regular NINE 540 0.4
Exception LOSE 541 0.9
Low frequency
Regular MODE 556 2.3
Exception DEAF 583 5.1
Note. Data are from the Seidenberg (1985c) experiment.
been defined in terms of word endings, also called rimes (Trieman
& Chafetz, 1987) or word bodies (Patterson & Coltheart,
1987), although as we shall see other aspects of word structure
also matter (Taraban & McClelland, 1987). Exception words
contain a common spelling pattern that is pronounced irregularly.
For example, -AVE is USually pronounced as in GAVE and
SAVE, but has an irregular pronunciation in the exception word
HAVE. In terms of orthographic structure, regular and exception
words are similar: Both contain spelling patterns that recur
in many words. It is often said that regular words obey the pronunciation
"rules" of English, whereas exception words do not.
Thus, these types of words are similar in terms of orthography,
and they can be equated in terms of other factors such as length
and frequency. Differences between them in terms of processing
difficulty must be attributed to the one dimension along which
they differ, regularity of spelling-sound correspondences.
The studies examining the processing of such words have
yielded the following results. As noted previously, there are frequency
effects; higher frequency words are named more quickly
than lower frequency words. In addition, regularity effects--
faster latencies for regular words compared to exceptions--are
larger in lower frequency items and are small or nonexistent
in higher frequency words (Andrews, 1982; Seidenberg, 1985c;
Seidenberg, Waters, Barnes, & Tanenhaus, 1984; Taraban &
McClelland, 1987; Waters & Seidenberg, 1985). In short, there
is a Frequency × Regularity interaction, as exemplified by the
results from Seidenberg (1985c) presented in Table 3.
The number of higher frequency items for which irregular
spelling-sound correspondences have little impact on overt
naming is likely to be rather large because of the type/token
facts about English (Seidenberg, 1985c). A relatively small
number of word types account for a large number of the tokens
that a reader encounters. In the Kucera and Francis (1967) corpus,
for example, the 133 most frequent words in the corpus
account for about one half of the total number of tokens. Hence,
a small number of words recur with very high frequency, and
for these words spelling-sound irregularity has little effect. Exception
words tend to be overrepresented among these higher
frequency items, largely due to the fact that the pronunciations
of higher frequency words are more susceptible to diachronic
change (Hooper, 1977; Wang, 1979). It is interesting to note that
although written English is said to be highly irregular, the irregular
items tend to duster in the higher frequency range, in
which this property has negligible effects on processing. Finally,
the size of this higher frequency pool varies as a function of
reading skill. Seidenberg (1985c) partitioned the data in Table
3 according to overall subject naming speed, yielding fast-,
medium-, and slow-reader groups (Table 4). Among these subjects,
who were McGill University undergraduates, the fastest
readers named lower frequency words more rapidly than the
slowest readers named higher frequency words, and thus
showed no regularity effect even for the lower frequency items.
Thus, faster readers recognize a larger pool of items without
interference from irregular spelling-sound correspondences. In
effect, more words are treated as though they are high-frequency
items; this may be an important source of individual
differences in reading skill.
To examine the model's performance on these types of words,
we used a somewhat larger stimulus set studied by Taraban and
McClelland (1987, Experiment 1). Figure 3 presents the
model's performance on this set of high- and low-frequency regular
and exception words after different mounts of training.
Each data point represents the mean phonological error score
for the 24 items of each type used in the Taraban and McClelland
experiment. The learning sequence is characterized by the
following trends. Training reduces the error terms for all of the
words following a negatively accelerated trajectory. Throughout
training, there is a frequency effect: The model performs better
on the words to which it is exposed more often. Note that although
the test stimuli are dichotomized into high- and lowfrequency
groups, frequency is actually a continuous variable,
and it has continuous effects in the model. Early in training,
there are large regularity effects for both high- and low-frequency
items; in both frequency classes, regular words produce
smaller error terms than do exception words. Additional training
reduces the exception effect for higher frequency words, to
the point where it is eliminated by 250 epochs. However, the
regularity effect for lower frequency words remains.
Taraban and McCleUand's (1987) adult subjects performed
as follows. First, lower frequency words were named more
slowly than higher frequency words. Second, there was a Frequency
× Regularity interaction; exception words produced
significantly longer naming latencies than regular words only
when they were low in frequency. For lower frequency words,
the difference between regular and exception words was 32 ms,
which was statistically significant; for higher frequency words,
Table 4
Mean Naming Latencies as a Function of Decoding Speed
Subject group
Word type Fastest Medium Slowest
High frequency
Regular 475 523 621
Exception 475 517 631
Difference 0 -6 + l 0
Low frequency
Regular 500 530 641
Exception 502 562 685
Difference +2 +32 +44
Note. Numbers are in milliseconds.
WORD RECOGNITION AND NAMING 535
24'
22'
20'
18'
16'
uJ 14'
12'
10'
~ s.
6'
4'
Type Example
-----CP-- LF Exc lose
• LF Reg bang
I .e. HF Exc have
must
, , , , , . . , ,
,o 2o 3o ,o so 6o ,o so ,o l oo, o 200
Epoch Number
ers in terms of regularity effects. As Table 4 indicates, the fastest
subjects in this study showed no regularity effect even for words
that are lower in frequency according to standard norms. The
model suggests that these subjects may have encountered lower
frequency words more often than did the slower subjects, with
the result that they effectively become high-frequency items.
Second, the model provides an important theoretical link between
effects of frequency and regularity. Both effects are due
to the fact that connections that are required for correct performance
have been adjusted more frequently in the required direction
for frequent and regular items than for infrequent or
irregular items. This holds for frequent words simply because
they are presented more often. It holds for regular words because
they make use of the same connections as other, neighboring
regular words. Hence, both frequency and regularity effects
derive from the same source, the effects of repeated adjustment
of connection weights in the same direction.
Figure 3. Mean phonological error scores for the stimuli
used by Taraban and McClelland (1987). Performance on Other Stimulus Types
Several other types of words have been studied in naming experiments;
research in this area has been marked by the develthe
difference was 13 ms and nonsignificant. The model produced
similar results, as indicated in Figure 4. 040-
Figure 5 shows two additional studies of this type, using
slightly different stimulus sets. The Seidenberg (1985c, Experiment
2) data summarized in Table 3 are presented on the left; ca0
the results of Seidenberg, Waters, Barnes, and Tanenhaus (1984, i
Experiment 3) are on the right. The model's performance on
the same stimulus words is also presented. In each case, both ~ 600
experiment and simulation yielded Frequency x Regularity interactions,
with a good fit between the two.
In Figure 6 we summarize the results of 14 conditions from z 580-
8 experiments that examined differences between regular and =
exception words. The data represent the mean differences be- J
tween exception words and regular words obtained in the exper- r~o
iments and in simulations using the same items. For conditions
a-e, the differences between the naming latencies for regular
and exception words were not statistically significant (these 54o
were higher frequency stimuli); the model also produced very
small effects in these cases. In the remaining conditions, which 7
yielded significant effects, the model also produces larger
differences between the two word types. The correlation between
experiment and simulation data is .915. s
The simulation is revealing about the behavioral phenomena
in two respects. First, it is clear that in the model, the
Frequency X Regularity interaction occurs because the output I~ll s
for both types of higher frequency words approaches asymptote
before the output for the lower frequency words. Hence, the
difference between the higher frequency regular and exception Oc ~ 4
words is eliminated, whereas the difference between the two
types of lower frequency words remains. This result suggests
that the interaction observed in the behavioral data results from a
a kind of"floor" effect due to the acquisition of a high level of
skill in decoding common words. In the model, the differences 2 between the two types of lower frequency words would also diminish
if training were continued for more epochs. This aspect
of the model provides an explanation for Seidenberg's (1985c)
finding that there are individual differences among skilled read-
Exception
Regu at
High L~v
IT
IT Excepti°n I . / , / 3
Regular ~
High Low
Frequency
Figure 4. Results of the Taraban and MeClelland (1987) study (upper
graph) and the simulation data for 250 epochs (lower graph).
536 MARK S. SEIDENBERG AND JAMES L. MCCLELLAND
~ 590"
E
580"
~ 570"
~S50"
Z
~5,m- o
, . , . . . . , L;w
m
~ 650'
c
"~ 610.
I
600"
~ 590'
I-~ Exception
Regular I
Low
Frequency
8
2"
=;
,=
i
10"
9 ~
e ~
7~
e~
s:
4
3
I-r-. Exception I
i , ' '
High Frequency Low High Frequency Low
Figure 5. Results of the Seidenberg ( 1985c; left graphs) and Seidenberg, Waters, Barnes, & Tanenhaus ( 1984,
Experiment 3; right graphs) studies: experiments (upper graphs) and simulations (lower graphs).
opment and revision of several taxonomies based on different
properties of words or perceptual units thought to be theoreticaUy
relevant. In part, this research was motivated by the fact
that several models, incorporating very different representational
and processing assumptions, all predict longer naming
latencies for exception words compared with regular words. In
the dual-route model (Coltheart, 1978), longer latencies result
because readers attempt to pronounce exception words by applying
grapheme-phoneme correspondence rules, resulting in
a temporary misanalysis. In Glushko's (1979) model, a word is
pronounced by analogy to similarly spelled neighboring words.
The fact that the neighbors of an exception word are all regular
was thought to interfere with generating its pronunciation. According
to Brown (1987), the factor that determines naming
latencies is the number of times a spelling pattern (word body)
occurs with a particular pronunciation. A regular word such as
DUST contains a word body, -UST, that is pronounced/ust/in
many words. An exception word such as SWAMP contains a
word body, -AMP, that is pronounced/omp/in only one word,
the exception itself. Hence, the frequency of a spelling-sound
correspondence could be the source of the exception effect.
In the following sections, we consider the model's performance
on several additional types of words and nonwords,
showing that it closely simulates the behavioral data. We then
consider the principles that govern the model's performance
and compare them with ones in other models.
Regular inconsistent words. In an important article, Glushko
(1979) studied a class of words called regular inconsistent.
These words, such as GAVE, PAID, and FOE, have two critical
properties. Their pronunciations can be derived by rule; in fact,
most of these words' neighbors rhyme (e.g., GAVE, PAVE, SAVE,
BRAVE). However, each of these words has an exception word
neighbor (e.g., HAVE, SAID, and SHOE, respectively). The view
that readers pronounce words by appl~ng speUing-sound rules
predicts that regular inconsistent words should be named as
quickly as regular words, other factors being equal; in both
cases, the rules generate the correct pronunciations. Glushko
(1979) proposed that words are pronounced by analogy to similarly
spelled words, affording the possibility that pronunciation
of a regular inconsistent word such as GAVE could be influenced
by knowledge of an exception word such as HAVE. He reported
experimental evidence that regular inconsistent words yield
longer naming latencies than do regular words; he also found
that nonwords derived from exception words (e.g., BINT from
PINT) yielded longer latencies than nonwords derived from regular
words (e.g., NUST from MUST). These findings have been
taken as strong evidence against dual-route models (e.g., Henderson,
1982).
Subsequent studies of regular inconsistent words have yielded
mixed results. Seidenberg et al. ( 1984b, Experiment 4) obtained
the regular inconsistent effect only for lower frequency words,
and several studies failed to yield statistically reliable effects at
(e.g., ~idenberg et al., 1984b, Experiment l; Stanhope &
Parkln, 1987; Taraban & McCleUand, 1987). These mixed results
suggest that the mere presence or absence of an exception
word neighbor is not the only factor relevant to processing, an
WORD RECOGNITION AND NAMING 537
50
40" . . I
3O"
I 20"
10" jo.
-10 •
a b c d e f g h i j k l m n
5
-4
-1
.E 0j
-1
Figure 6. Results of 14 conditions (a-n) examining exception effects:
Experiment and simulation data. (Key: a = Seidenberg [ 1985c], Experiment
2, Set A, HF words; b = Seidenberg, Waters, Barnes, & Tanenhaus
[1984], Experiment 3, HF words; c = Waters & Seidenberg [1985], Experiment
1, HF words; d = Taraban & McClelland [ 1987], Experiment
1, HF words; e = Seidenbcrg [1985c], Experiment 2, Set B, HF words;
f = Glushko [1979], Experiment 3; g = Brown [1987], Experiment l;
h = Seidenberg [1985c], Experiment 2, Set B, LF words; i = Glushko
[ 1979], Experiment l; j = Seidenberg, Waters, Barnes, & Tanenhaus
[1984], Experiment 3, LF words; k = Taraban & McClelland [1987],
Experiment 1, LF words; i = Seidenberg [ 1985c], Experiment 1, Set A,
LF words; m = Waters & Seidenberg [ 1985], Experiment l, LF words;
n = Seidenberg, Waters, Barnes, & Tanenhaus [1984], Experiment I.
HF = high frequency; LF = low frequency.)
issue to which we return later. We examined the model's processing
of regular inconsistent words using stimuli from the Taraban
and McClelland experiment described previously, which
also included high- and low-frequency regular inconsistent
words and matched regular word controls. This represents the
largest set of regular inconsistent words used in any experiment.
There were again 24 items of each type, all of which were included
among the 2,897 words in our training set. Figure 7
shows the model's performance on these words after different
amounts of training. Error scores again decreased with additional
training, and higher frequency words again produced
lower error scores than lower frequency words. However, after
250 epochs, there were only small differences between regular
inconsistent words and regular words in both frequency ranges
(high frequency: 0.0077; low frequency: 0.3128). These data arc
consistent with Taraban and McClelland's results; the differences
between regular inconsistent words and regular controls
in their experiment were 7 ms and 10 ms, respectively, for the
high- and low-frequency items. Neither difference was statistically
reliable. For comparison, note that the difference between
lower frequency regular and exception words in their experiment
was 32 ms, and 2.4804 in the simulation.
Seidenberg et al. (1984b) identified an aspect of Glushko's
(1979) methodology that may have been responsible for the
large regular inconsistent effect in his study. Glushko's experiment
included matched exception/regular inconsistent pairs
such as BEEN-SEEN, GIVE-DIVE, and NONE-CONE. Each spelling
pattern in the stimulus list occurred at least twice with two
different pronunciations; some spelling patterns were repeated
several times (e.g., the stimuli included NONE, CONE, GONE,
DONE, SHONE, and BONE). Repetition of spelling patterns with
different pronunciations may have introduced intralist priming
effects that would tend to increase the magnitude of the regular
inconsistent/regular difference. Seidenberg, Waters, Barnes,
and Tanenhaus (1984, Experiment 2) showed that a large regular
inconsistent effect occurs when stimuli are repeated in this
way, but not when the stimuli are not repeated. The model provides
additional support for this conclusion. We tested the
model on the items from Glushko's Experiment 3, which had
yielded a significant 17-ms difference between regular inconsistent
and regular words. The model yielded a negligible difference
of0.1247 on the same items. The basis for this difference is
clear: Unlike human subjects, the model's performance during
testing is not influenced by previous trials. The model is tested
on each stimulus without changing the weights in any way;
hence, there are no intralist priming effects.
We consider the regular inconsistent words again later because
they are theoretically important and because the studies
examining these items did not control another important aspect
of their structure. Here it is sufficient to note that the
model gives a good account of the behavioral data obtained in
studies using these words.
Strange words. Several studies (e.g., Parkin, 1982; Parkin &
Underwood, 1983; Seidenberg, Waters, Barnes, & Tanenhaus,
1984; Waters & Seidenberg, 1985) have examined words that
differ from the regulars, regular inconsistents, and exceptions
in a basic way: They contain spelling patterns that occur in a
very small number of words, often only one. Regular patterns
such as -UST and inconsistent patterns such as -AVE are productive
in the sense that they are realized in many words. Words
such as GUIDE, AISLE, and FUGUE contain nonproductive spelling
patterns that rarely occur in other words. For example,
GUIDE i8 the only monosyllabic word ending in -UIDE. Henderson
(1982) calls these words lexical hermits; in Glushko's
(1979) terminology, they have few if any immediate neighbors.
These words might be expected to be difficult to pronounce for
three reasons: first, because they contain relatively unfamiliar
spelling patterns and thus are low in terms of orthographic re-
24 !
22 t Type Exampk~
20 1 • LF Reg Inc cook
LF Reg code
18 ,I, HF Reg Inc base
0 . . . . . . . . . . . . .
0 10 2o 3O 4O SO 6O 7O eo 9o loo t~ 2oo 250
Epoch Number
Figure 7. Model's performance on regular inconsistent and regular
words used in the Taraban and McClelland (198"/) study.
538 MARK S. SEIDENBERG AND JAMES L. McCLELLAND
620r1 - Strange I /11
600t -7-" EC°ti°nl /
"~ 580
~. 540 Jt
500
High Low
Frequency
I,g !
=
Strange I ~11
Exception I /
Regular I / .
0
High LOW
Frequency
Figure 8. Results of the Waters and Seidenberg (1985) studies:
Experiment (upper graph) and simulation (lower graph).
dundancy, a factor that would slow the identification of component
letters; second, because the spelling-to-sound correspondences
of these patterns are also relatively unfamiliar; and third,
because these unusual spelling patterns are often associated
with idiosyncratic pronunciations (as in CORPS).
Waters and Seidenberg (1985) compared the naming latencies
for a set of these words (which they termed strange) with
the latencies for regular and exception words. The words were
again dichotomized into high- and low-frequency groups. Results
of this study are presented in Figure 8. Among the higher
frequency words, there were no reliable differences between
word classes; for the lower frequency words, the ordering of latencies
was strange > exception > regular. Strange words also
produced a larger number of mispronunciation errors. The
model's performance on these words is also presented in Figure
8, and shows the same interaction between frequency and word
class. The results corroborate the conclusion that for higher frequency
words, variations in word structure, such as the frequency
of a spelling pattern or spelling-sound correspondence,
have little impact on naming. Despite the various ways in which
regular, regular inconsistent, exception, and strange words
differ, they yield similar naming latencies in this frequency
range. Among the lower frequency words in the language, the
strange items are the most difficult to name.
Unique words. We also tested the model on a set of words
used by Brown (1987), who introduced another category of
items, termed unique. These are words such as SOAP or CURVE
that also contain word bodies that do not occur in other monosyllabic
words. These words are somewhat less eccentric than
the strange words mentioned earlier, as indicated by the fact
that they produce lower orthographic error scores, which are a
measure of orthographic redundancy (see discussion on p. 552).
Brown also examined exception words such as LOSE and regular
words such as MILL, which he termed consistent. The stimuli
were used to examine the hypothesis that the factor critical to
naming is the number of times a word body is associated with a
given pronunciation. Both unique and exception words contain
spelling patterns assigned a given pronunciation in only a single
word (namely, the unique or exception item itself), whereas regular
words contain word bodies associated with a given pronunciation
in many words. Hence, Brown predicted that unique
and exception words should yield similar naming latencies, and
both should be slower than regular words. Data from Brown's
naming experiment and the simulation are presented in Figure
9. Clearly, the fit between the two is very good.
Neighborhood size. Andrews (in press) reported a study that
factorially varied word frequency and a measure of neighborhood
size known as Coltheart's N (Coltheart, Davelaar, Jonasson,
& Besner, 1977), which refers to the number of words that
can be derived from a given word by changing one letter. There
were 15 words in each of the four classes formed by crossing
frequency (high, low) and neighborhood size (large, small). Results
of the experiment and simulation are presented in Figure
10, with again a very good fit between the two. Both Andrews's
data and the model suggest that as the frequency of a word increases,
the effects of neighboring words diminish.
Nonword pronunciation. After training, the model has encoded
facts about orthographic-phonological correspondences
in the weights on connections. Although the model performs
better on the training stimuli, it will compute phonological output
for novel stimuli. In this respect, it simulates the perfor-
560 - " 7.0
555" "6 5
g 5~. .6o ~
J 545" "5.5 | E
# 540- -50 i [
535- [ A Simulation J -4.5
530 L0
Unique Exception Consistent
~ap lose
Type
Figure 9. Results of Brown ( 1987): Experiment and simulation data.
WORD RECOGNITION AND NAMING 539
660"
1----o-- Large N I .e
I r- Small N I 640'
620"
600'
580"
560
High Low
Frequency
|
~r
g
O~ .¢_ |
|
I "~ Larg*N I
.......I..P
t
High Low
Frequency
Figure 10. Results of Andrews (in press): Experiment and simulation
data. (N refers to Coltheart's N, a measure of neighborhood size.)
mance of subjects asked to pronounce nonwords such as BIST or
TAZE. Nonword performance provides important information
concerning the naming process because, as we have seen, performance
on many words reaches floor levels because of repeated
exposure to the items themselves. Because nonwords
640,
630
g 6 ~ ¸
|
Z
¢a 610
t
60G
Experiment [ :"' ~
"16
~4 ~"
f~
i
13
Regular ExCeption
Type mare
Figure 11. Results of the Glushko (]979) nonword experiment:
Experiment and simulation data.
Figure 12. Model's performance on Taraban and McClelland (1987)
exception words. (Error scores for correct [exception] pronunciations
and incorrect [regularized] pronunciations.)
have not been encountered previously, pronunciation must be
based on knowledge gained from similar words. A critical experiment
was reported by Glushko (1979), who examined
naming latencies for nonwords derived from regular words
(e.g., NUST derived from MUST)and nonwords derived from exception
words (e.g., MAVE derived from HAVE). 6 We tested the
model on his set of nonwords; the results from experiment and
simulation are presented in Figure I I. In both cases, performance
is poorer on the exception nonwords. Note that the nonwords
derived from exceptions are in effect "regular inconsistent?'
Whereas regular inconsistent words show little effect of a
neighboring exception word, regular inconsistent nonwords do.
The difference, of course, is that the model is actually trained
on regular inconsistent words, but not the corresponding nonwords.
Apparently, training on the item itself is sufficient to
overcome the effect of training on the exception neighbor.
The model was also tested on a set of nonwords derived from
the exception words used in the Taraban and McClelland
(1987) study. These nonwords can be pronounced in two ways,
either by analogy to the exception word (e.g., MAVE pronounced
to rhyme with HAVE) or by analogy to a regular inconsistent
word (e.g., MAVE rhymed with GAVE). Using the weights from
250 epochs, the model was tested to determine which pronunciation
would be preferred. For each item, phonological error
scores were calculated twice, using both exception and regular
pronunciations as targets. We also calculated analogous scores
for alternative pronunciations of the exception words themselves
(e.g., HAVE pronounced correctly and pronounced to
rhyme with GAVE). This is the regularization error discussed
previously.
Figure 12 shows both types of error scores for the exception
words in the Taraban and McClelland (1987) stimuli. For
words, the correct "exception" pronunciations produce much
6 Glushko's (1979) Experiment 2, which examined nonword naming,
did not include repetitions of spelling patterns with different pronunciations;
hence, it is not subject to the repetition priming hypothesis previously
advanced in connection with his experiment on regular inconsistent
words.
540 MARK S. SEIDENBERG AND JAMES L. MCCLELLAND
Figure 13. Model's performance on nonwords derived from Taraban
and McClelland (1987) exception words. ("Exception" pronunciation
rhymed with exception word [e.g., MAVE pronounced like HAVe]; "regularized"
pronunciations rhymed with regular inconsistent word [e.g.,
MAVE pronounced like GAVE].)
smaller error scores than do the incorrect, "regularized" pronunciations.
Thus, the model's output resembles the correct
pronunciations rather than the regularized ones.
The opposite pattern obtains with the nonwords derived from
these stimuli (Figure 13). Here the "regularized" pronunciations
are preferred to the pronunciations derived from the
matched exception words. Note, however, that the difference between
the two pronunciations is much smaller than in the corresponding
word data, suggesting that the pronunciation of a nonword
like MAVE is influenced by the fact that the model has
been trained on exception words like HAVE.
Figure 14 shows the error scores for the regular pronunciations
of nonwords derived from regular and exception words.
The error scores are larger for nonwords such as MAVE (derived
from an exception word) than for nonwords such as PAME (derived
from a regular word). These results also indicate that the
pronunciation of novel stimuli such as MAVE is affected by the
fact that the model has been trained on both HAVE and regular
words such as GAVE.
The model's performance on the nonwords is important for
two reasons. First, it shows that performance generalizes to new
items; the knowledge that was acquired on the basis of exposure
to a pool of words can be used to generate plausible output for
novel stimuli. Second, the nonword data provide additional information
as to what the model has learned. Regular inconsistent
words are little affected by training on exception word
neighbors. However, the inconsistency in the pronunciation of
-AVE is encoded by the weights, as evidenced by performance
on regular inconsistent nonwords.
set. In effect, learning results in the recreation of significant aspects
of the structure of written English within the network.
Because the entire set of weights is used in computing the phonological
codes for all words, and because all of the weights are
updated on every learning trial, there is a sense in which the
output for a given word is a function of training on all of the
words in the set. Differences between words derive from facts
about the writing system distilled during the learning phase.
For words, the main influence on the phonological output is
the number of times the model was exposed to the word itself.
Number of times the model was exposed to closely related
words (e.g., similarly spelled items) exerts secondary effects;
there are also small effects due to exposure to other words. The
magnitudes of these effects vary as a function of how similar
these words are to a given item.
To see this more clearly, consider the following experiment.
We test the model's performance on the word TINT; with the
weights from 250 epochs, it produces an error score of 8.92. We
train the model on another word, adjusting the weights according
to the learning algorithm, and then retest TINT. By varying
the properties of the training word, we can determine which
aspects of the model's experience exert the greatest influence
on the weights relative to the target. This procedure yields orthographic
and phonological priming effects, which have been
studied by Meyer, Schvaneveldt, and Ruddy (1974), Hillinger
(1980), and Tanenhaus et al. (1980). For example, Meyer et al.
observed that lexical decision latencies to a target word such
as ROUGH were facilitated when preceded by the rhyme prime
TOUGH but inhibited when preceded by the similarly spelled
nonrhyme COUGH. For the purposes of the simulation, we examined
the cumulative effects of a sequence of 10 prime
(learn)--target (test) trials. The primes were a rhyme (MINT), a
matched exception word (PINT), a word with the same consonants
but a different vowel (TENT), and an unrelated control
(RASP). The data are presented in Figure 15.
The results indicate, first, that priming with the orthographically
similar rhyme MINT decreases the error for TINT; the overlap
between the words is sufficient to improve performance.
What the Model Has Learned
We have demonstrated that the model simulates a broad
range of empirical phenomena concerning the pronunciation of
words and nonwords. Why the model yields this performance
can be understood in terms of the effects of training on the set
of weights. The values of the weights reflect the aggregate effects
of many individual learning trials using the items in the training
Figure 14. Error scores for regular pronunciations of nonwords
derived from regular and exception words.
WORD RECOGNITION AND NAMING 541
9.8"
k,. 9.6'
I
.o 9.4'
0
U
.. 9.2"
9.0'
0
8.8' Cg
O
a. 8.6'
I T PRIANSTP
TENT
MINT
6.4 | i i
Tflal Number
Figure 15. Effects of training on PXNT, RASP, TENT, and MiNT
on phonological error score for TINT.
i
10
Other rhymes act in a similar manner. This outcome is consistent
with Brown's (1987) proposal that the frequency with
which a word body is associated with a given pronunciation influences
performance; the number of times the pattern -INT =
/int/ occurs in the training set affects performance on TINT.
Note, however, that the other primes also have effects. Priming
with the similarly spelled nonrhyme TENT also improves performance;
the effect is smaller because vowels are the primary
source of ambiguity in orthographic-phonological correspondences
and, hence, the primary source of error. Training on
MINT has a larger facilitating effect because it provides feedback
concerning the primary source of ambiguity. The exception
word PINT has interfering effects complementary to the facilitative
effects of MINT. Finally, the unrelated prime RASP has very
small negative effects.
Note that the priming effects illustrated in Figure 15 are not
characteristic of all of the words in the training set after 250
epochs of training. TINT is somewhat unusual in that the
model's performance is relatively poor, due in part to the fact
that TINT is low in frequency and the fact that there are few
-INT words in the corpus. There are smaller priming effects for
target words that yield smaller error scores. Figure 15 accurately
illustrates the influences of training on related words, but
these effects are more salient earlier in the training sequence
when error scores are larger.
The model clarifies why Some effects of word type are obtained
in behavioral studies and others are not. When experimenters
compare performance on two types of words, they are
attempting to observe the net effect of a particular aspect of
word structure (e.g., regularity defined in terms of word bodies)
against a background ofnoise provided by the effects ofaU other
properties of the words. For this reason, experimenters routinely
attempt to equate stimuli in terms of these other properties
(e.g., frequency, length, initial phoneme). There is a net exception
effect for lower frequency words because the regular
correspondence is encountered many more times than the irregular
one; repeated experience with words such as TINT, MINT,
and HI1,rr has a negative impact on the weights from the point
of view of PINT. Conversely, exposure to an exception such as
PINT tends to have relatively small effects on a regular inconsistent
word such as TINT because the exception word is encountered
much less often than the set of rhyming regular inconsistent
words. It is not that PINT has no effect on TINT; in the priming
experiment, the effect was observed once it was magnified
through repetition. The effect can also be observed earlier in
the training sequence; eventually it recedes into the background
provided by exposure to many other words. The model corroborates
the common assumption that word bodies are relevant
to naming; however, it suggests that other aspects of word structure
also matter.
One other point should be noted. We also examined repetition
priming, that is, the effects of 10 trials of training on TINT
itself. This resulted in a much larger decrease in TINT'S error
score, from 8.92 to 2.50. As stated previously, the main factor
that influences performance on a word is the number of times
the model is exposed to the word itself; effects of neighboring
words are relatively small. Thus, presenting an exception word
such as PINT with much greater frequency would have less effect
on TINT than a small number of exposures to TINT itself.
The model's behavior can be further clarified by examining
yet another type of word, which contain what Scidenberg, Waters,
Barnes, and Tanenhaus (1984) and Backman et al. (1984)
called ambiguous spelling patterns. These spelling patterns,
such as -OWN, -OVE, and -EAR, are associated with two or more
pronunciations, each of which occurs in many words (e.g.,
BLOWN, FLOWN, KNOWN, GROWN, TOWN, FROWN, DROWN,
GOWN). For inconsistent spelling patterns such as -INT or -AVE,
the number of words with the regular pronunciation greatly exceeds
the number of words with the exceptional pronunciation.
For the ambiguous spelling patterns, however, the ratio is more
nearly equal. Hence, during training, the model is exposed to
many examples of each pronunciation. We constructed a set of
24 high-frequency and 24 low-frequency words containing
these spelling patterns, matched with the stimuli in the Taraban
and McClelland (1987) set in terms of frequency. Mean phonological
error scores for these words (using the weights from 250
epochs) and the other stimuli in the Tarahan and McClelland
experiment, are presented in Figure 16. As before, there are
negligible differences between the word types in the higher frequency
range. Among the lower frequency words, the ambiguous
items yield better performance than the exceptions, but
worse performance than the regular inconsistents. Performance
is better than on the exceptions because the model receives less
training on the exceptional pronunciation than on either pronunciation
of the ambiguous spelling pattern. Performance is
worse than on the regular inconsistent words because the model
is repeatedly exposed to both pronunciations. Thus, there are
graded effects of regularity owing to the nature of the input during
acquisition.7
7 Ambiguous words have been used in only one study of skilled readers
(Seidenberg, Waters, Barnes, & Tanenhaus, 1984, Experiment 1).
The model simulates the results of this experiment quite closely. However,
the ambiguous words were in the higher frequency range, in which
they do not differ from regular words. In Backman, Bruck, H~bert, and
Seidenberg's 0984) developmental study (described later), children's
542 MARK S. SEIDENBERG AND JAMES L. McCLELLAND
6
l
4" i
3"
ExcepoitnR~~ a r Ambiguous Re)gn c /
High LOW
Frequency
Figure 16. Model's performance on Taraban and McClelland (1987)
stimuli and on a set of ambiguous words (such as TOWN and LOVE).
Characteristics of the hidden units. Evidence as to how orthographic
and phonological information are encoded by the network
can be obtained by examining the patterns of activations
over the hidden units produced by different words. Unlike the
model's orthographic and phonological units, the hidden units
do not have specific, predetermined roles. Rather, their representational
and functional roles emerge as a result of experience
in learning to perform the task that is imposed on the network
by the training procedure. Recall that the activation of a hidden
unit is a function of the weights on the connections coming into
it. At first, each hidden unit has random incoming and outgoing
connection strengths. Gradually these are adjusted through experience,
so that units come to perform useful, generally partially
overlapping parts of the task. Because of the task that these
units need to perform--they must allow reconstruction of the
orthography as well as construction of the phonology--the values
of these weights are affected by feedback concerning both
orthography and phonology.
Consider first the pattern of activation over the hidden units
produced by the word LINT (Figure 17). LINT activates 23
units, 22 very strongly (net activation > .8) and one more
weakly (net activation < .6). We can determine how many of
these units are also activated by the orthographically similar
rhyme MINT and by the unrelated word SAID. A total of 14 units
are activated by both LINT and MINT, and 3 by LINT, MINT, and
SAID; 1 unit was activated by both LINT and SAID. The remaining
5 units were "unique" to LINT, in the sense that they were