From sound to grammar: theory, representations and a computational model

From sound to grammar:
theory, representations
and a computational model
Marco A. Piccolino-Boniforti
Clare Hall
8th February 2014
This dissertation is submitted for the degree of
Doctor of Philosophy
at the University of Cambridge

Contents
Abstract xii
Declaration xiii
Acknowledgements xiv
1 Introduction:
From sound to grammar 1
1.1 From sound to grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background:
Variability and invariance 7
2.1 The study of variability and invariance . . . . . . . . . . . . . . . . . . . . 7
2.2 Traditional approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Minimal invariant units . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 The role of context . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Beads on a string . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Indexical variation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Linguistic factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.3 Auditory processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
ii

Contents iii
2.3.4 Automatic speech recognition . . . . . . . . . . . . . . . . . . . . . 20
3 Theoretical framework:
A rational prosodic analysis 22
3.1 Analytic foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.1 Rational analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.2 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.3 Firthian prosodic analysis . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Central assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 Specificity of task and environment . . . . . . . . . . . . . . . . . . 30
3.2.2 Optimal behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.3 Auditory patterns and linguistic features . . . . . . . . . . . . . . 33
4 Assessment:
Perceptual-magnet effect 36
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.1 The perceptual-magnet effect . . . . . . . . . . . . . . . . . . . . . 36
4.1.2 Context-dependent PME . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 Feldman et al.’s rational model . . . . . . . . . . . . . . . . . . . . 42
4.2.2 A multi-class extension . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.2 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 50

Contents iv
5 Representations:
Auditory processes
and linguistic categories 59
5.1 Auditory processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1.1 Cochlear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1.2 Auditory primal sketch . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Linguistic categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2.1 The relevance vector machine . . . . . . . . . . . . . . . . . . . . . 66
5.2.2 RVM: an example . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6 Evaluation:
Binary classiﬁcation tasks 76
6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.2 Simulation A: relevance vector machine . . . . . . . . . . . . . . . . . . . 77
6.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2.2 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3 Simulation B: auditory primal sketch . . . . . . . . . . . . . . . . . . . . . 79
6.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.4 Simulation C: cochlear model . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Contents v
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7 Model:
Predicting prefixes 86
7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.1.1 Acoustic cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.1.2 Behavioural evidence . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.1.2.1 Increased word identification in noise . . . . . . . . . . . 88
7.1.2.2 Predictive looks at target images . . . . . . . . . . . . . . 90
7.2 Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.2.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.2.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.2.3 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.2.4 A formal model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.3 Processes and representations . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.3.1 Fine-tuned learned pattern . . . . . . . . . . . . . . . . . . . . . . 101
7.3.2 Prefix-like prosody . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.3.3 Other model components . . . . . . . . . . . . . . . . . . . . . . . 103
8 Simulation:
Linking the computational model
to a behavioural experiment 104
8.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.2.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.2.3 Feature extraction and concatenation . . . . . . . . . . . . . . . . 111
8.2.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

Contents vi
8.2.5 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.3 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
8.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
8.3.2 Parameter choices . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
8.3.2.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 116
8.3.2.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . 117
8.3.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9 Conclusion:
Main contributions 130
Bibliography 131

List of Figures
4.1.1 Illustration of PME for equally spaced stimuli in one-dimensional acoustic
space (top) and corresponding representation in perceptual space (bot-
tom). Stimuli closer to prototype (stimulus 0) are attracted more, and
thus are less discriminable from neighbouring stimuli. . . . . . . . . . . . 37
4.1.2 Listeners’ individual Ps (circles, joined by continuous lines) and NPs
(squares, joined by dashed lines) for the three allophonic contexts (F2
variation). Data from Barrett (1997). Each line joining values of Ps and
NPs across subjects shows the great individual variability in terms of ab-
solute values. Despite the variability, the values of Ps and NPs for each
listener tend to spread over all available acoustic space. This is discussed
in greater detail at the end of section 4.3.1. . . . . . . . . . . . . . . . . . 41
4.2.1 Behaviour of the Feldman and Griﬃths (2007) model in the case of one
category (left) and multiple categories (right). . . . . . . . . . . . . . . . 44
4.3.1 Histogram plots for F2 onset values of, respectively, /u:/,/lu:/ and /ju:/ . 48
4.3.2 A sample plot of S (circles) and E[T|S] (squares) for a continuum of stim-
uli varying along the F20 axis. Solid lines show p(c|S) for /u:/,/lu:/ and
/ju:/ (from left to right respectively), while dotted lines show the probab-
ility density function (multiplied by 100 for visibility) for each category.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
vii

List of Figures viii
4.3.3 The measures of displacement (top left) warping (bottom left), and iden-
tification (right, solid curves) for an idealised subject in the case of three
categories (from left to right: /u:/, /lu:/,/ju:/). Category prior distribu-
tions based on prototypes are indicated in the right pane by the dotted
lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.4 Individual results for two subjects (left: S1, right: S2) for PME simu-
lations with constant category variance (7234) and three levels of noise
σ2
S: top: 1000, middle: 5000 and bottom: 10000 respectively. For an
explanation of the plots see figure 4.3.2 and the text in this section. . . . 53
4.3.5 Individual results for two subjects (left: S9, right: S10) for PME simu-
lations with constant category variance (7234) and three levels of noise
σ2
S: top: 1000, middle: 5000 and bottom: 10000 respectively. For an
explanation of the plots see figure 4.3.2 and the text in this section. . . . 54
4.3.6 Individual results for the first six subjects (top: S1,S2; middle: S3,S4; bot-
tom: S5,S6) for PME simulations with constant category variance (7234)
and the highest level of noise (σ2
S=10000). For an explanation of the plots
see figure 4.3.2 and the text in this section. . . . . . . . . . . . . . . . . . 55
4.3.7 Individual results for the last four subjects (top: S7,S8; bottom: S9,S10)
for PME simulations with constant category variance (7234) and the
highest level of noise (σ2
S=10000). For an explanation of the plots see
figure 4.3.2 and the text in this section. . . . . . . . . . . . . . . . . . . . 56
5.1.1 Rhythmogram for the word instability. From top to bottom: spectrogram,
waveform and rhythmogram (event and prominence detection). . . . . . . 63
5.1.2 Main processing stages to produce a rhythmogram (word: instability).
From top to bottom: waveform, hair cell model output (activation in
the auditory nerve), modulation spectrogram (multiresolution amplitude
modulation), rhythmogram (event and prominence detection). . . . . . . . 64

List of Figures ix
5.2.1 F1 and F2 onset values for [u:] from /u:/, /lu:/ and /ju:/ . . . . . . . . . 70
5.2.2 Binary RVM classifiers: F1 and F2 onset values for [u:] from /u:/, /lu:/
and /ju:/.
Grey stars represent relevance vectors retained by the models. The black
dotted line represents evaluation of the RVM decision function at category
membership probability = 0.5.
Left panel: categories /u:/ (white circles) vs. non-/u:/ (black triangles).
Right panel: categories /ju:/ (black triangles) vs. non-/ju:/ (white circles). 71
5.2.3 Binary RVM classifier: F1 and F2 onset values for [u:] from /u:/, /lu:/
and /ju:/.
Categories /lu:/ (black triangles) vs. non-/lu:/ (white circles).
Grey stars represent relevance vectors retained by the model. The black
dotted line represents evaluation of the RVM decision function at category
membership probability = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.4 Categories /lu:/ vs. non-/lu:/. Compare to figure 5.2.3
Left panel: two instances from /lu:/ with very low F1 values have been
assigned to the/lu:/ category.
Right panel: two previously correctly classified instances from /lu:/ with
very low F1 values have been assigned to the non-/lu:/ category. . . . . . 74
5.2.5 Categories /lu:/ vs. non-/lu:/. Compare to figure 5.2.3.
Five instances from the non-/lu:/ category with very high F2 values have
been assigned to the competing category to simulate an upper threshold. . 75
6.2.1 Simulation A: classification accuracy and sparsity of RVM and SVM. Top:
area under the curve (AUC): accuracy. Bottom: number of decision vec-
tors (DV): sparsity. Each of S1...S5 bar charts represents a model trained
on a single speaker. All values averaged over 5 train/test splits. . . . . . . 80

List of Figures x
6.3.1 Simulation B: classification accuracy and sparsity of APS vs. energy.
Top: area under the curve (AUC): accuracy. Bottom: number of decision
vectors (DV): sparsity. Each of S1...S5 bar charts represents a model
trained on a single speaker. All values averaged over 5 train/test splits. . 82
6.4.1 Simulation C: classification accuracy and sparsity of APS with cochlear
model (CM) vs. APS without cochlear model (NCM). Top: area under
the curve (AUC): accuracy. Bottom: number of decision vectors (DV):
sparsity. Each of S1...S5 bar charts represents a model trained on a single
speaker. All values averaged over 5 train/test splits. . . . . . . . . . . . . 84
7.1.1 Spectrograms showing acoustic differences between mistimes (true prefix,
top) and mistakes (pseudo-prefix, bottom) in the context of the same
utterance (I’d be surprised if Tess mistimes/mistakes it). See section
7.1.1 for details. From Smith et al. (2012) . . . . . . . . . . . . . . . . . . 88
7.2.1 A graphical model of prefix prediction. See section 7.2.4 for an explanation. 98
8.2.1 Components of the model introduced in chapter 7 that were implemented
for the simulation presented in this chapter (solid lines). . . . . . . . . . . 106
8.2.2 Overview of the model implementation’s architecture. See section 8.2.1 for
details. Thin lines on oscillograms represent acoustic chunks of increasing
length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.2.3 The segmentation, feature extraction and feature concatenation processes
as implemented. See sections 8.2.2 and 8.2.3 for explanation. . . . . . . . 110
8.2.4 Resampling procedures in the feature extraction process. See section 8.2.3
for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.2.5 A schematic representation of the training procedure. See section 8.2.4
for explanation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

List of Figures xi
8.2.6 A schematic representation of the recognition procedure. See section 8.2.5
for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.4.1 A sample plot showing curves of proportion of looks to targets (solid lines)
and competitors (dashed lines) for the match (grey) and mismatch (black)
conditions. Data from Hawkins et al. (in prep). . . . . . . . . . . . . . . . 121
8.4.2 Eye-tracking results for mis/dis from Hawkins et al. (in prep) in terms
of proportion of looks to targets (and competitors) for group M1 (left)
and group M2 (right). Group M1 was chosen for comparison with model
output. See text for explanation. . . . . . . . . . . . . . . . . . . . . . . . 122
8.4.3 A plot showing bias looks to targets involving a true prefix when listening
to either a true (grey line) or pseudo (black line) prefix for the M1 group. 123
8.5.1 RVM model output for the three kinds of feature vectors: APS, MFCC and
APS+MFCC. The left panels show average true prefix class probabilities
for input tokens of true prefixes (grey line) and pseudo prefixes (black
line). The right panels show number of relevance vectors (RV) retained
and area under the ROC curve for each model step. . . . . . . . . . . . . 125

Abstract
Marco A. Piccolino-Boniforti
From sound to grammar: theory, representations and a computational
model
This thesis contributes to the investigation of the sound-to-grammar mapping by de-
veloping a computational model in which complex acoustic patterns can be represented
conveniently, and exploited for simulating the prediction of English prefixes by human
listeners.
The model is rooted in the principles of rational analysis and Firthian prosodic ana-
lysis, and formulated in Bayesian terms. It is based on three core theoretical assumptions:
first, that the goals to be achieved and the computations to be performed in speech re-
cognition, as well as the representation and processing mechanisms recruited, crucially
depend on the task a listener is facing, and on the environment in which the task occurs.
Second, that whatever the task and the environment, the human speech recognition
system behaves optimally with respect to them. Third, that internal representations of
acoustic patterns are distinct from the linguistic categories associated with them.
The representational level exploits several tools and findings from the fields of machine
learning and signal processing, and interprets them in the context of human speech re-
cognition. Because of their suitability for the modelling task at hand, two tools are
dealt with in particular: the relevance vector machine (Tipping, 2001), which is cap-
able of simulating the formation of linguistic categories from complex acoustic spaces,
and the auditory primal sketch (Todd, 1994), which is capable of extracting the multi-
dimensional features of the acoustic signal that are connected to prominence and rhythm,
and represent them in an integrated fashion. Model components based on these tools
are designed, implemented and evaluated.
The implemented model, which accepts recordings of real speech as input, is com-
pared in a simulation with the qualitative results of an eye-tracking experiment. The
comparison provides useful insights about model behaviour, which are discussed.
Throughout the thesis, a clear distinction is drawn between the computational, rep-
resentational and implementation devices adopted for model specification.
xii

Declaration
This dissertation is the result of my own work and includes nothing which is the outcome
of work done in collaboration except where speciﬁcally indicated in the text.
This dissertation does not exceed 80,000 words, including footnotes, references and
appendices, but excluding bibliographies, as required by the Degree Committee of the
Faculty of Modern and Medieval Languages.
xiii

Acknowledgements
This research was funded by an ESR fellowship of the EU MRTN-035561 research train-
ing network Sound to Sense.
I am particularly grateful to my supervisor and coordinator of Sound to Sense, Sarah
Hawkins, who inspired me with her passion for interdisciplinary research, encouraged
me during particularly hard times, challenged me intellectually and ultimately made
this opportunity of professional and personal growth possible. I am also very thankful
to my advisor, Dennis Norris, who was always very approachable and supportive, and
from whom I tried to grasp the gifts of sharp thinking and clarity of expression.
The interdisciplinary nature of my project required me to gather knowledge in many
areas. I greatly beneﬁted from the workshops and discussions with many senior research-
ers in Sound to Sense, in particular Guy Brown, Richard Ogden and Martin Cooke, who
also welcomed me for a research stay. I am also grateful for the discussions and fun time
with fellow early stage researchers, and particularly to Bogdan Ludusan for his contri-
bution to my work and to Meghan Clayards for sharing her analyses. I also give thanks
to Rachel Baker for sharing data and analyses, and to my colleagues at the phonetics
lab and Linguistics department for fostering a positive, supportive and fun environment.
Finally, I would have never managed to accomplish this daunting task without the
loving support of my family, my girlfriend Silvia, my colleagues Marco and Sergio, and
the so many wonderful friendships that I was blessed with during my stay in Cambridge
and back home.
xiv

1 Introduction:
From sound to grammar
1.1 From sound to grammar
When we listen to someone speaking, e.g. during a telephone conversation, we can pay
attention to a number of different things: the words they are saying, their accent, their
sex, their age, their mood, their physical and even mental condition. We extract this
wealth of information from a single source (the speaker), often simultaneously. Despite
the fact that we might occasionally get some of this information wrong, in most cases,
even in the presence of noise, we succeed in a task whose “inner workings” turn out to
be quite complex to understand. Not only can we extract different kinds of information
from the same person: we can also extract the same kind of information from different
persons. So, for example, we are able to recognise one and the same word even when
it is pronounced by two people of different age, sex, geographical origin; or to recognise
two individuals as females despite differences in the words they are saying, their voice
quality, their pitch. This is possible because the recognition of speech relies on a subtle
relationship between variability and invariance. Speech researchers have been trying to
shed more light onto this complex relationship for some 60 years now (Jusczyk and Luce,
2002). So far, however, many of the questions aimed at a better understanding of it are
still in need of an adequate answer (Luce and McLennan, 2005).
Some of these questions concern the relationship between acoustic patterns and gram-
matical function, in short the sound-to-grammar mapping. While most models of spoken
word recognition to date postulate an obligatory stage of phonemic analysis as the only
“beneficiary” of acoustic information, beyond which recognition becomes a matter of
1

1.1. From sound to grammar 2
pattern matching on combinations of symbols, an increasing body of experimental data
suggests that acoustic patterns can be informative and drive recognition well beyond the
phonemic level of analysis. So, for example, a complex acoustic pattern can be a direct
cue to a grammatical category such as a morpheme (see e.g. Baker, 2008).
This thesis contributes to the investigation of the sound-to-grammar mapping by de-
veloping a computational model in which complex acoustic patterns can be represented
conveniently, and exploited for simulating the prediction of specific grammatical features
by human listeners.
The computational model described here is based on the following central theoretical
assumptions about human speech recognition. These, which require some explanation
as to their rationale, are illustrated in greater detail in chapter 3:
1. the specific characteristics of the recognition process are strictly connected to the
particular task in the context of which speech recognition happens (3.2.1);
2. the recognition process can be interpreted as a problem of optimal decision making
while reasoning under uncertainty (3.2.2);
3. there is a distinction in memory between the representation of acoustic patterns
and the linguistic features associated with them (3.2.3).
These assumptions impose important constraints about the fundamental properties of
the representational and processing tools and techniques used for modelling, which will
be introduced in chapter 5.
An important contribution of this thesis to the analysis of the sound-to-grammar map-
ping lies in making explicit connections about findings and methods from various fields
(most notably: theoretical linguistics, experimental phonetics, experimental psychology,
machine learning and signal processing) towards the common goal of developing a com-
putational model. Another important contribution is the development of an implemented

1.2. Thesis outline 3
architecture that helps running simulations with real speech, and comparing the results
with behavioural data from human listeners.
1.2 Thesis outline
Chapter 2 - Background: Variability and invariance
An investigation about the relationship between sound and grammar is necessarily con-
cerned with the broader issue of variability and invariance in speech recognition. I first
introduce the study of this issue (2.1), and show how researchers in human speech re-
cognition have dealt with it in the past (2.2). I then describe some issues from the
behavioural, linguistic, neuro-physiological and engineering perspectives that challenge
these traditional approaches (2.3) and motivate the development of improved theories
and computational models.
Chapter 3 - Theoretical framework: A rational prosodic analysis
I first introduce a theoretical framework for the study of the sound-to-grammar map-
ping that is based on Rational Analysis (3.1.1), Bayesian statistics (3.1.2) and Firthian
Prosodic Analysis (3.1.3). I then discuss the central theoretical assumptions that build
the foundations of a computational model for the sound-to-grammar mapping (3.2): 1)
a proper characterisation of speech recognition should account for the specific task that
is pursued by listeners, and for the environment in which the task is performed (3.2.1);
2) human speech recognition can be cast as a problem of optimal decision making while
reasoning under uncertainty (3.2.2); 3) acoustic patterns and linguistic categories are not
the same thing, and they shall not be confused in models of human speech recognition
(3.2.3).

Chapter 4 - Assessment: Perceptual-magnet effect
Modelling the perceptual-magnet effect (PME) helps investigating the mapping between
acoustic information and linguistic categories. After introducing PME, I consider work
that suggests that the behaviour of listeners can be accounted for by assuming context-
dependent prototypes (4.1), rather than phonemic categories. Although a recently pro-
posed Bayesian model of PME (4.2) based on a rational and Bayesian account explains
elegantly the behaviour of listeners in the case of very simplified data, the simulations I
present (4.3) show that the existence of context-dependent prototypes poses important
challenges about the way in which phonological and grammatical categories are repres-
ented in most current psycholinguistic models.
Chapter 5 - Representations: Auditory processes and linguistic categories
I introduce some modelling tools and techniques from the fields of machine learning and
signal processing, which are compatible with the theoretical principles outlined in chapter
3, and well suited for representing complex auditory patterns and the associated linguistic
categories. Because of their suitability for the implementation of the model developed
in chapter 7, two kinds of representations are dealt with. For the representation of
auditory processes (5.1), I first introduce a cochlear model (5.1.1) whose output is fed to
the auditory primal sketch (5.1.2). The auditory primal sketch is capable of extracting
the multi-dimensional features of the acoustic signal that are connected to prominence
and rhythm, and represent them in an integrated fashion. For the representation of
linguistic categories associated with complex auditory patterns I introduce the relevance
vector machine (5.2), a sparse Bayesian machine learning technique based on the concept
of prototypical exemplars.

Chapter 6 - Evaluation: Binary classification tasks
I design and implement model components based on the modelling tools described in
chapter 5, in order to test their suitability for inclusion in the computational model
described in the next chapter, and their relative advantages over other established mod-
elling techniques and tools. The implemented components are evaluated by means of
probabilistic binary classification tasks. The dataset used is first described (6.1). In
the first simulation, a model component based on the relevance vector machine is eval-
uated against one based on the support vector machine, a modelling technique which
is more widespread but seems less suited to the simulation of aspects of human speech
recognition (6.2). In the second simulation, a model component based on the auditory
primal sketch is evaluated against a model that just picks up the energy envelope of the
signal (6.3). Finally, in the third simulation, a model component based on the auditory
primal sketch without cochlear model is compared to a comparable model in which the
cochlear model is included (6.4). The general outcomes of the evaluation are then briefly
discussed (6.5).
Chapter 7 - Model: Predicting prefixes
I develop a computational model of prefix prediction for British English in which it is
assumed that listeners, by analysing fine-tuned, learned auditory patterns in the proper
prosodic and grammatical context, can set prefix prediction as an intermediate task in
order to fulfil higher-level goals. The model is first motivated (7.1) in terms of acoustic
analyses (7.1.1), and behavioural experiments (7.1.2). The computational aspects of the
model are dealt with, in terms of goal (7.2.1), environment (7.2.2) and constraints (7.2.3).
The model is then given a formal description with the aid of a Bayesian network (7.2.4).
Those model components that are implemented in the simulation are also described in
terms of processes and representations (7.3).

Chapter 8 - Simulation: Linking the computational model to a behavioural
experiment
I implement those model components that enable a qualitative comparison between
model output and the output of the eye-tracking experiment described in section 7.1.2.2.
The motivation for the simulation is ﬁrst explained (8.1). I then give a detailed account
of the system architecture devised for implementing the model, with the various stages
that it involves (8.2). I further describe the dataset and model parameters used in the
simulation (8.3), and explain the method used for the qualitative comparison (8.4). I
ﬁnally present the results of the comparison (8.5) and discuss them (8.6).
Chapter 9 - Conclusion: Main contributions
The main contributions of this thesis to the investigation of the sound-to-grammar map-
ping, and more generally to the study of human speech recognition are summarised.

2 Background:
Variability and invariance
2.1 The study of variability and invariance
Our understanding of how speech recognition works on a neuro-physiological basis is,
at present, quite fragmentary (see Young, 2008, for a recent review). This, however, is
not a major obstacle to its characterisation on a functional or formal basis (see Marr,
1982), and available neuro-physiological insights can be put to good use for constraining
hypotheses about the functional and formal properties of speech recognition, insofar as
they contribute to explain variability and invariance. Functional and formal character-
isations of speech recognition constitute a great deal of the work conducted in the last six
decades in ﬁelds as diverse as psychology, linguistics and statistical pattern recognition.
When considering the role of variability and invariance, a researcher can maintain one
among several positions between two hypothetical extremes. One extreme would con-
sider variability as being always inherently“bad”, because it is random or irrelevant. The
other extreme would consider it as being always inherently “good”, because it is system-
atic and informative. Evidently neither extreme is defensible since both would imply, for
listeners, the inability of making any kind of useful generalisation. Experimental evid-
ence, reviewed in the following sections, shows that in fact some variability is random,
and some is systematic, some is irrelevant and some is informative. Determining what is
irrelevant and what is informative, however, depends on the exact characterisation of the
task listeners are faced with (function), and thus of the mapping process that enables to
accomplish the task (form). Advances in speech recognition research can be character-
ised precisely as reﬁnements about the knowledge of the mapping process triggered by
7

2.2. Traditional approaches 8
observations about function and form.
2.2 Traditional approaches
2.2.1 Minimal invariant units
First published in 1951, Jakobson, Fant and Halle’s Preliminaries to Speech Analysis
(Jakobson et al., 1951) represented an innovative blend of theoretical and experimental
work on the investigation of the properties of speech. That study became quickly popu-
lar, also thanks to an international conference on speech communication held at MIT in
the following year (Perkell and Klatt, 1986, Preface). The book’s influence on the kind
of questions asked by researchers in speech communication has been long-lasting.
The primary goal of the Preliminaries was to propose questions about the nature
of the “ultimate discrete entities of language”, i.e. about linguistic form. What made
it particularly interesting to practitioners of several disciplines, as compared to other
linguistic investigations with the same goal (Twaddell, 1935; Trubetzkoy, 1939; Jones,
1950), was the great attention paid to the articulatory, acoustic and perceptual correlates
of the units they identified as the ultimate discrete components of language: distinctive
features. A distinctive feature was characterised as a choice faced by a listener between
two polar qualities of the same category (see Jakobson et al., 1951, p.3).
In their tentative sketch, the authors gave a systematic account of many articulatory
and acoustic correlates of distinctive features. The development of the sound spectro-
graph at Bell Laboratories (Potter et al., 1947) was invaluable in the determination of
the acoustic correlates (Fant, 2004). As to the articulatory aspects, the analysis was
influenced by the work of Chiba and Kajiyama (Chiba and Kajiyama, 1941; Fant, 2004).
Jakobson et al.’s definition, however, was based on a perceptual criterion: a “choice
faced by a listener”. Even the categories they adopted followed a terminology based on
perception, despite the explicit acknowledgement that their auditory observations were

not based on a systematic experimental survey.
It was only with the development of the Pattern Playback machine at the Haskins
Laboratories (Cooper et al., 1951) that a more detailed knowledge about the mapping
between acoustic stimuli and perceptual judgements on the identification of (synthetic)
phonemes and syllables, intended as bundles of distinctive features, could be gathered.
The Pattern Playback was a synthesiser in which a tone wheel modulated a light source
at about 50 harmonically related frequencies. A transparent or reflective spectrogram,
usually hand-painted, filtered specific portions of this harmonic source, which were passed
to a photo-tube and converted to sound. The great novelty introduced by the Pattern
Playback consisted in the flexibility it gave to researchers in the manipulation of sounds.
This simple, yet powerful technique was key to the discovery of fundamental perceptual
phenomena, such as categorical perception (Liberman et al., 1957); the role of spectral
loci (Delattre et al., 1955) and that of the main spectral prominences of the transient
relative to the vocalic part (Liberman et al., 1952) for the perception of the occlusive in
CV syllables. A limitation of this method consisted in its unsuitability to the faithful
reproduction of aperiodic portions of the spectrogram.
The model of language developed in the Preliminaries followed a computational per-
spective that was explicitly formulated according to the principles of the newborn re-
search field of information theory (Shannon, 1948). It was the intention of the authors
to establish a codebook that could faithfully and efficiently represent the transmission
of spoken messages. Distinctive features seemed the appropriate unit of analysis for
this endeavour. Jakobson and colleagues pursued this goal by identifying and “stripping
away” all acoustic variability in the speech signal that was considered as redundant,
while keeping those acoustic correlates that were deemed as essential to the definition of
the invariant units of analysis. The same purpose underlay the experimental work at the
Haskins Labs (Liberman et al., 1952) and found its linguistic counterpart in structural-
ist approaches to the analysis of language, including previous work by Jakobson himself

(Bloomfield, 1933; Jakobson, 1939; Harris, 1951).
Evidently, the notion of redundancy could only be elaborated with respect to a certain
task, or functional criterion. All the work contained in the Preliminaries assumed that
this task was a
“test of the intelligibility of speech, [where] an English speaking announcer
pronounces isolated root words (bill, put, fig, etc.), and an English speaking
listener endeavors to recognize them correctly” (Jakobson et al., 1951, p. 1).
This was most likely a choice dictated by experimental and analytical constraints. How-
ever, such a task has little to do with actual speech communication. The authors them-
selves, in the introduction of the book, highlighted the difference between the two tasks
very clearly.
Fant offered a critical retrospective of the featural approach, coming to the conclusion
that “the hunt for maximum economy often leads to solutions that impair the phonetic
reality of features” and that “a simple one-to-one relationship between phonetic events
and phonological entities is exceptional” (Fant, 1986, p. 482).
While many at MIT and Haskins were exploring the question of minimal units, re-
searchers at the Harvard Psycho-Acoustic Laboratory and elsewhere were realising the
importance of context (intended both as the whole acoustic neighbourhood and the num-
ber of possible lexical choices available to the listener) for the recognition of word stimuli,
both in isolation and in relation to a whole sentence.
2.2.2 The role of context
Miller et al. (1951) found that 1) level of background noise, 2) number of lexical items to
be considered, 3) word vs. non-word and 4) syntactic/semantic context all had a great
influence on the intelligibility scores of spoken stimuli: less background noise, smaller
number of available choices, word status and previous context improved recognition, with
level and number of available choices influencing the threshold of noise for intelligibility.

Ladefoged and Broadbent (1957) demonstrated the role of the preceding acoustic con-
text for the identification of the vowel in one out of four possible monosyllabic, synthes-
ised words in English. In their study, subjects listened to the carrier sentence Please say
what this word is, synthesised with the Parametric Artificial Talker (Lawrence, 1953).
Acoustic parameters in vowel formants were varied, as if the sentence was uttered by
different talkers. The sentence was followed by an acoustic token, which was exactly the
same for different carrier sentences. Despite this fact, for example, 97% of the listeners
recognised the token after a version of the carrier sentence as bit, whereas 92% of them
recognised the same token after another repetition of the carrier as bet. This study was
regarded as positive evidence for the theory of Joos (1948), according to which “the
phonetic quality of a vowel [i.e. the acoustic correlates of a perceptual category] depends
on the relationship between the formant frequencies for that vowel and the formant fre-
quencies of other vowels pronounced by the same speaker” (Ladefoged and Broadbent,
1957, p. 99).
In a different study, Miller and Selfridge (1950) investigated the role of the units of
analysis from an information-theoretical point of view. While Jakobson et al.’s focus
was on the codebook (that is on representational issues concerning the identity of the
invariant units) the main interest of Miller and Selfridge was rather on the code, intended
as a concatenation of atomic units. For the authors’ purposes, the units could have been
either phonemes, or words, or any other element that was amenable to be represented
sequentially. In that experimental study, the authors investigated the role of what they
named “verbal context” on the recall of spoken passages of text by listeners. To this
purpose, they devised so-called nth order approximations to the English language, i.e.
statistical models of a language based on the knowledge of the relative frequency of
successive units (phones, syllables, words), up to the nth unit.
To implement these models, Miller and Selfridge presented a sequence of n words to an
English speaker and asked her/him to complete the sequence with one more word. The

n+1 sequence was then presented to another speaker, that completed it with a further
word. The completed sequences were then recorded by a male speaker and played to
listeners. All listeners heard sequences of various lengths (10, 20, 30 and 50 words) and
various orders of approximation and were asked, after having listened to each sequence,
to write down as many words in the correct order as they could possibly remember. Miller
and Selfridge’s main findings were that both higher order of approximation and shortness
of the sequence correlated with higher recall scores, with the two factors interacting:
higher order approximations seemed to help recall especially with longer sequences.
Studies like these, albeit very different as to methodology adopted and immediate
goals, all acknowledged the fact that the amount of contextual information available to
listeners strongly influences the recognition of the message at various levels of analysis.
2.2.3 Beads on a string
The identification of minimal invariant units and the investigation of their concatenative
properties constitute the so called beads-on-a-string view of speech recognition. This
view has built the basis for most accounts of human and automatic speech recognition
until today (Luce and McLennan, 2005). Current psycholinguistic models of spoken
word recognition that rely on this principle include Trace (McClelland and Elman, 1986),
Shortlist A (Norris, 1994), and PARSYN (Luce et al., 2000). It is also an integral part of
the widespread Hidden Markov Model (HMM) approach to automatic speech recognition
(Baker, 1975; Jelinek et al., 1975). As noted above, its origins have connections to
information theory and structural linguistics. Crucially for our discussion about the
relationship between sound and grammar, this approach postulates 1) a mapping between
the acoustic signal and a sequence of discrete, abstract units, all of which belong to the
same level of linguistic analysis (in most cases either distinctive-featural or phonemic);
2) a concatenation of these units as input to further levels of analysis (e.g. the word
level). Early accounts of human speech recognition that adopted this approach were Fry

2.3. Challenges 13
(1959) and Halle & Stevens (1962). The account offered by Fry constituted the basis for
one of the first automatic speech recognisers ever built, the speech typewriter described
in Denes (1959).
The accounts of human speech recognition that adopt the beads-on-a-string view differ
among them in many respects, particularly as to the mapping function that is used
to obtain the sequence of symbolic units from the acoustic stream; however, they all
share the assumption that acoustic information only serves the purpose of guiding the
recognition of minimal units, and does not intervene directly in the determination of
other kinds of linguistic structure. The examples presented in section 2.3 suggest that
this might not be the case: rather, human listeners seem to rely on acoustic cues in
order to gather information about other kinds of linguistic structure as well, including
grammatical categories.
2.3 Challenges
2.3.1 Indexical variation
A beads-on-a-string view requires a certain degree of abstraction of the units of analysis,
which arises from discretisation. This in turn implies that what is usually termed in-
dexical variation, e.g. differences in speaking rate, among talkers or in affective states
(Luce and McLennan, 2005), is not accounted for by the units, and thus unexplained
variance within the units increases. If indexical variation did not influence the recog-
nition of minimal units, it would represent an independent issue. However, this does
not seem to be the case. The already mentioned study by Ladefoged and Broadbent
(1957), which ultimately simulated a difference among talkers, already pointed to this.
Additional evidence was collected in further studies.
Peters (1955) found that messages uttered by a single talker in noise were reliably
more intelligible than messages uttered by multiple talkers. Creelman (1957) found an

2.3. Challenges 14
inverse relationship between performance on the identification of words and number of
talkers. Findings like these were confirmed in later studies, such as Mullennix et al.
(1989).
Several studies highlighted the influence of speaking rate on the recognition of phon-
emes for which the rate of temporal change had been found to be relevant. Liberman
and colleagues (Liberman et al., 1956; Miller and Liberman, 1979) showed it for the
[w]/[b] distinction. Verbrugge and Shankweiler (1977) cross-spliced syllables from con-
texts at fast speaking rates into contexts at slower speaking rates. This affected vowel
identification: for example, subjects misidentified [a] with [2].
Investigating the role of variation in talker, speech rate and amplitude in recognition
memory, Bradlow et al. (1999) found that variation in both talker and speech rate had
an influence on recognition judgements (old vs. new), while amplitude did not. In the
case of an old word, however, listeners were reliably able to indicate whether it was
repeated by the same talker, at the same rate or at the same amplitude. This hinted to
the fact that some kinds of information (amplitude, in this case) are nonetheless stored
even when they do not influence recognition judgements.
Thus, there is evidence to postulate some interaction between indexical variation and
phoneme or word recognition. This was the main motivation for the development of
alternative approaches to speech recognition. Some approaches postulate the retain-
ing of a very large number of encountered patterns (exemplars) in long term memory
(Goldinger, 1998). Such a formulation accounts for generalisation effects by postulat-
ing analogical processes between stored and new exemplars at the time of recognition.
Other approaches postulate the co-existence of multiple units of linguistic analysis that
would be triggered according to the task at hand and/or the phonological structure of
a specific language (3.1.3). These two perspectives are not necessarily at odds, as they
mainly differ at the representational level (5.2). The next section will present selected
examples in which a direct sound-to-grammar mapping seems necessary to explain the

2.3. Challenges 15
behaviour of listeners.
2.3.2 Linguistic factors
Indexical variation is not the only kind of variability not accommodated for adequately
in mainstream models of human speech recognition: acoustic variability due to linguistic
factors other than phonemic identity also requires to be accounted for in such models,
since there is plenty of evidence that listeners are sensitive to it. Being able to account
for and exploit this kind of variability is the main motivation for the research presented
in this thesis.
A long research tradition acknowledges the fact that, in many languages, acoustic
cues not directly mappable onto phonemes or distinctive features play an important role
in the perceptual identification of morphological, lexical and syntactic boundaries. For
example, in English specific segment and syllable duration relationships may signal an
upcoming pause or prosodic boundary, such as the end of an utterance (see e.g. Klatt,
1976). In several cases, different acoustic features of variable granularity, arranged into
complex configurations, contribute together to the definition of linguistic structure, e.g.
in signalling word segmentation (Smith, 2004, for English).
Other experiments show that listeners are also sensitive to subtle but systematic vari-
ations in acoustic parameters which are linked to differences in prosodic structure, which
in turn are triggered by lexical differences. For example, Salverda et al. (2003), using
the visual paradigm (Tanenhaus and Spivey-Knowlton, 1996) and cross-splicing, found
that subjects were sensitive to acoustic differences, particularly in duration, which were
due to the monosyllabic vs. polysyllabic nature of a word (e.g. ham- as in ham vs.
hamster). Kemps et al. (2005a) arrived to similar conclusions for morphologically com-
plex vs. morphologically simple words (e.g. in Dutch singular/plural nouns: boek- as in
boek [buk] vs. boeken [buk@]). Baker (2007a; 2008) investigated the perception of true
vs. pseudo prefixes in English, e.g. dis- as in distasteful (true, i.e. productive and with

2.3. Challenges 16
clear compositional meaning) vs. distinctive (pseudo). In a fill-the-gap type listening
experiment in noise, she found that indeed cross-splicing some true prefixes with pseudo-
prefixed stems and vice versa had a negative impact on recognition performance. In this
case, although there was interaction with sentence focus (nuclear vs. post-nuclear stress
on the accented syllable), some variation should clearly be attributed to morphological
differences.
These and many other findings suggest that models of human speech recognition
should account for many more sources of variability than simply phonemic identity; and
that these sources are not limited to indexical properties, but include prosodic structure
as a manifestation of grammatical differences at various levels.
2.3.3 Auditory processes
One of the main limitations of most models of spoken word recognition is their reliance
upon strings of segments (either features or phonemes) as input to the model (2.2.3).
In addition to the increasing amount of behavioural evidence about the role of phonetic
detail in speech recognition (2.3), also psycho-acoustic and neuro-physiological studies
show that sound waves undergo substantial, partly still unexplained transformations
along their journey through the auditory nerves and on the cerebral cortex. While a
complete account of these transformations is both impossible and inappropriate in this
context, still it is worthwhile reviewing the main findings that document the way acous-
tic information is encoded during recognition. Some of these turn out to be informative
when it comes to the design of processing stages in models (7.3), and to the understand-
ing of the role of variability. While most neuro-physiological evidence about auditory
mechanisms comes from laboratory animals rather than humans, it seems that some
of these mechanisms, particularly those happening at the auditory periphery, are also
applicable to humans (Pickles, 2008). A comprehensive review of the findings of the last
25 years about the neural representation of speech can be found in Young (2008), which

2.3. Challenges 17
constitutes the main information source for this section.
Neuro-physiological data shows that brains operate elaborate transformations on the
input signal. Moreover, these transformations are not limited to feature extraction, but
seem to suggest the formation of auditory objects as a response to specific behavioural
needs. While details of the representations in the auditory cortex still escape us, and
acknowledging that, because of the interaction with the language areas, auditory areas
in human brains might behave in even more complex ways than animal data suggests
(see e.g. Zatorre and Gandour, 2008), neuro-physiological data provides an independent
source of evidence of the direct relationship between complex acoustic patterns and
meaning-bearing linguistic categories.
A major finding that seems to encompass, to different degrees, all levels of the neural
representation of speech is the so-called tonotopic nature of the representation: in the
auditory system different frequency bands are analysed separately. This happens during
the conversion from mechanical to neural signal, at the interface between the inner ear
and the auditory nerve.
Hair cells are arranged along the whole extension of the basilar membrane; departing
from them, because of their disposition, specific fibres of the auditory nerve respond
to specific frequencies, i.e. their discharge rates become high only in the presence of
excitations that fall within a certain frequency range. For this reason, many models
have interpreted the basilar-membrane / hair-cells analysis of the signal as a bank of
bandpass filters (Patterson et al., 1988). This approximation can be useful, but it is a
gross simplification in several respects. First of all, frequency selectivity varies signific-
antly with sound pressure level; then, auditory fibres undergo saturation effects; finally,
complex interactions among auditory fibres give rise to inhibitory effects: excitation of
fibres with a certain best frequency can suppress the excitatory levels of neighbouring
fibres. Inhibitory mechanisms are still not fully understood, particularly when it comes
to higher neural regions.

2.3. Challenges 18
Fibres in the auditory nerve can also be classified according to their dynamic range
and thresholds, i.e. their activity span, from the lowest sound pressure levels at which
discharge rates are observed, to the levels at which they attain saturation. From a
temporal perspective, the neural encoding at the auditory periphery displays effects such
as stimulus onset enhancement (a sudden increase in neural activity as a response to the
onset of a sound after silence) and inhibitory effects among successive sounds. There
is also evidence of specialised mechanisms to extract amplitude modulation information
(Joris et al., 2004).
Despite these transformations and non-linearities, the transmission of information
along the auditory nerve can be considered as quite faithful to the original signal, and
thus easily interpretable. The same is not true higher up along the auditory pathways.
While data about these levels of representation is more fragmentary, it is still possible
to determine some of their salient characteristics.
Fibres of the auditory nerve terminate in the cochlear nucleus, which contains between
five and ten neural subsystems operating in parallel. Neurons that make up this nucleus
are of different kinds. Among the few studied, primary-like and chopper neurons display
different responses to the input from the auditory nerve. Primary-like neurons behave
similarly to neurons in the auditory nerve, i.e. they offer a response which is quite faithful
and has high frequency resolution, as their main role is to transmit acoustic information
to other centres of the brain for auditory localisation. Chopper neurons, on the other
hand, seem to be more robust to ambient noise and differences in sound pressure levels.
They achieve this result by being sensitive to low pressure levels, while at the same
time having a mechanism to regulate their dynamic range in order to avoid saturation.
This behaviour has suggested a hypothesis, according to which chopper neurons might
possess a switching mechanism that regulates their responses to input auditory nerve
fibres of various thresholds and dynamic ranges. Both primary-like and chopper neurons
display higher gain levels than neurons in the auditory nerve. Combined to the tonotopic

2.3. Challenges 19
architecture, this results in an improved spectral representation of prominent events like
vowel formants.
The cochlear nucleus is one of the structures connected to the inferior colliculus. The
inferior colliculus is quite characteristic in the kind of response to amplitude modula-
tion that it provides. While neurons in lower areas of the auditory system provide a
fairly straightforward representation of amplitude modulation, responses in the inferior
colliculus are mostly transient, and they are observed particularly in relation to transi-
ent events in the input signal: conversely, acoustic portions representing steady states
(e.g. the central parts of many vowels) are not accompanied by signiﬁcant neural activ-
ity. This has been interpreted as a mechanism of perceptual enhancement for acoustic
events like bursts in stops with respect to vocalic portions.
While aspects of tonotopic organisation are also observable in the auditory cortex,
responses at that level are not as easy to correlate to inputs as they are in the auditory
nerve, cochlear nucleus and inferior colliculus, despite the transformations undergone in
these earlier stages. Young (2008) lists three reasons for this.
First, in animals (e.g. marmosets), cortical neurons seem to be selective for sounds
that are important for the species, like the vocalisations of other conspeciﬁcs, as opposed
to the same and similar sounds when perceived by other species. That is, cortical neurons
seem to respond to sounds as meaningful objects, rather than to their bare spectral and
temporal features.
Second, despite the tonotopic organisation, neurons in the auditory cortex are highly
adaptable, in the sense that their characteristic frequency can shift if a certain task
demands it, sometimes only temporarily. Moreover, in their degree of response, neurons
are also sensitive to stimulus frequency.
Third, simple models based on the response of cortical neurons to particular sets
of stimuli characterised by similar spectro-temporal features do not seem to have a
high predictive power, thus suggesting that the way neurons respond to sounds is more

2.3. Challenges 20
complex.
Thus, also neuro-physiological evidence strongly suggests that the mapping between
acoustic patterns and linguistic units is a very complex one. While it is still impossible
to model all these transformations, they should at least be acknowledged in models of
speech recognition that try to give an account of representations, processes and they way
these are implemented in the brain.
2.3.4 Automatic speech recognition
Moving beyond beads-on-a-string is not only a theoretical necessity imposed by the
explanation of data like those presented in the previous sections. With constant de-
velopments in automatic speech recognition technology, researchers in that area have
increasingly become aware of the intrinsic limitations of a classical HMM framework,
where context-free or context-dependent acoustic models of phones (or syllables) are
the only interface between the acoustic signal and linguistic categories (Jurafsky and
Martin, 2009). The main trigger of this awareness has been the issue of pronunciation
variability, mostly intended as variability due to geographical or sociolinguistic factors.
Traditionally, this issue has been tackled by explicitly listing in a dictionary several pro-
nunciations for the same word. This solution, however, has proven to be unsatisfactory,
particularly when dealing with spontaneous speech (Ostendorf, 1999). For this reason,
many researchers have been considering alternative approaches (Baker et al., 2009).
The HMM framework has proven to be a ﬂexible and powerful formalism for the mod-
elling of many aspects of speech recognition. Yet, its intrinsic limitations are well known
and constitute a major bottleneck for bridging the gap between human and machine
performance (Ostendorf, 1999). Among these, the most relevant ones include the lack
of embedded mechanisms for the modelling of event durations and the assumption of
conditional independence for successive acoustic observations. A further limitation of
standard HMM architectures is given by the blending of acoustic detail due to the rep-

2.3. Challenges 21
resentation of the acoustic space via mixtures of Gaussians or other kinds of distributions
based on summary statistics.
Many alternatives have been proposed to overcome these issues. Some of these, rather
than discarding the HMM framework, try to enhance its capabilities (Ostendorf et al.,
1996; Deng et al., 2006, II.B). Other proposals aim at a diﬀerent characterisation of the
recognition process. Among the latter, a few proposals concentrate their attention on
the modelling of the articulatory aspects of speech, e.g. by trying to model vocal tract
dynamics (Deng et al., 2005, 2006). Such proposals are interesting in that they try and
give a uniﬁed account of production and perception, along the lines of popular theories
of speech recognition such as Liberman et al. (1967; 1985) and Fowler (1990). These
theories, on the other hand, are controversial, and reliance upon production mechanisms
is not necessary in order to account for many aspects of speech recognition (Jusczyk and
Luce, 2002). From the point of view of the discussion here, it is thus more interesting to
look at architectures that allow more freedom regarding the nature of perceptual units
involved in recognition, by at the same time remaining agnostic regarding the relation-
ship between production and perception. Among these, two interesting implemented
proposals are template-based systems (De Wachter et al., 2007; Demange and Van Com-
pernolle, 2009; Maier and Moore, 2007) and graphical models for ASR (Bilmes, 2003;
Bilmes and Bartels, 2005).

3 Theoretical framework:
A rational prosodic analysis
3.1 Analytic foundations
3.1.1 Rational analysis
A cognitive system can be described from several perspectives: for example, its purpose
and the goals it strives to achieve; the mechanisms adopted to achieve those goals;
and the physical properties that make those mechanisms work effectively. According to
Marr (1982), the levels of explanation of any information processing system are usually
only loosely coupled: thus, it should be possible to describe a cognitive system from a
particular perspective, while only sketching the others. This assumption is indeed crucial
for any endeavour that strives to model a relatively complex system.
Anderson (1990; 1991) acknowledges this independence principle as one of the found-
ations of what he termed a rational analysis of cognitive systems. Anderson’s rational
analysis assumes that a cognitive system has a purpose, which can be described by form-
ally defining the task it has to achieve and the environment in which it operates. As
the name implies, a rational analysis assumes that cognitive systems behave rationally
in taking decisions. According to Anderson’s terminology, “rational” means that the
system, which is optimally adapted to its environment and to the task, makes use of all
available information about the task and the environment to fulfil its goals.
While the characterisation of cognitive systems in terms of their purpose has been
widely accepted, the concepts of optimality and rationality are seen by many as not ac-
counting for behavioural data about irrational and non-optimal decision making (Kahne-
22

3.1. Analytic foundations 23
man and Tversky, 1973; Kahneman et al., 1982; and Lopes, 1991 for a critical review).
However, Chase et al. (1998) have argued that giving more relevance to the constraints
that the environment imposes on the cognitive system and to simple approximations to
optimal solutions (a bounded rationality, as they call it) accommodates for these dis-
crepancies, by helping to make it clearer what it means to be optimal for a particular
system.
Rational analyses have been developed for the characterisation of many aspects of
perceptual and cognitive systems (Chater and Oaksford, 1999; Oaksford and Chater,
2008): from causal relations (Griffiths, 2005) to associative memory (Anderson and
Schooler, 1991), from continuous speech recognition (Norris and McQueen, 2008) to
category learning (Sanborn et al., 2010).
3.1.2 Bayes’ theorem
An important advantage of a rational analysis over a mechanistic explanation of a cog-
nitive system is that it can be readily expressed in formal terms by using Bayes’ theorem.
This is particularly useful when information from the environment and the task that is
available to the cognitive system is uncertain or incomplete, as is the case for perceptual
systems (2.1). In the analysis of speech recognition, the great variability found in acous-
tic patterns must be harmonised with the persistence of the linguistic and extra-linguistic
categories identified (2.3). By adopting probabilistic reasoning, we can associate an am-
biguous acoustic pattern to a set of competing linguistic structures (hypotheses) with
different degrees of confidence, and also update confidence scores as soon as new, per-
haps disambiguating acoustic evidence for or against a particular linguistic hypothesis
becomes available.
Bayesian principles (see e.g. Griffiths and Yuille, 2008) constitute a powerful tool for
probabilistic reasoning and hypothesis testing. While in a frequentist approach hypo-
theses are evaluated exclusively upon previous evidence, in the Bayesian framework a

hypothesis can be given a prior probability, independently from the observed evidence.
In the case of speech recognition, this means that hypotheses about linguistic categories
can be constrained by many additional linguistic and non-linguistic factors, that we can
loosely define as “context”.
Bayesian probability theory has been at the core of HMM automatic speech recognition
technology for more than twenty years (Jurafsky and Martin, 2009). More recently, and
particularly after the work of Anderson, it has gained great popularity also for the
modelling of cognitive systems (see e.g. Griffiths and Tenenbaum, 2006, and the early
discussion in Watanabe, 1985). Scharenborg et al. (2005) give a unified account of human
and automatic speech recognition, showing how describing the task of speech recognition
as reasoning under uncertainty in a Bayesian setting helps to bridge the gap between
modelling endeavours in ASR and HSR, despite differences at the implementation level
(human brains vs. computers). Finally, Norris and McQueen (2008) show convincingly
how a formulation of continuous spoken word recognition as a Bayesian problem of
optimal decision making accounts elegantly for many effects that in other modelling
frameworks would require special treatment.
For the particular purpose of this thesis, probabilistic Bayesian modelling offers sev-
eral advantages over other kinds of statistical modelling. First of all, at the core of the
Bayesian framework is a treatment of data and hypotheses in probabilistic terms. As
already described (2.1), such a treatment is required by the very nature of the problem
at hand (high degrees of random variability in the acoustic patterns; inherent ambiguity
of certain linguistic structures). Second, it is desirable to constrain the scoring of hy-
potheses about linguistic categories based on the broad “context” in which they operate
because, as we will see (3.2), the linguistic categories that are recruited during recogni-
tion are assumed to be task- and environment-specific. Probabilities offer a convenient
mechanism for doing so: frequency effects can be easily modelled with prior probabilities,
and contextual effects by incorporating other sources of evidence. Finally, because of

the underlying probabilistic reasoning, Bayesian modelling can be applied equally well to
various kinds of representations for data and hypotheses: atomic symbols, scalar values,
discrete and continuous distributions, complex structures like graphs. Particular kinds
of Bayesian models offer additional advantages. Those offered by sparse Bayesian models
are discussed in section 5.2.
3.1.3 Firthian prosodic analysis
A beads-on-a-string view, in which speech is treated as a concatenation of homogeneous
units, is not sufficient to account for human performance in the recognition of examples
like those in 2.3.2. Those examples show that listeners’ judgements are driven, to various
degrees, by diverse cues that cannot be located on a single segment, or that cannot be
related to short term spectral properties. The discussion in 2.3.1 has also pointed out
that what is usually considered as indexical variation has in fact a direct influence on
recognition performance, and hence cannot be excluded from a comprehensive model of
human speech recognition.
While the urge to overcome beads-on-a-string represents an element of relative novelty
in psycholinguistic modelling (Luce and McLennan, 2005), in descriptive linguistics the
question has been extensively investigated since at least the 1940s. Some of the accounts
elaborated in that context, however, have remained fairly marginal and less widespread
than works that, regarding the issue of variability and invariance, were based on more
“orthodox” views (e.g. Chomsky and Halle, 1968). A framework for linguistic analysis
known as Firthian Prosodic Analysis provides particularly helpful insights in this respect.
The linguistic framework known as Firthian Prosodic Analysis, or simply Prosodic
Analysis (Palmer, 1970, henceforth FPA), was developed by J.R. Firth and his co-workers
at the School of Oriental and African Studies in London (Firth, 1948). Its development
was motivated by the unsuitability, according to Firthians, of classical methods of phon-
emic analysis (e.g. Pike, 1947) to the description of many regularities in languages. Firth

attributed the classical analyses that considered the phonology of a language as a unitary
system of phonemic contrasts (the beads-on-a-string of section 2.2.3) to the influence of
Roman script, noting how other writing systems, based on different principles, were more
suited to a more economical description of the languages they had been developed for.
Firth mainly disputed the mostly paradigmatic and mono-systemic nature of phonemic
approaches.
Firth started by considering how certain acoustic patterns (’phonetic exponents’ in
FPA terminology) are more economically and profitably described by referring primarily
to their collocation within a certain linguistic structure (their syntagmatic properties),
rather than to their spectral similarity to other segments occurring in a different context
(their paradigmatic properties). For example, in British English, in words like pat and
tap, from an acoustic point of view there are potentially many more commonalities (e.g.
in terms of degree of aspiration, duration, intensity) between syllable-initial [p] and [t]
vs. syllable-final [p] and [t] than between both [p]’s vs. both [t]’s. Such commonalities
are determined, in this specific case, by syllabic structure, and can thus be predicted
quite independently from the actual segmental content. In this example, the syllable is a
suitable context for the prediction of many phonological properties of the word, which in
turn determine many of its observable acoustic patterns. Other properties, conversely,
would require the consideration of a wider context in which the word is embedded. This
clear distinction between ’sounds’ (segments in a traditional sense) and ’prosodies’ (the
properties of a given phonological context) allows one to dispense with the transform-
ational rules that became one of the main points of interest in Chomsky and Halle’s
The Sound Pattern of English (Chomsky and Halle, 1968) and, under different formula-
tions, of many successive generative approaches to phonological theory (e.g. Prince and
Smolensky, 1993).
Particularly relevant for our investigation of the sound-to-grammar mapping is the
nature, within a Firthian analysis, of context. Prosodies can be associated with linguistic

units of all kinds. We might have prosodies which serve the purpose of delimiting a
syllable, or a word, but also prosodies that mark grammatical categories, like verb vs.
noun (e.g. a ’stress’ prosody in many bisyllabic English words, like re’bel vs. ’rebel),
active vs. passive (e.g. a ’nasality’ prosody in the Eritrean language Bilin, see Robins,
1970), or prefix vs. non-prefix (Ogden et al., 2000). In addition, some prosodies might be
associated with aspects of speech that usually fall under the label of indexical variation
(2.3.1): mood, gender, register etc. Thus, in FPA terms, a language is a collection of
interacting subsystems, rather than a monolithic, hierarchical system.
In FPA, there is a clear distinction between phonological structure and the phonetic
manifestations thereof. A prosody, which is an aspect, or element, of phonological struc-
ture, will be manifested at the acoustic level by phonetic exponents (acoustic patterns).
That is, a prosody can be thought of as an invariant (and thus abstract) element associ-
ated with a particular linguistic context, that is realised acoustically by a co-occurrence,
or relation, of acoustic features forming a consistent acoustic pattern. The linguistic
context is of great importance for the definition of a prosody. A similar acoustic pat-
tern appearing in two different linguistic contexts is not automatically considered to be
associated to the same prosody: in such a case, acoustic similarity might have no relev-
ance whatsoever from a phonological point of view. This view is at odds with the one
presented in section 2.2.3.
While in most cases both a generative approach and FPA, albeit very differently, are
able to adequately explain the same linguistic data, in some circumstances the FPA ap-
proach succeeds where a standard phonemic analysis is not straightforward. An example
of this is the data of Hawkins and Nguyen (2004) on coda voicing. By examining pairs
such as led and let, they found that in addition to the well known effects on vowel dur-
ation, in non-rhotic varieties of British English coda voicing also affects the duration of
/l/, which is longer, and F2 and centre of gravity as measured at /l/’s onset, which are
mostly lower, as compared to the voiceless condition. The influence of coda voicing on

/l/ onset cannot be easily linked to anticipatory co-articulation, and thus is not easily
motivated without giving the right weight to the broader linguistic structure. An FPA
analysis accounts very naturally for this phonetic behaviour by interpreting coda voicing
as a property of the whole syllable, which is thus manifested by several, and possibly non-
adjacent acoustic cues. In addition to the weight given to linguistic context, also FPA’s
polysystematicity gives a consistent explanation of linguistic contrasts which would oth-
erwise seem rather opaque. An example is given by pairs of prefixed words that differ in
historical origin (Hawkins and Smith, 2001). Unknown, unnatural and innate all contain
prefixes which are monosyllabic and nasal. Furthermore, all words bear primary stress
on the second syllable. Despite these similarities, the first two words are rhythmically
very different from the third word, by displaying a longer /n/. The difference is however
easily explained if one considers the origin of the prefixes (Germanic in the former case,
Latinate in the latter) and thus postulating these two co-existing linguistic systems.
Cases like these, together with the other examples (provided in section 2.3) relat-
ive to phonetic detail signalling prosodic boundaries through rich phonetic exponents,
differences between morphologically simple and complex words, and prefixed vs. pseudo-
prefixed words, suggest that in terms of explanatory power, an FPA-style approach to
the modelling of variability seems to offer substantial advantages over a beads-on-a-string
one. However, there are several issues connected to its adoption. In the first place, one
must not forget that FPA is a framework for linguistic analysis, which does not make any
claim regarding the exact nature of the psychological processes and representations driv-
ing human speech recognition (Firth, 1948). This said, there seems to be at least some
evidence suggesting that an FPA-style analysis might actually be adopted by listeners:
for a start, the data presented in 2.3 requires an analysis of this sort; additionally, some
neuro-physiological evidence seems to support it too (Hawkins and Smith, 2001).
A second difficulty is given by the non-formalised, non-exhaustive nature of FPA
descriptions. As already noted, an information-processing system can be characterised

at several levels (Marr, 1982). A computational model of an information-processing
system requires at least 1) an explicit statement about the task to be carried out by
the system, 2) the development of representational devices and procedures to carry out
the simulation. In an FPA analysis, neither aspect is usually dealt with in great detail.
A computational model which adopts an FPA-style approach, however, should tackle
both aspects explicitly. The discussion of this issue will build the core of chapter 7.
There I will show that a Bayesian perspective of the kind adopted in Norris & McQueen
(2008) represents an elegant solution to the explicit formulation of the computational
task, and that the same probabilistic framework underlying it also allows us to make use
of representational devices of various kinds towards the same task, thus preserving the
spirit of a Firthian analysis.
Albeit not being mainstream, the Firthian approach to the analysis of spoken language
has found its way into more recent linguistic accounts. Among these, we might recall
Declarative Phonology (Coleman, 1998) and the work of Local and colleagues (Kelly
and Local, 1989; Ogden, 1999; Local, 2003). Much of this theoretical work has been
applied to models of speech production and implemented in speech synthesis systems,
ﬁrst with YorkTalk (Coleman, 1990) and later with ProSynth (Ogden et al., 2000).
Polysp, a descriptive model of human speech recognition by Hawkins and Smith, is
largely based on Firthian principles (Hawkins and Smith, 2001; Hawkins, 2003). One
account that, despite not having explicit connections to FPA, nonetheless share some of
its features is Jusczyk’s WRAPSA model of speech recognition development (Jusczyk
1993; 2000). Automatic speech recognition systems that, despite quite diﬀerent from each
other, might represent suitable tools for the implementation of an FPA-style approach
include graphical models for ASR (Bartels and Bilmes, 2010) and Leuven’s template-
based speech recognizer (De Wachter et al., 2007).

3.2. Central assumptions 30
3.2 Central assumptions
The theoretical framework that shapes the computational model of the sound-to-grammar
mapping introduced in the next chapters is based on three central assumptions:
1. the goals to be achieved and the computations to be performed in speech recogni-
tion, as well as the representation and processing mechanisms recruited, crucially
depend on the task a listener is facing, and on the environment in which the task
occurs (3.2.1);
2. whatever the task and the environment, the human speech recognition system
behaves optimally with respect to them (3.2.2);
3. internal representations of acoustic patterns are distinct from the linguistic features
associated to them (3.2.3).
The following sections will better qualify, and provide evidence for, these claims.
3.2.1 Specificity of task and environment
In a rational analysis perspective, the definition of optimality, and the analysis itself,
depend crucially on the task that the cognitive system is facing (3.1.1). This means that
the structure of the information processing system being recruited to accomplish the
task might be substantially different depending on the task faced by the listener, and
by the specific characteristics of the environment in which the task occurs (Hawkins and
Nguyen, 2004; Norris and Kinoshita, 2008).
Continuous spoken word recognition has been, and still is, at the core of most modelling
efforts both in psycholinguistics and engineering (Pisoni and Levi, 2007; Baker et al.,
2009). In a natural environment, however, spoken word recognition as usually intended is
the main goal of just one kind of task: dictation, i.e. the derivation of a lawful sequence
of written words from an acoustic input.

Continuous spoken word recognition per se is thus the primary object of enquiry, both
from a psychological and an engineering perspective, only if the task to be explained and
simulated corresponds to dictation. It cannot, however, be automatically assumed as a
primary goal for all tasks which involve the recognition of continuous speech. According
to the specific task and environment, the role played by spoken word recognition is greater
or smaller: in some cases it constitutes a necessary intermediate goal, along with other
goals; in other cases it plays an auxiliary, perhaps marginal role. An example supporting
this argument is presented by Hawkins and Smith (2001). Most speakers of English
possess a wide variety of expressions for conveying the meaning of “I don’t know”. Each
of these varieties, however, is perceived as appropriate only in a specific environment
and conveys different kinds of semantic and pragmatic information. These varieties
might range from a usually rude I. . . do. . . not. . . know to a rather stylized intonation
and rhythm configuration with very weak segmental articulation that signals the fact
that the speaker is not very engaged in the conversation: [˜@fl˜@˜@fi]. This very context-specific
acoustic pattern, because of its uniqueness, does not necessarily require a familiar listener
to recognize the sequence of words “I”, “don’t” and “know” as their main goal.
For any task that differs from dictation, the necessity and importance of spoken word
recognition as a goal should thus constitute a hypothesis to be tested in its own right,
and assessed by experiment. Just as spoken word recognition appears to be central to
some of the tasks which involve listening, and less important to others, so the recognition
of grammatical structures of other kinds should not be expected to differ in this respect.
Even in the case of an easily characterisable task like dictation, the nature of the
environment may vary: the kinds of acoustic patterns that one might expect to encounter,
and the type and number of linguistic structures that one might need to recruit, will be
very different if dictation involves writing down some telephone number or address, as
opposed e.g. to writing down a dictation passage at school, or having a business decision
dictated with the intention of writing a letter.

Firthian Prosodic Analysis is a convenient tool to envisage the recruitment of different
language structures based on the task and environment in which speech recognition op-
erates. As already noted, in FPA, linguistic structures are organised into self-contained,
albeit interacting subsystems, and particular linguistic contrasts are triggered only by
context-specific phonological contrasts, or prosodies, and their sometimes complex acous-
tic manifestations (3.1.3). By carefully considering task and environment, we can de-
velop computational models of speech recognition which are limited enough in scope
to make the most out of detailed phonetic descriptions, linguistic analyses, behavioural
experiments, and simulations. We can then integrate various models, always considering
carefully the respective tasks and environments and, if necessary, revising the models in
order to accommodate for any emerging interaction. By postulating an optimal beha-
viour of listeners with respect to the task and environment, as envisaged by a rational
analysis (3.1.1), and by exploiting the Bayesian framework through the probabilistic in-
terpretation of hypotheses and combinations thereof (3.1.2), we have a principled way to
express the models formally, to implement them, and to perform this kind of integration.
3.2.2 Optimal behaviour
While defining the task a listener is facing, and characterising the environment in which
it is performed, we need to specify what it means for a listener to behave optimally
with respect to task and environment (3.1.1). Norris and McQueen (2008) give two
examples of task-specific optimal behaviour: in tasks requiring speeded decisions, it
would amount to “making the fastest decision possible while achieving a given level
of accuracy”; whereas in tasks which require a response based on a fixed amount of
perceptual evidence, it could be defined as “selecting the response (word) that is most
probable, given the available input” (p.358).
Whereas a definition of optimality in most experimental settings (where most aspects
of task and environment are strictly controlled) is fairly trivial, a formal definition of

what is optimal behaviour in more natural contexts becomes quite challenging. Let us
recall the already introduced example of dictation, and particularly of having an address
dictated on the phone. If part of the environment in which the task is performed is a very
expensive international call, and the listener is the caller, she might accept a lower level
of confidence in the correctness of the address, with the aim of keeping the conversation
as short as possible, and hence paying less. This in turn should be weighted by other
contextual factors, such as how wealthy (and stingy) the person is, how important it is
for her to get the address right, and whether she knows that she can double-check the
address at a later point with an online mapping service. Each of these factors could be
important in determining what constitutes an optimal strategy for achieving the task of
recognising and writing down the address and, consequently, how speech recognition will
be performed (degree of attention to the subtleties of the acoustic signal, reliance upon
previous knowledge, degree of adaptation to the voice of the speaker, success measures
for the task). A complex optimality criterion like this is largely dependent on factors that
are not easily observable, and which are hard to capture in a simple, general-purpose
model of speech recognition. This seems a compelling reason to develop computational
models that are at first small in scope, and are then gradually expanded and connected
to include new combinations of tasks and environments, thus enabling what Luce and
McLennan call “cumulative progress” in the understanding of human speech recognition
(Luce and McLennan, 2005).
3.2.3 Auditory patterns and linguistic features
Finding a way to account gracefully for both generalisation properties and preservation
of detail in human speech recognition is arguably one of the most prominent topics
among researchers in the field (Luce and McLennan, 2005; Pisoni and Levi, 2007). The
discussion about variability and invariance in chapter 2 showed how this issue has been
tackled in the past.

Early theoretical accounts based on a beads-on-a-string approach (2.2.3) placed the
burden of this conversion almost exclusively onto the phonemic level. This simplification
was also adopted by many subsequent psycholinguistic models. A few theories and
models conceded the role of privileged unit of analysis of this conversion to the word,
as was the case for the original Cohort theory (Marslen-Wilson and Welsh, 1978). All
those accounts, however, tended to identify the unit of analysis, rather than the units.
As already discussed, such a rigid interpretation of human speech recognition fails to
account for numerous phenomena, such as: the role played in recognition by indexical
factors (2.3.1); physiological data about the nature of auditory representations along the
auditory nerve and in the brain (2.3.3); and behavioural data about the role of grammar
(2.3.2).
At the other end of the spectrum, some psycholinguistic accounts tried to do away
completely with abstract representations by holding all exemplars and the phonetic de-
tail they carry in memory, and envisaging recognition as an analogical process (e.g.
Goldinger, 1998). While accounts of this latter type can do justice to some data, es-
pecially those regarding the influence of speaker identity on word recognition (Nygaard
et al., 1994), they fail to account for the combinatorial and generalisation properties of
human language.
The task- and environment-specific, optimal-behaviour approach adopted in this thesis,
as outlined in the previous sections, does not encourage any kind of general-purpose
account about the nature of mental representations, let alone their hardware implement-
ation in the brain. Despite not having a strong position about the specific nature of
mental representations, the theoretical framework adopted here assumes some form of
distinction between the internal representations of auditory patterns and the linguistic
and indexical features associated with them. For example, the internal representation
of a single auditory pattern could be recalled to provide at once information about a
non-canonical acoustic realisation of a specific lexical item, the sex of the speaker as-

sociated with that particular realisation, and her identity. Once again, we believe that
the whole picture can only emerge after the careful analysis and modelling of several
speciﬁc computational tasks, their environments and constraints. Since there are many
possible combination of tasks, environments and constraints, the picture will certainly
be a complex one. Since, on the other hand, the concepts of task- and environment
speciﬁcity and of optimal behaviour are transversal and build upon the same principles,
we should expect a certain amount of convergence at the representation level.

4 Assessment:
Perceptual-magnet effect
4.1 Motivation
In this chapter I introduce a perceptual phenomenon known as the perceptual-magnet
effect (PME), and describe an account of it from the literature, which is based on a
rational analysis (3.1.1). While, as already mentioned, rational accounts like the one
presented here are gaining popularity, and this model in particular offers an elegant
explanation of PME for somewhat artificial datasets, I make the case for the use of
more “natural” data in the development of such accounts, to prevent a twofold risk: on
one hand, that of equating acoustic features with phonological categories (3.2.3); on the
other hand, that of concentrating too much on the computational aspects without any
specification of the representational and processing devices involved. Recent efforts to
include also these aspects in rational models (see e.g.Sanborn et al., 2010) should indeed
be welcomed. The increasing convergence of HSR and ASR methods (Scharenborg et al.,
2005) is also highly beneficial in this respect. The model of prefix prediction developed
in chapter 7, which builds upon the theoretical framework of chapter 3 and on the
tools presented in chapter 5, tries to be as explicit as possible in dealing with both the
computational and representational aspects.
4.1.1 The perceptual-magnet effect
Perceptual-magnet effect (henceforth PME) is a term that has become common in psy-
chological literature since it was introduced for the first time by Kuhl (1991). It is used
in order to describe the shrinkage of perceptual space, manifested as reduced discrimina-
36

4.1. Motivation 37
tion, around vowels and liquids whose quality listeners consider prototypical (i.e., good).
According to PME, two sounds which are separated by a certain acoustic distance are
less easily discriminable when they are in the proximity of good exemplars (prototypes)
of the category they belong to.
0 1 2 3 4 5 6 7 8 9
Stimulus
Feature values in acoustic space
Feature values in perceptual spacePrototype
Figure 4.1.1: Illustration of PME for equally spaced stimuli in one-dimensional acoustic
space (top) and corresponding representation in perceptual space (bottom).
Stimuli closer to prototype (stimulus 0) are attracted more, and thus are
less discriminable from neighbouring stimuli.
Consider for example ﬁgure 4.1.1. It represents a series of 9 stimuli which vary along
one acoustic dimension, say F2 frequency at the steady state of /u:/ in British English.
The top vector (circles) shows the stimuli in acoustic space, where there is a constant
increase in F2 mean frequency: the stimuli are thus equally spaced. The bottom vector

4.1. Motivation 38
(squares), by contrast, shows the stimuli as PME would predict they are perceived by a
native listener of BE: under the influence of the category’s prototype (stimulus 0 in the
figure), which might correspond, in this case, to the mean F2 frequency of all the /u:/
vowels that the listener has heard before when listening to other BE speakers, stimuli
which are closer to the prototype in acoustic space tend to be squeezed in perceptual
space, whereas stimuli farther away from the prototype in acoustic space are much more
clearly distinguishable from each other in perceptual space.
Questions about the existence and nature of this kind of perceptual warping have given
rise to a lively debate among scholars, particularly during the second half of the 1990s.
The major points of criticism raised by those who are not convinced about the existence of
a PME in real speech address the experimental methods used in order to assess the PME
and its generalisability. With respect to experimental methodology, sceptics pointed out
that in almost all cases investigators elicited judgements about isolated, synthetic sounds,
which varied along one or two parameters at most (usually F1 and/or F2). One example
of this kind of critique can be found in Lotto et al. (1998) (with interesting follow-
ups in Guenther, 2000 and Lotto, 2000). The second argument brought forward by
PME opposers concerned the difficulty of generalising PME across sounds, sound classes
and languages. Although several studies (mainly conducted by Kuhl, Iverson and co-
workers) postulate a PME for some language-specific vowel phonemes (e.g. American
English /i:/: Iverson and Kuhl, 1995; German /i/: Diesch et al., 1999; Swedish /y/:
Kuhl, 1992) and liquids (American English /r/ and /l/: Iverson and Kuhl, 1996, Iverson
et al., 2003), other studies did not find any such effect, and thus questioned at least its
generalisability (several Australian English vowels: Thyer et al., 2000; American English
/i:/: Lotto et al., 1998, Frieda et al., 1999 and Lively and Pisoni, 1997). Most authors
who do not agree with a PME analysis explain experimental results that seem to support
it in terms of the more classical categorical perception (Liberman et al., 1957). In other
classes of sounds, most notably stops and fricatives, listeners’ perception seems to be

4.1. Motivation 39
strictly categorical. Explaining vowel data with categorical perception has the advantage
of providing a unified account for both consonants and vowels.
Most of the experimental work done on the PME starts from the assumption that
prototypes are the best instance, as judged by listeners of a specific language, of a
certain phoneme. In the next section I describe a study by Barrett-Jones and Hawkins
(Barrett, 1997; Hawkins and Barret Jones, 2004) that challenges this assumption by
showing how contextual information affects the way listeners perceive prototypes, and
ultimately brings to the foreground the question about the nature of linguistic units.
4.1.2 Context-dependent PME
Barrett-Jones and Hawkins (Barrett, 1997; Hawkins and Barret Jones, 2004) investigated
the nature of prototypes in human speech recognition.
In her thesis, Barrett-Jones wanted to test whether context sensitivity, in terms of
allophonic variation, would affect PME. If the PME were found to be context-dependent,
this would have had some implications regarding the nature of the phonological units of
representation of speech sounds. She tested this hypothesis by eliciting goodness ratings
and similarity judgements from listeners for the Southern British English vowel /u:/
in three different allophonic contexts: isolation (/u:/ ), preceding lateral (/lu:/ ) and
preceding palatal glide (/ju:/ ). These three syllables also happen to be (pseudo-) words
in SBE (ooh!, Lou/loo, you), a fact which gives an added degree of naturalness to the
laboratory experiments.
After having recorded 50 tokens for each of the three monosyllables from a 29-year-old
male speaker of SBE, Barrett-Jones analysed them in terms of F1, F2 and F3 frequency
at the beginning and at the end of the vocalic portion. The measurements were taken
as outlined in Barrett-Jones (1997, p. 59), and they served in order to synthesise stimuli
for the perceptual experiments. As, unsurprisingly, F2 onset seemed the cue that mostly
distinguished the three allophones from each other, stimuli were synthesised by varying

4.1. Motivation 40
this parameter systematically (from 800 Hz to 1600 Hz), with minimal variation over
the remaining parameters, which was necessary due to naturalness concerns (see Barrett-
Jones’ thesis for further details). In a first experiment Barrett-Jones asked subjects (8,
different for each monosyllable) to give comparative judgements about which one was
the better exemplar for all possible pairs of stimuli of the same monosyllable (excluding
X-X pairs, thus giving 9Ö9 − 9 = 72 pairs). She then assigned one point to each of the
“winning” stimuli. In giving their judgements, subjects were asked to focus on the vowel.
This experiment showed that the most-preferred (or prototypes, P) and rarely-preferred
(or non-prototypes, NP) tokens differed for each of the three contexts. A further outcome
of this experiment was that individual differences were indeed quite big. This observation
led Barrett-Jones to have ten randomly-selected subjects repeat the experiment for all
three contexts. A plot of these listeners’ judgements based on her data is shown in
figure 4.1.2. These data hint at the fact that while inter-speaker, absolute values for
Ps and NPs can still contain some useful information, attention to individual values
and relative distances is very important in order to have a proper understanding of the
phenomenon. The data presented here will be described at the end of section 4.3.1 and
in the simulations following it.
In a subsequent experiment Barrett-Jones used the individually-measured Ps and NPs
in order to test for differences in discriminability, which could have offered evidence for
PME. The measure of discriminability adopted was d’ (d-prime: Green and Swets,
1966). The results showed that discrimination was consistently worse around Ps. As a
last, crucial step in order to ascertain the plausibility of context-dependent perceptual
magnets, Barrett-Jones performed another discrimination test in which the same subject
was asked to give same/different judgements for two distinct blocks of stimuli. The first
block tested for discriminability around the synthesised, subject-specific prototype (as
previously found through comparative judgements) for one of the three contexts, whereas
the second block tested for discriminability around a “candidate” prototype which took

From sound to grammar: theory, representations and a computational model

From sound to grammar: theory, representations and a computational model

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to From sound to grammar: theory, representations and a computational model

Similar to From sound to grammar: theory, representations and a computational model (20)

Recently uploaded

Recently uploaded (20)

From sound to grammar: theory, representations and a computational model