263 13 1MB
English Pages 237 [240] Year 2011
Linguistische Arbeiten
539
Herausgegeben von Klaus von Heusinger, Gereon Müller, Ingo Plag, Beatrice Primus, Elisabeth Stark und Richard Wiese
Gero Kunter
Compound Stress in English The Phonetics and Phonology of Prosodic Prominence
De Gruyter
Dissertation an der Universität Siegen
ISBN 978-3-11-025469-3 e-ISBN 978-3-11-025470-9 ISSN 0344-6727 Bibliografische Information der Deutschen Nationalbibliothek Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte bibliografische Daten sind im Internet über http://dnb.d-nb.de abrufbar. © 2011 Walter de Gruyter GmbH & Co. KG, Berlin/New York Gesamtherstellung: Hubert & Co. GmbH & Co. KG, Göttingen ∞ Gedruckt auf säurefreiem Papier Printed in Germany www.degruyter.com
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Compounds, stress and prominence: concepts and issues . 2.1 What is a compound? . . . . . . . . . . . . . . . . . 2.2 Prominence patterns in compounds . . . . . . . . . . 2.3 Prominence in the autosegmental-metrical framework
. . . .
5 5 8 11
3
The corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
4
Perception of compound prominence patterns 4.1 Introduction . . . . . . . . . . . . . . . 4.2 Pretest . . . . . . . . . . . . . . . . . . 4.3 Method . . . . . . . . . . . . . . . . . 4.3.1 Participants . . . . . . . . . . . 4.3.2 Stimuli . . . . . . . . . . . . . 4.3.3 Procedure . . . . . . . . . . . . 4.4 Results . . . . . . . . . . . . . . . . . . 4.4.1 Overall results . . . . . . . . . 4.4.2 Intrarater reliability . . . . . . . 4.4.3 Perception ratings by items . . . 4.4.4 Summary of results . . . . . . . 4.5 Discussion . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
20 20 23 25 25 26 26 28 29 31 40 48 48
5
Acoustic correlates of compound prominence . . . . 5.1 Previous research . . . . . . . . . . . . . . . . 5.1.1 Pitch and fundamental frequency . . . . 5.1.2 Loudness, intensity and spectral balance 5.1.3 Duration . . . . . . . . . . . . . . . . 5.1.4 Non-modal phonation . . . . . . . . . 5.1.5 Summary and research questions . . . . 5.2 Material and measurements . . . . . . . . . . . 5.2.1 Pitch measurements . . . . . . . . . . 5.2.2 Duration . . . . . . . . . . . . . . . . 5.2.3 Intensity . . . . . . . . . . . . . . . . 5.2.4 Spectral balance . . . . . . . . . . . . 5.2.5 Non-modal phonation . . . . . . . . . 5.3 Procedure . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
57 59 60 64 67 69 70 72 74 78 78 79 79 82
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
1
VI
5.4 5.5
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85 93
6
Classification and prediction of compound prominence patterns . 6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Automatic measurement procedure and evaluation 6.1.2 Vowel-intrinsic properties . . . . . . . . . . . . . 6.1.3 Summary of acoustic measurements . . . . . . . . 6.2 Prediction of median prominence ratings . . . . . . . . . . 6.2.1 Predictors in the regression analysis . . . . . . . . 6.2.2 Regression analysis . . . . . . . . . . . . . . . . . 6.2.3 Predictions for the Boston corpus . . . . . . . . . 6.3 Classification of the Boston corpus . . . . . . . . . . . . . 6.3.1 Training set . . . . . . . . . . . . . . . . . . . . . 6.3.2 Model application and evaluation . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
100 101 102 105 107 108 108 112 121 125 127 130
7
What determines compound prominence patterns? . 7.1 Methodology . . . . . . . . . . . . . . . . . 7.2 Hypothesis testing with unbalanced data . . . 7.3 The structural hypothesis . . . . . . . . . . . 7.4 The semantic hypothesis . . . . . . . . . . . 7.5 Structural and semantic hypotheses combined 7.6 Analogical effects . . . . . . . . . . . . . . . 7.7 General discussion . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
133 134 137 139 151 158 161 169
8
Within- and across-speaker variation 8.1 Methodology . . . . . . . . . 8.2 Within-speaker variability . . 8.2.1 Data . . . . . . . . . . 8.2.2 Results . . . . . . . . 8.2.3 Discussion . . . . . . 8.3 Across-speaker variability . . 8.3.1 Data . . . . . . . . . . 8.3.2 Results . . . . . . . . 8.3.3 Discussion . . . . . . 8.4 General discussion . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
174 176 180 180 181 185 188 191 192 197 200
9
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
A Introduction to linear regression and mixed-effects models . . . . . . 207 B N OUN + N OUN compounds used in the variability study . . . . . . . 210 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
List of Tables
4.1
Intraclass correlation coefficients for different rater groups . . .
35
5.1 5.2 5.3 5.4 5.5 5.6
Distribution of phonation types . . . . . . . . . . . . . . . Summary of acoustic measurements in the test set . . . . . Structure of random effects. . . . . . . . . . . . . . . . . . Mixed-effects models for phonetic properties. . . . . . . . Estimated acoustic properties of left and right prominence. Pitch accent types by prominence pattern . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
81 82 84 86 92 94
6.1 6.2 6.3 6.4 6.5 6.6 6.7
Summary of acoustic measurements for the complete corpus Predictors in the full regression model . . . . . . . . . . . . Pearson correlation coefficients of acoustic properties . . . . Regression analysis: average prominence ratings . . . . . . Effects of correlation structure on Sright coefficient . . . . . Confusion matrix for LDA predicting prominence patterns . Results of classification validation by three judges . . . . . .
. . . . . . .
. . . . . . .
107 111 112 114 117 128 131
7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8
Mean prominence estimates for argument-head and modifierhead compounds by head morphology . . . . . . . . . . . . . . Analysis of covariance testing structural effects. . . . . . . . . . Semantic relations claimed to trigger right prominence . . . . . Semantic categories claimed to trigger right prominence . . . . Possible semantic relations in modifier-head compounds . . . . Analysis of variance testing effects of semantic features. . . . . Analysis of covariance testing structural and semantic effects. . Predictive accuracy of analogical models . . . . . . . . . . . . .
144 148 152 152 154 155 159 164
8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10
Within-speaker variability and predominant prominence patterns Within-speaker variability and argument structure . . . . . . . . Within-speaker variability and semantic relation N1 HAS N2 . . Within-speaker variability and N1 IS A PROPER NOUN . . . . . Within-speaker variability and proper nouns . . . . . . . . . . . Within-speaker variability for different speakers . . . . . . . . . Across-speaker variability and predominant prominence pattern Across-speaker variability and argument structure . . . . . . . . Across-speaker variability and semantic relation N1 HAS N2 . . Across-speaker variability and semantic relation N2 FOR N1 . .
183 183 184 184 184 187 193 195 195 195
VIII
8.11 Across-speaker variability and sentential position . . . . . . . . 195 8.12 Across-speaker variability of types occurring more than two times 196
List of Figures
4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8
Screenshot of the software used for the perception element . . . Density estimate and histogram of perceived prominence ratings Cluster analysis of raters based on part-whole correlation . . . . Scatterplot of general trend and individual ratings by three raters Density estimates for responses in the perception experiment . . Normal and non-normal density estimates . . . . . . . . . . . . Scatterplots of median ratings and ISEs . . . . . . . . . . . . . Distribution of median perception ratings . . . . . . . . . . . .
27 29 32 37 41 42 45 46
5.1 5.2 5.3 5.4
Pitch tracking for roadways. . . . . . . . . . . . Modal and non-modal phonation . . . . . . . . . Linear regression models for phonetic parameters Stylized shape of pitch contours . . . . . . . . .
. . . .
75 80 88 95
6.1 6.2 6.3 6.4 6.5
Partial slopes of median prominence rating regression analysis . Scatterplot of observed and predicted prominence ratings . . . . Density estimates for observed and predicted prominence ratings Density estimate of predicted prominence ratings . . . . . . . . Median prominence ratings and posterior probabilities . . . . .
115 119 120 123 129
7.1 7.2 7.3
Graph of mean prominence estimates for structural factors . . . 145 Interaction of frequency and prominence estimate by argument structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Graph of main effects for semantic factors . . . . . . . . . . . . 156
8.1 8.2 8.3 8.4
rF for type-speaker combinations . . . . . . . . . . . . . . . . Strip chart of log frequency by within-speaker variability . . . Strip chart showing rF for type-environment combinations . . Predominant prominence pattern by across-speaker variability
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
181 183 192 194
Acknowledgements
Over the last view years, I had the fortune of receiving much support from so many people, which made this book possible in the first place. Of course, my highest gratitude is due to my PhD advisor Ingo Plag for giving me the freedom to explore on my own, while at the same time offering guidance and encouragement whenever this freedom threatened to lead me astray. I am also deeply grateful to my co-advisor Harald Baayen, whose prowess in all matters statistical has opened to me a completely new world of 0) with degrees of freedom d f1 = n − 1 and d f2 = (k − 1)(n − 1), where k is the number of raters and n the number of test items. This coefficient allows statements about how consistent the ratings of the participants are, and thus, to what extent the participants can be considered reliable judges of prominence patterns in English compounds. For the whole group of raters, the consistency is only medium (ICC(3, 1) = 0.341, F(104, 3120) = 17.1, p < 0.01). Apparently, the prominence ratings vary strongly for any given compound, even if differences in scale use are ignored. Given the examination of the part-whole correlations, this was to be expected. We saw in Figure 4.3 that the part-whole correlation coefficient separates four groups of raters. We now can assess whether these groups form four different strategies of judging the prominence pattern in compounds, or whether they represent inhomogeneous groups of raters whose ratings are, to a large degree, influenced by random factors. Table 4.1 lists the ICC and the F statistic for each of the four groups. The degree of intrarater consistency and the part-whole correlations agree in their assessment of the proficiency of the rater groups. In the ‘circle’ group, not only does each rater have a high part-whole coefficient (meaning that the ratings from each judge correspond well to the general trend), we also find a high degree of
35 Table 4.1: Intraclass correlation coefficients for the four different rater groups. The labels correspond to the symbols used in Figure 4.3. Label ‘diamond’ ‘square’ ‘triangle’ ‘circle’
ICC(3, 1) 0 0.103 0.312 0.631
F
p
F(104, 104) < 1 F(104, 520) = 1.69 F(104, 520) = 3.72 F(104, 1664) = 30.1
n.s. p < 0.01 p < 0.01 p < 0.01
agreement, with fairly little variation, in the ratings for any given compound. Apparently, the participants in this group can be regarded as consistent and proficient judges of the prominence pattern in the test stimuli. On the other hand, the two raters in the ‘diamond’ group show no consistency with each other, nor do their ratings show a correlation with the majority of ratings. It has to be suspected that these two raters failed to follow the experimental instructions. This does not seem to be the case for the raters in the ‘square’ and ‘triangle’ group, though. Even if their part-whole correlations are rather low, their ratings do reflect, to a certain degree (more so for the ‘square’ group’, less so for the ‘triangle’ group), the general trend, as the correlation is significant for all raters. In general, a judge from these groups is capable of assessing the prominence pattern in a compound, but each rating deviates considerably and randomly from the (assumed) true value for the prominence relation. Members of these groups can be described as less proficient raters than those from the ‘circle’ group.4 A comparison between a proficient rater and less proficient ones is illustrative. Rater 19 is from the ‘square’ group (¯r pw = 0.324), rater 16 is from the ‘triangle’ group (¯r pw = 0.579) and rater 3 is from the ‘circle’ group (¯r pw = 0.844), and hence is considered as proficient. All coefficients are within one respective standard deviation from the average coefficient for their groups. The relation between the ratings from these three raters and the general trend is displayed in Figure 4.4. The left column shows the individual ratings from each participant on the y axis, plotted against the general trend (the mean rating for each item, excluding the respective individual rating) on the x axis. The right column displays scatterplots for the deviation of each rating from the general trend for each rater on the y axis. The line in each scatterplot is the regression line of a nonparametric local linear regression model that estimates the relationship between the two axes. 4
The analysis presented in Kunter (2010) yields additional support to the conclusion that only members from the ‘circle’ group can be considered reliable judges: no other pair of raters was found that had a similarly high degree of consistency as the raters in the ‘circle’ cluster.
36
The nonparametric regression used here is described in detail in Bowman and Azzalini (1997: ch. 3). In brief, a nonparametric regression, unlike linear regression, does not assume a linear relation between predictors and response variable (of course, nonparametric regression, like all nonparametric models, is still robust for data where a linear relation does exist). The underlying model takes the general form y = m(x) + ε, (4.2) where y is the response variable, m(x) the relation function in x, and ε the error term. The local linear regression is a nonparametric regression that estimates m(x) in a way that is similar to the density estimation described above. It uses a kernel function to construct a local mean for any data point x by considering the data points within a given bandwidth. The local mean yields, in turn, the ˆ If there is no relation at all between the estimated regression function m(x). ˆ is the mean μ of the response variable y and the predictors, the result of m(x) response variable for any value of x. This is referred to as the ‘no effect’ model. Similarly, if there is a linear relation, m(x) ˆ may approximate the linear equation y = ax + e. Graphically, this can be illustrated by a reference band of two standard errors around μ (for the ‘no effect’ model) or around ax (for the ‘linear’ model). Bowman and Azzalini (1997: 86f) describe statistical tests for both types of model estimations. In the left column of Figure 4.4, reference bands for the ‘linear’ model are displayed, as we would expect a linear relation between the general trend and the ratings for a given rater if the ratings from that rater are to allow inferences about the prominence patterns in the experiment. In the second and third rows, the regression lines stay within two standard errors around the assumed linear relations (test of ‘linear’ model: p = 0.803 for rater 16, p = 0.514 for rater 3). Apparently, there is a tendency for both raters to give high ratings for compounds that received a high rating from the majority, and correspondingly for compounds with low ratings. This is not true for rater 19, though. The regression line in the top left panel falls partly outside of the reference band (test of ‘linear’ model: p = 0.02). While this rater shows an agreement with the general trend at the extremes (a very high or very low general trend coincides with a very high or very low individual rating, respectively), the middle range shows a marked deviation from linearity. In fact, in the interval between -200 to 100 on the scale of the general trend, no relation between general trend and the individual ratings is discernable at all. Here, the regression line is parallel to the x axis. In other words, rater 19 is capable of giving meaningful prominence ratings only for stimuli in which one element receives an unusually high or low prominence rating by the majority. The rater fails to provide meaningful ratings for less clear cases, where the response by rater 19 is largely unpredictable.
●
● ●
400 ●●
0
● ● ● ● ● ●
● ●
●
●
●
−200 ●
● ● ● ● ● ● ●● ● ●
●
●
●● ●
●
●●
● ●
●
●
0
400
100 ●
● ● ●
200
●
● ●
● ●
●
●●
● ● ●● ●● ●
−400
● ●
● ●
● ●
●
● ●
● ● ●
●
● ●
●
● ● ● ● ● ●● ● ● ●● ● ● ● ●● ●● ●
●
● ●
●
●
● ● ●
●
0 −200
●
●● ● ● ●
●
● ●
●●
●
●
●
● ●
● ● ●●
●
●●
● ● ● ●
●
●
●
●● ●
●●
●●
●
●
● ●
● ●
●
● ● ● ● ●●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
● ●● ●
● ●
● ●
● ●●
●
●
●
●
● ● ●
●
●
●
● ●
● ● ●
●
● ● ●
●
●
●
●
● ●
● ●
●
●
●
● ●
●
0
100
● ●
● ●
●
●● ●●●●
● ●
● ●
0
●
● ●
● ●● ●
● ●●● ●●
●
●
●
● ●
●
●
● ●
● ●
● ● ●
● ● ●● ● ●● ● ● ● ● ●
● ● ●
●
● ●
●
●
●
● ● ● ● ●
●
●
●
●●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
● ●
● ●
●●
● ● ●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
200
●
−300 −200 −100 400
● ● ●
●
0
200
●
●
200
●
●
−400
● ●
100
●
●
200
−200
0
●
400
●●
●
●●
●
● ● ● ● ●
400
● ●
200
● ●
● ●
●
−300 −200 −100
●
−200
●
200
●
● ●
●
0 ●
●
●
−400
●
●
−200
●●
●●
●
Deviation
Individual rating Individual rating
● ●
●● ●
−300 −200 −100
Rater 3
●
●
●
−300 −200 −100
Rater 16
● ●
●
●
● ●●
●
●
● ● ● ●
●
●
●
●●
●
●●
●
●
●
● ● ●●
●
●
●
●
●
● ● ●●
−400
● ●● ● ●
● ●
●● ● ●
●
●
Deviation
200
●
● ●
400
● ●
● ●
● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ●●● ● ● ● ●● ● ● ●●● ●● ●● ● ● ● ● ● ●
● ●● ●●● ● ●
●
● ●
●● ● ● ●
● ● ●●
●
●
●
● ●
● ● ● ● ● ● ● ●● ● ●
● ●
● ●
● ●
Deviation
Rater 19
Individual rating
37
0
100
200
● ●
200
● ● ●
0
●
● ●● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ● ● ●● ●● ●●● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ●●● ●
●
● ●
● ●● ● ● ● ●
●●
●
●
●
● ● ● ●●
●
●
●
●
●
●
● ●
● ●● ● ● ● ● ● ●●
● ●
●
−200
●
●● ●
●
● ●
●
−400
−400 −300 −200 −100
0
100
200
−300 −200 −100
0
100
200
General trend Figure 4.4: Scatterplots of the general trend and raters 19, 16, and 3 from ‘square’, ‘triangle’, and ‘circle’ group, respectively.
Apart from the linearity of the relation between the individual ratings and the general trend, the scatterplots also illustrate the magnitude of the part-whole correlation. The three raters are arranged with increasing part-whole correlation coefficients from top to bottom, and the deviation of the ratings from the general trend decrease accordingly, reflected in the degree of ‘cloudiness’ of the scatterplot. This deviation reflects the degree of rating uncertainty (or rating ‘errors’) for each judgement and rater. If a rater is to be judged as consistent, we do not expect to find a systematic relation between the deviation and the general trend. The error included in the ratings should be independent of the actual prominence pattern. Furthermore, the error should be low for a consistent rater. In the right column of Figure 4.4, these deviations from the general trend are plotted for each rater against the respective general trend. The reference bands provided correspond to a ‘no effect’ model of the regression estimate. This model is adequate if, indeed, the deviations cannot be predicted on the basis of the general trend for the respective rater.
38
If we look at the deviations for rater 16 and rater 3 (second and third panel in the right column), the regression line does not fall outside the reference band. No relation between the deviation and the general trend is detectable (test of ‘no effect’ model: p = 0.873 for rater 16, p = 0.721 for rater 3). For rater 19, however, the deviation is markedly positive for compounds with a negative general trend, and markedly negative for compounds with a positive general trend. This relation is significant (test of ‘no effect’ model: p < 0.001). Both deviations point towards a tendency of rater 19 to provide judgements that are closer to the neutral point of the scale than it would be expected from the general trend. This behaviour is a common trait of raters from the ‘square’ group. All other five cases show a similar relation between deviation and the general trend, which is always significant (at an alpha level of α = 0.01). For the two other groups, such a significant relation is rather rare. Only one rater from the ‘triangle’ group shows a significant behaviour similar to rater 19. For the ‘circle’ group, we do find a significant relation in 13 cases. 10 of these, however, are to the effect that the deviation and the general trend go in the same direction, which means that the respective rater gave more confident ratings than the average. These raters, then, are overshooting the general trend, while raters from the ‘square’ group rather undershoot it. In summary, the investigation of the relation between deviation and the general trend demonstrates further differences between the groups. Raters from the ‘circle’ group show either little deviation from the general trend, or deviation towards the scale extremes. Raters from the ‘triangle’ group tend to show no relation between the deviation and the general trend. In the ‘square’ group, the raters show a significant relation between deviation and the general trend. Contrary to the ‘circle’ group, the deviation here tends, in general, towards the scale centre. To determine whether the inconsistencies between raters can be accounted for by the available demographic factors of the participants, a linear mixedeffects model was fitted with the perception ratings as the response variable, the participants’ sex (SEX, male or female) and the state of origin (ORIGIN, California or other) as predictors, and the identifiers for compound types and participants as random factors. Markov Chain Monte Carlo samples were used to evaluate the coefficients in the model (cf. Baayen et al. 2008). The corresponding analysis of variance showed no significant main effect for SEX or ORIGIN whatsoever. There was also no trace of an interaction between SEX and ORIGIN (F(1, 3356) < 1 for all three tests). This suggests that the observed differences in proficiency are not due to a factor introduced by the choice of participants. As mentioned at the beginning of this section, a potential source of inconsistencies might be due to differences in the rating procedure. In particular, it is not unknown from rating tasks that raters may use different ranges of the available scale, and may also tend to avoid very high and very low ratings, preserving these
39
scale extremes for anticipated extraordinary stimuli (cf. Taylor and Parker 1964, Bortz and Döring 2002). It is not implausible to assume that the inconsistencies reported above are, at least partly, due to such differences of the rating scale between participants. A Tukey test for additivity reveals a significant interaction between the participants and the items (F(1, 3119) = 20.231, p < 0.01), which implies that some participants gave higher or lower ratings for some items than other participants. This interaction is further evidence that some raters placed their judgements on different ranges of the available scale. As the different ratings for a given compound are compared to establish the central tendency of the perceived prominence pattern, these differences have to be addressed before this comparison is possible. Accordingly, the data for each rater was standardized using a quantile transformation (see Jajuga and Walesiak 2000 for a discussion of different standardization procedures) that involves the division of each rater’s response by the standard deviation for that rater. While this transformation ensures a scale range that is comparable between raters, it also preserves information about the location of each rating in relation to the central point of the rating scale, and hence, about the element that is perceived as prominent: transformed negative and positive ratings still correspond to left-prominent and right-prominent compounds, respectively. Using the transformed data in another hierarchical cluster analysis, every rater was assigned to the same groups of raters as in the analysis with untransformed data in Figure 4.3, and any differences between the two analyses were only on the lowest level of branching. Apparently, the four different rater groups remain untouched by the transformation, so we can continue with the same classification of raters as before. Generally, the ICCs can be expected to be higher after the transformation process, as scale range differences are reduced. This is, indeed, the case, but the improvement of the consistency is not overwhelming. Using the transformed ratings, the ICC for all raters changes from 0.341 to 0.372 (F(104, 3120) = 19.4). It is the ‘circle’ group that profits most from a transformation of the ratings. Here, the coefficient changes from 0.631 to 0.677 (F(104, 1664) = 36.6). The consistency increase for the ‘square’ group is marginal (from 0.103 to 0.116, F(104, 520) = 1.78), and only small for the ‘triangle’ group (from 0.312 to 0.330, F(104, 520) = 3.96, all F statistics significant at α = 0.01). In conclusion, there are differences in scale use that may be reduced in normalizing the ratings on a rater basis. The effect is only small, however, and only for the group of proficient raters does the consistency increase after the transformation. On the other hand, the transformation does reduce the effect of different scale ranges, and ensures greater comparability of the ratings between raters. Indeed, with the transformed ratings, the interaction between rater and test item that was found to be significant for the raw data now disappears (Tukey’s test for additivity, F(1, 3119) = 0.543, p = 0.461).
40
The transformed ratings therefore are the basis for the further analysis. This decision has consequences for the range that ratings may take. The scale units are now transformed to pseudo-standard units,5 and no longer directly reflect the absolute slider position that was mapped onto the [−499, 500] interval. Typically, most ratings fall between −2.5 and 2.5, but occasional values outside that range are possible.
4.4.3 Perception ratings by items Since the transformation process just described ensures comparability of the ratings placed by different raters for a given compound, a more detailed analysis of the test stimuli becomes possible. Based on the assumption that there are different prominence patterns in English compounds, we expect these patterns to emerge in the distribution of ratings per test item. Figure 4.5 displays rating density estimates for each of the 105 compounds. The dotted line indicates the scale centre at zero. If the prominence pattern for a given item can be assessed unambiguously, the ratings for that item are normally distributed. The density estimate for such an item in Figure 4.5 resembles a bell curve, that is, there is a single peak indicating the area chosen most frequently by the raters. Deviations from that peak can be explained by random variation around the mean rating. The frequency of ratings monotonously decreases with increasing distance from the mean rating. Shapiro-Wilk tests for each item reveals, however, that half (53 out of 105, or 50.5 percent) of the test items have significantly non-normal distribution (all significant at α = 0.05). In these cases, marked with an asterisk in Figure 4.5, the ratings do not follow the described pattern that is expected for unambiguous items. An inspection of the density estimates reveals that the majority of compounds that have a significantly non-normal distribution are characterized either by a second peak (e.g. war efforts, Massachusetts house) or by an overly long tail (e.g. Seabrook, house speaker) in the density estimate. This means that in these cases, a sufficiently large number of raters gave judgements that differ markedly from those of the majority. The difference is too large to be expected under a normal distribution of ratings. The deviant ratings, then, reflect a perception that is significantly different from the remaining raters. There are compounds in which the location of the secondary peak is on the same side of the scale, such as cancer patient or phone calls. The non-normality in these cases is due to a set of highly confident ratings that were placed at a large distance from the scale centre. In many cases, however, the position of the 5
The prefix ‘pseudo’ is appropriate as the standardization was not applied uniformly across all ratings, but on a per-rater basis. The standard deviation of all ratings does not equal 1.0, but is 1.055.
41 −4−2 0 2 4 water pipes
1.0 0.8 0.6 0.4 0.2 0.0
T.H.M. problem
watershed
time zone *
state colleges state consent *
role models * Roxbury site 1.0 0.8 0.6 0.4 0.2 0.0
Density
paychecks *
1.0 0.8 0.6 0.4 0.2 0.0
Massachusetts Mathers case towns *
Javelin software
King appointee *
trade barriers *
state employees *
school districts
welfare department
−4−2 0 2 4 wonderland
wristband
training ship * treaty rights
state house
U.S. citizen
state officials
Seabrook * enate committee
tabulation errors *
target market
senate president *
police training
prayer healing *
price tag
Milltown
Myosin gene
Naushon Island *
neighborhood residents *
lawmaker
lawmakers *
learning disabilities
phone calls *
1.0 0.8 0.6 0.4 0.2 0.0 user Vietnam War * war effort * interfaces *
taxation committee *
tax hike *
sex acts
shellfish *
spending reductions *
prison sentence
racing days
roadways
network
oil fires *
oil spill
Macintosh user *
Massachusetts house *
lunchboxes * Lynn activist *
home phones
homicide charge *
households
house speaker *
immigration policy
income tax *
eviction notices *
Fenway Park
Finneran amendment *
fringe candidates
funding situation
game plan *
girlfriend *
growing season
cancer patient *
capital police *
college board *
community service *
computer companies
condo market *
correction officials *
nosaur research
boarding schools
bond sales * Boston harbor
breathalizer results
budget cuts
Bush administration
bar association
Beacon Hill (1) *
election day *
campaign promise *
bingo games biotech centers birth control
AIDS virus * 1.0 0.8 0.6 0.4 0.2 0.0 −4−2 0 2 4
air−time
attrition program *
−4−2 0 2 4
automobile registration
bandaid *
bank customer
−4−2 0 2 4
−4−2 0 2 4
1.0 0.8 0.6 0.4 0.2 0.0
1.0 0.8 0.6 0.4 0.2 0.0
growth period health clinics * HIV virus * 1.0 0.8 0.6 0.4 0.2 0.0
1.0 0.8 0.6 0.4 0.2 0.0
pay cut *
−4−2 0 2 4 water use
Beacon Hill (2) *
1.0 0.8 0.6 0.4 0.2 0.0
1.0 0.8 0.6 0.4 0.2 0.0
1.0 0.8 0.6 0.4 0.2 0.0
−4−2 0 2 4
Transformed perception ratings
Figure 4.5: Density estimates for responses in the perception experiment (N = 31 per item). See text for details.
0.0
0.0
Density 0.2 0.4
Density 0.2 0.4
0.6
0.6
42
−3 −2 −1 0 1 Boston harbor
2
3
−3 −2 −1 0 1 2 eviction notices
3
Figure 4.6: Density estimates for an item with normally distributed ratings (left panel), and for a non-normal distribution (right panel).
secondary peak is on the opposite half of the scale, as in the case of campaign promise, eviction notices, or HIV virus. This indicates considerable disagreement in the way the items are judged, as the different scale halves correspond to differences in the most prominent element. In Figure 4.6, we see in the left panel an example of a compound with a normal distribution of ratings (Boston harbor, Shapiro-Wilk test of normality: W = 0.963, p = 0.339), and in the right panel an example where the ratings are not normally distributed (eviction notices, W = 0.881, p = 0.003). A reference band for a normal distribution is overlaid that indicates the acceptable margin of deviation from a normal distribution (two standard errors). Both density estimates show two peaks, one above and one below the neutral point at zero. In the left example, however, the valley between the two peaks, which represents a slightly lower frequency of ratings around zero, still falls within what can be expected for ratings which are normally distributed. Accordingly, no part of the density estimate lies outside of the normality reference band. The situation is different in the example in the right panel. Here, the decrease in frequency between the two peaks is so large that it cannot be attributed to the variation expected under the assumption of normally distributed ratings. Also, the second peak is so high that it also falls outside of the reference band, signifying a very high frequency of ratings. In summary, the examples show that it is rather save to assume that Boston harbor in the left panel has a single average rating that is located close to the scale centre. Most raters perceived both elements to be of more or less equal prominence. This cannot be said for eviction notices. The majority of ratings are on the positive side, so most raters perceived right-prominence. A significant body of ratings were placed on the negative half, though, ratings which correspond to perceived left-prominence.
43
The perceived prominence pattern varies strongly for this item, with some raters judging the prominent element to be eviction, while others perceived notices as more prominent. In both Boston harbor and eviction notices, there is a saddle point below −2.0 on the transformed scale. This change in the course of the density estimate is still within the normality reference band, and therefore not largely contributing to the non-normality of the right example. Yet, it is a marked feature of many of the density estimates displayed in Figure 4.5. For many compounds, the density estimate reaches values well below −2.0, and marginal peaks are not infrequently found in this region. These marginal peaks show that the respective raters were very confident of perceiving left-prominence in the compound, and accordingly placed their rating to the far-left side of the scale. It should be noted that a corresponding observation cannot be made for the right hand side. The existence of compounds where large between-rater disagreement can be found as to which element is more prominent extends the notion of variability introduced in the previous section. We have seen there that not every speaker is similarly successful in judging compound prominence patterns. In addition to this, we now also find that a given compound may receive opposing ratings by larger numbers of judges. The perception of prominence is, apparently, not always an unambiguous issue, and there are some cases where no coherent judgement seems to be possible. A question that arises from this observation is whether there is a prominence relation that lends to an increased variability in the ratings. It might be possible, for instance, that it is right-prominent compounds that show a particularly high degree of variability. This question may be addressed by comparing the average prominence rating for each item with the deviation from this average. In compounds where the ratings are placed homogeneously, the deviation from the average should be lower than if there is large disagreement between raters. It is not advisable, however, to apply the arithmetic mean as a measure of the central tendency of ratings. The reason becomes immediately obvious if we again look at the two examples in Figure 4.6. We saw that the left item, Boston harbor, follows a normal distribution. The arithmetic mean, therefore, corresponds well with the central tendency that can be visually derived from the density estimate. With M = −0.031, the arithmetic mean is just below the scale centre, which seems plausible given that the two peaks in the density estimate are located around zero and of comparable height. The right item, eviction notices, was shown not to be normally distributed. Accordingly, the arithmetic mean of the ratings does not adequately reflect the central tendency for this item: With M = 0.214, it is very markedly below the peak position at 0.705 on the transformed rating scale. An alternative for eviction notices might be, then, to use the peak position as the indicator of the central tendency. However, for Boston harbor, the left peak is only slightly lower than the right one, yet using only the peak information would
44
discard all the ratings that are represented by the left peak. Clearly both the arithmetic mean and the position of the density estimate peak are not acceptable measures of central tendency in the present data. The appropriate measure, then, seems to be the median. As it is defined as the second quartile (the point below which fifty percent of the distribution of the data can be found), it does not assume a normal (or any other) distribution of the data, but is derived from the ranks of the data. For both examples, the median (MED) provides a reasonable approximation of the central tendency of ratings. It is MED = −0.049 for Boston harbor and MED = 0.558 for eviction notices, which are both more plausible than either the arithmetic mean or the mode of the density estimate. Under the assumption that rating consistency for a given compound is reflected in a normal distribution of the ratings, the degree of consistency of ratings can be assessed by the degree of deviation from a normal distribution. Bowman and Azzalini (1997: 38) propose the integrated square error (ISE) between the density estimate and an assumed normal density as a goodness-of-fit statistic for normality. It takes the form ISE =
2 ˆ σˆ ) dy, fˆ(x) − φ (x − μ;
(4.3)
ˆ σˆ ) the density of a normal with fˆ(x) the density estimate in x, and φ (x − μ; distribution based on estimations of the mean μˆ and standard deviation σˆ .6 While the ISE can be evoked as rather conservative test for normality as compared to the Shapiro-Wilk test, the statistic is used here to compare the degree of homogeneity of different compounds. Returning to the examples from above, the normality of Boston harbor is reflected in a low integrated square error (ISE = 0.004), while for the non-normal eviction notices, the ISE is clearly larger (ISE = 0.026). A Spearman correlation test between the median prominence rating and the ISE for each compound reveals a significant, medium positive correlation between the two (ρ = 0.202, p = 0.039). While this suggests an increased disagreement between ratings for compounds with right-prominence, a closer look at the data reveals a more detailed picture. Figure 4.7 displays a scatterplot for the median prominence ratings and the ISE for each item. It also contains the regression line of a local linear regression, with ISE as the response variable and the median rating as the co-variate. In addition, reference bands for the ‘no effect’ model are overlaid (see the previous section for a brief introduction to the concept of non-normal regression). The ratings of all participants went into the left panel, 6
Bowman and Azzalini (1997: 38) point out that a correction is necessary for the estimated standard deviation to cancel out the bias introduced in the smoothing operation. For the sake of clarity, this correction factor is not displayed here. It is, however, included in the present calculations of ISE.
●
● ● ● ● ●
●
● ●
●
● ● ●●
●
●● ●
−2.0
● ●● ● ●
●
●
● ● ● ● ● ●
● ● ●
● ●● ● ●●
● ● ●
● ●
●
●
● ● ● ●
● ● ●
● ● ●
●
● ●
●
●
●
●
●
●● ● ● ● ● ●● ●● ● ● ● ●●● ● ● ●
● ●
●
●
●
●
● ●
● ●
●
●
●
● ● ●
● ●●
● ●
−1.0 0.0 1.0 Median prominence rating
Integrated square error 0.00 0.02 0.04
Integrated square error 0.00 0.02 0.04
45 ●
● ●
● ●
●
●
●
● ●
●
● ● ●
● ●
● ●
●
−2.0
● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ●● ●● ● ●●● ● ● ● ● ● ●●
● ● ● ● ● ● ● ●
● ●
● ●
●
● ● ●
●
● ● ● ●
● ● ● ● ● ●
●
−1.0 0.0 1.0 Median prominence rating
Figure 4.7: Scatterplots showing median ratings and integrated square errors. The left panel includes all raters (N = 31), the right panel only the proficient raters (N = 17). See text for details.
while in the right panel, only the ratings from the group of raters are included that showed a large consistency in their ratings (the ‘circle’ group, see section 4.4.2). While the scatterplot on the left panel demonstrates that compounds with highly inconsistent ratings (resulting in high integrated square errors) may occur over the whole range of median ratings, the regression line shows that the relation between the two is not linear (test of linearity: p < 0.001). In particular, there is a pronounced increase of ISEs with increased median ratings, but this linear increase only sets in for positive median ratings. For negative ratings, no linear relation is determinable. The local linear regression shows a highly significant effect (test of no effect: p < 0.001). In other words, compounds with a more prominent right element also tend to receive highly inconsistent ratings. A distinctively different picture emerges from the scatterplot on the right panel that contains only the consistent raters. Here, the steep incline of the regression line for right-prominent compounds has mostly vanished. It now falls almost completely within the reference band for the no-effect model. Indeed, the test for no effect now finds only weak evidence of an effect of the median rating on the ISE (test of ‘no effect’ model: p = 0.072). This means that for a given compound, the ratings of the consistent raters (the ‘circle’ group) have a similar deviation from normality, irrespective of the median of the ratings for that compound. In general, the exclusion of the less consistent raters increases the normality of the ratings per item, as assessed by the ISEs (Welch t-test, t = 3.554, d f = 207.696, p < 0.001, Cohen’s d = 0.694). This reduction is visualized by the consistently lower regression line on the right panel in Figure 4.7. So, in summary, the exclusion of less proficient raters increases the consistency of ratings per item. In particular, the high rating variability for right-prominent compounds is
0.7
46
0.4 0.3 0.0
0.1
0.2
Density
0.5
0.6
consistent raters all raters
−2
−1 0 Median prominence rating
1
2
Figure 4.8: Distribution of median perception ratings (N = 105).
effectively eliminated, as the ISEs as a measure of variability are evenly spread out between zero and 0.03. It should be noted, nonetheless, that the ratings for some items may still deviate considerably from the median rating even for the group of proficient raters. For instance, King appointee was confidently judged as left-prominent by the large majority (13 ratings or 83.4 percent below −0.75), yet two raters (11.8 percent) gave high positive ratings (above 0.75) that indicate a clear right-prominent percept for these to raters. The ISE for this compound is expectably high (ISE = 0.042). Disagreement in the prominence pattern, it seems, is not an infrequent phenomenon, even among proficient raters. What the scatterplots in Figure 4.7 do not display clearly is the overall distribution of median prominence ratings. This is provided in Figure 4.8. Here, the solid line is the density estimate of the median prominence ratings from the group of proficient raters. The dashed line is based on all ratings. The overall shape of both density estimates is visually fairly similar. This is confirmed by a test of equal densities, which does not report a significant difference between the two density estimates (p = 0.33). In both cases, the distribution has two modes, one above and one below zero. Between the two peaks, close to the scale centre at zero, is a marked valley in the curve. The density estimate for all raters, however, shows a bias towards the neutral point. The location of both density peaks is closer to zero, and the dividing valley between the peaks is not as pronounced as in the case of the density estimate based on the consistent raters. Apparently, the inclusion of the less proficient raters partly obfuscates the clear bimodal distribution present in the ratings from
47
the consistent group. In the remaining section, we therefore concentrate on the latter group. The shape of the density estimate suggests that there are two distinct types of prominence patterns in English N OUN + N OUN compounds. In the first type, corresponding to the first peak at −1.112, the left element of the compound is the prominent one, as in cancer patient (MED = −1.154) or water pipes (MED = −1.145). The compound with the lowest median rating is bingo games (MED = −1.978). The left element is highly prominent over the right one in this item. In the second type, reflected in the second peak at 0.786, the right element is perceived more prominently. Examples for this type are college board (MED = 0.778) and neighborhood residents (MED = 0.749). In capital police, the median prominence rating is largest (MED = 1.367). Here, the right element was perceived very prominently. The data obtained from the perception experiment provides no indication that there is another prominence pattern apart from the two that manifest themselves as modes in the density estimates. In particular, a hierarchical cluster analysis (Ward clustering with Euclidean distance measure) based on the median ratings reaches an optimal data separation if two different clusters are assumed, as indicated by the Calinski-Harabasz index. The two clusters are separated at a median rating of 0.25, which corresponds closely to the location of the valley in the density estimate. Thus, no gain in data separation is achieved by admitting a third cluster. However, the density estimate demonstrates that the assignment to the two patterns has to deal with some overlap. There is a notable number of compounds that receive ratings right between both peaks. This becomes visible, for instance, by the observation that the density estimate in the interval [−0.4, 0.4] still is larger than 0.15. In fact, 11 compounds (10.5 percent) received ratings within this range. Nevertheless, the clustering implies that these instances are due to the expected variation around the two main modes in the distribution, and do not constitute a separate type of prominence pattern. What the distribution of all ratings already had suggested becomes even clearer for the distribution in Figure 4.8. The first prominence pattern occurs much more frequently than the second in the random set of compounds used in the perception experiment. The amplitude of the left peak is more than twice the amplitude of the second (0.683 vs. 0.327). The participants clearly placed more ratings on the left-hand side of the scale. 71 of the 105 compounds have a median rating below zero (67.6 percent). Even if the different peak amplitudes were ignored, the distribution is still far from symmetrical. We do find a valley in the density curve that is located almost exactly at the scale centre, but the right peak is significantly closer to the scale centre than the left peak (Wilcoxon signed rank test with null hypothesis that the median of ratings on each half of the scale is in the same absolute position, V = 62, p < 0.001). This means that if the participants perceived the left member as the prominent one, and therefore chose to place their rating on the left-hand
48
side of the scale, they placed it at a larger distance from the scale centre than if they perceived the right element as the prominent one. In other words, the participants were much more determined to give a clear rating for left-prominent compounds than for right-prominent ones.
4.4.4 Summary of results In the preceding section, an experiment was presented that tested the perception of prominence patterns in English N OUN + N OUN constructions. Participants could indicate their prominence percept on an unmarked graphical scale. 105 compounds were tested by 32 native raters. One rater was excluded due to a faulty use of the experimental scale. Not all raters turned out to be equally proficient in the rating task. Two raters gave responses that disagreed completely with the responses of the majority. A group of 17 raters turned out to be highly proficient in the rating task, based on a high agreement with the general trend as well as a high intraclass correlation. 6 of the remaining 12 raters turned out to provide unreliable responses for compounds, as they show a strong tendency towards the scale centre for both left-prominent and right-prominent compounds. The other 6 raters did not demonstrate such a tendency, but showed a much larger degree of random variation around the general trend than the group of 17 consistent raters. Rater proficiency was neither related to the sex of the participants, nor to their origin within the United States. Not all compounds showed a similar degree of inconsistencies. Stimuli with a more prominent right member demonstrated more variability in the ratings than those with a more prominent left member. Exclusion of the inconsistent raters reduced this effect considerably. Based on the median prominence ratings from the consistent raters, two different types of prominence patterns emerge: compounds with a more prominent left element, and compounds with a more prominent right element. 71 compounds (67.6 percent) of the 105 compounds received ratings that indicate left-prominence. No third type is traceable in a cluster analysis, but there is some overlap between the two patterns. Raters gave more confident ratings for left-prominent compounds than for right-prominent compounds.
4.5 Discussion A review of the literature on English compounds may give the impression that perceiving and identifying prominence patterns is a straightforward task that
49
leaves little room for disagreement. Accordingly, there is a notable lack of experiments that examine the perception of prominence. The analysis has shown that this assumption is far from valid—listeners vary considerably in the way they assess the perception of prominence in compounds. Similar observations have been occasionally reported for the perception of lexical stress. For instance, Youssef and Mazurkewich (1998) found in their production and perception study of the acquisition of lexical stress by L2 speakers of English that their control group of native English listeners were not always able to correctly identify which syllable of the stimuli carried lexical stress. The rate of misidentifications was particularly large if it was the antepenultimate syllable that received primary stress, where between 29.4 and 22.1 percent of the primary stresses were not identified by the control group. Altmann (2006), in her study of stress acquisition, does not report comparable misidentification rates of lexical stress for her control group of native English listeners. Her analysis shows, however, that English listeners are not equally proficient in perceiving lexical stress in all contexts. When it comes to prominence in constructions larger than a single word, Streefkerk (2002) reports similar rating inconsistencies for perceived prominence in Dutch sentences. Five out of ten subjects where identified as highly consistent raters, while two raters provided highly unreliable responses. It is quite probable that the high disagreement rate, both in our experiment as well as in the research just mentioned, is at least partly due to the fact that naive raters were tested. As mentioned above, it is not uncommon that, when compared to trained subjects, untrained subjects show less consistent responses, and may fail to make categorical distinctions that are available to trained subjects (see Warren 1999: 123 for a brief discussion of this topic). This is related to the observation that a hesitant usage of the rating scale may be more common if the rating task is unfamiliar to the raters (Bortz and Döring 2002). This hesitant usage leads to the error of central tendency that was observed above for members of the ‘square’ group, who tended to give ratings closer to the scale centre than members of other groups. Another source of considerable individual differences may lie in the rating task itself. The experiment demanded a metalinguistic judgement from the participants if the term is understood as in, for instance, Birdsong (1989: 1) (“Metalinguistic performance in its broadest sense can be understood as any objectification of language”). In the present case, the participants were instructed to provide a judgement of a linguistic dimension, namely, the comparison of the perceptual prominence of two lexical units. Chaudron (1983) as well as Birdsong (1989) have found that test subjects may show considerable variability in metalinguistic tasks. Birdsong (1989: 69) comes to the conclusion that different cognitive and analytical skills may strongly affect metalinguistic performance even so far as to obscure the linguistic structure under observation. This is clearly not
50
the case in the present data, but the central tendency of some raters shows that skill differences have to be acknowledged in perception experiments such as the present. The alternative, to use trained raters, however, is also not without disadvantages. As mentioned above, trained subjects may be more successful in discerning acoustic details, but to the extent that distinctions may be based not on a linguistic form, but on a metalinguistic re-analysis of the acoustic signal. Studies like that of Ladd et al. (1994), in which speech scientists gave prominence ratings significantly different from those of untrained subjects, suggest that training in prominence rating tasks may lead to an increasing reliance on a single acoustic cue such as pitch targets. This might skew the resulting prominence rating if the cue is not used consistently to signal prominence, or if there are other cues that stand in a trading relation to pitch. In the present experiment, an alternative is presented to relying either on the adequacy of ratings from untrained subjects or on possibly overspecialized ratings by trained raters. Measures of rater reliability as described in section 4.4.2 allow us to identify and exclude less proficient raters, and are thus improving the validity of the experimental data. Similar procedures are frequently applied, for instance, in psychological rating studies if rater training is not a viable choice (e.g. Wirtz and Caspar 2002: ch. 8). Streefkerk (2002: 33) goes so far as to base her prominence ratings on a single, ‘optimal’ rater. As she used a binary decision task (prominent vs. non-prominent), the possible degree of variability may have been reduced for this rater. Yet, in the light of the results reported here, this approach seems somewhat dubious, as variability in the ratings, and a certain degree of inconsistencies, was found to be inherent to every participant. Instead, for the by-item analysis, we used median ratings for the 17 most consistent raters, which seems more appropriate for the graphic rating scale used here. The different rater proficiencies, or, as it were, the different metalinguistic skills, could not be attributed to one of the demographic variables available for the raters (SEX and ORIGIN). The source of rating skills, while interesting in itself, was not a research question at the centre of the present experiment. A more detailed investigation of factors contributing to different prominence rating performances has therefore to be left to future research. If the ratings that are judged as reliable are pooled for each item, the distribution of ratings shows, as seen above, two concentrations of ratings, one representing left-prominent compounds, the other one representing right-prominent compounds. If we look at the frequencies of the respective prominence types, we find that nearly a third of our random set of types (32.4 percent) receives median ratings that indicate right-prominence. Right-prominence is, as it seems, not a marginal phenomenon at all, which is a conclusion which might be reached if, for instance, dictionaries were the source of prominence information: in the CELEX database (Baayen et al. 1995), for instance, only ten percent of the lemmas are
51
right-prominent, while Teschner and Whitley (2004) report about 17 percent compounds with right-prominence. Both sources are based on dictionary data, which may explain the low number of right prominent compounds. As we have seen above in section 2.1, the prominence pattern of a N OUN + N OUN construction has been applied as a criterion about word status for a long time, and it is rather likely that this criterion has influenced the decision of dictionary compilers to include or reject a given construction with right prominence. This is supported by the figures reported in Sproat (1994). Sproat, who looked at N OUN + N OUN constructions irrespective of their word status, found that about 30 percent of the data in his test set showed right-prominence, which is rather close to the ratio found here. On first glance, the findings agree fairly well with the metrical phonology framework. Yet, as pointed out above, one of the basic assumptions of metrical phonology is that of two adjacent constituents that are combined in a larger structure, one, and only one, constituent is ‘strong’ (more prominent), and the other is ‘weak’ (less prominent). We find this assumption largely confirmed in the distribution of our rating data. Some compounds have a prominent (‘strong’) left element, and in others, the right element is ‘strong’. The data do not provide evidence for a category in which both elements are perceived as equally strong. However, we do find 11 compounds that receive a median rating of ±0.4, such as Boston harbor in Figure 4.6, or, perhaps even more strikingly, senate committee (see Figure 4.5), which has a median rating of almost zero (MED = 0.004), with a very low dispersion (the interquartile range is IQR = 0.228), and a low deviation from normality (ISE = 0.005). In these cases, both elements are similarly prominent.7 As outlined in section 2.3, the autosegmental variant of metrical phonology links the ‘strong’–‘weak’ distinction to the presence of pitch accents, and the two different prominence patterns are the result of the presence or absence of a pitch accent on the second element. No prediction about prominence is made in the absence of pitch accents on both elements. It could be argued, then, that instances of near-zero prominence ratings are instances in which all elements are unaccented. This does not apply here, though, as pitch accents are present in the two examples above (acoustical inspection following the criteria for accent classification in Beckman and Hirschberg 1994 revealed H* accents on senate in senate committee and on harbor in Boston harbor). Equal prominence, it seems, is possible even with accented elements. However, without the detailed look at 7
It might be argued that participants also chose the scale centre at zero to indicate high uncertainty as to which of the two elements was perceived as more prominent. As the experimental design did not offer a way to indicate the degree of certainty, there is no way of identifying such a rating behaviour. Yet, it seems save to assume that a very high degree of uncertainty can only arise if none of the two elements is perceived as predominantly prominent, so an interpretation of ratings around zero as equal prominence appears to be appropriate even in these cases.
52
the actual acoustic material of the stimuli that is provided in the next chapter, an interpretation of compounds with ambiguous prominence patterns is postponed to chapter 6.2.3. For the time being, we can conclude that the perceived prominence in compounds that occur in natural speech forms a continuum. This continuum is not without structure, but it is less discrete than the division made by metrical phonology would predict. The metrical account, then, is a useful generalization to describe the general pattern of the data, but it is not unusual to find compounds that cannot be assigned to one of the two most frequent prominence patterns. Metrical Phonology, in its strict, binary sense, fails to provide an explanation for this. Furthermore, we have seen that these two prominence patterns are not perceived equally well: right-prominent compounds are more prone to receive highly variable ratings than left-prominent compounds. A similar result can be found in Farnetani et al. (1988) who report that their participants were significantly more successful in identifying left-prominent compounds than right-prominent compounds. Even with the insufficiency of a metrical account as outlined above, the description of the role of pitch accents above may help to explain this asymmetry, which seems to emerge across different experimental designs and tasks. Assuming that the left element is generally accented in a compound, the rating judgement about whether the left or the right element is more prominent is equivalent to a judgement whether the right element has the same accentuation as the left element or not: A compound is judged as right-prominent if the accentuation is the same in both elements. Although the actual type of pitch accent on a syllable may vary, the autosegmental approach to intonation regards accentuation as a binary feature. Any given syllable is either accented, or not (e.g. Gussenhoven 2004: 17). It is suggestive to assume that this distinction may be perceived categorically. The field of categorical perception has been explored in great detail since the 1950s (see Repp 1984 for an overview). The classic study by Liberman et al. (1957) examined the discrimination of stimuli in which the second formant was continuously altered between values typical for the English voiced obstruents /b/, /d/, and /g/. Subjects were then presented two neighbouring stimuli, and asked to judge whether they perceived it as identical or as different. Liberman et al. found that the discrimination rate was only high if the values for the second formant were on opposite sites of the boundary frequency that separated the continua possible for either of the two phonemes in question. In other words, stimuli that belonged to the same category, but were still phonetically different, received lower discrimination rates than stimuli from different categories, even if the phonetic differences were similar in size. Similar discriminatory effects have been reported for other, but not for all phonological contrasts. For instance, voicing contrasts between /b/ and /p/ demonstrate a categorical perception if
53
closure durations (in intervocalic position, Liberman et al. 1961) or the duration of the preceding vowels (in word-final position, Raphael 1972) are manipulated. Categorical discrimination does not, however, seem to be the only mode of perception of phonological contrasts. For instance, Repp (1984: 286) summarizes numerous experiments on the distinction between modified fricative stimuli, where within-category discrimination was often found to be as high as betweencategory discrimination. He concludes that fricatives may, in many contexts, be perceived continuously, and not as a case of categorical perception. For intonational categories, there is some evidence that the distinctions are instances of categorical perception. Pierrehumbert and Steele (1989) report an experiment that asked participants to imitate stimuli in which the type of pitch accent was manipulated gradually between L*+H and L+H*. The intonation curves of the responses pattern as two distinct categories, despite the graduality of the input. Pierrehumbert and Steele (1989) hesitate to assign this observation to either the perceptual or the production system, though. Gussenhoven (2004: 62f) reports experimental results which suggest that perception of intonational contours is, indeed, categorical, at least in instances where different types of pitch accent are associated with different semantic and paralinguistic structures. Finally, Schneider et al. (2006) find that low and high boundary tones are perceived categorically in German. Based on these observations of categorical effects in intonation perception, it is a plausible hypothesis to treat prominence-lending pitch accents in compound in categorical terms as well. This assumption provides an explanation for the large difference in disagreement for left-prominent and right-prominent compounds. In the case of left prominence, subjects have to discriminate between an accented and an unaccented element. Categorical perception predicts a high discrimination rate, as these belong to different perceptual categories, and accordingly, the variability and disagreement between ratings is low. On the other hand, right-prominent compounds confront the participants with a within-category comparison, which is associated with low discrimination rates. This is what we have found above (see Figure 4.7), in particular when the ratings from all participants were included. Differences in proficiency among participants are also not unknown in categorical perception experiments. Repp (1984: 303f) reports a perception experiment in which some participants were highly successful in discrimination even for within-category comparisons, while others showed less proficiency in these cases. He concludes that different subjects may rely on different strategies in discrimination tasks (with a broad distinction between auditory strategies that utilize auditory cues separately, and phonetic strategies in which different cues are integrated into a phonetic category). These differences in discrimination strategies may lead to the difference in reliability of the raters that we found in the perception experiment at hand. The effect of training on prominence ratings reported in Ladd et al. (1994), who cautiously interpreted this effect as the result of different
54
listening strategies, may probably be attributed to the same phenomenon. However, such strategic difference call for explicit experimental verification, which cannot be provided by our data. In summary, we have seen that the predictions made by the autosegmentalmetrical model of intonation outlined e.g. in Gussenhoven (2004) are, to a large extent, confirmed by the results of the perception experiment reported here. There are two perceptual classes of compounds, and the two classes differ in the element that is perceived as prominent. If we accept the presence or absence of a pitch accent on the right element as the decisive factor which distinguishes between the two classes, the larger variation of ratings for right-prominent compounds may be explained by the increased difficulty of discrimination between items belonging to the same category. However, the autosegmental model fails to account for two observations. First, we have seen that some compounds received clearly ambiguous ratings, so there are cases in which the perceived prominence structure depends, to a certain degree, on the listener. Second, we found compounds with ratings that fall clearly, and consistently, between the two peaks in the probability distribution. These compounds do not suggest a category of ‘level’ prominence, but their existence shows that there is much more variability in the perception of prominence than allowed in an autosegmental description. The traditional answer, of course, is to delegate this variability to performance. Indeed, Gussenhoven (2006) recently re-addressed the issue how listeners perceive and classify different intonational patterns. He draws a sharp line between phonological and phonetic differences, the former being an important structural element of the language, the latter being the result of implementation variability. The necessity of this distinction, or even the plausibility of it, is not without dispute. It is precisely this stance that is criticized by Kohler (2006). He perceives in most phonological accounts of intonation an invalid predominance of the notion of discrete categories, as this calls for unjustified simplifications of the phonetic facts (“Phonetic variability needs to be projected onto [discrete categories] without residue”, Kohler 2006: 125). A similarly critical stance is presented by Pierrehumbert, Beckman, and Ladd (2000) in their position article ‘Conceptual foundations of phonology as a laboratory science’. In their view, the competence/performance distinction that has been prevalent for a long time in phonological theory as a means of abstracting away from phonetic variation, and which is basically equivalent to ‘the old phonology-phonetics dichotomy’ criticized by Kohler (2006: 124), is methodologically unsound. While they maintain a distinction between a phonetic and a phonological analysis, the acknowledge that the variation inherent in natural speech data has to be included in the linguistic analysis: All data about language come from performance, and all present difficulties of interpretation relating to the nature and context of the performance. Like scientists in other
55 fields, we must assess the weight to assign to various types of data; statistics provide one tool for making such an assessment. But no matter how we weight the data, we must acknowledge that all data ultimately originate in performance. The notion that some data represent ‘mere performance’ does not in itself constitute sufficient grounds for discarding data. (Pierrehumbert et al. 2000: 290)
The next chapter, in which the relation between acoustic correlates and perceived prominence is examined, avoids such an unjustified treatment of the data retrieved in the perception experiment. The inherent gradiency of the acoustic form and of the auditory percept are taken for granted, but this does not entail the claim that the data are altogether unstructured. The occurrence of areas of high density show without doubt that listeners are able to assign prominence in most compounds to one of the two concentrations in the available acoustic prominence space. These concentrations can be interpreted as phonological categories, thus showing the usefulness of the abstraction offered by a phonological description, but they do not have sharply defined boundaries, a finding that is perhaps always observable as soon as phonological categories are investigated based on the phonetic material. Thus, categorical membership is not seen to be assigned by a determinant function with a binary output, but rather by a probability function with continuous output that reflects the likelihood of membership on one of the categories. This is in line with recent exemplar-based models of perception (most notably, Johnson 1997, Pierrehumbert 2001 for phoneme perception and Goldinger 1998 for word perception). In these models, perceptual categories emerge from acoustic similarities between the stored representations, and not defined on the basis of discrete features. Consequently, they are highly capable of dealing with variable data, and with continuous category boundaries. A similar exemplar-based model for the present perceptual data seems to be a promising approach for future research. What is still unresolved is the question briefly raised above, namely whether the apparent categorical perception of prominence patterns is present only in the categorization of the listener, or whether speakers also produce two acoustically distinct prominence patterns. It might be possible, for instance, that pitch, associated with accentuation and hence with perceptual prominence, is used by speakers as an acoustic continuum, with no obviously preferred pitch ranges. Categorical perception would still be possible, as the discrimination rate between two perceptual categories changes suddenly at a specific boundary. Thus, as Liberman et al. (1961) have shown, there is no linear mapping between closure duration in plosives and the categorization rate of a stimulus as voiced or unvoiced. On the other hand, it is also possible that speakers target two clearly distinct acoustic forms when they produce the two different prominence patterns of compounds. Indeed, Redi (2003) has shown that for the production of two different pitch accent types, speakers do show categorical effects in the production, a claim that had lacked empirical support. The next chapter, therefore, investigates the relation
56
between different acoustic properties and the perceived prominence pattern of a given compound.
5 Acoustic correlates of compound prominence
The acoustic characteristics of stress in English have been subject to a considerable amount of research, and acoustic differences between utterances with respect to their perceptual prominence have been subject to numerous studies in the past. For a long time, however, a phonetically adequate description was missing for stress phenomena. The earliest literature saw these perceptual differences to be realized primarily by differences in loudness, and assigned other acoustic parameters a secondary role. This becomes evident, for instance, in Bloomfield (1933: 110f), who uses the terms ‘stress’, ‘intensity’, and ‘loudness’ synonymously, or in Lutstorf (1960) who writes, ‘[s]o we may take it for granted that accent in English is primarily force of breath, though the other factors must not be neglected’ (Lutstorf 1960: 72). However, later experimental work such as that presented by Fry (1955, 1958) showed that this is a rather restricted view of perceptual prominence. Fry found length and pitch to play a much more important role in the perception of stress than previously thought. As shown below, his experiments indicate that pitch, duration, and intensity are all contributing to prominence, but that pitch is capable of overruling contradictory length and loudness cues. However, a framework that incorporated these results into a coherent and linguistically adequate theory of prominence phenomena was not introduced until the development of metrical phonology (cf. Liberman and Prince 1977) and the integration of metrical phonology and prosodic intonation by Pierrehumbert (1980) into what Ladd (2008: 43) termed the “autosegmental-metrical (AM) theory of intonational phonology”. In this framework, introduced briefly above in section 2.3, three different types of phonological prominence are identified, with each type corresponding to a different level of the prosodic hierarchy (cf. Selkirk 1984). On the lowest level there is the unstressed syllable, held to be marked by spectral characteristics of the vowel (i.e., presence or absence of vowel reduction). An unstressed syllable is also held to differ from a stressed syllable, the second level of phonological prominence, in intensity and duration: a stressed syllable remains unreduced and is longer and louder than an unstressed syllable (cf. Beckman and Edwards 1994, Sluijter and Heuven 1996a, Gussenhoven 2004: 14f). Prosodically, only stressed syllables may form the head of a foot (cf. Beckman and Edwards 1994: 9). The third and highest level of prominence marking lies in the presence of accentuation. Each stressed syllable is a potential point of association for a pitch accent, and an accented syllable is perceived as more prominent than an unaccented syllable (cf. Beckman and Edwards 1994: 12). In the autosegmental-metrical view, a
58
pitch accent is defined as a single or complex tone that cues prominence of the associated syllable (cf. Gussenhoven 2004: 17, Ladd 2008: 54). It follows from this definition that pitch features are the predominant acoustic correlates of accentuation, and intensity and duration are assigned only a secondary role. The identification of these different levels of perceptual prominence has been regarded as a significant contribution of the autosegmental-metrical framework to the understanding of stress phenomena. It allows a phonetically accurate description of the different levels involved. For instance, Beckman and Edwards (1994) note: Many previous phonetic studies have compared the intensities, durations, and F0 excursions in “stressed” versus “unstressed” syllables without controlling systematically for the levels of the stress hierarchy involved. [. . .] Therefore, it comes as no surprise when such different corpora yield conflicting results. (Beckman and Edwards 1994: 17)
In the same spirit, Sluijter and Heuven (1996a) point out that the interpretation of experimental studies on acoustic correlates of prominence has to take the difference between stress and accent into account, which is not always evident in earlier research, and hence that “[m]uch of this research, however, suffered from co-variation of accent and stress” (Sluijter and Heuven 1996a: 630). As discussed above in section 2.3, current phonological theory assumes the difference between left-prominent and right-prominent compounds to be one of accentuation. The distinction between unstressed and stressed syllables is considered as largely irrelevant here, as both elements in a compound are fully specified lexical units consisting of at least one stress foot.1 According to the view presented, for instance, in Ladd (1984) or Gussenhoven (2004), a left-prominent compound that is uttered in citation form has an accented left element and an unaccented right element, while in a right-prominent compound both elements are accented. Hence, the primary difference between the two types of prominence patterns is expected to lie in the pitch pattern of the right element, which should carry all characteristics of accentuation in the case of right prominence, but lack these characteristics in the case of left prominence. Thus, acoustic measures within the elements of N OUN + N OUN constructions should be compatible with those reported for the distinction between accented and unaccented syllables (cf., for instance, Sluijter and Heuven 1996a, Campbell and Beckman 1997). In particular, pitch in the two elements should be strongly dependent on the respective prominence pattern. The scarce acoustic data available for English 1
Apparent counterexamples for this claim are cases involving man as the second element such as clergyman /"kl3dgIm@n/. However, the exceptions seem to be restricted to lexicalized forms, and productively formed compounds of this type are expected to occur with an unreduced vowel in man (cf. Marchand 1969).
59
compounds, namely that reported in Farnetani et al. (1988), Plag (2006), and Adams (2007), seem to support this assumption. However, a review of the literature reveals that the distinction between an accented and an unaccented syllable is less straightforward than this description may suggest. In fact, it is shown below that not only pitch, but also intensity and duration have been found to co-vary with the stressed-unstressed as well as with the accented-unaccented contrasts. Just as it is hardly possible to assign one of these correlates to only one of these two contrasts, it is shown that an interpretation of accentuation as the effect of pitch targets alone is an invalid simplification. Because of this, the subsequent analysis is not restricted to pitch measurements, as might be seen appropriate in a strictly intonational view of compound prominence. Instead, a more complete set of measures is used that includes pitch, duration, and intensity as potential correlates of prominence. In the first section of this chapter, I first review the literature on the different acoustic properties that have been associated with prominence perception of spoken language. It is shown that accented and unaccented syllables usually differ in their pitch, intensity, and duration, and that creaky voice phonation is more likely to occur in non-prominent, and thus potentially unaccented, syllables. If the different compound prominence patterns are indeed realized by different patterns of accentuation, we may expect to find these differences reflected in the acoustic measures taken from the involved elements. The following section presents hypotheses about the distribution of these properties in left- and right-prominent N OUN + N OUN compounds on the basis of previous data on accentuation. These hypotheses are tested using the prominence ratings obtained in the perception experiment from the preceding chapter. Linear mixed-effects models are used to reveal co-variations between the prominence rating that a given compound has received and the acoustic measures taken from the respective left and right element. The results from these models are discussed in the context of the autosegmental-metrical model in the final section.
5.1 Previous research This section summarizes the phonetic properties that have been found to be associated with perceptual prominence in English. In general, a structure with a higher pitch, greater loudness, and longer duration is usually perceived as more prominent than a corresponding structure in which these properties are less pronounced. The following review of the literature presents these findings in more detail.
60
5.1.1 Pitch and fundamental frequency Pitch is the perceptual sensation of tone height, which may be defined as “the attribute of auditory sensation that can be ordered on a scale extending from high to low” (Warren 1999: 56). In a pure tone, this attribute correlates very closely to the frequency of the sine wave of the tone, with minor influences of the amplitude of the wave (Moore 2005). While this relation holds, in general, also in complex tones, it has been frequently demonstrated that the relation between perceived pitch and periodicity of the signal, the tone’s ‘fundamental frequency’2 or F0 , may be less straightforward. The seminal series of experiments reported by Schouten (1940), for instance, demonstrate that listeners may still perceive the pitch of the fundamental frequency in a signal, even if the spectrum corresponding to that frequency has been filtered. The “residual pitch”, in Schouten’s terms, is reconstructed from the higher harmonics still present in the signal. The fundamental frequency of voiced speech sounds is determined largely by the rate of glottal pulses per time unit (for a detailed description of the physiophysical mechanisms of the vocal fold cycle, see, for instance, Laver 1994). In the case of a steady state vowel with little timing variation of the glottal pulses, the perceived pitch is of the same frequency as the fundamental frequency of that vowel (Beckman 1986: 107f), and hence, pitch and fundamental frequency are frequently treated as near-equivalences for purposes of speech analysis. The use of the fundamental frequency as an acoustic operationalization of pitch has the advantage that this allows simple measurement and experimental manipulation of tones in speech. This research paradigm has been employed in many studies that addressed the relation of pitch and prominence within the autosegmental-metrical framework of intonation. Building on the privileged role of pitch in prominence perception that has been known since the 1950s (cf. Bolinger 1958, Fry 1958), it is central to current phonological theory of prominence (cf., in particular, Pierrehumbert 1980, Beckman and Pierrehumbert 1986, Gussenhoven 2004, Ladd 2008) to assume that placement of a phonological tone (i.e., a pitch accent) increases the prominence of that syllable. In other words, accented syllables are usually more prominent than other syllables that are not accented, and pitch is treated as the primary cue to this level of perceptual prominence. Experimental research has provided ample support for this assumption. For instance, Rietveld and Gussenhoven (1985) demonstrate that the prominence of an accented syllable increases if the size of the associated pitch excursion is increased. The exclusive role of pitch in accentuation is supported, for example, by Sluijter and Heuven (1996a), who find F0 to correlate only with the accented-unaccented distinction, but not on the 2
Technically, the fundamental frequency is the greatest common denominator of the frequencies of the sine waves that constitute the complex sound (Johnson 2003: 9).
61
stressed-unstressed dimension. They explain previous findings in which pitch was asserted to be a correlate of lexical stress (Fry 1958 may be regarded as an example) to be the result of a co-variation in the experimental setup between stress and accentuation. Terken (1997: 97) summarizes this direct association between pitch and perceived prominence: “The outcomes of the experiments conducted so far agree in showing that the prominence associated with an accented syllable is proportional with the size of the F0 change: greater F0 changes tend to elicit higher prominence ratings.” However, this apparent direct relation between pitch and prominence is complicated in utterances containing more than just one accent, as Terken (1997) also notes. Starting with Pierrehumbert (1979), the prominence perception of two consecutive pitch peaks with interspersed low-pitched material has been studied in detail (cf. Rietveld and Gussenhoven 1985, Gussenhoven and Rietveld 1988, Ladd et al. 1994, Gussenhoven et al. 1997). Invariably, it has been found that the two pitch peaks do not have to be at the same level to be perceived as equally prominent. Instead, the later pitch peak may be perceived as prominently as the earlier peak, even though the actual pitch level of the latter peak is not as high as that of the earlier peak. The explanation for this effect is seen in a steady decline of the pitch baseline that extends from the beginning to the end of an utterance. Irrespectively of pitch excursions excited by accentuation patterns or locally restricted variations such as the lowering of fundamental frequency in the vicinity of voiced obstruents (Kingston and Diehl 1994), this pitch declination appears to be determined, to a large extent, by the continuing decrease of the subglottal pressure during the course of an utterance (Collier 1975, Strik and Boves 1995).3 As a process that is largely beyond the control of the speaker, pitch declination is held to be expected by the listener, and the judgement of pitch excursion size is normalized with respect to the lowered baseline (Pierrehumbert 1979). Thus, pitch declination provides an explanation for the asymmetry between pitch peak height and perceived prominence in accent sequences. Gussenhoven and Rietveld (1988) conclude that “raters take the declination effect into account when judging the prominence of accent peaks, and thus judge later peaks to be more prominent than earlier peaks when F0 is equal” (Gussenhoven and Rietveld 1988: 386). Plag (2006) finds a correspondence of the declination effect also within N OUN + N OUN constructions. In his analysis, left-prominent N OUN + N OUN constructions showed a clearly higher pitch in the left element than in the right element. In the case of right-prominent constructions, the difference was much lower, still only a minority of cases with right prominence had a higher pitch in the right element. Plag argues that, due to pitch declination, right-prominent 3
For a detailed discussion of pitch declination, also with focus on the degree of grammaticalization and speaker control of the phenomenon, see Gussenhoven (2004: ch. 6).
62
compounds should not be expected to have equal pitch in both elements in general. Plag does not make an attempt to relate the observed pitch measurements to patterns of pitch accentuation, but his data agrees well with the interpretation of N OUN + N OUN prominence patterns in terms of pitch accents. Further support for such an association between prominence patterns and F0 peaks can be found in Farnetani et al. (1988). They report that the pitch peak difference between the left element and the right element is significantly larger if the structure is left-prominent than in the case of right-prominent structures. In addition, they find that the direction of the pitch contour at the boundary between the left and the right element is rising for right-prominent compounds, signalling a second tonal target in the structure, and conclude that in right-prominent compounds, each element is accented, while the right element is unaccented if the structure is left-prominent. However, the analysis by Farnetani et al. does not provide robust evidence for this association in all sentential positions. The observed association between pitch peak difference and prominence pattern was found to be significant only in sentence-final position. Whether this is due to the small sample size (their data is restricted to four variations of a single N OUN + N OUN construction spoken by five different speakers) or to a systematic interaction of sentence intonation with compound accentuation is not clear, as other experimental research yielded contradictory results. Plag (2006) reports that the right-prominence of a subtype of N OUN + N OUN constructions is neutralized in sentence-initial position, while in medial and final position, this subtype (N OUN + N OUN compounds in which the first element denotes the author of a literary or musical work, e.g. Mahler opera) shows consistent right-prominence. This finding contrasts with Adams (2007), whose data suggests that in clause-final or sentence-final position, it is not possible to claim confidently that there is a single pitch accent within leftprominent structures and two pitch accents within right-prominent structures.4 In sum, while there is some evidence that the distribution of pitch accents within a N OUN + N OUN structure is directly linked to the perceived prominence pattern, the evidence is still sketchy in details. The number of different stimuli in Farnetani et al. (1988) is too limited to permit a generalization about N OUN + N OUN constructions in general. Plag (2006) uses a deliberately gradient approach to investigate the different stress patterns in compounds. This approach does not allow a direct interpretation of his results in an autosegmental-metrical framework. 4
It is unclear, however, in how far these findings are due to the stimuli used. For example, in the sentence I thought the wind was strong at the beach yesterday, but I didn’t see the whitecaps, where whitecaps is the construction under examination, it is not unthinkable that many speakers placed a nuclear accent on the preceding see. Such a nuclear accent blocks any subsequent pitch accent (cf. Beckman and Edwards 1994) that might contribute to the prominence pattern of whitecaps in other environments.
63
Furthermore, the interaction with sentential position is yet unclear, as Farnetani et al. (1988), Plag (2006), and Adams (2007) report conflicting effects of this factor. The relative height of F0 peaks is, however, only one aspect of the shape of a pitch contour, and it is possible that other parameters also affect the perceived prominence of accented syllables. Indeed, Hermes and Rump (1994) show that there is a difference in perceived prominence between falling and rising pitch contours. They compared the perceived prominence of syllables with falling and rising pitch and found that if both were perceived with equal prominence, the excursion size of the fall was reliably smaller than that of the rise. If the excursion size is held constant, a fall is perceived as more prominent than a rise (cf. also Terken and Hermes 2000: 109). Yet, the contribution of pitch change direction and range of pitch change to compound prominence patterns is not well-researched. Farnetani et al. (1988) describe the direction of the pitch change as falling in both elements, irrespective of the prominence pattern. In the case of right prominence, however, they observe the fall on the right element to be a high fall. Still, their analysis leaves open if this description allows a larger generalization of the effect of the pitch contour slope on perceived prominence, or if the effect is due to the particular structure of their stimuli. Adams (2007), who measured the pitch minima and maxima within the left and right element of A DJ + N OUN structures, reports that the pitch range, i.e. the difference between pitch maximum and minimum, is larger in right-prominent structures than in left-prominent ones. A possible interpretation of this finding within an autosegmental-metrical view is that the pitch accent on each element of a right-prominent compound is significantly more frequently one of the complex accent types, while the single pitch accent on the left element of left-prominent compounds is more frequently a H* accent. This tentative interpretation of Adams’s findings is further complicated by interactions with the sentential position in which the structure occurs. The difference in pitch ranges between left- and right-prominent structures was observable only in the final position of subordinate clauses and in subject position, but not in statement-final position. In this position, the pitch measurements in the left and the right element did not differ significantly between the two prominence patterns. This is in contradiction to Farnetani et al. (1988), who did not find a significant effect of pitch range in either the left or in the right element, irrespective of the sentential position. In summary, pitch has been found to affect the perception of prominence and stress in numerous ways. First, a syllable that is associated with a tonal target, and thus carries a pitch accent, is perceived as more prominent than a syllable without such a link to the intonational contour. Second, in two consecutive pitch accents, the second pitch accent does not have to surpass the first pitch accent in pitch height to be perceived as equally or even more prominent due
64
to the declination effect. Third, the shape of the pitch contour also influences the perceived prominence, and a falling pitch seems to be perceived as more prominent than a rising contour. The scarce experimental data obtained for N OUN + N OUN constructions is, by and large, compatible with the view that prominence patterns in compounds are realized by different distributions of pitch accents, as assumed in the autosegmental-metrical framework. In general, left-prominent compounds seem to have a pitch accent on the left element, while in right-prominent compounds, both elements are accented (cf. Farnetani et al. 1988: 171). Yet, a controlled analysis on the basis of a sufficiently large data set is still lacking that tests the compliance of N OUN + N OUN prominence patterns with pitch accent theory. Furthermore, the influence of the sentential position in which the compound occurs is, however, as yet unclear, and needs further clarification. The same is true for other acoustic properties such as intensity, which is investigated in the next subsection.
5.1.2 Loudness, intensity and spectral balance Much of the earlier literature such as Bloomfield (1933) and Lutstorf (1960) shows a considerable inconsistency in the terminology and operationalization of loudness phenomena. This is particularly evident, for instance, in Lutstorf (1960), who uses terms such as “force of breath”, “loudness”, “intensity”, and “amplitude” interchangeably. Since then, advances in the psychology of perception and in acoustic and auditory phonetics have brought about a significant clarification of concepts that pertain to the volume of a sound. This chapter follows the conventions found, for instance, in Moore (2005) who defines loudness as the “attribute of auditory sensation in terms of which sounds can be ordered on a scale extending from quiet to loud” (Moore 2005: 409). ‘Intensity’ is understood as a function of sound pressure level that changes during the production of a sound, and is one of the acoustic correlates that contribute to the perception of loudness. The ‘amplitude’ of a sound wave can be defined as the distance between a positive and a negative peak in the waveform when seen, for instance, in an oscilloscope. When calculated over the course of an analysis window (RMS amplitude), it is a measure of acoustic intensity (cf. Johnson 2003: 31), and accordingly, this chapter uses ‘intensity’ and ‘amplitude’ synonymously unless noted differently. The relation between loudness and intensity is thus similar to that between pitch and fundamental frequency in that the first refers to the perceptual sensation and the second to a physical correlate. However, while pitch and F0 are very closely tied to each other, loudness is influenced by numerous additional factors other than the intensity of the signal. For instance, a sound is perceived as less
65
loud if presented after a sound of high loudness (Moore 2005). It has also long been observed that loudness is affected by the frequency of the signal: two sounds of equal intensity are perceived at different loudness levels if their frequency is not the same (S. S. Stevens 1935). An overview of factors contributing to loudness is presented in Beckman (1986: ch. 5). Despite these additional factors that influence perceived loudness, most acoustic studies use intensity or RMS amplitudes as an approximation of loudness. This is also the case with most studies in which the acoustic cues of prominence are addressed. Advances in the operationalization of loudness as cued by intensity have led to a better understanding of the role of loudness in the perception of prominence. Even if experiments such as Fry (1955, 1958) and Bolinger (1958) have challenged the primacy of loudness as a cue to perceived prominence, the data presented in Fry (1955) and Fry (1958) indicate that intensity does contribute significantly to the perception of prominence, a finding that has been replicated in numerous later experiments (e.g. Adams and Munro 1978, Beckman and Pierrehumbert 1986). A more prominent syllable is shown to have a higher overall intensity, and it is possible to shift the perceived location of prominence from one syllable to another by manipulation of the syllable intensities. However, these studies make no attempt of controlling the phonological level of prominence, and thus they leave it unresolved whether the observed intensity differences are associated with lexical stress or with accentuation. Sluijter and Heuven (1996a) address this, and while they find differences in intensity both along the stressed-unstressed and the accented-unaccented distinction, they also find that the difference is smaller for the distinction between stressed and unstressed syllables than that between accented and unaccented syllables. Thus, they conclude that “there is a small effect of stress [. . .] on overall intensity, but that there is a considerably larger effect of accent” (Sluijter and Heuven 1996a: 632). As noted above, intensity or amplitude are frequently used to approximate perceived loudness, which abstract away from the differences in perceived loudness between tones of different frequencies. Kochanski et al. (2005) instead use a loudness measure that attempts to replicate auditory loudness in a more realistic way. Kochanski et al. (2005) use a modified version of the model presented in S. S. Stevens (1972). In this model, the total loudness of a sound is calculated as a weighted sum of loudnesses of different equidistant frequency bands. Their analysis shows that loudness, if modelled auditorily adequately, is an extremely reliable predictor of accentuation that is even capable of outperforming pitch in their classification models. In the light of this, it may be assumed that intensity plays a significant role in compound prominence patterns as well, an assumption that is supported by the two pertinent experimental studies, Farnetani et al. (1988) and Adams (2007). Farnetani et al. (1988) report an overall decrease of intensity from left
66
to right element, regardless of the prominence pattern. Yet, more importantly, other things being equal, the intensity difference between the two elements in a left-prominent N OUN + N OUN structure is larger than in the corresponding rightprominent structure. The study also finds an interaction of intensity difference with sentential position, with the difference being significantly larger sentencefinally than sentence-initially. These findings agree with those reported in Adams (2007), who likewise finds that intensity in the left element is significantly higher than in the right element if the construction is left-prominent. Although she does not comment on this, her data implies that the contrast between intensity differences in left-prominent and in right-prominent constructions interacts with sentential position. It seems that the intensity difference in sentence-initial position is much more similar between left-prominent and right-prominent constructions than in clause-final or statement-final position, which agrees with the findings reported by Farnetani et al. (1988). However, Adams (2007) does not report whether this contextual difference is, indeed, statistically significant. As indicated already in Farnetani et al. (1988), intensity shows a tendency to decrease steadily over the course of an utterance, similar to the declination observed for pitch that has been described above (cf. Fant 1997, Trouvain and Traunmüller 1998). However, the implication of intensity declination on the perception of prominence is not well understood, and has not been the topic of systematic exploration comparable to pitch declination, possibly due to the nearly exclusive focus that the relation between pitch and prominence has received. Still, it seems plausible to assume that similar mechanisms of listener expectation affect the perception of two consecutive intensity peaks, so that the second peak does not need to surpass the first peak to be perceived as more prominent. The data reported in Farnetani et al. (1988) and Adams (2007) agree with this assumption, but a detailed examination of the mechanisms is not available as yet. What most studies on intensity reviewed thus far have in common is that they assume that an increased perceived prominence is associated with an increased overall intensity, that is, with an increase of energy across the whole frequency spectrum. Fairly recently, this assumption has been challenged. Sluijter et al. (1997) investigate the influence of both overall intensity increases as well as intensity increases restricted to the higher bands of the frequency spectrum (i.e., an increase in ‘spectral balance’)5 on the perception of lexical stress. While overall intensity does turn out to be a significant predictor of lexical stress, Sluijter et al. find that spectral balance is much more important for the stressedunstressed distinction than overall intensity. In a stressed syllable, intensity is 5
There are different ways of operationalizing the distribution of intensity across the frequency spectrum, and similarly, the terminology applied to it may also vary between authors. Other terms in use include, for instance, “spectral tilt” and “spectral emphasis”. Heldner (2003) presents an overview and an attempt to establish a coherent terminology.
67
spread across the whole frequency spectrum more or less evenly (disregarding the natural decrease of intensity with increasing frequency that is due to the shape of the vocal tract resonators, see K. N. Stevens 1998: ch. 3). An unstressed syllable, in contrast, has significantly less intensity in the higher ranges of the spectrum. Sluijter et al. conclude that spectral balance, “implemented as the acoustical reflection of greater vocal effort, is a reliable stress cue, close in strength to duration” (Sluijter et al. 1997: 511), thus clearly outperforming overall intensity as a cue to stress. A similar association between spectral balance and lexical stress has been reported for other languages as well, such as Dutch (Sluijter and Heuven 1996a,b), Swedish (Heldner et al. 1999), and Catalan (Astruc and Prieto 2006). While these findings suggest that a flat spectral balance is robustly associated with the unstressed-stressed distinction (a position also advanced by Gussenhoven 2004: 20), this association could not be replicated in Campbell and Beckman (1997). While observing a difference in spectral balance between accented and unaccented syllables, Campbell and Beckman “found virtually no difference for this measure between stressed versus unstressed syllables in the absence of the accent contrast” (Campbell and Beckman 1997: 70). Thus, the association of spectral balance with lexical stress is not undisputed, and needs further exploration. Furthermore, the effect of spectral balance has not been investigated in the context of N OUN + N OUN prominence patterns at all. In sum, previous research suggests that intensity is strongly associated with accentuation, and less strongly with lexical stress. Accordingly, intensity may be expected to be important in the discrimination of N OUN + N OUN prominence patterns as well, and indeed, the experimental data shows that the left element has, on average, a higher intensity than the right element if the whole compound is left-prominent. The experimental results imply that sentential position may interact with this, but the evidence is not fully convincing. Spectral balance, on the other hand, appears to be primarily a correlate of lexical stress, and less so of accentuation. If we follow the description outlined in section 2.3 and assume that the different compound prominence patterns are realized by different accent patterns, it is expected that there is no notable association between spectral balance and the prominence pattern of a compound. Applicable experimental data is not available yet.
5.1.3 Duration Early experimentation reported by Fry (1955) indicates that a longer duration may contribute to the perception of a syllable as prominent, and consequently, duration has been regarded as one of the acoustic correlates of stress phenomena. In the autosegmental-metrical model, however, it is usually regarded as a clearly
68
secondary acoustic cue. Some authors (for instance, Terken and Hermes 2000: 100), assign the influence of duration to the domain of lexical stress, and thus imply that the contribution of duration to accentuation is negligible. However, Turk and Sawusch (1996) and Turk and White (1999) have shown that the duration of syllables increases significantly if an accent is placed on it. Measuring the durations of primary stressed syllables of identical material both in accented and in unaccented position, Turk and White (1999) found that the lengthening effect of accentuation averaged between 15.9 percent and 23.4 percent, depending on whether the stressed syllable was part of a disyllabic or a monosyllabic word, respectively. Similar duration differences between accented and unaccented words are reported in Jong (2004). Nevertheless, it is worth noting that duration seems to be associated not only with accentuation, but also with lexical stress. Sluijter and Heuven (1996a) and Jong (2004), whose experiments control for these two different levels of prosodic prominence, consistently find that duration co-varies with either level. Thus, on average, a stressed syllable is longer than an unstressed syllable, and an accented syllable is longer than an unaccented one. In agreement with these findings, Farnetani et al. (1988) also report significantly different durational patterns between left-prominent and right-prominent N OUN + N OUN structures. Right-prominent compounds were not only significantly longer than left-prominent compounds containing the same two elements; they also found that the right element is significantly longer if the structure is right-prominent, while the duration of the left element is invariant with respect to the prominence pattern. Adams (2007) reports a similar durational difference that depends on the prominence pattern, at least for one of her experimental conditions. She found that on average, the left element was only slightly shorter than the right element if the structure was left-prominent, but significantly shorter if the structure was right-prominent. This effect was significant only in clause-final position and not in subject or sentence final-position. While Adams does not report absolute duration values, the ratios given for left- and right-prominent A DJ + N OUN structures in clause-final position suggest that the duration difference is of an order similar to that reported by Turk and White (1999) and Jong (2004) for effects of accentuation. However, from the autosegmental-metrical viewpoint, it is somewhat unexpected to find an apparent shortening of the left element if it is not prominent, as this element is claimed to be accented regardless of the prominence pattern.6 6
Adams (2007: 1003) reports an “overall trend of [left element] lengthening” which she considers “consistent for [left-prominent] compounds in all intonational environments”. Given that she does not report her average durations for left and right elements, but only duration ratios, one may speculate in how far her interpretation of the data is the only plausible one. Her duration ratios (left duration divided by right duration) approximate 1.0 in the presence of left prominence, but are clearly smaller than that
69
In sum, duration is claimed to be a reliable correlate of accentuation in general, to the effect that an accented syllable is, on average, longer than an unaccented one. Likewise, the different prominence patterns observed in N OUN + N OUN constructions are also reflected in the relation between the durations of the two elements. The existing findings that relate to N OUN + N OUN constructions are, however, not fully compatible with an autosegmental-metrical interpretation in so far as the different prominence patterns seem to affect mostly the left element. However, the previous analyses are somewhat contradictory on this point, and the temporal pattern in the two different types of N OUN + N OUN constructions is as yet unclear. After this review of the research on pitch, intensity, and duration in the different stress and accent conditions, the next section briefly investigates the relation between these conditions and the phonation mode that is employed, as the literature provides reason to assume that the choice of phonation is also associated with different levels of perceptual prominence.
5.1.4 Non-modal phonation While modal phonation mode is most frequently encountered in non-pathological speech, speakers are also capable of employing other modes of glottal vibration. The most frequent non-modal phonation mode found in English is creaky voice phonation, which Gerratt and Kreiman (2001) define as “a train of discrete laryngeal excitations, or ‘pulses’, of extremely low frequency, with almost complete damping of the vocal tract between excitations” (Gerratt and Kreiman 2001: 375f).7 In contrast to languages such as Jalapa Mazatec, English is generally not assumed to make use of different phonation modes on a phonological level (Gordon and Ladefoged 2001). However, creaky voice is occasionally employed in English to express paralinguistic information such as boredom, and also occurs as an idiosyncratic, speaker-specific speech characteristic (cf. Laver 1994: 196).
7
in right prominence. In other words, in left-prominent compounds, both durations are nearly identical, while the left element is shorter than the right element in rightprominent compounds. Nothing in her data suggests that there is an intrinsic duration difference between left and right elements in her data. Thus, in the light of Turk and White (1999), the data in Adams (2007) also seems to speak more in favour of a shortening effect of the left element in case of right prominence, and less in favour of the interpretation given by Adams of a lengthening effect of the left element in the case of left prominence. Other terms such as ‘glottalization’ and ‘vocal fry’ are also in frequent use. Gerratt and Kreiman (2001) provide an overview over the terminology found in the literature, as well as a summary of the acoustical and perceptual aspects of this phonation mode.
70
More relevant to the present topic is the observation that there is a relation between voice quality and prosody in English. For instance, Dilley et al. (1996) report that creaky voice is not infrequently found in word-initial vowels that occur at phrase boundaries. Likewise, Epstein (2002) reports that non-modal phonation is frequent in syllables with low boundary tones. In addition, she found non-modal phonation to occur more frequently if the word is not prominent. It is the latter finding that suggests a relation between creaky voice and the prominence patterns of English compounds. Apart from word-initial glottalization, we may expect to find creaky voice phonation almost exclusively in the right element of left-prominent constructions if the autosegmental-metrical view is taken. The right element of right-prominent constructions, as well as the left element regardless of prominence pattern, is expected to be accented, thus blocking non-modal phonation. This expectation is supported by an impressionistic investigation of the data from the Boston corpus, but no conclusive, methodologically sound investigation of such a co-variation of non-modal phonation and the prominence pattern of N OUN + N OUN constructions in English is currently available. Strictly speaking, the phonation mode is a potential articulatory correlate of prominence, not an acoustic one such as intensity, fundamental frequency, and duration. This is even more so the case as creaky voice will be operationalized not on the basis of the acoustic correlates associated with itself, but in a categorical fashion as outlined below in section 5.2.5. Nevertheless, despite this slight methodological inconsistency, the fact that creaky voice may play a role in discriminating between left and right prominence seems sufficient to justify its inclusion in the present analysis.
5.1.5 Summary and research questions This section has reviewed the acoustic correlates that contribute to perceived prominence. The autosegmental-metrical framework assumes that prominence beyond the level of lexical stress is usually cued by the presence of a pitch accent on the respective element, and thus provides a theory of how the two prominence patterns found in English N OUN + N OUN constructions are implemented. Hence, the focus of the review has been on the correlates of accentuation. In addition, we have reviewed previous studies that investigate the acoustic characteristics of compounds. The survey reveals that the description of accentuation merely in terms of pitch contours does not provide a sufficient account of an accented syllable. Such a description is found, for instance, in Terken and Hermes (2000: 101) who assign intensity and duration the role of primary cues to lexical stress, and pitch the role of primary cue to accentuation. A highly similar account is found in
71
Gussenhoven (2004: 20). However, there is ample experimental evidence that an accent is characterized by an increased intensity and a longer duration, in addition to the marked pitch excursion. Correspondingly, it is to be expected that all three acoustic dimensions are also involved in signalling the difference between a left-prominent and a rightprominent N OUN + N OUN construction. While the experimental data available for English compounds is limited, the results are largely compatible with the claim that the difference between a left-prominent and a right-prominent N OUN + N OUN construction lies primarily in the accentuation of the right element. This is concluded in the experiments reported by Farnetani et al. (1988), but their results are based on the acoustic analysis of a very small number of tokens. The data used in Adams (2007) are more extensive, but the primary aim of her analysis is not to investigate the acoustic correlates of the different prominence patterns, but the interaction of these patterns with larger prosodic units. Hence, it is not fully possible to derive an acoustic description of left and right prominence from her analysis. Furthermore, her stimuli consists of contrasting phrasal and lexical A DJ + N OUN constructions. It does not necessarily follow that her findings for A DJ + N OUN phrases also extends to N OUN + N OUN compounds that have right prominence. The aim of the present analysis, then, is to see in how far the potential acoustic properties (characteristics of the pitch contour, intensity, duration, spectral balance) are interrelated with the perceived prominence patterns in N OUN + N OUN constructions. This investigation allows us to test the claim proposed within the autosegmental-metrical framework, namely that left-prominent compounds have a single accent on the left element, while right-prominent compounds have an accent on both elements. This claim is made explicit in Farnetani et al. (1988), but on a very limited data base only, as well as in Gussenhoven (2004), but here from a rather theoretical viewpoint. The remainder of this chapter addresses the question as to whether the acoustical measurements are compatible with the autosegmental-metrical view in an empirical fashion. The Boston corpus continues to serve as the data source for the analysis. As the discussion in Kochanski et al. (2005) suggests, most studies of correlates to perceived prominence have focused on synthetic or reiterant speech, but the results of such studies may possibly exaggerate the influence of pitch. On the other hand, the body of literature that uses natural speech obtained in a nonlaboratory setting is still very rare, and it is still largely an open question in how far the effects observed in controlled settings are still relevant in less artificial contexts. Thus, the analysis uses the perceived prominence rating from the perception experiment reported in chapter 4 and investigates in separate statistical models the association between this rating and the different acoustic correlates that are potentially involved in accentuation.
72
Based on the assumed distribution of accents, we may formulate specific hypotheses about this influence. Given that in right-prominent compounds both elements are expected to be accented, the two elements should not differ largely in their average pitch, duration, and intensity. The right element may have a lower pitch than the left element, due to the declination effect, but the difference should not be overly large. A similar prediction is made for intensity, which should also be only slightly lower in the right element. Duration, however, is expected to be, on average, the same in both elements. For the other acoustic properties discussed so far, the predictions are less clear. It has been shown above that pitch falls are perceived as more prominent than contours with level or rising pitch. Thus, if there is a significant contribution of the pitch slope to perceived prominence, we may expect the pitch slope in the right element of right-prominent compounds to show a steeper fall than in the left element, as this would emphasize the prominence of the right element. Regardless of the question raised above whether spectral balance is associated with lexical stress or with accentuation, the two elements should not differ in the distribution of intensity along the frequency spectrum for right-prominent compounds. Finally, modal phonation should be prevalent for all elements. In left-prominent compounds, the left element is accented as in right-prominent compounds. We expect the average pitch, duration, and intensity to be comparable to that of right-prominent compounds. However, as the right element is unaccented, it is expected to show a clearly lower pitch, shorter duration, and lower intensity than the left-element, and hence, also to be clearly lower than the right element of right-prominent structures. If pitch slope is significant in left-prominent compounds, we expect a falling slope on the left element, thus increasing its prominence. The spectral balance is expected to be different between the left element only if this property is associated with accentuation, as proposed by Campbell and Beckman (1997). In this case, spectral balance of the left element should be lower than that of the right element, indicating that intensity is more evenly distributed in the prominent element than in the non-prominent one. If non-modal phonation is used to mark non-prominent material, we expect to find more creaky voice phonation in left-prominent compounds, but only in the right element.
5.2 Material and measurements The 105 compounds from the perception experiment in chapter 4 also formed the data base for the present analysis of the acoustic correlates of compound promin-
73
ence patterns, which are operationalized by the continuous median prominence ratings described above. The acoustic measures were obtained with the phonetic analysis software Praat (Boersma and Weenink 2007). The measurement intervals were manually marked in Praat textgrids on the basis of a combined waveform and spectrogram view. The usual criteria for determining segment boundaries were applied (see Ladefoged 2003 and Turk et al. 2006 for summaries of these criteria). After the segmentation, a script was run that measured various different acoustic properties as described below for all 210 measurement intervals. In each element of the N OUN + N OUN constructions, the measurement interval consisted of the syllable with primary stress. Due to metrical operations that may take effect in compounds (such as stress shift and early accent placement, cf. Shattuck-Hufnagel et al. 1994), the primarily stressed syllable is not always the one that normally receives primary stress if the respective element occurs in isolation. This is particularly true for the left element, where the primary stress frequently moves towards the left edge of the compound. The examples in (1) show compounds from the present data set in which such a stress shift has taken place. The primary stressed syllable within the left and right element is indicated by an accent.8 (1)
tábulation érrors táxation commíttee Mássachusetts tówns
For the present analysis, the sonorant part of the rime of each element was chosen as the relevant measurement interval, including post-vocalic nasals, liquids, and approximants, similar to the approach taken, for instance, in Beckman and Pierrehumbert (1986: 150). The decision to include these sonorants as well, if present, was made primarily on practical grounds. While it is fairly easy to draw a segmentation line between a vowel and a nasal, a boundary between a vowel and a following liquid or approximant can often be drawn only in a arbitrary fashion (see, for example, Ladefoged 2003: 98 and Turk et al. 2006: 14 for discussions of the problems involving the segmentation of these sound sequences). The present procedure circumvents this difficulties. A further technical advantage is that the inclusion of the sonorant part of the rime maximizes the measurement interval, which is useful for measurements that require a minimum duration of the signal. Pitch is one of these measurements. The algorithm used in Praat, similar to other 3 pitch tracking algorithms, presupposes a minimum duration of fmin , where fmin 8
The fact that all examples in (1) are right-prominent compounds is not incidental. Shattuck-Hufnagel et al. (1994) show that early accent placement occurs particularly frequently if the following word is also accented, which is assumed to be the case in right-prominent compounds.
74
is the minimum acceptable pitch for the measurement interval. This minimal duration fulfils the requirement that at least two complete waveform cycles are included in the algorithm windows (cf. Ladefoged 2003: 77f for an illustration of this requirement). Thus, for a fmin of 75 Hz, the default for male speakers, a pitch may only be extracted if the duration of a given measurement interval is 3 75Hz = 0.04s or longer. Thus, inclusion of the sonorant part of the rime increases the chances for a successful pitch extraction. These arguments in favour of maximizing the length of the measurement interval might be extended to include sonorant parts of the onset as well. However, it is a widespread assumption in Metrical Phonology to regard the onset of a syllable as irrelevant in the process of stress assignment to a syllable.9 Thus, if the onset is considered irrelevant for the assignment of stress in a syllable, it seems plausible to assume that the onset is also largely irrelevant for the location of accents. The syllable onset was therefore not included in the measurement interval.
5.2.1 Pitch measurements There are different possibilities of measuring pitch characteristics if the interplay with the prominence pattern is to be examined. One obvious choice is to consider the F0 peak in the measurement interval, an approach found, for instance, in Farnetani et al. (1988). Underlying this choice is the assumption that an accent is best characterized by the tonal height of the associated pitch target. There are, however, numerous cases in which an uncritical identification of a F0 peak with a pitch accent may be erroneous. Figure 5.1 illustrates such a case. Here, roadways occurs at the end of a sentence, and is followed by a brief pause. The right element of the compound has a pitch peak that is aligned with the right edge. This high boundary tone serves as a continuation rise (cf. Bolinger 1989, L-H% in the ToBI classification, Beckman and Hirschberg 1994), and is not primarily related to the prominence relation between the two elements. Thus, a measurement procedure for pitch peaks has to ensure not to confound peaks that are linked to boundary tones with accentual ones, a condition that is hard to meet if automatic measurements are used (see section 6.3 below for models which depend on automatic pitch measurements). Another possibility is the calculation of mean F0 throughout the interval, a measure used for instance by Campbell and Beckman (1997) and Štekauer et al. (2007). This measure has the advantage of being fairly indifferent to localized pitch variations and algorithmic misreadings, as well as to the influence of 9
For instance, Hayes (1995) argues that “prevocalic segments in the syllable (i.e. onset segments) are prosodically inert: that VC is prosodically equivalent to CVC and CCVC, V: to CV: and CCV:, and so on.” (Hayes 1995: 51, emphasis omitted)
75
Pitch (Hz)
300 200 100 0
road
ways
13.99
14.52 Time (s) Figure 5.1: Pitch tracking for roadways showing the effect of a boundary tone on the pitch contour in the right element.
boundary tones. Furthermore, it is a stable measurement that also allows a manual calculation of mean pitch by count of the visible pulses in the waveform and division by the length of the measurement interval. Such a manual calculation may be useful if the pitch extraction algorithm provided by Praat fails, which may be the case for instance in the presence of creaky voice phonation. However, average pitch is an abstraction of the pitch contour, which may obscure differences between different types of pitch accents while introducing numeric differences that may not be representative on a phonological level. For instance, a localized pitch peak with a high excursion on the one hand and a rather level pitch contour at mid-range on the other will yield fairly similar average pitches. Likewise, similarities between pitch peaks may also be obscured. For instance, both L+H* and H* are pitch accent types with high pitch targets, but the low pitch in the earlier part of L+H* accents is likely to result in a lower average pitch than for H*. A third pitch measure is found in Plag (2006), who measured the pitch at the mid-point of the vowel in each of the elements of his compounds. While this approach has the benefit of minimizing the effect of adjacent segments on the pitch within the vowel, as well as reducing the influence of boundary tones, it assumes that the pitch at the vowel mid is representative of the whole pitch contour, an assumption that may not be the case in complex pitch accent types such as L+H*. Because of this disadvantage, mid-vowel pitch was not taken into closer consideration here. All pitch measures discussed so far stipulate that high perceptual prominence is also associated with high pitch; hence, all prominence-lending pitch accents are expected to be high accents. This expectation might be problematic, as
76
pitch accents in English can be either low, high, or a combination of the two. Yet, low tones in English are associated with a pragmatic meaning that make their occurrence in the present data rather unlikely. For instance, Pierrehumbert and Hirschberg (1990: 291) suggest that an L* accent is used to mark “items that S [the speaker] intends to be salient but not to form part of what S is predicating in the utterance”, while L*+H accents express uncertainty in the speaker (Pierrehumbert and Hirschberg 1990: 295, cf. also Hirschberg 2004: 534). None of these interpretations seem probable within the news context of the corpus used here. A look at the distribution of pitch accent types that were provided by the corpus compilers for the stimuli used in the perception experiment supports this assumption. As Table 5.6 shows, only three elements were annotated as L*. Two of these occur in the two elements of health clinics (recording F3ATRLP3). Inspection of the pitch contour of this item suggests that these two accents may just as well be identified as !H*, given that they occur at the end of a sentence and receive a very low pitch due to pitch declination, rather than as the result of “an apparent tone target [. . .] which is in the lowest part of the speaker’s pitch range”, which is the formal definition of L* accents in the ToBI system (Beckman and Hirschberg 1994: 3). Mean and peak pitch were determined using the Praat auto-correlation algorithm (Boersma 1993), with pitch settings adjusted to suitable settings. A pitch range of 75–300 Hz was chosen as the default for male speakers, while for female speakers, the default range was 100–500 Hz. These settings were adjusted if required by the recording. For instance, pitch readings in creaky voice phonations required lower pitch floor settings. The automatic pitch readings were verified by an acoustic and visual inspection of the recordings. In six cases of the 210 measurements, the automated extraction failed to return appropriate pitch means, and a manual calculation of mean pitch values was performed through division of the number of visible periods by duration of the measurement interval. For the present data set, peak and average pitch measures are highly correlated (rP = 0.96, p < 0.001), so the choice of measure is rather unlikely to affect the outcome in a great way. The technical advantage of easy recoverability in case of algorithmic failures was the primary reason to continue the analysis with mean pitch, although the pitch peak measures were used in the calculation of pitch slopes (see below). To reflect the fact that listeners perceive pitch changes on a logarithmic rather than on a linear scale (see Nolan 2003 for a discussion of the benefits of different pitch scales), all mean pitch measurements were transformed to semitones relative to the lowest observed frequency using the transformation equation fi,ST = 12 ·
log( fi / min f ) , log(2)
(5.1)
where fi is the i-th mean pitch measurement, and min f the minimum of all
77
observed mean pitch measurements (all in Hz). Apart from the higher perceptual accuracy of semitones as compared to Hertz frequencies, the transformation also facilitates further analysis using statistical models that depend on a linear relation between the variables of the analysis, such as the linear mixed-effects model employed below. To capture the differences between the types of pitch accents that are proposed for English, a description of the height of the tonal target is not sufficient. A further characteristic of pitch contours is whether the pitch is rising, falling, or remains at a fairly stable level. The review above suggests that these different contours may differ in their respective prominence. The pitch slope, i.e. the slope of a line drawn between the pitch maximum and pitch minimum in the measurement interval, is a way to capture these differences between contours. For a given interval, it is calculated as S=
fmax − fmin , tmax − tmin
(5.2)
where fmax and fmin are the pitch maximum and minimum in the interval, and tmax and tmin are the times at which the maximum and minimum pitches are observed. All pitch maxima and minima were marked manually after an inspection of the pitch contours. No attempt was made here to separate boundary tones from pitch accents, and a rising pitch slope in a right element may also be affected by a high boundary tone that is aligned with the right edge of the compound. For instance, the pitch slope in ways from Figure 5.1 indicates a rising pitch. Still, the decision to determine a pitch slope that is regardless of differences between boundary and accentual tones seems justified. First, the stimuli in Hermes and Rump (1994), who observed prominence differences between falling and rising pitch contours, were not designed in a way to represent any explicit configuration of pitch accents and boundary tones. Thus, it is not clear as yet if the effect of pitch slopes on prominence follows the phonological distinction between boundary tones and pitch accents. Second, cases in which the effect of boundary tones is extreme, as in ways in the example above, are identified as potential outliers in the analysis below. Apparently, the approach chosen here seems robust enough to allow the inclusion of boundary tones without invalidating the analysis. As the pitch contour does not change largely for the unitonal accents (L* and, more relevant here, H*), the pitch slope S is expected to be small for these accent types. The bi-tonal accent types show a greater degree of movement in the pitch contour, and hence, the absolute value of S should be larger than for a unitonal accent. Furthermore, a falling pitch accent type should have a negative value for S, as the pitch minimum occurs later in the interval. A positive S, then, is expected for rising types such as L+H*. Thus, in sum, two different measures of pitch were gathered. The first measure is average pitch throughout the measurement interval in semitones relative to the
78
lowest average pitch observed. Pitch slope S, calculated as the slope between the pitch maximum and pitch minimum in the measurement interval, is the second measure of pitch.
5.2.2 Duration Duration was obtained directly from the length of the measurement intervals for each element, i.e. the sonorant part of the syllable rime. The duration measurements were log-transformed to reduce the influence of observations with extreme lengths. Furthermore, as shown below in section 5.3, the relation between perceived prominence rating and element duration appears to be non-linear and is better described in a logarithmic way. Farnetani et al. (1988) report a significant overall duration difference between left-prominent and right-prominent compounds, as well as a significantly longer right element in right-prominent structures. These measures are not taken into closer consideration for the present analysis. Unlike in Farnetani et al., the present data does not consist of minimal pairs, and hence, the overall duration may be longer for one of the two prominence pattern types for reasons that are independent of the phonetic implementation. For instance, the left element Massachusetts is among the longest left elements in the test set (four syllables, MED = 2 syllables). At the same time, it occurs in two of the 105 compounds in the test set (Massachusetts towns, Massachusetts House). Given that these compounds may be expected to be right-prominent (due to their locative semantics, see section 7.4 for a discussion of this expectation), a statistically significant association between overall duration and prominence ratings might be due to specific element durations, and not an acoustic correlate of the prominence implementation. The same reasoning speaks against inclusion of left and right element durations separately.
5.2.3 Intensity The intensity for each measurement interval was retrieved from Praat intensity objects. These are calculated using a Kaiser analysis window with shape parameter a = 20 (Kaiser 1966) and the pitch floor as derived above. The mean intensity I for an interval ranging from t1 to t2 is It1 ,t2 =
1 t2 − t 1
t2
a(t) dt,
(5.3)
t1
where a(t) is the intensity at time t. The scale used is dB SPL, which uses as a reference point the auditory threshold for a 1,000 Hz sine wave (Johnson 2003: 50).
79
5.2.4 Spectral balance While the notion of spectral balance as a descriptor of the slope of the frequency spectrum is rather well-established (see, for instance, Jackson et al. 1985, Campbell 1995, Sluijter and Heuven 1996b, Heldner 2003), there is no agreed operationalization of this concept: Jackson et al. (1985), for example, calculate the difference between the first and the second harmonic, while Sluijter et al. (1995) use the difference between the first harmonic and the strongest harmonic in the third formant. Sluijter and Heuven (1996b) divide the spectrum into four disjoint bands that captured F0 and the first three formants, respectively, and measured the mean intensity for each band. The measure used here is a simplified version of the latter approach. However, only two instead of four frequency bands are used, with the point of division chosen at 1,000 Hz so that it falls between the first and the second formant for most types of vowels and sonorants. Thus, the spectral balance B (in dB) is calculated as B = Ihigh − Ilow ,
(5.4)
where Ilow is the mean intensity from 0 to 1,000 Hz and Ihigh is the mean intensity from 1,000 to 4,000 Hz, both taken from a long-term average spectrum with a bandwidth of 100 Hz. In the case of low spectral balance, which is expected to occur for less prominent elements, the high band is much smaller than the low band, and hence, B is clearly negative, while more prominent elements are expected to have less difference between the two bands.
5.2.5 Non-modal phonation The operationalization of phonation modes other than modal phonation is rather complex. Epstein (2002), who found that the occurrence of creaky voice may be related to the prominence of the respective speech material, based her analysis on the Liljencrants-Fant model of the glottal source (Fant 1995). It uses parameters based on the waveform of the glottal flow that require inverse filtering of the speech signal. Blomgren et al. (1998) tested modal and non-modal registers for acoustic parameters and found significant differences for jitter, shimmer, and the signal-to-noise ratio. They also conducted a perception experiment and found that the stimuli with non-modal phonation were identified as such in every case by the participants. The proposed acoustic parameters depend, however, strongly on the characteristics of the speech material. For example, jitter measurements that calculate the variability of glottal pulse timings require a vowel with a sufficiently long steady state. This also applies to shimmer, the variability of pulse amplitudes. As the data from the Boston corpus are based on natural speech, it cannot be assumed that all relevant vowels are also suitable for this type of voice analysis.
80
Hz
5000
0 kæm
peIn
pô6
mIs
Hz
5000
0 kæm
peIn
pô6
mIs
Figure 5.2: campaign promise pronounced with modal (above, file F3ASQ1P1) and nonmodal phonation (below, file F2BJRLP2).
For example, in a pitch contour that shows a strong fall, the periodicity will also vary substantially during the vowel. An identification of creaky voice phonation merely by instrumental means, then, is clearly problematic. However, listeners were capable of correctly identifying creaky voice and modal phonation in 95 percent of the test items in Blomgren et al. (1998) by acoustic inspection only. Apparently, this non-modal phonation mode provides sufficient acoustical cues to be perceived in most cases as distinct from modal phonation. Thus, Gerratt and Kreiman (2001) conclude that “[w]hatever the basis for the perceptual distinction, these studies provide good cumulative evidence that [creaky voice] differs categorically from modal phonation.” (Gerratt and Kreiman 2001: 376). In consequence, the identification of creaky voice seems to be regarded as a rather uncontroversial task.
81 Table 5.1: Distribution of phonation types in left and right elements (N = 105, respectively). n Left element Right element
Modal Creaky voice Modal Creaky voice
98 7 72 33
(93.3 percent) (6.7 percent) (68.6 percent) (31.4 percent)
Hence, the phonation mode of each element in the 105 compounds was classified either as modal or non-modal, based both on the acoustic impression as well as the characteristics of the spectrogram and the waveform, such as the irregularity of pitch periods, an overall decrease of the amplitude of the waveform (cf. Gordon and Ladefoged 2001), and the near-complete damping of the vocal tract between glottal pulses, resulting in correspondingly strongly separated pulses visible in the spectrogram (cf. Gerratt and Kreiman 2001). Figure 5.2 illustrates these differences of waveform and spectrogram between modal and non-modal phonation. Each panel displays a recording of campaign promise by a different female speaker from the Boston corpus. In the lower panel, the one illustrating creaky voice phonation, the waveform shows clearly isolated spikes for most parts of the right element [pr6mIs] (the IPA ˜ ˜ ˜ separated diacritic [ ] indicates creaky voice phonation), which are furthermore ˜ by time intervals of varying length. The irregular pulsation timing is particularly visible in the spectrogram, which shows vertical bars of high energy followed by brief periods of very little energy (the effect of damping) at irregular intervals, with an average period length much longer than in the left element with modal phonation, thus indicating the very low pitch associated with creaky voice. In contrast to this, the right element in the upper panel is characterized by a high periodicity, with only little variation, of both the peaks in the waveform and the vertical bars in the spectrogram. There is also no indication of a damping effect, as even the periods between glottal pulses show considerable degrees of spectral energy. The narrow spacing of the periods clearly indicates that the pitch in the left and right element are within a similar range, unlike in the lower panel. All in all, the visual cues to creaky voice, even if not always as striking as in the given examples, are highly reliable for the identification of non-modal phonation. Accordingly, the phonation mode of all elements involved in the analysis was classified by the author as either creaky or modal on the basis of a combined audio-visual investigation. The distribution of non-modal phonation within left and right elements is summarized in Table 5.1, while Table 5.2 summarizes the measurements for
82 Table 5.2: Summary of acoustic measurements in left and right elements (N = 105, respectively).
Mean pitch (Hz) Mean pitch (ST) Intensity (dB SPL) Duration (s) log Duration Spectral balance (dB) Pitch slope (ST/s)
l r l r l r l r l r l r l r
Range
Mean
SD
Median
83.41 to 342.40 72.46 to 222.20 2.44 to 26.89 0.00 to 19.40 55.16 to 78.03 55.62 to 75.75 0.045 to 0.302 0.042 to 0.371 −3.101 to − 1.199 −3.161 to − 0.991 −39.24 to − 4.55 −29.19 to − 5.43 −86.03 to 72.59 −199.90 to 108.00
168.10 129.00 13.94 9.29 70.04 67.19 0.124 0.149 −2.164 −2.006 −16.52 −15.40 4.94 −27.64
45.74 37.69 4.73 4.85 4.15 4.18 0.051 0.070 0.401 0.464 5.32 4.73 33.77 42.88
164.80 118.60 14.22 8.53 70.58 67.67 0.116 0.129 −2.153 −2.045 −16.54 −14.74 11.81 −30.04
mean pitch (in Hz and transformed to semitones), intensity, duration (in seconds and log-transformed), spectral balance, and pitch slope.
5.3 Procedure Section 5.1.5 made several predictions about the distribution of pitch, intensity, duration, spectral balance, pitch slope, and phonation mode within each element of a N OUN + N OUN construction, depending on whether the construction is leftprominent or right-prominent. This section tests the validity of these predictions by fitting linear mixed-effects models for each of the respective variables (see Appendix A for an introduction to linear regression and mixed-effects models). For each of the acoustic variables discussed above (pitch, intensity, duration, pitch slope, spectral balance, and phonation mode), a separate regression model is fitted using each measure as the response variable. The first five response variables are continuous, and therefore fitted by linear mixed-effects models, while phonation mode, coded as a discrete, binomial variable, is fitted by a logistic mixed-effects model. The initial number of observations is 210 (105 compounds with two elements each). Each model contains two fixed main effects. The first fixed effect, E LEMENT is a discrete factor with values ‘left’ and ‘right’, and encodes the information from which element of a N OUN + N OUN construction the respective measurement
83
was taken. The second fixed effect, P ROMINENCE, is a continuous variable that reflects the median prominence ratings obtained from the reliable raters in the perception experiment. As discussed in chapter 4, negative and positive values correspond to compounds in which the left and right element is rated as more prominent, respectively. Hence, E LEMENT indicates whether the respective acoustic measure differs categorically between the left and the right element, while P ROMINENCE represents the linear relationship between left- and rightprominence on the one hand and the acoustic measure on the other. Also included in each model is the interaction P ROMINENCE × E LEMENT. A significant interaction indicates that the relationship between perceived prominence and the measure differs between left and right element. As Crawley (2005: ch. 10) illustrates, a fixed-effect structure such as this estimates a separate regression equation of the continuous predictor on the response variable for each level of the discrete factor. Thus, it allows us to investigate the association between the perceived prominence pattern and the acoustic measure separately for each element of the compounds. This interaction structure between P ROMINENCE and E LEMENT can be illustrated if we consider the model for mean pitch. First, we expect to find a significant main effect of E LEMENT due to pitch declination: the right element occurs later in the utterance, and is thus expected to have a lower pitch in general. In the model, this would be represented in a significantly different intercept for the right element. Second, we expect a significant coefficient for P ROMINENCE, as prominence and pitch are closely related in the autosegmental-metrical view. Most importantly, however, we expect a significant interaction between the two predictors, as an autosegmental-metrical interpretation of prominence patterns proposes different mean pitches in the left and right member depending on the prominence pattern: pitch in the left element should not alter greatly between left- and right-prominence, as this element is held to be accented in both cases, while pitch in the right element should reflect alternation between non-accented (left-prominence) and accented (right-prominence). Following Baayen et al. (2008), speaker and item identifications were included as random intercepts (S PEAKER and I TEM, respectively). For each mixed-effects model, the appropriateness of the random effect structure was verified using two likelihood ratio tests: one comparing the model containing both S PEAKER and I TEM (the full model) with a reduced model containing only S PEAKER, and one comparing the full model with the reduced model containing only I TEM. Only those random effects were retained that yielded a significant increase in log likelihood (cf. Baayen 2008a: 253). Thus, for intensity and duration, I TEM was the only random effect, while for pitch slope and phonation modes, both random effects were excluded. Fixed effects regression models were used instead for these two variables. The structure of random effects in the different regression models is summarized in Table 5.3.
84 Table 5.3: Structure of random effects. Phonetic property
Random effects
Mean pitch Intensity Duration Spectral balance Pitch slope Phonation mode
S PEAKER, I TEM I TEM I TEM S PEAKER, I TEM n.a. n.a.
The different random effect structures are not unexpected. As noted in section 6.1.2, vowel phonemes are known to have intrinsic pitch, intensity, and duration values. Kiefte and Kluender (2005) have shown that the perception of vowels is also influenced by spectral balance. These intrinsic variations of the respective measures are modelled by the random effect I TEM. The random factor S PEAKER accounts for a large amount of the variance in the model for mean pitch, which is not surprising given the natural pitch differences between individuals, in particular between the sexes. Between-speaker differences may be expected to exist for intensity as well. However, it is highly likely that intensity differences between speakers have already been removed during recording, as all recordings of the Boston corpus were taken in a professional studio environment. On the other hand, spectral balance may be partly due to physiological difference, which would explain the random effect S PEAKER for this cue. No random effect was required for the model of pitch slope, so the sources of variation of this measure seem to lie somewhere else. As noted above in section 5.2, the pitch measurements were transformed to the non-logarithmic semitone scale, relatively to the lowest mean pitch that occurred in the data. While this transformation is advisable from a perceptual viewpoint, it is also advantageous methodologically, as the mixed-effects models used here assume a linear relation between continuous predictor and response variables. Unlike for the logarithmic nature of pitch, there is no prior reason to assume a non-linear relation for the other response variables. However, an investigation of the preliminary mixed-effects models revealed a considerable non-normality of the residuals of the model for duration, which suggests that the relation between prominence ratings and this acoustic measure is not well-approached by using a linear model (cf., for instance, Crawley 2002: ch. 17 for a discussion of this aspect of model criticism). This non-linearity was addressed by refitting the model with the log-transformed duration as response variable, which removed the non-normal distribution of the residuals. However, inspection of the model with log-transformed durations revealed that three observations with absolute standardized residuals exceeded 2.5 standard units, which is one possible indicator
85
of potentially harmful outliers (cf. Baayen 2008a).10 Two of these observations were taken from the two elements of the same compound (wristband, recording F2BPROP1), which seems to feature a temporal pattern that is rather extreme for English compounds, as it has the shortest left element duration (0.045 s, M = 0.124 s, SD = 0.051) and the longest right element duration (0.124 s, M = 0.149 s, SD = 0.070) of all items in the test set. Accordingly, this item was excluded, and the model refitted with the remaining 208 observations. An acoustic investigation of the data points where the standardized, absolute residuals exceeded 2.5 standard units in the other linear models revealed that some of these measurements are, indeed, due to circumstances that invalidate them for inclusion in the respective models. Thus, for the model of spectral balance, the left element of school districts (recording M2BS20P3) shows a striking reduction of energy in the whole higher range of the frequency spectrum which results in the lowest spectral balance of all observations (−39.24 dB). However, this spectral balance appears to be a recording glitch rather than an intended pronunciation. The exceptional nature of this data point is also reflected in the very high standardized residual (3.97 standard units). Thus, the model for spectral balance is based on 209 observations. For the model of pitch slope, four data points were excluded in total, all representing right elements. Three of them showed a steeply falling pitch, but their slope measurement is rather unreliable due to the creaky voice phonation found in the pertaining measurement intervals (standardized residuals: 3.06, 4.49, 2.78). The forth observation is the right element of the item roadways (Figure 5.1). Here, the steeply rising slope (108.0 ST/s, by far the largest in the whole data set) is due to the exceeding effect of the boundary tone (standardized residual: 3.67). Accordingly, the number of observations for this model was reduced to 206.
5.4 Results The results from the six regression models are summarized in Table 5.4. The reported p values were estimated using Markov chain Monte Carlo simulations (see Baayen et al. 2008 for details), with the exceptions of pitch slope and creaky voice, where the probabilities were obtained directly from the linear model 10
In fixed-effects linear regression models, there are numerical ways of identifying data that may be regarded as potentially harmful outliers, such as leverage calculation or calculation of Cook’s distance (cf., for instance, Baayen 2008a: 188ff). At the time of writing, similar methods are not available for the linear mixed-effects models used here (Baayen 2008, personal communication).
86 Table 5.4: Mixed-effects models for phonetic properties using Helmert contrasts. Mean pitch (Intercept) P ROMINENCE (Intercept)|E LEMENT P ROMINENCE|E LEMENT Intensity (Intercept) P ROMINENCE (Intercept)|E LEMENT P ROMINENCE|E LEMENT log Duration (Intercept) P ROMINENCE (Intercept)|E LEMENT P ROMINENCE|E LEMENT Spectral balance (Intercept) P ROMINENCE (Intercept)|E LEMENT P ROMINENCE|E LEMENT Pitch slope (Intercept) P ROMINENCE (Intercept)|E LEMENT P ROMINENCE|E LEMENT Creaky voice (Intercept) P ROMINENCE (Intercept)|E LEMENT P ROMINENCE|E LEMENT
B
Std. Err.
t
p
11.55 −0.15 −1.49 1.91
1.40 0.31 0.17 0.17
8.25 −0.48 −8.94 11.07
< 0.001 0.635 < 0.001 < 0.001
B
Std. Err.
t
p
68.55 −0.14 −0.86 1.30
0.41 0.42 0.17 0.18
169.32 −0.34 −5.03 7.42
< 0.001 0.733 < 0.001 < 0.001
B
Std. Err.
t
p
−2.10 −0.03 0.10 0.06
0.04 0.04 0.03 0.03
−56.35 −0.85 3.55 2.26
< 0.001 0.399 < 0.001 0.025
B
Std. Err.
t
p
−16.07 −0.47 0.62 0.35
0.79 0.42 0.31 0.32
−20.32 −1.12 1.98 1.08
< 0.001 0.264 < 0.05 0.280
B
Std. Err.
t
p
−11.14 −2.40 −10.50 10.47
2.53 2.62 2.53 2.62
−4.40 −0.92 −4.14 4.00
< 0.001 0.360 < 0.001 < 0.001
B
Std. Err.
z
p
−1.86 −0.11 0.64 −0.69
0.25 0.26 0.25 0.26
−7.56 −0.43 2.59 −2.64
< 0.001 0.669 0.010 0.010
*** *** *** *** *** *** *** *** * *** *
*** *** *** *** ** **
and the logistic regression model, respectively. As the analysis is to reveal the influence of both predictors, no backwards elimination of insignificant terms (cf. Crawley 2002) has been applied. The regression models use Helmert contrasts to represent the influences of the discrete variable E LEMENT and the continuous variable P ROMINENCE on the response variables (cf. Crawley 2002: ch. 18). Thus, the intercept coefficient and the coefficient for P ROMINENCE are estimates of the intercept and slope of the
87
overall regression, regardless of the element from which the respective measurement was taken. The coefficients for the two separate regression equations are derived from the overall regression by addition and subtraction, respectively, of the parameters for (Intercept)|E LEMENT (the offset to the intercept) and P ROM INENCE |E LEMENT (the slope modification). In order to derive the regression parameters for right elements, the coefficients for (Intercept)|E LEMENT and P ROMINENCE|E LEMENT are added to the coefficients for intercept and P ROMIN ENCE, respectively, while they are subtracted for the regression for left elements. Hence, a significant coefficient for P ROMINENCE|E LEMENT indicates that the effect of prominence is different between the two elements, while a significant coefficient for (Intercept)|E LEMENT indicates an average difference between the two elements. Considering, for instance, the mixed effect model for mean pitch, we find a significant coefficient for P ROMINENCE|E LEMENT, which indicates that effect of prominence pattern on average pitch is, indeed, different for left and right elements. The left element in a compound with a prominence rating of zero is estimated to have an average pitch of 11.552 ST − −1.494 ST = 13.046 ST (the overall intercept coefficient minus the offset coefficient (Intercept)|E LEMENT), and the average pitch is estimated to increase by −0.150 ST − 1.910 ST = −2.060 ST (the overall slope coefficient minus the coefficient for the slope adjustment P ROMIN ENCE |E LEMENT ) with each increase of the prominence rating by one unit. For right elements, the estimated intercept is 11.552 ST + −1.494 ST = 10.058 ST, and the estimated slope is −0.150 ST + 1.910 = 1.760 ST. Thus, the mixed-effect model estimates the average pitch Pˆi in the left and right element of compound i as in (5.5): Pˆi,le f t = 13.046 − 2.060 × xi Pˆi,right = 10.058 + 1.760 × xi ,
(5.5)
where xi is the prominence rating for the i-th compound. Given the significant slope difference between the two equations, we may conclude that the association between pitch and prominence pattern is not the same in left and right elements. More specifically, the negative slope in the equation for left elements indicates that right-prominent compounds have a clearly lower mean pitch in the left element than left-prominent compounds, while the positive slope for right elements shows here that the mean pitch in right-prominent compounds is higher than in leftprominent compounds. The top panel in Figure 5.3 illustrates these conclusions. In this panel, the pitch measurements are plotted using filled (left element) and empty (right element) circles, while the solid and dashed lines are the regression line associated with left and right elements, respectively, and thus represent estimations for mean pitch in each element at a given prominence rating. The decreasing pitch in the left
88 25
●
10
15
● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ● ●
●
●
5
Pitch (ST)
20
●
0
●
● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ●● ●● ● ● ●●● ● ●●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●
● ●
● ● ●● ● ●
●
● ●● ● ● ● ● ●● ● ●
75 70
●
65 60
●
● ●
−1.0 −2.0 −10
−3.0
● ●
● ●
●
●
−30
log Duration (log s)
●
−20
● ●
● ●●
●
●
● ● ● ●
●●
● ● ●
● ● ●
●
● ● ●● ● ● ● ●
● ●
● ●
● ●
● ● ● ●● ● ● ● ●●
● ● ● ● ●
● ● ● ●
● ● ● ●● ● ● ●● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ●● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ●
●
● ●
●● ●●
● ● ● ● ●
● ●
● ●
●
●
● ●
● ●
●
● ●
●
●
● ●
● ● ●● ● ●
● ●
● ● ● ● ●
● ●
● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●
● ●
● ● ●
● ● ●
● ● ●
●
●
−40
Spectral balance (dB SPL)
●
● ● ● ● ● ●
●
55
Intensity (dB SPL)
●
●
● ● ●● ●
● ● ●
● ●
● ●
● ●
● ● ●
●
●
●
● ●
●
● ●
● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●
●
● ●●
● ●
● ● ● ●
●
●
●
●● ● ●
●
●
●
●●
●
●
●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●
●
●
● ●
● ●
● ● ●●
● ● ● ● ● ● ● ● ●
●
●
●
● ●
● ● ●●
●
●
●
●
●
●
●
●
● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ●
● ● ●● ● ● ●
● ● ● ● ●
●
●
50 0 −100
● ●
● ●
●
● ● ● ●● ● ● ● ● ●● ● ● ●●
●●● ●
●
●
●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●
● ● ● ● ● ●
● ●
●
−200
Pitch slope (ST/s)
● ● ● ● ● ● ●● ● ● ●●●● ●● ● ● ● ●●● ● ● ● ● ● ●● ● ● ●● ●● ●●●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●
●
creak
●
●
● ●
modal
●
●
●
● ●● ●●● ●● ●● ● ●● ●● ● ●●●● ● ● ● ● ● ●● ● ● ●●● ●●● ●●● ● ●● ●●●●● ●●●●● ●●● ● ● ●●● ● ●● ● ● ●●● ●● ● ● ●
● ●
●
●
● ●
●
●
●●● ● ● ● ● ●●● ●● ● ● ● ●●● ● ●● ●● ● ● ● ● ●●● ● ● ● ●●● ●● ●
●
●
●
● ●● ●
●
●
Phonation
● ● ●● ● ●● ● ●●●● ● ●
−2.0
−1.5
−1.0
●● ● ● ●●
●● ● ●●● ●● ● ●●●
●●● ●●●
−0.5
● ●
● ●
0.0
0.5
1.0
Median prominence rating ●
left element
●
right element
Figure 5.3: Regression models for phonetic parameters. See text for details.
●● ● ●● ●
1.5
89
element with increasing prominence rating is shown in the clearly falling solid regression line, while the positive association between pitch in the right element and prominence ratings shows in the positive slope of the dashed regression line. Each measurement is represented as a dot in the panel. Solid and unfilled circles represent measurements from left and right elements, respectively. Corresponding plots are included for intensity, duration, spectral balance, and pitch slope as well. Given the mixed-effect model for mean pitch, we may address the predictions made above about the distribution of pitch in the two elements under a given prominence pattern. The significant interaction term indicates that we do, indeed, find different pitch contour configurations for left- and right-prominent compounds, and that the differences lie in the two elements. The two regression lines in the corresponding panel in Figure 5.3 converge in the area associated with right-prominence. Hence, in right-prominent compounds, left and right elements have nearly indistinguishable pitch values, while the pitch of left elements is clearly higher than that of right elements if the whole compound is left-prominent. As none of the two regression lines is in parallel to the x axis,11 we may conclude that the pitch in both elements changes depending on the prominence pattern; pitch in both elements is susceptible to the prominence pattern of the construction. In particular, mean pitch in the left element decreases with increasing prominence ratings, which is an unexpected finding if we assume that the left element is always accented, irrespective of the prominence pattern. This finding is not an artefact of using mean pitch instead of peak pitch (see the discussion in section 5.2.1 above), as a corresponding model with peak pitch as the response variable yields highly similar results. Most to the point, the slope for peak pitch in the left element is also significantly falling (−2.341), and rising in the right element (1.221). For intensity, the results are very similar. As displayed in the second panel of Figure 5.3, the two regression lines intersect in the positive range of the prominence rating scale, indicating that right-prominent compounds have a nearly equal intensity in both elements, while the left element has a significantly higher intensity than the right element if the compound is right-prominent. Accordingly, the regressions for left and right elements have a negative (−1.446) and positive (1.161) slope, respectively. As with pitch, intensity in the left element is not 11
A regression line parallel to the x axis has a slope of zero, and indicates that there is no correlation between the response variable and the predictor. The Helmert contrasts used in Table 5.4 do not directly indicate whether this is the case for one of the two elements, but this information may easily be derived from the model using, for instance, treatment coding with the respective factor level as the reference level (cf. Crawley 2002: 339). Such a contrast recoding is reported only in cases where the respective panel in Figure 5.3 leaves doubts whether the slope of a regression line is different from zero. The slope for the right element in the durational model (see below) is such an instance.
90
invariant with respect to the prominence pattern, but shows a continuous decrease with increasing prominence rating. The results from the duration analysis, however, deviate from those for intensity and pitch. While we found pitch and intensity in right-prominent constructions to be nearly indistinguishable between the two elements, and clearly higher in the left element in left-prominent constructions, the situation is reversed with duration. Here, left-prominent compounds have nearly identical left and right durations, as indicated by the intersection of the two regression lines in Figure 5.3 at a prominence rating of about −1.5. Furthermore, the duration of the right element remains constant, irrespective of the overall prominence rating (treatment contrasts show that the slope for right elements is non-significant, B = 0.031, Std.Err. = 0.048, t = 0.652, p = 0.515), while the duration of the left element decreases with increasing prominence ratings (B = −0.096, Std.Err. = 0.048, t = −0.2017, p = 0.045). In the model for spectral balance, both the overall slope estimate and the slope modifier are insignificant, which speaks against a relation between spectral balance and the prominence pattern in N OUN + N OUN constructions. The slight decreasing trend for the left element with increasing prominence rating, observable in the non-parallel regression lines in Figure 5.3, is statistically not significant, and possibly due to the general difference between left and right elements that is reflected in the significant coefficient for (Intercept)|E LEMENT. In contrast to this, pitch slope is strongly affected by the prominence pattern of the compound. In left-prominent compounds, the pitch contour in the left element is typically rising (reflected in high, positive pitch slope measurements) or level, and only infrequently characterized by a fall (i.e. low, negative pitch slopes), while the reverse is true for the right element, which is almost constantly showing a clear fall. This becomes evident both in the clear separation of solid (left element) and unfilled (right element) circles on the left-hand side of the scale in Figure 5.3 as well as in the large difference between the respective regression lines in the lower range of the prominence rating scale and the corresponding differences in regression slopes (−12.866 and 8.064 for left and right element, respectively). The distinction between left and right element is neutralized in right-prominent compounds. Here, the two regression lines converge, suggesting as with pitch and duration that the two elements do not show a systematic difference. Contrary to the hypothesis suggested above, right-prominence is apparently not strongly associated with falling pitch in the right element. Finally, if we consider the distribution of creaky voice in the left and right element in relation to the perceived prominence, non-modal phonation occurs, indeed, more frequently if the compound is left-prominent, but this effect is largely restricted to the right element (treatment contrasts reveal that the probability for creaky voice in the left element is not affected by the prominence pattern, B = 0.581, Std.Err. = 0.435, z = 1.334, p = 0.182). The increased probability
91
of creaky voice in the right element with decreasing prominence rating, however, is significant (B = −0.805, Std.Err. = 0.294, z = −2.744, p = 0.006).12 This relation is illustrated by the lowest panel of Figure 5.3, which uses a different visualization from the linear mixed-effects models for the continuous response variables discussed so far. Here, the points in the lower half represent elements with modal phonation, and points on the top half elements with creaky voice phonation. As before, solid circles represent left elements and unfilled circles right elements. The two lines show the probability of creaky voice phonation in relation to the prominence rating. Thus, as the probability for creaky voice in the left element is not significantly affected by the prominence pattern, and rather low in general, the solid line (indicating the probability for creaky voice in the left element) stays very low regardless of the prominence rating (the probability increases insignificantly from 0.03 to 0.15 across the whole scale of observed prominence ratings). In the right element, by contrast, the probability of creaky voice decreases from 0.59 to 0.09 in the same range, which is reflected in the clearly falling dashed line (indicating probability for creaky voice in the right element). Using the regression models presented above, we may estimate the acoustic properties of a typical left-prominent and a typical right-prominent compound. The perception experiment described in chapter 4 provides the corresponding prominence ratings. Figure 4.8 illustrates that the prominence rating most likely to occur for left-prominent and right-prominent compounds is −1.112 and 0.786, respectively. Using these peaks in the rating distribution as representative prominence ratings for typical left- and right-prominence, respectively, we may use the regression models from above to estimate the acoustic properties typically expected for these fictitious N OUN + N OUN constructions. The estimates are given separately for each property in Table 5.5. All random effects are ignored for the estimates in this table. Hence, they are not specific to a given speaker or item, but representative of a larger population. Columns N1 and N2 correspond to the left and the right element of the two exemplary N OUN + N OUN constructions. The difference in the last column is calculated as the corresponding measure in the left element minus the measure in the right element, while the difference rows are calculated as the difference in a given element (either N1 or N2) between a left-prominent and a right-prominent compound. Thus, for instance, the mean pitch of the left element of left-prominent compounds is typically 3.91 ST higher than the mean pitch of the left element of right-prominent compounds. The interpretation of the estimates for phonation mode differs from the other estimations. As the pertaining model is a logistic model, the table provides the average probability p for creaky voice in the corresponding element. Here, the 12
Note that the coefficients in the logistic regression are logit-transformed probabilitp ies (Logit = ln 1−p ). The inverse transformation has to be applied to the respective coefficient to obtain an estimate of the probability for creaky voice.
92 Table 5.5: Estimated acoustic properties for fictitious, representative left-prominent and right-prominent compounds (prominence rating −1.112 and 0.768, respectively) . Acoustic property
Prominence pattern
N1
N2
Difference
Mean pitch (ST)
Left Right Difference
15.34 11.43 3.91
8.10 11.44 −3.34
7.24 −0.01
Intensity (dB SPL)
Left Right Difference
71.0 68.3 2.7
66.4 68.6 −2.2
4.6 −0.3
Duration (s)
Left Right Difference
0.123 0.103 0.021
0.130 0.138 −0.007
−0.007 −0.035
Spectral balance (dB SPL)
Left Right Difference
−15.79 −17.34 1.55
−15.32 −15.54 0.22
−0.46 −1.79
Pitch slope (ST/s)
Left Right Difference
13.67 −10.75 24.42
−30.61 −15.30 −15.30
44.27 4.55
Phonation mode (p)
Left Right Difference
0.041 0.115 0.332
0.418 0.135 4.610
0.060 1.033
difference column lists the odds ratio for non-modal phonation q between left and N2 ) right element, calculated as q = ppN1 (1−p , where q indicates how many times N2 (1−pN1 ) more likely creaky voice is to occur in the left than in the right element under a given prominence pattern. Analogically, the difference row shows for a given element how many times more likely creaky voice is expected in left-prominence as compared to right prominence. Thus, for instance, the token-wise difference of 4.610 for the right element indicates that creaky voice is almost five times more likely to occur in the right element of a left-prominent compound than in the right element of a right-prominent compound. These estimations provide a picture of the acoustic difference between a right prominent and a left prominent compound, and of the differences between the elements in each type of prominence pattern. Thus, a typical left-prominent compound will have a mean pitch in the left element that is 7.24 ST higher than in the right element, and the intensity is higher by 4.6 dB SPL. The spectral balance in the left element is steeper by 0.46 dB SPL. Both elements have a nearly indistinguishable duration. The pitch contour in the left element is expected to be rising, and falling in the right element. Furthermore, creaky voice in the
93
right element occurs with a probability of 0.418, which is 1/0.060 = 16.740 times more likely than in the left element. Corresponding differences for a right-prominent compound as well as the within-element differences for left- and right-prominence can be retrieved from the respective cells in the table.
5.5 Discussion The analysis presented in this section investigated the relations between prominence ratings and five different phonetic measures (mean pitch, intensity, duration, pitch slope, spectral balance) as well as a classification of the phonation mode, within the context of the autosegmental-metrical framework. In this framework, left prominence is held to be realized by an accented left element and an unaccented right element, while right prominence has a pitch accent on both elements. As an accented syllable has been shown to be associated with higher pitch, longer duration, and higher intensity than an unaccented syllable, these three measures were expected to be fairly constant in left and right elements in right-prominent compounds, but to differ greatly in left-prominent compounds, with larger figures in the left element than in the right element. A corresponding difference in spectral balance may be expected if this measure is also associated with accentuation, but not if it is a feature of lexical stress. The shape of the pitch contour, expressed by the pitch slope in the two elements, was suggested above to differ between the two prominence patterns as well, and it was supposed that the alleged higher perceptual prominence of a falling pitch contour is used to increase the prominence of the right element if necessary. Finally, creaky voice, described to be frequently associated with non-prominent material, was expected only in the right element of left-prominent compounds. Some of these predictions have found support in the analysis reported above. There is a rather reliable association between the perceived prominence rating on the one hand and pitch and intensity on the other. Both measures are statistically indistinguishable in both elements of right-prominent compounds, while the left element in left-prominent compounds has a significantly higher pitch and intensity than the right element. This provides strong support for the assumption that compound prominence patterns are, indeed, realized by differences in the accentuation pattern. Despite the similarity of pitch and intensity in right-prominent compounds, the right element is still perceived as more prominent than the left element. As outlined above, this effect is very likely due to the declination effect of pitch and intensity: listeners are held to expect the continued decrease of intensity and pitch in the right element, and perceive an increased prominence in a syllable in which this decrease is suspended.
94 Table 5.6: Distribution of pitch accent types across each element by prominence pattern (N = 33). Element
*
H*
L*
L+H*
None
left-prominent
left right
0 1
13 1
1 1
6 0
3 20
right-prominent
left right
3 1
1 6
1 0
3 1
2 2
The analysis of pitch slopes allows us to further specify the shape of accents encountered for the different prominence patterns. The finding that right-prominent compounds usually have a slightly falling pitch in both elements suggests that the prevalent accent type is H* (which, in the ToBI convention, subsumes Pierrehumbert’s 1980 falling H*+L type), and only occasionally a rising accent such as L+H*. In left elements, the latter accent type seems to be much more frequent for the left element (as indicated by the occurrence of many steeply positive slopes that indicate a rising pitch), but also numerous H* accents with less steep pitch slopes are present. The fact that most right elements of left-prominent compounds have a large negative pitch slope, together with the results from the analysis of mean pitch above, reflects the continued decrease of pitch from the high tone target in the left element towards a prosodically neutral pitch level in the right element. An investigation of the pitch accent annotations provided in the Boston corpus suggests that this is a plausible characterization. The distribution of pitch accent types for left-prominent (i.e. those with negative prominence ratings) and right-prominent (i.e. those with positive prominence ratings) is summarized in Table 5.6. For the sake of clarity, H* and the downstepped !H*, and L+H* and the downstepped L+!H* have been collapsed in the table, similar to the procedure in Pitrelli et al. (1994). The asterisk marks pitch accents of an unidentifiable type. The table shows that, in the case of right-prominent compounds, L+H* has been chosen for the right elements only once, and three times for the left element. In left-prominent compounds, six left elements are assigned this rising pitch accent type, while 13 left elements are classified as H*. 20 right elements of leftprominent compounds have not been classified as accented. Apparently, speakers do not utilize falling pitch contours to increase the perceptual prominence of the right element in right-prominent compounds. Thus, Figure 5.4 is a stylization of the pitch contours of left-prominent and right-prominent compounds, restricted to the measurement intervals within the two elements. Taken together, these results support the large importance of pitch for the perceived prominence pattern in compounds. However, the fact that the results for intensity are highly similar to that of pitch shows that a characterization of accentuation by pitch configurations only, such as that presented in Gussenhoven
95
Noun1
Noun2
Left prominence
Noun1
Noun2
Right prominence
Figure 5.4: Stylized shape of pitch contours in the sonorant rime parts of primary stressed syllables in left- and right-prominent compounds.
(2004), is not sufficient. It seems that the prominence pattern systematically covaries with intensity, and hence, that intensity (or loudness as the psychoacoustic sensation) is an important co-variate of accentuation as well. In the light of this, it may indeed be possible that the contribution of intensity to phonological prominence has been seriously underestimated in accounts of prosodic prominence, as Kochanski et al. (2005) and Kohler (2005) imply in their criticisms of the autosegmental-metrical view as being too much restricted to pitch effects. Nevertheless, the findings for pitch and intensity are not fully in line with the conclusion drawn in Kochanski et al. (2005), namely that “prominence and pitch movements should be treated as largely independent and equally important variables” (Kochanski et al. 2005: 1052). The model for pitch clearly shows a relation between pitch and prominence, and left-prominent compounds have a clearly different distribution of pitch targets than right-prominent compounds. It seems that the incompatibility between the IViE corpus used by Kochanski et al. and other data sources rests, indeed, in particularities of the former corpus, and is perhaps tied to British English varieties. Kochanski et al. found in simulation runs that a pitch peak of 2.4 ST was necessary to create a strong bias towards prominence in their classification algorithm, but the speakers in their recordings only rarely demonstrated pitch excursions of this size. In the Boston corpus used here, the distribution of pitch seems to be strikingly different. As Table 5.5 shows, the left and right element in left-prominent compounds differ typically by about 7 ST, which would clearly be sufficient to influence the prediction of prominence in the model of Kochanski et al. The data from the Boston corpus is thus in line with the pitch ranges examined in the studies quoted in Kochanski et al. (2005) that use synthetic speech (Gussenhoven et al. 1997, Rietveld and Gussenhoven 1985, Terken 1991), which have been treated as evidence for the predominant role of pitch in perceptual accentuation. What this discussion illustrates is that it may be premature to generalize on the phonetic implementation of accentuation over different varieties of English. Implementation differences were found by Kochanski et al. (2005) even for the different British English varieties covered in the IViE corpus: for Belfast, loudness, duration, and F0 are classifiers of comparable performance, while for Leeds, duration and F0 were found to be rather uninformative as classifiers. It is not unlikely that the distributional differences between pitch in their data
96
and the present data are due to the different varieties of English investigated. Apparently, dialectal variability in the implementation of prominence differences is still largely unexplored, and merits further attention in future research. That notwithstanding, the mixed-effects models have by and large been able to confirm the hypotheses for mean pitch and intensity in English N OUN + N OUN constructions as outlined above. The corresponding hypothesis for spectral balance was formulated somewhat hesitantly, as the literature reports unequivocal results with respect to the role of this acoustic property to accentuation. The present analysis does not find spectral balance to be affected to any significant degree by the prominence pattern of a N OUN + N OUN construction. This supports the view put forward by Sluijter and Heuven (1996a) and Gussenhoven (2004) that the principal domain of this acoustic feature is the distinction between lexically stressed and unstressed syllables. The analysis presents no supporting evidence for the opposing claim made by Campbell and Beckman (1997), namely that spectral balance differences depend on accentuation. The analysis of durational changes in relation to the different prominence patterns is somewhat unexpected from an autosegmental-metrical viewpoint. Turk and White (1999), for instance, found a consistent amount of lengthening in syllables with pitch accents. Assuming that the right element is accented only in right-prominent compounds, we predicted a duration increase with increasing prominence rating, a prediction that is not supported by the mixed-effects model. Instead, right duration stays more or less constant regardless of the prominence pattern, while the duration of the left element decreases with increasing prominence ratings that represent right-prominence. This finding contradicts the results reported in Farnetani et al. (1988), where the right element was found to be longer if the structure was right-prominent, while the left element showed a constant length regardless of the prominence pattern. Yet, the model presented above is in line with Adams (2007), who found, at least for a subset of her data, that it is the duration of the left element that co-varies with the prominence pattern of the construction. A potential reason for the discrepancies between the present results and Adams (2007) on the one hand and Turk and White (1999) on the other may lie in lengthening effects that occur at prosodic boundaries. As Turk and ShattuckHufnagel (2000) and Byrd et al. (2006) summarize, there is rich empirical support that syllables in the vicinity of word or phrase boundaries tend to be longer than elsewhere, and that this lengthening effect is particularly strong phrase-finally. To test whether these effects are responsible for the distribution of duration measures in the present corpus, the position of the 105 stimuli was classified as either sentence-initial, sentence-final, or sentence-medial (i.e. neither initial or final).13 Addition of this factor to the mixed-effect model, however, did not yield 13
This classification allows detection of lengthening effects at the highest type of prosodic boundary, i.e. at sentence boundaries. As the stimuli used here are taken from an
97
any evidence of either general lengthening effects at sentence boundaries nor particular longer elements in sentence-final position, irrespective of prominence rating or compound element. Thus, the results of the present analysis seem to be largely independent of sentence boundaries, a finding that is also, at least partially, supported by Adams (2007). Thus, in the absence of further empirical data that specifically addresses the length of the left element in compounds, it seems a viable conclusion to regard duration as an acoustic correlate of N OUN + N OUN prominence patterns, but as one that is largely independent of accentuation. Apparently, speakers shorten the left element if it occurs in a right-prominent compound, while the duration of the right element does not vary systematically. Finally, the analysis of phonation mode supports the findings reported in Epstein (2002). As expected, non-modal phonation almost only occurs within non-prominent elements. More specifically, the right element of left-prominent compounds may feature creaky voice, while non-modal phonation is virtually nonoccurring in the left element, or in any element in right-prominent compounds. The few exceptions of left elements with creaky voice are probably linked to the observation that word-initial vowels may optionally be glottalized if they occur at prosodic boundaries (Dilley et al. 1996), and do not pertain to the prominence pattern of the whole compound. The observation that creaky voice occurs in some, but not all non-accented contexts suggests that this phonation mode may only be chosen in contexts where it does not offer acoustic cues that are in conflict with those pertaining to accentuation. Given that we have found accentuation to be characterized by high pitch and intensity, the acoustic properties of creaky voice, as discussed above in section 5.1.4, are in opposition to these characteristics, and thus rather unlikely to be found in an accented element. In conclusion, then, the analysis above provides strong support for the autosegmental-metrical account that describes the two prominence patterns that occur in English N OUN + N OUN constructions as two different accentuation patterns. In the first pattern, which is perceived as left-prominent, mean pitch and intensity provide solid acoustic evidence for a single pitch accent on the left element, which is usually of a level or rising type. In the second pattern, which is perceived as right-prominent, each element is accented, and rising types are rather infrequent. Duration has also been found to co-vary with perceived prominence ratings, but in a way suggesting that the left element is shortened in right-prominent compounds. Apparently, neither accentual lengthening nor lengthening occurring at prosodic boundaries provide a satisfactory explanation for this shortening. However, in this description, accentuation is treated as a binary feature: an element of a compound is either accented or unaccented. While pitch accents in uncontrolled corpus, they are not very well-suited to detailed classification of boundary types at lower prosodic levels such as intonational phrases, as this would result in a highly unbalanced number of tokens per boundary type.
98
the autosegmental-metrical framework may occur in different types that may also have different pragmatic effects (see, for instance, the uses of low pitch accents described in Hirschberg 2004: 534), no further gradiency of perceptual prominence of different accent types is proposed. A closer look at the data questions whether this treatment is justified. If we consider the differences listed in Table 5.5 for the left element (column N1) in both prominence patterns, we find considerable changes of pitch and intensity, even though the left element is expected to be accented irrespective of the prominence pattern. The models in Table 5.4 for pitch and intensity show both a significant negative effect of prominence on the respective measure in the left element. Hence, in left-prominent compounds, the accent on the left element is articulated particularly strong, and less strong in right-prominent compounds. That accents may differ in their relative prominence, and that this difference may be related to the size of pitch excursion, has been observed elsewhere as well, for instance by Terken and Hermes (2000), who introduce the term ‘accent strengths’ to refer to different degrees of prominence found in accented syllables. They conclude that “the question as to whether different degrees of accent strength are to be treated in a categorical or gradient way has not yet been answered satisfactorily” (Terken and Hermes 2000: 123). The present results provide at least a partial answer to this question. It seems that we find at least two different types of accent strength at work in compound prominence patterns: in a compound with left prominence, a ‘strong’ accent (associated with high pitch and high intensity) is used, while compounds with right prominence feature a weaker accent (with pitch and intensity that are higher than in unaccented syllables, but lower than in strongly accented ones) on each element. Apparently, these types do not occur randomly, but are systematically associated with the left and right element of the two prominence patterns. The discussion in chapter 4 argued that N OUN + N OUN prominence patterns are perceived categorically, and thus, by extension, it seems plausible that the different accent strengths in N OUN + N OUN constructions are also used in a categorical way. Such a difference in accent strength provides also a possible explanation for the observation made above in chapter 4 that there is only little disagreement in prominence ratings for left-prominent compounds, but disagreement between raters increases with increasing prominence ratings. Apparently, the strong accentuation of the left element in contrast to an unaccented right element is very clearly perceived, and thus leaves little room for ambiguity in the ratings. On the other hand, the two accents of intermediate strength found in right-prominent compounds are perceptually highly similar, so that the complications involving categorical perception (see section 4.5) lead to an increased disagreement rate between listeners. The autosegmental-metrical framework, with ToBI as the proposed system of prosodic transcription, does not provide means of describing varying levels
99
of accent strength, as Terken and Hermes (2000) point out. In this framework, the presence of a pitch accent is seen to cue prosodic prominence of the stressed syllable with which the accent is associated. Potentially different degrees of prosodic prominence are discarded, as are potential prominence differences due to differences in pitch accent type. This simplification inherent in the ToBI labelling system is deliberate, as the system is intended to provide a symbolic representation, and not an accurate transcription of finer phonetic details (see Beckman et al. 2005 for a discussion). Yet, this restriction to a phonemic level may run the risk of missing finer, but linguistically relevant differences such as those found between the accents involved in N OUN + N OUN prominence patterns. For instance, Kohler (2006) provides examples where the ToBI system seems to fail to describe phonologically relevant differences in the intonation pattern, and he also proposes an alternative transcription system (PROLAB) for intonational prosody that encodes three different degrees of accent strength (“default”, “partially reduced”, and “reinforced”, Kohler 2006: 127). It is unclear, however, in how far this system is successful in avoiding “the impressionistic and subjective character of the transcription of accent strength that still prevails” (Terken and Hermes 2000: 122) with current prosodic transcription systems. For an adequate account of accents that pertain to the realization of N OUN + N OUN prominence patterns, an extension of existing transcription systems seems to be necessary.
6 Classification and prediction of compound prominence patterns
The previous two chapters have investigated the perception and the acoustics of prominence patterns in English N OUN + N OUN compounds on the basis of a selection of items from the Boston corpus. The perception experiment in chapter 4 has demonstrated that most listeners are capable of assigning a rating to a given compound that reflects the prominence relation between the two elements. However, the perception experiment has also suggested that there is a considerable degree of variation between different raters, and that therefore the number of raters has to meet a critical size to obtain reliable ratings. In a study that intends to examine a larger corpus of compounds such as, for instance, the 4,367 compounds found in the Boston corpus (see chapter 3), this methodology quickly reaches its limitations. It is perhaps for this methodological reason that empirical studies of the factors that determine the choice of a prominence pattern for a given English compound are very rare, and most authors rely on a small set of hand-selected types. Chapter 7 below discusses this issue in more detail. However, the analysis of the acoustics of prominence patterns in the preceding chapter has shown that there is a number of reliable acoustic co-variates to the prominence ratings obtained from the participants in the perception experiment. Utilizing these findings in a statistical model that approximates the ratings obtained by human listeners proposes itself as a promising alternative to perception or classification experiments. This chapter proposes two such models. The first one is a linear regression model that uses acoustic measurements from compounds to estimate the median prominence rating for that compound, i.e. the prominence relation between the left and right element on a continuous scale. The results thus correspond to the ratings obtained in the perception experiment from human listeners.1 The second model also utilizes acoustic measurements, but attempts to assign each compound to one of the two prominence patterns that are available for English N OUN + N OUN constructions (i.e. ‘left prominence’ and ‘right prominence’). The second model thus acts as a classifier that maps continuous acoustic measures onto two discrete categories. Both models are constructed on the basis of the median prominence ratings obtained for the 105 N OUN + N OUN compounds in the perception experiment in 1
An earlier version of this model has been presented in Kunter and Plag (2007), and their predictions for the Boston corpus have been used in Plag et al. (2008) to test hypotheses about factors that have been proposed to determine the choice of prominence patterns in English N OUN + N OUN constructions. In chapter 7, their data is re-investigated using the more accurate predictions produced by the present model. Differences between the present model and its predecessor are discussed below where appropriate.
101
chapter 4, and use the acoustic properties that have been established as correlates to the prominence rating in the previous chapter as model predictors. As noted above, the purpose of both models is to provide estimations of the prominence pattern for the remaining N OUN + N OUN constructions from the Boston corpus which have not been rated in the perception experiment. The acoustic measures for these compounds have been obtained by a Praat script that offers a semiautomatic way of retrieving acoustic data from a large set of recordings. This method of data retrieval is described and evaluated in the next section, which also summarizes the acoustic data obtained for the full set of compounds in the Boston corpus. The section also discusses the impact of vowel-intrinsic properties on the present statistical modelling.
6.1 Data As outlined above in chapter 3, the Boston corpus contains 4,367 N OUN + N OUN constructions that met the criteria to be treated as nominal compounds. In each element of these constructions, the sonorant part of the rime of the syllable with primary stress was manually segmented following the segmentation criteria from section 5.2.2 In the previous chapter, it was found that mean pitch, intensity, duration, and pitch slope are significant co-variates of the perceived prominence pattern, which suggests them as potentially successful predictors in statistical models that attempt to predict the perceived prominence pattern of new, unrated constructions. The present models also include spectral balance as a potential predictor, even if the previous chapter provided no clear evidence for a systematic association between this property and the perceived prominence rating. Despite this, it will be shown below that the prediction profits from inclusion of this predictor. In addition to these predictors, the present models also feature a further measure called ‘peak distance’ (Δ peak ). The peak distance is defined as the duration of the span between the pitch peak in the left measurement interval and the peak in the right measurement interval. Unlike the other acoustic parameters discussed up to now, the peak distance is not an independent acoustic measure that is under direct control of the speaker. On the one hand, it is strongly affected by the alignment of the pitch peaks in the two elements, which seems to differ between left-prominent and right-prominent 2
At this point, I wish to express my gratitude to Christian Grau, Christina Kellenter, Henner Metz, Hiromi Pat Noda, Taivi Rüüberg, and Linda Zirkel for their endless patience while praating this large amount of sound files, and to the Deutsche Forschungsgemeinschaft, who supported their work by grant PL-151/5-1.
102
compounds (cf. Figure 5.4 above). On the other hand, the peak distance is also determined by the number of unstressed syllables that occur between the primary stress in the left and the right element. For instance, the peak distance in a compound such as welfare department is bound to be long due to the two intervening unstressed syllables -fare and de-. Thus, Δ peak may not directly be interpreted as a correlate of prominence perception if it is found to contribute to the prediction of compound prominence patterns. Its primary reason for inclusion as a predictor is to address the declination effects observable for pitch and intensity during the course of an utterance (cf. section 5.1.1). We will return to this aspect of peak distance when the incorporation of the predictors into the regression model is discussed. Not included in the statistical models presented here is the phonation mode found in the left and right element. Even though the previous chapter suggests that creaky voice phonation is inhibited in elements that are accented, and thus the presence of creaky voice in the right element is strongly associated with a perceived left-prominence of the respective N OUN + N OUN construction, this observation is not exploited to improve the predictive accuracies of the present models. The construction of a reliable classifier that assigns an element to either modal or creaky phonation is beyond the scope of this chapter. The involved acoustic correlates of non-modal phonation (cf. Blomgren et al. 1998 for a review) require fairly sensitive measures that depend on speech material with a sustained steady state in the measurement intervals. For instance, Blomgren et al. 1998 use speech material with a duration of 6 s for their analysis of phonation modes, which is obviously much longer than typically found in the naturally occurring vowels of the Boston corpus. Thus, the inclusion of phonation mode as a potential predictor would require a manual classification of the input material, which is in contradiction to the purpose of the present models to provide an assessment of the perceived prominence pattern with as little supervision as possible. Kunter and Plag (2007) reported a regression model for the same data set that used a somewhat different set of predictors. In particular, their analysis incorporated ‘mean absolute pitch slope’ measures for the two elements instead of the pitch slope as defined in equation (5.2) above. The mean absolute pitch slope averages the absolute pitch change between two consecutive analysis frames, and thus is a measure of pitch variability within a measurement interval. It does not, however, allow conclusions about the actual shape of the pitch contour, and was therefore discarded in favour of the phonologically more informative pitch slopes.
6.1.1 Automatic measurement procedure and evaluation The measures were taken automatically by a Praat script written specifically for this purpose by the present author. Given that the primary design goal of the script
103
was to allow a mostly unsupervised data collection from a large corpus, the script included tests whether automatic adjustments to several program parameters were required, in particular to those relevant for the pitch extraction algorithm. Like all automatic pitch trackers (cf. Ladefoged 2003: 86), the auto-correlation method used in Praat (cf. Boersma 1993) depends on suitable settings for the pitch range that is considered for a measurement. Inappropriate settings may introduce misreadings such as period doubling or octave jumps in the pitch contour, or may result in failure to extract a pitch measurement at all. While these settings can be adjusted manually for small data sets, an automated reading of pitch measurements requires a reliable strategy for appropriate pitch settings. Multiple such strategies are conceivable. Hirst (2007), for instance, uses the interquartile range of the distribution of pitch values as lower and upper boundaries for automated pitch readings in a two-step algorithm. While this approach may work well for modal phonation, inspection of the readings for the Boston corpus showed that this approach yielded an unacceptable number of incorrect readings, which may be due to the frequency of elements with nonmodal phonation (see section 5.1.4 above). The following algorithm was devised to detect these problematic instances, and adjust the pertinent settings accordingly. The pitch settings obtained in this way were used to obtain measurements for minimum, maximum, and mean pitch. Furthermore, the lower pitch boundary was also used as a setting for the intensity algorithm. As initial values for pitch floor and ceiling, 100 Hz and 500 Hz were chosen as pitch boundary settings for female speakers, and 75 Hz and 300 Hz for male speakers, following the pitch setting suggestions in the Praat manual. Using these settings, the measurement interval was checked for the following four error conditions: 1. A pitch contour could be extracted for half or less of a given interval. 2. A pitch contour could be extracted for half or more of the measurement interval, but pitch extraction failed for 20 percent or more of the total length of the interval. 3. The minimal pitch extracted from a given interval was less than 0.5 semitones higher than the pitch floor setting. 4. The pitch contour showed extraordinary steep changes, which were assumed to represent octave jumps due to incorrect pitch boundary settings. The second condition ensured that very short failures of the pitch tracker did not cause a change of settings. Inspection of the recordings showed that these cases were often due to very brief changes to creaky voice within a syllable with modal phonation otherwise, in particular in instances of initial glottalization (cf. Dilley et al. 1996). In these cases, the algorithm was adjusted to provide adequate readings for the stretch with modal phonation. The fourth condition was true if the difference between two adjacent pitch readings, or between two pitch readings separated only by unvoiced frames, was larger than 9 semitones.
104
If one or more of these error conditions was encountered, the settings for pitch floor and ceiling values were automatically reduced by one third, and the voicing threshold was reduced from the standard value 0.45 to 0.1125 to increase the sensibility of the pitch extraction algorithm. The pitch measurement was repeated with the new settings. If, after three reductions, one or more of these conditions was still fulfilled, no pitch readings were returned for this interval. Furthermore, no pitch was measured if the duration of the measurement interval was so short that only three or less periods with the floor frequency would fit within it, as this was regarded as the bare minimum required for any sensible pitch extraction (cf. Ladefoged 2003: 77f). In 26 cases of the 4,367 compounds found in the Boston corpus, the automatic pitch measurement algorithm failed for one of these reasons. Exclusion of these cases reduced the number of compounds available to the present models to 4,341. The validity of the remaining pitch measurements was verified by a comparison with manual F0 measurements for a subset of 100 randomly chosen N OUN + N OUN constructions from the corpus. For each element of these compounds, the fundamental frequency was derived from a narrowband spectrum with a window length of 43 Hz. This bandwidth separates the individual harmonics that constitute the speech signal (cf. Ladefoged 2003: 106), and thus allows the identification of the frequency of the lowest harmonic, which is, by definition, the fundamental frequency (cf. Johnson 2003: ch. 5). This visual inspection of the spectrogram yields a pitch measurement that is independent of the pitch extraction algorithm used in Praat. Strictly speaking, the use of a spectrum to derive the fundamental frequency is adequate only at a given point in time, or for a sound with a steady F0 . Obviously, this requirement is not met in many compound elements, especially those with rising or falling pitch. Here, the longterm average spectrum was investigated, and the location of the first harmonic was extrapolated visually. A comparison of manual and automatic measurements for the 100 item subset indicates that the magnitude of error of unsupervised pitch extractions is largely negligible. The two (log-transformed) measures show a very high correlation (r = 0.986, p < 0.001), indicating that both measures yield nearly identical results. This is supported by the low mean difference between the two measures (M = 0.057 ST, SD = 0.975). In only five cases (2.5 percent), the difference exceeds 3 ST. A non-parametric analysis of covariance (cf. Bowman and Azzalini 1997: ch. 6 and section 4.4 above) shows that the automatic pitch extraction performs equally well in both elements (test of equality: h = 33.64, p = 0.726), while a non-parametric regression finds no relation between the manual measurements and the difference between automatic and manual measurements (test of no effect model: h = 23.922, p = 0.925). In other words, the error introduced by the automatic pitch extraction averages to zero, irrespective of the actual fundamental frequency, and irrespective of whether the left and the right element is measured.
105
That notwithstanding, the differences are largest for very low fundamental frequencies. Four of the five observations where automatic and manual frequency measurements differed by more than 3 ST were pronounced with creaky voice and had an F0 below 100.0 Hz. The insignificant test of the ‘no effect’ model implies, however, that the deviations average to zero across the whole frequency range. Hence, the automatic pitch extraction seems to be adequate even for problematic data points with non-modal phonation. This assumption is supported by the research presented in Batliner et al. (2007), who in their study found that the classifcation algorithms used to identify different degrees of emotional speech was only slightly impaired by differences introduced by chosing automatically extracted pitch over manually corrected pitch measurements.
6.1.2 Vowel-intrinsic properties An issue when measuring pitch, intensity, and duration in speech segments is the well-known observation that different phonemes may have different intrinsic acoustic characteristics. For example, Fairbanks and House (1950) find significant intensity differences between English vowel phonemes to the respect that a high vowel has, on average, a lower intensity than a low vowel (e.g. 14.1 dB for /u/ and 18.3 dB for /æ/). Likewise, House and Fairbanks (1953) have shown in their influential study that there are significant differences in average duration (between 0.195 s for /u/ and 0.244 s for /æ/) and average F0 (between 118.0 Hz for /a/ and 129.8 Hz for /u/). Whalen and Levitt (1995) have shown that the higher average F0 of high vowels is indeed a universal phenomenon attributable to the production process of vowels. These vowel-intrinsic properties have the potential of impairing the predictive power of a statistical model that estimates the perceived prominence on the basis of the acoustic measurements. For instance, in a compound such as wristband, the intrinsic fundamental frequency of the first vowel phoneme /i/ is higher than that of the following /æ/. If the statistical model fails to separate these intrinsic pitch values from a pitch increase that is caused by accentuation of the respective element, the performance of the model will decrease. In an experiment that attempts to identify the acoustic correlates of stress or of accentuation, these influences may be controlled for by using either material that consists of reiterant speech (thus contrasting, for instance, /’nana/ with /na’na/ as in Sluijter and Heuven 1996b), or by using pairs of natural stimuli that differ only in the prominence pattern (such as paper bág – páper bag in Farnetani et al. 1988). Both paradigms allow a direct comparison of acoustic characteristics under different experimental conditions, with vowel-intrinsic influences held constant. A further, statistically more advanced alternative was used above in chapter 5, where the effect particular to a certain combination of vowels in the
106
left and right element was accounted for by the inclusion of the construction as a random effect in the mixed-effects models. However, none of these methods is suitable for a model that aims to estimate the prominence pattern of a larger corpus of N OUN + N OUN constructions. The choice of elements is unrestrained and thus obviously precludes a minimal pair or reiterant speech approach. Mixed-effects models are also unsuitable in such a situation, as it is methodologically unclear how a random effect is to be incorporated into a prediction of additional data that was not used in the initial fit of the model (Harald Baayen 2008, personal communication). A different solution to the problem of vowel-intrinsic characteristics is presented, for instance, in Lutstorf (1960) and Streefkerk (2002). Here, average values are determined for each phoneme type, and these averages are used to normalize all observations in the data set. However, any normalization procedure along these lines requires that the phonemic class of each pertaining vowel in the data set is identified. The Boston corpus includes an automatically generated phonemic transcription that uses a stochastic model to identify speech phonemes (see Kimball et al. 1992 for a description of the algorithm), but the validity of this transcription is not documented, and would require careful re-examination. As with a manual classification of the phonation mode, this highly time-consuming, and possibly error-prone, process would defeat the purpose of the present models to provide a largely unsupervised estimation of the prominence pattern. In addition to this practical consideration, it is not fully clear whether a mathematical normalization process is an adequate approximation of the way listeners react to vowel-intrinsic characteristics. The results reported by Streefkerk (2002) suggest that this is not the case. In her study, the power of acoustic measures to predict the perceived prominence of syllables was not improved, or did even decrease, if the measurements were normalized by vowel type (Streefkerk 2002: 100). It seems that, at least in the perception of prominence, vowel-intrinsic properties are accounted for in a way that is not compatible with the mathematically simple normalization processes currently available. For these reasons, the present models ignore any vowel-intrinsic effect, and thus accept a certain degree of predictive inaccuracy.3
3
Some modifications of the regression models reported below imply that the deteriorating influence of vowel-intrinsic characteristics may indeed be negligible. For these modifications, the vowel height in each element of the 105 compounds used in the regression analysis was classified as 0 for a high vowel, 1 for a mid vowel, and 2 for a low vowel. Inclusion of the difference between the height of the left element and the right element did not significantly improve the model, nor were there significant interactions with pitch or intensity measurements.
107 Table 6.1: Summary of acoustic measurements in left and right elements (N = 4341, respectively).
Mean pitch (Hz) Intensity (dB SPL) Duration (s) Spectral balance (dB) Pitch slope (ST/s) Peak distance (s)
l r l r l r l r l r
Range
Mean
SD
Median
53.06 to 390.20 37.96 to 315.00 55.16 to 88.44 37.73 to 86.53 0.035 to 0.350 0.032 to 0.539 −39.82 to 1.80 −37.37 to 1.11 −445.4 to 741.2 −1083.0 to 845.1 0.055 to 0.843
167.10 137.40 70.37 68.09 0.120 0.134 −16.77 −16.27 0.52 −18.12 0.364
51.50 43.92 4.17 5.09 0.045 0.061 5.06 4.86 47.34 76.06 0.173
160.40 129.20 70.56 68.41 0.115 0.121 −16.22 −15.77 −7.16 −21.96 0.337
6.1.3 Summary of acoustic measurements Both models discussed in this chapter use the 105 N OUN + N OUN constructions that were measured and documented above in chapters 4 and 5 (for documentation, refer to Figure 4.8 for an illustration of the distribution of median prominence ratings, and to Table 5.2 for a summary of the acoustic measurements). The only parameter that has not been calculated in the analysis in chapter 5 was Δ peak . For the 105 compounds, measurements range from 0.055 s to 0.843 s (M = 0.364 s, SD = 0.173, MED = 0.337 s). The data obtained from the 4,341 compounds found in the Boston corpus are summarized in Table 6.1. The row labels ‘l’ and ‘r’ indicate whether the measurements were taken from the left or the right element.. An exception to this is the peak distance measure discussed at the beginning of this section, which is derived from the pitch peak position in both elements. The occurrence of a positive maximum for the two spectral balance measures is rather unexpected: in the source-filter model, the amplitude of harmonics is expected to decrease with increasing order of the harmonic (cf. Johnson 2003: 80). Thus, there is usually less energy in the higher range of the frequency spectrum than in the lower range, resulting in a negative spectral balance. In the present data, there are only two positive spectral balance measurements (1.795 dB in the left element of hate mail and 1.110 dB in the right element of classmates, both in recording M1BS23P1). Inspection of this recording reveals that this text has been high-pass filtered in the production process with a cut-off point at approximately 225 Hz, thus accounting for the very low spectral energy at lower frequencies. If all five compounds from this recording were excluded, the maximum observed spectral balance in the left and right element would be −3.023 dB and −0.513 dB, respectively.
108
6.2 Prediction of median prominence ratings This section presents the first of two statistical models that are used to predict the perceived prominence pattern in English compounds on the basis of a set of acoustic properties. As the section heading implies, this model is phonetic in nature: it does not attempt to abstract away differences between compounds that may have been introduced by factors such as an incomplete articulatory implementation by the respective speaker. Instead, it is acknowledged that there may be compounds in which the prominence relation may be perceived clearer than in others, because the respective speaker has produced these compounds with clearer acoustic cues to the respective prominence pattern. The perception experiment reported in chapter 4 illustrates that: even though two clear prominence patterns emerge overall, there is some degree of variation for compounds that may be assigned to either pattern. For instance, the acoustic information in the compound racing days (recording M2BS23PA) provides listeners with sufficient cues to perceive the left element as prominent over the right element. Accordingly, it receives a median prominence rating that is typical of the majority of left-prominent compounds (−1.086). The same prominence rating is found also in the compound bingo games (recording M2BS08P2), but here the acoustic information indicates such a high degree of prominence of the left element that the compound receives the lowest prominence rating of the whole test set (−1.978). Similar instances also exist for right-prominent compounds. The present statistical model reproduces this perceptual continuum of the prominence relation between the two elements of a compound, and correspondingly uses the median prominence rating as the target variable. Linear regression analysis suggests itself as the appropriate statistical technique, as it is an established methodology to allow the prediction of a continuous response variable on the basis of a set of predictor variables (see Appendix A for a brief introduction to the topic). However, there are several methodological aspects that have to be considered to ensure the adequacy of the resulting regression model. These aspects are discussed in the next subsection, while the model specification process is described below in subsection 6.2.2. This subsection also evaluates the accuracy of the predictions. The last subsection applies the regression model to the data from the Boston corpus.
6.2.1 Predictors in the regression analysis There are different possible ways of incorporating the measurements summarized in Table 6.1 above into a regression model. Perhaps the most obvious choice is to treat all measures as separate, independent predictors; mean pitch would
109
be represented by one predictor for the left and another predictor for the right element, and likewise for the other acoustic properties. However, upon closer consideration, this choice may not be optimal. Taking up mean pitch as an example, inclusion of mean pitch measurements from the left and right element as separate predictors means that the model is unable to make use of the additional information that each two measurements are linked to each other; this link consists of the fact that they were produced in the same compound by the same speaker. Thus, the overall variance for the pitch measures increases, but this increased variance is unrelated to variance due to differences in the accentuation pattern. As an effect, it is likely that the usefulness of mean pitch as a predictor in the analysis is severely reduced if no information is provided to the model that indicates that the measures from the left and from the right element of the same compound belong to the same observation. It is possible to incorporate speaker information as a random effect in a mixedeffect model, similar to the models used above in section 5.4. In such a model, differences in fundamental frequency are taken care of by adjustments to the intercept that depend on an estimation specific to each speaker. However, it turned out that such a model is inferior to the alternative proposed in the remainder of this section.4 The regression model presented in this section pursues an alternative way of incorporating the interrelation between measurements taken from the left and the right element of a given compound by using a derived measure for each pair of pitch measurements. This derived measure, Δ pitch , is calculated in semitones as in (6.1): Δ pitch = 12 ·
log( fle f t / fright ) , log(2)
(6.1)
where fle f t and fright are the mean pitch measurements in the left and the right element, respectively. Similar differences are calculated for intensity as Δint = Ile f t − Iright
(6.2)
Δdur = tle f t − tright ,
(6.3)
and for duration as where Ile f t and Iright are the intensities, and tle f t and tright are the durations of the left and right element, respectively. While the transformation from a logarithmic 4
Anticipating the regression model presented below that uses only fixed effects, an equivalent model in which speaker information was incorporated as an additional random intercept yielded an estimated variance for this random effect that was effectively zero. In other words, the contribution of speaker information to the goodness of fit is negligible.
110
to a linear scale is frequently used for pitch measures in order to approximate perception of frequency differences (cf. Nolan 2003 for a discussion), a corresponding transformation was not regarded as appropriate for these two measures. Similar difference measures have been used in earlier studies that investigated the prominence relation between two syllables, words, or elements of a compound, for instance in Morton and Jassem (1965), Farnetani et al. (1988), and Plag (2006). In these studies, difference measures were found to co-vary strongly with the prominence distinctions that were the focus of the investigation, which suggests a corresponding approach for the present regression model.5 Conceptually, each of these three differences neutralizes context-dependent or speaker-dependent differences between the compounds. Δ pitch addresses differences in average fundamental frequency between different speakers, for instance between men and women. Δint eliminates volume differences present in different recordings, which may be due to inconsistent microphone distances, amplification variations, or speaker inconsistencies. Δdur accounts, at least partially, for differences in speaking rate that depend either on the speaker or on the specific text. For all three derived measures, a positive difference indicates that the respective measure is larger in the left element than in the right element, while a negative difference represents the reverse case. In contrast to these difference measures, the two remaining acoustic properties were incorporated as separate predictors for the left and right element, namely Ble f t and Bright for spectral balance and Sle f t and Sright for pitch slope, respectively. Incorporation into a single derived measure comparable to those just mentioned is neither conceptually plausible in a way that is comparable to the three differences above, nor did the inclusion of such difference measures improve the regression model. 5
Instead of using differences, Taylor and Wales (1987) argue in favour of contrast ratios in studies of perceptual acoustic prominence. Contrast ratios have been shown to be an adequate operationalization of the perception of visual luminance, and are calculated −Imin as C = IImax , where Imax and Imin are the highest and lowest luminance, respectively max −Imin (Michelson 1927). Taylor and Wales (1987) apply this formula to pitch, duration, and intensity measures, and find that these contrast ratios consistently outperform division or subtraction ratios in their linear regression models. However, this outcome in favour of contrasts could not be replicated in the present data set. All models containing contrasts for any of the measures under observation performed worse than corresponding models that used subtraction or division ratios instead. It is possible that the strong predictive power of contrasts in Taylor and Wales (1987) may be attributed to the nature of their material. In test sentences like John must win the race today, a speaker deliberately alternated prominence between the modal and the full verb. As a result, the modal verb is stressed and accented in the one reading, but unstressed and unaccented in the other. It has been discussed above that such confounding of different levels of prominence marking may be potentially harmful to the explanatory power of the analysis.
111 Table 6.2: Predictors included in the full regression model fitting median prominence ratings. Measure
Symbol
Unit
Pitch difference Intensity difference Duration difference Spectral balance Pitch slope Peak distance Declination-corrected pitch difference Declination-corrected intensity difference
Δ pitch Δint Δdur Ble f t , Bright Sle f t , Sright Δ peak Δ pitch × Δ peak Δint × Δ peak
ST dB SPL s dB SPL ST/s s ST s dB s
Table 6.2 summarizes the measures that were included as predictors in the first, fully specified regression model. The last two rows list two interaction terms that both involve Δ peak : one interaction with Δ pitch , and another with Δint . These terms were included to model potential effects of declination: it is reasonable to assume that, for instance, the difference between left and right pitch increases with a longer intervening time span, as the latter pitch peak will be subject to a general pitch declination, irrespective of the prosodic structure (cf. Gussenhoven 2004: ch. 6 and section 5.1 above for discussion). A corresponding assumption holds also for intensity. Thus, Δ pitch × Δ peak designates the pitch difference corrected for declination effects, and Δint × Δ peak designates the corresponding intensity difference. Both terms are incorporated technically as interaction terms. In multiple regression, an interaction between two continuous predictors x1 and x2 is usually implemented as the product of both variables (cf. Bauer and Curran 2005). A regression model including such an interaction takes the form y = β0 + β1 x1 + β2 x2 + β3 x1 x2 + ε,
(6.4)
where β0 is the regression intercept, β1 and β2 are the regression coefficients for the main effects of x1 and x2 , respectively, and β3 is the coefficient for the interaction term expressed as the product of x1 and x2 . As Jaccard et al. (1990) point out, inclusion of an interaction term in a regression model as in (6.4) may lead to strong collinearities between the model predictors, depending on the scale of x1 and x2 . For instance, if the range of x2 exceeds that of x1 by an order of magnitude, a change in x2 will influence the product x1 x2 to a greater extent than a change in x2 . Jaccard et al. (1990) thus propose a transformation of the variables before the formation of the interaction variable, and suggest either centring or standardization. Given the great differences in scale between the involved predictors in the present case, Δ pitch , Δint , and Δ peak were standardized using the usual z-transformation that involves subtraction of the mean, followed by a division by the standard deviation of the respective
112 Table 6.3: Pearson correlation coefficients for the acoustic properties in the regression model. Associated p values are given in parentheses.
ZΔint Δdur Ble f t Bright Sle f t Sright ZΔ peak
ZΔ pitch
ZΔint
Δdur
Ble f t
Bright
Sle f t
Sright
0.66 (