Courses on Speech Prosody [1 ed.] 9781443882972, 9781443876001

In recent years, speech prosody research in Brazil has grown significantly, mainly due to a series of events organized t

144 75 16MB

English Pages 203 Year 2015

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Courses on Speech Prosody [1 ed.]
 9781443882972, 9781443876001

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Courses on Speech Prosody

Courses on Speech Prosody Edited by

Alexsandro Rodrigues Meireles

Courses on Speech Prosody Edited by Alexsandro Rodrigues Meireles This book first published 2015 Cambridge Scholars Publishing Lady Stephenson Library, Newcastle upon Tyne, NE6 2PA, UK British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Copyright © 2015 by Alexsandro Rodrigues Meireles and contributors All rights for this book reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the copyright owner. ISBN (10): 1-4438-7600-3 ISBN (13): 978-1-4438-7600-1

TABLE OF CONTENTS

Acknowledgments ..................................................................................... vii Introduction .............................................................................................. viii Chapter One ................................................................................................. 1 Assessment of Speech Production in Speech Therapy Data Aline N. Pessoa and Lílian K. Pereira Chapter Two .............................................................................................. 11 Setfon: The Problem of the Analysis of Prosodic, Textual and AcousticData Ana Cristina F. Matte, Rubens T. Ribeiro, Alexsandro R. Meireles and Adelma L. O. S. Araújo Chapter Three ............................................................................................ 33 Stress Assignment Contrasted in Spanish and Brazilian Portuguese Prosodic Non-Verbal Words Antônio R. M. Simões Chapter Four .............................................................................................. 52 Electroglottography Maurílio N. Vieira Chapter Five .............................................................................................. 98 Time-Normalization of Fundamental Frequency Contours: A Hands-On Tutorial Pablo Arantes Chapter Six .............................................................................................. 124 Hmm-based Speech Synthesis Using the Hts-2.2 Sarah Negreiros de Carvalho and Fábio Violaro

vi

Table of Contents

Chapter Seven.......................................................................................... 146 Speech Prosody: Theories, Models and Analysis Yi Xu Chapter Eight ........................................................................................... 178 The Validity of Some Egg Measures to Predict Laryngeal Voice Quality Settings: Perceptual and Phonatory Correlates Zuleica Camargo, Sandra Madureira and Luiz Carlos Rusilo

ACKNOWLEDGEMENTS

We would like to thank Roseli de Fátima Dias A. Barbosa for her permission to publish her painting on the book cover, and Nathália Reis for the design of the book cover. Also, we acknowledge the Brazilian research institutions that financed the II School of Prosody: CAPES, FAPES, UFES, and LBASS.

INTRODUCTION

In recent years, speech prosody research in Brazil has grown significantly, mainly due to a series of events organized in this country and through the support of Brazilian government. The 1st Brazilian Colloquium on Speech Prosody, organized by Dr. João Antônio de Moraes, took place at the Federal University of Rio de Janeiro in 2007, and gathered researchers from all over the country. The main objectives of the Brazilian Colloquium on Speech Prosody series are to promote discussion, to disseminate work throughout the Brazilian territory on the various fields of prosody research, as well as to meet researchers from related fields of interest such as linguistics, speech therapy, phonetics, neuroscience, engineering, psychology, and language teachers, among others. Due to the success of this first enterprise, the Luso-Brazilian Association of Speech Sciences (LBASS) was created in 2007 to support the organization of the Speech Prosody 2008 Conference, which was held in Campinas in May 2008. This international conference, organized by Doctors Plínio Barbosa, César Reis, and Sandra Madureira, followed the success of the previous avenues of study, and gathered prosodic researchers from all over the world. The 2nd Brazilian Colloquium on Speech Prosody, organized by Dr. Plínio Barbosa, took place at the State University of Campinas in 2009. This was the first Brazilian event supported by LBASS. Also, as a means of complementing the efforts to develop Brazilian prosody research, at this event, there came a proposal from the I School of Prosody, which had the mission of providing a ground for the development of prosody studies with the interplay of young and senior prosody researchers. In order to accomplish the objective of the dissemination of Brazilian prosody studies, these two events take place regularly, on a biannual basis and in alternation. Therefore, the following events have been realized in previous years: a) 3rd Colloquium on Speech Prosody, organized by Dr. César Reis at the Federal University of Minas Gerais in 2011; b) I School of Speech Prosody, organized by Sandra Madureira and Zuleica Camargo at PUC-SP in 2010; c) II School of Speech Prosody, organized by Dr. Alexsandro Meireles at the Federal University of Espirito Santo in 2012; d) 4th Colloquium on Speech Prosody, organized by Dr. Miguel Oliveira at the Federal University of Alagoas in 2013.

Courses on Speech Prosody

ix

Due to increasing support by Brazilian grant agencies in recent years, the II School of Speech Prosody was able to bring international researchers to Brazil for the first time, which greatly contributed to improving as well as to giving visibility to speech prosody research in Brazil. The internationalization of speech prosody events continued at the 4th Colloquium on Speech Prosody, and will continue at the upcoming III School of Prosody in 2014 in Campinas, next August. It is clear from the above that there is an increasing number of quality works on speech prosody research in Brazil, which undoubtedly is the direct result of the initiatives proposed by LBASS. This book came from the fact that there are very few textbooks dealing directly with speech prosody methodology and data. Because of this, we propose a book from a selection of the most attended prosody courses given during the II School of Prosody, and expect that this enterprise will help not only to contribute to prosody education in Brazil but also to the education in other countries. Also, we expect this book to be the first of a series on speech prosody research in Brazil. Alexsandro Rodrigues Meireles (Editor)

CHAPTER ONE ASSESSMENT OF SPEECH PRODUCTION IN SPEECH THERAPY DATA ALINE N. PESSOA1 AND LÍLIAN K. PEREIRA2

Abstract Based on phonological therapy, this course addressed the interface of the complex relationships between perception and speech production from the integration of information from physiological, perceptive and acoustics spheres. Such foundations allow speech therapists to explore and make inferences about the manifestations of speech in cases without alterations as well as in cases with diverse disturbances. The content aimed to cover practical issues from the composition of the corpus to be assessed to the possible uses of instruments for phonetic analysis. From lectures and examples with data from speech therapy, this study aimed to exemplify the different analyses from instruments and their correlations with clinical data, which allowed the detailing of short- and long-term speech instances. In the case of short-term data, the focus was speech segments (consonant and vowel sounds). Regarding long-term data, recurring aspects of speech emission were highlighted, such as the prosodic properties and, among them, vocal quality and dynamics. Thus, the goal of this course was to perform phonetic analysis using different instruments as clinical auxiliary tools for understanding speech characteristics. Keywords: clinical phonetics, speech therapy, acoustic analysis, auditory perception, speech production. 1

Department of Speech Therapy/Audiology - Federal University of Espirito Santo (UFES), Vitória-ES-Brazil. Laboratory of Cognition and Acoustic Analysis (LIAAC) – Pontifical Catholic University of Sao Paulo (PUCSP). 2 Laboratory of Cognition and Acoustic Analysis (LIAAC) – Pontifical Catholic University of Sao Paulo (PUCSP).

2

Chapter One

1. Clinical context From reflection on the interface between two fields of study – speech therapy and phonological therapy – we identified the relationships between perception and speech production in an indissoluble and dynamic perspective. From theoretical perspectives involved in the relationships between such spheres, we identified the different areas of knowledge in speech therapy: the knowledge field related to neurovegetative functions and human communication (orofacial motricity, swallowing, breathing, oral language, hearing and balance, voice, reading and writing). The challenge in speech therapy, in favor of the need for consistent theoretical-practical training, is notorious and debatable in the area, i.e., the choice of adequate instruments for the evaluation and monitoring of perception skills and speech production. This results from the need to obtain data and for analytical interpretation, in order to understand evidences, repercussions and implications of the phenomenon, especially from a clinical perspective. Starting from such a discussion, we proposed this short course in order to promote an introductory action to reflect on clinical data from patients with hearing loss and users of hearing aid devices (personal amplification device [PAD] and/or cochlear implant [CI])3, interpreted in consonance with the phonetic-acoustic theory, a science responsible for the study of speech sounds in order to characterize the mechanisms involved in the production and perception of languages sounds, in this case Brazilian Portuguese (BP). Thus, we pointed out the instances involved in the interpretation of data from the contributions of the principles of phonetics, which include articulation aspects (transmitter, speaker, production of speech sounds or phonemes), acoustic aspects (related to the message, transmission, sound wave considering parameters of frequency (Hz), intensity (dB), duration 3

PAD is an electronic hearing device comprising a system of individual sound amplification from electro-acoustic algorithms. It consists of the following components: microphone; amplifier (with analog, digital or hybrid processing of electro-acoustic signals); and receiver. The CI is a type of electronic hearing device that provides the sensation of hearing to users via electrical stimulation in the auditory system. It consists of an external part (microphone, microprocessor and transmitter) and an inner part (receiver and stimulator, a reference electrode and a set of electrodes that are surgically inserted into the cochlea (inner ear) and/or the auditory neural pathways). The goal is to take the acoustic signals, electrically decoded, to the brain, where they will be decoded and interpreted as sounds.

Assessment of Speech Production in Speech Therapy Data

3

(ms) and their possible relationships in short and long term), and perceptive aspects (involving receiver/listener, understanding and the complexity of central auditory processing).

2. Theoretical Perspectives Based on the acoustic theory of speech production (Fant, 1960) and considering the acoustic characteristics of each speech sound determined by the constrictions and bifurcations of the vocal tract and the fundamental frequency (f0), we proposed to address methodological aspects involved in the development of the understanding of speech data from these theoretical assumptions involved and the delimited aspects to be considered in the short and long term steps of acoustic analysis, such as possible approaches from other spheres: physiological and perceptive-hearing; and basic principles for the adoption of a method of systematization and interpretation of evidence from correlations with statistics.

2.1 Acoustic analysis The possibility of applying acoustic phonetics for the analysis of voice and speech production in the clinical, educational, artistic, and technological or expert fields is indisputable. However, we face the methodological challenges involved in the application of experimental phonetics that could be detailed and punctuated according to the following procedures: 1) formulation of hypothesis; 2) selection/constitution of data – corpus/subjects; 3) recording of the corpus, environmental conditions (noise, reverberation, control of intensity (dB) and microphone); 4) data analysis (choice of instruments, spectral analysis); 5) data processing; and 6) interpretation of results. We highlight the necessary cautiousness and sensitivity when addressing the corpus (delimitation of the corpus, spontaneous/semimediated/mediated speech or text reading; context, phonetically balanced or unbalanced; random productions or speech excerpts) since these factors define and are directly related to the interpretation of the data. With respect to data collection in cases of studio recording, it is recommended to use a soundboard for digitization of sound files, a headset microphone and sound editing software. In the specific situation of recordings with children in speech therapy, namely, in loco, it is recommended to use a portable digital recorder and a headset microphone. It is relevant to take the acoustics of the environment into consideration.

4

Chapter One

As a tool for acoustic analysis, we used Praat free software (www.fon.hum.uva.nl/praat/), which allowed the presentation of speech data obtained from speech therapy, giving a representation of the sound wave in spectrum and spectrogram. Spectra are diagrams that represent the amplitude and frequency of simple waves at a given point in time, and spectrograms are diagrams that allow the visualization of sound spectral evolution through time, by means of broadband and narrowband resolution. The representations described allowed introductory concepts to be addressed, for adoption of the theoretical and methodological assumptions involved in the linkage between speech therapy and clinical phonetics. Acoustic signal segmentation is a basic step and, at the same time, a complex acoustic analysis procedure. There must be a rigorous adoption and maintenance of the criteria used for segmentation, as well as consideration of the articulatory characteristics of segments, their acoustic correlates and the notion of coarticulation.

3. Brazilian Portuguese researches 3.1 The relevance of segmental aspects With regard to segments of BP, the articulatory and acoustic characteristics were specified, and the characterization of the following points were discussed: a) vowels: periodic complex sound with acoustic characteristics determined by the sum of sine waves with no obstructions in the air passage, but in resonance regions in which there are bigger or smaller concentrations of energy in the tube; and b) consonants: sounds produced with some kind of obstruction (partial/total) in the vocal tract, which causes interruption of the air passage. The criteria for classification of consonantal segments are: 1) manner of articulation; 2) place of articulation; 3) voicing; and 4) nasality x orality. We highlight the current path of egressive air from the chamber, the role of the vocal folds, the soft palate, and the place and manner of articulation. Especially for speech therapy, we emphasized the peculiarities of rothics (Gregio et al., 2012); although they are grouped under this nomenclature, the group is not defined by common articulatory characteristics. The examples and citations of studies on the specific case of vowels mention the exploration of the formant patterns that reveal indications of movement limitation in vowel articulation. In this sense, findings of decreased frequency of the first (F1) and second (F2) formant,

Assessment of Speech Production in Speech Therapy Data

5

respectively, related to the degree of openness of the vocal tract (vertical positioning of tongue and jaw) and anterior-posterior tongue movements, and even findings of anteriorized and lowered tongue tendency have been reported (Benedicte et al., 1989; Mendes, 2003; Barzaghi-Ficker, 2003; Cukier & Camargo, 2005; Campisi et al., 2005; Pereira, 2007; Serkhane et al., 2006; Seifert et al., 2002; Peng et al., 2008). Such evidence has also been described by means of another methodological approach such as in supra-segmental areas, according to Pessoa’s thesis (2012). Regarding the segmental level analysis of speech data, we identified studies (Barzaghi-Ficker, 2013; Pereira, 2008, 2012) that, from precepts of phonetic-acoustic analysis and based on the acoustic theory of speech production (Fant, 1960) and articulatory phonology (Browman, Goldstein, 1986, 1990, 1992), detailed the production of alveolar plosive consonants of BP in subjects with hearing impairments. In the study conducted by Pereira and Madureira (2012), using the Praat software, they extracted measures of duration (ms), f0 values and formants (Hz) of a corpus made up of words consisting of plosive consonants (‘tata’, ‘data’, ‘cata’ and ‘cada’) inserted in the carrier phrase. The acoustic analysis of the study (Pereira and Madureira, 2012) shows that the most changed parameter was the percentage of voicing during the total duration of consonants. In this sense, the assessment of the production characteristics of alveolar plosive consonants of BP, according to different positions (in this case, the unstressed position of two-syllable words stressed on the second syllable) in the speech of a subject with hearing impairment, especially as regards the voicing contrast, pointed to clinical developments. In view of this research, in which some hypotheses related to the importance of considering factors such as focus, degree of accentuation and coarticulation as interference were listed, the results will be analyzed from the assumptions of articulatory phonology and acoustic phonetics. Thus, the decrease in synchronicity between articulatory gestures, which may be present in the speech of subjects with hearing impairment, would cause retardation in voicing interruption, and, consequently, would change the perception of voicing parameters of these subjects’ speech. This type of description produces notorious evidence for therapy, which without the acoustic tool would not be possible to detail. At this point, it is important to highlight the importance of the methodological adequacy and consistency of the statistical analysis adopted: significance of data found; correlations between production and perception of speech; intra- and inter-subjects comparison; analysis procedures; and predetermined significance level. This is due to the fact that some situations require greater care and consideration; for example,

6

Chapter One

pre-judgments, auditory memory and linguistic knowledge in cases of individuals with hearing impairment (limited auditory perception).

3.2 Approaching to supra-segmental aspects Inextricably linked to segmental aspects, we dealt in a prosodic approach with the set of speech phenomena that include frequency variability, intensity and duration in the long term. The variation with respect to the field should be noted, which, according to Barbosa (2010), comprises the analysis of the phonic units and their relationships, from the syllable to the oral text. Prosody makes it possible to: define the mode of utterance (declarative, interrogative or exclamatory); organize speech structurally through chaining and prominence, interacting with the syntax; offer pragmatic function (integration of the message with its context); define attitudes through speech and, therefore, establish the relationship between the speakers; express emotions and even characterize the speaker (social and physically, for example) (Barbosa, 2010). It is known that prosodic elements, from the earliest vocalizations and gurgles of babies and young children, have a fundamental role in the perception and production of speech sounds, being connected to the symbolic and cognitive development that pervades the relationship between sound and sense.

3.3 Sounds and prosody – complementarity In this context, permeated by segmental and supra-segmental instances, we identify the methodological challenge of undertaking phonetics-based studies that consider the many variables involved in order to delimit the research corpus, exploring contexts for recording speech samples in therapeutic frames and without comparisons with adults’ patterns. It is possible not to depend on a standardized corpus, offering conditions for analyzing spontaneous speech excerpts (Pessoa et al., 2010, 2011, 2012). Complementarity between segmental and prosodic elements in speech therapy has been addressed in studies on speech production (Guedes, 2003; Magri, 2003; Andrade, 2004; Camargo et al., 2004; Peralta, 2005; Cukier et al., 2006; Lima et al., 2007; Magri et al., 2007; Blaj et al., 2007, Camargo & Madureira, 2010; Madureira & Camargo, 2010; Rusilo et al., 2011), especially due to the importance of reflecting on the impact of speech development on the role of communication and interaction in the paralinguistic field (short term adjustments of vocal quality used to signal excitement, communicative intentions, etc.) and in the extra-linguistic

Assessment of Speech Production in Speech Therapy Data

7

field (long-term vocal quality) which is not always consciously controlled (Mackenzie-Beck, 2005). Considering the variability of patterns contained in the speech and the complex interactions between perception and production mechanisms related to the dynamic model of speech (Boothroyd, 1986; Fujimura & Hirano, 1995; Lindblom, 1990; Barbosa, 2006, 2007, 2009; Xu, 2011; Hirst, 2011; Fourcin & Abberton, 2008), and the variety of possible and predictable adjustments, allows us to understand combinations resulting from detailed compensations. This fact is justified and correlated by the acoustic sphere, which is based on interdependent functioning of the structures of the laryngeal and super-laryngeal vocal tract (Cukier, 2006; Gregio et al., 2006). Thus, we highlight the refinement of the action of the vocal tract to phonation and the various compensations that may be caused due to their plasticity.

4. Perception analysis – VPAS From the vocal quality approach and vocal dynamics, the voice profile analysis scheme (VPAS) (Camargo & Madureira, 2008) has allowed the exploration of clinical data about the long-term tendencies of speech production that characterize a particular speaker (product of respiratory, laryngeal/phonatory and supra-laryngeal/articulatory activities). Mackenzie-Beck (2005) stated that they are: “those features that are present more or less all the time in which a person is speaking”, i.e., “An almost-permanent quality traversing all the sounds that emanate from the speaker’s mouth” (Abercrombie, 1967). From the phonetic point of view, there is the notion of setting (adjustment): “Recurrent feature translated as a tendency of the vocal apparatus to be subjected to a particular muscle long-term adjustment” (Laver, 1980). From the acoustic point of view, vocal quality and dynamics have been explored in our group by combining a group of acoustic measures (Hammaberg & Gauffin, 1995; Barbosa, 2009; Camargo & Madureira, 2009; Rusilo et al., 2011). Long-term acoustic measures extraction was performed using the Expression Evaluator script (Barbosa, 2009). Such measures are extracted from speech excerpts and excerpt labeling is not required. In this way, this method of analysis does not require a standardized speech sample and it is applied to assess acoustic correlates of quality adjustments and vocal dynamics. Perceptive-hearing (through the VPAS-PB script) and acoustic correlations (through the Expression Evaluator script), based on dynamic models and methodological procedures of experimental phonetics, allow

8

Chapter One

speech production to be approached in contexts of speakers with and without speech alterations. Such instruments addressing spontaneous speech allow discussion of production without characterizing a dichotomy between normality and pathology. Methodologically, controlled experimental situations are indisputably necessary and such systematic and experimental control can be indispensable in our quest to understand the mechanisms underlying speech. However, with the advent of prosodic studies, especially those of expressiveness, such procedures can compromise and hinder the recognition of factors that contribute to the description and understanding of particular elements of human speech (Xu, 2010). A combination of a group of acoustic measures (Barbosa, 2006, 2007, 2009) relating to the f0, the first derivative of f0, intensity, spectral decline and the long-term spectrum was used for the acoustic approach (Camargo & Madureira, 2010; Pessoa et al., 2010, 2012; Rusilo et al., 2011; LimaBonfim, 2012; Camargo et al., 2012; Queiroz, 2012). Methodologically, the use of acoustic measures taken through long-term techniques (from the processing of speech excerpts and not from isolated units) is highlighted. This procedure will not require labeling of vowel and consonantal segments, which may not be well delimited in certain productions, both in earlier stages of language development and in particular characteristics of the speakers, as those productions considered altered for the age bracket. From the earliest babblings – especially in children with hearing impairment – the perception of acoustic signals in the articulatory movement at the moment of babbling may show evidence that points to the importance of discovering the vocal tract skills and learning the relationships between movements and perception from the sequence of motor gestures and adjustments from auditory feedback (Meier et al., 1997; Bailly, 1997; Boysson-Bardies et al., 1999; Serkhane et al., 2006; Iverson et al., 2007). In children, the phonatory system is oscillating and, due to the nonlinear relationships between the elements, it can feature patterns of great variability. A description method able to include all the mobilizations can indicate, over time, changes in adopted patterns that define the maturation of the mechanism, learning phonatory control categories and, possibly, the use of different vocalizations in social contexts (Buder et al., 2007). This way, the relevance of speech therapies for the therapeutic process is confirmed, because perceptual and acoustic data enable the speech therapist to explore and make inferences about speech manifestations, both in cases without alterations and cases of disturbances from diverse origins. Such evidence can outline the definition of clinical outcome, i.e.,

Assessment of Speech Production in Speech Therapy Data

9

indicators of hearing care services. In addition, this evidence can outline clinical indicators and therapeutic management for therapies, as demonstrated in this short course for the case of children with hearing impairment regarding strategies related to articulatory control and precision on the acoustic target. There are two challenges: a) appropriate and articulated decision to approach the data from consistent methodological assumptions to be adopted; and b) reflection on the relationship between segment and suprasegment that unfolds: influence of temporal processes, variability and combinations between the parameters of frequency, intensity and duration offered by devices. Still, we emphasize the need for performing phoneticacoustic analysis that allows the inference of modes of speech and voice production, and correlates them with the plan of speech perception.

5. Final considerations Delimiting evidence and clinical outcome indicators has been the great challenge in the studies that relate to the spheres involved in speech production and perception. Through the innovative technologies offered by the CI, it has been possible to achieve a good performance in hearing acuity, revealing speakers with excellent quality of detection, discrimination, and recognition of speech sounds of a wide range of frequencies (reaching higher frequencies) and even the weakest intensities, which can be observed through consistent responses to sound stimuli. Such speakers show good results on specific tasks of speech sounds perception (vowel and segmental), thus confirming the efficiency of the electrical stimulation technology provided by the CI. The correlation between speech therapy and clinical phonetics is a fertile field and it allows understanding clinical evolution. The data of this study highlighted the importance of creating a database with a population character, aimed at service indicators, decision making regarding population and technological procedures and, above all, based on a speech corpus whose collection occurs during therapeutic procedures, with discussion of data collected in a longitudinal character. In view of the contributions of the analysis of prosodic elements, we suggest the possibility of incorporating tools for speech assessment in routine clinical monitoring of this population and in research that consider speech as the central object.

10

Chapter One

References Barbosa, P. A. (2009) Detecting changes in speech expressiveness in participants of a radio program. In: Proceedings of Interspeech. v. 1, 2155-2158. Brighton, United Kingdom. Camargo, Z. A.; Madureira, S. (2009) Dimensões perceptivas das alterações de qualidade vocal e suas correlações aos planos da acústica e da fisiologia. Delta. Documentação de Estudos em Linguística Teórica e Aplicada (PUCSP. Impresso), v. 25, p. 285-317. Camargo, Z., Navas, A. L. (2008) Fonética e fonologia aplicadas à aprendizagem. In: Zorzi, J.; Capellini, S. Dislexia e outros distúrbios de leitura-escrita. São José dos Campos: Pulso. p. 127-157. Fant, G. (2000) Half a century in phonetics and speech research. Fonetik 2000, Swedish phonetics meeting in Skövde, May 24-26. Gregio, F. N.; Gama-Rossi, A.; Madureira, S.; Camargo, Z. N. (2006) Modelos teóricos de produção e percepção da fala como um modelo dinâmico. Rev CEFAC, São Paulo, v. 8, n. 2, 244-247. Johnson, K. (2003) Acoustic & Auditory Phonetics. Malden: Blackwell. Kent, R. D. & Read, C. (2002) The Acoustic Analysis of Speech. 2nd ed. Albany, NY: Singular Thomson Learning. Pereira, L. K.; Madureira, S. (2012) A produção das plosivas alveolares /T/ e /D/ por um sujeito com deficiência auditiva: Um estudo fonéticoacústico. Intercâmbio (PUCSP), v. XXIII, p. 128-151. Pessoa, A. N.; Novaes, B. C. A.; Pereira, L. K.; Camargo, Z. A. (2011) Dados de dinâmica e qualidade vocal: correlatos acústicos e perceptivo-auditivos da fala em criança usuária de implante coclear. Journal of Speech Sciences 1(2): 17-33. Rusilo, L. C., Madureira S., Camargo Z. (2011) The validity of some acoustic measures to predict voice quality settings: trends between acoustic and perceptual correlates of voice quality. Proceedings of the Fourth ISCA Tutorial and Research Workshop on Experimental Linguistics. Paris: ISCA, p. 115-118.

CHAPTER TWO SETFON: THE PROBLEM OF THE ANALYSIS OF PROSODIC, TEXTUAL AND ACOUSTIC DATA ANA CRISTINA FRICKE MATTE,1 RUBENS TAKIGUTI RIBEIRO,2 ALEXSANDRO MEIRELES3 AND ADELMA L.O S. ARAÚJO4

Abstract Setfon is an open source web information system for data collecting in the field of speech sciences. The system is component based and emerged from the need to process an increasing amount of acoustic-phonological data, in order to solve the problem of statistical significance in emotional and stylistic speech. Moreover, this chapter presents the software and its components under different approaches of data collecting: Acoustic Phonetics, Phonology, Semiotics, Information Technology and Computation. Keywords: technology, acoustic phonetics, phonology, phonostylistics.

1

Universidade Federal de Minas Gerais, UFMG, Faculdade de Letras, POSLIN, Belo Horizonte, MG, Brasil, [email protected]. 2 Universidade Federal de Lavras, UFLA, Faculdade de Ciência da Computação, TecnoLivre, Lavras, MG, Brasil, [email protected]. 3 Universidade Federal do Espírito Santo, UFES, Departamento de Línguas e Letras, PPGEL, Vitória, ES, Brasil, [email protected]. 4 Universidade Federal de Minas Gerais, UFMG, Faculdade de Letras, POSLIN, Belo Horizonte, MG, Brasil, [email protected].

12

Chapter Two

1. What is Setfon? Research on speech is generally divided into acoustic, acousticphonetic, phonological and expressive (emotion, attitude, etc.) approaches. The last ones especially focus on the content of what has been said, trying to relate a communicative intention to a particular sound expression. Therefore, physical, linguistic and semiotic information is important to determine the existence of this relationship and, if it exists, the degree and type of the relationship. Data collection, hence, cannot focus on only one aspect of speech production, which significantly increases the number of parameters to be collected and analyzed. Setfon is an open source web information system for data collecting in the field of speech sciences. The system addresses the problem of collecting and managing data through software components. It was created from the need to process an increasing amount of acoustic-phonological data, in order to attend to the demands of statistical significance on expressive speech studies. In addition, since it is available online, Setfon permits the creation of a national database that can be shared among speech scientists. This work would not have been possible without an interdisciplinary team, with very different concerns, but common goals. This chapter aims to explore the different facets of the work. To accomplish this goal, we will present the program’s context and technology.

2. Phonetic-Acoustic Database According to many scholars, working with phonetics is to be in between linguistics and non-linguistics studies. The situation is further complicated when it comes to articulatory and acoustic phonetics, given the amount of knowledge of physics and human anatomy involved in those fields. However, phonetic studies may not be disconnected from phonological studies. In this case, the rejection of phonetics from linguistic studies by phonologists should not be applied. From our point of view, the idea of phonetics not being considered part of linguistics is largely the result of a reversal of priorities in terms of the time the phonetician dedicates to linguistic research, which can be divided into raising hypotheses and preparation of the experiment, data collection and analysis. The elaboration of the problem and analysis of data are epistemologically and linguistically well founded, but occupies a third or less of the time of the phonetician’s research, who spends most of the time collecting acoustic or articulatory data of speech sounds. The work of

Setfon: The Problem of the Analysis of Data

13

segmentation and labeling of sound samples is the main issue which causes misinterpretation of the linguist phonetician’s work. Unfortunately, it is not for language concerns that many researchers, for instance electrical engineers, elaborate automatic speech segmentation software. The automatic speech segmentation programs, which have been developed by engineers, usually aim at speech synthesis or recognition, and therefore restrict phonetic transcription as a single element linked to a phonic segment. On the other hand, programs such as Praat – open source speech analyzer software for the speech science community – allow the connection of several levels of information with a sound file. Nevertheless, the whole process of labeling is necessarily manual (conceiving labeling as the process of linking information – e.g., phonetic transcription, speech rate applied to the speaker, emotion reported by the speaker – to a predetermined sound unit). These automatic speech segmentation programs undoubtedly represent major advances for the linguist, but they are insufficient for the reversal of priorities in the schedule of research in phonetics. This reversal is vital so that phonetic studies applied to speech technology can reach international competitiveness.

3. Setfon: Proposal optimization To solve this problem, a semi-automatic labeling system was elaborated that not only meets the needs of the linguist phonetician but also provides future interaction with speech recognition as well as speech synthesis. This device is Setfon: an algorithm for the production and organization of phonological semiolabelers. Phonological semiolabelers are products of a tool for annotating and organizing data from textual, syntactic and semantic analyses, information about the recording, and any other information relevant to acoustic phonetics data (e.g. phonological and phonetic labels). This tool aims to speed up the process of preparing data for phonetic analysis, and represents an important advance for phonetics research in Brazil, given the originality of the proposal. Setfon’s algorithm combines a sound file (.wav or other format) with a text file (.txt) in order to obtain a labeled segmentation with access to information such as duration, intonation curve, and labeling – which is able to receive new information according to the researcher’s demands. Finally, it returns in tables, information about each segment. Given the nature of this process, the tool manager, from a computational viewpoint, is a system of various programs, each responsible for one of the tasks necessary for the work of labeling sound samples and obtaining tables.

14

Chapter Two

Some of these programs pre-existed, such as Praat1, Ortofon (Albano and Moreira, 1996) and SilWeb (Matte, Meireles and Fraguas, 2006). Setfon is dedicated to the linguistic analysis of speech, it being optional to support research on speech synthesis. Thus, it is not a segmentation program, but an algorithm that automates and manages the relationship of linguistic and extralinguistic information to segments, whose size is determined by the needs of each piece of research. Importantly, due to the versatility of Setfon, this proposed labeling is more than a simple program: it is designed to be an application server that can be easily adapted for many different purposes in experimental phonetics research, and whose components’ maintenance is very straightforward. Since the goal of the project is the design of this general algorithm, the implementation of the analysis, initially restricted to a sentence in terms of duration and intonation curve, enables the immediate realization of tests utilizing casual speech. Using casual and controlled corpora yielded results that were immediately applicable to research on Brazilian Portuguese phonostylistics (Mendes, 2009). Since this is a web resource, its application in speech technologies is wide in terms of man/machine interaction: telephone, web, home appliances, and speech disorders.

4. Software Development The design of the semiolabeler followed the phonological process of software development based on components proposed by Brito et al. (2005), which is divided into the following steps: (i) domain analysis, (ii) modeling the components, (iii) implementation of the components, (iv) testing the components, (v) implementation of the web interface, and (iv) integration testing. During the development of each Setfon component, interfaces were created for individual use (shell scripts), making it possible to perform independent tests. The Oriented Programming Components is a technical approach to solve computational problems through atomic logical structures and well defined interfaces. Components encapsulate black-box processes, or processes that do not require detailed knowledge of the implementation strategy, since they do not have coupling at the modular level. The phonological semiolabeler forms a complex data set. In order to get this, each attribute is treated by a differentiated component. The component-based process has great proximity to semi-automatic and manual activities performed by researchers to obtain acoustic data. Most operations are atomic, and they have well-defined inputs and

Setfon: The Problem of the Analysis of Data

15

outputs. In this sense, four essential components have been identified: (i) audio segmentation, (ii) fonotranscriber of text to phonological transcription, (iii) TextGrid1 handler, and (iv) data noise extraction. These components are handled by a web tier that functions both as the controller of the steps involved in the extraction process and as the direct interface with the user researcher.

Figure 1: Main page of Setfon

The main tool of Setfon is represented in Figure 1. This is a web interface that starts the process from a sound file and a text file (with corresponding semantic values). It is necessary to evaluate the sound file with its respective TextGrid, which is filled with phonological segments and other relevant data, to obtain the acoustic data. On the other hand, three sub-steps are necessary to obtain the TextGrid: (i) phonologically transcribe the text and segment using VV units, (ii) generate a TextGrid only with the data from the sound file, but still without phonological segments, and (iii) insert the phonological segments into the TextGrid. The main strategy to address the solution to this problem is to define the inputs and outputs of each component as files of different types. Each component hence receives one or more input files and produces a resulting file. The web tier presents the files (central region of Figure 1) and the possible operations on these files (bottom of Figure 1). In regard to performing an operation, one must select the input files (by clicking on

16

Chapter Two

them) and then trigger the desired operation. Each component in this process uses the most suitable technologies and techniques for the purpose.

5. A disassembly line: speech analysis Setfon works as a disassembly line: the product ‘speech’ is decoupled in time and its qualities are analyzed and arranged in order to allow the visualization of its parts before displaying the whole set. Two types of segmenting are needed to obtain phonostylistics data: a macrosegmentation based on stress groups as well as a micro-segmentation based on VV units. Although the automation of the process has been mandatory on the choice of the segment types and has been responsible for the use of phrase and syllable, the method of the inclusion of data was created in order to enhance the semiotic approach that considers the text as a whole. The starting point of Setfon was the SilWeb software, which was originally designed to return to every word and syllable its accentual classification. Working with UML, despite not been brought to completion, allowed a neat algorithm compatible with other applications, some of which were eventually incorporated into the program. The programming was started in Matlab and was completed in PHP. It was based on the phonological studies by Mattoso Câmara (1970), and predicts, with 99% accuracy, any Brazilian Portuguese word (henceforth BP) or logatome that follow the rules of BP phonotactics. The program also returns syllables and consonant-vowel(-glide) masks. The program was tested on a large corpus, CETEN-Folha, during the time that three researchers who were involved in the project were working on a thematic project called “Integrating Parameters in Continuous and Discrete Models of Knowledge and Lexical Phonic”, coordinated by Eleonora Albano, held at UNICAMP, and funded by FAPESP until January 2005. The behavior of units larger than a phoneme such as speech rate and inter-group perceptual-center (VV unit) – consisting of syllable-sized units from the first unit following the stressed segments up to the next stressed segment (Marcus, 1981) – has been significantly related to the tensile potential behavior of the text, according to the results of Matte (2005) on emotional speech. The application of phonological semiolabelers, with the implementation of Setfon, attends researchers’ demands to test which independent

Setfon: The Problem of the Analysis of Data

17

variables imply variation at the expression level, instead of sticking to a single working hypothesis. For instance, we mention the hypothesis of the tensile curve of temporality M (Matte 2004a), proposed in 2001. M is a combination of three elements directly derived from a semiotic analysis of five levels of temporality in text content (Matte, 2004b), two of which are discarded by the formula M due to a theoretical obsolescence. During this phase of the research, the component prosodic speech rate was significantly correlated with the variation of M. Setfon speeds the process of gathering and organizing data to allow the testing of different hypotheses; for example, a test with each temporal component alone and in different combinations, including those dropped by the hypothesis of the original formula of M. In addition, it also enables a leap towards a predictable and desirable step of the research regarding a semiotic analysis of the lexicon, by testing the relationship between semiotemporal keyword analysis and prosodic results. It is acceptable to predict that the analysis of a possible tensile content of the lexicon, linked to a syntactic analysis, may enable an automation of the semiotic analysis, taking into account speech synthesis, based on the assumption of vocal caricature (Matte, 2004a). Besides enabling the testing of a greater number and variety of cases in less time, the agility guaranteed by Setfon allows changes in strategy whenever the results point to it, without causing significant delays to the research. The project can be divided into three blocks: phonological study, computational study of the management, and programming of subsidiary tools in the interface between computer science and linguistics.

6. Components The linguistic study, using the methods of experimental phonetics and phonology, the behavior of speech rate, silent pauses and intonation curve (f0), enables a segmentation grounded in the linguistic sense of the sentence prosody. The segmentation of the sentence follows the concept of the VV unit, starting in the first vowel and ending at the beginning of the last vowel of the sentence, given the greater accuracy of the perception of the transition between a consonant and a subsequent vowel than of the transition between a vowel and a consonant, as shown by Barbosa (1996), Cummins (2002), and Pompino-Marschall (1989). The length of the sentence, segmented with this method, can only generate information about the speech rate if the units are also VV units, so it was necessary to remodel the SilWeb program (Matte, Meireles and

18

Chapter Two

Fráguas, 2006) to perform accentual analysis and phonic decomposition of words with phonological transcription, and the currently capability of splitting CV syllables, for use in the database of the thematic project mentioned above. This program, SilWebVV, also allows data to be obtained to calculate the z-score of the sentence, a relative measure of duration that takes into account the intrinsic duration of phonic segments, which is also done automatically. The overall design of the component-oriented program allows you to add other existing programs to the process, improving, therefore, the final result. It works like a grid of text that links different layers of information to each piece of sound-sentence. In a strict sense, it is a network of informational classes of different natures, connected to continuous media; in this case, sound through digital identities. Implemented in this way, the phonological semiolabeler Setfon is receptive to updates, some of which are predictable and desired by the phonetics community, such as the replacement of the sentence segmenter with a phone segmenter, as well as the implementation of an updated f0 analyzer. The three blocks that organize this project were developed as needed, and often simultaneously.

7. Phonological Labels The concept of a phonological semiolabeler, here proposed, is an approach to speech analysis that deals with speech sounds as objects. A phonological semiolabeler is a class of objects whose attributes are intrinsic or acquired data. The objects are segments of speech sound that may have different sizes. At this point, we have adopted the VV unit (vowel-to-vowel) (Marcus, 1981) and the stress group to support the segmentation (Barbosa, 2006). These objects are obtained by automatic analysis of stretches of speech accompanied by an orthographic transcription (Barbosa, 1996). The stress group is a sequence of VV units obtained by quantitative and qualitative analysis of VV durations. It is, therefore, an analysis dependent on the original attributes. On the one hand, the intrinsic data are essentially qualitative independent variables, which can be obtained by automatic acoustic analysis. On the other hand, acquired data are parameters whose automation is still an unexplored possibility, given its reliance on a qualitative analysis. Currently, it is possible to have syntactic and semantic parsers to help the process, although the semiotic analysis is completely manual.

Setfon: The T Problem of the Analysis off Data

19

While diifferent in naature, both thee stress groupp and the VV V unit can receive intriinsic and acquuired attributees. The first atttribute of the acquired VV unit is a phonologicall label obtaineed by transcripption and pho onological segmentatioon of the texxt correspondiing to a speeech sound. It is only possible to ccalculate its inntrinsic attributes (durationn, intensity, frrequency, formant connfiguration) annd create objeects of the classs Accentual Group G by getting the phonological label. Regard ding the objeects of this cllass, they have intrinssic prosodic attributes succh as speechh rate, melod dic curve, intensity varriation, duratiion variation, stress positioon, and number of VV units. Acquiired attributess are totally dependent on tthe type of th he desired result, whicch may arise from syntactic parsers, seemantic parseers and/or specific conntent, such as tensile t or narrrative semioticc analysis, just to name a few.

8. Ortossil Meireles and Fraguas F (2006)) developed a phonologicall syllabicMatte, M accentual paarser for linguuistic applicattions: SilWebb. Briefly, the program returns the ffollowing lexiical informatio on from a phoonological inp put: 1) the word accenttual class, 2) the number and type of ssyllable (stresssed, prestressed andd post-stresseed), and 3) sy yllabic maskss (with or wiithout the presence off glides). An example of this t analysis is shown in Figure 2 below.

Figure 2: Phoonological syllaabic-accentual analysis a of a w word written in “Ortofon” transcription (Matte, Meirelees, and Fraguass, 2006, p. 47)

20

Chapter Two

In figure 2, it is possible to notice that the researcher has entered the word transcribed for “Ortofon” (eSkaLda’NtI), and the program returns linguistic information that is pre-programmed in the source code. This type of transcription (conversion letter–phone) was proposed by Albano & Moreira (1996) for speech synthesis, and was performed by the program Ortofon (restricted use). In this way, even though SilWeb generates linguistic information, if linguists and others interested in corpus linguistics want to take advantage of the benefits of this program, they should know how to transcribe in “Ortofon”, which makes the practical implementation of the program more difficult. Thus, in order to facilitate the use of the program for the scientific community, we developed a suitable program for converting data using phonological spelling: Ortosil. This computational tool follows a similar theory, but independent of Ortofon principles. Ortosil emerged from our experience with Ortofon (Albano and Moreira, 1996); however, as our aim was to develop a program that reflects the phonological knowledge of Portuguese, we based our transcription, more precisely, on the phonological analysis proposed by Mattoso Câmara, but with some fundamental changes. According to Mattoso Jr. (1970), the phonological system of the Portuguese language is composed of the following phonemes: 19 (nineteen) consonants, 7 (seven) vowels and two (2) archiphonemes. Within this phonemic framework, one can transcribe any BP word. A comparative analysis of our phonological analysis contrasted with Mattoso Câmara Jr.’s follows below. Analysis of consonants: Mattoso Câmara Jr. proposes 19 BP consonant phonemes: /‫ ݕ‬୙ p b t d k g f v s z m n ݄ ‫ ݐ ݠ‬l R/. All of these phonemes occur at the beginning of the syllable, and therefore have little articulatory variability (see, among others, Taurosa, 1992; Keating et al. 1999). Our transcription of these phonemes is identical to this analysis, except for symbolic changes related to ease of computational implementation, namely: /sh zh p b t d k g f v s z m n nh l lh R/. Furthermore, we chose to represent the tap [‫ ]ݐ‬with the same archiphoneme symbol /R/, since it is one of its possible pronunciations. As can be seen, the phonemic consonantal chart for the beginning of the syllable and/or word (except the phoneme /r/), due to the issue of stability articulation, is uncontroversial. However, there is wide variation in the dialectal pronunciation of consonants in the final syllable and/or word in Brazilian Portuguese. To explain this variation, Mattoso Câmara Jr. used the classical notion of archiphoneme from Russian structuralism. Consonantal archiphonemes: According to Trubetzkoy (1939), archiphonemes are symbols that represent the loss of phonemic contrast in

Setfon: The Problem of the Analysis of Data

21

certain phonetic contexts. It is represented by the unmarked phoneme in capital letters. For instance, /s z/ are Portuguese phonemes because we find minimal pairs such as “saca” and “Zaca”, in which the exchange of one phoneme for the other modifies the word meaning. However, this phonemic contrast is neutralized in the final position of the syllable/word. Considering the Portuguese word “mas”, some dialects pronounce it as [mas], others like [maz], which indicates loss of phonemic distinction in this context. In such cases of phoneme neutralization, Mattoso (1970, p. 42) used the notion of archiphoneme. Therefore, /S/ handles the variation between /s z ‫ ݕ‬୙/, and /N/ handles variation between /m n ݄/. Likewise, our transcription is carried out using these symbols exactly. E.g. “pasta” is transcribed as /pa'StA/. There are also other unstable consonants, but which do not fit the definition of archiphoneme. In these cases, we used the notion of arquisegment, i.e., an unspecified segment (see Archangeli, 1988). Consonantal arquisegments: Arquisegments, as have also been pointed out by Albano and Moreira (1996), are phonologically underspecified elements; i.e., their phonetic realizations vary with social, dialectal and individual features of language. Nonetheless, unlike these authors, we consider a much larger number of these phonological elements, namely: /P B T D K G R L/. These arquisegments are needed to represent variations in standard Portuguese speech. For example, the word “pacto” can be pronounced in a continuum between [pakt‫ ]ݜ‬and [pakܼt‫]ݜ‬. This word, for example, is transcribed as /pa'KtO/. Analysis of vowels: According to Mattoso Câmara’s analysis, BP is composed of: (a) 7 oral vowel phonemes (/a ‫ ܭ‬e i ‫ ܧ‬o u/), (b) 5 oral vowel phonemes in a pre-stressed position and/or medial post-stressed (/a e i o u/), and (c) 3 oral vowel phonemes in a final post-stressed position (/a i u/). Our analysis considers the same set of phonemes for the first two positions (a, b), but not the third (c), for which vowel archiphonemes are used. Vowel archiphonemes: As has been done for the consonants, the archiphonemes /E O/ are used in BP final post-stressed syllables, since there is no phonological distinction between /e i/ or /o u/ in this context. Regarding /a/, the arquisegment /A/ is used to represent the variation between [a] and [‫ ]ܣ‬in the same context of /E O/. Similarly, the arquisegments /I U/ are used to represent the semivowels /i u/. Table 1 below represents all possible Ortosil transcriptions with examples of BP, and compared with Mattoso’s analysis (1970).

22

Chapter Two

Table 1: Comparison of Mattoso Câmara’s and Ortosil’s phonological analysis.

9. Computational algorithm of Ortosil The computational algorithm of Ortosil follows the same theoretical principles of SilWeb (described above). Lexical analysis is done, syllableby-syllable, based on the representation of a syllable as onset, nucleus, and coda (see Matte, Meireles, and Fraguas, 2006, p. 42). Before the analysis, however, all letters are lowercased (function “tolower” in C language) and the stressed syllables are marked (function “stress”). Portuguese orthography is based on the algorithms for detection of the stress class (oxytone, paroxytone, proparoxytone) of BP prescriptive grammars. Here are some of these cases: 1) final stress: if the word ends in “r”, and has no accent diacritic, it is oxytone. E.g. parTIR, aMAR; 2) penultimate stress: if the word ends in “o”, has more than one syllable and no accent diacritic, it is paroxytone. E.g. PAto, maROto; 3) antepenultimate stress: if the word has a graphic accent on the antepenultimate syllable, it is proparoxytone. E.g. paRÁbola, aBÓbora. After this first stage of data processing, the syllabic analysis starts, which generates the “Ortosil” transcription. The computational algorithm that represents this analysis is described in Figure 3. What is not described in this figure, however, is the loop that occurs after the syllabic nucleus (or post-syllabic nucleus, or simple coda, or complex coda), which restarts the syllabic treatment until the end of the word.

Setfon: The T Problem of the Analysis off Data

23

Figure 3: Com mputational alggorithm of Ortossil

9.1 Steps in the diisassembly lline Setfon caan be defined as a disassem mbly line: text and sound aree entered, resulting inn analytical information n from the acoustic, phoneticp phonologicaal, prosodic, and verbal content compoonents. To do o so, the steps are:

Chapter Two

24











Sound analyzer: Automatic segmentation of sentences. This segmentation is currently done with the open source program Praat with a script (designed by Plínio Barbosa with the collaboration of Meireles and Matte) for automatic segmentation of syllable-sized units. This script, in addition to delimiting the size of VV and stress groups units, saves samples of these sounds in separate files, and creates a database that may be recoverable for other types of analysis. Information on sentence duration (in milliseconds) is obtained at this stage, as well as information about the beginnings and ends of sentences, which are used by the annotator of continuous media. The possibility of hearing the sentence from the middle of the transition between it, and the previous one until the middle of the transition to the next one, may be present in the near future in all displays of the annotations. Phonics label: Run sentences in Fonotranscriber, a text parser that obtains the phonological transcription, consonant-vowel masks (with or without semi-vowels), word and VV unit stress, and number of VVs. Each piece of information is registered in the database and can be attributed a class or subclass status depending on the type of information. As Setfon works at that moment with reference to a sentence-sized segment, classes refer to one-to-one information; that is, one piece of information for each sentence, while sub-classes contain more than one piece of information for each sentence (for example, an identity for each word linked to a single sentence). z-score: Based on the information obtained by the sound analyzer (duration layer) and text parser (phonological transcription), calculate the z-score for the sentence and create the corresponding layer (Barbosa, 2006). Speech rate: From the information of the number of VVs per sentence and duration layers, calculate and create a layer of speech rate (Meireles, 2001; Matte, 2004a). Peaks and averages of f0 derivative peaks: This entry is done nowadays with the program Praat, for testing this sound component as a factor for speech rate analysis and emotional speech analysis. The sound analyzer, Praat, calculates and links the peaks and averages of f0 derivative peaks to each sentence. It deals with three classes of information: the number of peaks, the peak values, and the peak means for each sentence. This approach favors the analysis of the intonation curve dynamics with information from

Setfon: The Problem of the Analysis of Data





25

the derivative instead of absolute values of f0 (Bally and Holm, 2002). Semi-automatic annotation of non-acoustic, non-phonetic, and nonphonological data: This annotation is carried out on a web interface that allows the researcher to add any extra information he wants. For example: results of semiotic analysis (type of manipulation, value of the object, tensileness, aspectualization, etc.), sex of the informant, data context (instructions to the informant or other relevant data), among others. The researcher indicates the data category and inserts the information for each stress group, and can automatically repeat the same information to a set of stress groups or specify one at a time as needed. Tables: Due to the way they are organized in the database, all results are recoverable in tables (CSV – comma-separated values) for statistical analysis. The researcher-user only needs to set the desired information, even before the beginning of the process, so that unnecessary tasks do not increase the computational cost (execution time). Furthermore, when the program sends an e-mail about the completion of a process, it also sends along a brief descriptive analysis of the corpus, obtained with R5 program functions.

10. Sound segmenter The sound segmenter is a system component, implemented with Praat, which receives a sound file as input, and produces a semi-filled TextGrid. The code was based on the BeatExtractor Praat script, written by Plínio Barbosa. This component receives a set (such as speaker’s gender, the type of filter being used, and the type of technique to be used) and identifies the VV segmentation points, resulting in a TextGrid with time intervals for each segment and other relevant data, as shown in Figure 4. The component has a small layer implemented with PHP6 language, which communicates the web interface with the Praat script.

5 6

Available at http://cran.r-project.org/. Available at http://www.php.net.

Chapter Two

26 File type = "ooTextFile" Object class = "TextGrid" xmin = 0 xmax = 5.9780045351473925 tiers? size = 1 item []: item [1]: class = "IntervalTier" name = "vv" xmin = 0 xmax = 5.9780045351473925 intervals: size = 8 intervals [1]: xmin = 0 xmax = 1.6456602147085324 text = "" intervals [2]: xmin = 1.6456602147085324 xmax = 2.04477534539712 text = "O_n" intervals [3]: xmin = 2.04477534539712 xmax = 2.365412781849187 text = "om"

Figure 4: Example of a Praat TextGrid from the two first VVs in “O nome da fruta.”

To adjust the possible segmentation settings, the web interface provides a tool for creating and editing an INI file, which stores the possible configurations to be used. The component, therefore, can receive an optional configuration file as input, in addition to the required sound file. Defining a configuration file is useful for its reuse in different analyses or different instances of the system (on different servers). The use of scripts with the Praat GUI speeds up the signal’s manual processing, but implies performing the operation file-to-file, making the whole process much more time consuming. The first implementation of Barbosa’s script was via shell script, resulting in a significant increase in performance for data collection, and constituted the first version of the component for retrieving Setfon’s intrinsic data.

Setfon: The Problem of the Analysis of Data

27

11. Phonological Labels Filling phonological labels in the TextGrid (done by the sound segmenter) is performed by two components: fonotranscriber, and TextGrid handler, both implemented exclusively with PHP language. The fonotranscriber was totally based on Ortosil rules. The only difference is that the transcription of the orthographic form to phonological form is mediated by an automatic phonetic transcription. First, fonotranscriber is responsible for converting a text into phonological VVs in three stages: a) ortographic ĺ phonetic conversion; b) phonological ĺ phonetic conversion; and c) sentence ĺ VV conversion. For example, the word “complexo” (text) is reproduced as “kõpl‫’ܭ‬kso” (IPA transcription), then it is transcribed to “koNpleh'KsO” (phonological transcription), and finally is divided into segments: /oNpl/ /eh'Ks/ /O/ (VV units). The resulting file from this process is a text file with each line containing a VV unit. To perform the initial transcription, an algorithm based on regular expressions has been used, which assesses a word stretch-to-stretch and converts alphabet symbols into IPA symbols, according to a table of grammatical rules and a list of exceptions. The table of grammatical rules and the list of exceptions are specified in a separate file, since they depend on the alphabet used to represent the text. The other operations are relatively simple and straightforward. On the other hand, the TextGrid handler component has a parser, a handler, and a TextGrid generator. Thus, the component is able to read the semi-filled TextGrid – which is generated by segmenting audio and including phonological segments produced by the fonotranscriber – and create, as a result, a completely filled TextGrid. The TextGrid was originally produced as part of the segmentation of the sound without any qualitative data. The researcher inserted the phonological labels manually. Initially, a Praat script for inserting data dependent on the GUI interface and was restricted to run so that a file at a time was produced. The Setfon software rewrote the code in PHP and automated the process, thus eliminating the GUI Praat, which also contributed significantly to improving the performance of data collection.

12. Acoustic Data Extractor The acoustic data extractor is a component responsible for evaluating a sound file that returns the acoustic data, with the help of a complete TextGrid file, which is stored in a CSV file. This component was written

28

Chapter Two

primarily with Praat scripts, but also has a layer implemented in PHP that allows communication with the web interface. The algorithm for data extraction was based on the SGdetector Praat script, written by Plínio Barbosa and adapted by Ana C. F. Matte to obtain a larger number of variables. The script has also been adapted to output an SQL command, so as to enable the immediate inclusion of results in a database. INSERT INTO `segmentos` (`idSeg` , `arquivo` , `segmento` , `tIni` , `tFim` , `dur` , `f1media` , `f1mediana` , `f1dp` , `f2media` , `f2mediana` , `f2dp` , `f3media` , `f3mediana` , `f3dp` , `f4media` , `f4mediana` , `f4dp` , `zScore` , `zSuavisado` , `posicao` , `tamanho` ) VALUES ('1','Monicalit1_22','Og','72.91081323201482','73.05954890102598','0.17183238822112457','451','420','5 9.70','1739','1428','535.76','2674','2518',''f3#dp:2'','3819','3762','211.73','-4.212','0','6','7'), ('2','Monicalit1_22','al','72.91081323201482','73.05954890102598','0.17685845326224958','451','420','59 .70','1739','1428','535.76','2674','2518','299.50','3819','3762','211.73','-3.818','0','-5','7'),

Figure 5 – Initial excerpt of a result from the acoustic data extractor (Santos, 2008). Data: input file name, segment transcription, start and end time of the segment, segment duration, data on the first four formants, and analysis of relative duration, stress group position and size. Moreover, in another script, data on intensity, speech rate, and stress group’s melodic curve are obtained.

13. Semiotic Labels The insertion of semiotic labels is still done directly by the researcher, through a web interface that organizes data as attributes of the stress group class. The interface enables the automatic replication of entries and the creation of different attributes depending on the type of analysis. Although created for the insertion of data from semiotic analysis, the possibility of creating new attributes gives the tool sufficient flexibility for the insertion of any kind of qualitative data for the purposes of parametric statistical analysis, for example. This category would include, for instance, instructions given to the informant to produce a more neutral or more emotive, faster or slower speech, what is called circumstantial information about the data. This system can add, according to the purpose of the research, information such as corpus classification, and snippets of international, local or regional TV news (Mendes, 2009).

Setfon: The Problem of the Analysis of Data

29

14. Conclusion The system, registered on Sourceforge.net7 as GPL v.2, reached the expected main objective of enabling a significant increase in data collection for phonostylistics studies. Previously manual processes have now been automated and standardized using Setfon. Although Setfon is an important step for emotional speech studies, several improvements can still be explored, such as incorporating a component of speech recognition (allowing text generation from the sound file), offering a feature that allows asynchronous interaction (to minimize the connectivity between client/server during data processing), creating a framework that provides specific features for speech studies, and offering Setfon features under a web service architecture.

References Albano, E. C.; Barbosa, P., Gama-Rossi, A.; Madureira, S.; Silva, A. (1998) A interface fonética-fonologia e a interação prosódiasegmentos. In: Seminário do Grupo de Estudos Linguísticos do Estado de São Paulo – GEL’97, 45, 1997, Campinas, UNICAMP, 1998. Anais do GEL, Campinas: v. 17. p. 135-143. Albano, E. C.; Moreira, A. A. (1996) Archisegment-based letter-to-phone conversion for concatenative speech synthesis in Portuguese. In: ICSLP’96. Proceeding of ICSLP’96, v. 3, p. 1708-1711. Ambler, S. W. (2002) Agile Modeling: Effective Practices for XP and the UP. New York: John Wiley & Sons. Archangeli, Diana. (1988) Aspects of underspecification theory. Phonology v. 5, p. 183-207. Bailly, G.; Holm, B. (2002) Learning the hidden structure of speech: from communicative functions to prosody. Cadernos de Estudos Lingüísticos, Campinas, n. 43, p. 37-53. Barbosa, P. (2001) A Generating duration from a cognitively plausible model of rhythm production. In: EUROSPEECH. Älborg, Denmark. Proceeding of EUROSPEECH 2001. Älborg, Denmark: v. 2, p. 967970. Barbosa, P. A. (1996) At least two macrorhythmic units are necessary for modeling Brazilian Portuguese duration: emphasis on segmental duration generation. Cadernos de Estudos Lingüísticos, v. 31, n. 1, p. 33-53. 7

Available at http://www.sourceforge.net/projects/setfon.

30

Chapter Two

Barbosa, P. A. (2006) Incursões em torno do Ritmo da Fala. Campinas: Pontes. Bisol, L. (2001) Introdução a estudos de fonologia do Português. 3 ed. Porto Alegre: EDIPUCRS. Brito, P. H. da S.; Barbosa, M. A. M.; Guerra, P. A. de C.; Rubira, C. M. F. (2005) Um Processo para o Desenvolvimento Baseado em Componentes com Reuso de Componentes. Relatório Técnico IC-0522. Campinas, SP: Instituto de Computação, set. 2005. Retrieved May 30, 2010 from http://www.ic.unicamp.br/~reltech/2005/05-22.pdf. Cagliari, L. C. (1999) Acento em Português. Campinas, SP: Ed. do Autor. (Série Linguística, v. 4) Callou, D; Leite, Y. (1990) Introdução à Fonética e à Fonologia. Rio de Janeiro: Jorge Zahar. Deitel & Deitel. (2004) How to program java. 4. ed. London: Pearson Education. Fowler, M. (2003) (Ed.). Patterns of Enterprise Application Arquitecture. Boston: Pearson Education. (The Addison-Wesley Signature Series) Hawkins, S. (2003) Roles and representations of systematic fine phonetic detail in speech understanding. Journal of Phonetics, v. 31, p. 373-405. Hendler, J.; Berners-Lee, T.; Miller, E. (2002) Integrating Applications on the Semantic Web (English version), Journal of the Institute of Electrical Engineers of Japan, v. 22, n. 10, p. 676-680. Retrieved June, 2011 from http://www.w3.org/2002/07/swint. Koivunen, M.-R.; Swick, R.; Prud’hommeaux, E. (2003) Annotea Shared Bookmarks. In: KCAP 2003 Workshop on Knowledge Markup and semantic annotation, 2003, Sanibel, Florida, Proceedings of KCAP 2003. Sanibel, Florida. Retrieved May 30, 2010 from http://www.w3.org/2001/Annotea/Papers/KCAP03/annoteabm.html. Marcus, S. M. (1981) Acoustic Determinants of Perceptual-Center (pcenter) location. Perception and Psychophysics, v. 30, n. 3, p.247-256. Matte, A. C. F. (2005) Análise quantitativa da tensividade do conteúdo verbal tendo em vista o estudo da expressão da emoção na fala e o modelamento prosódico. Cadernos de Estudos Lingüísticos (UNICAMP), v. 45, n. 1, Campinas, SP. Matte, A. C. F. (2004a) Relating Emotional Content to Speech Rate in Brazilian Portuguese. In: Speech Prosody CD. Proceedings of Speech Prosody 2004, 2004. Nara, Japan. Matte, A. C. F. (2004b) Tempo Fonoestilístico e Semi-simbólico: a árvore gerativa da temporalidade. Estudos Lingüísticos, v. 33, Campinas.

Setfon: The Problem of the Analysis of Data

31

Matte, A. C. F.; Meireles, A. R.; Fraguas, C. C. (2006) SIL Webanalisador fonológico silábico-acentual de texto escrito. Revista de Estudos da Linguagem, Belo Horizonte, v. 14, n. 1, p. 31-50. Mattoso Câmara Jr., J. (1970) Estrutura da língua portuguesa. Petrópolis, RJ: Vozes. Meireles, A. R. (2001) Processos fonético-fonológicos decorrentes do aumento da velocidade de fala no Português Brasileiro. 2001. Dissertação (Mestrado em Letras/Estudos Literários) – Faculdade de Letras, Universidade Federal de Minas Gerais, Belo Horizonte. Mendes, C. A. (2009) Expressão e o Conteúdo da Fala do Jornal Nacional. 2009. Dissertação (Mestrado em Letras/Estudos Linguísticos) – Faculdade de Letras, Universidade Federal de Minas Gerais, Belo Horizonte. Pompino-Marschall, B. (1989) On the psychoacoustic nature of the Pcenter phenomenon. Journal of Phonetics, v. 17, n. 3, p. 175-192, Jul. Santos, M. C. V. (2008) A interferência dos sinais de pontuação na proficiência de leitura de textos em prosa em sala de aula. 2008. Dissertação (Mestrado em Letras/Estudos Linguísticos) – Faculdade de Letras, Universidade Federal de Minas Gerais, Belo Horizonte. Staa, Arndt von. (2000) Programação Modular: Desenvolvendo programas complexos de forma organizada e segura. São Paulo: Campus. Taurosa, S. (1992) The Instability of Word-Final Consonants and its Effects on Word Recognition in Connected Speech Hong Kong Journals Online. Retrieved May 30, 2010 from http://sunzi1.lib.hku.hk/hkjo/view/10/1000034.pdf.http://sunzi1.lib.hku .hk/hkjo/view/10/1000034.pdf. Trubetzkoy, N. K. (1970) Principes de Phonologie [Grundzüge der Phonologie]. Tradução de J. Cantineau. Paris: Klincksieck, [1939]. Retrieved June 15, 2011 from http://www.univie.ac.at/Hausa/SprawiODL/Trubetzkoy.html.

Web pages Praat. Software de análise acústica de fala. Retrieved May 30, 2010 from http://www.praat.org. http://www.praat.org. R. Software para análise estatística. Retrieved May 30, 2010 from http://cran.r-project.org/ PHP.net Compêndio oficial de informações sobre a linguagem de programação PHP. Retrieved May 30, 2010 from http://www.php.net.

32

Chapter Two

Setfon. Programa para coleta de dados em fonética acústica e semiótica. Retrieved May 30, 2010 from http://www.sourceforge.net/projects/setfon.

CHAPTER THREE STRESS ASSIGNMENT CONTRASTED IN SPANISH AND BRAZILIAN PORTUGUESE PROSODIC NON-VERBAL WORDS ANTÔNIO R.M. SIMÕES1

Abstract This study discusses stress assignment in prosodic, non-verbal words in Brazilian Portuguese. Descriptive analyses of stress assignment already proposed for Spanish (see Roca 1990, 1999, 2006) are used as a contrast in this study. Given the conflicting claims regarding stress assignment in Brazilian Portuguese (e.g. Mateus 1983, Bisol 1994 2005, Lee 2007, Cagliari 1999), there is still a need to revisit discussions on stress assignment in Portuguese. In general, descriptions of stress assignment using generative metrical theory to explain how it works in Spanish are in agreement, in terms of the interplay between the morphological and phonological domains. Similar descriptions for Portuguese still require far more abstraction and the use of artifacts other than in Spanish, which leaves Mattoso Câmara Jr.’s (1970, 1979) claim, that lexical stress is unpredictable in Brazilian Portuguese, unchallenged. Keywords: stress assignment, prosodic stress, syllable weight, Spanish, Portuguese, mora, trochaic foot, moraic foot, non-verbal words, extrametricality, catalexis.

1

University of Kansas, USA. I am very thankful to Juliette Blevins at CUNY, and Seung Hwa Lee at UFMG in Brazil, for their helpful discussions with me about lexical stress. I am also thankful to José Augusto Carvalho at UFES in Brazil, for his help with two of my bibliographical references.

34

Chapter Three

1. Preliminaries This chapter is the result of a mini-course prepared for the 2012 School of Prosody (II Escola de Prosódia) that took place at the Federal University of Espirito Santo, in Vitória, Brazil. Most of its content was kept in the same way in which it was originally written for the minicourse. No attempt has been made to eliminate some of the original class features because they turned out to be helpful to understand the claim made here. One way to compare Portuguese and Spanish in terms of stress assignment is to examine their word stress patterns. In Brazilian Portuguese and Spanish, there is a strong pressure to stress words on the penultimate syllable (paroxytones). In both languages, the great majority of words are paroxytones, as some of the examples below illustrate. Despite this coincidence and many others in both languages, predicting lexical stress in Brazilian Portuguese is not as governable as it is in Spanish. Among the differences in phonological and phonetics patterns between both languages, one of the most significant ones is the phonetically reduced or deleted postonic syllables in Brazilian Portuguese and the relatively less affected vowel quality in postonic syllables in Spanish. This different behavior of postonic syllable in both languages is reflected in many ways in Spanish. In versification, for instance, syllables are counted differently in Spanish and Portuguese, in ways that reflect rhythmic patterns in both languages. Portuguese counts the number of syllables until the last stressed syllable, whereas Spanish counts the number of syllables that are computed until the last stressed syllable plus one, regardless of the physical existence of a postonic syllable. Thus, Martí’s verses below have eight syllables each. As an illustration, if we counted the verses in the same manner as Portuguese versification does, these verses would have seven syllables each. The dot (.) indicates syllable boundaries after resyllabification, accounted as needed. The last lexically stressed syllables in each verse are in capitals. Yo . soy . u . nhom.bre . sin.CE.ro (8 syllables) Note: also Yo.so.yu.nhom. De . don.de. cre.ce. la. PAL.ma, (8 syllables) Y an.tes . de. mo.rir.me . QUIE.ro (8 syllables) E.char . mis .ver.sos. de.l AL.ma. (8 syllables) (...)

Stress Assignment Contrasted in Prosodic Non-Verbal Words

35

Oi.go un . sus.pi.ro, a. tra.VÉS (7+1 = 8 syllables) De . las . tie.rra.s y . la . MAR, (7+1 = 8 syllables) Y . no e.s un . sus.pi.ro,—ES (7+1 = 8 syllables) Que . mi hi.jo . va a . des.per.TAR. (7+1 = 8 syllables) (Poesía de José Martí, Versos Sencillos, 1981)

These verses help to understand how different both languages are in regards to the phonological features of postonic syllables, which is key to understanding the difficulty researchers have found to include oxytone words in a general system of stress assignment in Brazilian Portuguese. In Portuguese, one can attempt to suggest a solution similar to Spanish in the case of paroxytones and proparoxytones, but not for oxytones. Portuguese scholars may or may not have captured this difference between both languages, when they changed verse counting in Portuguese a couple centuries ago (Guerreiro, 1784; Castilho, 1851). Before then, versification in Portuguese was similar to Spanish. These differences, as well as the similarities of both languages, can be further refined. The illustration below shows pairs of paroxytones in both languages. The first word of each pair is written in Portuguese and the second one is its equivalent in Spanish.

MEsa-MEsa (Eng. table) esCAda-escaLEra (Eng. stairs, ladder) TEto-TEcho (Eng. ceiling) LÁpis-LÁpiz (Eng. pencil)

caDERno-cuaDERno (Eng. notebook) caDEIra-SIlla (Eng. chair) SAla-SAla (Eng. (living) room) coZInha-coCIna (Eng. kitchen)

The majority of words in Portuguese and Spanish are stressed on the penultimate syllable, as in the examples above. For instance, the natural pressure to prosodically stress the penultimate syllable in Spanish is so great that words with underlying antepenultimate stress often surface to penultimate in natural discourse, e.g. período ĺ períOdo, ¡Pórtate bien! ĺ ¡PorTAte bien! (The stress marker (´) indicates the lexically stressed syllable nucleus.) Dictionaries of the Spanish language show some of these cases as both proparoxytones and paroxytones. Similar trends to change stress to the paroxytonic position can be found in Brazilian Portuguese, although the phonological processes in Portuguese are different from Spanish:

36

Chapter Three abóbora ĺ aBObra (Eng. pumpkin); xícara ĺ XIcra (Eng. cup)

Although the preceding examples show similar stress assignment trends in both languages, there are important differences, such as the sound deletion and resyllabification in the above words in Portuguese, which ought to be taken into account when attempting to propose stress assignment algorithms. In Spanish, and in Portuguese to a certain extent, a great number of words that are stressed on the antepenultimate syllable are learned words (palabras/palavras cultas). Sometimes they are borrowed from Greek and sometimes from Latin, e.g. Arquímedes in Spanish but ArquiMEdes in Portuguese, Demóstenes in both languages, hypérbaton or hipérbato in Spanish and hipérbato in Portuguese, épsilon in Spanish and ép[i]silon, ép[i]silo, íp[i]silon or ip[i]siLOne in Portuguese, máximum or máximo in Spanish and máximo in Portuguese, régimen in Spanish but reGIme in Portuguese. As seen in these examples, while there are significant trends in Spanish, Portuguese shows no clear trends, that is to say, it is less predictable. This lack of a clear trend permeates Portuguese, contrary to Spanish. A comparison of trends to paroxytone patterns in Spanish hypocoristics with no such trends in Brazilian Portuguese further reveals the difficulty that researchers face in creating an algorithm to predict stress assignment in Brazilian Portuguese. Hypocoristics in both languages are enlightening in this discussion. Whereas Spanish has predominant patterns of disyllabic paroxytones for hypocoristics, Brazilian Portuguese produces disyllabics, monosyllabics, paroxytones and oxytones for hypocoristics, without particular trends, as Table 1 illustrates. Spanish does have dialectical variations that use monosyllables in hypocoristics, e.g. Daniel ~ Dan, Cristina ~ Cris, but it happens less frequently than the patterns above, and it usually happens in closed syllable (CVC), while monosyllables in Brazilian Portuguese generally have open syllables (CV). Therefore, in Brazilian Portuguese the variations are much greater.

Stress Pattern paroxytone paroxytone, monosyllable paroxytone paroxytone paroxytone paroxytone paroxytone paroxytone paroxytone

Hypocoristic Adri

Dani, Dan

Pancho

Juanra

JOse, Pepe MIguel Nacho Ari

Toño

Spanish Noun Adriana

Daniel

Francisco

Juan Ramón

José MiGUEL Ignacio Ariel

Antonio

Antônio

José Rodrigo Pedro Maria José

Gustavo

Francisco

Fernando

Zé Ro PePEU, PEpe Zezé Totonho, Tunico, Tô, Tonho

Gugu, Gu

Chico, Chicô

Nando, Fê

Brazilian Portuguese Noun Hypocoristic Benedito Benê

37

Stress Pattern oxytone paroxytone monosyllable paroxytone, oxytone paroxytone, monosyllable monosyllable monosyllable oxytone, paroxytone oxytone paroxytone, trisyllable, disyllable monosyllable

Table 1. A comparison of trend in Spanish hypocoristics and the lack of trend in BP hypocoristics.

Stress Assignment Contrasted in Prosodic Non-Verbal Words

38

Chapter Three

Penultimate stress is also more predictable in contemporary Spanish loanwords (LW), acronyms (AC) and foreign proper names (FN), but not in Portuguese, as depicted in the comparison below. Table 2. A comparison of paroxytone and oxytone trends in SP and BP loanwords. This table was produced with the help of eleven native speakers of SP and five native speakers of BP, who answered a questionnaire sent to them by e-mail, in which I asked them to indicate the stressed syllables in the words in this table. LW stands for loanwords, AC for acronyms and FN for foreign proper names. Spanish LW, AC, FN Stress Pattern Barman BARman Email Email Cocktail COCtel Karaoke karaOke GorbaCHEV, Gorbachev GorBAchev Muhammad MuhamMAD, Ali MuHAMmad PEMEX PEmex PC Pc (PEce)

Brazilian Portuguese LW, AC, FN Stress Pattern barman BarMAN email eMAIL cocktail coqueTEL karaoke KaraoKE Gorbachev Muhammad Ali PEMEX pC (peCE)

GorbaCHEV[i] MuhamMAD[i] peMEX pC

These examples further show the greater tendency in Spanish to stress the penultimate syllable, compared to Portuguese. The next section will discuss the notion of prosodic words along with other notions which are common in the American generative frame of (Autosegmental) Metrical Theory (Leben, 1973; Liberman, 1975; Bruce, 1977; Liberman and Prince, 1977; Goldsmith, 1976, 1990; and all the works that followed afterwards). This study is only using the generative frame to discuss metrical theory as it has been applied to Spanish, and to show how difficult, if not impossible and unmotivated, it is to predict lexical stress in Portuguese. In other words, although the generative frame is very useful to discuss stress assignment in any language, this study does not support the claim that it can predict stress assignment in Portuguese. Given the three types of syllable prominences in words of the two languages, e.g. the Spanish triplet CÉlebre, ceLEbre, celeBRÉ, the Brazilian Portuguese triplet PÁssara, pasSAra2, passaRÁ, the stress 2

Postonic vowels in Brazilian Portuguese change considerably in quality, but in the case of the /a/, an inherently strong vowel and more resistant to significant

Stress Assignment Contrasted in Prosodic Non-Verbal Words

39

windows in the two languages include proparoxytones (antepenultimate syllable stress), paroxytones (penultimate syllable stress) and oxytones (last syllable stress). In order to discuss stress assignment in the following sections, it would be helpful to first review the concepts of prosodic word, prosodic stress and syllable weight, and then discuss the cases of the most common patterns, the paroxytones, then the proparoxytones and finally the oxytones.

2. The Prosodic Word Structures are present in all aspects of life. This is inescapable. Language units have internal structures, and usually these structures have a nucleus and optional satellite elements at any level, from the features that make up a sound or phone, to the syllable, to the word, phrase, sentence, paragraph and discourse. At word level, the nucleus is the stressed syllable, but in metrical theories, it is said to be the metrical foot, which is discussed below.

2.1 Identification of prosodic stress Although I do not naïvely follow the common division of words into two main classes – content and function words – it is common and often helpful to say in Linguistics that content words are the only ones that receive lexical stress. In this same view, function words are unstressed. Content words are words with stronger semantic content, or with a meaning that is easier to know. Function words have weaker semantic content because their meanings are less obvious to guess. Content words are verbs, nouns and other words with a meaning that is more obvious to understand. Function words are “little” words like prepositions, pronouns attached to verbs, conjunctions, and similar words that either have a less obvious meaning or are satellites of or attached to a content word because they cannot stand by themselves in discourse. Such a classification is, as are all classifications, a helpful view, but it has serious flaws. Human language does not have sharp divisions or classifications. Function words sometimes have one prominent syllable if they have more than one syllable, e.g. the word for “while,” enquanto in Portuguese and mientras in SP. Prosody is a cover term for intonation, rhythm, stress, quality and changes in quality, the stress contrast in this triplet is still a valid one, especially when used in the careful, clear speech of the acrolect.

Chapter Three

40

other non-segmental elements. The prosodic word has stress: in general, clitics; that is to say, the corresponding term to function words do not. Therefore, in addition to saying that clitics have very weak semantic content, a common argument to defend the idea that clitics do not have stress is that they need the prosodic word to be in a sentence or phrase. On the other hand, prosodic words can stand by themselves. Prosodic words

Prosodic words and clitics

- Quiere eso? - No. - ¿Tal vez más tarde? - Tal vez.

- ¿De dónde vino? - *De3 - Si Pablo viene, ¿vienes también? - *Si. (in English, if)

In sum, prosodic words are verbs, nouns, adjectives, adverbs, indefinite adjectives and pronouns, demonstrative adjectives and pronouns, possessive pronouns, subject pronouns, prepositional pronouns, numerals and interrogatives. These classes of words are considered the only ones to receive prosodic stress.

2.2 Syllable weight Mora (ȝ) is the unit that makes the prominent weight of prosodic feet. In Spanish, its weight is generally uniform, one mora. In Spanish, contrary to English, even in complex nuclei like diphthongs, all vocalic features of a syllable nucleus fit into one mora, as illustrated with the word sentimiento:

3

Asterisks are traditionally used in Linguistics to indicate “not well-formed structures” or “not legal forms.”

Stress Assignment Contrasted in Prosodic Non-Verbal Words

Prosodic word

P W / /

Prosodic foot Syllables Moras Segments

41

/ / į/|\ | ȝ+ | | | | [ s e+ n

|

| | | į/\ | ȝ+ | | t i+

\ \ P F+ | \ į+ į/ |\ \ \ | ȝ+ | \ ȝ+ | \ \ | | m ie+n t o+ ]

Figure 1. The prosodic structure of the Spanish word sentimiento, to illustrate the concept of mora (ȝ).

Spanish does not contrast the so-called heavy or bimoraic syllables and light or monomoraic syllables. The words “see, no, array, key, pie and tear (=rip)” in English, for example, contrast heavy and light syllables: ‘seȝaȝ, noȝȝ, aȝ.’rra ȝyȝ, ’ke ȝyȝ, ‘piȝeȝ, ‘teaȝrȝ. Hence, in English, one syllable words are binary in terms of the number of moras or morae. The majority of English words have one syllable. Spanish needs two syllables to have two moras and the majority of Spanish words have two syllables. Brazilian Portuguese shows these Spanish characteristics without the regularity or uniformity found in Spanish. For example, in Rio, and maybe in some other areas of Brazil, but particularly in Rio, we find bimoraic and monomoraic contrasts, e.g. when answering the phone: “Alô!” [aȝ.loȝԥ ȝ]. It is important to keep in mind that we are thinking of a language system as found in social mesolects or higher. If we analyze Spanish basilects, we are dealing with completely different varieties of Spanish. The same can be said about Portuguese.

2.3 Paroxytonic words The structure of paroxytonic words contains a nucleus in the two last syllables, identified with the symbols “”: cua, , carre, pensa, melo, desespenos, desafortunada. These nuclei are prosodic feet, as illustrated with the word final in the diagram below.

42

Chapter Three

Figure 2. Thee prosodic foot < in the Spanish word ffinalmente. Propertiess of this structuure: Satellites: fi, nal Nucleus: men.te; final (eend of foot coincides with endd of word); tro ochaic (left of foot; not iambic whichh is right); binaary

Thereforre, agreementt among reseaarchers whenn determining stress in Spanish parroxytones is greater than in Portuguesee, because words w like “finalmente”” contain alll the basic structural s reqquirements off finality, trocaicity annd binarity. Inn Brazilian Porrtuguese, we aalso find similarities in paroxytoness. The obstaclle one will fin nd is how to ffit proparoxyttones and oxytones innto these ideall structural prroperty requirrements. According to the generatiive frame of Metrical Theory, in Spaniish, although oxytones seem, after a superficiall look, to haave a unary ffoot (co.li.), and proparoxytoones a tertiaryy foot (), this can be solveed with a morphologiccal interpretattion, as discussed below.

2.44 Proparoxy ytonic wordss A comm mon statement in the generattive frame of M Metrical Theo ory is that in Spanish, a great numbber of words have proparoxxytonic stresss because proparoxytoones include vowels v withou ut morae in thheir structures. I do not know of an empirical bassis on which to t make this cclaim, but it is helpful, when understanding Metrrical Theory, to assume thaat some syllab bles have no mora, thaat is to say, aree extrametriccal. Thereforre, words like mínimo, sába ana, número w will have a no on-moraic vowel in iits root (mínnim-o, sában n-a, númer-o)). Consequen ntly, root morphemes such as thesse, with one syllable withhout weight, carry the potential too cause retraaction of strress, as illusstrated with the t word espárrag-o bbelow, because their penulltimate syllablle (-rra- in th he case of espárrago) is invisible to t rules, or extrametrical. e This charactteristic or

Strress Assignmennt Contrasted in n Prosodic Non--Verbal Words

43

idiosyncrasyy of proparooxytones lead ds to the cconclusion th hat stress assignment iin Spanish is morphologica ally conditioneed. This morp phological condition also applies in a different way to oxytoness.

Figure 3. Thhe morphologiical interpretattion of proparroxytones as underlying u paroxytones.

This opaacity or invissibility is usu ually considerred a relativelly “small problem” iin generativee Metrical Theory. T Otheer words caarry this invisibility, e.g. murciélaggo, máscara. Portuguese foor example, co ommonly deletes nonn-moraic vow wels as in “m máscara”, resuulting in the phonetic mascra. Thereforre, root morpphemes can caause stress reetraction. Sufffixes can also cause retraction. Derivational su uffixes can c ause retractio on. Some examples off these suffixees are –ic as in metálico, caanónico. Hen nce, when the morphem me –ic is in paroxytonic po osition, it causses retraction of stress, because the [i] vowel is not n moraic. Other suuffixes that cause retracttion of non--verbal formss in this framework are –ul (capíttulo), –metr (cronómetro), ( –log (prólog go), –graf (bolígrafo), –fon (teléfoono), –nom (astrónomo, ( ecónomo). There T are exceptions. Astronómico, for examplee, could be sseen as an exception. e However, thhe limit of streess assignmen nt to the three last syllables would be violated witth the stress on the 4th sy yllable (*astróónomico). Heere, as in other seem mingly excepptional casess, morphologgical elemen nts help understandinng. The main explanation in n cases such aas astronómico o is made through thee concept off morphological nuclei. In other wo ords, the morphemes in astronómico are astr-o o-nom-ic-o. T The morphem mes –nom and –ic are some of the morphemes m th hat have a nonn-moraic vow wel. Given that –ic, andd not –nom, iss the morphem me that characcterizes “astro onómico” as an adjecttive, then –icc is the nuclear morphemee that keeps th he vowel

44

Chapter Three

invisible. In this exceptional case, –nom then is not a nuclear morpheme and as a result it carries one mora. The same arguments can be attempted in Portuguese and most of the examples in Spanish are similar in Brazilian Portuguese: mínimo, número, canônico, metálico, cronômetro, astrônomo, etc. Therefore, according to the preceding discussion based on the generative frame of Metrical Theory for Spanish, proparoxytones have the conditions of a syllable foot on the right of the word, just as paroxytones do. These conditions are: Only morphemes that function as a morphological nucleus can retain nonmoraic vowels. This also explains why this type of word is not common. Given the condition above, the same principles of paroxytones apply to proparoxytones: binarity, finality and trocaicity.

2.5 Oxytonic words If we keep the same view that we have been using in this discussion, oxytonic words are also morphologically conditioned. Whereas derivational morphemes or morphemes with morpho-syntactic function interfere in the irregularity of stress assignment of proparoxytones in Spanish, in the case of oxytones, the reason for irregularities has to do with morphemes whose function is exclusively morphological. Given the complexities in showing that oxytones and paroxytones carry similar conditions, a review of the morphological structure of words can be of help. Langacker (1972, 74f) is still one of the best ones that I know. “Most complex lexical items are formed by attaching affixes to a basic morpheme called [a] root. Affixes that precede roots are called prefixes, and those that follow the root are called suffixes. […] It is customary to distinguish between inflectional and derivational affixes. Inflectional affixes are those that mark number, gender, case, tense, and certain other categories; derivational are those that are not inflectional. Although the distinction is not a sharp one, inflectional and derivational affixes do tend to have certain contrasting properties. For one thing, derivational morphemes are in general independently meaningful, whereas inflectional categories are frequently introduced as sentence trappings by agreement rules. Second, derivational affixes have the potential to change the grammatical class of the elements to which they are attached. For example, the addition of the derivational suffix –ful to the noun care results in an adjective […] Third, there is a universal tendency for derivational affixes to occur closer to the root than inflectional affixes,” as in the figure below.

Strress Assignmennt Contrasted in n Prosodic Non--Verbal Words

45

Figure 4. Thhe morphologiccal word. Notee: IA = inflecttional morphem me; DA = derivational m morpheme.

In Spannish, the morphemes thatt have excluusively morp phological functions arre the class markers m and markers of pperson/numbeer, which normally apppear at the very end of wo ords. A distriibutive test heelps us to know whichh ones mark cllass and the on nes which do not. If we addd the –er moorpheme to joy ya, we obtainn joyero (jeweelry box), and not *joyyaero. Thereffore this “a” in i joya is a cllass marker. Likewise, L from guantee, we obtain guuantera (and not n *guanteerra). In the ccase of café, we obtain ca afetera with tthe insertion of “t” to preserve thee “e”, which indicates thaat the “e” in “café” is no ot a class marker. Likkewise, maní – manicito, so ofá – sofacitoo, need the inssertion of “c” to preserrve the vowells “i” and “a”. In these casees of the dimin nutive –it, these vowels are not classs markers either. The connclusion is thaat oxytones en nding in a voowel behave as a if they lack a class marker. But oxyytones ending in consonan nts (e.g. papeel, mujer, patrón) also share this chharacteristic. Taking T into account that inn Spanish, thee majority of words ennding in conssonant are ox xytones, this lleads to concclude that there is a reelationship beetween oxyton nes and the llack of word markers. Furthermoree, this lack off a word marker leaves an ““empty” slot at a the end of words. But of ccourse, all com mmon Spanissh words or ppalabras patriimoniales have class m markers. Thee explanation is that some words have prosodic structure, buut not segmenntal structure. In other wordds their morph hemes are irregular. This irreegularity is called c catalex xis. Basically,, Spanish hass regular moraic classs markers (aall five vowels a, e, i, o, uu) and catalecctic class markers. Having pprosodic but not segmentaal structure creeates opacity.. In other words, this description of o stress assiignment in ooxytones says that the segment is invisible althhough there iss a prosodic structure. Butt there is evidence forr this claim. For F instance, if we take a loook at the diag grams for sofá and muujer, showing the derivation ns of the singgular and plurral forms,

46

Chapter Three

this claim may be morre easily acccepted despitte the opacity y of the description w with respect too the oxytones.

Figure 5. Spaanish words sofá fá and mujer. Compare the diagram below w of the plural fo orms sofás and mujeres:

Figure 6. Spanish plural forms fo of the words sofá and mujer in the preceding Figure 5.

Thereforre, plural forrmation in Spanish also supports thiss idea of catalectic cclass markers. Plurals are formed iin Spanish with w the morpheme/ssuffix –s. Sinnce –s is infllectional, it ggoes after derrivational morphemes.. In words with w regular plural form mation, the prrocess is transparent: pal-o-s, car-aa-s, cruc-e-s, curs-i-s, trib--u-s. But in th he case of catalectic cllass markers, adding pluraal morphemess forces the catalectic

Stress Assignment Contrasted in Prosodic Non-Verbal Words

47

element to acquire a segmental content: sofá-e-s, maní-e-s, mujer-e-s, caiman-e-s. This additional “e” with the plural morpheme reveals the “emptiness” of the catalectic space, as well as other characteristics of Spanish such as the preference for CV sequences and the use of “e” epenthetic, which is the only vowel that Spanish uses when a vowel is to be inserted. And in its plural form, the word becomes paroxytone. If one accepts these generativist argumentations regarding oxytones, it is reasonable to say that all oxytones, regardless of their ending in a vowel or a consonant and contrary to paroxytones and proparoxytones, lack a class marker, that is to say, inflectional morphemes. This is unusual. A valid way of explaining this lack of inflectional morphemes is to assume that the space is there, and it can be said to be a prosodic “space,” not segmental. Therefore, the underlying structure of oxytones should have a prosodic mora that is filled out with a physical segment as needed. In the position where there should be an inflectional morpheme vowel, this vowel is not physically present. This leads, among other consequences, to assigning an underlying “e” vowel that will surface in plural forms of oxytones. This description has a number of advantages and the most obvious is the generalization of assigning only “s” to plural, instead of two rules, one for adding –s and another to add –es. In the case of words like café, té, the insertion of “e” still happens, and in actual speech this fusion is common in Spanish (mijito for mi hijito), alcol for alcohol, etc.). Of course some dialects have variations (e.g. sofás, rubís), but the description of stress assignment must be maintained within the same register of a dialect of reference. Otherwise, if one considers how different Spanish can be in different social and geographical areas, stress assignment as a general trend would be even more complex if not impossible to describe. It is simpler and more desirable to stay within the limits of a given register that is considered representative, and then move on into different dialectical variations. For the intended description of stress assignment for this discussion, we used the careful speech of general Spanish or the Spanish of Latin America’s altiplanos as reference. Taking the educated register into consideration, we need to keep in mind the following. The morphological nucleus is the morpheme that selects the class marker that identifies the word. By adding a derivational suffix, the morphological nucleus changes. This change makes it possible to introduce a different class marker. In the word verd-e, for instance, the morpheme that selects the class marker is verd, and it requires “e.” If we add “–ur” to the root, we obtain verd-ur-a, or yet, verd+or+catalexis.

48

Chapter Three

As –ur and –or become the morphological nuclei, they select the segment “a” and a prosodic mora as class markers. Among morphemes that select class markers, the most common are: –dad, –itud, –ción, –zón, –dor, –al, –il, –y, –r.

There are obviously serious flaws with the framework just discussed. It requires accepting an ad hoc description and the assumption that the opacity is an acceptable price to pay, given the other advantages of the description. These flaws are even worse in the case of Portuguese. In the case of oxytones in Brazilian Portuguese, for example, one can attempt to propose a process similar to the one in Spanish. All non-verbal plurals would be formed with one rule, by adding “s” to the singular form. Through this view, in Brazilian Portuguese, in the case of oxytones, there is no physical segment in the slot where normally one expects an inflectional morpheme. Words need a class marker to indicate lexical class. The explanation then, seems to be similar to the one for Spanish, which would be desirable if there were no flaws because it would require only one rule for plural formation. Therefore, if oxytones have an invisible prosodic unit that we can call mora, it materializes as “e.” Words like “sofá,” café,” colibri,” mulher,” “mar,” etc. would then have an “s” added in their plural forms. After the addition of “s,” “e” would be inserted before “s,” and finally carry on other known phonological processes of Brazilian Portuguese as needed: sofá_ ĺ sofá_s ĺ sofáes ĺ sofás café_ ĺ café_s ĺ cafées ĺ cafés colibri_ ĺ colibri_s ĺ colibries ĺ colibris mulher_ ĺ mulher_s ĺ mulheres ĺ mulheres mar_ ĺ mar_s ĺ mares ĺ mares azul_ ĺ azul_s ĺ azules ĺ azues ĺ azuis

The processes above can also be viewed as tree diagrams, like in Spanish, in the singular and plural forms. Figure 7 shows diagrams of the Portuguese singular forms of sofá and mulher.

Strress Assignmennt Contrasted in n Prosodic Non--Verbal Words

49

Figure 7. Braazilian Portugueese words sofá and mulher.

As can bbe seen, howeever, that it wo ould be necesssary to propo ose biased underlying fforms based on o historical fo orms for otherr words, e.g. *brasiles, * *papeles, **colchones, which w have no n synchronicc correspond dences or motivation in Brazilian Portuguese. Then, it m may be simpller, more efficient andd more realistiic to maintain n Mattoso Câm mara Jr.’s (197 70, 1979) claim that llexical stress in Portuguesse is unprediictable. One must use memory to llearn stress plaacement in Po ortuguese.

3. Concllusion This stuudy, which was originally a mini-coourse at thee Federal University oof Espirito Sannto in Brazil, examined lexxical stress asssignment in Spanish to help to unnderstand why lexical streess is unprediictable in Brazilian Poortuguese. Thhe degree of abstraction, a thhe creation off artifacts and variouss other flaw ws to attemptt an algorithhm for lexiccal stress assignment in Brazilian Portuguese P maake such an allgorithm unneecessarily complicatedd. In the histtory of sciencces, we welll know that the t more complicatedd a description is, the farth her we are fr from the truth h. This is obviously a reference to Occam’s O Razo or. Althoughh human langguages are rulee governed, sttress assignment is not always preddictable. Streess assignmen nt is most liikely unprediictable in Russian (Moolczanow et al., a 2013) and very likely unnpredictable in n English as well (Plag, 2006; Doomahs et al.,, 2014), in sspite of the claims c in Chomsky aand Halle (11968) that it can be preedicted, sincee careful inspection oof stress occurrrences in Eng glish, e.g. “áppplecake” vs. “piecáke,”

50

Chapter Three

contradicts their stress assignment algorithm in SPE. Therefore, many aspects of supposedly predictable lexical stress in these languages, including Brazilian Portuguese, are more effectively described if lexical stress is part of the lexicon, that is to say learned by memory.

References Bisol, Leda. (1994) O Acento e o Pé Métrico Binário. In Letras de Hoje 98, 25-36. Porto Alegre: EDIPUCRS. —, org. (2005) Introdução a estudos de fonologia do português brasileiro. Porto Alegre: EDIPCURS. Bruce, Gösta. (1977) Swedish word accents in sentence perspective. Lund: Lund: Gleerup. Cagliari, Luiz Carlos. (1999) O Acento em Português. Campinas: Editora do Autor. Câmara Jr., Joaquim Mattoso (1970) Estrutura da Língua Portuguesa. Petrópolis, RJ: Vozes. —. (1979) The Portuguese language. Chicago: University of Chicago Press. Castilho, António Feliciano de. (1851) Tractado de metrificação portugueza. Para em pouco tempo, e até sem mestre, se aprender a fazer versos de todas as medidas e composições. Lisboa: Imprensa Nacional. Published in 1874 as Tratado de metrificação portugueza: seguido de considerações sobre a declamação e a poética. Porto: Livraria Moré-Editora. Chomsky, Noam and Morris Halle. (1968) The Sound Pattern of English. New York: Harper & Row. Domahs, U., I. Plag and R. Carroll. (2014) Word stress assignment in German, English and Dutch: Quantity-sensitivity and extrametricality revisited. In Journal of Comparative Germanic Linguistics, 17, 1, 5996. Goldsmith, J. A. (1976) Autosegmental Phonology, Ph.D. dissertation, MIT. —. (1990) Autosegmental and Metrical Phonology. Oxford: Blackwell Publishers. Guerreiro, Miguel do Couto. (1784) Tratado da versificaçaõ portugueza. Lisboa: Oficina Patr. De Francisco Luiz Ameno. Hualde, José Ignacio. (2005) The Sounds of Spanish. New York: Cambridge University Press. Langacker, Ronald. (1972) Fundamentals of linguistic analysis. New York: Harcourt Brace Jovanovich, Inc.

Stress Assignment Contrasted in Prosodic Non-Verbal Words

51

Leben, William (1973) Suprasegmental phonology. Ph.D. Thesis, MIT. (New York: Garland Press, 1980.) Lee, Seung-Hwa. (2007) O Acento Primário no Português: Uma Análise Unificada na Teoria da Otimalidade. In O Acento em Português: Abordagens Fonológicas. São Paulo: Parábola Editorial, 121-143. Liberman, M. (1975) The intonational system of English, Ph.D. dissertation. Boston, Massachusetts Institute of Technology. Liberman, M. and Prince, A. (1977) On stress and linguistic rhythm. Linguistic Inquiry 8: 249-336. Mateus, Maria Helena Mira. (1983) O Acento de Palavra em Português: Uma Nova Proposta. Lisboa: Boletim de Filologia 27, 211-229. Molczanow, J., Domahs, U., Knaus, J. & Wiese, R. (2013) The lexical representation of word stress in Russian: Evidence from event-related potentials. The Mental Lexicon, 8, 164-194. Núnez Cedeño, Rafael A. and Morales-Front, Alfonso. Fonología generativa contemporánea de la lengua española. Washington, D.C.: Georgetown University Press, 1998. Pierrehumbert, J. (1980) The Phonology and Phonetics of English Intonation. Ph.D. dissertation, MIT, Cambridge, MA. [Published in 1987 by Indiana University Linguistics Club, Bloomington.] Piñeros, Carlos-Eduardo. (2009) Estrutura de los sonidos del español. Upper Saddle River, New Jersey: Pearson Education, Inc. Plag, Ingo. (2006) The variability of compound stress in English: structural, semantic, and analogical factors. English Language and Linguistics Volume 10 Issue 01 / May 2006, 143-172. Roca, Iggy M. (1999) “Stress in the Romance Languages.” In Word Prosodic Systems in the Languages of Europe, editor van der Hulst, H. Berlin, DEU: Mouton de Gruyter, 659-811. —. (1990) “Diachrony and synchrony in Spanish stress”, Journal of Linguistics 26: 133–164. —. (2006) “The Spanish stress window”, in F. Martínez-Gil & S. Colina (eds.), Optimality-theoretic advances in Spanish phonology, Amsterdam: John Benjamins. 239-77. Sosa, Juan Manuel. (1999) La entonación del español: su estructura fónica, variabilidad y dialectología. Madrid: Cátedra.

CHAPTER FOUR ELECTROGLOTTOGRAPHY PROF. MAURÍLIO NUNES VIEIRA1

Abstract This chapter presents the basic principles of the electroglottographic (EGG) technique and describes its application in the analysis of voice disorders and vocal quality. Recommendations concerning the recording and pre-processing of EGG signals are discussed in detail. Finally, normative data is presented for various objective parameters measured from recordings of dysphonic patients. Keywords: voice, electroglottography, dysphonia

1. Introduction2 Electroglottography is a technique attributed to Fabre (1940, 1957), who originally devised a method for electrical detection of blood pulsation. In this technique, a high frequency carrier is applied to the body (by means of contact electrodes) to be amplitude-modulated by the transcutaneous impedance changes associated with the vascular pulse. In 1957, Fabre suggested that the technique could be appropriate for detecting impedance changes across the neck caused by the vocal folds’ vibrations. This basic principle was confirmed experimentally in a number

1

Department of Electronic Engineering/School of Engineering, Federal University of Minas Gerais. 2 This chapter is an abridged version of “EGG Assessment of Voice Disorders”, in Vieira (1997), “Automated Measures of Dysphonias and the Phonatory Effects of Asymmetries in the Posterior Larynx”, Ph.D. Thesis, University of Edinburgh.

Electroglottography

53

of studies (e.g., Fant et al., 1966; van Michel, 1967; Lecluse, Brocaar & Verschuure, 1975; Gilbert, Potter & Hoodin, 1984). The main appeal of electroglottography is the availability of relatively inexpensive and easy-to-use devices to obtain a signal related to the phonatory mechanism. The technique is non-invasive and can be used in running speech. Moreover, the electroglottographic (EGG) signal suffers little influence from vocal tract resonances, being especially useful for F0 tracking (Krishnamurthy, 1983; Orlikoff, 1995). Also, EGG signals permit a certain speculation about the phonatory settings and the effects of vocal fold pathologies on phonation. Currently, electroglottography is utilized in the study of voice production, in the assessment of voice disorders, and as a visual feedback aid during the treatment of dysphonias (e.g., Fourcin et al., 1995). However, as will be discussed in the course of the chapter, electroglottography suffers from many limitations and users should be aware of them (Baken, 1987, pp. 219-227; Colton & Conture, 1990). Electroglottography is also known as “electrolaryngography” or simply “laryngography.” Recognizing that this technique relates not only to the glottis, but also to the larynx as a unit, Baken (1987, p. 217) pointed out that, “laryngography has long been used by radiologists to refer to contrast-medium visualization of the larynx [}]. To prevent any confusion, it seems wisest to retain the term electroglottography despite its literal inaccuracy.” The next sections of this chapter will describe the technique and its limitations, focusing on methodological aspects for data acquisition, preconditioning, and analysis. Methods for the improvement of the EGG signal quality and the extraction of objective parameters will be described. Relevant aspects of the EGG technique that may not be detected by objective analysis will also be addressed.

2. The technique The Electroglottograph. A schematic representation of the most important features of an electroglottograph is shown in Figure 1. A high frequency signal (carrier) is passed through the neck by means of two metallic electrodes “A” and “B”, transformers being used to provide electrical isolation to the users. In commercial instruments, the amplitude and frequency of the carrier are of the order of 0.5 V and 0.3-5.0 MHz, respectively (Baken, 1987). The carrier is amplitude-modulated by the variation of the electrical impedance between the electrodes (ZAB) and is demodulated by an envelope detector. The detected signal is then

54

Chapter Four

amplified, subject to an automatic gain control (AGC), being also limited to a certain bandwidth, as indicated by a band pass (BP) filter. Figure 1 represents the vocal folds in a partially closed position. Possible current paths through the anterior and posterior parts of the larynx are suggested. It has been demonstrated experimentally, based on the comparison of diverse instruments (Lecluse, Brocaar & Verschuure, 1975), that carrier frequencies below 50 kHz result in an undesired sensitivity to capacitive components of the impedance ZAB. These would include electrode-to-neck capacitances and, possibly, capacitances between other laryngeal folds (e.g., vestibular folds). The electrodes are usually but not necessarily circular, with a diameter of approximately 2.5 cm. Smaller electrodes may be necessary for children. It has been shown (Lecluse, Brocaar & Verschuure, 1975) that the “best” positioning of the electrodes in the larynx varies among instruments from different manufacturers. To reduce the resistive component of the electrode-to-neck impedance, it has been suggested that the electrodes and the skin should be cleaned, a layer of conductive paste being also recommended (Colton & Conture, 1990). A simpler approach involves the use of saline (Frøkyaer-Jensen, 1996). Rothenberg and Mahshie (1988, p. 338) have pointed out that the use of a conductive paste may also help in case of facial hair.

Figure 1: Main features of an electroglottograph. This figure is not intended to reflect construction details of any specific commercial model, but only the most likely features. HF GEN. = High frequency generator; A, B = metallic electrodes positioned at the level of the thyroid cartilage; ZAB = effective electrical impedance between the electrodes; AM DEM. = slope detector (amplitude demodulator); AMP. = amplifier; BP FILT. = band pass filter; AGC = automatic gain control.

Electroglotttography

55

Figure 1 shows that the t amplitudee of the EGG G signal is dyn namically controlled bby an AGC cirrcuit. The feed dback loop coould also actuaate on the amplitude of the carrier, as has been mentioned m elseewhere (Rothenberg & Mahshie, 19988). This varriable and, to o some extentt, unpredictab ble carrier amplitude im mplies that noo absolute meeasure of trannslaryngeal im mpedance can be obtaiined from this instrument. The banddwidth of thee EGG signal, indicated by the band pass filter in Figure 1, is approximatelyy 0-10 kHz in n commercial ddevices (Bakeen, 1987). A highpass cutoff frequenncy at 10-25 Hz H is often avvailable to red duce slow EGG compoonents caused by such facto ors such as thee change in th he vertical position of tthe larynx or movement m of the electrodess (Baken, 1987; Colton & Conture, 1990). Thesee slow compo onents can aff ffect F0 estimaation and related measures (Vieira,, McInnes & Jack, 1996), sometimes leeading to saturation (cclipping) of thhe signal.

Figure 2: EGG waveform. Schematic S repreesentation of a ttypical waveforrm (modal voice, healthyy larynx) and corresponding c coronal sectionns of the vocall folds. (1) first subglottaal contact; (2) increasing verttical contact; (33) maximum co ontact and medial pressuure; (4) subglotttal separation of o the folds; (55) separation off the upper rim of the follds; (6) subglotttal approximatiion. The arrow near label 4 po oints to the so-called “knnee” of the wavveform. Composed with drawinngs of the vocaal folds by Wendler (19993).

EGG siggnal and dyn namic vocal fold contact area. A typiical EGG signal is shhown in Figurre 2, where the signal am mplitude increaases with

56

Chapter Four

increasing vocal fold contact (i.e., the EGG is representing the electrical admittance 1/ZAB). The use of an “inverted” waveform, representing the electrical impedance, is preferred by some authors, especially when the EGG signal is displayed simultaneously with the transglottal airflow, since these waveforms bear some similarities (see, e.g., Bless et al., 1992). The vocal contact corresponding to some points of the EGG waveform over one vibratory cycle is indicated by coronal sections of the vocal folds in Figure 2. The time interval between the labels 1 and 2 (t1,2) is the closing phase. The closed phase (t2,4), the opening phase (t4,5), and the open phase (t5,1) are similarly defined. Notice the fast rise in the signal intensity during the closing phase of the glottal period. The opening movement of the folds in healthy larynges is slower than the closing movement. The increase in vocal contact leads to a plateau (closed phase) in the waveform. This is also reflected in the slower opening phase of the EGG signal. In the open phase, the wave shape is almost flat, and the slow rise in the electroglottogram may be interpreted as a rise in the translaryngeal admittance due to an increased capacitance associated with the approximation of the vocal folds. However, as Baken (1987, p. 222) pointed out, “the open phase [of the EGG signal] should be interpreted with extreme caution.” It has been demonstrated by Fant and colleagues (1966) and by Rothenberg and Mahshie (1988) that the automatic gain control has a high pass characteristic that may distort the EGG open phase. It is also likely that other laryngeal movements may be significant factors during the open phase, considering that the variation in the amplitude of the carrier due to vocal fold contact is only 0.1-0.5% (Rothenberg, 1992). There is a consensus that the amplitude of the (high pass filtered) EGG signal mostly represents the dynamics of the vocal fold contact area (Bless et al., 1992). A zero voltage in the output of the instrument indicates, thus, that the contact area is not changing. Fully adducted vocal folds (e.g., in an effort closure gesture) would also cause zero output level in the EGG signal. The relationships between EGG signals and vocal contact area have been studied theoretically by Childers and co-workers (1986) and by Titze (1984, 1989b). Some considerations on the study by Childers and colleagues will be given in this chapter. The study by Titze (1989b) describes a modification of his earlier vocal fold contact model (Titze, 1984), allowing the control of the glottal shape by means of four parameters related to pre-phonatory (static) and phonatory (dynamic) states of the folds. The parameters were (1) an “abduction” quotient defining a pre-phonatory glottal gap; (2) a “shape” quotient establishing a

Electroglotttography

57

pre-phonatoory difference between thee displacemennts of the infferior and superior edges of the folds; f (3) a “b bulging” quootient represen nting the amount of pre-phonatorry medial bu ulging; and finally (4) a “phase” quotient, rellated to the (dynamic) verrtical phase ddelay, expressed as the ratio of the gglottal depth to t the mucosaal-wave wavellength. The vo ocal folds were modeleed as a stretchhed ribbon fix xed at the extrremities but allowed to bend and flex in the vertical dimensio on, accountinng thus for thee vertical phase differrence. Realistiic EGG-like waveforms, w rrepresenting vocal v fold contact areaa, were simullated. The au uthor emphasiized (p. 199)) that the asymmetry in the openning and clo osing phases of the sign nal “is a combinationn of convergennce and phasee delay. Neithher parameter alone a can cause it.” He also pointedd out that the “knee” (see F Figure 2) in th he contact waveform m may be causedd by increased medial surfacce bulging. Experim mental investiggations using excised caniine larynges (Scherer, Druker & T Titze, 1988) partially p confiirmed that thee relationship p between EGG signal and vocal coontact area is linear. This llinearity has also a been suggested bby a simple experimental e simulation thhat revealed an a almost uniform dennsity of electriic field lines between b the ellectrodes (Titzze, 1990). In this studdy, the electrrodes were po ositioned in a tank, an electrolyte solution sim mulated the neck n conductiv vity, and a nnon-conductin ng acrylic wedge simuulated an opeen glottis. As observed byy the authors of these studies, thoough, the whhole neck strructure had not been reaalistically simulated inn the experim ments and the assumption oof a linear rellationship between EG GG signal and vocal contact needs furtherr evidence. Another biomechanical correlate of the EGG siignal appears to be the longitudinall length of vocal v contacct, which hass been show wn to be proportionall to the siignal intensiity (Krishnam murthy, 1983). This relationship,, not valid forr the open ph hase of the noormal laryngess studied, has been dem monstrated byy using synchrronized EGG signals and hiigh speed laryngeal fillms.

oglottogram (EG GG vs. time). Notice N also Figure 3: Basseline fluctuatioon in the electro the clipping inn some of the innitial cycles.

58

Chapter Four

It is widely agreed (Bless et al., 1992) that the EGG signal may reflect not only vocal contact but also other movements in the neck. These movements are assumed to be “slow” compared to the fundamental frequency and may cause a fluctuation in the EGG baseline (Figure 3), depending on the lower cutoff frequency of the instrument. As summarized by Baken (1987) and Colton and Conture (1990), possible sources of baseline drift include (1) vertical movements of the larynx; (2) movements of the head; (3) movements of other neck structures (e.g., pharyngeal walls, ventricular folds, base of the tongue); and (4) movement of the electrodes in the neck. Colton and Conture have also pointed out that the device can pick-up noise from the electrical power system (50/60 Hz and harmonics), which can fall in the F0 range. Having presented the principles of electroglottography, the typical waveform in modal voice, and some of the problems associated with the technique, the next section will address the effects of phonatory settings and voice disorders on the EGG signal. EGG signal, phonatory settings, and voice disorders. For the purpose of comparison, a normal EGG waveform is presented in Figure 4a, while Figure 4b shows an abnormal waveform recorded from an adult male suffering from mutational dysphonia. In such cases, the voice is characterized by a high F0, typically in the female range. This patient, whose larynx failed to develop completely during the puberty, had speaking F0 of approximately 212 Hz. There are two main aspects to be pointed out in Figure 4b. First, the patient spoke in falsetto and this is partially indicated by a sinusoidal-like EGG waveform in most of the cycles. In falsetto, the vocal folds may not touch and the capacitive effect of the approximation seems to be picked-up by the electrolaryngograph. The lack of a complete glottal closure during phonation can be verified in Figure 4d, where the maximum closure in a stroboscopic sequence is presented. A longitudinal chink is clearly seen and the arytenoids, especially in the patient’s right side, seem to be forcing the folds in the posterior direction, compensating the pull of the cricothyroid (CT) muscle. The second relevant aspect in Figure 4b is the occurrence of “spikes” in some glottal cycles. The spikes appear to have been caused by short vocal fold contacts, increasing the level of airflow modulation. This hypothesis is supported by Figure 4c, where the simultaneously recorded acoustic signal (dotted lines) exhibits increased amplitude in response to these apparent contacts. A possibility that cannot be ruled out, but seems unlikely here, is that the spikes would indicate a drop in the impedance

Electroglotttography

59

due to muucous bridgees between the t folds, aas was descrribed by Krishnamurtthy (1983) and Childers and d co-workers (1986). Another case of reducced medial contact, less eviident than the previous example, is presented inn Figure 5a. Notice N the shoortening of th he closed phase and thhe long opening phase. Thiis electroglotttogram was taaken from a patient sufffering from a (recovering) paralysis of th the left recurreent nerve, with possible limitation in the action of the (left) lateral cricoarytenoid (LCA) musccle. This was suggested by y endoscopic iimages that in ndicated a difference inn the vertical level of the folds during aadduction. A frame of glottal closuure is shown inn Figure 5b, showing that thhe vocal process in the patient’s lefft side is restinng on top of the t contralaterral vocal proccess. This change in vertical poosition, which h may not be easily seen in videoendosccopic images, was respon nsible for thhe reduction in vocal contact. Redduced vocal contact c (or no o contact at aall) is also co ommon in cases of voccal fold bowinng (due to, e.g g., to vocal fattigue or agein ng) and in other degreees of paralysiss of the recurreent nerve.

(a)

2 ms/div(b) m

(c)

10 ms/div

(b)

10 ms/div

(d)

Egg signal andd falsetto. (a) Electroglottogra E am in modal voice (/a/ Figure 4: E vowel), where “ms/div” sttands for millisseconds per horrizontal (time) division; d f (/a/ vowel) from a ppatient with mu utational (b) electrogglottogram in falsetto dysphonia; (c) EGG signaal (solid line) and a acoustic siggnal (dotted lin ne) from s the maaximum glottal closure; the same paatient; (d) stroboscopic frame showing notice the loongitudinal chinnk.

Chapter Four

60

(a)

2 ms/div m

(b)

Figure 5: Reduced meedial contact. (a) Electrogloottogram (/a/ vowel); (b) maximuum glottal closuure. Note the su uperior level off the patient’s leeft vocal process (arrrow) caused, poossibly, by a paaralysis of the rright superior laryngeal nerve.

Data in Figure 6a and a Figure 6b b refer to a case of nodu ules. The electroglottoogram has a “notch” “ in thee opening phaase, as commo only seen in lesions off this sort (Neil, Wechsler & Robinson,, 1977; Childers et al., 1986). Withh caution it caan be said thaat the more a lesion – like a nodule, polyp, or cyyst – is locatedd at the superrior part of thee folds and aw way from the glottal mid line, thee less it will interfere wiith vocal con ntact and, therefore, thhe smaller wiill be the EG GG notch. Chhilders and co o-workers (1986) simuulated the efffect of nodulees (and other factors) on the t EGG. Their vocal contact moddel considered d (1) the verttical phase diifference; (2) the angle in the transsverse plane at which the vocal folds close c and separate; annd (3) a phasee difference, observed o in hhigh speed film ms, along the length off the folds. Thhis longitudinaal phase differrence was desscribed as a “zipperlikke” movementt that closes the glottis inn the anterior--posterior direction annd opens it in the opposite direction. d Sim mulations indiccated that the larger thhe lesion, the larger the notcch (Figure 6c) , as might be expected, and that thee closer the siimulated nodu ule was to thee anterior com mmissure, the later thee notch appeared at the opeening phase ((Figure 6d). While W this study providdes interestingg guidelines, it remains to be validated with real (i.e., non-sim mulated) dataa. One factor that t may com mplicate this analysis is the possiblee existence of mucous bridg ges (Figure 6bb), since they continue providing a path of high electrical e conductance desppite the glottal opening and may distort the EGG Childers et al., 1986). G (Krishnamu urthy, 1983; C Finally, thee irregularitiees (indicated by arrows iin Figure 6cc) in the beginning oof the closingg phase of th he electroglotttograms simu ulated by Childers andd co-workers (1986) were also present in their simullations of non-dysphonnic larynges. Such irreg gularities appparently ind dicate an imprecision of their vocaal contact mod del. They havee been seen in n the data

Electroglotttography

61

used for thee present studyy only in connection with certain lesion ns, as will be discussedd later. Figure 7 shows the efffects of a Reeinke’s oedem ma on the EG GG signal. This disordeer is characteerized by a marked m increaase in the maass of the folds due too oedematous fluid being acccumulated innto the Reinkee’s space. This increasses the mucossal wave, whiich becomes iirregular, as suggested s by the elecctroglottogram m in Figure 7a. This fiigure also sh hows the correspondinng acoustic waveform w (dotted lines), w where marked cycle-tocycle ampliitude changess (shimmer) are a clear. Exxcessive shimm mer (and possibly jittter), combinedd with a drop in F0, lead too a characteriistic deep and harsh voice. Data in Figure 7 refeer to a femalee patient whosse F0 was approximateely 155 Hz. Irregular I EGG G waveforms of this type have h also been observved in patients with large po olyps and, som metimes, papillomas.

(a)

2 ms/div m

(c)

(b)

(d)

Figure 6: N Notches in the EGG E signal. (a)) Electroglottoggram (/a/ vowell) from a patient withh nodules, seenn in (b); notice the mucous brridge between the t folds (arrow). (c)) Simulated efffect of the sizee of a nodule oon the EGG, where w the contact area increases froom 0.5% to 5% %. The arrows indicate abnorrmalities caused apparently by impperfections in th he model. (d) S Simulated effecct of the position of a nodule (moddeled as an incrrease of 5% inn the contact arrea). The position off the nodule, represented r as a fraction of the fold’s long gitudinal length L, iis measured from the vocal process. (Figuures “c” and “d” “ were adapted from m Childers et al., a 1986.)

Chapter Four

62

(a)

2 ms/div m

(b)

Effects of a Reinke’s R Oedem ma on the EGG G signal. (a) Irregular I Figure 7: E electroglottogram (solid linne) and acoustic signal (dottedd lines) in the /aa/ vowel. (b) Videoenndoscopic imagge.

(a)

5m ms/div

(b)

5ms//div

Figure 8: E Effects of a polyp p located at a the anteriorr part of the fold. f (a) Electroglotttogram (solid lines) and acoustic signal (ddotted lines) in n a lowpitched voiice (| 120 Hz,, /a/ vowel). (b b) The same, iin a high-pitch hed (and creaky) voiice. It was diffiicult to estimatee F0 in this cas e. (c) Videoend doscopic image. Thiss picture has a poorer p quality, but b the polyp caan still be identtified.

The elecctroglottogram ms and the en ndoscopic imaage shown in Figure 8 were taken from a malee patient sufffering from a polyp locateed at the anterior-infeerior part of thhe right vocall fold (Figuree 8c). Videoen ndoscopic images weree misty but thhe lesion can n be seen. Thhe electroglotttogram in Figure 8a sshows that thhe polyp affeccted the EGG G signal in almost a all phases of thhe glottal cyclle because it apparently a intterfered with the vocal contact bothh vertically annd longitudin nally. Notice the hump in the open phase (detaiil 1 in Figure 8a) and the abnormality a inn the beginning of the

Electroglottography

63

closing phase (detail 2 in Figure 8a). Small abnormalities in the beginning of the EGG closing phase can be important indications of small subglottal lesions that may not be easily detected in videostroboscopic examinations. The electroglottogram in Figure 8b was taken when the patient was sustaining a high-pitched vowel, which also sounded creaky. The use of high-pitched vowels can be particularly useful to detect minor lesions in the edge of the folds. A prominent aspect of the electroglottogram in Figure 8b is the occurrence of contacts (indicated by the numbers 1-4) that were irregular in amplitude and time, causing marked excitations in the oral tract, as seen in the acoustic signal (dotted lines). A more detailed observation of the acoustic and EGG waveforms between these major contacts (e.g., between contacts 3 and 4 in Figure 8b) suggests that the folds were somehow vibrating at a higher frequency. It appears that only a short length of the folds, most likely the mid third, was allowed to vibrate in this higher frequency. The folds did not touch each other and the EGG oscillations seem to reflect a capacitive coupling. In this section, a number of cases were discussed to illustrate general relationships between the EGG signal, vocal fold vibration, vocal settings, and lesions. The analysis presented here should be seen as ad hoc interpretations and caution should be taken when extending these observations to other cases. Methods for objective characterization of EGG waveforms will be presented in later sections of the chapter, after the forthcoming description of the recommended data collection procedures and signal pre-conditioning methods.

3. EGG signal acquisition Equipment. EGG waveforms presented in this chapter were provided by a Portable Electro-Laryngograph (Laryngograph Ltd., London). This instrument is battery-operated, which gives certain immunity to electrical noise. The signal is applied by means of two electrodes placed on the surface of the patient’s neck at the level of the thyroid cartilage by means of elastic straps. This device uses a carrier frequency of 4 MHz, and also has an automatic gain control and a manual adjust of the level of the output signal. The signal bandwidth is limited from 10 Hz to 5,000 Hz, subject to a pass band fluctuation of r 0.5 dB.

Chapter Four

64

Stimuli. To focus on vocal fold behavior and reduce coupling effects with the upper vocal tract, sustained vowels were used as stimuli. Although the shape of EGG signals is expected to be vowel-independent, electroglottographic perturbations may vary across vowels. Only /a/ vowels were included because movements of the tongue and epiglottis  especially in /i/ and /u/  can connect to the larynx through the aryepliglottic fold and the hyoid bone (Rossi & Autesserre, 1981), affecting the signal. Recall that the tongue is pulled forward in /i/, being retracted in /u/, but left near its neutral position (i.e., the schwa) in /a/ vowels. Recording procedures. The EGG recording set-up used in this study is shown schematically in Figure 9. This diagram also presents details concerning the acquisition of the simultaneously recorded acoustic signal. The patients were seated in a soundproof booth (Amplivox-Burgess Audiometer Booth) that is not necessary for EGG recordings but provides quality audio signals. During the recording session, communication with patients occurred through an intercom (not indicated in Figure 9).

mic.

pre-amp. Recorder

EGG sound proof booth

oscilloscope

Figure 9: EGG and acoustic recording set-up. The output of the instrument (Portable Electro-Laryngograph) and the acoustic signals were recorded simultaneously with a Sony 55 ES DAT recorder. Oscilloscopes were used to monitor the signals during the recordings. The patient was seated in an isolated booth (Amplivox-Burgess Audiometer Booth), the acoustic signal being picked-up by a Shure SM10A microphone (mic.) and amplified to line level by a Shure SM11 pre-amplifier (pre-amp.). Data was recorded at the Voice Clinic of the Royal Infirmary of Edinburgh (Vieira et al., 1995).

Recordings were taken when the patients were instructed to take a deep breath and sustain their voice as long as possible, in comfortable levels of pitch and loudness. As an initial procedure, patients were asked to take off metallic necklaces, because it was observed that they introduced significant noise in the signal. An adequate electrode position for each

Electroglottography

65

individual was determined after some trial. This was done by asking the patient to hum or sustain a vowel while the electrodes were adjusted in the neck and the electroglottogram was simultaneously monitored in one of the oscilloscopes. In a satisfactory electrode position, the EGG signal amplitude was maximized and no artifactual waveform distortion was apparent. The instrument’s output level control was also adjusted during this initial procedure. The electrodes were then kept in position with an elastic strap. During the recordings, the EGG wave shape was continuously monitored and other procedures included: 1. Asking the patients to press the electrodes to the neck with their fingers, in case of weak signals. Notice that the fingers represent a fixed conductance path that is not reflected in the high pass filtered signal. This procedure has been suggested by Colton and Conture (1990); 2. Asking the patients to cough, when abnormalities appeared in the shape of the electroglottogram during the course of recordings. When the abnormality was being caused by mucus, coughing often cleaned up the vocal folds and the abnormality disappeared. Otherwise, the position of the electrodes was checked again; 3. Asking the patient to produce a louder voice, in case of a “weak” signal. In voice disorders with poor vocal contact (e.g., paralysis, bowing) or caused by inadequate breath support (e.g., asthma), this procedure sometimes led to an improved contact for at least the initial 2 or 3 seconds of the utterance. The EGG signal was fed into one of the channels (line input) of a Sony 55 ES digital audio tape recorder (DAT), being digitized at 48,000 samples per second, 16 bits per sample. The RMS (root mean squared) recording level, as read in the DAT display, was kept between -20 and 12 dB (ref: full scale) to reduce clippings caused by baseline fluctuations. In sustained vowels, provided that there was no vibrato or vocal tremor, this level could be raised to approximately -6 dB (ref full scale). Subsequently, the recordings were played back and the analogue outputs of the DAT were redigitized with a Sound Blaster 16 audio card at 22,050 samples per second, 16 bits per sample. Recognizing that extra A/D (analogue-to-digital) and D/A conversions can degenerate the signals’ quality, it is also pointed out that a similar procedure has been considered equivalent to direct sampling by computers in a comparative assessment of data acquisition methods (Doherty & Shipp, 1988).

66

Chapter Four

A total of 222 recordings (140 female speakers, 82 male speakers) were available for analysis, as shown in the column named “original” in Table 1. The “selected” column criteria will be discussed further. Caution should be taken when using EGG signals for automatic analysis: while recordings characterized by weak, noisy, or extremely irregular waveforms may still be interpreted subjectively, they should be disqualified for objective measurements. Considerations on the selection of EGG recordings for automatic analysis are given next.

4. EGG signal quality assessment As has been summarized by Watson (1995, p. 133), four basic factors can deteriorate EGG signals (Watson, 1995; p. 133): “1. Poor vocal function 2. Neck conduction paths which contain little vocal fold contact [}] 3. Inaccurate placement of electrodes 4. Poor signal processing technique (phase errors, poor low frequency [}] response, inadequate signal-to-noise ratio, signal clipping, etc.).”

Before relying on objective measures, it is of primary importance to certificate that the signal has not been affected by the problems listed above. The inadvertent use of objective measures from noisy or highly irregular electroglottograms is certainly the major risk of EGG techniques. Watson (1995) proposed a criterion based on a measure (NNE)3 of the noise-to-signal ratio of the signal: he found that signals with an NNE value above -15 dB would meet his subjective criteria for accepting or rejecting other measures from an EGG recording. His data consisted of 1-second long extracts from a longer sustained /a/ vowel. While Watson’s (1995) method is a useful reference, it is also observed that estimates of the noiseto-signal ratio increase with increasing levels of jitter and shimmer, even in the absence of high-frequency noise (Hillenbrand, 1987). This means that signals with no problem other than, say, a high level of shimmer, may be rejected.

3

NNE stands for “Normalised Noise Energy,” a measure of noise-to-signal ratio proposed by Kasuya and co-workers (1986, 1986a,b).

Electroglottography

67

Table 1: Incidence of voice disorders in the corpus. Overview of the pathologies in the original and selected group of EGG recordings. M = male, F = Female, MTD = muscular tension dysphonia, NAD = no abnormality detected, Rec. = recurrent laryngeal nerve, L = left, R = right.

DISORDER Acid laryngitis Carcinoma Cysts Granuloma Leukoplakia Mutational dysphonia Myopathy NAD Nodules Papillomas Polyp Presbyphonia Psychogenic/MTD Rec. paralysis (L) Rec. paralysis (R) Rec. paralysis (R+L) Reinke’s oedema Selective paralysis Sulcus vocalis Surgical scar Vocal abuse/misuse Web Other TOTAL

F 2 1 8 0 0 0 2 15 23 3 2 3 36 10 1 1 3 3 3 5 9 1 9 140

ORIGINAL M F+M 5 7 3 4 1 9 1 1 2 2 2 2 7 9 9 24 2 25 4 7 2 4 5 8 12 48 10 20 2 3 0 1 0 3 1 4 1 4 4 9 1 10 2 3 6 15 82 222

F 2 1 6 0 0 0 1 13 11 1 1 1 15 3 0 0 2 3 2 3 6 0 6 77

SELECTED M F+M 4 6 2 3 1 7 1 1 2 2 2 2 4 5 8 21 2 13 3 4 2 3 3 4 8 23 5 8 1 1 0 0 0 2 1 4 1 3 4 7 1 7 1 1 4 10 60 137

To reduce the chances of using inappropriate signals for objective measures and to accept as many recordings as possible, a semi-automatic method was devised for evaluating signal quality. Firstly, some cycles were visually inspected at the initial, mid, and final portions of each recording, and a subjective score was given to the signal; that is, “unusable” = 1, “poor” = 2, “fair” = 3, and “good” = 4 (Figure 10). Signals of type 1 suggested no vocal contact at all. “Poor” signals had excessive noise and a “weak” amplitude; that is, with a peak-to-peak amplitude below 14,000 units (< -13 dB ref: full scale), this value being larger than 28,000 (> -7 dB ref: full scale) units in “strong” signals. Recordings presenting abnormalities that could lead to meaningless measures were considered “inadequate” and included in the type 2 group. Signals of type

68

Chapter Four

3 had “strong” amplitudes but visible noise. Finally, signals of type 4 presented no problems during the inspection. Figure 11 shows the proportions of each type of signal in the 222 cases. In the next step of the evaluation procedure, the recordings were analyzed by an F0 detection algorithm (Vieira, McInnes & Jack, 1996) to obtain the percentage of voiced intervals. Recordings were accepted for further analysis when: (1) The subjective score was 3 or 4; and (2) the percentage of detected voiced intervals was larger than 75%. Although true unvoiced (or silent) intervals may have occurred during the sustained vowels, most of the estimated unvoiced intervals represented voiced-tounvoiced errors; that is, glottal cycles not detected by the F0 tracking algorithm. The 75% threshold is somehow arbitrary, but appeared appropriate to reject inadequate recordings not detected during the visual inspection. Excessive unvoiced intervals were interpreted as an indication of anomalies that could lead to unreliable measures. Approximately 90% of the 222 recordings had unvoiced intervals below 25%. Recordings from 137 patients, representing 62% (137/222) of the initial cases, were accepted by this quality control scheme. The voicing-detection criterion eliminated 20 of the 157 signals of type 2 or 3. The average number of cycles detected in the 137 “approved” signals was 2,032 (range: 2656,498).

Electroglottography

69

(a)

5ms/div

(b) Type 2

5ms/div

(c) Type 2

5 ms/div

(d) Type 3

10 ms/div

(e) Type 4

5 ms/div

(f) Type 4

5 ms/div

Figure 10: Quality control of electroglottograms. (a) Type 1 (unusable) signal. Vertical scale in analogue-to-digital units. (b) Type 2 (poor) signal. (c) Type 2 (irregular) signal; notice the irregularities in the waveform that led to a high incidence of unvoiced intervals. (d) Type 3 (fair) signal; e) Type 4 (good) signal; notice the systematic shimmer. (f) Type 4 signal that also resulted in excessive unvoiced intervals; in this case, more than one zero crossing occurred between consecutive negative and positive peaks.

Subjective EGG Quality Assessment

9% 22%

(1)

(4)

20%

(2)

1 = Unusable 2 = Poor/ Inadequate 3 = Fair 4 = Good

49%

(3) Total: 222 di

Figure 11: EGG quality. Proportion of signal types according to subjective criteria. Data from 222 dysphonic speakers. See complementary details in text and in Figure 10.

70

Chapter Four

The last row in Table 1 shows that 55% (77/140) of signals from female speakers were rejected, this proportion being 27% (22/82) for male speakers. The degradation in EGG signals from females relates not only to anatomical differences (i.e., smaller vocal folds and larger angle of the thyroid cartilage; details in Titze, 1989a) but also to the increased incidence of psychogenic and muscular tension dysphonias in women. These disorders usually present a “hypofunctional” state characterized by poor respiratory support, longitudinal chinks, and vocal fold bowing (e.g., Stemple, 1993, pp. 76-99), resulting in less vocal contact. Notice also the high rejection of signals in cases of paralysis, as would be expected. The forthcoming section presents the numerical results of the analysis of 137 signals from dysphonic speakers. Statistical distributions will be obtained and ranges of severity will be suggested for each objective measure.

5. EGG Baseline Fluctuation The lower cutoff frequency (10 Hz) of the EGG instrument is inadequate to successfully remove baseline fluctuations caused mainly by vertical movements of the larynx. Such fluctuations should be attenuated to avoid errors in automatic measurements. Band pass filtering (605,000 Hz) was used to remove possible low-frequency baseline drift, 50 Hz hum, as well as noise from the 5,000 Hz upper frequency of the Electro-Laryngograph up to the Nyquist frequency (i.e., 11,025 Hz). The lower cutoff frequency (-3 dB) was arbitrarily fixed at 60 Hz, a value that is above the frequency of the British electrical power system (50 Hz) and adequate for deep male voices. Since relevant information is carried in the shape of the electroglottogram, filter-induced distortions must be avoided. These distortions can result, for example, from a non-linear phase response, a poor transient response, or ripple in the filter pass band (Kuo, 1966). A possible way of avoiding phase distortions is the use of FIR (Finite Impulse Response) filters with exact linear phase. A well-known property of FIR filters (Oppenheim & Schafer, 1975, pp. 237-239) shows that the phase response of a filter of order N and coefficients h(n) is linear when the coefficients are symmetric in relation to the centre of the impulse response; that is, h(n) = h(N-n-1), 0 d n d N . This approach has been applied to EGG signals (Krishnamurthy, 1983; Eady et al., 1992), but it has certain disadvantages. In particular, (1) there is ripple in the pass band; (2) the order of adequate FIR filters is relatively high; and (3) there is a long time shift in the filtered signal, although the

Electroglottography

71

correction of this time delay to keep the synchrony between acoustic and EGG signals would be simple.

x(n)

X(z)

H(z) X(z)xH(z)

-n X(z-1)xH(z-1)

H(z) X(z-1)x|H(z)|2

-n y(n)

Figure 12: Recursive filtering with zero phase shift. The right hand side indicates the steps in the frequency domain. The input sequence x(n) is filtered by the generic filter H(z), where z = exp(jȦ) and Ȧ is the normalized angular frequency (Ȧ = 2ʌf/Fs, where FS is the sampling frequency and 0 ” | f | < FS/2). The output is time-reversed by the blocks “-n” and the procedure is repeated, resulting in an output y(n) that is independent of the phase response of the filter H(z).

X(z)x|H(z)|2

An alternative approach to remove the baseline drift based on recursive filters was adopted here. The method, proposed by Kormylo and Jain (1974), is shown in Figure 12 and involves two consecutive filtering-andtime-inversion procedures to achieve zero phase-shift and zero time-delay. As shown in this figure, the output signal y(n) does not depend on the phase response of the filter H(z) because the distortion introduced in the first pass is exactly compensated during the second pass, which processes the file in the reverse time direction. The procedure is obviously noncausal and is not suitable for real time processing. However, because the order of a recursive filter is significantly smaller than the order of an equivalent FIR filter (Rabiner et al., 1974) and the time inversion of a file is a simple multiplication-free operation, the two-pass filtering offers a reduction in the processing time (compared with FIR filtering), at least on ordinary personal microcomputers. The two-pass filtering method can also be implemented in “quasi real time” by processing small blocks of data

72

Chapter Four

with an overlap between consecutive blocks to compensate transient effects (Czarnach, 1982). In the implemented version of the method, though, each file was processed as a single block, restricting the transients to short segments (| 30-ms long) at the beginning and end of the recordings. An example of such filtering is given in Figure 13. Further technical details on the influence of baseline fluctuations on EGG signals are available in Vieira, McInnes and Jack (1996). On EGG F0 estimation. The automatic and precise determination of individual glottal cycles is an important pre-requisite for the extraction of features related to the wave shape. Since most relevant information carried by the EGG signal is encoded in the wave shape, time-domain analysis seems more appropriate than analysis in other transformed domains. Provided that the recorded signals are adequate (i.e., not affected by the artifacts mentioned before), a precise delimitation of glottal cycles may be carried out automatically, permitting, in a second step, the estimation of parameters related to morphological features of the signals.

10 ms/div

Figure 13: Removal of baseline drift. Filtered signal (solid line) after removal of baseline drift from the original electroglottogram (dotted line).

EGG signals have been used extensively for precise F0 determination (Hess & Indefrey, 1987; Orlikoff & Baken, 1989; Schoentgen & Guchteneere, 1991; Fourcin et al., 1995). Period boundaries can be delimited, for example, by (1) peaks in the waveforms, (2) peaks in the differentiated EGG signal (DEGG), or (3) by level crossings. EGG peaks are the least recommended strategy because they are relatively broad, being, therefore, less precisely identified. The other techniques mentioned above exploit the closing phase, which is the fastest EGG event, being, therefore, less affected by artifacts. Differentiation leads, desirably, to a single DEGG positive peak. Because high pass filtering is also a by-product of differentiation, there is no need for removing the EGG baseline drift beforehand. An F0 tracking algorithm based on DEGG peak-picking has been described by Hess and

Electroglottography

73

Indefrey (1987), who used a method for interpolating the peaks to achieve accuracy greater than 0.5% in F0 estimation from non-dysphonic speakers. This method is attractive but has limitations in dysphonic voices. As pointed out by Childers and colleagues (1990), even the DEGG from a non-dysphonic speaker may present noisy or multiple peaks. In particular, any inflection in the closing phase of the EGG may result in undesirable extra peaks in the DEGG, as shown in Figure 14. These problems are obviously aggravated in the pathological case due to noisy signals or abnormal waveforms.

2 ms/div Figure 14: Gross errors in F0 estimation based on differentiated EGG signals (DEGG). The numbers in brackets are the F0 values associated with the indicated time interval. The amplitude of the DEGG was scaled to fit the figure. After Vieira, McInnes and Jack (1996).

Up-going level crossing is another possible way of determining the boundaries of glottal periods. A simple approach would be the use of midlevel crossings, the crossing point being the mid-value between a negative peak and the following positive peak. This method also avoids a previous high pass filtering but may not be precise, since it relies on the use of peaks to define the crossing levels. Besides, the extraction of other waveform parameters (discussed in a subsequent section) would require baseline drift removal. The detection of zero crossings in a band pass filtered EGG appears thus to be a more adequate strategy for delimiting the glottal periods in EGG signals from dysphonic speakers. Assuming that the EGG signal has been band-pass filtered as described earlier, a relatively simple F0 tracking algorithm was implemented. Instantaneous F0 values were defined as the inverse of the elapsed time between consecutive linearly interpolated upgoing zero crossings. Fundamental frequency estimates were restricted to 50-500 Hz, a range assumed to be adequate for the speaking voices of the (adult) subjects used in this study. Details on the implemented F0 detection algorithm are given in Vieira, McInnes and Jack (1996). Examples of F0 contours extracted with the described algorithm are presented in Figure 15.

Chapter Four

74

Figure 15: Examples of F0 contours. Top parts aare the correesponding electroglottoggrams. (a) F0 traacking at the end of phonatioon; the “ticks” in i this F0 contour indicate zero crossinngs. (b) A phon natory glide.

(a)

5 ms/div

(b)

5 ms//div

(c) ycles detected inn noisy EGG att the end Figure 16: Diissociations. (a)) Occasional cy of phonation (patient with unnilateral paralysis). (b) Detectted EGG signal without bration of a larg rge polyp seen in panel audio, due, possibly, to the mechanical vib (c). Figures “a” and “b” afterr Vieira, McInn nes, and Jack (11996).

Simultan neous observations off EGG and d acoustic signals. Comparisonn between acooustic and EG GG recordingss can show in nteresting situations (““dissociations””) already meentioned in thhe literature (L Lebrun & Hasquin-Deeleval, 1971; Löfqvist, L Carllborg & Kitziing, 1982). Th hey could be seen in ccases where (11) a clear audiio signal is deetected while the EGG trace fades, and (2) vice-versa. The former f is illuustrated in Fig gure 16a, taken from m a patient who suffered d from unillateral paraly ysis. The

Electroglottography

75

electroglottogram suggests a possible tenuous vocal fold approximation, with no contact, but close enough to modulate the airflow and be pickedup capacitively. The latter is exemplified in Figure 16b, taken from a patient affected by a large and flaccid polyp (Figure 16c). Videostroboscopic images suggested that the polyp could be muffling the airflow. See also Vieira, McInnes and Jack (2002), for a comparison between acoustic and electroglottography jitter measures. It is worth mentioning that there is an intrinsic delay of approximately 0.6 ms between acoustic and “simultaneous” EGG recordings because acoustic waves are slower than the electrical EGG signals. This delay corresponds roughly to the time required for acoustic waves (speed | 340 m/s) to propagate from the larynx until the microphone, assuming a total distance of | 21 cm (corresponding to a 17-cm long vocal tract [Flanagan, 1972; p. 24] and a 4-cm microphone-to-mouth distance).

6. F0-synchronised EGG measures The examination of EGG waveforms in foregoing sections indicated that some details about the vocal fold behavior may be obtained by subjectively interpreting the signals. Subjective interpretation depends, of course, on the expertise of the interpreter. On the other hand, expert knowledge can be represented in automatic algorithms, but the ability to focus on fine details is sacrificed for the sake of reliability and computational feasibility. Time-domain EGG measures are relatively simple to compute and interpret in electroglottograms from normal larynges (e.g., Baken, 1987; Lindsey & Howard, 1989; Orlikoff, 1991; Fourcin et al., 1995; Colton and Conture, 1990). To a much lesser extent, EGG objective measures have been applied to signals from dysphonic speakers, mostly to differentiate a dysphonic group from a non-dysphonic control group (Haji et al., 1986; Horiguchi et al., 1987; Childers, 1992; Hall, 1995). While successful differentiation has been reported in the cited papers, little attention has been paid to how specific pathologies (especially organic disorders) can affect objective measures. An exception is the study of Dejonckere and Lebacq (1985) who found that nodules can increase the ratio between the duration of the closed and open phases. This section describes algorithms implemented for the extraction of common EGG measures and their application in the analysis of recordings from dysphonic speakers. Since the automatic analysis of “abnormal” electroglottograms has been limited, it appears prudent to start with known measures. The study is aimed at finding (1) normative values for the

Chapter Four

76

measures; (2) the measures’ numeric range in a relatively large dysphonic population; and (3) possible vocal disorders that may systematically affect the measurements.

6.1 Common EGG parameters A method was implemented, combined with the F0 detection algorithm, to identify certain anchor points in each glottal period. These points, used to define the measurements, are shown in Figure 17, and are nine in all: the positive (P) and negative (N) peaks; the points corresponding to 10% (a, f), 25% (b, e), and 90% (c, d) of the peak-topeak amplitude; and the up-going zero crossing (z). Points “a” and “c” indicate the beginning and end of the closing phase, while “d” and “f” delimit the opening phase. Points “b” and “e” will be justified in the forthcoming discussions. The parameterization in Figure 17 is similar to the method described by Marasek (1995), who used a smaller set of points (a, c, d, f) to obtain measures for classifying EGG signals according to different phonatory settings.

Figure 17: Parameterization of the EGG pulse. Amplitudes were normalized from 0% (negative peak, “N”) to 100% (positive peak, “P”). The crosses (“x”) indicate the zero crossings (z) found by the F0 detection algorithm.

Normalized Closing and Opening phases. Although the duration of the opening phase is intuitively F0-dependent, it is not clear to what extent the duration of the closing phase depends on the fundamental period. With this uncertainty in mind, the fundamental period (T) was used to compute normalized closing (Cp) and opening (Op) phases (see Figure 18):

Cp

A u 100% , T

(1)

Electroglottography

Op

CD u 100% , T

77

(2)

where A is the duration of the closing phase, C and D being the duration of the “first” and “second” parts of the opening phase, respectively. The division of the opening phase into such parts is not conventional and no attempt was taken to separate C and D automatically since a reliable separation can be difficult even in a manual analysis. A small closing phase (i.e., Cp ranging from | 5-10%, as will be justified later) is an important feature in healthy voices. An alternative representation of the closing phase would be in terms of its slope; that is, the ratio between the amplitude and the duration of the closing phase. However, the use of slope measures is restricted because of the lack of meaning in comparing measures of EGG amplitude across speakers. With this limitation in mind, Orlikoff (1995) showed that there was a strong intra-speaker correlation (mean r = 0.87, 10 subjects) between the slope of EGG closing phases and corresponding peak-to-peak acoustic values.

Figure 18. Closing and opening phases. “A” is the estimation of the closing phase, “C+D” representing the estimate of the combined first and second part of the opening phase.

Speed Index. This measurement combines the closing and opening phases and reflects the symmetry of the waveform (Baken, 1987, p. 210):4

4

According to Baken’s (1987) definition, though, the closing phase would be the interval between points “a” and “P” (Figure 17), the opening phase being the interval between “P” and “f”. The modification adopted here was used to try to avoid problems with multiple or noisy peaks.

Chapter Four

78

SI

Cp  Op , Cp  Op

(3)

“Op” and “Cp” have been defined earlier in this chapter. Notice that speed indices range from -1 (i.e., a zero closing phase) to +1 (a zero opening phase) and that a zero speed index indicates identical closing and opening phases. As a limitation of this measure, it is not possible to say, for example, whether an SI of small magnitude comes from an increased closing phase or a reduced opening phase. Electroglottograms from healthy voices are expected to be asymmetrical with a shorter closing phase, providing negative speed indices of | -0.65, at least for the definitions of Op and Cp adopted here.

Figure 19. Contact quotient (CQ). “B” is the estimate of the duration of vocal fold contact according to the 25% level criterion. CQ = (B/T)u100%.

Contact Quotient. Also known as “closed quotient,” this is possibly the most popular EGG measure, having been defined as

CQ

B u 100% , T

(4)

where B is the estimate of the duration of vocal contact (Figure 19). The criterion of 25% for the determination of this duration is identical to the value used by Orlikoff (1995), who said (p. 1067-1068), supported by other studies, that “[a 25% level] represented the lowest level that would ensure freedom from waveform artifacts.” Values of 30-35% have also been suggested elsewhere (e.g., Rothenberg & Mahshie, 1988; Lindsey & Howard, 1989).

Electroglottography

79

Contact quotients are expected to fall around 50% in non-dysphonic speakers using modal voice and “soft” intensity levels, rising to approximately 60% for “loud” voices (Orlikoff, 1995). Acoustic power is therefore related to increased CQ values and, as said before, to reduced closing phases. High CQ values (i.e., | 60-70%) over a wide F0 range have been shown to be a peculiarity of trained singers (Lindsey & Howard, 1989). In singers with little or no training, CQ values are smaller (40-50%) and may even drop with rising F0 values. Lindsey and Howard (1989) also showed that CQ values in the “singing voice” of trained singers are higher than CQ values in their modal “speaking voices.” Cycle-to-cycle perturbations. These measures usually refer to jitter (i.e., the amount of cycle-to-cycle period perturbations), and shimmer (similarly, the amount of cycle-to-cycle amplitude perturbations). Expressions for jitter and shimmer are similar and can be found in studies of acoustic voice signals, as briefly reviewed below. The reader is referred to a detailed survey on perturbation measures provided by Hiller (1985). The measurement of jitter was originally introduced by Lieberman (1961) to study the influence of emotional factors in the fundamental frequency. Later (Lieberman, 1963), he applied his ideas of vocal perturbation to study laryngeal disorders. In his 1963 paper, he defined “Perturbation Factor” as the percentage of cycle-to-cycle variations with absolute values t 0.5 ms, and suggested that this measure could be used for to detect pathologic laryngeal conditions. He also noticed that perturbations above 0.5 ms “generally occurred at rapid formant transitions for connected speech” (p. 346). Currently, there is a certain consensus (Titze, 1995, p. 28) that cycle-to-cycle perturbation analysis should be limited to sustained vowels to “elicit a stationary process in vocal fold vibration.” A common measure of jitter is the so called “average jitter” (Heiberger & Horii, 1982):

average jitter

1 N 1 ¦ | T (i  1)  T (i ) | , N 1 i 1

(5)

where T(i) is the duration of the ith period and N is the total number of cycles. The lack of time normalization in this expression results in F0dependent jitter values. Leiberman (1961, p. 602) observed that |T(i+1) T(i)| “increased with the duration of the periods until their duration reached 6 msec (sic). It was independent of the period duration for periods longer than 6 msec.” In other words, his data suggested that cycle-to-cycle period changes were F0-dependent for frequencies above

Chapter Four

80

| 167 Hz. Therefore, jitter estimation in time units is not a recommended strategy. Jitter normalization is usually achieved by using the utterance’s mean fundamental period ( T ), as in the “jitter factor” (Heiberger & Horii, 1982):

jitter factor

1 N 1 | T (i  1)  T (i ) | u 100% , ¦ N 1 i 1 T 1 N T ¦ T (i ) , N i1

(6a) (6b)

where T(i) is the duration of the ith period and N is the total number of cycles. To compensate for variations in the fundamental frequency, Koike (1973) proposed the “Relative Average Perturbation” (RAP), where the instantaneous fundamental period is compared to a local 3-point average, and normalized by the utterance’s mean period:

RAP

1 N 1 T (i  1)  T (i )  T (i  1)  T (i ) ¦ 3 N 2 i 2 u 100% , (7) T

where T(i) is the duration of the ith period and N is the total number of cycles, T being defined in Equation 6b. Davis (1976) suggested that a 5point running average would be more appropriate to differentiate healthy and dysphonic speakers based on acoustic jitter. The normalization by a local F0 estimate appears to have been originally suggested by Askenfelt and Hammarberg (1980), in their “Perturbation Magnitude” (PM) measure:

PM

1 N 1 F0 (i  1)  F0 (i ) | u 100 %, ¦ N 1 i 1 F0 (i )

(8)

where F0(i) is the fundamental frequency of the ith period and N is the total number of cycles. Another definition, the “First Order Perturbation Factor” (PF1), introduced by Titze and Liang (1993), was used here:

Electroglottography

PF 1

1 N 1 T (i  1)  T (i ) u 100% , ¦ N  1 i 1 0.5>T (i  1)  T (i )@

81

(9)

where T(i) is the duration of the ith period and N is the total number of cycles. This measure was adopted because (1) it incorporates a period normalization; (2) the instantaneous estimates depend only on two cycles, simplifying the implementation of the period tracking strategy; and (3) there is no compensation for F0 fluctuation, as in Equation 7. This type of compensation may mask true cycle-to-cycle perturbations, possibly leading to underestimated jitter values (Titze, Horii & Scherer, 1987). Shimmer was also measured with a first order perturbation function:

shimmer

1 N 1 P (i  1)  P (i ) u 100% , ¦ N  1 i 1 0.5>P (i  1)  P (i )@

(10)

where P(i) is the positive peak value of the ith cycle. Logarithms in expressions for shimmer are found in the literature, but this has been discouraged in current recommendations for voice analysis (Titze, 1995, p. 27). A smoothed significant peak could have been used to avoid the effects of mucous strands and noise. However, it might be useful having a measure sensitive to such factors because mucus being collected in the vocal folds may be indicative of abnormal vibration modes (Maran, 1995). This may possibly happen in situations where the mucosal wave (and hence mucus) is being interrupted before reaching the ventricles. The measure of shimmer defined in Equation 10 focuses on perturbations during maximum vocal contact. These perturbations can be induced by lesions, mucus and phonatory settings. A creaky voice can be a cause of excessive shimmer. In this setting, the folds can vibrate in longitudinal modes of higher order, possibly with left-to-right asymmetry, resulting in multiple and most likely irregular contacts.

6.2 Normative EGG data Initially, it should be emphasized that a database containing recordings from speakers with no laryngeal abnormality was not available. It would be difficult to recruit a large number of subjects to match the dysphonic group and submit this control group to a complete ENT (ear, nose, and throat) examination, including videoendoscopy. The use of voices from speakers with “no history of voice disorders” without a thorough ENT

Chapter Four

82

examination was considered inadequate because laryngeal problems can occur without affecting the voice, although other symptoms may appear (e.g., coughing or dry throat). With the lack of a non-dysphonic control group in mind, an attempt was made to obtain ranges corresponding to the degree of impairment of the voice as detected in each objective measure. Further research should attempt to complement the results reported below with data from a non-dysphonic population. Normalized closing and opening phases. The statistical distribution of the normalized closing phase (Cp) is shown in Figure 20. The values were divided into four ranges corresponding to quartiles (i.e., 25% groups) of the population. Considering that a fast closure is a primary feature for efficient voice production, it is expected that the first quartile represents values from healthy voices. The smallest closure (Cp = 3.75%) was taken from a singer in the “no abnormality detected” group. The largest measure (33.74%), on the other hand, came from a speaker with respiratory problems whose electroglottogram presented an open phase with a slowly rising slope (Figure 21). It is difficult to interpret this EGG pattern but videostroboscopic images suggested that closure was being accomplished with little vertical phase difference, because no mucosal wave was observed. Moreover, the “zipper-like” pattern of closure in the anteriorposterior direction was not present, and a reduction of the longitudinal length of contact could be observed, explaining the slow rise in the EGG signal. 35

100% 90% 80%

25

70%

20

60% 50%

15

40%

10

30%

Cumulative

No. of Occurences

30

20% 5

10%

0

0% 4

8

12

16

20

24

28

Closing phase/T (%)

0

1

2

3

Figure 20: Normalized closing phase (Cp = closing phase/T). The ranges on the bottom of the image correspond to the quartiles, equivalent to (0) 0% ” Cp ” 8.3%; (1); (2) 8.3% ” Cp ” 10.5%; and (3) 13.9% ” Cp. The last bin in the histogram concentrates all values above 29%.

Electroglotttography

83

5 ms/div

Figure 21: EG GG with slow closing phase. Top: Electrogllottogram. “N,”” “a”, “c”, “P” and “A” hhave been definned in Figure 17 7 and Figure 188. There are no prominent peaks in the aacoustic signal after the closu ure, as typicallyy seen in /a/ vo owels. The microphone ssignal is noisy and perceptuaally breathy. Boottom: videostrroboscopic sequence withh little indicatioon of vertical ph hase difference or mucosal waave. Notice also the boweed shape of the vocal folds and d the adductionn of the vestibullar folds in this 66 year oold male patientt.

The few w reported values of norm malized closingg and openin ng phases appear to bbe those by Kelman K (1981). In this sttudy, the opeening and closing phasses were definned in terms of o the peak off the waveform m instead of the 90% level adoptedd here. The low wer level is nnot clearly statted in the paper, but a figure in the cited paper (F Figure 1, p. 775) suggests th hat it was equivalent tto the 10% vaalue adopted here. The vallues of the no ormalized closing phase in Kelmann’s (1981) no on-dysphonic speakers rang ged from 5.9-8.5% (33 men) and 12.8-17.5% (4 women)). Differences in the definition of the phases led to larger values compaared to those obtained here (Figuree 20). There was also a laarge correlatioon (r = 0.93) between Kelman’s (11981) F0 values and normaalized openingg phase measu ures. This correlation w was only -0.111 in the 137 caases studied hhere. The stattistical distribbutions of norrmalized openning phase values v are shown in Fiigure 22. In the t lack of a reference rannge for non-d dysphonic speakers, thhe modal classs in this figu ure was takenn as represen ntative of healthy voicces. A few rannges are indicated in the figgure, suggestiing (0) no abnormalityy detected in the measure, and (1a) sligghtly short, (2a) short, (1b) slightlyy high, or (2bb) high openiing phases. T The central raange was based on 255% of the poppulation aroun nd the modal cclass; the nex xt 25% up defined the range 1b, andd the remaining cases defiined the rangee 2b. The

Chapter Four

84

18

100%

16

90%

14

80% 70%

12

60%

10

50%

8

40%

6

Cumulative

No. of Occurences

ranges 1a and 2a were similarly obtained. There was a high correlation (r = 0.74) between normalized opening phase measures and contact quotient measures. This is somehow predictable, keeping in mind that the opening phase accounts for most of the contact phase.

30%

4

20%

2

10%

0

0% 15

19

23 27 31 35 39 43 47 51 55 59 63

Opening phase/T (%)

2a

1a

0 1b

2b

Figure 22. Normalized opening phase (Op = Opening phase/T). The shaded bin is the modal class. The ranges on the bottom of the picture are approximately (0) 32.7% ” Op < 38.9%, (1a) 22.4 ” Op ” 32.7%, (2a) Op ” 22.4%, (1b) 38.9% ” Op < 42.8%, and (2b) Op • 42.8%.

Speed Index. Speed index (SI) measures are shown in Figure 23. There appears to have been little dependence between SI measure and the fundamental frequency, the correlation coefficient between these measures being r = -0.12. The largest value (-0.85) was taken from recordings of the singer with the shortest closing phase mentioned before. Regarding studies of other researchers, there appears to be no study reporting SI values for EGG signals. Baken (1987, p. 213) reproduced SI values obtained by Sonesson (1960) based on glottal area waveforms, but the magnitude of these measures was | 10 times smaller than the values obtained here.

Electroglottography

85

35

100% 90% 80%

25

70%

20

60% 50%

15

40%

10

30%

Cumulative

No. of Occurences

30

20% 5

10%

0

0% -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1

0

0.1

Speed Index

0

1

2

3

Figure 23. Speed index (SI) measures. Ranges at the bottom correspond to (0) -1.00 ” SI < -0.71%; (1) -0.71 ” SI < -0.60; (2) -0.60 ” SI < -0.41; and (3) -0.41 ” SI ” +1.00.

30

100% 90% 80% 70%

20

60% 15

50% 40%

10

Cumulative

No. of Occurences

25

30% 20%

5

10% 0

0% 25

30

35

40

45

50

55

60

65

70

75

80

Contact Quotient (%)

2a

1a

0

1b

2b

Figure 24: Contact quotient (CQ). The shaded bin is the modal class. Ranges at the bottom correspond approximately to (0) 51% ” CQ < 57%, (1a) 44% ” CQ < 51%, (2a) 0% ” CQ < 44%, (1b) 57% ” CQ < 65%, and (2b) 65% ” CQ ” 100%.

Contact Quotient. The distributions of contact quotient (CQ) measures are shown in Figure 24. Values concentrated in the central range are expected to represent normal voices, being in agreement with reported values (e.g., Kelman, 1981; Lindsey & Howard, 1989). The ranges were

Chapter Four

86

45

100%

40

90%

35

80% 70%

30

60%

25

50%

20

40%

15

Cumulative

No. of Occurences

determined as before (i.e., Figure 22), except that the boundary between the ranges 1b and 2b (dashed line in Figure 24) was introduced based on the study by Lindsey and Howard (1989). CQ values falling in the ranges 1a or 2a suggest poor vocal function and possibly a glottal chink. High CQ values may indicate a mass increase or hyperfunction (“pressed voices”), but that trained singers are expected to have CQ values in the 1b or 2b ranges (Lindsey and Howard, 1989). A contact quotient in the central range suggests that there is little air leakage and that aerodynamic energy is being efficiently transferred into vocal fold movement.

30%

10

20%

5

10%

0

0% 0.2

0

0.6

1

1 1.4 Jitter (%)

2

1.8

2.2

3

Figure 25. Jitter measures. The last bin indicates the interval 2.3-10%. Ranges are approximately (0) 0% ” jitter < 0.30%; (1) 0.30% ” jitter < 0.54%; (2) -0.54% ” jitter < 1.08%; and (3) 1.08 ” jitter ” 10%.

Jitter. The statistical distribution of jitter measures (Figure 25) had a marked modal class (range 1, 0.30-0.54%). Values for non-dysphonic speakers have been reported by Orlikoff (1995). His measures, in milliseconds, can be converted to 0.34-0.91% (men) and 0.51-1.36% (women), by normalizing to the reported mean F0 values (men: 110 Hz, women: 223 Hz). His values appear to be larger than the range suggested here for normal voices. They also had an apparent dependence with the fundamental frequency that has not been observed in the values obtained here: the correlation coefficient between jitter and F0 values was r = -0.071 in the 137 speakers.

Electroglotttography

(a)

87

5 ms/div (b)

Figure 26. S Speed of closuure and acoustiic intensity. (aa) Electroglotto ogram and superposed auudio signal. Arrrows point out the t effects of a polyp, shown in i (b).

An EGG G signal wiith large jittter is givenn in Figure 26. The electroglottoogram does not look severely affected bbut instantaneeous jitter values weree approximateely 7%, revealing the impportance of automatic a measuremennt. The patiennt had a polyp p (indicated bby the arrow in Figure 26b) that aff ffected either the t opening phase p (arrow 1 in Figure 26 6a) or the closing phasse (arrow 2 inn Figure 26a) of each cyclle. Notice thatt affected closing phaases had a smaller slo ope and thee amplitudes of the correspondinng acoustic cycles were siignificantly reeduced. This is a clear illustration of the relatioonship betweeen speed off closure and acoustic intensity. Fiigure 26a alsso shows that electroglottoographic and acoustic shimmer aree not necessarily correlated.. Shimmeer. The distribbutions of shim mmer measurees are shown in Figure 27, where ffour ranges foor the degree of perturbatioon are suggested. The few reportedd EGG shimm mer measuress (Haji et al., 1986; Orliko off, 1995) have been caalculated in deecibels, using Horii’s (19799) definition:

shimmerr ( dB)

ª P (i ) º 20 N 1 log « ¦ » N 1 i 1 ¬ P (i  1) ¼

1) (11

where P(i) is the positive amplitude of the ith cycle aand N is the number n of cycles analyyzed. The relaationship betw ween instantan aneous shimm mer values in decibels, 20·|log[P(i)/P P(i+1)]|, and values obtainned with the first f order perturbationn function, 2·|P P(i) – P(i+1)|//[P(i) + P(i+11)], is practicaally linear in the rangee of values obbserved in EG GG and acousstic signals an nd can be approximateed by:

Chapter Four

88

shimmer (PF1, %) | 11.42ushimmer (dB),

(12)

16

100%

14

90% 80%

12

70%

10

60%

8

50%

6

40%

Cumulative

No. of Occurences

for shimmer (PF1) in the 0-40% range. Equation 12 is plotted in Figure 28.

30%

4

20%

2

10%

0

0% 0.5

1.5

2.5

3.5

4.5

5.5

6.5

7.5

8.5

9.5

10.5 11.5 More

Shimmer (%) 0

1

2

3

Figure 27. Shimmer. Ranges are approximately (0) 0% ” shimmer < 1.40%; (1) 1.40% ” shimmer < 3.10%; (2) 3.10% ” shimmer < 5.50%; and (3) shimmer • 5.50%. The nine values in the “more” class spread up to 31.32%.

Orlikoff (1995) reported an average shimmer of 0.183 dB (range: 0.077-0.460 dB) for 20 non-dysphonic speakers, these values being equivalent to 2.09% (range: 0.88-5.25%) according to Equation 12. Another study (Haji et al., 1986) obtained a mean shimmer of | 0.15 dB (1.71%) for 30 “normal” speakers. The range for these 30 speakers was | 0.03-0.23 dB (0.34-2.62%). Measures from 33 dysphonic speakers by Haji and co-workers (1986) varied from approximately 0.04 dB (0.46%) to 1.8 dB (20.56%). Shimmer is not a critical measure, provided that F0 tracking is free from period doublings and halvings. The measures obtained from the 137 speakers analyzed here appear to be consistent with other studies. A scatter plot in the jitterushimmer plane is shown in Figure 29. The correlation between shimmer and jitter measures was r = 0.71. A close value (r = 0.86) has been reported by Haji and colleagues (1986) for the

Electroglottography

89

correlation between shimmer (dB) and jitter (in semitones). The regression line in the logulog scales of Figure 29 indicates that shimmer can be roughly predicted from jitter by: shimmer (PF1) | 4.77ujitter (PF1)

(13)

Shimmer % (PF1)

200 180 160 140 120 100 80 60 40 20 0 0

5

10

15

20

25

30

35

40

Shimmer (dB)

Figure 28. Conversion of instantaneous shimmer measures. See text for definitions of shimmer (dB) and shimmer (PF1).

Summary. The objective measures discussed in this chapter can be combined in a single chart, as shown in Figure 30. A similar display is used in commercially available software (Multi-Dimensional Voice Program, Kay Elemetrics Corp., Pine Prook, NJ, USA). The chart is divided into areas, corresponding to the ranges determined previously. Other axes could obviously have been added.

Chapter Four

90

shimmer (%)

100

10

1

0

1

2

3

0.1 0.1

1 jitter (%)

10

Figure 29. Jitterushimmer plane. Ranges (0-3) have been defined in Figure 25 and Figure 27.

Q

CQ (high)

jitter

3

2

1

0

CQ (low)

CQ Std Dev

shimmer

speed index

Figure 30. EGG profile. The two sets of connected dots suggest a nearly normal electroglottogram (inner set) and a moderately to extremely affected signal (outer set). To simplify the display of values below and above the central range, contact quotient measures (CQ) were split into two separate axes: CQ (low) axis, including values in the 0, 1a, and 2a ranges, and CQ (high) including values in the 1b and 2b ranges.

Electroglottography

91

7. Concluding Remarks This chapter introduced the basic principles of electroglottography and described its application to the analysis of voice disorders. Special attention has been given to the assessment of the quality of EGG signals to improve the reliability of automatic measures. A number of algorithms have been implemented for baseline drift removal, F0 tracking, and automatic measurement of waveform features. Ranges of severity for each measure were obtained based mostly on quartiles of the studied population. It is hoped that these measures and respective ranges may fulfill the lack of reference data for dysphonic speakers. Although a number of general guidelines for the interpretation of EGG signals have been given in the chapter, it is not possible, in general, to ascribe specific types of pathology to particular types of EGG signals or range of EGG measures. It should be noted that the electroglottogram basically represents the dynamic behavior of the vocal folds’ contact area. Therefore, the details seen in the signal depend not only on the physical integrity of the vocal folds, but also on the aerodynamic aspects of phonation. As an example, a patient with unilateral paralysis of the recurrent nerve can still be able to move the non-affected vocal fold beyond the mid line to achieve sufficient contact for phonation. Using only EGG analysis (i.e., without medical history details), it would be virtually impossible to discriminate such well-compensated paralysis from, say, a normal larynx with inadequate respiratory support. The objective measures studied in this chapter are valuable, though, to quantify the extent of contact and the regularity of the vibration, which are particularly useful to monitor the effects of phonosurgery and/or speech therapy. The objective analysis of EGG signals has little value for differential diagnosis. However, EGG parameters falling outside the expected “normal” ranges should motivate a careful examination of videoendoscopic images. Moreover, because EGG waveforms sometimes present marginal perturbations that may not be detected by automatic measures, the interpretation of waveforms seems essential in clinical practice. Abnormalities in the EGG closing phase (which relates to the lower parts of the vocal folds) have a particular importance, keeping in mind that videoendoscopic views are confined to the superior parts of the vocal folds, and lesions in the inferior part may not be easily perceived in stroboscopic pseudo slow motion sequences.

92

Chapter Four

References Askenfelt, A., & Hammarberg, B. (1980). Speech waveform perturbation analysis. Speech Transmission Laboratory Quarterly Progress and Status Report 4 (Royal Institute of Technology, Stockholm) pp. 40-49. Baken, R. J. (1987). Clinical Measurement of Speech and Voice. Taylor & Francis, London. Bless, D. M., Baken, R. J. (also transcribers and editors), Hacki, T., Fritzell, B., Laver, J., Schutte, H., Hirano, M., Loebell, E., Titze, I., Faure, M. A., Muller, A., Wendler, J., Fex, S., Kotby, M. N., Brewer, D., Sonninen, A., & Hurme, P. (discussants) (1992). International association of logopedics and phoniatrics (IALP) voice committee discussion of assessment topics. Journal of Voice 6, pp. 194-210. Childers, D. G. (1992). Detection of laryngeal function using speech and electroglottographic data. IEEE Transactions on Biomedical Engineering 39, pp. 19-25. Childers, D. G., Hicks, D. M., Moore, G. P., & Alsaka, Y. A. (1986). A model for vocal fold vibratory motion, contact area, and the electroglottogram. Journal of the Acoustical Society of America 80, pp. 1309-1320. Childers, D. G., Hicks, D. M., Moore, G. P., Eskenazi, L., and Lalwani, A. L. (1990). Electroglottography and vocal fold physiology, Journal of Speech and Hearing Research 33, pp. 245-254. Colton, R. H., & Conture E. G. (1990). Problems and pitfalls of electroglottography. Journal of Voice 4, pp. 10-24. Czarnach, R. (1982). Recursive processing by noncausal digital filters. IEEE Transactions on Acoustics, Speech, and Signal Processing, 30, pp. 363-370. Davis, S. B. (1976). Computer evaluation of laryngeal pathology based on inverse filtering of speech. Speech Communication Research Laboratory Monograph 13. (Santa Barbara, CA). Dejonckere, P. H., Lebacq, J. (1985). Electroglottography and vocal nodules: an attempt to quantify the shape of the signal. Folia Phoniatrica 37, pp. 195-200. Doherty, E. T. & Shipp, T. (1988). Tape recorder effects on jitter and shimmer extraction. Journal of Speech and Hearing Research 31, pp. 485-490. Eady, S., Dickson, C., Snell, R., Woolsey, J., Ollek, P., Wynrib, A., & Clayards, J. (1992). A microcomputer-based system for real-time analysis and display of laryngograph signals. Proceedings ICSLP’92:

Electroglottography

93

International Conference on Spoken Language Processing, pp. 16011604. Fabre, P. (1940). Sphygmographie par Simple Contact d’Électrodes Cutanées. C. R. Séanc. Soc. Biol. 133, pp. 639-640; —. (1957). Un Procédé Électrique Percutané d’Inscription de l’Accolement Glottique au Cours de la Phonation. Glottographie de Haute Fréquence. Bull. Acad. Natn. Méd. 141, pp. 66-70. Fant, G., Ondrácková, J., Lindqvist, J. & Sonesson, B. (1966). Electrical glottography. Speech Transmission Laboratory Quarterly Progress and Status Report 4 (Royal Institute of Technology, Stockholm) pp. 15-21. Flanagan, J. (1972). Speech Analysis, Synthesis and Perception (Second Edition). Springer-Verlag, New York. Fourcin, A., Abberton, E., Miller, D., & Howells, D. (1995). Laryngograph: speech pattern element tools for therapy, training and assessment. European Journal of Disorders of Communication 30, pp. 132-139. Frøkyaer-Jensen (1996). Personal communication. Gilbert, H. R., Potter, C. R., & Hoodin, R. (1984). Laryngography as a measure of vocal fold contact area. Journal of Speech and Hearing Research 27, pp. 178-182. Haji, T., Horiguchi, S., Baer, T., Gould, W. J. (1986). Frequency and amplitude perturbation analysis of electroglottograph during sustained phonation. Journal of the Acoustical Society of America 80, pp. 58-62. Hall, K. D. (1995). Variations across time in acoustic and electroglottographic measures of phonatory function in women with and without vocal nodules. Journal of Speech and Hearing Research 38, pp. 783-793. Heiberger, V. L., & Horii, Y. (1982). Jitter and shimmer in sustained phonation. In: Speech and Language: Advances in Basic Research and Practice (Vol. 7), edited by N. Lass (Academic Press, New York), pp. 299-332. Hess, W. H., & Indefrey, H. (1987). Accurate time-domain pitch determination of speech signals by means of a laryngograph. Speech Communication 6, pp. 55-68. Hillenbrand, J. (1987). A methodological study of perturbations and additive noise in synthetically generated voice signals. Journal of Speech and Hearing Research 30, pp. 448-461. Hiller, S. M. (1985). Automatic Acoustic Analysis of Waveform Perturbations. Ph.D. thesis (University of Edinburgh).

94

Chapter Four

Horiguchi, S., Haji, T., Baer, T., & Gould, W. J. (1987). Comparison of electroglottographic and acoustic waveform perturbation measures. In: Laryngeal Function in Phonation and Respiration, edited by T. Baer, C. S. Sasaki, and K. S. Harris (College Hill Press, Boston), pp. 509518. Kasuya, H., & Ogawa, S., Kikuchi, Y. (1986). An adaptive comb filtering method as applied to acoustic analysis of pathological voice. Proceedings of ICASSP’86: International Conference on Acoustic, Speech, and Signal Processing (IEEE; Tokyo), pp. 669-672. Kasuya, H., & Ogawa, S., Kikuchi, Y., & Ebihara, S. (1986a). An acoustic analysis of pathological voices and its application to the evaluation of laryngeal pathology. Speech Communication 5, pp. 171-181. Kasuya, H., & Ogawa, S., Mashima, K., & Ebihara, S. (1986b). Normalised noise energy as an acoustic measure to evaluate pathologic voice. Journal of the Acoustical Society of America 80, pp. 1329-1334. Kelman, A. W. (1981). Vibratory pattern of the vocal folds. Folia Phoniatrica 33, pp. 73-99. Koike, Y. (1973). Application of some acoustic measures for the evaluation of laryngeal dysfunction. Studia Phonologica 7, pp. 17-23. Kormylo, J. J., & Jain, V. K. (1974). Two-pass recursive digital filter with zero phase shift. IEEE Transactions on Acoustics, Speech, and Signal Processing 22, pp. 384-387. Krishnamurthy, A. K. (1983). A Study of Vocal Fold Vibration and the Glottal Sound Source Using Synchronized Speech, Electroglottography and Ultra-High Speed Laryngeal Films. Ph.D. thesis (University of Florida). Kuo, F. F. (1966). Network Analysis and Synthesis. John Wiley & Sons, New York. Lebrun, Y., & Hasquin-Deleval, J. (1971). On the so-called ‘dissociations’ between electroglottogram and phonogram. Folia Phoniatrica 23, pp. 225-227. Lecluse, F. L., Brocaar, M. P., & Verschuure, J. (1975). The electroglottography and its relation to glottal activity. Folia Phoniatrica 27, pp. 215-224. Lieberman, P. (1961). Perturbations in vocal pitch. Journal of the Acoustical Society of America 33, pp. 597-603. —. (1963). Some acoustic measures of the fundamental periodicity of normal and pathologic larynges. Journal of the Acoustical Society of America 35, pp. 344-353.

Electroglottography

95

Lindsey, G., & Howard, D. M. (1989). Larynx excitation in the singing voice, Speech, Hearing and Language Work in Progress 3 (University College, London), pp. 169-177. Löfqvist, A., Carlborg, B., & Kitzing, P. (1982). Initial validation of an indirect measure of subglottal pressure during vowels. Journal of the Acoustical Society of America 72, pp. 633-635. Maran, A. G. D. (1995). Personal communication. Marasek, K (1995). An attempt to classify Lx signals. Proceedings of EUROSPEECH’95: 4th European Conference on Speech Communication and Technology (ESCA; Madrid), pp. 1729-1732. Neil, W. F., Wechsler, E., & Robinson, M. P. (1977). Electrolaryngography in laryngeal disorders. Clinical Otolaryngology 2, pp. 33-40. Oppenheim, A. V., & Schafer, R. W. (1975). Digital Signal Processing. Prentice Hall, Englewood Cliffs, NJ. Orlikoff, R. F, (1991). Assessment of the dynamics of vocal fold contact from the electroglottogram: data from normal subjects. Journal of Speech and Hearing Research 34, pp. 1066-1072. —. (1995). Vocal stability and vocal tract configuration: an acoustic and electroglottographic investigation. Journal of Voice 9, pp. 173-181. Orlikoff, R. F., & Baken, R. J. (1989). The effect of the heartbeat on vocal fundamental frequency. Journal of Speech and Hearing Research 32, pp. 576-582. Rabiner, L. R., Kaiser, J. F., Herrmann, O., & Dolan, M. T. (1974). Some comparisons between FIR and IIR digital filters. Bell System Technical Journal 53, pp. 305-331. Rossi, M., & Autesserre, (1981). Movements of the hyoid and the larynx ant the intrinsic frequency of vowels. Journal of Phonetics 9, pp. 233249. Rothenberg, M. (1992). A multichannel electroglottograph. Journal of Voice 6, pp. 36-43. Rothenberg, M., & Mahshie, J. J. (1988). Monitoring vocal fold abduction through vocal fold contact area. Journal of Speech and Hearing Research 31, pp. 338-351. Scherer, R. C., Druker, D. G., & Titze, I. R. (1988). Electroglottography and direct measurement of vocal contact area. In: Vocal Fold Physiology: Voice Production, Mechanisms and Functions, edited by O. Fujimura (Raven Press, New York), pp. 279-292. Schoentgen, J., & Guchteneere, R. (1991). An algorithm for the measurement of jitter. Speech Communication 10, pp. 279-292.

96

Chapter Four

Sonesson, B. (1960). On the anatomy and vibratory pattern of the human vocal folds. Acta Otolaryngologica (Stockholm), Supplement. 156, pp. 1-80. Stemple, J. C. (1993). Voice Therapy: Clinical Studies. Mosby-Year, St. Louis. Titze, I. R., (1984). Parameterization of the glottal area, glottal flow, and vocal fold contact area. Journal of the Acoustical Society of America 75, pp.570-580. —. (1989a). Physiologic and acoustic differences between male and female voices. Journal of the Acoustical Society of America 85, pp. 1693-1707. —. (1989b). A four-parameter model of the glottis and vocal fold contact area. Speech Communication 8, pp. 191-201. —. (1990). Interpretation of the electroglottographic signal. Journal of Voice 4, pp. 1-9. —. (1995). Workshop on Acoustic Voice Analysis: Summary Statement (National Center for Voice and Speech, Iowa), pp. 26-30. Titze, I. R., Horii, Y., & Scherer, R. C. (1987). Some technical considerations in voice perturbation measurements. Journal of Speech and Hearing Research 30, pp. 252-260. Titze, I. R., & Liang, H. (1993). Comparison of F0 extraction methods for high-precision voice perturbation measurements. Journal of Speech and Hearing Research 36, pp. 1120-1133. Van Michel, P. C. (1967). Morphologie de la courbe glottographique dans certains troubles fonctionnels du larynx. Folia Phoniatrica 19, pp. 192202. Vieira, M. N., McInnes, F. R., Jack, M. A., Maran, A., Watson, C., & Little, M. (1995). Methodological aspects in a multimedia database of vocal fold pathologies. Proceedings of EUROSPEECH’95: 4th European Conference on Speech Communication and Technology (ESCA; Madrid), pp. 1867-1870. Vieira, M. N., McInnes, F. R., & Jack, M. A. (1996). Analysis of the effects of electroglottographic baseline fluctuation on the F0 estimation in pathological voices. Journal of the Acoustical Society of America 99, pp. 3171-3178. Vieira, M. N., McInnes, F. R., & Jack, M. A. (2002). On the influence of laryngeal pathologies on acoustic and electroglottographic jitter measures. Journal of the Acoustical Society of America 111, pp. 10451055.

Electroglottography

97

Watson, C. (1995). Quality analysis of laryngography in a busy hospital ENT voice clinic. European Journal of Disorders of Communication 30, pp. 132-139. Wendler, J. (1993). Stroboscopy: Principles and Clinical Application in the Investigation of the Larynx. Edited by ATMOS Medizintechnik GmbH & Co. .

CHAPTER FIVE TIME-NORMALIZATION OF FUNDAMENTAL FREQUENCY CONTOURS: A HANDS-ON TUTORIAL PABLO ARANTES1 Abstract In this chapter, a worked-out example taken from published research is presented in detail to demonstrate the use of the time-normalization technique to fundamental frequency contours. We present the timenormalization algorithm and its implementation as a Praat script as well as the role of each parameter on the average time-normalized contour generated by the script. As an exercise on reproducible research, we also share R programming language code that can be used to plot the contours generated by the script. Keywords: prosody, intonation, fundamental frequency.

1. Introduction In this chapter, we are going to present a hands-on tutorial on how to perform time normalization of fundamental frequency (FͲ) contours by means of a Praat script called time_normalized_f0 written for this purpose. Section 2 presents some research scenarios where it might be helpful to use the technique. The actual algorithm that does the time-normalization is presented in section 3. Basic instructions on how to use the script are provided in section 4. We are also going to demonstrate in section 6 how to use the R statistical computing environment (R Core Team 2014) to analyze the output of the Praat script in order to obtain a graphical 1

São Carlos Federal University, Brazil.

Time-Normalization of Fundamental Frequency Contours

99

representation of the mean time-normalized F0 contours generated by the Praat script. It should be mentioned that phonetician Yi Xu initially presented the basic ideas behind the time-normalization of F0 contours that inspired the creation of the time_normalized_f0 script (Xu 1993 is his seminal work). Xu developed and maintains a Praat script called ProsodyPro that, among other things, does time-normalization of FͲ contours. Although the core algorithm driving the two scripts is fundamentally the same, the script we present here is more focused on the time-normalization task and in some aspects offers users more fine-grained options, whereas ProsodyPro can be seen as a whole suite of functionalities that goes beyond timenormalization.2 While the time-normalization technique is not exactly new, the way we present it here can be said to be original because this tutorial aims to be an exercise on reproducible research (Wicherts at al. 2006, and Fomel and Claerbout 2009). The results of the data analysis, presented in detail through the next sections, were originally published in a peer-reviewed article (Arantes et al. 2012), but the actual data and the code to plot the graphics on that article were not made publicly available at the time. The Praat script source code is being made available to the public, together with the data files3. The R code that analyzes the data and generates the graphics in Arantes et al. (2012) are displayed in the body of this chapter and can be cut and paste to the R command prompt. The goal in reproducible research is to make it easy for other researchers to replicate published results. The data and code being shared can be reused and, maybe more importantly, scrutinized by other researchers, maximizing the chances of finding and correcting errors. Finally, working with real data in a tutorial has the advantage that readers can develop a feeling about how the workflow presented could be applied in other cases of research, and maybe even their own research.

2. Some use cases for time-normalization In this section, we present two use case examples for the timenormalization technique. The second example will be explored more systematically in the hands-on section of the chapter. 2

See http://www.phon.ucl.ac.uk/home/yi/ProsodyPro/ for more information on Xu’s script and the time-normalization technique. 3 Both the script source code and the data files used in this tutorial can be found on my GitHub profile: https://github.com/parantes.

100

Chapter Five

The first use case is when someone might be interested in syntagmatic comparisons over a specific domain; for instance, successive syllables in a word, or words in a sentence. Here, time-normalization serves the purpose of creating an average contour from multiple raw contours that usually are repetitions of the same linguistic pattern being investigated. An example of the first use case could be the scenario where a researcher wants to describe the behavior of FͲ in words with a growing number of prestressed syllables, aiming to investigate the possible existence of secondary stresses among prestressed syllables. Following a number of phonological analyses of Brazilian Portuguese (see Arantes 2010 for a review), for instance, the researcher could test the hypothesis that secondary stresses are placed following a binary pattern, with prestressed syllables being parsed in trochaic feet. Words such as “já” [ƹƠD], “jacá” [ƠD৺ƹND], “jacaré” [ƠD৺ND৺ƹƌũ], “jacarandá” [ƠD৺NŞ৺ƌŞ੻৺ƹGD] and “jacarepaguá” [ƠD৺ND৺ƌũ৺SD৺ƹůȿD] could be used to this end. The researcher would look for evidence of peaks in the contour in the first and last syllables in “jàcaré”4, the second and last syllables in “jacàrandá” and, finally, the first, third and last syllables in “jàcarèpaguá”. In order to determine if the hypothesis being investigated is true or false, the researcher has to determine how many peaks can be seen in each contour and to what syllable they are aligned to. Comparisons in this case are syntagmatic: the researcher is interested in what happens from one syllable to the next in the same contour. Overlaying the contours of the five words in this case would not help the researcher answer her question. One other common use case could be the comparison of two or more different conditions that would supposedly generate different contours over a comparable domain. For instance, establishing how the statement– question distinction changes the contour of the same sentence or how different sequences of tones change the contour of the same sequence of syllables. In these examples, the comparisons are not only syntagmatic but also paradigmatic, i.e., we’re not only comparing how one contour evolves over time, unfolding from syllable to syllable or word to word, but also how contours in one condition differ relative to contours in another condition. In this chapter, we are going to discuss more systematically one example where both syntagmatic and paradigmatic comparisons will be of interest. The complete analysis of the data that is going to be presented here can be seen in Arantes et al. (2010). The experimental design of the material in question contrasts a set of noun phrases carrying a target noun 4

Location of alleged secondary stresses indicated by the grave accent (`).

Time-Normalization of Fundamental Frequency Contours

101

appearing in three conditions, as new and given referents and on a control condition. Target word size, measured in syllables, and stress type (lexical stress in the penultimate syllable – paroxytone words – and the last syllable – oxytone words) were also controlled as independent variables. Experimental materials consisted of two-sentence narratives, such as the one shown in Example 1. The first and second occurrences of the target-word “peregrinação” form a coreference chain. (1)

Uma peregrinaçãoNEW reuniu muitas pessoas em volta da igreja. A peregrinaçãoGIVEN durou o dia todo. ‘A peregrination gathered a lot of people around the church. The peregrination lasted all day.’

In the control condition, the target word is not in a correference chain, as in Example 2. (2) A peregrinação passou por aqui. ‘The peregrination passed by.’ In this use case, the researchers wanted to observe how lexical features in the target words (size and stress) interact with the givenness variable in shaping FͲcontours. For each of the six combinations of word size (3) and stress type (2), contours of the three different givenness levels will be overlaid. Figure 1 shows raw (gray lines) and smoothed5 FͲ contours of the noun phrase (“A peregrinação”, the peregrination), when the noun is a new referent (top panel) and a given referent (middle panel). Both panels show 1.25 seconds of the sentence contours. Six labeled intervals shown below each contour identify the six vowel-to-vowel (V-V) units comprising the NP the target word is embedded to. Notice that corresponding intervals in the two conditions do not have the same duration. That hinders direct interval-by-interval comparisons because interval boundaries do not coincide. By way of example take the first interval, labeled “ap”, in Figure 1. In the “new” condition (top panel), its duration is 217 ms and in the “given” condition (middle panel) it is 117 ms.

5

See section 5.1 for a discussion on how smoothing works.

102

Chapter Five

3. Time-normalization algorithm Time-normalization can solve the issue of comparison by standardizing boundary size. It achieves this goal by taking a fixed number of uniformly spaced values in each interval. Information about the location of FͲ values in real time is lost and as a result, intervals are then not measured in time units but in time-normalized FͲ samples. What we gain is the ability to align intervals of equal length. The bottom panel in Figure 1 shows the time-normalized versions of the two contours overlaid. It is now possible to compare the shape of the two curves interval by interval. How is it done? The goal of the algorithm is to output a time-normalized contour from two input items: an FͲ contour (stored in a Pitch object) and user-provided interval segmentation (stored in a TextGrid object), indicating the temporal landmarks of each segmented interval and a corresponding label. The algorithm expects a fixed number m of intervals to be provided. Also, a number n of samples to be taken per interval has to be informed. This number will define the length of each interval and the total length, m ȉn, of the contour. Once we have the input items, the following steps take place: 1. Each labeled interval is divided into n isochronous subintervals, shown as small rectangles above each labeled interval in the top and middle panels in Figure 1. In this example n αͷ. 2. One F0 value is taken at the midpoint of each subinterval. Red ticks above subinterval rectangles in Figure 1 indicate the temporal location of each value. Red dots corresponding to each tick are superimposed on the smoothed contours. The bottom panel in Figure 1 shows the two time-normalized contours overlaid. The horizontal axis shows normalized time instead of absolute time. Values start at one going up to 30, since the analysis scope are the first six intervals in the test sentences. In a realistic research scenario, the conditions being compared will not be represented by a single contour. The researcher will record a number r of repetitions of a carrier sentence that contains instances of the pattern being investigated. In this case, we want to compute the average contour of each condition. How do we do that? After being processed by the algorithm, each of the resulting r timenormalized contours can be thought of as a sequence of fi,j values, with i in the ȏͳ,n ȉmȐinterval and j in the ȏͳ,rȐ. According to this notation, the set of r contours can be represented as follows.

T Time-Normalizzation of Fundam mental Frequenncy Contours

contour 1: contour 22: ...

(f1,11, f2,1, ..., fi,1, ...., f(n·m)í1,1, fn·mm,1) (f1,22, f2,2, ..., fi,2, ...., f(n·m)í1,2, fn·mm,2) ...

contour jj: ...

(f1,jj, f2,j, ..., fi,j, ...,, f(n·m)í1,j, fn·m,j) ...

contour rr:

(f1,rr, f2,r, ..., fi,r, ...., f(n·m)í1,r, fn·m,,r)

103

The meaan contour will be a single sequence of f LJi values, witth i in the ȏͳ,nȉmȐinterrval, obtainedd by applying the following formula.

(3) a obtained for f each condiition being tessted, they Once meean contours are can be anaalyzed side-byy-side in casse of syntagm matic comparrisons or overlaid in ccase of paradigmatic comp parisons. Sectiion 6 will dem monstrate how to procceed from the raw data produced p by tthe script to obtaining mean contouurs that can thhen be plotted..

4. Script usage u In this seection, we willl go over the most basic infformation a user has to master in orrder to make proper p use of the time_norm malized_f0 script. That includes whhat the script expects e as input and the parrameters that it lets the user change. The inpuuts to the scrippt are pairs off Pitch and TeextGrid objects. Sound files will be required onlyy in as much as a they are reqquired to extraact raw FͲ contours thaat are stored as a Pitch objectts. Once the uuser has the Pitch files, sound files are no longerr necessary. The T time_norm malized_f0 sccript does not extract P Pitch objects automatically y for the userr. That design n decision was made too encourage users u to review w the extracteed contours an nd correct FͲerrors maade by Praat’s To Pitch funcction before ffeeding Pitch objects o to the script. For eachh Pitch object, the user is ex xpected to proovide a TextGrrid object with a matcching name. The T TextGrid needs to havve at least onee interval tier to holdd the labeleed intervals. The script assumes the relevant segmentatioon is in the saame tier throughout all TexxtGrid files. Additional A tiers are ignoored, so the saame TextGrid d can be used for other purp poses. All

104

Chapter Five

non-empty intervals in the designated tier will be processed. The script does not assume any particular transcription alphabet, so any character can be used as a label. The rest of this section will review the parameters that users are free to change without modifying the script’s source code. They are all available from the script’s graphical user interface (GUI, in short) shown in Figure 2. x Pattern: the default option is *, which causes the script to work on all the files in the selected folders. If the user wants to select only a subset of the files, a pattern can be specified. For instance, if the user only wants to analyze files whose name start with the string “subj1”, the appropriate pattern would be “subj1*”.6 x Pitch folder and Grid folder: path of folders where Pitch and TextGrid files are stored. They can be the same or different folders. x Report folder and Report: path of the folder and name of the file (with extension, usually .txt) of the report generated by the script. x Tier: number of the TextGrid tier where the relevant segmentation is stored. x Smooth: allows users to choose whether or not smoothing should be applied to F0 contours. x Bandwidth: how much smoothing should be applied (the greater the number, the smaller the smoothing. See section 5.1 for more details). x Interpolation: which kind of interpolation to apply in voiceless intervals. Options are quadratic, linear or no interpolation.

6

Technically, this field will accept a regular expression, although limited to string literals and the quantifier operator *. See http://en.wikipedia.org/wiki/Regular_expression.

Time-Normalization of Fundamental Frequency Contours

105

F0 (Hz)

200

155

110 ap

e‫ޞ‬

eС‫ޞ‬

0

in

ਥĻ੻w੻d

as

Time (s)

1.25

F0 (Hz)

200

155

110 ap

e‫ޞ‬

eС‫ޞ‬

in

0.2

ਥĻ੻w੻d

as Time (s)

1.45

F0 (Hz)

200

155

110 ap

e‫ޞ‬ 5

eС‫ޞ‬

in

10 15 20 Normalized time (samples)

ਥĻ੻w੻d

as 25

30

Figure 1: F0 contours paired with segmentation indicating syllable-sized units. The top and middle panels show contours in real time and the bottom panel shows both of the time-normalized representations of both contours.

106

Chapter Five

Figure 2: Graphical user interface of the time_normalized_f0 script

Time-Normalization of Fundamental Frequency Contours

107

x Unvoiced: the user can select what string will be used when the script samples an unvoiced part of the F0 contour. The default value is NA, which is the string used by R to represent missing values. x Interval range: position of the first and last intervals in the specified tier to be sampled. The number of the last interval has to be equal to or greater than the number of the first interval. x Samples: number of samples taken in each surveyed interval. Each interval can have a different number. If just one value is provided, that number will be used for all intervals. The report file generated by the time_normalized_f0 script is a plain text file with a table-like structure. Tab characters separate columns and rows are listed one on each line. Default columns are listed below: x x x x

file: the name given to each Pitch and its matching TextGrid file. position: sequential position of intervals in a contour. label: string label given to an interval. sample: number of samples in a contour. • f0: FͲvalue in a given time. • step: duration of subintervals. • time: real time location of time-normalized samples.

5. Exploring the script The time_normalized_f0 script offers some options to the user regarding contour preprocessing. One of the options is the smoothing of raw FͲ contours. Smoothing can be useful to help mitigate micromelodic effects that might not be of interest to a researcher. Section 5.1 presents Praat’s smoothing algorithm and shows the effect of applying different degrees of smoothing to a single raw contour and how the mean contour generated by the time_normalized_f0 script is affected by smoothing. The remainder of this section will deal with two factors that can potentially influence the way mean time-normalized contours will look when applying the technique to a set of raw FͲ contours. Section 5.2 discusses the effect of the number of samples taken in each segmented interval: the more samples we take, more resolution we get. We try to establish when taking more samples stops paying off. In section 5.3, we gauge the effect that adding progressively more raw contours has on the mean time-normalized contour generated by the script. It is suggested that

108

Chapter Five

a researcher should experiment with these parameters and see the effect that each of them has on the corpus.

5.1 Contours preprocessing: smoothing Hirst (2005, 2011) suggests that raw FͲ contours be analyzed as the final product of the interaction between macro and micromelodic components. The macromelodic component “corresponds to the underlying laryngeal gesture” (Hirst 2011, p. 62). Micromelody can be understood as the local perturbations caused by segments whose articulation require greater impediment to airflow, such as stops and fricatives and “are not perceived as contributing per se to the overall pitch pattern of the utterance” (Hirst 2005, p. 176). In general, research in prosody is interested in uncovering patterns in the macromelodic component that are assumed to be present in a series of raw FͲcontours. In such a setting, micro-melodic perturbations can be seen as noise rather than meaningful variation. In order to counter micromelodic effects, smoothing procedures can be applied to raw contours prior to the application of the time-normalization procedure. A number of smoothing procedures exist that achieve different degrees of success in the task of getting rid of micro-melodic perturbations. Our script offers the user the option of applying Praat’s built-in smoothing function to the contours to be analyzed. In the remainder of this section, we are going to explain the principle behind Praat’s Smooth function, examine the effect of applying it to a single FͲ contour, and also how changes in a single curve affect the shape of the time-normalized contour generated by the script after averaging ten smoothed contours. The Smooth function performs Fourier analysis on a FͲ contour to find out its frequency components7. Low frequency components are associated with the slow-varying macro-melodic component. High frequency ones are caused by rapid variation typical of micro-melodic effects. The uppermost panel in Figure 3 shows the spectrum of a raw FͲ contour. It is 7

The function is not documented in detail in Praat’s help system. In a message sent to a mailing list, Praat’s creator provides the following explanation (Boersma 2011): “In the time domain, it’s a convolution with a Gaussian with the said bandwidth. That is, a multiplication in the frequency domain (of the Fouriertransformed pitch curve) with a shape of exp(-(frequency / bandwidth)ˆ2). Before the filtering, the voiceless stretches are filled up with voicing by linear interpolation and (at the edges) constant extrapolation; after the filtering, these parts are made voiceless again.”7 Each contour is comprised of 175 F0 values. The analysis frame is approximately 10 ms.

Time-Normalization of Fundamental Frequency Contours

109

possible to see frequency components of varying amplitude up to 25 Hz. The initial part of the same contour is shown in the time domain as the red line in Figure 4. Smoothing is achieved through changes in the underlying spectrum of the raw FͲ contour that try to get rid of some of its higher frequency components. A new contour is then resynthesized based on the modified spectrum. This is done by multiplying the raw contour spectrum with a Gaussian function of variable bandwidth. The height of the Gaussian curve falls off quickly towards its tails, attenuating or altogether suppressing higher frequency components. The smaller the Gaussian bandwidth, the greater the attenuation. Figure 3 shows four spectra modified by convolution with Gaussians of different bandwidths: 10, 7, 5 and 3 Hz. The positive halves of the Gaussians of said bandwidths are superimposed over the corresponding spectra. When using the Smooth function, the user sets the bandwidth value, expressed as a frequency value in Hertz. Figure 4 shows the changes undergone in the time domain by the raw FͲ contour of the phrase “A patarata”, spoken by a Brazilian Portuguese male speaker, when different degrees of smoothing are applied to it. Bandwidth values of 10, 7, 5 and 3 Hz were used. Some micro-melodic effects can be seen on the raw contour: sudden FͲ increase following occlusion release in the voiceless stops [p] and [t] and a ripple caused by the alveolar tap, [ƌ]. These perturbations are not seen in the smoothed contours. As a side effect to the smoothing effect, this particular procedure seems to flatten out the contour, making peaks less sharp and valleys shallower. Smaller bandwidth values make this flattening effect more noticeable. This side effect can be explained by the slight amplitude attenuation of low frequency components in the spectra, especially those under 5 Hz. See the bottom panels in Figure 3 for evidence of this. Ideally, the smoothed contour should be free of micromelodic perturbations while retaining essential macromelodic information. If raw and smoothed contours become too different from each other, one can suspect that macromelodic information may have been lost in the process. In order to quantify the loss of information caused by the smoothing process, the absolute difference between each FͲ value in the raw and smoothed contours was calculated (each contour is comprised of 175 FͲ values. The analysis frame is approximately 10 ms.). Table 1 shows median and maximum absolute deviations, converted to the semitone scale. The data on the table indicate that deviations are not extreme, especially for larger values of the deviation which coincide with the burst of the first [t]. In general, provided bandwidth values are kept above 5 Hz, deviations will be under a quarter of a semitone.

Chapter Five

110

Sound pressure level (dB/Hz)

No smoothing 140

120

100 0

5

10 15 Frequency (Hz)

20

Smoothing bandwidth 7 Hz Sound pressure level (dB/Hz)

Sound pressure level (dB/Hz)

Smoothing bandwidth 10 Hz 140

120

100 0

5

10 15 Frequency (Hz)

20

140

120

100 0

25

140

120

100 5

10 15 Frequency (Hz)

20

5

10 15 Frequency (Hz)

20

25

Smoothing bandwidth 3 Hz Sound pressure level (dB/Hz)

Sound pressure level (dB/Hz)

Smoothing bandwidth 5 Hz

0

25

25

140

120

100 0

5

10 15 Frequency (Hz)

20

Figure 3: Fourier spectrum showing individual frequency components of raw and smoothed F0 contours. Raw contour spectrum is the uppermost panel. Spectra of Smoothed contours are shown below with bandwidth value indicated in the titles. Positive halves of Gaussians (dashed lines) of said bandwidths are superimposed over the corresponding spectra.

25

Time-Normalization of Fundamental Frequency Contours

111

Table 1: Median and maximum absolute deviation for different values of the smoothing factor (in Hertz) in a single F0 contour. Deviation is expressed in semitones relative to F0 in the raw contour bandwidth parameter. Successive decreases in bandwidth tend to increase the median deviation by 30 to 40% and the maximum by 10 to 25%. Maximum deviation coincides with the burst of the first [t]. In general, provided bandwidth values are kept above 5 Hz, deviations will be under a quarter of a semitone. smooth level (Hz)

median deviation (st)

maximum deviation (st)

10

0.121

0.830

7

0.179

1.063

5

0.253

1.175

3

0.430

1.574

In addition, to test the effect of smoothing on a single raw contour, we also tested how different degrees of smoothing changed the overall shape of the time-normalized contour generated by averaging ten raw FͲ contours. For this analysis, five samples were taken for each labeled interval8. Figure 5 shows five mean time-normalized contours, one for the unsmoothed contours (condition labeled “none”) and the others for contours that were smoothed using bandwidth values of 10, 7, 5 and 3 Hz before the time-normalization analysis. Similar to what can be observed in the single contour analysis shown in Figure 4, applying smaller bandwidth values to individual FͲ curves results in smoother mean time-normalized contours. The flattening of peaks and valleys can also be observed. Table 2 shows the median and maximum absolute deviation between the unsmoothed mean timenormalized contour and the mean contours generated from previously smoothed curves9. As expected, median and maximum deviations increase when bandwidth values decrease. The median deviation values are similar to the ones in Table 1, although the maximum values are systematically smaller. Successive decreases in the value of the bandwidth parameter 8

For this analysis, we segmented the phrase “A patarata” in five vowel-to-vowel (V-V) units: ap-at-aR-at-5p. 9 All time-normalized contours are composed of 25 F0 values and deviations are calculated over all values.

Chapter Five

112

tend to increase the median deviation by 30 to 45% and the maximum by 25 to 35%. Taking into consideration all of the data discussed in this section, it seems that bandwidth values between 10 and 5 seem to be a good compromise between doing away with micro-melodic effects and avoiding the flattening out of movements that may have intonational significance. 170

factor none 10 7 5 3

110

a

p

a

t

a

‫ޞ‬

0

ਥa

t

Ļ

p

0.85 Time (s)

Figure 4: Smaller bandwidth values result in smoother contours and a smaller range of F0 movement.

Time-Normalization of Fundamental Frequency Contours

113

Table 2: Median and maximum absolute deviation for different values of the smoothing factor (in Hertz) in time-normalized F0 contours. Deviation is expressed in semitones relative to F0 in the raw contour. smooth level (Hz) 10 7 5 3

median deviation (st) 0.1084 0.1904 0.2979 0.5249

maximum deviation (st) 0.4411 0.6047 0.7946 1.1644

160

smooth none

150

F 0 (H

10 7

140

5 3

130

120 0

5

10

15

20

25

Normalized time (5 samples/VíV unit)

Figure 5: Smaller bandwidth values result in smoother contours. As bandwidth values decrease, peaks are less sharp and valleys are shallower.

5.2 Effect of number of samples per interval In this section, we explore how the time-normalized contour produced by the script changes as a result of the number of samples taken from the raw FͲ contours in each user-defined interval. For this analysis, the following script parameters were kept constant: ten raw contours were averaged; raw contours were smoothed using a 7 Hz bandwidth; unvoiced intervals were quadratically interpolated; contours were all divided into

Chapter Five

114

five vowel-tto-vowel (V-V V) intervals: ap-at-aR-at-5p a p (comprising the noun phrase “A ppatarata”). Thhe number off samples takken in each seegmented interval wass varied: 3, 5, 10 and 15, reesulting in conntours of lengtth 15, 25, 50 and 75. R Resulting timee-normalized contours c are sshown in Figure 6. As couldd be expectedd, increasing the t number oof samples per interval smooths thee time-normallized contour,, mainly rounnding out shaarp edges. Compare in Figure 6 how w the valley and a peaks go from slightly rough to smooth as thhe number off samples grow ws. This effecct happens because the time steps bbetween samplles decreases as the script ta takes more sam mples per interval. Tabble 3 shows thhe mean step duration for th the different number n of steps per innterval tested here. Notice that the coeffficient of vaariation is almost the same for all coonditions testeed.

Figure 6: Morre samples per interval add mo ore detail to thee contour, rounding sharp corners.

Table 3: M Mean duration and coefficient of variiation of step p for the different nu umbers of samples taken per interval.. More samplles imply shorter step ps. ms) mean step (m

CV

3

55.2

24.9% %

5

33.1

24.6% %

10

16.5

25.5% %

15

11

25.6% %

samples

Time-Normalization of Fundamental Frequency Contours

115

Two factors should be considered when deciding the number of samples to be taken: the time step used by Praat’s FͲ extraction function and the typical length of the intervals in the corpus being analyzed. Praat’s autocorrelation algorithm for FͲ extraction defines its analysis step (i.e., the time gap between successive estimates of FͲ) based on a user-provided estimation of the minimum FͲ value in the sound file being analyzed. Typical values for men and women are 70 Hz and 120 Hz, giving analysis steps of 10 and 6 ms, respectively. In the scenario explored in this section, the intervals are coextensive with syllable-sized units, containing a consonant and a vowel. Interval duration in this case is something around 150 ms. The balance to be achieved here is to sample enough that relevant details show up in the resulting time-normalized contour while avoiding sampling at intervals smaller than those used during the extraction phase. If the user is interested in comparing FͲ patterns that develop over larger domains, like a word or a phrase, analysis intervals will of course be greater than those considered here. In this case, the user should estimate the average interval duration for his/her corpus and adjust the number of samples to be taken per interval so that they are slightly larger than that used during the FͲ extraction phase. Doing this implies recording the Pitch floor parameter value passed to Praat’s To Pitch (ac) function.

5.3 Effect of number of contours averaged In this section, we explore how the time-normalized contour produced by the script changes as a function of the number of contours averaged. For this analysis, the following script parameters were kept constant: raw contours were smoothed using a 7 Hz bandwidth, unvoiced intervals were quadratically interpolated, and five FͲ samples were taken in each segmented interval. The number of contours fed to the script was varied: 5, 10, 20 and 40 contours. A list of 48 files (repetitions of the same carrier sentence containing the noun phrase “A patarata”) was randomized and the first 5, 10, 20 and 40 files on the list were processed in each script pass. The resulting time-normalized contours are shown in Figure 7. There was no noticeable change in the overall shape of the contours, except maybe on the fall of the second peak. Apart from that, the differences are in the overall height of the contour as a whole. Adding more raw contours seems to cause a slight raising of the resulting timenormalized contour. For each of the 25 time-normalized samples (horizontal axis in Figure 7), there are four normalized FͲ values, one for each number of averaged contours tested (5, 10, 20, 40).

Chapter Five

116

160

nrep

150

F 0 (H z) 140

5 10 20 40

130

120 0

5

10

15

20

25

Normalized time (5 samples/VíV unit)

Figure 7: Effect of number of contours.

To quantify the amount of variation caused by the number of contours, the coefficient of variation10 (CV) was calculated for each timenormalized sample. The median CV across all samples was 1.12%, with a maximum of 1.62%. This result suggests that successive doublings in the number of contours cause little variability in the time-normalized contours. At least for the subjects that provided the sentences used here, a number between 5 to 10 seems to be a reasonable choice of repetitions per condition. Since the acquisition and labeling of data can be time consuming, knowing how much data is enough to get a representative pattern is helpful. As different subjects may exhibit different behavior in this respect, it is advisable to run preliminary tests like the ones shown in this section before settling for a sample size.

10

CV is a relative measure of dispersion. It makes it possible to compare standard deviations of samples with different means. It is computed by dividing the sample standard deviation (SD) by its mean (SD/mean).

Time-Normalization of Fundamental Frequency Contours

117

6. Data analysis This section will demonstrate how to use the R programming language to analyze and plot the data generated by the Praat script. First, we investigate the contents of the plain text file case-study.txt. Data in the file is arranged in a tabular structure with tab characters separating columns. Table-like data such as this is represented inside R as a data frame object, which can have columns of data of different types, such as strings, integers, floating-point numbers and factors. In R, we can load the file as a data frame and then print its first lines to get a sense of the data file. Library(ggplot2) library(dplyr) library(reshape2) tnf0