Automatic Speech Translation: Fundamental Technology for Future Cross-Language Communications 9782919875023

Automatic Speech Translation introduces recent results of Japanese research and development in speech translation and sp

231 16 12MB

English Pages 131 [132] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Cover
Half Title
Title Page
Copyright Page
Table of Contents
Preface to the Series
Preface
1. Introduction to Speech Translation
1.1. Introduction
1.2. Configuration of Automatic Speech Translation Systems
1.3. Requirements for an Automatic Speech Translation System
1.4. History in Brief
2. Speech Recognition
2.1. Introduction
2.2. Fundamental Concepts of Speech Recognition
2.2.1 The Concept of Speech Recognition
2.2.2 The Problem of Speech Recognition
2.3. Speech Pattern Representation
2.3.1 Characteristics of the Japanese Speech Signal
2.3.2 Representation of Acoustical Patterns of Speech
2.3.3 Signal Processing and Speech Analysis Methods
2.3.4 Discrete Representation of the Speech Pattern
2.3.5 Speech Units
2.4. Phoneme-Based HMM Phoneme Recognition
2.4.1 The Hidden Markov Model
2.4.2 Discrete HMM Phoneme Model
2.4.3 Continuous-Mixture HMM
2.4.4 Hidden Markov Network
2.4.5 Successive State Splitting Algorithm
2.5. Continuous Speech Recognition
2.5.1 Approach to Large-Vocabulary Continuous Speech Recognition
2.5.2 Measure of task Complexity
2.6. HMM-LR Continuous Recognition
2.6.1 Outline of HMM-LR
2.6.2 Speech Recognition Using Context-free Grammar
2.6.3 LR Parsing
2.6.4 Generalized LR parsing
2.6.5 Operation of HMM-LR Speech Recognition
2.6.6 Japanese Speech Recognition System by HMM-LR
2.6.7 Sentence Recognition Using Two-level LR Parsing
2.7. Speaker Adaptation
2.7.1 Speaker Adaptation by Vector Quantization
2.7.2 Speaker Adaptation by Vector Field Smoothing
2.7.3 Speaker Adaptation Based on Vector Field Smoothing with Continuous Mixture Density HMM
2.7.4 Speaker Adaptation of Hidden Markov Network
2.8. Speaker-Independent Speech Recognition
2.9. Performance Score of Continuous Speech Recognition
3. Language Translation of Spoken Language
3.1. Problems in Spoken Language Translation
3.2. Intention Translation
3.3. Unification-Based Utterance Analysis
3.3.1 Basic Concept of Parsing
3.3.2 Unification-Based Utterance Analysis
3.3.3 HPSG Style Grammar for Spoken Japanese
3.3.4 Syntactic and Semantic Constraints
3.3.5 Parsing Based on Unification Grammar
3.3.6 Resolution of Zero-Pronoun
3.3.7 Ruling out the Erroneous Spoken Input
3.3.8 Experimental Results
3.4. Utterance Transfer
3.4.1 The Transfer Process
3.4.2 The Use of Domain and Language Knowledge
3.5. Utterance Generation
3.5.1 Language Generation Based on Feature Structure
3.5.2 Knowledge Representation of Phrase Description
3.5.3 Generation Algorithm
3.5.4 Towards Efficient Generation
3.6. Contextual Processing Based on Dialogue Interpretation
3.6.1 Plan recognition
3.6.2 Dialogue Interpretation
3.6.3 Contextual Processing
3.7. New Approach for Language Translation
3.7.1 Example-Based Language Translation
4. Speech Synthesis
4.1. Introduction
4.2. Speech Synthesis by Rule
4.2.1 Outline of Speech Synthesis by Rule
4.2.2 More Natural Sounding Speech
4.3. Speech Synthesis Using a Non-uniform Speech Unit
4.3.1 ATR-v Talk System
4.3.2 Selection of Appropriate Speech Unit
4.3.3 Phonetic Tagging of Speech Data Set
4.3.4 Unit Combination Design
4.3.5 Unit Segment Reduction
4.4. Prosody Control
4.4.1 Segmental duration control
4.4.2 Amplitude Control
4.4.3 Fundamental Frequency Control
4.5. Voice Conversion
4.5.1 Voice Conversion Based on Vector Quantization
4.5.2 Generation of Mapping Codebook
4.5.3 Experiment on Voice Conversion
4.5.4 Cross-language Voice Conversion
5. Experimental System of Speech Translation
5.1. ASURA
5.1.1 Speech Recognition
5.1.2 Spoken Language Translation
5.1.3 Performance
5.2. International Joint Experiment on Interpreting Telephony
5.3. Intertalker
5.3.1 Overview
5.3.2 Speech Recognition
5.3.3 Language Translation
5.3.4 Speech Synthesis
5.3.5 Performance
6. Future Directions
6.1. Introduction
6.2. Future Directions of Speech Translation
6.2.1 Recognition of Spontaneous Speech
6.2.2 Prosody Extraction and Control in Speech Processing
6.2.3 Translation of Colloquial Utterance
6.2.4 Integrated Control of Speech and Language Processing
6.2.5 Mechanism of Spontaneous Speech Interpretation
6.3. International Cooperation
References
Index
Recommend Papers

Automatic Speech Translation: Fundamental Technology for Future Cross-Language Communications
 9782919875023

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Japanese Technology Reviews Editor in Chief Toshiaki Ikoma, University of Tokyo

Section Editors Section A: Electronics Toshiaki Ikoma, University of Tokyo

Section B: Computers and Communications KazumotO Iinuma, NEC Corporation, Kawasaki and Tadao SaitO, University of Tokyo

Section C: New Materials Hiroaki Yanagida, University of Tokyo and Noboru Ichinose, Waseda University, Tokyo

Section D: Manufacturing Engineering Fumio Harashima, University of Tokyo

Section E: Biotechnology Isao Karube, University of Tokyo Reiko Kuroda, University of Tokyo GENERAL INFORMATION Aims and Scope Japanese Technology Reviews is a series of tracts which examines the status and future prospects for Japanese technology.

Automatic Speech Translation Fundamental Technology for Future Cross-Language Communications

Editor in Chief Toshiaki Ikoma, University of Tokyo

Section Editors Section A: Electronics Section B: Computers and Communications

Toshiaki Ikoma, University of Tokyo Tadao SaitO, University of Tokyo KazumotO Iinuma, NEC Corporation, Kawasaki

Section C: New Materials Hiroaki Yanagida, University of Tokyo Noboru Ichinose, Waseda University, Tokyo

Section D: Manufacturing Fumio Harashima, University of Tokyo Engineering Section E: Biotechnology

Isao Karube, University of Tokyo Reiko Kuroda, University of Tokyo

Section B: Computers and Communications Volume 10

Machine Vision: A Practical Technology for Advanced Image Processing Masakazu Ejiri Volume 15

Cryptography and Security Edited by Shigeo Tsujii Volume 16

VLSI Neural Network Systems Yuzo Hirai Volume 28

Automatic Speech Translation: Fundamental Technology for Future Cross-Language Communications Akira Kurematsu and Tsuyoshi Morimoto

Automatic Speech Translation Fundamental Technology for Future Cross-Language Communications

by Akira Kurematsu University of Electro-Communications, Tokyo, Japan

and Tsuyoshi Morimoto ATR Interpreting Telecommunications Laboratories, Kyoto, Japan

CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint of the Taylor & Francis Group, an informa business

First published 1996 by OPA (Overseas Publishers Association) Published 2021 by CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 1996 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works ISBN 13:978-2-919875-02-3 (pbk) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace |

SF

4

3.

4.

5.

6.

Introduction to Speech Translation apply to the capability of machine translation in the speech translation system. The automatic speech translation system will receive feedback from the speaker and hearer in the understanding of the dialogue, which will be difficult with the machine translation system. The hearer will be able to ask questions when he/she cannot understand the meaning of the output's synthesized speech. Handling spoken language is very different from written text. Although the sentences in spoken dialogues are usually short and the sentence structure is not particularly complex, spoken dialogues include elliptic and anaphoric expressions. They also tend to include many syntactically ill-formed expressions. In addition to their incorrect grammatical nature, a spoken language translation system must tackle the errors or ambiguities caused by the speech recognition system, which are difficult to recognize. To tolerate the inevitable recognition errors or ambiguities, the language parsing process which precedes speech recognition will be required to be robust and efficient — that is, it must be capable of selecting optimal results among multiple candidates. One basic requirement for a spoken language translation system is real-time operation. The automatic speech translation system simply cannot take more than several seconds to output interpreted speech. The hearer is there waiting for the output from the system, and a consecutive reply is required to respond to the previous utterance. What this means is that high-speed processing of both speech recognition and language translation is required. An efficient algorithm is crucial if an exponential increase in computation time is to be avoided. Although the ultimate goal of the automatic speech interpretation system will be used in a universal dialogue in an unlimited domain, the present technology limits us to specific, task-oriented domains. Only by sharply constraining the specific applications can we make an early progress toward an automatic speech interpretation system. The step-by-step upgrading of technology will be a reasonable strategy. The automatic speech interpretation system will be used by monolingual users, that is, the speaker will not know the target language and the hearer will not understand the source language. This situation also imposes stringent requirements on the accuracy of inter-

1.4 History in Brief

5

pretation to prevent misunderstanding. Clarification of the speech recognition and language translation will be needed. The ease of use, or "user friendliness" of the system incorporated into the intelligent human interface is very important.

1.4. History in Brief A brief history of research into automatic speech translation systems throughout the world is given below. In 1983, NEC Research Laboratories in Japan dramatically demonstrated a laboratory model of an automatic speech translation system at Telecom-83 in Geneva. It was the first trial of real-time bi-directional speech translation between selected languages, that is, between Japanese and English, and Spanish. Although it was a small and quite limited demonstration of a conversation about map inquiring, it attracted the attention of the audience and related persons. Dr. Kouji Kobayashi, the president at that time of NEC Co., advocated the necessity of researching automatic telephone interpretation for future world-wide telephone communication, using different languages. In Japan, the general outlook for the research and development of an automatic telephone translation system was reported in 1986 under the sponsorship of the Ministry of Posts and Telecommunications. It described the necessity for long-term research and development of various related fields such as speech recognition, machine translation speech synthesis, artificial intelligence and computer science. ATR Interpreting Telephony Research Laboratories was established in 1986. Since then, the research and development of automatic speech translation has been extended to several institutes over the world, inspired by the research activities of ATR in Japan. Speech translation between English and French based on the phrasebook approach, i.e.: stored set phrases, was conducted at British Telecom Research Laboratories in 1987 [Stentiford-87]. The system was based on a set of more than 400 common business phrases stored in memory. A phrase-book message retrieved through crude keyword speech recognition is sent to a distant computer system, which translates and speaks with reference to the phrase-book and outputs it by the speech synthesizer. The advantage of the phrase-book approach is the

6

Introduction to Speech Translation

good quality of translation. However, the disadvantage of this approach is the inflexibility of the expressions. The Toshiba Research and Development Center reported the results of an experiment in real-time bi-directional machine translation with a keyboard conversation in the manner of Unix Talk between Japanese and English [1988]. It was shown that frequent ellipses and unknown words caused syntax errors and that there was some meta-dialogue, i.e., dialogue about the previous interchanges. The results of experiment gave an insight into the desirable characteristics of the interactive dialogue translation system, although speech input and output were not included in the experiment. At Carnegie Mellon University (CMU), a speech translation system that can deal with spoken utterances input via a microphone was developed in 1988 [Tomita-88]. The domain was a simple doctor-patient conversation. As the linguistic part, the Universal Parser Architecture based on knowledge-based approach was used. Domain-specific knowledge was efficiently and dynamically utilized during the runtime. Noisy phoneme sequences of spoken utterances were handled by expanding the run-time generalized LR parser. Syntactic parsing results in Interlingua, a language independent, but domain-specific representation of meaning. The generation of target language from an Interlingua representation is undertaken as the mapping interlingua representation into a frame structure of the target language and sentence generation. Although it was limited domain of a small scale, the system hinted at the possibilities of a speech translation system. At ATR Interpreting Telephony Laboratories, an experimental speech translation system, which translates spoken Japanese into English (SL-TRANS), was developed in 1989 [Kurematsu-92(a) and (b)] [Morimoto-92]. It aimed at total speech translation with a constrained speaking style of phrase separation. The task domain was the inquiry on international conference registration with a vocabulary of about 500. After that, many improvements in the mechanism and efficiency were made. Details will be described in the following chapter of this book. There were several new approaches to speech translation at CMU. Direct memory access translation for speech input was proposed. The phoneme-based Direct Memory Access Translation System (DMTRANS), which integrates phonological and contextual knowledge in speech understanding in a massive parallel activation-marker-

1.4 History in Brief

7

passing network. A new model of speech-to-speech dialog translation $-DMDIALOG) was proposed at CMU [Kitano 91]. -DMDIALOG provides a scheme of nearly concurrent parsing and generation. Several types of markers are passed in a memory network, which represents knowledge from the morphonetic-level to discourse-level and world knowledge. The ability of these memory network models to be expanded to larger systems will require further study. CMU also developed another speech-to-speech translation system (JANUS) [Waibel-91]. The JANUS system was implemented in the Conference Registration Task with a vocabulary of about 500 words, between English and Japanese, and German. For acoustic modelling, an LVQ algorithm with context-dependent phonemes was used for speaker-independent recognition. The search module of the recognizer built a sorted list of sentence hypotheses. The N-best algorithm was implemented through the amount of 100 hypothesis. Language models of trigrams were introduced using word class specific equivalence classes (digits, names, towns, languages etc.). As for the MT components, the Generalized LR-parser-based syntactic approach was used following Tomita's system and the most useful analysis from the module was mapped onto a common meaning. Improvements of robustness and extendability were undertaken continuously. The NEC Research Laboratory developed a two-way speech translation system (INTERTALKER) in 1991 [Hatazaki-92] INTERTALKER can deal with speech translation between Japanese and English with a vocabulary size of about 500 for a sight-seeing guide domain. Multiple language translation is available by use of language-independent interlingua expressions. The acceptable expressions of language are fixed in form and the sentence form is restricted. A detailed description will be given in chapter 5 of this book. At ATR, the improved version of SL-TRANS called ASURA, has been developed in 1993. It integrated the technologies of speech recognition, language translation and speech synthesis [Morimoto-93]. ASURA is a Japanese-to-English/German speech translation system. The sentence uttered by the Japanese speaker is recognized by a speakeradaptive continuous speech recognition system (ATREUS) [Nagai-93]. The system incorporated an analysis of the expressions of linguistic intention of utterances. The task domain was inquiries and explanations regarding an international conference. In applying this translation

8

Introduction to Speech Translation

method to a goal-oriented dialogue corpus, the experimental system showed effectiveness for translation dialogues with ordinary conversational expressions from Japanese to English and German. An effort was made to build a restricted-domain spoken language translation system (VEST) at ATT Bell laboratories in 1991 [Roe-91]. VEST is the bi-directional spoken language translation system for English and Spanish. It recognizes several speakers and is limited to a few hundred words. The key new idea is that the speech recognition and language analysis are tightly coupled by using the same language model and augmented phrase structure grammar, for both. A prototype of the speech translation system was developed on a small scale. SRI investigated a prototype speech translation system (SLT) [Rayer93]. The SLT system can translate queries from spoken English to spoken Swedish in the domain of air travel information systems. The speech recognizer used is a fast version of SRFs DECIPHER speakerindependent continuous speech recognition system [Murveit-91]. It uses context-dependent phonetic-based Hidden Markov Models (HMMs). Text processing for language is performed by the SRI Core Language Engine (CLE), a general natural-language processing system [Ashawi92]. The English grammar is a large general-purpose feature grammar, which has been augmented with a small number of domain-specific rules. The Swedish grammar has been adapted fairly directly from the English ones. Each CLE grammar associates surface strings with representations in Quasi-Logical Form.

CHAPTER 2

Speech Recognition 2.1. Introduction Speech recognition refers to the recognition of a human utterance and understanding it automatically by machine. More specifically, speech recognition is the task of converting the acoustical speech signal of an utterance to a linguistic sequence of symbols. Speech recognition plays the role of front-end processing in automatic speech translation. Since speech recognition functions to convert a speech signal to text, it is the most important component in automatic speech translation. In speech translation, speaker-independent continuous speech recognition that recognizes speech uttered continuously is required. Recently, great progress has been made in speech recognition, owing to the advances in related technologies. Three major contributions are computing technologies, the availability of extensive speech databases, and the creation of powerful algorithms based on a statistical approach. In this chapter, the essentials of speech recognition, especially continuous speech recognition, will be presented.

2.2. Fundamental Concepts of Speech Recognition 2.2.1. The Concept of Speech Recognition Speech recognition is the process of converting a continuous speech signal to a discrete symbol sequence. In other words, speech recognition is expressed as the process of searching for the most probable candidate among many hypotheses of uttered speech. In continuous speech recognition, the number of hypotheses is enormously large, and it is difficult to check all of them. Typical continuous speech recognition consists of five major steps: signal analysis, acoustical speech modelling, language modelling, search and acoustical pattern matching, and language processing. The 9

Speech Recognition

10 Input Speech Acoustic Analysis

i

Feature Parameters Recognition Results

Search Process Similarity Acoustic Model

Constraint

Language Model

Figure 2.1. General principle of continuous speech recognition. general principle of continuous speech recognition is described in Figure 2.1. Speech analysis extracts important features from speech waveforms. Acoustical speech modelling is the process to express the basic speech unit such as phonemes. Language modelling represents the language characteristics needed to recognize the spoken language. Search and acoustical pattern matching is used to search the candidates and select the most probable one. Language processing deals with the language constraints of syntax, semantics and pragmatics. 2.2.2 The Problem of Speech Recognition In speech translation, large-vocabulary speech recognition of continuous utterances is essential. The vocabulary size that will be required is at least several thousand words, even for a goal-oriented dialogue in a limited domain. For the recognition of large-vocabulary continuous speech, the

problem of attaining high recognition performance must be solved in order to lessen the burden of language processing. Continuous speech recognition separated into phrases is the first actual step towards the ultimate, fully-continuous speech recognition system. The recognition of phrases or sentences based on phoneme recognition is investigated. The problems of how the speech recognition system is to be utilized in the real field are referred to as the pattern recognition problems. The

2.3 Speech Pattern Representation

11

principle concern of the speech recognizer is the method of understanding the input message by human voice. Key aspects of the speech recognizer must be described to overcome the problems: 1. 2. 3. 4.

Input speech is pronounced continuously and contextual variations are large in speech signal characteristics. Much variance exists in the idiosyncracy of individual's speech characteristics. Environment of the speech input will vary in terms of microphone characteristics, environmental noise, and communication line characteristics such as telephone circuits. Ambiguities inherently exist in spoken languages.

2.3, Speech Pattern Representation 2.3.1. Characteristics of the Japanese Speech Signal Generally, speech recognition performance is not greatly influenced by the difference between phonetic and linguistic characteristics. However, it is said that it differs fairly substantially in different languages because of the varieties of phonetic and linguistic phenomena. Before describing the method of speech recognition, the characteristics of the speech signal in Japanese are described in this section. An example of Japanese speech waveform (/soredewadozo/) is shown in Figure 2.2. The periodic characteristics can be seen in the vowel parts of /o/, Id or /a/. In the fricative part such as /s/, the waveform shows a noise-like, irregular form. In the plosive part such as /d/,

Time Figure 2.2. Example of Japanese speech waveform /soredewadozo/.

12

Speech Recognition

the plosion can be seen. It is common to treat speech signal based on spectral features in speech recognition. Characteristics of normal Japanese speech from the standpoint of speech recognition can be described as follows: 1.

2.

3. 4. 5. 6.

The number of phonemes is small compared to the European languages. There are five vowels, /a/, /i/, /u/, /e/, /of. The total number of phonemes is around 20, and they are categorized as unvoiced plosives, voiced plosives, unvoiced fricatives, voiced fricatives, nasals and semi vowels. The number of syllables is small. There are about 100 syllables in Japanese. In Japanese, there is a strong phonotactic constraint that consonants basically do not concatenate. Therefore, the Japanese syllable is composed of a sequence of consonants and vowels, or only vowels. Consonants do not appear at the end of the syllable. There are peculiar sounds, e.g. double consonants such as /kitta/, nasal /N/, and a long vowel such as /kiita/. These sounds have to be discriminated in reference to duration time of phoneme. An accent produced by difference of pitch frequency appears. For example, /hashi/ and hashi/have to be discriminated as different meanings. The phenomenon of change of sound by liaison appears in the case of concatenation of words. For example, if/kara/(vacant) and /hako/ (box) are concatenated, it is pronounced /karabako/(empty box). Devocalization happens under certain conditions, e.g. /h(i)kari/. The vowels l\l or /u/ followed by unvoiced consonants are usually devoiced.

Although the small number of phonemes in Japanese helps to discriminate between phonemes, it will also cause greater varieties in the acoustic characteristics. Furthermore, the many homonyms in Japanese will make it more difficult to understand the meaning of the recognized result. These kinds of language-specific knowledge about the characteristics of speech will be important in developing a speech recognition system for Japanese. 2.3.2. Representation of Acoustical Patterns of Speech Although a speech signal is non-stationary, it can be regarded as stationary over a short time period. This is because the movement of the

2.3 Speech Pattern Representation

13

organ of articulation is gradual. Spectrum analysis is processed for a certain duration, for instance, 15 or 30 ms. Short-time speech information is called a frame. By shifting a frame we derive the time sequence of feature parameters. A window function is applied to speech waveform. 2.3.3. Signal Processing and Speech Analysis Methods The mathematical process of signal processing and speech analysis methods aimed at speech recognition follows several steps, as described below: 1.

Sampling An analog speech signal picked up by a microphone is sampled and converted to a digital signal by AD converter. The sampling frequency is 12 kHz and the speech is quantized to 16 bits.

2.

Higher frequency enhancement Since the energy component of the speech signal {Xt} (t indicates time) is more weighted to a lower frequency, a higher frequency is enhanced to increase the accuracy of speech analysis by use of a linear differential filter. The formula is as follows: Xct=Xt-aXt_x

(2.1)

The coefficient of a is taken as 0.98. 3.

Data window Spectrum analysis of the speech signal is calculated by extracting the short-time signal (20 ms). It is multiplied by a window function. That is, (2.2)

xt = htx; The window function is the Hamming window {/z}, which is expressed as follows: ht = 0.54 - 0.46 cos (—X

V NJ

t = 1, 2,..., N

(2.3)

N is the window length. A speech signal is analyzed at each frame where windowing is shifted at short intervals, for instance, 5 ms.

14

Speech Recognition

4.

Auto-correlation analysis Before linear predictive coding (LPC) is performed, the auto-correlation functions are calculated, From speech signal {xh x^ ..., xN] the auto-correlation function is j N-T

(2.4) 5.

LPC analysis Linear predictive coding (LPC) is based on the model that speech signal xt, is predicted from the past p signals as indicated in the following: X

t = ~a\xt_x

- "2Xt-2

~ a3Xt-3

- - " -

a X

p t-p

(2.5)

The coefficients {a} are determined to minimize the mean square errors. The value of p is taken as 16. The solution of {at} is carried out by solving the equation called the Yule-Walker equation. v0

vx

v2

vx

p-3

VP-X

Vp-2

^0 _

p-\ P-2

a2 a3

= —

a

p.

°2=Ziaivi

v2 v3

(2.6)

_v (2.7)

i=0

This matrix is called the Teoplitz matrix. There is a recursive procedure, the Levinson-Durbin algorithm or the Saito-Itakura algorithm, to solve the equation. 6.

LPC cepstrum Cepstrum is the inverse Fourier transform of the log spectrum.

log(/(a>))=£cnC

(2.8)

2.3 Speech Pattern Representation

15

When the power spectra/(w) are all pole type. The following relationship between the spectrum prediction coefficient { INFORM in •.PHASE : English :STATUS :COMPLEMENT in = [[IFT INFORM] ?rest ] if type of Input.Obje.Pred is :action then set ?output to [[IFT REQUEST] ?rest ] out = ?output end' ' (c) Rule-3 Transfer rule for IFT

Figure 3.12 Transfer rules. (From reference [Suzuki-92].) case role constituents. The second spells out the description of a sentence according to the rules of the target language writing style and morphology. 3.5.1. Language Generation Based on Feature Structure In accordance with the grammatical framework of analysis and transfer, language generation based on the unification algorithm is carried out. Language generation requires linguistic knowledge of both specific

Language Translation of Spoken Language

60 input Feature Structure IFT. INFORM obje REQUEST agen: 7x1 •SPEAKER obje: SEND agen: •HEARER objei: ?X1 obje2 FORM

Output (English Syntactic Structure)

Generation

Generation Knowledge

IFT INFORM agen: 'SPEAKER obje: ?act agen * HEARER SYN "Please" VP(SEM ?act)

VP SEM SEND

objei: ?x

NP SEM: 'SPEAKER SYN: " I "

obje2: ?y SYN: "send11 NP(SEM: ">x) NP(SEM: ?y)

Figure 3.13 English generation by phrase description. expression-like idioms and general grammatical constructions. Surface strings should be efficiently produced by applying knowledge. A languagegeneration method for feature-structure-based unification grammar has been proposed [Kikui-92]. Linguistic generation rules can be defined for a comparatively large unit, because the variety of sentences to be generated is not very large. A generation rule is described in the phrase description, which is composed of syntactic phrase structure, syntactic and semantic constraints, and application constraints. An outline of the generation process is shown in Figure 3.13. A set of trees annotated with feature structures to represent generation knowledge is employed. Each tree represents a fragment of a syntactic structure, and is jointly represented with a semantic structure. Idiomatic constructions can be described by making the tree, which contains lexical specifications and is linked with a specific semantic structure. Generation is executed by successively activating phrase descriptions that can subsume the whole semantics of the input feature structure. The generation system is based on Semantic Head Driven Generation [Shiever89], which is an efficient algorithm for the unification-based formalism. A multiple index network of feature structures is used to efficiently select relevant generation knowledge from the knowledge base.

3.5.

Utterance Generation

61

3.5.2. Knowledge Representation of Phrase Description The generation process will be explained taking the feature structure in Figure 3.14, which is given as an example. This feature structure is composed of IFT (illocutional force type) of an intention part which expresses a request from a speaker to a hearer, and a propositional content part, which represents the content of the input sentence. The simplest way to implement the generation process is to link the input feature structure and language expression as a pair. Phrase descriptions are prepared for this purpose. An example of phrase descriptions is shown in Figure 3.15. The generation system generates word sequences, which coincide with the input semantic structure, by selecting phrase descriptions. For example, when an input is made, as shown in Figure 3.14, an output sentence is generated, as shown in No. 1 of Figure 3.15. [ [rein REQUEST] : (Intention) [action [[rein SEND] :(Content of Request) [agen *HEARER*] [obje *PAPERS*] [recp 'SPEAKER*]]]]

Figure 3.14 Example of simplified feature structure after transfer process.

Number

1

2

50 51

Semantic Feature Structure [ [rein REQUEST] faction I [rein SEND ] [agen *HEARER*] [obje "PAPERS* ] [recp *SPEAKER* ]]]] [ [rein REQUEST] [action I [rein SEND ] [agen *HEARER*j [Obje *ANNOUNCEMENT*] (recp *SPEAKER* ]]]]

< Feature structure to express "gratitude"> < Feature structure to express "greeting">

Language Expression

Please send mejthe papers.1

Please send me | the

Thank you.

How do you do?

Figure 3.15 Example of phrase description.

announcement \\

Language Translation of Spoken Language

62 [[sem [[rein REQUEST] [action ?X]]) [syn [[cat S-TOP]]]] (a-1)

[[sem ?X] [syn [[cat VP] [vform BASE]]]] (a-2)

(b) [[sem [[rein send] [agen ?X] [recp ?Y] (obj ?Z]]] [syn [[catVP] [vform BASE]]]]

send

(c) [[sem "SPEAKER"] [syn [[cat NP] [case ace]]]

(d) [[sem "papers") [syn [[cat NP) [case ace]]]

[[sem ?Y] [[sem ?Z] [syn [[cat NP] [syn [[cat NP] [case ACC]]] [case ACC]]] (b-2) (2b3)

Figure 3.16 Phrase description of tree expression. The phrase description (PD) is expressed as a set of tree forms as shown in Figure 3.16. The common parts are extracted and marked as a variable with a mark of "?" in a more efficient way. The root indicates the semantic feature and the leaves indicate the description patterns. By applying the same phrase description in various situations, infinite kinds of sentences can be generated.The PD is annotated with feature structures. The PD consists of a structure definition part and a feature structure annotation part. The structure definition part defines the structure of a tree expressed by a list where the first element corresponds to a mother node and the rest of the elements to daughters. Each daughter may be a tree or a simple node. The annotation part specifies the feature structure symbol of each element. A description of a feature structure contains tags or variables. Each node should have a semantic and syntactic feature structure. The semantic feature on the root node of PD represents the semantics of PD. 3.5.3. Generation Algorithm Input to the generation process is a feature structure. The task of the generation system is to generate a syntax tree corresponding to input semantic feature structures. The generation algorithm is undertaken in the following steps. 1. 2.

3.

Step 1: The initial node is constructed by combining the input feature structure and initial syntactic feature. Step 2: The PDs whose semantic structures subsume the semantic structure of the expanding node are activated. A tree that has a root node satisfying the constraints with the initial node is picked up in PD. The root node of the selected PD and the initial node are unified by copying this tree. Step 3: If all leaf nodes are lexically identified, the process terminates.

63

3.5. Utterance Generation

4.

Step 4: For the unlexicalized leaf node, a tree is selected that has the root node satisfying the constraint with it in PD. As in step 2, the leaf node and selected PD are unified, copying this tree.

The steps described above are explained in more detail. For example, the input feature structure is shown in Figure 3.14. The initial node shown in Figure 3.17 is obtained by combining the input shown in Figure 3.14 and syntactic category (S-TOP). The phrase description (1-1) in Figure 3.15 is selected and the result of unification is obtained as shown in Figure 3.18. After the unification, the feature of the variable (?X) of the nodes (1-1) and (1-2) is converted to the action feature of the input semantic structure. Since a leaf node of the obtained tree is not lexicalized, the process proceeds to step 4. By selecting a tree (2) in PD, this node is unified with the selected root node. In the same manner, PDs of (3) and (4) are applied to [[sem [[rein REQUEST] — [action [[rein SEND] [agen *HEARER*] {recp ^SPEAKER*] [obje *PAPERS*]]]]] [syn [[cat S-TOP]]]]

Input Feature Structure

Figure 3.17. Feature structure of initial node. [[syn [[cat S-TOP]] [sem [[rein REQUEST] [action [rein SEND] [agen MHEARERM] [recp "SPEAKER"] [obje "PAPERS"]]]]]]

?X

Please

[[syn | cat VP]]] [sem [[rein SEND] [agen "HEARER"] [recp "SPEAKER"] [obje "PAPERS"]]]]

Figure 3.18. Syntax structure immediately after applying tree of Figure 3.16(a).

Language Translation of Spoken Language

64 [[syn [[cat S-TOP]]]

Please

send

[[syn [[cat NP]]]

me

[[syn [[cat NP]]]

the papers

Figure 3.19. Completed syntactic structure (partially shown). the nodes (2-2) and (2-3). The final sentence structure is obtained as in Figure 3.19. By traversing the tree from left to right, the sentence "Please send me the papers", is generated. 3.5.4. Towards Efficient Generation Feature-structure-directed generation is useful for a bidirectional grammar of analysis and generation to share the grammar in the two modules. An appropriate auxiliary phrase is added if necessary. Typed feature structures are utilized to describe the control of the generation process in a declarative way. The disjunctive feature structure is introduced to solve the inefficiency in making multiple copies of the phrase structure when the generation process encounters multiple rule candidates.

3.6. Contextual Processing based on Dialogue Interpretation The dialogue interpretation using the context sensitive processing is effective in the spoken language interpretation. How to select the correct speech recognition hypotheses for robust spoken language processing is a significant problem in a speech translation system. How to disambiguate the ambiguous sentence and how to predict the next utterance are important issues in dialogue processing. The plan recognition model for dialogue understanding assumes that an utterance is an action in the conversation.

3.6. Towards Efficient Generation

65

3.6.1. Plan Recognition In devising a language translation system for spoken dialogue, one of the problems to be solved is how to adequately translate the underlying meaning of the source utterance, or the speaker's intention, into the target language. In spoken dialogue, smoothness of communication depends on understanding the speaker's underlying meaning. Considerable research has been focused on a plan recognition model for solving ellipses or phrases or choosing an appropriate translated word [Iida-92]. The model consists of plans, objects and inference rules. Plans for taskoriented knowledge and pragmatics are used in the model: a domain plan, which manages the structure of domain-dependent action hierarchies, a dialogue plan, which manages a global change of topics in a domain, a

I Dialogue | * ^ ^ ~ ^ ^ |Open-Dlalogue Unit] ' '

Dialogue-Plan: I Global Structure of Dialogue

\ 1 Content. I '—y-^

^"7^—*> I Cloee-Dlalogue Unit l '

| MaJte-Reglatratlon | Domain-Plan: Domain-Specific Action and Objects

^ | ^=^T [Gat Form |

~~~~ • — * . I Fill-Out Form

Achieve Know

Communication-Plan: Dialogue Development, Information Exchange Action

•—^_______^ j" * . Return Form

Execute Domain-Plan

Introduce Object-Plan

Execute Domain-Plan i

i

Get Value Unit

Get Value Unit

Request Action Unit

ZA_JJV. X^XN

Interaction-Plan: Utterance Turn-Taking

Requeetj I Accept I r A a F I Action I I Action | Value

utterance 1

[intern] [Value |

utterance3 utterance 2

I A r t I I Inform I ****** value | VaJue | | Action

utterances

utterance 4

I Accept | [Action

utterance 7 utterances

Dialogue Example: Utterance 1: Utterance 2: Utterance 3: Utterance 4: Utterance 5: Utterances: Utterancs 7:

customer: secretariat: customer: secretariat:

Will you send me a registration form? AH right. Will you give me your name and address? (My) name is Mayumi Suzuki, and (my) address is... When is the deadline? December 1. Please return the form as soon as possible.

Figure 3.20. Process of construction of dialogue structure. (From reference [Iida-92].)

Language Translation of Spoken Language

66

communication plan, which represents the sequence of information exchange, and an interaction plan, which manages demand-response pairs in the dialogues. The plan recognizer assumes a goal and the analyzed utterance is matched with the various plans, and a chaining path from an input into a goal is inferred. An example of dialogue structure is shown in Figure 3.20. 3.6.2. Dialogue Interpretation In the dialogue analysis of spoken language, the communication act type is not only concerned with the intentional part of an utterance but also the property of the topic of the propositional contents [Yamaoka-91]. For this purpose, a set of communicative act types, each of which indicates a Table 3.6. Typical communicative act type of Japanese. (From reference [Yamaoka-91].)

1. Demand Class

ASK-ACTION CONFIRM-ACTION REQUEST-ACTION OFFER-ACTION ASK-VALUE CONFIRM-VALUE ASK-STATEMENT CONFIRM-STATEMENT GREETING-OPEN GREETING-CLOSE

"ACT-wa WH-desu-ka?" "ACT-suru-no-desu-ka?" "ACT-shite-kudasai." "ACT-shi-masu." "OB J-wa WH-desu-ka?" "OBJ-wo onegai-shi-masu." "OBJ-wa VAL-desu-ka?" STA-waWh-desu-ka?" "STA-desu-ka?" "Mohsimoshi." "Sayonara."

2. Response Class

INFORM-ACTION INFORM-VALUE INFORM-STATEMENT AFFIRMATIVE NEGATIVE ACCEPT-ACTION REJECT-ACTION ACCEPT-OFFER REJECT-OFFER GREETING-OPEN GREETING-CLOSE

"ACT-shite-kudasai." "OBJ-wa Val-desu." "STA-desu-(ga)" "Hai.", "Soudesu." "lie." "Wakarimashita." "ACT-deki-masen." "Arigatou-gozaimasu "(lie) kekkou-desu." "Hai." . "Shitsurei-shi-masu."

3. Confirm Class

CONFIRMATION

"Wakarimashita."

(ACT denotes phrase on ACTION, OBJ on OBJECT, STA on STATEMENT, WH on interrogative, respectively.)

3.6. Towards Efficient Generation

67

Table 3.7. Typical utterance pairs of Japanese dialogue. (From reference [Yamaoka-91].)

Domain Class

Response Class

ASK-ACTION CONFIRM-ACTION

INFORMATION-ACTION AFFIRMATIVE-NEGATIVE INFORM-ACTION ACCEPT-ACTION, REJECT-ACTION ACCEPT-OFFER, REJECT-OFFER INFORM-VALUE AFFIRMATIVE-NEGATIVE INFORM-VALUE INFORM-STATEMENT AFFIRMATIVE-NEGATIVE INFORM-STATEMENT GREETING-CLOSE

REQUEST-ACTION OFFER-ACTION ASK-VALUE CONFIRM-VALUE ASK-STATEMENT CONFIRM-STATEMENT GREETING-CLOSE

particular type of speech act, can be defined. A set of communicative act types, shown in Table 3.6, are defined through a linguistic and pragmatic analysis of dialogue corpus of the conference registration domain. Five types of illocutionary forces for modal expressions are defined: INFORM (representative, declarative, expressive), ASK (interrogative), CONFIRM (interrogative), REQUEST (directive, imperative), and OFFER (commissive). Three classes of property of topic are defined: ACTION, OBJECT which has property value, and STATEMENT. A set of utterance pairs in a cooperative dialogues can be defined as shown in Table 3.7. Two sets of pragmatic knowledge concerning usage of isolated utterances are described separately. One is the surface form to express the speaker's intention. The other is the propositional contents in terms of a predicate and its case value. An outline of the communicative act type analysis concerning the property of a topic is shown in Figure 3.21. In order to reason communicative act types from an utterance, a rewriting system which can control the application of rules is employed as an inference engine. Prediction of the next utterance is also investigated using a dialogue understanding model in the framework of plan recognition [Yamaoka-91]. The various plans are used to predict expressions about the communicative act type of the utterance. Referring to the goal list, which contains the incomplete plans regarded as possibilities and expectations for future goals, the next utterance can be predicted within the ongoing dialogue. The next

Language Translation of Spoken Language

68 Semantic Structure with Pragmatics

Intention Part Analysis

Frozen Expression

Proposition Contents Analysis

Communicative Act

Rules for Intention Part )

Rewriting Engine

Rewriting Environment

Rules for Proposition Contents

Figure 3.21. Configuration of communicative act type analysis system. (From reference [Yamaoko-91 ].) utterance is predicted in the two types of information, one of which is regarding the communicative act type and the other is the constituents which consist of the propositional contents. The capability of selecting the correct surface form of the next utterance, in particular as regards expressions of the speaker's intention is confirmed [Yamaoka-91]. 3.6.3. Contextual Processing A dialogue model, as well as a broad explanation of the dialogue process, is useful for language translation. A computational model for contextual processing using constraints on the dialogue participants' mental states is studied. Shared goals and mutual beliefs between dialogue participants are taken from the context of the task-oriented dialogue. Communication acts performed by dialogue participants can be interpreted based on such contextual information [Dohsaka-90]. Referents of omitted pronouns in Japanese dialogue are identified

through the interpretation of pragmatic constraints on the use of linguistic expressions in the context. Honorific relationships, the speaker's point of view and the speaker's range of information are exploited by the model. The interpretation mechanism is regarded as an integration of constraint satisfaction and objective inference.

3.7. New approach for Language Translation

69

3.7. New approach for Language Translation Conventional approaches to language translation are mostly aimed at treating syntactical information and semantic information based on linguistic formal grammars. However, dialogue utterances involve various kinds of intention expressions, and are often fragmentary. A new approach is emerging in language translation, to overcome the inconvenience of the conventional rule-based translation systems. It is obvious that traditional language translation systems rely upon the extensive use of rules or lexical expressions. They face the following difficulties: (1) improvement in translation quality is delayed, and (2) processing time sometimes increases intolerably. The novel approach is proposed to make use of the increasing availability of a large-scale corpus and bulk processing power. 3.7.1. Example-based Language Translation Example-based language translation has been proposed to overcome the difficulties that have arisen in the traditional approach [Nagao-84]. Example-based language translation is usually called example-based machine translation (EBMT). In EBMT, a database that consists of a translation example of bilingual text, is prepared. Examples whose source is most similar to the input phrase or sentence are retrieved from the example database. A translation output is obtained based on the retrieved example. This framework utilizes best-matching between the input and provided examples, and selects the most plausible target expression from many candidates. The semantic distance between the input and example is to be calculated. A measure of the semantic distance is determined based on a thesaurus, which is a hierarchical organization of the concept of word. The distance between words is derived from that between concepts. The distance between concepts is determined according to their locations in the thesaurus hierarchy. The best match based on semantic distance is useful for target word selection of function words (e.g., Japanese particles and English prepositions) [Sumita-91 ]. Sentence and clause type patterns play an important role in an intuitive translation. As an extension of EBMT, a method called Transfer-Driven

Language Translation of Spoken Language

70

Input

AProcessing Lexical

)

i

Transfer Knowledge

]

/German-Japanese

"^GERMANY(Mu^fchTx SIEMENS AG j^smheUniversjt^^

Germarv>English

USA(Pittsburgh) ^Carnegie Meilon University/

GermarK-English

Figure 5.3. General configuration of international joint experiment.

Karlsruhe University in Germany. The general configuration of the International experiment is shown in Figure 5.3. The ASURA system was interconnected to English and German speech translation systems over international telecommunication channels. Figure 5.4 shows the configuration of the joint experiment. A DSP-based front-end processor calculates the output probabilities of hidden states of HMnet for each frame. Each party shared equal responsibility; each site developed units

Input Speech

Speech Recognition

Language Translation

Channel

1 Communication Control Japanese Output Speech

y

Speech Synthesis Communication Channel

ISIEMENS/KU]

f

German ^Speech

Speech Translation System s. ^ *

Input

German Speech Output

Figure 5.4. General configuration of international joint experiment of telephone interpretation.

Experimental System of Speech Translation

92

for own-language speech recognition, language translation from source language to target language, and speech synthesis. The configuration of the experimental setup between Japan and USA is shown in Figure 5.5. The Japanese speech synthesize, ATR-v Talk, was used as the Japanese speech output system. The experiment was held on January 28, in 1993. In addition to speech translation, a video-conference system was used so that the speaker could see what his/her partner was doing at the other end. Several dialogues on the conference registration task were uttered in the experiment. Figure 5.6 shows the picture of the joint experiment. As a whole, the experiment was successful and received favourable comments from all over the world.

Japanese Speech Input

ATRSite

"moshimoshi" I Japanese Speech Recognition

Japanese Speech Output "konnichtwa" T

Japanese to English Language Translation "helto"

Japanese Speech Synthesis

| Communication Control i

^international Communication Network > Communication Control "konnk^iwa" English Speech Synthesis "hello"

English to Japanese Language Translation

|

English Speech Output

English Speech Recognition

CMUSite

"how are you?" English Speech Input

Figure 5.5. Configuration of experimental system setup between Japan and USA. (From reference [Yato-93].)

5.3. INTERTALKER

93

Figure 5.6. Picture of joint experiment of automatic speech translation.

5.3. INTERTALKER 5.3.1. Overview INTERTALKER is an experimental automatic interpretation system developed at NEC Research Laboratories in 1991. It recognizes naturally spoken speech and translates between Japanese and English. It accomplishes a bi-directional speech translation between Japanese and English. A general view of INTERTALKER is shown in Figure 5.7 [Hatazaki-92b]. The peculiar characteristics are described as follows: (1) The domain is task-oriented dialogue input with a vocabulary size of about 500 words on sight-seeing guidance. (2) Speaker-independent continuous speech recognition is accomplished. (3) Speech recognition and language translation are tightly integrated using a conceptual representation. (4) Language-independent expression of sentence makes possible easy extension to multiple language translation from one source

Experimental System of Speech Translation

94

Conceptual Representation Speech Input m

Japanese/English Speech Recognition

Text

English Generation French Generation





English Speech Synthesis

Speech Output

French Speech Synthesis

Spanish Generation

Spanish Speech Synthesis

Japanese Generation

Japanese Speech Synthesis

Figure 5.7. General outline of automatic interpretation system: INTERTALKER. (From reference [Hatazaki-92b].) language: for instance, from Japanese to English, French and Spanish. (5) New Japanese speech synthesis is introduced, significantly improving the clarity and intelligibility of any synthesizing sentence. Detailed explanations will be stated in the following sections. 5.3.2. Speech Recognition In speech recognition, an input utterance is recognized and a conceptual representation is obtained as a result. This system accomplishes speaker-independent continuous recognition, controlled by a finite state network grammar. The Japanese speech recognition system uses a demi-syllable speech unit. The demi-syllable unit is robust for contextual variations caused by co-articulation, since it contains a transitional part information of speech within a unit. The number of demi-syllable units in Japanese is 241. Each unit is modelled by Gaussian mixture density HMMs. Speaker-independent HMMs are trained by task-independent training data. Training data are 250 phonetically balanced word utterances spoken by 100 speakers. Training data are 250 phonetically balanced word utterances spoken by 100 speakers. The English recognition system uses 354 diphone units and other processes are the same as in Japanese. A word dictionary contains demi-syllable sequences as a word and the finite state grammar network and demi-syllable HMMs are compiled into a single network. Phonological variations and word juncture models are automatically expanded in the network. Figure 5.8 shows the process involved in speech recognition and language translation.

95

5.3. INTERTALKER

3^

Speech Recognition

Demi-syllable based Speech Recognition

T

Conceptual Representation Generation

Network Grammar

^ \ Task Knowledge Conceptual / - Relationship w Table

Conceptual Representation Conceptual Wording 1

Syntactic Selection

Text Generation Target Language Dictionary

Word Ordering Morphological Generation Text Speech Synthesis Speech Output

Figure 5.8. Integration using conceptual representation. (From reference [Hatazaki-92b].) A best path for the input utterance is searched in the finite state network. The semantic and task knowledge is incorporated in the network grammar. Conceptual representation is obtained as a recognition result. A bundle search of a fast frame-synchronous search algorithm is introduced for the network search to reduce computation time. In the bundle search, word-level search is undertaken only once for the most likely occurrence. Figure 5.9 shows an example of conceptual representation in relation to the network grammar. The obtained best path in the network is an arc sequence from the start node in the network. Both conceptual primitives of key words and their semantic dependency relationship are extracted from the conceptual relationship table. The conceptual representation, which is composed of an acyclic graph, is built up by using

Experimental System of Speech Translation

96

Network Grammar

Conceptual Representation

Figure 5.9. Generation of conceptual representation from network grammar. (From reference [Hatazaki-92b].) the conceptual primitives and their relationship. Pragmatic information is attached to the nodes in the conceptual representation. 5.3.3. Language Translation Based on the conceptual representation, a sentence is generated in each language directly. The conceptual representation is a language-independent expression of a sentence that was developed for the text-to-text machine translation system, PIVOT [Okumura-91]. The conceptual representation easily accomplishes the multi-lingual translation method. The conceptual representation is transformed to a syntactically dependent structure, using the target language dictionary. The steps of generation are as follows: 1. 2.

3. 4.

Conceptual wording: The target sentence structure is determined pragmatically and syntactically. Suitable expressions of clause or sentence, as well as the subject and predicate, are selected. Syntactic selection: Syntactic information is given to the nodes in the semantic structure and the syntactic structure is generated. The morphological information for surface cases and modalities are produced. Word ordering: Word-order properties for the syntactic structure are determined. Morphological generation: The nodes is the syntactic structure are arranged according to word-order properties. Surface morphemes for each node are generated and combined into words.

5.3. INTERTALKER

97

5.3.4. Speech Synthesis Translated text is converted to speech using a rule-based speech synthesizer. A Japanese text-to-speech system with high intelligibility is used. First, accentuation and pause insertion rules are applied to a sentence. Then, synthesized speech is generated using a pitch-controlled residual waveform excitation technique [Iwata-90]. 5.3.5. Performance A real-time experiment on speech interpretation has been conducted. Two kinds of tasks are implemented in the system: concert ticket reservation, and a tour guide. The concert ticket reservation task has a 500word vocabulary and the word perplexity is 5.5. Performance has been evaluated in the tour guide task. By using 30 sentences spoken by ten talkers, the rate of correct sentence recognition is 83% and word accuracy is 95.5%. The sentence-level translation accuracy is 93%. When the key words in a sentence are recognized correctly, conceptual representation is obtained in spite of some errors appearing in the unimportant part.

CHAPTER 6

Future Directions 6.1. Introduction The important issue in speech translation is to develop key technologies for translation of spontaneously spoken utterances. Such utterances include wide varieties of speech and language phenomena. Characteristics of spoken dialogue in terms of language and speech phenomena are shown in Figure 6.1. In spontaneously spoken speech, various phenomena such as strong co-articulation, variations depending on the individual person, collapsed or missing phones, etc. will appear quite often. Moreover, prosody plays an important role in conveying extralinguistic information such as the speaker's intention and emotional expression. As for spoken language issues, fragmental and strongly context-dependent utterances, inversions, repetitions or re-phrasing, ungrammatical expressions, etc. will frequently appear.

Dialect Speech Free Utterance

Non-constraint Speech Spontaneous Dialogue

Spontaneous Speech

Cfear Pronunciation Speech^"^ s' Language

Simple Dialogue

Grammatically Well-formed Sentence

Illocutional Expression

Grammatically Ill-formed Sentence

Idiomatic Expression

Figure 6.1. Characteristics of spoken dialogue.

98

Emotional Expression

99

6.1. Introduction Speech Input

Spe ch Recognit on Language Analysis

Language Translation

Language Generation

Speech Output

Speech Synthesis

Semantics / Fragmatics Dialogue Context

Intention Prosody

Dialogue Situation

Emotion Psychologiacal State

Figure 6.2. Advance spoken dialogue translation. The perspective of the advanced speech translation incorporating various information and knowledge of context, situation and prosody is shown in Figure 6.2.

6.2. Future Directions of Speech Translation 6.2.1. Recognition of Spontaneous Speech The technology of speech recognition must be adequate to deal with both acoustic and linguistic variations. In the acoustic field, much effort should be directed at developing more precise and stronger contextdependent phoneme models to cover wide acoustic variations. The effects of the linguistic constraints must be extensively utilized. The speakerindependent recognition of a large vocabulary should be extended. Dynamic speaker adaptation will be developed to eliminate the undesirable necessity of uttering training words in advance. In addition to the above points, the problem of how to define and manage a language model of spontaneous speech should be investigated. Since unnecessary or unimportant words such as "Uh" or "Eh" are frequently inserted in spontaneous utterances, the treatment of colloquial expressions like inversion and insertion of words should be studied. A management mechanism of a language model that will

100

Future Directions

interact with a higher level language processing component and restructure the model dynamically according to the moves of intentions or topics of a dialogue, will have to be investigated. 6.2.2. Prosody Extraction and Control in Speech Processing Prosody such as intonation, power and duration will contain important information in spontaneous speech. It helps not only to resolve ambiguities in sentence meanings, but also to extract extra-linguistic information such as the speaker's attitude, intention and emotion. Prosodic control of the spoken style will also be introduced in speech synthesis. 6.2.3. Translation of Colloquial Utterance Robust translation will be required in spoken language translation. A new paradigm of language translation, called "example-based translation" will be promising for spontaneous utterance. It translates an input by using a set of translation examples, each of which is very similar to a portion of the input. An example-based paradigm is quite attractive as a way of establishing a robust translation. The translation mechanism retrieves similar examples from the database. A large number of database examples, which describe pairs of source texts and their translation, has been set up as knowledge. Such examples are extracted from a large bi-lingual corpus. The efficient way to proceed will be to integrate a rule-based approach and an example-based approach. The dialogue situation at the time of utterance should be taken into consideration to translate the contextdependent expressions properly. 6.2.4. Integrated Control of Speech and Language Processing Integrated control of speech and language processing will be important in spontaneous speech translation. Appropriate information necessary for a language model in speech recognition should be provided at the syntactic, semantic and pragmatic level. Speech information such as prosody should be provided for language processing as an important cue to understanding spoken language. The situation and environment of the dialogue should be recognized and maintained properly. Such situational information will include the environment such as the domain and subject of the dialogue, the intentional and mental status of partici-

6.3. International Cooperation

101

pants and the dialogue progression status such as the topic or focus. Such information will be referred to by both the speech processing and the language processing units. 6.2.5. Mechanism of Spontaneous Speech Interpretation The mechanism for spontaneous speech processing must be consistent with a mechanism that handles associative knowledge such as translation usage examples and word co-occurrence information [Iida-93]. The suitability of applying several types of knowledge such as pragmatic knowledge, metaphorical associative knowledge, and associative knowledge based on social behaviour cannot be determined without grammatical and semantic structural information. Accordingly, the respectively required types of information are positioned as keys to resolve local subproblems, which are all processed in parallel. The global problem of interpretation will be accomplished by solving subproblems, including the speaker's intention, and mental states, the effects on the outer world and the problems of interpretation from the pragmatics. The final solution will be the one that has the lowest number of contradictions.

6.3. International Cooperation Interest in speech translation is growing all over the world. The research project at ATR Interpreting Telephony has greatly stimulated further research on speech translation. At ATR, the extended project on speech translation is being continued at a newly established research organization (ATR Interpreting Telecommunications Research Laboratories) [Morimoto-93]. Extensive efforts have been undertaken in research on spontaneous speech translation. In various research groups, for example CMU, ATT Bell Labs, and SRI. In Germany, the national project called VERBMOBIL has been launched in 1993. The long-term vision behind VERBMOBIL is a portable translation device that you can carry to a meeting with speakers of other languages [Wahlster-93]. An automatic speech translation project has also started recently in Korea. International cooperation is essential in the development of speech translation systems since speech processing and language translation requires deep insight into each native language. Resource sharing in a speech and language database will be effective in enhancing the smooth development of the speech translation system.

References Abbreviations Used ACL: Association for Computational Linguistics COLING: International Conference On Computational Linguistics EUROSPEECH: European Conference on Speech Communication and Technology IEICE: Institute of Electronics, Information and Communication Engineers SST: Symposium in Speech Technology ICASSP: International Conference of Speech, Signal Processing ICSLP: International Conference on Spoken Language Processing TMI: International Conference on Theoretical and Methodological Issues in Machine Translation [Abe-90] M. Abe, S. Nakamura, K. Shikano, H. Kuwabara, "Voice Conversion Through Vector Quantization" Jour of Ac oust. Soc. of Japan, Vol. E-ll, No. 2, 71-76, (1990) [Abe-91] M. Abe, K. Shikano, H. Kuwabara, "Statistical Analysis of Bilingual Speaker's Speech for Cross-Language Voice Conversion", Jour, of Acoust. Soc. of America, Vol. 90. No. 1, 76-82, (1991) [Ashawi-92] H. Ashawi (ed.) The Core Language Engine, MIT Press, (1992). [ATR-93] ATR International (ed.), Automatic Telephone Interpretation, Tokyo: Ohm Co. (1993) [Dohsaka-90] K. Dohsaka, "Identifying the Referents of Zero-Pronouns in Japanese Based on Pragmatic Constraint Interpretation", Proc. of ECAI-90, 240-245,(1990) [Ehara-91] T Ehara, T. Morimoto, "Contents and Structure of the ATR Bilingual Database of Spoken Dialogues", Proc. Int. Joint Conf. ACH and ALLC, 131-136 (1991) [Furuse-92] O. Furuse, H. Iida, "An Example-Based Method for TransferDriven Machine Translation', Proc. ofTMI-92, 139-150, (1992) [Gunji-87] T. Gunji, Japanese Phrase Structure Grammar-A Unification-Based Approach (Dordrecht:Reidel) (1987) [Hanazawa-90] T. Hanazawa, K. Kita, T. Kawabata, S. Nakamura, K. Shikano, "ATR HMM-LR Continuous Speech Recognition System, Proc. ICASSP-90, 53-56,(1990) [Hatazaki-92a] K. Hatazaki, K. Yoshida, A. Okumura, Y. Mitome, T. Watanabe, M. Fujimoto, K. Narita, Proc. of 44th Convention Inf. Proc. Society of Japan,

Vol. 3, 219-220(1992)

102

References

103

[Hatazaki-92b] K. Hatazaki, J. Noguchi, A. Okumura, K. Yoshida, T. Watanabe, "INTERTALKER: An Experimental Automatic Interpretation System Using Conceptual Representation", Proc. ICSLP-92, pp. 393-396, (1992) [Hattori-92] H. Hattori, S. Sagayama, "Vector Field Smoothing Principle for Speaker Adaptation, Proc ICSLP-92, 381-384, (1992) [Hayes-86] P. Hayes, A. Hauptmann, J. Carbonell, M. Tomita, "Parsing Spoken Language: A Semantic Case-frame Approach", Proc. COLING-86, 587-592, (1986) [Iida-92] H. Iida, H. Arita, "Natural Language Understanding on a Four-layer Plan Recognition Model", Jour, of Information Processing, Vol. 15, No. 1, 60-71,(1992) [Iida-89] H. Iida, K. Kogure, T. Aizawa, "An Experimental Spoken Natural Dialogue Translation System Using a Lexicon-Driven Grammar", Proc. of European Conf on Speech Technology and Communications 89, \-^, (1989). [Iida-93] H. Iida, "Prospects for Advanced Spoken Dialogue Processing", IEICE Trans, INF. & SYST, pp. 108-114, (1993) [Iwahashi-92a] N. Iwahashi, N. Kaiki, Y. Sagisaka, "Concatenative Speech Synthesis by Minimum Distortion Criteria", Proc. ICASSP-92, 65-68 (1992) [Iwahashi-92b] N. Iwahashi, Y. Sagisaka, "Speech Segment Network Approach for an Optimal Synthesis Unit Set", Proc. ICSLP-92, 479-482 (1992) [Iwata-90] K. Iwata, Y. Mitome, J. Kametani, M. Akamatsu, S. Tomotake, K. Ozawa, T. Watanabe, "A Rule-based Speech Synthesizer Using Pitch Controlled Residual Wave Excitation Method", Proc. ICSLP 90, 6.6.1-6.6.4, (1990) [Kaiki-92] N. Kaiki, Y. Sagisaka, "Pause Characteristics and Local PhraseDependency Structure in Japanese", Proc. ICSLP-92, 357-360 (1992) [Kay-80] M. Kay, "Algorithm Schemata and Data Structures in Syntactic Processing", Tech. Report CSL-80-I2, Xerox PARK, (1980) [Kikui-92] G. Kikui, "Feature Structure Based Semantic Head Driven Generation", Proc. ofCOLING-92, 32-38, (1992) [Kindaichi-81] H. Kindaichi, K. Akinaga, Japanese Accent Dictionary, Meikai Nihongo Akusento Jiten, Sanseido (1981) [Kita-89] K. Kita, T. Kawabata, "HMM Continuous Speech Recognition Using Predictive LR Parsing", Proc. ICASSP-89, 703-796, (1989) [Kita-91] K. Kita, T. Takezawa, T. Morimoto, "Continuous Speech Recognition Using Two-Level LR Parsing", Trans. IEICE, Vol 74-E, No. 7, 1806-1810, (1991) [Kitano-89] H. Kitano, H. Tomabechi, T. Mitamura, H. Iida, Proc. of EUROSPEECH-89, 198-201,(1989) [Kitano-91] H. Kitano, "f DM-Dialogue-An Experimental Speech-to-Speech Dialog Translation System", IEEE Computer, pp. 31-50 (1991) [Kogure-90] K. Kogure, H. Iida, T. Hasegawa, K. Ogura, "NADINE: An Experimental Dialogue Translation System from Japanese to English, Proc. Info Japan-90, 57-64, (1990)

104

References

[Kosaka-92] T. Kosaka, S. Sagayama, "An Algorithm for Automatic HMM Structure Generation in Speech Recognition, "Proc. SST92, pp. 104-109, (1992) [Kosaka-93] T. Kosaka, J. Takami, S. Sagayama, "Rapid Speaker Adaptation Using Speaker Mixture Allophone Models Applied to Speaker-Independent Speech Recognition", Proc. ICASSP-93, II 570-573, (1993) [Kurematsu-90] A. Kurematsu, "ATR Speech Database: As a Tool of Speech Recognition and Analysis", Speech Communication, Vol 9, No. 4 357- 363, (1990) [Kurematsu-92a] A. Kurematsu, H. Iida, T. Morimoto, "Language Processing in Connection in with Speech Translation at ATR Interpreting Telephony Research Laboratories", Speech Communication, Vol. 10. No. 1, pp. 1-9 (1991) [Kurematsu-92b] A. Kurematsu, "Future Perspective of Automatic Telephone Interpretation", 1E1CE Trans, COMMUN., E75-B, No. 1, 14-19 (1992) [Minami-92] Y. Minami, T. Matsuoka, K. Shikano, "Evaluation of HMM by Training Using Speaker Independent Speech Database", Tech. Report of IEICE, SP91-113(1992) [Morimoto-92] T. Morimoto, M. Suzuki, T. Takezawa, G. Kikui, M. Nagata, M. Tomokiyo, "A Spoken Language Translation System: SL-TRANS2", Proc. ofCOLlNG-92, 1048-1052(1992) [Morimoto-93a] T. Morimoto, A. Kurematsu, "Automatic Speech Translation at ATR", Proc. Fourth Machine Translation Summit, pp. 83-96, (1993) [Morimoto-93b] T. Morimoto, T. Takezawa, F. Yato, SW. Sagayama, T. Tashiro, M. Nagata, A. Kurematsu, "ATR's Speech Translation System: ASURA", Proc. EUROSPEECH-93, 1291-1294 (1993) [Murveit-90] H. Murveit, R. Moore, "Integrating Natural Language Constraints into HMM-based Speech Recognition", Proc. ICASSP-90, 573-576, (1990) [Murveit-91] H. Murveit, J. Butzberger, M. Weintraub, "Speech Recognition in SRI's Resource Management and ATIS Systems", Proc DARPA Workshop on Speech and Natural Language, (1991) [Nagai-91] A. Nagai, S. Sagayama, K. Kita, "Phoneme-context-dependent Parsing Algorithm for HMM-based Continuous Speech Recognition", Proc. EUROSPEECH-91 (1991) [Nagai-92a] A. Nagai et al., "Hardware Implementation of Realtime 1000-word HMM-LR Continuous Speech Recognition", Proc. ICASSP-92, pp. 15111514,(1992) [Nagai-92b] A. Nagai, J. Takami, S. Sagayama, "The SSS-LR Continuous Speech Recognition System: Integrating SSS-Driven Allophone Models and a Phoneme-Context Dependent LR Parser", Proc. ICSLP-92, 1511-1514, (1992) [Nagai-93] A. Nagai, Y Yamaguchi, S. Sagayama, A. Kurematsu: "ATREUS: A Comparative Study of Continuous Speech Recognition System at ATR", Proc. ICASSP-93, 139-142 (1993) [Nagao-84] M. Nagao, Framework of a Machine Translation between Japanese and English by Analogy Principle, in Artificial and Human Intelligence, eds. A. Elithorn and R. Banerji, (Amsterdam: North-Holland) 173-180 (1984)

References

105

[Nagata-90] M. Nagata, K. Kogure, "HPSG-Based Lattice Parser for Spoken Japanese in a Spoken Language Translation System", Proc. ECAI-90, 461-466,(1990) [Nagata-92] M. Nagata, "An Empirical Study on Rule Granularity and Unification Interleaving Toward an Efficient Unification-Based Parsing System", Proc. COLING-92, 177-183 (1992) [Nagata-93] M. Nagata, T. Morimoto, "A Unification-Based Japanese Parser for Speech-to-Speech Translation", IEICE Trans. Inf. & Syst, Vol. E76-D Vol. 1, (1993) [Nakamura-89] S. Nakamura, K. Shikano, "Spectrogram Normalization Using Fuzzy Vector Quantization", Jour, of Acoust. Soc. of Japan, Vol. 45, No. 2, 107-114,(1989) [Okumura-91] K. Okura, K. Muraki, S. Akamine, "Multi-lingual Sentence Generation from the PIVOT Interlingua", Proc. Machine Translation Summit, 67-71, (1991) [Ohkura-92] K. Ohkura, M. Sugiyama, S. Sagayama, "Speaker Adaptation Based on Transfer Vector Field Smoothing with Continuous Mixture Density HMMs", Proc. ICSLP-92, 369-372, (1992) [Pollard-87] C. Pollard, "An Information-Based Syntax and Semantics-Volume 1: Fundamentals", CSLI Lecture Note Number 13, CSLI, (1987) [Rabiner-93] L. Rabiner, B. Juang Fundamentals of Speech Recognition (New York: Prentice Hall) (1993) [Rayner-93] M. Rayner, et al. "Spoken Language Translation with Mid-90's Technology: A Case Study", Proc. ICSLP-93, pp. 1299-1302 (1993) [Roe-91] D. Roe, F. Pereira, R. Sproat, M. Riley, "Toward a Spoken Language Translator for Restricted-domain Context-free Languages", Proc. of EUROSPEECH-91, pp. 1063-1066, (1991) [Sagayama-93] S. Sagayama, J. Takami, A. Nagai, H. Singer, K. Yamaguchi, K. Ohkura, K. Kita, A. Kurematsu, "ATREUS: a Speech Recognition Frontend for a Speech Translation System", Proc. EUROSPEECH-93, 1287-1290 (1993) [Sagisaka-88] Y. Sagisaka, "Speech Synthesis by Rule Using an Optimal Selection of Non-Uniform Synthesis Units", Proc. ICASSP-88, 679-682, (1988) [Sagisaka-89] Y. Sagisaka, "On the Unit Set Design for Speech Synthesis by Rule Using Non-Uniform Units", Jour. Acoust. Soc. Am, Suppl. Vol. 86, S79, FF24, Fall (1989) [Sagisaka-90] Y. Sagisaka, "On the Prediction of Global FO Shape for Japanese Text-to-Speech", Proc. ICASSP-90, (1990) [Sagisaka-92] Y Sagisaka, N. Kaiki, N. Iwahashi, K. Mimura, " ATR n Talk Speech Synthesis System", Proc. ICSLP-92, (1992) [Sato-92] H. Sato, "Text-to-Speech Synthesis of Japanese", Speech Science and Technology (Ed. by Shuzo Saito), 158-178 (1992) [Seneff-89] S. Seneff, "TINA: A Probabilistic Syntactic Parser for Speech Understanding Systems", Proc. ICASSP-89, 711-714, (1989)

106

References

[Shiever-89] S. Shiever et al., "Semantic-Head-Driven Generation", Computational Linguistics, Vol. 16, No. 1, 30-42, (1990) [Shikano-88] K. Shikano, "Towards the overcome of individual variations on speech", ATR Journal No. 3, pp. 10-13, (1988) [Stentiford-87] M. Stentiford, et. al.: "A Speech Driven Language Translation System", Proc. EUROSPEECH-87', pp. 418-421, (1987) [Sumita-91] E. Sumita, H. Iida, "Experiments and Prospects of Example-Based Machine Translation", Proc. 29th Annual Meeting ACL, 185-192, (1991) [Suzuki-92] M. Suzuki, "A Method of Utilizing Domain and Language Specific Constraints in Dialogue Translation", Proc. COLING-92, 756-762, (1992) [Takami-92a] J. Takami, S. Sagayama, "A Successive State Splitting Algorithm for Efficient Allophone Modeling", Proc. ICASSP-92, pp. 1573-1576, (1992) [Takami-92b] J. Takami, A. Nagai, S. Sagayama, "Speaker Adaptation of the SSS (Successive State Splitting)-based Hidden Markov Network for Continuous Speech Recognition", Proc. SST-92, pp. 437-442, (1992) [Takeda-92] K. Takeda, K. Abe, Y. Sagisaka, "On the Basic Scheme and Algorithms in Non-Uniform Unit Speech Synthesis", Talking Machines, 93105 (North Holland.Elsevier Science Publishers) (1992) [Tomabechi-92] H. Tomabechi, "Quasi-Destructive Graph Unification with Structure-Sharing", Proc. ACL-92, 440-446, (1992) [Tomita-86] M. Tomita, Generalized LR Parsing. Kluwer Academic Publishers, (1991) [Wahlster-93] W. Wahlster, "Verbmobil: Translation of Face-To-Face Dialogues", Proceedings of the Fourth Machine Translation Summit, pp. 127-135, (1993) [Waibel-91] A. Waibel, et al.,"JANUS: A Speech-to-Speech Translation System Using Connectionist and Symbolic Processing Strategies", ICASSP-91, pp. 793-796, (1991) [Yamaoka-91] T. Yamaoka, H. Iida, "Dialogue Interpretation Model and Its Application to Next Utterance Prediction for Spoken Language Processing", Proc. EUROSPEECH-91, 849-852, (1991) [Yato-93] F. Yato, T. Morimoto, Y. Yamazaki, A. Kurematsu, "Important Issue for Automatic Interpreting Telephone Technologies", Proc. of Int. Symp. on Spoken Dialogue 1993, pp. 235-238. (1993) [Yoshida-89] K. Yoshida, T. Watanabe, S. Koga, "Large Vocabulary Word Recognition Based on Demi-Syllable Hidden Markov Model Using Same Amount of Training Data", Proc. ICASSP 89, pp. 1-4, (1989) [Yoshimoto-88] K. Yoshimoto, "Identifying Zero Pronouns in Japanese Dialogues", Proc. COLING-88, (1988)

Index activation-marker-passing 6 accent 12,72 acoustic pattern 9, 16 algorithm 4, 9, 26 active chart parsing 53 beam search 25 forward-backward 29 generation 62 HMM-LR 25 Levinson-Durbon 14 N-best 7 trelis 29 Viterbi 29 VFS 38 allophone 16,87 ambiguities 4, 11 amplitude 79 analysis 13 anaphoric 1 annotation 48 ASURA 7, 86, 88-90 ATR 5,6 ATR-v Talk 74 automatic speech translation 1, 2 auto-correlation 14 Baum-Welch 18 Bayer's rule 23 bilingual text 69 British Telecom 5 bundle search 95 Carnegie Mellon University (CMU) 6 ceptstrum 14 clusters 15 code 15 codebook 81 codebook mapping 32

code vector 36 codeword 36 colloquial utterance 100 communication act 66 communication plan 66 communicative goal 56 computing 9 concepts (source language) 56 conceptual representation 96 consonant 12 contextual knowledge 56 contextual processing 68 context-dependent 7, 16 continuous speech recognition 10 large vocabulary 22, 24 control 100 convert 9 Core Language Engine (CLE) 8 cross-language voice conversion 84 database 9 demi-syllable 16,95 devocalization 12 dialogue interpretation 66 plan 65 dialogue translation method 45 dialogue utterance 42 discourse structure 43, 54 discrete representation 15 distortion 77 DMDIALOG 7 DMTRANS 6 domain 4 domain specific 6, 8 domain plan 65 duration control 30 dynamic time warping (DTW) 34, 82 107

108 ellipses 65 elliptic 4 enhancement 13 entropy 77 erroneous spoken input 53 example-based language 69 EBMT (example-based machine language) 69 feature parameter 16 feature structures 45, 57, 59-60, 62,64 feature rewriting process 55 finite state network grammar 95 frame 13 French 5 fricative 11-12 fundamental frequency 79 fuzzy mapping 32 membership function 32 vector quantization (VQ) 30, 32 Gaussian distribution 30 Gaussian output 18, 20-21 German 7,8 goal oriented dialogue 44 grammar ambiguity in 27 case-frame based semantic 46 context free (CFG) 24-26, 30 head-driven phrase structure 4 8 ^ 9 , 54 inter-phrase 31 intra-phrase 32 Japanese phrase structure 49 lexical-syntactic 46 stochastic 46 syntactic 46 unification based 46, 50 unification based spoken Japanese 53

Index Hamming window 13 HEAD feature 52 Hidden Markov Model (HMM) 8, 17 discrete 18 continuous-mixture 18 continuous-mixture density 37 perplexity 24 phoneme models 24, 28 network 19-21 network (speaker adaptation of) 38 histogram 82 homonyms 12 honorific expressions 42 honorific relationships 68 human-to-human communication 2 idioms 60 illocational force type (IFT) 44, 67 indirect expressions 43 individuality control 80 intention 42 intention translation method 88 international j oint experiment 91 INTERTALKER 7,91 interpolation 36 JANUS

7

k-nearest neighbor 30 knowledge-based approach 6 knowledge representation 61 labelling 15 language generation 59 language translation 42 problems in 42 lexical dictionaries 46 lexical entries 54 liaison 12 linear predictive coding 14 linguistic knowledge 59 log-power 18

Index

109

LPC analysis 14 cepstrum 14-15 delta 18 LR Parsing generalized 6, 7, 24, 27 table 26 two-level 31-32 stack operation 28 LVQ algorithm 7 machine translation 5, 70 mapping 6 mapping codebook 81 Markov 8 source 77 state 17 massively parallel 6 maximum likelyhood values memory access 6 meta-dialogue 5 model(s) acoustic 23 language 23 phoneme 24 probabilistic 22 modelling 9-10 monolingual 4 mora 78 morphology 46 multiple codebook 18, 30 multiple operations 27 mutual beliefs 68 nasals 12 nearest-neighbor labelling 16 NEC 5,7 negation 42 network 7 noise 11 non-terminal symbols 48 non-uniform speech 73 oriented knowledge

65

20

parser (feature structure) 51 parsing 46 particles 43 case 43 path equations 49, 51 pattern recognition 10-11 performance 89,90 performance score 40 perplexity 23-25 phoneme 7, 10, 12, 16 perplexity 24 parsing trees 29 phonetic 11 phonetic tagging 76 phrase book 5 phrase description 62 pitch 12 PIVOT 96 plan recognition 65 plosive 11-12 posteriori probability 23 pragmatic constraints 68 information 54 knowledge 67 presupposition 54 processing 9-10 propositional contents 67 prosody 73, 100 Quasi Logical

8

recognition 1 results 30 reference speakers 37 regression model 79 rewriting environment 58 rewriting rule 26 Saito-Itakura algorithm 14 sampling 13 segmental distortion 77 segmental duration 78

110 semantic(s) 10 case frame 53 constraints 50 distance 69 feature 44,62 parallelism constraint 53 transfer 44 semi vowel 12 sentence (Japanese) 57 shared goals 68 signal 9 similarities 83 simulated annealing 78 simulated telephone 42 SL-TRANS 6,7 SLT 8 smoothing 36 source 56 speaker adaptive/adaptation 31-32 speaker dependent 31 speaker mixture 38 speaker-tied mixture weight training 39 spectral difference 75 spectrum analysis 13 spectrum mapping 34, 80 spectrum normalization 35 speech modelling 9 segment network (SSN) 78 synthesis 71 translation 2, 6, 99 unit(s) 16,75 speech recognition 9 speaker-independent 38 spoken dialogue 98 spoken language 4 spontaneously spoken speech 98, 101 SSS-LR 87 stack operations 27 SUBCAT feature 52 sub-phrases 43 successive state splitting 20

Index supervised speaker adaptation 35 Swedish 8 syllable 12, 16 syntax 1,10,22 synthesis unit entry (SUE) 74 constraints 50 errors 6 synthesis 1 table 26 action table 26 go to table 26 target 56 target language 4 task-oriented 4 task-oriented dialogue TDMT 70 Telecom-83 5 telephone 5 template 15 Teoplitz matrix 14 thesaurus 69 Tomita 6,7 Toshiba 6 transfer knowledge 70 process 55 rules 58 vector 36 tree(s) 60 turn taking 43

68

units 16 Unix Talk 6 utterance 67 utterance generation 58-59 utterance transfer 55 variance/variations 11, 17 context-dependent 19 vector 15 VEST 8 Viterbi alignment 30

Index voice conversion 80 vowel 12, 16 vector field smoothing (VFS) 35 88 vector quantization (VQ) 15, 30. 32 (see also fuzzy VQ) verbs auxilary 43-^4

111 waveform 11 window 13 word co-occurence

101

Yule-Walker equation zero pronoun 42, 50 heuristics 54 resolution of 53

14

Volume 1 (Manufacturing Engineering)

AUTOMOBILE ELECTRONICS by Shoichi Washino Volume 2 (Electronics)

MMIC—MONOLITHIC MICROWAVE INTEGRATED CIRCUITS by Yasuo Mitsui

Volume 3 (Biotechnology)

PRODUCTION OF NUCLEOTIDES AND NUCLEOSIDES BY FERMENTATION by Sadao Teshiba and Akira Furuya Volume 4 (Electronics)

BULK CRYSTAL GROWTH TECHNOLOGY by Shin-ichi Akai, Keiichiro Fujita, Masamichi Yokogawa, Mikio Morioka and Kazuhisa Matsumoto Volume 5 (Biotechnology)

RECENT PROGRESS IN MICROBIAL PRODUCTION OF AMINO ACIDS

by Hitoshi Enei, Kenzo Yokozeki and Kunihiko Akashi Volume 6 (Manufacturing Engineering)

STEEL INDUSTRY I: MANUFACTURING SYSTEM

by Tadao Kawaguchi and Kenji Sugiyama Volume 7 (Manufacturing Engineering)

STEEL INDUSTRY II: CONTROL SYSTEM by Tadao Kawaguchi and Takatsugu Ueyama Volume 8 (Electronics)

SEMICONDUCTOR HETEROSTRUCTURE DEVICES by Masayuki Abe and Naoki Yokoyama

Volume 9 (Manufacturing Engineering)

NETWORKING IN JAPANESE FACTORY AUTOMATION by Koichi Kishimoto, Hiroshi Tanaka, Yoshio Sashida and Yasuhisa Shiobara Volume 10 (Computers and Communications)

MACHINE VISION: A PRACTICAL TECHNOLOGY FOR ADVANCED IMAGE PROCESSING by Masakazu Ejiri Volume 11 (Electronics)

DEVELOPMENT OF OPTICAL FIBERS IN JAPAN by Hiroshi Murata Volume 12 (Electronics)

HIGH-PERFORMANCE BiCMOS TECHNOLOGY AND ITS APPLICATIONS TO VLSIs by Ikuro Masuda and Hideo Maejima Volume 13 (Electronics)

SEMICONDUCTOR DEVICES FOR ELECTRONIC TUNERS by Seiichi Watanabe

Volume 14 (Biotechnology)

RECENT ADVANCES IN JAPANESE BREWING

TECHNOLOGY

by Takashi Inoue, Jun-ichi Tanaka and Shunsuke Mitsui Volume 15 (Computers and Communications)

CRYPTOGRAPHY AND SECURITY edited by Shigeo Tsujii Volume 16 (Computers and Communications)

VLSI NEURAL NETWORK SYSTEMS by Yuzo Hirai

Volume 17 (Biotechnology)

ANTIBIOTICS I: /3-LACTAMS AND OTHER ANTIMICROBIAL AGENTS by Isao Kawamoto and Masao Miyauchi Volume 18 (Computers and Communications)

IMAGE PROCESSING: PROCESSORS AND APPLICATIONS TO RADARS AND FINGERPRINTS by Shin-ichi Hanaki

Volume 19 (Electronics)

AMORPHOUS SILICON SOLAR CELLS by Yukinori Kuwano Volume 20 (Electronics)

HIGH DENSITY MAGNETIC RECORDING FOR HOME VTR: HEADS AND MEDIA by Kazunori Ozawa

Volume 21 (Biotechnology)

ANTIBIOTICS II: ANTIBIOTICS BY FERMENTATION by Sadao Teshiba, Mamoru Hasegawa, Takashi Suzuki, Yoshiharu Tsubota, Hidehi Takebe, Hideo Tanaka, Mitsuyasu Okabe and Rokuro Okamoto Volume 22 (Biotechnology)

OLIGOSACCHARIDES: PRODUCTION, PROPERTIES, AND APPLICATIONS edited by Teruo Nakakuki Volume 23 (Computers and Communications)

JAPANESE TELECOMMUNICATION NETWORK by Kimio Tazaki and Jun-ichi Mizusawa

Volume 24 (Biotechnology)

ADVANCES IN POLYMERIC SYSTEMS FOR DRUG DELIVERY by Teruo Okano, Nobuhiko Yui, Masayuki Yokoyama and Ryo Yoshida Volume 25 (Biotechnology)

ON-LINE SENSORS FOR FOOD PROCESSING edited by Isao Karube Volume 26 (Biotechnology)

HMG-CoA REDUCTASE INHIBITORS AS POTENT CHOLESTEROL LOWERING DRUGS

by Yoshio Tsujita, Nobufusa Serizawa, Masahiko Hosobuchi, Toru Komai and Akira Terahara Volume 27 (Manufacturing Engineering)

VISUAL OPTICS OF HEAD-UP DISPLAYS (HUDs) IN AUTOMOTIVE APPLICATIONS edited by Shigeru Okabayashi Volume 28 (Computers and Communications)

AUTOMATIC SPEECH TRANSLATION FUNDAMENTAL TECHNOLOGY FOR FUTURE CROSS-LANGUAGE COMMUNICATIONS by Akira Kurematsu and Tsuyoshi Morimoto Volume 29 (Electronics)

TFT/LCD: Liquid Crystal Displays Addressed by Thin-Film Transistors

by Toshihisa Tsukada