Toward Spontaneous Speech Recognition and Understanding

351 87 1MB

English Pages 43 Year 2003

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Speech Recognition and Understanding 9781774691847

This book entitled, "Introduction to Speech and Language Therapy", has been designed to add to the knowledge o

158 73 14MB Read more

Speech Recognition using Neural Networks

450 45 847KB Read more

Efficient Algorithms for Speech Recognition

446 113 1MB Read more

A STUDY OF THE LOCI OF STUTTERING IN SPONTANEOUS SPEECH

From Proquest... Dissertation/thesis number: 0013090

116 106 7MB Read more

The HARPY Speech Recognition System [Ph.D. Thesis]

The Harpy connected speech recognition system is the result of an attempt to understand the relative importance of vario

400 12 2MB Read more

Distant speech recognition 0470517042, 9780470517048, 0470714085, 9780470714089

A complete overview of distant automatic speech recognition The performance of conventional Automatic Speech Recognition

425 60 5MB Read more

Fundamentals of Speech Recognition 0132858266, 9780132858267

104 5 4MB Read more

Interacting with Computers by Voice Automatic Speech Recognition and Synthesis

511 118 568KB Read more

Speech Recognition Over Digital Channels: Robustness and Standards 9780470024003, 0470024003

Книга Speech Recognition Over Digital Channels: Robustness and Standards Speech Recognition Over Digital Channels: Robus

416 75 3MB Read more

Deep learning for NLP and speech recognition 9783030145958, 9783030145965, 3030145956

This textbook explains Deep Learning Architecture, with applications to various NLP Tasks, including Document Classifica

432 112 19MB Read more

Toward Spontaneous Speech Recognition and Understanding

Author / Uploaded
Furui S.

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Toward Spontaneous Speech Recognition and Understanding Sadaoki Furui Tokyo Institute of Technology Department of Computer Science 2-12-1, O-okayama, Meguro-ku, Tokyo, 152-8552 Japan Tel/Fax: +81-3-5734-3480 [email protected] http://www.furui.cs.titech.ac.jp/

0205-03

Outline • Fundamentals of automatic speech recognition • Acoustic modeling • Language modeling • Database (corpus) and task evaluation • Transcription and dialogue systems • Spontaneous speech recognition • Speech understanding • Speech summarization

1

0205-03

Outline • Fundamentals of automatic speech recognition • Acoustic modeling • Language modeling • Database (corpus) and task evaluation • Transcription and dialogue systems • Spontaneous speech recognition • Speech understanding • Speech summarization

0010-11

Speech recognition technology

Speaking style

Spontaneous speech Fluent speech

word spotting digit strings

Read speech

Connected speech

natural conversation 2-way dialogue network transcription agent & system driven intelligent dialogue messaging name 2000 dialing office form fill dictation by voice

1980

voice commands

Isolated words 2

directory assistance

1990

20 200 2000 20000 Unrestricted Vocabulary size (number of words)

2

0201-01

Categorization of speech recognition tasks Dialogue (Category I) Human to human

Switchboard, Call Home (Hub 5), meeting task

(Category III) Human to machine

ATIS, Communicator, information retrieval, reservation

Monologue (Category II) Broadcasts news (Hub 4), lecture, presentation, voice mail

(Category IV) Dictation

Major speech recognition applications • Conversational systems for accessing information services – Robust conversation using wireless handheld/hands-free devices in the real mobile computing environment – Multimodal speech recognition technology

• Systems for transcribing, understanding and summarizing ubiquitous speech documents such as broadcast news, meetings, lectures, presentations and voicemails

3

0010-12

Mechanism of state-of-the-art speech recognizers Speech input

Acoustic analysis x1 ... xT

Phoneme inventory P(x1... xT | w1 ...wk )

Global search: Maximize

Pronunciation lexicon

P (x1... xT )1... wk ) Px(w w1w...k w| xk1)...P(w T1|...

P(w1 ...wk )

over w1 ... wk

Language model

Recognized word sequence

0010-13

State-of-the-art algorithms in speech recognition LPC or mel cepstrum, time derivatives, auditory models

Speech input Acoustic analysis

Context-dependent, tied mixture sub-word HMMs, learning from speech data

SBR, MLLR Phoneme inventory

Cepstrum subtraction Global search Frame synchronous, beam search, stack search, fast match, A* search

Pronunciation lexicon Language model

Recognized word sequence

Bigram, trigram, FSN, CFG

4

0205-03

Outline • Fundamentals of automatic speech recognition • Acoustic modeling • Language modeling • Database (corpus) and task evaluation • Transcription and dialogue systems • Spontaneous speech recognition • Speech understanding • Speech summarization

0104-05

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

80 dB

Power

0 dB 6 kHz

Spectrum

0 kHz

Waveform 0 ms

Pitch 15 ms

Digital sound spectrogram

Time

5

0109-22

Frame length Time window

Frame period

Frame

…

Feature vector

Feature vector (short-time spectrum) extraction from speech

0112-11

log

Spectral fine structure

Short-term speech spectrum F0 (Fundamental frequency)

f

Spectral envelope f

log

F0 (Fundamental frequency)

Resonances (Formants) f

What we are hearing

Spectral structure of speech

6

0112-12

Logarithmic spectrum

Cepstrum

log

Spectral fine structure

0 Fast periodical function of f

f

t

IDFT

Concentration at different positions

log

Spectral envelope

0 Slow periodical function of f

t

f

Relationship between logarithmic spectrum and cepstrum

0103-09

10

Spectral envelope by LPC Spectral envelope by LPC cepstrum

Log amplitude

8

6

4 Short-time spectrum

2

Spectral envelope by FFT cepstrum

0

0

1

2

3

Frequency [kHz]

Comparison of spectral envelopes by LPC, LPC cepstrum, and FFT cepstrum methods

7

0103-10

Parameter (vector) trajectory

Instantaneous vector (Cepstrum)

Transitional (velocity) vector (Delta-cepstrum)

Cepstrum and delta-cepstrum coefficients

0103-11

Speech

FFT

FFT based spectrum

Mel scale triangular filters

Log

DCT

∆

Acoustic vector

∆2

MFCC-based front-end processor

8

0104-08

b1(x)

b2(x)

b3(x) Output probabilities

x

x 0.2

1

0.5

x 0.4

2

0.7

0.6

3

0.3

0.3

Phoneme k-1

Phoneme models

Feature vectors time Phoneme k+1

Phoneme k

Structure of phoneme HMMs

0104-07

Words

grey

whales

Phonemes Allophones

Speech signal

Amplitude

Spectrogram

Frequency (kHz)

Allophone models

Times (seconds)

Units of speech (after J.Makhoul & R. Schwartz)

9

0205-03

Outline • Fundamentals of automatic speech recognition • Acoustic modeling • Language modeling • Database (corpus) and task evaluation • Transcription and dialogue systems • Spontaneous speech recognition • Speech understanding • Speech summarization

0103-14

Language model is crucial ! • Rudolph the red nose reindeer. • Rudolph the Red knows rain, dear. • Rudolph the Red Nose reigned here.

• This new display can recognize speech. • This nudist play can wreck a nice beach.

10

0011-21

1. 2. 3. 4.

I WANT NEED THREE

5. 6. 7. 8.

ONE A AN BOOK

9. 10. 11. 12.

BOOKS COAT COATS NEW

THREE

3

5

WANT I

1

13. OLD

BOOKS COATS

ONE

2

8

BOOK COAT

9*

NEW

NEED

4

A

6 OLD

AN

7 An example of FSN (Finite State Network) grammar

0205-01

Syntactic language models • Rely on a formal grammar of a language. • Syntactic sentence structures are defined by rules that can represent global constraints on word sequences. • Mainly based on context-free grammars. • Very difficult to extend to spontaneous speech.

11

0205-02

Problems of context-free grammars • Over generation problem: not only generates correct sentences but also many incorrect sentences. • Ambiguity problem: the number of syntactic ambiguities in one sentence becomes increasingly unmanageable with the number of phrases. • Suitability for spontaneous speech is arguable. Stochastic context-free grammars

0012-01

Statistical language modeling Probability of the word sequence w1k = w1w2...wk : k

k

P (w1 =Π P (wi |w1w2…wi−−1) =Π P (wi |w1i−−1) k)

P(wi |w1

i =1

i− −1)

i =1

= N(w1 / N(w1 i)

i− −1)

where N (w1i) is the number of occurrences of the string w1i in the given training corpus. Approximation by Markov processes: Bigram model P(wi |w1i−−1) = P(wi |wi−1) Trigram model P(wi |w1i−−1) = P(wi |wi−2wi−1) Smoothing of trigram by the deleted interpolation method: P(wi |wi−2wi−1) = λ1P(wi |wi−2wi−1)+ + λ2P(wi |wi−1) + λ3P(wi)

12

0202-11

P(wt =YES| wt-1= sil)= = 0.2 P(st|st-1) S(1)

S(2)

S(3)

S(4)

Phoneme ‘YE’

S(5) Phoneme ‘S’

w = YES

0.6

S(6) P(wt =sil|wt-1 =YES)= =1

S(0)

Silence P(wt =sil|wt-1 =NO)= =1

Start S(7)

S(8)

S(9)

S(10)

Phoneme ‘N’ P(wt =NO| wt-1= sil)= = 0.2

S(11)

S(12)

Phoneme ‘O’

w = NO

P(Y|st =s(12)) Y

Complete Hidden Markov Model of a simple grammar

0205-03

Variations of N-gram language models • Various smoothing techniques • Word class language models • N-pos (parts of speech) models • Combination with a context-free grammar • Extend N-grams to the processing of long-range dependencies • Cache-based adaptive/dynamic language models

13

0201-08

Good-Turing estimate For any n-gram that occurs r times, we should pretend that it occurs r* times as follows:

r* = (r +1)

nr+1 nr

where nr is the number of n-grams that occur exactly r times in the training data.

0108-16

Katz smoothing algorithm Katz smoothing extends the intuitions of the Good-Turing estimate by adding the combination of higher-order models with lower-order models. if r> >k

C(wi−−1wi)/C(wi−−1) PKats (wi|wi-1) = drC(wi−−1wi)/C(wi−−1)

if k≥ ≥ r> >0

α(wi−−1)/P(wi)

where dr =

r* r 1

(k+ +1) nk++1 n1 (k+ +1) nk++1 n1

if r = 0

1 and α(wi−−1) =

Σ PKats (wi|wi-1) 1 Σ P (wi) w :r >0

wi :r >0 i

14

0205-03

Outline • Fundamentals of automatic speech recognition • Acoustic modeling • Language modeling • Database (corpus) and task evaluation • Transcription and dialogue systems • Spontaneous speech recognition • Speech understanding • Speech summarization

0205-04

Spontaneous speech corpora • Spontaneous speech variations: extraneous words, out-of-vocabulary words, ungrammatical sentences, disfluency, partial words, repairs, hesitations, repetitions, style shifting, …. • “There’s no data like more data” – Large structured collection of speech is essential. • How to collect natural data? • Labeling and annotation of spontaneous speech is difficult; how do we annotate the variations, how do the phonetic transcribers reach a consensus when there is ambiguity, and how do we represent a semantic notion?

15

0205-05

Spontaneous speech corpora (cont.) • How to ensure the corpus quality? • Research in automating or creating tools to assist the verification procedure is by itself an interesting subject. • Task dependency: It is desirable to design a task-independent data set and an adaptation method for new domains Benefit of a reduced application development cost.

0202-13

Main database characteristics (Becchetti & Ricotti) Name

Quantities No.CD No.hours Giga-bytes No. speakers

No. units

TI Digits TIMIT NTIMIT RM1 RM2 ATIS0

3 1 2 4 2 6

~14 5.3 5.3 11.3 7.7 20.2

2 0.65 0.65 1.65 1.13 2.38

326 630 630 144 4 36

>2,500 numbers 6,300 sentences 6,300 sentences 15,024 sentences 10,608 sentences 10,722 utterances

Switchboard (Credit Card)

1

3.8

0.23

69

35 dialogues

TI-46 Road Rally

1 1

5 ~10

0.58 ~0.6

16 136

19,136 isol. words Dialogues/sentences

Switchboard (Complete)

30

250

15

550

2,500 dialogues

ATC Map Task MARSEC ATIS2 WSJ-CSR1

9 8 1 6 18

65 34 5.5 ~37 80

5.0 5.1 0.62 ~5 9.2

100 124

30,000 dialogues 128 dialogues 53 monologues 12,000 utterances 38,000 utterances

16

0202-14

Further database characteristics (Becchetti & Ricotti) Transcription

Name

Based on:

TI Digits TIMIT NTIMIT RM1 RM2 ATIS0

TA

Recording SR Speech style environment kHz

Word No Reading Phone Yes Reading Phone Yes Reading Sentence No Reading Sentence No Reading Sentence No Reading spon.

Sponsor

QR QR Tel QR QR Ofc

20 16 8 20 20 16

TI DARPA NYNEX DARPA DARPA DARPA

Tel

8

DARPA

Switchboard (Credit Card)

Word

Yes

TI-46 Road Rally

Word Word

No Reading Yes Reading spon.

QR Tel

16 8

TI DoD

Word

Yes

Conv. spon.

Tel

8

DARPA

Sentence Yes Sentence Yes Phone Sentence No Sentence Yes

Spon. Conv. spon. Spon. Spon. Reading

RF Ofc Various Ofc Ofc

8 20 16 16 16

DARPA HCRC ESRC DARPA DARPA

Switchboard (Complete)

ATC Map Task MARSEC ATIS2 WSJ-CSR1

Conv. spon.

0112-13

Entropy: Amount of information (Task difficulty) • Yes, No

1 bit (log 2)

• Digits * 0 ~ 9 : 0.1

3.32 bits (log 0.1)

0 : 0.91

* 1 ~ 9 : 0.01

0.722 bits

(0.91 log 0.91 + 0.09 log 0.01)

17

0112-14

Test-set perplexity Vocabulary: A, B Test sentence: ABB A 1/8

A 3/4

A 3/4

B 7/8

B 1/4

B 1/4

Test sentence entropy: log 8 + log 4 + log 4 = 7 bit (ABB) Entropy per word: 7 / 3 = 2.33 bits Test-set perplexity (branching factor): 22.33 = 5.01

0205-03

Outline • Fundamentals of automatic speech recognition • Acoustic modeling • Language modeling • Database (corpus) and task evaluation • Transcription and dialogue systems • Spontaneous speech recognition • Speech understanding • Speech summarization

18

0109-36

W1 P(W1)

P(W2)

S

P(WN)

…

W2

WN

A unigram grammar network where the unigram probability is attached as the transition probability from starting state S to the first state of each word HMM.

0108-9

P(w1|w1)

P(w1|w2)

w1 P(w1|wN)

P(w2|w1) P(w2|w2) w2

P(w2|wN)

.. .

P(wN|w2)

P(wN|w1)

wN P(wN|wN) A bigram grammar network where the bigram probability P(wj|wi) is attached as the transition probability from word wi to wj.

19

0108-10

P(w1|w1 , w1) w1 P(w2|w1 , w1)

P(w1|w2 , w1) w2 P(w2|w2 , w1) P(w1|w1 , w2) w1 P(w2|w1 , w2)

P(w1|w2 , w2) w2

P(w1|w2 , w2) A trigram grammar network where the trigram probability P(wk|wi, wj)is attached to transition from grammar state wi, wj to the next word wk. Illustrated here is a two-word vocabulary, so there are four grammar states in the trigram network.

0011-10

Text DB

Speech

Speech DB

Language model training

Acoustic analysis

Acoustic model training

Bigram

Beam-search decoder (First path)

Trigram

Phone models (Tied-state Gaussian mixture HMM)

N-best hypotheses with acoustic score Rescoring (Second path) Recognition results

Two-pass search structure used in the Japanese broadcast-news transcription system

20

0205-03

Outline • Fundamentals of automatic speech recognition • Acoustic modeling • Language modeling • Database (corpus) and task evaluation • Transcription and dialogue systems • Spontaneous speech recognition • Speech understanding • Speech summarization

0204-03

Intelligible message ? Social context Communication Redundancy channel between:

Listener

ise

ise

No is e

ise

ise No

N

oise

No

ise

No ise

No

No

N

ise

ise

No

No

N

oise

ise

ise

No

ise

No

No

ise

No

ise

No

ise

N

No

ise

ise

No

ise

NoiseNo

No

• Feedback

oise

No

No N

ise

oise

ise

ise

ise

No

No

No

No

ise No No

ise

ise

ise

ise

No

No

No

N

ise

oise

ise

ise

ise

No

No

No

No

ise

No

ise

ise

oise

• Dialogue context

ise

No

ise

• Hearing

No

ise

ise

No

N o ise ise No

No ise ise No

No

No

• Vocabulary

ise

• Speech signal • Individual characteristics • Microphone • Reverberation • Linguistic (gender, dialect, pitch) • Distance • Manner of speaking No characteristics ise of the message No • Distortion • Cognitive load ise No ise • Filters • Dialogue context • Cooperativeness

No

Speaker

The main factors influencing speech communication

The possible distortions in the communication and the redundancy of the speech signal interact, yielding to a message intelligible or unintelligible to a listener

21

Difficulties in (spontaneous) speech recognition ●

Lack of systematic understanding in variability Structural or functional variability Parametric variability

●

Lack of complete structural representations of (spontaneous) speech

●

Lack of data for understanding non-structural variability

0010-24

Overview of the Science and Technology Agency Priority Program “Spontaneous Speech: Corpus and Processing Technology”

Large-scale spontaneous speech corpus

Spontaneous speech

World knowledge Linguistic information Para-linguistic information Discourse information

Speech recognition

Transcription

Understanding Information extraction Summarization

Summarized text Keywords Synthesized voice

22

0112-05

For training a morphological analysis and POS tagging program

For speech Recognition

CSJ Spontaneous monologue Core

Manually tagged with segmental and prosodic information

Digitized speech, transcription, POS and speaker information

500k words

7M words

Overall design of the Corpus of Spontaneous Japanese (CSJ)

0201-02

800 700 600

Perplexity

500 400 300 200 100 SpnL

WebL

0 0

1

2

3

4

5

6

7

OOV

Test-set perplexity and OOV rate for the two language models

23

Word accuracy (%)

75 70 65 60 55 50 45

40 Acous. model Ling. model Speaker adapt.

RdA SpnA

SpnA

RdA

SpnA

WebL

WebL

SpnL

SpnL

SpnL

w/o

w/o

w/o

w/o

with

Word accuracy for each combination of models

0204-13

Mean and standard deviation for each attribute of presentation speech Acc

AL

SR

PP

OR

FR

RR

Mean

68.6

-53.1

15.0

224

2.09

8.59

1.56

Standard deviation

7.5

2.2

1.2

61

1.18

3.67

0.72

Acc: word accuracy (%), AL: averaged acoustic frame likelihood, SR: speaking rate (number of phonemes/sec), PP: word perplexity, OR: out of vocabulary rate, FR: filled pause rate (%), RR: repair rate (%)

24

0201-05

OR RR

PP Acc

FR

AL SR Correlation

Spurious correlation

Acc: word accuracy, OR: out of vocabulary rate, RR: repair rate, FR: filled pause rate, SR: speaking rate, AL: averaged acoustic frame likelihood, PP: word perplexity

Summary of correlation between various attributes

0202-03

Linear regression models of the word accuracy (%) with the six presentation attributes Speaker-independent recognition Acc= = 0.12AL− − 0.88SR− − 0.020PP− − 2.2OR+ + 0.32FR− − 3.0RR+ + 95 Speaker-adaptive recognition Acc= = 0.024AL− − 1.3SR− − 0.014PP− − 2.1OR+ + 0.32FR− − 3.2RR+99 Acc: word accuracy, SR: speaking rate, PP: word perplexity, OR: out of vocabulary rate, FR: filled pause rate, RR: repair rate

25

0205-03

Outline • Fundamentals of automatic speech recognition • Acoustic modeling • Language modeling • Database (corpus) and task evaluation • Transcription and dialogue systems • Spontaneous speech recognition • Speech understanding • Speech summarization

0011-11

Speech generation

(Text generation) Intension

Phonemes, prosody

Articulatory motions

Message formulation

Language code

Neuro-muscular controls

Discrete signal 50 bps

(Speech production)

Vocal tract system

Continuous signal

200 bps

2,000 bps

30,000 ~ 50,000 bps Acoustic

waveform

Information rate Semantics

Phonemes, words, syntax

Feature extraction, coding

Spectrum analysis

Message understanding

Language translation

Neural transduction

Basilar membrane motion

Discrete signal (Linguistic decoding)

Continuous signal

Speech recognition

(Acoustic processing)

Human speech generation and recognition process

26

0010-22

A communication - theoretic view of speech generation & recognition P( M)

P ( W | M)

Message source

M

Linguistic channel

Language Vocabulary Grammar Semantics Context Habits

P ( X | W) W

Acoustic channel

X

Speech recognizer

Speaker Reverberation Noise Transmission characteristics Microphone

9903-4

Message-driven speech recognition Maximization of a posteriori probability, maxP(M | X ) = max∑ P(M | W)P(W | X ) M

M W

(1)

Using Bayes’ rule, it can be expressed as maxP(M | X ) = max∑ P( X | W)P(W | M )P(M ) / P( X ) M

M W

(2)

For simplicity, we can approximate the equation as maxP(M | X ) ≈ max P( X | W )P(W | M )P(M ) / P( X ) M

M ,W

(3)

P(W|M): hidden Markov models, P(M): uniform probability for all M. We assume that P(W|M) can be expressed as follows. P(W | M) ≈ P(W)1−λ P(W | M)λ

(4)

where λ, 0≤λ≤1, is a weighting factor.

27

9809-1

Word co-occurrence score P(W|M) is represented by word co-occurrences of nouns;

CoScore( wi , w j ) = log

p( wi , w j ) ( p( wi ) p( w j ))1 2

p(wi,wj): probability of observing words wi and wj in the same news article p(wi), p(wj): probabilities of observing word wi and wj in all the articles A square root term was employed to compensate the probabilities of the words with very low frequency.

0204-11

Speech Large vocabulary or keyword recognizer

Language model

Interface between recognition and natural language (e.g. N-best, word-graphs)

Parser dealing with speech fragments

Language model

Additional semantic pragmatic or dialogue constraints

Database query

Generic block diagram for spontaneous speech understanding

28

0204-12

Data collection

Speaker

Speaker or Machine Annotated data

Training

Annotated learning

Syntax

Lexicon Decoding

(short and long term structual relationships)

Discourse

Semantics

Pragmatics

Integrated understanding framework Speech input

Meaning

Generic semi-automatic language acquisition for speech understanding

0010-23

An architecture of a detection-based speech understanding system Detector 1 Detector 2 Speech input

Detector 3

Detector N-1 Detector N Solved by discriminative training, consistent with Neymann-Pearson Lemma

Understanding & response

Integrated search & confidence evaluation

Partial language model

Partial language modeling and detection-based search still need to be solved

29

0205-03

Outline • Fundamentals of automatic speech recognition • Acoustic modeling • Language modeling • Database (corpus) and task evaluation • Transcription and dialogue systems • Spontaneous speech recognition • Speech understanding • Speech summarization

0205-06

Sayings • The shortest complete description is the best understanding – Ockham • If I had more time I could write a shorter letter – B. Pascal • Make everything as simple as possible – A. Einstein

30

0205-07

From speech recognition to summarization LVCSR (Large Vocabulary Continuous Speech Recognition) systems can transcribe read speech with 90% word accuracy or higher.

Current target LVCSR systems for spontaneous speech recognition to generate closed captions, abstracts, etc.

Spontaneous speech features filled pauses, disfluency, repetition, deletion, repair, etc. Outputs from LVCSR systems include recognition errors

Automatic speech summarization

Important information extraction

0205-08

Summarization levels Summarization

Information extraction Understanding speech

Indicative summarization

Topics Sentence(s)

Abstract Summarized utterance(s) Informative summarization Raw utterance(s)

Target Closed captioning Lecture summarization

31

0205-09

Automatic speech summarization system Language Langu database ag …… ……… e …

Spontaneous speech

Language model

………. … …… …… …… . …… …… . …… …… …… …… …… . . . …… . …dat abas . …… e …… …. …… .

News, lecture, meeting etc.

Speaker adaptation

LVCSR module

Speech database

Acoustic model

Speech

Captioning Knowledge database

………. …… …. ………. …… …… ………. …… . …… . … ……….. … …… . …… …… ………… … … … .. … …… ………… … ……… .…… …… … …… …….. . … … … … …………… … .. …… … . … …………… … … …………… … … ..………… … ………… .. ……… … ..

Summary corpus

Context model

Taking minutes

Summarization module Summarization model

Making abstracts

(Understanding) Indexing

0205-10

Approach to speech summarization utterance by utterance Each transcribed utterance 1

2

3

4

5

6

7

8

9

10

A set of words is extracted (sentence compaction) Specified ratio e.g. Extracting 7 words from 10 words: 70%

1

2

3

6

7

8

9

Summarized (compressed) sentence

32

0205-11

Target of summarized speech Maintaining original meanings of speech as much as possible Information extraction

Significant (topic) words

Linguistic correctness

Linguistic likelihood

Semantic correctness

Dependency structure between words

Word error exclusion

Acoustic and linguistic reliability

0205-12

Summarization score Summarized sentence with M words V = v1 ,v2 ,…, vM

Summarization score M

S( V M)

=Σ

m=1

L(vm |… vm-1)

Linguistic score

+ λI I (vm )

Significance (topic) score

+ λC C (vm )

Confidence score

+ λT Tr (vm )

Word concatenation score

Trigram

Amount of information

Acoustic & linguistic reliability

Word dependency probability

33

0205-13

Linguistic score Linguistic likelihood of word strings (bigram/trigram) in a summarized sentence

log P (vm| vm-2 vm-1) Linguistic score is trained using a summarization corpus.

0205-14

Word significance score Amount of information

fi log

FA Fi

fi : Number of occurrences of ui in the transcribed speech ui : Topic word in the transcribed speech Fi : Number of occurrences of ui in all the training articles FA : Summation of all Fi over all the training articles

(FA = ΣFi ) i

•Significance scores of words other than topic words and reappearing topic words are fixed.

34

0205-15

Confidence score Acoustic and linguistic reliability of a word hypothesis Posterior probability αk Pac(wk,l) Plg (wk,l) βl C(wk,l) = log PG w4, l

4

w4,5 w5, l

w1, 4 wS, 1 w1, 2

S

wS, 3

wl, 11

w k, l

w1, k w3,5

α: Forward probability from the beginning node S to node k

w10, T

T w9, 10

7

w3, 7 w3, 9

11

w9, 11

w2, 9

3

wk,l : Word hypothesis between node k and node l

10

w7, l

w2, k k 2 w3, k

k,l : Node index in a graph

l w l, 10

5

1

C(wk,l) : Log posterior probability of wk,l

9

β: Backward probability from node l to the end node T Pac : Acoustic likelihood of wk,l Plg : Linguistic likelihood of wk,l PG : Forward probability from the beginning node S to the end node T

0205-16

Word concatenation score A penalty for word concatenation with no dependency in the original sentence

Inter-phrase Intra-phrase the

beautiful

cherry Phrase 1

“the beautiful Japan”

Intra-phrase blossoms

in

Japan

Phrase 2 Grammatically correct but incorrect as a summary

35

0205-17

Dependency structure Dependency Grammar

modifier

Left-headed dependency

Right-headed dependency

Dependency head

The cherry blossoms bloom in spring

Phrase structure grammar for dependency VP

DCFG (Dependency Context Free Grammar)

→ βα (Right-headed dependency) NP → αβ (Left-headed dependency) NP →w NP DET ADJ α, β Non-terminal symbols, The cherry blossoms wTerminal symbols α α α

VP PP VP

PP

NP

bloom

in

spring

0205-18

Word concatenation score based on SDCFG Word dependency probability If the dependency structure between words is ambiguous,

If the dependency structure between words is deterministic, 0 or 1

SDCFG

(Stochastic DCFG)

S The dependency probability between wm and wl, d (wm, wl, i, k, j) is calculated using Inside-Outside probability based on SDCFG. T(wm, wn) m

= log Σ

n-1

L

j

Σ Σ Σ d (wm, wl, i, k, j)

i=1 k=m j=n l=n

α β β

α

Inside probability

α

Outside probability

w1 ......wi-1wi ....wm ....wkwk+1..wn ...wl wj wj+1 .....wL S: Initial symbol, α, β : Non-terminal symbol, w: Word

36

0205-19

Dynamic programming for summarizing each utterance Selecting a set of words maximizing the summarization score Ex. w2 w4 w5 w7 w10

Transcription result

w10 w9 w8 w7 w6 w5 w4 w3 w2 w1

g (m, l, n) = max{g(m− −1, k, l) k