Information Extraction: Towards Scalable, Adaptable Systems (Lecture Notes in Computer Science, 1714) 3540666257, 9783540666257

Information extraction (IE) is a new technology enabling relevant content to be extracted from textual information avail

126 96 3MB

English Pages 184 [175] Year 1999

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Information Extraction: Towards Scalable, Adaptable Systems (Lecture Notes in Computer Science, 1714)
 3540666257, 9783540666257

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science Edited by J. G. Carbonell and J. Siekmann

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen

1714

3

Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Singapore Tokyo

Maria Teresa Pazienza (Ed.)

Information Extraction Towards Scalable, Adaptable Systems

13

Series Editors Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA J¨org Siekmann, University of Saarland, Saarbr¨ucken, Germany

Volume Editor Maria Teresa Pazienza Department of Computer Science, Systems and Production University of Roma, Tor Vergata Via di Tor Vergata, I-00133 Roma, Italy E-mail: [email protected]

Cataloging-in-Publication data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Information extraction : towards scalable, adaptable systems / Maria Teresa Pazienza (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 1999 (Lecture notes in computer science ; 1714 : Lecture notes in artificial intelligence) ISBN 3-540-66625-7

CR Subject Classification (1998): I.2, H.3 ISBN 3-540-66625-7 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. © Springer-Verlag Berlin Heidelberg 1999 Printed in Germany Typesetting: Camera-ready by author SPIN: 10705092 06/3142 – 5 4 3 2 1 0

Printed on acid-free paper

Preface The ever-growing interest in new approaches to information management is strictly related to the explosion of collections of documents made accessible through communication networks. The enormous amount of daily available information imposes the development of IE Information Extraction technologies that enable one to: access relevant documents only and integrate the extracted information into the user's environment. In fact, the classic application scenario for IE foresees, for example: 1. a company interested in getting detailed synthetic information related to prede ned categories 2. the documents, as sources of information, located in electronically accessible sites agencies' news, web pages, companies' textual documentation, international regulations etc. 3. the extracted information eventually being inserted in private data bases for further processing e.g. data mining, summary and report generation, forms

lling,.... A key problem for a wider deployment of IE systems is in their exibility and easy adaptation to new application frameworks. Most of the commonly available IE systems are based on speci c domain-dependent methodologies for knowledge extraction they ignore how to pass to templates related to other domains or dierent collections of documents. The need exists for more principled techniques for managing templates in a domain-independent fashion by using the general structures of language and logic. A few attempts have been made to derive templates directly from corpora. This process is similar to deriving knowledge structures and lexicons directly from corpora. This methodological approach adaptability  could push for a rapid customization to new domains of existing IE systems. The missing availability of robust natural language processing NLP tools is an obstacle in developing ecient systems for information management and broadcasting. The use of written texts as sources of knowledge lags behind other applications: it is crucial to characterize the suitable framework to support and simplify the construction phase for NL-based applications. The present software engineering methodologies are not adequate, while the automatic manipulation of unstructured natural language texts will become an important business niche. Information Extraction technology is required to get performance levels similar to Information Retrieval IR systems proved to be commercially viable.

VI

In many respects, IR and IE are very often used with a similar meaning when the interest is in extracting, from a very large collection of textual documents, useful information matching linguistic properties. Likewise, the Message Understanding Conferences MUC and the Text Retrieval Conferences TREC are the most qualied environments in which dierent IE and IR approaches, respectively, are evaluated with respect to the ability of identifying relevant information from texts. In both these competitions, innovative approaches have been implemented, evidencing the role of NLP systems. Nevertheless the denition of how accurate an approximation to explicit linguistic processing is required for good retrieval performances is still under debate. Multilingual information extraction IE methodologies are more and more necessary. Even if the most common language used in electronic texts is English, the number of languages adopted to write documents circulating and accessible through networks is increasing. Systems developed for such an application must rely on linguistic resources being available in several languages. Traditionally, these resources mainly lexicons have been hand-built at a high cost and present obvious problems for size extension and portability to new domains. Most of the resources needed for IE systems are still developed by hand. This is an highly time consuming task for very expensive human experts. A possible solution is in extracting linguistic knowledge from corpora. This requires developing systems that, in a unied approach, would be able to

1. extract such linguistic knowledge, and 2. represent it, preferably at a meta-level independently from source language and application domain. Parallel corpora may be considered as valuable sources of this meta-knowledge, in case aligned multilingual parallel corpora are available and tools for equivalent processing have been developed. Alignment in these texts is mandatory and it must be veried at some level at least paragraphs and sentences. Two dierent frameworks exist for this task: - use of some sort of traditional linguistic analysis of the texts, or - a statistical approach. The former seems to be based on the same kind of information they are trying to extract. The latter, based on simpler assumptions e.g. a signicant correlation exists in the relative length of sentences which are translations of each other, is currently used. All these themes will be analyzed and debated at SCIE99, the SChool on Information Extraction, organized by the Articial Intelligence Research Group of the University of Roma Tor Vergata Italy and supported by the European Space Agency ESA, the Italian Association for Articial Intelligence AI*IA and the National Institution for Alternative Forms of Energy ENEA.

VII

In recent years, SCIE99 the second conference, SCIE97 being the rst appears to to have become an important forum in which to analyze and discuss major IE concerns. By comparing the lectures held at the School on Information Extraction, SCIE97 Information Extraction: Multidisciplinary contributions to an emerging Information Technology, Pazienza M.T.Ed, Lecture Notes in Articial Intelligence 1299, Springer-Verlag, Berlin Heidelberg New York, 1997 and what was debated at SCIE99 and gathered in this book, as the current stage of the research and development in IE technology, the strong requirement for technology deployment emerges as a novelty, i.e. the availability of robust adaptable systems to test either dierent methodologies or new application scenario without being forced to redene knowledge resources and the kind of processing. The rst phase aimed at dening topics to be covered, at di erent extents of generality, in an IE system appears to be concluded a new spirit calls for technological deployment for e ective, adaptable IE systems! I would like to thank individually all my colleagues from the Articial Intelligence Research Group of the University of Roma Tor Vergata and particularly Roberto Basili and Michele Vindigni who supported my e orts at organizing SCIE99 and editing this book.

Roma, July 1999

Maria Teresa Pazienza

Organization SCIE99 is organized by the University of Roma, Tor Vergata Italy.

Program Committee Luigia Carlucci Aiello University of Roma "La Sapienza " Elisa Bertino University of Milano Domenico Sacca' University of Calabria Lorenza Saitta University of Torino Maria Teresa Pazienza University of Roma, Tor Vergata

Organizing Committee Roberto Basili University of Roma, Tor Vergata Cristina Cardani University of Roma, Tor Vergata Maria Teresa Pazienza University of Roma, Tor Vergata Michele Vindigni University of Roma, Tor Vergata Fabio Massimo Zanzotto University of Roma, Tor Vergata

Supporting Institutions The SCIE99 has been partially supported by AI*IA, Italian Association for Arti cial Intelligence ENEA, National Institute for Alternative Forms of Energy ESA, European Space Agency University of Roma, Tor Vergata, Italy

Table of Contents Can We Make Information Extraction More Adaptive? Yorick Wilks and Roberta Catizone Natural Language Processing and Digital Libraries Jean-Pierre Chanod Natural Language Processing and Information Retrieval Ellen M. Voorhees From Speech to Knowledge Veronica Dahl Relating Templates to Language and Logic John F. Sowa Inferential Information Extraction Marc Vilain The MITRE Corporation Knowledge Extraction from Bilingual Corpora Harold Somers Engineering of IE Systems: An Object-Oriented Approach Roberto Basili, Massimo Di Nanni, Maria Teresa Pazienza

: :: :: :: :: :: :: :: ::

1

: :: :: :: :: :: :: :: :: :: ::

17

:: :: :: :: :: :: :: ::

32

::: :: :: :: :: :: :: :: :: : :: :: :: :: :: :: :: :: :: :: ::

49

::: :: : :: :: :: :: :: :: :: :: :: :: ::

76

:: :: :: :: :: :: : :: :: :: :: :: :: :: :: :: :: ::

95

:: : :: :: :: :: :: :: :: :: :: :: ::

120

::: :: :: :: :: :: ::

134

Can We Make Information Extraction more Adaptive? ori k

ilks nd

o

rt

tizon



The University of Sheffield {yorick,roberta}@dcs.shef.ac.uk

Abstract. It seems widely agreed that IE (Information Extraction) is now a tested language technology that has reached precision+recall values that put it in about the same position as Information Retrieval and Machine Translation, both of which are widely used commercially. There is also a clear range of practical applications that would be eased by the sort of template-style data that IE provides. The problem for wider deployment of the technology is adaptability: the ability to customize IE rapidly to new domains. In this paper we discuss some methods that have been tried to ease this problem, and to create something more rapid than the bench-mark one-month figure, which was roughly what ARPA teams in IE needed to adapt an existing system by hand to a new domain of corpora and templates. An important distinction in discussing the issue is the degree to which a user can be assumed to know what is wanted, to have preexisting templates ready to hand, as opposed to a user who has a vague idea of what is needed from a corpus. We shall discuss attempts to derive templates directly from corpora; to derive knowledge structures and lexicons directly from corpora, including discussion of the recent LE project ECRAN which attempted to tune existing lexicons to new corpora. An important issue is how far established methods in Information Retrieval of tuning to a user’s needs with feedback at an interface can be transferred to IE.

1

Introduction

nform tion xtr tion h s lr dyr h d th l v lof su ss twhi h nform tion tri v l nd hin r nsl tion on diff ring m sur s, of ours h v p rov d om m r i lly vi l . y g n r l gr m nt, th m in rri r to wid rus nd om m r i liz tion of isth r l tiv infl xi ilityof th tm p l t on p t l ssi r li s on th us r h ving n lr dy d v lop d s tof tm p l ts, sw sth s with fn g n i sfrom wh r th t hnologyw s l rg lyd v lop d s low , nd thisisnotg n r llyth s . h intll tu l nd p r ti lissu now ishow to d v lop tm p l ts, th irsu p rts lik n m d 

The authors are grateful to discussion and contributions from Hamish Cunningham, Robert Gaizauskas, Louise Guthrie and Evelyne Viegas, All errors are our own of course.

Pazienza (Ed.): Information Extraction, LNAI 1714, pp. 1–16, 1999. c Springer-Verlag Berlin Heidelberg 1999 

2

Yorick Wilks and Roberta Catizone

ntiti sorN s , th rul sfor lling th m , nd sso i td knowl dg stru tur s, sr p idly sp ossi l forn w dom ins nd g nr s. hisp p rdis uss sth u si- utom ti d v lop m nt nd d t tion of tm p l ts, tm p l t- ll rs, l xi ons nd knowl dg stru tur s for n w dom ins nd g nr s, using om in tion of m hin l rning, linguisti r sour xtr p ol tion nd hum n m hin intrf li it tion nd f d k t hni u s.

2

Background: The Information Extraction Context

xtr ting nd m n ging inform tion h s lw ys n im p ort ntforintllig n g n i s, utit l rth t, in th n xtd d , t hnologi s forth s fun tions will lso ru i l to du tion, m di in , nd om m r . tis stim td th t 0 of ourinform tion istxtu l, nd nform tion xtr tion h s m rg d s n w t hnology sp rtof th s r h for ttrm thodsof nding, storing, ssing nd m ining su h inform tion. its lf is n utom ti m thod for lo ting im p ort ntf ts in l troni do um nts .g.n wsp p r rti l s, n wsf ds, w p g s, tr ns rip tsof ro dsts, t. nd storing th m in d t s forp ro ssing with t hni u slik d t m ining, orwith off -th -sh lf p rodu ts lik sp r dsh ts, sum m ris rs nd r p ortg n r tors. h histori p p li tion s n rio for nform tion xtr tion is om p nyth tw nts, s y, th xtr tion of llship sinkings, r ord d in p u li n wswir sin ny l ngu g world-wid , p utinto singl d t s showing ship n m , tonn g , d t nd p l of loss t. Lloydsof London h d p rform d thisp rti ul rt sk with hum n r d rsof th world’sn wsp p rsfor hundr d y rs. h k y notion in isth tof ”tm p l t” linguisti p ttrn, usu lly s tof ttri ut v lu p irs, with th v lu s ing txtstrings, r td y xp rts to p tur th stru tur of th f ts soughtin giv n dom in, nd whi h systm s p p ly to txt orp or with th id of xtr tion rul s th ts k thos ll rsin th orp us, giv n s tof synt ti , s m nti nd p r gm ti onstr ints. s m od rn l ngu g p ro ssing t hnology w s d v lop d l rg ly in th . utwith strong d v lop m nt ntr s ls wh r 1 7, 1 , 30, 34 , 27 v r25 systm sworld wid , h v p rti ip td in th r nt om p titions, m ostof whi h h v g n ri stru tur 34 nd p r viously unr li l t sks of id ntifying, n m s, d ts, org niz tions, ountri s, nd urr n i s utom ti lly –oftn r f rr d to s , or m p l t l m nt, t sks–h v om xtr m ly ur t ov r95 ur yforth stsystm s .. n intrp r ting gur s, itshould lso orn in m ind th tth ov r ll r ll nd p r ision of hum np rovid d inform tion s whol is stim td to out20 wors 1 5 , 1 3 , 1 4 th n th sthum n p rform n ;itw sm sur d yhow w llintllig n n lystsp rform th t sk m nu llywh n om p r d to ”gold st r” xp ri n d intllig n n lyst. d p tivity in th d v lop m nt ontxth s m ntth on -m onth p riod in whi h om p ting ntr s d p tth ir systm to n w tr ining d t s ts

Can We Make Information Extraction more Adaptive?

3

p rovid d y ;thisp riod th r for p rovid s n hm rk forhum n-only d p tivity of systm s. his p p r d s ri s th d p tivity p ro l m , to n w dom ins nd g nr s, th t onstituts th ntr l p ro l m to th xtnsion nd p t ility of , nd to in r s th p rin ip l d m ulti-lingu lity of systm s, whi h w t k to m n xtnding th ir ilityto xtr tinform tion in on l ngu g nd p r s nt itto us rin noth r.

3

Previous Work on ML and Adaptive Methods for IE

h p p li tion of hin L rning m thods to id th t sk go s k to work on th l rning of v r p r f r n s in th ighti s y rishm n trling 31 nd L hn rt 3 , s w ll s rly work t on l rning to nd n m d xp r ssions N s 5 . h m ostintr sting d v lop m ntssin th n h v n s ri sof xtnsionsto th work of L hn rt nd iloff on utoslog 4 7, whi h w s ll d n utom ti indu tion of l xi on for , utwhi h is norm lly d s ri d s m thod of l rning xtr tion rul s from p irs, th tisto s yth rul s nd sso i td typ onstr ints th t ssign th ll rsto tm p l t slotsfrom txt. h s rul s r th n su i nt to ll furth rtm p l tsfrom n w do um nts. No onv ntion ll rning lgorithm w sus d ut, sin th n, od rl nd h s xtnd d th work y ttm p ting to us form of uggl ton’s L ndu tiv Logi rogr m m ing systm to th tt sk, nd rdi 1 2 h ssoughtto xtnd itto r slik l rning th d trm in tion of or f r n links. uggl ton’s 5 2 l rning systm t ork h s p rovid d v ry good v lu td gur s ind d in world wid trm s in l rning p rtof sp h t gging nd is ing xtnd d to gr m m r l rning. uggl ton lso h s xp rim ntd with us r intr tion with systm th t r ts s m nti n tworks of th rti l s nd th r l v nt tm p l ts, lthough so f ritsp u lish d su ss sh v n in r slik rt-ofp h t gging th t r notinh r ntly stru tur l in th w ytm p l t l rning rgu ly is . t urh m h v don p ion rrishm n t N 24 nd org n 4 1 ing work using us r intr tion nd d nition to d n us l tm p l ts, nd iloff 4 h s ttm p td to us som form of th us r-f d k m thodsof nform tion tri v l, in luding us r-m rking of n g tiv nd p ositiv p irings. olli r t h ld 1 6 tri d to l rn th tm p l t stru tur its lf dir tly i. . unsup rvis d from orp us, tog th rwith p rim itiv xtr tion rul s, r th rth n how to ll giv n tm p l t.

4

UDIE: What Would It Be Like to Have a User-Driven IE System?

s r riv n is on p tonly tth m om nt its im is to ddr ss s v r l r s of r s r h su h s how to us m hin l rning t hni u s to llow

4

Yorick Wilks and Roberta Catizone

systm to d p td to n w dom in without xp rtintrv ntion, nd how th us r will intr twith th systm . low w dis uss th stru tur s th t m ust l rn d nd p rop os d str tgi s for l rning th m , nd th p rop os d intr tion with us rsth tw nvision will n ss ry to ustom iz systm to p rti ul r p p li tion. num rof issu s ris in onn tion with d signing us r-driv n . irst, th u lityof th systm d p ndsp rtlyon th u lityof th tr ining d t itis p rovid d with f.th ov gur on th low- u lityof m u h of th hum n d t , om p r d with th sthum n d t . hism k sth p rovision of toolsto involv us rsin thisp ro ss sp rtof th irnorm lwork-fl ow im p ort nts .g. 23 . ondly, th typ of th l rn d d t stru tur sim p tth m int in ility of th systm . to h sti m od ls, for x m p l , p rform w llin rt in s s, ut nnot h nd-t ilor d to s u z out littl xtr p rform n , orto lim in t n o vious rror. hisis n dv nt g of rror-driv n tr nsform tion- s d l rning of p ttrnsfor with d trm inisti utom ton- s d r ognition ngin , su h sth work of th t m t 1 , following th work of rill , sw ll sfor ll work don in th L p r digm . 4.1

Supervised Template Learning

rill-styl tr nsform tion- s d l rning m thods r on of th f w L m thods in NL to h v n p p li d ov nd yond th p rt-of-sp h t gging origins of virtu lly ll L in NL . rill’sorigin l p p li tion trigg r d onlyon t gs; l tr 7 h dd d th p ossi ility of l xi l trigg rs. in th n th m thod h s n xtnd d su ssfullyto .g. sp h td trm in tion 5 0, nd tm p l t l rning p p li tion w sd sign d y il in 5 4 . f stim p l m nt tion s d on th om p il tion of rill-styl rul s to d trm inisti utom t w sd v lop d t itsu ishi l s 4 9 s lso 1 9 . h u lity of th tr nsform tion rul sl rn d d p ndson f torssu h s 1 . th ur y nd u ntity of th tr ining d t ; 2. th typ sof p ttrn v il l in th tr nsform tion rul s; 3. th f tur s t v il l us d in th p ttrn sid of th tr nsform tion rul s. h p td wisdom of th m hin l rning om m unity is th titis v ry h rd to p r di twhi h l rning lgorithm willp rodu op tim lp rform n , so it is dvis l to xp rim ntwith r ng of lgorithm srunning on r ld t . h r h v syt n no systm ti om p risons tw n th s initi l ff orts nd oth r onv ntion l m hin l rning lgorithm s p p li d to l rning xtr tion rul sfor d t stru tur s .g. x m p l - s d systm ssu h s i L 22 nd L 4 2. u h xp rim ntsshould onsid r d sstronglyintr ting with th issu s dis uss d low s tion 3 on th l xi on , wh r w p rop os xtnsionsto rli r work don y us nd oth rs 4 on unsup rvis d l rning of th surf form s su tgoriz tion p ttrns of s tof roottm p l t v r s thisw swork th t soughtto ov rth r ng of orp usform sund rwhi h signi ntv r ’sN s

Can We Make Information Extraction more Adaptive?

5

m ight p p rin txt. u h inform tion m ightorm ightnot v il l in giv n s tof p irs– .g.would N if th v r s p p r d in s ntn s only in noni l form s. nv stig tion is still n d d on th tr d off tw n th orp us-intnsiv nd th p irm thods, if tm p l ts h v not n p r -p rovid d for v ry l rg orp us s l tion for, if th y h d, th m thodology ov ould su sum th su tgoriz tion work low . twill , in p r ti , m ttrof tr ining s m p l siz nd ri hn ss. 4.2

Unsupervised Template Learning

should r m m rth tth r is lso p ossi l unsup rvis d notion of tm p l t l rning, d v lop d in h ld h th sis y olli r 1 6 , on th t n thoughtof s y t noth r p p li tion of th old t hni u of Luhn 4 0 to lo t, in orp us, st tisti lly signi ntwords nd us thos to lo t th s ntn s in whi h th y o ur s k y s ntn s. his h s n th sis of r ng of sum m ris tion lgorithm s nd olli rp rop os d form of it s sis forunsup rvis d tm p l t indu tion, n m ly th tthos s ntn s, if th y ont in d orp us-signi ntv r s, would lso ont in s ntn s orr sp onding to tm p l ts, wh th rornoty tknown ssu h to th us r. olli r nnot onsid r d to h v p rov d th tsu h l rning is ff tiv only th tsom p rototyp r sults n o t in d. 4.3

User Input and Feedback at the Interface

n ov r ll im of would to nd th right om p rom is for us r of tw n utom ti nd us r-driv n m thods. n im p ort nt sp tof th tsup p l m nts th us of l rning m thods is us rintrf uit diff r nt from d v lop r-ori nt td intrf s su h s 21 . h r will r ng of w ysin whi h us r n indi t to th systm th irintr sts, in dv n of ny utom td l rning orus rf d k, sin itwould foolish to ignor th xtntto whi h us r m y h v som l rnotions of wh tis w ntd from orp us. ow v r, nd thisis n im p ort ntdiff r n from l ssi d s nding from , w will not lw ys ssum in wh tfollowsth tth us rdo sh v ”tm p l tsin m ind”, utonlyth tth r r f tsof gr tintr stto th us rin giv n orp us nd it n th jo of thissystm intrf to h lp li itth m in form l r p r s nt tion. tis ru i lto r llh r th ton of th f w p rodu tiv m thodsforop tim ising tr dition l in th l std d h s n th us of us r- f d k m thods, typ i llyon swh r us r n indi t from r tri v d do um nts tth t, s y, thistn r good nd thistn d. h s r sults r th n f d k to op tim is th r tri v l itr tiv ly y m odifying th r u st. tis not sy to d p tthis m thodology dir tly to , v n though now, with full txt v il l for l rg ind x d orp or , on n r f rto s ntn do um nts ing r tri v d y l ssi tm p l t, so th ton m ight , do um ntsof p r is ly th sp n of hop forsom tr nsf rof op tim is tion m thods.

6

Yorick Wilks and Roberta Catizone

ow v r, lthough th us r n m rk s ntn sso r tri v d sgood or d, th ” ll d tm p l t” p rtof th p irings nnot so m rk d y us rwho, y d nition, is not ssum d to f m ili rwith tm p l t form lism s. n thiss tion of th work w sh llm ntion w ysin whi h us r n indi t p r f r n s, n ds nd hoi s tth intrf th t ontri ut to tm p l t onstru tion whos p p li tion h n ss ss, though not th irt hni l stru tur . oing this will r uir th p ossi ility of us rm rking, on th s r n, k y p ortionsof txt, on sth t ont in th d sir d f ts; sw ll sth ilityto inp ut, in som form of n intrf l ngu g nglish, t li n t. , on p tsin k yf ts ortm p l t ontntin luding p r di ts nd r ng sof ll rs . his sp tof th p p ris om p l m nt ry to sup rvis d l rning m thodsfortm p l ts, l xi ons nd K stru tur s, non of whi h n d ssum th tth us r do s h v full nd xp li it on p tof wh tisw ntd from orp us.

5

Adapting System Lexicons for a New Domain

irtu lly ll us systm sus l xi ons, nd th r isuniv rs l gr m ntth t l xi onsn d to d p td ortun d to n w us rdom ins. h dis gr m ntis outwh ttuning im p li s nd wh th rth r isr l n tin trm sof r ll nd p r ision. hos in th nform tion tri v ltr dition of inform tion ss r usu lly sk p ti l outth l ttr, sin st tisti l m sur s tnd to ring th ir own intrn l ritrion of r l v n nd s m nti d p t tion. s r h rs lik trz lkowski 5 3 nd K rov tz 37 h v onsistntly rgu d th t l xi l d p t tion, t k n sf r sdom in- s d s ns t gging, do sim p rov . n this p p rw intnd to d p t nd ontinu ourwork on l xi ltuning to p rovid som v lu l m sur of th ff tiv n ssoroth rwis of l xi ltuning for . h t trm h sm nt num rof things th notion sf r k s ilks1 972 5 6 h sm nt dding n w s ns to l xi on on orp us vid n us th txt ould not om m od td to n xisting l xi on. n 1 990 ustjovskyus d th trm to m n dding n w su tgoriz tion p ttrn to n xisting s ns ntry from orp us vid n . n th tr dition th r h v n num r 4 6 , 32 of p ion ring ff orts to dd n w words nd n w su tgoriz tion/p r f r n p ttrnsto l xi on from orp us s p rol gom non to . 5.1

Background on Lexical Acquisition by Rule-Based Methods

L xi l uning L is los lyr l td, utfund m nt llydiff r ntfrom , group of r l td th ori sth t r sso i td with p hr s slik ”l xi lrul s”; llof th m s k to om p r ssl xi ons ym nsof g n r liz tions, nd w t k th tto in lud 25 , m thods d v lop d und r L 9 , s w ll s ustjovsky’s n r tiv L xi on 4 4 nd uitl r’sm or r ntr s r h on und r-sp i d l xi ons 1 1 . ll this work n tr d k to rly work y ivon 29 on l xi l r gul riti s, don , intr stingly to thos who think orp us nd

Can We Make Information Extraction more Adaptive?

7

r s r h g n in th 1 9 0s, in onn tion with th rst om p ut tion lwork on str’s hird i tion ry t in nt oni und r ohn ln yin 1 96 6 . llthiswork n roughtund rth h ding ”d t om p r ssion”wh th r or notth tm otiv is m d xp li it. ivon m intr std in wh tis now ll d ”systm ti p olys m y”, nd distinguish d from hom onym y whi h is d m d unsystm ti , with k y x m p l s lik ”gr in” whi h is norm lly giv n n s ns in di tion ry, d td rli rth n m ssnoun s ns of ”gr in in th m ss”, nd thisl xi l xtnsion n found in m nynouns, nd ind d r surf d in ris o nd op st k ’s f m ous ”grinding rul ” 9 th t dd d m ss su st n s ns for ll nim ls, s in ”r it ll ov rth ro d”. h rgum ntw s th t, if su h xtnsions w r systm ti , th y n d not stor d individu lly ut ould d v lop d wh n n d d unl ss xp li itly ov rridd n. h p r digm forthisw sth old p r digm of d f ultr soning lyd is n l p h nt nd ll l p h ntsh v fourl gs lyd h sthr l gs. o m nyof us, ith s n som thing of m ystry why this found tion l li h of h s n gr td l trwithin om p ut tion llinguisti s sr m rk l nd p rofound. zd r’s isth m ostintll tu lly dv nturousof th s systm s nd th on th tm k s l xi l om p r ssion th m ost xp li it, dr wing s itdo s on fund m nt l notions of s i n s om p r ssion of th d t of th world. h p ro l m h s n th tl ngu g is on of th m ostr l itr nt sp ts of th world nd ith s p rov d h rd to nd g n r liz tions ov th l v l of m orp hology— thos to do with m ning h v p rov d sp i lly lusiv . ostr ntly, th r h s n n ttm p tto g n r lis to ross-l ngu g g n r liz tionswhi h h s x r td th p ro l m . n n s th t, in nglish, uth nd rm n, r sp tiv ly, , nd r th ”s m word”– p rim itiv on p t r uir s. ut, wh r s h s r gul r p lur l, do snot, so v n tthislow l v l, signi ntg n r liz tions r v ry h rd to nd. ost ru i lly, th r n no p p ls to m ning from th on p t of ”s m word” N ng. nd N ut. r p l inlyth s m word in som s ns , tl st tym ologi lly nd p hon ti lly, nd m yw llo ym orp hologi l g n r liz tions lthough now, unlik th s s ov , th y h v no r l tion of m ning t ll, s N now m ns g rd n. rh p s th gr tst m iss d op p ortunity h r h s n ny ttm p tto link to st lish d u ntit tiv notionsof d t om p r ssion in linguisti s, lik inim um s rip tion L ngth whi h giv s p r is m sur of th om p tion of l xi on, v n wh r signi ntg n r liz tions m y h rd to sp ot y y or m ind, in th tim honour d m nn r. h systm swhi h s k l xi l om p r ssion ym nsof rul s, in on form or noth r, n dis uss d yp rti ul r ttntion to uitl r, sin ris o nd ustjovsky diff rin m ttrs of d t il nd rul form t in th s of ris o utnotin p rin ip l . uitl r ontinu s ustjovsky’s m p ign g instunstru tur d listvi wsof l xi ons vi wing th s ns sof word m r ly s list s som di tion ri s r s id to do, in f vourof lustr d p p ro h, on whi h, in histrm s, distinguish s”systm ti p olys m y” 1 1 from m r hom onym y lik

8

Yorick Wilks and Roberta Catizone

th v r p r s nts ns s of NK . ystm ti p olys m y is notion d riving dir tly from ivon’s x m p l s, though itis not l r wh th r itwould ov r s slik th diff r ntkindsof m itting nd r iving nks ov r d in m od rn di tion ry .g. sp rm nk, lood nk, ottl nk t. lustring word’s s ns s in n op tim lly r v ling w y is som - thing no on ould p ossi ly o j tto, nd ourdis ui t this st rting p ointis th tth x m p l s h p rodu s, nd p rti ul r his r l td tt k on word s ns dis m igu tion p rogr m s in luding th p r s nt uthor’s s ssum ing list-vi w of s ns , ism isguid d. or ov r, sNir n urg nd skin 4 3 h v p ointd outin r l tion to ustjovksy, thos who riti is listvi wsof s ns th n norm lly go on in th irp p rsto d s ri nd work with th s ns sof word s list! uitl r’sop ning rgum nt g instst nd rd tiviti s ould s m ill on wh r two s ns sof K on iv d his ountr- x m p l issup p os d to m ust k p tin p l y nd so should not don . h x m p l is” long ook h vilyw ightd with m ilit ryt hni liti s, in this dition itisn ith rso long norso t hni l sitw sorigin lly”. L ving sid th u stion of wh th rornotthisis s ntn , l tus pt th t uitl r’slist ! of p ossi l s ns s nd gloss s of K is r son l st rting p oint with ournum ring dd d i th inform tion ontntof ook m ilit ryt hni liti s ; ii itsp hysi l p p r n h vilyw ightd , iii nd th v ntsinvolv d in its onstru tion long i id. p . 25 . h issu , h s ys, is to whi h s ns of K do sth ”it”r f r, nd his on lusion isth tit nnot dis m igu td tw n th thr . hiss m sto us uit wrong, s m ttrof th x g sisof nglish. ”h vily w ightd” is p l inly m t p hori l nd r f rs to ontnt i notth p hysi l pp rn ii of th ook. h v no trou l t king L N s r f rring to th ontnt i sin not ll long ooks r p hysi lly l rg –itd p nds on th p rint t. n ourr ding th ”it” is univo l tw n th s ns s of K in this s . ow v r, nothing d p ndson n x m p l , w llorill- hos n nd itm y w ll th tth r r ind d s swh r m or th n on s ns m ustr m in in p l yin word’sd p loym nt;p o tryisoftn itd, utth r m yw ll oth rs, l ssp rip h r l to th r l world of th ll tr t ourn l. h m in p ointin ny nsw r to uitl r m ust th t, wh tv r is th s outth ov issu , p rogr m sh v no trou l p turing it m ny p rogr m s, nd rt inly th tof tv nson nd ilks, 1 997 th th its nd its l tr d v lop m nts, work y onstr ining s ns s nd r p rf tly l to r p ortr sults with m or th n on s ns still tt hing to word, just s som t gg rsr sultin m or th n on t g p rword in th outp ut. los s hol rs of will lso r m m rth t llish 2 , irst 33 nd m ll 5 1 llp rop os d m thods y whi h p olys m y m ight om p ut tion lly r du d y d gr nd notin n ll or nothing m nn r. r, s on m ightp utit, und r-sp i tion, uitl r’sk yt hni ltrm , n s m no m or th n n im p l m nt tion d t il in ny ff tiv t gg r! L tusturn to th h rtof uitl r’sp osition th issu of systm ti ity on with whi h oth r los lyr l td uthors’ l im s outl xi lrul s n tk n

Can We Make Information Extraction more Adaptive?

9

tog th r . f h w nts, s h do s, to lustr word’s s ns s if th y r los s m nti lly nd ignoring th f tth tL ’shom onym s, s y, in g n r ldo do th t! th n wh th sth td sir gotto do with hist lk outsystm ti n ss within l ss s of words, wh r w n ll gr th tsystm ti n ss is virtu wh r v ron n o t in it?? uitl r lists lustrs of nouns .g. l nd, om p tition, fl ux, tr nsform tion th tsh r th s m top s m nti nod sin som stru tur lik m odi d ordN t t/ vt/r l in th s of th listjustgiv n whi h n r d s tion xtntor r l tion . u h stru tur s, h l im s, r m nif st tions of systm ti p olys m y utwh tis on to t k th tto m n, s y y ontr st with L vin’s 39 v r l ss swh r , sh l im s, th m m rsof th l sssh r rt in synt ti nd s m nti p rop rti s nd, on th t sis, on ould in p rinip l p r di t ddition l m m rs. h tis sim p ly notth s h r on do s noth v to rm li v rin n tur l kinds to s th tth m m rsof this l ssh v nothing systm ti in om m on, ut r just r itr rily link d y th s m ”up p r nod s”. om su h l ss s r n tur l l ss s, s with th l ss h giv s link d y ing oth nim t nd food, ll of whi h, unsurp risingly, r nim ls nd r di l , tl ston som di t ry p rin ip l s, utth r is no systm i r l tionship h r of ny kind. r, to oin p hr s , on m ights y th t th list ov isjust list nd nothing m or ! n ll this, w intnd no riti ism of his us ful d vi , d riv d from ustjovsky, forshowing disjun tions nd onjun tionsof s m nti typ s tt h d to l xi l ntri s, s wh n on m ightm rk som thing s t N r l tion or n nim ls ns s nim t food. hisis los to old rd vi sin rti i lintllig n su h sm ultip l p rsp tiv son stru tur s in o row nd inogr d’s K L 6 , m ultip l form ul s for r l td s ns s of word in ilks 5 5 , nd so on. howing th s situ tions s onjun tions nd disjun tionsof typ sm yw ll sup riornot tion, though itis uit p rop rto ontinu to p ointoutth t th m m rsof onjun ts nd disjun ts r , nd r m in, in lists! in lly, uitl r’sp rop os l to us th s m thods vi or L x to uir l xi on from orp us m y lso n x ll nt p p ro h. ur p ointh r is th tth tm thod p turing th ontntof .g. dj tiv -noun inst n s in orp us h s no p rti ul r r l tionship to th th or ti l m hin ry d s ri d ov , nd is notdiff r ntin kind from th st nd rd NL p roj ts of th 70s lik utoslog 4 7 to t k juston of m ny p ossi l x m p l s.

5.2

Another Approach to Lexical Acquisition

now h v d v lop d s tof m odul rt hni u sin jointwork tw n om niv rsity nd h ld und r th -fund d N p roj t with whi h to im p l m nt nd v lu t l xi l d p t tion in g n r l m nn r, nd in th ontxtof full p ttrn-m thing systm , on noty tfully v lu td in om p tition

10

Yorick Wilks and Roberta Catizone

1. 2.

g n r l v r -p ttrn m th rfrom orp or ; loisL tti 3 d v lop d s sorting fr m for orp ussu - tgoriz tion fr m sforindividu l v r sto disp l y th irin lusion p rop rti s; 3. g n r lword s ns dis m igu tion p rogr m th tp rodu sth str sults world-wid forg n r ltxt 5 7; 4 . m sur s of m p p ing of th su tgoriz tion p ttrns with s ns dis m igu td noun ll rs tl tti nod s g inst n xisting l xi on of v r s y su tgoriz tion p ttrns L . h s r now su i nttools to xp rim ntsystm ti lly with th ugm nt tion of l xi on from orp uswith su tgoriz tion nd p r f r n p ttrns nd th d trm in tion of nov l s ns from l tti nod s whos p ttrn s tf llsoutsid som lustrm sur fors ns . s n sintr ting stronglywith th s his sp tof th p p rshould tion ov on unsup rvis d l rning of tri l tm p l t stru tur s from orp us tion 4 .2 ov in th twork tm p l t isindu d, s d d only y orp ussigni ntv r s. n this l xi on- d p t tion d v lop m nt, th v r s r xp li it s ds nd wh tis soughtis th orp us v ri ty of su tgoriz tion p ttrns within whi h tm p l t ll rs for thos v r s r to found nd p rti lly ord r d. twill n xp rim nt l u stion wh th rth su tgoriz tion v ri ty lo td y th m thod of thiss tion n l rn d nd g n r lis d y ny of th L t hni u sof tion 3 ov .

6

Adapting Knowledge Structures for a New Domain

tis truism , nd on of s inh rit n s from l ssi l , th tits m thods, how v rsup r i l som li v th m , r d p nd nton xtnsiv dom in knowl dg ;this t k s th form of, tl st, hi r r hi l stru tur s xp r ssing r l tionship s tw n ntiti s in th dom in world of th us r. h s r oftn shown shi r r hi sin th l ssi l m nn r utth r l tionship tw n hild nd p r ntnod s m y v riously p rt- of, m m rship , su s t, ontrol-ov r nd so on. h sim p l stsortof x m p l would n up p r nod r p r s nting om p ny , nd low r hildr n ing ivision- , ivision- , ivision- , whi h ould h v r l tionship onv ntion llyto d s ri d sp rt-of in th t th divisions r p rtof th om p ny utwhi h som m ightp r f r to insist w r r llys tm m rship r l tionsov r, s y, m p loy s in th t, ny m p loy of ivision- is lso n m p loy of om p ny- — ll th s r m ttrs of th intrp r t tion of sim p l di gr m s . h r islittl disp ut su h stru tur s r n d d forsop histi td systm s; th intr sting u stions r n th y uir d utom ti lly for n w dom in nd r th ydistin tfrom l xi lknowl dg ? sto th l ttr u stion, th typ sof knowl dg nnot ntir lydistin t. h n systm slik 1 0 ttm p td to lo t g nushi r r hi sforL from p rsing th d nitionsin th di tion ry, th r sulting hi r r hi s ould s n u llyw ll sl xi lknowl dg , or s -hi r r hi sin th world d isyis p l ntwhi h... t. . ttrn m thing

Can We Make Information Extraction more Adaptive?

11

work on orp or ustjovsky 4 5 4 to st lish nd ugm ntsu h r l tions w sp r s ntd sl xi on d p t tion utth r sults ould u llyw llh v n l im d s knowl dg stru tur dis ov ry. n ould ov r sim p lify om p l x p hilosop hi l r y s ying th tth diff r n s– tw n l xi l nd K intrp r t tionsof su h hi r r hi lstru tur – r in p rt out – intrp r t tion K hi r r hi s n intrp r td in num r of w ys, wh r s s l xi l stru tur s th y r norm lly s n only s n iv on p t in lusion .g. th on p t l nt ov rsth on p t isy ; – tr nsitivity of stru tur K hi r r hi s r on rn d with inf r n nd h n th tr nsitivityof th intrp r t tionsof th nd p rt-of t.links. hisisnotnorm lly n issu in lo l l xi l r l tions. – s op /intnsion lity t. sim p l K hi r r hi sof th sortw h v sugg std n d to sup p ortm or om p l xlogi l on p ts .g. s op orth u stion of intrp r t tion s on p torrol , s in ” h r sid ntm ust ov r35 ” whi h is oth out urr nt r sid ntlik linton nd out ny r sid nt ssu h . g in, th s r notissu sin l xi lr l tions snorm lly d s ri d. im p l tri lK for x m p l

hi r r hi sforn w dom ins ould

inf rr d from

om ining

1 . niti l inf r n of th ontologi l p op ul tion of n w dom in y st nd rd on p ts of signi ntp r s n of s tof words om p r d to st nd rd txts. 2. ttm p ting to lo t p rti lK stru tur sfor nym m rsof th ts tth t t k p rtin xisting N oroth rs m nti hi r r hi s, using word s ns dis m igu tion p rogr m rstif p p li l to ltrin p p rop ri t s ns sfrom th xisting hi r r hy p rtsso s l td. 3. sing som p rti l p rs ron dom in orp or to lo t ”signi nttrip l s” ov r ll inst n sof th word s ts l td y i in th m nn rof rishm n nd trling 31 . 4. v lop ing n lgorithm to ssign th s stru tur s whos non-signi nt words will s ns -t gg d nd lo td in hi r r hi s, hunks of whi h n im p ortd in om in tion to m inim um dom in- p p rop ri t K stru tur th twould th n h v to ditd nd p run d y h nd within us rintrf .

7

Adaptivity to Text Genre

noth r str nd of inv stig tion w li v to of gr t on rn to us rs of is th d p t tion, for giv n dom in, to n w txtg nr , su h s m oving from inform l m ss g s to form lis d r p orts, without h nging dom in, nd th issu of wh th r or nottm p l ts nd th ir xtr tion rul s n d r tr ining. hisw sf d s h nd- r fting t sk in rly swh n th g nr w s

12

Yorick Wilks and Roberta Catizone

N vy m ss g s whi h h d j rgon nd synt ti form s uit unlik onv ntion l nglish. K hosr vi 36 h s inv stig td wh th r n -lik p p ro h to sp h- t nd m ss g m thing n tr nsf r tw n di logu nd m il m ss g s nd found su st nti ld gr of tr nsf r, of th ord rof 6 0 th slong n known th t rt in l xi lform s r distin tiv of r gistr nd g nr , sin rt in l ngu g s z h isoftn itd th r r still ”form l di l ts” for rt in form sof om m uni tion. n s slik th s , st nd rd t hni u s n-gr m s nd l ngu g -r ognition m thods would su i ntto indi t g nr . h s ould ugm ntd ym thodssu h sth N-lik m thod s ov of d p ting to n w su tgoriz tion p ttrns nd p r f r n sof v r sof intr st in n w txtg nr s.

8

Multilingual IE

iv n n systm th tp rform s n xtr tion t sk g insttxts in on l ngu g , itisn tur l to onsid rhow to m odify th systm to p rform th s m t sk g insttxts in noth r. or g n r lly, th r m y r uir m ntto do th xtr tion t sk g insttxts in n r itr ry num r of l ngu g s nd to p r s ntr sults to us rwho h sno knowl dg of th sour l ngu g from whi h th inform tion h s n xtr td. o m inim is th l ngu g -sp i ltr tions th tn d to m d in xtnding n systm to n w l ngu g , itis im p ort ntto s p r t th t sk-sp i on p tu l knowl dg th systm us s, whi h m y ssum d to l ngu g ind p nd nt, from th l ngu g d p nd ntl xi l knowl dg th systm r uir s, whi h un void ly m ust xtnd d for h n w l ngu g . t h ld, w h v d p td th r hit tur of th L systm 26 , n systm origin lly d sign d to do m onolingu l xtr tion from nglish txts, to sup p ort l n s p r tion tw n on p tu l nd l xi l inform tion. his s p r tion llowsh rd-to- uir , dom in-sp i , on p tu l knowl dg to r p r s ntd only on , nd h n to r us d in xtr ting inform tion from txtsin m ultip l l ngu g s, whil st nd rd l xi l r sour s n us d to xtnd l ngu g ov r g . r lim in ry xp rim ntswith xtnding th systm to r n h nd p nish h v shown su st nti l r sults, nd y m thod uit diff r ntfrom tt hing l ssi m onolingu l systm to m hin tr nsl tion systm h -L i m ultilingu l systm r li s on ro ustdom in m od l th t onstitutsth ntr l x h ng through whi h llm ultilingu linform tion irul ts. h ddition of n w l ngu g to th systm onsists m inly of m p p ing n w m onolingu l l xi on to th dom in m od l nd dding n w synt ti /s m nti n lysisfront- nd, with no intr tion t ll with oth rl ngu g sin th systm . h l ngu g ind p nd ntdom in m od l n om p r d to th us of n interlingua r p r s nt tion in s , e.g., 35 . n systm , how v r, do s not r uir full g n r tion p iliti s from th intrm di t r p r s nt tion, nd th t sk will w ll-sp i d y lim itd ‘dom in m od l’r th rth n full

Can We Make Information Extraction more Adaptive?

13

unr stri td ‘world m od l’. hism k s n interlingua r p r s nt tion f si l for , us itwill notinvolv nding solutions to ll th p ro l m s of su h r p r s nt tion, only thos issu sdir tly r l v ntto th urr nt t sk. r n h- p nish- nglish p rototyp of this r hit tur h s n im p l m ntd nd su ssfully tstd on lim itd m ountof d t . h r hit tur h s n furth rd v lop d in th N N p roj t 20.

9

Conclusion

his h s n notso m u h p p r s disguis d r s r h p rop os l nd om p r s v ry unf vour ly th r for with thos d s ri d ov who h v n p r p r d to gin th di ultwork of m king d p tiv . noth r im p ort nt r nottou h d on h r though itis to found in rdi 1 2 is th p p li tion of L m thods to th ru i l notion of o-r f r n .g. 2 , nd p rti ul rly its rol in r l ting do um nts tog th r or ross-do um ntm ng m nt, form of d t -fusion tw n inform tion outindividu ls in diff r ntdo um nts, who m y in f t diff r nt lthough th y h v th s m n m s.

References 1. J. Aberdeen, J. Burger, D. Day, L. Hirschman, P. Robinson, and M. Vilain. MITRE - Description of the Alembic System used for MUC-6. In Proceedings of the Sixth Message Understanding Conference (MUC-6), pages 141–156, 1995. 4 2. S. Azzam, K. Humphreys, and R. Gaizauskas. Using corefernece chains for text summarization. In Proceedings of the ACL ’99 WOrkshop on Corefernce and its Applications. Maryland, 1999. 13 3. R. Basili, M. Pazienza, and P. Velardi. Aquisition of selectional patterns from sub-langauges. Machine Translation, 8, 1993. 10 4. R. Catizone M.T. Pazienza M. Stevenson M. P. Velardi M.Vindigni Y.Wilks Basili, R. An empirical approach to lexical tuning. In Workshop on Adapting Lexical and Corpus Resources to Sublanguages and Applications , LREC, First International Conference on Language Resources and Evaluation, Granada, Spain, 1998. 4 5. D. Bikel, S. Miller, R. Schwartz, and R. Weischedel. Nymble: a High-Performance Learning Name-finder. In Proceedings of the Fifth conference on Applied Natural Language Processing, 1997. 3 6. D.G. Bobrow and T. Winograd. An overview of krl, a knowledge representation language. Cognitive Science 1, pages 3–46, 1977. 9 7. E. Brill. Some Advances in Transformation-Based Part of Speech Tagging. In Proceedings ofthe Twelfth National Conference on AI (AAAI-94), Seattle, Washington, 1994. 4 8. E. Brill. Transformation-Based Error-Driven Learning and Natural Language. Computational Linguistics, 21(4), December 1995. 4 9. E. Briscoe, A. Copestake, and V. De Pavia. Default inheritance in unificationbased approaches to the lexicon. Technical report, Cambridge University Computer Laboratory, 1991. 6, 7

14

Yorick Wilks and Roberta Catizone

10. R. Bruce and L. Guthrie. Genus disambiguation: A study in weighted preference. In Proceesings of COLING-92, pages 1187–1191, Nantes, France, 1992. 10 11. P. Buitelaar. A lexicon for underspecified semantic tagging. In Proceedings of the ACL-Siglex Workshop on Tagging Text with Lexical Semantics, Washington, D.C., 1997. 6, 7 12. Claire Cardie. Empirical methods in information extraction. AI Magazine. Special Issue on Empirical Natural Language Processing, 18(4), 1997. 3, 13 13. N. Chinchor. The statistical significance of the MUC-5 results. In Proceedings of the Fifth Message Understanding Conference (MUC-5), pages 79–83. Morgan Kaufmann, 1993. 2 14. N. Chinchor and Sundheim B. MUC-5 Evaluation Metrics. In Proceedings of the Fifth Message Understanding Conference (MUC-5), pages 69–78. Morgan Kaufmann, 1993. 2 15. N. Chinchor, L. Hirschman, and D.D. Lewis. Evaluating message understanding systems: An analysis of the third message understanding conference (muc-3). Computational Linguistics, 19(3):409–449, 1993. 2 16. R. Collier. Automatic Template Creation for Information Extraction. PhD thesis, UK, 1998. 3, 5 17. J. Cowie, L. Guthrie, W. Jin, W. Odgen, J. Pustejowsky, R. Wanf, T. Wakao, S. Waterman, and Y. Wilks. CRL/Brandeis: The Diderot System. In Proceedings of Tipster Text Program (Phase I). Morgan Kaufmann, 1993. 2 18. J. Cowie and W. Lehnert. Information extraction. Special NLP Issue of the Communications of the ACM, 1996. 2 19. H. Cunningham. JAPE – a Jolly Advanced Pattern Engine. 1997. 4 20. H. Cunningham, S. Azzam, and Y. Wilks. Domain Modelling for AVENTINUS (WP 4.2). LE project LE1-2238 AVENTINUS internal technical report, University of Sheffield, UK, 1996. 13 21. H. Cunningham, R.G. Gaizauskas, and Y. Wilks. A General Architecture for Text Engineering (GATE) – a new approach to Language Engineering R&D. Technical Report CS – 95 – 21, Department of Computer Science, University of Sheffield, 1995. Also available as http://xxx.lanl.gov/ps/cmp-lg/9601009. 5 22. W. Daelemans, J. Zavrel, K. van der Sloot, and A. van den Bosch. TiMBL: Tilburg memory based learner version 1.0. Technical report, ILK Technical Report 98-03, 1998. 4 23. D. Day, J. Aberdeen, L. Hirschman, R. Kozierok, P. Robinson, and M. Vilain. Mixed-Initiative Development of Language Processing Systems. In Proceedings of the 5th Conference on Applied NLP Systems (ANLP-97), 1997. 4 24. J.Sterling.NYU E.Agichtein R.Grishman A.Borthwick. 3 25. R. Evans and G. Gazdar. Datr: A language for lexical knowledge representation. Computational Linguistics 22 2, pages 167–216, 1996. 6 26. R. Gaizauskas. XI: A Knowledge Representation Language Based on CrossClassification and Inheritance. Technical Report CS-95-24, Department of Computer Science, University of Sheffield, 1995. 12 27. R. Gaizauskas and Y. Wilks. Information Extraction: Beyond Document Retrieval. Journal of Documentation, 1997. In press (Also available as Technical Report CS97-10). 2 28. G. Gazdar and C. Mellish. Natural Language Processing in Prolog. Addison-Wesley, 1989. 8 29. T. Givon. Transformations of ellipsis, sense development and rules of lexical derivation. Technical Report SP-2896, Systems Development Corp., Sta Monica, CA, 1967. 6

Can We Make Information Extraction more Adaptive?

15

30. R. Grishman. Information extraction: Techniques and challenges. In M-T. Pazienza, editor, Proceedings of the Summer School on Information Extraction (SCIE-97), LNCS/LNAI. Springer-Verlag, 1997. 2 31. R. Grishman and J. Sterling. Generalizing automatically generated patterns. In Proceedings of COLING-92, 1992. 3, 11 32. R. Grishman and J. Sterling. Description of the Proteus system as used for MUC5. In Proceedings of the Fifth Message Understanding Conference (MUC-5), pages 181–194. Morgan Kaufmann, 1993. 6 33. G. Hirst. Semantic Interpretation and the Resolution of Ambiguity. CUP, Cambridge, England, 1987. 8 34. J.R. Hobbs. The generic information extraction system. In Proceedings of the Fifth Message Understanding Conference (MUC-5), pages 87–91. Morgan Kaufman, 1993. 2, 2 35. W.J. Hutchins. Machine Translation: past, present, future. Chichester : Ellis Horwood, 1986. 12 36. H. Khosravi and Y. Wilks. Extracting pragmatic content from e-mail. Journal of Natural Language Engineering, 1997. submitted. 12 37. R. Krovetz and B. Croft. Lexical ambiguity and information retrieval. ACM Transactions on Information Systems 2 10, 1992. 6 38. W. Lehnert, C. Cardie, D. Fisher, J. McCarthy, and E. Riloff. University of massachusetts: Description of the CIRCUS system as used for MUC-4. In Proceedings of the Fourth Message Understanding Conference MUC-4, pages 282–288. Morgan Kaufmann, 1992. 3 39. B. Levin. English Verb Calsses and Alternations. Chicago, Il, 1993. 9 40. H. P. Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development 1, pages 309–317, 1957. 5 41. R. Morgan, R. Garigliano, P. Callaghan, S. Poria, M. Smith, A. Urbanowicz, R. Collingham, M. Costantino, and C. Cooper. Description of the LOLITA System as used for MUC-6. In Proceedings of the Sixth Message Understanding Conference (MUC-6), pages 71–86, San Francisco, 1995. Morgan Kaufmann. 3 42. S. Muggleton. Recent advances in inductive logic programming. In Proc. 7th Annu. ACM Workshop on Comput. Learning Theory, pages 3–11. ACM Press, New York, NY, 1994. 4 43. S. Nirenburg and V. Raskin. Ten choices for lexical semantics. Technical report, Computing Research Lab, Las Cruces, NM, 1996. MCCS-96-304. 8 44. J. Pustejovsky. The Generative Lexicon. MIT, 1995. 6 45. J. Pustejovsky and P. Anick. Autmoatically acquiring conceptual patterns without an annotated corpus. In Proceedings of the Third Workshop on Very Large Corpora, 1988. 11 46. E. Riloff. Automatically contructing a dictionary for information extraction tasks. In Proceedings of Eleventh National Conference on Artificial Intelligence, 1993. 6 47. E. Riloff and W. Lehnert. Automated dictionary construction for information extraction from text. In Proceedings of Ninth IEEE Conference on Artificial Intelligence for Applications, pages 93–99, 1993. 3, 9 48. E. Riloff and J. Shoen. Automatically aquiring conceptual patterns without an annotated corpus. In Proceedings of the Third Workshop on Very Large Corpora, 1995. 3, 11 49. E. Roche and Y. Schabes. Deterministic Part-of-Speech Tagging with Finite-State Transducers. Computational Linguistics, 21(2):227–254, June 1995. 4

16

Yorick Wilks and Roberta Catizone

50. K. Samuel, S. Carberry, and K. Vijay-Shanker. Dialogue act tagging with transofrmation-based learning. In Proceedings of the COLING-ACL 1998 Conference, pages 1150–1156, 1998. 4 51. S. Small and C. Rieger. Parsing and comprehending with word experts (a theory and it’s realiastion). In W. Lehnert and M. Ringle, editors, Strategies for Natural Language Processing. Lawrence Erlbaum Associates, Hillsdale, NJ, 1982. 8 52. David Page Stephen Muggleton James Cussens and Ashwin Srinivasan. Using inductive logic programming for natural language processing. In Proceedings of in ECML.Workshop Notes on Empirical Learning of Natural Language Tasks, pages 25–34, Prague, 1997. 3 53. Jin Wang T.Strzalkowski, Fang Lin and Jose Perez-Caballo. Natural Language Information Retrieval, chapter Evaluating Natural Language Processing Techniques in Information Retrieval, pages 113–146. Kluwer Academic Publishers, 1997. 6 54. Mark Vilain. 4 55. Y. Wilks. Grammar, Meaning and the Machine Analysis of Meaning. Routledge and Kegan Paul, 1972. 9 56. Y. Wilks, L. Guthrie, J. Guthrie, and J. Cowie. Combining Weak Methods in LargeScale Text Processing, in Jacobs 1992, Text-Based Intelligent Systems. Lawrence Erlbaum, 1992. 6 57. Y. Wilks and M. Stevenson. Sense tagging: Semantic tagging with a lexicon. In Proceedings of the SIGLEX Workshop ”Tagging Text with Lexical Semantics: What, why and how?”, Washington, D.C., April 1997. Available as http://xxx.lanl.gov/ps/cmp-lg/9705016. 10

Natural Language Processing And Digital Libraries e n- ie e h no Xerox Research Centre Europe 6, chemin de Maupertuis, 38240 Meylan, France [email protected] http://www.xrce.xerox.com

Abstract. As one envisions a document model where language, physical location and medium - electronic, paper or other - impose no barrier to effective use, natural language processing will play an increasing role, especially in the context of digital libraries. This paper presents language components based mostly on finite-state technology that improve our capabilities for exploring, enriching and interacting in various ways with documents. This ranges from morphology to part-of-speech tagging, NP extraction and shallow parsing. We then focus on a series of on-going projects which illustrate how this technology is already impacting the building and sharing of knowledge through digital libraries.

1

Linguistic Components

he fi stp to this ti le on ent tes on linguisti tools uilt t s they ep esent i ly extensive suite o onsistent o ust n m ultilingu l om p onents nging om m o p hology p t-o -sp ee h ( O ) t gging sh llo p sing n sense is m igu tion. hey e se m ostly on fi nite-st te te hnology. om in tionso ule- se n st tisti lm etho s e p p lie heneve p p op i te. hey elyon ully evelop e l ngu ge esou es n e integ te into the s m e unifi e hite tu e o lll ngu ges the e oxLinguisti evelop m ent hite tu e ( eL ) evelop e ythe v n e e hnology ystem s g oup . hese tools e use to evelop es ip tions o m o e th n 1 5 l ngu ges n e integ te into highe level p p li tions su h ste m inologyext tion in o m tion etiev lo t nsl tion i hi h e highly elev ntin the e o igit l li ies. 1.1

Finite-State Calculus

is inite-st te te hnology is the un m ent l te hnology e ox l ngu ge se on. he si l ulusis uilton ent lli yth tim p lem entsthe unm ent l op e tions on fi nite-st te net o ks. t is the esult o long-st n ing ese h eff o t 21 24 2 30. n inte tive tuto i l on fi nite-st te l ulus is lso v il le t the fi nite-st te hom e p ge Pazienza (Ed.): Information Extraction, LNAI 1714, pp. 17–31, 1999. c Springer-Verlag Berlin Heidelberg 1999 

18

Jean-Pierre Chanod

(http // .x e.xe ox. om / ese h/m ltt/ st/hom e.htm l). esi es the si op e tions ( on ten tion union inte se tion om p osition ep l e op e to ) the li yp ovi esv ious lgo ithm sto im p ove u the the om p tion sp ee n e se o use o the net o ks. he l ulus lso in lu essp e ifi un tionsto es i e t o-level ules n to uil lexi l t ns u e s. he fi nite-st te l ulusis i ely use in linguisti evelop m ent to e te tokenise s m o p hologi l n lyse s noun p h se ext to s sh llo p se s n othe l ngu ge-sp e ifi om p onents 22. 1.2

Morphology

M o p hologi l v i tions n e onveniently ep esente y fi nite-st te t nsu e s hi h en o e on the one si e su e o m s n on the othe si e no m lise ep esent tionso su h su e o m s 23 . M o e sp e ifi lly 1 . the llo e om in tions o m o p hem es n e en o e s fi nite-st te net o k; 2. the ulesth t ete m ine the ontext- ep en ent o m o e h m o p hem e n e im p lem ente sfi nite-st te t ns u e s( . t o-levelm o p hology 25 ); 3. the lexi on net o k n the ule t ns u e s n e om p ose into single utom ton lexi lt ns u e th t ont ins llthe m o p hologi lin o m tion outthe l ngu ge in lu ing e iv tion infl e tion n om p oun ing. Lexi lt ns u e sh ve m ny v nt ges. hey e i- i e tion l(the s m e net o k o oth n lysis n gene tion) st(thous n s o o s p e se on ) n om p t. hey lso p ovi e n e u te o m lism o m ultilingu l p p o h to l ngu ge p o essing s m jo u op e n l ngu ges n non- n ou op e n l ngu ges(e.g. innish ung i n i s ue) n e es i e in this m e o k. 1.3

Part-of-Speech Tagging

he gene l p u p ose o p t-o -sp ee h t gge is to sso i te e h o in text ith itsm o p hosynt ti tego y ( ep esente y t g) sin the ollo ing ex m p le

This+PRON is+VAUX_3SG a+DET sentence+NOUN_SG .+SENT he p o esso t gging onsistsin th ee step s 1 . tokenis tion e k textinto tokens 2. lexi l lookup p ovi e ll p otenti l t gs o e h token

Natural Language Processing And Digital Libraries

3. is m igu tion

ssign to e h token

19

single t g

h step isp e o m e y n p p li tion p og m th tusesl ngu ge sp e ifi t – he tokenis tion step uses fi nite-st te t ns u e to inse ttoken oun ies oun sim p le o s (o m ulti- o exp essions) p un tu tion num e s et. – Lexi llookup e ui es m o p hologi l n lyse to sso i te e h token ith one o m o e e ings. U nkno n o s e h n le y guesse th tp ovi esp otenti lp t-o -sp ee h tego ies se on ffixp tte ns. – n l ngu ge suite is m igu tion is se on p o ilisti m etho s( i en M kovM o el) 1 0 hi h off e v ious v nt gessu h s e se o t ining n sp ee . o eve som e exp e im ents 7 sho e th t lim ite num e o is m igu tion ules oul e h the s m e level o u y. his m y e om e the sou e o inte esting evelop m ents in O t gging s one e ls ith highly infl e tive gglutin tive n /o ee- o o e l ngu ges o hi h sim p le ontextu l n lysis n esti te t gsets e not e u te 1 6 . 1.4

Noun Phrase Extraction

inite-st te oun h se ext tion 6 27 33 31 onsists in ext ting p tte ns sso i te ith n i tes s. u h p tte ns n e efi ne y egul exp essions se on se uen eso t gssu h s OU ( OU ) he ex m p le ove sp e ifi esth t n n e ep esente y se uen e o one o m o e nouns OU p e e e y nynum e o je tives n op tion lly ollo e y p ep osition n noun ( OU ) the op tion lity eing in i te in the egul exp ession y the p entheses. u h p tte n oul ove p h ses like igit l li ies el tion l m o p hologi l n lyse in o m tion etiev lsystem o net o k o net o ks . ue to ove gene tion the s m e p tte n oul lso ove un esi le se uen es su h s tm useum on ues y in ohn visite the tm useum on ues y . his highlights th tsim p le noun p h se ext tion se on p tte n m thing e ui es u the p o essing e it utom ti (e.g. yusing fi ne-g in synt ti o sem nti su tego is tion in ition to p t-o -sp ee h in o m tion o y using o p us- se fi lte ing m etho s) o m nu l (e.g. v li tion y te m inologistso in exe s). 1.5

Incremental Finite-State Parsing

n em ent l inite t te sing ( ) is n extension o fi nite st te te hnology to the level o p h ses n senten es in the m o e gene l m e o k o sh llo p sing o un esti te texts 1 9 1 . om p utessynt ti stu tu es ithout ully n lysing linguisti p henom en th t e ui e eep sem nti o p gm ti

20

Jean-Pierre Chanod

kno le ge. o inst n e - tt hm ent oo in te o ellip ti stu tu es e not l ys ully n lyse . he nnot tion s hem e em ins un e sp e ifi e ith esp e tto yetun esolve issues esp e i llyi fi ne -g ine linguisti in o m tion is ne ess y. his un e sp e ifi tion p events p se ilu es even on om p lex senten es. t lso p events som e e ly linguisti inte p et tion se on to too gene l p m ete s. ynt ti in o m tion is e tthe senten e level in n in em ent l y 2 3 ep en ing on the ontextu l in o m tion v il le t given st ge. he im p lem ent tion elies on se uen e o net o ks uilt ith the ep l e op e to . he p sing p o ess is in em ent l in the sense th tthe linguisti es ip tion tt he to given t ns u e in the se uen e elieson the p e e ing se uen e o t ns u e s n n e evise t l te st ge. he p se outp ut n e use o u the p o essing su h sext tion o ep en en y el tions ove un esti te o p o . n tests on en h o p o (te hni l m nu ls ne sp p e ) p e ision is oun 90-97% o su je ts ( 4 - % o o je ts) n e ll oun 6 -92% o su je ts( 0-90% o o je ts). he system eing highly m o ul the st tegy o ep en en y ext tion m y e juste to iff e ent om ins o p p li tion hile the fi stp h se o synt ti nnot tion is gene l enough to em in the s m e oss om ins. e e is s m p le senten e ext te

om this u entse tion

nnot tion he p sing p o ess / U vis in em ent l in the sense th t the linguisti es ip tion / U tt he to given t ns u e in the se uen e v elies on the p e e ing se uen e o t ns u e s n v n e evise t l te st ge . ep en en y ext tion – – – – – – – – – – – – – – –

U U U U M M M M

( es ip tion ely) (p o ess e) ( es ip tion evise) (p o ess evise) O O ( evise tst ge) O O ( ely tst ge) O O ( elyon se uen e) O O ( e in sense) (l te st ge) (given t ns u e ) (linguisti es ip tion) O ( es ip tion tt h) (p o essin em ent l) (se uen e tst ge) (se uen e o t ns u e )

Natural Language Processing And Digital Libraries

– – – – – – 1.6

U U ) U U

(t ns u e in se uen e) ( es ip tion to t ns u e ) U ( t l te st ge ) U ( on the p e e ing se uen e U U

( (

to

given t ns u e he p sing p o ess

21

o t ns u e s in the se uen e

)

)

Sense Disambiguation

he o sense is m igu tion ( ) system evelop e t is se on t o existing system s. he fi stsystem 32 the em nti i tion y Lookup is uilton top o Lo olex( .in ). tusesin o m tion out ollo tes n su tego iz tion m es e ive om the Oxo hette en h i tion y 9 . he is m igu tion p o ess elies on ep en en y el tions om p ute y the in em ent l fi nite-st te p se . he se on system 1 2 is n unsup e vise t nso m tionse sem nti t gge fi st uilt o nglish. em nti is m igu tion ules e utom ti lly ext te om i tion y ex m p les n thei sense num e ings. e use senses n ex m p lesh ve een efi ne ylexi og p he s theyp ovi e eli le linguisti sou e o onstu ting t se o sem nti is m igu tion sv lu le sem nti lly t gge o p us. ules. n i tion ies p p e

2 2.1

Some Language Applications Relevant to Digital Libraries LOCOLEX: A Machine Aided Comprehension Dictionary

LO OL 5 1 3 is n on-line ilingu l om p ehension i tion y hi h i s the un e st n ing o ele toni o um ents itten in o eign l ngu ge. t isp l ysonlythe p p op i te p to i tion yenty hen use li kson o in given ontext. he system is m igu tesp tso sp ee h n e ognisesm ulti o exp essions su h s om p oun s (e.g. ) p h s l ve s (e.g. ) i iom ti exp essions (e.g. y ) n p ove s(e.g. fl ). n su h ses LO OL isp l ysthe t nsl tion o the hole p h se n notthe t nsl tion o the o the use h s li ke on. o inst n e som eone m yuse en h/ nglish i tion y to un e st n the ollo ing text itten in en h e

e

e

e

hen the use li kson the o LO OL i entifi esits O n se o m . tthen isp l ysthe o esp on ing enty he e the noun ith its iff e entsense in i to s n sso i te t nsl tions. n thisp ti ul ontext

22

Jean-Pierre Chanod

the ve e ing o isigno e y LO OL . tu lly in o e to m ke the enty e sie to use only essenti lelem ents e isp l ye e 1. 2. 3. 4. 5. 6.

nm

onst t (o p i tu e in o ) (s ene y) setting (m ilieu) su oun ings (stu tu e ontext) m e o k (em p loyee) exe utive (o ike m oto y le) m e

me

he o in the s m e ex m p le ove is p to ve l m ulti o exp ession . n ou ex m p le the exp ession is infl e te n t o ve sh ve een stu k in et een the he ve n its om p lem ent. tillLOOL etievesonly the e uiv lentexp ession in nglish to e fl ying oun n notthe enti e enty o . t in 5

nm um eu s lle

on t in to e fl ying oun

LO OL uses n M L-t gge ilingu l i tion y(the Oxo hette en h nglish i tion y). o p tthis i tion yto LO OL e ui e the ollo ing evision o n M L-t gge i tion y to uil is m igu te tive i tion y ( ); sp e i l – e iting m ulti- o exp essions s egul exp essions using g mm ; – uil ing fi nite st te m hine th t om p tly sso i tes in ex num e s ith i tion y enties.



he lookup p o essitsel m y e ep esente

s ollo s

– sp litthe senten e sting into o s(tokenis tion); – no m lise the se n sp elling o e h o ; – i entiy llp ossi le m o p ho-synt ti us ges( se o m n m o p ho-synt ti t gs) o e h o in the senten e; – is m igu te the O ; – fi n elev ntenties (in lu ing p ossi le hom og p hs o om p oun s) in the i tion y o the lexi l o m (s) hosen y the O is m igu to ; – use the esulto the m o p hologi l n lysis n is m igu tion to elim in te i elev ntp tso the i tion y enty; – p o essthe egul exp essionsto see i theym th the o s tu l ontext in o e to i entiy sp e i l o i iom ti us ges; – isp l y to the use only the m ost p p op i te t nsl tion se on the p t o sp ee h n su oun ing ontext.

Natural Language Processing And Digital Libraries

2.2

23

Multilingual Information Retrieval and Data Mining

M nyo the linguisti tools eing evelop e t e eing use in p p lie ese h into m ultilingu lin o m tion etiev l n m o e o lyin t m ining 4 1 4 1 5 1 7. M ultilingu l in o m tion etiev l llo sthe inte og tion o texts itten in t getl ngu ge y use s sking uestions in sou e l ngu ge . n o e to p e o m this etiev l the ollo ing linguisti p o essing step s e p e o m e on the o um ents n the ue y utom ti lly e ognise l ngu ge o the text. e o m the m o p hologi l n lysiso the textusing e oxfi nite st te n lyse s. – to sp ee h t g the o sin the textusing the p e e ing m o p hologi l n lysis n the p o ility o fi n ing p t-o -sp ee h t g p thsin the text. – Lem m tise i.e. no m lise o e u e to i tion y enty o m the o s in the textusing the p to sp ee h t gs. – –

hism o p hologi l n lysis t gging n su se uentlem m tis tion o n lyse o s h s p ove to e use ul im p ovem ent o in o m tion etiev l s ny in o m tion- etiev lsp e ifi stem m ing. o p o ess given ue y n inte m e ite o m o the ue ym ust e gene te hi h om p esthe no m lise l ngu ge o the ue yto the in exe texto the o um ents. hisinte m e i te o m n e onstu te y ep l ing e h o ith t getl ngu ge o sth ough n online ilingu l i tion y. he inte m e i te ue y hi h isin the s m e l ngu ge s the t get o um ents is p sse long to t ition l in o m tion etiev l system su h s M . hissim p le o - se m etho isthe fi st p p o h e h ve een testing t . niti l unsin i te th tin o p o ting m ulti- o exp ession m thing n signifi ntly im p ove esults. he m ulti- o exp essions m ostinte esting o in o m tion etiev l e te m inologi l exp essions hi h m osto ten p p e snoun p h sesin nglish. t m ining e e sto the ext tion o stu tu e in o m tion om unstu tu e text. num e o p p li tions o linguisti p o essing n e p p lie to sem nti ext tion. One ex m p le o t -m ining ese h th t e e p u suing t iste m inologilext tion om text. fi stp ss tte m inologi lext tion isp ossi le on e m o p hologi l n lyse n t gge h ve een e te o l ngu ge. One nee then only efi ne egul p tte n o t gge o s th t o esp on s ith noun p h se in th tl ngu ge. tp esent e e le to ext tsu h noun p h ses om nglish en h e m n p nish n t li n text.

3

Some Recent Projects for Digital Libraries

he p st e ye s h ve seen em k le exp nsion o igit l net o ks n esp e i llythose using nte net. n o m tion sou es esse vi nte net e the om p onentso igit l li y ( L). hey e m ixtu e o p u li n p iv te

24

Jean-Pierre Chanod

in o m tion ee o h ge o p ying n in lu e ooks ep o ts m g zines ne sp p e s vi eo n soun e o ings s ientifi t et. he ole o l ngu ge p o essing toolsis e om ing p om inentin igit l li iesp oje ts in lu ing – llim ue oll o tive p oje t o vi tu l li – L n in usti l extension o llim ue – entyOne – op ye n Olive 3.1

num e o

ies

Callimaque: A Collaborative Project for Virtual Libraries

igit lli ies ep esent ne yo essing in o m tion isti ute llove the o l vi the use o om p ute onne te to the nte netnet o k. he e s p hysi lli y e lsp im ily ith p hysi l t igit lli y e ls ith ele toni o um entssu h stexts p i tu es soun s n vi eo. e exp e tm o e om igit l li y th n only the p ossi ility o o sing its o um ents. igit l li y ont-en shoul p ovi e use s ith seto tools o ue ying n etieving in o m tion s ell s nnot ting p geso o um ent efi ning hyp e -links et een p geso help ing to un e st n m ultilingu l o um ents. llim ue 20 is vi tu l li y esulting om oll o tion et een n ese h/ em i institutionso the eno le e ( M n e. ). t e onstu ts the e ly histo y o in o m tion te hnology in he p oje tis se on sim il p oje t the l ssp oje t hi h sst te y the U nive sity o o nell seve l ye s go un e the le e ship o tu t Lynn to p ese ve ittling ol ooks. he l ssp oje t uns ove onvention l net o ks n ll s nne m te i l isin nglish. he

llim

ue p oje tin lu e the ollo ing step s

nning n in exing oun 1 000te hni l ep o ts n 2000theses itten tthe U nive sityo eno le using e ox O system integ te ith s nne high-sp ee p inte so t e o e ueueing in exing sto ing et. um e ise o um ents n e e o ke p ge y p ge n even estu tu e tthe use s onvenien e. 30 yteso m em o y e nee e to sto e the im ges. – st ts e e O e to llo o textu l se h. – o um ents e e o e on el tion l t se on U se ve . –

num e o i entifi e s (title utho e e en e num e st t et.) e sso i te ith e h o um entto ilit te the se h. ith vie to m king these o um ents i ely essi le e oxh s evelop e so t e th t utho ises essto this t se y ny lientusing the http p oto oluse y the o l

Natural Language Processing And Digital Libraries

25

i e e . he se isthus essi le vi ny M intosh U st tion o even om sim p le te m in l ( he e ess is http // llim ue.g enet. ). inton em n ilities onne te to the neto k llo the use s to m ke op ies o the s nne m te i l. he m o ule o inte og tion o the t se islinke to t nsl tion ssist n e tool evelop e t (see ove Lo olex). t ill notonly en le e e un m ili ith en h to l iy e t in st ts ut lso to m ke key o se hing m o e intelligent y p op osing synonym s no m lising infl e te o m s n esolving m iguities. M ultilingu l te m inology ill help p o essing non- en h ue ies. L ngu ge p o essing tools esp . fi nite-st te se tools h ve een use u ing the onstu tion o the igit lli y (O te m inology ext tion). hey lso tively sup p o t the en -use inte tion ith the fi n l system (se h t nsl tion i ) – O o um ents olle te in the llim ue igit l li ies h ve een s nne ut n ition l step st ken y O -ing the st ts. his llo s o textu l se h eyon si e e en es su h s title n utho n m es. he O system integ teslexi l in o m tion e ive om lexi l t ns u e s. his im p oves sti lly the u y o the e ognition s op p ose to O se on sim p le o lists. his u y im p ovem ent e om eseven m o e stiking hen e ling ith highlyinfl e tion ll ngu ges. – ilingu l te m inology ext tion m ny o the o um ents integ te in the llim ue li y in lu e ilingu l st ts even hen the o y o the o um entism onolingu l(gene lly itten in en h). oun-p h se ext tion om ine ith lignm ent s p e o m e on su h ilingu l st ts. hisle to the onstu tion o ilingu lte m se ( en h nglish) hi h h s een use s llim ue sp e ifi esou e o in exing oss-lingu l se h n t nsl tion i . – oss-lingu lse h the ilingu lte m se h s een en o e ollo ing the Lo olex p p o h. his llo s o m thes eyon sim p le sting m thing. o inst n e n nglish ue y like m th en h o um ents ont ining exp essionslike o . hissho sth tm thes e m e eg lesso m o p hologi l v i tions (e.g. singul o p lu l) o even eg less o the t nsl tion o . ven m o e the system m th g inst llo ing o the inse tion o the je tive ithin the exp ession. his is p e m itte y the egul l ngu ge use to es i e exp essionsen o e in Lo olex. – t nsl tion i n tu l enh n em ento the system onsists in p ovi e t nsl tion i se on the ilingu l te m se. his n tu lly exp n s to ny ilingu l esou es (e.g. gene l i tion y) hi h n e m e v il le to the en -use .

26

Jean-Pierre Chanod

3.2

Lirix

Li ixisthe ollo -up to the llim ue m ultilingu lextensions. tis m ultilingu lin o m tion etiev lsystem uiltove op i the se h engine om e ity n the eL linguisti p l to m evelop p e t . he ove llgo lo the Li ix p oje tis to esign n uil gene l p o t le o ust n innov tive m ultilingu l in o m tion system ith p ti ul ttention to L ( ossLingu l n o m tion etiev l). he use n ente ue y using his/he m othe tongue n p e o m etiev l ossl ngu ges( yt nsl tion/exp nsion o the o igin l ue y). Li ix elies on gene l-p u p ose i tion ies (unlike llim ue th tis esti te to sp e i lize te m ses) n n un on ny olle tion l e y in exe ith e ity no e-in exing isne ess y. L ( ossLingu l n o m tion etiev l) system sh ve to op e ith t o iffi ultp o lem s hey m ightm iss elev nt o um ents e use o t nsl tion iffi ulties esp . i the ue y t nsl tion is se on o y o n lysis. o inst n e m ulti o exp essions e ui esthe elev ntlevelo ontextu l n lysisto e p op e lyi entifi e n then t nsl te (e.g. y m e ns e in en h ut t ken s st n lone o nnot e t nsl te y n vi e ve s ). tis lso ne ess y to exp n the ue y to el te o sth t e not i e tt nsl tionso the initi l ue y (e.g. in the ue y ). – he esults e o ten ve ynoisy e use in the sen e o sense is m igu tion llsenseso the o s oun in the initi l ue y e et ine o t nsl tion (e.g. the t nsl tion o the en h o e oul e y et.). –

p

o ove om e these iffi ulties Li ixinteg tes ilities

v n e l ngu ge p o essing

– M ulti o exp essions e ognition Li ix uses the ontextu l i tion y lookup o eL . to i entiy m ulti o exp essions n p ovi e the ey ist nsl te y e in en h). u te t nsl tion (e.g. hisim p oves u y n e u esnoise sin i te e lie . – el tion l M o p hology. hism etho gene tes ll el te o o m s om the o sin the ue y. his en les uzzy se h o el te o sin the o p us. ue ylike e e ill etu n o um ents ont ining ut lso . – i ext tion ith sh llo p sing. om e o s n e t nsl te into num e ous un el te o s ( . t nsl tions o the en h o e in the p evious se tion). e hing the o p us o eve y p ossi le t nsl tion ill m ostlikely etieve loto non elev nt o um ents. One y to im p ove this situ tion isth ough sh llo p sing n ext tion o synt ti lly el te p i s. o ex m p le i the ue y ont ins e y the

Natural Language Processing And Digital Libraries

27

system illse h o the p i y in the o p us.One inte esting e tu e o sh llo p sing isth text te p i s e e ive om i e seto ep en en y el tions et een he o s su h ssu je t/ve ve /o je t noun/ je tive et . s onse uen e the p i s e notlim ite to sho t ist n e el tions s m o e si notion o o-o u en e ithin in o o given size oul im p ose. lso the el tions eing synt ti lly m otiv te they e less sensitive to noise ue to o o e su e stu tu e n p oxim ity et een othe ise un el te elem ents. n the longe te m this p p o h oul e u the efi ne ith sense is m igu tion te hni ues s es i e ove. 3.3

TwentyOne

enty-One 26 1 1 1 34 is U un e p oje t hi h h sthe t getto evelop tool o effi ient issem in tion o m ultim e i in o m tion in the fi el o sust in le evelop m ent. he p oje tst te in m h 1 996 n illfi nish in 1 999. he em onst to in o p o tesone o the m jo p oje t esults so- lle se h engine hi h om ines ulltextse h ith seve lothe in o m tion etiev l un tions. he se h engine te hnology h s een ev lu te y the use p tne s n tthe - on e en esin 1 997 n 1 99 . he ollo ing key e tu es e envis ge – M ultim e i h n ling up p o t o the is losu e o v iety o o m ts p pe o p o esso o s u io vi eo – oss-l ngu ge etiev l up p o t o ou l ngu ges nglish em n uth n en h.U se s n ue ythe m ultilingu l t se in thei m othe tongue; o um ents ill e p esente in t nsl tion. enty-One iso p otenti l inte est o – p eop le looking o in o m tion outsust in le evelop m ent(en -use s) – o g nis tionsth t ntto issem in te thei p u li tionsto inte este p tiesin v iousl ngu ges(in o-p ovi e s) – igit l hives(textu l n /o m ultim e i ) in nee o utom ti is losu e tools 3.4

PopEye: Disclosure of Video Material Using Subtitles

euse o li ym te i lp l ys n im p o t nt ole in keep ing o n the u get o u io visu l p o u tion. o eve t loguing n in exing o fi lm n vi eo e o ingsis n exp ensive n tim e onsum ing p o ess. t e ui esthe exp e tise o sp e i lise hivists n h snotyet een utom te . he e e no om p ute system s yetin existen e th t n un e st n n inte p etm oving im ges o th th ve the un esti te ilityto e ognise n om p ehen sp ee h. utthe e

28

Jean-Pierre Chanod

is sou e o in o m tion v il le th t n en le existing om p ute te hnology to essthe ontento vi eo the su titles. u titles e p o u e eithe to t nsl te o eign l ngu ge p og m m eso ( o n tive l ngu ge p o-g m m es) to help p eop le ith he ing iffi ulties to un e st n the soun t k. utom te in exing M nyvi eo li ysystem suse textu l es ip tionsto l ssiy vi eo p og m m e; utthese usu lly off e es ip tion only o the enti e p og m m e noto e h sep te se uen e (shot) th tit ont ins. u titles e tim e- o e ; they n p ovi e the sis o n in ex th tp ointsto n ex tlo tion ithin vi eo p og m m e. op - ye s in exes ill onne ttextto im ge se uen es p ovi ing om p letely ne p p o h to the in exing o visu l m te i l. tul l ngu ge p o essing te hni ues e p p lie to the su titles to in ex them n p ti lly t nsl te them i nto e h o the th ee l ngu gessup p o te y the oje t( uth nglish e m n). op - ye ill ext ttim e- o e texts om the su titles o fi lm o vi eo se uen es n then utom ti lly gene te m ultilingu l in exes o the p og m m es. hitin the in ex ill eve l still im ge ( m e) o sho t lip p ing om the p og m m e o the use to ev lu te notjust gene l in i tion th t elev ntim ges m ight e ont ine som e he e on vi eo t p e. etiev l ossthe nte net he op - ye etiev linte e is se on st n nte net o se . U sing the o l ie e op - ye en les o g nis tions su h sthe o sting om p nies ho e m em e so op - ye ( n O ) to give use sinsi e n outsi e the o g nis tion ess to thei vi eo hive. etiev l o vi eo se uen es using op - ye ill e st;thism kesit ost-eff e tive o nykin o p og m m e p o u tion n lso op ensne usinessop p o tunities fi lm n vi eo hives n e m e v il le ut v il le M ny to nyone ho h s essto the nte net. notjust hive system s o uson uil ing l ge vi eo hives sto ing sm nyhou s sp ossi le o fi lm n vi eo on huge p ileso isks. y ont st op - ye isone o the ve y e system s th t on ent tes on o t ining goo e ll hen se hing sto e vi eo m te i l. op - ye oes notjust e key o s to fi lm o vi eo se uen e;ituses v il le textin o m tion (the su titles) to gui e use s in thei o n l ngu ge to the ex tlo tion ithin the vi eo th tm ight e o inte est to them . 3.5

Olive

OL 29 is evelop ing system hi h utom ti lly p o u esin exes om the soun t k o p og m m e (television o io). his llo s hives to e se he y key o s n o esp on ing visu lo soun t k m te i lto e etieve . OL o ensthe p p li ility o op ye hi h isusing su titles n vi eo tim e o es to in ex o stm te i l. he system ill ilit te io n vi eo p o u tion o o stm te i l hi h in o p o tes existing m te i l o ex m p le ne s n o um ent ies. t ill lso e v lu le tool o p og m m e ese he s esigning ne p og m m es n othe ontentp ovi e s su h s ve tise s. h ough the p ovision o i liog p hi m te i l t ns ip ts n vi eo stills the system ills ve tim e y llo ing m te i lto e p e-vie e e o e itis tu lly etieve om n hive.

Natural Language Processing And Digital Libraries

4

29

Conclusion

One o the m jo h llenges o L in the p o esso p ushing ese h esults to s e l-lie p p li tions esi esnotin the L te hnologyitsel ut the in integ tion i.e. in the ility to p lug L into ontextsth t e elev nt o en use s. igit lli ies ep esentone o the est ses o ev lu ting the im p to su h integ tion. e e only t n e lyst ge in thisp o ess n utu e o k ill e ui e ontinuouseff o tin v ious i e tions im p oving the u y n p e ision o u ent L tools;exp n ing su h toolsto ne l ngu gesesp . to t ke into onsi e tion the g o ing l ngu ge ive sityon the inte net n in ele toni o um ents tl ge;p ushing ese h in iti l e s su h ssem nti is m igu tion use o thes u us kno le ge ext tion in o e to uil the nextgene tion o L tools; evelop p ing sm te use inte es t king into ountthe ee k olle te om u ent L p p li tion integ ting e ent v n es in isti ute p ti es n m ultim o l n m ultim e i kno le ge ep osito ies.

References 1. Steven Abney. Parsing by chunks. In R. Berwick, S. Abney, and C. Tenny, editors, Principled-Based Parsing. Kluwer Academic Publishers, Dordrecht, 1991. 19 2. Salah At-Mokhtar and Jean-Pierre Chanod. Incremental finite-state parsing. In Proceedings of Applied Natural Language Processing, Washington, DC, 1997. 20 3. Salah At-Mokhtar and Jean-Pierre Chanod. Subject and object dependency extraction using finite-state transducers. In ACL workshop on Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications, 1997. 20 4. Roberto Basili and Maria Teresa Pazienza. Lexical acquisition and information extraction. In SCIE 1997, pages 44–72, 1997. 23 5. D. Bauer, F. Segond, and A. Zaenen. Locolex: the translation rolls off your tongue. In Proceedings of the ACH-ALLC conference, pages 6–8, Santa Barbara, 1995. 21 6. D. Bourigault. An endogenous corpus-based method for structural noun phrase disambiguation. In 6th Conf. of EACL, Utrecht, 1993. 19 7. Jean-Pierre Chanod and Pasi Tapanainen. Tagging french-comparing a statistical and a constraint-based method. In Seventh Conference of the European Chapter of the ACL, Dublin, 1995. 19 8. Jean-Pierre Chanod and Pasi Tapanainen. A non-deterministic tokeniser for finitestate parsing. In ECAI ’96 workshop on Extended finite state models of language, Budapest, 1996. 19 9. M-H Corrard and V. Grundy, editors. The Oxford Hachette French Dictonary. Oxford University Press-Hachette, Oxford, 1994. 21 10. Doug Cutting, Julian Kupiec, Jan Pedersen, and Penelope Sibun. A practical part-of-speech tagger. In Proceedings of ANLP-92, pages 133–140, Trento, 1992. 19 11. Franciska de Jong. Twenty-one: a baseline for multilingual multimedia retrieval. In Proceedings of the fourteenth Twente Workshop on Language Technology TWLT14, pages 189–195, University of Twente, 1998. 27

30

Jean-Pierre Chanod

12. L. Dini, V. Di Tomaso, and F. Segond. Ginger ii: an example-driven word sense disambiguator. Computer and the Humanities, 1999. to appear. 21 13. F.Segond and P. Tapanainen. Using a finite-state based formalism to identify and generate multiword expressions. Technical report, Xerox Research Centre Europe, Grenoble, 1995. 21 14. Gregory Grefenstette, editor. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Press, Boston, 1994. 23 15. Gregory Grefenstette, Ulrich Heid, and Thierry Fontenelle. The decide project: Multilingual collocation extraction. In Seventh Euralex International Congress, University of Gothenburg,Sweden, Aug 13-18, 1996. 23 16. Jan Hajic and Barbora Hladka. Czech language processing / pos tagging. In First International Conference on Language Resources and Evaluation, Granada, 1998. 19 17. Djoerd Hiemstra. A linguistically motivated probabilistic model of information retrieval. In Christos Nicolaou and Constantine Stephanidis, editors, Proceedings of the second European Conference on Research and Advanced Technology for Digital Libraries: ECDL’98, pages 569–584. Springer-Verlag, 1998. 23 18. Djoerd Hiemstra and Franciska de Jong. Cross-language retrieval in twenty-one: using one, some or all possible translations? In Proceedings of the fourteenth Twente Workshop on Language Technology TWLT-14, pages 19–26, University of Twente, 1998. 27 19. Karen Jensen, George E. Heidorn, and Stephen D. Richardson, editors. Natural language processing: the PLNLP approach. Kluwer Academic Publishers, Boston, 1993. 19 20. L. Julliard, M. Beltrametti, and F. Renzetti. Information retrieval and virtual libraries: the callimaque model. In CAIS’95, Edmonton, CANADA, June 1995. 24 21. Ronald M. Kaplan and Martin Kay. Regular models of phonological rule systems. Computational Linguistics, 20:3:331–378, 1994. 17 22. L. Karttunen, JP Chanod, G. Grefenstette, and A Schiller. Regular expressions for language engineering. Journal of Natural Language Engineering, 2(4):307–330, 1997. 18 23. Lauri Karttunen. Constructing lexical transducers. In Proceedings of the 15th International Conference on Computational Linguistics, Coling, Kyoto, Japan, 1994. 18 24. Lauri Karttunen. The replace operator. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, ACL-95, pages 16–23, Boston, 1995. 17 25. Kimmo Koskenniemi. A General Computational Model for Word-Form Recognition and Production. PhD thesis, Department of General Linguistics University of Helsinki, 1983. 18 26. W. Kraaij. Multilingual functionality in the twentyone project. In Proceedings of the AAAI spring symposium on Cross language Text and Speech retrieval, Palo Alto, March 1997. 27 27. M. Lauer and M. Dras. A probabilistic model of compound nouns. In 7th Joint Australian Conference on Artificial Intelligence., 1994. 19 28. Mehryar Mohri. Finite-state transducers in language and speech processing. Computational Linguistics, 23:2:269–312, 1997. 17 29. K. Netter and F.M.G. de Jong. Olive: speech based video retrieval. In Language Technology in Multimedia Information Retrieval. Proceedings Twente workshop on Language Technology (TWLT14), Enschede, 1998. 28

Natural Language Processing And Digital Libraries

31

30. E. Roche and Y. Schabe, editors. Finite-State Language Processing. MIT Press, Cambridge, Massachusetts, 1997. 17 31. Anne Schiller. Multilingual finite-state noun phrase extraction. In ECAI ’96 Workshop on Extended Finite State Models of Language, Budapest, Aug. 11-12,1996. 19 32. F. Segond, E. Aimelet, and L. Griot. All you can use!” or how to perform word sense disambiguation with available resources. In Second Workshop on Lexical Semantic System, Pisa, Italy, 1998. 21 33. T. Strzalkowski. Natural language information retrieval. Information Processing and Management, 31(3):1237–1248, 1995. 19 34. W.G. ter Stal, J.-H. Beijert, G. de Bruin, J. van Gent, F.M.G. de Jong, W. Kraaij, K. Netter, and G. Smart. Twenty-one: cross-language disclosure and retrieval of multimedia documents on sustainable development. Computer Networks And Isdn Systems, 30 (13):1237–1248, 1998. 27

Natural Language Processing and Information Retrieval ll n M

oorh s

N tion l nstitute of St n r s n e hnology ithers urg 20 99 S [email protected]

Abstract. nform tion retriev l resses the ro lem of fin ing those o uments whose ontent m t hes user’s request from mong l rge olle tion of o uments urrently the most su essful gener l ur ose retriev l metho s re st tisti l metho s th t tre t text s little more th n g of wor s owever ttem ts to im rove retriev l erform n e through more so histi te linguisti ro essing h ve een l rgely unsu essful n ee unless one refully su h ro essing n egr e retriev l e e tiveness Sever l f tors ontri ute to the i ulty of im roving on goo st tisti l seline in lu ing the forgiving n ture ut ro over ge of the ty i l retriev l t sk; the l k of goo weighting s hemes for om oun in ex terms; n the im li it linguisti ro essing inherent in the st tisti l metho s N tur l l ngu ge ro essing te hniques m y e more im ort nt for rel te t sks su h s question nswering or o ument summ riz tion

1

Introduction

m gin th tyou w ntto r s r h p ro l m su h s lim in ting p sts from yourg r n orl rning th history of th ity you will visiton yourn xtholi y n str tgy is to g th r r om m n tions for itm s to r ;th tis to sk for r f r n s to o um nts th t is uss your p ro l m r th r th n to sk for sp ifi nsw rs om p utr systm s th tr turn o um nts whos ontnts m th st t inform tion n h v histori lly n ll information retrieval ( ) systm s though l tlyth y r m or oftn ll document retrieval or text retrieval systm s to istinguish th m from systm s th tsup p ortoth r kin sof inform tion-s king t sks nform tion r tri v l systm s s r h oll tion of n tur l l ngu g o um ntswith th go lof r tri ving x tlyth s tof o um ntsth tp rt in to us r’s qu stion n ontr stto t s systm s th tr quir highly stru tur t n h v form l s m nti s systm s work with unstru tur n tur l l ngu g txt n in ontr stto xp rtsystm s systm s o not ttm p t to u org n r t sp ifi nsw rs utr turn (p i s of) o um nts whos ontntis sim il r to th qu stion hil systm s h v xist for ov r 0 y rs to y th orl i s r h ngin s r p ro ly th st-known Pazienza (Ed.): Information Extraction, LNAI 1714, pp. 32–48, 1999. c Springer-Verlag Berlin Heidelberg 1999 

N tur l L ngu ge

ro essing n

nform tion

etriev l

33

x m p l sof txtr tri v lsystm s th r x m p l sin lu systm sth tsup p ort litr tur s r h s tli r ri s n p tnt-orp r nt-s r hing systm sin l w fi rm s h un rlying t hnologyof r tri v lsystm s— stim ting th sim il rity of th ontntof two txts— ism or ro ly p p li l n om p ssing su h t sks s inform tion fi ltring o um ntsum m riz tion n utom ti onstru tion of hyp rtxtlinks nform tion r tri v l n vi w s gr tsu ssstory forn tur l l ngu g p ro ssing (N P) m jorin ustry h s n uilt roun th utom ti m nip ul tion of unstru tur n tur ll ngu g txt tth m ostsu ssfulg nr l p urp os r tri v l m tho s r ly on t hniqu s th ttr ttxt slittl m or th n g of wor s ttm p ts to im p rov r tri v l p rform n through m or sop histi t linguisti p ro ssing h v n l rg ly unsu ssful r sulting in m inim l i r n sin tiv n ss t su st nti lly gr trp ro ssing ostor v n gr ing r tri v l tiv n ss hisp p r x m in swhylinguisti lly-insp ir r tri v lt hniqu sh v h littl im p ton r tri v l tiv n ss v ri tyof f tors r in i t r nging from th n tur of th r tri v l t sk its lf to th th f tth t urr ntr tri v l systm s lr y im p li itly in orp or t f tur s th linguisti systm s m k xp li it h n xts tion p rovi sg n r l kgroun y s ri ing oth how urr ntr tri v l systm s op r t n th v lu tion m tho ology us to i if on r tri v l run is ttrth n noth r tion p rovi s n ov rvi w of r ntN P n r s r h in lu ing s stu y of p rti ul rs tof N P xp rim nts to illustr t why s m ingly goo i s o notn ss rily l to nh n p rform n h fi n l s tion sugg sts som r l t t sks th t m y n fi tm or ir tly from v n sin N P

2

Background

xtr tri v lsystm sh v th iroriginsin li r rysystm sth tw r us to p rovi i liogr p hi r f r n sto ooks n journ lsin th li r ry’shol ings 1 his origin h s h two m jor infl u n s on how th r tri v l t sk is fi n irst r tri ving (p ointrsto) o um ntsr th rth n tu l nsw rsw sth n tur l xtnsion to th m nu lp ro ss sth tw r us in th li r ri s tth tim n this ontinu sto th m in fo usof th t sk on r tri v lsystm s r xp t to h n l qu stionson nysu j tm ttrin lu in r l tiv lyl rg m ountof txt hisr quir m ntfor om in-in p n n n l rg m ountsof txtp r lu knowl g - s p p ro h s fortxtun rst n ing from ing in orp or t into r tri v l systm s us th r quisit knowl g stru tur s w r not v il l n th p ro ssing w stoo slow nst th m jority of inform tion r tri v lsystm sus st tisti l p p ro h sto om p ut th sim il rity tw n o um nts n qu ri s h tis th yus wor ounting t hniqu s n ssum th ttwo txts r outth s m top i if th y us th s m wor s si un rst n ing of how th s urr ntr tri v lsystm swork isr quir to p p r i t how linguisti p ro ssing m ight tth irp rform n hiss tion p rovi s sum m ryof th urr ntp r ti in s on th r sultsof n

34

llen

oorhees

on-going s ri sof v lu tionsknown sth xt tri v l onf r n ( ) workshop s h fi n lp rtof th s tion s ri s om m on p r ti sforr tri v l systm v lu tion 2.1

The Basics of Current IR Systems

tri v lsystm s onsistof two m in p ro ss s indexing n matching n xing isth p ro ssof s l ting trm sto r p r s nt txt M thing isth p ro ssof om p uting m sur of sim il rity tw n two txtr p r s nt tions n som nvironm nts hum n in x rs ssign trm s whi h r usu lly s l t from ontroll vo ul ry m or om m on ltrn tiv is to us utom ti in xing wh r th systm its lf i son th trm s s on th full txtof th o um nt si utom ti in xing p ro ur for nglish m ight p ro sfollows 1 sp litth txtinto stringsof h r trs lim it ywhit sp onsi ring su h stringsto “wor s (tok niz tion); r m ov v ry fr qu ntwor ssu h sp r p ositions n p ronouns(r m ov l of stop words); n onfl t r l t wor form sto om m on stm y r m oving suffix s(stm m ing) h r sulting wor stm swoul th trm sforth giv n txt n rlyr tri v lsystm s qu ri sw r r p r s nt s ool n om in tions of trm s n th s tof o um nts th ts tisfi th ool n xp r ssion w s r tri v in r sp ons to th qu ry hil this ool n m o lisstillin us to y itsu rsfrom som rw ks th siz of th r tri v s tis iffi ultto ontrol n th us risgiv n no in i tion sto wh th rsom o um ntsin th r tri v s t r lik ly to ttr th n oth rs in th s t hus m ostr tri v l systm s r turn r nk listof o um ntsin r sp ons to qu ry h o um ntsin th list r or r su h th tth o um ntsth systm li v sto m ostlik th qu ry r fi rston th list h v tor-sp m o lis noth r rlyr tri v lm o lstillin us to y n thism o l o um nts n qu ri s r r p r s nt yv torsin T - im nsion l sp wh r T isth num rof istin ttrm sus in th o um nts n h xis orr sp on sto on trm iv n qu ry v torsystm p ro u s r nk listof o um ntsor r ysim il rityto th qu ry wh r th sim il rity tw n qu ry n o um ntis om p ut using m tri on th r sp tiv v tors th rr tri v l m o ls xist in lu ing s v r l i r ntp ro ilisti m o ls n m o ls s on wor p roxim ity n of th fi n ingsof th workshop s isth tr tri v lsystm s s on quit i r ntm o ls xhi itsim il rr tri v l tiv n ss h tis r tri v l tiv n ss is notstrongly infl u n y th sp ifi sof th m o l us slong sth m o l in orp or ts p p rop ri t trm w ighting rm w ighting on th oth rh n h s n shown to h v p rim ry ton r tri v lqu lity with th stw ights om ining trm fr qu n y (tf ) inv rs o um ntfr qu n y (idf ) n o um ntl ngth (dl ) f tors n this

N tur l L ngu ge

ro essing n

nform tion

etriev l

35

form ul tion th tf f torw ights trm p rop ortion lly to th num rof tim s ito ursin th txt th idf f torw ights trm inv rs ly p rop ortion lto th num rof o um ntsin th oll tion th t ont in th trm n th dl f tor om p ns tsforwi ly v rying o um ntl ngths 2.2

The TREC Workshops

h r l tiv m ritof i r ntr tri v l p p ro h s(for x m p l i r ntw ighting s h m s) is v lu t using test collections n hm rk t sksforwhi h th orr t nsw rs r known us r tri v l p rform n is known to v ry wi ly ross qu ri s tst oll tions n to ont in suffi i ntnum r of qu ri sto m k om p risonsm ningful urth r n o s rv i r n in r tri v lp rform n tw n two systm sisg n r lly onsi r v li onlyif itis r p t l rossm ultip l oll tions husst tm ntsr g r ing stp r ti s in m ust s on hun r sof r tri v lruns p rovi sth n ss ry infr stru tur to sup p ortsu h om p risons http://trec.nist.gov h workshop s r sign to n our g r s r h on txtr tri v l forr listi p p li tions yp rovi ing l rg tst oll tions uniform s oring p rour s n forum fororg niz tionsintr st in om p ring r sults t rt in 1 99 th onf r n is o-sp onsor y th N tion l nstitut of t n r s n hnology (N ) n th f ns vn s r h Proj ts g n y ( P ) or h N p rovi s tsts tof o um nts n qu stions P rti ip ntsrun th irr tri v l systm son th t n r turn to N listof th r tri v top -r nk o um nts N p oolsth in ivi u l r sults ju g s th r tri v o um nts for orr tn ss n v lu ts th r sults h y l n swith workshop th tis forum forp rti ip ntsto sh r th ir xp ri n s ’s su ss p n s on h ving iv rs s tof p rti ip nts in th r l v n ju gm nts (th “ orr t nsw rs ) r s on p ool r sults th p ools m ust ont in th outp utfrom m ny i r ntkin s of systm s for th fi n l tst oll tionsto un i s lso v ri ty of i r nt n i t t hniqu sm ust om p r to m k g n r lr om m n tions sto goo r tri v l p r ti ortun tly h sgrown in oth th num rof p rti ip nts n th num rof i r ntr tri v l t sksstu i sin th fi rst h l tst -7 h l in Nov m r 1 99 h 5 6 p rti ip ting group s from 1 i r nt ountri s n in lu r p r s nt tiv s from th in ustri l mi n gov rnm nts tors onf r n s ont in justtwo m in t sks ad hoc n routh fi rst ing ition l su t sks known s “tr ks w r intro u into in - (1 995 ) h m in ho t sk p rovi s n ntry p ointfor n w p rti ip nts n p rovi s s lin of r tri v lp rform n h tr ksinvigor t y fo using r s r h on n w r sorp rti ul r sp tsof txtr tri v l o th xtntth s m r tri v l t hniqu s r us forth i r ntt sks th tr ks lso v li t th fi n ingsof th ho t sk igur 1 showsth num r of xp rim nts p rform in h wh r th s tof runs su m itt for on tr k y on p rti ip ntis ount son xp rim nt

36

llen

oorhees Ad Hoc Routing

140

Interactive Spanish

120

Experiments

Confusion DB Merging

100

Filtering Chinese

80

NLP Speech

60

X Lingual High Precision

40

VLC 20

Query

0

TREC 1 TREC 2 TREC 3 TREC 4 TREC 5 TREC 6 TREC 7 Fig. 1. Num 2.3

rof

xp rim nts y

t sk

Best Practices

nough i r nt xp rim nts h v n run in to sup p ortg n r l onlusions out stp r ti sfor — r tri v l t hniqu sin orp or t y m ost r tri v l systm s us th y h v n shown to n fi il n su h p r ti trm w ighting h s lr y n m ntion s ing riti lto r tri v l su ss noth rp rim ryf torin th tiv n ssof r tri v lsystm sisgoo qu ry form ul tion f ours th stw yof g tting goo qu ry isto h v th us r p rovi on nfortun tly us rs on’ttn to p rovi suffi i nt ontxt usu lly o ring f w k ywor s s n initi l qu stion tri v l systm s om p ns t y p rform ing query expansion ing r l t trm sto th qu ry h r r s v r l i r ntw yssu h xp nsion n om p lish utth m ost om m onlyus m tho is through blind feedback n this t hniqu r tri v l run onsists of two p h s s n th fi rstp h s th origin l qu ry is us to r tri v listof o um nts h top o um nts on th list r ssum to r l v nt n r us s sour of is rim in ting trm s;th s trm s r to th qu ry n th qu ry is r w ight h s on p h s us s th r form ul t qu ry to r tri v s on o um ntlistth tisr turn to th us r wo oth r t hniqu s th us of p ss g s n p hr sing r now us y m ostr tri v l systm sthough th y o noth v sl rg n im p ton th fi n l r sults s w ighting n qu ry form ul tion o Phr sing is th trm in tion of om p oun in x trm s i n in x trm th t orr sp on s to m or th n on wor stm in th origin l txt M ostfr qu ntly th p hr s s r wor p irs th t o-o urin th orp us (m u h) m or fr qu ntly th n xp t y h n n r lly oth th in ivi u l wor stm s n th om p oun trm r to th qu ry P ss g s r su p rtsof o um nt h y r us s m nsof

N tur l L ngu ge

ro essing n

fi n ing r sof hom og nous ontntwithin l rg of su j ts 2.4

nform tion

etriev l

37

o um ntsth t ov r v ri ty

Evaluating Retrieval System Effectiveness

hroughoutthisp p r ssum itisp ossi l to i th ton r tri v lrun is m or tiv th n noth r hissu s tion s ri sth v lu tion m tho ology us to m k this trm in tion tri v l xp rim nts r p rform using tst oll tions tst oll tion onsistsof s tof o um nts s tof qu stions( ll “top i s in ) n for h qu stion listof th o um ntsth t r r l v ntto th tqu stion th relevance assessments lvn ss ssm nts r g n r lly in ry ( o um nt is ith r r l v ntor not) n ssum to xh ustiv (if o um ntis not list s ing r l v nt itisirr l v nt) num r of i r nt tiv n ss m sur s n om p ut using th v ry om m on m tho of v lu ting rl vn ss ssm ntsof tst oll tion r tri v l run is to p lotprecision g instrecall Pr ision is th p rop ortion of r tri v o um nts th t r r l v nt n r ll is th p rop ortion of r l v nt o um ntsth t r r tri v hil p rf tr tri v l run will h v v lu of 1 0 for oth r ll n p r ision in p r ti p r ision n r ll r inv rs ly rl t h tiv n ss of in ivi u l qu ri s v ri s gr tly so th v r g of th p r ision n r ll v lu s ov r s tof qu ri s is us to om p r i r nt s h m s h p r ision of n in ivi u l qu ry n intrp ol t to o t in th p r ision t st n r s tof r llv lu s(for x m p l 00 1 0in in r m nts of 1 ) h p r ision tth s r llp ointsisth n v r g ov rth s tof qu ri s in th tst oll tion h “ -p oint v r g p r ision isus low s singl m sur of r tri v l tiv n ssin s stu y;this v r g isth m n of th p r ision v lu s t h of r ll v lu s( 5 n ) noth rsingl -v lu m sur ll “(non-intrp ol t ) v r g p r ision in th workshop s n isus to is ussth r sults w sintro u low h v r g p r ision for singl top i isth m n of th p r ision v lu s o t in ftr h r l v nt o um ntisr tri v h m n v r g p r ision for run onsisting of m ultip l qu ri sisth m n of th v r g p r ision s or s of h of th qu ri sin th run n g om tri trm s th v r g p r ision for singl qu ry isth r un rn th th unintrp ol t r ll-p r ision gr p h

3

Current Applications of NLP to IR

for is ussing how N P is us in itisn ss ry to fi n wh t onstituts“n tur ll ngu g p ro ssing h v ryf tth tr tri v lsystm sop r t on n tur l l ngu g txt n r turn us ful r sults m onstr ts th t tsom l v l txtr tri v lis n tur ll ngu g p ro ssing systm sm ust tl sttok niz th txt1 whi h is f irly trivi l for nglish utis m or of h ll ng 1

Not ll systems tokenize the text into wor s Systems se on n-gr ms 6 use wor fr gments s in ex terms ther systems su h s the ulti ext system 7 o not

3

llen

oorhees

in l ngu g s su h s rm n (with its xtnsiv us of om p oun form s) or hin s (wh r th r r v ry f w synt ti lu s to wor oun ri s) M ny r tri v l systm s lso p rform stm m ing typ of m orp hologi lp ro ssing Non th l ss in om m on us g “N P for h s th m or sp ifi m ning of using linguisti lly-insp ir p ro ssing to im p rov txtr tri v l systm tiv n ss 5 n m ost s s th N P h sfo us on im p roving th r p r s nt tion of txt( ith r o um nts or qu ri s) uring in xing M thing th r sulting qu ry n o um ntr p r s nt tions th n p ro s in th usu l w y though sp i lp ro ssing m y us to i if two in ivi u l trm sm th or x m p l if in xtrm s r noun p hr s s th n p rti lm th m y m if two trm ssh r om m on h ut r noti nti l hiss tion r vi wssom of th r ntr s r h in p p lying N P t hniqu s to inform tion r tri v l in xing h s tion gins y x m ining p rti ul r xp rim nt s s stu y of th typ s of issu s involv wh n in orp or ting N P t hniqu s within xisting r tri v l fr m works tth n looks tth r s r h th th s n un rt k n in th ontxtof th p rogr m sp i lly th N P tr k in -5 (1 996 ) 3.1

A Case Study

h s stu yinvolv s n inv stig tion into using th s m nti inform tion no in or N t m nu lly- onstru t l xi l systm v lop y org M ill r n his oll gu s tPrin ton niv rsity 9 to nh n ssto oll tionsof txt h inv stig tion took p l s v r l y rs go n is s ri in t il ls wh r 1 01 1 tis sum m riz h r to illustr t som of th p itf lls of linguisti p ro ssing or N tis systm th tr fl ts urr ntp sy holinguisti th ori s out how hum ns org niz th ir l xi l m m ori s h si o j tin or N tis s tof stri tsynonym s ll synset y fi nition h syns tin whi h wor p p rs is i r nts ns of th twor yns ts r org niz y th l xi l r l tions fi n on th m whi h i r p n ing on p rt of sp h ornouns(th only p rtof or N tus in th xp rim nt) th l xi l r l tionsin lu ntonym y hyp rnym y/hyp onym y(is-a r l tion) n thr i r nt m ronym /holonym (part-of) r l tions h is-a r l tion isth om in ntr l tionship n org niz sth syns tsinto s tof p p roxim tly tn hi r r hi s h fo usof th inv stig tion w sto xp loitth knowl g n o in or N tto m lior t th ts synonym s n hom ogr p hsh v on txtr tri v l systm sth tus wor m thing n th s of hom ogr p hs wor sth t p p r to th s m r p r s nttwo istin t on p ts su h s ‘ nk’m ning oth th si s of riv r n fi n n i l institution ith synonym s two istin t wor s r p r s ntth s m on p t s wh n oth ‘ o r ’ n ‘p l nk’m n pi of woo om ogr p hs p r ss p r ision us f ls m th s r m whil synonym s p r ss r ll us tru m th s r m iss n p rin ip l in ex t ll ut tre t the entire o ument olle tion s one long string n queries s r itr ry tterns over the string

efine

N tur l L ngu ge

ro essing n

nform tion

etriev l

39

r tri v l tiv n ssshoul im p rov if m thing isp rform noton th wor s th m s lv s uton th on p tsth wor sr p r s nt his i of conceptual indexing is notn w to ontroll vo ul ri s g n r llyh v noni l s rip tortrm th tisto us for giv n on p t on p tm thing h s lso n us su ssfullyin lim it om ins ysystm s su h s 1 n 1 ;in th s systm s m ning stru tur s r us to r p r s ntth on p ts n sop histi t m thing lgorithm sop r t on th stru tur s ss knowl g -intnsiv p p ro h s to on p tm thing h v lso n v lop or x m p l str ting w y from th p rti ul rwor s th th p p n to us in giv n txtisth m otiv tion hin l tnts m nti in xing 1 h p ointof ourinv stig tion w sto s if or N tsyns ts oul us s on p tsin g n r l-p urp os r tri v l systm u ssfully im p l m nting on p tu l in xing using syns ts r quir s m tho fors l ting singl or N tsyns t sth m ning for h noun in txt i wor s ns is m igu tion p ro ur h is m igu tion p ro ur us willnot s ri h r orthis is ussion th im p ort ntf tur of th p ro ur is th titus th ontnts of p i of txt( o um ntor qu ry) n th stru tur of or N tits lf to r turn ith r on syns ti or f ilur in i torfor h m iguousnoun in th txt h syns ti sw r us sin x trm s s s ri in th n xtp r gr p h h xp rim nts us n xtn v tor sp m o l of inform tion r tri v lth tw sintro u y ox 1 5 n thism o l v toris oll tion of su v torswh r h su v torr p r s nts i r nt sp tof th o um nts in th oll tion h ov r ll sim il rity tw n two xtn v tors is om p ut sth w ight sum of th sim il riti sof orr sp on ing su v tors h t is th sim il rity tw n qu ry Q n o um ntD is  sim (Q, D) αi sim i (Qi , Di ) su v tori wh r αi r fl tsth im p ort n of su v tori in th ov r llsim il rity tw n txts n sim i is th sim il rity m tri forv torsof typ i orth on p tu l in xing xp rim nts o um nt n qu ryv tors h ont in thr su v tors stm sof wor snotfoun in or N tornot is m igu t synonym s t i sof is m igu t nouns n stm sof th is m igu t nouns h s on n thir su v tors r ltrn tiv r p r s nt tionsof th txtin th tth s m txtwor us s n ntry in oth su v tors h noun wor stm s w r k p t to t s ontrol group in th xp rim nt h n th w ightof th syns ti su v toris s tto z ro in th ov r ll sim il rity m sur o um nt n qu ry txts r m th sol ly on th sisof wor stm s o ju g th tiv n ss of th on p tu l in xing th p rform n of th s ns v tors w s om p r to th p rform n of s lin run (s l 1 ) n th s lin run oth o um nt n qu ry v tors onsist of just on su v torth t ont in wor stm sfor ll ontntwor s h t l giv s th tiv n ssof th s lin run n thr i r nts ns - s v torruns forfi v st n r tst oll tions h fi v tst oll tions r

40

llen

M N M

oorhees

0 1 60 1 00 10

M

o um ntson om p utrs i n n 5 0qu ri s o um ntson inform tion s i n n 5 qu ri s o um ntson ngin ring n 5 qu ri s o um ntson m i in n 0 qu ri s n o um nts xtr t from Time Magazine n qu ri s

h row in th t l giv s th v r g -p ointp r ision v lu o t in y th four i r ntr tri v l runs for p rti ul r oll tion wh r th v r g is ov r th num r of qu ri s in th t oll tion or h of th s ns - s v torruns th p r nt g h ng in -p ointp r ision ov rth st n r run is lso giv n hus th ntry in row ‘M ’ olum n ‘1 1 ’of th t l in i ts th tth v r g p r ision for th M oll tion wh n s r h using s ns s v tors 1 1 ( xp l in low) is 777 whi h is 1 6 % gr tion in tiv n ss s om p r to th v r g p r ision of 5 5 7o t in wh n using st n r stm - s v tors Table 1. -p oint v r g p r ision fors ns - s olle tion S N

seline 3- t 3291 2426 4246 5527 6 91

110 3- t % 1994 -39 4 1401 -42 3 2729 -35 7 4405 -20 3 6044 -12 3

211 3- t % 2594 -21 2 19 0 -1 4 3261 -23 2 4777 -13 6 6462 -6 2

v torruns 101 3- t % 299 - 9 2225 - 3 353 -16 7 4735 -14 3 6577 -4 6

h thr s ns - s v tor runs i r in th w y th su v tors w r w ight wh n om p uting th ov r llsim il rity tw n o um nts n qu ri s n th s w ights r us to l l th runs h run l l ‘1 1 0’giv s qu l w ightto th non-noun wor stm s n th syns ti s n ignor sth noun wor stm s hisrun r p r s nts tru on p tu lin xing run h run l l ‘1 1 ’ giv sth non-noun wor stm stwi th w ightgiv n to h of th syns ti s n th noun wor stm s hisrun w ightsth non-noun stm stwi to ountr l n th f tth t oth th noun stm s n th noun s ns s r in lu h fi n l run (‘1 01 ’) is ontrol run— ll of th wor stm s g t qu l w ight n th syns ti s r ignor his is not quiv l ntto th s lin run sin th ov r ll sim il rity m sur only ounts trm m th if th trm o urs in th s m su v torin oth th qu ry n o um nt l rly th tiv n ss of th s ns - s v tors w s wors th n th tof th stm - s v tors som tim sv rym u h wors sisusu llyth s with r tri v l xp rim nts x m in tion of in ivi u l qu ry r sultsshowsth tsom qu ri s w r h lp y th on p tu l in xing whil oth rs w r hurt y it or x m p l th r tri v l tiv n ss of M qu ry 0 w sim p rov y th s ns - s v tors u ry 0r qu sts o um ntsth t is ussth tsof ‘som totrop in’ hum n growth horm on M nyof th r l v nt o um ntsus th

N tur l L ngu ge

ro essing n

nform tion

etriev l

41

v ri ntsp lling ‘som totrop hin’forth horm on n thus r notr tri v in th st n r run in th syns tth tr p r s nts th horm on in lu s oth sp llings sm m rsof th s t o um ntsth tus ith rsp lling r in x with th s m syns ti ntifi rin th s ns - s run n m th th qu ry n ontr st th r tri v l tiv n ssof M qu ry 1 6 w ss v r ly gr y th s ns - s v tors h qu ryr qu sts o um ntson s p r tion nxi tyin inf nt n p r s hool hil r n tr tri v s7r l v nt o um ntsin th top 1 5 for th st n r run utonly 1 r l v nt o um ntin th top 1 5 forth ‘1 1 0’run h p ro l m iss l ting th s ns of ‘s p r tion’in th qu ry or N t ont ins ights ns sof th noun ‘s p r tion’ ith f w lu sto go on in th shortqu ry txt th in xing p ro ur s l t s ns of ‘s p r tion’th tw s notus in ny o um nt h qu ry’s s p r tion on p t oul th r for n v rm th ny o um nt n r tri v l p rform n su r or ingly n thisp rti ul rs tof xp rim nts lm ost llof th gr tion in r tri v l p rform n n ttri ut to m issing trm m th s tw n o um nts n qu ri swh n using s ns - s v torsth t r m wh n using st n r wor stm v tors h m iss m th sh v s v r l us s i r nts ns sof noun ing hos n for o um nts n qu ri s wh n in f tth s m s ns is us ; th in ility to s l t ny s ns s in som qu ri s u to l k of ontxt; n j tiv s n v r sth t onfl t to th s m stm s noun in th st n r run ut r m int in s s p r t on p ts in th s ns - s runs h im p ort n of fi n ing m th s tw n o um nt n qu rytrm sis onfi rm y th gr tion in p rform n of th ontrol run ‘1 01 ’ om p r to th slin run h only m jor i r n tw n th ontrol run whi h ignor sth s ns s n justus sth wor stm s n th s lin run whi h lso us sonly wor stm s isth intro u tion of su v torsin th ‘1 01 ’run n th s ns - s v tors stm sof wor sth t r notnounsornounsth t r notin or N t r in on su v tor n stm s of or N tnouns r in th oth rsu v tor h xtn v torsim il ritym sur m th s wor stm in th o um ntv tor only if th twor stm p p rs in th s m su v torin th qu ry h r for j tiv s n v r sth t onfl t to th s m stm s noun g t ount s m th in th s lin run ut o notm th in th ‘1 01 ’run f ours th f tth tth on p tu lin xing f il in thison xp rim nt o snotm n th t on p ts r inh r ntlyinf riorto wor stm s is m igu tion p ro ur th tw s l to r solv wor s ns s m or onsistntly tw n o um nts n qu ri s woul h v im p rov th s ns - s r sults ov s woul n in xing p ro ur th t oul r ogniz on p ts im p li y wor s r insights into oth r th n nouns utth xp rim nt o s o r som ro im p roving wor - s r tri v l through linguisti lly s l t in xtrm s Linguistic techniques must be essentially perfect to help. h st t of th rtin linguisti p ro ssing of om in-in p n nttxt( g p rt-ofsp h t gging s ns r solution p rsing t ) issu h th t rrorsstillo ur husth tof rrorson r tri v l p rform n m ust onsi r wh n

42

llen

oorhees

trying to us th s t hniqu sto ov r om th fi i n i sof wor stm inxing nfortun tly in th p rti ul r s of wor s ns is m igu tion om m on rror(in orr tlyr solving two us g sof th s m s ns i r ntly) is is strousforr tri v l tiv n ss n rson foun th t is m igu tion ur yof tl st90% w sr quir justto voi gr ing r tri v l tiv n ss 1 6 his is v ry high st n r of p rform n for urr ntN P t hnology Queries are difficult. u ri s r sp i lly trou l som for m ostN P p rossing us th y r g n r lly quit short n o r littl to ssistlinguisti p ro ssing utto h v ny twh tso v r on r tri v l qu ri s m ust lso ont in th typ of in x trm s us in o um nts or tl st h v som w y of intr ting with th o um nts’in xtrm s Nonlinguistic techniques implicitly exploit linguistic knowledge. v n if on p rf tly linguisti t hniqu s m y p rovi littl n fi tov r p p rop ri t st tisti l t hniqu s us th st tisti l t hniqu sim p li itly xp loitth s m inform tion th linguisti t hniqu sm k xp li it g in using s ns is m igu tion s n x m p l in p r ti hom ogr p hs r not m jor ontri utorto r tri v l f ilur unl ss th qu ry is xtr m ly short (on wor ) or th s r h ris intr st in v ry high r ll 1 7 f o um nth s nough trm s in om m on with qu ry to h v high sim il rity to th qu ry th n th ontxts in th two txts r sim il r n ny p olys m ouswor swill lik ly us in th s m s ns n f t th m tho of om p uting sim il riti s m ong txts n us to uil l ssifi r to is rim in t m ong wor s ns s 1 Term normalization might be beneficial. rm norm liz tion i m pom m on p ing v ri ntsp llingsorform ul tionsof th s m l xi litm to form m y on r in whi h linguisti p p ro h s im p rov on sim p l wor stm s h us of somatotropin/somatotrophin is on x m p l of this t Prop rnouns r m or g n r l l ssof l xi litm sth twor stm p p ro h s o noth n l v ryw ll ut r r gul r nough to ur tly p tur y m or sop histi t t hniqu s 1 9 lthough urr nt tst oll tions o not ont in nough qu ri s th t p n on p rop r nouns to l to qu ntify how m u h sp i l p ro ssing h lp s in oth r r tri v l nvironm nts su h s w s r h ngin s p rovi ing sp i l p ro ssing for n m sisnoti ly ttr 3.2

TREC-5 NLP Track

ns r solution is uton p p ro h to using N P to im p rov in xing h N P tr k in -5 invit p rti ip nts to try ny N P p p ro h on th tst oll tion onsisting of lm ost75 000 Wall Street Journal rti l s( 0M of txt) n top i s 5 1 00 our group s su m itt runs to th tr k hil th tr k pt oth utom ti n m nu l runs only th utom ti runswill is uss h r in k p ing with th fo usof th r stof th p p r h M group 0 h xp ri n uil ing tr in l n tur l l ngu g lgorithm sforinform tion xtr tion t sks yp rti ip ting in th M ss g n-

N tur l L ngu ge

ro essing n

nform tion

etriev l

43

rst n ing onf r n s (M ) ow v r -5 w s th ir fi rst ntry into n th y w r not l to om p l t ll th y h hop to o y th tim of th -5 onf r n h run th y i su m itto th N P tr k on2 sist of p r - n p ost-p ro ssing stp s p p li to si M st tisti l run h p r p ro ssing stp im to utom ti llylo t n r m ov from th qu ry st tm nt xtr n ous m tri l th tm ightm isl stm - s s r h h p ost-p ro ssing stp im to r -or r th r nk outp utof th M s rh s on l rning whi h w r th im p ort ntk ywor s n p hr s s in th qu ry n giving o um nts ont ining thos trm shigh rr nks sim p l m nt forth tr k n ith rp ro ssh ny p p r i l im p t( ith rp ositiv orn g tiv ) on th M r sults h oth r thr ntri s in th N P tr k tst synt ti p hr sing (som tim sin onjun tion with oth rN P t hniqu s) s p ossi l im p rov m ntov r st tisti l p hr s s snot in tion on of th fi n ingsof isth t p hr sing in som form isg n r llyus ful M ostsystm sus st tisti lp hr sing wh r “p hr s is ny p ir of wor s th t o-o ur in o um nts suffi i ntly fr qu ntly n r llyth p ir n oth th in ivi u lwor stm s r us sinxtrm s t tisti lp hr s s r l rlyonly rough p p roxim tion to n tur l l ngu g p hr s s om fr qu ntly o-o urring p irssu h s‘ rly fourth’ r notp hr s s t ll o um nts ont ining non- om p osition l ollo tionssu h s ‘hot og’ n ‘ hit ous ’ r still (in orr tly) in x y th ir om p on nt wor s Phr s slong rth n two wor s r ignor h intrn lstru tur of th p hr s is lso ignor so th t‘oll g junior’is onfl t with ‘junior oll g ’ h qu stion isto wh t xtntth s p ro l m s tr tri v l h rox -5 N P tr k ntry ir tly om p r th tiv n ssof r tri v l runs using st tisti l p hr sing vs sp ifi kin of synt ti p hr sing 1 h synt ti p hr sing w s om p lish yusing lightp rs rto p rform sh llow synt ti n lysisof txt P irsof wor sth tth p rs foun to in on of th following r l tionsw r xtr t sp hr s s su j t-v r v r ir to j t v r - jun t noun m o ifying noun j tiv m o ifying noun v r m o ifying v r Phr s sth tin lu stop wor s p hr s om p on nt w r is r or h of th r m ining p hr s s th om p on ntwor s w r stm m n lp h ti lly sort to form th fi n l in x trm igur riv from fi gur sgiv n in th roxp p r showsth p hr s s t t y th st tisti l n synt ti m tho sfor n x m p l qu ry sing th m n v r g p r ision m sur to v lu t th r tri v l runs th us of th synt ti p hr s s in r s tiv n ss 1 5 % s om p r to s lin run with no p hr s s (from 00 to 1 ) sing th st tisti l p hr s s im p rov th m n v r g p r ision y only 7% ov rth s m s lin (from 00 to 1 5 ) so th synt ti p hr s s i h v p ositiv t utthis g in 0M o um nttxttook 6 m t ostin p ro ssing tim ;in xing th hours long rusing th p rsing th n it i using th st tisti l m tho s lso th synt ti p hr sing w sonly n fi i lwh n st rting with th long rv rsion 2

S t

is retriev l system ornell niversity

se on the ve tor s

e mo el th t w s evelo e

44

llen rigin l

oorhees ext non-sto wor s in italics here n for wh t purpose is scuba diving one professionally?

St tisti l hr ses in ive s u

S or us iving s u

erox synt ti hr ses ive s u iving s u ive rofess iving rofession lly

Fig. 2. Phr s s m tho s

riv

for n x m p l qu ry y oth st tisti l n synt ti

of th top i s h n only th shortv rsion of th top i s w s us ( g singl s ntn sshown in igur ) th synt ti p hr sing run degraded th s lin tiv n ss y 0% h th N P tr k ntry w s lso n v lu tion of th us of synt ti p hr s s for o um ntin xing h m in go l of th stu y w s to om p r i r ntkin s of synt ti p hr s s to h oth r r th rth n om p r synt ti p hr s sto st tisti l p hr s s h synt ti p hr s sus y th systm r noun p hr s s n th i r nttyp sof p hr s stst w r full noun p hr s s ( g “h vy onstru tion in ustry group ) j nt su p hr s sin th noun p hr s ( g “h vy onstru tion in ustry ) n h m o ifi rp irs( g “ onstru tion in ustry “in ustrygroup “h vy onstru tion ) our i r nt runsw r m s s onsisting of onlysingl wor s;singl wor s p lus h m o ifi r p irs;singl wor s p lus h m o ifi r p irs p lus full noun p hr s s; n singl wor s p lus ll typ s of p hr s s h m ost tiv run w s th run th tin lu singl wor s p lus h m o ifi r p irsonly whi h in r s m n v r g p r ision y1 % ov rth s s of wor s only (from 1 to 06 ) s on s tof runs p rform ftr -5 us m or tiv qu ry w ighting s h m th tim p rov ll th runs ith this w ighting s h m th h m o ifi rp irsrun w s still th m ost tiv with n in r s in m n v r g p r ision of 9% ov rth s s of no p hr s s (from 1 to 0) h s r sults ll us th long v rsion of th top i s v n wh n using th long v rsion not th tth y i nots sm u h of n ton r tri v lp rform n using p hr s s s xp t us th qu ri s ont in so f w p hr s s h y lso not th t p p rop ri tlyw ighting p hr s s is n im p ort ntf torin p hr s - s in xing h fo us of th -l group h s n on N P t hniqu s for inform tion r tri v l sin g n 5 us th ir rli r xp rim nts m onstr t th tth N P t hniqu s work signifi ntly ttrwith long r qu ry st tm nts m u h of th ir -5 work w s n inv stig tion into p rform n of th irsystm wh n th top i st tm nts w r xp n with l rg

N tur l L ngu ge

ro essing n

nform tion

etriev l

45

m ountsof h n -s l t o um nttxt u h xp nsion signifi ntlyim p rov s th p rform n of oth st tisti l n N P runs though th N P runs m y g tsom wh tm or of oost -5 w s lso th y rth group intro u str m r hit tur n this r hit tur i r ntin p n ntp ro ss s p ro u in x trm s for txt n om in tion m h nism r solv s th v rious n i t in x trm s ts into on fi n l s t h str m r hit tur p rovi s onv ni nttst to inv stig t th r l tiv ontri utionsof th i r ntstr m s h group im p l m nt v ri ty of st tisti l n linguisti str m s in lu ing wor stm s; h m o ifi rp irs( riv from v r o j t n su j tv r om in tionsin ition to noun p hr s s);unnorm liz noun group s; n n m s im il r to th fi n ings th r sults of th str m r hit tur xp rim nts sugg st th th ving som p hr s sis n im p rov m ntov rno p hr s s utsim p l rp hr s s(in this s th unnorm liz noun group s) work ttrth n m or om p li t p hr s s h -5 N P tr k p rti ip ntsfoun th s m typ sof iffi ulti s in trying to im p rov on st tisti l systm tiv n ss s w r n ountr in th s stu y u ri s r short n th r for on’to rm u h op p ortunity to p rform p ro ssing th twill signifi ntly tr tri v l rg gr tion in p rform n isp ossi l unl ssth N P worksv ry w ll n th trm w ighting isnot istur h st tisti lp hr s s p tur m ostof th s li ntinform tion th t n xp loit y synt ti p hr s s h s r th issu sth tn to r ss to im p rov r tri v l tiv n ssthrough linguisti p ro ssing

4

Summary

h xp losiv growth in th num r of full-txt n tur l l ngu g o um nts th t r v il l l troni lly m k s tools th t ssistus rs in fi n ing o um ntsof intr stin isp ns l nform tion r tri v lsystm s r ssthisp ro l m ym thing qu ryl ngu g st tm nts(r p r s nting th us r’sinform tion n ) g inst o um ntsurrog ts ntuitiv ly n tur ll ngu g p ro ssing t hniqu sshoul l to im p rov th qu lityof th o um ntsurrog ts n thus im p rov r tri v lp rform n utto t xp li itlinguisti p ro ssing of o um ntorqu rytxth s or ss nti llyno n fi tforg n r l-p urp os (i not om in sp ifi ) r tri v l systm s s om p r to l ss xp nsiv st tisti l t hniqu s h qu stion of st tisti l vs N P r tri v l systm s is m is st how v r tis not qu stion of ith r on or th oth r utr th r qu stion of how ur t n p p roxim tion to xp li itlinguisti p ro ssing isr quir forgoo r tri v l p rform n h t hniqu sus y th st tisti l systm s r s on linguisti th oryin th tth y r tiv r tri v lm sur sp r is ly us th y p tur im p ort nt sp tsof th w yn tur ll ngu g isus tm m ing is n p p roxim tion to m orp hologi lp ro ssing in ing fr qu ntly o-o urring wor p irs is n p p roxim tion to fi n ing ollo tions n oth r om p oun stru tur s im il ritym sur sim p li itlyr solv wor s ns s y p turing wor

46

llen

oorhees

form sus in th s m ontxts urr ntinform tion r tri v lr s r h m onstr ts th tm or ur t p p roxim tions nnoty t r li ly xp loit to im p rov r tri v l o why shoul r l tiv ly ru p p roxim tions suffi i nt h t sk in inform tion r tri v l is to p ro u r nk listof o um nts in r sp ons to qu ry h r is no vi n th t t il m ning stru tur s r n ss ry to om p lish this t sk n th litr tur sugg ststh tsu h stru tur s r notr quir or x m p l systm s n su ssfully p ro ss o um nts whos ontnts h v n g r l in som w y su h s y ing th outp utof p ro ssing 5 or th outp utof n utom ti sp h r ogniz r 6 h r h s vn n som su ssin r tri ving r n h o um ntswith nglish r n h 7 nst r tri v l qu ri s y sim p ly tr ting nglish s m issp ll tiv n ssisstrongly p n nton fi n ing llp ossi l (tru ) m th s tw n o um nts n qu ri s n on n p p rop ri t l n in th w ights m ong i r nt sp tsof th qu ry n thiss tting p ro ssing th twoul r t ttr linguisti p p roxim tionsm ust ss nti llyp rf tto voi using m or h rm th n goo his is notto s y th t urr ntn tur l l ngu g p ro ssing t hnology is notus ful hil inform tion r tri v l h s fo us on r tri ving o um nts s p r ti ln ssity us rswoul m u h p r f rsystm sth t r p l of m or intuitiv m ning- s intr tion urr ntN P t hnology m y now m k th s p p li tionsf si l n r s r h ortsto r ss p p rop ri t t sks r un rw y or x m p l on w y to sup p ortth us r in inform tion-intnsiv t sksisto p rovi sum m ri sof th o um ntsr th rth n ntir o um nts r nt v lu tion of sum m riz tion t hnologyfoun st tisti l p p ro h squit tiv wh n th sum m ri s w r sim p l xtr ts of o um nttxts ut g n r ting m or oh siv str tswill lik ly r quir m or v lop linguisti p ro ssing noth r w y to sup p ortth us r is to g n r t tu l nsw rs fi rsttstof systm s’ ility to fi n shorttxt xtr tsth t nsw rf t-s king qu stionswillo urin th “ u stion- nsw ring tr k of trm ining th r l tionship s th thol m ong wor s in txtis lik ly to im p ort ntin thist sk

Acknowledgements M y th nks to onn rm n through th ir om m nts

n

hris u kl y for im p roving this p p r

References 1 S r k ones K illett e s e ings in nform tion etriev l org n K ufm nn S n r n iso 1997 33 2 S lton ong ng S e tor S e o el for utom ti n exing ommuni tions of the 18 1975 613–620 34

N tur l L ngu ge

ro essing n

nform tion

etriev l

47

3 S r k ones K urther efl e tions on nform tion ro essing n ngement o e r 34 36 4 S r k ones K h t is the ole of NL in ext etriev l? n Strz lkowski e N tur l L ngu ge nform tion etriev l Kluwer n ress 3 5 erez- r llo Strz lkowski N tur l L ngu ge nform tion etriev l rogress e ort nform tion ro essing n ngement o e r 3 44 6 ’ more h ne- ime om lete n exing of ext heory n r ti e ro ee ings of the ighth nnu l ntern tion l S onferen e on ese r h n evelo ment in nform tion etriev l ress 19 5 155–164 37 7 orm k l rke L lmer o SSL ss ge- se uery efinement nform tion ro essing n n gement o e r 37 Strz lkowski NL r k t -5 ro ee ings of the ifth ext triev l onferen e -5 N S S e i l u li tion 500-23 1997 97–101 lso t http://trec.nist.gov/pubs.html 3 9 ell um e or Net n le troni Lexi l t se ress 199 3 10 oorhees sing or Net to is m igu te or Senses for ext etriev l ro ee ings of the Sixteenth nnu l ntern tion l S onferen e on ese r h n evelo ment in nform tion etriev l ress 1993 171–1 0 3 11 oorhees sing or Net for ext etriev l n ell um e or Net n le troni Lexi l t se ress 199 2 5–303 3 12 u L on e tu l nform tion xtr tion n etriev l from N tur l L ngu ge n ut n S r k ones K illett e s e ings in nform tion etriev l org n K ufm nn S n r n iso 1997 527–533 39 13 ul in L etriev l erform n e in ro ee ings of the ourteenth nnu l ntern tion l -S onferen e on ese r h n evelo ment in nform tion etriev l ress 1991 347–355 39 14 eerwester S um is S urn s L n uer K rshm n n exing y L tent Sem nti n lysis ourn l of the meri n So iety for nform tion S ien e 41 1990 391–407 39 15 ox xten ing the oole n n e tor S e o els of nform tion etriev l with -Norm ueries n ulti le on e t y es n u lishe o tor l issert tion ornell niversity th N niversity i rofilms nn r or 39 16 S n erson or Sense is m igu tion n nform tion etriev l ro ee ings of the Seventeenth nnu l ntern tion l -S onferen e on ese r h n evelo ment in nform tion etriev l S ringer- erl g 1994 142–151 42 17 Krovetz roft Lexi l m iguity in nform tion etriev l r nstions on nform tion Systems 10 1992 115–141 42 1 Le o k owell oorhees ow r s uil ing ontextu l e resent tions of or Senses sing St tisti l o els n ogur ev ustejovsky e s or us ro essing for Lexi l quisition ress 1996 9 –113 42 19 ik Li y u kenn tegorizing n St n r izing ro er Nouns for ient nform tion etriev l n ogur ev ustejovsky e s or us ro essing for Lexi l quisition ress 1996 61–73 42 20 urger er een S lmer nform tion etriev l n r inle N tur l L ngu ge ro essing ro ee ings of the ifth ext triev l onferen e -5 N S S e i l u li tion 500-23 1997 433–435 lso t http://trec.nist.gov/pubs.html 42

4 21

22

23

24

25

26

27

2

llen ull

oorhees

refenstette S hulze ussier S h¨ utze e ersen erox -5 Site e ort outing iltering NL n S nish r ks ro ee ings of the ifth ext triev l onferen e -5 N S S e i l u li tion 500-23 1997 167–1 0 lso t http://trec.nist.gov/pubs.html 43 h i ong ili - r yling N v ns v lu tion of Synt ti hr se n exing— L NL r k e ort ro ee ings of the ifth ext triev l onferen e -5 N S S e i l u li tion 500-23 1997 347–357 lso t http://trec.nist.gov/pubs.html 44 Strz lkowski uthrie L K rlgren Leistensni er Lin erezr llo Str szheim ng il ing N tur l L ngu ge nform tion etriev l -5 e ort ro ee ings of the ifth ext triev l onferen e -5 N S S e i l u li tion 500-23 1997 291–313 lso t http://trec.nist.gov/pubs.html 44 ghv K ors k on it esults of lying ro ilisti to ext ro ee ings of the Seventeenth nnu l ntern tion l -S onferen e on ese r h n evelo ment in nform tion etriev l S ringer- erl g 1994 202–211 46 K ntor oorhees e ort on the -5 onfusion r k ro ee ings of the ifth ext triev l onferen e -5 N S S e i l u li tion 500-23 1997 65–74 lso t http://trec.nist.gov/pubs.html 46 rofolo oorhees uz nne St nfor Lun 199 -7 S oken o ument etriev l r k verview n esults roee ings of the Seventh ext triev l onferen e -7 n ress lso t http://trec.nist.gov/pubs.html 46 u kley itr lz r ie sing lustering n Su er one ts ithin S 6 ro ee ings of the Sixth ext triev l onferen e -6 N S S e i l u li tion 500-240 199 107–124 lso t http://trec.nist.gov/pubs.html 46 ni ouse Klein irs hm n L rst L irmin hrz nowski Sun heim he S S ext Summ riz tion v lu tion in l e ort e hni l e ort 9 000013 Le n irgini 199 lso t htt //www nist gov/itl/ iv 94/ 94 02/rel te roje ts/ti ster summ / fin l r t html 46

From Speech to Knowledge eroni

hl

Simon niv ity u n y . . 5 1S n [email protected] http://www.cs.sfu.ca/people/Faculty/Dahl

Abstract. n um n ommuni tion umption pl y nt l ol . Lin ui t n lo i i n v un ov t i m ny t . u o wo k i l o on n wit t tu y o umption in on w y o not . o k on intuitioni ti n lin lo i p ovi o m lly t iz m o im nt o umption w i v n infl u nti l on lo i p o mmin ( . . 9 22 14 ). n t i ti l w x min om u o umptiv lo i p o mmin o p - iv n t tion n on ult tion o p iv n o ot ont ol n o w t ou l n u . i typ o n lp li v lt p o l m l t to t p nt typin n mo l o omput u . t n l o p ti lly t n to int t voi o nition voi ynt i n lon t out tow m kin omput into t u xt n ion o ou um n iliti - xt n ion t t pt to ou iolo y t t n qui in ou o i to pt.

1

Introduction

M ore th n twenty ye rs h ve el p sed sin e the fi rsteff orts tow rds de l r tive p rogr m m ing t p ulted om p uting s ien es rom the old num er- run hing p r digm into new er p r digm o in erenti lengines. n er in whi h we no longer m e sure effi ien y in term s o l ul tions p er se ond utin term s o in eren esp erse ondnt sti qu lit tive le p . Logi p rogr m m ing tthe he rto this revolution in om p uting ien es h s een resp onsi le or m ny e utiul in rn tions p rti ul rly in rtifi i l ntelligen e o the ide o p rogr m m ing through logi . ot le m ong them n tur l l ngu ge p ro essing p p li tions h ve lossom ed round the xis o p rsing- s-dedu tion ( n exp ression oined y ern ndo ereir ever sin e l in olm er uer develop ed the fi rstlogi gr m m r orm lism M et m orp hosis r m m rs 1 0. T hese p p li tionsm ostlysp n eitherl ngu ge-to-l ngu ge tr nsl tion (e.g. ren h to nglish with som e m e ning rep resent tion orm lism m edi ting etween the sour e l ngu ge nd the t rgetl ngu ge orl ngu ge-to-querytr nsl tion e.g. orusing hum n l ngu ges sd t se rontends. Pazienza (Ed.): Information Extraction, LNAI 1714, pp. 49–75, 1999. c Springer-Verlag Berlin Heidelberg 1999 

50

oni

l

T he l tterkind o p p li tionsexp loit n tur l loseness etween the logi p rogr m m ing style o queries nd hum n l ngu ge questions. onsider or inst n e rep resenting the query ” ind the n m eso em p loyeeswho work or irst nk orp or tion” s t log or s rolog query

query(X):- works(X,’First Bank Corporation’). versusits

L equiv lent

select employee_name from works where company-name= "First Bank Corporation" orits

L equiv lent

range of t is works retrieve (t.person-name) where t.company-name= "First Bank Corporation" T r dition ld t se queryl ngu ges re in t loserto om p uterp rogr m s th n to hum n l ngu ge questions notionsirrelev ntto the question itsel need to e exp li itly rep resented su h sthe r nge o tup le v ri le orop er tions su h ssele ting. n ontr st logi p rogr m m ing sed queries re lm ostre dle y p eop le with little kground o eitherlogi p rogr m m ing ord t se theory. hile written texth slong een used ord t se onsult tion (e.g. 1 itsuse orrep resenting knowledge itsel l gs ehind. t se up d testhrough l ngu ge h ve een studied (e.g.1 7 utd t se re tion through l ngu ge h snot to the esto ourknowledge een ttem p ted yet. T his is p rtly e use re ting knowledge ses through hum n l ngu ge p resentsm ore diffi ulties th n thus onsulting itorup d ting it nd p rtly euse the v il ility o re son ly effi ientsp ee h n lysis nd synthesis so tw re is rel tively new. T yp ing in the hum n l ngu ge senten es ne ess ry to re te knowledge se isp ro ly stim e onsum ing nd p erh p sm ore errorp rone styp ing in the in orm tion itsel oded in one o the urrentknowledge rep resent tion orm lism s. ith good sp ee h n lyzer however di t ting n tur l l ngu ge senten es th trep resent given orp us o knowledge e om es m u h m ore ttr tive t sk. T he fi eld o sp ee h n lysis nd re ognition h sin t een slowly re hing m turity to the p ointin whi h very eff e tive sp ee h so tw re is now v il le trel tively low ost. or inst n e p rodu ts su h s M i roso tsp ee h gent or r gon o.’s tur lly p e king so tw re n re ognize p erson’s sp ee h m od lities ter out h l houro tr ining.T h tp erson n then di t te into

om Sp

to

nowl

51

m i rop hone in luding p un tu tion m rks ss/he would di t te to se ret ry nd see the written orm or his or her utter n es p p e r on the s reen eing p l ed into textfi le or eing used s om m ndsto the om p uter. p ee h editing ilities re o ourse v il le (e.g. or orre ting m ist kesm de ythe sp e ker or y the sp ee h so tw re . m p ressive s itis sp ee h so tw re h s notyet een p rop erly p uttogether with rtifi i l intelligen e. irtu l erson lities n .’s ver l ro ots (ver ots om e outthe losest utthey m im i eizem um ’s liz style o ”underst nding” ounding in de ultresp onses su h s ” h tdoes th tsuggestto you ” ” see” et. et te hnology is lso m ture enough or m ny o its p p li tions to e p rofi t ly ugm ented with sp ee h p ilities. lso h rdw re te hnology is switlym oving tow rdsnetworkso wireless p ort le sm llp erson l om p uters th th ve little to envy ourp revious” ig” om p utersin term so p ower. T he p ie es o the p uzzle re l id down to now ttem p tm ore hum n like om m uni tion with om p uters- om p uters in wide sense in luding ro ots virtu l worlds nd the nternet. ndustry is lre dy identiying the need to integr te ”voi e re ognition so tw re so th tthe om p uter n listen to you voi e synthesis so it n t lk k to you nd p rogr m sto guess wh tyou re lly w nt” ( ewsweek M r h 1 998 interview to ill tes . n this rti le we ex m ine som e p p li tionso logi p rogr m m ing th tin our view should e exp lored long the route tow rds m king om p uters into true extensions o ourhum n ilities- extensions th t d p tto our iology r ther th n requiring our odiesto d p t. Ourp resent tion style isintuitive r therth n orm l sin e we ssum e little p revious knowledge o logi gr m m rs l ngu ge p ro essing or logi p rogr m m ing. T e hni l det ils n e ound in the re eren es given. e ders m ili r with logi p rogr m m ing n skim through se tion noti ing only ournot tion onventions. e tion des ri esourlogi p rogr m m ing tools nd tthe s m e tim e shows step - y-step the onstru tion o (sim p listi ut dequ te orexem p liying p urp oses fi rstp rototyp e gr m m r or gle ning knowledge rom n tur l l ngu ge senten es sed on ssum p tive logi p rogr m m ing.T he resulting d t sesm y in lude gener lrules swell s ts. e tion 3 p rop oses m ore det iled p p ro h to the re tion o knowledge ses(thisse tion p rti llyoverl p swith 1 6 nd introdu es new typ e o ssum p tion re soning orde ling with dis ourse. e tion 4 dis ussesothertyp eso knowledge sesth t n e driven through sp ee h on ep t- sed retriev l ro ot ontrol gener ting nim tions nd ontrolling virtu l worlds. in lly we p resentour on luding rem rks. s m p le session rom m ore en om p ssing d t se re tion p rototyp e th n the one develop ed in is shown in p p endi es nd . p p endix shows s m p le inter tion with p nish onsult le virtu l world.

52

2 2.1

oni

l

A First Prototype for Creating and Consulting Knowledge Bases Definite Clause Grammars

The Basic Formalism m gine rewrite gr m m rrulesth t n in lude v rilesor un tion lsym ols s rgum ents so th trewriting involvesunifi tion. h tyou h ve is lled m et m orp hosis gr m m rs 1 0 or s 8. T hrough them you n orinst n e de l re noun p hr se to e onstituted y n m e or y qu ntifi er n dje tive nd noun (we t ke not tion l li erties or onsisten y throughoutthisp p er. ”-” st nds or”rewrite into”. noun_phrase:- name. noun_phrase:- quant, adj, noun. M ore use ully we n ugm entthe gr m m rsym olswith rgum entswhi h utom te the onstru tion o m e ning rep resent tion noun_phrase(X,true) :- name(X). noun_phrase(X,(A,N)):- quant, adj(X,A), noun(X,N). ri les re p it lized. T he fi rstrule orinst n e om m ndsthe rewrite o nysym olm thing noun phrase(X, true into name(X (where h s een m thed in the s m e w y . ri lesn m es re sin rolog lo lto the l use (rule in whi h they p p e r (sin e they re im p li itly univers lly qu ntifi ed within it. T he se ond rgum ento ”noun p hr se” onstru tsthe noun p hr se’s”m e ning”. n the se o om p lex noun p hr se this m e ning is om p osed y the m e ningso the dje tive ( nd the noun ( ( oth o whi h willinvolve swe sh llnextsee . n the se o sim p le noun p hr se ( n m e itsm e ning is”true”- p rim itive rolog p redi te whi h is lw yss tisfi ed. ords re p re eded y ”# ” nd ”or” n e rep resented s ”; ” (thus the qu ntifi er rule elow shows six ltern tive qu ntiying words . rop er n m es m ust e written in lower se. e n m ke up rewrite rules orthe rem ining gr m m rsym ols swell e.g. name(rahel):- #rahel. name(estha):- #estha. quant:- #the; #a; #an; #some; #all; #every. adj(X,wise(X)):- #wise. adj(X,true). noun(X,owl(X)):- #owl.

om Sp

to

nowl

53

oti e th t dje tives re op tion l- i notp resent the trivi l rep resent tion ”true” is gener ted. ere re som e noun p hr ses nd their rep resent tions s o t ined y the ove gr m m r the wise owl an owl rahel

(wise(X),owl(X)) (true,owl(X)) rahel

Querying Logic Databases rom rep resent tionssu h sthe ove we n or inst n e dire tly query rolog d t se outthe su je tdom in (e.g. owls nd o t in the nswers utom ti lly. orinst n e to sk or wise owl we would write the rep resent tion o th t noun p hr se s rolog query ?- wise(X),owl(X). rolog will resp ond with ows with resp e tto h s een defi ned to e wise nd to e n owl e.g.

d t

se in whi h ows

wise(owsa). owl(owsa). Initializing Knowledge Bases e ould lso create d t se o knowledge rom su h rep resent tions.Letusfi rst dd ver p hr ses nd om p ose senten es rom noun p hr ses nd ver p hr ses e.g. through the rules verb(X,Y,saw(X,Y)):- #saw. verb(X,Y,likes(X,Y)):- #likes. verb_phrase(X,Head,VP):- verb(X,Y,Head),noun_phrase(Y,VP). sentence((Head,NP,VP)):- noun_phrase(X,NP),verb_phrase(X,Head,VP). nonym ous v ri les (i.e. v ri les whi h only p p e r on e in rule nd thusdo notneed sp e ifi n m e re noted ” ”. T o n lyze string rom given st rtsym ol we sim p lyquerythe p rim itive p redi te n lyze( ym ol nd ollow the p rom p ts e.g. ?- analyze(sentence(S)). Enter the string to be analyzed, followed by a return: estha saw the wise owl S=saw(estha,_x16074),wise(_x16074),owl(_x16074)

54

oni

l

oti e th tin the senten e’srep resent tion we h ve llthe elem entsneeded to re te rolog rule rep resenting to d t se the knowledge th t sth s w the wise owl. e m erely need to re rr nge its om p onents in rolog rule shion saw(estha,X):- wise(X),owl(X). where ”-” now isre d s”i” i

iswise nd n owl then esth s w it.

Assumption Grammars for Relating Long-distant Constituents ow sup p ose you w ntto dd rel tive l use to the om p lexnoun p hr se rule noun_phrase(X,(A,N,R)):- quant, adj(X,A), noun(X,N), #that, relative(X,R). ornoun p hr sesin whi h itis the su je tth tis m issing (to e identifi ed with the rel tive’s nte edent the rel tive l use redu es to ver p hr se so we ould sim p ly write inste d noun_phrase(X,(A,N,H,VP)):- quant, adj(X,A), noun(X,N), #that, verb_phrase(X,H,VP). T hisrule m kes (the rep resent tion o the rel tive’s nte edent the su je t o the rel tive l use’sver p hr se. e n now test orinst n e the senten e estha saw the owl that likes rahel rom whi h we o t in X=saw(estha,_x19101),true,owl(_x19101),likes(_x19101,rahel),true owever extr p ol ting thiste hnique to rel tives rom whi h nothernoun p hr se th n the su je tis m issing (e.g. ”the owl th tr hel s w” or”the owl th t r hel g ve s olding to” would ne essit te p ssing the nte edent whi h needs to e identifi ed with the m issing noun p hr se ll the w y to the p l e where the noun p hr se ism issing. T hisisin onvenient e use itim p oses the ddition o into sym olsth th ve no dire t usiness with it sin e they m erely t str nserentities or . twould e onvenientto h ve w y o hyp othesizing p otenti l rel tive’s nte edent s su h in m ore glo l shion nd then using itwherever itis required (i.e. where noun p hr se isexp e ted nd nnot e ound . T his is ex tly wh t we n do with ssum p tion gr m m rs in whi h hyp othesis- (line r ssum p tion 1 - isnoted s rolog p redi te p re eded y 1

Lin tion

n

umption 9 on um

n on um only on ny num o tim .

w

intuitioni ti

ump-

om Sp

to

nowl

55

” ” nd itsuse ( onsum p tion isnoted ”-”. n ssum p tion is v il le during the ontinu tion o the p resent om p ut tion nd v nishesup on ktr king. o orinst n e we n defi ne rel tive s senten e with m issing noun p hr se p re eded y rel tive p ronoun through the gr m m rrule relative(X,R):- #that, +missing_np(X), sent(R).

T he p p rop ri te nte edent(i.e. the v ri le to e ound with - rem em er th tv ri les re im p li itlyqu ntifi ed within e h rule so v ri leso the s m e n m e in diff erentrules re p rioriunrel ted n then tr nsm itted to the rel tive l use through the noun p hr se rule

noun_phrase(X,(A,N,R)):- quant, adj(X,A), noun(X,N), relative(X,R).

nd the m issing noun p hr se is sso i ted with this nte edentthrough onsum p tion tthe p ointin whi h itisshown m issing (i.e. sthe l stnoun p hr se rule to e tried ter ll othersh ve iled

noun_phrase(X,true) :- -missing_np(X).

in lly we llow ornoun p hr seswith no rel tive l uses

relative(_,true).

senten e su h s”the owl th testh s w likesr hel” now yieldsthe rep resent tion S=likes(_x19102,rahel),(true,owl(_x19102),saw(estha,_x19102),true, true),true T he sp urious ”true” p redi tes dis p p e r up on writing these results into fi le. g in rom su h rep resent tion we n then onstru tthe rolog d t se defi nition likes(A,rahel):- owl(A),saw(estha,A).

5

3

oni

l

A More Detailed Approach to Knowledge Base Creation

oing eyond ourfi rstp rototyp e orknowledge se re tion nd onsult tion involves de isions su h s wh tl ngu ge su setis going to e overed how ex tly isitgoing to e rep resented et. n thisse tion we p rop ose one p ossi le p p ro h nd we rgue th t sim p le higherlevelextension o L tim eless ssum p tions gre tly ilit testhe t sk o going rom dis ourse to d t ses.M ore rese r h isneeded to p rovide thorough p roo o on ep t. 3.1

Underlying Conventions

Letusex m ine wh tkindso nglish des rip tionswe sh ll dm it nd wh tkinds o rep resent tions ord t se rel tionsshould e extr ted utom ti lly rom them . The Natural Language Subset Oursu seto l ngu ge onsistso senten es in the tive voi e where rel tion words(nouns ver s nd dje tives orresp ond to d t se p redi tes nd their om p lem entsto rgum entso these p redi tes. orinst n e ” ohn re ds v nhoe to M ry” gener testhe rolog ssertion re ds(john iv nhoe m ry . o ul ry re son ly om m on to ll d t ses elongs to the st ti p rt o oursystem (e.g. rti les p rep ositions om m on ver s su h s ”to e” et. nd vo ul ry sp e ifi to e h p p li tion (e.g. p rop er n m es nouns ver s p rop ero the p p li tion et. isentered t re tion tim e lso through sp oken l ngu ge with m enu-driven help rom the system . n the interesto e se o p rototyp ing we sh ll fi rstonly use univers l qu ntifi ers( sin the l us l orm o logi whetherexp li itorim p li it nd onlythe restri tive typ e o rel tive l uses(i.e. those o the orm ”( ll ...th t...” where the p rop ertiesdes ri ed y the rel tive l use restri tthe r nge o the v ri le introdu ed ythe qu ntifi er” ll”. u h rel tives swe h ve seen tr nsl te into ddition l p redi tesin the l use’s ody sin People like cats that purr. orwhi h

p ossi le tr nsl tion is

like(P,C):- person(P), cat(C), purrs(C). tise syto in lude ltern tive lexi ldefi nitionsin the l ngu ge p ro essing m odule o oursystem so th t ll words or given on ep t s y ”p eop le” nd ”p ersons” tr nsl te into single d t se rel tion n m e (s y ”p eop le” . T hus we n llow the fl exi ility o synonym s together with the p rogr m m ing onvenien e o h ving only one onst nt or e h individu l- no need or equ lity xiom s nd theirrel ted p ro essing overhe d.

om Sp

to

nowl

5

Semantic Types tis use ul to h ve in orm tion outsem nti typ es. or inst n e we m yh ve in orm ed the d t se th tp eop le like nim ls nd m y noth ve exp li itlys id th tp eop le like tsth tp urr. uti we knew th t ts re nim ls we ould e sily in er th tp eop le like ts nd th tthey like ts th tp urr given the p p rop ri te query. we w nted to reje tsem nti lly nom lous inp ut m u h o this ould e done through sim p ly he king typ e om p ti ility etween s y the exp e ted rgum ento p redi te nd its tu l rgum ent. or inst n e i ”im gine” requires hum n su je t nd p p e rsin senten e with non-hum n su je t the user n e lerted o the nom ly nd p p rop ri te tion n e t ken. ri les n e typ ed when introdu ed y n tur l l ngu ge qu ntifi er nd onst nts n e typ ed in the lexi on when defi ning p rop er n m es. T he not tion used to rep resent sem nti typ e n refl e tthe relev ntsetin lusion rel tionship sin the typ e hier r hy. M u h eff orth s een devoted to effi ienten odings o typ e hier r hies. re entsurvey nd re kthrough results orlogi p rogr m m ing re p resented in 1 . T hese results n e tr nsered to m enu-driven m odule o our system whi h willquestion the user out top m ost l ssin the d t se’sdom in its su sets et. nd ordingly onstru tthe l sshier r hyen oded in su h w y th tsetin lusion rel tionship s re de id le with little m ore th n unifi tion. d t se should Set-orientation Our logi p rogr m m ing rep resent tion o e set-oriented. T h tis r therth n h ving n l uses orrep resenting un ry p rop erty th tn individu ls s tisy we n h ve one l use or the entire set (e.g. i li l( d m eve . t se p rim itives orh ndling rel tionson sets re p rovided. ntension lly s well sextension lly rep resented sets should e llowed in two w ys. irst h ving sem nti typ es sso i ted to e h v ri le nd onst nt m kes itp ossi le to give intension l rep lies r ther th n l ul ting n extension llyrep resented set. orinst n e we n rep ly” ll irds”to the question o whi h irdsfl y i the d t se exp ressesth t ll entitieso typ e ird fl y r ther th n he king the p rop erty o fl ying on e h individu l ird ( ssum ing we even were to list ll irds individu lly . e n even ount orex ep tions e.g. y rep lying ” ll irdsex ep tp enguins”. T he user n lw ys hoose to request n extension l rep ly i the fi rst intension l nswerisnotenough. e ondly we h ve ound ituse ul in som e ses to rep resentsets o o je ts s typ e nd n sso i ted rdin lity(e.g. in querysu h s” ow m ny rs re in sto k ” we do notre lly w ntto h ve n m e ore h o the rs eing only interested in the num erso (indistinguish le entitieso typ e r. Events n orderto orre tly rel te in orm tion outthe s m e eventgiven in diff erentsenten eso dis ourse we tr nsl te n- ryrel tionsinto setso in ry oneslinked y n eventnum er. orinst n e i we inp utthe senten e ” ohn g ve overto M ry.” inste d o gener ting the tern ry rel tion

5

oni

l

gave(john,rover,mary)

the n lyzer n gener te rep resent tion th tkeep str k o the event or in orm tion num erwithin the dis ourse through the three ssertions2

gave(1,who,john). gave(1,what,rover). gave(1,to,mary). T hism ethod roughly orresp ondsto eventlogi s d ting s r k s1 979 1 1 . tp rovidesuswith sim p le w yto e fl exi le sto the num ero rgum ents in rel tion. oti e th tthe eventnum er is notne ess rily the s m e s the senten e num er. ournextsenten e is "This happened in 1998" then the system needsto re ognize the eventdes ri ed in the p revioussenten e s the one re erred to y ”this” nd dd the ollowing l use to the d t se gave(1,when,1998). n order orourn tur l l ngu ge n lyzerto e le to eff e tsu h tr nsl tions itneedsin orm tion outsem nti typ eso e h rgum ento rel tion. T his n e given in the lexi on sp e ifi llyin the defi nitionso nouns ver s nd dje tives sin e these words typ i lly orresp ond to p redi tes in d t se. Lexi l defi nitions orthese n either e inp uton e nd or ll or given dom in oreli ited rom the userin m enu-driven shion.T he l tterop tion h sthe ttr tion o p roviding extensi ility to the system whi h n then dm itnew wordsinto itsvo ul ry. 3.2

From Sentences to Discourse- Timeless Assumptions

hile itis ert in th twe m ustrestri tthe r nge o n tur ll ngu ge ep ted y oursp ee h p p li tions we should seek n tur lness nd llow notonly isol ted senten esorqueries utthe fl exi ility o dis ourse nd p r p hr se swell. eterm ining whi h entities o-sp e iy (i.e. re er to the s m e individu l is one o the m ostim p ort ntp ro lem s in underst nding dis ourse. or inst n e i user enters the in orm tion ” ohn m ith works in the toy dep rtm ent. t doesnotm eetfi re s etyst nd rds. is ountry’sl ws re notenough to p rote t him ” the system tr nsl ting thisinp utinto knowledge se needsto identiy 2

n t m nti typ ov look t m o

n lo impli ity.

n

t

t l n u

n ly i

t

ut w

om Sp

to

nowl

59

”it” with the toy dep rtm entm entioned nd ”his” nd ”him ” s rel ting to ohn m ith. thorough tre tm ento o-sp e ifi tion is om p lex nd widely studied issue involving notonly synt ti ut lso sem nti p r gm ti nd ontextu l notions( onsider orinst n e ” on’tstep on it” re erring to sn ke justseen or” ohn ki ked m on M ond y nd ithurt” where ”it” re ersto n str t p rop osition r therth n on rete individu l . dis ussion o p ossi le tre tm ents o o-sp e ifi tion is eyond the s op e o this rti le utwe sh ll introdu e m ethodology -tim eless ssum p tions- or ilit ting the dete tion o o-sp e ifi tion.T his si te hnique n e d p ted to in orp or te diff erent riteri or o-sp e ifi tion determ in tion. T im eless ssum p tions llow usto onsum e ssum p tions terthey re m de ( s e ore ut lso when p rogr m requiresthem to e onsum ed t p oint in whi h they h ve notyet een m de they will e ssum ed to e ”w iting” to e onsum ed until they re tu lly m de (the rolog ut noted ”” is ontrol p redi te p reventing the rem ining l uses rom eing onsidered up on ktr k % Assumption: % the assumption being made was expected by a previous consumption =X:- -wait(X), !. % if there is no previous expectation of X, assume it linearly =X:- +X. % Consumption: % uses an assumption, and deletes it if linear =-X:- -X, !. % if the assumption has not yet been made, % adds its expectation as an assumption =-X:- +wait(X). ith these defi nitions itno longer m tters whether n ssum p tion is fi rst m de nd then onsum ed orfi rst” onsum ed” (i.e. p utin w iting listuntil when itis tu lly m de nd then m de. e n use tim eless ssum p tions orinst n e to uild logi p rogr m m ed d t se dire tly rom

The blue car stopped. Two people came out of it. or

Two people came out of it after the blue car stopped.

0

oni

l

T o over oth ses we ould tim elessly ssum e n o je to des rip tion or e h noun p hr se rep resented y v ri le nd with e tures th t p p e rs in the dis ourse =object(X,F,D). hen en ountering p ronoun with m thing e tures ’ in senten e th t urtherdes ri es the re erred o je t s ’ rep l e the tim eless ssum p tion y one whi h dds ’to wh tisknown o . ith som e not tion lli ense =object(X,F,D&D’). On e the dis ourse ends we n fi rm the ssum p tion into regul rd t se l use. O ourse there will e ses o unresolv le m iguity in whi h even the m ostin orm ed o-sp e ifi tion resolution riteri will il ( s itwill in hum n om m uni tion . nd introdu ing the om p lexities o dis ourse into system th tdep ends on it ully or g thering in orm tion its is no sm ll h llenge. utthe v il ility o tim e-indep endent ssum p tion m e h nism n gre tly help . T im eless ssum p tionsh ve lso p roved use ul ortre ting otherlinguisti p henom en su h s oordin tion 1 4

4

Further Applications Amenable to Speech

swe hop e to h ve shown in p reviousse tions the d t se fi eld isone o the m in ndid tes orsp ee h inter tionsthrough logi eing suffi ientlystudied nd given th tit lso in ludeslogi - sed in rn tionssu h s t log. Other p p li tions while lessstudied re em erging sgood ndid testoo. orinst n e the new need orp ro essing m ssive m ountso we d t whi h re m ostly exp ressed in n tur l l ngu ge op ens up sever l interesting p ossi ilities. n thisse tion we riefl ydis uss urrentrese r h whi h setsthe st ge orsp ee h inter tionsto e lso in orp or ted into them . 4.1

Speech Driven Robot Control

Controlling Mobile Mini-robots sing sp ee h to ontrol ro ots is lso n interesting p p li tion o n tur ll ngu ge p ro essing through logi sin e ro ots’ worlds re irly restri ted om m nds tend to e rel tively sim p le nd irly devoid o the m iguity th tp l guesothern tur l l ngu ge p p li tions. relim in rywork h s een done jointlywith M rie- l ude T hom s nd ndrew ll to ontrol sim p le world o m ini ro ots through n tur l l ngu ge om m nds 4 . T hese ro ots m ove in n en losed room voiding o st les through sensors. T hey know outtim e units they n dete t e turessu h s room tem p er ture nd hum idity m ove to sour eso light void o st les nd re h rge them selvesthrough sour eso lightp ositioned within the room . tur ll ngu ge om m ndstr nsl te into sp e i llydevelop ed orm llogi system . T his high degree o orm liz tion is urrently elieved ne ess ry in

om Sp

to

nowl

1

ro oti s nd llowsusin p rti ul rgre te onom y the synt xo ourlogi system isthe rep resent tion l ngu ge or om m nds nd itssem nti sisthe high level exe ution sp e ifi tion resulting in sever l llsto rolog nd routines. tur l l ngu ge om m nds re o the orm exem p lifi ed elow o to the ne restsour e o lightin ten m inutes. o to p oint t king re th tthe tem p er ture rem ins etween degrees. – Letro ot p ss. – ive to ro ot . – top ssoon sthe hum idity ex eeds . – –

nd

m p er tive senten eswith n im p li itsu je t whi h re r re in other p p litions re om m on here. om p lem ents re typ i llyslotsto e fi lled in m ostly y onst nts. Our p p ro h om inesthe two urrentm in p p ro hesin ro oti s highlevel dedu tive engine gener tes n over ll p l n while dyn m i lly onsulting low level distri uted ro oti s p rogr m s whi h inter e dyn m i lly with the ro ot’s tions nd rel ted in orm tion. O viously dding sp ee h om p onentto su h system swould enh n e them gre tly. M o ile om p uters nd wireless networks s well s lre dy v il le sp ee h so tw re whi h n e d p ted willhelp tow rdsthe go lo tr nsp orting these sp e i lized ro otsto whereverthey re needed while om m uni ting with them in m ore hum n term s.

Controlling Virtual, Visual World Robots nother interesting m ily o nternet p p li tions o sp ee h is th to ro ots th top er te in virtu l visu l worlds. or inst n e nternet- sed M L nim tions h ve een gener ted through nglish- ontrolled p rti l orderp l nners 3 . T he nextstep is orsu h system sto ep tsp ee h r therth n written inp ut. T his rese r h done in oll or tion with ndre hiel nd ul T r u p resents p roo -o - on ep t nternet- sed gent p rogr m m ed in in rolog whi h re eives n tur ll ngu ge des rip tion o ro ot’sgo lin lo ksworld nd gener testhe M L nim tion o sequen e o tions ywhi h the ro ot hievesthe st ted go l. tuses p rti l orderp l nner. T he inter tion story o rd is s ollows T he user typ es in n L request vi n T M L orm st ting desir le fi n l st te to e hieved. T his is sent overthe nternetto in rolog sed s rip tworking s lient onne ted to the m ulti-user orld- t te server. T he L st tem entisused to gener te n exp ression o the go l s onjun tion o p ost- onditions to e hieved. T he go l is p ssed to the p l nner m odule lso p rto the s rip t. T he p l n m teri lizes s M L nim tion ending in visu lrep resent tion o the fi n l st te to e sent k sthe resulto the s rip t swell sin n up d te o the orld t te d t se.

2

oni

l

Our n lyzer sed on ssum p tion r m m rs n de lwith m ultisententi l inp ut nd with n p hor (e.g. rel ting p ronounsto nte edentsin p revioussenten es . ronoun resolution involvesintuitionisti r therth n line r ssum p tions (sin e n nte edent n e re erred to y p ronoun m ore th n on e . T he results we p rototyp ed orthe lo ks world n e tr nsp osed to other dom ins y using dom in-t ilored p l nners nd d p ting the L gr m m rto those sp e ifi dom ins. m ore interesting extension isthe develop m ento single gentto p rodu e M L nim tions rom L go lswith resp e tto diff erent p p li tionso p l nning.T hism ight e hieved yisol ting the dom in-sp e ifi knowledge into n p p li tion-oriented ontology; d p ting p rti l orderp l nnerto onsultthisontologym odul rly; nd likewise h ving the gr m m rex m ine the hier r hy o on ep tsin orderto m ke sense o dom in-oriented words. 4.2

Web Access Through Language

T he interse tion etween logi p rogr m m ing nd the nternetis verynew ut r p idly growing fi eld. e entlogi - sed we p p li tionsh ve een p resented 1 8 0 nd sp e i lissue o the ourn lo Logi rogr m m ing on thissu je t is urrently underp rep r tion. m ong the ex iting new p p li tionsth tthese inter tions re m king p ossi le on ep t- sed retriev l nd virtu lworldsst nd out sp rti ul rlyp rom ising or long-dist n e inter tions distri uted work nd inter tive te hing through the we 6 . ndowing these udding p p li tionswith essthrough l ngu ge nd in p rti ul rsp ee h would gre tly enh n e their p ilities. n p rti ul r m ultilingu l essto virtu lworldsoverthe nternetwould help rem ove geogr p hi nd l ngu ge rriersto oop er tion. relim in rywork in this dire tion is . Speech Driven Virtual World Communication T hisrese r h im p lem ents in- rolog sed virtu l world running under ets p e nd xp lorer lled LogiM OO with m ultilingu l nd extensi le n tur l l ngu ge rontend 1 3 whi h will lso e endowed with sp ee h s well s visu l om p onent. t llows (written tp resent om m uni tion etween dist ntusers in re l tim e hiding the om p lexities o the distri uted om m uni tion m odel through the usu l m et p hors p l es (st rting rom de ultlo y p orts ility to move orteleport rom one p l e to nother wizard residenton the server ownership o o je ts the ilityto transfer ownership nd uilt-in notifi er gentw thing orm ess ges s kground thre d. LogiM OO m kesuse o line r nd intuitionisti ssum p tion te hniques nd is sed on seto em edd le logi p rogr m m ing om p onentswhi h interop er te with st nd rd e tools. m m edi te ev lu tion o world knowledge ythe p rser yields rep resent tions whi h m inim ize the unknowns llowing us to de l with dv n ed n tur ll ngu ge onstru tslike n p hor nd rel tiviz tion effi iently. e t ke dv nt ge o the sim p li ityo our ontrolled l ngu ge to p rovide swell n e sy d p t tion to other n tur l l ngu ges th n nglish with nglish-like rep resent tions s univers linterlingu .

om Sp

to

nowl

3

T he p e uli r e tureso the world to e onsulted- virtu l world- indu ed novel p rsing e tures whi h re interesting in them selves fl exi le h ndling o dyn m i knowledge im m edi te ev lu tion o noun p hr se rep resent tions llowing usto e e onom i with rep resent tion itsel in eren e o som e si synt ti tegories rom the ontext tre tm ento nouns s p rop ernouns e sy extensi ility within the s m e l ngu ge swell sinto othern tur l l ngu ges. Speech Driven Concept Based Retrieval s nyone knowswho h stried to querythe nternetthrough the existing se r h engines there is gl ring need or intelligent ess to th t nt sti ut rustr tingly m e h ni l rep ository o world knowledge th tthe nterneth s e om e. rom this p ersp e tive lone logi should p l y m jorrole given th tdedu tion iso viouslyneeded to m ke enough sense o query sto m inim ize noise (the num ero irrelev ntdo um ents o t ined or given queryto se r h engine nd silen e (the num ero ilures to fi nd ny relev ntdo um entsth tdo exist. orinst n e within orestrydom in o interest we n use t xonom yo orestry-rel ted on ep ts whi h llowsthe se r h engine to sp e i lize orgenerlize given on ep ts (e.g. going rom ”w ter” to ”l kes” orvi e vers nd to use the ontextu l in orm tion p rovided y orestry dom ins in orderto void nonsensi l nswers. e r h engines th t se their se r h on keywords r ther th n sem nti s in ontr st h ve een known to resp ond orinst n e to query ordo um entsrel ted to ” le r utsne rw ter”with ” om p lete p oeti lworks rom illi m ordsworth” m ong listo otherequ lly wrong sso i tions. on ep t- sed se r h engines re st rting to p p e r utto ourknowledge there re none yetth tgo m u h eyond sh llow use o keywords nd on ep t l ssifi tions. ull m e ning extr tion nd om p rison however n only e done on e the su tleties o n tur l l ngu ge sem nti s re t ken into ount. h llenging t sk ut g in one orwhi h logi p rogr m m ing is p rti ul rly suited. ssum p tion r m m rs h ve een p rop osed in onjun tion with other te hniques rom rtifi i l ntelligen e nd t ses( on ep thier r hies m ultil yered d t ses nd intelligent gents orintelligently se r hing in orm tion p ert ining to sp e ifi industry on the we 7.

5

Concluding Remarks

om p utersh ve e om e so u iquitous th titishigh tim e to develop ltern tive om p uterwork m odesth n the p resenttyp ing/s reen sed m odel. e re the fi rstgener tion with exp onentsth th ve sp enttwenty orthirty ye rsworking in ronto om p uterterm in l nd the ill eff e ts re ll too visi le round us tendonitis;eye ne k nd k str in; rp l T unnel syndrom e... p ee h-driven knowledge se re tion nd onsult tion just s sp ee h driven ro ot ontrol orp rogr m m ing nd we ess through l ngu ge ould ring relie rom su h p ro lem s.T hey n lso p rti lly ddressthe p resentneed p rogr m s. to integr te voi e re ognition so tw re voi e synthesis nd

4

oni

l

n this rti le we h ve p rop osed som e em erging p p ro hes tow rds su h ends. owever p utting ll the p ie eso the p uzzle togetherwill require re ul r ting. ithin the logi - sed d t se fi eld som e re entdevelop m ents ould p rove m ost v lu le tow rds this o je tive like the uses o ndu tive Logi rogr m m ing to utom te the onstru tion o n tur l l ngu ge inter es or d t se queries 1 9 . Logi gr m m r orm lism sh ve een develop ed m oreoverwith linguisti e se o exp ression in m ind. n help ing linguistswrite exe ut le gr m m rsin term s th t re nottoo rem oved rom theirown we m ight ythe w y e le to t p on linguisti exp ertise th tm ight e m ostv lu le orourown l ngu ge p ro essing p p li tions. T he v il ilityo p owerulwhile m o ile om p uters nd p p li tions(e.g. 1 lso ddsunp re edented p otenti lto sp ee h inter ing so tw re orinst n e or usiness p eop le or dem i s who o ten tr vel nd who ould there ore m ke the m osto su h resultsi they were p ort le. in lly ourexp erien e with the ir um s ri ed dom in o d t se re tion ould p rove use ul s fi rststep tow rds n even m ore d ring p p li tion th to p rogr m m ing through n tur l l ngu ge. ff orts re underw y or / orth m eri om p ulog etwork o entres o x ellen e to l un h oop er tion round L p roje tsusing logi p rogr m m ing p resently oordin ted jointly y r. e r e nd the uthor. ith the p resent rti le we hop e to stim ul te urtherinter tion long these lines oth in geogr p hi sense nd ross re s.

6

Appendix I: Sample Database Creation and Consultation Session

n the ollowing s m p le re tion session userinp utisp rom p ted y ”>”

?- go. Enter information for the database one sentence at a time When you are done, say stop

> anne is a person I am adding the clause: person(anne)

> garfield is a cat

om Sp

to

nowl

I am adding the clause: cat(garfield)

> garfield purrs I am adding the clause: purrs(garfield)

> earth is the mother of anne I am adding the clause: mother_of(earth,anne)

> anne is canadian I am adding the clause: canadian(anne)

> people like cats that purr I am adding the clause: likes(_x38299,_x38336):(person(_x38299),true),cat(_x38336),purrs(_x38336),true

> the mother of anne likes cats that purr I am adding the clause: likes(_x39082,_x39147):(mother_of(_x39082,anne),true,true),cat(_x39147), purrs(_x39147),true

> eve and adam like paradise I am adding the clause: likes([eve,adam],paradise)

> peter is angry with rover I am adding the clause:

5

oni

l

angry_with(peter,rover)

> the mother of anne gives rover to peter I am adding the clause: gives(_x37230,rover,peter):(mother_of(_x37230,anne),true,true),true

> anne is intelligent I am adding the clause: intelligent(anne)

> a person that likes cats that purr is intelligent I am adding the clause: intelligent(_x38073):person(_x38073),likes(_x38073,_x38193),cat(_x38193), purrs(_x38193),true

> stop You can consult the database by typing "answer." yes ?- answer.

> who is intelligent Representation of the query: question(_x2599,(intelligent(_x2599),true)) Answer: anne

> who likes cats that purr Representation of the query: question(_x2889,(likes(_x2889,_x2948),cat(_x2948),purrs(_x2948), true)) Answer: anne Answer: earth

om Sp

> who is angry with rover Representation of the query: question(_x3169,(angry_with(_x3169,rover),true)) Answer: peter

> who gives rover to peter Representation of the query: question(_x3459,(gives(_x3459,rover,peter),true)) Answer: earth

> earth gives rover to who Representation of the query: question(_x3749,(gives(earth,rover,_x3749),true)) Answer: peter

> earth gives who to peter Representation of the query: question(_x4039,(gives(earth,_x4039,peter),true)) Answer: rover

> who likes paradise Representation of the query: question(_x4257,(likes(_x4257,paradise),true)) Answer: [eve,adam]

> who likes garfield Representation of the query: question(_x4643,(likes(_x4643,garfield),true)) Answer: anne Answer: earth

to

nowl

oni

l

> stop Goodbye

. . T he m othero nne (e rth is notidentifi ed s one o the nswerson who isintelligent e use he h snot een de l red to e p erson.

7

Appendix II- The Database Created Through the Session in Appendix I

% dyn_compiled: person/1: person(anne).

% dyn_compiled: cat/1: cat(garfield).

% dyn_compiled: purrs/1: purrs(garfield).

% dyn_compiled: mother_of/2: mother_of(earth,anne).

% dyn_compiled: canadian/1: canadian(anne).

% likes/2: likes(A,B) :person(A), true, cat(B), purrs(B). likes(A,B) :mother_of(A,anne), true, true,

om Sp

to

nowl

9

cat(B), purrs(B). likes([eve,adam],paradise).

% angry_with/2: angry_with(peter,rover).

% gives/3: gives(A,rover,peter) :mother_of(A,anne), true, true.

% intelligent/1: intelligent(anne). intelligent(A) :person(A), likes(A,B), cat(B), purrs(B).

8

Appendix III- Sample Interaction with a Spanish Consultable Virtual World Through LogiMOO

T his p nish session illustr tes m ong otherthingsfl exi ility re. nouns. within ontrolled nglish over ge. in e the virtu lworld dm itsthe r ting o new o je ts nd we do notw ntto restri tthe user to r ting only those o je ts whose vo ul ryisknown the system ssum es rom ontextth t n unknown word is noun nd p ro eeds to tr nsl te it s itsel. T hus we get defi nite ” p nglish”fl vourin the intern lrep resent tionso t ined ut orthe p urp oses o re ting to the p nish om m nds thisdoesnotm tterm u h. ltim tely ilingu ldi tion ry onsult tion on line should gener te the tr nsl tion p erh p s in onsult tion with the user sto diff erentp ossi le tr nsl tions.Othershortuts re lso t ken (e.g. guestroom tr nsl tes into the single identifi erguestroom liti p ronouns re in n unn tur l p osition et . T he p nish inter tionsshown tr nsl te into

0

oni

l

m ul. ig guestroom . o there. ig kithen. o to the h ll. Look. m the wiz rd. here m ig the edroom . o there. ig kithen op en p ortto the south o the kithen go there op en p ortto the north o the edroom . o there. uild p ortr it. ive itto the wiz rd. Look. m i n . uild r. here is the r uild nu. ho h s it here is the nu here m ive the wiz rd the nu th t uilt. ho h sit test_data("Yo soy Paul."). test_data("Cave una habitacion_huespedes. Vaya alli. Cave una cocina."). test_data("Vaya al vestibulo. Mire."). test_data("Yo soy el brujo. Donde estoy yo?"). test_data("Cave el dormitorio. Vaya alli. Cave una cocina, abra una puerta alsur de la cocina, vaya alli, abra una puerta alnorte del dormitorio. Vaya alli. Construya un cuadro. Dese lo al brujo. Mire."). test_data("Yo soy Diana. Construya un automovil. Donde esta el automovil?"). test_data("Construya un Gnu. Quien tiene lo? Donde esta el Gnu? Donde estoy yo?"). test_data("Dele al brujo el Gnu que yo construi. Quien tiene lo?"). /* TRACE: ==BEGIN COMMAND RESULTS== TEST: Yo soy Paul. WORDS: [yo,soy,paul,.] SENTENCES: [yo,soy,paul] ==BEGIN COMMAND RESULTS== login as: paul with password: none your home is at http://199.60.3.56/~veronica SUCCEEDING(iam(paul)) ==END COMMAND RESULTS== TEST: Cave una habitacion_huespedes. Vaya alli. Cave una cocina.

om Sp WORDS: [cave,una,habitacion_huespedes,., vaya,alli,.,cave,una,cocina,.] SENTENCES: [cave,una,habitacion_huespedes] [vaya,alli] [cave,una,cocina] ==BEGIN COMMAND RESULTS== SUCCEEDING(dig(habitacion_huespedes)) you are in the habitacion_huespedes SUCCEEDING(go(habitacion_huespedes)) SUCCEEDING(dig(cocina)) ==END COMMAND RESULTS== TEST: Vaya al vestibulo. Mire. WORDS: [vaya,al,vestibulo,.,mire,.] SENTENCES: [vaya,al,vestibulo] [mire] ==BEGIN COMMAND RESULTS== you are in the lobby SUCCEEDING(go(lobby)) user(veronica,none,’http://...’). user(paul,none,’http://...’). login(paul). online(veronica). online(paul). place(lobby). place(habitacion_huespedes). place(cocina). contains(lobby,veronica). contains(lobby,paul). SUCCEEDING(look) ==END COMMAND RESULTS== TEST: Yo soy el brujo. Donde estoy yo? WORDS: [yo,soy,el,brujo,.,donde,estoy,yo,?] SENTENCES: [yo,soy,el,brujo] [donde,estoy,yo] ==BEGIN COMMAND RESULTS== login as: wizard with password: none your home is at http://199.60.3.56/~veronica SUCCEEDING(iam(wizard)) you are in the lobby SUCCEEDING(whereami) ==END COMMAND RESULTS== TEST: Cave el dormitorio. Vaya alli. Cave

to

nowl

1

2

oni

l

una cocina, abra una puerta alsur de la cocina, vaya alli, abra una puerta alnorte del dormitorio. Vaya alli. Construya un cuadro. Dese lo al brujo. Mire. WORDS: [cave,el,dormitorio,.,vaya,alli,., cave,una,cocina,(,),abra,una,puerta,alsur, de,la,cocina,(,),vaya,alli,(,),abra,una, puerta,alnorte,del,dormitorio,.,vaya,alli, .,construya,un,cuadro,.,dese,lo,al,brujo,., mire,.] SENTENCES: [cave,el,dormitorio] [vaya,alli] [cave,una,cocina] [abra,una,puerta,alsur, de,la,cocina] [vaya,alli] [abra,una,puerta, alnorte,del,dormitorio] [vaya,alli] [construya,un,cuadro] [dese,lo,al,brujo] [mire] ==BEGIN COMMAND RESULTS== SUCCEEDING(dig(bedroom)) you are in the bedroom SUCCEEDING(go(bedroom)) SUCCEEDING(dig(cocina)) SUCCEEDING(open_port(south,cocina)) you are in the cocina SUCCEEDING(go(cocina)) SUCCEEDING(open_port(north,bedroom)) you are in the bedroom SUCCEEDING(go(bedroom)) SUCCEEDING(craft(cuadro)) logimoo:# ’wizard:I give you cuadro’ SUCCEEDING(give(wizard,cuadro)) user(veronica,none,’http://...’). user(paul,none,’http://...’). user(wizard,none,’http://...’). login(wizard). online(veronica). online(paul). online(wizard). place(lobby). place(habitacion_huespedes). place(cocina). place(bedroom). contains(lobby,veronica). contains(lobby,paul). contains(bedroom,wizard). contains(bedroom,cuadro). port(bedroom,south,cocina). port(cocina,north,bedroom). has(wizard,cuadro). crafted(wizard,cuadro).

om Sp SUCCEEDING(look) ==END COMMAND RESULTS== TEST: Yo soy Diana. Construya un automovil. Donde esta el automovil? WORDS: [yo,soy,diana,.,construya,un, automovil,.,donde,esta,el,automovil,?] SENTENCES: [yo,soy,diana] [construya,un,automovil] [donde,esta,el,automovil] ==BEGIN COMMAND RESULTS== login as: diana with password: none your home is at http://199.60.3.56/~veronica SUCCEEDING(iam(diana)) SUCCEEDING(craft(automovil)) automovil is in lobby SUCCEEDING(where(automovil)) ==END COMMAND RESULTS== TEST: Construya un Gnu. Quien tiene lo? Donde esta el Gnu? Donde estoy yo? WORDS: [construya,un,gnu,.,quien,tiene, lo,?,donde,esta,el,gnu,?,donde,estoy,yo,?] SENTENCES: [construya,un,gnu] [quien,tiene,lo] [donde,esta,el,gnu] [donde,estoy,yo] ==BEGIN COMMAND RESULTS== SUCCEEDING(craft(gnu)) diana has gnu SUCCEEDING(who(has,gnu)) gnu is in lobby SUCCEEDING(where(gnu)) you are in the lobby SUCCEEDING(whereami) ==END COMMAND RESULTS== TEST: Dele al brujo el Gnu que yo construi. Quien tiene lo? WORDS: [dele,al,brujo,el,gnu,que,yo, construi,.,quien,tiene,lo,?] SENTENCES: [dele,al,brujo,el,gnu,que,yo, construi] [quien,tiene,lo] ==BEGIN COMMAND RESULTS==

to

nowl

3

4

oni

l

logimoo:# ’wizard:I give you gnu’ SUCCEEDING(give(wizard,gnu)) wizard has gnu SUCCEEDING(who(has,gnu)) ==END COMMAND RESULTS== SUCCEEDING(test) ==END COMMAND RESULTS== */

References 1. 2.

3.

4.

5.

.

.

.

9.

10. 11. 12. 13.

.

l . u . uo to S. o o t n . S u t u. umption mm o nowl Sy t m . Informatica 22(4) p 435 444 199 . . l . u S. o o t n . S u t u. Sp ni nt to Lo i oo- tow multilin u l vi tu l wo l . Informatica volum 2 jun 1999. 2 .S i l . l n . u. Generating Internet Based VRML Animations through Natural Language Controlled Partial Order Planners ni l po t Simon niv ity 199 . 1 . l . ll n . . om . ivin o ot t ou n tu l l n u . Proceedings 1995 International Conference on Systems, Man and Cybernetics p 1904 190 july 1995. 0 . . om . l n . ll. Lo i l nnin in o oti . Proceedings 1995 International Conference on Systems, Man and Cybernetics p 2951 2955 july 1995. 0 S. o ot . l n . u. i tu l nvi onm nt o oll o tiv L nin . Proc. World Multiconference on Systemics, Cybernetics and Informatics (SCI’98) and 4th International Conference on Information Systems Analysis and Synthesis (ISAS’98) l n o lo i 199 . 2 . . i n . ll S. o ot . l n . u. n-Lin ou i ov y u in tu l L n u . Proc. RIAO’97 , Computer-Assisted Searching on the Internet pp. 33 355 ill niv ity ont l un 199 . 3 . . . i n . . . n. nit l u mm o L n u n ly i u v y o t o m li m n omp i on wit n ition two k . Artificial Intelligence vol. 13 p 231 2 19 0. 52 . i n . ill . Extending definite clause grammars with scoping constructs p 3 3 3 9. n vi . . n Sz i . ( .) nt n tion l on n in Lo i o mmin 1990. 49 54 . olm u . t mo p o i mm . Lecture Notes in Computer Science 3 p 133 1 9 Sp in l 19 . 49 52 . . . ow l ki. Logic for Problem Solving. o t - oll n 19 9. 5 . ll. oun tion o t xonomi n o in . Computational Intelligence 14(4) 1 45 199 . 5 . u . o . l n S. o o t. Lo i n xt n il ultii tu l o l wit tu l L n u ont ol Journal of Logic Programming 3 (3) p 331 353 1999. 2

om Sp 14.

15.

1 .

1 . 1 .

19.

20. 21. 22.

to

nowl

5

. l . u n . Li. umption mm o o in tu l L nu . Proceedings International Conference on Logic Programming’97 p 25 2 0 199 . 49 0 . l. Lo i l i no u tiv tu l L n u on ult l t . Proc. V Int. Conf. on Very Large Databases, Rio de Janeiro p 24 31 19 9. 50 . l. lo i o l n u . n . pt . k n .S. n ( .) The Logic Programming Paradigm: A 25 year perspective p 429 451 Sp in l 1999. 51 . vi on. tu l L n u nt o o min t p t . ICDE p 9 19 4. 50 . u . o n . m n il o ( .). Proc. of the 2nd International Workshop on Logic Programming Tools for Internet Applications L 9 199 . 2 . . ll n . . oon y. L nin to t u i u in n u tiv Lo i o mmin . Proc. Thirteenth National Conference on Artificial Intelligence p 1050 1055 199 . 4 S. . Lok . Adding Logic Programming Behaviour to the World Wide Web i niv. o l ou n u t li 199 . 2 . . t n L. lli. i to y ppli tion . Proc. 8th Annual ACM Symposium on User Interface Software and Technology 1995. 4 . . ill n t u . Som u o i -o lo i in omput tion l lin ui ti . n Proceedings of the 24th Annual Meeting of the Association for Computational Linguistics p 24 255 19 . 49

Relating Templates to Language and Logic John F. Sowa s h s r oly hn n . [email protected] http://west.poly.edu

Abstract. yn h or s r l s n n sr r o h ls o orph s nfl ons n wor or r. n h or s r l s n n s o h ls o or l log n o l h ory. ny o h os s ss l progr s or n or on x r on ( ) r s on o n p n n pl s h gnor h ls h n ro n on o h jor h or s o syn x n s n s. h s p p r shows h s poss l o n or pr s o op r ons ll h non l or on r l s wh h n rl oh h pl ll ng op r ons o n h or or l op r ons o p rs rs n h or pro rs. h s r l s r rs s n r s o on p l gr phs n h n g n r lz o ny knowl g r pr s n on n l ng pr l l s r s n h pl s. s r s l h pl ll ng op r ons o o p r o or g n r l s o op r ons h n s n ro s o n ons o pro ss knowl g o ny k n n l ng l ng s knowl g ny l l o l.

1

Relating Different Language Levels

Since the 1 960s, many theoretical and computational linguists have assumed thatlanguage processing, either in the human brain or in a computer program, is bestperformed by an integrated system of modules thatoperate on diff erentlanguage levels: phonology, morphology, syntax, semantics, pragmatics, and general world knowledge. Chomsky and his students started from the bottom and defined syntactic structures for the firstthree levels.Richard M ontague and his colleagues started in the middle with symbolic logic for the semantic level and worked downward into syntax and upward to pragmatics. Roger Schank and his students started from the top with a variety of knowledge representations: conceptual dependencies, scripts, M OPs (memory organization packets), T OPs (thematic organization packets), and problem-dependentrepresentations for case-based reasoning aboutgeneralworld knowledge.Although these schools of thoughtshared some common views aboutthe existence of diff erentlanguage levels, their theoreticalfoundations and notations were so radicallydiff erentthat collaboration among them was impossible, and adherents of diff erentparadigms ignored each other’s work. During the 1 980s, Prolog demonstrated thata general logic-based approach could be fastenough to handle every level from phonology to world knowledge. Pazienza (Ed.): Information Extraction, LNAI 1714, pp. 76–94, 1999. c Springer-Verlag Berlin Heidelberg 1999 

l

ng

pl

s o L ng

g

n Log

77

Prolog-like unification grammars for the lower levels could be integrated with deductive reasoning and possible-world semantics for the higher levels. In one form or another, logic would be the underlying mechanism thatsupported every level of language processing. Various notations for diff erentsubsets of logic were developed: feature structures, description logics, discourse representation structures, conceptualgraphs, SNePS (semantic network processing system), and manyvariations of predicate calculus.Although these systems used diff erentnotations, their common basis in logic made itpossible for techniques developed for any one of the systems to be adapted to mostif notall of the others. During the 1 990s, however, the M U C series of message understanding conferences and the ARPA T ipster projectshowed thatthe integrated systems designed for detailed analysis ateverylevelare too slow for information extraction.T hey cannotprocess the large volumes of texton the Internetfastenough to find and extractthe information thatis relevantto a particular topic.Instead, competing groups with a wide range of theoretical orientations converged on a common approach: domain-dependenttemplates for representing the critical patterns of concepts and a limited amountof syntactic processing to find appropriate phrases thatfill slots in the templates [7]. T he group atSRI International([1 ],[8]) found thatT ACIT U S, a logic-based text-understanding system was far too slow.Itspentmostof its time on syntactic nuances thatwere irrelevantto the ultimate goal.T heyreplaced itwith FAST U S, a finite-state processor thatis triggered by key words, finds phrase patterns without attempting to link them into a formal parse tree, and matches the phrases to the slots in the templates. Cowie [3] observed that the FAST U S templates, which are simplified versions of a logic-based approach, are hardly distinguishable from the sketchy scripts thatDe Jong ([4],[5]) developed as a simplified version of a Schankian approach. Although IE systems have achieved acceptable levels of recall and precision on their assigned tasks, there is more work to be done. T he templates are handtailored for each domain, and their success rates on homogeneous corpora evaporate when theyare applied to a wide range of documents.T he high performance of template-based IE comes atthe expense of a laborious task of designing specialized templates. Furthermore, thattask can only be done by highly trained specialists, usually the same researchers who implemented the system thatuses the templates. A practicalIE system cannotdepend on the availabilityof human consultants for routine customization.Itshould automaticallyconstructnew templates from information supplied by users who have some familiarity with the domain, but no knowledge of how the IE system works. Butthe laborious task of deriving customized templates for a new domain is verydiff erentfrom the high-speed task of using the templates. Whereas the extraction task does shallow processing of large volumes of text, the customization task requires detailed understanding of the user’s questions and the contextin which they are asked. Itdepends on all the syntactic, semantic, pragmatic, and logical nuances thatare ignored in the high-speed search and extraction task.

7

ohn

. ow

U ltimately, a practical IE system mustbe able to perform detailed textunderstanding, buton a much smaller amountof information than the search and extraction task. When deriving customized templates, the system mustfocus on specific information aboutthe user’s requirements. T he customization task mustdo more than translate a single queryinto SQL.Itmaystartwith a query, butitmustcontinue with a clarification dialog to resolve ambiguities and fill in background knowledge. Butlinking the customization stage with the extraction stage requires a common semantic framework thatcan accommodate both. T he purpose of this paper is to show how the IE templates fitinto a larger framework thatlinks them to the more detailed issues of parse trees, discourse structures, and formal semantics. T his framework is related to logic, butnotin the same way as the logic-based systems of the 1 980s. Instead, itdepends on a small setof lower-level operations, called the canonical formation rules, which were originally developed in terms of conceptual graphs [1 2]. Butthose operations can be generalized to any knowledge representation language, including predicate calculus, frames, and IE templates. T his paper presents the canonical formation rules, and relates them to conceptual graphs (CGs), predicate calculus, frames, and templates.T he resultis nota magic solution to allthe problems, buta framework in which they can be addressed.

2

Canonical Formation Rules

All operations on conceptual graphs are based on combinations of sixcanonical formation rules, each of which performs one basic graph operation. Logically, each rule has one of three possible eff ects: itmakes a CG more specialized, it makes a CG more generalized, or itchanges the shape of a CG, butleaves it logically equivalentto the original. All the rules come in pairs: for each specialization rule, there is an inverse generalization rule;and for each equivalence rule, there is an inverse equivalence rule thattransforms a CG to its original shape. T hese rules are fundamentally graphical: they are easier to show than to describe. T he firsttwo rules, which are illustrated in Fig. 1 , are copy and simplify. Atthe top is a conceptual graph for the sentence ”T he catYojo is chasing a mouse”. T he boxes are called concepts, and the circles are called conceptual relations. In each box is a type label, such as Cat, Chase, and M ouse. In the concept[Cat: Yojo], the type field is separated by a colon from the referent field, which contains the name of a specific catnamed Yojo. T he agent(Agnt) relation links the conceptof chasing to the conceptof the catYojo, and the theme (T hme) relation links itto the conceptof a mouse. T he down arrow in Fig. 1 represents the copy rule. One application of the rule copies the Agntrelation, and a second application copies the subgraph → (T hme) → [M ouse] . T he dotted line connecting the two [M ouse] concepts is a coreference link that indicates thatboth concepts refer to the same individual. T he copies in the bottom graph are redundant, since they add no new information. T he up ar-

l

Cat: Yojo

Agnt

ng

pl

Chase

Copy

s o L ng

g

Thme

n Log

79

Mouse

Simplify

Agnt Cat: Yojo

Chase

Thme

Mouse

Thme

Mouse

Agnt

Fig. 1. Copy and simplify rules

row represents two applications of the simplify rule, which performs the inverse operations of erasing redundantcopies. T he copy and simplify rules are called equivalence rules because anytwo CGs thatcan be transformed from one to the other by any combination of copy and simplify rules are logically equivalent. T he two formulas in predicate calculus thatare derived from the CGs in Fig. 1 are also logically equivalent. In typed predicate calculus, each conceptof the top CG maps to a quantified variable whose type is the same as the concept type. If no other quantifier is written in the referentfield, the defaultquantifier is the existential ∃ . T he top CG maps to the following formula: (∃x:Cat)(∃y:Chase)(∃z:M ouse) (name(x,’Yojo’) ∧ agnt(y,x) ∧ thme(y,z)),

0

ohn

. ow

which is true or false under exactly the same circumstances as the formula that corresponds to the bottom CG: (∃x:Cat)(∃y:Chase)(∃z:M ouse)(∃w:M ouse) ( name(x,’Yojo’) ∧ agnt(y,x) ∧ agnt(y,x) ∧ thme(y,z) ∧ thme(y,w) ∧ (z = w) ) Bythe inference rules of predicate calculus, the redundantcopyof agnt(y,x) can be erased. T he equation z = w, which corresponds to the coreference link between the two [M ouse] concepts, allows the variable w to be replaced by z. After the redundantparts have been erased, the simplification of the second formula transforms itback to the first. Fig. 2illustrates the restrictand unrestrictrules. Atthe top is a CG for the sentence ”A catis chasing an animal.” By two applications of the restrictrule, itis transformed to the CG for ”T he catYojo is chasing a mouse.”T he firststep is a restriction by referent of the concept[Cat], which represents some indefinite cat, to the more specific concept[Cat: Yojo], which represents an individualcat named Yojo.T he second step is a restriction by type of the concept[Animal] to a conceptof the subtype [M ouse]. T wo applications of the unrestrictrule perform the inverse transformation of the bottom graph to the top graph. T he restrict rule is called a specialization rule, and the unrestrictrule is a generalization rule. T he more specialized graph implies the more general one: if the catYojo is chasing a mouse, itfollows thata catis chasing an animal. Equivalentoperations can be performed on the corresponding formulas in predicate calculus. T he top graph corresponds to the formula (∃x:Cat)(∃y:Chase)(∃z:M ouse) (agnt(y,x) ∧ thme(y,z)), Restriction byreferentadds the predicate name(x,’Yojo’), and restriction by type replaces the type label Animal with M ouse: (∃x:Cat)(∃y:Chase)(∃z:M ouse) (name(x,’Yojo’) ∧ agnt(y,x) ∧ agnt(y,x) ∧ thme(y,z)) By the rules of predicate calculus, this formula implies the previous one. Fig. 3 illustrates the join and detach rules. Atthe top are two CGs for the sentences ”Yojo is chasing a mouse” and ”A mouse is brown.” T he join rule overlays the two identicalcopies of the concept[M ouse] to form a single CG for the sentence ”Yojo is chasing a brown mouse.” T he detach rule performs the inverse operation.T he resultof join is a more specialized graph thatimplies the one derived by detach. In predicate calculus, join corresponds to identifying two variables, either by an equalityoperator such as z = w or bya substitution of one variable for every occurrence of the other. T he conjunction of the formulas for the top two CGs is

l

Cat

Agnt

Restrict

Cat: Yojo

Agnt

ng

pl

Chase

s o L ng

Thme

g

n Log

1

Animal

Unrestrict

Chase

Thme

Mouse

Fig. 2. Restrictand unrestrictrules

((∃x:Cat)(∃y:Chase)(∃z:M ouse) (name(x,’Yojo’) ∧ agnt(y,x) ∧ thme(y,z)) ∧ ((∃w:M ouse) (∃v:Brown) attr(w,v)) After substituting z for all occurrences of w and deleting redundancies, (∃x:Cat) (∃y:Chase)(∃z:M ouse)(∃v:Brown) (name(x,’Yojo’) ∧ agnt(y,x) ∧ thme(y,z) ∧ attr(w,v)) By the rules of predicate calculus, this formula implies the previous one. Although the canonicalformation rules are easyto visualize, the formalspecifications require more detail.T heyare mostsuccinctfor the simple graphs, which are CGs with no contexts, no negations, and no quantifiers other than existentials.T he following specifications, stated in terms of the abstractsyntax, can be applied to a simple graph u to derive another simple graph w. 1 . Equivalence rules. T he copy rule copies a graph or subgraph. T he simplify rule performs the inverse operation of erasing a copy.Letv be anysubgraph of a simple graph u;v may be empty or itmay be all of u. – Copy. T he copyrule makes a copyof anysubgraph v of u and adds itto u to form w. If c is anyconceptof v thathas been copied from a concept d in u, then c mustbe a member of exactly the same coreference sets as d. Some conceptual relations of v may be linked to concepts of u that

ohn

. ow

Cat: Yojo

Agnt

Chase

Thme

Mouse

Join

Cat: Yojo

Agnt

Chase

Mouse

Attr

Attr

Brown

Brown

Detach

Thme

Mouse

Fig. 3. Join and detach rules

are notin v;the copies of those conceptual relations mustbe linked to exactly the same concepts of u. – Simplify. T he simplify rule is the inverse of copy. If two subgraphs v1 and v2 of u are identical, they have no common concepts or conceptual relations, and corresponding concepts of v1 and v2 belong to the same coreference sets, then v2 may be erased. If any conceptual relations of v1 are linked to concepts of u thatare notin v1 , then the corresponding conceptual relations of v2 mustbe linked to exactly the same concepts of u, which may notbe in v2 . 2. Specialization rules. T he restrictrule specializes the type or referentof a single conceptnode. T he join rule merges two conceptnodes to a single node. T hese rules transform u to a graph w thatis more specialized than u. – Restrict. Any conceptor conceptual relation of u may be restricted by type byreplacing its type with a subtype.Anyconceptof u with a blank referentmay be restricted by referent by replacing the blank with some other existential referent. – Join. Letc and d be any two concepts of u whose types and referents are identical. T hen w is the graph obtained by deleting d, adding c to all coreference sets in which d occurred, and attaching to c all arcs of conceptual relations thathad been attached to d. 3. Generalization rules. T he unrestrictrule, which is the inverse of restrict, generalizes the type or referentof a conceptnode. T he detach rule, which is the inverse of join, splits a graph in two parts atsome conceptnode. T he lasttwo rules transform u to a graph w thatis a generalization of u. – Unrestrict. Letc be any conceptof u. T hen w may be derived from u by unrestricting c either by type or by referent: unrestriction by type replaces the type label of c with some supertype;and unrestriction by referenterases an existential referentto leave a blank.

l

ng

pl

s o L ng

g

n Log

– Detach. Letc be any conceptof u. T hen w may be derived from u by making a copyd of c, detaching one or more arcs of conceptualrelations thathad been attached to c, and attaching them to d. Although the six canonical formation rules have been explicitly stated in terms of conceptual graphs, equivalent operations can be performed on any knowledge representation.T he equivalents for predicate calculus were illustrated for Figs.1 , 2, and 3.Equivalentoperations can also be performed on frames and templates: the copy and simplify rules are similar to the CG versions;restrict corresponds to filling slots in a frame or specializing a slotto a subtype;and join corresponds to inserting a pointer thatlinks slots in two diff erentframes or templates. For nested contexts, the formation rules depend on the levelof nested negations.A positive context(sign +) is nested in an even number negations (possibly zero). A negative context(sign -) is nested in an odd number of negations. – Zero negations. A contextthathas no attached negations and is notnested in any other contextis defined to be positive. – Negated context.T he negation relation (Neg) or its abbreviation bythe ∼ or ¬ symbolreverses the sign of anycontextitis attached to: a negated context contained in a positive contextis negative;a negated contextcontained in a negative contextis positive. – Scoping context. A contextc with the type label SC and no attached conceptual relations is a scoping context, whose sign is the same as the sign of the contextin which itis nested. Letu be a conceptual graph in which some conceptis a contextwhose designator is a nested conceptual graph v. T he following canonical formation rules convertu to another CG w by operating on the nested graph v, while leaving everything else in u unchanged. 1 . Equivalence rules. – If v is a CG in the contextC, then letw be the graph obtained by performing a copy or simplify rule on v. – A contextof type Negation whose referentis another contextof type Negation is called a double negation. If u is a double negation around thatincludes the graph v, then letw be the graph obtained byreplacing u with a scoping contextaround v: [Negation: [Negation: v]] => [SC: v]. A double negation or a scoping contextaround a conceptualgraph may be drawn or erased atanytime. If v is a conceptualgraph, the following three forms are equivalent: ∼[ ∼[ v]], [v], v.

ohn

. ow

2. Specialization rules. – If C is positive, then letw be the resultof performing anyspecialization rule in C. – If C is negative, then letw be the resultof performing anygeneralization rule in C. 3. Generalization rules. – If C is positive, then letw be the resultof performing anygeneralization rule in C. – If C is negative, then letw be the resultof performing anyspecialization rule in C. In summary, negation reverses the eff ectof generalization and specialization, butithas no eff ecton the equivalence rules.Corresponding operations can be performed on formulas in predicate calculus. For frames and templates, the treatmentof negation varies from one implementation to another;some systems have no negations, and others have many special cases thatmustbe treated individually. Butfor any knowledge representation thatsupports negation, the same principle holds: negation reverses generalization and specialization.

3

Notation-Independent Rules of Inference

T he canonical formation rules, which can be formulated in equivalentversions for conceptual graphs and predicate calculus, extractthe logical essence from the details of syntax. As a result, the rules of inference can be stated in a notation-independentway. In fact, they are so completely independentof notation thattheyapplyequallywellto anyknowledge representation for which it is possible to define rules of generalization, specialization, and equivalence.T hat includes frames, templates, discourse representation structures, feature structures, description logics, expertsystem rules, SQL queries, and any semantic representation for which the following three kinds of rules can be formulated: – Equivalence rules. T he equivalence rules may change the appearance of a knowledge representation, butthey do notchange its logical status. If a graph or formula u is transformed to another graph or formula v by any equivalence rule, then u implies v, and v implies u. – Specialization rules. T he specialization rules transform a graph or formula u to a graph or formla v thatis logically more specialized: v implies u. – Generalization rules. T he generalization rules transform a graph or formula u to a graph or formula v thatis logically more generalized: u implies v. T he notation-independentrules of inference were formulated by the logician Charles Sanders Peirce. Peirce [1 0] had originally invented the algebraic notation for predicate calculus with notation-dependentrules for modus ponens and instantiation of universally quantified variables. Buthe continued to search for a simpler and more general representation, which expressed the logical operations diagrammatically, in whathe called a more iconic form. In 1 897, Peirce

l

ng

pl

s o L ng

g

n Log

invented existential graphs and introduced rules of inference thatdepend only on the operations of copying, erasing, and combining graphs.T hese five rules are so general thatthey apply to any version of logic for which the corresponding operations can be defined: 1 . Erasure. In a positive context, any graph or formula u may be replaced by a generalization of u;in particular, u may be erased (i.e. itmayreplaced by a blank, which is the universalgeneralization). 2. Insertion. In a negative context, anygraph or formula u maybe replaced by a specialization of u;in particular, any graph may be inserted (i.e. itmay replace the blank). 3. Iteration. If a graph or formula u occurs in a contextC, another copy of u may be drawn in the same contextC or in any contextnested in C. 4. Deiteration. Any graph or formula u thatcould have been derived by iteration may be erased. 5. Equivalence. Any equivalence rule (copy, simplify, or double negation) may be performed on anygraph, subgraph, formula, or subformula in anycontext. Each of these rules preserves truth: if the starting graph or formula u happens to be true, the resulting formula v mustalso be true. Peirce’s only axiom is the blank sheet of assertion.A blank sheet, which says nothing, cannotbe false.Any statementthatis derivable from the blank by these rules is a theorem, which mustalways be true. When applied to entire graphs or formulas, these rules supportpropositional logic;butwhen theyare applied to subgraphs and coreference links, theysupport full first-order logic. Peirce’s rules take their simplestform when they are applied to his originalexistentialgraphs or to conceptualgraphs, which are a typed version of existential graphs. When they are applied to the predicate calculus notation, Peirce’s rules mustaccommodate various specialcases thatdepend on the properties of each of the logicaloperators. T hataccommodation transforms Peirce’s five rules to the rules of natural deduction, which were defined by Gerhard Gentzen over thirtyyears later.For Peirce’s originalstatementof the rules, see [1 1 ]. For further examples and discussion of their application to other logics, see [1 3], [1 4].

4

Generalization Hierarchies

T he rules of inference of logic define a generalization hierarchy over the terms of any logic-based language. Fig. 4 shows a hierarchyin conceptual graphs, but an equivalenthierarchy could be represented in any knowledge representation language.For each dark arrow in Fig.4, the graph above is a generalization, and the graph below is a specialization. T he top graph says thatan animate being is the agent(Agnt) of some actthathas an entity as the theme (T hme) of the act. Below itare two specializations: a graph for a robotwashing a truck, and a graph for an animal chasing an entity. Both of these graphs were derived from the top graph by repeated applications of the rule for restricting type labels to

ohn

. ow

subtypes. T he graph for an animal chasing an entity has three specialization: a human chasing a human, a catchasing a mouse, and the dog M acula chasing a Chevrolet. T hese three graphs were also derived by repeated application of the rule of restriction.T he derivation from [Animal] to [Dog: M acula] required both a restriction by type from Animal to Dog and a restriction by referentfrom the blank to the name M acula.

Agnt

Animate

Robot

Agnt

Human

Senator

Agnt

Wash

Agnt

Chase

Thme

Chase

Thme

Truck

Act

Thme

Entity

Animal

AGNT

Chase

Thme

Entity

Dog: Macula

Agnt

Chase

Human

Thme

Secretary

Thme

Chevrolet

Arnd

Cat

Desk

Cat: Yojo

Agnt

Chase

Thme

Chase

Agnt

Mouse

Thme

Mouse

Brown

Attr

Manr

Vigorous

Cat: Tigerlily

Agnt

Chase

Thme

Mouse

Attr

Gray

Fig. 4. A generalization hierarchy

Besides restriction, a join was used to specialize the graph for a human chasing a human to the graph for a senator chasing a secretary around a desk. T he join was performed bymerging the concept[Chase] in the upper graph with the concept[Chase] in the following graph: [Chase] → (Arnd) → [Desk] Since the resulting graph has three relations attached to the concept[Chase], itis notpossible to representthe graph on a single line in a linear notation. Instead, a hyphen may be placed after the concept[Chase] to show thatthe

l

ng

pl

s o L ng

g

n Log

7

attached relations are continued on subsequentlines: [Chase](Agnt) → [Senator] (Thme) → [Secretary] (Arnd) → [Desk] For the continued relations, itis notnecessary to show both arcs, since the direction of one arrow implies the direction of the other one. T he two graphs atthe bottom of Fig. 4 were derived byboth restriction and join. T he graph on the leftsays thatthe catYojo is vigorouslychasing a brown mouse. Itwas derived by restricting [Cat] to [Cat: ’Yojo’] and by joining the following two graphs: [Mouse] → (Attr) → [Brown] [Chase] → (Manr) → [Vigorous] T he relation (M anr) represents manner, and the relation (Attr) represents attribute. T he bottom rightgraph of Fig. 4 says thatthe catT igerlilyis chasing a graymouse. Itwas derived from the graph above itbyone restriction and one join. Allthe derivations in Fig. 4 can be reversed byapplying the generalization rules from the bottom up instead of the specialization rules from the top down: everyrestriction can be reversed byunrestriction, and everyjoin can be reversed by detach. T he generalization hierarchy, which is drawn as a tree in Fig.4, is an excerpt from a lattice thatdefines all the possible generalizations and specializations thatare possible with the rules of inference. Ellis, Levinson, and Robinson ([6]) implemented such lattices with high-speed search mechanisms for storing and retrieving graphs. T hey extended their techniques to systems thatcan access millions of graphs in time proportional to the logarithm of the size of the hierarchy. T heir techniques, which were designed for conceptual graphs, can be applied to any notation, including frames and templates.

5

Frames and Templates

Predicate calculus and conceptualgraphs are domain-independentnotations that can be applied to any subjectwhatever. In thatrespect, they resemble natural languages, which can express anything thatanyone can express in any artificial language plus a greatdeal more. T he templates for information extraction and the frames for expertsystems are usually highly specialized to a particular application.

ohn

. ow

One expertsystem for diagnosing cancer patients represented knowledge in a frame with the following format: (defineType MedPatient (supertype Person) . . . (motherMelanoma (type Boolean) (question (’Has the patient’s mother had melanoma?’)) )) T his frame says thata medicalpatient, M edPatient, has a supertype Person. T hen itlists several attributes, including one named motherM elanoma, which has two facets: one facetdeclares thatthe values of the attribute mustbe of type Boolean;and the other specifies a character string called a question. Whenever the system needs the currentvalue of the motherM elanoma attribute, itprints the character string on the displayscreen, and a person answers yes or no.T hen the system converts the answer to a Boolean value (T or F), which becomes the value of the attribute. Such frames are simple, butthey omitimportantdetails. T he words mother and melanoma appear in a character string thatis printed as a question for some person ata computer display. Although the person may know the meaning of those words, the system cannotrelate them to the attribute motherM elanoma, which by itself has no more meaning than the character string ”M M ”. Whether or notthe system can generate correctanswers using values of thatattribute depends on how the associated programs happen to process the character strings. T o express those details, Fig. 5 shows a conceptual graph for the sentence The patient’s mother suffered from melanoma.

Past