157 26 5MB
English Pages 252 [250] Year 2021
Studies in Computational Intelligence 939
Roussanka Loukanova Editor
Natural Language Processing in Artificial Intelligence— NLPinAI 2020
Studies in Computational Intelligence Volume 939
Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, self-organizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.
More information about this series at http://www.springer.com/series/7092
Roussanka Loukanova Editor
Natural Language Processing in Artificial Intelligence— NLPinAI 2020
123
Editor Roussanka Loukanova Department of Algebra and Logic Institute of Mathematics and Informatics Bulgarian Academy of Sciences Sofia, Bulgaria
ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-030-63786-6 ISBN 978-3-030-63787-3 (eBook) https://doi.org/10.1007/978-3-030-63787-3 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Computational and technological developments that incorporate natural language are proliferating. Adequate coverage encounters difficult problems related to partiality, underspecification, and context-dependency, which are signature features of information in nature and natural languages. Furthermore, agents (humans or computational systems) are information conveyors, interpreters, or participate as components of informational content. Generally, language processing depends on agents’ knowledge, reasoning, perspectives, and interactions. To address the above challenges and advance further research, by sharing ideas, the Special Session on Natural Language Processing in Artificial Intelligence— NLPinAI 2020 (http://www.icaart.org/NLPinAI.aspx?y=2020) was held within the 12th International Conference on Agents and Artificial Intelligence—ICAART 2020 (http://www.icaart.org/?y=2020), which took place in Valletta, Malta, February 22–24, 2020. Some of the chapters of this book are extended and improved work based on selected papers presented at the Special Session on Natural Language Processing in Artificial Intelligence—NLPinAI 2020. The selected, shorter conference papers were published in: Proceedings of the 12th International Conference on Agents and Artificial Intelligence. Volume 1: ICAART, pp. 391–464, February 22–24, 2020. SciTePress—Science and Technology Publications, Lda. URL (https://www. scitepress.org/ProceedingsDetails.aspx?ID=w1amKRhgSWI=&t=1). The book covers theoretical work, applications, approaches, and techniques for computational models of information and its presentation by language (artificial, human, or natural in other ways). The goal is to promote computational systems of intelligent natural language processing and related models of computation, language, thought, mental states, reasoning, and other cognitive processes. The chapters of the book range over a variety of topics, e.g., • • • •
Logic Approaches to Natural Language Processing Lambek Calculus Classical Logic First-Order Linear Logic
v
vi
• • • • • • • • • •
Preface
Type Theories for Applications to Natural Language Decidability Computational Complexity Dialogical Argumentation Computational Grammar Large-Scale Grammars of Natural Languages Grammar Learning Information Theory Shannon Information Machine Learning of Language
August 2020
Roussanka Loukanova Department of Algebra and Logic Institute of Mathematics and Informatics Bulgarian Academy of Sciences Sofia, Bulgaria
Contents
Lambek Calculus with Classical Logic . . . . . . . . . . . . . . . . . . . . . . . . . . Wojciech Buszkowski
1
Partial Orders, Residuation, and First-Order Linear Logic . . . . . . . . . . Richard Moot
37
A Hyperintensional Theory of Intelligent Question Answering in TIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marie Duží and Michal Fait
69
Learning Domain-Specific Grammars from a Small Number of Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Herbert Lange and Peter Ljunglöf The Semantic Level of Shannon Information: Are Highly Informative Words Good Keywords? A Study on German . . . . . . . . . . . . . . . . . . . . 139 Max Kölbl, Yuki Kyogoku, J. Nathanael Philipp, Michael Richter, Clemens Rietdorf, and Tariq Yousef Towards Aspect Extraction and Classification for Opinion Mining with Deep Sequence Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Joschka Kersting and Michaela Geierhos Dialogical Argumentation and Textual Entailment . . . . . . . . . . . . . . . . . 191 Davide Catta, Richard Moot, and Christian Retoré A Novel Approach to Determining the Quality of News Headlines . . . . . 227 Amin Omidvar, Hossein Pourmodheji, Aijun An, and Gordon Edall
vii
Lambek Calculus with Classical Logic Wojciech Buszkowski
Abstract One of the most natural extensions of the Lambek calculus augments this logic with connectives of classical propositional logic. Actually, the resulting logic can be treated as a classical modal logic with binary modalities. This paper shows several basic properties of the latter logic in two versions: nonassociative and associative (axiom systems, algebras and frames, completeness, decidability, complexity), and some closely related logics. We discuss certain earlier results and add new ones. Keywords Lambek calculus · Classical logic · Residuated groupoid · Residuated semigroup · Ternary frame · Hilbert system · Sequent system · Completeness · Modal logic · Decidability · Computational complexity
1 Introduction 1.1 Overview The Lambek calculus, introduced by Lambek [31] under the name Syntactic Calculus, is a propositional logic which admits three binary connectives (product), \ (first residual) and / (second residual); the residuals are also regarded as (substructural) implications and written → and ←, respectively. We denote this logic by L. It can be axiomatized as a logic of arrows ϕ ⇒ ψ, where ϕ, ψ are formulas. The axioms are all arrows: (id) ϕ ⇒ ϕ (a1) (ϕ ψ) χ ⇒ ϕ (ψ χ ) (a2) ϕ (ψ χ ) ⇒ (ϕ ψ) χ W. Buszkowski (B) Faculty of Mathematics and Computer Science, Adam Mickiewicz University in Pozna´n, Pozna´n, Poland e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. Loukanova (ed.), Natural Language Processing in Artificial Intelligence—NLPinAI 2020, Studies in Computational Intelligence 939, https://doi.org/10.1007/978-3-030-63787-3_1
1
2
W. Buszkowski
and the inference rules are as follows: (r1)
ϕψ ⇒χ ψ ⇒ ϕ\χ (cut-1)
(r2)
ϕψ ⇒χ ϕ ⇒ χ /ψ
ϕ⇒ψ ψ ⇒χ ϕ⇒χ
As usually, the double line in a rule means that it expresses two rules: top-down and bottom-up. Some examples of proofs can be found in the next subsection. This axiomatization follows the algebraic axioms, defining residuated semigroups, which are the algebraic models of L. A residuated semigroup is an ordered algebra (A, , \, /, ≤) such that (A, ≤) is a poset, (A, ) is a semigroup, and \, / are binary operations on A, satisfying the residuation laws: (RES) for all a, b, c ∈ A, a b ≤ c iff b ≤ a\c iff a ≤ c/b. One refers to \, / as residual operations for product . Lambek [32] also considered a nonassociative version of this logic, nowadays called the nonassociative Lambek calculus and denoted NL, which omits (a1), (a2). Its algebraic models are residuated groupoids, defined like residuated semigroups except that product need not be associative. In models ⇒ is interpreted as ≤. In this paper we consider the extensions of NL and L by connectives of classical logic: ¬, ∨, ∧ (→, ↔ are defined) and constants ⊥ and , interpreted as the least and the greatest element in algebras. We denote these logics NL-CL and L-CL, respectively. They can be axiomatized by adding to NL and L the following axioms and rules. ϕ⇒ψ ϕ⇒χ (a∧) ϕ ∧ ψ ⇒ ϕ ϕ ∧ ψ ⇒ ψ (r∧) ϕ ⇒ψ ∧χ (r∨)
ϕ⇒χ ψ ⇒χ ϕ∨ψ ⇒χ
(a∨) ϕ ⇒ ϕ ∨ ψ ψ ⇒ ϕ ∨ ψ
(a⊥) ⊥ ⇒ ϕ
(a ) ϕ ⇒
(D) ϕ ∧ (ψ ∨ χ ) ⇒ (ϕ ∧ ψ) ∨ (ϕ ∧ χ ) (¬.1) ϕ ∧ ¬ϕ ⇒ ⊥
(¬.2) ⇒ ϕ ∨ ¬ϕ
The first two lines axiomatize ∨, ∧ in lattices. The third line contains axioms for bounds. (D) is a distributive law for ∧, ∨; others are derivable. The last line contains axioms for negation. Both logics are very natural. The intended models for L in linguistics are algebras of languages, i.e., algebras of all subsets of + (the set of all nonempty finite strings on ). From the point of view of modal logics, a model is based on the complex algebra of a relational frame (W, R), where R ⊆ W 3 . Dynamic interpretations of L
Lambek Calculus with Classical Logic
3
lead to relation algebras. These algebras are boolean algebras of sets, and ∨, ∧, ¬ are interpreted as set-theoretic union, intersection and complement. One also considers NL and L with constant 1, interpreted as the unit element for . The axioms for 1 are: (ax1) 1 ϕ ⇔ ϕ ϕ 1 ⇔ ϕ Here and further ϕ ⇔ ψ means: ϕ ⇒ ψ and ψ ⇒ ϕ. The resulting logics are denoted by NL1 and L1 and their extensions with classical connectives by NL1-CL and L1CL. Notice that NL1 (resp. L1) is not a conservative extension of NL (resp. L). Since 1 ⇒ p/ p is provable in NL1, then ( p/ p)\ p ⇒ p is so, but the latter is not provable in NL. The same example works for L1 versus L and the extensions with classical connectives. Remark 1 In the literature on linear logics, Lambek connectives , \, / and constants 1, 0 (see below for 0) are referred to as multiplicative and lattice connectives ∨, ∧ and constants ⊥, as additive. In linear logics the notation differs from our, e.g., one writes ⊥ for our 0, whereas and 1 in our sense, and ⊕ for our ∨. Our notation is similar to that in substructural logics [16]. The researchers in Lambek calculi studied many extensions of NL and L. Usually, these extensions differ from ours: they do not use the complete set of classical connectives. Some of them can easily be defined, using the axioms and rules, written above. L with ∨, ∧ and (a∨), (r∨), (a∧), (r∧) is the logic of lattice-ordered residuated semigroups (we write l.o.r. semigroups, and similarly in other contexts). This logic is called Multiplicative-Additive Lambek Calculus (MAL)1 ; see [24], where the acronym is MALC. MANL is defined in a similar way. Adding (D) to these logics yields DMAL and DMANL. They are the logics of distributive l.o.r. (write: d.l.o.r.) semigroups and groupoids, respectively. With 1 and (ax1) one obtains MAL1, MANL1, DMAL1, DMANL1. Each logic can be enriched with ⊥, ; we use no acronyms for these variants. In categorial grammars, NL and L serve as type processing logics. There were considered extensions with several products and the corresponding residuals, with unary modal operators, with ∨, ∧ interpreted in lattices (also distributive lattices), and others. The generative power of categorial grammars based on L and NL is restricted to context-free languages [11, 39]. MAL can generate some languages beyond this class [23], and similarly for L-CL. We briefly discuss categorial grammars at the end of this section. Substructural logics are often defined as extensions of L with ∨, ∧ (interpreted in lattices) and 1 by new axioms and rules. Sequent systems for these logics omit certain structural rules (weakening, contraction, exchange), characteristic of intuitionistic and classical logics; this justifies the name. Full Lambek Calculus FL amounts to MAL1; it is often regarded as the basic substructural logic. The connectives \, / are treated as nonclassical implications. Assuming the commutative law for , the 1 This
name resembles Multiplicative-Additive Linear Logic (MALL).
4
W. Buszkowski
two conditionals ϕ\ψ and ψ/ϕ collapse in one ϕ → ψ. One defines (substructural) negations: ∼ ϕ = ϕ\0, −ϕ = 0/ϕ, where 0 is a new constant (interpreted as an arbitrary designated element). These negations are a kind of minimal negation; they collapse in one, if is commutative. Linear logics assume the double negation laws; for Noncommutative MALL: ∼ −ϕ ⇒ ϕ, − ∼ ϕ ⇒ ϕ [1]; the converse arrows are provable. Many nonclassical logics can be presented as axiomatic extensions of FL with 0, e.g., many-valued logics, fuzzy logics, constructive logic with strong negation, and others. A thorough discussion of substructural logics and the corresponding algebras can be found in [16]. Since (ϕ\ψ)/χ ⇔ ϕ\(ψ/χ ) is provable in L, then ∼ ϕ/ψ ⇔ ϕ\ − ψ is provable in FL with 0. This is a substructural counterpart of the contraposition law of intuitionistic logic: (ψ → ¬ϕ) ↔ (ϕ → ¬ψ). Many laws of this kind, which show interplay of negation(s) with Lambek connectives, can be proved in substructural and linear logics. This does not hold for NL-CL and L-CL in their basic form. It, however, does not mean that the latter are less interesting. In a very natural sense, to be discussed in Sect. 3, they treat Lambek connectives as modal operators. Roughly is a binary ♦ and its residuals \, / are similar to ↓ , the backward-looking necessity operator. So NL-CL and L-CL can be treated as classical modal logics with binary modalities. The present paper focuses on this point of view. The modal logic interpretation of Lambek calculi was addressed by many authors. In the world of categorial grammars, it was employed by, e.g., Morrill [38], Moortgat [36], Moot and Retoré [37], from the perspective of dynamic logics by, e.g., van Benthem [6], and in the framework of substructural logics by, e.g., Restall [42]. These works, however, usually concern different logics, either weaker than ours, e.g., negation-free, without (D), or incomparable with them, e.g., with several modalities, connected by special axioms. In [14] a system equivalent to NL-CL is denoted by BFNL (from: Boolean Full NL). The main results are: (1) the strong finite model property (the proof uses algebraic methods, taken from substructural logics), which implies the decidability of provability from finitely many assumptions, (2) the equivalence of categorial grammars based on this logic (also extended by finitely many assumptions) and contextfree grammars. In fact, this paper starts from a weaker logic DFNL, i.e., our DMANL, and the results for BFNL are stated at the end with proofs merely outlined. A more general framework, employing residuated algebras with n−ary operations, appears in [12]. It is well-known that the provability from (finitely many) assumptions is undecidable for L [11], hence for L-CL as well, since the latter is a strongly conservative extension of the former (see Sect. 2). Therefore, neither L, nor L-CL possesses the strong finite model property. Kaminski and Francez [21] study L-CL and NL-CL, denoted PL and PNL (from: L and NL with propositional logic), in the form of Hilbert-style systems. They prove the strong completeness with respect to the corresponding classes of Kripke frames and the strong finite model property for PNL, using filtration of Kripke frames.
Lambek Calculus with Classical Logic
5
Several other results are sparse in the literature. The main aim of the present paper is to collect together the most important ones. Nonetheless this paper is not a typical survey. We obtain some new results, write new proofs and simplify earlier proofs. In Sect. 2 we discuss algebras and Kripke frames, corresponding to NL-CL, LCL and some related logics. We prove the strong completeness of these logics with respect to the corresponding classes of algebras and frames. As a consequence, we show that some logics are strongly conservative extensions of others. We also show that these logics are not weakly complete with respect to some classes of intended models. Section 3 presents these logics as Hilbert-style systems (H-systems). The systems from [21] are replaced by others, which makes the analogy with modal logics, in particular: the minimal tense logic Kt , transparent. We add some new modal axioms and study the resulting logics. In particular, cyclic logics are closely related to cyclic linear logics. If one adds axioms ϕ ψ → ϕ and ϕ ψ → ψ (corresponding to rules of left weakening in sequent systems), then the resulting logics reduce to classical logic. We show how the standard method of filtration [8] can be adjusted to NL-CL and its extensions; the notion of a suitable set of formulas seems new. Section 4 concerns decidability and complexity. It is known that L is NP-complete [41] and the provability from (finitely many) assumptions in NL is PTIME [11]. We write a proof of the undecidability of L-CL, which simplifies and corrects the proofs in [27, 28]. NL-CL is PSPACE-complete [34]. The provability from (finitely many) assumptions in NL-CL is EXPTIME-complete; essentially in [44, 45].
1.2 Categorial Grammars Lambek’s intention was to extend the type reduction procedure in categorial grammars, proposed by Ajdukiewicz [2] and modified by Bar-Hillel [4]. Categorial grammars are formal grammars assigning types (categories) to expressions of a language. More precisely, a type lexicon assigns some types to lexical atoms (words), whereas the types of compound expressions are derived by a type reduction procedure (independent of the particular language). The term categorial grammar first appeared in Bar-Hillel et al. [5]. This paper, later than and referring to [31], employs reductions of the form2 : (red\) α, α\β ⇒ β (red/) α/β, β ⇒ α Here α and β are syntactic types; they can be identified with \, /−formulas of L. An expression v1 . . . vn , where v1 , . . . , vn are words, is assigned type α, if for some types α1 , . . . , αn such that vi : αi , i = 1, . . . , n, according to the type lexicon the sequence (α1 , . . . , αn ) reduces to α by finitely many applications of (red\), (red/). For instance, from ‘Jane’: pn, ‘John’: pn and ‘meets’: ( pn\s)/ pn one derives ‘Jane 2 Reductions
in [2, 4] are more involved, since they employ many-argument types.
6
W. Buszkowski
meets John’: s by two reductions. This example uses two atomic types: s (sentence) and pn (proper noun). The arrows (red\) and (red/) are provable in L (even NL), if one replaces comma with product; e.g., for (red\), apply (r1) (bottom-up) to α\β ⇒ α\β. Therefore all derivations based on these reductions can be performed in L (even NL, if one adds bracketing). L yields many new arrows. We list some; for the arrow form of L, comma in (L2) must be replaced with . (L1) Type-raising laws: α ⇒ (β/α)\β and α ⇒ β/(α\β) (L2) Composition laws: α\β, β\γ ⇒ α\γ and α/β, β/γ ⇒ α/γ (L3) Geach laws3 : α\β ⇒ (γ \α)\(γ \β) and α/β ⇒ (α/γ )/(β/γ ) To prove (L1)(left), apply (r1) to (red/) (β/α) α ⇒ β. The proof of (L2) needs the derivable rules: from α ⇒ β infer γ α ⇒ γ β and α γ ⇒ β γ . The derivation of the first rule uses (L1) β ⇒ γ \(γ β), which yields α ⇒ γ \(γ β), by the premise and (cut-1), and finally γ α ⇒ γ β, by (r1). We prove (L2)(right). (β/γ ) γ ⇒ β yields (α/β) ((β/γ ) γ ) ⇒ (α/β) β by the (first) derivable rule. The required arrow is obtained by (a1), (cut-1), (red/), (cut-1) and (r2). Other formal proofs in L are left to the reader. Due to these new laws, parsing becomes more flexible. Let us recall Lambek’s example. We assign s/( pn\s) (type of subject) to ‘she’ and (s/ pn)\s (type of object) to ‘him’. This yields ‘she meets John’: s by two applications of (red/), but ‘she meets him’: s needs L. In the sequence s/( pn\s), ( pn\s)/ pn, (s/ pn)\s (outer parentheses omitted), one reduces s/( pn\s), ( pn\s)/ pn to s/ pn by (L2), then s/ pn, (s/ pn)\s to s by (red\). Another way: expand s/( pn\s) to (s/ pn)/(( pn\s)/ pn) by (L3), then use (red/) and (red\). In a categorial grammar based on (red\) and (red/) only, ‘he’ has to be assigned both types. Accordingly, parsing with L enables one to restrict type lexicons and to see logical connections between different types of the same word (expression). By (L1) every proper noun (type pn) can also be treated as a full noun phrase, both subject (type s/( pn\s)) and object (type (s/ pn)\s). Therefore ‘and’ in ‘Jane and some teacher’ can be assigned (α\α)/α, where α is the type of subject or object. Another type of ‘and’ can be α\(α/α), but it is equivalent to the former by the laws of L: (L4) Mixed associativity: (α\β)/γ ⇔ α\(β/γ ) (L1) are provable in NL; (L2)–(L4) require associativity. Notice that (L4) is needed for ‘Jane meets him’: s, since the type of ‘meets’ must be transformed into pn\(s/ pn). In these examples only two atomic types appear. Lambek [33] elaborates a categorial grammar for a large part of English, which uses 33 atomic types, e.g., π (subject), π1 (first person singular subject), π2 (second person singular subject and any plural subject), π3 (third person singular subject), s (statement), s1 (statement in present tense), s2 (statement in past tense), and others. The grammar is based on the 3 Geach
[17] shows the usefulness of (L3) in language description.
Lambek Calculus with Classical Logic Table 1 Inference rules of sequent systems for L and NL Rule L
7
NL
( ⇒)
,α,β, ⇒γ
,αβ, ⇒γ
[(α,β)]⇒γ
[αβ]⇒γ
(⇒ )
⇒α ⇒β
,⇒αβ
⇒α ⇒β ( ,)⇒αβ
(\ ⇒)
,β, ⇒γ ⇒α
,,α\β, ⇒γ
[β]⇒γ ⇒α
[(,α\β)]⇒γ
(⇒ \)
α, ⇒β
⇒α\β
(α, )⇒β
⇒α\β
(/ ⇒)
,α, ⇒γ ⇒β
,α/β,, ⇒γ
[α]⇒γ ⇒β
[(α/β,)]⇒γ
(⇒ /)
,β⇒α
⇒α/β
( ,β)⇒α
⇒α/β
(cut)
,α, ⇒β ⇒α
,, ⇒β
[α]⇒β ⇒α
[]⇒β
calculus of pregroups, but this is not essential here: everything can be translated into L with some non-lexical assumptions, as e.g., πi ⇒ π , si ⇒ s. Formally, a categorial grammar based on a logic L can be defined as a triple G = (, I, α0 ) such that is a finite lexicon (alphabet), I is a map which assigns a finite set of types (i.e., formulas of L) to each v ∈ , and α0 is a designated type. One refers to , I and α0 as the lexicon, the type lexicon and the principal type of G. In examples we write v : α for α ∈ I (v), and similarly for compound strings assigned type α (see below). Usually, L is given in the form of a sequent system (of intuitionistic form). A sequent is an expression of the form α1 , . . . , αn ⇒ α, where αi , α are formulas of L. For nonassociative logics, like NL, the antecedent of a sequent is a bracketed string of types (precisely: an element of the free groupoid generated by the set of formulas), called a bunch. Sequent systems for L and NL were proposed by Lambek [31, 32]. The axioms are (id) α ⇒ α and the rules are shown in Table 1 In L, , stand for finite sequences of formulas, in NL for bunches. Formulas are denoted by α, β, γ . In models, each comma in the antecedent of a sequent is interpreted as product. One says that G assigns type α to a string v1 . . . vn , where all vi belong to , if there exist types αi ∈ I (vi ), i = 1, . . . , n, such that α1 , . . . , αn ⇒ α is provable in L (for nonassociative logics: under some bracketing of the antecedent). The language of G is defined as the set of all u ∈ + such that G assigns α0 to u. Often one takes an atomic type for α0 , e.g., s—the type of sentence. In rules for NL, [] is the result of substitution of for x in the context [x]. The context [x] can be defined as a bracketed string of formulas, containing one
8
W. Buszkowski
Table 2 Rules for ∨, ∧ in MAL and MANL Rule MAL (∨ ⇒)
,α, ⇒γ
MANL
,β, ⇒γ
,α∨β, ⇒γ
[α]⇒γ [β]⇒γ
[α∨β]⇒γ
(⇒ ∨)
⇒αi
⇒α1 ∨α2
Same as MAL
(∧ ⇒)
,αi , ⇒β
,α1 ∧α2 , ⇒β
[αi ]⇒β
[α1 ∧α2 ]⇒β
(⇒ ∧)
⇒α ⇒β
⇒α∧β
Same as MAL
special variable x (a place for substitution). Since in L the antecedents of sequents are nonempty (Lambek’s restriction), then must be nonempty in (⇒ \), (⇒ /) for L. Systems for logics with 1 neglect this restriction. Lambek [31, 32] proved the cut-elimination theorem for both systems: every provable sequent can be proved without (cut). This immediately yields the decidability of NL and L, since in all remaining rules the premises consist of subformulas of formulas appearing in the conclusion and the size of every premise is less than the size of the conclusion. Therefore the proof-search procedure for a cut-free proof is finite. These results remain true for several richer logics, discussed above, e.g., MANL, MAL and their versions with 1, 0 and (D). Table 2 shows axioms and rules for ∨, ∧. Another consequence of the cut-elimination theorem is the subformula property: every provable sequent has a proof using only subformulas of formulas appearing in this sequent. Therefore MAL is a conservative extension of its language restricted fragments, e.g., L, L with ∧, etc., and similarly for other logics, admitting cut elimination. Unfortunately, no cut-free sequent systems are known for NL-CL and L-CL, mainly considered in this paper. Therefore the sequent systems, presented above, are not much important in what follows except for one application, mentioned below. With (cut), some sequent systems for NL-CL and L-CL can be formed easily: simply add (D), (a⊥), (a ), (¬.1) and (¬.2) as new axioms to the sequent systems for MANL and MAL, respectively. Kanazawa [23] studies categorial grammars based on MAL. He provides examples of types with ∧, ∨, illustrating feature decomposition of types. For instance, ‘walks’: (np ∧ sg)\s and ‘walk’: (np ∧ pl)\s, where np is a type of noun phrase, whereas sg and pl are types of singular and plural phrase, respectively. In [14] types with ∨ are used to eliminate Lambek’s non-lexical assumptions, mentioned above. Instead of them one can define π = π1 ∨ π2 ∨ π3 , s = s1 ∨ s2 with the same effect.
Lambek Calculus with Classical Logic
9
Kanazawa [23] proves that the languages generated by categorial grammars based on MAL are closed under finite intersections and unions. The proof essentially uses a cut-free system for MAL. In fact, if L 1 , L 2 are ( −free) context-free languages, then L 1 ∩ L 2 can be generated by a categorial grammar based on the (\, /, ∧)−fragment of MAL, i.e., the product-free L with ∧; we denote it here by L0 . This follows from the cut-elimination theorem, the reversibility of (⇒ ∧) and a restricted reversibility of (∧ ⇒). Precisely: (1) if ⇒ α ∧ β is provable, then both ⇒ α and ⇒ β are provable, (2) If , α ∧ β, ⇒ γ is provable in L0 , γ does not contain ∧, and ∧ does not occur in the scope of \, /, then , α, ⇒ γ or , β, ⇒ γ is provable in L0 . Both claims can easily be proved by induction on cut-free proofs. Let G 1 , G 2 be grammars such that L(G 1 ) = L 1 and L(G 2 ) = L 2 ; both grammars are based on the product-free L and have the same lexicon but no common atomic type. Let G be the grammar which to any v ∈ assigns the conjunction of all types assigned by G 1 and G 2 to v; its principal type equals α1 ∧ α2 , where αi is the principal type of G i . It is easy to verify that L(G) = L(G 1 ) ∩ L(G 2 ). Therefore the generative power of categorial grammars based on MAL (even L0 ) is greater than those based on L. L0 is complete with respect to language models [9], hence with respect to boolean residuated semigroups (see Sect. 2), and the latter holds for L-CL as well. Consequently, L-CL is a conservative extension of L0 . So every grammar based on L0 can also be treated as a grammar based on L-CL. Therefore the grammars based on L-CL can generate some languages which are not context-free, e.g., {a n bn cn : n ≥ 1}. Types with ¬ make it possible to express negative information: u : ¬α means that u is not of type α; see [10] for a discussion of categorial grammars with negative information. Our grammars do not assign types to the empty string . Kuznetsov [29] shows that all context-free languages, also containing , are generated by categorial grammars based on L1. The construction, however, of a categorial grammar for a given language employs quite complicated types. In practice, it is easier to extend the type lexicon by assigning the principal type to , if one wants to have in the language. In this paper we often consider the consequence relations for the given logics, i.e., provability from a set of assumptions. Categorial grammars, studied in the literature, are usually based on pure logics. This agrees with the principle of lexicality: all information on the particular language is contained in the type lexicon. In practice, however, this logical purity may be inconvenient. Besides non-lexical assumptions for atomic types, like those used by Lambek [33], one can approximate a stronger but less efficient logic, e.g., L, L-CL, by a weaker but more efficient logic, e.g., NL, NL-CL, by adding to the latter some particular arrows, provable in the former only, as assumptions. Finally, L with finitely many assumptions can generate arbitrary recursively enumerable languages [11]. Above we have merely described some very basic aspects of categorial grammars. To keep this paper in a reasonable size, we cannot elaborate more on them. The reader is referred to [35–38] for a thorough discussion.
10
W. Buszkowski
2 Algebras and Frames The algebraic models of L (resp. NL), i.e., residuated semigroups (resp. residuated groupoids), have been defined in Sect. 1. By a boolean residuated groupoid (we write: b.r. groupoid) we mean an algebra (A, , \, /, ∨, ∧,− , ⊥, ) such that (A, ∨, ∧,− , ⊥, ) is a boolean algebra and (A, , \, /, ≤) is a residuated groupoid, where ≤ is the boolean ordering: a ≤ b ⇔ a ∨ b = b. If is associative, the algebra is called a boolean residuated semigroup (we write: b.r. semigroup). Remark 2 In the literature on substructural logics, one considers residuated lattices (A, , \, /, 1, ∨, ∧), where (A, ∨, ∧) is a lattice and (A, , \, /, 1, ≤) is a residuated monoid, i.e., a residuated semigroup with the unit element for product (≤ is the lattice ordering). They are algebraic models of FL. Following this terminology, one might use the term ‘residuated boolean algebra’ for our ‘b.r. semigroup’ [20, 34]. Our term, however, seems more precise and will be used in this paper. We do not assume that the algebra is unital, i.e., admits 1. If so, it is referred to as a b.r. monoid and for the nonassociative case a b.r. unital groupoid. Like residuated lattices, b.r. groupoids and semigroups can be axiomatized by a finite set of equations, hence these classes are (algebraic) varieties. The axioms are the standard axioms for boolean algebras, i.e., the associative, commutative and distributive laws for ∨, ∧, x ∨ ⊥ = x, x ∧ = x, x ∨ x − = , x ∧ x − = ⊥ and the following axioms for , \, /: (R1) (R2) (R3) (R4) (R5)
x (x\y) ≤ y, (x/y) y ≤ x x ≤ y\(y x), x ≤ (x y)/y x y ≤ (x ∨ z) y, x y ≤ x (y ∨ z) x\y ≤ x\(y ∨ z), x/y ≤ (x ∨ z)/y (x y) z = x (y z) (for b.r. semigroups).
Indeed, (R1)–(R4) are valid in b.r. groupoids and (R5) in b.r. semigroups. (R1), (R2) hold by (RES). Using (RES), one easily proves the monotonicity condition: (MON) if x ≤ y, then z x ≤ z y, x z ≤ y z, z\x ≤ z\y, and x/z ≤ y/z, which yields (R3), (R4). Conversely, (RES) follow from (R1)–(R4) in lattices. First, (MON) follows from (R3), (R4). We show: x y ≤ z iff y ≤ x\z. Assume x y ≤ z. Then y ≤ x\(x y) ≤ x\z. Assume y ≤ x\z. Then x y ≤ x (x\z) ≤ z. In residuated groupoids \, / are antitone in the bottom argument. (AON) if x ≤ y, then y\z ≤ x\z and z/y ≤ z/x Assume x ≤ y. Then x (y\z) ≤ y (y\z) ≤ z. This yields y\z ≤ x\z, by (RES). For / the argument is dual. In b.r. groupoids (even in l.o.r. groupoids), (MON) can be strengthened to the following distributive laws:
Lambek Calculus with Classical Logic
11
(x ∨ y) z = (x z) ∨ (y z) z (x ∨ y) = (z x) ∨ (z y)
(1)
z\(x ∧ y) = (z\x) ∧ (z\y) (x ∧ y)/z = (x/z) ∧ (y/z)
(2)
and (AON) to: (x ∨ y)\z = (x\z) ∧ (y\z) z/(x ∨ y) = (z/x) ∧ (z/y)
(3)
In what follows we assume the higher priority of unary operators over binary operators, and , \, / and related operators over classical binary connectives and the corresponding operation symbols. For classical connectives, ∧, ∨ are stronger than →, ↔. So parentheses on the right-hand side of each equation, written above, can be omitted. Formulas of NL-CL and L-CL are formed out of propositional variables p, q, r, . . . and constants ⊥, by means of the connectives , \, /, ∨, ∧, ¬. Given an algebra A = (A, . . .), where the operations and designated elements in . . . correspond to the basic connectives and constants (in particular − to ¬), a valuation in A is a homomorphism of the (free) algebra of formulas into A. A formula ϕ is said to be: (1) true in A for valuation μ, if μ(ϕ) = ,4 (2) valid in A, if it is true in A for all valuations, (3) valid in the class of algebras A, if it is valid in every algebra from A. Let be a set of formulas. We say that entails ϕ in A, if ϕ is true in every A ∈ A for any μ such that all formulas in are true in A for μ. The connectives →, ↔ are defined as usually (in classical logic).5 ϕ → ψ = ¬ϕ ∨ ψ ϕ ↔ ψ = (ϕ → ψ) ∧ (ψ → ϕ) An arrow ϕ ⇒ ψ is said to be true in A for μ, if μ(ϕ) ≤ μ(ψ). This is equivalent to μ(ϕ → ψ) = . So arrows are identified with classical conditionals. Therefore the notions, defined in the preceding paragraph, can also be applied to (sets of) arrows. A logic L is weakly complete with respect to A, if the theorems of L are precisely the formulas (arrows) valid in A. L is strongly complete with respect to A, if for any set of formulas (arrows) and any formula (arrow) ϕ, ϕ is provable in L from
if and only if entails ϕ in A. (ϕ is provable in L from , if ϕ is provable in L enriched with all formulas (arrows) from as assumptions. In opposition to axioms, assumptions need not be closed under substitutions.) Clearly weak completeness follows from strong completeness. Theorem 1 NL-CL (resp. L-CL) is strongly complete with respect to b.r. groupoids (resp. b.r. semigroups). Proof The proof is routine. For soundness, one observes that all axioms are valid and all rules preserve the truth for μ in every b.r. groupoid. should not be confused with 1 ≤ μ(ϕ), admitted in substructural logics. formulas of first-order logic we write ⇒, ⇔.
4 This 5 In
12
W. Buszkowski
For completeness, one constructs a Lindenbaum–Tarski algebra. A syntactic work is needed. One shows that the following monotonicity rules are derivable in both systems (this means: the conclusion is provable from the premise). Here ∗ represents each of the connectives ∨, ∧, . ϕ⇒ψ χ ∗ϕ ⇒χ ∗ψ
ϕ⇒ψ ϕ∗χ ⇒ψ ∗χ
ϕ⇒ψ ¬ψ ⇒ ¬ϕ
ϕ⇒ψ ϕ⇒ψ ϕ⇒ψ ϕ⇒ψ χ \ϕ ⇒ χ \ψ ϕ/χ ⇒ ψ/χ ψ\χ ⇒ ϕ\χ χ /ψ ⇒ χ /ϕ Let be a set of arrows. One defines a binary relation: ϕ ∼ ψ iff ϕ ⇔ ψ is provable from . By the monotonicity rules, ∼ is a congruence on the algebra of formulas. The quotient algebra is a b.r. groupoid (resp. semigroup) for the case of NL-CL (resp. L-CL. By [ϕ] we denote the equivalence class of ∼ containing ϕ. For the valuation μ defined by μ( p) = [ p] , one gets μ(ϕ) = [ϕ] for any formula ϕ. One proves: [ϕ] ≤ [ψ] iff ϕ ⇒ ψ is provable from . Consequently, ϕ ⇒ ψ is provable from if and only if [ϕ → ψ] = [ ] , which is equivalent to μ(ϕ → ψ) = [ ] . Therefore all arrows in are true for μ in the quotient algebra. If ϕ ⇒ ψ is not provable from , then [ϕ → ψ] = [ ] , which means that ϕ ⇒ ψ is not true for μ; so does not entail ϕ ⇒ ψ in b.r. groupoids (semigroups). Remark 3 A formula ϕ is said to be provable in the system, if ⇒ ϕ is provable. For either system, there holds: ϕ ⇒ ψ is provable from if and only if ϕ → ψ is so. Indeed, from ϕ ⇒ ψ we get ¬ϕ ∨ ϕ ⇒ ¬ϕ ∨ ψ, by monotonicity, hence ⇒ ϕ → ψ, by lattice laws, (¬.2), (cut-1) and the definition of →. Conversely, from ⇒ ϕ → ψ we get ϕ ∧ ⇒ ϕ ∧ (¬ϕ ∨ ψ), by monotonicity and the definition of →, hence ϕ ⇒ ϕ ∧ ψ, by (D), bounded lattice laws, monotonicity and (¬.1), which yields ϕ ⇒ ψ, by (a∧) and (cut-1). Notice that this equivalence expresses a kind of Deduction Theorem for arrows. Deduction Theorem for provability does not hold in these logics. For instance, in both logics p/q is provable from p, but p → p/q is not provable. The proof of Theorem 1 can be adapted for several other logics. In a similar way one proves that NL (resp. L) is strongly complete with respect to residuated groupoids (resp. residuated semigroups), NL1 (resp. L1) with respect to residuated unital groupoids (resp. residuated monoids), MANL (resp. MAL) with respect to l.o.r. groupoids (resp. semigroups) and so on. Remark 4 Every formula of our logics can be translated into a term of the first-order theory of the corresponding algebras. By t (ϕ) we denote the translation of ϕ. It is reasonable to change the symbols for operations ∨, ∧ in terms and algebras: write ∪ for ∨ and ∩ for ∧. For instance, t ( p ∨ ¬ p) = x ∪ x − . By completeness, ϕ ⇒ ψ is valid in the class if and only if t (ϕ) ≤ t (ψ) is so. By strong completeness, ϕ ⇒ ψ is provable from if and only if t (ϕ) ≤ t (ψ) follows from {t (χ ) ≤ t (χ ) : (χ ⇒ χ ) ∈
Lambek Calculus with Classical Logic
13
} in this class. For a finite , the latter condition is equivalent to the validity of the Horn formula: the conjunction of all formulas t (χ ) ≤ t (χ ), for χ ⇒ χ in , implies t (ϕ) ≤ t (ψ). As a consequence, formal proofs of arrows in a system can be replaced by algebraic proofs of the corresponding first-order formulas, which are often shorter and easier. For logics with ∨, ∧, atomic formulas with ≤ can be replaced by equations s = t, hence Horn formulas by quasi-equations: s1 = t1 ∧ · · · ∧ sn = tn ⇒ s = t. We consider some special constructions of b.r. semigroups (groupoids). Given a groupoid (G, ·), one defines operations , \, / on P(G). X Y = {a · b : a ∈ X, b ∈ Y } X \Y = {b ∈ G : X {b} ⊆ Y } X/Y = {a ∈ G : {a} Y ⊆ X } (P(G), , \, /, ⊆) is a residuated groupoid. Clearly P(G) is a boolean algebra of sets, hence this construction yields a b.r. groupoid, which we refer to as the powerset algebra of (G, ·). If (G, ·) is a semigroup, then this construction yields a b.r. semigroup. If (G, ·, 1) is a unital groupoid (resp. monoid), then this construction yields a b.r. unital groupoid (resp. b.r. monoid), where {1} is the unit element for . One also considers relation algebras P(W 2 ) with , \, / defined as follows. R S = {(x, y) ∈ W 2 : ∃z ((x, z) ∈ R ∧ (z, y) ∈ S)} R\S = {(z, y) ∈ W 2 : R {(z, y)} ⊆ S} R/S = {(x, z) ∈ W 2 : {(x, z)} S ⊆ R}
They are algebraic models of L1-CL; IdW = {(x, x) : x ∈ W } is the unit element for . Clearly is the relative product (often written as R ◦ S or R; S). For L-CL, one considers algebras P(U ), where U ⊆ W 2 is a transitive relation (then P(U ) is closed under ; in definitions of \, / one replaces W 2 with U ). The strong completeness of L with respect to the latter algebras was shown in [3]. In linguistics, the intended models of L are powerset algebras of ( + , ·), where + is the set of all nonempty finite strings over a (finite) alphabet and · is the concatenation of strings (this operation is associative). Subsets of + are called −free languages on . One often refers to these algebras as language models. For NL, strings are replaced by bracketed strings (phrase structures). For bracketed strings x, y one defines: x · y = (x, y) (in examples, comma can be omitted). For instance, a · (b, c) = (a, (b, c)), but we write (Jane (meets John)). By (+) we denote the set of all bracketed strings on . More precisely, ( (+) , ·) is the free groupoid generated by , and ( + , ·) is the free semigroup generated by . By we denote the empty string. ∗ = + ∪ { } is the set of all finite strings on . ∗ with concatenation and is the free monoid generated by . We also define (∗) = (+) ∪ { } and assume · x = x = x · for x ∈ (∗) ; this yields the free unital groupoid generated by . The powerset algebras of ∗ and (∗) are the intended models of L1 and NL1, respectively; { } is the unit element.
14
W. Buszkowski
By Theorem 1, L-CL and NL-CL are (strongly) sound with respect to powerset algebras of semigroups and groupoids, respectively, and consequently to language models. L (resp. NL) is strongly complete with respect to powerset algebras of semigroups [9] (resp. groupoids [25]). The proofs of the latter results employ some labeled deductive systems. As a consequence, L-CL (resp. NL-CL) is a strongly conservative extension of L (resp. NL). This means: for any set of arrows and any arrow ϕ ⇒ ψ in the language of the weaker system, this arrow is provable from this set in the weaker system if and only if it is so in the stronger system (for = ∅, this defines a conservative extension). We prove it for L. The ‘only-if’ part is obvious. For the ‘if’ part, assume that φ ⇒ ψ is not provable from in L. Since L is strongly complete with respect to powerset algebras of semigroups, there exist a semigroup (G, ·) and a valuation μ in P(G) such that all arrows in are true for μ but ϕ ⇒ ψ is not. The powerset algebra can be expanded to a b.r. semigroup. So does not entail ϕ ⇒ ψ in b.r. semigroups. By Theorem 1, ϕ ⇒ ψ is not provable from in L-CL. This argument also works for NL versus NL-CL. The very result has already been proved in [21]; the proof uses frame models (see below). Pentus [40] proves the weak completeness of L with respect to language models P( + ) (the proof is quite involved). This does not hold for NL and language models P( (+) ). We recall an example of Došen [15]. The arrow (( p q)/r ) r ⇒ p r is valid in these algebras. For assume a ∈ μ((( p q)/r ) r ). Then a = b · c, where b ∈ μ(( p q)/r ) and c ∈ μ(r ). Hence b · c ∈ μ( p q), which yields b ∈ μ( p), since in (+) b, c are the only elements such that a = b · c. Consequently a ∈ μ( p r ). This arrow, however, is not provable in NL, since it is not valid in residuated groupoids. Take the free group generated by { p, q, r }. Every group is a residuated semigroup (hence groupoid), with a b = a · b (write ab), a\b = a −1 b, a/b = ab−1 , and ≤ being the identity relation. For μ defined by μ( p) = p and similarly for q, r , one gets μ((( p q)/r ) r ) = pqr −1r = pr = μ( p r ). The same example shows that NL-CL is not weakly complete with respect to language models P( (+) ): this arrow is not provable in NL-CL, since NL-CL is a conservative extension of NL. The following, stronger proposition implies that L-CL is not weakly complete with respect to language models P( + ) and NL1-CL (resp. L1-CL) is not weakly complete with respect to language models P( ∗ ) (resp. P( (∗) )). Proposition 1 NL-CL (resp. L-CL) is not weakly complete with respect to powerset algebras of groupoids (resp. semigroups). Also NL1-CL (resp. L1-CL) is not weakly complete with respect to powerset algebras of unital groupoids (resp. monoids). Proof Let (G, ·) be a groupoid. In the powerset algebra, ∅/ X = ∅ for any nonempty X ⊆ G. We consider the arrow ⊥/( p\( p p)) ⇒ ⊥. Since p ⇒ p\( p p) is provable in NL, then μ( p\( p p)) = ∅ for any valuation μ in P(G). This holds, if μ( p) = ∅; if μ( p) = ∅, then μ( p\( p p)) = G = ∅. Therefore our arrow is valid in powerset algebras of groupoids. This arrow, however, is not provable in L1-CL (the strongest logic), since it is not valid in relation algebras P(W 2 ). Indeed, let
Lambek Calculus with Classical Logic
15
W = {a, b, c} (three different elements), μ( p) = {(a, b), (b, c)}. Then μ( p p) = {(a, c)}, μ( p\( p p)) = {(b, c)} ∪ {(a, x); x ∈ W } and μ(⊥/( p\( p p))) = {(x, c) : x ∈ W } = ∅. This proof also shows that NL with constants ⊥, and axioms (a⊥), (a ) is not weakly complete with respect to powerset algebras of groupoids, and similarly for L, NL1 and L1, extended in this way; also see [30]. Example 1 Here is another arrow valid in powerset algebras of groupoids but not provable in L1-CL: (¬q)/( p\( p p)) ⇒ ¬(q/( p\( p p))). By (2) and the law: x ≤ y − iff x ∧ y = ⊥, the arrow is valid in powerset algebras of groupoids (resp. provable in L1-CL) if and only if ⊥/( p\( p p)) ⇒ ⊥ is so. Example 2 Došen’s example does not work for NL1; (( p q)/r ) r ⇒ p r is not valid in language models P( (∗) ). Indeed, for μ(r ) = { }, if this arrow is true, then μ( p q) ⊆ μ( p), which need not be true. A good example is q (1/( p\ p)) ⇒ (1/( p\ p)) q. This arrow is valid in powerset algebras of unital groupoids. Indeed 1 ⇒ p\ p is provable in NL1, hence 1/( p\ p) ⇒ 1 is so. In these models, μ(1/( p\ p)) equals ∅ or {1}, hence it commutes with μ(q). This arrow, however, is not valid in relation algebras P(W 2 ). Define W , μ( p) as in the proof of Proposition 1 and μ(q) = {(a, b)}. Then μ( p\ p) = {(b, b), (c, c)} ∪ {(a, x) : x ∈ W }, μ(1/( p\ p)) = {(b, b), (c, c)}. So for μ the left-hand side of the arrow equals {(a, b)} and the right-hand side is empty. Therefore NL1 is not weakly complete with respect to powerset algebras of unital groupoids, hence with respect to language models P( (∗) ). Since relation algebras P(W 2 ) are models of L1, this example also shows that L1 is not weakly complete with respect to powerset algebras of monoids, hence with respect to language models P( ∗ ). We turn to frame models, characteristic of Kripke semantics for modal logics. A pair (W, R) such that R ⊆ W 3 is called a (ternary) relational frame. On P(W ) one defines operations , \, / (sometimes they are written R , \ R , / R ). X Y = {u ∈ W : R(u, v, w) for some v ∈ X, w ∈ Y } X \Y = {w ∈ W : X {w} ⊆ Y } X/Y = {v ∈ W : {v} Y ⊆ X } Accordingly: w ∈ X \Y iff for all u, v ∈ W , if R(u, v, w) and v ∈ X then u ∈ Y . Also: v ∈ X/Y iff for all u, w ∈ W , if R(u, v, w) and w ∈ Y then v ∈ X . We have: X Y ⊆ Z iff Y ⊆ X \Z iff X ⊆ Z /Y . Indeed, each condition is equivalent in first-order logic to the formula: ∀u,v,w (R(u, v, w) ∧ v ∈ X ∧ w ∈ Y ⇒ u ∈ Z ). Consequently, (P(W ), , \, /, ∪, ∩,− , ∅, W ) is a b.r. groupoid. One refers to this algebra as the complex algebra of the frame (W, R). A frame (W, R) is said to be associative, if (X Y ) Z = X (Y Z ) for all X, Y, Z ⊆ W . If a frame is associative, then its complex algebra is a b.r. semigroup. The associativity of (W, R) is equivalent to the following condition. ∀u,x,y,z∈W [∃v (R(v, x, y) ∧ R(u, v, z)) ⇔ ∃w (R(u, x, w) ∧ R(w, y, z))]
(4)
16
W. Buszkowski
The next proposition is equivalent to the strong completeness of Hilbert-style systems for NL-CL and L-CL with respect to frame models, proved in [21]. We, however, outline a different proof, using Theorem 1 and the following representation theorem for b.r. groupoids (semigroups). Theorem 2 ([7]) Every b.r. groupoid (resp. b.r. semigroup) is isomorphic to a subalgebra of the complex algebra of some (resp. associative) frame. Proposition 2 NL-CL (resp. L-CL) is strongly complete with respect to the complex algebras of (resp. associative) relational frames. Proof We prove the strong completeness of NL-CL. If ϕ ⇒ ψ is provable from in this logic, then entails ϕ ⇒ ψ in b.r. groupoids, hence in the complex algebras of frames. Assume that ϕ ⇒ ψ is not provable from . By Theorem 1, there exist a b.r. groupoid A and a valuation μ in A such that all arrows in are true for μ in A, but ϕ ⇒ ψ is not. By the representation theorem, there exists a frame (W, R) and a monomorphism h from A to the complex algebra of (W, R). Clearly h ◦ μ is a valuation in the latter algebra. All arrows in are true for h ◦ μ, but ϕ ⇒ ψ is not. Therefore does not entail ϕ ⇒ ψ in the complex algebras of frames. A similar argument works for L-CL. We recall some main steps of the proof of Theorem 2, since we need them later. Let A be a b.r. groupoid. One defines the canonical frame. W consists of all ultrafilters on the boolean algebra underlying A. For F, G, H ∈ W one defines: R(F, G, H ) iff G H ⊆ F. Here X Y = {a b : a ∈ X, b ∈ Y } for X, Y ⊆ A. The canonical embedding h : A → P(W ) is defined by h(a) = {F ∈ W : a ∈ F}. One shows that h is a monomorphism of A into the complex algebra of (W, R). The following lemma is crucial. Lemma 1 Let F1 , F2 be proper filters and let F be an ultrafilter in the boolean algebra such that F1 F2 ⊆ F. Then, there exist ultrafilters G 1 , G 2 such that F1 ⊆ G 1 , F2 ⊆ G 2 and G 1 G 2 ⊆ F. We prove h(a b) = h(a) R h(b). We show ⊆. Let F ∈ h(a b). Then a b ∈ F. Since F is an ultrafilter, then a b = ⊥, hence a = ⊥ and b = ⊥. One defines Fa = {x ∈ A : a ≤ x} and Fb similarly. Fa , Fb are proper filters and Fa Fb ⊆ F. There exist ultrafilters G 1 , G 2 as in the lemma. We have G 1 ∈ h(a), G 2 ∈ h(b) and R(F, G 1 , G 2 ). This yields F ∈ h(a) R h(b). We show ⊇. Let F ∈ h(a) R h(b). Then, R(F, G 1 , G 2 ), i.e., G 1 G 2 ⊆ F for some G 1 ∈ h(a), G 2 ∈ h(b). Since a b ∈ G 1 G 2 , then a b ∈ F, and consequently F ∈ h(a b). For other steps the reader is referred to [7]. If A is a b.r. semigroup, then in the complex algebra of the canonical frame is associative. We only prove (⇒) of (4). Let F1 , F2 , F3 , H ∈ W . Assume that R(G 1 , F1 , F2 ) and R(H, G 1 , F3 ), i.e., F1 F2 ⊆ G 1 and G 1 F3 ⊆ H , for some G 1 ∈ W . Since the powerset operation on P(A) is associative and preserves ⊆, then F1 F2 F3 ⊆ H . Define F = {x ∈ A : ∃ y,z (y ∈ F2 ∧ z ∈ F3 ∧ y z ≤ x)}. Clearly F is a proper filter, F2 F3 ⊆ F, and F1 F ⊆ H . By the lemma, there
Lambek Calculus with Classical Logic
17
exists G 2 ∈ W such that F ⊆ G 2 and F1 G 2 ⊆ H . This yields R(G 2 , F2 , F3 ) and R(H, F1 , G 2 ). This finishes the proof. Remark 5 DMANL and DMAL (also: with ⊥, ) are strongly complete with respect to the complex algebras of relational frames and associative frames, respectively. This can be proved like Proposition 2, using the following representation theorem [7]: every (also: bounded) d.l.o.r. groupoid (resp. semigroup) is isomorphic to a subalgebra of the complex algebra of some (resp. associative) relational frame. (Precisely, we mean the − −free reduct of the complex algebra.) The proof is similar to that of Theorem 2 except that ultrafilters are replaced with prime filters. As a consequence, NL-CL (resp. L-CL) is a strongly conservative extension of DMANL (resp. DMAL), also with bounds. Interestingly, this does not hold for logics with 1. Proposition 3 NL1-CL (resp. L1-CL) is a non-conservative extension of DMANL1 (resp. DMAL1), also with bounds. Proof In b.r. unital groupoids, if a ≤ 1, then a a = a. We show it. Assume a ≤ 1. Define b = a − ∧ 1. We have 1 = ∧ 1 = (a ∨ a − ) ∧ 1 = (a ∧ 1) ∨ b = a ∨ b. For x, y ≤ 1, x y ≤ x ∧ y (indeed, x y ≤ x 1 = x and x y ≤ 1 y = y). So a b = ⊥. This yields a = a 1 = a (a ∨ b) = (a a) ∨ ⊥ = a a. It follows that x ∧ 1 ≤ (x ∧ 1) (x ∧ 1) is valid in b.r. unital groupoids, and the corresponding arrow is provable in NL1-CL. This arrow, however, is not provable in DMAL1 (even with bounds). It suffices to observe that x ∧ 1 ≤ (x ∧ 1) (x ∧ 1) is not valid in bounded d.l.o.r. monoids, e.g., in MV-algebras,6 i.e., algebras of manyvalued logics of Łukasiewicz. Consider the closed interval [0, 1] ⊆ R (the standard model of Ł∞ ), where x ∧ y = min(x, y), x y = max(0, x + y − 1); the number 1 is both and the multiplicative unit. Then x ∧ 1 = x, but x ≤ x x is not true for x = 21 . In fact, in b.r. unital groupoids x y = x ∧ y for all x, y ≤ 1. As above, x y ≤ x ∧ y. Also x ∧ y = (x ∧ y) (x ∧ y) ≤ x y. At the end of this section, we consider some operations definable in b.r. groupoids. First, one defines the De Morgan dual of and its dual residual operations. a • b = (a − b− )− a\• b = (a − \b− )− a/• b = (a − /b− )− There hold dual residuation laws. (RES•) For all elements a, b, c, c ≤ a • b iff a\• c ≤ b iff c/• b ≤ a. In the complex algebra of (W, R): X • Y = {u ∈ W : ∀v,w (R(u, v, w) ⇒ v ∈ X ∨ w ∈ Y )} / X ∧ u ∈ Y )} X \• Y = {w ∈ W : ∃u,v (R(u, v, w) ∧ v ∈ 6 Every
MV-algebra is a bounded d.l.o.r. commutative monoid, where 1 = , 0 = ⊥.
(5)
18
W. Buszkowski
X/• Y = {v ∈ W : ∃u,w (R(u, v, w) ∧ w ∈ / Y ∧ u ∈ X )} Our definition of in the complex algebra of (W, R) takes the first element of the triple (u, v, w) as an element of X Y . The reason is to make this definition compatible with the standard definition of ♦X in analogous algebras for modal logics. Some authors, however, prefer the third element of (u, v, w) in this role. One can define 2 and 3 as follows. X 2 Y = {v ∈ W : ∃w,u (R(u, v, w) ∧ w ∈ X ∧ u ∈ Y )} X 3 Y = {w ∈ W : ∃u.v (R(u, v, w) ∧ u ∈ X ∧ v ∈ Y )} We also set 1 = . By \i , /i we denote the residual operations for i and by •i , \i• , /i• the corresponding dual operations. As observed in several papers, e.g., [26], the new operations are definable in terms of , \, / and − . X 2 Y = (Y − / X )− X \2 Y = (Y − X )− X/2 Y = X − \Y − X 3 Y = (Y \X − )− X \3 Y = X − /Y − X/3 Y = (Y X − )− The following equivalences: X ∩ (Y Z ) = ∅ iff Y ∩ (Z 2 X ) = ∅ iff Z ∩ (X 3 Y ) = ∅ show that 2 is the left and 3 the right conjugate of in the sense of Jónsson and Tarski; see [20]. Sedlár and Tedder [43] study DMANL enriched with 2 , 3 and their residuals; they provide complete (with respect to frames) axiom systems for some language restricted fragments, leaving the problem for the full logic open. By Proposition 2, NL-CL is a conservative extension of all complete logics of this kind.
3 H-Systems and Modal Logics In this section we present Hilbert-style systems (H-systems) for NL-CL, L-CL and their extensions. In these systems one derives provable formulas; see Remark 3. We treat them as classical modal logics with binary modalities , \, /. We also consider their extensions by new axioms, natural in the frameworks of modal and substructural logics. At the end, we show how the standard method of filtration can be adjusted for binary modalities.
Lambek Calculus with Classical Logic
19
3.1 Unary Modalities First, we recall a H-system for Kt (the minimal tense logic [8]), just to illuminate a close relationship between NL-CL and Kt . Kt is a classical modal logic with unary modalities , ↓ . Dual modalities ♦, ♦↓ are defined as follows: ♦ϕ = ¬¬ϕ, ♦↓ ϕ = ¬↓ ¬ϕ. In tense logics, one usually writes F, P, G, H for ♦, ♦↓ , , ↓ , respectively. The corresponding frames are of the form (W, R), where R ⊆ W 2 . A frame model is a triple M = (W, R, V ) such that (W, R) is a frame and V is a map from the set of propositional variables to P(W ). The truth predicate u |= M ϕ, where u ∈ W and ϕ is a formula, is defined as usually. (|= p) (|= ¬) (|= ∧) (|= ) (|= ↓ )
u |= M p iff u ∈ V ( p) u |= M ¬ϕ iff u |= M ϕ u |= M ϕ ∧ ψ iff u |= M ϕ and u |= M ψ u |= M ϕ iff v |= M ϕ for any v ∈ W such that R(u, v) u |= M ↓ ϕ iff v |= M ϕ for any v ∈ W such that R(v, u)
ϕ is valid in M, if w |= M ϕ for all w ∈ W , and in the frame (W, R), if it is valid in all models (W, R, V ). Kt can be presented as the following H-system [8]. The axioms are all tautologies of classical logic (in the modal language) and the modal axioms7 : (K) (ϕ → ψ) → (ϕ → ψ) (K↓ ) ↓ (ϕ → ψ) → (↓ ϕ → ↓ ψ) (A♦↓ ) ♦↓ ϕ → ϕ (A↓ ♦) ϕ → ↓ ♦ϕ Its inference rules are modus ponens and two necessitation rules. (MP)
ϕ ϕ→ψ ϕ ϕ (RN) (RN↓ ) ↓ . ψ ϕ ϕ
One derives the monotonicity rules for modalities. (r-MON1)
(r-MON2)
ϕ→ψ ϕ → ψ
ϕ→ψ ♦ϕ → ♦ψ
ϕ→ψ → ↓ ψ
♦↓ ϕ
↓ ϕ
ϕ→ψ → ♦↓ ψ
The following residuation rule is derivable: (r-RES1)
7 Precisely,
♦ϕ → ψ ϕ → ↓ ψ
in [8] ♦ is primitive and is defined. The additional axiom ♦ϕ ↔ ¬¬ϕ is needed.
20
W. Buszkowski
The top-down part of (r-RES1) is derived by (r-MON2), (A↓ ♦) and the bottom-up part by (r-MON1)), (A♦↓ ). If ♦ is admitted as primitive, then (r-RES1) can replace the four modal axioms, written above, and (RN), (RN↓ ); see the next subsection, where an analogous claim is proved for logics with binary modalities. (With primitive, the additional axiom ϕ ⇔ ¬♦¬ϕ is needed.) The modal axiom scheme: (B) ϕ → ♦ϕ is valid in (W, R) if and only if R is symmetrical: R(u, v) implies R(v, u).8 In models M, based on symmetric frames, u |= M ϕ iff u |= M ↓ ϕ. We prove a syntactic counterpart of this equivalence. Proposition 4 In Kt (B) (as a scheme) is deductively equivalent to ϕ ↔ ↓ ϕ. Proof The second scheme yields (B), by (A↓ ♦). For the converse, (B) yields ♦ϕ → ϕ, hence ϕ → ↓ ϕ, by (r-RES1). Also, ↓ ϕ → ♦↓ ϕ is an instance of (B), and ♦↓ ϕ → ϕ, by (A♦↓ ) and (r-MON1). This yields ↓ ϕ → ϕ. Consequently, in Kt with (B) (and its extensions) and ↓ collapse, and one can remove ↓ from the language. One omits all axioms and rules for ↓ .
3.2 Binary Modalities Now we turn to binary modalities. They are precisely , \, / of L, added to the standard language of classical propositional logic. So one can define other modalities, e.g., •, \• , /• , •i , etc., as in Sect. 2. A (ternary) frame has been defined there. A model is a triple M = (W, R, V ) such that (W, R) is a frame and V maps the set of variables into P(W ). The truth definition is standard for variables and classical connectives. One defines: (|= ) u |= M ϕ ψ iff for some v, w ∈ W , R(u, v, w), v |= M ϕ and w |= M ψ (|= \) w |= M ϕ\ψ iff for all u, v ∈ W , if R(u, v, w) and v |= M ϕ then u |= M ψ (|= /) v |= M ϕ/ψ iff for all u, w ∈ W , if R(u, v, w) and w |= M ψ then u |= M ϕ The notions of validity in a model and in the frame are defined as in Sect. 3.1. We also define entailment in models on a class of ternary frames F . A set of formulas
entails a formula ϕ in models on F , if ϕ is valid in every model M = (W, R, V ) such that (W, R) ∈ F and all formulas from are valid in M. We write |=F ϕ for this entailment relation,9 and similarly |=A ϕ for the entailment relation in a class of algebras A (see Sect. 2). For a model M = (W, R, V ), one defines μ M (ϕ) = {u ∈ W : u |= M ϕ}. It is easy to show that μ M is a valuation in the complex algebra of (W, R); furthermore, scheme is valid in (W, R), if its all instances are so. relation should not be confused with the stronger relation: entails ϕ in F , if ϕ is valid in every frame from F such that all formulas from are valid in this frame.
8A
9 This
Lambek Calculus with Classical Logic
21
every valuation in this complex algebra equals μ M for some model M = (W, R, V ). Clearly ϕ is valid in M if and only if μ M (ϕ) = W (W = in this algebra). Let F , C, and A denote now the class of ternary frames, the class of their complex algebras, and the class of b.r. groupoids, respectively. The following equivalences are true for any and ϕ. (6)
|=F ϕ iff |=C ϕ iff |=A ϕ The second equivalence follows from Theorem 2. Equation (6) also hold for the associative case: F is the class of associative frames, C of their complex algebras, and A of b.r. semigroups. Since in b.r. groupoids distributes over ∨ and satisfies ⊥ a = ⊥ = a ⊥, then it can be treated as a binary normal possibility operator. In a ternary frame, R can be interpreted as an accessibility relation in the following sense: R(u, v, w) means that from the world (state) u one can access a pair of worlds (states) (v, w) in one step. There are many natural examples of ternary frames. Every groupoid (G, ·) determines the frame (G, R), where: R(u, v, w) iff u = v · w, for u, v, w ∈ G. The complex algebra of this frame coincides with the powerset algebra P(G). Given a set W with a partial function f from W 2 to W (the domain of f is contained in W 2 ), one obtains the frame (W, R f ), where: R f (u, v, w) iff f (v, w) is defined and equals u. Every relation algebra P(W 2 ) coincides with the complex algebra of the frame (W 2 , R f ), where f is the composition of pairs: f ((x, y), (z, u)) is defined iff y = z; f ((x, y), (y, u)) = (x, u). Example 3 Another example employs formal logics. We consider a propositional logic L such that all rules have two premises. Its formulas will be denoted by α, β, γ . We define a frame (W, R) such that W is the set of formulas of this logic and R is defined as follows: R(α, β, γ ) iff α can be derived from β and γ by one application of some rule. Then, modal formulas of NL-CL can encode proof schemes in L. Let L be the logic of positive implication with (MP) as the only rule and the following axioms. (A1) α → (β → α) (A2) [α → (β → γ )] → [(α → β) → (α → γ )] We write a proof of α → α. 1 2 3 4 5
[α → ((β → α) → α)] → [(α → (β → α)) → (α → α)] α → ((β → α) → α) (α → (β → α)) → (α → α) α → (β → α) α→α
(A2) (A1) M P1, 2 (A1) M P3, 4
We consider a model (W, R, V ), where W, R are as above and V ( p) (resp. V (q)) is the set of all axioms (A1) (resp. (A2)). Then, α → α |= M (q p) p and every formula γ such that γ |= M (q p) p is of this form for some α. If, additionally, V (r ) is the set of provable formulas, then r ↔ p ∨ q ∨ (r r ) is valid in M.
22
W. Buszkowski
To obtain a more precise description of proof schemes of L with rules r1 , . . . , rn , the relation R from Example 3 could be replaced by relations R1 , . . . , Rn , each corresponding to one rule. The resulting modal logic admits several products 1 , . . . , n and the related residual operators. If one-premise rules r j appeared in L, it would be reasonable to represent them by binary relations R j ⊆ W 2 (as in Sect. 3.1), corresponding to unary operators ♦ j . Such multi-modal logics can be useful in applications. We, however, discuss logics with one product. Most results can easily be generalized for the case of many products and unary modalities, at least if no special connections between them are assumed. Now we discuss H-systems. The system PNL of Kaminski and Francez [21] is formulated in the language of classical propositional logic enriched by , \, /. Its axioms are all tautologies of classical logic (in the extended language). Its rules are (MP) and (r1), (r2) from Sect. 1 with ⇒ replaced by →. PL also admits the associative law for : (ϕ ψ) χ ↔ ϕ (ψ χ ) (7) as an axiom. These systems are strongly complete with respect to the corresponding classes of ternary frames. Theorem 3 ([21]) For any set of formulas and any formula ϕ, ϕ is provable from
in PNL (resp. PL) if and only if entails ϕ in models on (resp. associative) ternary frames. Proof The present proof is different from that in [21]. Like in the proof of Theorem 1, one shows that PNL (resp. PL) is strongly complete with respect to b.r. groupoids (resp. b.r. semigroups): ϕ is provable from in the system if and only if
|=A ϕ, where A is the class of b.r. groupoids (resp. b.r. semigroups). Then, one applies (6). By Proposition 2, PNL (resp. PL) is simply a H-system for NL-CL (resp. LCL). Both systems yield the same provable formulas (see Remark 3) and the same consequence relation. Clearly ⊥ and can be defined in these H-systems: ⊥ = p ∧ ¬ p, = p ∨ ¬ p, for some fixed p. We use the latter acronyms in what follows. Another H-system for NL-CL is similar to Kt . It is convenient to take • instead of as a primitive modal operator. Clearly becomes definable: ϕ ψ = ¬(¬ϕ • ¬ψ). Rules (r1), (r2) can be replaced by the following axioms and rules. (K1) (K2) (K\) (K/) (A1\) (A1/) (A2\) (A2/)
(ϕ → ψ) • χ → (ϕ • χ → ψ • χ ) ϕ • (ψ → χ ) → (ϕ • ψ → ϕ • χ ) ϕ\(ψ → χ ) → (ϕ\ψ → ϕ\ψ) (ψ → χ )/ϕ → (ψ/ϕ → χ /ϕ) ϕ (ϕ\ψ) → ψ (ϕ/ψ) ψ → ϕ ψ → ϕ\(ϕ ψ) ϕ → (ϕ ψ)/ψ
Lambek Calculus with Classical Logic
(RN1)
23
ϕ ϕ (RN2) ψ •ϕ ϕ•ψ
(RN\)
ϕ ϕ (RN/) ψ\ϕ ϕ/ψ
There is a clear analogy between (K) and (K1), (K2), between (K↓ ) and (K\), (K/), between (A♦↓ ) and (A1\), (A1/), between (A↓ ♦) and (A2\), (A2/), between (RN) and (RN1), (RN2), and between (RN↓ ) and (RN\), (RN/). The following monotonicity rules are easily derivable in both axiomatizations. (MON•) from ϕ → ψ infer χ • ϕ → χ • ψ and ϕ • χ → ψ • χ (MON) from ϕ → ψ infer χ ϕ → χ ψ and ϕ χ → ψ χ (MON\) from ϕ → ψ infer χ \ϕ → χ \ψ and ψ\χ → ϕ\χ (MON/) from ϕ → ψ infer ϕ/χ → ψ/χ and χ /ψ → χ /ϕ Both H-systems for NL-CL are equivalent (the provability relation is the same). We outline a proof. S1 stands for PNL from [21] and S2 for the system similar to Kt . First, (r1), (r2) are derivable in S2 . We derive (r1). Assume ϕ ψ → χ . By (RN\), we get ϕ\(ϕ ψ → χ ), hence ϕ\(ϕ ψ) → ϕ\χ , by (K\) and (MP). This yields ψ → ϕ\χ , by (A2\) and classical logic. Assume ψ → ϕ\χ . We get ϕ ψ → ϕ (ϕ\χ ), by (MON), which yields ϕ ψ → χ , by (A1\) and classical logic. Second, axioms (K1)–(A2/) are provable and rules (RN1)–(RN/) are derivable in S1 . It is easy to prove (A1\), (A1/), (A2\) and (A2/). Using (r1), (r2), one proves the distributive law. ϕ (ψ ∨ χ ) ↔ (ϕ ψ) ∨ (ϕ χ ) (ψ ∨ χ ) ϕ ↔ (ψ ϕ) ∨ (χ ϕ)
(8)
In S1 • is defined: ϕ • ψ = ¬(¬ϕ ¬ψ). Using (8), monotonicity rules and classical logic, one proves the following law. ϕ • (ψ ∧ χ ) ↔ (ϕ • ψ) ∧ (ϕ • χ ) (ψ ∧ χ ) • ϕ ↔ (ψ • ϕ) ∧ (χ • ϕ)
(9)
From (ϕ → ψ) ∧ ϕ → ψ one obtains (K1), using (MON•), (9) and classical logic. (K2) is obtained similarly. (K\) and (K/) are obtained in a similar way, using the same classical tautology, (MON\), (MON/) and the following laws analogous to (2), easily provable in S1 . ϕ\(ψ ∧ χ ) ↔ (ϕ\ψ) ∧ (ϕ\χ ) (ψ ∧ χ )/ϕ ↔ (ψ/ϕ) ∧ (χ /ϕ)
(10)
We derive (RN1). From (¬ψ) ⊥ ↔ ⊥ one obtains ψ • ↔ . Assume ϕ. Then → ϕ by classical logic. Hence ψ • → ψ • ϕ by (MON•), which yields → ψ • ϕ by classical logic, and consequently, ψ • ϕ by (MP). The derivation of (RN2) is similar. We derive (RN\). Assume ϕ. Then ψ → ϕ by classical logic. Hence → ψ\ϕ by (r1), which yields ψ\ϕ by (MP). The derivation of (RN/) is similar.
24
W. Buszkowski
S2 enriched by the associative law for • is a H-system for L-CL. Clearly this law for • implies this law for , and conversely.
3.3 Other Modal Axioms At first we consider some analogues of the symmetry axiom (B). For R ⊆ W 3 , the symmetry property of a binary relation has different counterparts. We list three. (WS) for all u, v, w ∈ W , if R(u, v, w) then R(u, w, v) (weak symmetry) (Cy) for all u, v, w ∈ W , if R(u, v, w) then R(w, u, v) (cyclicity) (FS) for all u 1 , u 2 , u 3 ∈ W , if R(u 1 , u 2 , u 3 ) then R(u i1 , u i2 , u i3 ) for any permutation (i 1 , i 2 , i 3 ) of (1, 2, 3) (full symmetry) Clearly (FS) is equivalent to the conjunction of (WS) and (Cy). (WS) corresponds to the commutative law. (COM) ϕ ψ ↔ ψ ϕ Precisely, (COM) is valid in the frame (W, R) if and only if R satisfies (WS). Like in algebras, the scheme (COM) is deductively equivalent to the scheme ϕ\ψ ↔ ψ/ϕ. Modal logics admitting (COM) are said to be commutative. In commutative logics \, / collapse in one operator, which we denote by → (similar to ↓ ), just to distinguish it from →. One omits all axioms and rules for / and writes → for \ in the remaining ones. Like Theorems 1 and 3, one proves that Commutative NL-CL (resp. L-CL) is strongly complete with respect to commutative b.r. groupoids (resp. semigroups) and models on (resp. associative) ternary frames, satisfying (WS). Returning to Example 3, let us note that the relation R, defined there, satisfies (WS), if the premises of a rule are treated as a set (their order is inessential). The frames satisfying (Cy) are said to be cyclic. We look at the corresponding logics closer, since they can be regarded as classical counterparts of cyclic linear logics. In particular, Cyclic MALL of Yetter [46] can be presented as MAL1 with 0, ⊥ (see Sect. 1), admitting the cyclic axiom: ϕ\0 ⇔ 0/ϕ. So two substructural negations collapse in one, written ∼ . Cyclic MALL also assumes the double negation law: ϕ ∼∼ ⇔ ϕ; the contraposition rule: from ϕ ⇒ ψ infer ψ ∼ ⇒ ϕ ∼ is derivable. So the resulting negation is a De Morgan negation. One obtains the following contraposition law10 : (CONT) ϕ ∼ /ψ ⇔ ϕ\ψ ∼ . ϕ ∼ /ψ ⇔ (ϕ\0)/ψ ⇔ ϕ\(0/ψ) ⇔ ϕ\ψ ∼ The second ⇔ follows from the associativity of . One can consider weaker logics of this kind, e.g., without constants 1, 0 (also in algebras) and/or with nonassociative product. In them, ∼ is a primitive connective; the double negation law and (CONT) are admitted as axioms and the contraposition rule is assumed. In this way, from 10 (CONT)
is equivalent to other laws of this kind, e.g., ϕ\ψ ⇔ ϕ ∼ /ψ ∼ .
Lambek Calculus with Classical Logic
25
MANL one obtains the nonassociative version of Cyclic MALL without multiplicative constants [13, 19]. Let us refer to this logic as Cyclic MANL. Remark 6 In the literature on linear logics and Lambek calculi, the extensions admitting the double negation law are often referred to as ‘classical’, like in [13, 19]. This usage of ‘classical’ seems misleading: this law holds in genuine nonclassical logics, e.g., many-valued logics and relevance logics. As in the literature on substructural logics, the term ‘cyclic’ is preferred here. By Cyclic NL-CL we mean NL-CL enriched with the following axiom scheme. (CONT¬) ¬ϕ/ψ ↔ ϕ\¬ψ In its arrow version ↔ is replaced by ⇔. Clearly (CONT¬) is valid in (W, R) if and only if in the complex algebra of (W, R), for all X, Y ⊆ W X 2 Y = X 3 Y ; the latter condition is equivalent to (Cy) for R. Cyclic NL-CL is an extension of Cyclic MANL, if one translates the latter’s ϕ ∼ as ¬ϕ. Therefore classical negation behaves in the former like cyclic negation in linear logics. In particular, with \, / it fulfils contraposition laws. We prove the strong completeness of Cyclic NL-CL with respect to cyclic ternary frames. Proposition 5 For any set of formulas and any formula ϕ, ϕ is provable from in Cyclic NL-CL if and only if entails ϕ in models on cyclic ternary frames. Proof A b.r. groupoid A is said to be cyclic, if a − /b = a\b− for all a, b ∈ A. Like Theorem 1, one proves that Cyclic NL-CL is strongly complete with respect to cyclic b.r. groupoids. Since (CONT¬) is valid in cyclic ternary frames, then the complex algebras of these frames are cyclic b.r. groupoids. Consequently, Cyclic NL-CL is sound with respect to the complex algebras of cyclic frames. For completeness, like Theorem 2 one shows that every cyclic b.r. groupoid is isomorphic to a subalgebra of the complex algebra of some cyclic frame. It suffices to observe that, if A is a cyclic b.r. groupoid, then the canonical frame (W, R) is cyclic. Indeed, assume R(F, G, H ), i.e., G H ⊆ F. We show R(H, F, G), i.e., F G ⊆ H . Suppose F G ⊆ H . There exists a ∈ F, b ∈ G such that a b ∈ / H. Then (a b)− ∈ H . We have (a b)− = b\a − . Indeed, c ≤ b\a − iff b c ≤ a − iff b ≤ a − /c iff b ≤ a\c− iff a b ≤ c− iff c ≤ (a b)− , for any c ∈ A. Therefore b (a b)− ≤ a − , by (R1). Since b (a b)− ∈ F by the assumption, we get a − ∈ F, which contradicts a ∈ F. One defines a 2 b = (b− /a)− and a 3 b = (b\a − )− in any b.r. groupoid. Then a b = a 2 b = a 3 b is valid in cyclic b.r. groupoids. The proof of Proposition 5 implicitly uses a b = a 3 b. Also a • b = b/a − = b− \a is valid these algebras. The corresponding logical equivalences are analogous to the scheme ϕ ↔ ↓ ϕ for unary modalities. In linear logics, one defines the operation par (a De Morgan dual of ). Its classical counterpart, here denoted by • , satisfies in algebras a • b = b • a. One obtains a • b = a/b− = a − \b, which yields a/b = a • b− , a\b = a − • b (definitions of /, \ in terms of par and negation in algebras of cyclic linear logics).
26
W. Buszkowski
Analogues of Proposition 5 can be proved for Cyclic L-CL, i.e., L-CL with (CONT¬), and versions with multiplicative constants and (COM). Cyclic L1-CL is an extension of Cyclic MALL. A closer examination of these logics and their applications must be deferred to another paper. In cyclic commutative logics, corresponding to frames satisfying (FS), (CONT¬) takes the following form. (ϕ → ¬ψ) ↔ (ψ → ¬ϕ) For logics with 1, the corresponding frames are of the form (W, R, E), where (W, R) is as above and E ⊆ W satisfies: ∃e∈E R(u, e, w) ⇔ u = w ∃e∈E R(u, v, e) ⇔ u = v for all u, v, w ∈ W . This yields E X = X = X E, for any X ⊆ W , in the complex algebra of (W, R). Accordingly, this algebra is a b.r. unital groupoid. All results of this section hold for the logics, discussed here, enriched with 1, but we omit all details. The modal axiom scheme (T) ϕ → ♦ϕ is analogous to the following scheme of contraction laws in substructural logics. (CON) ϕ → ϕ ϕ (T) is valid in (W, R), R ⊆ W 2 , if and only if R is reflexive. Similarly, (CON) is valid in (W, R), R ⊆ W 3 , if and only if R(u, u, u) holds for any u ∈ W ; we say that this frame is reflexive. Like Theorem 3 one proves the strong completeness of NL-CL (resp. L-CL) with (CON) with respect to models on (resp. associative) reflexive ternary frames. The algebraic condition corresponding to (CON) is: a ≤ a a for any element a (one says that is square-increasing). In the proof the following observation is essential: if in A is square increasing, then in the canonical frame R(F, F, F) holds for any ultrafilter F. We show it. Assume that in A is self-increasing. Then a ∧ b ≤ (a ∧ b) (a ∧ b) ≤ a b. Hence, for all a, b ∈ F, a b ∈ F, which yields F F ⊆ F, i.e., R(F, F, F). Let us note that the stronger schemes ϕ → ϕ ψ, ψ → ϕ ψ lead to the inconsistent logic. Fix a provable formula ϕ0 . In the first scheme replace ϕ with ϕ0 and ψ with ϕ0 \ψ; then use (A1\). This yields ϕ0 → ψ, for any ψ, and consequently, every formula ψ is provable. The converse schemes: (LWE) ϕ ψ → ϕ ϕ ψ → ψ express the algebraic conditions: a b ≤ a, a b ≤ b, for all elements a, b (one says that is decreasing). These conditions correspond to the left-weakening rules in sequent systems for substructural logics. Their analogue for ♦ is ♦ϕ → ϕ. This scheme is valid in (W, R), R ⊆ W 2 , if and only if, for any u, v ∈ W , R(u, v) implies u = v, i.e., R ⊆ IdW . The resulting
Lambek Calculus with Classical Logic
27
logic is not interesting as a modal logic: one proves ♦ϕ ↔ ϕ ∧ ♦ . The situation is similar for (LWE). These schemes are valid in (W, R), R ⊆ W 3 , if and only if, for any u, v, w ∈ W , R(u, v, w) implies u = v = w. For such models M, one obtains the following truth condition. u |= M ϕ ψ iff u ∈ U, u |= M ϕ and u |= M ψ, where U = {u : R(u, u, u)} Therefore the following scheme is valid. ϕψ ↔ϕ∧ψ ∧
(11)
In fact, in NL-CL (11) is deductively equivalent to (LWE). We prove the algebraic version of this equivalence. Proposition 6 For any b.r. groupoid A, the following conditions are equivalent: (i) is decreasing, (ii) a b = a ∧ b ∧ for all a, b ∈ A. Proof Clearly (i) follows from (ii). We prove the converse. Assume (i). Then a b ≤ a ∧ b. We obtain: a ∧ b ∧ = a ∧ b ∧ (a ∨ a − ) (b ∨ b− ) = (a ∧ b ∧ a b) ∨ (a ∧ b ∧ a b− ) ∨ (a ∧ b ∧ a − b) ∨ (a ∧ b ∧ a − b− ) = ab∨⊥∨⊥∨⊥=ab This yields (ii).
The resulting logic amounts to classical logic with a new variable pU and definitions: ϕ ψ = ϕ ∧ ψ ∧ pU ϕ\ψ = ϕ ∧ pU → ψ ϕ/ψ = ψ ∧ pU → ϕ Then ↔ pU is provable. In b.r. unital groupoids, = ; hence, if is decreasing, then a b = a ∧ b and a\b = b/a = a − ∨ b for all elements a, b. Accordingly NL1-CL with (LWE) amounts to classical logic. Clearly L-CL with (LWE) equals NL-CL with (LWE), since the latter’s product is associative (and commutative). Remark 7 Lambek calculi and linear logics are often interpreted as logics of actions (programs); see e.g. [6, 18]. For sentences of natural language, expressing an action, can be interpreted as the conjunction (superposition) of actions, which need not be commutative. ‘Susan met John and John bought flowers’ is not synonymous to ‘John bought flowers and Susan met John’. One might claim that the truth of either sentence implies the truth of both ‘Susan met John’ and ‘John bought flowers’. It is tempting to employ L-CL with the axiom-scheme ϕ ψ → ϕ ∧ ψ, equivalent to (LWE), for logical analysis of such sentences. The preceding paragraph, however, shows
28
W. Buszkowski
that this logic is too strong. Its product almost coincides with classical conjunction; with 1 even coincides. Therefore a weaker logic must be employed, e.g., DMAL or DMAL1, either with (LWE). Another option is L-CL or L1-CL with (LWE) replaced by the rule: from ϕ ψ infer ϕ ∧ ψ. The modal axiom (4): ♦♦ϕ → ♦ϕ, valid in transitive binary frames, can be adapted for binary modalities in several ways. We leave an analysis of these options to further research.
3.4 Filtration Kaminski and Francez [21] prove the strong finite model property of PNL with respect to models on ternary frames: if ϕ is not provable from finite in PNL, then does not entail ϕ in models on finite ternary frames. By Theorem 2, this result is equivalent to the strong finite model property of NL-CL with respect to b.r. groupoids, established in [12, 14]. Nonetheless, the direct proof in [21] is interesting: it uses a filtration of a frame model whose worlds are certain sets of formulas. This filtration, however, is defined in a nonstandard way: worlds are subsets of a finite set of formulas. Here we briefly explain how to adapt the standard method of filtration (as in [8] for unary modalities) for logics with , \, /. Let be a set of formulas of NL-CL. is said to be suitable, if it is closed under subformulas and satisfies the conditions: (\) (/)
if ϕ\ψ ∈ then ϕ (ϕ\ψ) ∈ , if ϕ/ψ ∈ then (ϕ/ψ) ψ ∈ .
Every finite set 0 can be extended to a finite suitable set . First, add to 0 all subformulas of formulas from 0 . Second, add to the obtained set new formulas, according to (\) and (/) (these steps do not iterate). The resulting set is suitable. Let M = (W, R, V ) be a model, where R ⊆ W 3 . Let be a set of formulas. We define an equivalence relation ∼ ⊆ W 2 . u ∼ v iff for any ϕ ∈ , u |= M ϕ ⇔ v |= M ϕ
(12)
By [u] we denote the equivalence class of ∼ containing u; the subscript is often omitted. We define W = {[u] : u ∈ W }. A filtration of M through is defined as a model M f = (W , R f , V f ) such that f V ( p) = {[u] : u ∈ V ( p)} and R f ⊆ (W )3 satisfies the following conditions for all u, v, w ∈ W : (f1) (f2)
if R(u, v, w), then R f ([u], [v], [w]), if R f ([u], [v], [w]), ϕ ψ ∈ , v |= M ϕ and w |= M ψ, then u |= M ϕ ψ.
The lemma below explains the role of (\) and (/),
Lambek Calculus with Classical Logic
29
Lemma 2 Let satisfy (\) (resp. (/)), and let R f ⊆ (W )3 satisfy (f2). Thus, for all ϕ, ψ, if ϕ\ψ ∈ (resp. ϕ/ψ ∈ ), R f ([u], [v], [w]), v |= M ϕ (resp. w |= M ψ), and w |= M ϕ\ψ (resp. v |= M ϕ/ψ), then u |= M ψ (resp. u |= M ϕ). Proof We prove the first part only. Assume that R f ([u], [v], [w]), ϕ\ψ ∈ , v |= M ϕ and w |= M ϕ\ψ. Since ϕ (ϕ\ψ) ∈ , then u |= ϕ (ϕ\ψ), by (f2). Since (A1\) is valid in M, then u |= M ψ. Lemma 3 (Filtration Lemma) Let be a suitable set of formulas. Let M f be a filtration of M = (W, R, V ) through . Then, for all χ ∈ and u ∈ W , u |= M χ if and only if [u] |= M f χ . Proof Induction on χ ∈ . Let χ = p. If u |= M p, then u ∈ V ( p), which yields [u] ∈ V f ( p), hence [u] |= M f p. Assume [u] |= M f p. Then [u] ∈ V f ( p), hence u ∼ v for some v ∈ V ( p). Since v |= M p, then u |= M p. The arguments for classical connectives are routine. We consider the cases χ = ϕ ψ and χ = ϕ\ψ. Assume u |= M ϕ ψ. There exist v, w such that R(u, v, w), v |= M ϕ and w |= M ψ. We have R f ([u], [v], [w]), by (f1), and [v] |= M f ϕ, [w] |= M f ψ, by the induction hypothesis. Consequently [u] |= M f ϕ ψ. Assume [u] |= M f ϕ ψ. There exist v, w such that R f ([u], [v], [w]), [v] |= M f ϕ and [w] |= M f ψ. So v |= M ϕ, w |= M ψ, by the induction hypothesis. By (f2), u |= M ϕ ψ. Assume w |= M ϕ\ψ. Let R f ([u], [v], [w]) and [v] |= M f ϕ. Then v |= M ϕ, by the induction hypothesis, hence u |= M ψ, by Lemma 2. This yields [u] |= M f ψ. Consequently [w] |= M f ϕ\ψ. Assume [w] |= M f ϕ\ψ. Let R(u, v, w) and v |= M ϕ. Then [v] |= M f ϕ, by the induction hypothesis. Since R f ([u], [v], [w]) by (f1), then [u] |= M f ψ. Consequently u |= M ψ, by the induction hypothesis. We have shown w |= M ϕ\ψ. For a set closed under subformulas, we define the smallest and the largest filtration of M through : (sf) R s ([u], [v], [w]) iff there exist u ∈ [u], v ∈ [v], w ∈ [w] such that R(u , v , w ), (lf) R l ([u], [v], [w]) iff for all ϕ, ψ, if ϕ ψ ∈ , v |= M ϕ and w |= M ψ, then u |= M ϕ ⊗ ψ. It is easy to verify that R s and R l satisfy (f1), (f2) and, for any R f satisfying (f1), (f2), R s ⊆ R f ⊆ R l . Our proof of the following theorem uses filtration in the sense, defined above. Theorem 4 ([21]) NL-CL possesses the strong finite model property with respect to models on ternary frames. Proof let be a finite set of formulas, and let ϕ be a formula not provable from in NL-CL. We set 0 = ∪ {ϕ}. We extend 0 to a finite suitable set . By Theorem 3, there exists a model M = (W, R, V ) such that all formulas from are valid but ϕ is not valid in M. We construct a filtration M f of M through . One can put R f = R s or R l = R l (both work). By Lemma 3 all formulas from are valid but ϕ is not valid in M f . Since W is finite, then does not entail ϕ in models on finite frames.
30
W. Buszkowski
Actually, this yields the bounded finite model property. By the size of (s( )) we mean the number of variables and connectives occurring in formulas from (for connectives, we count their occurrences). The number of all subformulas of formulas from is not greater than s( ). If is the smallest suitable set containing 0 , then
consists of at most 2s( 0 ) formulas. Consequently, W has at most 2n elements, where n = 2s( 0 ). We obtain: ϕ is provable from in NL-CL if and only if entails ϕ in models on ternary frames with at most 2n elements. Clearly this implies the decidability of the provability from finite sets in NL-CL. Analogous results can be obtained for all nonassociative extensions, discussed in Sect. 3.3. The following observations are crucial. If R ⊆ W 3 satisfies (WS) (resp. (Cy), (FS)), then R s satisfies (WS) (resp. (Cy), (FS)). If R is reflexive, then R s is reflexive. If R ⊆ {(u, u, u) : u ∈ W }, then R s ⊆ {([u], [u], [u]) : [u] ∈ W } (this is not very useful, since the corresponding logic reduces to classical logic). They cannot be adapted, at least directly, for associative versions of these logics. L-CL is undecidable (see Sect. 4), hence it does not possess the finite model property (FMP).11 Consequently, filtration does not preserve associativity: for R ⊆ W 3 , satisfying (4), R f need not satisfy (4).
4 Decidability and Complexity NL-CL is decidable, and similarly for the provability from a finite set [14, 21]. The provability in the pure logic NL-CL is PSPACE-complete. Lin and Ma [34] prove it by: (1) a polynomial translation of K, i.e., Kt without ↓ , in NL-CL, (2) a polynomial translation of NL-CL in Kt (in two steps: first, NL-CL in K2t , i.e., Kt ↓ with the second pair of modalities 2 , 2 : axioms and rules for them copy those for ↓ , , second, a polynomial translation of K2t in Kt ). Since K and Kt are PSPACEcomplete [8], (1) implies that NL-CL is PSPACE-hard and (2) that it is PSPACE. Let A be a class of algebras. By Eq(A) (resp. Queq(A)) we denote the set of all equations (resp. quasi-equations) valid in A; we refer to the first-order language of A (see Remark 4). A universal sentence is a sentence ∀x1 ,...,xn ϕ, where ϕ is a quantifier-free first-order formula. The universal theory of A is defined as the set of all universal sentences valid in A and denoted ThU (A). By G (resp. SG) we denote the class of groupoids (resp. semigroups), by RG (resp. RSG) the class of residuated groupoids (resp. semigroups), by DLRG the class of d.l.o.r. groupoids, and by bDLRG the class of bounded d.l.o.r. groupoids. BRG (resp. BRSG) denotes the class of b.r. groupoids (resp. semigroups). Shkatov and van Alten [44] prove that ThU (bDLRG) is EXPTIME-complete; this proof also yields the EXPTIME-completeness of Queq(bDLRG). The same authors [45] prove the EXPTIME-completeness of the universal theory of normal modal algebras. Details are too involved to be discussed here. It follows (see Remark 4) that the provability from finite sets in DMANL with ⊥, is EXPTIME-complete. Since NL11 FMP:
every unprovable formula can be falsified in a finite model.
Lambek Calculus with Classical Logic
31
CL is a strongly conservative extension of DMANL with ⊥, (see Remark 5), then the provability from finite sets in NL-CL is EXPTIME-hard. It is also EXPTIME. The proof from [44] that ThU (bDLRG) is EXPTIME, which uses some characterization of the partial algebras being subalgebras of algebras in bDLRG, can be adjusted for BRG, like it is made in [45] for normal modal algebras with unary modal operators. Therefore Qeq(BRG) is EXPTIME, which shows that the provability from finite sets in NL-CL is EXPTIME-complete. The same is true for Commutative NL-CL. For associative logics the situation radically changes. L-CL is undecidable. This was explicitly stated by Kurucz et al. [27] who proved a more general result. In the next paper [28], not referring to the Lambek calculus, the same authors proved a closely related result: classical propositional logic enriched with a binary modality, distributing over disjunction, is undecidable. Since the undecidability of L-CL is an important result for our subject-matter, we present a proof below. Our proof essentially follows that in [28] (which simplifies the approach of [27]), but further simplifies it and repairs an error, namely a wrong definition of the equation e(q), encoding a quasi-equation q. We consider quasi-equations s1 = t1 ∧ · · · ∧ sn = tn ⇒ s0 = t0 , n ≥ 0, in the first-order language of semigroups, i.e., is the only operation symbol,12 which can appear in terms si , ti . For a class of algebras A, admitting an operation , Queq (A) denotes the set of all quasi-equations of this form valid in A. Clearly Queq (SG)=Queq(SG). Queq(SG) is undecidable. This amounts to the classical result of computability theory: the word problem for semigroups is undecidable. We will show that Queq(SG) can be encoded in Eq(BRSG), which yields the undecidability of Eq(BRSG). As a consequence, L-CL is undecidable (see Remark 4). Lemma 4 Queq(SG)=Queq (BRSG) Proof ⊆ is obvious. To prove ⊇ we observe that every semigroup (G, ) is isomorphic to a subalgebra of the semigroup reduct of a b.r. semigroup, namely P(G) with operations , \, / defined as in Sect. 2 (except that · is replaced with ). The map h(a) = {a} is a monomorphism of (G, ) into P(G). We define a term σ (x) in the first-order language of BRSG (as it has been noted in Remark 4, in terms ∪, ∩ are used for ∨, ∧). σ (x) := x ∪ x ∪ x ∪ x
(13)
Lemma 5 Let A be a b.r. semigroup. For any a ∈ A: (i) a ≤ σ (a), (ii) σ (a) ≤ σ (a), σ (a) ≤ σ (a), (iii) σ (⊥) = ⊥. Proof (i) and (iii) are obvious. We show (ii) σ (a) ≤ σ (a). σ (a) = a ∪ a ∪ a ∪ a ≤ ≤ a ∪ a ∪ a ∪ a = a ∪ a ≤ σ (a) 12 We
use instead of · for the semigroup operation, since the former is used in b.r. semigroups.
32
W. Buszkowski
The proof of σ (a) ≤ σ (a) is similar.
By BSG we denote the class of boolean semigroups, i.e., boolean algebras with an associative operation which distributes over ∪ in both arguments. Notice that σ (x) is a term in the language of BSG. Lemma 5(iii), however, needs ⊥ = ⊥ = ⊥, which is valid in BRSG, but not in BSG. For A ∈ BSG and c ∈ A, we define a map gc : A → A as follows: gc (x) = x ∪ σ (c) for x ∈ A. On the set Bc = gc [A] we define an operation c : a c b = a b ∪ σ (c). Lemma 6 The map gc is an epimorphism of (A, ) onto (Bc , c ). Proof gc (a) c gc (b) = (a ∪ σ (c)) (b ∪ σ (c)) ∪ σ (c) = a b ∪ a σ (c) ∪ σ (c) b ∪ σ (c) σ (c) ∪ σ (c). We have a σ (c) ≤ σ (c) ≤ σ (c), by Lemma 5. Similarly σ (c) b ≤ σ (c) and σ (c) σ (c) ≤ σ (c). Consequently, the right-hand side of the second equation equals a b ∪ σ (c), i.e., gc (a b). Corollary 1 (Bc , c ) is a semigroup.
.
In boolean algebras one defines: a − b = a ∩ b− , a − b = (a − b) ∪ (b − a) (symmetric difference). We need the following properties. .
a=b⇔a−b=⊥ .
(14) .
a ∪ (a − b) = a ∪ b = b ∪ (a − b)
(15)
For a quasi equation q := s1 = t1 ∧ · · · ∧ sn = tn ⇒ s0 = t0 (in language of SG) we define a term tq and an equation e(q) as follows. .
.
tq := (s1 − t1 ) ∪ · · · ∪ (sn − tn ) e(q) := s0 ∪ σ (tq ) = t0 ∪ σ (tq ) Lemma 7 For any quasi-equation q (in language of SG), q ∈ Queq(SG) if and only if e(q) ∈ Eq(BRSG). Proof We prove the if-part. Assume e(q) ∈ Eq(BRSG). We show q ∈ Queq(BRSG). Let A ∈ BRSG. Let a valuation μ in A be such that μ(si ) = μ(ti ) for all i = 1, . . . , n. Then, μ(tq ) = ⊥, hence μ(σ (tq )) = ⊥, by Lemma 5(iii). Consequently, μ(s0 ) = μ(s0 ∪ σ (tq )) = μ(t0 ∪ σ (tq )) = μ(t0 ), where the second equation holds, since e(q) is valid in BRSG. So q is valid in BRSG, hence in SG, by Lemma 4. We prove the only-if part. Assume e(q) ∈ / Eq(BRSG). There exist A ∈ BRSG and a valuation μ in A such that μ(s0 ∪ σ (tq )) = μ(t0 ∪ σ (tq )). For c = μ(tq ), we consider the semigroup (Bc , c ) and the epimorphism gc of (A, ) onto (Bc , c ), defined above. We show gc (μ(si )) = gc (μ(ti )), for all i = 1, . . . , n. We denote ai = μ(si ), bi = μ(ti ) for i = 0, 1, . . . , n. By (15), si ∪ tq = ti ∪ tq is valid in BRSG, for any i = 1, . . . , n. By Lemma 5(i), tq ≤ σ (tq ) is valid as well. This yields: gc (ai ) = ai ∪ σ (c) = ai ∪ c ∪ σ (c) = bi ∪ c ∪ σ (c) = bi ∪ σ (c) = gc (bi ). On the other hand, gc (a0 ) = a0 ∪ σ (c) = b0 ∪ σ (c) = gc (b0 ), Therefore q is not true in (Bc , c ) for the valuation gc ◦ μ. So q ∈ / Queq(SG).
Lambek Calculus with Classical Logic
33
Theorem 5 ([27]) L-CL is undecidable. Proof By Lemma 7, Eq(BRSG) is undecidable. This implies the undecidability of L-CL (see Remark 4). In the proof of Lemma 7, Eq(BRSG) can be replaced with Eq(BSG). Indeed, Lemma 4(iii) is used in the if-part only. With BSG, we assume e(q) ∈ Eq(BSG), which yields e(q) ∈ Eq(BRSG). So we can continue as above. Consequently, Eq(BSG) is undecidable [28]. . Now we point out the error in [28]. In this paper e(q) := s0 − t0 ≤ σ (tq ). The argument for the only-if part of Lemma 7 does not work. Assuming e(q) ∈ / Eq(BSG), . one obtains μ(s0 − t0 ) = ⊥, hence μ(s0 ) = μ(t0 ). This, however, does not imply gc (μ(s0 )) = gc (μ(t0 )) (claimed in [28]), since gc need not be a monomorphism. In our proof, this inequality, written gc (a0 ) = gc (b0 ), holds by the form of. e(q). For honesty, let us note that in [27] e(q) is defined differently: (s0 ∪ σ (tq )) − (t0 ∪ σ (tq )) ≤ σ (tq ) (in fact, the construction is more complicated, but it takes this form . if a noncommutative operation, used in [27], is replaced with −). This works! In the same way one proves the undecidability of L1-CL (not stated in [27, 28]). It suffices to note that the word problem for monoids is also undecidable. By dual constructions, one proves the undecidability of intuitionistic logic with an associative binary modality which distributes over ∧; we denote it by •. This is briefly noted in [27] without any details, but it is not difficult to recover them. Intuitionistic ↔ . replaces −. Our σ (x) is replaced by δ(x) := x ∩ ⊥ • x ∩ x • ⊥ ∩ ⊥ • x • ⊥. Finally: tq := (s1 ↔ t1 ) ∩ · · · ∩ (sn ↔ tn ) e(q) := s0 ∩ δ(tq ) = t0 ∩ δ(tq ) By dualizing the arguments, written above, one can prove the undecidability of Ld IL, i.e., intuitionistic logic augmented with dual Lambek connectives •, \• , /• , with the axioms (id), (a1), (a2) (for •), the rules corresponding to (RES•) and (cut-1). Interestingly, L-IL is decidable [22]: the proof employs a cut-free sequent system for this logic.
5 Conclusion Although the topics presented above were earlier considered by several scholars in different contexts, it seems that NL-CL, L-CL and their extensions were never studied systematically. This motivated the author to write this survey, also containing some new observations. Hopefully, it will stimulate further research. There remain some open mathematical problems, e.g., the complexity of extensions of NL-CL by new axioms, the (un)decidability of analogous extensions of L-CL, and others. The major problem, however, is to find good applications of these formalisms in NLP and elsewhere (some have been pointed out in Introduction and Remark 7).
34
W. Buszkowski
Acknowledgements The author thanks Roussanka Loukanova for her essential help in the edition of this chapter and two anonymous referees for their valuable comments.
References 1. Abrusci, V.M.: Phase semantics and sequent calculus for pure noncommutative classical linear propositional logic. J. Symb. Log. 56(4), 1403–1451 (1991). https://doi.org/10.2307/2275485 2. Ajdukiewicz, K.: Die syntaktische Konnexität. Stud. Philos. 1(1), 1–27 (1935) 3. Andréka, H., Mikulás, S.: Lambek calculus and its relational semantics: completeness and incompleteness. J. Log. Lang. Inf. 3(1), 1–37 (1994). https://doi.org/10.1007/BF01066355 4. Bar-Hillel, Y.: A quasi-arithmetical notation for syntactic description. Language 29(1), 47–58 (1953). https://doi.org/10.2307/410452 5. Bar-Hillel, Y., Gaifman, C., Shamir, E.: On categorial and phrase-structure grammars. Bull. Res. Counc. Isr. F9, 1–16 (1960) 6. van Benthem, J.: Language in Action: Categories, Lambdas and Dynamic Logic. Studies in Logic and the Foundations of Mathematics, vol. 130. North-Holland, Amsterdam (1991) 7. Bimbo, K., Dunn, J.M.: Generalized Galois Logics. Relational Semantics for Nonclassical Logical Calculi. CSLI Lecture Notes, vol. 188. CSLI Publications, Stanford (2008) 8. Blackburn, P., de Rijke, M., Venema, Y.: Modal Logic. Cambridge Tracts in Theoretical Computer Science. Cambridge University Press, Cambridge (2001). https://doi.org/10.1017/ CBO9781107050884 9. Buszkowski, W.: Completeness results for Lambek syntactic calculus. Z. Math. Log. Grundl. Math. 32(1), 13–28 (1986) 10. Buszkowski, W.: Categorial grammars with negative information. In: Wansing, H. (ed.) Negation. A Notion in Focus, pp. 107–126. de Gruyter, Berlin (1996). https://doi.org/10.1515/ 9783110876802.107 11. Buszkowski, W.: Lambek calculus with nonlogical axioms. In: Casadio, C., Scott, P.J., Seely, R.A. (eds.) Language and Grammar. Studies in Mathematical Linguistics and Natural Language. CSLI Lecture Notes, vol. 168, pp. 77–93. CSLI Publications, Stanford (2005) 12. Buszkowski, W.: Interpolation and FEP for logics of residuated algebras. Log. J. IGPL 19(3), 437–454 (2011). https://doi.org/10.1093/jigpal/jzp094 13. Buszkowski, W.: On classical nonassociative Lambek calculus. In: Amblard, M., de Groote, P., Pogodalla, S., Retoré, C. (eds.) Logical Aspects of Computational Linguistics. Celebrating 20 Years of LACL (1996–2016), pp. 68–84. Springer, Berlin (2016). https://doi.org/10.1007/ 978-3-662-53826-5_5 14. Buszkowski, W., Farulewski, M.: Nonassociative Lambek calculus with additives and contextfree languages. In: Grumberg, O., Kaminski, M., Katz, S., Wintner, S. (eds.) Languages: From Formal to Natural: Essays Dedicated to Nissim Francez on the Occasion of His 65th Birthday, pp. 45–58. Springer, Berlin (2009). https://doi.org/10.1007/978-3-642-01748-3_4 15. Došen, K.: A brief survey of frames for the Lambek calculus. Z. Math. Log. Grundl. Math. 38(1), 179–187 (1992). https://onlinelibrary.wiley.com/doi/abs/10.1002/malq.19920380113 16. Galatos, N., Jipsen, P., Kowalski, T., Ono, H.: Residuated Lattices: An Algebraic Glimpse at Substructural Logics. Studies in Logic and The Foundations of Mathematics, vol. 151. Elsevier B. V., Amsterdam (2007) 17. Geach, P.T.: A program for syntax. Synthese 22(1–2), 3–17 (1971). https://doi.org/10.1007/ BF00413597 18. Girard, J.Y.: Linear logic. Theor. Comput. Sci. 50(1), 1–101 (1987). https://doi.org/10.1016/ 0304-3975(87)90045-4 19. de Groote, P., Lamarche, F.: Classical non-associative Lambek calculus. Stud. Log. 71(3), 355–388 (2002). https://doi.org/10.1023/A:1020520915016
Lambek Calculus with Classical Logic
35
20. Jónsson, B., Tsinakis, C.: Relation algebras as residuated Boolean algebras. Algebra Universalis 30, 469–478 (1993). https://doi.org/10.1007/BF01195378 21. Kaminski, M., Francez, N.: Relational semantics of the Lambek calculus extended with classical propositional logic. Stud. Log. 102, 479–497 (2014). https://doi.org/10.1007/s11225-0139474-7 22. Kaminski, M., Francez, N.: The Lambek calculus extended with intuitionistic propositional logic. Stud. Log. 104, 1051–1082 (2016). https://doi.org/10.1007/s11225-016-9665-0 23. Kanazawa, M.: The Lambek calculus enriched with additional connectives. J. Log. Lang. Inf. 1(2), 141–171 (1992). https://doi.org/10.1007/BF00171695 24. Kanovich, M., Kuznetsov, S., Scedrov, A.: The complexity of multiplicative-additive Lambek calculus: 25 years later. In: de Queiroz, R., Iemhoff, R., Moortgat, M. (eds.) Logic, Language, Information, and Computation. WoLLIC 2019. Lecture Notes in Computer Science, vol. 11541, pp. 356–372. Springer, Berlin (2019). https://doi.org/10.1007/978-3-662-59533-6_22 25. Kołowska-Gawiejnowicz, M.: Powerset residuated algebras and generalized Lambek calculus. Math. Log. Q. 43(1), 60–72 (1997). https://doi.org/10.1002/malq.19970430108 26. Kurtonina, N.: Frames and labels. a modal analysis of categorial inference. Ph.D. thesis, OTS, Utrecht University (1995) 27. Kurucz, A., Németi, I., Sain, I., Simon, A.: Undecidable varieties of semilattice-ordered semigroups, of Boolean algebras with operators, and logics extending Lambek calculus. Bull. IGPL 1(1), 91–98 (1993). https://doi.org/10.1093/jigpal/1.1.91 28. Kurucz, A., Németi, I., Sain, I., Simon, A.: Decidable and undecidable logics with a binary modality. J. Log. Lang. Inf. 4(3), 191–206 (1995). https://doi.org/10.1007/BF01049412 29. Kuznetsov, S.: Lambek grammars with one division and one primitive type. Log. J. IGPL 20(1), 207–211 (2012). https://doi.org/10.1093/jigpal/jzr031 30. Kuznetsov, S.: Trivalent logics arising from L-models for the Lambek calculus with constants. J. Appl. Non-Class. Log. 14(1–2), 1312–137 (2014). https://doi.org/10.1080/11663081.2014. 911522 31. Lambek, J.: The mathematics of sentence structure. Am. Math. Mon. 154–170 (1958) 32. Lambek, J.: On the calculus of syntactic types. In: Jakobson, R. (ed.) Structure of Language and Its Mathematical Aspects. Proceedings of Symposia in Applied Mathematics, vol. 12, pp. 166–178. American Mathematical Society, Providence (1961). https://doi.org/10.1090/psapm/ 012 33. Lambek, J.: From Word to Sentence: A Computational Algebraic Approach to Grammar. Polimetrica (2008). https://books.google.se/books?id=ZHgRaRaadJ4C 34. Lin, Z., Ma, M.: On the complexity of the equational theory of residuated Boolean algebras. In: Väänänen, J., Hirvonen, Å., de Queiroz, R. (eds.) Logic, Language, Information, and Computation. WoLLIC 2016. Lecture Notes in Computer Science, vol. 9803, pp. 265–278. Springer, Berlin (2016). https://doi.org/10.1007/978-3-662-52921-8_17 35. Moortgat, M.: Multimodal linguistic inference. J. Log. Lang. Inf. 5(3/4), 349–385 (1996). https://doi.org/10.1007/BF00159344 36. Moortgat, M.: Categorial type logic. In: van Benthem, J., ter Meulen, A. (eds.) Handbook of Logic and Language, pp. 93–177. Elsevier, Amsterdam; The MIT Press, Cambridge (1997). https://doi.org/10.1016/8978-044481714-3/50005-9 37. Moot, R., Retoré, C.: The Logic of Categorial Grammars: A Deductive Account of Natural Language Syntax and Semantics. Lecture Notes in Computer Science, vol. 6850. Springer, Berlin (2012). https://doi.org/10.1007/978-3-642-31555-8 38. Morrill, G.: Type Logical Grammar: Categorial Logic of Signs. Kluwer, Dordrecht (1994). https://doi.org/10.1007/978-94-011-1042-6 39. Pentus, M.: Lambek grammars are context free. In: 1993 Proceedings 8th Annual IEEE Symposium on Logic in Computer Science, pp. 429–433. IEEE (1993). https://doi.org/10.1109/ LICS.1993.287565 40. Pentus, M.: Models for the Lambek calculus. Ann. Pure Appl. Log. 75(1), 179–213 (1995). https://doi.org/10.1016/0168-0072(94)00063-9. Invited Papers Presented at the Conference on Proof Theory, Provability Logic, and Computation
36
W. Buszkowski
41. Pentus, M.: Lambek calculus is NP-complete. Theor. Comput. Sci. 357(1), 186–201 (2006). https://doi.org/10.1016/j.tcs.2006.03.018 42. Restall, G.: An Introduction to Substructural Logics. Routledge, Abingdon (2000) 43. Sedlár, I., Tedder, A.: Lambek calculus with conjugates. Stud. Log. (2020). https://doi.org/10. 1007/s11225-020-09913-2 44. Shkatov, D., Van Alten, C.: Complexity of the universal theory of bounded residuated distributive lattice-ordered groupoids. Algebra Universalis 80(36) (2019). https://doi.org/10.1007/ s00012-019-0609-1 45. Shkatov, D., Van Alten, C.: Complexity of the universal theory of modal algebras. Stud. Log. 108(2), 221–237 (2020). https://doi.org/10.1007/s11225-019-09853-6 46. Yetter, D.N.: Quantales and (noncommutative) linear logic. J. Symb. Log. 55(1), 41–64 (1990). https://doi.org/10.2307/2274953
Partial Orders, Residuation, and First-Order Linear Logic Richard Moot
Abstract We will investigate proof-theoretic and linguistic aspects of first-order linear logic. We will show that adding partial order constraints in such a way that each sequent defines a unique linear order on the antecedent formulas of a sequent allows us to define many useful logical operators. In addition, the partial order constraints improve the efficiency of proof search. Keywords Lambek calculus · Residuation · First-order linear logic
1 Introduction Residuation is a standard principle which holds for the Lambek calculus and many of its variants. However, even though first-order linear logic can embed the Lambek calculus and some of its variants, linear logic formulas need not be part of a residuated triple (or pair). In this paper, we will show that first-order linear logic can respect the residuation principle by adding partial order constraints on the variables and constants in a sequent. We investigate the number of connectives definable this way and compare these connectives to the connectives definable in other type-logical grammars. We conclude by investigating some of the applications of these results, both in terms of linguistic modelling and in terms of improving upon the efficiency of proof search.
R. Moot (B) LIRMM, CNRS, Université de Montpellier, 860 rue Saint Priest, 34095 Montpellier, France e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. Loukanova (ed.), Natural Language Processing in Artificial Intelligence—NLPinAI 2020, Studies in Computational Intelligence 939, https://doi.org/10.1007/978-3-030-63787-3_2
37
38
R. Moot
2 Categorial Grammars and Residuation Lambek introduced his syntactic calculus first as a calculus based on residuation [15, Sect. 7, with a sequent calculus in Sect. 8]. The principle of residuation is shown as Eq. 1. A→C/B
⇐⇒
A•B →C
⇐⇒
B → A\C
(1)
The Lambek calculus is then defined using just the principle of residuation together together with reflexivity and transitivity of the derivation arrow and associativity of the product ‘•’. Table 1 lists the full set of rules of the residuation-based representation of the Lambek calculus. In the Lambek calculus, the standard interpretation of the product ‘•’ is as a type of concatenation, with the implications ‘\’ and ‘/’ its residuals. Using the residuation calculus, we can derive standard cancellation schemes such as the following. Refl A\C → A\C Res\,• A • (A \ C) → C
Refl C/B→C/B Res/,• (C / B) • B → C
Showing us that when we compose C / B with a B to its right, we produce a C, and that when we compose A \ C with an A to its left, we produce a C. Figure 1 shows a standard visual representation of the residuation principle in the form of a triangle, where the each of the vertices of the triangle corresponds to one of the Lambek calculus connectives.
Table 1 Residuation-based presentation of the Lambek calculus Identity A→A
A→B B→C Trans A→C
Refl
Residuation A • B → C Res •,/ A→C/B
A • B → C Res •,\ B → A\C
A→C/B Res/,• A•B →C
B → A\C Res\,• A•B →C Associativity
A • (B • C) → (A • B) • C
Ass1
(A • B) • C → A • (B • C)
Ass2
Partial Orders, Residuation, and First-Order Linear Logic
39
Fig. 1 Visual representation of residuation
We can ‘read off’ many of the principles from this triangle, for example, the three different ways of concatenating the elements of a residuated triple are: 1. composing A and B to produce A • B, 2. composing C / B and B to produce C, 3. composing A and A \ C to produce C. The residuation presentation of the Lambek calculus naturally forms a category. This not only gives the Lambek calculus a category theoretic foundation—something Girard [9] argues is an important, deeper level of meaning for logics—but it can also play the role of an alternative type of natural language semantics for the Lambek calculus [4, 16], to be contrasted with the more standard semantics for type-logical grammars in the tradition of Montague [19]. An alternative combinatorial representation of residuation is found in Table 2. This presentation uses the two application principles we have derived above as axioms, Table 2 Došen’s presentation of the Lambek calculus Identity A→A
A→B B→C Trans A→C
Refl
Application A • (A \ B) → B
Appl\
(B / A) • A → B
Appl/
Co-Application A → B \ (B • A)
Coappl\
A → (A • B) / B
Coappl/
Monotonicity A → B C → D Mon A → B C → D Mon A → B C → D Mon• \ / B \C → A\ D A•C → B • D C/B→ D/A Associativity A • (B • C) → (A • B) • C
Ass1
(A • B) • C → A • (B • C)
Ass2
40
R. Moot
and adds two additional principles of co-application, easily obtained from the identity on the product formulas together with a residuation step. Refl A • B → A • B Res •,/ A → (A • B) / B
Refl B • A → B • A Res •,\ A → B \ (B • A)
The advantage of this presentation is that, besides transitivity, the only recursive rules are the monotonicity principles for the three connectives. This makes this presentation especially convenient for inductive proofs. For example, the completeness proofs of Došen [7] use this presentation.
2.1 Residuation in Extended Lambek Calculi Many of the extensions and variants of the Lambek calculus which have been proposed keep the principle of residuation central. For example, the multimodal Lambek calculus simply uses multiple families of residuated connectives {/i , •i , \i } for members i of a fixed, small set I of modes. Similarly, the unary connectives ‘♦’ and ‘’ connectives are a residuated pair [14, 20, 25]. However, some other formalisms do not use residuation as their central tool for defining connectives. These formalisms either add connectives corresponding to alternative algebraic principles, or abandon residuation altogether. Formalisms in the former group take residuation for some of its connectives and add additional principles such as dual residuation, Galois connections, and dual Galois connections for other connectives [1, 3]. Formalisms in the latter group abandon residuation as a key principle (without replacing it with another algebraic principle), or only preserve it for some of their connectives. These formalisms include lambda grammars [24], hybrid type-logical grammars [12, 13] and first-order linear logic [21, 22].
2.2 Residuation and First-Order Linear Logic The main theme of this paper will be to investigate what types of connectives are definable in first-order linear logic when we restrict ourselves to residuated connectives. We will look at generalised forms of concatenation and their residuals and see how we can define these in first-order linear logic. Some of these definable connectives require us to explicitly specify partial order constraints on some of the positions to preserve the required information. The resulting grammar system then has two components: for a sentence to be grammatical, a logical statement has to be derivable (as is standard for type-logical grammars) but also a corresponding partial order definition must be consistent. This gives us
Partial Orders, Residuation, and First-Order Linear Logic
41
Table 3 The sequent calculus for first-order intuitionistic multiplicative linear logic
AA
Ax
A , A C Cut , C
, A, B C L⊗ , A ⊗ B C
A B R⊗ , A ⊗ B
A , B C L , , A B C
, A B R AB
, A C L∃∗ , ∃x.A C
A[x := t] R∃ ∃x.A
, A[x := t] C L∀ , ∀x.A C
A R∀∗ ∀x.A
a mechanism to specify the relative order of grammatical constituents (logical formulas in type-logical grammars). The property we want to preserve locally in each statement is that the strings corresponding to the antecedent formulas can be linearly ordered in a unique way.
3 First-Order Linear Logic A sequent or a statement is an expression of the form A1 , . . . , An C (for some n ≥ 0), which we will often shorten to C. We call the antecedent, formulas Ai in antecedent formulas, and C the succedent of the statement. We assume the sequent comma is both associative and commutative and treat statements which differ only with respect to the order of the antecedent formulas to be equal. Table 3 shows the sequent calculus rules for first-order multiplicative intuitionistic linear logic. The R∀ and L∃ rule have the standard side condition that there are no free occurrences of x in and C. Cut-free proof search for the sequent calculus is decidable (the decision problem is NP complete [17]), and sequent proof search can be used as a practical decision procedure [18]. Decidability presupposes both cut elimination (which, as usual, is a simple enough proof even though there are many rule permutations to verify) and a restriction on the choice of t for the L∀ and R∃ rules. A standard solution is to use unification for this purpose, effectively delaying the choice of t to the most general term required by the axioms in backward chaining cut-free proof search. This of course requires us to verify the eigenvariable conditions for the R∀ and L∃ rules are still satisfied after unification. We can see this in action in the following failed attempt to prove ∀y[a ⊗ b(y)] a ⊗ ∀x.b(x) (the reader can easily verify all other proof attempts fail as well).
42
R. Moot
Y =x Ax b(Y ) b(x) ∀R∗ Ax aa b(Y ) ∀x.b(x) R⊗ a, b(Y ) a ⊗ ∀x.b(x) L⊗ a ⊗ b(Y ) a ⊗ ∀x.b(x) L∀ ∀y.[a ⊗ b(y)] a ⊗ ∀x.b(x) Tracing the proof from the endsequent upwards to the axioms, we start by replacing y by a fresh metavariable Y to be unified later, then follow the proof upwards to the axioms. For the b predicates, we compute the most general unifier of x and Y , which is x. But then, the antecedent of the ∀R rule becomes b(x), which fails to respect the eigenvariable condition for x. We can improve on the sequent proof procedure for first-order linear logic, even exploiting some of the rule permutabilities [18]. However, in Sect. 3.2 we will present a proof net calculus for first order linear logic, following Girard [8], which intrinsically avoids the efficiency problems caused by rule permutations. Before we do so, however, we will briefly recall how we can use first-order linear logic for modelling natural languages.
3.1 First-Order Linear Logic and Natural Language Grammars For type-logical grammars, a lexicon is a mapping from words to formulas in the corresponding logic. In first-order linear logic, this mapping is parametric for two position variables L and R, corresponding respectively to the left and right position of the string segment corresponding to the word. In general, for a sentence with n words, we assign the formula of word wi (for 1 ≤ i ≤ n) the string positions i − 1 and i. This simply follows the fairly standard convention in the parsing literature to represent substrings of the input string by pairs of integers. As noted by Moot and Piazza [22], we can translate Lambek calculus formulas to first-order linear logic formulas as follows. p x,y = p(x, y)
(2)
A • B = ∃y. A ⊗ B A \ C y,z = ∀x. A x,y C x,z
(3) (4)
C / B x,y = ∀z. B y,z C x,z
(5)
x,z
x,y
y,z
Equation 5 states that when C/B is a formula spanning string x, y (that is, having x as its left edge and y as its right edge), that means combining it with a formula B having y as its left edge and any z as its right edge will produce a formula C starting at x (the left edge of C/B) and ending at z (the right edge of B).
Partial Orders, Residuation, and First-Order Linear Logic
43
Fig. 2 Figure 1 with the corresponding translations in first-order logic
Figure 2 shows how this translation forms a residuated triple.1 Note how combining (the translations of) A and B to A • B, A and A \ C to B, and C / B and B to C all correspond to the concatenation of an x, y segment to an y, z segment to form x, z segment.
3.2 Proof Nets Multiplicative linear logic has an attractive, graph-based representation of proofs called proof nets. It is relatively simple to add the first-order quantifiers to proof nets [2, 8]. The choice for intuitionism is justified by our interest in natural language semantics: the Curry–Howard isomorphism between proofs in multiplicative intuitionistic linear logic and linear lambda terms gives us a simple and principled way of defining the syntax-semantics interface, thereby connecting our grammatical analyses to formal linguistic semantics in the tradition of Montague [19]. Proof nets can be defined in two different ways. 1. We can define them inductively as instructions of how to build proof nets from simpler ones. 2. We can define proof nets as instances of a more general class of objects called proof structures. Even though the inductive definition of proof nets is useful for proving all proof nets have certain properties, it is not immediately obvious how to determine whether something is or is not a proof net, since its inductive structure is not immediately visible (unlike, say, for sequent proofs). The second way of producing proof nets starts from proof structures. Given a sequent, there is a very direct procedure to enumerate its proof structures. Not all of these proof structures will be proof nets (that is, correspond to the inductive definition of proof nets, or, equivalently, to provable sequents). A correctness condition allows us to distinguish the proof nets from other structures. 1 To
show this in full detail would require us to do the simple but tedious job of proving that this definition satisfies the monotonicity and Application/Co-Application principles of Table 2.
44
R. Moot
Table 4 Logical links for MILL1 proof structures
Proof structures are built from the links shown in Table 4. The formulas drawn above the links are called its premisses and the formulas drawn below it are called its conclusions. Each connective is assigned two links: one where it occurs as a premiss (the left link, corresponding to the left rule for the connective in the sequent calculus) and one where it occurs as a conclusion (corresponding to the right rule in the sequent calculus). We call the formula occurrence containing the main connective of a link its main formula and all other formula occurrences its active formulas. The logical links are divided into four groups: 1. the tensor links are the binary rules drawn with solid lines (the left link for ‘’ and the right link for ‘⊗’), 2. the par links are the binary rules drawn with dashed lines (the left link for ‘⊗’ and the right link for ‘’; par is the name for the multiplicative, classical disjunction of linear logic, ‘`’), 3. the existential links are the unary rules drawn with solid lines (the left link for ‘∀’ and the right link for ‘∃’), 4. the universal links are the unary rules drawn with dashed lines and labeled with the corresponding eigenvariable (the left link for ‘∃’ and the right link for ‘∀’). Definition 1 A proof structure is a tuple S = F, L where F is a set of formula occurrences and L is a set of the links connecting these formula occurrences such that each local subgraph is an instantiation one of the links in Table 4 (for some A, B, x, t), and such that • each formula is at most once the premiss of a link, • each formula is at most once the conclusion of a link. Finally, the quantifiers links and eigenvariables have the following additional conditions. • each quantifier link uses a distinct bound variable,
Partial Orders, Residuation, and First-Order Linear Logic
45
• all conclusions and hypotheses of S are closed, • all eigenvariables of links in S are used strictly, meaning that we cannot substitute a constant cx for any set of occurrences of an eigenvariable x and obtain a proof structure with the same conclusions and hypotheses. The formulas which are not the premisses of any link in a proof structure with hypotheses are the conclusions of the structure. The formulas which are not the conclusions of any link are the hypotheses of the structure. Formulas which are both the premiss and the conclusion of a link in a proof structure are its internal formulas. All other formulas (that is, formulas which are either hypotheses or conclusions of the proof structure) are its external formulas. This definition essentially follows Girard [8], incorporating the notion of strictly used eigenvariables from Bellin and van de Wiele [2] and the proof structures with hypotheses of Danos [5]. The requirement that eigenvariables are used strictly avoids the case where, for example, a subproof ∀x.a(x) ∃y.a(y) instantiates x and y to the eigenvariable z of a universal link elsewhere in the proof. Given that, by definition, we can replace such occurrences by a new constant cz this is a minor technicality to facilitate the verification of the correctness of the universal links in a proof net. Figure 3 shows, on the left hand side, the formula unfolding for the underivable sequent ∀y[a ⊗ b(y)] a ⊗ ∀x.b(x). We want derivable sequents A1 , . . . , An C to correspond to proof structures (and proof nets) with exactly the Ai as hypotheses and C as a conclusion. The proof structure on the left hand side of Fig. 3 has a and b(Y ) as additional conclusions and a and b(x) as additional hypotheses. By identifying these formulas (and substituting x for Y ) we obtain the proof structure shown on the right hand side of Fig. 3. In the current case, this is the unique identification of atomic formulas producing a proof structure such that the only hypothesis is ∀y[a ⊗ b(y)] and the only conclusion is a ⊗ ∀x.b(x). In the general case, there can be many ways of identifying atomic formulas and this is the central problem for proof search using proof nets.
Fig. 3 Two proof structures for the sequent ∀y[a ⊗ b(y)] a ⊗ ∀x.b(x)
46
R. Moot
Underivability in the sequent calculus follows from the fact that there is no proof where the ∀ right rule is performed below the ∀ left rule (the intuitionistic version of this sequent ∀y[a ∧ b(y)] a ∧ ∀x.b(x) is derivable, but it requires us to use the antecedent formula ∀y[a ∧ b(y)] twice, which produces the correct order between the ∀ left and right rules). We will see below why the proof structure on the right of Fig. 3 is not a proof net. Definition 2 Given a proof structure P a component is a maximal, connected substructure containing only tensor and existential links. We obtain the components of a proof structure by first removing the par and universal links, then taking each (maximal) connected substructure. Components can be single formulas. The components of the proof structure on the right of Fig. 3 correspond to the induced substructures of {∀y.[a ⊗ b(y)], a ⊗ b(x)}, {a, ∀x.b(x), a ⊗ ∀x.b(x)}, and {b(x)}. For the first and last of these structures, the occurrences of x (all of them free) will be replaced by cx . The second substructure contains the universal link for x (and only bound occurrences of x) and its formulas will therefore be unchanged. The corresponding sequents are given in Eqs. 6–8. ∀y.[a ⊗ b(y)] a ⊗ b(cx ) a, ∀x.b(x) a ⊗ ∀x.b(x) b(cx ) b(cx )
(6) (7) (8)
The reader can verify that all of these are derivable (although we cannot combine these three proofs into a single proof of the required endsequent). Before we turn to the correctness condition, we need another auxiliary notion from Bellin and van de Wiele [2]. Definition 3 Given a proof structure P and the eigenvariable x of a universal link l in P, the existential frontier of x in P is the set of formula occurrences A1 , . . . , An such that each Ai is the main formula of an existential link li where x occurs free in the active formula of li but not in its main formula Ai . In Fig. 3, the formula ∀y.[a ⊗ b(y)] is the only formula in the existential frontier of x. To decide whether a proof structure is a proof net in linear logic, we need a correctness condition on proof structures. Given that the two universal links correspond to sequent calculus rules with side conditions on the use of their eigenvariable, it should come as no surprise that we need to keep track of free occurrences of eigenvariables for deciding correctness. Typical correctness conditions involve graph switchings and graph contractions. Girard [8], and Bellin and van de Wiele [2] extend the switching condition of Danos and Regnier [6] for first-order linear logic. Here, we will extend the contraction condition of Danos [5] to the first-order case. Definition 4 An abstract proof structure A = V, L is obtained from a proof structure P = F, L by replacing each formula A ∈ F by the set of eigenvariables freely
Partial Orders, Residuation, and First-Order Linear Logic
47
Fig. 4 Proof structure (left) and abstract proof structure (right) for the sequent ∀y[a ⊗ b(y)] a ⊗ ∀x.b(x)
occurring in A, plus the eigenvariable x in case A is on the existential frontier of a universal link of P. Figure 4 shows the proof structure and corresponding abstract proof structure of the proof structure we’ve seen before on the right of Fig. 3. We have simply erased the formula information and kept only the information of the free variables at each node. The top node and only hypothesis of the structure, which corresponds to a closed formula (the formula ∀y.[a ⊗ b(y)]), is on the existential frontier of x (there is an occurrence of x in the active formula of the link) and therefore has the singleton set {x} assigned to it. Table 5 shows the contractions for first-order linear logic. Each contraction is an edge contraction on the abstract proof structure, deleting an edge or a joined pair of edges, and identifying the two incident vertices vi and v j . The resulting vertex is incident both to all nodes incident to vi (except v j ) and to all nodes incident to v j (except vi ). The eigenvariables assigned to the resulting vertex are the set union of the eigenvariables assigned to vi and v j . For the universal contraction u the eigenvariable corresponding to the eigenvariable x of the link is removed. The contraction p verifies that the two premisses of a single par link can be joined in a single point. The contraction u verifies that all free occurrences of the eigenvariable of a universal link (and its existential frontier) can be found at the vertex corresponding to the premiss of the link. The contraction c contracts a component. All contractions remove one edge (or, in the case of the par contraction p, a linked pair of edges) and keep all other edges the same, reducing the length of the paths which passed through the contracted edge by one. Contractions can produce self-loops and multiple edges between two nodes, but can never remove self-loops. Definition 5 A proof structure is a proof net iff its abstract proof structure contracts to a single vertex using the contractions of Table 5. The contraction system as presented is not confluent. For the critical cases, when a pair of vertices v1 and v2 is connected by two or more links of different types (par, universal or component), we can contract any of these multiple links connecting v1
48
R. Moot
Table 5 Contractions for first-order linear logic. Conditions: vi = v j and, for the u contraction, all occurrences of x are at v j
Fig. 5 Failed contraction sequence for the abstract proof structure on the right of Fig. 3
and v2 and produce a self-loop for all others. An easy solution to ensure confluence is to treat all self-loops as equivalent.2 Figure 5 shows how the abstract proof structure of Fig. 3 fails to contract to a single vertex. The final structure shown on the right of the figure cannot be further contracted: the par (p) contraction requires the two edges of the par link to end in the same vertex, whereas the universal (u) contraction requires all occurrences of x to be at the vertex from which the x edge is leaving. Lemma 1 C is derivable if and only if there is a proof net of C See Bellin and van de Wiele [2] for a proof, which adapts trivially to the current context.
4 Residuation and Partial Orders So far, we have discussed proof-theoretic properties of first-order linear logic while only hinting at its applications as a formalism for natural language processing. In this section, I will suggest some principles for writing grammars using first-order linear 2A
more elegant solution for ensuring confluence would replace the right-hand side of the p and u contractions by the left-hand side of the c contraction.
Partial Orders, Residuation, and First-Order Linear Logic
49
logic, essentially in the form of constraints on the formulas. These constraints apply only to constants and variables used as string positions and not to other applications of first-order variables (such as grammatical case, island constraints and scoping constraints). The principles presented here should not be taken in a dogmatic way. It may turn out that a larger class of grammars has significant applications or better mathematical properties. The goal is merely to provide some terra firma for exploring both linguistic applications and mathematical properties. Indeed, some known classes of type-logical grammars are outside the residuated fragment investigated in this paper [24], even though it is possible to follow Kubota and Levine [13] and combine residuated connectives with non-residuated ones in the more general framework proposed here. The main property we want our formulas to preserve is that we can always uniquely define a linear order on the string segments (pairs of position variables) used in the formulas of first-order linear logic. This is already somewhat of a shift with respect to standard first-order linear logic: an atomic formula p(x0 , x1 , x2 , x3 ) represents to string segments x0 , x1 and x2 , x3 without any claims about the relative order of these two segments. This gives us the freedom to build these two strings independently and let other lexical items in the grammar decide in which relative order these two segments will ultimately appear in the derived string. Adding the linear order requirement requires us to add an explicit relation between these two segments (either x1 ≤ x2 , for the linear order x0 , x1 , x2 , x3 , or x3 ≤ x0 for the linear order x2 , x3 , x0 , x1 ). It is possible to define residuated connectives for string segments which are not linearly ordered. However, we would then be limited by the fact that any connective which linearises such segments (by ordering some of the previously unordered segments) would not be residuated. For example, suppose we want to define a connective combining two unordered string segments x0 , x1 and x2 , x3 by concatenating them (or ‘wrapping’ them around) a segment x1 , x2 producing the complex segment x0 , x1 . This would entail the linear order to be x0 , x1 , x2 , x3 , and therefore the two segments x0 , x1 and x2 , x3 assigned to one of the residuals must be linearly ordered as well, simply because the alternative order x2 , x3 , x0 , x1 has become incompatible with the linear order after concatenation. A restriction to residuated connective therefore sacrifices some flexibility for writing grammars in first-order linear logic. We will return briefly to this point in the discussion of Sect. 7.
4.1 Residuation for the Lambek Calculus Revisited We have already looked at the Lambek calculus connectives and their translation into linear logic from the point of view of residuation. Figure 6 presents a simplified version of Fig. 2. It focuses only on the position variables, which have been placed at the appropriate points in the triangle. Each variable occurs on exactly two of the tree points of the triangle. The place where a variable is absent determines the quantifier: ‘∃’ for ‘⊗’ (that is, the bottom
50
R. Moot
Fig. 6 Lambek calculus residuation translated into first-order linear logic
node), and ‘∀’ for the two ‘’ nodes (the two top nodes). Downwards movement— from A and B to A ⊗ B, from A and A C to C, and from B C and B to C—corresponds to concatenation: we combine a first string with left position X and right position Y with a second string with left position Y and right position Z to form a new string starting at the left position X of the first and ending at the right position Z of the second. A variable shared between the bottom position and one of the top positions of the figure must appear in both of these in either a left position or a right position (as, respectively, variables X and Z in Fig. 6). A variable shared among the two top positions must appear in a right position in one and a left position in the other. Variable Y in the figure is in this case. Seen from the point of view of string segments, the bottom element contains exactly the combination of the string segments of the left and right elements, with some of them (that is those positions occurring both left and right) concatenated.
4.2 Partial Orders As a general principle, we want the left-to-right order of the position variables and constants to be globally coherent. This means that we do not want X to be left of Y at one place and to the right of it at another (at least not unless they are equal). Formally, this means that the variables in a formula and in a proof are partially ordered. More precisely, we have only argued for antisymmetry (that is X ≤ Y and Y ≤ X entail X = Y ). To be a partial order, we also need reflexivity (X ≤ X ) and transitivity (X ≤ Y and Y ≤ Z entail X ≤ Z , or, in our terms: if X occurs to the left of Y and Y occurs to the left of Z then X occurs to the left of Z ). We can add explicit partial order constraints to first-order linear logic, where a lexical entry specifies explicitly how some of its variables are ordered. In a system with explicit partial order constraints, a sequent is derivable if it is derivable in firstorder linear logic (as before) but also satisfies all lexical constraints on the partial order. We will see in what follows how this can be useful. Instead of using partial order constraints to obtain extra expressivity, we can also see it as a way of improving efficiency. For example, when we look at a sentence like.
Partial Orders, Residuation, and First-Order Linear Logic
51
1. John gave Mary flowers. With formulas np, ((np\s)/np)/np, np, and np, we obtain the formula np(0, 1) for “John” and ∀Z .np(Y, Z ) ∀Y.np(2, Y ) ∀X.np(X, 1)s(X, Z ) for “gave” (using the standard Lambek calculus translation). This produces the orders 0 < 1 for “John” and X ≤ 1 < 2 ≤ Y ≤ Z for “gave”. Without any partial order constraints, it would be possible to identify np(0, 1) with np(Y, Z ). With the constraint, this would fail, since unifying Y with 0 would entail 2 ≤ 0 contradicting 0 < 2. We will give a more detailed and interesting example in Sect. 5.3. The residuation principle for generalised forms of concatenation requires us to be able to uniquely reconstruct the linear order of any of the three elements in a residuated triple based on the linear order of the two others. As we will see, for three position variables and two string segments, the Lambek calculus connectives are the only available residuated triple. But what happens when we increase the number of variables, and thereby the number of string positions? Figure 7 shows two solutions with four position variables. The residuated triple at the top represents an infixation connective A\3a C and a circumfixion connective C/3a B. Note that since this last connective is represented by the pair of white rectangles, it positions itself ‘around’ the B formula. The infixation operation corresponds, at the string level, to the adjoining operation of tree adjoining grammars [10] and to the simplest version of the discontinuous connectives of Morrill et al. [23]. Given the concatenation operation, we can obtain its residuals by plugging them in the Application/Co-Application principles and adding the required quantifiers to make them derivable. However, the general principle is very simple and we can ‘read off’ the definitions directly (although the reader is invited to verify that all the Application/Co-Application principles hold). For the topmost residuated triple this gives the following definition (this connective is labeled 3a to indicate it is the first connective with 3 string positions). A •3a B x0 ,x3
= ∃x1 , x2 .[ A x0 ,x1 ,x2 ,x3 ⊗ B x1 ,x2 ]
A\3a C x1 ,x2 = ∀x0 , x3 .[ A x0 ,x1 ,x2 ,x3 C x0 ,x3 ] C/3a B x0 ,x1 ,x2 ,x3 = B x1 ,x2 C x0 ,x3 We can see that the patterns are very similar to the translation of the Lambek calculus connectives: the variables shared between A and B (in the current case x1 and x2 ) are quantified existentially for the A ⊗ B case, the variables shared between A and C are quantified universally for the A C case (x1 and x2 here), and the variables shared between B and C (none for this case) are quantified universally for the B C case. In total each variable is quantified in exactly one of the translation cases. The residuated triple at the bottom of Fig. 7 assigns positions x0 , x1 to its A formula and positions x2 , x3 to its B formula. In this case, the positions assigned to A ⊗ B are underdetermined: we can say that nothing is known about the relation between x1 and x2 , or between x0 and x3 . This case therefore explicitly requires an additional partial order constraint to be a residuated triple. The recursive definitions are as follows.
52
R. Moot
Fig. 7 Two families of connectives with three segments and four position variables
A •3b B x0 ,x1 ,x2 ,x3 = x2 ,x3
A\3b C C/3b B x0 ,x1
A x0 ,x1 ⊗ B x2 ,x3
= ∀x0 , x1 .[ A x0 ,x1 C x0 ,x1 ,x2 ,x3 ] = ∀x2 , x3 .[ B x2 ,x3 C x0 ,x1 ,x2 ,x3 ]
The key case is A •3b B, where there would be a loss of information in the information passed to the two subformulas without the additional constraint the x1 ≤ x2 . Now it may seem that this connective is just a formal curiosity. However, it is essentially this pattern, notably the A\3b C connective, which figures in the analysis of the well-known crossed dependencies for Dutch verb clusters of Morrill et al. [23].
5 The General Case Given linear order of the string position variables, each additional string variable increases the number of possible connectives. We have seen the case for three position variables (the Lambek calculus connectives) and the two residuated triples for four position variables. Are these the only possibilities? And, more generally, how many residuated connectives exist for k position variables. We want our residuated triples to combine two sequences of components, one containing elementary segments labeled a (corresponding to the left residual) and the other containing elementary segments labeled b (corresponding to the right residual) while allowing an ‘empty’ component between two other components (but not at the beginning or end of a generalised concatenation). Residuated triples can use the
Partial Orders, Residuation, and First-Order Linear Logic
53
Fig. 8 Finite state automaton of concatenation-like operations
‘empty’ segment 1, which corresponds to a sort of placeholder or hole for another segment. 1. the first segment must be a (concatenations with b as first segment are obtained by left-right symmetry of the residuated triple), 2. there can be no consecutive a segments (that it, if two a segments have already been concatenated, we ‘lose’ the internal structure), 3. for the same reasons, there can be no consecutive b segments, 4. consecutive 1 segments do not increase expressivity and are therefore excluded, 5. there must be at least one b segment, 6. the last segment cannot be 1 (and, as a consequence of item 1, neither can the first segment). The finite state automaton shown in Fig. 8 generates all strings which satisfy these requirements. From the start state q0 , the only valid symbol is a. The condition that we cannot repeat the last symbol then ensures that the states where the last symbol was a (states q1 and q4 ) can only continue with a 1 or a b symbol. Similarly, the states where the last symbol was 1 (states q2 and q5 ) can only continue with an a or a b symbol, and the state where the last symbol was b (state q3 ) can only continue with 1 or a. Finally, the states q3 , q4 and q5 denote the states where we have seen at least one b symbol. These are accepting states except for q5 (because its last symbol is 1). We can now show that this machine generates only one two-symbol string ab (corresponding to three string positions and to the simple concatenation of a and b) and two three-symbol strings (with four string positions, namely a1b and aba). Table 6 shows the concatenation-like operations definable with two, three, and four total string segments. The a segments correspond to empty rectangles, the b segments to filled rectangles and the 1 segments to empty spaces between the other segments. We can read off the free variables and their linear order for each of the subformulas of a residuated triple.
54
R. Moot
Table 6 Concatenation-like operations for two to four string segments
For example, the A (and B C) segments of the first item with four segments corresponds to a formula with free variable x0 , x1 , x2 , x3 (in that linear order) whereas the B (and A C) formula corresponds to a formula with free variables x1 , x2 , x3 , x4 . Finally, the result of the concatenation formula C (and A ⊗ B) corresponds to variables x0 , x4 , with three separate concatenation operations. We concatenate a1a to b1b to produce abab. The number of variables shared by the left branch A and the right branch B corresponds to the number of concatenations of elementary segments. If we name this residuated triple 4a, its recursive definition is as follows. A •4a B x0 ,x4 = ∃x1 , x2 , x3 .[ A x0 ,x1 ,x2 ,x3 ⊗ B x1 ,x2 ,x3 ,x4 ] A\4a C x1 ,x2 ,x3 ,x4 = ∀x0 .[ A x0 ,x1 ,x2 ,x3 C x0 ,x4 ] C/4a B x0 ,x1 ,x2 ,x3 =
∀x4 .[ B x1 ,x2 ,x3 ,x4 C x0 ,x4
As another example, the fourth item with four segments (and five variables) assign the A (and B C) segments the sequence of variables x0 , x1 , x2 , x3 , the B and (and A C) formula the variables x3 , x4 , and the C (and A ⊗ B formula) the variables x0 , x1 , x2 , x4 . If we name this residuated triple 4d, we obtain the following recursive definitions.
Partial Orders, Residuation, and First-Order Linear Logic
55
Table 7 Concatenation-like operations for five string segments
A •4d B x0 ,x1 ,x2 ,x4 = ∃x3 .[ A x0 ,x1 ,x2 ,x3 ⊗ B x3 ,x4 ] A\4d C x3 ,x4 = ∀x0 , x1 , x2 .[ A x0 ,x1 ,x2 ,x3 C x0 ,x1 ,x2 ,x4 ] C/4d B x0 ,x1 ,x2 ,x3 =
∀x4 .[ B x3 ,x4 C x0 ,x1 ,x2 ,x4 ]
Table 7 shows the concatenation-like operations definable with five string segments. We give an example of only one of these, because it illustrates a new pattern. As we have seen, some concatenation-like operations require additional order constraints to uniquely define a linear order, for each subformula, on all variables occurring exactly once in this subformula. This was the case for the second possibility with three segments, where we could not infer the order between the A segment x0 , x1 and the B segment x2 , x3 without explicitly requiring x1 ≤ x2 . The second item of Table 7, 5b, shows a different type of underdetermination. When we give the translation of the table entry into a residuated triple 5b, we obtain the following.
56
R. Moot
A•5b B x0 ,x3 ,x4 ,x5
=
∃x1 , x2 .[ A x0 ,x1 ,x2 ,x3 ,x4 ,x5 ⊗ B x1 ,x2 ]
A\5b C x1 ,x2 = ∀x0 , x3 , x4 , x5 .[ A x0 ,x1 ,x2 ,x3 ,x4 ,x5 C x0 ,x3 ,x4 ,x5 ] C/5b B x0 ,x1 ,x2 ,x3 ,x4 ,x5 = B x1 ,x2 C x0 ,x3 ,x4 ,x5 The problematic connective here is C/5b B. The order information of its subformulas B and C does not allow us to unambiguously reconstruct the full order: it is compatible with an alternative linear order x0 , x3 , x4 , x1 , x2 , x5 , which is the sixth entry 5 f in Table 7. The left residuals of 5b and 5 f cannot be distinguished without an explicit constraint on the linear order for the left residual. In the case above, we need to explicitly state that x0 ≤ x1 and x2 ≤ x3 (technically, since x0 is the leftmost element of the triple, the first constraint is superfluous).
5.1 How Many Residuated Connectives Are There for Concatenation-Like Operations? Since the finite state automaton of Fig. 8 is deterministic, each transition produces a symbol and it is therefore easy to use the automaton to enumerate the number of strings3 of a certain length k. We can also use the machine to directly compute the number of words, either by using a standard dynamic programming approach or by solving the linear recurrence specified by the automaton to produce a closed form. For example, there is a single length 1 path to q1 (the path from the start state q0 ). For paths of length greater than 1, the number of paths to q1 of length k is equal to the number of paths of length k − 1 to q2 . In general, the number of paths of length k to a state is the sum of the paths of length k − 1 which can reach this state in one step. Writing out the full definition then gives the following set of linear recurrences, where p[Q][K ] denotes the number of paths of length K which reach state Q. In addition, p[k] denotes the number of accepting paths of length k and it is the sum of the number of paths to the two accepting states q3 and q4 . p[q1 ][1] = 1 p[q1 ][k] = p[q2 ][k − 1] p[q2 ][k] = p[q1 ][k − 1] p[q3 ][k] = p[q4 ][k − 1] + p[q5 ][k − 1] + p[q1 ][k − 1] + p[q2 ][k − 1] p[q4 ][k] = p[q3 ][k − 1] + p[q5 ][k − 1] p[q5 ][k] = p[q3 ][k − 1] + p[q4 ][k − 1] 3 In
the literature on finite state automata it is common to refer to sequences of symbols produced by such an automaton as “words”. However, we reserve “words” to refer to elements in the lexicon of a type-logical grammar and exclusively use “string” for a sequence of symbols produced by a finite state automaton.
Partial Orders, Residuation, and First-Order Linear Logic
57
p[k] = p[q3 ][k] + p[q4 ][k]
We can simplify these equations by observing that for each k there is exactly one path arriving at q in k steps from either q2 (for k − 1 even) or q1 (for k − 1 odd). So we can simplify p[q1 ][k − 1] + p[q2 ][k − 1] to 1. In addition, because of the symmetries in the automaton, there are exactly as many paths reaching q4 as there are reaching q5 for any k, so we can replace p[q5 ][k] by p[q4 ][k] without changing the results. This simplifies the equations as follows. p[q3 ][0] = p[q3 ][1] = 0 p[q4 ][0] = p[q4 ][1] = p[q4 ][2] = 0 p[q3 ][k] = 2 ∗ p[q4 ][k − 1] + 1 p[q4 ][k] = p[q3 ][k − 1] + p[q4 ][k − 1] p[k] = p[q3 ][k] + p[q4 ][k]
(k > 1) (k > 2)
We can now show the following. p[q3 ][k] = p[q4 ][k] p[q3 ][k] = p[q4 ][k] + 1
(for k odd) (for k even and ≥ 2)
(9) (10)
This is an easy induction: it is trivially true for k = 1. Now assume Eqs. 9 and 10 hold for all k < k. If k is even, k − 1 is odd, and induction hypothesis gives us p[q3 ][k − 1] = p[q4 ][k − 1] and we need to show that p[q3 ][k] = p[q4 ][k] + 1, given k ≥ 2. Using p[q3 ][k − 1] = p[q4 ][k − 1], we can simplify p[q4 ][k] = p[k3 ][k − 1] + p[ p4][k − 1] to p[q4 ][k] = 2 ∗ p[q4 ][k]. But since p[q3 ][k] = 2 ∗ p[q4 ][k] + 1 we have therefore shown that p[q3 ][k] = p[q4 ][k] + 1. If k is odd, k − 1 is even, and induction hypothesis gives us p[q3 ][k − 1] = p[q4 ][k − 1] + 1. We have already verified k = 1, so we only need to verify k ≥ 3. Again, using p[q3 ][k − 1] = p[q4 ][k − 1] + 1 to substitute p[q4 ][k − 1] + 1 for p[q3 ][k − 1] in the equation for p[q4 ][k] produces p[q4 ][k] = 2 ∗ p[q4 ][k − 1] + 1 and we have therefore shown that p[q3 ][k] = p[q4 ][k] as required. We can use Eqs. 9 and 10 to further simplify the machine equations and end up with the following. For k odd, we have p[q4 ][k] = 2 ∗ p[q4 ][k − 1] + 1 p[q3 ][k] = 2 ∗ p[q3 ][k − 1] − 1 and therefore
58
R. Moot
p[k] = 2 ∗ p[q4 ][k − 1] + 1 + 2 ∗ p[q3 ][k − 1] − 1 = 2 ∗ p[k − 1] For k even and ≥ 2, we have p[q4 ][k] = 2 ∗ p[q4 ][k − 1] p[q3 ][k] = 2 ∗ p[q3 ][k − 1] + 1 and therefore p[k] = 2 ∗ p[q4 ][k − 1] + 2 ∗ p[q3 ][k − 1] + 1 = 2 ∗ p[k − 1] + 1 The number of residuated connectives definable in first-order linear logic with partial order constraints therefore corresponds to sequence A000975 of the Online Encyclopedia of Integer Sequences [26]. Giving us the sequence the following sequence of the number of residuated triples 0, 1, 2, 5, 10, 21, 42, 85, 170, 341, 682, . . . for 1, 2, 3, . . . total string components and for 2, 3, 4, . . . total string positions.4
5.2 Well-Nestedness One important property often imposed on linguistic formalisms is the property of well-nestedness [11]. In the current context, this means that with respect to the finite state automaton of Fig. 8, we restrict ourselves to those paths where, whenever we encounter an a symbol after a b, there can be no further b symbols (see Fig. 9). In other words, the bs are sandwiched between the as, but not inversely. The simplest non-wellnested combination is abab. We can write out the linear recurrences as before. The number of paths to q3 and q5 are easily established to be the following (Fig. 9).
4A
.
closed form solution for this recurrence is the following [26]. 2(2n − 1) p[n] = 3
Partial Orders, Residuation, and First-Order Linear Logic
59
Fig. 9 Variant of the finite state automatic of Fig. 8 for well-nested operations
p[q3 ][2k] =k p[q3 ][2k + 1] = k p[q5 ][2k] =k−1 p[q5 ][2k + 1] = k
k>0
Then given that p[q6 ][n] = p[q4 ][n − 1], we can establish the number of paths to q4 as follows. = q[3][2k − 1] + q[5][2k − 1] + q[4][2(k − 1)] p[q4 ][2k] p[q4 ][2k + 1] = q[3][2k] + q[5][2k] + q[4][2(k − 1) + 1] Simplifying the above recurrence with the calculated values for q3 and q5 produces the following. p[q4 ][2k]
= q[4][2(k − 1)]
+ 2(k − 1)
p[q4 ][2k + 1] = q[4][2(k − 1) + 1] + 2k
= k(k − 1) = k2
The number of paths to a final state of the automaton is then obtained by simply adding the number of paths to q3 to those to q4 , which gives us the following solutions after some elementary arithmetic. p[2k]
= p[q3 ][2k] + p[q4 ][2k]
= k + k(k − 1) p[2k + 1] = p[q3 ][2k + 1] + p[q4 ][2k + 1] = k + k2
= k2 = k(k + 1)
An alternative way to state this same solution is the following.
60
R. Moot
p[n] = (n/2) ∗ n/2
Accordingly, the number of well-nested residuated connectives is the following 0, 1, 2, 4, 6, 9, 12, 16, 20, 25, 30, 36, . . . for 1, 2, 3, 4, 5, . . . segments and 2, 3, 4, 5, 6, . . . string position variables. This corresponds to sequence A002620 of the Online Encyclopedia of Integer Sequences [26]. As a sanity check, we can verify that 4 out of 5 of the four segment possibilities of Table 6 are well-nested (only 4a is not) whereas 6 out of 10 of the five segment possibilities of Table 7 are well-nested (the exceptions being 5a, 5c, 5d, and 5h).
5.3 Partial Order Constraints in Practice As an example, we will give an analysis of the sentence ‘John left before Mary did’ based on the analysis of Morrill et al. [23]. We assign ‘John’ and ‘Mary’ the formulas np(0, 1) and np(3, 4) respectively (based on their positions in the string). We assign ‘left’ the formula np\s, which at positions 1, 2 translates to ∀A.[np(A, 1) s(A, 2)]. We assign the ‘before’ the formula ((np\s)\(np\s))/s (that is, it selects a sentence to its right and a vp = np\s to its left to return a vp). This translates to the following formula. ∀B.[s(3, B) ∀D.[∀x.[np(x0 , D) s(x0 , 2)] ∀C.[np(C, D) s(C, B)]]] Finally, the complicated formula is assigned to ‘did’. In terms of the residuated connectives it is assigned to formula ((vp/3a vp)/vp)\4d (vp/3a vp). As a reminder, we restate the relevant translations of the connectives occurring in this formula. C/B x0 ,x1
=
∀x2 .[ B x1 ,x2 C x0,x2 ]
C/3a B x0 ,x1 ,x2 ,x3 = B x1 ,x2 C x0,x3 x3 ,x4 A\4d C = ∀x0 , x1 , x2 .[ A x0 ,x1 ,x2 ,x3 C x0 ,x1 ,x2 ,x4 ] Given these translations, we can translate this formula into first-order linear logic as follows. ((vp/3a vp)/vp)\4d (vp/3a vp) 4,5 ∀F, I, J. (vp/3a vp)/vp F,I,J,4 vp/3a vp F,I,J,5 ∀F, I, J.[∀x1 , vp 4,x1 vp/3a vp F,I,J,x1 ] vp I,J vp F,5 ∀F, I, J.[∀x1 , vp 4,x1 vp I,J vp F,x1 ] vp I,J vp F,5
Partial Orders, Residuation, and First-Order Linear Logic
61
We have left the final vp = np\s subformulas untranslated. We can see that except for some fairly complicated manipulations with string positions, to which we will return shortly, the formula simply indicates it select a function of two vp’s into a single vp to become a vp modifier. Given these translations, Fig. 10 shows the formula unfolding for the sentence ‘John left before Mary did’. Each node indicates the corresponding linear order on the variables occurring once in this subformula. The complex formula ‘did’ has many branchings but referring back to the position variables allows to to identify which node corresponds to which subformula in the translation. For example, the node labeled F, I, J, x1 corresponds to (the leftmost occurrence of) the formula vp/3a vp. Table 8 shows the possible matchings between positive and negative atomic formulas (the filled rectangles represent the solution which will be discussed below). The rows of the table represent the choices for the positive formulas, whereas the columns represent the choices for the negative formulas. The positive s(0, 5) formula represents the conclusion, the other positive formulas are those which are premisses of their link. Each of the candidate proof structures for the goal sequent is one of the perfect matchings of the positive with the negative formulas. However, since there are n! matchings, brute force search is to be avoided as much as possible. Just for the current example, there are 5! = 120 choices for the np formulas and the same number of choices for the s formulas. Given that these choices are independent, this amounts to a total of 14.400 different possible proof structures. Fortunately, there are quite a number of constraints on the possible connections in the proof structure. The partial order constraints are one of those. Figure 11 summarises the partial order constraints for the structure of Fig. 10. The partial order constraints allow us to avoid connecting s(3, B) to s(A, 2) since it fails both the 3 ≤ B constraint (when unifying B to 2) and the A ≤ 1 constraint (when unifying A to 3). A slightly less obvious connection which fails the constraint is the connection between s(x2 , x1 ) and s(H, J ). Here we have J ≤ 4, but also 4 < x1 . Unifying J to x1 would therefore produce the contradictory x1 ≤ 4 and 4 < x1 . Many potential axiom connections are excluded by a simple failure of unification between the two atoms: the positive atom s(x2 , x1 ) cannot connect either to s(A, 2) or to s(E, 5) (since x1 does not unify with either 2 or 5). Finally, the contractability condition excludes many other connections. The metavariables F, I , and J have free occurrences at many nodes, including at the nodes to which the arrows of the universal links for x1 , x2 and x3 point. This notably means none of F, I , and J can unify with any x1 , x2 and x3 without violating the contraction condition. Similarly, x0 cannot unify with B, C, or D. In general, the eigenvariable of a universal link can never appear on the ‘wrong’ side of its link (the part to which the arrow points), since this would correspond to a violation of the eigenvariable condition in the sequent calculus. Now, returning to our proof structure, we can see there is only a single possibility for the positive atomic formula s(x2 , x1 ). We have already seen that s(A, 2) and s(E, 5) do no unify and that s(H, J ) fails on the partial order constraint. This leaves only s(C, B) and s(G, x1 ). However, s(G, x1 ) fails on the proof net condition: unifying G to x2 produces an occurrence of x2 on the 4, x1 node of the proof structure
62
R. Moot
Fig. 10 Proof structure formed from the formula unfolding for ‘John left before Mary did’
Partial Orders, Residuation, and First-Order Linear Logic
63
Table 8 Possible axiom connectives for the proof structure in Fig. 10, with the columns representing the negative occurrences and the rows the positive ones
Fig. 11 The partial order constraints corresponding to the proof structure of Fig. 10
0 1
0
1
2
3
2
3
4
5
above the G link (since it is on the existential frontier of G). And a reduction of the par link requires an identification of this node with the F, I, J, x1 node, thereby producing an occurrence of x2 on the wrong side of its universal link. Therefore, the only possible connection for s(x2 , x1 ) is to s(C, B), unifying C = x2 and B = x1 . This fills in the first cell labeled 1 of Table 8. This unification then turns the positive np(C, D) formula into np(x2 , D) which can only unify with np(x2 , F), filling cell 2 of the table. We can now turn to the goal formula s(0, 5). Since we have already connected the s(C, B) formula to s(x2 , x1 ) this option is no longer available, and the s(A, 2) and s(G, x1 ) options are excluded by failure of unification. Finally, s(H, J ) is excluded because the J ≤ 4 partial order constraint would contradict unifying J to 5. This leaves only the s(E, 5) possibility, unifying E to 0, as indicated by cell 3 of the table. After these unifications the negative np(E, F) has become np(0, D) which only unifies with np(0, 1), instantiating D to 1, and filling cell 4 of the table. We have now essentially solved the linking problem and the remaining s connections can only be made in a single way, filling cells 5–7 in the table. Following that, we can apply similar reasoning to the np connections and fill the remaining cells (cells 8–10).
64
R. Moot
What we have shown is that even a for a quite complex proof structure such as the one in Fig. 10, the partial order constraints combined with the proof net conditions can allow us to produce the unique solution while avoiding all backtracking. Given the essentially non-deterministic natural of natural language parsing (sentences can have multiple readings and our parser should therefore produce as many proofs), we will in many cases be required to use some form of backtracking. But this example gives an illustration of how powerful the combined constraints are.
6 The Empty String Up until now, we have not explicitly allowed string segments to be empty. However, there are some well-know applications of empty string, notably the treatment of extraction in variants of the Lambek calculus. We can add a variant of extraction as a residuated pair as follows. A C y,z = ∀x.[ A x,x ] C y,z A ⊗ B y,z = ∀x.[ A x,x ] ⊗ B Even though this works in many cases, there is a potential problem here: suppose the extracted element is a vp, that is the Lambek calculus formula np\s, with the standard translation into first-order linear logic of ∀x0 .[np(x0 , x1 ) s(x0 , x2 )] corresponding to a vp at positions x1 , x2 . When we plug this formula into the A argument of the implication selecting an empty argument, this results in unifying x1 and x2 , producing the formula ∀x1 ∀x0 .[np(x0 , x1 ) s(x0 , x1 )] for this extracted vp. Compare this to an extracted formula corresponding to s/np. It would be translated into ∀x2 .[np(x1 , x2 ) s(x0 , x2 )] at positions x0 , x1 . Turning this into the empty string identifies x0 with x1 , producing the following ∀x1 ∀x2 .[np(x1 , x2 ) s(x1 , x2 )] The problem now is that this is equivalent to the formula for the extracted vp we computed before!
Partial Orders, Residuation, and First-Order Linear Logic
65
Though it would seem that there is not much of a difference between concatenating the empty string to the left or to the right of an np constituent, there should be a difference in behaviour between an np\s gap and a s/np gap: for example, the first, but not the second can be modified by an subject-oriented adverb of type (np\s)\(np\s). The naive first-order translation fails to make this distinction. There is a solution, and it consists of moving the universal quantifier out. Instead of the universal quantifier having only the A formula as its scope, we turn it into an existential quantifier which has the entire A C formula as its scope as follows. A C y,z = ∃x.[ A x,x C y,z ] This allows us to correctly distinguish these two cases, but at the price of no longer having a residuated pair for the extraction phenomena.5
7 Discussion One obvious aspect of first-order linear logic which hasn’t been mention thus far is that the Horn clause fragment corresponds to a lexicalised version of multiple context-free grammars [21, 28]. Horn clauses for first-order linear logic are of the form ∀x0 , . . . , xn [ p1 ⊗ . . . ⊗ pm q] for predicates pi and q, or equivalently ∀x1 , . . . , xn .( p1 (. . . ( pm q), and they code each segment of an MCFG by a pair of string positions. In the context of MCFG it is well-known that each additional segment increases the generative capacity. When the maximum arity is 2, each predicate has a single segment and we have context-free grammars which allow us to generate languages such as a n bn . When the maximum arity is 4, we can generate a n bn cn d n , with maximum arity 6 a n bn cn d n en f n , and so on [11]. It is unclear which of these classes best captures the properties we want with respect to the string languages needed for the analysis of natural languages. It is generally assumed that a reasonable minimum is 4 (that is, two string segments per predicate). For example the languages generated by tree adjoining grammars and several similar formalisms are strictly included in this class (more precisely, the tree adjoining languages have the additional constraint of well-nestedness, whereas the multiple context free languages in general do not [27]). It is unclear to me which would be the right number of components to consider. Values between 4 and 6 components would seem to suffice for most applications, and it is unclear whether there are good linguistic reasons for abandoning well-nestedness. The well-nested, residuated connectives seem to be the same as those definable in the Displacement calculus. Indeed, I have elsewhere already implicitly assumed a 5 This
analysis also makes an unexpected empirical claim: the treatment of parasitic gapping in type-logical grammars using the linear logic exponential ‘!’ would require the exponential to have scope over the quantified variable representing the empty string. We therefore need to claim that parasitic gapping can only happen with atomic formulas.
66
R. Moot
linear order for all subproofs when relating the Displacement calculus to first-order linear logic [21]. One interesting area of further investigation would be to relax the linear order constraint. For example, we let our sequent compute a unique partial order over the initial position variables (now no longer linearly ordered) and consider the sentence grammatical when the input string is a valid linearisation of this partial order. This would be potentially interesting for languages with relatively free word order.
8 Conclusions This paper has discussed several aspect of adding partial order constraints to firstorder linear logic. Although somewhat odd from the logical point of view, adding order constraints to the variables in first-order linear logic allows us to preserve the standard algebraic and category theoretic perspectives on type-logical grammars. In addition, some linguistically interesting operations can only be defined as part of a residuated triple when we impose partial order constraints on the string position variables. We have also shown how partial order constraints can be use as a mechanism for improving proof search by filtering out choices inconsistent with this order.
References 1. Areces, C., Bernardi, R., Moortgat, M.: Galois connections in categorial type logic. Electron. Notes Theor. Comput. Sci. 53, 3–20 (2004). https://doi.org/10.1016/S1571-0661(05)82570-8 2. Bellin, G., van de Wiele, J.: Empires and kingdoms in MLL. In: Girard, J.Y., Lafont, Y., Regnier, L. (eds.) Advances in Linear Logic, pp. 249–270. Cambridge University Press, Cambridge (1995) 3. Bernardi, R., Moortgat, M.: Continuation semantics for the Lambek–Grishin calculus. Inf. Comput. 208(5), 397–416 (2010). https://doi.org/10.1016/j.ic.2009.11.005 4. Coecke, B., Grefenstette, E., Sadrzadeh, M.: Lambek vs. Lambek: functorial vector space semantics and string diagrams for Lambek calculus. Ann. Pure Appl. Log. 164(11), 1079– 1100 (2013). https://doi.org/10.1016/j.apal.2013.05.009 5. Danos, V.: La logique linéaire appliquée à l’étude de divers processus de normalisation (principalement du λ-calcul). Ph.D. thesis, University of Paris VII (1990) 6. Danos, V., Regnier, L.: The structure of multiplicatives. Arch. Math. Log. 28, 181–203 (1989). https://doi.org/10.1007/BF01622878 7. Došen, K.: A brief survey of frames for the Lambek calculus. Z. Math. Log. Grundl. Math. 38, 179–187 (1992). https://doi.org/10.1002/malq.19920380113 8. Girard, J.Y.: Quantifiers in linear logic II. In: Corsi, G., Sambin, G. (eds.) Nuovi problemi della logica e della filosofia della scienza, CLUEB, Bologna, Italy, vol. II. (1991). Proceedings of the Conference with the Same Name, Viareggio, Italy (1990) 9. Girard, J.Y.: The Blind Spot: Lectures on Logic. European Mathematical Society, Zürich (2011) 10. Joshi, A., Schabes, Y.: Tree-adjoining grammars. In: Rosenberg, G., Salomaa, A. (eds.) Handbook of Formal Languages 3: Beyond Words, pp. 69–123. Springer, New York (1997)
Partial Orders, Residuation, and First-Order Linear Logic
67
11. Kallmeyer, L.: Parsing Beyond Context-Free Grammars. Cognitive Technologies. Springer, Berlin (2010) 12. Kubota, Y., Levine, R.: Gapping as like-category coordination. In: Béchet, D., Dikovsky, A. (eds.) Logical Aspects of Computational Linguistics. Lecture Notes in Computer Science, vol. 7351, pp. 135–150. Springer, Nantes (2012). https://doi.org/10.1007/978-3-642-31262-5_9 13. Kubota, Y., Levine, R.: Type-Logical Syntax. MIT Press, Cambridge (2020) 14. Kurtonina, N., Moortgat, M.: Structural control. In: Blackburn, P., de Rijke, M. (eds.) Specifying Syntactic Structures, pp. 75–113. CSLI, Stanford (1997) 15. Lambek, J.: The mathematics of sentence structure. Am. Math. Mon. 65, 154–170 (1958). https://doi.org/10.1080/00029890.1958.11989160 16. Lambek, J.: Categorial and categorical grammars. In: Oehrle, R.T., Bach, E., Wheeler, D. (eds.) Categorial Grammars and Natural Language Structures. Studies in Linguistics and Philosophy, vol. 32, pp. 297–317. Reidel, Dordrecht (1988) 17. Lincoln, P.: Deciding provability of linear logic formulas. In: Girard, J.Y., Lafont, Y., Regnier, L. (eds.) Advances in Linear Logic, pp. 109–122. Cambridge University Press, Cambridge (1995) 18. Lincoln, P., Shankar, N.: Proof search in first-order linear logic and other cut-free sequent calculi. In: Proceedings of Logic in Computer Science (LICS’94), pp. 282–291. IEEE Computer Society Press (1994) 19. Montague, R.: The proper treatment of quantification in ordinary English. In: Thomason, R. (ed.) Formal Philosophy. Selected Papers of Richard Montague. Yale University Press, New Haven (1974) 20. Moortgat, M.: Multimodal linguistic inference. J. Log. Lang. Inf. 5(3–4), 349–385 (1996). https://doi.org/10.1007/BF00159344 21. Moot, R.: Extended Lambek calculi and first-order linear logic. In: Casadio, C., Coecke, B., Moortgat, M., Scott, P. (eds.) Categories and Types in Logic, Language, and Physics: Essays dedicated to Jim Lambek on the Occasion of this 90th Birthday. Lecture Notes in Artificial Intelligence, vol. 8222, pp. 297–330. Springer, Berlin (2014). https://doi.org/10.1007/978-3642-54789-8_17 22. Moot, R., Piazza, M.: Linguistic applications of first order multiplicative linear logic. J. Log. Lang. Inf. 10(2), 211–232 (2001). https://doi.org/10.1023/A:1008399708659 23. Morrill, G., Valentin, O., Fadda, M.: The displacement calculus. J. Log. Lang. Inf. 20(1), 1–48 (2011). https://doi.org/10.1007/s10849-010-9129-2 24. Oehrle, R.T.: Term-labeled categorial type systems. Linguist. Philos. 17(6), 633–678 (1994). https://doi.org/10.1007/BF00985321 25. Oehrle, R.T.: Multi-modal type-logical grammar. In: Borsley, R., Börjars, K. (eds.) Nontransformational Syntax: Formal and Explicit Models of Grammar, pp. 225–267. WileyBlackwell, Hoboken (2011) 26. OEIS Foundation: On-line encyclopedia of integer sequences (OEIS). http://oeis.org (1964). Accessed 23 Jul 2020 27. Seki, H., Matsumura, T., Fujii, M., Kasami, T.: On multiple context-free grammars. Theor. Comput. Sci. 88, 191–229 (1991). https://doi.org/10.1016/0304-3975(91)90374-B 28. Wijnholds, G.: Investigations into categorial grammar: symmetric pregroup grammar and displacement calculus. Master’s thesis, Utrecht University (2011)
A Hyperintensional Theory of Intelligent Question Answering in TIL Marie Duží and Michal Fait
Abstract The paper deals with natural language processing and question answering over large corpora of formalised natural language texts. Our background theory is the system of Transparent Intensional Logic (TIL) which is a partial, hyperintensional, typed λ-calculus. Having a fine-grained analysis of natural language sentences in the form of TIL constructions, we apply Gentzen’s system of natural deduction adjusted for TIL to answer questions in an ‘intelligent’ way. It means that our system derives logical consequences entailed by the input sentences rather than merely searching answers by keywords. The theory of question answering must involve special rules rooted in the rich semantics of a natural language, and the TIL system makes it possible to formalise all the semantically salient features of natural languages in a fine-grained way. In particular, since TIL is a logic of partial functions, it is apt for dealing with non-referring terms and sentences with truth-value gaps. It is important because sentences often come attached with a presupposition that must be true so that a given sentence had any truth-value. And since answering is no less important than raising questions, we also propose a method of adequate unambiguous answering questions with presuppositions. In case the presupposition of a question is not true (because either false or ‘gappy’), there is no unambiguous direct answer, and an adequate complete answer is instead a negated presupposition. There are two novelties; one is the analysis and answering of Wh-questions that transform into λ-terms referring to α-objects where α is not the type of a truth-value. The second is integration of special rules rooted in the semantics of natural language into Gentzen’s system of natural deduction, together with a heuristic method of searching relevant sentences in the labyrinth of input text data that is driven by constituents of a given question.
M. Duží (B) · M. Fait VSB-Technical University Ostrava, Department of Computer Science FEI, 17. listopadu 15, 708 33 Ostrava, Czech Republic e-mail: [email protected] M. Fait e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. Loukanova (ed.), Natural Language Processing in Artificial Intelligence—NLPinAI 2020, Studies in Computational Intelligence 939, https://doi.org/10.1007/978-3-030-63787-3_3
69
70
M. Duží and M. Fait
Keywords Question answering · Wh-Question · Lambda calculus · Natural deduction · Transparent Intensional Logic TIL · Anaphoric references · Property modifiers · Factive verbs · Partiality · Presuppositions
1 Introduction Formal analysis of interrogative sentences and appropriate answers should not be missing from any formal system dealing with natural language processing because questioning and answering plays an essential role in our communication and has many logically relevant features. To this end, many systems of erotetic logic have been developed.1 In general, these logics specify axioms and rules that are special for questioning and answering. The systems of erotetic logics are valuable, as they render many exciting features of Yes-No questions and answers. However, many other important features of questions stem from their presuppositions.2 Everybody who is at least partially acquainted with the methods applied in the social sciences has heard of the importance to consider the presuppositions of a question in questionnaires. Yet, to our best knowledge, none of the systems of erotetic logic deals with the presuppositions of questions in a satisfactory way. This is unsatisfactory because they fail to consider properly partial functions, which lack a value at some of their arguments. For instance, propositions (in their capacity as truth-bearers) can have truth-value gaps. Moreover, we need question-answering systems that would be able to extract pieces of information from natural language texts and answer not only Yes-No questions but also Wh-questions, which is beyond the capacities of ordinary erotetic logics. In the era of information overload, the systems that can answer questions raised over the large corpora of text data in an ‘intelligent’ way gain more and more interest in the research community. To achieve such a goal, logic and computational linguistics are the disciplines that should work hand in hand in natural language processing and question answering. Moreover, the logical system should make it possible to render all the semantically salient features of natural language in a fine-grained way. We have a suitable system at hand, though. It is Tichý’s Transparent Intensional Logic (TIL), which comes with a procedural semantics that assigns abstract procedures to the terms and expressions of natural language as their meanings. These procedures are rigorously defined as TIL constructions which produce lower-order objects as their products or in well-defined cases fail to produce an object. In case of empirical expressions, the produced entity is a possible-world semantic (PWS) α-intension viewed as a partial function from possible worlds to a function from instants of time to α-typed entities (where α is a placeholder for specific types) such that each world-time pair is taken to at most one value of type α.3 1 See
for instance, [27, 28, 42, 45, 50]. [36]. 3 For more details on TIL, see, for instance, [48] or [18]. 2 See
A Hyperintensional Theory of Intelligent Question Answering in TIL
71
In this paper, we introduce a system that derives the logical consequences of information recorded in the huge knowledge bases of text data. Thus, the system not only answers the questions by providing explicit knowledge sought by keywords. It answers in an ‘intelligent’ way and computes inferable knowledge [21] such that rational human agents would produce if only this were not beyond their time and space capacities. To this end, we apply Gentzen’s system of natural deduction adjusted to our background theory of Transparent Intensional Logic (TIL). Duží and Horák in [14] introduce the system that applies the goal-driven, backward-chaining strategy of inferring answers by general resolution method adjusted for TIL. It seems to be a natural choice because by applying the goal-driven strategy, we can easily solve the problem of searching for relevant information resources in the huge labyrinth of input data. Yet, a problem arises here, namely the problem of integrating special rules rooted in the rich natural language semantics into the deduction process because input formulas for the resolution method must come in the Skolem clausal form (see [13]). These semantic rules include, inter alia, the rules of left and right subsectivity for property modifiers, the rules for handling non-referring terms and propositions with truth-value gaps, the rules dealing with factive verbs like ‘knowing’ or ‘regretting’, presuppositions of sentences, de dicto vs de re attitudes, and many other. There are two main goals of the paper. The first one is to illustrate how to derive answers deduced from the sentences extracted from natural language texts, and how to integrate the special rules of natural language semantics into the system of Gentzen’s natural deduction. This integration has been dealt with in the paper [12] by the authors published in the proceedings of the ICAART 2020 conference. The second main goal and the novelty of this paper is the analysis of Wh-questions and deducing answers to them. In English, one can find two main types of questions: open-ended and closedended. While open-ended questions have many options, closed-ended questions have simple answers with few options. Here we do not deal with open-ended questions. Rather, we concentrate on closed-ended questions, which are Yes-No questions and simple Wh-questions that are similar to Yes-No questions, but the variety of answers is greater. They are, for instance questions beginning with ‘what’ asking for a thing, ‘when’ asking for time, ‘who’ asking for a person, ‘where’ asking for a place or location of something, ‘why’ asking for a reason, and ‘how’ asking for directions or instructions. The analysis of Wh-questions transforms in our TIL formalism into λ-terms denoting procedures that produce α-intensions where α is not a truth-value. The sought answer should provide an object of type α, which is the value of the α-intension asked for in the world and time of evaluation. The classical system of natural deduction does not make it possible because in such a system we deal only with formulas denoting truth-values. We show how to deal with such λ-terms and how by suitable substitutions obtain the sought values that serve as answers to Whquestions. In terms of practical applications, our theoretical results are being implemented as one of the most important components of an intelligent question-answering system over large corpora of natural language texts. To this end we are making use of the Normal Translation Algorithm (NTA) that has been developed in the Centre for Natural Language Processing at the Faculty of Informatics, Masaryk University of
72
M. Duží and M. Fait
Brno [29]. NTA is a method that integrates logical analysis of sentences with a linguistic approach to semantics. The result of NTA so far is a corpus of more than six thousands of constructions obtained by the analysis of newspaper sentences that serve as an input for our inference machine. Furthermore, our procedural approach makes it possible to implement the extensional logic of hyperintensions so as to provide relevant information from a wide range of natural-language resources. The rest of the paper is organized as follows. Section 2 introduces our background system of Transparent Intensional Logic (TIL). In Sect. 3 we briefly describe the rules of natural deduction adjusted to TIL. In Sect. 4 we introduce the semantic rules and their formalization in TIL. Section 5 deals with Wh-questions and answers to them. Section 6 illustrates our method of intelligent question answering by two case studies. Finally, Sect. 7 contains some concluding remarks.
2 Foundations of TIL TIL is a rich and expressive system with procedural (as opposed to set-theoretical) semantics that makes it possible to properly analyse in a fine-grained way almost all the semantically salient features of natural language. Referring for details to numerous papers (see, e.g., [5, 6, 8, 11, 15, 16, 30]) and in particular to the book [18], we just briefly recapitulate. The meaning of a linguistic term is conceived as an abstract algorithmically structured procedure encoded by the term, where the structure of the meaning procedure is almost isomorphic with the structure of the term.4 These procedures can be viewed as instructions how, in any possible world and time, to evaluate or compute the object denoted by the term, if any.5 Pavel Tichý, the founder of TIL, coined these meaning procedures constructions. The general semantic schema involving the meaning (i.e., a construction) of an expression E, denotation (i.e., the object, if any, denoted by E), and reference (i.e., the value of an intension, if the denotation is an intension, in the actual world at the present time), is depicted by Fig. 1. The relation of encoding (or expressing) between expression and construction assigned to it as its meaning is semantically primary. Once we have the meaning
E
expresses
construction
v-constructs
denotation
has a value at w, t
reference
denotes
Fig. 1 General semantic schema 4 See
[11] for the mereology of abstract procedures.
5 A kindred theory of procedural semantics has been introduced by Moschovakis in [43] and further
developed by Loukanova, see, e.g., [38, 40, 41]. Moschovakis likened those meaning procedures to ‘generalized algorithms’.
A Hyperintensional Theory of Intelligent Question Answering in TIL
73
construction, we can examine which object (the denotation, if any) the construction produces, prove what is entailed by the construction, examine its structure, etc. Denotation is thus semantically secondary. Moreover, denotation can be missing in case of meaningful, yet non-denoting terms like, e.g., ‘the greatest prime number’ or ‘tg( π2 )’. Importantly, empirical terms, such as the definite descriptions ‘the first player in the WTA singles ranking’, ‘the president of the Czech Republic’ or the predicate ‘is a married man’, invariably denote the condition that an individual, a set, etc., must satisfy in order to be (in) its extension at the world/time pair of evaluation. We model these conditions as PWS (possible world semantic) intensions, i.e. functions with the domain of possible worlds. However, extensions of these intensions in any possible world at any time, a fortiori extensions in the actual world at the present time are not a semantic issue, they are empirical facts. We say that a term refers to its extension in a world/time pair of evaluation. Hence, the world/time-relative extensions (if any) of intensions denoted by empirical linguistic terms fall outside the purview of the semantics. The semantics of definite descriptions, predicates, and all other linguistic terms is top-down for full referential transparency. This is to say that the above semantic schema that illustrates the relation between a linguistic term, the procedure (construction) that is its meaning, and its denotation (if any) is the same schema in any context, be it a hyperintensional, intensional or extensional context.6 A term or expression expresses a (privileged) construction as its meaning and denotes the entity (if any) that the construction produces.7 The denoted entity (if any) can be an object of one of three basic kinds: • a non-procedural entity, i.e. an object belonging to a type of order 1 (see below) that comprises all partial functions the domain and range of which does not contain any constructions, including nullary functions like numbers and individuals • a procedural entity, i.e. a construction belonging to a type of order n > 1 (see below). • a partial function involving constructions in its domain or range, i.e. an object belonging to a type of order n > 1 (see below). Tichý defined six kinds of constructions, namely variables, Trivialization, Composition, (λ-)Closure, Execution, and Double Execution. While variables and Trivializations are atomic constructions that do not contain any other constituents but themselves, Composition and Closure are molecular constructions that consist of constituents other than just themselves. Abstract or physical objects on which constructions operate are not their constituents; they are beyond constructions and must 6 The three kinds of context have been defined in [18, Sect. 2.6]. See also [16]. Briefly, in a hyperin-
tensional context a construction is an object of predication. In an intensional context, the function produced by a construction is an object of predication, while in an extensional context the value (if any) of the produced function is an object of predication. 7 Indexicals being the only exception: while the sense of an indexical remains constant (a free variable with a type assignment), its denotation varies in keeping with its contextual embedding. See [18, Sect. 3.4].
74
M. Duží and M. Fait
be ‘grabbed’ and supplied to be operated on. Atomic constructions serve this grabbing role. Trivialization roughly corresponds to a constant of formal languages; where X is an object whatsoever of TIL ontology, Trivialization 0X produces X . Using terminology of programming languages, Trivialization 0X is just a pointer to X , a simple grabbing mechanism. Variables produce objects of their respective ranges dependently on valuations, they v-construct. Composition [F A1 . . . An ] is the procedure of applying the function f produced by F to its tuple argument a1 , . . . , an produced by constituents A1 , . . . , An to obtain the value of f, if any; dually, Closure [λx1 . . . xm C] is the procedure of declaring or constructing a function by abstracting over the values of λ-bound variables in the ordinary manner of lambda calculi. TIL is a logic of partial functions, i.e. functions that can lack a value at some of their arguments. Thus, Composition fails to produce an object, if the function f is not defined at the argument a1 , . . . , an . In such a case we say that the Composition is v-improper. While in this paper we do not need and thus do not define Single Execution, Double Execution is an important construction; it executes a given displayed construction twice over, thus decreasing the mode of occurrence of the displayed construction to the executed mode. Thus, we define. Definition 1 (construction) (i) Variables x, y, ... are constructions that construct objects (elements of their respective ranges) dependently on a valuation v; they v-construct. (ii) Where X is an object whatsoever (even a construction), 0X is the construction Trivialization that constructs X without any change in X. (iii) Let X , Y1 , ..., Yn be arbitrary constructions. Then Composition [X Y1 ...Yn ] is the following construction. For any valuation v, if X does not v-construct a function that is defined at the n-tuple of objects (if any) v-constructed by Y1 , ..., Yn , the Composition [X Y 1 ...Yn ] is v-improper. If X does v-construct such a function, then [X Y 1 ...Yn ] v-constructs the value of this function at the n-tuple. (iv) (λ-)closure [λx1 ...xm Y ] is the following construction. Let x1 , x2 , ..., xm be pair-wise distinct variables and Y a construction. Then Closure [λx1 ...xm Y ] v-constructs the function f that takes any members B1 ,...,Bm of the respective ranges of the variables x1 , ..., xm into the object (if any) that is v(B1 /x1 ,...,Bm /xm )constructed by Y, where v(B1 /x1 ,...,Bm /xm ) is like v except for assigning B1 to x1 ,...,Bm to xm . (v) Where X is an object whatsoever, 2X is the construction Double Execution. If X is not itself a construction, or if X does not v-construct a construction, or if X v-constructs a v-improper construction, then 2X is v-improper. Otherwise 2 X v-constructs what is v-constructed by the construction v-constructed by X . (vi) Nothing is a construction, unless it so follows from (i) through (v). From the formal point of view, TIL is a typed λ-calculus that operates on functions (intensional level) and their values (extensional level), as ordinary λ-calculi do; in addition to this dichotomy, there is however the highest hyperintensional level of
A Hyperintensional Theory of Intelligent Question Answering in TIL
75
procedures producing lower-level objects.8 And since these procedures themselves can serve as objects on which other higher-order procedures operate, there is a fundamental dichotomy between two modes in which constructions can occur, namely displayed (as an object to be operated on) and executed to v-construct a lower-level object. In principle, constructions are displayed by Trivialization. A dual operation to Trivialization is Double Execution that executes constructions twice over. Hence, while 0X displays X , 20X voids the effect of Trivialization and is thus equivalent to executed X . Below we refer to this equivalence as to 20 -rule. To avoid vicious circle problem and keep track of particular logical strata in its stratified ontology, TIL ontology is organized into a ramified hierarchy of types built over a base. For natural language processing, we use the epistemic base consisting of four atomic types, namely o (the set of truth-values), ι (individuals), τ (times or real numbers), and ω (possible worlds). The type of constructions is ∗n , where n is the order of construction. Definition 2 (ramified hierarchy of types) Let B be a base, where a base is a collection of pair-wise disjoint, non-empty sets. Then: T1 (types order 1). (i) Every member of B is an elementary type of order 1 over B (ii) Let α, β1 , ..., βm (m > 0) be types of order 1 over B. Then (α β1 ... βm ), i.e. the collection of all m-ary partial mappings from β1 × ... × βm into α, is a functional type of order 1 over B. (iii) Nothing is a type of order 1 over B unless it so follows from (i) and (ii). Cn (constructions of order n) (i) Let x be a variable ranging over a type of order n. Then x is a construction of order n over B. (ii) Let X be a member of a type of order n. Then 0X , 2X are constructions of order n over B. (iii) Let X , X1 , ..., Xm (m>0) be constructions of order n over B. Then [X X1 ...Xm ] is a construction of order n over B. (iv) Let x1 , ..., xm , X (m>0) be constructions of order n over B. Then [λx1 ...xm X] is a construction of order n over B. (v) Nothing is a construction of order n over B unless it so follows from Cn (i)–(iv). Tn+1 (types of order n + 1) Let ∗n be the collection of all constructions of order n over B. Then (i) ∗n and every type of order n are types of order n + 1. (ii) If α, β1 ,...,βm (m > 0) are types of order n + 1 over B, then (α β1 ... βm ) (see T1 (ii)) is a type of order n + 1 over B. (iii) Nothing is a type of order n + 1 over B unless it so follows from (i) and (ii).
8 For
the introduction on particular theories of hyperintensions, see [33].
76
M. Duží and M. Fait
Notational conventions. That an object X belongs to a type α is denoted as ‘X /α’. That a construction C v-constructs an α-object (provided not v-improper) is denoted by ‘C → α’; we will often say that C is typed to v-construct an α-object, for short. Throughout, variables w → ω and t → τ (possibly with subscripts) are used as ranging over possible worlds and times, respectively. If C → ατω then the frequently used Composition [[C w] t], aka extensionalization of the α-intension v-constructed by C, is abbreviated as Cwt . We use classical infix notation without Trivialization for truthvalue functions ∧ (conjunction), ∨ (disjunction), ⊃ (implication), and ¬ (negation). Also, identities =α of α-objects are written in the infix way without Trivialization and the superscript α whenever no confusion arises. Empirical sentences and terms denote (PWS-)intensions, functions with the domain of possible worlds ω; they are frequently mappings from ω to chronologies of α-objects, hence functions of types ((ατ)ω), or ατω , for short. Where variables w, t range over possible worlds (w → ω) and times (t → τ), respectively, constructions of intensions are usually Closures of the form λwλt [. . . w . . . t . . . ]. For a simple example of the analysis of an empirical sentence, consider the sentence “John is a student”. First, types: Student/(oι)τω is a property of individuals and John/ι individual. The sentence “John is a student” encodes as its meaning the hyper-proposition (i.e. construction of a proposition) λwλt [0Studentwt 0John]. The property Student must be extensionalized first, Studentwt → (oι) and only then can it be applied to the individual John/ι, thus obtaining a truth-value T, F in accordance with John belonging to the population of students in a given world w and time t of evaluation: [0Studentwt 0John] → o. Abstracting over the values of variables w, t the proposition of type oτω that John is a student is produced. We model sets and relations by their characteristic functions. Hence, (oι), (oιι) are types of a set of individuals and of a binary relation-in-extension between individuals, respectively. Quantifiers ∀α , ∃α are type-theoretically polymorphic total functions of types (o(oα)) defined as follows. Where B is a construction that v-constructs a set of α-objects, [0 ∀α B] v-constructs T if B v-constructs the set of all α-objects, otherwise F; [0 ∃α B] v-constructs T if B v-constructs a non-empty set, otherwise F. Instead of [0 ∀α λx A], [0 ∃α λx A] we write ‘∀xA’, ‘∃xA’ whenever no confusion arises. The above introduced TIL system might be familiar to those who are familiar with Montague style semantics. Yet, TIL deviates in four relevant respects from the version of λ-calculus made popular by Montague’s Intensional Logic. First, and most importantly, meanings are not identified with (or modelled as) mappings from world/time pairs. Instead Montague-like meanings (i.e. mappings) are the products of our meaning procedures (TIL constructions). Thus, while Montague’s system is an intensional logic operating on functions and their values, TIL is a hyperintensional logic operating on constructions of functions, functions, and their values. Second,
A Hyperintensional Theory of Intelligent Question Answering in TIL
77
variables are not linguistic items. The term ‘y’ expresses an atomic procedure as its meaning and picks out the entity that an assignment function has assigned to y as its value. Thus, three entities are involved: a term, a variable (a procedure), a value. Furthermore, our variables can themselves occur as products of procedures placed higher up. This is essential in what follows, in particular for operations into hyperintensional contexts. Third, the analysis of a piece of language does not amount to translating it from some natural language into an artificial language (say, the λ-calculus), which in turn receives an interpretation, which is transferred back to the natural-language sentence. Instead our λ-calculus is an inherently interpreted formal language, which serves as a device to directly denote (talk about) meanings; our λ-terms denote TIL constructions. Meaning procedures are studied by studying their structure and constituents as encoded in the λ-calculus of TIL in virtue of the isomorphism between formulae and meanings. It should be stressed again that our constructions are not linguistic entities; they are higher-order abstract procedures. Fourth, in TIL we have explicit intensionalisation and temporalisation. Whereas Montague’s IL combines worlds and times, TIL treats worlds and times as two distinct ground types, which enables separate variables ranging over these two separate types.9 To complete our brief introduction to the formal apparatus of TIL, we are going to define the substantial distinction between a construction occurring in executed vs displayed mode. To define the distinction, we must take the following factors into account. A construction C can occur in displayed mode only as a sub-construction within another construction D that operates on C. Therefore, C itself must be produced by another sub-construction E of D. And it is necessary to define this distinction for occurrences of constructions because one and the same construction C can occur executed in D and at the same time serve as an input/output object for another sub-construction E of D that operates on C. The distinction between displayed and executed mode can be characterised like this (with rigorous definition coming below). Let C be a sub-construction of a construction D. Then an occurrence of C is displayed in D if the execution of D does not involve the execution of this occurrence of C. Otherwise, C occurs in executed mode within D, i.e. C occurs as a constituent of the procedure D. Let us illustrate an occurrence of a displayed construction by a simple example. Consider this sentence: “John is solving the equation Sin(x) = 0” When solving the problem of seeking the numbers x such that the value of the function Sine at x equals zero, John is not related to the set of multiples of the number π, i.e. to an object of type (oτ). If he were, John would have already solved the problem. Rather John wishes to find the product of the construction λx [0 = [0Sin x] 0 0]. In other words, the sentence expresses John’s relation-in-intension to this very construction and Solve is thus an object of type (oι∗1 )τω . Therefore, the whole sentence encodes this construction: λwλt [0Solvewt 0John 0[λx [0 = [0Sin x] 0 0]]] 9 For
critical comments on Montague’s IL and a comparison with TIL, see [18, Sect. 1.5].
78
M. Duží and M. Fait
Types and type checking: 0 Solve → (oι∗1 )τω ; 0 Solvewt → (oι∗1 ); 0John → ι; 0 Sin → (ττ); 0 = → (oττ); 0 0 → τ; x → τ; [0 Sin x] → τ; [0 = [0 Sin x] 0 0] → o; [λx [0 = [0Sin x] 0 0]] → (oτ ); 0[λx [0 = [0Sin x] 0 0]] → ∗1 ; [0Solvewt 0John 0[λx [0 = [0Sin x] 0 0]]] → o; λwλt [0Solvewt 0John 0[λx [0 = [0Sin x] 0 0]]] → oτω . The construction [λx [0 = [0Sin x] 0 0]] is displayed (using Trivialization) as the second argument of the relation 0Solvewt . The evaluation of the truth-conditions expressed by the sentence consists in checking, for any possible world w and for any time t, whether John and this construction occur in the extensionalized relationin-intension of solving as its first and second argument, respectively. Hence the execution of the procedure encoded by the sentence does not involve the execution of the equation λx [0 = [0Sin x] 0 0]; this is something John is tasked with. The execution steps specified by the above Closure, i.e. its constituents, are as follows. Each construction is an executed part of itself, hence the Closure (1) is a constituent of itself. (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
λwλt [0Solvewt 0John 0[λx [0 = [0Sin x] 0 0]]] λt [0Solvewt 0John 0[λx [0 = [0Sin x] 0 0]]] [0Solvewt 0John 0[λx [0 = [0Sin x] 0 0]]] 0 Solvewt [0Solve w] 0 Solve w t 0 John 0 [λx [0 = [0Sin x] 0 0]]
Definition 3 (displayed vs executed mode of occurrence of a construction) Let C be a construction and D a sub-construction of C. Then: (i) If C is identical to 0X and D is identical to X , then the occurrence of D and of all the sub-constructions of D are displayed in C. (ii) If D is displayed in C and C is a sub-construction of a construction E such that E is not identical to 2F for any construction F, then the occurrence of D and of all the sub-constructions of D are displayed in E. (iii) If D is identical to C, then the occurrence of D is executed in C. (iv) If C is identical to [X1 X2 . . . Xm ] and D is identical to one of the constructions X1 , X2 , . . ., Xm , then the occurrence of D is executed in C. (v) If C is identical to [λx1 . . . xm X ] and D is identical to X , then the occurrence of D is executed in C. (vi) If C is identical to 2X and D is identical to X , then the occurrence of D is executed in C. (vii) If C is identical to 20X and D is identical to X , then the occurrence of D is executed in C. (viii) If an occurrence of D is executed in a construction E such that this occurrence of E is executed in C, then the occurrence of D is executed in C.
A Hyperintensional Theory of Intelligent Question Answering in TIL
79
Definition 4 (constituent of a construction) Let C be a construction. Then constituents of C are those sub-constructions of C that occur in executed mode in C. Corollary. Each construction C is a constituent of itself, namely its improper constituent. All the other constituents of C are its proper constituents.10 Analogously to formal languages, variables can occur free or bound within a construction. In TIL, there are two kinds of binding variables, namely λ-binding and binding by Trivialization, o-binding. Thus, an occurrence of a variable can also be double-bound, as when a λ-bound occurrence is also o-bound. Yet in the case of double-bound variables, we say that the variable is simply o-bound because obinding is stronger than λ-binding. If a construction C is displayed then all its subconstructions, including its variables, are displayed as well and therefore o-bound, regardless of whether they occur within the scope of a λ-operator. The general rule is that a higher context is dominant over a lower one. Thus, we define: Definition 5 (free variable, bound variable, open/closed construction) Let C be a construction with at least one occurrence of a variable ξ. (i) Let C be ξ. Then the occurrence of ξ in C is free. (ii) Let C be 0X . Then every occurrence of ξ in C is o-bound. (iii) Let C be [λx1 . . . xn Y ]. Any occurrence of ξ in Y that is one of xi , 1 ≤ i ≤ n, is λ-bound in C unless it is o-bound in Y. Any other occurrence of ξ in Y that is neither o-bound nor λ-bound in Y is free in C. (iv) Let C be [X X1 . . . Xm ]. Any occurrence of ξ that is free, o-bound, λ-bound in one of X , X1 , . . . , Xm is, respectively, free, λ-bound, o-bound in C. (v) Let C be 2X . Then any occurrence of ξ that is free, λ-bound in a constituent of C is, respectively, free, o-bound, λ-bound in C. (vi) An occurrence of ξ is free, λ-bound, o-bound in C only due to (i)–(v). A construction with at least one occurrence of a free variable is an open construction. A construction without any occurrences of free variables is a closed construction. Corollary. If a construction D is displayed in C, then all the variables occurring in D are o-bound in C. Importantly, o-bound variables occur in the displayed mode, and thus, they are not constituents of the super-construction C in which they occur. Therefore, the product of C is invariant of any valuation of o-bound variables occurring in C. In other words, o-bound variables figure just as objects to operate on rather than executed variables v-producing objects dependently on an assignment function. A few examples: Let x → τ be a variable ranging over numbers. Then while x v-constructs numbers, 0 x v-constructs just x for any valuation v (or independently of valuations); therefore, 0 x → ∗1 . The Composition [0 > x 0 0] v-constructs T if the number v-constructed by x is greater than zero. Trivialization of this Composition, 0[0 > x 0 0] produces just the Composition [0 > x 0 0] independently of valuations. Hence, the Closure λx [0 > x 0 0] constructs the set of positive numbers, an 10 The
notions of being proper or improper constituents should not be confused with the notions of being v-proper or v-improper constructions.
80
M. Duží and M. Fait
(oτ)-object. The variable x is λ-bound here. But the Closure λx 0[0 > x 0 0] constructs a constant function of type (∗1 τ) that associates any number with the construction [0 > x 0 0]. This is so because the variable x is now o-bound rather than λ-bound.
3 Natural Deduction in TIL The standard rules of a proof calculus are in TIL applicable to constituents of those constructions that produce truth-values. To avoid misunderstanding, we follow Church and Gentzen in the classical style of a proof-calculus as it is applied, for instance, in HOL languages. Hence, we do not apply the Curry-Howard correspondence.11 The rules of the natural deduction system follow its general pattern and are thus introduced in I/E pairs.12 The rules dealing with truth-functions, namely conjunction introduction (∧-I) and elimination (∧-E), disjunction introduction (∨-I) and elimination (∨-E), implication introduction (⊃-I) and elimination (⊃-E, known also as modus ponendo ponens MPP) are standard, as in propositional logic. Additionally, there are rules dealing with quantifiers (general ∀ and existential ∃). Again, these rules are of two kinds, namely introduction and elimination rules. Yet, quantifiers in TIL (see above) are not special symbols; rather, they are functions applicable to classes of objects. Hence, our task is to explain how the rules for quantifiers are introduced in the TIL system. Here is how. Let x → α, B(x) → o: the variable x is free in B; [λx B] → (oα), ∀/(o(oα)), C → α: a construction that is not v-improper. Then the general quantifier elimination in full TIL detail consists of these steps: [0 ∀λx B] [[λx B] C] B(C/x)
Ø ∀-E β-reduction
where B(C/x) arises from B by a collision-less, valid substitution of the construction C for all occurrences of the variable x in B. For sure, if the condition B is true of all the elements of type α, it must be also true of the object v-constructed by C. Note, however, that in order the rule be truth-preserving, C must not fail to produce such an object. Otherwise, if C were v-improper, the Composition [[λx B] C] would be v-improper as well though by assumption [0 ∀λx B] v-constructs T. For the sake of simplicity, we write this rule in the shortened ordinary form: X [0 ∀λx B] X B(C/x)
(∀-E)
11 More precisely, Church in [2] applies Hilbert’s deductive system with the rules of lambda conver-
sion and many additional axioms. Gordon and Melham in the HOL system apply natural deduction in the sequent form [26, Sect. 22.3]. 12 The rules of natural deduction adjusted to TIL have been first briefly introduced in [22].
A Hyperintensional Theory of Intelligent Question Answering in TIL
81
The dual rule ∀-I then receives this form (y being a ‘fresh’ free variable, i.e. “local” to this part of the proof): X B(y/x) X [0 ∀λx B]
(∀-I)
In classical logic the existential quantifier ∃ is dual to the general quantifier ∀. Thus, it might seem that whereas the rule ∃-I for ∃ introduction is unproblematic, the difficulties would arise with the rule ∃-E for elimination of the existential quantifier. This is true but not the whole truth. Since TIL is a logic of partial functions, we must be careful also with the ∃-I rule so as not to derive that there is a value of a function at an argument when there is none. As in classical logic, the rules for existential quantifier functions, ∃/(o(oα)), are parallel to those for disjunction (∨). Let x, y → α, B → o, [λx B] → (oα), ∃/(o(oα)), [0 ∃λx B] → o, C → o. Then the rule of existential quantifier elimination (∃-E) is: X [0 ∃λx B] Y , B(y) C X,Y C
(∃-E)
where the ‘fresh’ variable y does not occur free in any construction at Y or in C. Comment. Recall the rule for eliminating disjunction; it is rather complicated. X A∨B Y,A C Z, B C X,Y,Z C
(∨-E)
Roughly, it says this; consider both the disjuncts A and B, and if you manage to prove another construction C taking first A as an assumption and then B, you proved C from A ∨ B. The rule is well justified. Proving C from A is equivalent to proving A ⊃ C, and proving C from B is equivalent to proving B ⊃ C. Hence, we have proved (A ⊃ C) ∧ (B ⊃ C), which is equivalent to (A ∨ B) ⊃ C. By modus ponendo ponens, we have proved C. This suggests that to eliminate an existential quantification [0 ∃ λx B] and derive another construction C, we should be able to conclude C starting from B with any ‘value’ substituted for x in B. We do this by substituting a ‘fresh’ free variable y that does not occur free in C (or anywhere outside the proof sequence). The rule of existential quantifier introduction is valid in its classical form provided applied to a constituent. Recall that a constituent of B is a construction that does not occur displayed in B. Let D → α be a construction that occurs as a constituent of the construction B, the other types as above. Since, by assumption, B produces a truth-value T and D is its constituent, B is of the form of a Composition [. . . D . . . ]. Then, by definition of Composition, the construction D cannot be v-improper and the Composition [[λx B] D] v-constructs T as well. Thus, the set of α-elements produced by λx B is non-empty and the application of ∃ quantifier is truth-preserving. As a result, we obtain the classical existential quantifier introduction (∃-I) rule:
82
M. Duží and M. Fait
X B(D/x) X [0 ∃λx B]
(∃-I)
The crucial condition in this rule is that D occurs as a constituent of B. Hence, this rule quantifies over constituents, it does not apply into a hyperintensional context. If it were not so, we might ‘magically’ derive existence of a non-existent object, which would not be correct, of course. For instance, consider this sentence as an assumption: “a calculates the cotangent of π” The analysis of the sentence is this construction: λwλt [0Calculatewt a 0[0Cot 0 π]] Types. Calculate/(oι∗n )τω ; a → ι; Cot/(ττ); π/τ; [0Cot 0 π] → τ; 0[0Cot 0 π] → ∗1 . Since 0[0Cot 0 π] is a constituent of the above assumption, we can apply the (E-I) rule to derive that there is a construction c → ∗1 such that a calculates c: λwλt [0 ∃λc [0Calculatewt a c]] However, if we applied the (E-I) rule to the displayed construction [0 Cot 0 π], we would attempt to derive that there is a number x → τ such that a calculates x: λwλt [0 ∃λx [0Calculatewt a 0 x]] Such a derivation is invalid for three reasons. First, there is no such number because the function cotangent is not defined at the number π. Second, even if there were such a number, it makes no sense to compute a number. Third, the variable x is o-bound and thus not amenable to λ-binding as explained above. Yet, any logic that deserves the claim of being hyperintensional, should also explain how to operate into a hyperintensional context, i.e. how to operate with a displayed construction. Thus, besides the classical rule of existential quantifier introduction, in TIL we can also quantify into hyperintensional contexts. Referring for details to [16], where the authors introduce the rules for quantifying into hyperintensional context, we briefly recapitulate. Among those rules there is the rule for quantifying over an object supplied by Trivialization. We can quantify over an object produced by Trivialization inside a displayed construction because Trivialization does not fail to supply such an object for any valuation v. Consider an argument of the form a calculates the cotangent of π a calculates the cotangent of something It is obviously valid. There is ‘something’, namely the number π, the cotangent of which a calculates. But careless application of the rule (∃-I) is not valid:
A Hyperintensional Theory of Intelligent Question Answering in TIL
83
λwλt [0Calculatewt a 0[0Cot 0 π]] λwλt [0 ∃λx [0Calculatewt a 0[0Cot x]]] The reason is this. The Trivialisation 0 [0 Cot x] constructs the Composition [0 Cot x] independently of any valuation v because the variable x is o-bound. Thus from the fact that at w, t it is true that a calculates [0 Cot 0 π] we cannot validly infer that a calculates [0 Cot x], because a calculates the cotangent of π rather than of x. Put differently, the class of numbers constructed by λx [0Calculatewt a 0[0Cot x]] will be non-empty, according as a calculates [0Cot x] and regardless of a’s calculating [0Cot 0 π]. The problem just described of λx being unable to catch the occurrence of x inside the Trivialized construction is TIL’s way of phrasing the standard objection to quantifying-in. Yet in TIL we have a way out (or perhaps rather, a way in). In order to validly infer the conclusion, we need to pre-process the Composition [0 Cot x] and substitute the Trivialization of π v-produced by another free variable y for x. Only then can the conclusion be inferred. To this end we deploy the polymorphic functions Subn /(∗n ∗n ∗n ∗n ) and Trα /(∗n α) that operate on constructions in the manner stipulated by the following dual definition. Definition 6 (Subn , Tr α ) Let C1 /∗n+1 → ∗n , C2 /∗n+1 → ∗n , C3 /∗n+1 → ∗n v-construct constructions D1 , D2 , D3 , respectively. Then the Composition [0Subn C1 C2 C3 ] v-constructs the construction D that results from D3 by collisionless substitution of D1 for all occurrences of D2 in D3 . The function Trα /(∗n α) returns as its value the Trivialization of its α-argument. Example 1 Let variable y → τ. Then [0 Tr τ y] v(π/y)-constructs 0 π. The Composition [0Sub1 [0 Tr τ y] 0 x 0[0Cot x]] v(π/y)-constructs the Composition [0Cot 0 π]. Hence, the Composition [0Sub1 [0 Tr τ y] 0 x 0[0Cot x]] is v(π/y)-congruent with [0Cot 0 π]. Importantly, the variable y is free for λ-binding in the former unlike the variable x that is o-bound. Below we will omit the superscripts n and α and write simply ‘Sub’ and ‘Tr’ whenever no confusion can arise. It should be clear now how to validly derive that a calculates the cotangent of something if a calculates the cotangent of π. The valid argument, in full TIL notation, is this: λwλt [0Calculatewt a 0[0Cot 0 π]] λwλt [ ∃λy[0Calculatewt a [0Sub[0 Tr y] 0 x 0[0Cot x]]]] 0
There are two rules for quantifying into a hyperintensional context, one that quantifies over a construction, the other over an object produced by Trivialization, as the above example illustrates. We are not going to introduce these rules here because it would needlessly make our exposition hard to read. Suffice to say that the substitution method introduced above must be applied whenever without its application a conflict of contexts would arise. The last technical devices that we need are λ-introduction and elimination of the left-most λwλt. They are applied when dealing with empirical propositions.
84
M. Duží and M. Fait
If the assumptions are empirical propositions, as it is often the case when processing natural-language texts, our task is to derive the proposition that is logically entailed by the propositions in premises. Logical entailment between propositions is defined below. In prose, a proposition P → oτω is entailed by the propositions Q1 , . . . , Qn iff necessarily, i.e. in all possible worlds and times in which all the assumptions Q1 , . . . , Qn are true the proposition P is true as well. Hence, in any world w0 and time t0 of evaluation the derivation sequence must be truthpreserving from premises to the conclusion. Thus, the typical sequence of derivation steps is this. We have assumptions of the form λwλt [. . . w . . . t . . .] → oτω , and we assume that the propositions produced by these constructions are true in the world w0 at time t0 of evaluation. Using the detailed notation, we obtain the Composition [[[λw [λt [. . . w . . . t . . .]]] w0 ] t0 ] → o. By applying restricted β-reduction twice, we eliminate the left-most λwλt, thus obtaining [. . . w0 . . . t0 . . .] → o.13 Now we proceed with derivation steps until the conclusion of the form [. . . w0 . . . t0 . . .] → o is derived. Since we are to derive a proposition, we finally abstract over the values of the variables w0 , t0 , thus introducing the left-most λwλt to construct a proposition: λwλt [. . . w . . . t . . .] → oτω . In order to simplify the derivations, in what follows we omit the initial and final steps of λ-elimination and introduction, respectively.
4 Semantic Rules There are many features of the rich semantics of natural language that must be formalized by special rules that are not found in the formal logical languages. TIL is a logical system that has been primarily applied to the analysis of natural language because it is a powerful system apt for this task. Since it is out of the scope of this paper to deal with all the natural language semantic peculiarities, we refer for details to [18]. To illustrate the problems we have to deal with when building up a question answering system over natural language corpora, we are now going to deal with factive verbs and presuppositions triggered by them, property modifiers, and anaphoric references.
13 The restricted β-reduction is just the substitution of variables for variables of the same type. Since
variables are not v-improper for any valuation v, it is a strictly equivalent conversion. In the logic of partial functions, such as TIL, β-reduction ‘by name’ is in general not an equivalent transformation. If the construction that is substituted for λ-bound variables is v-improper, in some cases the redex does not v-construct the same object as the contractum. For this reason, in our applications we use β-reduction by value. For details, see [15] or [19].
A Hyperintensional Theory of Intelligent Question Answering in TIL
85
4.1 Factive Attitudes and Presuppositions Factive verbs like to ‘know that’, ‘regret that’, ‘be sorry’, ‘be proud’, ‘be indifferent’, ‘be glad that’, ‘be sad that’, etc., presuppose that the embedded clause denotes a true proposition. For instance, if one asks, “Does John regret his being late?” and John was not late, there is no direct answer Yes or No. For, both answers entail that John did come late. In such a case an appropriate answer conveys information that the presupposition is not true, like “It is not true that John regrets his coming late, because he was not late”. Note that while the direct answer applies narrow scope negation, the complete answer denies by wide scope negation.14 Hence, both John regretting and John not regretting his being late entail that John was late. If John was not late, he could neither regret nor not regret it; therefore, the proposition that he regrets it has a truth-value gap. Schematically, if K is a factive verb and X its complement clause, the following rules are valid: K(X ) X , ¬K(X ) X Factive verbs should be distinguished from implicative verbs like ‘to manage’ or ‘to dare’. While sentences applying factive verbs presuppose the truth of the embedded clause, those with implicative verbs only entail it.15 Schematically, where I is an implicative verb and X the complement clause, we have the following rules. I (X ) X , ¬I (X ) ¬X TIL is a logic of partial functions, and as such is apt for dealing with presuppositions and truth-value gaps. Yet, partiality, as we all know very well, brings about technical complications. To manage them properly, we define properties of propositions True, False and Undefined, all of type (ooτω )τω , as follows (P → oτω ): [0 Truewt P] v-constructs T if Pwt , otherwise F; [0Falsewt P] v-constructs T if ¬Pwt , otherwise F; [0U ndefined wt P] = ¬[0 Truewt P] ∧ ¬[0Falsewt P]. Now we can rigorously define the difference between presupposition and a mere entailment. Let P, Q be constructions of propositions. Then
14 For details on narrow and wide scope negation see [9], and for answering questions with presuppositions, see [3]. 15 We are not going to deal with implicative verbs here; yet, see [44], and also [1] for detail. Note however, that the notion of presupposition that these authors deal with is pragmatic in nature, while we deal with logical presuppositions the definition of which comes below. It appears the implicative verbs listed above presuppose a weaker version of a presupposition; ‘to manage something’ presupposes ‘to try that something’ (and a certain difficulty of the task) and ‘to dare’ presupposes a sort of ‘wanting’. We are grateful to an anonymous referee for this note.
86
M. Duží and M. Fait
Q is entailed by P iff ∀w∀t [[0 Truewt P] ⊃ [0 Truewt Q]]; Q is a presupposition of P iff ∀w∀t [[[0 Truewt P] ∨ [0Falsewt P]] ⊃ [0 Truewt Q]]. Hence, we have: Q is a presupposition of P iff ∀w∀t [¬[0 Truewt Q] ⊃ [0Undefinedwt P]]. If a presupposition of a proposition P is not true, then P has no truth value. Factive verbs being a special case of attitudinal verbs, they thus denote relationsin-intension of an individual to the meaning of the embedded clause, which is a construction of a proposition.16 Hence, if K is the meaning of a factivum, then K/(oι∗n )τω . Furthermore, let c/∗n+1 → ∗n , 2c → oτω , 2cwt → o be a variable ranging over constructions of propositions, a → ι. Then the rules (FA) dealing with factive propositional attitudes are: ¬[0Kwt a c] [0Kwt a c] (FA) 2 2 cwt cwt Hence, the analysis of the above example together with the proof that John came late comes down to these constructions. First, types: Regret/(oι∗n )τω ; John/ι; Late/((oι)τω (oι)τω ): property modifier (see Sect. 4.2); Coming/(oι)τω . For the sake of simplicity, we now ignore the preprocessing of the anaphoric pronoun ‘his’, substituting John for ‘his’ directly. See, however, Sect. 4.3. (1) [0 Regret wt 0John 0[λwλt [[0Late 0Coming]wt 0John]]] (2) 20[λwλt [[0Late 0Coming]wt 0John]]wt (3) [λwλt [[0Late 0Coming]wt ] 0John]]wt (4) [[0Late 0Coming]wt 0John]
∅ 1, FA 2, 20 -E 3, β-r
In this proof, we applied also β-reduction of λ-calculi (step 4) and elimination of Trivialization by Double Execution, (20 -E) in step 3. For any construction C, (20 -E)
20
C=C
16 Here we deal with an empirical case of an attitude to a proposition. In case of mathematical attitudes the embedded clause denotes a truth-value. For details and the rules for mathematical factive attitudes, see [18, Sect. 5.1].
A Hyperintensional Theory of Intelligent Question Answering in TIL
87
4.2 Property Modifiers Property modifiers are denoted mostly by adjectives and they are functions in extension that applied to a root property return as a value the modified property. In this subsection we deal with properties of individuals and modifiers of such properties of type ((oι)τω (oι)τω ). There are three basic kinds of modifiers, namely intersective, subsective, and privative. Here are some examples. (a) Intersective. “A yellow elephant is yellow and is an elephant.” (b) Subsective. “A skilful surgeon is a surgeon.” (c) Privative. “Forged passport is a non-passport.” We are not going to analyse these modifiers in detail here. TIL analysis has been introduced in numerous papers, see, e.g. [7, 30–32]. The issue we deal with bellow is the rule of left subsectivity.17 The principle of left subsectivity is trivially (by definition) valid for intersective modifiers. If Jumbo is a yellow elephant, then Jumbo is yellow. Yet how about the other modifiers? If Jumbo is a small elephant, is Jumbo small? If you factor out small from small elephant, the conclusion says that Jumbo is small. Yet this would seem a strange thing to say, for something appears to be missing: Jumbo is a small what? Nothing or nobody can be said to be small—or forged, skilful, good, notorious, or whatnot, without any sort of qualification. A complement to provide an answer to the question, ‘a’ what?’ is required. We introduce the rule of left subsectivity that is valid for all kinds of modifiers including subsective and privative ones. The idea is simple. From a is an [MP] we infer that a is an M-with respect to something. The scheme of defining the left subsectivity rule is this (SI being substitution of identical properties, Leibniz’s Law): (1) (2) (3) (4)
a is an MP a is an (M something) M* is the property (M something) a is an M*
assumption 1, EG definition 2, 3, SI
To put the rule on more solid grounds of TIL, let π = (oι)τω for short, M → (ππ) be a modifier, P → π an individual property, [M P] → π the property resulting from applying M to P. Further, let = /(oππ) be the identity relation between properties, and let p →v π range over properties, x →v ι over individuals. Then the proof of the rule is this (additional type: ∃/(o(oπ))):
17 Here
we partly draw on material from [18, Sect. 4.4] and from [31].
88
M. Duží and M. Fait
(1) [[M P]wt a] assumption 1, ∃I (2) ∃p [[M p]wt a] 2, β-expansion (3) [λx ∃p [[M p]wt x] a] (4) [λw λt [λx ∃p [[M p]w t x]]wt a] 3, β-expansion (5) M* = λw λt [λx ∃p [[M p]w t x]] definition 4, 5, SI (6) [M*wt a] Any valuation of the free occurrences of the variables w, t that makes the first premise true will, together with step five, make the conclusion true. Left subsectivity (LS), dressed up in full TIL notation, is this: [[MP]wt a] [M* = λwλt λx ∃p [[Mp]wt x]] [M*wt a]
(LS)
This specification of the rule easily dismantles objections raised against the (LS) principle by Gamut [24, Sect. 6.3.11] and Geach [25]. Summarising very briefly, there are arguments against (LS) like if Jumbo is a small elephant and a large mammal, then Jumbo is small and large—contradiction! Yet, there is no contradiction, because Jumbo is small as an elephant and large as a mammal. Hence the properties p, q with respect to which Jumbo is a [0Small p] and [0Large q] are distinct. Of course, nobody and nothing is absolutely small or absolutely large. Everybody is made small by something and made large by something else. And, nobody is absolutely good (except God) or absolutely bad, then everybody has something they do well and something they do poorly. That is, everybody is both good and bad, which here just means being good at something and being bad at something else, without generating paradox (Good, Bad/(ππ)): λwλt ∀x [∃p [[0Good p]wt x] ∧ ∃q [[0Bad q]wt x]]. But nobody can be good at something and bad at the same thing simultaneously. Similarly, if Jumbo is a small elephant and Mickey is a large mouse, then Jumbo is small, and Mickey is large; which does not entail that Jumbo is smaller than Mickey. Again, to derive the conclusion, it would have to be granted that Jumbo is small with respect to the same property as Mickey, which is not so.
4.3 Anaphoric References and Substitution Method Resolving anaphoric references is a hard nut for every linguist dealing with the semantics of natural languages because there are frequently many ambiguities as for to which part of the foregoing discourse the anaphoric pronoun refers. Logic cannot disambiguate any sentence, of course. Instead, logic can contribute to disambiguation and better communication by making these hidden features explicit and logically tractable. If a sentence or term is ambiguous, we furnish it with multiple constructions
A Hyperintensional Theory of Intelligent Question Answering in TIL
89
as its proposed meanings and leave it to the agent to decide which of these meanings is the intended one. To deal with anaphoric references, we apply generalized Hans Kamp’s Discourse Representation Theory (DRT), see [34, 35]. ‘DRT’ is an umbrella term for a collection of logical and computational linguistic methods developed for a dynamic interpretation of natural language, where each sentence is interpreted within a certain discourse. DRT as presented in [34] is a first-order theory. Thus, only terms denoting individuals (indefinite or definite noun phrases) can introduce so-called discourse referents, which are free variables that are updated when interpreting the discourse. Since TIL semantics is procedural, hence hyperintensional and higher-order, not only individuals, but entities of any type, like properties of individuals, propositions, relations-in-intension, and even constructions (i.e., meanings of antecedent expressions), can be linked to anaphoric variables. Moreover, the thoroughgoing typing of the universe of TIL makes it possible to determine the respective type-theoretically appropriate antecedent, which also contributes to disambiguation.18 For instance, the ambiguous anaphoric reference to properties as in Neale’s example “John loves his wife and so does Peter” has been analysed in [15]. The authors prove that the sentence entails that John and Peter share a property. Only that it is ambiguous which one; there are two options, (i) loving John’s wife and (ii) loving one’s own wife. The property predicated of Peter in ‘so does Peter’ is a function of the property predicated of John in ‘John loves his wife’. Since the source clause is ambiguous between attributing (i) or (ii) to John, the target clause is likewise ambiguous between attributing (i) or (ii) to Peter. The ambiguity of the anaphoric expression ‘his wife’ as applied to John is visited upon the likewise anaphoric expression ‘so does’. The authors propose the analyses of both readings and show that unrestricted β-reduction ‘by name’ reduces both readings to the strict one on which John and Peter love John’s wife, which is undesirable.19 The solution consists in the application of β-reduction ‘by value’ that makes use of the above defined functions Sub and Tr. To recall, the function Sub operates on constructions so that the Composition [0Sub C1 C2 C3 ] produces a construction D that is the result of the collision-less substitution of the product of C1 for the product of C2 into the product of C3 . The function Tr/(∗n α) produces the Trivialization of the α-object. What is also special about “John loves his wife, and so does Peter” is that it involves two anaphoric terms, namely ‘his’ and ‘so does’. It might seem tempting, though, to analyse “John loves his wife” as though it were synonymous with “John loves John’s wife’. Then “So does Peter” would unambiguously attribute to Peter the property of loving John’s wife. But this analysis would not be plausible as it would entirely annihilate the anaphoric character of “his”. Instead, the form of the solution 18 The
algorithm for dynamic discourse representation within TIL has been specified in [8] and implemented by Kotová [37]. It is applied in a multi-agent system to govern the communication of individual agents by messaging. 19 Loukanova in [38] also warns against unrestricted β-reduction and its undesirable results.
90
M. Duží and M. Fait
must be in terms of resolution of verb-phrase ellipsis. It needs to be spelt out which of two properties applies to John in “John loves his wife” and so applies to Peter in “So does Peter”. The property (i) of loving John’s wife is produced by λwλt λx [0Lovewt x [0 W ife-ofwt 0John]] while the property (ii) of loving one’s own wife is produced by λwλt λx [0Lovewt x 2[0Sub [0 Tr x] 0 y 0 [0 W ife-ofwt y]]] From the logical point of view, anaphoric pronouns denote variables, valuation of which is supplied by referring to an appropriate antecedent. To this end, we apply the substitution method that exploits the functions Sub and Tr. To adduce an example of referring to the meaning of a term, i.e. to the encoded construction, the sentence “Sin of π equals zero and John knows it’ encodes the following construction as its meaning.20 λwλt [[[0Sin 0 π] = 0 0]∧ [ Sub [0 Tr 0[[0Sin 0 π] = 0 0]] 0 it 0[0Knowwt 0John it]]]
2 0
Types. Sin/(ττ); 0, π/τ; [[0Sin 0 π] = 0 0] → o; Know/(oι∗n )τω ; John/ι; it → ∗n . Note that the result of the substitution (application of the Sub function) is an adjusted construction [0Knowwt 0John 0 [[0Sin 0 π] = 0 0]]]. But the second argument of conjunction must be a truth-value; hence, the adjusted construction must be executed—therefore Double Execution. This analysis is fully compositional. The meaning of “John knows it” λwλt [0Knowwt 0John it] contains a free variable it as its constituent. If the sentence is uttered in isolation, the valuation assignment is a pragmatic matter of a speaker/interpreter. However, if the sentence is embedded in the discourse context, the variable it becomes bound, and the value assignment is provided by the substitution method.21
analyse Know(ing)/(oι∗n )τω as a hyperintensional attitude, i.e. the relation-in-intension of an individual to a hyperproposition (construction of a truth value or a PWS proposition). In case of mathematics it is obvious that such attitudes must relate an individual to the very procedure rather than its product; it makes no sense to know a truth value without any mathematical operation producing it. In an empirical case intensional attitudes are also thinkable. Yet, since intensional attitudes inevitably yield a variant of the well-known paradox of logical/mathematical omniscience, we vote for the hyperintensional analysis here. 21 A similar stance and solution can be found in [39]. 20 We
A Hyperintensional Theory of Intelligent Question Answering in TIL
91
5 Wh-Questions Up until now we have considered classical truth-preserving derivations only. In other words, in our proof sequences and rules so far would figure only constructions producing truth values. Such a method makes it possible to answer Yes-No questions. For instance, if the query put forward is “Is a going to Brussels”, then we aim at deriving from the input textual sentences the consequence confirming the answer in the affirmative: “a is going to Brussels.” If we succeed, the answer is ‘Yes’, otherwise ‘No’.22 Yet, one can also ask, “Who is going to Brussels?” and we want the system to provide the answer ‘a’. Before introducing the method of deriving answers to Wh-questions, we briefly recapitulate the theory of questions and answers as developed within the TIL system.23 From the logical point of view, interrogative empirical sentences denote αintensions of a type ((ατ) ω), or ατω for short. The direct answer to the question posed by an interrogative sentence is the value, if any, of the denoted α-intension, i.e. an object of type α. Interrogative empirical sentences can be classified according to many criteria, and various categorisations of questions have been proposed. Questions can be openended or close-ended. Open questions give the respondent greater freedom to provide information or opinions on a topic while closed questions call for an answer of a specific type. In this paper, we deal with close questions. These questions can be classified into three basic types, to wit Yes-No questions, Wh-questions and exclusive-or questions. Yes-No questions like “Does John regret his being late?”, “Did the Pope ever visit Prague?” present a proposition whose actual truth-value the inquirer would like to know. In the case of Wh-questions like “Who is the Pope?”, “When did you stop smoking?”, “Who are the members of the European Union?”, “Why did John come late?” there is a much greater variety of possible answers because the type of the denoted intension is ατω for any type α but o. It can be the type of an individual, a set of individuals, time moment, location, property, or whatever else. In case of exclusiveor questions like “Are you going by train, or by car?”, “Is Tom an assistant, or a professor?” the adequate answer does not provide a truth value; instead, it conveys information on which of the alternatives is the case. Concerning answers, we distinguish between direct and complete answers. A direct answer provides directly the α-value of the asked α-intension. A complete answer is that the α-value of the posed α-intension is an α-object. For instance, the direct answer to the Wh-question “Who is the No.1 player in the WTA singles ranking?” is Ashleigh Barty, while the complete answer is the proposition that Ashleigh Barty is the No.1 player in the WTA singles ranking. Obviously, to each direct answer corresponds the respective complete answer. Possible direct answers to Wh-questions determine the type α of the α-intension in question. 22 If
a does not exist, the answer is just that a does not exist because there is an existential presupposition on a’s existence. 23 For more details, see, for instance, [3], [18, Sect. 3.6], [46].
92
M. Duží and M. Fait
For instance, a possible direct answer to the question (1)
“Who are the first three players in WTA ranking singles?”
is a set of individuals, currently (writing in July 2020) {Ashleigh Barty, Simona Halep, Karolina Plíšková}, which is an object of type (oι). Thus, the analysis of this question consists in a construction of a property of individuals, an object of type (oι)τω : (1∗ )
λwλt λx [[0 W TA-Ranking wt x] ≤ 0 3] → (oι)τω
Types: x →v ι; 0 WTA-Ranking/(τι)τω : attribute of an individual that is an empirical function assigning to individuals their current position (if any) in WTA ranking singles; 3/τ, ≤ /(oττ). On the other hand, the direct answer to the question (2)
“Who is the No. 1 player in the WTA singles ranking?”
should be a single individual; currently, she is Ashleigh Barty. Hence, the analysis of the question must produce an individual office (or role), an object of type ιτω . Here is how: (2∗ )
λwλt [0 I λx [[0 W TA-Ranking wt x] = 01]] → ιτω
Additional type. I/(ι(oι)): the singularizer, i.e. a function that associates a set S of individuals with the only member of S provided S is a singleton, and otherwise (if S is an empty or a multi-valued set) is undefined. Note that the question presupposes that the set of individuals to whom the WTA ranking is assigned is non-empty. If it were empty, then there can be no direct answer, and our method provides a complete answer informing that the presupposition is not true. In our case, the answer would be that there is nobody to whom the WTA ranking is assigned.24 The problem to solve is now this. Having derived a set of TIL constructions of propositions which can provide answers to Yes/No questions, we need a method how to answer also Wh-questions. Assume, for instance, that we have extracted and formalised the sentence (3) ∗
(3 )
“Ashleigh Barty is the first player in the WTA singles ranking.” λwλt [[0 W TA-Ranking wt 0Barty] = 01]
24 The general analytic schema for questions with presuppositions applies a rigorously defined, strict function ‘If-then-else’. Where P is a presupposition of a question Q, in plain English, the schema is “If P then Q else non-P”. For details, see [4, 17].
A Hyperintensional Theory of Intelligent Question Answering in TIL
93
To answer the question (2), we aim at finding a suitable substitution for the variable x occurring in (2*). In other words, using terminology of formal languages, we are looking for a proper construction with which (2*) can be unified. Obviously, in this simple example (3*) can serve the goal. Here is the derivation of the desired answer 0 Barty: (1) (2) (3) (4)
[0 I λx [[0 W TA-Ranking wt x] = 01]] [[0 W TA-Ranking wt x] = 01] [[0 W TA-Ranking wt 0Barty] = 01] x = 0Barty
Question 1, I-E, λ-E assumption 2, 3, Unification
Comments. We derived that on the assumption (3), the direct answer to the question (2) is Barty. Of course, the WTA-Ranking is an empirical attribute denoting an intension of type (τι)τω . Hence, who is the No.1 in WTA ranking depends crucially on the possible world and time in which we evaluate. The fact that we derived Barty as the direct answer to the question (2) does not mean that Barty is necessarily the No.1 in the WTA ranking, of course. This answer is only entailed by the proposition denoted by (3). Barty has not been and will not always be No.1; and if the circumstances were different, i.e. in another possible world in which Barty would have lost with, for instance, Svitolina, the latter would be the No.1.25 When deriving the answer to a Wh-question, we apply an adjusted Robinson’s algorithm of unification as known from the general resolution method. We are looking for a construction with the same constituents as a given question up to the variable the value of which we want to obtain. The adjustment of the algorithm is this. Sometimes the constituents are not strictly identical; it suffices that by applying basic arithmetic, we can conclude that this or that construction is suitable for answering. To illustrate, consider this knowledge base: λwλt [[0 W TA-Ranking wt 0Barty] = 01] λwλt [[0 W TA-Ranking wt 0Halep] = 0 2] λwλt [[0 W TA-Ranking wt 0Pliskova] = 0 3] λwλt [[0 W TA-Ranking wt 0Kenin] = 0 4] λwλt [[0 W TA-Ranking wt 0Svitolina] = 0 5] ... λwλt [[0 WTA-Rankingwt 0Kvitova] = 012] ... To derive the answer to the question (1), i.e. “who are the first three players in WTA ranking singles?”, we can use the first three constructions because the question transforms into the construction λwλt λx [[0 W TA-Ranking wt x] ≤ 0 3] → (oι)τω
25 We
are grateful to an anonymous referee for this comment.
94
M. Duží and M. Fait
and the condition that the ranking be less than or equal to 3 is met. Hence, we derive the answer {Barty, Halep, Pliskova}. Here is another example. “The US President met the Czech President in the Reduta Jazz Club, Prague, in 1994”. This sentence is multiply ambiguous. This ambiguity concerns the question who met with whom in the Reduta Jazz Club. The ambiguities stem from the interplay between the time reference 1994 and the current/then presidencies.26 Of course, those who know the history of the relations between the United States and the Czech Republic remember that a memorable moment occurred in 1994 when then US President Bill Clinton visited the Czech Republic and took in the music at the Reduta Jazz Club. The then-president of the Czech Republic Václav Havel presented Clinton with a saxophone and Clinton jammed with the band for a few songs. Under this reading, the sentence presupposes the existence of the Czech and US Presidents in 1994. Both definite descriptions occur with supposition de re with respect to 1994, because both the de re principles are valid with respect to the year 1994. In particular, there is an existential presupposition that both the presidential offices were occupied in 1994. Hence, we have the following derivation: (1) [If ∀u [[01994 u] ⊃ [[0Exist wu λwλt [0Preswt 0U SA]]∧ [0Exist wu λwλt [0Preswt 0CR]]]]] then ∃v [[01994 v] ∧ [0Meet wv [0Preswv 0U SA] [0Preswv 0CR] 0Reduta]] else fail] (2) ∀u [[01994 u] ⊃ [0Clinton = [0Preswu 0U SA]]] (3) ∀u [[01994 u] ⊃ [0Havel = [0Preswu 0CR]]] (4) ∀u [[01994 u] ⊃ ∃x [x = [0Preswu 0U SA]]] (5) ∀u [[01994 u] ⊃ [0Exist wu λwλt [0Preswt 0U SA]]] (6) ∀u [[01994 u] ⊃ ∃x [x = [0Preswu 0CR]]] (7) ∀u [[01994 u] ⊃ [0Exist wu λwλt [0Preswt 0CR]]] (8) [[01994 v] ⊃ [0Exist wv λwλt [0Preswt 0U SA]]] (9) [[01994 v] ⊃ [0Exist wv λwλt [0Preswt 0CR]]] (10) [01994 v] (11) [0Exist wv λwλt [0Preswt 0U SA]] (12) [0Exist wv λwλt [0Preswt 0CR]] (13) [[0Exist wv λwλt [0Preswt 0U SA]]∧ [0Exist wv λwλt [0Preswt 0CR]]] (14) [[01994 v] ⊃ [[0Exist wv λwλt [0Preswt 0U SA]]∧ [0Exist wv λwλt [0Preswt 0CR]]]] (15) ∀u [[01994 u] ⊃ [[0Exist wu λwλt [0Preswt 0U SA]]∧ [0Exist wu λwλt [0Preswt 0CR]]]] (16) ∃v [[01994 v] ∧ [0Meet wv [0Preswv 0U SA][0Preswv 0CR] 0Reduta]] (17) [[01994 v] ⊃ [0Clinton = [0Preswv 0U SA]]]
∅ ∅ ∅ 2, ∃-I 4, Def. Exist 3, ∃-I 6, Def. Exist 5, ∀-E v/u 7, ∀-E v/u Assumption 8, 10, MPP 9, 10, MPP 11, 12, ∧-I 13, ⊃ -I 14, ∀-I, u/v 1, 15, Def. if-then-else 2, ∀-E, v/u
26 This example is taken from [10] where a detailed analysis of various readings is provided together with the analysis of presuppositions with respect to current time and the year 1994.
A Hyperintensional Theory of Intelligent Question Answering in TIL
(18) (19) (20) (21) (22) (23) (24) (26) (27)
[[01994 v] ⊃ [0Havel = [0Preswv 0CR]]] [[01994 v] ∧ [0Meet wv [0Preswv 0U SA] [0Preswv 0CR] 0Reduta]] [01994 v] [0Meet wv [0Preswv 0U SA] [0Preswv 0CR] 0Reduta] [0Clinton = [0Preswv 0U SA]] [0Havel = [0Preswv 0CR]] [0Meet wv 0Clinton 0Havel 0Reduta] [01994 v] ∧ [0Meet wv 0Clinton 0Havel 0Reduta] ∃v [[01994 v] ∧ [0Meet wv 0Clinton 0Havel 0Reduta]]
95
3, ∀-E, v/u 16, ∃-E 19, ∧-E 19, ∧-E 17, 20, MPP 18, 20, MPP 21, 22, 23, SI 20, 24, ∧-I 25, ∃-I
Types. t, u, v → τ; 1994/(oτ); Exist/(oιτω )τω : the property of an individual office of being occupied; Pres(ident-of)/(ιι)τω ; USA, CR/ι; Reduta/ι: the Reduta Jazz Club in Prague; Meet/(oιιι)τω : the relation-in-intension, who meets with whom, where. Remark 1 For the sake of simplicity, we omitted the presupposition that the whole year 1994 must precede the time of evaluation t, a presupposition which is obviously met.27 Yet, we take into account the presupposition that both presidents had to exist in 1994. To this end, we apply the If-then-else-fail function, which is defined as follows. 2 0 [ I λc[Pwt ∧ c = 0Q]]wt Types. I/(∗n (o ∗n )): the singulariser on constructions; the function that associates a singleton of constructions with the only element of the set, otherwise undefined; the variable c → ∗n ; P → oτω : the presupposition of Q; Q → ατω . To make the constructions easier to read, instead of ’[0If-then-else-fail Pwt 0Q]’, we simply write ’If Pwt then Qwt else fail’. We have derived that in some time of 1994 Clinton met with Havel in the Reduta Jazz Club, Prague. By applying the unification algorithm to properly chosen constructions from the above derivation, we can now answer questions like “Who met with whom in the Reduta Jazz Club in 1994?” “When did Clinton meet with Havel in the Reduta Jazz Club?” “Where did the US President meet with the Czech president in 1994?” For instance, the first question transforms into the construction (x, y → ι) λwλt λxy [∃v [[01994 v] ∧ [0Meet wv x y 0Reduta]]] Having applied (λ-E) rules, unification with (16) yields the answer ’the US President with the Czech President’, while by unifying with (26) we obtain the answer ’Clinton with Havel’. 27 For
the analysis of sentences in past and future tenses with time references and their presuppositions, see [4, 47].
96
M. Duží and M. Fait
Similarly, the second question amounts to the following construction (c → (oτ): a variable ranging over time intervals). λwλt λc [∃v [[c v] ∧ [0Meet wv 0Clinton 0Havel 0Reduta]]] By application of λ-E rules and unifying with (26) the answer ’in 1994’ is produced.
6 Two Case Studies In this section, we illustrate the above-introduced methods by two examples of integrating the rules dealing with property modifiers and with factive verbs, respectively.
6.1 Reasoning with Property Modifiers Scenario. John is a married man. John’s partner is Eve. John is a member of a sports club and a student. All students like holidays. Everybody married believes that his/her partner is fantastic. Frank is a student. Frank thinks that Peter is an actor. Question. Does John believe that Eve is fantastic? To formalise our mini knowledge base, we start with assigning types to the objects that receive mention in the text: John, Eve, Peter, Frank, S(port)C(lub)/ι; Married m /((oι)τω (oι)τω ); Married , Actor, Student, Partner(-of )/(ιι)τω ; Fantastic/(oι)τω ; Member, Like/(oιι)τω ; Holidays/α: for the sake of simplicity, we don’t analyse the type of holidays, which is harmless here; Believe, Think/ (oι∗n )τω ; w → ω; t → τ; x, y → ι. Analysis of the sentences of our scenario comes down to these constructions: A. B. C. D. E. F. G.
λwλt [[0Married m 0Man]wt 0John] λwλt [[0Partner wt 0John] = 0Eve] λwλt [[0 Member wt 0John 0SC] ∧ [0Student wt 0John]] λwλt λx [[0Student wt x] ⊃ [0Likewt x 0Holidays]] λwλt ∀x [[0Married wt x] ⊃ [0Believewt x [0Sub [0 Tr [0Partner wt x]] 0 y 0[λwλt [0Fantasticwt y]]]]] λwλt [0Student wt 0Frank] λwλt [0 Think wt 0Frank 0[λwλt [0Actor wt 0Peter]]]
Conclusion/question : Q. λwλt [0Believewt 0John 0[λwλt [0Fantasticwt 0Eve]]]
To derive the answer, we are going to apply the system of Gentzen’s natural deduction (ND) adapted to TIL. In addition to the standard rules of the ND system, we need the rule of left subsectivity (LS) for dealing with the property modifier Married m .
A Hyperintensional Theory of Intelligent Question Answering in TIL
97
The rule results in (LS)
[[0 Married m 0 Man]wt x] [0 Married wt x]
Informally, this rule represents the fact that “A married man is married”. We must also deal with technical rules and functions specific for TIL. For instance, application of the functions Sub and Tr must be evaluated appropriately, or Leibniz’s law of substitution of identicals specified for TIL in [20, 23] must be appropriately applied. Here is the derivation. (1) [[0Married m 0Man]wt 0John] (2) [[0Partner wt 0John] = 0Eve] (3) ∀x [[0Married wt x] ⊃ [0Believewt x [0Sub [0 Tr [0Partner wt x]] 0 y 0[λwλt [0Fantastic y]]]]] wt (4) [[0Married wt 0John] ⊃ [0Believewt 0John [0Sub [0 Tr [0Partner wt 0John]] 0 y 0[λwλt [0Fantastic y]]]]] wt (5) [[0Married wt 0John] ⊃ [0Believewt 0John [0Sub [0 Tr 0Eve] 0 y 0[λwλt [0Fantastic y]]]] wt (6) [0Married wt 0John] (7) [[0Believewt 0John [0Sub [0 Tr 0Eve] 0 y [0Fantasticwt y]]]] (8) [[0Believewt 0John [0Fantasticwt 0Eve]]]
∅ ∅ ∅ 3, ∀-E,0 John/x 2, 4, SI(Leibnitz) 1, LS 5, 6, MPP 7, Sub, Tr
The answer to the question Q is Yes, of course; it follows from our mini knowledge base that John indeed believes that Eve is fantastic. However, in this proof, we simplified the situation. We took into account only the premises relevant for deriving the conclusion, ignoring the others. For instance, from premises D and F, one can infer (by applying ∀-E and MPP) that “Frank likes holidays”. Similarly, by applying ∧-E, ∀-E, and MPP to the premises C and D we can infer that John likes holidays. Yet, these conclusions are pointless when answering the question Q. In practice, there are a vast number of sentences formalised in the form of TIL constructions so that extracting the relevant ones is not so easy. Moreover, implementation of the method within the interactive question answering system calls for an algorithm of selecting appropriate input sentences so that to reduce inferring consequences that are not needed. To this end, we propose a simple solution that nevertheless restricts the number of input premises and thus also the length of the proofs significantly. We select only those sentences that talk about the objects that receive mention in a given question. In our example, the following constructions would be selected because they contain the constituents 0Believe, 0John, 0Fantastic, and 0Eve, which they have in common with the question Q.
98
M. Duží and M. Fait
A. B. C. E.
λwλt [[0Married m 0Man]wt 0John] λwλt [[0Partner wt 0John] = 0Eve] λwλt [[0 Member wt 0John 0SC] ∧ [0Studentwt 0John]] λwλt ∀x [[0Married wt x] ⊃ [0Believewt x [0Sub [0 Tr [0Partner wt x]] 0 y 0 [λwλt [0Fantasticwt y]]]]]
The premises D, F, and G are irrelevant because they do not have any constituent in common with the question Q. This heuristic method does not guarantee that all the selected constructions are necessary for deriving the answer (in our case the premise C is spare), nor that the chosen set is sufficient for deriving the answer. It may happen that in the proof process, the heuristic method must be iterated to select additional input sentences. Anyway, it turns out that in most cases, a one-step heuristic is sufficient, and the process of proving is effectively optimised.
6.2 Reasoning with Factive Propositional Attitudes Scenario. The Mayor of Ostrava is Tomáš Macura. Prof. Vondrák likes teaching. The Mayor of Ostrava regrets that the President of Technical University of Ostrava (TUO) does not know (yet) that he (the President of TUO) will go to Brussels. The President of TUO is prof. Snášel. Prof. Snášel likes swimming. Prof. Vondrák is a politician. Question. Will prof. Snášel go to Brussels? Types: Snasel, Macura, Vondrak, Brussels/ι; President(of TUO), Mayor(of Ostrava)/ιτω ; Like/(oια)τω ; Mayor(of Ostrava)/ιτω ; Know, Regret/(oι∗n )τω ; Swimming, Teaching/α28 ; Politician/(oι)τω ; Go/(oιι)τω . Knowledge base: A. B. C. D. E. F.
λwλt [0Mayor wt = 0Macura] λwλt [0Likewt 0 V ondrak 0 Teaching] λwλt [0Regret wt 0Mayor wt 0[λwλt ¬[0Knowwt 0President wt [0Sub [0 Tr 0President wt ] 0he 0[λwλt [0 Gowt he 0Brussels]]]] 0 λwλt [ President wt = 0Snasel] λwλt [0Likewt 0Snasel 0Swimming] λwλt [0Politicianwt 0 V ondrak]
Conclusion/question : Q: λwλt [0 Gowt 0Snasel 0Brussels]
What is interesting about this example is that it makes it possible to demonstrate a top-down derivation from hyperintensional level of the complement of Regret28 For the sake of simplicity, we assign type α to these activities, because this simplification is harmless to the derivation we are going to demonstrate.
A Hyperintensional Theory of Intelligent Question Answering in TIL
99
ing/not knowing that “he will go to Brussels” to the extensional level of Snasel’s going to Brussels. It is made possible by application of the rules for factive attitudes defined above, plus resolution of anaphoric references by the substitution method. To recapitulate, here are the rules (c → ∗n , 2c → oτω , 2cwt → o). (FA1) (FA2)
[0 Regretwt a c] 2 cwt ¬[0 Knowwt a c] 2 cwt
For the selection of constructions that are relevant for deriving the answer, we now apply the heuristics described above. Constituents of the question Q are 0Go, 0Snasel, and 0Brussels. These constituents occur as sub-constructions of the sentences C, D, and E. C. D. E.
λwλt [0Regret wt 0Mayor wt 0[λwλt ¬[0Knowwt 0President wt [0Sub [0 Tr 0President wt ] 0he 0[λwλt [0Gowt he 0Brussels]]]] 0 λwλt [ President wt = 0Snasel] λwλt [0Likewt 0Snasel 0Swimming]
In sentence C there is another constituent, namely 0Mayor, and this same constituent also occurs in the premise A. By iterating the heuristics, we include A among the premises as well: A. λwλt [0Mayor wt = 0Macura] The proof of the argument, i.e. the derivation of the answer to the question Q from premises A, C, D, and E, is as follows: (1) [0Regret wt 0Mayor wt 0[λwλt ¬[0Knowwt 0President wt [0Sub [0 Tr 0President wt ] 0he 0[λwλt [0Gowt he 0Brussels]]]] 0 (2) [ President wt = 0Snasel] (3) [0Likewt 0Snasel 0Swimming] (4) [0Mayor wt = 0Macura] (5) 20 [λwλt ¬[0Knowwt 0President wt [0Sub [0 Tr 0President wt ] 0he 0[λwλt [0Gowt he 0Brussels]]]wt (6) [λwλt ¬[0Knowwt 0President wt [0Sub [0 Tr 0President wt ] 0he 0[λwλt [0Gowt he 0Brussels]]]wt (7) ¬[0Knowwt 0President wt [0Sub [0 Tr 0President wt ] 0he 0[λwλt [0Gowt he 0Brussels]] 2 0 (8) [ Sub [0 Tr 0President wt ] 0he 0[λwλt [0Gowt he 0Brussels]]wt (9) 2[0Sub [0 Tr 0Snasel] 0he 0[λwλt [0Gowt he 0Brussels]]wt (10) 2[0Sub 00Snasel 0he 0[λwλt [0Gowt he 0Brussels]]wt (11) 20 [λwλt [0Gowt 0Snasel 0Brussels]]wt (12) [λwλt [0Gowt 0Snasel 0Brussels]]wt (13) [0Gowt 0Snasel 0Brussels]
∅ ∅ ∅ ∅ 1, FA1 5,20-E 6, β-r 7, FA2 8, 2, SI 9, Tr 10, Sub 11,20 -E 12, β-r
100
M. Duží and M. Fait
Since we proved that the premises A, C, D, and E entail that Snasel is going to Brussels, the answer to the question Q is Yes. However, the above scenario makes it possible to answer other questions as well. For instance, we can ask, Q1. “Who is going to Brusells?” λwλt λx [0Gowt x 0Brussels] Q2. “Where does Snasel go?” λwλt λy [0Gowt 0Snasel y] Q3. “Who is the Mayor of Ostrava?” λwλt λz [0Mayorwt = z] and many other Wh-questions. The technique of answering consists in applying the unification algorithm to a given question and an appropriate formalised sentence from the scenario, as described above. For instance, the first question is easily unified with the construction (13), thus producing the answer ‘Snasel’.
7 Conclusion In this paper, we introduced the system for ’intelligent’ question answering over natural language texts. The system derives answers to the questions as logical consequences of assumptions extracted from given text corpora. When designing such a system, we had to solve several problems. First, natural language sentences must be analysed in a fine-grained way so that all the semantically salient features of a language are captured by an adequate formalisation. To this end, we exploited the system of Transparent Intensional Logic (TIL). Second, there are special rules rooted in the rich semantics of natural language which are not found in standard proof calculi. The problem is how to integrate these rules with a given proof system. We met the problem by natural deduction adapted to TIL. Third, there is the problem of how to extract just those sentences that are needed for deriving the answer from the large corpora of input text data. As a solution, we proposed a heuristic method driven by the constituents of a given question. Last but not least, we dealt with the problem of answering Wh-questions. There are two novelties of the paper. While in the previous proposals based on TIL it has been tacitly presupposed that it is possible to pre-process the natural language sentences first, and then to apply a standard proof calculus, we gave up this assumption, because it turned up to be unrealistic. Instead, we voted for Gentzen’s natural deduction system so that those special semantic rules could be smoothly inserted into the derivation process together with the standard I/E rules of the proof system. Yet, by applying the forward-chaining strategy of the natural deduction system, we faced up the problem of extracting those sentences that are relevant for the derivation of the answer. As a solution, we proposed a heuristic method that selects those sentences that have some constituents in common with the posed question. The second novel result is the method of answering Wh-questions. The analysis of such questions yields constructions of the form λx [. . . ]. A direct answer provides
A Hyperintensional Theory of Intelligent Question Answering in TIL
101
the value of the variable x by the substitution that unifies a given query with an appropriate sentence from an input knowledge base. Future research will concentrate on the comparison of this approach with the system of deriving answers utilising the backwards-chaining strategy of general resolution method and sequent calculus, and an effective implementation there-of. Acknowledgements This research was supported by the Grant Agency of the Czech Republic, project no. GA18-23891S, Hyperintensional Reasoning over Natural Language Texts, and by the internal grant agency of VSB-Technical University of Ostrava, project SGS No. SP2020/62, Application of Formal Methods in Knowledge Modelling and Software Engineering III. Michal Fait was also supported by the Moravian-Silesian regional program No. RRC/10/2017 “Support of science and research in Moravian-Silesian region 2017”. This research was supported by the University of Oxford project ‘New Horizons for Science and Religion in Central and Eastern Europe’ funded by the John Templeton Foundation. The opinions expressed in the publication are those of the author(s) and do not necessarily reflect the view of the John Templeton Foundation. A short version of the paper (see [12]) has been presented at the Special Session on Natural Language Processing in Artificial Intelligence (NLPinAI) of 12th International Conference on Agents and Artificial Intelligence, ICAART 2020, Valletta, Malta. This paper is its major extension. In addition to many touches here and there that have been made to improve the quality, we added a section on logical analysis of Wh-questions and deriving answers to Wh-questions. Moreover, the section dealing with TIL natural deduction has been extended by the rules for existential quantification into hyperintensional contexts. Also, the two case studies in Sect. 6 have been revised, and Wh-questions answered here.
References 1. Baglini, R., Francez, I.: The implications of managing. J. Semant. 33(3), 541–560 (2016). https://doi.org/10.1093/jos/ffv007 2. Church, A.: A formulation of the simple theory of types. J. Symbol. Logic 5(2), 56–68 (1940). https://doi.org/10.2307/2266170 ˇ 3. Cíhalová, M., Duží, M.: Questions, answers, and presuppositions. Computacion y Sistemas 19(4), 647–659 (2015). https://doi.org/10.13053/CyS-19-4-2327 4. Duží, M.: Tenses and truth-conditions: a plea for if-then-else. In: Peliš, M., (ed.) Logica Yearbook 2009, pp. 63–80. College Publications, London (2010). http://collegepublications.co.uk/ logica/?00017 5. Duží, M.: Extensional Logic of Hyperintensions. In: Düsterhöft, A., Klettke, M., Schewe, K., (eds.) Conceptual modelling and its theoretical foundations - essays dedicated to Bernhard Thalheim on the occasion of his 60th birthday. Lecture Notes in Computer Science, vol. 7260, pp. 268–290. Springer, Berlin (2012). https://doi.org/10.1007/978-3-642-28279-9_19 6. Duží, M.: Towards an Extensional Calculus of Hyperintensions. Organon F 19(supplementary issue 1), 20–45 (2012). http://www.klemens.sav.sk/fiusav/organon/?q=en/towardsextensional-calculus-hyperintensions 7. Duží, M.: Property Modifiers and Intensional Essentialism. Computacion y Sistemas 21(4), 601–613 (2017). https://doi.org/10.13053/CyS-21-4-2811 8. Duží, M.: Logic of dynamic discourse; anaphora resolution. In: Yoshida, N., Kiyoki, Y., Chawakitchareon, P., Koopipat, C., Hansuebsai, A., Sornlertlamvanich, V., Thalheim, B., Jaakkola, H. (eds.) Information Modelling and Knowledge Bases XXIX, Frontiers in Artificial Intelligence and Applications, vol. 301, pp. 263–279. IOS Press, Amsterdam (2018). https://doi.org/10.3233/978-1-61499-834-1-263 9. Duží, M.: Negation and presupposition, truth and falsity. Stud. Log. Gramm. Rhetor. 54(67), 15–46 (2018). https://doi.org/10.2478/slgr-2018-0014
102
M. Duží and M. Fait
10. Duží, M.: Ambiguities in Natural Language and Time References. In: Horák, A., Osolsobˇe, K., Rambousek, A., Rychlý, P. (eds.) Slavonic Natural Language Processing in the 21st Century, pp. 28–50. Brno, Czech republic, Tribun EU (2019) 11. Duží, M.: If structured propositions are logical procedures then how are procedures individuated? Synthese 196(4), 1249–1283 (2019). https://doi.org/10.1007/s11229-017-1595-5 12. Duží, M., Fait, M.: Integrating special rules rooted in natural language semantics into the system of natural deduction. In: Rocha, A.P., Steels, L., van den Herik J. (eds.) ICAART 2020 Proceedings of the 12th International Conference on Agents and Artificial Intelligence, vol. 1, pp. 410–421 (2020). https://doi.org/10.5220/0009369604100421 13. Duží, M., Fait, M., Menšík, M.: Adjustment of Goal-driven Resolution for Natural Language Processing in TIL. In: Horák, A., Rychlý, P., Rambousek, A. (eds.) Proceedings of the 13th Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2019, pp. 71–81. Tribun EU (2019). http://nlp.fi.muni.cz/raslan/2019/paper04-duzi.pdf 14. Duží, M., Horák, A.: Hyperintensional Reasoning Based on Natural Language Knowledge Base. Int. J. Uncertain. Fuzziness Knowlege-Based Syst. Rhetor. 28(3), 443–468 (2020). https:// doi.org/10.1142/S021848852050018X 15. Duží, M., Jespersen, B.: Procedural isomorphism, analytic information and β-conversion by value. Logic J. IGPL 21(2), 291–308 (2013). https://doi.org/10.1093/jigpal/jzs044 16. Duží, M., Jespersen, B.: Transparent quantification into hyperintensional objectual attitudes. Synthese 192(3), 635–677 (2015). https://doi.org/10.1007/s11229-014-0578-z 17. Duží, M., Jespersen, B.: An intelligent question-answer system over natural-language texts. In: Kim, S.B., Zelinka, I., Hoang Duy, V., Trong Dao, T., Brandstetter P. (eds.) AETA 2018 – Recent Advances in Electrical Engineering and Related Sciences: Theory and Applications. Lecture Notes in Electrical Engineering, vol. 554, pp. 162–174. Springer, Berlin (2020). https:// doi.org/10.1007/978-3-030-14907-9_17 18. Duží, M., Jespersen, B., Materna, P.: Procedural Semantics for Hyperintensional Logic - Foundations and Applications of Transparent Intensional Logic. Logic, Epistemology, and the Unity of Science, vol. 17. Springer, Berlin (2010). https://doi.org/10.1007/978-90-481-8812-3 19. Duží, M., Kosterec, M.: A Valid Rule of conversion for the Logic of Partial Functions. Organon F 24(1), 10–36 (2017). http://www.klemens.sav.sk/fiusav/organon/?q=sk/valid-rulev-conversion-logic-partial-functions 20. Duží, M., Materna, P.: Validity and Applicability of Leibniz’s Law of Substitution of Identicals. In: Arazim, P., Laviˇcka, T. (eds.) Logica Yearbook 2016, pp. 17–35. College Publications, London (2017). http://collegepublications.co.uk/logica/?00030 21. Duží, M., Menšík, M.: Logic of inferable knowledge. In: Jaakkola, H., Thalheim, B., Kiyoki, Y., Yoshida, N. (eds.) Information Modelling and Knowledge Bases XXVIII, Frontiers in Artificial Intelligence and Applications, vol. 292, pp. 405–425. IOS Press, Amsterdam (2017). https:// doi.org/10.3233/978-1-61499-720-7-405 22. Duží, M., Menšík, M.: Inferring knowledge from textual data by natural deduction. Computacion y Sistemas 24(1), 29–48 (2020). https://doi.org/10.13053/CyS-24-1-3345 23. Fait, M., Duží, M.: Substitution Rules with Respect to a Context. In: Kim, S.B., Zelinka, I., Hoang Duy, V., Trong Dao, T., Brandstetter, P. (eds.) AETA 2018 – Recent Advances in Electrical Engineering and Related Sciences: Theory and Applications. Lecture Notes in Electrical Engineering, vol. 554, pp. 55–66. Springer, Berlin (2020). https://doi.org/10.1007/ 978-3-030-14907-9_6 24. Gamut, L.T.F.: Logic, Language and Meaning, vol. II - Intensional Logic and Logical Grammar. The University of Chicago Press, London (1991). https://press.uchicago.edu/ucp/books/book/ chicago/L/bo3628700.html 25. Geach, P.T.: Good and evil. Analysis 17(2), 32–43 (1956). https://doi.org/10.2307/3326442 26. Gordon, M.J.C., Melham, T.F.: Introduction to HOL. Cambridge University Press, Cambridge (1993). https://doi.org/10.1017/S0956796800001180 27. Harrah, D.: The Logic of Questions. In: Gabbay, D.M., Guenthner, F. (eds.) Handbook of Philosophical Logic. Handbook of Philosophical Logic, vol. 8, pp. 1–60. Springer, Dordrecht (2002). https://doi.org/10.1007/978-94-010-0387-2_1
A Hyperintensional Theory of Intelligent Question Answering in TIL
103
28. Higginbotham, J.: Interrogatives. In: Hale, K., Keyser, S.J. (eds.) The View from Building 20: Essays in Linguistic in Honor od Sylvain Bromberger, pp. 195–227. The MIT Press, Cambridge, MA (1993) 29. Horák, A.: The normal translation algorithm in transparent intensional logic for Czech. Ph.D. thesis, Masaryk University, Brno (2002). https://www.fi.muni.cz/~hales/disert/thesis.pdf 30. Jespersen, B.: Structured lexical concepts, property modifiers, and transparent intensional logic. Philos. Stud. 172(2), 321–345 (2015). https://doi.org/10.1007/s11098-014-0305-0 31. Jespersen, B.: Left subsectivity: how to infer that a round peg is round. Dialectica 70(4), 531–547 (2016). https://doi.org/10.1111/1746-8361.12159 32. Jespersen, B., Carrara, M., Duží, M.: Iterated privation and positive predication. J. Appl. Logic 25(supplement), 548–571 (2017). https://doi.org/10.1016/j.jal.2017.12.004 33. Jespersen, B., Duží, M.: Introduction to the special issue on hyperintensionality. Synthese 192(3), 525–534 (2015). https://doi.org/10.1007/s11229-015-0665-9 34. Kamp, H.: A theory of truth and semantic representation. In: Groenendijk, J., Janssen, T., Stokhof, M. (eds.) Formal Methods in the Study of Language, Part 1, pp. 227–322. Mathematical Center, Amsterdam (1981) 35. Kamp, H., Reyle, U.: From discourse to logic. In: Introduction to Model-Theoretic Semantics of Natural Language, Formal Logic and Discourse Representation Theory. Studies in Linguistics and Philosophy, vol. 42, 1 edn. Springer, Dordrecht (1993). https://doi.org/10.1007/978-94017-1616-1 36. Keenan, E.L., Hull, R.D.: The logical presuppositions of questions and answers. In: Petöfi, J.S., Franck, D. (eds.) Präsuppositionen in Philosophie und Linguistik, vol. 7, pp. 441–466. Athenäum, Frankfurt (1973) 37. Kotová, I.: Logika Dynamického Diskursu. Master’s thesis, VSB-Technical University of Ostrava (2018). https://hdl.handle.net/10084/128520. In czech 38. Loukanova, R.: β-reduction and antecedent-anaphora relations in the language of acyclic recursion. In: Cabestany, J., Hernández, F.S., Prieto, A., Corchado, J.M. (eds.) Bio-Inspired Systems: Computational and Ambient Intelligence, 10th International Work-Conference on Artificial Neural Networks, IWANN 2009. Proceedings, Part I. Lecture Notes in Computer Science, vol. 5517, pp. 496–503. Springer, Berlin (2009). https://doi.org/10.1007/978-3-642-02478-8_62 39. Loukanova, R.: Algorithmic semantics of ambiguous modifiers with the type theory of acyclic recursion. In: Proceedings of the 2012 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Workshops, WI-IAT 2012, pp. 117–121 (2012). https://doi.org/10.1109/WI-IAT.2012.246 40. Loukanova, R.: Gamma-reduction in type theory of acyclic recursion. Fundam. Inform. 170(4), 367–411 (2019). https://doi.org/10.3233/FI-2019-1867 41. Loukanova, R.: Gamma-satar canonical forms in the type-theory of acyclic algorithms. In: van den Herik, J., Rocha,. A.P. (eds.) Agents and Artificial Intelligence. ICAART 2018. Lecture Notes in Computer Science, vol. 11352, pp. 383–407. Springer, Cham (2019). https://doi.org/ 10.1007/978-3-030-05453-3_18 42. Łupkowski, P.: Erotetic inferences in natural language dialogues. In: Proceedings of the Logic & Cognition Conference, pp. 39–48 (2012) 43. Moschovakis, Y.N.: A logical calculus of meaning and synonymy. Linguist. Philos. 29(1), 27–89 (2006). https://doi.org/10.1007/s10988-005-6920-7 44. Nadathur, P.: Causal necessity and sufficiency in implicativity. In: Proceedings of SALT, vol. 26, pp. 1002–1021 (2016). https://doi.org/10.3765/salt.v26i0.3863 45. Peliš, M., Majer, O.: Logic of questions and public announcements. In: Bezhanishvili, N., Löbner, S., Schwabe, K., Spada, L. (eds.) Logic, Language, and Computation - 8th International Tbilisi Symposium on Logic, Language, and Computation, TbiLLC 2009. Lecture Notes in Computer Science, vol. 6618, pp. 145–157. Springer, Berlin (2011). https://doi.org/10.1007/ 978-3-642-22303-7_9 46. Tichý, P.: Questions, answers, and logic. Am. Philos. Q. 15(4), 275–284 (1978). Reprinted in [49, pp. 293–304]
104
M. Duží and M. Fait
47. Tichý, P.: The logic of temporal discourse. Linguist. Philos. 3(3), 343–369 (1980). https://doi. org/10.1007/BF00401690. Reprinted in [49, pp. 373–403] 48. Tichý, P.: The Foundations of Frege’s Logic. De Gruyter, Berlin (1988). https://doi.org/10. 1515/9783110849264 49. Tichý, P.: Collected Papers in Logic and Philosophy. Czech Academy of Science and University of Otago Press, Prague, Dunedin, Filosofia (2004) 50. Wisniewski, A.: The Posing of Questions: Logical Foundations of Erotetic Inferences, Synthese Library, vol. 252. Springer, Dordrecht (1995). https://doi.org/10.1007/978-94-015-8406-7
Learning Domain-Specific Grammars from a Small Number of Examples Herbert Lange and Peter Ljunglöf
Abstract In this chapter we investigate the problem of grammar learning from a perspective that diverges from previous approaches. These prevailing approaches to learning grammars usually attempt to infer a grammar directly from example corpora without any additional information. This either requires a large training set or suffers from bad accuracy. We instead view learning grammars as a problem of grammar restriction or subgrammar extraction. We start from a large-scale grammar (called a resource grammar) and a small number of example sentences, and find a subgrammar that still covers all the examples. To accomplish this, we formulate the problem as a constraint satisfaction problem, and use a constraint solver to find the optimal grammar. We created experiments with English, Finnish, German, Swedish, and Spanish, which show that 10–20 examples are often sufficient to learn an interesting grammar for a specific application. We also present two extensions to this basic method: we include negative examples and allow rules to be merged. The resulting grammars can more precisely cover specific linguistic phenomena. Our method, together with the extensions, can be used to provide a grammar learning system for specific applications. This system is easy-to-use, human-centric, and can be used by non-syntacticians. Based on this grammar learning method, we can build applications for computer-assisted language learning and interlingual communication, which rely heavily on the knowledge of language and domain experts who often lack the competence to develop required grammars themselves. Keywords Grammar learning · Grammar restriction · Domain-specific grammar · Constraint satisfaction
H. Lange (B) · P. Ljunglöf Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden e-mail: [email protected] P. Ljunglöf e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. Loukanova (ed.), Natural Language Processing in Artificial Intelligence—NLPinAI 2020, Studies in Computational Intelligence 939, https://doi.org/10.1007/978-3-030-63787-3_4
105
106
H. Lange and P. Ljunglöf
1 Introduction Many currently common trends in NLP are aiming at general purpose language processing for tasks such as information retrieval, machine translation, and text summarization. But there are cases of usage where it is not necessary to handle language in general. Instead, it is possible to restrict the language which such a system needs to recognize or produce. The reason for restricting the language can be in requirements of very high precision, e.g., in safety-critical systems, or in order to build domain-specific systems, e.g., special-purpose dialogue systems. In this context, domain-specific means that an application is used in a specific application domain, and is unrelated to the concept from cognitive science. A simple example of a domain specific language is the kind of language used in cooking recipes. They involve a specific vocabulary but also certain kinds of syntactic constructions: imperative in English and Swedish, and subjunctive with an unspecified subject in German. Domain-specific languages in the sense we use the term are related to Controlled Natural Languages [11], which are natural language fragments with clear formal properties. One common aspect of these high-precision applications is that they can be built using computational grammars to provide the required reliability and interpretability. These domain-specific grammars often have to be developed by grammar engineers who are not always experts in the application domain. Domain experts on the other hand usually lack the skills necessary to create the grammars themselves. We present a method to bridge the gap between the two parties. Starting from a general-purpose resource grammar, we use example sentences to automatically infer a domain-specific grammar. We apply constraint satisfaction methods to ensure that the examples are covered. It is also possible to apply various objective functions to guarantee that the result is optimal in a certain way. Even though our results so far are very promising, we have no guarantee that the method we propose is working for all kinds of domain-specific languages. The main limitation we see is that not all domains are covered sufficiently by the resource grammar. This chapter is an extension of our paper from the Special Session on Natural Language Processing in Artificial Intelligence at the 12th International Conference on Agents and Artificial Intelligence (ICAART 2020) [22]. In addition to the technique presented in the original paper, in this chapter we extend the basic method in two ways. Firstly, we account for negative examples. Besides sentences that have to be included in the language of the new grammar, it is also possible to give sentences which the grammar should not be able to parse. This allows for an iterative refinement process. Secondly, we generalize the method, which uses grammar rules as the basic units, to subtrees of size1 larger than 1, as the basic units. In combination with the ability to merge rules, we move away from extracting pure subgrammars, towards learning modified grammars. The resulting grammars can more precisely cover specific linguistic phenomena. 1 The
size of a tree is the number of nodes in the tree.
Learning Domain-Specific Grammars from a Small Number of Examples
107
For our experiments we use the Grammatical Framework (GF) [29, 31] as the underlying grammar formalism, but we believe that the main ideas should be transferable to other formalisms. The main requirements are that there is a general purpose resource grammar and that syntactic analyses can be translated into logic constraints in a way similar to what we present in this chapter. The work in this chapter is based on our experiments using GF and GF’s Resource Grammar Library (RGL) [30], and we leave as future work to explore how our methods can be used for other grammar formalisms.
1.1 Use Case: Language Learning Our primary use case is language learning. We have developed a grammar-based language learning tool for computer-assisted language learning [19–21], which automatically generates translation exercises based on multilingual computational grammars. Each exercise topic is defined by a specialized grammar that can recognize and generate examples that showcase a topic. Creating those grammars requires experience in grammar writing and knowledge of the grammar formalism. However, language teachers usually lack these skills. Our idea is to let the teacher write down a set of example sentences that show which kind of language fragment they have in mind, and the system will automatically infer a suitable grammar. One exercise topic could focus on gender and number agreement, another one on relative clauses, while yet another could focus on inflecting adjectives or the handling of adverbs. The optimal final grammar should cover and generalize from the given examples. At the same time it should not over-generate, i.e., cover ungrammatical expressions, and instead reduce the syntactic ambiguity, i.e., the number of syntactic analyses or parse trees, as much as possible. Completely covering the examples, generalize from the examples, and not being over-generating usually are mutually exclusive and contradictory requirements. So the best we can hope for is a balance between these requirements.
1.2 Use Case: Interlingual Communication Another use case for our research is domain-specific applications, such as dialogue systems, expert systems, and apps to support communication in situations where participants do not share a common language. These situations can, for example, be found in the healthcare sector, especially when involving immigrants [33]. Here misunderstandings can cause serious problems. In the development of such systems, the computational linguists who are specialists on the technological side have to collaborate with informants who have deep knowledge of the language and domain. A common domain is established by dis-
108
H. Lange and P. Ljunglöf
cussing example sentences. These sentences can be automatically translated into a domain-specific grammar, which can be refined by generating new example sentences based on the initial grammar and receiving feedback about them from the informants. Such an iterative, example-based, development of application-specific grammars allows for close collaboration between the parties involved. The result is a high quality domain-specific application.
2 Background We want to give some insight into the related work and relevant background here. There is a long history of approaches to grammar learning. Our grammar learning technique makes use of computational grammar resources, some of which are presented here. An essential part for our work is the concept of an abstract grammar, which we will define and explain. Furthermore, using constraint solving is a common computational approach to problem solving.
2.1 Previous Work on Grammar Learning Grammar inference—the generation of a grammar from examples—has been a field of active research for quite some time. Common grammar learning approaches are: • grammar inference, where a grammar is inferred from unannotated corpus data, • data-oriented parsing (DOP), where subtrees from an example treebank are used as an implicit grammar, • probabilistic context-free grammars (PCFGs), where, for each rule of a contextfree grammar (CFG) a probability is learned from a corpus, and • subgrammar extraction, where a subset of syntactic rules is extracted from a more general grammar. We draw inspiration from DOP and several of the other approaches. Most of these methods require no or very little linguistic information besides the training examples. As a result they tend to require more training data to learn more expressive grammars. The main difference of our method is, that it, in addition, requires linguistic knowledge in the form of a wide-coverage resource grammar. However, this fact allows the method to learn reasonable grammars, usually from less than 10 examples. Grammar Inference: There has been a lot of work on general grammar inference, both using supervised and unsupervised methods (see, e.g., overviews by [6, 9]). Most of these approaches have focused on context-free grammars, but there has also been work on learning grammars in more expressive formalisms (e.g., [7]). In traditional grammar inference one starts from a corpus and learns a completely new grammar. Because the only input to the inference algorithm is an unannotated
Learning Domain-Specific Grammars from a Small Number of Examples
109
corpus, it can require a larger amount of data to learn a reasonable grammar. Clark reports results on the ATIS corpus of around 740 sentences [5]. DOP: In DOP [1, 2], the grammar is not explicitly inferred. Instead a treebank is seen as an implicit grammar which is used by the parser. It tries to combine subtrees from the treebank to find the most probable parse tree. The DOP model is interesting because it has some similarities with our approach, especially with the second extension where we use subtrees in grammar learning (see Sect. 9). Bod also reports results for DOP on the Penn Treebank ATIS corpus with 750 trees [2]. PCFGs: Another approach that could come to mind when thinking about learning grammars is PCFGs, an extension of context-free grammars where each grammar rule is assigned a probability. Parsing with a PCFG involves finding the most probable parse tree [26, Chap. 11]. The probabilities for a PCFG can be learned from annotated or unannotated corpora using the Inside-Outside algorithm, an instance of Expectation Maximization (EM) algorithms [23, 27]. The approach works for both formal and natural languages, as Pereira and Shabes [27] show. They present experiments involving the palindrome language L = {ww R | w ∈ {a, b}∗ } and the Penn treebank, and show that training on bracketed strings results in significantly better results than training on raw text. The results they report are based on 100 input sentences for the palindrome language and on the ATIS corpus using 700 bracketed sentences. Subgrammar Extraction: There has been some previous work on subgrammar extraction [13, 18]. Both articles present approaches to extract an application-specific subgrammar from a large-scale grammar focusing on more or less expressive grammar formalisms: CFG, systemic grammars (equivalent to typed unification based grammars), and HPSG. However, both approaches are substantially different from our approach, either in the input they take or in the constraints they enforce on the resulting grammar. Logic Approaches: To our knowledge, there have been surprisingly few attempts to use logic or constraint-based approaches, such as theorem proving or constraint optimization, for learning grammars from examples. One exception is [14], in which the authors experiment with Boolean satisfiability (SAT) constraint solvers to learn context-free grammars. They report results similar to ours, but only focus on formal languages over {a, b}∗ .
2.2 Abstract Grammars In our work we use the Grammatical Framework (GF). However, the way in which we state our problem allows us to substitute the grammar formalism used by any other that fulfills some basic requirements. Grammatical Framework [30, 31] is a multilingual grammar formalism based on a separation between abstract and concrete syntax. The abstract level is meant to be language-independent, and every abstract syntax can have several associated concrete syntaxes, which act as language-specific realizations of the abstract rules and trees.
110
H. Lange and P. Ljunglöf
In this chapter we only make use of the high-level abstract syntax. This high-level view makes it possible to transfer our approach to other grammar formalisms with a comparable level of abstraction. An abstract syntax can be expressed as a manysorted signature (Definiton 4.1), and we build the techniques in this chapter on top of the concept of many-sorted signatures. As a consequence, the techniques work for any grammar formalism that can be expressed as a many-sorted signature. In this chapter, when we talk about grammars, we usually refer to many-sorted signatures. Definition 4.1 (Many-sorted Signature) A many-sorted signature is a tuple = S, F where S is a set of sorts and F is a set of function symbols, together with a mapping type : F → S ∗ × S which expresses the type of each function symbol. Many-sorted signatures build the foundation of many-sorted algebras [36, p. 680]. However, signatures already provide a suitable abstraction for our needs, so we do not need the full algebraic framework. Notation: Instead of giving the two sets and the type function separately, we follow Wirsing [36] and give a signature in the following form: s i g n a t u r e Bool ≡ sort Bool functions true : → Bool f a l s e : → Bool and : Bool , Bool → Bool or : Bool , Bool → Bool endsignature
Each function symbol in a many-sorted signature has a function type given by the typing function type (Definition 4.1). The sort on the right-hand side of the arrow is the result type of the function and the sorts on the left-hand side are the argument types. In this chapter we can look at grammars from various perspectives. From a linguistic point of view we talk about syntax rules, syntactic rules, or grammar rules. From the perspective of many-sorted signatures we talk about functions. For that reason the terms rule and function are used interchangeably. The same is the case for the terms category and sort, because sorts in many-sorted signatures are used to express syntactic categories. The many-sorted signature in Fig. 1 can be expressed in GF syntax as shown in Fig. 2. In this chapter, we do not want to go into detail about the GF abstract syntax (see, e.g., [31] for more details). Instead we will express all relevant grammars in this chapters as many-sorted signatures. To make the grammar in Fig. 1 more accessible, we give a short explanation of the rules. This grammar has three constant functions: man_N representing the noun man, sleep_V representing the intransitive verb sleep, and detSg_Det representing the definite article in singular the. The function UseN converts a noun into a common noun, i.e., a noun phrase without determiner, and DetCN adds the determiner to form a complete noun phrase. The function UseV forms a verb phrase
Learning Domain-Specific Grammars from a Small Number of Examples
111
Fig. 1 Many-sorted signature for an example abstract grammar
Fig. 2 Example GF abstract syntax for the signature in Fig. 1
from an intransitive verb. Finally, PredVP combines a subject noun phrase and a verb phrase into a declarative clause. Definition 4.2 (Abstract Syntax Tree) An abstract syntax tree t consists of a root node f, C (which we write as f : C), and a potentially empty list of children t1 . . . tn (n ≥ 0; if n = 0, the list is empty), where each child ti is itself an abstract syntax tree. A node without children is called a leaf node. The tree t is valid according to a many-sorted signature = S, F, if C ∈ S and there exists a function f ∈ F with the type f : C1 , . . . , Cn → C, such that every child ti is a valid abstract syntax tree, and has a root node f i , Ci (or, using our notation for nodes, f i : Ci ). Note that this definition is different from the graph-theoretic definition of a tree as a connected acyclic graph. Instead it is similar to abstract syntax trees as used in programming language design and compiler construction [32, Sect. 2.5]. However, the difference is not important for the methods and results of this chapter. An example of an abstract syntax tree of the English sentence The man sleeps can be seen in Fig. 3. The tree is valid according to the grammar Example (Fig. 1).
112
H. Lange and P. Ljunglöf
Fig. 3 Abstract syntax tree for the example sentence the man sleeps— note that the English surface words and the dotted edges are not part of the abstract syntax
2.3 Wide-Coverage and Resource Grammars For various grammar formalisms there exist large grammars describing significant parts of natural languages. Examples include the HPSG resource grammars developed within the DELPH-IN collaboration [8] or grammars created for the XMG metagrammar compiler [37] . The resource grammar available in GF is called the GF resource grammar library (RGL) [30], which is a general-purpose grammar library for more than 30 languages which covers a wide range of common grammatical constructions. The main purpose of the RGL is to act as a programming interface (API) when building domain-specific grammars. It provides high-level access to the linguistic constructions, facilitating development of specific applications. The inherent multilinguality also makes it easy to create and maintain multilingual applications. However, it is necessary to learn the GF formalism as well as to write grammars in general to use the RGL for developing GF grammars. This limits the user group drastically. In contrast, the methods presented in this chapter allow non-grammarians to create grammars for their own domain or application without the knowledge of the grammar formalism and methods of grammar writing.
2.4 Constraint Satisfaction Problems Many interesting problems can be formulated as constraint satisfaction problems (CSP) [34, Chap. 6]. For a CSP, one targets to find an assignment of a number of variables that respect some given constraints. CSPs are classified depending on the domains of the variables, and the kinds of constraints that are allowed. If an objective function is added to a CSP to add a judgment of optimality, we talk of a constraint optimization problem (COP) instead. In this chapter we formulate our problem based on Boolean variables in the constraints, but require integer operations
Learning Domain-Specific Grammars from a Small Number of Examples
113
in the objective functions. An objective function is the function whose value has to be maximized or minimized while solving the constraints. We use the IBM ILOG CPLEX Optimization Studio2 to find solutions to this integer linear problem (ILP) restricted to 0/1 integers. Other solvers such as the free and open source solver GLPK (GNU Linear Programming Kit)3 can be used as well, but, currently the free alternatives suffer from larger performance issues.
3 Learning a Subgrammar In this section we will describe the task and the problem to be solved. We start from a large, expressive, but over-generating, resource grammar and a small set of example sentences. From this input we want to infer a subgrammar that covers the examples and is optimal with respect to some objective function. One possible objective function would be, e.g., to reduce the number of syntactic analyses per sentence. Sections 3.1, 3.2 introduce an outline for formalization of the concepts, notions, and the tasks of modeling subgrammar extraction as a constraint problem. The description is rather informal. A completely precise formalization is outside the scope of this chapter.
3.1 Subgrammar Extraction by Tree Selection We assume that we already have a parser for the resource grammar that returns all possible parse trees for the example sentence. That means we can start from a set of trees for each sentence. From the syntax trees we can extract the list of syntactic functions involved (Definition 4.3). The grammar we want to learn still has to be able to cover all these sentences and should be optimal according to some optimality criterion. Definition 4.3 (Flattened Abstract Syntax Tree) Given an abstract syntax tree t, we define a flattened representation of t as the set tflat = { f | f : C is a node in the abstract syntax tree t} The resulting representation loses all structural and type information but is sufficient for our purposes.
2 http://www.cplex.com/. 3 https://www.gnu.org/software/glpk/.
114
H. Lange and P. Ljunglöf
Definition 4.4 (Subgrammar) Given a many-sorted signature = S, F, a subgrammar is a many-sorted signature = S , F with S ⊆ S, F ⊆ F, and for all f ∈ F : type ( f ) = type ( f ). Subgrammar Learning Problem As mentioned earlier, we assume that we already have a parser that can convert example sentences into sets of parse trees. This means that the actual computational problem we try to solve with the techniques presented in this chapter is the following: • Given: a set of n ≥ 1 tree sets, F = {F1 , . . . , Fn }, where each Fk is a set of m k ≥ 1 trees, Fk = {tk1 , ..., tkm k } • Problem: select at least one tree tkik (1 ≤ i k ≤ m k ) from each Fk , while minimizing the objective function Possible objective functions for our problem include: rules the number of rules in the resulting grammar (i.e., reducing the grammar size) trees the number of all initial parse trees tki that are, intended or not, valid in the resulting grammar (i.e., reducing the syntactic ambiguity) rules+trees the sum of rules and trees weighted a modification of rules+trees where each rule is weighted by the number of occurrences in all Fk The problem we describe here is an instance of a set covering problem. It is a generalization of the Hitting Set problem [12, Sect. A3.1], which is NP-complete [17], meaning that the subgrammar learning problem is NP-complete as well.
3.2 Modeling Subgrammar Extraction as a Constraint Problem Even though there exist other solutions to the related class of set covering problems, a natural approach to this problem is to model it as a constraint satisfaction problem. An outline of the system architecture is shown in Fig. 4. Given the set of trees for each sentence, there are various possible ways to model the problem, depending on the choice of the atomic units we want to represent by the logical variables. The logical variables can encode subtrees of various sizes, ranging from subtrees of size 1, i.e., single nodes or syntactic functions, to subtrees of arbitrarily larger sizes. There are also different ways to split a tree into these larger subtrees. In the following, we use subtrees of size 1, but see Sect. 9 for an extension to larger subtrees. As a result, we can represent an abstract syntax tree t as the set of labels in the tree, tflat = {r1 , . . . rn } (Definition 4.3). This results in a loss of structural information but does not have any negative effect on the outcome of our approach. An example can be seen in Figs. 5, 6. We start with Fig. 5 and go from left to right, starting with the sentences. Each sentence results in one or several syntax trees, which than can be represented in a flattened form (Definition 4.3).
Learning Domain-Specific Grammars from a Small Number of Examples
115
Fig. 4 The outline of our grammar learning system, where s1 . . . sn are the input sentences, GR is the original resource grammar, and G is the resulting subgrammar
The resulting constraints can be seen in Fig. 6. We have variables for each sentence (si ), each tree (ti j ), and all the syntax rules occurring in the trees (rk ). First we enforce that all sentences have to be covered, then we describe for each sentence what trees have to be covered, and finally for each tree, if it should be covered, what rules have to be covered. The solution to this problem gives rise to a new grammar, which, following the description of our subgrammar learning problem, also covers all examples. Statement 4.1 shows that the resulting grammar is again a many-sorted signature and according to Statement 4.2 it is also valid subgrammar of the original grammar following Definition 4.4. Statement 4.1 From the solution to the CSP and the original resource grammar GR , we can construct a new grammar G, i.e., a new many-sorted signature G . Proof (Outline) The CSP solution consists of the set of variables with value 1. From the solution we can select the set FG of variables that represent syntax rules.4 The new grammar has the signature G = SG , FG , where: FG = {r | r is a variable that represents a syntax rule and has value 1} SG = {C1 , . . . , Cn | typeGR (r ) = (C1 , . . . , Cn )} r ∈FG
typeG = typeGR FG
(i.e., the restriction of typeGR to the set FG )
Statement 4.2 The new grammar G described by the signature G = SG , FG is a subgrammar of the original grammar GR given by the signature GR = SGR , FGR
4 Thus
FG is a subset of the union of all flattened syntax trees, or FG ⊆ t11flat ∪ t12flat ∪ · · · ∪ tnm flat .
116
H. Lange and P. Ljunglöf
Fig. 5 Sentences and tree representations Fig. 6 Encoding of Fig. 5 as logical constraints
Proof (Outline) By definition, G is a subgrammar of GR if the following holds: FG ⊆ FGR By definition of FG in Statement 4.1, and the description of the learning problem and abstract syntax trees (Definition 4.2) SG ⊆ SGR By definition of SG in Statement 4.1, and the definition of the typing function on GR , typeGR : F → SG∗ × SG (Definition 4.1) ∀ f ∈ FG : typeG ( f ) = typeGR ( f ) By definition of typeG ( f ) in Statement 4.1 and function restriction
Learning Domain-Specific Grammars from a Small Number of Examples
117
Fig. 7 Excerpt from the Finnish treebank used in the “Comparing-Against-Treebank” experiment. The Finnish example is followed by the English translation and the abstract syntax tree. Sources of (morpho)syntactic ambiguity are highlighted
4 Bilingual Grammar Learning If our example sentences are translated into another language, and the resource grammar happens to be bi- or multilingual,5 we can use that knowledge to improve the learning results. This can be relevant because various languages express different features explicitly. As an example, consider Fig. 7. Finnish does not express definite or indefinite articles, so laula laulu can be translated to both sing a song and sing the song. On the other hand, the verb sing in the English imperative phrase sing a song is morphosyntactically ambiguous— it can be singular or plural, while Finnish makes the distinction into laula laulu (singular) and laulakaa laulu (plural). This example shows how English can be used to disambiguate Finnish and vice versa. For each sentence pair (si , si ), we parse each sentence separately using the . . . tim resource grammar into the tree sets Fi = {ti1 . . . tim i } and Fi = {ti1 }. We then i only keep the trees that occur in both sets, i.e., Fi ∩ Fi . These filtered tree sets are translated into logical formulas, just as for monolingual learning. The intersection of the trees selects the (morpho)syntactically disambiguated reading. The disambiguation makes the constraint problem smaller and the extracted grammar more likely to be the intended one.
5 The
same abstract grammar is used to describe multiple languages in parallel.
118
H. Lange and P. Ljunglöf
Fig. 8 Translation between logical formulas and ILP constraints
5 Implementation We have implemented the system we outlined in Fig. 4 and the previous section as well as all aspects of the following evaluation and extensions. The implementation is done in Haskell and released as open source.6 As constraint solvers both GLPK and, if available, CPLEX can be used. The system can be treated as a black box that takes a set of sentences {s1 , . . . , sn } as an input and produces a grammar G as output, doing so by relying on a resource grammar (labeled GR ). First the sentences are parsed using the resource grammar GR and the syntax trees translated into logical formulas, in the way described in Sect. 3.2. The logical formulas are then translated into ILP constraints and handed over to the constraint solver, which returns a list of rule labels that form the basis for the new restricted grammar G. The output of the solver is influenced by the choice of the objective function (candidates are described in Sect. 3.1). The translation between logical formulas and linear inequalities is well established, and an example can be seen in Fig. 8. Conjunctions and disjunctions are converted into sums. The direction of the inequality as well as the multiplication constant are chosen accordingly, depending on if it is an implication, a conjunction, or a disjunction. In fact, the solver does not necessarily only return one solution. In case of several solutions, they are ordered by the objective value. Choosing the one with the best value is a safe choice even though there might be a solution with a slightly worse score that actually performs better on the intended task.
6 https://github.com/MUSTE-Project/subgrammar-extraction.
Learning Domain-Specific Grammars from a Small Number of Examples
119
Fig. 9 Setup for an evaluation by rebuilding a known grammar G0
6 Evaluation Related literature [9] suggests several measures for the performance of grammar inference algorithms, most prominently the methods “Looks-Good-To-Me”, “Rebuilding-Known-Grammar”, and “Compare-Against-Treebank”. Our inferred grammars passed the informal “Looks-Good-To-Me” test, so we designed two experiments to demonstrate the learning capabilities of our approach following the other two methods.
6.1 Rebuilding a Known Grammar The evaluation process is shown in Fig. 9. It is based on the learning component (Fig. 4), which is highlighted. To evaluate our technique in a quantitative way we start with two grammars GR and G0 , where G0 is a subgrammar of GR . We use G0 to generate random example sentences s1 , …, sn . These examples are then used to learn a new grammar G as described in Sect. 5. The aim of the experiment is to see how similar the inferred grammar G is to the original grammar G0 . To measure this similarity, we compute precision and recall in the following way, where FG0 are the rules of the original grammar and FG the rules of the inferred grammar: Precision =
|FG0 ∩ FG | |FG |
Recall =
|FG0 ∩ FG | |FG0 |
We can analyze the learning process depending on, e.g., the number of examples, the size of the examples, and the language involved. We conducted this experiment for Finnish, German, Swedish, Spanish, and English. For each of these languages we used the whole GF RGL as the resource grammar GR , and a small subset containing 24 syntactic and 47 lexical rules as our known grammar G0 (Fig. 10). We tested the process with an increasing number of random example sentences (from 1 to 20), an increasing maximum depth of the generated syntax trees (from 6
120
H. Lange and P. Ljunglöf
Fig. 10 Many-sorted signature used as G0
to 10), five languages (Finnish, German, Spanish, Swedish, and English), and our four different objective functions from Sect. 3.1.
6.2 Comparing Against a Treebank Our second approach to evaluate our grammar learning technique has a more manual and qualitative focus, and is depicted in Fig. 11 (again with the highlighted learning component). Instead of starting from a grammar which we want to rebuild, we start from a treebank s1 , t1 , …, sn , tn , i.e., a set of example sentences in a language and one gold-standard tree for each sentence. We use the plain sentences s1 , …, sn from the treebank to learn a new grammar G, using the GF RGL, extended with the required lexicon, as the resource grammar GR . Then the system parses the sentences with the resulting grammar G, and compares
Learning Domain-Specific Grammars from a Small Number of Examples
121
Fig. 11 Setup for an evaluation by comparing the learned grammar to a treebank
the resulting trees with the original trees in our gold standard. If the original tree ti for sentence si is among the parsed trees ti1 , …, tim , it is reported as a success. i If the gold standard tree is not covered, we could use a more fine-grained similarity score, such as labeled attachment score (LAS) or tree edit distance. However, because of the limited size of the treebanks we decided against this evaluation measure. The data we used for testing the grammar learning consists of hand-crafted treebanks for the following languages: Finnish, German, Swedish, and Spanish (see Table 1 for statistics and Fig. 7 for a fragment of the Finnish treebank). We exclude English here because we use it as the second language in bilingual learning on the treebank in the next section.
6.3 Comparing Against a Bilingual Treebank Our final experiment is a repetition of the “Compare-Against-Treebank” experiment, but using a bilingual treebank instead of a monolingual one. The treebanks we created contain English translations of all sentences. This means we have access to four bilingual treebanks Finnish-English, Spanish-English, Swedish-English, and German-English. We used the bilingual learning component described in Sect. 4, using the GF RGL which is a multi-lingual resource grammar covering all these languages.
7 Results We conducted the experiments described in the previous section and got very promising results. In this section we will discuss the results in detail.
122
H. Lange and P. Ljunglöf
Fig. 12 Results for objective function rules, maximum tree depth 9, and various languages
7.1 Results: Rebuilding a Known Grammar We ran the first experiment, described in Sect. 6.1, and a selection of the results can be seen in Figs. 12–14. Out of the many possible experiments (5 languages, 4 objective functions, and 5 different tree depths for generating examples) we present 3 representative samples: • the same objective function and tree depth with various languages, • the same language and tree depth with various objective functions, and • the same language and objective function with various tree depths. We report precision and recall for a sequence of experiments, where for each experiment we generated sets of random sentences with increasing size. All three graphs (Figs. 12, 13, and 14) resemble typical learning curves where the precision stays mostly stable while the recall rises strongly in the beginning and afterwards approaches a more or less stable level. The precision rises slightly between 1 and 5 input sentences. The recall remains almost constant after input of about 5 sentences. With larger input the precision starts to drop slightly when the system learns additional rules that are not part of the original grammar. These curves are pretty much stable across all languages (see Fig. 12), objective functions (see Fig. 13), and maximum tree depth used in sentence generation (see Fig. 14), and show that we get the best results with about 10 examples. One exception can be seen in Fig. 14. With a maximum tree depth of 5 the system can only achieve a recall of about 0.7, which means that for this tree depth it does not encounter all grammar rules. These results confirm that our method is very general and provides good results, especially for really small training sets of only a few to a few dozen sentences. By starting from a linguistically sound source grammar, which our learning technique recovers by extracting a subgrammar, we can show that the learned grammar is sound in a similar way.
Learning Domain-Specific Grammars from a Small Number of Examples
123
Fig. 13 Results for Finnish, maximum tree depth 9, and various objective functions
Fig. 14 Results for English with objective function rules, and various generation depths
7.2 Results: Comparing Against a Treebank We used the treebanks and the process described in Sect. 6.2 to further evaluate our learning method. Table 1 shows the results of running our experiment on monolingual and bilingual treebanks of four different languages, and with two objective functions, rules+trees and weighted. The table columns are: Size the number of sentences in the treebank Accuracy the accuracy, meaning the percentage of sentences where the correct tree is among the parse trees for the new grammar Ambig. the syntactic ambiguity, i.e., the average number of parse trees per sentence The system can cover all the sentences of the treebank with our learned grammar and, as the table shows, in most cases the results include the gold standard tree. We inspected more closely the sentences where the grammar fails to find the gold standard tree, and found that the trees usually differ only slightly, so if we used attachment scores instead we would get close to 100% accuracy in every case. A clear exception is the case of the monolingual Finnish treebank. When we use the rules+trees objective function, we have serious problems learning the correct grammar, with only 1 correct sentence out of 22. This is due to a high level of morphosyntactic ambiguity among Finnish word forms. If we instead use the weighted objective function, we get a decent accuracy, but the grammar becomes highly syntactically ambiguous with 115 parse trees per sentence on average. The second part
124
H. Lange and P. Ljunglöf
Table 1 Results for comparing against a treebank. Accuracy means the percentage of sentences where the correct tree is found, and Ambig(uity) means the average number of parse trees per sentence Monolingual
Bilingual
rules+trees
Weighted
rules+trees
Weighted
Size
Accuracy Ambig. (%)
Accuracy Ambig. (%)
Accuracy Ambig. (%)
Accuracy Ambig. (%)
Finnish
22
5
1.0
91
115
86
4.9
96
8.7
German
16
75
1.1
100
2.0
94
1.1
100
1.5
Swedish
10
100
1.1
100
2.8
100
1.1
100
1.2
Spanish
13
100
1.2
92
3.7
100
1.2
100
2.3
of the experiment, using a bilingual treebank, solves most of the problems involving Finnish while also improving results for other languages.
7.3 Results: Using Bilingual Treebanks When we repeated the previous experiment using translation pairs as described in Sect. 6.3 we got very similar results for most of the languages, as can be seen on the right side of Table 1. The main difference is that the resulting grammars are more compact for the weighted objective function, resulting in fewer analyses. Notably, for Finnish the average number of trees per sentence drops by one order of magnitude. This is because the high level of syntactic ambiguity of Finnish sentences is reduced when disambiguated using the English translations.
8 Extension 1: Negative Examples The first addition to the original grammar learning method, to which we dedicated the previous sections, is to add negative examples to the learning process. Negative examples can speed up the learning process for certain syntactic constructions by narrowing down the grammar, e.g., by using positive and negative examples that are minimal pairs concerning the intended linguistic phenomenon. To allow for negative examples, i.e., example sentences that should not be parsable in the new grammar, we have to add additional constraints. These new constraints have to express that, for each of the syntax trees we get from the negative example sentences, at least one rule, of those involved in the parse of the negative example, has to be excluded in the learned grammar. Or conversely, that not all rules can be included. As a logic formula, this results in the negation of the conjunction of all rules for a tree, i.e.,
Learning Domain-Specific Grammars from a Small Number of Examples
125
¬r1 ∨ ¬r2 ∨ · · · ∨ ¬rn ≡ ¬(r1 ∧ r2 ∧ · · · ∧ rn ) This simple addition allows negative examples in the basic constraint optimization problem described in Sect. 3.2. We will demonstrate in two examples how this feature can be used.
8.1 Examples We can demonstrate how positive and negative examples can be used together, in general, by having a look at both formal languages and fragments of natural languages.7
8.1.1
Dyck Language
The Dyck language is a language of balanced opening and closing parentheses, LDyck = {w ∈ V ∗ | Each prefix ofwcontains no more )’s than (’s and there are exactly as many (’s as )’s in w} with V = {(, )}8 In our example we extend this to two kinds of bracketing symbols, parenthesis “( )”, and brackets “[ ]”, i.e., VDyck = {(, ), [, ]}. The language LBracket = ∗ }, the language of all strings over the alphabet VDyck , can be expressed {w | w ∈ VDyck with the grammar in Fig. 15. The semantics of the rules is the following: Empty introduces an empty string LeftP (or RightP) introduces a single left (or right) parenthesis LeftS (or RightS) introduces a single left (or right) square bracket BothP (or BothS) wraps a balanced pair of parentheses (or square brackets) around a string Conc concatenates a pair of strings Learning the Dyck language of balanced parentheses from the grammar in Fig. 15, using examples, can be either quite trivial or pretty tricky. With the objective function minimizing the number of rules, the grammar learning technique immediately outputs the intended grammar from just the positive examples “( )”, “[ ]”, and “( ) ( )”. However, with the objective function minimizing the number of parse trees the technique learns a wrong grammar, allowing unbalanced parentheses.
7 The
examples are in English but the problems we approach are language independent. to avoid naming conflicts with signatures, we use the letter V instead.
8 Usually the alphabet is denoted with the letter
126
H. Lange and P. Ljunglöf
Fig. 15 Signature of the over-generating Dyck language
Fig. 16 Signature of the resulting grammar to cover the Dyck language
Fig. 17 Signature of a fragment of the GF RGL concerning adverbials
By adding the negative examples “( ]” and “[ )” as well as “(” and “[” we solve the issue and the learning component provides the correct grammar. The resulting grammar in Fig. 16 is a subset of the original grammar in Fig. 15. This example demonstrates how sometimes adding negative examples can help us learn a grammar in a quick and immediate way.
8.1.2
Adverbials
In the previous section we showed how negative examples can help to learn artificial formal languages. The next step is to show that the same applies to natural languages. To show how quickly the proposed technique of grammar learning can learn a desired subgrammar, we have to start from a wide-coverage grammar. We use the full GF RGL again. One of the problems that can be solved with negative examples is the handling of adverbials in the RGL. A relevant fragment of the RGL is shown in Fig. 17. Various syntactic constructions such as prepositional phrases, created by the function PrepNP, are assigned the syntactic category Adv. Furthermore, almost every part of speech can be modified by adverbials, such as noun phrases using AdvNP and verb phrases using AdvVP.
Learning Domain-Specific Grammars from a Small Number of Examples
127
Fig. 18 The two syntactic analyses according to the RGL for the boy reads a book today
This can lead to syntactic ambiguity when an adverb or adverbial, potentially modifying two different parts of speech, appears in the same position of the sentence. Despite human intuition, the sentence the boy reads a book today is, according to the syntactic functions in the RGL, syntactically ambiguous and has two different readings (see Fig. 18). One solution to the problem would be to add more positive examples to learn only one of the alternatives. Another solution is to add a negative example that simply rules out one of the readings. In this case such a negative example could be ∗ a book today arrives. It only has the undesired reading of attaching the adverb to the noun phrase. Together with the positive example it can be used to disambiguate the readings and only the intended first reading where the adverb modifies the verb phrase remains. In a similar way simple cases of syntactic ambiguity can be resolved using negative examples. For more advanced cases, e.g., to distinguish between lexical adverbs as adverbials and prepositional phrases in the same context, we need other mechanisms (see, e.g., Sect. 9).
8.2 Iterative Grammar Learning Process The inclusion of negative examples allows us to create a human-centric and examplebased grammar learning and refinement system. Starting from a wide-coverage grammar, the user can give a set of example sentences. From these examples a first version of the domain-specific grammar can be learned. This domain-specific grammar will then be extended and refined iteratively.
128
H. Lange and P. Ljunglöf
The user can either give more examples to extend the grammar or ask the system for example sentences. These examples can either be marked as acceptable or as erroneous. When a sentence is marked as wrong, it is added as a negative example and the learning process starts again with the new examples. This way negative examples can be used to step-wise refine the grammar. Because this process is purely example-based, no knowledge about grammar engineering is required. This means that a wide range of people can use the system, not only linguists. This is especially relevant in use-cases where the people who are involved in creating grammars are specialists in other fields, such as teachers in a language learning setup or healthcare professionals building communication support in their field.
9 Extension 2: Extracting Subtrees as Basic Units The second modification is the generalization from syntactic rules as the atomic units to subtrees. Together with a method to merge syntactic rules into new, more specific rules, this allows to address some shortcomings of the technique presented in the previous sections. The idea behind this extension is similar to the approach used by Bod [1] for data-oriented parsing (DOP). In DOP, the use of larger subtrees resulted in significantly better learning results [2]. To be able to use subtrees as basic units, we need to be able to create a constraint satisfaction problem of a similar structure as we used before to formulate our original constraint optimization problem, i.e., similar to a many-sorted signature (see Sect. 2.2). Instead of converting the syntax trees into lists of syntax rules to be converted into logical variables, we can split the syntax trees into all possible subtrees up to a certain size. The splitting happens in a way that we get a list of splits and each split contains only subtrees that can be reassembled to the original tree. This is necessary to guarantee that the inferred grammar can still cover all the example sentences (Fig. 19). Because our grammars are equivalent to many-sorted signatures and our syntax rules are similar to functions in mathematics, we can combine several of them to a new function using function composition. For example, if we have the two rules PredVP : NP, VP → Cl and UseV : V → VP. we can combine them into a new rule PredVP#?#UseV : NP, V → Cl. The resulting structure is again a many-sorted signature. The main motivation for using subtrees as the basic units is that when merging subtrees into new grammar rules, we can create more precise and specific grammars than the wide-coverage grammar. This also means that we step away from pure subgrammar extraction into creating more independent grammars. Having the splits into subtrees, we can translate the splits into logical variables. The procedure follows along the same lines as it worked for syntax rules. However, one additional level has to be introduced. Previously, to cover a tree, all its rules had to be covered. Now we have the additional level of splits. That means, to cover a
Learning Domain-Specific Grammars from a Small Number of Examples
129
Fig. 19 Splitting into subtrees for the example tree (Fig. 3)
tree, at least one of the splits has to be covered, and to cover a split, all its subtrees have to be covered. So for the example in Fig. 19, to cover the tree, we end up with the following constraint involving the splits: (PredVP ∧ DetCN ∧ UseV ∧ UseN ∧ theSg_Det ∧ man_N ∧ sleep_V) ∨ (PredVP#DetCN#? ∧ theSg_Det ∧ UseN#man_N ∧ UseV#sleep_V) ∨ (PredVP#?#UseV ∧ DetCN#theSg_Det#? ∧ UseN#man_N ∧ sleep_V) ∨ ... The labels for the subtrees are depth-first concatenations of the subtree nodes using the delimiter “#” between the function names, and the question mark character “?” to mark the nodes where the tree has been split. After solving the constraint problem we can either recover the rules from the subtrees in the solution or we can merge the subtrees into new grammar rules.
9.1 Handling Combinatorial Explosion The previous section shows that we can easily extend the basic technique to include subtrees or arbitrary size. However, a major challenge is the exponential explosion. If we include all splits into subtrees up to the maximum size of 2 for the tree in Fig. 19, we already end up with 19 splits and a total of 133 subtrees, and if we include subtrees up to a size of 3 we end up with 40 splits and 280 subtrees. One way to tackle the problem is to limit the number of subtrees with size larger than 1 per split. As we have just seen, if we allow any split into subtrees up to a
130
H. Lange and P. Ljunglöf
Fig. 20 Second over-generating grammar for the Dyck language
maximum size, we get a large number of splits, quickly growing with the maximum size of subtrees. When limiting the number of subtrees we can significantly reduce this number of splits. For example, if we would allow at most one subtree of a size up to 2, for the example in Fig. 19, the first split into only subtrees of size 1 would be allowed. The other two splits given as an example would be, however, disallowed. After starting with a subtree of size 2, only subtrees of size 1 would be allowed for the rest of the split. This solves some of the combinatorial problems but leads to the introduction of additional parameters that influence the learning process.
9.2 Examples With the addition of using subtrees to the learning process, we can revisit the Dyck language and the handling of adverbials. We show how the learning method profits from using subtrees and merging rules to solve the problems in a more powerful and elegant way.
9.2.1
Dyck Language
In Sect. 8.1.1 we presented the Dyck language with two types of brackets. We demonstrated how the Dyck language can be learned using positive and negative examples. This was possible because of the structure of the grammar we defined. With a different definition of the base language the previous technique cannot learn the desired grammar just using positive and negative examples. We can define this alternative grammar in Fig. 20. With this grammar we cannot just exclude rules to learn the intended language. Instead we need to merge the wrap rule with either the rules to add parentheses or with the rules to introduce square brackets. To learn the grammar we used only positive examples and a moderate subtree size. The examples involved were just the two strings “[ ( ) ]” and “[ ] ( )”, a subtree size of 3, and allowed at most 2 subtrees in each split. To guarantee for an optimal grammar we
Learning Domain-Specific Grammars from a Small Number of Examples
131
Fig. 21 Resulting grammar for the Dyck language
used the objective function minimizing the number of rules. The resulting grammar is shown in Fig. 21.
9.2.2
Adverbials
In a similar way we can return to adverbials. Here a major remaining problem is the fact that various phrases are mapped onto the syntactic category of adverbials (Adv). And not all phrases make sense in every position an adverbial can appear. For example, prepositional phrases can both modify verbs and nouns. This leads to well-known cases of prepositional phrase (PP) attachment ambiguity. Sometimes the different readings are equally plausible,9 but usually there is one preferred reading, which can be inferred from the lexical semantics of the involved words. To give an example, the sentence I eat pizza with pineapple is structurally ambiguous but the meaning is completely clear to a human. The same is true for the sentence I eat pizza with scissors, even though it would be possible to put scissors on top of a pizza.10 So the main difference between the two sentences is more an aspect of semantics. In the first case clearly the noun is modified by the prepositional phrase while in the second case it is the verb phrase that is modified. One potential learning task is to learn a grammar that only allows prepositional phrases modifying nouns but allowing for regular adverbs modifying verbs. We start from the grammar in Fig. 22, which again is a subset of the RGL. With this grammar and the correct examples, the technique can learn a new grammar that disambiguates the attachment ambiguity by allowing only the modification of noun phrases by prepositional phrases. To learn the new grammar, we use the first sentence as a positive example and the second sentence as a negative example. We add additional sentences to the positive examples to reinforce the prepositional attachment to the noun and to allow for regular verbs modifying verbs. The positive examples we use for training are: (1) I eat pizza with pineapple (2) pizza with pineapple is delicious 9 For
example, I saw the building with the telescope has again two syntactic analyses, and each one has a plausible semantic interpretation. 10 For some people that would even be as likely as pineapple as a topping.
132
H. Lange and P. Ljunglöf
Fig. 22 Signature of a RGL fragment exposing PP attachment ambiguity
(3) I run today (4) I sleep now (5) I run And the only negative example is: (6)
∗
I eat pizza with scissors
We combine these examples with the following training parameters: maximum subtree size of 2 and at most 3 merges per split. As a result we end up with a new grammar that fulfills our expectation about the prepositional attachment. The resulting grammar rules are given in Fig. 23. The first rule (AdvNP#?#PrepNP) allows prepositional phrases to modify noun phrases. The subsequent two rules allow the two regular adverbs to modify verb phrases. This grammar meets our expectation about the behavior of adverbial. On the other hand it is in some cases overly-specific and in other cases more general than expected. The second (AdvVP#?#now_Adv) and third rule (AdvVP#?#today_Adv) could be split to make the grammar more general, and the rules UseN and MassNP could be merged because they are the only two rules with matching types. These issues could probably be solved by fine-tuning the parameters. Despite that, this example shows how learning from subtrees and merging to
Learning Domain-Specific Grammars from a Small Number of Examples
133
Fig. 23 Resulting grammar rules for adverbials including merged rules
form new grammar rules can be used to deal with a common attachment ambiguity problem.
10 Discussion In the previous sections we described the technical details of our system. Furthermore, we describe two extensions of the basic technique. All these aspects are implemented and can be tested and evaluated. However, the work described here is only the beginning of an interesting line of research and there are still topics open for discussion. Other Grammar Formalisms In the introduction (Sect. 1) we mentioned two requirements for a grammar formalism in order to be able to use it together with our learning technique: having a wide-scale grammar, and being able to translate syntax trees into constraints. For other formalisms than GF, such as Head-driven Phrase Structure Grammars (HPSG) [28], Lexical-Functional Grammars (LFG) [3, 16], or Lexicalized Tree-Adjoining Grammars (LTAG) [15], large-scale grammars exist, fulfilling the first requirement. For the translation from syntax trees into constraints, a promising approach seems to be the generalization of grammar formalisms in the framework of Constraint-Based Lexicalized Grammar (CBLG) [25, 35] which subsumes, e.g., HPSG and LFG. In CBLG, well-formed tree structures are defined as trees that have fully instantiated feature structures in each node. These feature structures can be treated as complex categories and used in the grammar learning method we present without any major changes.
134
H. Lange and P. Ljunglöf
Influence and Handling of Parameters There are a few open issues involving the choice and effect of the parameters such as subtree size and how to split trees into subtrees. A serious consequence of starting from the wrong parameters is, that it makes it difficult to learn the intended grammar in the iterative process sketched above. Some of the parameters lead to the process becoming too slow to be feasible. The number of variables involved in the process, especially in the objective function, slow down the solving of the problem. The objective function reducing the number of trees is usually unproblematic but the objective function minimizing the number of rules can lead to serious problems. This is especially the case when including subtrees because each distinct subtree will be treated as a separate rule. Another issue is a consequence of restricting the number of subtrees included in a split in combination with negative examples. Because positive and negative examples are treated slightly differently, it can happen, that positive and negative trees are split differently. This means that not all parts of a negative example can be used to eliminate solutions. These observations are not overly surprising. The more complex a system grows, the more parameters can be tuned. And tuning parameters has a growing effect on the results. In our case, a way to approach this problem is to start from the most simple system, in our case learning from only positive examples and only add more features when necessary. Another approach is to automatically and iteratively increase the parameter values until a suitable solution can be found. Handling Larger Problem Sizes With the basic learning algorithm we did not encounter any performance issues, even though the problem itself is NP-complete. However, one potential problem is the number of parse trees involved, which can grow exponentially in the length of the sentences [24, p. 7]. If we also split the trees into all possible subtrees, the number grows even more. We currently solve this problem by limiting the number of subtrees, but there are other ways to approach this problem as well. One possible solution is to move away from the formulation of the problem of grammar learning, as covered in this chapter, in terms of parse trees and instead refer to the states in the parse chart. The chart has a polynomial size [24, p. 87], compared to the exponential growth of the trees, and it should be possible to translate the chart directly to a complex logical formula instead of having to go via parse trees. Another approach is to use a different constraint solving method. Instead of modeling a constraint optimization problem that requires more effort to solve we can model it as a plain constraint satisfaction problem such as Boolean satisfiability (SAT). This saves us the additional effort in solving a more challenging problem and in translating between logic formulas and ILP constraints, but we lose the guarantee of an optimal solution. However, there are methods to approximate optimal solutions using off-the-shelf SAT solvers, e.g., MiniSAT [10] or SAT+ [4]. Interaction Between Iterative Process and Merging Rules Another open question is how exactly the merging of rules can be included in the iterative grammar generation process. To also take advantage of learning from subtrees, it is also pos-
Learning Domain-Specific Grammars from a Small Number of Examples
135
sible to include, in each learning step, singular subtrees to occasionally merge rules in cases where the rule-based learning method is not sufficiently powerful. How well this works in practice is not yet established. Our intuition is that merging rules in a meaningful way requires additional user interaction besides judging positive and negative examples because merging rules could make a grammar more restrictive than intended and might have to be rolled back. Multilingual Learning Finally, a very interesting topic we could only touch on shortly is the influence of combining languages in bilingual or multilingual learning. In the preliminary results of the modified “Compare-Against-Treebank” experiment (Sect. 7.2) and an example from the used treebank (Sect. 4), we could show that Finnish and English can be paired up in a meaningful way to disambiguate features of both languages. However, we did not research the influence of the choice of languages involved more thoroughly. Another interesting aspect of pairing languages is when encountering lexical ambiguity. The same kind of lexical ambiguity can span across languages, e.g., bank is ambiguous in many Germanic languages, not always with the same meanings. But in many cases it is possible to disambiguate the meaning of words by using translations.
11 Conclusion In this chapter we have shown how it is possible for a computer to learn an applicationor domain-specific grammar from a very limited number of example sentences. When making use of a large-scale resource grammar, in most cases only around 10 example sentences are enough to get a suitable domain-specific grammar. We evaluated this method in two different ways with good results that encouraged us to work on the extensions. Based on the results of the initial method, we also present two modifications. The first one, including negative examples, gives an easy way for humans to influence the learned grammar by giving both sentences that should and should not be included. The second, learning from subtrees and merging rules, allows for more fine-grained domain-specific grammars. We demonstrated that the procedure developed by our method can learn interesting languages or features using even fewer positive and negative example sentences on two examples involving both formal languages and natural language phenomena. Already five sentences were often sufficient to achieve the intended result. In Sect. 10 we discussed some of the remaining issues of this method. Notwithstanding, we accomplished to present a framework that can be used for humancentric, iterative grammar learning for domain- and application-specific grammars. There is still work left to be done, including performing more evaluations on different kinds of grammars and example treebanks. But we hope that this idea can find its uses in areas such as computer-assisted language learning, domain-specific dialogue systems, computer games, and more. In our future work, we will especially focus on
136
H. Lange and P. Ljunglöf
ways to use this method in computer-assisted language learning. However, a thorough evaluation of the suitability of the extracted grammars has to be conducted for each of these applications and remains as future work. Furthermore, we plan to explore the use of SAT to model the grammar learning problem. This should help to avoid performance issues but requires a redesign of the whole process to approximate an optimal solution. Finally, we want to include the iterative learning process in a computer-assisted language learning application and evaluate it thoroughly, both with students and language teachers. Acknowledgements We want to thank Koen Claessen for inspiration and help with the CSP formulation, Krasimir Angelov and Thierry Coquand for pointing us in the direction of many-sorted algebras as a means of formalizing abstract grammars, and three anonymous reviewers for many constructive comments. This chapter is an extended version of [22] presented at the Special Session NLPinAI 2020 at the 12th International Conference on Agents and Artificial Intelligence (ICAART 2020). The work reported in this chapter was supported by the Swedish Research Council, project 2014-04788 (MUSTE: Multimodal semantic text editing).
References 1. Bod, R.: A computational model of language performance: data oriented parsing. In: COLING’92, 14th International Conference on Computational Linguistics. Nantes, France (1992). https://www.aclweb.org/anthology/papers/C/C92/C92-3126/ 2. Bod, R., Scha, R.: Data-oriented language processing. In: Young, S., Bloothooft, G. (eds.) Corpus-based Methods in Language and Speech Processing, Text, Speech, and Language Technology 2, chap. 5, pp. 137–174. ELSNET: Kluwer, Dordrecht (1997). https://doi.org/10.1007/ 978-94-017-1183-8 3. Bresnan, J.: Lexical-Functional Syntax. Blackwell Textbooks in Linguistics. Blackwell, Malden, Mass (2001). https://doi.org/10.1002/9781119105664 4. Claessen, K.: SAT+ (2018). https://github.com/koengit/satplus. Accessed 25 June 2020 5. Clark, A.: Unsupervised induction of stochastic context free grammars using distributional clustering. In: CoNLL, the ACL 2001 Workshop on Computational Natural Language Learning (2001). https://www.aclweb.org/anthology/W01-0713 6. Clark, A., Lappin, S.: Unsupervised learning and grammar induction. In: Clark, A., Fox, C., Lappin, S. (eds.) The Handbook of Computational Linguistics and Natural Language Processing, chap. 8, pp. 197–220. Wiley-Blackwell, Oxford (2010). https://doi.org/10.1002/ 9781444324044.ch8 7. Clark, A., Yoshinaka, R.: Distributional learning of parallel multiple context-free grammars. Mach. Learn. 96(1–2), 5–31 (2014). https://doi.org/10.1007/s10994-013-5403-2 8. DELPH-IN: Deep linguistic processing with HPSG (DELPH-IN) (2020). http://moin.delphin.net/GrammarCatalogue. Accessed 25 June 2020 9. D’Ulizia, A., Ferri, F., Grifoni, P.: A survey of grammatical inference methods for natural language learning. Aritif. Intell. Rev. 36, 1–27 (2011). https://doi.org/10.1007/s10462-0109199-1 10. Eén, N., Sörensson, N.: An extensible SAT-solver. In: Giunchiglia, E., Tacchella, A. (eds.) Theory and Applications of Satisfiability Testing, pp. 502–518. Springer, Berlin (2003). https:// doi.org/10.1007/978-3-540-24605-3_37 11. Fuchs, N.E., Schwitter, R.: Specifying logic programs in controlled natural language. In: CLNLP’95, Workshop on Computational Logic for Natural Language Processing. University of Edinburgh, Edinburgh (1995)
Learning Domain-Specific Grammars from a Small Number of Examples
137
12. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NPCompleteness. W. H. Freeman & Co, New York, USA (1979). https://doi.org/10.5555/574848 13. Henschel, R.: Application-driven automatic subgrammar extraction. In: Computational Environments for Grammar Development and Linguistic Engineering (1997). https://www.aclweb. org/anthology/W97-1507 14. Imada, K., Nakamura, K.: Learning context free grammars by using SAT solvers. In: ICMLA 2009, International Conference on Machine Learning and Applications, pp. 267–272 (2009). https://doi.org/10.1109/ICMLA.2009.28 15. Joshi, A.K., Schabes, Y.: Tree-adjoining grammars. In: Rozenberg, G., Salomaa, A. (eds.) Handbook of Formal Languages, vol. 3, pp. 69–123. Springer, Berlin, Heidelberg (1997). https://doi.org/10.1007/978-3-642-59126-6_2 16. Kaplan, R.M., Bresnan, J.: Lexical-functional grammar: a formal system for grammatical representations. In: Bresnan, J. (ed.) The Mental Representation of Grammatical Relations, pp. 173–281. MIT Press, Cambridge, MA (1982) 17. Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W., Bohlinger, J. (eds.) Complexity of Computer Computations, pp. 85–103. Plenum, New York, USA (1972). https://doi.org/10.1007/978-1-4684-2001-2_9 18. Kešelj, V., Cercone, N.: A formal approach to subgrammar extraction for NLP. Math. Comput. Modell. 45(3), 394–403 (2007). https://doi.org/10.1016/j.mcm.2006.06.001 19. Lange, H.: Computer-assisted language learning with grammars. a case study on Latin learning. Licentiate thesis, Department of Computer Science and Engineering, University of Gothenburg, Gothenburg, Sweden (2018). https://gup.ub.gu.se/publication/269655 20. Lange, H., Ljunglöf, P.: MULLE: A grammar-based Latin language learning tool to supplement the classroom setting. In: NLPTEA 2018, 5th Workshop on Natural Language Processing Techniques for Educational Applications, pp. 108–112. Melbourne, Australia (2018). http:// aclweb.org/anthology/W18-3715 21. Lange, H., Ljunglöf, P.: Putting control into language learning. In: CNL 2018, 6th International Workshop on Controlled Natural Languages, Frontiers in Artificial Intelligence and Applications, vol. 304, pp. 61–70. IOS Press, Maynooth. Ireland (2018). https://doi.org/10.3233/9781-61499-904-1-61 22. Lange, H., Ljunglöf, P.: Learning domain-specific grammars from a small number of examples. In: ICAART 2020, 12th International Conference on Agents and Artificial Intelligence, vol. 1, pp. 422–430. INSTICC, SciTePress, Valletta, Malta (2020). https://doi.org/10.5220/ 0009371304220430 23. Lari, K., Young, S.: The estimation of stochastic context-free grammars using the insideoutside algorithm. Comput. Speech Lang. 4(1), 35–56 (1990). https://doi.org/10.1016/08852308(90)90022-X 24. Ljunglöf, P.: Expressivity and Complexity of the Grammatical Framework. Ph.D. thesis, University of Gothenburg, Gothenburg, Sweden (2004). https://gup.ub.gu.se/publication/10794 25. Loukanova, R.: An approach to functional formal models of constraint-based lexicalized grammar. Fundam. Inform. 152(4), 341–372 (2017). https://doi.org/10.3233/FI-2017-1524 26. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA (1999) 27. Pereira, F., Schabes, Y.: Inside-outside reestimation from partially bracketed corpora. In: ACL 1992, 30th Annual Meeting of the Association for Computational Linguistics, pp. 128–135. Newark, Delaware, USA (1992). https://www.aclweb.org/anthology/P92-1017 28. Pollard, C.J.: Head-driven Phrase Structure Grammar. Studies in Contemporary linguistics. University of Chicago Press, Chicago (1994) 29. Ranta, A.: GF: A multilingual grammar formalism. Lang. Linguist. Compass 3(5), 1242–1265 (2009). https://doi.org/10.1111/j.1749-818X.2009.00155.x 30. Ranta, A.: The GF resource grammar library. Linguist. Issues Lang. Technol. 2(2), 1–63 (2009). https://journals.linguisticsociety.org/elanguage/lilt/article/view/214.html 31. Ranta, A.: Grammatical Framework: Programming with Multilingual Grammars. CSLI Publications (2011). https://www.grammaticalframework.org/gf-book/
138
H. Lange and P. Ljunglöf
32. Ranta, A.: Implementing Programming Languages. An Introduction to Compilers and Interpreters. College Publications (2012). http://www.grammaticalframework.org/ipl-book/ 33. Ranta, A., Angelov, K., Höglind, R., Axelsson, C., Sandsjö, L.: A mobile language interpreter app for prehospital/emergency care. In: Medicinteknikdagarna. Västerås, Sweden (2017). http://urn.kb.se/resolve?urn=urn:nbn:se:hb:diva-13366 34. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Prentice Hall, Upper Saddle River (2009). http://aima.cs.berkeley.edu/ 35. Sag, I.A., Wasow, T., Bender, E.M.: Syntactic theory: A formal introduction, 2nd edn. No. 152 in CSLI Lecture Notes. Center for the Study of Language and Information, Stanford, CA (2003) 36. Wirsing, M.: Algebraic specification. In: van Leeuwen, J. (ed.) Handbook of Theoretical Computer Science, vol. B, chap. 13, pp. 675–788. Elsevier, MIT Press, Cambridge (1990) 37. XMG: eXtensible MetaGrammar (2017). http://xmg.phil.hhu.de/. Accessed 25 June 2020
The Semantic Level of Shannon Information: Are Highly Informative Words Good Keywords? A Study on German Max Kölbl, Yuki Kyogoku, J. Nathanael Philipp, Michael Richter, Clemens Rietdorf, and Tariq Yousef Abstract This paper reports the results of a study on automatic keyword extraction in German. We employed in general two types of methods: (A) unsupervised, based on information theory, i.e., (i) a bigram model, (ii) a probabilistic parser model, and (iii) a novel model which considers topics within the discourse of target word for the calculation of their information content, and (B) supervised, employing a recurrent neural network (RNN). As baselines, we employed TextRank and the TF-IDF ranking function. The topic model (A)(iii) outperformed clearly all remaining models, even TextRank and TF-IDF. In contrast, RNN performed poorly. We take the results as first evidence that (i) information content can be employed for keyword extraction tasks and has thus a clear correspondence to semantics of natural language, and (ii) that—as a cognitive principle—the information content of words is determined from extra-sentential contexts, i.e., from the discourse of words. Keywords Keyword extraction · Information theory · Shannon information · Discourse · Communication · Topic model · Recurrent neural network
M. Kölbl · Y. Kyogoku · J. N. Philipp (B) · M. Richter · C. Rietdorf · T. Yousef Institute of Computer Science, NLP Group, Universität Leipzig, Augustusplatz 10, 04109 Leipzig, Germany e-mail: [email protected] M. Kölbl e-mail: [email protected] Y. Kyogoku e-mail: [email protected] M. Richter e-mail: [email protected] C. Rietdorf e-mail: [email protected] T. Yousef e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. Loukanova (ed.), Natural Language Processing in Artificial Intelligence—NLPinAI 2020, Studies in Computational Intelligence 939, https://doi.org/10.1007/978-3-030-63787-3_5
139
140
M. Kölbl et al.
1 Introduction This is an extension of the original paper [22] with the following revisions: (1) The theoretical basis of information content for words is extended, i.e., the philosophical basis and implication of Shannon’s information theory [44] is thoroughly discussed, in particular with regard to the work of Dretske [8]. That is, we focus on the motivation of the link from information content to natural language semantics. Additionally, we are much more explicit now w.r.t. a novel model introduced in the original paper, that uses extra-sentential topics as contexts for the calculation of information content. (2) A further analysis of the ground truth keywords, since our first study disclosed that the keywords that occur in the text are mostly names. Using a named-entity recogniser (NER) we investigate this further. (3) In the present paper we discuss some shortcomings of common methods of keyword evaluation: evaluation measures such as precision, recall, and F1 require an exact match of outputted keywords and reference keywords, thereby penalising a non-exact match if the former are synonyms and hypernyms of the latter. We briefly sketch a novel evaluation approach which is based on more fine-grained semantic representations of words which we consider promising especially for languages with particularly complex nouns such as German. The challenge of keyword extraction (and also of generation) is to capture the meaning of a document in a few words. Since the pioneering work of [50] and, in addition, due to the rapidly increasing quantity and availability of digital texts, this is a vital field of research on applications such as automatic summarisation [33], text categorisation [32], information retrieval [27, 51], and question answering [24]. Methodically, there are two general lines of research: supervised and unsupervised approaches. In this study, we propose an innovative, unsupervised approach to keyword extraction that employs Shannon information theory [44]. Shannon’s theory does not have an explicit semantic level in general, let alone an application to the semantics of natural language. However, in previous studies, it came to light that Shannon Information can provide the framework for a model of sentence comprehension [11, 17, 18, 23]. To be precise, these studies are concerned with the processing effort required for sentence comprehension under the premise that linguistic units, such as words, phrases, and sentences require a higher processing effort than less informative ones, which is proved to be empirically correct. Our approach, however, goes further by trying to relate Shannon’s theory to the level of meaning of linguistic sign, the general research question being: Can Shannon information be a semantic model of natural language, i.e., can Shannon’s information theory handle a semantic information level which is applicable to natural language and human communication? The specific research question is: does information theory allow the marking of words/phrases that reflect the meaning of a text, or to put it differently, are above-average informative words good keyword-candidates?
The Semantic Level of Shannon Information …
141
1.1 General Application of Shannon’s Information Theory in Language Since the publication of Shannon’s magnum opus, A Mathematical Theory of Communication, his information theory has been applied to various fields and continues to have a great impact. In this section, we are going to look into general application of Shannon information theory to the field of natural language processing. Shannon’s definition of information differs from the colloquial definition of information which is similar to meaning or notification. So we use the term information here in the information-theoretical sense. According to Shannon, the information content (Shannon Information henceforth SI) of a linguistic unit w is thus equivalent to its surprisal and results from the negative logarithm of the probability of this unit w, as given in formula 1: (1) S I (w) = − log2 P(w) The above formula indicates that a word with a lower probability has a higher information content. Therefore rarer words tend to have a higher information content than more frequent words because they occur with a lower probability. The probability of a unit is determined in dependence of the context, i.e., how likely the unit is to occur in a specific context. The information that a linguistic unit w carries, given a context c is calculated by formula 2: S I (w) = − log2 P(w|c)
(2)
For example, the context can be regarded as the preceding characters, words or part-of-speech tags and thus be modelled by n-grams, often bi- or trigrams [4, 13, 36, 37]. Unlike n-gram models, probabilistic parser models [11] calculate the information content of a word based on the change in probability of a parse tree when that word is added to the sentence. Probabilistic parser models are based on statistical frequencies in corpora and are formed from phrase structure rules or dependency rules to which probabilities are assigned [11, 23]. Furthermore, the context is not only limited to the individual sentence containing the word in question, but can also be considered across sentences. Thus, the preceding sentences or even the whole document can be included in the context. The novel topic model which is introduced in [22] and described in more detail in Sect. 4.2.3, is based on extra-sentential contexts. Words that are highly predictable in a particular context therefore carry less information than words that are less likely or very surprising in that context. According to [11, 17, 18, 23] the information content of single words can also be used to calculate the information content of phrases or even whole sentences. It applies that if a phrase p consists of n words than the information content of the phrase can be calculated as the sum of the information contents of the n words, as given in formula 3: S I ( p) = − log2 P(w1 ) − log2 P(w2 ) − . . . − log2 P(wn )
(3)
142
M. Kölbl et al.
1.2 Semantic Aspects in Shannon’s Information Theory In the above section, we introduced a general application of Shannon’s information theory to natural language processing. However, one of the drawbacks of his information theory is, as admitted by Shannon himself, that semantic aspects of communication are forsaken [44, p. 379]. Dretske, who was dedicated to epistemology and the philosophy of mind, analyses why Shannon’s theory cannot deal with semantic issues, and proposes some formal modifications to deal with them. According to Dretske, one of the reasons why Shannon’s theory is unable to handle semantic content is that semantic notions are assigned to individual items, while his theory nevertheless just refers to average amounts of information [8, p. 48]; [25, p. 27]; [26, p. 10]. Dretske asserts that, insofar we are concerned with the informational content of a signal or message, it does not make sense to ask about the average content of two or more messages. Consider the following example: while reading in the newspaper that one’s utility stock has declined, one is told that one’s wife is waiting downstairs. We may suppose that there is some average to the amount of information he has received. But it surely makes no sense to ask about the average content of the two messages [8, pp. 47–48]. That is, the content of some messages cannot be quantified by means of average value. In accordance with the individualisation, given a source of information s, a signal r, a particular property F, Dretske defines informational content of the signal in the following way: A signal r carries the information that s is F = The conditional probability of s’s being F, given r (and k), is 1 (but given k alone, less than 1) [8, p. 65] In his definition, k stands for what the receiver of the information already knows. For example, if one already knows that s is either red or blue, a signal that eliminates s’s being blue as a possibility carries the information that s is red. On the contrary, for someone who does not know that s is either red or blue, the same signal might not carry the information that s was red. In order to realise the idea of allocating semantic information to individual items, Dretske applies the aforementioned surprisal instead of Shannon’s formula, which is known as information entropy and given in formula 4: S I (si ) = − log P(si )
(4)
In Shannon’s formula in 5: S I (S) = −
P(si ) log P(si )
(5)
i
si represents one of various states at the source. S I (si ) is the amount information of the individual state si , whereas S I (S) is the average amount of information. Subsequently, Dretske defines the transformation between the states si and ri (one of the individual states at the receiver), given in formula 6:
The Semantic Level of Shannon Information …
S I (si , ri ) = S I (si ) − E(ri )
143
(6)
where, given in formula 7 E(ri ) = −
P(s j |ri ) log P(s j |ri )
(7)
j
E in the above formula stands for equivocation, i.e., a portion of information which is not transmitted from the source to receiver. If the equivocation E(ri ) is zero, then the signal carries as much information about the source as is generated at the source by the occurrence of si . Yet, as pointed out by some authors [25, p. 29]; [39, p. 17], Dretske makes an error in modifying the formula of equivocation, so that the application of Shannon’s formulas to the particular pair of states si and r j is not completely generalised. The correct formula is given in formula 8: S I (si , r j ) = S I (si ) − E(si , r j )
(8)
E(si , r j ) = − log P(si |r j )
(9)
where, given in formula 9
The error is that he uses the same index to denote the state of the source and the state of the receiver, and consequently has the result of the individual contribution to the equivocation being only a function of a specific state of the receiver, although the equivocation depends essentially on the communication channel between the source and receiver, not only on the receiver. However, even given the properly modified version of the formulas, the concept of informational content does not obtain semantic aspects, inasmuch as it is defined in terms of a quantitative theory. Thus, as an important factor for relating information to knowledge, Dretske introduces a philosophical term called intentionality. The term intentionality describes the following phenomena: If a signal carries the information that s is F, it does not necessarily carry the information that s is G despite the extensional equivalence of “F” and “G” [8, p. 75]. To put it another way, a sentence “He believes that s is F” is used to describe intentionality, in that one cannot substitute “G” for “F” in this sentence without risking a change in truth, even if everything that is F is G, and vice versa. Such intentionality is not inherent in mere de facto correlations. For instance, suppose that A transmits a message to B, while C transmits the same message to D. In spite of the perfect correlation between the message A transmits and the message D receives, the equivocation between A and D is at a maximum. In other words, the information generated at the source A is not transmitted to the receiver D, and as such there is a mere correlation between A and D, in which no intentionality is admitted. According to Dretske, the ultimate source of intentionality is not a mere correlation based on pure coincidence, but rather exceptionless lawful regularities [8, p. 76]; [25, p. 37], and they can be exhibited only in completely closed and isolated systems
144
M. Kölbl et al.
[39, pp. 5–11]. A completely closed and isolated system is a macroscopic system, such as a box filled with particles, which is completely isolated from any external environment, and therefore the macro-states averaged over time—such as its volume, pressure, temperature, entropy, etc.—do not change, while the micro-states, namely the particles which interact with each other in the macro-system vary in space and time. The lawful regularity on which Dretske’s theory relies describes the evolution of the system of particles from one initial state to another. The laws are deterministic in the sense that, once an initial state at the microscopic level is determined, all other micro-states are determined. Thus in exceptionless lawful regularities, any uncertainty is not tolerated because intentionality, which is necessary for semantic information, is guaranteed by errorless transmission in which equivocation should remain zero. Otherwise, from Dretske’s point of view, the information transmission chain with equivocation, albeit a small amount, would accumulate a large amount of equivocation at the end point of the transmission. Now there are two partially closed macro-systems, viz., source and receiver, that can interact with each other only through a channel between the two systems, and because of this interaction, their macroscopic variables can change. In this case, in order to get macroscopic dynamics, the exceptionless lawful regularities must be lifted, so that information of the systems can flow from one state to another. This means, that an amount of uncertainty is introduced. But the uncertainty, on the other hand, is contrary to the exceptionless lawful regularities dominant over the systems. Rogers straightforwardly illustrates the central problem of Dretske’s theory as follows: “Dretske’s theory requires laws with “certainty” (exceptionless laws) acting on states with “uncertainty” (multivalent states). However, classical physics can only provide laws with certainty acting on states with certainty and/or laws with uncertainty acting on states with uncertainty.” [39, p. 11]. Uncertainty can be well managed in statistical physics, but in a deterministic classical system that Dretske supposes the introduction of uncertainty is problematic. Dretske’s theory is further exposed to criticism. In locating the semantic aspect of a signal in the particular instantiation that links an informational source and receiver, Dretske presupposes a decoupling of the correlation where each state at the source is perfectly paired with a specific state at the receiver, as may be indicated from his mistaken formal modification, whereas Shannon presupposes an entangled one where the set of states at the source is correlated with the set of the states at the receiver. Dretske’s decoupling, however, lacks something important, namely an understanding of how new information is generated. New information is generated from uncertainty which is nonetheless excluded from Dretske’s system on the ground of the exceptionless lawful regularities. Unlike Dretske’s restricted interactions, Shannon’s system allows entanglement which can produce more uncertainty. From this point of view, Shannon’s entangled framework offers the robustness that Dretske’s decoupling cannot do [39, pp. 18–19]. Taking into account the criticisms we have seen so far, we have no choice but to conclude that Dretske’s modification fails to lead Shannon’s information theory to attain semantics. But how is semantic information transmitted to the receiver then? As a matter of fact, hints capturing the semantic dimension are hiding behind
The Semantic Level of Shannon Information …
145
Dretske’s own theory. That is a non-physical information-preserving relation, and not the physical channels Dretske’s core theory invokes [39, p. 21]. For instance, when a source S transmits information to both receivers R A and R B , there is no physical channel between R A and R B , whereas there is a physical channel respectively between S and R A , and between S and R B . The crucial point in this example is that there is an informational link between R A and R B , although they are physically isolated [8, pp. 38–39]; [25, pp. 33–34]. The information transmitted via such a link between R A and R B is not about a state per se, but rather about an abstracted property of the source S which can be also interpreted as “averaged” in a sense [39, p. 15, pp. 21–22]. In this regard, Rogers further gives a keen insight by referring to Dretske’s so-called Xerox principle: If A carries the information of B, and B carries the information of C, then A carries the information of C. Dretske concludes that information-preserving channels cannot allow any equivocation, since this would involve information loss that would cumulatively build error into the transmission of information [8, p. 58]. But the point here is not the errorless transmission, but rather that the information can pass intact from A to C despite equivocation [39, p. 20]. In the same fashion, [9] criticises Dretske’s insistence upon a causal requirement which does not allow the possibility of inappropriate sorts of “epistemic luck.” Interestingly enough, the inclination to abstraction or average which can implement semantic aspects into the information theory is totally opposed to the individualisation that Dretske originally aimed at. In short, semantic information can be transmitted in the form of an abstracted property through non-physical channel—even with equivocation. This idea can be understood in the following way. Imagine that you are hearing what somebody says. After his speech you want to show how well you have understood him. In this situation you would not just repeat word by word what he said, but rather rephrase it. Your rephrasing is enabled by what you have understood from his speech, i.e., the informational content in his speech is abstracted by you. This kind of abstraction brings about transmission of meaningful information.
1.3 Application of Shannon’s Information Theory to Our Study Further impulses for the application of Information Theory to natural language and its comprehension are given by semantic and pragmatic language models. In wellknown linguistic approaches to information structure of sentences, new information is considered the relevant one, whereby the term information in linguistics must not be confused with Shannon Information. New information is part of the common opposite pairs given—new or topic—comment, respectively. Within sentences, new information—after the setting of something given—can be said to form the message, i.e., the new information that the human language processor is awaiting. Within alternative semantics [40, 41], the focus-position is filled by that new information,
146
M. Kölbl et al.
as Krifka [21] “Focus indicates the presence of alternatives that are relevant for the interpretation of linguistic expressions”. That is, the more alternatives there are, the higher the relevance of the actually occurring word is, and this relationship, in fact, meets Shannon’s definition of information [44]. Highly relevant words are accordingly highly informative. These concepts are also found in information theory, in particular in Hale’s surprisal theory [11] (see above) where it is stated: the higher the information content, the higher the processing difficulty is and the higher is its surprisal. Thus an extremely high information content leads to a high processing effort of language. If the density of information in a linguistic utterance is “dangerously high” [18], i.e., the information structure within that utterance exhibits extreme peaks and troughs, this might cause massive problems in comprehension. A too informative sentence comment (or focus) has an extremely low probability in the discourse and leads to a too high surprisal effect which endangers sentence comprehension. How can we measure the applicability of Shannon’s information theory as a semantic model of natural language? To this end, we will compare our information theory-based, unsupervised method against a supervised deep-learning method that employs a recurrent neural network (RNN), and we will apply these methods to texts in the German language. We aim to investigate whether keyword extraction using our simple information theory-based approach is able to compete with a state of the art deep-learning technique. In contrast to our information theory based approach, there is no explicit hypothesis w.r.t. the semantics of words in the deep learning approach. In order to avoid the RNN performing keyword generation instead of keyword extraction, the algorithm is solely trained on document-keyword-pairs. Two baseline methods are used in order to evaluate the quality of the information theory based approach and the RNN: (1) TF-IDF-measure [32, 42, 50] and (2) TextRank [30] which is a highly influential graph-based-ranking-approach on keyword extraction. As shown in the Sect. 1.1, the amount of surprisal that a word causes is equivalent to its information content (formula 1) and proportional to the difficulty when that word is mentally processed [11, 23]: a sign is more informative if it is more surprising. As discussed above, the choice of the context is not trivial, since the information content of a linguistic unit depends strongly on it (formula 2). The information theory-based approach to keyword extraction put forward in this study employs three models with different context definitions: 1. a bigram model that has yielded promising results in a previous pilot study [38], henceforth referred to as bigram model, 2. a probabilistic parser model based on phrase structures, of which [11] claims psycholinguistic plausibility, henceforth referred to as parser model, 3. an novel extra-sentential topic model based on Latent Dirichlet Allocation (LDA) [3] that defines as contexts the topics in documents that contain the target words, henceforth referred to as topic model. The idea of topic contexts is to determine how informative/surprising a word w is, given the topics within all its discourses, i.e., the documents, in which w occurs.
The Semantic Level of Shannon Information …
147
Topic contexts satisfy definitions within the surprisal theory [11]: the difficulty to process a word equals its surprisal both in a small context within sentence boundaries and within a large extra-sentential context [23], see formula 10. difficulty ∝ − log2 (P(wi |w1...i−1 , context))
(10)
In the concept of topic contexts, a distinction is made between (i) overall discourse contexts which are extra-sentential and can comprise the entire corpus in which the target word occurs, and (ii) local contexts of the target word within the sentence boundaries. The idea of Topic Contexts is compatible with Discourse Representation Theory [20]: every new word is an increment to the overall context, i.e., the discourse. The information content of the overall discourse increases by the information content of that word calculated from its discourse and its local context.
2 Related Work To the best of our knowledge, information theory has rarely been employed for keyword extraction so far. Ravindra [35] applied successfully collocation information for extractive summarisations which is Shannon Entropy of a collocation of two words. In this study, extremely small, non-extra-sentential contexts are thus exploited, i.e., bigrams which as will be demonstrated below, do not yield satisfying results in our study. Mutual information has been used by [1] and by [16] for abstractive summarisations. However, the calculation of mutual information does not take extended and extra-sentential contexts of target words into account, as, in contrast, our approach does (see above). In general, pioneering work in supervised approaches to key phrase extractions comes from [50] who introduced the KEA-algorithm that is based on the features TF-IDF and First Occurrence of key phrases, and employs a Bayesianclassifier. Nowadays, graph-based approaches such as TextRank (TR) [30] are state of the art. TR is based on co-occurrences of words, represented as a directed graph on the amount of incoming and outgoing links to neighbours to both sides of the target word. A highly effective graph-based approach is introduced by [6, 47] who make use of k–trusses1 within k–degenerate graphs for keyword extraction. The authors propose that TR is not optimal for keyword extraction since it fails to detect dense substructures or, in other words, influential spreaders/influential nodes [47] within the graph. The idea is to decompose a graph of words with maximum density to the core [14].
1A
k-truss in a graph is a subset of the graph such that every edge in the subject is supported by at least k − 2 other edges that form triangles with that particular edge. In other words, every edge in the truss must be part of k − 2 triangles made up of nodes that are part of the truss. https://louridas. github.io/rwa/assignments/finding-trusses/.
148
M. Kölbl et al.
3 Dataset We collected 100,673 German language texts from Heise-News2 and split them into a training set containing 90,000 texts and a validation set with 10,673 texts. For each text we have the headline and the text body. For 56,444 texts we also have the lead text of which 50,000 are in the training set. The number of characters of each text varies between 250 and 5,000 characters. The keywords were extracted from the associated meta-tag. There are 50,590 keywords in total. For this paper, we focused on the keywords that can be found in the headline, the lead, and the text, resulting in 38,633 keywords. The corpus contains a total of 1,340,512 word types when splitting on blanks and 622,366 when filtering using the regex [\w-]+ with Unicode support. The frequency of the keywords varies extremely. The three most common keywords are Apple with a frequency of 7,202, Google (5,361), and Microsoft (4,464). On the other hand 24,245 keywords only occur once. 25,582 keywords are single words and the longest keyword is Bundesanstalt für den Digitalfunk der Behörden und Organisationen mit Sicherheitsaufgaben.
3.1 What Are Keywords? The four keywords mentioned before are all names. To understand if this is always the case we deploy a named entity recogniser (NER). We use the de_core_news_md language model from spaCy [12]. When running the NER over all 50,590 keywords, sometimes not the entire keyword is tagged as one. In this case when all tags are the same, we combine the tags and thread the tag as one for the entire keyword. This results in 10,429 keywords with no tags, 7,845 keywords as location (LOC), 10,592 keywords as person (PER), 10,640 keywords as miscellaneous (MISC), and 10,894 keywords as organisation (ORG). 186 keywords have two tags that are not the same, 3 keywords have three tags and 1 keyword has all four tags which is DMC-GH1; Lumix G Vario HD 14140mm/F4.0-5.8 ASPH. O.I.S. with the ORG(ORG/MISC)/MISC. This means that about 80% of the keywords are named entities. Next we ran the NER on both training and validation sets. The results for the training set can be found in Table 1 and for the validation set in Table 2. We report numbers for four different combinations of named entity tags. As expected the recall is always significantly higher than the precision because much more words are found than are in the keywords. The NER outperforms all other methods on recall and the accuracy measures we employ, which means that as keyword candidates one should look only for named entities in a text. The bad precision values also show that the evaluation is highly dependable on the ground truth, i.e., the keywords we found with the texts. That the NER didn’t get 100% for a1 is due to the fact that the NER itself is not perfect. For example, 2 https://heise.de.
The Semantic Level of Shannon Information …
149
Table 1 Precision (Prec), recall (Rec), F1, and the three accuracy–values (a1–a3) of the NER for the training set Methods Prec (%) Rec (%) F1 (%) a1 (%) a2 (%) a3 (%) NERORG/PER 15.04 NERORG/PER/LOC 11.75 NERORG/PER/MISC 10.33 NERORG/PER/LOC/MISC 9.15
22.33 25.78 34.20 37.06
15.90 14.51 14.91 13.94
65.55 71.46 81.26 84.71
24.24 29.79 43.71 48.12
7.26 10.05 19.40 22.37
our longest keyword occurs only once as a keyword, but if we look at all the texts it occurs a total of 14 times in them. Since it is a very specific organisation name it could/should also be a keyword in the other 13 texts.
4 Method 4.1 Baseline The first baseline approach we employed is TextRank [30]. For keyword extractions, the TextRank-algorithm builds a (directed) graph with words (or even sentences) for nodes within a text of a paragraph. The weight of a word is determined within a sliding context window and results essentially from the number of outgoing links of the words directly preceding the target word. Our second baseline was the TF-IDF-ranking function of words [19]. This measure is the product of term frequency and inverse document frequency. The term frequency is the total occurrence of a term in a specific document, whereas the inverse document frequency is the quotient of the total number of documents and the number of documents that contain that term.
4.2 Information Theory Based Methods 4.2.1
Bigram Model
We determined the probability of a word on the basis of the probability that it occurs in the context of the preceding word. We chose a bigram model because the chosen corpus contains many technical words which occur only rarely. This leads to the undesired fact that when calculating with trigrams (or higher) many of the calculated probabilities are 1 because many of the target words are divided by the number of their preceding trigrams that only occur once in all the documents and consequently
150
M. Kölbl et al.
the information content of these words is 0. Thus we calculated the information content of a word with formula 11. S I (wi ) = − log2 P(wi |wi−1 )
(11)
For the calculation, all bigrams from the headings, leads, and texts of the corpus were extracted and preprocessed. Then their frequency within the corpus was counted. During the preprocessing, the words were lowercased in order not to distinguish between uppercase and lowercase forms of the same words. Furthermore, punctuation and special characters were removed. A calculation of the information content of these signs would not be meaningful, since they are not suitable as candidates for keywords. Digits were also replaced by a special character ($) in order not to distinguish between individual numbers (e.g., 1234 and 1243) in the information calculation which would lead to a disproportionately high information content due to their rarity. All keyword occurrences where the keyword consists of more than a word were combined into a single token. The five most informative words of each text were chosen as keywords.
4.2.2
Parser Model
This approach is based on a psycholinguistic model presented in [11]. It aims to explain difficulties readers experience when reading so-called garden-path sentences, i.e., grammatically correct sentences which invite the reader to interpret the syntax in a way that later turns out to be wrong. In his paper, Hale uses a famous example due to [2]: The horse raced past the barn fell. Before reading the last word, the reader is invited to interpret The horse as the subject and raced as the verb of the sentence. However, as soon as the word fell is added, the reader is forced to reconsider their interpretation. The crucial assumption here is that sentences are read and processed one word at a time. Given a context-free probabilistic grammar G, i.e., a context-free grammar whose rules have probabilities assigned to them, Hale’s model realises syntactic interpretations of a sentence s as parse trees Ts equipped with a likelihood value P(Ts ). This value is the product of all the probabilities associated to the grammar rules employed in Ts . The surprisal of a word wi within a sentence s = w1 · · · wn is computed through the prefix probability α(si−1 ) of the subsentence si−1 = w1 · · · wi−1 which is the probability that si−1 occurs as a prefix of any sentence generated by G. This can also be expressed as a mathematical formula given by 12: α(si−1 ) =
t∈strings of words
T ∈parse trees of si−1 t
The surprisal of wi is then given by formula 13.
P(T ).
(12)
The Semantic Level of Shannon Information …
151
S I (wi ) = − log2
α(si ) α(si−1 )
(13)
This means that the surprisal of a word measures how dramatically the likelihood of a subsentence being a prefix decreases when the word is added to it. Hale’s model uses a modified Earley parser due to Stolcke [46] to compute the prefix probabilities. We created our probabilistic grammar by annotating our model corpus using spaCy [12] with the model de_core_news_md. It yielded 445,210 rules. For our purposes, using an Earley parser was not feasible. With our extensive number of rules and a test corpus of several million sentences, an Earley parser would have taken unreasonably long to draw the entire information map. Instead we approximated the prefix probabilities by α(si−1 ) ≈ maxT ∈parse trees of si−1 P(T ). The reasoning behind this goes as follows. First notice that most parse trees of a string of words si−1 t will have marginal probabilities. In fact, the longer t gets, the smaller the probability of each parse tree will become. In addition, the number of alternative parse trees will not grow considerably because for the most part, German does a decent job at avoiding syntactic ambiguities. Hence, we may first assume that α(si−1 ) ≈ T ∈parse trees of si−1 P(T ). This may seem like a too radical approximation, but the average probability of a grammar rule is very low, especially for rules of the form Non-Terminal → Terminal. The second step once again takes advantage of the fact that the syntax in German is unambiguous most of the time. This means that the parse forest of an expression si−1 is often small and the probabilities of the parse trees other than the most likely one are smaller by a considerable margin, which justifies the approximation. Hence, instead of using an Earley parser, we annotated all the subsentences of the form w1 . . . wi individually using spaCy and then computed the probability of the resulting tree later. From the results, it becomes clear that this model is very ill-suited tackling the problem at hand (see Sect. 5). One possible source of errors is our approximation of the prefix probabilities. However, it is unlikely that the surprisal values would have been distorted enough to assume that without the approximations the model would have worked better. This is because there may be a variety of reasons for this, but probably the most important one comes directly from the way the model is designed. The type of surprisal Hale’s model measures is almost exclusively syntactic in nature, whereas keywords of a text carry semantic information and should be chosen independently of their respective syntactic contexts. Indeed, in the Parser Model the words in a sentence serve hardly any other role besides providing a syntax tree.
4.2.3
Topic Model
A topic model is a statistical model that tries to detect and identify the topics that appear in a text collection [3]. As pointed out above, Shannon calculates its information content according to a context, although there is no clear definition of what a context could be. It might be n-grams of tokens, part-of-speech or syntactic context. Our topic model defines the context as a topic and calculates the average Shannon
152
M. Kölbl et al.
Pre-processing
LDA
SI Calculation
Keyword Selection
Fig. 1 Topic model workflow
Information value S I for each word in the dataset, depending on the contexts/topics in which it occurs within the discourse of the complete set of texts. Figure 1 shows the model workflow. It consists of four main steps, starting with preprocessing to clean and prepare the dataset for the topic modelling step. After identifying the topic by applying latent Dirichlet allocation (LDA), we are able to calculate the S I value for each word in the document. A selection process will be performed to get the keyword candidates by taking into account the S I value and the word frequency within the document. Words with the highest score in each document will be selected as predicted keywords. Preprocessing Before employing LDA algorithm to discover the topics, the texts need to be cleaned and prepared. The preprocessing phase starts with converting the texts to lower case, then tokenising them, and removing stopwords and non-alphabetical tokens. Since there is no evidence that stemming or lemmatisation would enhance the topic modelling results [29, 43], we decided not to perform them especially with the large size of our data set and the consequent processing time. Latent Dirichlet Allocation (LDA) This is a generative statistical model used to discover the hidden patterns and contexts in a documents collection and to classify the document according to the discovered topics. LDA supposes that every document is a mixture of topics and each topic is indicated by a probability distribution over the vocabulary. The algorithm takes the number of topics as a parameter and returns a distribution of topics θi for each document di as output. SI Calculation At this stage we calculated the average Shannon Information value S I for each word in the dataset using formula 14 where n is the number of contexts (topics) where the word w occurs in, and P(w|ti ) is the probability that word w occurs in the context ti . S I (w) = −
n 1 log2 P(w|ti ) n i=1
(14)
Example Let us suppose that the word w occurs in five different documents d1 , d2 , ...d5 . Performing LDA on the dataset will produce a distribution of topics θi for every
The Semantic Level of Shannon Information …
153
document di . This distribution is a K -dimensional vector of probabilities which must sum to 1. The experiment shows that each document has dominant topics with high probabilities. Therefore we picked the topics which have probabilities greater than 0.30 allowing three dominant topics per document at most. After aggregating the dominant topics covered by documents d1 , d2 , ...d5 , let the word w occur in four topics (t1 , t2 , t3 , t4 ) with frequencies (2, 3, 1, 4) respectively. According to formula 14, the Shannon Information values will be 2.175. Keyword Selection Word frequency within a document d might be an indicator whether this word is a keyword or not. Our assumption is that frequent words are more likely to be keywords, so we decided to multiply the word frequency cd (wi ) with S I (wi ) to calculate a score for every word wi in the document d and select words with the highest scores. Using a part of speech tagger to tag the text and assign a grammatical category to each token would be very useful. It would enable us to reduce the candidates for the keyword selection process, since some categories such as prepositions, adverbs, adjectives, etc. are unlikely to be keywords. The performance of our model is related somehow to the number of topics we use to classify the articles at LDA stage. Therefore, in order to find the best parameters, we performed the LDA several times with different topic numbers. The experiment shows that the higher the number of topics is, the better the model works. However, after a specific threshold, we noticed that there is no benefit in increasing the number of topics. Table 2 shows that the model reaches its best performance when 500 topics are used.
4.3 Neural Network Neural networks have become state of the art in many fields in recent years. There are several possible approaches to be considered: (i) a multi-label classification approach which in our case would be limited to predicting keywords out of our set of 38,633 keywords. (ii) a sequence to sequence approach, which would also be limited to the 38,633 keywords in our dataset—or by generating the output sequence using the whole vocabulary of 1,340,512 (i.e., 622,366 word types) with some kind of upper boundary on how many keywords a text can have, which for our dataset is about 40. (iii) marking what words in the input are keywords [52] which theoretically has none of the limitations the other two approaches have, but can only give keywords which are part of the text itself. Strictly speaking, the first approach is a classification task, the second is either a classification task or (if using the whole vocabulary for the keyword output) a keyword generation task, and the third is a keyword extraction task. To be in line with the other methods which are extractive the neural network tries to follow this approach. Similar to [52] the neural network predicts whether a word
154
M. Kölbl et al.
Fig. 2 Schematic RNN network architecture
at the input sequence is a keyword. But instead of working at the word level we chose a characterwise approach. In contrast to [52], where they only have about 140,000 word types, our network would have to work on 1,340,512, i.e., 622,366 word types with the same amount of words, resulting in a very large neural network with more than 500,000,000 parameters. Hence with the characterwise approach, the neural network is fairly small, having only 1,561,996 parameters. The network architecture is straightforward. The network has three inputs and outputs, one for each part of a text, e.g., headline, lead, and text. First is an embedding layer followed by a bidirectional Gated Recurrent Unit (GRU) [5], these two are shared over all three inputs. The output layers are dense layers where the number of units corresponds to the maximum length of each part, e.g., 141,438, and 5,001. The texts are fed characterwise into the network. If the character is part of a keyword 1 is outputted; if not, then 0 is outputted in Fig. 2. For the training, all characters with an incidence of less than 80 in the whole dataset were treated as the same character, resulting in a vocabulary of 132 characters. The network was trained for 14 epochs, about 121 h.
4.4 Evaluation Method As evaluation measures we used Precision (Prec), Recall (Rec), F1, and accuracy. We determined the latter as follows: accuracy 1 (A1) is the percentage of the model generated keyword sets for which there is at least an intersection of one word with the respective keyword set from the dataset. For A2 and A3 we require at least two and three intersections, respectively.
The Semantic Level of Shannon Information …
155
Table 2 Precision (Prec), recall (Rec), F1, and the three accuracy–values (a1–a3) of the employed methods Methods Prec (%) Rec (%) F1 (%) a1 (%) a2 (%) a3 (%) TextRank TF-IDF NERORG/PER NERORG/PER/LOC NERORG/PER/MISC NERORG/PER/LOC/MISC RNN Bigram model Parser model Topic model (50) Topic model (100) Topic model (300) Topic model (500)
6.99 3.30 15.15 11.86 10.41 9.20 0.92 3.00 3.40 17.39 19.35 20.65 21.48
6.78 3.20 22.04 25.55 34.04 36.94 0.92 17.00 3.20 17.42 19.30 20.59 21.42
7.35 3.20 15.90 14.53 14.97 13.99 0.92 5.10 3.30 17.39 19.71 20.62 21.45
22.15 18.54 65.36 71.21 81.19 84.62 1.10 11.00 8.23 54.50 57.30 59.25 60.89
1.97 2.54 24.44 30.24 44.20 48.54 0.10 3.20 0.57 16.20 18.60 19.85 21.15
0.12 0.26 7.63 10.63 19.98 23.08 0.10 0.70 0.04 2.88 3.66 4.15 4.58
5 Results The performances of the respective models are given in Table 2. Although no model achieves high precision, recall, and F1-scores, it is evident that the NER followed by the topic model outperforms all remaining models in all measures that we applied. The two baseline models TextRank and TF-IDF perform considerably better than the bigram- and the parser model. The results of the RNN are of poor quality. The topic model has the highest precision and F1 values, whereas the NER has the highest recall which is due to the fact that the NER returns much more words than the topic model and so nearly always has at least one right keyword.
6 Conclusion and Discussion We raised the general research question, whether Shannon information can be a semantic model of natural language, or, more generally, whether Shannon’s information theory can handle a semantic information level which is applicable to natural language and human communication. These questions can—with all due caution—be answered in the affirmative. The novel information theory based topic model which considers extra-sentential contexts for the calculation of Information content, achieved the highest correspondence with human created keywords of all employed models, apart from the NER, including a RNN, TF-IDF, and a TextRank model.
156
M. Kölbl et al.
The high accuracy for the NER is mostly due to the fact that nearly all keywords are names. Topic model showed the highest agreement with the cognitive performance of language comprehension of all models. Even though the possibility that some other factors might have played important roles cannot be completely excluded, we interpret the result as evidence that the parameters of topic model, i.e., information theory and extra-sentential contexts, are suitable parameters for modelling language comprehension. The implications are the following: (i) Since topic model is a model of language comprehension, it is a candidate for a semantic model of natural language. (ii) Since the theoretical framework of topic models is information theory, it might have explanatory power regarding the semantics of natural language. By considering extra-sentential contexts, topic model follows a principle from Hale’s surprisal theory [11] which provides a model of language comprehension. Further the relatively successful result by taking into account extra-sentential contexts also corresponds to the standpoint in the introduction. In Sect. 1.2 we observed that semantic information is transmitted in the form of an abstracted property. In our study the extracted keywords can be regarded as bearing semantic information in the sense that they are selected on the basis of the extra-sentential contexts which are not directly extracted from the target documents, and which function therefore as an abstracted property. In order to prevent a premature overestimation of topic model, we should be aware that even this model yielded modest results at best. However, topic model outperformed the two baseline models which demonstrably perform well with tasks in the English language. However, the language in focus of the present study was German which is morphologically much more complex than the English language and has a considerably larger number of possible tokens. This might have an influence on the information distribution in sentences. Secondly, the possibility of combining words with hyphens is used much more frequently in German than in English, e.g., in combinations such as Ebay–Konzern and PDF–files. The bad results could also be due to a corpus bias: the long keywords are characteristic of texts on technological topics. But why did the remaining models perform poorly in our study? In particular, the poor performance of the neural network is striking: we assume that it is partly due to the choice to make the network work characterwise. For example, in some cases the neural network predicted that FDP was a keyword. Since the whitespaces are not part of the gold standard keyword, the network predicted an apparent wrong keyword. By contrast, if that network had worked on tokens, this would not have happened, but the network would be significantly larger. Additionally there was very little time to test various hyper parameters. The bigram model performed weakly because the size of the exploited contexts is simply too small. In other words, the assumption that the conditional probability of a word given a preceding word, could be used to determine an information content that would be relevant for semantics proved to be incorrect in this study. The probabilistic parser model as implemented in this study, captures syntactic intrica-
The Semantic Level of Shannon Information …
157
cies well, however it does not seem to be fit for tasks involving semantics alone in the syntactically homogeneous discourse that is technology news. The parser model shows its strength in particular with constructions such as garden-path sentences, see Sect. 4.2.2. Garden-path sentences are grammatical, but they have the flavour of artefacts that are hardly to be expected in factual texts such as the Heise-News. Apart from that, for long sentences with more than 80 tokens the probabilities of the prefix trees were rounded down to 0 which rendered them unusable. Keyword extraction and generation include the evaluation of the keywords. This is a non-trivial problem. For its solution common approaches require an exact match of the extracted or generated keywords with keywords from a set assumed as standard/true. This however, as we will argue, cannot yield satisfying results. In a couple of recent state-of-the-art studies (see for instance [14, 15, 28, 47]) the measures precision, recall, and F1 are applied, while a human-created set of keywords serves as standard and these measures are based on direct matching. Another method is the evaluation by human raters (see for instance [48]), however there are objections in Hulth [15] who refers to a report [7] on considerable diversities within human ratings. A more extrinsic method of evaluation could be a task based evaluation where the generated keywords can be used to accomplish a task faster/better than a baseline approach as employed for instance in [49] for information retrieval in web data. In order to evaluate the proposed algorithms, the authors apply the measures time, i.e., the time span till a user manages to find a certain webpage or image and, in addition, click count, i.e., the number of clicks that a user needs to achieve a result. This evaluation however is time consuming and extremely expensive, in particular when a large number of keyword sets have to be evaluated. An evaluation should thus be able to abstract from direct matching, and, where possible, should not employ human ratings since this would require an expensive second line of research with questionable results. For the evaluation of keywords, the principal question is: does a Ground Truth exist, i.e., the true set of keywords based on the true semantic meaning of words and multiword units? Basically, this is an epistemological problem derived from Kant’s dictum in his work Kritik der reinen Vernunft that reality cannot be recognised. One could state that the ground truth of keywords arises indirectly, (i) through the choice of method and model and (ii) because keywords are not a natural phenomenon in the world, and a potentially infinite amount of ground truth keyword sets is conceivable. The epistemological question of what a ground truth can be for keywords, was not addressed in this study. In addition the question arises: how can an evaluation capture the semantic relationships synonymy (see Leibniz’ principle of substitutio salva veritate), hypernymy, and hyponymy? If, for instance, a text is about Angela Merkel, Barack Obama, and Gerhard Schröder, and the keyword politicians is outputted, then the latter is a hypernym of the three former since they are members of the set of (former) politicians. An evaluation by exact matching would not detect this good performance of the model as there is no match between chain of signs politicians and the respective names of the three politicians.
158
M. Kölbl et al.
An additional problem is the evaluation of complex expressions, i.e., complex words and multiword units or their abbreviations, each of them with an individual semantic representation whose meaning can be computed according to the Fregian principle of compositionality [10]. It is obvious that an evaluation strategy by an exact matching of keywords neglects the human cognitive ability of abstraction, concept formation, and classification as modelled for instance by Ogden and Richards [31] in their famous semiotic triangle which differentiates between an entity in the world, its linguistic representation, and its concept as cognitive phenomenon. The semiotic triangle is essentially based on Aristotle’s differentiation between things, linguistic utterances, and affections of the mental soul and it was also a point of departure for the models of Frege [10], Pierce [34], and finally of Sowa [45]. Concept forming and abstraction are thus a characteristic human ability and the desideratum is that state-of-the-art methods and techniques of keyword evaluation should be able to approach this ability. Keyword evaluation should thus exceed exact matching and should capture concept forming and abstraction. Given this postulate, we propose that keyword evaluation has to be based on a fine–grained semantic representation of words, within a structuralistic and distributional framework, i.e., considering co-occurring context words, while applying the techniques of word embeddings allows for the representation of word semantics as multi dimensional vectors whose distances can easily be determined. In addition, a semantic representation of words should be based on decomposed co-occurrences of words. Take as a fictitious example the keywords from above: Angela Merkel, Barack Obama, and Gerhard Schröder. According to the internet resource Wortschatz Universität Leipzig,3 these three words do not have many common co-occurrences: Wahlkampf (election campaign) is a co-occurrence of Angela Merkel and Barack Obama, Wahlsieg (election victory) is a co-occurrence of Gerhard Schröder. As cooccurrences of Politiker (politician), the resource outputs Wahlen (elections), and Wahlkampforganisation (election campaign organisation). The common intersection is the lemma Wahl (election) which can be achieved either by stemming of Wahlen (elections), i.e., the plural form of Wahl (election), or by lexical decomposition. Wahlkampf (election campaign) for instance can be decomposed into the two lemmas Wahl (election) and Kampf (fight/struggle/campaign). Decomposed word structures can thus be useful to disclose semantic similarities between words. In order to overcome the non-trivial problem of keyword evaluation, a deeper understanding of the ways words work in a language is necessary. One way to achieve that could be to build a word graph and use it to bridge the gap between words. Hereby we assume that we know the entire vocabulary of a language. In doing so, words are close together when they are semantically related. Looking/evaluating if a keyword is the same/close to another one, these relations can be used/exploited to find related words with the same or similar meaning. The construction of a graph-theoretical model and the precise evaluation mechanisms are the subject of future work. 3 https://clarin.informatik.uni-leipzig.de/de?corpusId=deu_news_2012_3M.
The Semantic Level of Shannon Information …
159
Acknowledgements This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), project number: 357550571. The training of the neural network was done on the High Performance Computing (HPC) Cluster of the Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) of the Technische Universität Dresden. Thanks to Caitlin Hazelwood for proofreading this chapter. This chapter is an extended version from the initial paper with the title ‘Keyword extraction in German: Information-theory vs. deep learning’ published in Proceedings of the 12th International Conference on Agents and Artificial Intelligence (Vol. 1), 459–464, ICAART 2020.
References 1. Aji, S., Kaimal, R.: Document summarization using positive pointwise mutual information. Int. J. Comput. Sci. Inf. Technol. 4(2), 47 (2012). https://doi.org/10.5121/ijcsit.2012.4204 2. Bever, T.G.: The cognitive basis for linguistic structures. Cogn. Dev. Lang. (1970) 3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. J. Mach. Learn. Res. 3, 993– 1022 (2003) 4. Celano, G.G., Richter, M., Voll, R., Heyer, G.: Aspect coding asymmetries of verbs: the case of Russian. In: Proceedings of the 14th Conference on Natural Language Processing, pp. 34–39 (2018) 5. Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches (2014). arXiv preprint arXiv:1409.1259. https://doi. org/10.3115/v1/W14-4012 6. Cohen, J.: Graph twiddling in a mapreduce world. Comput. Sci. Eng. 11(4), 29–41 (2009). https://doi.org/10.1109/MCSE.2009.120 7. van Dijk, B.: Parlement européen. In: Evaluation des opérations pilotes d’indexation automatique (Convention spécifique no 52556), Rapport d’évalution finale (1995) 8. Dretske, F.: Knowledge and the Flow of Information. MIT Press, Cambridge (1981) 9. Foley, R.: Dretske’s “information-theoretic” account of knowledge. Synthese 159–184 (1987). https://doi.org/10.1007/BF00413933 10. Frege, G.: Begriffsschrift, a formula language, modeled upon that of arithmetic, for pure thought. From Frege to Gödel: A Source Book in Mathematical Logic, vol. 1931, pp. 1–82 (1879). https://doi.org/10.4159/harvard.9780674864603.c2 11. Hale, J.: A probabilistic earley parser as a psycholinguistic model. In: 2nd Meeting of the North American Chapter of the Association for Computational Linguistics (2001) 12. Honnibal, M., Johnson, M.: An improved non-monotonic transition system for dependency parsing. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1373–1378 (2015). https://doi.org/10.18653/v1/D15-1162 13. Horch, E., Reich, I.: On “article omission” in German and the “uniform information density hypothesis”. Bochumer Linguistische Arbeitsberichte, p. 125 (2016) 14. Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 216–223 (2003). https://doi.org/10.3115/1119355.1119383 15. Hulth, A.: Enhancing linguistically oriented automatic keyword extraction. In: Proceedings of Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics 2004: Short Papers, pp. 17–20 (2004). https://doi.org/10.3115/ 1613984.1613989 16. Huo, H., Liu, X.H.: Automatic summarization based on mutual information. In: Applied Mechanics and Materials, vol. 513, pp. 1994–1997. Trans Tech Publications, Freienbach (2014). https://doi.org/10.4028/www.scientific.net/AMM.513-517.1994 17. Jaeger, T.F.: Redundancy and reduction: speakers manage syntactic information density. Cogn. Psychol. 61(1), 23–62 (2010). https://doi.org/10.1016/j.cogpsych.2010.02.002
160
M. Kölbl et al.
18. Jaeger, T.F., Levy, R.P.: Speakers optimize information density through syntactic reduction. In: Advances in Neural Information Processing Systems, pp. 849–856 (2007) 19. Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. (1972). https://doi.org/10.1108/eb026526 20. Kamp, H.: Discourse representation theory: what it is and where it ought to go. Nat. Lang. Comput. 320(1), 84–111 (1988) 21. Krifka, M.: Basic notions of information structure. Acta Linguist. Hung. 55(3–4), 243–276 (2008). https://doi.org/10.1556/aling.55.2008.3-4.2 22. Kölbl, M., Kyogoku, Y., Philipp, J.N., Richter, M., Rietdorf, C., Yousef, T.: Keyword extraction in German: information-theory vs. deep learning. In: ICAART (1), pp. 459–464 (2020). https:// doi.org/10.5220/0009374704590464 23. Levy, R.: Expectation-based syntactic comprehension. Cognition 106(3), 1126–1177 (2008). https://doi.org/10.1016/j.cognition.2007.05.006 24. Liu, R., Nyberg, E.: A phased ranking model for question answering. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 79–88 (2013). https://doi.org/10.1145/2505515.2505678 25. Lombardi, O.: Dretske, Shannon’s theory and the interpretation of information. Synthese 144(1), 23–39 (2005). https://doi.org/10.1007/s11229-005-9127-0 26. Lombardi, O., Holik, F., Vanni, L.: What is Shannon information? Synthese 193(7), 1983–2012 (2016). https://doi.org/10.1007/s11229-015-0824-z 27. Marujo, L., Bugalho, M., Neto, J.P.S., Gershman, A., Carbonell, J.: Hourly traffic prediction of news stories (2013). arXiv preprint arXiv:1306.4608 28. Marujo, L., Ling, W., Trancoso, I., Dyer, C., Black, A.W., Gershman, A., de Matos, D.M., Neto, J.P., Carbonell, J.G.: Automatic keyword extraction on twitter. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 637–643 (2015). https://doi.org/10.3115/v1/P15-2105 29. May, C., Cotterell, R., Van Durme, B.: An analysis of lemmatization on topic models of morphologically rich language (2016). arXiv preprint arXiv:1608.03995 30. Mihalcea, R., Tarau, P.: Textrank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404–411 (2004) 31. Ogden, C.K., Richards, I.A.: The Meaning of Meaning: A Study of the Influence of Language upon Thought and of the Science of Symbolism, vol. 29. K. Paul, Trench, Trubner & Company, Limited, London (1923). https://doi.org/10.1038/111566b0 32. Özgür, A., Özgür, L., Güngör, T.: Text categorization with class-based and corpus-based keyword selection. In: International Symposium on Computer and Information Sciences, pp. 606– 615. Springer (2005). https://doi.org/10.1007/11569596_63 33. Pal, A.R., Maiti, P.K., Saha, D.: An approach to automatic text summarization using simplified Lesk algorithm and wordnet. Int. J. Control. Theory Comput. Model. 3 (2013). https://doi.org/ 10.5121/ijctcm.2013.3502 34. Peirce, C.S.: Collected Papers of Charles S. Peirce. In: Hartshorne, C., Weiss, P., Burks, A.W. (eds.) (1932) 35. Ravindra, G.: Information theoretic approach to extractive text summarization. Ph.D. thesis, Supercomputer Education and Research Center, Indian Institute of Science, Bangalore (2009) 36. Richter, M., Kyogoku, Y., Kölbl, M.: Estimation of average information content: comparison of impact of contexts. In: Proceedings of SAI Intelligent Systems Conference, pp. 1251–1257. Springer (2019). https://doi.org/10.1007/978-3-030-29513-4_91 37. Richter, M., Kyogoku, Y., Kölbl, M.: Interaction of information content and frequency as predictors of verbs’ lengths. In: International Conference on Business Information Systems, pp. 271–282. Springer (2019). https://doi.org/10.1007/978-3-030-20485-3 38. Rietdorf, C., Kölbl, M., Kyogoku, Y., Richter, M.: Summarisation by information maps. A pilot study (2019). Submitted 39. Rogers, T.M.: Is Dretske’s Theory of Information Naturalistically Grounded? How emergent communication channels reference an abstracted ontic framework (2007). https://www. researchgate.net/publication/326561084. Unpublished
The Semantic Level of Shannon Information …
161
40. Rooth, M.: Association with focus. Ph.D. thesis, Department of Linguistics, University of Massachusetts, Amherst (1985). Unpublished 41. Rooth, M.: A theory of focus interpretation. Nat. Lang. Semant. 1(1), 75–116 (1992). https:// doi.org/10.1007/BF02342617 42. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988). https://doi.org/10.1016/0306-4573(88)90021-0 43. Schofield, A., Mimno, D.: Comparing apples to apple: the effects of stemmers on topic models. Trans. Assoc. Comput. Linguist. 4, 287–300 (2016). https://doi.org/10.1162/tacl_a_00099 44. Shannon, C.E.: A mathematical theory of communication. ACM SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3–55 (1948). https://doi.org/10.1002/j.1538-7305.1948.tb01338.x 45. Sowa, J.F., Way, E.C.: Implementing a semantic interpreter using conceptual graphs. IBM J. Res. Dev. 30(1), 57–69 (1986). https://doi.org/10.1147/rd.301.0057 46. Stolcke, A.: An efficient probabilistic context-free parsing algorithm that computes prefix probabilities (1994). arXiv preprint arXiv:cmp-lg/9411029 47. Tixier, A., Malliaros, F., Vazirgiannis, M.: A graph degeneracy-based approach to keyword extraction. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1860–1870 (2016).https://doi.org/10.18653/v1/D16-1191 48. Turney, P.D.: Learning algorithms for keyphrase extraction. Inf. Retr. 2(4), 303–336 (2000). https://doi.org/10.1023/A:1009976227802 49. Vijayarajan, V., Dinakaran, M., Tejaswin, P., Lohani, M.: A generic framework for ontologybased information retrieval and image retrieval in web data. Hum.-Centric Comput. Inf. Sci. 6(1), 18 (2016). https://doi.org/10.1186/s13673-016-0074-1 50. Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: Kea: practical automated keyphrase extraction. In: Design and Usability of Digital Libraries: Case Studies in the Asia Pacific, pp. 129–152. IGI Global, Pennsylvania (2005) 51. Yang, Z., Nyberg, E.: Leveraging procedural knowledge for task-oriented search. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 513–522 (2015). https://doi.org/10.1145/2766462.2767744 52. Zhang, Q., Wang, Y., Gong, Y., Huang, X.J.: Keyphrase extraction using deep recurrent neural networks on Twitter. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 836–845 (2016). https://doi.org/10.18653/v1/D16-1080
Towards Aspect Extraction and Classification for Opinion Mining with Deep Sequence Networks Joschka Kersting and Michaela Geierhos
Abstract This chapter concentrates on aspect-based sentiment analysis, a form of opinion mining where algorithms detect sentiments expressed about features of products, services, etc. We especially focus on novel approaches for aspect phrase extraction and classification trained on feature-rich datasets. Here, we present two new datasets, which we gathered from the linguistically rich domain of physician reviews, as other investigations have mainly concentrated on commercial reviews and social media reviews so far. To give readers a better understanding of the underlying datasets, we describe the annotation process and inter-annotator agreement in detail. In our research, we automatically assess implicit mentions or indications of specific aspects. To do this, we propose and utilize neural network models that perform the here-defined aspect phrase extraction and classification task, achieving F1-score values of about 80% and accuracy values of more than 90%. As we apply our models to a comparatively complex domain, we obtain promising results. Keywords Sentiment analysis · Opinion mining · Text mining · Neural networks · Deep learning
1 Introduction Researchers are becoming increasingly interested in sentiment analysis—also known as opinion mining—because of the steadily growing, mostly user-generated amount of textual data on the web. Pertinent examples are review websites and social media websites. Due to the nature of these websites, recent research in sentiment analysis J. Kersting Paderborn University, Warburger Str. 100, Paderborn, Germany e-mail: [email protected] M. Geierhos (B) Bundeswehr University Munich, Research Institute CODE, Carl-Wery-Straße 22, Munich, Germany e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. Loukanova (ed.), Natural Language Processing in Artificial Intelligence—NLPinAI 2020, Studies in Computational Intelligence 939, https://doi.org/10.1007/978-3-030-63787-3_6
163
164
J. Kersting and M. Geierhos
has often been document-based, as its main goal is to identify an overall sentiment for whole documents or sentences. In contrast, aspect-based sentiment analysis (ABSA) aims at a fine-grained analysis of opinions expressed about the features of products, services, etc. This approach is necessary because most expressed opinions are related to single aspects of the overall entities. These aspects might contradict each other and numerous opinions might be voiced in one sentence. Up to now, research has mainly focused on domains with a common vocabulary represented by bag-of-words models, and it has neglected that aspects are sometimes represented implicitly by phrases rather than by direct remarks. The research has also overlooked including an understanding of why users rate in a certain manner, which is necessary when using review data that is available in substantial amounts [49]. ABSA is therefore a challenging and rewarding field of research to investigate. Opinion mining (i.e., sentiment analysis) involves three approaches, focusing on the document, the sentences or the aspects. Document-level and sentence-level sentiment analysis suppose one opinion per document or sentence. In ABSA, this is different because it assumes that more than one opinion can be expressed. The following example demonstrates the necessity of ABSA rather well: “Doctor Doe was very nice but she did not shake my hand.” The expressions in bold demonstrate that the physician’s friendliness is rated twice: once positively, and once negatively. Finding an expression such as the negative one shown above is simple for a human but difficult for an algorithm.
1.1 Contribution Our research addresses two of the three subtasks involved in ABSA: namely, aspect term extraction and aspect class (i.e., category) classification, excluding aspect polarity classification [16]. With this research, we advance the field of ABSA by incorporating implicit, indicated aspect remarks in longer phrases, which are often long and complex in their form and also unique, due to differing wordings, insertions, etc. We decided to investigate German texts, a morphologically rich and complex language, and selected physician reviews for the dataset because of the high variety of vocabulary (e.g., professions and diseases in the medical sector versus healthrelated common vocabulary). By presenting our datasets, describing the annotation process and also delivering a rich set of examples, we create a new gold standard for user-generated content in analyzing physician reviews. These datasets are the base for training neural network architectures by applying a variety of word embedding technologies. In doing this, we evaluate our networks without separating the step of aspect phrase extraction and classification, in contrast to approaches used for some shared tasks [59]. This chapter is an extended version of [35], enhancing the previous study by adding a second dataset with a new set of aspect classes that are explained and presented in detail through several examples. This extended version also updates the state-ofthe-art and adds more details regarding the annotation process. We also describe
Towards Aspect Extraction and Classification for Opinion Mining …
165
the annotation process for the new dataset, which was evaluated by calculating the inter-annotator agreement. In comparison to our previous study, we added improved deep neural network architectures and word embedding configurations. For example, we use an architecture with an attention mechanism and also apply several newly trained word embedding algorithms. The combinations of these technologies were thoroughly evaluated. We also provide more training details and reasons for our design decisions.
1.2 Domain Focus Our research focuses on physician reviews, where it is not possible to perform keyword spotting, due to the fact that several aspects are only implicitly mentioned or indicated by long phrases that also have off-topic insertions. This is especially the case for physician reviews because they deal with services that are usually personal, sensitive, and based on trust [10, 34]. Nevertheless, a great amount of ABSA studies proposes that nouns and noun phrases do represent aspects sufficiently or that aspects are explicitly mentioned in some form [8, 12, 32, 55, 59, 63]. This is especially the case for product and service reviews—so-called search goods. These search goods (e.g., smartphones) will always contain the same specific parts (e.g., a smartphone has a display, battery, etc.). But there are also experience goods, whose performance can only be evaluated after experiencing them because their nature is different and subjective [78]. Most ABSA studies focus on products [16] and services that use a rather limited vocabulary. For instance, the aspects of hotels or restaurants can be rated by using simple nouns: breakfast, cleanliness, etc. Such domains are more often characterized as experience domains than as services. The term ‘experience goods’ characterizes physician reviews rather well. They characterize special services that involve intimate, personal, and private components. Each performed health service is unique, due to several involved factors such as the patient’s personality, the symptoms, the current and any previous diseases, the personality of the health-care provider, and circumstances such as the rooms and the patient’s current feelings. In addition, services performed by persons are generally reviewed on the basis of the staff’s behavior, so keywords focus on the reliability, empathy, and the ambiance of the rooms [79]. These types of health reviews can be accessed on physician review websites (PRWs) such as RateMDs1 in the USA, Jameda2 in Germany and Pincetas3 in Lithuania, to mention a smaller country. On PRWs, users can write qualitative review texts or assign quantitative grades (e.g., between one and five stars, where five is the best), which can usually be assigned to certain aspects such as the physician’s friendliness or perceived competence. The review websites also offer functionalities such as blogging, appointment services, etc. 1 http://ratemds.com,
last visit was on 2020-05-19. last visit was on 2020-05-19. 3 http://pincetas.lt, last visit was on 2020-05-19. 2 http://jameda.de,
166
J. Kersting and M. Geierhos
Yet, trust is still an important issue on PRWs. Some physicians do not prefer to be rated, so there are legal consequences, and others also feel unfairly rated. Moreover, users generally feel anonymous, even though they may be identified by the PRW or their physician through technical information or their review texts [2, 10, 34]. The outline of this chapter follows these steps: Sect. 2 deals with related research, current approaches and datasets in the domain of ABSA. Section 3 describes our dataset in general as well as the aspect classes and the annotation process in particular. Section 4 presents our neural network architectures and their functionality for identifying aspect phrases in German physician review texts. The discussion and evaluation can be found in Sect. 5. Section 6 concludes and proposes future work.
2 State of Research Extracting aspect phrases and classifying them is the core task in ABSA [12]. This is different in comparison to classification tasks, where a whole document receives a target value. In ABSA, words and phrases must be found from sequential data. Consequent steps in extraction and classification are to identify sentiment-related words or phrases and then distinguish their polarity, meaning their positivity or negativity [12]. Besides this, sentiment analysis must also tackle other issues such as analyzing emotions or detecting sarcasm [80]. Since 2004, the subject of ABSA has been closely investigated by the research community. Especially, Hu and Liu [32] published a key work. Such studies typically deal with products like televisions, in which sentences like the following are common: “The screen is sharp and nice, but I hate the voice quality.” Since such reviews are published on the product page in an online shop, common aspects can be derived from other sources or by extracting the nouns and noun phrases [59, 61]. Another approach is to use topic modeling with a list of seed words [53, 81]. Furthermore, some scholars [59, 61] have written annotation guidelines which state that [a]n opinion target expression [...] is an explicit reference (mention) to the reviewed entity [...]. This reference can be a named entity, a common noun or a multi-word term [61].
They used annotated data in more than one language for extracting aspect terms and polarity—but not in German, and their domains involved hotels and restaurants. The results for the separated steps, such as aspect term extraction and classification, were almost all below 50% [59]. Other approaches use dependency and constituency parsing in order to find relevant words. However, these words are mostly nouns, too [55]. The main issue with nouns is the assumption that aspects are mostly explicitly mentioned in texts through single words or short noun phrases. This does not comply with the complexity of human language texts. While there are explicit aspect remarks in the form of nouns and noun phrases, there are also implicit forms that come in forms such as adjectives, verbs, etc., and in complex constructions of these. A rather simple example is the adjective “expensive”, which can refer to the price of a smartphone [44].
Towards Aspect Extraction and Classification for Opinion Mining …
167
Section 3, which presents the data of this work, will demonstrate that the reviews contain implicit mentions of aspects in various forms (e.g., several word types). However, some scholars extract only the most frequent nouns and group them into synonym classes in order to find aspect phrases in texts [12]. In comparison to the previously mentioned studies, Wojatzki et al. [76] take a different path, by using customer reactions from Twitter and not review texts. In contrast to our study, their social media data evolve around the German railway company Deutsche Bahn. They annotated data with the goal of identifying aspects in texts by extracting the corresponding words or phrases, refraining from using data from customer reviews or from a sensitive health- and trust-related domain [10, 34]. Furthermore, the aspects in [76] relate to a large train corporation and are thus not as diverse as those related to all different kinds of physicians, medical professions, diseases, symptoms and the sensitive patient-physician relationship [34]. In the health-care domain, most aspects can be described with nouns, such as “too loud atmosphere” or the “poor connectivity”, and specifically refer to things such as the seats, the noise level, etc. Hence, the general topic of this study has already been touched, yet the domain, approach, and dataset are different. Another study to mention here builds an ABSA pipeline for data on retail, human resources, and banking in the Netherlands. It is the scientific contribution of [16] to ABSA in the sector of commercial and financial services. Here, it is interesting that their approach is based on earlier studies that suggest regarding aspect term extraction as a sequential labeling task. Inside, Outside, Beginning (IOB) tags [7] are used to determine inside, outside, and beginning tokens of aspect phrases. They used more than 20 classes per domain and annotated their data manually. For aspect term extraction, they achieved very high scores. But for the results of the aspect classification, they in part have values beneath 50%. According to the domain and presented samples, it seems that the most aspect terms are nouns. Several datasets for ABSA have been published and used so far [16, 18, 20, 52, 59, 60, 62, 65–67, 76] and multiple surveys and summarizing studies have been published [32, 33, 44, 45, 50, 56, 74, 77, 80, 82]. However, some researchers regard nouns either as representations of aspects or as at least sufficient for extracting aspects from texts [8, 12, 55, 59]. Many of these datasets and thus the studies that use them focus on product reviews or commercial service reviews [16]. Among existing datasets, the investigated domains were products and services like restaurants [59], non-evaluative data like Tweets, and other social media data [20, 52] or newspaper articles and other texts [18]. Examples of other approaches are the semi-supervised or unsupervised methods for building ABSA systems [28, 53, 81]. For example, Garcia-Pablos et al. [28] start with a list of seed words to find aspects in large datasets. Yet, such topic models find clusters and topics that are usually not comprehensive and therefore do not comply with the goal of ABSA tasks [53]. Another issue is that such models cannot extract phrases. That is, they provide information on what is given in a text document but cannot sequentially identify the exact words that correspond to the information in the text. Thus, further linguistic analyses are excluded. There is also a variety of datasets that may be used for further analyses based on words in aspect phrases, such
168
J. Kersting and M. Geierhos
as SentiWS [64] or SentiWordNet [3]. Identifying not only aspects but also the words that describe them in a document encourages further approaches for new analyses. This may enable researchers that use existing and tested data instead of annotating data anew. In general, there are a multitude of studies that work with PRWs [5, 6, 10, 11, 21– 24, 29, 30, 34, 35, 39, 41, 46, 48, 54, 68, 70, 75]. PRWs exist in at least 28 European countries [15]. The portals are quite similar, but show differences, too: RateMDs has four quantitative rating classes [27, 75] while Jameda has 25, Docfinder4 has nine in Austria and Medicosearch5 from Switzerland has six [15]. Zeithaml [78] already mentioned that medical diagnoses are among the most difficult things to evaluate. Other scholars found that physician reviews are very influential in choosing an adequate physician [22] and that most ratings are positive [24]. The importance of physician reviews is underlined by these studies. Finally, researchers must always remember that reviews from PRWs have a specific vocabulary and that trust is a basic requirement [34].
3 New Datasets for Aspect Phrases in Physician Reviews Our dataset consists of physician reviews from several PRWs located in three German-speaking countries (Austria, Germany, and Switzerland) and in which the written language is German.
3.1 Data Collection and Overview We gathered the ground data in mid-2018, from March to July. For data download, a distributed crawler framework was developed. We checked the sites manually and found an index that we crawled first. This enabled us to directly access all sites that correspond to a physician. Then, we saved all reviews including the ratings and other information that were publicly displayed. To keep the costs for all stakeholders as low as possible we respected good behavior and did not cause much site traffic. Hence, we collected the data over several weeks before saving them to a relational database [15]. Apart from reviews, including ratings and average ratings for a physician, we collected further information such as the address, opening hours, and introductory texts if written by the physicians themselves. This data might be useful in the future. To gain a broader view, we also collected data from a Spanish, an American, and a Lithuanian PRW in their corresponding language. This gave us the opportunity to make general observations or qualitative comparisons, e.g., regarding the use of
4 http://docfinder.at,
last visit was on 2020-05-19. last visit was on 2020-05-19.
5 http://medicosearch.ch,
Towards Aspect Extraction and Classification for Opinion Mining … Table 1 Statistics for German-language PRWs [35, 36] PRW Jameda Docfinder # Physicians # Review texts # Professions Avg. rating Rating system (best to worst) Men/Women Length (Char.) a Only
169
Medicosearch
413,218 1,956,649 293 1.68 1–6
20,660 84,875 51 4.31 5–1
16,146 8,547 139 4.82 5–1
53/47% 383
71/29%a 488
No Data 161
few data were available
rating classes. Detailed observations on the researched data were taken from the three German-language PRWs: Jameda, Medicosearch, and Docfinder. General statistics on this data can be found in Table 1, which demonstrates that most physicians are listed on Jameda in Germany. Fewer physicians are listed on the Austrian Docfinder and the Swiss Medicosearch, while Docfinder and Jameda are both visited more often than Medicosearch, presumably due to the higher number of reviews. On average, the ratings are very positive. The higher number of professions listed in Jameda and Medicosearch may be caused by allowing non-official professions to be listed. For our further steps, we excluded reviews that were in languages other than German. However, the acquired data are not representative for all physicians in the named countries nor in any country, for that matter. The data neither represent all physicians or patients, nor do they provide a consistent rating for a country’s health-care system. Still, they do provide many insights that would otherwise not be possible to date and the combination of textual reviews and quantitative ratings in the form of grades encourages a multitude of analyses. For example, as there are over 2 million sentences, there is a rich linguistic corpus of evaluative statements. The relation among physician and patients is sensitive [34] and, apart from protection, requires further evaluation due to data security questions. Moreover, since patients almost anonymously [10] reveal their findings and use subjective phrases, the data source is also rich in non-standard vocabulary. While these statements also contain many incorrectly spelled words and other typos, they are usually written honestly but may sometimes not reveal the true intention of the reviewer. The analyses also involve how people rate services. While experience goods and services are more difficult to evaluate than products and medical diagnoses are the most difficult to evaluate [78], physician reviews are usually rated on the behavior of the staff [79]. This is revealed by statements such as: “He did not look me in the eye.” and “Dr. Mueller did not shake my hand and answered my questions waaay too briefly.” Further challenges are that PRWs can contain errors, they can decide which information (not) to publish [11], and rating schemes on PRWs are not scientifically
170
J. Kersting and M. Geierhos
grounded, as far as we know. A scientific approach would be to employ scales such as the RAND questionnaire which uses 50 items to measure patient satisfaction [24].
3.2 Rating Classes To find aspect classes that should be annotated, we used the available classes that can be assigned on Jameda, Docfinder, and Medicosearch. We further used qualitative methods to set fixed classes, using the base from all three PRWs. Examples of the classes that can be found on the PRWs are “explanation,” “friendliness,” etc. The rating classes for annotation were developed by discussing classes from the PRWs and then merging them semantically. A selection of classes that can be found on the websites is presented in Table 2 (translated from German). While Jameda has the most classes, it presents only a subset of them for each profession (Table 2). To extract the aspect terms and classify them, we manually annotated a dataset in order to use it for supervised machine learning. For this study, we chose the aspect target “physician” and annotated over 26,000 sentences in two datasets. The first dataset (“fkza”, an acronym of its German letters) has 11,237 sentences
Table 2 Rating classes on PRWs (selection; translated from German) [15, 35] Source Rating classes Jameda
Jameda— Profession “dentista ”
Docfinder
Medicosearch
a Example
Treatment, counselling/care, commitment, discretion, explanation, relationship of trust, addressing concerns, time taken, friendliness, anxious patients, waiting time for an appointment, waiting time at the doctor’s office, opening hours, consultation hours, entertainment in the waiting room [...] Treatment, explanation, mutual trust, time taken, friendliness, anxious patients, waiting time for an appointment, waiting time at the doctor’s office, consultation hours, care, entertainment in the waiting room, alternative healing methods, child-friendliness, barrier-free access, equipment at the doctor’s office, accessibility by telephone, parking facilities, public accessibility Overall assessment, empathy of the physician, trust in the physician, peace of mind with treatment, range of services offered, equipment of the doctor’s office/premises, care by medical assistants, satisfaction in waiting time for an appointment, satisfaction in waiting time in the waiting room Relationship of trust (the service provider has taken my problem seriously), relationship of trust (the service provider has taken time for me), information behavior (the service provider has given me comprehensive information), information behavior (the explanations of the service provider were understandable), recommendation (the service provider has fulfilled my expectations), recommendation (I recommend this service provider)
extracted from Jameda (limited number of classes available).
Towards Aspect Extraction and Classification for Opinion Mining …
171
containing the following classes: friendliness, competence, time taken and explanation.6 The second dataset (the acronym “bavkbeg”) contains 15,467 sentences with six classes: treatment, alternative healing methods, relationship of trust, childfriendliness, care/commitment and overall/recommendation,7 where the final aspect class applies not just solely to the physician but to the physician, the team, and the offices in general. The second set of classes (“bavkbeg”) has a larger dataset than the first because of the number of sentences that actually contain an aspect phrase. Both sets have roughly the same number of sentences that a machine learning model learns from, as experiments have shown. Systems usually extract the target and the aspect together [80]. In our PRW data, we found three opinion targets: the physician, the team, and the doctor’s office (e.g., “parking situation”). Furthermore, one particular target corresponds to a general evaluation that has just one aspect: overall/recommendation. An example here is: “Totally satisfied.” Before we describe the annotation process, the named aspect classes are clearly defined below.
3.2.1
First Rating Class Dataset (fkza)
• Friendliness deals with the question whether the physician treats his/her patients respectfully and kindly, whether he/she is nice or nasty when greeting them and whether he/she looks them in the eye? Generally, friendliness refers to the degree of devotion. Examples (translated from German) are: – “She was nice to me and her assistants are quite efficient.” – “I don’t understand it, he neither greeted me, nor did he listen.” • Competence describes the subjectively felt or demonstrated expertise of a physician. The question is whether a rater sensed that the physician knows what to do, why to do it, and how to do it. It neither asks about the general quality of treatment nor does it cover friendliness or empathy. This class also includes whether a physician is good in his/her profession in general: i.e., “knows his job” or “knows how to reduce anxiety” are seen as ratings of competence. – “He is not competent but has a conscientious manner.” • Time Taken refers to the amount of time a physician takes during his/her appointments with patients. Time is a crucial aspect for the perceived quality of a treatment and the treatment itself. When a physician takes sufficient time or much time, patients see this as a positive signal and therefore express positive sentiment toward that physician. However, in the German language and especially in this class, other
6 The
aspect classes were translated from German: Freundlichkeit, Kompetenz, Zeit genommen, Aufkärung. 7 The aspect classes were originally in German: Behandlung, Alternativheilmethoden, Vertrauensverhältnis, Kinderfreundlichkeit, Betreuung/Engagement, Gesamt/Empfehlung.
172
J. Kersting and M. Geierhos
words are placed between words such as “take” and “time”, comparable to these English phrases: – “She took a lot of time and [...]” – “However, there is one remarkable drawback, because the practice is always overcrowded, so the personal consultation is rather short.” • Explanation deals with the clarifications a physician uses to explain symptoms, diseases, and (especially) treatments in an understandable manner, because patients naturally need to be informed well. This class is separated from the time taken class because a lengthy conversation may indicate the amount of time used, but the quality of the explanation depends on the details described during this conversation or the questions that the physician asks and answers. – “I received a detailed clarification from Dr. Müller.” – “The consultation was great and I was very well informed about my disease pattern.”
3.2.2
Second Rating Class Dataset (bavkbeg)
• Alternative Healing Methods describes whether a physician offers alternatives to his/her patients. That is, if the physician discusses possible treatments with patients and offers other ways. This does not involve the explanation of a treatment, but rather the general attitude towards alternatives, including alternatives that deviate from conventional medicine, which some patients seek. – “She also offers alternative methods.” – “He is absolutely not a single bit open for homeopathy.” – “The doctor deliberately treats alternatively, which I disliked.” • Treatment deals with the way a physician treats his/her patients, in comparison to the competence class of the first dataset. As the datasets were annotated one after the other, certain decisions arose during annotation that needed to be made in order to classify certain cases. That is, a sentence such as “The doctor is conscientious.” belongs to competence, but the phrase “treats conscientiously” applies to treatment. In German, these two words (adjective and adverb) would look and sound the same. Therefore, this class is rather narrowly defined. – “I was satisfied with the treatment.” – “She treats conscientiously.” • Care/Commitment includes all evaluations of whether a physician is further interested or involved in caring for and committing to the patient and the treatment. This can be seen in phone calls after a visit, in asking a patient further about his or her well-being, continuing to work even after the shift, etc. – “After the surgery, the doctor came by to ask me how I am.” – “He also postponed his workday off for me. Very committed doctor!!!”
Towards Aspect Extraction and Classification for Opinion Mining …
173
• Overall/Recommendation indicates that the physician, the team, and the office are generally recommendable so that patients are satisfied and return. If this expressed satisfaction refers solely to the competence of the physician, it would be classified as competence. – “I came here based on recommendations.” – “I’d love to go again. You’re in good hands!” • Child-Friendliness describes how a physician takes care of minors. Children need special protection and when talking to a physician is involved, they may not be able to express their medical needs and current situation accordingly. Physicians are therefore especially required to be able to handle children, which this aspect class investigates: i.e., “Is a physician treating pediatric patients well?” – “She does not talk to my child at all.” – “However, he listens to my son.” – “Listening to kids—not possible for Dr. Müller.” • Relationship of Trust describes the sensitive relationship between patient and physician. It concerns the question whether the patient has confidence in his/her medical service provider and whether this is expressed in a review. Patients often visit the same physician for years, which can be a way to express their trust. Other forms of trust can be seen in these examples: – “I feel taken seriously.” – “He doesn’t understand me.” – “I’ve been going to Dr. Doe for 10 years.” All of the aforementioned classes can be clearly distinguished. Just a noun or a single word usually do not clearly indicate the class, so multi-word phrases need to be annotated. The overall task was to review and discuss linguistic constructions that arose during the annotation. We created guidelines and documented cases that are on the edge. Yet the process is complicated because the typical German sentence structure provides much flexibility, especially since the word order can be changed quite freely.
3.3 Annotation Process We started the annotation process by splitting reviews into sentences using the tokenizer of the spaCy library [25]. Extremely short sentences and rather long sentences were excluded; sentences that started in the heading and continued in the review were merged. We used sentences instead of full review texts as these contain aspects represented by complex phrases and (generally speaking) because sentence-level annotation is more efficient than at the document level. Overall, we had over 2 million review sentences. As many sentences do not contain relevant information, we annotated approx. 10,000 sentences on the basis of whether they contain an evaluative
174
J. Kersting and M. Geierhos
Fig. 1 Number of tokens per sentence with aspect phrases for the first dataset (fkza) [35, 36]
statement. Using this set, and with a high level of agreement among participating persons, we built a Convolutional Neural Network (CNN) [43] classifier,8 which determined whether a sentence contained an evaluative statement. The annotation was conducted when the first set of rating classes, fkza, was already developed while the other classes were not set yet. This especially applies to fringe cases (i.e., phrases that could in theory be assigned to more than one class) that created the need to make decisions while annotating the corresponding dataset. Hence, this preselection may contain a certain bias. We tackled this issue by randomly saving all sentences with a minimal probability for an evaluative statement to a new file and then using this to annotate aspect phrases and their classes. For each dataset, we randomly set up a new input file. We regarded this approach as superior and less constrained in comparison to other approaches, such as using seed words [13] which would have caused a loss of information because our vocabulary is diverse and consists of longer phrases that indicate aspects. The dataset was annotated by one person while two other trained persons contributed to discussions, evaluations, and the inter-annotator agreement. In the first rating class set (fkza), there are 11,237 sentences: 6,337 with one or more of the four classes, 4,900 without. In each sentence, it was possible to annotate several aspects and also the same aspect several times: i.e., a sentence could contain all four classes and also the friendliness class twice. We stored our annotations in a database where we also saved the tokenization. Figure 1 shows the sentence length for fkza measured in counted tokens. Most of the sentences are brief, but there are some instances of longer sentences. The second rating class set (bavkbeg) includes 15,467 sentences: 6,600 contain evaluative statements and 8,867 did not. The higher number of sentences without evaluative statements derives from the fact that the six classes of the second set 8 The
neural network was inspired by [37].
Towards Aspect Extraction and Classification for Opinion Mining …
175
did not appear in the data often, especially when compared to the first set. During annotation, this quota was even worse for some time. Consequently, we searched for approaches to increase the number of sentences for annotation that contain the desired classes. As we had viewed over 11,000 sentences, we were confident enough to train a neural network classifier that can predict whether a sentence contains one or more aspect classes in general (multi-output, multi-class classification). At that time, we had seen about 3,000 sentences with aspect phrases and about 8,000 without. The architecture we used is a CNN combined with a bidirectional Long Short-Term Memory [31, 37]. The accuracy was high for the test set (over 90%) and can be considered a good metric, because we also over- and undersampled the data, cutting the number of sentences without useful aspect phrases to reduce their overweight in the classification. This led to sufficient classification results and produced a new input file for the annotations. In each line there was a sentence that contained one of the six mentioned classes (on an alternating basis) and for each sentence more classes may have been predicted: For example, sentence one had the classes childfriendliness and also treatment. The sentences were chosen on the basis of whether they contained one of the classes and not on a high probability for a class. In the end, each sentence that was proposed for annotation had a high probability for at least one class, while each class appeared in every sixth sentence (with a high probability).9 We did this again twice, while also increasing the value for what was regarded as a classification: i.e., going from a probability of 50%, which was required to expect a sentence to belong to a class, to a higher value such as 70%. Figure 2 presents the sentence length of the sentences that contained an aspect phrase. As can be seen, in comparison to Fig. 1, most sentences are rather short, but there are also many long sentences, such as the following example which was translated from German and based on the datasets: “Is able to do what is necessary [competence] and I trust in him [relationship of trust], a good match: Dr. Meyer knows what he is doing [competence] and always welcomes me so perfectly, [friendliness] he takes a lot of time [time taken] for anyone who is visiting and answers every question [explanation], I totally recommend [overall/recommendation] him to anyone.”
This example is an illustration of a common review sentence. The aspect phrases are printed in bold, followed by their classes. The authors usually use colloquial language and often misspell names of diseases or treatments, in part referring to the same aspect more than once while using different phrases and implications, but often not just nouns. This is different than in other studies [59, 76]. Furthermore, our dataset is larger than [59] (e.g., English laptop ratings: 3,308 sentences; Dutch restaurant reviews: 2,286 sentences). What is more, they annotate just one of potentially several mentions of an aspect in a sentence. Wojatzki et al. [76] have a slightly larger dataset and not all sentences in their dataset contain aspect phrases. Although their dataset includes 2,000 sentences more than our first dataset (fkza), we have two datasets and thus more data. 9 The
scheme was as follows: sentence: 1, classes predicted: [1]/ sentence: 2, classes predicted: [2, 5]/ ...; sentence: 7, classes predicted: [1]/ etc.
176
J. Kersting and M. Geierhos
Fig. 2 Number of tokens per sentence with aspect phrases for the second dataset (bavkbeg)
The way how users describe their consultations in reviews makes the annotation task a difficult endeavor. The inter-annotator agreement was calculated on the basis of tagging: i.e., each word received a tag that marked it as part of an aspect phrase (including the corresponding class) or as part of a non-relevant phrase (None-class, see Sect. 4). For the first rating class set (fkza), the approach was conducted as follows. From the data that were annotated by the main annotator, we selected roughly 3%: i.e., 337 sentences. The other two persons re-annotated the data from scratch. Looking into previously annotated datasets was allowed. Table 3 demonstrates our evaluation results. Cohen’s Kappa [14] was calculated for each of the two annotators. We used the well-tested Scikit-learn software library for this [58], achieving a substantial agreement among annotators with scores of 0.722 up to 0.857. All values between 0.61 and 0.80 can be considered substantial agreement; values above this can be regarded as almost perfect [42]. In addition, we calculated Krippendorff’s Alpha [40] by using the Natural Language Toolkit (NLTK) [7] because it allows one to consider all annotators at once, which we did. The score of 0.771 can be regarded as good; 1.0 would be the best. Alpha provides features such as calculating for many annotators at once (not only two) and can also handle missing data and any number of classes [40].
Table 3 Inter-annotator agreement for three annotators and two datasets [35, 36] First set (fkza) Second set (bavkbeg) Annotators Cohen’s Kappa Krippendorff’s alpha (for all 3)
R&B 0.722
R&J 0.857 0.771
B&J 0.730
R&B 0.731
R&J 0.719 0.720
B&J 0.710
Towards Aspect Extraction and Classification for Opinion Mining …
177
The inter-annotator agreement for the second dataset was calculated in the same manner as for the first one. As we calculated it before having finished the annotation, 3% corresponds to 358 sentences of the data. If we sensed that the task was comprehended mostly incorrectly, the task was allowed to start over from the beginning (which happened once). As can be seen in Table 3, the scores were sufficient after starting again. The agreement can be considered substantial although there is no “almost perfect” agreement between two annotators. Krippendorff’s Alpha has a value of 0.720, which is a good score. The Kappa scores all range slightly above 0.70.
4 Neural Networks for Aspect Phrase Extraction and Classification In our solution for aspect phrase extraction and classification, we ultimately used our annotated dataset to perform supervised machine learning. Along the way, we tried several paths which all failed to work out, even though we searched the literature to find the best system. Liu [44], for example, presents four approaches to extract aspects: (1) using frequent noun (phrases), (2) utilizing opinion and target relations, (3) applying supervised learning and (4) making use of topic modeling [44]. We attempted all four approaches, none of which provided sufficient results, except for supervised learning. Furthermore, Sect. 2 supports our conclusion. In our tests, topic modeling found “topics” that were not sharply separated and were not understandable to a human: i.e., humans would have drawn a topic completely differently. Frequent nouns are not sufficient either, as mentioned above. The detection rate was extremely low and the extraction of relations led to no results. Some approaches, like the one of [57] from 2013, are outdated as they fail to use state-of-the art pretraining to enrich their algorithms with adequate text representations such as word embeddings [9, 19, 51]. For instance, to find candidate phrases, we used spaCy [25] for dependency parsing and the results of [38] for constituency parsing. Moreover, we constructed machine learning algorithms for IOB tagging, of which our final approach was the best. Evaluation results will be presented in Sect. 5. Literature indicates an adequate use of IOB tagging [16]. For our domain, this does not seem appropriate: We have long phrases with differing start and end words, including punctuation, due to the user-generated nature. Hence, the aspect phrases are not as predictable as named entities, such as: “Mrs. Jeanette Muller” and “Jeanette Muller”, in comparison to our data (in German) “Dr. Meyer hat sich einige Zeit genommen.” (translated: “Dr. Meyer took a lot of time.”). This can also be compared to cases such as: “Dr. Meyer nimmt sich für seine Patienten viel Zeit.” (translated: “Dr. Meyer takes a lot of time for his patients.”). In the German phrase, “for his patients” also has to be annotated because it is located in the middle of a phrase [35, 36]. IOB tagging seems to be a non-perfect fit for challenging cases like ours. However, “I” is sufficient when the start of a phrase receives less importance: i.e.,
178
J. Kersting and M. Geierhos
Fig. 3 BiLSTM-CRF model architecture [35, 36]
when the Beginning (“B”) tag is left out. Hence, “I” and “O” are sufficient. In the end, every word is marked whether it belongs to an aspect phrase of a specific class or not. As experiments showed, using just “I” and “O” tags is sufficient and superior to using IOB tags. For labeling sequential data such as text, studies like [71] propose an architecture composed of a Conditional Random Field (CRF) and a bidirectional Recurrent Neural Network (RNN) for feature extraction. We also tried to use additional features such as named entities, token lemmas, etc. But this approach proved to work not as well due to the use of user-generated content, which has various mistakes, while nouns are of limited importance to us. Furthermore, tests with Part-of-Speech (PoS) tags did not improve our results. Figure 3 shows the architecture of our system. As Fig. 3 demonstrates, a bidirectional Long Short-Term Memory (biLSTM) [31] is used for feature extraction. This means that features are extracted in both directions from the text data and thus words before and after the current one are included in the calculations. A time-distributed dense layer then aligns the features and a CRF finds the optimal set of tags given for the words in a sentence. Using a biLSTM and CRF is state-of-the art [1]. In between, there are “BatchNormalization” layers in order to keep the activation values to a normalized score. We also use dropout10 layers that regularize the data and thus hinder overfitting due to a limited dataset size (manual annotations require much human work). As input, we use sentences with vectorized tokens. For supervised learning, the tags resemble the following format: “I-friendliness” or “O” for a non-relevant word. We trained our system to find aspect phrases and classes together, in the same step. We found that it is crucial to have pretrained vectors that embed common knowledge about words and subword units, so we trained vectors on all available reviews, not only the annotated ones. Further measures for embedding training focus on incorrectly split and spelled words by using only lowercase. Even though we have a comparatively limited amount of data, vectors with 300 dimensions worked the best. This dimensionality lead to no overfitting and increased the recall values, 10 SpatialDropout was used only for the first dataset (fkza) and the biLSTM-CRF with FastText embeddings, while we relied on a regular dropout layer for the other cases. While SpatialDropout performed slightly better in this case, the overall effect was marginal and we thus trusted the normal Dropout, which is more suitable for sequence processing in general. The difference between Dropout and SpatialDropout is that the latter drops whole feature maps instead of single elements [72].
Towards Aspect Extraction and Classification for Opinion Mining …
179
Fig. 4 BiLSTM-attention model architecture
which was especially the case when comparing them to a dimensionality of 25. We trained our own vectors using FastText [9]. When using other vectors apart from FastText, we spared the embedding layer and directly inserted the vectors to the model. However, as our user-generated data contained several mistakes, lowering the data and using the subword information, such as character n-Grams, helped achieve better representations. The Skipgram algorithm [9] learns vector representations for words that can predict their surrounding words. We allocated enough time for parameter tuning and testing a multitude of architecture configurations including CNNs, more RNN layers and other types of RNNs, with and without CRFs. The best results for the biLSTM-CRF model were reached with rather small values such as 0.3 for dropout, a unit size of around 30 for the biLSTM, and a small epoch and batch size. RMSprop also showed satisfactory results and we also tested current solutions such as BERT [19] (in its German, cased version by Deepset [17]). As BERT uses “wordpiece” tokens and not words, we use the weighted sum of the last hidden state [26]. The embedding layer in Fig. 3 contains all the vectors. Another model architecture we tested performed well enough to be reported here (see Fig. 4). It consisted of a biLSTM and an attention layer, which are usually used together with recurrent neural networks [73] and calculates a weight (or attention) for data such as words [82]. Dependencies in sequence data can therefore be modeled without respect to the distance. We use multiplicative (dot-product) self-attention for sequence processing [69, 73]. However, using fewer layers and only a small portion of dropout worked here. To achieve a better tested system, we also used different word embeddings. That is, we tested BERT vectors, as mentioned above. These were pretrained for German text, including capital letters (768 dimensions) [17]. As this model is well established, we fine-tuned it using physician reviews inspired by [4], thereby reaching a perplexity score of 3.78 and a loss of about 1.37. We used both language models to calculate the contextual input vectors (for results, see Table 4). We also used FLAIR embeddings [1] (also with uppercase letters) due to their contextual functionality, good performance results and the usage of characters, which makes them well suited for user-generated content. We also used BERT with uppercase letters because we wanted to benefit from the pretrained model available for the German language [17]. For FLAIR embeddings, we kept this configuration. We used a dimensionality of 768 to compare it to the BERT embeddings and trained RoBERTa embeddings [47] ourselves from scratch, using the physician reviews, but did not achieve usable results: i.e., the perplexity score of those embeddings remained high above 900.
180
J. Kersting and M. Geierhos
5 Evaluation and Discussion Tables 4, 5, 6, and 7 present our evaluation results such as precision, recall and F1 score per label as well as accuracy and an average per measure. We evaluated the two datasets each with a biLSTM-CRF and a biLSTM-Attention network using different word embeddings such as self-trained FastText, pretrained BERT, self-fine-tuned BERT and self-trained FLAIR embeddings. While our accuracy is always around 0.95 and thus high, we regard the F1 scores as more important. For example, in Table 4, we have an average F1 value of 0.80, which is not weighted. This score can be regarded as good. This are better scores in comparison to [59] or [76]. Their domain uses a less complex wording and nouns usually describe most of the aspects. Furthermore, their models barely achieve a score of 0.50 and they perform aspect phrase extraction and classification in two steps instead of one, which propagates errors forward. We separated the classes with the “physician” aspect target into two datasets and thus trained models that recognize all the classes present in a dataset. However, there are also other approaches [71] that use trained models for each class. This may result in overlapping aspect phrase borders, with some words being labeled more than once for several classes. Our approach also achieves better results in bare numbers. However, when taking a closer look at our results in Table 4, our precision scores are generally better than our recall scores. This is also consistent with Tables 5, 6, and 7. This may be caused by a limited amount of training data. As overfitting was an issue during model building and training, we therefore aimed at improving the recall results: Our model is relevant only in its practical application when it generalizes well to new data. Apart from our self-trained FastText vectors, we used BERT vectors, self-fine-tuned BERT vectors and self-trained FLAIR vectors. We expected generally and notably better results when using more sophisticated word
Table 4 Evaluation results of the biLSTM-CRF model for the first dataset (fkza) with different embeddings (except for BERT, the vectors were self-trained) [35, 36] Vectors
FastText
BERT
own BERT
FLAIR
Measures
P
R
F1
P
R
F1
P
R
F1
P
R
F1
I-explanation
0.81
0.71
0.76
0.73
0.67
0.70
0.75
0.65
0.69
0.82
0.44
0.57
I-friendliness
0.75
0.74
0.75
0.75
0.69
0.72
0.67
0.77
0.72
0.65
0.75
0.70
I-competence
0.68
0.67
0.67
0.69
0.65
0.67
0.84
0.53
0.65
0.94
0.38
0.54
I-time
0.85
0.80
0.82
0.87
0.77
0.82
0.85
0.80
0.82
0.82
0.78
0.80
O
0.97
0.98
0.97
0.97
0.98
0.97
0.96
0.98
0.97
0.95
0.99
0.97
Accuracy Average
0.95 0.81
0.78
0.94 0.80
0.80
0.75
0.94 0.78
0.81
0.75
0.94 0.77
0.84
0.67
0.72
Towards Aspect Extraction and Classification for Opinion Mining …
181
Table 5 Evaluation results of the biLSTM-attention model for the first dataset (fkza) with different embeddings (except for BERT, the vectors were self-trained) Vectors
FastText
BERT
own BERT
FLAIR
Measures
P
R
F1
P
R
F1
P
R
F1
P
R
F1
I-explanation
0.80
0.60
0.69
0.71
0.64
0.67
0.73
0.63
0.68
0.74
0.61
0.67
I-friendliness
0.80
0.67
0.73
0.67
0.76
0.71
0.74
0.76
0.75
0.81
0.59
0.68
I-competence
0.77
0.58
0.66
0.70
0.63
0.66
0.77
0.61
0.68
0.73
0.55
0.62
I-time_taken
0.88
0.78
0.83
0.74
0.80
0.77
0.85
0.78
0.81
0.91
0.80
0.85
O
0.96
0.98
0.97
0.97
0.97
0.97
0.97
0.98
0.97
0.96
0.98
0.97
Accuracy Average
0.95 0.84
0.94
0.72
0.78
0.76
0.95
0.76
0.76
0.81
0.75
0.94 0.78
0.83
0.71
0.76
Table 6 Evaluation results of the biLSTM-CRF model for the second dataset (bavkbeg) with different vector representations (except for BERT, the vectors were self-trained) Vectors Measures
FastText P
BERT
own BERT
FLAIR
R
F1
P
R
F1
P
R
F1
P
R
I-altern._healing_m. 0.78
0.71
0.75
0.65
0.71
0.68
0.80
0.60
0.68
0.85
0.49
0.62
I-treatment
0.71
0.71
0.70
0.58
0.63
0.84
0.63
0.72
0.75
0.52
0.61
0.71
F1
I-care/commitment
0.73
0.62
0.67
0.77
0.50
0.61
0.82
0.65
0.73
0.79
0.32
0.45
I-overall/recom.
0.80
0.64
0.71
0.67
0.71
0.69
0.80
0.67
0.73
0.89
0.54
0.67
I-child-friendliness
0.76
0.81
0.79
0.77
0.81
0.79
0.75
0.81
0.78
0.64
0.75
0.69
I-relation._of_trust
0.83
0.79
0.81
0.83
0.80
0.81
0.83
0.80
0.81
0.95
0.59
0.73
O
0.96
0.97
0.97
0.96
0.97
0.96
0.96
0.98
0.97
0.94
0.99
0.97
0.80
0.75
0.77
0.76
0.73
0.74
0.83
0.73
0.78
0.83
0.60
Accuracy Average
0.94
0.94
0.95
0.93 0.68
vector calculation technologies such as BERT because these consider context, are state-of-the-art, and—in the case of BERT—were pretrained on larger quantities of data. Still, our comparatively simple FastText embeddings achieved better recall and overall scores than in Table 4: The recall scores of 0.67 to 0.80 (and 0.98 for label “O”) are favorable, especially when also reasoning about the F1 scores of 0.76, 0.75, 0.67, 0.82 and 0.97. In the domain and data that were used, these values are satisfying. We can explain the accuracy value of 0.95 with the high appearance of the label “O”, as this boosts the accuracy score in general. We reduced the overweight of “O” labels by training our models only on sentences that contain an aspect phrase. It is crucial to have scores for precision and recall that are not too distinct. This is why when choosing the best model for the first dataset (fkza, Table 4 and 5), we regard the biLSTM-CRF model as superior to the biLSTM-Attention model due to the superior scores and because we regard FastText embeddings superior among the various word embedding configurations: The scores are higher yet not too distinct from each other.
182
J. Kersting and M. Geierhos
Table 7 Evaluation results of the biLSTM-attention model for the second dataset (bavkbeg) with different vector representations (except for BERT, the vectors were self-trained) Vectors Measures
FastText P
R
BERT F1
P
own BERT R
F1
P
R
FLAIR F1
P
R
F1
I-altern._healing_m. 0.78
0.69
0.73
0.70
0.67
0.68
0.73
0.61
0.66
0.93
0.54
0.68
I-treatment
0.69
0.70
0.84
0.63
0.72
0.82
0.65
0.72
0.85
0.57
0.68
0.71
I-care/commitment
0.63
0.69
0.66
0.71
0.54
0.62
0.79
0.56
0.65
0.66
0.58
0.61
I-overall/recom.
0.72
0.73
0.73
0.68
0.75
0.71
0.85
0.62
0.72
0.82
0.67
0.74
I-child-friendliness
0.77
0.75
0.76
0.79
0.77
0.78
0.80
0.84
0.82
0.79
0.72
0.76
I-relation._of_trust
0.77
0.80
0.79
0.80
0.79
0.80
0.86
0.76
0.81
0.85
0.79
0.81
O
0.97
0.97
0.97
0.96
0.97
0.97
0.96
0.98
0.97
0.96
0.98
0.97
0.76
0.76
0.76
0.78
0.73
0.75
0.83
0.72
0.76
0.84
0.69
Accuracy Average
0.94
0.94
0.95
0.94 0.75
For example, as Table 4 reveals, BERT embeddings [19] enable our model to achieve an F1 score of 0.67 for competence, the same as for our embeddings. While on average the precision scores are 0.80 compared to 0.81, the recall is lower with 0.75 to 0.78, so we prefer our model which we believe also reflects our user-generated data better. BERT embeddings are remarkable, though, because they are not trained on our domain and it is helpful to have embeddings calculated for every word in relation to its context words [19]. Yet, we must note that the fine-tuned embeddings actually perform slightly worse than embeddings that are not fine-tuned. We do not have an explanation for this, as we trained the embeddings for over 24 hours and saw a learning curve, evident in a decreasing loss. Regarding the second dataset (bavkbeg) and the evaluation scores in Tables 6 and 7, we can also conclude that the biLSTM-CRF model using FastText vectors performs best. While obtaining an F1 score of 0.77, self-fine-tuned BERT embeddings achieve an F1 score of 0.78. However, FastText vectors achieve a smaller discrepancy of 0.05, compared to 0.10 in relation to precision and recall. The evaluation results in Table 7 are also almost as good. Here, the FastText vectors enable an F1 score of 0.76, which is balanced better because precision and recall also have a value of 0.76. Still, the biLSTM-CRF model performs slightly better. The attention model is simpler, though, as Figs. 3 and 4 demonstrate. In this section, we have shown a number of approaches (along with evaluation scores) for automatically processing our annotated dataset and building a generalizing model that learns from the data to prove new ones. However, it is not possible to conduct a direct comparison to other models and datasets such as those of [59] or [76]. But a comparison to the presented values in studies dealing with shared tasks indicates the superiority of our approach. IO tagging with a biLSTM-CRF model improved our evaluation scores, but the self-trained word vectors were nevertheless important in achieving the final values shown in Tables 4, 5, 6, and 7. Numbers may not show the full picture, so we propose a manual evaluation as an additional step.
Towards Aspect Extraction and Classification for Opinion Mining …
183
We therefore wrote sentences that we regard as fringe cases and cases that can be hard to tag. The results of our algorithm were positive. In addition to these measures, we also annotated a dataset with higher interannotator agreement scores than Wojatzki et al. [76]. We have comparable Cohen’s Kappa scores to [76], even though they do not clearly indicate scores for the aspect spans. We also used fewer human annotators. The inter-annotator agreement for aspects is 0.79–1.0. Pontiki et al. [59] utilize the F1 score for measuring the annotator agreement. However, this score is difficult to compare.
6 Conclusion We presented two new datasets and several algorithms for ABSA by first introducing related literature and characterized numerous studies that tackle aspect extraction, classification and sentiment analysis in general. We then presented the data domain of physician reviews, where rating aspects are often implicitly mentioned or just indicated. Although we presented convincing algorithms and evaluation scores, there is still much work to be done before ABSA can be expanded from rather common review domains with a known and clearer vocabulary to fields that are more complicated and have a real-world use. This study also provides a detailed presentation of our data, describing available rating classes on PRWs besides just providing general information. We then described the definition process for aspects and showed the classes to be annotated. We annotated four aspect classes for the first dataset (fkza): friendliness, competence, time taken and explanation, which are all equally related to physicians performing health-care services. The second dataset (bavkbeg) consists of six classes of which all but one apply to the physician: namely, treatment, alternative healing methods, relationship of trust, child-friendliness, care/commitment and overall/recommendation (which is a general class that does not involve the physician directly). The first dataset has 11,237 annotated sentences, the second 15,467. We plan to expand the process to other classes and opinion targets such as the doctor’s office and the team until we have a holistic view over the entire domain of physician reviews and PRWs. To illustrate our findings, we provided example phrases and sentences, included comparisons to other datasets, and named details regarding the differentiation of classes. The inter-annotator agreement we calculated achieves favorable scores. Our approach for extracting and classifying aspect phrases involves using a biLSTM-CRF model. We also tested a biLSTM-Attention model and various word embedding configurations, including transformers and subword information. To provide information for other scholars, we mentioned the difficulties involved when machine learning systems perform aspect extraction. In comparison to the work of other scholars, our approach conducts two steps in one: extracting and classifying the aspects phrases (their words). Even if the comparison is difficult to undertake, our datasets and models seem to outperform other approaches such as those of Pontiki et
184
J. Kersting and M. Geierhos
al. [59]. This is impressive, because we consider our domain complex since it uses the morphologically complex German language. For the future, we do not only plan to build more annotated datasets, but we also want to include opinion extraction: A possible method would be to build on existing knowledge such as integrating a second stack of neural network layers that learn from other annotated datasets. We could also annotate more sentences in our dataset in relation to their general sentiment. During the initial trial runs, it became evident that most sentences can be considered either positive or negative. A more fine-grained scale is not possible, because the wording does not exhibit a fine-grained gradation. For instance, the phrase “He was very/really/quite friendly.” will express the same information no matter which of the three adverbs is chosen, whether the adverbs are left out or whether two adverbs are applied. In theory, it would be possible to find guidelines that make a distinction possible, but this would be very subjective and not easy for human annotators to agree on. Moreover, such rules would be difficult to remember and reaching an agreement is challenging because of phrases such as: “He was very friendly and competent.” The moderating word “very” cannot be annotated twice with the guidelines and tool we use. So if this were the decisive factor between a positive and very positive phrase, we would have to rely on the fact that a neural network can consider the context accordingly. In this case, the word “competent” would be labeled as very positive. But in other cases where this word occurs again, but alone, it would be labeled as positive, not as very positive. Acknowledgements This study is an invited, extended work based on [35]. Another related study is [36], which was written and submitted during the same period as [35]. This work was partially supported by the German Research Foundation (DFG) within the Collaborative Research Centre On-The-Fly Computing (SFB 901). We thank Rieke Roxanne Mülfarth, Frederik Simon Bäumer and Marvin Cordes for their support with the data collection.
References 1. Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638– 1649. ACL, Santa Fe, NM, USA (2018). https://www.aclweb.org/anthology/C18-1139 2. Apotheke-Adhoc: Von Jameda zur Konkurrenz geschickt. [sent by Jameda to the competitors]. https://www.apotheke-adhoc.de/nachrichten/detail/apothekenpraxis/von-jameda-zurkonkurrenz-geschickt-bewertungsportale/ (2018). Accessed 28 Oct 2019 3. Baccianella, S., Esuli, A., Sebastiani, F.: SentiWordNet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In: Proceedings of the 7th LREC, vol. 10, pp. 2200– 2204. ELRA (2010) 4. Beltagy, I., Lo, K., Cohan, A.: SCIBERT: a pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 3615–3620. ACL (2019) 5. Bidmon, S., Elshiewy, O., Terlutter, R., Boztug, Y.: What patients value in physicians: Analyzing drivers of patient satisfaction using physician-rating website data. J. Med. Internet Res. 22(2), e13830 (2020). https://doi.org/10.2196/13830
Towards Aspect Extraction and Classification for Opinion Mining …
185
6. Bidmon, S., Elshiewy, O., Terlutter, R., Boztug Y.: What patients really value in physicians and what they take for granted: an analysis of large-scale data from a physician-rating website. J. Med. Internet Res. 22(2), e13830 (2019). https://doi.org/10.2196/13830 7. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python, 1st edn. O’Reilly Media, Sebastopol (2009) 8. Blair-Goldensohn, S., Hannan, K., McDonald, R., Neylon, T., Reis, G.A., Reynar, J.: Building a sentiment summarizer for local service reviews. In: Proceedings of the WWW Workshop on NLP Challenges in the Information Explosion Era, vol. 14, pp. 339–348. ACM (2008) 9. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. ACL 5, 135–146 (2017) 10. Bäumer, F.S., Grote, N., Kersting, J., Geierhos, M.: Privacy matters: detecting nocuous patient data exposure in online physician reviews. In: Proceedings of the 23rd International Conference on Information and Software Technologies, vol. 756, pp. 77–89. Springer (2017). https://doi. org/10.1007/978-3-319-67642-5_7 11. Bäumer, F.S., Kersting, J., Kuršelis, V., Geierhos, M.: Rate your physician: findings from a Lithuanian physician rating website. In: Proceedings of the 24th International Conference on Information and Software Technologies, Communications in Computer and Information Science, vol. 920, pp. 43–58. Springer (2018). https://doi.org/10.1007/978-3-319-99972-2_4 12. Chinsha, T.C., Shibily, J.: A syntactic approach for aspect based opinion mining. In: Proceedings of the 9th IEEE International Conference on Semantic Computing, pp. 24–31. IEEE (2015). https://doi.org/10.1109/icosc.2015.7050774 13. Cieliebak, M., Deriu, J.M., Egger, D., Uzdilli, F.: A Twitter corpus and benchmark resources for German sentiment analysis. In: Proceedings of the 5th International Workshop on Natural Language Processing for Social Media, pp. 45–51. ACL (2017). https://doi.org/10.18653/v1/ W17-1106 14. Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960) 15. Cordes, M.: Wie bewerten die anderen? Eine übergreifende Analyse von Arztbewertungsportalen in Europa. [What do the others think? An overarching analysis of doctor rating portals in Europe]. Master’s thesis, Paderborn University (2018) 16. De Clercq, O., Lefever, E., Jacobs, G., Carpels, T., Hoste, V.: Towards an integrated pipeline for aspect-based sentiment analysis in various domains. In: Proceedings of the 8th ACL Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 136– 142. ACL (2017). https://doi.org/10.18653/v1/w17-5218 17. deepset: deepset – open sourcing German BERT (2019). https://deepset.ai/german-bert. Accessed 28 Nov 2019 18. Deng, L., Wiebe, J.: MPQA 3.0: an entity/event-level sentiment corpus. In: Proceedings of the 2015 Conference of the North American Chapter of the ACL: Human Language Technologies, pp. 1323–1328. ACL (2015) 19. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint (2018) 20. Dong, L., Wei, F., Tan, C., Tang, D., Zhou, M., Xu, K.: Adaptive recursive neural network for target-dependent twitter sentiment classification. In: Proceedings of the 52nd Annual Meeting of the ACL, pp. 49–54. ACL (2014) 21. Ellimoottil, C., Leichtle, S.W., Wright, C.J., Fakhro, A., Arrington, A.K., Chirichella, T.J., Ward, W.H.: Online physician reviews: the good, the bad and the ugly. Bull. Am. Coll. Surg. 98(9), 34–39 (2013) 22. Emmert, M., Meier, F., Pisch, F., Sander, U.: Physician choice making and characteristics associated with using physician-rating websites: cross-sectional study. J. Med. Internet Res. 15(8), e187 (2013) 23. Emmert, M., Sander, U., Esslinger, A.S., Maryschok, M., Schöffski, O.: Public reporting in Germany: the content of physician rating websites. Methods Inf. Med. 51(2), 112–120 (2012) 24. Emmert, M., Sander, U., Pisch, F.: Eight questions about physician-rating websites: a systematic review. J. Med. Internet Res. 15(2), e24 (2013). https://doi.org/10.2196/jmir.2360
186
J. Kersting and M. Geierhos
25. ExplosionAI: Spacy (2019). https://spacy.io/. Accessed 06 Nov 2019 26. ExplosionAI: GitHub - explosion/spacy-transformers/ – spaCy pipelines for pre-trained BERT, XLNet and GPT-2 (2020). https://github.com/explosion/spacy-transformers. Accessed 20 May 2020 27. Gao, G.G., McCullough, J.S., Agarwal, R., Jha, A.K.: A changing landscape of physician quality reporting: Analysis of patients’ online ratings of their physicians over a 5-year period. J. Med. Internet Res. 14(1), e38 (2012). https://doi.org/10.2196/jmir.2003 28. Garcia-Pablos, A., Cuadros, M., Rigau, G.: W2VLDA: almost unsupervised system for aspect based sentiment analysis. Expert Syst. Appl. 91, 127–137 (2018). https://doi.org/10.1016/j. eswa.2017.08.049 29. Geierhos, M., Bäumer, F., Schulze, S., Stuß, V.: “I grade what I get but write what I think.” inconsistency analysis in patients’ reviews. In: ECIS 2015 Completed Research Papers. AIS (2015). https://doi.org/10.18151/7217324 30. Hao, H., Zhang, K.: The voice of Chinese health consumers: A text mining approach to webbased physician reviews. J. Med. Internet Res. 18(5), e108 (2016). https://doi.org/10.2196/ jmir.4430 31. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735 32. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 168–177. ACM (2004) 33. Hu, M., Liu, B.: Mining opinion features in customer reviews. In: Proceedings of the 19th National Conference on Artificial Intelligence, pp. 755–760. AAAI (2004) 34. Kersting, J., Bäumer, F., Geierhos, M.: In reviews we trust: But should we? experiences with physician review websites. In: Proceedings of the 4th International Conference on Internet of Things, Big Data and Security, pp. 147–155. SCITEPRESS (2019). https://doi.org/10.5220/ 0007745401470155 35. Kersting, J., Geierhos, M.: Aspect phrase extraction in sentiment analysis with deep learning. In: Proceedings of the 12th International Conference on Agents and Artificial Intelligence: Special Session on Natural Language Processing in Artificial Intelligence, pp. 391–400. SCITEPRESS (2020) 36. Kersting, J., Geierhos, M.: Neural learning for aspect phrase extraction and classification in sentiment analysis. In: Proceedings of the 33rd International Florida Artificial Intelligence Research Symposium (FLAIRS) Conference. AAAI (2020) 37. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1746–1751. ACL (2014) 38. Kitaev, N., Klein, D.: Constituency parsing with a self-attentive encoder. In: Proceedings of the 56th Annual Meeting of the ACL, vol. 1, pp. 2676–2686. ACL (2018) 39. Kordzadeh, N.: Investigating bias in the online physician reviews published on healthcare organizations’ websites. Decis. Support Syst. 118, 70–82 (2019). https://doi.org/10.1016/j. dss.2018.12.007 40. Krippendorff, K.: Computing Krippendorff’s alpha-reliability. Technical report 1-25-2011, University of Pennsylvania (2011). https://repository.upenn.edu/asc_papers/43 41. Lagu, T., Norton, C.M., Russo, L.M., Priya, A., Goff, S.L., Lindenauer, P.K.: Reporting of patient experience data on health systems’ websites and commercial physician-rating websites: mixed-methods analysis. J. Med. Internet Res. 21(3), e12007 (2019). https://doi.org/10.2196/ 12007 42. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174 (1977). https://doi.org/10.2307/2529310 43. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 44. Liu, B.: Sentiment analysis and opinion mining. Synth. Lect. Hum. Lang. Technol. 5(1), 1–167 (2012)
Towards Aspect Extraction and Classification for Opinion Mining …
187
45. Liu, B., Zhang, L.: A survey of opinion mining and sentiment analysis. In: Aggarwal, C.C., Zhai, C.X. (eds.) Mining Text Data, pp. 415–463. Springer, Berlin (2012) 46. Liu, J., Hou, S., Evans, R., Xia, C., Xia, W., Ma, J.: What do patients complain about online: a systematic review and taxonomy framework based on patient centeredness. JMIR 21(8), e14634 (2019). https://doi.org/10.2196/14634 47. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: a robustly optimized BERT pretraining approach. CoRR p, o. S. (2019) 48. Löpez, A., Detz, A., Ratanawongsa, N., Sarkar, U.: What patients say about their doctors online: a qualitative content analysis. J. Gen. Intern. Med. 27(6), 685–692 (2012). https://doi.org/10. 1007/s11606-011-1958-4 49. McAuley, J., Leskovec, J., Jurafsky, D.: Learning attitudes and attributes from multi-aspect reviews. In: Proceedings of the 12th IEEE International Conference on Data Mining, pp. 1020– 1025. IEEE (2012). http://arxiv.org/pdf/1210.3926v2 50. Medhat, W., Hassan, A., Korashy, H.: Sentiment analysis algorithms and applications: a survey. Ain Shams Eng. J. 5(4), 1093–1113 (2014). https://doi.org/10.1016/j.asej.2014.04.011 51. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR pp. 1–12 (2013) 52. Mitchell, M., Aguilar, J., Wilson, T., Van Durme, B.: Open domain targeted sentiment. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1643–1654. ACL (2013) 53. Mukherjee, A., Liu, B.: Aspect extraction through semi-supervised modeling. In: Proceedings of the 50th Annual Meeting of the ACL, vol. 1, pp. 339–348. ACL (2012) 54. Murphy, G.P., Radadia, K.D., Breyer, B.N.: Online physician reviews: is there a place for them. Risk Manag. Healthc. Policy 12, 85–89 (2020) 55. Nguyen, T.H., Shirai, K.: Phrasernn: Phrase recursive neural network for aspect-based sentiment analysis. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2509–2514. ACL (2015) 56. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2(1–2), 1–135 (2008). https://doi.org/10.1561/1500000001 57. Paul, M.J., Wallace, B.C., Dredze, M.: What affects patient (dis) satisfaction? analyzing online doctor ratings with a joint topic-sentiment model. In: Proceedings of the Workshops at the 27th AAAI Conference on Artificial Intelligence. AAAI (2013) 58. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 59. Pontiki, M., Galanis, D., Papageorgiou, H., Androutsopoulos, I., Manandhar, S., AL-Smadi, M., Al-Ayyoub, M., Zhao, Y., Qin, B., De Clercq, O., Hoste, V., Apidianaki, M., Tannier, X., Loukachevitch, N., Kotelnikov, E., Bel, N., Jiménez-Zafra, S.M., Eryi˘git, G.: SemEval-2016 task 5: aspect based sentiment analysis. In: Proceedings of the 10th International Workshop on Semantic Evaluation, pp. 19–30. ACL (2016). http://www.aclweb.org/anthology/S16-1002 60. Pontiki, M., Galanis, D., Papageorgiou, H., Manandhar, S., Androutsopoulos, I.: SemEval-2015 task 12: aspect based sentiment analysis. In: Proceedings of the 9th International Workshop on Semantic Evaluation, pp. 486–495. ACL (2015). http://aclweb.org/anthology/S/S15/S152082.pdf 61. Pontiki, M., Galanis, D., Papageorgiou, H., Manandhar, S., Androutsopoulos, I.: SemEval 2016 task 5: aspect based sentiment analysis (ABSA-16) annotation guidelines (2016) 62. Pontiki, M., Galanis, D., Pavlopoulos, J., Papageorgiou, H., Androutsopoulos, I., Manandhar, S.: SemEval-2014 task 4: aspect based sentiment analysis. In: Proceedings of the 8th International Workshop on Semantic Evaluation, pp. 27–35. ACL (2014) 63. Qiu, G., Liu, B., Bu, J., Chen, C.: Opinion word expansion and target extraction through double propagation. Comput. Linguist. 37(1), 9–27 (2011). https://doi.org/10.1162/coli_a_00034
188
J. Kersting and M. Geierhos
64. Remus, R., Quasthoff, U., Heyer, G.: SentiWS - A publicly available German-language resource for sentiment analysis. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the International Conference on Language Resources and Evaluation, pp. 1168–1171. ELRA (2010). http://www.lrec-conf. org/proceedings/lrec2010/summaries/490.html 65. Ruppenhofer, J., Klinger, R., Struß, J.M., Sonntag, J., Wiegand, M.: IGGSA shared tasks on German sentiment analysis (GESTALT). In: Proceedings of the 12th KONVENS, pp. 164–173 (2014). http://nbn-resolving.de/urn:nbn:de:gbv:hil2-opus-3196 66. Ruppenhofer, J., Struß, J.M., Wiegand, M.: Overview of the IGGSA 2016 shared task on source and target extraction from political speeches. In: Proceedings of the IGGSA 2016 Shared Task on Source and Target Extraction from Political Speeches, pp. 1–9. Ruhr Universität Bochum, Bochumer Linguistische Arbeitsberichte (2016) 67. Saeidi, M., Bouchard, G., Liakata, M., Riedel, S.: SentiHood: Targeted aspect based sentiment analysis dataset for urban neighbourhoods. In: Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1546–1556. COLING/ACL (2016) 68. Sharma, R.D., Tripathi, S., Sahu, S.K., Mittal, S., Anand, A.: Predicting online doctor ratings from user reviews using convolutional neural networks. Int. J. Mach. Learn. Comput. 6(2), 149–154 (2016). https://doi.org/10.18178/ijmlc.2016.6.2.590 69. Shen, T., Zhou, T., Long, G., Jiang, J., Pan, S., Zhang, C.: Disan: Directional self-attention network for RNN/CNN-free language understanding. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI (2018) 70. Terlutter, R., Bidmon, S., Röttl, J.: Who uses physician-rating websites? Differences in sociodemographic variables, psychographic variables, and health status of users and nonusers of physician-rating websites. J. Med. Internet Res. 16(3), e97 (2014). https://doi.org/10.2196/ jmir.3145 71. Toh, Z., Su, J.: Nlangp at SemEval-2016 task 5: Improving aspect based sentiment analysis using neural network features. In: Proceedings of the 10th International Workshop on Semantic Evaluation, pp. 282–288. ACL (2016). https://doi.org/10.18653/v1/s16-1045 72. Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bregler, C.: Efficient object localization using convolutional networks. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp. 648–656. IEEE (2015). https://doi.org/10.1109/cvpr.2015.7298664 73. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st Conference on Neural Information Processing Systems, pp. 5998–6008. Curran Associates (2017) 74. Vinodhini, G., Chandrasekaran, R.: Sentiment analysis and opinion mining: a survey. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 2(6), 282–292 (2012) 75. Wallace, B.C., Paul, M.J., Sarkar, U., Trikalinos, T.A., Dredze, M.: A large-scale quantitative analysis of latent factors and sentiment in online doctor reviews. J. Am. Med. Inform. Assoc. 21(6), 1098–1103 (2014). https://doi.org/10.1136/amiajnl-2014-002711 76. Wojatzki, M., Ruppert, E., Holschneider, S., Zesch, T., Biemann, C.: GermEval 2017: shared task on aspect-based sentiment in social media customer feedback. In: Proceedings of the GermEval 2017 – Shared Task on Aspect-based Sentiment in Social Media Customer Feedback. Springer (2017) 77. Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 13(3), 55–75 (2018) 78. Zeithaml, V.: How consumer evaluation processes differ between goods and services. Mark. Serv. 9(1), 186–190 (1981) 79. Zeithaml, V.A., Parasuraman, A., Berry, L.L., Berry, L.L.: Delivering Quality Service: Balancing Customer Perceptions and Expectations. Free Press (1990). https://books.google.de/ books?id=RWPMYP7-sN8C 80. Zhang, L., Wang, S., Liu, B.: Deep learning for sentiment analysis: a survey. Wiley Interdiscip. Rev.: Data Min. Knowl. Discov. 8(4), 1–25 (2018). https://doi.org/10.1002/widm.1253 81. Zhao, W.X., Jiang, J., Yan, H., Li, X.: Jointly modeling aspects and opinions with a maxentlda hybrid. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 56–65. ACL (2010)
Towards Aspect Extraction and Classification for Opinion Mining …
189
82. Zhou, J., Huang, J.X., Chen, Q., Hu, Q.V., Wang, T., He, L.: Deep learning for aspect-level sentiment classification: survey, vision, and challenges. IEEE Access 7, 78454–78483 (2019). https://doi.org/10.1109/access.2019.2920075
Dialogical Argumentation and Textual Entailment Davide Catta, Richard Moot, and Christian Retoré
Abstract In this chapter, we introduce a new dialogical system for first order classical logic which is close to natural language argumentation, and we prove its completeness with respect to usual classical validity. We combine our dialogical system with the Grail syntactic and semantic parser developed by the second author in order to address automated textual entailment, that is, we use it for deciding whether or not a sentence is a consequence of a short text. This work—which connects natural language semantics and argumentation with dialogical logic—can be viewed as a step towards an inferentialist view of natural language semantics. Keywords Dialogical logic · Argumentation · Inferentialism
1 Presentation: Argumentation, Inference, Semantics This work takes its inspiration from the observation that logical or natural language inferences should be related to inferentialism, i.e., a view of the semantics of a formula or of a sentence as the inferential possibilities of the statement in reasoning or argumentation. Let us first present inferentialism, which although not new is not that well-known in logic nor in natural language semantics. Inferentialism A problem with the standard view of both natural language semantics and logical interpretations of formulas is that the models or possible worlds in which a sentence is D. Catta · R. Moot · C. Retoré (B) LIRMM, Université de Montpellier, CNRS, 860 rue Saint Priest, 34095 Montpellier, France e-mail: [email protected] D. Catta e-mail: [email protected] R. Moot e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. Loukanova (ed.), Natural Language Processing in Artificial Intelligence—NLPinAI 2020, Studies in Computational Intelligence 939, https://doi.org/10.1007/978-3-030-63787-3_7
191
192
D. Catta et al.
true cannot be computed or even enumerated [33]. As far as pure logic is concerned there is an alternative view of meaning called inferentialism [5, 12, 14, 15, 36]. Although initially inferentialism was introduced within a constructivist view of logic [14], there is no necessary conceptual connection between accepting an inferentialist position and rejecting classical logic as explained in [12]. As its name suggests inferentialism replaces truth as the primary semantic notion by the inferential activity of an agent. According to this paradigm, the meaning of a sentence is viewed as the knowledge needed to understand the sentence. This view is clearly stated by Cozzo [12] A theory of meaning should be a theory of understanding. The meaning of an expression (word or sentence) or of an utterance is what a speaker-hearer must know (at least implicitly) about that expression, or that utterance, in order to understand it.
This requirement has some deep consequences: as speakers are only able to store a finite amount of data, the knowledge needed to understand the meaning of the language itself should also be finite or at least recursively enumerable from a finite set of data and rules. Consequently, an inferentialist cannot agree with the Montagovian view of the meaning of a proposition as the possible worlds in which the proposition is true [29]. In particular because there is no finite way to enumerate the infinity of possible worlds nor to finitely enumerate the infinity of individuals and of relations in a single of those possible worlds. Let us present an example of the knowledge needed to understand a word. If it is a referring word like “eat”, one should know what it refers to the action that someone eats something, possibly some postulates related to this word like the eater is animated,1 and how to compose it with other words eat is a transitive verb, a binary predicate; if it is a non referring word like “which” one should know that it combines with a sentence without subject (a unary predicate), and a noun (a unary predicate), and makes a conjunction of those two predicates. Observe that this knowledge is not required to be explicit for human communication. Most speakers would find it difficult to explicitly formulate these rules, especially the grammatical ones. However, this does not mean that they do not possess this knowledge. An important requirement for a theory of meaning is that the speaker’s knowledge can be observed, i.e., his knowledge can be observed in the interactions between the speaker(s), the hearer(s) and the environment. This requirement is supported by the famous argument against the private language of Wittgenstein [40]. This argument can be presented as follows. Imagine that two speakers have the same use of a sentence S in all possible circumstances. Assume that one of the two speakers includes as part of the meaning of S some ingredient that cannot be observed. This ingredient has to be ignored when defining the knowledge needed to master the meaning of S. Indeed, according to the inferentialist view, a misunderstanding that can neither be isolated nor observed should be precluded. 1 Our
system is able to deal with metaphoric use, like the The cash machine ate my credit card. see e.g., [37].
Dialogical Argumentation and Textual Entailment
193
Another requirement for a theory of meaning is the distinction between sense and force, on which we shall be brief. Since Frege [19], philosophy of language introduced the distinction between the sense of the sentence and its force. The sense of a sentence is the propositional content conveyed by the sentence, while its force is its mood—this use of the word “mood” is more general than its linguistic use for verbs. Observe that the same propositional content can be asserted, refuted, hoped etc., as in the three following sentences Is the window open? Open the window! The window is open. Here we focus on assertions and questions. Observe that this draws a connection between inferentialism and the study of dialogue in linguistics and philosophy. Indeed, the most common interaction between the speaker(s), the hearer(s) and the environment is a dialogue in natural language. Only argumentative dialogues are relevant from an inferentialist perspective, but they still are dialogues and as such the wide literature on dialogue may be helpful to deepen the connection developed in this paper, in particular [3, 10, 20, 24]. The force or mood of the intervention of the speaker and the hearer in argumentative dialogue are particular cases of functional roles in dialogues (see e.g., [38]), and this connection with different moods of dialogue interaction paves the way to a similar treatment of other dialogues which are not limited to argumentative interaction. As an intermediate step between argumentative dialogues which we now deal with, and general dialogues which we plan to study, could be argumentative dialogues with non standard inferences resulting from enthymemes or from psychological disorders respectively studied in [6] and [4]. But as a first exploration of inferential semantics, let us focus on the applications of inferentialism to argumentative dialogues and in particular to textual entailment. Inferentialism and Textual Entailment We will illustrate the application of the inferentialist view to natural language semantics with an very natural task: the recognition of natural language inference, a task also known as textual entailment. In the current context, we use textual entailment in a more limited sense than it is generally used in natural language processing tasks. Textual entailment in natural language processing generally aims to obtain humanlike performance on relating a text and a possible conclusion [13]. Natural language processing systems are evaluated on their ability to approach the performance of humans when deciding between entailment, contradiction and unknown (i.e., neither the entailment relation nor the contradiction relation holds between the text and the given candidate conclusion). We consider textual entailment from a purely logical point of view, taking entailment and contradiction in their strictly logical meanings. In our opinion, a minimal requirement for a textual entailment system should be that it can handle the well-known syllogisms of Aristotle, as well as a number of other patterns [34] with perfect accuracy.2 2 The
patterns in the well-known corpus for testing Textual Entailment recognition called FraCaS [11] greatly vary in difficulty. We expect only some of them (monotonicity, syllogisms) to be handled easily, whereas we expect others (plurals, temporal inference for aspectual classes) to be much more difficult for systems based on automated theorem provers, or, indeed, for any automated system.
194
D. Catta et al.
The computational correspondence between natural language sentences and logical formulas is obtained both theoretically and practically using type-logical grammars. In a sense, type-logical grammars are designed to produce logical meanings for grammatical sentences. They compute the possible meanings of a sentence viewed as logical formulae. In particular, the Grail platform, a wide-scale categorical parser which maps French sentences to logical formulas [30–32], will be presented. We then use Grail and dialogical logic to solve some examples of textual entailment from the FraCaS database [11]. Overview This chapter is structured as follows. Section 2 introduces dialogical logic for classical first order logic and dialogical validity. Section 3 gives a proof of the fact that the class of formulas that are dialogically valid is (almost) equal to the class of formulas that are valid in the standard meaning of the term, i.e., true in all interpretations. Section 4 is an introduction to categorical grammars and Grail. We conclude with the applications of semantics to the problem of textual inference. We then show applications of this semantics to the problem of textual entailment using examples taken from the FraCaS database. Hence this paper deepens and extends our first step in this direction “inferential semantics and argumentative dialogues” [9], mainly because of a strengthened link with natural language semantics and textual entailment: • Technically, we provide a proof that classical validity is equivalent, in a sense that will be specified below, to dialogical validity. • We do not require the formulas to be in a normal negative form as done in [9]. In this way the strategies correspond to derivations in a two-sided sequent calculus, i.e., with multiple hypothesis and multiple conclusions. • We use Grail to connect natural language inference (or textual entailment) to argumentative dialogues: indeed Grail turns natural language sentences into logical formulas (delivered as DRS)—while our previous work only assumed that sentences could be turned into logical formulas. • This lead us closer to inferentialist semantics: a sentence S can be interpreted as all argumentative dialogues (possibly expressed in natural language) whose conclusion is S—under assumptions corresponding to word meaning and to the speaker beliefs.
2 Dialogical Logic The formal developments of the inferentialist view of meaning—as far as logic is concerned—are usually based on natural deduction [18]. We choose a different approach, and give a formal implementation of the ideas of inferentialist meaning theory using the tools of dialogical logic. In our view, the connection with semantics based on the notion of argument is clearer within the latter paradigm: an argument in favor of a statement is often developed when a critical audience, real or imaginary, doubts the truth, or the plausibility of the proposition. In this case, in order to
Dialogical Argumentation and Textual Entailment
195
successfully assert the statement, a speaker or proponent of it must be capable of providing all the justifications that the audience is entitled to demand. Taking this idea seriously, an approximation of the meaning of a sentence in a given situation can be obtained by studying the argumentative dialogues that arise once the sentence is asserted in front of such a critical audience. This type of situation is captured—with a reasonable degree of approximation—by dialogical logic. In the dialogical logic framework, knowing the meaning of a sentence means being able to provide a justification of the sentence to a critical audience. Note that with this type of methodology the requirement of manifestability required to attribute knowledge of the meaning of a sentence to a locutor is automatically met. The locutor who asserts a certain formula is obliged to make his knowledge of the meaning manifest so that he can answer the questions and objections of his interlocutor. In addition, any concessions made by his interlocutor during the argumentative dialogue will form the linguistic context in which to evaluate the initial assertion. Our approach shares some similarities with the one in [17, 27], which propose a formal implementation of the inferentialist view of meaning in the frame of Ludics [22]: the main objects of Ludics, called designs, can be viewed as argumentative strategies. We now progressively present dialogical logic. Although the study of dialectics— the art of correct debate—and logic—the science of valid reasoning—have been intrinsically linked since their beginnings [7, 8, 35], modern mathematical logic had to wait until the 50s of the last century to ensure that the logical concept of validity was expressed through the use of dialogical concepts and techniques. Inspired by the Philosophical Investigations of Wittgenstein [40], the German mathematician and philosopher Lorenzen [28] proposed to analyze the concept of validity of a formula F through the concept of winning strategy in a particular type of two-player game. This type of game is nothing more than an argumentative dialogue between a player, called proponent, who affirms the validity of a certain formula F and another player, called opponent, who contests its validity. The proponent starts the argumentative dialogue by affirming a certain formula. The opponent takes turns and attacks the claim made by the proponent according to its logical form. The proponent can, depending on his previous assertion and on the form of the attack made by the opponent, either defend his previous claim or counter attack. The debate evolves following this pattern. The proponent wins the debate if he has the last word, i.e., the defence against one of the attacks made by the opponent is a proposition that the opponent can not attack without violating the debate rules. Dialogical logic was initially conceived by Lorenzen as a foundation for intuitionistic logic (IL). Lorenzen’s idea was, roughly speaking, the following. It is possible to define a “natural” class of dialogue games in which, given a formula F of I L, the proponent can always win a game on F, no matter how the opponent chooses to attack his assertion in the debate, if F is IL-valid. This intuition was formalized as the completeness of the dialogical proof system with respect to provability or validity in any model: Completeness of a dialogical proof system: Given a logical language L and a notion of validity for L (either proof theoretical or model theoretical) and a notion
196
D. Catta et al.
of dialogical game, a formula F is valid in L if and only if, there is a winning strategy for the proponent of F in the class of games under consideration. A winning strategy can be intuitively understood as an algorithm that takes as input the moves of the game made so far, and outputs moves for the proponent which guarantee he will always win. Unfortunately, almost 40 years of work were needed to get a first correct proof of the completeness theorem [16]. Subsequently, different systems of dialogical logic were developed. We present here a system of dialogical logic that is complete for classical first order logic.
2.1 Language, Formulas, Subformulas, Trees Throughout the paper we assume that a first order language L has been defined, by a set of constants and of function symbols (each of them with an arity) and a set of predicate symbols (each of them with a specified arity), including a particular 0-ary predicate, i.e., a proposition letter written ⊥ – beware that, in some of the sequent calculi to be introduced later on, this proposition letter has no particularity while in some others it has a specific rule, namely ex falso quod libet sequitur. The set of terms T is defined as the smallest set containing constants and variables (their set is V), and closed under function symbols: if t1 , . . . , tn are n terms and if f in an n-ary function symbol, the expression f (t1 , . . . , tn ) is a term as well. An atomic formula is an expression P(t1 , . . . , t p ) where P is an n-ary predicate symbol and t1 , . . . , tn are n terms. In particular a proposition letter is an atomic proposition. Formulas and sub-formulas are defined as usual, except that we have no negation connective—in some of the proof systems considered in the paper G ⇒ ⊥ enjoys all of the properties of ¬G while in some others it does not. The set of terms of the language will be denoted by T , the set of variables of the language will be denoted by V. The set of set of formulas on the language will be denoted by F. We here recall the definitions of formulas and two notions of subformula, the usual one and the Gentzen one. The set of formulas as well as the multisets of positive (resp. negative) occurrences of subformulas sub+ (F) (resp. sub− (F)) of a formula F are defined as follows, where denotes multiset union: • if F is an atomic formula P(t1 , . . . tn ) then F is a formula and sub+ (F) = {F} sub− (F) = ∅ • if F1 and F2 are formulas then F = F1 ∗ F2 with ∗ ∈ {∧, ∨} is a formula and sub+ (F) = sub+ (F1 ) sub+ (F2 ) {F}. sub− (F) = sub− (F1 ) sub− (F2 ) • If if F1 and F2 are formulas then F = F1 ⇒ F2 is a formula and sub+ (F) = sub− (F1 ) sub+ (F2 ) {F} sub− (F) = sub+ (F1 ) sub− (F2 )
Dialogical Argumentation and Textual Entailment
197
• if F1 is a formula and x is a variable then F = Qx. F1 with Q ∈ {∃, ∀} is a formula and sub+ (F) = sub+ (F1 ) {F} sub− (F) = sub− (F1 ) • nothing else is a formula. The subformula occurrences of a formula F are simply sub+ (F) sub− (F). Observe that, because we use multisets which distinguish the different occurrences of the same subformula in a formula, a given occurrence of a subformula in F is either positive or negative, but not both, although the same underlying formula may appear both in sub+ (F) and in sub− (F). Let F[t/x] stand for the formula obtained by replacing the free occurrences of x in F by the term t in T . The Gentzen variants gv(G) of a formula G are obtained by recursively replacing the quantified variables of G by terms in T , starting from outermost quantifiers: • gv(A(t1 , . . . , tn )) = {A(t1 , . . . , tn )} there are no variants but the formula itself when the formula is atomic • gv(F1 ∗ F2 ) = {F1 ∗ F2 } ∪ gv(F1 ) ∪ gv(F2 ) when ∗ is a binary connective among ∧, ∨, ⇒ • gv(Qx F1 ) = {Qx F1 } ∪ gv(F1 [t/x]) when Q is a quantifier among ∃, ∀ The positive (resp. negative) Gentzen subformulas of a formula F are the Gentzen variants gv(H ) with H a positive (resp. negative) occurrence of a subformula in F, i.e., for H ∈ sub+ (F) (resp. H ∈ sub− (F). Thus any Genten subformula is anchored to an occurrence of a subformula. The above logical definitions about formulas are standard (see, e.g., [21]), but given that they may differ slightly from one textbook to another, we prefer to present them in full detail. The paper also deals with trees, because proofs and strategies are trees. We recall that a tree can either be defined as a set of prefix closed sequences (the empty sequence corresponds to the root), or inductively: given a family of trees t1 , . . . , tn , one may define a new tree f (t1 , . . . , f n ), whose daughters are the ti s, by adding a new root f . In a tree there is a unique path from a node to another, in particular from any node x to the root: such a path is called a branch, its length is the height of the node x; the branch is said to be maximal when it is not the prefix of a longer branch, and in this case x is called a leaf. The paper does not need a more formal reminder about trees to be understood.
2.2 Argumentation Forms Let us consider a set Aux of auxiliary symbols containing the symbols ∧1 , ∧2 , ∨, ∃ and the expressions ∀[t/x] for all terms in T and variables x in L, and nothing else. Following the terminology of Felscher [16], an argumentation form Arg is a function assigning to each non atomic formula F in F a set of pairs consisting in one
198
D. Catta et al.
question (also called attacks in the literature) and one answer (also called defense in the literature) with questions being either formulas in F or symbols in Aux and answers being formulas in F.3 Arg(F1 ⇒ F2 ) = {(F1 , F2 )} Arg(F1 ∧ F2 ) = {(∧1 , F1 ), (∧2 , F2 )} Arg(F1 ∨ F2 ) = {(∨, F1 ), (∨, F2 )} Arg(∀x F) = {(∀[t/x], F[t/x]) | t ∈ T } Arg(∃x F) = {(∃, F[t/x]) | t ∈ T } In a pair (∀[t/x], F[t/x]) ∈ Arg(∀x F) the term t in the question ∀[t/x] is called the chosen term. In each couple (∃, F[t/x]) the term t in the answer F[t/x] is called chosen term. Given a formula F, a question q that belongs to a couple (q, a) ∈ Arg(F) is called a question on F. An answer a is called an answer to q whenever the couple (q, a) is an element of Arg(F). So, for example, if F is F1 ∧ F2 , both ∧1 and ∧2 are question on F but only F1 is an answer to ∧1 and only F2 is an answer to ∧2 . If F = F1 ∨ F2 , the symbol ∨ is a question on F, and both F1 , F2 are answers to ∨. Consider the case where F is F1 ⇒ F2 . In this case F1 is a question on F and F2 is an answer to F1 . The use of the expression “question” to qualify F1 may sounds odd. The bizarre impression that the expression “question” generates if associated with F1 disappears if we paraphrase F1 as a question in the following way: “could you convince me that F2 holds by assuming the hypothesis that F1 holds?”.
2.3 Prejustified Sequences Moves are pairs (i, s) with i ∈ {?, !} and s being either a formula or an auxiliary symbol. Moves are called attacks whenever i =? and defences whenever i =!. Some moves are called asser tions. There are two types of assertions: all defences are assertions, and attacks of the form (?, F) where F is a formula are assertions. A prejustified sequence is a sequence of moves M = M0 , M1 , . . . , M j , . . ., together with a partial function f : M → M such that for all all Mi in the sequence for which f is defined, f (Mi ) is an M j such that j < i. We say that f (Mi ) is the enabler of Mi or that Mi is enabled by f (Mi ). Given a prejustified sequence M = M0 , M1 , . . . , M j , . . ., an attack move Mn in M of the form (?, s) is said to be justified whenever f (Mn ) is an assertion (i, F) and s is a question on F. The attack move Mn = (?, s) is called an existential attack whenever s = ∃, and a universal attack whenever s = ∀[t/x]. 3 The words “question” and “answer” are called “attack” and “defence” by Felscher in [16]; we deviate from this terminology because we will rather use the terms “attack” and “defence” exclusively for the moves in a game, avoiding possible confusion.
Dialogical Argumentation and Textual Entailment
199
A defence Mn of the form (!, F) is justified if f (Mn ) = (?, s) is a justified attack of M j , M j is an assertion (i, F ), s a question on F , and the couple (s, F) belongs to Arg(F ). An assertion Mn of the form (i, F) is a reprise if and only if there exists another move M j , with j < i and j having opposite parity of i, of the form (i , F). An assertion Mn is called an existential repetition, if its enabler is of the form (?, ∃) and there is another move of the same parity Mn with n < n having the same enabler. A pre-justified sequence in which each move is justified is called justified sequence
2.4 Games Definition 1 A game for a formula F is a pre-justified sequence M0 , M1 , . . . , M j , . . . such that 1. M0 is (!, F) and M1 , . . . , M j , . . . is a justified sequence in which each oddindexed move is enabled by its immediate predecessor and in which each evenindexed move is enabled by a preceding odd-index move, 2. if an even-indexed move asserts an atomic formula then it is a reprise, 3. for all even m, n if Mm and Mn are defence moves that assert the subformula F1 of F and are enabled by the same move M j , then m = n unless M j is (?, ∨) and f (M j ) = (F1 ∨ F1 ) The item 1 is usually called the formal rule. Odd-index moves are opponent moves (O-moves) while even index moves are called proponent moves (P-moves). A move M is legal for a game G if the prejustified sequence G, M is a game. A game G is won by the proponent if, and only if, it is finite and there is no O-move which is legal for G. It is won by the opponent otherwise. Remark 1 All formulas that are asserted in a Game G for a formula F are, by construction of the game, Gentzen subformulas of F We give three examples of games. Each game will be represented as a table with two columns. The first column, read from top to bottom represents the sequence of moves of the game. The second column shows the value of the function f at each point of the game. The first is a game won by P for the formula a(x) ∨ ¬a(x) where a(x) is an atomic formula. M0 M1 M2 M3 M4
= (!, a(x) ∨ ¬a(x)) = (?, ∨) = (!, ¬a(x)) = (?, a(x)) = (!, a(x))
M0 M1 M2 M1
200
D. Catta et al.
The game starts by the assertion of a(x) ∨ ¬a(x). The move M1 is a justified attack on M0 . In fact f (M1 ) = M0 and ∨ is a question on a(x) ∨ ¬a(x). The following defence move M2 is justified. In fact f (M2 ) = M1 is a justified attack, and ¬a(x) is an answer to the question ∨ on the formula a(x) ∨ ¬a(x). The assertion M3 is itself a justified attack. f (M3 ) = M2 , M2 is an assertion of ¬a(x) = a(x) ⇒ ⊥ and a(x) is a question on a(x) ⇒ ⊥. The final move M4 is a justified defence. In fact f (M4 ) = M1 , M1 is a justified attack on M0 and a(x) is an answer to ?∨ which, in turn, is a question on a(x) ∨ ¬a(x). Moreover M4 is a reprise: there exists another assertion, the assertion M3 , that asserts the same formula a(x), with 3 odd and smaller than 4. The second example is a game that is not won by P for the formula a(x) ∨ ¬a(y) M0 M1 M2 M3
= (!, a(x) ∨ ¬a(y)) = (?, ∨) M0 = (!, ¬a(y)) M1 = (?, a(y)) M2
The game starts by the assertion of a(x) ∨ ¬a(y). The move M1 is a justified attack on M0 , since f (M1 ) = M0 and ∨ is a question on a(x) ∨ ¬a(y). The following defence move M2 is justified because f (M2 ) = M1 is a justified attack, and ¬a(y) is an answer to the question ∨ on the formula a(x) ∨ ¬a(y). The assertion M3 in its turn is a justified attack. f (M3 ) = M2 , M2 is an assertion of ¬a(y) = a(y) ⇒ ⊥ and a(y) is a question on a(y) ⇒ ⊥. The game ends here and its lost by P. Remark that P cannot extend the game: he cannot assert ⊥ as an answer to a(y) in M3 , because there is no move Mk by O that asserts ⊥. For the same reason he cannot assert a(x) as an answer to ∨ in M1 . Moreover he cannot assert ¬a(y) a second time because of condition 3 on the definition of dialogical games. The third and last example is a game won by P for the formula a ∨ b ⇒ a. This formula is not a tautology. The fact that P can win a game on this formula means that the notion of validity cannot be captured using only the definition of game. M0 M1 M2 M3 M4
= (!, a ∨ b ⇒ a) = (?, a ∨ b) = (?, ∨) = (!, a) = (!, a)
M0 M1 M2 M1
The game starts by the assertion of a ∨ b ⇒ a. The following move M1 is a justified attack on M0 , because F(M1 ) = M0 and a ∨ b is a question on a ∨ b ⇒ a. The following attack move M2 is justified. In fact f (M2 ) = M1 is an assertion of a ∨ b and ∨ is an attack on this last formula. The defence move M3 is justified, since a is an answer on ∨ which is a question on a ∨ b. Finally the move M4 is a justified defence. f (M4 ) = M1 is a justified attack, a is an answer to a ∨ b if this last is a question on a ∨ b ⇒ a. Moreover M4 is a reprise, because there exists an earlier assertion by O, namely M3 , that asserts a.
Dialogical Argumentation and Textual Entailment
201
Here are some properties that will be used in the final section of this chapter. Proposition 1 If a finite game G = M0 , . . . , Mn is won by P in the sense defined above (O has no further legal move left) then Mn is the assertion of some atomic formula a(t1 , . . . tm ). Proposition 2 For all games G, for all formulas F and for all subformula F of F. If there is an P-move (O-move) in G that asserts F then F is a positive (negative) subformula of F. Proof Let F be any formula and G any game. We show the proposition by induction on the length of G. If the length is 1 then G consist of only one move that is a P-move asserting the formula F and so the proposition holds. Suppose that the proposition holds for all games G having length n and let G be a game having length n + 1. Let Mn be the last move of G . Suppose that Mn is a P-move (the argument for O-moves runs in a very similar way) We have three cases. 1. If Mn is not an assertion, the proposition holds automatically by induction hypothesis. 2. If Mn is a defence asserting some formula F , then, since G is a game, the sequence M1 . . . Mn is justified. Thus Mn is enabled by some O-move Mk with (k < n). If Mk := (?, F1 ) (the other cases are easier) then it is an attack against Mk−1 and Mk1 is a P-move that asserts F1 ⇒ F . By induction hypothesis F1 ⇒ F is a positive subformula of F, and F1 is a negative subformula of F. Thus F is a positive subformula of F by definition. 3. If Mn is an assertion and an attack, let F be the asserted formula. As before there must exists an enabler of Mn , call it Mk (k < n), Mk is necessarily an O-move that asserts the formula F ⇒ F . By induction hypothesis this last formula is a negative subformula of F, thus F is a positive subformula of F by definition. An easy consequence of the latter proposition is the following Proposition 3 Let G be a game for a formula F and let Mn be a P-move in G asserting an atomic formula a(t1 , . . . tm ). Then this latter formula appears both as a negative and positive subformula of F
2.5 Strategies Informally speaking, a strategy for a player is an algorithm for playing the game that tells the player what to do for every possible situation throughout the game. We informally describe how a strategy should operate and then formalize this notion. Imagine being engaged in a game G, that the last move of G was played according to the strategy, and that it is now your opponent’s turn to play. Your opponent could extend the game in different ways: for example if you are playing chess, you are white and you just made your first move by moving a pawn to a certain position of
202
D. Catta et al.
the chessboard, black can in turn move a pawn or move a horse. If you are playing according to the strategy, the strategy should tell you how to react against either type of move. If black moves a pawn to C6 and you just moved your pawn to C3 then move the horse to H 3. If black moves a horse to H 6 and you just moved your pawn to C3 then move your pawn in B4. Therefore, a strategy can be viewed as tree in which each node is a move in the game, the moves of my opponent have at most one daughter, and my moves have as many daughters as there are available moves for my opponent. In game semantics and in the dialogical logic literature, it is rather standard to represent a strategy as a tree of games [1, 16, 25]. Nevertheless one should keep in mind that a strategy is a function telling one player which move to play in her turn to play whatever the history of the game is, and this is easily represented as a tree. We thus formalize the notion of strategy as follows. Given a game G we say that a variable x appears in the game if, and only if, the variable x appears in some asserted formula or in the choice of some universal attack. Let (vi )i∈I be an enumeration of the variables in L. A strategy S is a prefix-closed set of games for the same formula (i.e., a tree of games for the same formula) such that: 1. if G belongs to the strategy and the last move of G is a P move that is neither an assertion of a universally quantified formula nor an existential attack, then G, M belongs to the strategy, for all moves M legal for G, 2. if G, M and G, M belong to S and M, M are P-moves then M = M , 3. if G, M and G, M belongs to S and M, M are O-moves and universal attacks then M = M ; moreover the chosen variable is the first variable in the enumeration that does not appear in G, 4. if G, M and G, M belong to S and M, M are O-moves and existential defences then M = M ; moreover the term chosen to defend is the first variable in the enumeration that does not appear in G, 5. if G belongs to S and the last move of G is an O-move that is a question on an existential quantifier, then G, M belongs to S, where M is enabled by the last move of G. A strategy S is P-winning iff each game in S is won by P. Given that a strategy is a tree of games, in what follows we will sometimes speak of nodes of a strategy as a shortcut for moves of a game G that belongs to the strategy.
2.6 Validity Definition 2 Given a first order formula F we say that F is dialogically valid if, and only if, there exists a winning strategy S for the the formula. Figure 1 shows three examples of winning strategies. The blue dotted arrows represent the function f that points back from P-moves to the O-move that enables
Dialogical Argumentation and Textual Entailment
Fig. 1 Three winning strategies
203
204
D. Catta et al.
them. Keep in mind that every O-move is enabled by the immediately preceding move. We give an explanation of the strategy for the drinker theorem, i.e., the formula ∃x(a(x) ⇒ ∀ya(y)). Call the moves in the strategy, from the root to the leaf, M0 , M1 . . . M8 . The strategy starts by P-move asserting the drinker theorem. The subsequent move, M1 by O (?, ∃) is an attack move directed toward the move M0 . A possible paraphrase of M1 would be: “could you choose a term to instantiate the existential formula you asserted?”. By condition 5 in the definition of strategy, the move M2 must be a defence move enabled by M1 . In the picture there is an arrow pointing back from M2 to M1 and the formula asserted in M2 , a(c) ⇒ ∀ya(y), is an answer to ∃. Since each O-move in a game is enabled by the immediately preceding P-move, and since a strategy is a tree of games, O has no choice but to attack the move M2 by asserting a(c). a(c) is a question on the formula asserted by M2 . The player P chooses (move M4 ) to answer to a(c) by asserting ∀y.a(y). The player O attacks M4 by choosing, as the definition of strategy prescribes, a variable w that does not appear in the Game, and asking the player P to assert a[w/y], (move M5 ). The player P cannot immediately answer to the question in M5 : there is no O-move in the dialog that ends with M5 that is an assertion of a(w), thus by condition 1 in the definition of games, the move M6 cannot be the assertion of a(w). The player P decides, instead, to again answer the question ∃ played by O in M1 by instantiating the drinker theorem with the variable w, i.e., the move M6 is (!, (a(w) ⇒ ∀y.a(y)) and it is enabled by M1 (this move is an existential repetition). The player O is obliged by the definition of game to assert, as move M7 , a(w) as an answer on the formula asserted by P in M6 . At this point the player P can answer to the question ?∀[w/x] in M5 by making the move M8 = (!, a(w)) and he wins the game.
3 Dialogical Validity Is Equivalent to Classical Validity In this section we show the equivalence, for a formula F between the dialogical validity of F (the existence of winning strategy for the proponent of F) and classical validity of F (F being true in all interpretations). To prove this equivalence, we use a particular version of the sequent calculus, GKs (see Table 1). The calculus GKs (For Genzten Kleene strategy) is equivalent to the sequent calculus GKc (Gentzen Kleene classical, see Table 2); GKc is complete for first order logic (see [39] pp. 84–86) inthe sense that a sequent can be derived if and only if the formula ⇒ classically valid. Later on we shall consider a restriction (strategic derivations) on the use of the left implication introduction rule and on the use of the right existential introduction rule in the calculus GKs. This restriction does not affect the completeness of the sequent calculus with respect to validity. The restriction on the use of the left implication introduction rule was already studied by Herbelin in is PhD thesis [23] for a sequent calculus for propositional classical logic called LKQ. The restriction on the use of the
Dialogical Argumentation and Textual Entailment
205
Table 1 The GKs sequent calculus
, A A, , A B, ⇒R A ⇒ B,
A, B, ∧R A ∧ B,
Id
, A ⇒ B A, , A ⇒ B, B ⇒L , A ⇒ B
, A, A ∧ B ∧L 1 , A ∧ B
, B, A ∧ B ∧L 2 , A ∧ B
A, B, ∨R A ∨ B,
, A ∨ B, A , A ∨ B, B ∨L , A ∨ B
∃x A, A[t/x], ∃R ∃xA,
, A[y/x], ∃x A ∃L , ∃xA
A[y/x], ∀R ∀xA,
A(t), ∀x A ∀L , ∀xA
left implication introduction rule was used by Herbelin to prove the correspondence between winning strategies for propositional classical logic and derivations in LKQ. Definition 3 The sequent calculus GKs is defined by the rules in Table 1. The sequent calculus GKc is defined by the rules in Table 2. In both calculi greek upper-case letters , , . . . stand for multisets of formulas. In the I d-rule A is an atomic formula. In the ∀R and ∃L rules the variable y does not occur in the conclusion sequent. The bold formulas in the conclusion of each rule are called active formulas. A derivation (or a proof) π of a sequent in GKs (resp. GKc) is a tree of sequents constructed according to the rules of GKs (resp. GKc) in which leaves are I d-rules (resp. I d-rules or ⊥L-rules) and whose root also called conclusion is . A sequent is said to be derivable or provable in a sequent calculus GKX whenever there exists a proof with conclusion . When the sequent is F the formula F is said to be provable in GKX. A binary (resp unary) rule R is said to be admissible in a sequent calculus GKX if one may derive from any sequents S1 , S2 (resp. S1 ) that are premises of R and,
206
D. Catta et al.
Table 2 The GKc sequent calculus
, A A,
Id
⊥,
⊥L
, A B, A ⇒ B, ⇒R A ⇒ B,
, A ⇒ B A, , A ⇒ B, B ⇒L , A ⇒ B
A, A ∧ B, B, A ∧ B, ∧R A ∧ B,
, Ai , A1 ∧ A2 ∧L i , A1 ∧ A2
Ai , A 1 ∨ A 2 , ∨Ri A1 ∨ A2 ,
, A ∨ B, A , A ∨ B, B ∨L , A ∨ B
∃x A, A[t/x], ∃R ∃xA,
, A[y/x], ∃x A ∃L , ∃xA
A[y/x], ∀x A, ∀R ∀xA,
A(t), ∀x A ∀L , ∀xA
possibly axioms (also called identity rules) of GKX, the conclusion of R applied to S1 , S2 (resp. S1 ). Remark 2 The calculi GKs and GKc differ in several aspects. • First of all in GKc there is a rule for ⊥ (called ex falso quod libet sequitur) while there is no rule for ⊥ in GKs: this entails that in GKc the negation ¬G of a formula G can be defined as ¬G = G ⇒ ⊥, while in GKs, the formula G ⇒ ⊥ is “almost” a negation, enjoying tertium non datur but not some other properties. • Moreover each premise of a rule in GKc contains the active formula of the conclusion, while this is not the case in GKs. In GKs only the left introduction rules and the right introduction rule for the existential quantifier have this property. Finally the right introduction rule for disjunction, using the terminology of linear logic, is additive in GKc (contexts in premise sequent(s) and conclusion sequent
Dialogical Argumentation and Textual Entailment
207
are the same) and multiplicative in GKs (contexts may be different in the premise sequent(s) and are concatenated in the conclusion sequent). • The sequent ⊥ is provable in GKc but not in GKs, while ⊥ ⊥ is provable in both GKc and GKs. Proposition 4 Contraction and weakening are admissible in GKs, i.e., for all A, , • • • •
if , A, A is provable in GKs then , A is provable in GKs if , A, A is provable then , A is provable in GKs if is provable in GKs then , A is provable in GKs if is provable in GKs then , A is provablein GKs
Proof (Sketch) The admissibility of contraction and weakening in GKs is an easy adaptation of the proof contained in [39, pp. 78–81]. Proposition 5 All rules of GKs are admissible in GKc Proof One shows that if the premises of a rule R from GKs are derivable in GKc then the conclusion of R is derivable in GKc, using as base case the identity rule which is the same in the two system. The proof uses the admissibility of weakening and contraction for GKc [39]. We show just one case. All other cases are similar. Suppose that, in GKc, there is a derivation of the premise of the right rule introduction for ∨ in GKs, i.e., that the sequent A, B, is derivable. We want to show that the sequent A ∨ B, is derivable using the rules of GKc. By the admissibility of weakening we have a derivation of the sequent A, B, A ∨ B, in GKc. We can then construct the following derivation using the rules of GKc A, B, A ∨ B, ∨R2 A, A ∨ B, ∨R1 A ∨ B,
We now prove that GKs and GKc prove almost the same sequents. Proposition 6 Given a pair of multi-sets of formulas , , , ⊥ is provable in GKs if, and only if, is provable in GKc. Proof For the left to right direction. if , ⊥ is provable in GKc, then we can conclude, by Proposition 5 that , ⊥ is provable in GKc. Since GKcis complete for classical first order logic we can conclude that the formula ⇒ ( ∨ ⊥) is valid. But this means that the formula ⇒ is also valid. Using completeness again we can conclude that the sequent is provable in GKc. For the right to left direction. Remark that all rules of GKc except the rule ⊥L are admissible in GKs because of the admissibility of contraction and weakening for GKs. However, if the rule ⊥L is used to prove a sequent ⊥, in GKc, the sequent ⊥, , ⊥ will be provable in GKs using an instance of the identity rule. Remark 3 GKs is stronger than minimal logic; for example both Peirce’s law and the law of excluded middle are provable in GKs. We show a derivation of Peirce’s law and a derivation of the law of the excluded middle.
208
D. Catta et al.
Id ((a ⇒ b) ⇒ a, a b, a Id ⇒R (a ⇒ b) ⇒ a a ⇒ b, a a, (a ⇒ b) ⇒ a a ⇒L (a ⇒ b) ⇒ a a ⇒R ((a ⇒ b) ⇒ a) ⇒ a Id a a, ⊥ ⇒R a ⇒ ⊥, a ∨R (a ⇒ ⊥) ∨ a
3.1 From Strategies to Proofs in GKs Let us say a few words that justify the introduction of the calculus GKs. GKs is been chosen strategically because it is “easy” to map winning strategies of dialogical logic to its derivations. The fact that all binary rules of GKs are context sharing (or additive) is motivated by the fact that, as it is explained below, we recursively associate sequents to the nodes of a strategy starting from the root of the strategy. Using this methodology it would be hard to split the sequent in the right way as it is requested by a context splitting (or multiplicative) rule. The fact that all the left introduction rules of GKs carry the active formula of the conclusion in the premises of the rule is motivated by the fact that, as we will see below, left introduction “corresponds” to an attack move by P. The player P can attack the same formula many times. This corresponds, in a GKs derivation, to a left introduction rule having the same active formula and being used many times in the derivation. The fact that the only right introduction rule in which the premise carries the active formula of the conclusion is the existential rule, is motivated by the following fact: right introduction rules corresponds to defence moves by P and P can answer to the same question on an existential formula many times. This corresponds, in a derivation in GKs, to an existential right rule having the same active formula and being used at different points of the derivations. Finally, the, admittedly odd, treatment of negation in GKs is motivated by the following fact: without such treatment the correspondence between strategies and derivations would be trickier and more tedious to prove. One should give a dialogical meaning to the constant ⊥. This is not impossible but the definitions became less harmonious and the proof longer. We can now proceed with our correspondence proof. Because sequent calculus GKs is sound for classical logic, the following proposition shows that a formula F with a winning strategy is true in any interpretation. Proposition 7 Given a formula F, if there is a winning strategy for F then there is a proof π of F in sequent calculus GKs. Proof This proposition results from the results below: • The function π(S) defined below associates to each O-move of a strategy S a sequent thus yielding a tree of sequents π(S). • The tree of sequents π(S) enjoys the eigen variable condition (Proposition 8).
Dialogical Argumentation and Textual Entailment
209
• The tree of sequents π(S) can easily be turned in a proof of GKs without losing the eigen variable property (Proposition 9). Let us consider D(S), the O restriction of a strategy S, the tree of O-moves only obtained from S (which is a tree of P and O moves) by forgetting all the P-moves and adding a root. Given a strategy S for a formula F we associate, by induction of the length n of a sequence of O-moves M = M1 , M3 , . . . M2n−1 of D(S), a sequent M M to each sequence of O moves M in D(S) as follows: 1. If M is the empty sequence, then π(M) = F 2. if the sequence ends with an O-move which is an assertion: a. if the assertion is a defence move (!, F) then we associate the sequent , F where is the sequent associated to the prefix of the sequence b. if the assertion is an attack move (?, F) against an assertion (in the game) of F ⇒ C then we associate the sequent , F C, where is the sequent associated to the prefix of the sequence from which we have erased the formula F ⇒ C on the right of 3. if the sequence ends in a move that is not an assertion then it should be an attack, (?, s) where s is an auxiliary symbol. Two cases may occur: a. If s is either ?∨, ∧1 , ∧2 or ∀[w/x] then we associate the sequent where is equal to the associated to the prefix of the sequence. is the sequent obtained from , the sequent associated to the prefix of the sequence, in the following way: = − C ∪ {} where C is the formula such that s is a question on C, and is the multiset of answers to s (remark that the formulas that are answers to s do not necessarily occurs in some move of the strategy). b. if s is ∃ then we associate the sequent F(t), where F(t) is the formula asserted by P in is defence against (?, s) and is the sequent associated to the prefix of the sequence. Remark that the P-defence must exists in the strategy by the definition of strategy 5. The following proposition shows that the tree of sequents π(S) satisfies the variable restriction on ∀R ∃L, known as the eigen variable condition: Proposition 8 Let S be a winning strategy and let π(S) be the tree of sequent associated with S by the above procedure. Let M be a sequence in D(S) ending with a move that is • either an attack against a universal quantifier (?, ∀[w/x]) • or a defence against an existential attack (!, A[w/x]). Then the variable w does not appear free in the sequent associated to the proper prefix of M In order to prove Proposition 7 we just need some little syntactic manipulations on π(S) to obtain a sequent calculus proof of F:
210
D. Catta et al.
Proposition 9 To each sequence of O-moves M in D(S) we can associate a derivation πM of M M —the sequent associated to M. Proof By well-founded induction on (D(S), ≺). Suppose that for each suffix M of M the proposition holds. We associate a derivation to M by considering the last move of M2n (M). Where (M) is the unique game G in S ending in a P-move such that M is obtained from G by erasing P-moves. We only prove some cases which are not straightforward: 1. if M2n is an attack (?, A) on the assertion A ⇒ C depending on the form of A • if A is atomic then the immediate suffix of M is M(!, C) for which the proposition hold by hypothesis. We associate it with the following derivation. .. .πM(!,C)
Id , A ⇒ C, C F, , A ⇒ C, A A, ⇒L , A ⇒ C F,
• if A = (A1 ⇒A2 ) then M has two immediate suffixes namely M, (?, A1 ) and M, (!, C), for which the proposition holds by hypothesis. We associate the following derivation to M. .. .πM (?,A1 )
.. .πM (!,C) , (A1 ⇒ A2 ) ⇒ C, A1 A2 , ⇒R , (A1 ⇒ A2 ) ⇒ C A1 ⇒ A2 , , (A1 ⇒ A2 ) ⇒ C, C F, ⇒L , (A1 ⇒ A2 ) ⇒ C F,
• If M2n is an existential repetition asserting a formula F[t/x] we proceed as follows: we only consider the case where F[t/x] = (F1 ⇒ F2 )[t/x]. By induction hypothesis there is derivation of the sequent , F1 ∃x F1 [t/x] ⇒ F2 [t/x], associated to the direct suffix M(?, F1 [t/x]) of M. We associate the following derivation to M. .. .πM(?,F1 ) , F1 [t/x] ∃x(F1 ⇒ F2 ), F2 [t/x], ⇒R ∃x(F1 ⇒ F2 ), (F1 ⇒ F2 )[t/x], ∃R ∃x(F1 ⇒ F2 ),
This proves Proposition 7 and thus assures us that that if a formula is dialogically valid then it is provable in sequent calculus, hence true in any interpretation.
3.2 From Proofs in GKs to Strategies We have just shown that if a formula is dialogically valid then it is provable in sequent calculus GKs. We now show the converse, by turning a GKs sequent calculus derivation into a winning a strategy, but we shall impose a restriction on the derivations of GKs—a restriction which derives exactly the same sequents.
Dialogical Argumentation and Textual Entailment
211
Indeed not all derivations in GKs are the image of some winning strategy. For instance, the two derivations below where c(x), a, b are atomic formulas are not the image of any winning strategy S although there are winning strategies for the two formulas (bold formulas are active occurrences of formulas in the sequent): Id c(x) c(x) ∀L ∀x c(x) c(x) ∃R ∀x c(x) ∃x c(x) Id Id a ⇒ b, a a, b b, a ⇒ b, a b ⇒R ⇒R a ⇒ b a, a ⇒ b b, a ⇒ b a ⇒ b ⇒L a⇒ba⇒b ⇒R (a ⇒ b) ⇒ (a ⇒ b)
This leads us to restrict proofs of GKs to strategic proofs which derive the same sequents but always correspond to winning strategies, and to proceed as follows: • We describe a procedure that turns a proof into strategy, by tree traversal from root to higher nodes—the order of traversal of daughters is irrelevant. • By looking at how the derivation should be made in order for the procedure to be successful, we define a subclass of derivations of GKs called strategic derivations. • We show that the subclass is complete, in the sense that if the sequent is provable then it corresponds to a strategic derivation. We describe a procedure, that we call p2s (from a Proof in GKs to a strategy), that converts a proof π of a formula F into a winning strategy S for F. The procedure p2s explore the proofs π starting from the root and proceeding by level order traversal. The procedure associate to π a prefix closed set of games for the formula F. Assume that for each node x of the proof of the formula F having depth n, the branch of the derivation from the root to x is already associated with a prefix closed set Sx of games for the formula F and also assume that each game maximal game G in Sx ends with an O-move. The prefix closed set of games a Sa1 associated with a1 where a1 is any daughter of a is defined as follows: 1. if a1 is obtained by an identity rule , A A, then Sa1 = Sa ∪ {G, (!, A)} where A is the active formula of the identity rule and G is a maximal game in Sa1 such that (!, A) is legal for G. 2. If a1 is labelled with a sequent obtained from a right introduction rule with active formula A. If A is not a conjunction nor a universal formula then Sa1 = Sa ∪ {G, (!, A), (?, s)} where G is a maximal game in Sa such that (!, A) is legal for G and (?, s) is an attack move such that s is a question on A if A is ∀x A then Sa1 = Sa ∪ {G, (!, ∀x A), (?, ∀[w/x])} where G is a maximal game in Sa such that (!, A) is legal for G and the variable w in (?, ∀[w/x]) is the variable that appears in the premise of a1 but not in a1 . if A is B ∧ C then Sa1 = Sa ∪ {G, (!, A ∧ B), (?, ∧1 )} ∪ {G, (!, A ∧ B), (?, ∧2 )} where G is a maximal game in Sa such that the P-move (!, A ∧ B) is legal for G
212
D. Catta et al.
3. If a1 is labelled with a sequent obtained from a left introduction rule with active formula A. If A is neither a disjunction nor an implication formula then Sa1 = Sa ∪ {G, (?, q), (!, a))} where G is a maximal game in Sa such that (?, q) is legal for G, (?, q) is a P-move where q is a question on A and (!, a) is O-move such that the couple (q, a) ∈ Arg(A). • If A is B ∨ C then Sa1 = Sa ∪ {G, (?, ∨)(!, B)} ∪ {G, (?, ∨)(!, C)} where G is a maximal game in Sa such that the P-move (?, ∨) is legal for G. • if A is B ⇒ C then Sa1 = Sa ∪ {G, (?, B), (?, q1 )} ∪ . . . ∪ {G, (?, B), (?, qn )} ∪{(G, (?, B), (!, C)}. Where G is a maximal game in Sa such that the P-move (?, B) is legal for G each qi is a question on B. The above lines inductively define the mapping of a proof in GKs to a prefix closed set of games, but are all the obtained prefix closed sets of games strategies? Not always: • If in (2) the active formula is an existentially quantified formula ∃x B then P asserts the formula and next it is attacked by O with (?, ∃). By the definition of a strategy, P has to assert B[t/x]. This means that a1 should have just one daughter a2 in which the formula B[t/x] is active. • A similar situation occurs in (3) when the active formula is a conditional A ⇒ B: P has to assert A, so A must be the active formula of the left premise of the ⇒ L rule. In order to overcome this problem, we introduce the following definition: Definition 4 (Strategic derivations) A derivation π in GKs is said to be strategic whenever it satisfies the two following conditions: • for each application of a left implication introduction rule, the formula occurrence A in the left-hand premise is active. , A ⇒ B A, , A ⇒ B, B ⇒L , A ⇒ B
• for each application of right existential introduction rule, the formula occurrence A[t/x] is active in the premise: ∃x A, A[t/x], ∃R ∃xA,
Proposition 10 If π is a strategic derivation of F then the procedure above outputs a winning strategy for F Given these last proposition we can conclude our proof by the following Lemma 1 For any multiset of formulas , there is a derivation of the sequent iff and only iff there is a strategic derivation of
Dialogical Argumentation and Textual Entailment
213
Proof The direction from right to left is straightforward: each strategic derivation is a derivation in GKs. The other direction results from a structural induction on the derivation π in GKs. All cases are straightforward except when π is obtained by π by the application of a ⇒ L rule or an ∃R rule. Let us discuss the ∃R rule, which together with a similar result in [23] dealing with all the propositional cases entail our proposition. If π ends in a ∃R rule application then, by induction hypothesis there is a strategic derivation π1 of its premise A[t/x], ∃x A, . if A[t/x] is active we are done. If not we can suppose, without loss of generality that the rule application in which A[t/x] is active is just above the last rule R of π1 . The “hard” case is when R is a ∀R rules, A[t/x] = ∃y B(t, y), i.e., π1 has the following shape: .. . B(t, t ), A(y) ∃R ∃y B(t, y), D(y), ∀R ∃B(t, y), ∀w D,
The problem is that the term t can contain a free occurrence of y. In this case we let permute the ∃R upwards in this way: .. . B(t, t ), A(y), ∃R ∃y B(t, y), D(y), ∃R ∃x A, D(y), ∀R ∃x A, ∀w D,
This way we obtain the strategic proof we wanted. This concludes our proof of the equivalence between winning strategies for our dialogical games and the existence of a proof in classical logic (here viewed, without lost of generality, as a strategic GKs proof). Our result, the equivalence of proofs in GKs and winning strategies can be proved for a sequent calculus which is complete for classical logic. We did not do so, because the proof is much trickier and the intuitive meaning for ⊥ in games is obscure, so we prefer to present it for GKs which is equivalent to complete sequent calculus GKc in the sense of Proposition 6.
4 Categorical Grammars and Automated Theorem Proving Type-logical grammars are a family of frameworks for the analysis of natural language based on logic and type theory. Type-logical grammars are generally fragments of intuitionistic linear logic, with the Curry-Howard isomorphism of intuitionistic logic serving as the syntax-semantics interface. Figure 2 shows the standard architecture of type-logical grammars.
214
D. Catta et al.
Fig. 2 The standard architecture of type-logical grammars
1. given some input text, a lexicon translates words into formulas, resulting in a judgment in some logical calculus, such as the Lambek calculus or some variant/extension of it, 2. the grammaticality of a sentence corresponds to the provability of this statement in the given logic (where different proofs can correspond to different interpretations/readings of a sentence), 3. there is a forgetful mapping from the grammaticality proof into a proof of multiplicative, intuitionistic linear logic, 4. by the Curry-Howard isomorphism, this produces a linear lambda-term representing the derivational meaning of the sentence (that is, it provides instructions for how to compose the meanings of the individual words), 5. we then substitute complex lexical meanings for the free variables corresponding to the lexical entries to obtain a representation of the logical meaning of the sentence, 6. finally, we use standard theorem proving tools (in first- or higher-order logic) to compute entailment relations between (readings of) sentences. To make this more concrete, we present a very simple example, using the Lambek calculus. The Lambek calculus has two connectives,4 A/B, pronounced A over B, representing a formula looking for a B constituent to its right to form an A, and B\A, pronounce B under A, representing a formula looking for a B constituent to its left to form an A. Table 3 shows the logical rules of the calculus. We’ll look at the French sentence ‘Un Suédois a gagné un prix Nobel’ (A Swede won a Nobel prize). Figure 3 shows a Lambek calculus proof of this sentence. It shows that when we assign the formula n, for (common) noun, to ‘prix’ and n\n to ‘Nobel’, we can derive ‘prix Nobel’ as an n. Similarly, when we assign np/n to ‘un’ we can combine 4 We
ignore the product connectives ‘•’ here, since it has somewhat more complicated natural deduction rules and it is not used in the examples.
Dialogical Argumentation and Textual Entailment
215
Table 3 The Lambek calculus A A\B B/A A [\E] [/E] , B , B A, B [\I ] A\B
, A B [/I ] , B/A
Fig. 3 Lambek calculus proof of ‘Un Suédois a gagné un prix Nobel’ (A Swede won a Nobel prize)
Table 4 The multiplicative intuitionistic linear logic with linear lambda-term labeling x : A, M : B N : A M : AB [ E] [ I ] , (M N ) : B λx.M : A\B
this with ‘prix Nobel’ of type n to produce ‘un prix Nobel’ as a noun phrase np. We can continue the proof as shown in Fig. 3 to show that ‘Un Suédois a gagné un prix Nobel’ is a main, declarative sentence s. The lambda-term of the corresponding linear logic proof (according to the rules of Table 4) is (g(u p))(u s) (we have simplified a bit here, treating ‘a gagné’ and ‘prix Nobel’ as units). We then substitute the lexical semantics to obtain the logical representation of the meaning of this sentence. The simple substitutions are suédois for s and prix_Nobel for p. The two complicated substitutions are the two occurrences of u which are translated as follows. λP e→t λQ e→t ∃x.[(P x) ∧ (Q, x)] This is the standard Montague-style analysis of a generalised quantifier. It abstracts over two properties P and Q and states that there is an x which satisfies these two properties. Because of our choice of np for the quantifier (instead of a more standard higher-order type like s/(np\s)), the type for the transitive verb has to take care of the quantifier scope. The lexical entry for the transitive verb below chooses the subject wide scope reading. λN (e→t)→t λM (e→t)→t (M λx.(N λy.gagner(x, y)))
216
D. Catta et al.
Fig. 4 Grail output for the semantics of ‘Un Suédois a gagné un prix Nobel’
Substituting these terms into the lambda-term for the derivation and normalising the resulting term produces the following. ∃x∃y.[suédois(x) ∧ prix_Nobel(y) ∧ gagner(x, y)] Even though this is an admittedly simple example, it is important to note that, although slightly simplified for presentation here, the output for this example and other examples in this paper are automatically produced by the wide-coverage French parser which is part of the Grail family of theorem provers [31]: Grail uses a deep learning model to predict the correct formulas for each word, finds the best way to combine these lexical entries and finally produces a representation of a logical formula. The full Grail output for the meaning of the example sentence is shown in Fig. 4. Grail uses discourse representation structures [26] for its meaning representation, which is essentially a graphical way to represent formulas in first-order logic. Besides providing a readable presentation of formulas, discourse representation structures also provide a dynamic way of binding, with applications to the treatment of anaphora in natural language. The variables d0 , y0 and z 0 in the top part of the rightmost box represent existentially quantified variables, y0 is a swede, z 0 is a prize (named after Nobel) and d0 is a variable for an eventuality—essentially denoting a slice of space-time—the inner box indicates that this ‘winning’ event must have occurred at a time before ‘maintenant’ (now), i.e., in the past. Even though the meaning assigned is in some ways simplistic, the advantage is that it can be automatically obtained and that it is of exactly the right form for logic-based entailment tasks.
5 Textual Entailment What can dialogical argumentation contribute to the study of textual entailment? In natural language processing, textual entailment is usually defined as a relation between text fragments that holds whenever the truth of one text fragments follows
Dialogical Argumentation and Textual Entailment
217
from another text. Textual entailment recognition is the task of determining, given text fragments, whether the relation of textual entailment holds between these texts. Our examples below are taken form the FraCaS benchmark, but translated into French. This is due to the fact that our methodology involves the use of Grail and the latter is developed mainly for the French language. Recently a French version of the FraCaS data set has been developed [2]. In this work we do not use this version and the examples are translated by us. In future works, however, we will evaluate examples taken from [2]. The FraCaS benchmark was built in the mid 1990s; the aim was developing a general framework from computational semantics. The data set consists of problems each containing one or more statements and one yes/no-question. An example taken from the date set is the following (1) A Swede won a Nobel prize. (2) Every Swede is a Scandinavian. (3) Did a Scandinavian win a Nobel prize? [Yes]
5.1 First Example We illustrate our methodology to solve inference problems using examples. First of all we turn the question (3) into an assertion, i.e., (4) Some Scandinavian won a Nobel prize. We then translate each sentence in French and use Grail on each sentence in order to get a logical formula. In the enumeration below we report, in order: the sentence in English. A word-for-word translation, then a more natural paraphrase which takes into account French grammar and idioms, and, finally, the logical formula that Grail outputs from the input of the latter (5) A Swede won a Nobel prize Un suédois a gagné un Nobel prix Un suédois a gagné le prix Nobel ∃x∃y.[suédois(x) ∧ prix_Nobel(y) ∧ gagner(x, y)] (6) Every Swede is a Scandinavian Tout suédois est un scandinave Tout suédois est scandinave ∀u.(suédois(u) ⇒ scandinave(u)) (7) Some Scandinavian won a Nobel prize Un scandinave a gagné un Nobel prix Un scandinave a gagné un prix Nobel ∃w.∃z.[(scandinave(w) ∧ (prix_Nobel(z) ∧ gagner(w, z))]
218
D. Catta et al.
We then construct a winning strategy for the formula H1 ∧ H2 . . . ∧ Hn ⇒ C where each Hi is the logical formula that Grail associates to each statement from the data set, and C is the formula that Grail associates to the assertion obtained from the pair question-answer in the data-set. F = ∃x∃y.[su(x) ∧ p_N (y) ∧ g(x, y)] ∧ ∀u.[su(u) ⇒ sc(u)] ⇒ ∃w∃z.[(sc(w) ∧ ( p_N (z) ∧ g(w, z))]
In the above formula su stands for suédois, p_N for pri x_N obel, g pour gagner and sc for scandinave. A winning strategy for the formula F is shown in Fig. 5 in two steps.
5.2 Second Example (8) Some Irish delegates finished the survey on time. (9) Did any delegates finish the survey on time? [Yes] The answer to the question is affirmative. This means that if (8) is true then the sentence “some delegate finished the survey on time” must also be true. (10) Some Irish delegates finished the survey on time Certains irlandais délégués ont terminé l’ enquête a` temps Certain délegués irlandais ont términé l’enquête a` temps ∃x∃y((délegué(x) ∧ irlandais(x)) ∧ (enquête(y) ∧ terminé-a` -temps(x, y))) (11) Some delegates finished the survey on time Certains délégués ont terminé l’ enquête a` temps Certains délégués ont terminé l’enquête a` temps ∃x∃y(délegué(x) ∧ (enquête(y) ∧ terminé-a` -temps(x, y)))
We have that F1 ⇒ F2 . Where F1 = ∃x∃y((délegué(x) ∧ irlandais(x)) ∧ (enquête(y) ∧ terminé-a` -temps(x, y))) F2 = ∃x∃y(délegué(x) ∧ (enquête(y) ∧ terminé-a` -temps(x, y))) Figure 6 shows the winning strategy for this formula.
Dialogical Argumentation and Textual Entailment
Fig. 5 Winning strategy showing entailment for the first example
219
220
Fig. 6 Winning strategy showing entailment for the second example
D. Catta et al.
Dialogical Argumentation and Textual Entailment
221
5.3 Third Example (12) No delegate finished the report on time (13) Did any Scandinavian delegate finished the report on time? [No] In this example, the question should get a negative reply. A positive answer would be implied by the existence of a Scandinavian delegate who finished the report in the time allotted. Thus the sentence (12) plus the sentence Some Scandinavian delegate finished the report on time should imply a contradiction. We first translate the two sentences in French and use Grail to get the corresponding logical formulas. (14) No delegate finished the report on time Aucun délégué n’a terminé le rapport á temps Aucan délégué n’a terminé le rapport á temps ∀x(délégué(x) ⇒ ¬terminé-le rapport-a` -temps(x)) (15) Some Scandinavian delegate finished the report on time Un scandinave délegué a terminé le rapport á temps un délégué scandinave a terminé le rapport á temps ∃x((délégué(x) ∧ scandinave(x)) ∧ terminé-le-rapport-a-temps(x)) ` The two formulas F1 = ∀x(délégué(x) ⇒ ¬terminé-le rapport-a` -temps(x)) F2 = ∃x((délégué(x) ∧ scandinave(x)) ∧ terminé-le-rapport-a` -temps(x)) Are contradictory. So there exists a winning strategy for the formula ¬(F1 ∧ F2 ) as shown in Fig. 7. We recall that the expression ¬F is just a shortcut for F ⇒ ⊥, that enjoys most of the properties of the negations but not all of them. For reason of space in Fig. 7 two moves are omitted at the end of the strategies. Both of them are assertion move of the form (!, ⊥). The move closer to the root of the strategy is an O-move. The second, which is the leaf of the strategy, is a P-move enabled by the O-move that just below the root of the strategy.
5.4 Fourth Example In the last example we focus on a series of sentences that our system should not solve, because the question asked neither has a positive nor a negative answer. (16) A Scandinavian won a Nobel prize. (17) Every Swede is a Scandinavian (18) Did a Swede win a Nobel prize? [Don’t know]
222
Fig. 7 Winning strategy showing contradiction for the third example
D. Catta et al.
Dialogical Argumentation and Textual Entailment
223
This means that, on the basis of the information provided, we can neither say that a Swede has won a Nobel Prize nor that there are no Swedes who have won a Nobel Prize. (19) A Scandinavian won a Nobel prize Un scandinave a gagné un Nobel prix Un scandinave a gagné un prix Nobel ∃x∃y(scandinave(x) ∧ (prix_Nobel(y) ∧ gagner(x, y))) (20) Every Swede is a Scandinavian Tout suédois est un scandinave Tout suédois est scandinave ∀u.(suédois(v) ⇒ scandinave(v)) Call the formula in (19) F1 and the formula in (20) F2 . In dialogical logic terms, the fact that we do not have enough information to answer the question (18), either in a positive fashion or in a negative way, means that there is no winning strategy for the formula F1 ∧ F2 ⇒ F3 nor for the formula F1 ∧ F2 ⇒ ¬F3 where the formula F3 is F3 = ∃w∃z(suédois(w) ∧ (prix_Nobel(z) ∧ gagner(w, z))) In general, given a sentence F of first order logic there it is not decidable whether F is valid. However in some cases we can manage this problem. Luckily the present case is one of those we can manage. We consider what a winning strategy for the formula F1 ∧ F2 ⇒ F3 must look like. A winning strategy S for this formula will necessarily contain a dialog whose last move Mn is a P-move that asserts suédois (t) for some term t in the language. Since suédois (t) is an atomic formula by Proposition 3 above suédois (t) must occur both as a positive and negative Gentzen sub-formula of F1 ∧ F2 ⇒ F3 but this is not the case. Thus there is no winning strategy for the latter formula. Let us now discuss why there is no winning strategy for the formula F1 ∧ F2 ⇒ ¬F3 . First of all, Proposition 1 guarantees that each game won by P ends with the assertion of some atomic formula, and that this assertion is a P-move. By Proposition 3 above, the only candidate for this is again suédois (t) for all terms t in the language. If the P-move Mk is an assertion of suédois (t), then it must be an attack. If it were a defence instead, this would mean that there must be a formula F of the form ∀w. suédois (w) or ∃w (suédois (w)) or F ∨ suédois (w) or F ∧ suédois (w) or F ⇒ suédois(t) such that P asserts F. This implies that this formula F must be a positive Gentzen subformula of F1 ∧ F2 ⇒ ¬F3 . But not such formula exists. Thus the move Mn asserting suédois(t) must be an attack. Since the only formula that can be attacked by this means is the formula suédois (t) ⇒ scandinave(t), O can answer by asserting scandinave(t), and P cannot win the game. Thus there is no winning strategy for the formula F1 ∧ F2 ⇒ ¬F3 .
224
D. Catta et al.
6 Conclusion In this paper, we adapted our simple version of argumentative dialogues and strategies of [9] to two-sided sequents (hypotheses and conclusions): this point of view better matches natural language statements, because the assumption sentences of a textual entailment task can be viewed as sequent calculus hypotheses, while the conclusion text can be viewed as the conclusion of the sequent. In the present paper, we successfully use the syntactic and semantic platform Grail to “translate” natural language sentences into DRS that can be viewed as logical formulas. This brings us closer to inferentialist semantics: a sentence S can be interpreted as all argumentative dialogues in natural language whose conclusion is S—under assumptions corresponding to word meaning and to the speaker beliefs. We are presently working to extend our work with semantics modelled in classical first order logic to a broader setting which models semantics in first-order modal logic. Indeed, modal reasoning (temporal, deontic, alethic, etc.) is rather common in natural language argumentation. Regarding the architecture of our model of natural language argumentation we would like to encompass lexical meaning as axioms along the lines of [9] and to use hypotheses to model the way the two speakers differ in their expectations, beliefs and knowledge, using insights from existing work on functional roles in dialogue modelling [38]. We plan to explore the connection between our restricted view of dialogue, which only concerns argumentative dialogues (i.e., games), and well-developed theories of discourse and dialogue such as the one presented in the books [3, 20] or the more innovative approach of [27] whose viewpoint is closer to ours.
References 1. Abramsky, S., McCusker, G.: Game semantics. In: Berger, U., Schwichtenberg, H. (eds.) Computational Logic, pp. 1–55. Springer, Berlin (1999) 2. Amblard, M., Beysson, C., de Groote, P., Guillaume, B., Pogodalla, S.: A French version of the FraCaS test suite. In: LREC 2020 - Language Resources and Evaluation Conference (2020) 3. Asher, N., Lascarides, A.: Logics of Conversation. Cambridge University Press, Cambridge (2003) 4. Boritchev, M., Amblard, M.: Picturing questions and answers - a formal approach to slam. In: Amblard, M., Musiol, M., Rebuschi, M. (eds.) (In)coherence of discourse - Formal and Conceptual issues of Language, Language, Cognition and Mind. Springer, Berlin (2019). To appear 5. Brandom, R.: Articulating Reasons: An Introduction to Inferentialism. Harvard University Press, Harvard (2000) 6. Breitholtz, E.: Enthymemes in dialogue: a micro-rhetorical approach. Ph.D. thesis, Humanistiska fakulteten. Göteborgs universitet (2014) 7. Castelnérac, B., Marion, M.: Arguing for inconsistency: dialectical games in the academy. In: Primiero, G. (ed.) Acts of Knowledge: History, Philosophy and Logic. College Publications (2009)
Dialogical Argumentation and Textual Entailment
225
8. Castelnérac, B., Marion, M.: Antilogic. Balt. Int. Yearb. Cogn., Log. Commun. 8(1) (2013). https://doi.org/10.4148/1944-3676.1079 9. Catta, D., Pellissier, L., Retoré, C.: Inferential semantics as argumentative dialogues. In: González, S., González-Briones, A., Gola, A., Katranas, G., Ricca, M., Loukanova, R., Prieto, J. (eds.) Distributed Computing and Artificial Intelligence, Special Sessions, 17th International Conference, Advances in Intelligent Systems and Computing, pp. 72–81 (2020). https://doi. org/10.1007/978-3-030-53829-3_7 10. Cooper, R.: Update conditions and intensionality in a type-theoretic approach to dialogue semantics. In: Fernández, R., Isard, A. (eds.) Proceedings of the 17th Workshop on the Semantics and Pragmatics of Dialogue (2013) 11. Cooper, R., Crouch, D., Eijck, J.V., Fox, C., Genabith, J.V., Jaspars, J., Kamp, H., Milward, D., Pinkal, M., Poesio, M., Pulman, S., Briscoe, T., Maier, H., Konrad, K.: Using the framework (1996). FraCaS deliverable D16 12. Cozzo, C.: Meaning and Argument: A Theory of Meaning Centred on Immediate Argumental Role. Stockholm Studies in Philosophy. Almqvist & Wiksell International (1994) 13. Dagan, I., Roth, D., Sammons, M., Zanzotto, F.M.: Recognizing textual entailment: models and applications. Synthesis Lectures on Human Language Technologies, vol. 6(4). Morgan & Claypool Publishers, San Rafael (2013). https://doi.org/10.2200/S00509ED1V01Y201305HLT023 14. Dummett, M.A.E.: What is a theory of meaning? In: Guttenplan, S. (ed.) Mind and Language. Oxford University Press, Oxford (1975) 15. Dummett, M.A.E.: The Logical Basis of Metaphysics. Harvard University Press, Harvard (1991) 16. Felscher, W.: Dialogues as a foundation for intuitionistic logic. In: Gabbay, D.M., Guenthner, F. (eds.) Handbook of Philosophical Logic, pp. 115–145. Springer Netherlands, Dordrecht (2002). http://dx.doi.org/10.1007/978-94-017-0458-8_2 17. Fouqueré, C., Quatrini, M.: Argumentation and inference a unified approach. In: The Baltic International Yearbook of Cognition, Logic and Communication Volume 8: Games, Game Theory and Game Semantics, pp. 1–41. New Paririe Press (2013) 18. Francez, N.: Proof Theoretical Semantics. Studies in Logic, vol. 57. College Publication (2015) 19. Frege, G.: The thought: a logical inquiry. Mind 65(259), 289–311 (1956). https://doi.org/10. 1093/mind/65.1.289 20. Ginzburg, J.: The Interactive Stance. Oxford University Press, Oxford (2012) 21. Girard, J.Y.: Proof-Theory and Logical Complexity – vol. I. Studies in Proof Theory. Bibliopolis, Napoli (1987) 22. Girard, J.Y.: Locus solum. Math. Struct. Comput. Sci. 11(3), 301–506 (2001) 23. Herbelin, H.: Séquents qu’on calcule : de l’interprétation du calcul des séquents comme calcul de λ-termes et comme calcul de stratégies gagnantes. Thèse d’université, Université Paris 7 (1995) 24. Hunter, J., Asher, N., Lascarides, A.: A formal semantics for situated conversation. Semant. Pragmat. 11, 1–52 (2018). https://doi.org/10.3765/sp.11.10 25. Hyland, M.: Game semantics. In: Pitts, A., Dybjer, P. (eds.) Semantics and Logics of Computation, pp. 131–182. Cambridge University Press, Cambridge (1997) 26. Kamp, H., Reyle, U.: From Discourse to Logic. Kluwer Academic Publishers, Dordrecht (1993) 27. Lecomte, A.: Meaning, Logic and Ludics. Imperial College Press, London (2011) 28. Lorenzen, P., Lorenz, K.: Dialogische Logik. Wissenschaftliche Buchgesellschaft (1978). https://books.google.fr/books?id=pQ5sQgAACAAJ 29. Montague, R.: English as a formal language. In: Visentini, B. (ed.) Linguaggi nella Societa e nella Tecnica, pp. 189–224. Edizioni di Communità, Milan, Italy (1970). (Reprinted in R. Thomason (ed) The collected papers of Richard Montague. Yale University Press, 1974.) 30. Moot, R.: A type-logical treebank for French. J. Lang. Model. 3(1), 229–264 (2015). http://dx. doi.org/10.15398/jlm.v3i1.92 31. Moot, R.: The Grail theorem prover: type theory for syntax and semantics. In: Chatzikyriakidis, S., Luo, Z. (eds.) Modern Perspectives in Type Theoretical Semantics, pp. 247–277. Springer, Berlin (2017). https://doi.org/10.1007/978-3-319-50422-3_10
226
D. Catta et al.
32. Moot, R.: The Grail family of theorem provers (syntactic and semantic parser) (2018). https:// richardmoot.github.io 33. Moot, R., Retoré, C.: Natural language semantics and computability. J. Log., Lang. Inf. 28, 287–307 (2019). https://doi.org/10.1007/s10849-019-09290-7 34. Moss, L.: Natural logic. In: Lappin, S., Fox, C. (eds.) The Handbook of Contemporary Semantic Theory, 2 edn., pp. 559–592. Blackwell, Hoboken (2015) 35. Novaes, C.D.: Medieval “obligationes” as logical games of consistency maintenance. Synthese 145(3), 371–395 (2005). http://www.jstor.org/stable/20118602 36. Prawitz, D.: The epistemic significance of valid inference. Synthese 187(3), 887–898 (2012). https://doi.org/10.1007/s11229-011-9907-7 37. Retoré, C.: The montagovian generative lexicon T yn : a type theoretical framework for natural language semantics. In: Matthes, R., Schubert, A. (eds.) 19th International Conference on Types for Proofs and Programs (TYPES 2013), Leibniz International Proceedings in Informatics (LIPIcs), vol. 26, pp. 202–229. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany (2014). https://doi.org/10.4230/LIPIcs.TYPES.2013.202 38. Sabah, G., Prince, V., Vilnat, A., Ferret, O., Vosniadiou, S., Dimitracopoulou, A., Papademetriou, E., Tsivgouli, M.: What dialogue analysis can tell about teacher strategies related to representational changes. In: Kayser, D., Vosniadou, S. (eds.) Modelling Changes in Understanding: Case Studies in Physical Reasoning, Advances in Learning and Instruction, pp. 223–279. Pergamon Press, Oxford (2000) 39. Troelstra, A.S., Schwichtenberg, H.: Basic Proof Theory. Cambridge University Press, USA (1996) 40. Wittgenstein, L.: Philosophische Untersuchungen/Philosophical Investigations. Oxford University Press, Oxford (1953). Translated by G. E. M, Anscombe/Bilingual edition
A Novel Approach to Determining the Quality of News Headlines Amin Omidvar, Hossein Pourmodheji, Aijun An, and Gordon Edall
Abstract Headlines play a pivotal role in engaging and attracting news readers since headlines are the most visible parts of the news articles, especially in online media. Due to this importance, news agencies are putting much effort into producing high-quality news headlines. However, there is no concise definition of headline quality. We consider headlines as high quality if they are attractive to readers, and highly related to the article contents. While almost all the previous studies considered headline quality prediction as either clickbait detection (which is a binary text classification problem), or popularity prediction (which is a regression problem), our model employs four quality indicators to incorporate these two factors. In this paper, we first discuss the previous works on the news headline quality detection. We then propose a machine learning-based model to predict the quality of a headline based on four quality indicators before the publication of the news article. The proposed model is an extended version of our previously proposed model in a way that it considers sentiment features of headlines as well. We conduct experiments on a news dataset and compare our method with the state-of-the-art NLP models. Keywords Headline quality · Deep learning
A. Omidvar (B) · H. Pourmodheji · A. An Department of Electrical Engineering and Computer Science, York University, Toronto, ON, Canada e-mail: [email protected] H. Pourmodheji e-mail: [email protected] A. An e-mail: [email protected] G. Edall The Globe and Mail, Toronto, ON, Canada e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. Loukanova (ed.), Natural Language Processing in Artificial Intelligence—NLPinAI 2020, Studies in Computational Intelligence 939, https://doi.org/10.1007/978-3-030-63787-3_8
227
228
A. Omidvar et al.
1 Introduction People’s attitude toward reading news articles is changing in a way that people are more willing to read online articles than paper-based ones. In the past, people bought a newspaper, saw almost all the pages while scanning headlines, and read through articles that seemed interesting [1]. The role of headlines was to help readers have a clear understanding of the topics of the article. However, today, online news publishers are changing the role of headlines in a way that headlines are the most important way to gain readers’ attention. One important reason is that online newspapers rely on the income generated from the subscriptions and clicks made by their readers [2]. Furthermore, publishers need to attract more readers than their competitors if they want to succeed in this competitive industry. The aforementioned reasons are the most important ones why some of the online news media come up with exaggerating headlines to lure the readers into clicking on their headlines. These likable headlines may increase the number of clicks but, at the same time, will disappoint the readers since they exaggerate the content of the news articles [3]. Therefore, having a mechanism that can predict the quality of news headlines before publication can help authors choose those headlines that not only increase readers’ attention but also satisfy their expectations. Also, using high-quality headlines will help to bring more traffic from social media, e.g., Twitter, Facebook to the news portal, which will increase revenue through both advertisements, and subscriptions. However, determining the quality of headlines is not an easy task. One important reason is a lack of concise definition for the headline quality in previous studies. Some of the previous works have considered quality as how interesting headlines are from the readers’ point of view. In this case, they usually measure quality by considering the number of readers’ clicks, or likes on social media. Some of the other works have considered accurate headlines as high quality. Accurate headlines are those that are related to the content of their articles. We consider a headline that can attract many viewers, i.e., popular, and is highly related to the content of its article, i.e., not clickbait as a high-quality headline. We define four quality indicators to determine the quality of headlines, which we describe in Sect. 3. In general, previous studies can be classified into two categories: clickbait detection and headline popularity prediction. Most of the previous studies are in the clickbait detection category. Clickbait detection has been considered as a binary text classification problem where the goal is to determine whether a headline is misleading or not. All misleading headlines have low quality since they do not satisfy the expectations of their readers. However, clickbait detection models cannot detect high-quality headlines. The problem is that not all non-clickbait headlines are of high quality since they may not be interesting enough to catch the attention of newsreaders. The focus of studies in the headline popularity prediction category is to predict the popularity of news headlines mostly in terms of the number of visits, shared, or liked by newsreaders. Almost all of these studies considered headline popularity prediction as a regression problem. Unpopular headlines were considered low-quality
A Novel Approach to Determining the Quality of News Headlines
229
headlines since they failed to engage newsreaders. However, not all popular headlines have high quality since they may be clickbait. Thus, neither clickbait detection nor popularity prediction models can determine high-quality headlines. In this work, we propose a novel headline quality prediction approach that can distinguish high-quality headlines from low-quality ones. The main contributions of this research are as follows: 1. We propose a novel approach to determining the quality of published headlines using dwell time and click counts of the articles, and we provide four headline quality indicators. By using this approach, we can label the quality of headlines in a news article data set of any size automatically, which is not possible by employing human annotators. Using human annotators to label data is costly and requires much time and effort. It may also result in inconsistent labels due to subjectivity. To the best of our knowledge, none of the previous research has done this for headline quality detection. 2. We develop a deep neural network (DNN) based predictive model that incorporates some advanced features of DNN to predict the quality of unpublished headlines using the data labelled by the above approach. The proposed model uses the proposed headline quality indicators and considers the similarity between the headline and its article. The rest of this paper is organized as follows. In Sect. 2, the most relevant works regarding headline quality in computer science and psychology fields are studied. In Sect. 3, we propose four quality indicators to represent the quality of headlines. Also, we label our dataset using a novel way to calculate the proposed quality indicators for published news articles. Next, we propose a novel deep learning architecture in Sect. 4 to predict the headline quality for unpublished news articles. We use the calculated headline quality from Sect. 3 as ground truth to train our model. Then in Sect. 5, our proposed model is compared with baseline models. Finally, this study is wrapped up with a conclusion in Sect. 6.
2 Related Work Many studies in different areas such as computer science, psychology, anthropology, and communication have been conducted on the popularity, and accuracy of the news headlines over the past few years. In this section, the most relevant works in the domain of computer science, and psychology are briefly described. Researchers manually examined 151 news articles from four online sections of the El Pais, which is a Spanish Newspaper, to find out features that are important to catch the readers’ attention. They also analyzed how important linguistic techniques such as vocabulary, and words, direct appeal to the reader, informal language, and simple structures are in order to gain the attention of readers [4]. In another research, more than a million headlines from two content marketing platforms were examined to find out how essential features are in terms of turning
230
A. Omidvar et al.
guest users into subscribed ones and catching users’ attention. The authors found out clickbait techniques can boost subscription rates and users’ attention temporarily [5]. In order to determine the organic reach of the tweets, social sharing patterns were analyzed in [6]. To reach this goal, several tweets from newspapers, which are known to publish a high ratio of clickbait, and non-clickbait contents, were gathered. The authors showed how the differences between customer demographics, follower graph structure, and type of text content could influence the quality of the tweets. Then, they used these differences as a set of features to recognize clickbait headlines. Ecker et al. [7] studied how misinformation in news headlines could affect newsreaders. They found out headlines have a vital role in shaping readers’ attitudes toward the content of news. In [2], they extracted features from the content of 69,907 news articles in order to find approaches that can help to attract clicks. They discovered that the sentiment of a headline is strongly correlated to the popularity of its news article. Most of the published news articles have negative headlines, and the production rate of negative headlines have been remained constant over time. They also showed headlines that are intensely positive, or negative tend to attract more clicks, while neutral ones get fewer clicks. Some distinctive characteristics between accurate and clickbait headlines in terms of words, entities, sentence patterns, and paragraph structures are discovered in [8]. In the end, the authors proposed an interesting set of 14 features to recognize how accurate headlines are. In another work, a linguistically-infused network was proposed to distinguish clickbait from accurate headlines using the passages of both article and headline along with the article’s images [9]. To do that, they employed Long Short-Term Memory (LSTM) and Convolutional Neural Network architectures to process text, and image data, respectively. One interesting research measured the click-value of individual words of headlines. The authors proposed a headline click-based topic model (HCTM) based on latent Dirichlet allocation (LDA) to identify words that can bring more clicks for headlines [10]. In another related research [11], a useful software tool was developed to help authors to compose effective headlines for their articles. The software uses state of the art NLP techniques to recommend keywords to authors for inclusion in articles’ headlines to make headlines look more intersecting than the other works. The authors calculated two local and global popularity measures for each keyword and use a supervised regression model to predict how likely headlines will be widely shared on social media. Deep Neural Networks have become a widely used technique that has produced very promising results in news headline popularity tasks in recent years [12–14]. Most NLP approaches employ deep learning models, and they do not usually need heavy feature engineering and data cleaning. However, most of the traditional methods rely on the graph data of the interactions between users and contents. For detecting clickbait headlines, many research have been conducted so far [15– 18]. In [19], authors launched a clickbait challenge competition, and also released two supervised, and unsupervised datasets that contain over 80,000, and 20,000 samples, respectively. Each sample contains news content such as headline, article, media, and keywords. Samples of the supervised dataset are labelled via five human
A Novel Approach to Determining the Quality of News Headlines
231
annotators in four different categories: not clickbaiting, slightly clickbaiting, considerably clickbaiting, and heavily clickbaiting. The average of the judges’ scores will be considered as the final score of each sample. The goal is to predict the final score for the samples in the test dataset. A leading proposed model in the clickbait challenge competition [3], which is called albacore,1 employed bi-directional GRU along with fully connected NN layers to determine how much clickbait each headline is. They showed that posted headline on Twitter, i.e., postText field is the most important feature of each sample to predict the judges’ score due to the fact that maybe human evaluators only used posted headline feature to label each sample. The leading approach not only got the first rank in terms of Mean Squared Error (MSE) but also is the fastest among all the other proposed models. A novel study on Clickbait challenge’s dataset was conducted to propose a federated hierarchical hybrid model to detect clickbait where the titles and contents are stored by different parties [20]. Almost all the previous clickbait detection research assumes all the required data is available locally and can be used to train a machine learning model. However, in many real-world situations, training data is distributed by different parties, and the parties cannot share their data with each other due to the data privacy policies. The authors assumed two parties in their work and assigned different fields of training data to each of them. Then the authors proposed a novel architecture of hierarchical hybrid networks to train their proposed clickbait detection model, which can have comparable accuracy to the models which have access to the whole data locally. One new and hot trending research topic is finding clickbait videos on YouTube. One of the most successful clickbait detection techniques for YouTube [21] utilizes thumbnails, headlines, comments, and videos’ statistics to decide how clickbait the videos are. The authors created a dataset by crawling YouTube and extracting the above data fields. They have converted text data, i.e., comments, and headlines, and thumbnails to the embedding vectors using sent2vec [22], and CNN, respectively. They also found that when a user watches a clickbait video, YouTube’s recommender system will recommend them more clickbait videos, which will result in a frustrating experience for that particular user. They observed that if a user watches a clickbait video, they are recommended 4.1 times more clickbait videos than the non-clickbait ones. In [14], the authors used only headlines to predict the popularity of the news articles. The proposed method was evaluated on the Russian, and Chinese datasets with over 800,000 samples in total. They focused just on headlines since the users click on the article after reading its headline. Therefore, the headline is the start point of the user’s desire to open the news article. They used Bidirectional LSTM (BLSTM) to predict the number of views for each news article. In another related research [13], the authors employed BLSTM to predict the popularity of the posted news on Facebook using only headlines. This was the first attempt for predicting news popularity on social media using only posted news titles. 1 https://www.clickbait-challenge.org/#results.
232
A. Omidvar et al.
A test-rollout strategy has been used at Yahoo’s home page to choose the best headline variant in terms of users’ points of view [23]. In this strategy, multiple variations of a news headline are shown to randomize same size user buckets for a defined period. Then the headline variant, which gained the most clicks, will be selected for the article permanently. This strategy has been used to select the best banner for online advertising as well. However, a large proportion of the article’s clicks occur in its early life since freshness is the key factor of news article popularity. To the best of our knowledge, none of the previous studies analyzed the quality of headlines by considering both their popularity and truthfulness, i.e., non-clickbait. The reason is that almost all the previous research, especially those for clickbait detection, looked at the problem as a binary classification task. Also, most of them depend on human evaluators to label the dataset. In our proposed data labelling approach, we determine the quality of headlines based on four quality indicators by considering both their popularity and validity. Also, we come up with a novel approach to calculate four quality indicators automatically by using users’ activity log datasets. Then, our trained deep learning model not only determines how popular headlines are but also how honest and accurate they are. This research is the extended version of our proposed model for headline quality detection [24]. Since polarity can play an important role in the popularity of headlines, we extended our model by adding a pre-trained sentiment analysis neural network to its architecture. We used the Yelp dataset2 to train sentiment analysis neural network, and then we transferred it to our new domain.
3 Labelling Data In this section, a novel approach is introduced to calculate the quality of published headlines based on users’ interactions with articles. This approach is used for labelling our dataset.
3.1 Data Our data is provided by The Globe and Mail,3 which is a major Canadian newspaper. It contains a news corpus dataset (containing articles, and their metadata), and a log dataset (containing interactions of readers with the news website). Every time a reader opens an article, writes a comment or takes any other trackable action, the action is stored as a record in a log data warehouse. Generally, every record contains
2 https://www.yelp.com/dataset. 3 https://www.theglobeandmail.com/.
A Novel Approach to Determining the Quality of News Headlines
233
246 captured attributes such as event ID, the user ID, time, date, browser, or IP address. The data we obtained have been anonymized. The log data provide useful insights into readers’ behaviours. However, there are noise, and inconsistencies in the clickstream data which should be cleaned before calculating any measures, applying any models, or extracting any patterns, e.g., users may leave an article open in the browser for a long time while doing other activities, such as browsing other websites in another tab). In this case, some articles may have false long dwell time from some readers. There are approximately 2 billion records of users’ actions in the log dataset. We use the log dataset to find how many times each article has been read and how much time users spent reading it. We call these two measures click count and dwell time, respectively.
3.2 Quality Indicators Due to the high cost of labelling supervised training data using human annotators, large datasets are not available for most NLP tasks [25]. In this section, we calculate the quality of published articles using the articles’ click count and dwell time measures. By using the proposed approach, we can label any size of the database automatically, and use those labels as ground truths to train deep learning models. Dwell time for article a is computed using Formula 1. Da =
u Ta,u Ca
(1)
where C a is the number of times article a was read, and T a,u is the total amount of time that user u has spent reading article a. Thus, the dwell time of article a, i.e., Da ) is the average amount of time spent on the article during a user visit. The values of read count, and dwell time are normalized in the scale of zero to one. By considering these two measures for headline quality, we can define four quality indicators, which are shown by the four corners of the rectangle in Fig. 1. We did not normalize articles’ dwell time by articles’ length since the correlation and mutual information between articles’ reading time and articles’ length were 0.2 and 0.06, respectively, which indicates there is a very low dependency between these two variables in our dataset. Indicator 1: High Dwell Time but Low Read Count. Articles close to this indicator were interesting for users because of their high dwell time, but their headlines were not interesting enough to motivate users to click on the articles. However, those users who read these articles spent a significant amount of time reading them. Indicator 2: High Dwell Time and High Read Count. Articles close to indicator 2 had interesting headlines since many users opened them, and the articles were impressive as well because of their high dwell time.
234
A. Omidvar et al.
Fig. 1 Representing news headlines’ quality with respect to the four quality indicators
Indicator 3: Low Dwell Time but High Read Count. Articles close to this indicator have high read count but low dwell time. These headlines were interesting for users, but their articles were not. We call these types of headlines misleading headlines since the articles do not meet the expectation of the readers. As we can see in Fig. 1, very few articles reside in this group. Indicator 4: Low Dwell Time and Read Count. Headlines of these articles were not successful in attracting users, and those who read them did not spend much time reading them. In this formula, we measure a Euclidean distance between a headline√and each of four quality indicators. Then we subtract each calculated distance from 2 to covert it into similarity. In the end, the Softmax function is used to convert the calculated similarities into a probability distribution. ⎤ ⎛ √ ⎞ Pa,1 √ 2 − (Ca , (1 − Da ))2 ⎜ ⎢ Pa,2 ⎥ − ((1 − Ca ), (1 − Da ))2 ⎟ ⎥ = So f tmax ⎜ 2√ ⎟ ⎢ ⎝ ⎣ Pa,3 ⎦ 2√ − ((1 − Ca ), Da )2 ⎠ Pa,4 2 − (Ca , Da )2 ⎡
(2)
A Novel Approach to Determining the Quality of News Headlines
235
4 Predict Headline Quality In this section, we propose a novel model to predict the quality of unpublished news headlines. To the best of our knowledge, we are the first to consider latent features of headlines, bodies, and the semantic relation between them to find the quality of news headlines.
4.1 Problem Definition We consider the task of headline quality prediction as a multiclass classification problem. We assume our input contains a dataset D = {(Hi , Ai )}iN of N news articles where each news article contains a header and an article which are shown by Hi , and Ai , respectively. Anapproach forlearning the quality of headline is to define a conditional probability P I j |Hi , Ai , θ for each quality indicator I j with respect to the header text, i.e., Hi = {t1 , t2 , ..., t K }, article text, i.e., Ai = {z 1 , z 2 , ..., z m }, and parameterized by a model with parameters θ . We then estimate our prediction for each news article in our database as: y i = argmax j∈{1,2,3,4} P C j |Hi , Ai , θ
(3)
4.2 Proposed Model In this section, we propose a deep learning model to predict the quality of headlines before publication. The architecture of the proposed model is illustrated in Fig. 2.
4.2.1
Embedding Layer
This layer, which is available in Keras library,4 converts the one-hot-encoding of each word in headlines, and articles to the dense word embedding vectors. The embedding vectors are initialized using GloVe embedding vectors [26]. We find that 100-dimensional embedding vectors lead to the best result. Also, we use a dropout layer on top of the embedding layer to drop 0.2% of the output units.
4 https://keras.io/layers/embeddings/.
236
A. Omidvar et al.
Fig. 2 The proposed model for predicting news headlines’ quality according to the four quality indicators
4.2.2
Similarity Matrix Layer
Because one of the main characteristics of high-quality headlines is that a headline should be related to the body of its article, the main goal of this layer is to find out how related each headline is to the article’s body. The inputs to this layer are the words embedding vectors of the headline and the first paragraph of the article. We use the first paragraph of the article since the first paragraph is used extensively for news summarizing tasks due to its high importance in representing the whole news article [27]. In Fig. 2, each cell ci,j represents the similarity between words hi, and bj from the headline, and its article, respectively, which is calculated using the cosine similarity between their embedding vectors using formula 4. Using the cosine similarity function will enable our model to capture the semantic relation between the embedding vectors of two words zi , and t j in the article, and header, respectively. T− → − → zi t j Ci j = − − → → z i . t j
(4)
Also, the 2-d similarity matrix allows us to use 2-d CNN, which has shown excellent performance for text classification through abstracting visual patterns from text data [28]. In fact, matching headlines, and articles is viewed as an image recognition problem, and 2-d CNN is used to solve it.
A Novel Approach to Determining the Quality of News Headlines
4.2.3
237
Convolution and Max-Pooling Layers
Three Convolutional Network layers, each of which containing 2-d CNN, and 2d Max-Pooling layers, are used on top of the similarity matrix layer. The whole Similarity Matrix, i.e., x (0) = M is scanned by the first layer of 2-d CNN to generate based on formula 5. the first feature map, i.e., xi,(1,k) j ⎛ xi,(1,k) j
= f⎝
v vk k y=1
w(l+1,k) · y,m
(0) xi+y, j+m
⎞ + b(1,k) ⎠
(5)
m=1
Then, different level of matching patterns is extracted from the Similarity Matrix in each Convolutional Network Layer based on the formula 6. xi,(l+1,k) j
⎛ ⎛ ⎞⎞ nl vk vk (l, p) = f⎝ ⎝ w(l+1,k) · xi+y, j+m + b(l+1,k) ⎠⎠ y,m p=1
y=1
(6)
m=1
where x (l+1) is the computed feature map at level l + 1, w(l+1,k) is the k-th square kernel at the level l + 1 which scans the whole feature map x (l) from the previous layer, vk is the size of the kernel, b(l+1) is the bias parameter at level l + 1, and ReLU [29] is chosen to be the activation function f . Then we will get feature maps by applying a dynamic pooling method [30]. We use (5 × 5), (3 × 3), and (3 × 3) for the size of kernels. We also use 8, 16, and 32 for the number of filters, and (2 × 2) for the pool size in each Convolutional Network layer, respectively. The result of the final 2-d Max-Pooling layer is flattened to the 1-d vector. Then it passes the dropout layer with a rate of 0.2. In the end, the size of the output vector is reduced to 100 using a fully-connected layer.
4.2.4
BERT
Google’s Bidirectional Encoder Representations from Transformers (BERT) [31] is employed to transform variable-length inputs, which are headlines, and articles, into fixed-length vectors to find the latent features of both headlines, and articles. BERT’s goal is to produce a language model using the Transformer model. Details regarding how Google Transformer works is provided in [32]. BERT is pre-trained on a huge dataset to learn the general knowledge that can be used and combined with the acquired knowledge on a small dataset. We use the publicly available pre-trained BERT model, i.e., BERT-Base, Uncased,5 published by Google. We employed BERT as the feature-based transfer learning approach in our model. 5 https://giyhub.com/google-research/bert#pre-trainedmodels.
238
A. Omidvar et al.
After encoding each headline into a fixed-length vector using BERT, a multi-layer perceptron is used to project each encoded headline into a 100-d vector. The same procedure is performed for the articles as well.
4.2.5
Topic Modelling
We use Non-Negative Matrix Factorization (NNMF) [33], and Latent Dirichlet Allocation (LDA) [34] from Scikit-learn library to find topics from both headlines, and articles. Since headlines are significantly shorter than articles, we use separate topic models for headlines, and articles. Even though both NNMF, and LDA can be used for topic modelling, their approach is different from each other in a way that the former is based on linear algebra, and the latter relies on probabilistic graphical modelling. We find out NNMF extracts more meaningful topics than LDA on our news dataset. We create matrix A, in which each article is represented as a row, and columns are the TF-IDF values of the article’s words. TF-IDF is an acronym for term frequencyinverse document frequency, which is a statistical measure to show how important a word is to an article in a group of articles. Term Frequency (TF) part calculates how frequently a word appears in an article divided by the total number of words in that article. The Inverse Document Frequency (IDF) part weighs down the frequent words while scaling up the rare words in an entire corpus. Then we use NNMF to factorize matrix A into two matrices W, and H, which are document to topic matrix, and topic to word matrix, respectively. When these two matrices multiplied, the result is the matrix A with the lowest error (formula 7). An×v = Wn×t Ht×v
(7)
where n is the number of articles, v is the size of vocabulary, and t is the number of topics (t v), which we set it to 50. As it is shown in Fig. 2, we use topics, i.e., each row of matrix (W ) as input features to the Feedforward Neural Network (FFNN) part of our model.
4.2.6
Sentiment Analysis
Positive or negative headlines will attract more clicks than neutral ones. The idea in this part is to add a pre-trained sentiment analysis model to our proposed architecture using a feature-based transfer learning technique. Adding this analysis will help our model to find the probability of each quality indicator more accurately. We use the Yelp Dataset to train a deep FFNN. There are 6,685,900 reviews in the dataset, each of which has a star rating between 1, and 5. We remove nonEnglish reviews from the Yelp dataset. Also, we remove reviews that have special characters that are not commonly used in news headlines. Moreover, we try to make
A Novel Approach to Determining the Quality of News Headlines
239
the distribution of the source domain, i.e., Yelp reviews, and the destination domain, i.e., news headlines similar in terms of length by removing short and lengthy reviews from Yelp. After applying the processing steps, we end up having 195,224 reviews. Our task is ranking learning since it has both characteristics of classification, i.e., we have five classes, and regression, i.e., the classes have order 5 > 4 > 3 > 2 > 1 problems. We follow the approach in [35] to convert stars into appropriate labelling form. Like Sect. 4.2.4, we employ pre-trained BERT to extract latent features from reviews, which are used as the input to the sentiment analysis model. Our sentiment analysis model is a deep FFNN containing six hidden layers. The sizes of layers from the input layer to the output layer are 768, 1000, 500, 300, 200, 150, 100, and 5, respectively. The rectifier activation function is used for all the layers, but the last one which uses the sigmoid activation function. We remove the output layer because it is too close to the target function during pretraining; therefore may be biased to those Yelp reviews’ labels. Then we employ the pre-trained neural network in our proposed architecture, as shown in Fig. 2, which the input is the BERT embedding of news headlines, and the output is 100-dimensional dense sentiment latent features.
4.2.7
FFNN
As we can see in Fig. 2, FFNN layers are used in different parts of our proposed model. The rectifier is used as the activation function of all layers except the last one. The activation function of the last layer is softmax, which calculates the probability of the input example being in each quality indicator. We find that using the batch normalization layer before the activation layer in all layers helps to reduce the loss of our model. The reason is that the batch normalization layer normalizes the input to the activation function so that the data are centred in the linear part of the activation function.
5 Evaluation 5.1 Baselines For evaluation, we have compared our proposed model with the following baseline models.
5.1.1
EMB + 1-d CNN + FFNN
This embedding layer is like the embedding layer of the proposed model, which will convert a one-hot representation of the words to the dense 100-d vectors. The
240
A. Omidvar et al.
dropout layer is used on top of the embedding layer to drop 0.2% of the output units. Also, we use GloVe embedding vectors to initialize word embedding vectors [26]. The next layer is 1-d CNN, which works well for identifying patterns within single spatial dimension data such as text, time series, and signal. Many recent NLP models employed 1-d CNN for text classification tasks [36]. The architecture is comprised of two layers of convolution on top of the embedding layer. The last layer is a single layer FFNN using softmax as its activation function.
5.1.2
Doc2Vec + FFNN
Doc2Vec6 is an implementation of the Paragraph Vector model, which was proposed in [38]. It is an unsupervised learning algorithm that can learn fixed-length vector representations for different length pieces of text, such as paragraphs, and documents. The goal is to learn the paragraph vectors by predicting the surrounding words in contexts obtained from the paragraph. It consists of two different models, which are Paragraph Vector Distributed Memory Model (PV-DMM), and Paragraph Vector without word ordering distributed bag of words (PV-DBOW). The former has much higher accuracy than the latter, but the combination of them yields to the best result. We convert headlines and bodies into two separate 100-d embedded vectors. These vectors are fed into FFNN, which comprises of two hidden layers with the size of 200, and 50 consecutively. ReLU is used for the activation function of all FFNN layers except the last layer, which employs softmax function.
5.1.3
EMB + BGRU + FFNN
This is a Bidirectional Gated Recurrent Unit on top of the Embedding layer. GRU employs two gates to trail the input sequences without using separate memory cells, which are reset r t , and update zt gates, respectively. rt = σ (Wr xt + Ur h t−1 + br )
(8)
z t = σ (Wz xt + Uz h t−1 + bz )
(9)
h∼ t = tanh(Wh x t + r t ∗ (Uh h t−1 ) + bh )
(10)
h t = (1 − z t ) ∗ h t−1 + z t ∗ h ∼ t
(11)
6 https://radimrehurek.com/gensim/models/doc2vec.html.
A Novel Approach to Determining the Quality of News Headlines
241
In formulas 8, and 9, W r , U r , br , W z , U z , bz are the parameters of GRU that should be trained during the training phase. Then, the candidate and new states will be calculated at time t based on the formula 10, and 11, respectively. In formulas 10, and 11, * denotes an elementwise multiplication between the reset gate and the past state. So, it determines which part of the previous state should be forgotten. Furthermore, the update gate in formula 11 determines which information from the past should be kept, and which one should be updated. The forward way reads the post text from x1 to x N , and the backward way reads the post text from x N to x1 . − → −−−→ −−→ (12) h n = G RU xn , h n−1 ← − ←−−− ←−− h n = G RU xn , h n+1
(13)
− → ← − hn = [hn , hn ]
(14)
Moreover, the input to the FFNN layer is the concatenation of the last output of the forward way and backward way.
5.1.4
EMB + BLSTM + FFNN
This is a Bidirectional Long Short Term Memory (LSTM) [39] on top of the Embedding layer. Embedding and FFNN layers are like the previous baseline. The only difference here is using LSTM instead of using GRU.
5.2 Evaluation Metrics Mean Absolute Error (MAE), and Relative Absolute Error (RAE) are used to compare the result of the proposed model with the result of the baseline models on the test dataset. As we can see in formula 14, RAE is relative to a simple predictor, which is just the average of the ground truth. The ground truth, the predicted values, and the −
average of the ground truth are shown by Pi j , P i j , and P j , respectively. N 4 P −P ij i=1 j=1 i j R AE = − N 4 Pi j − P j i=1 j=1
(14)
242
A. Omidvar et al.
5.3 Experimental Results We train our proposed model with the same configuration two times: once using hard labels, i.e., assigning a label 1, and three 0 s to the quality indicators for each sample), and the other time using soft labels, which were calculated by formula 2. We use categorical cross-entropy loss function for the former, and MSE loss function for the latter. Then we find out our proposed model will be trained more efficiently by using soft targets than using hard targets, the same as what was shown in [39]. The reason could be that soft targets provide more information per training example in comparison with hard targets, and much less variance in the gradient between our training examples. For instance, for machine learning tasks such as MNIST [40], one case of image 1 may be given probabilities 10–6 , and 10–9 for being 7, and 9, respectively, while for another case of image 1 it may be the other way round. So, we decide to train our proposed model, and all the baseline models just using soft targets. The results of the proposed model and baseline models on the test data are shown in Table 1. The loss function of the proposed model and all the baseline models is based on the MSE between the predicted quality indicators by the models, and the ground truth calculated in Sect. 3.2 (Soft labels). Moreover, we use Adam optimizer for the proposed model, and all our baseline models [41]. Also, we split our dataset into train, validation, and test sets using 70%, 10%, and 20% of data, respectively. Our proposed model got the best results by having the lowest RAE among all the other models. Surprisingly, TFIDF performs better than the other baseline models. It can be due to the fact that the number of articles in our dataset is not significant (28,751), so complex baseline models may overfit to the training dataset. Also, we show the performance of the proposed model with and without having the pre-trained sentiment analysis part. The results show that using the pre-trained sentiment analysis part could help the model to decrease RAE by 2. Table 1 Comparison between the proposed model and baseline models Models
MAE
RAE
EMB + 1-D CNN + FFNN
0.044
105.08
Doc2Vec + FFNN
0.043
101.61
EMB + BLSTM + FFNN
0.041
97.92
EMB + BGRU + FFNN
0.039
94.38
TF-IDF + FFNN
0.038
89.28
Proposed Model without Sentiment Analysis
0.034
80.56
Proposed Model
0.030
78.23
A Novel Approach to Determining the Quality of News Headlines
243
6 Conclusions We defined the quality of news headlines using four proposed quality indicators. We also proposed a method to calculate the quality of news headlines with respect to the four proposed quality indicators, using users’ browsing history data. News media can use this approach not only to analyze their published headlines in terms of the quality but also, they can build a labelled training data set without getting any help from human evaluators. Moreover, we proposed a novel model to predict the quality of headlines before their publication, using the combination of sentiment, topics, and latent features of headlines, and articles, along with their similarities. We used soft labels rather than hard labels in order to train the proposed neural network model, and our baselines more efficiently. The experiment was conducted on a real dataset obtained from a major Canadian newspaper. The results showed that the proposed model outperformed all the baselines in terms of Mean Absolute Error (MAE), and Relative Absolute Error (RAE) measures. As headlines play an important role in catching the attention of readers, the proposed method is of great practical value for online news media. The proposed method will help authors to predict how popular any newly written unpublished headline is. Also, it will determine whether the headline is accurate, or not by matching it to the content of the article. Then it will help authors to choose the best headline among other candidates for their newly unpublished written article. For future work, we will work on generative models to produce high-quality headlines for the news articles. Producing high-quality headlines for a newly written article can help its author to choose a headline which is not only related to the content of the article but is also attractive from the readers’ point of view. Acknowledgements This work is funded by the Natural Science and Engineering Research Council of Canada (NSERC), The Globe and Mail, and the Big Data Research, Analytics and Information Network (BRAIN) Alliance established by the Ontario Research Fund—Research Excellence Program (ORF-RE).
References 1. Kuiken, J., Schuth, A., Spitters, M., Marx, M.: Effective Headlines of Newspaper Articles in a Digital Environment. Digit. Journal. 5, 1300–1314 (2017). https://doi.org/10.1080/21670811. 2017.1279978 2. Reis, J., Benevenuto, F., de Melo, P.O.S.V., Prates, R., Kwak, H., An, J.: Breaking the News: First Impressions Matter on Online News, pp. 357–366 (2015) 3. Omidvar, A., Jiang, H., An, A.: Using neural network for identifying clickbaits in online news media. In: Annual International Symposium on Information Management and Big Data, pp. 220–232 (2018) 4. Palau-Sampio, D.: Reference press metamorphosis in the digital context: clickbait and tabloid strategies in Elpais.com. Commun. Soc. 29, 63–71 (2016). https://doi.org/10.15581/003.29.2. 63-79
244
A. Omidvar et al.
5. Rony, M.M.U., Hassan, N., Yousuf, M.: Diving deep into clickbaits: who use them to what extents in which topics with what effects? In: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017, pp. 232–239 (2017) 6. Chakraborty, A., Sarkar, R., Mrigen, A., Ganguly, N.: Tabloids in the Era of Social Media? Understanding the Production and Consumption of Clickbaits in Twitter (2017) 7. Ecker, U.K.H., Lewandowsky, S., Chang, E.P., Pillai, R.: The effects of subtle misinformation in news headlines. J. Exp. Psychol. Appl. (2014). https://doi.org/10.1037/xap0000028 8. Chakraborty, A., Paranjape, B., Kakarla, S., Ganguly, N.: Stop clickbait: detecting and preventing clickbaits in online news media. In: 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 9–16. IEEE, Davis (2016) 9. Glenski, M., Ayton, E., Arendt, D., Volkova, S.: Fishing for clickbaits in Social Images and Texts with Linguistically-Infused Neural Network Models (2017) 10. Zhang, Y., Wallace, B.C.: A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification (2015) 11. Szymanski, T., Orellana-Rodriguez, C., Keane, M.T.: Helping News Editors Write Better Headlines: A Recommender to Improve the Keyword Contents & Shareability of News Headlines (2017) 12. Bielski, A., Trzcinski, T.: Understanding multimodal popularity prediction of social media videos with self-attention. IEEE Access. 6, 74277–74287 (2018). https://doi.org/10.1109/ACC ESS.2018.2884831 13. Stokowiec, W., Trzci´nski, T., Wołk, K., Marasek, K., Rokita, P.: Shallow reading with deep learning: predicting popularity of online content using only its title. In: International Symposium on Methodologies for Intelligent Systems, pp. 136–145 (2017) 14. Voronov, A., Shen, Y., Mondal, P.K.: Forecasting popularity of news article by title analyzing with BN-LSTM network. In: Proceedings of the 2019 International Conference on Data Mining and Machine Learning, pp. 19–27 (2019) 15. Fu, J., Liang, L., Zhou, X., Zheng, J.: A convolutional neural network for clickbait detection. In: 2017 4th International Conference on Information Science and Control Engineering (ICISCE), pp. 6–10 (2017) 16. Venneti, L., Alam, A.: How Curiosity can be modeled for a clickbait detector (2018) 17. Wei, W., Wan, X.: Learning to identify ambiguous and misleading news headlines. In: IJCAI International Joint Conference on Artificial Intelligence, pp. 4172–4178 (2017) 18. Zhou, Y.: Clickbait detection in tweets using self-attentive network. In: Clickbait Challenge 2017 (2017) 19. Potthast, M., Gollub, T., Hagen, M., Stein, B.: The Clickbait Challenge 2017: Towards a regression model for clickbait strength. In: Proceedings of the Clickbait Challenge (2017) 20. Liao, F., Zhuo, H.H., Huang, X., Zhang, Y.: Federated Hierarchical Hybrid Networks for Clickbait Detection (2019). arXiv:1906.00638 21. Zannettou, S., Chatzis, S., Papadamou, K., Sirivianos, M.: The good, the bad and the bait: detecting and characterizing clickbait on YouTube. In: 2018 IEEE Security and Privacy Workshops (SPW), pp. 63–69 (2018) 22. Pagliardini, M., Gupta, P., Jaggi, M.: Unsupervised learning of sentence embeddings using compositional n-gram features (2017). arXiv:1703.02507 23. Mao, Y., Chen, M., Wagle, A., Pan, J., Natkovich, M., Matheson, D.: A Batched Multi-armed bandit approach to news headline testing. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 1966–1973 (2018) 24. Omidvar., A., Pourmodheji., H., An., A., Edall., G.: Learning to determine the quality of news headlines. In: Proceedings of the 12th International Conference on Agents and Artificial Intelligence—Volume 1: NLPinAI, pp. 401–409. SciTePress, Setúbal (2020) 25. Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N., John, R.S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., others: Universal sentence encoder (2018). arXiv:1803.11175 26. Pennington, J., Socher, R., Manning, C.D.: GloVe: Global vectors for word representation. In: 2014 Conference on Empirical Methods in Natural Language Processing (EMNL), pp. 1532– 1543
A Novel Approach to Determining the Quality of News Headlines
245
27. Lopyrev, K.: Generating News Headlines with Recurrent Neural Networks, pp. 1–9 (2015) 28. Pang, L., Lan, Y., Guo, J., Xu, J., Wan, S., Cheng, X.: Text Matching as Image Recognition, pp. 2793–2799 (2016). https://doi.org/10.1007/s001700170197 29. Dahl, G.E., Sainath, T.N., Hinton, G.E.: Improving deep neural networks for LVCSR using rectified linear units and dropout. In: Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8609–8613 (2013) 30. Socher, R., Huang, E.H., Pennington, J., Ng, A.Y., Manning, C.D.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011, NIPS 2011 (2011) 31. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018) 32. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017) 33. Kim, J., He, Y., Park, H.: Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework. J. Glob. Optim. 58, 285–319 (2014). https:// doi.org/10.1007/s10898-013-0035-4 34. Hoffman, M.D., Blei, D.M., Bach, F.: Online learning for latent dirichlet allocation. In: Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010, NIPS 2010 (2010) 35. Cheng, J., Wang, Z., Pollastri, G.: A neural network approach to ordinal regression. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1279–1284 (2008) 36. Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative Study of CNN and RNN for Natural Language Processing (2017) 37. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on International Conference on Machine Learning, vol. 32, pp. II–1188– II–1196. JMLR.org (2014) 38. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735 39. Hinton, G., Vinyals, O., Dean, J.: Distilling the Knowledge in a Neural Network, pp. 1–9 (2015). https://doi.org/10.1063/1.4931082 40. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE. (1998). https://doi.org/10.1109/5.726791 41. Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015—Conference Track Proceedings (2015)