165 18 2MB
English Pages 126 Year 2021
Studies in Computational Intelligence 999
Roussanka Loukanova Editor
Natural Language Processing in Artificial Intelligence – NLPinAI 2021
Studies in Computational Intelligence Volume 999
Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, selforganizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.
More information about this series at http://www.springer.com/series/7092
Roussanka Loukanova Editor
Natural Language Processing in Artificial Intelligence – NLPinAI 2021
123
Editor Roussanka Loukanova Department of Algebra and Logic Institute of Mathematics and Informatics Bulgarian Academy of Sciences Sofia, Bulgaria
ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-030-90137-0 ISBN 978-3-030-90138-7 (eBook) https://doi.org/10.1007/978-3-030-90138-7 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Computational and technological developments that incorporate natural language are proliferating. Adequate coverage of Natural Language Processing in Artificial Intelligence encounters problems on developments of specialised computational approaches and algorithms. Many difficulties are due to ambiguities in natural language and dependency of interpretations on contexts and agents, which can arise in computational systems based on nature of languages. Classical approaches proceed with relevant updates, and new developments emerge in theories of formal and natural languages, computational models of information and reasoning, and related computerised applications. The book covers theoretical work, approaches, applications, and techniques for computational models of information, language, and reasoning. Its focus is on computational processing of human language and relevant medium languages, which can be theoretically formal, or for programming and specification of computational systems. The goal is to promote intelligent natural language processing, along with models of computation, language, reasoning, and other cognitive processes. The Special Session on Natural Language Processing in Artificial Intelligence— NLPinAI 2021 (http://www.icaart.org/NLPinAI.aspx?y=2021) was held within the 13th International Conference on Agents and Artificial Intelligence—ICAART 2021 (http://www.icaart.org/?y=2021), by distance, Online Streaming, 4–6 February 2021. The series of the special sessions Natural Language Processing in Artificial Intelligence (NLPinAI) and its post-conference book volumes address the above challenge, by advancements of further research, and also sharing ideas and feedback between researchers. The book sequence Natural Language Processing in Artificial Intelligence— NLPinAI covers a variety of topics, e.g. • Logic Approaches to Natural Language Processing • Classical and Non-Classic Logics for applications to NLP • Type Theories for Applications to Natural Language
v
vi
• • • • • • •
Preface
Computational Grammar Large-Scale Grammars of Natural Languages Syntax, Semantics, Syntax-Semantics Interfaces Information Theory Statistical Approaches in Computational Linguistics and NLP Machine Learning of Grammar and Language Integrated Approaches in Computational Linguistics and NLP
The chapters of this book volume, NLPinAI 2021, are based on extended work on selected topics of the Special Session on Natural Language Processing in Artificial Intelligence—NLPinAI 2021. Chapter 1 presents new developments of CatLog. CatLog system is a categorial grammar parser and theorem-prover originally developed by Glyn Morrill and his co-authors. There are two variants of extended Lambek calculus in two versions of CatLog, both of which are undecidable. The chapter focuses on fragments where the usage of subexponential is restricted by specialised bracket (non-negative/non-positive) conditions. The authors prove that these fragments are decidable, and place them in the complexity hierarchy. Then, they present a practically important problem of predicting brackets, and prove one decidability and one undecidability result. Chapter 2 is on automated reasoning as a computer assistant for building proofs of theorems in logic, by a focus on using the Isabelle proof assistant. The authors link two approaches, Epistemic Logic and Public Announcement Logic. Systems of epistemic logic can model reasoning with knowledge of agents. Public announcements can update knowledge of a system, users, and agents. The chapter presents formalisations of axiomatic systems for epistemic and public announcement logic, which improves the foundations of automated reasoning for logic and information. Chapter 3 of the volume NLPinAI 2021 is a specialised, extensive study of verbal valences for Norwegian. The work presents an exhaustive resource catalogue NorVal, which contains formal descriptions of the valence features of more than 6300 lemmas. The theoretical research together with the valence resource NorVal has great potentials for applications to NLP, not only for Norwegian, but also for computerised translation systems, as well as resource to human usage in translations and transcripts. It also presents an example for similar developments for other human languages, including for English. Chapter 4 is on further prospects of computational linguistics of Arabic. It is a discussion on the potential and challenges that Arabic language presents to NLP, by the nature of the Arabic morphology, script, transcription, and transliteration. September 2021
Roussanka Loukanova
Contents
Decidable Fragments of Calculi Used in CatLog . . . . . . . . . . . . . . . . . . Max I. Kanovich, Stepan G. Kuznetsov, Stepan L. Kuznetsov, and Andre Scedrov
1
Interactive Theorem Proving for Logic and Information . . . . . . . . . . . . Jørgen Villadsen, Asta Halkjær From, Alexander Birch Jensen, and Anders Schlichtkrull
25
A Valence Catalogue for Norwegian . . . . . . . . . . . . . . . . . . . . . . . . . . . Lars Hellan
49
Arabic Computational Linguistics: Potential, Pitfalls and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Elie Wardini Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
vii
Decidable Fragments of Calculi Used in CatLog Max I. Kanovich1,2 , Stepan G. Kuznetsov3 , Stepan L. Kuznetsov2,4(B) , and Andre Scedrov5 1
University College London, Gower Street, London, UK [email protected] 2 Computer Science Department, HSE University, 11 Pokrovsky Blvd., Moscow, Russia 3 Mathematics Department, HSE University, 6 Usacheva Street, Moscow, Russia [email protected] 4 Steklov Mathematical Institute of RAS, 8 Gubkina Street, Moscow, Russia [email protected] 5 Department of Mathematics, University of Pennsylvania, 209 South 33rd Street, Philadelphia, PA, USA [email protected]
Abstract. CatLog is a categorial grammar parser/theorem-prover developed by Glyn Morrill and his co-authors. CatLog is based on an extension of Lambek calculus. A distinctive feature of this extension is the usage of brackets for controlled non-associativity and a subexponential modality whose contraction rule interacts with bracketing in a sophisticated way. We consider two variants of the calculus, appearing in different versions of CatLog. Both systems are, unfortunately, undecidable in general. We consider fragments where the usage of subexponential is restricted by so-called bracket non-negative/non-positive conditions, prove that these fragments are decidable, and pinpoint their place in the complexity hierarchy. We also consider a more complicated, but more practically interesting problem of inducing (guessing) brackets. For this problem, we prove one decidability and one undecidability result, and leave some open questions for further research. Keywords: Lambek calculus · Categorial grammars modalities · Bracket modalities
1
· Subexponential
Introduction
The Lambek calculus was introduced by J. Lambek [17] for mathematical description of natural language syntax, in the framework of categorial grammars. The idea of categorial grammar goes back to Ajdukiewicz [2] and Bar-Hillel [3], and Lambek-style grammars form a subclass of categorial grammar formalisms. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. Loukanova (Ed.): NLPinAI 2021, SCI 999, pp. 1–24, 2022. https://doi.org/10.1007/978-3-030-90138-7_1
2
M. I. Kanovich et al.
In a categorial grammar, each lexeme is annotated by a syntactic type (category), which is a formula of a specific non-classical substructural logic, e.g., the Lambek calculus. A sentence is considered grammatical if the sequent composed from types of lexemes is derivable in the given calculus. The original Lambek calculus is capable of handling basic cases like “John loves Mary.” In this example, syntactic types are assigned as follows: John, M ary N
loves (N \S)/N
Here N and S are variables (primitive types), meaning “noun phrase” and “sentence” respectively. The \ and / operations, called left and right divisions, are directed implications. The type for “loves,” (N \S)/N , means that this word lacks two noun phrases (N ), one on the left and one on the right, to form a sentence (S). Derivability of the sequent N, (N \S)/N, N → S in the Lambek calculus justifies “John loves Mary” as a grammatically correct sentence. The Lambek calculus is usually formulated as a Gentzen-style sequent calculus. Formulae are constructed from a countable set of variables (p1 , p2 , p3 , . . .) using two divisions (\ and /) and product (·). A sequent is an expression of the form Γ → B, where Γ is a sequence of formulae and B is a formula. Axioms are sequents of the form A → A, and inference rules are as follows: Γ → B Δ1 , C, Δ2 → D /L Δ1 , C/B, Γ, Δ2 → D
Γ, B → C /R Γ → C/B
Γ → A Δ1 , C, Δ2 → D \L Δ1 , Γ, A\C, Δ2 → D
A, Γ → C \R Γ → A\C
Δ1 , A, B, Δ2 → D ·L Δ1 , A · B, Δ2 → D
Δ→A Γ→B ·R Δ, Γ → A · B
The Lambek calculus comes in two variants, depending on whether we allow antecedents (left-hand sides of sequents) to be empty. The natural way to disallow empty antecedents is to impose the constraint “Γ is non-empty” on / R and \ R. This constraint is called Lambek’s restriction and exists in the original Lambek calculus [17]. Later on, however, Lambek also introduced a variant of his calculus without this restriction [18]. From the point of view of algebraic logic (see [8]), the Lambek calculus with Lambek’s restriction is the logic of residuated partially ordered semigroups, while the Lambek calculus without this restriction corresponds to residuated partially ordered monoids. On the other hand, as a logical system, the Lambek calculus without Lambek’s restriction is a non-commutative intuitionistic variant of Girard’s [9] linear logic (see Abrusci [1]). From this perspective, Lambek’s restriction seems unnatural. Lambek’s restriction is desirable for linguistic applications. Without this restrictions, Lambek grammars overgenerate, i.e., accept incorrect phrases as correct ones. This can be seen in the following example [21, § 2.5]: “very book.”
Decidable Fragments of Calculi Used in CatLog
3
Being ungrammatical, this phrase is accepted under the following natural type assignment: book CN
very (CN /CN )/(CN /CN )
Here the primitive type CN stands for “common noun,” a noun phrase without an article. (Unlike N , such a phrase cannot be directly used as an object or subject.) Without Lambek’s restriction, the sequent (CN /CN )/(CN /CN ), CN → CN is derivable, which declares “very book” a correct common noun group (which is actually not the case). Informally, the absence of Lambek’s restriction allows the usage of empty words. In our example, there is an “empty adjective” of type CN /CN between very and book: compare with a grammatically correct phrase “very interesting book,” (CN /CN )/(CN /CN ), CN /CN, CN → CN . More sophisticated natural language phenomena require extended versions of the Lambek calculus. In this paper, we focus on systems developed by G. Morrill and his co-authors for the CatLog natural language parser/theorem-prover [26, 28]. For a more general overview of Lambek-style categorial grammar formalisms, see Buszkowski [4], Carpenter [6], Morrill [31], Moot and Retor´e [21], etc. In our linguistic examples, we follow Morrill [31] and later papers by Morrill and his co-authors. The limitations of grammars based on the “pure” Lambek calculus show up when one tries to analyze complex and compound sentences. The core construction here is relativisation, which connects a dependent clause to the main sentence. In some easy cases, relativisation can still be described by means of the Lambek calculus. For example, in the noun phrase “the girl whom John loves” the following type assignment does the job: John N the N /CN whom (CN \CN )/(S/N )
loves (N \S)/N girl CN
The sequent N, N /CN, (CN \CN )/(S/N ), N, (N \S)/N → N is derivable in the Lambek calculus, because “John loves” is a sentence lacking a noun phrase on the right, i.e., an object of type S/N . In more complicated situations, however, the “pure” Lambek calculus is insufficient. This can be seen on examples like “the girl whom John met yesterday” or “the paper that John signed without reading.” In the first example, the gap in S which should be filled by N is located in the middle: “John met ... yesterday,” which cannot be handled by Lambek divisions. In the second example, there are even two gaps, which should be filled by the same N (the paper): “John signed ... without reading ...” These two phenomena are called medial and parasitic extraction respectively, and Morrill suggests a structural (subexponential) modality, denoted by !, to handle it. This modality allows permutation (for medial extraction) and some form of contraction (for parasitic extraction). On the other hand, extraction from compound sentences leads to overgeneration. The standard example here is “the girl whom John loves Mary and Pete
4
M. I. Kanovich et al.
loves.” This phrase is ungrammatical. However, it is parsed as a common noun group, CN , since “John loves Mary and Pete loves” is of type S/N (cf. “John loves Mary and Pete loves Ann” being of type S). In order to overcome this issue, a sentence obtained from two other sentences using “and” should be made an island, which cannot be penetrated by extraction. In Morrill’s system, islands are introduced and managed using brackets an bracket modalities, the idea of which goes back to Morrill himself [22] and Moortgat [20]. Thus, Morrill’s systems include both a subexponential and brackets. Moreover, they interact in a subtle way, since in the case of parasitic extraction one should penetrate islands. Also Morrill’s systems include lattice-theoretic meet (∧) and join (∨), in other words, additive conjunction and disjunction. In [16], we give a detailed proof-theoretic analysis of Morrill’s systems and prove undecidability of the corresponding derivability problems. The latter is unfortunate, since these systems were designed to be used in natural language parsing software. The present paper is a more optimistic sequel of [16]. Here we prove that these systems enjoy naturally defined decidable fragments, and prove upper complexity bounds for them. While the present paper is mostly self-contained, we suggest the reader to get acquainted with the article [16] also, for a deeper discussion of Morrill’s systems from a proof-theoretic point of view. The rest of the paper is organised as follows. In Sect. 2, we introduce three of Morrill’s systems, with certain proof-theoretic clarifications, following [16]. In Sect. 3, we define the bracket non-negative and bracket non-positive conditions, which are restrictions on the usage of bracket modalities under the subexponential. In Sect. 4, we consider the systems without additive operations and prove that imposing appropriate bracket conditions leads to decidability and, moreover, NP complexity upper bound. In Sect. 5, we do the same for systems with additive operations; here the upper bound is PSPACE. Notice that these complexity bounds are tight, since the corresponding lower bounds are known already for systems without subexponentials and bracket modalities [11,32]. In Sect. 6, we consider a more complicated, but more practically interesting algorithmic problem of inducing brackets. For this problem, for one of the Morrill’s systems we prove undecidability (even under bracket conditions) and for the other one, decidability and complexity upper bounds. In the concluding Sect. 7 we discuss directions of further research.
2
Morrill’s Calculi
We define three systems, denoted by !2015 MALC∗ (st) , !2018 MALC∗ (st) , and b b 2018 !b MALC(st) respectively. The first one originates in works of Morrill and Valent´ın [30] and Morrill [24,25] from 2015–2017. The second one appears in more recent works of Morrill [27,28] from 2018–2019; however, here Morrill comes back to the ideas from earlier publications [23,31]. Finally, the third system, MALC(st) , is a variant of !2018 MALC∗ (st) which employs Lambek’s !2018 b b restriction.
Decidable Fragments of Calculi Used in CatLog
5
The formulations of Morrill’s systems we use here are those presented our article [16]. These formulations, if compared to Morrill’s original ones, are a bit clarified in order to maintain desired proof-theoretic properties (mainly cut elimination), see [16] for details. These clarifications do not alter linguistic applications of the systems. On the other hand, we notice that the systems presented here and in [16] are only fragments of the ones constructed by Morrill: Morrill’s original systems have up to 45 connectives, but we focus on the behaviour of brackets and subexponentials. The version of the ‘newer’ Morrill’s calculus with Lambek’s non-emptiness MALC(st) , was introduced in [16]. Morrill’s original formularestriction, !2018 b tions do not accommodate Lambek’s restriction. The possibility of consistently imposing Lambek’s restriction is in fact quite interesting, since more standard (sub)exponential modalities happen to be incompatible with this restriction [15]. The interaction of the subexponential with bracketing, however, made imposing Lambek’s restriction possible (which is desirable from the linguistic point of view). Let us formally define the syntax of the calculi in question, following our article [16] and earlier works by Morrill. Definition 1.1. Formulae are built from variables (primitive types) p1 , p2 , p3 , . . . and the unit constant 1 using five binary operations: \ (left division), / (right division), · (product, or multiplicative conjunction), ∧ (additive con−1 junction), ∨ (additive disjunction), and three unary operations: , [] (bracket modalities), and ! (subexponential modality). The sequential syntax of Morrill’s systems is more involved than the syntax of usual sequent calculi. Namely, left-hand sides of sequents, besides “comma” as a metasyntactic version of multiplication, include brackets for designating islands (controlled non-associativity) and so-called stoups for handling the subexponential modality. The formal definition, following Morrill [28], is as follows. Definition 1.2. We define the following three notions simultaneously: stoup, tree term, and meta-formula. • A stoup is a multiset of formulae: ζ = {A1 , . . . , An }. Here the order does not matter, while the number of occurrences does. • A tree term is either a formula or a bracketed expression of the form [Ξ], where Ξ is a meta-formula • A meta-formula is an expression of the form ζ; Γ, where ζ is a stoup and Γ is a linearly ordered sequence of tree terms In a meta-formula, Γ could be empty; in this case, it is denoted by Λ. An empty stoup is omitted: we write just Γ instead of ∅; Γ. We use comma both for concatenation of tree term sequences and for multiset union of stoups. Adding one formula to the stoup is written as ζ, A (the bureauratic way to write it would be ζ {A}).
6
M. I. Kanovich et al.
Definition 1.3. A sequent (in Morrill’s terminology, h-sequent) is an expression of the form Ξ → C, where C is a formula and Ξ is a meta-formula. Let us first define !2015 MALC∗ (st) and !2018 MALC∗ (st) . These calculi b b share the same axioms and rules for all connectives, except the subexponential modality. A→A
id
ζ1 ; Γ → B Ξ(ζ2 ; Δ1 , C, Δ2 ) → D /L Ξ(ζ1 , ζ2 ; Δ1 , C/B, Γ, Δ2 ) → D
ζ; Γ, B → C /R ζ; Γ → C/B
ζ1 ; Γ → A Ξ(ζ2 ; Δ1 , C, Δ2 ) → D \L Ξ(ζ1 , ζ2 ; Δ1 , Γ, A\C, Δ2 ) → D
ζ; A, Γ → C \R ζ; Γ → A\C
Ξ(ζ; Δ1 , A, B, Δ2 ) → D ·L Ξ(ζ; Δ1 , A · B, Δ2 ) → D
ζ1 ; Δ → A ζ2 ; Γ → B ·R ζ1 , ζ2 ; Δ, Γ → A · B Ξ(ζ; Δ1 , Δ2 ) → A 1L Ξ(ζ; Δ1 , 1, Δ2 ) → A
Ξ → Ai ∨Ri i = 1, 2 Ξ → A1 ∨ A2
Ξ(ζ; Δ1 , A1 , Δ2 ) → C Ξ(ζ; Δ1 , A2 , Δ2 ) → C ∨L Ξ(ζ; Δ1 , A1 ∨ A2 , Δ2 ) → C
Ξ(ζ; Δ1 , Aj , Δ2 ) → C ∧Lj j = 1, 2 Ξ(ζ; Δ1 , A1 ∧ A2 , Δ2 ) → C Ξ(ζ; Δ1 , A, Δ2 ) → B Ξ(ζ; Δ1 , [[]
−1
A], Δ2 ) → B
−1
[]
L
Ξ(ζ; Δ1 , [A], Δ2 ) → B L Ξ(ζ; Δ1 , A, Δ2 ) → B
Λ→1
1R
Ξ → A1 Ξ → A2 ∧R Ξ → A1 ∧ A2 [Ξ] → A −1
Ξ → []
A
−1
[]
R
Ξ → A R [Ξ] → A
The two systems, !2015 MALC∗ (st) and !2018 MALC∗ (st) , also share two b b rules for the subexponential modality: Ξ(ζ, A; Γ1 , Γ2 ) → B !L Ξ(ζ; Γ1 , !A, Γ2 ) → B
Ξ(ζ; Γ1 , A, Γ2 ) → B !P Ξ(ζ, A; Γ1 , Γ2 ) → B
Decidable Fragments of Calculi Used in CatLog
7
The other two rules, !R and !C, are different. In !2015 MALC∗ (st) , they are b as follows: ζ; Λ → B !R, ζ = ∅ ζ; Λ → !B
Ξ(ζ1 , ζ2 ; Γ1 , [ζ2 , ζ3 ; Γ2 ], Γ3 ) → C !C, ζ2 = ∅ Ξ(ζ1 , ζ2 , ζ3 ; Γ1 , Γ2 , Γ3 ) → C
In !2018 MALC∗ (st) , these rules are as follows: b A; Λ → B !R A; Λ → !B
Ξ(ζ1 , A; Γ1 , [ζ2 , A; Γ2 ], Γ3 ) → C !C Ξ(ζ1 , A; Γ1 , [[ζ2 ; Γ2 ]], Γ3 ) → C
Finally, !2018 MALC(st) is obtained from !2018 MALC∗ (st) by imposing b b Lambek’s non-emptiness restriction in the following form: • in the \ R and / R rules, ζ; Γ is required to be non-empty (i.e., to be not ∅; Λ) • in the !C rule, ζ2 ; Γ2 is required to be non-empty (in the same sense) • the unit constant 1, with axiom 1R and rule 1L, is removed As shown in [16], the cut rule in the following form: ξ; Π → A Ξ(ζ; Γ1 , A, Γ2 ) → C cut Ξ(ξ, ζ; Γ1 , Π, Γ2 ) → C is admissible in all three systems in question, !2015 MALC∗ (st) , b 2018 2018 ∗ !b MALC (st) , and !b MALC(st) . Thus, all derivations we analyze will be cut-free, but we may use cut to simplify construction of derivations. In what follows, it will be convenient to restict the id axiom to its atomic subcase: pi → pi , where pi is a variable. This restriction does not change the set of derivable sequents, due to the following lemma (which is mathematical folklore). Lemma 1.1. In each of the three systems !2015 MALC∗ (st) , b 2018 2018 ∗ !b MALC (st) , and !b MALC(st) , the sequent A → A, for any A, has a cut-free derivation in which all id axioms are in the atomic form. Proof. As usual, we proceed by induction on the structure of A. The base case of A = pi is given. For A = 1, we just apply the 1L rule to the 1R axiom. The other cases are considered as follows: A1 → A1 A2 → A2 \L A1 , A1 \A2 → A2 \R A1 \A2 → A1 \A2
A1 → A1 A2 → A2 /L A2 /A1 , A1 → A2 /R A2 /A1 → A2 /A1
A1 → A1 A2 → A2 ∧L ∧L A1 ∧ A2 → A1 A1 ∧ A2 → A2 ∧R A1 ∧ A2 → A1 ∧ A2
A1 → A1 A2 → A2 ·R A1 , A2 → A1 · A2 ·L A1 · A2 → A1 · A2
8
M. I. Kanovich et al.
A → A R A1 → A1 A2 → A2 ∨R ∨R [A] → A A1 → A1 ∨ A2 A2 → A1 ∨ A2 L ∨L A1 ∨ A2 → A1 ∨ A2 A → A A→A !P A→A −1 A; Λ → A [] L −1 !R [[] A] → A A; Λ → !A −1 [] R !L −1 −1 [] A → [] A !A → !A Notice that the application of !R here is valid in all three systems; the other rules, !P and !L, are the same. The power of Morrill’s approach can be illustrated by the derivation for “the paper that John signed without reading.” This example shows medial and parasitic extraction, in a bracket-aware setting. With bracket modalities, the type assignment is as follows: John N the N /CN that ([]
−1 −1
[]
signed (N \S)/N reading (N \S)/N (CN \CN ))/(S/!N )
paper CN −1
without ([]
((N \S)\(N \S)))/(N \S)
In order to parse this phrase, we should first put the correct bracketing: “the paper [[ that [ John ] signed [[ without reading ]] ]].” Notice that here we distinguish single-bracketed weak islands and double-bracketed strong ones. Less obviously, “without reading” linguistically is a weak island (and it is going to be penetrtaed using !C), but it is double-bracketed here. The trick is that in the MALC∗ (st) or !2018 MALC(st) , double brackets newer Morrill’s systems, !2018 b b become single after applying !C, looking from bottom to top. (For the older sysMALC∗ (st) , also in the view of !C, one should start without brackettem, !2015 b MALC(st) ing this island.) The derivation of the corresponding sequent in !2018 b is presented in Fig. 1. This figure is a copy of [16, Fig. 2], which is in its turn an adaptation of the derivation given by Morrill [28, Fig. 24]. We include this derivation here for the convenience of the reader. Also notice that the role of Lambek’s restriction here is twofold. Besides disallowing empty words, it also disallows empty islands to be filled by parasitic extraction. An example is the incorrect phrase “the man who likes,” which can MALC∗ (st) using an empty subject island: “the man [[ who [[ be parsed in !2018 b ]] likes ]],” see [16, Fig. 4]. The first reference for this example is [27, Footnote 1]. Lambek’s restriction prevents this. Now let us discuss algorithmic questions. Unlike other rules, in the contraction rule !C the premise is more comMALC∗ (st) plex than the conclusion. Namely, parts of the stoup—ζ2 in !2015 b 2018 2018 ∗ and A in !b MALC (st) and !b MALC(st) —get copied. This makes the proof search space potentially infinite and yields an unfortunate consequence: as MALC∗ (st) , !2018 MALC∗ (st) , shown in [16], derivability problems in !2015 b b 2018 and !b MALC(st) are algorithmically undecidable.
Decidable Fragments of Calculi Used in CatLog
9
Fig. 1. Derivation for “the paper [[that [John] signed [[without reading]]]]” in MALC(st) !2018 b
In practice (i.e., in CatLog), however, categorial grammars based on Morrill’s systems are already used for parsing natural language sentences. This means that for sequents which actually occur in practice the proof search procedure terminates. In other words, there are practically important fragments of Morrill’s calculi, for which derivability problems are algorithmically decidable.
3
Polarity and Bracket Restrictions
In what follows, we designate these decidable fragments by imposing certain easily checkable syntactic conditions on formulae and sequents. These conditions are called the bracket non-negative condition (BNNC for short) and the bracket non-positive condition (BNPC). The BNNC was suggested by Morrill and Valent´ın [30], who presented an exponential-time decision algorithm for sequents obeying this condition. In [13],
10
M. I. Kanovich et al.
we sketched a proof of the NP upper bound for derivability under the BNNC, in the case without additives; here we give a more detailed proof. The BNNC, MALC∗ (st) . For ‘newer’ however, is useful for the ‘older’ Morrill’s system !2015 b 2018 2018 ∗ systems !b MALC (st) and !b MALC(st) , here we introduce a novel dual constraint, namely, the BNPC, and prove the corresponding decidability and complexity results. Let us first recall the standard notion of positive and negative subformulae in a given formula/sequent. Definition 1.4. For a formula A or a sequent Ξ → B, we define two finite sets, SubFm+ (A) and SubFm− (A) (resp., SubFm+ (Ξ → B) and SubFm− (Ξ → B)), by joint recursion. SubFm+ (pi ) = {pi } SubFm+ (1) = {1} SubFm+ (A\B) = SubFm− (A) ∪ SubFm+ (B) ∪ {A\B} SubFm+ (B/A) = SubFm− (A) ∪ SubFm+ (B) ∪ {B/A} SubFm+ (A · B) = SubFm+ (A) ∪ SubFm+ (B) ∪ {A · B} SubFm+ (A ∧ B) = SubFm+ (A) ∪ SubFm+ (B) ∪ {A ∧ B} SubFm+ (A ∨ B) = SubFm+ (A) ∪ SubFm+ (B) ∪ {A ∨ B} −1
SubFm+ ([]
A) = SubFm+ (A) ∪ {[]
−1
A}
SubFm (A) = SubFm (A) ∪ {A} +
+
SubFm+ (!A) = SubFm+ (A) ∪ {!A} SubFm+ (Ξ) is the union of SubFm+ (A), where A is a formula in Ξ, either in a stoup or as a tree term SubFm+ (Ξ → B) = SubFm− (Ξ) ∪ SubFm+ (B) SubFm− (pi ) = SubFm− (1) = ∅ SubFm− (A\B) = SubFm− (B/A) = SubFm+ (A) ∪ SubFm− (B) SubFm− (A · B) = SubFm− (A) ∪ SubFm− (B) SubFm− (A ∧ B) = SubFm− (A ∨ B) = SubFm− (A) ∪ SubFm− (B) −1
SubFm− ([]
A) = SubFm− (A) = SubFm− (!A) = SubFm− (A)
SubFm− (Ξ) is the union of SubFm− (A), where A is a formula in Ξ, either in a stoup or as a tree term SubFm− (Ξ → B) = SubFm+ (Ξ) ∪ SubFm− (B) Elements of SubFm+ (A) are called positive subformulae of A, and elements of SubFm− (A) are negative ones (similarly for Ξ → B). Cut-free proofs enjoy the polarized subformula property:
Decidable Fragments of Calculi Used in CatLog
11
Lemma 1.2. For any sequent Ξ → B in a cut-free derivation of the goal sequent Ξ → B we have SubFm+ (Ξ → B ) ⊆ SubFm+ (Ξ → B) and SubFm− (Ξ → B ) ⊆ SubFm− (Ξ → B). Proof. Obvious from the form of inference rules: all formulae which appear in premises are subformulae of the conclusion, with the same polarities. Definition 1.5. A sequent Ξ → B obeys the bracket non-negative condition (BNNC), if for any !F ∈ SubFm− (Ξ → B) and for any F ∈ ζ, where ζ is one of the stoups inside Ξ, the set SubFm+ (F ) does not include formulae of the form −1 [] A and the set SubFm− (F ) does not include formulae of the form A. In other words, the BNNC means that negative occurrences of !-formulae, which can undergo !C, cannot include bracket modalities, the rules for which −1 −1 remove brackets (i.e., [] introduced by [] L and introduced by R). This allows controlling the number of contractions (applications of !C) by counting brackets and bracket modalities, see Lemma 1.3 below. The BNPC is a dual condition. Under this condition, if a bracket modality −1 −1 got introduced by a rule which introduces a pair of brackets (i.e., [] R for [] and L for ), then later on such a modality is not allowed to undergo !C. Definition 1.6. A sequent Ξ → B obeys the bracket non-positive condition (BNPC), if for any !F ∈ SubFm− (Ξ → B) and for any F ∈ ζ, where ζ is one of the stoups inside Ξ, the set SubFm− (F ) does not include formulae of the form −1 [] A and the set SubFm+ (F ) does not include formulae of the form A. Notice that the BNNC and the BNPC (respectively) are exactly the conditions on formulae under ! which are violated in the undecidability proofs in [16].
4
Decidable Multiplicative Fragments
We start with “purely multiplicative” fragments of the calculi in question, i.e., fragments without additive connectives, ∧ and ∨. Since our calculi are cut-free, these fragments are axiomatized simply by taking the rules ∧L, ∧R, ∨L, and ∨R away from the corresponding full systems. Here the corresponding conditions on brackets yield decidability. MALC∗ (st) , for sequents Theorem 1.1. The derivability problem in !2015 b without ∧ and ∨ and obeying the BNNC, is decidable and belongs to the NP MALC∗ (st) and !2018 MALC(st) , with BNPC class. The same holds for !2018 b b instead of BNNC. Notice that the corresponding lower bound, NP-hardness, is due to Pentus [32], who proved NP-hardness of the Lambek calculus itself, without brackets and subexponentials. Theorem 1.1 immediately follows from the following lemma which establishes a polynomial upper bound on the derivation size. Indeed, such a polynomial size
12
M. I. Kanovich et al.
derivation serves as the necessary NP witness for derivability. (In other words, a non-deterministic algorithm can guess the derivation and then check that it is correct, and this is all done in polynomial time.) Lemma 1.3. If a sequent without ∧ and ∨ is derivable in !2015 MALC∗ (st) b and obeys the BNNC, then its cut-free derivation is of polynomial size w.r.t. the MALC∗ (st) and !2018 MALC(st) , size of the sequent. The same holds for !2018 b b with BNPC instead of BNNC. Proof. Let n be the size of the sequent in question, measured as the total number of symbols (including brackets). Let us estimate the number of rule applications in the derivation. 1. Contraction rule (!C). The key consideration in the proof of the lemma is the upper bound on the number of contractions, i.e., applications of !C. Let us denote this number by #!C. Also let #B + be the number of applications of −1 [] L and R (each of these rules introduces a pair of brackets) and let #B − be −1 the number of applications of L and [] R (these rules erase brackets). Finally, let #[] be the number of pairs of brackets in the goal sequent. MALC∗ (st) , each application of !C erases a pair of In the case of !2015 b brackets. Therefore, #[] = #B + − #B − − #!C, whence
#!C = #B + − #B − − #[] ≤ #B + .
Now we recall that our goal sequent obeys the BNNC. Therefore, no formula of −1 −1 the form [] A introduced by [] L (i.e., in the negative polarity) gets included into a formula of the form !F in the antecedent (i.e., in negative polarity) or a formula F in a stoup. The same holds for formulae of the form A introduced by R. Thus, each rule application counted in #B + is connected to a unique occurrence of the corresponding modality in the goal sequent. This yields, #B + ≤ n, whence #!C ≤ n. MALC∗ (st) and !2018 MALC(st) , the argument is similar. Now For !2018 b b each application of !C introduces a new pair of brackets (making a singlebracketed island a double-bracketed one). Therefore, #[] = #B + − #B − + #!C, whence
#!C = #[] + #B − − #B + ≤ #[] + #B − .
Dually, the BNPC guarantees that each rule application counted in #B − corresponds to a unique occurrence of a modality in the goal sequent. These occur−1 rences are negative ones for and positive ones for [] . This yields #B − ≤ n. − Hence, #!C ≤ #[] + #B ≤ 2n. Thus, in both cases we have #!C ≤ 2n. 2. Logical rules. All rules, except !C and !P , are logical rules (recall that all our derivations are cut-free). Each logical rule introduces exactly one new
Decidable Fragments of Calculi Used in CatLog
13
occurrence of a connective or a modality. Such occurrences either trace down to the goal sequent or get contracted (i.e., merged with another occurrence) by !C. MALC∗ (st) or !2018 MALC(st) , counting logical rules is In the case of !2018 b b simple. Indeed, each application of !C contracts a formula A (in the stoup), which is a subformula of the goal sequent. Thus, the number of logical rules introducing connectives or modalities which get contracted is bounded by #!C · n ≤ 2n2 . The number of logical rules introducing connectives or modalities which trace down to the goal sequent is less than or equal to n. Thus, the total number of logical rule applications in the derivation is less or equal than 2n2 + n, which is polynomial. MALC∗ (st) is more involved, since in this system !C The case of !2015 b may contract an arbitrary part of the stoup, ζ2 . However, we still establish the needed upper bound by proving the following statement: let ζ be a stoup in MALC∗ (st) derivation of a sequent obeying the BNNC (in a sequent a !2015 b of the form Θ(ξ; Π) → E); then each formula in ξ traces down to a distinct subformula occurrence in the goal sequent. In other words, two formulae in the same stoup could not get contracted (merged) by !C. Let us prove this statement. Suppose the contrary. Let a formula F appear two times as a subformula in a stoup ξ, and these two occurrences get contracted below. This means that F is a subformula of a formula G which belongs to ζ2 in the formulation of the !C rule: Ξ(ζ1 , ζ2 ; Γ1 , [ζ2 , ζ3 ; Γ2 ], Γ3 ) → C !C Ξ(ζ1 , ζ2 , ζ3 ; Γ1 , Γ2 , Γ3 ) → C The copies of G, however, are separated by brackets embracing ζ2 , ζ3 ; Γ2 . This means that, when going upwards from the !C application to Θ(ξ; Π) → E, this pair of brackets should be destroyed at some point. This could be performed −1 using [] L or R. In the first case, our formula F should be a subformula −1 of [] A (since it is the only formula inside the brackets), and the latter is a subformula of G. Moreover, the polarity is positive in both cases. This violates the BNNC, since G is a member of a stoup. In the second case, F should be a negative subformula of A (which is the only formula outside the brackets), and the latter is a negative subformula of G. Again, the BNNC get violated. Contradiction. The statement entails that for each application of !C the summary number of connective and modality occurrences in ζ2 is bounded by n, which is the MALC∗ (st) length of the goal sequent. Now the same argument as for !2018 b 2018 and !b MALC(st) shows that the total number of logical rule applications is less than or equal to n2 + n (here #!C ≤ n). 3. Permutation rule (!P ). The reasoning here is similar to the previous case. Each application of !P introduces a formula into the stoup, and such a formula either traces down to a designated subformula occurrence in the goal sequent or gets contracted by !C. This gives the same upper bound: 2n2 + n MALC∗ (st) and !2018 MALC(st) and n2 + n for !2015 MALC∗ (st) . for !2018 b b b
14
M. I. Kanovich et al.
Summing up. In a cut-free derivation, each sequent is either the goal one or a premise of one of the inference rules considered above. Each logical rule has at MALC∗ (st) and most two premises; !P and !C have one. Thus, the for !2018 b 2018 !b MALC(st) the total number of sequents in the derivation is less than or equal to 1 + 2n + 2(2n2 + n) + (2n2 + n) = 6n2 + 5n + 1, which is polynomial. MALC∗ (st) , the upper bound is even a bit smaller: 1 + n + 2(n2 + For !2015 b 2 n) + n + n = 3n2 + 4n + 1.
5
Decidable Fragments with Additives
Now let us consider the full systems, with additive operations. Here the corresponding bracket conditions also yield decidability, but the complexity is higher. Theorem 1.2. The derivability problem in !2015 MALC∗ (st) , for sequents b obeying the BNNC, is decidable and belongs to the PSPACE class. The same MALC∗ (st) and !2018 MALC(st) , with BNPC instead of holds for !2018 b b BNNC. Notice that the upper bound here is again tight: the corresponding lower bound, PSPACE-hardness of the Lambek calculus with additive operations, was shown by Kanovich [11] and Kanazawa [10]. Moreover, the minimalistic fragment with only two connectives, \ and ∧, is already PSPACE-hard [14]. In the presence of additive operations, there is no hope to obtain a global polynomial upper bound on the size of the derivation (like in Lemma 1.3). The reason is that the ∨L and ∧R rules copy big parts of the sequent to both premises, and this can make the derivation exponentially large. However, we shall establish a “local” upper bound, namely, prove that the length of each path in the derivation tree, from the goal sequent to an axiom, is polynomial. In other words, we shall show that our derivations have polynomial height. (As usual, the height of a tree is the length of the longest path from the root to a leaf.) MALC∗ (st) and obeys the Lemma 1.4. If a sequent is derivable in !2015 b BNNC, then its cut-free derivation has polynomial height w.r.t. the size of MALC∗ (st) and !2018 MALC(st) , with the sequent. The same holds for !2018 b b BNPC instead of BNNC. Proof. Let δ be a path in the derivation tree from the root (goal sequent) to an axiom leaf. We shall estimate the number of rule applications on δ in terms of n, the size of the goal sequent. Let us relativize the parameters used in Lemma 1.3 to the path δ. Namely, #δ !C, #δ B + , and #δ B − are, respectively, the numbers of contractions (appli−1 cations of !C), introductions of brackets (applications of [] L and R), and −1 removals of brackets (L and [] R), along δ. As in Lemma 1.3, #[] denotes the total number of pairs of brackets in the goal sequent.
Decidable Fragments of Calculi Used in CatLog
15
Contraction rule (!C). The case of !2018 MALC∗ (st) and b here is easier. Each application of !C on δ adds a pair of brack−1 ets which is either removed by an application of [] R or L below or traces down to the goal sequent. This gives 1.
MALC(st) !2018 b
#δ !C ≤ #δ B − + #[]. Due to the BNPC, we have #δ B − ≤ n, since each rule application counted in #δ B − traces down to a separate modality occurrence in the goal sequent. Therefore, #δ !C ≤ 2n. MALC∗ (st) is a bit trickier. One cannot just claim The situation with !2015 b + #δ !C ≤ #δ B , since a pair of brackets erased by an application of !C could have been introduced in another branch of the derivation, outside δ. This issue is resolved in the following way. Let us take the whole derivation tree. For each application of ∨L or ∧R let us remove one of its premises and the whole subtree above it, keeping our path δ intact. (If δ traverses an application of ∨L or ∧R, we remove the premise which is not on δ; otherwise, the choice is arbitrary.) The resulting subtree (which is not required to be a valid derivation tree, of course) will be denoted by D. −1 By #D B + let us denote the number of [] L and R applications inside D. On the one hand, each pair of brackets erased by an application of !C on path δ was introduced by such a rule application. Indeed, each application of ∨L or ∧R copies all the brackets from the conclusion to both premises, therefore, the brackets cannot escape from D. Therefore, #δ !C ≤ #D B + . −1
On the other hand, under the BNNC no negative occurrence of [] (introduced −1 by [] L) and no positive occurrence of (introduced by R) can be contracted by !C. Moreover, inside D two such occurrences cannot be identified by ∨L or −1 ∧R. Therefore, each connective introduced by [] L or R in D traces down to a separate modality occurrence in the goal sequent. Therefore, #D B + ≤ n. Thus, in both cases the number of !C applications on δ is linearly bounded. 2. Logical rules and permutation rule. Here the argument is the same as in MALC∗ (st) , we again the proof of Lemma 1.3, but relativized to δ. For !2015 b show that two formulae in the same stoup could not get contracted. Therefore, in all three systems the total number of connective and modality occurrences contracted by one application of !C is bounded by n. Now, each connective occurrence introduced on the path δ either gets contracted by !C or traces down to the goal sequent. This gives an upper bound on MALC∗ (st) and the number of logical rule applications on δ: n2 + n for !2015 b 2018 2018 ∗ 2 2n + n for !b MALC (st) and !b MALC(st) . The same upper bounds hold for the number of permutations (applications of !P ). Summing up our upper bounds give the same polynomial estimations on the number of rule applications on δ (i.e., the length of δ), as Lemma 1.3 gives for
16
M. I. Kanovich et al.
the whole derivation tree: 3n2 + 4n + 1 for !2015 MALC∗ (st) and 6n2 + 5n + 1 b 2018 2018 ∗ for !b MALC (st) and !b MALC(st) . Constructing a PSPACE decision algorithm for a calculus with a polynomial bound on derivation heights is quite a standard task. In a similar situation, for the multiplicative-additive fragment of linear logic, Lincoln et al. [19] use alternating Turing machines [7] in order to prove the PSPACE upper bound. In this article, we directly construct a non-deterministic depth-first search algorithm which works on polynomially bounded memory space. Proof (of Theorem 1.2). Lemma 1.4 guarantees that any cut-free derivation in each of the three calculi has a polynomially bounded height. Let us construct a non-deterministic algorithm guessing such a derivation in the following way. The algorithm starts from the goal sequent (i.e., the root) and then tries to build a correct derivation tree in the depth-first manner. In the memory, the algorithm keeps a stack (a ‘last-in-first-out’ structure) of sequents, proof search for which is postponed, and one ‘active’ sequent which is being considered right now. For each sequent (both the active one and those in the stack) the algorithm also keeps the length of the path from this sequent to the goal one. In the beginning, the stack is empty and the active sequent is the goal one. At each step, the algorithm performs a non-deterministic guess which inference rule to apply in order to derive the active sequent. Recall that each rule has at most two premises. If it has one premise, then the active sequent gets replaced by this premise, increasing the length parameter by 1. If there are two premises, then the left one becomes active, while the right one is put onto the stack. (Proof search for the right premise is postponed to the future.) At some point, the algorithm will either: 1. exceed the fixed polynomial bound on derivation height 2. not be able to apply any inference rule 3. reach an axiom instance In the first and second cases the algorithm returns the answer “no” and terminates. (This does not mean that the sequent is not derivable, possibly just the concrete series of non-deterministic guesses suggested a wrong derivation strategy.) In the successful third case, the algorithm checks the stack. If the stack is empty, the algorithm terminates returning “yes.” This indeed means that the sequent is derivable, and the algorithm has constructed a derivation (though the complete derivation tree was never kept in the memory). If the stack is not empty, the algorithm pops the topmost sequent from the stack, makes it active and recursively applies proof search to this sequent. Intuitively, the stack represents sequents that ‘sit’ on the right-hand side of the branching points along a path in the derivation tree. The algorithm applies depth-first search, and actually tries all the paths from the goal to an axiom leaf, from left to right. One can easily see that the existence of a correct derivation tree is equivalent to the existence of ‘correct’ non-deterministic guesses, after
Decidable Fragments of Calculi Used in CatLog
17
which the algorithm returns “yes.” Therefore, we have indeed constructed a nondeterministic algorithm solving the derivability problem in one of our calculi. Moreover, at each time of execution the stack includes sequents located on different heights in the tree which is supposed to be a derivation. Since the height of this tree is polynomially bounded, we get a polynomial bound on the amount of memory used (the size of each sequent is also polynomially bounded). Thus, the derivability problem belongs to the NPSPACE class. Finally, by Savitch’s theorem [33] we have NPSPACE = PSPACE, which finishes the proof.
6
Inducing Brackets
The algorithmic problem of parsing using categorial grammars is actually harder than proving sequents in the calculus these grammars are based on. First, a word of the language can have several syntactic types, so before proving the sequent the algorithm should determine, for each word, which of these types should be used. This is a minor issue, however, since our algorithms are non-deterministic, so we can just guess the correct type assignment. But there is another issue, a more serious one. As one can see from examples in Sect. 2, the sequence of types should be properly bracketed before starting proof search. In real natural language data, however, there are no brackets. Therefore, ideally, the number and position of brackets should be guessed, or induced, by the algorithm, rather than requested from the user. Without the subexponential modality, a parsing algorithm which automatically induces brackets was developed by Morrill et al. [29]. The key to decidability of bracket induction is the fact that, without the subexponential, the number of bracket pairs in the goal sequent is bounded by the number of bracket modalities (and the latter is fixed). In this section we show that, with respect to the possibility of effective bracket induction, the two Morrill’s systems with the subexponential behave differently. Formally, the bracket induction problem is formulated as follows: given a sequent of the form A1 , . . . , An → B (without brackets and stoups), is there a way to put brackets on the left-hand side, so that the resulting sequent would be derivable in the given calculus. MALC∗ (st) . For the multiplicativeLet us start with the older system, !2015 b only fragment, without ∧ and ∨, we again get NP decidability. Theorem 1.3. The bracket induction problem for !2015 MALC∗ (st) , for b sequents without ∧ and ∨ obeying the BNNC, is decidable and belongs to the NP class. Proof. If the sequent in question becomes derivable after adding brackets, then, following the reasoning from the proof of Lemma 1.3, we get #[] = #B + − #B − − #!C ≤ #B + ≤ n. Here n is the size of the original sequent without brackets. Indeed, #B + is the −1 number of applications of [] L and R, and, thanks to the BNNC, each such
18
M. I. Kanovich et al.
application traces down to an occurrence of the corresponding occurrence of a bracket modality, not a pair of brackets, in the goal sequent. Now, since the number of brackets is linearly bounded, the bracketing can be guessed by an NP algorithm in polynomial time, along with the derivation itself. For the system including additive operations we predictably get PSPACE. MALC∗ (st) , for Theorem 1.4. The bracket induction problem for !2015 b sequents obeying the BNNC, is decidable and belongs to the PSPACE class. Proof. Again, we establish an upper bound on #[], which allows nondeterministic guessing of the bracketing. Unlike the previous theorem, however, this upper bound cannot be directly extracted from the proof of Lemma 1.4. However, we can use a trick similar to the one used in the proof of Lemma 1.4. Suppose the sequent becomes derivable after imposing a certain bracketing. In this derivation, for each application of ∨L or ∧R let us remove one of its premises (e.g., the right one) and the whole subtree above it. Of course, the resulting tree D is no longer necessarily a correct derivation. However, it is useful for counting brackets. Namely, now each subformula has a unique trace either to the goal sequent or to a contraction (application of !C), just as in the case without ∧ and ∨. This yields the same estimation: #[] = #D B + −#D B − −#D !C ≤ #D B + ≤ n, where the #D counts are taken in the modified “derivation” tree. Now, using the linear bound on the number of brackets, we nondeterministically get the bracketing, and then apply the non-deterministic polynomial space algorithm from Theorem 1.2. This yields NPSPACE complexity, which is the same as PSPACE by Savitch’s theorem. MALC∗ (st) , the situation is opposite. For the newer system, !2018 b Theorem 1.5. The bracket induction problem for !2018 MALC∗ (st) for b sequents obeying the BNPC is undecidable. Proof. The general line of the proof is standard. Its ideas go back to Lincoln et al. [19]. We encode a well-known undecidable problem, derivability in type-0 grammars (which are closely related to semi-Thue systems). For our purposes we do not need non-terminal symbols in type-0 grammars. Thus, a type-0 grammar over alphabet Σ is a triple G = (Σ, P, s), where s ∈ Σ is the starting symbol and P is a finite set of rewriting rules of the form x1 . . . xk ⇒ y1 . . . ym , where xi , yj ∈ Σ, k ≥ 1, m ≥ 0. A derivation in G is a sequence of words starting with s, such that each next word is obtained from the previous one by applying a rewriting rule (i.e., replacing a subword x1 . . . xk with y1 . . . ym ). A word w is derivable if there exists a derivation with w being its last word. Given a type-0 grammar G, let us define the set AG as follows: AG = {(x1 · . . . · xk )/(y1 · . . . · ym ) | x1 . . . xk ⇒ y1 . . . ym is a rewriting rule of G}. The elements of AG will be denoted by A1 , . . . , AN . Each Ai encodes one of the rewriting rules.
Decidable Fragments of Calculi Used in CatLog
19
We shall prove that a word a1 . . . an is derivable from s in G if and only if one can put brackets on the left-hand side of the following sequent so that it becomes derivable: −1
![]
−1
!A1 , . . . , ![]
−1
!AN , a1 , . . . , an → (![]
−1
!A1 ) · . . . · (![]
!AN ) · s.
This gives computable reduction of derivability in G to the bracket induction MALC∗ (st) under the BNPC (since the sequent in question problem for !2018 b obeys the BNPC), which establishes undecidability of the latter. Let us start with the “only if” direction: from derivation in G to inducing brackets. Suppose that a1 . . . an is derived in G from s in r steps. Then we put brackets as follows: r times
−1
![]
−1
!A1 , . . . , ![]
!AN , [[Λ]], . . . , [[Λ]], a1 , . . . , an −1
→ (![]
−1
!A1 ) · . . . · (![]
!AN ) · s.
Notice that here we put brackets over empty parts of the sequent (in other words, make them strong islands). Suppose that the rewriting rules used in the derivation s ⇒ . . . ⇒ a1 . . . an are rules with numbers i1 , . . . , ir . Then we derive our sequent in the following way. Using !C, we put copies of the corresponding Aij ’s into the islands [[Λ]], −1 one into each island. The islands become single-bracketed: [[] !Aij ; Λ]. Then −1 we use [] ’s to remove the remaining pairs of brackets and remove the leading −1 −1 ![] !A1 , . . . , ![] !AN using ·R. The corresponding derivation is presented on Fig. 2(a). At the top of this derivation there is the sequent Ai1 , . . . , Air ; a1 , . . . , an → s. We show its derivability by induction on r. If r = 0, then it is just s → s. For the induction step, consider the last, r-th, rewriting rule applied in the derivation: s ⇒ . . . ⇒ a1 . . . x1 . . . xk . . . an ⇒ a1 . . . y1 . . . ym . . . an . This rewriting is simulated using Air = (x1 · . . . · xk )/(y1 · . . . · ym ) from the stoup, as shown on Fig. 2(b). The topmost sequent, Ai1 , . . . , Air−1 ; a1 , . . . , x1 , . . . , xk , . . . an → s, is derivable by the induction hypothesis. The “only if” part, from inducing brackets to rewriting in G, is performed in a rather standard way, using the bracket-forgetting projection [16]. Let us consider !L∗ , the Lambek calculus (see Introduction) without Lambek’s nonemptiness restriction extended with a full-power exponential modality !. The exponential modality is governed by the following rules: Γ1 , A, Γ2 → C !L Γ1 , !A, Γ2 → C
!A1 , . . . , !An → B !R !A1 , . . . , !An → !B
Γ 1 , Γ2 → C !W Γ1 , !A, Γ2 → C
Γ1 , Φ, !A, Γ2 → C !P Γ1 , !A, Φ, Γ2 → C 1
Γ1 , !A, Φ, Γ2 → C !P Γ1 , Φ, !A, Γ2 → C 2
Γ1 , !A, !A, Γ2 → C !C Γ1 , !A, Γ2 → C
Notice that here left-hand sides of sequents are just sequences of formulae, there are no stoups or brackets.
20
M. I. Kanovich et al.
Fig. 2. Derivations for simulating rewritings in a type-0 grammar via bracket induction MALC∗ (st) in !2018 b
One can easily see that if one takes a sequent derivable in !2018 MALC∗ (st) , b erases all brackets and bracket modalities, and translates meta-formulae of the form ζ; Γ, where ζ = {A1 , . . . , An }, as !A1 , . . . , !An , Γ, then the resulting sequent will be derivable in !L∗ . (The opposite does not hold: bracketing prevents some of the derivations.) This translation is called the bracket-forgetting projection (BFP). Now let us suppose that our sequent, −1
![]
−1
!A1 , . . . , ![]
−1
!AN , a1 , . . . , an → (![]
−1
!A1 ) · . . . · (![]
!AN ) · s
becomes derivable after putting some brackets on it. Independently of the bracketing imposed, the BFP gives the following sequent: !!A1 , . . . , !!AN , a1 , . . . , an → (!!A1 ) · . . . · (!!AN ) · s,
Decidable Fragments of Calculi Used in CatLog
21
which is derivable in !L∗ . Now let us consider the following derivable sequents: !Ai → !!Ai and (!!A1 ) · . . . · (!!AN ) · s → s. The first one is derived in !L∗ by one application of !R, and the derivation of the second one is as follows: s→s !W, N times !!A1 , . . . , !!AN , s → s ·L, N times (!!A1 ) · . . . · (!!AN ) · s → s Using cut, we derive !A1 , . . . , !AN , a1 , . . . , an → s. (For cut elimination in !L∗ , see [12].) Now we perform the standard backwards translation, from derivations in noncommutative linear logic to computations (in our case, in a type-0 grammar), which goes back to Lincoln et al. [19]. A detailed proof can be found, e.g., in [16], Lemma 1, implication 4 ⇒ 1. Using this translation, we conclude that a1 . . . an is derivable in G from s.
7
Conclusion and Future Work
In this paper, we have shown that the systems with brackets and a subexponential proposed by Morrill as basic calculi for the CatLog natural language parser, while being undecidable in general, enjoy natural decidable fragments. These fragments are designated by syntactic restrictions called the bracket nonnegative/non-positive conditions (BNNC/BNPC). Moreover, algorithmic complexity of these fragments is the same as for the systems without brackets and the subexponential. Namely, with additive operations we get PSPACE and without them we get NP. As noticed by one of the reviewers, these complexity results could be easily extended to discontinuous operations used in Morrill’s systems along with standard Lambek ones. Full Morrill’s systems, however, include other sources of undecidability (besides the contraction rule for !). One of such sources is the Kleene star, which Morrill calls ‘existential exponential.’ The Kleene star is governed by an omega-rule [28], thus, the system includes infinitary action logic, which is known to be Π01 -complete [5]. Another potential source of undecidability is the presence of quantifiers. The development of appropriate syntactic restrictions on these connectives in order to restore decidability is still an open problem. Another observation made by one of the reviewers is that our decidability MALC∗ (st) under the results also entail the finite reading property for !2015 b 2018 2018 ∗ BNNC and !b MALC (st) and !b MALC(st) under the BNPC. The finite reading property means that for a sentence with brackets imposed there could exist only a finite number of different derivations. Indeed, even in the broader systems with additives we have managed to prove a polynomial upper bound
22
M. I. Kanovich et al.
on the height of the derivation tree (Lemma 1.4). This yields a finite, though exponential, bound on the size of the derivation and, thus, a double-exponential bound on the number of possible derivations. The choice of types for each word is also finite, since so is the lexicon. For the more complicated, but at the same time more practically interesting algorithmic problem of inducing brackets, the situation is as follows. For MALC∗ (st) , complexity of the bracket inducthe ‘older’ Morrill’s system !2015 b tion problem, under the BNNC, is the same as for the derivability problem. MALC∗ (st) , unfortunately, the bracket inducFor the ‘newer’ system, !2018 b tion problem is undecidable even under the BNPC. The undecidability construction, however, crucially depends on empty bracketed islands, i.e., on the violation of Lambek’s non-emptiness restriction. We conjecture that for the sysMALC(st) , with the BNPC imposed, the tem with Lambek’s restriction, !2018 b bracket induction problem is decidable. Complexity of this problem is left as an open question. Another open question is whether the bracket induction problem MALC∗ (st) (without Lambek’s restriction) becomes decidable if we for !2018 b impose both the BNPC and the BNNC (i.e., disallow any bracket modalities in the scope of the subexponential and in stoups). Acknowlegdement. We are grateful to Glyn Morrill for a number of very helpful interactions we benefited from at various stages of our work. We would also like to thank the reviewers for their efforts. The work of Max Kanovich was partially supported by EPSRC Programme Grant EP/R006865/1: “Interface Reasoning for Interacting Systems (IRIS).” The part by Stepan G. Kuznetsov was prepared within the framework of the Academic Fund Program at HSE University in 2021–2022 (grant № 21-04-027). The work of Stepan L. Kuznetsov and the early part of the work of Andre Scedrov (until July 2020) was performed within the framework of the HSE University Basic Research Program. The work of Stepan L. Kuznetsov was also partially supported by the Council of the President of Russia for Support of Young Russian Researchers and Leading Research Schools of the Russian Federation (grant MK-1184.2021.1.1) and by the Russian Foundation for Basic Research (grant № 20-01-00435).
References 1. Abrusci, V.M.: A comparison between Lambek syntactic calculus and intuitionistic linear logic. Zeitschrift f¨ ur mathematische Logik und Grundlagen der Mathematik 36, 11–15 (1990) 2. Ajdukiewicz, K.: Die syntaktische Konnexit¨ at. Stud. Philos. 1, 1–27 (1935) 3. Bar-Hillel, Y.: A quasi-arithmetical notation for syntactic description. Language 29(1), 47–58 (1953) 4. Buszkowski, W.: Type logics in grammar. In: Hendriks, V.F., Malinowski, J. (eds.) Trends in Logic: 50 Years of Studia Logica. TREN, vol. 21, pp. 337–382. Springer, Dordrecht (2003). https://doi.org/10.1007/978-94-017-3598-8 12 5. Buszkowski, W.: On action logic: equational theories of action algebras. J. Log. Comput. 17(1), 199–217 (2007). https://doi.org/10.1093/logcom/exl036 6. Carpenter, B.: Type-Logical Semantics. MIT Press, Cambridge (1997)
Decidable Fragments of Calculi Used in CatLog
23
7. Chandra, A.K., Kozen, D.C., Stockmeyer, L.J.: Alternation. J. ACM 28(1), 114– 133 (1981). https://doi.org/10.1145/322234.322243 8. Galatos, N., Jipsen, P., Kowalski, T., Ono, H.: Residuated Lattices: An Algebraic Glimpse on Substructural Logics. Studies in Logic and the Foundations of Mathematics, vol. 151. Elsevier, Amsterdam (2007) 9. Girard, J.Y.: Linear logic. Theor. Comput. Sci. 50(1), 1–101 (1987). https://doi. org/10.1016/0304-3975(87)90045-4 10. Kanazawa, M.: Lambek calculus: recognizing power and complexity. In: Gerbrandy, J., Marx, M., de Rijke, M., Venema, Y. (eds.) JFAK. Essays Dedicated to Johan van Benthem on the Occasion of His 50th Birthday. Vossiuspers, Amsterdam University Press (1999) 11. Kanovich, M.: Horn fragments of non-commutative logics with additives are PSPACE-complete. In: 1994 Annual Conference of the European Association for Computer Science Logic, Kazimierz, Poland (1994) 12. Kanovich, M., Kuznetsov, S., Nigam, V., Scedrov, A.: Subexponentials in noncommutative linear logic. Math. Struct. Comput. Sci. 29(8), 1217–1249 (2019). https://doi.org/10.1017/S0960129518000117 13. Kanovich, M., Kuznetsov, S., Scedrov, A.: Undecidability of the Lambek calculus with subexponential and bracket modalities. In: Klasing, R., Zeitoun, M. (eds.) FCT 2017. LNCS, vol. 10472, pp. 326–340. Springer, Heidelberg (2017). https:// doi.org/10.1007/978-3-662-55751-8 26 14. Kanovich, M., Kuznetsov, S., Scedrov, A.: The complexity of multiplicativeadditive Lambek calculus: 25 years later. In: Iemhoff, R., Moortgat, M., de Queiroz, R. (eds.) WoLLIC 2019. LNCS, vol. 11541, pp. 356–372. Springer, Heidelberg (2019). https://doi.org/10.1007/978-3-662-59533-6 22 15. Kanovich, M., Kuznetsov, S., Scedrov, A.: Reconciling Lambek’s restriction, cutelimination, and substitution in the presence of exponential modalities. J. Log. Comput. 30(1), 239–256 (2020). https://doi.org/10.1093/logcom/exaa010 16. Kanovich, M., Kuznetsov, S., Scedrov, A.: The multiplicative-additive Lambek calculus with subexponentials and bracket modalities. J. Log. Lang. Inf. 30, 31–88 (2021). https://doi.org/10.1007/s10849-020-09320-9 17. Lambek, J.: The mathematics of sentence structure. Am. Math. Monthly 65, 154– 170 (1958). https://doi.org/10.1080/00029890.1958.11989160 18. Lambek, J.: On the calculus of syntactic types. In: Jakobson, R. (ed.) Structure of Language and Its Mathematical Aspects, Proceedings of Symposia in Applied Mathematics, vol. 12, pp. 166–178. AMS, Providence (1961) 19. Lincoln, P., Mitchell, J., Scedrov, A., Shankar, N.: Decision problems for propositional linear logic. Ann. Pure Appl. Log. 56(1–3), 239–311 (1992). https://doi. org/10.1016/0168-0072(92)90075-B 20. Moortgat, M.: Multimodal linguistic inference. J. Log. Lang. Inf. 5(3–4), 349–385 (1996). https://doi.org/10.1007/BF00159344 21. Moot, R., Retor´e, C.: The Logic of Categorial Grammars: A Deductive Account of Natural Language Syntax and Semantics. LNCS, vol. 6850. Springer, Berlin (2012). https://doi.org/10.1007/978-3-642-31555-8 22. Morrill, G.: Categorial formalisation of relativisation: pied piping, islands, and extraction sites. Technical report LSI-92-23-R, Universitat Polit`ecnica de Catalunya (1992) 23. Morrill, G.: A categorial type logic. In: Casadio, C., Coecke, B., Moortgat, M., Scott, P. (eds.) Categories and Types in Logic, Language, and Physics. LNCS, vol. 8222, pp. 331–352. Springer, Heidelberg (2014). https://doi.org/10.1007/9783-642-54789-8 18
24
M. I. Kanovich et al.
24. Morrill, G.: Grammar logicised: relativisation. Linguist. Philos. 40(2), 119–163 (2017). https://doi.org/10.1007/s10988-016-9197-0 25. Morrill, G.: Parsing logical grammar: CatLog3. In: Loukanova, R., Liefke, K. (eds.) Proceedings of the Workshop on Logic and Algorithms in Computational Linguistics (LACompLing 2017), pp. 107–131. Stockholm University, Stockholm (2017) 26. Morrill, G.: The CatLog3 technical manual. Technical report, Universitat Polit`ecnica de Catalunya (2018). http://www.lsi.upc.edu/∼morrill/CatLog3/ CatLog3.pdf 27. Morrill, G.: A note on movement in logical grammar. J. Lang. Model. 6(2), 353–363 (2018). https://doi.org/10.15398/jlm.v6i2.233 28. Morrill, G.: Parsing/theorem-proving for logical grammar CatLog3. J. Log. Lang. Inf. 28(2), 183–216 (2019). https://doi.org/10.1007/s10849-018-09277-w 29. Morrill, G., Kuznetsov, S., Kanovich, M., Scedrov, A.: Bracket induction for Lambek calculus with bracket modalities. In: Foret, A., Kobele, G., Pogodalla, S. (eds.) FG 2018. LNCS, vol. 10950, pp. 84–101. Springer, Heidelberg (2018). https://doi. org/10.1007/978-3-662-57784-4 5 30. Morrill, G., Valent´ın, O.: Computation coverage of TLG: nonlinearity. In: Kanazawa, M., Moss, L., de Paiva, V. (eds.) Third Workshop on Natural Language and Computer Science, NLCS 2015. EPiC Series in Computing, vol. 32, pp. 51–63 (2015). https://doi.org/10.29007/96j5 31. Morrill, G.V.: Categorial Grammar: Logical Syntax, Semantics, and Processing. Oxford University Press, Oxford (2011) 32. Pentus, M.: Lambek calculus is NP-complete. Theor. Comput. Sci. 357(1–3), 186– 201 (2006). https://doi.org/10.1016/j.tcs.2006.03.018 33. Savitch, W.J.: Relationships between nondeterministic and deterministic tape complexities. J. Comput. Syst. Sci. 4(2), 177–192 (1970). https://doi.org/10.1016/ S0022-0000(70)80006-X
Interactive Theorem Proving for Logic and Information Jørgen Villadsen1(B) , Asta Halkjær From1 , Alexander Birch Jensen1 , and Anders Schlichtkrull2 1
Technical University of Denmark, Kongens Lyngby, Denmark {jovi,ahfrom,aleje}@dtu.dk 2 Aalborg University Copenhagen, Copenhagen, Denmark [email protected]
Abstract. Automated reasoning is the study of computer programs that can build proofs of theorems in a logic. Such programs can be either automatic theorem provers or interactive theorem provers. The latter are also called proof assistants because the user constructs the proofs with the help of the system. We focus on the Isabelle proof assistant. The system ensures that the proofs are correct, in contrast to pen-and-paper proofs which must be checked manually. We present applications to logical systems and models of information, in particular selected modal logics extending classical propositional logic. Epistemic logic allows intelligent systems to reason about the knowledge of agents. Public announcements can change the knowledge of the system and its agents. In order to account for this, epistemic logic can be extended to public announcement logic. An axiomatic system consists of axioms and rules of inference for deriving statements in a logic. Sound systems can only derive valid statements and complete systems can derive all valid statements. We describe formalizations of sound and complete axiomatic systems for epistemic logic and public announcement logic, thereby strengthening the foundations of automated reasoning for logic and information. Keywords: Interactive theorem proving · Propositional logic · Epistemic logic · Public announcement logic · Isabelle/HOL proof assistant
1 Introduction Automated reasoning technology has matured tremendously in the recent decades. However, the main applications are found in verification of hardware and software systems as well as in many areas of mathematics. We present a series of applications to logical systems and models of information, in particular classical propositional logic and selected modal logics extending classical propositional logic. On the one hand, we interpret interactive theorem proving narrowly and focus on the Isabelle proof assistant [45]. On the other hand, we interpret logic and information broadly and consider three logics in the area: propositional logic, epistemic logic (EL) and public announcement logic (PAL). c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. Loukanova (Ed.): NLPinAI 2021, SCI 999, pp. 25–48, 2022. https://doi.org/10.1007/978-3-030-90138-7_2
26
J. Villadsen et al.
Building up to formalizations of formulas, we start with a formalization of binary trees and a number of functions operating on these. Thereafter, we formalize a prover for propositional logic as a simple example to introduce the reader to the idea of formalizing logics in Isabelle. We use a so-called deep embedding of logics where formulas are essentially binary trees. By using a datatype for formulas we can prove soundness, completeness and termination of the prover. Moving on, we formalize epistemic logic, a logic for reasoning about both the factual and higher-order knowledge of agents, and a deductive proof system that enables this reasoning from a few axioms and inference rules. Again we use the deep embedding approach and prove soundness and completeness. Finally, we formalize public announcement logic with countably many agents. Public announcement logic extends epistemic logic with an operator for publicly announcing information. The formalization includes proofs of soundness and completeness for a variant of the well-known PA + DIST! + NEC! axiomatic system. The completeness proof builds on the one of epistemic logic by reducing formulas into that logic. Our definitions are given in Isabelle’s precise language of higher-order logic and every step of our soundness and completeness proofs is mechanically checked. With formalizations of sound and complete axiomatic systems for epistemic logic and public announcement logic, we strengthen the foundations of automated reasoning for logic and information. The formalizations are available here: https://hol.compute.dtu.dk/ITPLI The present paper extends our 3-page paper at the International Workshop on Logical Aspects in Multi-Agent Systems and Strategic Reasoning which was not formally published and covered only the formalization of epistemic logic [20]. Summing up, in the present paper we focus on propositional logic, epistemic logic and public announcement logic. As a supplement to pen-and-paper proofs of soundness and completeness, we describe the use of the powerful Isabelle proof assistant for interactive theorem proving. Other logics have been formalized in Isabelle. We mention here some of them and leave the rest for our discussion of related work together with results in other proof assistants than Isabelle. • Michaelis and Nipkow formalize several proof systems for classical propositional logic [39, 40]. From, Eschen and Villadsen formalize a number of axiomatic systems for propositional logic [19]. In the present paper we consider modal logics going beyond classical propositional logic. • From, Lund and Villadsen formalize a number of small provers for classical propositional logic [21, 71, 72]. In the present paper we use a similar prover as a motivational example. We recommend the survey on the use of formalizations in computer science by Ringer et al. [59] and the state-of-the-art in mathematics in form of the official published account of the now completed Flyspeck project [25].
Interactive Theorem Proving for Logic and Information
27
The paper is organized as follows: Sect. 2 introduces the reader to Isabelle/HOL and how to deeply embed logics. Section 3 explains our formalization of epistemic logic. Section 4 explains our formalization of public announcement logic. We discuss related work in Sect. 5 and conclude in Sect. 6.
2 Isabelle/HOL and Deep Embeddings of Logics Isabelle is a generic proof assistant originally developed at the University of Cambridge and Technische Universität München [45]. The most used instance of Isabelle today is Isabelle/HOL, based on classical higher-order logic, and in the following we often use the name Isabelle to refer to Isabelle/HOL. In order to provide a gentle introduction to programming and proving in Isabelle, we start with a formalization of binary trees and a number of functions operating on these. We further prove a few interesting properties about these functions. In Isabelle/HOL, programming is not limited to the computable fragments of HOL. For instance, a function may return a boolean value that is the result of quantifying over all elements of a type, e.g. stating that all natural numbers are either odd or even. As such, the concept of programming in Isabelle/HOL goes beyond its usual meaning in the context of traditional programming languages like Haskell and Java. Finally, we briefly consider a formalization of a prover for propositional logic. This mainly serves the purpose of introducing the reader to formalizing logics in Isabelle using a deep embedding approach. In this approach, formulas are defined as a datatype which enables the definition of semantics, a proof system and a small prover as functions that operate on this datatype. In turn, we can prove termination, soundness and completeness of the prover. 2.1 Formally Verified Functional Programming The following is a rather straightforward example of formally verified functional programming in Isabelle/HOL: a typical solution to an exercise in the Isabelle tutorial [44]. We start with a datatype of trees with labels at the nodes. The labels can be of any type, as specified by the type variable a, and so-called cartouches delineate the three components of a Node: datatype a tree = Tip | Node a tree a a tree
We may collect the contents of such trees into a set by writing a simple functional program: fun set :: a tree ⇒ a set where set Tip = {} | set (Node l a r) = set l ∪ {a} ∪ set r
Note that we can use the usual set notation and operators in our definition. The type declaration can be omitted in which case it is inferred automatically.
28
J. Villadsen et al.
We can then write a predicate on trees labelled by integers that checks if they are binary search trees: fun ord :: int tree ⇒ bool where ord Tip = True | ord (Node l a r) = ((∀ i ∈ set l. i < a) ∧ ord l ∧ (∀ i ∈ set r. a < i) ∧ ord r)
This checks if they are ordered such that for all nodes, every element in the left subtree is smaller than the element at the node while every element in the right subtree is larger. The following insertion function is supposed to preserve this order: fun ins :: int ⇒ int tree ⇒ int tree where ins i Tip = Node Tip i Tip | ins i (Node l a r) = (if i < a then Node (ins i l) a r else if a < i then Node l a (ins i r) else Node l a r)
In the ord function we have exploited universal quantification over a finite set, which is computable, but really this program could also be written in an ordinary functional language. This is a good thing as it helps build familiarity with the proof assistant. The next two lines take things a step further: theorem [simp]: set (ins i t) = {i} ∪ set t by (induct t) auto
There could potentially be a mistake in the ins function where certain elements were not inserted or other elements forgotten. Moreover, we might have to test a lot of inputs to uncover such a mistake. The theorem above, stated for all elements and all trees, rules out such errors. The proof works by induction on the tree and using Isabelle’s proof method auto to discharge the two resulting cases. With that result in hand we can also prove that ins preserves the binary search tree order: theorem ord t =⇒ ord (ins i t) by (induct t) simp-all
Writing a machine-checked proof requires a higher level of abstraction, considering both how properties are expressed and proved. 2.2
Termination
So far we have only considered programs that are trivially total. The fun command will prove both pattern completeness and termination automatically. An advanced alternative is to use the function command, which does not prove either, and thus we have to do so manually afterwards, for example using Isar for the formal proofs [74]. Pattern completeness must be proved immediately, here with simp-all, and termination is shown later with the termination command.
Interactive Theorem Proving for Logic and Information
29
We need to prove the termination of our micro provers manually. To illustrate the technique, we consider the McCarthy 91 function, which is an old test case for formal verification [34, 36]. The definition itself is simple, but the nested recursion makes termination non-obvious: function M :: int ⇒ int where M i = (if 100 < i then i − 10 else M (M (i + 11))) by simp-all
It is called the 91 function because M i = 91 for all i ≤ 100 (and M i = i −10 otherwise). This is easy to show once termination has been established. We do so below. To prove termination we show a well-founded relation between the recursive calls and function input: termination proof let ?R = measure (λi. nat (101 − i)) show wf ?R by simp
Briefly, (x, y) ∈ measure f ←→ f x < f y. Any relation defined via measure is wellfounded by construction. What remains to be shown is that both i + 11 and M(i + 11) are related to i, to justify the inner and outer recursive call, respectively. We consider only the branch of the if where the recursion happens and as such the first case is trivial given our measure: fix i :: int assume ∗: ¬ 100 < i then show (i + 11, i) ∈ ?R by simp
For the other case, we assume that i + 11 is an input that M terminates for, as expressed by M-dom: assume M-dom (i + 11)
This M-dom predicate allows us to prove properties about the input that M terminates on, even though we are still to prove that this is in fact all input. In particular, we note that when M terminates, the output is “mostly” larger than the input: moreover have M-dom j =⇒ j − 11 < M j for j by (induct j rule: M.pinduct) (auto simp: M.psimps)
Since the inner recursive call is on i + 11, the output is in fact larger than the input i and this is enough to relate the two, proving termination of the outer recursive call: ultimately have i + 11 − 11 < M (i + 11) by blast then show (M (i + 11), i) ∈ ?R using ∗ by simp qed
30
J. Villadsen et al.
Having proved termination, we can now perform induction over the call graph (as expressed by M.induct) to prove that the function can be defined without recursion: theorem M i = (if 100 < i then i − 10 else 91) by (induct i rule: M.induct) simp
This was an example of a function with a difficult termination proof. We also need to give explicit measures to prove termination of our provers in the coming sections but then the automation takes over, making them more suitable as starting points for exploration. Coming up with the measure can be tricky enough without struggling to prove that it works. We note that this declarative way of proving termination is similar to how a mathematician would do it. 2.3
A Prover for Propositional Logic
The following is a formalization of a simple prover for propositional logic. The prover is implicitly based on a sequent calculus for formulas in negation normal form. We start with a datatype for formulas: datatype a form = Paf a | Naf a | Con a form a form | Dis a form a form
Formulas can be combined using conjunction (Con) and disjunction (Dis). The type variable a allows for any representation of atomic formulas. We do not include negation as usual; instead, an atomic formula can appear as either positive (Paf : positive atomic formula) or negative (Naf : negative atomic formula). The following function defines the semantics of formulas, where an interpretation i maps elements of the type a to truth values: fun val where val i (Paf n) = i n | val i (Naf n) = (¬ i n) | val i (Con p q) = (val i p ∧ val i q) | val i (Dis p q) = (val i p ∨ val i q)
We exploit the built-in Boolean operators for negation, conjunction and disjunction. Alongside the semantics, we define a sequent calculus as a function for proving formulas: function cal where cal e [] = (∃ n ∈ fst e. n ∈ snd e) | cal e (Paf n # s) = cal ({n} ∪ fst e, snd e) s | cal e (Naf n # s) = cal (fst e, snd e ∪ {n}) s | cal e (Con p q # s) = (cal e (p # s) ∧ cal e (q # s)) | cal e (Dis p q # s) = cal e (p # q # s) by pat-completeness simp-all
Interactive Theorem Proving for Logic and Information
31
The sequent calculus operates on a list of formulas, recursively decomposing them. We construct a set of the positive and a set of the negative literals in e. The function terminates once the list of formulas is empty—the truth is determined by whether some atom appears in both literal sets. We need to prove that our cal function terminates: termination by (relation measure (λ(-, s).
p ← s. size p) ) simp-all
We obtain a termination proof by providing a suitable measure based on the second argument of the cal function: the sum of sizes of the formulas in the list that we decompose. Because we have defined our sequent calculus as a function, we can immediately obtain a prover by proper initialization of this function: definition prover p ≡ cal ({}, {}) [p]
We showcase the prover by running it on a list of formulas (applied to each element individually): value map prover [Paf n, Naf n, Con (Paf n) (Naf n), Dis (Paf n) (Naf n)]
Trivially, only the last formula is a tautology so the result is a list with three False values and then a single True value. Isabelle interactively displays the result of running the prover on the formulas. We now move on to the question of soundness and completeness for the sequent calculus. We first define an intermediate abbreviation sat that captures that at least one literal in e (positive or negative) is satisfied by the interpretation i. abbreviation sat i e ≡ (∃ n ∈ fst e. i n) ∨ (∃ n ∈ snd e. ¬ i n)
This definition is useful for stating the soundness and completeness properties of our sequent calculus: lemma sound-and-complete: cal e s ←→ (∀ i. (∃ p ∈ set s. val i p) ∨ sat i e) by (induct rule: cal.induct) auto
Because we state soundness and completeness as a single property, and for any call pattern of cal, we need to consider both the contents of the sets of positive and negative literals e, and the list of formulas s. The sequent calculus returns true if and only if, for all interpretations, truth either follows from a formula in the list or from one of the literals. The proof is by induction over the rules of the sequent calculus. We finally formulate soundness and completeness for the prover: theorem main: prover p ←→ (∀ i. val i p) unfolding sound-and-complete prover-def by simp
32
J. Villadsen et al.
The stated lemma is weaker than for the sequent calculus and a proof can be obtained by simple rewriting. As such, the proof goal is easily discharged by Isabelle’s automation.
3 Epistemic Logic Epistemic logic provides a foundation for reasoning about the knowledge of agents, both factual (“I know the sky is blue”) and higher-order (“I know that you know that I know the sky is blue”). A deductive proof system enables this reasoning with just a few axioms and inference rules. We formalize epistemic logic with countably many agents in the proof assistant Isabelle/HOL [17]. We include soundness and completeness proofs for the axiomatic system Kn based on the textbook Reasoning About Knowledge by Fagin, Halpern, Moses and Vardi [15]. Our definitions and proofs are specified in the precise language of higher-order logic and every step of our reasoning is mechanically checked. While the results are not new, we use them to showcase the level of precision and guarantee achievable by formalizing work in a proof assistant. Our formalization can also serve as starting point for similar logics or proof systems. Our completeness proof does not follow the one by Fagin et al. [15] to the letter but is inspired by Fitting’s [16] consistency properties as formalized by Berghofer [5]. We have adapted them from first-order logic to epistemic logic. 3.1
Syntax and Semantics
The formal language L for epistemic logic is a propositional language extended with modal operators K1, . . . , Kn for expressing knowledge of agents, for example the formula K1 ϕ ∧ K2 K1 ϕ ∧ ¬K1 K2 K1 ϕ states that: (i) agent 1 knows ϕ, (ii) agent 2 knows that agent 1 knows ϕ, but (iii) agent 1 does not know that agent 2 knows (i). The language is deeply embedded as a datatype in Isabelle/HOL: datatype i fm = FF (⊥) | Pro id | Dis i fm i fm (infixr ∨ 30) | Con i fm i fm (infixr ∧ 35) | Imp i fm i fm (infixr −→ 25) | K i i fm
We define a constructor for each primitive of our syntax, e.g. FF for falsity with the alternative notation ⊥. Similarly, we give infix syntax for the binary connectives, which all associate to the right and are given suitable precedences.
Interactive Theorem Proving for Logic and Information
33
The type id is an abbreviation for strings of characters, used as labels for the propositions. We fix this instead of using a type variable in order to ease notation later. The type variable i is an arbitrary type for agents. In our informal example, we used natural numbers, but we do not commit ourselves to any specific type. Our soundness proof holds for any type while the completeness proof holds for any countable type i. We need the agent labels i to be countable, such that the language itself is countable. Countability of the syntax is a standard prerequisite for our way of proving completeness. We introduce negation into the syntax as an abbreviation: abbreviation Neg (¬ - [40] 40) where Neg p ≡ p −→ ⊥
The semantics of epistemic logic formulas is based on a model of possible worlds as formalized by Kripke structures: datatype ( i, s) kripke = Kripke (π: s ⇒ id ⇒ bool ) (K: i ⇒ s ⇒ s set )
There are two components: an interpretation π that assigns truth values to propositions for each state (possible world), and a relation K that, when viewed as a function, takes an agent and a state and returns a set of states. This set is to be understood as the states the agent considers possible given the information available in the input state. We should mention the type variables (i, s). The type i is again an arbitrary type for agents while s is the type of states. Thus, the formalization is generic over the type of agents and possible worlds. The double turnstile, M, s |= ϕ, denotes the semantics of a formula ϕ ∈ L under a Kripke structure M and state s. We formalize it as the following function: primrec semantics :: ( i, s) kripke ⇒ s ⇒ i fm ⇒ bool (-, - |= - [50,50] 50) where (-, - |= ⊥) = False | (M, s |= Pro i) = π M s i | (M, s |= (p ∨ q)) = ((M, s |= p) ∨ (M, s |= q)) | (M, s |= (p ∧ q)) = ((M, s |= p) ∧ (M, s |= q)) | (M, s |= (p −→ q)) = ((M, s |= p) −→ (M, s |= q)) | (M, s |= K i p) = (∀ t ∈ K M i s. M, t |= p)
No combination of model and state satisfies ⊥. The logical operators are defined by recursively obtaining the semantics of each subformula and combining the Boolean values through the built-in operators in Isabelle/HOL. The case for a proposition i looks up and returns the truth value of s and i in π M (the latter gives the π component of the Kripke structure M). Lastly, we have the case for a modal operator Ki p which requires the semantics of p to be true in every state agent i considers possible (from the current state). With the semantics in place, we can prove various properties of the modal operator Ki , say, (see the formalization for the proof): theorem distribution: M, s |= (K i p ∧ K i (p −→ q) −→ K i q)
The above states that the operator Ki distributes over implication.
34
J. Villadsen et al. p is a propositional tautology p
A1 A2
R1
p
p
Kip
K i (p
q)
q
q
Kiq R2
p Kip
Fig. 1. Our axiomatic system for epistemic logic.
3.2
Axiomatic System
The distribution theorem can be recognized in the very compact axiomatic system Kn (cf. Fig. 1). We adopt the usual syntax that the provability of a formula ϕ ∈ L is denoted by the turnstile symbol: ϕ. The system is inductively defined as follows: inductive SystemK :: i fm ⇒ bool ( - [50] 50) where A1: tautology p =⇒ p | A2: (K i p ∧ K i (p −→ q) −→ K i q) | R1: p =⇒ (p −→ q) =⇒ q | R2: p =⇒ K i p
A1 states that any classical propositional tautology is provable, A2 is similar to the distribution theorem, R1 is simply modus ponens and R2 states that agents also know the provable formulas. The definition tautology in A1 relies on a semantics that treats modal formulas Ki ϕ as if they were propositional symbols. This is the semantic equivalent of allowing all substitution instances of propositional tautologies, but is simpler to formalize. 3.3
Soundness
For the axiomatic system K to be sound, every formula in L provable in system Kn must be valid with respect to the semantics: ∀ϕ ∈ L. ϕ −→ (∀M, s. M, s |= ϕ) That is, no combination of proof rules leads to a formula that is not valid. It does not follow that all valid formulas are provable, however, which is why we also need completeness.
Interactive Theorem Proving for Logic and Information
35
Our formalized proof of soundness requires extra work for the rule A1. The following theorem states soundness for this rule: theorem tautology: tautology p =⇒ M, s |= p
Note that the quantification p ∈ L and ∀M s is implicit in Isabelle/HOL. See the formalization for the proof. Proving soundness for system Kn is now straightforward. The following theorem captures the property: theorem soundness: p =⇒ M, s |= p by (induct p arbitrary: s rule: SystemK.induct) (simp-all add: tautology)
The proof strategy is to apply induction over the rules of the system. Once we supply the tautology theorem, the simplification proof method in Isabelle/HOL discharges all subgoals. 3.4 Completeness We now want to demonstrate that system Kn is not only sound, but also complete, namely that every valid formula in L is provable: ∀ϕ ∈ L. (∀M, s. M, s |= ϕ) −→ ϕ The formalized proof follows Fagin et al. [15] and builds on maximal consistent sets of formulas. A formula ϕ is Kn -consistent if its negation is not provable: ¬ϕ. A finite set of formulas ϕ1, . . . , ϕn is Kn -consistent if we cannot prove that they imply a contradiction: ϕ1 −→ . . . −→ ϕn −→ ⊥. Finally, an infinite set of formulas is Kn -consistent if all its finite subsets are. Instead of working directly with this definition, we start from Fitting’s consistency properties [5], which define the class C of consistent sets S directly from the connectives of the formula, instead of referencing the axiom system: definition consistency :: i fm set set ⇒ bool where consistency C ≡ ∀ S ∈ C. (∀ p. ¬ (Pro p ∈ S ∧ (¬ Pro p) ∈ S)) ∧ ⊥S∧ (∀ Z. (¬ (¬ Z)) ∈ S −→ S ∪ {Z} ∈ C) ∧ (∀ A B. (A ∧ B) ∈ S −→ S ∪ {A, B} ∈ C) ∧ (∀ A B. (¬ (A ∨ B)) ∈ S −→ S ∪ {¬ A, ¬ B} ∈ C) ∧ (∀ A B. (A ∨ B) ∈ S −→ S ∪ {A} ∈ C ∨ S ∪ {B} ∈ C) ∧ (∀ A B. (¬ (A ∧ B)) ∈ S −→ S ∪ {¬ A} ∈ C ∨ S ∪ {¬ B} ∈ C) ∧ (∀ A B. (A −→ B) ∈ S −→ S ∪ {¬ A} ∈ C ∨ S ∪ {B} ∈ C) ∧ (∀ A B. (¬ (A −→ B)) ∈ S −→ S ∪ {A, ¬ B} ∈ C) ∧ (∀ A. tautology A −→ S ∪ {A} ∈ C) ∧ (∀ A i. ¬ (K i A ∈ S ∧ (¬ K i A) ∈ S))
36
J. Villadsen et al.
All but the last two conditions are standard and ensure downwards saturation [67] of each set: the satisfiability of any member is guaranteed by conditions on its subformulas, and consistency is ensured at the bottom. The penultimate line ensures that the consistent sets contain all tautologies. This is a technical trick that makes them easier to work with: since any tautology cannot break consistency, we might as well include them. Similarly, the last condition ensures that no agent both knows and does not know the same formula A. We connect the definition of consistency to provability in system Kn through the following theorem: theorem K-consistency: consistency {set G | G. ¬ imply G ⊥}
The completeness proof follows the usual recipe: (i) assume a valid formula ϕ has no derivation (ii) then its negation is Kn -consistent and (iii) we can extend the set {¬ϕ} in a standard way (due to Lindenbaum [68]) to a maximally consistent set [15] which (iv) has a model. This contradicts the validity assumption. The completeness theorem is: theorem completeness: assumes ∀ (M :: ( i :: countable, i fm set) kripke) s. M, s |= p shows p
For technical reasons we have to require validity in a specific universe, namely in which the possible worlds are sets of formulas, but this is implied by the usual assumption of validity in all universes. Given the provability of p, that is p, the soundness results implies that p is valid in all universes.
4 Public Announcement Logic We now move beyond static knowledge of agents and consider information updates as well. The formal language L! for public announcement logic is an extension of that of epistemic logic with the operator [r]! p for any formulas r and p meaning “p is true after the public announcement of r”. For example, [K1 ρ ∧! K2 σ]! τ means that τ is true after the public announcement that agent 1 knows ρ and agent 2 knows σ. In the formalization [18], we again deeply embed the language as a datatype in Isabelle/HOL: datatype i pfm = FF (⊥! ) | Pro id (Pro! ) | Dis i pfm i pfm (infixr ∨! 30) | Con i pfm i pfm (infixr ∧! 35) | Imp i pfm i pfm (infixr −→! 25) | K i i pfm (K ! ) | Ann i pfm i pfm ([-]! - [50, 50] 50)
We have added primes to some constructors to disambiguate them from the epistemic logic. We say that a formula is static if it does not contain any announcement operators.
Interactive Theorem Proving for Logic and Information p is a propositional tautology p
PA1 PA2
PR1
p
K ip
p
K i (p
q
PFF
PImp PK
K iq
p K ip
([r ]
PPro
PCon
q)
PR2
q
PDis
37
PR3
(r
[r ] x
))
(r
x)
([r ] (p
q)
[r ] p
[r ] q )
([r ] (p
q)
[r ] p
[r ] q )
q)
[r ] p
([r ] (p (([r ] K ip )
r
p [r ] p
[r ] q )
K i([r ] p )))
Fig. 2. Our axiomatic system for public announcement logic.
The bi-implication operator is central to our development and we introduce it as an abbreviation: abbreviation PIff :: i pfm ⇒ i pfm ⇒ i pfm (infixr ←→! 25) where p ←→ q ≡ (p −→ q) ∧ (q −→ p) ! ! ! !
The semantics depend on the notion of the restriction of a model to the worlds in which a specific formula is true. We formalize the semantics as the function psemantics and restriction as the function restrict. They are defined by mutual recursion: fun psemantics :: ( i, w) kripke ⇒ w ⇒ i pfm ⇒ bool (-, - |=! - [50, 50] 50) and restrict :: ( i, w) kripke ⇒ i pfm ⇒ ( i, w) kripke where (M, w |= ⊥ ) = False ! ! | (M, w |=! Pro! x) = π M w x | (M, w |=! (p ∨! q)) = ((M, w |=! p) ∨ (M, w |=! q)) | (M, w |=! (p ∧! q)) = ((M, w |=! p) ∧ (M, w |=! q)) | (M, w |=! (p −→! q)) = ((M, w |=! p) −→ (M, w |=! q)) | (M, w |=! K ! i p) = (∀ v ∈ K M i w. M, v |=! p) | (M, w |=! [r]! p) = ((M, w |=! r) −→ (restrict M r, w |=! p)) | restrict M p = Kripke (π M) (λi w. {v. v ∈ K M i w ∧ (M, v |=! p)})
As can be seen, the semantics for each formula is defined the same as for epistemic logic, a semantics for [_]! is added, and restrict is defined.
38
J. Villadsen et al.
We restrict the model, not by removing worlds but by removing every agent’s accessibility to those worlds. The idea for that semantics is that for [r]! p to be true in model M and world w, either p is falsified at M and w, a false announcement, or p is satisfied at w in the restricted world restrict M where only p-worlds are accessible. 4.1
Axiomatic System
We adapt the syntax ! ρ for the provability of ρ in the following axiomatic system inspired by the system described by Baltag and Renne [2]. It is defined inductively (cf. Fig. 2): inductive PA :: i pfm ⇒ bool (! - [50] 50) where PA1: ptautology p =⇒ ! p | PA2: ! (K ! i p ∧! K ! i (p −→! q) −→! K ! i q) | PR1: ! p =⇒ ! (p −→! q) =⇒ ! q | PR2: ! p =⇒ ! K ! i p | PR3: ! p =⇒ ! [r]! p | PFF: ! ([r]! ⊥! ←→! (r −→! ⊥! )) | PPro: ! ([r]! Pro! x ←→! (r −→! Pro! x)) | PDis: ! ([r]! (p ∨! q) ←→! [r]! p ∨! [r]! q) | PCon: ! ([r]! (p ∧! q) ←→! [r]! p ∧! [r]! q) | PImp: ! (([r]! (p −→! q)) ←→! ([r]! p −→! [r]! q)) | PK: ! (([r]! K ! i p) ←→! (r −→! K ! i ([r]! p)))
Rules PA1, PA2, PR1 and PR2 are analogous to the rules A1, A2, R1 and R2 of epistemic logic (ptautology is implemented in the same style as tautology). In addition the system has six axioms – one for each combination of [_]! with ⊥! , atomic formulas, ∨! , ∧! , −→! and K! . The axioms for the binary connectives simply distribute [_]! over each connective, while the ones for ⊥! and atomic formulas rephrase [_]! as an implication. The axiom for [_]! and knowledge says that “i knows p after an announcement r if and only if the announcement r, whenever truthful, is known by i to make p true.” [2]. 4.2
Reducing to Epistemic Logic
We implement the reduction from public announcement logic to epistemic logic operationally, as guided by the reduction axioms. We do so in two steps. The first operation, reduce’ r p, translates the formula [r]! p into an equivalent formula in epistemic logic when p itself is static: primrec reduce :: i pfm ⇒ i pfm ⇒ i pfm where reduce r ⊥ = (r −→ ⊥ ) ! ! ! | reduce r (Pro! x) = (r −→! Pro! x) | reduce r (p ∨! q) = (reduce r p ∨! reduce r q) | reduce r (p ∧! q) = (reduce r p ∧! reduce r q) | reduce r (p −→! q) = (reduce r p −→! reduce r q) | reduce r (K ! i p) = (r −→! K ! i (reduce r p)) | reduce r ([p]! q) = undefined
The second operation, reduce p, reduces the PAL-formula p into epistemic logic by recursion over the syntax:
Interactive Theorem Proving for Logic and Information
39
primrec reduce :: i pfm ⇒ i pfm where reduce ⊥ = ⊥ ! ! | reduce (Pro! x) = Pro! x | reduce (p ∨! q) = (reduce p ∨! reduce q) | reduce (p ∧! q) = (reduce p ∧! reduce q) | reduce (p −→! q) = (reduce p −→! reduce q) | reduce (K ! i p) = K ! i (reduce p) | reduce ([r]! p) = reduce (reduce r) (reduce p)
We stay within the pfm type rather than fm, even though we do not use the extra constructors, since our axiomatic system is defined over the pfm type. To prove completeness, we must prove that the reduction preserves the semantics. We do so by first considering the basic reduce’ operation with a static target: lemma reduce -semantics: assumes static q shows ((M, w |=! [p]! (q))) = (M, w |=! reduce p q) using assms by (induct q arbitrary: w) auto
With this lemma we can prove that reduce preserves the semantics: lemma reduce-semantics: (M, w |=! p) = (M, w |=! reduce p)
We refer to the formalization for the proof by structural induction. 4.3 Soundness We prove the proof system sound similar to how we did for epistemic logic: theorem soundness: assumes ! p shows M, w |=! p using assms by (induct p arbitrary: M w rule: PA.induct) (simp-all add: ptautology)
The lemma ptautology is analogous to the theorem tautology from the formalization of epistemic logic. 4.4 Completeness We prove the proof system complete. Recall that the static formulas are those in which [_]! does not occur. The proof system is complete for such formulas: theorem static-completeness: assumes static p ∀ (M :: ( i :: countable, i fm set) kripke) w. M, w |=! p shows ! p
The reason is that • ! contains all the axioms of , • is complete, and • a static formula is straightforwardly a formula of epistemic logic.
40
J. Villadsen et al.
With this theorem in place we can prove completeness for all formulas: theorem completeness: assumes ∀ (M :: ( i :: countable, i fm set) kripke) w. M, w |=! p shows ! p
We do it by proving that if p is true in all models then so is the formula reduce p since the reduction is sound. The formula reduce p does not contain [_]! and is therefore static. By static completeness, reduce p is provable, ! reduce p. Additionally we prove from the reduction axioms PDis, PCon, PImp, PK and PFF, PPro that ! p ←→! reduce p, and thus that p is provable, ! p.
5 Related Work For a good overview of the topic of formalizing logical meta-theory we recommend a recent paper by Blanchette [6]. Several frameworks have been developed for proving logical calculi complete. These frameworks allow the reuse of syntax, semantics and proof ideas to formalize logical systems and their soundness and completeness as well as other results: • Michaelis and Nipkow formalize a bouquet of different proof systems all based on the same syntax for propositional logic [39, 40]. The framework formalizes sequent calculus, natural deduction, Hilbert systems and resolution. • The framework by Blanchette, Popescu and Traytel allows proofs of soundness and completeness for proof systems for different logics [8–11]. This is possible because their framework is parameterized on the specific syntax and semantics. In a related paper’s supplementary material, Blanchette and Popescu [7] show that a formalized tableau for many-sorted first-order logic in negation normal form with equality fits in the framework. This supplementary material is unfortunately not up to date with recent Isabelle versions. • A third development is frameworks for proving completeness of saturation provers. Schlichtkrull et al. [60, 63, 64] formalize the completeness of resolution in a generic way that allows for different provers to be built from the development, which is based on the work by Bachmair and Ganzinger [1]. The development is used to show the soundness and completeness of a particular prover using binary resolution and with a specific strategy for removing redundant clauses, but other provers would also fit [61, 62]. Tourret and Blanchette reformalize this result [69, 70] based on the more general theory of saturation provers by Waldmann et al. [73]. Outside of the mentioned frameworks, a number of self-contained formalizations of sequent calculi in proof assistants appear in the literature: • Ridge and Margetson [37, 57, 58] formalized in Isabelle/HOL soundness and completeness for a sequent calculus for formulas in negation normal form and with a term language of only variables. • Braselmann and Koepke [12, 13] formalized in Mizar soundness and completeness of a sequent calculus.
Interactive Theorem Proving for Logic and Information
41
• Schlöder and Koepke [66] formalized its completeness considering also uncountable languages. • A more exotic result is the formalization by Ilik [28] in Coq of completeness of a sequent calculus with respect to a Kripke-semantics for classical first-order logic [29]. The following formalizations appear if we broaden the scope to include intuitionistic logic: • Persson [50] formalized in ALF the soundness of a sequent calculus for intuitionistic first-order logic. • Herbelin, Kim and Lee [27] formalized in Coq the completeness of a sequent calculus for intuitionistic first-order logic restricted to formulas with implication and universal quantification as the only logical symbols. Their formalization applied a Kripke-style semantics. If we broaden the scope further to look beyond sequent calculi, we can mention several other formalizations: • Jensen, Larsen, Schlichtkrull and Villadsen [32, 65] formalized in Isabelle/HOL an axiomatic system for classical logic. • Raffali [53] formalized in Phox natural deduction for classical logic. • Persson [50] formalized in ALF natural deduction for intuitionistic logic. • Peltier [49] formalized in Isabelle/HOL superposition. • Paulson [46–48] formalized in Isabelle/HOL Gödel’s Incompleteness Theorems, but this does not include a completeness proof. • Popescu and Traytel present a formalization of Gödel’s Incompleteness Theorems [52]. • Jensen, Hindriks and Villadsen [30, 31] also present an approach to formalize in Isabelle/HOL a verification framework for agent programs. Let us now turn to formalizations of modal logic. These logics contain a single necessity operator rather than one Ki for each agent i in a set of agents: • Bentzen [3] formalized S5 in Lean. • Neeley [42] formalized modal systems K, T, S4 and S5 in Lean. In the context of epistemic logic we found two formalizations in Lean of the S5 system for epistemic logic and PAL. We instead opted to formalize Kn . The S5 system extends Kn in that it has a number of additional axioms, and it is sound and complete when considering Kripke models in which the accessibility relation is an equivalence relation rather than any relation. Additionally there is work on a formalization of intuitionistic epistemic logic in Coq. • Neeley [42, 43] formalized S5 for epistemic logic and public announcement logic in Lean. Her proof system includes an axiom for the composition of public announcement operators, instead of our axiom for distribution over implication (PImp) and our announcement necessitation rule (PR3). • Li [35] formalized S5 for epistemic logic and public announcement logic in Lean but only formalized the logical equivalence of the reduction axioms, not the completeness of a proof system that includes them.
42
J. Villadsen et al.
• Hagemeier [24] is formalizing intuitonistic epistemic logic in Coq. It is presented in a number of slides, memos and draft memos. We look forward to the finished presentation of the work. • The Twelf distribution [51] includes a formalization in the LF logical framework [26] of a sequent calculus and natural deduction proof system for classical S5. The Twelf system [51] is worth mentioning by itself. It provides a uniform metalanguage for specifying logics and proof systems and proving meta-theoretical properties like cut-elimination. However, we are not aware of any formalizations of semantic completeness like we present here. Other interesting proposals for epistemic logic appear in the literature: • Ka¸dziołka [33] formalized a solution to a puzzle and introduces a logic tailored to the problem that turns out to be very similar to the possible worlds model of epistemic logic. • Zuojun, Ågotnes and Zhang [75] presented a variant of epistemic logic that adds the notion of secret knowledge as a first-class citizen. The notion of secrets can be defined in terms of the knowledge operator, but a new modality for secrets is introduced. The authors argue that the main principles can be studied this way, for instance when considering a language with an operator for secrets and without the usual knowledge operator. Our formalizations rely on deep embedding of formulas. In contrast, using a shallow embedding of the logic means that we write formulas directly in the proof assistant’s logic. The advantages of a shallow embedding include not having to formalize semantics, and usually the automation has an easier time proving theorems. The advantage of a deep embedding is that we can obtain formalized soundness and completeness theorems, cf. Sect. 2. • Benzmüller and Paulson [4] formalized in Isabelle/HOL a shallow encoding of modal logic. Gleißner, Steen and Benzmüller [22, 23] showed effective automation for a wide range of modal logics due to the use of a shallow embedding. • Reiche and Benzmüller [54] formalized in Isabelle/HOL a shallow embedding of PAL. Giselle Reis [55] sees formalizing logics in proof assistants as one of several ways to facilitate meta-theory. Concretely, she looks at three methods for facilitating meta-theory: Firstly, she considers using linear logic and subexponential linear logic as a framework for meta-theoretical reasoning. The idea is that certain logics can be expressed in the meta-logic of linear logic and subexponential linear logic. These logics allow some meta-theoretical properties to be proved automatically. Secondly, she considers the use of proof assistants to prove meta-theoretical properties – this is similar to our work here. She notes: One of the issues when developing proofs of meta-properties by hand is the sheer complexity and number of cases. By implementing these proofs in proof assistants, the computer will not let us skip cases or overlook details.
Interactive Theorem Proving for Logic and Information
43
We share this experience. Reis experienced that using Coq to formalize logics required her to write specific tactics to do parts of the proofs automatically. In our Isabelle formalization we instead relied on the Isar proof language and the built-in tactics of Isabelle. Reis also explains that working in proof assistants can be combined with the approach of using linear logic as a framework: the idea is that linear logic can be formalized in Coq and then one can use this formalized linear logic to prove properties of other logics. Giselle Reis also notes that formalizations of logics require a significant amount of work: The fact that each of these works is a publication (or collection of publications) itself is evidence that formalizing meta-theory is far from trivial work and cannot be done as a matter of fact. We agree with this perspective and see the building of frameworks in proof assistants and formalizing more logics within them as a way to improve this situation. Additionally, improving the proof assistants themselves will help this agenda. Lastly, Reis considers a solution where the computer aids only in parts of the meta-reasoning, which leaves a part to be done by hand. In particular she considers two systems that can be used for this: GAPT [14] (General Architecture for Proof Theory) is a proof theory framework containing common components of proof theory such as data structures, algorithms, parsers and automated deduction. GAPT interfaces to a number of automated reasoning tools and its focus is on transformation and further processing of proofs. Sequoia [56] is a tool for helping with the meta-theory of sequent calculi and which can import and export LaTeX code. Reis concludes that each method has its strengths and weaknesses and also that much work can be done to make them better and easier to use.
6 Concluding Remarks For artificial intelligence (AI) in general and for natural language processing (NLP) in particular, the interrelationship between logic and information is pivotal [38]: There is a bi-directional relation between logic and information. On the one hand, information underlies the intuitive understanding of standard logical notions such as inference (which may be thought of as the process that turns implicit information into explicit information) and computation. On the other hand, logic provides a formal framework for the study of information itself. We have considered fundamental axiomatic systems for both epistemic logic (EL) and public announcement logic (PAL). Instead of presenting pen-and-paper proofs of soundness and completeness we have used automated reasoning, the Isabelle proof assistant, as a powerful interactive tool. We share the vision of Rob Nederpelt and Herman Geuvers [41, p. 385]: In the future, we expect an enormous increase in the use of proof assistants. Our vision is that formalising a mathematical proof may become as easy as writing mathematics in a mathematical text editor such as LATEX (Lamport, 1985) and that a mathematical proof will only be accepted for publication when it has been formally checked.
44
J. Villadsen et al.
But, in fact, we do not need to choose between pen-and-paper and mechanically checked proofs, as they can successfully coexist. Acknowledgement. We thank Frederik Krogsdal Jacobsen for comments on drafts.
References 1. Bachmair, L., Ganzinger, H., McAllester, D.A., Lynch, C.: Resolution theorem proving. In: Robinson, J.A., Voronkov, A. (eds.) Handbook of Automated Reasoning, vol. 2, pp. 19–99. Elsevier and MIT Press (2001) 2. Baltag, A., Renne, B.: Dynamic epistemic logic. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy, Winter 2016 edn. Stanford University, Metaphysics Research Lab (2016) 3. Bentzen, B.: A Henkin-style completeness proof for the modal logic S5. CoRR (2019). https:// arxiv.org/abs/1910.01697 4. Benzmüller, C., Paulson, L.C.: Quantified multimodal logics in simple type theory. Logica Universalis 7(1), 7–20 (2013) 5. Berghofer, S.: First-Order Logic According to Fitting. Archive of Formal Proofs (2007). http://isa-afp.org/entries/FOL-Fitting.html 6. Blanchette, J.C.: Formalizing the metatheory of logical calculi and automatic provers in Isabelle/HOL (invited talk). In: Mahboubi, A., Myreen, M.O. (eds.) Proceedings of the 8th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2019, Cascais, Portugal, 14–15 January 2019, pp. 1–13. ACM (2019) 7. Blanchette, J.C., Popescu, A.: Mechanizing the metatheory of Sledgehammer. In: Fontaine, P., Ringeissen, C., Schmidt, R.A. (eds.) FroCoS 2013. LNCS (LNAI), vol. 8152, pp. 245–260. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40885-4_17 8. Blanchette, J.C., Popescu, A., Traytel, D.: Abstract completeness. Archive of Formal Proofs (2014). https://isa-afp.org/entries/Abstract_Completeness.html. Formal proof development 9. Blanchette, J.C., Popescu, A., Traytel, D.: Unified classical logic completeness. In: Demri, S., Kapur, D., Weidenbach, C. (eds.) IJCAR 2014. LNCS (LNAI), vol. 8562, pp. 46–60. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08587-6_4 10. Blanchette, J.C., Popescu, A., Traytel, D.: Abstract soundness. Archive of Formal Proofs (2017). https://isa-afp.org/entries/Abstract_Soundness.html. Formal proof development 11. Blanchette, J.C., Popescu, A., Traytel, D.: Soundness and completeness proofs by coinductive methods. J. Autom. Reason. 58(1), 149–179 (2016). https://doi.org/10.1007/s10817-0169391-3 12. Braselmann, P., Koepke, P.: Gödel’s completeness theorem. Formal. Math. 13(1), 49–53 (2005) 13. Braselmann, P., Koepke, P.: A sequent calculus for first-order logic. Formal. Math. 13(1), 33–39 (2005) 14. Ebner, G., Hetzl, S., Reis, G., Riener, M., Wolfsteiner, S., Zivota, S.: System description: GAPT 2.0. In: Olivetti, N., Tiwari, A. (eds.) IJCAR 2016. LNCS (LNAI), vol. 9706, pp. 293–301. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-40229-1_20 15. Fagin, R., Halpern, J.Y., Vardi, M.Y., Moses, Y.: Reasoning about Knowledge. MIT Press (1995) 16. Fitting, M.: First-Order Logic and Automated Theorem Proving. Graduate Texts in Computer Science, 2nd edn. Springer, New York (1996). https://doi.org/10.1007/978-1-4612-2360-3 17. From, A.H.: Epistemic logic. Archive of Formal Proofs (2018). https://isa-afp.org/entries/ Epistemic_Logic.html. Formal proof development
Interactive Theorem Proving for Logic and Information
45
18. From, A.H.: Public announcement logic. Archive of Formal Proofs (2021). https://isa-afp. org/entries/Public_Announcement_Logic.html. Formal proof development 19. From, A.H., Eschen, A.M., Villadsen, J.: Formalizing axiomatic systems for propositional logic in Isabelle/HOL. In: Kamareddine, F., Sacerdoti Coen, C. (eds.) Intelligent Computer Mathematics - 14th International Conference, CICM 2021, Timisoara, Romania, 26–31 July 2021, Proceedings, Lecture Notes in Artificial Intelligence, vol. 12833, pp. 32–46. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-81097-9_3 20. From, A.H., Jensen, A.B., Villadsen, J.: Formalized soundness and completeness of epistemic logic. In: LAMAS 2021–11th Workshop on Logical Aspects of Multi-Agent Systems (2021) 21. From, A.H., Lund, S.T., Villadsen, J.: A case study in computer-assisted meta-reasoning. In: Special Session on Computational Linguistics, Information, Reasoning, and AI 2021 (CompLingInfoReasAI 2021), Lecture Notes in Networks and Systems, 18th International Conference Distributed Computing and Artificial Intelligence, vol. 332, pp. 53–63. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86887-1_5 22. Gleißner, T., Steen, A.: The MET: the art of flexible reasoning with modalities. In: Benzmüller, C., Ricca, F., Parent, X., Roman, D. (eds.) Rules and Reasoning - Second International Joint Conference, RuleML+RR 2018, Luxembourg, 18–21 September 2018, Proceedings, Lecture Notes in Computer Science, vol. 11092, pp. 274–284. Springer, Cham (2018). https://doi. org/10.1007/978-3-319-99906-7_19 23. Gleißner, T., Steen, A., Benzmüller, C.: Theorem provers for every normal modal logic. In: Eiter, T., Sands, D. (eds.) LPAR 2021, 21st International Conference on Logic for Programming, Artificial Intelligence and Reasoning, Maun, Botswana, 7–12 May 2017, EPiC Series in Computing, vol. 46, pp. 14–30. EasyChair (2017). https://easychair.org/publications/paper/ 6bjv 24. Hagemeier, C.: Formalizing intuitionistic epistemic logic in Coq (2021). https://www.ps.unisaarland.de/~hagemeier/bachelor.php. BSc thesis 25. Hales, T.C., et al.: A formal proof of the Kepler conjecture. Forum Math. Pi 5, 1–29 (2017). https://doi.org/10.1017/fmp.2017.1 26. Harper, R., Honsell, F., Plotkin, G.D.: A framework for defining logics. J. ACM 40(1), 143–184 (1993). https://doi.org/10.1145/138027.138060 27. Herbelin, H., Kim, S.Y., Lee, G.: Formalizing the meta-theory of first-order predicate logic. J. Korean Math. Soc. 54(5), 1521–1536 (2017) 28. Ilik, D.: Constructive completeness proofs and delimited control. Ph.D. thesis. École Polytechnique (2010). https://tel.archives-ouvertes.fr/tel-00529021/document 29. Ilik, D., Lee, G., Herbelin, H.: Kripke models for classical logic. Ann. Pure Appl. Logic 161(11), 1367–1378 (2010) 30. Jensen, A.B.: Towards verifying GOAL agents in Isabelle/HOL. In: ICAART 2021 - Proceedings of the 13th International Conference on Agents and Artificial Intelligence, vol. 1, pp. 345–352. SciTePress (2021) 31. Jensen, A.B., Hindriks, K.V., Villadsen, J.: On using theorem proving for cognitive agentoriented programming. In: ICAART 2021 - Proceedings of the 13th International Conference on Agents and Artificial Intelligence, vol. 1, pp. 446–453. SciTePress (2021) 32. Jensen, A.B., Larsen, J.B., Schlichtkrull, A., Villadsen, J.: Programming and verifying a declarative first-order prover in Isabelle/HOL. AI Commun. 31(3), 281–299 (2018). https:// doi.org/10.3233/AIC-180764 33. Kadziołka, J.: Solution to the xkcd blue eyes puzzle. Archive of Formal Proofs (2021). https:// isa-afp.org/entries/Blue_Eyes.html. Formal proof development 34. Krauss, A.: Defining Recursive Functions in Isabelle/HOL (2021). https://isabelle.in.tum.de/ doc/functions.pdf 35. Li, J.: Formalization of PAL·S5 in proof assistant. CoRR (2020). https://arxiv.org/abs/2012. 09388
46
J. Villadsen et al.
36. Manna, Z., Pnueli, A.: Formalization of properties of functional programs. J. ACM 17(3), 555–569 (1970). https://doi.org/10.1145/321592.321606 37. Margetson, J., Ridge, T.: Completeness theorem. Archive of Formal Proofs (2004). http://isaafp.org/entries/Completeness.html. Formal proof development 38. Martinez, M., Sequoiah-Grayson, S.: Logic and information. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy, Spring 2019 edn., Metaphysics Research Lab, Stanford University (2019) 39. Michaelis, J., Nipkow, T.: Formalized proof systems for propositional logic. In: Abel, A., Forsberg, F.N., Kaposi, A. (eds.) 23rd International Conference on Types for Proofs and Programs, TYPES 2017, 29 May–1 June 2017, Budapest, Hungary, LIPIcs, vol. 104, pp. 5:1–5:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2017) 40. Michaelis, J., Nipkow, T.: Propositional proof systems. Archive of Formal Proofs (2017). http://isa-afp.org/entries/Propositional_Proof_Systems.html. Formal proof development 41. Nederpelt, R., Geuvers, H.: Type Theory and Formal Proof: An Introduction. Cambridge University Press (2014). https://doi.org/10.1017/CBO9781139567725 42. Neeley, P.: A formalization of dynamic epistemic logic. Master’s thesis, Carnegie Mellon University (2021). https://paulaneeley.com/wp-content/uploads/2021/05/draft1.pdf 43. Neeley, P.: Results in modal and dynamic epistemic logic: a formalization in Lean. Slides Lean Together Workshop (2021). https://leanprover-community.github.io/lt2021/ slides/paula-LeanTogether2021.pdf 44. Nipkow, T.: Programming and Proving in Isabelle/HOL (2021). https://isabelle.in.tum.de/ doc/prog-prove.pdf 45. Nipkow, T., Wenzel, M., Paulson, L.C. (eds.): Isabelle/HOL – A Proof Assistant for HigherOrder Logic. LNCS, vol. 2283. Springer, Heidelberg (2002). https://doi.org/10.1007/3-54045949-9 46. Paulson, L.C.: Gödel’s incompleteness theorems. Archive of Formal Proofs (2013). http:// isa-afp.org/entries/Incompleteness.html, Formal proof development 47. Paulson, L.C.: A machine-assisted proof of Gödel’s incompleteness theorems for the theory of hereditarily finite sets. Rev. Symb. Log. 7(3), 484–498 (2014). https://doi.org/10.1017/ S1755020314000112 48. Paulson, L.C.: A mechanised proof of Gödel’s incompleteness theorems using Nominal Isabelle. J. Autom. Reason. 55(1), 1–37 (2015). https://doi.org/10.1007/s10817-015-9322-8 49. Peltier, N.: A variant of the superposition calculus. Archive of Formal Proofs (2016). http:// isa-afp.org/entries/SuperCalc.shtml, Formal proof development 50. Persson, H.: Constructive completeness of intuitionistic predicate logic. Ph.D. thesis, Chalmers University of Technology (1996). http://web.archive.org/web/20001011101511/ www.cs.chalmers.se/~henrikp/Lic/ 51. Pfenning, F., Schürmann, C.: System description: Twelf — a meta-logical framework for deductive systems. In: CADE 1999. LNCS (LNAI), vol. 1632, pp. 202–206. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48660-7_14 52. Popescu, A., Traytel, D.: A formally verified abstract account of Gödel’s incompleteness theorems. In: Fontaine, P. (ed.) Automated Deduction - CADE 27, pp. 442–461. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-29436-6_26 53. Raffalli, C.: Krivine’s abstract completeness proof for classical predicate logic. https://github. com/craff/phox/blob/master/examples/complete.phx (2005, possibly earlier) 54. Reiche, S., Benzmüller, C.: Public announcement logic in HOL. In: Martins, M.A., Sedlár, I. (eds.) Dynamic Logic. New Trends and Applications - Third International Workshop, DaLi 2020, Prague, Czech Republic, 9–10 October 2020, Revised Selected Papers, Lecture Notes in Computer Science, vol. 12569, pp. 222–238. Springer, Cham (2020). https://doi.org/10. 1007/978-3-030-65840-3_14
Interactive Theorem Proving for Logic and Information
47
55. Reis, G.: Facilitating meta-theory reasoning (invited paper). In: Pimentel, E., Tassi, E. (eds.) Proceedings Sixteenth Workshop on Logical Frameworks and Meta-Languages: Theory and Practice, Pittsburgh, USA, 16 July 2021, Electronic Proceedings in Theoretical Computer Science, vol. 337, pp. 1–12. Open Publishing Association (2021). https://doi.org/10.4204/ EPTCS.337.1 56. Reis, G., Naeem, Z., Hashim, M.: Sequoia: a playground for logicians. In: Peltier, N., SofronieStokkermans, V. (eds.) IJCAR 2020. LNCS (LNAI), vol. 12167, pp. 480–488. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-51054-1_32 57. Ridge, T.: A mechanically verified, efficient, sound and complete theorem prover for first order logic. Archive of Formal Proofs (2004). http://isa-afp.org/entries/Verified-Prover.shtml, Formal proof development 58. Ridge, T., Margetson, J.: A mechanically verified, sound and complete theorem prover for first order logic. In: Hurd, J., Melham, T. (eds.) TPHOLs 2005. LNCS, vol. 3603, pp. 294–309. Springer, Heidelberg (2005). https://doi.org/10.1007/11541868_19 59. Ringer, T., Palmskog, K., Sergey, I., Gligoric, M., Tatlock, Z.: QED at large: a survey of engineering of formally verified software. Found. Trends Program. Lang. 5(2–3), 102–281 (2019). https://doi.org/10.1561/2500000045 60. Schlichtkrull, A., Blanchette, J., Traytel, D., Waldmann, U.: Formalizing Bachmair and Ganzinger’s ordered resolution prover. J. Autom. Reason. 64(7), 1169–1195 (2020). https:// doi.org/10.1007/s10817-020-09561-0 61. Schlichtkrull, A., Blanchette, J.C., Traytel, D.: A verified functional implementation of Bachmair and Ganzinger’s ordered resolution prover. Archive of Formal Proofs (2018). https:// isa-afp.org/entries/Functional_Ordered_Resolution_Prover.html. Formal proof development 62. Schlichtkrull, A., Blanchette, J.C., Traytel, D.: A verified prover based on ordered resolution. In: Mahboubi, A., Myreen, M.O. (eds.) Proceedings of the 8th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2019, Cascais, Portugal, 14–15 January 2019, pp. 152–165. ACM (2019). https://doi.org/10.1145/3293880.3294100 63. Schlichtkrull, A., Blanchette, J.C., Traytel, D., Waldmann, U.: Formalization of Bachmair and Ganzinger’s ordered resolution prover. Archive of Formal Proofs (2018). https://isa-afp. org/entries/Ordered_Resolution_Prover.html. Formal proof development 64. Schlichtkrull, A., Blanchette, J.C., Traytel, D., Waldmann, U.: Formalizing Bachmair and Ganzinger’s ordered resolution prover. In: Galmiche, D., Schulz, S., Sebastiani, R. (eds.) Automated Reasoning - 9th International Joint Conference, IJCAR 2018, Held as Part of the Federated Logic Conference, FloC 2018, Oxford, UK, 14–17 July 2018, Proceedings, Lecture Notes in Computer Science, vol. 10900, pp. 89–107. Springer, Cham (2018). https://doi.org/ 10.1007/978-3-319-94205-6_7 65. Schlichtkrull, A., Villadsen, J., From, A.H.: Students’ Proof Assistant (SPA). In: Quaresma, P., Neuper, W. (eds.) Proceedings 7th International Workshop on Theorem Proving Components for Educational Software (ThEdu), Electronic Proceedings in Theoretical Computer Science, vol. 290, pp. 1–13. Open Publishing Association (2019). https://doi.org/10.4204/EPTCS.290. 1 66. Schlöder, J.J., Koepke, P.: The Gödel completeness theorem for uncountable languages. Formal. Math. 20(3), 199–203 (2012) 67. Smullyan, R.M.: First-Order Logic. Springer, Heidelberg (1968). https://doi.org/10.1007/ 978-3-642-86718-7 68. Tarski, A.: Logic, Semantics, Metamathematics: Papers from 1923 to 1938. Hackett Publishing (1983) 69. Tourret, S.: A comprehensive framework for saturation theorem proving. Archive of Formal Proofs (2020). https://isa-afp.org/entries/Saturation_Framework.html. Formal proof development
48
J. Villadsen et al.
70. Tourret, S., Blanchette, J.: A modular Isabelle framework for verifying saturation provers. In: C. Hritcu, A. Popescu (eds.) CPP 2021: 10th ACM SIGPLAN International Conference on Certified Programs and Proofs, Virtual Event, Denmark, 17–19 January 2021, pp. 224–237. ACM (2021). https://doi.org/10.1145/3437992.3439912 71. Villadsen, J.: A micro prover for teaching automated reasoning. In: Seventh Workshop on Practical Aspects of Automated Reasoning (PAAR 2020) - Presentation Only/Online Papers, pp. 1–12 (2020). http://www.eprover.org/EVENTS/PAAR-2020.html 72. Villadsen, J.: Tautology checkers in Isabelle and Haskell. In: Calimeri, F., Perri, S., Zumpano, E. (eds.) Proceedings of the 35th Edition of the Italian Conference on Computational Logic (CILC 2020), Rende, Italy, 13–15 October 2020, CEUR Workshop Proceedings, vol. 2710, pp. 327–341. CEUR-WS.org (2020). http://ceur-ws.org/Vol-2710/paper-21.pdf 73. Waldmann, U., Tourret, S., Robillard, S., Blanchette, J.: A comprehensive framework for saturation theorem proving. In: Peltier, N., Sofronie-Stokkermans, V. (eds.) IJCAR 2020. LNCS (LNAI), vol. 12166, pp. 316–334. Springer, Cham (2020). https://doi.org/10.1007/ 978-3-030-51074-9_18 74. Wenzel, M.: The Isabelle/Isar Reference Manual (2021). https://isabelle.in.tum.de/doc/isarref.pdf 75. Xiong, Z., Ågotnes, T., Zhang, Y.: The logic of secrets. In: LAMAS 2020–10th Workshop on Logical Aspects of Multi-Agent Systems (2020)
A Valence Catalogue for Norwegian Lars Hellan(B) Norwegian University of Science and Technology (NTNU), N-7491 Trondheim, Norway [email protected]
Abstract. Essential aspects of a verb’s usage reside in its valence environments. The Norwegian valence resource here presented, called NorVal, has 6,300 verb lemmas. About 3,360 of them are associated with sets of frames, and the organization of entries is divided into one enumeration of the total number of framespecific entries, which is about 15,750, and one enumeration of lemmas, counting 6,300. About 300 frame types are distinguished inducing the 15,750 frame specific entries, taking into account most grammatical factors distinguishing verb frames and verb-headed construction types. Both the frame types and the two dimensions of entries are represented in string-based formalisms, enabling simple procedures for comparing individual valence frames, frame-specific entries, and entries representing lemmas, and for doing statistics over types and combinations of all of these. The paper illustrates the resources relative to their representation of light reflexives, verb particles, and frames including sentential constituents. Keywords: Verb valence · Norwegian · Set of valence frames for a lemma (valpod) · Frame-specific lexical entry (lexval) · Labeling code · Object · Indirect object · Oblique · Transitivity · Clausal argument (declarative · interrogative · infinitival) · Verb particle · Secondary predicate · Reflexive · Minimal sentence · Logical form of frame type
1 Introduction NorVal1 is a resource representing the valence potential of more than 6300 verb lemmas of Norwegian. Two features of its formal design reflect the circumstance that some verb lemmas take more than one valence frame. One feature resides in a format of lexical entries consisting of a lemma and one frame, as a pair; such an entry format we call a lexval. The other feature is the construal of the valence of a lemma as formally a set of lexvals, such that a multi-valent lemma is represented by an enumeration of lexvals, on the form ‘, , …’. Such a set representation (with minimum one member per set) is called a valpod. The system counts about 15,750 lexvals, and while there are as many valpods as there are lemmas, more than 3360 valpods are multi-membered. A further formal feature resides in the representation of valence frame types. About 300 different frame types are recognized as distinguishing between the lexvals, classified 1 https://doi.org/10.18710/8U3L2U; https://typecraft.org/tc2wiki/NorVal_resources.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. Loukanova (Ed.): NLPinAI 2021, SCI 999, pp. 49–104, 2022. https://doi.org/10.1007/978-3-030-90138-7_3
50
L. Hellan
according to a modular one-string annotation system called Construction Labeling (CL; cf. Hellan and Dakubu [19], Dakubu and Hellan [11]). It counts types for valence and grammatical functions, relevant for many linguistic areas, and could be compared with ‘Universal Dependency Grammar2 ’ – cf. Marneffe et al. [35]. The CL system is algorithmically co-operative with a typed feature-system for syntactic and semantic analysis, as outlined in Hellan [14, 15]. The verb inventory is partly derived from, and consistently kept in sync with, the verb lexicon of the computational grammar NorSource (cf. Hellan and Bruland [20]), a grammar based on the framework Head-Driven Phrase Structure Grammar (‘HPSG’, cf. Pollard and Sag [37], using a typed feature structure design (cf. Carpenter [4])), and with the platform LKB for grammar development (cf. Copestake [5]). The design and content of this verb lexicon were in turn at the outset in 2001 informed by the lexicons of TROLL (Hellan et al. [22]) and NorKompLex (cf. Nordgård [36]). The verb inventories of NorVal and NorSource have been applied in the online valence corpus Norwegian Valency Corpus, cf. Hellan et al. [25], and in the comparative online valence resource MultiVal, cf. Hellan et al. [23], derived from HPSG grammars of Norwegian, Ga, Spanish, and Bulgarian. The NorVal resources as such are currently not accessible online, but a DOI connected to this article, viz. Hellan [17], provides excerpts from the files. We call the valence resource a ‘catalogue’ rather than a ‘dictionary’ due to its stripped-down formal format, and since, unlike what is normally understood by a dictionary, it does not offer senses as designated features of its lexical entries. Seen from a theoretical viewpoint, the assembly of a verb’s environments is likely to reflect essential aspects of the verb’s meaning, and many studies, mostly for English, have aligned verb meanings with valence frames and frame alternations, however most of them covering only a limited set of all the environments that a verb can occur in. NorVal opens for the opposite strategy: with a classification of phenomena which makes limited semantic pre-commitments, it nevertheless enlarges the basis of phenomena that could reflect semantic factors, and thus may open for the identification of factors that have so far not been systematically explored. In the following, Sect. 2 gives an overview of most of the frame types covered and their encoding in terms of the Construction Labeling system. Sect. 3 outlines and illustrates the format of lexvals and valpods. Sect. 4 illustrates how the catalogue can be used in analysis of the phenomena of clausal arguments, particles and light reflexives, all playing a central role in the Norwegian valence system. Section 5 assesses the catalogue relative to issues of redundancy, to the notion of ‘valency class’, and to the prospect of combining valence information on a large scale with sense information. It also in some respects compares NorVal to other valence resources. Section 6 considers possible directions and domains in which the catalogue could be used or further developed.
2 Universal Dependency Grammar: https://universaldependencies.org/.
A Valence Catalogue for Norwegian
51
2 Representing Frame Types We here describe phenomena categorized in NorVal, notions used in the classifications,and their encoding. The terms pused are largely rooted in Scandinavian grammatical tradition, of course influenced by Latin grammar, and harmonizing well with the notions connected to ‘valence’ used in Tesnière [41]. The early phases of Generative Grammar, e.g., Ross [40], brought about a wealth of descriptive labels that were soon assimilated also in Scandinavian linguistics. Although the present exposition has little room for weighing terminological alternatives against each other, some essential choices of terminology are commented on. Norwegian, belonging to the Mainland Scandinavian branch of North Germanic, is an SVO language with strict order among the argument structure constituents. They generally (with well-known provisos not relevant to the present discussion) occur before possible adjuncts. Only personal pronouns have case, for subject vs. non-subject form. Argument structure constituents can be analyzed in terms of traditional grammatical function terms such as ‘subject’ and ‘object’ – the notion ‘object’ sub-classified as ‘direct’ and ‘indirect’ object when there are two objects -, and ‘oblique’ for a prepositional phrase with argument status, the preposition then typically counting as ‘selected’ by the verb. Following the terminology proposed in Marantz [34], subjects and objects (indirect as well as direct object) count as direct arguments, the governed item in an oblique as an indirect argument. The linear order between the direct arguments is strictly subject - indirect object - direct object - oblique(s). (While the framework of Lexical Functional Grammar (LFG; cf. Bresnan [2]) is one that generally supports the formal use of ‘grammatical function’ notions like those here used, it may be noted that while the constituents of a double object construction in LFG is often referred to as ‘object’ and ‘second object’, we here use the traditional terminology of Norwegian grammars. Unlike the tradition of some grammars, we also reserve the notion ‘indirect object’ for NPs, while the prepositional alternative with til (‘to’) or for is counted as an oblique constituent.) A further dimension of grammatical classification resides in notions such as ‘transitive’, ‘intransitive’, ‘ditransitive’ and more, which qualify the overall composition of a valence frame (or construction) rather than any of its constituent parts. We call this dimension of notions global relative to the valence frames. Although there are dependencies between which global label a construction type may carry and which grammatical function labels would be used to qualify its constituents, the two dimensions can be used independently and serve different purposes. Our labeling system is designed accordingly, starting with a global label which indicates which types of grammatical functions are realized, and then continuing with constituent-specific labels of the types outlined above when such qualifications are called for. For instance, the global notion transitive will declare the frame as consisting of a subject and an object (i.e., a direct object, since an indirect object only occurs with a direct object), where both have participant status (thus neither being an expletive pronoun), and where both are direct arguments, thus not preceded by a preposition. In the following, we first introduce notions for the classification of arguments, and then the global notions.
52
L. Hellan
2.1 Argument Labels Annotation labels for the grammatical functions ‘subject’, ‘object’, ‘direct’ and ‘indirect object’, ‘oblique’, and ‘complement’ are, using the annotation system Construction Labeling (‘CL’) mentioned above, su (for ‘subject’), ob (for ‘object’, and for ‘direct object’ when there is an indirect object present), iob (for ‘indirect object’), obl (for ‘oblique’) and comp (for ‘complement’, used for clausal complements not having object status – cf. Dalrymple and Lødrup [12]. Further argument labels are introduced below. Passive, as a regular grammatical process, applies to both direct and indirect objects (the pattern of so-called ‘symmetrical passive’), and also to the governed NP of oblique constituents. The process is heavily resorted to in defining criteria for grammatical functions, but is not by itself represented in valence frames, patterns induced through passivization processes counting as ‘productively derivable’. The use of the expletive personal pronoun det is pervasive, especially in subject but also in object position, annotated suExpl and obExpl, repectively. (Holen [26] notes that while for most languages pronoun resolution algorithms will be defined for subject as a first choice, this is not so for Norwegian, given the likelihood that the subject may be an expletive.) Constructions with a subject expletive are commonly divided into impersonals (with no direct argument participant), presentationals (with one or two direct argument participants, of which one, the ‘presented’ participant, is indefinite), and extraposition (where the expletive pronoun can be, metaphorically speaking, seen as ‘holding the place’ of a clausal subject or object). An additional pattern which we call extralinking is like extraposition except that the clause is governed by a preposition. The annotation label for an extraposed clause is expn, as in expnDECL when this is a declarative clause, while the prefix for an extralinked clause is exlnk. The expletive subject in all these patterns obeys standard criteria of subjecthood, also when in embedded clauses (it thus has a status clearly different from the seemingly similar expletive element in German), and therefore regularly carries the grammatical function su. A widely used construction is that of secondary predication (also called ‘small clauses’, and abbreviated as sc in annotation), where the predicatively functioning phrase - commonly referred to as a predicative – can be an adjective phrase (abbreviated scA), a prepositional phrase (abbreviated scP), a noun phrase (abbreviated scN), or a predicational particle phrase (abbreviated scPredprtcl – see shortly below). They can be predicated of either the subject or the object, referred to as, resp., subject predicatives and object predicatives. They are, following Jespersen [27, 28], referred to as bound predicatives in that they form part of argument structure, as opposed to free predicatives which can apply to either subjects or objects, and have adjunct status. Valence frames often include what is commonly referred to as particles, here analyzed as adverbs, and annotated as prtcl, this counting as a grammatical function. Many particles relate to locative and directional adverbs, but only carrying bleached versions of such meanings if at all. Locative and directional adverbs are in many cases homonyms with prepositions. A rule of thumb distinguishing a sequence ‘preposition + NP’ from a sequence ‘particle + NP’ is that the latter can alternate with the sequence in opposite order, subject to conditions pertaining to weight and category of the NP, while for a sequence ‘preposition + NP’, this is impossible. Moreover, we analyze a preposition as necessarily taking a complement, called its governee, while particles take
A Valence Catalogue for Norwegian
53
neither complements nor specifiers. Consequently a particle can occur without an NP preceding or following it, while a preposition always has a governed NP following it. Distinct from ‘particles’ as now discussed, analyzable as adverbs, are so-called predicate particles, in English exemplified by as, in German by als, and in Norwegian typically as som or for, as in Jeg anser det som håpløst ‘I regard it as hopeless’ or in Jeg anser det for å være håpløst ‘I regard it as being hopeless’. They are close to both prepositions and complementizers, and their typical behavior as part of predicatives is what motivates the label predicate particles, in that they serve to mediate a predication, their complement being a predicate. In contrast, a predicatively used adverb is the predicate. By an ‘NP’ we understand a phrase headed by a noun or a personal pronoun, in the limiting case consisting of the noun only, and also including cases where a quantifier or an adjective occur in a position standardly held by a noun and is arguably ‘derived’ into a noun use. When a standard NP position is held by an NP, we make no annotation for it, except when it occurs as predicative in a small clause (see above) or copula construction, or as governee of a preposition, which we annotate oblN. Partly overlapping with positions where NPs occur are various types of sentential embedded constructions such as declarative clauses (annotated DECL as in suDECL, obDECL, oblDECL, etc. for declarative embedded clauses and correspondingly for the other types), infinitival clauses (annotated with different labels according to control relations, see below) and interrogative clauses (INTERR), the latter sometimes distinguished as ‘yes-no’ clauses, corresponding to those introduced with ‘whether’ in English, and constituent-wh-clauses (annotated INTERRyn and INTERRwh, respectively). Concerning the configuration annotated as oblDECL, and similarly for the other clause types, it is to be noted that Norwegian, like other Scandinavian languages, can freely embed all kinds of clauses under prepositions (English, for instance, in contrast, here uses a gerund or other more nominal type of construct). Pervasive throughout the Indo-European languages is what we call the light reflexive pronoun, which in Norwegian has the form seg in 3. person, and the same form as the non-subject uses of the other persons and numbers. The label ‘light’ is to contrast it with the form seg selv, which can always be replaced by an NP, whereas for seg this is not always possible (as in Ola skammer seg ‘Ola is ashamed’; see Sect. 4). The light reflexive can occur as direct and indirect object (abbreviated iobRefl and obRefl) and as a governee of a preposition, either in an oblique constituent (abbreviated oblRefl) or in a PP serving as object predicative (abbreviated scPPrefl). Constructions such as Ola skammer seg (‘Ola is ashamed’) and the presentational Det setter seg en katt her (‘There seats itself a cat here’) are not uncommonly seen as intransitive, but the status of seg as direct object in these constructions is argued for in Hellan and Beermann [21] and Hellan [16].
54
L. Hellan
Some specifications have semantic impact, including the following: – A verb expressing movement of its subject or object has the respective argument specifications suDir and obDir. – An oblique expressing a location is marked oblLoc. While in other constructions classified as oblique, there is a preposition counting as ‘selected’ by the verb, in locative oblique constructions, any locative preposition can occur, the argument dependence here residing in the verb requiring a locative specification. Also locative adverbs in such constructions count as oblique. – In the formation of ‘possessor raising’ where the object expresses the ‘possessor’ and the oblique expresses the area ‘possessed’, as in hun stryker ham over ansiktet (‘she strokes him over the face’), the oblique is marked oblPRTOFob for ‘the oblique is part of the object’ (in the sense that, in the example, the face is part of him). Also the subject can be ‘possessor’. – The specification of a predicative gives two aspects of semantic information, namely of which argument it is predicated, and whether that argument is at the same time semantically an argument of the verb. The cases where it is semantically an argument of the verb are marked scSuArg or scObArg, where ‘sc’ is for ‘secondary argument’, the interspersed ‘Su’ or ‘Ob’ indicate whether the predicate is predicated of the subject or the object ‘, and ‘Arg’ indicates semantic connection both to the verb and to the predicative. The cases where it is not semantically an argument of the verb (referred to as ‘non-argument’) are marked as, respectively, scSuNrg and scObNrg, where the interspersed ‘Su’ and ‘Ob’ are as above, and ‘Nrg’ indicates the lack of a semantic connection to the verb. Thus, in a construction like He sang the room empty, the status as ‘empty’ ascribed to the object comes from the secondary predicate, not from the verb (even though the grammatical function as object here obtains relative to the verb). The label ‘scObNrg’ describes such a constellation, literally meaning ‘secondary predicate predicated of a non-argument object’. – A further marking of a predicative can be used when the property it expresses is caused (as in the above case He sang the room empty), namely by a suffix Csd at the end of the specification pattern mentioned above, thus a label scObNrgCsd. – When an infinitival clause is ‘controlled’, i.e., having its understood subject interpreted as identical to a constituent in the matrix clause, this control status is marked by ‘Eq’ (for ‘equi-NP-control’), as opposed to ‘Abs’ when no such identity is understood; the ‘Eq’ mark is followed by an identification of which of the verb’s arguments is the controller (which can be subject, indirect object, direct object or oblique), and finished with ‘Inf’. An example is obEqSuInf, meaning ‘object consisting of an infinitival clause equi-controlled by the matrix subject’. An infinitival object not controlled is marked obAbsinf. – For ‘extraposition’ constructions, where an ‘extraposed’ clause is linked to an expletive det, there is a need to indicate whether the clause is linked to subject or object function. Examples are, respectively, det koster henne mye krefter å slåss alene (‘it cost her much effort to fight alone’), where the infinitival clause serves as ‘logical subject’, and de overlot det til meg å finne en løsning (‘they left it for me to find a solution’), where the infinitival clause serves as ‘logical object’. We indicate these linking-directions in the ‘global’ label rather than infixing them in the label initiated
A Valence Catalogue for Norwegian
55
by expn, since the specification of control status of an infinitive constituting the extraposed clause may already require a linking-direction. Thus, for det koster henne mye krefter å slåss alene (‘it costs her much effort to fight alone’), where the extraposed clause å slåss alene is a controlled infinitive with the indirect object henne (‘her’) as controller, the relevant label for indicating the infinitival control is expnEqIobInf, whereby a ‘linking-director’ is already present in the expn specification. To avoid confusion, therefore, the label for indicating the ‘logical’ role of the extraposed infinitive (as ‘logical subject’) is included in the global label (here ditrExpnSu, so that the full frame specification becomes ditrExpnSu-obMeas-expnEqIobInf), rather than in the expn specification. The formalism also allows for more explicit semantic information (as outlined in Hellan [14, 15]), but these facilities are currently made minimal use of, the closest being marking for aspectual values as exemplified in the discussion of spise ‘eat’ in the next section. We summarize the argument labels now mentioned in a Table 1 below, where we indicate explicitly the logical role of the various label components, according to their grammatical function (GF), whether they constitute the GF in full or are embedded inside the constituent realizing the GF, semantic role of GF, dependency target of GF (predicated of or controlled by), and semantic argument status of GF (dependent of verb or not): The total number of argument labels is near 80, thus many more than listed here, but most of the principles of their internal composition are reflected in this table. 2.2 Global Labels Notions such as ‘transitive’, ‘intransitive’, ‘ditransitive’, and more, qualify the overall composition of a valence frame (or construction) rather than any of its constituent parts. We call this dimension of notions global relative to the valence frames, and the labels reflecting them are the global labels. The formal role of a global label is to declare which grammatical functions are realized in a given construction, and in some cases it also declares the participant structure of the frame and its linking to the grammatical functions. The simplest global labels in these respects are the following:
(1) intr – one participant, grammatical function: su tr – two participants, grammatical functions: su and ob ditr – three participants, grammatical functions: su, iob, and ob impers – no participants, grammatical function: su (expletive)
56
L. Hellan
Table 1. Labels for arguments (column 1) and decomposition of the labels (other columns) Label
GF
Carrier of the GF
suExpl
su
Expl
obExpl
ob
Expl
expnDECL
expn
DECL
exlnkDECL
exlnk
DECL
prtcl
prtcl
oblN
obl
suDECL
su
DECL
obDECL,
ob
DECL
oblDECL
obl
iobRefl
iob
Refl
obRefl
ob
Refl
Embedded in Semantic PP carrying role or the GF function
Target of dependency
Sem-arg status of target
N
DECL
oblRefl
obl
Refl
scPPrefl
sc
Refl
suDir
su
obDir.
ob.
oblLoc.
obl
oblPRTOFob
obl
Dir Loc PRTFob
scSuNrg
sc
Su
Nrg
scObNrg,
sc
Ob
Nrg
scObNrgCsd
sc
Csd
Ob
Nrg
obEqSuInf
ob
Inf
Eq
Su
expnEqIobInf
expn
Inf
Eq
Iob
More complex global labels are built with the above symbols as initial parts but with further symbols indicating further aspects of frame structure. As initial symbols, the above symbols retain their grammatical functions contributions as in (1), but possibly with further constituents added, while the semantic linking is defined anew for each more complex global label. Examples of such complex global labels, with grammatical functions, are listed below; included here are also two global labels used for copulas, one with the pattern ‘copX’ for predicative use, and ‘copIdX’ for identity predication (‘X’ ranging over ‘Adj’, ‘PP’, ‘N’ etc. in the first case, and ‘N’ and clausal arguments in the second):
A Valence Catalogue for Norwegian
57
(2) intrObl – grammatical function: su and obl trObl – grammatical functions: su, ob and obl impersObl – grammatical function: su (expletive) intrScpr – grammatical functions: su and sc trScpr – grammatical functions: su, ob and sc intrPresnt - grammatical functions: su (expletive)and pres trPresnt - grammatical functions: su (expletive), ob and pres intrExpn - grammatical functions: su (expletive)and expn trExpnSu - grammatical functions: su (expletive), ob and expn trExpnOb - grammatical functions: su (expletive), ob and expn copAdj - grammatical functions: su and sc copIdN - grammatical functions: su and id
We first comment on the suffix ‘Obl’. A two-participant construction where the nonsubject participant is expressed by an oblique (as in I rely on Mary) is called intransitive oblique, abbreviated intrObl. We thus reserve the notion ‘transitive’ for configurations where there is a formal object serving as a direct argument. The construction type which in the present system is called transitive oblique, abbreviated trObl, thus has a formal object and in addition an oblique constituent (as in I tell him about Mary). These labeling conventions are well rooted in general and typological linguistics, however there are also conventions that would favor a notion like ‘transitive oblique’ as applying to what we here call ‘intransitive oblique’ (as in I rely on Mary). Using the term ‘transitive’ here could be seen as anchoring the notion of transitivity more in the semantic binary relation expressed than in the formal pattern. Also the present use of ‘transitive’ can however be seen as semantically grounded, taking as a ‘prototypically’ transitive relation one where force emanates from one participant targeted at another participant, and counting the formal configuration generally used in the language for expressing such a relation as the grammatical transitivity notion (cf. Creissels [7]). These conventions may be inter-translatable, but being in the literal coding strict opposites, care must be taken in observing the difference. We then note cases where the semantic linking for the labels in (1) does not carry over to global labels where they are prefixes. Thus, although as lone-standing labels intr and tr have a participant subject, in intrPresnt and trPresnt, the subject is an expletive; likewise in the extraposition labels. A similar point holds for the global label trScpr, which declares the syntactic frame as su-ob-sc, while the semantic status of the object depends on whether it is tied to the frame-bearing verb or only to the secondary predicate, the latter indicated with the infix Nrg (for non-argument’) in the label for the secondary predicate; cf. the discussion above concerning He sang the room empty. The same holds for the global label intrScpr, which declares the syntactic frame as su-sc, and the semantic status of the subject depends on whether it is tied to the frame-bearing verb or only to the secondary predicate, the latter as in He seems happy. (In both cases, the role of secondary predicate can be held by an infinitive, construction types often referred to as raising constructions.)
58
L. Hellan
As a matter of ‘default’ convenience, in the characterization of frames where direct arguments are NPs (in the formal sense mentioned above), no indication is given in the argument labels to this effect; this thus holds for subjects, both types of objects and for ‘presented’ NPs in presentationals (thus, there are no argument labels ‘suN’, ‘obN’, etc.). However, in a ‘non-default’ transitive construction like Who comes first will decide whether we leave, subject and object will have specification, the full specification of the frame in such a case being tr-suINTERR-obINTERR. We summarize the factors represented in global labels in Table 2 below, these being first the grammatical functions they declare, then the semantic participant structures they represent, then the status of subject as ‘full’ or ‘expletive’, then a parameter left open in the global level but specified in an argument label, viz., the status of an NP of which a predicative is predicated relative to the matrix verb, and finally the ‘target’ marking for an extraposed clause. In the column for number of participants, propositional participants are marked as ‘prop’ - as ‘opt(ionally) prop’ when the GF per se can be nonpropositional, as with the GF of most declarative, interrogative and infinitival clauses, and simply ‘prop’ for secondary predicates, since these always constitute the predicate of a proposition. 2.3 Global and Argument Labels Together In the frame type specifications, the DTD of the combinations sets the global label first, followed by argument labels, ordered with subject specifications first, then indirect object, then direct object, etc., so that any combination of labels has a unique internal order. The system has about 60 global labels and 80 labels for specification of arguments, but the main factors are still represented in the frame type specifications in Tables 1 and 2. These tables also show how these factors can be recognized in the ‘morphology’ of label types. Information mining in NorVal therefore can target even single ‘morphs’ within the individual labels. This notwithstandeing, not all aspects of argument structure information are formally represented in the code. For instance, the global label tr does not itself say how the GFs su and ob are linked to the two participants indicated. This is not necessary for information extraction where one knows how tr is to be interpreted; still, explicitness is useful, and in a type-theoretic underpinning of the CL system presented in Hellan [14] (cf. Carpenter [4], Loukanova [32] for background), all global and argument labels are defined as types relative to a grammatical system where also semantic linking is made explicit. For instance, the type tr here has a type specification illustrated in (3), which includes the semantic linking of subject and object into a semantic space using semantic ‘actant’ notions:
(3)
A Valence Catalogue for Norwegian
59
Table 2. Global labels with GF-declarations and semantic content Global label
GFs declared
Semantic arguments of the verb
intr
su
1 (opt prop)
tr
su, ob
2 (opt 1 or 2 prop)
ditr
su, iob, ob
3 (opt 1 or 2 prop)
Subject explet
Predication target (Nrg = not sem. arg of verb)
impers
su
0
intrObl
su, obl
2 (opt 1 or 2 prop)
trObl
su, ob, obl
3 (opt 1, 2, or 3 prop)
impersObl
su, obl
1
intrScpr
su, sc
1 (prop)
su Nrg
intrScpr
su, sc
2 (1 prop)
su Arg
trScpr
su, ob, sc
2 (1 prop)
ob Nrg
Correlate of ‘extraposed’ clause
x
x
trScpr
su, ob, sc
3 (1 prop)
intrPresnt
su, pres
1
x
ob Arg
trPresnt
su, ob, pres
2
x
intrExpn
su, expn
1 (prop)
x
‘logical subject’
trExpnSu
su, ob, expn
2 (1, opt 2, prop)
x
‘logical subject’
trExpnOb
su, ob, expn
2 (1, opt 2, prop)
X
‘logical object’
copAdj
su, sc
1 (opt prop)
copIdN
su, id
2 (opt 1 prop)
Moreover, relative to this system, also the hyphens between labels in the CL string have an interpretation, namely as unification operations. In the present setting, though, these type-theoretic aspects will not be discussed, since our present concern is how the frame types as enumerated in the CL formalism can be used in defining a valence catalogue. A first step is to identify exactly those combinations of global labels and argument labels which correspond to distinguishable valence frames in the language. The table in Appendix 1 shows combinations amounting to about 300 valence type specifications relevant for Norwegian.
60
L. Hellan
3 Lexvals and Valpods 3.1 Lexvals Reflecting the circumstance that a lemma can occur in more than one valence environment, we define a lexval as the combination of a lemma and one of its valence environments. The practical format for notation of lexvals is exemplified in (4), with the lemma occurring before the underline and the frame type specification after (ditr-iobReflobINTERRyn being one of the 300 frame types defined in the CL code):
(4)
undre__ditr-iobRefl-obINTERRyn
This expresses that the verb lemma undre (‘wonder’) can occur in a ditransitive frame with a reflexive indirect object and a yes-no-interrogative clause as direct object, an example being Hun undrer seg hvorvidt vi kommer (‘She wonders whether we are coming’). When the verb selects a preposition or particle, this is represented as in (5), as exemplified in Hun lurer på om vi kommer (‘She wonders whether we are coming’), where på is a selected preposition:
(5)
lure-på__intrObl-oblINTERRyn
The ‘selected’ preposition is represented in form as hyphenated to the lemma, and by category as indicated through the label part Obl. Again the general frame type follows the underline, while the lexically specific information precedes it. In both of these cases the formal structure is that of an ordered pair, as represented more explicitly in (6a) and (6b):
(6)
a.
b.
In both cases the lemma as such is the first member, and the valence frame is the second member. In (6b), the valence frame in question for lure is divided into the lexically specific information to the left of the underline and the general frame type to the right. In both cases the formal connection between logical format and practical format is clear.
A Valence Catalogue for Norwegian
61
3.2 Valpods To illustrate the notion of valpod, consider the following set of lexvals, being the complete set of lexvals for the lemma spise ‘eat’ (the examples with translations being not part of the lexvals):
(7) spise__intr spise-av__intrObl-oblN-ACTIVITY
(Ex.: de spiser ‘they eat’) (Ex.: hun spiser av vellingen ’she eats of the
porridge’) (Ex.: hun spiser på brødstykket ’she eats of the bread’) spise__tr (Ex.: de spiser kjøttet ‘they eat the meat’) spise-innpå__trObl-obRefl-oblN (Ex.: hun spiser seg innpå ham ‘she eats herself onto him’ (= ’she shortens the distance to him’)) spise-opp__trPrtcl (Ex.: hun spiser opp grøten ‘she eats up the porridge’) spise__trScpr-obRefl-scObNrgCsd-scPred (Ex.: hun spiser seg frisk ‘she eats herself healthy’) spise__trScpr-scObNrgCsd-scPred (Ex.: hun spiser tallerkenen tom ‘she eats the plate empty’) spise-i__trScpr-scPPrefl (Ex.: hun spiser i seg maten ‘she gobbles the food into her’)
spise-på__intrObl-oblN-ACTIVITY
The valpod of spise is construed not as a simple enumeration of the lines in (7), but through abstracting out the lemma, and listing all the lexvals with a variable ‘V’ in the place of the lemma, with the lemma itself outside the set representation, as shown in (8):
(8) spise:{V__intr & V-av__intrObl-oblN-ACTIVITY & Vpå__intrObl-oblN-ACTIVITY & V__tr & V-innpå__trOblobRefl-oblN & V-opp__trPrtcl & V__trScpr-scObNrgCsdscPred & V__trScpr-obRefl-scObNrgCsd-scPred & Vi__trScpr-scPPrefl}
Formally speaking, the valpod of spise is thus an ordered pair consisting of the lemma and the abstraction part ‘:{…}’. The abstraction itself we call a valpod type. Thus, the valpod type associated with spise is (9) (it may be noted that a valpod, and thus also a valpod type, is written on one line).
62
L. Hellan
(9) :{V__intr & V-av__intrObl-oblN-ACTIVITY & V-på__intrObloblN-ACTIVITY & V__tr & V-innpå__trObl-obRefl-oblN & Vopp__trPrtcl & V__trScpr-scObNrgCsd-scPred & V__trScprobRefl-scObNrgCsd-scPred & V-i__trScpr-scPPrefl}
The valpod type is in principle a set (where the prefixed ‘:’ corresponds to a lambda operator inducing the characteristic function of the set) and may intersect with valpod types of other lemmas, or be in super- or subset relations to them. The valpod type being a set, the order of the elements in the list defining the set is in principle not essential. For convenience, however (for the purpose of by-eye observation, or for string matching for multiple lexvals), certain ordering conventions are observed (largely reflecting alphabetic order). Relative to global label, the ordering relative to the initial part of the label is cop… > ditr… > impers… > intr… > tr…. Within each set of lexvals sharing the initial part, lexvals with this part as lone-standing global label are ordered first, followed by lexvals where the global label is complex, according to the alphabetic order of the suffixes attached, so that tr comes before trObl, which in turn comes before trPrtcl, etc. Corresponding conventions apply to the ordering of argument labels. Within each type of sequence thereby obtained, if lexvals have selected items, then they are alphabetically ordered according to the selected items, so that, e.g.-, V-av__intrObl-oblN-ACTIVITY precedes V-på__intrObloblN-ACTIVITY. As mentioned, the number of multi-membered valpods is more than 3000, and it is among these that maybe most issues of information extraction will be formulated. Given the set design of the valpods, it is envisaged that recognized techniques of information extraction over sets can be applied. Given at the same time the conventions of ordering in the valpod representations just outlined, and the strict DTDs of the composition of valence frame type composition (cf. Sect. 2.3), it is clear that the design of the lexvals and valpods inventories is amenable also to inspection ‘by eye’ and search over defined strings. 3.3 Further Illustration With 13 members, the valpod of the lemma bry (‘bother, concern’) has slightly more members than spise, and a rather different valpod:
A Valence Catalogue for Norwegian
(10)
63
Valpod for bry ('bother, care, concern'): bry:{V__tr & V__tr-obRefl & V-med__trObl-oblAbsinf & V-med__trObl-oblDECL & V-med__trObl-oblINTERR & Vmed__trObl-oblAbsinf & V-med__trObl-obRefl-oblDECL & V-med__trObl-obRefl-oblINTERR & V-med__trObl-obRefloblN & V-om__trObl-obRefl-oblDECL & V-om__trOblobRefl-oblEqObInf & V-om__trObl-obRefl-oblINTERR & Vom__trObl-obRefl-oblN}
Exemplifications of the lexvals are given in (11a) below, and some of the lexval labels are paraphrased in (11b):
(11) a. bry__tr bry__tr-obRefl bry-med__trObl-oblAbsinf bry-med__trObl-oblINTERR bry-med__trObl-oblN bry-med__trObl-obRefl-oblAbsinf bry-med__trObl-obRefl-oblDECL
(Ex.: de bryr dem ‘they bother them’) (Ex.: hun bryr seg ‘she cares’) (Ex.: vi bryr dem med å ta opp store spørsmål ‘we bother them by raising big questions’) (Ex.: vi bryr dem med hva som skal gjøres ‘we bother them with what is to be done’) (Ex.: vi bryr dem med spørsmål ‘we bother them with questions’) (Ex.: de bryr seg med å avhjelpe nød i landet ‘they care to counteract need in the country’) (Ex.: de bryr seg ikke med at det gjenstår oppgaver ‘they don’t care that there remain tasks’)
(Ex.: de bryr seg med hva som blir sagt ‘they care about what gets said’) bry-med__trObl-obRefl-oblN (Ex.: hun bryr seg med problemene ‘she cares about the problems’) bry-om__trObl-obRefl-oblEqObInf (Ex.: de bryr seg om å snakke med ofrene ‘they care about speaking with the victims’) bry-om__trObl-obRefl-oblDECL (Ex.: de bryr seg ikke om at det ble liggende igjen noen kasser ‘they don’t care that some boxes were remaining’) bry-om__trObl-obRefl-oblINTERR (Ex.: de bryr seg om hva som blir sagt ‘they care what is being said’) bry-om__trObl-obRefl-oblN (Ex.: han bryr seg om henne ‘he cares about her’)
bry-med__trObl-obRefl-oblINTERR
64
L. Hellan
b. V__tr-obRefl: transitive where the object is a light reflexive pronoun V-med__trObl-oblAbsinf: transitive plus an oblique PP where the selected preposition is med, and the preposition governs an infinitival clause with arbitrary control of the subject V-med__trObl-oblINTERR: transitive plus an oblique PP where the selected preposition is med, and the preposition governs an interrogative clause (whether a yes-no interrogative or a constituent wh-interrogative) V-med__trObl-oblN: transitive plus an oblique PP where the selected preposition is med, and the preposition governs an NP V-med__trObl-obRefl-oblAbsinf: transitive plus an oblique PP where the object is a light reflexive pronoun, the selected preposition is med, and the preposition governs an infinitival clause with arbitrary control of the subject V-med__trObl-obRefl-oblDECL: transitive plus an oblique PP where the object is a light reflexive pronoun, the selected preposition is med, and the preposition governs a declarative clause
Although the differentiating parameters in (10) may seem like rather minor and almost mechanical twists around the basic patterns bry med and bry om, there is no automatism in the availability of these patterns: most combinations of a verb with a selected preposition take just a nominal governee, and once a clausal governee is allowed, it does not follow that all three types of clausal complements are allowed. Indeed, the patterns with seg om in (10) appear only with three other verbs, viz. forklare (‘explain’), forsikre (‘ensure’) and overbevise (‘convince’). The full set of patterns with med instantiated here only obtain with sammenholde (‘align, compare’), and those with seg med with no other verbs at all. Needless to say, the whole valpod type in (10) is unique to bry. For a ‘brief-consultation’ dictionary entry for bry, it might still be felt that the features ‘regular NP vs. light reflexive’ as object, and ‘om vs. med’ as selected preposition will suffice as information, the further possibilities being inferable from the meaning of bry and the meanings of om and med. While that could well be true of ‘inferable’ for a person, there exists at this point no formal mechanism for such inferences, let alone a formal repository of word senses from which such an inference could be made. To the contrary, what a meticulous listing of environments such as here may contribute to is an ‘extensional’ circumscription of verb senses, by which one maybe can get closer to defining such possible ‘inferences’. Illustrating the point is the circumstance that the valpods of spise and bry have only one lexval in common. Furthermore, nine of the frames for bry contain a clausal argument, against none for spise; all of these clausal arguments are introduced by a preposition. Obviously this must reflect something about the meanings of the verbs. To get closer to a diagnosis of ‘what’ such a meaning-valence connection may reflect, valpods with the amount of detail displayed in (10) and (7) are required.
A Valence Catalogue for Norwegian
65
4 Using the Resource Any phenomenon labeled in the resource can be efficiently searched relative to the construction types in which it occurs, and with regard to items it may contain, syntagmatically or as represented by labeled features. Thus, discoveries can be made regarding in how many patterns something thought of as ‘one thing’ can actually appear. One can also focus on larger types of assemblies as within valpods and see patterns regarding what they contain and thus kinds of phenomena that appear together. Where labels have been assigned but categorization is nevertheless in doubt or subject to revision, one can efficiently get an overview of relevant items. Regarding generalizations, the resource allows for ‘frame-predictions’ and ‘sense predictions’ at large. This section illustrates these aspects. 4.1 Clausal Arguments Among the totality of the 15,700 lexvals in the resource, 1140 lexvals contain an argument specified with DECL as a defining label, 849 lexvals contain an argument specified with INTERR as a defining label, 1066 lexvals contain an argument specified as a controlled infinitive, and 267 lexvals contain an argument specified as an absolute infinitive, meaning that more than 3000 lexvals, or about 20% of the lexvals, contain such an argument. The distributions are rendered in the following tables (Table 3, 4, 5, 6 and 7). Table 3. Clausal arguments of type ‘declarative’ Argument label
Instances
suDECL
87
obDECL
460
oblDECL
485
expnDECL oblExlnkDECL DECL
89 5 1142
An immediate observation concerning the clausal arguments is that they to a large extent obtain as governees of prepositions, i.e., as obliques. This is summarized in the table (Table 8) further below. Thus, for the arguments specified with INTERR as a defining label, about half appear in an oblique PP, the same holds for arguments specified with infinitive as a defining label, and it holds for almost half of the arguments specified with DECL as a defining label. As shown in (10) above, the verb bry ‘bother’ has valence frames for all of the three clausal types declarative, interrogative and controlled infinitive. The catalogue contains no less than 217 verbs which have this capacity. In the case of bry, all of these arguments are embedded in a PP; how pervasive is this, compared with having the clauses as direct
66
L. Hellan Table 4. Clausal arguments of type ‘interrogative’
Argument label
Instances
suINTERR
22
obINTERR
235
compINTERR
77
expnINTERR
48
oblExlnkINTERR
1
oblINTERR
432
INTERR
849
Table 5. Clausal arguments of type ‘controlled infinitive’ Argument label
Instances
suEqObInf (‘subject is an infinitive controlled by object’)
21
obEqSuInf (‘object is an infinitive controlled by subject’)
135
obEqIobInf (‘object is an infinitive controlled by indirect object’)
51
oblEqSuInf (‘oblique is an infinitive controlled by subject’)
291
oblEqObInf (‘oblique is an infinitive controlled by object’)
476
expnEqObInf (‘extraposed is an infinitive controlled by object’) EqInf
Table 6. Clausal arguments of type ‘absolute infinitive’ Argument label
Instances
suAbsinf
17
obAbsinf
35
oblAbsinf
160
expnAbsinf
28
Absinf
267
Table 7. Clausal arguments of type ‘bare infinitives’ Argument label obEqIobBareinf scBareinf obEqBareinf Bareinf
31 1066
Instances 2 18 2 22
A Valence Catalogue for Norwegian
67
Table 8. Oblique clausal arguments summarized Argument label
Instances
oblDECL
485
oblINTERR
432
oblEqSuInf
291
oblEqObInf
476
oblAbsinf
160
Oblique clausal arguments
1844
arguments (i.e., as subject, object, complement, or extraposed)? As the following tables (Tables 9, 10 and 11) show, within these valpods, the distribution is not unlike those for the total set of valpods, so, also here with a majority of oblique arguments: Table 9. Declarative arguments in the 217 valpods with all types of clausal arguments Argument label
Instances
suDECL
19
obDECL
88
oblDECL expnDECL
204 22
Table 10. Interrogatives in the 217 valpods with all types of clausal arguments Argument label suINTERR
Instances 6
obINTERR
48
compINTERR
23
expnINTERR
25
oblINTERR
212
The verbs of these valpods are listed in Appendix 2. Seeing whether they may have some meaning factors or other properties enabling this array of argument types for exactly these verbs, may be an interesting undertaking. Here, the role of the prepositions obviously also has to be assessed. Even if the number of prepositions in these roles is not more than 15–20, so that they also have a foot in what we may call the domain of ‘structural’ parameters, their inherent semantics is obviously essential.
68
L. Hellan Table 11. Controlled infinitives in the 217 valpods with all types of clausal arguments
Argument label
Instances
obEqSuInf
45
obEqIobInf
13
oblEqSuInf
97
oblEqObInf
110
expnEqObInf
1
The overall number of lexvals declared for including an oblique argument is around 4600 (global labels including trObl, intrObl, ditrObl, impersObl, …PrtclObl), distributed over 113 frame types, thus nearly one third of all lexvals and more than one third of all frame types, underlining the role of obliques in general. As a last observation relating to clausal arguments, if a verb has any of the types of clausal argument occurring in (a frame type in) its valpod, then the valpod will have at least one more member. There are only two exceptions, as indicated in Table 12 below showing frame distribution for the 2917 valpods with only one member. This is a sweeping example of frame-prediction. This clearly suggests that clausal arguments are in some sense ‘secondary’ within some order of establishment among argument types. How an account for this may go, and analytic consequences, we leave open here. This subsection illustrates what we in the introduction to the section mentioned as cases where something one may think of as ‘one thing’, like a declarative clausal argument, has a wide spectrum of distributions, and the catalogue allows one to get a concise picture of them all. It also illustrates how focus on larger types of assemblies offered through valpod representations allows one to see patterns of cooccurrence of complex entities, as in a possible study of verbs taking all three kinds of clausal arguments. 4.2 Particles and Secondary Predicates As mentioned, particle is a functional category whose part of speech is adverb. The total number of lexvals containing particles is 1331, distributed over 45 frame types. These adverbs can also serve as directional or locative adjuncts (relative to the wellknown formally differentiated groups of directional and locative adverbs as in ut ‘out’, inn ‘in’, etc., vs. ute ‘out(side)’, inne ‘in(side)’, etc., the particle uses of adverbs are reserved for the ‘directional’ versions), which is a different function than particle, and also as predicatives in (bound) predicative constructions, and a question is how to keep instances of such functions apart.
A Valence Catalogue for Norwegian
69
Table 12. Number of valpods with a unique member (i.e., univalent verbs) Number
Frame type of the unique lexval in valpod
2,145
Transitive
656
Intransitive
88
Transitive with light reflexive object
65
Intransitive with oblique
42
Transitive with a particle
36
Intransitive with a directional subject
28
Ditransitive
22
Impersonal
16
Transitive with directional object,
15
Transitive plus oblique,
14
Ditransitive with light reflexive as indirect object,
12
Transitive with light reflexive object plus oblique,
12
Transitive with light reflexive directional object,
11
Transitive with a particle,
10
Transitive with a particle and light reflexive object, …
…
…
2
Subject-controlled infinitive as unique frame: plikte å, unnlate å
0
Declarative or interrogative argument, extraposed clause, or absolute infinitive
We regard adverbs as in principle incapable of taking NP objects,3 while an adverb with directional or locative function may have prepositional complements (like in English of the house being a complement of out in out of the house), which we may take as a capacity reserved for the non-particle uses. When such an adverb occurs alone, however, it may take some consideration to decide whether it is a directionally used adverb, a predicatively used adverb, or a particle. For instance, ut in Han kastet den ut (‘He threw it out’) is presumably a directionally used adverb (a type instantiated by 245 lexvals where an object undergoes directional movement, coded with obDir as an argument specification); in De frøs ham ut (‘They froze him out’), ut is presumably a predicatively used adverb (instantiated by 33 lexvals coded with scObNrgCsd as argument specification); and in De skjemte ham ut (‘They spoiled him’) ut is presumably a particle, a use type instantiated in 664 lexvals, coded as __trPrtcl as a global specification. 3 Apparent countercases like ut porten ‘out the gate’ as in Han løp ut porten ‘He ran out through
the gate’ can be analyzed as generally representing an understood ‘through’ or ‘along’ (cf. Jørgensen [29]), and thus in principle following the pattern of (ran) out of the house. (The opposite position may lead to regarding adverbs as ‘intransitive prepositions’, a notion we thus reject.).
70
L. Hellan
Among the 1331 lexvals having Prtcl as part of their global label, 943 have __trPrtcl as intial part of the label, and 372 have __intrPrtcl as intial part of the label. The respective numbers of lexvals having these strings as the full global label is, respectively, 664 and 253. The combinations most frequent among lexvals with more complex global labels involve the substring PrtclObl, obtaining with no less than 167 lexvals, while 277 lexvals have still other global labels including Prtcl as part; cf. Appendix 1 for an overview. Given the close connections between the function of particle and the functions of (directional) adjunct and predicative mentioned above, it is obvious that both criteria and assignments will be in constant need of further scrutiny. The case of particles thus is a prime instance of what we referred to in the section introduction as cases where labels have been assigned but categorization is nevertheless in doubt or subject to revision, and the catalogue allows one to efficiently get an overview of relevant items. In this case this will include not just lexvals marked Prtcl in their global label but also lexvals marked for Scpr in their global label and lexvals where an argument specification includes Dir. The notion of frame prediction as conducted relative to the catalogue can be illustrated by a subset of the lexvals marked for Scpr in their global label, namely the causative predicative type. It is represented by the label trScpr-scObNrgCsd (read as ‘transitive with a secondary predicate predicated of a non-argument object, causatively interpreted’) for non-reflexive versions, and for the corresponding reflexive constructions represented by the label trScpr-obRefl-scObNrgCsd); some valpods have both, so that the number of valpods containing the type is 26, on the current count. A commonly entertained description of this construction is that it expresses an activity leading to a result not inherent as goal in the concept of that activity (which, in an example like De spiste kjøleskapet tomt (‘They ate the refrigerator empty‘), is to say that creating food containers is not in the lexical semantics of the notion ‘eating’). This suggests that verbs instantiating the construction can occur intransitively as well, as a frame prediction. 19 of the 26 valpods containing a trScpr-scObNrgCsd item indeed do contain lexvals with the type intransitive. 14 of these valpods also contain the frame type transitive, but for 9 of those, the frame with this specification co-occurs with the frame for intransitive, as the expectation would be. For 5 of the valpods with a trScpr-scObNrgCsd item containing the frame type transitive, however, there is no instance of intransitive. This is then an instance of a frame prediction not immediately borne out, but conceivably resolvable through further analysis.4 4.3 Light Reflexives In many linguistic discussions, constructions with what we call light reflexives (LR) have been rather vaguely categorized, often assigned a status intermediate between transitive and intransitive. All the more astounding is that there are no less than 2050 lexvals with LRs, thus about 15% of the total amount of lexvals, and they spread over a wide array of construction types, as illustrated in the Table below (Table 13). 4 The 5 ‘aberrant’ verbs are ergre (‘annoy’), kjøpe (‘buy’), skjenke (apart from senses ‘give,
donate, endow’, here meaning ‘pour’), spyle (‘flush’) and stue (‘stow’), and one may try to find a factor distinguishing those to be worked into the prediction.
A Valence Catalogue for Norwegian
71
Table 13. Examples of Light Reflexive (LR) constructions, with number of instances Informal frame description
Example
Translation
Inst.
V_LR
hun vasker seg
She washes herself
667
V_LR_P_NP
hun befatter seg med dem
She deals with them
333
Vobjcontrol_LR_P_Inf
hun tvinger seg til å sitte
She forces Refl to sit
79
V_LR_LOC
hun oppholder seg her
She stays here
15
V_LR_DIR
hun smyger seg hit
She slithers hereto
V_LR_Prtcl
de dummer seg ut
They make fools of Refl
139
Vobjcontrol_LR_Inf
hun tillater seg å komme
She allows Refl to come
8
V_LR_SCPRcaus-Ap
hun_løper seg frisk
She runs Refl healthy
V_LR_SCPRstate-AP
hun befinner seg vel
She is well
V_LR_SCPRprtclP
hun oppfører seg som en idiot She behaves like an idiot
38
V_LR_PPpossrais
hun gnir seg i nakken
She rubs Refl in the neck
23
Vrais-to-obj_LR_Inf
hun viser seg å komme
She turns out to come
V_LR_ NP
hun lærer seg spansk
She teaches Refl Spanish
V_P_LR
hun rører på seg
She moves
13
V_SCPR[P_LR_NP]
hun jafser i seg maten
She gobbles up the food
48
Expl_V_LR_Extraposed
det viser seg at hun lyver
It turns out that she lies
20
Expl_V_LR_LOC
det satte seg en katt her
There sat down a cat here
16
76
14 9
1 133
As noted earlier, given that LRs occur in designated positions for direct objects, indirect objects and prepositional governees, they are here categorized for these grammatical functions, following Hellan and Beermann [21], Hellan [16]. In these terms, the Table shows that there are 667 lexvals where the LR is object,5 333 where the LR is object followed by an oblique, 139 where the LR is object followed by a particle, 130 where the LR is indirect object followed by an object, 79 where the LR is object followed by an object controlled infinitive, and 61 where the LR is governee inside a PP serving as object predicative or oblique, to mention the most frequent. Among multivalent verbs, the reflexive frames are about 1900 in number - constituting about 17% of the totality of frames (lexvals), while among the univalent verbs, the reflexive frames - 150 in number - constitute about 5%. This difference may reflect a tendency for reflexive frames to ‘live on’ the presence of non-reflexive variants of the same overall frame in the valpod of any given verb, without thereby being reducible simply to instances of these patterns. (This recalls the dependence mentioned in Sect. 4.1 of verbs
5 In presentational constructions with an LR (as in Det setter seg en katt ‘there seats itself a cat’),
the LR also, in our analysis, counts as object; here, the expletive det carrying subject status, en katt is assigned the function of ‘presented’, following Hellan and Beermann [21], Hellan [16]. These constructions have not yet been fully registered in the catalogue, see Sect. 5.1.
72
L. Hellan
with lexvals with clausal arguments to also have lexvals with non-clausal arguments, although in that case the dependency is more clear-cut.) The latter observations are formally facilitated by the construct of valpods, with a division into uni-membered and multi-membered valpods. Within the array of lexvals as such, the formal probe for the status as ‘LR’ is the argument specification Refl as in obRefl, iobRefl, and oblRefl, and refl as in scPPrefl. This is thus again an instance of how something one may think of as ‘one thing’, i.e., ‘the light reflexive’, has a wide spectrum of distributions, and the catalogue allows one to get a large concise picture of them all. 4.4 Conclusions Particles and light reflexives as we have here classified them are generally not so much in focus in formal descriptions, still, finding the ‘rhythm’ of the language very much includes mastery of these aspects. Having concise overviews of these phenomena ought to feed into both linguistic investigations and more practical applications (for pedagogical purposes such as L2 learning, for instance, systematic use of examples displaying the patterns would recommend itself). In view of their pervasive use of prepositions, the ‘rhythm’ aspect also holds for constructions with clausal arguments, but they also hold another aspect of linguistic analysis, namely complexity of constructions: along with clausal adjuncts, these constructions hold the main recursive power within text composition. Identifying uses of these facilities may be crucial to understanding aspects of text structure, useful for studying styles, and, for instance, for diagnosticizing parameters for what constitutes ‘easy’ and ‘complex’ language. In both respects, thus, a catalogue like the present, through its functions illustrated, will lend itself well to formal, theoretical, descriptive, as well as practical purposes. We pursue some of these points in Sect. 6.
5 Discussion 5.1 Issues of Redundancy As mentioned at the beginning of Sect. 2, constructions analyzable as ‘passive’ are counted as regularly predictable from patterns with the verb in active form and certain argument distributions counted as ‘basic’.6 The alternating distribution of particles as preceding or following an NP object likewise counts as regularly predictable, and likewise do of course constructions with so-called ‘gaps’ due to front positioning of wh-elements or ‘topicalized’ elements. For ‘presentationals’ and ‘extraposition’, which mostly alternate with structures where the ‘logical subject’ is also the syntactic subject, 6 In valence assignments in the valence corpus sustained by the grammar Norsource, such
‘derived’ structures are accordingly assigned valence frame reflecting their ‘base’ structure.
A Valence Catalogue for Norwegian
73
the matter is less clear, currently with ‘extraposition’ instances being explicitly listed, while most presentational options alternating with the frame intrObl-oblLoc are left unspecified. Another construction type often perceived as having a possibly ‘derived’ status is causative secondary predicates. In suite to the considerations in Sect. 4.2 about ‘predictability’ of this construction, a proposal for constructions like Han spiste tallerkenen tom ‘He ate the plate empty’ has been that the transitive frame of spise ‘eat’, undergoes a rule of ‘Object Deletion’, and subsequently the addition of the ‘small clause’. Such a reasoning would reduce the valpod of spise from its current number of 9 elements to a number of 5, omitting the intransitive and the causative secondary predicate (with scObNrgCsd) frames, and thus exemplifies an analytic approach to enhancing non-redundancy, successful or not.7 While not many frames are involved in this case, an issue of further reach is whether intransitive frames can be generally predictable from transitive ones, such that there would be a ‘lean’ inventory having only one of the two in valpods where the relevant conditions are met, thereby extending ‘Object Deletion’ beyond the domain invoked above. In the catalogue there are 264 valpods with exactly these two frames (the object then being nominal), and about 750 valpods where the two frames co-exist among other frame types, thus altogether more than 1000 candidate cases. It is doubtful whether such a potential rule, accommodating spise, could be ‘widened’ to cover these 1000 verbs, while at the same time being restricted so as to not apply to the about 2000 verbs which have only a transitive frame (cf. Table 12).8 In these considerations, the sheer number of items to take into account has a weight by itself, aside from linguistic issues, and is a contribution from a catalogue design. These are illustrations of ‘specificity’ vs. ‘redundancy’ concerns as they will apply to a resource like the present.
7 Rather than (8) above, the valpod thus would be:spise:{V-av__intrObl-oblN-
ACTIVITY & V-på__intrObl-oblN-ACTIVITY & V__tr & V-innpå__trOblobRefl-oblN & V-opp__trPrtcl & V-i__trScpr-scPPrefl}. 8 A factor here is also that the semantics of ‘input’ and ‘output’ must be consistently related (synonymy in the cases of passives etc., addition of causation between defined entities in the previous case). So, in Nei, han er opptatt, han sitter og spiser (‘No, he is occupied, he is sitting eating’), the verb spise ‘eat’ is arguably used as a one-participant concept; would this be a relevant proposal in many of the 1000 cases? Or how many are analyzable in terms of causativization (or ‘anticausativization’)?A related concern is when a transitive and an intransitive frame are ‘collapsed’ using symbols for optionality, such as in the use of parentheses in notations like ‘NP V (NP)’ for expressing ‘optionality’ of an object: a constant semantic relation of meaning must be defined between such options.
74
L. Hellan
5.2 Valpod Intersections vs. ‘Valency Classes’ As is clear from the foregoing, the multi-membered valpods provide ample ground for investigating commonalities between valencies of verbs: what are formally intersections between valpod types may point to interesting similarities between the verbs hosting these types. Abstractly speaking, the notion of valpod intersection is not so remote from the notion of a valence class; however, differences ought to be noted. The notion is tied to Levin [31] and underlies projects and resources such as the Leipzig Valency Classes Project (cf. Malchukov and Comrie [33]), the online database ValPal (http://valpal.info) created from the project, and VerbNet (http://verbs.colorado.edu/~mpalmer/projects/ver bnet.html, cf. Korhonen et al. [30]). A ‘valency class’ is in principle a set of verbs sharing valence frame potential, although the frame types in question need not fully exhaust the frames that each of the verbs can take. What identifies a valence class is rather the recurrence of a small number of frame types across a significant number of verbs - socalled ‘alternation pairs’ -, where these frame types in these lexvals express a common meaning, or meaning alternations over a common semantic parameter. A well-known case is the so-called spray/load alternations, whose common semantic parameter resides in processes involving two incrementally affected participants, with one or the other of the incremental processes being understood as completed, and where the shared pair of frames is of the form ‘_ NP [[prep]NP]PP ’ where the ‘completed’ process is indicated by the NP and the non-completed by the PP, with a different ‘prep’ according to which of the processes it represents. Thus, in spray the wall with paint, with represents the aspect of consumption of paint, the wall indicating its completion, while in spray paint onto the wall, onto represents the aspect of coverage of the wall. What distinguishes the notion of ‘valence class’ from general valpod intersection is the explicit identification of a semantic parameter interrelating the frame instances in question: the notion of valpod intersection, on the other hand, is purely formal, with no assumption of semantic relations. However, based on the NorVal catalogue, one can of course pursue issues recognizable as related to the ‘valence class’ notion, namely as issues of sense prediction. 5.3 Valence Frames and Senses A standard dictionary will identify senses. For instance, the verb lære has at least two senses, one corresponding to English learn, and one corresponding to English teach, and a dictionary may mention the difference. The valence catalogue, in contrast, does not mark the distinction. It just enumerates the following frames for lære (with examples and translations here indicated), without noting that the first two and the last two instantiate the ‘teaching’ lære (with another person – explicit or not - as target of instruction) while the others all instantiate the sense of increasing own knowledge:
A Valence Catalogue for Norwegian
75
(12) lære__ditr (Ex.: hun lærer ham mordvinsk ‘she teaches him Mordvinian’) lære__ditr-obEqIobInf (Ex.: jeg lærer dem å skrive ‘I teach them to write’) lære__ditr-iobRefl-obEqIobInf (Ex.: hun lærer seg å lese ‘she learns to read’) lære-av__intrObl-oblN (Ex.: jeg lærer av henne ‘I learn from her’) lære-om__intrObl-oblN (Ex.: de lærer om vaksiner ‘they learn about vaccines’) lære-om__intrObl-oblDECL (Ex.: de lærer om at utseendet kan bedra ‘they learn about [that appearance deceives] ’) lære-om__intrObl-oblAbsnf (Ex.: de lærer om å bygge solceller ‘they learn about building solar cells’) lære-om__intrObl-oblINTERR (Ex.: de lærer om hva som kan gå galt ‘they learn about what can go wrong’) lære__tr-obEqSuInf (Ex.: hun lærer å lese ‘she learns to read’) lære__tr-obDECL (Ex.: de lærer at det er galt å lyve ‘they learn that lying is wrong’) lære__tr (Ex.: de lærer gangetabellen ‘they learn the multiplication table’) lære-bort__trPrtcl (Ex.: vi lærer bort koden ‘we teach away the code’) lære-opp__trPrtcl (Ex.: de lærer opp lærlinger ‘they educate apprentices’)
Thus, although (as noted in Sect. 2) the frame type code in many respects reflects semantic parameters, these parameters are general, while the notion of ‘sense’ now in question resides in semantic properties distinguishing all words between one another, thus lexically specific properties. How do senses relate to valence frames as here recorded? As a general observation, they stand in a many-to-many relation. The list in (12) illustrates the circumstance that there can be many valence frames reflecting the same sense. Conversely, the situation of one valence frame representing many senses can be illustrated with a verb like løpe ‘run’ which, among its senses, has one of directional movement and one of pure directionality, both of which can be expressed in the frame ‘verb + directional PP’, illustrated below:9 (13) a. Han løper fra stadion til broen. ‘He runs from the stadium to the bridge.’ b. Linjen løper fra punkt A til punkt B. ’The line runs from point A to point B’.
In a valence repository like the present, thus, classification of a verb’s senses has to be a cross-classification relative to its valence frames, so that given the set of senses available for a verb – its lex-senses – each lex-sense has to be specified for which lexvals it can be instantiated by, and each lexval has to have a specification for which lex-senses it can express. 9 Also for løpe there is a frame which can be used only for one of the senses, viz. the frame with a
caused secondary predicate, which is available only for the reading involving actual movement: a. Hun løper seg frisk. ‘She runs herself healthy’. b. *Linjen løper seg krum.’ The line runs itself curved’. The circumstance alluded to in (b) can be expressed, e.g., by Linjen krummer seg’ The line curves’, just not with the valence frame in question.
76
L. Hellan
Standard dictionaries indicate senses by definitions, synonyms or near-synonyms, or short examples. If one were to envisage an extension of a valence catalogue like NorVal with senses, the cross-classification would of course affect the structure of lexvals and valpods. The ‘catalogue’ aspect would also raise the issue of annotation format, that is, of whether definitions and synonyms can be rendered on a format that would combine with the structures already in the catalogue. From a formal-theoretical viewpoint, both the CL code and its type-theoretic counterpart have in principle formats for representing ‘lexical semantics’ and ‘situation types’, as outlined in Hellan [14, 15]. What would be required first, however, would be an overview of how many senses are distinguished in a normal-size dictionary, and how they distribute over the lemmas; the exact formalization of the senses would be immaterial at such a stage.10
6 Final Remarks 6.1 Comparison with Other Valence Resources Among existing monolingual valence dictionaries can be mentioned, for English: FrameNet (5213 verbs, per July 2021);11 VerbNet (6340 verbs);12 PropBank (5649 verbs);13 and EngVallex;14 for German: Evalbu;15 for Czech: Vallex;16 and for Polish: Walenty.17 All of these resources offer excellent online user interfaces, most of them are associated with a corpus accessible from the interface, and most of them expose their analyses for concrete sentences illustrating frames for the verbs. In many cases this includes semantic representations, in the forms of, grossly categorized, AVMs (attributevalue matrices) (e.g., FrameNet), predicate decomposition in the style of Generative or Jackendovian Semantics (e.g., VerbNet), or annotation with semantic roles (e.g., PropBank, 5649). Some resources also offer comprehensive descriptions or definitions, like Evalbu. As was outlined in Sect. 2.1, many of the classificatory labels in NorVal carry general semantic content, and a resource connected to NorVal provides information about ‘logical form’ corresponding to this content, viz. the computational grammar NorSource, which analyzes sentences instantiating all of the 15,700 lexvals, displaying their predicate-logic forms, as an online service (cf. Appendix 1).18 None of the valence resources mentioned 10 A matter closely related to senses are multi-word expressions (MWEs), including idioms. They
11 12 13 14 15 16 17 18
mostly follow the patterns of non-MWEs as far as the syntax is concerned, and so their forms can in principle be specified within the existing frame repertory, while their meanings would be encoded once senses would be eventually encoded. The form specification would be an extension of the format already used for ‘selected’ items. For a preliminary outline concerning the type of MWEs called Light Verb Constructions, see Hellan [14, 18]. https://framenet.icsi.berkeley.edu/fndrupal. http://verbs.colorado.edu/~mpalmer/projects/verbnet.html. https://propbank.github.io/. http://ucnk.ff.cuni.cz. https://grammis.ids-mannheim.de/verbvalenz. http://ucnk.ff.cuni.cz. http://clip.ipipan.waw.pl/Walenty; cf. Przepiórkowski et al. [18]. A device for displaying the feature structures associated with each frame type label is also envisaged, cf. https://typecraft.org/tc2wiki/NorVal_resources.
A Valence Catalogue for Norwegian
77
above are accompanied by such a facility. (This also applies to the way in which the grammar generates valence information in a corpus analyzed by the grammar.) The main feature of NorVal is still the compactness of its enumerations of the totality of frames in which the totality of verbs of the language can occur, whereby investigations drawing on multiple phenomena jointly can be relatively easily conducted. We are not in a position to compare this feature with the relevant corresponding mechanisms of the valence resources mentioned. 6.2 Extendability to Other Languages The catalogue in its content of course applies only to Norwegian, but its formalism is applicable to any language (the label system already having extensions for many languages). Extended use of the design may open avenues for cross-linguistic comparison, not only for entire catalogues, but also for investigations directed at particular phenomena. Currently a resource similar to NorVal is a valence dictionary for the West African language Ga, by Dakubu [9], containing 470 lemmas and 1834 lexvals, conducted with the same frame description code, but without a formal grouping into valpods. In addition to morpho-syntactic features, it also provides semantic role and situation type labels for all lexvals.19 Like NorVal, these resources build on extensive previous lexicographical work, represented by Dakubu [8]. In general, when valence resources with large lexical coverage are created from pre-existing lexical resources, the effort of super-imposing the valencesensitive organization is relatively small compared to the effort that went into the original resource; this holds whichever system of organization is used. The possibilities of using the NorVal design in a more general methodology with such aims might lie in the following direction of steps, based on the implicit assumption that the assembly of a verb’s environments reflects essential aspects of the verb’s meaning: 1. For some other language L, Identify general differences between Norwegian (N) and L in the grammatical encoding of argument structure. 2. Modulo these differences, map the frame types defined for N – let’s call it FrameSet_N - to a putative set of frame types for L, viz. FrameSet_L, thus, constructing a list of correspondence pairs:
FrameSet_N FrameN-1 FrameN-2 ….’. .
-
FrameSet_L FrameL-1; FrameL-2;
(This will hardly be a one-to-one correspondence, and many frames may lack a counterpart on the other side.) 19 Dakubu [10] is a monograph expanded from Dakubu [9]. An illustration of valence comparison
relative to these resources for Ga vs. Akan is given in Beermann and Hellan [1], based on the lemma ba ‘come’ and its 18 different lexvals in Ga.
78
L. Hellan
3. Establish a correspondence of basic verb synonyms between N and L, thus, a list of verb pairs:
VN-1 VN-2 …..
-
VL-1, VL-2,
(Presumably about 2/3 of verbs in one language have single-verb translations in another; this still will give about 4000 synonyms in L.) 4. Assign to each verb VL-n the valpod of its corresponding verb in N, i.e., a valpod where each lexval has a frame in FrameSet_L established in the mapping in point 2. A somewhat related strategy was assumed in the Leipzig Valency Classes Project presented in Malchukov and Comrie [33], where about 80 verbs from English were mapped to counterparts in about 30 languages, with the aim of establishing the valence potential of these 80 ‘meanings’ across these languages. While the above sketch would be 50 times scaled up compared to the Leipzig project as regards number of verbs, and presumably more extensive as regards frame types, cross-linguistic research like in this project would be essential relative to point 1 above (and likewise other studies under the heading ‘contrastive valency’, like those in Hellan et al. [24]). Whether sustainable methodologies can be developed along these lines is of course to be seen. 6.3 Possible Applications 6.3.1 Minimal Sentences and POS-Based Valence Annotation A research initiative described in Quasthoff et al. [39], establishes what is called typical sentences as representative of valence frames, and aims at characterizing the POS patterns of these sentences as parts-of-speech-signatures for the valence frames in question. Although so far conducted for only a few frame types, in German, one can envisage this as an initiative applicable to all valence frame types in all languages. Typical sentences tend to be the shortest sentences where each constituent of the frame in question is realized, and thus come close to the format of minimal sentence examples used in NorVal for each lexval. To illustrate, corresponding to the first two lines in (12), the POS signatures of the frame types in the left column will be the POS-strings matching the example sentences; lexically specific signatures include the lemma of the head of the frame, as in the rightmost column, while frame type specific signatures mark only POS values, as in the third column in Table 14 below. Appendix 1 below lists lexval instances for all the 300 frame types represented in NorVal, and supplies examples for each lexval. From this overview one could readily construct 300 lexically specific signatures corresponding to these 300 lexvals, on the model of row 4, and on the model of row 3, one could make 300 frame type specific signatures applicable across all lexvals.
A Valence Catalogue for Norwegian
79
Table 14. Illustrating POS-signatures for frames matching minimal sentences Lexval and frame
Minimal example
POS-signature for the frame
POS-signature for the lexval
lære__ditr
hun lærer ham mordvinsk
PRON V PRON N
PRON læreV PRON N
lære__ditr-obEqIobInf
jeg lærer dem å skrive
PRON V PRON INF V-INF
PRON læreV PRON INF V-INF
An application of such resources may be automatic (or semi-automatic) annotation of POS-annotated corpora for valence. While ‘manual’ corpus annotation for valence is effort-costly, and parser-based automatic annotation for valence presupposes specific technology and can be somewhat error-prone (cf. Hellan et al. [25]), valence annotation based on POS-annotation would be a valuable alternative, possibly a preferred one. 6.3.2 Valence Information in Dictionaries While the absence of sense specification probably makes it difficult to develop a resource like NorVal into a dictionary, there may be ways in which it could enrich a dictionary through inclusion of valence information. Since standard public dictionaries recognize mainly just intransitive, transitive and reflexive as valence variants, a fine-grained specification like that in (12) for lære would not be readily incorporable, not because of terms or formulas used (the CL formulas can be turned into normal prose) but because of overload of information when so many istinctions are drawn. A better strategy will be to make use of example sentences, presenting them structured according to patterns so that a user gets a direct impression of what are possible expressions. In a case like lære, one such strategy may be to assemble all the example sentences in (12) en bloc to indicate the richness of patterns available for lære; another would be to use the sentences as frame-wise illustrations relative to the various frames (which need not be named per se); given a recognition of just intransitive, transitive and reflexive, a partial bundling of the frames corresponding to these main groups may be the best alternative. Regardless of which strategy is chosen, the possibility of enriching the stock of examples from corpora could in turn be built in, with access to valence annotation which has already been done, or is executed on call, established by either of thestrategies considered above. 6.3.3 Valence Resources in Second Language (L2) Acquisition NLP-based resources for Norwegian include the NorSource-based grammar-correcting application A Norwegian Grammar Sparrer 20 where freely chosen inputs get a grammatical correctness check and relative to some phenomena, if ungrammatical, also feedback on what is wrong and correct versions automatically generated. Corrections concerning valence are here included, but since a verb can have many valence frames, such corrections will have far less ‘determinacy’ than, e.g., corrections for gender, use of articles and the like. 20 https://typecraft.org/tc2wiki/A_Norwegian_Grammar_Sparrer.
80
L. Hellan
A more adequate use of the valence catalogue may reside in the multitude of instances of patterns that can be automatically generated, for instance, verbs with reflexives and particles, verbs with particle plus oblique, etc. While online interfaces could be used for accessing such patterns, one has also the possibility of printing leaflets for various patterns. Needless to say, these perspectives align well with what was said above about applying valence information in the context of dictionaries. 6.3.4 Valence and ‘Complexity Assignment’ When every verb occurrence in a text is annotated for the frame type it has on that occurrence, one can entertain strategies for measuring or assessing complexity of texts in terms of valence, as alluded to in Sect. 4.4. Fairly exact measures can be conceived: the higher the number of arguments in a frame, the higher may the score of the frame be, and clausal arguments may induce a higher score than non-clausal one, to mention some obvious possibilities. Procedures for ‘counting together’ all the annotations in a text could be defined, and the text as a whole could receive a value, attuned to length as a further factor. Many factors of assumed complexity would of course fall outside such a calculation, such as noun phrase structures, the status of a relative clause as relative or an adverbial clause as adverbial (while the internal structure of such clauses would be measured); possible complexity effects of ‘wh-movement’ would not be measurable in terms of valence, and likewise effects of ‘passive’ when this is not a valence factor, as in NorVal. Nevertheless most of these factors are in turn easily identifiable in a text and can be added to a total calculation. Such calculations would constitute only one side of what a ‘complexity’ investigation would require, the other being how language users actually perceive texts in terms of what they, if asked, would count as ‘complex’, or related properties like ‘difficult’, ‘unclear’, or even ‘coherent’. Relative to defined domains of communication, such a law texts and public assignments or guidelines, such investigations may be coupled to initiatives towards ‘easy’ or ‘clear’ language in public sector, and what has there been established as ‘better’ practice could be matched against the outcomes of the formal complexity calculations, in cycles of revision and broadening of scope. 6.4 Extending the Catalogue With explicit information about valence frames, and a minimal format for representation of sameness or relatedness of meaning as envisaged in Sect. 5.3, the catalogue could be extended to register the various ways in which verbs have correlates among nouns and adjectives, and even among other verbs. An example of the latter is how verbs with an initial morpheme that could be characterized as a ‘particle prefix’ have counterparts where that morpheme is absent (counting about 1300 lemmas), and in many cases (about 35%) with a particle similar to that morpheme as a possible valence frame item, thus cases like innsette ‘insert’ vs sette inn ‘set in’. Representing to what extent such pairs have exactly the same meaning or rather only a ‘motivational’ similarity (300 and 130 cases, respectively, on a current count) will be a natural feature of a valence catalogue, and be useful in L2 applications.
A Valence Catalogue for Norwegian
81
Illustrating for nouns and what is counted as ‘nominalizations’, nouns corresponding to lære ‘learn, teach’ (as commented on in Sect. 5.3) include lærer ‘teacher’, læring ‘learning’, lære (‘taught conception’), and related to lære opp, opplæring ‘training, education’. Lærer carries the ‘teach’ sense exclusively, and likewise opplæring, læring can be either the ‘learn’ sense or the ‘teach’ sense, while (en) lære is an ‘inner object’ relative to either the ‘learn’ or the ‘teach’ sense. Some kind of sameness marking across the entries in a noun lexicon and the verb lexicon relative to these lex-senses of lære can be both formally manageable and instructive, e.g., for L2 purposes. This suggests a likely domain into which the catalogue could be extended. It would involve not only the enumeration of relevant nouns and adjectives (these also relating to each other), and an explicit recognition of sense-identifiers (although not sense representations or descriptions), but also an extended lemma architecture for representing derivation, subsumption and morphological relatedness (see Hellan (to appear)). This being all said, 6,300 is not the maximal number of verbs in Norwegian, and for hardly any of them is it likely that the catalogue at its present stage, even with descriptive parameters kept constant, gives a complete representation of their valence potential. So, obviously, these will remain dimensions in which also to develop the catalogue. Acknowledgments. I am grateful to Dorothee Beermann, the editor Roussanka Loukanova, and the reviewers of this chapter, for comments and advice.
Appendix 1 Overview of Frame Types The first column in the following table lists lexvals, with one lexval for each frame type. The ordering of the rows reflects a standard ordering of frame types (alphabetically, and according to internal DTD, cf. Sect. 3. Each frame type is presented as part of a lexval, and thus with a lemma to its left, so that when searching according to the alphabetical order of frame types, one must ignore what is to the left of the ‘__’. The English translations often do not quite match the valence pattern of the source sentence, and the points of deviance are marked in this way: VR,P,L,T,I,NS means that V (mostly the verb, but also a preposition) differs in valence from the Norwegian counterpart with respect to, respectively, reflexive, preposition, particle, finite complementizer, infinitival marker, non-split predicate (in most cases the factor is missing in the translation, but in some cases in the original). To see the logical forms associated with the various frame types, the example sentences in column 2 can be entered into the online grammar parse window at http://regdili. hf.ntnu.no:8081/linguisticAce/parse, where in most cases an MRS (‘Minimal Recursion Semantics’; cf. Copestake et al. [6]) representation, close to a standard predicate logic representation, is displayed for each parse. Further supporting facilities are described at https://typecraft.org/tc2wiki/NorVal_resources (Table 15).
Appendix 2 Verbs Allowing for All Three Types of Clausal Arguments: Declaratives, Interrogatives and Infinitives See Table 16.
82
L. Hellan Table 15. Lexvals instantiating frame types illustrated with examples
Lexval identifier
Example for frame type
English translation
bli__copAdj
dette blir hyggelig
This will be nice
bli__copAdj-suDECL
at det etableres en god praksis blir avgjørende
That a good practice gets established is decisive
bli__copAdj-suINTERR
hvem som vinner blir avgjørende
Who wins will be decisive
bli__copAdv
hun blir her
She remains here
være__copExpnAdj-expnAbsinf
det er hyggelig å løpe maraton
It is nice to run marathon
være__copExpnAdj-expnDECL
det er hyggelig at hun vant
It is nice that she won
være__copExpnAdj-expnINTERRwh
det er uvisst hvem som vinner
It is uncertain who will win
være__copExpnAdj-expnINTERRyn
det er uvisst om hun vinner
It is uncertain whether she will win
bli__copExpnN-expnAbsinf
det blir en ære å motta It becomes an honor gjesteforskerinvitasjoner to receive guest researcher invitations
bli__copExpnN-expnDECL
det blir en ære at dere inviterer meg
It becomes an honor that you invite me
bli__copExpnN-expnINTERRwh
det blir et hovedspørsmål hvem som snakker
It becomes a main theme who talks
bli__copExpnN-expnINTERRyn
det blir et hovedspørsmål om han fortsetter
It becomes a main theme if he continues
bli__copExpnPP-expnDECL
det blir under tvil at han fortsetter
It will be under doubt that he continues
bli__copExpnPP-expnINTERRyn
det blir under kontinuerlig vurdering hvorvidt han fortsetter i stillingen
It will be under continued consideration whether he continues in the position
bli__copIdAbsinf
en slik avtale blir å avbryte samarbeidet
Such a deal will be to discontinue the cooperation
bli__copIdAbsinf-suAbsinf
å inngå en slik avtale blir å avbryte samarbeidet
To enter into such a deal will be to discontinue the cooperation
bli__copIdDECL
innholdet i avtalen blir The content of the at vi frastår eiendommen deal will be that we relinquish the property
bli__copIdN
han blir den nye representanten
He becomes the new representative
bli__copIdINTERRyn
spørsmålet blir om han kommer
The question will be whether he comes
(continued)
A Valence Catalogue for Norwegian
83
Table 15. (continued) Lexval identifier
Example for frame type
English translation
bli__copIdINTERRwh
spørsmålet blir hvem som kommer
The question will be who comes
være__copImpersAdjLoc
det er fint i Finnmark
It is fine in Finnmark
være__copN
han er bonde
He is a farmer
være__copN-suAbsinf
å inngå denne avtalen er en skandale
To enter this deal is a scandal
være__copN-suDECL
at han får komme er en skandale
That he is eligible for coming is a scandal
være__copN-suINTERRwh
hvem som vinner er et spørsmål
Who will win is a question
være__copN-suINTERRyn
om han vinner er et spørsmål
Whether he will win is a question
være__copPP
hun er i Finnmark
She is in Finnmark
være__copPP-suDECL
at han får komme er under sterk tvil
That he gets admitted is under strong doubt
være__copPredprtcl
hun er som en ninja
She is like a ninja
være__copToFind
han er å treffe på strandeiendommen
He is to be met with at the beach property
være__copToughAdj
han er hyggelig å snakke He is pleasant to talk med with
yte__ditr
de yter oss kompensasjon
They provide us compensation
koste-1__ditrExpnSu-obMeas-expnEqIobInf
det koster henne mye krefter å slåss alene
It costs her much effort to fight alone
lage__ditr-iobRefl
hun lager seg en modell
She makes herself a model
merke__ditr-iobRefl-obDECL
hun merker seg at du kommer
He notesR that you are coming
motsette__ditr-iobRefl-obEqIobInf
hun motsetter seg å skulle åpne paraden
She resistsR to have to open the parade
tenke__ditr-iobRefl-obINTERR
jeg tenker meg hva det er
I imagineR what it is
tilgi__ditr-obDECL
hun tilgir oss at vi forløp She forgives us that oss we made a faux pas
be__ditr-obEqIobBareinf
hun ber dem komme
befale__ditr-obEqIobInf
jeg befaler deg å gå
I order you to go
garantere__ditr-obEqSuInf
hun garanterer dem å bidra
She guarantees them to contribute
innprente__ditr-obINTERR
vi innprenter dem hva som skal gjøres
We tell them what is to be done
undre__ditr-iobRefl-obINTERRwh
han undrer seg hvem som kommer
He wondersR who is coming
vise__ditr-obINTERRyn
instrumentet viser oss om det blir væromslag
The instrument tell sus whether there will be a weather change
undre__ditr-iobRefl-obINTERRyn
han undrer seg om vi kommer
He wondersR whether we are coming
She asksI them to come
(continued)
84
L. Hellan Table 15. (continued)
Lexval identifier
Example for frame type
English translation
kaste__ditrObl-oblPRTOFiob
ekornet kaster oss nøtter i hodet
The squirrel throws us nuts on our heads
frarøve__ditr-suAbsinf
å si slikt frarøver politikerne respekten
To say such things deprivesP politicians of their respect
gi__ditr-suDECL
at saken gjenåpnes gir oss mot
That the case is reopened gives us courage
vise__ditr-suDECL-obDECL
at de er så ekstra vennlige viser oss at de har baktanker
That they are so utterly friendly shows us that they have back-thoughts
vise__ditr-suDECL-obINTERR
at de applauderer viser oss hvordan de tenker
That they applaude shows us how they think
vise__ditr-suINTERR
hvem som kommer vil vise oss planen
Who comes will show us the plan
vise__ditr-suINTERR-obINTERR
hvem som kommer vil vise oss hva som kommer til å skje
Who comes will show us what will happen
hagle__impers
det hagler
It hails
gå-i__impersObl-oblN
det går i døren
“Someone moves by the door”
tykne-til__impersPrtcl
det tykner til
“It gets more overcast”
tørke__intr
klærne tørker
The clothes dry
ende__intrAdv
det ender godt
It ends well
skje__intrAdvExpn-expnDECL
det skjer ofte at folk blir syke
It often happens that people get sich
skje__intrAdvPresnt
det skjer ofte ulykker
There often occur accidents
skje__intrAdv-suDECL
at folk blir syke skjer ofte
That people get sic koften happens
skulle__intrAuxmodScpr-scSuNrg-scBareinf
han skal gå
He shall go
være__intrAuxpassScpr-scSuNrg-scPass
han er skutt
He is shot
ha-perf__intrAuxperfScpr-scSuNrg-scPerf
han har kommet
He has come
huske__intrComp-compINTERR
hun husker hvem som kommer
She recalls who comes
bevise__intrComp-suDECL-compINTERR
at de støtter ham beviser hvem som står bak
That they support him proves who stands behind
mankere__intrExpn-expnAbsinf
det mankerer å løse den siste oppgaven
It fails to solve the last task
trengs__intrExpn-expnDECL
det trengs at en spesialist ser på det
It is necessaryNS that a specialist looks at it
ryktes__intrExpn-expnINTERR
det ryktes hva som holdes skjult
It is rumouredNS what is being kept hidden
(continued)
A Valence Catalogue for Norwegian
85
Table 15. (continued) Lexval identifier
Example for frame type
English translation
stå__intrLghtScpr-scAdj
kjelleren står tom
The basement stands empty
debutere-som__intrLghtScpr-scPredprtcl
han debuterer som forfatter
She makes debutNS as author
fremstå-som__intrLghtScpr-scSuNrg-scPredprtcl
han fremstår som hovedtaler
He stands upNS as main speaker
konferere-med-om__intrObl2-obl1N-obl2Absinf
vi konfererer med dem om å finne en løsning
We confer with them aboutI finding a solution
samtale-med-om__intrObl2-obl1N-obl2DECL
vi samtaler med dem om We talk with them at man kan finne aboutT (the circumstance) that one løsninger can find a solution
skjenne-på-for__intrObl2-obl1N-obl2EqObl1Inf
vi skjenner på dem for å ha knust ruten
We scoldP them forI having broken the window glass
forhandle-med-om__intrObl2-obl1N-obl2EqSuInf
de forhandler med geriljaen om å kunne komme ut
They negotiate with the guerilla aboutI getting out
underhandle-med-om__intrObl2-obl1N-obl2INTERR
vi underhandler med dem om hvorvidt vi kan få visse fordeler
We negotiate with them about whether we can get certain advantages
kappkjøre-med-om__intrObl2-obl1N-obl2N
de kappkjører med prinsen om prisen
They race with the prince about the prize
dages-for__intrOblExlnk-oblExlnkAbsinf
det dages for å starte det endelige slag
It is timeNS forI starting the final battle
helle-mot__intrOblExlnk-oblExlnkDECL
det heller mot at det blir ekstraomganger
It tends towards that there will be extra time
dages-for__intrOblExlnk-oblExlnkINTERR
det dages for hvordan rettferdighet kan skje fyldest
It is timeNS for how justice can be honored
avhenge-av__intrOblExpn-expnDECL
det avhenger av deg at turen går bra
It depends on you that the trip goes well
avhenge-av__intrOblExpn-expnINTERRwh
det avhenger av deg hvem som vil vinne
It depends on you who will win
avhenge-av__intrOblExpn-expnINTERRyn
det vil avhenge av deg om turen går bra
It will depend on you whether the voyage goes well
bero-på__intrOblExpn-oblINTERRwh-expnINTERRwh
det vil bero på hvem som kommer hvem som vil vinne
It will depend on who comes who will win
avhenge-av__intrOblExpn-oblINTERRwh-expnINTERRyn
det vil avhenge av hvem It will depend on who som kommer hvorvidt vi comes whether we får delta may participate
avhenge-av__intrOblExpn-oblINTERRyn-expnINTERRwh
det vil avhenge av hvorvidt vi deltar hvem som vil vinne
It will depend on whether we marticipate who will win
(continued)
86
L. Hellan Table 15. (continued)
Lexval identifier
Example for frame type
English translation
avhenge-av__intrOblExpn-oblINTERRyn-expnINTERRyn
det vil avhenge av hvorvidt vi deltar om de vil vinne
It will depend on whether we marticipate who will win
bidra-til__intrObl-oblAbsinf
de bidrar til å løse problemene
They contribute toI solving the problems
blånekte-på__intrObl-oblDECL
de blånekter på at de visste noe
They denyP that they knew something
bløffe-om__intrObl-oblEqSuInf
de bløffer om å ville nå klimamålene
They bluff aboutI wishing to reach the climate goals
fable-om__intrObl-oblINTERR
de fabler om hvordan de kan oppfinne en ny art
They fabulate about how they can create a new species
avhenge-av__intrObl-oblINTERRwh
utfallet vil avhenge av hvem som kommer
The outcome will depend on who comes
bo__intrObl-oblLoc
de bor her
They live here
bomme-på__intrObl-oblN
han bommer på målet
He missedP the taerget
spise-på__intrObl-oblN-ACTIVITY
hun spiser på brødstykket
He eats of the bread
fryse__intrObl-oblPRTOFsu
han fryser på ryggen
She freezes on her back
røre-på__intrObl-oblRefl
han rører på seg
He movesP,R
tegne-til__intrOblRais-oblRaisInf
det tegner til å bli uvær
It seemsP to become bad weather
minne-om__intrObl-suAbsinf
å spise reker minner om nedlagte havner
To eat shrimps reminds (one) if abandoned harbours
tyde-på__intrObl-suDECL-oblDECL
at kursen synker tyder på at det verste er over
That the price goes down indicatesP that the worst is over
bero-på__intrObl-suDECL-oblN
at han fikk jobb beror på At he got a job is deg dueNS to you
komme-an-på__intrPrtclObl-suINTERR-oblINTERR
hvem som får kjøre kommer an på om været blir bra
Who may drive dependsL on whether the weather gets good
avhenge-av__intrObl-suINTERR-oblN
om han får jobb vil avhenge av deg
Whether he gets a job will depend on you
peke__intrPath-suDir-PUREORIENTATION
pilen peker mot øst
The arrow points towards east
trengs__intrPresnt
det trengs en spesialist
Il faut un specialist
sprette__intrPresntDir
det hopper en katt opp i stolen
There jumps a cat into the chair
stå__intrPresntLoc
det står en kommode her There stands a chest of drawers here
versere-om__intrPresntObl-oblDECL
det verserer rykter om at There are circulating han kommer rumours aboutT him coming
(continued)
A Valence Catalogue for Norwegian
87
Table 15. (continued) Lexval identifier
Example for frame type
English translation
versere-om__intrPresntObl-oblINTERR
det verserer rykter om hva vi har i vente
There are circulating rumours about what we may expect
versere-om__intrPresntObl-oblN
det verserer rykter om ham
There are circulating rumours about him
vike-unna__intrPrtcl
de viker unna
They shy away
høre-med__intrPrtclExpn-expnAbsinf
det hører med å snakke med pressen
It belongsL to talk with the press
høre-med__intrPrtclExpn-expnDECL
det hører med at pressen It belongsL that the stiller opp press appears
komme-an-på__intrPrtclOblExpn-expnDECL
det kommer an på deg at It dependsL on you denne turen går bra that this tour goes well
komme-an-på__intrPrtclOblExpn-expnINTERRwh
det vil komme an på deg It dependsL on you hvem som vinner who will win
komme-an-på__intrPrtclOblExpn-expnINTERRyn
det vil komme an på deg It dependsL on you om turen går bra whether this tour goes well
komme-an-på__intrPrtclOblExpn-oblINTERRwh-expnINTERRwh
det vil komme an på hvem som kommer hvem som vil vinne
It dependsL on who comes who will win
komme-an-på__intrPrtclOblExpn-oblINTERRwh-expnINTERRyn
det vil komme an på hvem som kommer hvorvidt vi vil vinne
It dependsL on who comes whether we will win
komme-an-på__intrPrtclOblExpn-oblINTERRyn-expnINTERRwh
det vil komme an på hvorvidt vi deltar hvem som vil vinne
It dependsL on whether we participate who will win
komme-an-på__intrPrtclOblExpn-oblINTERRyn-expnINTERRyn
det vil komme an på hvorvidt vi deltar om de vinner
It dependsL on whether we participate whether they will win
rakke-ned-på__intrPrtclObl-oblAbsinf
de rakker ned på å samle They slanderL,P,I collecting money inn penger
rippe-opp-i__intrPrtclObl-oblDECL
vi ripper opp i at saken aldri ble etterforsket
We rip up inT the case being never investigated
sjalte-over-til__intrPrtclObl-oblEqSuInf
vi sjalter over til å snakke positivt
We switch over toI talking positively
skvære-opp-i__intrPrtclObl-oblINTERR
vi skværer opp i hvordan We set straightNS,L,I how we run the club vi driver klubben
komme-an-på__intrPrtclObl-oblINTERRwh
dette vil komme an på hvem som kommer
This dependsL on who comes
komme-an-på__intrPrtclObl-oblINTERRyn
dette vil komme an på om været blir bra
This dependsL on whether the weather is ok
munne-ut-i__intrPrtclObl-oblLoc
elven munner ut i Rhinen
The river runs out in the Rhine
rippe-opp-i__intrPrtclObl-oblN
vi ripper opp i saken
We rip up in the case
(continued)
88
L. Hellan Table 15. (continued)
Lexval identifier
Example for frame type
English translation
knekke-av__intrPrtclObl-oblPRTOFob
kvisten knekker av på midten
The branch breaks off in the middle
se-ut-til__intrPrtclOblRais-oblRaisInf
det ser ut til å regne
It seemsL,P to rain
komme-an-på__intrPrtclObl-suDECL-oblN
at han får denne jobben kommer an på anbefalingene
That he gets the job dependsL on the recommendations
komme-an-på__intrPrtclObl-suINTERR-oblN
om han får jobb vil komme an på deg
Whether he gets the job dependsL on you
se-ut-som__intrPrtclScpr-scSuNrg-scPredprtclN
hun ser ut som en vinner She looksL like a winner
se-ut-som__intrPrtclScpr-scSuNrg-scPredprtclS
hun ser ut som hun sitter She looksL like she sits
holde-på__intrPrtcl-SUSTAINEDACTIVITY
han holder på
He keeps on
innløpe__intr-RESULT
pengebidragene innløper
The cash contributions comeC in
synes__intrScprExpn-scAdj-expnAbsinf
det synes fristende å prøve igjen
It seems tempting to try again
virke__intrScprExpn-scAdj-expnDECL
det virker rart at de får komme
It seems strange that they are allowed
virke__intrScprExpn-scAdj-expnINTERR
det virker tvilsomt om dette vil virke
It seems dubious whether this will work
se-ut__intrScprPrtcl-scSuNrg-scAdj
han ser syk ut
He looksL ill
fungere-som__intrScpr-scPredprtcl
han fungerer som forsanger
He functions as lead singer
sne-inne__intrScpr-scSuNrgCsd-scAdv
landsbyen sner inne
The village snows under
gå__intrScpr-scSuNrgCsd-scPred
motoren går varm
The motor runs hot
måtte__intrScpr-scSuNrg-scDir
hun må vekk
She must off
synes__intrScpr-scSuNrg-scInf
oppskriften synes å fungere
The recipe seems to work
synes__intrScpr-scSuNrg-scN
han synes en snill prest
He seems like a kind priest
tykkes__intrScpr-scSuNrg-scPred
han tykkes glad
He seems happy
opptre-som__intrScpr-scSuNrg-scPredprtcl
vi opptrer som forsøkspersoner
We figure as research subjects
lyde-som__intrScpr-scSuNrg-scPredprtclN
det lyder som et signal
It sounds like a signal
lyde-som__intrScpr-scSuNrg-scPredprtclS
hun lyder som hun er redd
She sounds as if she is afraid
koste-1__intr-suAbsinf
å slåss alene koster
To fight alone costs
ryktes__intr-suDECL
at noe holdes skjult ryktes
That something is kept secret is rumouredns
ryke__intr-suDir
de ryker ut av turneringen
They drop out of the tournament
vare-1__intr-suDirTemp
møtet varer til middag
The meeting lasts until dinner
(continued)
A Valence Catalogue for Norwegian
89
Table 15. (continued) Lexval identifier
Example for frame type
English translation
spørs__intr-suINTERR
hva som vil skje nå spørs
What will now happenNS is a question
hensvinne__intrPresnt-RESULT
det hensvinner bevis
There disappears ocean ice
herde__tr
de herder metallet
They harden the metal
rangere__trAdv
vi rangerer henne høyt
We rank her high
skikke__trAdv-obRefl
han skikker seg vel
He behavesR well
umuliggjøre__trExpnOb-expnAbsinf
de umuliggjør det å finne en fredsplan
They makeNS it impossible to find a peace plan
beklage__trExpnOb-expnCOND
jeg beklager det om du føler deg neglisjert
I regret it if you feel neglected
beklage__trExpnOb-expnDECL
jeg beklager det at du føler deg neglisjert
I regret it that you feel neglected
mangle__trExpnSu-expnAbsinf
det mangler å løse den siste oppgaven
It fails to solve the last task
forundre__trExpnSu-expnCOND
det forundrer meg om du kommer
It astonishes mei f you come
gagne__trExpnSu-expnDECL
det gagner oss at de flytter hit
It benefits us that they move here
glede__trExpnSu-expnEqObInf
det gleder dem å bli omtalt slik
It pleases them to be talked about like that
ryste__trExpnSu-expnINTERR
det ryster oss hvor dårlig It shakes us how badly ledelsen er organisert the leadership is organized
ane__trExpnSu-expnINTERRwh
det aner dem hva som vil skje
It occursP to them what will happen
angå__trExpnSu-expnINTERRyn
det angår dem hvorvidt du kommer
It concerns them whether you come
ta__trExpnSu-obMeas-expnAbsinf
det tar tre timer å gå dit
It takes three hours to go there
anstå__trExpnSu-obRefl-expnAbsinf
det anstår seg å gå i hvitt It behooves itself to go in white
høve__trExpnSu-obRefl-expnDECL
det høver seg at man går It behooves itself that i hvitt one goes in white
syne__trExpnSu-obRefl-expnINTERRwh
det syner seg hvem som kommer
syne__trExpnSu-obRefl-expnINTERRyn
det syner seg om det fins It shows itself whether håp there is hope
ordne__trImpers-obRefl
det ordner seg
simulere__tr-obAbsinf
han simulerer å være syk He simulates to be sick
simulere__tr-obDECL
han simulerer at han er syk
It shows itself who comes
It arranges itself
He simulates that he is sick
tro__tr-obDECL-obV
vi tror han kommer
We think he comes
tømme__tr-obDir
vi tømmer innholdet ut i elven
We empty the content out into the river
(continued)
90
L. Hellan Table 15. (continued)
Lexval identifier
Example for frame type
English translation
tore__tr-obEqBareinf
hun tør komme
She dares come
unnlate__tr-obEqSuInf
hun unnlater å melde seg She fails to report
dø__tr-obEventunit
de dør en pinefull død
They die a painful death
eliminere__tr-obINTERR
de eliminerer hvem som kan ha gjort det
They eliminate who may have done it
stevne-for-for__trObl2-obl1N-obl2DECL
vi stevner dem for retten We drag them to court for at de har begått forT having committed blasphemy blasfemi
stevne-for-for__trObl2-obl1N-obl2EqObInf
vi stevner dem for retten We drag them to court for å ha bespottet gud forT having committed blasphemy
Iinnklage-til-for__trObl2-obl1N-obl2INTERR
vi innklager dem til domstolen for hva de gjorde med dokumentene
We drag them to court for what they did with the documents
vedde-med-på__trObl2-obl1N-obl2N
jeg vedder et stort beløp med Ola på hesten
I bet a big amount with Ola on the horse
samordne-med-om__trObl2-obRefl-obl1N-obl2Absinf
de samordner seg med hjelpemannskapene om å finne en løsning
They consultR with the rescue forces aboutI finding a solution
rådføre-med-om__trObl2-obRefl-obl1N-obl2INTERR
hun rådfører seg med dem om hvordan man kan behandle soppskader
She consultsR with them about how one can treat fungal damage
rådføre-med-om__trObl2-obRefl-obl1N-obl2N
hun rådfører seg med dem om soppskader
She consultsR with them about fungal damage
overlate-til__trOblExpnOb-expnAbsinf
de overlater det til bøndene å finne en løsning
They leave it to the farmers to find a solution
anspore-til__trOblExpnSu-oblEqObInf-expnDECL
det ansporer ham til å fokusere at han får applaus
It spurs him toI focus that he gets applause
anspore-til__trOblExpnSu-oblEqObInf-expnEqObInf
det ansporer ham til å It spurs him toI focus fokusere å høre tilropene to hear the applause
anspore-til__trOblExpnSu-oblN-expnEqObInf
det ansporer ham til innsats å høre tilropene
It spurs him to extra effort to hear the shouts
ekvivalere-med__trObl-obAbsinf-oblAbsinf
man kan ikke ekvivalere å trene med å øve
One cannot equivalateI training withI practizing
ekvivalere-med__trObl-obAbsinf-oblN
man kan ikke ekvivalere å trene med øving
One cannot equivalateI training with practice
henstille-om__trObl-obDECL-oblN
de henstiller til dem at det utvises måtehold
They urgeP them that restraint be exercized
overlate-til__trObl-obEqOblInf-oblN
de overlater til bøndene å finne en løsning
They leave to the farmers to find a new solution
(continued)
A Valence Catalogue for Norwegian
91
Table 15. (continued) Lexval identifier
Example for frame type
English translation
velge-fremfor__trObl-obEqSuInf-oblEqSuInf
vi velger å ta kveldstjeneste fremfor å stå vakt
We choose evening service beforeI standing guard
velge-fremfor__trObl-obEqSuInf-oblN
vi velger å ta kveldstjeneste fremfor utmarsj
We choose to take evening service before march
koste-1-på__trObl-obEqSuInf-oblRefl
hun koster på seg å kjøpe en ny bil
She affordsP,R to buy a new car
foreslå-for__trObl-obINTERR-oblN
vi foreslår for turistene hva de bør gjøre
We propose for the tourists what they should do
kurse-i__trObl-oblAbsinf
vi kurser dem i å arrangere foredrag
We educate them inI arranging talks
lønne-for__trObl-oblDECL
vi lønner dem for at de gjorde arbeidet
We compensate them forT doing the work
lønne-for__trObl-oblEqObInf
vi lønner dem for å ha gjort arbeidet
We compensate them forI having done the work
overraske-med__trObl-oblEqSuInf
de overrasker oss med å levere fullt regnskap
They surprise us withI delivering a full account
rettlede-i__trObl-OblINTERR
de rettleder dem i hvordan man setter opp regnskap
They guide them in how one sets up accounts
utplassere__trObl-oblLoc
vi utplasserer dem i skogen
We locate them in the forest
innkalle-til__trObl-oblN
vi innkaller dem til møte We summon them to a meeting
dunke__trObl-oblPRTOFob
han dunker dem i ryggen He bangs them in their backs
raske-med__trObl-oblRefl
de rasker med seg eiendelene
They shuffle with them the belongings
ytre-om__trObl-obRefl-oblAbsinf
de ytrer seg om å beholde naturens likevekt
They pronounce themselves aboutI keeping the balance of nature
akke-over__trObl-obRefl-oblDECL
han akker seg over at administrasjonen er inkompetent
He rantsR aboutT the administration being incompetent
akke-over__trObl-obRefl-oblEqObInf
han akker seg over å måtte sitte i flere møter
He rantsR aboutI having to sit in more meetings
avfinne-med__trObl-obRefl-oblINTERR
hun avfinner seg med hva hun får
She resignsR to what she receives
befinne__trObl-obRefl-oblLoc
hun befinner seg i Afrika She isR in Africa
beflitte-med__trObl-obRefl-oblN
hun beflitter seg med oppgaven
holde__trObl-obRefl-oblPRTOFob
hun holder seg for nesen She holdsR,P her nose
She busies herself with the task
(continued)
92
L. Hellan Table 15. (continued)
Lexval identifier
Example for frame type
English translation
ha-til__trOblRais-oblRaisObInf
de vil ha ham til å ha løyet
They allegeP him to have lied
overbevise-om__trObl-suDECL
at dødstallene går ned overbeviser oss om denne fremgangsmåten
That the death tolls go down convinces us about this procedure
overbevise-om__trObl-suDECL-oblDECL
at dødstallene går ned overbeviser oss om at denne fremgangsmåten er riktig
That the death tolls go down convinces us aboutT this procedure being right
anspore-til__trObl-suDECL-oblEqObInf
at tilhengerne jubler ansporer ham til å fokusere skikkelig
That the fans are cheering spurs him toI focus properly
overbevise-om__trObl-suDECL-oblINTERR
at dødstallene går ned overbeviser oss om hvilken fremgangsmåte som er riktig
That the death tolls go down convinces us about which procedure is right
anspore-til__trObl-suDECL-oblN
at tilhengerne jubler ansporer ham til innsats
That the fans are cheering spurs him to further effort
avholde-fra__trObl-suEqObInf-oblEqObInf
å høre jubelen avholder oss fra å avbryte
Hearing the cheering keepsI us fromI quitting
forhindre-fra__trObl-suEqObInf-oblN
å være nedvurdert forhindrer dem fra rettferdig dømming
Being underestimated preventsI them from fair judging
forkjøle__tr-obRefl
hun forkjøler seg
She getsR,NS a cold
kare__tr-obRefl-obDir
hun karer seg til uthuset
She scrambles herself to the outhouse
skyve__trPath-obRefl-obDir
hun skyver seg frem
She pushes herself forward
tilfalle__trPresnt
det vil tilfalle oss utbytte There will accrueP to us gains
smyge__trPresntDir-obRefl
det smyger seg en katt langs muren
There slithers a cat along the wall
oppholde__trPresntLoc-obRefl
det oppholder seg en beboer her
There staysR an inhabitant here
åpne__trPresnt-obRefl
det åpner seg nye muligheter
There openR new possibilities
bolte-igjen__trPrtcl
vi bolter igjen porten
We boltL the gate
ha-til__trPrtclExpnOb-expnDECL
de vil ha det til at målet var ugyldig
They haveP it that the goal was invalid
provosere-fram__trPrtcl-obDECL
vi provoserer frem at det We provokeL that blir et brudd there becomes a schism
finne-på__trPrtcl-obEqSuInf
de finner på å stenge veiene nå
They decideL to close the roads now
finne-ut__trPrtcl-obINTERR
vi finner ut hvorvidt de har rett
We find out whether they are right
(continued)
A Valence Catalogue for Norwegian
93
Table 15. (continued) Lexval identifier
Example for frame type
English translation
lekse-opp-for__trPrtclObl-obINTERR
han lekser opp for oss hva som var gått galt
He lists up for us what had gone wrong
ale-opp-til__trPrtclObl-oblEqObInf
vi aler den opp til å løpe veddeløp
We breedL it toI do racing
fritte-ut-om__trPrtclObl-oblINTERR
vi fritter dem ut om hvorvidt man kan få finansiering
We ask them out about whether one can get financing
hyre-inn-til__trPrtclObl-oblN
han hyrer dem inn til høyonna
He hiresL them for the harvesting
knekke-av__trPrtclObl-oblPRTOFob
han knekker den av på midten
He breaks it off at the middle
hisse-opp-over__trPrtclObl-obRefl-oblDECL
hun hisser seg opp over at han sviktet
She getsR,NS angry overT his failing
skape-om-til__trPrtclObl-obRefl-oblEqObInf
han skaper seg om til å bli et mønsterindivid
He reshapesL himself toI become a modell individual
peile-inn-på__trPrtclObl-obRefl-oblINTERR
hun peiler seg inn på hva de har fore
She findsR,P out what they are planning
skrubbe-opp__trPrtclObl-obRefl-oblPRTOFob
hun skrubber seg opp på kneet
She rubsR,L,P her knee
skape-om-til__trPrtclObl-obRefl-oblN
han skaper seg om til en shapeshifter
He reshapesL himself as a shapeshifter
skitne-til__trPrtcl-obRefl
han skitner seg til
He dirtensL himself
peke-ut-som__trPrtclScpr-obRefl-scObNrg-scPredprtcl
hun peker seg ut som fremragende
She emerges as excellent
anse__trScprExpnOb-scObNrg-scAdj-expnAbsinf
vi anser det ufornuftig å nekte all skyld
We deem it unwise to deny all guilt
anse__trScprExpnOb-scObNrg-scAdj-expnDECL
vi anser det tvilsomt at det vil holde seg slik
We deem it doubtful that it will remain like this
anse__trScprExpnOb-scObNrg-scAdj-expnINTERR
vi anser det tvilsomt hvorvidt det vil holde seg slik
We deem it doubtful whether it will remain like this
anse-som__trScprExpnOb-scObNrg-scPredprtclAdj-expnAbsinf
vi anser det som ufornuftig å nekte all skyld
We deem it as unwise to deny all guilt
anse-som__trScprExpnOb-scObNrg-scPredprtclAdj-expnDECL
vi anser det som tvilsomt at det vil komme bedre forslag
We consider it as doubtful that there will come better proposals
vurdere-som__trScprExpnOb-scObNrg-scPredprtclAdj-expnINTERR vi vurderer det som tvilsomt hvorvidt det vil komme bedre forslag
We deem it as doubtful whether there will come better proposals
anse-for__trScprExpnOb-scObNrg-scPredprtclInf-expnAbsinf
vi anser det for å være mulig å vinne
We considerP it to be possible to win
anse-for__trScprExpnOb-scObNrg-scPredprtclInf-expnDECL
vi anser det for å være mulig at vi vinner
We considerP it to be possible that we win
anse-for__trScprExpnOb-scObNrg-scPredprtclInf-expnINTERR
vi anser det for å være åpent hvorvidt vi vinner
We considerP it to be open whether we win
(continued)
94
L. Hellan Table 15. (continued)
Lexval identifier
Example for frame type
English translation
anse-som__trScprExpnOb-scObNrg-scPredprtclN-expnAbsinf
vi anser det som en dårlig taktikk å nekte all skyld
We consider it as a bad tactics to deny all guilt
vurdere-som__trScprExpnOb-scObNrg-scPredprtclN-expnDECL
vi vurderer det som et omen at det regner svart regn
We consider it as an omen that it rains black rain
vurdere-som__trScprExpnOb-scObNrg-scPredprtclN-expnINTERR
vi vurderer det som et åpent spørsmål hvorvidt det vil komme bedre forslag
We consider it as an open question whether there will come better proposals
spandere-på__trScpr-obDECL-scPPrefl
de spanderer på seg at firmaet får ny logo
They afford for themselves that the company gets a better logo
spandere-på__trScpr-obEqSuInf-scPPrefl
hun spanderer på seg å kjøpe en ny jakke
She affords for herself to buy a new jacket
låse-ut__trScpr-obRefl-scObDir
hun låser seg ut
She locks herself out
låse-inne__trScpr-obRefl-scObLoc
hun låser seg inne
She locks herself in
le__trScpr-obRefl-scObNrgCsd-scPred
de ler seg skakke
They laugh themselves merry
føle__trScpr-obRefl-scObNrg-scBareinf
hun følte seg forfalle innvendig
She felt herself decay on the inside
kalle__trScpr-obRefl-scObNrg-scN
han kaller seg et talent
He calls himself a talent
kjenne__trScpr-obRefl-scObNrg-scPred
hun kjenner seg trygg
She feelsR safe
snakke-til__trScpr-obRefl-scPP
han snakker seg til fordeler
He talks himself to advantages
forholde__trScpr-obRefl-scPred
hun forholder seg rolig
She remainsR quiet
konstituere-som__trScpr-obRefl-scPredprtcl
de konstituerer seg som et parti
They constitute themselves as a party
anse-for__trScpr-obRefl-scPredprtclInf
hun anser seg for å være kompetent
She considers herself asI being competent
la__trScpr-obRefl-scSuNrg-scBareinf-suRAISsuMob
stjernen lot seg se
The star let itself see
vise__trScpr-obRefl-scSuNrg-scInf
oppskriften viser seg å fungere
The recipe turnsR,L out to function
fortone__trScpr-obRefl-scSuNrg-scPred
situasjonen fortoner seg ufarlig
The situation appearsR undangerous
presse__trScpr-scObCsd
de presser sitronene flate They squeeze the citrons flat
stasjonere__trScpr-scObLoc
vi stasjonerer ham i Sydamerika
We station him in South America
synge__trScpr-scObNrgCsd-scPred
hun synger folk glade
She sings people happy
føle__trScpr-scObNrg-scBareinf
jeg føler smitten snike seg inn i meg
I feel the contagion enter into me
la__trScpr-scObNrg-scBareinf-obRAISsuMob
de lot sangen synge
They let the song sing (be sung)
anta__trScpr-scObNrg-scInf
vi antar henne å være kompetent
We assume her to be competent
(continued)
A Valence Catalogue for Norwegian
95
Table 15. (continued) Lexval identifier
Example for frame type
kalle__trScpr-scObNrg-scN
han kaller dem feiginger He calls them cowards
English translation
erklære__trScpr-scObNrg-scPred
vi erklærer byen friskmeldt
We declare the town healthy
forutsette__trScpr-scPasscmplx
jeg forutsetter arten utryddet
I presuppose the species extinct
rappe-til__trScpr-scPPrefl
hun rapper til seg pengene
She snatchesP,R the money
regne-som__trScpr-scPredprtcl
vi regner dem som farlige
We count them as dangerous
anse-for__trScpr-scPredprtclInf
vi anser henne for å være kompetent
We regardP her to be competent
forekomme__trScpr-scSuNrg-scInf
oppskriften forekommer meg å fungere
The recipe appearsP to me to function
synes__trScpr-scSuNrg-scN
han synes meg en skurk
He seemsP to me a crook
tykkes__trScpr-scSuNrg-scPred
han tykkes meg glad
He seemsP to me happy
forekomme-som__trScpr-scSuNrg-scPredprtcl
han forekommer meg som fortapt
He seemsP to me as lost
koste-1__tr-suAbsinf
å drive storgård koster penger
To run a big farm costs money
bety__tr-suAbsinf-obAbsinf
å slutte betyr å gi opp
To stop means to give up
forarge__tr-suDECL
at han sang foran kirkedøren forarget mange
That he sang before the church door angered many
implisere__tr-suDECL-obDECL
at regnskapet stemmer impliserer at han har snakket sant
That the accounts are correct implies that he has spoken the truth
ankomme__tr-suDir
båten ankommer byen
The boat arrivesP to the city
hoppe__tr-suDir-obLengthunit
han hopper fem meter
He jumps five metres
liste__tr-suDir-obRefl
hun lister seg unna
She sneaksR away
huge__tr-suEqObInf
å høre så vakker sang huger meg
To hear such a beautiful song pleases me
indikere__tr-suINTERR
hvem som kommer vil indikere planen
Who comes will indicate the plan
antyde__tr-suINTERR-obINTERR
hvem som kommer vil antyde hva vi kan vente
Who comes will indicate what we should expect
96
L. Hellan
Table 16. Lemmas, where relevant with light reflexives indicated by seg, and with selected prepositions or particles indicated Lemma
“_”, means that a direct clausal argument is also English basic translation possible)
angre
_, på, for
Repent
anse
som, for
Regard
anta
Assume
arrangere
Arrange
ause
seg opp over
Get upset
avfinne
seg med
Resign to
avhenge
av
Depend
avsky
for
Deteste
avtale
_, med, om
Agree
avtalefeste
Fix as an agreement
bable
om
Babble
begripe
seg på
Understand
bekjenne beklage
Confess seg over
Complain, regret, apologize
seg over
Worry
bekrefte bekymre
Confirm
beregne
Calculate
berekne
Calculate
bestemme
seg for
Decide
blogge
om
Blog
bløffe
om
Bluff
blåse
av,i
Sniff
bortvise
for
Expel
botlegge
for
Fine
briske
seg over
Brag
bry
(seg) med, om
Bother
bøtelegge
for
Fine
bøtlegge
for
Fine
domfelle
for
Sentence
drite
i
Not care
drømme
_, om
Dream (continued)
A Valence Catalogue for Norwegian
97
Table 16. (continued) Lemma
“_”, means that a direct clausal argument is also English basic translation possible)
ergre
_, seg over
Annoy
erindre
Recall, remember
erkjenne
Realize, recognize
fable
om, over
Fantasize
fabulere
om, over
Fantasize
fantasere
om, over
Fantasize
fastholde
Maintain
fiksere
på
Fixate
finne
_, ut
Find out
fokusere
_, på
Focus
forakte
for
Loathe
forankre
i
Anchor, base
forarge
_, seg over
Annoy
forbause forberede
Surprise på, for
Prepare
forbitre
seg over
Be angry
fordømme
for
Denounce
foreholde
Make aware
forekomme
Seem, occur
foreskrive
Ordonate
foreslå
Suggest, propose
forespeile
Make expect
forestille
seg
Imagine
foresveve
Seem
forklare
Explain
formode
Assume
forsikre
(seg) om
Ensure
forsone
seg med
Reconcile
forstå
Understand
forsvare
(seg), med, mot
Defend
fortenke
i
Disagree (continued)
98
L. Hellan Table 16. (continued)
Lemma
“_”, means that a direct clausal argument is also English basic translation possible)
fortvile
over
Despair
forundre
(seg), _, over
Wonder, be astounded
fravike
Depart
fryde
(seg), _, over
Delight
frykte
_, for
Fear
fullrose
for
Compliment
fulltakke
for
Thank
fundere
på
Muse, ponder
furte
over
Sulk
førebu
(seg), på
Prepare
garantere
_, for
Garantere
gasse
seg over
Glee
gjenoppleve
Re-experience
godta
Accept
godte
seg over
Glee
gratulere
med
Congratulate
gremme
seg over
Dismay
gremmes
over
Dismay
gruble
på, over
Brood
grunne
på, over
Ponder
grøsse
over
Shudder
gråte
over
Cry, weep
henholde
seg til
Refer
hisse
seg opp over
Get angry
hovmode
seg over
Gloat
hugse humre
Remember over
huske
Hum Remember
hylle
for
Hail
informere
om
Inform
innklage
for
Report, accuse (continued)
A Valence Catalogue for Norwegian
99
Table 16. (continued) Lemma
“_”, means that a direct clausal argument is also English basic translation possible)
innprente
Impress
innrømme
Admit
innse
Realize
irettesette
for
Reproach
irritere
_, seg over
Ittitate
jamre
jamre seg over
Moan, wail
joike
:, om
Joik
juble
over
Cheer
kjangse
på
Take a chance
kjempe
om, for
Fight
klage
om, på
Complain
komme
på
Remember
kompensere
for
Compensate
komplimentere for
Compliment
krangle
om
Quarrel
kreditere
for
Give credit
kritisere
_, for
Criticize
lure
på/ seg fra, seg til
Wonder/sneak
lyve
om
Lie
lære
_, om
Learn, teach
melde
_, fra om
Report
minne
om
Remind
more
_, seg over
Entertain
nytte
_, seg av
Make use of
oppleve overbevise
Experience _, om
Convince
overraske
Surprise
passe
Suit
planlegge
Plan
programfeste
Fix within a program
påstå
Assert (continued)
100
L. Hellan Table 16. (continued)
Lemma
“_”, means that a direct clausal argument is also English basic translation possible)
refse
for
Scold
regne
ut, med
Calculate, count
rekne
ut, med
Calculate, count
reminisere
om
Reminiscent
respektere respondere
Respect _, på
rettferdiggjøre
Respond Justify
rose
for
Compliment
saksøke
for
Suit (court)
se
_, på
See
si
_, fra om
Say
sjenere
seg for
Be shy
skamme
seg for, seg over
Be ashamed
skjemme
seg for, seg over
Be ashamed
skjemmes
for, over
Be ashamed
skjenne
på for, over
Reprimand
skjerme
for
Protect
skjønne
_, seg på
Understand
skryte
over
Boast
skuffe
_, med
Disappoint
skumle
om, over
Murmur
skvaldre
om
Gossip
skåle
for
Cheer
skåne
for
Protect
slåss
om, for
Fight
småprate
om
Smalltalk
snakke
om
Talk
sole
seg i
Bask
sone
for
Spend a sentence
spotte
for
Mock
steile
over
Resent (continued)
A Valence Catalogue for Norwegian
101
Table 16. (continued) Lemma
“_”, means that a direct clausal argument is also English basic translation possible)
stipulere
Stipulate
straffe
for
Punish
stusse
over
Be surprised
stø
seg på
Support
støe
seg på
Support
stønne
over
Sigh
stå
for
Stand
sukke
over
Sigh
sutre
over
Wail
svare
_, på
Answer
svi
for
Suffer
synes
Think
syte
over
Wail
søke
om
Apply
sørge
over
Mourn
ta
opp
Take
takke
for
Thank
tenke
_, på, over
Think
tie
om
Be quiet
tekste
Text
tilgi
Forgive
tilkjennegi
Show, declare
tilstå
Confess
tipse
om
Tip
tiske
om
Whisper
tjene
på
Earn
trekke
inn, fra
Pull, withdraw
triumfere
over
Triumph
trives
med
Thrive
trygge
(seg) mot
Ensure
trøste
(seg) med
Comfort (continued)
102
L. Hellan Table 16. (continued)
Lemma
“_”, means that a direct clausal argument is also English basic translation possible)
tåle
Suffer, sustain
uffe
seg over
Puff
unngjelde
for
Pay, suffer
unnskylde
_, seg for
Excuse
uroe
_, seg over
Worry
utelate
Omit
utgyde
seg over, seg om
Complain
utgyte
seg over, seg om
Complain
uttale
seg om
Pronounce
vedde
på, om
Bet
vedgå
Acknowledge
vedkjenne
seg
Acknowledge
vedstå
(seg)
Acknowledge
vedta
Acccept
velge
Choose
vemmes
over
Quail
vente
_, på
Wait, expect
verge
(seg) mot, seg for
Protect
verje
(seg) mot, seg for
Protect
verne
(seg) mot, seg for
Protect
vise
_, til
Show
vite
_, om
Know
vitse
om
Joke
vredes
over
Feel wrath
vrøvle
om
Talk nonsense
vurdere
Assess
våse
om
Talk nonsense
ymte
om
Hint
øse
seg opp over
Get upset
åpne
for, opp for
Open
A Valence Catalogue for Norwegian
103
References 1. Beermann, D., Hellan, L.: Enhancing grammar and valence resources for Akan and Ga. In: West African Languages. Linguistic Theory and Communication. Wydawnictwa Uniwersytetu Warszawskiego, Warzawa, pp. 166–185 (2020). ISBN 978-83-235-4623-8 2. Bresnan, J.: Lexical Functional Grammar. Blackwell, Oxford (2001) 3. Calzolari, N., et al. (eds.): Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), Reykjavík, Iceland. ELRA (2014) 4. Carpenter, B.: The Logic of Typed Feature Structures. Cambridge University Press, Cambridge (1992) 5. Copestake, A.: Implementing Typed Feature Structure Grammars. CSLI Publications, Stanford (2002) 6. Copestake, A., Flickinger, D., Sag, I., Pollard, C.: Minimal recursion semantics: an introduction. J. Res. Lang. Comput. 3, 281–332 (2005) 7. Creissels, D.: Transitivity, valency, and voice. Ms. European Summer School in Linguistic Typology. Porquerolles (2016) 8. Dakubu, M.E.K.: Ga-English Dictionary with English-Ga Index. Black Mask Publishers, Accra (2009) 9. Dakubu, M.E.K.: Ga Toolbox project expanded with Construction Labeling valence information. Ms (2010). https://typecraft.org/tc2wiki/Ga_Valence_Profile 10. Dakubu, M.E.K.: Ga Verbs and their constructions. Monograph ms, Univ. of Ghana (2011) 11. Dakubu, M.E.K., Hellan, L.: A labeling system for valency: linguistic coverage and applications. In: Hellan, L., Malchukov, A., Cennamo, M. (eds.) Contrastive Studies in Valency. John Benjamins Publ. Co., Amsterdam (2017) 12. Dalrymple, M., Lødrup, H.: The grammatical functions of complement clauses. In: Proceedings of the LFG00 Conference. CSLI Publications (2000) 13. Haugen, T.A.: Polyvalent adjectives in Norwegian: aspects of their semantics and complementation patterns. Ph.D. dissertation, University of Oslo (2012) 14. Hellan, L.: Construction-based compositional grammar. J. Logic Lang. Inform. 28(2), 101– 130 (2019). https://doi.org/10.1007/s10849-019-09284-5 15. Hellan, L.: Interoperable semantic annotation. In: LREC workshop ISA-16, 6th Joint ACLISO Workshop on Interoperable Semantic Annotation (2020) 16. Hellan, L.: Representing Light Reflexives in a valence resource for Norwegian. Presentation at SLE 2020 (2020) 17. Hellan, L.: Supplementary data for: ‘a valence catalogue for Norwegian. In: Loukanova (ed.) Natural Language Processing in Artificial Intelligence, NLPinAI 2021. Springer (2021). https://doi.org/10.18710/8U3L2U. (Further resources are displayed at: https://typecraft.org/ tc2wiki/NorVal_resources) 18. Hellan, L.: Unification and selection in Light Verb Constructions. A study of Norwegian. In: Pompei, A., Mereu, L., Piunno, V. (eds.) Light verb constructions as complex verbs. Features, typology and function (Series “Trends in Linguistics. Studies and Monographs”), Mouton de Gruyter (to appear) 19. Hellan, L., Dakubu, M.E.K.: Identifying verb constructions cross-linguistically. In: Studies in the Languages of the Volta Basin 6.3. Legon: Linguistics Department, University of Ghana (2010). https://typecraft.org/tc2wiki/Verbconstructions_cross-linguistically_-_Introduction 20. Hellan, L., Bruland, T.: A cluster of applications around a Deep Grammar. In: Vetulani, Z., et al. (eds.) Proceedings from The Language & Technology Conference (LTC) 2015, Poznan (2015). Web server version at http://regdili.hf.ntnu.no:8081/linguisticAce/parse 21. Hellan, L., Beermann, D.: Presentational and related constructions in Norwegian with reference to German. In: Abraham, W., Leiss, E., Fujinawa, Y. (eds.) Thetics and Categoricals. [LA 262], John BenjaminsPublishing Company (2020)
104
L. Hellan
22. Hellan, L., Johnsen, L.G., Pitz, A.: TROLL. Ms, University of Trondheim. (Downloadable at Nasjonalbiblioteket) (1989) 23. Hellan, L., Beermann, D., Bruland, T., Dakubu, M.E.K., Marimon, M.:. MultiVal: towards a multilingual valence lexicon. In: Calzolari, et al. (eds.) Proceedings of LREC 2014 (2014). (web demo: https://typecraft.org/tc2wiki/Multilingual_Verb_Valence_Lexicon and http://reg dili.hf.ntnu.no:8081/multilanguage_valence_demo/multivalence) 24. Hellan, L., Malchukov, A.L., Cennamo, M. (eds): Contrastive studies in Valency. John Benjamins Publ. Co., Amsterdam & Philadelphia (2017) 25. Hellan, L., Beermann, D., Bruland, T., Haugland, T., Aamot, E.: Creating a Norwegian valence corpus from a deep grammar. In: Vetulani, Z., Paroubek, P., Kubis, M. (eds.) Human Language Technology. Challenges for Computer Science and Linguistics. 8th Language & Technology Conferene, LTC 2017. LNCS, vol. 12598. Springer, Cham (2020). https://doi.org/10.1007/ 978-3-030-66527-2_1. ISBN 978-3-030-66526-5, https://typecraft.org/tc2wiki/Norwegian_ Valency_Corpus) 26. Holen, G.I.: Automatic anaphora resolution for Norwegian. In: Branco, A. (ed.) 6th Discourse Anaphora and Anaphor Resolution Colloquium, DAARC 2007, Lagos, Portugal, pp. 151–166. Springer, Berlin. (2007) https://doi.org/10.1007/978-3-540-71412-5_11 27. Jespersen, O.: Analytic Syntax. Holt, Rinehart and Winston, New York (1969, orig. edition 1937) 28. Jespersen, O.: The Philosophy of Grammar. Routledge, London. (2010, orig. edition 1924) 29. Jørgensen, F.: The semantic representation of location in machine translation. Cand. Philol. thesis, University of Oslo (2004) 30. Korhonen, A., Briscoe, T.: extended lexical-semantic classification of english verbs. In: Proceedings of the HLT/NAACL Workshop on Computational Lexical Semantics, Boston, MA (2004) 31. Levin, B.: English Verb Classes and Alternations. University of Chicago Press, Chicago (1991) 32. Loukanova, R.: An approach to functional formal models of constraint-based lexicalized grammar (CBLG). Fund. Inform. 152(4), 341–372 (2017). https://doi.org/10.3233/FI-20171524 33. Malchukov, A.L., Comrie, B. (eds.): Valency Classes in the World’s Languages. Mouton De Gruyter, Berlin (2015) 34. Marantz, A.: Grammatical Relations. MIT Press, Cambridge (1985) 35. Marneffe, M.-C., Manning, C.D., Nivre, J., Zeman, D.: Universal Dependencies. Computational Linguistics (2021). https://doi.org/10.1162/COLI_a_00402 36. Nordgård, T.: Norwegian Computational Lexicon (NorKompLeks). In: Proceedings of NoDaLiDa 1998 (1998) 37. Pollard, C., Sag, I.A.: Head-Driven Phrase Structure Grammar. Chicago University Press, Chicago (1994) ´ 38. Przepiórkowski, A., Hajnicz, E., Patejuk, A., Woli´nski, M., Skwarski, F., Swidzi´ nski, M.: Walenty: towards a comprehensive valence dictionary of Polish. In: Calzolari et al. (eds.) (2014) 39. Quasthoff, U., Hellan, L., Körner, E., Eckart, T., Goldhahn, D., Beermann, D.: Typical Sentences as a Resource for Valence. LREC 2020 (2020). http://www.lrec-conf.org/proceedings/ lrec2020/index.html 40. Ross, J.R.: Constraints on variables in syntax. PhD dissertation, MIT (1967) 41. Tesnière, L.: Éleménts de syntaxe structurale. Klincksieck, Paris (1959)
Arabic Computational Linguistics: Potential, Pitfalls and Challenges Elie Wardini(B) Department of Aisan, Middle Eastern and Turkish Studies, Stockholm University, Stockholm, Sweden [email protected]
Abstract. Arabic computational linguistics though still relatively new is gaining pace rapidly. While the development of tools for computational linguistics in many languages has come a very long way, and progress has been achieved in creating tools for Arabic, Arabic computational linguistics are in need of much attention. It is not obvious that tools developed for, let us say, English will only need minor modifications before they can be applied to Arabic. Computational tools developed for English rely heavily on enormous work achieved in English linguistics in general, and corpus linguistics more particularly. If Arabic computational linguistics is to achieve its potential, it needs to mirror the hard work done in other languages. Researchers in Arabic computational linguistics should also fully understand the nature of the data they are working with. The present article is not a review of the field, but rather a discussion on the potential, pitfalls, and challenges of Arabic computational linguistics. We will discuss the potential of what research in this field can contribute to linguistic and pedagogical research on Arabic, we will also discuss issues related to defining what ‘Arabic (language)’ is from a linguistic point of view, the nature of the Arabic script, transcription and transliteration, and finally corpus building. Keywords: Arabic · Computational linguistics · Natural language processing · Corpus linguistics
1 The Potential of CL/NLP for Arabic Computational linguistics in Arabic has enormous potential. Arabic is one of the world’s larger languages, has an attested history spanning over 1500 years, and possesses a rich literature covering a very wide range of topics and every genre. Computational linguistics (CL) and natural language processing (NLP) in Arabic have the potential to vastly increase the pace of study of Arabic in every domain. To date, most grammars and dictionaries of Arabic are not corpus based. Grammar books reproduce earlier outlines of Arabic grammar, which are mostly discussions with the 9th and 10th century grammarians of Arabic. Most Arabic dictionaries take on the role of more or less purists keeping the users on the right path, and in the case of neologisms, suggesting ‘good and correct’ words for the modern times. Most, if not all, early as well as © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. Loukanova (Ed.): NLPinAI 2021, SCI 999, pp. 105–117, 2022. https://doi.org/10.1007/978-3-030-90138-7_4
106
E. Wardini
modern grammarians, have a prescriptive approach. To my knowledge only Badawi et al. Modern Written Arabic: A Comprehensive Grammar (2004) is a corpus based grammar, and Wehr’s Arabisches Wörterbuch (first published in 1952) is a corpus based dictionary of Modern Standard Arabic. As for the spoken variants of Arabic, there are many (but not enough!) good descriptive grammars and dictionaries, yet with notable exceptions (for example, see Salloum and Habash 2014 and Samih 2017, among others), systematic descriptions of spoken variants are not the focus of many working on CL and NLP in Arabic. So, CL/NLP assisted grammars and dictionaries based on large corpora will be very welcome descriptive correctives to the mostly prescriptive existing grammars and dictionaries of Arabic. This is true for all domains of Arabic grammar, phonetic, morphological, syntactic, lexical, etc. Another field to which CL/NLP in Arabic has the potential to provide important contributions is literature. The field of Arabic literature has a large, international, truly active, and productive collegium of researchers. This research is by definition corpus based. Yet it is only with years of experience, hard work, and large-scale cooperation that researchers gather insight from comprehensive sets of texts. CL/NLP could give researchers the tools to do research on corpora including, for example, all the works of a single author, country, period, genre, etc. and in any combination they would see fit and in any domain(s) they need: Patterns of lexical usage, phraseology, topoi across periods, authors, and genres. The list is endless. One excellent example of this is the Kitab project (see below). Education and especially language learning/teaching could benefit from CL/NLP in Arabic. As we will discuss below, the Arabic script poses special difficulties with regard to language learning/teaching. The morphology of Standard Arabic poses other issues. With CL/NLP, educators and linguists could identify the more accessible parts of Arabic grammar, and more importantly identify empirically the difficult parts and segregate between those parts that are essential for understanding and which ones are redundant. For example, much time is spent in the Arabic grammar classroom on nominal case and verbal moods. There is reason to argue that nominal case is mostly redundant and verbal mood is generally governed by particles (but not so for the Quran), and thus could get less focus in the classroom. CL/NLP could identify which words and constructions together with their connotations are the most commonly used and thus form the basis for textbooks and dictionaries, balancing the focus between the lexicon and grammar based on empirical and comprehensive data. Here too, the list is large. Lastly, worth mentioning here is that CL/NLP in Arabic would benefit the research field itself. With such a large potential corpus and wide usage, Arabic could be a major contributor to the modeling and development of CL/NLP. As is well known and current for linguistic theory as well as many other fields, English seems to be the starting point, with correctives quickly following from well known European or Asian languages. Arabic only on the rare occasion makes a significant contribution. CL/NLP in Arabic could be an important contributor to the further development of the field. CL/NLP in Arabic are gaining more and more attention. The number of researchers in the field is indeed growing (see MecEnery et al. (eds.) (2019), Eddakrouri’s website and the references below for a handful of examples). Yet there are some essential elements in this growing field that seem to be lacking and some important issues that seem to
Arabic Computational Linguistics
107
be ignored, thus hindering its development (see also Ditters 2013, updated 2017). The present contribution should be read as an attempt by a linguist specialist in Arabic and Semitic languages to identify some issues related to CL/NLP in Arabic. It is our hope that computational scientists join forces to a greater extent with specialists in the field of Arabic in order to achieve the potential of Arabic computational linguistics.
2 What is ‘Arabic’? Languages quite often ignore or even defy borders and sociopolitics. The term ‘Arabic (language)’ is a sociopolitical term and not a linguistic term in the narrow sense. I will explain, but first let me emphasize that this is not unique to Arabic. ‘Norwegian (language)’, ‘Swedish (language)’, etc. are sociopolitical terms also, just to use Scandinavia as an example. The classification of a given spoken variant as ‘Norwegian’, ‘Swedish’, etc. is not dependent on the linguistic traits that are characteristic of that given variant, but rather on the geographical location where the variant is used and whether it falls within the borders of Norway, Sweden etc. In other words, in Norway one speaks Norwegian, in Sweden one speaks Swedish, irrespective of how linguistically similar the spoken variants across the border are to each other or linguistically dissimilar they may be to other variants that fall within the same borders. From a linguistic perspective the distinction between ‘linguistics’ in the narrow sense and ‘sociopolitics’ is important. It is important to apply the correct tools to a given task. One should apply ‘linguistic’ tools to linguistic data and ‘sociopolitical’ tools to sociopolitical data. The field of sociolinguistics has the important task of bridging the fields of linguistics and sociopolitics, since speakers of language are not one thing or the other, ‘performers of speech acts’ or ‘social actors’, but a complex combination of many aspects. Language indeed is part and parcel of the sociopolitical space. Yet it is important to note that linguistics in the narrow sense are concerned with ‘speech acts’, more precisely the mechanisms by which language encodes meaning and how language variants as linguistic systems function, even when they are affected by sociopolitical or other factors. In the context of computational linguistics and natural language processing, the tools are primarily linguistic by design. Though one could see the need and benefits of incorporating sociopolitical or other aspects into the models, CL/NLP primarily aim at identifying and processing linguistic features in language usage. So, applying CL/NLP models without a good grasp of linguistics and a deep linguistic knowledge of the languages being studied, inevitably leads to dubious results. A distinction linguists have more and more shied away from are the terms ‘language’ and ‘dialect’. These terms are quite difficult to define linguistically, so linguists usually quote Max Weinreich who popularized the following statement in the mid 1940’s: “Language is a dialect with an army and a navy”, and one may be tempted to add: “a priesthood”. The terms ‘language’ and ‘dialect’ have moved more and more into the domain of sociopolitics and have become less and less useful in the domain of linguistics. Linguists tend to prefer the term ‘(language) variant’. A language variant is a distinct linguistic system which is more or less closer/further from other variants. Clusters of variants that share many linguistic traits are commonly classified under a certain language, keeping in mind the often ‘fuzzy’ borders across different variant clusters. Linguists are often at
108
E. Wardini
odds as to which traits to include in defining a cluster and what ‘many traits’ actually means in practice. More often than not, different linguistic clusters form a continuum with other clusters, making it difficult to distinguish clear ‘borders’ between clusters. For comparison, the term ‘Arabic’ is the equivalent to all ‘Scandinavian’ language variants, mainland and insular, from Old Norse to the present. This is not strange, since ‘Arabic’ is spoken from the Atlantic to Central Asia, from the Sub-Sahara to Turkey, including Malta and historically Spain and Sicily. Many researchers in the field of CL/NLP in Arabic seem want to apply models that are intended to cover the Quran as well as news articles in a modern newspaper, others apply them to spoken variants. Applying CL/NLP to ‘Arabic’ as if it were a single linguistic system is the equivalent of applying computational linguistic models to ‘Scandinavian’ as if it were a single linguistic system. This is not a linguistically viable approach. What then is Arabic? Arabic is a complex cluster (of clusters) of distinct (language) variants that are closer to each other than to clusters of other languages, such as Aramaic or Hebrew (despite the valid arguments of Retsö 2013). Variants of Arabic are classified in different ways, depending on the linguistic traits that are used. One major classification is the distinction between the literary variant(s), also called Classical Arabic, Fus.h.a¯ , Arabiyya, Standard Arabic, etc. on the one hand and the spoken variants on the other. The linguistic differences between the literary Arabic variants and spoken Arabic variants are significant. The literary variant(s) have not been the mother tongue of any person during the period that they have been attested. The literary variant is what is taught at school and used in most written Arabic. On the other hand, the spoken variants are the different mother tongues of speakers of Arabic. The distinction between literary Arabic and spoken Arabic is as old as the earliest attestations of Arabic. In principle, literary Arabic is highly standardized and follows the grammar outlined by early grammarians in the 9th and 10th centuries. Empirically though, literary Arabic is not the monolithic giant it is usually depicted as. Literary Arabic exhibits variation depending on the texts and periods. Modern Standard Arabic is the term used to designate the literary Arabic used from the late 19th century to the present. Are there regional differences in Modern Standard Arabic? CL/NLP would be of tremendous help to plot the variation of literary Arabic from its earliest stages and map the cohesiveness and/or variation it exhibits. The general attitude among speakers of Arabic is that the ‘dialects’ are of lesser value, and not really ‘languages’. Linguists beg to differ. As for the spoken variants, they exhibit important linguistic differences between the different clusters. Different criteria are used to classify these clusters. Regional clusters in terms of East (Mashriqi, from Egypt towards the east) vs West (Maghribi, west of Egypt), or regional clusters in terms smaller regions of the Middle East: Gulf Arabic, Northern Iraqi Arabic, Levantine Arabic, Egyptian Arabic, etc. Another type of classification is based on historical settlement patterns: Bedouin vs Sedentary, and here between Rural and Urban (see Versteegh 2014 and Palva 2006). All these classifications, nevertheless belie the fact that even at the smaller regional level, the spoken variants of Arabic form even smaller distinct sub-clusters. So Egyptian, Iraqi, Syrian etc. Arabic are umbrella names covering distinct language variants used in a certain territory. Remember though, the ‘borders’ between clusters mostly do not follow state borders.
Arabic Computational Linguistics
109
The Arabic used on social media is noteworthy in this context. There an increasing number of studies on this phenomenon. CL/NLP has a great potential to assist in studying these texts. Yet, a note of caution. The Arabic written on social media is generally hybrid in nature. Even when, let us say, an Egyptian person writes on FaceBook, the texts produced more often than not do not produce Egyptian Arabic. Rather, these are often a mixture of Egyptian Arabic, Modern Standard Arabic and to varying degrees English. These texts are as genuine as any other text, and their ‘hybrid’ nature is not a stigma. So, researchers studying Arabic on social media should be aware of the nature of the texts they are working with. Moreover, there are no standards for writing Arabic on social media, sometimes the Arabic script is used, at others the Latin script. There is very seldom any consistency in the orthography. Multilingualism is another phenomenon that should be taken into consideration when applying CL/NLP to ‘Arabic’. In our context, speakers and even at times writers, use more than one variant of Arabic. The most obvious are religious terms that very often are Standard Arabic. Speakers, depending on the context, interlocutors, and circumstances, may use several Arabic variants in one single conversation or text. For example, most novels are written in Standard Arabic. Yet authors do use dialectal terms and phrases, even at times passages, in order to achieve certain effects. Dialogues in are very often written in a spoken variant. Speakers may use a certain variant to make a certain point. So, researchers must distinguish between linguistic elements that have become incorporated into a certain variant, as loans, and have become part and parcel of a certain variant of Arabic, for example the term ‘please’ has become incorporated into many spoken variants of Arabic, on the one hand and multilingualism, diglossia, translanguaging, etc. on the other hand. If a speaker of a certain variant of Arabic uses a Modern Standard Arabic phrase, or even an English phrase, these do not automatically become part of the linguistic system of that variant. The question of “which Arabic” is therefore an essential aspect that needs to be addressed when modeling CL/NLP for Arabic. One mitigating, and less than optimal, factor is that most CL/NLP models for Arabic are applied mostly to literary Arabic. Well designed and corpus based studies using computational linguistics could be a boost to research not only to the specific literary Arabic variants, but if applied correctly also to the spoken variants. This requires deep knowledge and awareness among researchers about the nature of Arabic.
3 The Arabic Script The Arabic script is a so-called abjad-script. This means that the script represents mainly the consonants of the language. Long vowels are as a rule marked in the mainly consonantal script with some exceptions with the consonants alif , waw or ya, the so-called matres lectionis. Short vowels and doubled/long consonants are marked with diacritical signs above or below the consonants. These diacritical signs are very seldomly used in writing, with the exceptions of some types of texts such as the Quran or some types of children’s or beginners’ books where short vowels are fully or partially marked. Fully ‘vocalized’ texts (i.e. where all the diacritics are represented in the script) are visually cumbersome and thus avoided especially in smaller print. Authors and editors do add
110
E. Wardini
a diacritical sign here or there, often not systematically, with the aim of ‘disambiguation’. The rule though is that most written Arabic texts do not include markers for short vowels or doubling of consonants, all of which are phonemically and morphemically significant. The saying goes: “One usually reads in order to understand; In Arabic, one needs to understand in order to read.” This does not only pose difficulties for readers of Arabic, but especially so for learners of Arabic and in our context for applying CL/NLP models to Arabic texts. For example, in the Arabic script the string “ktb” ≈ [kataba ‘he wrote’, kutiba ‘it was written’, kutub ‘books’, kattaba ‘he caused someone to write’, kuttiba ‘he was made to write’, …], or the string “lwm” ≈ [lawm ‘a blame’, l¯um ‘blame someone’, …]. Anyone working with RegEx (regular expressions) will realize the consequences of this type of script. CL/NLP modeling in Arabic should anticipate that a search for the expression “ktb” would return an array of possibilities [“kataba”, “kutiba”, “kutub”, “kattaba”, …] rather than a well defined single ‘unambiguous’ item. In short, the Arabic script is ambiguous. Simplistic attempts at using NLP to ‘disambiguate’ Arabic is equivalent to trying to produce matter from nothing. What you feed into the model is what you get out. You feed ambiguity, as the Arabic script does, the result is ambiguity. As an illustration, I entered the string “hlk” into the Madamira disambiguation demo webpage (see [15], see also Pasha et al. 2014). As ‘disambiguation’ Madamira returned the somewhat rare “ahhalaki” ‘he made you.Fem.Sing. competent’ (retrieved July 17th, 2021). The string “hlk” should rather return an array:
CalimaStar, an excellent analyzer, (retrieved August 22nd, 2021, see [8], see also Taji et al. 2018), on the other hand, does exactly this, it returns an array of 11 lemmas and 59 analyses. This example reveals at least two issues: the ambiguity of the Arabic script, and as importantly the limitations of the training set/methods used by Madamira. More on corpora and training sets below. It is clear that the Arabic script with its over abundance of homographs in itself presents a special challenge to CL/NLP models.
4 Arabic Morphology In addition to the script itself, Arabic morphology presents a different set of issues. The stem of Arabic nouns, adjectives and verbs permutate. Verbal stems as well as the plural of nouns and adjectives (the so-called ‘broken plurals’) are the biggest ‘culprits’. As
Arabic Computational Linguistics
111
examples from English, the verbs come and see permutate: “come”, “comes”, “came”; “see”, “sees”, “saw”, “seen” respectively. An example from Norwegian: bok ‘book’ is the singular form, while “bøker” is the plural form. In Arabic this phenomenon is pervasive. Thus in order to identify the lemma behind a certain string in a text, the CL/NLP models need to accommodate for numerous permutations, again in the form of returning an array: lemma ≈ [“stem 0”, “stem 1”, …]. As an example, I have recorded in the Quran 17 stem permutations for the basic and very frequent word at¯a ‘to come’ (attested 264 times) and similarly 11 for the word ra a¯ ‘to see’ (attested 267 times); kit¯ab ‘book’ is the singular form while “kutub” is the plural form. Surely, the words at¯a and ra a¯ with their numerous permutations (due to the hamza and the long vowel in their roots) are more on the extreme side. Still stem permutations need to be given due attention in any CL/NLP model that will be applied to Arabic.
5 Arabic Orthography The orthography of Arabic also presents its own sets of issues. Consider the following string “wlsylmnh”:
(Note that this same string could be read as: wa-la-sa-yu allimannahu ‘he will surely teach him’). Arabic orthography attaches certain parts of speech together into one ‘word’. So the question of “what is a word?” arises. In the example above the English phrase contains 6 words, while Arabic contains only one. Linguists including computational linguists have come to a practical solution: A word is a string of characters separated by a space or punctuation. While this works reasonably well for tokenization in English and Norwegian, among others, the models do not work well for Arabic. Or rather, when tokenizing an Arabic text, as it is processed in most programming languages and their libraries or packages, the results that are returned will differ significantly from results in other languages. For example, most NLP libraries/packages include lists of words that can be omitted while processing, e.g. the definite article, pronouns, etc. Most of these ‘words’ in Arabic are clitics, forming part of the ‘word’ in Arabic orthography. So, for example, performing a tf-idf analysis on an English text would process different types of tokens/token types/information than it would in Arabic. This issue is added to the above mentioned issues related to script and stem permutations. The words discussed above, at¯a ‘to come’, ra a¯ ‘to see’, and kit¯ab ‘book’ return the following arrays respectively in the case of the Quran:
112
E. Wardini
Moreover, the ratios between lemma/token type/and number of attestations in Arabic of a vocalized text differs considerably from a non-vocalized text. The former has a larger number of token types/unique forms in relation to lemma and/or attestation, but with less ambiguity; the latter has a smaller number of token types/unique forms in relation to lemma and/or attestation, but with more ambiguity. Scripts are in their essence approximations, conventions that attempt to represent language. No script is perfect. Linguists rely on transcription (see below) in order to adapt scripts to the different languages they are studying. But even transcriptions need to make compromises. In the case of the Arabic script, due to its origin in Semitic scripts, it has a major drawback: the representation of vowels and consonant doubling with diacritics that are more often than not omitted in written texts. Attempts at reforming the Arabic script lead, as is the case with the Greek script, inevitably to uproar. In the context of CL/NLP, the researcher needs to pay extra attention to this fact. Off-the-shelf models rarely yield adequate results.
6 Ambiguity The Arabic script, as we have seen above, is ambiguous and polysemous. Not only should CL/NLP models account for this, the models should not ‘extract’ more information from these texts than is present in the texts themselves. Indeed, language in general is more often than not ambiguous. One need only read legal texts to see how ‘heavy’ they are with specialized terminology, redundancy, and repetition. All this in order to make legal texts as little ambiguous as possible. And still, legal texts need scholars to interpret them, due to their legal implications and inherent and unavoidable ambiguity. Ambiguity is present even more so for other less worked texts. Pronouns are prime examples of ambiguity in language. Consider the following sentence:
In English the string “his apple” is ambiguous. Norwegian on the other hand has two different possessive pronouns that translate English “his/her/its”, specifically: “hans/hennes/dets” and “sin/sitt” with different antecedents: The subject as antecedent for “sin/sitt” and the non-subject as antecedent for “hans/hennes/dets”. In this specific case Norwegian is less ambiguous than English. Applying CL/NLP to Arabic texts should reasonably well assist with understanding these texts better and finding correlations internally in the text or with other texts. CL/NLP are excellent tools to identify collocations, frequency, syntax, semantic contexts, etc. CL/NLP could and should help parse and processes digitalized Arabic texts and large amounts of texts better and produce corpus based grammars and dictionaries.
Arabic Computational Linguistics
113
7 Transcription and Transliteration Gone are the days of ASCII. With UNICODE many of the limitations and difficulties encountered due to ASCII are solved. There is no rational or technical reason why scholars today should still use ASCII inspired transcription or transliteration models (for example the often used Buckwalter model developed in 1988, see [7], see also Habash et al. 2007). UNICODE provides the means to produce human and machine readable, and consistent transcriptions. UNICODE also handles the Arabic script quite well. Given that the Arabic script is written from right to left, some issues may arise with certain applications or programing languages. UNICODE encodes the Arabic script from left to right, then inverts the words (strings between spaces or punctuation) to appear from right to left. Not all applications are able to handle this and/or diacritics correctly. This can produce issues especially with formatting. My experience though is that applications and programming languages that handle RegEx well will not have noteworthy issues with the Arabic script. There could still be a need to transcribe the Arabic texts, or at least a portion of them, into a Latin script based text. The tags and encoding should at least be in the Latin script. In my experience, due to the ambiguity of the Arabic script and also due to the orthography of Arabic, a combination of Arabic script and transcription produces the best results. In this context, it is nevertheless important to distinguish between the technical terms: transliteration vs transcription. These are often confused. Often used terms such as ‘romanization’ should be avoided. Transliteration is a mapping into a Latin based script of the characters that occur in the Arabic text as they are attested in the text, character for character, diacritical sign per diacritical sign. Transliteration is very important since it gives the researcher or the CL model a clear picture of what is actually written in that Arabic text. It answers the questions: What information is available in the text? Did the Arabic text include diacritics or not?, etc. Transcription, on the other hand falls into the domain of interpretation, especially so in languages with consonantal or abjad scripts such as Arabic or other Semitic scripts. For English one could mention the string “gh” as in “laugh” interpreted as /laf/ vs in “sigh” where it is interpreted as /s¯ı/, or as noted above: “lwm” could be interpreted as /lawm/ or /l¯um/. Transcription surely reduces ambiguity. But one should remember that transcription is the result of the transcriber’s interpretation of the text. There are a set of factors that need to be accounted for when applying CL/NLP to Arabic. In general, these factors make the computational linguistic processes more cumbersome for Arabic. Yet, one ignores or downplays them at one’s own peril. A major rule for any scholar is: Know your data, and know it well. On the other hand, the benefits of well designed models for Arabic completely overshadow the efforts required. So, in order for CL/NLP to work for Arabic there is very important ground work that needs to be done. This work can be summarized with one phrase: Specially designed, tagged and annotated corpora.
8 Arabic Corpora At the University of Oslo in the early 90’s nearly everybody was involved in digitalizing documents. OCR was still in its infancy, scanners had relatively low resolution and most
114
E. Wardini
documents were printed. Still, the push to digitalize archives of old documents to more recent texts was in full swing. Some were assigned to scanners. Others were tasked with proof-reading the OCRed documents. But most importantly, the linguistically savvy among the participants were assigned the task to encode and tag the texts. AI (artificial intelligence), NN (neural networks), ML (machine learning), NLP, etc. were all unknown then, but somehow anticipated. Maybe by good fortune, since without access to AI, researchers who at the time wanted to tap into the growing body of digitalized texts relied on RegEx and specialized software such as Conc or CasualConc, to mention a few. Tags were the reliable means of retrieving and connecting desired data from extensive corpora (see for example Text Encoding Initiative, see [26]). Accurately digitalized texts and well developed encoding, tagging, and lemmatization not only opened treasure troves to scholars, but also provided the emerging AI, ML, NN, NLP with reliable datasets to train models. This sequence of events is key. AI, ML, and CL models are as good as the datasets they are trained with. For example, I entered the following string “kyf h.lk” to Madamira’s demo-web site, where they claim they can disambiguate not only Standard Arabic, but also Egyptian (Cairo?) Arabic. The site returned the following:
This was a trick question and maybe somewhat unfair. The string “kyf h.lk” is rather Levantine Arabic /k¯ıf h.a¯ lak/ or Modern Standard Arabic /kayfa h.a¯ luka/, not Egyptian (Cairo) Arabic /ez-zayyak/. Yet Madamira still returned a ‘result’ and it was not: This string is not Egyptian (Cairo) Arabic. Even if Egyptian Arabic and Modern Standard Arabic coexist in Egypt, they are still distinct variants of Arabic (see above). Similarly, Google Translate, which relies heavily on parallel corpora, more often than not does not return adequate translations into Arabic. The intention here is not to throw a shadow on any specific project. Nor is it an overview or review of existing projects. But rather the aim is to highlight the important gap in the work on CL/NLP in Arabic: The lack of adequately digitalized and encoded corpora. Many researchers in Arabic computational linguistics seem to want to leapfrog the extensive work that has preceded the development and successes achieved in languages as English, European or some Asian languages. So what are the challenges that CL/NLP in Arabic face and need to overcome before major successes can be achieved?
9 Specialized Corpora Text corpora are tools. And as tools they are/should be designed to fulfill a certain purpose. Most Arabic corpora are rather collections of texts, text repositories, for example al-Maktaba al-Shamila (see [16]), archive.org (see [2]), al-Waraq (see [29]), etc. Most texts are in PDF-format that are simply scans of the Arabic printed texts. These texts can be OCRed, but the quality is usually not good. Moreover, OCR of Arabic texts is still in its infancy and much work meeds to be done before high quality digital texts can be produced from scans without extensive human intervention. In the context of CL/NLP,
Arabic Computational Linguistics
115
these repositories are not of great value. But in terms of making the Arabic texts available, they are of immense value. Some of these repositories do provide digitalized texts, such as al-Maktaba al-Shamila and al-Waraq, among others, some free and others payed. The main goal is to make Arabic texts available to the general public. One is not sure though about the copyright status of some of these sites. Some other projects are aimed more generally at researchers. One such project is the Shamela: A Large-Scale Historical Arabic Corpus (see [23], see also Belinkov et al. 2016). They state on their homepage: “We develop a large-scale, historical corpus of Arabic of about 1 billion words from diverse periods of time. We clean this corpus, process it with a morphological analyzer, and enhance it by detecting parallel passages and automatically dating undated texts. We demonstrate its utility with selected casestudies in which we show its application to the digital humanities”. A similar project is the Kitab project (see [14]). They are working on providing digital Arabic texts of high quality. The project provides tools for searching the digitalized texts, but most importantly, and something they are excellent at, they provide tools to compare texts. Using chunks of some 300 words they perform excellent intertextual analysis in order to find relations between texts (see [13]). The choice of 300 words is interesting, since any fewer, the results would be very noisy due the ambiguity of the Arabic script, and with more words the analysis would not yield good results. The Kitab project does provide metadata to the texts in the project. These comprise of information concerning the document itself: source, author, genre, etc. To my knowledge the project does not provide texts that are encoded and tagged at the level of the tokens or phrase. Not surprising, the Quran has gotten the most attention, with excellent projects, such as quran.com, Corpus Coranicum (see [9]), The Quranic Arabic Corpus (see [27]), among many. Special mention goes to the Tanzil.net project which makes its digital version of the Quran manuscript of Madina (mus.h.af al-mad¯ına) available for use for free with attribution. Most of the Quran projects do provide transliterations and/or transcriptions of the Arabic text. All provide more or less advanced search engines. Some, such as The Quranic Arabic Corpus, provide morphological information and analysis of the Quran text. Most of these projects are nevertheless designed to help in reading and studying the Quran, rather than being tools for CL/NLP. Projects like The Corpus Coranicum stand out. Corpus Coranicum state on their webpage: “The project offers systematic access to early Qur’anic manuscripts with images and transliterated text. In parallel, a catalogue of variant readings included in the works of the Islamic scholarly tradition is produced”. The availability of the digital Tanzil.net text of the Quran, on the other hand, provides an excellent basis for those who might want to apply CL/NLP to the text of the Quran. But the text as it is from Tanzil.net should be seen as raw data which needs to be encoded and tagged before it is of much use for CL/NLP. In other words, and without reviewing all the repositories of Arabic texts or Arabic text projects, digitalized Arabic texts are not readily available, and the quality of those that are available vary. Furthermore, properly encoded and tagged texts that are adequate for CL and model training are pressing desiderata.
116
E. Wardini
10 Encoding Texts Throwing CL/NLP models developed for English at Arabic just does not work. There is clearly a pressing need for systematic, sustained and long term efforts to prepare and develop well designed, and executed extensive Arabic datasets aimed specifically at CL/NLP. The teams involved should comprise of scholars who are well versed both in CL/NLP and Arabic linguistics. To the extent possible international standards and conventions should be used, but also special attention should be given to the requirements of Arabic. This is a two-pronged approach: 1. Preparing Arabic datasets that are adapted to be used in CL/NLP; 2. Developing and adapting models to be used with Arabic. These two processes go hand in hand, the one feeding and providing corrections to the other. There are some projects that are pushing in this direction. The above mentioned Madamira, CalimaStar are such examples. Another is arTenTen: Corpus of the Arabic Web (see [1], see also Arts et al. 2014). The tendency though is to use CL models to tag the texts. arTenTEn state on their webpage: “The arTenTen corpus was tagged by the Stanford Arabic parser […]”. The Stanford University Arabic Natural Language Processing (see [24]) provide software for CL/NLP in Arabic. For Syriac, one could mention the excellent work of the Simtho project, with limited resources, at Beth Mardutho (see [6]). There is no reason why, like in the early 90s for European and other languages, students and others could not be tasked to digitalize Arabic texts. A team or several should be able to adapt international guidelines to encode Arabic texts. Then linguistically savvy participants should be put to the heavy and cumbersome, yet of crucial importance, work of encoding, tagging and lemmatization of the digitalized texts. Then datasets should be prepared and tested for use with ML and NLP. It is only when this process reaches a certain maturity that more extensive CL/NLP work can be done even on non-encoded texts.
References 1. arTenTen: Corpus of the Arabic Web. https://www.sketchengine.eu/artenten-arabic-corpus/ 2. Archive.org. https://archive.org 3. Arts, T., Belinkov, Y., Habash, N., Kilgarriff, A., Suchomel, V.: arTenTen: Arabic corpus and word sketches. J. King Saud Univ. Comput. Inf. Sci. 26, 357 (2014). https://doi.org/10.1016/ j.jksuci.2014.06.009 4. Badawi, E.M., Carter, M.G., Gully, A.: Modern Written Arabic: A Comprehensive Grammar. Routledge, London (2004) 5. Belinkov, Y., Magidow, A., Romanov, M., Shmidman, A., Koppel, M.: Shamela: A LargeScale Historical Arabic Corpus (2016) 6. Beth Mardutho. https://bethmardutho.org/simtho/ 7. Buckwalter developed in 1988. http://www.qamus.org/transliteration.htm 8. CalimaStar. https://calimastar.abudhabi.nyu.edu/analyzer/ 9. Corpus Coranicum. https://corpuscoranicum.de 10. Ditters, E.: Issues in Arabic computational linguistics. In: Owens, J. (ed.) The Oxford Handbook of Arabic Linguistics. Online Publication (2013) 11. Eddakrouri, A.: https://sites.google.com/a/aucegypt.edu/infoguistics/directory/Corpus-Lin guistics/arabic-corpora
Arabic Computational Linguistics
117
12. Habash N., Soudi A., and Buckwalter, T.: On Arabic transliteration. In: Soudi, A., Bosch, A., Neumann, G. (eds.) Arabic Computational Morphology. Text, Speech and Language Technology, vol. 38. Springer, Dordrecht (2007). https://doi.org/10.1007/978-1-4020-604 6-5_2 13. The History of the Arabic Book: A New Chapter. Institute for Advanced Study, Near Eastern Studies and Digital Scholarship @IAS Joint Lecture, 4 March 2021. See also https://www. youtube.com/watch?v=Z6KkpF3-73U 14. Kitab project. http://kitab-project.org 15. Madamira demo webpage. https://camel.abudhabi.nyu.edu/madamira/. See also http://inn ovation.columbia.edu/technologies/cu14012_arabic-language-disambiguation-for-naturallanguage-processing-applications 16. al-Maktaba al-Shamila. https://shamela.ws 17. MecEnery, T., Hardie, A., Younis (red), N.: Arabic Corpus Linguistics. Edinburgh University Press, Edinburgh (2019) 18. Palva, H.: Dialect classification. In: Versteegh, C.H.M., Eid, M. (eds.) Encyclopedia of Arabic Language and Linguistics, vol. 1, A-Ed, pp. 604–613. Leiden, Brill (2006) 19. Pasha, A., et al.: MADAMIRA: a fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In: Proceedings of the 9th International Conference on Language Resources and Evaluation, pp. 1094–1101 (2014) 20. Retsö, J.: What is Arabic? In: Owens, J. (ed.) The Oxford Handbook of Arabic Linguistics. Online Publication (2013) 21. Salloum, W., Habash, N.: ADAM: Analyzer for Dialectal Arabic Morphology. J. King Saud Univ. Comput. Inf. Sci. 26, 372–378 (2014) 22. Samih, Y.: Dialectal Arabic Processing Using Deep Learning. Inaugural-Dissertation. Heinrich-Heine-Universität Düsseldorf, Düsseldorf (2017) 23. Shamela: A Large-Scale Historical Arabic Corpus. https://arxiv.org/abs/1612.08989 24. Stanford University Arabic Natural Language Processing. https://nlp.stanford.edu/projects/ arabics.html 25. Taji, D., Khalifa, S., Obeid, O., Eryani, F., Habash, N.: An Arabic morphological analyzer and generator with copious features. In: Proceedings of the 15th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp. 140–150. Brussels, Belgium, 31 October 2018 26. Text Encoding Initiative. https://tei-c.org 27. The Quranic Arabic Corpus. https://corpus.quran.com 28. Versteegh, C.H.M.: The Arabic Language, 2nd edn. Edinburgh University Press, Edinburgh (2014) 29. al-Waraq. https://alwaraq.net/ 30. Wardini, E.: The Quran: Key Words in Context, vol. 1–5. Gorgias Press, Piscataway (2020) 31. Wardini, E.: The Quran: Key Word Collocations, vol. 1–16. Gorgias Press, Piscataway (2021) 32. Wehr, H.: Arabisches Wörterbuch für die Schriftsprache der Gegenwart. In: Hans, W., Milton, C.J. (eds.) Leipzig. English translation: A Dictionary of Modern Written Arabic (ArabicEnglish), 4th edn. Considerably enl. and amended by the author New York: Spoken Language Services (1994)
Author Index
F From, Asta Halkjær, 25
H Hellan, Lars, 49
K Kanovich, Max I., 1 Kuznetsov, Stepan G., 1 Kuznetsov, Stepan L., 1 S Scedrov, Andre, 1 Schlichtkrull, Anders, 25 V Villadsen, Jørgen, 25
J Jensen, Alexander Birch, 25
W Wardini, Elie, 105
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. Loukanova (Ed.): NLPinAI 2021, SCI 999, p. 119, 2022. https://doi.org/10.1007/978-3-030-90138-7