Natural Language Processing in Artificial Intelligence ― NLPinAI 2021 (Studies in Computational Intelligence, 999) 3030901378, 9783030901370

The book covers theoretical work, approaches, applications, and techniques for computational models of information, lang

165 18 2MB

English Pages 126 Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
Decidable Fragments of Calculi Used in CatLog
1 Introduction
2 Morrill's Calculi
3 Polarity and Bracket Restrictions
4 Decidable Multiplicative Fragments
5 Decidable Fragments with Additives
6 Inducing Brackets
7 Conclusion and Future Work
References
Interactive Theorem Proving for Logic and Information
1 Introduction
2 Isabelle/HOL and Deep Embeddings of Logics
2.1 Formally Verified Functional Programming
2.2 Termination
2.3 A Prover for Propositional Logic
3 Epistemic Logic
3.1 Syntax and Semantics
3.2 Axiomatic System
3.3 Soundness
3.4 Completeness
4 Public Announcement Logic
4.1 Axiomatic System
4.2 Reducing to Epistemic Logic
4.3 Soundness
4.4 Completeness
5 Related Work
6 Concluding Remarks
References
A Valence Catalogue for Norwegian
1 Introduction
2 Representing Frame Types
2.1 Argument Labels
2.2 Global Labels
2.3 Global and Argument Labels Together
3 Lexvals and Valpods
3.1 Lexvals
3.2 Valpods
3.3 Further Illustration
4 Using the Resource
4.1 Clausal Arguments
4.2 Particles and Secondary Predicates
4.3 Light Reflexives
4.4 Conclusions
5 Discussion
5.1 Issues of Redundancy
5.2 Valpod Intersections vs. ‘Valency Classes’
5.3 Valence Frames and Senses
6 Final Remarks
6.1 Comparison with Other Valence Resources
6.2 Extendability to Other Languages
6.3 Possible Applications
6.4 Extending the Catalogue
Appendix 1 Overview of Frame Types
Appendix 2 Verbs Allowing for All Three Types of Clausal Arguments: Declaratives, Interrogatives and Infinitives
References
Arabic Computational Linguistics: Potential, Pitfalls and Challenges
1 The Potential of CL/NLP for Arabic
2 What is ‘Arabic’?
3 The Arabic Script
4 Arabic Morphology
5 Arabic Orthography
6 Ambiguity
7 Transcription and Transliteration
8 Arabic Corpora
9 Specialized Corpora
10 Encoding Texts
References
Author Index
Recommend Papers

Natural Language Processing in Artificial Intelligence ― NLPinAI 2021 (Studies in Computational Intelligence, 999)
 3030901378, 9783030901370

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Studies in Computational Intelligence 999

Roussanka Loukanova   Editor

Natural Language Processing in Artificial Intelligence – NLPinAI 2021

Studies in Computational Intelligence Volume 999

Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, selforganizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.

More information about this series at http://www.springer.com/series/7092

Roussanka Loukanova Editor

Natural Language Processing in Artificial Intelligence – NLPinAI 2021

123

Editor Roussanka Loukanova Department of Algebra and Logic Institute of Mathematics and Informatics Bulgarian Academy of Sciences Sofia, Bulgaria

ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-030-90137-0 ISBN 978-3-030-90138-7 (eBook) https://doi.org/10.1007/978-3-030-90138-7 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

Computational and technological developments that incorporate natural language are proliferating. Adequate coverage of Natural Language Processing in Artificial Intelligence encounters problems on developments of specialised computational approaches and algorithms. Many difficulties are due to ambiguities in natural language and dependency of interpretations on contexts and agents, which can arise in computational systems based on nature of languages. Classical approaches proceed with relevant updates, and new developments emerge in theories of formal and natural languages, computational models of information and reasoning, and related computerised applications. The book covers theoretical work, approaches, applications, and techniques for computational models of information, language, and reasoning. Its focus is on computational processing of human language and relevant medium languages, which can be theoretically formal, or for programming and specification of computational systems. The goal is to promote intelligent natural language processing, along with models of computation, language, reasoning, and other cognitive processes. The Special Session on Natural Language Processing in Artificial Intelligence— NLPinAI 2021 (http://www.icaart.org/NLPinAI.aspx?y=2021) was held within the 13th International Conference on Agents and Artificial Intelligence—ICAART 2021 (http://www.icaart.org/?y=2021), by distance, Online Streaming, 4–6 February 2021. The series of the special sessions Natural Language Processing in Artificial Intelligence (NLPinAI) and its post-conference book volumes address the above challenge, by advancements of further research, and also sharing ideas and feedback between researchers. The book sequence Natural Language Processing in Artificial Intelligence— NLPinAI covers a variety of topics, e.g. • Logic Approaches to Natural Language Processing • Classical and Non-Classic Logics for applications to NLP • Type Theories for Applications to Natural Language

v

vi

• • • • • • •

Preface

Computational Grammar Large-Scale Grammars of Natural Languages Syntax, Semantics, Syntax-Semantics Interfaces Information Theory Statistical Approaches in Computational Linguistics and NLP Machine Learning of Grammar and Language Integrated Approaches in Computational Linguistics and NLP

The chapters of this book volume, NLPinAI 2021, are based on extended work on selected topics of the Special Session on Natural Language Processing in Artificial Intelligence—NLPinAI 2021. Chapter 1 presents new developments of CatLog. CatLog system is a categorial grammar parser and theorem-prover originally developed by Glyn Morrill and his co-authors. There are two variants of extended Lambek calculus in two versions of CatLog, both of which are undecidable. The chapter focuses on fragments where the usage of subexponential is restricted by specialised bracket (non-negative/non-positive) conditions. The authors prove that these fragments are decidable, and place them in the complexity hierarchy. Then, they present a practically important problem of predicting brackets, and prove one decidability and one undecidability result. Chapter 2 is on automated reasoning as a computer assistant for building proofs of theorems in logic, by a focus on using the Isabelle proof assistant. The authors link two approaches, Epistemic Logic and Public Announcement Logic. Systems of epistemic logic can model reasoning with knowledge of agents. Public announcements can update knowledge of a system, users, and agents. The chapter presents formalisations of axiomatic systems for epistemic and public announcement logic, which improves the foundations of automated reasoning for logic and information. Chapter 3 of the volume NLPinAI 2021 is a specialised, extensive study of verbal valences for Norwegian. The work presents an exhaustive resource catalogue NorVal, which contains formal descriptions of the valence features of more than 6300 lemmas. The theoretical research together with the valence resource NorVal has great potentials for applications to NLP, not only for Norwegian, but also for computerised translation systems, as well as resource to human usage in translations and transcripts. It also presents an example for similar developments for other human languages, including for English. Chapter 4 is on further prospects of computational linguistics of Arabic. It is a discussion on the potential and challenges that Arabic language presents to NLP, by the nature of the Arabic morphology, script, transcription, and transliteration. September 2021

Roussanka Loukanova

Contents

Decidable Fragments of Calculi Used in CatLog . . . . . . . . . . . . . . . . . . Max I. Kanovich, Stepan G. Kuznetsov, Stepan L. Kuznetsov, and Andre Scedrov

1

Interactive Theorem Proving for Logic and Information . . . . . . . . . . . . Jørgen Villadsen, Asta Halkjær From, Alexander Birch Jensen, and Anders Schlichtkrull

25

A Valence Catalogue for Norwegian . . . . . . . . . . . . . . . . . . . . . . . . . . . Lars Hellan

49

Arabic Computational Linguistics: Potential, Pitfalls and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Elie Wardini Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

vii

Decidable Fragments of Calculi Used in CatLog Max I. Kanovich1,2 , Stepan G. Kuznetsov3 , Stepan L. Kuznetsov2,4(B) , and Andre Scedrov5 1

University College London, Gower Street, London, UK [email protected] 2 Computer Science Department, HSE University, 11 Pokrovsky Blvd., Moscow, Russia 3 Mathematics Department, HSE University, 6 Usacheva Street, Moscow, Russia [email protected] 4 Steklov Mathematical Institute of RAS, 8 Gubkina Street, Moscow, Russia [email protected] 5 Department of Mathematics, University of Pennsylvania, 209 South 33rd Street, Philadelphia, PA, USA [email protected]

Abstract. CatLog is a categorial grammar parser/theorem-prover developed by Glyn Morrill and his co-authors. CatLog is based on an extension of Lambek calculus. A distinctive feature of this extension is the usage of brackets for controlled non-associativity and a subexponential modality whose contraction rule interacts with bracketing in a sophisticated way. We consider two variants of the calculus, appearing in different versions of CatLog. Both systems are, unfortunately, undecidable in general. We consider fragments where the usage of subexponential is restricted by so-called bracket non-negative/non-positive conditions, prove that these fragments are decidable, and pinpoint their place in the complexity hierarchy. We also consider a more complicated, but more practically interesting problem of inducing (guessing) brackets. For this problem, we prove one decidability and one undecidability result, and leave some open questions for further research. Keywords: Lambek calculus · Categorial grammars modalities · Bracket modalities

1

· Subexponential

Introduction

The Lambek calculus was introduced by J. Lambek [17] for mathematical description of natural language syntax, in the framework of categorial grammars. The idea of categorial grammar goes back to Ajdukiewicz [2] and Bar-Hillel [3], and Lambek-style grammars form a subclass of categorial grammar formalisms. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  R. Loukanova (Ed.): NLPinAI 2021, SCI 999, pp. 1–24, 2022. https://doi.org/10.1007/978-3-030-90138-7_1

2

M. I. Kanovich et al.

In a categorial grammar, each lexeme is annotated by a syntactic type (category), which is a formula of a specific non-classical substructural logic, e.g., the Lambek calculus. A sentence is considered grammatical if the sequent composed from types of lexemes is derivable in the given calculus. The original Lambek calculus is capable of handling basic cases like “John loves Mary.” In this example, syntactic types are assigned as follows: John, M ary  N

loves  (N \S)/N

Here N and S are variables (primitive types), meaning “noun phrase” and “sentence” respectively. The \ and / operations, called left and right divisions, are directed implications. The type for “loves,” (N \S)/N , means that this word lacks two noun phrases (N ), one on the left and one on the right, to form a sentence (S). Derivability of the sequent N, (N \S)/N, N → S in the Lambek calculus justifies “John loves Mary” as a grammatically correct sentence. The Lambek calculus is usually formulated as a Gentzen-style sequent calculus. Formulae are constructed from a countable set of variables (p1 , p2 , p3 , . . .) using two divisions (\ and /) and product (·). A sequent is an expression of the form Γ → B, where Γ is a sequence of formulae and B is a formula. Axioms are sequents of the form A → A, and inference rules are as follows: Γ → B Δ1 , C, Δ2 → D /L Δ1 , C/B, Γ, Δ2 → D

Γ, B → C /R Γ → C/B

Γ → A Δ1 , C, Δ2 → D \L Δ1 , Γ, A\C, Δ2 → D

A, Γ → C \R Γ → A\C

Δ1 , A, B, Δ2 → D ·L Δ1 , A · B, Δ2 → D

Δ→A Γ→B ·R Δ, Γ → A · B

The Lambek calculus comes in two variants, depending on whether we allow antecedents (left-hand sides of sequents) to be empty. The natural way to disallow empty antecedents is to impose the constraint “Γ is non-empty” on / R and \ R. This constraint is called Lambek’s restriction and exists in the original Lambek calculus [17]. Later on, however, Lambek also introduced a variant of his calculus without this restriction [18]. From the point of view of algebraic logic (see [8]), the Lambek calculus with Lambek’s restriction is the logic of residuated partially ordered semigroups, while the Lambek calculus without this restriction corresponds to residuated partially ordered monoids. On the other hand, as a logical system, the Lambek calculus without Lambek’s restriction is a non-commutative intuitionistic variant of Girard’s [9] linear logic (see Abrusci [1]). From this perspective, Lambek’s restriction seems unnatural. Lambek’s restriction is desirable for linguistic applications. Without this restrictions, Lambek grammars overgenerate, i.e., accept incorrect phrases as correct ones. This can be seen in the following example [21, § 2.5]: “very book.”

Decidable Fragments of Calculi Used in CatLog

3

Being ungrammatical, this phrase is accepted under the following natural type assignment: book  CN

very  (CN /CN )/(CN /CN )

Here the primitive type CN stands for “common noun,” a noun phrase without an article. (Unlike N , such a phrase cannot be directly used as an object or subject.) Without Lambek’s restriction, the sequent (CN /CN )/(CN /CN ), CN → CN is derivable, which declares “very book” a correct common noun group (which is actually not the case). Informally, the absence of Lambek’s restriction allows the usage of empty words. In our example, there is an “empty adjective” of type CN /CN between very and book: compare with a grammatically correct phrase “very interesting book,” (CN /CN )/(CN /CN ), CN /CN, CN → CN . More sophisticated natural language phenomena require extended versions of the Lambek calculus. In this paper, we focus on systems developed by G. Morrill and his co-authors for the CatLog natural language parser/theorem-prover [26, 28]. For a more general overview of Lambek-style categorial grammar formalisms, see Buszkowski [4], Carpenter [6], Morrill [31], Moot and Retor´e [21], etc. In our linguistic examples, we follow Morrill [31] and later papers by Morrill and his co-authors. The limitations of grammars based on the “pure” Lambek calculus show up when one tries to analyze complex and compound sentences. The core construction here is relativisation, which connects a dependent clause to the main sentence. In some easy cases, relativisation can still be described by means of the Lambek calculus. For example, in the noun phrase “the girl whom John loves” the following type assignment does the job: John  N the  N /CN whom  (CN \CN )/(S/N )

loves  (N \S)/N girl  CN

The sequent N, N /CN, (CN \CN )/(S/N ), N, (N \S)/N → N is derivable in the Lambek calculus, because “John loves” is a sentence lacking a noun phrase on the right, i.e., an object of type S/N . In more complicated situations, however, the “pure” Lambek calculus is insufficient. This can be seen on examples like “the girl whom John met yesterday” or “the paper that John signed without reading.” In the first example, the gap in S which should be filled by N is located in the middle: “John met ... yesterday,” which cannot be handled by Lambek divisions. In the second example, there are even two gaps, which should be filled by the same N (the paper): “John signed ... without reading ...” These two phenomena are called medial and parasitic extraction respectively, and Morrill suggests a structural (subexponential) modality, denoted by !, to handle it. This modality allows permutation (for medial extraction) and some form of contraction (for parasitic extraction). On the other hand, extraction from compound sentences leads to overgeneration. The standard example here is “the girl whom John loves Mary and Pete

4

M. I. Kanovich et al.

loves.” This phrase is ungrammatical. However, it is parsed as a common noun group, CN , since “John loves Mary and Pete loves” is of type S/N (cf. “John loves Mary and Pete loves Ann” being of type S). In order to overcome this issue, a sentence obtained from two other sentences using “and” should be made an island, which cannot be penetrated by extraction. In Morrill’s system, islands are introduced and managed using brackets an bracket modalities, the idea of which goes back to Morrill himself [22] and Moortgat [20]. Thus, Morrill’s systems include both a subexponential and brackets. Moreover, they interact in a subtle way, since in the case of parasitic extraction one should penetrate islands. Also Morrill’s systems include lattice-theoretic meet (∧) and join (∨), in other words, additive conjunction and disjunction. In [16], we give a detailed proof-theoretic analysis of Morrill’s systems and prove undecidability of the corresponding derivability problems. The latter is unfortunate, since these systems were designed to be used in natural language parsing software. The present paper is a more optimistic sequel of [16]. Here we prove that these systems enjoy naturally defined decidable fragments, and prove upper complexity bounds for them. While the present paper is mostly self-contained, we suggest the reader to get acquainted with the article [16] also, for a deeper discussion of Morrill’s systems from a proof-theoretic point of view. The rest of the paper is organised as follows. In Sect. 2, we introduce three of Morrill’s systems, with certain proof-theoretic clarifications, following [16]. In Sect. 3, we define the bracket non-negative and bracket non-positive conditions, which are restrictions on the usage of bracket modalities under the subexponential. In Sect. 4, we consider the systems without additive operations and prove that imposing appropriate bracket conditions leads to decidability and, moreover, NP complexity upper bound. In Sect. 5, we do the same for systems with additive operations; here the upper bound is PSPACE. Notice that these complexity bounds are tight, since the corresponding lower bounds are known already for systems without subexponentials and bracket modalities [11,32]. In Sect. 6, we consider a more complicated, but more practically interesting algorithmic problem of inducing brackets. For this problem, for one of the Morrill’s systems we prove undecidability (even under bracket conditions) and for the other one, decidability and complexity upper bounds. In the concluding Sect. 7 we discuss directions of further research.

2

Morrill’s Calculi

We define three systems, denoted by !2015 MALC∗ (st) , !2018 MALC∗ (st) , and b b 2018  !b MALC(st) respectively. The first one originates in works of Morrill and Valent´ın [30] and Morrill [24,25] from 2015–2017. The second one appears in more recent works of Morrill [27,28] from 2018–2019; however, here Morrill comes back to the ideas from earlier publications [23,31]. Finally, the third system, MALC(st) , is a variant of !2018 MALC∗ (st) which employs Lambek’s !2018 b b restriction.

Decidable Fragments of Calculi Used in CatLog

5

The formulations of Morrill’s systems we use here are those presented our article [16]. These formulations, if compared to Morrill’s original ones, are a bit clarified in order to maintain desired proof-theoretic properties (mainly cut elimination), see [16] for details. These clarifications do not alter linguistic applications of the systems. On the other hand, we notice that the systems presented here and in [16] are only fragments of the ones constructed by Morrill: Morrill’s original systems have up to 45 connectives, but we focus on the behaviour of brackets and subexponentials. The version of the ‘newer’ Morrill’s calculus with Lambek’s non-emptiness MALC(st) , was introduced in [16]. Morrill’s original formularestriction, !2018 b tions do not accommodate Lambek’s restriction. The possibility of consistently imposing Lambek’s restriction is in fact quite interesting, since more standard (sub)exponential modalities happen to be incompatible with this restriction [15]. The interaction of the subexponential with bracketing, however, made imposing Lambek’s restriction possible (which is desirable from the linguistic point of view). Let us formally define the syntax of the calculi in question, following our article [16] and earlier works by Morrill. Definition 1.1. Formulae are built from variables (primitive types) p1 , p2 , p3 , . . . and the unit constant 1 using five binary operations: \ (left division), / (right division), · (product, or multiplicative conjunction), ∧ (additive con−1 junction), ∨ (additive disjunction), and three unary operations: , [] (bracket modalities), and ! (subexponential modality). The sequential syntax of Morrill’s systems is more involved than the syntax of usual sequent calculi. Namely, left-hand sides of sequents, besides “comma” as a metasyntactic version of multiplication, include brackets for designating islands (controlled non-associativity) and so-called stoups for handling the subexponential modality. The formal definition, following Morrill [28], is as follows. Definition 1.2. We define the following three notions simultaneously: stoup, tree term, and meta-formula. • A stoup is a multiset of formulae: ζ = {A1 , . . . , An }. Here the order does not matter, while the number of occurrences does. • A tree term is either a formula or a bracketed expression of the form [Ξ], where Ξ is a meta-formula • A meta-formula is an expression of the form ζ; Γ, where ζ is a stoup and Γ is a linearly ordered sequence of tree terms In a meta-formula, Γ could be empty; in this case, it is denoted by Λ. An empty stoup is omitted: we write just Γ instead of ∅; Γ. We use comma both for concatenation of tree term sequences and for multiset union of stoups. Adding one formula to the stoup is written as ζ, A (the bureauratic way to write it would be ζ  {A}).

6

M. I. Kanovich et al.

Definition 1.3. A sequent (in Morrill’s terminology, h-sequent) is an expression of the form Ξ → C, where C is a formula and Ξ is a meta-formula. Let us first define !2015 MALC∗ (st) and !2018 MALC∗ (st) . These calculi b b share the same axioms and rules for all connectives, except the subexponential modality. A→A

id

ζ1 ; Γ → B Ξ(ζ2 ; Δ1 , C, Δ2 ) → D /L Ξ(ζ1 , ζ2 ; Δ1 , C/B, Γ, Δ2 ) → D

ζ; Γ, B → C /R ζ; Γ → C/B

ζ1 ; Γ → A Ξ(ζ2 ; Δ1 , C, Δ2 ) → D \L Ξ(ζ1 , ζ2 ; Δ1 , Γ, A\C, Δ2 ) → D

ζ; A, Γ → C \R ζ; Γ → A\C

Ξ(ζ; Δ1 , A, B, Δ2 ) → D ·L Ξ(ζ; Δ1 , A · B, Δ2 ) → D

ζ1 ; Δ → A ζ2 ; Γ → B ·R ζ1 , ζ2 ; Δ, Γ → A · B Ξ(ζ; Δ1 , Δ2 ) → A 1L Ξ(ζ; Δ1 , 1, Δ2 ) → A

Ξ → Ai ∨Ri i = 1, 2 Ξ → A1 ∨ A2

Ξ(ζ; Δ1 , A1 , Δ2 ) → C Ξ(ζ; Δ1 , A2 , Δ2 ) → C ∨L Ξ(ζ; Δ1 , A1 ∨ A2 , Δ2 ) → C

Ξ(ζ; Δ1 , Aj , Δ2 ) → C ∧Lj j = 1, 2 Ξ(ζ; Δ1 , A1 ∧ A2 , Δ2 ) → C Ξ(ζ; Δ1 , A, Δ2 ) → B Ξ(ζ; Δ1 , [[]

−1

A], Δ2 ) → B

−1

[]

L

Ξ(ζ; Δ1 , [A], Δ2 ) → B L Ξ(ζ; Δ1 , A, Δ2 ) → B

Λ→1

1R

Ξ → A1 Ξ → A2 ∧R Ξ → A1 ∧ A2 [Ξ] → A −1

Ξ → []

A

−1

[]

R

Ξ → A R [Ξ] → A

The two systems, !2015 MALC∗ (st) and !2018 MALC∗ (st) , also share two b b rules for the subexponential modality: Ξ(ζ, A; Γ1 , Γ2 ) → B !L Ξ(ζ; Γ1 , !A, Γ2 ) → B

Ξ(ζ; Γ1 , A, Γ2 ) → B !P Ξ(ζ, A; Γ1 , Γ2 ) → B

Decidable Fragments of Calculi Used in CatLog

7

The other two rules, !R and !C, are different. In !2015 MALC∗ (st) , they are b as follows: ζ; Λ → B !R, ζ = ∅ ζ; Λ → !B

Ξ(ζ1 , ζ2 ; Γ1 , [ζ2 , ζ3 ; Γ2 ], Γ3 ) → C !C, ζ2 = ∅ Ξ(ζ1 , ζ2 , ζ3 ; Γ1 , Γ2 , Γ3 ) → C

In !2018 MALC∗ (st) , these rules are as follows: b A; Λ → B !R A; Λ → !B

Ξ(ζ1 , A; Γ1 , [ζ2 , A; Γ2 ], Γ3 ) → C !C Ξ(ζ1 , A; Γ1 , [[ζ2 ; Γ2 ]], Γ3 ) → C

Finally, !2018 MALC(st) is obtained from !2018 MALC∗ (st) by imposing b b Lambek’s non-emptiness restriction in the following form: • in the \ R and / R rules, ζ; Γ is required to be non-empty (i.e., to be not ∅; Λ) • in the !C rule, ζ2 ; Γ2 is required to be non-empty (in the same sense) • the unit constant 1, with axiom 1R and rule 1L, is removed As shown in [16], the cut rule in the following form: ξ; Π → A Ξ(ζ; Γ1 , A, Γ2 ) → C cut Ξ(ξ, ζ; Γ1 , Π, Γ2 ) → C is admissible in all three systems in question, !2015 MALC∗ (st) , b 2018 2018 ∗ !b MALC (st) , and !b MALC(st) . Thus, all derivations we analyze will be cut-free, but we may use cut to simplify construction of derivations. In what follows, it will be convenient to restict the id axiom to its atomic subcase: pi → pi , where pi is a variable. This restriction does not change the set of derivable sequents, due to the following lemma (which is mathematical folklore). Lemma 1.1. In each of the three systems !2015 MALC∗ (st) , b 2018 2018 ∗   !b MALC (st) , and !b MALC(st) , the sequent A → A, for any A, has a cut-free derivation in which all id axioms are in the atomic form. Proof. As usual, we proceed by induction on the structure of A. The base case of A = pi is given. For A = 1, we just apply the 1L rule to the 1R axiom. The other cases are considered as follows: A1 → A1 A2 → A2 \L A1 , A1 \A2 → A2 \R A1 \A2 → A1 \A2

A1 → A1 A2 → A2 /L A2 /A1 , A1 → A2 /R A2 /A1 → A2 /A1

A1 → A1 A2 → A2 ∧L ∧L A1 ∧ A2 → A1 A1 ∧ A2 → A2 ∧R A1 ∧ A2 → A1 ∧ A2

A1 → A1 A2 → A2 ·R A1 , A2 → A1 · A2 ·L A1 · A2 → A1 · A2

8

M. I. Kanovich et al.

A → A R A1 → A1 A2 → A2 ∨R ∨R [A] → A A1 → A1 ∨ A2 A2 → A1 ∨ A2 L ∨L A1 ∨ A2 → A1 ∨ A2 A → A A→A !P A→A −1 A; Λ → A [] L −1 !R [[] A] → A A; Λ → !A −1 [] R !L −1 −1 [] A → [] A !A → !A Notice that the application of !R here is valid in all three systems; the other rules, !P and !L, are the same. The power of Morrill’s approach can be illustrated by the derivation for “the paper that John signed without reading.” This example shows medial and parasitic extraction, in a bracket-aware setting. With bracket modalities, the type assignment is as follows: John  N the  N /CN that  ([]

−1 −1

[]

signed  (N \S)/N reading  (N \S)/N (CN \CN ))/(S/!N )

paper  CN −1

without  ([]

((N \S)\(N \S)))/(N \S)

In order to parse this phrase, we should first put the correct bracketing: “the paper [[ that [ John ] signed [[ without reading ]] ]].” Notice that here we distinguish single-bracketed weak islands and double-bracketed strong ones. Less obviously, “without reading” linguistically is a weak island (and it is going to be penetrtaed using !C), but it is double-bracketed here. The trick is that in the MALC∗ (st) or !2018 MALC(st) , double brackets newer Morrill’s systems, !2018 b b become single after applying !C, looking from bottom to top. (For the older sysMALC∗ (st) , also in the view of !C, one should start without brackettem, !2015 b MALC(st) ing this island.) The derivation of the corresponding sequent in !2018 b is presented in Fig. 1. This figure is a copy of [16, Fig. 2], which is in its turn an adaptation of the derivation given by Morrill [28, Fig. 24]. We include this derivation here for the convenience of the reader. Also notice that the role of Lambek’s restriction here is twofold. Besides disallowing empty words, it also disallows empty islands to be filled by parasitic extraction. An example is the incorrect phrase “the man who likes,” which can MALC∗ (st) using an empty subject island: “the man [[ who [[ be parsed in !2018 b ]] likes ]],” see [16, Fig. 4]. The first reference for this example is [27, Footnote 1]. Lambek’s restriction prevents this. Now let us discuss algorithmic questions. Unlike other rules, in the contraction rule !C the premise is more comMALC∗ (st) plex than the conclusion. Namely, parts of the stoup—ζ2 in !2015 b 2018 2018 ∗   and A in !b MALC (st) and !b MALC(st) —get copied. This makes the proof search space potentially infinite and yields an unfortunate consequence: as MALC∗ (st) , !2018 MALC∗ (st) , shown in [16], derivability problems in !2015 b b 2018  and !b MALC(st) are algorithmically undecidable.

Decidable Fragments of Calculi Used in CatLog

9

Fig. 1. Derivation for “the paper [[that [John] signed [[without reading]]]]” in MALC(st) !2018 b

In practice (i.e., in CatLog), however, categorial grammars based on Morrill’s systems are already used for parsing natural language sentences. This means that for sequents which actually occur in practice the proof search procedure terminates. In other words, there are practically important fragments of Morrill’s calculi, for which derivability problems are algorithmically decidable.

3

Polarity and Bracket Restrictions

In what follows, we designate these decidable fragments by imposing certain easily checkable syntactic conditions on formulae and sequents. These conditions are called the bracket non-negative condition (BNNC for short) and the bracket non-positive condition (BNPC). The BNNC was suggested by Morrill and Valent´ın [30], who presented an exponential-time decision algorithm for sequents obeying this condition. In [13],

10

M. I. Kanovich et al.

we sketched a proof of the NP upper bound for derivability under the BNNC, in the case without additives; here we give a more detailed proof. The BNNC, MALC∗ (st) . For ‘newer’ however, is useful for the ‘older’ Morrill’s system !2015 b 2018 2018 ∗   systems !b MALC (st) and !b MALC(st) , here we introduce a novel dual constraint, namely, the BNPC, and prove the corresponding decidability and complexity results. Let us first recall the standard notion of positive and negative subformulae in a given formula/sequent. Definition 1.4. For a formula A or a sequent Ξ → B, we define two finite sets, SubFm+ (A) and SubFm− (A) (resp., SubFm+ (Ξ → B) and SubFm− (Ξ → B)), by joint recursion. SubFm+ (pi ) = {pi } SubFm+ (1) = {1} SubFm+ (A\B) = SubFm− (A) ∪ SubFm+ (B) ∪ {A\B} SubFm+ (B/A) = SubFm− (A) ∪ SubFm+ (B) ∪ {B/A} SubFm+ (A · B) = SubFm+ (A) ∪ SubFm+ (B) ∪ {A · B} SubFm+ (A ∧ B) = SubFm+ (A) ∪ SubFm+ (B) ∪ {A ∧ B} SubFm+ (A ∨ B) = SubFm+ (A) ∪ SubFm+ (B) ∪ {A ∨ B} −1

SubFm+ ([]

A) = SubFm+ (A) ∪ {[]

−1

A}

SubFm (A) = SubFm (A) ∪ {A} +

+

SubFm+ (!A) = SubFm+ (A) ∪ {!A} SubFm+ (Ξ) is the union of SubFm+ (A), where A is a formula in Ξ, either in a stoup or as a tree term SubFm+ (Ξ → B) = SubFm− (Ξ) ∪ SubFm+ (B) SubFm− (pi ) = SubFm− (1) = ∅ SubFm− (A\B) = SubFm− (B/A) = SubFm+ (A) ∪ SubFm− (B) SubFm− (A · B) = SubFm− (A) ∪ SubFm− (B) SubFm− (A ∧ B) = SubFm− (A ∨ B) = SubFm− (A) ∪ SubFm− (B) −1

SubFm− ([]

A) = SubFm− (A) = SubFm− (!A) = SubFm− (A)

SubFm− (Ξ) is the union of SubFm− (A), where A is a formula in Ξ, either in a stoup or as a tree term SubFm− (Ξ → B) = SubFm+ (Ξ) ∪ SubFm− (B) Elements of SubFm+ (A) are called positive subformulae of A, and elements of SubFm− (A) are negative ones (similarly for Ξ → B). Cut-free proofs enjoy the polarized subformula property:

Decidable Fragments of Calculi Used in CatLog

11

Lemma 1.2. For any sequent Ξ → B  in a cut-free derivation of the goal sequent Ξ → B we have SubFm+ (Ξ → B  ) ⊆ SubFm+ (Ξ → B) and SubFm− (Ξ → B  ) ⊆ SubFm− (Ξ → B). Proof. Obvious from the form of inference rules: all formulae which appear in premises are subformulae of the conclusion, with the same polarities. Definition 1.5. A sequent Ξ → B obeys the bracket non-negative condition (BNNC), if for any !F ∈ SubFm− (Ξ → B) and for any F ∈ ζ, where ζ is one of the stoups inside Ξ, the set SubFm+ (F ) does not include formulae of the form −1 [] A and the set SubFm− (F ) does not include formulae of the form A. In other words, the BNNC means that negative occurrences of !-formulae, which can undergo !C, cannot include bracket modalities, the rules for which −1 −1 remove brackets (i.e., [] introduced by [] L and  introduced by R). This allows controlling the number of contractions (applications of !C) by counting brackets and bracket modalities, see Lemma 1.3 below. The BNPC is a dual condition. Under this condition, if a bracket modality −1 −1 got introduced by a rule which introduces a pair of brackets (i.e., [] R for [] and L for ), then later on such a modality is not allowed to undergo !C. Definition 1.6. A sequent Ξ → B obeys the bracket non-positive condition (BNPC), if for any !F ∈ SubFm− (Ξ → B) and for any F ∈ ζ, where ζ is one of the stoups inside Ξ, the set SubFm− (F ) does not include formulae of the form −1 [] A and the set SubFm+ (F ) does not include formulae of the form A. Notice that the BNNC and the BNPC (respectively) are exactly the conditions on formulae under ! which are violated in the undecidability proofs in [16].

4

Decidable Multiplicative Fragments

We start with “purely multiplicative” fragments of the calculi in question, i.e., fragments without additive connectives, ∧ and ∨. Since our calculi are cut-free, these fragments are axiomatized simply by taking the rules ∧L, ∧R, ∨L, and ∨R away from the corresponding full systems. Here the corresponding conditions on brackets yield decidability. MALC∗ (st) , for sequents Theorem 1.1. The derivability problem in !2015 b without ∧ and ∨ and obeying the BNNC, is decidable and belongs to the NP MALC∗ (st) and !2018 MALC(st) , with BNPC class. The same holds for !2018 b b instead of BNNC. Notice that the corresponding lower bound, NP-hardness, is due to Pentus [32], who proved NP-hardness of the Lambek calculus itself, without brackets and subexponentials. Theorem 1.1 immediately follows from the following lemma which establishes a polynomial upper bound on the derivation size. Indeed, such a polynomial size

12

M. I. Kanovich et al.

derivation serves as the necessary NP witness for derivability. (In other words, a non-deterministic algorithm can guess the derivation and then check that it is correct, and this is all done in polynomial time.) Lemma 1.3. If a sequent without ∧ and ∨ is derivable in !2015 MALC∗ (st) b and obeys the BNNC, then its cut-free derivation is of polynomial size w.r.t. the MALC∗ (st) and !2018 MALC(st) , size of the sequent. The same holds for !2018 b b with BNPC instead of BNNC. Proof. Let n be the size of the sequent in question, measured as the total number of symbols (including brackets). Let us estimate the number of rule applications in the derivation. 1. Contraction rule (!C). The key consideration in the proof of the lemma is the upper bound on the number of contractions, i.e., applications of !C. Let us denote this number by #!C. Also let #B + be the number of applications of −1 [] L and R (each of these rules introduces a pair of brackets) and let #B − be −1 the number of applications of L and [] R (these rules erase brackets). Finally, let #[] be the number of pairs of brackets in the goal sequent. MALC∗ (st) , each application of !C erases a pair of In the case of !2015 b brackets. Therefore, #[] = #B + − #B − − #!C, whence

#!C = #B + − #B − − #[] ≤ #B + .

Now we recall that our goal sequent obeys the BNNC. Therefore, no formula of −1 −1 the form [] A introduced by [] L (i.e., in the negative polarity) gets included into a formula of the form !F in the antecedent (i.e., in negative polarity) or a formula F in a stoup. The same holds for formulae of the form A introduced by R. Thus, each rule application counted in #B + is connected to a unique occurrence of the corresponding modality in the goal sequent. This yields, #B + ≤ n, whence #!C ≤ n. MALC∗ (st) and !2018 MALC(st) , the argument is similar. Now For !2018 b b each application of !C introduces a new pair of brackets (making a singlebracketed island a double-bracketed one). Therefore, #[] = #B + − #B − + #!C, whence

#!C = #[] + #B − − #B + ≤ #[] + #B − .

Dually, the BNPC guarantees that each rule application counted in #B − corresponds to a unique occurrence of a modality in the goal sequent. These occur−1 rences are negative ones for  and positive ones for [] . This yields #B − ≤ n. − Hence, #!C ≤ #[] + #B ≤ 2n. Thus, in both cases we have #!C ≤ 2n. 2. Logical rules. All rules, except !C and !P , are logical rules (recall that all our derivations are cut-free). Each logical rule introduces exactly one new

Decidable Fragments of Calculi Used in CatLog

13

occurrence of a connective or a modality. Such occurrences either trace down to the goal sequent or get contracted (i.e., merged with another occurrence) by !C. MALC∗ (st) or !2018 MALC(st) , counting logical rules is In the case of !2018 b b simple. Indeed, each application of !C contracts a formula A (in the stoup), which is a subformula of the goal sequent. Thus, the number of logical rules introducing connectives or modalities which get contracted is bounded by #!C · n ≤ 2n2 . The number of logical rules introducing connectives or modalities which trace down to the goal sequent is less than or equal to n. Thus, the total number of logical rule applications in the derivation is less or equal than 2n2 + n, which is polynomial. MALC∗ (st) is more involved, since in this system !C The case of !2015 b may contract an arbitrary part of the stoup, ζ2 . However, we still establish the needed upper bound by proving the following statement: let ζ be a stoup in MALC∗ (st) derivation of a sequent obeying the BNNC (in a sequent a !2015 b of the form Θ(ξ; Π) → E); then each formula in ξ traces down to a distinct subformula occurrence in the goal sequent. In other words, two formulae in the same stoup could not get contracted (merged) by !C. Let us prove this statement. Suppose the contrary. Let a formula F appear two times as a subformula in a stoup ξ, and these two occurrences get contracted below. This means that F is a subformula of a formula G which belongs to ζ2 in the formulation of the !C rule: Ξ(ζ1 , ζ2 ; Γ1 , [ζ2 , ζ3 ; Γ2 ], Γ3 ) → C !C Ξ(ζ1 , ζ2 , ζ3 ; Γ1 , Γ2 , Γ3 ) → C The copies of G, however, are separated by brackets embracing ζ2 , ζ3 ; Γ2 . This means that, when going upwards from the !C application to Θ(ξ; Π) → E, this pair of brackets should be destroyed at some point. This could be performed −1 using [] L or R. In the first case, our formula F should be a subformula −1 of [] A (since it is the only formula inside the brackets), and the latter is a subformula of G. Moreover, the polarity is positive in both cases. This violates the BNNC, since G is a member of a stoup. In the second case, F should be a negative subformula of A (which is the only formula outside the brackets), and the latter is a negative subformula of G. Again, the BNNC get violated. Contradiction. The statement entails that for each application of !C the summary number of connective and modality occurrences in ζ2 is bounded by n, which is the MALC∗ (st) length of the goal sequent. Now the same argument as for !2018 b 2018  and !b MALC(st) shows that the total number of logical rule applications is less than or equal to n2 + n (here #!C ≤ n). 3. Permutation rule (!P ). The reasoning here is similar to the previous case. Each application of !P introduces a formula into the stoup, and such a formula either traces down to a designated subformula occurrence in the goal sequent or gets contracted by !C. This gives the same upper bound: 2n2 + n MALC∗ (st) and !2018 MALC(st) and n2 + n for !2015 MALC∗ (st) . for !2018 b b b

14

M. I. Kanovich et al.

Summing up. In a cut-free derivation, each sequent is either the goal one or a premise of one of the inference rules considered above. Each logical rule has at MALC∗ (st) and most two premises; !P and !C have one. Thus, the for !2018 b 2018  !b MALC(st) the total number of sequents in the derivation is less than or equal to 1 + 2n + 2(2n2 + n) + (2n2 + n) = 6n2 + 5n + 1, which is polynomial. MALC∗ (st) , the upper bound is even a bit smaller: 1 + n + 2(n2 + For !2015 b 2 n) + n + n = 3n2 + 4n + 1.

5

Decidable Fragments with Additives

Now let us consider the full systems, with additive operations. Here the corresponding bracket conditions also yield decidability, but the complexity is higher. Theorem 1.2. The derivability problem in !2015 MALC∗ (st) , for sequents b obeying the BNNC, is decidable and belongs to the PSPACE class. The same MALC∗ (st) and !2018 MALC(st) , with BNPC instead of holds for !2018 b b BNNC. Notice that the upper bound here is again tight: the corresponding lower bound, PSPACE-hardness of the Lambek calculus with additive operations, was shown by Kanovich [11] and Kanazawa [10]. Moreover, the minimalistic fragment with only two connectives, \ and ∧, is already PSPACE-hard [14]. In the presence of additive operations, there is no hope to obtain a global polynomial upper bound on the size of the derivation (like in Lemma 1.3). The reason is that the ∨L and ∧R rules copy big parts of the sequent to both premises, and this can make the derivation exponentially large. However, we shall establish a “local” upper bound, namely, prove that the length of each path in the derivation tree, from the goal sequent to an axiom, is polynomial. In other words, we shall show that our derivations have polynomial height. (As usual, the height of a tree is the length of the longest path from the root to a leaf.) MALC∗ (st) and obeys the Lemma 1.4. If a sequent is derivable in !2015 b BNNC, then its cut-free derivation has polynomial height w.r.t. the size of MALC∗ (st) and !2018 MALC(st) , with the sequent. The same holds for !2018 b b BNPC instead of BNNC. Proof. Let δ be a path in the derivation tree from the root (goal sequent) to an axiom leaf. We shall estimate the number of rule applications on δ in terms of n, the size of the goal sequent. Let us relativize the parameters used in Lemma 1.3 to the path δ. Namely, #δ !C, #δ B + , and #δ B − are, respectively, the numbers of contractions (appli−1 cations of !C), introductions of brackets (applications of [] L and R), and −1 removals of brackets (L and [] R), along δ. As in Lemma 1.3, #[] denotes the total number of pairs of brackets in the goal sequent.

Decidable Fragments of Calculi Used in CatLog

15

Contraction rule (!C). The case of !2018 MALC∗ (st) and b here is easier. Each application of !C on δ adds a pair of brack−1 ets which is either removed by an application of [] R or L below or traces down to the goal sequent. This gives 1.

MALC(st) !2018 b

#δ !C ≤ #δ B − + #[]. Due to the BNPC, we have #δ B − ≤ n, since each rule application counted in #δ B − traces down to a separate modality occurrence in the goal sequent. Therefore, #δ !C ≤ 2n. MALC∗ (st) is a bit trickier. One cannot just claim The situation with !2015 b + #δ !C ≤ #δ B , since a pair of brackets erased by an application of !C could have been introduced in another branch of the derivation, outside δ. This issue is resolved in the following way. Let us take the whole derivation tree. For each application of ∨L or ∧R let us remove one of its premises and the whole subtree above it, keeping our path δ intact. (If δ traverses an application of ∨L or ∧R, we remove the premise which is not on δ; otherwise, the choice is arbitrary.) The resulting subtree (which is not required to be a valid derivation tree, of course) will be denoted by D. −1 By #D B + let us denote the number of [] L and R applications inside D. On the one hand, each pair of brackets erased by an application of !C on path δ was introduced by such a rule application. Indeed, each application of ∨L or ∧R copies all the brackets from the conclusion to both premises, therefore, the brackets cannot escape from D. Therefore, #δ !C ≤ #D B + . −1

On the other hand, under the BNNC no negative occurrence of [] (introduced −1 by [] L) and no positive occurrence of  (introduced by R) can be contracted by !C. Moreover, inside D two such occurrences cannot be identified by ∨L or −1 ∧R. Therefore, each connective introduced by [] L or R in D traces down to a separate modality occurrence in the goal sequent. Therefore, #D B + ≤ n. Thus, in both cases the number of !C applications on δ is linearly bounded. 2. Logical rules and permutation rule. Here the argument is the same as in MALC∗ (st) , we again the proof of Lemma 1.3, but relativized to δ. For !2015 b show that two formulae in the same stoup could not get contracted. Therefore, in all three systems the total number of connective and modality occurrences contracted by one application of !C is bounded by n. Now, each connective occurrence introduced on the path δ either gets contracted by !C or traces down to the goal sequent. This gives an upper bound on MALC∗ (st) and the number of logical rule applications on δ: n2 + n for !2015 b 2018 2018 ∗ 2   2n + n for !b MALC (st) and !b MALC(st) . The same upper bounds hold for the number of permutations (applications of !P ). Summing up our upper bounds give the same polynomial estimations on the number of rule applications on δ (i.e., the length of δ), as Lemma 1.3 gives for

16

M. I. Kanovich et al.

the whole derivation tree: 3n2 + 4n + 1 for !2015 MALC∗ (st) and 6n2 + 5n + 1 b 2018 2018 ∗   for !b MALC (st) and !b MALC(st) . Constructing a PSPACE decision algorithm for a calculus with a polynomial bound on derivation heights is quite a standard task. In a similar situation, for the multiplicative-additive fragment of linear logic, Lincoln et al. [19] use alternating Turing machines [7] in order to prove the PSPACE upper bound. In this article, we directly construct a non-deterministic depth-first search algorithm which works on polynomially bounded memory space. Proof (of Theorem 1.2). Lemma 1.4 guarantees that any cut-free derivation in each of the three calculi has a polynomially bounded height. Let us construct a non-deterministic algorithm guessing such a derivation in the following way. The algorithm starts from the goal sequent (i.e., the root) and then tries to build a correct derivation tree in the depth-first manner. In the memory, the algorithm keeps a stack (a ‘last-in-first-out’ structure) of sequents, proof search for which is postponed, and one ‘active’ sequent which is being considered right now. For each sequent (both the active one and those in the stack) the algorithm also keeps the length of the path from this sequent to the goal one. In the beginning, the stack is empty and the active sequent is the goal one. At each step, the algorithm performs a non-deterministic guess which inference rule to apply in order to derive the active sequent. Recall that each rule has at most two premises. If it has one premise, then the active sequent gets replaced by this premise, increasing the length parameter by 1. If there are two premises, then the left one becomes active, while the right one is put onto the stack. (Proof search for the right premise is postponed to the future.) At some point, the algorithm will either: 1. exceed the fixed polynomial bound on derivation height 2. not be able to apply any inference rule 3. reach an axiom instance In the first and second cases the algorithm returns the answer “no” and terminates. (This does not mean that the sequent is not derivable, possibly just the concrete series of non-deterministic guesses suggested a wrong derivation strategy.) In the successful third case, the algorithm checks the stack. If the stack is empty, the algorithm terminates returning “yes.” This indeed means that the sequent is derivable, and the algorithm has constructed a derivation (though the complete derivation tree was never kept in the memory). If the stack is not empty, the algorithm pops the topmost sequent from the stack, makes it active and recursively applies proof search to this sequent. Intuitively, the stack represents sequents that ‘sit’ on the right-hand side of the branching points along a path in the derivation tree. The algorithm applies depth-first search, and actually tries all the paths from the goal to an axiom leaf, from left to right. One can easily see that the existence of a correct derivation tree is equivalent to the existence of ‘correct’ non-deterministic guesses, after

Decidable Fragments of Calculi Used in CatLog

17

which the algorithm returns “yes.” Therefore, we have indeed constructed a nondeterministic algorithm solving the derivability problem in one of our calculi. Moreover, at each time of execution the stack includes sequents located on different heights in the tree which is supposed to be a derivation. Since the height of this tree is polynomially bounded, we get a polynomial bound on the amount of memory used (the size of each sequent is also polynomially bounded). Thus, the derivability problem belongs to the NPSPACE class. Finally, by Savitch’s theorem [33] we have NPSPACE = PSPACE, which finishes the proof.

6

Inducing Brackets

The algorithmic problem of parsing using categorial grammars is actually harder than proving sequents in the calculus these grammars are based on. First, a word of the language can have several syntactic types, so before proving the sequent the algorithm should determine, for each word, which of these types should be used. This is a minor issue, however, since our algorithms are non-deterministic, so we can just guess the correct type assignment. But there is another issue, a more serious one. As one can see from examples in Sect. 2, the sequence of types should be properly bracketed before starting proof search. In real natural language data, however, there are no brackets. Therefore, ideally, the number and position of brackets should be guessed, or induced, by the algorithm, rather than requested from the user. Without the subexponential modality, a parsing algorithm which automatically induces brackets was developed by Morrill et al. [29]. The key to decidability of bracket induction is the fact that, without the subexponential, the number of bracket pairs in the goal sequent is bounded by the number of bracket modalities (and the latter is fixed). In this section we show that, with respect to the possibility of effective bracket induction, the two Morrill’s systems with the subexponential behave differently. Formally, the bracket induction problem is formulated as follows: given a sequent of the form A1 , . . . , An → B (without brackets and stoups), is there a way to put brackets on the left-hand side, so that the resulting sequent would be derivable in the given calculus. MALC∗ (st) . For the multiplicativeLet us start with the older system, !2015 b only fragment, without ∧ and ∨, we again get NP decidability. Theorem 1.3. The bracket induction problem for !2015 MALC∗ (st) , for b sequents without ∧ and ∨ obeying the BNNC, is decidable and belongs to the NP class. Proof. If the sequent in question becomes derivable after adding brackets, then, following the reasoning from the proof of Lemma 1.3, we get #[] = #B + − #B − − #!C ≤ #B + ≤ n. Here n is the size of the original sequent without brackets. Indeed, #B + is the −1 number of applications of [] L and R, and, thanks to the BNNC, each such

18

M. I. Kanovich et al.

application traces down to an occurrence of the corresponding occurrence of a bracket modality, not a pair of brackets, in the goal sequent. Now, since the number of brackets is linearly bounded, the bracketing can be guessed by an NP algorithm in polynomial time, along with the derivation itself. For the system including additive operations we predictably get PSPACE. MALC∗ (st) , for Theorem 1.4. The bracket induction problem for !2015 b sequents obeying the BNNC, is decidable and belongs to the PSPACE class. Proof. Again, we establish an upper bound on #[], which allows nondeterministic guessing of the bracketing. Unlike the previous theorem, however, this upper bound cannot be directly extracted from the proof of Lemma 1.4. However, we can use a trick similar to the one used in the proof of Lemma 1.4. Suppose the sequent becomes derivable after imposing a certain bracketing. In this derivation, for each application of ∨L or ∧R let us remove one of its premises (e.g., the right one) and the whole subtree above it. Of course, the resulting tree D is no longer necessarily a correct derivation. However, it is useful for counting brackets. Namely, now each subformula has a unique trace either to the goal sequent or to a contraction (application of !C), just as in the case without ∧ and ∨. This yields the same estimation: #[] = #D B + −#D B − −#D !C ≤ #D B + ≤ n, where the #D counts are taken in the modified “derivation” tree. Now, using the linear bound on the number of brackets, we nondeterministically get the bracketing, and then apply the non-deterministic polynomial space algorithm from Theorem 1.2. This yields NPSPACE complexity, which is the same as PSPACE by Savitch’s theorem. MALC∗ (st) , the situation is opposite. For the newer system, !2018 b Theorem 1.5. The bracket induction problem for !2018 MALC∗ (st) for b sequents obeying the BNPC is undecidable. Proof. The general line of the proof is standard. Its ideas go back to Lincoln et al. [19]. We encode a well-known undecidable problem, derivability in type-0 grammars (which are closely related to semi-Thue systems). For our purposes we do not need non-terminal symbols in type-0 grammars. Thus, a type-0 grammar over alphabet Σ is a triple G = (Σ, P, s), where s ∈ Σ is the starting symbol and P is a finite set of rewriting rules of the form x1 . . . xk ⇒ y1 . . . ym , where xi , yj ∈ Σ, k ≥ 1, m ≥ 0. A derivation in G is a sequence of words starting with s, such that each next word is obtained from the previous one by applying a rewriting rule (i.e., replacing a subword x1 . . . xk with y1 . . . ym ). A word w is derivable if there exists a derivation with w being its last word. Given a type-0 grammar G, let us define the set AG as follows: AG = {(x1 · . . . · xk )/(y1 · . . . · ym ) | x1 . . . xk ⇒ y1 . . . ym is a rewriting rule of G}. The elements of AG will be denoted by A1 , . . . , AN . Each Ai encodes one of the rewriting rules.

Decidable Fragments of Calculi Used in CatLog

19

We shall prove that a word a1 . . . an is derivable from s in G if and only if one can put brackets on the left-hand side of the following sequent so that it becomes derivable: −1

![]

−1

!A1 , . . . , ![]

−1

!AN , a1 , . . . , an → (![]

−1

!A1 ) · . . . · (![]

!AN ) · s.

This gives computable reduction of derivability in G to the bracket induction MALC∗ (st) under the BNPC (since the sequent in question problem for !2018 b obeys the BNPC), which establishes undecidability of the latter. Let us start with the “only if” direction: from derivation in G to inducing brackets. Suppose that a1 . . . an is derived in G from s in r steps. Then we put brackets as follows: r times

−1

![]

−1

!A1 , . . . , ![]

   !AN , [[Λ]], . . . , [[Λ]], a1 , . . . , an −1

→ (![]

−1

!A1 ) · . . . · (![]

!AN ) · s.

Notice that here we put brackets over empty parts of the sequent (in other words, make them strong islands). Suppose that the rewriting rules used in the derivation s ⇒ . . . ⇒ a1 . . . an are rules with numbers i1 , . . . , ir . Then we derive our sequent in the following way. Using !C, we put copies of the corresponding Aij ’s into the islands [[Λ]], −1 one into each island. The islands become single-bracketed: [[] !Aij ; Λ]. Then −1 we use [] ’s to remove the remaining pairs of brackets and remove the leading −1 −1 ![] !A1 , . . . , ![] !AN using ·R. The corresponding derivation is presented on Fig. 2(a). At the top of this derivation there is the sequent Ai1 , . . . , Air ; a1 , . . . , an → s. We show its derivability by induction on r. If r = 0, then it is just s → s. For the induction step, consider the last, r-th, rewriting rule applied in the derivation: s ⇒ . . . ⇒ a1 . . . x1 . . . xk . . . an ⇒ a1 . . . y1 . . . ym . . . an . This rewriting is simulated using Air = (x1 · . . . · xk )/(y1 · . . . · ym ) from the stoup, as shown on Fig. 2(b). The topmost sequent, Ai1 , . . . , Air−1 ; a1 , . . . , x1 , . . . , xk , . . . an → s, is derivable by the induction hypothesis. The “only if” part, from inducing brackets to rewriting in G, is performed in a rather standard way, using the bracket-forgetting projection [16]. Let us consider !L∗ , the Lambek calculus (see Introduction) without Lambek’s nonemptiness restriction extended with a full-power exponential modality !. The exponential modality is governed by the following rules: Γ1 , A, Γ2 → C !L Γ1 , !A, Γ2 → C

!A1 , . . . , !An → B !R !A1 , . . . , !An → !B

Γ 1 , Γ2 → C !W Γ1 , !A, Γ2 → C

Γ1 , Φ, !A, Γ2 → C !P Γ1 , !A, Φ, Γ2 → C 1

Γ1 , !A, Φ, Γ2 → C !P Γ1 , Φ, !A, Γ2 → C 2

Γ1 , !A, !A, Γ2 → C !C Γ1 , !A, Γ2 → C

Notice that here left-hand sides of sequents are just sequences of formulae, there are no stoups or brackets.

20

M. I. Kanovich et al.

Fig. 2. Derivations for simulating rewritings in a type-0 grammar via bracket induction MALC∗ (st) in !2018 b

One can easily see that if one takes a sequent derivable in !2018 MALC∗ (st) , b erases all brackets and bracket modalities, and translates meta-formulae of the form ζ; Γ, where ζ = {A1 , . . . , An }, as !A1 , . . . , !An , Γ, then the resulting sequent will be derivable in !L∗ . (The opposite does not hold: bracketing prevents some of the derivations.) This translation is called the bracket-forgetting projection (BFP). Now let us suppose that our sequent, −1

![]

−1

!A1 , . . . , ![]

−1

!AN , a1 , . . . , an → (![]

−1

!A1 ) · . . . · (![]

!AN ) · s

becomes derivable after putting some brackets on it. Independently of the bracketing imposed, the BFP gives the following sequent: !!A1 , . . . , !!AN , a1 , . . . , an → (!!A1 ) · . . . · (!!AN ) · s,

Decidable Fragments of Calculi Used in CatLog

21

which is derivable in !L∗ . Now let us consider the following derivable sequents: !Ai → !!Ai and (!!A1 ) · . . . · (!!AN ) · s → s. The first one is derived in !L∗ by one application of !R, and the derivation of the second one is as follows: s→s !W, N times !!A1 , . . . , !!AN , s → s ·L, N times (!!A1 ) · . . . · (!!AN ) · s → s Using cut, we derive !A1 , . . . , !AN , a1 , . . . , an → s. (For cut elimination in !L∗ , see [12].) Now we perform the standard backwards translation, from derivations in noncommutative linear logic to computations (in our case, in a type-0 grammar), which goes back to Lincoln et al. [19]. A detailed proof can be found, e.g., in [16], Lemma 1, implication 4 ⇒ 1. Using this translation, we conclude that a1 . . . an is derivable in G from s.

7

Conclusion and Future Work

In this paper, we have shown that the systems with brackets and a subexponential proposed by Morrill as basic calculi for the CatLog natural language parser, while being undecidable in general, enjoy natural decidable fragments. These fragments are designated by syntactic restrictions called the bracket nonnegative/non-positive conditions (BNNC/BNPC). Moreover, algorithmic complexity of these fragments is the same as for the systems without brackets and the subexponential. Namely, with additive operations we get PSPACE and without them we get NP. As noticed by one of the reviewers, these complexity results could be easily extended to discontinuous operations used in Morrill’s systems along with standard Lambek ones. Full Morrill’s systems, however, include other sources of undecidability (besides the contraction rule for !). One of such sources is the Kleene star, which Morrill calls ‘existential exponential.’ The Kleene star is governed by an omega-rule [28], thus, the system includes infinitary action logic, which is known to be Π01 -complete [5]. Another potential source of undecidability is the presence of quantifiers. The development of appropriate syntactic restrictions on these connectives in order to restore decidability is still an open problem. Another observation made by one of the reviewers is that our decidability MALC∗ (st) under the results also entail the finite reading property for !2015 b 2018 2018 ∗   BNNC and !b MALC (st) and !b MALC(st) under the BNPC. The finite reading property means that for a sentence with brackets imposed there could exist only a finite number of different derivations. Indeed, even in the broader systems with additives we have managed to prove a polynomial upper bound

22

M. I. Kanovich et al.

on the height of the derivation tree (Lemma 1.4). This yields a finite, though exponential, bound on the size of the derivation and, thus, a double-exponential bound on the number of possible derivations. The choice of types for each word is also finite, since so is the lexicon. For the more complicated, but at the same time more practically interesting algorithmic problem of inducing brackets, the situation is as follows. For MALC∗ (st) , complexity of the bracket inducthe ‘older’ Morrill’s system !2015 b tion problem, under the BNNC, is the same as for the derivability problem. MALC∗ (st) , unfortunately, the bracket inducFor the ‘newer’ system, !2018 b tion problem is undecidable even under the BNPC. The undecidability construction, however, crucially depends on empty bracketed islands, i.e., on the violation of Lambek’s non-emptiness restriction. We conjecture that for the sysMALC(st) , with the BNPC imposed, the tem with Lambek’s restriction, !2018 b bracket induction problem is decidable. Complexity of this problem is left as an open question. Another open question is whether the bracket induction problem MALC∗ (st) (without Lambek’s restriction) becomes decidable if we for !2018 b impose both the BNPC and the BNNC (i.e., disallow any bracket modalities in the scope of the subexponential and in stoups). Acknowlegdement. We are grateful to Glyn Morrill for a number of very helpful interactions we benefited from at various stages of our work. We would also like to thank the reviewers for their efforts. The work of Max Kanovich was partially supported by EPSRC Programme Grant EP/R006865/1: “Interface Reasoning for Interacting Systems (IRIS).” The part by Stepan G. Kuznetsov was prepared within the framework of the Academic Fund Program at HSE University in 2021–2022 (grant № 21-04-027). The work of Stepan L. Kuznetsov and the early part of the work of Andre Scedrov (until July 2020) was performed within the framework of the HSE University Basic Research Program. The work of Stepan L. Kuznetsov was also partially supported by the Council of the President of Russia for Support of Young Russian Researchers and Leading Research Schools of the Russian Federation (grant MK-1184.2021.1.1) and by the Russian Foundation for Basic Research (grant № 20-01-00435).

References 1. Abrusci, V.M.: A comparison between Lambek syntactic calculus and intuitionistic linear logic. Zeitschrift f¨ ur mathematische Logik und Grundlagen der Mathematik 36, 11–15 (1990) 2. Ajdukiewicz, K.: Die syntaktische Konnexit¨ at. Stud. Philos. 1, 1–27 (1935) 3. Bar-Hillel, Y.: A quasi-arithmetical notation for syntactic description. Language 29(1), 47–58 (1953) 4. Buszkowski, W.: Type logics in grammar. In: Hendriks, V.F., Malinowski, J. (eds.) Trends in Logic: 50 Years of Studia Logica. TREN, vol. 21, pp. 337–382. Springer, Dordrecht (2003). https://doi.org/10.1007/978-94-017-3598-8 12 5. Buszkowski, W.: On action logic: equational theories of action algebras. J. Log. Comput. 17(1), 199–217 (2007). https://doi.org/10.1093/logcom/exl036 6. Carpenter, B.: Type-Logical Semantics. MIT Press, Cambridge (1997)

Decidable Fragments of Calculi Used in CatLog

23

7. Chandra, A.K., Kozen, D.C., Stockmeyer, L.J.: Alternation. J. ACM 28(1), 114– 133 (1981). https://doi.org/10.1145/322234.322243 8. Galatos, N., Jipsen, P., Kowalski, T., Ono, H.: Residuated Lattices: An Algebraic Glimpse on Substructural Logics. Studies in Logic and the Foundations of Mathematics, vol. 151. Elsevier, Amsterdam (2007) 9. Girard, J.Y.: Linear logic. Theor. Comput. Sci. 50(1), 1–101 (1987). https://doi. org/10.1016/0304-3975(87)90045-4 10. Kanazawa, M.: Lambek calculus: recognizing power and complexity. In: Gerbrandy, J., Marx, M., de Rijke, M., Venema, Y. (eds.) JFAK. Essays Dedicated to Johan van Benthem on the Occasion of His 50th Birthday. Vossiuspers, Amsterdam University Press (1999) 11. Kanovich, M.: Horn fragments of non-commutative logics with additives are PSPACE-complete. In: 1994 Annual Conference of the European Association for Computer Science Logic, Kazimierz, Poland (1994) 12. Kanovich, M., Kuznetsov, S., Nigam, V., Scedrov, A.: Subexponentials in noncommutative linear logic. Math. Struct. Comput. Sci. 29(8), 1217–1249 (2019). https://doi.org/10.1017/S0960129518000117 13. Kanovich, M., Kuznetsov, S., Scedrov, A.: Undecidability of the Lambek calculus with subexponential and bracket modalities. In: Klasing, R., Zeitoun, M. (eds.) FCT 2017. LNCS, vol. 10472, pp. 326–340. Springer, Heidelberg (2017). https:// doi.org/10.1007/978-3-662-55751-8 26 14. Kanovich, M., Kuznetsov, S., Scedrov, A.: The complexity of multiplicativeadditive Lambek calculus: 25 years later. In: Iemhoff, R., Moortgat, M., de Queiroz, R. (eds.) WoLLIC 2019. LNCS, vol. 11541, pp. 356–372. Springer, Heidelberg (2019). https://doi.org/10.1007/978-3-662-59533-6 22 15. Kanovich, M., Kuznetsov, S., Scedrov, A.: Reconciling Lambek’s restriction, cutelimination, and substitution in the presence of exponential modalities. J. Log. Comput. 30(1), 239–256 (2020). https://doi.org/10.1093/logcom/exaa010 16. Kanovich, M., Kuznetsov, S., Scedrov, A.: The multiplicative-additive Lambek calculus with subexponentials and bracket modalities. J. Log. Lang. Inf. 30, 31–88 (2021). https://doi.org/10.1007/s10849-020-09320-9 17. Lambek, J.: The mathematics of sentence structure. Am. Math. Monthly 65, 154– 170 (1958). https://doi.org/10.1080/00029890.1958.11989160 18. Lambek, J.: On the calculus of syntactic types. In: Jakobson, R. (ed.) Structure of Language and Its Mathematical Aspects, Proceedings of Symposia in Applied Mathematics, vol. 12, pp. 166–178. AMS, Providence (1961) 19. Lincoln, P., Mitchell, J., Scedrov, A., Shankar, N.: Decision problems for propositional linear logic. Ann. Pure Appl. Log. 56(1–3), 239–311 (1992). https://doi. org/10.1016/0168-0072(92)90075-B 20. Moortgat, M.: Multimodal linguistic inference. J. Log. Lang. Inf. 5(3–4), 349–385 (1996). https://doi.org/10.1007/BF00159344 21. Moot, R., Retor´e, C.: The Logic of Categorial Grammars: A Deductive Account of Natural Language Syntax and Semantics. LNCS, vol. 6850. Springer, Berlin (2012). https://doi.org/10.1007/978-3-642-31555-8 22. Morrill, G.: Categorial formalisation of relativisation: pied piping, islands, and extraction sites. Technical report LSI-92-23-R, Universitat Polit`ecnica de Catalunya (1992) 23. Morrill, G.: A categorial type logic. In: Casadio, C., Coecke, B., Moortgat, M., Scott, P. (eds.) Categories and Types in Logic, Language, and Physics. LNCS, vol. 8222, pp. 331–352. Springer, Heidelberg (2014). https://doi.org/10.1007/9783-642-54789-8 18

24

M. I. Kanovich et al.

24. Morrill, G.: Grammar logicised: relativisation. Linguist. Philos. 40(2), 119–163 (2017). https://doi.org/10.1007/s10988-016-9197-0 25. Morrill, G.: Parsing logical grammar: CatLog3. In: Loukanova, R., Liefke, K. (eds.) Proceedings of the Workshop on Logic and Algorithms in Computational Linguistics (LACompLing 2017), pp. 107–131. Stockholm University, Stockholm (2017) 26. Morrill, G.: The CatLog3 technical manual. Technical report, Universitat Polit`ecnica de Catalunya (2018). http://www.lsi.upc.edu/∼morrill/CatLog3/ CatLog3.pdf 27. Morrill, G.: A note on movement in logical grammar. J. Lang. Model. 6(2), 353–363 (2018). https://doi.org/10.15398/jlm.v6i2.233 28. Morrill, G.: Parsing/theorem-proving for logical grammar CatLog3. J. Log. Lang. Inf. 28(2), 183–216 (2019). https://doi.org/10.1007/s10849-018-09277-w 29. Morrill, G., Kuznetsov, S., Kanovich, M., Scedrov, A.: Bracket induction for Lambek calculus with bracket modalities. In: Foret, A., Kobele, G., Pogodalla, S. (eds.) FG 2018. LNCS, vol. 10950, pp. 84–101. Springer, Heidelberg (2018). https://doi. org/10.1007/978-3-662-57784-4 5 30. Morrill, G., Valent´ın, O.: Computation coverage of TLG: nonlinearity. In: Kanazawa, M., Moss, L., de Paiva, V. (eds.) Third Workshop on Natural Language and Computer Science, NLCS 2015. EPiC Series in Computing, vol. 32, pp. 51–63 (2015). https://doi.org/10.29007/96j5 31. Morrill, G.V.: Categorial Grammar: Logical Syntax, Semantics, and Processing. Oxford University Press, Oxford (2011) 32. Pentus, M.: Lambek calculus is NP-complete. Theor. Comput. Sci. 357(1–3), 186– 201 (2006). https://doi.org/10.1016/j.tcs.2006.03.018 33. Savitch, W.J.: Relationships between nondeterministic and deterministic tape complexities. J. Comput. Syst. Sci. 4(2), 177–192 (1970). https://doi.org/10.1016/ S0022-0000(70)80006-X

Interactive Theorem Proving for Logic and Information Jørgen Villadsen1(B) , Asta Halkjær From1 , Alexander Birch Jensen1 , and Anders Schlichtkrull2 1

Technical University of Denmark, Kongens Lyngby, Denmark {jovi,ahfrom,aleje}@dtu.dk 2 Aalborg University Copenhagen, Copenhagen, Denmark [email protected]

Abstract. Automated reasoning is the study of computer programs that can build proofs of theorems in a logic. Such programs can be either automatic theorem provers or interactive theorem provers. The latter are also called proof assistants because the user constructs the proofs with the help of the system. We focus on the Isabelle proof assistant. The system ensures that the proofs are correct, in contrast to pen-and-paper proofs which must be checked manually. We present applications to logical systems and models of information, in particular selected modal logics extending classical propositional logic. Epistemic logic allows intelligent systems to reason about the knowledge of agents. Public announcements can change the knowledge of the system and its agents. In order to account for this, epistemic logic can be extended to public announcement logic. An axiomatic system consists of axioms and rules of inference for deriving statements in a logic. Sound systems can only derive valid statements and complete systems can derive all valid statements. We describe formalizations of sound and complete axiomatic systems for epistemic logic and public announcement logic, thereby strengthening the foundations of automated reasoning for logic and information. Keywords: Interactive theorem proving · Propositional logic · Epistemic logic · Public announcement logic · Isabelle/HOL proof assistant

1 Introduction Automated reasoning technology has matured tremendously in the recent decades. However, the main applications are found in verification of hardware and software systems as well as in many areas of mathematics. We present a series of applications to logical systems and models of information, in particular classical propositional logic and selected modal logics extending classical propositional logic. On the one hand, we interpret interactive theorem proving narrowly and focus on the Isabelle proof assistant [45]. On the other hand, we interpret logic and information broadly and consider three logics in the area: propositional logic, epistemic logic (EL) and public announcement logic (PAL). c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  R. Loukanova (Ed.): NLPinAI 2021, SCI 999, pp. 25–48, 2022. https://doi.org/10.1007/978-3-030-90138-7_2

26

J. Villadsen et al.

Building up to formalizations of formulas, we start with a formalization of binary trees and a number of functions operating on these. Thereafter, we formalize a prover for propositional logic as a simple example to introduce the reader to the idea of formalizing logics in Isabelle. We use a so-called deep embedding of logics where formulas are essentially binary trees. By using a datatype for formulas we can prove soundness, completeness and termination of the prover. Moving on, we formalize epistemic logic, a logic for reasoning about both the factual and higher-order knowledge of agents, and a deductive proof system that enables this reasoning from a few axioms and inference rules. Again we use the deep embedding approach and prove soundness and completeness. Finally, we formalize public announcement logic with countably many agents. Public announcement logic extends epistemic logic with an operator for publicly announcing information. The formalization includes proofs of soundness and completeness for a variant of the well-known PA + DIST! + NEC! axiomatic system. The completeness proof builds on the one of epistemic logic by reducing formulas into that logic. Our definitions are given in Isabelle’s precise language of higher-order logic and every step of our soundness and completeness proofs is mechanically checked. With formalizations of sound and complete axiomatic systems for epistemic logic and public announcement logic, we strengthen the foundations of automated reasoning for logic and information. The formalizations are available here: https://hol.compute.dtu.dk/ITPLI The present paper extends our 3-page paper at the International Workshop on Logical Aspects in Multi-Agent Systems and Strategic Reasoning which was not formally published and covered only the formalization of epistemic logic [20]. Summing up, in the present paper we focus on propositional logic, epistemic logic and public announcement logic. As a supplement to pen-and-paper proofs of soundness and completeness, we describe the use of the powerful Isabelle proof assistant for interactive theorem proving. Other logics have been formalized in Isabelle. We mention here some of them and leave the rest for our discussion of related work together with results in other proof assistants than Isabelle. • Michaelis and Nipkow formalize several proof systems for classical propositional logic [39, 40]. From, Eschen and Villadsen formalize a number of axiomatic systems for propositional logic [19]. In the present paper we consider modal logics going beyond classical propositional logic. • From, Lund and Villadsen formalize a number of small provers for classical propositional logic [21, 71, 72]. In the present paper we use a similar prover as a motivational example. We recommend the survey on the use of formalizations in computer science by Ringer et al. [59] and the state-of-the-art in mathematics in form of the official published account of the now completed Flyspeck project [25].

Interactive Theorem Proving for Logic and Information

27

The paper is organized as follows: Sect. 2 introduces the reader to Isabelle/HOL and how to deeply embed logics. Section 3 explains our formalization of epistemic logic. Section 4 explains our formalization of public announcement logic. We discuss related work in Sect. 5 and conclude in Sect. 6.

2 Isabelle/HOL and Deep Embeddings of Logics Isabelle is a generic proof assistant originally developed at the University of Cambridge and Technische Universität München [45]. The most used instance of Isabelle today is Isabelle/HOL, based on classical higher-order logic, and in the following we often use the name Isabelle to refer to Isabelle/HOL. In order to provide a gentle introduction to programming and proving in Isabelle, we start with a formalization of binary trees and a number of functions operating on these. We further prove a few interesting properties about these functions. In Isabelle/HOL, programming is not limited to the computable fragments of HOL. For instance, a function may return a boolean value that is the result of quantifying over all elements of a type, e.g. stating that all natural numbers are either odd or even. As such, the concept of programming in Isabelle/HOL goes beyond its usual meaning in the context of traditional programming languages like Haskell and Java. Finally, we briefly consider a formalization of a prover for propositional logic. This mainly serves the purpose of introducing the reader to formalizing logics in Isabelle using a deep embedding approach. In this approach, formulas are defined as a datatype which enables the definition of semantics, a proof system and a small prover as functions that operate on this datatype. In turn, we can prove termination, soundness and completeness of the prover. 2.1 Formally Verified Functional Programming The following is a rather straightforward example of formally verified functional programming in Isabelle/HOL: a typical solution to an exercise in the Isabelle tutorial [44]. We start with a datatype of trees with labels at the nodes. The labels can be of any type, as specified by the type variable  a, and so-called cartouches delineate the three components of a Node: datatype a tree = Tip | Node  a tree   a   a tree 

We may collect the contents of such trees into a set by writing a simple functional program: fun set ::  a tree ⇒ a set  where  set Tip = {}  |  set (Node l a r) = set l ∪ {a} ∪ set r 

Note that we can use the usual set notation and operators in our definition. The type declaration can be omitted in which case it is inferred automatically.

28

J. Villadsen et al.

We can then write a predicate on trees labelled by integers that checks if they are binary search trees: fun ord ::  int tree ⇒ bool  where  ord Tip = True  |  ord (Node l a r) = ((∀ i ∈ set l. i < a) ∧ ord l ∧ (∀ i ∈ set r. a < i) ∧ ord r) 

This checks if they are ordered such that for all nodes, every element in the left subtree is smaller than the element at the node while every element in the right subtree is larger. The following insertion function is supposed to preserve this order: fun ins ::  int ⇒ int tree ⇒ int tree  where  ins i Tip = Node Tip i Tip  |  ins i (Node l a r) = (if i < a then Node (ins i l) a r else if a < i then Node l a (ins i r) else Node l a r) 

In the ord function we have exploited universal quantification over a finite set, which is computable, but really this program could also be written in an ordinary functional language. This is a good thing as it helps build familiarity with the proof assistant. The next two lines take things a step further: theorem [simp]:  set (ins i t) = {i} ∪ set t  by (induct t) auto

There could potentially be a mistake in the ins function where certain elements were not inserted or other elements forgotten. Moreover, we might have to test a lot of inputs to uncover such a mistake. The theorem above, stated for all elements and all trees, rules out such errors. The proof works by induction on the tree and using Isabelle’s proof method auto to discharge the two resulting cases. With that result in hand we can also prove that ins preserves the binary search tree order: theorem  ord t =⇒ ord (ins i t)  by (induct t) simp-all

Writing a machine-checked proof requires a higher level of abstraction, considering both how properties are expressed and proved. 2.2

Termination

So far we have only considered programs that are trivially total. The fun command will prove both pattern completeness and termination automatically. An advanced alternative is to use the function command, which does not prove either, and thus we have to do so manually afterwards, for example using Isar for the formal proofs [74]. Pattern completeness must be proved immediately, here with simp-all, and termination is shown later with the termination command.

Interactive Theorem Proving for Logic and Information

29

We need to prove the termination of our micro provers manually. To illustrate the technique, we consider the McCarthy 91 function, which is an old test case for formal verification [34, 36]. The definition itself is simple, but the nested recursion makes termination non-obvious: function M ::  int ⇒ int  where  M i = (if 100 < i then i − 10 else M (M (i + 11)))  by simp-all

It is called the 91 function because M i = 91 for all i ≤ 100 (and M i = i −10 otherwise). This is easy to show once termination has been established. We do so below. To prove termination we show a well-founded relation between the recursive calls and function input: termination proof let ?R =  measure (λi. nat (101 − i))  show  wf ?R  by simp

Briefly, (x, y) ∈ measure f ←→ f x < f y. Any relation defined via measure is wellfounded by construction. What remains to be shown is that both i + 11 and M(i + 11) are related to i, to justify the inner and outer recursive call, respectively. We consider only the branch of the if where the recursion happens and as such the first case is trivial given our measure: fix i :: int assume ∗:  ¬ 100 < i  then show  (i + 11, i) ∈ ?R  by simp

For the other case, we assume that i + 11 is an input that M terminates for, as expressed by M-dom: assume  M-dom (i + 11) 

This M-dom predicate allows us to prove properties about the input that M terminates on, even though we are still to prove that this is in fact all input. In particular, we note that when M terminates, the output is “mostly” larger than the input: moreover have  M-dom j =⇒ j − 11 < M j  for j by (induct j rule: M.pinduct) (auto simp: M.psimps)

Since the inner recursive call is on i + 11, the output is in fact larger than the input i and this is enough to relate the two, proving termination of the outer recursive call: ultimately have  i + 11 − 11 < M (i + 11)  by blast then show  (M (i + 11), i) ∈ ?R  using ∗ by simp qed

30

J. Villadsen et al.

Having proved termination, we can now perform induction over the call graph (as expressed by M.induct) to prove that the function can be defined without recursion: theorem  M i = (if 100 < i then i − 10 else 91)  by (induct i rule: M.induct) simp

This was an example of a function with a difficult termination proof. We also need to give explicit measures to prove termination of our provers in the coming sections but then the automation takes over, making them more suitable as starting points for exploration. Coming up with the measure can be tricky enough without struggling to prove that it works. We note that this declarative way of proving termination is similar to how a mathematician would do it. 2.3

A Prover for Propositional Logic

The following is a formalization of a simple prover for propositional logic. The prover is implicitly based on a sequent calculus for formulas in negation normal form. We start with a datatype for formulas: datatype a form = Paf a | Naf a | Con  a form   a form  | Dis  a form   a form 

Formulas can be combined using conjunction (Con) and disjunction (Dis). The type variable  a allows for any representation of atomic formulas. We do not include negation as usual; instead, an atomic formula can appear as either positive (Paf : positive atomic formula) or negative (Naf : negative atomic formula). The following function defines the semantics of formulas, where an interpretation i maps elements of the type  a to truth values: fun val where  val i (Paf n) = i n  |  val i (Naf n) = (¬ i n)  |  val i (Con p q) = (val i p ∧ val i q)  |  val i (Dis p q) = (val i p ∨ val i q) 

We exploit the built-in Boolean operators for negation, conjunction and disjunction. Alongside the semantics, we define a sequent calculus as a function for proving formulas: function cal where  cal e [] = (∃ n ∈ fst e. n ∈ snd e)  |  cal e (Paf n # s) = cal ({n} ∪ fst e, snd e) s  |  cal e (Naf n # s) = cal (fst e, snd e ∪ {n}) s  |  cal e (Con p q # s) = (cal e (p # s) ∧ cal e (q # s))  |  cal e (Dis p q # s) = cal e (p # q # s)  by pat-completeness simp-all

Interactive Theorem Proving for Logic and Information

31

The sequent calculus operates on a list of formulas, recursively decomposing them. We construct a set of the positive and a set of the negative literals in e. The function terminates once the list of formulas is empty—the truth is determined by whether some atom appears in both literal sets. We need to prove that our cal function terminates: termination by (relation  measure (λ(-, s).



p ← s. size p) ) simp-all

We obtain a termination proof by providing a suitable measure based on the second argument of the cal function: the sum of sizes of the formulas in the list that we decompose. Because we have defined our sequent calculus as a function, we can immediately obtain a prover by proper initialization of this function: definition  prover p ≡ cal ({}, {}) [p] 

We showcase the prover by running it on a list of formulas (applied to each element individually): value  map prover [Paf n, Naf n, Con (Paf n) (Naf n), Dis (Paf n) (Naf n)] 

Trivially, only the last formula is a tautology so the result is a list with three False values and then a single True value. Isabelle interactively displays the result of running the prover on the formulas. We now move on to the question of soundness and completeness for the sequent calculus. We first define an intermediate abbreviation sat that captures that at least one literal in e (positive or negative) is satisfied by the interpretation i. abbreviation  sat i e ≡ (∃ n ∈ fst e. i n) ∨ (∃ n ∈ snd e. ¬ i n) 

This definition is useful for stating the soundness and completeness properties of our sequent calculus: lemma sound-and-complete:  cal e s ←→ (∀ i. (∃ p ∈ set s. val i p) ∨ sat i e)  by (induct rule: cal.induct) auto

Because we state soundness and completeness as a single property, and for any call pattern of cal, we need to consider both the contents of the sets of positive and negative literals e, and the list of formulas s. The sequent calculus returns true if and only if, for all interpretations, truth either follows from a formula in the list or from one of the literals. The proof is by induction over the rules of the sequent calculus. We finally formulate soundness and completeness for the prover: theorem main:  prover p ←→ (∀ i. val i p)  unfolding sound-and-complete prover-def by simp

32

J. Villadsen et al.

The stated lemma is weaker than for the sequent calculus and a proof can be obtained by simple rewriting. As such, the proof goal is easily discharged by Isabelle’s automation.

3 Epistemic Logic Epistemic logic provides a foundation for reasoning about the knowledge of agents, both factual (“I know the sky is blue”) and higher-order (“I know that you know that I know the sky is blue”). A deductive proof system enables this reasoning with just a few axioms and inference rules. We formalize epistemic logic with countably many agents in the proof assistant Isabelle/HOL [17]. We include soundness and completeness proofs for the axiomatic system Kn based on the textbook Reasoning About Knowledge by Fagin, Halpern, Moses and Vardi [15]. Our definitions and proofs are specified in the precise language of higher-order logic and every step of our reasoning is mechanically checked. While the results are not new, we use them to showcase the level of precision and guarantee achievable by formalizing work in a proof assistant. Our formalization can also serve as starting point for similar logics or proof systems. Our completeness proof does not follow the one by Fagin et al. [15] to the letter but is inspired by Fitting’s [16] consistency properties as formalized by Berghofer [5]. We have adapted them from first-order logic to epistemic logic. 3.1

Syntax and Semantics

The formal language L for epistemic logic is a propositional language extended with modal operators K1, . . . , Kn for expressing knowledge of agents, for example the formula K1 ϕ ∧ K2 K1 ϕ ∧ ¬K1 K2 K1 ϕ states that: (i) agent 1 knows ϕ, (ii) agent 2 knows that agent 1 knows ϕ, but (iii) agent 1 does not know that agent 2 knows (i). The language is deeply embedded as a datatype in Isabelle/HOL: datatype i fm = FF (⊥) | Pro id | Dis  i fm   i fm  (infixr ∨ 30) | Con  i fm   i fm  (infixr ∧ 35) | Imp  i fm   i fm  (infixr −→ 25) | K i  i fm 

We define a constructor for each primitive of our syntax, e.g. FF for falsity with the alternative notation ⊥. Similarly, we give infix syntax for the binary connectives, which all associate to the right and are given suitable precedences.

Interactive Theorem Proving for Logic and Information

33

The type id is an abbreviation for strings of characters, used as labels for the propositions. We fix this instead of using a type variable in order to ease notation later. The type variable i is an arbitrary type for agents. In our informal example, we used natural numbers, but we do not commit ourselves to any specific type. Our soundness proof holds for any type while the completeness proof holds for any countable type i. We need the agent labels i to be countable, such that the language itself is countable. Countability of the syntax is a standard prerequisite for our way of proving completeness. We introduce negation into the syntax as an abbreviation: abbreviation Neg (¬ - [40] 40) where  Neg p ≡ p −→ ⊥ 

The semantics of epistemic logic formulas is based on a model of possible worlds as formalized by Kripke structures: datatype ( i, s) kripke = Kripke (π:  s ⇒ id ⇒ bool ) (K:  i ⇒ s ⇒ s set )

There are two components: an interpretation π that assigns truth values to propositions for each state (possible world), and a relation K that, when viewed as a function, takes an agent and a state and returns a set of states. This set is to be understood as the states the agent considers possible given the information available in the input state. We should mention the type variables (i,  s). The type i is again an arbitrary type for agents while  s is the type of states. Thus, the formalization is generic over the type of agents and possible worlds. The double turnstile, M, s |= ϕ, denotes the semantics of a formula ϕ ∈ L under a Kripke structure M and state s. We formalize it as the following function: primrec semantics ::  ( i, s) kripke ⇒ s ⇒ i fm ⇒ bool  (-, - |= - [50,50] 50) where  (-, - |= ⊥) = False  |  (M, s |= Pro i) = π M s i  |  (M, s |= (p ∨ q)) = ((M, s |= p) ∨ (M, s |= q))  |  (M, s |= (p ∧ q)) = ((M, s |= p) ∧ (M, s |= q))  |  (M, s |= (p −→ q)) = ((M, s |= p) −→ (M, s |= q))  |  (M, s |= K i p) = (∀ t ∈ K M i s. M, t |= p) 

No combination of model and state satisfies ⊥. The logical operators are defined by recursively obtaining the semantics of each subformula and combining the Boolean values through the built-in operators in Isabelle/HOL. The case for a proposition i looks up and returns the truth value of s and i in π M (the latter gives the π component of the Kripke structure M). Lastly, we have the case for a modal operator Ki p which requires the semantics of p to be true in every state agent i considers possible (from the current state). With the semantics in place, we can prove various properties of the modal operator Ki , say, (see the formalization for the proof): theorem distribution:  M, s |= (K i p ∧ K i (p −→ q) −→ K i q) 

The above states that the operator Ki distributes over implication.

34

J. Villadsen et al. p is a propositional tautology p

A1 A2

R1

p

p

Kip

K i (p

q)

q

q

Kiq R2

p Kip

Fig. 1. Our axiomatic system for epistemic logic.

3.2

Axiomatic System

The distribution theorem can be recognized in the very compact axiomatic system Kn (cf. Fig. 1). We adopt the usual syntax that the provability of a formula ϕ ∈ L is denoted by the turnstile symbol:  ϕ. The system is inductively defined as follows: inductive SystemK ::  i fm ⇒ bool  ( - [50] 50) where A1:  tautology p =⇒  p  | A2:   (K i p ∧ K i (p −→ q) −→ K i q)  | R1:   p =⇒  (p −→ q) =⇒  q  | R2:   p =⇒  K i p 

A1 states that any classical propositional tautology is provable, A2 is similar to the distribution theorem, R1 is simply modus ponens and R2 states that agents also know the provable formulas. The definition tautology in A1 relies on a semantics that treats modal formulas Ki ϕ as if they were propositional symbols. This is the semantic equivalent of allowing all substitution instances of propositional tautologies, but is simpler to formalize. 3.3

Soundness

For the axiomatic system K to be sound, every formula in L provable in system Kn must be valid with respect to the semantics: ∀ϕ ∈ L.  ϕ −→ (∀M, s. M, s |= ϕ) That is, no combination of proof rules leads to a formula that is not valid. It does not follow that all valid formulas are provable, however, which is why we also need completeness.

Interactive Theorem Proving for Logic and Information

35

Our formalized proof of soundness requires extra work for the rule A1. The following theorem states soundness for this rule: theorem tautology:  tautology p =⇒ M, s |= p 

Note that the quantification p ∈ L and ∀M s is implicit in Isabelle/HOL. See the formalization for the proof. Proving soundness for system Kn is now straightforward. The following theorem captures the property: theorem soundness:   p =⇒ M, s |= p  by (induct p arbitrary: s rule: SystemK.induct) (simp-all add: tautology)

The proof strategy is to apply induction over the rules of the system. Once we supply the tautology theorem, the simplification proof method in Isabelle/HOL discharges all subgoals. 3.4 Completeness We now want to demonstrate that system Kn is not only sound, but also complete, namely that every valid formula in L is provable: ∀ϕ ∈ L. (∀M, s. M, s |= ϕ) −→  ϕ The formalized proof follows Fagin et al. [15] and builds on maximal consistent sets of formulas. A formula ϕ is Kn -consistent if its negation is not provable:  ¬ϕ. A finite set of formulas ϕ1, . . . , ϕn is Kn -consistent if we cannot prove that they imply a contradiction:  ϕ1 −→ . . . −→ ϕn −→ ⊥. Finally, an infinite set of formulas is Kn -consistent if all its finite subsets are. Instead of working directly with this definition, we start from Fitting’s consistency properties [5], which define the class C of consistent sets S directly from the connectives of the formula, instead of referencing the axiom system: definition consistency ::  i fm set set ⇒ bool  where  consistency C ≡ ∀ S ∈ C. (∀ p. ¬ (Pro p ∈ S ∧ (¬ Pro p) ∈ S)) ∧ ⊥S∧ (∀ Z. (¬ (¬ Z)) ∈ S −→ S ∪ {Z} ∈ C) ∧ (∀ A B. (A ∧ B) ∈ S −→ S ∪ {A, B} ∈ C) ∧ (∀ A B. (¬ (A ∨ B)) ∈ S −→ S ∪ {¬ A, ¬ B} ∈ C) ∧ (∀ A B. (A ∨ B) ∈ S −→ S ∪ {A} ∈ C ∨ S ∪ {B} ∈ C) ∧ (∀ A B. (¬ (A ∧ B)) ∈ S −→ S ∪ {¬ A} ∈ C ∨ S ∪ {¬ B} ∈ C) ∧ (∀ A B. (A −→ B) ∈ S −→ S ∪ {¬ A} ∈ C ∨ S ∪ {B} ∈ C) ∧ (∀ A B. (¬ (A −→ B)) ∈ S −→ S ∪ {A, ¬ B} ∈ C) ∧ (∀ A. tautology A −→ S ∪ {A} ∈ C) ∧ (∀ A i. ¬ (K i A ∈ S ∧ (¬ K i A) ∈ S)) 

36

J. Villadsen et al.

All but the last two conditions are standard and ensure downwards saturation [67] of each set: the satisfiability of any member is guaranteed by conditions on its subformulas, and consistency is ensured at the bottom. The penultimate line ensures that the consistent sets contain all tautologies. This is a technical trick that makes them easier to work with: since any tautology cannot break consistency, we might as well include them. Similarly, the last condition ensures that no agent both knows and does not know the same formula A. We connect the definition of consistency to provability in system Kn through the following theorem: theorem K-consistency:  consistency {set G | G. ¬  imply G ⊥} 

The completeness proof follows the usual recipe: (i) assume a valid formula ϕ has no derivation (ii) then its negation is Kn -consistent and (iii) we can extend the set {¬ϕ} in a standard way (due to Lindenbaum [68]) to a maximally consistent set [15] which (iv) has a model. This contradicts the validity assumption. The completeness theorem is: theorem completeness: assumes  ∀ (M :: ( i :: countable, i fm set) kripke) s. M, s |= p  shows   p 

For technical reasons we have to require validity in a specific universe, namely in which the possible worlds are sets of formulas, but this is implied by the usual assumption of validity in all universes. Given the provability of p, that is  p, the soundness results implies that p is valid in all universes.

4 Public Announcement Logic We now move beyond static knowledge of agents and consider information updates as well. The formal language L! for public announcement logic is an extension of that of epistemic logic with the operator [r]! p for any formulas r and p meaning “p is true after the public announcement of r”. For example, [K1 ρ ∧! K2 σ]! τ means that τ is true after the public announcement that agent 1 knows ρ and agent 2 knows σ. In the formalization [18], we again deeply embed the language as a datatype in Isabelle/HOL: datatype i pfm = FF (⊥! ) | Pro  id (Pro! ) | Dis  i pfm   i pfm  (infixr ∨! 30) | Con  i pfm   i pfm  (infixr ∧! 35) | Imp  i pfm   i pfm  (infixr −→! 25) | K  i  i pfm  (K ! ) | Ann  i pfm   i pfm  ([-]! - [50, 50] 50)

We have added primes to some constructors to disambiguate them from the epistemic logic. We say that a formula is static if it does not contain any announcement operators.

Interactive Theorem Proving for Logic and Information p is a propositional tautology p

PA1 PA2

PR1

p

K ip

p

K i (p

q

PFF

PImp PK

K iq

p K ip

([r ]

PPro

PCon

q)

PR2

q

PDis

37

PR3

(r

[r ] x

))

(r

x)

([r ] (p

q)

[r ] p

[r ] q )

([r ] (p

q)

[r ] p

[r ] q )

q)

[r ] p

([r ] (p (([r ] K ip )

r

p [r ] p

[r ] q )

K i([r ] p )))

Fig. 2. Our axiomatic system for public announcement logic.

The bi-implication operator is central to our development and we introduce it as an abbreviation: abbreviation PIff ::  i pfm ⇒ i pfm ⇒ i pfm  (infixr ←→! 25) where  p ←→ q ≡ (p −→ q) ∧ (q −→ p)  ! ! ! !

The semantics depend on the notion of the restriction of a model to the worlds in which a specific formula is true. We formalize the semantics as the function psemantics and restriction as the function restrict. They are defined by mutual recursion: fun psemantics ::  ( i, w) kripke ⇒ w ⇒ i pfm ⇒ bool  (-, - |=! - [50, 50] 50) and restrict ::  ( i, w) kripke ⇒ i pfm ⇒ ( i, w) kripke  where  (M, w |= ⊥ ) = False  ! ! |  (M, w |=! Pro! x) = π M w x  |  (M, w |=! (p ∨! q)) = ((M, w |=! p) ∨ (M, w |=! q))  |  (M, w |=! (p ∧! q)) = ((M, w |=! p) ∧ (M, w |=! q))  |  (M, w |=! (p −→! q)) = ((M, w |=! p) −→ (M, w |=! q))  |  (M, w |=! K ! i p) = (∀ v ∈ K M i w. M, v |=! p)  |  (M, w |=! [r]! p) = ((M, w |=! r) −→ (restrict M r, w |=! p))  |  restrict M p = Kripke (π M) (λi w. {v. v ∈ K M i w ∧ (M, v |=! p)}) 

As can be seen, the semantics for each formula is defined the same as for epistemic logic, a semantics for [_]! is added, and restrict is defined.

38

J. Villadsen et al.

We restrict the model, not by removing worlds but by removing every agent’s accessibility to those worlds. The idea for that semantics is that for [r]! p to be true in model M and world w, either p is falsified at M and w, a false announcement, or p is satisfied at w in the restricted world restrict M where only p-worlds are accessible. 4.1

Axiomatic System

We adapt the syntax ! ρ for the provability of ρ in the following axiomatic system inspired by the system described by Baltag and Renne [2]. It is defined inductively (cf. Fig. 2): inductive PA ::  i pfm ⇒ bool  (! - [50] 50) where PA1:  ptautology p =⇒ ! p  | PA2:  ! (K ! i p ∧! K ! i (p −→! q) −→! K ! i q)  | PR1:  ! p =⇒ ! (p −→! q) =⇒ ! q  | PR2:  ! p =⇒ ! K ! i p  | PR3:  ! p =⇒ ! [r]! p  | PFF:  ! ([r]! ⊥! ←→! (r −→! ⊥! ))  | PPro:  ! ([r]! Pro! x ←→! (r −→! Pro! x))  | PDis:  ! ([r]! (p ∨! q) ←→! [r]! p ∨! [r]! q)  | PCon:  ! ([r]! (p ∧! q) ←→! [r]! p ∧! [r]! q)  | PImp:  ! (([r]! (p −→! q)) ←→! ([r]! p −→! [r]! q))  | PK:  ! (([r]! K ! i p) ←→! (r −→! K ! i ([r]! p))) 

Rules PA1, PA2, PR1 and PR2 are analogous to the rules A1, A2, R1 and R2 of epistemic logic (ptautology is implemented in the same style as tautology). In addition the system has six axioms – one for each combination of [_]! with ⊥! , atomic formulas, ∨! , ∧! , −→! and K! . The axioms for the binary connectives simply distribute [_]! over each connective, while the ones for ⊥! and atomic formulas rephrase [_]! as an implication. The axiom for [_]! and knowledge says that “i knows p after an announcement r if and only if the announcement r, whenever truthful, is known by i to make p true.” [2]. 4.2

Reducing to Epistemic Logic

We implement the reduction from public announcement logic to epistemic logic operationally, as guided by the reduction axioms. We do so in two steps. The first operation, reduce’ r p, translates the formula [r]! p into an equivalent formula in epistemic logic when p itself is static: primrec reduce  ::  i pfm ⇒ i pfm ⇒ i pfm  where  reduce  r ⊥ = (r −→ ⊥ )  ! ! ! |  reduce  r (Pro! x) = (r −→! Pro! x)  |  reduce  r (p ∨! q) = (reduce  r p ∨! reduce  r q)  |  reduce  r (p ∧! q) = (reduce  r p ∧! reduce  r q)  |  reduce  r (p −→! q) = (reduce  r p −→! reduce  r q)  |  reduce  r (K ! i p) = (r −→! K ! i (reduce  r p))  |  reduce  r ([p]! q) = undefined 

The second operation, reduce p, reduces the PAL-formula p into epistemic logic by recursion over the syntax:

Interactive Theorem Proving for Logic and Information

39

primrec reduce ::  i pfm ⇒ i pfm  where  reduce ⊥ = ⊥  ! ! |  reduce (Pro! x) = Pro! x  |  reduce (p ∨! q) = (reduce p ∨! reduce q)  |  reduce (p ∧! q) = (reduce p ∧! reduce q)  |  reduce (p −→! q) = (reduce p −→! reduce q)  |  reduce (K ! i p) = K ! i (reduce p)  |  reduce ([r]! p) = reduce  (reduce r) (reduce p) 

We stay within the pfm type rather than fm, even though we do not use the extra constructors, since our axiomatic system is defined over the pfm type. To prove completeness, we must prove that the reduction preserves the semantics. We do so by first considering the basic reduce’ operation with a static target: lemma reduce -semantics: assumes  static q  shows  ((M, w |=! [p]! (q))) = (M, w |=! reduce  p q)  using assms by (induct q arbitrary: w) auto

With this lemma we can prove that reduce preserves the semantics: lemma reduce-semantics:  (M, w |=! p) = (M, w |=! reduce p) 

We refer to the formalization for the proof by structural induction. 4.3 Soundness We prove the proof system sound similar to how we did for epistemic logic: theorem soundness: assumes  ! p  shows  M, w |=! p  using assms by (induct p arbitrary: M w rule: PA.induct) (simp-all add: ptautology)

The lemma ptautology is analogous to the theorem tautology from the formalization of epistemic logic. 4.4 Completeness We prove the proof system complete. Recall that the static formulas are those in which [_]! does not occur. The proof system is complete for such formulas: theorem static-completeness: assumes  static p   ∀ (M :: ( i :: countable, i fm set) kripke) w. M, w |=! p  shows  ! p 

The reason is that • ! contains all the axioms of , •  is complete, and • a static formula is straightforwardly a formula of epistemic logic.

40

J. Villadsen et al.

With this theorem in place we can prove completeness for all formulas: theorem completeness: assumes  ∀ (M :: ( i :: countable, i fm set) kripke) w. M, w |=! p  shows  ! p 

We do it by proving that if p is true in all models then so is the formula reduce p since the reduction is sound. The formula reduce p does not contain [_]! and is therefore static. By static completeness, reduce p is provable, ! reduce p. Additionally we prove from the reduction axioms PDis, PCon, PImp, PK and PFF, PPro that ! p ←→! reduce p, and thus that p is provable, ! p.

5 Related Work For a good overview of the topic of formalizing logical meta-theory we recommend a recent paper by Blanchette [6]. Several frameworks have been developed for proving logical calculi complete. These frameworks allow the reuse of syntax, semantics and proof ideas to formalize logical systems and their soundness and completeness as well as other results: • Michaelis and Nipkow formalize a bouquet of different proof systems all based on the same syntax for propositional logic [39, 40]. The framework formalizes sequent calculus, natural deduction, Hilbert systems and resolution. • The framework by Blanchette, Popescu and Traytel allows proofs of soundness and completeness for proof systems for different logics [8–11]. This is possible because their framework is parameterized on the specific syntax and semantics. In a related paper’s supplementary material, Blanchette and Popescu [7] show that a formalized tableau for many-sorted first-order logic in negation normal form with equality fits in the framework. This supplementary material is unfortunately not up to date with recent Isabelle versions. • A third development is frameworks for proving completeness of saturation provers. Schlichtkrull et al. [60, 63, 64] formalize the completeness of resolution in a generic way that allows for different provers to be built from the development, which is based on the work by Bachmair and Ganzinger [1]. The development is used to show the soundness and completeness of a particular prover using binary resolution and with a specific strategy for removing redundant clauses, but other provers would also fit [61, 62]. Tourret and Blanchette reformalize this result [69, 70] based on the more general theory of saturation provers by Waldmann et al. [73]. Outside of the mentioned frameworks, a number of self-contained formalizations of sequent calculi in proof assistants appear in the literature: • Ridge and Margetson [37, 57, 58] formalized in Isabelle/HOL soundness and completeness for a sequent calculus for formulas in negation normal form and with a term language of only variables. • Braselmann and Koepke [12, 13] formalized in Mizar soundness and completeness of a sequent calculus.

Interactive Theorem Proving for Logic and Information

41

• Schlöder and Koepke [66] formalized its completeness considering also uncountable languages. • A more exotic result is the formalization by Ilik [28] in Coq of completeness of a sequent calculus with respect to a Kripke-semantics for classical first-order logic [29]. The following formalizations appear if we broaden the scope to include intuitionistic logic: • Persson [50] formalized in ALF the soundness of a sequent calculus for intuitionistic first-order logic. • Herbelin, Kim and Lee [27] formalized in Coq the completeness of a sequent calculus for intuitionistic first-order logic restricted to formulas with implication and universal quantification as the only logical symbols. Their formalization applied a Kripke-style semantics. If we broaden the scope further to look beyond sequent calculi, we can mention several other formalizations: • Jensen, Larsen, Schlichtkrull and Villadsen [32, 65] formalized in Isabelle/HOL an axiomatic system for classical logic. • Raffali [53] formalized in Phox natural deduction for classical logic. • Persson [50] formalized in ALF natural deduction for intuitionistic logic. • Peltier [49] formalized in Isabelle/HOL superposition. • Paulson [46–48] formalized in Isabelle/HOL Gödel’s Incompleteness Theorems, but this does not include a completeness proof. • Popescu and Traytel present a formalization of Gödel’s Incompleteness Theorems [52]. • Jensen, Hindriks and Villadsen [30, 31] also present an approach to formalize in Isabelle/HOL a verification framework for agent programs. Let us now turn to formalizations of modal logic. These logics contain a single necessity operator  rather than one Ki for each agent i in a set of agents: • Bentzen [3] formalized S5 in Lean. • Neeley [42] formalized modal systems K, T, S4 and S5 in Lean. In the context of epistemic logic we found two formalizations in Lean of the S5 system for epistemic logic and PAL. We instead opted to formalize Kn . The S5 system extends Kn in that it has a number of additional axioms, and it is sound and complete when considering Kripke models in which the accessibility relation is an equivalence relation rather than any relation. Additionally there is work on a formalization of intuitionistic epistemic logic in Coq. • Neeley [42, 43] formalized S5 for epistemic logic and public announcement logic in Lean. Her proof system includes an axiom for the composition of public announcement operators, instead of our axiom for distribution over implication (PImp) and our announcement necessitation rule (PR3). • Li [35] formalized S5 for epistemic logic and public announcement logic in Lean but only formalized the logical equivalence of the reduction axioms, not the completeness of a proof system that includes them.

42

J. Villadsen et al.

• Hagemeier [24] is formalizing intuitonistic epistemic logic in Coq. It is presented in a number of slides, memos and draft memos. We look forward to the finished presentation of the work. • The Twelf distribution [51] includes a formalization in the LF logical framework [26] of a sequent calculus and natural deduction proof system for classical S5. The Twelf system [51] is worth mentioning by itself. It provides a uniform metalanguage for specifying logics and proof systems and proving meta-theoretical properties like cut-elimination. However, we are not aware of any formalizations of semantic completeness like we present here. Other interesting proposals for epistemic logic appear in the literature: • Ka¸dziołka [33] formalized a solution to a puzzle and introduces a logic tailored to the problem that turns out to be very similar to the possible worlds model of epistemic logic. • Zuojun, Ågotnes and Zhang [75] presented a variant of epistemic logic that adds the notion of secret knowledge as a first-class citizen. The notion of secrets can be defined in terms of the knowledge operator, but a new modality for secrets is introduced. The authors argue that the main principles can be studied this way, for instance when considering a language with an operator for secrets and without the usual knowledge operator. Our formalizations rely on deep embedding of formulas. In contrast, using a shallow embedding of the logic means that we write formulas directly in the proof assistant’s logic. The advantages of a shallow embedding include not having to formalize semantics, and usually the automation has an easier time proving theorems. The advantage of a deep embedding is that we can obtain formalized soundness and completeness theorems, cf. Sect. 2. • Benzmüller and Paulson [4] formalized in Isabelle/HOL a shallow encoding of modal logic. Gleißner, Steen and Benzmüller [22, 23] showed effective automation for a wide range of modal logics due to the use of a shallow embedding. • Reiche and Benzmüller [54] formalized in Isabelle/HOL a shallow embedding of PAL. Giselle Reis [55] sees formalizing logics in proof assistants as one of several ways to facilitate meta-theory. Concretely, she looks at three methods for facilitating meta-theory: Firstly, she considers using linear logic and subexponential linear logic as a framework for meta-theoretical reasoning. The idea is that certain logics can be expressed in the meta-logic of linear logic and subexponential linear logic. These logics allow some meta-theoretical properties to be proved automatically. Secondly, she considers the use of proof assistants to prove meta-theoretical properties – this is similar to our work here. She notes: One of the issues when developing proofs of meta-properties by hand is the sheer complexity and number of cases. By implementing these proofs in proof assistants, the computer will not let us skip cases or overlook details.

Interactive Theorem Proving for Logic and Information

43

We share this experience. Reis experienced that using Coq to formalize logics required her to write specific tactics to do parts of the proofs automatically. In our Isabelle formalization we instead relied on the Isar proof language and the built-in tactics of Isabelle. Reis also explains that working in proof assistants can be combined with the approach of using linear logic as a framework: the idea is that linear logic can be formalized in Coq and then one can use this formalized linear logic to prove properties of other logics. Giselle Reis also notes that formalizations of logics require a significant amount of work: The fact that each of these works is a publication (or collection of publications) itself is evidence that formalizing meta-theory is far from trivial work and cannot be done as a matter of fact. We agree with this perspective and see the building of frameworks in proof assistants and formalizing more logics within them as a way to improve this situation. Additionally, improving the proof assistants themselves will help this agenda. Lastly, Reis considers a solution where the computer aids only in parts of the meta-reasoning, which leaves a part to be done by hand. In particular she considers two systems that can be used for this: GAPT [14] (General Architecture for Proof Theory) is a proof theory framework containing common components of proof theory such as data structures, algorithms, parsers and automated deduction. GAPT interfaces to a number of automated reasoning tools and its focus is on transformation and further processing of proofs. Sequoia [56] is a tool for helping with the meta-theory of sequent calculi and which can import and export LaTeX code. Reis concludes that each method has its strengths and weaknesses and also that much work can be done to make them better and easier to use.

6 Concluding Remarks For artificial intelligence (AI) in general and for natural language processing (NLP) in particular, the interrelationship between logic and information is pivotal [38]: There is a bi-directional relation between logic and information. On the one hand, information underlies the intuitive understanding of standard logical notions such as inference (which may be thought of as the process that turns implicit information into explicit information) and computation. On the other hand, logic provides a formal framework for the study of information itself. We have considered fundamental axiomatic systems for both epistemic logic (EL) and public announcement logic (PAL). Instead of presenting pen-and-paper proofs of soundness and completeness we have used automated reasoning, the Isabelle proof assistant, as a powerful interactive tool. We share the vision of Rob Nederpelt and Herman Geuvers [41, p. 385]: In the future, we expect an enormous increase in the use of proof assistants. Our vision is that formalising a mathematical proof may become as easy as writing mathematics in a mathematical text editor such as LATEX (Lamport, 1985) and that a mathematical proof will only be accepted for publication when it has been formally checked.

44

J. Villadsen et al.

But, in fact, we do not need to choose between pen-and-paper and mechanically checked proofs, as they can successfully coexist. Acknowledgement. We thank Frederik Krogsdal Jacobsen for comments on drafts.

References 1. Bachmair, L., Ganzinger, H., McAllester, D.A., Lynch, C.: Resolution theorem proving. In: Robinson, J.A., Voronkov, A. (eds.) Handbook of Automated Reasoning, vol. 2, pp. 19–99. Elsevier and MIT Press (2001) 2. Baltag, A., Renne, B.: Dynamic epistemic logic. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy, Winter 2016 edn. Stanford University, Metaphysics Research Lab (2016) 3. Bentzen, B.: A Henkin-style completeness proof for the modal logic S5. CoRR (2019). https:// arxiv.org/abs/1910.01697 4. Benzmüller, C., Paulson, L.C.: Quantified multimodal logics in simple type theory. Logica Universalis 7(1), 7–20 (2013) 5. Berghofer, S.: First-Order Logic According to Fitting. Archive of Formal Proofs (2007). http://isa-afp.org/entries/FOL-Fitting.html 6. Blanchette, J.C.: Formalizing the metatheory of logical calculi and automatic provers in Isabelle/HOL (invited talk). In: Mahboubi, A., Myreen, M.O. (eds.) Proceedings of the 8th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2019, Cascais, Portugal, 14–15 January 2019, pp. 1–13. ACM (2019) 7. Blanchette, J.C., Popescu, A.: Mechanizing the metatheory of Sledgehammer. In: Fontaine, P., Ringeissen, C., Schmidt, R.A. (eds.) FroCoS 2013. LNCS (LNAI), vol. 8152, pp. 245–260. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40885-4_17 8. Blanchette, J.C., Popescu, A., Traytel, D.: Abstract completeness. Archive of Formal Proofs (2014). https://isa-afp.org/entries/Abstract_Completeness.html. Formal proof development 9. Blanchette, J.C., Popescu, A., Traytel, D.: Unified classical logic completeness. In: Demri, S., Kapur, D., Weidenbach, C. (eds.) IJCAR 2014. LNCS (LNAI), vol. 8562, pp. 46–60. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08587-6_4 10. Blanchette, J.C., Popescu, A., Traytel, D.: Abstract soundness. Archive of Formal Proofs (2017). https://isa-afp.org/entries/Abstract_Soundness.html. Formal proof development 11. Blanchette, J.C., Popescu, A., Traytel, D.: Soundness and completeness proofs by coinductive methods. J. Autom. Reason. 58(1), 149–179 (2016). https://doi.org/10.1007/s10817-0169391-3 12. Braselmann, P., Koepke, P.: Gödel’s completeness theorem. Formal. Math. 13(1), 49–53 (2005) 13. Braselmann, P., Koepke, P.: A sequent calculus for first-order logic. Formal. Math. 13(1), 33–39 (2005) 14. Ebner, G., Hetzl, S., Reis, G., Riener, M., Wolfsteiner, S., Zivota, S.: System description: GAPT 2.0. In: Olivetti, N., Tiwari, A. (eds.) IJCAR 2016. LNCS (LNAI), vol. 9706, pp. 293–301. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-40229-1_20 15. Fagin, R., Halpern, J.Y., Vardi, M.Y., Moses, Y.: Reasoning about Knowledge. MIT Press (1995) 16. Fitting, M.: First-Order Logic and Automated Theorem Proving. Graduate Texts in Computer Science, 2nd edn. Springer, New York (1996). https://doi.org/10.1007/978-1-4612-2360-3 17. From, A.H.: Epistemic logic. Archive of Formal Proofs (2018). https://isa-afp.org/entries/ Epistemic_Logic.html. Formal proof development

Interactive Theorem Proving for Logic and Information

45

18. From, A.H.: Public announcement logic. Archive of Formal Proofs (2021). https://isa-afp. org/entries/Public_Announcement_Logic.html. Formal proof development 19. From, A.H., Eschen, A.M., Villadsen, J.: Formalizing axiomatic systems for propositional logic in Isabelle/HOL. In: Kamareddine, F., Sacerdoti Coen, C. (eds.) Intelligent Computer Mathematics - 14th International Conference, CICM 2021, Timisoara, Romania, 26–31 July 2021, Proceedings, Lecture Notes in Artificial Intelligence, vol. 12833, pp. 32–46. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-81097-9_3 20. From, A.H., Jensen, A.B., Villadsen, J.: Formalized soundness and completeness of epistemic logic. In: LAMAS 2021–11th Workshop on Logical Aspects of Multi-Agent Systems (2021) 21. From, A.H., Lund, S.T., Villadsen, J.: A case study in computer-assisted meta-reasoning. In: Special Session on Computational Linguistics, Information, Reasoning, and AI 2021 (CompLingInfoReasAI 2021), Lecture Notes in Networks and Systems, 18th International Conference Distributed Computing and Artificial Intelligence, vol. 332, pp. 53–63. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86887-1_5 22. Gleißner, T., Steen, A.: The MET: the art of flexible reasoning with modalities. In: Benzmüller, C., Ricca, F., Parent, X., Roman, D. (eds.) Rules and Reasoning - Second International Joint Conference, RuleML+RR 2018, Luxembourg, 18–21 September 2018, Proceedings, Lecture Notes in Computer Science, vol. 11092, pp. 274–284. Springer, Cham (2018). https://doi. org/10.1007/978-3-319-99906-7_19 23. Gleißner, T., Steen, A., Benzmüller, C.: Theorem provers for every normal modal logic. In: Eiter, T., Sands, D. (eds.) LPAR 2021, 21st International Conference on Logic for Programming, Artificial Intelligence and Reasoning, Maun, Botswana, 7–12 May 2017, EPiC Series in Computing, vol. 46, pp. 14–30. EasyChair (2017). https://easychair.org/publications/paper/ 6bjv 24. Hagemeier, C.: Formalizing intuitionistic epistemic logic in Coq (2021). https://www.ps.unisaarland.de/~hagemeier/bachelor.php. BSc thesis 25. Hales, T.C., et al.: A formal proof of the Kepler conjecture. Forum Math. Pi 5, 1–29 (2017). https://doi.org/10.1017/fmp.2017.1 26. Harper, R., Honsell, F., Plotkin, G.D.: A framework for defining logics. J. ACM 40(1), 143–184 (1993). https://doi.org/10.1145/138027.138060 27. Herbelin, H., Kim, S.Y., Lee, G.: Formalizing the meta-theory of first-order predicate logic. J. Korean Math. Soc. 54(5), 1521–1536 (2017) 28. Ilik, D.: Constructive completeness proofs and delimited control. Ph.D. thesis. École Polytechnique (2010). https://tel.archives-ouvertes.fr/tel-00529021/document 29. Ilik, D., Lee, G., Herbelin, H.: Kripke models for classical logic. Ann. Pure Appl. Logic 161(11), 1367–1378 (2010) 30. Jensen, A.B.: Towards verifying GOAL agents in Isabelle/HOL. In: ICAART 2021 - Proceedings of the 13th International Conference on Agents and Artificial Intelligence, vol. 1, pp. 345–352. SciTePress (2021) 31. Jensen, A.B., Hindriks, K.V., Villadsen, J.: On using theorem proving for cognitive agentoriented programming. In: ICAART 2021 - Proceedings of the 13th International Conference on Agents and Artificial Intelligence, vol. 1, pp. 446–453. SciTePress (2021) 32. Jensen, A.B., Larsen, J.B., Schlichtkrull, A., Villadsen, J.: Programming and verifying a declarative first-order prover in Isabelle/HOL. AI Commun. 31(3), 281–299 (2018). https:// doi.org/10.3233/AIC-180764 33. Kadziołka, J.: Solution to the xkcd blue eyes puzzle. Archive of Formal Proofs (2021). https:// isa-afp.org/entries/Blue_Eyes.html. Formal proof development 34. Krauss, A.: Defining Recursive Functions in Isabelle/HOL (2021). https://isabelle.in.tum.de/ doc/functions.pdf 35. Li, J.: Formalization of PAL·S5 in proof assistant. CoRR (2020). https://arxiv.org/abs/2012. 09388

46

J. Villadsen et al.

36. Manna, Z., Pnueli, A.: Formalization of properties of functional programs. J. ACM 17(3), 555–569 (1970). https://doi.org/10.1145/321592.321606 37. Margetson, J., Ridge, T.: Completeness theorem. Archive of Formal Proofs (2004). http://isaafp.org/entries/Completeness.html. Formal proof development 38. Martinez, M., Sequoiah-Grayson, S.: Logic and information. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy, Spring 2019 edn., Metaphysics Research Lab, Stanford University (2019) 39. Michaelis, J., Nipkow, T.: Formalized proof systems for propositional logic. In: Abel, A., Forsberg, F.N., Kaposi, A. (eds.) 23rd International Conference on Types for Proofs and Programs, TYPES 2017, 29 May–1 June 2017, Budapest, Hungary, LIPIcs, vol. 104, pp. 5:1–5:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2017) 40. Michaelis, J., Nipkow, T.: Propositional proof systems. Archive of Formal Proofs (2017). http://isa-afp.org/entries/Propositional_Proof_Systems.html. Formal proof development 41. Nederpelt, R., Geuvers, H.: Type Theory and Formal Proof: An Introduction. Cambridge University Press (2014). https://doi.org/10.1017/CBO9781139567725 42. Neeley, P.: A formalization of dynamic epistemic logic. Master’s thesis, Carnegie Mellon University (2021). https://paulaneeley.com/wp-content/uploads/2021/05/draft1.pdf 43. Neeley, P.: Results in modal and dynamic epistemic logic: a formalization in Lean. Slides Lean Together Workshop (2021). https://leanprover-community.github.io/lt2021/ slides/paula-LeanTogether2021.pdf 44. Nipkow, T.: Programming and Proving in Isabelle/HOL (2021). https://isabelle.in.tum.de/ doc/prog-prove.pdf 45. Nipkow, T., Wenzel, M., Paulson, L.C. (eds.): Isabelle/HOL – A Proof Assistant for HigherOrder Logic. LNCS, vol. 2283. Springer, Heidelberg (2002). https://doi.org/10.1007/3-54045949-9 46. Paulson, L.C.: Gödel’s incompleteness theorems. Archive of Formal Proofs (2013). http:// isa-afp.org/entries/Incompleteness.html, Formal proof development 47. Paulson, L.C.: A machine-assisted proof of Gödel’s incompleteness theorems for the theory of hereditarily finite sets. Rev. Symb. Log. 7(3), 484–498 (2014). https://doi.org/10.1017/ S1755020314000112 48. Paulson, L.C.: A mechanised proof of Gödel’s incompleteness theorems using Nominal Isabelle. J. Autom. Reason. 55(1), 1–37 (2015). https://doi.org/10.1007/s10817-015-9322-8 49. Peltier, N.: A variant of the superposition calculus. Archive of Formal Proofs (2016). http:// isa-afp.org/entries/SuperCalc.shtml, Formal proof development 50. Persson, H.: Constructive completeness of intuitionistic predicate logic. Ph.D. thesis, Chalmers University of Technology (1996). http://web.archive.org/web/20001011101511/ www.cs.chalmers.se/~henrikp/Lic/ 51. Pfenning, F., Schürmann, C.: System description: Twelf — a meta-logical framework for deductive systems. In: CADE 1999. LNCS (LNAI), vol. 1632, pp. 202–206. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48660-7_14 52. Popescu, A., Traytel, D.: A formally verified abstract account of Gödel’s incompleteness theorems. In: Fontaine, P. (ed.) Automated Deduction - CADE 27, pp. 442–461. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-29436-6_26 53. Raffalli, C.: Krivine’s abstract completeness proof for classical predicate logic. https://github. com/craff/phox/blob/master/examples/complete.phx (2005, possibly earlier) 54. Reiche, S., Benzmüller, C.: Public announcement logic in HOL. In: Martins, M.A., Sedlár, I. (eds.) Dynamic Logic. New Trends and Applications - Third International Workshop, DaLi 2020, Prague, Czech Republic, 9–10 October 2020, Revised Selected Papers, Lecture Notes in Computer Science, vol. 12569, pp. 222–238. Springer, Cham (2020). https://doi.org/10. 1007/978-3-030-65840-3_14

Interactive Theorem Proving for Logic and Information

47

55. Reis, G.: Facilitating meta-theory reasoning (invited paper). In: Pimentel, E., Tassi, E. (eds.) Proceedings Sixteenth Workshop on Logical Frameworks and Meta-Languages: Theory and Practice, Pittsburgh, USA, 16 July 2021, Electronic Proceedings in Theoretical Computer Science, vol. 337, pp. 1–12. Open Publishing Association (2021). https://doi.org/10.4204/ EPTCS.337.1 56. Reis, G., Naeem, Z., Hashim, M.: Sequoia: a playground for logicians. In: Peltier, N., SofronieStokkermans, V. (eds.) IJCAR 2020. LNCS (LNAI), vol. 12167, pp. 480–488. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-51054-1_32 57. Ridge, T.: A mechanically verified, efficient, sound and complete theorem prover for first order logic. Archive of Formal Proofs (2004). http://isa-afp.org/entries/Verified-Prover.shtml, Formal proof development 58. Ridge, T., Margetson, J.: A mechanically verified, sound and complete theorem prover for first order logic. In: Hurd, J., Melham, T. (eds.) TPHOLs 2005. LNCS, vol. 3603, pp. 294–309. Springer, Heidelberg (2005). https://doi.org/10.1007/11541868_19 59. Ringer, T., Palmskog, K., Sergey, I., Gligoric, M., Tatlock, Z.: QED at large: a survey of engineering of formally verified software. Found. Trends Program. Lang. 5(2–3), 102–281 (2019). https://doi.org/10.1561/2500000045 60. Schlichtkrull, A., Blanchette, J., Traytel, D., Waldmann, U.: Formalizing Bachmair and Ganzinger’s ordered resolution prover. J. Autom. Reason. 64(7), 1169–1195 (2020). https:// doi.org/10.1007/s10817-020-09561-0 61. Schlichtkrull, A., Blanchette, J.C., Traytel, D.: A verified functional implementation of Bachmair and Ganzinger’s ordered resolution prover. Archive of Formal Proofs (2018). https:// isa-afp.org/entries/Functional_Ordered_Resolution_Prover.html. Formal proof development 62. Schlichtkrull, A., Blanchette, J.C., Traytel, D.: A verified prover based on ordered resolution. In: Mahboubi, A., Myreen, M.O. (eds.) Proceedings of the 8th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2019, Cascais, Portugal, 14–15 January 2019, pp. 152–165. ACM (2019). https://doi.org/10.1145/3293880.3294100 63. Schlichtkrull, A., Blanchette, J.C., Traytel, D., Waldmann, U.: Formalization of Bachmair and Ganzinger’s ordered resolution prover. Archive of Formal Proofs (2018). https://isa-afp. org/entries/Ordered_Resolution_Prover.html. Formal proof development 64. Schlichtkrull, A., Blanchette, J.C., Traytel, D., Waldmann, U.: Formalizing Bachmair and Ganzinger’s ordered resolution prover. In: Galmiche, D., Schulz, S., Sebastiani, R. (eds.) Automated Reasoning - 9th International Joint Conference, IJCAR 2018, Held as Part of the Federated Logic Conference, FloC 2018, Oxford, UK, 14–17 July 2018, Proceedings, Lecture Notes in Computer Science, vol. 10900, pp. 89–107. Springer, Cham (2018). https://doi.org/ 10.1007/978-3-319-94205-6_7 65. Schlichtkrull, A., Villadsen, J., From, A.H.: Students’ Proof Assistant (SPA). In: Quaresma, P., Neuper, W. (eds.) Proceedings 7th International Workshop on Theorem Proving Components for Educational Software (ThEdu), Electronic Proceedings in Theoretical Computer Science, vol. 290, pp. 1–13. Open Publishing Association (2019). https://doi.org/10.4204/EPTCS.290. 1 66. Schlöder, J.J., Koepke, P.: The Gödel completeness theorem for uncountable languages. Formal. Math. 20(3), 199–203 (2012) 67. Smullyan, R.M.: First-Order Logic. Springer, Heidelberg (1968). https://doi.org/10.1007/ 978-3-642-86718-7 68. Tarski, A.: Logic, Semantics, Metamathematics: Papers from 1923 to 1938. Hackett Publishing (1983) 69. Tourret, S.: A comprehensive framework for saturation theorem proving. Archive of Formal Proofs (2020). https://isa-afp.org/entries/Saturation_Framework.html. Formal proof development

48

J. Villadsen et al.

70. Tourret, S., Blanchette, J.: A modular Isabelle framework for verifying saturation provers. In: C. Hritcu, A. Popescu (eds.) CPP 2021: 10th ACM SIGPLAN International Conference on Certified Programs and Proofs, Virtual Event, Denmark, 17–19 January 2021, pp. 224–237. ACM (2021). https://doi.org/10.1145/3437992.3439912 71. Villadsen, J.: A micro prover for teaching automated reasoning. In: Seventh Workshop on Practical Aspects of Automated Reasoning (PAAR 2020) - Presentation Only/Online Papers, pp. 1–12 (2020). http://www.eprover.org/EVENTS/PAAR-2020.html 72. Villadsen, J.: Tautology checkers in Isabelle and Haskell. In: Calimeri, F., Perri, S., Zumpano, E. (eds.) Proceedings of the 35th Edition of the Italian Conference on Computational Logic (CILC 2020), Rende, Italy, 13–15 October 2020, CEUR Workshop Proceedings, vol. 2710, pp. 327–341. CEUR-WS.org (2020). http://ceur-ws.org/Vol-2710/paper-21.pdf 73. Waldmann, U., Tourret, S., Robillard, S., Blanchette, J.: A comprehensive framework for saturation theorem proving. In: Peltier, N., Sofronie-Stokkermans, V. (eds.) IJCAR 2020. LNCS (LNAI), vol. 12166, pp. 316–334. Springer, Cham (2020). https://doi.org/10.1007/ 978-3-030-51074-9_18 74. Wenzel, M.: The Isabelle/Isar Reference Manual (2021). https://isabelle.in.tum.de/doc/isarref.pdf 75. Xiong, Z., Ågotnes, T., Zhang, Y.: The logic of secrets. In: LAMAS 2020–10th Workshop on Logical Aspects of Multi-Agent Systems (2020)

A Valence Catalogue for Norwegian Lars Hellan(B) Norwegian University of Science and Technology (NTNU), N-7491 Trondheim, Norway [email protected]

Abstract. Essential aspects of a verb’s usage reside in its valence environments. The Norwegian valence resource here presented, called NorVal, has 6,300 verb lemmas. About 3,360 of them are associated with sets of frames, and the organization of entries is divided into one enumeration of the total number of framespecific entries, which is about 15,750, and one enumeration of lemmas, counting 6,300. About 300 frame types are distinguished inducing the 15,750 frame specific entries, taking into account most grammatical factors distinguishing verb frames and verb-headed construction types. Both the frame types and the two dimensions of entries are represented in string-based formalisms, enabling simple procedures for comparing individual valence frames, frame-specific entries, and entries representing lemmas, and for doing statistics over types and combinations of all of these. The paper illustrates the resources relative to their representation of light reflexives, verb particles, and frames including sentential constituents. Keywords: Verb valence · Norwegian · Set of valence frames for a lemma (valpod) · Frame-specific lexical entry (lexval) · Labeling code · Object · Indirect object · Oblique · Transitivity · Clausal argument (declarative · interrogative · infinitival) · Verb particle · Secondary predicate · Reflexive · Minimal sentence · Logical form of frame type

1 Introduction NorVal1 is a resource representing the valence potential of more than 6300 verb lemmas of Norwegian. Two features of its formal design reflect the circumstance that some verb lemmas take more than one valence frame. One feature resides in a format of lexical entries consisting of a lemma and one frame, as a pair; such an entry format we call a lexval. The other feature is the construal of the valence of a lemma as formally a set of lexvals, such that a multi-valent lemma is represented by an enumeration of lexvals, on the form ‘, , …’. Such a set representation (with minimum one member per set) is called a valpod. The system counts about 15,750 lexvals, and while there are as many valpods as there are lemmas, more than 3360 valpods are multi-membered. A further formal feature resides in the representation of valence frame types. About 300 different frame types are recognized as distinguishing between the lexvals, classified 1 https://doi.org/10.18710/8U3L2U; https://typecraft.org/tc2wiki/NorVal_resources.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. Loukanova (Ed.): NLPinAI 2021, SCI 999, pp. 49–104, 2022. https://doi.org/10.1007/978-3-030-90138-7_3

50

L. Hellan

according to a modular one-string annotation system called Construction Labeling (CL; cf. Hellan and Dakubu [19], Dakubu and Hellan [11]). It counts types for valence and grammatical functions, relevant for many linguistic areas, and could be compared with ‘Universal Dependency Grammar2 ’ – cf. Marneffe et al. [35]. The CL system is algorithmically co-operative with a typed feature-system for syntactic and semantic analysis, as outlined in Hellan [14, 15]. The verb inventory is partly derived from, and consistently kept in sync with, the verb lexicon of the computational grammar NorSource (cf. Hellan and Bruland [20]), a grammar based on the framework Head-Driven Phrase Structure Grammar (‘HPSG’, cf. Pollard and Sag [37], using a typed feature structure design (cf. Carpenter [4])), and with the platform LKB for grammar development (cf. Copestake [5]). The design and content of this verb lexicon were in turn at the outset in 2001 informed by the lexicons of TROLL (Hellan et al. [22]) and NorKompLex (cf. Nordgård [36]). The verb inventories of NorVal and NorSource have been applied in the online valence corpus Norwegian Valency Corpus, cf. Hellan et al. [25], and in the comparative online valence resource MultiVal, cf. Hellan et al. [23], derived from HPSG grammars of Norwegian, Ga, Spanish, and Bulgarian. The NorVal resources as such are currently not accessible online, but a DOI connected to this article, viz. Hellan [17], provides excerpts from the files. We call the valence resource a ‘catalogue’ rather than a ‘dictionary’ due to its stripped-down formal format, and since, unlike what is normally understood by a dictionary, it does not offer senses as designated features of its lexical entries. Seen from a theoretical viewpoint, the assembly of a verb’s environments is likely to reflect essential aspects of the verb’s meaning, and many studies, mostly for English, have aligned verb meanings with valence frames and frame alternations, however most of them covering only a limited set of all the environments that a verb can occur in. NorVal opens for the opposite strategy: with a classification of phenomena which makes limited semantic pre-commitments, it nevertheless enlarges the basis of phenomena that could reflect semantic factors, and thus may open for the identification of factors that have so far not been systematically explored. In the following, Sect. 2 gives an overview of most of the frame types covered and their encoding in terms of the Construction Labeling system. Sect. 3 outlines and illustrates the format of lexvals and valpods. Sect. 4 illustrates how the catalogue can be used in analysis of the phenomena of clausal arguments, particles and light reflexives, all playing a central role in the Norwegian valence system. Section 5 assesses the catalogue relative to issues of redundancy, to the notion of ‘valency class’, and to the prospect of combining valence information on a large scale with sense information. It also in some respects compares NorVal to other valence resources. Section 6 considers possible directions and domains in which the catalogue could be used or further developed.

2 Universal Dependency Grammar: https://universaldependencies.org/.

A Valence Catalogue for Norwegian

51

2 Representing Frame Types We here describe phenomena categorized in NorVal, notions used in the classifications,and their encoding. The terms pused are largely rooted in Scandinavian grammatical tradition, of course influenced by Latin grammar, and harmonizing well with the notions connected to ‘valence’ used in Tesnière [41]. The early phases of Generative Grammar, e.g., Ross [40], brought about a wealth of descriptive labels that were soon assimilated also in Scandinavian linguistics. Although the present exposition has little room for weighing terminological alternatives against each other, some essential choices of terminology are commented on. Norwegian, belonging to the Mainland Scandinavian branch of North Germanic, is an SVO language with strict order among the argument structure constituents. They generally (with well-known provisos not relevant to the present discussion) occur before possible adjuncts. Only personal pronouns have case, for subject vs. non-subject form. Argument structure constituents can be analyzed in terms of traditional grammatical function terms such as ‘subject’ and ‘object’ – the notion ‘object’ sub-classified as ‘direct’ and ‘indirect’ object when there are two objects -, and ‘oblique’ for a prepositional phrase with argument status, the preposition then typically counting as ‘selected’ by the verb. Following the terminology proposed in Marantz [34], subjects and objects (indirect as well as direct object) count as direct arguments, the governed item in an oblique as an indirect argument. The linear order between the direct arguments is strictly subject - indirect object - direct object - oblique(s). (While the framework of Lexical Functional Grammar (LFG; cf. Bresnan [2]) is one that generally supports the formal use of ‘grammatical function’ notions like those here used, it may be noted that while the constituents of a double object construction in LFG is often referred to as ‘object’ and ‘second object’, we here use the traditional terminology of Norwegian grammars. Unlike the tradition of some grammars, we also reserve the notion ‘indirect object’ for NPs, while the prepositional alternative with til (‘to’) or for is counted as an oblique constituent.) A further dimension of grammatical classification resides in notions such as ‘transitive’, ‘intransitive’, ‘ditransitive’ and more, which qualify the overall composition of a valence frame (or construction) rather than any of its constituent parts. We call this dimension of notions global relative to the valence frames. Although there are dependencies between which global label a construction type may carry and which grammatical function labels would be used to qualify its constituents, the two dimensions can be used independently and serve different purposes. Our labeling system is designed accordingly, starting with a global label which indicates which types of grammatical functions are realized, and then continuing with constituent-specific labels of the types outlined above when such qualifications are called for. For instance, the global notion transitive will declare the frame as consisting of a subject and an object (i.e., a direct object, since an indirect object only occurs with a direct object), where both have participant status (thus neither being an expletive pronoun), and where both are direct arguments, thus not preceded by a preposition. In the following, we first introduce notions for the classification of arguments, and then the global notions.

52

L. Hellan

2.1 Argument Labels Annotation labels for the grammatical functions ‘subject’, ‘object’, ‘direct’ and ‘indirect object’, ‘oblique’, and ‘complement’ are, using the annotation system Construction Labeling (‘CL’) mentioned above, su (for ‘subject’), ob (for ‘object’, and for ‘direct object’ when there is an indirect object present), iob (for ‘indirect object’), obl (for ‘oblique’) and comp (for ‘complement’, used for clausal complements not having object status – cf. Dalrymple and Lødrup [12]. Further argument labels are introduced below. Passive, as a regular grammatical process, applies to both direct and indirect objects (the pattern of so-called ‘symmetrical passive’), and also to the governed NP of oblique constituents. The process is heavily resorted to in defining criteria for grammatical functions, but is not by itself represented in valence frames, patterns induced through passivization processes counting as ‘productively derivable’. The use of the expletive personal pronoun det is pervasive, especially in subject but also in object position, annotated suExpl and obExpl, repectively. (Holen [26] notes that while for most languages pronoun resolution algorithms will be defined for subject as a first choice, this is not so for Norwegian, given the likelihood that the subject may be an expletive.) Constructions with a subject expletive are commonly divided into impersonals (with no direct argument participant), presentationals (with one or two direct argument participants, of which one, the ‘presented’ participant, is indefinite), and extraposition (where the expletive pronoun can be, metaphorically speaking, seen as ‘holding the place’ of a clausal subject or object). An additional pattern which we call extralinking is like extraposition except that the clause is governed by a preposition. The annotation label for an extraposed clause is expn, as in expnDECL when this is a declarative clause, while the prefix for an extralinked clause is exlnk. The expletive subject in all these patterns obeys standard criteria of subjecthood, also when in embedded clauses (it thus has a status clearly different from the seemingly similar expletive element in German), and therefore regularly carries the grammatical function su. A widely used construction is that of secondary predication (also called ‘small clauses’, and abbreviated as sc in annotation), where the predicatively functioning phrase - commonly referred to as a predicative – can be an adjective phrase (abbreviated scA), a prepositional phrase (abbreviated scP), a noun phrase (abbreviated scN), or a predicational particle phrase (abbreviated scPredprtcl – see shortly below). They can be predicated of either the subject or the object, referred to as, resp., subject predicatives and object predicatives. They are, following Jespersen [27, 28], referred to as bound predicatives in that they form part of argument structure, as opposed to free predicatives which can apply to either subjects or objects, and have adjunct status. Valence frames often include what is commonly referred to as particles, here analyzed as adverbs, and annotated as prtcl, this counting as a grammatical function. Many particles relate to locative and directional adverbs, but only carrying bleached versions of such meanings if at all. Locative and directional adverbs are in many cases homonyms with prepositions. A rule of thumb distinguishing a sequence ‘preposition + NP’ from a sequence ‘particle + NP’ is that the latter can alternate with the sequence in opposite order, subject to conditions pertaining to weight and category of the NP, while for a sequence ‘preposition + NP’, this is impossible. Moreover, we analyze a preposition as necessarily taking a complement, called its governee, while particles take

A Valence Catalogue for Norwegian

53

neither complements nor specifiers. Consequently a particle can occur without an NP preceding or following it, while a preposition always has a governed NP following it. Distinct from ‘particles’ as now discussed, analyzable as adverbs, are so-called predicate particles, in English exemplified by as, in German by als, and in Norwegian typically as som or for, as in Jeg anser det som håpløst ‘I regard it as hopeless’ or in Jeg anser det for å være håpløst ‘I regard it as being hopeless’. They are close to both prepositions and complementizers, and their typical behavior as part of predicatives is what motivates the label predicate particles, in that they serve to mediate a predication, their complement being a predicate. In contrast, a predicatively used adverb is the predicate. By an ‘NP’ we understand a phrase headed by a noun or a personal pronoun, in the limiting case consisting of the noun only, and also including cases where a quantifier or an adjective occur in a position standardly held by a noun and is arguably ‘derived’ into a noun use. When a standard NP position is held by an NP, we make no annotation for it, except when it occurs as predicative in a small clause (see above) or copula construction, or as governee of a preposition, which we annotate oblN. Partly overlapping with positions where NPs occur are various types of sentential embedded constructions such as declarative clauses (annotated DECL as in suDECL, obDECL, oblDECL, etc. for declarative embedded clauses and correspondingly for the other types), infinitival clauses (annotated with different labels according to control relations, see below) and interrogative clauses (INTERR), the latter sometimes distinguished as ‘yes-no’ clauses, corresponding to those introduced with ‘whether’ in English, and constituent-wh-clauses (annotated INTERRyn and INTERRwh, respectively). Concerning the configuration annotated as oblDECL, and similarly for the other clause types, it is to be noted that Norwegian, like other Scandinavian languages, can freely embed all kinds of clauses under prepositions (English, for instance, in contrast, here uses a gerund or other more nominal type of construct). Pervasive throughout the Indo-European languages is what we call the light reflexive pronoun, which in Norwegian has the form seg in 3. person, and the same form as the non-subject uses of the other persons and numbers. The label ‘light’ is to contrast it with the form seg selv, which can always be replaced by an NP, whereas for seg this is not always possible (as in Ola skammer seg ‘Ola is ashamed’; see Sect. 4). The light reflexive can occur as direct and indirect object (abbreviated iobRefl and obRefl) and as a governee of a preposition, either in an oblique constituent (abbreviated oblRefl) or in a PP serving as object predicative (abbreviated scPPrefl). Constructions such as Ola skammer seg (‘Ola is ashamed’) and the presentational Det setter seg en katt her (‘There seats itself a cat here’) are not uncommonly seen as intransitive, but the status of seg as direct object in these constructions is argued for in Hellan and Beermann [21] and Hellan [16].

54

L. Hellan

Some specifications have semantic impact, including the following: – A verb expressing movement of its subject or object has the respective argument specifications suDir and obDir. – An oblique expressing a location is marked oblLoc. While in other constructions classified as oblique, there is a preposition counting as ‘selected’ by the verb, in locative oblique constructions, any locative preposition can occur, the argument dependence here residing in the verb requiring a locative specification. Also locative adverbs in such constructions count as oblique. – In the formation of ‘possessor raising’ where the object expresses the ‘possessor’ and the oblique expresses the area ‘possessed’, as in hun stryker ham over ansiktet (‘she strokes him over the face’), the oblique is marked oblPRTOFob for ‘the oblique is part of the object’ (in the sense that, in the example, the face is part of him). Also the subject can be ‘possessor’. – The specification of a predicative gives two aspects of semantic information, namely of which argument it is predicated, and whether that argument is at the same time semantically an argument of the verb. The cases where it is semantically an argument of the verb are marked scSuArg or scObArg, where ‘sc’ is for ‘secondary argument’, the interspersed ‘Su’ or ‘Ob’ indicate whether the predicate is predicated of the subject or the object ‘, and ‘Arg’ indicates semantic connection both to the verb and to the predicative. The cases where it is not semantically an argument of the verb (referred to as ‘non-argument’) are marked as, respectively, scSuNrg and scObNrg, where the interspersed ‘Su’ and ‘Ob’ are as above, and ‘Nrg’ indicates the lack of a semantic connection to the verb. Thus, in a construction like He sang the room empty, the status as ‘empty’ ascribed to the object comes from the secondary predicate, not from the verb (even though the grammatical function as object here obtains relative to the verb). The label ‘scObNrg’ describes such a constellation, literally meaning ‘secondary predicate predicated of a non-argument object’. – A further marking of a predicative can be used when the property it expresses is caused (as in the above case He sang the room empty), namely by a suffix Csd at the end of the specification pattern mentioned above, thus a label scObNrgCsd. – When an infinitival clause is ‘controlled’, i.e., having its understood subject interpreted as identical to a constituent in the matrix clause, this control status is marked by ‘Eq’ (for ‘equi-NP-control’), as opposed to ‘Abs’ when no such identity is understood; the ‘Eq’ mark is followed by an identification of which of the verb’s arguments is the controller (which can be subject, indirect object, direct object or oblique), and finished with ‘Inf’. An example is obEqSuInf, meaning ‘object consisting of an infinitival clause equi-controlled by the matrix subject’. An infinitival object not controlled is marked obAbsinf. – For ‘extraposition’ constructions, where an ‘extraposed’ clause is linked to an expletive det, there is a need to indicate whether the clause is linked to subject or object function. Examples are, respectively, det koster henne mye krefter å slåss alene (‘it cost her much effort to fight alone’), where the infinitival clause serves as ‘logical subject’, and de overlot det til meg å finne en løsning (‘they left it for me to find a solution’), where the infinitival clause serves as ‘logical object’. We indicate these linking-directions in the ‘global’ label rather than infixing them in the label initiated

A Valence Catalogue for Norwegian

55

by expn, since the specification of control status of an infinitive constituting the extraposed clause may already require a linking-direction. Thus, for det koster henne mye krefter å slåss alene (‘it costs her much effort to fight alone’), where the extraposed clause å slåss alene is a controlled infinitive with the indirect object henne (‘her’) as controller, the relevant label for indicating the infinitival control is expnEqIobInf, whereby a ‘linking-director’ is already present in the expn specification. To avoid confusion, therefore, the label for indicating the ‘logical’ role of the extraposed infinitive (as ‘logical subject’) is included in the global label (here ditrExpnSu, so that the full frame specification becomes ditrExpnSu-obMeas-expnEqIobInf), rather than in the expn specification. The formalism also allows for more explicit semantic information (as outlined in Hellan [14, 15]), but these facilities are currently made minimal use of, the closest being marking for aspectual values as exemplified in the discussion of spise ‘eat’ in the next section. We summarize the argument labels now mentioned in a Table 1 below, where we indicate explicitly the logical role of the various label components, according to their grammatical function (GF), whether they constitute the GF in full or are embedded inside the constituent realizing the GF, semantic role of GF, dependency target of GF (predicated of or controlled by), and semantic argument status of GF (dependent of verb or not): The total number of argument labels is near 80, thus many more than listed here, but most of the principles of their internal composition are reflected in this table. 2.2 Global Labels Notions such as ‘transitive’, ‘intransitive’, ‘ditransitive’, and more, qualify the overall composition of a valence frame (or construction) rather than any of its constituent parts. We call this dimension of notions global relative to the valence frames, and the labels reflecting them are the global labels. The formal role of a global label is to declare which grammatical functions are realized in a given construction, and in some cases it also declares the participant structure of the frame and its linking to the grammatical functions. The simplest global labels in these respects are the following:

(1) intr – one participant, grammatical function: su tr – two participants, grammatical functions: su and ob ditr – three participants, grammatical functions: su, iob, and ob impers – no participants, grammatical function: su (expletive)

56

L. Hellan

Table 1. Labels for arguments (column 1) and decomposition of the labels (other columns) Label

GF

Carrier of the GF

suExpl

su

Expl

obExpl

ob

Expl

expnDECL

expn

DECL

exlnkDECL

exlnk

DECL

prtcl

prtcl

oblN

obl

suDECL

su

DECL

obDECL,

ob

DECL

oblDECL

obl

iobRefl

iob

Refl

obRefl

ob

Refl

Embedded in Semantic PP carrying role or the GF function

Target of dependency

Sem-arg status of target

N

DECL

oblRefl

obl

Refl

scPPrefl

sc

Refl

suDir

su

obDir.

ob.

oblLoc.

obl

oblPRTOFob

obl

Dir Loc PRTFob

scSuNrg

sc

Su

Nrg

scObNrg,

sc

Ob

Nrg

scObNrgCsd

sc

Csd

Ob

Nrg

obEqSuInf

ob

Inf

Eq

Su

expnEqIobInf

expn

Inf

Eq

Iob

More complex global labels are built with the above symbols as initial parts but with further symbols indicating further aspects of frame structure. As initial symbols, the above symbols retain their grammatical functions contributions as in (1), but possibly with further constituents added, while the semantic linking is defined anew for each more complex global label. Examples of such complex global labels, with grammatical functions, are listed below; included here are also two global labels used for copulas, one with the pattern ‘copX’ for predicative use, and ‘copIdX’ for identity predication (‘X’ ranging over ‘Adj’, ‘PP’, ‘N’ etc. in the first case, and ‘N’ and clausal arguments in the second):

A Valence Catalogue for Norwegian

57

(2) intrObl – grammatical function: su and obl trObl – grammatical functions: su, ob and obl impersObl – grammatical function: su (expletive) intrScpr – grammatical functions: su and sc trScpr – grammatical functions: su, ob and sc intrPresnt - grammatical functions: su (expletive)and pres trPresnt - grammatical functions: su (expletive), ob and pres intrExpn - grammatical functions: su (expletive)and expn trExpnSu - grammatical functions: su (expletive), ob and expn trExpnOb - grammatical functions: su (expletive), ob and expn copAdj - grammatical functions: su and sc copIdN - grammatical functions: su and id

We first comment on the suffix ‘Obl’. A two-participant construction where the nonsubject participant is expressed by an oblique (as in I rely on Mary) is called intransitive oblique, abbreviated intrObl. We thus reserve the notion ‘transitive’ for configurations where there is a formal object serving as a direct argument. The construction type which in the present system is called transitive oblique, abbreviated trObl, thus has a formal object and in addition an oblique constituent (as in I tell him about Mary). These labeling conventions are well rooted in general and typological linguistics, however there are also conventions that would favor a notion like ‘transitive oblique’ as applying to what we here call ‘intransitive oblique’ (as in I rely on Mary). Using the term ‘transitive’ here could be seen as anchoring the notion of transitivity more in the semantic binary relation expressed than in the formal pattern. Also the present use of ‘transitive’ can however be seen as semantically grounded, taking as a ‘prototypically’ transitive relation one where force emanates from one participant targeted at another participant, and counting the formal configuration generally used in the language for expressing such a relation as the grammatical transitivity notion (cf. Creissels [7]). These conventions may be inter-translatable, but being in the literal coding strict opposites, care must be taken in observing the difference. We then note cases where the semantic linking for the labels in (1) does not carry over to global labels where they are prefixes. Thus, although as lone-standing labels intr and tr have a participant subject, in intrPresnt and trPresnt, the subject is an expletive; likewise in the extraposition labels. A similar point holds for the global label trScpr, which declares the syntactic frame as su-ob-sc, while the semantic status of the object depends on whether it is tied to the frame-bearing verb or only to the secondary predicate, the latter indicated with the infix Nrg (for non-argument’) in the label for the secondary predicate; cf. the discussion above concerning He sang the room empty. The same holds for the global label intrScpr, which declares the syntactic frame as su-sc, and the semantic status of the subject depends on whether it is tied to the frame-bearing verb or only to the secondary predicate, the latter as in He seems happy. (In both cases, the role of secondary predicate can be held by an infinitive, construction types often referred to as raising constructions.)

58

L. Hellan

As a matter of ‘default’ convenience, in the characterization of frames where direct arguments are NPs (in the formal sense mentioned above), no indication is given in the argument labels to this effect; this thus holds for subjects, both types of objects and for ‘presented’ NPs in presentationals (thus, there are no argument labels ‘suN’, ‘obN’, etc.). However, in a ‘non-default’ transitive construction like Who comes first will decide whether we leave, subject and object will have specification, the full specification of the frame in such a case being tr-suINTERR-obINTERR. We summarize the factors represented in global labels in Table 2 below, these being first the grammatical functions they declare, then the semantic participant structures they represent, then the status of subject as ‘full’ or ‘expletive’, then a parameter left open in the global level but specified in an argument label, viz., the status of an NP of which a predicative is predicated relative to the matrix verb, and finally the ‘target’ marking for an extraposed clause. In the column for number of participants, propositional participants are marked as ‘prop’ - as ‘opt(ionally) prop’ when the GF per se can be nonpropositional, as with the GF of most declarative, interrogative and infinitival clauses, and simply ‘prop’ for secondary predicates, since these always constitute the predicate of a proposition. 2.3 Global and Argument Labels Together In the frame type specifications, the DTD of the combinations sets the global label first, followed by argument labels, ordered with subject specifications first, then indirect object, then direct object, etc., so that any combination of labels has a unique internal order. The system has about 60 global labels and 80 labels for specification of arguments, but the main factors are still represented in the frame type specifications in Tables 1 and 2. These tables also show how these factors can be recognized in the ‘morphology’ of label types. Information mining in NorVal therefore can target even single ‘morphs’ within the individual labels. This notwithstandeing, not all aspects of argument structure information are formally represented in the code. For instance, the global label tr does not itself say how the GFs su and ob are linked to the two participants indicated. This is not necessary for information extraction where one knows how tr is to be interpreted; still, explicitness is useful, and in a type-theoretic underpinning of the CL system presented in Hellan [14] (cf. Carpenter [4], Loukanova [32] for background), all global and argument labels are defined as types relative to a grammatical system where also semantic linking is made explicit. For instance, the type tr here has a type specification illustrated in (3), which includes the semantic linking of subject and object into a semantic space using semantic ‘actant’ notions:

(3)

A Valence Catalogue for Norwegian

59

Table 2. Global labels with GF-declarations and semantic content Global label

GFs declared

Semantic arguments of the verb

intr

su

1 (opt prop)

tr

su, ob

2 (opt 1 or 2 prop)

ditr

su, iob, ob

3 (opt 1 or 2 prop)

Subject explet

Predication target (Nrg = not sem. arg of verb)

impers

su

0

intrObl

su, obl

2 (opt 1 or 2 prop)

trObl

su, ob, obl

3 (opt 1, 2, or 3 prop)

impersObl

su, obl

1

intrScpr

su, sc

1 (prop)

su Nrg

intrScpr

su, sc

2 (1 prop)

su Arg

trScpr

su, ob, sc

2 (1 prop)

ob Nrg

Correlate of ‘extraposed’ clause

x

x

trScpr

su, ob, sc

3 (1 prop)

intrPresnt

su, pres

1

x

ob Arg

trPresnt

su, ob, pres

2

x

intrExpn

su, expn

1 (prop)

x

‘logical subject’

trExpnSu

su, ob, expn

2 (1, opt 2, prop)

x

‘logical subject’

trExpnOb

su, ob, expn

2 (1, opt 2, prop)

X

‘logical object’

copAdj

su, sc

1 (opt prop)

copIdN

su, id

2 (opt 1 prop)

Moreover, relative to this system, also the hyphens between labels in the CL string have an interpretation, namely as unification operations. In the present setting, though, these type-theoretic aspects will not be discussed, since our present concern is how the frame types as enumerated in the CL formalism can be used in defining a valence catalogue. A first step is to identify exactly those combinations of global labels and argument labels which correspond to distinguishable valence frames in the language. The table in Appendix 1 shows combinations amounting to about 300 valence type specifications relevant for Norwegian.

60

L. Hellan

3 Lexvals and Valpods 3.1 Lexvals Reflecting the circumstance that a lemma can occur in more than one valence environment, we define a lexval as the combination of a lemma and one of its valence environments. The practical format for notation of lexvals is exemplified in (4), with the lemma occurring before the underline and the frame type specification after (ditr-iobReflobINTERRyn being one of the 300 frame types defined in the CL code):

(4)

undre__ditr-iobRefl-obINTERRyn

This expresses that the verb lemma undre (‘wonder’) can occur in a ditransitive frame with a reflexive indirect object and a yes-no-interrogative clause as direct object, an example being Hun undrer seg hvorvidt vi kommer (‘She wonders whether we are coming’). When the verb selects a preposition or particle, this is represented as in (5), as exemplified in Hun lurer på om vi kommer (‘She wonders whether we are coming’), where på is a selected preposition:

(5)

lure-på__intrObl-oblINTERRyn

The ‘selected’ preposition is represented in form as hyphenated to the lemma, and by category as indicated through the label part Obl. Again the general frame type follows the underline, while the lexically specific information precedes it. In both of these cases the formal structure is that of an ordered pair, as represented more explicitly in (6a) and (6b):

(6)

a.

b.

In both cases the lemma as such is the first member, and the valence frame is the second member. In (6b), the valence frame in question for lure is divided into the lexically specific information to the left of the underline and the general frame type to the right. In both cases the formal connection between logical format and practical format is clear.

A Valence Catalogue for Norwegian

61

3.2 Valpods To illustrate the notion of valpod, consider the following set of lexvals, being the complete set of lexvals for the lemma spise ‘eat’ (the examples with translations being not part of the lexvals):

(7) spise__intr spise-av__intrObl-oblN-ACTIVITY

(Ex.: de spiser ‘they eat’) (Ex.: hun spiser av vellingen ’she eats of the

porridge’) (Ex.: hun spiser på brødstykket ’she eats of the bread’) spise__tr (Ex.: de spiser kjøttet ‘they eat the meat’) spise-innpå__trObl-obRefl-oblN (Ex.: hun spiser seg innpå ham ‘she eats herself onto him’ (= ’she shortens the distance to him’)) spise-opp__trPrtcl (Ex.: hun spiser opp grøten ‘she eats up the porridge’) spise__trScpr-obRefl-scObNrgCsd-scPred (Ex.: hun spiser seg frisk ‘she eats herself healthy’) spise__trScpr-scObNrgCsd-scPred (Ex.: hun spiser tallerkenen tom ‘she eats the plate empty’) spise-i__trScpr-scPPrefl (Ex.: hun spiser i seg maten ‘she gobbles the food into her’)

spise-på__intrObl-oblN-ACTIVITY

The valpod of spise is construed not as a simple enumeration of the lines in (7), but through abstracting out the lemma, and listing all the lexvals with a variable ‘V’ in the place of the lemma, with the lemma itself outside the set representation, as shown in (8):

(8) spise:{V__intr & V-av__intrObl-oblN-ACTIVITY & Vpå__intrObl-oblN-ACTIVITY & V__tr & V-innpå__trOblobRefl-oblN & V-opp__trPrtcl & V__trScpr-scObNrgCsdscPred & V__trScpr-obRefl-scObNrgCsd-scPred & Vi__trScpr-scPPrefl}

Formally speaking, the valpod of spise is thus an ordered pair consisting of the lemma and the abstraction part ‘:{…}’. The abstraction itself we call a valpod type. Thus, the valpod type associated with spise is (9) (it may be noted that a valpod, and thus also a valpod type, is written on one line).

62

L. Hellan

(9) :{V__intr & V-av__intrObl-oblN-ACTIVITY & V-på__intrObloblN-ACTIVITY & V__tr & V-innpå__trObl-obRefl-oblN & Vopp__trPrtcl & V__trScpr-scObNrgCsd-scPred & V__trScprobRefl-scObNrgCsd-scPred & V-i__trScpr-scPPrefl}

The valpod type is in principle a set (where the prefixed ‘:’ corresponds to a lambda operator inducing the characteristic function of the set) and may intersect with valpod types of other lemmas, or be in super- or subset relations to them. The valpod type being a set, the order of the elements in the list defining the set is in principle not essential. For convenience, however (for the purpose of by-eye observation, or for string matching for multiple lexvals), certain ordering conventions are observed (largely reflecting alphabetic order). Relative to global label, the ordering relative to the initial part of the label is cop… > ditr… > impers… > intr… > tr…. Within each set of lexvals sharing the initial part, lexvals with this part as lone-standing global label are ordered first, followed by lexvals where the global label is complex, according to the alphabetic order of the suffixes attached, so that tr comes before trObl, which in turn comes before trPrtcl, etc. Corresponding conventions apply to the ordering of argument labels. Within each type of sequence thereby obtained, if lexvals have selected items, then they are alphabetically ordered according to the selected items, so that, e.g.-, V-av__intrObl-oblN-ACTIVITY precedes V-på__intrObloblN-ACTIVITY. As mentioned, the number of multi-membered valpods is more than 3000, and it is among these that maybe most issues of information extraction will be formulated. Given the set design of the valpods, it is envisaged that recognized techniques of information extraction over sets can be applied. Given at the same time the conventions of ordering in the valpod representations just outlined, and the strict DTDs of the composition of valence frame type composition (cf. Sect. 2.3), it is clear that the design of the lexvals and valpods inventories is amenable also to inspection ‘by eye’ and search over defined strings. 3.3 Further Illustration With 13 members, the valpod of the lemma bry (‘bother, concern’) has slightly more members than spise, and a rather different valpod:

A Valence Catalogue for Norwegian

(10)

63

Valpod for bry ('bother, care, concern'): bry:{V__tr & V__tr-obRefl & V-med__trObl-oblAbsinf & V-med__trObl-oblDECL & V-med__trObl-oblINTERR & Vmed__trObl-oblAbsinf & V-med__trObl-obRefl-oblDECL & V-med__trObl-obRefl-oblINTERR & V-med__trObl-obRefloblN & V-om__trObl-obRefl-oblDECL & V-om__trOblobRefl-oblEqObInf & V-om__trObl-obRefl-oblINTERR & Vom__trObl-obRefl-oblN}

Exemplifications of the lexvals are given in (11a) below, and some of the lexval labels are paraphrased in (11b):

(11) a. bry__tr bry__tr-obRefl bry-med__trObl-oblAbsinf bry-med__trObl-oblINTERR bry-med__trObl-oblN bry-med__trObl-obRefl-oblAbsinf bry-med__trObl-obRefl-oblDECL

(Ex.: de bryr dem ‘they bother them’) (Ex.: hun bryr seg ‘she cares’) (Ex.: vi bryr dem med å ta opp store spørsmål ‘we bother them by raising big questions’) (Ex.: vi bryr dem med hva som skal gjøres ‘we bother them with what is to be done’) (Ex.: vi bryr dem med spørsmål ‘we bother them with questions’) (Ex.: de bryr seg med å avhjelpe nød i landet ‘they care to counteract need in the country’) (Ex.: de bryr seg ikke med at det gjenstår oppgaver ‘they don’t care that there remain tasks’)

(Ex.: de bryr seg med hva som blir sagt ‘they care about what gets said’) bry-med__trObl-obRefl-oblN (Ex.: hun bryr seg med problemene ‘she cares about the problems’) bry-om__trObl-obRefl-oblEqObInf (Ex.: de bryr seg om å snakke med ofrene ‘they care about speaking with the victims’) bry-om__trObl-obRefl-oblDECL (Ex.: de bryr seg ikke om at det ble liggende igjen noen kasser ‘they don’t care that some boxes were remaining’) bry-om__trObl-obRefl-oblINTERR (Ex.: de bryr seg om hva som blir sagt ‘they care what is being said’) bry-om__trObl-obRefl-oblN (Ex.: han bryr seg om henne ‘he cares about her’)

bry-med__trObl-obRefl-oblINTERR

64

L. Hellan

b. V__tr-obRefl: transitive where the object is a light reflexive pronoun V-med__trObl-oblAbsinf: transitive plus an oblique PP where the selected preposition is med, and the preposition governs an infinitival clause with arbitrary control of the subject V-med__trObl-oblINTERR: transitive plus an oblique PP where the selected preposition is med, and the preposition governs an interrogative clause (whether a yes-no interrogative or a constituent wh-interrogative) V-med__trObl-oblN: transitive plus an oblique PP where the selected preposition is med, and the preposition governs an NP V-med__trObl-obRefl-oblAbsinf: transitive plus an oblique PP where the object is a light reflexive pronoun, the selected preposition is med, and the preposition governs an infinitival clause with arbitrary control of the subject V-med__trObl-obRefl-oblDECL: transitive plus an oblique PP where the object is a light reflexive pronoun, the selected preposition is med, and the preposition governs a declarative clause

Although the differentiating parameters in (10) may seem like rather minor and almost mechanical twists around the basic patterns bry med and bry om, there is no automatism in the availability of these patterns: most combinations of a verb with a selected preposition take just a nominal governee, and once a clausal governee is allowed, it does not follow that all three types of clausal complements are allowed. Indeed, the patterns with seg om in (10) appear only with three other verbs, viz. forklare (‘explain’), forsikre (‘ensure’) and overbevise (‘convince’). The full set of patterns with med instantiated here only obtain with sammenholde (‘align, compare’), and those with seg med with no other verbs at all. Needless to say, the whole valpod type in (10) is unique to bry. For a ‘brief-consultation’ dictionary entry for bry, it might still be felt that the features ‘regular NP vs. light reflexive’ as object, and ‘om vs. med’ as selected preposition will suffice as information, the further possibilities being inferable from the meaning of bry and the meanings of om and med. While that could well be true of ‘inferable’ for a person, there exists at this point no formal mechanism for such inferences, let alone a formal repository of word senses from which such an inference could be made. To the contrary, what a meticulous listing of environments such as here may contribute to is an ‘extensional’ circumscription of verb senses, by which one maybe can get closer to defining such possible ‘inferences’. Illustrating the point is the circumstance that the valpods of spise and bry have only one lexval in common. Furthermore, nine of the frames for bry contain a clausal argument, against none for spise; all of these clausal arguments are introduced by a preposition. Obviously this must reflect something about the meanings of the verbs. To get closer to a diagnosis of ‘what’ such a meaning-valence connection may reflect, valpods with the amount of detail displayed in (10) and (7) are required.

A Valence Catalogue for Norwegian

65

4 Using the Resource Any phenomenon labeled in the resource can be efficiently searched relative to the construction types in which it occurs, and with regard to items it may contain, syntagmatically or as represented by labeled features. Thus, discoveries can be made regarding in how many patterns something thought of as ‘one thing’ can actually appear. One can also focus on larger types of assemblies as within valpods and see patterns regarding what they contain and thus kinds of phenomena that appear together. Where labels have been assigned but categorization is nevertheless in doubt or subject to revision, one can efficiently get an overview of relevant items. Regarding generalizations, the resource allows for ‘frame-predictions’ and ‘sense predictions’ at large. This section illustrates these aspects. 4.1 Clausal Arguments Among the totality of the 15,700 lexvals in the resource, 1140 lexvals contain an argument specified with DECL as a defining label, 849 lexvals contain an argument specified with INTERR as a defining label, 1066 lexvals contain an argument specified as a controlled infinitive, and 267 lexvals contain an argument specified as an absolute infinitive, meaning that more than 3000 lexvals, or about 20% of the lexvals, contain such an argument. The distributions are rendered in the following tables (Table 3, 4, 5, 6 and 7). Table 3. Clausal arguments of type ‘declarative’ Argument label

Instances

suDECL

87

obDECL

460

oblDECL

485

expnDECL oblExlnkDECL DECL

89 5 1142

An immediate observation concerning the clausal arguments is that they to a large extent obtain as governees of prepositions, i.e., as obliques. This is summarized in the table (Table 8) further below. Thus, for the arguments specified with INTERR as a defining label, about half appear in an oblique PP, the same holds for arguments specified with infinitive as a defining label, and it holds for almost half of the arguments specified with DECL as a defining label. As shown in (10) above, the verb bry ‘bother’ has valence frames for all of the three clausal types declarative, interrogative and controlled infinitive. The catalogue contains no less than 217 verbs which have this capacity. In the case of bry, all of these arguments are embedded in a PP; how pervasive is this, compared with having the clauses as direct

66

L. Hellan Table 4. Clausal arguments of type ‘interrogative’

Argument label

Instances

suINTERR

22

obINTERR

235

compINTERR

77

expnINTERR

48

oblExlnkINTERR

1

oblINTERR

432

INTERR

849

Table 5. Clausal arguments of type ‘controlled infinitive’ Argument label

Instances

suEqObInf (‘subject is an infinitive controlled by object’)

21

obEqSuInf (‘object is an infinitive controlled by subject’)

135

obEqIobInf (‘object is an infinitive controlled by indirect object’)

51

oblEqSuInf (‘oblique is an infinitive controlled by subject’)

291

oblEqObInf (‘oblique is an infinitive controlled by object’)

476

expnEqObInf (‘extraposed is an infinitive controlled by object’) EqInf

Table 6. Clausal arguments of type ‘absolute infinitive’ Argument label

Instances

suAbsinf

17

obAbsinf

35

oblAbsinf

160

expnAbsinf

28

Absinf

267

Table 7. Clausal arguments of type ‘bare infinitives’ Argument label obEqIobBareinf scBareinf obEqBareinf Bareinf

31 1066

Instances 2 18 2 22

A Valence Catalogue for Norwegian

67

Table 8. Oblique clausal arguments summarized Argument label

Instances

oblDECL

485

oblINTERR

432

oblEqSuInf

291

oblEqObInf

476

oblAbsinf

160

Oblique clausal arguments

1844

arguments (i.e., as subject, object, complement, or extraposed)? As the following tables (Tables 9, 10 and 11) show, within these valpods, the distribution is not unlike those for the total set of valpods, so, also here with a majority of oblique arguments: Table 9. Declarative arguments in the 217 valpods with all types of clausal arguments Argument label

Instances

suDECL

19

obDECL

88

oblDECL expnDECL

204 22

Table 10. Interrogatives in the 217 valpods with all types of clausal arguments Argument label suINTERR

Instances 6

obINTERR

48

compINTERR

23

expnINTERR

25

oblINTERR

212

The verbs of these valpods are listed in Appendix 2. Seeing whether they may have some meaning factors or other properties enabling this array of argument types for exactly these verbs, may be an interesting undertaking. Here, the role of the prepositions obviously also has to be assessed. Even if the number of prepositions in these roles is not more than 15–20, so that they also have a foot in what we may call the domain of ‘structural’ parameters, their inherent semantics is obviously essential.

68

L. Hellan Table 11. Controlled infinitives in the 217 valpods with all types of clausal arguments

Argument label

Instances

obEqSuInf

45

obEqIobInf

13

oblEqSuInf

97

oblEqObInf

110

expnEqObInf

1

The overall number of lexvals declared for including an oblique argument is around 4600 (global labels including trObl, intrObl, ditrObl, impersObl, …PrtclObl), distributed over 113 frame types, thus nearly one third of all lexvals and more than one third of all frame types, underlining the role of obliques in general. As a last observation relating to clausal arguments, if a verb has any of the types of clausal argument occurring in (a frame type in) its valpod, then the valpod will have at least one more member. There are only two exceptions, as indicated in Table 12 below showing frame distribution for the 2917 valpods with only one member. This is a sweeping example of frame-prediction. This clearly suggests that clausal arguments are in some sense ‘secondary’ within some order of establishment among argument types. How an account for this may go, and analytic consequences, we leave open here. This subsection illustrates what we in the introduction to the section mentioned as cases where something one may think of as ‘one thing’, like a declarative clausal argument, has a wide spectrum of distributions, and the catalogue allows one to get a concise picture of them all. It also illustrates how focus on larger types of assemblies offered through valpod representations allows one to see patterns of cooccurrence of complex entities, as in a possible study of verbs taking all three kinds of clausal arguments. 4.2 Particles and Secondary Predicates As mentioned, particle is a functional category whose part of speech is adverb. The total number of lexvals containing particles is 1331, distributed over 45 frame types. These adverbs can also serve as directional or locative adjuncts (relative to the wellknown formally differentiated groups of directional and locative adverbs as in ut ‘out’, inn ‘in’, etc., vs. ute ‘out(side)’, inne ‘in(side)’, etc., the particle uses of adverbs are reserved for the ‘directional’ versions), which is a different function than particle, and also as predicatives in (bound) predicative constructions, and a question is how to keep instances of such functions apart.

A Valence Catalogue for Norwegian

69

Table 12. Number of valpods with a unique member (i.e., univalent verbs) Number

Frame type of the unique lexval in valpod

2,145

Transitive

656

Intransitive

88

Transitive with light reflexive object

65

Intransitive with oblique

42

Transitive with a particle

36

Intransitive with a directional subject

28

Ditransitive

22

Impersonal

16

Transitive with directional object,

15

Transitive plus oblique,

14

Ditransitive with light reflexive as indirect object,

12

Transitive with light reflexive object plus oblique,

12

Transitive with light reflexive directional object,

11

Transitive with a particle,

10

Transitive with a particle and light reflexive object, …





2

Subject-controlled infinitive as unique frame: plikte å, unnlate å

0

Declarative or interrogative argument, extraposed clause, or absolute infinitive

We regard adverbs as in principle incapable of taking NP objects,3 while an adverb with directional or locative function may have prepositional complements (like in English of the house being a complement of out in out of the house), which we may take as a capacity reserved for the non-particle uses. When such an adverb occurs alone, however, it may take some consideration to decide whether it is a directionally used adverb, a predicatively used adverb, or a particle. For instance, ut in Han kastet den ut (‘He threw it out’) is presumably a directionally used adverb (a type instantiated by 245 lexvals where an object undergoes directional movement, coded with obDir as an argument specification); in De frøs ham ut (‘They froze him out’), ut is presumably a predicatively used adverb (instantiated by 33 lexvals coded with scObNrgCsd as argument specification); and in De skjemte ham ut (‘They spoiled him’) ut is presumably a particle, a use type instantiated in 664 lexvals, coded as __trPrtcl as a global specification. 3 Apparent countercases like ut porten ‘out the gate’ as in Han løp ut porten ‘He ran out through

the gate’ can be analyzed as generally representing an understood ‘through’ or ‘along’ (cf. Jørgensen [29]), and thus in principle following the pattern of (ran) out of the house. (The opposite position may lead to regarding adverbs as ‘intransitive prepositions’, a notion we thus reject.).

70

L. Hellan

Among the 1331 lexvals having Prtcl as part of their global label, 943 have __trPrtcl as intial part of the label, and 372 have __intrPrtcl as intial part of the label. The respective numbers of lexvals having these strings as the full global label is, respectively, 664 and 253. The combinations most frequent among lexvals with more complex global labels involve the substring PrtclObl, obtaining with no less than 167 lexvals, while 277 lexvals have still other global labels including Prtcl as part; cf. Appendix 1 for an overview. Given the close connections between the function of particle and the functions of (directional) adjunct and predicative mentioned above, it is obvious that both criteria and assignments will be in constant need of further scrutiny. The case of particles thus is a prime instance of what we referred to in the section introduction as cases where labels have been assigned but categorization is nevertheless in doubt or subject to revision, and the catalogue allows one to efficiently get an overview of relevant items. In this case this will include not just lexvals marked Prtcl in their global label but also lexvals marked for Scpr in their global label and lexvals where an argument specification includes Dir. The notion of frame prediction as conducted relative to the catalogue can be illustrated by a subset of the lexvals marked for Scpr in their global label, namely the causative predicative type. It is represented by the label trScpr-scObNrgCsd (read as ‘transitive with a secondary predicate predicated of a non-argument object, causatively interpreted’) for non-reflexive versions, and for the corresponding reflexive constructions represented by the label trScpr-obRefl-scObNrgCsd); some valpods have both, so that the number of valpods containing the type is 26, on the current count. A commonly entertained description of this construction is that it expresses an activity leading to a result not inherent as goal in the concept of that activity (which, in an example like De spiste kjøleskapet tomt (‘They ate the refrigerator empty‘), is to say that creating food containers is not in the lexical semantics of the notion ‘eating’). This suggests that verbs instantiating the construction can occur intransitively as well, as a frame prediction. 19 of the 26 valpods containing a trScpr-scObNrgCsd item indeed do contain lexvals with the type intransitive. 14 of these valpods also contain the frame type transitive, but for 9 of those, the frame with this specification co-occurs with the frame for intransitive, as the expectation would be. For 5 of the valpods with a trScpr-scObNrgCsd item containing the frame type transitive, however, there is no instance of intransitive. This is then an instance of a frame prediction not immediately borne out, but conceivably resolvable through further analysis.4 4.3 Light Reflexives In many linguistic discussions, constructions with what we call light reflexives (LR) have been rather vaguely categorized, often assigned a status intermediate between transitive and intransitive. All the more astounding is that there are no less than 2050 lexvals with LRs, thus about 15% of the total amount of lexvals, and they spread over a wide array of construction types, as illustrated in the Table below (Table 13). 4 The 5 ‘aberrant’ verbs are ergre (‘annoy’), kjøpe (‘buy’), skjenke (apart from senses ‘give,

donate, endow’, here meaning ‘pour’), spyle (‘flush’) and stue (‘stow’), and one may try to find a factor distinguishing those to be worked into the prediction.

A Valence Catalogue for Norwegian

71

Table 13. Examples of Light Reflexive (LR) constructions, with number of instances Informal frame description

Example

Translation

Inst.

V_LR

hun vasker seg

She washes herself

667

V_LR_P_NP

hun befatter seg med dem

She deals with them

333

Vobjcontrol_LR_P_Inf

hun tvinger seg til å sitte

She forces Refl to sit

79

V_LR_LOC

hun oppholder seg her

She stays here

15

V_LR_DIR

hun smyger seg hit

She slithers hereto

V_LR_Prtcl

de dummer seg ut

They make fools of Refl

139

Vobjcontrol_LR_Inf

hun tillater seg å komme

She allows Refl to come

8

V_LR_SCPRcaus-Ap

hun_løper seg frisk

She runs Refl healthy

V_LR_SCPRstate-AP

hun befinner seg vel

She is well

V_LR_SCPRprtclP

hun oppfører seg som en idiot She behaves like an idiot

38

V_LR_PPpossrais

hun gnir seg i nakken

She rubs Refl in the neck

23

Vrais-to-obj_LR_Inf

hun viser seg å komme

She turns out to come

V_LR_ NP

hun lærer seg spansk

She teaches Refl Spanish

V_P_LR

hun rører på seg

She moves

13

V_SCPR[P_LR_NP]

hun jafser i seg maten

She gobbles up the food

48

Expl_V_LR_Extraposed

det viser seg at hun lyver

It turns out that she lies

20

Expl_V_LR_LOC

det satte seg en katt her

There sat down a cat here

16

76

14 9

1 133

As noted earlier, given that LRs occur in designated positions for direct objects, indirect objects and prepositional governees, they are here categorized for these grammatical functions, following Hellan and Beermann [21], Hellan [16]. In these terms, the Table shows that there are 667 lexvals where the LR is object,5 333 where the LR is object followed by an oblique, 139 where the LR is object followed by a particle, 130 where the LR is indirect object followed by an object, 79 where the LR is object followed by an object controlled infinitive, and 61 where the LR is governee inside a PP serving as object predicative or oblique, to mention the most frequent. Among multivalent verbs, the reflexive frames are about 1900 in number - constituting about 17% of the totality of frames (lexvals), while among the univalent verbs, the reflexive frames - 150 in number - constitute about 5%. This difference may reflect a tendency for reflexive frames to ‘live on’ the presence of non-reflexive variants of the same overall frame in the valpod of any given verb, without thereby being reducible simply to instances of these patterns. (This recalls the dependence mentioned in Sect. 4.1 of verbs

5 In presentational constructions with an LR (as in Det setter seg en katt ‘there seats itself a cat’),

the LR also, in our analysis, counts as object; here, the expletive det carrying subject status, en katt is assigned the function of ‘presented’, following Hellan and Beermann [21], Hellan [16]. These constructions have not yet been fully registered in the catalogue, see Sect. 5.1.

72

L. Hellan

with lexvals with clausal arguments to also have lexvals with non-clausal arguments, although in that case the dependency is more clear-cut.) The latter observations are formally facilitated by the construct of valpods, with a division into uni-membered and multi-membered valpods. Within the array of lexvals as such, the formal probe for the status as ‘LR’ is the argument specification Refl as in obRefl, iobRefl, and oblRefl, and refl as in scPPrefl. This is thus again an instance of how something one may think of as ‘one thing’, i.e., ‘the light reflexive’, has a wide spectrum of distributions, and the catalogue allows one to get a large concise picture of them all. 4.4 Conclusions Particles and light reflexives as we have here classified them are generally not so much in focus in formal descriptions, still, finding the ‘rhythm’ of the language very much includes mastery of these aspects. Having concise overviews of these phenomena ought to feed into both linguistic investigations and more practical applications (for pedagogical purposes such as L2 learning, for instance, systematic use of examples displaying the patterns would recommend itself). In view of their pervasive use of prepositions, the ‘rhythm’ aspect also holds for constructions with clausal arguments, but they also hold another aspect of linguistic analysis, namely complexity of constructions: along with clausal adjuncts, these constructions hold the main recursive power within text composition. Identifying uses of these facilities may be crucial to understanding aspects of text structure, useful for studying styles, and, for instance, for diagnosticizing parameters for what constitutes ‘easy’ and ‘complex’ language. In both respects, thus, a catalogue like the present, through its functions illustrated, will lend itself well to formal, theoretical, descriptive, as well as practical purposes. We pursue some of these points in Sect. 6.

5 Discussion 5.1 Issues of Redundancy As mentioned at the beginning of Sect. 2, constructions analyzable as ‘passive’ are counted as regularly predictable from patterns with the verb in active form and certain argument distributions counted as ‘basic’.6 The alternating distribution of particles as preceding or following an NP object likewise counts as regularly predictable, and likewise do of course constructions with so-called ‘gaps’ due to front positioning of wh-elements or ‘topicalized’ elements. For ‘presentationals’ and ‘extraposition’, which mostly alternate with structures where the ‘logical subject’ is also the syntactic subject, 6 In valence assignments in the valence corpus sustained by the grammar Norsource, such

‘derived’ structures are accordingly assigned valence frame reflecting their ‘base’ structure.

A Valence Catalogue for Norwegian

73

the matter is less clear, currently with ‘extraposition’ instances being explicitly listed, while most presentational options alternating with the frame intrObl-oblLoc are left unspecified. Another construction type often perceived as having a possibly ‘derived’ status is causative secondary predicates. In suite to the considerations in Sect. 4.2 about ‘predictability’ of this construction, a proposal for constructions like Han spiste tallerkenen tom ‘He ate the plate empty’ has been that the transitive frame of spise ‘eat’, undergoes a rule of ‘Object Deletion’, and subsequently the addition of the ‘small clause’. Such a reasoning would reduce the valpod of spise from its current number of 9 elements to a number of 5, omitting the intransitive and the causative secondary predicate (with scObNrgCsd) frames, and thus exemplifies an analytic approach to enhancing non-redundancy, successful or not.7 While not many frames are involved in this case, an issue of further reach is whether intransitive frames can be generally predictable from transitive ones, such that there would be a ‘lean’ inventory having only one of the two in valpods where the relevant conditions are met, thereby extending ‘Object Deletion’ beyond the domain invoked above. In the catalogue there are 264 valpods with exactly these two frames (the object then being nominal), and about 750 valpods where the two frames co-exist among other frame types, thus altogether more than 1000 candidate cases. It is doubtful whether such a potential rule, accommodating spise, could be ‘widened’ to cover these 1000 verbs, while at the same time being restricted so as to not apply to the about 2000 verbs which have only a transitive frame (cf. Table 12).8 In these considerations, the sheer number of items to take into account has a weight by itself, aside from linguistic issues, and is a contribution from a catalogue design. These are illustrations of ‘specificity’ vs. ‘redundancy’ concerns as they will apply to a resource like the present.

7 Rather than (8) above, the valpod thus would be:spise:{V-av__intrObl-oblN-

ACTIVITY & V-på__intrObl-oblN-ACTIVITY & V__tr & V-innpå__trOblobRefl-oblN & V-opp__trPrtcl & V-i__trScpr-scPPrefl}. 8 A factor here is also that the semantics of ‘input’ and ‘output’ must be consistently related (synonymy in the cases of passives etc., addition of causation between defined entities in the previous case). So, in Nei, han er opptatt, han sitter og spiser (‘No, he is occupied, he is sitting eating’), the verb spise ‘eat’ is arguably used as a one-participant concept; would this be a relevant proposal in many of the 1000 cases? Or how many are analyzable in terms of causativization (or ‘anticausativization’)?A related concern is when a transitive and an intransitive frame are ‘collapsed’ using symbols for optionality, such as in the use of parentheses in notations like ‘NP V (NP)’ for expressing ‘optionality’ of an object: a constant semantic relation of meaning must be defined between such options.

74

L. Hellan

5.2 Valpod Intersections vs. ‘Valency Classes’ As is clear from the foregoing, the multi-membered valpods provide ample ground for investigating commonalities between valencies of verbs: what are formally intersections between valpod types may point to interesting similarities between the verbs hosting these types. Abstractly speaking, the notion of valpod intersection is not so remote from the notion of a valence class; however, differences ought to be noted. The notion is tied to Levin [31] and underlies projects and resources such as the Leipzig Valency Classes Project (cf. Malchukov and Comrie [33]), the online database ValPal (http://valpal.info) created from the project, and VerbNet (http://verbs.colorado.edu/~mpalmer/projects/ver bnet.html, cf. Korhonen et al. [30]). A ‘valency class’ is in principle a set of verbs sharing valence frame potential, although the frame types in question need not fully exhaust the frames that each of the verbs can take. What identifies a valence class is rather the recurrence of a small number of frame types across a significant number of verbs - socalled ‘alternation pairs’ -, where these frame types in these lexvals express a common meaning, or meaning alternations over a common semantic parameter. A well-known case is the so-called spray/load alternations, whose common semantic parameter resides in processes involving two incrementally affected participants, with one or the other of the incremental processes being understood as completed, and where the shared pair of frames is of the form ‘_ NP [[prep]NP]PP ’ where the ‘completed’ process is indicated by the NP and the non-completed by the PP, with a different ‘prep’ according to which of the processes it represents. Thus, in spray the wall with paint, with represents the aspect of consumption of paint, the wall indicating its completion, while in spray paint onto the wall, onto represents the aspect of coverage of the wall. What distinguishes the notion of ‘valence class’ from general valpod intersection is the explicit identification of a semantic parameter interrelating the frame instances in question: the notion of valpod intersection, on the other hand, is purely formal, with no assumption of semantic relations. However, based on the NorVal catalogue, one can of course pursue issues recognizable as related to the ‘valence class’ notion, namely as issues of sense prediction. 5.3 Valence Frames and Senses A standard dictionary will identify senses. For instance, the verb lære has at least two senses, one corresponding to English learn, and one corresponding to English teach, and a dictionary may mention the difference. The valence catalogue, in contrast, does not mark the distinction. It just enumerates the following frames for lære (with examples and translations here indicated), without noting that the first two and the last two instantiate the ‘teaching’ lære (with another person – explicit or not - as target of instruction) while the others all instantiate the sense of increasing own knowledge:

A Valence Catalogue for Norwegian

75

(12) lære__ditr (Ex.: hun lærer ham mordvinsk ‘she teaches him Mordvinian’) lære__ditr-obEqIobInf (Ex.: jeg lærer dem å skrive ‘I teach them to write’) lære__ditr-iobRefl-obEqIobInf (Ex.: hun lærer seg å lese ‘she learns to read’) lære-av__intrObl-oblN (Ex.: jeg lærer av henne ‘I learn from her’) lære-om__intrObl-oblN (Ex.: de lærer om vaksiner ‘they learn about vaccines’) lære-om__intrObl-oblDECL (Ex.: de lærer om at utseendet kan bedra ‘they learn about [that appearance deceives] ’) lære-om__intrObl-oblAbsnf (Ex.: de lærer om å bygge solceller ‘they learn about building solar cells’) lære-om__intrObl-oblINTERR (Ex.: de lærer om hva som kan gå galt ‘they learn about what can go wrong’) lære__tr-obEqSuInf (Ex.: hun lærer å lese ‘she learns to read’) lære__tr-obDECL (Ex.: de lærer at det er galt å lyve ‘they learn that lying is wrong’) lære__tr (Ex.: de lærer gangetabellen ‘they learn the multiplication table’) lære-bort__trPrtcl (Ex.: vi lærer bort koden ‘we teach away the code’) lære-opp__trPrtcl (Ex.: de lærer opp lærlinger ‘they educate apprentices’)

Thus, although (as noted in Sect. 2) the frame type code in many respects reflects semantic parameters, these parameters are general, while the notion of ‘sense’ now in question resides in semantic properties distinguishing all words between one another, thus lexically specific properties. How do senses relate to valence frames as here recorded? As a general observation, they stand in a many-to-many relation. The list in (12) illustrates the circumstance that there can be many valence frames reflecting the same sense. Conversely, the situation of one valence frame representing many senses can be illustrated with a verb like løpe ‘run’ which, among its senses, has one of directional movement and one of pure directionality, both of which can be expressed in the frame ‘verb + directional PP’, illustrated below:9 (13) a. Han løper fra stadion til broen. ‘He runs from the stadium to the bridge.’ b. Linjen løper fra punkt A til punkt B. ’The line runs from point A to point B’.

In a valence repository like the present, thus, classification of a verb’s senses has to be a cross-classification relative to its valence frames, so that given the set of senses available for a verb – its lex-senses – each lex-sense has to be specified for which lexvals it can be instantiated by, and each lexval has to have a specification for which lex-senses it can express. 9 Also for løpe there is a frame which can be used only for one of the senses, viz. the frame with a

caused secondary predicate, which is available only for the reading involving actual movement: a. Hun løper seg frisk. ‘She runs herself healthy’. b. *Linjen løper seg krum.’ The line runs itself curved’. The circumstance alluded to in (b) can be expressed, e.g., by Linjen krummer seg’ The line curves’, just not with the valence frame in question.

76

L. Hellan

Standard dictionaries indicate senses by definitions, synonyms or near-synonyms, or short examples. If one were to envisage an extension of a valence catalogue like NorVal with senses, the cross-classification would of course affect the structure of lexvals and valpods. The ‘catalogue’ aspect would also raise the issue of annotation format, that is, of whether definitions and synonyms can be rendered on a format that would combine with the structures already in the catalogue. From a formal-theoretical viewpoint, both the CL code and its type-theoretic counterpart have in principle formats for representing ‘lexical semantics’ and ‘situation types’, as outlined in Hellan [14, 15]. What would be required first, however, would be an overview of how many senses are distinguished in a normal-size dictionary, and how they distribute over the lemmas; the exact formalization of the senses would be immaterial at such a stage.10

6 Final Remarks 6.1 Comparison with Other Valence Resources Among existing monolingual valence dictionaries can be mentioned, for English: FrameNet (5213 verbs, per July 2021);11 VerbNet (6340 verbs);12 PropBank (5649 verbs);13 and EngVallex;14 for German: Evalbu;15 for Czech: Vallex;16 and for Polish: Walenty.17 All of these resources offer excellent online user interfaces, most of them are associated with a corpus accessible from the interface, and most of them expose their analyses for concrete sentences illustrating frames for the verbs. In many cases this includes semantic representations, in the forms of, grossly categorized, AVMs (attributevalue matrices) (e.g., FrameNet), predicate decomposition in the style of Generative or Jackendovian Semantics (e.g., VerbNet), or annotation with semantic roles (e.g., PropBank, 5649). Some resources also offer comprehensive descriptions or definitions, like Evalbu. As was outlined in Sect. 2.1, many of the classificatory labels in NorVal carry general semantic content, and a resource connected to NorVal provides information about ‘logical form’ corresponding to this content, viz. the computational grammar NorSource, which analyzes sentences instantiating all of the 15,700 lexvals, displaying their predicate-logic forms, as an online service (cf. Appendix 1).18 None of the valence resources mentioned 10 A matter closely related to senses are multi-word expressions (MWEs), including idioms. They

11 12 13 14 15 16 17 18

mostly follow the patterns of non-MWEs as far as the syntax is concerned, and so their forms can in principle be specified within the existing frame repertory, while their meanings would be encoded once senses would be eventually encoded. The form specification would be an extension of the format already used for ‘selected’ items. For a preliminary outline concerning the type of MWEs called Light Verb Constructions, see Hellan [14, 18]. https://framenet.icsi.berkeley.edu/fndrupal. http://verbs.colorado.edu/~mpalmer/projects/verbnet.html. https://propbank.github.io/. http://ucnk.ff.cuni.cz. https://grammis.ids-mannheim.de/verbvalenz. http://ucnk.ff.cuni.cz. http://clip.ipipan.waw.pl/Walenty; cf. Przepiórkowski et al. [18]. A device for displaying the feature structures associated with each frame type label is also envisaged, cf. https://typecraft.org/tc2wiki/NorVal_resources.

A Valence Catalogue for Norwegian

77

above are accompanied by such a facility. (This also applies to the way in which the grammar generates valence information in a corpus analyzed by the grammar.) The main feature of NorVal is still the compactness of its enumerations of the totality of frames in which the totality of verbs of the language can occur, whereby investigations drawing on multiple phenomena jointly can be relatively easily conducted. We are not in a position to compare this feature with the relevant corresponding mechanisms of the valence resources mentioned. 6.2 Extendability to Other Languages The catalogue in its content of course applies only to Norwegian, but its formalism is applicable to any language (the label system already having extensions for many languages). Extended use of the design may open avenues for cross-linguistic comparison, not only for entire catalogues, but also for investigations directed at particular phenomena. Currently a resource similar to NorVal is a valence dictionary for the West African language Ga, by Dakubu [9], containing 470 lemmas and 1834 lexvals, conducted with the same frame description code, but without a formal grouping into valpods. In addition to morpho-syntactic features, it also provides semantic role and situation type labels for all lexvals.19 Like NorVal, these resources build on extensive previous lexicographical work, represented by Dakubu [8]. In general, when valence resources with large lexical coverage are created from pre-existing lexical resources, the effort of super-imposing the valencesensitive organization is relatively small compared to the effort that went into the original resource; this holds whichever system of organization is used. The possibilities of using the NorVal design in a more general methodology with such aims might lie in the following direction of steps, based on the implicit assumption that the assembly of a verb’s environments reflects essential aspects of the verb’s meaning: 1. For some other language L, Identify general differences between Norwegian (N) and L in the grammatical encoding of argument structure. 2. Modulo these differences, map the frame types defined for N – let’s call it FrameSet_N - to a putative set of frame types for L, viz. FrameSet_L, thus, constructing a list of correspondence pairs:

FrameSet_N FrameN-1 FrameN-2 ….’. .

-

FrameSet_L FrameL-1; FrameL-2;

(This will hardly be a one-to-one correspondence, and many frames may lack a counterpart on the other side.) 19 Dakubu [10] is a monograph expanded from Dakubu [9]. An illustration of valence comparison

relative to these resources for Ga vs. Akan is given in Beermann and Hellan [1], based on the lemma ba ‘come’ and its 18 different lexvals in Ga.

78

L. Hellan

3. Establish a correspondence of basic verb synonyms between N and L, thus, a list of verb pairs:

VN-1 VN-2 …..

-

VL-1, VL-2,

(Presumably about 2/3 of verbs in one language have single-verb translations in another; this still will give about 4000 synonyms in L.) 4. Assign to each verb VL-n the valpod of its corresponding verb in N, i.e., a valpod where each lexval has a frame in FrameSet_L established in the mapping in point 2. A somewhat related strategy was assumed in the Leipzig Valency Classes Project presented in Malchukov and Comrie [33], where about 80 verbs from English were mapped to counterparts in about 30 languages, with the aim of establishing the valence potential of these 80 ‘meanings’ across these languages. While the above sketch would be 50 times scaled up compared to the Leipzig project as regards number of verbs, and presumably more extensive as regards frame types, cross-linguistic research like in this project would be essential relative to point 1 above (and likewise other studies under the heading ‘contrastive valency’, like those in Hellan et al. [24]). Whether sustainable methodologies can be developed along these lines is of course to be seen. 6.3 Possible Applications 6.3.1 Minimal Sentences and POS-Based Valence Annotation A research initiative described in Quasthoff et al. [39], establishes what is called typical sentences as representative of valence frames, and aims at characterizing the POS patterns of these sentences as parts-of-speech-signatures for the valence frames in question. Although so far conducted for only a few frame types, in German, one can envisage this as an initiative applicable to all valence frame types in all languages. Typical sentences tend to be the shortest sentences where each constituent of the frame in question is realized, and thus come close to the format of minimal sentence examples used in NorVal for each lexval. To illustrate, corresponding to the first two lines in (12), the POS signatures of the frame types in the left column will be the POS-strings matching the example sentences; lexically specific signatures include the lemma of the head of the frame, as in the rightmost column, while frame type specific signatures mark only POS values, as in the third column in Table 14 below. Appendix 1 below lists lexval instances for all the 300 frame types represented in NorVal, and supplies examples for each lexval. From this overview one could readily construct 300 lexically specific signatures corresponding to these 300 lexvals, on the model of row 4, and on the model of row 3, one could make 300 frame type specific signatures applicable across all lexvals.

A Valence Catalogue for Norwegian

79

Table 14. Illustrating POS-signatures for frames matching minimal sentences Lexval and frame

Minimal example

POS-signature for the frame

POS-signature for the lexval

lære__ditr

hun lærer ham mordvinsk

PRON V PRON N

PRON læreV PRON N

lære__ditr-obEqIobInf

jeg lærer dem å skrive

PRON V PRON INF V-INF

PRON læreV PRON INF V-INF

An application of such resources may be automatic (or semi-automatic) annotation of POS-annotated corpora for valence. While ‘manual’ corpus annotation for valence is effort-costly, and parser-based automatic annotation for valence presupposes specific technology and can be somewhat error-prone (cf. Hellan et al. [25]), valence annotation based on POS-annotation would be a valuable alternative, possibly a preferred one. 6.3.2 Valence Information in Dictionaries While the absence of sense specification probably makes it difficult to develop a resource like NorVal into a dictionary, there may be ways in which it could enrich a dictionary through inclusion of valence information. Since standard public dictionaries recognize mainly just intransitive, transitive and reflexive as valence variants, a fine-grained specification like that in (12) for lære would not be readily incorporable, not because of terms or formulas used (the CL formulas can be turned into normal prose) but because of overload of information when so many istinctions are drawn. A better strategy will be to make use of example sentences, presenting them structured according to patterns so that a user gets a direct impression of what are possible expressions. In a case like lære, one such strategy may be to assemble all the example sentences in (12) en bloc to indicate the richness of patterns available for lære; another would be to use the sentences as frame-wise illustrations relative to the various frames (which need not be named per se); given a recognition of just intransitive, transitive and reflexive, a partial bundling of the frames corresponding to these main groups may be the best alternative. Regardless of which strategy is chosen, the possibility of enriching the stock of examples from corpora could in turn be built in, with access to valence annotation which has already been done, or is executed on call, established by either of thestrategies considered above. 6.3.3 Valence Resources in Second Language (L2) Acquisition NLP-based resources for Norwegian include the NorSource-based grammar-correcting application A Norwegian Grammar Sparrer 20 where freely chosen inputs get a grammatical correctness check and relative to some phenomena, if ungrammatical, also feedback on what is wrong and correct versions automatically generated. Corrections concerning valence are here included, but since a verb can have many valence frames, such corrections will have far less ‘determinacy’ than, e.g., corrections for gender, use of articles and the like. 20 https://typecraft.org/tc2wiki/A_Norwegian_Grammar_Sparrer.

80

L. Hellan

A more adequate use of the valence catalogue may reside in the multitude of instances of patterns that can be automatically generated, for instance, verbs with reflexives and particles, verbs with particle plus oblique, etc. While online interfaces could be used for accessing such patterns, one has also the possibility of printing leaflets for various patterns. Needless to say, these perspectives align well with what was said above about applying valence information in the context of dictionaries. 6.3.4 Valence and ‘Complexity Assignment’ When every verb occurrence in a text is annotated for the frame type it has on that occurrence, one can entertain strategies for measuring or assessing complexity of texts in terms of valence, as alluded to in Sect. 4.4. Fairly exact measures can be conceived: the higher the number of arguments in a frame, the higher may the score of the frame be, and clausal arguments may induce a higher score than non-clausal one, to mention some obvious possibilities. Procedures for ‘counting together’ all the annotations in a text could be defined, and the text as a whole could receive a value, attuned to length as a further factor. Many factors of assumed complexity would of course fall outside such a calculation, such as noun phrase structures, the status of a relative clause as relative or an adverbial clause as adverbial (while the internal structure of such clauses would be measured); possible complexity effects of ‘wh-movement’ would not be measurable in terms of valence, and likewise effects of ‘passive’ when this is not a valence factor, as in NorVal. Nevertheless most of these factors are in turn easily identifiable in a text and can be added to a total calculation. Such calculations would constitute only one side of what a ‘complexity’ investigation would require, the other being how language users actually perceive texts in terms of what they, if asked, would count as ‘complex’, or related properties like ‘difficult’, ‘unclear’, or even ‘coherent’. Relative to defined domains of communication, such a law texts and public assignments or guidelines, such investigations may be coupled to initiatives towards ‘easy’ or ‘clear’ language in public sector, and what has there been established as ‘better’ practice could be matched against the outcomes of the formal complexity calculations, in cycles of revision and broadening of scope. 6.4 Extending the Catalogue With explicit information about valence frames, and a minimal format for representation of sameness or relatedness of meaning as envisaged in Sect. 5.3, the catalogue could be extended to register the various ways in which verbs have correlates among nouns and adjectives, and even among other verbs. An example of the latter is how verbs with an initial morpheme that could be characterized as a ‘particle prefix’ have counterparts where that morpheme is absent (counting about 1300 lemmas), and in many cases (about 35%) with a particle similar to that morpheme as a possible valence frame item, thus cases like innsette ‘insert’ vs sette inn ‘set in’. Representing to what extent such pairs have exactly the same meaning or rather only a ‘motivational’ similarity (300 and 130 cases, respectively, on a current count) will be a natural feature of a valence catalogue, and be useful in L2 applications.

A Valence Catalogue for Norwegian

81

Illustrating for nouns and what is counted as ‘nominalizations’, nouns corresponding to lære ‘learn, teach’ (as commented on in Sect. 5.3) include lærer ‘teacher’, læring ‘learning’, lære (‘taught conception’), and related to lære opp, opplæring ‘training, education’. Lærer carries the ‘teach’ sense exclusively, and likewise opplæring, læring can be either the ‘learn’ sense or the ‘teach’ sense, while (en) lære is an ‘inner object’ relative to either the ‘learn’ or the ‘teach’ sense. Some kind of sameness marking across the entries in a noun lexicon and the verb lexicon relative to these lex-senses of lære can be both formally manageable and instructive, e.g., for L2 purposes. This suggests a likely domain into which the catalogue could be extended. It would involve not only the enumeration of relevant nouns and adjectives (these also relating to each other), and an explicit recognition of sense-identifiers (although not sense representations or descriptions), but also an extended lemma architecture for representing derivation, subsumption and morphological relatedness (see Hellan (to appear)). This being all said, 6,300 is not the maximal number of verbs in Norwegian, and for hardly any of them is it likely that the catalogue at its present stage, even with descriptive parameters kept constant, gives a complete representation of their valence potential. So, obviously, these will remain dimensions in which also to develop the catalogue. Acknowledgments. I am grateful to Dorothee Beermann, the editor Roussanka Loukanova, and the reviewers of this chapter, for comments and advice.

Appendix 1 Overview of Frame Types The first column in the following table lists lexvals, with one lexval for each frame type. The ordering of the rows reflects a standard ordering of frame types (alphabetically, and according to internal DTD, cf. Sect. 3. Each frame type is presented as part of a lexval, and thus with a lemma to its left, so that when searching according to the alphabetical order of frame types, one must ignore what is to the left of the ‘__’. The English translations often do not quite match the valence pattern of the source sentence, and the points of deviance are marked in this way: VR,P,L,T,I,NS means that V (mostly the verb, but also a preposition) differs in valence from the Norwegian counterpart with respect to, respectively, reflexive, preposition, particle, finite complementizer, infinitival marker, non-split predicate (in most cases the factor is missing in the translation, but in some cases in the original). To see the logical forms associated with the various frame types, the example sentences in column 2 can be entered into the online grammar parse window at http://regdili. hf.ntnu.no:8081/linguisticAce/parse, where in most cases an MRS (‘Minimal Recursion Semantics’; cf. Copestake et al. [6]) representation, close to a standard predicate logic representation, is displayed for each parse. Further supporting facilities are described at https://typecraft.org/tc2wiki/NorVal_resources (Table 15).

Appendix 2 Verbs Allowing for All Three Types of Clausal Arguments: Declaratives, Interrogatives and Infinitives See Table 16.

82

L. Hellan Table 15. Lexvals instantiating frame types illustrated with examples

Lexval identifier

Example for frame type

English translation

bli__copAdj

dette blir hyggelig

This will be nice

bli__copAdj-suDECL

at det etableres en god praksis blir avgjørende

That a good practice gets established is decisive

bli__copAdj-suINTERR

hvem som vinner blir avgjørende

Who wins will be decisive

bli__copAdv

hun blir her

She remains here

være__copExpnAdj-expnAbsinf

det er hyggelig å løpe maraton

It is nice to run marathon

være__copExpnAdj-expnDECL

det er hyggelig at hun vant

It is nice that she won

være__copExpnAdj-expnINTERRwh

det er uvisst hvem som vinner

It is uncertain who will win

være__copExpnAdj-expnINTERRyn

det er uvisst om hun vinner

It is uncertain whether she will win

bli__copExpnN-expnAbsinf

det blir en ære å motta It becomes an honor gjesteforskerinvitasjoner to receive guest researcher invitations

bli__copExpnN-expnDECL

det blir en ære at dere inviterer meg

It becomes an honor that you invite me

bli__copExpnN-expnINTERRwh

det blir et hovedspørsmål hvem som snakker

It becomes a main theme who talks

bli__copExpnN-expnINTERRyn

det blir et hovedspørsmål om han fortsetter

It becomes a main theme if he continues

bli__copExpnPP-expnDECL

det blir under tvil at han fortsetter

It will be under doubt that he continues

bli__copExpnPP-expnINTERRyn

det blir under kontinuerlig vurdering hvorvidt han fortsetter i stillingen

It will be under continued consideration whether he continues in the position

bli__copIdAbsinf

en slik avtale blir å avbryte samarbeidet

Such a deal will be to discontinue the cooperation

bli__copIdAbsinf-suAbsinf

å inngå en slik avtale blir å avbryte samarbeidet

To enter into such a deal will be to discontinue the cooperation

bli__copIdDECL

innholdet i avtalen blir The content of the at vi frastår eiendommen deal will be that we relinquish the property

bli__copIdN

han blir den nye representanten

He becomes the new representative

bli__copIdINTERRyn

spørsmålet blir om han kommer

The question will be whether he comes

(continued)

A Valence Catalogue for Norwegian

83

Table 15. (continued) Lexval identifier

Example for frame type

English translation

bli__copIdINTERRwh

spørsmålet blir hvem som kommer

The question will be who comes

være__copImpersAdjLoc

det er fint i Finnmark

It is fine in Finnmark

være__copN

han er bonde

He is a farmer

være__copN-suAbsinf

å inngå denne avtalen er en skandale

To enter this deal is a scandal

være__copN-suDECL

at han får komme er en skandale

That he is eligible for coming is a scandal

være__copN-suINTERRwh

hvem som vinner er et spørsmål

Who will win is a question

være__copN-suINTERRyn

om han vinner er et spørsmål

Whether he will win is a question

være__copPP

hun er i Finnmark

She is in Finnmark

være__copPP-suDECL

at han får komme er under sterk tvil

That he gets admitted is under strong doubt

være__copPredprtcl

hun er som en ninja

She is like a ninja

være__copToFind

han er å treffe på strandeiendommen

He is to be met with at the beach property

være__copToughAdj

han er hyggelig å snakke He is pleasant to talk med with

yte__ditr

de yter oss kompensasjon

They provide us compensation

koste-1__ditrExpnSu-obMeas-expnEqIobInf

det koster henne mye krefter å slåss alene

It costs her much effort to fight alone

lage__ditr-iobRefl

hun lager seg en modell

She makes herself a model

merke__ditr-iobRefl-obDECL

hun merker seg at du kommer

He notesR that you are coming

motsette__ditr-iobRefl-obEqIobInf

hun motsetter seg å skulle åpne paraden

She resistsR to have to open the parade

tenke__ditr-iobRefl-obINTERR

jeg tenker meg hva det er

I imagineR what it is

tilgi__ditr-obDECL

hun tilgir oss at vi forløp She forgives us that oss we made a faux pas

be__ditr-obEqIobBareinf

hun ber dem komme

befale__ditr-obEqIobInf

jeg befaler deg å gå

I order you to go

garantere__ditr-obEqSuInf

hun garanterer dem å bidra

She guarantees them to contribute

innprente__ditr-obINTERR

vi innprenter dem hva som skal gjøres

We tell them what is to be done

undre__ditr-iobRefl-obINTERRwh

han undrer seg hvem som kommer

He wondersR who is coming

vise__ditr-obINTERRyn

instrumentet viser oss om det blir væromslag

The instrument tell sus whether there will be a weather change

undre__ditr-iobRefl-obINTERRyn

han undrer seg om vi kommer

He wondersR whether we are coming

She asksI them to come

(continued)

84

L. Hellan Table 15. (continued)

Lexval identifier

Example for frame type

English translation

kaste__ditrObl-oblPRTOFiob

ekornet kaster oss nøtter i hodet

The squirrel throws us nuts on our heads

frarøve__ditr-suAbsinf

å si slikt frarøver politikerne respekten

To say such things deprivesP politicians of their respect

gi__ditr-suDECL

at saken gjenåpnes gir oss mot

That the case is reopened gives us courage

vise__ditr-suDECL-obDECL

at de er så ekstra vennlige viser oss at de har baktanker

That they are so utterly friendly shows us that they have back-thoughts

vise__ditr-suDECL-obINTERR

at de applauderer viser oss hvordan de tenker

That they applaude shows us how they think

vise__ditr-suINTERR

hvem som kommer vil vise oss planen

Who comes will show us the plan

vise__ditr-suINTERR-obINTERR

hvem som kommer vil vise oss hva som kommer til å skje

Who comes will show us what will happen

hagle__impers

det hagler

It hails

gå-i__impersObl-oblN

det går i døren

“Someone moves by the door”

tykne-til__impersPrtcl

det tykner til

“It gets more overcast”

tørke__intr

klærne tørker

The clothes dry

ende__intrAdv

det ender godt

It ends well

skje__intrAdvExpn-expnDECL

det skjer ofte at folk blir syke

It often happens that people get sich

skje__intrAdvPresnt

det skjer ofte ulykker

There often occur accidents

skje__intrAdv-suDECL

at folk blir syke skjer ofte

That people get sic koften happens

skulle__intrAuxmodScpr-scSuNrg-scBareinf

han skal gå

He shall go

være__intrAuxpassScpr-scSuNrg-scPass

han er skutt

He is shot

ha-perf__intrAuxperfScpr-scSuNrg-scPerf

han har kommet

He has come

huske__intrComp-compINTERR

hun husker hvem som kommer

She recalls who comes

bevise__intrComp-suDECL-compINTERR

at de støtter ham beviser hvem som står bak

That they support him proves who stands behind

mankere__intrExpn-expnAbsinf

det mankerer å løse den siste oppgaven

It fails to solve the last task

trengs__intrExpn-expnDECL

det trengs at en spesialist ser på det

It is necessaryNS that a specialist looks at it

ryktes__intrExpn-expnINTERR

det ryktes hva som holdes skjult

It is rumouredNS what is being kept hidden

(continued)

A Valence Catalogue for Norwegian

85

Table 15. (continued) Lexval identifier

Example for frame type

English translation

stå__intrLghtScpr-scAdj

kjelleren står tom

The basement stands empty

debutere-som__intrLghtScpr-scPredprtcl

han debuterer som forfatter

She makes debutNS as author

fremstå-som__intrLghtScpr-scSuNrg-scPredprtcl

han fremstår som hovedtaler

He stands upNS as main speaker

konferere-med-om__intrObl2-obl1N-obl2Absinf

vi konfererer med dem om å finne en løsning

We confer with them aboutI finding a solution

samtale-med-om__intrObl2-obl1N-obl2DECL

vi samtaler med dem om We talk with them at man kan finne aboutT (the circumstance) that one løsninger can find a solution

skjenne-på-for__intrObl2-obl1N-obl2EqObl1Inf

vi skjenner på dem for å ha knust ruten

We scoldP them forI having broken the window glass

forhandle-med-om__intrObl2-obl1N-obl2EqSuInf

de forhandler med geriljaen om å kunne komme ut

They negotiate with the guerilla aboutI getting out

underhandle-med-om__intrObl2-obl1N-obl2INTERR

vi underhandler med dem om hvorvidt vi kan få visse fordeler

We negotiate with them about whether we can get certain advantages

kappkjøre-med-om__intrObl2-obl1N-obl2N

de kappkjører med prinsen om prisen

They race with the prince about the prize

dages-for__intrOblExlnk-oblExlnkAbsinf

det dages for å starte det endelige slag

It is timeNS forI starting the final battle

helle-mot__intrOblExlnk-oblExlnkDECL

det heller mot at det blir ekstraomganger

It tends towards that there will be extra time

dages-for__intrOblExlnk-oblExlnkINTERR

det dages for hvordan rettferdighet kan skje fyldest

It is timeNS for how justice can be honored

avhenge-av__intrOblExpn-expnDECL

det avhenger av deg at turen går bra

It depends on you that the trip goes well

avhenge-av__intrOblExpn-expnINTERRwh

det avhenger av deg hvem som vil vinne

It depends on you who will win

avhenge-av__intrOblExpn-expnINTERRyn

det vil avhenge av deg om turen går bra

It will depend on you whether the voyage goes well

bero-på__intrOblExpn-oblINTERRwh-expnINTERRwh

det vil bero på hvem som kommer hvem som vil vinne

It will depend on who comes who will win

avhenge-av__intrOblExpn-oblINTERRwh-expnINTERRyn

det vil avhenge av hvem It will depend on who som kommer hvorvidt vi comes whether we får delta may participate

avhenge-av__intrOblExpn-oblINTERRyn-expnINTERRwh

det vil avhenge av hvorvidt vi deltar hvem som vil vinne

It will depend on whether we marticipate who will win

(continued)

86

L. Hellan Table 15. (continued)

Lexval identifier

Example for frame type

English translation

avhenge-av__intrOblExpn-oblINTERRyn-expnINTERRyn

det vil avhenge av hvorvidt vi deltar om de vil vinne

It will depend on whether we marticipate who will win

bidra-til__intrObl-oblAbsinf

de bidrar til å løse problemene

They contribute toI solving the problems

blånekte-på__intrObl-oblDECL

de blånekter på at de visste noe

They denyP that they knew something

bløffe-om__intrObl-oblEqSuInf

de bløffer om å ville nå klimamålene

They bluff aboutI wishing to reach the climate goals

fable-om__intrObl-oblINTERR

de fabler om hvordan de kan oppfinne en ny art

They fabulate about how they can create a new species

avhenge-av__intrObl-oblINTERRwh

utfallet vil avhenge av hvem som kommer

The outcome will depend on who comes

bo__intrObl-oblLoc

de bor her

They live here

bomme-på__intrObl-oblN

han bommer på målet

He missedP the taerget

spise-på__intrObl-oblN-ACTIVITY

hun spiser på brødstykket

He eats of the bread

fryse__intrObl-oblPRTOFsu

han fryser på ryggen

She freezes on her back

røre-på__intrObl-oblRefl

han rører på seg

He movesP,R

tegne-til__intrOblRais-oblRaisInf

det tegner til å bli uvær

It seemsP to become bad weather

minne-om__intrObl-suAbsinf

å spise reker minner om nedlagte havner

To eat shrimps reminds (one) if abandoned harbours

tyde-på__intrObl-suDECL-oblDECL

at kursen synker tyder på at det verste er over

That the price goes down indicatesP that the worst is over

bero-på__intrObl-suDECL-oblN

at han fikk jobb beror på At he got a job is deg dueNS to you

komme-an-på__intrPrtclObl-suINTERR-oblINTERR

hvem som får kjøre kommer an på om været blir bra

Who may drive dependsL on whether the weather gets good

avhenge-av__intrObl-suINTERR-oblN

om han får jobb vil avhenge av deg

Whether he gets a job will depend on you

peke__intrPath-suDir-PUREORIENTATION

pilen peker mot øst

The arrow points towards east

trengs__intrPresnt

det trengs en spesialist

Il faut un specialist

sprette__intrPresntDir

det hopper en katt opp i stolen

There jumps a cat into the chair

stå__intrPresntLoc

det står en kommode her There stands a chest of drawers here

versere-om__intrPresntObl-oblDECL

det verserer rykter om at There are circulating han kommer rumours aboutT him coming

(continued)

A Valence Catalogue for Norwegian

87

Table 15. (continued) Lexval identifier

Example for frame type

English translation

versere-om__intrPresntObl-oblINTERR

det verserer rykter om hva vi har i vente

There are circulating rumours about what we may expect

versere-om__intrPresntObl-oblN

det verserer rykter om ham

There are circulating rumours about him

vike-unna__intrPrtcl

de viker unna

They shy away

høre-med__intrPrtclExpn-expnAbsinf

det hører med å snakke med pressen

It belongsL to talk with the press

høre-med__intrPrtclExpn-expnDECL

det hører med at pressen It belongsL that the stiller opp press appears

komme-an-på__intrPrtclOblExpn-expnDECL

det kommer an på deg at It dependsL on you denne turen går bra that this tour goes well

komme-an-på__intrPrtclOblExpn-expnINTERRwh

det vil komme an på deg It dependsL on you hvem som vinner who will win

komme-an-på__intrPrtclOblExpn-expnINTERRyn

det vil komme an på deg It dependsL on you om turen går bra whether this tour goes well

komme-an-på__intrPrtclOblExpn-oblINTERRwh-expnINTERRwh

det vil komme an på hvem som kommer hvem som vil vinne

It dependsL on who comes who will win

komme-an-på__intrPrtclOblExpn-oblINTERRwh-expnINTERRyn

det vil komme an på hvem som kommer hvorvidt vi vil vinne

It dependsL on who comes whether we will win

komme-an-på__intrPrtclOblExpn-oblINTERRyn-expnINTERRwh

det vil komme an på hvorvidt vi deltar hvem som vil vinne

It dependsL on whether we participate who will win

komme-an-på__intrPrtclOblExpn-oblINTERRyn-expnINTERRyn

det vil komme an på hvorvidt vi deltar om de vinner

It dependsL on whether we participate whether they will win

rakke-ned-på__intrPrtclObl-oblAbsinf

de rakker ned på å samle They slanderL,P,I collecting money inn penger

rippe-opp-i__intrPrtclObl-oblDECL

vi ripper opp i at saken aldri ble etterforsket

We rip up inT the case being never investigated

sjalte-over-til__intrPrtclObl-oblEqSuInf

vi sjalter over til å snakke positivt

We switch over toI talking positively

skvære-opp-i__intrPrtclObl-oblINTERR

vi skværer opp i hvordan We set straightNS,L,I how we run the club vi driver klubben

komme-an-på__intrPrtclObl-oblINTERRwh

dette vil komme an på hvem som kommer

This dependsL on who comes

komme-an-på__intrPrtclObl-oblINTERRyn

dette vil komme an på om været blir bra

This dependsL on whether the weather is ok

munne-ut-i__intrPrtclObl-oblLoc

elven munner ut i Rhinen

The river runs out in the Rhine

rippe-opp-i__intrPrtclObl-oblN

vi ripper opp i saken

We rip up in the case

(continued)

88

L. Hellan Table 15. (continued)

Lexval identifier

Example for frame type

English translation

knekke-av__intrPrtclObl-oblPRTOFob

kvisten knekker av på midten

The branch breaks off in the middle

se-ut-til__intrPrtclOblRais-oblRaisInf

det ser ut til å regne

It seemsL,P to rain

komme-an-på__intrPrtclObl-suDECL-oblN

at han får denne jobben kommer an på anbefalingene

That he gets the job dependsL on the recommendations

komme-an-på__intrPrtclObl-suINTERR-oblN

om han får jobb vil komme an på deg

Whether he gets the job dependsL on you

se-ut-som__intrPrtclScpr-scSuNrg-scPredprtclN

hun ser ut som en vinner She looksL like a winner

se-ut-som__intrPrtclScpr-scSuNrg-scPredprtclS

hun ser ut som hun sitter She looksL like she sits

holde-på__intrPrtcl-SUSTAINEDACTIVITY

han holder på

He keeps on

innløpe__intr-RESULT

pengebidragene innløper

The cash contributions comeC in

synes__intrScprExpn-scAdj-expnAbsinf

det synes fristende å prøve igjen

It seems tempting to try again

virke__intrScprExpn-scAdj-expnDECL

det virker rart at de får komme

It seems strange that they are allowed

virke__intrScprExpn-scAdj-expnINTERR

det virker tvilsomt om dette vil virke

It seems dubious whether this will work

se-ut__intrScprPrtcl-scSuNrg-scAdj

han ser syk ut

He looksL ill

fungere-som__intrScpr-scPredprtcl

han fungerer som forsanger

He functions as lead singer

sne-inne__intrScpr-scSuNrgCsd-scAdv

landsbyen sner inne

The village snows under

gå__intrScpr-scSuNrgCsd-scPred

motoren går varm

The motor runs hot

måtte__intrScpr-scSuNrg-scDir

hun må vekk

She must off

synes__intrScpr-scSuNrg-scInf

oppskriften synes å fungere

The recipe seems to work

synes__intrScpr-scSuNrg-scN

han synes en snill prest

He seems like a kind priest

tykkes__intrScpr-scSuNrg-scPred

han tykkes glad

He seems happy

opptre-som__intrScpr-scSuNrg-scPredprtcl

vi opptrer som forsøkspersoner

We figure as research subjects

lyde-som__intrScpr-scSuNrg-scPredprtclN

det lyder som et signal

It sounds like a signal

lyde-som__intrScpr-scSuNrg-scPredprtclS

hun lyder som hun er redd

She sounds as if she is afraid

koste-1__intr-suAbsinf

å slåss alene koster

To fight alone costs

ryktes__intr-suDECL

at noe holdes skjult ryktes

That something is kept secret is rumouredns

ryke__intr-suDir

de ryker ut av turneringen

They drop out of the tournament

vare-1__intr-suDirTemp

møtet varer til middag

The meeting lasts until dinner

(continued)

A Valence Catalogue for Norwegian

89

Table 15. (continued) Lexval identifier

Example for frame type

English translation

spørs__intr-suINTERR

hva som vil skje nå spørs

What will now happenNS is a question

hensvinne__intrPresnt-RESULT

det hensvinner bevis

There disappears ocean ice

herde__tr

de herder metallet

They harden the metal

rangere__trAdv

vi rangerer henne høyt

We rank her high

skikke__trAdv-obRefl

han skikker seg vel

He behavesR well

umuliggjøre__trExpnOb-expnAbsinf

de umuliggjør det å finne en fredsplan

They makeNS it impossible to find a peace plan

beklage__trExpnOb-expnCOND

jeg beklager det om du føler deg neglisjert

I regret it if you feel neglected

beklage__trExpnOb-expnDECL

jeg beklager det at du føler deg neglisjert

I regret it that you feel neglected

mangle__trExpnSu-expnAbsinf

det mangler å løse den siste oppgaven

It fails to solve the last task

forundre__trExpnSu-expnCOND

det forundrer meg om du kommer

It astonishes mei f you come

gagne__trExpnSu-expnDECL

det gagner oss at de flytter hit

It benefits us that they move here

glede__trExpnSu-expnEqObInf

det gleder dem å bli omtalt slik

It pleases them to be talked about like that

ryste__trExpnSu-expnINTERR

det ryster oss hvor dårlig It shakes us how badly ledelsen er organisert the leadership is organized

ane__trExpnSu-expnINTERRwh

det aner dem hva som vil skje

It occursP to them what will happen

angå__trExpnSu-expnINTERRyn

det angår dem hvorvidt du kommer

It concerns them whether you come

ta__trExpnSu-obMeas-expnAbsinf

det tar tre timer å gå dit

It takes three hours to go there

anstå__trExpnSu-obRefl-expnAbsinf

det anstår seg å gå i hvitt It behooves itself to go in white

høve__trExpnSu-obRefl-expnDECL

det høver seg at man går It behooves itself that i hvitt one goes in white

syne__trExpnSu-obRefl-expnINTERRwh

det syner seg hvem som kommer

syne__trExpnSu-obRefl-expnINTERRyn

det syner seg om det fins It shows itself whether håp there is hope

ordne__trImpers-obRefl

det ordner seg

simulere__tr-obAbsinf

han simulerer å være syk He simulates to be sick

simulere__tr-obDECL

han simulerer at han er syk

It shows itself who comes

It arranges itself

He simulates that he is sick

tro__tr-obDECL-obV

vi tror han kommer

We think he comes

tømme__tr-obDir

vi tømmer innholdet ut i elven

We empty the content out into the river

(continued)

90

L. Hellan Table 15. (continued)

Lexval identifier

Example for frame type

English translation

tore__tr-obEqBareinf

hun tør komme

She dares come

unnlate__tr-obEqSuInf

hun unnlater å melde seg She fails to report

dø__tr-obEventunit

de dør en pinefull død

They die a painful death

eliminere__tr-obINTERR

de eliminerer hvem som kan ha gjort det

They eliminate who may have done it

stevne-for-for__trObl2-obl1N-obl2DECL

vi stevner dem for retten We drag them to court for at de har begått forT having committed blasphemy blasfemi

stevne-for-for__trObl2-obl1N-obl2EqObInf

vi stevner dem for retten We drag them to court for å ha bespottet gud forT having committed blasphemy

Iinnklage-til-for__trObl2-obl1N-obl2INTERR

vi innklager dem til domstolen for hva de gjorde med dokumentene

We drag them to court for what they did with the documents

vedde-med-på__trObl2-obl1N-obl2N

jeg vedder et stort beløp med Ola på hesten

I bet a big amount with Ola on the horse

samordne-med-om__trObl2-obRefl-obl1N-obl2Absinf

de samordner seg med hjelpemannskapene om å finne en løsning

They consultR with the rescue forces aboutI finding a solution

rådføre-med-om__trObl2-obRefl-obl1N-obl2INTERR

hun rådfører seg med dem om hvordan man kan behandle soppskader

She consultsR with them about how one can treat fungal damage

rådføre-med-om__trObl2-obRefl-obl1N-obl2N

hun rådfører seg med dem om soppskader

She consultsR with them about fungal damage

overlate-til__trOblExpnOb-expnAbsinf

de overlater det til bøndene å finne en løsning

They leave it to the farmers to find a solution

anspore-til__trOblExpnSu-oblEqObInf-expnDECL

det ansporer ham til å fokusere at han får applaus

It spurs him toI focus that he gets applause

anspore-til__trOblExpnSu-oblEqObInf-expnEqObInf

det ansporer ham til å It spurs him toI focus fokusere å høre tilropene to hear the applause

anspore-til__trOblExpnSu-oblN-expnEqObInf

det ansporer ham til innsats å høre tilropene

It spurs him to extra effort to hear the shouts

ekvivalere-med__trObl-obAbsinf-oblAbsinf

man kan ikke ekvivalere å trene med å øve

One cannot equivalateI training withI practizing

ekvivalere-med__trObl-obAbsinf-oblN

man kan ikke ekvivalere å trene med øving

One cannot equivalateI training with practice

henstille-om__trObl-obDECL-oblN

de henstiller til dem at det utvises måtehold

They urgeP them that restraint be exercized

overlate-til__trObl-obEqOblInf-oblN

de overlater til bøndene å finne en løsning

They leave to the farmers to find a new solution

(continued)

A Valence Catalogue for Norwegian

91

Table 15. (continued) Lexval identifier

Example for frame type

English translation

velge-fremfor__trObl-obEqSuInf-oblEqSuInf

vi velger å ta kveldstjeneste fremfor å stå vakt

We choose evening service beforeI standing guard

velge-fremfor__trObl-obEqSuInf-oblN

vi velger å ta kveldstjeneste fremfor utmarsj

We choose to take evening service before march

koste-1-på__trObl-obEqSuInf-oblRefl

hun koster på seg å kjøpe en ny bil

She affordsP,R to buy a new car

foreslå-for__trObl-obINTERR-oblN

vi foreslår for turistene hva de bør gjøre

We propose for the tourists what they should do

kurse-i__trObl-oblAbsinf

vi kurser dem i å arrangere foredrag

We educate them inI arranging talks

lønne-for__trObl-oblDECL

vi lønner dem for at de gjorde arbeidet

We compensate them forT doing the work

lønne-for__trObl-oblEqObInf

vi lønner dem for å ha gjort arbeidet

We compensate them forI having done the work

overraske-med__trObl-oblEqSuInf

de overrasker oss med å levere fullt regnskap

They surprise us withI delivering a full account

rettlede-i__trObl-OblINTERR

de rettleder dem i hvordan man setter opp regnskap

They guide them in how one sets up accounts

utplassere__trObl-oblLoc

vi utplasserer dem i skogen

We locate them in the forest

innkalle-til__trObl-oblN

vi innkaller dem til møte We summon them to a meeting

dunke__trObl-oblPRTOFob

han dunker dem i ryggen He bangs them in their backs

raske-med__trObl-oblRefl

de rasker med seg eiendelene

They shuffle with them the belongings

ytre-om__trObl-obRefl-oblAbsinf

de ytrer seg om å beholde naturens likevekt

They pronounce themselves aboutI keeping the balance of nature

akke-over__trObl-obRefl-oblDECL

han akker seg over at administrasjonen er inkompetent

He rantsR aboutT the administration being incompetent

akke-over__trObl-obRefl-oblEqObInf

han akker seg over å måtte sitte i flere møter

He rantsR aboutI having to sit in more meetings

avfinne-med__trObl-obRefl-oblINTERR

hun avfinner seg med hva hun får

She resignsR to what she receives

befinne__trObl-obRefl-oblLoc

hun befinner seg i Afrika She isR in Africa

beflitte-med__trObl-obRefl-oblN

hun beflitter seg med oppgaven

holde__trObl-obRefl-oblPRTOFob

hun holder seg for nesen She holdsR,P her nose

She busies herself with the task

(continued)

92

L. Hellan Table 15. (continued)

Lexval identifier

Example for frame type

English translation

ha-til__trOblRais-oblRaisObInf

de vil ha ham til å ha løyet

They allegeP him to have lied

overbevise-om__trObl-suDECL

at dødstallene går ned overbeviser oss om denne fremgangsmåten

That the death tolls go down convinces us about this procedure

overbevise-om__trObl-suDECL-oblDECL

at dødstallene går ned overbeviser oss om at denne fremgangsmåten er riktig

That the death tolls go down convinces us aboutT this procedure being right

anspore-til__trObl-suDECL-oblEqObInf

at tilhengerne jubler ansporer ham til å fokusere skikkelig

That the fans are cheering spurs him toI focus properly

overbevise-om__trObl-suDECL-oblINTERR

at dødstallene går ned overbeviser oss om hvilken fremgangsmåte som er riktig

That the death tolls go down convinces us about which procedure is right

anspore-til__trObl-suDECL-oblN

at tilhengerne jubler ansporer ham til innsats

That the fans are cheering spurs him to further effort

avholde-fra__trObl-suEqObInf-oblEqObInf

å høre jubelen avholder oss fra å avbryte

Hearing the cheering keepsI us fromI quitting

forhindre-fra__trObl-suEqObInf-oblN

å være nedvurdert forhindrer dem fra rettferdig dømming

Being underestimated preventsI them from fair judging

forkjøle__tr-obRefl

hun forkjøler seg

She getsR,NS a cold

kare__tr-obRefl-obDir

hun karer seg til uthuset

She scrambles herself to the outhouse

skyve__trPath-obRefl-obDir

hun skyver seg frem

She pushes herself forward

tilfalle__trPresnt

det vil tilfalle oss utbytte There will accrueP to us gains

smyge__trPresntDir-obRefl

det smyger seg en katt langs muren

There slithers a cat along the wall

oppholde__trPresntLoc-obRefl

det oppholder seg en beboer her

There staysR an inhabitant here

åpne__trPresnt-obRefl

det åpner seg nye muligheter

There openR new possibilities

bolte-igjen__trPrtcl

vi bolter igjen porten

We boltL the gate

ha-til__trPrtclExpnOb-expnDECL

de vil ha det til at målet var ugyldig

They haveP it that the goal was invalid

provosere-fram__trPrtcl-obDECL

vi provoserer frem at det We provokeL that blir et brudd there becomes a schism

finne-på__trPrtcl-obEqSuInf

de finner på å stenge veiene nå

They decideL to close the roads now

finne-ut__trPrtcl-obINTERR

vi finner ut hvorvidt de har rett

We find out whether they are right

(continued)

A Valence Catalogue for Norwegian

93

Table 15. (continued) Lexval identifier

Example for frame type

English translation

lekse-opp-for__trPrtclObl-obINTERR

han lekser opp for oss hva som var gått galt

He lists up for us what had gone wrong

ale-opp-til__trPrtclObl-oblEqObInf

vi aler den opp til å løpe veddeløp

We breedL it toI do racing

fritte-ut-om__trPrtclObl-oblINTERR

vi fritter dem ut om hvorvidt man kan få finansiering

We ask them out about whether one can get financing

hyre-inn-til__trPrtclObl-oblN

han hyrer dem inn til høyonna

He hiresL them for the harvesting

knekke-av__trPrtclObl-oblPRTOFob

han knekker den av på midten

He breaks it off at the middle

hisse-opp-over__trPrtclObl-obRefl-oblDECL

hun hisser seg opp over at han sviktet

She getsR,NS angry overT his failing

skape-om-til__trPrtclObl-obRefl-oblEqObInf

han skaper seg om til å bli et mønsterindivid

He reshapesL himself toI become a modell individual

peile-inn-på__trPrtclObl-obRefl-oblINTERR

hun peiler seg inn på hva de har fore

She findsR,P out what they are planning

skrubbe-opp__trPrtclObl-obRefl-oblPRTOFob

hun skrubber seg opp på kneet

She rubsR,L,P her knee

skape-om-til__trPrtclObl-obRefl-oblN

han skaper seg om til en shapeshifter

He reshapesL himself as a shapeshifter

skitne-til__trPrtcl-obRefl

han skitner seg til

He dirtensL himself

peke-ut-som__trPrtclScpr-obRefl-scObNrg-scPredprtcl

hun peker seg ut som fremragende

She emerges as excellent

anse__trScprExpnOb-scObNrg-scAdj-expnAbsinf

vi anser det ufornuftig å nekte all skyld

We deem it unwise to deny all guilt

anse__trScprExpnOb-scObNrg-scAdj-expnDECL

vi anser det tvilsomt at det vil holde seg slik

We deem it doubtful that it will remain like this

anse__trScprExpnOb-scObNrg-scAdj-expnINTERR

vi anser det tvilsomt hvorvidt det vil holde seg slik

We deem it doubtful whether it will remain like this

anse-som__trScprExpnOb-scObNrg-scPredprtclAdj-expnAbsinf

vi anser det som ufornuftig å nekte all skyld

We deem it as unwise to deny all guilt

anse-som__trScprExpnOb-scObNrg-scPredprtclAdj-expnDECL

vi anser det som tvilsomt at det vil komme bedre forslag

We consider it as doubtful that there will come better proposals

vurdere-som__trScprExpnOb-scObNrg-scPredprtclAdj-expnINTERR vi vurderer det som tvilsomt hvorvidt det vil komme bedre forslag

We deem it as doubtful whether there will come better proposals

anse-for__trScprExpnOb-scObNrg-scPredprtclInf-expnAbsinf

vi anser det for å være mulig å vinne

We considerP it to be possible to win

anse-for__trScprExpnOb-scObNrg-scPredprtclInf-expnDECL

vi anser det for å være mulig at vi vinner

We considerP it to be possible that we win

anse-for__trScprExpnOb-scObNrg-scPredprtclInf-expnINTERR

vi anser det for å være åpent hvorvidt vi vinner

We considerP it to be open whether we win

(continued)

94

L. Hellan Table 15. (continued)

Lexval identifier

Example for frame type

English translation

anse-som__trScprExpnOb-scObNrg-scPredprtclN-expnAbsinf

vi anser det som en dårlig taktikk å nekte all skyld

We consider it as a bad tactics to deny all guilt

vurdere-som__trScprExpnOb-scObNrg-scPredprtclN-expnDECL

vi vurderer det som et omen at det regner svart regn

We consider it as an omen that it rains black rain

vurdere-som__trScprExpnOb-scObNrg-scPredprtclN-expnINTERR

vi vurderer det som et åpent spørsmål hvorvidt det vil komme bedre forslag

We consider it as an open question whether there will come better proposals

spandere-på__trScpr-obDECL-scPPrefl

de spanderer på seg at firmaet får ny logo

They afford for themselves that the company gets a better logo

spandere-på__trScpr-obEqSuInf-scPPrefl

hun spanderer på seg å kjøpe en ny jakke

She affords for herself to buy a new jacket

låse-ut__trScpr-obRefl-scObDir

hun låser seg ut

She locks herself out

låse-inne__trScpr-obRefl-scObLoc

hun låser seg inne

She locks herself in

le__trScpr-obRefl-scObNrgCsd-scPred

de ler seg skakke

They laugh themselves merry

føle__trScpr-obRefl-scObNrg-scBareinf

hun følte seg forfalle innvendig

She felt herself decay on the inside

kalle__trScpr-obRefl-scObNrg-scN

han kaller seg et talent

He calls himself a talent

kjenne__trScpr-obRefl-scObNrg-scPred

hun kjenner seg trygg

She feelsR safe

snakke-til__trScpr-obRefl-scPP

han snakker seg til fordeler

He talks himself to advantages

forholde__trScpr-obRefl-scPred

hun forholder seg rolig

She remainsR quiet

konstituere-som__trScpr-obRefl-scPredprtcl

de konstituerer seg som et parti

They constitute themselves as a party

anse-for__trScpr-obRefl-scPredprtclInf

hun anser seg for å være kompetent

She considers herself asI being competent

la__trScpr-obRefl-scSuNrg-scBareinf-suRAISsuMob

stjernen lot seg se

The star let itself see

vise__trScpr-obRefl-scSuNrg-scInf

oppskriften viser seg å fungere

The recipe turnsR,L out to function

fortone__trScpr-obRefl-scSuNrg-scPred

situasjonen fortoner seg ufarlig

The situation appearsR undangerous

presse__trScpr-scObCsd

de presser sitronene flate They squeeze the citrons flat

stasjonere__trScpr-scObLoc

vi stasjonerer ham i Sydamerika

We station him in South America

synge__trScpr-scObNrgCsd-scPred

hun synger folk glade

She sings people happy

føle__trScpr-scObNrg-scBareinf

jeg føler smitten snike seg inn i meg

I feel the contagion enter into me

la__trScpr-scObNrg-scBareinf-obRAISsuMob

de lot sangen synge

They let the song sing (be sung)

anta__trScpr-scObNrg-scInf

vi antar henne å være kompetent

We assume her to be competent

(continued)

A Valence Catalogue for Norwegian

95

Table 15. (continued) Lexval identifier

Example for frame type

kalle__trScpr-scObNrg-scN

han kaller dem feiginger He calls them cowards

English translation

erklære__trScpr-scObNrg-scPred

vi erklærer byen friskmeldt

We declare the town healthy

forutsette__trScpr-scPasscmplx

jeg forutsetter arten utryddet

I presuppose the species extinct

rappe-til__trScpr-scPPrefl

hun rapper til seg pengene

She snatchesP,R the money

regne-som__trScpr-scPredprtcl

vi regner dem som farlige

We count them as dangerous

anse-for__trScpr-scPredprtclInf

vi anser henne for å være kompetent

We regardP her to be competent

forekomme__trScpr-scSuNrg-scInf

oppskriften forekommer meg å fungere

The recipe appearsP to me to function

synes__trScpr-scSuNrg-scN

han synes meg en skurk

He seemsP to me a crook

tykkes__trScpr-scSuNrg-scPred

han tykkes meg glad

He seemsP to me happy

forekomme-som__trScpr-scSuNrg-scPredprtcl

han forekommer meg som fortapt

He seemsP to me as lost

koste-1__tr-suAbsinf

å drive storgård koster penger

To run a big farm costs money

bety__tr-suAbsinf-obAbsinf

å slutte betyr å gi opp

To stop means to give up

forarge__tr-suDECL

at han sang foran kirkedøren forarget mange

That he sang before the church door angered many

implisere__tr-suDECL-obDECL

at regnskapet stemmer impliserer at han har snakket sant

That the accounts are correct implies that he has spoken the truth

ankomme__tr-suDir

båten ankommer byen

The boat arrivesP to the city

hoppe__tr-suDir-obLengthunit

han hopper fem meter

He jumps five metres

liste__tr-suDir-obRefl

hun lister seg unna

She sneaksR away

huge__tr-suEqObInf

å høre så vakker sang huger meg

To hear such a beautiful song pleases me

indikere__tr-suINTERR

hvem som kommer vil indikere planen

Who comes will indicate the plan

antyde__tr-suINTERR-obINTERR

hvem som kommer vil antyde hva vi kan vente

Who comes will indicate what we should expect

96

L. Hellan

Table 16. Lemmas, where relevant with light reflexives indicated by seg, and with selected prepositions or particles indicated Lemma

“_”, means that a direct clausal argument is also English basic translation possible)

angre

_, på, for

Repent

anse

som, for

Regard

anta

Assume

arrangere

Arrange

ause

seg opp over

Get upset

avfinne

seg med

Resign to

avhenge

av

Depend

avsky

for

Deteste

avtale

_, med, om

Agree

avtalefeste

Fix as an agreement

bable

om

Babble

begripe

seg på

Understand

bekjenne beklage

Confess seg over

Complain, regret, apologize

seg over

Worry

bekrefte bekymre

Confirm

beregne

Calculate

berekne

Calculate

bestemme

seg for

Decide

blogge

om

Blog

bløffe

om

Bluff

blåse

av,i

Sniff

bortvise

for

Expel

botlegge

for

Fine

briske

seg over

Brag

bry

(seg) med, om

Bother

bøtelegge

for

Fine

bøtlegge

for

Fine

domfelle

for

Sentence

drite

i

Not care

drømme

_, om

Dream (continued)

A Valence Catalogue for Norwegian

97

Table 16. (continued) Lemma

“_”, means that a direct clausal argument is also English basic translation possible)

ergre

_, seg over

Annoy

erindre

Recall, remember

erkjenne

Realize, recognize

fable

om, over

Fantasize

fabulere

om, over

Fantasize

fantasere

om, over

Fantasize

fastholde

Maintain

fiksere



Fixate

finne

_, ut

Find out

fokusere

_, på

Focus

forakte

for

Loathe

forankre

i

Anchor, base

forarge

_, seg over

Annoy

forbause forberede

Surprise på, for

Prepare

forbitre

seg over

Be angry

fordømme

for

Denounce

foreholde

Make aware

forekomme

Seem, occur

foreskrive

Ordonate

foreslå

Suggest, propose

forespeile

Make expect

forestille

seg

Imagine

foresveve

Seem

forklare

Explain

formode

Assume

forsikre

(seg) om

Ensure

forsone

seg med

Reconcile

forstå

Understand

forsvare

(seg), med, mot

Defend

fortenke

i

Disagree (continued)

98

L. Hellan Table 16. (continued)

Lemma

“_”, means that a direct clausal argument is also English basic translation possible)

fortvile

over

Despair

forundre

(seg), _, over

Wonder, be astounded

fravike

Depart

fryde

(seg), _, over

Delight

frykte

_, for

Fear

fullrose

for

Compliment

fulltakke

for

Thank

fundere



Muse, ponder

furte

over

Sulk

førebu

(seg), på

Prepare

garantere

_, for

Garantere

gasse

seg over

Glee

gjenoppleve

Re-experience

godta

Accept

godte

seg over

Glee

gratulere

med

Congratulate

gremme

seg over

Dismay

gremmes

over

Dismay

gruble

på, over

Brood

grunne

på, over

Ponder

grøsse

over

Shudder

gråte

over

Cry, weep

henholde

seg til

Refer

hisse

seg opp over

Get angry

hovmode

seg over

Gloat

hugse humre

Remember over

huske

Hum Remember

hylle

for

Hail

informere

om

Inform

innklage

for

Report, accuse (continued)

A Valence Catalogue for Norwegian

99

Table 16. (continued) Lemma

“_”, means that a direct clausal argument is also English basic translation possible)

innprente

Impress

innrømme

Admit

innse

Realize

irettesette

for

Reproach

irritere

_, seg over

Ittitate

jamre

jamre seg over

Moan, wail

joike

:, om

Joik

juble

over

Cheer

kjangse



Take a chance

kjempe

om, for

Fight

klage

om, på

Complain

komme



Remember

kompensere

for

Compensate

komplimentere for

Compliment

krangle

om

Quarrel

kreditere

for

Give credit

kritisere

_, for

Criticize

lure

på/ seg fra, seg til

Wonder/sneak

lyve

om

Lie

lære

_, om

Learn, teach

melde

_, fra om

Report

minne

om

Remind

more

_, seg over

Entertain

nytte

_, seg av

Make use of

oppleve overbevise

Experience _, om

Convince

overraske

Surprise

passe

Suit

planlegge

Plan

programfeste

Fix within a program

påstå

Assert (continued)

100

L. Hellan Table 16. (continued)

Lemma

“_”, means that a direct clausal argument is also English basic translation possible)

refse

for

Scold

regne

ut, med

Calculate, count

rekne

ut, med

Calculate, count

reminisere

om

Reminiscent

respektere respondere

Respect _, på

rettferdiggjøre

Respond Justify

rose

for

Compliment

saksøke

for

Suit (court)

se

_, på

See

si

_, fra om

Say

sjenere

seg for

Be shy

skamme

seg for, seg over

Be ashamed

skjemme

seg for, seg over

Be ashamed

skjemmes

for, over

Be ashamed

skjenne

på for, over

Reprimand

skjerme

for

Protect

skjønne

_, seg på

Understand

skryte

over

Boast

skuffe

_, med

Disappoint

skumle

om, over

Murmur

skvaldre

om

Gossip

skåle

for

Cheer

skåne

for

Protect

slåss

om, for

Fight

småprate

om

Smalltalk

snakke

om

Talk

sole

seg i

Bask

sone

for

Spend a sentence

spotte

for

Mock

steile

over

Resent (continued)

A Valence Catalogue for Norwegian

101

Table 16. (continued) Lemma

“_”, means that a direct clausal argument is also English basic translation possible)

stipulere

Stipulate

straffe

for

Punish

stusse

over

Be surprised

stø

seg på

Support

støe

seg på

Support

stønne

over

Sigh

stå

for

Stand

sukke

over

Sigh

sutre

over

Wail

svare

_, på

Answer

svi

for

Suffer

synes

Think

syte

over

Wail

søke

om

Apply

sørge

over

Mourn

ta

opp

Take

takke

for

Thank

tenke

_, på, over

Think

tie

om

Be quiet

tekste

Text

tilgi

Forgive

tilkjennegi

Show, declare

tilstå

Confess

tipse

om

Tip

tiske

om

Whisper

tjene



Earn

trekke

inn, fra

Pull, withdraw

triumfere

over

Triumph

trives

med

Thrive

trygge

(seg) mot

Ensure

trøste

(seg) med

Comfort (continued)

102

L. Hellan Table 16. (continued)

Lemma

“_”, means that a direct clausal argument is also English basic translation possible)

tåle

Suffer, sustain

uffe

seg over

Puff

unngjelde

for

Pay, suffer

unnskylde

_, seg for

Excuse

uroe

_, seg over

Worry

utelate

Omit

utgyde

seg over, seg om

Complain

utgyte

seg over, seg om

Complain

uttale

seg om

Pronounce

vedde

på, om

Bet

vedgå

Acknowledge

vedkjenne

seg

Acknowledge

vedstå

(seg)

Acknowledge

vedta

Acccept

velge

Choose

vemmes

over

Quail

vente

_, på

Wait, expect

verge

(seg) mot, seg for

Protect

verje

(seg) mot, seg for

Protect

verne

(seg) mot, seg for

Protect

vise

_, til

Show

vite

_, om

Know

vitse

om

Joke

vredes

over

Feel wrath

vrøvle

om

Talk nonsense

vurdere

Assess

våse

om

Talk nonsense

ymte

om

Hint

øse

seg opp over

Get upset

åpne

for, opp for

Open

A Valence Catalogue for Norwegian

103

References 1. Beermann, D., Hellan, L.: Enhancing grammar and valence resources for Akan and Ga. In: West African Languages. Linguistic Theory and Communication. Wydawnictwa Uniwersytetu Warszawskiego, Warzawa, pp. 166–185 (2020). ISBN 978-83-235-4623-8 2. Bresnan, J.: Lexical Functional Grammar. Blackwell, Oxford (2001) 3. Calzolari, N., et al. (eds.): Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), Reykjavík, Iceland. ELRA (2014) 4. Carpenter, B.: The Logic of Typed Feature Structures. Cambridge University Press, Cambridge (1992) 5. Copestake, A.: Implementing Typed Feature Structure Grammars. CSLI Publications, Stanford (2002) 6. Copestake, A., Flickinger, D., Sag, I., Pollard, C.: Minimal recursion semantics: an introduction. J. Res. Lang. Comput. 3, 281–332 (2005) 7. Creissels, D.: Transitivity, valency, and voice. Ms. European Summer School in Linguistic Typology. Porquerolles (2016) 8. Dakubu, M.E.K.: Ga-English Dictionary with English-Ga Index. Black Mask Publishers, Accra (2009) 9. Dakubu, M.E.K.: Ga Toolbox project expanded with Construction Labeling valence information. Ms (2010). https://typecraft.org/tc2wiki/Ga_Valence_Profile 10. Dakubu, M.E.K.: Ga Verbs and their constructions. Monograph ms, Univ. of Ghana (2011) 11. Dakubu, M.E.K., Hellan, L.: A labeling system for valency: linguistic coverage and applications. In: Hellan, L., Malchukov, A., Cennamo, M. (eds.) Contrastive Studies in Valency. John Benjamins Publ. Co., Amsterdam (2017) 12. Dalrymple, M., Lødrup, H.: The grammatical functions of complement clauses. In: Proceedings of the LFG00 Conference. CSLI Publications (2000) 13. Haugen, T.A.: Polyvalent adjectives in Norwegian: aspects of their semantics and complementation patterns. Ph.D. dissertation, University of Oslo (2012) 14. Hellan, L.: Construction-based compositional grammar. J. Logic Lang. Inform. 28(2), 101– 130 (2019). https://doi.org/10.1007/s10849-019-09284-5 15. Hellan, L.: Interoperable semantic annotation. In: LREC workshop ISA-16, 6th Joint ACLISO Workshop on Interoperable Semantic Annotation (2020) 16. Hellan, L.: Representing Light Reflexives in a valence resource for Norwegian. Presentation at SLE 2020 (2020) 17. Hellan, L.: Supplementary data for: ‘a valence catalogue for Norwegian. In: Loukanova (ed.) Natural Language Processing in Artificial Intelligence, NLPinAI 2021. Springer (2021). https://doi.org/10.18710/8U3L2U. (Further resources are displayed at: https://typecraft.org/ tc2wiki/NorVal_resources) 18. Hellan, L.: Unification and selection in Light Verb Constructions. A study of Norwegian. In: Pompei, A., Mereu, L., Piunno, V. (eds.) Light verb constructions as complex verbs. Features, typology and function (Series “Trends in Linguistics. Studies and Monographs”), Mouton de Gruyter (to appear) 19. Hellan, L., Dakubu, M.E.K.: Identifying verb constructions cross-linguistically. In: Studies in the Languages of the Volta Basin 6.3. Legon: Linguistics Department, University of Ghana (2010). https://typecraft.org/tc2wiki/Verbconstructions_cross-linguistically_-_Introduction 20. Hellan, L., Bruland, T.: A cluster of applications around a Deep Grammar. In: Vetulani, Z., et al. (eds.) Proceedings from The Language & Technology Conference (LTC) 2015, Poznan (2015). Web server version at http://regdili.hf.ntnu.no:8081/linguisticAce/parse 21. Hellan, L., Beermann, D.: Presentational and related constructions in Norwegian with reference to German. In: Abraham, W., Leiss, E., Fujinawa, Y. (eds.) Thetics and Categoricals. [LA 262], John BenjaminsPublishing Company (2020)

104

L. Hellan

22. Hellan, L., Johnsen, L.G., Pitz, A.: TROLL. Ms, University of Trondheim. (Downloadable at Nasjonalbiblioteket) (1989) 23. Hellan, L., Beermann, D., Bruland, T., Dakubu, M.E.K., Marimon, M.:. MultiVal: towards a multilingual valence lexicon. In: Calzolari, et al. (eds.) Proceedings of LREC 2014 (2014). (web demo: https://typecraft.org/tc2wiki/Multilingual_Verb_Valence_Lexicon and http://reg dili.hf.ntnu.no:8081/multilanguage_valence_demo/multivalence) 24. Hellan, L., Malchukov, A.L., Cennamo, M. (eds): Contrastive studies in Valency. John Benjamins Publ. Co., Amsterdam & Philadelphia (2017) 25. Hellan, L., Beermann, D., Bruland, T., Haugland, T., Aamot, E.: Creating a Norwegian valence corpus from a deep grammar. In: Vetulani, Z., Paroubek, P., Kubis, M. (eds.) Human Language Technology. Challenges for Computer Science and Linguistics. 8th Language & Technology Conferene, LTC 2017. LNCS, vol. 12598. Springer, Cham (2020). https://doi.org/10.1007/ 978-3-030-66527-2_1. ISBN 978-3-030-66526-5, https://typecraft.org/tc2wiki/Norwegian_ Valency_Corpus) 26. Holen, G.I.: Automatic anaphora resolution for Norwegian. In: Branco, A. (ed.) 6th Discourse Anaphora and Anaphor Resolution Colloquium, DAARC 2007, Lagos, Portugal, pp. 151–166. Springer, Berlin. (2007) https://doi.org/10.1007/978-3-540-71412-5_11 27. Jespersen, O.: Analytic Syntax. Holt, Rinehart and Winston, New York (1969, orig. edition 1937) 28. Jespersen, O.: The Philosophy of Grammar. Routledge, London. (2010, orig. edition 1924) 29. Jørgensen, F.: The semantic representation of location in machine translation. Cand. Philol. thesis, University of Oslo (2004) 30. Korhonen, A., Briscoe, T.: extended lexical-semantic classification of english verbs. In: Proceedings of the HLT/NAACL Workshop on Computational Lexical Semantics, Boston, MA (2004) 31. Levin, B.: English Verb Classes and Alternations. University of Chicago Press, Chicago (1991) 32. Loukanova, R.: An approach to functional formal models of constraint-based lexicalized grammar (CBLG). Fund. Inform. 152(4), 341–372 (2017). https://doi.org/10.3233/FI-20171524 33. Malchukov, A.L., Comrie, B. (eds.): Valency Classes in the World’s Languages. Mouton De Gruyter, Berlin (2015) 34. Marantz, A.: Grammatical Relations. MIT Press, Cambridge (1985) 35. Marneffe, M.-C., Manning, C.D., Nivre, J., Zeman, D.: Universal Dependencies. Computational Linguistics (2021). https://doi.org/10.1162/COLI_a_00402 36. Nordgård, T.: Norwegian Computational Lexicon (NorKompLeks). In: Proceedings of NoDaLiDa 1998 (1998) 37. Pollard, C., Sag, I.A.: Head-Driven Phrase Structure Grammar. Chicago University Press, Chicago (1994) ´ 38. Przepiórkowski, A., Hajnicz, E., Patejuk, A., Woli´nski, M., Skwarski, F., Swidzi´ nski, M.: Walenty: towards a comprehensive valence dictionary of Polish. In: Calzolari et al. (eds.) (2014) 39. Quasthoff, U., Hellan, L., Körner, E., Eckart, T., Goldhahn, D., Beermann, D.: Typical Sentences as a Resource for Valence. LREC 2020 (2020). http://www.lrec-conf.org/proceedings/ lrec2020/index.html 40. Ross, J.R.: Constraints on variables in syntax. PhD dissertation, MIT (1967) 41. Tesnière, L.: Éleménts de syntaxe structurale. Klincksieck, Paris (1959)

Arabic Computational Linguistics: Potential, Pitfalls and Challenges Elie Wardini(B) Department of Aisan, Middle Eastern and Turkish Studies, Stockholm University, Stockholm, Sweden [email protected]

Abstract. Arabic computational linguistics though still relatively new is gaining pace rapidly. While the development of tools for computational linguistics in many languages has come a very long way, and progress has been achieved in creating tools for Arabic, Arabic computational linguistics are in need of much attention. It is not obvious that tools developed for, let us say, English will only need minor modifications before they can be applied to Arabic. Computational tools developed for English rely heavily on enormous work achieved in English linguistics in general, and corpus linguistics more particularly. If Arabic computational linguistics is to achieve its potential, it needs to mirror the hard work done in other languages. Researchers in Arabic computational linguistics should also fully understand the nature of the data they are working with. The present article is not a review of the field, but rather a discussion on the potential, pitfalls, and challenges of Arabic computational linguistics. We will discuss the potential of what research in this field can contribute to linguistic and pedagogical research on Arabic, we will also discuss issues related to defining what ‘Arabic (language)’ is from a linguistic point of view, the nature of the Arabic script, transcription and transliteration, and finally corpus building. Keywords: Arabic · Computational linguistics · Natural language processing · Corpus linguistics

1 The Potential of CL/NLP for Arabic Computational linguistics in Arabic has enormous potential. Arabic is one of the world’s larger languages, has an attested history spanning over 1500 years, and possesses a rich literature covering a very wide range of topics and every genre. Computational linguistics (CL) and natural language processing (NLP) in Arabic have the potential to vastly increase the pace of study of Arabic in every domain. To date, most grammars and dictionaries of Arabic are not corpus based. Grammar books reproduce earlier outlines of Arabic grammar, which are mostly discussions with the 9th and 10th century grammarians of Arabic. Most Arabic dictionaries take on the role of more or less purists keeping the users on the right path, and in the case of neologisms, suggesting ‘good and correct’ words for the modern times. Most, if not all, early as well as © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. Loukanova (Ed.): NLPinAI 2021, SCI 999, pp. 105–117, 2022. https://doi.org/10.1007/978-3-030-90138-7_4

106

E. Wardini

modern grammarians, have a prescriptive approach. To my knowledge only Badawi et al. Modern Written Arabic: A Comprehensive Grammar (2004) is a corpus based grammar, and Wehr’s Arabisches Wörterbuch (first published in 1952) is a corpus based dictionary of Modern Standard Arabic. As for the spoken variants of Arabic, there are many (but not enough!) good descriptive grammars and dictionaries, yet with notable exceptions (for example, see Salloum and Habash 2014 and Samih 2017, among others), systematic descriptions of spoken variants are not the focus of many working on CL and NLP in Arabic. So, CL/NLP assisted grammars and dictionaries based on large corpora will be very welcome descriptive correctives to the mostly prescriptive existing grammars and dictionaries of Arabic. This is true for all domains of Arabic grammar, phonetic, morphological, syntactic, lexical, etc. Another field to which CL/NLP in Arabic has the potential to provide important contributions is literature. The field of Arabic literature has a large, international, truly active, and productive collegium of researchers. This research is by definition corpus based. Yet it is only with years of experience, hard work, and large-scale cooperation that researchers gather insight from comprehensive sets of texts. CL/NLP could give researchers the tools to do research on corpora including, for example, all the works of a single author, country, period, genre, etc. and in any combination they would see fit and in any domain(s) they need: Patterns of lexical usage, phraseology, topoi across periods, authors, and genres. The list is endless. One excellent example of this is the Kitab project (see below). Education and especially language learning/teaching could benefit from CL/NLP in Arabic. As we will discuss below, the Arabic script poses special difficulties with regard to language learning/teaching. The morphology of Standard Arabic poses other issues. With CL/NLP, educators and linguists could identify the more accessible parts of Arabic grammar, and more importantly identify empirically the difficult parts and segregate between those parts that are essential for understanding and which ones are redundant. For example, much time is spent in the Arabic grammar classroom on nominal case and verbal moods. There is reason to argue that nominal case is mostly redundant and verbal mood is generally governed by particles (but not so for the Quran), and thus could get less focus in the classroom. CL/NLP could identify which words and constructions together with their connotations are the most commonly used and thus form the basis for textbooks and dictionaries, balancing the focus between the lexicon and grammar based on empirical and comprehensive data. Here too, the list is large. Lastly, worth mentioning here is that CL/NLP in Arabic would benefit the research field itself. With such a large potential corpus and wide usage, Arabic could be a major contributor to the modeling and development of CL/NLP. As is well known and current for linguistic theory as well as many other fields, English seems to be the starting point, with correctives quickly following from well known European or Asian languages. Arabic only on the rare occasion makes a significant contribution. CL/NLP in Arabic could be an important contributor to the further development of the field. CL/NLP in Arabic are gaining more and more attention. The number of researchers in the field is indeed growing (see MecEnery et al. (eds.) (2019), Eddakrouri’s website and the references below for a handful of examples). Yet there are some essential elements in this growing field that seem to be lacking and some important issues that seem to

Arabic Computational Linguistics

107

be ignored, thus hindering its development (see also Ditters 2013, updated 2017). The present contribution should be read as an attempt by a linguist specialist in Arabic and Semitic languages to identify some issues related to CL/NLP in Arabic. It is our hope that computational scientists join forces to a greater extent with specialists in the field of Arabic in order to achieve the potential of Arabic computational linguistics.

2 What is ‘Arabic’? Languages quite often ignore or even defy borders and sociopolitics. The term ‘Arabic (language)’ is a sociopolitical term and not a linguistic term in the narrow sense. I will explain, but first let me emphasize that this is not unique to Arabic. ‘Norwegian (language)’, ‘Swedish (language)’, etc. are sociopolitical terms also, just to use Scandinavia as an example. The classification of a given spoken variant as ‘Norwegian’, ‘Swedish’, etc. is not dependent on the linguistic traits that are characteristic of that given variant, but rather on the geographical location where the variant is used and whether it falls within the borders of Norway, Sweden etc. In other words, in Norway one speaks Norwegian, in Sweden one speaks Swedish, irrespective of how linguistically similar the spoken variants across the border are to each other or linguistically dissimilar they may be to other variants that fall within the same borders. From a linguistic perspective the distinction between ‘linguistics’ in the narrow sense and ‘sociopolitics’ is important. It is important to apply the correct tools to a given task. One should apply ‘linguistic’ tools to linguistic data and ‘sociopolitical’ tools to sociopolitical data. The field of sociolinguistics has the important task of bridging the fields of linguistics and sociopolitics, since speakers of language are not one thing or the other, ‘performers of speech acts’ or ‘social actors’, but a complex combination of many aspects. Language indeed is part and parcel of the sociopolitical space. Yet it is important to note that linguistics in the narrow sense are concerned with ‘speech acts’, more precisely the mechanisms by which language encodes meaning and how language variants as linguistic systems function, even when they are affected by sociopolitical or other factors. In the context of computational linguistics and natural language processing, the tools are primarily linguistic by design. Though one could see the need and benefits of incorporating sociopolitical or other aspects into the models, CL/NLP primarily aim at identifying and processing linguistic features in language usage. So, applying CL/NLP models without a good grasp of linguistics and a deep linguistic knowledge of the languages being studied, inevitably leads to dubious results. A distinction linguists have more and more shied away from are the terms ‘language’ and ‘dialect’. These terms are quite difficult to define linguistically, so linguists usually quote Max Weinreich who popularized the following statement in the mid 1940’s: “Language is a dialect with an army and a navy”, and one may be tempted to add: “a priesthood”. The terms ‘language’ and ‘dialect’ have moved more and more into the domain of sociopolitics and have become less and less useful in the domain of linguistics. Linguists tend to prefer the term ‘(language) variant’. A language variant is a distinct linguistic system which is more or less closer/further from other variants. Clusters of variants that share many linguistic traits are commonly classified under a certain language, keeping in mind the often ‘fuzzy’ borders across different variant clusters. Linguists are often at

108

E. Wardini

odds as to which traits to include in defining a cluster and what ‘many traits’ actually means in practice. More often than not, different linguistic clusters form a continuum with other clusters, making it difficult to distinguish clear ‘borders’ between clusters. For comparison, the term ‘Arabic’ is the equivalent to all ‘Scandinavian’ language variants, mainland and insular, from Old Norse to the present. This is not strange, since ‘Arabic’ is spoken from the Atlantic to Central Asia, from the Sub-Sahara to Turkey, including Malta and historically Spain and Sicily. Many researchers in the field of CL/NLP in Arabic seem want to apply models that are intended to cover the Quran as well as news articles in a modern newspaper, others apply them to spoken variants. Applying CL/NLP to ‘Arabic’ as if it were a single linguistic system is the equivalent of applying computational linguistic models to ‘Scandinavian’ as if it were a single linguistic system. This is not a linguistically viable approach. What then is Arabic? Arabic is a complex cluster (of clusters) of distinct (language) variants that are closer to each other than to clusters of other languages, such as Aramaic or Hebrew (despite the valid arguments of Retsö 2013). Variants of Arabic are classified in different ways, depending on the linguistic traits that are used. One major classification is the distinction between the literary variant(s), also called Classical Arabic, Fus.h.a¯ , Arabiyya, Standard Arabic, etc. on the one hand and the spoken variants on the other. The linguistic differences between the literary Arabic variants and spoken Arabic variants are significant. The literary variant(s) have not been the mother tongue of any person during the period that they have been attested. The literary variant is what is taught at school and used in most written Arabic. On the other hand, the spoken variants are the different mother tongues of speakers of Arabic. The distinction between literary Arabic and spoken Arabic is as old as the earliest attestations of Arabic. In principle, literary Arabic is highly standardized and follows the grammar outlined by early grammarians in the 9th and 10th centuries. Empirically though, literary Arabic is not the monolithic giant it is usually depicted as. Literary Arabic exhibits variation depending on the texts and periods. Modern Standard Arabic is the term used to designate the literary Arabic used from the late 19th century to the present. Are there regional differences in Modern Standard Arabic? CL/NLP would be of tremendous help to plot the variation of literary Arabic from its earliest stages and map the cohesiveness and/or variation it exhibits. The general attitude among speakers of Arabic is that the ‘dialects’ are of lesser value, and not really ‘languages’. Linguists beg to differ. As for the spoken variants, they exhibit important linguistic differences between the different clusters. Different criteria are used to classify these clusters. Regional clusters in terms of East (Mashriqi, from Egypt towards the east) vs West (Maghribi, west of Egypt), or regional clusters in terms smaller regions of the Middle East: Gulf Arabic, Northern Iraqi Arabic, Levantine Arabic, Egyptian Arabic, etc. Another type of classification is based on historical settlement patterns: Bedouin vs Sedentary, and here between Rural and Urban (see Versteegh 2014 and Palva 2006). All these classifications, nevertheless belie the fact that even at the smaller regional level, the spoken variants of Arabic form even smaller distinct sub-clusters. So Egyptian, Iraqi, Syrian etc. Arabic are umbrella names covering distinct language variants used in a certain territory. Remember though, the ‘borders’ between clusters mostly do not follow state borders.

Arabic Computational Linguistics

109

The Arabic used on social media is noteworthy in this context. There an increasing number of studies on this phenomenon. CL/NLP has a great potential to assist in studying these texts. Yet, a note of caution. The Arabic written on social media is generally hybrid in nature. Even when, let us say, an Egyptian person writes on FaceBook, the texts produced more often than not do not produce Egyptian Arabic. Rather, these are often a mixture of Egyptian Arabic, Modern Standard Arabic and to varying degrees English. These texts are as genuine as any other text, and their ‘hybrid’ nature is not a stigma. So, researchers studying Arabic on social media should be aware of the nature of the texts they are working with. Moreover, there are no standards for writing Arabic on social media, sometimes the Arabic script is used, at others the Latin script. There is very seldom any consistency in the orthography. Multilingualism is another phenomenon that should be taken into consideration when applying CL/NLP to ‘Arabic’. In our context, speakers and even at times writers, use more than one variant of Arabic. The most obvious are religious terms that very often are Standard Arabic. Speakers, depending on the context, interlocutors, and circumstances, may use several Arabic variants in one single conversation or text. For example, most novels are written in Standard Arabic. Yet authors do use dialectal terms and phrases, even at times passages, in order to achieve certain effects. Dialogues in are very often written in a spoken variant. Speakers may use a certain variant to make a certain point. So, researchers must distinguish between linguistic elements that have become incorporated into a certain variant, as loans, and have become part and parcel of a certain variant of Arabic, for example the term ‘please’ has become incorporated into many spoken variants of Arabic, on the one hand and multilingualism, diglossia, translanguaging, etc. on the other hand. If a speaker of a certain variant of Arabic uses a Modern Standard Arabic phrase, or even an English phrase, these do not automatically become part of the linguistic system of that variant. The question of “which Arabic” is therefore an essential aspect that needs to be addressed when modeling CL/NLP for Arabic. One mitigating, and less than optimal, factor is that most CL/NLP models for Arabic are applied mostly to literary Arabic. Well designed and corpus based studies using computational linguistics could be a boost to research not only to the specific literary Arabic variants, but if applied correctly also to the spoken variants. This requires deep knowledge and awareness among researchers about the nature of Arabic.

3 The Arabic Script The Arabic script is a so-called abjad-script. This means that the script represents mainly the consonants of the language. Long vowels are as a rule marked in the mainly consonantal script with some exceptions with the consonants alif , waw or ya, the so-called matres lectionis. Short vowels and doubled/long consonants are marked with diacritical signs above or below the consonants. These diacritical signs are very seldomly used in writing, with the exceptions of some types of texts such as the Quran or some types of children’s or beginners’ books where short vowels are fully or partially marked. Fully ‘vocalized’ texts (i.e. where all the diacritics are represented in the script) are visually cumbersome and thus avoided especially in smaller print. Authors and editors do add

110

E. Wardini

a diacritical sign here or there, often not systematically, with the aim of ‘disambiguation’. The rule though is that most written Arabic texts do not include markers for short vowels or doubling of consonants, all of which are phonemically and morphemically significant. The saying goes: “One usually reads in order to understand; In Arabic, one needs to understand in order to read.” This does not only pose difficulties for readers of Arabic, but especially so for learners of Arabic and in our context for applying CL/NLP models to Arabic texts. For example, in the Arabic script the string “ktb” ≈ [kataba ‘he wrote’, kutiba ‘it was written’, kutub ‘books’, kattaba ‘he caused someone to write’, kuttiba ‘he was made to write’, …], or the string “lwm” ≈ [lawm ‘a blame’, l¯um ‘blame someone’, …]. Anyone working with RegEx (regular expressions) will realize the consequences of this type of script. CL/NLP modeling in Arabic should anticipate that a search for the expression “ktb” would return an array of possibilities [“kataba”, “kutiba”, “kutub”, “kattaba”, …] rather than a well defined single ‘unambiguous’ item. In short, the Arabic script is ambiguous. Simplistic attempts at using NLP to ‘disambiguate’ Arabic is equivalent to trying to produce matter from nothing. What you feed into the model is what you get out. You feed ambiguity, as the Arabic script does, the result is ambiguity. As an illustration, I entered the string “hlk” into the Madamira disambiguation demo webpage (see [15], see also Pasha et al. 2014). As ‘disambiguation’ Madamira returned the somewhat rare “ahhalaki” ‘he made you.Fem.Sing. competent’ (retrieved July 17th, 2021). The string “hlk” should rather return an array:

CalimaStar, an excellent analyzer, (retrieved August 22nd, 2021, see [8], see also Taji et al. 2018), on the other hand, does exactly this, it returns an array of 11 lemmas and 59 analyses. This example reveals at least two issues: the ambiguity of the Arabic script, and as importantly the limitations of the training set/methods used by Madamira. More on corpora and training sets below. It is clear that the Arabic script with its over abundance of homographs in itself presents a special challenge to CL/NLP models.

4 Arabic Morphology In addition to the script itself, Arabic morphology presents a different set of issues. The stem of Arabic nouns, adjectives and verbs permutate. Verbal stems as well as the plural of nouns and adjectives (the so-called ‘broken plurals’) are the biggest ‘culprits’. As

Arabic Computational Linguistics

111

examples from English, the verbs come and see permutate: “come”, “comes”, “came”; “see”, “sees”, “saw”, “seen” respectively. An example from Norwegian: bok ‘book’ is the singular form, while “bøker” is the plural form. In Arabic this phenomenon is pervasive. Thus in order to identify the lemma behind a certain string in a text, the CL/NLP models need to accommodate for numerous permutations, again in the form of returning an array: lemma ≈ [“stem 0”, “stem 1”, …]. As an example, I have recorded in the Quran 17 stem permutations for the basic and very frequent word at¯a ‘to come’ (attested 264 times) and similarly 11 for the word ra a¯ ‘to see’ (attested 267 times); kit¯ab ‘book’ is the singular form while “kutub” is the plural form. Surely, the words at¯a and ra a¯ with their numerous permutations (due to the hamza and the long vowel in their roots) are more on the extreme side. Still stem permutations need to be given due attention in any CL/NLP model that will be applied to Arabic.

5 Arabic Orthography The orthography of Arabic also presents its own sets of issues. Consider the following string “wlsylmnh”:

(Note that this same string could be read as: wa-la-sa-yu allimannahu ‘he will surely teach him’). Arabic orthography attaches certain parts of speech together into one ‘word’. So the question of “what is a word?” arises. In the example above the English phrase contains 6 words, while Arabic contains only one. Linguists including computational linguists have come to a practical solution: A word is a string of characters separated by a space or punctuation. While this works reasonably well for tokenization in English and Norwegian, among others, the models do not work well for Arabic. Or rather, when tokenizing an Arabic text, as it is processed in most programming languages and their libraries or packages, the results that are returned will differ significantly from results in other languages. For example, most NLP libraries/packages include lists of words that can be omitted while processing, e.g. the definite article, pronouns, etc. Most of these ‘words’ in Arabic are clitics, forming part of the ‘word’ in Arabic orthography. So, for example, performing a tf-idf analysis on an English text would process different types of tokens/token types/information than it would in Arabic. This issue is added to the above mentioned issues related to script and stem permutations. The words discussed above, at¯a ‘to come’, ra a¯ ‘to see’, and kit¯ab ‘book’ return the following arrays respectively in the case of the Quran:

112

E. Wardini

Moreover, the ratios between lemma/token type/and number of attestations in Arabic of a vocalized text differs considerably from a non-vocalized text. The former has a larger number of token types/unique forms in relation to lemma and/or attestation, but with less ambiguity; the latter has a smaller number of token types/unique forms in relation to lemma and/or attestation, but with more ambiguity. Scripts are in their essence approximations, conventions that attempt to represent language. No script is perfect. Linguists rely on transcription (see below) in order to adapt scripts to the different languages they are studying. But even transcriptions need to make compromises. In the case of the Arabic script, due to its origin in Semitic scripts, it has a major drawback: the representation of vowels and consonant doubling with diacritics that are more often than not omitted in written texts. Attempts at reforming the Arabic script lead, as is the case with the Greek script, inevitably to uproar. In the context of CL/NLP, the researcher needs to pay extra attention to this fact. Off-the-shelf models rarely yield adequate results.

6 Ambiguity The Arabic script, as we have seen above, is ambiguous and polysemous. Not only should CL/NLP models account for this, the models should not ‘extract’ more information from these texts than is present in the texts themselves. Indeed, language in general is more often than not ambiguous. One need only read legal texts to see how ‘heavy’ they are with specialized terminology, redundancy, and repetition. All this in order to make legal texts as little ambiguous as possible. And still, legal texts need scholars to interpret them, due to their legal implications and inherent and unavoidable ambiguity. Ambiguity is present even more so for other less worked texts. Pronouns are prime examples of ambiguity in language. Consider the following sentence:

In English the string “his apple” is ambiguous. Norwegian on the other hand has two different possessive pronouns that translate English “his/her/its”, specifically: “hans/hennes/dets” and “sin/sitt” with different antecedents: The subject as antecedent for “sin/sitt” and the non-subject as antecedent for “hans/hennes/dets”. In this specific case Norwegian is less ambiguous than English. Applying CL/NLP to Arabic texts should reasonably well assist with understanding these texts better and finding correlations internally in the text or with other texts. CL/NLP are excellent tools to identify collocations, frequency, syntax, semantic contexts, etc. CL/NLP could and should help parse and processes digitalized Arabic texts and large amounts of texts better and produce corpus based grammars and dictionaries.

Arabic Computational Linguistics

113

7 Transcription and Transliteration Gone are the days of ASCII. With UNICODE many of the limitations and difficulties encountered due to ASCII are solved. There is no rational or technical reason why scholars today should still use ASCII inspired transcription or transliteration models (for example the often used Buckwalter model developed in 1988, see [7], see also Habash et al. 2007). UNICODE provides the means to produce human and machine readable, and consistent transcriptions. UNICODE also handles the Arabic script quite well. Given that the Arabic script is written from right to left, some issues may arise with certain applications or programing languages. UNICODE encodes the Arabic script from left to right, then inverts the words (strings between spaces or punctuation) to appear from right to left. Not all applications are able to handle this and/or diacritics correctly. This can produce issues especially with formatting. My experience though is that applications and programming languages that handle RegEx well will not have noteworthy issues with the Arabic script. There could still be a need to transcribe the Arabic texts, or at least a portion of them, into a Latin script based text. The tags and encoding should at least be in the Latin script. In my experience, due to the ambiguity of the Arabic script and also due to the orthography of Arabic, a combination of Arabic script and transcription produces the best results. In this context, it is nevertheless important to distinguish between the technical terms: transliteration vs transcription. These are often confused. Often used terms such as ‘romanization’ should be avoided. Transliteration is a mapping into a Latin based script of the characters that occur in the Arabic text as they are attested in the text, character for character, diacritical sign per diacritical sign. Transliteration is very important since it gives the researcher or the CL model a clear picture of what is actually written in that Arabic text. It answers the questions: What information is available in the text? Did the Arabic text include diacritics or not?, etc. Transcription, on the other hand falls into the domain of interpretation, especially so in languages with consonantal or abjad scripts such as Arabic or other Semitic scripts. For English one could mention the string “gh” as in “laugh” interpreted as /laf/ vs in “sigh” where it is interpreted as /s¯ı/, or as noted above: “lwm” could be interpreted as /lawm/ or /l¯um/. Transcription surely reduces ambiguity. But one should remember that transcription is the result of the transcriber’s interpretation of the text. There are a set of factors that need to be accounted for when applying CL/NLP to Arabic. In general, these factors make the computational linguistic processes more cumbersome for Arabic. Yet, one ignores or downplays them at one’s own peril. A major rule for any scholar is: Know your data, and know it well. On the other hand, the benefits of well designed models for Arabic completely overshadow the efforts required. So, in order for CL/NLP to work for Arabic there is very important ground work that needs to be done. This work can be summarized with one phrase: Specially designed, tagged and annotated corpora.

8 Arabic Corpora At the University of Oslo in the early 90’s nearly everybody was involved in digitalizing documents. OCR was still in its infancy, scanners had relatively low resolution and most

114

E. Wardini

documents were printed. Still, the push to digitalize archives of old documents to more recent texts was in full swing. Some were assigned to scanners. Others were tasked with proof-reading the OCRed documents. But most importantly, the linguistically savvy among the participants were assigned the task to encode and tag the texts. AI (artificial intelligence), NN (neural networks), ML (machine learning), NLP, etc. were all unknown then, but somehow anticipated. Maybe by good fortune, since without access to AI, researchers who at the time wanted to tap into the growing body of digitalized texts relied on RegEx and specialized software such as Conc or CasualConc, to mention a few. Tags were the reliable means of retrieving and connecting desired data from extensive corpora (see for example Text Encoding Initiative, see [26]). Accurately digitalized texts and well developed encoding, tagging, and lemmatization not only opened treasure troves to scholars, but also provided the emerging AI, ML, NN, NLP with reliable datasets to train models. This sequence of events is key. AI, ML, and CL models are as good as the datasets they are trained with. For example, I entered the following string “kyf h.lk” to Madamira’s demo-web site, where they claim they can disambiguate not only Standard Arabic, but also Egyptian (Cairo?) Arabic. The site returned the following:

This was a trick question and maybe somewhat unfair. The string “kyf h.lk” is rather Levantine Arabic /k¯ıf h.a¯ lak/ or Modern Standard Arabic /kayfa h.a¯ luka/, not Egyptian (Cairo) Arabic /ez-zayyak/. Yet Madamira still returned a ‘result’ and it was not: This string is not Egyptian (Cairo) Arabic. Even if Egyptian Arabic and Modern Standard Arabic coexist in Egypt, they are still distinct variants of Arabic (see above). Similarly, Google Translate, which relies heavily on parallel corpora, more often than not does not return adequate translations into Arabic. The intention here is not to throw a shadow on any specific project. Nor is it an overview or review of existing projects. But rather the aim is to highlight the important gap in the work on CL/NLP in Arabic: The lack of adequately digitalized and encoded corpora. Many researchers in Arabic computational linguistics seem to want to leapfrog the extensive work that has preceded the development and successes achieved in languages as English, European or some Asian languages. So what are the challenges that CL/NLP in Arabic face and need to overcome before major successes can be achieved?

9 Specialized Corpora Text corpora are tools. And as tools they are/should be designed to fulfill a certain purpose. Most Arabic corpora are rather collections of texts, text repositories, for example al-Maktaba al-Shamila (see [16]), archive.org (see [2]), al-Waraq (see [29]), etc. Most texts are in PDF-format that are simply scans of the Arabic printed texts. These texts can be OCRed, but the quality is usually not good. Moreover, OCR of Arabic texts is still in its infancy and much work meeds to be done before high quality digital texts can be produced from scans without extensive human intervention. In the context of CL/NLP,

Arabic Computational Linguistics

115

these repositories are not of great value. But in terms of making the Arabic texts available, they are of immense value. Some of these repositories do provide digitalized texts, such as al-Maktaba al-Shamila and al-Waraq, among others, some free and others payed. The main goal is to make Arabic texts available to the general public. One is not sure though about the copyright status of some of these sites. Some other projects are aimed more generally at researchers. One such project is the Shamela: A Large-Scale Historical Arabic Corpus (see [23], see also Belinkov et al. 2016). They state on their homepage: “We develop a large-scale, historical corpus of Arabic of about 1 billion words from diverse periods of time. We clean this corpus, process it with a morphological analyzer, and enhance it by detecting parallel passages and automatically dating undated texts. We demonstrate its utility with selected casestudies in which we show its application to the digital humanities”. A similar project is the Kitab project (see [14]). They are working on providing digital Arabic texts of high quality. The project provides tools for searching the digitalized texts, but most importantly, and something they are excellent at, they provide tools to compare texts. Using chunks of some 300 words they perform excellent intertextual analysis in order to find relations between texts (see [13]). The choice of 300 words is interesting, since any fewer, the results would be very noisy due the ambiguity of the Arabic script, and with more words the analysis would not yield good results. The Kitab project does provide metadata to the texts in the project. These comprise of information concerning the document itself: source, author, genre, etc. To my knowledge the project does not provide texts that are encoded and tagged at the level of the tokens or phrase. Not surprising, the Quran has gotten the most attention, with excellent projects, such as quran.com, Corpus Coranicum (see [9]), The Quranic Arabic Corpus (see [27]), among many. Special mention goes to the Tanzil.net project which makes its digital version of the Quran manuscript of Madina (mus.h.af al-mad¯ına) available for use for free with attribution. Most of the Quran projects do provide transliterations and/or transcriptions of the Arabic text. All provide more or less advanced search engines. Some, such as The Quranic Arabic Corpus, provide morphological information and analysis of the Quran text. Most of these projects are nevertheless designed to help in reading and studying the Quran, rather than being tools for CL/NLP. Projects like The Corpus Coranicum stand out. Corpus Coranicum state on their webpage: “The project offers systematic access to early Qur’anic manuscripts with images and transliterated text. In parallel, a catalogue of variant readings included in the works of the Islamic scholarly tradition is produced”. The availability of the digital Tanzil.net text of the Quran, on the other hand, provides an excellent basis for those who might want to apply CL/NLP to the text of the Quran. But the text as it is from Tanzil.net should be seen as raw data which needs to be encoded and tagged before it is of much use for CL/NLP. In other words, and without reviewing all the repositories of Arabic texts or Arabic text projects, digitalized Arabic texts are not readily available, and the quality of those that are available vary. Furthermore, properly encoded and tagged texts that are adequate for CL and model training are pressing desiderata.

116

E. Wardini

10 Encoding Texts Throwing CL/NLP models developed for English at Arabic just does not work. There is clearly a pressing need for systematic, sustained and long term efforts to prepare and develop well designed, and executed extensive Arabic datasets aimed specifically at CL/NLP. The teams involved should comprise of scholars who are well versed both in CL/NLP and Arabic linguistics. To the extent possible international standards and conventions should be used, but also special attention should be given to the requirements of Arabic. This is a two-pronged approach: 1. Preparing Arabic datasets that are adapted to be used in CL/NLP; 2. Developing and adapting models to be used with Arabic. These two processes go hand in hand, the one feeding and providing corrections to the other. There are some projects that are pushing in this direction. The above mentioned Madamira, CalimaStar are such examples. Another is arTenTen: Corpus of the Arabic Web (see [1], see also Arts et al. 2014). The tendency though is to use CL models to tag the texts. arTenTEn state on their webpage: “The arTenTen corpus was tagged by the Stanford Arabic parser […]”. The Stanford University Arabic Natural Language Processing (see [24]) provide software for CL/NLP in Arabic. For Syriac, one could mention the excellent work of the Simtho project, with limited resources, at Beth Mardutho (see [6]). There is no reason why, like in the early 90s for European and other languages, students and others could not be tasked to digitalize Arabic texts. A team or several should be able to adapt international guidelines to encode Arabic texts. Then linguistically savvy participants should be put to the heavy and cumbersome, yet of crucial importance, work of encoding, tagging and lemmatization of the digitalized texts. Then datasets should be prepared and tested for use with ML and NLP. It is only when this process reaches a certain maturity that more extensive CL/NLP work can be done even on non-encoded texts.

References 1. arTenTen: Corpus of the Arabic Web. https://www.sketchengine.eu/artenten-arabic-corpus/ 2. Archive.org. https://archive.org 3. Arts, T., Belinkov, Y., Habash, N., Kilgarriff, A., Suchomel, V.: arTenTen: Arabic corpus and word sketches. J. King Saud Univ. Comput. Inf. Sci. 26, 357 (2014). https://doi.org/10.1016/ j.jksuci.2014.06.009 4. Badawi, E.M., Carter, M.G., Gully, A.: Modern Written Arabic: A Comprehensive Grammar. Routledge, London (2004) 5. Belinkov, Y., Magidow, A., Romanov, M., Shmidman, A., Koppel, M.: Shamela: A LargeScale Historical Arabic Corpus (2016) 6. Beth Mardutho. https://bethmardutho.org/simtho/ 7. Buckwalter developed in 1988. http://www.qamus.org/transliteration.htm 8. CalimaStar. https://calimastar.abudhabi.nyu.edu/analyzer/ 9. Corpus Coranicum. https://corpuscoranicum.de 10. Ditters, E.: Issues in Arabic computational linguistics. In: Owens, J. (ed.) The Oxford Handbook of Arabic Linguistics. Online Publication (2013) 11. Eddakrouri, A.: https://sites.google.com/a/aucegypt.edu/infoguistics/directory/Corpus-Lin guistics/arabic-corpora

Arabic Computational Linguistics

117

12. Habash N., Soudi A., and Buckwalter, T.: On Arabic transliteration. In: Soudi, A., Bosch, A., Neumann, G. (eds.) Arabic Computational Morphology. Text, Speech and Language Technology, vol. 38. Springer, Dordrecht (2007). https://doi.org/10.1007/978-1-4020-604 6-5_2 13. The History of the Arabic Book: A New Chapter. Institute for Advanced Study, Near Eastern Studies and Digital Scholarship @IAS Joint Lecture, 4 March 2021. See also https://www. youtube.com/watch?v=Z6KkpF3-73U 14. Kitab project. http://kitab-project.org 15. Madamira demo webpage. https://camel.abudhabi.nyu.edu/madamira/. See also http://inn ovation.columbia.edu/technologies/cu14012_arabic-language-disambiguation-for-naturallanguage-processing-applications 16. al-Maktaba al-Shamila. https://shamela.ws 17. MecEnery, T., Hardie, A., Younis (red), N.: Arabic Corpus Linguistics. Edinburgh University Press, Edinburgh (2019) 18. Palva, H.: Dialect classification. In: Versteegh, C.H.M., Eid, M. (eds.) Encyclopedia of Arabic Language and Linguistics, vol. 1, A-Ed, pp. 604–613. Leiden, Brill (2006) 19. Pasha, A., et al.: MADAMIRA: a fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In: Proceedings of the 9th International Conference on Language Resources and Evaluation, pp. 1094–1101 (2014) 20. Retsö, J.: What is Arabic? In: Owens, J. (ed.) The Oxford Handbook of Arabic Linguistics. Online Publication (2013) 21. Salloum, W., Habash, N.: ADAM: Analyzer for Dialectal Arabic Morphology. J. King Saud Univ. Comput. Inf. Sci. 26, 372–378 (2014) 22. Samih, Y.: Dialectal Arabic Processing Using Deep Learning. Inaugural-Dissertation. Heinrich-Heine-Universität Düsseldorf, Düsseldorf (2017) 23. Shamela: A Large-Scale Historical Arabic Corpus. https://arxiv.org/abs/1612.08989 24. Stanford University Arabic Natural Language Processing. https://nlp.stanford.edu/projects/ arabics.html 25. Taji, D., Khalifa, S., Obeid, O., Eryani, F., Habash, N.: An Arabic morphological analyzer and generator with copious features. In: Proceedings of the 15th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp. 140–150. Brussels, Belgium, 31 October 2018 26. Text Encoding Initiative. https://tei-c.org 27. The Quranic Arabic Corpus. https://corpus.quran.com 28. Versteegh, C.H.M.: The Arabic Language, 2nd edn. Edinburgh University Press, Edinburgh (2014) 29. al-Waraq. https://alwaraq.net/ 30. Wardini, E.: The Quran: Key Words in Context, vol. 1–5. Gorgias Press, Piscataway (2020) 31. Wardini, E.: The Quran: Key Word Collocations, vol. 1–16. Gorgias Press, Piscataway (2021) 32. Wehr, H.: Arabisches Wörterbuch für die Schriftsprache der Gegenwart. In: Hans, W., Milton, C.J. (eds.) Leipzig. English translation: A Dictionary of Modern Written Arabic (ArabicEnglish), 4th edn. Considerably enl. and amended by the author New York: Spoken Language Services (1994)

Author Index

F From, Asta Halkjær, 25

H Hellan, Lars, 49

K Kanovich, Max I., 1 Kuznetsov, Stepan G., 1 Kuznetsov, Stepan L., 1 S Scedrov, Andre, 1 Schlichtkrull, Anders, 25 V Villadsen, Jørgen, 25

J Jensen, Alexander Birch, 25

W Wardini, Elie, 105

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. Loukanova (Ed.): NLPinAI 2021, SCI 999, p. 119, 2022. https://doi.org/10.1007/978-3-030-90138-7