Programming for Corpus Linguistics: How to Do Text Analysis with Java 9781474470780

The ability to program a computer has become increasingly important in work that involves corpora. Specialised research

121 31 20MB

English Pages 256 [250] Year 2022

Recommend Papers

Cluster Analysis for Corpus Linguistics 9783110363814, 9783110350258

The standard scientific methodology in linguistics is empirical testing of falsifiable hypotheses. As such the process o

157 64 4MB Read more

Cluster Analysis for Corpus Linguistics 9783110350258, 9783110363814, 9783110393170

The standard scientific methodology in linguistics is empirical testing of falsifiable hypotheses. As such the process o

162 90 7MB Read more

Java: How To Learn Java Programming: How To Improve Your Java Coding In 2020/2021: 5 Programming Languages To Learn For stupid In Tech

429 67 521KB Read more

Statistics for Corpus Linguistics 9781474471381

This book in the Edinburgh Textbooks in Empirical Linguistics series is a comprehensive introduction to the statistics c

114 81 32MB Read more

Get Programming with Java

842 161 9MB Read more

Programming With Java (Mastering Programming Languages Series)

Programming With Java: A Comprehensive Guide to Mastering the Modern Programming Landscape Unlock the Power of Java – Yo

114 85 949KB Read more

Doing Linguistics with a Corpus: Methodological Considerations for the Everyday User (Elements in Corpus Linguistics) 1108744850, 9781108744850

Paradoxically, doing corpus linguistics is both easier and harder than it has ever been before. On the one hand, it is e

326 15 2MB Read more

Corpus Linguistics 9781474470865

GBS_insertPreviewButtonPopup('ISBN:9780748611652); Corpus Linguistics has quickly established itself as the leadi

115 20 4MB Read more

Arabic Corpus Linguistics 9780748677382

An overview of current corpus-based research on the Arabic language Takes a perspective-based approach to the practice o

99 83 4MB Read more

Java Programming For Kids ages 12 – 18 : Simple, Concise & Easy guide to Java Programming Language

542 104 4MB Read more

Programming for Corpus Linguistics: How to Do Text Analysis with Java
9781474470780

Author / Uploaded
Oliver Mason

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Programming for Corpus Linguistics

EDINBURGH TEXTBOOKS IN EMPIRICAL LINGUISTICS

CORPUS LINGUISTICS by Tony McEnery and Andrew Wilson LANGUAGE AND COMPUTERS A PRACTICAL INTRODUCTION TO THE COMPUTER ANALYSIS OF LANGUAGE

by GeoffBarnbrook STATISTICS FOR CORPUS LINGUISTICS by Michael Oakes COMPUTER CORPUS LEXICOGRAPHY by Vincent B. Y. Ooi THE BNC HANDBOOK EXPLORING THE BRITISH NATIONAL CORPUS WITH SARA

by Guy Aston and Lou Burnard PROGRAMMING FOR CORPUS LINGUSTICS HOW TO DO TEXT ANALYSIS WITH JAVA

by Oliver Mason

EDITORIAL ADVISORY BOARD

Ed Finegan University of Southern California, USA Dieter Mindt Freie Universitat Berlin, Germany Bengt Altenberg Lund University, Sweden Knut Hofland Norwegian Computing Centre for the Humanities, Bergen, Norway ]anAarts Katholieke Universiteit Nijmegen, The Netherlands Pam Peters Macquarie University, Australia

If you would like information on forthcoming titles in this series, please contact Edinburgh University Press, 22 George Square, Edinburgh EH8 9LF

EDINBURGH TEXTBOOKS IN EMPIRICAL LINGUISTICS

Series Editors: Tony McEnery and Andrew Wilson

Programming for Corpus Linguistics How to Do Text Analysis with Java

Oliver Mason

EDINBURGH

UNIVERSITY

PRESS

EDINBURGH

UNIVERSITY

PRESS

© Oliver Mason, 2000 Edinburgh University Press 22 George Square, Edinburgh EH8 9LF Transfered to digital print 2006 Printed and boood by CPI Antony Rowe, Eastbonrne

A CIP record for this book is available from the British Library ISBN-10 0 7486 1407 9 (paperback) ISBN-13 9 7807 4861 407 3 (paperback)

The right of Oliver Mason to be identified as author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.

Contents I Programming and Corpus Linguistics

1

1 Introduction 1.1 PROGRAMMING IN CORPUS LINGUISTICS. 1.1.1 The Computer in Corpus Linguistics . . . 1.1.2 Which Programming Language? . . . . . 1.1.3 Useful Aspects ofJava . . . . . . . . . . 1.1.4 Programming Language Classification . . 1.2 ROAD-MAP . . . . . . . . . . . 1.2.1 What is Covered . . . . . . . . . 1.2.2 Other Features of Java . . . . . . 1.3 GETTING JAVA . . . . . . . . . . . . . 1.4 PREPARING THE SOURCE . . . . . . . 1.4.1 Running your First Program . . . . . . . . . . 1.5 SUMMARY . . . . . . . . . . . . . . . . . . . . . . .

3 4 4 5 5 6 8 8 9 10 10

2 Introduction to Basic Programming Concepts 2.1 WHAT DOES A PROGRAM DO? . . . . . . . . 2.1.1 What is an Algorithm? . . . . . . . . . . 2.1.2 How to Express an Algorithm . . . . . . . 2.2 CONTROL FLOW . . ..... 2.2.1 Sequence . . . . . . . . . . . 2.2.2 Choice . . . . . . . . . . . . 2.2.3 Multiple Choice . . . . . . . 2.2.4 Loop . . . . . . . . ..... . ..... 2.3 VARIABLES AND DATA TYPES . . 2.3.1 Numerical Data . . . . . . 2.3.2 Character Data . . . . . . 2.3.3 Composite Data . . . . . . 2.4 DATA STORAGE . . . . . . . . . 2.4.1 Internal Storage: Memory 2.4.2 External Storage: Files . . 2.5 SUMMARY . . . . . . . . . . . .

13 13 14 15 18 19 19 20 21 25 25 26 27 27 27 28 29

. . . . . .

11

12

.....

CONTENTS

3 Basic Corpus Concepts 3.1 HOW TO COLLECT DATA 3.1.1 Typing . . . 3.1.2 Scanning . . 3.1.3 Downloading 3.1.4 Other Media 3.2 HOW TO STORE TEXTUAL DATA . 3.2.1 Corpus Organisation . . . . 3.2.2 File Formats . . . . . . . . 3.3 MARK-UP AND ANNOTATIONS . 3.3.1 Why Use Annotations? . . . 3.3.2 Different Ways to Store Annotations . 3.3.3 Error Correction .. 3.4 COMMON OPERATIONS . 3.4.1 Word Lists .. 3.4.2 Concordances . 3.4.3 Collocations 3.5 SUMMARY . . . . .

31 31 31 31 32 32 33 33 35 37 38 38 42 42 42 43 44 44

4 Basic Java Programming 4.1 OBJECT-ORIENTED PROGRAMMING 4.1.1 What is a Class, What is an Object? 4.1.2 Object Properties . . 4.1.3 Object Operations . . . . . . . . 4.1.4 The Class Definition . . . . . . . 4.1.5 Accessibility: Private and Public . 4.1.6 APis and their Documentation 4.2 INHERITANCE . . . . . . . 4.2.1 Multiple Inheritance 4.3 SUMMARY . . . . . . . . .

47 47 47

5 The Java Class Library 5.1 PACKAGING IT UP . . . . . . 5.1.1 Introduction . . . . . . . 5.1.2 The Standard Packages . 5.1.3 Extension Packages . . . 5.1.4 Creating your Own Package 5.2 ERRORS AND EXCEPTIONS . 5.3 STRING HANDLING IN JAVA 5.3.1 String Literals .. 5.3.2 Combining Strings .. . 5.3.3 The String API . . . . . 5.3.4 Changing Strings: The StringBuffer 5.4 OTHER USEFUL CLASSES . 5 .4.1 Container Classes . 5.4.2 Array . . . . . . . . .

48 48

49 53 55

57 58 58

61 61 61 63 64 64 65 66 66 66

67 77 78 78 79

CONTENTS

5.5

5.6

5.4.3 Vector . . . . 5.4.4 Hashtable .. 5.4.5 Properties . . 5.4.6 Stack .... 5.4.7 Enumeration . . . . . . . . . . . . . . . . . . . . . . THE COLLECTION FRAMEWORK . . . . . . . . . . . . . . . . 5.5.1 Introduction. . . . . . . . . . . . . . . . . . . . . . 5.5.2 Collection . . 5.5.3 Set .. 5.5.4 List . . . . . 5.5.5 Map . . . . . 5.5.6 Iterator .. . 5.5.7 Collections . . . . SUMMARY . . . . . . . .

6 Input/Output 6.1 6.2

6.3

6.4

6.5 6.6

THE STREAM CONCEPT . . . . . . 6.1.1 Streams and Readers . . . . . FILE HANDLING . . . . . . . . 6.2.1 Reading from a File . . . . . 6.2.2 Writing to a File . . . . . . . CREATING YOUR OWN READERS 6.3.1 The ConcordanceReader . . . 6.3.2 Limitations & Problems . . . RANDOM ACCESS FILES . . . . . 6.4.1 Indexing . . . . . . . . . . . 6.4.2 Creating the Index . . . . . . 6.4.3 Complex Queries . . . . . . . SUMMARY . . . . . . . . . . . . . . STUDY QUESTIONS . . . . . . . .

97 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . .

. . . .

. . . .

7 Processing Plain Text 7.1 7.2

7.3

7.4

80 81 83 84 84 87 87 88 90 91 92 93 94 95

SPLITTING A TEXT INTO WORDS 7.1.1 Problems with Tokenisation . . . THE STRINGTOKENIZER CLASS . . . 7 .2.1 The StringTokenizer API . . . . . 7 .2.2 The PreTokeniser Explained . . . . . . . 7 .2.3 Example: The FileTokeniser . . . . . . . 7 .2.4 The FileTokeniser Explained . . . CREATING WORD LISTS . . . . . . . . . . . . . . . 7.3.1 Storing Words in Memory . . . . 7.3.2 Alphabetical Wordlists . . . . . . 7.3.3 Frequency Lists . . . . . . . . . . 7.3.4 Sorting and Resorting . . . . . . SUMMARY . . . . . . . . . . . . . . . .

. . . . 97 . . . . 98 . 99 . 99 101 . . 105 105 116 117 117 . 118 . 128 . 129 . . . . 130

133 133 134 135 135 137 138 139 142 142 143 144 148 150

CONTENTS

8 Dealing with Annotations 801 INTRODUCTION 0 802 WHAT IS XML? 0 0 0 0 0 0 0 0 0 0 0 0 0 0 80201 An Informal Description of XML 0 803 WORKING WITH XML 0 0 0 0 0 0 0 0 0 0 80301 Integrating XML into your Application 80302 An XML Tokeniser 0 80303 An XML Checker 8.4 SUMMARY 0 0 0 0 0 0 0 0

153

II

177

o

9

Language Processing Examples Stemming 901 INTRODUCTION 0 0 0 0 0 0 0 902 PROGRAM DESIGN 0 0 0 0 0 903 IMPLEMENTATION 0 0 0 0 0 90301 The Stemmer Class 0 0 90302 The RuleLoader Class 90303 The Rule Class 903.4 The Rule File 0 9.4 TESTING 0 0 0 0 0 9.401 Output 0 0 0 0 9.402 Expansion 0 0 0 905 STUDY QUESTIONS

153 153 154 156 156 157 172 174

179 179 180 181 181 185 188 192 192 192 193 193

10 Part of Speech Tagging 1001 INTRODUCTION 1002 PROGRAM DESIGN 1003 IMPLEMENTATION 0 100301 The Processor 0 100302 The Lexicon 0 100303 The Suffix Analyser 1003.4 The Transition Matrix 10.4 TESTING 0 0 0 0 0 0 0 1005 STUDY QUESTIONS

211

11 Collocation Analysis 11.1 INTRODUCTION 11.1.1 Environment 0 0 0 0 0 110102 Benchmark Frequency 11.1.3 Evaluation Function 11.2 SYSTEM DESIGN 0 11.3 IMPLEMENTATION 0 0 0 11.301 The Collocate 0 0 0 11.302 The Comparators 0

213 213 214 214 215 215 216 216 218

o

o

195 195 196 197 197 203 206 208 210

CONTENTS 11.3.3 The Span . . . . . . . . . . . . . . . . 11.3.4 The Collocator . . . . . . . . . . . . . 11.3.5 The Utility Class . . . . . . . . . . . . 11.4 TESTING . . . . . . . . . . . . . . . . . . . . 11.5 STUDY QUESTIONS . . . . . . . . . . . . .

III

. . . . .

Appendices

219 221 225 225 227

229

12 Appendix 12.1 A LIST OF JAVA KEYWORDS . . . . . . . . . . . . . . . . . . . 12.2 RESOURCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 RINGCONCORDANCEREADER . . . . . . . . . . . . . . . . . . 12.4 REFERENCES

231 231 231 231 235

Index

237

To Joanna, for all her help and assistance

1 Introduction

Corpus linguistics is all about analysing language data in order to draw conclusions about how language works. To make valid claims about the nature of language, one usually has to look at large numbers of words, often more than one million. Such amounts of text are clearly outside the scope of manual analysis, and so we need the help of computers. But the computer, powerful though it is, is not an easy tool to use for someone with a humanities background, and so its use is generally restricted to whatever ready-made programs are available at the moment. This book is an introduction to computer programming, aimed at corpus linguists. It has been written to enable corpus linguists without any prior knowledge of programming (but who know how to switch a computer on and off) to create programs to aid them in their research, analysing texts and corpora. For this purpose it introduces the basic concepts of programming using the programming language Java. This language is not only suitable for beginners, but also has a number of features which are very desirable for corpus processing. This book is also suitable for students of computer science, who do have some background in computing itself, and want to venture into the language processing field. The basics of text processing are explained in chapter 3, which should do what chapter 2 does for non-programmers: give enough of an introduction to make practical work in the field possible, before proceeding with examples and applications later on. After having finished this book, you will know how to produce your own tools for basic text processing tasks, such as creating word lists, computing parameters of a text, and concordancing. To give you an idea how to go about developing more complex software we will look at a few example projects: a stemmer, a part-ofspeech tagger, and a collocation program. What this book obviously cannot do is to provide a full discussion of both corpus linguistics and programming. Both subjects are large enough for introductory books in their own right, and in computing especially there are many of them, targeted at all levels. In corpus linguistics there have been some introductory books published recently, and the other books in this series will serve well if you want to delve deeper into the subject. This book brings the two areas together. Hopefully, this book will whet your appetite and will make you want to go further from the foundations provided here.

INTRODUCTION

4

1.1

PROGRAMMING IN CORPUS LINGUISTICS

Corpus linguistics originates from linguistics, as a branch concentrating on the empirical analysis of data. However, the role of the computer is an extremely important one, and without machine-readable corpora corpus linguistics didn't get very far. In order to process such corpora one requires software, i.e. computer programs that analyse the data, and that creates a need for programming skills. This section will briefly discuss some of the problems arising from the current situation of corpus linguistics, arguing that the IT skills crisis that threatens industry also has an effect on the academic world.

1.1.1

The Computer in Corpus Linguistics

The computer is the basic tool of the corpus linguist when it comes to analysing large amounts of language data. But as soon as the task at hand goes beyond the gathering of a small set of concordance lines from a small corpus, one finds that there is no adequate software available for the task at hand. This means that in order to perform an empirical investigation, one either has to resort to doing it manually, or create an appropriate piece of software oneself. Manual data analysis is not only tedious and time-consuming, it is also prone to errors, as the human mind is not suited for dull repetitive tasks, which is what counting linguistic events often amounts to. This, however, is exactly what the computer is extremely good at. Taking the point further it would not be far-fetched to say that corpus linguistics in its current form cannot work without the help of the computer. Some techniques, for example collocational analysis (Church and Hanks, 1990; Clear, 1993), and the study of register variation (Biber, 1988) could simply not be applied manually. In fact, a corpus is really only useful if available in machinereadable form, and the meaning of 'corpus' is indeed coming to imply machinereadable, so that a printed corpus is now the exception rather than the rule (McEnery and Wilson, 1996). This dependency on the computer for all but the most basic work is obviously not without problems. The problem here is that a researcher is limited by the available software and its functionality, and will often have to change the way in which he or she approaches a question in order to be able to use a given program. Developers on the other hand are guided by what they are interested in themselves (due to lack of communication with potential users), or what is easy to implement. The main requirement for corpus linguistics software thus seems to be flexibility, so that a corpus can be explored in new ways, unforeseen by the developer. This, however, can only be achieved with some sort of programming on the user's part. When there is no software available which is suitable for a given task there are two main solutions: users can develop the right software themselves, or get a developer in to do it. The problem with the do-it-yourself approach is that it requires programming expertise, which can be difficult to acquire. Programming itself is not only time-consuming, but it can also be rather frustrating when it the resulting program does not work properly and needs to be debugged. It is generally worth checking to what degree a computer needs to be involved in a project, see Barnbrook (1996): quite often the effort required to adjust the problem in a way to be able to use a

PROGRAMMING IN CORPUS LINGUISTICS

5

computer is far higher than doing it manually, especially when it does not involve repetitive or large-scale counting tasks. 1.1~2

Which Programming Language?

A computer program is written in a special code, the so-called programming language. There are several levels of machine instructions, and the programmer usually uses a higher level language, which gets translated into the actual machine code, which depends on the computer's processor. Programming has gone a long way since the middle of the last century, when it basically consisted of configuring switches on large front panels of computers which needed huge air-conditioned rooms to be stored in. Instead, writing a computer program nowadays is not very different from writing an academic paper: you sit down in front of the screen and type into a text editor, composing the program as you go along, changing sections of it, and occasionally trying out how it works. Today there is a multitude of programming languages around, with Java being one of the most recent developments. All of these languages have been developed for a specific purpose, and with different design goals in mind. There are languages more suitable for mathematical computing (like Fortran), Artificial Intelligence research (like Lisp) and logic programming (like Prolog). Thus it is the first task to find a language that is well suited for the typical processes that take place in corpus analysis. However, there are other aspects as well: the language should be reasonably easy to learn for non-computing experts, as few corpus linguists have a degree in computer science and it should be available on different platforms, as both personal computers and larger workstations are typically used in corpus processing. For this book Java has been chosen. Java is mainly known for being used to 'enhance' web pages with bits of animations and interactivity, but at the same time it is a very powerful and well-designed general purpose programming language, an aspect that has contributed to making it one of the most widespread languages in a relatively short time. In the following sections we will have a brief look at those aspects of Java that make it particularly useful for corpus analysis, before you will find a detailed outline of the road ahead.

1.1.3 Useful Aspects of Java The first feature of the Java language that is particularly suitable for working with corpus material is its advanced capability to deal with character strings. While most other languages are restricted to the basic Latin alphabet, with possibly a few accented characters as extensions, Java supports the full Unicode character set (see http: I /www. unicode. org /).That means that it is quite easy to deal with languages other than English, however many different letters they should have. This includes completely non-Latin alphabets such as Greek, Cyrillic and even Chinese. Apart from being able to deal with different character sets without problems, Java itself has a very simple and straightforward syntax, which makes it easy to learn. The instructions, the so-called source code, are easily readable and can be understood quickly, as the designers of the language decided to leave out some powerful but cryptic language constructs. The syntax originates from the programming language

6

INTRODUCTION

C, but it is only a carefully chosen subset, which makes it easier to learn and maintain programs. The loss of expressional power is only marginal and not really relevant for most programming tasks anyway. A more important aspect is that Java is object-oriented. We will discuss in detail what this means in chapter 4, so for now all we need to know is that it allows you to write programs on a higher level of abstraction. Instead of having to adjust yourself to the way the computer works, using Java means that you can develop programs a lot faster by concentrating on the task at hand rather than how exactly the computer is going to do the processing. Java combines elements from a variety of older programming languages. Its designers have literally followed a pick-and-mix approach, taking bits of other languages that they thought would combine into a good language. It also seems that the language was targeted at a general audience, rather than specialist programmers. Several aspects have been built into the language that aid the developer, such as the automatic handling of memory and the straightforward syntax. It also comes with an extensive library of components that make it very simple to build fairly complex programs. When a computer executes a program, a lot of things can go wrong. This can be due to external data which either is not there or is in a different format than expected, or the computer's disk could be full just when the program wants to write to it, or it tries to open a network connection to another machine when you have just accidentally disconnected the network cable. These kinds of errors cannot be dealt with by changing the program, as they happen only at run time and are not predictable. As a consequence, they are difficult to handle, and catering for all possible errors would make programming prohibitively complicated. Java, however, has an error handling mechanism built into it, which is easy to understand, adds only a small overhead to the source code, and allows you to recover from all potential errors in a graceful way. This allows for more robust programs that won't simply crash and present the user with a cryptic and unhelpful error message.

1.1.4 Programming Language Classification There are, of course, alternatives to Java, mostly the languages that it took pieces from in the first place. However, none of the alternatives seems to be quite as good a mixture as Java, which is one reason for Java's unparalleled success in recent years. In a rather short period of time Java matured into a language that is being used for all kinds of applications, from major bank transaction processing down to animating web pages with little pictures or bouncing text. In this section we will have a brief look at what kind of a language Java is, and how it can be placed in a general classification of programming languages. This will help you to understand more about how Java works. One feature with which programming languages can be classified is the way their programs are run on the computer. Some programs are readily understood by the machine and run just by themselves. These are compiled languages. Here the source code is processed by a translation program that transforms it into a directly executable program. The translation program is called a compiler. Compiled Ian-

PROGRAMMING IN CORPUS LINGUISTICS

7

guages require some initial effort to be translated, but then they are fairly speedy. Thus, compiled languages are best suited for programs that are run very frequently, and which don't get changed too often, as every change requires are-compilation · and the respective delay while the compilation takes place. This also slows down development, but an added bonus is that a lot of errors in the source code can be detected by the compiler and can be corrected before the program actually runs for real. The other major class are interpreted languages. Here the source code is interpreted by another program as the program is executed. This other program is the interpreter. This means that execution time is slower, as the same source code has to be re-interpreted every time it is executed, which is not the case with compiled languages, where this happens only once, during the translation. On the positive side it means that the programs can be run immediately, without the need for compilation, and also that they are not dependent on the actual machine, as they run through the interpreter. The interpreter is machine dependent, but not necessarily the program itself. Development time is quick, but some errors might be unnoticed until the program actually runs, as it is not checked in advance. This distinction perfectly matches natural languages: if you want to read a book in a language you can't speak, you wait until someone translates the book for you. This might take a while, but then you can read it very quickly, as the translation is now in your own language. On the other hand, if it is not worth getting it translated (if you only want to read it once, for example) you could get someone to read it and translate it as you go along. This is slower, as the translator needs to read it, translate it, and then tell you what it means in your language. Also, if you want to go back to a chapter and read it a second time it will have to be translated again. On the positive side you don't have to wait for the translation of the whole book to be completed before you can start 'reading'. Java has elements of both these types of languages: the source code is initially compiled, but not into a machine-specific executable program, but into an intermediate code, the so-called byte code. This byte code is then interpreted by the Java run-time interpreter. This interpreter is called the Java Virtual Machine (JVM), and it provides the same environment to a Java program, whatever the actual type of the computer is that it runs on. Arguably you get the worst of both worlds, the delay during the compilation and the slow execution speed at the same time. However, it works both ways: the byte code can be interpreted muchmore quickly than actual source code, as it is already pre-translated and error checked. It also keeps the portability, as the byte code is machine independent, and the same 'compiled' program works on a Windows PC as it does on a Unix workstation or an Apple Macintosh. When you are working with Java, you first write your program in the form of its source code. Then you compile it, using the command j avac, which results in a pre-compiled byte code, the so-called class file. If you want to execute that program, you then start the Java run-time environment with the java command, which will run the commands by interpreting the byte code.

8

INTRODUCTION

1.2 ROAD-MAP In a book like this it is obviously not possible to cover either all aspects of Java programming or all aspects of corpus linguistics, so some decisions had to be made as to what would make it into the book and what was to be left out. This section briefly lists what you will find in the rest of the book, and what you won't find. Java has quickly grown into a very comprehensive programming language, and tomes with thousands of pages are being produced to describe it. However, a lot of this is not really relevant for the corpus linguist, but that still leaves a lot of useful elements for text and corpus analysis.

1.2.1

What is Covered

In the following chapter, Introduction to Basic Programming Concepts, there is a brief introduction to computer programming in general. It is intended for readers who have not done any programming so far, and will provide the foundation for later chapters. Basic Corpus Concepts, the third chapter, introduces the basics of corpus linguistics. By reading both of them, a framework is established to bring corpus linguistics and programming together, to see how programming can be utilised for linguistic research.

The next chapter, Basic Java Programming, introduces the Java programming language, building up on the concepts introduced in the first chapter. It will show how they can be realised in actual programming code. In The Java Class Library we then have a look at some of the standard classes that you can use in your own programs. Reusing existing classes greatly speeds up program development, and it also reduces the scope for errors, as existing classes are less likely to contain a lot of undetected bugs. Then, Input/Output gets you going with the single most important task in corpus linguistics. After showing how to read texts and print out text we will apply our newly acquired skills to investigate several different ways of creating a concordance. Afterwards, in chapter 7, we look into processing full texts instead of just single words. Identifying words in a stream of input data is one of the fundamental processing steps, on which a lot of other tasks depend later on. Most corpora nowadays contain annotations. How to process annotations in a corpus is the topic of chapter 8. Here we will look at mark-up, concentrating on XML, which is not only becoming the new standard for the Web, but as a simplified variant of SGML is also very relevant for corpus encoding. And finally we will investigate three case studies, which are self-contained little projects you might find useful in your day-to-day research. Taking up threads from two companion books in the series, we will implement a stemmer from a brief description in Oakes' Statistics for Corpus Linguistics, a parts-of-speech tagger described in a study question of McEnery & Wilson's Corpus Linguistics, and see how we can compute collocations as described in Barnbrook's Language and Computers. All these case studies start off from a simple description of how to do it, and you will end up with a working program that you can start using straight away.

ROAD-MAP

9

1.2.2 Other Features of Java As mentioned before, Java has developed (and still is developing further) into a vast language with extensions for all kinds of areas. However, a lot of those will not be directly relevant for your purposes, and in this section we will touch on some features which had to be left out of this book, but might be relevant for your further work. If you want to learn more about those features, I suggest you get a general Java book, like Horton (1999). Throughout this list you will find one recurring theme, namely that system specific operations are generalised to a more abstract level which is then incorporated into the language. A lot of operations which need to access external hardware components (such as a graphics card, or a network adapter) or other software modules (such as a database) are specific to a certain machine or operating system. However, by creating an abstract layer they can be treated identically across platforms, with the Java run-time environment filling the gap between the common operations and the way they actually work on a given computer. Graphics Graphics are another of Java's strong points. Dealing with graphical output is very platform specific, which means that there is no general support for it if your development environment needs to be portable. The developers of Java, however, have designed an abstract toolkit that defines graphical operations and renders them on the target platform. In the first version, this scheme suffered from deficiencies and subtle differences between the way buttons and other elements work in Windows and on the X Window system on Unix, but the developers chose a slightly different approach which works out much better: instead of relying on the operating system to provide user interface widgets, the new toolkit (called Swing) only makes use of basic operations such as drawing lines and providing windows, and all widgets are realised by the toolkit via those primitives. By leaving aside the 'native' widgets, user interface behaviour is now truly consistent across all platforms, and it is possible to program sophisticated user interfaces in a portable way. There is a large number of components, from labels, buttons and checkboxes to complex structures such as tables and ready-made file choosers. All these work on both proper applications as well as applets, even though there could be problems as browser implementations of Java always lag behind a bit and might not fully support Swing without needing an upgrade. Databases Similar to graphical operations, all database-related functionality has also been abstracted into a generalised system, the so-calledJDBC (Java Database Connectivity). Here the programmer has access to a set of standard operations to retrieve data from databases, regardless of what database is actually backing up the application. To provide compatibility there is a set of drivers which map those standardised operations onto the specific commands implemented by the database itself. Together with the portable graphic environment this makes it easy to build graphical interfaces to database systems, which will continue to work even when the

INTRODUCTION

database itself is switched to another version, as long as there is a driver available for it. Networking Java is closely associated with the Internet. This is mainly because it is used as a programming language for applets, small applications which run within a web browser when you visit a certain page, but it also has a range of components which allow easy access to data across networks, just as if it was on the local machine. Opening a web page from within an application is as easy as reading a file on your local hard drive. Furthermore it is possible to execute programs on other machines on a network, so that distributed applications can be implemented. By utilising the capacity of multiple computers at the same time very powerful and resource-intensive software can be developed.

1.3 GETTINGJAVA Unlike a lot of other languages, compilers for Java are freely available on the Internet. You can download the latest version from the Sun website, see the resources section in section 12.2 on page 231. There are two different packages available, one called JRE and one called JDK or SDK. The JRE is the 'Java Runtime-Environment'. You need this if you just want to run programs written in Java. It contains the standard class library plus a virtual machine for your computer. The JDK or SDK is the 'Java Development Kit', or the 'System Development Kit'. Sun, as the distributor, has changed the name of that package from version 1.2 onwards. This package contains all you need to develop programs in Java, and this includes the run-time environment as well. In fact, the Java compiler, j avac is itself written in Java. If you want to compile your own programs in Java you will need to get this package. You also need an editor, in order to write the source files. You can use any text editor you want, as long as you can produce plain text files. Some companies offer so-called IDEs (Integrated Development Environment), sometimes with their own optimised compilers. As long as they are supporting the full Java standard you can use one of those for development.

1.4 PREPARING THE SOURCE There are a few more points to notice before we can start to write programs in Java; these are only minor details but nevertheless important, and remembering them can save you a lot of struggling with compiler error messages at later stages. The source code of a class has to be in a plain text file. It is not possible to write a program in a word processing package and then compile it directly. You would either have to save it as plain text, or, what would be more sensible to start with, use a separate text editor for writing them. Java source files have to have the extension . java, so files must be saved accordingly. The Java compiler will reject any files

PREPARING THE SOURCE

11

that have other extensions. Also, the filename needs to match that of the class that is defined in it. Each class thus has to be in a separate file, unless it is not publically accessible from within other classes. Even if you could define several classes in one source file it would make it more difficult to find the relevant file when looking for the definition of a particular class. It makes matters much easier if the class Phrase can always be found in the file Phrase. java. The Java compiler will transform this into a binary file called Phrase. class which can be used by the JVM.

1.4.1

Running your First Program

Before we start with the theory, here is a simple example that you can try out for yourself. So far we are still fairly limited as to what language constructs we can use, so it necessarily has to be a small example. In the following chapters you will see what other classes Java provides to make programming easier, and how to deal with input and output. The class we are developing now simply echoes its command-line arguments. That means you call it with a number of parameters, for example java Echo This is a test of the Echo class and as a result the class will print This is a test of the Echo class on the screen. The first two parts of the above command-line are the call to the Java interpreter and the name of the class to execute, and they are not counted as parameters. When you try to execute a class, the Java interpreter loads the corresponding compiled bytecode into memory and analyses it. It looks for a method of the form public static void main(String args[]) This means there has to be a method called main ( ) which takes as argument an array of Strings, and it also has to be declared public and static. This is so that it can be accessed from the outside, and also without there being an instance of the object available. The array of Strings will be the command-line parameters, and with this knowledge we can now code the Echo class: I*

*

Echo.java

*I public class Echo { public static void main{String args[]) System. out .println (""The command-line arguments are: ••); for(int i = 0; i < args.length; i++) { System.out.println(i+". "+args[i] );

II end of class Echo

The definition of a class is introduced by an access modifier followed by the keyword class. Most classes you will deal with will be public, but again there are more fine-grained options. We will ignore these for the time being as they are

12

INTRODUCTION

not relevant to the material in this book. After the name of the class the definition itself is enclosed in curly brackets. This is called a block of lines, and it is a recurring pattern in Java. You will find blocks in several other places, for example in method definitions. Just save this listing into a file called Echo. java, compile it with j avac Echo . java and then run it in the way described above. You will see that it behaves slightly differently, in order to make it a bit more interesting. Instead of simply echoing the parameters as they are, they are put into a numbered list on separate lines. When the class gets executed, the JVM locates the main ( ) method and passes the command-line parameters to it. Here we print out a header line before iterating through all the parameters. Note how we use the length field of the array args in the condition part of the for-loop. Inside the loop we have access to the loop counter in the variable i, and in the print statement we simply put the variable, a literal string containing a full stop and a space character, and the current parameter together into one string which will be put on the screen.

1.5 SUMMARY In this introductory chapter we first discussed the role of the computer in corpus linguistics, emphasising the fact that it becomes more and more relevant to be able to program when working with computer corpora. The preferred solution to this is to learn Java, a language that is suitable for text and corpus processing, yet easy to learn at the same time. Furthermore, Java is machine independent, which means that you can run Java programs on any platform. This is especially important when working with different machines at home and at work; a less powerful computer might not be as fast or might not be capable of processing large amounts of data, but at least you won't have to change your programs. This is due to the hybrid nature of Java, which is halfway between a compiled and an interpreted language. After looking at a road-map of what you will find in this book, several other features of Java have been introduced briefly, so that if you want to extend your knowledge of programming towards graphical user interfaces, databases, or networking facilities, you know that you can still use Java, which saves you from having to learn another programming language. And finally, we have seen how to acquire the necessary tools that we need to develop our own programs in Java. This includes the development kit, which is available for free downloading. We have also written our first Java program. In the next chapter we will have a look at programming in general. You will be made familiar with basic concepts that are necessary to begin programming, and these will be applied throughout the rest of the book. The emphasis here is on the bare necessities, keeping in mind that ultimately most readers are not interested in computer science as such, but only as a means to aid them in their linguistic research.

2 Introduction to Basic Programming Concepts This section is for beginners who have not done any programming yet. It briefly introduces the algorithm, a general form of a computer program, and ways to express tasks in algorithmic form. Plenty of examples from non-programming domains are used to make it accessible to newcomers.

2.1

WHAT DOES A PROGRAM DO?

A computer program is simply a list of instructions which the machine executes in the specified order. These instructions have to be in a code that the machine can interpret in some way, which is roughly what a computer language is. In order to make it easier for humans to write programs, special programming languages have been developed which are more abstract and thus on a level closer to human thinking than the low-level manipulation of zeroes and ones that the computer ultimately does. One can go even further and design programs in an abstraction from programming languages, which would be the equivalent of jotting down an outline of a book or paper before filling in the gaps with prose text. This outline is called an algorithm in computing terminology. An algorithm is very much like a recipe, in that it describes step by step what needs to be done in order to achieve a desired result. The only problem is that the computer cannot cook, and will therefore be extremely stupid when it comes to interpreting the recipe: it does so without trying to make sense of it, which means it does literally what you tell it to do. You will find that while you're new to programming it usually does something that you didn't want it to do, mainly because it is so difficult for a programmer to think in the same simplistic and narrow-minded way as a computer does when executing a program. And that is also one of the main 'skills' of programming: you have to think on an abstract level during the design phase, but when it comes to coding you also need to be able to think in a manner as straightforward and pedantic as a computer. In the following sections we will have a closer look at what an algorithm is and how we can most easily express it. In the discipline of software engineering a variety of methods have been developed over time, and we will investigate some of them which might be useful for our purposes. The most important point here is that we don't care too much about the methods themselves, as we only view them as a means to the end of creating a computer program.

14

2.1.1

INTRODUCTION TO BASIC PROGRAMMING CONCEPTS

What is an Algorithm?

Suppose you are standing in front of a closed door, which we know is not locked. How do we get to the other side of the door? This is something most people learn quite early in life and don't spend much time thinking about later on, but the computer is like a little child in many ways, and it does not come with the knowledge of how to walk through a door. If you want to teach a robot to move around in the house, you need to describe the task it should perform in a list of simple steps to follow, such as: take hold of the door handle, press it down, pull or push door and move through the resulting gap. Unless your robot is already a rather sophisticated one, you will now have to tell it what a door handle looks like and how it can be pressed down. Ultimately this would be described in more and more detail, until you have arrived at the level of simple movements of its hands or their equivalents. Of course you will have to describe a lot of movements only once, like 'move arm downwards' and you can refer to them once they have been defined. Such repeated 'procedures' could be things like gripping a handle, pushing an object, and eventually opening doors. As a computer has no built-in intelligence, it needs to be told everything down to the last detail. However, at that rate writing a program to accomplish a certain task would probably take longer than doing it yourself, and there are a lot of basic operations that are required every time. In order to make programming easier, programming languages with higher levels of abstraction have been developed. Here the most likely tasks you would want to do are readily available as single commands, like print this sentence on the screen at the current position ofthe cursor. You don't have to worry about all the details, and thus high-level languages are considerably easier to learn and speed up development, with less scope to produce errors, as the size of programs (measured in the number of instructions written by the programmer) is reduced. If you want to know, for example, what the length of a word is, you would have to go through it from beginning to end, adding one to a counter for each letter until you have got to the last letter of the word, making sure that you don't stop too early or too late. In Java, however, there is a single instruction that computes the length of a piece of text for you, and you haven't got to worry about all the details of how this is accomplished. High-level languages are one step towards easier communication with computers. Of course we are still a long way away from Star Trek-like interaction with them, but at least we no longer have to program machines on the level of individual bits and bytes and operations like load register X from memory address Y, shift accumulator content to the right or decrement accumulator and jump to address Z if accumulator is not zero. Programming, then, is the formulation of an algorithm in a way that the computer can understand and execute. The first step in this is to be clear about what it is you want the computer to do. Unless you know exactly what you want you will not be able to get the computer to do it for you. From here there are several ways to continue, and a whole discipline, Software Engineering, is concerned with researching them. For our purposes we have to choose a method which is a good compromise between cost and benefits; we don't want to have to spend too much time learning how to

WHAT DOES A PROGRAM DO?

15

do this, but would still like to profit from using it. After all, programming is only a means to an end, which is exploring corpora.

2.1.2

How to Express an Algorithm

A very natural and intuitive way to develop programs is called stepwise refinement (Wirth, 1971). Here you start with a general (and necessarily vague) outline of the algorithm, going over it multiple times. Every time each step is formulated in more detail, until you reach a level which matches the programming language after a number of iterations. This style of programming is called top-down programming, as you start at a high level of abstraction working your way 'down' to the level of the computer. The opposite would be bottom-up, where you start at small low-level routines which you combine into larger modules, until you reach the application level. Most of the time you will actually use both methods at the same time, developing the program from both ends until they meet somewhere in the middle. In order to describe the algorithm we will be using pseudo-code, a language that is a mixture of some formal elements and natural language. At first, the description will contain more English, but with subsequent iterations there will be less and less of it, and more and more formal elements. As an example consider the task of creating a sorted reverse word list from a text file. This is a list of all words of a text, sorted backwards, so that words with the same ending appear next to each other. Such a list could be used for morphological analysis, or an investigation of rhyming patterns. We start off with the following description: 1 2 3 4 5

read each word from the text reverse the letters in the word sort the list alphabetically reverse the letters back print out the sorted list

By reversing the letters of the words we can simply sort them alphabetically, and then reverse them back afterwards. This means we can use existing mechanisms for sorting lists instead of having to create a special customised one. That list of instruction would be sufficient for a human being to create such a list, but for the computer it is not precise enough. Nevertheless we can now go through it again and fill in a few more gaps: 1 for each word in the text read the next word from the text 2 reverse the word 3 insert the word into a list 4 5 sort the list 6 for all words in the list reverse word 7 print word 8

Note that we now have changed the structure of the algorithm by splitting it up into three main steps: creating a list of reversed words, sorting the list, and printing the list. This new version is much more precise in that it reflects how each word is being dealt with at a time, and we now also get an idea of how much work is involved: the first part of the program is executed once for each word (or token) in the text we're

16

INTRODUCTION TO BASIC PROGRAMMING CONCEPTS

looking at, whereas the second and third parts operate on the list of unique words (the types). We also have made explicit the relationship between a word and the list of words, and that we insert a word into the list, which was not obvious from the first draft. While the second attempt is much more precise, it's still not good enough. Here is attempt number three: 1 open text file for reading 2 create empty list 3 while there are more words in the file 4 read next word from file 5

reverse word

6 check if word is in list 7 YES: skip word 8 NO: insert word into list 9 close input file 10 sort list alphabetically 11 for all words in list 12 reverse word 13 print word

What we have added here are quite obvious points that a human being would not think about. If someone asked you to write down a shopping list you would take an empty piece of paper, just as you would open up a book and start on page one when you would want to read it. But for the computer you have to make all these steps explicit, and these include opening the text for reading and setting up the word list. Statements and Expressions A computer program is typically made up of statements and expressions. In the above algorithm, you have statements like reverse word and close input file. These statements consist of a command keyword (reverse and close) and an expression that describes what they are operating on (word and input file respectively). An expression evaluates to a certain data type. For example, a statement to print the time could look like

print time Here we have time as an expression, which is evaluated when the computer executes the statement. You don't write 'print 10: 30', because you want the computer to print the time at that moment when the statement is executed. Therefore 'time' would need to be an expression, whose evaluation triggers reading the computers internal clock and returning the time of day in some form. The print, on the other hand, is a statement. It works the same way all the time, and is a direct instruction to the computer. White time would ask the computer to retrieve the current time, print tells it what to do with that. As a further example, consider the literal string 'the'. This literal value could be replaced by the expression 'most frequent word in most English corpora'. The difference here is that the expression cannot be taken literally, it has to be evaluated first to make sense. An expression in tum can be composed out of sub-expressions which are combined by operators. These are either the standard mathematical operations, addition, subtraction, and so on, or operations on non-numerical data, like

WHAT DOES A PROGRAM DO?

17

string concatenation. This will become clearer if we look at an example in actual Java code: String wordl = "aard"; String word2 = Vark"; String word3 = wordl + word2; 11

In this simple example we first declare two variables (labelled 'containers' for storing data) of the data type String. A variable declaration consists of a data type, a variable name, and it optionally can contain an initial assignment of a value as well. Here we are assigning the literal values aard and vark to the two variables called wordl and word2 using the assignment operator '='. A variable declaration counts as a statement, and in Java statements have to be terminated by a semicolon. The third line works exactly the same, only that this time we are not assigning a literal String value to word3, but instead an expression which evaluates to a String. The plus sign signifies the concatenation of two (or more) strings, so the expression wordl + word2 evaluates to another (unnamed) String which we assign to word3. At the end of that sequence of statements the variable word3 will contain the value 'aardvark'. As you can see from that example, a variable is a kind of expression. It cannot be used in place of a statement, but it can be used with commands that require an expression of the same type as the variable. Assignments are a bit special, as they are technically commands, but without a command keyword. We will discuss data types, statements and expressions in a bit more detail later on, but there is one more type that we need to know about before continuing: the boolean expression. A boolean expression is an expression that evaluates to a truth value. This can be represented by the literals true and false, which are kind of minimal boolean expressions. A boolean expression we have already encountered is word is in list in the last version of the reverse word list program. If you evaluate this expression, it is either true (if the word is in the list) or false (if it isn't). Most boolean expressions used in programming are much simpler, testing just for equality or comparing values. For this there are a number of boolean operators (listed in table 2.1 ). Note the difference between the assignment indicator, the single equals sign, and the boolean operator 'equal to' which is a double equals sign.

To illustrate the use of these operators, here are a few simple examples. Suppose you were writing a scheduling program, then you could print a notice to go for lunch if the expression time == 12: 3 0 was true. Combining two expressions, the system wouldtellyoutogohomeiftime >= 5:30 II to-do-list== 'empty' was true. This means that you can leave work either after 5:30, or when you have done all the things you had to do. As soon as one of the two sub-expressions evaluates to true, the whole expression becomes true as well. A mean manager, however, might want to change the logical OR to a logical AND, in which case you have to wait until 5:30 even if you have done your day's work, or you have to work overtime if you haven't finished going through your in-tray by 5:30, because then the whole expression would only be true if both of the sub-expressions are true.

18

INTRODUCTION TO BASIC PROGRAMMING CONCEPTS

Operator

-> < >= 2 4=5, 7 >=5 5 1 && 3 < 5) !(2 > 0)

Table 2.1: Boolean Operators

2.2

CONTROL FLOW

When you are reading a text, you start at the beginning and proceed along each line until you get to the end, unless you come across references like footnotes or literature references. In those cases you 'jump' to the place where the footnote is printed, and then you resume reading at the place where you had stopped reading before. This is essentially the same way that the computer will go through your program. However, there are also lines in the listings we looked at in section 2.1.2 that indicate repetition (while there are more words) and branching (check if word is in list). Such statements are directing the flow of control through the program, and are very important for understanding the way the computer executes it. In this section we will discuss in greater detail why control flow is such an important concept in programming. In order to keep track of which commands to execute next the computer needs a pointer to the current position within the program, just as you need to keep track of where you are in a text you are reading. This position marker is called the program counter and it stores the address of the next instruction to be executed. It also has an auxiliary storage space in case it temporarily needs to jump to another position, like looking at a footnote in an article. Here it stores the current value of the program counter in a separate place and then loads it with the address of the 'footnote'. The next instruction will then be read from a different location. Once the end of the footnote is reached, an end-marker notifies the processor to reload the old value from the temporary space back into the program counter, and execution resumes where it was interrupted before. In a program we will actually make use of several ways to direct the control flow. Maybe you remember the so-called Interactive Fiction books, which were a kind of fantasy role-playing game for one person. It was a collection of numbered paragraphs, and things would happen that tell you where to continue from. You 'fight' against an ogre by throwing a few dice, and depending on the outcome you win or lose. The last lines of that paragraph would typically be something like "if you beat the ogre, go to 203, otherwise continue at 451." Unlike traditional narratives, these books did not have a single strand of plot, but you could 'read' it multiple times, exploring different 'alternate realities', depending on whether you defeated the ogre or not.

CONTROL FLOW

19

Computer programs are a lot like that. Depending on outside influences the computer takes different paths through a program, and there are several ways of changing the flow of control. We will now discuss three of them in more detail: sequence, choice, and loop.

2.2.1

Sequence

This is the general way in which a program is executed, in sequence. Starting at the first instruction, the program counter is simply incremented to point to the next instruction and most instructions will not actually influence the program counter. A sequence is like reading a book from start to finish without ever going backwards or skipping parts of it. Even though it is the default, a program cannot do much if it would only be able to go through each statement once. The main strength of a computer is its speed, and that can best be exploited by making it perform some task repeatedly. For this we need other means, which we will look at below. The expressiveness of a purely sequential program is fairly limited; it's like a pocket calculator that can perform mathematical operations, but cannot react to the result in any way. Any algorithm that is more complex than that will require a more differentiated control flow. Before discussing the other ways of directing the control flow, there is one other concept that we need to know about, the block. A block is a sequence of statements grouped together so that they are syntactically equivalent to a single statement. For example, in the previous example we had a block right at the end: 11 for all words in list reverse word 12 print word 13

The bottom two lines are a block, and in pseudo-code they are grouped together by being at the same level of indentation. In a Java program, and in a number of other languages, blocks are delimited by curly brackets. The reason for having blocks is so that the computer can determine the scope of certain commands; otherwise, how should it know that the print word statement is also to be executed for each word? By putting those two statements into a block it becomes clear immediately. We will see more examples where blocks are essential in the following sections.

2.2.2

Choice

If there was only sequential execution, each time a computer program was run it

would basically be the same as before. More variation can be introduced through another type of control flow, the choice. Here the processor evaluates a condition, and depending on the result branches into one of two or more alternatives. The simple alternative is expressed in abstract form as IF X THEN A ELSE B where X is the condition to be evaluated, and A and B are the two alternative blocks of statements. The condition X must be a boolean expression, i.e. it evaluates to either true or false. If X is true, then it is block A that gets executed, otherwise it is block B. This can now be used for decision making: Imagine a ticket counter at a cinema, where the people are buying the tickets at a machine. In order to check whether the

20

INTRODUCTION TO BASIC PROGRAMMING CONCEPTS

clients are allowed to watch a film, it asks them their age. Then it checks whether the age is sufficient with respect to the rating of the film they want to watch and prints out a different statement depending on the outcome of the check. In pseudo-code (with some comments written in square brackets) this could look like this: 1 2 3 4 5 6 7 8 9

rating = 15 [or any other appropriate value according to the film] print ''Please enter your age, input age [at this stage the user enters his or her age] if age>= rating [the operator stands for 'greater than or equal to'] then [the following block gets executed if the condition is fulfilled] print "OK, you're old enough to watch this" else [the following block gets executed if the condition does not apply] print "Come back in a few years" endif

The pseudo-code syntax for simple branching is if . . . then . . . else ... endi f. The e 1 se and endi f parts are needed to mark the boundaries of the two blocks, the one that gets executed when the condition is true, and the other one which gets executed when the condition is false. In the above example we first assign the value 15 to the label rating. This is not strictly necessary, but makes it easier if the next film has a different rating: all that is necessary is to change this one line, and everything will work properly. Otherwise, if the rating value is used several times, all of those instances would have to be changed, but it is very easy to just forget one, and then you have a problem. For this reason, there will often be a couple of 'definitions' at the beginning of a source file, where values are assigned to labels. There are two types of such labels, ones that can change their assigned value and those that can't. The ones which can change are called variables, while the ones that always keep the same value are called constants. Constants are declared in a way that forbids you to ever assign a different value to them, and this is checked by the compiler: if you try and do it, you will get an error message. We will further discuss the topic of variables below; for the moment all you need to know is that a variable is basically a symbolic name for a value of some kind. If the value can change, it is called a variable, if the value is immutable it is a constant. Once we have read in the client's age, we are in a position to compare it to the threshold as defined by the film's rating. This is a simple test where we see if the age is greater or equal to the rating. Here the action parts are only messages printed to the screen, but in a real life application we would take more action, like initiating a ticket purchase or displaying alternative films that have a lower rating.

2.2.3 Multiple Choice There is another form of choice control flow, where the test condition evaluates to one of several possible values. In this case, which really is only a special case of the one-way choice, you need to provide a corresponding block of statements for each possible outcome. This will usually be a rather limited set of options. The simple alternative is by far more frequent than the multiple choice. Here is a brief example of a multiple choice situation: imagine a part-of-speech tagger, a program that assigns labels to words according to their word classes. These

21

CONTROL FLOW

labels are often rather cryptic, and we would like to have them mapped onto something more readable. This short example will map some tags from a widely used English tagset into more human-digestible labels. The syntax for this is switch . . . case . . . case, where the expression following the switch statement is matched against each case, and if it matches the corresponding block is executed. A special case is default which matches if no other case did. switch tag case "JJOO print "adjective" case "NN"

print "noun" case OOVBGOO print ooverb-ingoo case OODTOO print "determiner" default print "unknown tag encountered!"

You can see that a switch statement is not much more than a convenient way to combine a series of if statements.

2.2.4 Loop Computers are very good at doing the same thing over and over again, and unlike humans they don't lose concentration when doing something for the five hundred and seventh time. In order to get the computer to do something repeatedly we use a loop to control the flow of execution. The first type of loop is governed by a condition. It basically looks like a branch, but it has only one block which is executed if the condition is true. Once the end of the block has been reached, the condition is re-examined, and if it is still true, the block is executed again. If it is false, the program continues after the end of the block. For our example, we want to write a program that scans through a text and finds the first occurrence of a certain word. Once it has found the word it will stop and print out the position of the word in the text. In pseudo-code this looks like: 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19

searchWord = "scooter" searchPosition = -1 currentPosition = 0 open input text for reading while text-has-more-words && searchPosition read word increment currentPosition if word == searchWord then

searchPosition

==

-1

= currentPosition

endif endwhile close input text if searchPosition == -1 then print the search word does not occur in the text oo else print the search word first occurs at position "+searchPosition endif 00

11

22

INTRODUCTION TO BASIC PROGRAMMING CONCEPTS

We start off by initialising a few variables. If you initialise a variable, you assign an initial value to it. These variables are: searchWord, the word we will be looking for, searchPosition, the position it first occurs at, and a counter that keeps track of which position we're at, currentPosi tion. The searchPosi tion variable also doubles as an indicator whether we have found the word yet: it initially is set to -1, and keeps that value until it matches a word in the text. After opening our input text for reading, we enter into the loop. This loop is governed by a complex condition, which is made up of two sub-conditions. In our case they are combined with a logical AND, which means the whole condition is true if and only if both sub-conditions are true as well. As soon as either of the sub-conditions becomes false, the whole condition will become false as well. The condition here is that there are more words available to read (otherwise we wouldn't know when to stop reading), and also that the value of searchPosi tion is -1. This means that we haven't yet found the word we are looking for. In the loop we read a word and note its position in the text, and then we compare it to our target word: if they are the same, we assign the current text position to the variable searchPosi tion. Once that has happened, the variable will no longer be -1, and the second sub-condition becomes false, causing the loop to be terminated. Otherwise searchPosition will remain -1, and the loop is executed again. Once we're done with the loop, we investigate the value of the searchPosi tion variable: if it is still -1, the loop terminated because the end of the text had been reached before the search word was encountered. In that case we print out a message that the word has not been found in the text. Otherwise we print out a message that it has, including its position. Please note the way the variable is printed: if it was enclosed in double quotes like the rest of the message, we would simply have printed the word 'searchPosition' instead of the value of the variable searchPosi tion. What we have to do to print the actual value is to attach it to the message with a plus sign outside the double quotes. This while loop is characterised by the fact that the condition is evaluated before the body of the loop is reached. This means that the body might not be evaluated at all, namely when the condition is false to start off with. This could be the case if the input text is empty, i.e. if it contains no words (or maybe doesn't exist). Another type of loop has the condition after the body, so that the body is executed at least once. They are quite similar, so let's look at the same task with a different loop: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

searchWord = "scooter" searchPosition = -1 currentPosition = 0 open input text for reading do read word increment currentPosition if word == searchWord then searchPosition = currentPosition endif while text-has-more-words && searchPosition == -1 close input text if searchPosition == -1 then print "the search word does not occur in the text" else

CONTROL FLOW 18

23

print "the search word first occurs at position "+searchPosition

19 endif

We start off with the same initialisations, and then enter the loop's body, as indicated by the do keyword (line 5). Apart from that it seems to be identical to the previous program. However, there is one important difference: if the input text is empty, i.e. it does not contain any words, the program nevertheless tries to read a word from it when the loop's body is executed for the first time, which would cause an error. This is because the condition 'text-has-more-words' (line 12) is first checked after the body has been executed, and so there is some scope for trouble. Using a head-driven while-loop thus is inherently safer, as you do not have the constraint that the body gets executed once whatever the condition, and therefore head-driven loops are a lot more common than do-loops. This is not to say that do-loops are never used; you just have to be a bit careful when deciding which loop to choose. The third type of loop is used when we know in advance how often we want to execute a block. It is effectively a short-hand form of the while-loop, and we will start with the verbose form to look at the concept first. Let's assume we want to know how often our search word occurs within the first 100 words of the text. To make things easier we will assume that the text has at least 100 words, so that we don't have to test for that.

1 2 3 4 5 6 7 8

position = 0 searchWord = "by" counter = 0 open text for reading while position < 100 read word if word == searchWord then

9

increment counter

10 endif 11 increment position 12 endwhile 13 close input text 14 print "number of occurrences: "+counter

In order to keep track of whether we've reached 100 words, we keep count in a variable called position. We start with position 0 (computers always start counting at 0) and check if the current word equals our search word. If so, we add one to the counter variable. Then we increment the position value and repeat until we have reached 100. Note that, since we started at zero, the loop is not executed when position has the value 100, as this would be the 101 st time. In most languages there is a special loop type which can be used if you know exactly how often a loop needs to be executed. As there is no real pseudo-code way to express this in a way which is different from the previous example, we will now add in a bit of real Java in the next listing. The loop is actually called a for-loop and looks like this:

24

INTRODUCTION TO BASIC PROGRAMMING CONCEPTS

int counter = 0; String searchWord = "by"; FileReader input =new FileReader("inputfile"); for(int position = 0; position< 100; position++) String word = readNextWord(input); if(word.equals(searchWord)) { counter++;

System. out .println( "number of occurrences: "+counter);

A few points need mentioning: unlike pseudo-code, Java requires variables to be declared before they can be used (which we are doing in the line by specifying a data type, in t, before the name of the variable, counter), and for numbers we use the in t data type, which is short for integer; instead of keywords like endi f, blocks of code are enclosed in curly brackets ({ .. }), and all statements have to be ended with a semicolon. We also cheated a bit, as the operation of reading a word from the input text is simply replaced by a call to a procedure, which we called readNextWord ( ) here, as it is not very straightforward. The whole of chapter 7 is devoted to getting words out of input data, so for now we just assume that there is a way to do this. This is actually a good example of top-down design: we postpone the details until later and concentrate on what is important at the current stage. Another point, which might be slightly confusing, is the way we are comparing string variables. For reasons we will discuss later, we cannot use the double equal sign here, so we have to use a special method, equals () to do that. How exactly this works will not be relevant for now, but you will see in chapter 5 how to work with String variables. The for keyword is followed by a group of tlrree elements, separated by semicolons and enclosed in round brackets: for(int position

=

0; position < 100; position++)

{

The first one is the initialiser, which is executed before the loop starts. One would usually assign the starting value to the counting variable, like position = 0 here. In the 'verbose' while form above it would be equivalent to the first line. The second element is the condition. The loop's body is executed while this condition is true, so in our case while position is smaller than 100 (compare to line 5 of the pseudo-code listing above). The final element is a statement that is executed after the loop's body has been processed, and here we increment the loop position by one. The expression position++ is a short hand form of position = position + 1, where we assign to position the current value ofposi tion plus one. This would be line 10 in the previous example, where it says increment position. For-loops are more flexible than this; you can have any expression which evaluates to a truth value as the condition, and the final statement can also be more complex than a simple increment statement. We will learn more about the full power of for-loops in later chapters.

VARIABLES AND DATA TYPES

25

VARIABLES AND DATA TYPES

2.3

We have already come across data types when we were talking about expressions on page 16. There we used time as an example. All expressions have a data type, which tells the computer how to deal with them. The data type of the 'time' expression depends on the computer language you're using: it could either be a specific type handling dates and times, or it could be a sequence of characters (e.g. '1 0 : 3 0 '), or even (as it is common in the Unix operating system) a single number representing the number of seconds that have elapsed since 12:00 a.m. on January the 1st 1970. In Java there is a special class, Date, which contains references to dates and times. When an expression is evaluated, you need somewhere to store the result. You also need a way of accessing it, and this is done through variables. We have already come across variables and constants as symbolic labels for certain values, some of which can be changed by assigning new values to them. From a technical point of view, variables are places in memory which have a label attached to them which the program can use to reference it. For example, if you want to count how many words there are in a text, you would have a variable 'counter' in which you store the number of words encountered so far. When you read the next word, you take the current value stored under the label 'counter' and increment it by one. When you have finished counting, you can print out the final value. In pseudo-code this could look like 1 counter = 0 2 while more words are available

3 4

read next word

counter ~ counter + 1 5 endwhile 6 print counter

At the beginning of this program, the variable is initialised to be zero, which means it is assigned an initial value. If this is not done, the storage area this label refers to would just contain a random value, which might lead to undesired results. If you forget to initialise a variable, the Java compiler generally gives you a warning. It is best to develop the habit of always initialising variables before they are used. A variable can only contain one sort of data, which means that the data in the computer's memory which it points to is interpreted in a certain way. The basic distinction is between numerical data, character data, and composite data. In the following sections we will walk through the different data types, starting with numerical types.

2.3.1

Numerical Data

These are numbers of different kinds. The way a number is stored in memory is optimised depending on what it is used for. The basic distinction here are integer numbers (numbers with no decimal places) and floating point numbers (with decimal places). The wider the range of the value, the more memory it takes up, so it is worth choosing the right data type. The numerical types available in Java together with their required storage space is shown in table 2.2.

26

INTRODUCTION TO BASIC PROGRAMMING CONCEPTS

data type byte short int long float double

description small number (-128 to 127) small number (-32768 to 32767) number (+/- 2 billion) large number floating-point number double precision floating point number

size (bytes) 1 2 4

8 4

8

Table 2.2: Numerical data types in Java

Unlike some other programming languages, all numerical data types in Java are signed, which means they have a positive and a negative range. This effectively halves the potential range that would be available if there were only positive values, but is more useful for most computing purposes. A single byte, for example, can store 256 different values. In Java this is mapped onto the range -127 to +128, so you still have the same amount of different values, only the maximum and minimum values are different. Most often you will probably require in t variables, mainly for counting words or properties of words. They have a range of about two thousand million either side of zero, which should be sufficient for most purposes. Floating-point variables should only be used when working with decimal places, such as proportions or perhaps probabilities, as they cannot represent larger integer values without losing precision. A float has a precision of about 40 decimal places, whereas a double goes to about 310.

2.3.2

Character Data

A character is a symbol that is not interpreted as a numerical value. This includes letters and digits, and special symbols such as punctuation marks and idiograms. The ability to process symbols is one important aspect that distinguishes a general purpose computer from a pocket calculator, and one that is obviously extremely relevant for corpus linguistics. Internally each character is represented by a number, its index in the character table. In order to allow the exchange of textual data between different computers there are standard mappings from characters to index values. While mainframe computers usually work with the EBCDIC character table, most other machines use the ASCII character set. This defines 127 symbols, which includes control codes (e.g. line break and page break), the basic letters used in English, the digits, and punctuation. Quite often there are also extensions for regional characters up to a maximum of 256 symbols (which is the maximum number of different values for a single byte). Java, however, uses the Unicode character set, which provides more than 65,000 characters, enough for most of the world's alphabets. Single letters are represented by the char data type, while sequences ofletters are represented by Strings:

DATA STORAGE

27

II note the single quotes char rnyinitial ~ 'M'; String rnyName ~ "Mason"; II note the double quotes I I this is a String, not a char String aLetter ~ "a";

Literal characters have to be enclosed in single quotes, while for strings you have to use double quotes. If you enclose a single character by double quotes (as in the third line), it will be treated as a String.

2.3.3

Composite Data

The data types described so far enable one to do most kinds of programming, but usually real world data is more complex than just simple numerical values or single characters. If you want to deal with word forms, for example, it would be rather tedious to handle each letter individually. The solution for this lies in composite data types. Historically they started off as records with several fields consisting either of primitive types (the two types described in the previous paragraphs) or other composite types. This allowed you to store related data together in the same place, so you could have a word and its frequency in the same composite variable, and you could always keep track of the relationship between the two entities. Later on it became apparent that one might not only want to have related data together, but also instructions that are relevant only to that data. The result of this is the object, a combination of data fields and operations that can be applied to them. With the example of the word, we could create a word object from a text file, and we could define operations to retrieve the length of the word, the overall frequency of the word in the text, its base form, or a distribution graph that showed us all where all the occurrences showed up in a text. Objects are a very powerful programming idiom: they are basically re-usable building blocks that can be combined to create an application. As objects are a fundamental part of Java we will discuss them in more detail in chapter 4.

2.4

DATA STORAGE

When processing data there is one fundamental question that we need to look at next: where and how do we keep that data? We already know that we can store individual data items in variables, but this is only really suitable for a small number of items that we are processing. If you were to store all words of a text in different variables you would for example have to know in advance how many there are, and then you can't even really do anything useful with them. Instead you would have to follow a different approach, storing the words in an area where you can access them one by one or manipulate them as a whole, if you for example want to sort them. Principally there are two places where you can store your data, either directly in the computer's memory or in a file on the computer's hard disk.

2.4.1

Internal Storage: Memory

Storing your data in memory has one big advantage: accessing it is very fast. The reason for that is that the time needed by the computer to read a value out of a memory chip is much shorter than the time needed to read data of an external storage

28

INTRODUCTION TO BASIC PROGRAMMING CONCEPTS

medium. This is because there are no moving parts involved, but just connections through which electrons flow at high speed. Moving parts are much slower, especially when the storage device needs to be robust. Compare for example the time it needs to read a file from a floppy disk and a hard disk, and then drop both of them from a large height onto the floor and repeat the procedure. You basically pay for the increased speed of the hard disk with its low robustness. If memory is so much faster, why does anybody then bother to store data elsewhere? There are two main reasons for this: (a) when you tum off the power supply, the content of memory gets erased, and (b) there is only a limited amount of memory available. So external storage is used for backup purposes, as it does not have to rely on a permanent supply of electricity, and because its capacity is much bigger. So how do you actually store data in memory? For this purpose Java provides a number of so-called container classes, which you can imagine as a kind of shelf that you can put your variables on. There are different ways to organise the way your variables are stored so that you can find them again when you need them, and we will discuss these in chapter 4 when we are looking at the actual classes. Basically you can store data as a sequence or with a keyword. You would use the sequence if you are processing a short text and want to look at the words as they appear, so you really need to preserve the order in which they come in the data. By storing them in the right container class you can then access them by their position in the text. Remember that computers start counting at zero, so the first word of your text would be at position 0, the second at position 1, and so forth, with word n being at position n-1. Access is extremely quick, and you don't have to follow a certain order. This is called random access, and this data structure is called an array. Arrays can either be of a fixed size, or they can be dynamic, i.e. they can grow as you add more elements to them. Access by keyword is a bit more sophisticated. Imagine you want to store a word frequency list and then retrieve values for a set of words from that list. The easiest way to do this is in a table that is indexed by the word, so that you can locate the information immediately without having to trawl through all of the words to find it. This works exactly like a dictionary: you would never even think about reading entry number 4274 in a dictionary (unless you are taking random samples of entries), but instead you would be looking for the entry for 'close'. In fact, you wouldn't even know that 'close' is the 4274th entry, as they are not numbered, but you wouldn't really want to know that anyway. There are several possible data structures that allow data access by keyword, all with different advantages and disadvantages. We will discuss those in chapter 4 when we investigate Java's collection classes in more detail.

2.4.2 External Storage: Files We have now heard about several ways of storing data in the computer's main memory, but what if you can't store it because you have got too much data? You could not, for example, load all of the British National Corpus (BNC) into memory at once, as it's far too big. And even if you want to store just a single text, which would fit into memory easily, you need to get it from somewhere. This somewhere is an ex-

SUMMARY

29

ternal storage medium, and the data is kept in a file. As opposed to internal storage, or memory, external storage is usually much bigger, non-volatile (i.e. it keeps its content even if there is no electricity available), a lot cheaper, and much slower to access. A file is just a stretch of data that can be accessed by its name. In a way the external storage is organised in a way similar to the access by keyword method we came across earlier. In order to get at the data which is stored in a file you need to open it. Generally you can open a file either for reading or for writing, so you either read data out of it or write data to it. You cannot open a file for reading that does not exist, and when you open an existing file for writing its previous content is erased. Data in a file is stored in an unstructured way, and when writing to a file the data is just added on to the end. When reading it, you will read the contents in the same order that you have written them in. There is also another way of accessing files: instead of opening them for sequential reading or writing you can open them for random access. Here you can jump to any position in the file, and both read or write at the same time, just like you would do when accessing an array in the internal memory. The drawback of this is that it is slightly slower, and also more complex. We will look at different ways of accessing files in more detail in chapter 6.

2.5 SUMMARY In this chapter we have first learned what algorithms are, namely formalised descriptions of how things are done. The level of formality of an algorithm is quite arbitrary, which can be used to guide the software development process: starting from a very informal and imprecise outline we gradually go into more detail using more and more formal elements until we reach the level of the programming language. Individual statements can be combined into blocks and the way the computer executes them can be commanded using branching and loops. This control flow allows the computer to repeat operations multiple times and make decisions based on boolean expressions. A program consists of both instructions and data. Variables and constants are named labels which can be used to store and retrieve individual variables. Larger numbers of variables can be stored in container classes within the computer's memory or in external data files on the hard disk. At the end of the chapter we discussed the different properties of internal and external storage media: internal memory is faster to access, but more limited in size, whereas external storage is slower but there is far more of it, and it keeps information even if there is no electrical power available.

3 Basic Corpus Concepts This chapter introduces relevant concepts from corpus linguistics. It is intended as a brief introduction to enable you to understand the principal problems that you will encounter when working with corpora.

3.1

HOW TO COLLECT DATA

Unless you are working with a research group that already has a set of corpora available that you want to analyse, your first question will be where to get data from and how to assemble it to a corpus. For the remainder of this section we will assume that you want to build your own corpus for analysis. There are several reasons why you might want to do this. The main one is that the kind of data you want to analyse does not (yet) exist as a corpus. This is in fact highly likely when you choose your topic without knowing what corpora already exist and are accessible publically. On the one hand you will have the advantage of not having to compromise on the subject of your research (e.g. by restricting its scope), but you might have to put extra work into collecting appropriate texts. Quite often you might want to compare a special sample of language with genera/language; here you would collect some data (e.g. political speeches, or learner essays) and use existing corpora (like the BNC or the Bank of English) as a benchmark.

3.1.1

Typing

Typing your data in is undoubtedly the most tedious and time consuming way of getting it into the computer. It is also prone to errors, as the human typist will either correct the data, or mistype it. Nevertheless, this is often the only possible solution, e.g. when dealing with spoken data, or when no other alternative (see following sections) is available. The most important point to consider when entering corpus data at the keyboard is that you want it to be in a format that can easily be processed by a computer. There is no point in having a machine-readable corpus otherwise. Over the years (mainly prompted through advancements in computing technology) several different formats have evolved. More on data formats below in section 3.2.2.

3.1.2

Scanning

Scanning seems to be a much better way of entering data. All you need to do (provided you have access to appropriate hardware) is to put a page at a time into the

32

BASIC CORPUS CONCEPTS

scanner, press (or click on) a button, and the scanner will read in the page and dump it on your system. Just as easy as photocopying it seems. Unfortunately it is not that easy; there is one important step which complicates things tremendously. That is the transition from the page image to the page content, in other words the reading process. While humans are usually quite good at reading, to a computer a page looks like a large white area with random black dots on it. Using quite sophisticated software for optical character recognition (OCR), the dots are matched with letter shapes, which are then written into an output file as plain text. This transformation from image to text can be easy for high quality printed material on white paper. Older data, however, can quite often lead to mistakes, due to lack of contrast between paper and letters, or because the lead type has rubbed off with time and bits of the letter shapes are missing. Imperfect printing can lead to letters being misinterpreted by the computer, where the most common examples are mistaking c fore (and vice versa), or ni form. Luckily a lot of these recognition errors can be caught with modem spell checking software, but it all adds to the effort required for getting the texts into machinereadable form before you even start analysing them, and spell checkers are not infallible either. Further complications arise once you start to look at data which is formatted in multiple columns, intermixed with figures or tables. It is very hard for the computer to figure out what the flow of the text is, and thus a lot of manual intervention is required for such data.

3.1.3 Downloading The best solution-as always-is to re-use material that somebody else has already made machine-readable. There are plenty of texts available on the Internet, from a variety of sources, and of varying quality. For a list of possible sources see the appendix. Generally there are two types of texts available on the Internet, web pages and text files. Web pages can easily be saved from a browser, and they contain many different texts. Quite often you can find newspaper articles or even academic papers in this form. Text files are usually too large to display on a single page, and they are available for downloading rather than on-line viewing. A number of sites contain whole books, like Project Gutenberg and other similar ventures. The Oxford Text Archive also allows downloading of full texts, in different formats. Most of the books available for free downloading are classics which are out of copyright; while this makes them an easy source of data, their linguistic relevance might be a bit dubious.

3.1.4 Other Media It is increasingly possible to buy collections of text on CD ROM for not very much money. These are usually whole years worth of newspapers or classics which are out of copyright and are sold as a kind of digital library, often with software that displays the texts on the computer screen. While this seems like a very convenient source of data, it is often the case that the texts are in some encoded format, which makes them

HOW TO STORE TEXTUAL DATA

33

impossible to use without the included software. This format could be for indexing or compression purposes, and it means that you will have to use the software provided on the CD to look at the data. Sometimes you might be able to save the text you are looking at from within the program, or you can 'print' it into a file which you can then process further. It might require quite some computing expertise to extract texts from such collections, so it is probably easier to find the texts on the Internet, which is presumably where a lot of these texts originated from in the first place (see previous section). With newspapers, where you often get back issues of a whole year on a CD, you will almost certainly have problems accessing the text. These CDs are mainly meant as an archive of news articles, so they are linked up with search mechanisms which often provide the only point of entry to a text. The same caveats as mentioned before apply, even more so since the publishers might see some commercial value in the material and thus encode or even encrypt it to prevent access other than through the interface provided. The same applies to a lot of encyclopedias which are available on CD, often on the cover disks of computing magazines. One issue that is often neglected here is that of copyright, which is a right minefield. If you are in doubt whether you are allowed to use some data, it is best to consult an expert, as that can save you a lot of trouble in the long run.

3.2

HOW TO STORE TEXTUAL DATA

Collecting data is one part of the whole problem, another one is to store it. This is important, as how you manage your data greatly influences what you can do with it during analysis, and how easy this analysis will be. Apart from organising your data, you also need to look at the physical storage, which means the format you store your data in. There are a great number of ways to represent textual data on a computer, especially when it is more than just the plain text, such as annotations of some sort. Careful planning at the outset of your research project can save you many headaches later on, and handling your data is as important as your actual investigation, so any time spent on it is well worth the effort.

3.2.1

Corpus Organisation

Once you have collected all your texts, you need to store them in some form on your computer. Depending on what your analysis will be about, there are different ways of organising your material, but it is always worth trying to anticipate other uses of your data. Try not to exclude certain avenues of analysis on the grounds that you are not interested in them; your own interests might change or somebody else might come along wishing to analyse your data in an unforeseen way. It is probably a good choice to partition your data into smaller parts which have attributes in common. What attributes these are depends entirely on your data, so it is impossible to give ultimate guidelines which are suitable for each and every corpus. The first step is to look around and see if there are similar corpora whose structure you might want to mirror. This is the approach that has been chosen for the LOB corpus (Lancaster-Oslo/Bergen corpus of written British English), which took the design of the corpus!Brown (American written English) to allow direct comparison

34

BASIC CORPUS CONCEPTS

between the two. Other corpora then picked up on the same structure as well, like for example FLOB and Frown, the modem equivalents of Brown and LOB created at the University of Freiburg, and the Kolhapur corpus of Indian English. As it is so widely used, the organisation of these corpora is shown in table 3.1 (from Taylor, Leech, and Fligelstone (1991) ).

A B

c D E F G

H J K L M

N p

R

press (reportage) press (editorial) press (reviews) religion skills & hobbies popular lore belles lettres, biography, essays miscellaneous learned & scientific writings general fiction mystery & detective fiction science fiction adventure & western fiction romance & love story humour

Table 3.1: Brown/LOB structure If you cannot find any other similar corpora whose structure you can follow, you will have to think of one yourself. It might be useful to partition your data in a kind of tree structure, taking the hlost important attributes to decide on the split (see figure 3.1 ). For instance, if your corpus contains both written and spoken material, you might want to take the first split across that particular attribute, and within the written partition you might want to distinguish further between books and newspapers, or adult and children's books, or female/male authors. Here you could choose whatever is most convenient for your later analysis, and you could re-arrange the partitioning if necessary for other studies.

Another useful criterion to structure your corpus is date. If you are looking at the way usage patterns change across time, it might be good solution to have different sub-corpora for different years or chronological ranges. Depending on your time scale you might wish to distinguish between decades, years, months, or even days. This is also a relatively neutral way of dividing up your corpus, which doesn't make too many assumptions about the data. In the end, the partitioning is mainly a matter of convenience, as all attributes should be stored within the data files, allowing you to select data according to any of them, provided they are marked up in a sensible way (see next section on how

35

HOW TO STORE TEXTUAL DATA

/~

Written

/~~

Books

Journals

Dialogue

/~

/~

American

Monologue

Newspapers

British

Radio broadcasts

Academic

lectures

Figure 3.1: A sample taxonomy of text types

to mark up attributes). Grouping your corpus data in a sensible way, however, can greatly speed up the analysis, as you don't have to trawl through all your material to find relevant parts for comparisons; instead you can just select a block of data grouped according to the attribute you are looking for. There has been some work on developing guidelines for structuring corpus material, and the results of the EU project EAGLES are available on the Internet. The address of the site can be found in the appendix.

3.2.2 File Formats Mter having decided on a logical structure, you need to think about the physical structure, i.e. how to store the data on your computer. One straightforward way is to mirror the hierarchical structure in the directory tree on your hard disk. This way the attributes of a text would be reflected by its position on the disk, but processing might be more difficult as you will have to gather text files from a set of separate directories. It also makes rearranging the data more complicated. Another possibility is to have a separate file for each text, and keep a list of a text's attributes (such as author, title and source) somewhere else. When you want to pick a sample of your corpus, you can consult the list of attributes and then select those files which contain the texts you are interested in. This is easier for processing, but it might be inefficient if you have too many files in a single directory. Some operating systems might also put an upper limit on the number of files you can have in a single directory. In principle the choice between storing texts in individual files on your computer, or whether lumping them together in one or more large files is largely a matter of convenience. If all the data is in one big file, processing it is easier, as there is only one file you have to work with, as opposed to a large number of smaller files. But once your corpus has grown in size, a single large file becomes unwieldy and if it does not fit onto a single floppy or ZIP disk it might be difficult to copy it or back it up. You might also have to think about marking document boundaries in a big file, whereas individual little files could comprise self-contained documents with all related necessary meta information.

BASIC CORPUS CONCEPTS

36

Once you get to storing the material in the right place on your computer, the next question is what format to choose. In principle there are three kinds of formats to choose from, all with their special advantages and disadvantages: 1. plain ASCII text 2. marked-up ASCII text (SGML/XMLI... ) 3. word processing text (Word/RTF/... )

Plain ASCII text is the most restricted format. All you can store in it are the words and some simple layout information such as line breaks. You cannot even represent non-ASCII characters, like accented letters and umlauts, which severely restricts you when it comes to working with occasional foreign words or even non-English corpora. In order to overcome this problem, a whole host of different conventions has evolved, each of which requires a special program to work with it. If you had a concordancing program suitable for corpus X, the chances are that you could not use it on corpus Y, or at least you would not able to exploit all the information stored in that corpus. After years of developing idiosyncratic mark-up schemes, a standard format has been established: SGML, or its simplified version, XML. These are formal descriptions of how to encode a text, and how to represent information that otherwise could not be added straightforwardly into the text. As this standard is increasingly used for marking up corpus data, a lot of software has been developed to work with it. More details of mark-up will be discussed in the next section. Another popular choice for corpus formats seem to be word processing files. There are three main reasons for this: 1. many people have a word processor available 2. plenty of existing files are already in this format 3. you can highlight important words or change their font

Unless it represents spoken data, a corpus is basically a collection of text documents. Thus it seems only natural to use a word processor to store them, which allows you to use highlighting and font changes to mark up certain bits of text. You can format ergative verbs in a bold font, then search for bold text during the analysis, and with built-in macro languages you might even be able to write a basic concordancing program. So what's wrong with that? The problem here is that you would be limiting yourself to just a single product on a single platform, supported by a single vendor. What if the next version of the word processor changes some font formatting codes? Suddenly all your laboriously inserted codes for marking up ergative verbs could have disappeared from the file. Even though it seems a big improvement to be able to mark certain words by choosing a different typeface over not being able to do it at all with a plain ASCII text, it severely limits the way in which you can process the file later. By choosing a word processor format you can then only analyse your text within this word processor, as the file format will almost certainly not be readable by any other software. As a word processor is not designed to be used for corpus analysis, you cannot do very much more than just search for words.

MARK-UP AND ANNOTATIONS

37

Furthermore, what happens if the word processor is discontinued? You might be tempted to say that a certain product is so widespread that it will always be around, but think back about ten years: then there was a pre-PC computer in widespread use, the Amstrad PCW. It came with a word processing program (Locoscript), and it would save its data on 3 inch floppy disks (remember that the 'standard' floppy size nowadays is 3.5 inch). Suppose you had started your corpus gathering in those days using this platform. You took a break from working on it, and now, a few years later, you want to continue turning it into a really useful resource. The first problem is that you swapped your PCW for a PC (or an Apple Mac) at the last round of equipment spendings, but suppose you manage to find another one in a computing museum. Unfortunately it doesn't have Locoscript on, as the disk got lost somehow. Without any supporting software you are not able to make use of the text files, and even if there might be a Locoscript version for PCs you still can't do anything because PCs don't have drives which can read 3 inch floppies. Many individual attempts of gathering corpus data have chosen this form of storing data, and as soon as the question of a program for analysis arises, people realise that they have reached a dead end. Converting the data back into a usable format will almost certainly result in a loss of information, and consequently much time has been wasted during setting up. Technology changes so quickly that it would be foolish to tie yourself to one particular product. Instead you should go for an open, non-proprietary format that can be read by any program and that is well-documented, so that you can write your own software to process it if you need to. You should also store it on a variety of media which are likely to be available at future dates. This of course includes the necessity of making backup copies to insure yourself against accidental loss of your data in case your disk drive breaks down. As we will see in the following section, marked-up ASCII is the best option. It is readable by any program on any computing platform, and if you choose the right kind of mark-up you will find plenty of software readily available to work with it. The main point of this section is that you should generally do what other people have done already, so that you can gain from their experience and re-use their software and possibly their data. It also makes it easier for other people to use your data if it is not kept in an obscure format nobody else can use. Corpora have been collected for more than forty years, so it would be foolish to ignore the lessons learned by your predecessors. A particular choice might make sense in the short term, but in the long run it is best to follow the existing roads rather than go out into the wild on your own. This particularly applies to the topic of annotations, which we will look at next.

3.3 MARK-UP AND ANNOTATIONS In the previous section we discussed ways of organising and storing your corpus, how to structure it and how to keep it on a computer physically; in this section we will look at why you would want to use annotations and how they can be stored. In chapter 8 we will then develop programs to process annotations.

BASIC CORPUS CONCEPTS

38

3.3.1 Why Use Annotations? Annotations can be used to label items of interest in a corpus. For example, if you are analysing anaphoric references, you cannot simply use a program to extract them automatically, as their identification is still an unsolved research problem. The only reliable way to find them is to go through the data (perhaps after it has been processed by an auxiliary program which identified pronouns) and label the references yourself. Once you have added annotations to your data, you can search for them and thus retrieve instances of the features and process them. This means, of course, that the program you are using is capable of looking for the annotations you put in earlier. While it is very unlikely to be able to ask a concordancing program to find anaphoric references, it is much easier to instruct it to find all labels marking instances of those. In practice, most corpora have some kind of annotations. The degree of granularity varies widely, and more sophisticated annotations (i.e. those that need to be put in manually) are mainly restricted to smaller corpora, whereas larger corpora typically contain only such annotations which can be added automatically using computer programs, e.g. part-of-speech tags and perhaps sentence boundaries and paragraph breaks. Any material you can get from other sources will most likely be without any annotaions.

3.3.2 Different Ways to Store Annotations We have already seen that there are different ways to store your data files, and there is an even larger variety of formats in which to encode annotations. In this section we will briefly touch upon this subject, but we will not spend too much time on describing obsolete formats. There are typically two different kinds of annotations, those relating to the whole text, such as author, and title, and those which only apply to parts of it (usually words or groups of words), such as part-of-speech or anaphoric references. While the former are usually kept separate in a header block at the beginning of the text, the latter necessarily have to be included inside the text.

Header Information Usually a corpus will contain just the plain text, unless it was converted from published sources (such as newspaper articles or typesetter's tapes) or structured collections (such as databases, see Galle et al. (1992)). In this case it sometimes contains extra information, which is attached to it at the beginning in a header describing what the title of the document is, when it was published, in which section, and so on. This kind of external information can easily be kept separate from the actual text through some kind of boundary marker which indicates the start of the text proper. Some mark-up schemes use a number of tags to indicate which attributes of a document are filled with what values. A newspaper article could for example be annotated thus: \begin{header} \title{Labour Party To Abolish Monarchy} \date{Ol/04/2001} \author{A.P. Rilfool} \source{The Monday Times}

MARK-UP AND ANNOTATIONS

39

\end{header} \begin{text} In an announcement yesterday the Prime Minister said that \end{text}

Other labels could be used, or they could be enclosed in angle brackets, as in this example from McEnery and Wilson (1996):

This line in the so-called COCOA format (from an early computer program) indicates that Charles Dickens is the author of the following text. There are many different ways in which such information can be stored, and each concordance program will understand its own format only. Structuring the header information is a problem of standardisation, as different corpora could be annotated in different variants, using different tag labels, and without any specification of what kinds of information will be given. This makes it difficult for software to process different corpora, as it cannot rely on them being in one particular form. Also, without having a defined set of properties which are being defined, any program will have problems allowing the user to exploit the information which is available. The solution is to have one single format which is used for all corpora, as then the writers of corpus processing software can tailor the software to that format.

Text-Internal Annotations Adding annotation for other aspects of a corpus require that annotation be added within the text. If you want to analyse the part-of-speech of the individual words, or the reference points of anaphora, you will have to mark those up in the text itself, and this creates a number of problems. The first is to keep the annotation separable from the text. If at some future point you want to get back to the original, unannotated, text, and you cannot easily work out what is part of the text and what is annotation then you will have a problem. When researchers first started gathering corpora, they had to make up their own formats for storing them. Early computers only had upper case letters, so special conventions were introduced to distinguish upper and lower case, e.g. by putting an asterisk character before any upper case letter, treating all other letters as lower case. This made dealing with the data rather complicated, as programs had to be able to handle all these conventions, but there were no real alternatives. Worse still, everybody used their own conventions, mainly because people wanted to encode different things, which meant that neither software nor data could easily be exchanged or reused. As there were not that many people working with corpora initially there was no awareness that this could tum into a problem. However, with corpus linguistics becoming more and more mainstream it quickly became an issue, as not everybody could start from scratch developing their own software and collecting their own data, and so committees were set up to develop standards. This has gone so far that projects will sometimes no longer get funding if they create resources that don't follow these standards.

40

BASIC CORPUS CONCEPTS

SGML The only file format which can be understood generally is plain ASCII text. ASCII allows no non-printable characters apart from line breaks and tabulator marks, and no characters with a code higher than 127. However, you can now read your data on virtually any machine using any software, but how do you mark up your ergative verb occurrences? This is where mark-up languages enter the scene. A mark-up language allows you to store any kind of information separated from the text in an unambiguous way. This means you can easily recognise it, and if you don't need it you can simply remove it from the text with a computer program, without any worries that you might also remove parts of the text at the same time. A mark-up language typically reserves a number of characters which you then cannot use in your data unless they are 'escaped', i.e. replaced by a certain code. For example, using HTML, the hypertext mark-up language, if you want to write a web page about your R&D unit you have to write 'R& D', where amp is the entity name of the ampersand character, and the ampersand and the semicolon are used to separate it from the other text. The de facto standard format to encode corpora is using SGML, which actually is a language to define mark-up languages. SGML is a well established standard (ISO 8879), and as it cannot be changed easily because of this, software developers have written quite a few tools to work with it in the knowledge that it will be here to stay and thus the time and effort they invest will be worth it. And indeed you can easily find a large number of programs, a lot of them very cheap or even free, that will allow you to work with SGML data. SGML can be used to specify a set of tags and make their relationships explicit in a formal grammar which describes valid combinations of tags. For example, you could specify that a 'chapter' tag can contain one or more 'sections', which in turn contain 'paragraphs'. If you try to put a 'chapter' tag inside a 'section' you violate the structural integrity of the data, and a piece of software called parser will tell you that there is an error in your mark-up. SGML uses the angle brackets('') and the ampersand('&') as special characters to separate your tags from the text, though that could be changed in a mark-up language definition. The Text Encoding Initiative (TEl) has spent a considerable amount of effort on producing guidelines for marking-up texts for scholarly use in the humanities. These text encoding guidelines are available on the Internet (see section 12.2 on resources), and there is even a 'TEl Lite', a version stripped down to the bare essentials of marking up textual data. The TEl guidelines provide a standard format, but that does not mean that a corpus will be marked up in all the detail provided by them. There is a special customised version of these guidelines, the Corpus Encoding Standard (CES) which defines different levels of annotation, according to the detail expressed in the mark-up. Being a customised version of it, CES is TEl conformant. For the newcomer SGML is not too easy a field to start working in, as there is a lot of terminology, a large number of opaque acronyms, and programs which tend to be reluctant to deal with your data, giving cryptic error messages as an excuse. The reason for that is that SGML is very powerful, but also extremely complex. As a result, software is slow and difficult to write, which has prevented SGML from being widely used for about the last ten years.

MARK-UP AND ANNOTATIONS

41

However, there is a wide variety of mark-up languages that have been defined for all possible areas of academia and commerce, the best known of which is HTML. HTML is a mark-up language for hypertext documents, widely used on the Internet for formatting web pages. It follows the standard SGML conventions regarding tag format. HTML Part of the success of HTML is that it is simple. Its limited power and resulting ease of use have contributed majorly to the success of the world wide web. When you are downloading web pages, they will be marked-up in HTML, so you could conceivably use it for marking up other corpus data as well. HTML was developed originally for the basic presentation of scientific articles, with extensions to allow more general formatting of information, and thus it contains a number of tags which allow you to mark paragraphs and headings, and even crossreferences, but there is no tag. Of course you could use one of the existing tags for font-change or italic and bold printing, but that only gives you a limited range of tags, and there is no obvious link between the tag and its meaning. If you just add your own tags, your data is no longer HTML conformant and no browser would know how to handle those tags. For a solution to that dilemma we need to go back one step and look at what HTML is itself. HTML is an application of SGML.It is a definition of a language (L) to mark up (M) hypertext (HT) documents. But a corpus is not primarily a hypertext, and so you might want to use another SGML-based format to mark it up. This other format has been designed to be a simplified subset of SGML, easy to use for the human user, flexible, powerful, and easy to process for the computer. The outcome of that design process is XML, the eXtensible Mark-up Language. It is a lot easier to write programs that can process XML than it is to do the same for SGML, but SGML conformant documents can easily be converted to be XML conformant, so there is no compatibility issue. The constraints on XML are less strict, as we will see in chapter 8, when we write a couple of tools to work with XML data. The corpus encoding schemes that have been developed for SGML are now also available for XML (see XCES in the appendix), and XML looks like it is going to be a major breakthrough in mark up. For all these reasons it is advisable to use XML to store your data; it's extensible, so you can just add your own tags if you have to, and there is also a lot of software available to process it, despite the fact that its specification is still considerably less settled than SGML. XML will be described in more detail in chapter 8, where we will develop a couple of programs for handling texts which are encoded in XML format. Another reason for using XML is its future development. By many people it is seen as the mark-up language which will transform the way data is stored and exchanged world-wide, so knowing XML is a useful skill in itself. Annotating data is a complex topic, worthy of its own book(s). For our purposes we will keep things simple, as we are concentrating on the operations we want to perform on the data. For examples further on in the book we will be using plain text,

42

BASIC CORPUS CONCEPTS

with no annotations, stored in a single file on the computer's hard disk. However, if you decide to add annotations to any data you are collecting, it is worth using XML as the mark-up format. One of the sample applications we will be looking at in the final part of this book will allow you to process such data.

3.3.3 Error Correction Another point is worth mentioning in connection with mark-up: it is unavoidable that there will be typographical errors in a corpus, as all methods to enter data are prone to the introduction of glitches. Whether there is a smudge on the page that is scanned in, or just a momentary lapse of the keyboarder, errors will always creep in. Unless the amounts of data involved are very small, it is generally not worth correcting those, as too much time is spent on this, which could probably be used for more productive tasks. Furthermore, even sending the data through a spell checker is not guaranteed to make it any better, as rare words could not be recognised and changed, while other errors might happen to coincide with the spelling of other, correct words and remain uncorrected. There is, however, a kind of error which should be corrected, namely faulty mark-up. Unlike natural language, mark-up has a strict grammar, which can easily be checked automatically. So a simple computer program (like the one we will develop in chapter 8 ) can quickly tell you whether there are any inconsistencies, and more importantly where they are. However, this applies only to the syntactic wellformedness, as the computer cannot check whether the mark-up has been applied correctly on the semantic leveL

3.4 COMMON OPERATIONS In this section we will investigate some of the basic operations which are usually applied to corpus data. These operations will later be implemented in Java, so it is important that you have an idea of how they work, and what what they can be used for. All of them are described in more detail in the accompanying books of the series, namely McEnery and Wilson (1996) and Barnbrook (1996}, with the latter having an emphasis on how they can be used in linguistic research.

3.4.1 Word Lists A word list is basically a list of all word types of a text, ordered by some criterion. The most common form is that of a word-frequency list, where the words are sorted in descending order of frequency, so that the most common words are at the top of the list, and the rare ones at the bottom. A word-frequency list can give you an idea of what a text is like: if it is a firstperson narrative, you would expect to see a lot of occurrences of the personal pronoun 'I', and certain other often used words might reflect what the text is actually about. They can also be used to compare different texts by working out which words are more frequently used in one text than another.

43

COMMON OPERATIONS processing tasks, such as creating for example, what the length of a have got to the last letter of the task of creating a sorted reverse read each lowing description: reverse the letters in the ext for each fill in a few more gaps: ch word in the text read the next xt word from the text reverse the text reverse the word insert the for all words in the list reverse s in the list reverse word print ecise in that it reflects how each program is executed once for each xplicit the relationship between a ist of words, and that we insert a more words in the file read next read next word from file reverse from file reverse word check if YES: skip eck if word is in list

word lists, computing parameters of a te word is, you would have to go through it word, making sure that you don't stop to This is a l word list from a text file. reverse the lette word from the text sort the list alphabetically word word in the text read the next word fro word from the text reverse the word in word insert the word into a list sort word into a list sort the list for all word print word Note that we now have word Note that we now have changed the word is being dealt with at a time, and word (or token) in the text we're lookin word and the list of words, and that we word into the list, which was not obviou word from file reverse word check if w YES: sk word check if word is in list NO: i YES: skip word word is in list NO: insert word into list close i word

Figure 3.2: Concordances for the word 'word' An alternative way of ordering a word list is alphabetically. Here you could for example see which inflected forms of a lemma's paradigm are used more often than others.

3.4.2

Concordances

Looking at a word list of a text might be useful for getting a first impression of what words are used in the text, and possibly also the style it is written in, but it necessarily has to be rather superficial and general. It is far more useful to look at the individual words themselves, analysing the local context they are being used in. This allows you to analyse the way meaning is constituted by usage. The tool for doing that is the concordance. Concordances have been around long before corpus linguistics, but only with the advent of computers and machine-readable corpora has it become feasible to use them on an ad-hoc basis for language research. Previously concordances took years to produce, and were only compiled for certain works of literary or religious importance. However, it is now possible to create exhaustive concordances for sizeable amounts of texts in a few seconds on a fairly standard desktop computer. There are basically two types of concordances, KWIC and KWOC. These two acronyms stand for keyword in/out of context, and refer to the way the visual presentation is laid out: KWIC, which is by far the more common of the two, consists of a list of lines with the keyword in the centre, with as much context to either side of it as fits in the line. KWOC, on the other hand, shows the keyword on the margin of the page, next to a sentence or paragraph which makes up the context. In figure 3.2 you can see an example concordance in KWIC format. This is an extract from a concordance of the word word in the text of this book (before this figure was added to it). KWIC displays allow quick and easy scanning of the local contexts, so that one can find out what phrases and expressions a word is being used in, or what adjectives

44

BASIC CORPUS CONCEPTS

modify it if it is a noun. For this purpose most of the time only a narrow context is required. As soon as the context exceeds a single line on a printed page or a computer screen, it loses its big advantage, namely that the keyword (or node word) can be instantly located in the middle of it. Most computer software for corpus analysis provides a way of producing a KWIC display, while the KWOC format is not really used much in corpus linguistics.

3.4.3

Collocations

By sorting concordance lines according to the words to the left or right of the node word it is very easy to spot recurrent patterns in the usage of the node word. However, if you have more than a couple of hundred lines to look through, it becomes very difficult to not lose sight of the big picture. One way to reduce large amounts of concordance lines to a list of a few relevant words is to calculate the collocations of anode word. The basic concept underlying the notion of collocation is that words which are important to a node word will tend to occur in its local context more often than one would expect if there was no 'bond' of some sort between them. For example, the word cup will be found in the context of the word tea more often than the word spade, due to the comparatively fixed expression cup of tea and the compound tea cup. There is no particular reason why spade should share the same link to tea as cup. Collocations are not limited to straightforward cases like this; quite often a list of collocates comes up with a few words one would expect, but also with a number of words which are obvious once you think about them, albeit not ones a native speaker would come up with using their intuition. This fact makes collocation a useful tool for corpus analysis, as it helps to unearth subliminal patterns in language use of which speakers are generally not aware. A number of studies (e.g. Stubbs (1995)) show how certain words (e.g. the apparently neutral verb to cause) are always used with predominantly negative words, even though the word itself does not actually have any negative connotations when taken out of context. One important aspect when working with collocations is how they are evaluated: there is a list of almost a dozen different functions which compute significance scores for a word given a few basic parameters such as the overall size of the corpus, the frequency of the node word, the frequency of the collocate, and the joint frequency of the two (i.e. the count of how often the collocate occurs in the context of the node word). Some of these functions yield similar results, but they all have certain biases, some towards rare words, some towards more frequent words, and it is important to know about these properties before relying on the results for any analysis.

3.5

SUMMARY

In this chapter we have looked at the first step of corpus exploration, building your own corpus. There are several ways of acquiring data from a variety of sources, but most often you will not have much choice when you want to work with specific text material. It is important to design your corpus structure in an appropriate way

SUMMARY

45

if you want to fully exploit external information related to the texts. Certain avenues of exploration will be much easier if you spend some thoughts on either naming conventions for data files or a directory structure to put your files in. Annotations play an important role when you want to incrementally enrich your data with more information. This is useful if you want to search for linguistic events which are not lexicalised, i.e. where you cannot simply retrieve instances of a certain word form. There are many different ways annotations can be stored in a corpus, and it is best to follow the emerging standards, using XML and the TEl guidelines or the CES. Much work has been done on developing general and flexible ways of marking up corpus data, so it would be a waste of time trying to reinvent the wheel. Finally we've looked at the most common operations that are used to analyse corpus data. Word lists are conceptually easy and can be produced very quickly. They can give an instant impression of what a text is about, or in what way it is different from other texts. Slightly more complex are concordance lines. While word lists deal with types, concordances show tokens in their immediate environment. Sorting concordance lines according to adjacent words can give useful cues as to fixed expressions and phrases a word is used in. Collocations are the next step up from concordance lines. Here we take all words which occur within a certain distance from the node word (the so-called collocates) and assign a score to each according to how likely it is that they are near the node word simply by chance. Several functions exist to compute such scores, and they come up with interesting results. Once we have covered the basics of Java programming, we will start developing some applications to perform the operations described in this chapter.

4 Basic Java Programming Before we can start writing programs in Java we will need to cover some more ground: the basics of object-oriented programming. We have already heard that Java is an object-oriented programming language, and now we are going to see what that means for writing programs in it.

4.1

OBJECT-ORIENTED PROGRAMMING

Object-oriented programming, or OOP, was developed in order to make it easier . to design robust and maintainable software. It basically tries to provide an easy to understand structure or model for the software developer by reducing the overall complexity of a program through splitting it into autonomous components, or objects. OOP is a topic which in itself is taught in University degree-level courses, but for our purposes we will limit ourselves to the aspects of it which are relevant for developing small to medium-scale software projects. The central terms in OOP are class and, of course, object. In the following section we will discuss what they are and how you can use them in programming. We will also learn how Java specifically supports the use and re-use of components through particular conventions.

4.1.1

What is a Class, What is an Object?

Any computer program is constructed of instructions modelling a process, like for example the way a traffic light at a junction works, how tomorrow's weather can be predicted from today's temperature, humidity and air pressure, and so on. These are either copies of real world systems (the traffic light), or theoretical models of those (the weather prediction). In corpus linguistics we can for example implement models of part-of-speech assignment, lexical co-occurrence, and similar aspects of texts. However, we are generally not interested in all aspects of reality, so in our model we restrict ourselves to just those aspects which are deemed important for our application. We thus create an abstraction, leaving out all those irrelevant details, making the model simpler than the real world. So, by creating a computer program we effectively create a model of what is called the domain we are dealing with. In this domain there are objects, for example words, sentences, texts, which all have certain properties. Now, if a programming language has objects as well, it is fairly easy to produce a model of the domain as a program. Writing a program is then easily separated into two phases, the design

48

BASIC JAVA PROGRAMMING

phase, where you look at your domain and work out what objects it contains and which of them are relevant to the solution of your problem, and the implementation phase, where you take your design to the machine to actually create the objects by writing their blueprints as Java source code. The important word in the previous sentence is blueprint. You don't actually specify the objects, but templates of them. This makes sense, as you would usually have to deal with more than one word when analysing a text, and so you create a Word class, which is used as a template to create the actual word objects as needed. This template is called a class!template, and classes are the foundation underlying object-oriented programs. A Java program consists of one or more classes. When it is executed, objects can be created from the classes, and these populate the computer's main memory.

4.1.2

Object Properties

An object has a set of properties associated with it. For example, a traffic light has a state, which is its current light configuration. When you model a traffic light, you need to keep track of this. However, you wouldn't really need to know how high it is, or what colour the case is painted in. These properties are not relevant if you only want to model the functionality. If, on the other hand, you want to create a computer animation of a junction, you would probably want to include them. When designing your class, you need to represent those properties, and the way to do that is to store them in variables. Each class can define any number of variables, and each object instance of the class will have its own set of these variables. What you need to do when you're designing the class is to map those properties on to the set of available data types. We have already come across the simple data types in chapter 2, but classes are data types as well, so you can use other classes to store properties of your objects. For our traffic light example we would want to store the state as a colour, and there are several options how we can represent this: 1>

1>

1>

1>

we could assign a number to each possible colour and store it as a numerical value, e.g. red is 1, amber is 2, and green is 3. we could use the first letter of each colour and store it as a character value, e.g. 'r', 'a', and 'g'. we could use String objects storing the full name, i.e. "red", "amber" and "green". we could define a special class, Colour and have different objects for the respective colours we need for the traffic light.

All of these have their own advantages and disadvantages. From a purist's point of view the final option would be best, but it also is a bit more complex. We will come back to this point further on.

4.1.3 Object Operations Apart from properties, classes also define operations. These allow you to do something with those properties, and they model the behaviour of your domain object in

OBJECT-ORIENTED PROGRAMMING

49

terms of actions. For example, a traffic light can change its state from red to green, and this could be reflected in an appropriate operation. In programming these are called either methods or sometimesfimctions. In this book we will refer to them as methods. A method consists of a sequence of instructions. These instructions have access to the properties of the class, but you can also provide additional information to them in form of parameters. Methods can be called from within other methods, and even from within other classes. We will see when this is possible later on when we are talking about accessibility. Apart from passing information to a method, a method can also return information back to the caller. Typically a method will work on the object's properties (e.g. a lexicon) and the parameters which have been passed to it (e.g. a word to be looked up), compute a value from them (e.g. the most likely word class of the word in the lexicon), and return it. This is in form of a variable, and you will have to define what data type this variable is (e.g. String). A method must have a single data type, which is specified in its declaration. If you do not want to return any data, you can use the keyword void to indicate that. Methods don't have to have any link to the class, but the whole point of OOP is to bundle related things together. Therefore a class should only contain properties and methods which actually have something to do with what the class is modelling. One last point needs mentioning: apart from ordinary methods, a class has a special type of methods, constructors. A constructor is a special method to initialise the properties of an object, and it is always called when a new object is created. It has as its method name the name of the class itself, and it has no return value. As it implicitly creates an instance of the class, you don't specify a return value. A class can have more than one constructor with different parameters.

4.1.4

The Class Definition

We will now look at how we can implement a traffic light as a class in Java. There is a fixed way in which a class is declared in Java. It is introduced by the keyword class followed by the class definition enclosed in curly brackets. A class needs to have a name, which by convention is spelled with an upper case letter and internal capitalisation; thus our example class would be named TrafficLight. We need to save this class in a file "TrafficLight. java" so that the java compiler can handle it properly. For better documentation it is always a good thing to put the name of the file in a comment at the beginning and the end. A comment is a piece of text intended for adding notes to the source file which are to be ignored by the compiler. In this example they are marked by the sequence I * . . . * I; everything between these two pairs of characters is ignored. So we start with: I*

* TrafficLight.java *I class TrafficLight {

II end of class TrafficLight

50

BASIC JAVA PROGRAMMING

We will now add a property to that class, namely the state of the light. To represent this we'll choose a String object, which has the benefit that it is easy to understand without further explanation, unlike a numerical representation where we would require a legend explaining what number corresponds to what colour. It also avoids the temptation of performing arithmetical operations on them, which would not make sense as the numbers don't stand for numerical values, but are just symbols. Properties are declared by a data type followed by a variable name. If we call our light configuration state, the revised class looks like this: I*

* TrafficLight.java *I

class TrafficLight String state;

II end of class TrafficLight

Variables are conventionally put at the top of the class definition, before the method declarations. While class names are spelled with an initial capital letter, variables start with a lower case letter, but can have internal capitalisation. This convention allows you to easily distinguish between a class name and the name of a variable. You don't have to follow this convention, but as it is generally accepted it would be good to do so for consistency. Next we want to add a constructor to our class. As a traffic light always shows a certain colour (unless it is switched oft) we want to use this as a parameter. Constructors are usually next in line after the variable declarations: class TrafficLight String state; TrafficLight(String initialState) state = initialState;

II end of class TrafficLight

The constructor has to be called TrafficLight,just like the class. It takes a string argument, which we call ini tialState. When we want to create an instance of the TrafficLight class, we have to provide this value. The JVM allocates space for the object in memory, and then executes the constructor. There we simply assign the initial state value to the state variable, so that it now has a defined value that we can access from other methods. Next we want to add a method that changes the state. For this we don't need any parameters, as the state can only advance in a well-defined sequence. We simply check what the current state is, and we change the state according to what it should be next. For that we're using a chain of if statements (see section 2.2.2 ).

OBJECT-ORIENTED PROGRAMMING

51

To compare String objects we can't just use the double equal sign, as that would compare the objects, and not the contents of the object. However, when comparing strings what we want to know is whether they have the same sequence of characters as their content. Thus we have to use another method, equals (), to check for equality of content. This is explained in more detail in section 5.3.3. The changeState () method looks like this: void changeS tate () if (state .equals ("red")) { state= "red-amber"; } else if(state.equals("red-amber")) state = "green"; else if (state. equals ("green")) state = "amber"; } else if(state.equals("amber")) { state = "red";

This method checks what the current value of state is, and then assigns assigns a new value to it. It does not return a value, so we declare it as void, and it takes no parameters, so we simply leave the round brackets empty. You cannot leave them out, as they are part of the syntax of method declarations. All we need now is a way of finding out what colour the traffic light is actually displaying right now. We introduce another method, getColour (), which will return the current state of the traffic light. This method looks like this: String getColour () return(state);

It returns a String value, so we specify this by putting the data type before the

method name. You can put it on a separate line, which makes it easier to find the method name (especially if you have larger classes), as it always starts on the first column. We return a value using the return ( ) statement. It exits the method, so any further instructions after it would be ignored; in fact, in this situation the compiler issues a warning and might not compile your class if there are any further statement following the return (). This class is now fully functioning, and can be used to create TrafficLight objects. You can create an instance of it, you can change its state, and you can then also query what its current state is. If we want to make use of this, we would need other classes, for example Junction or Crossing, which would make use of that class. For testing purposes, however, we want to be able to run the class on its own. To execute a class as a program, you need another special method called main (). Whenever you tell the Java interpreter to run a class, it tries to locate a main ( ) method in it, and starts executing it. This method has a special signature, or declaration, which includes a few keywords which we will explain later on:

52

BASIC JAVA PROGRAMMING

public static void rnain(String args[]) TrafficLight rnyLight

~new

TrafficLight("red");

for(int i ~ 0; i < 6; i++) { System. out .print ("My TrafficLight is: "); Systern.out.println(rnyLight.getColour()); rnyLight.changeState();

At the beginning we create a traffic light, myLight, which initially is red. We do this with the new statement. Unlike non-object data types (e.g. int variables) you need to do this explicitly, as creating objects involves a lot more internal work for the JVM. For a primitive type all that is needed is to allocate some storage space, but with an object the JVM also has to call the constructor to make sure it is properly initialised. Furthermore, objects are stored in a different area in memory, which involves some additional internal housekeeping. While you need to explicitly create objects, you don't have to worry about cleaning them up once they are no longer needed. The JVM keeps track of all the objects which are still in use, and those that aren't will be reclaimed automatically through the so-called garbage collection. After having created our traffic light, we tilen enter a loop in which we print out the current state of the light, and then change tile state. After six iterations of the loop we finish. The System. out. print () statement prints its argument on the screen, and System. out. println () does tile same but witil an added line break at the end. We will discuss output in more detail in chapter 6. The whole class now looks like this: /*

* TrafficLight.java *I class TrafficLight String state; TrafficLight(String initialState) state ~ initialState;

void changeS tate () i f (state.equals ("red")) state ~ "red-amber"; else if(state.equals("red-arnber")) state = "green"; else if (state. equals ("green"))

state

=

It

amber";

else if(state.equals("arnber")) state ~ "red";

String getColour () return(state);

public static void

OBJECT-ORIENTED PROGRAMMING

53

main(String args[]) ( TrafficLight myLight = new TrafficLight("red"); for(int i = 0; i < 6; i++) ( System.out.print("My TrafficLight is: "); System.out.println(my Light.getColour()); myLight.changeState() ;

II end of class TrafficLight

If you type this class in, save it as TrafficLight. java, compile it with j avac TrafficLight, and run it with java TrafficLight, you will see the following output:

My My My My My My

TrafficLight TrafficLight TrafficLight TrafficLight TrafficLight TrafficLight

is: is: is: is: is: is:

red red-amber green amber red red-amber

You can see that it repeats its states, so it seems to be functioning properly.

4.1.5

Accessibility: Private and Public

One important notion with classes is that of the outside. You have a class, and everything else outside the class. On the outside there can be complete chaos, but if it is properly designed, you can bank on your class being stable, robust and reliable. The key to this security is not to let anybody from the outside touch your data. Unfortunately such a class might be safe and secure, but it would not actually be useful, as it would not be able to provide any services without an interface to the outside world. So we need to have some entry points that are accessible from the outside. These entry points are summarised in the Application Programming Inteiface, or API. The API of a class gives you an overview of all the services available from that class, and it also tells you how to access them. Without any information on the API of a class you might as well not have the class available at all, as it would be completely useless. We learned earlier in this chapter that a class has properties and methods. In principle the API would include both of these, but in order to guarantee internal consistency the data is usually hidden and can only be accessed via specific methods. Consider the following change to the main () method of our traffic light: public static void main(String args[]) ( TrafficLight myLight =new TrafficLight("red"); for(int i = 0; i < 6; i++) ( if(i == 3) myLight.state = "blue"; System. out. print ("My TrafficLight is: "); System.out.println(my Light.getColour()); myLight.changeState() ;

54

BASIC JAVA PROGRAMMING

What we are doing here is directly assigning a new value to the state variable of the myLight object when the loop variable reaches the value 3. This changes the output to look as follows: My My My My My My

TrafficLight TrafficLight TrafficLight TrafficLight TrafficLight TrafficLight

is: is: is: is: is: is:

red red-amber green blue blue blue

You can see that once it has turned to 'blue', the traffic light never changes, as there is no defined follow-up to a blue light. This is clearly something we want to avoid, as we cannot foresee all possible colours somebody might want to assign to a traffic light. Instead we want to control tightly the way the state can change, and the only way to enforce that is to forbid external access to the state variable. This is known as data-hiding, which means you don't expose the properties of your objects to the outside world, but instead hide them behind methods. In those methods you can perform extra checks to make sure your object retains a consistent state. This is something you should also do in the constructor, otherwise one could simply write: TrafficLight myLight =new TrafficLight("blue");

and get into the same trouble. The easiest way to avoid such potential problems is to make sure the value is valid before assigning it. We will now see how this can be done. Let's assume you want to add the capability of setting the traffic light's state to any value, without going through all intermediate states. You don't want to allow direct access to the variable holding the state, to avoid potential disasters involving blue lights. So you write a method setState () which takes the new state as a parameter, and performs some checks on it before it actually assigns it: void setState(String newState) { if(newState.equals("red") II newState.equals("red-arnber") II newState.equals("green") II newState.equals("arnber")) { state = newState; J else { System.out.println("Illegal state value");

Here we first compare the new state with all allowed values, and only if a match was found we take on board the change. Otherwise we print out a message and keep the state as it was. You could also call this method from the constructor. There it would be a bit more complicated, as the state has not yet been assigned a value, so the easiest way to deal with this would be to assign it a default value before attempting to set the value given by the user. The modified constructor would then look like this:

OBJECT-ORIENTED PROGRAMMING

55

TrafficLight·(String initialState) { state = "red•; setState(initialState);

If the ini tialState value was not valid, the traffic light would simply remain 'red'. But how do we prevent malicious (or ignorant) users of our class from by-passing the setS tate () method and assigning a value directly to the variable anyway? For this the Java language provides accessibility modifiers. There are several of those, but we will only look at the most important, private and public. These modifiers can be applied to both properties and methods (see for example the main ( ) method above, which needs to be declared public). If you declare something as private, it can only be accessed from within the class itself. So our revised version of the TrafficLight class would contain the following variable declaration: private String state;

The methods on the other hand would be declared public, as they would constitute the public interface, or API, that other classes would use to work with. Private methods are not visible from the outside, just like private variables. They are entirely contained within the class, and they can only be accessed from other methods of the class. The main reasons for having private methods are to factor out common pieces of code which are shared by multiple methods, or to provide auxiliary methods which are not directly related to the class' functionality. In that case you wouldn't want them to clutter up the API, and you can hide them by making them private. To summarise this point about access restrictions, you want to keep as much of your actual class hidden from the outside in order to reduce the complexity of interactions. Instead you would want a limited number of clearly defined interfaces to a number of 'services' your class provides. All these services are implemented as public methods, and all others should be kept private.

4.1.6 APis and their Documentation When you are re-using other people's classes (or any of the large Java class library for that matter) you need to know what services (i.e. public methods or constants) they provide, and what you need to do in order to access them. This is quite a vital point, because if you don't know that something exists it might as well not be there. Therefore the designers of Java have included a special mechanism to document methods in a way which can automatically turned into nicely organised web pages directly from comments that you have included in your source code. The basic problem with documentation is that it rarely happens. Programmers are too busy programming and don't want to waste precious time writing documents that go out of date every time they change something in the software or that get lost as soon as you hand them out to a user. The only thing they might just about be prepared to do is to put comments into the code which explains what goes on to programmers who later might have to maintain the software.

56

BASIC JAVA PROGRAMMING

Comments can be added to Java source files in two forms: two slashes at any point on a line mark everything until the end of the line as a comment. This could look for example like this: tokens = tokens + 1;

II add one to the number of tokens

The last character of the source line is the semicolon; all spaces and the text from the two slashes to the end of the line are ignored. If you want comments to span more than one line, you need to enclose them in I * . . . * I, like this: I* Add one to the number of tokens. After this instruction the number of tokens is one larger than before. tokens = tokens + 1;

*I

In this comment style you need to be careful not to forget the closing sequence, otherwise you might find that large parts of your program are being treated as comments. This, in fact, is one of the more frequent uses of this comment style: to comment out chunks of source code that you don't need any more, but that you still don't want to delete as yet. Both comment styles can be used together at any time. Comments are explanations meant for human consumption, which is why they are marked in the source code so that the compiler can ignore them when translating the program. However, a special tool, called j avadoc can extract comments from the source code and create documentation out of it. Whenever something in the program changes, all that is necessary to update the documentation is to make sure that the comments get changed. This is not much effort, as the comments are usually right next to the source code which has been changed. Once that has been done, a single run of j avadoc can then update the on-line documentation. In order to make the extraction task easier for the j avadoc tool, the documentation comments have to follow certain conventions: they need to start with a ' I* *' sequence (whereas a normal comment only requires one asterisk), and certain keywords are marked by'@' signs. Let's look at an example: I** * Look up a word. * This method looks up a word in the lexicon. If it cannot be found * initially, a second attempt is made with the word converted to lower * case. * @param word the word to look up.

* @return true if the word is in the lexicon, false otherwise.

*I public boolean wordLookup(String word)

The most important keywords are @par am and @return, which allow you to specify the method's parameters and return value respectively. When you are documenting a class, you can also use @author and @version tags. For the most relevant set of j avadoc tags see table 4.1. The @par am, @throws, and @see tags can be used multiple times, as can the @author tag. Exceptions are described in section 5.2.

57

INHERITANCE

@author @version @par am @return @throws @see

The name of the class' author A version number (or date) Parameter of a method The return value of a method Exception thrown by the method Crossreference to other method

Table 4.1: Javadoc comment tags The set of all public methods of a class is called its API, and the generated documentation is therefore referred to as API documentation. Together with a brief general description of a class, the API documentation should enable you to use it, even though it is sometimes not that easy, especially with classes that are either complex or badly designed, or both. We will have a look at the actual format of the documentation comments that you have to follow for j avadoc to work when we start developing our first class in the next chapter.

4.2 INHERITANCE Another important aspect of OOP is the ability to derive new classes by extending existing classes. You simply state which class you want to extend, and then you add more methods and variables to it. You can use this to create specialised versions of classes, as we will see in chapter 6, where we extend some existing classes for reading from files to tum them into concordancing tools. Apart from just adding more methods, you can also overwrite existing methods with different functionality. As a simple example consider a pedestrian crossing light. This only has a red and a green light, so we can do away with the transitions to amber and red-amber that we had to handle in the TrafficLight class. So we simply write class PedestrianLight extends TrafficLight { public void changeState() { if (state. equals ("red") ) state

=

"green";

J else { state ;;;; "red";

II end of class PedestrianLight

We don't have to repeat all the other methods, as they are inherited from the superclass, TrafficLight. A PedestrianLight has exactly the same behaviour, the only difference being that it has fewer states. You can even assign an instance of the PedestrianLight class to a TrafficLight variable: they are in an 'is a' relationship, just like a dog is a mammal. We will see how we can use this in later chapters.

58

BASIC JAVA PROGRAMMING

In fact, all classes are arranged in a hierarchy, which means they are all derived from some other class, apart from the class at the top. This class is Object, the most general class. All other classes are directly or indirectly derived from it, and if you write a new class it is implicitly extending Object unless you specify another class (like the PedestrianLight) from which it is derived.

4.2.1

Multiple Inheritance

In Java you can only extend one class at a time, but what if you want to combine two or more classes for joint functionality? In C++, for example, a class can have more than one ancestor class, but this is not allowed in Java for very good reasons. One reason is that it keeps the class hierarchy simple: instead of having a complex network of classes you have a straightforward tree, where each class has exactly one path back to the class Object which is at the root of this tree. The other one is that it avoids potential clashes: if the two parent classes would have methods of the same name, which one should be chosen? If there is only one parent this problem does not arise. But there is also a way of more or less having multiple inheritance in Java. Instead of inheriting a method's implementation you can state that a class simply has a method of that name. This might sound a bit pointless, but is in fact extremely useful. Quite often a class can serve multiple functions in different contexts. The constraint of singular inheritance would make it difficult if not impossible to realise this. Let's assume for example, that you have a class which models things using electricity. They have a property 'required voltage', and methods swi tchOn () and swi tchOff (). Our PedestrianLight would ideally extend both this Electricitem class and the TrafficLight, as it shares features with both. This, however, is not possible in Java, as there can only be a single super-class. The way you do this is through so-called interfaces. An interface!definition is declared just like a class, only you would use the keyword interface instead of class, and all the method bodies are empty. That means you declare a method's signature, but you don't provide the instructions with it. You use the extends keyword to extend a class, and the implements class to implement an interface. This means that you need to provide implementations for the methods declared in the interface. Effectively you can only inherit the implementation from one class, but the functionality of as many interfaces as you care to implement. Interfaces are useful when you have different classes that provide the same service, and you want to be able to exchange them for each other without having to change anything else. An interface can be used to 'isolate' the common functionality, and then you can treat all the classes implementing the interface as if they had the same data type. We will see how this can be used later.

4.3

SUMMARY

In this chapter we have had a brief introduction to object-oriented programming. We have learned how to model our problem using classes which represent entities in the domain we are dealing with, abstracting away all the aspects which are not relevant

SUMMARY

59

to the solution. We have also learned what the relationship is between classes and objects, and that objects encapsulate data and operations related to them. At the end we have seen how to extend classes using inheritance. More common than direct inheritance is the use of interfaces to specify a certain functionality in form of a set of methods. Interfaces allow you to separate different views of an object, and to treat it under different aspects. This is especially useful as objects grow more complex and rich in features.

5

The Java Class Library In this chapter we will first briefly cover the way different parts of the Java class library are organised. Related classes are arranged in so-called packages, and understanding the mechanics of those is fundamental for Java programming. After that we will have a brief look at how Java deals with errors, before we then look in detail at one of the most important classes of the Java language, the String class, which is used to represent sequences of characters. We'll investigate the API of the String class, with examples of how the individual methods it provides can be used. Next we'lllook in slightly less detail at some other useful classes which you will use frequently when writing your own programs, and which are vital if you want to look at existing programs written by other people in order to understand how they work. Some of these classes, however, have been made superfluous by the collection framework, which was introduced in version 1.2 of the JDK. We will look at this framework in the final section. This chapter will contain a few examples to illustrate the usage of the classes described in it. Its main purpose is to introduce those classes and to prepare the ground for using them in the following chapters. For most examples we require external input, which we will be dealing with in the next chapter. So far we have mainly looked at the theoretical background and the principles of object oriented programming, and this chapter is going to make the transition to applying those principles in practice.

5.1 PACKAGING IT UP 5.1.1 Introduction A Java application consists of a number of classes, and through modular design and re-use this number can be quite large. There are classes available for a wide variety of tasks, and it would only be a question of time until two classes were created that had the same name. This would cause a rather difficult situation, namely how you can tell the computer which of two identically named classes you would want to use. As this problem can easily be foreseen, there is a solution built into the language: packages. A package is like a directory on your hard-disk, a container where classes are kept together. Just as when you have two identically named files on your computer's disk, they can be told apart by their full pathname, which includes the location as a path of directories leading to the actual file. This analogy with directories is closer than you might think, as packages directly reflect the directory structure, and the package name of a class is mapped onto a pathname that the NM follows when it tries to find a certain class.

62

THE JAVA CLASS LIBRARY

Basically there are three different types of packages: 1. packages that come with the base distribution of Java 2. extensions which are optional 3. third party and other packages In the following three sections we will look in more detail at each of these types, but first we need to answer another question: how do you actually work with packages? With one exception (which we will deal with in the next section) you have to declare what classes (or packages) you are using in your class. Suppose we want to use a class Date, which is located in the package called java. util, we need to include the following statement before our class definition: import java.util.Date;

This tells the compiler that whenever we are using a class called Date it is in fact the class java. util. Date, i.e. the class from the package java. util. If you would leave out the import statement, you would always have to refer to the class by its full name, which is the package name and the class name separated by a dot (in our case java. util. Date). You can of course do that, but you would save yourself a lot of typing by using the import statement. If you find that you are using a lot of classes from a package you can either enumerate them, as in import java.util.Date; import java.util.Calendar; import java.util.TimeZone;

or you could simply import all classes of the package at once, by writing import java.util.*;

The second method imports a few more classes which you are not actually using. While this might slightly increase the time needed for compiling your class, it has no influence at all on the result, i.e. your own class will be neither larger nor slower. It is, however, better to enumerate the classes when it comes to maintenance: it allows you to see at one glance which other classes you are using, provided you are not importing ones that you don't actually use. In the end this is very much a question of personal style and does not really matter much. But what happens if you import two packages which happen to have classes which have the same name? Suppose you import a package called bham. qtag which has a Tagger class, and another package, lanes. claws, which also has a class called Tagger. Now, when you have a line Tagger myTagger; new Tagger(tagset);

the compiler is at a loss as to which Tagger class you want to use. You could now either not import both packages, which might mean that you will have to write a potentially long list of classes to import individually at the top of your file, or you can simply qualify the class. This means you use the full name, which includes the package, and thus is defined to be unique, as no package can have two classes of the same name. So, you could simply write

PACKAGING IT UP

63

bham.qtag.Tagger myTagger =new bham.qtag.Tagger(tagset); lancs.claws.Tagger myOtherTagger = new lanc.claws.Tagger(tagset);

and the compiler is happy, as there are no more unresolved ambiguities.

5.1.2

The Standard Packages

Part of the power of Java comes from the large number of useful classes that are included in the language specification. Unlike a lot of other languages which only define a minimal set of commands and rely on third party extensions for anything non-trivial, Java includes quite a large set of classes for most purposes. As these classes are guaranteed to be available on any Java system, you can use them without having to worry about whether they are actually installed on the machine of a potential user. This saves you a lot of time, as you don't always have to reinvent the wheel, and the standard packages allow you to concentrate your efforts on the actual logic of your program, without wasting too much time on programming auxiliary classes for other tasks that you might need. Every major revision of Java has included more packages and classes, so it doesn't make much sense to spend too much time listing them all; it is best to consult the on-line documentation of the JDK you have installed on your computer for the authoritative list. In this section we will therefore just have a brief look at packages which contain classes that are important for processing texts and corpora, plus a few more you might be interested in later on. All the classes in the standard package start with java., followed by the actual name of the package. However, you can't import all those classes by typing import java.*;

as the asterisk only applies to classes within a package, and not to sub-packages. The above statement would only include classes from a package called java, but there aren't any, as they are all in sub-packages of java. There is another set of packages that comes with a JDK distribution, which start with sun. These packages are not very well documented, and for a good reason: you shouldn't use them. They are either internal classes that are used behind the scenes, or classes which haven't quite made it into the standard distribution yet. The sun packages are likely to change without notice, and if you want your programs to work for longer than just until the next version of Java is released you should avoid to use them. Generally you won't need them anyway. The most important package is java. lang, and it is so crucial that you don't even have to import it yourself: it is automatically imported by the compiler. It contains some basic classes which are required by the language itself. The package from which you will import classes most often is java. util. It contains a set of utility classes which are not essential but very useful. We will look at some of its classes towards the end of this chapter. All classes which deal with input and output are bundled up in the package java. io. The whole of chapter 6 is devoted to these classes, as they are important for data processing.

64

THE JAVA CLASS LIBRARY

The java. text package provides a number of classes for formatting text in culture-specific ways. For example, some countries use a full stop as a decimal separator, while others use the comma. These classes check the relevant settings in the operating system and format numbers following the right conventions. Similar considerations apply to the translation of the user interface. With the portability of Java this internationalisation plays an important role, which has been neglected in programming for too long. One strong point of Java is the fact that it provides platform-independent access to graphics. Most of this is contained in the package java. awt (AWT stands for abstract window toolkit), and more recently in the extension package j avax. swing (see below). Classes to access the Internet are in java. net. These allow you to establish network connection to other computers or to download web pages from any URL. If you want to access a database from within a Java application you can do so with the java. sql package. SQL is a standard query language for retrieving data from databases, and the classes in this package provide supporting functionality for this.

5.1.3 Extension Packages A while ago developers started to run out of patience with the graphics capabilities provided by the AWT, as even such basic widgets as simple buttons behaved slightly different on different platforms, and developers couldn't rely on their programs working properly everywhere. This was seen by some as proof that the 'write once run anywhere' philosophy of Java was flawed. However, these problems have been successfully solved by using a more low-level approach, where only the very basic operations are managed by the host system, and everything else is realised by Java. So, buttons behave the same on all platforms, as they are no longer 'Windows' buttons or 'XWindow' buttons, but instead Java buttons. The resulting system is much more reliable and consistent across platforms, but considerably larger than the original AWT, which for much of its functionality relied on the underlying operating system. Therefore, the new system, which is called Swing has not been made part of the standard packages, but has instead been put into a new type of package, the Java extensions. These start with j avax, and though they are not included in all Java distributions they are still part of the well-defined Java package system. Several other types of packages also fall into this category, however, these are mainly more esoteric, and we will not discuss them here.

5.1.4

Creating your Own Package

For small one-off projects it might not be too relevant, but whenever you design a class which you might want to re-use at some point in the future you should consider packing it up in a package. This makes it easier to organise your projects and maintain them, as you would put all related classes in the same package. Creating your own packages is very easy: just insert a package statement as the first statement in your source file (that excludes comments, which are allowed to

65

ERRORS AND EXCEPTIONS

come before a package statement). Some of the programming projects that we will look at later have been organised in their own packages.

5.2 ERRORS AND EXCEPTIONS Regardless of how careful you are with writing your programs, there will always be things which go wrong. Handling errors is a very complex matter, and it can take a considerable effort to deal with them properly. Every time your favourite word processor crashes just as you were about to save two hours worth of writing it has come across some circumstances that its programmer hadn't thought of. In general there are two types of errors: those which you can recover from, and those that are fatal. Running out of memory is fatal, as you've just hit the ceiling of what the machine can do. Trying to open a file that does not exist is not such a grave problem; you can just try again opening another file, or abandon the attempt altogether. In Java this is reflected in the distinction between Error (fatal) and Exception (non-fatal). If an error occurs, there is usually not much you can do. Exceptions, on the other hand, only stop the program if you don't handle them. The term for handling an exception is to catch it. This is done in a combination of a try block and one or more catch blocks. As the most frequent exceptions will involve input and output, here is a brief example (you will find more on this in chapter 6 ): II opening a file can produce an IOException FileReader file; try { file; new FileReader("corpus.txt"); J catch(IOException exc) { System.err.println("Problem opening the file: file ; null;

"+exc);

if(file J; null) { System.out.println("Success!");

If there is a problem opening the file 'corpus. txt', an exception is thrown. As this can only happen when we create the FileReader object, we enclose this critical section in a try block. The variable f i 1 e needs to be declared before we enter the block, as its scope would otherwise be restricted to within the block only. A try block is followed by a number of catch blocks, which specify which exception they are dealing with. Any exception that hasn't been caught by a catch block is passed back to the caller of the method the exception was thrown in. In our example the IOException is the only exception that can be thrown when a FileReader is created, so we don't have to worry about any other exceptions. But how do you know what exceptions you have to expect when calling a method? The answer is usually provided in the API. The list of exceptions that a method can potentially throw is as much part of its public interface as the return value or the number of parameters. If an exception is not caught within a method, it has to be declared, using the keyword throws:

66

THE JAVA CLASS LIBRARY

public int countLinesOfFile(String filename) throws IOException {

If in this method you handle the exception with try and catch, you don't have to declare it, unless you rethrow it, i.e. you pass it on after you have dealt with it. The Java compiler will notice if an exception can be thrown in one of your methods which has not been declared, and it will give an error message when you try to compile it. Exceptions are classes, and thus they are organised in a hierarchy, with Exception as the top-level class. When you catch an exception, any sub-classes of the exception you specify will also be caught, so catching an IOException will also deal with a FileNotFoundException, which is a sub-class of IOException. If you're lazy you can simply catch Exceptions directly, but then you lose all the information about what went wrong. However, you can choose exactly of what level of granularity you want to deal with.

STRING HANDLING IN JAVA

5.3

The basic unit of analysis in corpus linguistics is generally the word. So we will use this as the entry point into the Java programming language, especially since it is also important for programming in general: there are a number of occasions where words or sequences of words are used, such as names of files, error messages, help texts, and labels on buttons. In computing terminology a sequence of characters is called a String, and there is a class of that name in Java's basic class library, which we have already come across in the previous chapter.

5.3.1

String Literals

When you are assigning ~ value to a numerical variable, you can simply write the value in the source code as a sequence of digits, for example int aNumber = 13;

To do the same with a String, you have to tell the compiler exactly where the string starts and where it ends. For that reason you need to enclose a string in the source code in double quotes, as in the following example: String weather= "Sunshine, with light showers in the evening.";

The quotes themselves are not part of the String, they are only the delimiters indicating the extent of the string. A sequence of characters such as this is called a literal. It is analoguous to a literal number in that you can use it directly in String expressions and variable assignments.

5.3.2

Combining Strings

In the Java language, Strings have a special position, as they are the only objects which can be handled directly with an operator instead of method calls only.

STRING HANDLING IN JAVA

67

This operator is the concatenation operator('+'), which allows you to combine two Strings. Furthermore it also allows you to convert other objects and primitive data types to Strings, which happens automatically when you are concatenating them, as in the following examples: ' int score = 150; String examplel = "The score is "+score;

In these examples we are using two other data types, an int value and a Date object. The variable example1 will have the value "The score is 150", which is a new String created by converting the variable score to a String and adding it to the specified literal String (note the space before the quote mark; without it there would be no gap between the 'is' and the '150'). String nestedString = "The variable 'examplel' has the value \"" + examplel + "\".";

A more complicated looking example is the variable nestedString: Here we include the value of example1, and enclose it within double quote marks: to make it clear that these do not mark the end of the literal string they are preceded by a backslash. That means that the double quote is to be part of the string, whereas the quote mark following it (the last one in the line, just before the semicolon) is terminating the literal. Strings in Java can contain any character from the Unicode character set, and they can be of any length. One thing you cannot do with them, and this might sound a bit odd at first, is to change them: String objects are immutable, which means that once they have been created they will always contain the same value. There are several reasons for this, mainly to do with security constraints, which we shall not worry about here. The immutability is no big problem for us, as we can always create a new String object with a new value if we want to change it; the fact that an object's value cannot be changed does not mean that all variables of that type cannot be changed either. So, in the API of the String object you will find plenty of methods to access all or part of it, but none to change it. Should you ever require such functionality, there is a class called StringBuffer which can be used to build up character strings, and which can be converted into a String.

5.3.3

The String API

In this section we discuss the most important methods of the String class. These can be divided into several groups: first those to create String objects, as there are multiple ways to do so, and a few basic general methods. Then there are methods to compare Strings, followed by some to transform the content of a String. After methods for finding sequences in Strings we will look at ways of getting at sections of an existing String object. As mentioned above, the String class is part of the java. lang package, which is automatically imported into all classes, so you don't explicitly need to do so yourself. We will not go through all the methods provided by the String class, as it contains quite a large number of them. In order to be able to use a class efficiently you

68

THE JAVA CLASS LIBRARY

will only need to be able to know what kinds of methods it provides, so that you can then look up the exact form of a method you need in the documentation. Apart from the Java Development Kit you can also download its documentation (from the Sun website); this is a set of web pages which list all classes in the standard class library together with all methods and their parameters. This documentation has actually been produced from the class library's source code using the j avadoc tool, which means that it is very easy for you to produce the same kind of documentation by following the j avadoc conventions.

Creating a String There are two commonly used constructors available, which are listed in table 5.1. String(char value[]); String(StringBuffer buffer); Table 5.1: Frequently used String constructors The first constructor uses an array of char values to create a String. This constructor is not used very often, but it is useful when you get an array of characters from some other class (e.g. by retrieving data from a source over a network or out of a database) and want to turn it into a String. The second constructor we will be looking at takes its data from a StringBuffer. We have already heard that a StringBuffer is used to build strings, as they can be changed, unlike String objects. Here we now create an immutable copy of the StringBuffer, which then is disassociated from the original, which means that changing the StringBuffer in any way does not alter the String object. Another way to create a String in Java is by putting together String literals. These can also be mixed with other data types, which get automatically converted. Java supports the '+' operator, which combines individual Strings to one single String: String name = ••Lord Gnome"; int age = 94; String line = My name is ''+name+'' and I'm 11

11

+age+" years old.'';

Here we construct a string called line by concatenating three literal strings (''My name is ","and I'm ",and "years old.") with two variables, a String (name) and an int value (age). As an aside, note the space characters which are included around the places where the variables and the literal strings are concatenated. Without these spaces you would get ''My name isLord Gnomeand I ' m9 4years o 1 d . ", as the components are just stuck together. Another very common way of creating a String object is not actually part of the String API, but is a method of the class Object: the method toString (). This method returns a representation of the object in form of a String, and as it

69

STRING HANDLING IN JAVA

is a method of Object, and all other classes are directly or indirectly derived from Object we can rely on every object having this method. So we can equally well write StringBuffer sb =new StringBuffer(); sb. append ( "some text" ) ; sb. append ( " and even more text") ; String textl = new String(sb); String text2 = sb.toString(); II textl and text2 are both "some text and even more text"

Here we first create an empty StringBuffer object, to which we then append two literal strings. Then we declare two String variables textl and text2, and assign to them in two different ways the contents of the StringBuffer. First we're using the constructor method, which directly creates a String from a given StringBuffer, and then we use the more general way of employing the toString () method, which the StringBuffer inherits from Object. If you have objects from other classes, you can still use that way, as all classes have a toString () method. However, this will not always return what you expect, especially from classes which do not have as direct a textual representation as the StringBuffer.

This is in fact a design aspect that you should keep in mind when writing your own classes: provide a toString () method which returns a sensible representation of the class. For example, we will in a case study in chapter 9 write a class Rule which represents a replacement rule for a morphological stemmer. In the stemmer itself we never actually need a String object which represents a Rule object, but it is very useful to have for debugging purposes. For the Rule class we have chosen the individual parts of the rule to be printed, which makes it easy to identify which rule this object represents. If you don't provide a toString () method yourself, the default method of the parent class Object will be used; as this method has no access to the semantics of your class definition it just prints out the address in memory at which the object is kept. This is enough to distinguish two objects, but is not very useful for any other purposes. Basic String methods

int char[] String

length(); toCharArray(); toString();

Table 5.2: Miscellaneous String methods In this section we cover some general String methods, as shown in table 5.2. As you will notice, they have a data type in front of the method name, unlike the constructors we looked at in table 5.1. This data type described what type the return value of the method will be (and constructors do not have an explicit return value).

70

THE JAVA CLASS LIBRARY

So, the length () method returns an int value, and if you want to assign that to a variable, it needs to be of type int. The first of these methods is length (), which will return the length of a String object, measured in the number of characters. You most often use this method for looping through a string in conjunction with the charAt () method (see below). It can also, for example, be used to produce a list of word length vs. frequency. When looking at the constructors in table 5 .1, we have already come across the array of char values which is used internally to store the content of a String. With the toCharArray () method we have the reciprocal functionality, as it converts the String to an array. This array is a copy of the string's content, so that you can change it without affecting the original String object. You can use this method when you e.g. want to manipulate the individual characters of the string in a way for which there are no other suitable methods provided. And finally, there is the toString () method. This method is inherited from the Object class and thus exists in all classes, but in the String class it is fairly redundant, as a String object is already a String. All this method does is therefore returning a reference to the String object itself.

Comparing Strings Methods for comparing strings seem about as relevant as the to String () method, due to the existence of the comparison operator '= ='. However, it is more complex than one might initially think, as this operator has rather complicated semantics. In order to understand the difference between the comparison operator and the methods for comparing String objects, we need to know more about how objects are handled in the NM. String variables

String Object

other String Object

~!--------...~~

"aardvark"

I

Figure 5.1: Comparing string objects

Looking at diagram 5.1, we can see that a String variable provides a reference to a position in memory where the corresponding object is stored. The exact physical location is not relevant, and can change during the execution of a program, so for all intents and purposes the reference is all we need in order to work with that object. Two variables can point to the same object, as in String varl = new String ( "aardvark" ) ; String var2 = varl;

STRING HANDLING IN JAVA

71

Here varl is assigned to var2, and they both refer to the same physical object. We now continue our example by introducing a few comparisons: if(varl ;; var2) { System.out.println("varl is equal to var2"); if(varl =; "aardvark") { System.out.println("varl is equal to 'aardvark'");

If you were to try this out, you would get the surprising result that varl is equal to var2, but it is apparently not equal to the literal string 'aardvark', even though it obviously is. The reason for this lies in the way the '==' operator works: it compares the references, not what the refered objects actually contain. As we have earlier assigned the two variables to each other, they are identical, and thus they are referring to the same object in memory. In the second comparison, as well as this one: String var3 = "aardvark"; if(varl ;= var3) { System.out.println("varl is equal to var3");

the literal string and var3 are different objects, though their content is the same. It is just the same as if you were comparing two cups of tea, which are both full: they have the same content, but they are two different cups. This means that the '==' operator is only really useful for comparing the primitive types, but not for objects, unless you really want to check whether two values refer to the same physical object. This applies not only to String objects, but to all objects, as the concept of two objects being equal in some way or another generally doesn't coincide with the objects being identical. For this reason there is a method in the class Obj e~t which is intended for defining what 'equality' means, the equals () method. This method takes as its parameter a variable of the type Object, which means it can be used to compare any other objects to our String object, regardless of whether the other object is a String or not. Comparing Whole Strings boolean boolean int int boolean boolean boolean boolean

equals(Object other); equalslgnoreCase(String otherString); compareTo(String otherString); compareTolgnoreCase(String otherString); startsWith(String prefix); startsWith(String prefix, int offset); endsWith(String suffix); regionMatches(int offl, String other, int off2, int length);

Table 5.3: String comparison methods

THE JAVA CLASS LIBRARY

72

The methods for comparing strings (and sections of them) are shown in table 5.3. We start with the two simplest methods, equals ( ) , which tests for straight equality, and equalsignoreCase (),which collates upper and lower case characters during the comparison of the current String object with the argument, which needs to be a String object as well. The equals () method accepts any object, even from different classes (see above), but in order for it to be equal this other object will have to be a String as well. Both methods return a boolean value, true in case the two strings are equal, and false otherwise. This means you can easily use them in boolean expressions, such as String word

= new

String ( "anything" ) ;

if (word.equals ("something")) {

without comparing the return value to some pre-defined value. Unlike equals (),the method compareTo () (and the corresponding noncase-sensitive variant, compareToignoreCase ())returns an int value. This is because it evaluates the sorting order of two strings, and there are three possible options (before, after, same) which are too many to represent as boolean values. The return value is negative if the current object (on which the method is invoked) comes before the other object (the parameter) in lexicographical order, and positive if it comes after it. Both methods will return 0 if the strings are equal, i.e. if the equals () method would return true. If you look at the full API documentation of the String class, you will notice that there is another form of the compareTo ( ) method, which takes an Object as its parameter, just like the equals () method. This method exists because the String class implements an interface Comparable (which specifies just that one method, compareTo ( ) ): By implementing that interface a fixed sorting order is defined, which means that you can use strings in sorted collections (see below) without any further effort on your part. Again, the Comparable interface is fairly general and thus can only specify Object as its parameter, as it can be implemented by any class. Comparing Sections of Strings The next set of methods is used for comparing sections of strings, and we start off with the two simpler ones, s tartsWi th ( ) and endsWi th ( ) , which are very useful for language processing. They basically do what their respective method names say, and like equals () they return a boolean value. Let's look at an example of how they are used: II word is a String defined-elsewhere if (word. startsWith( "un")) { System.out.println( "negative"); }

i f (word. endsWith( "ly")) { System.println( "adverb");

STRING HANDLING IN JAVA

73

i f (word. endsWith ( "ing")) { System.println("ING-form");

This is a slightly oversimplified version of a morphological analyser, which assumes that all words beginning with un- are negatives, and that words ending in -ly are all adverbs and so on. For a real analyser there would clearly have to be more sophisticated checks on other conditions before such statements can be made, but you get the picture of how these methods are used. The s tartsWi th ( ) method has another variant, where you can specify an offset. This means that you don't start at the first character of the String when trying to match the prefix, but that you can skip any number of characters (for example if you have earlier established that the string has a certain other prefix). The 'plain' version of startsWi th () is equivalentto startsWi th ("prefix", 0) ; . The final comparison method, regionMatches (), is rather more complicated. It allows you to match sections of two strings, as specified by two offset variables. The first offset (thisOffset) is applied to the current String object, while the second one (otherOffset) is applied to the parameter. As both parts have to be of the same length in order to be equal, it is only necessary to specify the length of the stretch that is to be compared once. There is also an alternative form of that method where you can also specify (through a boolean parameter, slightly inconsistent compared to the other methods) whether the match should take upper and lower case differences into account. You will rarely need those two methods.

Transforming Strings As you will note by looking at table 5.4, all the methods which transform strings return a String object. This is because they cannot operate on the current object themselves, as String objects are immutable. However, in practice this does not matter much, as you can always assign the return value to the same variable, as in word= word.toLowerCase();

Here you convert the object word to lower case, and what in fact happens is that a new String object is created which contains the lower case version of word; this new object is then assigned to word, and the original mixed case version is discarded. The only drawback of this is a slight cost in performance, but that would only be noticed in time-critical applications churning through massive amounts of data. The first method, concat (),concatenates the String given as a parameter with the value of the current object and returns a new String object (provided the parameter string was not empty) which consists of the two strings combined. Concatenation is the technical term for putting two strings together, as in: String strl = "Rhein"; String str2 = "gold"; String str3 = strl.concat(str2); II str3 is "Rheingold"

74

THE JAVA CLASS LIBRARY

String String String String String String String

concat(String value); trim(); replace( char oldChar, char newChar); toLowerCase(); toLowerCase(Locale locale); toUpperCase(); toUpperCase(Locale locale);

Table 5.4: Methods for transforming Strings As a short form you can also use the concatenation operator, the plus sign, as in String str3 = strl + str2;

Note that again, due to the immutable nature of Strings, the object (strl in the previous example) itself does not get changed through the concat () method, but that a new object is created instead. Sometimes when you are dealing with data read in from external sources it is possible for strings to contain spaces and other control characters at either end. The trim () method can be used to get rid of them: it removes all spaces and nonprintable characters from either end of the String. This could result in an empty string, if the object does not contain any printable characters. The replace () method replaces all occurrences of one character, oldChar, by another character, newChar. This method is not too useful, as the replacement can only be a single character. It would have been more useful to have a general replacement routine, which can replace substrings with each other. The final four methods we will be looking at in this section are used to convert strings into all lower case or all upper case characters. This is quite an important preprocessing step for most language processing, as dictionaries usually contain only one form of a word. Suppose a tagger's dictionary contained information about the word 'bird', but your input text for some reason is all in uppercase, so you come across the word as 'BIRD'. Looking this up in the dictionary would probably fail. It is therefore always best to convert words into a single normalised form before processing, especially when you need to look them up in a dictionary: String word= "Bird"; String lower= word.toLowerCase(); String upper = word. toUpperCase ();

II lower is "bird" II upper is "BIRD"

Now, why are there two variants of each method? The reason behind this is that upper and lower case forms of characters can be language specific. A Locale object defines a country and language combination which can be used to make certain decisions. There are only a few cases where this actually makes a difference: in the Turkish locale, a distinction is made between the letter i with and without a dot, and thus there is an upper case I with a dot above. For the conversion into upper case there also is the German sharps ('JS'), which is rendered as 'SS' in upper case.

STRING HANDLING IN JAVA

75

Finding sequences in Strings One frequent job you will do in corpus analysis is matching words or parts of words. In this section we will look at the methods that the String class provides for finding sequences within a string. With these methods you can either look for a single character or a short String. You can search either forwards (indexOf ())or backward (lastindexOf ()).These methods return a number, which is either the position of the character you were looking for in the string (starting a position 0 if it was the first character), or -1 if it could not be found. You can also start at either end of the String object (by using the corresponding method with just one parameter) or at a position further on (with the two-parameter versions, which require you to specify the position within the string that you want to start the search from).

int int int int int int int int

indexOf(int chr); indexOf(int chr, int from); indexOf(String str); indexOf( String str, int from); lastlndexOf(int chr); lastlndexOf(int chr, int from); lastlndexOf(String str); lastlndexOf(String str, int from);

Table 5.5: Methods for finding sequences in Strings The full list of methods is given in table 5.5, and as you will notice they all return integer values. The meaning of the return value is always the same, either the position where the item you searched for was found, or -1 if it didn't occur in the String. Let's look at a few examples, where we first declare a string and then search for elements in it: String sample= "To be or not to be, that is the question."; II 0 1 2 3 4 II 01234567890123456789012345678901234567890 int x = sample. indexOf ("to"); II x will be 13 x = sample.indexOf('T'); II x will be 0 x = sample. indexOf ("be", 4); II x will be 16 x = sample .lastindexOf ("ion") ; II x will be 37 x = sample.lastindexOf('t' ,25); II x will be 23 x = sample.indexOf('t' ,37); II x will be -1

In the comment lines below the sample declaration you will find numbers giving you the corresponding positions in the string, going from 0 up to 40. This is to make it easier to work out what the return values mean in the examples. We declare an int variable which we will use to assign the return values to. We only need to declare it once, and then we can simply reuse it without having to declare it again. In the initial line we are searching for the string "to". If you look at the sample string, you can see that it first occurs at position 13 (the match is casesensitive), and thus the variable x will be assigned that value. Note the difference between Strings (in double quotes) and characters (in single quotes).

76

THE JAVA CLASS LIBRARY

If you want to find multiple occurrences, you can simply loop over the string, as in this code sample: int location = -1; do { location= sample.indexOf('o' ,location+1); if(location >= 0) { System.out .println( "Match at position "+location); } else { System.out.println("No (further) matches found"); while(location >= 0);

This piece of code will print out all index positions of the String object sample from our previous example at which the letter 'o' occurs. Note how we keep track of the last position to remember where to start looking for further matches. If we didn't add one to the value of location when calling the indexOf () method we would simply remain forever at the position where we'd first found a match. Getting Substrings Once you have found something in a string, you might want to extract a substring that is preceding or following it, or you might simply want to compare a substring with a list of other strings, without searching the full string for each string in the list. There are three methods which you can use for this, and they are listed in table 5.6.

char String String

charAt(int index); substring(int from); substring(int from, int to);

Table 5.6: Methods for getting substrings

The charAt () method gives you access to a single character position within a String. You could use it to loop through all letters in a string, for example: String str ="And now for something completely different ... "; for(int i = 0; i < str.length(); i++) { char c = str.charAt(i); System.out.println("Character at position +i+ +c); 11

11

:

11

Here we first initialise a string str, and then we use a for-loop to walk through all letters. As i is an index position, it has to be smaller than the length, so we use that fact as our loop condition. The output of this example would look like: Character Character Character Character Character

at at at at at

position position position position position

0: 1: 2: 3: 4:

Character at position 45:

A n d n .

77

STRING HANDLING IN JAVA

The substring () method has two variants: in both you specify at which position your substring should start, and in the second one only you specify where it should end. In the first variant the substring is taken to last until the end of the string. With both variant, the method returns the substring resulting from the character positions you've specified. Suppose you want to remove a possible prefix un from a string. You could do that with the following code snippet: if(word.startsWith("un")) { II starts with the prefix "un" word = word.substring(2);

Here we first check that our word does actually start with the prefix before taking it away. The parameter to the substring () method is 2, which is the third character. We thus ignore the two characters at positions 0 and 1, which make up out prefix, and only keep the string from position 2 onwards.

5.3.4

Changing Strings: The StringBuffer

Several times so far we have heard that String objects cannot be changed, i.e. that their content always stays as it was when initially created. Although you can nevertheless work with strings by creating a new String object everytime you want to change it, this is rather inefficient. Looking at the execution time of several Java actions one can see that creating objects is fairly expensive in computing terms, so by creating a lot of objects you could slow down your program. As any corpus processing is likely to involve lots of string processing, one thing you wouldn't want to do is to be inefficient in such a key area. The solution to this dilemma is a related class, the StringBuffer. A StringBuffer is an auxiliary class which allows you to manipulate strings. When you get to a stage where you require a String object, you can easily tum the StringBuffer into one. We have already come across this in section 5.3.3. StringBuffers are suitable for intermediate work, like assembling a String from several components. In fact, it is implicitly used whenever strings are concatenated in Java. The most frequently used methods of the StringBuffer API are given in table 5.7. Again, note the missing return values for the constructors, which implicitly return a StringBuffer object. If you know in advance how long your string is going to be, you can specify the

length in the constructor. Also, you can create a StringBuffer directly from a String. If you wanted to reverse a string, you could write: String str = "Corpus Linguistics StringBuffer sb = new StringBuffer(str); sb.reverse(); str = sb.toString(); II str now is "scitsiugniL suproC" 11 ;

The setCharAt () method allows you to change individual characters, whereas with setLength () you can either reserve more space (if the new length

78

THE JAVA CLASS LIBRARY

StringBuffer StringBuffer char int StringBuffer void void String

StringBuffer(); StringBuffer(int length); StringBuffer(String str); append(char c); append(String str); charAt(int index); length(); reverse(); setCharAt(int index, char c); setLength(int length); toString();

Table 5.7: Frequently used methods from the StringBuffer API

is larger than the current length), or cut off the end of the StringBuf fer (if the new length is less than the current length). Reserving more space is not really necessary, as the StringBuffer will automatically grow as you append more text to it. You would only want to do that if you were about to append a lot of smaller pieces of text, as it is more efficient if space only has to be allocated once.

5.4 OTHER USEFUL CLASSES In this section we will have a look at those classes of the standard Java class library which you will be likely to use most often. It has to be rather brief, but remember that you can always have a closer look in more detail by studying the on-line API documentation. Once you have an idea what a class is good for and how you can use it, the API documentation enables you to make the best use of it. The classes in this section have to do with data structures. A data structure is a container that you can use to store objects in. One important thing is that only objects can be stored: to store primitive values in these you have to use wrapper classes. So, in order to store a set of int values (e.g. word frequencies), you would have to create Integer objects from them, which you can then store. The only exception to this is an array, which we will come to below.

5.4.1

Container Classes

Suppose you wanted to collect the word frequencies of a text: as you read each word you will have to look up how often it has occurred so far, and increment its frequency count by one, adding it to the list if it is a new word. You simply cannot do that with just string and integer variables alone, as you do not know how many words you will need to store, and it would be impractical to compare one variable to a large set of other variables, as you would have to spell all that out explicitly in the program code of what is likely to be a huge program. In fact, everytime you manipulate data you do need somewhere to store the results of intermediate processing, like the current number of occurrences of a word, or probability scores for a set of word class tags

OTHER USEFUL CLASSES

"

79

associated with a word. This is where container classes come in. A container class is an object that contains a number of other objects. You access these objects either by an index number, with a key, or in a sequential order through so-called iterators. You can think of a dictionary as a container class, where you access the entries via a key, namely the headword. As it happens, most entries will be containers themselves, with separate definitions for each sense. Here you access the definition by the sense number. So, in order to look up sense 3 of the word take you first retrieve from the dictionary the list of definitions associated with take, and then you take the third definition from that list. There are different ways of organising data objects, each with their own advantages and disadvantages. Each way has a set of parameters which determine its suitability for the task at hand, and choosing the right option can make a program easy and straightforward, fast, and space efficient. The wrong choice, on the other hand, can make it awkward, slow and space-wasting. Thus it is important to know what the available options and their properties are. From version 1.2 onwards, Java has a very systematic framework of data structures, the so-called collections framework. In previous versions of Java there were only a few loosely related classes available. However, as these older classes are still widely used (partly for reasons of backwards compatibility) it is important to have a look at them. In practice you might prefer to use the classes from the collections framework instead, if you are using Java 1.2.

5.4.2 Array An array is a data structure built into the Java language. Unlike all the other container classes you can store primitive types (int, char and so on) in an array. You declare a variable as an array by adding square brackets to it, for example int frequencies[];

This line only declares frequencies as an array in which you can store int values, it doesn't specify how many of them there could be. In order to use this array, you then need to create it using a new statement, just like other objects: frequencies= new int[5];

Here you reserve space for five values. Arrays must have a defined number of cells, and you cannot change that number unless you use another new statement to reallocate the array. Unlike the StringBuffer's method setLength () however, you will then lose the contents of the previous array. You can access the individual cells in an array by their index positions; note that counting starts at zero, so here you have indices 0 to 4. Arrays have an associated variable, length, which tells you how many elements there are in it. Unfortunately this is very confusing, as it is a public variable, not a method, so while you use length () to find out the length of a String, you use length (without brackets) for the size of an array. The compiler will tell you when you mix these up, but nevertheless it is quite a nuisance to remember this (and to be told so by the compiler if you got it wrong).

THE JAVA CLASS LIBRARY

80

Here is an example to illustrate how to use arrays: first we assign values to some of the cells we allocated above, and then we loop through the array, printing all the values: frequencies[O] frequencies[l] frequencies[2] frequencies[3] frequencies[4]

= = = = =

3776; 42; 6206; 65000; 386;

for(int i = 0; i < frequencies.length; i++) System.out.println(frequencies[i]);

We can access the cells of an array either by number, or through a numerical variable, as it is done in the for-loop. The loop starts at the first element and continues until we have reached the end of the array. Setting the frequencies 'by hand' is rather tedious, but you can of course also use a loop to do that, e.g. when reading data in from a file.

5.4.3

Vector

boolean void boolean Object int boolean Object boolean void int

Vector(); Vector(int initialCapacity); add(Object o); clear(); contains(Object o); elementAt(int index); indexOf(Object o); isEmpty(); remove(int index); remove( Object o); removeElementAt(int index); size();

Table 5.8: Frequently used methods from the Vector API

A Vector is like an expandable array, which means you can add and remove elements without having to worry about keeping track of its size. The most important methods of the Vector class are shown in table 5.8. The elements of a Vector are stored sequentially, which means that you can loop through them by index values. In order to print out all elements of a Vector you could for example write Vector myVector =new Vector(); for(int i = 0; i < myVector.size(); i++) System.out.println("Element "+i+" is "+myVector.elementAt(i));

OTHER USEFUL CLASSES

81

A Vector is very useful if you need to store a number of objects and you don't know in advance how many of them you are going to have. If you know that, you might want to use an array instead. An array has the advantage that it is type safe, i.e. you can only have elements of the same data type in a single array, and you can also use it to store primitive data types. A Vector just stores elements as Objects, so you have to cast them to the appropriate class when you retrieve them: String myString = (String)myVector.elementAt(5);

Here we retrieve the sixth element of the Vector. We know it has to be a String, as we only insert String objects into this Vector, but the JVM doesn't know this (and we can, of course, store any object in this vector if we want to). The elementAt () method only returns objects of the type Object, and in order to tum it into a String we have to cast it to String. This will work as long as the objects actual type is String, otherwise we will get an error. This means you have to take care that you either don't mix different data types in a Vector, or you need to keep track of what objects they are yourself. The same applies to all the other container classes as well. In order to allow you to store any class in it you want, they have to go for the lowest common denominator, which in this case is the 'ultimate' super-class, Object. In practice this is not too much of a problem, as you wouldn't usually mix different data types in a single container anyway.

5.4.4 Hashtable Quite often you want to store values associated with a certain key, for example word frequencies. Here you have a key (the word) and an associated value (the frequency). Storing these in a Vector would be awkward, as you would lose the association between the two. A Hash table is well suited for this, as it allows you to store a key/value pair, and it is very fast as well. It basically works by computing a numerical value from each key, and then using that value to select a location for the associated value. So instead of looking through the whole table to locate an item, you can just compute its position in the table and fetch it from there directly. The most important methods of the Hashtable class are given in table 5.9.

To insert a set of word frequencies into a Hash table we could write something like this: String word; int frequency; Hashtable freqTable =new Hashtable(); II read word and frequency from external source word= readNextWord(); frequency= readNextFrequency(); while(word != null) { freqTable.put(word, new Integer(frequency)); word= readNextWord(); frequency= readNextFrequency();

82

THE JAVA CLASS LIBRARY

void Object• Object Object boolean int Enumeration Enumeration

Hashtable(); Hashtable(int initialCapacity); clear(); put( Object key, Object value); get( Object key); remove( Object key); isEmpty(); size(); keys(); elements();

Table 5.9: Frequently used methods from the Hashtable API

Here we have to make a few assumption regarding our data source: we can read words from some method called readNextWord ( ) , which returns the special value null if no more words are available. Frequencies are read in from a similar method, readNextFrequency (),which returns as an int value the frequency of the word read last. As we can only store objects in a Hash table, we have to create an Integer object with the new statement when inserting the frequency value. We can do this directly in the argument list of the put ( ) method, as we don't require direct access to the new object at this stage. Unlike a Vector you cannot directly access an arbitrary value stored in the Hashtable by a numerical index value, but instead you have to use the key to retrieve it, as in:

String word = ••aardvark"; Integer freq; freq = (Integer)freqTable.get(word); if(freq ==null) { freq =new Integer(O); System.out.println("The frequency of "+word+" is "+freq);

There are two points to note here: First, you have to cast the object you retrieve from the table to the right class, Integer in this case. Second, if the key cannot be found in the Hash table, the get () method will return null. We therefore need to test the return value and act accordingly. We could either print an error message, stating that the word is not in the frequency list, or, in this case, assign to it the value zero, which makes perfect sense in the context of a frequency list. If you don't have a full list of all keys which are stored in the table, but want to print out the full set, you can get it from the Hashtable using the keys () method. This allows you to get access to all the keys, which you can then use to read out their associated values from the Hash table. We will look at that when we get to the Enumeration below.

OTHER USEFUL CLASSES

void String String String Object Enumeration void void void

83

Properties(); Properties(Properties defaultProperties ); clear(); getProperty(String key); getProperty(String key, String default); setProperty(String key, String value); remove( Object key); propertyNames(); list(PrintStream out); load(InputStream in); store(OutputStream out, String headerLine);

Table 5.10: Frequently used methods from the Properties API

5.4.5

Properties

The Properties class (see table 5.10) is an extension of the Hash table, which is more specialised as to the keys and values you can store. In a Hash table there are no restrictions, so you can use any kind of object you want as either keys or values. In a Properties object you can only use Strings, but there are some mechanisms which allow you to provide default values in case a key wasn't found in the Properties object: you can either specify another Properties object which contains default values, so that you can override certain properties and keep the values of others, or you can directly supply a default when trying to retrieve the attribute. This is illustrated in the following code snippet: Properties wordClassTable =new Properties(basicLexicon); wordClassTable. set Property ("koala", "noun") ; String wclassl = wordClassTable.getProperty("koala"); II wclassl is now "noun" String wclass2 = wordClassTable.getProperty("aardvark","unknown"); II wclass2 is now "unknown"

Here we create a Properties object called wordClassTable, in which we want to store words with their associated word classes. We provide a default called basicLexicon, another Properties object which would have to be defined elsewhere. We then add to it the key-value pair 'koala' and 'noun', and after that we try to retrieve the value for 'koala'. As we have just entered it, we will get the same result back, but if we hadn't done that the system would look up 'koala' in the basicLexicon, and return the value stored there if there was one. If the attribute was in neither wordClassTable nor basicLexicon then the value null would be returned. We then also look up 'aardvark', and this time we provide a default value directly in the method call. If it cannot be found, instead of returning null it returns 'unknown', the default we supplied.

84

THE JAVA CLASS LIBRARY

5.4.6 Stack A Stack is a fairly simple data structure. You can put an object on top of the stack, and you can take the topmost element off it, but you cannot access any other of its elements (see table 5.11 ). These actions are called 'push' and 'pop' respectively. Despite this limited way of accessing elements on it, a stack is a useful data structure for certain types of processing. In chapter 8 we will see how we can use a stack to keep track of matching pairs of mark-up tags.

Object Object boolean Object

Stack(); push(); pop(); empty(); peek();

Table 5.11: Frequently used methods from the Stack API

The empty ( ) method allows you to check whether there is anything on the Stack; if you try to pop () an element off an empty Stack, you get an EmptyStackException. The peek () method is a shortcut for: Object obj = myStack.pop(); myStack.push(obj);

It retrieves the topmost element of the Stack without actually removing it off the Stack. Here is a brief example of how a Stack can be used: first we push three String objects on a Stack, and then we pop them off again: Stack theStack =new Stack(); theStack.push( "Stilton"); theStack.push( "Cheddar"); theStack.push("Shropshire Blue"); String cheese = (String)theStack.pop(); II cheese is "Shropshire Blue" cheese = (String)theStack.pop(); II cheese is "Cheddar" cheese= (String)theStack.pop(); II cheese is "Stilton"

As usual, we have to cast the return value of the pop ( ) method to the correct data type. This example illustrates that a stack is a so-called LIFO (last in, first out) structure: the last item we put on the stack ('Shropshire Blue') is the first one that we get when we take an element off it. As we will see in chapter 8, this is ideally suited for processing nested and embedded structures.

5.4.7 Enumeration An Enumeration is somewhat the odd one out in this section. It is not a data structure in itself, but rather an auxilliary type to provide access to the container classes we have just discussed. And, it is not even a class, but just an interface. The full API of the Enumeration is shown in table 5.12.

85

OTHER USEFUL CLASSES

boolean Object

hasMoreElements(); nextElement();

Table 5.12: The Enumeration API With a Vector you can easily access all elements through an index. However, with a Hash table this is quite a different matter, as there is no defined order, and elements are not associated with ascending index values, but with keys, which might not have any natural sequencing. The way to get access to all elements is through enumerating them: You can do that either on keys, or on the elements themselves, and for this purpose the Hash table class has the methods keys () and elements () (see table 5.9 ). The reason that the Enumeration is not actually a class, but an interface is that with an Enumeration you are only interested in the functionality, not the implementation. An Enumeration provides exhaustive access to all of its elements in no particular order, but the way this is implemented might vary depending on what kind of container class you want to go through. Let's see how we could use an Enumeration to print out our word/frequency list which we created in section 5.4.4 (unsorted): Enumeration words = freqTable.keys(); while(words.hasMoreElements()) { String word= (String)words.nextElement(); Integer freq = (Integer)freqTable.get(word); System.out.println(word+": "+freq);

This is the typical way you would use an Enumeration in practice. The two key methods are hasMoreElements () and nextElement ().The former returns true if there are more elements available, and false if there aren't. Here we are using it to control the loop which iterates through all the words, which we have used as keys in the frequency table. As soon as we have reached the end of the Enumeration the loop is terminated. The nextElement () method returns the next object. As the Enumeration is necessarily general, it again returns an object of type Object, which we have to cast to the desired type (which we have to know in advance). In our example we know it's a String, and we can use that to retrieve the associated value from the frequency table. To fill the concept of an Enumeration with a bit more life, here is an example of a possible implementation. This is just a wrapper around an array, which we access through an Enumeration. We call this class ArrayEnumeration, and you can use it to get an Enumeration from an array of objects. We are backing the implementation with two (private) variables, content, which points to the array that we are enumerating the elements of, and counter, which keeps track of what position in the array we are at. Both variable get initialised in the constructor. As the class implements the Enumeration interface (as specified in the class declaration line), we need to provide the two methods which make up the

THE JAVA CLASS LIBRARY

86

Enumeration API. Here we have included documentation comments (see table 4.1 on page 57 for a list) which describe the constructor's parameter, and the return values of the two Enumeration methods. /*

* ArrayEnumeration.java

*I public class ArrayEnumeration implements Enumeration { private Object content[]; private int counter;

/*'1
>'. private XMLElernent readXMLinstr() throws IOException XMLElernent retval = null; String name= readNarne(); skipSpace(); StringBuffer content= new StringBuffer(); while ( ! lookahead ( "?>" ) ) { int c = in.read(); if(c == -1) { throw new XMLParseError("Prernature end of file in XML instruction"); else { content.append((char)c);

retval =new XMLinstruction(narne,content.toString()); return(retval);

And indeed the method is very similar to readComment () which we discussed earlier. As always we want to be able to try out this class, and so we add a main ( ) method to run the XMLTokeniser on some input data. We simply process the data and print out whatever elements we encounter in the input. This makes use of the toString () methods which we overwrote in each of the XMLElement subclasses. This implementation of the main ( ) method as it stands is not meant for general use, so it does not contain that vital check for the presence of command-line parameters which a proper program should have: public static void rnain(String args[]) throws IOException { XMLTokeniser parser int i = 0;

new XMLTokeniser(new FileReader(args[O]));

172

DEALING WITH ANNOTATIONS XMLElement xrnl; do { xrnl = parser.readElement(); System.out.println(i+": "+xrnl); i++; while(xml != null);

II end of class XMLTokeniser

With more than 300 lines of code, the XMLTokeniser is a substantial class. However, by splitting it up into many small methods, most of which fit easily on one screen, the complexity can be reduced considerably. Tokenisers and parsers of formally defined languages are always rather messy, as they have to deal with a lot of variations in the input, and they have to show a certain behaviour when encountering deviations from the input's specification. The XMLTokeniser does not enforce many of the restrictions that XML puts on the shape and form of the input data, so you can try it out on any SGML input file as well, provided you first take out any SGML declarations which we are not handling in the tokeniser. You will find that it works just as well, with the major problem being the restriction that attribute values have to be enclosed in quotes. The next class we will look at, XMLFormCheck, checks if a file contains wellformed XML. It does enforce the restrictions on matching opening and closing tags, but as it does not process a DTD, it cannot tell whether the input data is actually valid XML. However, it is a much smaller class compared to the tokeniser.

8.3.3

An XML Checker

The easiest way to keep track of matching tags is to put them on a stack when an opening tag is encountered, and when coming across a closing tag the topmost tag is taken off the stack and compared to it. Once the end of the input has been reached the stack should be empty, otherwise there were some closing tags missing. A stack is a standard data structure, and we have seen in chapter 5 that there is a stack implementation in the Java class library. I*

* XMLFormCheck.java *I package xml; import import import import

java.util.Stack; java.io.Reader; java.io.FileReader; java.io.IOException;

public class XMLFormCheck { private Stack tagstack; private XMLTokeniser source;

After importing the necessary classes, we declare a Stack variable to keep track of the tags, and an XMLTokeniser to process the input data.

WORKING WITH XML

173

In the constructor we initialise these variables: public XMLFormCheck(Reader input) { tagstack =new Stack(); · source = new XMLTokeniser(input);

Unlike some other classes, we provide a separate method to do the work, so the constructor is left with only preparing the variables. As all the work of processing the XML input is already done in the XMLTokeniser, the check () method which tests the input for well-formedness can be kept quite simple. This shows the benefits of modularisation, as we can now handle basic XML data without a lot of programming overhead. public boolean check() throws IOException XMLElement xml = source.readElement(); boolean wellFormed = true; while(wellFormed && xml != null) { if(xml.isOpeningTag()) { tagstack.push(((XMLTag)xml) .getName()); else if(xml.isClosingTag()) { if(tagstack.isEmpty()) { System.out.println("Line: "+source.currentLine()); System.out.println(" - found spare tag "+ ( (XMLTag)xml) .getName()); wellFormed = false; else { String expected= (String)tagstack.pop(); if(!expected.equals(((XMLTag)xml) .getName())) System.out.println( "Line: "+source.currentLine()); System.out .println (" - found "+ ( (XMLTag)xml) .getName ()); System.out.println("- expected "+expected); wellFormed = false;

xml

source.readElement();

}

while(!tagstack.isEmpty()) { wellFormed = false; System.out.println("leftover tag: "+tagstack.pop()); return(wellFormed);

The check () method returns true if the input is well-formed, and false otherwise. We simply loop through all elements returned by the tokeniser, pushing opening tags onto the stack, and comparing the closing tags with the top element taken off it. As the XMLTokeniser contains a facility to provide the current line of the input, we can make use of that in case we find a mismatch. If the tag stack is not empty at the end of processing, we print out all the tags that are still left on there.

174

DEALING WITH ANNOTATIONS

public static void main(String args[]) throws IOException { XMLFormCheck tester= new XMLFormCheck(new FileReader(args[O])); boolean result = tester.check(); System.out.print("The document "+args[O]+" is "); if(result == false) { System.out.print("not "); System. out .println ("well-formed XML");

II end of class XMLFormCheck

In the main () method we create an instance of the XMLFormCheck class. We then execute the check ( ) method and store the result in a variable, which we use to generate the right output, depending on whether the XML is well-formed or not. When you run this class with java xml.XMLFormCheck myfile.xml you could either get the reassuring The document myfile.xml is well-formed XML or, in case of an error, Line: 1031 - found spare tag test The document myfile.xml is not well-formed XML and leftover tag: test The document myfile.xml is not well-formed XML In this class we have not included a lot of error handling, which is alright as long as you're dealing only with research prototypes that you are using yourself. As soon as you develop software that is to be used by other people you should put in some safeguards for potential errors. This includes the XMLParseErrors which might be thrown by the tokeniser. It is no problem for you to interpret what's gone wrong, but users will get completely confused when they get a scary error message when there was some problem with the input file.

8.4

SUMMARY

In this chapter we have seen how XML mark -up looks like, and how you can process files which are marked up with it. There is an important difference between valid and well-formed, and you can process XML documents using either a DOM parser, which reads the whole document at once, or an event-based parser, which reads it in parts and uses callbacks to communicate with a higher-level application. The most widely used event-based API for XML is SAX.

SUMMARY

175

After the theoretical background, we have implemented a simple tokeniser which splits XML data into distinct elements. This tokeniser does not recognise all possible forms of tags, but should go a long way when processing corpus data, which doesn't necessarily use a lot of fancy mark-up. And finally we have seen how easy it is to create applications based on lowerlevel components like the tokeniser. Once the main job of splitting the input correctly is done, checking for well-formedness is very easy and straightforward, and does not require a lot of programming effort.

9

Stemming In this section we take a look at implementing a stemmer as described in Oakes (1998). Starting from a brief description we will see what it takes to tum a table of production rules into a working program.

9.1

INTRODUCTION

Table 3.10 of Oakes (1998) lists a set of rules used by Paice (1977) for implementing a stemmer. A stemmer is a program that reduces word forms to a canonical form, almost like a lemmatiser. The main difference is that a lemmatiser will only take inflectional endings off a word, effectively resulting in a verb's infinitive or the singular form of a noun. A stemmer on the other hand tries to remove derivational suffixes as well, and the resulting string of characters might not always be a real word. Stemmers are mainly used in information retrieval, where able and ability should be identical for indexing purposes, whereas most linguists would not consider them part of the same inflectional paradigm. There is a further advantage of stemmers: a lemmatiser's requirement to produce proper words comes with the need for a way to recognise them, but a stemmer can do without that. This means that a lemmatiser usually consists of a set of reduction rules with a dictionary to check for the correctness of the result, while for a stemmer a set of rules is sufficient. This means they are in general not only smaller in size, but also much faster. The most widespread stemmers in use are based on an algorithm by Porter (1980), which is quite sophisticated and therefore not as easy to implement as the one of Paice (1977). However, the basic principles are much the same, the major difference being that Porter's algorithm applys a number of tests on the input word before executing a rule, whereas Paice's applys a rule as soon as a matching suffix has been found. Looking at the list of rules there is one major point to notice: as the first matching suffix fires a replacement rule, the order in which the rules are processed is quite important. Consider the case of -ied and -ed. Both rules match the word testified, but the first one accounts for more characters, reducing it correctly to testify, while the second is an 'incomplete' match when looking at the full picture, as it leads to the undesirable testi f i. It is therefore important that the longer rule is applied first; this mode of matching patterns is called longest-matching. The stemming program that we are about to write now will take a word, apply all possible rules to it, and will output a canonical representation of the input word.

180

STEMMING

This might serve as a first step for creating a grouped word list, as it removes inflectional endings, as well as derivational suffixes. As the program does not contain any language specific components apart from the rules it can easily be adopted to work in other languages as well, provided they use suffixes only to form derivations.

9.2

PROGRAM DESIGN

Summarising the algorithm is quite simple: for each word we are dealing with we traverse the list of rules, testing each rule whether it applies to the input word. Once a matching rule has been found, we 'execute' it, which means we take the suffix off the word and append the replacement if there is one. Afterwards we interpret the transfer part: we either continue at another rule or stop the whole process. As an example we will have a look at the first rule: ably

IS

This rule deals with the suffix 'ably' in words like 'reasonably'. Here it would remove the suffix, as the replacement part is empty (as indicated by the dash). This results in 'reason'. Then we jump to the rule whose label is 'IS' and continue processing there. For our example word this would be the end, but e.g. 'advisably' would have been reduced to first 'advis' and then 'adv'. If the word would not end in 'ably', the rule would be skipped and processing would continue with the second rule. When designing the program, we first need to think what kinds of objects we are dealing with. The most basic object, which we will be using as a starting point, is a rule. A rule consists of four elements, the first three of which are optional: label, suffix, replacement and transfer. The label is used for skipping rules during the processing of the list of rules, the suffix determines whether a rule fires, and the replacement specifies what a matching suffix is going to be replaced with. All these can be empty; only the transfer part, which describes what action is being taken after a rule has been fired, is compulsory. Then we need a means to initialise the rules somehow. In order to allow different sets of rules to be used, and to make it easier to modify existing rules, we will be reading the rules from a text file. This could be done from within the same class as the rule definition, but in this example the rule loader is kept in a separate class. On the one hand this introduces two classes with a rather tight coupling, as the rule loader needs to interact with the rule objects it is creating from the data file, and this interdependency is generally not a good thing. On the other hand, rules could be loaded from different sources, like files, networked file servers, direct user input, or even automatically generated from another program. This would be an argument in favour of keeping the initialisation separate from the rule itself. Furthermore, the specification of a rule is unlikely to change, and any major change of its interface would need to be reflected in all other classes anyway. With the present design we keep the classes small and uncluttered and organised by functionality. The final class we then need is the main processor, which is often called the engine. Our stemming engine does all the actual work, initialising the rules at startup, then waiting for input words to come in which will then be subjected to the rules. As we have delegated the initialisation to the rule loader, and all the low level

IMPLEMENTATION

181

processing to the rule class, we just need to deal with the input/output aspects and the co-ordination of the rule matching. We will need to store the rules somehow in memory for the processing, and here we have a straight mapping from what we need and an existing data structure: the List interface provides a sequentially ordered sequence of elements, just what we need for our purposes. The RuleLoader class returns an instance of a list, and we don't even have to worry about which actual implementation it is. We can use either a Vector or a LinkedList, and if we only use the methods provided by the List interface, we can even change the underlying implementation in the RuleLoader class without it affecting the main class. This is generally a good thing to do, as it hides the implementation details behind the interface specification. This higher level of abstraction makes it easier to handle the complexity of larger programs; in this small example it would not make such a big difference.

9.3 IMPLEMENTATION In this section we will go through the implementation of the stemming algorithm and discuss each class in detail. Before we do this, we will briefly look at the whole picture, namely how these classes interact. This interaction turns the three individual classes into a stemming program. The stemming algorithm is implemented using three classes, Stemmer, RuleLoader and Rule. The Stemmer class is the main processing class, which coordinates the work and makes use of the other two classes in the process. It can either be used as a module in a larger application, or in stand-alone mode. If run by itself it takes a list of words from the command-line and prints out their stems. This mode is mainly useful for testing purposes; for use within other applications it provides a method to stem a single word.

9.3.1

The Stemmer Class

As with other programs we will now walk through the source code for the Stemmer class. /*

* Sternmer.java *I

package sternmer; import java.util.List; import java.util.Listiterator; import java.io.IOException;

As we have several different classes making up one single application, we put them together in a package. This package is simply called stemmer (see the discussion on xml above), but the name is really arbitrary as long as you don't want to distribute your classes. We need three classes from the standard library, which we import explicitly. In this case listing them all (instead of using the asterisked 'import all' versions) gives us an immediate picture of what external classes we are depending upon.

182

STEMMING

I*" * This class implements a stemmer for English as described in Oakes (1998) * and Paice (1977). It loads a set of rules from a file and then applies * them to individual words. * @author Oliver Mason * @version 1.0 *I public class Stemmer private List ruleset = null; boolean TRACE = true;

We have two variables, one to store our set of rules in, and one to control the output of diagnostic messages. In order to make it easier to follow the progress of the stemming process through the list of rules, a number of print statements have been added to the code, indicating what rule is currently processed, and whether the matching has been successful or not. Obviously we would not want it to print these messages once the program has been tested and is working properly. We could then take those statements out again, but suppose we change something in the class later on, or we use a different set of rules and want to see if they work properly with the program. In this situation it would be good to have the extra output available again, and with the way the stemmer is implemented here all you need to do is change one single line: the value of the TRACE variable effectively switches the print statements on or off, which makes it very easy to deliberately sprinkle the code with diagnostic messages, none of which you want to see in the final version. You basically have a 'development version', which has the variable set to true at compilation time, and a 'production version' where it is set to false. Quite often such a variable is called DEBUG instead of TRACE, but that makes no difference. The idea is the same, namely having a 'trace' or 'debugging' mode which allows you to inspect internal information of the program. You can think of these print statements like sensors measuring the temperature and oil pressure of a car engine. The car runs perfectly well without them, but you might discover faults much easier with the extra information available. I**

* Constructor.

* The set of rules is loaded from a file. * @param filename the name of the rule-file. * @throws IOException in case anything went wrong. *I public Stemmer(String filename) throws IOException { ruleset = RuleLoader.load(filenarne); if(TRACE) System.err.println("loaded "+ruleset.size()+" rules");

The constructor of the Stemmer class is quite simple; it just initialises the set of rules. We also see TRACE in action: In trace-mode it also prints the number of rules loaded, which is simply the number of elements in the list. Loading the rules from a file involves a whole host of potential problems, which could make the program

IMPLEMENTATION

183

fail. If the rule file does not exist, or the computer's disk drive is broken, or the file has been corrupted, the list of rules cannot be loaded. Since the stemmer relies on the rules to exist, this would be a pretty serious state, and we would need to pass on that information to whoever is using the stemmer. For that reason we do not try to catch the IOException that can be thrown by the RuleLoader class in case things go wrong, but just pass it on. This means that the construction of the stemmer fails, which is perfectly reasonable behaviour in this case. After all, this is an unrecoverable error. !**

* Process a word.

*

A single word is passed through the set of rules.

* @param word the input word. * @return the output of applying the stemming rules to the word. *I public String stem(String word) { Listiterator i = ruleset.listiterator(); boolean finished = false; while(i.hasNext() && !finished) { Ruler= (Rule)i.next(); if(TRACE) System.err.println(word+": "+r); if(r.matches(word)) { word= r.execute(word); if(TRACE) System.err.println("match -> ""+word); String transfer= r.getTransfer(); if("finish".equals(transfer)) { finished = true; else { if(TRACE) System.err.println(""-> "+transfer); finished= !advance(i,transfer);

return (word) ;

The stem () method does most of the work, and that is reflected partly by its length. Methods shouldn't really be much longer than that, and ideally the whole method should fit on the screen at once, which makes it a lot easier to work with. The stern ( ) method is part of the public API, which means this is the entry point that other modules will use when they want to have a word stemmed. At first we set up an iterator to walk through the rule set. This iterator allows us to keep track of which rule we are currently dealing with. Next we create a flag which we use to terminate the stemming once we have reached a rule that has 'finish' as its transfer component. The processing loop is governed by two conditions, the existence of more rules and the fact that we haven't reached a point where a rule told us to stop. As soon as either of these conditions is true, the loop is exited and we return the word's stem to the calling module. In the loop we retrieve the next rule (which we also print when in trace mode) and check if it matches. If it does, we then execute it, which means we apply the suffix replacement to the word. The return value of the rule's execute () method is the modified word, and as we don't need to keep track of the original input we just assign the new word to the same variable, thus overwriting the old value. As an aside remember that this only changes the local copy of the word variable; the original string in the calling method will not have changed. For that reason we

184

STEMMING

will have to send the stem back as a return value from the stern ( ) method. Next we get the transfer part of the rule and compare it to the String finished. The string literal can be used in exactly the same way as a normal string object, and there is a reason that we call the equals () method on the literal and not on the variable transfer: if the value of transfer is null we get a NullPointerException when we try to execute one of its methods. The literal, however, will never be null, and as it is not a problem if the parameter of equals ( ) is null we will never get into trouble here, whatever the value of transfer. If the transfer part is equal to finish we set the variable finished to true, which means that we exit the main loop at the end of this pass. Otherwise we try and locate a rule which has a label that matches the transfer. The advance () method does that, and it returns true if it could find a matching label. When it returns false it means there was no matching label and we ought to stop the processing. Therefore we assign to the finished flag the negation of the return value, as indicated by the exclamation mark (the boolean negation operator, see 2.1 ).

/** * Advance through the set of rules until a matching label has been found. * The iterator given as a parameter is moved forward to the next rule * which matches the given label. * @param iter an iterator through a set of rules. * @param label a label to match. * @return false if no matching label could be found, true otherwise. *I private boolean advance(Listiterator iter, String label) boolean found = false; while(iter.hasNext() && !found) Ruler= (Rule)iter.next(); if(r.matchesLabel(label)) ( //match found iter.previous(); found = true;

return (found) ;

The advance () method moves the iterator forward until it either hits the end of the list of rules or finds one with a matching label. By default we assume that no rule has been found, and we initalise a flag with false. Just like the main loop of the stem () method we proceed while there are more rules to look through and while we haven't found a match. We get the next rule and test if its label matches the one we are looking for. If the match was successful, we need to go back one step, as the iterator will already point to the next rule. By setting it to the previous rule, which in fact is the one we just retrieved, we make sure that the matching rule will be the one selected next time the iterator is accessed. Effectively we are moving the iterator to the rule before the one with the matching label. We then also set the flag to true so that the loop is exited afterwards. If there was no matching rule, the loop will exit because of the first condition, and found will still have the value false.

IMPLEMENTATION

185

I** * * * * *

main method for testing purposes. The first command-line parameter is the rule file, and all subsequent parameters are interpreted as words to be stemmed. @param args command-line parameters. @throws IOException if the rule file could not be loaded.

*I public static void main(String args[]) throws IOException { Stemmer s =new Stemmer(args[O]); for(int i = 1; i < args.length; i++) { System.out.println(args[i]+": "+s.stem(args[i]));

II end of class Stemmer

The final method of the Stemmer class is main () which is called whenever a user runs the Java interpreter directly on it. Other classes can of course also call it. The parameter of main ( ) is an array of strings from the command-line, and we take the first one to be the filename of the rule file which we use to construct an instance of the stemmer itself. We then loop through all the remaining command-line arguments, taking every further argument as a word to be stemmed. In this loop we print the original word and its stem until all parameters have been processed. It is worth noting that the main () method as it stands will cause an exception if the command-line is empty. It is assumed that there is at least one argument, the rule file name, and no check is made whether this argument exists. If the main ( ) method was the main entry point for users of the stemmer this would not be very good, as the user might easily try it out without knowing about the required commandline structure and will promptly see an error message, thus not getting a very good impression of the program. Since the main ( ) method here is only meant to be used for testing purposes the error check has been left out, but it is always a good idea to test any input for correctness, and a user will find it much more friendly if a usage message is printed on the screen instead of getting a cryptic error message.

9.3.2 The RuleLoader Class Next we will take a look at the RuleLoader class. I* * RuleLoader.java

*I package stemmer; import import import import import import import

ava.io.BufferedReader; ava.io.FileReader; ava.io.IOException; ava.util.StringTokenizer; ava.util.NoSuchElementException; ava.util.List; ava.util.Vector;

186

STEMMING

This class is obviously in the same package as the other classes, and here we require a few more standard class library components. Still, they are all listed explicitly for clarity. As an aside, this also speeds up compiling, as the compiler doesn't have to look through the whole package to locate the relevant class information. !** * This class loads a set of rules from a file.

*

* * * * *

*

* *

* *

* *

The file contains one rule per line, empty lines or lines beginning Each line has to contain four with a hash symbol are ignored. elements separated by white spaces:

label (can be "-'' if no label applies)

ending (can be n_n for default match)

replacement (can be "-" if there is none)

transfer (a label or "finish")

All the action takes place in the static method load() which loads the rules from the specified file.

@author Oliver Mason @version 1.0

*I class RuleLoader

As you can see in the comment above the class declaration, you can use HTML mark-up to format the comments. Here we are including a numbered list detailing the individual parts of a rule. This list will also appear in the API documentation generated with j avadoc. The RuleLoader class only contains a single method, which is also rather long: /**

* Load a list of rules from a file. * @pararn filename the name of the file containing the rules. * @return a List containing all the rules from the file.

* @throws IOExcpetion in case anything goes wrong. *I public static List load(String filename) throws IOException { List rules ~ new Vector(); BufferedReader br ~new BufferedReader(new FileReader(filename)); String line ~ br.readLine(); int lineNurnber ~ 1; while(line !~ null) {

The method load () takes a filename as parameter and returns a list of rules. The method is static, which means we will not need to create an instance of the class in order to use it. This is a fairly frequent pattern in classes which are mainly collections of useful routines not operating on a concrete object. If you have a look at the java. lang. Math class you will find that it has the same structure. One thing that might strike you is the fact that there is no class List in java. uti 1, only an interface of that name. So the load ( ) method does not actually return a class, but only an interface, or rather a class which implements the List interface. This is quite a useful aspect of Java, as you can deal with the functionality of a class without having to worry about the actual implementation. Several classes

IMPLEMENTATION

187

implement the List interface, but for anybody using the RuleLoader class all that is relevant is that the class it returns is a List. If the actual choice turns out to be too slow, or to wasteful with space, the actual ve·ctor which is used to implement the List can easily be replaced by some other List class without having any impact on other classes. This decoupling of functionality and implementation is a very good thing to do, and it is cases like this where one suddenly realises the power of object-oriented languages and Java in particular. The load ( ) method itself follows a fairly common pattern, a loop through all lines of a file. First we open the file and wrap a BufferedReader around it for increased efficiency. We read the first line, and to locate errors we also keep track of line numbers. Then we enter a loop while there are still more lines available. An attempt to read a line beyond the end of the file will return null, so we test for that in the loop's condition. Empty lines or those starting with a hash sign are ignored, so we only process a line if this does not apply.

if(line.length() > 0 && !line.startsWith("jl")) { StringTokenizer st =new StringTokenizer(line); try { String label= st.nextToken(); if(label.equals("-")) label= null; String ending= st.nextToken(); if(ending.equals("-")) ending= null; String replacement = st.nextToken{); if(replacement.equals("-")) replacement= null; String transfer = st.nextToken(); Rule rule= new Rule(label,ending,replacement,transfer); rules.add(rule); catch(NoSuchElementException e) { System.err.println("Too few elements in line jl"+lineNumber);

Each line consists of a number of strings, separated by spaces or tab characters. In order to split a line into these elements we use a S tr ingTokeni z er object, which we construct on the input line. White spaces are the default separators, so we don't need to take any further action here. If a line is incomplete, the attempt to get the next token will throw an exception, and in order to trap this we enclose the actual tokenisation of the line in a try I catch block. In this block we read four tokens, assuming that there are at least four tokens available. An empty element is indicated by a dash, so we need to convert those to nulls. Then we create a Rule object with these parameters and add it to the list of rules. If there are fewer than four tokens a NoSuchElernentException will be thrown and be caught, causing the flow of control to be handed over to the catch block. Here we print a diagnostic message to notify the user of the error, and then continue, thus effectively skipping the incomplete line. It is unlikely that this error cannot be recovered from, as it only means that one rule is missing, and execution of the program might well continue without it, as long as the user is aware that there might be problems. This is an example of a recoverable error.

STEMMING

188 line= br.readLine(); lineNumber++;

br.close(); return(rules);

II end of class RuleLoader

At the end of the loop we read the next line and increment the counter for the number of lines. As soon as the end of the file is reached the line variable will be null and the loop will terminate, as the condition is checked right after the line has been read. Following the loop we close the file and then return the list of rules that we have read in.

9.3.3 The Rule Class With the Stemmer and RuleLoader classes finished, all that remains is the Rule class. So far we have followed a top-down approach, delegating the details of any task to a method to be developed later. This way we have reduced the complexity step by step, so that the remaining work appears rather trivial. This is often a good idea when faced with a daunting task, as most problems can easily be decomposed into much simpler sub-problems. By now we have got quite a complete idea of how the Rule class has to look like. It needs to allow testing for a matching suffix and label, we need to be able to apply it, and we need to know where execution continues after the rule has been applied. Once we have reached that stage, the code almost writes itself. I* * Rule.java *I package stemmer;

I** * This class implements a suffix reduction rule. * @author Oliver Mason * @version 1.0

*I class Rule { private private private private

String String String String

label = null; ending = null; replacement = null; transfer = null;

There are four private variables, which are reflecting the separate components of a rule as given in table 3.10 of Oakes (1998). Each rule can have a label, an ending that determines whether the rule applies, a potential replacement for the suffix, and a transfer field which indicates where to continue with the traversal of the rule list. As the components can potentially be empty, we initialise them with null to start off with. This is not strictly necessary, as we assign values to each variables in the constructor, but it is generally a good idea to initialise variables in case no such assignment takes place. The variables will then always be in a defined state when they are accessed.

189

IMPLEMENTATION

!** * Return a textual representation of the rule. * This is used for debugging purposes. * @return the components of the rule as a String. *I public String toString() ( return(label+"/"+ending+"/"+replacement+"/"+transfer);

At the beginning we overwrite the toString () method. This is done to get a more useful representation of the class when printing it. Instead of just giving the address of the class in memory we can now directly print a rule and will see its components. This is useful for debugging purposes, as we can just write System.err.println("Current rule: "+currentRule);

if we assume that currentRule is an object of the Rule class that contains the current rule. The output will just contain all elements of the rule separated by forward slashes. /**

*

Constructor.

* The constructor initialises the rule components from the given * parameters.

* @param 1 the label (can be null).

* @param e the ending to match (can be null) . * @param r the replacement (can be null) . * @param t the transfer action to take. *I

public Rule(String 1, String e, String r, String t) ( label = 1; ending = e; replacement

=

r;

transfer = t;

Next we have the constructor. When a rule is created, all that needs to be done is initialising the variables containing the four components, and that is what we do here. !**

* Test for a match with a word. * A rule matches by default if the ending is null. * @param word the word that is to be tested. * @return true if the rule matches, false otherwise.

*I

public boolean matches(String word) boolean retval = false; if(ending ==null) ( retval = true; } else ( if(word.endsWith(ending)) retval return(retval);

true;

190

STEMMING

The next method is used to check whether a rules applies to a given word. This is like a yes/no question, and so we choose boolean as the return type. The return value will be true if the rule matches, and false otherwise. We assume that the rule does not match to start with, and then we have to deal with a special case: some rules have an empty suffix, which means they will match any word. In this case the ending will have the value null, so we simply test whether it is this and set the return value to true. Otherwise we test if the word ends with the rule's suffix. If we had not checked for a null ending, we would get a NullPointerException here, as the parameter of endsWi th () must not be null. We could of course have coded this method a bit less verbose by leaving out the variable retval, so that it would look like this: public boolean matches(String word) { if(ending ==null) return(true); return(word.endsWith(ending));

However, in this case we would violate the principle that each block should only have one exit. Suddenly there are two return () statements, and if there were any further lines between those two one might easily miss that. Again, it is fairly trivial in a short method, but for longer and more complex methods it always pays off to be verbose. You will find that this is a great help when trying to find out why a program doesn't work as expected, and the hours you save during that process make it worthwhile to type in a few more lines of code. /**

* Execute a rule. * The suffix is replaced by the replacement. There is no test *whether the rule actually matches, so the matches() method needs * to be called before. * @param word the word on which the rule acts. * @return the result of the replacement. */ public String execute(String word) { StringBuffer sb = null; if(ending != null) { sb =new StringBuffer{word.substring(O,word.length()-ending.length())); if(replacement !=null) { sb.append(replacement); else { sb =new StringBuffer(word); return(sb.toString());

Then we come to the most complex method of the class: executing a rule. As we are changing a String instance we make use of a StringBuffer object for temporary storage. We first check whether the ending is empty; if this is the case we just keep the word, which we do in the else branch. Otherwise we take a substring of the word in what looks like a rather complicated expression, but is really simple and straightforward. The substring ( ) method of the String class has one form

IMPLEMENTATION

191

where it takes the starting position and the end position and returns the part of the string between those two. Here we start at the beginning, namely position 0, and take off the number of characters defined by the length of the ending. So we take the word length, subtract the ending's length, and thus arrive at the final position of our new word. If there is a replacement for the suffix in the rule we then append it to the word after we have cut off the suffix, and finally we return a String object derived from our temporary StringBuffer. Once we have dealt with a rule we need to know where to continue. The next method returns the transfer component of the rule, which is either the label of the next rule to apply or the word 'finish'. I** * Retrieve the transfer part of rule. * The transfer of a rule is either a label that execution continues * at, or it will be a marker indicating that the stemming process * has finished. *@return a label or the string "finish", depending on the rule.

*I public String getTransfer() { String retval; if(transfer == null) { "finish u; ret val } else { retval = transfer; return (retval);

In case the transfer part is empty we return 'finish' as well, just to be on the safe side. This would only be relevant if there was a mistake in the rule definition file, as all rules ought to have a transfer part. I** * Match the label of the rule. * @param label a label to be tested against the rule's label. * @return true if the label matches, false otherwise.

*I public boolean matchesLabel(String label) { boolean retval = false; if(label.equals(this.label)) retval = true; return(retval);

II end of class rule

The final method of the Rule class tests whether its label matches a given one. When we transfer to another rule with a label, we need to check each rule whether the label matches. This method is essentially analogous to the matches () method we looked at earlier, but this time we need to test for the full string instead of just the ending, therefore we use equals ( ) instead of endsWi th ( ) .

STEMMING

192 # # Rulefile for Stemmer # Contents from table 3.10 in Oakes (1998) # label ending replacement transfer # IS ably finish ibly ss ly finish ss ss ss finish ous ARY y ies E s ARY y ied ABL ed ABL ing ABL E e ION al AT ion ION finish finish ary ARY IS ability finish ibility IV ity finish ify finish IS abl ABL finish ibl AT iv IV IS AT at finish IS is finish ific finish olut olv finish

9.3.4

0

Format:

The Rule File

To operate properly we also need the rules in a machine-readable form. The format of the rule file is defined by the RuleLoader class, and the set of rules is listed in figure 9.3.4 with a few added comments to aid understanding it. Again, it will make it easier to maintain the rule set if the documentation of the format is included in the file itself, as it means there is no separate piece of paper that you can lose to depend on. You might remember that the RuleLoader class ignores lines which begin with a hash sign, which is traditionally used in Unix script files to mark comments.

9.4 TESTING In the software development cycle the final step is testing. Here we determine if the program works as defined in the specification, and if it does then we have completed the task. If it doesn't, we have to go back to an earlier stage and amend it until it passes the test.

9.4.1

Output

As mentioned earlier, the main () method of the Stemmer class has been designed for testing, so we can now try out a few words and see what the result is. As our starting point in developing this program was just a list of rules, it is somewhat difficult to define what a successful test would look like. All we can do is think of

STUDY QUESTIONS

193

a few complex words and see if we are satisfied with the result. Ideally we would pick words with suffixes which are featured in the rules, but it is always a good idea to also run a test with some 'inadequate' words. If a word has no matching suffixes it should pass through the rules unchanged, otherwise the usefulness of the stemmer would suffer, as we would somehow have to work out beforehand if a word can be stemmed or not. Here is some sample output. The first line is the command-line that we enter, and the following lines are the output of the Stemmer program: java sternmer.Stemmer rules availability astonishing rallied interrogation questioningly availability: avail astonishing: astonish rallied: rally interrogation: interrog questioningly: question >

Even though stemming is not a linguistically meaningful procedure, it can still help you for certain tasks where you don't want to distinguish between different inflected forms or derivational variants. As it happens, the testing revealed an error in the rules file. If you try and run it with revolutionary, you will find that nothing happens. The empty rule following the label ION fires (as the ending is empty and matched anything), causing the stemming procedure to be terminated. However, if you take that line out (or comment it out by putting a hash sign in the first column) the suffix -ary will be removed and the result comes out as revolution. Ideally you would want revolut as the stem, so you really want to step back to cut off the -ion suffix. That would reduce revolutionary and revolving to the same stem revolut.

9.4.2 Expansion As mentioned above, the stemmer is in principle not language specific, though the rules are of course only valid for English words. However, it should not be a difficult task to develop a similar set of rules for other languages that form inflections and derivations by appending suffixes to the stem. In order to make use of the stemmer, you can easily integrate it into a word frequency list program (see study questions).

9.5

STUDY QUESTIONS

1. Modify the programs we have developed so far to make use of stemming. For

what programs would it make sense? How should the data be presented? How can the stemmer be integrated? 2. Look at the rules file more closely. What happens if you change the order of some rules? Can you think of other rules you might want to add? Create a set of words for testing, so that you can observe any changes to the results when changing the rules file. Beware of unexpected side effects. 3. Create a rules file for another language. This can be a living language, a dead language (e.g. Latin) or even a planned language (e.g. Esperanto). Use either

194

STEMMING

a grammar or a description of inflectional endings and suffixes you might find on the Internet. 4. As it is coded here, the execute () method of the class Rule will not allow rules which have an empty ending and a non-empty replacement. Why is that? How will the method behave in such a case? How can it be changed to allow rules that simply add an ending to the word (i.e. have no ending to be deleted, but only a replacement)?

10 Part of Speech Tagging 10.1

INTRODUCTION

In a study question in McEnery and Wilson (1996) a simple part-of-speech tagger is described, and the task for the reader is to manually tag a sentence using the algorithm. In this chapter we will see what it takes to implement this as a computer program, as tagging is rather boring if it is done by hand. Before we look at the program itself there is a brief introduction to some relevant concepts of tagging. Part-of-speech tagging is the process of assigning word class labels to tokens in a running text. The aim of this is to achieve a more sophisticated analysis, as it is then possible to investigate syntactic patterns and to distinguish different forms of a lemma. Sometimes it is even possible to look at word senses, as long as they are tied to different word classes. The program performing that word class assignment is called a tagger, and there are two principal ways a tagger functions: rule-based or stochastic. The first step, which all taggers share, is a lexicon lookup. Here each word of the input text gets assigned a number of potential tags, the number of which depends on the ambiguity of the token. Some words will not be found in the lexicon, in which case there has to be some way of 'guessing' the right set of tags. This is typically done by looking at morphological features, for example the ending, and if that should fail the last resort is often to assign a default set of all open word classes. A rule-based tagger has as its resources a list of rules which describe possible sequences of tags. They typically look at the environment of an ambiguous word and try to work out from the context what the most likely tag would be. For example, light can be either of adj, noun or verb, but in this context a_DET light_??? snack_NOUN

there could be a rule which might look like DET ??? NOUN -> DET ADJ NOUN

and specify that an unknown word surrounded by a determiner and a noun is likely to be an adjective. The problem with rules is that they usually require a lot of effort to create (though there are ways to acquire rules automatically, see Brill (1992)), and you need quite a lot of rules for comprehensive coverage of all possible situations. Also, these rules are generally language specific, or even limited to a particular tagset.

196

PART OF SPEECH TAGGING

As opposed to that, a stochastic tagger uses probabilities which usually have been gathered from a (manually) pre-tagged training corpus. In the above example the tagger would evaluate which of the individual tags are most likely on their own (i.e. what part-of-speech is most frequent for light), and that is then combined with the likelihood of the resulting tag sequence. Some taggers use bigrams, so they just look at the preceding tag, whereas others use trigrams, which means they take three tags into account A stochastic tagger can easily be designed to be language independent. All it needs is a training corpus to extract probabilities, and there are ways to fall back in cases which did not occur in the training data. For an experiment in adapting a stochastic tagger from English to Romanian see Tufis and Mason ( 1998). The tagger described in McEnery and Wilson ( 1996) is a stochastic one, based on bigrams. In the following section we will look at the design, before we then discuss how it can be implemented.

10.2 PROGRAM DESIGN The tagger can conceptually be separated into four different components: 1. 2. 3. 4.

the lexicon, to look up the closed class words the suffix analyser, to deal with the open class words the transition matrix, to incorporate knowledge about context the processor, to assign the tags

It thus seems fairly straightforward to design the tagger using four different classes, one for each of the components. There are a number of points to look out for during the design phase, as we want to keep our options open for further improvements.

Raw Text

rlLexicon

Tagging Engine

.

...

I

Suffix Analyser

Tagged .... Text Transition -------+ Matrix Figure 10.1: The structure of the tagger

Lexicon File

IMPLEMENTATION

197

The structure of the tagging system is shown in figure 10.1. As in figure 6.3, data is indicated by plain boxes, whereas classes are in boxes with shadows. You can see that the 'raw' text passes through the tagging engine where it is turned into tagged text. The engine makes use of the three knowledge sources, lexicon, suffix analyser and transition matrix. Only one of them, the lexicon, requires further external data.

10.3 IMPLEMENTATION 10.3.1 The Processor Just like the stemmer's engine, the processor, implemented as the Tagger class, deals with all general aspects of tagging that need to be done. Just as we did with the stemmer in the previous chapter, we will implement the tagger in a top-down fashion. !*

* Tagger. java *I package tagger; import import import import import import import

!**

*

java.io.BufferedReade r; java.io.PrintStream; java.io.FileReader; java.io.InputStreamRe ader; java.io.FileOutputStre am; java.io.IOException; java.util.StringToken izer;

This class implements a basic tagger, as described in McEnery (1996), page 140-143. It will then The tagger is started with the filename of a lexicon. read data from System.in or a supplied filename and print out the result to System.out or into a file if a third name is supplied. @author Oliver Mason @version 1.0

* & Wilson

* * *

* * *I

public class Tagger { private private private private private private private

Lexicon lex = null; String prevWord = null; char prevTag = '0'; String currWord = null; char currTag = '0'; PrintStream output = System.out; int lineWidth = 0;

The set of variables is quite straightforward. We need a Lexicon (see next section), the tag and word form of both the previous word and the current word (to be able to compute bigram probabilities), and we need somewhere to write the output. To avoid overly long lines of output we want to insert line breaks after a certain column. We are using 1 ineWidth to keep track of how long the current line is. In chapter 8 we have discussed different forms of annotations. With this tagger to support two of them: XML and the 'underscore format'. XML is obviwant we ously the preferred way, but the underscore format has been around for quite a long time, and it is also easier to use for certain tasks, like tagged word lists.

198

PART OF SPEECH TAGGING

public static final int XML = 0; public static final int USCORE = 1; private int printStyle = USCORE;

We define two constants, one for each format, and declare a variable that will contain the actual style to be used. By default we are using the underscore format, but that can easily be changed by adjusting a single line. The constructor of the Tagger class takes two arguments, the name of a lexicon and the format of the output as an int value. This should be either Tagger. XML or Tagger. USCORE, but as you will see further below there are provisions to deal with other values as well. !**

* Constructor. * @param lexicon the name of the lexicon file. * @param outputFormat the output format to use (XML/USCORE) . *I public Tagger(String lexicon, int outputFormat) lex= new Lexicon(); try { lex.load(lexicon); } catch(IOException e) System.err.println{ Lexicon Loader: "+e); 11

printStyle = outputFormat;

If loading the lexicon fails we only print an error message. As a fallback we still have the suffix analyser, so that we can still get some result from the tagging process, even though it might not be very good. Imagine the data file containing the lexicon data got lost: if this would cause the tagger's initialisation to fail the whole program would suddenly be rendered useless. If, on the other hand, we try and do as best as we can without it, the tagger can still continue to do at least something towards the result. Next we have a lookup table which maps the original one-letter-labels used by McEnery and Wilson on to more descriptive names. This is not strictly necessary, but it greatly increases the user-friendliness of the whole application. As the original labels are only one character long, we can easily encapsulate this lookup in a single switch statement: /**

* Map one character tag labels to descriptive labels.

* The abbreviated tag labels as used internally by the tagger are * mapped to more explicit and descriptive labels.

* @param tag the one character tag label

* @retur~ the descriptive label or "???" if no mapping was possible. *I private String map(char tag) { switch(tag) case 'A' : return ( "ART" ) ; case 'C' : return ( "CONJ"); case 'D': return("DET"); case 'E' : return ( "EXTH");

199

IMPLEMENTATION case 'F': return( "FORM") case 'G': return ("GENS") case 'I,: return( "PREP") case 'J': return{ "ADJ"); case 'M': return("NUM"); case 'N': return( "NOUN"); case '0': return ( "OTHER" ) ; case 'P': return( "PRON"); case 'R': return("ADV"); case 'T': return( "INFTO"); case 'U': return ( "INTERJ" ) ; case 'V': return ( "VERB" ) ; case 'X': return( "NEG"); case 'Z': return("LETTER");

return(null);

If the tag was not one of the recognised set, we simply return null to indicate that we were not able to find a longer label for it. The next job we're looking at is generating the output. We have already decided to support both XML and the underscore format, with the current style stored in the variable printStyle. Here is the print () method, which takes a word and its tag and prints them to the output stream in the specified format: /** * Print a tag to the specified output stream. * @param word the word form that has been tagged. * @param tag the tag that has been assigned.

*I private void print(String word, char tag) ( if(word != null) { String outputLine = null; switch(printStyle) { case XML: outputLine new String(""+word+" break; case USCORE: outputLine =new String(word+"_"+map(tag)+" "); break; default: outputLine =new String(word+"/"+map(tag)+" ");

11 ) ;

if(lineWidth > 55) { output.println(outputLine); lineWidth = 0; else { output.print(outputLine); lineWidth += outputLine.length();

If the word is null, nothing happens. Otherwise we define a String called outputLine which we fill with the printable version of this particular word and tag. Then we enter a switch statement according to the print style that has been set. As you can see, we actually support a third style, which is like the underscore format, but instead of the underscore uses a forward slash to separate the tag from the word.

200

PART OF SPEECH TAGGING

This format is used when the printStyle value is neither XML nor USCORE. Clearly this shouldn't happen, but just to make sure that your program will always behave in a defined way it is worth catering for those 'impossible' situations. Alternatively we could have set the default label to one of the other options, such as case USCORE: default: outputLine =new String(word+"_"+rnap(tag)+" "); break;

This would have made USC ORE the default in case printStyle was out of range. Once we have our output line we print it. In order. to add line breaks when needed we keep track of how far we have already filled the line: if it is more than 55 characters we use println (), thus adding a line break, otherwise we use print () which does not add a line break to the end of the output. Talking of output, quite often you don't simply want to print it on the screen, but instead save it in a file. For this reason we are using the instance variable output instead of directly sending the output to System. out. With the setOutput () method you can open a file and make the tagger put the output there. You can see how that is done below when we are discussing the main ( ) method. /** * Set the output stream of the tagger.

* @param out the new value for the output stream. *I

public void setOutput(PrintStream out) output = out;

The setOutput () method is quite straightforward. Methods such as this which only set the value of a variable are often called setters, as opposed to getters which return the value of a variable. As discussed in section 4.1.5, this is a much better way than accessing variables directly. Now we deal with tagging a line of text. Using the PreTokeniser from section 7 .2.1 we split the line into words, and each single token is passed to the tag ( ) method. /**

* * * *

Process a line of text. The line is split into tokens (separated by spaces) and fed into the tagging engine. @param line the line of text to process.

*I

public void processLine(String line) { StringTokenizer st =new StringTokenizer(PreTokeniser.tokenise(line)); while(st.hasMoreTokens()) { Strings= st.nextToken(); tag(s);

IMPLEMENTATION

201

In tag ( ) we first print out the previous word, which should by now have its tag assigned. Remember that this word initially is set to be null, and that print () ignores null values, so for the very first word nothing happens. You always need to make sure that the very first and the very last items are handled properly when dealing with input data. !**

* Tag a word. * * * * *

The word is added to the buffer, and the most likely tag is assigned from combining all possible tags with the possible tags of the previous word. The word that has been tagged before is dropped out of the buffer and sent off to the output routine. @param word the word to tag.

*I

private void tag(String word) { print(prevWord, prevTag); prevWord = currWord; prevTag = currTag; currword = word; currTag = '0'; double score = 0.0; String possTags = lex.lookup(word); if (possTags == null) possTags = "N J V"; StringTokenizer st =new StringTokenizer(possTags); while(st.hasMoreTokens()) { String tmpTag = st.nextToken(); double tmpScore = Matrix.lookup(prevTag, tmpTag.charAt(O)); if(tmpTag.endsWith("@")) tmpScore /= 10; if(tmpTag.endsWith("%")) tmpScore /= 100; if(tmpScore > score) { score = tmpScore; currTag = tmpTag.charAt(O);

Once the previous word has been printed, the current word is assigned to be the new prevWord. This is needed for working out the bigram probabilities. The word to be tagged then becomes the currWord and is assigned a default tag. We then get all possible tags for the word by way of looking it up in the lexicon. This includes the suffix analysis if the word was not found in the lexicon, but for our tagging purposes we don't care where that tags came from. If no possible tags have been found we assume that the word in question can be a noun, verb, or an adjective. We then loop through all possible tags and work out their score. The score of a tag is worked out from the likelihood that it can follow the tag of the previous word. For this we consult the transition matrix with the previous tag and the first character of the current tag. The reason for using only the first character is that some tags have a special qualifier attached, which means they are rare or even very rare. The return value from the matrix lookup is divided by either 10 or 100, depending which modifier (if any) the tag ends with. If this temporary score is bigger than the current tally in score, we assume it is more likely to be the correct tag. We then adjust the settings of the score and tag accordingly. Once the end of the file has been reached we need to handle the last words properly. What we do here is to print the final two words, which we have still stored

202

PART OF SPEECH TAGGING

in prevWord and currWord, end the line by printing a line break, and close the output stream. I** * End *I

the tagging process.

public void close () { print(prevWord, prevTag); print(currWord, currTag); output.println(""); output.close();

When you look back to the tag ( ) method, you will notice that we only print out the prevWord, and the most recent two words are kept in variables, as they still have influence on the tagging decisions to be made. At the end of the data we therefore have to print those two words out separately. Just as the stemmer in the previous chapter, you can either use the tagger from other applications or in stand-alone mode. This is accomplished by the main ( ) method, which acts as a wrapper around the operations of the Tagger class. I** * Main method. * The tagger is run as a stand-alone application.

The command-line

* * * *

parameters are:

the lexicon file

the input file (optional)

the output file (optional) *

*I public static void main(String args[]) throws IOException { if(args.length == 0) { System. err .println ("usage: Tagger LEXICON [INFILE] System.exit(O);

[OUTFILE] ");

Tagger tagger =new Tagger(args[O], USCORE); BufferedReader input = null; if(args.length > 1) { input new BufferedReader(new FileReader(args[l] )) ; } else { input= new BufferedReader(new InputStreamReader(System.in)); if(args.length > 2) { tagger.setOutput(new PrintStream(new FileOutputStream(args[2] ))); String line = input.readLine(); while(line != null) { tagger.processLine(line); line = input.readLine(); tagger.close();

II

end of class Tagger

At first we check whether any command-line parameters are present. If not we print out a usage message and exit the program. Otherwise we create an instance of the

IMPLEMENTATION

203

tagger, taking the first command-line argument as the name of the lexicon. If there is more than one command-line parameter, the second one will be the name of an input file to be tagged, and we use that to create eiter a F i l eReader from a file of that name or an InputStreamReader from System. in. For reading purposes we need a BufferedReader which provides a method to read a line at a time. If there is a third parameter, we use that as output file; the default output target is System. out. The main loop then reads lines from the input source and sends them to the Tagger via processLine () as long as more lines are available. At the end calling the Tagger's close () method makes sure all words are tagged and the output file is closed properly.

10.3.2 The Lexicon The lexicon is basically a list of words with their possible parts-of-speech attached. It should be a persistent object, as we don't want to have to key it in every time we

run the tagger, and the main function it serves is looking up words and returning either their parts of speech if found or some other value if not. It should be easy to add and remove words, and to change the part-of-speech to be assigned to a word. A natural data structure for this task would be a Map, as it involves mapping word forms to associated information, and as we're only dealing with String objects the Property class exactly fulfills the purpose: it has Strings as both key and values, and because it also has convenient methods for storing them in a file it is the most suitable for our purposes here. The lexicon data stored in a Property file will look like this: a=A D abaft=! C aboard=! C about=! C after=! C against=! C ain't=V

If a word form has more than one possible part of speech, they are listed separated by a blank space, like abaft which can be assigned either an I (preposition) or a C (conjunction) tag. I*

* Lexicon. java *I package tagger; import import import import

java.util.Properties; java.io.BufferedinputStream; java.io.FileinputStream; java.io.IOException;

/** * This class implements a basic lexicon.

* @author Oliver Mason * @version 1.0 *I class Lexicon {

PART OF SPEECH TAGGING

204

private Properties storage

null;

!**

* Constructor. *I public Lexicon() new Properties(); storage !**

* Load a lexicon file. * The lexicon is initialised from a file. The file should contain * one line per entry, with the data separated from the entry with * an equal-sign (the standard Java property format) .

*

@param filename the name of the data file.

* @throws IOException if something goes wrong. *I void load(String filename) throws IOException { BufferedinputStream in

new BufferedinputStream( new FileinputStream(filename));

storage.load(in); in. close();

The class Lexicon is not much more than a wrapper around a Properties object. The load ( ) method allows adding multiple lexicon files to the current lexicon, as it does not overwrite its current content (unless the keys happened to be the same). This allows you to have multiple lexicon files, which you can maintain separately and choose as required (e.g. for domain-specific words). When initialising the tagger you just need to call load ( ) repeatedly, once for each lexicon file you want to use. Apart from load (), there is one central method, lookup (), which returns the entry of the word given as parameter. If the word is not in the dictionary, it will return null. There are two special cases to consider: digits and other nonalphabetic strings. It would be impossible (and rather wasteful) to have all possible digit sequences in the lexicon, so we just check if the first character of word is a digit, and if it is so, we just return 'M', the tag for numbers. Similarly, if the first character is neither a digit nor a letter, we return '0', the tag for other entities. !** Look up a word in the dictionary. If the given word form is contained in the dictionary, the associated If the word is not entry is returned as a String, otherwise null. found in the dictionary, suffix-related rules are applied if possible. These are defined in the class Suffix. * @param word the word form to look up. * @return the entry for the word if found, or null otherwise. * @see Suffix. *I String lookup(String word) String retval = null; if(word.length() > 0) if (Character. isDigit (word. charAt ( 0))) return ( "M"); if(Character.isLetter(word.charAt(O))) { String entry= storage.getProperty(word,null);

205

IMPLEMENTATION if(entry == null) { entry= storage.getProperty(word.toLowerCase(),null); if(entry ==null) { retval = Suffix.match(word, this); if(retval ==null) { if(Character.isUpperCase(word.charAt(O))) "N retval else { retval = "N V J"; 11 ;

else retval = entry; else { retval = entry; else { retval = "0";

return (retval);

II end of class Lexicon

This method looks more complicated than it really is. It is a simple sequence of decisions based on the outcome of the lexicon lookup. When you have such deeply nested if statements it is sometimes useful to add comments to the else branch, like l else { retval

=

II entry was in lexicon entry;

II first character not a letter or digit else { retval = "0";

Also, using proper indentation to reflect the nesting makes it a lot easier to follow the code's logic. There are several steps involved in looking a word up: first we check for numbers and other tokens, then we look it up in the original form. However, this lookup is case-sensitive, and if the word is capitalised for some reason (like being at the start of a sentence) this would fail. If we don't find the word, the next step is then to look for it converted to lower case. If this also proves to be unsuccessful we consult the suffix analyser. If we don't get a result from that either, we check the first character of the word: if it is in upper case we assume it is a proper noun and assign it an N tag, otherwise we guess it is one of noun, adjective or verb. During the lookup procedure we call upon the suffix analyser to provide possible tags for a word. We will discuss the suffix analyser in detail in the next section, but if you look at the call retval = Suffix.match(word, this);

you will notice the keyword this. The match () method takes a Lexicon object as its second parameter, in case a word is to be looked up after a suffix has been removed. In this case we need a reference to the current Lexicon object, which is

206

PART OF SPEECH TAGGING

what this is good for: it is a reference to the object itself. Passing it to the suffix analyser allows that object to refer back to the lexicon. Whenever the processor class receives a new word to tag, it first looks it up in the lexicon. If it is not in the lexicon, it submits it to the suffix analyser. We can take a shortcut here, which makes the processor a bit less complicated: we don't really care where the tag information for a word form comes from, whether it was the lexicon or the suffix analyser. Therefore, as soon as the Lexicon class does not find a word form in the lexicon, it passes it directly on to the suffix analysis. This means we don't have to deal with unknown words in the processor. Conceptually we can imagine the suffix analyser being a kind of dynamic lexicon, as it fulfills the same function: it takes a word and returns information about it.

10.3.3

The Suffix Analyser

The suffix analyser is an extremely simple component. All it does is to check the ending of a word and return a pre-defined set of tags if a match is found, or a default set (of V, J and N) if not. The Java String API contains a useful method for this purpose, endsWi th ( ) , so all we need to do is write down the possible known endings as a list of commands like: if (word. endsWith ("less")) return ( "J"); if(word.endsWith("ly")) return("R J@"); if (word. endsWith( "ment")) return ( "N");

and that is all there is to it. There is, however, one caveat: consider the following sequence: if(word.endsWith("y")) return("J N"); if(word.endsWith("ity")) return("N"); if(word.endsWith("ly")) return("R J@");

This sequence will not give the desired result if we want to know what the appropriate tags are for entity or quickly. Both of these end withy, so the first line matches and will return J N,even though the two other lines are a better match. This is exactly the same problem we encountered in the stemmer in the previous chapter when it came to ordering the rules is the right way. What we have to do here in order to prevent this premature matching is to change the order of the lines, so that the longer suffixes are checked before the shorter ones. /*

* Suffix.java *I package tagger; /**

* This class implements a suffix-matching routine for the * POS-tagger. *page 142.

The algorithm is as described in McEnery

* @author Oliver Mason * @version 1.0 *I class Suffix {

& Wilson (1996),

IMPLEMENTATION

207

I**

* Match a suffix. * The word form to be matched is analysed with respect to its ending,

• and if a matching suffix rule has been found, a String describing * the likely tag choices for that word is returned. If no rule could * be matched, null is returned. The rules are applied using longest * match. The default match is "V J N". • @pararn word the word form to match. * @pararn lexicon the lexicon for subsequent searches. * @return a String describing the possible tags or null. *I public static String match(String word, Lexicon lexicon) { if (word.endsWith ("able")) return ( "J"); if (word. endsWith( "al")) return( "J N%"); if(word.endsWith("ance")) return("N"); if(word.endsWith("ant")) return("J N"); if(word.endsWith("ed")) return("V J@"); if (word. endsWith( "er")) return ( "N V J"); if (word.endsWith ( "ers")) return ( "N V"); if (word.endsWith( "est")) return ( "J"); if (word. endsWith( "ful")) return( "J N%"); if(word.endsWith("ing")) return("N V J@"); if(word. endsWith( "ings")) return ( "N"); if(word.endsWith("ity")) return("N"); if(word.endsWith("ive")) return(null); if (word.endsWith( "less")) return( "J"); if (word. endsWith ( "ly") ) return ( "R J@") ; if (word.endsWith( "ment")) return( "N"); if(word.endsWith( "ness")) return( "N"); if(word.endsWith("th")) return("N M"); i f (word.endsWith( "tion")) return( "N"); if (word. endsWith ( "n") ) return ( "J N") ; if(word.endsWith("y")) return("J N"); if(word.endsWith("s")) { return(lexicon.lookup(word.substring(O,word.length()-2))); return(null);

II end of class Suffix

There are of course plenty of other ways in which the suffix analyser could be implemented, some of which would avoid the problem mentioned in the previous paragraph. The list of suffixes could be put into a Properties object, like the lexicon, and then a loop could be used to look up a word in it. If the word is not found, we cut the first letter off it and try again. This is repeated until we have either found a match or used up all the letters in the word. The two advantages that this method has is (a) that the list of suffixes is not hard-coded within the class, but is in an external resource file like the lexicon, and (b) that it automatically picks the longest match and we do not have to worry about putting the suffixes in in the right order. If the entries are in an external file, we can easily add, remove or change entries in it without having to re-compile the Suffix class. This makes maintenance a lot easier, especially when other people are using the tagger who do not have access to the source code. The second point does not seem to important, but just imagine that in a couple of months you want to add a new suffix entry to the list. It is quite possible that by then you will have forgotten that the ordering is relevant. Even if it is documented somewhere you might not remember

208

PART OF SPEECH TAGGING

to look it up, or the documentation might get lost. It is always best to avoid implicit constraints and to keep things as simple as possible for future maintenance. A further way of implementing the suffix matching would be to generate a tree from the list, where each branch is indicated by a letter (see figure 10.2 ). We would then take a word starting by its last letter and progressively go further into the tree as letters match. This is the most efficient and elegant method, but it is also most complex in the implementation. The time required to look up a word is extremely short, and it can even be used to store a whole lexicon in an efficient way.

e~

1< E

~

1

...

b

...

a

c

...

n

...

a

v

...

i

...

J

N

...

a

JN%

u

ll>f

l

•e

g

•n

...

s

•e

...

JN% L.

~

...

N

1

...

J

n

...

N

Figure 10.2: A suffix trie (to be read backwards)

On the other hand, it makes it even more difficult to deal with special cases, like cutting off a final -s and looking the resulting string up in the lexicon again.

10.3.4 The Transition Matrix The transition matrix is also a very straightforward element. It is just a twodimensional array of transition values, which we access through the tag labels which denote its rows and columns. The values are in rounded percentages, i.e. each row adds up to 100. We can simply use a static array of int values which is initialised with the right entries. The mapping between the tag labels and the row/column numbers can easily be achieved through a String: as tag labels are one character only, the position of the tag in the String object is equivalent to the column index. private static String labels= "ACDEFGIJMNPRTUVXZO"; int indexl = labels.indexOf(tl);

209

IMPLEMENTATION

Here, tl is a tag label (a String or a char), and indexl its corresponding index value. The indexOf () method returns the index of its argument in the String object, or -1 if the argument is not contained in the string. This is the easiest way of getting the right index value if the sequence of tags is not complete (like the missing letter H) or not in the right order (like 0 being at the end). If tag labels are more than just one character long we could put them into a Hash table or a Properties object to retrieve their associated index values.

/* * Matrix.java

*I package tagger; /** * This class implements a transition matrix for tag probabilities. @author Oliver Mason @version 1.0

..

*I

class Matrix { private static int matrix[][] = { /*A*/ {0. o. 1, 0, 0, 0, 0, 29,4, 64, 0, 2, 0, 0, 0, 0, !*C*/ {14,2, 5, 2, 0, 0, 9, 7, 1, 19,14,9, 1, o. 13,1, /*D*/ {5, 1, 3, 0, 0, 0, 8, 6, 2, 42,6, 3. 0, 0, 16,0, /*E*/ {2, 0, 1, o. 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 93,0, /*F*/ {0, 2, 0, 0, 49,0, 2, 0, 1, 7, 0, 1, 0, 0, 6, 0, /*G*/ {1, 1, 1, 0, 1, 0, 0, 7, 1, 84,0, 1, 0, 0, 0, 0, /*I*/ {46,0, 10,0, 0, 0, 0, 6, 4, 22,4, 1, 0, 0, 4, 0, /*J*/ {0, 5, 0, 0, 0, 0, 5, 3. 0, 72,0, 1, 4, 0, 0, 0, /*M*/ {0, 2, 1, 0, 1, 0, 4, 3, 2, 57,1, 2, 0, 0, 2, 0, /*N*/ {1, 6, 1, 0, 0, 1, 27. 0, 1, 8, 1, 3' 1, 0, 17,0, /*P*/ {1, 2, 1, 0, 0, 0, 4, 1, o. 1, 1, 7, 2, 0, 71,0, /*R*/ {9, 4, 3, 0, 0, 0, 14,11,1, 3, 6, 11,1, 0, 21, o. /*T*/ {0, o. 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 99,0, 0, /*U*/ {1, 1, 1, 0, o. 0, 1, o. 0, 1, 1, 2, 0, 3, 0, 0, /*V*/ {17,2, 3, 0, 0, 0, 11,5, 1, 4, 5, 16,5, 0, 18,3, /*X*/ {7, 1, 2, 0, 0, 0, 5, 4, 0, 1, 6, 8, 4, 0, 56,0, /*Z*/ { 0. 4, 1, 0, 2, 0, 19,2, 0, 4, 0, 1, 0, 0, 10,0, /*0*/ {9, 10,4, 0, 1, 0, 5, 4, 1, 10,13,6, 0, 1, 7, 0, private static String labels = "ACDEFGIJMNPRTUVXZO';

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

1}, 3}, 7}, OJ' 30). 2}, 1}. 9}, 26}, 31). 9}, 16}, OJ. 89). 10). 6}, o. 55}, 0, 29}};

In order to make it easier to check whether the matrix values are right, the corresponding tag labels have been put at the beginning of the line as comments. This form of initialising an array is not really advisable if the array is any larger, as it is difficult to see which cell corresponds to which value. Alternatively we could put the matrix into a resource file. This has the same advantages mentioned above for the suffix analyser, and you should always keep in mind that you might want to change individual entries in the matrix for finetuning, or add or remove a tag, or use a different tagset altogether. If all tagset-related information is kept in external files, you can easily change all this without having to change anything in the program itself. The price to pay is fairly small: a bit more added complexity in the initialisation stage and a minute drop in performance.

210

PART OF SPEECH TAGGING

!**

* * * *

Look a tag sequence up in the matrix. If the tag labels are invalid, an ArrayindexOutOfBounds exception will be thrown. @param tl the first tag to look at. * @param t2 the tag following tl. * @return the transition probability of tl followed by t2. *I

static int lookup(char tl, char t2) { int indexl = labels.indexOf(tl); int index2 = labels.index0f(t2); return(matrix[indexl] [index2]); II end of class Matrix

To look up a bigram value we first need to convert the tag labels from char values to the corresponding index values which we then use to access the matrix. This method is marked static, so we don't need an instance of the Matrix object to call it. This means, however, that all variables we are accessing in this static method have to be declared static as well, so that they are associated with the class and not an object instance. Otherwise the variables would only exist within the context of an object.

10.4 TESTING We now let the tagger loose on some data. Here we're just using a few sentences from the first paragraph of this book; they are saved in a file called 'tagtes t', and using the command

java tagger.Tagger lexicon.prop tagtest tagtest.out we create a file 'tagtest. out'. As the Tagger class is in the package tagger, we need to use the full package name plus the class name for running the class. The first command-line parameter, lexicon. prop, is the name of the Properties file which containes our lexicon. This is what the tagger produces: Corpus_NOUN linguistics_VERB is_VERB all_DET about_PREP analysing_NOUN language_VERB data_VERB in_PREP order_NOUN to_PREP draw_NOUN conclusions_VERB about_PREP how_OTHER language_NOUN works_VERB ._OTHER To_CONJ make_NOUN valid_VERB claims_VERB about_PREP the_ART nature_NOUN of_PREP language_NOUN ,_OTHER one_NUM usually_ADV has_VERB to_PREP look_NOUN at_PREP large_NOUN numbers_VERB of_PREP words_NOUN ,_OTHER often_NOUN more_DET than_PREP one_NUM million_NUM ._OTHER

Unsurprisingly the results are pretty bad. The main reason for this is that there are quite a number of unknown words: for each noun and each verb the tagger has to guess what the right class could be, only supported by the short list of suffixes it has. If we add a few words to the lexicon file, e.g. linguistics=N language=N draw=V N claims=N V conclusions=N

STUDY QUESTIONS

211

make=V valid=J data=N

we get Corpus_NOUN linguistics_NOUN is_VERB all_DET about_PREP analysing_NOUN language_NOUN data_NOUN in_PREP order_NOUN to_PREP draw_NOUN conclusions_NOUN about_PREP how_OTHER language_NOUN works_VERB ._OTHER To_CONJ make_VERB valid_ADJ claims_NOUN about_PREP the_ART nature_NOUN of_PREP language_NOUN ,_OTHER one_NUM usually_ADV has_VERB to_PREP look_NOUN at_PREP large_NOUN numbers_VERB of_PREP words_NOUN ,_OTHER often_NOUN more_DET than_PREP one_NUM million_NUM ._OTHER

Still not exactly correct, but already a lot better. In order to tum this tagger into a usable application you simply have to keep working with the lexicon. You can either tag more and more data and add new words to the lexicon as you go along, or you could try and get hold of a list of words which already have word class tags attached. Then you only need to convert that list into the right form to use it with the tagger.

10.5 STUDY QUESTIONS 1. Modify the programs we have developed so far to make use of tagging. For what programs would it make sense? How should the data be presented? 2. Use the tagger for creating an index (see chapter 6) of a text file, so that you can search for all occurrences of light as an adjective. How would you go about doing that? Can you do it without modifying the text file itself? 3. Some tag labels are hard-coded in the class files. Why is that a bad idea? How could you change it? 4. At present there is only a single file containing closed class words. Add the capability to have additional lexicon files with frequent words that the tagger gets wrong otherwise. 5. Does is make sense to combine the stemmer with the suffix analyser in some way? What could you gain?

11 Collocation Analysis

This chapter shows how to implement the facilities described in Language and Computers, chapter 5. There, Barnbrook describes in detail how to compute collocations.

11.1

INTRODUCTION

Collocational analysis attempts to identify words that occur together significantly more often than one would expect by pure chance. These word pairs then form what is called a collocation. Collocations can be the result of a number of different underlying linguistic processes which influence the combinatory arrangement of words; these processes can be syntactic, semantic, or even prosodic. The way one can identify collocations is by looking at the frequency of words in the environment of the node word. If a potential candidate for a collocation, sometimes called a collocate, has a particularly high frequency in this environment, we can assume that the two words form a collocation. This involves comparing the frequency of a word in the node's environment with a benchmark frequency. The benchmark can either be a general word frequency list, or if none is available we can simply take the frequency of the word in the corpus we are investigating. However, in that case the frequencies are not independent of each other, which might skew the results. The significance of a node-collocate pair is evaluated using a mathematical formula, the evaluation function. There are a number of different ways to do that, and consequently there are about a dozen different formulae in use, each producing slightly different results. The most common ones are the z-score, the t-score, and mutual information. The first two are based on general statistics, while mutual information (MI) originates from information theory. It is very difficult to say that one of them is better than the other, so the best way to get familiar with them is to simply try them out on a number of words. This will quickly give you an idea of what you can expect from each function. The evaluation function assigns a score to each collocate, which indicates the strength of the bond between the node and the collocate. It is generally not possible to compare those scores directly with each other, so you should only use them for ranking the list of collocates. We will later see how a list of collocates can be sorted according to the different scores assigned to them.

COLLOCATION ANALYSIS

214

We have now seen what the important parameters are for computing the collocations of a word: 1> 1> 1>

A definition of the environment. A definition of the benchmark frequency. A method for comparing the frequencies.

You have some degree of choice with these parameters, as there are no definite answers. Collocations have in the past been looked at in a rather haphazard way, using trial-and-error to reach plausible conclusions. Not very much work has been done on the theoretical foundations, so you need to build up some degree of experience by just using it, looking at the data and comparing it with what you would expect. We will now step through the three parameters and see how we handle them in our application.

11.1.1

Environment

Conventionally, researchers have used a fixed span of either three or four words to either side of the node word as the environment in which to look for collocates. However, recent research (see Mason (1997) and Mason (2000)) shows that this is simplifying matters, and that ideally you would have to use a different span depending on the word you are looking at. Each word seems to have its own range of influence on the neighbouring words, and with a simple procedure you can work out what the ideal span should be. For our purposes we will use a variable span, so that you can specify the desired range for both the left and the right side of the node. This means you can choose whatever span you want. It is likely that you might want to vary your choice depending on the type of words you're looking at and the language you are working in: in English with its comparatively fixed word-order one can get away with relatively small values for the span, whereas other languages might require larger values. We will extract the environment from a set of concordance lines. These concordances can be produced using any of the concordancing programs we looked at in chapter 6. You need to make sure that the lines are wide enough to contain enough words for the span size, otherwise the environment will be silently truncated to fill however many words there are available.

11.1.2

Benchmark Frequency

As mentioned above, you should ideally use a general frequency list which is not directly generated from the corpus you are using. For that reason we will use the FreqList class we introduced in chapter 7, which doesn't have to be generated from the same corpus. If you don't have any other suitable corpus material available you could try and look for frequency lists on the Internet, or search the archive of the corpora mailing-list: there are occasional requests for such lists for different languages. When you get such a list from another source you will have to convert it into a format that the FreqLi s t class can understand. The format is just a word followed

SYSTEM DESIGN

215

by whitespace followed by the frequency value. Most frequency lists will already be in that format. For a more detailed description of the class see chapter 7.

11.1.3 Evaluation Function Barnbrook (1996) describes three possible ways to evaluate the frequencies we extracted from the environment: z-score, t-score, and mutual information. We will implement all three of them, so that you can compare the resulting scores. The scores all require the same parameters: the frequencies of the collocate in the environment of the node and in the benchmark list, and the number of tokens of both environment and reference list. The latter are used to compute the relative frequencies which are used for comparison.

11.2 SYSTEM DESIGN Starting at the end, the resulting output of the system should be a list of collocates with their respective scores as computed by the evaluation functions. We therefore need some object to hold a word and the three scores, which we will call Collocate. Objects of this class will be created during the process, so we need accessor methods to read out the scores. This will also be relevant for sorting. As we saw above, the algorithm to compute collocations involves counting the frequencies of words in the neighbourhood of the node word. The input data to that procedure are a set of concordance lines and the size of the span. We create a frequency list from these which we will then compare with the general frequency list. The input data will be lines, which will need to be split into tokens. For this we will use a class Span which will take a concordance line and prepare it for access by 'slot', i.e. the distance to the node. This can be illustrated as follows: +3 -1 -4 -3 -2 NODE +1 +2 +4 +5 ocessing tasks, such as creating word lists, computing parameters of a te

Here we would have enough context for a span of up to 4 words on the left and five on the right. The first and last word is ignored, as it could be truncated, even though it could happen to be complete in the actual line. The 'collocation engine' will trawl through all lines and convert them into tokens using the Span class. This is then used to retrieve the words from within the span we are looking at, in order to count the frequencies of the collocates. We already have a suitable class in FreqList, which was introduced earlier, so we can simply reuse that class. All we need to do is to design a class which performs the necessary operations, our 'engine'. The Col locator class will coordinate the activities involved in the whole process: 1. 2. 3. 4.

tokenise the concordance lines. count the collocate frequencies. compare them with the benchmark. generate the output.

COLLOCATION ANALYSIS

216

The input to that class will be an array of String objects representing concordance lines, and the output will be a list of Collocate objects. As we want to sort the resulting collocates according to different criteria, we need to employ the Comparator mechanism which we have already come across in chapter 7 when sorting word lists. We will thus create a set of comparators which will allow us to sort the collocates according to any of the three scores that we have computed.

11.3 IMPLEMENTATION Following the bottom-up approach during the design phase, we will continue that way for the description of the class implementations. Here we will not describe the FreqL is t class, as this class has already been described earlier in chapter 7.

11.3.1

The Collocate

The basic items which we will get as a result are Collocates, which are an encapsulation of a word form and its collocational relevance as expressed by the scores of the evaluation functions: I* * Collocate.java

*I public class Collocate implements Comparable { private private private private

String double double double

collocate ~ null; zscore ~ 0.0; tscore ~ 0.0; mutual ~ 0.0;

We implement the interface Comparable so that we are able to use the Collocate objects in sorted collections; by default we will opt for alphabetical order of the word forms. In the class we declare variables to store the word form and the three scores. These need to be double values, as their computation involves floating-point arithmetics. In the constructor we initialise all the variables. In the case of the collocate variable this is a simple assignment, but for the scores it means that we have to compute them from the evaluation functions. In this method we have provided a comment in the listing to make clear what the individual parameters stand for. It would not be helpful just to know that the constructor takes a String and four int values as arguments. I** * Constructor. * @param word the collocate.

*

@param f the frequency of the collocate in the sample.

* @param F the frequency of the collocate in the corpus. * @param n the size of the sample. * @param N the size of the corpus. *I

IMPLEMENTATION

217

public Collocate(String word, int f, int F, int n, int N) { double observed; (double)f; double expected ; ( (double)F I (double)N ) * (double)n; double p ; (double)F I (double)N; double std; Math.sqrt((double)n * p * (1.0-p)); collocate ; word; zscore ; (observed - expected) I std; tscore; (observed- expected) I Math.sqrt(observed); mutual ; Math.log(observedlexpected) I Math.log(2.0);

In order to make the computations as explicit as possible we are using some temporary variables to store the observed and expected frequencies. The implementation of the formulae is exactly the same as in Bambrook ( 1996). As the input to the formulae consists of integer variables, we have to promote them to doubles for the actual calculations. Without that we would get rounding errors, as the fractional component of a division's result would get lost. For most classes the toString () method simply prints out the class name followed by a unique identifier. In order to get something more useful, this method should be overwritten with an appropriate replacement. public String toString() { return(collocate+" T: "+tscore+" Z: ''+zscore+" MI: "+mutual);

In this case we want to print the collocate followed by the individual scores. This can serve as a useful short-cut to generating more customised output. The next group of methods are the usual accessor methods to retrieve the values of the individual variables: public String getCollocate () return(collocate);

public double getzscore() { return(zscore);

public double getTscore() { return(tscore);

public double getMI() { return (mutual) ; .

This methods should really have j avadoc-style documentation comments as they constitute a major part of the API; these comments are left out here for space reasons. It might seem too trivial to state that the method getTscore () returns the t-score

COLLOCATION ANALYSIS

218

of the collocate, but remember that somebody else looking at such a comment might appreciate this explicit description. And finally we need to implement the compareTo ( ) method as defined by the Comparable interface. As this method takes an Object as its parameter, we first need to check whether it is a Collocate or not. Only then can we proceed with the comparison, otherwise we would throw a ClassCastException to indicate that the two objects cannot be meaningfully compared. public int compareTo(Object other) throws ClassCastException if(other instanceof Collocate) Collocate coll ~ (Collocate)other; return(collocate.compareTo(coll.collocate)); else { throw new ClassCastException("Collocate !~ "+ other.getClass() .getName());

II end of class Collocate

If you wanted you could of course define meaningful comparisons with other objects, such as Strings, but in general this will be of little value. The compareTo () method is mainly used for maintaining the correct ordering in the sorted container classes, and you should only store objects of the same class (or derived subclasses) in a single container. There would be no reason to store both String and Collocate objects in the same container class.

11.3.2

The Comparators

As mentioned before, we want to be able to sort the Collocate objects according to any of the three scores, without having to write our own custom sort routines. The class itself can only have one single compareTo ( ) method, so we're restricted to just one sorting criterion. The answer provided by Java's sorting methods in the Arrays class (in the java. util package) is that they allow you to specify a separate object that implements the Comparator interface and is capable of comparing two of the objects you want to sort. So we have to implement three different classes, one for each score. The ZCompar a tor class compares two Co 11 ocate objects according to their z-score values. It works analogous to the compareTo ( ) method we just discussed for Collocate: I*

* ZComparator.java *I import java.util.Comparator; public class ZComparator implements Comparator { public int compare(Object ol, Object o2) int retval ~ 0; if(ol instanceof Collocate && o2 instanceof Collocate)

{

IMPLEMENTATION

219

Collocate c1 = (Collocate)o1; Collocate c2 = (Collocate)o2; if(c1.getzscore() < c2.getZscore()) retval = -1; if(c1.getzscore() > c2.getzscore()) retval = 1; else { throw new ClassCastException("ZCornparator: illegal arguments."); return(retval);

II end of class ZComparator

Again we have to cope with the possibility that the objects might not be of the correct type. As it stands, the object will sort the values in ascending order, i.e. the smallest values first, and the largest values at the bottom. In order to change this to descending order you will have to swap the 1 and -1 assignments to retval. Alternatively you could swap the greater-than sign and the less-than sign. The difference between the ZComparator and the other two, TComparator and MIComparator is minimal. Apart from the class name, only the lines doing the actual comparisons are different. For the TComparator these are: if(c1.getTscore() < c2.getTscore()) retval -1; if(c1.getTscore() > c2.getTscore()) retval = 1;

and for the MIComparator they are: if(c1.getMI() < c2.getMI()) retval if(c1.ge~MI() > c2.getMI()) retval

-1; 1;

To change the sorting order from ascending to descending you need to do the same as with the ZComparator.

11.3.3 The Span The Span class is used for preprocessing of the concordance lines that make up the input data. A single Span object will take one line and split it into tokens separated into left and right context. As many tokens as available will be extracted from the line, even though not all of them might be used later. The node word has to be in the middle of the line; this is the format that has been used as output in the concordancing components in chapter 6. I*

* Span.java

*I import java.uti1.StringTokenizer; class Span { private String left[]; private String right[];

For processing we require a StringTokenizer, and for storing the contexts we need two arrays of Strings. You will notice that the class is not declared public, and neither are any of its method. This is because it is only an auxiliary class which

220

COLLOCATION ANALYSIS

doesn't have to be used by classes other than the ones constituting the collocation system. If you wanted to use the class more generally you would have to add the public modifier to it. The constructor takes a single line and splits it into tokens. This is done separately for the left and right hand sides, in a separate method called tokenise ().As we start right in the middle of the node word, the first token of the right side and the last token of the left side has to be ignored, in fact, the first and last tokens of both sides have to be discarded, as the respective words could be truncated as well. Span{String line) { int middle = line.length{) I 2; right= tokenise{line.substring{middle)); left= tokenise{line.substring{O,middle));

In the tokenise () method we can largely build on previous efforts in tokenisation. By using the PreTokeniser class we can be sure that our software will always have the same definition of a token. If for some reason we want to change the PreTokeniser class, all classes using it adopt the new behaviour without requiring recompilation. private String[] tokenise{String line) StringTokenizer st =new StringTokenizer{PreTokeniser.tokenise{line)); String retval[] =new String[st.countTokens{)-2]; String dummy= st.nextToken{); for{int i = 0; i < retval.length; i++) { retval[i] = st.nextToken{); return{retval);

As mentioned before, we skip both the first and the last token of the string, as it might not be a complete word. To do this we allocate two words less than the number of available words, and read the first token before entering the loop in which we assign the valid words to the retval array. The getWord ( ) method retrieves a token from the span according to a given offset. This offset value will be negative for the left context and positive for the right context. For an offset value of zero, null will be returned. String getWord{int offset) String retval = null; if{offset > 0) { // right offset if{offset < {right.length)) { retval = right[offset-1]; }

if{offset < 0) { // left offset offset = left.length + offset; if{offset >= 0) { retval = left[offset];

IMPLEMENTATION

221

return(retval);

II end of class Span

The context arrays start counting at zero, so we need to subtract one from the value for the right hand side in order to map the offset value to the corresponding array index. The calculation for the left hand side looks a bit more complicated: as the offset value is negative, we have to add it to the length of the array to get the right mapping. Assuming the length of the array is 7, an offset value of -1 would result in an index value of 7 + ( -1) , which is 6, the index value of the last element in the array (remember that the valid index range would be from zero to six). The Span class allows us to refer to words in the environment by offset value. For this reason we will need to convert the input data from a list of concordance lines to a list of Span objects, one for each line. It is then a lot easier to operate on those.

11.3.4

The Collocator

The Col locator is our main processing class, the 'engine' of the collocation system. It coordinates all the relevant activities and turns a list of concordances (with a supplied benchmark frequency list) into a list of collocates with their respective scores. I*

* Collocator.java *I

import java.util.*; import java.io.*;

public class Collocator private Span subset[]; private FreqList ftable;

We need two class variables, the set of pre-processed concordance lines stored in subset, and the benchmark frequency list in ftable. In the constructor we fill the subset array with instances of Span objects according to the size of the input data, and we assign the provided FreqList object to ftable: public Collocator(String lines[], FreqList frequencies) subset= new Span[lines.length]; for(int i = 0; i < lines.length; i++) subset[i] =new Span(lines[i]); ftable = frequencies;

The next method, getCollocates (),does most of the work. It takes as parameters the extent of the left and right contexts, and a further value that we haven't

222

COLLOCATION ANALYSIS

discussed yet: the cut-off. The cut-off sets a threshold frequency which can be used to filter out hapax legomena and wrongly spelt words. These words are rare by definition and can disturb the result, as there is not really enough data available to allow a correct assessment of their collocational significance. If the cut-off has a value of zero, all words are included, with a value of one all words occurring once only are rejected, and so forth. The best setting for this parameter depends on the amount of data you are using, so you will have to experiment a bit. public Collocate[] getCollocates(int left, int right, int cutoff) Vector colls =new Vector(); FreqList flist = getFreqList(left,right); int n = flist.getN(); int N = ftable.getN();

In the first part, a Vector is set up which will be used to store the Collocate objects as we go along. Then we create the frequency list for the environment, which will contain the frequency of each collocate that we are evaluating. Then we get the sums of all frequencies stored in the respective frequency lists. These are the size of the environment (n) and the size of the reference data (N), both measured in number of tokens. Iterator it= flist.iterator(); while(it.hasNext()) { String word= (String)it.next(); int freq = flist.getFreq(word); if(freq > cutoff) { int cfreq = ftable.getFreq(word); Collocate coll =new Collocate(word,freq,cfreq,n,N); colls.addElement(coll);

Collocate retval[] =new Collocate[colls.size()]; colls.copyinto(retval); return (retval) ;

In the second part we iterate through our set of candidates (which is equivalent to all keys stored in the environment frequency list). For each word we get the frequency value, check whether it is above the cut-off point, and if so we retrieve the benchmark frequency and create a Collocate object with the necessary parameters. This we then add to the Vector, which we copy into an array of the right size once we've processed all potential collocates. What is left now is the method to produce the frequency list of the environment as stored in subset, our array of Span objects. Here we need to know how many words to either side we should take into account, and we create a FreqList object to which we add the tokens as we go along: private FreqList getFreqList(int left, int right) { FreqList flist =new FreqList(); for(int i = 0; i < subset.length; i++) {

IMPLEMENTATION

223

int offset = -1; String coll = null; for(int j = 0; j < left; j++) { do { coll = subset[i] .getWord(offset); offset--; } while(coll !=null && !Character.isLetterOrDigit(coll.charAt(O))); if(coll !=null) flist.add(coll); offset = 1; for(int j = 0; j < right; j++) { do { coll = subset[i] .getWord(offset); offset++; } while(coll !=null && !Character.isLetterOrDigit(coll.charAt(O))); if(coll !=null) flist.add(coll);

return ( flist);

The outer loop, controlled by the variable i, goes through all the lines of the data set, while the two inner loops, each controlled by a variable j, collect the relevant tokens. We retrieve the word at the offset position via another loop, which skips punctuation, as a comma is treated as a token by the PreTokeniser, but for the collocational analysis we would not normally want punctuation to be included. If we try to retrieve a word which is outside the correct span-range, coll will come back with a value of null which we simply ignore. Otherwise we add the token to the current frequency list, which we duly return at the end of the method. And finally we need to provide a main ( ) method which allows you to run the Col locator class as an application. The setup here is slightly more complex than with previous programs, as we depend on some external data, namely the concordances and the benchmark frequencies. These are kept in two external files, the names of which will have to be passed to the application through the command-line. public static void main(String args[]) throws IOException { if(args.length != 5) { System.err.println("usage: java Collocator " +" "); else { int leftSpan = Integer.parselnt(args[O]); int rightSpan = Integer.parseint(args[1]); int cutoff= Integer.parselnt(args[2]); args[3]; String concfile String freqfile = args[4];

First we check for the presence of the right number of command-line arguments. If there aren't five, then we print a usage message, otherwise we initialise a set

of variables with the corresponding values. As the first three are in t values, we need to convert them, as command-line arguments are always passed to main () as Strings. For this we're using the static method parselnt () of the Integer class. The remaining two parameters are Strings, so we'll just assign them straight away.

224

COLLOCATION ANALYSIS

FreqList flist = new FreqList(); BufferedReader br =new BufferedReader(new FileReader(freqfile)); flist.load(br); br. close() ; String conclist[] = Utility.readTextFile(concfile);

To conclude the preparatory actions we create a FreqLi s t instance and load it from the file which contains the benchmark frequencies, and we load the concordance lines into an array of Strings, which is the format the Col locator constructor requires. This is a generally useful task which is not really related to computing collocations, so the method readTextFile () has been put into a separate class called Utility. We will discuss that class further below. Collocator cells= new Collocator(conclist, flist); Collocate result[] = colls.getCollocates(leftSpan, rightSpan, cutoff); Comparator comp =new ZComparator(); Arrays.sort(result,comp); for(int i = 0; i < result.length; i++) System.out.println(result[i]);

II end of class Collocator

Once we have prepared the input data, the rest is easy: we create a Col locator object with the concordance lines and the frequency list we have loaded from the two data files, and then we retrieve the collocates by invoking the getCollocates () method. This method requires the other parameters we extracted from the commandline. By default we want to have the output sorted according to the z-score, so we create a ZComparator which we use to sort our array of Collocate objects. After that we simply print out the array, making use of the to String () method of the Collocate class which we have overwritten. If you want to have the collocates sorted by any of the other two scores, simply change the line Comparator comp =new ZComparator();

to either Comparator comp =new TComparator();

or Comparator comp =new MIComparator();

This is not a very elegant solution; it would be better to be able to change this through a further command-line parameter. This is left as an exercise to the reader.

TESTING

225

11.3.5 The Utility Class Before we can actually run the application, one more class needs to be described, Utility. This class has the same function as the Math class, to be a collection of useful routines which do not really fit anywhere else and are not related in any way. You can easily add more useful methods to such a class, but they all should be static, so that you never need an actual instance of this class. This class has at present just one method, which reads a text file and returns it in an array of Strings, one per line. This is used to load the concordance lines from an external file. I*

* Utility. java *I

import java.io.*; import java.util.*;

public class Utility { public static String[] readTextFile(String filename) throws IOException { BufferedReader br =new BufferedReader(new FileReader(filename)); Vector v =new Vector(); String line= br.readLine(); while(line != null) { v.addElement(line); line= br.readLine(); String retval[] =new String[v.size()]; v.copyinto(retval); return ( retval) ; II end of class Utility

The method does not require any object variables, so we can declare it static. At first we create a Buff eredReader for the file that we want to read, and a Vee tor for storing the lines. As we don't know initially how many lines there are going to be we will need a dynamic data structure which can grow if its capacity is exceeded. We then fill the vector with the lines we're reading, and once we have finished reading (which is when line will become null) we convert the Vector into an array. This is similar to the way we converted the vector of Collocate objects to an array in the Col locator class above.

11.4 TESTING As mentioned above, setting the Col locator application up is a bit more complicated than it was with previous applications. For this test run we assume that you will have a sample text available that you can use; to stay with the example of Barnbrook (1996) you might want to download the text of Frankenstein from the Gutenberg Project (for the address see section 12.2 ). At the time of writing the latest release was labelled 'frank13 '.The actual filename was 'frank13. txt'.

226

COLLOCATION ANALYSIS

First we need to create a word frequency list of the text. In order to avoid funny results, the 'small print' at the beginning of the file was removed, as was the last line which stated that this was the end of the text. Then you run java FreqListCreator frank13.txt

and shortly after you should find a file called 'frank13. txt. frq' in your working directory. This file contains all tokens with their respective frequencies. The next step is to extract some concordance lines to work on. For this we'll be using the IndexedConcordancer from chapter 6, so you first need to create an index for the file: java IndexCreator frank13.txt

This should give you a file 'frank13. txt. idx' with all words and their index positions in the file. This you can then use to extract some concordances, for example with java IndexedConcordancer frank13.txt monster> monster.cnc

This command runs the IndexedConcordancer on the text 'frank13. txt' and finds occurrences of monster, which there should be 33 of, and writes them to a file called 'monster. cnc'. Once we have gathered all the necessary data we can finally compute the collocations, for example with the command-line: java Collocator 4 4 1 monster.cnc frank13.txt.frq

This uses a context of four words either side and a cut-off of one, and the resulting output is a list of collocations, such as this: and T: -1.4865707297435484 Z: -1.1724742950776923 MI: -0.7353615578509783 in T: -0.749878280057944 Z: -0.6100622138038688 MI: -0.6137617273570343 wasT: -0.6294278992176262 Z: -0.5267567151201137 MI: -0.5311421106802238 toT: 0.041016404968706295 Z: 0.04187778870760741 MI: 0.024362297212415777 with T: 0.10130441121363348 Z: 0.10554580933679193 MI: 0.10723290989738887 my T: 0.16525497220856003 Z: 0.17338910329527538 MI: 0.11076676357094925 of T: 0.1825337589538807 Z: 0.1917126220111746 MI: 0.0962450803874195 but T: 0.3120925776841892 Z: 0.3546740330849931 MI: 0.3597173963902727 aT: 0.5347255942439666 Z: 0.617900223644449 MI: 0.39429054493769894 his T: 0.9108506511037365 Z: 1.3267314626074727 MI: 1.0766754429214191 would T: 1.048847407157465 Z: 2.065712071518409 MI: 1.9525850951967751 My T: 1.1311551673983486 Z: 2.5304697792831545 MI: 2.3208283829966776 then T: 1.2054329022498778 Z: 3.1392109948841322 MI: 2.759940017254379 said T: 1.2094479149445552 Z: 3.1803587962127433 MI: 2.7879543934239757 saw T: 1.2255079657232641 Z: 3.3567642585653785 MI: 2.905790883717834 The T: 1.2665401400317515 Z: 2.447141857581152 MI: 1.8955951166119456 he T: 1.3314054434537774 Z: 2.3091231666264207 MI: 1.5807964857826882 form T: 1.360010890994952 Z: 6.947976364089238 MI: 4.705492233232003 wretch T: 1.366033410036968 Z: 7.401945514954234 MI: 4.875417234674315 goT: 1.3760709417736612 Z: 8.379950754076035 MI: 5.212452221951886 on T: 1.3782496480525577 Z: 2.47829808471388 MI: 1.6855926757942983 behold T: 1.3901234862050316 Z: 10.651784342845051 MI: 5.875417234674314 depart T: 1.398153511594386 Z: 13.120768726868983 MI: 6.460379735395471 IT: 1.5860029573732628 Z: 2.125236084830328 MI: 0.7955515412635342 had T: 1.656710261038665 Z: 2.923867442839361 MI: 1.6274897212307298 created T: 1.712381342743365 Z: 16.069971299116308 MI: 6.460379735395471 meT: 1.7197028792240807 Z: 2.921600018023244 MI: 1.5145179082741203 that T: 1.8226664447295382 Z: 3.0746810096013175 MI: 1.4917129422002628 whom T: 2.134495297902427 Z: 10.019663645907034 MI: 4.460379735395471 the T: 2.4847364278191204 Z: 3.664238161152983 MI: 1.0532043538895979

227

STUDY QUESTIONS

The lines contain first the word, and then the three respective scores. They are sorted in reverse order according to the t-score, so the is the most significant collocate of monster, followed by whom. If you look at the z-score, you will find that created scores highest, and then depart; these two are also the top two collocates with the largest mutual information score.

11.5

STUDY QUESTIONS

1. As it stands the Collocator operates on a set of concordances in an external file that you had to prepare earlier. Combine the Col locator with a concordancing program from chapter 6 so that you can specify the node word instead of a filename, with the concordances being extracted automatically. 2. It is a bit cumbersome to have three different Comparator objects for the significance scores. Can you think of a way of combining them into a single object which can be set to sort according to any of the three scores? This could either be done in the constructor or via a method that allows you to change the sort mode. 3. To facilitate comparing the different scores, think of a way to work out the ranking of each collocate according to each score and then print out a list which will look like: created depart whom

1

1

2

3

3

15

2 1 6

This shows the word and the respective ranking for each of the scores.

12 Appendix 12.1

A LIST OF JAVA KEYWORDS

Java has a set of reserved keywords, which you cannot use as variable names. Some of these are fairly obvious, like if, while others (like goto) are not yet used but merely reserved for possible future use. To spare you from wondering why your program doesn't compile for no apparent reason when you called one of your variables operator, here is a list of the reserved words: abstract boolean break byte byvalue case cast catch char class canst continue default do double

else extends false final finally float for future generic go to if

implements import inner instanceof

int interface long native new null operator outer package private protected public rest return short

static super switch synchronized this throw throws transient true try var void volatile while

12.2 RESOURCES Table 12.1 contains a list of websites which contain useful resources for corpus work. Obviously, the Internet is a very volatile entity, with sites disappearing, changing addresses, and new sites coming up at a rate much higher than revised editions of this book will be published. For that reason, some of the URLs listed here might be no longer valid, even though they were at the time of writing. Your first resort when a link doesn't work should be to search for the name or title of that resource on a search engine of your choice. It is quite likely that you might pick up another link to the resource which still works, or even that you get hold of the resource's address directly.

12.3 RINGCONCORDANCEREADER In this section you will find the listing of the RingConcordanceReader class, which was not printed in full in chapter 6.

232

APPENDIX

http:lljava.sun.comlproductsljdkl http:llwww.blackdown.orgljava-linux http:l/www.apple.comljava http:llwww.clg.bham.ac.uk/ http:llwww.tei-c.org/ http:llwww.w3.orgl http:l/www.cs.vassar.eduiXCES/ http://www.ilc.pi.cnr.itiEAGLES/home.html http:llpromo.netlpg/ http:llota.ahds.ac.ukl http:llwww.unicode.orgl http:llwww.clg.bham.ac.uk/QTAG http:/lwww.megginson.comiSAXIsax.html http://www.xml.com/publriAElfred

JDK for Solaris and Windows JDK for Linux JDK for Apple Corpus Linguistics Group at Birmingham Text Encoding Initiative XML Standard XML Corpus Encoding Standard The EAGLES website Project Gutenberg Oxford Text Archive Unicode website QTag, a stochastic POS tagger The SAX website JElfred is a free XML parser for Java

Table 12.1: Useful Websites

,. * RingConcordanceReader.java

*I import java.io.*;

I** * This class filters input data to produce concordance lines.

* The readLine() method can be used to retrieve a single line. It * extends a BufferedReader, so no further buffering is necessary * with this class. * @author Oliver Mason * @version 1.0

*I public class RingConcordanceReader extends BufferedReader { private private private private private

char buffer[] = null; String node = null; int start = 0; int middle 0; int spaces = -1;

I** * Constructor. * Wraps a RingConcordanceReader around the provided Reader. Unlike the • BufferedReader, the buffer size refers to the line length, and not * to the size of the internal IIO buffer. * @param in the Reader that provides data. * @param node the node word of the concordance line. * @param width the linewidth of a concordance line.

*I public RingConcordanceReader(Reader in, String node, int width) { super(in); buffer= new char[width]; for(int i = 0; i < width; i++) buffer[i] '· this.node =node; middle= (width- node.length()) I 2;

I** * Constructor.

233

RINGCONCORDANCEREADER

* Creates a RingConcordanceReader with a default linewidth of 75 characters. * @param in the Reader that provides data. * @param node the node word of the concordance line.

*I public RingConcordanceReader(Reader in, String node) { this(in,node,75);

I** * Increment

a pointer value by one.

* If the end of the buffer is reached, the pointer value is reset to zero. * @param pointer the old pointer value.

* @param *I

the new pointer value.

private int nextPointer(int pointer) { pointer++; if(pointer >= buffer.length) pointer return(pointer);

0;

I** * Decrement * * * *

a pointer value by one. If the beginning of the buffer is reached, the pointer value is reset to point to the end of it. @param pointer the old pointer value. @param the new pointer value.

*I private int prevPointer(int pointer) { pointer--; if(pointer < 0) pointer return(pointer);

buffer.length - 1;

I** * Read the next character.

* This method is called internally by the readLine() method. * @return true if a character could be read, false when the end of the input has been reached.

* @throws IOException.

*I private boolean readNextChar ( ) throws IOException true; boolean retval I I we'r& in padding mode if (spaces > -1) { II we have used up all spaces if(spaces == 0) { retval = false; else { buffer[start] = start= nextPointer(start); middle= nextPointer(middle);

spaces--; retval = true;

else int next = read(); char nextC; II end of input reached if(next == -1) { spaces = buffer.length I 2; nextC

'

'·

else { (char)next; nextC if(Character.isiSOControl(nextC)) nextC buffer[start] = nextC;

234

APPENDIX start; nextPointer(start); middle nextPointer(middle); retval ; true; return(retval);

/** * Check whether the node word matches the current line.

* * * *

The current content at the middle of the buffer is matched against the node word. The characters before and after the node word have to be non-letter characters, to avoid partial word matches. @return true if there is a match, false otherwise.

*I

private boolean matchNode() { boolean retval ; true; if(Character.isLetter(buffer[prevPointer(middle)])) retval ; false; int position ; middle; for(int i ; 0; i < node.length() && retval ;; true; i++) { if(node.charAt(i) !; buffer[position]) retval; false; position ; nextPointer(position); if(Character.isLetter(buffer[position])) retval ; false; return(retval);

!**

* Read a concordance line. * @return a line or null if no more concordance lines are available. * @throws IOException.

*I

public String readLine() throws IOException String retval ; null; while(readNextChar() && !matchNode()) ; if(matchNode()) { StringBuffer res new StringBuffer(buffer.length); int pos ; start; for(int i ; 0; i < buffer.length; i++) { res.append(buffer[pos]); pas ; nextPointer(pos); retval; res.toString(); return(retval);

!**

* Main method for testing purposes. * The first command-line parameter is interpreted as a node word that * has to match. Data is read from System.in and concordances are printed

* from that.

* @param args

the command-line parameters.

*I

public static void main(String args[]) throws IOException { RingConcordanceReader cr ; new RingConcordanceReader( new InputStreamReader(System.in), args[O]); String line; cr.readLine(); while(line !; null) { System.out.println(line);

REFERENCES

235

line= cr.readLine();

l // end of class RingConcordanceReader

12.4 REFERENCES Bambrook, Geoff (1996) Language and Computers: A practical introduction to the Computer Analysis ofLanguage. Edinburgh Textbooks in Empirical Linguistics. Edinburgh: EUP. Biber, Douglas (1988) Variation across speech and writing. Cambridge: CUP. Brill, Eric (1992) A simple rule-based part of speech tagger. In Third Conference on Applied Natural Language Processing, pages 152-155, Trento, Italy. Bulka, Dov (2000) Java Performance and Scalability: Server-side Programming Techniques, volume 1. Addison-Wesley. Church, Kenneth Ward and Patrick Hanks (1990) Word association norms, mutual information and lexicography. Computational Linguistics, 16(1):22-29. Clear, Jeremy (1993) From Firth principles: Computational tools for the study of collocation. In Mona Baker, Gill Francis, and Elena Tognini-Bonelli, editors, Text and Technology. John Benjamins, Philadelphia/Amsterdam, pages 271-292. Galle, Matthias, Oliver Jakobs, Peter Kesten, Amancio Kolompar, and Alexander Mehler (1992) Dokumentation des studienprojektes ,aufbereitung des dpa korpus". Technical report, Dept. of Computational Linguistics, University of Trier. Horton, Ivor (1999) Beginning Java 2: A comprehensive tutorial to Java programming. Birmingham: Wrox Press. Mason, Oliver (1997) The weight of words: An investigation of lexical gravity. In Proceedings of PALC'97, pages 361-375, Lodz, Poland. Mason, Oliver (2000) Parameters of collocation: The word in the centre of gravity. In John M. Kirk, editor, Corpora Galore: Analyses and Techniques in Describing English. Rodopi, Amsterdam/Atlanta, GA. McEnery, Tony and Andrew Wilson (1996) Corpus Linguistics. Edinburgh Textbooks in Empirical Linguistics. Edinburgh: EUP. Oakes, Michael P. (1998) Statistics for Corpus Linguistics. Edinburgh Textbooks in Empirical Linguistics. Edinburgh: EUP. Paice, C.D. (1977) Information Retrieval and the Computer. Computer monographs. London: Macdonald and Jane's. Porter, M.F. (1980) An algorithm for suffix stripping. Program, 14(3):130.

236

REFERENCES

Stubbs, Michael (1995) Collocations and semantic profiles: on the cause of the trouble with quantitative studies. Functions of Language, 2(1):23-55. Taylor, Lita, Geoffrey Leech, and Steven Fligelstone (1991) A survey of english machine-readable corpora. In English Computer Corpora: Selected Papers and Research Guide, pages 319-354. Tufis, Dan and Oliver Mason (1998) Tagging romanian texts: a case study for qtag, a language independent probabilistic tagger. In Proceedings of the First International Conference on Language Resources & Evaluation (LREC), pages 589-596, Granada, Spain, May. Wirth, Niklaus (1971) Program development by stepwise refinement. Communications of the ACM, 14(4):221-227, April. Witten, Ian H., Alistair Moffat, and Timothy C. Bell 1994. Managing Gigabytes: Compressing and Indexing Documents and Images. New York: Van Nostrand Reinhold.

Index abbreviation, 134 abstraction, 47, 87 access by keyword, 28 modifier, 11, 55 random, 28, 117 restrictions, 55 action, 49 add(), 89 algorithm choices, 129 collocation,215 definition, 13 sorting, 94 stemming, 179, 181 structure of, 15 tagging, 195 tokenisation, 119, 121, 134 alternate realities, 18 alternative, 19 ambiguity, 195 annotation, 33, 38 boundary marker, 38 external information, 38 font changes, 36 header block, 38 headerinformation,39 standards, 39 API, 55, 183 definition, 57 documentation, 186 event-based, 156 String, 67 application, 61, 101, 127, 225 module, 181 stand-alone, 123, 202 user-friendliness, 198 Application Programming Interface, 53

ArrayEnumeration, 87 ArrayList, 91 Arrays, 143,148,218 ASCII, 26 text, 36 attribute value, 169 AWT,64 backup copies, 37 Bank of English, 31 binarySearch(), 94 block definition, 19 blueprint, 48 BNC, 28, 31, 153 boolean value, 72 branching, 18 break, 166 Brown corpus, 33 browser, 32 buffer, 106 linear, 112 look-ahead, 164 ring, 106, 112 size, 117 BufferedlnputStream, 98 BufferedReade~ 121,139,187,203,225 byte code, 7 cache, 115 callback definition, 156 canonical form, 179 case lower, 74 upper, 74 CDATA, 155, 167 CES,40,45 chaos, 53

237

238 Character, 11 0 character apostrophe, 134 control, 74, 110, 127 encoding, 98 non-printable, 74, 108 normalisation, 165 printable, 127 punctuation, 110, 119, 134, 223 quotation mark, 134 white space, 134, 135 character data, 25 characters control codes, 26 regional, 26 charAt(), 76 class, 48, 100 'proper', 94 abstract, 158, 161 API, 53 complexity, 172 definition, 11 design, 49 example, 11 execute, 101 file, 7, 11 hierarchy, 58, 157 implementation, 120 in OOP, 47 instance, 110, 137 legacy, 87 library, 10, 66, 68, 78, 181, 186 maintenance, 62, 93 outside, 53 services, 53 source code, 10 sub-, 99, 108, 158, 168 super, 57, 81, 107, 111, 162 traffic light, 49 TrafficLight, 51 variable, 210 vs. interface, 85, 186 wrapper, 78, 98 clear(), 89 COCOA format, 39 coding verbosity, 190 Collection, 88 collection framework, 61, 87, 88, 94 Collections, 88, 94

INDEX collections framework, 79 collocate, 45, 213, 222, 224 collocation, 44, 213, 226 significance score, 44 collocation system, 221 collocational analysis, 4 command-line, 100, 137 arguments, 11 comment, 55 definition, 49 style, 56 Comparable, 72, 94, 149, 218 Comparator, 94, 149, 216, 218 comparators, 93 compareTo(), 218 comparison strings, 51 compiler, 6, 56, 135, 186 checks, 20 complexity of a program, 116 concat(), 73 concept, 130 concordance, 43 concordance lines, 105, 125, 215, 219, 221,225,226 concordancer, 117 ConcordanceReader, 106, 111, 112 concordances, 214 concordancing interactive, 117, 128 concordancing program, 38, 112 condition, 19, 24 consistency internal, 53 naming, 50 state, 54 constant, 25 as symbolic label, 163 definition, 20 constructor Collocate, 216 Collocator, 221 definition, 49 design, 120 example, 50 FreqList, 145 IndexCreator, 121 inheritance, 105 Rule, 189

239

INDEX Span,220 Stemmer, 182 String,68 Tagger, 198 XMLTag, 160 XMLTokeniser, 163 container class, 28 container classes, 218 definition, 79 contains(), 89 context, 44, 215, 219, 226 extended, 117 for disambiguation, 195 control flow, 18, 19, 87 multiple choice, 165 variation, 19 copy(), 95 copyright, 32, 33 corpus, 31 benchmark, 31 encoding guidelines, 40 FLOB, 34 Frown, 34 Kolhapur, 34 LOB,33 partition, 33 physical structure, 35 published sources, 38 sample, 31 spoken data, 31 structural integrity, 40 structured collections, 38 typographical errors, 42 word processing, 36 corpus linguistics, 39 corpus position, 118 data format, 133 data processing, 97 data structure, 124, 130, 142, 172, 181 definition, 78 dynamic,225 data type, 17, 25, 81 arra~28, 79,102,124,125 boolean, 190 cas~81, 110,125,168 char, 26, 68 character, 26 composite, 25, 27

double, 216 int, 24 mapping, 48 numerical, 25 primitive, 27, 79 String, 50 data-hiding, 54 database, 9 java.sql, 64 date, 34 decimal separator, 64 decision making, 19 default, 200 default value, 83 delimiter, 13 8, 169 derivations, 193 design bottom-up, 216 complexity, 181 constructor, 173 decoupling, 187 encapsulation, 130, 216 interdependency, 180 modular, 116 modularisation, 173 of classes, 144, 180 of collocator, 215 phase, 47 principles, 120 tagger, 196 top-down, 188, 197 desktop publishing, 133 development version, 182 diagnostic messages, 182 dictionary, 28, 79 digit, 26 digital library, 32 document tree structure, 156 documentation, 49, 55, 192, 208 API, 57, 72, 78, 109 class, 68 comments, 57,217 on-line, 56, 63 DOM, 156 domain, 47 download, 32 DTD, 153, 157, 172 definition, 154 duplicates, 143

240 EAGLES,35 EBCDIC,26 efficiency, 117 endsWith(), 72, 190, 191,206 engine, 180 entity name, 40 Enumeration, 82, 95, 139 environment, 214 of a word, 221 equals(), 24, 51, 71, 72, 89, 150, 184, 191 equalslgnoreCase(), 72 error channel, 141 error condition, 87 error message, 141 evaluation function, 213 exception, 141, 149, 187 handling, 87 run-time, 87 expression boolean, 17, 19,72 definition, 16 loop, 24 multi-word, 147 numerical, 134 sub-condition, 22 variable, 17 external storage, 28 fallback, 198 false, 17 file extension, 123 frequency list, 148 operations, 100 output, 200 pointer, 125 random access, 29 reading, 29, 97 saving to, 144 self-contained documents, 35 storage, 29 text, 138, 180, 225 writing,29 Filelndex, 123 FilelnputStream, 97 FileNotFoundException, 99 FileOutputStream, 97 FileReader, 98, 99, 203 FileTokeniser, 144 FileWriter, 101

INDEX fixed expression, 44 floating-point, 216 floppy disk, 28 formal grammar, 40, 154 FreqList, 144, 214, 221, 222 FreqListCreator, 144 frequency benchmark, 213, 222 cut-off, 222, 226 relative, 145 retrieve, 144, 146 storing, 143 frequency list, see list, frequency frequency table, 85 FrequencyComparator, 149 function, 49 functionality, 144 garbage collection, 52 hapax legomena, 222 hard disk, 28 hash sign for comments, 192 Has~ap,92, 121,124,145 capacity, 124 HashSet, 90 Hashtable, 81 hasMoreElements(), 85 HTML,41, 154,186 human typist, 31 IDE, 10 idiogram, 26 immutable Strings, 67 imperfect printing, 32 implementation, 130, 181, 217 implementation details, 87 implementation phase, 48 import, 62 index, 117,128,226 in file, 123 size, 118 storage of, 124 index number, 79 IndexCreator, 118, 121 IndexedConcordancer, 119, 125 indexOf(), 75 inflections, 193

241

INDEX information structural, 133 information retrieval, 179 inheritance, 59 initialise, 22 initialiser, 24 static, 105 input, 97 InputStream, 97 InputStreamReader, 112, 203 instanceof, 149, 168 instruction method, 49 Integer, 78, 82, 123 parseint(), 125, 147, 223 Interactive Fiction, 18 interface, 58, 119, 156, 216 instance, 121 List, 186 public, see API vs. class, 85, 102, 181 interfaces, 130 intermediate code, 7 intemationalisation, 64 Internet, 10, 214 andXML, 154 java.net, 64 texts on, 32 interpreter, 7 intuition, 44 IOException, 99, 141, 183 iterate, 222 Iterator, 87, 93, 95, 122 iterator, 144, 146, 183, 184 iterator(), 90 iterators, 79 Java extensions, 64 Java interpreter, 51 javac, 7 javadoc,56,68, 186,217 tags, 56 JDBC, 9 JDK, 10, 63, 68 JRE, 10 JVM, 7, 52, 137 and packages, 61 key, 79, 143, 148 key-value pair, 83

key/value pair, 81 keys(), 82 KWIC,43 KWOC,43 language mark-up, 133 lastlndexOf(), 75 layout information, 133 lemmatiser, 179 length, 79 of array, 125 length(), 70, 79 letter, 26 lexicon, 196, 201, 203 file, 204 lookup, 205 lexicon lookup, 195 LIFO, 84 line number, 163 LineNumberReader, 163 linguistic relevance, 32 LinkedList, 91 List, 91, 181 list alphabetical, 143 frequenc~ 144,215,222 merge, 147 sorted, 90, II 9 tagged words, 197 word, 148 word frequency, 142, 193, 213, 226 Locale, 74 look-ahead, 138 lookup table, 198 loop, 12,52 body, 24 counter, 12, 93 definition, 21 for, 23,24 head-driven, 23 processing, 183 while, 22 main(), 123 Collocator, 223 ConcordanceReader, Ill for testing, 148 method, II static, 137

INDEX

242 Stemmer, 185, 192 Tagger, 202 TrafficLight, 51 XMLTokeniser, 171 Map,92, 203 mapping of keys, 92 mark-up, 40 verification, 154 mark-up language, 40 match node, 110 matching case-sensitive, 205 longest-, 179, 207 pattern, 133 suffix, 179, 188 matching pairs, 84 matching words, 75 measurement, 114 memory, 27 core, 97 mental effort, 163 merge collections, 89 method, 49 accessor, 146, 160, 215 declaration syntax, 51 for reading, 164 getter, 200 in interface, 120 inheritance, 162 setter, 200 signature, 51 static, 137, 186, 210, 225 verbosity, 190 mismatch, 111 model, 47 morphological features, 195 multiple inheritance, 58 mutual information, 213 native speaker, 44 negation, 184 network connection, 64 new, 52,79,82 nextElement(), 85 node query, 129 node word, 107, 126, 213, 219

NoSuchElementException, 187 NullPointerException, 190 number, 25 floating point, 25 integer, 25 Object, 58, 81, 150 object, 27, 47, 100 creation of, 52 inOOP,47 OCR,32 ogre, 18 operation class, 48 operator, 16 assignment, 17 boolean, 17 comparison, 70 concatenation, 17, 67 negation, 89 query, 129 optimisation, 130 order alphabetical, 216 ascending, 219 descending, 219 frequency, 149 lexicographical, 72 natural, 90, 94 of rules, 207 sequential, 79, 157 sorting, 72, 219 word, 214 output, 97 OutputStream, 97 package, 157 creating, 64 definition, 61 standard, 63 stemmer, 181 sub-, 63 padding, 109 mode, 109 parameter command-line, 171, 181, 185, 202, 223 method, 49 parse error, 167 part-of-speech

INDEX in lexicon, 203 part-of-speech tagger, 195 example, 20 performance measuring, 114 persistent, 203 pocket calculator, 19 portability, 7 PositionLister, 118 preprocessing, 219 PreTokeniser, 137, 200, 220, 223 PrintWriter, 101 private, 55, 107 variable, 148 probability, 196 bigram, 196, 201, 210 trigram, 196 process, 47 production version, 182 program definition, 13 maintenance, 207 program counter, 18 programming bottom-up, 15 by stepwise refinement, 15 compiled languages, 6 interpreted languages, 7 language, 5 top-down, 15, 24 programming overhead, 173 Properties, 83, 119, 124, 204, 207 Property, 203 property class, 50 object, 48 pseudo-code, 15 block, 19 example, 20 loop, 23,25 public, 55 class, 219 punctuation mark, 26 PushBackReader, 163 query complex, 118 language, 118 processing, 129

243 RandomAccessFile, 125 Readers, 98 reading sequential, 117 readLine(), 98, 106 recurrent pattern, 44 reference to object, 205 register variation, 4 remove(), 89 repetition, 18 representation numerical, 50 research prototypes, 174 reserved keyword, 231 resource creation, 39 resource file, 209 return value, 140, 183 reverse(), 95 robot, 14 rounding errors, 217 rule, 180 file, 183 loader, 180 production, 179 reduction, 179 rule-based, 195 run time, 6 RuntimeException, 87 SAX, 156 scanner, 32 scanning, 31 scope, 19 score, 201, 213 SDK, 10 search, 117 security, 53 seek(), 127 segmentation, 133 sentence, 134 Set, 90 SGML,36,40 and XMLTokeniser, 172 generating output, 147 parser, 40 tags, 40 vs.XML, 153 shared features, 88

244 shopping list, 16 significance, 213, 222 singular inheritance, 58 size(), 90 software engineering, 14 software development cycle, 192 sort example, 101 sort(), 94 Sorted~ap,93, 102,143 SortedSet, 90, 143 sorting, 148 collocates, 216 sorting order, 149 source code, 5, 7, 10, 48, 55 access to, 207 indentation, 205 source file, 157 span,214,215,220 specification, 92, 192 speed of execution, 115 spell checker, 32, 42 Stack definition, 84 example, 172 stack, 172 example, 91 standard library, 97 startsWith(), 72 state, 121 statement, 17 definition, 16 static, 208 stem(), 183 Stemmer, 181, 182 stemmer, 179 implementation, 182 stochastic, 195 storage physical, 33 stream, 97 filter, 97 input, 97 output, 97 String,61,68, 72, 73,112,190,199,203, 208 concatenation, 73 definition, 66

INDEX example, 17 in Properties, 83 literal, 66 methods, 75 string literal, 184 StringBuffer, 67, 68, 77, 126, 163, 190 StringTokenizer, 125, 135, 137, 139, 147, 187,219 substring, 76, 110 substring(), 77, 190 suffix analyser, 196, 205, 206 suffix analysis, 201 suffix replacement, 183 super(), 107 Swing, 9, 64 switch, 165, 198, 199 syntactic patterns, 195 System in, 203 system resources, 142 System.err, 101 System.in, 112 t-score, 213 tag, 153 empty, 170 opening, 170 POS, 201 tagger, 195, 197 tagging engine, 197 TEl, 40,45 template, 48 testing, 51, 192 text annotated, 162 literal, 155 plain, 133, 134 processing, 135 text files, 32 texts academic papers, 32 classics, 32 newspaper articles, 32 this, 107, 205 self-reference, 122 this(), 107 threshold frequency, 222 token, 15, 120, 121, 134, 140, 146, 187, 220

245

INDEX concordance, 45 digits, 204 tokenisation, 133-135, 187, 220 tokeniser pre-, 136 tokens, 219 toString(), 159, 171, 189, 217, 224 trace-mode, 182 traffic light class, 49 training corpus, 196 transition matrix, 196, 201, 208 tree structure, 34 TreeMap, 92 TreeSet, 90 trim(), 74, 103 true, 17 truth value, 17, 24 type, 118 word list, 45 type safe, 81 Type.java, 100 types, 16 underscore format, 197 Unicode, 5, 26, 98, 137 Strings, 67 Unix convention, 192 usage message, 223 Utility, 225 utility classes, 63 variable, 12, 70 boolean, 121 class, 48 comparing, 24 declaration, 17 definition, 17, 20 expression, 25 instance, 163 overwrite, 141 precision, 26 range, 26 read-only, 148 storage area, 25 temporary, 109, 217 vs. literal, 184 Vector, 122, 222, 225 definition, 80

Vehicle, 88 virtual machine, 10 void, 49 web page, 32, 55, 154 well-formed, 172 word boundaries, 134 definition, 134 word frequency, 78 word list, 118 word processing, 36, 133, 134 word-frequency list, 42 WordFreq, 148 WordLister, 143 WordPositionReceiver, 121 wrapper, 85, 202, 204 Writers, 98 XML, 36, 41, 45 comment, 155 declaration, 155 definition, 153 entity declaration, 155 for annotations, 153 generating output, 147 output, 197 parser, 155, 156 processing instruction, 155 specification, 156 well-formed, 172, 174 XMLCharData, 162 XMLData, 159, 161 XMLElement, 158 XMLFormCheck, 172 XMLinstruction, 159 XMLParseError, 167 XMLTag, 159 XMLTokeniser, 162, 171, 173 z-score, 213