118 31 20MB
English Pages 256 [250] Year 2022
Programming for Corpus Linguistics
EDINBURGH TEXTBOOKS IN EMPIRICAL LINGUISTICS
CORPUS LINGUISTICS by Tony McEnery and Andrew Wilson LANGUAGE AND COMPUTERS A PRACTICAL INTRODUCTION TO THE COMPUTER ANALYSIS OF LANGUAGE
by GeoffBarnbrook STATISTICS FOR CORPUS LINGUISTICS by Michael Oakes COMPUTER CORPUS LEXICOGRAPHY by Vincent B. Y. Ooi THE BNC HANDBOOK EXPLORING THE BRITISH NATIONAL CORPUS WITH SARA
by Guy Aston and Lou Burnard PROGRAMMING FOR CORPUS LINGUSTICS HOW TO DO TEXT ANALYSIS WITH JAVA
by Oliver Mason
EDITORIAL ADVISORY BOARD
Ed Finegan University of Southern California, USA Dieter Mindt Freie Universitat Berlin, Germany Bengt Altenberg Lund University, Sweden Knut Hofland Norwegian Computing Centre for the Humanities, Bergen, Norway ]anAarts Katholieke Universiteit Nijmegen, The Netherlands Pam Peters Macquarie University, Australia
If you would like information on forthcoming titles in this series, please contact Edinburgh University Press, 22 George Square, Edinburgh EH8 9LF
EDINBURGH TEXTBOOKS IN EMPIRICAL LINGUISTICS
Series Editors: Tony McEnery and Andrew Wilson
Programming for Corpus Linguistics How to Do Text Analysis with Java
Oliver Mason
EDINBURGH
UNIVERSITY
PRESS
EDINBURGH
UNIVERSITY
PRESS
© Oliver Mason, 2000 Edinburgh University Press 22 George Square, Edinburgh EH8 9LF Transfered to digital print 2006 Printed and boood by CPI Antony Rowe, Eastbonrne
A CIP record for this book is available from the British Library ISBN-10 0 7486 1407 9 (paperback) ISBN-13 9 7807 4861 407 3 (paperback)
The right of Oliver Mason to be identified as author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
Contents I Programming and Corpus Linguistics
1
1 Introduction 1.1 PROGRAMMING IN CORPUS LINGUISTICS. 1.1.1 The Computer in Corpus Linguistics . . . 1.1.2 Which Programming Language? . . . . . 1.1.3 Useful Aspects ofJava . . . . . . . . . . 1.1.4 Programming Language Classification . . 1.2 ROAD-MAP . . . . . . . . . . . 1.2.1 What is Covered . . . . . . . . . 1.2.2 Other Features of Java . . . . . . 1.3 GETTING JAVA . . . . . . . . . . . . . 1.4 PREPARING THE SOURCE . . . . . . . 1.4.1 Running your First Program . . . . . . . . . . 1.5 SUMMARY . . . . . . . . . . . . . . . . . . . . . . .
3 4 4 5 5 6 8 8 9 10 10
2 Introduction to Basic Programming Concepts 2.1 WHAT DOES A PROGRAM DO? . . . . . . . . 2.1.1 What is an Algorithm? . . . . . . . . . . 2.1.2 How to Express an Algorithm . . . . . . . 2.2 CONTROL FLOW . . ..... 2.2.1 Sequence . . . . . . . . . . . 2.2.2 Choice . . . . . . . . . . . . 2.2.3 Multiple Choice . . . . . . . 2.2.4 Loop . . . . . . . . ..... . ..... 2.3 VARIABLES AND DATA TYPES . . 2.3.1 Numerical Data . . . . . . 2.3.2 Character Data . . . . . . 2.3.3 Composite Data . . . . . . 2.4 DATA STORAGE . . . . . . . . . 2.4.1 Internal Storage: Memory 2.4.2 External Storage: Files . . 2.5 SUMMARY . . . . . . . . . . . .
13 13 14 15 18 19 19 20 21 25 25 26 27 27 27 28 29
. . . . . .
11
12
.....
CONTENTS
3 Basic Corpus Concepts 3.1 HOW TO COLLECT DATA 3.1.1 Typing . . . 3.1.2 Scanning . . 3.1.3 Downloading 3.1.4 Other Media 3.2 HOW TO STORE TEXTUAL DATA . 3.2.1 Corpus Organisation . . . . 3.2.2 File Formats . . . . . . . . 3.3 MARK-UP AND ANNOTATIONS . 3.3.1 Why Use Annotations? . . . 3.3.2 Different Ways to Store Annotations . 3.3.3 Error Correction .. 3.4 COMMON OPERATIONS . 3.4.1 Word Lists .. 3.4.2 Concordances . 3.4.3 Collocations 3.5 SUMMARY . . . . .
31 31 31 31 32 32 33 33 35 37 38 38 42 42 42 43 44 44
4 Basic Java Programming 4.1 OBJECT-ORIENTED PROGRAMMING 4.1.1 What is a Class, What is an Object? 4.1.2 Object Properties . . 4.1.3 Object Operations . . . . . . . . 4.1.4 The Class Definition . . . . . . . 4.1.5 Accessibility: Private and Public . 4.1.6 APis and their Documentation 4.2 INHERITANCE . . . . . . . 4.2.1 Multiple Inheritance 4.3 SUMMARY . . . . . . . . .
47 47 47
5 The Java Class Library 5.1 PACKAGING IT UP . . . . . . 5.1.1 Introduction . . . . . . . 5.1.2 The Standard Packages . 5.1.3 Extension Packages . . . 5.1.4 Creating your Own Package 5.2 ERRORS AND EXCEPTIONS . 5.3 STRING HANDLING IN JAVA 5.3.1 String Literals .. 5.3.2 Combining Strings .. . 5.3.3 The String API . . . . . 5.3.4 Changing Strings: The StringBuffer 5.4 OTHER USEFUL CLASSES . 5 .4.1 Container Classes . 5.4.2 Array . . . . . . . . .
48 48
49 53 55
57 58 58
61 61 61 63 64 64 65 66 66 66
67 77 78 78 79
CONTENTS
5.5
5.6
5.4.3 Vector . . . . 5.4.4 Hashtable .. 5.4.5 Properties . . 5.4.6 Stack .... 5.4.7 Enumeration . . . . . . . . . . . . . . . . . . . . . . THE COLLECTION FRAMEWORK . . . . . . . . . . . . . . . . 5.5.1 Introduction. . . . . . . . . . . . . . . . . . . . . . 5.5.2 Collection . . 5.5.3 Set .. 5.5.4 List . . . . . 5.5.5 Map . . . . . 5.5.6 Iterator .. . 5.5.7 Collections . . . . SUMMARY . . . . . . . .
6 Input/Output 6.1 6.2
6.3
6.4
6.5 6.6
THE STREAM CONCEPT . . . . . . 6.1.1 Streams and Readers . . . . . FILE HANDLING . . . . . . . . 6.2.1 Reading from a File . . . . . 6.2.2 Writing to a File . . . . . . . CREATING YOUR OWN READERS 6.3.1 The ConcordanceReader . . . 6.3.2 Limitations & Problems . . . RANDOM ACCESS FILES . . . . . 6.4.1 Indexing . . . . . . . . . . . 6.4.2 Creating the Index . . . . . . 6.4.3 Complex Queries . . . . . . . SUMMARY . . . . . . . . . . . . . . STUDY QUESTIONS . . . . . . . .
97 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . .
. . . .
. . . .
7 Processing Plain Text 7.1 7.2
7.3
7.4
80 81 83 84 84 87 87 88 90 91 92 93 94 95
SPLITTING A TEXT INTO WORDS 7.1.1 Problems with Tokenisation . . . THE STRINGTOKENIZER CLASS . . . 7 .2.1 The StringTokenizer API . . . . . 7 .2.2 The PreTokeniser Explained . . . . . . . 7 .2.3 Example: The FileTokeniser . . . . . . . 7 .2.4 The FileTokeniser Explained . . . CREATING WORD LISTS . . . . . . . . . . . . . . . 7.3.1 Storing Words in Memory . . . . 7.3.2 Alphabetical Wordlists . . . . . . 7.3.3 Frequency Lists . . . . . . . . . . 7.3.4 Sorting and Resorting . . . . . . SUMMARY . . . . . . . . . . . . . . . .
. . . . 97 . . . . 98 . 99 . 99 101 . . 105 105 116 117 117 . 118 . 128 . 129 . . . . 130
133 133 134 135 135 137 138 139 142 142 143 144 148 150
CONTENTS
8 Dealing with Annotations 801 INTRODUCTION 0 802 WHAT IS XML? 0 0 0 0 0 0 0 0 0 0 0 0 0 0 80201 An Informal Description of XML 0 803 WORKING WITH XML 0 0 0 0 0 0 0 0 0 0 80301 Integrating XML into your Application 80302 An XML Tokeniser 0 80303 An XML Checker 8.4 SUMMARY 0 0 0 0 0 0 0 0
153
II
177
o
9
Language Processing Examples Stemming 901 INTRODUCTION 0 0 0 0 0 0 0 902 PROGRAM DESIGN 0 0 0 0 0 903 IMPLEMENTATION 0 0 0 0 0 90301 The Stemmer Class 0 0 90302 The RuleLoader Class 90303 The Rule Class 903.4 The Rule File 0 9.4 TESTING 0 0 0 0 0 9.401 Output 0 0 0 0 9.402 Expansion 0 0 0 905 STUDY QUESTIONS
153 153 154 156 156 157 172 174
179 179 180 181 181 185 188 192 192 192 193 193
10 Part of Speech Tagging 1001 INTRODUCTION 1002 PROGRAM DESIGN 1003 IMPLEMENTATION 0 100301 The Processor 0 100302 The Lexicon 0 100303 The Suffix Analyser 1003.4 The Transition Matrix 10.4 TESTING 0 0 0 0 0 0 0 1005 STUDY QUESTIONS
211
11 Collocation Analysis 11.1 INTRODUCTION 11.1.1 Environment 0 0 0 0 0 110102 Benchmark Frequency 11.1.3 Evaluation Function 11.2 SYSTEM DESIGN 0 11.3 IMPLEMENTATION 0 0 0 11.301 The Collocate 0 0 0 11.302 The Comparators 0
213 213 214 214 215 215 216 216 218
o
o
195 195 196 197 197 203 206 208 210
CONTENTS 11.3.3 The Span . . . . . . . . . . . . . . . . 11.3.4 The Collocator . . . . . . . . . . . . . 11.3.5 The Utility Class . . . . . . . . . . . . 11.4 TESTING . . . . . . . . . . . . . . . . . . . . 11.5 STUDY QUESTIONS . . . . . . . . . . . . .
III
. . . . .
Appendices
219 221 225 225 227
229
12 Appendix 12.1 A LIST OF JAVA KEYWORDS . . . . . . . . . . . . . . . . . . . 12.2 RESOURCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 RINGCONCORDANCEREADER . . . . . . . . . . . . . . . . . . 12.4 REFERENCES
231 231 231 231 235
Index
237
To Joanna, for all her help and assistance
1 Introduction
Corpus linguistics is all about analysing language data in order to draw conclusions about how language works. To make valid claims about the nature of language, one usually has to look at large numbers of words, often more than one million. Such amounts of text are clearly outside the scope of manual analysis, and so we need the help of computers. But the computer, powerful though it is, is not an easy tool to use for someone with a humanities background, and so its use is generally restricted to whatever ready-made programs are available at the moment. This book is an introduction to computer programming, aimed at corpus linguists. It has been written to enable corpus linguists without any prior knowledge of programming (but who know how to switch a computer on and off) to create programs to aid them in their research, analysing texts and corpora. For this purpose it introduces the basic concepts of programming using the programming language Java. This language is not only suitable for beginners, but also has a number of features which are very desirable for corpus processing. This book is also suitable for students of computer science, who do have some background in computing itself, and want to venture into the language processing field. The basics of text processing are explained in chapter 3, which should do what chapter 2 does for non-programmers: give enough of an introduction to make practical work in the field possible, before proceeding with examples and applications later on. After having finished this book, you will know how to produce your own tools for basic text processing tasks, such as creating word lists, computing parameters of a text, and concordancing. To give you an idea how to go about developing more complex software we will look at a few example projects: a stemmer, a part-ofspeech tagger, and a collocation program. What this book obviously cannot do is to provide a full discussion of both corpus linguistics and programming. Both subjects are large enough for introductory books in their own right, and in computing especially there are many of them, targeted at all levels. In corpus linguistics there have been some introductory books published recently, and the other books in this series will serve well if you want to delve deeper into the subject. This book brings the two areas together. Hopefully, this book will whet your appetite and will make you want to go further from the foundations provided here.
INTRODUCTION
4
1.1
PROGRAMMING IN CORPUS LINGUISTICS
Corpus linguistics originates from linguistics, as a branch concentrating on the empirical analysis of data. However, the role of the computer is an extremely important one, and without machine-readable corpora corpus linguistics didn't get very far. In order to process such corpora one requires software, i.e. computer programs that analyse the data, and that creates a need for programming skills. This section will briefly discuss some of the problems arising from the current situation of corpus linguistics, arguing that the IT skills crisis that threatens industry also has an effect on the academic world.
1.1.1
The Computer in Corpus Linguistics
The computer is the basic tool of the corpus linguist when it comes to analysing large amounts of language data. But as soon as the task at hand goes beyond the gathering of a small set of concordance lines from a small corpus, one finds that there is no adequate software available for the task at hand. This means that in order to perform an empirical investigation, one either has to resort to doing it manually, or create an appropriate piece of software oneself. Manual data analysis is not only tedious and time-consuming, it is also prone to errors, as the human mind is not suited for dull repetitive tasks, which is what counting linguistic events often amounts to. This, however, is exactly what the computer is extremely good at. Taking the point further it would not be far-fetched to say that corpus linguistics in its current form cannot work without the help of the computer. Some techniques, for example collocational analysis (Church and Hanks, 1990; Clear, 1993), and the study of register variation (Biber, 1988) could simply not be applied manually. In fact, a corpus is really only useful if available in machinereadable form, and the meaning of 'corpus' is indeed coming to imply machinereadable, so that a printed corpus is now the exception rather than the rule (McEnery and Wilson, 1996). This dependency on the computer for all but the most basic work is obviously not without problems. The problem here is that a researcher is limited by the available software and its functionality, and will often have to change the way in which he or she approaches a question in order to be able to use a given program. Developers on the other hand are guided by what they are interested in themselves (due to lack of communication with potential users), or what is easy to implement. The main requirement for corpus linguistics software thus seems to be flexibility, so that a corpus can be explored in new ways, unforeseen by the developer. This, however, can only be achieved with some sort of programming on the user's part. When there is no software available which is suitable for a given task there are two main solutions: users can develop the right software themselves, or get a developer in to do it. The problem with the do-it-yourself approach is that it requires programming expertise, which can be difficult to acquire. Programming itself is not only time-consuming, but it can also be rather frustrating when it the resulting program does not work properly and needs to be debugged. It is generally worth checking to what degree a computer needs to be involved in a project, see Barnbrook (1996): quite often the effort required to adjust the problem in a way to be able to use a
PROGRAMMING IN CORPUS LINGUISTICS
5
computer is far higher than doing it manually, especially when it does not involve repetitive or large-scale counting tasks. 1.1~2
Which Programming Language?
A computer program is written in a special code, the so-called programming language. There are several levels of machine instructions, and the programmer usually uses a higher level language, which gets translated into the actual machine code, which depends on the computer's processor. Programming has gone a long way since the middle of the last century, when it basically consisted of configuring switches on large front panels of computers which needed huge air-conditioned rooms to be stored in. Instead, writing a computer program nowadays is not very different from writing an academic paper: you sit down in front of the screen and type into a text editor, composing the program as you go along, changing sections of it, and occasionally trying out how it works. Today there is a multitude of programming languages around, with Java being one of the most recent developments. All of these languages have been developed for a specific purpose, and with different design goals in mind. There are languages more suitable for mathematical computing (like Fortran), Artificial Intelligence research (like Lisp) and logic programming (like Prolog). Thus it is the first task to find a language that is well suited for the typical processes that take place in corpus analysis. However, there are other aspects as well: the language should be reasonably easy to learn for non-computing experts, as few corpus linguists have a degree in computer science and it should be available on different platforms, as both personal computers and larger workstations are typically used in corpus processing. For this book Java has been chosen. Java is mainly known for being used to 'enhance' web pages with bits of animations and interactivity, but at the same time it is a very powerful and well-designed general purpose programming language, an aspect that has contributed to making it one of the most widespread languages in a relatively short time. In the following sections we will have a brief look at those aspects of Java that make it particularly useful for corpus analysis, before you will find a detailed outline of the road ahead.
1.1.3 Useful Aspects of Java The first feature of the Java language that is particularly suitable for working with corpus material is its advanced capability to deal with character strings. While most other languages are restricted to the basic Latin alphabet, with possibly a few accented characters as extensions, Java supports the full Unicode character set (see http: I /www. unicode. org /).That means that it is quite easy to deal with languages other than English, however many different letters they should have. This includes completely non-Latin alphabets such as Greek, Cyrillic and even Chinese. Apart from being able to deal with different character sets without problems, Java itself has a very simple and straightforward syntax, which makes it easy to learn. The instructions, the so-called source code, are easily readable and can be understood quickly, as the designers of the language decided to leave out some powerful but cryptic language constructs. The syntax originates from the programming language
6
INTRODUCTION
C, but it is only a carefully chosen subset, which makes it easier to learn and maintain programs. The loss of expressional power is only marginal and not really relevant for most programming tasks anyway. A more important aspect is that Java is object-oriented. We will discuss in detail what this means in chapter 4, so for now all we need to know is that it allows you to write programs on a higher level of abstraction. Instead of having to adjust yourself to the way the computer works, using Java means that you can develop programs a lot faster by concentrating on the task at hand rather than how exactly the computer is going to do the processing. Java combines elements from a variety of older programming languages. Its designers have literally followed a pick-and-mix approach, taking bits of other languages that they thought would combine into a good language. It also seems that the language was targeted at a general audience, rather than specialist programmers. Several aspects have been built into the language that aid the developer, such as the automatic handling of memory and the straightforward syntax. It also comes with an extensive library of components that make it very simple to build fairly complex programs. When a computer executes a program, a lot of things can go wrong. This can be due to external data which either is not there or is in a different format than expected, or the computer's disk could be full just when the program wants to write to it, or it tries to open a network connection to another machine when you have just accidentally disconnected the network cable. These kinds of errors cannot be dealt with by changing the program, as they happen only at run time and are not predictable. As a consequence, they are difficult to handle, and catering for all possible errors would make programming prohibitively complicated. Java, however, has an error handling mechanism built into it, which is easy to understand, adds only a small overhead to the source code, and allows you to recover from all potential errors in a graceful way. This allows for more robust programs that won't simply crash and present the user with a cryptic and unhelpful error message.
1.1.4 Programming Language Classification There are, of course, alternatives to Java, mostly the languages that it took pieces from in the first place. However, none of the alternatives seems to be quite as good a mixture as Java, which is one reason for Java's unparalleled success in recent years. In a rather short period of time Java matured into a language that is being used for all kinds of applications, from major bank transaction processing down to animating web pages with little pictures or bouncing text. In this section we will have a brief look at what kind of a language Java is, and how it can be placed in a general classification of programming languages. This will help you to understand more about how Java works. One feature with which programming languages can be classified is the way their programs are run on the computer. Some programs are readily understood by the machine and run just by themselves. These are compiled languages. Here the source code is processed by a translation program that transforms it into a directly executable program. The translation program is called a compiler. Compiled Ian-
PROGRAMMING IN CORPUS LINGUISTICS
7
guages require some initial effort to be translated, but then they are fairly speedy. Thus, compiled languages are best suited for programs that are run very frequently, and which don't get changed too often, as every change requires are-compilation · and the respective delay while the compilation takes place. This also slows down development, but an added bonus is that a lot of errors in the source code can be detected by the compiler and can be corrected before the program actually runs for real. The other major class are interpreted languages. Here the source code is interpreted by another program as the program is executed. This other program is the interpreter. This means that execution time is slower, as the same source code has to be re-interpreted every time it is executed, which is not the case with compiled languages, where this happens only once, during the translation. On the positive side it means that the programs can be run immediately, without the need for compilation, and also that they are not dependent on the actual machine, as they run through the interpreter. The interpreter is machine dependent, but not necessarily the program itself. Development time is quick, but some errors might be unnoticed until the program actually runs, as it is not checked in advance. This distinction perfectly matches natural languages: if you want to read a book in a language you can't speak, you wait until someone translates the book for you. This might take a while, but then you can read it very quickly, as the translation is now in your own language. On the other hand, if it is not worth getting it translated (if you only want to read it once, for example) you could get someone to read it and translate it as you go along. This is slower, as the translator needs to read it, translate it, and then tell you what it means in your language. Also, if you want to go back to a chapter and read it a second time it will have to be translated again. On the positive side you don't have to wait for the translation of the whole book to be completed before you can start 'reading'. Java has elements of both these types of languages: the source code is initially compiled, but not into a machine-specific executable program, but into an intermediate code, the so-called byte code. This byte code is then interpreted by the Java run-time interpreter. This interpreter is called the Java Virtual Machine (JVM), and it provides the same environment to a Java program, whatever the actual type of the computer is that it runs on. Arguably you get the worst of both worlds, the delay during the compilation and the slow execution speed at the same time. However, it works both ways: the byte code can be interpreted muchmore quickly than actual source code, as it is already pre-translated and error checked. It also keeps the portability, as the byte code is machine independent, and the same 'compiled' program works on a Windows PC as it does on a Unix workstation or an Apple Macintosh. When you are working with Java, you first write your program in the form of its source code. Then you compile it, using the command j avac, which results in a pre-compiled byte code, the so-called class file. If you want to execute that program, you then start the Java run-time environment with the java command, which will run the commands by interpreting the byte code.
8
INTRODUCTION
1.2 ROAD-MAP In a book like this it is obviously not possible to cover either all aspects of Java programming or all aspects of corpus linguistics, so some decisions had to be made as to what would make it into the book and what was to be left out. This section briefly lists what you will find in the rest of the book, and what you won't find. Java has quickly grown into a very comprehensive programming language, and tomes with thousands of pages are being produced to describe it. However, a lot of this is not really relevant for the corpus linguist, but that still leaves a lot of useful elements for text and corpus analysis.
1.2.1
What is Covered
In the following chapter, Introduction to Basic Programming Concepts, there is a brief introduction to computer programming in general. It is intended for readers who have not done any programming so far, and will provide the foundation for later chapters. Basic Corpus Concepts, the third chapter, introduces the basics of corpus linguistics. By reading both of them, a framework is established to bring corpus linguistics and programming together, to see how programming can be utilised for linguistic research.
The next chapter, Basic Java Programming, introduces the Java programming language, building up on the concepts introduced in the first chapter. It will show how they can be realised in actual programming code. In The Java Class Library we then have a look at some of the standard classes that you can use in your own programs. Reusing existing classes greatly speeds up program development, and it also reduces the scope for errors, as existing classes are less likely to contain a lot of undetected bugs. Then, Input/Output gets you going with the single most important task in corpus linguistics. After showing how to read texts and print out text we will apply our newly acquired skills to investigate several different ways of creating a concordance. Afterwards, in chapter 7, we look into processing full texts instead of just single words. Identifying words in a stream of input data is one of the fundamental processing steps, on which a lot of other tasks depend later on. Most corpora nowadays contain annotations. How to process annotations in a corpus is the topic of chapter 8. Here we will look at mark-up, concentrating on XML, which is not only becoming the new standard for the Web, but as a simplified variant of SGML is also very relevant for corpus encoding. And finally we will investigate three case studies, which are self-contained little projects you might find useful in your day-to-day research. Taking up threads from two companion books in the series, we will implement a stemmer from a brief description in Oakes' Statistics for Corpus Linguistics, a parts-of-speech tagger described in a study question of McEnery & Wilson's Corpus Linguistics, and see how we can compute collocations as described in Barnbrook's Language and Computers. All these case studies start off from a simple description of how to do it, and you will end up with a working program that you can start using straight away.
ROAD-MAP
9
1.2.2 Other Features of Java As mentioned before, Java has developed (and still is developing further) into a vast language with extensions for all kinds of areas. However, a lot of those will not be directly relevant for your purposes, and in this section we will touch on some features which had to be left out of this book, but might be relevant for your further work. If you want to learn more about those features, I suggest you get a general Java book, like Horton (1999). Throughout this list you will find one recurring theme, namely that system specific operations are generalised to a more abstract level which is then incorporated into the language. A lot of operations which need to access external hardware components (such as a graphics card, or a network adapter) or other software modules (such as a database) are specific to a certain machine or operating system. However, by creating an abstract layer they can be treated identically across platforms, with the Java run-time environment filling the gap between the common operations and the way they actually work on a given computer. Graphics Graphics are another of Java's strong points. Dealing with graphical output is very platform specific, which means that there is no general support for it if your development environment needs to be portable. The developers of Java, however, have designed an abstract toolkit that defines graphical operations and renders them on the target platform. In the first version, this scheme suffered from deficiencies and subtle differences between the way buttons and other elements work in Windows and on the X Window system on Unix, but the developers chose a slightly different approach which works out much better: instead of relying on the operating system to provide user interface widgets, the new toolkit (called Swing) only makes use of basic operations such as drawing lines and providing windows, and all widgets are realised by the toolkit via those primitives. By leaving aside the 'native' widgets, user interface behaviour is now truly consistent across all platforms, and it is possible to program sophisticated user interfaces in a portable way. There is a large number of components, from labels, buttons and checkboxes to complex structures such as tables and ready-made file choosers. All these work on both proper applications as well as applets, even though there could be problems as browser implementations of Java always lag behind a bit and might not fully support Swing without needing an upgrade. Databases Similar to graphical operations, all database-related functionality has also been abstracted into a generalised system, the so-calledJDBC (Java Database Connectivity). Here the programmer has access to a set of standard operations to retrieve data from databases, regardless of what database is actually backing up the application. To provide compatibility there is a set of drivers which map those standardised operations onto the specific commands implemented by the database itself. Together with the portable graphic environment this makes it easy to build graphical interfaces to database systems, which will continue to work even when the
INTRODUCTION
database itself is switched to another version, as long as there is a driver available for it. Networking Java is closely associated with the Internet. This is mainly because it is used as a programming language for applets, small applications which run within a web browser when you visit a certain page, but it also has a range of components which allow easy access to data across networks, just as if it was on the local machine. Opening a web page from within an application is as easy as reading a file on your local hard drive. Furthermore it is possible to execute programs on other machines on a network, so that distributed applications can be implemented. By utilising the capacity of multiple computers at the same time very powerful and resource-intensive software can be developed.
1.3 GETTINGJAVA Unlike a lot of other languages, compilers for Java are freely available on the Internet. You can download the latest version from the Sun website, see the resources section in section 12.2 on page 231. There are two different packages available, one called JRE and one called JDK or SDK. The JRE is the 'Java Runtime-Environment'. You need this if you just want to run programs written in Java. It contains the standard class library plus a virtual machine for your computer. The JDK or SDK is the 'Java Development Kit', or the 'System Development Kit'. Sun, as the distributor, has changed the name of that package from version 1.2 onwards. This package contains all you need to develop programs in Java, and this includes the run-time environment as well. In fact, the Java compiler, j avac is itself written in Java. If you want to compile your own programs in Java you will need to get this package. You also need an editor, in order to write the source files. You can use any text editor you want, as long as you can produce plain text files. Some companies offer so-called IDEs (Integrated Development Environment), sometimes with their own optimised compilers. As long as they are supporting the full Java standard you can use one of those for development.
1.4 PREPARING THE SOURCE There are a few more points to notice before we can start to write programs in Java; these are only minor details but nevertheless important, and remembering them can save you a lot of struggling with compiler error messages at later stages. The source code of a class has to be in a plain text file. It is not possible to write a program in a word processing package and then compile it directly. You would either have to save it as plain text, or, what would be more sensible to start with, use a separate text editor for writing them. Java source files have to have the extension . java, so files must be saved accordingly. The Java compiler will reject any files
PREPARING THE SOURCE
11
that have other extensions. Also, the filename needs to match that of the class that is defined in it. Each class thus has to be in a separate file, unless it is not publically accessible from within other classes. Even if you could define several classes in one source file it would make it more difficult to find the relevant file when looking for the definition of a particular class. It makes matters much easier if the class Phrase can always be found in the file Phrase. java. The Java compiler will transform this into a binary file called Phrase. class which can be used by the JVM.
1.4.1
Running your First Program
Before we start with the theory, here is a simple example that you can try out for yourself. So far we are still fairly limited as to what language constructs we can use, so it necessarily has to be a small example. In the following chapters you will see what other classes Java provides to make programming easier, and how to deal with input and output. The class we are developing now simply echoes its command-line arguments. That means you call it with a number of parameters, for example java Echo This is a test of the Echo class and as a result the class will print This is a test of the Echo class on the screen. The first two parts of the above command-line are the call to the Java interpreter and the name of the class to execute, and they are not counted as parameters. When you try to execute a class, the Java interpreter loads the corresponding compiled bytecode into memory and analyses it. It looks for a method of the form public static void main(String args[]) This means there has to be a method called main ( ) which takes as argument an array of Strings, and it also has to be declared public and static. This is so that it can be accessed from the outside, and also without there being an instance of the object available. The array of Strings will be the command-line parameters, and with this knowledge we can now code the Echo class: I*
*
Echo.java
*I public class Echo { public static void main{String args[]) System. out .println (""The command-line arguments are: ••); for(int i = 0; i < args.length; i++) { System.out.println(i+". "+args[i] );
II end of class Echo
The definition of a class is introduced by an access modifier followed by the keyword class. Most classes you will deal with will be public, but again there are more fine-grained options. We will ignore these for the time being as they are
12
INTRODUCTION
not relevant to the material in this book. After the name of the class the definition itself is enclosed in curly brackets. This is called a block of lines, and it is a recurring pattern in Java. You will find blocks in several other places, for example in method definitions. Just save this listing into a file called Echo. java, compile it with j avac Echo . java and then run it in the way described above. You will see that it behaves slightly differently, in order to make it a bit more interesting. Instead of simply echoing the parameters as they are, they are put into a numbered list on separate lines. When the class gets executed, the JVM locates the main ( ) method and passes the command-line parameters to it. Here we print out a header line before iterating through all the parameters. Note how we use the length field of the array args in the condition part of the for-loop. Inside the loop we have access to the loop counter in the variable i, and in the print statement we simply put the variable, a literal string containing a full stop and a space character, and the current parameter together into one string which will be put on the screen.
1.5 SUMMARY In this introductory chapter we first discussed the role of the computer in corpus linguistics, emphasising the fact that it becomes more and more relevant to be able to program when working with computer corpora. The preferred solution to this is to learn Java, a language that is suitable for text and corpus processing, yet easy to learn at the same time. Furthermore, Java is machine independent, which means that you can run Java programs on any platform. This is especially important when working with different machines at home and at work; a less powerful computer might not be as fast or might not be capable of processing large amounts of data, but at least you won't have to change your programs. This is due to the hybrid nature of Java, which is halfway between a compiled and an interpreted language. After looking at a road-map of what you will find in this book, several other features of Java have been introduced briefly, so that if you want to extend your knowledge of programming towards graphical user interfaces, databases, or networking facilities, you know that you can still use Java, which saves you from having to learn another programming language. And finally, we have seen how to acquire the necessary tools that we need to develop our own programs in Java. This includes the development kit, which is available for free downloading. We have also written our first Java program. In the next chapter we will have a look at programming in general. You will be made familiar with basic concepts that are necessary to begin programming, and these will be applied throughout the rest of the book. The emphasis here is on the bare necessities, keeping in mind that ultimately most readers are not interested in computer science as such, but only as a means to aid them in their linguistic research.
2 Introduction to Basic Programming Concepts This section is for beginners who have not done any programming yet. It briefly introduces the algorithm, a general form of a computer program, and ways to express tasks in algorithmic form. Plenty of examples from non-programming domains are used to make it accessible to newcomers.
2.1
WHAT DOES A PROGRAM DO?
A computer program is simply a list of instructions which the machine executes in the specified order. These instructions have to be in a code that the machine can interpret in some way, which is roughly what a computer language is. In order to make it easier for humans to write programs, special programming languages have been developed which are more abstract and thus on a level closer to human thinking than the low-level manipulation of zeroes and ones that the computer ultimately does. One can go even further and design programs in an abstraction from programming languages, which would be the equivalent of jotting down an outline of a book or paper before filling in the gaps with prose text. This outline is called an algorithm in computing terminology. An algorithm is very much like a recipe, in that it describes step by step what needs to be done in order to achieve a desired result. The only problem is that the computer cannot cook, and will therefore be extremely stupid when it comes to interpreting the recipe: it does so without trying to make sense of it, which means it does literally what you tell it to do. You will find that while you're new to programming it usually does something that you didn't want it to do, mainly because it is so difficult for a programmer to think in the same simplistic and narrow-minded way as a computer does when executing a program. And that is also one of the main 'skills' of programming: you have to think on an abstract level during the design phase, but when it comes to coding you also need to be able to think in a manner as straightforward and pedantic as a computer. In the following sections we will have a closer look at what an algorithm is and how we can most easily express it. In the discipline of software engineering a variety of methods have been developed over time, and we will investigate some of them which might be useful for our purposes. The most important point here is that we don't care too much about the methods themselves, as we only view them as a means to the end of creating a computer program.
14
2.1.1
INTRODUCTION TO BASIC PROGRAMMING CONCEPTS
What is an Algorithm?
Suppose you are standing in front of a closed door, which we know is not locked. How do we get to the other side of the door? This is something most people learn quite early in life and don't spend much time thinking about later on, but the computer is like a little child in many ways, and it does not come with the knowledge of how to walk through a door. If you want to teach a robot to move around in the house, you need to describe the task it should perform in a list of simple steps to follow, such as: take hold of the door handle, press it down, pull or push door and move through the resulting gap. Unless your robot is already a rather sophisticated one, you will now have to tell it what a door handle looks like and how it can be pressed down. Ultimately this would be described in more and more detail, until you have arrived at the level of simple movements of its hands or their equivalents. Of course you will have to describe a lot of movements only once, like 'move arm downwards' and you can refer to them once they have been defined. Such repeated 'procedures' could be things like gripping a handle, pushing an object, and eventually opening doors. As a computer has no built-in intelligence, it needs to be told everything down to the last detail. However, at that rate writing a program to accomplish a certain task would probably take longer than doing it yourself, and there are a lot of basic operations that are required every time. In order to make programming easier, programming languages with higher levels of abstraction have been developed. Here the most likely tasks you would want to do are readily available as single commands, like print this sentence on the screen at the current position ofthe cursor. You don't have to worry about all the details, and thus high-level languages are considerably easier to learn and speed up development, with less scope to produce errors, as the size of programs (measured in the number of instructions written by the programmer) is reduced. If you want to know, for example, what the length of a word is, you would have to go through it from beginning to end, adding one to a counter for each letter until you have got to the last letter of the word, making sure that you don't stop too early or too late. In Java, however, there is a single instruction that computes the length of a piece of text for you, and you haven't got to worry about all the details of how this is accomplished. High-level languages are one step towards easier communication with computers. Of course we are still a long way away from Star Trek-like interaction with them, but at least we no longer have to program machines on the level of individual bits and bytes and operations like load register X from memory address Y, shift accumulator content to the right or decrement accumulator and jump to address Z if accumulator is not zero. Programming, then, is the formulation of an algorithm in a way that the computer can understand and execute. The first step in this is to be clear about what it is you want the computer to do. Unless you know exactly what you want you will not be able to get the computer to do it for you. From here there are several ways to continue, and a whole discipline, Software Engineering, is concerned with researching them. For our purposes we have to choose a method which is a good compromise between cost and benefits; we don't want to have to spend too much time learning how to
WHAT DOES A PROGRAM DO?
15
do this, but would still like to profit from using it. After all, programming is only a means to an end, which is exploring corpora.
2.1.2
How to Express an Algorithm
A very natural and intuitive way to develop programs is called stepwise refinement (Wirth, 1971). Here you start with a general (and necessarily vague) outline of the algorithm, going over it multiple times. Every time each step is formulated in more detail, until you reach a level which matches the programming language after a number of iterations. This style of programming is called top-down programming, as you start at a high level of abstraction working your way 'down' to the level of the computer. The opposite would be bottom-up, where you start at small low-level routines which you combine into larger modules, until you reach the application level. Most of the time you will actually use both methods at the same time, developing the program from both ends until they meet somewhere in the middle. In order to describe the algorithm we will be using pseudo-code, a language that is a mixture of some formal elements and natural language. At first, the description will contain more English, but with subsequent iterations there will be less and less of it, and more and more formal elements. As an example consider the task of creating a sorted reverse word list from a text file. This is a list of all words of a text, sorted backwards, so that words with the same ending appear next to each other. Such a list could be used for morphological analysis, or an investigation of rhyming patterns. We start off with the following description: 1 2 3 4 5
read each word from the text reverse the letters in the word sort the list alphabetically reverse the letters back print out the sorted list
By reversing the letters of the words we can simply sort them alphabetically, and then reverse them back afterwards. This means we can use existing mechanisms for sorting lists instead of having to create a special customised one. That list of instruction would be sufficient for a human being to create such a list, but for the computer it is not precise enough. Nevertheless we can now go through it again and fill in a few more gaps: 1 for each word in the text read the next word from the text 2 reverse the word 3 insert the word into a list 4 5 sort the list 6 for all words in the list reverse word 7 print word 8
Note that we now have changed the structure of the algorithm by splitting it up into three main steps: creating a list of reversed words, sorting the list, and printing the list. This new version is much more precise in that it reflects how each word is being dealt with at a time, and we now also get an idea of how much work is involved: the first part of the program is executed once for each word (or token) in the text we're
16
INTRODUCTION TO BASIC PROGRAMMING CONCEPTS
looking at, whereas the second and third parts operate on the list of unique words (the types). We also have made explicit the relationship between a word and the list of words, and that we insert a word into the list, which was not obvious from the first draft. While the second attempt is much more precise, it's still not good enough. Here is attempt number three: 1 open text file for reading 2 create empty list 3 while there are more words in the file 4 read next word from file 5
reverse word
6 check if word is in list 7 YES: skip word 8 NO: insert word into list 9 close input file 10 sort list alphabetically 11 for all words in list 12 reverse word 13 print word
What we have added here are quite obvious points that a human being would not think about. If someone asked you to write down a shopping list you would take an empty piece of paper, just as you would open up a book and start on page one when you would want to read it. But for the computer you have to make all these steps explicit, and these include opening the text for reading and setting up the word list. Statements and Expressions A computer program is typically made up of statements and expressions. In the above algorithm, you have statements like reverse word and close input file. These statements consist of a command keyword (reverse and close) and an expression that describes what they are operating on (word and input file respectively). An expression evaluates to a certain data type. For example, a statement to print the time could look like
print time Here we have time as an expression, which is evaluated when the computer executes the statement. You don't write 'print 10: 30', because you want the computer to print the time at that moment when the statement is executed. Therefore 'time' would need to be an expression, whose evaluation triggers reading the computers internal clock and returning the time of day in some form. The print, on the other hand, is a statement. It works the same way all the time, and is a direct instruction to the computer. White time would ask the computer to retrieve the current time, print tells it what to do with that. As a further example, consider the literal string 'the'. This literal value could be replaced by the expression 'most frequent word in most English corpora'. The difference here is that the expression cannot be taken literally, it has to be evaluated first to make sense. An expression in tum can be composed out of sub-expressions which are combined by operators. These are either the standard mathematical operations, addition, subtraction, and so on, or operations on non-numerical data, like
WHAT DOES A PROGRAM DO?
17
string concatenation. This will become clearer if we look at an example in actual Java code: String wordl = "aard"; String word2 = Vark"; String word3 = wordl + word2; 11
In this simple example we first declare two variables (labelled 'containers' for storing data) of the data type String. A variable declaration consists of a data type, a variable name, and it optionally can contain an initial assignment of a value as well. Here we are assigning the literal values aard and vark to the two variables called wordl and word2 using the assignment operator '='. A variable declaration counts as a statement, and in Java statements have to be terminated by a semicolon. The third line works exactly the same, only that this time we are not assigning a literal String value to word3, but instead an expression which evaluates to a String. The plus sign signifies the concatenation of two (or more) strings, so the expression wordl + word2 evaluates to another (unnamed) String which we assign to word3. At the end of that sequence of statements the variable word3 will contain the value 'aardvark'. As you can see from that example, a variable is a kind of expression. It cannot be used in place of a statement, but it can be used with commands that require an expression of the same type as the variable. Assignments are a bit special, as they are technically commands, but without a command keyword. We will discuss data types, statements and expressions in a bit more detail later on, but there is one more type that we need to know about before continuing: the boolean expression. A boolean expression is an expression that evaluates to a truth value. This can be represented by the literals true and false, which are kind of minimal boolean expressions. A boolean expression we have already encountered is word is in list in the last version of the reverse word list program. If you evaluate this expression, it is either true (if the word is in the list) or false (if it isn't). Most boolean expressions used in programming are much simpler, testing just for equality or comparing values. For this there are a number of boolean operators (listed in table 2.1 ). Note the difference between the assignment indicator, the single equals sign, and the boolean operator 'equal to' which is a double equals sign.
To illustrate the use of these operators, here are a few simple examples. Suppose you were writing a scheduling program, then you could print a notice to go for lunch if the expression time == 12: 3 0 was true. Combining two expressions, the system wouldtellyoutogohomeiftime >= 5:30 II to-do-list== 'empty' was true. This means that you can leave work either after 5:30, or when you have done all the things you had to do. As soon as one of the two sub-expressions evaluates to true, the whole expression becomes true as well. A mean manager, however, might want to change the logical OR to a logical AND, in which case you have to wait until 5:30 even if you have done your day's work, or you have to work overtime if you haven't finished going through your in-tray by 5:30, because then the whole expression would only be true if both of the sub-expressions are true.
18
INTRODUCTION TO BASIC PROGRAMMING CONCEPTS
Operator
-> < >= 2 4=5, 7 >=5 5 1 && 3 < 5) !(2 > 0)
Table 2.1: Boolean Operators
2.2
CONTROL FLOW
When you are reading a text, you start at the beginning and proceed along each line until you get to the end, unless you come across references like footnotes or literature references. In those cases you 'jump' to the place where the footnote is printed, and then you resume reading at the place where you had stopped reading before. This is essentially the same way that the computer will go through your program. However, there are also lines in the listings we looked at in section 2.1.2 that indicate repetition (while there are more words) and branching (check if word is in list). Such statements are directing the flow of control through the program, and are very important for understanding the way the computer executes it. In this section we will discuss in greater detail why control flow is such an important concept in programming. In order to keep track of which commands to execute next the computer needs a pointer to the current position within the program, just as you need to keep track of where you are in a text you are reading. This position marker is called the program counter and it stores the address of the next instruction to be executed. It also has an auxiliary storage space in case it temporarily needs to jump to another position, like looking at a footnote in an article. Here it stores the current value of the program counter in a separate place and then loads it with the address of the 'footnote'. The next instruction will then be read from a different location. Once the end of the footnote is reached, an end-marker notifies the processor to reload the old value from the temporary space back into the program counter, and execution resumes where it was interrupted before. In a program we will actually make use of several ways to direct the control flow. Maybe you remember the so-called Interactive Fiction books, which were a kind of fantasy role-playing game for one person. It was a collection of numbered paragraphs, and things would happen that tell you where to continue from. You 'fight' against an ogre by throwing a few dice, and depending on the outcome you win or lose. The last lines of that paragraph would typically be something like "if you beat the ogre, go to 203, otherwise continue at 451." Unlike traditional narratives, these books did not have a single strand of plot, but you could 'read' it multiple times, exploring different 'alternate realities', depending on whether you defeated the ogre or not.
CONTROL FLOW
19
Computer programs are a lot like that. Depending on outside influences the computer takes different paths through a program, and there are several ways of changing the flow of control. We will now discuss three of them in more detail: sequence, choice, and loop.
2.2.1
Sequence
This is the general way in which a program is executed, in sequence. Starting at the first instruction, the program counter is simply incremented to point to the next instruction and most instructions will not actually influence the program counter. A sequence is like reading a book from start to finish without ever going backwards or skipping parts of it. Even though it is the default, a program cannot do much if it would only be able to go through each statement once. The main strength of a computer is its speed, and that can best be exploited by making it perform some task repeatedly. For this we need other means, which we will look at below. The expressiveness of a purely sequential program is fairly limited; it's like a pocket calculator that can perform mathematical operations, but cannot react to the result in any way. Any algorithm that is more complex than that will require a more differentiated control flow. Before discussing the other ways of directing the control flow, there is one other concept that we need to know about, the block. A block is a sequence of statements grouped together so that they are syntactically equivalent to a single statement. For example, in the previous example we had a block right at the end: 11 for all words in list reverse word 12 print word 13
The bottom two lines are a block, and in pseudo-code they are grouped together by being at the same level of indentation. In a Java program, and in a number of other languages, blocks are delimited by curly brackets. The reason for having blocks is so that the computer can determine the scope of certain commands; otherwise, how should it know that the print word statement is also to be executed for each word? By putting those two statements into a block it becomes clear immediately. We will see more examples where blocks are essential in the following sections.
2.2.2
Choice
If there was only sequential execution, each time a computer program was run it
would basically be the same as before. More variation can be introduced through another type of control flow, the choice. Here the processor evaluates a condition, and depending on the result branches into one of two or more alternatives. The simple alternative is expressed in abstract form as IF X THEN A ELSE B where X is the condition to be evaluated, and A and B are the two alternative blocks of statements. The condition X must be a boolean expression, i.e. it evaluates to either true or false. If X is true, then it is block A that gets executed, otherwise it is block B. This can now be used for decision making: Imagine a ticket counter at a cinema, where the people are buying the tickets at a machine. In order to check whether the
20
INTRODUCTION TO BASIC PROGRAMMING CONCEPTS
clients are allowed to watch a film, it asks them their age. Then it checks whether the age is sufficient with respect to the rating of the film they want to watch and prints out a different statement depending on the outcome of the check. In pseudo-code (with some comments written in square brackets) this could look like this: 1 2 3 4 5 6 7 8 9
rating = 15 [or any other appropriate value according to the film] print ''Please enter your age, input age [at this stage the user enters his or her age] if age>= rating [the operator stands for 'greater than or equal to'] then [the following block gets executed if the condition is fulfilled] print "OK, you're old enough to watch this" else [the following block gets executed if the condition does not apply] print "Come back in a few years" endif
The pseudo-code syntax for simple branching is if . . . then . . . else ... endi f. The e 1 se and endi f parts are needed to mark the boundaries of the two blocks, the one that gets executed when the condition is true, and the other one which gets executed when the condition is false. In the above example we first assign the value 15 to the label rating. This is not strictly necessary, but makes it easier if the next film has a different rating: all that is necessary is to change this one line, and everything will work properly. Otherwise, if the rating value is used several times, all of those instances would have to be changed, but it is very easy to just forget one, and then you have a problem. For this reason, there will often be a couple of 'definitions' at the beginning of a source file, where values are assigned to labels. There are two types of such labels, ones that can change their assigned value and those that can't. The ones which can change are called variables, while the ones that always keep the same value are called constants. Constants are declared in a way that forbids you to ever assign a different value to them, and this is checked by the compiler: if you try and do it, you will get an error message. We will further discuss the topic of variables below; for the moment all you need to know is that a variable is basically a symbolic name for a value of some kind. If the value can change, it is called a variable, if the value is immutable it is a constant. Once we have read in the client's age, we are in a position to compare it to the threshold as defined by the film's rating. This is a simple test where we see if the age is greater or equal to the rating. Here the action parts are only messages printed to the screen, but in a real life application we would take more action, like initiating a ticket purchase or displaying alternative films that have a lower rating.
2.2.3 Multiple Choice There is another form of choice control flow, where the test condition evaluates to one of several possible values. In this case, which really is only a special case of the one-way choice, you need to provide a corresponding block of statements for each possible outcome. This will usually be a rather limited set of options. The simple alternative is by far more frequent than the multiple choice. Here is a brief example of a multiple choice situation: imagine a part-of-speech tagger, a program that assigns labels to words according to their word classes. These
21
CONTROL FLOW
labels are often rather cryptic, and we would like to have them mapped onto something more readable. This short example will map some tags from a widely used English tagset into more human-digestible labels. The syntax for this is switch . . . case . . . case, where the expression following the switch statement is matched against each case, and if it matches the corresponding block is executed. A special case is default which matches if no other case did. switch tag case "JJOO print "adjective" case "NN"
print "noun" case OOVBGOO print ooverb-ingoo case OODTOO print "determiner" default print "unknown tag encountered!"
You can see that a switch statement is not much more than a convenient way to combine a series of if statements.
2.2.4 Loop Computers are very good at doing the same thing over and over again, and unlike humans they don't lose concentration when doing something for the five hundred and seventh time. In order to get the computer to do something repeatedly we use a loop to control the flow of execution. The first type of loop is governed by a condition. It basically looks like a branch, but it has only one block which is executed if the condition is true. Once the end of the block has been reached, the condition is re-examined, and if it is still true, the block is executed again. If it is false, the program continues after the end of the block. For our example, we want to write a program that scans through a text and finds the first occurrence of a certain word. Once it has found the word it will stop and print out the position of the word in the text. In pseudo-code this looks like: 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
searchWord = "scooter" searchPosition = -1 currentPosition = 0 open input text for reading while text-has-more-words && searchPosition read word increment currentPosition if word == searchWord then
searchPosition
==
-1
= currentPosition
endif endwhile close input text if searchPosition == -1 then print the search word does not occur in the text oo else print the search word first occurs at position "+searchPosition endif 00
11
22
INTRODUCTION TO BASIC PROGRAMMING CONCEPTS
We start off by initialising a few variables. If you initialise a variable, you assign an initial value to it. These variables are: searchWord, the word we will be looking for, searchPosition, the position it first occurs at, and a counter that keeps track of which position we're at, currentPosi tion. The searchPosi tion variable also doubles as an indicator whether we have found the word yet: it initially is set to -1, and keeps that value until it matches a word in the text. After opening our input text for reading, we enter into the loop. This loop is governed by a complex condition, which is made up of two sub-conditions. In our case they are combined with a logical AND, which means the whole condition is true if and only if both sub-conditions are true as well. As soon as either of the sub-conditions becomes false, the whole condition will become false as well. The condition here is that there are more words available to read (otherwise we wouldn't know when to stop reading), and also that the value of searchPosi tion is -1. This means that we haven't yet found the word we are looking for. In the loop we read a word and note its position in the text, and then we compare it to our target word: if they are the same, we assign the current text position to the variable searchPosi tion. Once that has happened, the variable will no longer be -1, and the second sub-condition becomes false, causing the loop to be terminated. Otherwise searchPosition will remain -1, and the loop is executed again. Once we're done with the loop, we investigate the value of the searchPosi tion variable: if it is still -1, the loop terminated because the end of the text had been reached before the search word was encountered. In that case we print out a message that the word has not been found in the text. Otherwise we print out a message that it has, including its position. Please note the way the variable is printed: if it was enclosed in double quotes like the rest of the message, we would simply have printed the word 'searchPosition' instead of the value of the variable searchPosi tion. What we have to do to print the actual value is to attach it to the message with a plus sign outside the double quotes. This while loop is characterised by the fact that the condition is evaluated before the body of the loop is reached. This means that the body might not be evaluated at all, namely when the condition is false to start off with. This could be the case if the input text is empty, i.e. if it contains no words (or maybe doesn't exist). Another type of loop has the condition after the body, so that the body is executed at least once. They are quite similar, so let's look at the same task with a different loop: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
searchWord = "scooter" searchPosition = -1 currentPosition = 0 open input text for reading do read word increment currentPosition if word == searchWord then searchPosition = currentPosition endif while text-has-more-words && searchPosition == -1 close input text if searchPosition == -1 then print "the search word does not occur in the text" else
CONTROL FLOW 18
23
print "the search word first occurs at position "+searchPosition
19 endif
We start off with the same initialisations, and then enter the loop's body, as indicated by the do keyword (line 5). Apart from that it seems to be identical to the previous program. However, there is one important difference: if the input text is empty, i.e. it does not contain any words, the program nevertheless tries to read a word from it when the loop's body is executed for the first time, which would cause an error. This is because the condition 'text-has-more-words' (line 12) is first checked after the body has been executed, and so there is some scope for trouble. Using a head-driven while-loop thus is inherently safer, as you do not have the constraint that the body gets executed once whatever the condition, and therefore head-driven loops are a lot more common than do-loops. This is not to say that do-loops are never used; you just have to be a bit careful when deciding which loop to choose. The third type of loop is used when we know in advance how often we want to execute a block. It is effectively a short-hand form of the while-loop, and we will start with the verbose form to look at the concept first. Let's assume we want to know how often our search word occurs within the first 100 words of the text. To make things easier we will assume that the text has at least 100 words, so that we don't have to test for that.
1 2 3 4 5 6 7 8
position = 0 searchWord = "by" counter = 0 open text for reading while position < 100 read word if word == searchWord then
9
increment counter
10 endif 11 increment position 12 endwhile 13 close input text 14 print "number of occurrences: "+counter
In order to keep track of whether we've reached 100 words, we keep count in a variable called position. We start with position 0 (computers always start counting at 0) and check if the current word equals our search word. If so, we add one to the counter variable. Then we increment the position value and repeat until we have reached 100. Note that, since we started at zero, the loop is not executed when position has the value 100, as this would be the 101 st time. In most languages there is a special loop type which can be used if you know exactly how often a loop needs to be executed. As there is no real pseudo-code way to express this in a way which is different from the previous example, we will now add in a bit of real Java in the next listing. The loop is actually called a for-loop and looks like this:
24
INTRODUCTION TO BASIC PROGRAMMING CONCEPTS
int counter = 0; String searchWord = "by"; FileReader input =new FileReader("inputfile"); for(int position = 0; position< 100; position++) String word = readNextWord(input); if(word.equals(searchWord)) { counter++;
System. out .println( "number of occurrences: "+counter);
A few points need mentioning: unlike pseudo-code, Java requires variables to be declared before they can be used (which we are doing in the line by specifying a data type, in t, before the name of the variable, counter), and for numbers we use the in t data type, which is short for integer; instead of keywords like endi f, blocks of code are enclosed in curly brackets ({ .. }), and all statements have to be ended with a semicolon. We also cheated a bit, as the operation of reading a word from the input text is simply replaced by a call to a procedure, which we called readNextWord ( ) here, as it is not very straightforward. The whole of chapter 7 is devoted to getting words out of input data, so for now we just assume that there is a way to do this. This is actually a good example of top-down design: we postpone the details until later and concentrate on what is important at the current stage. Another point, which might be slightly confusing, is the way we are comparing string variables. For reasons we will discuss later, we cannot use the double equal sign here, so we have to use a special method, equals () to do that. How exactly this works will not be relevant for now, but you will see in chapter 5 how to work with String variables. The for keyword is followed by a group of tlrree elements, separated by semicolons and enclosed in round brackets: for(int position
=
0; position < 100; position++)
{
The first one is the initialiser, which is executed before the loop starts. One would usually assign the starting value to the counting variable, like position = 0 here. In the 'verbose' while form above it would be equivalent to the first line. The second element is the condition. The loop's body is executed while this condition is true, so in our case while position is smaller than 100 (compare to line 5 of the pseudo-code listing above). The final element is a statement that is executed after the loop's body has been processed, and here we increment the loop position by one. The expression position++ is a short hand form of position = position + 1, where we assign to position the current value ofposi tion plus one. This would be line 10 in the previous example, where it says increment position. For-loops are more flexible than this; you can have any expression which evaluates to a truth value as the condition, and the final statement can also be more complex than a simple increment statement. We will learn more about the full power of for-loops in later chapters.
VARIABLES AND DATA TYPES
25
VARIABLES AND DATA TYPES
2.3
We have already come across data types when we were talking about expressions on page 16. There we used time as an example. All expressions have a data type, which tells the computer how to deal with them. The data type of the 'time' expression depends on the computer language you're using: it could either be a specific type handling dates and times, or it could be a sequence of characters (e.g. '1 0 : 3 0 '), or even (as it is common in the Unix operating system) a single number representing the number of seconds that have elapsed since 12:00 a.m. on January the 1st 1970. In Java there is a special class, Date, which contains references to dates and times. When an expression is evaluated, you need somewhere to store the result. You also need a way of accessing it, and this is done through variables. We have already come across variables and constants as symbolic labels for certain values, some of which can be changed by assigning new values to them. From a technical point of view, variables are places in memory which have a label attached to them which the program can use to reference it. For example, if you want to count how many words there are in a text, you would have a variable 'counter' in which you store the number of words encountered so far. When you read the next word, you take the current value stored under the label 'counter' and increment it by one. When you have finished counting, you can print out the final value. In pseudo-code this could look like 1 counter = 0 2 while more words are available
3 4
read next word
counter ~ counter + 1 5 endwhile 6 print counter
At the beginning of this program, the variable is initialised to be zero, which means it is assigned an initial value. If this is not done, the storage area this label refers to would just contain a random value, which might lead to undesired results. If you forget to initialise a variable, the Java compiler generally gives you a warning. It is best to develop the habit of always initialising variables before they are used. A variable can only contain one sort of data, which means that the data in the computer's memory which it points to is interpreted in a certain way. The basic distinction is between numerical data, character data, and composite data. In the following sections we will walk through the different data types, starting with numerical types.
2.3.1
Numerical Data
These are numbers of different kinds. The way a number is stored in memory is optimised depending on what it is used for. The basic distinction here are integer numbers (numbers with no decimal places) and floating point numbers (with decimal places). The wider the range of the value, the more memory it takes up, so it is worth choosing the right data type. The numerical types available in Java together with their required storage space is shown in table 2.2.
26
INTRODUCTION TO BASIC PROGRAMMING CONCEPTS
data type byte short int long float double
description small number (-128 to 127) small number (-32768 to 32767) number (+/- 2 billion) large number floating-point number double precision floating point number
size (bytes) 1 2 4
8 4
8
Table 2.2: Numerical data types in Java
Unlike some other programming languages, all numerical data types in Java are signed, which means they have a positive and a negative range. This effectively halves the potential range that would be available if there were only positive values, but is more useful for most computing purposes. A single byte, for example, can store 256 different values. In Java this is mapped onto the range -127 to +128, so you still have the same amount of different values, only the maximum and minimum values are different. Most often you will probably require in t variables, mainly for counting words or properties of words. They have a range of about two thousand million either side of zero, which should be sufficient for most purposes. Floating-point variables should only be used when working with decimal places, such as proportions or perhaps probabilities, as they cannot represent larger integer values without losing precision. A float has a precision of about 40 decimal places, whereas a double goes to about 310.
2.3.2
Character Data
A character is a symbol that is not interpreted as a numerical value. This includes letters and digits, and special symbols such as punctuation marks and idiograms. The ability to process symbols is one important aspect that distinguishes a general purpose computer from a pocket calculator, and one that is obviously extremely relevant for corpus linguistics. Internally each character is represented by a number, its index in the character table. In order to allow the exchange of textual data between different computers there are standard mappings from characters to index values. While mainframe computers usually work with the EBCDIC character table, most other machines use the ASCII character set. This defines 127 symbols, which includes control codes (e.g. line break and page break), the basic letters used in English, the digits, and punctuation. Quite often there are also extensions for regional characters up to a maximum of 256 symbols (which is the maximum number of different values for a single byte). Java, however, uses the Unicode character set, which provides more than 65,000 characters, enough for most of the world's alphabets. Single letters are represented by the char data type, while sequences ofletters are represented by Strings:
DATA STORAGE
27
II note the single quotes char rnyinitial ~ 'M'; String rnyName ~ "Mason"; II note the double quotes I I this is a String, not a char String aLetter ~ "a";
Literal characters have to be enclosed in single quotes, while for strings you have to use double quotes. If you enclose a single character by double quotes (as in the third line), it will be treated as a String.
2.3.3
Composite Data
The data types described so far enable one to do most kinds of programming, but usually real world data is more complex than just simple numerical values or single characters. If you want to deal with word forms, for example, it would be rather tedious to handle each letter individually. The solution for this lies in composite data types. Historically they started off as records with several fields consisting either of primitive types (the two types described in the previous paragraphs) or other composite types. This allowed you to store related data together in the same place, so you could have a word and its frequency in the same composite variable, and you could always keep track of the relationship between the two entities. Later on it became apparent that one might not only want to have related data together, but also instructions that are relevant only to that data. The result of this is the object, a combination of data fields and operations that can be applied to them. With the example of the word, we could create a word object from a text file, and we could define operations to retrieve the length of the word, the overall frequency of the word in the text, its base form, or a distribution graph that showed us all where all the occurrences showed up in a text. Objects are a very powerful programming idiom: they are basically re-usable building blocks that can be combined to create an application. As objects are a fundamental part of Java we will discuss them in more detail in chapter 4.
2.4
DATA STORAGE
When processing data there is one fundamental question that we need to look at next: where and how do we keep that data? We already know that we can store individual data items in variables, but this is only really suitable for a small number of items that we are processing. If you were to store all words of a text in different variables you would for example have to know in advance how many there are, and then you can't even really do anything useful with them. Instead you would have to follow a different approach, storing the words in an area where you can access them one by one or manipulate them as a whole, if you for example want to sort them. Principally there are two places where you can store your data, either directly in the computer's memory or in a file on the computer's hard disk.
2.4.1
Internal Storage: Memory
Storing your data in memory has one big advantage: accessing it is very fast. The reason for that is that the time needed by the computer to read a value out of a memory chip is much shorter than the time needed to read data of an external storage
28
INTRODUCTION TO BASIC PROGRAMMING CONCEPTS
medium. This is because there are no moving parts involved, but just connections through which electrons flow at high speed. Moving parts are much slower, especially when the storage device needs to be robust. Compare for example the time it needs to read a file from a floppy disk and a hard disk, and then drop both of them from a large height onto the floor and repeat the procedure. You basically pay for the increased speed of the hard disk with its low robustness. If memory is so much faster, why does anybody then bother to store data elsewhere? There are two main reasons for this: (a) when you tum off the power supply, the content of memory gets erased, and (b) there is only a limited amount of memory available. So external storage is used for backup purposes, as it does not have to rely on a permanent supply of electricity, and because its capacity is much bigger. So how do you actually store data in memory? For this purpose Java provides a number of so-called container classes, which you can imagine as a kind of shelf that you can put your variables on. There are different ways to organise the way your variables are stored so that you can find them again when you need them, and we will discuss these in chapter 4 when we are looking at the actual classes. Basically you can store data as a sequence or with a keyword. You would use the sequence if you are processing a short text and want to look at the words as they appear, so you really need to preserve the order in which they come in the data. By storing them in the right container class you can then access them by their position in the text. Remember that computers start counting at zero, so the first word of your text would be at position 0, the second at position 1, and so forth, with word n being at position n-1. Access is extremely quick, and you don't have to follow a certain order. This is called random access, and this data structure is called an array. Arrays can either be of a fixed size, or they can be dynamic, i.e. they can grow as you add more elements to them. Access by keyword is a bit more sophisticated. Imagine you want to store a word frequency list and then retrieve values for a set of words from that list. The easiest way to do this is in a table that is indexed by the word, so that you can locate the information immediately without having to trawl through all of the words to find it. This works exactly like a dictionary: you would never even think about reading entry number 4274 in a dictionary (unless you are taking random samples of entries), but instead you would be looking for the entry for 'close'. In fact, you wouldn't even know that 'close' is the 4274th entry, as they are not numbered, but you wouldn't really want to know that anyway. There are several possible data structures that allow data access by keyword, all with different advantages and disadvantages. We will discuss those in chapter 4 when we investigate Java's collection classes in more detail.
2.4.2 External Storage: Files We have now heard about several ways of storing data in the computer's main memory, but what if you can't store it because you have got too much data? You could not, for example, load all of the British National Corpus (BNC) into memory at once, as it's far too big. And even if you want to store just a single text, which would fit into memory easily, you need to get it from somewhere. This somewhere is an ex-
SUMMARY
29
ternal storage medium, and the data is kept in a file. As opposed to internal storage, or memory, external storage is usually much bigger, non-volatile (i.e. it keeps its content even if there is no electricity available), a lot cheaper, and much slower to access. A file is just a stretch of data that can be accessed by its name. In a way the external storage is organised in a way similar to the access by keyword method we came across earlier. In order to get at the data which is stored in a file you need to open it. Generally you can open a file either for reading or for writing, so you either read data out of it or write data to it. You cannot open a file for reading that does not exist, and when you open an existing file for writing its previous content is erased. Data in a file is stored in an unstructured way, and when writing to a file the data is just added on to the end. When reading it, you will read the contents in the same order that you have written them in. There is also another way of accessing files: instead of opening them for sequential reading or writing you can open them for random access. Here you can jump to any position in the file, and both read or write at the same time, just like you would do when accessing an array in the internal memory. The drawback of this is that it is slightly slower, and also more complex. We will look at different ways of accessing files in more detail in chapter 6.
2.5 SUMMARY In this chapter we have first learned what algorithms are, namely formalised descriptions of how things are done. The level of formality of an algorithm is quite arbitrary, which can be used to guide the software development process: starting from a very informal and imprecise outline we gradually go into more detail using more and more formal elements until we reach the level of the programming language. Individual statements can be combined into blocks and the way the computer executes them can be commanded using branching and loops. This control flow allows the computer to repeat operations multiple times and make decisions based on boolean expressions. A program consists of both instructions and data. Variables and constants are named labels which can be used to store and retrieve individual variables. Larger numbers of variables can be stored in container classes within the computer's memory or in external data files on the hard disk. At the end of the chapter we discussed the different properties of internal and external storage media: internal memory is faster to access, but more limited in size, whereas external storage is slower but there is far more of it, and it keeps information even if there is no electrical power available.
3 Basic Corpus Concepts This chapter introduces relevant concepts from corpus linguistics. It is intended as a brief introduction to enable you to understand the principal problems that you will encounter when working with corpora.
3.1
HOW TO COLLECT DATA
Unless you are working with a research group that already has a set of corpora available that you want to analyse, your first question will be where to get data from and how to assemble it to a corpus. For the remainder of this section we will assume that you want to build your own corpus for analysis. There are several reasons why you might want to do this. The main one is that the kind of data you want to analyse does not (yet) exist as a corpus. This is in fact highly likely when you choose your topic without knowing what corpora already exist and are accessible publically. On the one hand you will have the advantage of not having to compromise on the subject of your research (e.g. by restricting its scope), but you might have to put extra work into collecting appropriate texts. Quite often you might want to compare a special sample of language with genera/language; here you would collect some data (e.g. political speeches, or learner essays) and use existing corpora (like the BNC or the Bank of English) as a benchmark.
3.1.1
Typing
Typing your data in is undoubtedly the most tedious and time consuming way of getting it into the computer. It is also prone to errors, as the human typist will either correct the data, or mistype it. Nevertheless, this is often the only possible solution, e.g. when dealing with spoken data, or when no other alternative (see following sections) is available. The most important point to consider when entering corpus data at the keyboard is that you want it to be in a format that can easily be processed by a computer. There is no point in having a machine-readable corpus otherwise. Over the years (mainly prompted through advancements in computing technology) several different formats have evolved. More on data formats below in section 3.2.2.
3.1.2
Scanning
Scanning seems to be a much better way of entering data. All you need to do (provided you have access to appropriate hardware) is to put a page at a time into the
32
BASIC CORPUS CONCEPTS
scanner, press (or click on) a button, and the scanner will read in the page and dump it on your system. Just as easy as photocopying it seems. Unfortunately it is not that easy; there is one important step which complicates things tremendously. That is the transition from the page image to the page content, in other words the reading process. While humans are usually quite good at reading, to a computer a page looks like a large white area with random black dots on it. Using quite sophisticated software for optical character recognition (OCR), the dots are matched with letter shapes, which are then written into an output file as plain text. This transformation from image to text can be easy for high quality printed material on white paper. Older data, however, can quite often lead to mistakes, due to lack of contrast between paper and letters, or because the lead type has rubbed off with time and bits of the letter shapes are missing. Imperfect printing can lead to letters being misinterpreted by the computer, where the most common examples are mistaking c fore (and vice versa), or ni form. Luckily a lot of these recognition errors can be caught with modem spell checking software, but it all adds to the effort required for getting the texts into machinereadable form before you even start analysing them, and spell checkers are not infallible either. Further complications arise once you start to look at data which is formatted in multiple columns, intermixed with figures or tables. It is very hard for the computer to figure out what the flow of the text is, and thus a lot of manual intervention is required for such data.
3.1.3 Downloading The best solution-as always-is to re-use material that somebody else has already made machine-readable. There are plenty of texts available on the Internet, from a variety of sources, and of varying quality. For a list of possible sources see the appendix. Generally there are two types of texts available on the Internet, web pages and text files. Web pages can easily be saved from a browser, and they contain many different texts. Quite often you can find newspaper articles or even academic papers in this form. Text files are usually too large to display on a single page, and they are available for downloading rather than on-line viewing. A number of sites contain whole books, like Project Gutenberg and other similar ventures. The Oxford Text Archive also allows downloading of full texts, in different formats. Most of the books available for free downloading are classics which are out of copyright; while this makes them an easy source of data, their linguistic relevance might be a bit dubious.
3.1.4 Other Media It is increasingly possible to buy collections of text on CD ROM for not very much money. These are usually whole years worth of newspapers or classics which are out of copyright and are sold as a kind of digital library, often with software that displays the texts on the computer screen. While this seems like a very convenient source of data, it is often the case that the texts are in some encoded format, which makes them
HOW TO STORE TEXTUAL DATA
33
impossible to use without the included software. This format could be for indexing or compression purposes, and it means that you will have to use the software provided on the CD to look at the data. Sometimes you might be able to save the text you are looking at from within the program, or you can 'print' it into a file which you can then process further. It might require quite some computing expertise to extract texts from such collections, so it is probably easier to find the texts on the Internet, which is presumably where a lot of these texts originated from in the first place (see previous section). With newspapers, where you often get back issues of a whole year on a CD, you will almost certainly have problems accessing the text. These CDs are mainly meant as an archive of news articles, so they are linked up with search mechanisms which often provide the only point of entry to a text. The same caveats as mentioned before apply, even more so since the publishers might see some commercial value in the material and thus encode or even encrypt it to prevent access other than through the interface provided. The same applies to a lot of encyclopedias which are available on CD, often on the cover disks of computing magazines. One issue that is often neglected here is that of copyright, which is a right minefield. If you are in doubt whether you are allowed to use some data, it is best to consult an expert, as that can save you a lot of trouble in the long run.
3.2
HOW TO STORE TEXTUAL DATA
Collecting data is one part of the whole problem, another one is to store it. This is important, as how you manage your data greatly influences what you can do with it during analysis, and how easy this analysis will be. Apart from organising your data, you also need to look at the physical storage, which means the format you store your data in. There are a great number of ways to represent textual data on a computer, especially when it is more than just the plain text, such as annotations of some sort. Careful planning at the outset of your research project can save you many headaches later on, and handling your data is as important as your actual investigation, so any time spent on it is well worth the effort.
3.2.1
Corpus Organisation
Once you have collected all your texts, you need to store them in some form on your computer. Depending on what your analysis will be about, there are different ways of organising your material, but it is always worth trying to anticipate other uses of your data. Try not to exclude certain avenues of analysis on the grounds that you are not interested in them; your own interests might change or somebody else might come along wishing to analyse your data in an unforeseen way. It is probably a good choice to partition your data into smaller parts which have attributes in common. What attributes these are depends entirely on your data, so it is impossible to give ultimate guidelines which are suitable for each and every corpus. The first step is to look around and see if there are similar corpora whose structure you might want to mirror. This is the approach that has been chosen for the LOB corpus (Lancaster-Oslo/Bergen corpus of written British English), which took the design of the corpus!Brown (American written English) to allow direct comparison
34
BASIC CORPUS CONCEPTS
between the two. Other corpora then picked up on the same structure as well, like for example FLOB and Frown, the modem equivalents of Brown and LOB created at the University of Freiburg, and the Kolhapur corpus of Indian English. As it is so widely used, the organisation of these corpora is shown in table 3.1 (from Taylor, Leech, and Fligelstone (1991) ).
A B
c D E F G
H J K L M
N p
R
press (reportage) press (editorial) press (reviews) religion skills & hobbies popular lore belles lettres, biography, essays miscellaneous learned & scientific writings general fiction mystery & detective fiction science fiction adventure & western fiction romance & love story humour
Table 3.1: Brown/LOB structure If you cannot find any other similar corpora whose structure you can follow, you will have to think of one yourself. It might be useful to partition your data in a kind of tree structure, taking the hlost important attributes to decide on the split (see figure 3.1 ). For instance, if your corpus contains both written and spoken material, you might want to take the first split across that particular attribute, and within the written partition you might want to distinguish further between books and newspapers, or adult and children's books, or female/male authors. Here you could choose whatever is most convenient for your later analysis, and you could re-arrange the partitioning if necessary for other studies.
Another useful criterion to structure your corpus is date. If you are looking at the way usage patterns change across time, it might be good solution to have different sub-corpora for different years or chronological ranges. Depending on your time scale you might wish to distinguish between decades, years, months, or even days. This is also a relatively neutral way of dividing up your corpus, which doesn't make too many assumptions about the data. In the end, the partitioning is mainly a matter of convenience, as all attributes should be stored within the data files, allowing you to select data according to any of them, provided they are marked up in a sensible way (see next section on how
35
HOW TO STORE TEXTUAL DATA
/~
Written
/~~
Books
Journals
Dialogue
/~
/~
American
Monologue
Newspapers
British
Radio broadcasts
Academic
lectures
Figure 3.1: A sample taxonomy of text types
to mark up attributes). Grouping your corpus data in a sensible way, however, can greatly speed up the analysis, as you don't have to trawl through all your material to find relevant parts for comparisons; instead you can just select a block of data grouped according to the attribute you are looking for. There has been some work on developing guidelines for structuring corpus material, and the results of the EU project EAGLES are available on the Internet. The address of the site can be found in the appendix.
3.2.2 File Formats Mter having decided on a logical structure, you need to think about the physical structure, i.e. how to store the data on your computer. One straightforward way is to mirror the hierarchical structure in the directory tree on your hard disk. This way the attributes of a text would be reflected by its position on the disk, but processing might be more difficult as you will have to gather text files from a set of separate directories. It also makes rearranging the data more complicated. Another possibility is to have a separate file for each text, and keep a list of a text's attributes (such as author, title and source) somewhere else. When you want to pick a sample of your corpus, you can consult the list of attributes and then select those files which contain the texts you are interested in. This is easier for processing, but it might be inefficient if you have too many files in a single directory. Some operating systems might also put an upper limit on the number of files you can have in a single directory. In principle the choice between storing texts in individual files on your computer, or whether lumping them together in one or more large files is largely a matter of convenience. If all the data is in one big file, processing it is easier, as there is only one file you have to work with, as opposed to a large number of smaller files. But once your corpus has grown in size, a single large file becomes unwieldy and if it does not fit onto a single floppy or ZIP disk it might be difficult to copy it or back it up. You might also have to think about marking document boundaries in a big file, whereas individual little files could comprise self-contained documents with all related necessary meta information.
BASIC CORPUS CONCEPTS
36
Once you get to storing the material in the right place on your computer, the next question is what format to choose. In principle there are three kinds of formats to choose from, all with their special advantages and disadvantages: 1. plain ASCII text 2. marked-up ASCII text (SGML/XMLI... ) 3. word processing text (Word/RTF/... )
Plain ASCII text is the most restricted format. All you can store in it are the words and some simple layout information such as line breaks. You cannot even represent non-ASCII characters, like accented letters and umlauts, which severely restricts you when it comes to working with occasional foreign words or even non-English corpora. In order to overcome this problem, a whole host of different conventions has evolved, each of which requires a special program to work with it. If you had a concordancing program suitable for corpus X, the chances are that you could not use it on corpus Y, or at least you would not able to exploit all the information stored in that corpus. After years of developing idiosyncratic mark-up schemes, a standard format has been established: SGML, or its simplified version, XML. These are formal descriptions of how to encode a text, and how to represent information that otherwise could not be added straightforwardly into the text. As this standard is increasingly used for marking up corpus data, a lot of software has been developed to work with it. More details of mark-up will be discussed in the next section. Another popular choice for corpus formats seem to be word processing files. There are three main reasons for this: 1. many people have a word processor available 2. plenty of existing files are already in this format 3. you can highlight important words or change their font
Unless it represents spoken data, a corpus is basically a collection of text documents. Thus it seems only natural to use a word processor to store them, which allows you to use highlighting and font changes to mark up certain bits of text. You can format ergative verbs in a bold font, then search for bold text during the analysis, and with built-in macro languages you might even be able to write a basic concordancing program. So what's wrong with that? The problem here is that you would be limiting yourself to just a single product on a single platform, supported by a single vendor. What if the next version of the word processor changes some font formatting codes? Suddenly all your laboriously inserted codes for marking up ergative verbs could have disappeared from the file. Even though it seems a big improvement to be able to mark certain words by choosing a different typeface over not being able to do it at all with a plain ASCII text, it severely limits the way in which you can process the file later. By choosing a word processor format you can then only analyse your text within this word processor, as the file format will almost certainly not be readable by any other software. As a word processor is not designed to be used for corpus analysis, you cannot do very much more than just search for words.
MARK-UP AND ANNOTATIONS
37
Furthermore, what happens if the word processor is discontinued? You might be tempted to say that a certain product is so widespread that it will always be around, but think back about ten years: then there was a pre-PC computer in widespread use, the Amstrad PCW. It came with a word processing program (Locoscript), and it would save its data on 3 inch floppy disks (remember that the 'standard' floppy size nowadays is 3.5 inch). Suppose you had started your corpus gathering in those days using this platform. You took a break from working on it, and now, a few years later, you want to continue turning it into a really useful resource. The first problem is that you swapped your PCW for a PC (or an Apple Mac) at the last round of equipment spendings, but suppose you manage to find another one in a computing museum. Unfortunately it doesn't have Locoscript on, as the disk got lost somehow. Without any supporting software you are not able to make use of the text files, and even if there might be a Locoscript version for PCs you still can't do anything because PCs don't have drives which can read 3 inch floppies. Many individual attempts of gathering corpus data have chosen this form of storing data, and as soon as the question of a program for analysis arises, people realise that they have reached a dead end. Converting the data back into a usable format will almost certainly result in a loss of information, and consequently much time has been wasted during setting up. Technology changes so quickly that it would be foolish to tie yourself to one particular product. Instead you should go for an open, non-proprietary format that can be read by any program and that is well-documented, so that you can write your own software to process it if you need to. You should also store it on a variety of media which are likely to be available at future dates. This of course includes the necessity of making backup copies to insure yourself against accidental loss of your data in case your disk drive breaks down. As we will see in the following section, marked-up ASCII is the best option. It is readable by any program on any computing platform, and if you choose the right kind of mark-up you will find plenty of software readily available to work with it. The main point of this section is that you should generally do what other people have done already, so that you can gain from their experience and re-use their software and possibly their data. It also makes it easier for other people to use your data if it is not kept in an obscure format nobody else can use. Corpora have been collected for more than forty years, so it would be foolish to ignore the lessons learned by your predecessors. A particular choice might make sense in the short term, but in the long run it is best to follow the existing roads rather than go out into the wild on your own. This particularly applies to the topic of annotations, which we will look at next.
3.3 MARK-UP AND ANNOTATIONS In the previous section we discussed ways of organising and storing your corpus, how to structure it and how to keep it on a computer physically; in this section we will look at why you would want to use annotations and how they can be stored. In chapter 8 we will then develop programs to process annotations.
BASIC CORPUS CONCEPTS
38
3.3.1 Why Use Annotations? Annotations can be used to label items of interest in a corpus. For example, if you are analysing anaphoric references, you cannot simply use a program to extract them automatically, as their identification is still an unsolved research problem. The only reliable way to find them is to go through the data (perhaps after it has been processed by an auxiliary program which identified pronouns) and label the references yourself. Once you have added annotations to your data, you can search for them and thus retrieve instances of the features and process them. This means, of course, that the program you are using is capable of looking for the annotations you put in earlier. While it is very unlikely to be able to ask a concordancing program to find anaphoric references, it is much easier to instruct it to find all labels marking instances of those. In practice, most corpora have some kind of annotations. The degree of granularity varies widely, and more sophisticated annotations (i.e. those that need to be put in manually) are mainly restricted to smaller corpora, whereas larger corpora typically contain only such annotations which can be added automatically using computer programs, e.g. part-of-speech tags and perhaps sentence boundaries and paragraph breaks. Any material you can get from other sources will most likely be without any annotaions.
3.3.2 Different Ways to Store Annotations We have already seen that there are different ways to store your data files, and there is an even larger variety of formats in which to encode annotations. In this section we will briefly touch upon this subject, but we will not spend too much time on describing obsolete formats. There are typically two different kinds of annotations, those relating to the whole text, such as author, and title, and those which only apply to parts of it (usually words or groups of words), such as part-of-speech or anaphoric references. While the former are usually kept separate in a header block at the beginning of the text, the latter necessarily have to be included inside the text.
Header Information Usually a corpus will contain just the plain text, unless it was converted from published sources (such as newspaper articles or typesetter's tapes) or structured collections (such as databases, see Galle et al. (1992)). In this case it sometimes contains extra information, which is attached to it at the beginning in a header describing what the title of the document is, when it was published, in which section, and so on. This kind of external information can easily be kept separate from the actual text through some kind of boundary marker which indicates the start of the text proper. Some mark-up schemes use a number of tags to indicate which attributes of a document are filled with what values. A newspaper article could for example be annotated thus: \begin{header} \title{Labour Party To Abolish Monarchy} \date{Ol/04/2001} \author{A.P. Rilfool} \source{The Monday Times}
MARK-UP AND ANNOTATIONS
39
\end{header} \begin{text} In an announcement yesterday the Prime Minister said that \end{text}
Other labels could be used, or they could be enclosed in angle brackets, as in this example from McEnery and Wilson (1996):
This line in the so-called COCOA format (from an early computer program) indicates that Charles Dickens is the author of the following text. There are many different ways in which such information can be stored, and each concordance program will understand its own format only. Structuring the header information is a problem of standardisation, as different corpora could be annotated in different variants, using different tag labels, and without any specification of what kinds of information will be given. This makes it difficult for software to process different corpora, as it cannot rely on them being in one particular form. Also, without having a defined set of properties which are being defined, any program will have problems allowing the user to exploit the information which is available. The solution is to have one single format which is used for all corpora, as then the writers of corpus processing software can tailor the software to that format.
Text-Internal Annotations Adding annotation for other aspects of a corpus require that annotation be added within the text. If you want to analyse the part-of-speech of the individual words, or the reference points of anaphora, you will have to mark those up in the text itself, and this creates a number of problems. The first is to keep the annotation separable from the text. If at some future point you want to get back to the original, unannotated, text, and you cannot easily work out what is part of the text and what is annotation then you will have a problem. When researchers first started gathering corpora, they had to make up their own formats for storing them. Early computers only had upper case letters, so special conventions were introduced to distinguish upper and lower case, e.g. by putting an asterisk character before any upper case letter, treating all other letters as lower case. This made dealing with the data rather complicated, as programs had to be able to handle all these conventions, but there were no real alternatives. Worse still, everybody used their own conventions, mainly because people wanted to encode different things, which meant that neither software nor data could easily be exchanged or reused. As there were not that many people working with corpora initially there was no awareness that this could tum into a problem. However, with corpus linguistics becoming more and more mainstream it quickly became an issue, as not everybody could start from scratch developing their own software and collecting their own data, and so committees were set up to develop standards. This has gone so far that projects will sometimes no longer get funding if they create resources that don't follow these standards.
40
BASIC CORPUS CONCEPTS
SGML The only file format which can be understood generally is plain ASCII text. ASCII allows no non-printable characters apart from line breaks and tabulator marks, and no characters with a code higher than 127. However, you can now read your data on virtually any machine using any software, but how do you mark up your ergative verb occurrences? This is where mark-up languages enter the scene. A mark-up language allows you to store any kind of information separated from the text in an unambiguous way. This means you can easily recognise it, and if you don't need it you can simply remove it from the text with a computer program, without any worries that you might also remove parts of the text at the same time. A mark-up language typically reserves a number of characters which you then cannot use in your data unless they are 'escaped', i.e. replaced by a certain code. For example, using HTML, the hypertext mark-up language, if you want to write a web page about your R&D unit you have to write 'R& D', where amp is the entity name of the ampersand character, and the ampersand and the semicolon are used to separate it from the other text. The de facto standard format to encode corpora is using SGML, which actually is a language to define mark-up languages. SGML is a well established standard (ISO 8879), and as it cannot be changed easily because of this, software developers have written quite a few tools to work with it in the knowledge that it will be here to stay and thus the time and effort they invest will be worth it. And indeed you can easily find a large number of programs, a lot of them very cheap or even free, that will allow you to work with SGML data. SGML can be used to specify a set of tags and make their relationships explicit in a formal grammar which describes valid combinations of tags. For example, you could specify that a 'chapter' tag can contain one or more 'sections', which in turn contain 'paragraphs'. If you try to put a 'chapter' tag inside a 'section' you violate the structural integrity of the data, and a piece of software called parser will tell you that there is an error in your mark-up. SGML uses the angle brackets('') and the ampersand('&') as special characters to separate your tags from the text, though that could be changed in a mark-up language definition. The Text Encoding Initiative (TEl) has spent a considerable amount of effort on producing guidelines for marking-up texts for scholarly use in the humanities. These text encoding guidelines are available on the Internet (see section 12.2 on resources), and there is even a 'TEl Lite', a version stripped down to the bare essentials of marking up textual data. The TEl guidelines provide a standard format, but that does not mean that a corpus will be marked up in all the detail provided by them. There is a special customised version of these guidelines, the Corpus Encoding Standard (CES) which defines different levels of annotation, according to the detail expressed in the mark-up. Being a customised version of it, CES is TEl conformant. For the newcomer SGML is not too easy a field to start working in, as there is a lot of terminology, a large number of opaque acronyms, and programs which tend to be reluctant to deal with your data, giving cryptic error messages as an excuse. The reason for that is that SGML is very powerful, but also extremely complex. As a result, software is slow and difficult to write, which has prevented SGML from being widely used for about the last ten years.
MARK-UP AND ANNOTATIONS
41
However, there is a wide variety of mark-up languages that have been defined for all possible areas of academia and commerce, the best known of which is HTML. HTML is a mark-up language for hypertext documents, widely used on the Internet for formatting web pages. It follows the standard SGML conventions regarding tag format. HTML Part of the success of HTML is that it is simple. Its limited power and resulting ease of use have contributed majorly to the success of the world wide web. When you are downloading web pages, they will be marked-up in HTML, so you could conceivably use it for marking up other corpus data as well. HTML was developed originally for the basic presentation of scientific articles, with extensions to allow more general formatting of information, and thus it contains a number of tags which allow you to mark paragraphs and headings, and even crossreferences, but there is no tag. Of course you could use one of the existing tags for font-change or italic and bold printing, but that only gives you a limited range of tags, and there is no obvious link between the tag and its meaning. If you just add your own tags, your data is no longer HTML conformant and no browser would know how to handle those tags. For a solution to that dilemma we need to go back one step and look at what HTML is itself. HTML is an application of SGML.It is a definition of a language (L) to mark up (M) hypertext (HT) documents. But a corpus is not primarily a hypertext, and so you might want to use another SGML-based format to mark it up. This other format has been designed to be a simplified subset of SGML, easy to use for the human user, flexible, powerful, and easy to process for the computer. The outcome of that design process is XML, the eXtensible Mark-up Language. It is a lot easier to write programs that can process XML than it is to do the same for SGML, but SGML conformant documents can easily be converted to be XML conformant, so there is no compatibility issue. The constraints on XML are less strict, as we will see in chapter 8, when we write a couple of tools to work with XML data. The corpus encoding schemes that have been developed for SGML are now also available for XML (see XCES in the appendix), and XML looks like it is going to be a major breakthrough in mark up. For all these reasons it is advisable to use XML to store your data; it's extensible, so you can just add your own tags if you have to, and there is also a lot of software available to process it, despite the fact that its specification is still considerably less settled than SGML. XML will be described in more detail in chapter 8, where we will develop a couple of programs for handling texts which are encoded in XML format. Another reason for using XML is its future development. By many people it is seen as the mark-up language which will transform the way data is stored and exchanged world-wide, so knowing XML is a useful skill in itself. Annotating data is a complex topic, worthy of its own book(s). For our purposes we will keep things simple, as we are concentrating on the operations we want to perform on the data. For examples further on in the book we will be using plain text,
42
BASIC CORPUS CONCEPTS
with no annotations, stored in a single file on the computer's hard disk. However, if you decide to add annotations to any data you are collecting, it is worth using XML as the mark-up format. One of the sample applications we will be looking at in the final part of this book will allow you to process such data.
3.3.3 Error Correction Another point is worth mentioning in connection with mark-up: it is unavoidable that there will be typographical errors in a corpus, as all methods to enter data are prone to the introduction of glitches. Whether there is a smudge on the page that is scanned in, or just a momentary lapse of the keyboarder, errors will always creep in. Unless the amounts of data involved are very small, it is generally not worth correcting those, as too much time is spent on this, which could probably be used for more productive tasks. Furthermore, even sending the data through a spell checker is not guaranteed to make it any better, as rare words could not be recognised and changed, while other errors might happen to coincide with the spelling of other, correct words and remain uncorrected. There is, however, a kind of error which should be corrected, namely faulty mark-up. Unlike natural language, mark-up has a strict grammar, which can easily be checked automatically. So a simple computer program (like the one we will develop in chapter 8 ) can quickly tell you whether there are any inconsistencies, and more importantly where they are. However, this applies only to the syntactic wellformedness, as the computer cannot check whether the mark-up has been applied correctly on the semantic leveL
3.4 COMMON OPERATIONS In this section we will investigate some of the basic operations which are usually applied to corpus data. These operations will later be implemented in Java, so it is important that you have an idea of how they work, and what what they can be used for. All of them are described in more detail in the accompanying books of the series, namely McEnery and Wilson (1996) and Barnbrook (1996}, with the latter having an emphasis on how they can be used in linguistic research.
3.4.1 Word Lists A word list is basically a list of all word types of a text, ordered by some criterion. The most common form is that of a word-frequency list, where the words are sorted in descending order of frequency, so that the most common words are at the top of the list, and the rare ones at the bottom. A word-frequency list can give you an idea of what a text is like: if it is a firstperson narrative, you would expect to see a lot of occurrences of the personal pronoun 'I', and certain other often used words might reflect what the text is actually about. They can also be used to compare different texts by working out which words are more frequently used in one text than another.
43
COMMON OPERATIONS processing tasks, such as creating for example, what the length of a have got to the last letter of the task of creating a sorted reverse read each lowing description: reverse the letters in the ext for each fill in a few more gaps: ch word in the text read the next xt word from the text reverse the text reverse the word insert the for all words in the list reverse s in the list reverse word print ecise in that it reflects how each program is executed once for each xplicit the relationship between a ist of words, and that we insert a more words in the file read next read next word from file reverse from file reverse word check if YES: skip eck if word is in list
word lists, computing parameters of a te word is, you would have to go through it word, making sure that you don't stop to This is a l word list from a text file. reverse the lette word from the text sort the list alphabetically word word in the text read the next word fro word from the text reverse the word in word insert the word into a list sort word into a list sort the list for all word print word Note that we now have word Note that we now have changed the word is being dealt with at a time, and word (or token) in the text we're lookin word and the list of words, and that we word into the list, which was not obviou word from file reverse word check if w YES: sk word check if word is in list NO: i YES: skip word word is in list NO: insert word into list close i word
Figure 3.2: Concordances for the word 'word' An alternative way of ordering a word list is alphabetically. Here you could for example see which inflected forms of a lemma's paradigm are used more often than others.
3.4.2
Concordances
Looking at a word list of a text might be useful for getting a first impression of what words are used in the text, and possibly also the style it is written in, but it necessarily has to be rather superficial and general. It is far more useful to look at the individual words themselves, analysing the local context they are being used in. This allows you to analyse the way meaning is constituted by usage. The tool for doing that is the concordance. Concordances have been around long before corpus linguistics, but only with the advent of computers and machine-readable corpora has it become feasible to use them on an ad-hoc basis for language research. Previously concordances took years to produce, and were only compiled for certain works of literary or religious importance. However, it is now possible to create exhaustive concordances for sizeable amounts of texts in a few seconds on a fairly standard desktop computer. There are basically two types of concordances, KWIC and KWOC. These two acronyms stand for keyword in/out of context, and refer to the way the visual presentation is laid out: KWIC, which is by far the more common of the two, consists of a list of lines with the keyword in the centre, with as much context to either side of it as fits in the line. KWOC, on the other hand, shows the keyword on the margin of the page, next to a sentence or paragraph which makes up the context. In figure 3.2 you can see an example concordance in KWIC format. This is an extract from a concordance of the word word in the text of this book (before this figure was added to it). KWIC displays allow quick and easy scanning of the local contexts, so that one can find out what phrases and expressions a word is being used in, or what adjectives
44
BASIC CORPUS CONCEPTS
modify it if it is a noun. For this purpose most of the time only a narrow context is required. As soon as the context exceeds a single line on a printed page or a computer screen, it loses its big advantage, namely that the keyword (or node word) can be instantly located in the middle of it. Most computer software for corpus analysis provides a way of producing a KWIC display, while the KWOC format is not really used much in corpus linguistics.
3.4.3
Collocations
By sorting concordance lines according to the words to the left or right of the node word it is very easy to spot recurrent patterns in the usage of the node word. However, if you have more than a couple of hundred lines to look through, it becomes very difficult to not lose sight of the big picture. One way to reduce large amounts of concordance lines to a list of a few relevant words is to calculate the collocations of anode word. The basic concept underlying the notion of collocation is that words which are important to a node word will tend to occur in its local context more often than one would expect if there was no 'bond' of some sort between them. For example, the word cup will be found in the context of the word tea more often than the word spade, due to the comparatively fixed expression cup of tea and the compound tea cup. There is no particular reason why spade should share the same link to tea as cup. Collocations are not limited to straightforward cases like this; quite often a list of collocates comes up with a few words one would expect, but also with a number of words which are obvious once you think about them, albeit not ones a native speaker would come up with using their intuition. This fact makes collocation a useful tool for corpus analysis, as it helps to unearth subliminal patterns in language use of which speakers are generally not aware. A number of studies (e.g. Stubbs (1995)) show how certain words (e.g. the apparently neutral verb to cause) are always used with predominantly negative words, even though the word itself does not actually have any negative connotations when taken out of context. One important aspect when working with collocations is how they are evaluated: there is a list of almost a dozen different functions which compute significance scores for a word given a few basic parameters such as the overall size of the corpus, the frequency of the node word, the frequency of the collocate, and the joint frequency of the two (i.e. the count of how often the collocate occurs in the context of the node word). Some of these functions yield similar results, but they all have certain biases, some towards rare words, some towards more frequent words, and it is important to know about these properties before relying on the results for any analysis.
3.5
SUMMARY
In this chapter we have looked at the first step of corpus exploration, building your own corpus. There are several ways of acquiring data from a variety of sources, but most often you will not have much choice when you want to work with specific text material. It is important to design your corpus structure in an appropriate way
SUMMARY
45
if you want to fully exploit external information related to the texts. Certain avenues of exploration will be much easier if you spend some thoughts on either naming conventions for data files or a directory structure to put your files in. Annotations play an important role when you want to incrementally enrich your data with more information. This is useful if you want to search for linguistic events which are not lexicalised, i.e. where you cannot simply retrieve instances of a certain word form. There are many different ways annotations can be stored in a corpus, and it is best to follow the emerging standards, using XML and the TEl guidelines or the CES. Much work has been done on developing general and flexible ways of marking up corpus data, so it would be a waste of time trying to reinvent the wheel. Finally we've looked at the most common operations that are used to analyse corpus data. Word lists are conceptually easy and can be produced very quickly. They can give an instant impression of what a text is about, or in what way it is different from other texts. Slightly more complex are concordance lines. While word lists deal with types, concordances show tokens in their immediate environment. Sorting concordance lines according to adjacent words can give useful cues as to fixed expressions and phrases a word is used in. Collocations are the next step up from concordance lines. Here we take all words which occur within a certain distance from the node word (the so-called collocates) and assign a score to each according to how likely it is that they are near the node word simply by chance. Several functions exist to compute such scores, and they come up with interesting results. Once we have covered the basics of Java programming, we will start developing some applications to perform the operations described in this chapter.
4 Basic Java Programming Before we can start writing programs in Java we will need to cover some more ground: the basics of object-oriented programming. We have already heard that Java is an object-oriented programming language, and now we are going to see what that means for writing programs in it.
4.1
OBJECT-ORIENTED PROGRAMMING
Object-oriented programming, or OOP, was developed in order to make it easier . to design robust and maintainable software. It basically tries to provide an easy to understand structure or model for the software developer by reducing the overall complexity of a program through splitting it into autonomous components, or objects. OOP is a topic which in itself is taught in University degree-level courses, but for our purposes we will limit ourselves to the aspects of it which are relevant for developing small to medium-scale software projects. The central terms in OOP are class and, of course, object. In the following section we will discuss what they are and how you can use them in programming. We will also learn how Java specifically supports the use and re-use of components through particular conventions.
4.1.1
What is a Class, What is an Object?
Any computer program is constructed of instructions modelling a process, like for example the way a traffic light at a junction works, how tomorrow's weather can be predicted from today's temperature, humidity and air pressure, and so on. These are either copies of real world systems (the traffic light), or theoretical models of those (the weather prediction). In corpus linguistics we can for example implement models of part-of-speech assignment, lexical co-occurrence, and similar aspects of texts. However, we are generally not interested in all aspects of reality, so in our model we restrict ourselves to just those aspects which are deemed important for our application. We thus create an abstraction, leaving out all those irrelevant details, making the model simpler than the real world. So, by creating a computer program we effectively create a model of what is called the domain we are dealing with. In this domain there are objects, for example words, sentences, texts, which all have certain properties. Now, if a programming language has objects as well, it is fairly easy to produce a model of the domain as a program. Writing a program is then easily separated into two phases, the design
48
BASIC JAVA PROGRAMMING
phase, where you look at your domain and work out what objects it contains and which of them are relevant to the solution of your problem, and the implementation phase, where you take your design to the machine to actually create the objects by writing their blueprints as Java source code. The important word in the previous sentence is blueprint. You don't actually specify the objects, but templates of them. This makes sense, as you would usually have to deal with more than one word when analysing a text, and so you create a Word class, which is used as a template to create the actual word objects as needed. This template is called a class!template, and classes are the foundation underlying object-oriented programs. A Java program consists of one or more classes. When it is executed, objects can be created from the classes, and these populate the computer's main memory.
4.1.2
Object Properties
An object has a set of properties associated with it. For example, a traffic light has a state, which is its current light configuration. When you model a traffic light, you need to keep track of this. However, you wouldn't really need to know how high it is, or what colour the case is painted in. These properties are not relevant if you only want to model the functionality. If, on the other hand, you want to create a computer animation of a junction, you would probably want to include them. When designing your class, you need to represent those properties, and the way to do that is to store them in variables. Each class can define any number of variables, and each object instance of the class will have its own set of these variables. What you need to do when you're designing the class is to map those properties on to the set of available data types. We have already come across the simple data types in chapter 2, but classes are data types as well, so you can use other classes to store properties of your objects. For our traffic light example we would want to store the state as a colour, and there are several options how we can represent this: 1>
1>
1>
1>
we could assign a number to each possible colour and store it as a numerical value, e.g. red is 1, amber is 2, and green is 3. we could use the first letter of each colour and store it as a character value, e.g. 'r', 'a', and 'g'. we could use String objects storing the full name, i.e. "red", "amber" and "green". we could define a special class, Colour and have different objects for the respective colours we need for the traffic light.
All of these have their own advantages and disadvantages. From a purist's point of view the final option would be best, but it also is a bit more complex. We will come back to this point further on.
4.1.3 Object Operations Apart from properties, classes also define operations. These allow you to do something with those properties, and they model the behaviour of your domain object in
OBJECT-ORIENTED PROGRAMMING
49
terms of actions. For example, a traffic light can change its state from red to green, and this could be reflected in an appropriate operation. In programming these are called either methods or sometimesfimctions. In this book we will refer to them as methods. A method consists of a sequence of instructions. These instructions have access to the properties of the class, but you can also provide additional information to them in form of parameters. Methods can be called from within other methods, and even from within other classes. We will see when this is possible later on when we are talking about accessibility. Apart from passing information to a method, a method can also return information back to the caller. Typically a method will work on the object's properties (e.g. a lexicon) and the parameters which have been passed to it (e.g. a word to be looked up), compute a value from them (e.g. the most likely word class of the word in the lexicon), and return it. This is in form of a variable, and you will have to define what data type this variable is (e.g. String). A method must have a single data type, which is specified in its declaration. If you do not want to return any data, you can use the keyword void to indicate that. Methods don't have to have any link to the class, but the whole point of OOP is to bundle related things together. Therefore a class should only contain properties and methods which actually have something to do with what the class is modelling. One last point needs mentioning: apart from ordinary methods, a class has a special type of methods, constructors. A constructor is a special method to initialise the properties of an object, and it is always called when a new object is created. It has as its method name the name of the class itself, and it has no return value. As it implicitly creates an instance of the class, you don't specify a return value. A class can have more than one constructor with different parameters.
4.1.4
The Class Definition
We will now look at how we can implement a traffic light as a class in Java. There is a fixed way in which a class is declared in Java. It is introduced by the keyword class followed by the class definition enclosed in curly brackets. A class needs to have a name, which by convention is spelled with an upper case letter and internal capitalisation; thus our example class would be named TrafficLight. We need to save this class in a file "TrafficLight. java" so that the java compiler can handle it properly. For better documentation it is always a good thing to put the name of the file in a comment at the beginning and the end. A comment is a piece of text intended for adding notes to the source file which are to be ignored by the compiler. In this example they are marked by the sequence I * . . . * I; everything between these two pairs of characters is ignored. So we start with: I*
* TrafficLight.java *I class TrafficLight {
II end of class TrafficLight
50
BASIC JAVA PROGRAMMING
We will now add a property to that class, namely the state of the light. To represent this we'll choose a String object, which has the benefit that it is easy to understand without further explanation, unlike a numerical representation where we would require a legend explaining what number corresponds to what colour. It also avoids the temptation of performing arithmetical operations on them, which would not make sense as the numbers don't stand for numerical values, but are just symbols. Properties are declared by a data type followed by a variable name. If we call our light configuration state, the revised class looks like this: I*
* TrafficLight.java *I
class TrafficLight String state;
II end of class TrafficLight
Variables are conventionally put at the top of the class definition, before the method declarations. While class names are spelled with an initial capital letter, variables start with a lower case letter, but can have internal capitalisation. This convention allows you to easily distinguish between a class name and the name of a variable. You don't have to follow this convention, but as it is generally accepted it would be good to do so for consistency. Next we want to add a constructor to our class. As a traffic light always shows a certain colour (unless it is switched oft) we want to use this as a parameter. Constructors are usually next in line after the variable declarations: class TrafficLight String state; TrafficLight(String initialState) state = initialState;
II end of class TrafficLight
The constructor has to be called TrafficLight,just like the class. It takes a string argument, which we call ini tialState. When we want to create an instance of the TrafficLight class, we have to provide this value. The JVM allocates space for the object in memory, and then executes the constructor. There we simply assign the initial state value to the state variable, so that it now has a defined value that we can access from other methods. Next we want to add a method that changes the state. For this we don't need any parameters, as the state can only advance in a well-defined sequence. We simply check what the current state is, and we change the state according to what it should be next. For that we're using a chain of if statements (see section 2.2.2 ).
OBJECT-ORIENTED PROGRAMMING
51
To compare String objects we can't just use the double equal sign, as that would compare the objects, and not the contents of the object. However, when comparing strings what we want to know is whether they have the same sequence of characters as their content. Thus we have to use another method, equals (), to check for equality of content. This is explained in more detail in section 5.3.3. The changeState () method looks like this: void changeS tate () if (state .equals ("red")) { state= "red-amber"; } else if(state.equals("red-amber")) state = "green"; else if (state. equals ("green")) state = "amber"; } else if(state.equals("amber")) { state = "red";
This method checks what the current value of state is, and then assigns assigns a new value to it. It does not return a value, so we declare it as void, and it takes no parameters, so we simply leave the round brackets empty. You cannot leave them out, as they are part of the syntax of method declarations. All we need now is a way of finding out what colour the traffic light is actually displaying right now. We introduce another method, getColour (), which will return the current state of the traffic light. This method looks like this: String getColour () return(state);
It returns a String value, so we specify this by putting the data type before the
method name. You can put it on a separate line, which makes it easier to find the method name (especially if you have larger classes), as it always starts on the first column. We return a value using the return ( ) statement. It exits the method, so any further instructions after it would be ignored; in fact, in this situation the compiler issues a warning and might not compile your class if there are any further statement following the return (). This class is now fully functioning, and can be used to create TrafficLight objects. You can create an instance of it, you can change its state, and you can then also query what its current state is. If we want to make use of this, we would need other classes, for example Junction or Crossing, which would make use of that class. For testing purposes, however, we want to be able to run the class on its own. To execute a class as a program, you need another special method called main (). Whenever you tell the Java interpreter to run a class, it tries to locate a main ( ) method in it, and starts executing it. This method has a special signature, or declaration, which includes a few keywords which we will explain later on:
52
BASIC JAVA PROGRAMMING
public static void rnain(String args[]) TrafficLight rnyLight
~new
TrafficLight("red");
for(int i ~ 0; i < 6; i++) { System. out .print ("My TrafficLight is: "); Systern.out.println(rnyLight.getColour()); rnyLight.changeState();
At the beginning we create a traffic light, myLight, which initially is red. We do this with the new statement. Unlike non-object data types (e.g. int variables) you need to do this explicitly, as creating objects involves a lot more internal work for the JVM. For a primitive type all that is needed is to allocate some storage space, but with an object the JVM also has to call the constructor to make sure it is properly initialised. Furthermore, objects are stored in a different area in memory, which involves some additional internal housekeeping. While you need to explicitly create objects, you don't have to worry about cleaning them up once they are no longer needed. The JVM keeps track of all the objects which are still in use, and those that aren't will be reclaimed automatically through the so-called garbage collection. After having created our traffic light, we tilen enter a loop in which we print out the current state of the light, and then change tile state. After six iterations of the loop we finish. The System. out. print () statement prints its argument on the screen, and System. out. println () does tile same but witil an added line break at the end. We will discuss output in more detail in chapter 6. The whole class now looks like this: /*
* TrafficLight.java *I class TrafficLight String state; TrafficLight(String initialState) state ~ initialState;
void changeS tate () i f (state.equals ("red")) state ~ "red-amber"; else if(state.equals("red-arnber")) state = "green"; else if (state. equals ("green"))
state
=
It
amber";
else if(state.equals("arnber")) state ~ "red";
String getColour () return(state);
public static void
OBJECT-ORIENTED PROGRAMMING
53
main(String args[]) ( TrafficLight myLight = new TrafficLight("red"); for(int i = 0; i < 6; i++) ( System.out.print("My TrafficLight is: "); System.out.println(my Light.getColour()); myLight.changeState() ;
II end of class TrafficLight
If you type this class in, save it as TrafficLight. java, compile it with j avac TrafficLight, and run it with java TrafficLight, you will see the following output:
My My My My My My
TrafficLight TrafficLight TrafficLight TrafficLight TrafficLight TrafficLight
is: is: is: is: is: is:
red red-amber green amber red red-amber
You can see that it repeats its states, so it seems to be functioning properly.
4.1.5
Accessibility: Private and Public
One important notion with classes is that of the outside. You have a class, and everything else outside the class. On the outside there can be complete chaos, but if it is properly designed, you can bank on your class being stable, robust and reliable. The key to this security is not to let anybody from the outside touch your data. Unfortunately such a class might be safe and secure, but it would not actually be useful, as it would not be able to provide any services without an interface to the outside world. So we need to have some entry points that are accessible from the outside. These entry points are summarised in the Application Programming Inteiface, or API. The API of a class gives you an overview of all the services available from that class, and it also tells you how to access them. Without any information on the API of a class you might as well not have the class available at all, as it would be completely useless. We learned earlier in this chapter that a class has properties and methods. In principle the API would include both of these, but in order to guarantee internal consistency the data is usually hidden and can only be accessed via specific methods. Consider the following change to the main () method of our traffic light: public static void main(String args[]) ( TrafficLight myLight =new TrafficLight("red"); for(int i = 0; i < 6; i++) ( if(i == 3) myLight.state = "blue"; System. out. print ("My TrafficLight is: "); System.out.println(my Light.getColour()); myLight.changeState() ;
54
BASIC JAVA PROGRAMMING
What we are doing here is directly assigning a new value to the state variable of the myLight object when the loop variable reaches the value 3. This changes the output to look as follows: My My My My My My
TrafficLight TrafficLight TrafficLight TrafficLight TrafficLight TrafficLight
is: is: is: is: is: is:
red red-amber green blue blue blue
You can see that once it has turned to 'blue', the traffic light never changes, as there is no defined follow-up to a blue light. This is clearly something we want to avoid, as we cannot foresee all possible colours somebody might want to assign to a traffic light. Instead we want to control tightly the way the state can change, and the only way to enforce that is to forbid external access to the state variable. This is known as data-hiding, which means you don't expose the properties of your objects to the outside world, but instead hide them behind methods. In those methods you can perform extra checks to make sure your object retains a consistent state. This is something you should also do in the constructor, otherwise one could simply write: TrafficLight myLight =new TrafficLight("blue");
and get into the same trouble. The easiest way to avoid such potential problems is to make sure the value is valid before assigning it. We will now see how this can be done. Let's assume you want to add the capability of setting the traffic light's state to any value, without going through all intermediate states. You don't want to allow direct access to the variable holding the state, to avoid potential disasters involving blue lights. So you write a method setState () which takes the new state as a parameter, and performs some checks on it before it actually assigns it: void setState(String newState) { if(newState.equals("red") II newState.equals("red-arnber") II newState.equals("green") II newState.equals("arnber")) { state = newState; J else { System.out.println("Illegal state value");
Here we first compare the new state with all allowed values, and only if a match was found we take on board the change. Otherwise we print out a message and keep the state as it was. You could also call this method from the constructor. There it would be a bit more complicated, as the state has not yet been assigned a value, so the easiest way to deal with this would be to assign it a default value before attempting to set the value given by the user. The modified constructor would then look like this:
OBJECT-ORIENTED PROGRAMMING
55
TrafficLight·(String initialState) { state = "red•; setState(initialState);
If the ini tialState value was not valid, the traffic light would simply remain 'red'. But how do we prevent malicious (or ignorant) users of our class from by-passing the setS tate () method and assigning a value directly to the variable anyway? For this the Java language provides accessibility modifiers. There are several of those, but we will only look at the most important, private and public. These modifiers can be applied to both properties and methods (see for example the main ( ) method above, which needs to be declared public). If you declare something as private, it can only be accessed from within the class itself. So our revised version of the TrafficLight class would contain the following variable declaration: private String state;
The methods on the other hand would be declared public, as they would constitute the public interface, or API, that other classes would use to work with. Private methods are not visible from the outside, just like private variables. They are entirely contained within the class, and they can only be accessed from other methods of the class. The main reasons for having private methods are to factor out common pieces of code which are shared by multiple methods, or to provide auxiliary methods which are not directly related to the class' functionality. In that case you wouldn't want them to clutter up the API, and you can hide them by making them private. To summarise this point about access restrictions, you want to keep as much of your actual class hidden from the outside in order to reduce the complexity of interactions. Instead you would want a limited number of clearly defined interfaces to a number of 'services' your class provides. All these services are implemented as public methods, and all others should be kept private.
4.1.6 APis and their Documentation When you are re-using other people's classes (or any of the large Java class library for that matter) you need to know what services (i.e. public methods or constants) they provide, and what you need to do in order to access them. This is quite a vital point, because if you don't know that something exists it might as well not be there. Therefore the designers of Java have included a special mechanism to document methods in a way which can automatically turned into nicely organised web pages directly from comments that you have included in your source code. The basic problem with documentation is that it rarely happens. Programmers are too busy programming and don't want to waste precious time writing documents that go out of date every time they change something in the software or that get lost as soon as you hand them out to a user. The only thing they might just about be prepared to do is to put comments into the code which explains what goes on to programmers who later might have to maintain the software.
56
BASIC JAVA PROGRAMMING
Comments can be added to Java source files in two forms: two slashes at any point on a line mark everything until the end of the line as a comment. This could look for example like this: tokens = tokens + 1;
II add one to the number of tokens
The last character of the source line is the semicolon; all spaces and the text from the two slashes to the end of the line are ignored. If you want comments to span more than one line, you need to enclose them in I * . . . * I, like this: I* Add one to the number of tokens. After this instruction the number of tokens is one larger than before. tokens = tokens + 1;
*I
In this comment style you need to be careful not to forget the closing sequence, otherwise you might find that large parts of your program are being treated as comments. This, in fact, is one of the more frequent uses of this comment style: to comment out chunks of source code that you don't need any more, but that you still don't want to delete as yet. Both comment styles can be used together at any time. Comments are explanations meant for human consumption, which is why they are marked in the source code so that the compiler can ignore them when translating the program. However, a special tool, called j avadoc can extract comments from the source code and create documentation out of it. Whenever something in the program changes, all that is necessary to update the documentation is to make sure that the comments get changed. This is not much effort, as the comments are usually right next to the source code which has been changed. Once that has been done, a single run of j avadoc can then update the on-line documentation. In order to make the extraction task easier for the j avadoc tool, the documentation comments have to follow certain conventions: they need to start with a ' I* *' sequence (whereas a normal comment only requires one asterisk), and certain keywords are marked by'@' signs. Let's look at an example: I** * Look up a word. * This method looks up a word in the lexicon. If it cannot be found * initially, a second attempt is made with the word converted to lower * case. * @param word the word to look up.
* @return true if the word is in the lexicon, false otherwise.
*I public boolean wordLookup(String word)
The most important keywords are @par am and @return, which allow you to specify the method's parameters and return value respectively. When you are documenting a class, you can also use @author and @version tags. For the most relevant set of j avadoc tags see table 4.1. The @par am, @throws, and @see tags can be used multiple times, as can the @author tag. Exceptions are described in section 5.2.
57
INHERITANCE
@author @version @par am @return @throws @see
The name of the class' author A version number (or date) Parameter of a method The return value of a method Exception thrown by the method Crossreference to other method
Table 4.1: Javadoc comment tags The set of all public methods of a class is called its API, and the generated documentation is therefore referred to as API documentation. Together with a brief general description of a class, the API documentation should enable you to use it, even though it is sometimes not that easy, especially with classes that are either complex or badly designed, or both. We will have a look at the actual format of the documentation comments that you have to follow for j avadoc to work when we start developing our first class in the next chapter.
4.2 INHERITANCE Another important aspect of OOP is the ability to derive new classes by extending existing classes. You simply state which class you want to extend, and then you add more methods and variables to it. You can use this to create specialised versions of classes, as we will see in chapter 6, where we extend some existing classes for reading from files to tum them into concordancing tools. Apart from just adding more methods, you can also overwrite existing methods with different functionality. As a simple example consider a pedestrian crossing light. This only has a red and a green light, so we can do away with the transitions to amber and red-amber that we had to handle in the TrafficLight class. So we simply write class PedestrianLight extends TrafficLight { public void changeState() { if (state. equals ("red") ) state
=
"green";
J else { state ;;;; "red";
II end of class PedestrianLight
We don't have to repeat all the other methods, as they are inherited from the superclass, TrafficLight. A PedestrianLight has exactly the same behaviour, the only difference being that it has fewer states. You can even assign an instance of the PedestrianLight class to a TrafficLight variable: they are in an 'is a' relationship, just like a dog is a mammal. We will see how we can use this in later chapters.
58
BASIC JAVA PROGRAMMING
In fact, all classes are arranged in a hierarchy, which means they are all derived from some other class, apart from the class at the top. This class is Object, the most general class. All other classes are directly or indirectly derived from it, and if you write a new class it is implicitly extending Object unless you specify another class (like the PedestrianLight) from which it is derived.
4.2.1
Multiple Inheritance
In Java you can only extend one class at a time, but what if you want to combine two or more classes for joint functionality? In C++, for example, a class can have more than one ancestor class, but this is not allowed in Java for very good reasons. One reason is that it keeps the class hierarchy simple: instead of having a complex network of classes you have a straightforward tree, where each class has exactly one path back to the class Object which is at the root of this tree. The other one is that it avoids potential clashes: if the two parent classes would have methods of the same name, which one should be chosen? If there is only one parent this problem does not arise. But there is also a way of more or less having multiple inheritance in Java. Instead of inheriting a method's implementation you can state that a class simply has a method of that name. This might sound a bit pointless, but is in fact extremely useful. Quite often a class can serve multiple functions in different contexts. The constraint of singular inheritance would make it difficult if not impossible to realise this. Let's assume for example, that you have a class which models things using electricity. They have a property 'required voltage', and methods swi tchOn () and swi tchOff (). Our PedestrianLight would ideally extend both this Electricitem class and the TrafficLight, as it shares features with both. This, however, is not possible in Java, as there can only be a single super-class. The way you do this is through so-called interfaces. An interface!definition is declared just like a class, only you would use the keyword interface instead of class, and all the method bodies are empty. That means you declare a method's signature, but you don't provide the instructions with it. You use the extends keyword to extend a class, and the implements class to implement an interface. This means that you need to provide implementations for the methods declared in the interface. Effectively you can only inherit the implementation from one class, but the functionality of as many interfaces as you care to implement. Interfaces are useful when you have different classes that provide the same service, and you want to be able to exchange them for each other without having to change anything else. An interface can be used to 'isolate' the common functionality, and then you can treat all the classes implementing the interface as if they had the same data type. We will see how this can be used later.
4.3
SUMMARY
In this chapter we have had a brief introduction to object-oriented programming. We have learned how to model our problem using classes which represent entities in the domain we are dealing with, abstracting away all the aspects which are not relevant
SUMMARY
59
to the solution. We have also learned what the relationship is between classes and objects, and that objects encapsulate data and operations related to them. At the end we have seen how to extend classes using inheritance. More common than direct inheritance is the use of interfaces to specify a certain functionality in form of a set of methods. Interfaces allow you to separate different views of an object, and to treat it under different aspects. This is especially useful as objects grow more complex and rich in features.
5
The Java Class Library In this chapter we will first briefly cover the way different parts of the Java class library are organised. Related classes are arranged in so-called packages, and understanding the mechanics of those is fundamental for Java programming. After that we will have a brief look at how Java deals with errors, before we then look in detail at one of the most important classes of the Java language, the String class, which is used to represent sequences of characters. We'll investigate the API of the String class, with examples of how the individual methods it provides can be used. Next we'lllook in slightly less detail at some other useful classes which you will use frequently when writing your own programs, and which are vital if you want to look at existing programs written by other people in order to understand how they work. Some of these classes, however, have been made superfluous by the collection framework, which was introduced in version 1.2 of the JDK. We will look at this framework in the final section. This chapter will contain a few examples to illustrate the usage of the classes described in it. Its main purpose is to introduce those classes and to prepare the ground for using them in the following chapters. For most examples we require external input, which we will be dealing with in the next chapter. So far we have mainly looked at the theoretical background and the principles of object oriented programming, and this chapter is going to make the transition to applying those principles in practice.
5.1 PACKAGING IT UP 5.1.1 Introduction A Java application consists of a number of classes, and through modular design and re-use this number can be quite large. There are classes available for a wide variety of tasks, and it would only be a question of time until two classes were created that had the same name. This would cause a rather difficult situation, namely how you can tell the computer which of two identically named classes you would want to use. As this problem can easily be foreseen, there is a solution built into the language: packages. A package is like a directory on your hard-disk, a container where classes are kept together. Just as when you have two identically named files on your computer's disk, they can be told apart by their full pathname, which includes the location as a path of directories leading to the actual file. This analogy with directories is closer than you might think, as packages directly reflect the directory structure, and the package name of a class is mapped onto a pathname that the NM follows when it tries to find a certain class.
62
THE JAVA CLASS LIBRARY
Basically there are three different types of packages: 1. packages that come with the base distribution of Java 2. extensions which are optional 3. third party and other packages In the following three sections we will look in more detail at each of these types, but first we need to answer another question: how do you actually work with packages? With one exception (which we will deal with in the next section) you have to declare what classes (or packages) you are using in your class. Suppose we want to use a class Date, which is located in the package called java. util, we need to include the following statement before our class definition: import java.util.Date;
This tells the compiler that whenever we are using a class called Date it is in fact the class java. util. Date, i.e. the class from the package java. util. If you would leave out the import statement, you would always have to refer to the class by its full name, which is the package name and the class name separated by a dot (in our case java. util. Date). You can of course do that, but you would save yourself a lot of typing by using the import statement. If you find that you are using a lot of classes from a package you can either enumerate them, as in import java.util.Date; import java.util.Calendar; import java.util.TimeZone;
or you could simply import all classes of the package at once, by writing import java.util.*;
The second method imports a few more classes which you are not actually using. While this might slightly increase the time needed for compiling your class, it has no influence at all on the result, i.e. your own class will be neither larger nor slower. It is, however, better to enumerate the classes when it comes to maintenance: it allows you to see at one glance which other classes you are using, provided you are not importing ones that you don't actually use. In the end this is very much a question of personal style and does not really matter much. But what happens if you import two packages which happen to have classes which have the same name? Suppose you import a package called bham. qtag which has a Tagger class, and another package, lanes. claws, which also has a class called Tagger. Now, when you have a line Tagger myTagger; new Tagger(tagset);
the compiler is at a loss as to which Tagger class you want to use. You could now either not import both packages, which might mean that you will have to write a potentially long list of classes to import individually at the top of your file, or you can simply qualify the class. This means you use the full name, which includes the package, and thus is defined to be unique, as no package can have two classes of the same name. So, you could simply write
PACKAGING IT UP
63
bham.qtag.Tagger myTagger =new bham.qtag.Tagger(tagset); lancs.claws.Tagger myOtherTagger = new lanc.claws.Tagger(tagset);
and the compiler is happy, as there are no more unresolved ambiguities.
5.1.2
The Standard Packages
Part of the power of Java comes from the large number of useful classes that are included in the language specification. Unlike a lot of other languages which only define a minimal set of commands and rely on third party extensions for anything non-trivial, Java includes quite a large set of classes for most purposes. As these classes are guaranteed to be available on any Java system, you can use them without having to worry about whether they are actually installed on the machine of a potential user. This saves you a lot of time, as you don't always have to reinvent the wheel, and the standard packages allow you to concentrate your efforts on the actual logic of your program, without wasting too much time on programming auxiliary classes for other tasks that you might need. Every major revision of Java has included more packages and classes, so it doesn't make much sense to spend too much time listing them all; it is best to consult the on-line documentation of the JDK you have installed on your computer for the authoritative list. In this section we will therefore just have a brief look at packages which contain classes that are important for processing texts and corpora, plus a few more you might be interested in later on. All the classes in the standard package start with java., followed by the actual name of the package. However, you can't import all those classes by typing import java.*;
as the asterisk only applies to classes within a package, and not to sub-packages. The above statement would only include classes from a package called java, but there aren't any, as they are all in sub-packages of java. There is another set of packages that comes with a JDK distribution, which start with sun. These packages are not very well documented, and for a good reason: you shouldn't use them. They are either internal classes that are used behind the scenes, or classes which haven't quite made it into the standard distribution yet. The sun packages are likely to change without notice, and if you want your programs to work for longer than just until the next version of Java is released you should avoid to use them. Generally you won't need them anyway. The most important package is java. lang, and it is so crucial that you don't even have to import it yourself: it is automatically imported by the compiler. It contains some basic classes which are required by the language itself. The package from which you will import classes most often is java. util. It contains a set of utility classes which are not essential but very useful. We will look at some of its classes towards the end of this chapter. All classes which deal with input and output are bundled up in the package java. io. The whole of chapter 6 is devoted to these classes, as they are important for data processing.
64
THE JAVA CLASS LIBRARY
The java. text package provides a number of classes for formatting text in culture-specific ways. For example, some countries use a full stop as a decimal separator, while others use the comma. These classes check the relevant settings in the operating system and format numbers following the right conventions. Similar considerations apply to the translation of the user interface. With the portability of Java this internationalisation plays an important role, which has been neglected in programming for too long. One strong point of Java is the fact that it provides platform-independent access to graphics. Most of this is contained in the package java. awt (AWT stands for abstract window toolkit), and more recently in the extension package j avax. swing (see below). Classes to access the Internet are in java. net. These allow you to establish network connection to other computers or to download web pages from any URL. If you want to access a database from within a Java application you can do so with the java. sql package. SQL is a standard query language for retrieving data from databases, and the classes in this package provide supporting functionality for this.
5.1.3 Extension Packages A while ago developers started to run out of patience with the graphics capabilities provided by the AWT, as even such basic widgets as simple buttons behaved slightly different on different platforms, and developers couldn't rely on their programs working properly everywhere. This was seen by some as proof that the 'write once run anywhere' philosophy of Java was flawed. However, these problems have been successfully solved by using a more low-level approach, where only the very basic operations are managed by the host system, and everything else is realised by Java. So, buttons behave the same on all platforms, as they are no longer 'Windows' buttons or 'XWindow' buttons, but instead Java buttons. The resulting system is much more reliable and consistent across platforms, but considerably larger than the original AWT, which for much of its functionality relied on the underlying operating system. Therefore, the new system, which is called Swing has not been made part of the standard packages, but has instead been put into a new type of package, the Java extensions. These start with j avax, and though they are not included in all Java distributions they are still part of the well-defined Java package system. Several other types of packages also fall into this category, however, these are mainly more esoteric, and we will not discuss them here.
5.1.4
Creating your Own Package
For small one-off projects it might not be too relevant, but whenever you design a class which you might want to re-use at some point in the future you should consider packing it up in a package. This makes it easier to organise your projects and maintain them, as you would put all related classes in the same package. Creating your own packages is very easy: just insert a package statement as the first statement in your source file (that excludes comments, which are allowed to
65
ERRORS AND EXCEPTIONS
come before a package statement). Some of the programming projects that we will look at later have been organised in their own packages.
5.2 ERRORS AND EXCEPTIONS Regardless of how careful you are with writing your programs, there will always be things which go wrong. Handling errors is a very complex matter, and it can take a considerable effort to deal with them properly. Every time your favourite word processor crashes just as you were about to save two hours worth of writing it has come across some circumstances that its programmer hadn't thought of. In general there are two types of errors: those which you can recover from, and those that are fatal. Running out of memory is fatal, as you've just hit the ceiling of what the machine can do. Trying to open a file that does not exist is not such a grave problem; you can just try again opening another file, or abandon the attempt altogether. In Java this is reflected in the distinction between Error (fatal) and Exception (non-fatal). If an error occurs, there is usually not much you can do. Exceptions, on the other hand, only stop the program if you don't handle them. The term for handling an exception is to catch it. This is done in a combination of a try block and one or more catch blocks. As the most frequent exceptions will involve input and output, here is a brief example (you will find more on this in chapter 6 ): II opening a file can produce an IOException FileReader file; try { file; new FileReader("corpus.txt"); J catch(IOException exc) { System.err.println("Problem opening the file: file ; null;
"+exc);
if(file J; null) { System.out.println("Success!");
If there is a problem opening the file 'corpus. txt', an exception is thrown. As this can only happen when we create the FileReader object, we enclose this critical section in a try block. The variable f i 1 e needs to be declared before we enter the block, as its scope would otherwise be restricted to within the block only. A try block is followed by a number of catch blocks, which specify which exception they are dealing with. Any exception that hasn't been caught by a catch block is passed back to the caller of the method the exception was thrown in. In our example the IOException is the only exception that can be thrown when a FileReader is created, so we don't have to worry about any other exceptions. But how do you know what exceptions you have to expect when calling a method? The answer is usually provided in the API. The list of exceptions that a method can potentially throw is as much part of its public interface as the return value or the number of parameters. If an exception is not caught within a method, it has to be declared, using the keyword throws:
66
THE JAVA CLASS LIBRARY
public int countLinesOfFile(String filename) throws IOException {
If in this method you handle the exception with try and catch, you don't have to declare it, unless you rethrow it, i.e. you pass it on after you have dealt with it. The Java compiler will notice if an exception can be thrown in one of your methods which has not been declared, and it will give an error message when you try to compile it. Exceptions are classes, and thus they are organised in a hierarchy, with Exception as the top-level class. When you catch an exception, any sub-classes of the exception you specify will also be caught, so catching an IOException will also deal with a FileNotFoundException, which is a sub-class of IOException. If you're lazy you can simply catch Exceptions directly, but then you lose all the information about what went wrong. However, you can choose exactly of what level of granularity you want to deal with.
STRING HANDLING IN JAVA
5.3
The basic unit of analysis in corpus linguistics is generally the word. So we will use this as the entry point into the Java programming language, especially since it is also important for programming in general: there are a number of occasions where words or sequences of words are used, such as names of files, error messages, help texts, and labels on buttons. In computing terminology a sequence of characters is called a String, and there is a class of that name in Java's basic class library, which we have already come across in the previous chapter.
5.3.1
String Literals
When you are assigning ~ value to a numerical variable, you can simply write the value in the source code as a sequence of digits, for example int aNumber = 13;
To do the same with a String, you have to tell the compiler exactly where the string starts and where it ends. For that reason you need to enclose a string in the source code in double quotes, as in the following example: String weather= "Sunshine, with light showers in the evening.";
The quotes themselves are not part of the String, they are only the delimiters indicating the extent of the string. A sequence of characters such as this is called a literal. It is analoguous to a literal number in that you can use it directly in String expressions and variable assignments.
5.3.2
Combining Strings
In the Java language, Strings have a special position, as they are the only objects which can be handled directly with an operator instead of method calls only.
STRING HANDLING IN JAVA
67
This operator is the concatenation operator('+'), which allows you to combine two Strings. Furthermore it also allows you to convert other objects and primitive data types to Strings, which happens automatically when you are concatenating them, as in the following examples: ' int score = 150; String examplel = "The score is "+score;
In these examples we are using two other data types, an int value and a Date object. The variable example1 will have the value "The score is 150", which is a new String created by converting the variable score to a String and adding it to the specified literal String (note the space before the quote mark; without it there would be no gap between the 'is' and the '150'). String nestedString = "The variable 'examplel' has the value \"" + examplel + "\".";
A more complicated looking example is the variable nestedString: Here we include the value of example1, and enclose it within double quote marks: to make it clear that these do not mark the end of the literal string they are preceded by a backslash. That means that the double quote is to be part of the string, whereas the quote mark following it (the last one in the line, just before the semicolon) is terminating the literal. Strings in Java can contain any character from the Unicode character set, and they can be of any length. One thing you cannot do with them, and this might sound a bit odd at first, is to change them: String objects are immutable, which means that once they have been created they will always contain the same value. There are several reasons for this, mainly to do with security constraints, which we shall not worry about here. The immutability is no big problem for us, as we can always create a new String object with a new value if we want to change it; the fact that an object's value cannot be changed does not mean that all variables of that type cannot be changed either. So, in the API of the String object you will find plenty of methods to access all or part of it, but none to change it. Should you ever require such functionality, there is a class called StringBuffer which can be used to build up character strings, and which can be converted into a String.
5.3.3
The String API
In this section we discuss the most important methods of the String class. These can be divided into several groups: first those to create String objects, as there are multiple ways to do so, and a few basic general methods. Then there are methods to compare Strings, followed by some to transform the content of a String. After methods for finding sequences in Strings we will look at ways of getting at sections of an existing String object. As mentioned above, the String class is part of the java. lang package, which is automatically imported into all classes, so you don't explicitly need to do so yourself. We will not go through all the methods provided by the String class, as it contains quite a large number of them. In order to be able to use a class efficiently you
68
THE JAVA CLASS LIBRARY
will only need to be able to know what kinds of methods it provides, so that you can then look up the exact form of a method you need in the documentation. Apart from the Java Development Kit you can also download its documentation (from the Sun website); this is a set of web pages which list all classes in the standard class library together with all methods and their parameters. This documentation has actually been produced from the class library's source code using the j avadoc tool, which means that it is very easy for you to produce the same kind of documentation by following the j avadoc conventions.
Creating a String There are two commonly used constructors available, which are listed in table 5.1. String(char value[]); String(StringBuffer buffer); Table 5.1: Frequently used String constructors The first constructor uses an array of char values to create a String. This constructor is not used very often, but it is useful when you get an array of characters from some other class (e.g. by retrieving data from a source over a network or out of a database) and want to turn it into a String. The second constructor we will be looking at takes its data from a StringBuffer. We have already heard that a StringBuffer is used to build strings, as they can be changed, unlike String objects. Here we now create an immutable copy of the StringBuffer, which then is disassociated from the original, which means that changing the StringBuffer in any way does not alter the String object. Another way to create a String in Java is by putting together String literals. These can also be mixed with other data types, which get automatically converted. Java supports the '+' operator, which combines individual Strings to one single String: String name = ••Lord Gnome"; int age = 94; String line = My name is ''+name+'' and I'm 11
11
+age+" years old.'';
Here we construct a string called line by concatenating three literal strings (''My name is ","and I'm ",and "years old.") with two variables, a String (name) and an int value (age). As an aside, note the space characters which are included around the places where the variables and the literal strings are concatenated. Without these spaces you would get ''My name isLord Gnomeand I ' m9 4years o 1 d . ", as the components are just stuck together. Another very common way of creating a String object is not actually part of the String API, but is a method of the class Object: the method toString (). This method returns a representation of the object in form of a String, and as it
69
STRING HANDLING IN JAVA
is a method of Object, and all other classes are directly or indirectly derived from Object we can rely on every object having this method. So we can equally well write StringBuffer sb =new StringBuffer(); sb. append ( "some text" ) ; sb. append ( " and even more text") ; String textl = new String(sb); String text2 = sb.toString(); II textl and text2 are both "some text and even more text"
Here we first create an empty StringBuffer object, to which we then append two literal strings. Then we declare two String variables textl and text2, and assign to them in two different ways the contents of the StringBuffer. First we're using the constructor method, which directly creates a String from a given StringBuffer, and then we use the more general way of employing the toString () method, which the StringBuffer inherits from Object. If you have objects from other classes, you can still use that way, as all classes have a toString () method. However, this will not always return what you expect, especially from classes which do not have as direct a textual representation as the StringBuffer.
This is in fact a design aspect that you should keep in mind when writing your own classes: provide a toString () method which returns a sensible representation of the class. For example, we will in a case study in chapter 9 write a class Rule which represents a replacement rule for a morphological stemmer. In the stemmer itself we never actually need a String object which represents a Rule object, but it is very useful to have for debugging purposes. For the Rule class we have chosen the individual parts of the rule to be printed, which makes it easy to identify which rule this object represents. If you don't provide a toString () method yourself, the default method of the parent class Object will be used; as this method has no access to the semantics of your class definition it just prints out the address in memory at which the object is kept. This is enough to distinguish two objects, but is not very useful for any other purposes. Basic String methods
int char[] String
length(); toCharArray(); toString();
Table 5.2: Miscellaneous String methods In this section we cover some general String methods, as shown in table 5.2. As you will notice, they have a data type in front of the method name, unlike the constructors we looked at in table 5.1. This data type described what type the return value of the method will be (and constructors do not have an explicit return value).
70
THE JAVA CLASS LIBRARY
So, the length () method returns an int value, and if you want to assign that to a variable, it needs to be of type int. The first of these methods is length (), which will return the length of a String object, measured in the number of characters. You most often use this method for looping through a string in conjunction with the charAt () method (see below). It can also, for example, be used to produce a list of word length vs. frequency. When looking at the constructors in table 5 .1, we have already come across the array of char values which is used internally to store the content of a String. With the toCharArray () method we have the reciprocal functionality, as it converts the String to an array. This array is a copy of the string's content, so that you can change it without affecting the original String object. You can use this method when you e.g. want to manipulate the individual characters of the string in a way for which there are no other suitable methods provided. And finally, there is the toString () method. This method is inherited from the Object class and thus exists in all classes, but in the String class it is fairly redundant, as a String object is already a String. All this method does is therefore returning a reference to the String object itself.
Comparing Strings Methods for comparing strings seem about as relevant as the to String () method, due to the existence of the comparison operator '= ='. However, it is more complex than one might initially think, as this operator has rather complicated semantics. In order to understand the difference between the comparison operator and the methods for comparing String objects, we need to know more about how objects are handled in the NM. String variables
String Object
other String Object
~!--------...~~
"aardvark"
I
Figure 5.1: Comparing string objects
Looking at diagram 5.1, we can see that a String variable provides a reference to a position in memory where the corresponding object is stored. The exact physical location is not relevant, and can change during the execution of a program, so for all intents and purposes the reference is all we need in order to work with that object. Two variables can point to the same object, as in String varl = new String ( "aardvark" ) ; String var2 = varl;
STRING HANDLING IN JAVA
71
Here varl is assigned to var2, and they both refer to the same physical object. We now continue our example by introducing a few comparisons: if(varl ;; var2) { System.out.println("varl is equal to var2"); if(varl =; "aardvark") { System.out.println("varl is equal to 'aardvark'");
If you were to try this out, you would get the surprising result that varl is equal to var2, but it is apparently not equal to the literal string 'aardvark', even though it obviously is. The reason for this lies in the way the '==' operator works: it compares the references, not what the refered objects actually contain. As we have earlier assigned the two variables to each other, they are identical, and thus they are referring to the same object in memory. In the second comparison, as well as this one: String var3 = "aardvark"; if(varl ;= var3) { System.out.println("varl is equal to var3");
the literal string and var3 are different objects, though their content is the same. It is just the same as if you were comparing two cups of tea, which are both full: they have the same content, but they are two different cups. This means that the '==' operator is only really useful for comparing the primitive types, but not for objects, unless you really want to check whether two values refer to the same physical object. This applies not only to String objects, but to all objects, as the concept of two objects being equal in some way or another generally doesn't coincide with the objects being identical. For this reason there is a method in the class Obj e~t which is intended for defining what 'equality' means, the equals () method. This method takes as its parameter a variable of the type Object, which means it can be used to compare any other objects to our String object, regardless of whether the other object is a String or not. Comparing Whole Strings boolean boolean int int boolean boolean boolean boolean
equals(Object other); equalslgnoreCase(String otherString); compareTo(String otherString); compareTolgnoreCase(String otherString); startsWith(String prefix); startsWith(String prefix, int offset); endsWith(String suffix); regionMatches(int offl, String other, int off2, int length);
Table 5.3: String comparison methods
THE JAVA CLASS LIBRARY
72
The methods for comparing strings (and sections of them) are shown in table 5.3. We start with the two simplest methods, equals ( ) , which tests for straight equality, and equalsignoreCase (),which collates upper and lower case characters during the comparison of the current String object with the argument, which needs to be a String object as well. The equals () method accepts any object, even from different classes (see above), but in order for it to be equal this other object will have to be a String as well. Both methods return a boolean value, true in case the two strings are equal, and false otherwise. This means you can easily use them in boolean expressions, such as String word
= new
String ( "anything" ) ;
if (word.equals ("something")) {
without comparing the return value to some pre-defined value. Unlike equals (),the method compareTo () (and the corresponding noncase-sensitive variant, compareToignoreCase ())returns an int value. This is because it evaluates the sorting order of two strings, and there are three possible options (before, after, same) which are too many to represent as boolean values. The return value is negative if the current object (on which the method is invoked) comes before the other object (the parameter) in lexicographical order, and positive if it comes after it. Both methods will return 0 if the strings are equal, i.e. if the equals () method would return true. If you look at the full API documentation of the String class, you will notice that there is another form of the compareTo ( ) method, which takes an Object as its parameter, just like the equals () method. This method exists because the String class implements an interface Comparable (which specifies just that one method, compareTo ( ) ): By implementing that interface a fixed sorting order is defined, which means that you can use strings in sorted collections (see below) without any further effort on your part. Again, the Comparable interface is fairly general and thus can only specify Object as its parameter, as it can be implemented by any class. Comparing Sections of Strings The next set of methods is used for comparing sections of strings, and we start off with the two simpler ones, s tartsWi th ( ) and endsWi th ( ) , which are very useful for language processing. They basically do what their respective method names say, and like equals () they return a boolean value. Let's look at an example of how they are used: II word is a String defined-elsewhere if (word. startsWith( "un")) { System.out.println( "negative"); }
i f (word. endsWith( "ly")) { System.println( "adverb");
STRING HANDLING IN JAVA
73
i f (word. endsWith ( "ing")) { System.println("ING-form");
This is a slightly oversimplified version of a morphological analyser, which assumes that all words beginning with un- are negatives, and that words ending in -ly are all adverbs and so on. For a real analyser there would clearly have to be more sophisticated checks on other conditions before such statements can be made, but you get the picture of how these methods are used. The s tartsWi th ( ) method has another variant, where you can specify an offset. This means that you don't start at the first character of the String when trying to match the prefix, but that you can skip any number of characters (for example if you have earlier established that the string has a certain other prefix). The 'plain' version of startsWi th () is equivalentto startsWi th ("prefix", 0) ; . The final comparison method, regionMatches (), is rather more complicated. It allows you to match sections of two strings, as specified by two offset variables. The first offset (thisOffset) is applied to the current String object, while the second one (otherOffset) is applied to the parameter. As both parts have to be of the same length in order to be equal, it is only necessary to specify the length of the stretch that is to be compared once. There is also an alternative form of that method where you can also specify (through a boolean parameter, slightly inconsistent compared to the other methods) whether the match should take upper and lower case differences into account. You will rarely need those two methods.
Transforming Strings As you will note by looking at table 5.4, all the methods which transform strings return a String object. This is because they cannot operate on the current object themselves, as String objects are immutable. However, in practice this does not matter much, as you can always assign the return value to the same variable, as in word= word.toLowerCase();
Here you convert the object word to lower case, and what in fact happens is that a new String object is created which contains the lower case version of word; this new object is then assigned to word, and the original mixed case version is discarded. The only drawback of this is a slight cost in performance, but that would only be noticed in time-critical applications churning through massive amounts of data. The first method, concat (),concatenates the String given as a parameter with the value of the current object and returns a new String object (provided the parameter string was not empty) which consists of the two strings combined. Concatenation is the technical term for putting two strings together, as in: String strl = "Rhein"; String str2 = "gold"; String str3 = strl.concat(str2); II str3 is "Rheingold"
74
THE JAVA CLASS LIBRARY
String String String String String String String
concat(String value); trim(); replace( char oldChar, char newChar); toLowerCase(); toLowerCase(Locale locale); toUpperCase(); toUpperCase(Locale locale);
Table 5.4: Methods for transforming Strings As a short form you can also use the concatenation operator, the plus sign, as in String str3 = strl + str2;
Note that again, due to the immutable nature of Strings, the object (strl in the previous example) itself does not get changed through the concat () method, but that a new object is created instead. Sometimes when you are dealing with data read in from external sources it is possible for strings to contain spaces and other control characters at either end. The trim () method can be used to get rid of them: it removes all spaces and nonprintable characters from either end of the String. This could result in an empty string, if the object does not contain any printable characters. The replace () method replaces all occurrences of one character, oldChar, by another character, newChar. This method is not too useful, as the replacement can only be a single character. It would have been more useful to have a general replacement routine, which can replace substrings with each other. The final four methods we will be looking at in this section are used to convert strings into all lower case or all upper case characters. This is quite an important preprocessing step for most language processing, as dictionaries usually contain only one form of a word. Suppose a tagger's dictionary contained information about the word 'bird', but your input text for some reason is all in uppercase, so you come across the word as 'BIRD'. Looking this up in the dictionary would probably fail. It is therefore always best to convert words into a single normalised form before processing, especially when you need to look them up in a dictionary: String word= "Bird"; String lower= word.toLowerCase(); String upper = word. toUpperCase ();
II lower is "bird" II upper is "BIRD"
Now, why are there two variants of each method? The reason behind this is that upper and lower case forms of characters can be language specific. A Locale object defines a country and language combination which can be used to make certain decisions. There are only a few cases where this actually makes a difference: in the Turkish locale, a distinction is made between the letter i with and without a dot, and thus there is an upper case I with a dot above. For the conversion into upper case there also is the German sharps ('JS'), which is rendered as 'SS' in upper case.
STRING HANDLING IN JAVA
75
Finding sequences in Strings One frequent job you will do in corpus analysis is matching words or parts of words. In this section we will look at the methods that the String class provides for finding sequences within a string. With these methods you can either look for a single character or a short String. You can search either forwards (indexOf ())or backward (lastindexOf ()).These methods return a number, which is either the position of the character you were looking for in the string (starting a position 0 if it was the first character), or -1 if it could not be found. You can also start at either end of the String object (by using the corresponding method with just one parameter) or at a position further on (with the two-parameter versions, which require you to specify the position within the string that you want to start the search from).
int int int int int int int int
indexOf(int chr); indexOf(int chr, int from); indexOf(String str); indexOf( String str, int from); lastlndexOf(int chr); lastlndexOf(int chr, int from); lastlndexOf(String str); lastlndexOf(String str, int from);
Table 5.5: Methods for finding sequences in Strings The full list of methods is given in table 5.5, and as you will notice they all return integer values. The meaning of the return value is always the same, either the position where the item you searched for was found, or -1 if it didn't occur in the String. Let's look at a few examples, where we first declare a string and then search for elements in it: String sample= "To be or not to be, that is the question."; II 0 1 2 3 4 II 01234567890123456789012345678901234567890 int x = sample. indexOf ("to"); II x will be 13 x = sample.indexOf('T'); II x will be 0 x = sample. indexOf ("be", 4); II x will be 16 x = sample .lastindexOf ("ion") ; II x will be 37 x = sample.lastindexOf('t' ,25); II x will be 23 x = sample.indexOf('t' ,37); II x will be -1
In the comment lines below the sample declaration you will find numbers giving you the corresponding positions in the string, going from 0 up to 40. This is to make it easier to work out what the return values mean in the examples. We declare an int variable which we will use to assign the return values to. We only need to declare it once, and then we can simply reuse it without having to declare it again. In the initial line we are searching for the string "to". If you look at the sample string, you can see that it first occurs at position 13 (the match is casesensitive), and thus the variable x will be assigned that value. Note the difference between Strings (in double quotes) and characters (in single quotes).
76
THE JAVA CLASS LIBRARY
If you want to find multiple occurrences, you can simply loop over the string, as in this code sample: int location = -1; do { location= sample.indexOf('o' ,location+1); if(location >= 0) { System.out .println( "Match at position "+location); } else { System.out.println("No (further) matches found"); while(location >= 0);
This piece of code will print out all index positions of the String object sample from our previous example at which the letter 'o' occurs. Note how we keep track of the last position to remember where to start looking for further matches. If we didn't add one to the value of location when calling the indexOf () method we would simply remain forever at the position where we'd first found a match. Getting Substrings Once you have found something in a string, you might want to extract a substring that is preceding or following it, or you might simply want to compare a substring with a list of other strings, without searching the full string for each string in the list. There are three methods which you can use for this, and they are listed in table 5.6.
char String String
charAt(int index); substring(int from); substring(int from, int to);
Table 5.6: Methods for getting substrings
The charAt () method gives you access to a single character position within a String. You could use it to loop through all letters in a string, for example: String str ="And now for something completely different ... "; for(int i = 0; i < str.length(); i++) { char c = str.charAt(i); System.out.println("Character at position +i+ +c); 11
11
:
11
Here we first initialise a string str, and then we use a for-loop to walk through all letters. As i is an index position, it has to be smaller than the length, so we use that fact as our loop condition. The output of this example would look like: Character Character Character Character Character
at at at at at
position position position position position
0: 1: 2: 3: 4:
Character at position 45:
A n d n .
77
STRING HANDLING IN JAVA
The substring () method has two variants: in both you specify at which position your substring should start, and in the second one only you specify where it should end. In the first variant the substring is taken to last until the end of the string. With both variant, the method returns the substring resulting from the character positions you've specified. Suppose you want to remove a possible prefix un from a string. You could do that with the following code snippet: if(word.startsWith("un")) { II starts with the prefix "un" word = word.substring(2);
Here we first check that our word does actually start with the prefix before taking it away. The parameter to the substring () method is 2, which is the third character. We thus ignore the two characters at positions 0 and 1, which make up out prefix, and only keep the string from position 2 onwards.
5.3.4
Changing Strings: The StringBuffer
Several times so far we have heard that String objects cannot be changed, i.e. that their content always stays as it was when initially created. Although you can nevertheless work with strings by creating a new String object everytime you want to change it, this is rather inefficient. Looking at the execution time of several Java actions one can see that creating objects is fairly expensive in computing terms, so by creating a lot of objects you could slow down your program. As any corpus processing is likely to involve lots of string processing, one thing you wouldn't want to do is to be inefficient in such a key area. The solution to this dilemma is a related class, the StringBuffer. A StringBuffer is an auxiliary class which allows you to manipulate strings. When you get to a stage where you require a String object, you can easily tum the StringBuffer into one. We have already come across this in section 5.3.3. StringBuffers are suitable for intermediate work, like assembling a String from several components. In fact, it is implicitly used whenever strings are concatenated in Java. The most frequently used methods of the StringBuffer API are given in table 5.7. Again, note the missing return values for the constructors, which implicitly return a StringBuffer object. If you know in advance how long your string is going to be, you can specify the
length in the constructor. Also, you can create a StringBuffer directly from a String. If you wanted to reverse a string, you could write: String str = "Corpus Linguistics StringBuffer sb = new StringBuffer(str); sb.reverse(); str = sb.toString(); II str now is "scitsiugniL suproC" 11 ;
The setCharAt () method allows you to change individual characters, whereas with setLength () you can either reserve more space (if the new length
78
THE JAVA CLASS LIBRARY
StringBuffer StringBuffer char int StringBuffer void void String
StringBuffer(); StringBuffer(int length); StringBuffer(String str); append(char c); append(String str); charAt(int index); length(); reverse(); setCharAt(int index, char c); setLength(int length); toString();
Table 5.7: Frequently used methods from the StringBuffer API
is larger than the current length), or cut off the end of the StringBuf fer (if the new length is less than the current length). Reserving more space is not really necessary, as the StringBuffer will automatically grow as you append more text to it. You would only want to do that if you were about to append a lot of smaller pieces of text, as it is more efficient if space only has to be allocated once.
5.4 OTHER USEFUL CLASSES In this section we will have a look at those classes of the standard Java class library which you will be likely to use most often. It has to be rather brief, but remember that you can always have a closer look in more detail by studying the on-line API documentation. Once you have an idea what a class is good for and how you can use it, the API documentation enables you to make the best use of it. The classes in this section have to do with data structures. A data structure is a container that you can use to store objects in. One important thing is that only objects can be stored: to store primitive values in these you have to use wrapper classes. So, in order to store a set of int values (e.g. word frequencies), you would have to create Integer objects from them, which you can then store. The only exception to this is an array, which we will come to below.
5.4.1
Container Classes
Suppose you wanted to collect the word frequencies of a text: as you read each word you will have to look up how often it has occurred so far, and increment its frequency count by one, adding it to the list if it is a new word. You simply cannot do that with just string and integer variables alone, as you do not know how many words you will need to store, and it would be impractical to compare one variable to a large set of other variables, as you would have to spell all that out explicitly in the program code of what is likely to be a huge program. In fact, everytime you manipulate data you do need somewhere to store the results of intermediate processing, like the current number of occurrences of a word, or probability scores for a set of word class tags
OTHER USEFUL CLASSES
"
79
associated with a word. This is where container classes come in. A container class is an object that contains a number of other objects. You access these objects either by an index number, with a key, or in a sequential order through so-called iterators. You can think of a dictionary as a container class, where you access the entries via a key, namely the headword. As it happens, most entries will be containers themselves, with separate definitions for each sense. Here you access the definition by the sense number. So, in order to look up sense 3 of the word take you first retrieve from the dictionary the list of definitions associated with take, and then you take the third definition from that list. There are different ways of organising data objects, each with their own advantages and disadvantages. Each way has a set of parameters which determine its suitability for the task at hand, and choosing the right option can make a program easy and straightforward, fast, and space efficient. The wrong choice, on the other hand, can make it awkward, slow and space-wasting. Thus it is important to know what the available options and their properties are. From version 1.2 onwards, Java has a very systematic framework of data structures, the so-called collections framework. In previous versions of Java there were only a few loosely related classes available. However, as these older classes are still widely used (partly for reasons of backwards compatibility) it is important to have a look at them. In practice you might prefer to use the classes from the collections framework instead, if you are using Java 1.2.
5.4.2 Array An array is a data structure built into the Java language. Unlike all the other container classes you can store primitive types (int, char and so on) in an array. You declare a variable as an array by adding square brackets to it, for example int frequencies[];
This line only declares frequencies as an array in which you can store int values, it doesn't specify how many of them there could be. In order to use this array, you then need to create it using a new statement, just like other objects: frequencies= new int[5];
Here you reserve space for five values. Arrays must have a defined number of cells, and you cannot change that number unless you use another new statement to reallocate the array. Unlike the StringBuffer's method setLength () however, you will then lose the contents of the previous array. You can access the individual cells in an array by their index positions; note that counting starts at zero, so here you have indices 0 to 4. Arrays have an associated variable, length, which tells you how many elements there are in it. Unfortunately this is very confusing, as it is a public variable, not a method, so while you use length () to find out the length of a String, you use length (without brackets) for the size of an array. The compiler will tell you when you mix these up, but nevertheless it is quite a nuisance to remember this (and to be told so by the compiler if you got it wrong).
THE JAVA CLASS LIBRARY
80
Here is an example to illustrate how to use arrays: first we assign values to some of the cells we allocated above, and then we loop through the array, printing all the values: frequencies[O] frequencies[l] frequencies[2] frequencies[3] frequencies[4]
= = = = =
3776; 42; 6206; 65000; 386;
for(int i = 0; i < frequencies.length; i++) System.out.println(frequencies[i]);
We can access the cells of an array either by number, or through a numerical variable, as it is done in the for-loop. The loop starts at the first element and continues until we have reached the end of the array. Setting the frequencies 'by hand' is rather tedious, but you can of course also use a loop to do that, e.g. when reading data in from a file.
5.4.3
Vector
boolean void boolean Object int boolean Object boolean void int
Vector(); Vector(int initialCapacity); add(Object o); clear(); contains(Object o); elementAt(int index); indexOf(Object o); isEmpty(); remove(int index); remove( Object o); removeElementAt(int index); size();
Table 5.8: Frequently used methods from the Vector API
A Vector is like an expandable array, which means you can add and remove elements without having to worry about keeping track of its size. The most important methods of the Vector class are shown in table 5.8. The elements of a Vector are stored sequentially, which means that you can loop through them by index values. In order to print out all elements of a Vector you could for example write Vector myVector =new Vector(); for(int i = 0; i < myVector.size(); i++) System.out.println("Element "+i+" is "+myVector.elementAt(i));
OTHER USEFUL CLASSES
81
A Vector is very useful if you need to store a number of objects and you don't know in advance how many of them you are going to have. If you know that, you might want to use an array instead. An array has the advantage that it is type safe, i.e. you can only have elements of the same data type in a single array, and you can also use it to store primitive data types. A Vector just stores elements as Objects, so you have to cast them to the appropriate class when you retrieve them: String myString = (String)myVector.elementAt(5);
Here we retrieve the sixth element of the Vector. We know it has to be a String, as we only insert String objects into this Vector, but the JVM doesn't know this (and we can, of course, store any object in this vector if we want to). The elementAt () method only returns objects of the type Object, and in order to tum it into a String we have to cast it to String. This will work as long as the objects actual type is String, otherwise we will get an error. This means you have to take care that you either don't mix different data types in a Vector, or you need to keep track of what objects they are yourself. The same applies to all the other container classes as well. In order to allow you to store any class in it you want, they have to go for the lowest common denominator, which in this case is the 'ultimate' super-class, Object. In practice this is not too much of a problem, as you wouldn't usually mix different data types in a single container anyway.
5.4.4 Hashtable Quite often you want to store values associated with a certain key, for example word frequencies. Here you have a key (the word) and an associated value (the frequency). Storing these in a Vector would be awkward, as you would lose the association between the two. A Hash table is well suited for this, as it allows you to store a key/value pair, and it is very fast as well. It basically works by computing a numerical value from each key, and then using that value to select a location for the associated value. So instead of looking through the whole table to locate an item, you can just compute its position in the table and fetch it from there directly. The most important methods of the Hashtable class are given in table 5.9.
To insert a set of word frequencies into a Hash table we could write something like this: String word; int frequency; Hashtable freqTable =new Hashtable(); II read word and frequency from external source word= readNextWord(); frequency= readNextFrequency(); while(word != null) { freqTable.put(word, new Integer(frequency)); word= readNextWord(); frequency= readNextFrequency();
82
THE JAVA CLASS LIBRARY
void Object• Object Object boolean int Enumeration Enumeration
Hashtable(); Hashtable(int initialCapacity); clear(); put( Object key, Object value); get( Object key); remove( Object key); isEmpty(); size(); keys(); elements();
Table 5.9: Frequently used methods from the Hashtable API
Here we have to make a few assumption regarding our data source: we can read words from some method called readNextWord ( ) , which returns the special value null if no more words are available. Frequencies are read in from a similar method, readNextFrequency (),which returns as an int value the frequency of the word read last. As we can only store objects in a Hash table, we have to create an Integer object with the new statement when inserting the frequency value. We can do this directly in the argument list of the put ( ) method, as we don't require direct access to the new object at this stage. Unlike a Vector you cannot directly access an arbitrary value stored in the Hashtable by a numerical index value, but instead you have to use the key to retrieve it, as in:
String word = ••aardvark"; Integer freq; freq = (Integer)freqTable.get(word); if(freq ==null) { freq =new Integer(O); System.out.println("The frequency of "+word+" is "+freq);
There are two points to note here: First, you have to cast the object you retrieve from the table to the right class, Integer in this case. Second, if the key cannot be found in the Hash table, the get () method will return null. We therefore need to test the return value and act accordingly. We could either print an error message, stating that the word is not in the frequency list, or, in this case, assign to it the value zero, which makes perfect sense in the context of a frequency list. If you don't have a full list of all keys which are stored in the table, but want to print out the full set, you can get it from the Hashtable using the keys () method. This allows you to get access to all the keys, which you can then use to read out their associated values from the Hash table. We will look at that when we get to the Enumeration below.
OTHER USEFUL CLASSES
void String String String Object Enumeration void void void
83
Properties(); Properties(Properties defaultProperties ); clear(); getProperty(String key); getProperty(String key, String default); setProperty(String key, String value); remove( Object key); propertyNames(); list(PrintStream out); load(InputStream in); store(OutputStream out, String headerLine);
Table 5.10: Frequently used methods from the Properties API
5.4.5
Properties
The Properties class (see table 5.10) is an extension of the Hash table, which is more specialised as to the keys and values you can store. In a Hash table there are no restrictions, so you can use any kind of object you want as either keys or values. In a Properties object you can only use Strings, but there are some mechanisms which allow you to provide default values in case a key wasn't found in the Properties object: you can either specify another Properties object which contains default values, so that you can override certain properties and keep the values of others, or you can directly supply a default when trying to retrieve the attribute. This is illustrated in the following code snippet: Properties wordClassTable =new Properties(basicLexicon); wordClassTable. set Property ("koala", "noun") ; String wclassl = wordClassTable.getProperty("koala"); II wclassl is now "noun" String wclass2 = wordClassTable.getProperty("aardvark","unknown"); II wclass2 is now "unknown"
Here we create a Properties object called wordClassTable, in which we want to store words with their associated word classes. We provide a default called basicLexicon, another Properties object which would have to be defined elsewhere. We then add to it the key-value pair 'koala' and 'noun', and after that we try to retrieve the value for 'koala'. As we have just entered it, we will get the same result back, but if we hadn't done that the system would look up 'koala' in the basicLexicon, and return the value stored there if there was one. If the attribute was in neither wordClassTable nor basicLexicon then the value null would be returned. We then also look up 'aardvark', and this time we provide a default value directly in the method call. If it cannot be found, instead of returning null it returns 'unknown', the default we supplied.
84
THE JAVA CLASS LIBRARY
5.4.6 Stack A Stack is a fairly simple data structure. You can put an object on top of the stack, and you can take the topmost element off it, but you cannot access any other of its elements (see table 5.11 ). These actions are called 'push' and 'pop' respectively. Despite this limited way of accessing elements on it, a stack is a useful data structure for certain types of processing. In chapter 8 we will see how we can use a stack to keep track of matching pairs of mark-up tags.
Object Object boolean Object
Stack(); push(); pop(); empty(); peek();
Table 5.11: Frequently used methods from the Stack API
The empty ( ) method allows you to check whether there is anything on the Stack; if you try to pop () an element off an empty Stack, you get an EmptyStackException. The peek () method is a shortcut for: Object obj = myStack.pop(); myStack.push(obj);
It retrieves the topmost element of the Stack without actually removing it off the Stack. Here is a brief example of how a Stack can be used: first we push three String objects on a Stack, and then we pop them off again: Stack theStack =new Stack(); theStack.push( "Stilton"); theStack.push( "Cheddar"); theStack.push("Shropshire Blue"); String cheese = (String)theStack.pop(); II cheese is "Shropshire Blue" cheese = (String)theStack.pop(); II cheese is "Cheddar" cheese= (String)theStack.pop(); II cheese is "Stilton"
As usual, we have to cast the return value of the pop ( ) method to the correct data type. This example illustrates that a stack is a so-called LIFO (last in, first out) structure: the last item we put on the stack ('Shropshire Blue') is the first one that we get when we take an element off it. As we will see in chapter 8, this is ideally suited for processing nested and embedded structures.
5.4.7 Enumeration An Enumeration is somewhat the odd one out in this section. It is not a data structure in itself, but rather an auxilliary type to provide access to the container classes we have just discussed. And, it is not even a class, but just an interface. The full API of the Enumeration is shown in table 5.12.
85
OTHER USEFUL CLASSES
boolean Object
hasMoreElements(); nextElement();
Table 5.12: The Enumeration API With a Vector you can easily access all elements through an index. However, with a Hash table this is quite a different matter, as there is no defined order, and elements are not associated with ascending index values, but with keys, which might not have any natural sequencing. The way to get access to all elements is through enumerating them: You can do that either on keys, or on the elements themselves, and for this purpose the Hash table class has the methods keys () and elements () (see table 5.9 ). The reason that the Enumeration is not actually a class, but an interface is that with an Enumeration you are only interested in the functionality, not the implementation. An Enumeration provides exhaustive access to all of its elements in no particular order, but the way this is implemented might vary depending on what kind of container class you want to go through. Let's see how we could use an Enumeration to print out our word/frequency list which we created in section 5.4.4 (unsorted): Enumeration words = freqTable.keys(); while(words.hasMoreElements()) { String word= (String)words.nextElement(); Integer freq = (Integer)freqTable.get(word); System.out.println(word+": "+freq);
This is the typical way you would use an Enumeration in practice. The two key methods are hasMoreElements () and nextElement ().The former returns true if there are more elements available, and false if there aren't. Here we are using it to control the loop which iterates through all the words, which we have used as keys in the frequency table. As soon as we have reached the end of the Enumeration the loop is terminated. The nextElement () method returns the next object. As the Enumeration is necessarily general, it again returns an object of type Object, which we have to cast to the desired type (which we have to know in advance). In our example we know it's a String, and we can use that to retrieve the associated value from the frequency table. To fill the concept of an Enumeration with a bit more life, here is an example of a possible implementation. This is just a wrapper around an array, which we access through an Enumeration. We call this class ArrayEnumeration, and you can use it to get an Enumeration from an array of objects. We are backing the implementation with two (private) variables, content, which points to the array that we are enumerating the elements of, and counter, which keeps track of what position in the array we are at. Both variable get initialised in the constructor. As the class implements the Enumeration interface (as specified in the class declaration line), we need to provide the two methods which make up the
THE JAVA CLASS LIBRARY
86
Enumeration API. Here we have included documentation comments (see table 4.1 on page 57 for a list) which describe the constructor's parameter, and the return values of the two Enumeration methods. /*
* ArrayEnumeration.java
*I public class ArrayEnumeration implements Enumeration { private Object content[]; private int counter;
/*'1
>'. private XMLElernent readXMLinstr() throws IOException XMLElernent retval = null; String name= readNarne(); skipSpace(); StringBuffer content= new StringBuffer(); while ( ! lookahead ( "?>" ) ) { int c = in.read(); if(c == -1) { throw new XMLParseError("Prernature end of file in XML instruction"); else { content.append((char)c);
retval =new XMLinstruction(narne,content.toString()); return(retval);
And indeed the method is very similar to readComment () which we discussed earlier. As always we want to be able to try out this class, and so we add a main ( ) method to run the XMLTokeniser on some input data. We simply process the data and print out whatever elements we encounter in the input. This makes use of the toString () methods which we overwrote in each of the XMLElement subclasses. This implementation of the main ( ) method as it stands is not meant for general use, so it does not contain that vital check for the presence of command-line parameters which a proper program should have: public static void rnain(String args[]) throws IOException { XMLTokeniser parser int i = 0;
new XMLTokeniser(new FileReader(args[O]));
172
DEALING WITH ANNOTATIONS XMLElement xrnl; do { xrnl = parser.readElement(); System.out.println(i+": "+xrnl); i++; while(xml != null);
II end of class XMLTokeniser
With more than 300 lines of code, the XMLTokeniser is a substantial class. However, by splitting it up into many small methods, most of which fit easily on one screen, the complexity can be reduced considerably. Tokenisers and parsers of formally defined languages are always rather messy, as they have to deal with a lot of variations in the input, and they have to show a certain behaviour when encountering deviations from the input's specification. The XMLTokeniser does not enforce many of the restrictions that XML puts on the shape and form of the input data, so you can try it out on any SGML input file as well, provided you first take out any SGML declarations which we are not handling in the tokeniser. You will find that it works just as well, with the major problem being the restriction that attribute values have to be enclosed in quotes. The next class we will look at, XMLFormCheck, checks if a file contains wellformed XML. It does enforce the restrictions on matching opening and closing tags, but as it does not process a DTD, it cannot tell whether the input data is actually valid XML. However, it is a much smaller class compared to the tokeniser.
8.3.3
An XML Checker
The easiest way to keep track of matching tags is to put them on a stack when an opening tag is encountered, and when coming across a closing tag the topmost tag is taken off the stack and compared to it. Once the end of the input has been reached the stack should be empty, otherwise there were some closing tags missing. A stack is a standard data structure, and we have seen in chapter 5 that there is a stack implementation in the Java class library. I*
* XMLFormCheck.java *I package xml; import import import import
java.util.Stack; java.io.Reader; java.io.FileReader; java.io.IOException;
public class XMLFormCheck { private Stack tagstack; private XMLTokeniser source;
After importing the necessary classes, we declare a Stack variable to keep track of the tags, and an XMLTokeniser to process the input data.
WORKING WITH XML
173
In the constructor we initialise these variables: public XMLFormCheck(Reader input) { tagstack =new Stack(); · source = new XMLTokeniser(input);
Unlike some other classes, we provide a separate method to do the work, so the constructor is left with only preparing the variables. As all the work of processing the XML input is already done in the XMLTokeniser, the check () method which tests the input for well-formedness can be kept quite simple. This shows the benefits of modularisation, as we can now handle basic XML data without a lot of programming overhead. public boolean check() throws IOException XMLElement xml = source.readElement(); boolean wellFormed = true; while(wellFormed && xml != null) { if(xml.isOpeningTag()) { tagstack.push(((XMLTag)xml) .getName()); else if(xml.isClosingTag()) { if(tagstack.isEmpty()) { System.out.println("Line: "+source.currentLine()); System.out.println(" - found spare tag "+ ( (XMLTag)xml) .getName()); wellFormed = false; else { String expected= (String)tagstack.pop(); if(!expected.equals(((XMLTag)xml) .getName())) System.out.println( "Line: "+source.currentLine()); System.out .println (" - found "+ ( (XMLTag)xml) .getName ()); System.out.println("- expected "+expected); wellFormed = false;
xml
source.readElement();
}
while(!tagstack.isEmpty()) { wellFormed = false; System.out.println("leftover tag: "+tagstack.pop()); return(wellFormed);
The check () method returns true if the input is well-formed, and false otherwise. We simply loop through all elements returned by the tokeniser, pushing opening tags onto the stack, and comparing the closing tags with the top element taken off it. As the XMLTokeniser contains a facility to provide the current line of the input, we can make use of that in case we find a mismatch. If the tag stack is not empty at the end of processing, we print out all the tags that are still left on there.
174
DEALING WITH ANNOTATIONS
public static void main(String args[]) throws IOException { XMLFormCheck tester= new XMLFormCheck(new FileReader(args[O])); boolean result = tester.check(); System.out.print("The document "+args[O]+" is "); if(result == false) { System.out.print("not "); System. out .println ("well-formed XML");
II end of class XMLFormCheck
In the main () method we create an instance of the XMLFormCheck class. We then execute the check ( ) method and store the result in a variable, which we use to generate the right output, depending on whether the XML is well-formed or not. When you run this class with java xml.XMLFormCheck myfile.xml you could either get the reassuring The document myfile.xml is well-formed XML or, in case of an error, Line: 1031 - found spare tag test The document myfile.xml is not well-formed XML and leftover tag: test The document myfile.xml is not well-formed XML In this class we have not included a lot of error handling, which is alright as long as you're dealing only with research prototypes that you are using yourself. As soon as you develop software that is to be used by other people you should put in some safeguards for potential errors. This includes the XMLParseErrors which might be thrown by the tokeniser. It is no problem for you to interpret what's gone wrong, but users will get completely confused when they get a scary error message when there was some problem with the input file.
8.4
SUMMARY
In this chapter we have seen how XML mark -up looks like, and how you can process files which are marked up with it. There is an important difference between valid and well-formed, and you can process XML documents using either a DOM parser, which reads the whole document at once, or an event-based parser, which reads it in parts and uses callbacks to communicate with a higher-level application. The most widely used event-based API for XML is SAX.
SUMMARY
175
After the theoretical background, we have implemented a simple tokeniser which splits XML data into distinct elements. This tokeniser does not recognise all possible forms of tags, but should go a long way when processing corpus data, which doesn't necessarily use a lot of fancy mark-up. And finally we have seen how easy it is to create applications based on lowerlevel components like the tokeniser. Once the main job of splitting the input correctly is done, checking for well-formedness is very easy and straightforward, and does not require a lot of programming effort.
9
Stemming In this section we take a look at implementing a stemmer as described in Oakes (1998). Starting from a brief description we will see what it takes to tum a table of production rules into a working program.
9.1
INTRODUCTION
Table 3.10 of Oakes (1998) lists a set of rules used by Paice (1977) for implementing a stemmer. A stemmer is a program that reduces word forms to a canonical form, almost like a lemmatiser. The main difference is that a lemmatiser will only take inflectional endings off a word, effectively resulting in a verb's infinitive or the singular form of a noun. A stemmer on the other hand tries to remove derivational suffixes as well, and the resulting string of characters might not always be a real word. Stemmers are mainly used in information retrieval, where able and ability should be identical for indexing purposes, whereas most linguists would not consider them part of the same inflectional paradigm. There is a further advantage of stemmers: a lemmatiser's requirement to produce proper words comes with the need for a way to recognise them, but a stemmer can do without that. This means that a lemmatiser usually consists of a set of reduction rules with a dictionary to check for the correctness of the result, while for a stemmer a set of rules is sufficient. This means they are in general not only smaller in size, but also much faster. The most widespread stemmers in use are based on an algorithm by Porter (1980), which is quite sophisticated and therefore not as easy to implement as the one of Paice (1977). However, the basic principles are much the same, the major difference being that Porter's algorithm applys a number of tests on the input word before executing a rule, whereas Paice's applys a rule as soon as a matching suffix has been found. Looking at the list of rules there is one major point to notice: as the first matching suffix fires a replacement rule, the order in which the rules are processed is quite important. Consider the case of -ied and -ed. Both rules match the word testified, but the first one accounts for more characters, reducing it correctly to testify, while the second is an 'incomplete' match when looking at the full picture, as it leads to the undesirable testi f i. It is therefore important that the longer rule is applied first; this mode of matching patterns is called longest-matching. The stemming program that we are about to write now will take a word, apply all possible rules to it, and will output a canonical representation of the input word.
180
STEMMING
This might serve as a first step for creating a grouped word list, as it removes inflectional endings, as well as derivational suffixes. As the program does not contain any language specific components apart from the rules it can easily be adopted to work in other languages as well, provided they use suffixes only to form derivations.
9.2
PROGRAM DESIGN
Summarising the algorithm is quite simple: for each word we are dealing with we traverse the list of rules, testing each rule whether it applies to the input word. Once a matching rule has been found, we 'execute' it, which means we take the suffix off the word and append the replacement if there is one. Afterwards we interpret the transfer part: we either continue at another rule or stop the whole process. As an example we will have a look at the first rule: ably
IS
This rule deals with the suffix 'ably' in words like 'reasonably'. Here it would remove the suffix, as the replacement part is empty (as indicated by the dash). This results in 'reason'. Then we jump to the rule whose label is 'IS' and continue processing there. For our example word this would be the end, but e.g. 'advisably' would have been reduced to first 'advis' and then 'adv'. If the word would not end in 'ably', the rule would be skipped and processing would continue with the second rule. When designing the program, we first need to think what kinds of objects we are dealing with. The most basic object, which we will be using as a starting point, is a rule. A rule consists of four elements, the first three of which are optional: label, suffix, replacement and transfer. The label is used for skipping rules during the processing of the list of rules, the suffix determines whether a rule fires, and the replacement specifies what a matching suffix is going to be replaced with. All these can be empty; only the transfer part, which describes what action is being taken after a rule has been fired, is compulsory. Then we need a means to initialise the rules somehow. In order to allow different sets of rules to be used, and to make it easier to modify existing rules, we will be reading the rules from a text file. This could be done from within the same class as the rule definition, but in this example the rule loader is kept in a separate class. On the one hand this introduces two classes with a rather tight coupling, as the rule loader needs to interact with the rule objects it is creating from the data file, and this interdependency is generally not a good thing. On the other hand, rules could be loaded from different sources, like files, networked file servers, direct user input, or even automatically generated from another program. This would be an argument in favour of keeping the initialisation separate from the rule itself. Furthermore, the specification of a rule is unlikely to change, and any major change of its interface would need to be reflected in all other classes anyway. With the present design we keep the classes small and uncluttered and organised by functionality. The final class we then need is the main processor, which is often called the engine. Our stemming engine does all the actual work, initialising the rules at startup, then waiting for input words to come in which will then be subjected to the rules. As we have delegated the initialisation to the rule loader, and all the low level
IMPLEMENTATION
181
processing to the rule class, we just need to deal with the input/output aspects and the co-ordination of the rule matching. We will need to store the rules somehow in memory for the processing, and here we have a straight mapping from what we need and an existing data structure: the List interface provides a sequentially ordered sequence of elements, just what we need for our purposes. The RuleLoader class returns an instance of a list, and we don't even have to worry about which actual implementation it is. We can use either a Vector or a LinkedList, and if we only use the methods provided by the List interface, we can even change the underlying implementation in the RuleLoader class without it affecting the main class. This is generally a good thing to do, as it hides the implementation details behind the interface specification. This higher level of abstraction makes it easier to handle the complexity of larger programs; in this small example it would not make such a big difference.
9.3 IMPLEMENTATION In this section we will go through the implementation of the stemming algorithm and discuss each class in detail. Before we do this, we will briefly look at the whole picture, namely how these classes interact. This interaction turns the three individual classes into a stemming program. The stemming algorithm is implemented using three classes, Stemmer, RuleLoader and Rule. The Stemmer class is the main processing class, which coordinates the work and makes use of the other two classes in the process. It can either be used as a module in a larger application, or in stand-alone mode. If run by itself it takes a list of words from the command-line and prints out their stems. This mode is mainly useful for testing purposes; for use within other applications it provides a method to stem a single word.
9.3.1
The Stemmer Class
As with other programs we will now walk through the source code for the Stemmer class. /*
* Sternmer.java *I
package sternmer; import java.util.List; import java.util.Listiterator; import java.io.IOException;
As we have several different classes making up one single application, we put them together in a package. This package is simply called stemmer (see the discussion on xml above), but the name is really arbitrary as long as you don't want to distribute your classes. We need three classes from the standard library, which we import explicitly. In this case listing them all (instead of using the asterisked 'import all' versions) gives us an immediate picture of what external classes we are depending upon.
182
STEMMING
I*" * This class implements a stemmer for English as described in Oakes (1998) * and Paice (1977). It loads a set of rules from a file and then applies * them to individual words. * @author Oliver Mason * @version 1.0 *I public class Stemmer private List ruleset = null; boolean TRACE = true;
We have two variables, one to store our set of rules in, and one to control the output of diagnostic messages. In order to make it easier to follow the progress of the stemming process through the list of rules, a number of print statements have been added to the code, indicating what rule is currently processed, and whether the matching has been successful or not. Obviously we would not want it to print these messages once the program has been tested and is working properly. We could then take those statements out again, but suppose we change something in the class later on, or we use a different set of rules and want to see if they work properly with the program. In this situation it would be good to have the extra output available again, and with the way the stemmer is implemented here all you need to do is change one single line: the value of the TRACE variable effectively switches the print statements on or off, which makes it very easy to deliberately sprinkle the code with diagnostic messages, none of which you want to see in the final version. You basically have a 'development version', which has the variable set to true at compilation time, and a 'production version' where it is set to false. Quite often such a variable is called DEBUG instead of TRACE, but that makes no difference. The idea is the same, namely having a 'trace' or 'debugging' mode which allows you to inspect internal information of the program. You can think of these print statements like sensors measuring the temperature and oil pressure of a car engine. The car runs perfectly well without them, but you might discover faults much easier with the extra information available. I**
* Constructor.
* The set of rules is loaded from a file. * @param filename the name of the rule-file. * @throws IOException in case anything went wrong. *I public Stemmer(String filename) throws IOException { ruleset = RuleLoader.load(filenarne); if(TRACE) System.err.println("loaded "+ruleset.size()+" rules");
The constructor of the Stemmer class is quite simple; it just initialises the set of rules. We also see TRACE in action: In trace-mode it also prints the number of rules loaded, which is simply the number of elements in the list. Loading the rules from a file involves a whole host of potential problems, which could make the program
IMPLEMENTATION
183
fail. If the rule file does not exist, or the computer's disk drive is broken, or the file has been corrupted, the list of rules cannot be loaded. Since the stemmer relies on the rules to exist, this would be a pretty serious state, and we would need to pass on that information to whoever is using the stemmer. For that reason we do not try to catch the IOException that can be thrown by the RuleLoader class in case things go wrong, but just pass it on. This means that the construction of the stemmer fails, which is perfectly reasonable behaviour in this case. After all, this is an unrecoverable error. !**
* Process a word.
*
A single word is passed through the set of rules.
* @param word the input word. * @return the output of applying the stemming rules to the word. *I public String stem(String word) { Listiterator i = ruleset.listiterator(); boolean finished = false; while(i.hasNext() && !finished) { Ruler= (Rule)i.next(); if(TRACE) System.err.println(word+": "+r); if(r.matches(word)) { word= r.execute(word); if(TRACE) System.err.println("match -> ""+word); String transfer= r.getTransfer(); if("finish".equals(transfer)) { finished = true; else { if(TRACE) System.err.println(""-> "+transfer); finished= !advance(i,transfer);
return (word) ;
The stem () method does most of the work, and that is reflected partly by its length. Methods shouldn't really be much longer than that, and ideally the whole method should fit on the screen at once, which makes it a lot easier to work with. The stern ( ) method is part of the public API, which means this is the entry point that other modules will use when they want to have a word stemmed. At first we set up an iterator to walk through the rule set. This iterator allows us to keep track of which rule we are currently dealing with. Next we create a flag which we use to terminate the stemming once we have reached a rule that has 'finish' as its transfer component. The processing loop is governed by two conditions, the existence of more rules and the fact that we haven't reached a point where a rule told us to stop. As soon as either of these conditions is true, the loop is exited and we return the word's stem to the calling module. In the loop we retrieve the next rule (which we also print when in trace mode) and check if it matches. If it does, we then execute it, which means we apply the suffix replacement to the word. The return value of the rule's execute () method is the modified word, and as we don't need to keep track of the original input we just assign the new word to the same variable, thus overwriting the old value. As an aside remember that this only changes the local copy of the word variable; the original string in the calling method will not have changed. For that reason we
184
STEMMING
will have to send the stem back as a return value from the stern ( ) method. Next we get the transfer part of the rule and compare it to the String finished. The string literal can be used in exactly the same way as a normal string object, and there is a reason that we call the equals () method on the literal and not on the variable transfer: if the value of transfer is null we get a NullPointerException when we try to execute one of its methods. The literal, however, will never be null, and as it is not a problem if the parameter of equals ( ) is null we will never get into trouble here, whatever the value of transfer. If the transfer part is equal to finish we set the variable finished to true, which means that we exit the main loop at the end of this pass. Otherwise we try and locate a rule which has a label that matches the transfer. The advance () method does that, and it returns true if it could find a matching label. When it returns false it means there was no matching label and we ought to stop the processing. Therefore we assign to the finished flag the negation of the return value, as indicated by the exclamation mark (the boolean negation operator, see 2.1 ).
/** * Advance through the set of rules until a matching label has been found. * The iterator given as a parameter is moved forward to the next rule * which matches the given label. * @param iter an iterator through a set of rules. * @param label a label to match. * @return false if no matching label could be found, true otherwise. *I private boolean advance(Listiterator iter, String label) boolean found = false; while(iter.hasNext() && !found) Ruler= (Rule)iter.next(); if(r.matchesLabel(label)) ( //match found iter.previous(); found = true;
return (found) ;
The advance () method moves the iterator forward until it either hits the end of the list of rules or finds one with a matching label. By default we assume that no rule has been found, and we initalise a flag with false. Just like the main loop of the stem () method we proceed while there are more rules to look through and while we haven't found a match. We get the next rule and test if its label matches the one we are looking for. If the match was successful, we need to go back one step, as the iterator will already point to the next rule. By setting it to the previous rule, which in fact is the one we just retrieved, we make sure that the matching rule will be the one selected next time the iterator is accessed. Effectively we are moving the iterator to the rule before the one with the matching label. We then also set the flag to true so that the loop is exited afterwards. If there was no matching rule, the loop will exit because of the first condition, and found will still have the value false.
IMPLEMENTATION
185
I** * * * * *
main method for testing purposes. The first command-line parameter is the rule file, and all subsequent parameters are interpreted as words to be stemmed. @param args command-line parameters. @throws IOException if the rule file could not be loaded.
*I public static void main(String args[]) throws IOException { Stemmer s =new Stemmer(args[O]); for(int i = 1; i < args.length; i++) { System.out.println(args[i]+": "+s.stem(args[i]));
II end of class Stemmer
The final method of the Stemmer class is main () which is called whenever a user runs the Java interpreter directly on it. Other classes can of course also call it. The parameter of main ( ) is an array of strings from the command-line, and we take the first one to be the filename of the rule file which we use to construct an instance of the stemmer itself. We then loop through all the remaining command-line arguments, taking every further argument as a word to be stemmed. In this loop we print the original word and its stem until all parameters have been processed. It is worth noting that the main () method as it stands will cause an exception if the command-line is empty. It is assumed that there is at least one argument, the rule file name, and no check is made whether this argument exists. If the main ( ) method was the main entry point for users of the stemmer this would not be very good, as the user might easily try it out without knowing about the required commandline structure and will promptly see an error message, thus not getting a very good impression of the program. Since the main ( ) method here is only meant to be used for testing purposes the error check has been left out, but it is always a good idea to test any input for correctness, and a user will find it much more friendly if a usage message is printed on the screen instead of getting a cryptic error message.
9.3.2 The RuleLoader Class Next we will take a look at the RuleLoader class. I* * RuleLoader.java
*I package stemmer; import import import import import import import
ava.io.BufferedReader; ava.io.FileReader; ava.io.IOException; ava.util.StringTokenizer; ava.util.NoSuchElementException; ava.util.List; ava.util.Vector;
186
STEMMING
This class is obviously in the same package as the other classes, and here we require a few more standard class library components. Still, they are all listed explicitly for clarity. As an aside, this also speeds up compiling, as the compiler doesn't have to look through the whole package to locate the relevant class information. !** * This class loads a set of rules from a file.
*
* * * * *
*
* *
* *
* *
The file contains one rule per line, empty lines or lines beginning Each line has to contain four with a hash symbol are ignored. elements separated by white spaces:
All the action takes place in the static method load() which loads the rules from the specified file.
@author Oliver Mason @version 1.0