Probabilistic Indexing for Information Search and Retrieval in Large Collections of Handwritten Text Images 9783031553882, 9783031553899

This book provides a comprehensive presentation of a recently introduced framework, named "probabilistic indexing&q

124 64 13MB

English Pages 378 [372] Year 2024

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Acknowledgements
Persons
Projects
Grants
Contents
Acronyms
Lists of Abbreviations
Mathematical Notation
List of Algorithms
List of Figures
List of Tables
Chapter 1 Introduction
1.1 Motivation and Background
1.2 Information Retrieval
1.3 Pattern Recognition
1.4 Decision Theory
1.5 Handwritten Text Recognition
1.6 Assessing Indexing and Search Performance
1.7 Handwritten Text Recognition and Probabilistic Indexing
References
Chapter 2 State Of The Art
2.1 The field of a Hundred Names
2.2 Taxonomy of KWS approaches
2.2.1 Segmentation Assumptions
2.2.1.1 Word Segmentation
2.2.1.2 Line Segmentation
2.2.1.3 Segmentation-free
2.2.2 Retrieved Objects
2.2.2.1 Word Instances
2.2.2.2 Lines
2.2.2.3 Pages
2.2.3 Query Representation
2.2.3.1 Query-by-String
2.2.3.2 Query-by-Example
2.2.4 Training Requirements
2.2.4.1 Unsupervised
2.2.4.2 Supervised
2.3 Additional Significant Matters for KWS
2.3.1 Hyphenated Words
2.3.2 Abbreviations
2.3.3 Multiple-word Queries
2.4 State of the Art in HTR Models and Methods
References
Chapter 3 Probabilistic Indexing (PrIx) Framework
3.1 Pixel Level Textual Image Representation: 2-D Posteriorgram
3.2 Image Regions for Keyword Indexing and Search
3.3 Position-independent PrIx
3.3.1 NaiveWord Posterior Interpretation of V(X | x, υ)
3.3.2 Proposed Approximations to V(X | x, υ)
3.3.3 Estimating Image-Region RPs from Posteriorgrams
3.3.4 Line-Region RP and 1-D Posteriorgram
3.4 Position-independent PrIx and KWS from a HTR Viewpoint
3.4.1 Comparing the Image Processing and HTR Viewpoints
3.5 Position-dependent PrIx
3.5.1 Relevance of an Horizontal Coordinate Position
3.5.2 Relevance of a Segment of Text-line Image Region
3.5.3 Relevance of a Transcript Ordinal Position
3.6 Query-by-Example Paradigm
3.6.1 Position-independent RPs for Query by Example KWS
3.6.2 Position-dependent RPs for Query by Example KWS
3.7 Relations among Position-Dependent and Independent RPs
3.7.1 Equivalences of Positional RPs and other Posterior Probabilities
3.7.2 Computing Horizontal Coordinate RP from Segment RP
3.7.3 Expected Values of Segments and Ordinal Positions
3.7.4 RP Inequalities Based on Fr´echet Bounds
3.8 PrIx Implementation Foreword
References
Chapter 4 Probabilistic Models for Handwritten Text
4.1 Traditional Image Preprocessing and Feature Extraction
4.1.1 Image Preprocessing and Text Segmentation
4.1.2 Text Line Normalization
4.1.3 Feature Extraction
4.2 Optical Modeling
4.3 Hidden Markov Models
4.3.1 Description
4.3.2 HMM Training
4.3.3 HMMs for Optical Modeling in Handwritten Text Recognition
4.3.3.1 Generative Training
4.3.3.2 Discriminative Training
4.4 Artificial Neural Networks
4.4.1 Description
4.4.2 Convolutional Layers
4.4.3 Recurrent Layers
4.4.3.1 Long Short-Term Memory layers
4.4.3.2 Estimating character-level posterior probabilities
4.4.3.3 Connectionist Temporal Classification
4.4.4 CRNN Training Through Gradient Descent
4.4.5 Neural Networks for Handwritten Text
4.5 Key differences between HMMs and CRNNs with CTC
4.6 N-gram Language Models
4.6.1 Combining the Output of a CRNN with a N-gram LM
4.7 Weighted Finite State Transducers (WFST)
4.7.1 The WFST Composition Operation
4.7.2 Handling CTC by Means of Elementary WFST Operations
4.7.3 Lattices Represented as WFST or WFSA
4.7.4 Normalization of LatticeWeights
The Backward and Forward Algorithms
Edge-Posterior Normalization
Sentence-Posterior Normalization
References
Chapter 5 Probabilistic Indexing for Fast and Effective Information Retrieval
5.1 Lexicon-Based and Lexicon-Free PrIx
5.2 Lexicon-based PrIx from Pixel-level Posteriorgrams
5.3 Indexing Lexicon-based Lattices
5.3.1 Position-independent Relevance
A Lower Cost Alternative
5.3.2 Lexicon-based Segment Relevance
5.3.3 Lexicon-based Horizontal Position Relevance
5.3.4 Lexicon-based Ordinal Position Relevance
DisambiguatingWG State Positions
Building Ordinal Position PrIxs
5.4 The Out-of-vocabulary Problem
5.5 Indexing Lexicon-free Lattices
5.5.1 From Character toWord Lattices
5.6 Alternative Approaches for Lexicon-free PrIx
5.6.1 Lexicon-free Segment Relevance
5.6.1.1 Encode Character Alignment
5.6.1.2 Disambiguating the Input Class Associated to States
5.6.1.3 From Subpaths to Complete Paths
5.6.1.4 From Character toWord Alignments
5.6.1.5 Disambiguating WFST Paths through Automaton Determinization
5.6.1.6 N-best Paths
5.6.1.7 Indexing Words with Alignment from Character Lattices
5.6.2 Lexicon-free Ordinal Position Relevance
5.6.2.1 Associating OrdinalWord Positions to States
5.6.2.2 EncodeWord Counts
5.6.2.3 Indexing Words with Positions from Character Lattices
5.7 Multi-word and Regular-Expression Queries
References
Chapter 6 Empirical Validation of Probabilistic Indexing Methods
6.1 Experimental Setup
6.1.1 Evaluation Protocol: Image Regions, Query Sets and Metrics
6.1.2 Datasets and Query Sets
6.1.3 Statistical Models for Handwritten Text
6.2 Assessing Posteriorgram Methods for Lexicon-based PrIx
6.3 Comparing Position-Dependent RP Definitions
6.4 Evaluating Language Model Impact
6.4.1 Lexicon-based Models
N-gram Order
Lexicon Size
6.4.2 Lexicon-free Models
N-gram Order
Number of Indexed Spots per Line and PrIx Density
6.4.3 Effect of the Optical and Character-label Prior Scales
6.5 Impact of Training-set Size and Data Augmentation
6.6 Correlation between Average Precision and HTR Error Rates
6.7 Results on Other Academic Benchmark Datasets
6.7.1 George Washington
6.7.2 Parzival
6.7.3 Comparison with Previous State-of-the-art Results
6.8 Comparing CRNN and HMM-GMM Optical Modeling
Storage Efficiency
6.9 Experiments for Real Indexing Projects
6.9.1 Passau
6.9.2 Chancery (Himanis)
6.9.3 Teatro del Siglo de Oro (TSO)
6.9.4 Large Bentham Dataset (BEN4)
6.9.5 Carabela
6.9.6 Finish Court Records (FCR)
6.10 Segmentation-free Evaluation
6.10.1 ICDAR2015 Competition on Handwriting KWS
6.10.2 ICFHR2014 Competition on QbE Handwriting KWS
6.11 Summary
References
Chapter 7 Probabilistic Interpretation of Traditional KWS Approaches
7.1 On the Spotting Versus Recognition Debate
7.2 Distance-based Methods
7.2.1 Simplifying QbE RP forWord-segmented Image Regions
7.2.2 Distance-based Density Estimation
7.2.2.1 The Multi-variance Problem
7.2.2.2 The Multi-mode Problem
7.2.3 Interpretation of Distance-based KWS: Empirical Results
Results
7.3 PHOC-based Methods
7.3.1 Predicting the PHOC of aWord Image Region: PHOCNet
7.3.2 PHOC-based QbE KWS
7.3.3 Probabilistic PHOCNet
7.3.4 PHOCNet Probabilistic Interpretation: Empirical Results
Results
7.3.5 Summary of Results of Distance– and PHOC–based Methods
7.4 HMM-Filler
7.4.1 HMM-Filler Probabilistic Interpretation: Experiments
Results and Discussion
7.4.2 Fast HMM-Filler Computation using Character Lattices
7.5 BLSTM-CTC KWS
7.5.1 BLSTM-CTC KWS Interpretation: Experimental Validation
References
Chapter 8 Probabilistic Indexing Search Extensions
8.1 Multi-Word Boolean andWord-Sequence Queries
8.1.1 Experiments
8.2 Searching for Music Symbol Sequences
8.2.1 Experiments
8.3 Structured Queries for Information Retrieval in Table Images
8.3.1 Experiments
Discussion
8.4 Searching for Hyphenated Words
8.4.1 Experiments
Discussion
8.5 Approximate-Spelling andWildcard Queries
8.5.1 Approximate-Spelling
8.5.2 Wildcard Spelling
8.5.3 Experiments
Discussion
References
Chapter 9 Beyond Search Applications of Probabilistic Indexing
9.1 Text Analytics Using PrIx
9.2 EstimatingWord and Document Frequencies from PrIxs
9.3 Zipf Curves, RunningWords and Lexicon Size
9.3.1 Estimating RunningWords and Lexicon Size: Results
9.4 Statistical Information Extraction from Text Images
9.4.1 Indexing Semantically TaggedWords and Named Entities
9.4.2 Statistical Information Extraction from Handwritten Forms
9.5 Classification of Large Untranscribed Image Documents
9.5.1 Plaintext Document Classification
9.5.1.1 Feature Selection
9.5.1.2 Feature Extraction
9.5.2 Estimating Text Features from Image PrIxs
9.5.3 Image Document Classification
9.5.4 Open Set Classification
9.5.5 Experiments
9.5.5.1 Dataset
9.5.5.2 Empirical settings
9.5.5.3 Experiments and Results
Threshold-less Closed and Open Set Classification
Threshold-based Open Set Classification and Rejection
9.5.6 Image Document Classification Concluding Remarks
References
Chapter 10 Large-scale Systems and Applications
10.1 Conceptual System Organization andWorkflow
10.1.1 PrIx Components
10.1.2 Spots Database
10.1.3 PrIx Search Engine and User Interface
10.2 Architecture Design
10.2.1 Web and Data Servers
10.2.2 PrIx Server and Search Engine
10.2.3 Web Client
10.3 Large-Scale Applications
10.3.1 Tr´esor des Chartes (Chancery)
10.3.2 Teatro del Siglo de Oro (TSO)
10.3.3 Bentham Papers (Bentham)
10.3.4 Parcels from Indias and C´adiz Archives (Carabela)
10.3.5 Finnish Court Records (FCR)
10.3.6 General Discussion on Large-Scale Applications
References
Chapter 11 Conclusion and Outlook
11.1 Contribution Summary
Probabilistic Indexing Framework (PrIx)
Probabilistic Models of Handwritten Text
Indexing Algorithms based on PrIx
Probabilistic Interpretation of KWS Methods
Beyond Traditional and Academic KWS
11.2 FutureWork
Stochastic Definitions of Relevance
Better Statistical Models and Training
Probabilistic Framework Applied to Other Domains
References
Appendix A The Probability Ranking Principle
A.1 Ranking Multiple Relevant Images
A.2 Evaluation Measures and Optimality
A.2.1 Precision-at-k
A.2.2 Recall-at-k
A.2.3 Average Precision (AP)
A.2.4 Discounted Cumulative Gain (DCG)
A.2.5 Normalized Discounted Cumulative Gain (NDCG)
A.3 Global and Mean Measures
References
Appendix B Weighted Finite State Transducers (WFST)
B.1 Introduction
B.2 Description
B.3 WFST Operations
B.3.1 Composition
B.3.2 Shortest Path and Distance
B.4 Determinization
References
Appendix C Text Image Document Collections and Datasets
General Remarks About How the Dataset Statistics Are Reported
C.1 IAM
C.2 The Bentham Papers Collection and Datasets
C.2.1 ICFHR-2014 Competition on HTR (BEN1)
C.2.1.1 Line-level PrIx Experiments
C.2.1.2 Multi-word Page-level PrIx Experiments
C.2.2 ICFHR-2014 Competition on KWS (BEN2)
C.2.3 ICDAR-2015 Competition on KWS (BEN3)
C.2.4 Large Bentham Dataset used in [28] (BEN4)
C.3 GeorgeWashington (GW)
Line-level Settings
Word-level Settings
C.4 Parzival (PAR)
C.5 Plantas (PLA)
C.6 Passau Parish Records (PAS)
C.7 Tr´esor des Chartes and Chancery (CHA)
C.8 Spanish Golden Age Theater (TSO)
C.9 Parcels from Indias and C´adiz Archives: Carabela (CAR)
C.10 Finnish Court Record (FCR)
Dataset for Hyphenated-Word Experiments
C.11 The Vorau-253 Sheet Music Manuscript and Dataset
C.12 A Dataset for Multi-page Handwritten Deeds Classification
References
Recommend Papers

Probabilistic Indexing for Information Search and Retrieval in Large Collections of Handwritten Text Images
 9783031553882, 9783031553899

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

The Information Retrieval Series

Alejandro Héctor Toselli Joan Puigcerver Enrique Vidal

Probabilistic Indexing for Information Search and Retrieval in Large Collections of Handwritten Text Images

The Information Retrieval Series Volume 49

Series Editors ChengXiang Zhai, University of Illinois, Urbana, IL, USA Maarten de Rijke, University of Amsterdam, The Netherlands and Ahold Delhaize, Zaandam, The Netherlands Editorial Board Members Nicholas J. Belkin, Rutgers University, New Brunswick, NJ, USA Charles Clarke, University of Waterloo, Waterloo, ON, Canada Diane Kelly, University of Tennessee at Knoxville, Knoxville, TN, USA Fabrizio Sebastiani , Consiglio Nazionale delle Ricerche, Pisa, Italy

Information Retrieval (IR) deals with access to and search in mostly unstructured information, in text, audio, and/or video, either from one large file or spread over separate and diverse sources, in static storage devices as well as on streaming data. It is part of both computer and information science, and uses techniques from e.g. mathematics, statistics, machine learning, database management, or computational linguistics. Information Retrieval is often at the core of networked applications, web-based data management, or large-scale data analysis. The Information Retrieval Series presents monographs, edited collections, and advanced text books on topics of interest for researchers in academia and industry alike. Its focus is on the timely publication of state-of-the-art results at the forefront of research and on theoretical foundations necessary to develop a deeper understanding of methods and approaches. This series is abstracted/indexed in EI Compendex and Scopus.

Alejandro Héctor Toselli • Joan Puigcerver Enrique Vidal

Probabilistic Indexing for Information Search and Retrieval in Large Collections of Handwritten Text Images

Alejandro Héctor Toselli Universitat Politècnica de València Valencia, Spain

Joan Puigcerver Google Research Zurich, Switzerland

Enrique Vidal Universitat Politècnica de València Valencia, Spain

ISSN 1871-7500 ISSN 2730-6836 (electronic) The Information Retrieval Series ISBN 978-3-031-55388-2 ISBN 978-3-031-55389-9 (eBook) https://doi.org/10.1007/978-3-031-55389-9 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.

Preface

The technology presented in this book was developed by members of the Pattern Recognition and Human Language Technology (PRHLT) research center of the Universitat Polit`ecnica de Val`encia (UPV), Spain. PRHLT dates back to the decade of 1990. In addition to the development of fundamental Pattern Recognition (PR) methods, the main application field of PRHLT in those early years was Automatic Speech Recognition (ASR). But the research activity was soon expanded to deal also with Machine Translation (MT) and, later (around 2010), with Computer Vision (CV) and Multimodal Interaction in PR. Also at that time, a new research line was started on Document Analysis and Handwritten Text Recognition (HTR). It became clear soon that the application field of these technologies was not in nowadays handwritten documents, but in the astronomical number of historical series of manuscripts that are held by archives and libraries around the world. Not in vain it is often speculated that the amount of original handwritten text existing in the world notably surpass the printed or typewritten text, including native electronic documents! When dealing with historical, handwritten text, a major issue emerges forcibly: uncertainty. The concept of “ground truth” or “reference” transcript proves elusive: If several paleography experts are asked to interpret a given piece of this kind of handwritten text, what you generally get are several different interpretations – nothing close to the single, unique transcript you would need as ground truth. The reasons are plenty: Historical manuscripts are generally plagued with scrawled scripts, extremely abridged and tangled abbreviations and archaic or outdated word spellings. Moreover, punctuation is rarely used, or is used inconsistently, and readingorder is often ambiguous, due to conflicting and even erratic layout. In addition, there are many types of preservation-related degradations, such as lack of contrast or support, dirt spots, severe show-through and bleed-through, etc. Because of these issues, it promptly became very clear that the classical goal of HTR technologies –given a text image, produce a unique transcript which is “as correct as possible”– was a delusion. Therefore, suitable approaches were needed that explicitly embraced the intrinsic uncertainty of the historical handwritten text. v

vi

Preface

The first of these approaches was CATTI (for Computer Assisted Transcription of Text Images):1 Rather than insisting in finding the best, or the most correct transcript, just offer the users a bunch of likely alternatives and let them decide what is “best” or “correct”. In CATTI, the process is interactive and driven by user feedback, so alternative transcription hypotheses tend to follow what the user has previously validated, suggested, or amended. While CATTI proved useful to cost-effectively transcribe singular historical manuscripts or small sets of specific text images, it is obviously not an option when the historical collection considered amounts to hundreds of thousand, or millions of pages. In these cases, any approach that requires human expert intervention quickly becomes not scalable, or plainly unrealistic. Interestingly, when large series of manuscripts are considered, the primary interest is generally not transcription, but making the series searchable. So a naive reasoning became very popular to deal with this situation: first transcribe the whole series automatically by means of a “good” HTR system, and then apply a suitable off-theshelf platform for plain text search. However, because of the intrinsic uncertainty discussed above (and the sheer scale of the task), this naive idea is rather unrealistic; transcription “errors” (or just unexpected interpretations) will be plenty and, what is worse, it will be difficult to asses to which extent the obtained transcripts are “correct”. Therefore, searching for information in such an uncontrollably noisy text is typically prone to yield a disappointing search experience. This is how our Probabilistic Indexing (PrIx) framework emerged. It explicitly embraces the inherent ambiguities and uncertainty of historical handwritten text and it capitalizes on this acquiescence to provide excellent search performance, even for massive series of complex manuscripts. As discussed in this book, by adequately modeling the uncertainty, PrIx methods prove very effective to actually allow users find the information they are interested in. Of course, once the information is found in a page or a section of some manuscript, the relevant handwritten text can be automatically transcribed, if needed. Then the user, or an expert paleographer, can revise these specific transcripts (maybe with CATTI assistance) to ensure the required level of reliability. Or it may be also the case that the target of the search process was not to get hold of specific text, but rather to use the discovered information in further studies. In this case, we also discuss in the book (Chapter 9) how the PrIx representation of a large collection of text images can be advantageously used for this purpose. PRHLT researchers developed PrIx (and HTR) technologies thanks to several projects funded by National and European research programs. Perhaps the most important projects were tranScriptorium and READ, followed by HIMANIS, and Carabela. By the end of these projects (2020), we had released an open-source HTR toolkit called PyLaia based on deep Convolutional-Recurrent Neural Networks (which is nowadays considered one of the best HTR toolkits). On the other hand, the fundamentals of the PrIx framework were already fully developed and a completely operative set of PrIx tools and the corresponding workflow had been implemented 1 Reference [29] in Chapter 1.

Preface

vii

and fully tested in several large-scale series of historical manuscripts (these works and results are described in Chapter 10 of this book). In sum, by 2020 the PRHLT/UPV PrIx (and HTR) technology was fully mature and ready to be applied in real indexing projects. The first of these projects was the so called Finnish Court Records, where the PrIxs of more than one million page images were successfully produced through a collaboration of the National Archives of Finland, READ-coop and PRHLT/UPV. Yet, PRHLT as an academic research center of the UPV, was not the most adequate institution to manage this kind of projects which were mostly application-oriented. So, by the end of 2020, the UPV promoted the creation of a spin-off company called tranSkriptorium IA SL (tS). Since its creation, tS has been busy with important indexing projects both in Spain and in other European countries. Given the very nature of PrIx, tS only deals with large documentary series, say from a few tens of thousands to millions of text images. Smaller series and singular transcription works are not considered for now in the tS business plan. Harnessing the PrIx workflow for real large-scale applications has proved fairly complex. Best results are always achieved by training or at least fine-tuning the PrIx models with the most representative images (and transcripts) of the series considered. The amount of training images required to achieve satisfactory results is also difficult to predict. Adequate management of these complexities usually boils down to understanding what are the most significant writing style and layout differences between images of different parts of a series of manuscripts. So the whole process entails a multidisciplinary pipeline where expert paleographers are needed in the center of a loop that we call expert-mediated active learning (outlined in Appendix C of this book). This process ensures results that are perfectly adapted to the particular needs of the archive or library users for each specific document collection. To manage these complexities, tS has developed a competitive multidisciplinary team with the know-how and skills needed for an efficient and productive use of the PrIx and HTR technologies inherited from PRHLT/UPV. This allows the company to be already self-sustainable with production and service costs perfectly affordable by most public archives and libraries. Of course, the contents of this book could make things easier for individuals or institutions that wish to develop their own workflows to process large series of historical documents. In fact this is the purpose. However, it is worth to consider cooperation rather than competition. There are plenty of emerging applications of PrIx (and/or HTR) technologies for documentary management in archives and libraries. And tS is keen to discuss interesting precompetitive collaborations to help developing this kind of applications. Moreover, tS, as a spin-off of PRHLT/UPV enjoys a tight collaboration with PRHLT researchers who go on doing fundamental research on open or basic problems of document analysis and recognition. This allows cost-effective research on important issues that would otherwise be prohibitive for small companies such as tS.

viii

Preface

Let us focus now on the contents of this book. It constitutes a comprehensive, ordered and coherent compilation of our work on Probabilistic Indexing of Handwritten Text Images during the last decade. Parts of this work have been presented in several conferences and workshops in the field of document image processing, as well as in a few journal papers.2 But a very significant part of the book stems from the mostly unpublished PhD work3 of one of the authors, as well as from tutorial and keynote presentations that had never been published in proceedings or journals. The book is structured into 11 chapters and three appendices, as follows: Chapter 1 (Introduction) exposes motivations for effective information retrieval solutions to real-life searching on large collections of handwritten documents. Fundamentals of Pattern Recognition, Statistical Decision Theory, and Handwritten Text Recognition are briefly outlined, along with Information Retrieval and a comprehensive account of the evaluation measures most commonly adopted in this field. Chapter 2 (State Of The Art) outlines the state of the art in the different fields related with the book contents. Basic techniques and currently available solutions to tackle the relevant problems are presented and compared. More technical details of the most interesting approaches, however, will be presented in Chapter 7, once the notation and the fundamental framework have been completely established. Chapter 3 (Probabilistic Framework) presents, in a principled and comprehensive manner, the approaches we propose for indexing (as opposed to “spotting”) each region of a handwritten text image which is likely to contain a word. Chapter 4 (Probabilistic Models for Handwritten Text) describes models that we have adopted for handwritten text in images, namely hidden Markov models, convolutional and recurrent neural networks and language models. This chapter also provides full details of weighted finite-state transducer (WFST) concepts and methods, needed in further chapters of the book. Under this point of view, word or character graphs or lattices are reviewed in full detail. Chapter 5 (Probabilistic Indexing for Fast and Effective Information Retrieval) introduces and explain the set of techniques and algorithms developed to generate image probabilistic indexes, to allow for fast search and retrieval of textual information in the indexed images. Chapter 6 (Empirical Validation of Probabilistic Indexing Methods) presents experimental evaluations of the proposed framework and algorithms on different traditional benchmark datasets and compares them with other traditional approaches. Results on new, very much larger and realistic experimental datasets are also included.

2 References [37, 36, 40] in Chapter 1. 3 Reference [27] in Chapter 1.

Preface

ix

Chapter 7 (Probabilistic Interpretation of Traditional KWS Approaches) reviews the most popular KWS approaches, the majority of which based on heuristic arguments, and provides probabilistic interpretation of these arguments. Chapter 8 (Probabilistic Indexing Search Extensions) explains how a basic, wordbased PrIx can support classical free-text search tools, such as arbitrarily complex Boolean multi-word (AND/OR/NOT) combinations and word sequences, as well as wildcard and flexible or approximate spelling. Other extensions include using entire words in queries to find also word instances that my be hyphenated in unknown or even unexpected manners, as well as tabular queries to retrieve geometrically structured information in handwritten tables. Chapter 9 (Beyond Search Applications of PrIx) presents new methods that use PrIx not only for searching, but also to deal with text analytics and other related natural language processing and information extraction tasks, directly on untranscribed collections of text images. Chapter 10 (Large-scale Systems and Applications) shows how the proposed solutions can be used to effectively index real, large collections of handwritten document images, which typically are orders of magnitude larger than the traditional academic datasets. Several showcases of large-scale image collections of historical manuscripts, which have been very successfully indexed using the proposed approaches are also presented. Chapter 11 (Conclusion and Outlook) summarize the contributions of this book and suggests promising lines of future research. Appendix A (The Probability Ranking Principle) provides Decision Theoretic background for the PrIx framework and the corresponding methods developed in this book. Appendix B (Weighted Finite State Transducers) outlines the algebra and basic algorithms of WFSTs, which constitute the main supporting machinery for the algorithms presented (mainly) in Chapter 5. Appendix C (Text Image Document Collections and Datasets) presents details of the text image collections and datasets used in the experiments reported throughout the book and/or in the large-scale demonstrators presented in Chapter 10.

Valencia (Spain), December, 2023

Alejandro H´ector Toselli Joan Puigcerver Enrique Vidal

Acknowledgements

The authors are deeply grateful to the following people and publicly funded projects and grants, which have made possible the writting of several sections of this book.

Persons Specific thanks are due to: Contribution Name Jos´e Andr´es Collaboration in the work of Secs. 8.4 and 9.4 Jos´e Ram´on Prieto Collaboration in the work of Sec. 9.5 Dr. Carlos Alonso Collaboration in compiling datasets of Appendix C.9 Dr. Ver´onica Romero Collaboration in work of Sec. 6.9 and Apps. C.8, C.9 Dr. Joan Andreu S´anchez Collaboration in the work of Sec. 6.9 and Appendix C.2 Dr. Giorgos Sfikas For providing Fig. 7.3 In general, the authors also warmy acknowledge the help of all the current and former PRHLT members who contributed with code development and in the realization of many of the experiments reported throughout this book.

Projects Thanks to the tranScriptorium and READ Projects for supporting the initial technological developments of PrIx and the compilation of many of the datasets described in Appendix C. Moreover, thanks to the HIMANIS and Carabela Projects for supporting the development of the large-scale PrIx demonstrators described in Chapter 10, and the compilation of the datasets presented in Appendixes C.7 and C.9, respectively.

Grants The work of one of the authors, Alejandro H. Toselli, is currently supported by a Mar´ıa Zambrano Grant from the Spanish Ministerio de Universidades and the European Union NextGenerationEU/PRTR.

xi

Contents

Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxi 1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Handwritten Text Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.6 Assessing Indexing and Search Performance . . . . . . . . . . . . . . . . . . . . 9 1.7 Handwritten Text Recognition and Probabilistic Indexing . . . . . . . . . 14 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2

State Of The Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 The field of a Hundred Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Taxonomy of KWS approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Segmentation Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Retrieved Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Query Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Training Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Additional Significant Matters for KWS . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Hyphenated Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Multiple-word Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 State of the Art in HTR Models and Methods . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17 17 18 18 21 23 24 25 25 25 26 26 28 xiii

xiv

3

4

Contents

Probabilistic Indexing (PrIx) Framework . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Pixel Level Textual Image Representation: 2-D Posteriorgram . . . . . 3.2 Image Regions for Keyword Indexing and Search . . . . . . . . . . . . . . . . 3.3 Position-independent PrIx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Naive Word Posterior Interpretation of 𝑃(𝑅 | 𝑥, 𝑣) . . . . . . . . . 3.3.2 Proposed Approximations to 𝑃(𝑅 | 𝑥, 𝑣) . . . . . . . . . . . . . . . . . 3.3.3 Estimating Image-Region RPs from Posteriorgrams . . . . . . . 3.3.4 Line-Region RP and 1-D Posteriorgram . . . . . . . . . . . . . . . . . 3.4 Position-independent PrIx and KWS from a HTR Viewpoint . . . . . . 3.4.1 Comparing the Image Processing and HTR Viewpoints . . . . 3.5 Position-dependent PrIx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Relevance of an Horizontal Coordinate Position . . . . . . . . . . 3.5.2 Relevance of a Segment of Text-line Image Region . . . . . . . . 3.5.3 Relevance of a Transcript Ordinal Position . . . . . . . . . . . . . . . 3.6 Query-by-Example Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Position-independent RPs for Query by Example KWS . . . . . 3.6.2 Position-dependent RPs for Query by Example KWS . . . . . . 3.7 Relations among Position-Dependent and Independent RPs . . . . . . . 3.7.1 Equivalences of Positional RPs and other Posterior Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.2 Computing Horizontal Coordinate RP from Segment RP . . . 3.7.3 Expected Values of Segments and Ordinal Positions . . . . . . . 3.7.4 RP Inequalities Based on Fr´echet Bounds . . . . . . . . . . . . . . . . 3.8 PrIx Implementation Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33 33 35 37 37 38 39 41 42 44 45 46 48 51 52 52 53 54

Probabilistic Models for Handwritten Text . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Traditional Image Preprocessing and Feature Extraction . . . . . . . . . . 4.1.1 Image Preprocessing and Text Segmentation . . . . . . . . . . . . . 4.1.2 Text Line Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Optical Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 HMM Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 HMMs for Optical Modeling in Handwritten Text Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Convolutional Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Recurrent Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.4 CRNN Training Through Gradient Descent . . . . . . . . . . . . . . 4.4.5 Neural Networks for Handwritten Text . . . . . . . . . . . . . . . . . . 4.5 Key differences between HMMs and CRNNs with CTC . . . . . . . . . . 4.6 𝑁-gram Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65 65 66 66 68 69 69 70 71

55 56 56 58 60 63

72 75 75 77 78 83 84 86 87

Contents

xv

4.6.1 Combining the Output of a CRNN with a 𝑁-gram LM . . . . . 89 Weighted Finite State Transducers (WFST) . . . . . . . . . . . . . . . . . . . . . 90 4.7.1 The WFST Composition Operation . . . . . . . . . . . . . . . . . . . . . 92 4.7.2 Handling CTC by Means of Elementary WFST Operations . 92 4.7.3 Lattices Represented as WFST or WFSA . . . . . . . . . . . . . . . . 94 4.7.4 Normalization of Lattice Weights . . . . . . . . . . . . . . . . . . . . . . . 96 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.7

5

Probabilistic Indexing for Fast and Effective Information Retrieval . . 107 5.1 Lexicon-Based and Lexicon-Free PrIx . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.2 Lexicon-based PrIx from Pixel-level Posteriorgrams . . . . . . . . . . . . . 108 5.3 Indexing Lexicon-based Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.3.1 Position-independent Relevance . . . . . . . . . . . . . . . . . . . . . . . . 111 5.3.2 Lexicon-based Segment Relevance . . . . . . . . . . . . . . . . . . . . . . 114 5.3.3 Lexicon-based Horizontal Position Relevance . . . . . . . . . . . . 115 5.3.4 Lexicon-based Ordinal Position Relevance . . . . . . . . . . . . . . . 115 5.4 The Out-of-vocabulary Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.5 Indexing Lexicon-free Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.5.1 From Character to Word Lattices . . . . . . . . . . . . . . . . . . . . . . . 119 5.6 Alternative Approaches for Lexicon-free PrIx . . . . . . . . . . . . . . . . . . . 123 5.6.1 Lexicon-free Segment Relevance . . . . . . . . . . . . . . . . . . . . . . . 123 5.6.2 Lexicon-free Ordinal Position Relevance . . . . . . . . . . . . . . . . . 132 5.7 Multi-word and Regular-Expression Queries . . . . . . . . . . . . . . . . . . . . 135 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

6

Empirical Validation of Probabilistic Indexing Methods . . . . . . . . . . . . 139 6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.1.1 Evaluation Protocol: Image Regions, Query Sets and Metrics140 6.1.2 Datasets and Query Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.1.3 Statistical Models for Handwritten Text . . . . . . . . . . . . . . . . . . 142 6.2 Assessing Posteriorgram Methods for Lexicon-based PrIx . . . . . . . . 143 6.3 Comparing Position-Dependent RP Definitions . . . . . . . . . . . . . . . . . . 145 6.4 Evaluating Language Model Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.4.1 Lexicon-based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.4.2 Lexicon-free Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 6.4.3 Effect of the Optical and Character-label Prior Scales . . . . . . 153 6.5 Impact of Training-set Size and Data Augmentation . . . . . . . . . . . . . . 155 6.6 Correlation between Average Precision and HTR Error Rates . . . . . . 156 6.7 Results on Other Academic Benchmark Datasets . . . . . . . . . . . . . . . . 158 6.7.1 George Washington . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.7.2 Parzival . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.7.3 Comparison with Previous State-of-the-art Results . . . . . . . . 161 6.8 Comparing CRNN and HMM-GMM Optical Modeling . . . . . . . . . . . 163 6.9 Experiments for Real Indexing Projects . . . . . . . . . . . . . . . . . . . . . . . . 165 6.9.1 Passau . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

xvi

Contents

6.9.2 Chancery (Himanis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 6.9.3 Teatro del Siglo de Oro (TSO) . . . . . . . . . . . . . . . . . . . . . . . . . 168 6.9.4 Large Bentham Dataset (BEN4) . . . . . . . . . . . . . . . . . . . . . . . 169 6.9.5 Carabela . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 6.9.6 Finish Court Records (FCR) . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 6.10 Segmentation-free Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 6.10.1 ICDAR2015 Competition on Handwriting KWS . . . . . . . . . . 172 6.10.2 ICFHR2014 Competition on QbE Handwriting KWS . . . . . . 173 6.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 7

Probabilistic Interpretation of Traditional KWS Approaches . . . . . . . . 181 7.1 On the Spotting Versus Recognition Debate . . . . . . . . . . . . . . . . . . . . 181 7.2 Distance-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 7.2.1 Simplifying QbE RP for Word-segmented Image Regions . . 183 7.2.2 Distance-based Density Estimation . . . . . . . . . . . . . . . . . . . . . 184 7.2.3 Interpretation of Distance-based KWS: Empirical Results . . 188 7.3 PHOC-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 7.3.1 Predicting the PHOC of a Word Image Region: PHOCNet . . 193 7.3.2 PHOC-based QbE KWS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 7.3.3 Probabilistic PHOCNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 7.3.4 PHOCNet Probabilistic Interpretation: Empirical Results . . . 194 7.3.5 Summary of Results of Distance– and PHOC–based Methods197 7.4 HMM-Filler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 7.4.1 HMM-Filler Probabilistic Interpretation: Experiments . . . . . 201 7.4.2 Fast HMM-Filler Computation using Character Lattices . . . . 203 7.5 BLSTM-CTC KWS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 7.5.1 BLSTM-CTC KWS Interpretation: Experimental Validation 209 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

8

Probabilistic Indexing Search Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 213 8.1 Multi-Word Boolean and Word-Sequence Queries . . . . . . . . . . . . . . . 213 8.1.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 8.2 Searching for Music Symbol Sequences . . . . . . . . . . . . . . . . . . . . . . . . 218 8.2.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 8.3 Structured Queries for Information Retrieval in Table Images . . . . . . 221 8.3.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 8.4 Searching for Hyphenated Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 8.4.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 8.5 Approximate-Spelling and Wildcard Queries . . . . . . . . . . . . . . . . . . . . 227 8.5.1 Approximate-Spelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 8.5.2 Wildcard Spelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 8.5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

Contents

xvii

9

Beyond Search Applications of Probabilistic Indexing . . . . . . . . . . . . . . 233 9.1 Text Analytics Using PrIx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 9.2 Estimating Word and Document Frequencies from PrIxs . . . . . . . . . 234 9.3 Zipf Curves, Running Words and Lexicon Size . . . . . . . . . . . . . . . . . . 235 9.3.1 Estimating Running Words and Lexicon Size: Results . . . . . . 237 9.4 Statistical Information Extraction from Text Images . . . . . . . . . . . . . . 238 9.4.1 Indexing Semantically Tagged Words and Named Entities . . 239 9.4.2 Statistical Information Extraction from Handwritten Forms . 241 9.5 Classification of Large Untranscribed Image Documents . . . . . . . . . . 242 9.5.1 Plaintext Document Classification . . . . . . . . . . . . . . . . . . . . . . 244 9.5.2 Estimating Text Features from Image PrIxs . . . . . . . . . . . . . . 245 9.5.3 Image Document Classification . . . . . . . . . . . . . . . . . . . . . . . . 246 9.5.4 Open Set Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 9.5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 9.5.6 Image Document Classification Concluding Remarks . . . . . . 253 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

10

Large-scale Systems and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 10.1 Conceptual System Organization and Workflow . . . . . . . . . . . . . . . . . 255 10.1.1 PrIx Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 10.1.2 Spots Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 10.1.3 PrIx Search Engine and User Interface . . . . . . . . . . . . . . . . . . 258 10.2 Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 10.2.1 Web and Data Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 10.2.2 PrIx Server and Search Engine . . . . . . . . . . . . . . . . . . . . . . . . . 260 10.2.3 Web Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 10.3 Large-Scale Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 10.3.1 Tr´esor des Chartes (Chancery) . . . . . . . . . . . . . . . . . . . . . . . . 264 10.3.2 Teatro del Siglo de Oro (TSO) . . . . . . . . . . . . . . . . . . . . . . . . . 266 10.3.3 Bentham Papers (Bentham) . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 10.3.4 Parcels from Indias and C´adiz Archives (Carabela) . . . . . . 268 10.3.5 Finnish Court Records (FCR) . . . . . . . . . . . . . . . . . . . . . . . . . . 271 10.3.6 General Discussion on Large-Scale Applications . . . . . . . . . . 273 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

11

Conclusion and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 11.1 Contribution Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 11.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

A

The Probability Ranking Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 A.1 Ranking Multiple Relevant Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 A.2 Evaluation Measures and Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . 284 A.2.1 Precision-at-𝑘 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 A.2.2 Recall-at-𝑘 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286

xviii

Contents

A.2.3 Average Precision (AP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 A.2.4 Discounted Cumulative Gain (DCG) . . . . . . . . . . . . . . . . . . . . 293 A.2.5 Normalized Discounted Cumulative Gain (NDCG) . . . . . . . . 295 A.3 Global and Mean Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 B

Weighted Finite State Transducers (WFST) . . . . . . . . . . . . . . . . . . . . . . . 299 B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 B.2 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 B.3 WFST Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 B.3.1 Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 B.3.2 Shortest Path and Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 B.4 Determinization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

C

Text Image Document Collections and Datasets . . . . . . . . . . . . . . . . . . . . 309 C.1 IAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 C.2 The Bentham Papers Collection and Datasets . . . . . . . . . . . . . . . . . . . 312 C.2.1 ICFHR-2014 Competition on HTR (BEN1) . . . . . . . . . . . . . . 312 C.2.2 ICFHR-2014 Competition on KWS (BEN2) . . . . . . . . . . . . . . 315 C.2.3 ICDAR-2015 Competition on KWS (BEN3) . . . . . . . . . . . . . . 317 C.2.4 Large Bentham Dataset used in [28] (BEN4) . . . . . . . . . . . . . 318 C.3 George Washington (GW) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 C.4 Parzival (PAR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 C.5 Plantas (PLA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 C.6 Passau Parish Records (PAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 C.7 Tr´esor des Chartes and Chancery (CHA) . . . . . . . . . . . . . . . . . . . . . . 328 C.8 Spanish Golden Age Theater (TSO) . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 C.9 Parcels from Indias and C´adiz Archives: Carabela (CAR) . . . . . . . 332 C.10 Finnish Court Record (FCR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 C.11 The Vorau-253 Sheet Music Manuscript and Dataset . . . . . . . . . . . . . 337 C.12 A Dataset for Multi-page Handwritten Deeds Classification . . . . . . . 339 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343

Acronyms

Lists of Abbreviations ANN AP BB BLSTM BOW CBIDC CER CNN CL CRNN CSC CTC DAG DC DCG DFA DL DT FFN FSA FST gAP GMM GT

Artificial Neural Network (raw) Average Precision Bounding Box Bi-directional Long-Short Term Memory Bag of Words Content Based Image Document Classification Character Error Rate Convolutional Neural Network Character Lattice Convolutional-Recurrent Neural Network Closed Set Classification Connectionist Temporal Classification Directed Acyclic Graph Documnet Classification Discounted Cumulative Gain Deterministic Finite Automaton Deep Learning Decission Theory Feed Forward Network Finite State Automaton or Automata Finite State Transducer global Average Precision Gaussian Mixture Model Ground Truth xix

xx

HMM HTR IG IR KWS LM LSTM mAP MDLSTM MLE ML MLP NDCG OCR OOV OSC PHOC PR PRP PrIx PrIxs QbE QbS RNN R–P RP WFSA WFST WER WG

Acronyms

Hidden Markov Model Handwritten Text Recognition Information Gain Information Retrieval Keyword Spotting Language Model Long-Short Term Memory mean Average Precision Multidimensional Long-Short Term Memory Maximum Likelihood Estimation Machine Learning Multi-Layer Perceptron Normalized Discounted Cumulative Gain Optical Character Recognition Out of Vocabulary Open Set Classification Pyramid of Histograms of Characters Pattern Recognition Probability Ranking Principle Probabilistic Indexing (or Index) Probabilistic Indexes Query by Example Query by String Recurrent Neural Network Recall–Precision Relevance Probability Weighted Finite State Automaton or Automata Weighted Finite State Transducer Word Error Rate Word (or character) Graph or lattice

xxi

Acronyms

Mathematical Notation Common notation rules adopted across all the Chapters

Symbol(s) 𝑎, 𝑏, . . . 𝛼, 𝛽, . . . 𝐴, 𝐵, . . . A, B, . . . 𝒂, 𝒃, . . . 𝒂𝑇 (𝑎, 𝑏) 𝑇 𝑨, 𝑩, . . . [ 𝑨] 𝑖, 𝑗 𝑎 1 , . . . , 𝑎 𝑇 , 𝑎 1:𝑇 𝑠1 , . . . , 𝑠 𝑁 |A|, |𝑎| def = ★ = 𝑎←𝑏 𝑧∈A 𝑣∈𝑤 𝑢⊂𝑤 𝑧⊑𝑍 𝑃(· · · ) 𝑝(· · · ) 𝑃(· · · ; 𝜃) 𝑝(· · · ; 𝜃) PZ (· · · ), pZ (· · · ) E[· · · ] E[· · · | · · · ] E𝑧 [· · · ]

Description Scalar variables, sequences, etc. Random variables (sometimes also scalar constants or variables) Sets Vectors Transpose of a vector Transpose: vertical vector with componnents 𝑎 and 𝑏 Matrices or tensors Element of a matrix (or tensor) at row 𝑖 and column 𝑗 A sequence of length 𝑇 A list of 𝑁 sequences Size of the set A, length of the sequence 𝑎 Used in equations to define a symbol or function The equality holds under some assumption In algorithms, the value of 𝑏 is assigned to the variable 𝑎 𝑧 belongs to the set A 𝑣 is one of the elements of the sequence 𝑤 ≡ 𝑤1:𝑀 ; i.e., ∃𝑘 : 𝑤 𝑘 = 𝑣 𝑢 ≡ 𝑢 1,𝐾 is a substring of the sequence 𝑤 ≡ 𝑤1:𝑀 , 𝐾 ≤ 𝑀 The image region 𝑧 is within a larger image region 𝑍 Probability mass function Probability density function Parametric probability mass function with parameter 𝜃 Parametric probability density function with parameter 𝜃 Probabilities estimated, or computed, using 𝑍 (training set, or WG) Expected value of an expression Expected value of an expression conditioned on another Expected value of a (conditioned) expression with respect to 𝑧

xxii

Acronyms Meaning of mathematical symbols and expressions frequently used across Chapters

Symbol 𝑥 𝒃 𝑣 𝑤 𝑐 𝑅 𝑃(𝑅 | 𝑥, 𝑣) 𝜏 𝑒 𝜙 𝜔(·) 𝑙 𝑖 (·) 𝑙 𝑜 (·) 𝑙 (·) 𝑝(·) 𝑛(·) 𝑎(𝑞) F P (𝑞) N (𝑞) 𝒩(·, ·)

Typical usage Image region A word or character bounding box or segment within a line image A word (or its character sequence) – a query word in most cases A text; i.e., a sequence of words or characters A single character. Also, a sequence of CTC character-level labels Binary random variable for the relevance probability (RP) ≡ 𝑃(𝑅 = 1 | 𝑋 = 𝑥, 𝑄 = 𝑣): RP of 𝑥 for the query word 𝑣 RP threshold An edge of a graph (usually a WFST), or of a path of a WFST A path (sequence of edges) of a graph (WFST) The weight of an edge 𝑒, or of a path 𝜙 (WFST) The input token(s) of an edge 𝑒 (or of a path 𝜙) (WFST) The output token(s) of an edge 𝑒 (or of a path 𝜙) (WFST) The token(s) of an edge 𝑒 (or of a path 𝜙) (WFSA) Departing (or “previous”) state of an edge 𝑒, or of a path 𝜙 (WFST) Ending (of “next”) state of an edge 𝑒, or of a path 𝜙 (WFST) Alignment (horizontal coordinate) asociated with a state 𝑞 (WG) Set of final states of a WFST or a WFSA Set of edges leading to (or predecesors of) the state 𝑞 (WFST) Set of edges departing from (or next to) the state 𝑞 (WFST) Nearest Neighbor(s) Function

List of Algorithms

5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16

Compute a PrIx from Pixel-level Posteriorgram . . . . . . . . . . . . . . . . . . 110 Compute a PrIx for a word lattice of an image region . . . . . . . . . . . . . 112 Compute a PrIx for a word lattice of an image region (opt1) . . . . . . . . 113 Compute a PrIx for a word lattice of an image region (opt2) . . . . . . . . 113 Compute a segment PrIx from a word lattice of an image region . . . . 114 Disambiguate word lattice states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Compute an ordinal position PrIx from a word lattice . . . . . . . . . . . . . 117 Expansion of subpaths formed by labels of the same class in a WFSA 120 Edge alignment encoding as part of the output labels of a WFST . . . . 124 Disambiguate the input symbols of the states in a WFST . . . . . . . . . . . 125 Convert same-class subpaths into complete paths . . . . . . . . . . . . . . . . . 126 Keep only word alignments from a WFST yield by Alg. 5.11 . . . . . . . 127 Compute a word index for text segments based on character lattices . . 128 Disambiguate character lattice states . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Edge word count encoding in a lattice as the output labels of a WFST 133 Compute a PrIx for text ordinal positions based on a CL . . . . . . . . . . . 134

7.1

Fast HMM-Filler Computation using CL . . . . . . . . . . . . . . . . . . . . . . . . 205

xxiii

List of Figures

1.1 1.2 1.3 1.4 2.1 2.2 2.3 2.4

Illustration of a probabilistic index of a text image from the Bentham Papers collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Retrieval operation scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Illustration of R–P curves and gAP results for three typical IR systems working on text images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Evaluating PrIx performance is based on user-produced GT reference transcripts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Accurate word segmentation is not always possible to perform . . . . . Textual context is a useful aid to identify words in handwritten text . Different types of objects to retrieve after a user query . . . . . . . . . . . . Illustration of the different query paradigms used for KWS . . . . . . . .

19 20 22 23

Example of marginalization bounding boxes . . . . . . . . . . . . . . . . . . . . . Example of 2-D posteriorgrams for a text image and keyword . . . . . . Correlation between exact and approximate RPs obtained for line regions and words of the Bentham test set . . . . . . . . . . . . . . . . . . . . . . 3.4 Example of 1-D posteriorgram using a contextual recognizer . . . . . . . 3.5 An example of text image with two likely transcripts . . . . . . . . . . . . . . 3.6 Example of relevant image horizontal coordinates for a given query . 3.7 Heat map representing the RP of the image columns . . . . . . . . . . . . . . 3.8 Example showing that multiple instances of the same keyword may have overlapping segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Example of the relationship between position-independent and position-dependent RPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 Diagram of the relationship among the different RPs for a fixed text image and a keyword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11 PrIx workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34 36

3.1 3.2 3.3

4.1

40 42 44 46 48 50 59 60 62

Page images from different collections used for HTR and KWS experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 xxv

xxvi

4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17

List of Figures

Examples of skewed and slanted text lines . . . . . . . . . . . . . . . . . . . . . . An example of a HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of the HMM alignment for the letter “a” . . . . . . . . . . . . . . . . Diagram of the mathematical model of an artificial neuron and an illustration of a multilayer neural network . . . . . . . . . . . . . . . . . . . . . . . Diagram of a two dimensional convolution operation . . . . . . . . . . . . . Compact unrolled representation of a simple recurrent layer . . . . . . . Unit diagrams of a simple recurrent layer and a LSTM layer . . . . . . . Example of the probabilistic interpretation that the CTC makes of the output of a CRNN, applied to a text image . . . . . . . . . . . . . . . . . . . Coordinates in one-dimensional and two-dimensional input signals of a multidimensional recurrent layer . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of the outputs from a 2D-LSTM layer and a CNN layer, both trained for a HTR task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Diagram of the CRNN architecture proposed and used to model the transcripts of a handwritten text line . . . . . . . . . . . . . . . . . . . . . . . . An example of WFST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of WFST composition to model the full transcription posterior 𝑃(𝑤 | 𝑥) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of a lattice represented as a WFST and an equivalent compact version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of edge-posterior graph normalization . . . . . . . . . . . . . . . . . . Example of sentence-posterior graph normalization . . . . . . . . . . . . . .

68 71 72 76 77 79 79 82 84 85 86 91 93 95 98 99

Minimal Deterministic Automaton accepting all sequences containing a particular word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.2 Example of disambiguation of the word position associated to the states of a lattice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.3 Illustrative example for Alg. 5.8 application . . . . . . . . . . . . . . . . . . . . . 122 5.4 Illustrative example for Alg. 5.13 applied on a small WFSA . . . . . . . . 130 5.4bis Illustrative example for Alg. 5.13 applied on a small WFSA (cont.) . . 131 5.5 Example of RP computation for a given regular expression . . . . . . . . 135 5.1

6.1 6.2 6.3 6.4 6.5 6.6

mAP and gAP evolution for increasing word 𝑛-gram order on the IAM dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 mAP and gAP evolution for increasing lexicon size, on the IAM dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Lexicon-free mAP and gAP evolution for increasing order of character 𝑛-gram, on the IAM dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 151 AP evolution with respect to the number of indexed segments per line, on the IAM dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Lexicon-free mAP and gAP evolution on the IAM validation-set with varying optical and prior scales (𝛾, 𝜂), using a character 8-gram 154 Lexicon-based mAP and gAP evolution on the IAM validation-set with varying optical and prior scales (𝛾, 𝜂), using a word 3-gram . . . 155

List of Figures

6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 7.1 7.2 7.3 7.4 7.5 7.6 7.7 8.1 8.2 8.3 8.4 8.5 8.6 9.1 9.2 9.3

xxvii

Lexicon-free mAP and gAP evolution on the GW validation-set with varying optical and prior scales (𝛾, 𝜂), using a character 6-gram 155 AP evolution with respect to the number of lines used to train the CRNN, on the IAM dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Correlation between AP measures and Recognition Error Rates, for the IAM test set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 mAP and gAP evolution for increasing order of the character 𝑛-gram, on the GW dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 AP evolution for increasing order of the character 𝑛-gram, on the Parzival dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 R–P curves of HMMs, CRNNs and 1-best HTR transcripts, on the Bentham and Plantas datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 gAP, mAP and R–P curves for six real indexing datasets: TSO, Bentham, FCR, Chancery, Passau and Carabela collections . . . . 167 Example of spotted results in the ICFHR2014 H-KWS Competition . 176 An instance of the multi-variance problem of traditional distance-based KWS and in a unit-normed space . . . . . . . . . . . . . . . . . 186 An instance of the multi-mode problem of traditional distance-based KWS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Work-flow diagram of the feature extraction process used to obtain Zoning Aggregated Hypercolumn features . . . . . . . . . . . . . . . . . . . . . . . 189 gAP and mAP evolution for different values of the sharpness hyperparameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 PHOC example representation and percentage of words that share the same PHOC representation in IAM and GW datasets . . . . . . . . . . 192 “Filler” and “keyword” model schemes used in the HMM-Filler approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 Example of CL and posteriorgram produced by the decoding of line image rendering the text “to be for” . . . . . . . . . . . . . . . . . . . . . . . . 204 Scatter plots and histograms of lower-upper bounds of Boolean AND/OR word pair combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 R–P curve for Vorau-235 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 Example of geometric reasoning for the column-wise multi-word structured query: “⟨ NAMEN VERSTORBENEN, WOLF ⟩”. . . . . . . . . . . . . . 221 R–P curve for Passau dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Image region from a document of the FCR collection . . . . . . . . . . . . . 224 Examples of hyphenated words with different hyphenation symbols . 224 Plaintext Zipf curve of Bentham test-set GT transcripts . . . . . . . . . . . 236 Comparing the Zipf curves and lexicon sizes computed from Bentham’s GT transcripts and from corresponding PrIxs . . . . . . . . . 237 Zipf curves of PAS, TSO, CAR and FCR computed for their GT transcripts and estimated using their PrIxs . . . . . . . . . . . . . . . . . . . . . . 238

xxviii

9.4 9.5 9.6 9.7 9.8

List of Figures

Example of PrIx with tagged entries . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 Statistics of “Reason to travel” and “Jobs” . . . . . . . . . . . . . . . . . . . . . . . 241 Real example of visa information statistics estimated using PrIxs . . . 242 Age histogram for persons registered in the visa record collection, estimated using the PrIxs of the visa images . . . . . . . . . . . . . . . . . . . . 242 Leaving-one-out classification error rate on two books with three threshold-less MLP models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

10.1 10.2 10.3 10.4 10.5 10.6

Diagram of a PrIx-based Search and Retrieval System . . . . . . . . . . . . 255 Diagram of PrIx Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 Diagram of the PrIx Ingestion component . . . . . . . . . . . . . . . . . . . . . . 257 Diagram of PrIx search engine and user interface . . . . . . . . . . . . . . . . 258 Client–server architecture used by the large-scale demonstrators . . . . 259 Representation of the hierarchical index used in the large-scale demonstrators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 10.7 Search engine response time as a function of the number of occurrences retrieved . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 10.8 Web client graphical interface at root of the hierarchy . . . . . . . . . . . . . 262 10.9 Web client at box (book) level of the hierarchy . . . . . . . . . . . . . . . . . . . 263 10.10 Web client at page level of the hierarchy . . . . . . . . . . . . . . . . . . . . . . . . 263 10.11 A view of the Chancery PrIx demonstrator showing a result for a bilingual Boolean and word-sequence query . . . . . . . . . . . . . . . . . . . . . 265 10.12 A view of the TSO PrIx demonstrator showing a result for a Boolean proximity-AND query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 10.13 Results for a query using wildcard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 10.14 Two views of the FCR PrIx demonstrator showing results for a given query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 10.15 PrIx-estimated Zipf curves of the complete collections Chancery, Carabela, TSO, FCR and Bentham . . . . . . . . . . . . . . . . . . . . . . . . . . 274 B.1 B.2 B.3 B.4 C.1 C.2 C.3 C.4

An illustrative example of a WFST . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 Example of the composition of two WFSTs . . . . . . . . . . . . . . . . . . . . . 304 Example of a cyclic WFST which admits applying the shortest distance algorithm both in the Tropical and Log semirings. . . . . . . . . 304 Original WFST and two determinized versions operated on the Tropical and Real semirings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 Examples of a few pages and segmented lines extracted from the IAM dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 Examples of Bentham Papers page images. . . . . . . . . . . . . . . . . . . . . 313 Examples of pages and segmented lines from the BEN1 Bentham dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 Examples of the query images used in the ICFHR-2014 Competition on KWS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316

List of Figures

C.5 C.6 C.7 C.8 C.9 C.10 C.11 C.12 C.13 C.14 C.15 C.16 C.17 C.18 C.19

xxix

Examples of page images extracted from the George Washington dataset (GW). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 Examples of a few normalized and binarized line images used from the line-level partition of the George Washington database. . . . . . . 320 Examples of pages and a couple of segmented (binarized and normalized) text lines from the Parzival dataset. . . . . . . . . . . . . . . . . 322 Examples of page images and a close-up of some text lines from the Plantas dataset (PLA). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 Examples of page images of the Passau dataset (PAS). . . . . . . . . . . . 326 Examples of page images from the Chancery dataset (CHA) . . . . . . 329 Examples of TSO images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 Random examples of Carabela page images . . . . . . . . . . . . . . . . . . . 333 Examples of important difficulties exhibited by Carabela images . . 333 Examples of Finnish Court Records (FCR) dataset images and a close-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 Two FCR image examples showing bounding boxes of HwFs . . . . . . 337 Examples of page images from VORAU-253 sheet music manuscript. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 Geometry-wise annotation of music symbols . . . . . . . . . . . . . . . . . . . . 339 Page image examples from JMDB 4949 and JMBD 4950 books . . . . 340 Example of a four-page deed of class Risk (RI) from the JMBB-4949 book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341

List of Tables

1.1 1.2

Loss matrix for the binary decision problem involved in KWS . . . . . . 8 Comparison between HTR and PrIx . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1 3.2 3.3 3.4 3.5

Likely segments where the word “all” is written in Fig. 3.7 . . . . . . . . . . Example highlighting the differences between the segment RPs . . . . . Example highlighting the differences between ordinal RPs . . . . . . . . . Example of expected segment boundaries computed from ordinal position RPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of expected ordinal positions computed from segment RPs .

4.1

Operations in Real and Tropical semirings used in WFSTs . . . . . . . . . 90

5.1

Example of RP computation for a multi-word query . . . . . . . . . . . . . . . 136

49 50 52 57 57

6.1

Main dataset features of: IAM, PAR, GW, BEN1, BEN2, BEN3 and PLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.2 Architecture of the CRNN used in the IAM experiments. . . . . . . . . . . 143 6.3 Architecture of the CRNN used in the BEN1 experiments . . . . . . . . . . 144 6.4 Interpolated gAP obtained on BEN1 dataset for various posteriorgram-based approximations to the RP . . . . . . . . . . . . . . . . . . . 144 6.5 Lexicon-based/free APs on the IAM dataset for different RPs evaluated in a line-level setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.6 Time needed by different algorithms to build the PrIx on the IAM dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.7 Evolution of the total index size and indexing time for increasing maximum number of spots per line, on the IAM dataset . . . . . . . . . . . 153 6.8 Architecture of the CRNN used in the GW experiments . . . . . . . . . . . . 159 6.9 Architecture of the CRNN used in the Parzival experiments . . . . . . . 160 6.10 AP results achieved by different QbS, line-level KWS approaches on the IAM, GW and PAR datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

xxxi

xxxii

List of Tables

6.11 Comparison of gAP results between using HMM and CRNN for PrIxs on the BEN1 and PLA datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.12 gAP and indexing density for PLA and BEN1, using HMM and RNN 165 6.13 Main dataset features of: TSO, BEN4, FCR, PAS and CAR . . . . . . . . 166 6.14 Architecture of the CRNN used in the BEN4 experiments . . . . . . . . . . 166 6.15 Architecture of the CRNN used in the Passau experiments. . . . . . . . . 168 6.16 Architecture of the CRNN used in the TSO experiments. . . . . . . . . . . 169 6.17 Architecture of the CRNN used in the Carabela experiments. . . . . . 170 6.18 Comparison of various systems in the ICDAR2015 Competition on KWS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 6.19 Architecture of the CRNN used in ICFHR2014 Handwriting KWS Competition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 6.20 Comparison of multiple systems and measures in the ICFHR2014 H-KWS Competition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 7.1 7.2 7.3 7.4 7.5 7.6 8.1 8.2 8.3 8.4 9.1 9.2 9.3 9.4 9.5

mAP and gAP on the GW dataset for different ranking strategies based on the features extracted in [31] . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Comparison of word-segmentation-based QbE KWS performance achieved by different PHOC-based methods on the GW dataset . . . . . 196 Summary of the KWS results on the word-segmented GW dataset for different approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Line-level gAP results on IAM using HMM-GMM, and different character 𝑛-grams and approximations to the RP. . . . . . . . . . . . . . . . . . 202 Preprocessing and average query times and total indexing times, for classical and CL-based HMM-Filler KWS on the IAM dataset . . . . . . 206 Comparative gAP results between BLSTM-CTC KWS and the probabilistic interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 gAP for single-word and Boolean AND an OR queries on the BEN1 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Architecture of the CRNN used for optical modeling of sheet music staves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 AP figures obtained using Plain and HyW PrIxs for the query sets AllWords and MaybeHyph on the FCR test set . . . . . . . . . . . . . . . . . 227 Exact, approximate- and wildcard-spelling retrieval gAP and mAP achieved for the two query sets Q1 and Q2 . . . . . . . . . . . . . . . . . . . . . . . 231 Running words and vocabulary sizes estimated from different PrIxs of datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Several statistics of JMBD 4949 and JMBD 4950 books . . . . . . . . . . . 248 Classification error rate of threshold-less methods . . . . . . . . . . . . . . . . 251 OSC classification + rejection error rate using PrIx and 2 048 words . 252 Rejection performance for bMLP-2 OSC with PrIx and 2 048 words . 253

List of Tables

xxxiii

10.1 Public URLs to access to the PrIx demonstrator for the manuscript series “Tr´esor des Chartes” and corresponding statistics . . . . . . . . . . . 265 10.2 Public URL of the “Teatro del Siglo de Oro” PrIx demonstrator and corresponding statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 10.3 Public URL of the Bentham demonstrator and corresponding PrIx statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 10.4 Public URL of the Carabela PrIx demonstrator and corresponding statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 10.5 Public URL of the FCR PrIx demonstrator and corresponding statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 10.6 Summary of main features of the PrIx demonstrators . . . . . . . . . . . . . . 273 A.1 Example illustrating the calculation of the Global and Mean Average Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 B.1 Common semirings used in WFSTs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 C.1 C.2 C.3 C.4 C.5 C.6 C.7 C.8 C.9 C.10 C.11 C.12 C.13 C.14 C.15 C.16 C.17 C.18 C.19 C.20

Statistics of the IAM dataset and partition used in the experiments. . . 311 Statistics of the queries used in the IAM dataset. . . . . . . . . . . . . . . . . . . 311 Statistics of the external corpora used in IAM experiments. . . . . . . . . 312 Statistics of the Bentham BEN1 dataset, as proposed in ICFHR-2014 HTRtS and used in our experiments. . . . . . . . . . . . . . . . . 313 Queries for PrIx experiments on the BEN1 Bentham dataset . . . . . . . 314 Statistics of the query pools generated for page-level multi-word experiments on the BEN1 Bentham dataset . . . . . . . . . . . . . . . . . . . . . 315 Statistics of the Bentham partition from the ICFHR-2014 Competition on KWS as used in our experiments. . . . . . . . . . . . . . . . . . 316 Keywords of the ICFHR-2014 KWS competition . . . . . . . . . . . . . . . . . 316 Statistics of the Bentham partition from the ICDAR-2015 Competition on KWS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 Basic statistics of the relatively large BEN4 Bentham dataset . . . . . . 319 Statistics of the George Washington dataset (GW) . . . . . . . . . . . . . . 320 Statistics of the queries used in the George Washington dataset (GW). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 Statistics of the query set used on the word-level experiments with the George Washington dataset (GW). . . . . . . . . . . . . . . . . . . . . . . . . 321 Statistics of the Parzival (PAR) partition used in the experiments. . . 323 Statistics of the query set used in the Parzival database (PAR). . . . . . 323 Statistics of the partition of Plantas dataset . . . . . . . . . . . . . . . . . . . . . 325 Statistics of the query set used in the test partition of the Plantas database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 Main forms of character transliteration applied to the PAS dataset. . . 327 Statistics of the Passau experimental dataset (PAS) . . . . . . . . . . . . . . . 327 Basic statistics and partition of the Chancery dataset (CHA) . . . . . . . 329

xxxiv

List of Tables

C.21 Basic statistics of the TSO experimental datasets and the corresponding external text corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 C.22 Basic statistics of the batches used to train and test the Carabela statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 C.23 Basic statistics of the Finnish Court Records (FCR) experimental dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 C.24 Statistics of the hyphenated lines and hyphenated word (prefix or suffix) fragments of the FCR dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 C.25 Statistics of the Vorau-253 dataset and partitions used in the experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 C.26 Number of documents and page images for JMBD 4949 and JMBD 4950: per class, per document & class, and totals. . . . . . . . . . . 342

Chapter 1

Introduction

Abstract After an introduction to the motivation and background of the topics considered in this book, the present chapter reviews the most important concepts of the scientific areas upon which the rest of the book stands on. This includes Information Retrieval (IR), Pattern Recognition (PR) and Statistical Decision Theory. The chapter includes also a complete account of standard IR metrics, used in this book to assess the performance of the proposed Probabilistic Indexing models and methods. We provide the general equations, as well algorithmic details that are seldom found in the literature. Probabilistic Indexing is closely related with Handwritten Text Recognition, which is also reviewed here, along with a comparison of the most important similarities and differences between both technologies.

1.1 Motivation and Background During thousands of years the human kind used handwriting to preserve and share knowledge. With the invention of the printing press, by Johannes Gutenberg (circa 1439), the printing press allowed an incredible acceleration in the distribution of information, and made it possible that segments of the population that had never had access before, could start to gain it [23]. In the current digital era, with the usage of computers and digital formats, information can be stored in a cheaper and more convenient way than ever before. In addition, any person around the world with access to a computer with Internet connection, can retrieve any piece of information, even if it is stored anywhere else in the globe. This has the potential to bring a true democratization of human knowledge, that was started with the invention of the printing press in the 15th century. As a matter of fact, digitization works carried out in the last decades by archives and libraries world wide have produced massive quantities high resolution images of historical manuscripts and early printed documents [15, 9, 19, 25]. Billions of text images have been produced through these efforts, and this is only a minuscule © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 A. H. Toselli et al., Probabilistic Indexing for Information Search and Retrieval in Large Collections of Handwritten Text Images, The Information Retrieval Series 49, https://doi.org/10.1007/978-3-031-55389-9_1

1

2

1 Introduction

part of the amount of manuscripts which are still waiting to be digitized. The aim of manuscript digitization is not only to improve preservation, but also to make the written documents easily accessible to interested scholars and general public. However, access to the real wealth of these images, namely, their textual contents, remains elusive and there is a fast growing interest in automated methods which allow the users to search for relevant textual information contained in handwritten text images. In order to use classical plain-text indexing and search Information Retrieval (IR) methods [22, 1, 16, 43], a first step would be to convert the handwritten text images into digital text. But the image collections for which text indexing is highly in demand are so large that the cost of manually transcribing these images is entirely prohibitive, even by means of crowd-sourcing approaches [5, 32]. An obvious alternative to manual transcription is to rely on automatic Handwritten Text Recognition (HTR) [2, 42, 34]. The development of HTR started during the 1950s and 1960s and, in the last 50 years, the field of HTR has improved significantly. However, current state of the art HTR systems achieve good transcription results only if perfect layout, line detection and reading order are taken for granted – as it is generally the case in published results. But real historical scanned documents prove elusive even to the most advanced HTR technologies and, despite the great recent advances in the field [14, 30, 26], fully automatic transcripts of the kind of historical images of interest lack the accuracy required to enable useful plain-text indexing and search. Another possibility would be to use computer-assisted transcription methods [29], but so far these methods can not provide the huge human-effort reductions needed to render semiautomatic transcription of large image collections feasible [35]. HTR accuracy becomes low on real historical text images for many reasons, including unpredictable, erratic layouts, lines with uneven interline spacing and highly variable skew, ambiguous, inconsistent or capricious reading order of layout elements, etc. All or most of these difficulties boil down to the intrinsic uncertainty which underlies the interpretation (by machines and humans alike) of image strokes into actual textual elements or glyphs and how to combine these glyphs into characters, words, paragraphs and, in the end, whole textual documents. Interestingly, most or all of these problems disappear or become much less severe if, rather than to achieve accurate word-by-word image transcripts, the goal is to determine how likely is that a given word is or is not written in some indexable image region, such as a text line, a text block or paragraph, or just a full page image. This goal statement places our textual IR problem close to the field known as Keyword Spotting (KWS), which emerged as an alternative to HTR in the 1990s. A comprehensive survey on KWS approaches for text images1 can be found in [13] and Chapters 2 and 7 of this book provide detailed insights about the most relevant of these approaches. 1 Among the works cited in this survey, it is worth noting that many recent developments are inspired in one form or the other in earlier KWS works in the field of automatic speech recognition (ASR), such as [8, 6, 17, 7, 4, 18, 33]. This is also the case of the work presented in this book.

1.1 Motivation and Background

3

Generally speaking, KWS aims at determining locations on a text image or image collection which are likely to contain instances of the query words, without explicitly transcribing the image(s). This is also the aim of a framework we have developed during the last decade, called Probabilistic Indexing (PrIx) [37, 27, 36, 40], which is the main topic of the present book. PrIx explicitly adopts the IR point of view to develop search and retrieval methods for untranscribed handwritten text images. However, in contrast with traditional KWS, rather than focusing on searching for specific, given “keywords”, all the likely locations of all the words which are deemed likely keyword candidates are holistically determined and indexed, along with the corresponding likelihoods. As we will see, HTR and PrIx can advantageously share statistical models and training methods. However, it is important to realize that HTR and PrIx are fundamentally different problems, even if both may rely on identical probability distributions and models. The HTR decision rule attempts to obtain the best sequence of words or characters (transcript) for a given image. Therefore the result epitomizes just the mode of the distribution; once a transcript has been obtained, the distribution itself can be safely discarded. In contrast, PrIx decisions are delayed to the query phase and, for each decision, (an approximation to) the full distribution is used. In other words, rather than aiming at single, unique interpretations of the textual elements of an image (as in HTR), PrIx explicitly embraces the interpretation uncertainty above mentioned so that a proper interpretation can be disambiguated using the required additional knowledge as/when available. This obviously explains why proper KWS and PrIx can always achieve better overall search results than those provided by naive KWS based on plain HTR transcripts. An indexing and search system can be evaluated by measuring its precision and recall performance for a given (large) set of keywords. Precision is high if most of the retrieved results are correct while recall is high if most of the existing correct results are retrieved. In the case of naive indexing, based on automatic HTR transcripts, precision and recall are fixed numbers, which are obviously closely correlated with the accuracy of the recognized transcripts. In contrast, for a PrIx system based on the likelihood that a keyword is written in an image region, arbitrary precision-recall tradeoffs can be achieved by setting a threshold to decide whether the likelihood is high enough or not. We refer to this flexible search and retrieval framework as the “Precision-Recall Tradeoff Model”. Under this model, it becomes even more clear that proper KWS and PrIx has the opportunity of achieving better results than naive KWS based on HTR transcripts, as previously discussed. Contributions of this book to the state of the art in handwritten image indexing and search include: First, a sound probabilistic framework is presented which helps understanding the relations between PrIx and other classical, maybe not-probabilistic statements of KWS, and provides probabilistic interpretations to many of these approaches. Second, the development of this framework makes it clear that word recognition implicitly or explicitly underlies any proper formulation of KWS, and suggests that the same statistical models and training methods successfully used for HTR can advantageously used also for PrIx. Third, experiments carried out using this approach on datasets traditionally used for KWS benchmarking yield

4

1 Introduction

results significantly better than those previously published for these datasets. Fourth, PrIx results on new, larger handwritten text image datasets are reported, showing the great potential of the methods proposed in this book for accurate indexing and textual search in large collections of handwritten documents. Fifth, extensions and applications of PrIx are described that go far beyond basic textual information retrieval; this includes, text analytics, search for information in structured documents, and textual content based document classification.

1.2 Information Retrieval As previously commented, the framework proposed in this book mainly adopts the point of view of Information Retrieval, which aims to build systems that enable users to meet their information needs by finding relevant information among large collections of generally unstructured documents [22]. The academic field of IR was originated almost in parallel to computers, back in the 1940s and 1950s, mainly in order to organize companies and libraries catalogs [31]. With the popularization and spread of the Internet, the field gained significant importance and has been the central business of various companies (such as Yahoo, Google or the extinct Altavista or Lycos). The term information needs is very fuzzy and can be ambiguous in many cases. For instance, when a user searches for “Paris” while she is planning her holiday trip, she probably wishes to find nice hotels, the best flight fares and interesting attractions. However, when some local Frenchwoman searches for “Paris” on the Internet, she is probably expecting to find other types of information (e.g. the web address of the City Hall, hospitals, etc.). The task of the IR system is to provide the users with relevant answers for their information need. When we place traditional KWS systems and literature under the IR goggles, the definition of relevant information may seem trivial: a document (i.e. full page, text line, or a specific page location) is relevant if, and only if, it contains an instance of the queried keyword. However, in this case, other sources of ambiguity arise: in particular, the textual content of the documents is unknown. This contrasts with the above web search examples, where the very definition of relevance is ambiguous, but the content of the documents is not. Probability is the standard way of measuring different sources of uncertainty. Thus, the approaches presented in this book can be seen as instances of probabilistic IR methods [22], applied to manage the uncertainty underlying the textual information conveyed by images of handwritten (or printed) documents. Finally, we want to emphasize the fact that aiming at building indexes from text images was one of the foundational goals of KWS [20, 21], as well as interpreting KWS as an instance of Information Retrieval. However, the indexes that we build are more closely related to the ones used by traditional search engines, which make them very easy to use and integrate with existing IR systems.

5

1.3 Pattern Recognition

In order to better illustrate our IR-oriented indexing goals, Fig. 1.1 shows an example of one of the probabilistic KWS indexes that we intend to build, for a given (segment of a) page containing handwritten text. 0

100

200

300

400

500

600

50 100 150 200

Term

Probability

2 21 It If matters matter some soner

Likely location

0.929 1 36 0.064 1 36 0.982 33 36 0.012 33 36 0.998 76 35 0.011 77 36 ··· 0.832 570 198 0.016 576 198

20 24 27 26 104 93

31 31 31 31 31 31

71.21.2 71.21.2 71.21.2 71.21.2 71.21.2 71.21.2

78 31 71.21.2 83 31 71.21.2

Fig. 1.1: Illustration of a probabilistic index of a text image (071 021 002, from the Bentham Papers collection – http://www.prhlt.upv.es/htr/bentham). Each entry contains an indexed term (word), its relevance probability and its likely location (bounding box and image ID) in the collection.

In this example, the index contains the list of candidate words likely written in the image, each with its location (bounding box) and the location relevance probability. For example, the system find it very likely that the words “It”, “matters” and “some” are written in the corresponding locations, although it is less certain about the latter and even less for other alternatives such as “matter”, “If” “sooner”, etc. At the end of this book, the reader will be able to understand the approaches and algorithms used to build this kind of index and the theory supporting them.

1.3 Pattern Recognition As discussed in the previous section, PrIx can be interpreted as a form of IR where the content of the documents is uncertain. Recall that we are dealing with scanned document images containing text. Thus, under a probabilistic formulation, we can only guess which text is likely to actually be written in such images. Pattern Recognition (PR) is a time-honored discipline that deals with the inherent uncertainty of recognition of patterns from arbitrary data [10, 11, 3]. Machine Learning (ML) is

6

1 Introduction

one of the fundamental pillars of PR, even though some modern works on ML may forget its origins. As the title of this book suggest, it focuses on probabilistic (or statistical) PR models. In fact, although many PR methods or tasks may not involve the explicit computation of any probability or distribution, all of them can be properly interpreted in these terms. Specifically, PR uses statistical methods to (1) learn the parameters of the models from given training data, and (2) decide (or predict) optimal actions or outcomes from new (“evaluation” or “test”) data. Over the years, PR has proven to be a phenomenal approach to find general rules to solve problems from examples. In practice, once we have trained our model from data, we need to check that it actually generalizes well to previously unseen data (produces the correct answer). This is an essential step to ensure that our methods are not just memorizing the data supplied during training. Different algorithms and models have different generalization properties. The number of model parameters, the dimensionality of the data, the independence assumptions taken into consideration, etc. are just a few aspects that affect the ability of a PR method to be successful [11, 3, 24]. Both ML and PR ultimately relay on the Statistical Decision Theory to develop their learning and decision methods.

1.4 Decision Theory Decision Theory (DT) provides the reasoning underlying systems that have to decide, under uncertainty, optimal strategies with respect some utility or loss function. In PR, training methods have to decide the optimal set of parameters according to some optimization criterion (e.g. maximizing a likelihood function, or minimizing some least squared error). Similarly, in the operational (evaluation or test) phase, the system hast to decide the optimal values of the output variables, according to some criterion (e.g. minimize the expected classification error). DT can also be advantageously invoked for IR problems. Given a query and some arbitrary object to retrieve, the system needs to decide whether that object is relevant (and should therefore be retrieved) for a given query, or not. In the Appendix A, DT is called in to prove that the PrIx framework proposed and developed in this book is optimal for a broad set of criteria typically adopted in KWS and many other IR applications. In its simplest form, DT considers a binary classification problem where the two classes are the two possible answers to a yes/not question. As applied specifically to our text-image IR task, the following general question is considered, regarding a given query word 𝑣 and a certain image or image region 𝑥: “Is 𝑣 written in 𝑥?” Associated with this question there is another one which might appear more complex: “What are the locations (if any) of word 𝑣 within 𝑥?”. However, this can often be answered as a byproduct of solving the main question. The probabilistic framework proposed and developed in this book deals with these questions.

7

1.4 Decision Theory

First, to model the decision probability associated to the above binary classification problem we need a binary random variable which, following common notation in the IR field, will be denoted 𝑅 (after “relevant”). This entails a slight reformulation of the original question as: “is the image 𝑥 relevant for the query 𝑣?”, considering that 𝑥 is relevant for 𝑣 if at least one instance of 𝑣 is rendered in 𝑥. Second, we propose another random variable 𝑋 over the set of image regions. A value of 𝑋 (i.e., an arbitrary image region), will be denoted as 𝑥. At this point we do not need to consider what are the possible sizes or shapes of image regions (a page, a paragraph, a line, a word-sized bounding box, etc.) and, until we need to be more specific, we will simply use the term “image” for a value of 𝑋. Finally, we introduce the random variable, 𝑄, over the set of all possible user queries. An arbitrary value of 𝑄 would generally be denoted as 𝑞. The proposed framework properly admits arbitrary types of queries: from single words, to boolean word combinations [36], or even “example image patches”, as in “query by example” KWS [39] (see also Chapters 2 and 8). However, to keep the presentation simple, we start considering only conventional string search, where queries are individual keywords. Therefore, from now on, a generic value of 𝑄 will be denoted as 𝑣. We can now introduce the relevance probability (RP) distribution:2 𝑃(𝑅 = 1 | 𝑋 = 𝑥, 𝑄 = 𝑣) ≡ 𝑃(𝑅 | 𝑥, 𝑣)

(1.1)

which denotes the probability that 𝑥 is relevant for the keyword 𝑣. Since 𝑅 is binary, the relevance probability can be obviously interpreted as the statistical expectation that 𝑣 is written in 𝑥. On the other hand, by definition, 𝑃(𝑅 | 𝑥, 𝑣) is the posterior probability underlying the following 2-class classification problem: Given 𝑣, classify each image 𝑥 into one of two classes:

(1.2)

• 1 : 𝑣 is (one of the words) written (somewhere) in 𝑥 • 0 : 𝑣 does not appear in 𝑥 Using a loss matrix 𝜆 to weight the cost of each 0/1 decision, the resulting decision theoretic Bayes’ or minimum expected risk rule amounts to classify 𝑥 into the class 1 (“yes”) iff [10]: 𝑃(𝑅 | 𝑥, 𝑣) > 𝜏 ,

𝜏=

𝜆10 − 𝜆00 𝜆01 − 𝜆 11 + 𝜆10 − 𝜆00

(1.3)

Table 1.1 shows in detail the meaning of each component of the 𝜆 matrix. According to Eq. (1.3), in the two-class case, this matrix reduces to just a single scalar threshold 𝜏. Under the precision-recall tradeoff model, this is exactly the threshold to be adjusted in order to achieve the required tradeoffs. In the following chapters of this book we explain how to compute relevance distributions for given images. 2 To simplify notation, from now on we will generally write 𝑃 (𝑅 · · · ) and 𝑃 (𝑎 · · · ), rather than 𝑃 (𝑅 = 1 · · · ) and 𝑃 ( 𝐴 = 𝑎 · · · ), respectively, except when the full notation helps enhancing clarity and/or avoiding ambiguity.

8

1 Introduction Table 1.1: IR loss matrix. For instance, 𝜆10 is the loss incurred by classifying as “Not relevant” an object which actually is “Relevant”. Decision Not relevant Relevant Truth

Not relevant Relevant

𝜆00 𝜆10

𝜆01 𝜆11

1.5 Handwritten Text Recognition It has already been commented that perhaps the simplest approach for textual IR in text images is to first use HTR to transcribe the images and then use of-the-shelf IR tools on the noisy automatic transcripts. Informally speaking, HTR aims to provide automatic ways to transcribe digitized text images into a symbolic format that would allow modern treatment of textual matters such as editing, indexing, and retrieval. For a formal treatment, HTR is stated as the following PR problem: Given an image region 𝑥, obtain a word sequence 𝑤ˆ such that: 𝑤ˆ = arg max 𝑃(𝑤 | 𝑥)

(1.4)

𝑤

From a statistical DT point of view, the underlying loss function is 𝜆 𝑤 𝑤′ = 0 iff 𝑤 = 𝑤′ and, therefore, Eq. (1.4) is a minimum expected risk rule which minimizes the statistical expectation of whole transcription error. As will be discussed in detail in Chapter 4, modern approaches to HTR are based on optical models which deal with how image strokes can be interpreted in terms of text elements or glyphs, such as characters, and language models, which account for how the text elements can be combined to form words and sentences. Both types of models are learned from training examples. The optimization of Eq. (1.4) has proven to be a computational hard problem and, actually, no algorithm exists which solves it exactly. However good approximations are available to solve a similar (albeit apparently more difficult) problem in which 𝑤ˆ is obtained from a pair ( 𝑤, ˆ 𝑎) ˆ of a word sequence and an alignment which maximize the joint posterior 𝑃(𝑤, 𝑎 | 𝑥). Formally: ( 𝑤, ˆ 𝑎) ˆ = arg max 𝑃(𝑤, 𝑎 | 𝑥)

(1.5)

𝑤,𝑎

An alignment 𝑎 = 𝑎 1 , . . . , 𝑎 𝑛 , 𝑎 𝑛+1 of a word sequence 𝑤 = 𝑤1 , . . . , 𝑤𝑛 , with a (line) image region 𝑥, is a sequence of (horizontal) coordinates of 𝑥 which determine how the image is segmented into words of 𝑤. There are various algorithms with provide exact or approximate solutions to this optimization problem. The best known and the one most effective and efficient is the Viterbi algorithm, which provides an exact solution to Eq. (1.5), as well as very fast approximations by means of an accelerating technique known as Viterbi beam search [29].

1.6 Assessing Indexing and Search Performance

9

According to DT, the ultimate goal of Eq. (1.4) (and Eq. (1.5) alike) is to achieve a whole sentence HTR transcription error rate as low as possible. But, in practice, HTR results are typically evaluated using more fine-grained measures. Most popular metrics are Word Error Rate (WER) and Character Error Rate (CER), which are defined as the number of insertion, deletion and substitution word/character errors divided by the number of words/characters in the reference transcript. WER and CER are considered very adequate to asses HTR systems where textline image regions are the basic recognition units. However, for recently proposed end-to-end HTR approaches, where text line detection and recognition are done simultaneously, other evaluation measures –such as those based on Bag-of-Words and the Hungarian algorithm proposed in [41]– are required which are more informative about the types of errors; that is, whether they are due to reading order issues or recognition itself.

1.6 Assessing Indexing and Search Performance The indexing and retrieval effectiveness of IR systems is generally assessed using the standard measures of recall and interpolated precision [22]. In the PrIx framework, for a given query and relevance threshold, recall is the ratio of relevant image regions (lines) correctly retrieved by the system (often called hits), with respect to the total number of relevant regions existing in the image test set. Precision, on the other hand, is the ratio of hits with respect to the total number of (correctly or incorrectly) retrieved regions. Precision is high if most of the retrieved results are correct while recall is high if most of the existing correct results are retrieved. The adequateness of these and other related measures to evaluate PrIx-based systems is thoroughly discussed in Appendix A. In this section we review the most important concepts and equations, along with the algorithmic details needed for efficient computing of these metrics. Let Q be a set of (word) queries and 𝜏 be a relevance threshold. The recall, 𝜌(𝑞, 𝜏), and the raw (non-interpolated) precision, 𝜋 ′ (𝑞, 𝜏), for a given query 𝑞 ∈ Q are defined as: ℎ(𝑞, 𝜏) ℎ(𝑞, 𝜏) 𝜌(𝑞, 𝜏) = , 𝜋(𝑞, 𝜏) = (1.6) 𝑟 (𝑞) 𝑑 (𝑞, 𝜏) Here 𝑟 (𝑞) is the number of test image regions which are relevant for 𝑞, according to the ground-truth (GT), 𝑑 (𝑞, 𝜏) is the number of regions retrieved or detected by the system with relevance threshold 𝜏 and ℎ(𝑞, 𝜏) is the number of detected regions which are actually relevant (also called hits). See Fig. 1.2. The interrelated trade-off between recall and precision can be conveniently displayed as the so-called recall-precision (R–P) curve, 𝜋𝑞 (𝜌) [12]. Any IR system should allow users to (more or less explicitly) regulate 𝜏 in order to choose the R–P operating point which is most appropriate in each query. Good systems should

10

1 Introduction

C D (𝑞, 𝜏)

H (𝑞, 𝜏)

R (𝑞)

Fig. 1.2: Retrieval operation scheme. C is the whole collection or set of elements considered. For a given query, 𝑞, R (𝑞) is the set of 𝑟 (𝑞) elements that, according to the reference GT, are relevant to 𝑞. For a certain relevance threshold 𝜏, D (𝑞, 𝜏 ) is the set of 𝑑 (𝑞, 𝜏 ) elements detected by the system. Finally, H (𝑞, 𝜏 ) = R (𝑞) ∩ D (𝑞, 𝜏 ) is the set of correctly detected elements or hits, with | H (𝑞, 𝜏 ) | = ℎ (𝑞, 𝜏 ).

achieve both high precision and high recall for a wide range of values of 𝜏. A commonly accepted scalar measure which meets this intuition is the area under the R–P curve, here denoted as 𝜋 𝑞 and called (raw) average precision (AP) [44, 28]. In addition, to consider all the queries in Q, the (raw) mean average precision (mAP, denoted as 𝜋) is used: ∫ 1 1 Õ 𝜋𝑞 = 𝜋𝑞 (𝜌)𝑑𝜌 , 𝜋 = 𝜋𝑞 (mAP) (1.7) |Q| 0 𝑞∈ Q

Obviously, the mAP is undefined if ∃ 𝑞 ∈ Q for which 𝜋 𝑞 is undefined, which happens if 𝑟 (𝑞) = 0, that is, if no test-set image region is relevant for 𝑞. On the other hand, Eq. (1.7) equally weights all the queries, thereby ignoring the different amounts of relevant regions for different queries. To circumvent both of these issues, a global averaging scheme can be adopted by computing the total number of test image regions which are relevant for all 𝑞 ∈ Q, the total number of regions detected with relevance threshold 𝜏 and the total number of hits, respectively, as: Õ Õ Õ 𝑟= 𝑟 (𝑞) , 𝑑 (𝜏) = 𝑑 (𝑞, 𝜏) , ℎ(𝜏) = ℎ(𝑞, 𝜏) (1.8) 𝑞∈ Q

𝑞∈ Q

𝑞∈ Q

Then the overall recall and raw precision, and the (often preferred) global average precision, 𝜋 (referred to as gAP or simply as AP), are defined as: ℎ(𝜏) 𝜌(𝜏) = , 𝑟

ℎ(𝜏) 𝜋(𝜏) = , 𝑑 (𝜏)

𝜋 =



1

𝜋(𝜌)𝑑𝜌

(gAP)

(1.9)

0

The integral in Eq. (1.9), must be computed numerically. For large datasets and many queries in Q, the cost of this computation can become excessive – as large as 𝑂 (𝑁 2 ), where 𝑁 = |Q ×C|. However, it can be greatly accelerated as follows [44, 38]. First, for each 𝑞 ∈ Q and each 𝑥 ∈ C, compute the RP 𝑃(𝑅 | 𝑥, 𝑞) (Eq. 1.1) and sort

11

1.6 Assessing Indexing and Search Performance

the list of 𝑁 pairs (𝑥, 𝑞) in decreasing order of RP. Let (𝑥 𝑘 , 𝑞 𝑘 ) be the k-th pair of this list and let 𝑑 𝑘 ≡𝑘 and ℎ 𝑘 be the number of regions detected and the number of hits accumulated up to the 𝑘-th entry of the sorted list. Let finally 𝜋 𝑘 ≡ ℎ 𝑘 /𝑘 be the precision up to the 𝑘-th entry in the sorted list. Then, using the simplest rectangular or Rimanian numerical integration, the gAP is given by: 𝜋 =

𝑁 Õ 𝑘=1

𝑁

𝜋 𝑘 (Δ𝜌) 𝑘 =

1Õ 𝜋 𝑘 𝑔(𝑘) 𝑟 𝑘=1

(1.10)

where 𝑔(𝑘) ∈ {0, 1} is the true relevance (given by the GT) of 𝑥 𝑘 with respect to 𝑞 𝑘 . The values of 𝜋 𝑘 , can be easily computed recursively as: def

𝜋0 = 1 ,

𝜋𝑘 =

 1 (𝑘 − 1)𝜋 𝑘−1 + 𝑔(𝑘) , 1 ≤ 𝑘 ≤ 𝑁 𝑘

(1.11)

The cost of this computation is now dominated by the 𝑂 (𝑁log𝑁) complexity of the sort step. This cost remains unchanged if the often preferred trapezoidal numerical integration is used. In this case, Eq. (1.10) becomes: 𝑁

𝜋 =

1 Õ 𝜋 𝑘−1 + 𝜋 𝑘 𝑔(𝑘) 𝑟 𝑘=1 2

(1.12)

Even with the global averaging scheme, raw precision can still be ill-defined in some extreme cases and, moreover, raw R–P curves can present an undesired distinctive saw-tooth shape [12]. Both of these issues are avoided by the so-called interpolated precision, defined as: 𝜋 ′ (𝜌) = max 𝜋(𝜌 ′ ) ′ ′ 𝜌 :𝜌 ≥𝜌

(1.13)

Intuitive arguments in favor of 𝜋 ′ (𝜌), which is often adopted in the literature, are discussed in [22]. The same interpolation scheme can be applied to a single query 𝑞, resulting in the interpolated R–P curve 𝜋𝑞′ (𝜌). Then, the interpolated versions of mAP and gAP are straightforwardly defined using 𝜋𝑞′ (𝜌) and 𝜋 ′ (𝜌), rather than 𝜋𝑞 (𝜌) and 𝜋(𝜌), in Eqs. (1.7) and (1.9), respectively. The adoption of the interpolated precision 𝜋 ′ instead of the raw precision 𝜋, prevents directly using Eqs. (1.10) and (1.12) for efficient computation. In this case, it becomes necessary to explicitly compute the precision and recall up to the 𝑘-th entry 𝜋 𝑘 , during a first, forward pass over the sorted list. Then the list is visited backwards, applying Eq. (1.13) to compute the interpolated precision up the 𝑘-th entry, denoted 𝜋 ′𝑘 . Finally the interpolated versions of Eqs. (1.10) and (1.12) are computed by simply changing 𝜋 𝑘 with 𝜋 ′𝑘 in these equations. Eqs. (1.10,1.11,1.12) all relay on the values of 𝑘 that are assigned to each pair (𝑥, 𝑞) by sorting these pairs according to their RP, 𝑃(𝑅 | 𝑥, 𝑞). So, a practical problem arises as to what to do in case of RP ties. This problem can be very relevant even for moderately large data and query sets, and more so for high RP

12

1 Introduction

values. Specifically, most of the highest RP values tend to be 1.0.3 Clearly, some of these duplicated high-RP entries can be not-relevant (i.e., 𝑔(𝑘) = 0). Therefore, depending on exactly where these entries are placed in the sorted list, the resulting gAP values can be significantly higher or lower. To fairly circumvent this problem, ties can be avoided by sorting with duplicateremoving. But then, several changes are needed to Eqs. (1.10–1.12). To start with, the upper limit 𝑁 of the sums in Eqs. (1.10,1.12) and the span of the recurrence (1.11) is now 𝑁 ′ ≤ 𝑁 and 𝑔(𝑘)∈{0, 1} is now 𝑔 ′ (𝑘 ′ )∈ N, re-defined as the total number of GT 1’s in the 𝑘 ′ -th collapsed block of entries with the same RP value. Finally, Eq. (1.11) becomes: 𝑑 𝑘 ′ −1 𝜋 𝑘 ′ −1 + 𝑔 ′ (𝑘 ′ ) def 𝜋0 = 1 , 𝜋 ′𝑘 = , 1 ≤ 𝑘′ ≤ 𝑁′ (1.14) 𝑑 𝑘 ′ −1 + 𝑚(𝑘 ′ ) where 𝑚(𝑘 ′ ) is the number of duplicated entries collapsed in the 𝑘 ′ -th block and, as before, 𝑑 𝑘 ′ , is the accumulated number of detected regions up to the 𝑘 ′ -th entry of the sorted list. Note that now, 𝑑 𝑘 ′ is no longer just equal to 𝑘 ′ , but it can be simply def computed incrementally as: 𝑑0 = 0, 𝑑 𝑘 ′ = 𝑑 𝑘 ′ −1 + 𝑚(𝑘 ′ ), 1 ≤ 𝑘 ′≤ 𝑁 ′ .

The use of interpolated precision is particularly necessary for fair evaluation of IR results of the naive 1-best KWS approach, where RP can only be 1 or 0, independently of 𝜏. Therefore, in a raw R–P curve, just one R–P point, (𝜌0 , 𝜋0 ), could be defined and the resulting raw gAP would be 0, disallowing comparison with other approaches. In contrast, the interpolated precision curve becomes 𝜋 ′ (𝜌) = 𝜋0 if 0 ≤ 𝜌 ≤ 𝜌0 , 𝜋 ′ (𝜌) = 0 otherwise, with a resulting interpolated gAP: 𝜋 ′ = 𝜋0 · 𝜌0 . Fig. 1.3 illustrates the computation of a R–P curve and the corresponding interpolated gAP values for a typical IR system working on text images. If perfectly correct text were indexed, we would get a single, “ideal” point with 𝜌0 = 𝜋0 = 1 and gAP = 1.0. If automatic (typically noisy) HTR transcripts are naively indexed just as plaintext, precision and recall are also fixed values, albeit not “ideal” (perhaps something like 𝜌0 = 0.75, 𝜋0 = 0.8, with gAP = 0.6). In contrast, PrIx allows for arbitrary precision-recall tradeoffs by setting a threshold on the system confidence (relevance probability). This flexible precision-recall tradeoff model obviously allows for better search and retrieval performance than naive plaintext searching on automatic noisy transcripts. It is worth to recall that precision/recall assessment requires GT reference data – denoted 𝑔(·) in the above equations. In our case, such GT can be straightforwardly derived from the same GT needed to evaluate HTR performance; namely, the reference transcripts of a selected test set of images. Fig. 1.4 illustrates the complete assessment process and the user involvement in this process.

3 Up to reasonably rounding RP to, say a 6 decimal digits.

1.6 Assessing Indexing and Search Performance

13

1 High relevance threshold

τ

0.8

Precision (π )

τ 0.6

0.4 Perfect (gAP=1.0) HTR Transcript (gAP=0.6) PrIx (gAP=0.8)

0.2

Low relevance threshold

τ

0 0

0.2

0.4

0.6

0.8

1

Recall (ρ)

Fig. 1.3: Illustration of R–P curves and gAP results for three typical IR systems working on text images: an ideal system based on perfect transcripts, another based on plain indexing of HTR transcripts and a PrIx system.

Precision/Recall and Average Precision results Reference Transcripts

Evaluation

Page Image PrIxs

abcd abcd abcd abcd abcd abcd abcd abcd abcd

xyz abc xyz xyz abc xyz xyz abc xyz abcd xyz abc xyz xyz abc xyz abcd xyz abc xyz xyz abc xyz abcd xyz abcxyz xyzabc xyz xyz abc abcd xyz abcd xyz abcxyz xyzabc xyz xyz abc abcd xyz abcd xyz abcxyz xyzabc xyz xyz abc abcd xyz abcd xyz abcxyz xyzabc xyz xyz abc abcd xyz abcd xyz abcxyz xyzabc xyz xyz abc abcd xyz abcd xyz abcxyz xyzabc xyz abcd abcd xyz abcxyz xyzabc xyz abcd xyz abc xyz abcd xyz abc xyz abcd xyz abc xyz xyz abc xyz

Selected test text Images

Probabilistic Indexing (PrIx)

Fig. 1.4: Evaluating PrIx performance is based on user-produced GT reference transcripts.

14

1 Introduction

1.7 Handwritten Text Recognition and Probabilistic Indexing As previously discussed, PrIx and HTR are related technologies which can advantageously share concepts and models. In fact, it can be empirically seen that when PrIx and HTR share the same models, the conventional evaluation measures of PrIx and HTR (gAP and WER) correlate graciously (see Sec. 6.6 of Chapter 6). Nevertheless, it is important to highlight that HTR and PrIx are fundamentally different technologies, with substantially different application goals. Table 1.2 summarizes the most important similarities and differences between these technologies. Table 1.2: Probabilistic indexes are not transcripts. Automatic Transcription (HTR)

Probabilistic Indexing (PrIx)

Generally comes after Layout Analysis and Reading Order determination

Is generally agnostic to Layout structure and Reading Order

Typically needs carefully detected lines

Line detection helps, but only if accurate

The output is a best, unique (frail!) text interpretation of the given image according to the models used

For the same models, the output is a robust probability distribution of words with their positions in the images

The output is aimed to be in reading order (but this is seldom achieved)

In general, Probabilistic Indexing is reading-order agnostic

Provides plaintext output. If accuracy is high, it can be directly used in many applications

In its basic form, does not provide any text output; only images marked with word-sized bounding boxes

Usually yields only fixed and comparatively low precision-recall performance for the given trained models

Allows flexible, user-controlled precision-recall tradeoffs and search performance is generally much better for the same trained models

References

15

References 1. Bache, R., Ballie, M., Crestani, F.: The likelihood property in general retrieval operations. Information Sciences 234, 97 – 111 (2013) 2. Bazzi, I., Schwartz, R., Makhoul, J.: An Omnifont Open-Vocabulary OCR System for English and Arabic. IEEE Trans. on Pattern Analysis and Machine Intelligence 21(6), 495–504 (1999) 3. Bishop, C.: Pattern Recognition and Machine Learning. Springer-Verlag New York (2006) 4. Can, D., Saraclar, M.: Lattice indexing for spoken term detection. IEEE Transactions on Audio, Speech, and Language Processing 19(8), 2338–2347 (2011) 5. Causer, T., Wallace, V.: Building a volunteer community: results and findings from Transcribe Bentham. Digital Humanities Quarterly 6(2) (2012) 6. Chelba, C., Silva, J., Acero, A.: Soft indexing of speech content for search in spoken documents. Computer Speech & Language 21(3), 458 – 478 (2007) 7. Chia, T.K., Sim, K.C., Li, H., Ng, H.T.: Statistical lattice-based spoken document retrieval. ACM Trans. Inf. Syst. 28(1), 2:1–2:30 (2010) 8. Christiansen, R., Rushforth, C.: Detecting and locating key words in continuous speech using linear predictive coding. IEEE Trans. on ASSP 25(5), 361–367 (1977) 9. D’Orazio, D.: Oxford and Vatican libraries to digitize 1.5 million pages of ancient texts. The Verge (2012). URL https://www.theverge.com/2012/4/15/2950260/oxford-v atican-libraries-digitize-1-5-million-pages-ancient-texts. Visited on 17-Jun-2018 10. Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. J. Wiley and Sons, (1973) 11. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edition edn. Wiley-Interscience, New York, NY, USA (2000) 12. Egghe, L.: The measures precision, recall, fallout and miss as a function of the number of retrieved documents and their mutual interrelations. Inf. Proces. & Management 44(2), 856– 876 (2008) 13. Giotis, A.P., Sfikas, G., Gatos, B., Nikou, C.: A survey of document image word spotting techniques. Pattern Recognition 68, 310–332 (2017) 14. Graves, A., Liwicki, M., Fern´andez, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwrting recognition. IEEE Trans. on PAMI 31(5), 855–868 (2009) 15. Jimenez, C.: British Library books go digital. BBC News (2007). URL http://news.bbc .co.uk/2/hi/technology/7018210.stm. Visited on 17-Jun-2018 16. Joung, Y.J., Yang, L.W.: On character-based index schemes for complex wildcard search in peer-to-peer networks. Information Sciences 272, 209 – 222 (2014) 17. Kohler, J., Larson, M., Jong de, F., Kraaij, W., R.J.F, O. (eds.): Searching Spontaneous conversational speech workshop, proc. of the ACM SIGIR workshop of the 31th Annual International SIGIR conference. Centre for Telematics and Inf. Tech., Enschede, The Netherlands (2008) 18. Lee, L.s., Glass, J., Lee, H.y., Chan, C.a.: Spoken content retrieval—beyond cascading speech recognition with text retrieval. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(9), 1389–1420 (2015) 19. Madrigal, A.C.: Norway Decided to Digitize All the Norwegian Books. The Atlantic (2013). URL https://www.theatlantic.com/technology/archive/2013/12/norway-dec ided-to-digitize-all-the-norwegian-books/282008/. Visited on 16-Jun-2018 20. Manmatha, R., Han, C., Riseman, E.M.: Word spotting: a new approach to indexing handwriting. In: Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 631–637 (1996). DOI 10.1109/CVPR.1996.517139 21. Manmatha, R., Han, C., Riseman, E.M., Croft, W.B.: Indexing Handwriting Using Word Matching. In: Proceedings of the First ACM International Conference on Digital Libraries, DL ’96, pp. 151–159. ACM, New York, NY, USA (1996). DOI 10.1145/226931.226960 22. Manning, C.D., Raghavan, P., Sch¨utze, H.: Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA (2008)

16

1 Introduction

23. McLuhan, M.: The Gutenberg Galaxy: The Making of Typographic Man. University of Toronto Press (1962) 24. Murphy, K.P.: Machine learning: a probabilistic perspective. The MIT Press, Cambridge, MA (2012) 25. Paniagua, E.: As´ı se digitaliza la Biblioteca Nacional de Espa˜na. El Pa´ıs (2018). URL https://retina.elpais.com/retina/2018/01/30/innovacion/1517327412_5156 02.html. Visited on 16-Jun-2018 26. Puigcerver, J.: Are Multidimensional Recurrent Layers Really Necessary for Handwritten Text Recognition? In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 67–72 (2017). DOI 10.1109/ICDAR.2017.20 27. Puigcerver, J.: A probabilistic formulation of keyword spotting. Ph.D. thesis, Universitat Polit`ecnica de Val`encia (2018) 28. Robertson, S.: A new interpretation of average precision. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR, pp. 689–690. ACM, New York, NY, USA (2008) 29. Romero, V., Toselli, A.H., Vidal, E.: Multimodal Interactive Handwritten Text Transcription. Perception and Artif. Intell. (MPAI). World Scientific, (2012) 30. S´anchez, J.A., Romero, V., Toselli, A.H., Vidal, E.: ICFHR2016 Competition on Handwritten Text Recognition on the READ Dataset. In: Proc. of 15th ICFHR, pp. 630–635 (2016) 31. Sanderson, M., Croft, W.B.: The History of Information Retrieval Research. Proceedings of the IEEE 100(Special Centennial Issue), 1444–1451 (2012). DOI 10.1109/JPROC.2012.2189916 32. Shuttleworth, S.: Old weather: Citizen scientists in the 19th and 21st centuries. Science Museum Group Journal 3(3) (2016) 33. Tabibian, S., Akbari, A., Nasersharif, B.: Discriminative keyword spotting using triphones information and n-best search. Information Sciences 423, 157 – 171 (2018) 34. Toselli, A.H., Juan, A., Gonz´alez, J., Salvador, I., Vidal, E., Casacuberta, F., Keysers, D., Ney, H.: Integrated handwriting recognition and interpretation using finite-state models. Int. Journal of Pattern Recognition and Artificial Intelligence 18(04), 519–539 (2004) 35. Toselli, A.H., Leiva, L.A., Bordes-Cabrera, I., Hern´andez-Tornero, C., Bosch, V., Vidal, E.: Transcribing a 17th-century botanical manuscript: Longitudinal evaluation of document layout detection and interactive transcription. Digital Scholarship in the Humanities 33(1), 173–202 (2018). DOI 10.1093/llc/fqw064. URL http://dx.doi.org/10.1093/llc/fqw064 36. Toselli, A.H., Vidal, E., Puigcerver, J., Noya-Garc´ıa, E.: Probabilistic multi-word spotting in handwritten text images. Pattern Analysis and Applications (2019) 37. Toselli, A.H., Vidal, E., Romero, V., Frinken, V.: HMM Word-Graph Based Keyword Spotting in Handwritten Document Images. Information Sciences 370(C), 497–518 (2016). DOI 10.1016/j.ins.2016.07.063 38. Turpin, A., Scholer, F.: User performance versus precision measures for simple search tasks. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 11–18 (2006) 39. Vidal, E., Toselli, A.H., Puigcerver, J.: High performance Query-by-Example keyword spotting using Query-by-String techniques. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 741–745 (2015). DOI 10.1109/ICDAR.2015.7333860 40. Vidal, E., Toselli, A.H., Puigcerver, J.: Lexicon-based probabilistic indexing of handwritten text images. Neural Computing and Applications pp. 1–20 (2023) 41. Vidal, E., Toselli, A.H., R´ıos-Vila, A., Calvo-Zaragoza, J.: End-to-end page-level assessment of handwritten text recognition. Pattern Recognition p. 109695 (2023). DOI https://doi.org/ 10.1016/j.patcog.2023.109695 42. Vinciarelli, A., Bunke, H., Bengio, S.: Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(6), 709–720 (2004). DOI 10.1109/TPAMI.2004.14 43. Wu, M.S.: Modeling query-document dependencies with topic language models for information retrieval. Information Sciences 312, 1 – 12 (2015) 44. Zhu, M.: Recall, Precision and Average Precision. Working Paper 2004-09 Department of Statistics & Actuarial Science - University of Waterloo (2004)

Chapter 2

State Of The Art

Abstract As discussed in the previous chapter, the PrIx framework explicitly adopts the IR point of view, but it draws from many concepts and methods developed over the last 50 years in the field of KWS, first for speech signals and more recently for text images. A comprehensive survey of these approaches can be consulted in [23] and a recent review in [36]. This chapter overviews the taxonomy and the most important state-of-the-art approaches to KWS for text images. Detailed insights about the most interesting and/or relevant of these approaches will be provided in Chapter 7 of this book. This chapter also reviews the works carried out so far on certain issues which are very significant for PrIx (and KWS alike) but which are very seldom considered in the KWS literature; namely querying text images for hyphenated, abbreviated and/or multiple words. Finally, since PrIx shares optical and language models with HTR, the state of the art in HTR is also briefly outlined.

2.1 The field of a Hundred Names The scientific literature is flooded with works that chase the aims described earlier. Nevertheless, depending on the authors’ background or community, different names are used for tackling virtually the same problem. One remarkable example is the name Spoken Term Detection (or STD), which is widely used by the speech processing community [57, 48, 35, 13, 30, 38, 47]. Although the name Keyword Spotting (or just KWS) has been used in the past by the speech processing community as well [56, 84], the former name has been broadly adopted in the recent years. The contributions from the speech community are very significant, since they tackled the problem earlier than the researchers interested in historical handwritten documents [33, 14, 43, 44, 32]. As a matter of fact, some of the popular strategies to perform keyword spotting for handwritten documents were adopted from the speech community (see [20], for example). This should not come as a surprise to the reader, since the handwritten text recognition community has benefited for a long time from the developments made © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 A. H. Toselli et al., Probabilistic Indexing for Information Search and Retrieval in Large Collections of Handwritten Text Images, The Information Retrieval Series 49, https://doi.org/10.1007/978-3-031-55389-9_2

17

2 State Of The Art

18

by their speech recognition colleagues: the use of Hidden Markov and Gaussian Mixture Models was first used for speech recognition [31], and then adopted by text recognition researchers [37], and the same occurred with modern artificial neural network architectures, such as the Long-Short Term Memories (see [24] and [26]). Sometimes, even within the same community, two different names are used to refer to the same problem. For instance, in the handwritten documents community, many researchers prefer the term Word Spotting [43, 20]. And, just to make the things a bit more confusing, some researchers have used these names to refer to different problems (e.g. [11]). In this book, we will use the term KWS which is the most popular in the literature nowadays. However, if the reader wants to investigate other works with the same or very similar aims, she should carefully review the literature related to all these topics: • • • • • • • •

Keyword Spotting Spoken Term Detection (only for speech signals) Word Spotting Word Detection (Speech or text image) Search (Speech or text image) Retrieval (Speech or text image) Indexing (Speech or text image) Mining

2.2 Taxonomy of KWS approaches Different KWS systems and publications can be classified using multiple criteria. In this section we aim to present the different categories that can be employed to classify a particular solution, which will be useful later to understand the assumptions and limitations of different approaches.

2.2.1 Segmentation Assumptions One of the most important practical distinctions between different Keyword Spotting systems is related to the assumptions that each system makes regarding the segmentation of the original document images. Collections of real handwritten document images have a large variability, and certain assumptions may be reasonable in some cases but not others. Any assumption that one system makes, is a real limitation if that assumption does not hold in reality.

19

2.2 Taxonomy of KWS approaches

2.2.1.1 Word Segmentation Many KWS approaches in the literature assume that is possible to have an accurate segmentation of the words present in the documents [43, 1, 2, 64, 69, 49]. Generally, these approaches work directly with the cropped bounding boxes of the words in the document, and the query introduced by the user, which can be either a string or another image (see Sec. 2.2.3). Many research assume that this allows to simplify the problem of KWS and helps to determine the best-case performance. However, we can identify the following problems in these: 1. Manual word segmentation is not practical. This is quite obvious, since we aim to automate the processing of handwritten documents. It does not seem reasonable that one of the steps involved needs of human labor to be completed. In addition, one has to realize that accurate word segmentation into bounding boxes or bounding polygons is a very tedious task (and thus, expensive). 2. Automatic word segmentation is not good enough. Some authors argue that, although manual word segmentation is not practical for obvious reasons, current automatic word segmentation approaches are good enough to perform word spotting. While this could be true for some academic data sets, actual data from real collections of historical documents clearly contradicts this hypothesis. See Fig. 2.1, for example.

(a) RIMES experimental dataset

(b) Alcaraz manuscript

Fig. 2.1: Accurate word segmentation is not always possible to perform. While isolated words are clearly identifiable in image (a), even expert human paleographers would face problems segmenting the individual words in image (b).

3. Working with isolated word images discards useful information. Notice that, if we try to identify whether a given query keyword is written in a particular cropped word region, we are assuming that the word contained in this particular region is independent of other words in the document. This assumption is obviously false, and it has been extensively shown in the literature that, using

2 State Of The Art

20

textual context information can substantially improve the results [46, 19, 74]. As an example of the importance of the context, see Fig. 2.2. In theory, word segmentation does not necessarily discard context information, but virtually all works assuming word segmentation ignore it.

Fig. 2.2: Textual context is a useful aid to identify words in handwritten text. Although the word “House” cannot be read in the image, any reader fairly formed in politics (or statistics), can infer it from the typical use of the English language or, at least, she would guess that words like “automobile” are very unlikely in this context.

2.2.1.2 Line Segmentation An important subset of papers in KWS assume that accurate line-segmentation is feasible in practice [20]. These works typically use the machinery developed for Spoken Term Detection, since the 𝑥-axis in images (or 𝑦-axis for vertical-oriented text) can be interpreted as the time-axis in speech. Of course, manual line segmentation is not feasible for real applications, for the same reasons as manual word segmentation. However, current automatic line segmentation approaches offer a very good accuracy in many historical collections, and thus this is a less restrictive segmentation assumption in practice [40, 41, 28] . One additional benefit that these systems offer, in front of most word segmentationbased approaches, is that they are able to take into account textual context information to improve the accuracy. Nevertheless, the computation needed to take into account the context information is not negligible. The time needed to process a text line typically grows exponentially with the size of the textual context considered, unless some pruning and approximation strategies are used to accelerate the process. For instance, the number of states in a 𝑛-gram language models, which directly affects the running time of keyword spotting and text recognition systems using these, increases exponentially with the context size 𝑛 (more details will be discussed in Chapter 4. Yet, this extra computational resources are only needed during a single step of the construction of the PrIx index, as we will see in Chapter 5.

2.2 Taxonomy of KWS approaches

21

2.2.1.3 Segmentation-free Finally, the less restrictive scenario is where no segmentation assumption is made whatsoever. There are many works that claim to follow a segmentation-free approach, however they actually are divided into two clearly separated steps: first, the location and segmentation of word-like areas of the image, and then deciding which of these areas are relevant for the given query [59, 1, 51, 58]. Regardless of whether the system is implicitly or explicitly free of any segmentation assumption, we believe that all methods should be evaluated at some point under a fully automatic segmentation-free scenario. Mainly, because this will be the real operating scenario once the systems are deployed in libraries and archives, where millions of document pages have to be processed. Thus, despite the fact that the methods presented in this book generally operate on segmented lines, we will carry out some experiments where no manual segmentation of the pages is given (see Chapter 6).

2.2.2 Retrieved Objects Another important aspect of any KWS system is the type of the retrieved objects. In practice, this is usually related to the previous subsection: most works that operate under a line segmentation assumption retrieve relevant lines for a given query, and systems working under a word segmentation assumption, typically retrieve relevant word instances. Nevertheless, although this is the usual practice, it does not mean that it is the only possible combination, nor the most recommended. Fig. 2.3 illustrates the different types of retrieved objects described below.

2.2.2.1 Word Instances As we mentioned before, a large portion of the keyword spotting systems described in the literature operate under the word segmentation assumption: that is, the system assumes that accurate word segmented regions have been extracted from the collection of images. Thus, when a user gives a query keyword to the system, it will provide a ranked list of the word regions that, according to the system, correspond to the given keyword [43, 2, 64, 69, 29].

2.2.2.2 Lines In a similar way than before, the system will retrieve full text line regions where it believes that the user’s query keyword is written. This approach is followed also by many works [20, 21, 53, 76]. The benefit of this approach is twofold: first, it gives more context to the user to decide whether or not the retrieved object is actually of

2 State Of The Art

22

(a)

(b)

(c)

Fig. 2.3: Different types of objects to retrieve after a user query. For instance, if the user searches for “Labour”, a keyword spotting system could choose to retrieve (a) individual instances of this keyword, (b) lines where this word was written, or (c) whole pages or paragraphs containing the keyword.

interest to the human labor required to produce the ground truth is much smaller. Also, it provides with performance measure values highly correlated with those systems retrieving word instances [82, 81].

2.2.2.3 Pages Last, but not least, the spotting system could choose to retrieve full pages or paragraphs containing multiple lines. This would give the user further context, and the measuring of the quality of the results would resemble more the traditional applications of Information Retrieval. Remember, that the user is trying to spot some keyword in our collection of documents because she needs to satisfy some need of information, and it is highly probable that the required information cannot be found in a single text line of our collection. Hence, it would make perfect sense that the system reported full text pages or paragraphs. However, because most keyword spotting works are focused at a very low level (decide whether or not a word instance in a collection of images is the one that the user was searching for), they disregard this scenario. There are some works that evaluate their systems retrieving paragraphs, with more complex queries [81, 75].

23

2.2 Taxonomy of KWS approaches

2.2.3 Query Representation The user may interact with the system in different ways, in order to present the query keyword. The two classical alternatives are the Query-by-String and Queryby-Example paradigms. Fig. 2.4 shows a representation of the two paradigms, which are explained next. Query: Query:

country (a)

(b)

Fig. 2.4: Illustration of the different query paradigms used in the Keyword Spotting literature. The yellow box represents the user input. In figure (a) the user types a query string using her keyboard, while in figure (b) she uses an exemplar image containing the keyword to search for.

2.2.3.1 Query-by-String On the one hand, the Query-by-String (QbS) paradigm assumes that the query keyword is presented to the system as an individual symbol part of a vocabulary lexicon or, alternatively, as a sequence of characters of a given alphabet. This paradigm is typically adopted in the speech community [56, 84, 57], and many of the handwriting community works influenced by the former [20, 21, 76, 74].

2.2.3.2 Query-by-Example On the other hand, the Query-by-Example (QbE) paradigm assumes that an exemplar image, containing the query keyword of interest, is given to the system, and it has to find the instances of the same keyword within the collection of document images [43, 1, 72, 54, 64, 49]. Historically, in this case, KWS is seen as a particular instance of Content-based Image Retrieval (CBIR) [77, 5, 66], since most researchers focusing on this paradigm have a Computer Vision background, where CBIR has a long tradition. In this book, we will focus mainly on the QbS paradigm, however our probabilistic framework will also be applied to the QbE case. Without giving much further details, it will be shown in successive chapters (see Chapter 7) that the QbE case only introduces one additional hidden variable with respect to the QbS case, and this only requires minor modifications to the algorithms. One clear advantage of QbS in front of QbE is that the user only needs her keyboard to search for any imaginable concept that she wishes, and a broader set

24

2 State Of The Art

of queries (such as Boolean queries or arbitrary regular expressions) can be used. Notice that the QbE is more restricted in this sense, since, in principle, we would need at least one exemplar image for each of the keywords forming the query. Nevertheless, the QbE paradigm also offers some advantages in front of some QbS approaches. In particular, as we will discuss later (see Chapter 5), QbS approaches relying on a closed lexicon are prone to the out-of-vocabulary problem, while QbE approaches are essentially resistant to it.

2.2.4 Training Requirements Finally, a fundamental distinction among different KWS systems is whether they need human annotated training data to be built. This is a very important distinction, since systems that do not need annotated data (i.e. unsupervised methods) are much cheaper to build than those requiring large quantities of annotated samples (also know as supervised methods).

2.2.4.1 Unsupervised Initial works on KWS for historical documents typically fall into this category [33, 43, 32]. Virtually all the unsupervised approaches have been restricted to the queryby-example paradigm, explained before. Typically, researchers apply some feature extraction mechanism on the images (typically, pre-segmented word-shaped boxes) in order to extract visual features that could discriminate similar images. Then, they compare the features extracted from the collection of images against the features obtained from the query, in order to rank them by some similarity (or dissimilarity) measure. Recently, there are still works being published under this paradigm [72, 51, 54, 17, 86], although the popular trend is to use supervised methods.

2.2.4.2 Supervised Because visually similar or dissimilar images do not necessary mean relevant or irrelevant pairs of objects, researchers soon noticed that better quality results could be obtained by using supervised methods. In fact, most recent successful KWS methods are supervised algorithms [20, 21, 2, 69, 76]. This book focuses on supervised methods. As it will be shown, our framework involves the probability distribution of the content of the text images (i.e. the posterior probability over transcripts), and supervised methods excel in the task of learning parametric models for these kind of distributions. However, experimental evaluation with different amounts of supervised training data will be carried on, in order to measure the amount of needed annotation.

2.3 Additional Significant Matters for KWS

25

2.3 Additional Significant Matters for KWS In the previous sections we have reviewed the most significant research topics that have been widely considered in the field of KWS for text images. Most of the cited works are mainly academically oriented and tend to overlook important features which are actually needed in practical search interfaces aimed to find textual information in real, large-scale collections of text images. Perhaps the most important features that are largely overlooked are dealing with hyphenated and abbreviated text and allowing combined, multi-word queries.

2.3.1 Hyphenated Words Hyphenated words are very frequent in historical documents. For example, in many large manuscript collections, such as the Finnish Court Records (see Appendix C), at least one word is hyphenated every other text line, which results in about 10% of the written tokens being fragments of hyphenated words. Most works on HTR sidestep this problem and just try to accurately recognize the prefix/suffix fragments of each hyphenated word [8, 71, 60, 70, 76]. Note that reliable recognition of these fragments is problematic and has not been sufficiently studied so far. However, if the aim is to transcribe text images, a sufficiently accurate characterlevel recognition of the fragments (and maybe the hyphens themselves) might be an admissible transcription result. This explains why only a handful of HTR works can be found in the literature so far which explicitly deal with this problem [8, 71, 70]. Special mention, however, deserve recent works on line-segmentation-free, full-page or full-paragraph HTR approaches [15, 16], where hyphenation fragments are paired in training and the resulting systems prove able to transcribe hyphenated words into their corresponding entire word forms. Nevertheless, since our goal is to allow searching for words, users need to use entire words in their queries. Clearly it completely unacceptable that users have to figure out what are the possible fragments into which these words may happen to have been broken when the documents were written! Works that deal with this important practical problem are extremely rare. We only find [82, 88], in addition to our own recent papers [80, 3]. These works will be reviewed in detail in Chapter 8 of this book.

2.3.2 Abbreviations Abbreviations are intensively used in historical documents. For instance, in typical Latin texts, about 40% of the written words are abbreviated. And abbreviations seldom follow consistent, homogeneous rules. This becomes dramatically the case

26

2 State Of The Art

in handwritten documents. Again, most works on HTR sidestep this problem and just try to accurately recognize (at the character level) the abbreviations as such. This is often referred to as diplomatic or paleographic transcription. As with hyphenation, historical documents seldom apply consistent or homogeneous abbreviation rules and, moreover, abbreviated tokens very frequently become ambiguous and one can only tell which is the right way to expand an abbreviation by relying on textual context provided by the surrounding text. In the field of HTR, it is only recently that this problem is starting to be explicitly considered [83, 12, 61]. Also as in the case of hyphenated words, since our goal is to allow searching for words, it becomes completely unacceptable that users have to figure out what are the possible abbreviated forms of the words they wish to find. The specific problem of searching for abbreviated words in text images does not seem to have been explicitly considered so far, except for a few works of our own [6, 79, 61].

2.3.3 Multiple-word Queries The vast majority of works on KWS for text images only consider single-word queries. But in the practical application of this technology for information searching it is necessary to allow for queries based on multiple words. In a real, large-scale application, users expect to be able to use all the conventional search amenities available when searching for information in electronic text (like the Internet). A basic technology requirement is to provide support for word sequences and Boolean AND/OR word combinations. As previously commented, the field of KWS is more mature for speech documents (where it is often called spoken term detection) than for text images. Thus, significant amount of work has been already carried out in this area to support the most conventional forms of multiple-word queries [87, 62, 85, 13]. While all these works are certainly interesting and useful, the approaches and techniques proposed are not directly applicable to searching in text images. Regarding works specifically dealing with text images, we can only cite one paper [55] that describes a simplistic way to combine KWS outcomes for individual keywords to obtain a response similar to that of an AND query. In addition, there are a few papers of our own that explicitly deal with queries based on AND/OR word combinations [50, 9, 75, 61] and word sequences [9, 10, 61]. These works will be reviewed in detail in Chapter 8 of this book.

2.4 State of the Art in HTR Models and Methods HTR approaches evolved from the first proposals based on detecting and recognizing the constituent words/characters of a given handwritten text image [42, 63] to the segmentation-free approaches based on Hidden Markov Models (HMM) that enable the modeling of syntactic constraints through language models as in [4, 73, 45].

2.4 State of the Art in HTR Models and Methods

27

Since 2009, the Recurrent Neural Networks” (RNN) and “Deep Feed-Forward Neural Networks began to be key parts of the new recognition systems with amazing results. In particular those Recognizers using the RNN types Bi-directional and Multi-dimensional Long Short-Term Memory (LSTM) [27, 26], and trained with the Connectionist Temporal Classification (CTC) [25] loss function far outperformed those based on HMM. This trend was later consolidated with the introduction of Convolutional Neural Networks (CNN) [22] for automatic feature extractions from images, defining a new architecture type named Convolutional-Recurrent Neural Network (CRNN) [65, 7, 68]. State-of-the-art HTR toolkits like PyLaia [52] and Kraken [34] are based on this architecture type. Lately, with the advent of Transformer models and their effective attention mechanism [78] (specially regarding the so-called Visual Transformers [18]), new HTR approaches are emerging which achieve results comparable to those of CRNN on traditional HTR datasets like IAM (Appendix C.1) or the simplest Bentham BEN1 dataset (Appendix C.2.1). These good results [39] were generally achieved with pre-trained Transformer models fine-tuned on the specific dataset on which it is evaluated. Furthermore, such pre-trained models have been fine-tuned on historical manuscript datasets with fair results [67], but it is not entirely clear to what extent this depends on the fact that original data used to pre-train the models included also text samples of the same (or similar) language of the manuscript to be recognized.

28

2 State Of The Art

References 1. Almaz´an, J., Gordo, A., Forn´es, A., Valveny, E.: Efficient Exemplar Word Spotting. In: Proceedings of the British Machine Vision Conference, pp. 67.1–67.11. BMVA Press (2012). DOI http://dx.doi.org/10.5244/C.26.67 2. Almaz´an, J., Gordo, A., Forn´es, A., Valveny, E.: Word Spotting and Recognition with Embedded Attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(12), 2552–2566 (2014). DOI 10.1109/TPAMI.2014.2339814 3. Andr´es, J., Toselli, A.H., Vidal, E.: Search for hyphenated words in probabilistic indices: a machine learning approach. In: 2023 International Conference on Document Analysis and Recognition (ICDAR), pp. 269–285 (2023) 4. Bazzi, I., Schwartz, R., Makhoul, J.: An Omnifont Open-Vocabulary OCR System for English and Arabic. IEEE Trans. on Pattern Analysis and Machine Intelligence 21(6), 495–504 (1999) 5. Bird, C.L., Chapman, S.G., Ibbotson, J.B.: Content-driven navigation of large databases. In: IEEE Coll. on Intelligent Image Databases, pp. 13/1–13/5 (1996). DOI 10.1049/ic:19960751 6. Bluche, T., Hamel, S., Kermorvant, C., Puigcerver, J., Stutzmann, D., Toselli, A.H., Vidal, E.: Preparatory KWS Experiments for Large-Scale Indexing of a Vast Medieval Manuscript Collection in the HIMANIS Project. In: Proc. of 14th ICDAR (2017) 7. Bluche, T., Louradour, J., Messina, R.O.: Scan, attend and read: End-to-end handwritten paragraph recognition with mdlstm attention. 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) 01, 1050–1055 (2016) 8. Bluche, T., Ney, H., Kermorvant, C.: The LIMSI handwriting recognition system for the HTRtS 2014 contest. In: 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 86–90. IEEE (2015) 9. Calvo-Zaragoza, J., Toselli, A.H., Vidal, E.: Probabilistic music-symbol spotting in handwritten scores. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 558–563. IEEE (2018) 10. Calvo-Zaragoza, J., Toselli, A.H., Vidal, E., S´anchez, J.A.: Music symbol sequence indexing in medieval plainchant manuscripts. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 882–887. IEEE (2019) 11. Cambria, E., Schuller, B., Xia, Y., Havasi, C.: New Avenues in Opinion Mining and Sentiment Analysis. IEEE Intelligent Systems 28(2), 15–21 (2013). DOI 10.1109/MIS.2013.30 12. Camps, J.B., Vidal-Gor`ene, C., Vernet, M.: Handling heavily abbreviated manuscripts: Htr engines vs text normalisation approaches. ArXiv abs/2107.03450 (2021) 13. Can, D., Saraclar, M.: Lattice indexing for spoken term detection. IEEE Transactions on Audio, Speech, and Language Processing 19(8), 2338–2347 (2011) 14. Chen, F.R., Wilcox, L.D., Bloomberg, D.S.: Word spotting in scanned images using hidden Markov models. In: International Conference on Acoustics, Speech, and Signal Processing, vol. 5, pp. 1–4 (1993). DOI 10.1109/ICASSP.1993.319732 15. Coquenet, D., Chatelain, C., Paquet, T.: SPAN: a simple predict & align network for handwritten paragraph recognition. In: Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part III 16, pp. 70–84. Springer (2021) 16. Coquenet, D., Chatelain, C., Paquet, T.: DAN: a segmentation-free document attention network for handwritten document recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) 17. Dey, S., Nicolaou, A., Llad´os, J., Pal, U.: Local Binary Pattern for Word Spotting in Handwritten Historical Document. In: A. Robles-Kelly, M. Loog, B. Biggio, F. Escolano, R. Wilson (eds.) Structural, Syntactic, and Statistical Pattern Recognition, pp. 574–583. Springer International Publishing (2016) 18. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. CoRR abs/2010.11929 (2020)

References

29

19. Fischer, A., Frinken, V., Bunke, H., Suen, C.Y.: Improving HMM-Based Keyword Spotting with Character Language Models. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 506–510 (2013). DOI 10.1109/ICDAR.2013.107 20. Fischer, A., Keller, A., Frinken, V., Bunke, H.: Lexicon-free handwritten word spotting using character HMMs. Pattern Recognition Letters 33(7), 934 – 942 (2012). DOI 10.1016/j.patrec .2011.09.009. Special Issue on Awards from ICPR 2010 21. Frinken, V., Fischer, A., Manmatha, R., Bunke, H.: A Novel Word Spotting Method Based on Recurrent Neural Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(2), 211–224 (2012). DOI 10.1109/TPAMI.2011.113 22. Fukushima, K., Miyake, S.: Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Visual Pattern Recognition. In: S.i. Amari, M.A. Arbib (eds.) Competition and Cooperation in Neural Nets, pp. 267–285. Springer Berlin Heidelberg (1982) 23. Giotis, A.P., Sfikas, G., Gatos, B., Nikou, C.: A survey of document image word spotting techniques. Pattern Recognition 68, 310–332 (2017) 24. Graves, A., Eck, D., Beringer, N., Schmidhuber, J.: Biologically Plausible Speech Recognition with LSTM Neural Nets. In: A.J. Ijspeert, M. Murata, N. Wakamiya (eds.) Biologically Inspired Approaches to Advanced Information Technology, pp. 127–136. Springer Berlin Heidelberg, Berlin, Heidelberg (2004) 25. Graves, A., Fern´andez, S., Gomez, F., Schmidhuber, J.: Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In: Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pp. 369–376. ACM, New York, NY, USA (2006). DOI 10.1145/1143844.1143891 26. Graves, A., Liwicki, M., Fern´andez, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwrting recognition. IEEE Trans. on PAMI 31(5), 855–868 (2009) 27. Graves, A., Schmidhuber, J.: Offline handwriting recognition with multidimensional recurrent neural networks. In: Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS’08, p. 545–552. Curran Associates Inc., Red Hook, NY, USA (2008) 28. Gr¨uning, T., Leifert, G., Strauss, T., Labahn, R.: A Robust and Binarization-Free Approach for Text Line Detection in Historical Documents. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 236–241 (2017). DOI 10.110 9/ICDAR.2017.47 29. G´omez, L., Rusi˜nol, M., Karatzas, D.: LSDE: Levenshtein Space Deep Embedding for Queryby-String Word Spotting. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 499–504 (2017). DOI 10.1109/ICDAR.2017.88 30. Hazen, T.J., Shen, W., White, C.: Query-by-example spoken term detection using phonetic posteriorgram templates. In: 2009 IEEE Workshop on Automatic Speech Recognition Understanding, pp. 421–426 (2009). DOI 10.1109/ASRU.2009.5372889 31. Jelinek, F.: Continuous speech recognition by statistical methods. Proceedings of the IEEE 64(4), 532–556 (1976). DOI 10.1109/PROC.1976.10159 32. Keaton, P., Greenspan, H., Goodman, R.: Keyword spotting for cursive document retrieval. In: Document Image Analysis, 1997. (DIA ’97) Proceedings., Workshop on, pp. 74–81 (1997). DOI 10.1109/DIA.1997.627095 33. Khoubyari, S., Hull, J.J.: Keyword location in noisy document image. In: 2nd Annual Symposium on Document Analysis and Information Retrieval, pp. 217–231 (1993) 34. Kiessling, B., Kurin, G., Miller, M.T., Smail, K., Miller, M.: Advances and limitations in open source arabic-script ocr: A case study. Digital Studies/Le champ num´erique 11(1) (2021) 35. Kohler, J., Larson, M., Jong de, F., Kraaij, W., R.J.F, O. (eds.): Searching Spontaneous conversational speech workshop, proc. of the ACM SIGIR workshop of the 31th Annual International SIGIR conference. Centre for Telematics and Inf. Tech., Enschede, The Netherlands (2008) 36. Kumari, L., Sharma, A.: A review of deep learning techniques in document image word spotting. Archives of Computational Methods in Engineering pp. 1–22 (2021) 37. Kundu, A., He, Y., Bahl, P.: Recognition of handwritten word: First and second order hidden Markov model based approach. Pattern Recognition 22(3), 283–297 (1989). DOI https: //doi.org/10.1016/0031-3203(89)90076-9

30

2 State Of The Art

38. Lee, L.s., Glass, J., Lee, H.y., Chan, C.a.: Spoken content retrieval—beyond cascading speech recognition with text retrieval. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(9), 1389–1420 (2015) 39. Li, M., Lv, T., Cui, L., Lu, Y., Florˆencio, D.A.F., Zhang, C., Li, Z., Wei, F.: Trocr: Transformerbased optical character recognition with pre-trained models. ArXiv abs/2109.10282 (2021) 40. Likforman-Sulem, L., Zahour, A., Taconet, B.: Text line segmentation of historical documents: a survey. International Journal of Document Analysis and Recognition (IJDAR) 9(2), 123–138 (2007). DOI 10.1007/s10032-006-0023-z 41. Louloudis, G., Gatos, B., Pratikakis, I., Halatsis, C.: Text Line and Word Segmentation of Handwritten Documents. Pattern Recogn. 42(12), 3169–3183 (2009). DOI 10.1016/j.patcog .2008.12.016 42. Mahadevan, U., Nagabushnam, R.: Gap metrics for word separation in handwritten lines. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, vol. 1, pp. 124–127 vol.1 (1995). DOI 10.1109/ICDAR.1995.598958 43. Manmatha, R., Han, C., Riseman, E.M.: Word spotting: a new approach to indexing handwriting. In: Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 631–637 (1996). DOI 10.1109/CVPR.1996.517139 44. Manmatha, R., Han, C., Riseman, E.M., Croft, W.B.: Indexing Handwriting Using Word Matching. In: Proceedings of the First ACM International Conference on Digital Libraries, DL ’96, pp. 151–159. ACM, New York, NY, USA (1996). DOI 10.1145/226931.226960 45. Marti, U.V., Bunke, H.: Handwritten sentence recognition. In: Proceedings 15th International Conference on Pattern Recognition. ICPR-2000, vol. 3, pp. 463–466 vol.3 (2000). DOI 10.1109/ICPR.2000.903584 46. Marti, U.V., Bunke, H.: Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition system. In: Hidden Markov Models, pp. 65–90. World Scientific (2001). DOI 10.1142/9789812797605 0004 47. Mary, L., Deekshitha, G.: Searching speech databases: features, techniques and evaluation measures. Springer (2018) 48. Miller, D.R., Kleber, M., Kao, C.L., Kimball, O., Colthurst, T., Lowe, S.A., Schwartz, R.M., Gish, H.: Rapid and accurate spoken term detection. In: 8th Annual Conference of the International Speech Communication Association (2007) 49. Mondal, T., Ragot, N., Ramel, J.Y., Pal, U.: Flexible Sequence Matching technique: An effective learning-free approach for word spotting. Pattern Recognition 60, 596–612 (2016). DOI https://doi.org/10.1016/j.patcog.2016.05.011 50. Noya-Garc´ıa, E., Toselli, A.H., Vidal, E.: Simple and Effective Multi-word Query Spotting in Handwritten Text Images, pp. 76–84. Springer International Publishing (2017) 51. Papandreou, A., Gatos, B., Zagoris, K.: An Adaptive Zoning Technique for Word Spotting Using Dynamic Time Warping. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp. 387–392 (2016). DOI 10.1109/DAS.2016.79 52. Puigcerver, J.: Are Multidimensional Recurrent Layers Really Necessary for Handwritten Text Recognition? In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 67–72 (2017). DOI 10.1109/ICDAR.2017.20 53. Puigcerver, J., Toselli, A.H., Vidal, E.: Probabilistic interpretation and improvements to the HMM-filler for handwritten keyword spotting. In: 13th Int. Conf. on Document Analysis and Recognition (ICDAR), pp. 731–735 (2015). DOI 10.1109/ICDAR.2015.7333858 54. Retsinas, G., Louloudis, G., Stamatopoulos, N., Gatos, B.: Keyword Spotting in Handwritten Documents Using Projections of Oriented Gradients. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp. 411–416 (2016). DOI 10.1109/DAS.2016.61 55. Riba, P., Almaz´an, J., Forn´es, A., Fern´andez-Mota, D., Valveny, E., Llad´os, J.: e-crowds: A mobile platform for browsing and searching in historical demography-related manuscripts. In: Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on, pp. 228–233 (2014). DOI 10.1109/ICFHR.2014.46 56. Rohlicek, J.R., Russell, W., Roukos, S., Gish, H.: Continuous hidden Markov modeling for speaker-independent word spotting. In: International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 627–630 (1989). DOI 10.1109/ICASSP.1989.266505

References

31

57. Rose, R.: Keyword detection in conversational speech utterances using hidden Markov model based continuous speech recognition. Computer Speech & Language 9(4), 309–333 (1995). DOI https://doi.org/10.1006/csla.1995.0015 58. Rothacker, L., Sudholt, S., Rusakov, E., Kasperidus, M., Fink, G.A.: Word hypotheses for segmentation-free word spotting in historic document images. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 1174–1179 (2017). DOI 10.1109/ICDAR.2017.194 59. Rusi˜nol, M., Aldavert, D., Toledo, R., Llad´os, J.: Browsing Heterogeneous Document Collections by a Segmentation- Free Word Spotting Method. In: 2011 International Conference on Document Analysis and Recognition, pp. 63–67 (2011). DOI 10.1109/ICDAR.2011.22 60. S´anchez, J.A., Romero, V., Toselli, A.H., Villegas, M., Vidal, E.: A set of benchmarks for handwritten text recognition on historical documents. Pattern Recognition 94, 122–134 (2019) 61. S´anchez, J.A., Vidal, E., Bosch, V.: Effective crowdsourcing in the EDT project with probabilistic indexes. In: Document Analysis Systems: 15th IAPR International Workshop, DAS 2022, La Rochelle, France, May 22–25, 2022, Proceedings, pp. 291–305. Springer (2022) 62. Seide, F., Yu, P., Shi, Y.: Towards spoken-document retrieval for the enterprise: Approximate word-lattice indexing with text indexers. In: 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), pp. 629–634. IEEE (2007) 63. Seni, G., Cohen, E.: External word segmentation of off-line handwritten text lines. Pattern Recognition 27(1), 41–52 (1994) 64. Sfikas, G., Retsinas, G., Gatos, B.: Zoning Aggregated Hypercolumns for Keyword Spotting. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 283–288 (2016). DOI 10.1109/ICFHR.2016.0061 65. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence 39(11), 2298–2304 (2017). DOI 10.1109/TPAMI.2016.2646371 66. Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-Based Image Retrieval at the End of the Early Years. IEEE Trans. Pattern Anal. Mach. Intell. 22(12), 1349–1380 (2000). DOI 10.1109/34.895972 67. Str¨obel, P.B., Clematide, S., Volk, M., Hodel, T.: Transformer-based HTR for historical documents. arXiv preprint arXiv:2203.11008 (2022) 68. Subramani, N., Matton, A., Greaves, M., Lam, A.: A survey of deep learning approaches for ocr and document understanding. ArXiv abs/2011.13534 (2020) 69. Sudholt, S., Fink, G.A.: PHOCNet: A Deep Convolutional Neural Network for Word Spotting in Handwritten Documents. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 277–282 (2016). DOI 10.1109/ICFHR.2016.0060 70. Swaileh, W., Lerouge, J., Paquet, T.: A unified french/english syllabic model for handwriting recognition. In: 15th Int. Conf. on Frontiers in Handwriting Rec. (ICFHR), pp. 536–541 (2016) 71. S´anchez, J.A., Romero, V., Toselli, A.H., Vidal, E.: ICFHR2014 Competition on Handwritten Text Recognition on Transcriptorium Datasets (HTRtS). In: 2014 14th International Conference on Frontiers in Handwriting Recognition, pp. 785–790 (2014). DOI 10.1109/ICFHR.2014.137 72. Tarafdar, A., Pal, U., Roy, P.P., Ragot, N., Ramel, J.: A Two-Stage Approach for Word Spotting in Graphical Documents. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 319–323 (2013). DOI 10.1109/ICDAR.2013.71 73. Toselli, A.H., Juan, A., Gonz´alez, J., Salvador, I., Vidal, E., Casacuberta, F., Keysers, D., Ney, H.: Integrated handwriting recognition and interpretation using finite-state models. Int. Journal of Pattern Recognition and Artificial Intelligence 18(04), 519–539 (2004) 74. Toselli, A.H., Puigcerver, J., Vidal, E.: Context-aware lattice based filler approach for key word spotting in handwritten documents. In: 2015 13th Int. Conference on Document Analysis and Recognition (ICDAR), pp. 736–740 (2015). DOI 10.1109/ICDAR.2015.7333859 75. Toselli, A.H., Vidal, E., Puigcerver, J., Noya-Garc´ıa, E.: Probabilistic multi-word spotting in handwritten text images. Pattern Analysis and Applications (2019)

32

2 State Of The Art

76. Toselli, A.H., Vidal, E., Romero, V., Frinken, V.: HMM Word-Graph Based Keyword Spotting in Handwritten Document Images. Information Sciences 370(C), 497–518 (2016). DOI 10.1016/j.ins.2016.07.063 77. Toshikazu, K., Takio, K., Hiroyuki, S.: Intelligent Visual Interaction with Image Database Systems : Toward the Multimedia Personal Interface. Journal of Information Processing 14(2), 134–143 (1991) 78. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. CoRR abs/1706.03762 (2017) 79. Vidal, E., Romero, V., Toselli, A.H., S´anchez, J.A., Bosch, V., Quir´os, L., Bened´ı, J.M., Prieto, J.R., Pastor, M., Casacuberta, F., et al.: The carabela project and manuscript collection: large-scale probabilistic indexing and content-based classification. In: 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 85–90. IEEE (2020) 80. Vidal, E., Toselli, A.H.: Probabilistic indexing and search for hyphenated words. In: Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Proceedings, Part II 16, pp. 426–442. Springer (2021) 81. Villegas, M., M¨uller, H., Garc´ıa Seco de Herrera, A., Schaer, R., Bromuri, S., Gilbert, A., Piras, L., Wang, J., Yan, F., Ramisa, A., Dellandrea, E., Gaizauskas, R., Mikolajczyk, K., Puigcerver, J., Toselli, A.H., S´anchez, J.A., Vidal, E.: General overview of imageclef at the clef 2016 labs. In: N. Fuhr, P. Quaresma, T. Gonc¸alves, B. Larsen, K. Balog, C. Macdonald, L. Cappellato, N. Ferro (eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction, pp. 267–285. Springer International Publishing, Cham (2016) 82. Villegas, M., Puigcerver, J., Toselli, A.H., S´anchez, J.A., Vidal, E.: Overview of the ImageCLEF 2016 Handwritten Scanned Document Retrieval Task. In: CLEF (Working Notes), pp. 233–253 (2016) 83. Villegas, M., Toselli, A.H., Romero, V., Vidal, E.: Exploiting Existing Modern Transcripts for Historical Handwritten Text Recognition. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 66–71 (2016). DOI 10.1109/ICFHR.2016.0025 84. Weintraub, M.: Keyword-spotting using SRI’s DECIPHER large-vocabulary speech- recognition system. In: 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 463–466 (1993). DOI 10.1109/ICASSP.1993.319341 85. Yu, R.P., Thambiratnam, K., Seide, F.: Word-lattice based spoken-document indexing with standard text indexers. In: Searching Spontaneous conversational speech workshop, SIGIR, pp. 54–61. Citeseer (2008) 86. Zagoris, K., Pratikakis, I., Gatos, B.: Unsupervised Word Spotting in Historical Handwritten Document Images Using Document-Oriented Local Features. IEEE Transactions on Image Processing 26(8), 4032–4041 (2017). DOI 10.1109/TIP.2017.2700721 87. Zhou, Z.Y., Yu, P., Chelba, C., Seide, F.: Towards spoken-document retrieval for the internet: Lattice indexing for large-scale web-search architectures. In: Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pp. 415–422 (2006) 88. Ziran, Z., Pic, X., Innocenti, S.U., Mugnai, D., Marinai, S.: Text alignment in early printed books combining deep learning and dynamic programming. Pattern Recognition Letters 133, 109–115 (2020). DOI https://doi.org/10.1016/j.patrec.2020.02.016

Chapter 3

Probabilistic Indexing (PrIx) Framework

Abstract The proposed PrIx framework is formally presented in this chapter. In short, PrIx aims at processing each text image in such a way that all the sets of strokes in the image which can be reasonably interpreted as text elements, such as characters and words, become symbolically represented; that is, represented like electronic text. However, the primary concern of PrIx is to retain all the information needed to also represent the intrinsic uncertainty which underlies text images, and more specifically handwritten text images. A dual presentation is given. First, a “pure” image processing viewpoint is adopted, where each text element in the images is treated just as a small object that has to be somehow detected and identified. This presentation will make it clear that PrIx, and KWS alike, essentially boil down to an object recognition process, where the class’ posterior probability of each object has to be estimated at each image location. Then PrIx will be developed in full detail from another equivalent viewpoint where the underlying object recognition problem is equated to HTR, thereby considering PrIx as a form of HTR which explicitly retains image interpretation uncertainty.

3.1 Pixel Level Textual Image Representation: 2-D Posteriorgram Words are here considered just as small objects that we wish to detect and identify. Each of these objects, denoted by 𝑣, is assumed to belong to a large (open) set of “object classes” which might be called (open) Vocabulary. The posteriorgram of a text image 𝑥 and a (key)word 𝑣 is the probability that 𝑣 uniquely and completely appears in some bounding box containing the pixel (𝑖, 𝑗). In mathematical notation: 𝑃(𝑄 = 𝑣 | 𝑋 = 𝑥, 𝐿 = (𝑖, 𝑗)) ≡ 𝑃(𝑣 | 𝑥, 𝑖, 𝑗) , 1≤ 𝑖≤ 𝐼, 1≤ 𝑗 ≤ 𝐽, 𝑣 ∈ 𝑉

(3.1)

where 𝐿 is a random variable over the set of locations (pixel coordinates) and 𝐼, 𝐽 are the horizontal and vertical dimensions of 𝑥, respectively. 𝑃(𝑣 | 𝑥, 𝑖, 𝑗) is a proper probability distribution that is: © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 A. H. Toselli et al., Probabilistic Indexing for Information Search and Retrieval in Large Collections of Handwritten Text Images, The Information Retrieval Series 49, https://doi.org/10.1007/978-3-031-55389-9_3

33

3 Probabilistic Indexing (PrIx) Framework

34

Õ 𝑣

𝑃(𝑣 | 𝑥, 𝑖, 𝑗) = 1 ,

1≤ 𝑖≤ 𝐼, 1≤ 𝑗 ≤ 𝐽

(3.2)

A simple way to compute 𝑃(𝑣 | 𝑥, 𝑖, 𝑗) is by considering that 𝑣 may have been written in any possible bounding box1 𝒃 in B (𝑖, 𝑗), the set of all bounding boxes which contain the pixel (𝑖, 𝑗): Õ 𝑃(𝑣, 𝒃 | 𝑥, 𝑖, 𝑗) 𝑃(𝑣 | 𝑥, 𝑖, 𝑗) = 𝒃∈ B (𝑖, 𝑗 )

=

Õ

𝒃∈ B (𝑖, 𝑗 )

𝑃(𝒃 | 𝑥, 𝑖, 𝑗) 𝑃(𝑣 | 𝑥, 𝒃, 𝑖, 𝑗)

(3.3)

where 𝑃(𝑣 | 𝑥, 𝒃, 𝑖, 𝑗) is the probability that 𝑣 is the (unique) word written in the box 𝒃 (which includes the pixel (𝑖, 𝑗)). Therefore it is conditionally independent of (𝑖, 𝑗) given 𝒃, and Eq (3.3) simplifies to: Õ (3.4) 𝑃(𝑣 | 𝑥, 𝑖, 𝑗) = 𝑃(𝒃 | 𝑥, 𝑖, 𝑗) 𝑃(𝑣 | 𝑥, 𝒃) 𝒃∈ B (𝑖, 𝑗 )

This marginalization process is illustrated in Fig. 3.1 and Fig. 3.2 shows real results of computing 𝑃(𝑣 | 𝑥, 𝑖, 𝑗) in this way for an example image 𝑥 and a specific keyword 𝑣.

j i Fig. 3.1: Marginalization bounding boxes 𝒃 ∈ B (𝑖, 𝑗 ). For 𝑣 = ”matter”, the thick-line box will provide the highest value of 𝑃 (𝑣 | 𝑥, 𝒃), while most of the other boxes will contribute only (very) low amounts to the sum.

The distribution 𝑃(𝒃 | 𝑥, 𝑖, 𝑗) of Eq. (3.4) should be interpreted as the probability that some word (not necessarily 𝑣) is written in the image region delimited by the bounding box 𝒃. Therefore, this probability should be high for word-shaped and word-sized bounding boxes centered around the pixel (𝑖, 𝑗), like some of those illustrated in Fig. 3.1. In contrast, it should be low for boxes which are too small, too large, or are too off-center with respect to (𝑖, 𝑗). For simplicity, it can be assumed that this distribution is uniform for all reasonably sized and shaped boxes around (𝑖, 𝑗) (and null for all other boxes), and then just replace this distribution with a constant in Eq. (3.4). Such a simplification encourages the peaks of the posteriorgram to be rather flat, as in Fig. 3.2. 1 Vector notation, like 𝒃, will be generally adopted for bounding boxes. In a 2-D digital image, 𝒃 is generally assumed to be rectangle, represented in N4 by the coordinates of two opposed corners. Line image regions, considered later on, are seen as 1-D objects and 𝒃 = (𝑏1 , 𝑏2 ) 𝑇∈ N2 is a horizontal segment delimited by 𝑏1 and 𝑏2 .

35

3.2 Image Regions for Keyword Indexing and Search

On the other hand, the term 𝑃(𝑣 | 𝑥, 𝒃), is exactly the posterior probability needed by any system capable of recognizing a pre-segmented word image (i.e., a sub-image of 𝑥 bounded by 𝒃). Actually, such an isolated word recognition task can be formally written as the following classification problem: 𝑣ˆ = arg max 𝑃(𝑣 | 𝑥, 𝒃)

(3.5)

𝑣∈𝑉

In general, any system capable of recognizing pre-segmented word images implicitly or explicitly computes 𝑃(𝑣 | 𝑥, 𝒃) and can thereby be used to obtain the posteriorgram according to Eq. (3.4). For example, using a 𝑘-Nearest Neighbor classifier, it can be approximated just as [3]: (3.6) 𝑃(𝑣 | 𝑥, 𝒃) = 𝑘 /𝑘 𝑣

where 𝑘 𝑣 is the number of 𝑣–labeled prototypes out of the 𝑘 which are nearest to the image in the bounding box 𝒃 of 𝑥. Obviously, the better the classifier, the better the corresponding posteriorgram estimates. This is illustrated in Fig. 3.2, which shows two examples of image posteriorgrams obtained according to Eq. (3.4) using two different word image recognizers. In both cases, well trained optical hidden Markov models (HMM) were used to compute 𝑃(𝑣 | 𝑥, 𝒃) ∀𝒃 ∈ B (𝑖, 𝑗). 𝑃0 (𝑣 | 𝑥, 𝑖, 𝑗) was obtained directly, using a plain, context-agnostic optical recognizer, and 𝑃2 (𝑣 | 𝑥, 𝑖, 𝑗) was produced using a more precise contextual word recognizer, additionally based on a well trained bi-gram. As it can be seen, 𝑃0 values are only good for the two clear instances of “matter”, but almost vanish for a third instance, probably because of the faint character “m”. Worse still, 𝑃0 values are relatively high for the similar, but wrong word “matters”; in fact very much higher than for the third, faint instance of the correct one. In contrast, the contextual recognizer leads to high 𝑃2 values for all the three correct instances of “matter”, even for the faint one, while the values for the wrong word are very low. Clearly bi-grams such as “It matter” and “matter not” are unlikely, thereby preventing 𝑃2 (𝑣 | 𝑥, 𝒃) to be high for any box 𝒃 around the word “matters”. On the other hand, the bi-grams “the matter” and “matter of” are very likely, thereby helping the optical recognizer to boost 𝑃2 (𝑣 | 𝑥, 𝒃) for boxes 𝒃 around the faint instance of “matter”. Pixel-level posteriorgrams could be directly used for keyword search: Given a threshold 𝜏 ∈ [0, 1], a word 𝑣 ∈ 𝑉 is spotted in all image positions where 𝑃(𝑣 | 𝑥, 𝑖, 𝑗) > 𝜏. Varying 𝜏, adequate precision–recall tradeoffs could be achieved.

3.2 Image Regions for Keyword Indexing and Search Computing the full posteriorgram as in Eq. (3.4) for all the words of a large vocabulary (as needed for indexing purposes) and all the pixels of each page image entails a formidable amount of computation. The same can be said for the exorbitant amount of memory which would be needed to explicitly store all the resulting posterior probabilities. Therefore such a direct approach is obviously inappropriate for indexing purposes and, moreover, it becomes unfeasible for the size of text image collections

36

3 Probabilistic Indexing (PrIx) Framework

𝑃2 (𝑣 | 𝑥, 𝑖, 𝑗)

𝑃0 (𝑣 | 𝑥, 𝑖, 𝑗)

𝑥

Fig. 3.2: Identical optical HMMs were carefully trained in order to help computing two 2-D posteriorgrams, 𝑃0 and 𝑃2 , for a text image 𝑥 and keyword 𝑣 =“matter”. A contextagnostic HMM+0-gram isolated word classifier was used to obtain 𝑃0 . But much better posterior estimates are offered by 𝑃2 , obtained using a contextual, HMM+2-gram classifier.

considered in this work. Clearly, rather than working at the pixel level, some adequately small image regions, 𝑥, which are indexable and suitable search targets for users, need be defined to compute the RPs introduced in Chapter 1 (Sec. 1.4). While these concerns are seldom discussed in the KWS literature, region proposal [18] has been the focus of a number of studies in the object recognition community, as well as in the field of document analysis – see e.g. [4], which deals with graphic pattern spotting in historical documents and [12, 16, 23], which apply region proposal neural networks to various document layout analysis tasks. In the traditional KWS literature, word-sized regions are often considered by default. This is reminiscent of segmentation-based KWS methods which required previously cropped accurate word bounding boxes. However, as discussed in Chapter 1, this is not realistic for large image collections. More importantly, by considering isolated words, it is difficult for the underlying word recognizer to take advantage of word linguistic contexts to achieve good spotting precision (as illustrated in Fig. 3.2). At the other extreme we may consider whole page images, or relevant text blocks thereof, as the search target image regions. While this can be sufficiently adequate for many textual content retrieval applications, a page may typically contain many instances of the word searched for and, on the other hand, users generally like to get narrower responses to their queries.

3.3 Position-independent PrIx

37

A particularly interesting intermediate search target level consists of line-shaped regions. Lines are useful targets for indexing and search in practice and, in contrast with word-sized image regions, lines generally provide sufficient linguistic context to allow computing accurate word classification probabilities. Moreover, as will be discussed later on, line region posteriorgrams can be very efficiently computed.

3.3 Position-independent PrIx Let us now examine how to obtain the RP 𝑃(𝑅 | 𝑥, 𝑣) defined in Eq. (1.1) when 𝑥 is a suitable (typically a line-shaped or any other small) image region. We will say that this scenario assumes a position-independent RP. In this section we examine different approaches to compute the position-independent RP for an image region.

3.3.1 Naive Word Posterior Interpretation of 𝑷(𝑹 | 𝒙, 𝒗) If 𝑥 were a word-sized, tight bounding box, then 𝑃(𝑣 | 𝑥) could be used as a proxy for 𝑃(𝑅 | 𝑥, 𝑣) as: 𝑃(𝑅 = 1 | 𝑥, 𝑣) ≈ 𝑃(𝑣 | 𝑥) Õ 𝑃(𝑅 = 0 | 𝑥, 𝑣) ≈ 𝑃(𝑢 | 𝑥) = 1 − 𝑃(𝑣 | 𝑥)

(3.7)

𝑢≠𝑣

As will be discussed in Chapter 7, this is in fact the approach implicitly or explicitly adopted by all word-segmentation-based KWS methods. So it is not surprising that researchers have tried to stretch this idea even if 𝑥 is not a tight word bounding box (i.e., it may contain multiple words). In this case, however, the intuition behind the classification problem underlying 𝑃(𝑣 | 𝑥) is unclear: How the (unique) “most likely word” 𝑣ˆ = arg max𝑣 𝑃(𝑣 | 𝑥) should be interpreted? Moreover, 𝑃(𝑣 | 𝑥) sums up to one for all 𝑣 ∈𝑉; but in keyword search, each word actually written in 𝑥 should have high RP and, as discussed in Chapter 9, the sum should rather approach the expected number of different words written in 𝑥. In [31] (see also Chapter 6) we empirically study whether using 𝑃(𝑣 | 𝑥) with line image regions can still provide useful KWS performance. While the posterior underlying any isolated word classifier can straightforwardly be used to obtain 𝑃(𝑣 | 𝑥), in the comparative experiments reported in [31] we tried to use the same underlying probability distributions for all the methods. To this end, one can realize that 𝑃(𝑣 | 𝑥) can be readily obtained as a simple pixel-average of the posteriorgram as follows: 𝑃(𝑣 | 𝑥) =

Õ 𝑖𝑗

𝑃(𝑣, 𝑖, 𝑗 | 𝑥) ≈

1 Õ 𝑃(𝑣 | 𝑥, 𝑖, 𝑗) 𝐼 ·𝐽 𝑖 𝑗

(3.8)

where 𝐼 · 𝐽 is the number of pixels of 𝑥 and, for simplicity, possible positions (𝑖, 𝑗) of words are assumed to be equiprobable.

3 Probabilistic Indexing (PrIx) Framework

38

3.3.2 Proposed Approximations to 𝑷(𝑹 | 𝒙, 𝒗) To start with, let the correct transcript of the sequence of words 𝑤 = 𝑤1 , 𝑤2 , . . . , 𝑤𝑛 , 𝑤 𝑘 ∈ 𝑉, 1 ≤ 𝑘 ≤ 𝑛, and let us abuse the notation and write 𝑣 ∈ 𝑤 to denote that ∃𝑘, 𝑤 𝑘 = 𝑣.2 The definition of the class “1” in Eq. (1.2) can then be written as: (𝑅 = 1) ≡ (𝑣 ∈ 𝑤) ≡ (𝑤1 = 𝑣 ∨ 𝑤2 = 𝑣 . . . ∨ 𝑤𝑛 = 𝑣)

(3.9)

Of course, if 𝑤 were known, the RP 𝑃(𝑅 | 𝑥, 𝑣) would trivially be 1 if 𝑣 ∈ 𝑤 and 0 otherwise. In KWS or PrIx, no transcripts are available, but an obvious, naive idea is to approximate 𝑤 with a best HTR transcription hypothesis, 𝑤(𝑥) (see Sec. 3.4). ˆ This results in: ( ˆ 1 if 𝑣 ∈ 𝑤(𝑥) (3.10) 𝑃(𝑅 | 𝑥, 𝑣) ≈ 0 otherwise While the simplicity of this idea makes it really enticing (and it has in fact become is seldom accurate enough in practice, and quite popular), we anticipate that 𝑤(𝑥) ˆ this method generally yields poor precision-recall performance. Therefore, we propose less simple but hopefully more accurate developments. According to [29] and Eq. (3.9), 𝑃(𝑅 | 𝑥, 𝑣) can be exactly written as: 𝑃(𝑅 | 𝑥, 𝑣)

=

𝑛 Õ 𝑘=1

− +

Õ 𝑙