245 79 4MB
English Pages VII, 111 [115] Year 2020
Mathias Soeken Rolf Drechsler Editors
Natural Language Processing for Electronic Design Automation
Natural Language Processing for Electronic Design Automation
Mathias Soeken • Rolf Drechsler Editors
Natural Language Processing for Electronic Design Automation
Editors Mathias Soeken EPFL Integrated Systems Laboratory Lausanne, Switzerland
Rolf Drechsler AG Rechnerarchitektur University of Bremen Bremen, Germany
ISBN 978-3-030-52271-1 ISBN 978-3-030-52273-5 (eBook) https://doi.org/10.1007/978-3-030-52273-5 © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
The design of modern hardware systems involves many designers from different fields with different backgrounds that work on several abstraction levels each of which uses dedicated (formal) languages. In addition, customers are tightly integrated into the design process, and hence natural language is the ubiquitous language that connects all participants. Each design process starts with a natural language specification that serves as a starting point to the design flow and consists of all requirements for the final system. Since the specification is not formally defined, it is extremely prone to errors which may only be detected late in the design flow. In order to overcome this problem, scientists have investigated how natural language processing (NLP) techniques can be utilized in order to automatize the requirements engineering process and the translation of natural language specifications into formal descriptions. Several works on this topic addressing a variety of aspects are discussed in this book. In Chap. 1, Natalia Vanetik, Marina Litvak, and Efi Levi describe an event summarizer and event detector based on Twitter posts. Liana Musat proposes an approach for the semi-formalization of requirements with application in automotive design in Chap. 2. In Chap. 3, Ian G. Harris and Christopher B. Harris show how natural language processing can aid formal verification with a method that generates verification artifacts from design specifications. In Chap. 4, Oliver Keszöcze, Betina Keiner, Matthias Richter, Gottfried Antpöhler, and Robert Wille show how methods from electronic design automation (EDA) can be used to semi-automatically translate legal regulations to formal representations. The book describes approaches on how to integrate more automatization to the early stages of EDA design flows. Since the main description in these early stages is an informal natural language specification, automatization is significantly more difficult. On the other hand, since errors in early phases can lead to long delays of the release, automatization is very important. The chapters in this book give insight into how natural language processing techniques can be utilized to create interesting
v
vi
Preface
examples for automatization in requirements engineering as well as in translating natural language specifications into formal models. We would like to express our thanks to all the authors of contributed chapters who did a great job in submitting manuscripts of very high quality. Finally, we would like to thank Brian Halm, Cynthya Pushparaj, and Charles Glaser from Springer. All this would not have been possible without their steady support. Lausanne, Switzerland Bremen Germany June 2019
Mathias Soeken Rolf Drechsler
Contents
1
2
3
4
(Semi)automatic Translation of Legal Regulations to Formal Representations: Expanding the Horizon of EDA Applications . . . . . . . . Oliver Keszocze, Betina Keiner, Matthias Richter, Gottfried Antpöhler, and Robert Wille
1
Semi-Formalization of Requirements for Analogue/ Mixed-Signal Products with Application in Automotive Domain . . . . . . Liana Kampl
13
Generation of Verification Artifacts from Natural Language Descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ian G. Harris and Christopher B. Harris
37
Real-World Events Discovering with TWIST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Natalia Vanetik, Marina Litvak, and Efi Levi
71
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
vii
Chapter 1
(Semi)automatic Translation of Legal Regulations to Formal Representations: Expanding the Horizon of EDA Applications Oliver Keszocze, Betina Keiner, Matthias Richter, Gottfried Antpöhler, and Robert Wille
1.1 Introduction The ever-increasing complexity of hardware and software systems led to the development of elaborated design flows for electronic design automation (EDA). Within these flows, how to check whether the system works as intended gains more and more relevance. Requirement engineering [1] and design at the Formal Specification Level [2] exploiting languages such as UML [3] or OCL [4] provide proper solutions for this purpose. Based on the initially given (textual) specification, engineers use these solutions to (formally) design and, eventually, verify the desired system to be implemented. This led to the availability of very efficient EDA tools (see, e.g., [5–7]). However, the application of these tools is not necessarily bounded to the design of hardware/software systems.
O. Keszocze () Hardware-Software-Co-Design, Friedrich-Alexander University Erlangen-Nuremberg (FAU), Erlangen, Germany B. Keiner · M. Richter gradient.Systemintegration GmbH, Singen, Germany e-mail: [email protected]; [email protected] G. Antpöhler Kassenärztliche Vereinigung Bremen, Bremen, Germany e-mail: [email protected] R. Wille Cyber-Physical Systems, German Research Centre for Artificial Intelligence, Bremen, Germany Institute for Integrated Circuits, Johannes Kepler University Linz, Linz, Austria Software Competence Center Hagenberg GmbH (SCCH), Hagenberg, Austria e-mail: [email protected]; [email protected]; [email protected] © Springer Nature Switzerland AG 2020 M. Soeken, R. Drechsler (eds.), Natural Language Processing for Electronic Design Automation, https://doi.org/10.1007/978-3-030-52273-5_1
1
2
O. Keszocze et al.
In fact, legal regulations can also be seen as a special kind of specification. Although they do not directly specify the functionality of a system to be developed, many computer applications heavily rely on them. In fact, (software) systems exist which are “in charge” of checking whether, e.g., tax returns, accounting and billing, stock market transactions, etc. are processed in line with the rules and regulations of the respective field and country. Hence, as in hardware/software design, a first (manual) design step in the development of such systems is to formalize these regulations. Very few work has been done in this field. In [8], the authors model the tax law of Luxembourg using UML/OCL. This process involved many time-consuming meetings with legal experts. The resulting model was completely created “by hand.” Considering that legal rules and laws are usually provided in a structured and precise (albeit not always comprehensive) fashion motivates the use of natural language processing for their formalization. In this work, we present a methodology which (semi)automatically formalizes given legal regulations (provided as real-world wording of the law) into a formal representation. As case study, we consider rules from the German Regulations on Scales of Fees for Medical Doctors. We show how the proposed approach and the resulting formal representation can be used as blueprint to incorporate respective checks into an existing software system. Moreover, the obtained formal descriptions of the considered legal regulations even allow to check whether the law is consistent by itself. This does not only advance the development of respective systems but also shows directions how EDA tools can be utilized to make better and noncontradictory laws in general.
1.2 Considered Domain and Problem Formulation In this work, the translation of legal regulations to a formal representation is considered and discussed by considering rules from the German Regulations on Scales of Fees for Medical Doctors [9] provided in terms of a so-called Uniform Assessment Standard (German, Einheitlicher Bewertungsmaßstab; in the following, EBM). These regulations specify how medical doctors in Germany are supposed to generate his/her invoices. For this purpose, all possible services are listed and structured by means of cases. Each case is composed of notes and spans over one or more sessions which are conducted in one or more days. Eventually, one or more services are conducted in each case at the respective sessions. This structure is required since some services can only be accounted once (independently of how many sessions were required) or solely for each session. The EBM eventually defines which services can be accounted and to what amount.
1 (Semi)automatic Translation of Legal Regulations to Formal Representations
3
Case Note
Session Service (a)
Day
“The service 45678 is not to be billed together with EBM position 54687 within the third and fourth quarter of 2014.” (b)
if ( Case.quarter >= new Quarter(32014) && Case.quarter = new Quarter(32014) && Case.quarter Vin_th) – and an inverse current situation must be present. No inverse current is present when the output voltage is smaller than the supply voltage (Vout < Vs). When an inverse current condition is detected in the OFF state, the transistor remains off and all other protection functions are disabled in order to protect them from the undesired situation. The check for inverse current is continuously performed and reported to the microcontroller via the sense pin (current IS is given the value IS_fault which is defined in the parametric requirements). In case the inverse current condition is evaluated to false, all the other protection functions are enabled. The inverse current condition (Vout versus Vs) is also continuously evaluated in ON mode. When Vout > Vs, the inverse current is signalized to the microcontroller for the ON mode, via the sensed current which gets the predefined value IIS(OL). At the same time, all the other protection functions are disabled. In case no inverse current is detected, the other protection functions are enabled and the sensed current reflected by the IS pin is not influenced by the inverse current protection function. The figure shows also the power OFF final exit point, when the power supply to the switch is turned off. The final exit point can be reached from both ON and OFF states. While the requirements for stand-alone functions are usually clearly stated, the requirements for their interactions are often in between the lines, spread throughout the entire specification document. Because parts of activity diagrams and corresponding requirements are closely related, they can be checked for the absence of contradictions or missing functionality easily. This holds true also for interactions and collaborations between functionalities. For example, Fig. 2.8 provides an overview for the ON and OFF states with all the protection functions. For making the visualization easier, the protection functions have been included in a separate state machine. The requirements are mainly describing what a system should do and not how it will do it. The examples above represent a small part of the entire model representing the requirements. The advantage of using semi-formal representation is that we can easily work with the different levels of details, depending on the persons using the model and their interest in having a more or less detailed view. One of the most common problems detected while modelling using these representations is missing information. When a conditional transition is defined, there are two possibilities, when the condition is fulfilled and when it is not. During natural language description, it is common that only one part of it is explicitly expressed “When the condition X happens then . . . ”, without specifying what will happen if the condition X is not fulfilled. The main explanation is that the system remains in the same state without anything changing. It is a common situation that unfortunately can lead to errors if something else is intended, but was unintentionally unstated. While representing the situation using a SysML state machine or activity diagram, the syntax will be invalid if both situations are not displayed.
32
L. Kampl
Fig. 2.8 Integration of normal functionality with protection functions
The graphical models represent the translation of the natural language requirements, and they are explicitly linked to these requirements. Depending on the tooling used, this information as well as metadata can be available in the property fields. These include the name of the linked element, element type (e.g. requirement, test case) and the connection type (e.g. realization). Together with this information, basic meta-information for all models is available, like name, creation date, modification date, status, author, version, etc.
2.4.3.4
Requirements Models Reuse
The smart high-side power switch is one of the many products with wide utilization for different applications with different loads. In the example above, it is used for controlling the valves of an ABS device. Other applications include automotive applications for lighting, power distribution, heating, motor control or infotainment as well industrial applications as robotics, general load management, electric drives and control systems for energy saving. Due to the large diversity of applications, reuse is essential for efficient product development and short time to market. Widely done at the level of the design, the question arises for the requirements engineering phases. Although the level of detail is maintained low and the requirements are
2 Semi-Formalization of Requirements for Analogue/Mixed-Signal Products. . .
33
described at a high level of abstraction, the reuse is significant. Within a product family, models for structure and behaviour can be reused, and only the requirements models need to be adapted to the differences between the products as well the links between the requirements and the other elements fulfilling the requirements. This means that differences from one product to another have to be identified to adapt the models accordingly to represent the new product including all related requirements. The advantage of this approach compared to the reuse and adaption of a specification based on natural language is the significantly reduced time in checking and modelling common parts, in handling problems identified already on previous products and in providing a common starting point for design and verification. This fosters uniformity over a large family of products.
2.5 Conclusions Semi-formal representation of requirements is an important approach for overcoming the natural language description problems such as ambiguity, incomprehensibility, incompleteness or inconsistency. A diagram-based approach improves the requirements engineering process allowing a better information processing compared to the textual description. In order to be successful, the semi-formal representation must be, on the one hand, easy to use and easy to understand and, on the other hand, somewhat standard so that there is no room for interpretation. This does not mean that semi-formal representation will be error proof as long as they are a product of human manual work. However, for modelling requirements a basic understanding of the requirement set must be build up, and hence a first step in their quality checking is performed. Designed for system engineering applications, including continuous-time systems such as AMS components, SysML represents a preferable choice for modelling the requirements. The capabilities of SysML support the implementation of a methodology and flow to organize and break down requirements, with the possibilities to describe structure and behaviour of the system in a well-readable form. The links between all types of diagrams will assure the consistency in the case that requirements are added, changed or deleted on the impacted elements. The modelling process represents an important step in checking the initial natural language requirements quality. The use of the semi-formal representation in the protected power switch example demonstrates that it can be applied successfully even for AMS components and systems. In particular, it is very helpful for modelling the safety mechanisms which represent a very sensitive issue. Finally, the diagram-based representation inherently allows the reuse of requirements. The benefits of reuse are time savings, better organization of requirements and better requirement descriptions.
34
L. Kampl
References 1. J. Motavalli. The dozens of computers that make modern cars go (and stop), Available: http:// www.nytimes.com/2010/02/05/technology/05electronics.html?_r=0 2. ISO 26262-1, Road vehicles—Functional safety, 1st Edition, 2011-11-15 3. S. Withall, Software Requirements Patterns (Microsoft Press, 2007). https://www.amazon.com/ Software-Requirement-Patterns-Developer-Practices/dp/0735623988 4. C. Palomares, C. Quer, X. Franch, PABRE-Proj: Applying Patterns in Requirements Elicitation, in 21th IEEE International Requirement Engineering Conference (RE’13) (2013) 5. C. Wei, B. Xiaohong, L. Xuefei, A study on airborne software safety requirements patterns, in IEEE 7th international conference on software security and reliability-companion (SERE-C), 18–20 June 2013, pp.131, 136 6. G. Grau Colom, An i-based Reengineering Framework for Requirements Engineering. PhD. thesis, Universitat Politècnica de Catalunya, July 2008 7. X. Franch, The i* framework: The way ahead. Sixth International Conference on Research Challenges in Information Science (RCIS), 16–18 May 2012, pp. 1, 3 8. W.C. Chu, H. Yang, A formal method to software integration in reuse. Proceedings of IEEE Computer Software and Applications, COMPSAC-96, 1996, pp.343–348 9. W.C. Chu, C.P. Hsu, C.A. Lu, H. Xudong, A semi-formal approach to assist software design with reuse, Proceedings of IEEE International Conference on Software Maintenance, 1999 (ICSM ’99) (1999), pp. 256, 264 10. OMG SysML, the Systems Modeling Language, Object Modeling Group Std. (2011), Available: http://www.sysml.org 11. H. Naz, M.N. Khokhar, Critical Requirements Engineering Issues and their Solution, International Conference on Computer Modeling and Simulation, 2009 (ICCMS ’09), 20–22 Feb 2009, pp. 218, 222 12. W.M. Wilson, Writing effective natural language requirements specifications, Technical report, Naval Research Laboratory (1999) 13. IEEE Computer Society, IEEE Standard Classification for SW Anomalies, IEE Std 1044–2009 14. J. McLean, Twenty years of formal methods, Proceedings of the 1999 IEEE Symposium on Security and Privacy, 1999 (1999), pp. 115, 116 15. T. Hoverd, Are formal methods the answer?, IEE Colloquium on Requirements Capture and Specification for Critical Systems, 24 Nov 1989, pp. 7/1, 7/2 16. Object Management Group, OMG Unified Modeling Language (OMG UML) Superstructure, Version 2.2 (2009), http://www.omg.org/spec/UML/2.2/Superstructure/PDF 17. M. Gavrilescu, G. Magureanu, D. Pescaru, I. Jian, Towards UML software models for Cyber Physical System applications, Telecommunications Forum (TELFOR), 20–22 Nov 2012, pp. 1701, 1704 18. W. Mueller, Y. Vanderperren, UML and model-driven development for SoC design, Hardware/Software Codesign and System Synthesis, 2006. CODES+ISSS ’06. Proceedings of the 4th International Conference, , 22–25 Oct 2006, p. 1 19. R. Damasevicius, V. Stuikys, Application of UML for hardware design based on design process model, Design Automation Conference, 2004, Proceedings of the ASP-DAC 2004. Asia and South Pacific, 27–30 Jan 2004, pp. 244, 249 20. SAE Aerospace. SAE AS5506B: Architecture Analysis and Design Language (AADL) (2012) 21. P.H. Feiler, B.A Lewis, S. Vestal, The SAE Architecture Analysis & Design Language (AADL) a standard for engineering performance critical systems, 2006 IEEE Computer Aided Control System Design, 2006 IEEE International Conference on Control Applications, 2006 IEEE International Symposium on Intelligent Control, 4–6 Oct 2006, pp. 1206, 1211 22. W. Hanbo, Z. Xingshe, D. Yunwei, T. Lei, Timing properties analysis of real-time embedded systems with AADL model using model checking, 2010 IEEE International Conference on Progress in Informatics and Computing (PIC), 10–12 Dec 2010, vol. 2, pp. 1019, 1023
2 Semi-Formalization of Requirements for Analogue/Mixed-Signal Products. . .
35
23. E. Senn, J. Laurent, E. Juin, J.-P. Diguet, Refining power consumption estimations in the component based AADL design flow, Forum on Specification, Verification and Design Languages, 2008 (FDL 2008), 23–25 Sept 2008, pp. 173, 178 24. The ATESST Consortium. East-ADL2 specification (2015), http://www.atesst.org 25. UML Profile for MARTE: Modeling and Analysis of Real-Time embedded systems (2015), http://www.omg.org/spec/MARTE/1.1/PDF/ 26. K.D. Evensen and K.A. Weiss, A comparison and evaluation of real-time software systems modeling languages, presented at the Aerospace conference, Georgia, Atlanta, 2010 27. J. Helming, M. Koegel, M. Schneider, M. Haeger, C. Kaminski, B. Bruegge and B. Berenbach, Towards a unified Requirements Modeling Language, in Fifth International Workshop on Requirements Engineering Visualization (REV) (2010), pp. 53–57 28. L. Delligatti, SysML Distilled: A Brief Guide to the Systems Modelling Language (Addison-Wesley, 2013). https://www.amazon.com/SysML-Distilled-Systems-ModelingLanguage/dp/0321927869/ref=sr_1_1?dchild=1&keywords=sysml+distilled+a+brief&qid= 1595942383&s=books&sr=1-1 29. S. Friedenthal, A. Moore, R. Steiner, A Practical Guide to SysML – The Systems Modelling Language (Elsevier, 2012). https://www.amazon.com/PracticalGuide-SysML-Modeling-Language/dp/0128002026/ref=pd_sbs_14_1/135-22705704049329?_encoding=UTF8&pd_rd_i=0128002026&pd_rd_r=d6c37737-528a-444e8c8a-63be3b51d086&pd_rd_w=ZnEIG&pd_rd_wg=39Q62&pf_rd_p=bc074051-81d14874-a3fd-fd0c867ce3b4&pf_rd_r=DJPBXMWRE20RNC0PFTW0&psc=1&refRID= DJPBXMWRE20RNC0PFTW0 30. J. Holt, S. Perry, SysML for systems engineering, IET Professional Applications of Computing, Published by The Institute of Engineering and Technology, London, United Kingdom, 2009 31. Infineon Technologies AG, Protected high side drivers, in Bridging Theory into Practice – Fundamentals of Power Semiconductors for Automotive Applications, 2nd edn., (Infineon Technologies AG, Munich, 2008), pp. 125–149 32. Infineon Technologies AG, Introduction to PROFET™ [20.05.2015], Internet: http:// www.infineon.com/dgdl/Introduction+to+PROFET%E2%84%A2.pdf?folderId=db3a30431400 ef68011421b54e2e0564&fileId=db3a304332ae7b090132b527d9173083 33. Infineon Technologies AG, Ultimate Power – Perfect Control [20.02.2015], Internet: http:/ /www.infineon.com/dgdl/Infineon-Automotive_Power_SelectionGuide_2014-BC-v00_00EN.pdf?fileId=db3a30431ddc9372011e2692f130475f 34. ISO 26262-9, Road vehicles — Functional safety - Part 9: Automotive Safety Integrity Level (ASIL)-oriented and safety-oriented analyses, First Edition, 2011-11-15
Chapter 3
Generation of Verification Artifacts from Natural Language Descriptions Ian G. Harris and Christopher B. Harris
3.1 Introduction The integrated circuit (IC) design process has evolved greatly, from the manual layout of a small number of components to the automated design of ICs containing billions of transistors. To accommodate the dramatic increases in design complexity, the field of electronic design automation (EDA) was born, starting with simple schematic capture tools and culminating in the complex automation tools available today. EDA tools depend on the existence of a well-defined behavioral model, or model of computation, which can be used by EDA tools to perform synthesis and verification tasks. Over time, the abstraction level of the behavioral models in use has risen to efficiently capture more complex behaviors. EDA tools have proven effective in supporting synthesis and verification tasks, but the initial behavioral model must be generated manually by human experts. The process of manually creating an accurate and complete behavioral description has always been a central bottleneck in the design process which EDA tools seek to alleviate. Manually generating a behavioral description is expensive, requiring significant time and a large number of well-trained design and verification engineers. A large part of the verification process is devoted to detecting and fixing design errors created during the processes of creating a behavioral description. Developing a natural language specification of design behavior is a wellaccepted precondition for generating a formal behavioral model. Natural language specifications are the first concrete behavioral description which is the basis for the I. G. Harris () Department of Computer Engineering, University of California Irvine, Irvine, CA, USA e-mail: [email protected] C. B. Harris Department of Electrical and Computer Engineering, Auburn University, Auburn, Alabama, USA e-mail: [email protected] © Springer Nature Switzerland AG 2020 M. Soeken, R. Drechsler (eds.), Natural Language Processing for Electronic Design Automation, https://doi.org/10.1007/978-3-030-52273-5_3
37
38
I. G. Harris and C. B. Harris
manually generated formal behavioral model. Natural language is preferred as the initial description method mainly because it is much simpler for a designer to use than existing hardware description languages. Natural language specifications also have the advantage that they can be used to communicate behavioral information with non-technical stakeholders, such as a client for whom the design is being made, or a high-level manager. Design and verification engineers use the specification as the main source of behavioral information, to generate a formal behavioral description and to identify corner cases and expected responses for verification. The task of interpreting natural language specifications has been exclusively manual because, generally speaking, only humans with expert design knowledge have the ability to properly interpret specification documents.
3.1.1 Verification Artifacts Simulation-based verification involves applying test vectors to a system under test and verifying the correctness of the responses. An enormous number of test results must be evaluated when verifying complex systems, so the task of evaluating test results must be automated. The evaluation of test responses is typically automated using one of two approaches. • Assertion-Based Verification An assertion is a program invariant which is evaluated automatically during hardware simulation to perform response evaluation. Assertion-based verification is a well-used hardware verification approach [14]. • Transaction-Level Modeling Transaction-level modeling is a high-level behavioral modeling approach which separated computation from communication [6]. The high level of abstraction enables simulation models to be created more efficiently than models at lower abstraction levels such as register-transfer level. Both result evaluation approaches require a tedious and error-prone manual step to generate assertions or generate simulatable transaction models. The task of formally specifying the behavior in terms of assertions or transaction models can be nearly as difficult as the design task itself.
3.1.2 Verification from Natural Language We present approaches to generate assertions and transactor models directly from natural language descriptions of the system behavior. The benefit of our work is to simplify the verification process by reducing the amount of manual effort required, as well as the time required to debug the assertion framework itself. It is common for a specification to contain sentences which express constraints on the legal behavior of a system. We present an approach to generate SystemVerilog assertions directly from constraint sentences such as these. An example of the
3 Generation of Verification Artifacts from Natural Language Descriptions
39
Fig. 3.1 Generation of an assertion Fig. 3.2 Generation of a transactor
goal of this project is seen in Fig. 3.1 which shows an English assertion statement together with the equivalent SystemVerilog assertion which our approach generates. Specifications often describe event sequences which implement features of the behavior. These event sequences are often named (i.e., “write transaction”) and may be expressed over multiple sentences. Formally capturing event sequences is essential in order to create a transaction model which will provide a “golden model” of the transaction for response checking. Capturing sequences has the additional complexity that information contained in multiple related sentences must be combined to create a complete model. An example of transaction generation is shown in Fig. 3.2 which shows two English sentences which are part of a larger sequence description and the corresponding portion of the transactor written in Verilog.
3.1.3 Chapter Organization The remainder of the chapter is organized as follows. Related work in the application of natural language processing is presented in Sect. 3.2. Section 3.3 outlines some of the key issues that must be considered when generating a formal description from natural language. Our use of semantic parsing to address issues related to linguistic variation is described in Sect. 3.4. Sections 3.5 and 3.6 describe our work in assertion generation and transactor generation, respectively. Section 3.7 presents conclusions from our results and suggests future work.
40
I. G. Harris and C. B. Harris
3.2 Related Work Tasks related to the processing of natural language documents traditionally fall into the NLP field [26]. NLP includes information extraction as well as many related subproblems such as part-of-speech tagging and syntactic parsing. An information extraction approach which we will leverage for understanding hardware specifications is semantic parsing which maps natural language sentences to formal meaning representations using syntactic parsing [37]. Existing semantic parsing approaches can be categorized based on the restrictions on the text which they process and the meaning representation used. In all previous work, the structure of the text is restricted in some way. Restrictions may be very strict, such as insisting on a subject, verb, direct object, and indirect object pattern in each sentence [9, 34] or restricting verb tense and the number of independent clauses in a sentence [29]. Several semantic parsers accept spoken dialog but restrict the conversation domain [4, 24]. Research in processing software requirements frequently accepts semi-structured use case textual descriptions [9, 29, 34, 42]. Semantic parsing systems are domain-specific, so the meaning representation must be appropriate to contain information in the chosen domain. Several earlier semantic parsers relied on semantic frames [36], which are similar to structures which contain a set of fields to describe the attributes of each structure instantiation[4, 24]. Message sequence charts are often used as meaning representations for systems which process software specifications [19, 29, 42]. Other meaning representations used include hidden understanding models [35] and abstract state machines [30]. An alternative approach for information extraction is the use of machine learning techniques in the form of text mining [12, 22]. Machine learning approaches use statistical techniques to extract association rules involving words groups. Machine learning has several applications in NLP including document retrieval, text summarization [47], and semantic role labeling [18, 44].
3.2.1 NLP for Hardware and Software Design Natural language processing has been applied to several different hardware design problems in the past [10]. Researchers have developed a natural language interface to search through circuit simulation results [40, 41]. The simulation process produces a results file containing a set of triples of the form (signal, voltage, time), and natural language queries are used to specify constraints on each parameter. Researchers have generated partial hardware designs from natural language specifications [20, 21] by identifying a set of concepts expressed, together with a textual pattern for each concept. Any sentence which matches a textual pattern can be mapped to a structures in a design data structure defined by the authors. The approach taken in [7] defines a grammar to parse natural language expressions and generates VHDL snippets. More recent efforts have improved on the sophistication
3 Generation of Verification Artifacts from Natural Language Descriptions
41
of the analysis by relying on the semi-formal structure of test scenarios described by acceptance tests [43]. A UML class diagram is generated based on the entities referred to in the scenario, and a UML sequence diagram is generated from the sequence of operations described. NLP has been applied in the software engineering field to support various problems related to program comprehension. Information has been extracted from software artifacts including source code and code comments [15, 45, 48] and development emails [2]. Program comprehension tasks which have been addressed include concern location, aspect mining [15], traceability, artifact summarization [2], program rule extraction, and code search [45, 46]. The work presented in [45] on program rule extraction is most related to our work since program rules are essentially the same as assertions. The work in [45, 48] extracts information from comments, while our work extracts information from hardware specifications. iComment uses “rule templates” to match pattern in sentences, while our approach uses an attribute grammar approach.
3.3 Issues in Formalizing Natural Language for Hardware When developing approaches to formalize natural language descriptions of hardware, there are several overarching concerns which fundamentally impact the process and must be considered at the outset. We describe two of the most important concerns in order to motivate and justify our approaches.
3.3.1 Computational Models The information extracted from a specification document must be represented using a formal, unambiguous computational model. Several types of computational models may be used since the most efficient representation for each type of information may be different. We broadly classify the types of information into three classes, structural information which describes the physical objects referred to in the text, behavioral information which defines constraints on events occurring on objects in the ontology, and behavioral constraints which express general limits on legal behavior.
3.3.1.1
Structural Representation
In a hardware specification, the most basic objects which will be part of the structural description are wires, state elements, and hierarchical combinations of the two. Hardware specifications will also refer to structural blocks in the design which have associated behaviors. Events occurring on wires and state elements
42
I. G. Harris and C. B. Harris
define the behavior of the system, and the purpose of a specification is to define allowable relationships between these events. It is possible to define a specification by referring to only the input and output wires, without explicitly defining any internal state elements. However, it is common for hardware specifications to reference some key state elements, and sometimes internal wires, which are assumed to exist in the design. The most appropriate method to represent a structure, and the most commonly used method in previous work in software requirements analysis [28, 31], is the use of a class hierarchy. Each type of object defined in the structure can be represented as a class with attributes which indicate the basic wires and other classes which are aggregated together. A structural example can be seen in the following sentence from the IEEE 1500 standard specification document [23], “The WPP terminals consist of the wrapper parallel input (WPI) terminal(s), wrapper parallel output (WPO) terminal(s), and wrapper parallel control (WPC) terminals.” This sentence defines a class called WPP with the attributes WPO, WPI, and WPC. Both WPO and WPI are simple wires, but WPC is a grouping of wires defined elsewhere in the specification document, so WPC would be defined as another class.
3.3.1.2
Behavior Representation
There are many well-accepted formal models for hardware behavior [16, 25, 32, 33]. Process-based models represent behavior as a set of concurrent processes which are internally described in an imperative form using a sequential programming model. Each concurrent process can be represented by a control-dataflow graph model. State-based models describe system behavior by defining a set of states and transitions between states. The type of model chosen to capture meaning in a natural language specification depends on the style in which the behavior is described. We demonstrate the relationship between the writing style and the behavioral model used by comparing the following two descriptions of a generic data transfer operation. Description 1: A data transfer is initiated when the sender asserts the REQ signal. The sender then waits until the ACK signal is asserted before transmitting the data byte. Description 2: When in the READY state, the sender asserts the REQ signal and transitions to the WAIT state. The sender remains in the WAIT state until the ACK signal is asserted, which causes it to enter the SEND state. In the SEND state, the sender transmits the data byte. Both of the specifications above describe the same behavior in different styles. The first description clearly outlines a sequence of events and could most naturally be captured with a process-based model as shown in Fig. 3.3a, using a single process and a sequential program to implement the behavior. The second description is explicitly written in terms of states and transitions, so it could be easily represented using a state-based model as shown in Fig. 3.3b.
3 Generation of Verification Artifacts from Natural Language Descriptions
43
Fig. 3.3 Models of behavior, (a) process-based, (b) state-based
Representing the meaning of natural language specifications requires the features of both process-based and state-based models, so the best model would incorporate both. There are several appropriate models to choose from including SpecCharts [38] and SpecC [17].
3.3.1.3
Behavioral Constraints
In addition to explicit descriptions of individual features, hardware specifications also contain behavioral constraints which limit particular aspects of behavior more generally. The following statement taken from the I2C bus protocol specification document [39] is an example, “The data on the SDA line must be stable during the HIGH period of the clock.” This statement does not describe a sequence of events associated with a particular behavioral feature. Instead, it is a general constraint on almost all explicit behaviors, including read and write transactions. As a result, the statement is not easily represented using either a process-based or state-based representation. Behavioral constraints are a common part of hardware specifications and are often used to express properties for model checking and assertions for checking during simulation. To accommodate the representation of behavioral constraints, it is convenient to use models which are already accepted for specifying properties and assertions. Different forms of first-order predicate calculus, such as CTL and LTL, are effective for specifying boolean and temporal relationships between events. Constraint satisfaction programming (CSP) [8] can express arithmetic constraints as well. Efficient solvers exist for CTL properties and CSP formulations. Hardware verification languages, such as SystemVerilog and E, allow assertions to be expressed with additional flexibility including hierarchy and timing precision. The complexity of assertions makes them unsuitable for automatic theorem proving, but they are useful for result checking during simulation.
3.3.2 Linguistic Variation Linguistic variation describes the aspect of a language which enables a single concept to be expressed in multiple ways in the language.
44
I. G. Harris and C. B. Harris
Morphological – Morphological variation is employed when two homonyms, words with the same meaning, are used in two different sentences. I hate cats. I detest felines. These two sentences display morphological variation because they have the same meaning but vary only in the choice of homonym used in each position in the sentence. The words “hate” and “cats” in the first sentence are replaced with their homonyms “detest” and “felines,” respectively. Syntactic – Syntactic variation describes the use of different sentence structures to express a single concept. Joe hates cats. Cats are hated by Joe. The two sentences use largely the same words but the word order is changed from an active voice in the first sentence to a passive voice in the second sentence. Pragmatic – Pragmatic variation describes two sentences with different literal meanings, but the same connotation. Your breath stinks. You might want to try using this toothbrush. These two sentences have different literal meanings, but they both convey the same meaning to most listeners: your breath stinks. The first sentence is direct, while the second is suggestive, allowing the listener to infer the true meaning. The chief problem associated with linguistic variation is to ensure that the computational models generated from two semantically equivalent sentences are themselves equivalent, independent of any linguistic variation present.
3.3.2.1
Linguistic Variation in Hardware Descriptions
Morphological Variation Morphological variation does occur in hardware descriptions. For example, the verbs “set” and “assign” are often used interchangeably. This type of variation can be modeled in a straightforward way using a thesaurus to identify words with the same meaning. Each word is associated with some object in the computational model, and a thesaurus enables all words with the same meaning to be associated with the same model. The modeling of morphological variation is shown in Fig. 3.4 which shows possible models generated from the sentences “I hate cats” and “I hate felines.” The entity-relationship diagram (ERD) in Fig. 3.4a relates the words “cat” and “feline” to a single object “cat_type” using the IS-A relation. The logic expression in Fig. 3.4b contains a predicate “cat_type” which is used to describe both cats and felines. By using a thesaurus to define the scope of the nodes in the ERD of Fig. 3.4a and the domain of the “cat_like” predicate in Fig. 3.4b,
3 Generation of Verification Artifacts from Natural Language Descriptions
45
Fig. 3.4 Modeling morphological variation, (a) entity-relationship diagram, (b) predicate calculus
we can ensure that the two sentences with the same meaning have identical computational models. Syntactic Variation English grammar has a rich syntax providing for many ways to express a single idea. This is true in the domain of hardware specifications, for example, the sentences “Assign X to one” and “Signal X is asserted,” which both describe the same signal assignment. In order to generate equivalent computational models from sentences exhibiting syntactic variation, the meaning of the sentence must be extracted in a way which is robust in the presence of different word orderings. Central to the meaning of any sentence is the verb (or verbs) which it contains. Each declarative sentence contains actions or states involving one or more participants, and the verbs in the sentence describe the action/state. Declarative sentences describe events, as in “Signal X is asserted,” and states, such as “Signal Y is low.” In order to use information in a sentence, it is essential to detect the event or state, as well as the participants in the event or state. The verb (or verbs) is a sentence describing the type of action or state. Each participant is said to have a semantic role in the sentence, with respect to the action as usually described by the verb. A simple example can be seen in the sentence, “Joe loves cats” where loves describes the action, “Joe” has the role of the loving thing, and “cats” has the role of the loved thing. Each sentence is assumed to match a semantic frame [13] which is a template to describe the action of a sentence and the semantic roles involved in the action. The sentence “Ian loves cats” can be described with a semantic frame which describes the act of loving and contains two participants, the loving thing and the loved thing. The use of semantic frames to represent information in a way which is syntactically neutral is a well-accepted approach in artificial intelligence research. We perform parsing to fill a semantic frame for each English sentence, allowing the identification of key words which fill each semantic role. Pragmatic Variation – Pragmatic variation exists when context indicates that the intended meaning of a sentence is different from the apparent meaning. For example, if two people are at a store buying a toothbrush and one person says, “You might want to try using this toothbrush,” then the meaning should be taken literally. However, if the same sentence is spoken by a person who is forced to be in close physical proximity with another person, then the meaning of the sentence is likely to be an insult about the breath of the listener.
46
I. G. Harris and C. B. Harris
Although pragmatic variation can occur in English, it would never be expected in a hardware description. We expect that a hardware description is always stated in a direct manner for a single purpose, specifying system behavior.
3.4 Semantic Parsing One technique that we will use to extract important elements from a sentence is semantic parsing which uses a syntactic parser to process each sentence according to the rules defined by a context-free grammar (CFG). Each sentence is associated with one or more semantic frames, and the key words in the sentence which perform each semantic role defined by the frame are easily located in the resulting parse tree. Although natural languages are not context-free languages, the use of syntactic parsers is well-accepted in the NLP community due to the existence of efficient parsing algorithms. A CFG is defined to capture the English subset of interest, and the parser generates a parse tree representation of the sentence in which each node represents a production used in the parse. An example of a syntactic parse is shown using the example sentence “Set P to one” and the CFG shown in Fig. 3.5. The symbols used in the productions of Fig. 3.5 describe standard constituents in English grammar which include sentence (S), verb phrase (VP), and noun phrase (NP). The parse tree resulting from using this grammar is shown in Fig. 3.6a. Semantic parsing uses a CFG containing symbols which are associated with well-defined semantic interpretations. To demonstrate the semantic parsing process, Fig. 3.6b shows the parse tree generated for the example sentence using the grammar Fig. 3.5 Simple context-free grammar for English
S →VP V P → V B NN PP VB PP IN NN NN
Fig. 3.6 Parse trees generated from (a) an English syntactic grammar, (b) a semantic grammar
→ “set” → IN N N → “to” → “P” → “one”
3 Generation of Verification Artifacts from Natural Language Descriptions Fig. 3.7 Semantic grammar
47
ASGN → “set” SIG “to” V AL SIG → “P” V AL → “one”
in Fig. 3.7. The semantic grammar includes the SIG symbol indicating a signal name, the VAL symbol indicating a signal value, and the ASGN signal indicating a signal assignment. The key observation of semantic parsing is that it performs semantic role labeling by highlighting relevant domain-specific information in the parse tree. For example, in order to find the name of the signal being assigned, the parse tree can be searched for the SIG symbol, and the signal name is its child. This signal assignment can now be represented using a process-based representation as an assignment statement of the form, SIG = VAL;, where the SIG role is the signal being assigned and the VAL role is the value to which it is assigned. We have developed semantic grammars to perform semantic role labeling for all behavioral concepts which we consider. We accommodate the wide range of linguistic variation by extending the grammar to capture all variations which are common in hardware descriptions.
3.5 Generating Assertions We present an approach to generate SystemVerilog assertions from behavioral constraints expressed in English. This work depends only on the English text and does not require the existence of a simulatable model of the design. Our approach is based on the use of an attribute grammar to define the formal semantics of a subset of assertion descriptions in English. Attribute grammars were originally developed by Knuth [27] as a simple yet powerful formalism for expressing the semantics of programming languages. We define an attribute grammar which associates key natural language structures with semantically equivalent SystemVerilog code. We use a parser to recognize important natural language structures in assertion descriptions. The evaluation rules which are part of the attribute grammar are used to generate SystemVerilog code for each structure.
3.5.1 Attribute Grammars An attribute grammar is a context-free grammar enhanced with attribute values and evaluation rules to compute the attribute values as a function of the attributes of adjacent nodes in the parse tree. They were originally developed by Knuth [27] for expressing the semantics of programming languages. We define an attribute
48 Fig. 3.8 Attribute grammar
I. G. Harris and C. B. Harris
ASGN → “set” SIG “to” V AL [ASGN.v = SIG.v + “=” + V AL.v + “;”] SIG → “P” [SIG.v = “P”] V AL → “one” [V AL.v = “1”]
Fig. 3.9 Extension to an attribute grammar
ASGN → SIG “must be assigned to” V AL [ASGN.v = SIG.v + “=” + V AL.v + “;”]
grammar to capture the semantics of English assertions as the attribute values of the symbols. The attribute value of each symbol of our grammar is a string of SystemVerilog assertion code. Attribute values can be passed from a node to its parent using a synthesized attribute or from a node to its child using an inherited attribute. The grammar that we define in this paper is said to be S-attributed because it only uses synthesized attributes. Each production is associated with an evaluation function which computes the attribute value of the symbol on the left-hand-side of the production. The attribute value of the root node of the parse tree is the semantic meaning of the parsed string. In our notation, each production is followed by its attribute evaluation rule in square brackets. We also use the “+” symbol to represent the string concatenation operation. As a demonstrative example, we modify the semantic grammar in Fig. 3.7 to produce an attribute grammar shown in Fig. 3.8. The attribute values in this grammar are equivalent strings of C code. When the attribute grammar is used to parse the sentence, “Set P to one,” the attribute value of the ASGN symbol is “P = 1;” which is the C code equivalent to the sentence. An important property of attribute grammars is the fact that they can be easily extended to consider linguistic variation. As an example, consider an English statement which declares that signal P should be assigned to the value 1. The grammar shown in Fig. 3.8 will parse the sentence “Set P to one,” but it will not parse, “P must be assigned to one” which uses the passive voice. However, by including the production shown in Fig. 3.9 the sentence written in the passive voice can be parsed as well. The attribute grammar can be constructed in a general way to accommodate all of the linguistic variation required.
3.5.2 System Overview Figure 3.10 shows the structure of our system and the flow of data between its components. The system starts with the English Assertion at the top left. The
3 Generation of Verification Artifacts from Natural Language Descriptions
49
Fig. 3.10 System structure
first processing step uses the Recursive Descent Parser to generate a Parse Tree. The second processing step performs Attribute Evaluation to generate the SystemVerilog Assertion, which is the attribute of the root node of the parse tree. Both processing steps use the Attribute Grammar which we present in this paper. The recursive descent parser uses Backus-Naur Form (BNF) productions of the grammar to perform parsing. The attribute grammar includes evaluation rules, associated with each production, which are evaluated to generate semantically equivalent SystemVerilog assertions. The recursive descent parser which we use is an off-the-shelf component taken from the open source Natural Language Toolkit [3]. Attribute evaluation is well understood, and we implement an existing technique using a left-to-right depth-first traversal of the parse tree [11].
3.5.3 Attribute Grammar We present an attribute grammar which parses a class of assertions descriptions written in English and produces SystemVerilog assertions which are semantically equivalent to the English descriptions. The following subsections present the productions of the grammar grouped based on the hardware description concepts which the productions recognize. Linguistic variation in English ensures that there are several ways to express each concept. In each subsection we describe a set of ways in which each concept is described in English, and we describe the features of our grammar which captures each method of expression. Productions in the grammar are associated with an attribute value, labeled with the suffix .sv, which is a string of equivalent SystemVerilog code. The attribute value of the root node of each parse tree is the SystemVerilog assertion which is equivalent to the parsed sentence.
50
I. G. Harris and C. B. Harris
Fig. 3.11 Productions for constants
CST → “0” [CST.sv = “0”] CT R → CST [CT R.sv = CST.sv] CT R → DET “value” “of”CST [CT R.sv = CST.sv] DET → “a”|“an”|“the” [DET.sv = ∅]
3.5.3.1
Constants
Constants may be referred to directly by their names or indirectly by referencing their value. An example of a direct reference to the constant “1” would be “V is assigned to 1” and an indirect reference example would be “V is assigned to the value of 1.” In Fig. 3.11, the productions for the CST symbol define all constant names, although only the definition of the constant “0” is shown. The symbol CT R captures indirect constant references.
3.5.3.2
Signals and Storage Elements
The SN symbol captures all valid signal names in the system. It is common practice to provide a list of all key signals and storage elements in any hardware specification, so we assume that such a list is provided and we generate productions for the SN signal to recognize signal names. Although all signals and storage elements are described by SN productions, only one is shown in Fig. 3.12 for the “awvalid.” The attribute value of each SN production is the name of the signal or storage element. References to signals and storage elements can be either direct or indirect. A direct reference may use only the signal name, such as OP CODE in the sentence, “OPCODE must be reset.” A direct reference may also include a determiner and a label specifying what type of storage element is being referred to, such as “The OPCODE register must be reset.” The SL symbol describes the possible labels, the SLR symbol describes direct references with labels, and the SDI symbol captures all direct signal references. Indirect signal references indicate the value of the signal rather than the signal itself, such as “The value of the OPCODE register must be reset.” The I N D symbol describes the “the value of” string used to identify indirect signal references. The SDE symbol captures all indirect signal references, and the SR symbol captures all signal references, both direct and indirect.
3 Generation of Verification Artifacts from Natural Language Descriptions Fig. 3.12 Productions for signals and storage elements
51
SN → “awvalid” [SN.sv = “awvalid”] SL → “signal”|“wire”|“register”|“bus” [SL.sv = ∅] SLR → “the” SN SL [SLR.sv = SN.sv] SDI → SN [SDI.sv = SN.sv] SDI → SLR [SDI.sv = SLR.sv] IN D → “the” “value” “of” [IN D.sv = ∅] SDE → IN D SN [SDE.sv = SN.sv] SDE → IN D SLR [SDE.sv = SLR.sv] SR → SDE [SR.sv = SDE.sv] SR → SDI [SR.sv = SDI.sv]
3.5.3.3
Events
Hardware descriptions refer to events on signals in order to place constraints on those events. We allow two types of event references, transition references and assignment references. A transition reference can indicate either a rising edge, a falling edge, or a transition of any kind. These transitions are captured by the symbols T U , T D, and T A in Fig. 3.13. The attribute values for each transition use the SystemVerilog functions for transition detection, $rise, $f ell, and $stable. An assignment reference uses a signal assignment as a noun phrase in a sentence. One type of assignment reference refers to a constant and uses a prepositional phrase to indicate the signal being assigned. An example is “a value of 1 on v,” where the prepositional phrase “on v” indicates the signal being assigned. The other type of assignment reference is the use of a gerund phrase to indicate the signal being assigned. An example is “assigning v to 1” where the subject of the gerund “assigning” is the signal name and the prepositional phrase “to 1” indicates the value to which the signal is assigned. The symbol AR captures both types of assignment references, and the ER symbol captures both assignment and transition references.
52
I. G. Harris and C. B. Harris
Fig. 3.13 Productions for events
T U → “a” “rising” “edge” “on”SN [T U.sv = “$rise(” + SN.sv + “)”] T D → “a” “falling” “edge” “on”SN [T D.sv = “$fell(” + SN.sv + “)”] T A → “a” “transition” “on”SN [T A.sv = “!$stable(” + SN.sv + “)”] TR → TU [T R.sv = T U.sv] TR → TD [T R.sv = T D.sv] TR → TA [T R.sv = T A.sv] AG → “assigning”|“setting” [AG.sv = ∅] AR → AG SDI “to” CST [AR.sv = SDI.sv + “ == ” + CST.sv] AR → CT R “on” SN [AR.sv = SN.sv + “ == ” + CT R.sv] ER → T R [ER.sv = T R.sv] ER → AR [ER.sv = AR.sv]
3.5.3.4
Comparison Operations
Constraints in hardware descriptions are frequently specified using comparison operations. We accept the following comparisons and their complements: equal, greater than, and less than. Examples include “v is greater than 1” and “v must be equal to w.” A comparative relation is typically indicated in text by the use of a form of the verb “to be” sometimes together with a modal verb such as “must” or “can.” The productions for comparisons are shown in Fig. 3.14 which capture equality statements, Fig. 3.15 which capture inequality statements, and Fig. 3.16 which capture magnitude comparison statements. The RL symbol describes all of the verbs which indicate a comparative relation, and RN describes their negations. The symbols EQ (EN), GR (GN ), and LS (LN ) capture the comparisons of the type equal (not equal), greater than (not greater than), and less than (not less than), respectively. The attribute values of comparison productions use the appropriate comparison operators built into SystemVerilog.
3 Generation of Verification Artifacts from Natural Language Descriptions Fig. 3.14 Productions for equality statements
53
RL → “is”|“must” “be”|“remains” RL → “must” “remain” [RL.sv = ∅] RN → “is” “not”|“must” “not” “be” RN → “cannot” “be”|“must” “not” “remain” [RN.sv = ∅] EQ → SR RL CST [EQ.sv = SR.sv + “ == ” + CST.sv] EQ → SR “equals” CST [EQ.sv = SR.sv + “ == ” + CST.sv] EQ → SR RL “equal” “to” CST [EQ.sv = SR.sv + “ == ” + CST.sv] EQ → CST RL SDE [EQ.sv = SDE.sv + “ == ” + CST.sv] EQ → CST RL “equal” “to” SDE [EQ.sv = SDE.sv + “ == ” + CST.sv]
Fig. 3.15 Productions for inequality statements
3.5.3.5
EQN → SR RN CST [EQN.sv = SR.sv + “ != ” + CST.sv] EQN → SR RN “equal” “to” CST [EQN.sv = SR.sv + “ != ” + CST.sv] EQN → SR RL “does” “not” “equal” CST [EQN.sv = SR.sv + “ != ” + CST.sv] EQN → CST RN SDE [EQN.sv = SDE.sv + “ != ” + CST.sv] EQN → CST RN “equal” “to” SDE [EQN.sv = SDE.sv + “ != ” + CST.sv]
Event Constraints
Hardware descriptions can constrain the possible events which can occur, both transition events and assignment events. We accept event constraints which are exclusionary, indicating that an event cannot occur. Transition events can be constrained in a positive sense by stating that a signal must remain stable, such as “V must be stable.” Figure 3.17 shows the productions for the symbol ST which is used to capture statements of stability and the symbol EX which captures exclusionary event constraints.
54
I. G. Harris and C. B. Harris
Fig. 3.16 Productions for magnitude comparison statements
GR → SRRL “greater” “than” CST [GR.sv = SR.sv + “ > “ + CST.sv] GR → SR1 RL “greater” “than” SR2 [GR.sv = SR1 .sv + “ > ” + SR2 .sv] GN → SR RN “greater” “than” CST [GN.sv = SR.sv + “ leq ” + CST.sv] GN → SR1 RN “greater” “than” SR2 [GN.sv = SR1 .sv + “ leq ” + SR2 .sv] LS → SR RL “less” “than” CON ST [LS.sv = SR.sv + “ < ” + CST.sv] LS → SR1 RL “less” “than” SR2 [LS.sv = SR1 .sv + “ < ” + SR2 .sv] LN → SR RN “less” “than” CST [LN.sv = SR.sv + “ ≥ ” + CST.sv] LN → SR1 RN “less” “than” SR2 [LN.sv = SR1 .sv + “ ≥ ” + SR2 .sv]
Fig. 3.17 Productions for event constraints
SW → “stable”|“constant”|“fixed” ST → SR RL SW [ST.sv = “$stable(” + SR.sv + “)”] EX → ER “is” “not” “permitted” [EX.sv = “!(” + ER.sv + “)”] EX → ER “is” “not” “allowed” [EX.sv = “!(” + ER.sv + “)”]
3.5.3.6
Boolean Logic
Constraints can be combined using boolean constructs such as “V is greater than 1 and w is equal to 1.” We allow the use of the words “and” and “or” to indicate the boolean combination of constraints. The symbol CB captures all basic constraints which do not include boolean constructs. Figure 3.18 presents productions for CB which match all arithmetic comparison and event constraints. The CA (CO) symbol describes the conjunction (disjunction) of any set of basic constraints. The CL symbol captures all boolean combinations of basic constraints involving “and” and “or” operations.
3 Generation of Verification Artifacts from Natural Language Descriptions Fig. 3.18 Productions for boolean logic
55
CB → EQ [CB.sv = EQ.sv] CB → EQN [CB.sv = EQN.sv] CB → GR [CB.sv = GR.sv] CB → GN [CB.sv = GN.sv] CB → LS [CB.sv = LS.sv] CB → LN [CB.sv = LN.sv] CL → CB [CL.sv = CB.sv] CL → CA [CL.sv = CA.sv] CL → CO [CL.sv = CO.sv] CA → CB“and”CL [CA.sv = CB.sv + “ && ” + CL.sv] CO → CB“or”CL [CO.sv = CB.sv + “ ” + CL.sv]
3.5.3.7
Implication
Implication is a concept commonly expressed in hardware descriptions. There are several ways in which implication is expressed in English, using the key words “if,” “then,” and “when.” Examples of implications include “V must be equal to 1 when W is asserted” or “If W is asserted then V must be equal to 1.” The productions for the symbols for the antecedent (AN) and consequent (CN ) of an implication are shown in Fig. 3.19. Both antecedents and consequents match any boolean combination of basic constraints. The productions for symbol CI which captures implication constraints are also shown in Fig. 3.19.
56
I. G. Harris and C. B. Harris
AN → CL [AN.sv = CL.sv] CN → CL [CN.sv = CL.sv] CI → “if” AN “then” CN [CI.sv = “!(” + AN.sv + “) (” + CN.sv + “)”] CI → “when” AN “,” CN [CI.sv = “!(” + AN.sv + “) (” + CN.sv + “)”] CI → CN “when” AN [CI.sv = “!(” + AN.sv + “) (” + CN.sv + “)”] Fig. 3.19 Productions for implications Fig. 3.20 Productions for assertion sentences
3.5.3.8
S → CL [S.sv = “assert property (” + CL.sv + “);”] S → CI [S.sv = “assert property (” + CI.sv + “);”]
Assertion Sentences
Each assertion is assumed to be expressed in a single sentence. A sentence is the basic unit of text which is parsed, and the sentence symbol S is the root node of any parse tree. We assume that each sentence being parsed is the expression of a constraint on system signals and storage elements. A sentence can be either a boolean combination of basic constraints or an implication. Figure 3.20 shows the productions for the S symbol.
3.5.4 Experimental Results The system was implemented in Python using the API provided in the Natural Language Tool Kit [3] to create a recursive descent syntactic parser. All results were generated on an Intel Core i5 processor, 3.2 GHz, with 8 GB RAM.
3.5.4.1
Benchmark Set
To evaluate our system it was necessary to identify a set of assertions for a real system, specified in English. As a benchmark set of assertions, we have used the assertions developed by ARM Inc. for the verification of AXI protocol
3 Generation of Verification Artifacts from Natural Language Descriptions
57
assert property (!(awvalid == 1) || (!(awburst == 2b’11))); Fig. 3.21 SystemVerilog assertion created from an AXI assertion
Fig. 3.22 Parse tree created from an AXI assertion
implementations [1]. We evaluated all assertions checking the write and read channels including write/read address channel checks, write/read data channel checks, and write response channel checks. The benchmark set of assertions consists of 117 individual assertions, expressed in English.
3.5.4.2
Results on Benchmarks
Our tool successfully generated SystemVerilog assertions for 52 out of 117, 44% of all assertions. The total CPU time required to generate all SystemVerilog assertions was 199 s, which is 3.82 s per assertion, on average. Figure 3.22 shows the parse tree resulting from the AXI assertion, “A value of 2’b11 on awburst is not permitted when awvalid is high” [1]. The SystemVerilog assertion generated from this English assertion is shown in Fig. 3.21. Near the top of the parse tree the CI symbol indicates that an implication was recognized, of the form CONSEQUENT “when” ANTECEDENT. The consequent and antecedent are the subtrees under the CN and AN symbols, respectively. The antecedent is the phrase “awvalid is high” which is the subtree under the EQ symbol, indicating that it represents an equality constraint. The antecedent and consequent appear in the SystemVerilog assertion as the terms “awvalid == 1” and “!(awburst == 2b’11).”
58
3.5.4.3
I. G. Harris and C. B. Harris
Limitations of Assertion Generation
The range of English expression which our system can process is limited by the attribute grammar which we have defined. In order to better understand the practical limits of the grammar, we have examined the English assertions in our benchmark set which our grammar failed to parse. We have identified the three features of the unparsed assertions which caused our system to fail. Sequential Constraints: Our system cannot parse sequential constraints which describe properties spanning multiple clock cycles. An example of such a constraint in our benchmark set is, “Recommended that wready is asserted within MAXWAITS cycles of WVALID being asserted.” Object Hierarchy: Our system only accepts constraints directly on the values of signals and storage elements. However assertions may constrain a more abstract object or event. An example in our benchmark set is, “The size of a read transaction must not exceed the width of the data interface.” This assertion refers to the abstract “read transaction” event which is defined elsewhere in the specification, and it constrains the “size” property of this event. Multiple Sentences: Our system assumes that each assertion is expressed in a single sentence, but this is sometimes not convenient in practice. A sample assertion in our benchmark set which requires multiple sentences is, “The number of write data items matches awlen for the corresponding address. This is triggered when any of the following occurs: . . . .” Accepting assertions expressed across multiple sentences introduces several referencing issues including anaphora resolution which enables the word “This” in the second sentence to be related to the constraint defined in the first sentence. Although we did not address them here, reference problems, including anaphora resolution, have been well studied in the field of natural language processing. In order to address these limitations, it was necessary to expand our work from processing assertions describing an instant in time to entire transactions which model sequences of events over a span of time. Considering transactions is the topic of the following section.
3.6 Generating Transactors We present an approach to automatically generate simulatable bus transactions directly from natural language bus protocol specifications. Our technique employs semantic parsing to produce Verilog transactors with high timing fidelity [5]; all significant events of the protocol are modeled explicitly. We identify a set of transaction concepts which are ideas commonly used in natural language description of bus protocols to express different aspects of a transaction. Each transaction concept is recognized in the natural language description using a set of context-free grammar (CFG) productions which we define. The resulting parse
3 Generation of Verification Artifacts from Natural Language Descriptions
59
Fig. 3.23 Transaction graph representation, shifting ‘01’ into a shift register
trees are scanned to locate each transaction concept and generate appropriate Verilog code. Automating the generation of bus transactors reduces design and verification time, eliminating the need to manually design and verify each transactor.
3.6.1 Transactions and Transactors We describe a transaction as a hierarchical sequence of events on a set of signals. Each event may write a value to a signal and read a value from one or more signals. Figure 3.23 shows the transaction graph representation of a transaction to shift the value ‘01’ into a two bit shift register. We assume that the shift register has two single bit inputs Din, the data input, and Clk, the clock input. A bit is shifted into the register by assigning Din to the appropriate value and causing a rising edge on the Clk input. The transaction, referred to as Trans01, is described hierarchically and drawn as a directed acyclic graph in Fig. 3.23. The leaf nodes in the graph are assignments to input signals of the shift register. Each non-leaf node is a transaction which is defined by the sequence of its successor nodes. For example, the ClockEdge transaction is defined by the sequential execution of the assignments Clk=0 and Clk=1.
3.6.1.1
Bus Transactors
A bus transactor implements a transaction defined in a protocol, acting as the link between a transaction generator and a bus. The generic transactor interface which we assume is shown in Fig. 3.24. In the figure, the transactor is shown to interface with the bus signals on the right. The transactor is implemented as a Verilog task, so a transaction generator invokes the transactor task with a set of arguments defined by the transactor interface. The interface includes the three signals shown in bold on the left side of Fig. 3.24. Many bus protocols include a unique address for each device on the bus, so the Address signal is defined to hold the address of the receiver. Tx contains the data to be transmitted during a write transaction, and Rx contains the data received during a read transaction. The width of the Address, Tx, and Rx signals must be declared in the specification document. References to the Address (Ad), Tx, and Rx signals in the specification document are considered differently than normal signal references. We assume that the Ad and Tx data will be transmitted on the bus and that the Rx data is read from the bus. We
60
I. G. Harris and C. B. Harris
Fig. 3.24 Transactor interface
Fig. 3.25 Structure of the natural language transactor generation system
assume that Ad and Tx are stored in transmit queues, so when data is read from them the data is removed from the queue. We also assume that Rx is stored in a receive queue so that when data is written to Rx, it does not overwrite existing data in the queue.
3.6.2 System Overview Figure 3.25 depicts the structure of the proposed system for transactor generation from a natural language description. The natural language description is processed using a Semantic Parser to generate a parse tree. The semantic parser is built using an off-the-shelf syntactic parser, the Natural Language Toolkit [3], and a semantic grammar which we define for this application. Information Extraction is applied to the resulting parse tree to generate a semantic representation which contains all behavioral information about each transaction. The semantic representation is used to perform Transactor Generation and generate a set of Verilog transactors which accurately model the behavior of the specified transactions.
3.6.3 Transaction Concepts We have defined a semantic grammar to identify the expression of transaction concepts in natural language descriptions of bus transaction protocols. The grammar which we present is not sufficient to parse all legal descriptions; however, it is
3 Generation of Verification Artifacts from Natural Language Descriptions
61
broad enough to parse a useful subclass of all descriptions. In this section we define the subclass of descriptions which can be parsed by our grammar. We present the syntactic patterns which are recognized by our grammar and a subset of the production rules used to parse each pattern. The productions we use to recognize standard English grammatical constructs (i.e., noun phrase, verb phrase, etc.) are a subset of those presented in [26]. We present a set of transaction concepts which are ideas used in the natural language description to express different aspects of a transaction. The transaction concepts are expressed in the natural language specification to describe the protocol, and each transaction concept can be mapped directly to Verilog constructs. Each transaction concept is recognized in the natural language description using a set of CFG productions which we define. • Signal Definitions – All input and output signals must be declared in the natural language document. The SIGDEF grammatical symbol captures each signal declaration. • Transaction References – Each transaction must be referred to in the text using some noun phrase as an unique identifier. The symbol TRANSREF is used to capture transaction references. • Sequence Descriptions – Each transaction is composed of a sequence of other transactions which are lower in the transaction hierarchy. The three symbols FULLSEQUENCE, PREFIXSEQUENCE, and SUFFIXSEQUENCE are used to capture sequence descriptions. • Signal Reading/Writing – Transactions at the lowest hierarchical level must directly interact with signals. The symbols ASSIGNDIRECT, ASSIGNSTRUCT, and ASSIGNIMPL are used to capture signal reading and writing. The transaction concepts and their associated CFG productions are described in the following sections.
3.6.3.1
Signal Definitions
We assume that all input and output signals are declared in the document. The top production used to capture signal definitions is shown in Fig. 3.26. The symbol SIGNAME matches any string and is assumed to be a proper noun which is used to reference the signal. The DIRECTION symbol matches one of the following values: “input,” “output,” and “input/output.” The WIDTH symbol is a numeral indicating the bitwidth of the signal. Fig. 3.26 Productions to recognize signal definitions
SIGDEF → SIGNAME “is an” DIRECTION “signal,” WIDTH “bit(s) wide”
62
I. G. Harris and C. B. Harris
Fig. 3.27 Productions to recognize transaction references
3.6.3.2
TRANSREF → SNP | GNP GNP → VG NP VG → “sending” | “transmitting” | “receiving” SNP → NP
Transaction References
We assume that transactions are referred to in one of two ways in the document, either as a simple noun phrase (SNP) or a gerundive noun phrase (GNP). A simple noun phase is a noun phrase which contains a head noun, a set of modifiers for the noun, and a determiner. An example of a simple noun phrase references is “a write transaction,” with the head noun “transaction,” the modifier “write,” and the indefinite article “a” which acts as a determiner. A gerundive noun phrase uses a verb (i.e., “send”) in gerund form (i.e., “sending”) as the first modifier for the head noun. For example, “sending a data byte” refers to a transaction which transmits a byte of data. The gerund used in a gerund noun phrase describes the movement of information, so we assume that it is either “sending,” “transmitting,” or “receiving.” The gerund indicates the direction of data flow with respect to the bus master. Figure 3.27 shows a subset of the productions used to recognize transaction references and to identify their head noun and modifiers. In the figure, the symbol TRANSREF represents a transaction reference, and the symbol NP represents a noun phrase. In a gerundive noun phrase, the head noun may be plural to indicate iterative execution. For example, in the sentence “Sending a byte is performed by sending 8 bits,” the transaction “sending bits” is plural and describes the sending of a single bit eight times. Recognition of plurals is well understood in natural language processing research. We add productions to identify plural head nouns and to identify the numeral representing the number of iterations.
3.6.3.3
Sequence Descriptions
The relationship between a transaction and the sequence of sub-transactions which compose it can be expressed with several syntactic patterns. In order to recognize sequence descriptions we define a set of cues, multiword substrings which we expect to find in a sequence description. Cues are included in the production rules of our grammar to allow us to identify the elements of a sequence. We define three different syntactic patterns which indicate that a sentence contains sequence information. The full sequence pattern matches sentences which define the entire sequence of sub-transactions composing a transaction. An example of a sentence which matches the full sequence pattern is the following: “Sending a byte is performed by sending 8 bits and receiving an acknowledge bit.” In this sentence, “Sending a byte” is the
3 Generation of Verification Artifacts from Natural Language Descriptions Fig. 3.28 Productions to recognize the full sequence pattern
Fig. 3.29 Productions to recognize the prefix and suffix patterns
63
S → FULLSEQUENCE FULLSEQUENCE → TRANSHEAD CUE FULL SEQ TRANSLIST TRANSHEAD → TRANSREF TRANSLIST → TRANSREF TRANSLIST → TRANSREF “and” TRANSREF TRANSLIST → TRANSREF “,” TRANSLIST CUE FULL SEQ → “is performed by” | “is transmitted by” | “is executed by” | “is sent by” S → PREFIXSEQUENCE S → SUFFIXSEQUENCE PREFIXSEQUENCE → TRANSHEAD CUE PREFIX TRANSLIST SUFFIXSEQUENCE → TRANSHEAD CUE SUFFIX TRANSLIST CUE PREFIX → “begins with” | “starts with” CUE SUFFIX → “ends with”
head transaction, “is performed by” is the cue substring, “sending 8 bits” is the first sub-transaction, and “receiving an acknowledge bit” is the second sub-transaction. Figure 3.28 lists the productions added to our grammar to recognize the full sequence pattern and its components. The TRANSHEAD symbol is a reference to the transaction being defined in the sentence. The TRANSLIST symbol represents an unbounded sequence of transaction references. CUE_FULL_SEQ matches the set of substrings which identify the full sequence pattern. CUE_FULL_SEQ is defined as the following set of substrings: “is performed by,” “is transmitted by,” “is executed by,” and “is sent by.” The second and third syntactic patterns to describe sequence are the prefix pattern and the suffix pattern. The only difference between the three sequence syntactic patterns is the cues used. The cues used for the prefix pattern are “begins with” and “starts with.” The cue for the suffix pattern is “ends with.” The productions added to recognize the prefix and suffix patterns are shown in Fig. 3.29.
3.6.3.4
Reading and Writing Signals
Each transaction is a hierarchical sequence of events on a set of signals. We define three types of signal assignment descriptions, a direct signal assignment, a structured signal assignment, and an implicit signal assignment. A direct signal assignment is expressed with a gerundive noun phrase as in the sentence, “Transmitting a bit is performed by setting X to 1.” In this sentence, the
64
I. G. Harris and C. B. Harris
Fig. 3.30 Direct signal assignment production
ASSIGNDIRECT → CUE DIRECT SIGNAME “to” SIGVALUE
ASSIGNSTRUCT → CUE STRUCT “on” SIGNAME Fig. 3.31 Structured signal assignment production Fig. 3.32 Implicit signal assignment production
ASSIGNIMPL → SIGASSIGN “while” SIGCOND
gerundive noun phrase “setting X to 1” indicates the direct signal assignment. The top production for a direct signal assignment is shown in Fig. 3.30. In Fig. 3.30, CUE_DIRECT is the set of direct assignment cues which we recognize: “setting” and “assigning.” SIGNAME matches any string and should match the name of a signal defined in another sentence. SIGVALUE matches the set of values to which a signal can be assigned. For single-bit signals, the SIGVALUE set includes “0,” “1,” and “Z.” We also allow the description to apply the gerund “releasing” to a signal to indicate that it should be assigned to the value Z. In this case, the form of the signal assignment is “releasing” SIGNAME, and no SIGVALUE is required. The description can also specify a structured signal assignment which is a predefined sequence of assignments on a signal. Structured signal assignments are assignments which would be commonly understood by any designer without having to be explicitly defined in the specification. We define three structured signal assignments, generating a pulse, generating a rising edge, and generating a falling edge. The top production for a structured signal assignment is shown in Fig. 3.31. In Fig. 3.31, CUE_STRUCT matches the following multiword strings: “generating a pulse,” “generating a rising edge,” and “generating a falling edge.” The transaction description may require a signal assignment to be performed without explicitly declaring the assignment. This type of implicit signal assignment can occur when the word “while” is used to specify a condition on a signal. An example of an implicit signal assignment can be seen in the following sentence: “Sending a bit is performed by setting SDA to 1 while SCL is low.” In this sentence, the gerundive noun phrase “setting SDA to 1” indicates an assignment to signal SDA. The phrase “while SCL is low” implies that the SCL signal must be assigned to 0 before the SDA signal assignment occurs. The production for an implicit signal assignment is shown in Fig. 3.32. SIGASSIGN matches either a direct or structured signal assignment, as shown in Figs. 3.30 and 3.31. SIGCOND is a condition on a signal which is assumed to be in the following form, SIGNAME “is” SIGVALUE.
3 Generation of Verification Artifacts from Natural Language Descriptions
65
3.6.4 Information Extraction The information extraction stage analyzes the parse tree to find all information needed to generate a simulatable bus transactor. The output of this stage is a computational model containing all extracted information. We present a set of classes which define the model. Information extraction instantiates these classes to generate a set of objects which contain the extracted information.
3.6.4.1
Class Structure
We define two main classes to store information, the Signal class and the Transaction class. The Signal class contains three attributes: SignalName, the name of the signal; Width, the bit width of the signal; and Direction, the input/output direction of the signal. The Transaction class has two subclasses, TerminalTransaction and NonTerminalTransaction. The TerminalTransaction class defines transactions which directly writes or reads signals. Objects of the TerminalTransaction class represent one event, either a signal read event or a signal write event. A TerminalTransaction does not describe a sequence of transactions. The TerminalTransaction class has the following attributes: • R/W – This indicates whether the transaction is a read event or a write event. • Signal – This a reference to the signal which is being read (read transaction) or written (write transaction). • Value – This attribute only has meaning for a write transaction. This attribute is either the value being assigned to a signal {0, 1, Z} or the name of the input queue being read from, {Address, Tx}. The NonTerminalTransaction class defines transactions which are defined as a sequence of other transactions. A NonTerminalTransaction does not directly assign values to signals. The NonTerminalTransaction class has the following attributes: • TransID – This is the unique name of the transaction. • TransList – This is an ordered list of transactions. Each element of the TransList is a member of the TransListElt class which has the following attributes: • Transaction – This is a reference to a single transaction object. • Iterations – This indicates the number of times the associated transaction must be repeated. This attribute is used when a transaction is modified by a numeral in the specification document. For example, the phrase “sending 8 bits” indicates that the transaction to send a bit should be repeated 8 times.
66
3.6.4.2
I. G. Harris and C. B. Harris
Extraction Process
Information extraction is performed by scanning the parse trees of each sentence to find symbols which have well-defined semantic meanings. When such a symbol is found an object is added to the semantic representation to capture the meaning of the symbol. The attributes of the object are defined by examining the subtree of the parse tree whose head is the semantic symbol. A Signal object is created for each SIGDEF symbol found. The SIGNAME, DIRECTION, and WIDTH symbols, which are children of the SIGDEF symbol, are used to define the attributes of the Signal object. A Transaction object is created when any of the sequence description symbols are found: FULLSEQUENCE, PREFIXSEQUENCE, and SUFFIXSEQUENCE. The TransID attribute of the transaction object is generated from the transaction reference associated with the TRANSHEAD symbol of the sequence description. The TransList is created from the TRANSLIST symbol of the sequence description. A TransListElt object is created for each TRANSREF symbol in the TRANSLIST. A TRANSLIST may also contain signal read/write events. A TransListElt object is created for each signal read/write also. The result of information extraction is a hierarchy of transaction objects whose TransList attributes refer to the objects in the next lower level of the hierarchy.
3.6.5 Transactor Generation The transaction graph generated by information extraction is used to create a Verilog description of a simulatable bus transactor. Transactor generation involves a depth-first traversal of the transaction graph, starting at the top-level node. When a leaf node is reached, Verilog code is added to the transactor which performs the operation corresponding to the leaf node. Each leaf node is an assignment involving constants, signals, and the input/output queues Tx, Rx, and Ad. If the signal assignment does not involve a queue then an appropriate Verilog dataflow assignment statement is generated to perform the assignment. If the assignment involves a queue, special-purpose Verilog tasks are used to access queues. We have defined the q_extract task to extract a number of bits of a queue, and the q_insert task to add bits to a queue. These two tasks maintain the read and write pointers associated with the queue. Information about iterative execution is contained in each element of the TransList of each Transaction object. Iterative execution is modeled using the repeat construct in Verilog. When a TransListElt which has I terations > 1, a repeat construct is generated. The scope of the repeat construct includes the entire subgraph.
3 Generation of Verification Artifacts from Natural Language Descriptions 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
67
Sending a write transaction is performed by sending a start condition, sending a write header byte, sending a data byte, and sending a stop condition. Sending a start condition is performed by setting SDA to 0 and setting SCL to 0. Sending a write header byte is performed by sending an address, sending a bit whose value is 0, and receiving an acknowledge bit. Transmitting an address is performed by sending 7 address bits. Sending an address bit is performed by setting SDA to a bit from Ad while SCL is low and generating a pulse on SCL. Sending a bit is performed by setting SDA to a value while SCL is low and generating a pulse on SCL. Receiving an acknowledge bit is performed by releasing SDA and generating a pulse on SCL. Sending a data byte is performed by sending 8 data bits and receiving an acknowledge bit. Sending a data bit is performed by setting SDA to a bit from Tx while SCL is low and generating a pulse on SCL. Sending a stop condition is performed by setting SDA to 0, setting SCL to 1, and generating a rising edge on SDA.
Fig. 3.33 Natural language specification, I2 C write transaction
3.6.6 Experimental Results The transactor generation system was implemented in Python using the API provided in the Natural Language Tool Kit [3] to create a recursive descent syntactic parser. All results were generated on a 2 GHz Intel Core 2 Duo processor. To evaluate our system we have generated a bus transactor for the I2 C serial protocol developed by Philips [39] to support onboard communication. The protocol uses two wires, the data line SDA and the clock line SCL whose rising edges synchronize data transmission. We have explored a subset of the protocol involving a single Master node, 7-bit addressing, and no use of the repeated start condition. We use the 10 sentence natural language specification of the write transaction shown in Fig. 3.33. To focus on the more interesting part of the example, we have omitted the sentences defining the signals. An unique Transaction object is created to capture the information in each sentence of the natural language description, forming an object hierarchy with the Write transaction as the top-level object. Figure 3.34a shows the partial object hierarchy representing the Transmitting an address transaction generated from sentence 4 in Fig. 3.33. Figure 3.34a includes the Sending an address bit transaction from sentence 5 which is annotated with the number 7 to represent the number of iterations as specified in sentence 4. The q_extract task call removes a bit from the Ad queue and assigns SDA to the bit value. The resulting Verilog code for Transmitting an address is shown in Fig. 3.34b. The bus transactor generation process was performed in 61.5 s of CPU time. The resulting transactor, named i2c_write, is composed of 32 lines of Verilog code. The first portion of the simulation result of invoking i2c_write to write to address 7’0011001 is shown in Fig. 3.35. The portions of the write transaction shown include the Start condition from sentence 2 (START), Transmitting an address from sentence 4, and the 0 bit (R/W) referred to in sentence 3.
68
I. G. Harris and C. B. Harris
Fig. 3.34 “Transmitting an address” transaction, (a) transaction graph, (b) Verilog code
Fig. 3.35 Simulation waveform from i2c_write
3.7 Conclusions We have presented approaches to generate verification artifacts, assertions, and transactors, directly from natural language text found in a hardware description. The assertions are generated in SystemVerilog and can be used for response checking as part of a standard verification flow. The transactors are simulatable Verilog and can also be used with any Verilog simulator. Our approaches are based on the use of semantic context-free grammars which can be easily extended to accept any desired degree of linguistic variation which might be present in the English hardware description. In its present form, our work can significantly reduce the amount of manual labor which is traditionally devoted to creating and debugging a hardware verification environment. Acknowledgments This material is based upon work supported by the National Science Foundation under Grant No. 1813858.
References 1. ARM Ltd. AMBA 3 AXI Protocol Checker User Guide (2009) 2. A. Bacchelli, T. Dal Sasso, M. D’Ambros, M. Lanza, Content classification of development emails, in Proceedings of the 2012 International Conference on Software Engineering, ICSE (2012)
3 Generation of Verification Artifacts from Natural Language Descriptions
69
3. S. Bird, E. Klien, E. Loper, Natural Language Processing with Python — Analyzing Text with the Natural Language Toolkit (O’Reilly Media, 2009) 4. D. Bobrow, GUS, A Frame Driven Dialog System (Morgan Kaufmann Publishers Inc., 1986) 5. M. Burton, J. Aldis, R. Gunzel, W. Klingauf, Transaction level modeling: A reflection on what TLM is and how TLMs may be classified, in Forum on Specification Design Languages (FDL), pp. 92–97, 2007 6. L. Cai, D. Gajski, Transaction level modeling: An overview, in International Conference on HW/SW Codesign and System Synthesis (CODES-ISSS), pp. 19–24, 2003 7. W.R. Cyre, J. Armstrong, M. Manek-Honcharik, A.J. Honcharik, Generating VHDL models from natural language descriptions, in Proceedings of the conference on European design automation, EURO-DAC ’94, 1994 8. R. Dechter, Constraint Processing (Morgan Kaufmann Publishers, 2003) 9. J. Drazan, V. Mencl, Improved processing of textual use cases: Deriving behavior specifications, in In Proceedings of SOFSEM 2007 (Springer, 2007) 10. R. Drechsler, M. Soeken, R. Wille, Automated and quality-driven requirements engineering, in 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov 2014 11. J. Engelfriet, Attribute grammars: Attribute evaluation methods, in Methods and Tools for Compiler Construction (Cambridge University Press, New York, 1984), pp. 103–138 12. R. Feldman, I. Dagan, Knowledge discovery in textual databases (kdt), in In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD-95) (AAAI Press, 1995), pp. 112–117 13. C. Fillmore, Frame semantics, in Linguistics in the Morning Calm, Linguistic Society of Korea (Hanshin Publilshing Company, 1982) 14. H. Foster, Applied assertion-based verification: An industry perspective. Found. Trends Electron. Des. Autom. 3(1) (2009) 15. Z.P. Fry, D. Shepherd, E. Hill, L. Pollock, K. Vijay-Shanker, Analysing source code: Looking for useful verb-direct object pairs in all the right places. Software, IET 2(1) (2008) 16. D.D. Gajski, S. Abdi, A. Gerstlauer, G. Schirner, Embedded System Design: Modeling, Synthesis and Verification (Springer Publishing Company, Incorporated, 2009) 17. D.D. Gajski, J. Zhu, R. Domer, A. Gerstlauer, S. Zhao, SpecC: Specification Language and Methodology (Kluwer Academic Publishers, 2000) 18. D. Gildea, D. Jurafsky, Automatic labeling of semantic roles. Comput. Linguist. 28, 245–288 (2001) 19. M. Gordon, D. Harel, Generating executable scenarios from natural language, in Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing (Springer, 2009) 20. J.J. Granacki, A.C. Parker, PHRAN-SPAN: A natural language interface for system specifications, in Proceedings of the 24th ACM/IEEE Design Automation Conference, 1987 21. J.J. Granacki, A.C. Parker, Y. Arena, Understanding system specifications written in natural language, in Proceedings of the 10th International Joint Conference on Artificial Intelligence – Volume 2, IJCAI’87, 1987 22. M.A. Hearst, Untangling text data mining, in Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, ACL ’99, pp. 3–10, 1999 23. IEEE Press, IEEE Std. 1500-2005, IEEE Standard for Embedded Core Test, 2005 24. S. Issar, W. Ward, Cmu’s robust spoken language understanding system, in EUROSPEECH’93, pp. 2147–2150, 1993 25. A. Jantsch, Modeling Embedded Systems and SoC’s: Concurrency and Time in Models of Computation (Morgan Kaufmann, 2004) 26. D. Jurafsky, J.H. Martin, Speech and Language Processing An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd edn (Pearson Education, 2009) 27. D.E. Knuth, Semantics of context-free languages. Theory Comput. Syst. 2(2), 127–145 (1968)
70
I. G. Harris and C. B. Harris
28. L. Kof, Natural language processing: Mature enough for requirements documents analysis, in 10th International Conference on Applications of Natural Language to Information Systems (NLDB (Springer, 2005), pp. 91–102 29. L. Kof, Scenarios: Identifying missing objects and actions by means of computational linguistics, in In Proceedings 15th RE, pp. 121–130, 2007 30. L. Kof, Translation of textual specifications to automata by means of discourse context modeling, in Proceedings of the 15th International Working Conference on Requirements Engineering: Foundation for Software Quality (Springer, 2009) 31. S.J. Korner, T. Brumm, Natural language specification improvement with ontologies. Int. J. Semant. Comput. 3, 445–470 (2009) 32. L. Lavagno, A. Sangiovanni-Vincentelli, E. Sentovich, Models of computation for embedded system design, in A. Jerraya, J. Mermet (eds.) System-Level Synthesis (Kluwer Academic Publishers, 1998), pp. 45–102 33. P. Marwedel, Embedded System Design (Springer, 2006) 34. V. Mencl, Deriving behavior specifications from textual use cases, in Oesterreichische Computer Gesellschaft, pp. 3–85403, 2004 35. S. Miller, Hidden understanding models of natural language, in In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pp. 25–32, 1994 36. M. Minsky, A framework for representing knowledge. Technical report, Massachusetts Institute of Technology, Cambridge, 1974 37. R.J. Mooney, Learning for semantic parsing, in In Computational Linguistics and Intelligent Text Processing: Proceedings of the 8th International Conference, 2007 38. S. Narayan, F. Vahid, D.D. Gajski, System specification with the speccharts language. IEEE Des. Test 9(4) (1992) 39. NXP Semiconductors, I2C-bus specification and user manual, rev. 03 edition, June 2007. UM10204 40. T. Samad, S. Director, Natural-language interface for cad: A first step. IEEE Des. Test 2(4) (1985) 41. T. Samad, S.W. Director, Towards a natural language interface for CAD, in Proceedings of the 22nd ACM/IEEE Design Automation Conference, 1985 42. L.M. Segundo, R.R. Herrera, K. Yeni Perez Herrera, Uml sequence diagram generator system from use case description using natural language, in Proceedings of the Electronics, Robotics and Automotive Mechanics Conference (IEEE Computer Society, 2007), pp. 360–363 43. M. Soeken, R. Wille, R. Drechsler, Assisted behavior driven development using natural language processing, in TOOLS (50), 2012 44. R.S. Swier, S. Stevenson, Unsupervised semantic role labelling, in Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2004 45. Lin Tan, Ding Yuan, Gopal Krishna, and Yuanyuan Zhou. /*icomment: bugs or bad comments?*/. In Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles, SOSP ’07, 2007. 46. S.H. Tan, D. Marinov, L. Tan, G.T. Leavens, @tcomment: Testing javadoc comments to detect comment-code inconsistencies, in 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation (ICST), 2012 47. I.H. Witten. Practical Handbook of Internet Computing, chapter Text Mining (Chapman & Hall/CRC Press, 2005) 48. J. Yang, L. Tan, Inferring semantically related words from software context, in 2012 9th IEEE Working Conference on Mining Software Repositories (MSR), June 2012
Chapter 4
Real-World Events Discovering with TWIST Natalia Vanetik, Marina Litvak, and Efi Levi
4.1 Introduction One of the most representative examples of microblogging (a form of social media) is Twitter, which allows users to publish short tweets (messages within a 140 character limit) about “what’s happening.” Alongside “what’s on the user’s mind” tweets, real-time events are also reported. For example, Football World cup, IsraeliPalestinian conflict, etc. were extensively reported by Twitter users. Reporting those events could provide different perspectives to news items than those of traditional media and could also provide valuable user sentiment about certain companies and products. Twitter users publish short messages, in a fast and summarized way, which makes Twitter the preferred tool for quick dissemination of information over the web. Tweets may relate to anything that came to the user’s mind, with some relating to real-life events. Twitter grows rapidly. Every second, on average, around 6,000 tweets reporting about real-life events are tweeted on Twitter.1 To analyze the Twitter information flow efficiently, event detection is needed. A Twitter event is a collection of tweets and re-tweets that discuss the same subject in a relatively short (minutes, hours or days) time period. Existing algorithms typically detect events by clustering together words with similar burst patterns. Moreover, it is usually necessary to set the number of events that would be detected, which is difficult to obtain in Twitter due to its real-time nature. The Pear Analytics study [22] states that about 40% of all the tweets are pointless “bubbles” and that about 37% are conversational. Although
1 http://www.internetlivestats.com/twitter-statistics/
N. Vanetik · M. Litvak () · E. Levi Sami Shamoon College of Engineering, Beer Sheva, Israel e-mail: [email protected]; [email protected] © Springer Nature Switzerland AG 2020 M. Soeken, R. Drechsler (eds.), Natural Language Processing for Electronic Design Automation, https://doi.org/10.1007/978-3-030-52273-5_4
71
72
N. Vanetik et al.
such tweets are popular among the users, they interfere with performance of event detection algorithms and should be treated as noise. Event detection in traditional media has been broadly divided into retrospective event detection and new event detection (see [1, 3, 37]), where the former requires data to be pre-collected and the latter addresses the news stream as it arrives in real-time. Two main Twitter event detection categories are unspecified event detection, where an event is not identified beforehand, and specified event detection, where an event is known or planned. Our method is directed to the former category. Event detection in Twitter employs techniques from different computer science fields. Unspecified events emerge fast and typically attract the attention of a large number of users. Because no prior information is available, the classic approach to detection of such events exploits temporal bursts of Twitter stream features such as hashtags and specific words. A hashtag is an unspaced phrase prefixed with # to form a type of label that specifies a main theme of a tweet. Users can start a new theme by initiating their own hashtags, and other users can use these hashtags in their tweets in order to “categorize” them to one or more existing themes. Hashtag makes it easier for users to find messages with a specific theme or content. An overview of existing approaches for event detection in Twitter is given in Sect. 4.3. In this chapter we present TWIST – a system that detects and summarizes realworld events in Twitter. TWIST extends the EDCoW algorithm of [32] for event detection. This method is based on clustering of discrete wavelet signals built from individual words generated by Twitter. TWIST performs wavelet analysis and extends it by doing an additional similarity text analysis of tweets. TWIST distinguishes events that occur at the same time and, while they have similar wavelet signals, have a different content. In addition, TWIST retrieves the most important tweets, hashtags, and keywords for every detected event and builds its summary from relevant external sources. This paper is organized as follows. Section 4.2 describes prior works related to event detection in Twitter. Section 4.3 contains a detailed description of the background methods that were adapted in our system. Section 4.4 describes our methodology for event detection and summarization, including the adaptation and extension of the background methods described in the previous section. Section 4.5 describes the TWIST architecture and implementation of its main modules. Section 4.6 summarizes the pilot study that was performed on seven million collected tweets. The last section concludes the performed study and reports future work.
4.2 Related Work In general, events can be defined as real-world occurrences that unfold over space and time ([1, 31, 35, 36]). Event detection from traditional media sources has been addressed in the Topic Detection and Tracking (TDT) research program, which mainly aims at finding events in a stream of news stories ([1, 36]). Event detection in traditional media can be divided into retrospective event detection and new event
4 Event Discovering with TWIST
73
detection (see [1, 3, 37]), where the former requires the data to be pre-collected and the latter addresses the news stream as it arrives in real time. Twitter contents are dynamically changing and increasing by the minute. According to http://tweespeed.com, there are more than 15,000 tweets per minute by average published in Twitter. Therefore, event detection in Twitter streams is more challenging and difficult than similar tasks in traditional media. News articles are usually well-written, structured, and edited, while Twitter messages are restricted in length and written by anyone. Tweets include large amounts of informal, irregular words, and abbreviations and contain large numbers of spelling and grammatical errors. Twitter messages often have improper sentence syntactic structure and mix different languages. When tweets report big real-life events, they are usually mixed with a large amount of meaningless messages. Twitter streams contain large amounts of garbage, polluted content, and rumors (see [10]). Therefore, not every word or hashtag in tweets that show bursts is related to a real-life event. A good example is the popular hashtag “#musicmonday”. It shows some bursts every Monday because it is commonly used to suggest music on Mondays. However, such bursts obviously do not correspond to an event that the majority of the users would pay attention to. Event detection in Twitter is expected to differentiate the big events from the trivial ones. Many methods for event detection from Twitter are based on techniques from different fields, such as machine learning and data mining, natural language processing, information extraction, text mining, and information retrieval. Two main Twitter event detection categories are unspecified event detection, where an event is not identified beforehand, and specified event detection, where an event is known or planned. Because no prior information is available about the event, unspecified event detection relies on the temporal signal of Twitter streams to detect the occurrence of a real-world event. These techniques typically require monitoring for bursts or trends in Twitter streams, grouping the features with similar trends into events, and then classifying the events into different categories. On the other hand, specified event detection relies on specific information and features that are known about the event, such as a geo-location, time, type, and description, which are provided by the user or are derived from the event context. These features can be exploited by adapting traditional information retrieval and extraction techniques (such as filtering, query generation and expansion, clustering, and information aggregation) to the unique characteristics of tweets.
4.2.1 Specified Event Detection in Twitter Specified event detection is focused on already known or planned events. Important information about these events (such as location, time, and people involved), called metadata, is assumed to be known from experts or other internal sources. Specified event detection techniques exploit Twitter textual content or metadata information, or both.
74
4.2.1.1
N. Vanetik et al.
General Methods
The work of Popescu and Pennacchiotti, in [25], identifies controversial events. These events provoke public discussions with opposing opinions in Twitter. Their detection framework is based on the notion of a Twitter snapshot, consisting of a target entity, a given period, and a set of tweets about the entity from the target period. An event detection module first distinguishes between event and nonevent snapshots using supervised gradient-boosted decision trees, and then it ranks event snapshots using a multiple-feature regression algorithm. In [26], the framework of [25] is extended with additional features that aim at ranking the entities in a snapshot with respect to their relative importance to the snapshot. These features include relative positional information (e.g., offset of the term in the snapshot), term-level information (term frequency, Twitter corpus IDF), and snapshot-level information (length of snapshot, category, language). Authors of [7] present a novel approach to identify Twitter messages for concert events using a factor graph model, which simultaneously analyzes individual messages, clusters them according to event type, and induces a canonical value for each event property. The output of the model consists of a musical event-based clustering of messages, where each cluster is represented by an artist-venue pair.
4.2.1.2
Query-Based Event Detection
Becker et al. in [4], present a system for augmenting information about planned events with Twitter messages, using a combination of simple rules and query building strategies that are derived from the event description and its associated aspects. Related work in [6] uses centrality-based approaches to extract high-quality, relevant, and useful Twitter messages related to an event. Massoudi et al. in [17], use a generative language modeling approach based on query expansion quality indicators to retrieve individual microblog messages. Metzler et al. in [18], propose retrieving a ranked list (or timeline) of historical event summaries instead of retrieving individual microblog messages in response to an event query. Gu and colleagues, in [13], introduce the ETree event modeling approach that employs n-gram-based content analysis techniques to group a large number of event-related messages. The n-gram model is used to detect frequent key phrases among a large number of event-related messages, where each phrase represents an initial information block.
4.2.1.3
Event Detection by Geo-location
This paper [15] presents a system that performs Twitter geosocial local event detection of local festivals by modeling and monitoring crowd behaviors. Their approach uses geotags to find geographical regularities deduced from the behavior patterns.
4 Event Discovering with TWIST
75
Work of Sakaki et al. in [29], aims at using tweets for detection of specific types of events such as earthquakes and typhoons. Event detection in their work is treated as a classification problem; manually labeled Twitter data set comprising related events and unrelated events is used for training an SVM using statistical and contextual features.
4.2.2 Unspecified Event Detection in Twitter The nature of Twitter posts reflects events as they unfold; hence, these tweets are particularly useful for unknown event detection. Unknown events of interest are typically driven by emerging events, breaking news, and general topics that attract the attention of a large number of Twitter users. Because no event information is available ahead of time, unknown events are typically detected by exploiting the temporal patterns or signal of Twitter streams. New events of general interest exhibit a burst of features in Twitter streams that yield, for instance, a sudden increased use of specific keywords. Bursty features that occur frequently together in tweets can then be grouped into trends. In addition to trending events, endogenous or nonevent trends are also abundant on Twitter. Techniques for unspecified event detection in Twitter must discriminate trending events of general interest from trivial or nonevent trends.
4.2.2.1
Textual Features-Based Systems
The TwitterStand system, introduced in [30], is a news processing system that captures tweets corresponding to breaking news. The system uses a naïve Bayes classifier to separate news from irrelevant information. It couples this with an online clustering algorithm based on a weighted term vector according to tf-idf and cosine similarity, in order to form clusters of news. Phuvipadawat and Murata, in [24], present a method for collecting, grouping, ranking, and tracking news in Twitter. Their approach groups together similar messages in order to form a news story. Similarity between messages is computed using tf-idf with an increased weight for proper noun terms, hashtags, and usernames. Petrovi´c and co-authors, in [23], adapt the traditional media news event detection approach of Allan et al. [2] to Twitter. Here, cosine similarity between news documents is used in order to discover new events. The work of Becker et al. [5] uses an online clustering technique to identify of real-world event content and its associated Twitter messages. Similar tweets are clustered in a continuous manner, and then they are classified into events and nonevents (trending activities in Twitter that do not reflect any real-world occurrences). Messages are represented by their tf-idf weight vectors, and cluster distance is computed using cosine similarity. A cluster is classified as an event or as a nonevent with the help of SVM trained on a labeled cluster features.
76
N. Vanetik et al.
Long et al. in [16], integrate topical-word microblog features into a traditional clustering approach. These features are based on “topical words” that are extracted from daily messages on the basis of word frequency, word occurrence in hashtag, and word entropy. Event tracking is performed with the help of maximum-weighted bipartite graph matching where Jaccard coefficient is used as a cluster similarity measure; cosine similarity is used to measure distance between messages.
4.2.2.2
Wavelet-Based Systems
Weng and Lee, in [32], proposed an event detection method based on clustering of discrete wavelet signals built from individual hashtags. Unlike Fourier transforms used for event detection in traditional media, wavelet transformations are localized in both time and frequency domains and are therefore able to identify the time and the duration of an event within the signal. Wavelets are used to convert the signals from the time domain to a time-scale domain, where the scale is the inverse of hashtag frequency. Signal construction is performed using df-idf (document frequency inverse document frequency), where df counts the number of tweets (i.e., documents) containing a specific hashtag. Trivial words are filtered out on the basis of (a threshold set on) signals cross-correlation, which measure similarity between two signals as function of a time lag. The remaining words are then clustered to form events with a modularity-based graph partitioning technique, which splits the graph into subgraphs, each corresponding to an event. Finally, significant events are detected on the basis of the number of words and the cross-correlation among the hashtags related to an event. Similarly, the authors of [11] propose a continuous wavelet transformation based on hashtag occurrences combined with a topic model inference using LDA (see [8]). An abrupt increase in the number of uses of a given hashtag is considered a good indicator of an event that is happening at a given time. When an event is detected within a given time interval, LDA is applied to all tweets related to the hashtag in each corresponding time series to extract a set of latent topics, which provides an improved summary of event description.
4.2.3 Our Method Our method aims at unspecified event detection in Twitter. It combines waveletbased method of [32] with textual features and therefore allows better separation between events with similar burst patterns. Wavelet approach, in its turn, is responsible for the separation and elimination of garbage and non-events from important real-word events. Our system, TWIST, not only detects events but also provides their textual description compiled from different textual information, both internal and external to Twitter.
4 Event Discovering with TWIST
77
4.3 Background Methods This section contains the detailed description of the three methods – EDCoW [32], TextRank [19], and Lingo [21] – that we used in TWIST for event detection, text ranking, and text clustering, respectively. The following sections describe our methodology for event detection and summarization, including the usage and adaptation of these methods.
4.3.1 EDCoW Algorithm The EDCoW algorithm was first introduced in [32]. Its main idea is to build wavelet signals for individual words or hashtags that appear in tweets and use these signals to capture the bursts in their appearances. The signals can be computed quickly using wavelet analysis, and their representation requires much less storage then the original tweets. The algorithm filters away the trivial words and then detects the events by clustering signals together using a graph partitioning approach. EDCoW differentiates big events from trivial ones by computing the significance of an event by using the number of words and the connections among the words related to the event. This algorithm has three components: 1. signal construction, 2. cross-correlation computation, and 3. graph partitioning. All these components are described below in detail.
4.3.1.1
Individual Word Signal Construction
The first component of the EDCoW algorithm constructs a signal for each individual hashtag that appears in tweets using wavelets. A wavelet is a wave-like oscillation with amplitude begins at zero, increases, and then decreases back to zero. Figure 4.1 illustrates the notion of a wavelet. Generally, wavelets are purposefully crafted to have specific properties that make them useful for signal processing. Wavelet analysis is applied in EDCoW to build signals for individual words. The wavelet analysis provides precise measurements regarding when and how the frequency of the signal changes over time; wavelets are relatively localized in both time and frequency. The core of wavelet analysis is wavelet transformation. It converts a signal from the time domain to the time-scale domain (using a scale that can be considered as the inverse of frequency). Wavelet transformation is classified into continuous wavelet transformation (CWT) and discrete wavelet transformation (DWT). CWT provides
78
N. Vanetik et al.
Fig. 4.1 Wavelet example
a redundant representation of the signal under analysis. It is also time-consuming to compute directly. In contrast, DWT provides a non-redundant, highly efficient wavelet representation of the signal. In EDCoW, the signal for each individual word (unigram) is built in two stages. In the first stage, the signal for a word ω at current time Tc is written as a sequence Sω = [sω (1), sω (2), . . . , sω (Tc )]
(4.1)
where sω (t) at each sample point t is the df-idf -alike score of ω. This score is defined as Tc N (i) Nω (t) sω (t) = × log Ti=1 c N(t) Nω (i)
(4.2)
i=1
The first component of (4.2) is df (document frequency), where Nω (t) is the number of tweets that contain word ω and appear after sample point t −1 but before t and N (t) is the number of all tweets in the same period of time. Document frequency is the counterpart of the commonly used term frequency, which is commonly used to measure importance of a word in text retrieval. The difference between them is that df only counts the number of tweets containing word ω. df is used in order to handle multiple appearances of the same word in a single tweet, in which case a word is usually associated with the same event. The second component of (4.2) is equivalent to idf. The difference is that, for the conventional idf the text/collection of text is fixed, whereas new tweets are generated very fast in Twitter. Therefore, the idf component in the formula makes it possible to accommodate new words. The value of sω (t) is high if word ω is used more often than others from time t − 1 to time t, while it is rarely used before the current time Tc . The second stage builds the smooth signal with the help of a sliding window that covers a number of first-stage sample points; sliding window size is denoted by Δ.
4 Event Discovering with TWIST
79
Each second-stage sample point captures the ***magnitute of the change in sω (t) in the sliding window, if there is any. In this stage, the signal for word ω at current time Tc is again represented as a sequence: Sω = [sω (1), sω (2), . . . , sω (Tc )]
(4.3)
Note that time units t in the first stage and t in the second stage are not necessarily the same. For example, the interval between two consecutive time points in the first stage could be 10 min, while that in the second stage could be one hour. In this case Δ = 6. To compute the value of sω (t ) at each sample point, the EDCoW algorithm first moves the sliding window to cover first-stage sample points from sω ((t − 1) ∗ Δ). Denote that signal fragment by Dt . The algorithm then derives the H-measure of the signal in Dt −1 and denotes it by Ht −1 . Next, EDCoW shifts the sliding window to cover first-stage sample points from sω ((t − 1) ∗ Δ + 1) to sω (t ∗ Δ). This new fragment is denoted by Dt . Then, EDCoW concatenates segments Dt −1 and Dt sequentially to form a larger segment Dt , whose H-measure Ht is also computed. Subsequently, the value of sω (t − 1) is computed as: sω (t )
H −H =
t
t −1
Ht −1
0,
if Ht > Ht −1 otherwise
(4.4)
If there is no change in sω (t) within interval Dt , there will be no significant difference between sω (t ) and sω (t − 1). On the other hand, an increase or decrease in the usage of word ω would cause sω (t) in interval Dt to appear in more or fewer scales. This is translated into an increase or decrease of a wavelet entropy in Dt from that in Dt −1 , and the value of sω (t ) shows the amount of the change. 4.3.1.2
Discarding Weak Signals
In signal processing, cross-correlation is a common measure to determine similarity between two signals, measured as a function of a time-lag applied to one of them. The EDCoW algorithm applies cross-correlation twice, once to compute autocorrelation between every signal and itself and the second time to find correlation between every pair of hashtag signals. The second component of EDCoW receives as input a segment of signals; the length of the segment varies depending on the application scenario. This segment is denoted by S I , and an individual signal in this segment is denoted by SiI . For two signal functions f (t) and g(t), their cross correlation is defined as: (f ∗ g)(t) =
f ∗ (τ )g(t + τ )
(4.5)
80
N. Vanetik et al.
where f ∗ is the complex conjugate of f . Computation of cross correlation basically shifts one signal (in (4.5) it is g) and calculates the dot product between the two signals. Cross-correlation applied on the signal itself is called auto correlation, which always shows a peak at a lag of zero, unless the signal is a zero signal. Therefore, autocorrelation could be used to evaluate how trivial a word is. Autocorrelation of signal SiI is denoted by AIi . In order to avoid measuring cross correlation between all pairs of signals generated by Twitter messages, EDCoW performs this computation only for pairs of non-trivial signals; the non-triviality of a signal is measured by the value of its autocorrelation. EDCoW discards all signals with AIi < θ1 , where the bound θ1 is set as follows. First, the median absolute deviation (MAD) of all autocorrelation values AIi within segment SiI is computed as MAD = median(|AIi − median(AIi )|)
(4.6)
MAD is statistically robust measure of the variability of a sample data in the presence of “outliers.” Then the θ1 boundary is set as θ1 = median(AIi ) + γ ∗ MAD(S I )
(4.7)
Empirically, the value of γ is set to be no less than 10 due to the high skewness of the AIi distribution. Let the number of the remaining signals be K . Cross-correlation between a pair SiI , SjI of the remaining signals is denoted by Xij . Because the distribution of Xij exhibits a skewness, EDCoW applies another threshold θ2 to Xij as follows: θ2 = medianS I ∈S I (Xij ) + γ ∗ MADS I ∈S I (Xij ) i
i
(4.8)
where γ is the same as in (4.7). Then the value of Xij is set to zero if Xij < θ2 . The remaining non-zero Xij are the arranged in a square matrix to form correlation matrix M; diagonal values of M are set to be zero. The matrix M is highly sparse because threshold θ2 was applied.
4.3.1.3
Grouping Signals to Events
The third component of the EDCoW algorithm views event detection as a graphpartitioning problem for a weighted graph whose adjacency matrix is the crosscorrelation matrix M that was constructed during the second stage of the algorithm. Figure 4.2 illustrates the notation of graph partitioning. The purpose is to divide a graph into closely connected subgraphs, where each subgraph stands for an event and contains a set of words with high cross-correlation. The cross-correlation between words in different subgraphs is expected to be low.
4 Event Discovering with TWIST
81
Fig. 4.2 Weighted graph partitioning example
Matrix M is a symmetric sparse matrix, and thus it can be viewed as the adjacency matrix of a sparse undirected weighted graph G = (V , E, W ). Here, the vertex set V contains all the K signals after filtering with auto correlation, while the edge set E = V × V . An edge between two vertices vi , vj ∈ V exists if Xij ≥ θ2 , and its weight is wij = Xij . Then the event detection problem can be re-formulated as a graph partitioning problem of separating the graph into closely connected subgraphs. Each subgraph corresponds to an event, that contains a set of words with high cross-correlation. Newman, in [20], proposes a metric called modularity to measure the quality of such partitioning. The modularity of a graph is defined as the sum of weights of all the edges that fall within subgraphs (after partitioning) subtracted by the expected edge weight sum if edges were placed at random. A positive modularity indicates possible presence of partitioning. Denote the weighted degree of node vi by di = j wj i . The sum of all the edge weights in graph G is defined as m = i d2i . The modularity of the partitioning is defined as: Q=
di dj 1 )δci ,cj (wi,j − 2m 2m
(4.9)
i,j
where ci and cj are the index of the subgraph that nodes vi and vj belong to 1, ci = cj respectively, and δci ,cj = . 0, ci = cj The goal is to partition G such that Q is maximized. Newman has proposed a very intuitive and efficient spectral graph theory-based approach to solve this problem. First, a modularity matrix B of the graph G is built as: Bij = wij −
di dj 2m
(4.10)
82
N. Vanetik et al.
Then, the eigenvalue vector of the symmetric matrix B is computed. Finally, G is split into two subgraphs based on the signs of the elements in the eigenvalue vector. The spectral method is recursively applied to each of the two pieces to further divide them into smaller subgraphs. The modularity-based graph partitioning allows the EDCoW to detect events without knowing their number in advance. Graph partitioning stops automatically when no more subgraphs can be constructed (i.e., Q < 0). The main computation task in this component is finding the largest eigenvalue (and the corresponding eigenvector) of the sparse symmetric modularity matrix B, which can be done efficiently by using the power iteration method.
4.3.1.4
Usage of EDCow in TWIST
TWIST adapts the EDCoW algorithm for wavelet-based analysis. We use a wavelet correlation, calculated according to the EDCoW methodology, as one of two features for hashtag clustering into events. TWIST combines wavelet analysis with text analysis for better event recognition. Our choice of this particular algorithm is motivated by its robustness, accuracy, and efficiency.
4.3.2 TextRank TextRank is a graph-based ranking model for text processing, which was applied as an unsupervised method for keyword and sentence extraction by Mihalcea and Tarau in [19]. The summarized text is represented by a graph so that the basic idea of “voting” or “recommendation” implemented by a graph-based ranking model can be applied. Namely, TextRank operates eigenvector centrality of nodes standing for text units (sentences or words) that need to be ranked. The PageRank algorithm [9] is used for calculating eigenvector centrality.
4.3.2.1
TextRank Model
Formally, let G = (V , E) be a directed graph with a set of vertices V and a set of edges E, where E is a subset of V × V . For a given vertex vi ∈ V , let I n(vi ) be the set of vertices that points to it (predecessors), and let Out (vi ) be the set of vertices that vertex vi points to (successors). The score of a vertex vi is defined as follows, in [9]: S(vi ) = (1 − d) + d ∗
j ∈I n(vi )
1 S(vj ) |Out (vj )|
4 Event Discovering with TWIST
83
where 0 ≤ d ≤ 1 is a damping factor, which has the role of integrating into the model the probability of jumping from a given vertex to another random vertex in the graph, according to the “random surfer model.” The factor d is usually set to 0.85 [9], defining a 15% probability for jumping to a completely new page. We use the same value in our implementation of TextRank.
4.3.2.2
Text as a Graph
To enable the application of graph-based ranking algorithms to natural language texts, one must build a graph that represents the text; one that interconnects text entities with meaningful relations. TextRank uses words and sentences as text entities. However, depending on the application at hand, text units of various sizes and characteristics can be added as vertices in the graph, e.g. collocations, entire paragraphs, short documents, or others. Similarly, the application should dictate the type of relations that are used to draw connections between any two such vertices, e.g., lexical or semantic relations, contextual overlap, etc. Regardless of the type and characteristics of the elements added to the graph, TextRank consists of the following main steps: 1. Identify text units that best define the task at hand, and add them as vertices in the graph. 2. Identify relations that connect such text units, and use these relations to draw edges between vertices in the graph. Edges can be directed or undirected, weighted or unweighted. 3. Iterate the graph-based ranking algorithm until convergence. 4. Sort vertices based on their final score. Use the values attached to each vertex for ranking and selection decisions.
4.3.2.3
Usage of TextRank in TWIST
In TWIST, we apply TextRank several times, for the following purposes: extracting the most important external sources for summarization, ranking and extracting clusters of sentences, and, finally, ranking and extracting sentences covering the most important clusters into a summary. According to the main steps of TextRank, we identify the text units and relations between them that best fit the task at each stage. We describe these stages in detail later in this section. Our choice of TextRank is very natural and motivated by the following requirements. Due to the real-time nature of Twitter and the event detection approach (we recognize unknown and unspecified events), we need an unsupervised summarization methodology. Also, because of the enormous amount of retrieved data, we need an efficient algorithm, both in terms of space and especially run-time. TextRank fills both requirements. Due to the high popularity of Twitter across the international community there is an additional characteristic that we would like to have in our
84
N. Vanetik et al.
summarizer – an ability to process texts in multiple languages. Because TextRank is known for its language-independent extractive approach, it can be easily applied to different languages. The minimal effort required for its adaptation to a different language is sentence splitting. Also, the texts must be written in UTF-8 encoding.
4.3.3 Lingo Clustering Algorithm Lingo was designed as a web search clustering algorithm, with special attention given to ensuring that both content and description (labels) of the resulting groups are meaningful to humans. Lingo first attempts to ensure that a human-perceivable cluster label can be created, and only then assigns documents to it. Specifically, it extracts frequent phrases–as the most informative source of human-readable topic descriptions – from the input documents. Next, by performing reduction of the original term-document matrix using Singular Value Decomposition (SVD), Lingo discovers any existing latent structure of diverse topics in the input documents. Finally, it matches group descriptions with the extracted topics and assigns relevant documents to them. Formally, Lingo contains five specific steps, as described below.
4.3.3.1
Preprocessing
Three steps are performed: text filtering removes entities and non-letter characters except for sentence boundaries. Next, the language of each snippet is identified, and finally, appropriate stemming and stop words removal end the preprocessing phase.
4.3.3.2
Frequent Phrase Extraction
Frequent phrases are defined as recurring ordered sequences of terms appearing in the input documents. To be a candidate for a cluster label, a frequent phrase or a single term must: 1. appear in the input documents at least a certain number of times (specified by a term frequency threshold), 2. not cross sentence boundaries, 3. be a complete phrase, 4. not begin nor end with a stop word. A complete phrase is a complete subsequence of the collated text of the input documents, considered as a sequence of terms. Authors define complete subsequence as a sequence that cannot be “extended” by adding preceding or trailing terms. In other words, a complete phrase cannot be a part of another, bigger complete phrase.
4 Event Discovering with TWIST
4.3.3.3
85
Cluster Label Induction
Once frequent phrases – including single terms – are known, they are used for cluster label induction. This is done in four steps: 1. 2. 3. 4.
term-document matrix building, abstract concept discovery, phrase matching, and label pruning.
The term-document matrix is constructed of single frequent terms, using their tf-idf weights. In abstract concept discovery, the SVD method is applied to the term-document matrix to find its orthogonal basis (SVD´s U matrix), which vectors supposedly represent the abstract concepts appearing in the input documents. Only the first k vectors of matrix U are used in the further phases of the algorithm. The value of k is estimated by selecting the Frobenius norms of the term-document matrix A and its k-rank approximation Ak . Let threshold q be a percentage-expressed value that determines to what extent the k-rank approximation should retain the original information in matrix A. k is hence defined as the minimum value that satisfies the following condition: ||Ak ||F /||Ak || ≥ q, where the ||X||F symbol denotes the Frobenius norm of matrix X. The larger value of q induces more cluster candidates. The choice of the optimal value for this parameter ultimately depends on user preferences, and is expressed by the Lingo control parameter – Candidate Label Threshold. In TWIST, we use a default value for this threshold. Phrase matching, where group descriptions are discovered, relies on an important observation that both abstract concepts and frequent phrases are expressed in the same vector space – the column space of the original term-document matrix A. Thus, the classic cosine similarity can be used to calculate how “close” a phrase or a single term is to an abstract concept. The matrix P of size t × (p + t), where t is the number of frequent terms and p is the number of frequent phrases, is built by treating phrases and keywords as pseudo-documents and using one of the term weighting schemes. Having the P matrix and the i-th column vector of the SVD’s U matrix, a vector mi of cosines of the angles between the i-th abstract concept vector and the phrase vectors is calculated as mi = UiT P . The phrase that corresponds to the maximum component of the mi vector is selected as the human-readable description of i-th abstract concept. Additionally, the value of the cosine becomes the score of the cluster label candidate. A single matrix multiplication M = UkT P yields the result for all pairs of abstract concepts and frequent phrases. The final step of label induction is to prune overlapping label descriptions. For doing that, the similarity between cluster labels is calculated and similar labels are pruned. Let V be a vector of cluster label candidates and their scores. A termdocument matrix Z, where cluster label candidates from V serve as documents, is calculated. After column length is normalized, a matrix of similarities between cluster labels is calculated as Z T Z. For each row the columns that exceed the Label Similarity Threshold (another Lingo parameter) are picked and all but the
86
N. Vanetik et al.
single cluster label with the maximum score are discarded. As with other Lingo parameters, the default value for the Label Similarity Threshold was used in our system.
4.3.3.4
Cluster Content Discovery
In the cluster content discovery phase, the input documents are re-queried with all induced cluster labels. The documents are assigned to labels based on the classic Vector Space Model (VSM). Let C = QT A, where Q is a matrix in which each cluster label is represented as a column vector and A is the original term-document matrix for input documents. This way, element cij of matrix C indicates the strength of membership of the j -th document to the i-th cluster. A document is added to a cluster if cij exceeds the Snippet Assignment Threshold, yet another control parameter of the Lingo algorithm. Documents not assigned to any cluster ultimately located in an artificial cluster called Others.
4.3.3.5
Final Cluster Formation
Finally, clusters are ranked by a score, calculated as follows: Cscore = label score × ||C||, where ||C|| is the number of documents assigned to cluster C. The scoring function prefers well-described and relatively large groups over smaller, possibly noisy ones. For the time being, no cluster merging strategy or hierarchy induction is proposed for Lingo.
4.3.3.6
Usage of Lingo in TWIST
TWIST applies the Lingo algorithm for clustering sentences into themes as one of the steps of our summarization approach (see Sect. 4.4.2) requirements. In addition to its availability, Lingo provides such nice “extensions” as cluster labels and weights that supply important additional information to our summarization process.
4.3.4 Suffix Tree Clustering Algorithm Suffix Tree Clustering (STC) [38] is a linear time clustering algorithm (linear in the size of the document set) that is based on identifying phrases that are common to groups of documents. A phrase in its context is an ordered sequence of one or more words. STC defines a base cluster to be the set of documents that share a common phrase. The STC algorithm creates overlapping clusters, i.e., a document can appear in more than one cluster. STC has three logical steps:
4 Event Discovering with TWIST
87
1. document “cleaning”, 2. identifying base clusters using a suffix tree, and 3. merging these base clusters into clusters. The following subsections describe each step in detail.
4.3.4.1
Document Cleaning
In the document “cleaning” step, the string of text representing each document is transformed using a light stemming algorithm. Sentence boundaries are marked and non-word tokens (such as numbers, HTML tags, and most punctuation) are stripped.
4.3.4.2
Identifying Base Clusters
The second step – the identification of base clusters – can be viewed as the creation of an inverted index of phrases for the document set. This is done efficiently using a data structure called a suffix tree [14]. This data structure can be constructed in time linear in the size of the document set and can be constructed incrementally as the documents are being read. Each base cluster is assigned a score that is a function of the number of documents it contains, and the number of words that make up its phrase. Stopwords – which can be identified by either using a predefined list or by their tf-idf weight – are not considered as contributing to the score of a base cluster.
4.3.4.3
Merging Base Clusters
The final step of the STC algorithm merges base clusters with a high degree of overlap in their document sets. This creates clusters that are more coherent semantically by allowing documents to be in the same cluster even if they do not share a common phrase but rather share phrases with other documents of the cluster. This step also reduces the fragmentation of the produced clusters. The STC algorithm creates overlapping clusters, i.e., a document can appear in more than one cluster.
4.3.4.4
STC Usage in TWIST
TWIST applies the STC algorithm for clustering sentences into themes as one of the steps of our summarization approach requires (See Sect. 4.4.2.) We used the Carrot22 implementation of the STC algorithm, as described in [39]. STC
2 http://www.carrot2.org
88
N. Vanetik et al.
also provides cluster labels and weights that we use as additional information for our summarization process. Because we use full phrases that are derived from the suffix-tree, the cluster labels are very informative and can represent the main themes describing the detected events. The shared phrases of a cluster provide an informative way of summarizing its contents to the user. Also, using those phrases instead of bag-of-words for identifying similarities between documents, and the consequent cluster construction, improves the clustering quality. The STC algorithm is particularly well suited for Web document clustering, as it is both fast, incremental, and has been shown to be robust in a “noisy” domain. STC does not require a predefined number of clusters, thus allowing the documents themselves to determine the number of clusters generated.
4.4 Event Detection and Summarization with TWIST This section describes methodologies for event detection and description (or summarization) of the detected events in our system. TWIST performs two tasks in the pipeline, one after another, as follows: 1. Event detection. TWIST combines wavelet analysis with textual analysis for better event detection and separation. Wavelet analysis is adapted from the EDCoW algorithm, with minor changes in its steps. Textual analysis is performed for measuring the lexical similarity between tweets. The lexical similarity helps to avoid categorization of different events (hashtags) with similar wavelet structure and temporal boundaries, into one cluster representing event. 2. Event description. The detected events in TWIST are visualized as clusters – groups of interconnected nodes standing for hashtags – in the graph representing relationships between different hashtags in Twitter. Relationships are expressed through the edge weights as an average of the wavelet correlation and the lexical similarity, and all displayed hashtags are observed in a particular time period. Given informative hashtags and the awareness of users about the real-world events that are reported in Twitter, it is possible to “guess” which real-world events they are representing. However, it is definitely impossible to learn about the storyline of an event only from its hashtags. TWIST provides this information to the end user by a textual description of the detected events, both in terms of internal Twitter information such as hashtags, tweets, and keywords extracted from tweets, and in terms of external information extracted from the relevant external sources such as news articles describing the detected real-world event.
4 Event Discovering with TWIST
89
4.4.1 Event Detection in TWIST TWIST extends the EDCoW algorithm for event detection by analyzing the textual content of tweets. The latter requires text preprocessing that is performed on all collected tweets before they are stored in a database and then analyzed. The following subsections describe text preprocessing and the EDCoW extension in detail.
4.4.1.1
Text Preprocessing
We perform the following preprocessing steps for each collected tweet: (1) tokenization (hashtags are retrieved and stored separately from regular tokens), (2) part-of-speech tagging and filtering, (3) stop-words removal, and (4) stemming for remaining words [27]. The result of preprocesing is a collection of normalized terms (stems for remaining words) and hashtags, linked to their tweets. Then the frequency-based statistics are calculated and also stored in our database. Namely, for each normalized term (or hashtag) ti in a tweet Tj we calculate: tfij (number of ti occurrences in Tj , normalized by Tj length), idfi (logarithm of a number of tweets divided by a number of tweets containing ti ), and tf -idfij equals tfij × idfi . All tweets are linked to their hashtags, which allows us later to use these statistics for operating hashtag signals and detected events that are formed by signals.
4.4.1.2
Text Similarity Analysis for Better Event Detection
Our system extends the third stage of the EDCoW algorithm by integrating a text similarity knowledge between tweets into a graph representation. The motivation behind this idea was dictated by a possible situation where two or more unrelated events evolve at the same time, following the same pattern of burst. In such a case, a wavelet analysis will not distinguish between these events, and only analyzing the content of tweets may point to the differences between them. In TWIST, the weights on graph edges are calculated as a weighted linear combination of cross-correlation values computed during the second stage, together with textual similarity scores for every pair of signals. Every signal is represented by its textual “profile” compiled from the texts of all tweets belonging to it. Because all tweets are preprocessed in advance, we only need to integrate the preprocessed data of all tweets belonging to the signal’s hashtag. Given a cross-correlation score ccij and a similarity score simij between signals i and j , the weight on edge between signal nodes is computed as wij = α × ccij + (1 − α) × simij where 0 ≤ α ≤ 1 is a system parameter.
(4.11)
90
N. Vanetik et al.
4.4.2 Event Description in TWIST After detecting events, TWIST performs an additional (fourth) step – describing the detected events by generating their textual profiles (or summaries). We create two different profiles: internal and external. An internal profile (or “Twitter profile”) uses strictly internal Twitter sources such as hashtags, keywords, and tweets, while an external profile is built from external sources such as news articles. The summarization approach in both cases follows a strictly extractive principle, as most appropriate for the Twitter domain, both in terms of accuracy and efficiency. We used two state-of-the-art algorithms in the summarization process – TextRank [19] and Lingo [21] – for text ranking and clustering, respectively. The following subsections describe those algorithms and their adaptation and usage in our system.
4.4.2.1
Internal Profile
The internal profile is built from the most salient hashtags, keywords, and tweets. The “Twitter profile” of the detected event is retrieved by taking tweets with the highest PageRank score obtained from a weighted tweets graph, with nodes standing for tweets, according to the TextRank method [19]. A tweets graph is built on the tweets that are filtered by length and keywords coverage in order to reduce a graph’s size and TextRank processing time. Hashtags are considered as extremely important keywords and give a higher impact to a coverage score. A similarity between tweets for weighting the edges of the graph can be calculated by either Jaccard similarity between sets of tweets’ terms or cosine similarity between their tf-idf vectors. The keywords and hashtags are ranked by their tf-idf score and the top-ranked ones are extracted.
4.4.2.2
External Profile
The “external profile” of the detected event is compiled from the text available in external sources, using extractive text summarization. Key sentences are selected from multiple sources and unified into a single text unit, which we call the external profile of an event. The profile is created in the following phases: 1. retrieving the relevant sources, including their collection and filtering, 2. preprocessing and ranking the text of relevant sources, and 3. summarizing the text of the top-ranked sources.
Retrieving External Sources The relevant sources for each detected event are collected by collecting, analyzing, and filtering links that appear in tweets. Sources that do not contain enough
4 Event Discovering with TWIST
91
meaningful text are filtered out of the process and no longer considered. For each remaining link, the number of its appearances in tweets is counted and stored.
Preprocessing and Ranking Retrieved Sources We use a classic VSM as a text representation model. Namely, each source is represented by a vector of tf-idf values for its terms. We follow the standard preprocessing procedure, including tokenization, stemming, and stopwords removal. Then, the external sources are ranked by their eigenvector centrality in the document graph. We build a graph with nodes standing for documents and edges standing for similarity relationship, and weighted by the similarity score. PageRank score as a variant of eigenvector centrality is measured, and the top-ranked sources are selected for summarization, following the TextRank approach (see Sect. 4.3.2).
Summarizing the Relevant Sources We used the top-ranked documents for summarization in the following steps: 1. selecting theme sentences, 2. ranking and selecting theme sentences into a summary. Every real-world event may be described by several related themes. For example, an event of an earthquake may involve such characteritics as the geo-location of the earthquake, its power, victims, other countries’ involvement in humanitarian help, and more. Because we summarize many event-related documents, it is quite natural to suppose the existence of some other reports about the same themes, and that such reports will contain lexically similar sentences. We consider theme as a group of lexically similar sentences and retrieve all event-related themes by a clustering of sentences collected from relevant sources. We applied two clustering algorithms using Carrot2 API: Lingo and STC (see Sects. 4.3.3 and 4.3.4 for details). Both algorithms provide a label and a score for each cluster. We use this additional data in further steps as follows. Only clusters with a score above median value participate in summarization. Also, we consider label words as very important words that describe the main theme of the label’s cluster. Given clusters, we select theme sentences as representatives of their clusters. We look for the sentences that are close to the centroids of the clusters. First, a centroid, as an average of the vectors representing the cluster documents, is calculated for each cluster. Then, the distance between a centroid and a group of cluster sentences is calculated, with the closest sentence being selected from each cluster. Label words get the higher weight in the vector representation (TWIST uses a configurable constant factor). After this procedure, we have one sentence for each theme describing the detected event. A summary that describes the detected event must cover all (or at least most) of its important themes. Given theme sentences, we rank them using the TextRank
92
N. Vanetik et al.
Fig. 4.3 Flowchart of the summarization approach
approach and compile a summary from the top-ranked theme sentences. As a standard, an undirected graph of sentences with lexical similarity relationships is built from event theme sentences, and PageRank is applied for scoring sentences. Figure 4.3 provides a flowchart of our summarization approach.
4.5 System Description The TWIST system is aimed at event detection and description. TWIST supports the end user by providing the following functionality: 1. collecting tweets for a predefined period of time, 2. building wavelets for the hashtags appearing in the collected tweets, including wavelet construction, wavelet filtering, and wavelet smoothing, 3. grouping hashtags into events, using wavelet and text analyses, 4. summarizing detected events of interest, using Twitter internal and external information. Each stage of analyzing is visualized. For example, the user can see the actual wavelets, before and after filtering and smoothing. She also can see correlation matrix and the graph partitioned into events, based on this matrix. At the end, the user can focus on some event of interest and get its internal and external profiles. TWIST allows the end user to configure its multiple parameters, such as the period
4 Event Discovering with TWIST
93
Fig. 4.4 Data flow of TWIST
of time for tweet collection, percentage of wavelet correlation and text analysis for their combination as a weighting function for a partitioned graph, different thresholds for text similarity functions, maximal graph size, and so on. Subsections below describe the general architecture of TWIST and each module in detail. The screenshots of the TWIST GUI windows are also provided.
4.5.1 System Architecture Our system is written in C# and it uses a MySQL database. The system contains two standalone applications for a complete solution: Tweets Crawler, which collects tweets, and Event Detector, which fetches the data from the database and detects events by analyzing the collected data. The Event Detector also contains the Event Summarizer module for summarizing the detected events. The general architecture of our system is shown in Fig. 4.4.
4.5.2 The Twitter Stream The dataset object of analysis is retrieved using the twitter streaming API that, using the default access level, returns a random sample of all public tweets. This access level provides a small proportion of all public tweets (1%). The data are
94
N. Vanetik et al.
Fig. 4.5 Initialization of TWIST. Tweets Crawler GUI
returned as a set of documents, one per tweet, in JavaScript Object Notation. These documents also contain additional user- and tweet-related data. Given the average number of 140 million tweets sent per day in Twitter, the size of the data retrieved by the streaming API (1%) in a 24 h time span is roughly 1,400,000 tweets. The 140 character limit of tweets gives an expected 196 MBytes per day or a 2269 bytes per second data stream.
4.5.3 Tweets Crawler Tweets Crawler uses Tweetinvi to retrieve tweets data and store them in a database. Figure 4.5 shows the GUI window for this module that actually initializes the entire system. The following parameters can be set by a user at this stage, such as time period collected tweets data belongs to and sampling time interval. All the tweets are stored with their additional features provided by the twitter streaming API. We are mainly interested in hashtags, as an explicit annotation of a tweet’s main theme and tweets content for event detection, and links they contain for event summarization.
4.5.4 Event Detector This component detects events by operating hashtag wavelets, following three stages of the EDCoW algorithm enumerated in Sect. 4.3.1, integrated with text analysis, described in Sect. 4.4.1.2. The system allows a user to follow the evolution of the event detection process by visualizing the results of each simple stage.
4 Event Discovering with TWIST
95
Fig. 4.6 Event Detector. After initial signal construction
After the wavelets are formed by the tf-idf values of the hashtags in the first stage (signal construction), they are smoothed using the Savitzky-Golay filter [28]. This is a different filter from one used in the EDCoW algorithm, and it was chosen based on our judgment. Figure 4.6 demonstrates the first window the user gets after initial signal construction. A user can see both hashtag wavelets before and after smoothing, as Figs. 4.7 and 4.8 show. The second stage (cross-correlation computation) starts filtering irrelevant hashtag wavelets by their autocorrelation values. TWIST provides a user the ability to see how wavelets evolve. Figure 4.9 shows the form of hashtag wavelets after autocorrelation-based filtering has been performed. The correlation and the similarity between pairs of the remaining hashtags are calculated and form a sparse correlation matrix representing a weighted graph, where rows and columns stand for vertices (hashtags) and cell values stand for the weight of arcs between vertices. The user must specify both the type of similarity metric (Jaccard or cosine) and the weight of the similarity score in a linear combination (α in Eq. 4.11). After modularity-based graph partitioning (third stage), the clusters of hashtag wavelets representing events are displayed to the user, as shown in Figs. 4.13 and 4.15. Figure 4.10 demonstrates the window user gets after correlation is calculated, with both filtered wavelets and correlation matrix.
96
N. Vanetik et al.
Fig. 4.7 Hasthtag graph before smoothing
Fig. 4.8 Hasthtag graph after smoothing
The GUI allows a user to set the hashtag minimal count threshold. For example, the system can detect events that occurred on 16.01.15 between 13:00 and 15:00 with a sampling interval of 15 min (crawler parameters), taking only hashtags with over 30 occurrences (detector parameters). Also, the user can filter hashtags and their wavelets, and focus exclusively on particular events of interest.
4 Event Discovering with TWIST
97
Fig. 4.9 Hashtag wavelets after filtering
Fig. 4.10 Event Detector. After filtering and a weighted graph construction
4.5.5 Event Summarizer This module allows a user to get summaries – internal and external – of an event of interest. A user can choose to see the following in the internal Twitter profile of the selected event: (1) most salient hashtags, (2) keywords, and (3) tweets. Also, the summary compiled from the relevant external sources can be provided on demand. The module allows a user to configure the following parameters for the internal
98
N. Vanetik et al.
profile: the impact factor of hashtags (by default, TWIST multiplies their tf-idf scores by 3), the similarity metric for weighting edges in a sentence graph (by default, the system will use Jaccard similarity), and the size of a sentence graph (by default, a graph can contain a maximum of 2000 nodes). For summarizing the external sources, the user can set the following parameters: clustering algorithm (Lingo or STC) and maximal number of selected sentences. Figure 4.11 provides the view of the window that users see, with a summary for the chosen event.
4.5.6 External Sources Retriever For building a readable and informative summary (external profile) for each detected event, TWIST retrieves the external sources – such as news articles – that describe the real-world event. TWIST collects the links appearing in the event’s tweets,
Fig. 4.11 Event Summarizer. The event’s profile
4 Event Discovering with TWIST
99
retrieves their content by alchemyAPI,3 and provides that content to the summarizer module. No user-specified parameters exist in this module.
4.6 Pilot Study We performed a pilot study over a two day period, during which 7,549,339 tweets, published between 08/07/14 12:01 a.m. and 09/07/14 11:59 p.m., and covering 95889 hashtags, were collected. The Football World Cup 2014 contest [33] and the Israeli Protective Edge operation in Gaza [34] both fell during this data collection period. There is a need to stress that we collected tweets only until midnight (11:59 p.m.), so there might be a chance of collecting only partial events (for instance, the football game between Argentina and the Netherlands took place in the late hours of the evening, and, therefore, only some of the tweets about the game were collected). Using pure wavelet similarity according to the EDCoW algorithm resulted in inaccurate event detection, when different unrelated events mistakenly fell into the same category. As an example, signals related to the football game between the Netherlands and Argentina fell in the same cluster with signals hashtagged by Gaza. Figure 4.12 demonstrates the correlation matrix using wavelets only. It can be seen from the matrix that the game-related and the PEO-related events have positive correlation values. Figure 4.13 shows the clusters, representing events, that were found by using hashtag wavelets only, without taking tweet text into account. The system detected one event perfectly – the World Cup 2014. The clique contains BRA, GER, WorldCup2014, Brazil, and similar hashtags. However, the Protective Edge Operation event was not detected.
Fig. 4.12 Edge weights using correlation between wavelets, without text analysis
3 http://www.alchemyapi.com/api/text-extraction
100
N. Vanetik et al.
Fig. 4.13 Events detected without text features
Fig. 4.14 Edge weights using text analysis
Figure 4.15 shows events detected by TWIST with the text analysis component activated. The edge weights for a partitioned graph were calculated as an average (with α = 0.5) of correlation and similarity scores. There was no significant difference between results of event clustering using either cosine or Jaccard similarity metrics. Figure 4.14 demonstrates the matrix with edge weights using wavelets and cosine similarity. It can be seen that the football game between Netherlands and Argentina has a zero-value – after filtering values below median – correlation with PEO-related events. Figure 4.15 also shows the results using cosine similarity. As can be seen, Gaza and World Cup 2014 were detected as separate unrelated events. The Protective Edge Operation event is detected properly. The event contains hashtags such as PrayForPalestine, GazaUnderAttack, and FreePalestine.
4 Event Discovering with TWIST
101
Fig. 4.15 Events detected with text features
Fig. 4.16 Top-ranked hashtags, keywords, and tweets for the PEO event
Figure 4.16 shows an internal profile for the Protective Edge operation event, as a collection of hashtags, keywords, and tweets. Figures 4.17 and 4.18 demonstrate sentence clustering results – clusters with their labels and weights – for Lingo and STC clustering algorithms, respectively. Because Lingo usually produces a large number of clusters (230 clusters were created for the PEO event), its figure shows only 17 first top-ranked clusters. STC produced
102
N. Vanetik et al.
Fig. 4.17 Sentence clusters for the PEO event produced by the Lingo algorithm
Fig. 4.18 Sentence clusters for the PEO event produced by the STC algorithm
16 clusters for the same event. Only 115 top-ranked clusters were considered for summarization using Lingo, and only 8 clusters using STC. Figures 4.19 and 4.20 show the external profiles–summaries extracted from the external sources describing the Protective Edge operation event, using Lingo and
4 Event Discovering with TWIST
Fig. 4.19 External summary of the PEO event using Lingo clustering
Fig. 4.20 External summary of the PEO event using STC clustering
103
104
N. Vanetik et al.
Fig. 4.21 External summary of the PEO event using STC clustering and bigger factor for label words
STC algorithms for clustering sentences, respectively. We limit the summaries by eight sentences. The summary received with help of STC contains only seven sentences, because one of theme sentences was filtered (we filter sentences that composed of citations only). Figure 4.21 shows how summary changes if we use bigger factors for label words (we used factor of 3 for the shown summary).
4.7 Conclusions and Future Work In this work we present a system we call TWIST, which aims at detecting and describing events in Twitter during a pre-defined period of time. TWIST extends the EDCoW algorithm, using wavelet analysis of hashtags and text analysis of tweets. Namely, similarity analysis between texts of highly correlated signals is used for better event detection, and summarization techniques are applied for describing the detected events. As a pilot study showed, the proposed extensions improve the quality of event detection and the user’s experience in following the main trends in Twitter.
4 Event Discovering with TWIST
105
Unfortunately, the extractive approach for events summarization does not provide very coherent summaries. We process many different sources and usually are unable to follow their chronological order. Anaphora and co-reference resolution – proven to be very useful for improving coherency of generated text – are very timeconsuming processes and are not appropriate for the “real-time” nature of our system. Our future work includes additional ways of retrieving and summarizing external sources describing the detected events and geo-sentiment analysis and monitoring of detected events. For example, we can retrieve additional reliable external sources for summarizing detected events by crawling specific official news sites with retrieved keywords (from “Twitter/internal profile”) as a query. Also, the abstractive approach for summarizing external sources can be applied. For example, we can fuse sentences from highly ranked clusters of sentences [12].
References 1. J. Allan, J.G. Carbonell, G. Doddington, J. Yamron, Y. Yang, Topic detection and tracking pilot study final report, in Proceedings of the Broadcast News Transcription and Understanding Workshop (1998) 2. J. Allan, V. Lavrenko, H. Jin, First story detection in tdt is hard, in Proceedings of the Ninth International Conference on Information and Knowledge Management (ACM, 2000), pp. 374– 381 3. F. Atefeh, W. Khreich, A survey of techniques for event detection in twitter. Comput. Intell. 31(1), 132–164 (2013) 4. H. Becker, F. Chen, D. Iter, M. Naaman, L. Gravano, Automatic identification and presentation of twitter content for planned events, in ICWSM (2011) 5. H. Becker, M. Naaman, L. Gravano, Beyond trending topics: real-world event identification on twitter. ICWSM 11, 438–441 (2011) 6. H. Becker, M. Naaman, L. Gravano, Selecting quality twitter content for events. ICWSM 11, 442–445 (2011) 7. E. Benson, A. Haghighi, R. Barzilay, Event discovery in social media feeds, in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1. Association for Computational Linguistics (2011), pp. 389–398 8. D.M. Blei, A.Y. Ng, M.I. Jordan, Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003) 9. S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30, 1–7 (1998) 10. C. Castillo, M. Mendoza, B. Poblete, Information credibility on twitter, in Proceedings of the 20th International Conference on World Wide Web (ACM, 2011), pp. 675–684 11. M. Cordeiro, Twitter event detection: combining wavelet analysis and topic inference summarization, in Doctoral Symposium on Informatics Engineering, DSIE (2012) 12. K. Filippova, M. Strube, Sentence fusion via dependency graph compression, in EMNLP (2008), pp. 177–185 13. H. Gu, X. Xie, Q. Lv, Y. Ruan, L. Shang, Etree: effective and efficient event modeling for real-time online social media networks, in Web Intelligence and Intelligent Agent Technology (WI-IAT), 2011 IEEE/WIC/ACM International Conference on, vol. 1 (IEEE, 2011), pp. 300– 307
106
N. Vanetik et al.
14. D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology (Cambridge University Press, Cambridge, 1997) 15. R. Lee, K. Sumiya, Measuring geographical regularities of crowd behaviors for twitterbased geo-social event detection, in Proceedings of the 2nd ACM SIGSPATIAL International Workshop on Location Based Social Networks (ACM, 2010), pp. 1–10 16. R. Long, H. Wang, Y. Chen, O. Jin, Y. Yu, Towards effective event detection, tracking and summarization on microblog data, in Web-Age Information Management (Springer, 2011), pp. 652–663 17. K. Massoudi, M. Tsagkias, M. de Rijke, W. Weerkamp, Incorporating query expansion and quality indicators in searching microblog posts, in Advances in Information Retrieval (Springer, 2011), pp. 362–367 18. D. Metzler, C. Cai, E. Hovy, Structured event retrieval over microblog archives, in Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics (2012), pp. 646–655 19. R. Mihalcea, P. Tarau, Textrank: bringing order into texts, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2004) 20. M.E. Newman, Modularity and community structure in networks. Proc. Natl. Acad. Sci. 103(23), 8577–8582 (2006) 21. S. Osinki, J. Stefanowski, D. Weiss, Lingo: search results clustering algorithm based on singular value decomposition, in Advances in Soft Computing, Intelligent Information Processing and Web Mining, Proceedings of the International IIS: IIPWMÂ’04 Conference (2004), pp. 359–368 22. Pear analytics twitter study (2009). http://www.pearanalytics.com/wp-content/uploads/2009/ 08/Twitter-Study-August-2009.pdf 23. S. Petrovi´c, M. Osborne, V. Lavrenko, Streaming first story detection with application to twitter, in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics (2010), pp. 181–189 24. S. Phuvipadawat, T. Murata, Breaking news detection and tracking in twitter, in Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on, vol. 3 (IEEE, 2010), pp. 120–123 25. A.M. Popescu, M. Pennacchiotti, Detecting controversial events from twitter, in Proceedings of the 19th ACM International Conference on Information and Knowledge Management (ACM, 2010), pp. 1873–1876 26. A.M. Popescu, M. Pennacchiotti, D. Paranjpe, Extracting events and event descriptions from twitter, in Proceedings of the 20th International Conference Companion on World Wide Web (ACM, 2011), pp. 105–106 27. M. Porter, An algorithm for suffix stripping. Program 14(3), 130–137 (1980) 28. W.H. Press, S.A. Teukolsky, Savitzky-golay smoothing filters. Comput. Phys. 4(6), 669–672 (1990) 29. T. Sakaki, M. Okazaki, Y. Matsuo, Earthquake shakes twitter users: real-time event detection by social sensors, in Proceedings of the 19th International Conference on World Wide Web (ACM, 2010), pp. 851–860 30. J. Sankaranarayanan, H. Samet, B.E. Teitler, M.D. Lieberman, J. Sperling, Twitterstand: news in tweets, in Proceedings of the 17th ACM Sigspatial International Conference on Advances in Geographic Information Systems (ACM, 2009), pp. 42–51 31. R. Troncy, B. Malocha, A.T. Fialho, Linking events with media, in Proceedings of the 6th International Conference on Semantic Systems (ACM, 2010), p. 42 32. J. Weng, B.S. Lee, Event detection in twitter. ICWSM 11, 401–408 (2011) 33. Wikipedia: 2014 FIFA World Cup (2014). https://en.wikipedia.org/wiki/2014_FIFA_World_ Cup 34. Wikipedia: Protective Edge Operation (2014). http://en.wikipedia.org/wiki/2014_Israel-Gaza_ conflict
4 Event Discovering with TWIST
107
35. L. Xie, H. Sundaram, M. Campbell, Event mining in multimedia streams. Proc. IEEE 96(4), 623–647 (2008) 36. Y. Yang, T. Pierce, J. Carbonell, A study of retrospective and on-line event detection, in Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (ACM, 1998), pp. 28–36 37. Y. Yang, J.G. Carbonell, R.D. Brown, T. Pierce, B.T. Archibald, X. Liu, Learning approaches for detecting and tracking news events. IEEE Intell. Syst. 14(4), 32–43 (1999) 38. O. Zamir, O. Etzioni, Web document clustering: a feasibility demonstration, in Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98) (1998), pp. 46–54 39. O. Zamir, O. Etzioni, Grouper: a dynamic clustering interface to web search results. Comput. Netw. 31(11–16), 1361–1374 (1999)
Index
A Assertions attribute grammar, 47–48 comparison operations, 52–54 constants, 50 event constraints, 53, 54 events, 51, 52 implication, 55, 56 sentences, 56 signals and storage elements, 50, 51 automatic theorem, 43 experimental results benchmark set, 56–57 limitations, 58 generation, 39 response evaluation, 38 system overview, 48–49 Auto correlation, 80, 81
B Behavioral constraints, 41, 43 Bus transactors, 59–60, 65–67
D Design verification artifacts, 38 automatic, 15 AXI protocol, 56 computer systems, 17 engineers, 28 natural language, 38–39
NLP (see Natural language processing (NLP)) power switch, 29 Domain specific language (DSL), 3–5, 9
E Electronic control units (ECUs), 13, 20, 22, 26–28 Electronic design automation (EDA) design flows, 1 tools, 1, 2, 10, 37 Event constraints, 53, 54 Event detection graph partitioning problem, 81 retrospective event detection, 72 specified general methods, 74 geo-location, 74–75 query-based, 74 and summarization, 72 in TWIST (see TWIST) Twitter information, 71, 73 unspecified textual features-based, 75–76 wavelet-based, 76 Event detection with clustering of waveletbased signals (EDCoW) algorithm grouping signals, 80–82 individual word signal construction, 77–79 in TWIST, 82 wavelet similarity, 99 weak signals, 79–80
© Springer Nature Switzerland AG 2020 M. Soeken, R. Drechsler (eds.), Natural Language Processing for Electronic Design Automation, https://doi.org/10.1007/978-3-030-52273-5
109
110 F Formal representations, 17 DSL, 4 legal regulations, 2, 10 natural language, 7 semi-formal, 15
I Integrated circuit (IC), 26–29, 37 ISO26262, 14
Index IC, 37 information extraction, 40 linguistic variation in hardware descriptions, 44–46 organization, 39 semantic parsing, 46–47 software design, 40–41 transactors, 58–68 verification artifacts, 38
O Object constraint language (OCL), 1, 2 L Legal regulations domain, 2–3 evaluation, 9–10 existing solution, 4–5 problem formulation, 2–3 proposed solution EBM rules, 5–6 exploitation, 6–7 NLP, 7–9 requirement engineering, 1 UML/OCL, 2 Lingo clustering algorithm content discovery, 86 formation, 86 frequent phrases, 84 label induction, 85–86 PEO event, 103 preprocessing, 84 STC, 86–88 in TWIST, 86
M Metadata, 32, 73
N Natural language processing (NLP) assertion generation, 39 classes of sentences, 9 computational models behavioral constraints, 43 representation, behavior, 42–43 structural representation, 41–42 constraints, 38 document retrieval, 40 EDA tools, 37 exploitation, 5, 7–9 generating assertions, 47–58 hardware, 40–41
P Protected power switch, 21–23 application description, 20 modelling of requirements behavioural, 29–32 hierarchical organization, 23–25 reuse, 32–33 structural modelling, 25–29
R Real-world events EDCoW algorithm, 77–82 event description in TWIST, 90–92 detection, 72, 88–89 Lingo clustering algorithm, 84–88 pilot study, 99–104 specified event detection general methods, 74 geo-location, 74–75 query-based, 74 summarization, 88–89 TextRank, 82–84 Twitter, 71 unspecified event detection textual features-based systems, 75–76 wavelet-based systems, 76 Regular expressions, 5–7, 9, 10
S Semantic parsing, 39, 40, 46–47 Semi-automatic translation domain, 2–3 evaluation, 9–10 existing solution, 4–5 problem formulation, 2–3 proposed solution
Index EBM rules, 5–6 exploitation, 6–7 NLP, 7–9 requirement engineering, 1 UML/OCL, 2 Semi-formalization natural language representation, 16–17 requirements, 13–15 protected power switch, 20–33 representation formal, 17 semi-formal, 17–18 requirements representation, 15 SysML-based approach, 16 use, 15–16 SysML language, 18–20 Smart high-side power switch, 21, 26, 32 Suffix tree clustering (STC) base clusters identifying, 87 merging, 87 clustering algorithms, 101 document cleaning, 87 PEO event, 102–104 usage in TWIST, 87–88 Summarization artifact, 41 event detection, 77, 88–92 external sources, 83 NLP, 40 STC algorithm, 87 TWIST, 86 System description architecture, 93 event detector, 94–97 summarizer, 97–98 external sources retriever, 98–99 Tweets Crawler, 94 Twitter stream, 93–94 System modelling language (SysML) behaviour diagram, 29 hierarchical decomposition, 24 overview, 18–20 representation, 27 semi-formalization, 16
T Text analysis, 72, 82, 93, 94, 99, 100, 104 TextRank
111 as graph, 83 model, 82–83 in TWIST, 83–84 Transactors bus, 59–60 experimental results, 67, 68 generation, 39, 66 information extraction class structure, 65 process, 66 system overview, 60 transaction concepts reading and writing signals, 63–64 references, 62 sequence descriptions, 62–63 signal definitions, 61 Tweets Crawler, 93, 94 TWIST architecture, 72 event description external profile, 90–92 internal profile, 90 text preprocessing, 89 similarity analysis, 89 Twitter detection categories, 72 event detection, 72 general methods, 74 geo-location, 74–75 internal profile, 90 messages, 73 microblogging, 71 query-based event detection, 74 real-life events, 71 stream, 93–94 TWIST, 72 unspecified event detection textual features-based systems, 75–76 wavelet-based systems, 76 Typed dependencies, 5, 7, 8
U Unified modelling language (UML), 1–3, 16–19, 23, 41
W Wavelet analysis, 72, 77, 82, 88, 89, 104 Wrapper parallel control (WPC) terminal, 42 Wrapper parallel input (WPI) terminal, 42 Wrapper parallel output (WPO) terminal, 42