120 48
English Pages 322 Year 2010
Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
6153
Paulo Borba Ana Cavalcanti Augusto Sampaio Jim Woodcock (Eds.)
Testing Techniques in Software Engineering Second Pernambuco Summer School on Software Engineering, PSSE 2007 Recife, Brazil, December 3-7, 2007 Revised Lectures
13
Volume Editors Paulo Borba Universidade Federal de Pernambuco Centro de Informática CEP 50732-970, Recife, PE, Brazil E-mail: [email protected] Ana Cavalcanti University of York Department of Computer Science Heslington, York YO10 5DD, UK E-mail: [email protected] Augusto Sampaio Universidade Federal de Pernambuco Centro de Informática CEP 50732-970, Recife, PE, Brazil E-mail: [email protected] Jim Woodcock University of York Department of Computer Science Heslington, York YO10 5DD, UK E-mail: [email protected]
Library of Congress Control Number: 2010929777 CR Subject Classification (1998): D.2.4, D.2, D.1, F.3, K.6.3, F.4.1 LNCS Sublibrary: SL 2 – Programming and Software Engineering ISSN ISBN-10 ISBN-13
0302-9743 3-642-14334-2 Springer Berlin Heidelberg New York 978-3-642-14334-2 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Preface
The Pernambuco School on Software Engineering (PSSE) 2007 was the second in a series of events devoted to the study of advanced computer science and to the promotion of international scientific collaboration. The main theme in 2007 was testing. Testing is nowadays a key activity for assuring software quality. The summer school and its proceedings were intended to give a detailed tutorial introduction to the scientific basis of this activity and its state of the art. These proceedings record the contributions from the invited lecturers. Each of the chapters is the result of a thorough revision of the initial notes provided to the participants of the school. The revision was inspired by the synergy generated by the opportunity for the lecturers to present and discuss their work among themselves and with the school’s attendees. The editors have tried to produce a coherent view of the topic by harmonizing these contributions, smoothing out differences in notation and approach, and providing links between the lectures. We apologize to the authors for any errors introduced by our extensive editing. Although the chapters are linked in several ways, each one is sufficiently selfcontained to be read in isolation. Nevertheless, Chap. 1 should be read first by those interested in an introduction to testing. Chapter 1 introduces the terminology adopted in this book. It also provides an overview of the testing process, and of the types (functional, structural, and so on) and dimensions (unit, integration, and so on) of the testing activity. The main strategies employed in the central activity of test selection are also discussed. Most of the material presented in this introductory chapter is addressed in more depth in the following chapters. Chapter 2 gives an overview of the foundations and practice of testing. It covers functional, structural, and fault-based testing, and discusses automation extensively. Evaluation and comparison are based on experimentation, with emphasis on mutation testing. Chapter 3 discusses test-case generation and selection for reactive systems. The focus is on model-based approaches. A particular tool, Target, and a case study using mobile-phone applications are used for extensive illustration of techniques, strategies, and algorithms. Modelling is an expensive activity in this context, and this chapter also discusses the automatic generation of models from requirements. Chapter 4 discusses the main challenges of testing software product lines. It presents several techniques and an overall process for testing, in a productive way, products that have related functionalities but specific behavior differences. This chapter considers testing and validation not only of code but other artifacts as well.
VI
Preface
Chapter 5 describes how to design, implement, and test software using the methodology of Parameterized Unit Testing. This is supported by the tool Pex, an automated test input generator, which leverages dynamic symbolic execution to testing whether the software under test agrees with the specification. Pex produces a small test suite with high code coverage from Parameterized Unit Tests. The basic concepts are introduced and the techniques are illustrated with some examples. Deterministic, single-threaded applications are assumed. Chapter 6 summarizes the experience of a leading tool vendor and developer. It addresses the issues governing the design and use of both static and dynamic testing tools. It argues that a well-designed modern tool relies on an underlying mathematical theory: there is now a convergence between formal methods and principled testing. Chapter 7 presents a generic framework for developing testing methods based on formal specifications, and its specialization to several formal approaches: Finite State Machine, Algebraic Specifications, Input-Output Transition Systems, and Transition Systems with Priorities. Assuming some testability hypothesis on the system under test, a notion of exhaustive test suite is derived from the semantics of the formal notation and from the definition of correct implementation. Then, a finite test suite can be selected from the exhaustive one via some selection hypotheses. Chapter 8 revisits fault-based and mutation testing. It gives a foundational account of this technique using a relational model of programming based on refinement, namely, that of Hoare and He’s Unifying Theories of Programming (UTP). The theory suggests and justifies novel test-generation techniques, which are also discussed. Tool support for the presented techniques is considered. We are grateful to the members of the Organizing Committee, who worked very hard to provide an enjoyable experience for all of us. Without the support of our sponsors, PSSE 2007 could not have been a reality. Their recognition of the importance of this event for the software engineering community in Latin America is greatly appreciated. We would also like to thank all the lecturers for their invaluable technical and scientific contribution, and for their commitment to the event; the effort of all authors is greatly appreciated. Finally, we are grateful to all the participants of the school. They are the main focus of the whole event.
March 2010
Paulo Borba Ana Cavalcanti Augusto Sampaio Jim Woodcock
Organization
PSSE 2007 was organized by the Centro de Informática, Universidade Federal de Pernambuco (CIn/UFPE), Brazil, in cooperation with the University of York, UK.
Executive Committee Paulo Borba CIn/UFPE Managing Director and Guest Editor Ana Cavalcanti University of York Augusto Sampaio CIn/UFPE Jim Woodcock University of York
Sponsoring Institutions Formal Methods Europe Sociedade Brasileira de Computação, Brazil United Nations University, Macau Universidade Federal de Pernambuco (CIn/UFPE), Brazil University of York, UK
VIII
Organization
Acknowledgements Auri Vincenzi, Márcio Delamaro, Erika Höhn, and José Carlos Maldonado would like to thank the Brazilian Funding Agencies – CNPq, FAPESP and CAPES — and the QualiPSo Project (IST-FP6-IP-034763) for their partial support of the research they report in Chap. 2. Patricia Machado and Augusto Sampaio would like to emphasize that most of their chapter covers results achieved from a research cooperation in software testing between Motorola Inc., CIn-UFPE, and UFCG. They would like to thank the entire group for all the support, effort, criticisms, and suggestions throughout this cooperation. Particularly, their chapter is based on joint papers with Gustavo Cabral, Emanuela Cartaxo, Sidney Nogueira, Alexandre Mota, Dante Torres, Wilkerson Andrade, Laisa Nascimento, and Francisco Oliveira Neto. John McGregor would like to thank Kyungsoo Im and Tacksoo Im for their work on implementations and John Hunt for the implementations from his dissertation. Nikolai Tillmann, Jonathan de Halleux, and Wolfram Schulte would like to thank their past interns and visiting researchers Thorsten Schuett, Christoph Csallner, Tao Xie, Saswat Anand, Dries Vanoverberghe, Anne Clark, Soonho Kong, Kiran Lakhotia, Katia Nepomnyashchaya, and Suresh Thummalapenta for their work and experiments to improve Pex, Nikolaj Bjørner and Leonardo de Moura for their work on the constraint solver Z3, the developers and testers of .NET components and Visual Studio within Microsoft for their advice, and all users of Pex for giving feedback. Marie-Claude Gaudel’s text is extracted from or inspired by previous articles co-authored by the author. She would particularly like to thank Pascale Le Gall for Sect. 2, Richard Lassaigne, Michel de Rougemont, and Frédéric Magniez for Sect. 3, Perry James and Grégory Lestiennes for Sects. 4 and 5, and Ana Cavalcanti for new exciting work on testing CSP refinements. Bernhard K. Aichernig’s work was carried out as part of the EU-funded research project in Framework 6: IST-33826 CREDO (Modeling and analysis of evolutionary structures for distributed services). His Theorem 12 was contributed by He Jifeng, then at UNU-IIST. Several people contributed to the implementations of the theories discussed in Sect. 6 of his chapter. The OCL test-case generator was implemented by Percy Antonio Pari Salas during his fellowship at UNU-IIST. The Spec# test-case generator is an idea of Willibald Krenn, TU Graz, and was realized in the EU project in Framework 7: ICT-216679 MOGENTES (Modelbased Generation of Tests for Dependable Embedded Systems). The first LOTOS mutation-testing case study was carried out by Carlo Corrales Delgado during his fellowship at UNU-IIST. The more recent achievements in the mutation testing of protocols are the work of Martin Weiglhofer, TU Graz.
Table of Contents
Software Testing: An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Patr´ıcia Machado, Auri Vincenzi, and Jos´e Carlos Maldonado Functional, Control and Data Flow, and Mutation Testing: Theory and Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Auri Vincenzi, M´ arcio Delamaro, Erika H¨ ohn, and Jos´e Carlos Maldonado
1
18
Automatic Test-Case Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Patr´ıcia Machado and Augusto Sampaio
59
Testing a Software Product Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . John D. McGregor
104
Parameterized Unit Testing with Pex: Tutorial . . . . . . . . . . . . . . . . . . . . . . . Nikolai Tillmann, Jonathan de Halleux, and Wolfram Schulte
141
Software Tool Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Hennell
203
Software Testing Based on Formal Specification . . . . . . . . . . . . . . . . . . . . . . Marie-Claude Gaudel
215
A Systematic Introduction to Mutation Testing in Unifying Theories of Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bernhard K. Aichernig
243
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
313
Software Testing: An Overview Patrícia Machado1, Auri Vincenzi3 , and José Carlos Maldonado2 1
3
Universidade Federal de Campina Grande, Brazil [email protected] 2 Universidade de São Paulo, Brazil [email protected] Universidade Federal de Goiás, Brazil [email protected]
The main goal of this chapter is to introduce common terminology and concepts on software testing that is assumed as background in this book. The chapter also presents the multidimensional nature of software testing, showing its different variants and levels of application. After a brief introduction, Section 2 presents a set of basic definitions used in the remaining of this book. Section 3 gives an overview of the essential activities and documents involved in most test processes. Section 4 discusses the kinds of properties we may want to test, including functional, non-functional, and structural properties. In Section 5, we discuss the various dimensions of software testing, covering unit, integration, system, and acceptance testing. Section 6 highlights that different domains have demanded effort from the research community to work on tailored strategies; we discuss object-oriented, component-based, product-line, and reactive-systems testing. Test selection is a main activity of a test process, and we discuss the main strategies in Section 7. We conclude this introduction in Section 8 with some final considerations.
1 Introduction In recent years the interest and importance of software testing have grown, mainly due to the rising demand for higher software quality. Shull et al. [298] present a discussion and an interesting synthesis in the current knowledge of software defect reduction. One point they comment on concerns the software modules that most contribute to defects. They warn that, during development, virtually no modules are defect-free when implemented and about 40% of the modules may be defect-free after their release. Therefore, as was pointed out by Boehm and Basili [47], it is almost impossible to deliver a software product free of defects. Moreover, it is important to observe that the later a fault is detected, the greater the cost of its correction (see Figure 1). In 1987, Boehm [48] evaluated the relative cost of a defect correction against the development phases it detected and concluded that, from requirements to maintenance, the cost-escalation factor ranges from 1 to 100 (Figure 1(a)), where 1 means the relative cost in the requirement phase and 100 refers to the cost in the maintenance phase. More recently, Boehm and Basili [47] provided new and not so dramatic information that the cost-escalation factor for small, noncritical software systems is more likely to be 5:1 than 100:1 (Figure 1(b)). However, even a factor P. Borba et al. (Eds.): PSSE 2007, LNCS 6153, pp. 1–17, 2010. c Springer-Verlag Berlin Heidelberg 2010
2
P. Machado, A. Vincenzi, and J.C. Maldonado
0.5-1
Requirement
0.5-1
Requirement
2.5
Project
1.5
Project
5
Coding
2.0
Coding
10
Unit Testing
3.0
Unit Testing
25
Acceptance Testing
4.0
Acceptance Testing
100
Maintenance
5.0
Maintenance
(a) Boehm, 1987 [48].
(b) Boehm e Basili, 2001 [47].
Fig. 1. Cost-escalation factor of defect correction
of 5 can represent a high cost rate, which emphasizes the need to bring verification and validation (V&V) activities earlier in the development. Software development methodologies like Extreme Programming (XP) [86] and Test-Driven Development (TDD) [15] implement this practice. Software testing is a dynamic activity in the sense that it requires a product that can be executed or simulated with a given input and can provide an observable output. Its main goal is to reveal the existence of faults in the product under testing (PUT) during its different development phases. We use the term “product” instead of “program” or “system" because the concepts hereby presented can be equally applied, without loss of generality, to products from specification to the implementation level. There are other complementary static verification and validation activities, such as formal revision, which should be used in conjunction with testing in order to detect a larger number of faults as early as possible (see Chapter 4 and [245], for example). Despite its fundamental importance, software testing is known as one of the most expensive activities in the development process that can take up to 50% of the total cost in software projects [157]. Besides its main objective — to reveal faults — the data collected during testing is also important for debugging, maintenance, reliability assessment, and also for software process improvement. Experimental software engineering [196,299] is a recent field of study aiming at contributing to the improvement of the current state of the art and practice providing cost and benefits evidences on method, technique, criteria and tools in different domain applications. These evidences are relevant for the establishment of effective and efficient verification, validation and testing strategies combining the benefits of these different testing methods, techniques, criteria and tools [26,300,187,229,64].
2 Basic Terminology To decide whether the actual output obtained by exercising a PUT with a given input is correct against the product specification, an “oracle” is needed — in general represented by a domain expert, one which is able to determine whether such an input has revealed a fault or not. Once a fault is detected, the testing activity is usually interrupted and the debugging activity takes place; its objective is to identify, from the incorrect output, the point in the product where the defect is located. Once the fault (the cause) is located and corrected, the testing activity is resumed.
Software Testing: An Overview 1 2 3 4 5 6 7 8 9 10 11 12 13 14
3
public c las s F a c t o r i a l { p u b l i c s t a t i c l o n g compute ( i n t x ) throws N e g a t i v e N u m b e r E x c e p t i o n { i f ( x < 0) { / / S h o u l d be ( x >= 0 ) long r = 1; f o r ( i n t k = 2 ; k =0) is responsible for a domain fault because, regardless of the value of x, a wrong
4
P. Machado, A. Vincenzi, and J.C. Maldonado
set of statements is executed. The second fault (supposing the first one is corrected) located at line 7 (it should be r*=k) is responsible for a computational fault since, except for negative values and for x={0,1,3}, the method computes the factorial function incorrectly. We observe that this is also an example of a data sensitive fault, since for x=3 the fault is activated but the correct value of 3!=6 is computed. Therefore, in order to detect the fault, a value of x different from {0,1,3} should be used. Despite its considerable importance, the testing activity is permeated by a series of limitations [175,283,157]. In general, the following problems are undecidable and represent such limitations. – Correctness: there is no general purpose algorithm to prove the correctness of a product. – Equivalence: given two programs, whether they are equivalent or not; or given two paths (sequence of statements) whether they compute the same function or not. – Executability: Given a path (sequence of statements) whether there exists an input data that can execute it or not. – Coincidental Correctness: a product can present coincidentally a correct result for a given input data d ∈ D because one fault masks the error of another. All these limitations bring important considerations in the context of software testing, mainly the impossibility of providing a full automation of all necessary testing activities. Considering product correctness, it is said that a product P with input domain D is correct with respect to its specification S if S (d ) = P(d ) for any data d belonging to D , that is, if the behavior of the product is in accordance with the expected behavior (with respect to the specification) for all input domains. Figure 2 illustrates the input and output domains of a given product but observe that both can be infinite and, in such a case, it is not possible to execute P with its entire input domain. Given two products, P1 and P2 , if P1 (d ) = P2 (d ) for every d ∈ D , it is said that P1 and P2 are equivalent. Since there is no general purpose algorithm to decide the correctness of a product, as discussed above, a common assumption in software testing refers to the existence of an oracle, which can determine, for any input data d ∈ D , if S (d ) = P(d ), considering a reasonable limit of time and effort. An oracle simply decides whether output values are correct against what is specified. Example 3. As an example of equivalent products, we consider the following version of the Factorial program (Listing 1.2). We observe that the only difference in relation
Input Domain∗
Output Domain∗
Product Product ∗
May be infinite (∞)
Fig. 2. Input and output domains of a given program
Software Testing: An Overview 1 2 3 4 5 6 7 8 9 10 11 12 13 14
5
public c las s F a c t o r i a l { p u b l i c s t a t i c l o n g compute ( i n t x ) throws N e g a t i v e N u m b e r E x c e p t i o n { i f ( x >= 0 ) { long r = 1; f o r ( int k = 1 ; k 1) { achar = s . charAt ( 1 ) ; int i = 1; while ( i < s . length ( ) − 1) { achar = s . charAt ( i ) ; i f ( ! va l i d_ f ( achar ) ) valid_id = false ; i ++; } }
19
i f ( v a l i d _ i d && ( s . l e n g t h ( ) >= 1 ) && ( s . l e n g t h ( ) < 6 ) ) return true ; else return f a l s e ;
20 21 22 23 24
}
Listing 1.3. validateIdentifier method
The following example is an adaptation of the one provided by Dijkstra [105] in “On the reliability of mechanisms”, where he demonstrates the general impossibility of an exhaustive test and states the famous corollary: “Program testing can only be used to show the presence of bugs, but never to show their absence!” Example 5. We consider a simple Java method which takes two arguments of primitive double type, each one 64 bits long. This method has a clearly finite input domain with 2128 elements (264 ∗264 ), considering all possible combinations. We also assume that we are running this method in a machine capable of performing 240 instructions per second. In this machine, which has a processor speed compatible with those commonly found 128 currently, this method will take ( 2240 = 288 ≈ 1026 ) seconds to be completed. Since 26 19 one year corresponds to approximately 107 seconds, it will be finished on 10 107 = 10 years, clearly an infeasible deadline. Moreover, it is important to remember that there is no general purpose algorithm that can be used to prove the correctness of a product. Even though in general it is not possible to prove product correctness with testing, the test, when conducted in a systematic and clear-sighted way, helps to increase the confidence that the product behaves according to its specification, and also to highlight some minimal characteristics from product quality. Two important questions arise in the context of software testing: “How is the selection of test data performed ?” and “How may one decide when a product P was sufficiently tested?”. Testing criteria for test suite selection and evaluation are crucial to the success of the testing activity. Such criteria provide an indication of how test cases should be selected in order to increase the chances of detecting faults or, when no faults are found, to establish a high level of confidence of correctness. A testing criterion is used to help the tester subdivide the input and output domains and provide a systematic way to select a finite number of elements to compose a test suite. The objective is to create the smallest
Software Testing: An Overview
7
test suite for which the output indicates the largest set of faults. For instance, in Figure 2, the dots in the input domain correspond to a test suite selected according to a given testing criterion. The simplest way to represent a test case is as a tuple (d , S (d )), where d ∈ D and S (d ) represents the corresponding expected output for the input d according to specification S . A more detailed definition states that a test case is a 4-tuple preconditions, input, expected output, execution order. We observe that this definition, resulting from the combination of test case definitions provided by Copeland [85] and McGregor [245], includes the former. The pre-conditions establish the constraints that must hold before the input is provided. The input is a given element from the input domain taken to execute the PUT. The expected output represents what should be the correct output according to the product specification. Finally, the execution order can be classified as “cascade” or “independent”. It is cascade when the order in which each test case is executed is important, that is, there is a correct order for executing each test case because the following test case assumes, as a pre-condition, the resulting state from the execution of the previous ones. On the other hand, if a test case is independent it means that the order in each test case is executed does not matter. As a consequence of this order of execution, cascade test cases are in general shorter than independent test cases. However, once a cascade test case fails it is probable that all other subsequent cascade test cases fail also due to the dependence between them. Given a product P and a test suite T , we adopt the following terminology in relation to testing criteria. – Adequacy criterion of test case: predicate to evaluate T when testing P; and – Selection criterion of test case: procedure to choose test cases in order to test P. Goodenough and Gerhart [144] define software testing adequacy criterion as “a predicate that defines what properties of a program must be exercised to constitute a ‘thorough’ test”. There is a strong correspondence between selection methods and adequacy criteria because, given an adequacy criterion C , there exists a selection method MC which establishes: select T such that T is adequate according to C . Analogously, given a selection method M , there exists an adequacy criterion CM which establishes: T is adequate if it was selected according to M . In this way, it is common to use the term “adequacy criterion” (or simply “testing criterion”) to refer to selection methods [283,228]. Testing techniques and criteria have been defined to systematize the testing activity. A testing technique defines the source of information which will be used, and a testing criterion from such a source of information derives a set of testing requirements. Figure 3 depicts the relations between the several concepts presented so far. As it may be observed, once a testing requirement is obtained it can be used for both test-case generation or test-case evaluation (coverage analysis). Given P, T , and a criterion C , it is said that the test suite T is C -adequate to test P if T contains test cases which, when executed against P, satisfy all the testing requirements established by C . As an example, we consider a given testing criterion that requires the execution of each statement of a unit under testing. This criterion, called All-Statements, generates a list of all unit statements as testing requirements. Based on such a list, the tester tries to find an input to execute a specific statement. On the other hand, the tester can be given several test cases and, in this case, needs to know whether such test cases fulfill
8
P. Machado, A. Vincenzi, and J.C. Maldonado generation Testing Technique
has
Source of defines Information
Testing Criteria
derives
Testing Requirements
Test Cases evaluation
uses
Fig. 3. Depicts our view of the relations between software testing definitions
all the testing requirements demanded by the All-Statements criterion. In the first case, the tester is using the testing requirements to generate test cases. In the second case, the tester is using the testing requirements to evaluate already existing test cases. Once a test suite T satisfies all the requirements of a given testing criterion C , we say that such a test suite is adequate for that criterion, or simply T is C -adequate. Every time a test case is executed against a PUT, some statements of the product are executed while others are not. All the executed parts are said to be covered by the test case and such a coverage can be used as a measure of the test-case quality. Based on the control flow or data flow which exists in the PUT, different coverage patterns can be defined. For instance, considering the criterion mentioned above, a given test suite T1 can cover 70% of the unit statements and a test suite T2 can cover 100% of the statements (T2 is All-Statements-adequate). We observe that the concept of coverage does not apply only to the source code. There may exist a functional criterion which requires specific parts of the product specification to be covered, not its implementation. It is important to observe that, from a given test suite T that is C -adequate, it is possible to obtain, in theory, an infinite number of C -adequate test suites simply by including more test cases in T , that is, any test suite T ⊇ T is also C -adequate. Obviously T will contain redundant test cases in relation to the required elements of C and this is not always desired — due to time and cost constraints to execute the complete test suite. There are also certain situations where we would like to find the minimum test suite Tm so that Tm is C -adequate. This process is called test suite minimization [352]. As defined by Wong et al. [352], if T is a test suite for P, then | T | denotes the number of elements in T . If C is a testing adequacy criterion, C (T ) denotes the set of testing requirements of P covered by T . A suite Tm is the minimal subset of T in terms of the number of test cases if, and only if, C (Tm ) = C (T ) and for all T ⊆ T such that C (T ) = C (T ), | Tm |≤| T |. In more informal terms, by considering a test suite T which covers a set of testing requirements demanded by C , a minimal test suite Tm corresponds to the minimum subset of T which also covers the same set of testing requirements, and there is no other subset of T with fewer elements than Tm covering the same set of testing requirements. As is also mentioned by Wong et al. [352], the relative “cost” of a given test suite can be calculated in different ways. In the definition presented above we use the number of test cases, but the purpose of minimization is to reduce some associated cost of the test suite, for instance, the computation time needed to execute the test suite, instead of its number of elements.
Software Testing: An Overview
9
Another relevant question in this context is: given a test suite T C1 -adequate, is there another criterion C2 which contributes to the improvement of the quality of T ? This question has been investigated both on theoretical and experimental perspectives. In general, we may state that the minimal properties which a testing criterion C should fulfill are as follows [228]. 1. To guarantee, from the control flow perspective, the coverage of all conditional deviations. 2. To require, from the data flow perspective, at least one use of all computational results. 3. To require a finite test suite. Advantages and disadvantages of testing criteria can be evaluated through theoretical and experimental studies. From a theoretical point of view, the subsume relation and the complexity of the testing criteria are the most investigated aspects [339,283,175]. The subsume relation establishes a partial order between testing criteria, characterizing a hierarchy among them. It is said that a criterion C1 subsumes a criterion C2 if, for every program P and any test suite T1 C1 -adequate, T1 is also C2 -adequate and there is a program P and a test suite T2 C2 -adequate such that T2 is not C1 -adequate. Section 3.2 of Chapter 2 presents a subsume relation that considers the main controland data-flow testing criteria. Complexity is defined as the maximum number of test cases required to satisfy a criterion in the worst case. By considering the data flow criteria, studies have shown that the latter have an exponential complexity which motivates experimental studies capable of determining their application cost in practice [228]. Some authors have also explored the testing criteria efficacy from a theoretical point of view. They have worked on the definition of different relations between criteria to capture the capability of such criteria to detect faults, since this capability cannot be expressed by the subsume relation [340,124]. From an experimental point of view, three aspects of the testing criteria are commonly evaluated: cost, efficacy, and strength [341,239,266]. The cost reflects the required effort to use the criterion. It is in general measured by the number of test cases needed to satisfy the criterion. The efficacy corresponds to the capacity of the criterion in detecting faults. The strength refers to the probability of satisfying a given criterion C2 after satisfying a criterion C1 [239]. An important research area known as Experimental Software Engineering has emerged in an attempt to provide evidence of advantages and disadvantages of methods, methodologies, techniques, and tools used during software development processes. In Section 8 of Chapter 2 it is also provided some information on Experimental Software Engineering in the context of verification and validation activities. Evaluation of efficacy test suites is commonly carried out using mutation testing. It appeared in the 1970s at Yale University and at the Georgia Institute of Technology. It was strongly influenced by a classical method for digital circuit testing known as “single fault test model” [126]. One of the first papers describing mutation testing was published in 1978 [101]. This criterion uses a set of products that differ slightly from product P under testing, named mutants, in order to evaluate the adequacy of a test suite T . The goal is to find a set of test cases which is able to reveal the differences between
10
P. Machado, A. Vincenzi, and J.C. Maldonado
P and its mutants, making them behave differently. When a mutant is identified to have a diverse behavior from P it is said to be “dead”, otherwise it is a “live” mutant. A live mutant must be analyzed for one to check whether it is equivalent to P or whether it can be killed by a new test case, thus promoting the improvement of T . Mutants are created based on mutation operators: rules that define the (syntactic) changes to be carried out in P in order to create the mutants. It is widely known that one of the problems with mutation testing is related to the high cost of executing a large number of mutants. Moreover, there is also the problem of deciding mutant equivalence, which in general is undecidable.
3 A Generic Testing Process A testing process involves several activities and related documents that should be produced before and during the software development process. In general, the testing process involves some subprocess: 1) Test Planning; 2) Test Project; 4) Test Execution; and 4) Test Record. Figure 4, adapted from [61], depicts these subprocess, activities to be carried out, and the related artifacts. In the case of testing process artifacts, they can be produced based on the IEEE Standard 829-1998 for Software Test Documentation [176] which provides general guidelines about several kinds of documents to be produced during the testing process. As we can observe, Figure 4 depicts the subprocesses and correlated artifacts of the test process. In the subprocess “Test Planning” all related testing activities are planned and documented in a test plan that should contain information about what parts of the product will be tested, what test levels or phases will be covered (see the next section), which test technique and criteria will be used in each test phase, what are the necessary resources, what is the operational environment, and also the related schedule of each task.
Fig. 4. Generic Testing Process Adapted from IEEE Standard 829-1998 [61]
Software Testing: An Overview
11
Once the test plan is defined it is refined during the “Test Project”, when the test cases are really created aiming at reaching the desired level of quality. Three documents can be generated during the test project: the “Test Project Specification” contains details about how the test will be performed, the “Test Case Specification” document registers the generated test cases for each phase, and the “Test Procedure Specification” document details each step previously created and defined in the “Test Project Specification” document. Once the test cases are defined, they can be run to collect the test results. This activity is performed during the “Test Execution” subprocess. For this purpose, the existence of testing tools is almost mandatory since there are several tasks that, if executed manually, are very error prone and subject to human mistakes. Finally, the results produced during the test execution are documented in the “Test Daily Report” and, if any discrepancy is observed with respect to the software specification, the “Test Incident Report” is filled out and corrective actions can be taken. At the end, a “Test Summary Report” is produced containing a syntheses of the test execution process. All these testing activities have been studied for long time and it can be observed that the earlier they are integrated in the development process the better. A well-known system development model which aggregates verification and validation (V&V) activities with the development process is called the V-Model [279] (see Figure 5. This model suggests that at every development phase a verification and validation activity may be performed mainly by anticipating the fault detection capability as early as possible. It is called V-Model because it is inspired in the traditional cascade development model, but such concept of integrated V&V activities into the development process can be equally applied to any development model. For instance, once the product specification has been completed, if such a specification is based on a formal specification or in a state machine-based model, it is possible to
Requirement Specification
Operational System
User’s Requirement
Acceptance Test Plan Functional and non-functional requirements
System Specification
Acceptance Testing
P st Te
System Test Plan
re
System Project
pa
System Testing
System architecture
t ra ion Project
xe cu ti
Integration Testing
Te st E
Detailed Project
on
Integration Test Plan
Unit Testing
Fig. 5. Software development versus test execution: V-Model (adapted from [61,301])
12
P. Machado, A. Vincenzi, and J.C. Maldonado
use the techniques presented in Chapters 7 and 3 for automatizing the generation of test cases which can be used to test the specification itself and also, later, its implementation. For systematizing the test case generation, it is important to combine two or more testing criteria aiming at selecting elements from several input subdomains, maximizing the change of exposing the faults in the PUT. As mentioned before, the implementation of testing tools for supporting the application of a testing criterion is of great importance for helping the tester to reduce the cost associated with the testing activity and also to make this activity less error prone and more productive. Moreover, testing tools can be used for teaching the testing concepts and also for technology transfer. They also help the execution of experimental studies aiming at comparing the different testing criteria in the definition of incremental, effective and low cost testing strategies.
4 Types of Testing Software requirements are usually classified as functional and non-functional. This naturally poses a classification on testing techniques that are related to them. Alternatively, a classification has often been used in the literature: functional and structural testing. However, this leaves out a very important group of testing techniques: the ones devoted to non-functional properties. In the sequel, we present an alternative classification that is more helpful to identify the different goals of the strategies presented in this book. Testing for functional properties. Strategies in this group focus on testing functional requirements at different phases of the development process (as presented in Section 3) that describe how the PUT should react to particular stimuli. The behaviour of the system is observed, usually by exercising a usage scenario, to decide whether the test is successful or not (whether system behaves as expected or not). Functional testing strategies are included in this group such as the ones presented in this book in Chapters 3 and 4 (model-based testing), and Chapters 6 , 7 and 8 (specification-based testing). Testing for non-functional properties. This involves checking constraints under which the system should operate. Testing strategies here are often applied at system level only, since the properties are often emergent ones, such as performance, timing, safety, security and quality standards. Chapter 4 presents techniques that can be used to evaluate these properties in the context of product line engineering and sort out conflicts on them based on a inspection guided by tests and architecture analysis. Testing for design properties has also been investigated in the literature with the goal of either detecting or anticipating design faults that may lead to software failures. This is discussed in Chapter 8 . Testing for structural properties. Since testing is applied at different phases during the software development process, adequate coverage of the structure is often required. Structural testing techniques that rely on either checking or covering from the architecture to the code structures of the system are included in this group. These are discussed in Chapter 2 . Structural criteria commonly used for structural testing are often used to evaluate the quality of a test suite in general.
Software Testing: An Overview
13
5 Levels or Phases of Testing The development process is divided into several phases — allowing the system engineer to implement its solution step by step — and so is the testing activity. The tester is able to concentrate on various aspects of the software and use different testing criteria in each one [215]. In the context of procedural software, the testing activity can be divided into four incremental phases: unit, integration, system, and acceptance testing [279]. Variations in this pattern are identified for object-oriented and component-based software, as will be discussed later. Unit testing focuses on each unit to ensure that their algorithmic aspects are correctly implemented. The aim is to identify faults related to logic and implementation in each unit. In this phase structural testing is widely used, requiring the execution of specific elements of the control structure in each unit. Mutation testing is also an alternative to unit testing; it is discussed later on in Section 7. In this phase the need to develop drivers and stubs is common (Figure 6). If we consider that F is the unit to be tested, the driver is responsible for coordinating its testing. It gathers the data provided by the tester, passes it to F in the form of arguments, collects the results produced by F , and shows them to the tester. A stub is a unit that replaces another unit used (called) by F during unit testing. Usually, a stub is a unit that simulates the behavior of the used unit with minimum computation effort or data manipulation. The development of drivers and stubs may represent a high overhead to unit testing. Tools like the traditional xUnit frameworks, such as JUnit [236], may provide a test driver for the PUT with the advantage of also providing additional facilities for automating the test execution. Once each unit has been tested the integration phase begins and consequently the integration testing. However, is integration testing really necessary? Why should a product - built from previously tested units - not work adequately? The answer is that unit testing presents limitations and cannot ensure that each unit functions in every single possible situation. For example, a unit may suffer from the adverse influence of another unit. Subfunctions, when combined, may produce unexpected results and global data structures may raise problems. After being integrated, the software works as a whole and must be submitted to system testing. The goal is to ensure that the software and the other elements that are part of the system (hardware and database, for instance) are adequately combined and
input
output
driver
Unit Under Testing (F )
stub1
stub2
···
stubn
Fig. 6. Required environment for unit testing
14
P. Machado, A. Vincenzi, and J.C. Maldonado
adequate function and performance are obtained. Functional testing has been the most used in this phase [279]. Acceptance testing is, in general, performed by the user, who checks whether the product meets the expectations. Functional testing is also the most widely used for acceptance testing. All previous kinds of tests are run during the software development process, and information obtained from them is useful for other software engineering activities, like debugging and maintenance. Any required change in the software after its release demands some tests to be rerun to make sure the changes did not introduce any collateral effect in the previous working functionalities. This kind of testing is called regression testing. The ideal situation is to validate the changed project by rerunning all previous test cases that execute the modified parts of the code. Several works are developed providing guidance on how to select a subset of the previous test suite for regression testing [202,353,270,269].
6 Domain Specific Approaches Research on software testing also focus on domain specific concerns. The reason is that the task of testing is rather complex whose success usually depends on investigating particular features of the PUT. On one hand, the tester can usually run only very few test cases that need to be carefully selected. On the other hand, the features of each domain usually determines what is more important to be tested. Some domains that have classically been considered and for which a number of approaches have been developed are briefly described in the sequel. This list is not exhaustive, because the aim here is only to classify the different approaches presented in this book. Object-oriented software testing. Approaches focus on concepts and properties related to classes and objects, such as interface, inheritance, polymorphism, dynamic binding, dependencies between classes and methods, and life cycle of objects. Techniques and tools have been developed for exploring both functional and structural aspects of objectoriented software. For instance, Chapter 6 presents a tool for unit testing based on the code and assertions to be met. On the other hand, Chapter 4 discusses functional issues of object-oriented testing in the context of product lines. Moreover, Chapter 7 presents fundamental concepts on testing from algebraic specifications that have been used as basis for a number of specification-based testing approaches for object-oriented software as well as interface testing in general. Finally, the testing theory presented in Chapter 8 has also been applied to test generation from OCL constraints. Component-based software testing. In this case, approaches are usually integrated with component-based software development methods, where a component can be defined as a unit of composition that is identified by a contract specification. The concerns here are more abstract than the ones considered by object-oriented testing. Testing is aimed at checking whether contracts can be satisfied, independent of the internal structure of the component. A component is committed to fulfill its behaviour as long as the dependencies it requires can also be fulfilled by the environment where it is deployed. Therefore, both the component and the environment have to be tested. Chapter 3 briefly reviews a test-case generation strategy based on Markov chains for component testing.
Software Testing: An Overview
15
Product-line software testing. Independently of the technology used, the main challenge here is how to define, select and maintain test suites that are appropriate for each member of a family of products by starting from general test suites and taking variabilities into account. The main concerns are how different variation mechanisms as well as the architectural design can guide test-case selection. Chapter 4 presents an overview of testing in product lines along with an example. Reactive-systems testing. Reactive systems interact with their environment by accepting inputs and producing outputs. These systems may also be composed of a number of concurrent processes and networked distributed services, where interruptions in a flow of execution can occur at any time. Testing approaches for these systems are usually defined at system level where observable behaviours are the information used to decide on success or failure of a test. The main concerns are related to synchronisation, scheduling, timing and interruption as well as properties such as livelock and deadlock. Chapters 3 and 7 present fundamental concepts and test-case generation algorithms based on inputoutput labelled transition systems as specification formalism. Chapter 3 also presents approaches for dealing with interruptions, particularly considering process algebras.
7 Test Selection Test case selection is a key activity that it is executed at different stages during the software life cycle. The problem of selecting the most cost-effective test suite has been a major research subject in the software testing area since the 1970s. Strategies from adhoc and determinist selection to fully automated solutions have already been proposed and applied in industry. They usually focus on specific criteria that guide the tester to measure the adequacy of a test suite and also to decide when to stop testing. When selection strategies are applied for defining a representative test suite from another one, the term test suite reduction is more commonly applied [287]. Strategies for test-case selection can be classified according to the kind of criteria that is addressed. A quick overview of them is given in the sequel. Based on faults. Fault detection is the main goal of the tester that opt for this kind of strategy. Fault-based criteria focus on detecting certain kinds of faults that are usually defined by a fault model [41]. The selected test cases are the ones that are capable of revealing the faults. A popular technique in this category is mutation testing. Chapter 8 presents a theory of fault-based testing that uses refinement techniques and applies mutation testing at a specification level to anticipate faults that may be included in the design of the software. Fault-based testing, particularly mutation testing, is also covered in Chapter 2 . Based on structural coverage. Strategies are based on structural coverage criteria, such as control-flow, data-flow and transition-based, measured at code level (white-box testing) and also at the abstract-model level of the application (model-based testing). Structural criteria have originally been defined for white-box testing. Model-based testing have inherited and extended them to express aspects of the software that needs to be tested not particularly aiming at covering code fragments, but covering the high-level structure of the specification [324]. Test-generation algorithms presented in Chapter 3
16
P. Machado, A. Vincenzi, and J.C. Maldonado
focus on transition coverage of test models, whereas Chapter 2 presents commonly applied structural coverage criteria focusing on control-flow and data-flow. Based on property coverage (test purpose). Given a specification and a PUT, any testing activity is, explicitly or not, based on a satisfaction relation (often called conformance relation): does the PUT satisfy the specification? The tests are derived from the specification on the basis of the satisfaction relation, and often on the basis of some additional knowledge of the PUT and of its operational environment called testability hypothesis. Property-based testing is a kind of conformance testing where the selection criteria is associated with a given property that needs to be checked in the software [225], and is verified on the specification if it exists. This property is often represented as a test purpose that targets testing at a particular functionality or behaviour, limiting the scope of the specification from where test cases are selected. In other words, test purposes can be used to focus test selection at specific parts of a specification. Test selection strategies based on test purpose are presented in Chapter 3 . Test purposes are often expressed using a formal notation so that automatic test selection is possible. Moreover, test purposes are often defined from test objectives that are descriptions, at requirements level, of what should be tested. However, they can also represent safety and security properties to be checked. Based on controlling the size of the test suite. The main goal is to select a minimum yet representative set of test cases from a test suite. The algorithms applied are usually based on statistics, heuristics, and coverage of test requirements, such as how to choose the minimum suite that achieves a given coverage criteria [158,287]. They can be classified according their goal as follows. – Test-case selection: algorithms that select a subset of the original test suite to achieve a given goal. The subset may not have the same coverage capability as the original test suite [25]; – Test-suite reduction: algorithms that select a representative subset of the original test suite that provides the same coverage as the original suite [287]; – Test-case prioritisation: algorithms that schedule test cases for regression testing in an order that attempts to maximise some objective function [288]. Chapter 3 presents a test-case selection strategy based on similarity of test cases. The idea is, given a number of test cases as goal, to select the most different ones. This chapter also reviews work on statistical testing where probabilities assigned to transitions of a Markov chain guide the random selection of the most representative test cases. Based on formal specifications. Sentences in a formal specification may guide test-case selection by precisely pointing out behaviour that needs to be tested. From a different perspective when compared to traditional structural coverage, the main goal here is to cover all possible behaviours that can be observed from the specification. A technique called unfolding of sentences is usually applied, such as that presented in Chapter 7 in the context of algebraic specifications. The idea is to replace a function call by its definition. Since this may lead to an infinite number of test cases, selection hypotheses are usually considered to make it possible for a finite test suite to be selected.
Software Testing: An Overview
17
8 Final Remarks This chapter presents basic terminology and concepts that constitute essential knowledge for understanding the remaining of the chapters of this book. Standard classifications as well as novel classifications are presented to put into context the different aspects of the modern practice of software testing. The scope and importance of testing has been recognised in the last decades. From a terminal activity of the development process that had little attention, software testing evolved to an essential activity that starts from the beginning of the development process to support planning, specification and design. It is now a way of systematically investigating the problem domain and design issues, by aiming at testability and, consequently, feasibility, as well as a way of establishing concrete goals towards validation of the solutions that are proposed as early as possible in the development process. The practice of software testing has become more and more feasible and effective due to the maturity of current methods of testing selection, maintenance and evaluation of test suites that are usually supported by tools. Therefore, current research in the area has mostly been devoted to specific domains, empirical evaluation of solutions, integration of solutions with development processes, tools automation and technology transfer. Besides the classical task of detecting fault, current practices of software testing, mainly the ones that bring them close to the development process, have also great influence on the final quality of the development process by early detection of faults in specification and design documents during testing planning and specification. This is due to the fact that the testing perspective is usually more investigative and rigorous than the development one, and it is more likely to detect lack of completeness, consistency and conformance between different artifacts. This book presents interesting approaches that can be applied in practice to achieve these benefits.
Functional, Control and Data Flow, and Mutation Testing: Theory and Practice Auri Vincenzi1 , Márcio Delamaro2, Erika Höhn2 , and José Carlos Maldonado2 1
2
Instituto de Informática, Universidade Federal de Goiás Goiânia, Goiás, Brazil [email protected] Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo São Carlos, São Paulo, Brazil {delamaro,hohn,jcmaldon}@icmc.usp.br
The growth of user request for higher software quality has motivated the definition of methods and techniques in order to improve the way software is developed. Several works have investigated a variety of testing criteria in an attempt to obtain a testing strategy with lower application costs and higher efficacy in detecting faults. The aim of this chapter is to present the theoretical and practical aspects related to the software testing activity. A synthesis of functional, structural, and fault-based testing techniques is presented. A comparison of the testing criteria (cost, efficacy, and strength) is also considered from the theoretical and experimental points of view. The importance of testing automation is discussed, characterizing the main efforts of the testing community in this direction. Emphasis is given to state-of-practice tools. Efforts from the academia is also discussed. The testing activity and its related problems are presented and illustrated through practical examples with the support of different testing tools, to provide information about software testing in terms of both theory and practice. The need for a systematic evaluation of these criteria and tools under the perspective of experimental software engineering is also discussed. After a short introduction, Section 2 briefly describes a software product named Identifier, which is used to illustrate the testing concepts discussed in this chapter. In Section 3, a synthesis of the functional, structural, and fault-based testing techniques is presented. Sections 4 to 6 provide some examples of different testing criteria, considering the three testing techniques. We comment on their complementarity and show how they can be used in an incremental testing strategy. The importance of testing automation is also discussed in these sections, characterizing the main efforts of the testing community in this direction. In Section 7 the need for a systematic evaluation of these criteria and tools under the perspective of experimental software engineering is discussed. We illustrate that section by describing an experimental evaluation of different test suites generated by random, functional, and Pex [263] (described in detail in Chapter 5) against mutation testing. Finally, Section 8 presents the final considerations of this chapter.
1 Introduction Regarding the functional techniques mentioned in Chapter 1, testing requirements are obtained from the software specification (see Figure 2 in Chapter 1). The structural P. Borba et al. (Eds.): PSSE 2007, LNCS 6153, pp. 18–58, 2010. c Springer-Verlag Berlin Heidelberg 2010
Functional, Control and Data Flow, and Mutation Testing
19
techniques use implementation features to obtain such requirements, and the fault-based technique uses information about common faults that may appear during the development. It is important to notice that these testing techniques are complementary and the question to be answered is not “Which is the one to be used?” but “How can they be used in a coordinated way taking advantage of each other?”. Each of the above mentioned techniques has several criteria that define specific requirements which should be satisfied by a test suite. In this way, requirements determined by a testing criterion can be used either for test-suite evaluation or test-suite generation. But why is a testing criterion required? The answer is that, as discussed in Chapter 1, in general executing the PUT with its entire input domain is not always possible or practical because the input and output domain may be infinite or too large. Testing techniques and criteria are mechanisms available to assess testing quality. In this scenario, a testing criterion is used to help the tester subdivide the input and output domain and provide a systematic way to select a finite number of test cases to satisfy such a testing criterion. The objective is to create the smallest test suite to which the output indicates the largest set of faults. In general, the application of a testing criterion without the support of a testing tool is an unproductive and error-prone activity. Moreover, the existence of a testing tool is very useful in conducting experimental studies, teaching testing concepts, and transferring technology. Regardless of the kind of software and the way it is produced, the use of software testing techniques, criteria, and supporting tools is crucial to ensure the desired level of quality and reliability. Since 2007, a large project, called QualiPSo (www.qualipso.org) is under development. QualiPSo is a unique alliance of European, Brazilian, and Chinese industry players, governments, and academics that was created to help industries and governments fuel innovation and competitiveness with Open Source Software (OSS). In order to meet this goal, the QualiPSo consortium intends to define and implement technologies, processes, and policies in order to facilitate the development and deployment of OSS components, with the same level of reliability traditionally offered by proprietary software. QualiPSo is the largest ever Open Source initiative funded by the European Commission, under EU’s sixth framework program (FP6), as part of the Information Society Technologies (IST) initiative. As part of the Brazilian group working in the context of QualiPSo, we intend to cooperate by making our own tools, such as JaBUTi presented in this chapter, available as OSS to increase their use by both OSS and nonOSS development communities [282].
2 The Illustrative Example We use the same didactic example throughout this chapter: the Identifier program that was already mentioned in Example 4 of Chapter 1. Though simple, the example contains the necessary elements to illustrate the concepts used in this chapter. Furthermore, we also provide some exercises using a more complex program.
20
A. Vincenzi et al. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> cd i d e n t i f i e r i d e n t i f i e r > j a v a −cp b i n i d e n t i f i e r . I d e n t i f i e r Usage : i d e n t i f i e r . I d e n t i f i e r < s t r i n g > i d e n t i f i e r > j a v a −cp b i n i d e n t i f i e r . I d e n t i f i e r " ab c1 2 " Valid i d e n t i f i e r > j a v a −cp b i n i d e n t i f i e r . I d e n t i f i e r " c o n t ∗1 " Invalid i d e n t i f i e r > j a v a −cp b i n i d e n t i f i e r . I d e n t i f i e r " 1 soma " Invalid i d e n t i f i e r > j a v a −cp b i n i d e n t i f i e r . I d e n t i f i e r " a123456 " Invalid
Listing 2.1. Executing identifier.Identifier
The Identifier program implements the following specification. “The program determines if a given identifier is valid or not in a variant of Pascal language, called Silly Pascal. A valid identifier must begin with a letter and contain only letters or digits. Moreover, it has at least one and no more than six character length.” Listing 2.1 shows five executions of a possible implementation of such a specification in Java, shown in Listing 2.2. Observe that, with the exception of the first call, the program implementation behaves according to the specification for all executions, thus judging the four identifiers correctly. The Identifier class implemented in Listing 2.2 has four methods. It has a known number of faults which will be used to illustrate the adequacy and effectiveness of various testing criteria. The validateIdentifier method is the most important since it is responsible for deciding whether a given String corresponds to a valid or invalid identifier, returning true or false, respectively. On the other hand, valid_s and valid_f are utility methods used to decide whether the start and the following characters taken one by one are valid or not according to the specification. Finally, the main method allows the program to be called from the command line with a single parameter – a string –, printing “Valid” or “Invalid”, depending on the program answer for the given string as a valid or invalid identifier, respectively.
3 Testing Techniques 3.1 Functional Testing Functional or black box testing is so named because the software is handled as a box with unknown content. Only the external side is visible. Hence the tester basically uses the specification to obtain testing requirements or test data without any concern for implementation [256]. A high-quality specification that matches the client’s requirements is fundamental to support the application of functional testing. Examples of such criteria [279] are equivalence partition, boundary value, cause-effect graph, and categorypartition method [265].
Functional, Control and Data Flow, and Mutation Testing 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
21
p u b l i c b o o lea n v a l i d a t e I d e n t i f i e r ( S t r i n g s ) { char a c h a r ; boolean v a l i d _ i d = f a l s e ; achar = s . charAt ( 0 ) ; v a li d_ i d = valid_s ( achar ) ; i f ( s . length ( ) > 1) { achar = s . charAt ( 1 ) ; int i = 1; while ( i < s . length ( ) − 1) { achar = s . charAt ( i ) ; i f ( ! va l i d_ f ( achar ) ) valid_id = false ; i ++; } }
19
i f ( v a l i d _ i d && ( s . l e n g t h ( ) >= 1 ) && ( s . l e n g t h ( ) < 6 ) ) return true ; else return f a l s e ;
20 21 22 23 24
}
25 26 27 28 29 30 31 32
p u b l i c b o o lea n v a l i d _ s ( char ch ) { i f ( ( ( ch >= ’A ’ ) && ( ch = ’ a ’ ) && ( ch = ’A ’ ) && ( ch = ’ a ’ ) && ( ch = ’ 0 ’ ) && ( ch 0 ) { / ∗ 02 ∗ / achar = s . charAt ( 0 ) ; / ∗ 02 ∗ / va l id _i d = valid_s ( achar ) ; / ∗ 02 ∗ / i f ( s . length ( ) > 1) { / ∗ 03 ∗ / achar = s . charAt ( 1 ) ; / ∗ 03 ∗ / int i = 1; / ∗ 04 ∗ / while ( i < s . length ( ) − 1) { / ∗ 05 ∗ / achar = s . charAt ( i ) ; / ∗ 05 ∗ / i f ( ! v a l id _f ( achar ) ) / ∗ 06 ∗ / valid_id = false ; / ∗ 07 ∗ / i ++; } } } / ∗ 08 ∗ / / ∗ 09 ∗ / / ∗ 10 ∗ / i f ( v a l i d _ i d && ( s . l e n g t h ( ) >= 1 ) && ( s . l e n g t h ( ) < 6 ) ) / ∗ 11 ∗ / return true ; else / ∗ 12 ∗ / return f a l s e ; }
Listing 2.4. Commented validateIdentifier method Table 2. Testing requirements and criterion relationship based on the CFG (Figure 2) Element
Testing requirement
Criterion
Nodes 6 All-Nodes Edge (8,12) All-Edges Loop (4,5,6,7,4) Boundary-Interior Path (1,2,3,4,8,12) All-Paths Variable definition valid_id=false All-Defs Predicative use of a variable while(i=1) from the statement condition at line 22 (see Listing 2.4) with no collateral effect to product functionality. However, in the present text we will maintain this “redundant” logical condition for teaching purposes. As previously mentioned, this relation between testing criteria defines the so-called subsume relation. For the control-flow based testing criteria, the strongest criterion is the All-Paths criterion, which includes the All-Edges criterion which, in its turn, includes All-Nodes. Later in this section we will provide more details of the subsume relation.
32
A. Vincenzi et al.
Due to the complementary aspect of the testing techniques, the resulting test suite from the application of functional testing criteria may be used as the initial test suite for structural testing. Since in general such a test suite is not enough to satisfy a structural criterion, new test cases are generated and included in the test suite until the desired level of coverage is reached, hence exploring their complementarities. A problem related to the structural testing criteria is the general impossibility of automatically determining whether a given path is feasible, that is, there is no general purpose algorithm which, given any complete path, decides if such a path is executable and, if so, what are the input values which cause the execution of such a path [326]. Therefore, tester intervention is required to determine both feasible and infeasible paths of the PUT. Data-flow-based testing criteria. The introduction of these criteria “bridges the gap” between All-Paths and All-Edges, in an attempt to make the test more rigorous since, in general, the All-Paths criterion is not practical due to the infinite number of possible paths. According to Ural [323], data-flow-based criteria are more adequate in detecting some classes of faults, such as computational faults. Once data dependencies are identified, functional segments of the product are required to be exercised by the testing requirements. Rapps & Weyuker proposed the use of a Definition-Use Graph (Def-Use Graph, for short) or DUG, which consists of an extension of the CFG [283]. The DUG contains information about the data flow of the PUT, characterizing associations between statements in which a value is assigned to a variable (known as “definition”) and statements in which the value of that variable is used (known as “use”). For instance, by considering the CFG of Figure 2 and extending it with information about variable definitions and use presented in the implementation of the validateIdentifie method (see Listing 2.4), we obtain the DUG presented in Figure 3. More generally, the occurrence of variables in a program can be classified as a variable definition, a variable use, or an undefined variable. Usually, the different types of occurrences of variables are defined by a data-flow model. Considering the model defined by Maldonado [228], a variable definition occurs when a value is stored in a memory location. In general, in a product implementation, the occurrence of a variable is a definition when the variable: is on the left side of an assignment statement; is an input parameter; or is an output parameter, passed by reference during a unit call. The occurrence of a variable is a use when it is not a definition. There are two different types of uses: computational use (c-use) and predicative use (puse). The former directly affects the computation being performed or allows the result from a previous definition to be observed; the latter directly affects the control flow of the product implementation. We note that the c-use is associated with a node and the p-use is associated with the outgoing edges. A variable is considered undefined when it is not possible to access its value or when its location in the memory is not defined. If x is a variable occurring in a given program, a path c = (i, n1, . . . , nm , j ), m ≥ 0 containing no definition of x in nodes n1, . . . , nm is called a definition-clear path (defclear path) with respect to x from node i to node j and from node i to the edge (nm , j ). A node i has a global definition of a variable x if there is a definition of x in i and there is a def-clear path from a node or an edge with a c-use or a p-use of x . A c-use of x is a
Functional, Control and Data Flow, and Mutation Testing
33
d = {s,valid_id}
1
s s d = {valid_id,achar} achar s d = {achar,i} s
2
s 3
4
s,i d = {achar} s,i
s,i
5
achar
8
achar 6
valid_id
valid_id
d = {valid_id}
9
s
7
d = {i} i
s 10
s d = {setofdefinitions} variableassignedtonode(c-use) variableassignedtoedge(p-use)
12
s 11
Fig. 3. Def-use graph of validateIdentifier method
global c-use if there is no definition of x in the same node preceding the c-use. A data flow association is represented by a triple x, i, j when corresponding to a c-use and by a triple x, i, (j , k ) where x is a variable, i is a node containing a global definition of x , and j /(j , k ) is a node/edge with a c-use/p-use of x . Example 2. In order to exemplify the previous concepts, we consider the DUG of Figure 3 and the corresponding source code of the validateIdentifier method of Listing 2.4. There are three different kinds of information annotated on the DUG: a set of defined variables d assigned to each node, variables in a c-use assigned to nodes, and variables in a p-use assigned to edges. We observe that at node 1 there are definitions of the variables s and valid_id, since at lines 4 and 6 respectively there are statements causing such definitions. At nodes 2 and 5 there are c-uses of the variables achar (line 9), s, and i (line 14). Finally, we have p-uses of variables assigned to the outgoing edges of decision nodes. For instance, node 4 is a decision node and has the p-use of variables i and s, which means that such variables decide which edge is going to be taken – (4, 5) or (4, 8) – due to the conditional statement at line 13. To give an example of a c-use association, we consider the variable s defined at node 1. There is a c-use of such a variable at node 3 represented by the association s, 1, 3. Similarly, valid_id, 1, (8, 9) and valid_id, 1, (8, 12) correspond to the puse associations of the variable valid_id defined at node 1 and its p-uses on edges (8, 9) and (8, 12). The path (1,8,12) is a def-clear path with respect to valid_id defined at node 1 that covers the association valid_id, 1, (8, 12,). On the other hand, the path (1,2,8,12) is not a def-clear path with respect to valid_id defined at node 1 since there is a redefinition of valid_id at node 2, and, in this way, when reaching the p-use of valid_id at edge (8, 12), such a use does not correspond to the value of valid_id defined at node 1 but to the one defined at node 2. In order to
34
A. Vincenzi et al.
cover the data-flow association with respect to the definition of valid_id at node 1, we have to find a test case that follows any complete path not passing through node 2. The most basic data-flow-based criteria is the All-Defs criterion which is part of the Rapps & Weyuker family criteria [283]. Among the remaining criteria of this family the most used and investigated is the All-Uses criterion. – All-Defs: requires a data-flow association for each variable definition to be exercised at least once by a def-clear path with respect to a c-use or p-use. – All-Uses: requires all data-flow associations between a variable definition and all its subsequent uses (c-uses and p-uses) to be exercised by at least one def-clear path. Example 3. To exercise the definition of the variable valid_id defined at node 6, according to the All-Defs criterion, the following paths can be executed: (6,7,4,8,12) or (6,7,4,8,9). However, we must have in mind that the path (6,7,4,8,91) is infeasible, as are all complete paths which include it. Therefore, considering this example, there is only one possible executable path to satisfy the testing requirement passing through (6,7,4,8,12), and the test case (1#@, Invalid), for instance, follows such a path. In order to satisfy the All-Defs criterion this analysis has to be done for each variable definition in the PUT. For the All-Uses criterion, with respect to the same definition, the following associations are required: valid_id, 6, (8, 9) and valid_id, 6, (8, 12) . As previously discussed, the association valid_id, 6, (8, 9) is infeasible and can be dropped off, while the other can be covered by the same test case. This analysis has to be carried out for all other variable definitions and their corresponding def-use associations in order to satisfy the All-Uses criterion. The majority of the data-flow-based testing criteria that requires coverage of a certain element (path, association, and so on) demands the explicit occurrence of a variable use and does not necessarily guarantee the subsumption of the All-Edges criterion in the presence of infeasible paths, which occurs frequently. With the introduction of the concept of potential use, a family of Potential-Use testing criteria were defined [228]. What distinguishes such criteria from the ones mentioned before is that they introduce testing requirements regardless of the explicit occurrence of a use with respect to a given definition. It is enough that a use of such a variable “may exist”, that is, existing a def-clear path with respect to a given variable until some node or edge characterizes a potential use and an association is eventually required. Similarly to the remaining data-flow-based criteria, the Potential-Uses criteria use the DUG as the basis for deriving the set of testing requirements (potential-use associations). Actually, all that is needed is an extension of the CFG, called Definition Graph (DEG), so that each node contains the information about its set of defined variables. Figure 4 illustrates the DEG for the validateIdentifier method. The All-Pot-Uses criterion requires at least a def-clear path with respect to a given variable x defined in a node i for all node and all edge possible to be reached from i to be exercised. Example 4. For instance, the potential associations s, 1, 6, achar, 3, (8, 9) , and achar, 3, (8, 12) are required by the All-Pot-Uses criterion, but are not needed by
Functional, Control and Data Flow, and Mutation Testing
1
35
d = {s,valid_id}
d = {valid_id,achar} 2
d = {achar,i}
3
4
d = {achar} 5
6
8
d = {valid_id}
9
d = {i} 7
d = {setofdefinitions}
10
12
11
Fig. 4. Def graph of validateIdentifier method
the other data-flow-based criterion. Moreover, since each data=flow association is also a potential data-flow association, the associations required by the All-Uses criterion are a subset of the potential associations required by the All-Pot-Uses criterion. In other words, this means that the All-Pot-Uses subsumes the All-Uses by definition. The subsume relation is an important property of the testing criteria and is used to evaluate them from a theoretical point of view. As discussed before, the All-Edges criterion, for instance, subsumes the All-Nodes criterion, that is, any test suite All-Edgesadequate is also necessarily All-Nodes-adequate, but the opposite does not hold. When it is not possible to establish such a subsume order between two criteria, such as the All-Defs and All-Edges criteria, it is said that such criteria are incomparable [283]. It is important to observe that the Potential-Uses criteria are the only data-flow-based criteria that, even in the presence of infeasible paths, satisfy the minimum properties required by a test criterion, and no other data-flow-based testing criterion subsumes them. Figure 5 depicts both situations with respect to the subsume relation. In Figure 5(a) the relationship among the criteria does not consider the presence of infeasible paths, which, as mentioned previously, is quite common in the majority of real product implementations. On the other hand, Figure 5(b) presents the subsume relation considering the presence of infeasible paths. As previously mentioned, one of the disadvantages of structural testing is the existence of required infeasible paths. There is also the problem of missing paths, that is, when a given functionality is not implemented, the structural testing criteria cannot select testing requirements to test such a functionality because there is no corresponding implementation and, therefore, no test case is required to test it. Nevertheless, such criteria rigorously establish the testing requirements to be satisfied by the test suite in terms of paths, def-use associations, or other structural elements, allowing the objective
36
A. Vincenzi et al. All-Paths
All-Pot-Du-Paths All-Pot-Uses/Du All-Du-Paths All-Pot-Uses
All-Paths
All-Uses All-Pot-Du-Paths
All-C-Uses/Some-P-Use
All-P-Uses/Some-C-Use
All-Pot-Uses/Du
All-Defs All-C-Uses
All-Du-Paths All-Pot-Uses
All-P-Uses All-Uses
All-Edges
All-Edges
All-C-Uses/Some-P-Use
All-P-Uses/Some-C-Use All-Nodes All-Defs
All-Nodes
(a) Does not consider infeasible paths.
All-C-Uses
All-P-Uses
(b) Considers infeasible paths.
Fig. 5. The relationship among data-flow based testing criteria
measure of the adequacy of a test suite on testing a product. The strict definition of structural testing criteria facilitates the automation of their application. 5.2 Automation In this section we discuss the basic use of structural and data-flow-based testing criteria. In order to illustrate the use of control-flow-based testing, we use a tool named Emma (emma.sourceforge.net), which supports the use of the All-Nodes criterion. Similarly to JUnit, there are several options of testing tools available, open source or not. We employ Emma because it is easer to use and allows full integration with JUnit. Moreover, it can also be integrated with IDEs via plug-ins. For instance, in the case of Eclipse, the EclEmma plugin (www.eclemma.org) allows such an integration. Below we describe the basic functionalities of Emma version 2.0.5312 and how to use it through a command line, since our intention is to present the concepts of the tool, instead of the knowledge of how to use it in a particular IDE. To show the complementarity between control- and data-flow-based testing criteria, we use JaBUTi (incubadora.fapesp.br/projects/jabuti), which supports the application of All-Nodes, All-Edges, All-Uses and All-Pot-Uses criteria for Java bytecode [330]. Emma testing tool. It supports the application of the All-Nodes criterion. One of its benefits is that it can be integrated with JUnit so that the tester is able to evaluate how much of the PUT was executed by that particular JUnit test suite.
Functional, Control and Data Flow, and Mutation Testing
37
In order to execute the PUT and collect execution trace information, Emma uses a specialized class loader called emmarun, which is responsible to instrument the product before its execution. Each execution corresponds to a new test case and the tester can generate a testing report in order to evaluate the progress of the testing activity. There are different levels of reports, as is illustrated by Figures 6, 7, and 8. The report details different “levels of coverage” as explained below. – class: corresponds to the total number of classes under testing versus the number of executed classes during testing. Since we have just one class, the tool considers it executed and, therefore, we executed 100% of the classes under testing. – method: corresponds to the total number of methods of the classes under testing versus the number of executed methods during testing. In our example, Identifier has five methods and three of them were executed during this test (60%). – block: corresponds to the total number of blocks of all methods of all classes versus the number of blocks actually covered during testing. In our example, the five Identifier methods have a total of 121 blocks and 29 of these were covered (24%). – line: remember that a block is a set of statements; therefore, based on the calculated blocks it is possible to infer the total number of lines in relation to all methods of all classes under testing. This coverage means the total number of lines in the PUT versus the coverage lines during testing. In our case, Identifier has 29 executable lines and 9.2 were executed during testing (32%). It is important to note that Emma works at bytecode level and it computes the number of source-code lines based on the bytecode information. Because of this conversion, it is possible to have a fraction in the number of executed lines, since a single source-code line can be broken down into several bytecode instructions and these may belong to different CFG blocks. These blocks are not always executed together due to product implementation conditions. On a more detailed analysis, we observe that the Equivalence Partition-adequate test suite used previously was able to cover all statements of four out of five methods of the Identifier class, as illustrated in Figure 7. The only method which reveals 68% block coverage is the main method, which is natural since the tests are run by JUnit. To reach 100% of coverage with respect to all methods of the Identifier class, two additional test cases are required to be executed via the main method: one calling the identifier.Identifier with no parameter in order to cover line 47, and another calling it with a valid identifier in order to cover line 52. JaBUTi testing tool. To illustrate the complementary aspect of the testing criteria, we use another testing tool named JaBUTi (incubadora.fapesp.br/projects/jabuti/). This tool was originally developed to support the use of the All-Nodes, All-Edges, All-Uses, and All-Pot-Uses criteria for the unit level when testing Java product implementations [330]. Since its definition in 2003, the original tool has been extended for testing products other than its original purpose. Figure 9 shows the current JaBUTi family described below. All of the elements share the JaBUTi core which includes the static analysis of bytecode and other basic features. They are all implemented in Java and run as
38
A. Vincenzi et al.
Fig. 6. Emma overall report after JUnit execution, excluding third-party packages and test suites
Fig. 7. Identifier method coverage after the execution of functional test cases
Fig. 8. main method coverage
desktop applications. With our participation in the QualiPSo Project we aim at making such tools available as open source softwares and migrating them to work in a service-oriented architecture, supported by the QualiPSo Factory, which is currently under development.
Functional, Control and Data Flow, and Mutation Testing JaBUTi/MA [99]
39
JaBUTi/ME [100]
JaBUTi/Web [119]
JaBUTi/DB [257]
JaBUTi [330]
JaBUTi/Integration [123]
JaBUTi/AJ [207] eXVantage [360]
Fig. 9. JaBUTi family
– JaBUTi/MA [99] is an extension of JaBUTi which enables the use of structural testing criteria to test mobile agents when they are running on their real environment. – JaBUTi/ME [100] applies the same concepts in the structural testing of J2ME (Java Micro Edition) applications both during their development via emulators and after the application, when they are deployed to the target devices like PDAs (Personal Digital Assistants) and mobile phones. – JaBUTi/DB [257] implements the data-flow testing criteria defined by Spoto et al. [303] specifically for applications that manipulate persistent data in the form of relational databases. – JaBUTi/AJ [207] is a JaBUTi extension which provides support for applying structural testing on aspect-oriented products. – JaBUTi/Integration [123] is an extension of JaBUTi which provides support for the application of structural testing at integration level. – JaBUTi/Web [119] is an initiative of extending the JaBUTi testing tool and its corresponding criteria to test Java-based Web applications. – eXVantage [360] (www.research.avayalabs.com) is in fact a reduced version of JaBUTi which implements only its control-flow-based testing criteria. With this reduction, eXVantage does not need to worry about several time-consuming tasks for calculation and evaluation of data-flow-based testing criteria, thus improving the general performance of the tool. The first step of JaBUTi is the creation of a hierarchical abstraction of the program being tested, in which the tester indicates which parts of the product should really be tested and which should be ignored, that is, excluded from the program structure during the instrumentation process. Such information is stored in a testing project which allows the testing activity to be stopped and resumed at any time. Once the testing project is created, the tester has eight structural criteria to work with (radio buttons below the main menu in Figure 10). These criteria are summarized in Table 3. By selecting a criterion, the tester visualizes information about the program concerning the selected criterion. For example, by using the All-Nodesei criterion we are able to see the source code (if available – see Figure 10), the bytecode, or the def-use graph (see Figure 10). In either case, the tester is provided with hints about which testing requirement should be covered in order to achieve a higher coverage.
40
A. Vincenzi et al.
Fig. 10. Source code and DUG visualization for All-Nodesei criterion Table 3. Testing criteria implemented by JaBUTi Name All-Nodesei All-Edgesei All-Usesei All-Pot-Usesei All-Nodesed All-Edgesed All-Usesed All-Pot-Usesed
Meaning all nodes, regardless of exceptions all edges, regardless of exceptions all uses, regardless of exceptions all potential-uses, regardless of exceptions
Description requires the execution of each node in the graph that can be executed without the occurrence of an exception requires the execution of each edge in the graph that can be executed without the occurrence of an exception requires the coverage of each def-use pair that can be executed without the occurrence of an exception requires the coverage of each def-potential-use [228] pair that can be executed without the occurrence of an exception
same criteria, dependent on ex- respectively require the coverage of nodes, edges, def-use pairs as ceptions well as the coverage of nodes, edges, def-use pairs, and def-potentialuse pairs that can only be executed with the occurrence of an exception
We observe that, although this DUG has a similar layout to the DUG presented in Figure 3, the labels assigned to the nodes of this DUG are in general the offset of the first bytecode instruction in the block. For instance, node 9 in the DUG of Figure 10 represents the instructions from offset 0 (“Start PC”) to 26 (“End PC”) at bytecode level, or lines 8 to 10 at source code level (“Corresponding Source Lines”). This node 9 corresponds to node 2 in the DUG of Figure 3. Actually, the latter was manually generated by editing the DUG generated by JaBUTi. The tester can manage testing requirements, for example, by marking a requirement as infeasible. A testing requirement may be covered by the execution of a test case. This is done “outside” the tool by a test driver that instruments the PUT and then starts
Functional, Control and Data Flow, and Mutation Testing
41
Fig. 11. Summary report by method after the execution of the JUnit test suite
Fig. 12. Summary report by criterion after 100% of statement coverage
the instrumented program in the same way as emmarun of the Emma tool. JaBUTi also allows the import and evaluation of the coverage of JUnit test suites against their criteria. For instance, Figure 11 shows one of JaBUTi’s testing reports after the execution of our previous JUnit test suite. Since we wish to obtain 100% of code coverage with respect to the complete source code, two additional test cases were added to the JUnit test suite, imported and evaluated by JaBUTi. Therefore, the test suite TAll-Nodesei is an All-Nodesei -adequate test suite made up of the following set of test cases: TAll-Nodesei = TEquivalence Partition ∪ {("", Invalid), (a1, Valid)}. Figure 12 shows the summary report by criterion after the execution of the two additional test cases. As we can observe, despite obtaining 100% of statement coverage, the second strongest testing criterion supported by JaBUTi (All-Nodesei ) reaches only 88% of coverage, followed by All-Usesei with 85% of coverage and All-Pot-Usesei with 73% of coverage. Therefore, if there are enough resources in terms of time and cost, the tester may continue with the next criterion (All-Nodesei ) and verify how its testing requirements can be used to improve the test suite quality. In this case, the tester needs to provide
42
A. Vincenzi et al.
additional test cases by forcing each boolean expression to be evaluated as true and false at least once, and so on. By clicking over any of the highlighted decisions, JaBUTi changes the color of the chosen decision point and highlights the corresponding branches of such a decision point. For instance, by clicking over the decision point in line 10, Figure 13 shows the coverage status for such a decision. As we can observe, the true branch is already covered since the statements inside the if statements are marked in white (which means “covered”). On the other hand, the statement outside the if statement appears in a different color, an indication that it is not yet covered. If we take a look at the source code and at our test suite, we are able to conclude that there is a missing test case considering an identifier (valid or invalid) with a single character so that the if statement can evaluate to false. We can use the test case (c, Valid), for instance, to improve the current test suite and to cover such an edge.
Fig. 13. False branch not yet covered: All-Edgesei criterion
By analyzing all the other uncovered edges, we have to include two additional test cases in TAll-Nodesei to make it All-Edgesei -adequate. Below, we give the test suite TAll-Edgesei , which is an All-Edgesei -adequate test suite composed by the following set of test cases: TAll-Edgesei = TAll-Nodesei ∪ {(c, Valid), ({, Invalid), (a{b, Invalid)}.
Functional, Control and Data Flow, and Mutation Testing
43
Fig. 14. Summary report by criterion after running TAll-Edgesei
Figure 14 gives the resulting coverage obtained with respect to all remaining JaBUTi testing criteria. Observe that after covering 100% of the All-Nodes and All-Edges criteria, the coverage concerning the data-flow-based testing criteria is below 95%. At this point, considering the strongest criterion of JaBUTi (All-Pot-Usesei ), there are 34 uncovered data-flow associations, all of which are shown in Table 4. The variable names L@x are used internally by JaBUTi to designate local variables 0, 1, 2, · · · , n. The real names of such variables can be identified from the bytecode if the class file is compiled with the -g parameter with the aim of generating all debugging information. To facilitate the identification, we provided a key below the table with the corresponding source-code name of each variable. Since the edge (76, 95) is infeasible, all associations related with such an edge are also infeasible and can be dropped off. Such infeasible associations are marked with the symbol “×”. The associations L@3, 0, cdot also fail to be feasible, since there is no def-clear path to cover them or because the only way to take the edge (0, 72) is via a zero length Table 4. Uncovered data-flow association for All-Pot-Usesei criterion Required Associations Infeasible Required Associations Infeasible 01) L@0, 0, (76, 95) × 18) L@4, 29, (84, 95) × 02) L@1, 0, (76, 95) × 19) L@4, 29, (76, 95) × 03) L@3, 0, 93 × 20) L@4, 29, (72, 95) 04) L@3, 0, (84, 93) × 21) L@2, 49, (76, 95) × 05) L@3, 0, (84, 95) × 22) L@3, 64, (49, 64) 06) L@3, 0, (76, 84) × 23) L@3, 64, 93 × 07) L@3, 0, (76, 95) × 24) L@3, 64, (84, 93) × 08) L@3, 0, (72, 76) × 25) L@3, 64, (84, 95) × 09) L@3, 0, (9, 29) × 26) L@3, 64, (76, 84) × 10) L@3, 0, (9, 72) × 27) L@3, 64, (76, 95) × 11) L@2, 29, 95 28) L@3, 64, (72, 76) × 12) L@2, 29, (84, 95) × 29) L@4, 66, (49, 64) 13) L@2, 29, (76, 95) × 30) L@4, 66, (76, 95) × 14) L@2, 29, (72, 95) 31) L@4, 66, 64 15) L@2, 29, (49, 66) × 32) L@2, 9, (84, 95) × 16) L@2, 29, (49, 64) × 33) L@2, 9, (76, 95) × 17) L@4, 29, 95 34) L@3, 9, (76, 95) × Key: L@0 - this; L@1 - s; L@2 - achar; L@3 - valid_id; L@4 - i.
44
A. Vincenzi et al.
identifier which is always considered invalid when following the edge (72, 95). All other possible uses with respect to the variable L@3 defined at node 0 are infeasible. The potential associations L@2, 29, 95, L@2, 29, (72, 95) , L@4, 29, 95, and L@4, 29, (72, 95) can be translated to source code representation as achar, 3, 12, achar, 3, (8, 12) , i, 3, 12, and i, 3, (8, 12) , respectively (see Listing 2.4 and Figure 4). We observe that all of them require an invalid test case with two characters to be covered. For instance, the test case (%%,Invalid) covers such associations. By continuing these analyses, we found one additional test case (%%%a,Invalid) to cover the feasible associations. All the remaining are infeasible and therefore after running such test cases we obtained 100% of coverage with respect to all JaBUTi testing criteria. In summary, the test suite TAll-Pot-Usesei below is an All-Pot-Usesei -adequate test suite composed by the following set of test cases: TAll-Pot-Usesei = TAll-Edgesei ∪ {(%%,Invalid), (%%%a,Invalid)}. It must be highlighted that we found an All-Pot-Usesei -adequate test suite and, even though there are at least two faults in the validateIdentifier method, this test suite was not able to detect any of them. 5.3 State of the Art Structural testing criteria have been used mainly for unit testing once their testing requirements are easier to be calculated at this level of testing. Several efforts to extend them for integration testing can be identified. Haley and Zweben proposed a testing criterion to select paths into a module that should be tested at integration level based on its interface [152]. Linnenkugel and Müllerburg presented a family of criteria which extends traditional control- and data-flow-based unit-testing criteria to integration testing [215]. Harrold and Soffa proposed a technique to determine interprocedural def-use associations allowing the application of data-flow-based testing at integration level [160]. Jin and Offutt defined some criteria based on a coupling classification between modules [181]. Vilela, supported by the concept of potential use, extended the family of Potential-Uses criteria to integration testing [327]. Harrold and Rothermel [159] extended data flow testing to the context of objectoriented products at class level. The authors commented that some data-flow based testing criteria originally proposed for testing procedural products [283,160] can be used for testing both individual methods as well as the interactions between methods inside the same class. However, such criteria do not consider data-flow associations whenever a user invokes sequences of methods in an arbitrary order. To solve this problem, Harrold and Rothermel proposed an approach which allows different types of data-flow interactions between classes. This approach uses traditional (procedural) data-flow testing for testing methods and for testing method interactions inside the same class. To test methods which are visible outside the current class and can be called by other classes, a new representation called CCFG (class control-flow graph) was defined and developed. From the CCFG, new testing requirements at inter-method, intra-class, and inter-class level can be determined [159].
Functional, Control and Data Flow, and Mutation Testing
45
Vincenzi et al. have also investigated the use of control- and data-flow-based testing criteria on testing object-oriented and component-based products [328]. By aiming at the development of a common solution to deal with both kinds of products, they decided to carry out the static analysis of Java programs to determine the testing requirements directly at bytecode level since, in general, when testing software components the product source code is not always available. With such an approach, regardless of the availability of the source code, it is possible to derive a set of testing requirements concerning different structural testing criteria, which can be used for the evaluation or the development of a test suite. The JaBUTi testing tool described above was developed to support such an approach, allowing the application of control- and data-flow based criteria at bytecode level.
6 Fault-Based Testing Technique 6.1 State of Practice The mutation testing technique was briefly described in Chapter 1. The basic idea behind mutation testing is the competent programmer hypothesis, which states that a good programmer writes correct or close-to-correct programs. Assuming this hypothesis is valid, we can say that errors are introduced in a program through small syntactic deviations (faults) that lead its execution to an incorrect behavior. In order to reveal such errors, mutation testing identifies the most common of such deviations and, by applying small changes to the PUT, encourages the tester to construct test cases that reveal that such modifications create incorrect programs [2]. A second hypothesis explored by mutation testing is the coupling effect. It states that complex errors result from the composition of simple ones. Thus, it is expected – and some experimental studies [58] have confirmed this – that test suites which reveal simple faults are also able to discover complex errors. Thus, a single mutation is applied to the PUT P, that is, each mutant has a single syntactic transformation related to the original program. A mutant with k changes is referred to as a k -mutant. Mutants higher than 1-mutant have not been used in the literature and will therefore not be explored in this chapter. By considering these two hypotheses, the tester should provide a PUT P and a test suite T whose adequacy is to be assessed. The PUT is executed against T and, if a failure occurs, a fault has been revealed and the test is over. If no problem is observed, P may still have hidden faults that T is unable to reveal. In this case, P is submitted to a set of “mutation operators” which transform P into P1 , P2 , . . . , Pn , called mutants of P. Mutation operators are rules that model the most frequent faults or syntactic deviations related to a given programming language. A mutation operator defines the (single) change to be applied to the original program in order to create a mutant. It is designed for a target language and should fulfill one of the following objectives: create a simple syntactic change based on typical errors made by programmers (changing the name of a variable, for instance); or force test cases to have a desired property (covering a given branch in the program, for instance) [266].
46
A. Vincenzi et al.
Mutants are executed against the same test suite T . Dead mutants are those for which the result differs from P on at least one test case in T . The others, that is, the ones for which the results are the same for each test case of T , are called live mutants. The ideal situation would reveal all the mutants dead, which would indicate that T is adequate for testing P (according to that set of mutants). As already mentioned, serious problem in this approach is the possible existence of equivalent mutants. In some cases, the syntactic change used to create the mutant does not result in a behavioral change, and for every element in the input domain, P and the mutant always compute the same results. A few heuristics have been proposed to automatically identify equivalent mutants, but this is not always possible [59]. The problem of deciding whether two programs compute the same function is known to be undecidable. Therefore, in general, mutation testing requires tester intervention in order to determine equivalent mutants in the same way as structural testing requires the determination of infeasible paths. After executing the mutants and – probably via tester intervention – identifying the equivalent ones, an objective measure of test-suite adequacy is provided by the mutation score. DM (P, T ) ms(P, T ) = M (P) − EM (P) where DM (P, T ): number of mutants killed by the test suite T ; M (P): total number of mutants; EM (P): number of mutants equivalent to P. The mutation score ranges from 0 to 1. The higher it is, the more adequate is the test suite. The value of DM (P, T ) depends only on the test used to execute P and the mutants, and the value of EM (P) is obtained as the tester, manually or aided by heuristics, decides that a given live mutant is in fact equivalent. Besides the existence of equivalent mutants, the cost to apply the criterion - represented mainly by the effort to execute the mutants against the test suite - is regarded as a serious obstacle to the adoption of mutation testing. 6.2 Automation Several are the initiatives on developing testing tools intended to support the application of mutation testing criteria [102]. Some of them support the application of mutation testing for Java products like MuJava [221] and Jester [156], but none offers all the functionalities provided by Proteum’s tool family [230]. As regards testing product implementations, PROTEUM/IM 2.0 [97], developed at the ICMC-USP - Brazil, is the only mutation testing tool which supports the application of mutation testing for C programs at unit and integration levels. It consists of an integrated testing environment composed by Proteum [95] which supports the application of the mutation analysis criterion, and PROTEUM/IM [96], which supports the
Functional, Control and Data Flow, and Mutation Testing
47
application of the interface mutation criterion. Moreover, due to its multi-language architecture, it may also be configured to test products written in other languages. Basically, PROTEUM/IM offers the tester all the necessary resources for evaluating or generating test suites based on the mutation testing criterion. Based on the information provided by PROTEUM/IM 2.0, the tester is able to improve the quality of T until an adequate test suite is obtained. The resources provided by PROTEUM/IM 2.0 allow the execution of the following operations: define the test cases; execute the PUT; select mutation operators used to generate the mutants; generate mutants; execute mutants; analyse alive mutants; and compute the mutation score. Some of these functions are completely automated by the tool (such as the execution of mutants) while others require the aid of the tester in order to be carried out (such as the identification of equivalent mutants). In addition, some features have been added to the tool in order to facilitate the execution of experimental studies. For instance, the tool allows the tester to choose to execute the mutants with all test suites, even if it has already been distinguished. With this kind of test session (called a research session), additional data can be gathered regarding the efficacy of mutation operators or the determination of strategies for test case minimization [95]. An important point to consider when applying mutation testing criteria is the decision over which set or subset of mutation operators should be used. PROTEUM/IM 2.0 implements two different sets of mutation operators, one composed by 75 operators that perform mutations at unit level and another composed by 33 operators for integration testing. Mutation operators for unit testing are divided into four classes: statement mutations, operator mutations, variable mutations, and constant mutations, depending on the syntactical entity manipulated by the mutation operator. It is possible to select operators according to the classes or faults to be addressed, allowing the creation of mutants to be done stepwise or even to be divided between testers working independently. Table 5 shows a few operators for each of these classes. The use of PROTEUM/IM 2.0 is based on “testing sessions” in which the tester can perform operations in steps, create a session, interrupt it, and resume it. For a C version of the Identifier product, a test session can be conducted using a GUI or testing scripts. The first allows the tester to learn and explore the concepts related to mutation testing and to the tool itself. In addition, it offers a better means to visualize test cases and mutants, thus facilitating some tasks such as the identification of equivalent mutants. Conducting a test session using the graphical interface is probably easier, but less flexible than using command line scripts. The graphical interface requires constant Table 5. Example of mutation operators for unit testing of C products Operator Description u-SSDL u-ORRN u-VTWD u-Ccsr u-SWDD u-SMTC u-OLBN u-Cccr u-VDTR
Removes a statement from the program. Replaces a relational operator. Replaces the reference to a scalar by its predecessor and successor. Replaces the reference to a scalar by a constant. Replaces a while by a do-while. Breaks a loop execution after two executions. Replaces a logical operator by a bitwise operator. Replaces a constant by another. Forces each reference to a scalar to be: negative, positive, and zero.
48
A. Vincenzi et al.
intervention on the part of the tester while the scripts allow the execution of long test sessions in batch mode. In script mode, the tester is able to construct a script specifying how the test should be carried out and the tool just follows such a script without requiring tester intervention. On the other hand, this mode requires an additional effort for the creation of the script and a complete knowledge of both the concepts behind mutation testing and the commands provided by PROTEUM/IM 2.0. Test scripts are very useful for performing experimental studies which require several repetitive steps to be employed several times to obtain statistically significant information, and should be carried out by experienced users. We now evaluate the adequacy of our previous test suites against the mutation testing criterion, considering a C version of Identifier. To perform such an evaluation, only a subset of unit mutation operators is used. This subset (presented in Table 5) is called the “essential set of mutation operators” in the sense that an adequate test suite regarding this subset of mutation operators should be adequate or almost adequate to the entire set [21]. For our example, the complete set of mutation operators - when applied to the Identifier product - generates 1,135 mutants, while the essential set of mutation operators generates 447, which amounts to a reduction of 60.6%. We observe that this is an important reduction since one of the biggest advantages of mutation testing is the large number of mutants that can be executed and analyzed for a possible equivalence. In our specific case, before evaluating the adequacy of the test suites against mutation testing, we determined by hand all the equivalent mutants, 80 out of 447, as can be seen in Figure 15(a). Figure 16 shows the original product (on the left) and one of its equivalent mutants (on the right), as well as the way the tester can determine an equivalent mutant using the PROTEUM/IM 2.0 graphical interface. By disregarding the equivalent mutants, we are using 367 non-equivalent mutants during the evaluations of the adequacy of the test suites. At first, only the adequate test suite for the equivalent partition criterion was imported and evaluated against the mutation testing criterion; Figure 15(a) shows the status of the test session after the mutant execution. We observe that such a test suite killed 187 out of 367 mutants, hence determining a mutation score of 0.509. Subsequently, we evaluated the test suites TAll-Nodesei , TAll-Edgesei , and TAll-Pot-Usesei , which produced the status reports shown in Figures 15(b), 15(c), and 15(d), respectively. There was no increment in the mutation score from test suite for equivalent partition criterion to the test suite for All-Nodesei , and from the suite for All-Edgesei to the suite for All-Pot-Usesei . Therefore, at the end, after all test cases adequate for the All-Pot-Usesei criterion were executed, 128 out of 367 mutants are still alive. If we interrupt the testing activity at this point, we are considering that any of these 128 mutants could be considered “correct” with respect to our current testing activity, since there is no test case capable of distinguishing their behavior from the original product. During the analysis of these live mutants we observed that 119 were alive due to missing test cases, therefore further 23 test cases were required to kill them. The TMutation Testing adequate test suite is as follows.
Functional, Control and Data Flow, and Mutation Testing
(a)
TEquivalence Partition × Mutation Testing.
(c)
TAll-Edgesei × Mutation Testing.
(b)
(d)
49
TAll-Nodesei × Mutation Testing.
TAll-Pot-Usesei × Mutation Testing.
Fig. 15. Test-session status window
TMutation Testing = TAll-Pot-Usesei ∪ { (1#,Invalid), (#-%a,Invalid), (zzzz,Valid), (aAA,Valid), (A1234,Valid), (ZZZZ,Valid), (AAAA,Valid), (aa09a,Valid), ([,Invalid), (X::,Invalid), (X18,Invalid), (X[[a,Invalid), (X{{a,Invalid), (aax#a,Invalid), (aaa,Valid), ({,Invalid),(a,Valid), (a#a,Valid), (111,Invalid), (\‘111,Invalid),(a11,Invalid), (a\‘11,Invalid), (a/a,Invalid)} After executing the mutants with these test cases, we obtain the status report shown in Figure 17. As we can be see in this figure, there are still 9 live mutants and the maximum mutation score is 0.975. However, a mutation score of 1.00 cannot be obtained for such a product since it has two remaining faults that were not detected even by an All-Pot-Usesei adequate test suite. These live mutants are known as fault-reveling. A mutant is said to be fault-revealing if for any test case t so that P(t ) = M (t ) we can conclude that P(t ) is not in accordance with the expected result, that is, the presence of a fault is revealed. The faultrevealing mutants were generated by the Cccr and ORRN mutation operator and any test case developed by us to kill any of them actually detects the presence of a fault in the original product. Moreover, it is important to observe that these mutation operators belong to the essential set so that, besides the cost reduction, the efficacy of these subset of mutants was not compromised in this case.
50
A. Vincenzi et al.
Fig. 16. Mutant visualization
Fig. 17. Status report after TMutation Testing execution
For instance, Listing 2.5 illustrates two such mutants (we present them in the same source code for the sake of brevity), each of them revealing a different fault. The mutations can be observed in lines 20 and 29. Listing 2.6 shows the implementation of the validateIdentifier function without such faults. Once these faults are corrected we can apply the mutation testing criterion once more to verify whether a mutation score of 1.00 can be reached and, if so, we can incrementally apply the remaining mutation operators depending on our time and cost constraints. 6.3 State of the Art Experimental studies have provided evidence that mutation testing is among the most promising criteria in terms of fault detection [354,266]. However, mutation testing often imposes unacceptable demands on computing and human resources because of the large number of mutants that need to be executed and analyzed for a possible equivalence with respect to the original program. Besides the existence of equivalent mutants, the cost of applying the criterion represented mainly by the effort of executing the mutants against the test suite is pointed out as a serious obstacle to the adoption of mutation testing.
Functional, Control and Data Flow, and Mutation Testing 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
51
i n t v a l i d a t e I d e n t i f i e r ( char ∗ s ) { char a c h a r ; i n t i , v a l i d _ i d = FALSE ; i f ( s t r l e n ( s ) > 0) { achar = s [ 0 ] ; v a li d_ i d = valid_s ( achar ) ; i f ( s t r l e n ( s ) > 1) { achar = s [ 1 ] ; i = 1; w h i l e ( i < s t r l e n ( s ) − 0 ) { / / O r ig . : ( i < s t r l e n ( s ) −1) achar = s [ i ] ; i f ( ! va l i d_ f ( achar ) ) v a l i d _ i d = FALSE ; i ++; } } } / ∗ O r ig . : ( s t r l e n ( s ) < 6 ) ∗ / i f ( v a l i d _ i d && ( s t r l e n ( s ) >= 1 ) && ( s t r l e n ( s ) 0) { achar = s [ 0 ] ; v a li d_ i d = valid_s ( achar ) ; i f ( s t r l e n ( s ) > 1) { achar = s [ 1 ] ; i = 1; while ( i < s t r l e n ( s ) ) { achar = s [ i ] ; i f ( ! va l i d_ f ( achar ) ) v a l i d _ i d = FALSE ; i ++; } } }
28
i f ( v a l i d _ i d && ( s t r l e n ( s ) , where:
Automatic Test-Case Generation
75
– Q is a countable, non-empty set of states; – R = A ∪ N is a countable set of labels, where A is a countable set of actions and N is a countable set of annotations; – T ⊆ Q × R × Q is a transition relation; and – q0 ∈ Q is an initial state. Annotations are inserted into the LTS with specific goals: guide the test-case generation process, by making the focus on particular interruptions easier; make it possible for interruption models to be plugged without interfering with the main model; guide testcase documentation; make it possible for conditions to be associated with actions; and indicate points where interruptions can be reasonably observed externally.
Fig. 11. ALTS for the Remove Message Behavior of a Message Feature
76
P. Machado and A. Sampaio
Figure 11 presents an ALTS model that represents the behavior of removing a message from the inbox. Three annotations are used in the ALTS model: steps, conditions, and expectedResults. The steps annotation is used to indicate input actions, the conditions associated with input actions are indicated using the conditions annotation, and finally the expected results are indicated using the expectedResults annotation. Figure 12 shows an ALTS model that represents two features where a feature, named Incoming Message, interrupts the flow of execution of the feature in Figure 11 at node 4. This is annotated by using the begininterruption and endinterruption labels. This annotation connects the LTS models of the two feature behaviors and indicate at which point interruption can happen and what is the expected behavior if the interruption occurs or not. The idea is that feature can be specified separately and then combined at the allowed points of interruption by composing the ALTSs with the interruption annotations. Test-generation algorithm. A feature test case consists of a path extracted from the test model in the scope of a given feature without considering the interruptions. On the other hand, a feature-interruption test case is a path that includes an interruption. The same basic algorithm presented in the sequel is going to be used to extract test cases, but its application is different depending on the kind of test case (feature or feature interruption). A path can be obtained from the ALTS model, using Depth First Search (DFS), by traversing an ALTS starting from the initial state. As a general coverage criterion, all labelled transitions need to be covered, that is, all ALTS labelled transitions need to be visited at least once. Since we are considering functional testing, total coverage is a reasonable and feasible goal for feature testing, to guarantee a thorough investigation of the feature functionalities [246]. Nevertheless, this is not always feasible for featureinterruption testing due to the possibly infinite number of combinations of interruptions that can happen at different points of different features. Therefore, test-case selection techniques must also be applied (This is discussed in Section 4). In Figure 13, the algorithm to generate test cases is shown. For the execution of the algorithm, we use three parameters: vertex is a vertex in the model, indicating the current vertex during the DFS; path is a sequence of edges from the model, indicating the path visited during the DFS; and interruptionModel is a graph G[V , E ], used to separate the edges from the feature model and the feature-interruption model already connected. We begin the extraction from the root (the initial node of the ALTS model), verifying if the current vertex indicates the end of a path in the model, indicating that the test case has been extracted. In this case, it needs to be recorded. (The LTS-BT tool [70] records the extracted test cases, in tables inside a file.) If the current vertex does not indicate the end of a path, then each of its descendants are visited using the DFS algorithm. To visit each of its descendant, the edge between the current vertex and its descendant is analysed. The search proceeds only if: (i) the edge does not belong to the current analysed path (the edge has been already “visited" ), or (ii) it is an edge from the feature interruption model (an edge with the endInterruption label). Due to these conditions, two scenarios are encountered:
Automatic Test-Case Generation
Fig. 12. ALTS with Interruption
77
78
P. Machado and A. Sampaio
Fig. 13. Test Case Generation Algorithm
Fig. 14. Feature test case
– Conditions (i) and (ii) are not satisfied: the search stops, recording the entire path as a test case. In this case, the recursion step of the algorithm returns to the next branch that needs to be analysed, continuing the DFS algorithm. – Condition (i) or (ii) is satisfied: the edge between the vertex and its descendent is added to the test case and the DFS continues until it finds the end of the path, which happens when either a leaf in the graph or an edge going back to the root of the graph is found. These constraints over the extraction, when using the DFS approach, are required to avoid a burst of paths during the test-case extraction caused by the loops in the test model. This may reduce the number of extracted test cases, but without these constraints, the number of paths extracted becomes infeasible, while most of them may be obtained by combining the extracted test cases. Also, practice has shown that these excluded paths generally add redundancy to the test suite, that is, they do not add test cases that would lead to uncover additional faults. To demonstrate the application of the algorithm presented above, we extracted test cases from the test model of Figure 12. Figure 14 presents a feature test case and
Automatic Test-Case Generation
79
Fig. 15. Feature interruption test case
Figure 15 presents a feature-interruption test case. To generate feature test cases, the interruption transitions of the model are not reached, but to generate feature-interruption test cases the search goes through them. 3.4 Process Algebra While LTSs and Finite State Machines are the main models used to automate test generation, they are very concrete models and often adopted as the operational semantics of more abstract process algebras like CSP [172,286], CCS [251] and LOTOS [240]. In contexts where a process algebra is adopted as the specification formalism, automated test-case generation usually involves translating a specification into an operational model to generate the test cases, which are themselves expressed in terms of the concrete model, as illustrated in the previous sections. Here we summarise an approach for the automated guided generation of sound test cases [262], using the CSP process algebra. More generally, this approach characterises a testing theory in terms of CSP: test models, test purposes, test cases, test execution, test verdicts, soundness and an implementation relation are defined in terms of CSP processes and refinement notions. Test-case generation, particularly, is achieved as counter-examples of refinement checking, and is mechanised using the CSP model checker, FDR [285]. The approach is currently based on the trace semantics of CSP. The Model. CSP (Communicating Sequential Processes) is a process algebra that provides a rich set of operators to describe the behavior of concurrent and distributed systems. We introduce some of the operators of CSP through the following example.
80
P. Machado and A. Sampaio
S0 = t → S9 2 t → S2 S 2 = t → S 0 2 (c → S 6 2 b → S 4) S4 = z → S2 2 t → S4 2 t → S8 S6 = y → S7 S7 = c → S6 S8 = y → S0 S9 = a → S8 SYSTEM = S 0 \ {t } The process SYSTEM captures part of the behavior of the IOLTS presented in Section 3.2 (see Figure 5). The alphabet of a CSP process is the set of events it can communicate. In this chapter, we assume that each process alphabet is split into two disjoint sets (input and output), since the approach to generate test cases from CSP specifications is based on that of Tretmans [320] for IOLTS. Thus, given a CSP process P, its alphabet is αP = αPi ∪ αPo . For instance, αSYSTEMi = {a, b, c} and αSYSTEMo = {y, z }. The fragment c → S 6 2 b → S 4 of the process S 2 uses the external choice operator (2) to indicate that it can communicate c and behave like S 6 (S 2/c = y → S 7), or communicate b and behave as S 4 (S 2/b = z → S 2 2 t → S 4 2 t → S 8). The notation P/s indicates the behavior of the process P after performing the trace s. The process S 2/b, t is non-deterministic, behaving recursively as S 4, or as S 8; the decision is made internally since S 4 offers t in two branches of the choice. The set initials(P) contains the initial events offered by the process P. Thus, initials(S 2/b) = {z , t }. Moreover, the special event t is used in this specification exclusively to create non-deterministic behavior using the hiding operator (\): the process S 0 \ {t } behaves like S 0, but all occurrences of t become internal (invisible) events. Some additional CSP constructs are used in the rest of this section. The process Stop specifies a broken process (deadlock), and Skip a process that communicates an event and terminates successfully. The process P; Q behaves like P until it terminates successfully, when the control passes to Q . The process P |[ Σ ]| Q stands for the generalised parallel composition of the processes P and Q with synchronisation set Σ. This expression states that the processes P and Q have to agree (synchronise) on events that belong to Σ. Each process can evolve independently for events that are not in Σ. The parallel composition P Q represents the interleaving between the processes P and Q . In this case, both processes communicate any event freely. A process used later is RUN (s) = 2 e : s • e → RUN (s). The construction
2 e : s • e → P represents the indexed external choice of all events (e) that belong to the set s, which, after communicating e, behaves as P. Therefore RUN recursively offers all the events in s. Trace semantics is the simplest model for a CSP process, and is adopted in this approach to characterise a testing theory. The traces of a process P, represented by traces(P), correspond to the set of all possible sequences of visible events (traces) that P can produce. For instance, traces(S 6) is the infinite set that contains the traces ,
Automatic Test-Case Generation
81
y, y, c∗ and y, c ∗ y, where w ∗ means zero or more occurrences of w , and s1 s2 indicates the concatenation of sequences s1 and s2. It is possible to compare the trace semantics of two processes by a refinement verification. This can be automatically performed with FDR. A process Q trace-refines a process P, which we write as P τ Q , if, and only if, traces(P) ⊇ traces(Q ). For instance, the process S 9 refines S 0 (S 0 τ S 9) since the traces of S 9 are contained in those of S 0. However, the relation S 0 τ S 8 does hold, since y ∈ traces(S 8) but y ∈ traces(S 0). Test-generation algorithms. This approach introduces cspioco (CSP Input-Output Conformance), the implementation relation that is the basis for the test-case generation strategy. The test hypothesis for this relation assumes that the SUT behaves like some CSP process, say SUT . In this section, we consider SUT to be a CSP process representing the implementation, and S a specification, such that αSi ⊆ αSUTi and αSo ⊆ αSUTo . Informally, if SUT cspioco S , the set of output events of the implementation, after performing any trace s of the specification, is a subset of the outputs performed by S after s. Formally, SUT cspioco S ⇔ (∀ s : traces(S ) • out (SUT , s) ⊆ out (S , s)) where out (P, s) = initials(P/s) ∩ αPo , if s ∈ traces(P), and out (P, s) = ∅, otherwise. The following result establishes that cspioco can be alternatively characterised in terms of process refinement: the relation SUT cspioco S holds if, and only if, the following refinement holds. S τ (S RUN (αSUTo )) |[ αSUT ]| SUT The intuition for this refinement expression is as follows. If we consider an input event that occurs in SUT , but not in S , then on the right-hand side of the refinement, the parallel composition cannot progress through this event, so it is refused. Because refused events are ignored in the traces model, new SUT inputs are allowed by the above refinement. As a consequence, partial specifications are allowed. The objective of the interleaving with the process RUN (αSUTo ) is to avoid that the right-hand side process refuses output events that the implementation can perform but S cannot. Thus, RUN (αSUTo ) allows that such outputs be communicated to SUT . Finally, if SUT can perform such output events, then they appear in the traces of the right-hand side process, what falsifies the traces refinement. In summary, the expression on the right-hand side captures new inputs performed by SUT generating deadlock from the trace where the input has occurred, in such a way that any event that comes after is allowed. Furthermore, it keeps in the traces all the output events of SUT for the inputs from S , therefore allowing a comparison in the traces models. The rest of this section discusses how to obtain a set of test scenarios from the specification process, and how test cases can be mechanically generated via refinement checking using FDR, where test scenarios correspond to counterexamples of refinement verifications. The first step of the test-selection approach is to mark certain traces of the specification according to a test purpose, also specified as a CSP process. This can be directly achieved using parallel composition. Assuming that there is a test scenario that can be
82
P. Machado and A. Sampaio
selected by TP from S , the parallel composition of S with a test purpose TP (parallel S product), with synchronisation set αS , is PPTP = S |[ αS ]| TP. The process TP synchronises on all events offered by S until the test purpose matches a test scenario, when TP communicates a special event mark ∈ MARKS . At this point, the process TP deadlocks, and consequently PP deadlocks as well. This makes the parallel product to produce traces ts = t mark , such that t ∈ traces(S ), where ts is the test scenario to be selected. Because ts ∈ traces(S ), we have that S ). Thus, using FDR, the shortest counterexample for the traces(S ) traces(PPTP S refinement S τ PPTP , say ts1 , can be produced. If S does not contain scenarios specified by TP, no mark event is communicated, the parallel product does not deadS lock and the relation S τ PPTP is true. To obtain subsequent test scenarios, we use the function P that receives as input a sequence of events and generates a process whose maximum trace corresponds to the input sequence. For instance, P(a, b, c) yields the process a → b → c → Stop. The reason for using Stop, rather than Skip, is that Stop does not generate any visible event in the traces model, while Skip generates the event . The second counterexample is selected from S using the previous refinement, augmented by the process formed by the counterexample ts1 (that is, P(ts1 )) as an alternative to S on the left-hand side. The second test scenario can then be generated as the S . As traces(S 2 P(ts1 )) is counter-example to the refinement S 2 P(ts1 ) τ PPTP equivalent to traces(S ) ∪ ts1 , then ts1 cannot be a counterexample of the second iteration refinement. Thus, if the refinement does not hold again, then we can get a different trace ts2 as a counterexample. The iterations can be repeated until the desired test suite (regarding the test purpose) is obtained. For example, this limit can be based on a fixed number of tests or on some coverage criteria. If the refinement checking of some iteration holds, there are no more test scenarios to select. In general, the (n + 1)th test scenario can be generated as a counterexample of the following refinement. S S 2 P(ts1 ) 2 P(ts2 )... 2 P(tsn ) τ PPTP
In the context of our example, consider the following test purpose: TP = ANY ({y}, UNTIL({z }, ACCEPT (1))) 2 ANY ({z }, REFUSE (1)) 2 NOT (αSYSTEM , {y, z }, TP1) The process TP is defined in terms of some primitive processes (namely, ANY , NOT , ACCEPT , REFUSE and UNTIL) that facilitate the definition of test purposes [262]. Initially, TP1 offers three choices. Each one can be chosen depending on the events offered by the specification. If y is offered it is selected, and TP1 behaves as the process UNTIL({z }, ACCEPT (1)) that selects acceptance scenarios where eventually z is offered. If z is offered, but y has not yet been offered, then TP1 behaves as REFUSE (1) that selects refusal scenarios. Finally, if the offered events differ from y and z , TP1 restarts the selection and behaves as TP1.
Automatic Test-Case Generation
83
The parallel product between the specification SYSTEM and the test purpose TP1 SPEC SPEC is the process PPTP = SYSTEM |[ αSYSTEM ]| TP1. The set of traces of PPTP 1 1 is similar to the set of traces of SYSTEM , except for the traces that end with the mark events accept .1 and refuse.1. For instance, the trace a, y, b, z , accept .1 belongs to SPEC traces(PPTP 1 ) but not to traces(SYSTEM ). Thus, in the first iteration, the refinement verification yields ts1 = b, z , refuse.1 as the first counterexample. Using ts1 , we have the second selection refinement iteration that does not hold and yields ts2 = a, y, b, z , accept .1 as the next counterexample. There are infinite test scenarios that can be obtained from S using TP1, since the specification is recursive and has infinite acceptance and refusal traces. Besides the uniformity of expressing the entire theory in terms of a process algebra, there are other potential benefits of this approach. – As opposed to LTSs, CSP models that represent the specification can naturally evolve in an abstract formalism. This is particularly relevant in a context where complex models are built from simpler ones that model isolated (or individual) features. – Implementation conformance to a specification and test-case generation can be directly expressed and mechanised in terms of process refinement. – Test purposes are specified in terms of CSP processes that allow them to be composed and easily combined and extended. – The generated test cases (also represented as CSP processes) can be combined to form more coarse-grain test cases, or split to give rise to finer-grain test cases. Some previous approaches have addressed test generation in the context of CSP [274,294,73]. All these approaches focus on the formalisation of conformance relations. The approach summarised here goes beyond that, addressing guided test selection and generation using refinement checking, the implementation of a tool that supports the approach (the ATG tool), and practical use of the approach through cooperation with Motorola Inc. A related research topic is to explore more elaborate CSP semantic models, such as failures-divergences. Such models enable the definition of more robust implementation relations that, in addition to traces, consider nondeterminism and quiescence (for example, deadlock or livelock) as observations of testing. Another research direction is to explore test-generation approaches that capture component interaction. This might potentially emphasise the nature of an approach entirely based on a process algebra, which offers a rich repertoire of operators to express several patterns of concurrency and communication, contrasting with strategies based on more operational models (like LTSs or finite state machines) that do not record the application architecture.
4 Test-Case Selection Strategies In this section, we present test-case selection strategies that can be applied to guide an automatic test-case generation process. Even though test-case generation is always based on a coverage criteria, mostly structural criteria, the possible number of test cases to be selected is usually huge. Also, as mentioned before, not all test cases are relevant
84
P. Machado and A. Sampaio
and also test-case suites automatically generated tend to include redundant test cases. The focus is on test-purpose selection, random selection and similarity-based selection. 4.1 Test Selection Based on Test Purpose Test purposes describe behaviors that we wish to observe by testing a SUT. This is a widely used strategy, particularly evolved with the TGV tool [180] and its test-case synthesis algorithms as mentioned in Section 3.2. Theoretical foundations of test purposes are presented in [332]. These are related to the formal testing framework proposed in [320]. In this section, we give a quick overview of fundamental concepts on test purposes and properties of test cases generated from them and also present a tool that supports test-purpose definition and application. Formal test purposes [332]. Testing for conformance and testing from test purposes, also called property-oriented testing, have different goals [225]. The former aims to accept or reject a given implementation. On the other hand, the latter aims to observe a desired behavior that is not necessarily directly related to a required behavior or correctness. If a behavior is observed, then confidence on correctness may increase. Otherwise, no definite conclusion can be reached. This strategy is usually implemented to reduce the scope of the model from which test cases are generated. Test purposes are related to implementations that are able to exhibit them by a well chosen set of experiments. This is defined by the relation exhibits ⊆ IMPS × TOBS , where IMPS is the universe of implementations, and TOBS is the universe of test purposes. To reason about exhibition, we also need to consider a test hypothesis by defining the reveal relation rev ⊆ MODS × TOBS , where MODS is the universe of models, so that: ∀ e ∈ TOBS · SUT exhibits e ⇔ iSUT rev e with iSUT ∈ MODS of SUT. A verdict function He decides whether a test purpose is exhibited by an implementation: He : P(OBS ) → {hit, miss}. Then, SUT hits e by te =def He (exec(te , SUT)) = hit This is extended to a test suite Te as SUT hits e by Te =def He ( {exec(t , SUT) | t ∈ Te }) = hit where exec is a procedure that represents test-cases execution. A test suite Te that is e-complete can distinguish among all exhibiting and nonexhibiting implementations, so that, SUT exhibits e if, and only if, SUT hits e by Te . A test suite is e-exhaustive when it can detect only non-exhibiting implementations (that is, SUT exhibits e implies SUT hits e by Te ), whereas a test suite is e-sound when it can detect only exhibiting implementations (SUT exhibits e if SUT hits e by Te ). We note that there is a similarity in purpose between sound test suites and e-sound test suites, even though the implications are relatively inverted. The former can reveal the presence of faults, whereas the latter can reveal intended behavior.
Automatic Test-Case Generation
85
Conformance and exhibition can be related. The goal is to consider test purposes in test selection to obtain test suites that are sound and e-complete. On one hand, esoundness guarantees that a hit result always implies exhibition. On the other hand, eexhaustiveness guarantees that implementations that exhibit are not rejected. Soundness provides us with the ability to detect non-conforming implementations. Contrary to complete test suites, e-complete test suites are more feasible. For instance, an algorithm is present in [332] for LTSs. Finally, we observe that a test purpose may be revealed by both conforming and non-conforming implementations. An ideal situation, though not practical, would be to consider a test purpose e only when i rev e ⊇ i passes T , where T is a test suite, and passes relates implementations to sets of test cases in which they pass. However, test purposes are chosen so that: {i | i rev e}∩{i | i imp s} = ∅. In this case, a test execution with test case Ts,e that is both sound and e-complete and that results in fail means non-conformity, since sound test cases do not reject conforming implementations and e-complete test cases distinguish between all exhibiting and non-exhibiting implementations. Also, if the result is {pass, hit}, confidence on correctness is increased, as the hit provides possible evidence of conformance. Test-purpose selection. The LTS-BT tool [70] provides a graphical interface from which test purposes can be selected to guide the generation of e-complete test suites. Test purposes are extracted from the ALTS model under consideration. The tester does not need to write the test purposes by hand or deal with these models. For each ALTS model, the LTS-BT tool loads the set of possible labelled transitions, that is, conditions, steps and expected results (see Section 3.3). The user chooses the test purpose in order to limit the ALTS model, having a filtered ALTS model as the result. The test-purpose notation is a sequence of transitions. In this sequence, the “*" (asterisk) can appear at the beginning or between transitions. The sequence must finish with either an Accept (meaning that the user wants all test cases that comply with the test purpose) or a Reject node. The asterisk means any transition. Some examples of test purposes for the ALTS test model showed in Figure 11 are shown below. 1. *, Go to inbox, *, Accept. This means all test cases that contain “Go to inbox". In this case, the model is not filtered. 2. *, Message Storage is full,*, Accept. This means all test cases that finish with “Message Storage is full". 3. *, Message Storage is full,*, Reject. This means all test cases that do not contain “Message Storage is full". The use of test purposes is particularly interesting for interruption testing, since the possible number of combinations of interruptions can be huge and make full coverage infeasible. In practice, this kind of testing is only considered for particular interruptions at specific "critical" points of a feature. Therefore, test purpose constitutes a valuable tool for model-based interruption test designers. For interruption test cases, the following test purposes can be defined for the model in Figure 12.
86
P. Machado and A. Sampaio
1. *, "Hot Message" folder is displayed, Send a message from phone 2 to phone under test, *, Accept This test purpose considers all possible behaviors of the interruption after the “"Hot Message" folder is displayed” result. In this case, the model is not filtered. 2. *, "Hot Message" folder is displayed, Send a message from phone 2 to phone under test, *, Reject This test purpose does not consider test cases with the interruption after the “"Hot Message" folder is displayed” result. As a result, we obtain the model in Figure 11. 3. *, "Hot Message" folder is displayed, Send a message from phone 2 to phone under test, *, Message is Displayed, *, Accept This test purpose considers all possible behaviors of the interruption after the “"Hot Message" folder is displayed”. In other words, this test purpose focus on the interruption at a specific point and also in the particular behavior of the interruption in which the message is displayed. 4.2 Random versus Deterministic Test Selection From usage models such as the one presented in Section 3.1, specific test cases can be selected among the possible ones that can be generated. This can be done either based on random walks on the model or by deterministic choice. As mentioned before, usage models can give rise to a huge number of test cases if only structural coverage criteria is considered. However, use profiles can be incorporated into these models by probabilities that are associated with transitions, characterising the most and less frequently traversed paths. Random selection based on this kind of probability distribution can produce test cases that have high probability of causing failures, contributing to operational reliability. On the other hand, crafted, non-random deterministic test cases are also of importance. They are usually scenarios of execution that are of interest to stakeholders whose value may not be captured by the use profile defined. They can often be used to validate the use model and the use profile (probabilities distribution) being considered. However, it is important to remark that deterministic selection alone is not effective to assure coverage: test cases are usually prone to a particular, usually partial, view of the application. In fact, a combination of both random and deterministic strategies is recommended in practice. In the next subsection, an alternative strategy to random selection is presented. It addresses the problem of guaranteeing the best coverage of a model by selecting the less similar test cases, with random choice applied to solve the problem of discarding similar (according to a certain degree) test cases. 4.3 Test Selection Based on Similarity Function The main goal of this selection strategy is to reduce the number of test cases and minimise redundancy [72]. This is inspired by the work of Jin-Cherng Lin and his colleagues [213].
Automatic Test-Case Generation
87
The strategy is based on a target percentage of the total of test cases that can be generated. A similarity criterion is used to achieve this percentage by eliminating similar test cases. Based on the similarity degree between each pair of test cases, the test cases candidate to be eliminated are those that have the biggest similarity degree: one of them is eliminated. The choice is made by observing the test case that has the smallest path, that is, the smallest number of functionalities (to maximise the chances of keeping a good coverage of the model). From an ALTS model, we mount a similarity paths matrix. This matrix is as follows. – n × n, where n is the number of paths and each n represents one path. – The elements are defined as: aij = SimilarityFunction(i, j ). This function, as presented below, is calculated by observing the number of identical transitions nit , that is, the ones where states from and to and also transition labels are the same. The function also considers the average of paths length avg(| i |, | j |) so that the test-cases lengths are balanced with respect to similarity, that is, two small test cases are not erroneously considered to be less similar than two large test cases that have more transitions in common but are not so similar. SimilarityFunction(i, j ) = nit /(avg(| i |, | j |)) Figure 16 presents an example of this selection strategy applied to an LTS model. In the example, there are four paths. Figure 17 shows the similarity matrix and the length of the paths. Considering that the percentage-of-coverage criterion is 50%, then two test cases must be eliminated. Observing the matrix in Figure 17(x), we note that (b) and (c) have
Fig. 16. Possible paths from an LTS model
88
P. Machado and A. Sampaio
Fig. 17. (x) Similarity Matrix, (y) Paths length, (z) Similarity Matrix after elimination
the biggest similarity. As they have the same length, (the paths length can be seen in Figure 17(y)), then the one to be eliminated is chosen randomly, for example (b). Figure 17(z), present the matrix after elimination of (b). From Figure 17(z), we have that (c) and (d) have now the biggest similarity. As | (c) |>| (d ) |, then (d) is eliminated. We note that the eliminated test cases are the most similar: (b) is very similar to (c) and (c) is very similar to (d). Of course, for this simple example, it is clearly possible to execute all test cases, but in an actual setting (with costs and time restrictions), 100% coverage may be infeasible. In order to assess this strategy, experiments have been conducted. The results presented in the sequel are from an experiment presented in [72] with three different applications. For this experiment, similarity was compared with random choice in other to assess the percentage of transition coverage in the two strategies. A path coverage of 50% was fixed and both strategies were applied 100 times (due to the random choices). The applications chosen are all reactive systems with the following profile. 1. A cohesive feature for adding contacts in a mobile phone’s contact list. 2. A message application that deals with embedded items. An embedded item can be a URL, a phone number, or an e-mail address. For each embedded item, it is possible to execute some tasks. 3. The TARGET tool, an application that generates test cases automatically from usecase scenarios. The results obtained for Application 1 are presented in Figure 18. We note that the "y-axis" represents the number of transitions removed, whereas the "x-axis" represents each of the 100 times both algorithms were executed on the test suite generated by considering the 50% path-coverage goal. For this application, similar test cases of similar length have been found. In this case, the similarity technique also involved random choices. Generally, for Application 1, similarity performed better with only a few experiments where random choice was equal or more effective. The results obtained for Application 2 are presented in Figure 19. This application has "disjoint" groups of similar test cases since it is composed of different features. Therefore, the performance of the similarity technique was better than that of random choice: the less similar test cases have been kept for each group of similar test cases.
Automatic Test-Case Generation
89
Fig. 18. Application 1: Similarity versus Random choice with 50% path coverage
Fig. 19. Application 2: Similarity versus Random choice with 50% path coverage
Finally, for Application 3 (see Figure 20), the performance of the similarity technique was much better. In this case, the application is composed of different features with test cases of different lengths: similarity was purely applied without the need for random choice of test cases of the same size. Further experiments have been conducted to assess more precisely similarity-based selection and also to find out its most indicated contexts of application. For instance, Cartaxo el al [71] presents a more elaborate experimentation that considers both transition and fault-based coverage criteria and path coverage varies from 5% to 95%. The results shows more clearly the benefits of the similarity strategy over random selection with respect to the ability of the selected test suite to detect more faults and transitions.
90
P. Machado and A. Sampaio
Fig. 20. Application 3: Similarity versus Random choice with 50% path coverage
5 Test-Model Generation from Use-Case Specifications Business analysts tend to adopt informal representations, based on natural-language descriptions, to record the result of requirements elicitation. While natural languages seem an appropriate notation to describe documents at the requirements level, they are also the potential cause of ambiguities and inconsistencies, as is well-known from software engineering practice. Before progressing towards analysis and design activities, more precise models of the application need to be developed. Ideally, a formal model of the application should be developed to serve as a consistent reference for requirements validation and verification of the correctness of candidate implementations, or even to generate correct implementations in a constructive way. Complementarily, formal models can be useful for the automatic generation of test cases, as discussed in the previous sections. In order to benefit from both a natural language description (to capture requirements) and a formal model (to serve as basis for the development process, and particularly for test-case generation), templates are proposed in [65] to write use cases. These templates can capture functionality at several abstraction levels, ranging from the interaction between the user and the application (in a black-box style) to the detailed interaction and synchronisation of system components. The text used to fill the templates obeys a fixed grammar; we call it a Controlled Natural Language (CNL). The CNL can be considered a domain specific language (for mobile applications), which allows to fix some relevant verbs and complements; this is the key to allow automatic processing of the templates and of the text inside the templates. Besides verifying whether the text is according to the defined CNL grammar, this also permits a formal-model generation from use-case templates. In the following sections we introduce the use case templates in some detail, and present a strategy to generate ALTS test models from use cases; finally, we consider some related work.
Automatic Test-Case Generation
91
5.1 Use-Case Templates As we are concerned with the generation of functional test cases, here we concentrate on user-view templates, which capture user interaction with the system through sentences used to describe user actions, system states and system responses. An example is presented in Figure 21, which describes a use case of a simple book application. In the domain of mobile applications, use cases are usually grouped to form a feature. Each feature contains an identification number. For instance, the use case in Figure 21 can be considered as part of Feature 11111 - My Phonebook. Although this grouping is convenient, it is not imposed by the strategy. As we can observe in Figure 21, each use case has a number, a name and a brief description. A use case specifies different scenarios, depending on user inputs and actions. Hence, each execution flow represents a possible path that the user can perform. Execution flows are categorised as main, alternative or exception flows. The main execution flow represents the use case happy path, which is a sequence of steps where everything works as expected. In our example, the main flow captures the successful insertion of a new contact in the phone book. An alternative execution flow represents a choice situation; during the execution of a flow (typically the main flow) it may be possible to execute other actions. If an action from an alternative flow is executed, the system continues its execution behaving according to the new path specification. Alternative flows can also begin from a step of another alternative flow; this enables reuse of specification. In our example, the alternative flow allows the insertion of more detailed information related to the contact, in addition to a name and a phone number. The exception flows specify error scenarios caused by invalid input data or unexpected system states. Alternative and exception flows are strictly related to the user choices and to the system state conditions. The latter may cause the system to respond differently given the same user action. In our example, the exception flow describes the failure to include a contact due to memory overflow. Each flow is described in terms of steps. The tuple (user action, system state, system response) is called a step. Every step is identified through an identifier, an Id. The user action describes an operation accomplished by the user; depending on the system nature it may be as simple as pressing some button or a more complex operation, such as printing a report. The system state is a condition on the actual system configuration just before the user action is executed. Therefore, it can be a condition on the current application configuration (setup) or memory status. The system response is a description of the operation result after the user action occurs based on the current system state. As an example, we consider the step identified as 4M in the main flow in Figure 21. The user action concerns the confirmation of the contact creation. However, this depends on the availability of memory. If the condition holds, then the expected result is that the contact is effectively inserted in the phone book. A user action, a system state or a system response may be related to named system requirements. This is useful for traceability purposes, both relating use cases with requirements and the generated test cases with requirements. When requirements change, it is possible to know which use cases might be impacted and, if it is the case, update
92
P. Machado and A. Sampaio
UC 01 - Creating a New Contact Description: This use case describes the creation of a new contact in the contact list. Main Flow Description: Create a new contact From Step: START To Step: END Step Id User Action System State System Response 1M Start My Phonebook applicaMy Phonebook application tion. menu is displayed. 2M Select the New Contact opThe New Contact form is tion. displayed. 3M Type the contact name and The new contact form is the phone number. filled. 4M Confirm the contact creation. There is enough phone The next message is high[TRS_11111_101] memory to insert a new lighted. contact. Alternative Flows Description: extended information to the contact From Step: 3M To Step: 4M Step Id User Action System State 1A Go to context menu and select Extended Information. 2A 3A
Fill some of the extended information fields. Press the OK softkey.
Exception Flows Description: There is not enough memory From Step: 3M, 3A To Step: END Step Id User Action System State 1E Confirm the contact creation. There is not phone memory.
2E
Select the OK softkey.
System Response The extended information form is displayed. [TRS_111166_102] Some of the extended information form is filled. The phone goes back to New Contact form. It is filled with the extended information.
System Response enough A dialog is displayed informing that there is not enough memory. [TRS_111166_103] The phone goes back to My Phonebook application menu.
Fig. 21. Example of a user view use case
Automatic Test-Case Generation
93
them. Test cases related to these use cases can also be updated or regenerated (assuming an automatic approach). Furthermore, requirement information can be used to filter the test-case generation. There are situations when a user can choose between different paths. When this happens it is necessary to define one flow for each path. Every execution flow has one or more starting points, or initial states, and one final state. The starting point is represented by the From step field and the final state by the To step field. The From step field can assume more than one value, meaning that the flow is triggered from different source steps. When one of the From step items is executed, the first step of the specified execution flow is executed. As an example, we consider the exception flow in Figure 21, which may start after the step 3M of the main flow or after the step 3A of the alternative flow. The To step field references only one step; after the last step of an execution flow is performed the control passes to the step defined in the To step field. In the main flow, whenever the From step field is defined as START it means that this use case does not depend on any other, so it can be the starting point of the system usage, as illustrated in our example. Alternatively, the main flow From step field may refer to other use case steps, meaning that it can be executed after a sequence of events has occurred in the corresponding use case. When the To step field of any execution flow is set to END, this flow terminates successfully after its last step is executed, as illustrated by the main flow of our example. Subsequently, the user can execute another use case that has the From step field set to START. The From step and the To step fields are essential to define the application navigation, which allows the generation of a formal model like an LTS [69], as discussed in previous sections. These two fields also enable the reuse of existing flows when new use cases are defined; a new scenario may start from a preexistent step from some flow. Finally, loops can appear in the specification if direct or indirect circular references between flows is defined; this scenario can result in a livelock situation in the case of infinite loops. The user-view use case in Figure 21 is an example of a mobile-phone functionality. Nevertheless, this template is generic enough to permit the specification of any application, not only mobile-phone ones. The user-view use case has the main characteristics of other use-case definitions, such as UML use cases [290]. However, our template seems to offer more flexibility. The existence of execution flows starting and ending according to other execution flows makes it possible to associate use cases in a more general way than through regular UML associations such as extend, generalisation, and include. References enable the reuse of parts of other use-cases execution flows and the possibility of defining loops, so use cases can collaborate to express more complex functionalities. 5.2 Generating ALTS Models In this section, we present a strategy for translating use-case templates into an ALTS from which test cases can be generated. The general translation procedure is shown in Figure 22, and is explained below.
94
P. Machado and A. Sampaio
Fig. 22. Procedure that translates use case templates to an ALTS
– Each template of the use case, starting from the main flow one, is processed sequentially and, from each step, states and transitions are created in the target ALTS according to the order of steps defined in the template. This is controlled by the two for loops. – currentState represents the state from which transitions are to be created for the current step. This is either: i) the last state created in case the From Step field is defined as START or this is the first state; or ii) the last state of a given step (defined in the From Step field) of another template. – From step and To Step guide the connection of each trace created by each of the templates. For this, the Step Id label is associated with the states that are created from it (see State constructor). – User Action, System State and System Response become transitions that are preceded by steps, conditions and expectedresults annotations respectively. – States are created as new transitions need to be added. These are incrementally numbered from 0. States and transitions are created by the add operation, but states already created can be reused when connecting the traces of new templates. When possible, addToStep (To Step is different from END), addFromStep (From Step is different from START) and addNewCondition (a new condition is considered based on a user action already added) are used instead. – Duplicated annotated transitions from the same state are also avoided. This can happen when two or more steps are possible from the same state, or two or more conditions define expected results from a single user action.
Automatic Test-Case Generation
95
The ALTS generated from the templates presented in Figure 21 is shown in Figure 23. We observe that, due to the lack of space, some transition labels are not completely presented. The templates are connected at state 13, which corresponds to the next step after the 3M step in the main flow, and at state 15, which corresponds to the state leading from the conditions annotation that combines the main flow and the exception flow. For the sake of simplicity, the procedure presented in Figure 22 only considers the existence of one From Step and one To Step. Therefore, From Step 3A in the exception flow is not directly handled. However, this behavior happens to be included in the ALTS in Figure 23.
Fig. 23. ATLS model generated from the Creating a New Contact Use Case
5.3 Final Considerations Rather than building specifications in an ad hoc way, some approaches in the literature also have explored the derivation of formal specifications from requirements. ECOLE [295] is a look-ahead editor for a controlled language called PENG (Processable English), which defines a mapping between English and First-Order Logic in order to verify requirements consistency. A similar initiative is the ACE (Attempto Controlled English) project [128] also involved with natural language processing for specification validation through logic analysis. The work reported in [174] establishes a mapping between English specifications and finite state machine models. In industry, companies,
96
P. Machado and A. Sampaio
such as Boeing [350], use a controlled natural language to write manuals and system specifications, improving document quality. There are also approaches that use natural language to specify system requirements and automatically generate formal specifications in an object-oriented notation [203]. Concerning the format to write requirements, use cases describe how entities (actors) cooperate by performing actions in order to achieve a particular goal. Some consensus is admitted regarding the use-case structure and writing method [84]; a use case is specified as a sequence of steps forming system-usage scenarios, and natural language is used to describe the actions taken in a step. This format makes use cases suitable to a wide audience. This section has proposed a strategy that automatically translates use cases, written in a Controlled Natural Language, into specifications in the ALTS. For obvious reasons, it is not possible to allow a full natural language as a source. Rather, a subset of English with a fixed grammar (CNL) has been adopted. The context of this work is a cooperation between CIn-UFPE and Motorola, known as Brazil Test Center, whose aim is to develop strategies to support the testing of mobile device applications. Therefore, the proposed CNL reflects this application domain. Unlike the cited approaches, which focus on translation at a single level, the work reported in [65] proposes use-case views possibly reflecting different levels of abstraction of the application specification; however, rather than adopting ALTS as the target test model, that work generates CSP test models. This is illustrated in [65] through a user and a component view. A refinement relation between these views is also explored; the use of CSP is particularly relevant in this context: its semantic models and refinement notions allow precisely capturing formal relations between user and component views. The approach is entirely supported by tools. A plug-in to Microsoft Word 2003 [307] has been implemented to allow checking adherence of the use-case specifications to the CNL grammar. Another tool has been developed to automate the translation of use cases written in CNL into CSP; FDR [285], a CSP refinement checker, is used to check refinement between user and component views.
6 The TARGET Tool The aim of this section is to present the TARGET tool, whose purpose is to mechanise a test-case generation strategy that supports the steps presented in the previous sections. Particularly, TARGET accepts as input use-case scenarios written in CNL (as addressed in Section 5) and generates test cases, also written in CNL, which includes the test procedure, a description and related requirements. Moreover, the tool can generate a traceability matrix relating test cases, use cases and requirements. The purpose of TARGET is to aid test engineers in the creation of test suites. Three major aspects distinguishes TARGET from other model-based testing tools: the use of test purposes, provided by the test engineer, to restrict the number of generated test cases as needed, focusing on test cases that are more critical or relevant to a given task; algorithms for elimination of similar test cases, reducing the test suite size without significant impact on effectiveness; the use of natural-language use cases as system input, which is more natural for engineers when compared to formal specification languages.
Automatic Test-Case Generation
97
Internally, TARGET translates CNL use-case scenarios into LTSs, in order to generate test cases, possibly guided by test purposes that are also translated to LTSs. Nevertheless, the use of LTS is completely hidden from test engineers. Some of the facilities provided by TARGET in order to support test case generation from use case scenarios are as follows. – processing of use-case templates and automatic generation of LTS test models; – automatic purpose-based generation of test suites, with adequate coverage criteria, from LTS models; – reduction of test suits by eliminating test cases according to a notion of similarity and user interaction; – automatic generation of traceability matrices: requirements versus test cases, requirements versus use cases, and test cases versus use cases; – friendly user-interface for editing test purposes and for guiding the user to generate test cases and reduce the size of test suits. Although the tool has been originally motivated by mobile-phone applications, it actually does not depend on any platform or domain. TARGET has already been used in practice for generating test cases for mobile phone as well as desktop applications. For instance, it has been used to generate test cases to test the implementation of the tool itself. 6.1 Using the Tool TARGET allows the user to create, open, close and refresh projects; open, rename and delete artifacts (use-case documents and test suites); import use-case documents; and, generate test suites, through the menu options presented in Figure 24, which shows a project with the phonebook feature and two use cases. The left panel contains the use-cases view in which the use cases are outlined. It is shown in a tree structure, grouping the use cases according to their feature. The bottom panel includes three views: – The artifacts view outlines the project artifacts. It groups, in two different folders, the use-case documents and the test-suite documents. These artifacts can be edited, removed or renamed. – The errors view lists all errors in the use-case documents. Each error is composed by the description, the resource where the error has occurred, the path of the resource, and the error location inside the resource. – The search view displays the results of a search operation. It lists all use cases that are found according to the given query. Each result is composed by the use-case name, the use-case identification, the identification of the feature that contains the use case (feature ID) and the name of the document that contains the use case (document name). The main panel is used to display the artifact contents. It shows use-case documents in HTML format, as the use-case scenario presented in the main panel in Figure 24.
98
P. Machado and A. Sampaio
Fig. 24. The main screen of TARGET
The major functionality of TARGET is the automatic generation of test cases. This facility provides the user with two generation features. Test suits can be generated for all possible scenarios or for some specific requirements. In any case, TARGET offers two features that allow the user to select test cases in the output test suite: defining test purposes or choosing the test-cases path coverage based on similarity, or both. As explored in the previous section, a test purpose is an abstract description of a subset of the specification, allowing the user to choose behaviors to test, and consequently reducing the specification exploration. In TARGET, test purposes are incomplete sequences of steps. Figure 25 illustrates the definition of test purposes. The notation is similar to a regular expression language. The test purpose in the figure selects all the test cases of the use case UC_02 (Searching a Contact) of the feature 11111 (My Phonebook) that include steps 1M and 2B. As another example of test purpose, we can use the sequence 11111#UC_01#1M;*;11111#UC_01#4M to select only test cases that begin with step 1M and end with step 4M of the use case Create a New Contact. As we can observe in Figure 25, a test purpose is composed of a sequence of commaseparated steps. Each step can be chosen from a list of use-case steps, in the Steps field. When a step is chosen, TARGET automatically appends it to the Current Test Purpose field. The user can define one or more test purposes through the window presented in Figure 25. If more than one test purpose is defined, TARGET combines the set of test cases generated for each purpose. As test purposes are defined, they are inserted in the Created Test Purposes field.
Automatic Test-Case Generation
99
Fig. 25. Test purpose and path coverage selection criteria
Another selection criteria is based on similarity. The similarity is set by the horizontal scale in Figure 25. Setting the scale to 100% means that all test cases will be included in the output test suite. Therefore, this scale preserves all the test cases selected by the previous selection functionalities (selection by requirements and test purposes). On the other hand, for instance, to eliminate the test suite by removing the 30% most similar test cases, the user must set the scale to 70%, indicating the intention to keep the 70%most distinct test cases. Details about the similarity algorithm have been presented in the previous section. The generated test suite files are created with a default name in the test suites folder. Each test case is presented in a template whose fields also obey the CNL rules. An example of a test case in the suite generated for the test purpose of Figure 25 is presented in Figure 26.
100
P. Machado and A. Sampaio
Fig. 26. Test case generated by TARGET
The generated test-case format has the following structure: – a Use Cases field, which lists all use cases related to the test case; – a Requirements field, which is filled with all requirements covered by the test case; – a Initial Conditions field, which is filled with all system states covered by the test case; – a Procedure field, which is a sequence of user actions from the use cases (see the use case structure); – a Expected Results field, which is the sequence of system responses for the corresponding actions (as in the use case templates). There are additional fields (Setup, Final Conditions and Cleanup) whose contents are not currently generated by TARGET. The first one has to do with the mobile-phone configuration for the execution of the test case, the second one defines a postcondition in the system state established by the test case, and the third one resets the configuration setup for the test-case execution. The test-suite generation also includes the automatic generation of the traceability matrices, which relates requirements, use cases and test cases. The traceability matrices are included in the output Microsoft Excel document as spreadsheets. There is one spreadsheet for each traceability matrix. TARGET provides the following traceability matrices:
Automatic Test-Case Generation
101
– the Requirements x Use Cases matrix links each requirement to its related use cases. – the Requirements x Test Cases matrix links each requirement to the test cases that cover it. – the Use Case x Test Cases matrix links each use case to the related test cases. By related test cases, we mean that at least one step from the use case is contained in the test cases. These matrices are illustrated by the Figures 27, 28 and 29.
Fig. 27. Requirement x Use Case matrix
Fig. 28. Requirement x Test Case matrix
6.2 Final Considerations TARGET has been developed in Java, based on the Eclipse Rich Client Platform. It depends on some open source software (Apache Lucene, Jakarta POI and AspectJ), and on the .NET Framework. The input use cases must be written in Microsoft Word 2003. A stable version of the tool has been developed. Four case studies have already been run with Motorola Brazil Test Center teams. These case studies are related to mobilephone features. As already mentioned, the tool has been used to generate test cases to
102
P. Machado and A. Sampaio
Fig. 29. Use Case x Test Case matrix
test the implementation of TARGET itself. All case studies indicated a reduction on testdevelopment effort. A detailed analysis of two of the case studies is presented in [261]. Interruption testing is going to be addressed in the next versions. Using TARGET, shorter test cycles are expected due to the automation of test design. However, this requires more detailed use cases, which takes longer to specify. Still, overall, experience has shown that there is a reduction of the test cycle time of around 50%. A more significant decrease on cycle time is expected for maintenance activities; maintaining the use cases and then generating updated test cases is potentially much more productive than maintaining test cases directly. Instead of having people to maintain and create a large number of test cases, the idea is to concentrate efforts on properly describing a much smaller number of use cases and then have the tool generate several test cases from each use case. The use of TARGET, like of other test-case generation tools, potentially reduces redundancy and ambiguity, both in use cases and generated test cases. The tool also helps to achieve better coverage since more tests can be generated in less time and with less effort, and test selection can be achieved using test purposes. The automatic generation of test cases avoids errors in test cases, and allows test-case document inspection focus more on more significant semantic issues. TARGET is restricted for use inside Motorola and its partners.
7 Concluding Remarks This chapter discusses automatic test-case generation in MBT processes. Test models and test-generation strategies, mostly devoted to reactive systems, are presented, including test models like Markov chains, LTSs and process algebra. Test-model generation is also considered by introducing a strategy for generating LTSs from use-case specifications. However, in this chapter, there is a special focus on MBT in the domain of mobilephone applications. Even though the use of TARGET is not limited to this domain, it has been primarily developed for supporting work in this context, as a result of a research cooperation of our team and Motorola Inc. The tool has been inspired by the challenges of a real testing setting, where MBT is to be integrated with other testing processes for increasing productivity and effectiveness on testing.
Automatic Test-Case Generation
103
The construction of TARGET along with experience on case studies in which the tool has been applied have confirmed and also uncovered many issues concerning advantages and challenges yet to be faced by MBT for widespread use. As main advantages, MBT indeed reduces the overall effort on test-case construction and provides an adequate coverage of requirements. However, it is important to remark that test-model generation rather than construction from scratch has been a differential: test designers can use a controlled natural language to document requirements and do not need to be concerned about writing models. Another advantage is that test suites can be regenerated by modifying the requirements documentation and by applying test-selection strategies that better suit a particular testing goal. As a main drawback, MBT may generate more test cases than what is actually needed. For instance, test cases may be redundant. This can be handled by automatic test-selection strategies such as the ones presented in this chapter that are implemented in TARGET. Also, MBT may not generate important test cases that would have been written by an experienced test designer in the domain. This often happens due to the so called “implicit” domain requirements that are not usually explicitly modelled as requirements for an application. Both problems have been addressed by our research team so that effective solutions can be incorporated in next versions of the TARGET tool and its processes. MBT is currently addressed by a number of research groups worldwide. This is a promising strategy since it can formalise the notion of abstraction that is so important in testing along with automation and reuse. Also, MBT can make an engineering approach to functional testing possible where systematic procedures can be undertaken for more reliability, effectiveness and productivity in testing. However, this is not a universal recipe. Finding out the right context of application and levels of testing as well as addressing the obstacles such as the ones mentioned above is determinant for success. Furthermore, even though a number of tools and theories have already been developed, the practice of MBT is still recent and immature. Lessons are still to be learned as well as theoretical problems still need to investigated. From satisfactory progress on them, effective practices will emerge. In this chapter, we have concentrated on test-case generation for operational models, particularly, from LTS. We have also been investigating test-case generation from the CSP process algebra, as briefly addressed in Section 3.4. While for isolated features the results seem very similar, for applications that involve feature interaction, a process algebraic test model might be a promising alternative to explore, since their rich repertoire of operators allow capturing elaborate patterns of component architectures, and selecting test cases confined to the interaction of a subset of the components.
Testing a Software Product Line John D. McGregor Clemson University, USA
The software product line approach to the development of software intensive systems has been used by organizations to improve quality, increase productivity and reduce cycle time. These gains require different approaches to a number of the practices in the development organization including testing. The planned variability that facilitates some of the benefits of the product line approach poses a challenge for test-related activities. This chapter provides a comprehensive view of testing at various points in the software development process and describes specific techniques for carrying out the various test-related tasks. These techniques are illustrated using a pedagogical product line developed by the Software Engineering Institute (SEI). We first introduce the main challenges approached by this chapter, then Section 2 overviews basic software product lines concepts. Section 3 explains, in a product line context, several testing concepts introduced in Chapter 1. Section 4 complements this by introducing Guided Inspections, a technique that applies the discipline of testing to the review of non-software assets typically found in software product lines. The core of the chapter is Section 5, which describes techniques that can be used to test product lines. Section 6 discusses how we can evaluate the benefits of a product line approach for testing, and Section 7 illustrates some of the presented techniques. Finally, we end with a discussion of related issues and research questions (see Section 8), and conclusions (see Section 9) about product line testing and its relation to development.
1 Introduction Organizations are making the strategic decision to adopt a product line approach to the production of software-intensive systems. This decision is often in response to initiatives within the organization to achieve competitive advantage within their markets. The product line strategy has proven successful at helping organizations achieve aggressive goals for increasing quality and productivity and reducing cycle time. The strategy is successful, at least in part, due to its comprehensive framework that touches all aspects of product development. Testing plays an important role in this strategic effort. In order to achieve overall goals of increased productivity and reduced cycle time, there need to be improvements to traditional testing activities. These improvements include the following: – – – –
Closer cooperation between development and test personnel Increased throughput in the testing process Reduced resource consumption Additional types of testing that address product line specific faults.
P. Borba et al. (Eds.): PSSE 2007, LNCS 6153, pp. 104–140, 2010. c Springer-Verlag Berlin Heidelberg 2010
Testing a Software Product Line
105
There are several challenges for testing in an organization that realizes seventy-five to ninety-five percent of each product from reusable assets. Among these challenges, we have the following: – variability: The breadth of the variability that must be accommodated in the assets, including test assets, directly impacts the resources needed for adequate testing. – emergent behavior: As assets are selected from inventory and combined in ways not anticipated, the result can be an interaction that is a behavior not present in any one of the assets being combined. This makes it difficult to have a reusable test case that covers the interaction. – creation of reusable assets: Test cases and test data are obvious candidates to be reused and when used as is they are easy to manage. The amount of reuse achieved in a project can be greatly increased by decomposing the test assets into finer grained pieces that are combined in a variety of ways to produce many different assets. The price of this increased reuse is increased effort for planning the creation of the assets and management of the increased number of artifacts. – management of reusable assets: Reuse requires traceability among all of the pieces related to an asset to understand what an asset is, where it is stored and when it is appropriate for use. A configuration management system provides the traceability by explicit artifacts. We discuss several activities that contribute to the quality of the software products that comprise the product line, thinking of them as forming a chain of quality in which quality assuring activities are applied in concert with each production step in the software development process. In addition to discussing the changes in traditional testing processes needed to accommodate the product line approach, we present a modified inspection process that greatly increases the defect finding power of traditional inspection processes. This approach to inspection applies testing techniques to the conduct of an inspection. We use a continuing example throughout the chapter to illustrate the topics and then we summarize the example near the end of the chapter. The Arcade Game Maker product line is an example product line developed for pedagogical purposes (the complete example is available at http://www.sei.cmu.edu/productlines/ppl). A complete set of product line assets are available for this example product line. The product line consists of three games: Brickles, Pong and Bowling. The variation points in the product line include the operating system on which the games run, a choice of an analog, digital, or no scoreboard, and whether the product has a practice mode. The software product line strategy is a business strategy that uses a specific method to achieve its goals. The material in this chapter reflects this orientation by combining technical and managerial issues. We briefly introduce a comprehensive approach to software product line development and we provide a state-of-the-practice summary. Then we describe the current issues, detail some experiences and outline research questions regarding the test-related activities in a software product line organization.
2 Software Product Lines A software product line is a set of software-intensive systems sharing a common, managed set of features that satisfy the specific needs of a particular market segment or
106
J.D. McGregor
mission and that are developed from a common set of core assets in a prescribed manner [83]. This definition has a number of implications for test strategies. Consider these key phrases from the definition: – set of software-intensive systems: The product line is the set of products. The product line organization is the set of people, business processes and other resources used to build the product line. The commonalities among the products will translate into opportunities for reuse in the test artifacts. The variabilities among products will determine how much testing will be needed. – common, managed set of features: Test artifacts should be tied to significant reusable chunks of the products, such as features. These artifacts are managed as assets in parallel to the production assets to which they correspond. This will reduce the effort needed to trace assets for reuse and maintenance purposes. A test asset is used whenever the production asset to which it is associated is used. – specific needs of a particular market segment or mission: There is a specified domain of interest. The culture of this domain will influence the priorities of product qualities and ultimately the levels of test coverage. For example, a medical device that integrates hardware and software requires far more evidence of the absence of defects than the latest video game. Over time those who work in the medical device industry develop a different view of testing and other quality assurance activities from workers in the video game domain. – common set of core assets: The test core assets include test plans, test infrastructures, test cases and test data. These assets are developed to accommodate the range of variability in the product line. For example, a test suite constructed for an abstract class is a core asset that is used to quickly create tests for concrete classes derived from the abstract class. – in a prescribed manner: There is a production strategy and production method that define how products are built. The test strategy and test infrastructure must be compatible with the production strategy. A production strategy that calls for dynamic binding imposes similar constraints on the testing technology. The software product line approach to development affects how many development tasks are carried out. Adopting the product line strategy has implications for the software engineering activities as well as the technical and organizational management activities. The Software Engineering Institute has developed a Framework for Software Product Line PracticeSM 1 which defines 29 practice areas that affect the success of the product line. A list of these practices is included in the Appendix. The practices are grouped into Software Engineering, Technical Management and Organizational Management categories. These categories reflect the dual technical and business perspectives of a product line organization. A very brief description of the relationship between the Testing practice area and the other 28 practices is included in the list in the appendix. The software product line approach seeks to achieve strategic levels of reuse. Organizations have been able to achieve these strategic levels using the product line approach, while other approaches have failed, because of the comprehensive nature of the product line approach. For example, consider the problem of creating a “reusable” 1 SM
Service mark of Carnegie Mellon University.
Testing a Software Product Line
107
implementation of a component. The developer is left with no design context, just the general notion of the behavior the component should have and the manager is not certain how to pay for the 50% - 100% additional cost of making the component reusable instead of purpose built. In a software product line, the qualities and properties, required for a product to be a member, provide the context. The reusable component only has to work within that context. The manager knows that the product line organization that owns all of the products is ultimately responsible for funding the development. This chapter presents testing in the context of such a comprehensive approach. 2.1 Commonality and Variability The products in a product line are very similar but differ from each other (otherwise they would be the same product). The points at which the products differ are referred to as variation points. Each possible implementation of a variation point is referred to as a variant. The set of products possible by taking various combinations of the variants defines the scope of the product line. Determining the appropriate scope of the product line is critical to the success of the product line in general and the testing practice specifically. The scope constrains the possible variations that must be accounted for in testing. Too broad a scope will waste testing resources if some of the products are never actually produced. Too vague a scope will make test requirements either vague or impossible to write. Variation in a product is nothing new. Control structures allow several execution paths to be specified but only one to be taken at a time. By changing the original inputs, a different execution path is selected from the existing paths. The product line approach adds a new dimension. From one product to another, the paths that are possible change. In the first type of variation, the path taken during a specific execution changes from one execution to the next but the control flow graph does not change just because different inputs are chosen. In the second type of variation different control flow graphs are created when a different variant is chosen. This adds the need for an extra step in the testing process, sampling across the product space, in addition to sampling across the data space. This is manageable only because of the use of explicit, pre-determined variation points and automation. Simply taking an asset and modifying it in any way necessary to fit a product will quickly introduce the level of chaos that has caused most reuse efforts to fail. The commonality among the products in a product line represents a very large portion of the functionality of these products and the largest opportunity to reduce the resources required. In some product lines this commonality is delivered in the form of a platform which may provide virtually identical functionality to every product in the product line. Then a selected set of features are added to the platform to define each product. In other cases, each product is a unique assembly of assets, some of which are shared with other products. These different approaches will affect our choice of test strategy. The products that comprise a product line have much in common beyond the functional features provided to users. They typically attack similar problems, require similar technologies and are used by similar markets. The product line approach facilitates the
108
J.D. McGregor
exploitation of the identified commonality even to the level of service manuals and training materials [92]. The commonality/variability analysis of the product line method produces a model that identifies the variation points needed in the product line architecture to support the range of requirements for the products in the product line. Failure to recognize the need for variation in an asset will require custom development and management of a separate branch of code that must be maintained until a future major release. Variability has several implications for testing: – Variation is identified at explicit, documented variation points: Each of these points will impose a test obligation in terms either of selecting a test configuration or test data. Analysis is required at each point to fully understand the range of variation possible and the implications for testing. – Variation among products means variation among tests: The test software will typically have at least the same variation points as the product software. Constraints are needed to associate test variants with product variants. One solution to this is to have automated build scripts that build both the product and the tests at the same time. – Differences in behavior between the variants and the tests should be minimal: We show later that using the same mechanisms to design the test infrastructure as are used to design the product software is usually an effective technique. Assuming the product mechanisms are chosen to achieve certain qualities, the test software should seek to achieve the same qualities. – The specific variant to be used at a variation point is bound at a specific time: The binding time for a variation point is one specific attribute that should be carefully matched to the binding of tests. Binding the tests later than the product assets are bound is usually acceptable but not the reverse. – The test infrastructure must support all the binding times used in the product line: Dynamically bound variants are of particular concern. For example, binding aspects [191] to the product code as it executes in the virtual machine may require special techniques to instrument test cases. – Managing the range of variation in a product line is essential: Every variant possibility added to a variation point potentially has a combinatorial impact on the test effort. Candidate variants should be analyzed carefully to determine the value added by that variant. Test techniques that reduce the impact of added variants will be sought as well. Commonality also has implications for testing: – Techniques that avoid retesting of the common portions can greatly reduce the test effort – The common behavior is reused in several products making an investment in test automation viable. So it is also explored in this chapter.
Testing a Software Product Line
109
2.2 Planning and Structuring Planning and structuring are two activities that are important to the success of the product line. A software product line is chartered with a specific set of goals. Test planning takes these high level goals into account when defining the goals for the testing activities. The test assets will be structured to enhance their reusability. Techniques such as inheritance hierarchies, aspect-oriented programming and template programming provide a basis for defining assets that possess specific attributes including reusability. Planning and structuring must be carried out incrementally. To optimize reuse, assets must be decomposed and structured to facilitate assembly by a product team. One goal is to have incremental algorithms that can be completed on individual modules and then be more rapidly completed for assembled subsystems and products using the partial results. Work in areas such as incremental model checking and component certification may provide techniques for incremental testing [334]. A product line organization defines two types of roles: core asset builder and product builder. The core asset builders create assets that span a sufficient range of variability to be usable across a number of products. The core assets include early assets such as the business case, requirements and architecture and later assets such as the code. Test assets include plans, frameworks and code. The core asset builder creates an attached process for each core asset. The attached process describes how to use the core asset in building a product. The attached process may be a written description, provided as a cheatsheet in Eclipse or as a guidance element in the .NET environment, or it may be a script that will drive an automated tool. The attached process adds value by reducing the time required for a product builder to learn how to use the core asset. The core asset builder’s perspective is: create assets that are usable by product builders. A core asset builder’s primary trade-off is between sufficient variability to maximize the reuse potential of an asset and sufficient commonality to provide substantial value in product building. The core asset developer is typically responsible for creating all parts of an asset. This may include test cases that are used to test the asset during development and can be used by product developers to sanity test the assets once they are integrated into a product. The product builders construct products using core assets and any additional assets they must create for the unique portion of product. The product builder provides feedback to the core asset builders about the usefulness of the core assets. This feedback includes whether the variation points were sufficient for the product. The product builder’s perspective is: achieve the required qualities for their specific product as rapidly as possible. The product builder’s primary trade-off is between maximizing the use of core assets in building the product and achieving the precise requirements of their specific product. In particular, product builders may need a way to select test data to focus on those ranges critical to the success of their product. To coordinate the work of the two groups, a production strategy is created that provides a strategic overview of how products will be created from the core assets. A production method is defined to implement the strategy. The method details how the core asset developers should build and test the core assets so that product builders can achieve their objectives.
110
J.D. McGregor
Fig. 1. Test points
A more in-depth treatment of software product lines can be found in [83].
3 Testing Overview As discussed throughout this book, testing is the detailed examination of an artifact guided by specific information. The examination is a search for defects. Program failures signal that the search has been successful. How hard we search depends on the consequences if a defect is released in a product. Software for life support systems will require a more thorough search than word processing software. The purpose of this section is to give a particular perspective on testing at a high level, both recalling and relating to the product line context concepts discussed in Chapter 1. We discuss some of the artifacts needed to operate a test process. After that we present a perspective on the testing role and then briefly describe fault models. 3.1 Testing Artifacts This definition of testing encompasses a number of testing activities that are dispersed along the software development life cycle. We refer to the places where these activities are located as test points. Figure 1, similar to what is illustrated in Chapter 1, shows the set of test points for a high-level view of a development process. This sequence is intended to establish the chain of quality. As mentioned in Chapter 1, the IEEE 829 standard [176] defines a number of testing artifacts, which will be used at each of the test points in the development process. Several of these test assets are modified from their traditional form to support testing in a
Testing a Software Product Line
111
product line. To better understand the necessary modifications, we now recall some of the concepts discussed in Chapter 1 and relate them to the product line context: – test plan: A description of what testing will be done, the resources needed and a schedule for when activities will occur. Any software development project should have a high level end-to-end plan that coordinates the various specific types of tests that are applied by developers and system testers. Then individual test plans are constructed for each test that will be conducted. In a software product line organization a distinction is made between those plans developed by core asset builders that will be delivered to the product builders of every product and the plans that the product builder derives from the product line plan and uses to develop their product specific plan. The core asset builders might provide a template document or a tool that collects the needed data and then generates the required plan. – test case: A single configuration of test elements that embodies a use scenario for the artifact under test. In a software product line, a test case may have variation points that allow it to be configured for use with multiple products. – test data: All of the data needed to fully describe a scenario. The data is linked to specific test cases so that it can be reused when the test case is. The test data may have variation points that allow some portion of the data to be included or excluded for a particular use. – test report: A summary of the information resulting from the test process. The report is used to communicate to the original developers, management and potentially to customers. The test report may also be used as evidence of the quantity and quality of testing in the event of litigation related to a product failure. The primary task in testing is defining effective test cases. To a tester, that means finding a set of stimuli that, when applied to the artifact under test, exposes a defect. We discuss several techniques for test case creation but all of them rely on a fault model, as briefly discussed in Chapter 1 and explored in more detail in Chapter 8. There will be a fault model for each test point. We focus on faults related to being in a product line in Section 1. 3.2 Testing Perspective The test activities, particularly the review of non-software assets, are often carried out by people without a traditional testing background. Unit tests are usually carried out by developers who pay more attention to creating than critiquing. A product line organization will provide some fundamental training in testing techniques and procedures but this is no substitute for the perspective of an experienced tester. In addition to training in techniques, the people who take on a testing role at any point in any process should adopt the “testing perspective”. This perspective guides how they view their assigned testing activities. Each person with some responsibility for a type of testing should consider how these qualities should affect their actions: – Systematic: Testing is a search for defects and an effective search must be systematic about where it looks. The tester must follow a well-defined process when they
112
J.D. McGregor
are selecting test cases so that it is clear what has been tested and what has not. For example, coverage criteria are usually stated in a manner that describes the “system.” All branches level of test coverage means that test cases have been created for each path out of each decision point in the control flow graph. – Objective: The tester should not make assumptions about the work to be tested. Following specific algorithms for test case selection removes any of the tester’s personal feelings about what is likely to be correct or incorrect. “Bob always does good work, I do not need to test his work as thoroughly as John’s,” is a sure path to failure. – Thorough: The tests should reach some level of coverage of the work being examined that is “complete” by some definition. Essentially, for some classes of defects, tests should look everywhere those defects could be located. For example, test every error path if the system must be fault tolerant. – Skeptical: The tester should not accept any claim of correctness until it has been verified by an acceptable technique. Testing boundary conditions every time eliminates the assumption that “it is bound to work for zero.” 3.3 Fault Models A fault model is the set of known defects that can result from the development activities leading to the test point. Faults can be related to several different aspects of the development environment. For example, programs written in C and C++ are well known for null pointer errors. Object-oriented design techniques introduce the possibility of certain types of defects such as invoking the incorrect virtual method [268,10]. Faults also are the result of the development process and even the organization. For example, interface errors are more likely to occur between modules written by teams that are non-co-located. The development organization can develop a set of fault models that reflect their unique blend of process, domain and development environment. The organization can incorporate some existing models such as Chillaredge’s Orthogonal Defect Classification into the fault models developed for the various test points [77]. Testers use fault models to design effective and efficient test cases since the test cases are specifically designed to search for defects that are likely to be present. Developers of safety critical systems construct fault trees as part of a failure analysis. A failure analysis of this type usually starts at the point of failure and works back to find the cause. Another type of analysis is conducted as a forward search. These types of models capture specific faults. We want to capture fault types or categories of faults. The models are further divided so that each model corresponds to specific test points in the test process. A product line organization develops fault models by tracking defects and classifying these defects to provide a definition of the defect and the frequency with which it occurs. Table 1 shows some possible faults per test point related to a product line strategy. Others can be identified from the fault trees produced from safety analyses conducted on product families [94,93,218].
Testing a Software Product Line
113
Table 1. Faults by test point Test point Requirements Analysis Architecture Design Detailed design Unit testing Integration testing System testing
Example faults incomplete list of variations missing constraints on variations contradictions between variation constraints failure to propagate variations from subsystem interface failure to implement expected variations mismatched binding times between modules inability to achieve required configuration
3.4 Summary Nothing that has been said so far requires that the asset under test be program source code. The traditional test points are the units of code produced by an individual developer or team, the point at which a team’s work is integrated with that of other teams, and the completely assembled system. In section 4 we present a review/inspection technique, termed Guided Inspection, that applies the testing perspective to the review of non-software assets. This technique can be applied at several of the test points shown in Figure 1. This adds a number of test points to what would usually be referred to as “testing” but it completes the chain of quality. In section 5 we present techniques for the other test points. In an iterative, incremental development process the end of each phase may be encountered many times so the test points may be exercised many times during a development effort. Test implementations must be created in anticipation of this repetition.
4 Guided Inspection Guided Inspection is a technique that applies the discipline of testing to the review of non-software assets. The review process is guided by scenarios that are, in effect, test cases. This technique is based on Active Reviews by Parnas and Weiss [273]. An inspection technique is appropriate for a chapter about testing in a product line organization because the reviews and inspections are integral to the chain of quality and because the test professionals should play a key role in ensuring that these techniques are applied effectively. 4.1 The Process Consider a detailed design document for the computation engine in the arcade game product line. The document contains a UML model complete with OCL constraints as well as text tying the model to portions of the use case model that defines the product line. A Guided Inspection follows the steps shown in Figure 2, a screenshot from the Eclipse Process Framework Composer. The scenarios are selected from the use case model shown in Figure 3.
114
J.D. McGregor
Fig. 2. Guided inspection description
Fig. 3. Use case diagram
For this example, we picked “The Player has begun playing Brickles. The puck is in play and the Player is moving the paddle to reflect the puck. No pucks have been lost and no bricks hit so far. The puck and paddle have just come together on this tick of the clock.” as a very simple scenario that corresponds to a test pattern discussed later.
Testing a Software Product Line
115
Fig. 4. Central class diagram
The GameBoard shown in the class diagram in Figure 4 is reused, as is, in each game. It serves as a container for GamePieces. According to the scenario, the game is in motion so the game instance is in the moving state, one of the states shown in Figure 8. The sequence diagram shown in Figure 5 shows the action as the clock sends a tick to the gameboard which sends the tick on to the MovableSprites on the gameboard. After each tick the gameboard invokes the “check for collision” algorithm shown in Figure 6. The collision detection algorithm detects that the puck and paddle have collided and invokes the collision handling algorithm shown in Figure 7. In the inspection session, the team reads the scenario while tracing through the diagrams to be certain that the situation described in the scenario is accurately represented in the design model, looking for problems such as missing associations among classes and missing messages between objects. The defects found are noted and in some development organizations would be written up as problem reports. Sufficient scenarios are created and traced to give evidence that the design model is complete, correct and consistent. Coverage is measured by the portions of diagrams, such as specific classes in a class diagram, that are examined as part of a scenario. One possible set of coverage criteria, listed in order of increasing coverage, includes: – a scenario for each end-to-end use case, including “extends” use cases – a scenario that touches each “includes” use case
116
J.D. McGregor
Fig. 5. Basic running algorithm
Fig. 6. Collision detection algorithm
Testing a Software Product Line
117
Fig. 7. Collision handling algorithm
– a scenario that touches each variation point – a scenario that uses each variant of each variation point. Guided Inspection is not the only scenario based evaluation tool that can be employed. The Architecture Tradeoff Analysis Method (ATAM) developed by the SEI also uses scenarios to evaluate the architecture of a product line [28]. Their technique looks specifically at the quality attributes, that is, non-functional requirements, the architecture is attempting to achieve. The benefit of a Guided Inspection session does not stop with the many defects found during the inspections. The scenarios created during the architecture evaluation and detailed design will provide an evolution path for the scenarios to be used to create executable integration and system test cases. 4.2 Non-functional Requirements Software product line organizations are concerned about more than just the functional requirements for products. These organizations want to address the non-functional requirements as early as possible. Non-functional requirements, sometimes called quality attributes, include characteristics such as performance, modifiability and dependability. So and others describe a technique for using scenarios to begin testing for performance early in the life of the product line [284]. Both Guided Inspection and the ATAM provide a means for investigating quality attributes. During the “create scenarios” activity, scenarios, elicited from stakeholders, describe desirable product behavior in terms of user-visible actions. The ArchE tool
118
J.D. McGregor
Fig. 8. Simple state diagram
(available for download at http://www.sei.cmu.edu/architecture/arche.html), developed at the SEI, aids in reasoning about these attributes. It provides assistance to the architect in trading off among multiple, conflicting attributes.
5 Testing Techniques for a Product Line The test practice in a product line organization encompasses all of the knowledge about testing necessary to operate the test activities in the organization. This includes the knowledge about the processes, technologies and models needed to define the test method. We first discuss testing as a practice since this provides a more comprehensive approach than just discussing all of the individual processes needed at the individual test points [179]. Then we use the three phase view of testing developed by Hetzel [167]—planning, construction and execution—to structure the details of the rest of the discussion. In his keynote to the Software Product Line Testing Workshop (SPLiT), Grütter listed four challenges to product line testing [150]: – Meeting the product developer’s quality expectations for core assets – Establishing testing as a discipline that is well-regarded by managers and developers – Controlling the growth of variability – Making design decisions for testability. We incorporate a number of techniques that address these challenges. 5.1 Test Method Overview The test practice in an organization encompasses the test methods, coordinated sets of processes, tools and models, that are used at all test points. For example, “test first development” is an integrated development and testing method that follows an agile development process model and is supported by tools such as JUnit. The test method for a product line organization defines a number of test processes that operate independently of each other. This is true for any project but in a product
Testing a Software Product Line
119
line organization these processes are distributed among the core asset and product teams and must be coordinated. Often the teams are non-co-located and communicate via a variety of mechanisms. The test method for a project must be compatible with the development method. The test method for an agile development process is very different from the test method for a certifiable software (FAA, FDA, and others) process. The rhythm of activity must be synchronized between the test and development methods. Tasks must be clearly assigned to one method or the other. Some testing tasks such as unit testing will often be assigned to development staff. These tasks are still defined in the testing method so that expectations for defect search, including levels of coverage and fault models, can be coordinated. There is a process defined for the conduct of testing at each of the test points. This process defines procedures for constructing test cases for the artifacts that are under test at that test point. The test method defines a comprehensive fault model and assigns responsibility for specific defect types to each test point. The product line test plan should assign responsibility for operating each of these test processes. An example model of the testing practice for a software product line organization is provided elsewhere [243]. 5.2 Test Planning Adopting the software product line approach is a strategic decision. The product line organization develops strategies for producing products and testing them. The strategies are coordinated to ensure an efficient, effective operation. One of the challenges to testing in a product line organization is achieving the same gains in productivity as product production so that testing does not prevent the organization from realizing the desired improvements in throughput. Table 2 shows frequently desired production and product qualities and corresponding testing actions. The faster time to market is achieved by eliminating the human tester in the loop as much as possible. Short iterations from test to debug/fix will keep cycle time shorter. Production is cheaper and faster when the human is out of the loop. The product line strategy also supports having a cheaper, more domain-oriented set of product builders. These product builders could be less technical and paid less if the core assets are designed to support a more automated style of product building and testing. The quality of the products can be made better because more thorough coverage can be achieved over time. Finally, the product line core asset base supports mass customization but requires combinatorial test techniques to cover the range of variability without having the test space explode. The production planning activity begins with the business goals of the product line and produces a production Table 2. Product qualities and testing actions To achieve faster cheaper better mass customization
use automation and iteration automation and differentiated work force more thorough coverage combinatorial testing
120
J.D. McGregor Table 3. Organization by test point Core asset builders Major responsibility Test to reusability level Integration testing Shared responsibility n-way interactions tested System testing Minor responsibility test example products Unit testing
Product developers Minor responsibility Test for specific functionality Shared responsibility existing interactions tested Major responsibility test the one product
plan that describes the production strategy and the production method [74]. The testing activities in the product line organization are coordinated with the product development activities by addressing both production and testing activities during production planning. The plan talks about how core assets will be created to support strategic reuse and how products will be assembled to meet the business goals of the organization. The production method is a realization of the production strategy and defines how products are created from the core assets. The core asset development method is structured to produce core assets that facilitate product building. The production method specifies the variation mechanisms to be used to implement the variation points. These mechanisms determine some of the variation mechanisms used in the test software. Table 3 shows the division of testing responsibilities between the core asset developers and the product builders for code-based assets. In the earlier test points, shown in Figure 1, the core asset developers have the major responsibility while the product builders do incremental reviews. The product line test plan maps the information in Table 3 into a sequence of testing activities. The IEEE standard outline for a test plan provides the entries in the leftmost column in Table 4. We have added comments for some of the items that are particularly important in a product line environment. These comments apply to the product line wide test plan that sets the context for the individual plans for each core asset and product. The plan illustrates the concurrency possible in the test activities where the core asset team is testing new core assets while products built from previous assets are being tested by the product teams. Defining the test method. Once the production strategy has been defined, the product building and test methods can be developed. The method engineer works with testers and developers to specify the processes, models and technologies that will be used during the various test activities. For example, if a state-based language is chosen as the programming language in the production method, then testing focuses on states and the notion of a “switch cover”, a set of tests that trace all the transitions in the design state machine, becomes the unit of test coverage. In a software product line organization all of the other 28 practices impact testing. Some of the testing-related implications of the three categories of practices are shown in Table 5. The appendix contains a more detailed analysis of the relationship of testing to each of the other practice areas. The test method describes the test tools that will be used. We cover these more thoroughly in Section 5.3 but we give one example here. If development will be done in Java, the testers will use JUnit and its infrastructure.
Testing a Software Product Line
Table 4. IEEE test plan outline Introduction Test Items Tested Features
This is the overall plan for product line testing. All products possible from the core assets. Product features are introduced incrementally. As a product uses a feature it is tested and included in the core asset base.
Features Not Tested (per cycle) Testing Strategy and Approach Separate strategies are needed for core assets and products. Syntax Description of Functionality Arguments for tests Expected Output Specific Exclusions Dependencies The product line test report should detail dependencies outside the organization. Are special arrangements necessary to ensure availability over the life of the product line? Test Case Success/Failure Cri- Every type of test should have explicit instructions as to teria what must be checked to determine pass/fail. Pass/Fail Criteria for the Complete Test Cycle Entrance Criteria/Exit Criteria Test Suspension Criteria and In a product line organization a fault found in a core asResumption Requirements set should suspend product testing and send the problem to the core asset team. Test Deliverables/Status Com- Test reports are core assets that may be used as part of munications Vehicles safety cases or other certification procedures Testing Tasks Test Planning Test Construction Test Execution and Evaluation Hardware and Software Requirements Problem Determination and Important that this reflect the structure defined in the Correction Responsibilities CONOPS Staffing and Training Needs/Assignments Test Schedules Risks and Contingencies The resources required for testing may increase if the testability of the asset specifications is low. The test coverage may be lower than is acceptable if the test case selection strategy is not adequate. The test results may not be useful if the correct answers are not clearly specified. Approvals
121
122
J.D. McGregor Table 5. Implications of practice area categories for personnel roles Core asset development Product development Shift traditional emphasis from backend Consider the impact of product seto frontend testing. quence on development of assets including tests man- Coordinate development of core assets Provide configuration support for tests with product development. What testing as well as development artifacts tools are delivered with the core assets? engi- Design for testability Use testing to guide integration
Organizational management Technical agement Software neering
The test method in a data intensive application development effort might specify that unit testers will create datapools, that is, specially formatted file of data for test cases. The test case then contains commands that retrieve data as a test case is ready to be run. This facilitates sharing test data across different test cases and different software development efforts. In this way datapools help automate the handling of large amounts of data. The method defines techniques and provides examples such as the one below. Listing 4.1 shows the setUp method in a JUnit test class that uses datapools to apply test data to configuration cases. This is a reusable chunk that could be packaged and used by any JUnit test class.
/ / I n i t i a l i z e the datapool f actory IDatapoolFactory dpFactory ; d p F a c t o r y = new C o m m o n _ D a t a p o o l F a c t o ry Im p l ( ) ; / / Load t h e s h o p p i n g C a r t D a t a p o o l d a t a p o o l IDatapool datapool = dpFactory . load ( new j a v a . i o . F i l e ( " c : \ \ c o u r s e s \ \ . . . \ \ velocityPool . datapool ") , false ) ; / / C r e a t e an i t e r a t o r t o t r a v e r s e t h e d a t a p o o l d p I t e r a t o r = d p F a c t o r y . open ( d a t a p o o l , " o r g . e c l i p s e . h y a d e s . datapool . i t e r a t o r . DatapoolIteratorSequentialPrivate "); / / I n i t i a l i z e the datapool to t r av er s e the f i r s t / / equivalence class . d p I t e r a t o r . d p I n i t i a l i z e ( datapool , 0 ) ;
1 2 3 4 5 6 7 8 9 10 11 12 13 14
}
Listing 4.1. Datapool snippet
Design for testability. For a given component, if it is implemented in a single-product, custom development effort, we assume that a component C is executed x times for a specified time period. In development where multiple copies of the product are deployed the same component will now be executed nc ∗ x times in a specified time period, nc is the number of copies. In product line development, the same component is executed np i=1
(nci ∗ xi )
Testing a Software Product Line
123
times in the same time period, where np is the number of products in which the component is used, nci is the number of copies of a given product and xi is the number of executions for a given product. In this scenario, assume that the probability of a defect in the component causing a failure is P(d ). Obviously the number of failures observed in the product line scenario will likely be greater than the other two scenarios as long as P(d ) remains constant. The expected number of failures can be stated as: expectedNumFailures = P(d ) ∗
np (nci ∗ xi ) i=1
In a product line, a component is used in multiple products, which may have different levels of quality attributes and different types of users. We expect the range of input data presented to a component varies from one product context to another. So, P(d ) does not remain constant. If we assume that when a failure occurs in one product, its failure is known to the organization, the number of failures can be stated as: expectedNumFailures =
np
Pi (d ) ∗ (nci ∗ xi ))
i=1
We discuss later that this aggregate decrease in testability may require additional testing. Part of developing the production method is specifying the use of techniques to ensure that assets are testable. There are two design qualities that guide these definitions: – Observability: provide interfaces that allow the test software to observe the internal state of the artifact under test. Some languages provide a means of limiting access to a particular interface so that encapsulation and information hiding are not sacrificed. For example, declaring the observation interface to have package visibility in Java and then defining the product and test assets in the same package achieves the objective. A similar arrangement is possible with the C++ friend mechanism. – Controllability: provide interfaces that allow the artifact under test to be placed in a particular state. Use the same techniques as for observability to preserve the integrity of the artifact. Testability is important in a software product line organization because the presence of variability mechanisms makes it more difficult to search through an asset for defects. A level of indirection is introduced by many of the mechanisms. The product line infrastructure may provide tools that provide a test view of the assets. This test view allows test objects to observe the state of the object under test and to set the state to any desired value. A module can provide application behavior through one interface that hides its implementation while providing access to its internals through another, transparent interface. The transparent test interface may be hidden from application modules but visible to test modules. One approach is to use package level visibility and include only the module and its test module. Or, the test interface may be protected by a tool that checks, prior to compilation, for any reference to the test interface other than from a test module.
124
J.D. McGregor
Test coverage. As discussed in Chapter 1, test coverage is a measure of how thorough the search has been with the test cases executed so far. The test plan defines test coverage goals for each test point. Setting test coverage levels is a strategic decision that affects the reputation of the organization in the long term and the quality of an individual product in the short term. It is also strategic because the coverage goals directly influence the costs of testing and the resources needed to achieve those goals. “Better” is often one of the goals for products in a product line where better refers to some notion of improved quality. One approach to achieving this quality is to allow the many users of the products find defects and report them, so they can be repaired. However, in many markets letting defects reach the customer is unacceptable. An alternative is to require more thorough test coverage of the core assets compared to coverage for typical traditional system modules. The earlier discussion on testability leads to the conclusion that traditional rules of thumb used by testers and developers about the levels of testing to which a component should be subjected will not be adequate for a software product line environment. The increased number of executions raises the likelihood that defects will be exposed in the field unless the test coverage levels for in-house testing are raised correspondingly. This is not to say that individual users will see a decline in reliability. Rather, the increased failures will be experienced as an aggregate over the product line. Help desks and bug reporting facilities will feel the effects. If the software is part of a warranted product, the cost of repairs will be higher than anticipated. The weight of this increase in total failures may result in pressures, if not orders, to recall and fix products [244]. The extra level of complexity in the product line—the instantiation of individual products—should be systematically explored just as values are systematically chosen for parameter values. We refer to these as configuration cases because they are test cases at one level but are different from the dynamic data used during product execution. A product line coverage hierarchy might look something like this: – select configurations so that each variant at each variation point is included in some test configuration, see section 5.3; obey any constraints that link variants – select variants that appear together through some type of pattern even though there is no constraint linking them – select variant values pair-wise so that all possible pairs of variant values are tested together – select higher order combinations. Other coverage definitions can be given for specific types of artifacts. Kauppinen and others define coverage levels for frameworks. The coverage levels are defined in terms of the hook and template implementations in a framework [190]. The coverage is discussed for feature-level tests in terms of an aspect-oriented development approach. 5.3 Test Construction In a product line organization, tests are executed many times: – as the artifact it tests is iteratively refined and – as the artifact it tests is reused across products.
Testing a Software Product Line
125
The test assets must be constructed to meet these requirements. This includes: – technologies to automatically execute test cases and – technologies to automatically build configuration cases. The technologies and models used to construct test assets should be closely related to the technologies and models used to construct the product assets. This coordination occurs during production planning. As test software is constructed, the design of the products is considered to determine the design of the test artifacts. There are several types of variation mechanisms. Here is one high level classification of mechanism types [113]: – Parameterization: Mechanisms here range from simply sending different primitive data values as parameters to sending objects as parameters to setting values in configuration files. – Refinement: Mechanisms include inheritance in object-oriented languages and specializers that use partial evaluation. – Composition: Container architectures in which components “live” in a container that provides services. – Arbitrary transformation: Transformations based on underlying meta-models that take a generic artifact and produce an asset. In the following we discuss factors related to these mechanisms. Binding time. The variation mechanisms differ in many ways, one of which is the time at which definitions are bound. There is a dynamic tension between earlier binding that is easier to instrument and verify and later binding which provides greater flexibility in product structure. The choice of binding time affects how we choose to test. Mechanisms that are bound during design or compile time can often be statically checked. Mechanisms that bind later must be checked at runtime and perhaps even over multiple runs since their characteristics may change. In an object-oriented environment, some mechanisms are bound to classes while others are bound to objects. This generally fits the static/dynamic split but not always. It is possible to verify the presence of statically defined objects in languages such as Java. This is a particularly important issue since these static objects are often responsible for instantiating the remainder of the system to test. Test architecture. A software product line organization is usually sufficiently large and long lived that the tools and processes need to be coordinated through an architecture. The test architecture controls the overall design of the test environment. Several research and experience reports point to the fact that the design of the test environment should parallel the design of the product. Cummins Engines, a Software Product Line Hall of Fame member, has experience that reinforces the notion of the product architecture and the test architecture having a similar shape. “Tests must be designed for portability by leveraging the points of variation in the software as well as in the System Architecture” [346]. In fact at a top level they view the test environment architecture as a natural part of the product line architecture.
126
J.D. McGregor
A product line test architecture must address some specific range of variability and the accompanying variety of binding times. If the range in the product line is very large, it may be reasonable to have multiple architectures and this usually happens between test points. The architecture for unit testing usually needs to have access to the internals of components and will tie into the security model of the programming language. The architecture for a GUI tester will likewise tie into the windowing model. For the very latest binding mechanisms it is necessary to tie into the execution environment such as the JVM for a Java program. Aspect-oriented techniques. Aspect-oriented programming is one technology used to implement variation in core assets. An aspect is a representation of a cross cutting concern that is not the primary decomposition. Research has shown that for code that is defined in terms of aspects, the test software should also be defined in terms of aspects [197]. In that way, the test software is bound at the same time as the product software. We do not give a tutorial on aspect-oriented programming here but we talk about some characteristics of aspects that make them a good variation mechanism. An aspect is a design concern that cuts across the primary decomposition of a design. Canonical examples include how output or security are handled across a product. Aspect-oriented techniques do not require any hooks to exist in the non-aspect code before the aspects are added. The aspect definition contains semantic information about where it should be inserted into the code. An aspect weaver provides that insertion either statically or dynamically depending upon the design. Shown below is the paint method for a DigitalScoreBoard that is being used in products that use the Java Microedition. The microdedition is for small devices such as cellphones. A different paint method is used when the base code for the scoreboard is used in the Java Standard Edition. The rest of the scoreboard code is common. package c o r e A s s e t s ; import j a v a x . m i c r o e d i t i o n . l c d u i . G r a p h i c s ; public aspect DigitalScoreBoardA { p r i v a t e s t a t i c f i n a l i n t c o l o r = 255