272 57 1MB
English Pages 184 Year 2002
Adaptive Systems in Drug Design
Gisbert Schneider, Ph.D. and Sung-Sau So, Ph.D.
LANDES BIOSCIENCE
BIOTECHNOLOGY INTELLIGENCE UNIT 5
Adaptive Systems in Drug Design Gisbert Schneider, Ph.D. Beilstein Professor of Cheminformatics Institute of Organic Chemistry Johann Wolfgang Goethe-University Frankfurt, Germany
Sung-Sau So, Ph.D. F. Hoffmann-La Roche, Inc. Discovery Chemistry Nutley, New Jersey, U.S.A.
LANDES BIOSCIENCE GEORGETOWN, TEXAS U.S.A.
EUREKAH.COM AUSTIN, TEXAS U.S.A.
ADAPTIVE SYSTEMS IN DRUG DESIGN Biotechnology Intelligence Unit Eurekah.com Landes Bioscience Designed by Jesse Kelly-Landes Copyright ©2003 Eurekah.com All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher. Printed in the U.S.A. Please address all inquiries to the Publishers: Eurekah.com / Landes Bioscience, 810 South Church Street Georgetown, Texas, U.S.A. 78626 Phone: 512/ 863 7762; FAX: 512/ 863 0081 www.Eurekah.com www.landesbioscience.com While the authors, editors and publisher believe that drug selection and dosage and the specifications and usage of equipment and devices, as set forth in this book, are in accord with current recommendations and practice at the time of publication, they make no warranty, expressed or implied, with respect to material described in this book. In view of the ongoing research, equipment development, changes in governmental regulations and the rapid accumulation of information relating to the biomedical sciences, the reader is urged to carefully review and evaluate the information provided herein.
ISBN: 1-58706-059-0 (hard cover) 1-58706-118-X (soft cover)
Cover Illustration: M.C. Escher's "Development I" ©2000 Cordon Art B.V., Baarn, Holland. All rights reserved. With kind permission of the copyright holder (B-00/484).
Library of Congress Cataloging-in-Publication Data Schneider, Gisbert, 1965Adaptive systems in drug design / Gisbert Schneider, Sung-Sau So. p.; cm.--(Biotechnology intelligence unit; 5) Includes bibliographical references and index. ISBN 1-58706-059-0 (hardcover) -- ISBN 1=58706-118-X (softcover) 1. Drugs--Design. [DNLM: 1. Drug Design. QV 744 S358a 2001] I. So, Sung-Sau. II. Title. III. Series. RS420.S355 2001 615´.19--dc21 2001005044
Cover Illustration: The artwork "Development I" by M.C. Escher illustrates pattern formation that can be interpreted as resulting from an adaptive process. A similar situation is found in drug design: starting with only little knowledge about function-determining molecular features—the "twilight zone"—, structural patterns begin to emerge in subsequent synthesis-and-test cycles. Drug design and development form a complex adaptive system.
CONTENTS Foreword ............................................................................................... vi Preface ................................................................................................ viii Synopsis ................................................................................................. 1 1. A Conceptual Framework ....................................................................... 3 Drug Design is An Optimization Task .................................................. 3 Principles of Adaptive Evolutionary Systems ......................................... 8 Solution Encoding and Basic Evolutionary Operators ......................... 13 Molecular Feature extraction and Artificial Neural Networks .............. 19 Conventional Supervised training of Neural Networks ........................ 24 2. Analysis of Chemical Space .................................................................. 32 A Virtual Screening Philosophy ........................................................... 32 Logical inference ................................................................................. 33 Chemical Compound Libraries ........................................................... 37 Similarity Searching ............................................................................. 40 Feature Extraction Methods ................................................................ 46 3. Modeling Structure-Activity Relationships ........................................... 67 The Basic Idea ..................................................................................... 67 Development of QSAR Models ........................................................... 68 Application of Adaptive Methods in QSAR ......................................... 76 Comparison to Classical QSAR Methods ............................................ 87 Outlook and Concluding Remarks ...................................................... 95 4. Prediction of Drug-Like Properties .................................................... 105 The “Drug-Likeness” Concept .......................................................... 105 Physicochemical Properties ................................................................ 110 Bioavailability .................................................................................... 117 Toxicity ............................................................................................. 124 Multidimensional Lead Optimization ............................................... 131 5. Evolutionary De Novo Design ........................................................... 136 Current Concepts in Computer-Based Molecular design ................... 136 Evolutionary Structure Generators .................................................... 138 Peptide design by “Simulated Molecular Evolution” ......................... 141 TOPAS: Fragment-Based Design of Drug-Like Molecules ................ 149 Concluding Remarks ......................................................................... 154 Epilogue ............................................................................................. 163 Paul Wrede Index .................................................................................................. 171
AUTHORS Gisbert Schneider, Ph.D. Beilstein Professor of Cheminformatics Institute of Organic Chemistry Johann Wolfgang Goethe-University Frankfurt, Germany
Sung-Sau So, Ph.D.
F. Hoffmann-La Roche, Inc. Discovery Chemistry Nutley, New Jersey, U.S.A.
CONTRIBUTORS Paul Wrede Professor CallistoGen AG and Freie Universität Berlin Fachbereich Humanmedizin Institut für Molekularbiologie und Bioinformatik Berlin, Germany Epilogue Martin Karplus Professor Department of Chemistry Harvard University Cambridge, Massachusetts, U.S.A Foreword
FOREWORD
T
he design of drugs is concerned with finding patentable chemical compounds that cure or reduce the effect of a disease by attacking pathogenic organisms, such as certain bacteria and viruses, or supplementing required naturally occurring compounds, such as insulin. Drug design has been the primary objective of pharmaceutical companies for a long time. What has changed over the years is not the nature of the quest but the preferred way of achieving it. We focus on the first type of drug in what follows; for example, a ligand that binds to a target protein and has desirable properties in terms of bioavailability and toxicity. Since the structure of most targets was unknown until recently, finding a drug involved primarily screening of available small molecules, such as those in the company data base, to obtain a “lead”, which was then optimized by medicinal chemists. Often a 1000 or even more compounds were synthesized to go from millimolar to nanomolecular binding constants, usually by a lengthy iterative process. SAR and QSAR came to be used as ways of systematizing the information obtained from the syntheses and improving the effectiveness of the search. A new paradigm for drug design emerged when X-ray structures of targets began to be available in the 1980’s. “Rational” drug design became “the” way of finding drugs, or more precisely, ligands that bound strongly. Essentially all large drug companies—and in those days there were only “large” drug companies—hired computational chemists to design drugs. They, as well as academics, developed many computer programs to aid in the design process over the next dozen years. However, the old ways did not die out, in part because medicinal chemists often did not really believe in structurebased methods. Also, structures were available for only a small fraction of the targets involved for the wide range of diseases that were under investigation. In the 1990’s, combinatorial chemistry emerged as the paradigm for drug design and the idea was to be able to synthesize so many compounds that rational design methods were no longer necessary. Of course, even the largest combinatorial libraries are small, in a real sense. They contain only a million or so compounds, which covers a negligible fraction of chemical space, estimated to be in the range of 1023 to 1030 compounds. The perception that rational methods were on their way out was strengthened by the fact that their early promise had not been fulfilled. Recently, the situation has changed again, and “high throughput screening” with robotic techniques has emerged as the new paradigm for drug design. This brief, somewhat impressionistic, history of drug design is presented to make clear that there are fashions in this important field and that they change rather rapidly. This is due in part to the fact that the way that a new paradigm is accepted in a drug company often does not depend on its scientific merit alone. Instead, the decision on “how to design drugs” is made, usually at relatively high levels of management, by people who may not have
a full understanding of either the technology or the science involved. This has resulted repeatedly in a new technique being adopted as “the” approach, with the near exclusion of all other, at least for a while. Today a more balanced view of drug design seems to be emerging. None of the available techniques are regarded as panaceas. It is being realized that a combination of methods is the best approach. For example, a combinatorial library is more likely to yield ligand candidates if the library has been designed by rational methods for a given target of known structure. Such an open approach will be most effective only if drug companies have scientists working at the state-of-the-art level in a range of areas contributing to drug design. The present volume will play an important role in the required educational process. The authors, Drs. Gisbert Schneider and Sung-Sau So, both working at Hoffmann-La Roche, have provided a definitive work in what they call “adaptive systems in drug design”. This involves the development of algorithms that learn from available data to give deeper insights into their meaning. Both Dr. So and Dr. Schneider are experts in this area of computational drug design and have worked in it for a number of years. Thus, it is not surprising that their book is a very worthwhile contribution. Most of the volume is concerned with artificial neural networks and evolutionary (genetic) algorithms, which are now widely used in drug design. An important element of the book is that it gives a framework for a fundamental understanding of such adaptive approaches. This is supplemented by descriptions of specific programs developed by the authors. Although the book does include introductory material, it will be most useful to scientists wishing to work at the forefront of this rapidly developing area. Training in the techniques described in the present volume is all the more important as the range of possible targets is exploding with the availability of genomic data. Martin Karplus Cambridge
PREFACE During the past years we have witnessed an increasing impact of highthroughput screening technology and combinatorial chemistry on the early drug discovery process. Driven by these events and by recent advances in computer sciences, several novel algorithms have been developed to support the purely experimental drug design approaches by computational chemistry. Some of the major concepts in this challenging and rapidly expanding field of research are presented and described in this book, with a strong focus on complex adaptive systems. Special emphasis is put on neural network applications and evolutionary algorithms, in particular those systems which we have investigated in more detail during the past years in both academic and pharmaceutical research. We are convinced that rationalizing the general concept of adaptation will be outstandingly valuable for the drug discovery process - in particular for its early phases where we are often confronted with large collections of noisy data, little knowledge about structure-activity relationships, and sometimes unpredictable system behavior. This book does neither contain unique solutions nor does it present the method of choice. It is meant to complement existing textbooks about computational chemistry and bioinformatics, and to present some new challenging ideas. We have compiled a collection of methods and approaches, which we expect to be useful for both experienced medicinal and computational chemists as well as novices to the field. Many of our friends and colleagues contributed to this volume in various ways. Constructive criticism, experimental support, encouragement, and many exciting discussions helped us putting it together. We particularly thank (in alphabetical order) Geo Adam, Alexander Alanine, Konrad Bleicher, HansJoachim Böhm, Odile Clément, François Diederich, Martin Ebeling, Holger Fischer, Nader Fotouhi, David Fry, Paul Gerber, Paul Gillespie, Robert Goodnow, Frank Grams, Wolfgang Guba, Jeanmarie Guenot, Eva-Maria Gutknecht, Wolfgang Haehnel, Manfred Kansy, Achim Kramer, Nicole Kratochwil, Man-Ling Lee, Ruifeng Liu, Nora McDonald, Werner Neidhart, Edward Roberts, Mark Rogers-Evans, Olivier Roche, Gérard Schmid, Petra Schneider, Andrew Smellie, Venus So, Martin Stahl, Hongmao Sun, Jeff Tilley, and Jochen Zuegge. We are especially grateful to Mark, Alex, Paul and Nora, who did an outstanding job in proofreading the final draft of this book. Cynthia Dworaczyk from Landes Bioscience is thanked for their excellent support and editorial work. We would like to express our great pleasure at the fact that one of the founders of modern computational chemistry, Prof. Martin Karplus, wrote the Foreword to this volume. We are equally grateful to Prof. Paul Wrede for the Epilogue summarizing things to come, based on his background of experience in pioneering work in molecular biology and bioinformatics.
What do we expect as an outcome of this book? We do hope that it will provide an entry point for interested readers, and will ignite a constructive discussion about future directions in drug design. The methods and ideas presented are also thought to contribute to the ongoing dialogue about virtual screening concepts. We wish to stress that some of the approaches highlighted herein are still at an early stage - some of them are of purely theoretical nature - and a final assessment of their value and potential contribution to drug discovery is not possible at the present time. It is clear that the field of chemoinformatics and drug design is enormously diverse. This book can only represent a personal view emerging from our research experience in pharmaceutical industry, which is dominated by practical considerations. Nevertheless we tried to produce a book that surveys the fundamentals, and points to likely developments on many fronts. While much effort was spent in the preparation of this manuscript, we know that there must be errors or omissions. We would greatly appreciate our readers bringing them to our attention. As a final note, regardless of the appeal of methods like neural networks and evolutionary algorithms and their potential applicability we should always seek for as-simple-as-possible solutions to drug design problems. There are many extremely useful methods available now, and it is a challenge for the computational as well as medicinal chemist to choose the appropriate approach. Choose wisely! Gisbert Schneider Sung-Sau So
SYNOPSIS Chapter 1. A Conceptual Framework This Chapter introduces the use of adaptive systems, mainly neural networks and evolutionary algorithms, in the process of the design and optimization of drug candidates. It establishes a theoretical foundation that is critical to the better understanding of the subsequent Chapters. The principle of evolutionary algorithms and their variants are discussed. This covers different encoding schemes and proliferation strategies, i.e., genetic operators. Part of this Chapter is devoted to the use of artificial neural networks in molecular design, and in particular, structure-activity/property relationships. Basic neural network training algorithms and architectures are briefly reviewed.
Chapter 2. Analysis of Chemical Space The concept of chemical space is discussed in this Chapter. The selection of relevant chemical space is arguably the most critical task in computer-aided molecular design. Its impact on virtual screening, chemical compound library design, structure-activity relationship, and molecular similarity calculation are discussed. The Chapter covers algorithms for classical unsupervised projection methods such as principal component analysis, Sammon mapping; and further nonlinear methodologies such as encoder networks, and the self-organizing map. Numerous example applications are given, including the discrimination of drug and non-drug molecules, clustering of lead molecules with a specific type of biological activity, e.g., antidepressants, in combinatorial library design, and the characterization of ligand binding sites in protein structures.
Chapter 3. Modeling Structure-Activity Relationships The use of evolutionary algorithms and artificial neural networks in quantitative structure-activity relationships are discussed. The Chapter begins with a detailed description of the various steps involved in the development of QSAR models. These include the generation of many classes of molecular descriptors, different descriptor selection strategies, and the use of linear and nonlinear feature mapping methods. Some common procedures for validating QSAR models are discussed. Real-world examples of the application of genetic algorithms and neural networks to variable selection are treated in the second half of this Chapter. The appropriateness of neural networks for model building is exemplified by the study of Andrea and Kalayeh. The next generation QSAR tool that incorporates both genetic algorithms and neural networks is also discussed. QSAR models that are derived from artificial intelligence methods are then compared with the traditional linear regression approach. The effect of chance correlation, overfitting, and overtraining is concisely mentioned. The Chapter concludes with the application of these QSAR models in molecular design, i.e., the inverse QSAR problem.
Chapter 4. Prediction of Drug-Like Properties The drug-likeness concept has become very fashionable and is treated in Chapter 4. The Pfizer rule represents a simple set of heuristics that can help to identify molecules that have acceptable drug-like characteristics. In addition, three sophisticated neural network-based Adaptive Systems in Drug Design, by Gisbert Schneider and Sung-Sau So. ©2003 Eurekah.com.
2
Adaptive Systems in Drug Design
drug/non-drug discrimination systems are compared. Some practical implications for the utility of these scoring functions as a filter for virtual library screening are considered. In the second section, drug-like properties are treated as a composite property of different physicochemical or biological attributes. The use of adaptive methods in the prediction of aqueous solubility and logP in the literature is discussed. It is argued that in silico prediction of ADMET properties is very challenging due to a lack of reliable data. Examples of neural network applications to predicting bioavailability, human intestinal adsorption, drug metabolism, CNS activity, toxicity, and carcinogenicity are described in detail.
Chapter 5. Evolutionary De Novo Design This Chapter is focused on the utility of evolutionary methods in de novo molecular design. Some common de novo design algorithms from commercial and academic sources are described and compared. Most of these methods are based on iterative addition of atoms or molecular fragments to existing scaffolds, using either 2D/3D similarity scores or the complementary fit to the receptor as fitness function to evaluate new designs. Two evolutionary design approaches, TOPAS and PepMaker, are introduced to demonstrate some general principles. TOPAS is an evolutionary structure generator that is ideal for “lead-hopping” of small molecules to provide suitable candidates in fast-follower projects. The program PepMaker is applicable for peptide design, and was the first successful hybrid application combining evolutionary design and a neural network-based scoring function.
Gisbert Schneider Sung-Sau So
CHAPTER 1
A Conceptual Framework “It is no longer just sufficient to synthesise and test; experiments are played out in silico with prediction, classification, and visualisation being the necessary tools of medicinal chemistry.” (L. B. Kier)1
Drug Design is An Optimization Task
P
athological biochemical processes within a cell or an organism are usually caused by molecular interaction and recognition events. One major aim of drug discovery projects is to identify bioactive molecular structures that can be used to systematically interfere with such molecular processes and positively influence and eventually cure a disease. Both significant biological activity and specificity of action are important target functions for the molecular designer. In addition, one has to consider absorption, distribution, metabolic (ADME), and potential toxic properties during the drug development process to ensure that the drug candidate will have the desired Pharmacokinetic and pharmacodynamic properties. Usually the initial “hit”, i.e., a bioactive structure identified in a high-throughput screen (HTS) or any other biological test system, is turned into a “lead” structure. This often includes optimization/ minimization towards ADME/Tox parameters. Further optimization is carried out as the compounds enters the next phases of drug discovery, namely lead optimization and development, nonclinical and finally clinical trials (Fig. 1.1). Only if all of the required conditions are met can the molecule become a trade drug. The target function in drug design clearly is multidimensional.2 Typically several rounds of optimization must be performed to eventually obtain a clinical candidate. Before describing particular optimization strategies in more detail, we shall define what we mean by an “optimization problem” (Definition 1.1) (adapted from Papadimitriou and Steiglitz).3
Definition 1.1 An optimization problem is a set I of instances of an optimization problem. (Note: Informally, in an instance we are given the “input data” and have enough information to obtain a solution). An instance of an optimization problem is a pair (F, q), where F is the domain of feasible points, and q is the quality function. The problem is to find an f ∈ F for which
Such a point f is called the globally optimal solution to the given instance. • The virtually infinite (judged from a practical, synthesis-oriented perspective) set of chemically feasible, “druglike” molecules has been estimated to be in the order of magnitude of approximately 10 100 (a “googol”) different chemical structures.4,5 This inconceivable large Adaptive Systems in Drug Design, by Gisbert Schneider and Sung-Sau So. ©2003 Eurekah.com.
4
Adaptive Systems in Drug Design
Fig. 1.1. Early drug development stages.
number gives hope that there exists an appropriate molecule for almost every type of biological activity. The task is to find such molecules, and the molecular designer is confronted with this problem. As time and resources prohibit systematic synthesis and testing of each of these structures—and thus perform an exhaustive search—more intelligent search strategies are needed. Random search is one workable approach to solve the task. However, due to the extremely large search space its a priori probability of success is very low. HTS is a practical example of this random approach to drug discovery.6,7 Several technological breakthroughs during the past years have validated this strategy. By means of current advanced ultra-HTS methods it is now possible to test several hundred thousand molecular structures within a few days.2,8 Typically, the hit rate with respect to low micromolar activity in these experiments is 0.1-2%.9,10 This means that for many drug targets hits are actually being identified by HTS. Why is that? Wouldn´t one expect rarely any hit at all? One possible explanation is that the compounds tested contain a relatively large number of “biologically preferred”, druglike structures already.2 Very often the corporate compound collections (“libraries”) used in the primary HTS have been shaped by many experienced medicinal chemists over several decades already, therefore inherently representing accumulated knowledge about biochemically meaningful structural elements. In addition, many of the “new” drug targets belong to some well-studied protein families (e.g., kinases, GPCRs). In other words, corporate compound collections and “historical” collections represent activity-enriched libraries compared to a truly random set. As a consequence the actual a posteriori probability of finding active molecules by random guessing is sometimes higher than expected. HTS provides an entry point—often the only possible—to drug discovery in the absence of knowledge about a desired bioactive structure. There are additional complementing possibilities available through virtual screening techniques, and the field of virtual screening has been extensively reviewed over the past years.4,11,12 The design of activity-enriched sets of molecules that are specifically shaped for some desired quality represents one of the major challenges in rational drug design and virtual screening. In the hit-to-lead optimization phase following the primary HTS, smaller-sized compound libraries are generated and tested, typically containing between some ten and a few hundred molecules per optimization cycle. These compound collections are more and more focused towards a desired bioactivity, which is reflected by a steadily decreasing structural diversity yet simultaneously increasing hit rates at higher activity levels. In other words, molecular diversity is an adaptive (lat. ad-aptare, “to fit to”) parameter in drug design (Fig. 1.2).13,14 The cover illustration “Development I” by M.C. Escher illustrates pattern formation that can be interpreted as resulting from an adaptive process (see book cover). Starting with only little knowledge about molecular structures exhibiting a desired biological activity—the “twilight zone”, patterns begin to emerge in subsequent
A Conceptual Framework
5
Fig. 1.2. Idealized course of adaptive library diversity with increasing knowledge in a drug discovery project.
synthesis-and-test cycles. Knowledge about essential, function-determining structures grows with the time resulting from an adaptive variation-selection process. The term “adaptive system” must be defined before starting on the survey of adaptation with a focus on molecular feature extraction, pattern formation, and simulated molecular evolution. Frank15 gave a comprehensive definition that highlights the main aspects discussed here (Definition 2):
Definition 1.2 “An adaptive system is a population of entities that satisfy three conditions of natural selection: the entities vary, they have continuity (heritability), and they differ in their success. The entities in the population can be genes, competing computer programs, or anything else that satisfies the three conditions.”15• Adaptation involves the progressive modification of some entity or structure so that it performs better in its environment. “Learning” is some form of adaptation for which performance consists of solving a problem. Koza16 summarized the central elements of artificial adaptive systems: • • • • • • • •
Structures that undergo adaptation Initial structures (starting solutions) Fitness measure that evaluates the structures Operations to modify the structures State (memory) of the system at each stage Method for designating a result Method for terminating the process Parameters that control the process
In the context of drug design, the “entities” mentioned in Definition 2 are small organic molecules; successful structures or substructural elements are kept and passed on in subsequent design rounds (“heritability”); and the compounds that are synthesized definitely differ in their success (biological activity, Pharmacokinetic and pharmacodynamic properties). The cyclic
Adaptive Systems in Drug Design
6
interplay between synthesis and testing of compound libraries can be interpreted as a simplified evolutionary variation-selection process.13,17-20 In fact, this conceptual view helps to turn the more or less blind screening by HTS—also termed “irrational” design – into a guided search, and to perform systematic focussed screening. An important feature of adaptive systems is that they use “history”, i.e., they “learn” from experience and use this knowledge for future decision making. They are able to consider several conditions or target variables at a time— multidimensional optimization—which makes them parallel processing systems. Furthermore, their behavior is characterized by relative tolerance to noise or disturbance.21,22 Dracopoulos22 described adaptive systems from a more technically oriented, system control point of view, taking into account the work of Narendra and Annaswamy21 and Luenberger23 (Definition 3):
Definition 1.3 “An adaptive system is a system which is provided with a means of continuously monitoring its own performance in relation to a given figure of merit or optimal condition and a means of modifying its own parameters by a closed loop action so as to approach this optimum. Adaptive systems are inherently nonlinear.”22 • One may regard drug discovery as driven by an adaptive process that is guided by a complex, nonlinear objective function (also called “fitness function” in the following Chapters).18 It is evident that the classical experimental synthesize-and-test approach to drug design is a practicable and sometimes very successful implementation of an adaptive system: A team of scientists decides to synthesize a set of molecular structures which they believe to be active based on a hypothesis that was generated from their knowledge about the particular project. Then, these compounds are tested for their biological activity in a biochemical assay. The results provide the basis for a new hypothesis, and the next set of molecules will be conceived taking these facts into account. Along the course of the optimization the molecular diversity will drop, and the quality of the compounds will increase. This drug discovery procedure has been successful in the past, so why should we bother about any additions or modifications? There are several obvious caveats and limitations to this classical, entirely experimental approach.2,24 a. The problem of large numbers. The numbers of molecules that can be examined by a human is limited; a scientist cannot compete with the capacity throughput of virtual, computer-based storage, modeling, and screening systems; b. The problem of multidimensional optimization. Often no real parallel optimization of bioactivity (e.g., as measured in a ligand binding assay) and ADME/Tox properties is performed; c. The cost of experiments. Screening assays, in particular when they are used in HTS, can be very expensive. This is especially true when difficult biochemical assays or protein purification schemes are employed; also, blind screening can be wasted time and resources if the screening library is biased towards inactive structures or no hit is found (this statement is not entirely correct, as much can be learned from “negative” examples, i.e., inactive structures).
Certainly many of the above limitations can be alleviated by a good experimental design during lead optimization. This book provides several examples of adaptive virtual screening systems that can help to overcome some of these problems and complement classical experimental drug discovery and HTS efforts. It must be stressed that virtual screening is meant to support and complement the experimental approach. Real-life chemical and biological bench experiments are always needed to validate theoretical suggestions, and they are an indispensable prerequisite for any model building and formulation of a hypothesis. The aim is to enhance the lead discovery process by rationalizing the library shaping process through adaptive, “self-learning” modeling of quantitative structure-activity relationships (QSAR) or structure-property
A Conceptual Framework
7
relationships (SPR) and molecular structure generation (de novo design, combinatorial design), thereby enabling focussed screening and data mining. The main objective with such “intelligent” control is to obtain acceptable performance characteristics over a wide range of uncertainty, computational complexity, and nonlinearity with many degrees of freedom.25,26 The presence of uncertainty and nonlinear behavior in lead optimization and throughout the whole drug development process is evident. For example, the data generated by HTS and ultra HTS assays—which often provides the knowledge base for a first semi-quantitative SAR attempt—is inherently noisy and sometimes erroneous due to limitations of assay accuracy and experimental setups. One can rarely expect highly accurate predictions from models that are grounded on very fuzzy data. The response of a linear system to small changes in its parameters (or to external stimulation) is usually in direct proportion to the stimulation (“smooth” response). For nonlinear systems, however, a small change in the parameters can produce a large qualitative difference in the output. There are many observations of such behavior in drug design. For example, considering chemical substructures being the free parameters of a lead optimization process, the presumably “small changes” made in the molecules shown in Figure 1.3 led to significant modification of their biological activity. Exchange of the phenolic hydroxy groups of isoproterenol 1 with chlorine results in the conversion of an agonist (β-adrenergic, isoproterenol) to an antagonist (β-adrenergic blocker, dichloro-isoproterenol 2).27 The closely related chemical structures 3, 4, and 5 reveal very different biological activity. Close structural relationships of gluco- and corticosteroids, e.g., also reflect the observation that apparently small structural variations can lead to signifcant loss of or change in bioactivity. Obviously in these cases essential parts of the function-determining Pharmacophore pattern were altered, resulting in qualitatively different activities. Had other non-essential atom groups been changed, added or removed, the effect on bioactivity would likely have been much less dramatic. As we do not know a priori which parts of a molecule are crucial determinants of bioactivity, we tend to believe that any “small” change of structure will only slightly affect molecular function, i.e., the active analog similarity principle. This way of thinking is often not appropriate. This example demonstrates that there probably is no generally valid definition of a “small” structural change of a molecule. The definition and quantitative description of chemical similarity is a critical issue in medicinal chemistry.27 This definition is context-dependent, i.e., it depends on the particular structural class of the molecule to be optimized, the drug target (receptor, enzyme, macromolecule etc.) or the particular property of interest, and the assay system. Our knowledge about the underlying SAR or SPR drives the design process at a given time and is important to constantly re-formulate the prediction model or hypothesis during the process to achieve convergence. At the start of a drug discovery project one hardly knows which substructural groups of a molecule are crucial for a desired bioactivity, simply because our knowledge base is extremely small or not existent at all. Virtual screening is thought to provide the medicinal chemist with ideas, assist in molecular feature extraction, and cope with large amounts of real structural and assay data and virtual data (predicted properties and virtual compound structures). The aim is to design a robust drug optimization process that is able to cope with uncertainty, computational complexity including multidimensional objective functions, and nonlinearity. To do so, a QSAR hypothesis generator (inference system) is needed, a molecular structure generator, and a test system (Fig. 1.4). In a feedback loop, QSAR hypotheses are derived from a knowledge base, e.g., experimental assay results or predicted activities. Then new molecular structures are generated based on the QSAR model either by de novo design, combinatorial assembly, or by systematically changing already existing structures. Finally, activity measurements or predictions of these new structures establish new facts, thereby augmenting the knowledge base. For an adaptive system it is important to use the error between
8
Adaptive Systems in Drug Design
Fig. 1.3. Seemingly “minor” chemical differences can have a dramatic effect on biological activity (adapted from ref. 27). These examples illustrate the difficulty of defining and quantitatively describing molecular similarity.
the observed and expected behavior (e.g., biological activity) to tune the system parameters (feedback control). Otherwise no adaptation will take place. In this volume emphasis is put on two kinds of virtual adaptive systems that have proven their usefulness for drug design: evolutionary algorithms for molecular optimization, and artificial neural networks (ANN) for flexibly modeling structure-activity relationships. These techniques seem to be methods of choice especially when the receptor structure or an active site model of the drug target are unavailable, and thus more conventional computer-based molecular design approaches cannot be applied.28-30 This situation is given for over 50% of the current drug targets since the majority of drug targets belongs to the family of integral membrane proteins (Fig. 1.5). Until today only some few high-resolution structures of such proteins are available, but hopefully novel crystallization and structure determination methods will lead to progress in this field in the near future.31-33 Today, the most promising—and often the only possible—approach to rational drug design in these cases is ligand optimization based on small molecule structures and assay data as the only sources of information.
Principles of Adaptive Evolutionary Systems A big challenge recurs in all adaptive systems, namely to balance exploration versus exploitation,34 i.e., the balance between costly exploration for improved efficiency and the comparably cheap exploitation of already known solutions.15 Adaptive design is a process of discovery that follows cycles of broad exploration and efficient exploitation. Greater variability of
A Conceptual Framework
9
Fig. 1. 4. A general drug design cycle. Any of the three main parts of the cycle, the inference system, molecule generator, and test system can be real existing persons or systems or artificial computer-based implementations. Some initial knowledge is required to enter the cycle.
molecular structures in a library improves the chances that some of the molecules will have properties that match unpredictable challenges (which may be caused, e.g., by the nonlinear nature of a multimodal and multidimensional fitness function). In an entirely experimental drug design approach the price for broad exploration is high because many different chemical structures must be synthesized and tested. In contrast, the combinatorial chemistry approach aiming at the variation of successful schemes provides one possibility to exploit or fine-tune an already successful solution. In artificial, computer-based drug design both of these two extremes can be efficiently managed. In particular, the exploration of chemical space can be much broader by considering all kinds of virtual “drug-like” structures, and the exploitation of successful solutions by systematic variation of a structural framework can be extremely efficient following a modular building-block approach (see Chapter 5 for details). Again, a central task is to operate at the right level of structural diversity at the different stages of molecular design (Fig. 1.2).18
10
Adaptive Systems in Drug Design
Fig. 1. 5. Overview of current drug targets (adapted from ref. 124). More than 50% belong to protein classes (membrane proteins) for which structure determination by X-ray or NMR methods is extremely complicated and only some rare solved structures exist today.
Adaptive optimization cycles pursued in drug design projects can be mimicked by evolutionary algorithms (EA), in particular evolution strategies (ES)35-37 and genetic algorithms (GA).34,38-41 These algorithms perform a local stochastic search in parameter space, e.g., a chemical space. During the search process, good “building blocks” of solutions are discovered and emphasized. The idea here is that good solutions to a problem tend to be made up of good building blocks, which are combinations of variable values that confer higher “fitness” on the solutions in which they are present. Building blocks are often referred to as “schemata” according to the original notation of Holland,34 or as “hyperplanes” according to Goldberg´s notation.38 In the past decades many families of heuristics for optimization problems have been developed, typically inspired by metaphors from physics or biology, which follow the idea of local search (sometimes these approaches are referred to as “new age algorithms”).3,42 Among the best known are simulated annealing,43 Boltzmann machines44, and Tabu search.45,46 There are of course many additional mathematical optimization techniques known (for an overview, see ref. 3). Why should EAs be particularly appropriate for drug design purposes? To answer this question, let us consider the following two citations, where the first one is taken from Frank,15 and the second from Globus and coworkers:47 a. “A simple genetic algorithm often performs reasonably well for a wide range of problems. However, for any specific case, specially tailored algorithms can often outperform a basic genetic algorithm. The tradeoffs are the familiar ones of general exploration versus exploitation of specific information.”48,49 b. “The key point in deciding whether or not to use genetic algorithms for a particular problem centers around the question: what is the space to be searched? If that space is well understood and contains structure that can be exploited by special-purpose search techniques, the use of genetic algorithms is generally computationally less efficient. If the space to be searched is not so well understood, and relatively unstructured, and if an effective GA representation of that space can be developed, then GAs provide a surprisingly powerful search technique for large, complex spaces.”50,51
A Conceptual Framework
11
These two excerpts address two critical points for any optimization or control system: First the structure of search space (parameter space), and second the knowledge that is available about this space. If not much is known about the structure of the search space, and the fitness landscape is rugged with possible local discontinuities, then an EA mimicking natural adaptation strategies can still be a practicable optimization method.52 The molecular designer faces exactly this situation: optimization in the presence of noise and nonlinear system behavior.53 As soon as more information about the (local) search space becomes available—e.g., by a highresolution X-ray structure of the receptor protein that can guide the molecular design process, or sets of active and inactive molecules to derive a preliminary SAR—more specialized optimization strategies can be applied that may, for example, rely on an analytical description of the error surface. There is no general best optimization or search technique. In a smooth fitness landscape advanced steepest descent or gradient search represents a reasonable choice to find the global optimum, and several variations of this technique have been described to facilitate navigation in a landscape with some few local optima.54 Typically, virtual screening is confronted with a high-dimensional search space and convoluted fitness functions that easily lead to a rugged fitness landscape. Robust approximation algorithms must be employed to find practicable (optimal or near-optimal) solutions here.55,56 Stochastic search methods sometimes provide answers that are provably close to optimal in these cases. In fact, several systematic investigations demonstrated that a simple stochastic search technique might be comparable or sometimes superior to sophisticated gradient techniques, depending on the structure of fitness landscape or the associated error surface.57,58 In short, EAs represent a general method for optimization in large, complex spaces. The fact that EAs are straightforward to implement and easily extended to further problems has certainly added to their widespread use, popularity, and proliferation. The authors do not intend to review the whole field of evolutionary algorithms. Rather it is intended to highlight some of their specific algorithmic features that are particularly suited for molecular design applications. For a detailed overview of EA theory, additional applications in molecular design, and related approaches like evolutionary programming (EP) and combinatorial optimization, see the literature.3,12,20,52,59,60 EAs operate by generating distributions (“populations”) of possible solutions (“offspring”) from a “parent” parameter set, subjecting every “child” to fitness determination and a subsequent selection procedure. The cycle begins again with the selected new parent(s). A schematic of this general algorithm is given in Fig. 1.6. Offspring are bred by means of two central mechanisms: mutation and recombination (crossover). EAs differ largely in the degree to which each mutation and crossover operator is used to generate offspring and the types of so-called “strategy-parameters” employed. These are additional adaptive variables which can be used to shape the offspring distributions (vide infra). The classical ES is dominated by mutation events to generate offspring—thereby adding new information to a population and tending to explore search space—whereas the classical GA favors crossover operations to breed offspring— thus promoting the exploitation of a population. During the past decades the individual advantages of these approaches were identified and the current advanced implementations are usually referred to as “evolutionary algorithms” to indicate this fact. Historically, Rechenberg35,61 coined the original ES idea over 40 years ago already following ideas of Bremermann.62 It was further developed by Schwefel.36,63 The term “genetic algorithm” appeared soon after the publication of Holland´s pioneering text on Adaptation in Natural and Artificial Systems in 1975.34 It is important to keep in mind that EAs perform a local stochastic search in parameter space, in contrast to a global search. Finding a globally optimal solution to an instance of some optimization problems can be prohibitively difficult. Nevertheless, it is often possible to find a feasible solution f that is best in the sense that there is no better solution in its neighborhood N(f ), i.e., it is locally optimal with respect to N. N(f) defines a set of points that are “close” in
Adaptive Systems in Drug Design
12
Fig. 1.6. General scheme of an evolutionary algorithm.
some sense to the point f. (Note: If F = Rn for a given optimization problem with instances (F, q), the set of points within a fixed Euclidean distance provides a natural neighborhood). The term “locally optimal” shall be defined as given by Definition 4 according to ref. 3):
Definition 1.4 Given an instance (F, q) of an optimization problem and a neighborhood N, a feasible solution f ∈ F is called locally optimal with respect to N (or simply locally optimal whenever N is understood by context) if q(f ) = q(g) for all g ∈ N(f) where q is the quality or fitness function.3 • A particular example of a one-dimensional optimization task (F, q) is sketched in Fig. 1.7a. Here, the solution space .Obviously the function q has three local optima A, B, and C, but only B is globally optimal. This can be more formally described: Let the neighborhood N be defined by closeness in Euclidean distance for some ε > 0, Nε(f ) = {x : x ∈ F and x-f ≤ ε}. Then if ε is suitably small, the points A, B, and C are locally optimal. In
A Conceptual Framework
13
Fig. 1.7. a) A one-dimensional Euclidean optimization problem. The fitness function c has three local optima in A, B, and C; b) Schematic representation of ordinary local search with a fixed neighborhood N (adapted from ref. 3).
ordinary local search, a defined neighborhood is searched for local optima (Fig. 1.7b). A strong neighborhood seems to produce local optima whose average quality is largely independent on the quality of the starting solutions, whereas weak neighborhoods seem to produce local optima whose quality is strongly correlated with that of the starts.3 The particular choice of N may critically depend on the structure of F in many combinatorial optimization problems. A particularly useful aspect of some evolutionary algorithms is that they implement variable choices of N. A further issue is that of how to search the neighborhood. The two extremes are firstimprovement, in which a favorable change is accepted as soon as it is found, and steepest descent, where the entire neighborhood is searched. The great advantage of first-improvement is that local optima can generally be found faster. For further details on local search methods and variations of this scheme, see the literature.3,54 An early description of local search addressing some aspects of natural selection can be found in ref. 64. A schematic of some popular neighborhood search methods is shown in Fig. 1.8.
Solution Encoding and Basic Evolutionary Operators The shape of the fitness surface in parameter space can be modified by choosing an appropriate encoding of the problem solutions. In the case of drug design the “solutions” are molecular structures exhibiting some desired properties or biological functions. This means that without an appropriate, context- and problem-dependent representation of molecules no systematic optimization can be performed. The performance of an optimization process in such a space can be even worse than pure random searching. Therefore, the implementation of any EA begins with the important encoding step. The traditional GAs employ binary encodings, whereas the classical ESs use real number encodings. It is often not trivial to find an appropriate encoding scheme so that an acceptably smooth fitness landscape results that obeys the Principle of Strong Causality18,35 (also known as the Similarity Principle in medicinal chemistry).65 This means that a step in parameter space should lead to a proportional change of fitness, which is identical to the demand for a smooth fitness landscape with respect to the neighborhood N. Actually, this task represents the greatest challenge in rational molecular design. Because it can be regarded as a difficult optimization problem in itself to develop an appropriate encoding scheme, e.g., EAs can be applied to solve this task as detailed in Chapter 3. For scientific problems real number encodings are common, as they generally offer a natural representation of the problem, which is also true for molecular optimization. Many drug design tasks, however, can also be regarded as a combinatorial optimization task, and clever binary encoding schemes of molecular building blocks (e.g., amino acids, building blocks in combinatorial
14
Adaptive Systems in Drug Design
Fig. 1.8. Schematic of neighborhood search methods (adapted from ref. 78).
chemistry) can be very appropriate. Usually, the particular encoded representation of a molecule (“solution representation”) is called the genotype, which determines the decoded phenotype (e.g., biological activity). An example of binary and real-valued encodings of a molecular structure is given in Fig. 1. 9. In this example, five molecular properties were selected to provide the solution or parameter space: the calculated octanol/water partition coefficient (clogP), calculated molecular weight (cMW), the number of potential hydrogen-bond (H-donors) donors and acceptors, the total number of nitrogen and oxygen atoms (O+N), and the polar surface area (Surfpolar). Some of these properties have been shown to be useful descriptors for predictions of the potential oral bioavailability of drug candidates.66,67 A possible genotype representation contains five “genes”, i.e., either five real-valued numbers or eight three-bit words, respectively. For the binary representation three bits per gene were chosen to encode 23 = 8 different intervals for each property descriptor (Table 1.1). The offspring solutions for a new generation are generated by genotype variations, mainly by mutation and crossover operations. The conventional evolution strategies use mutation as their primary operator and crossover as a secondary operator to breed offspring, whereas genetic algorithms use crossover as the primary operator and mutation as a secondary operator. In addition to mutation and crossover, the EA community has developed many additional operators for breeding offspring from one or many parents. These operators are inspired by observations in natural genetics, e.g., insertion, deletion, and death. It should be stressed that EAs do not attempt to provide realistic models of natural evolution. They rather mimic a Darwinian-like evolution process with the idea to solve technical optimization problems. And one such application is evolutionary drug design.
The Mutation Operator. By mutation individual bits of a binary representation are switched from 1 to 0 or from 0 to 1, or real-valued mutation changes a value by a random amount, respectively, to generate offspring vectors ξ parent →ξ child . The model real-value mutation (5.19,404,5,2,59.9)parent→(5.19,615,5,2,59.9)child is equivalent to the binary transition (100011100001010)parent→(100100100001010)child. The degree of mutation, i.e., how many positions of a representation are changed, determines the amount of new information present in a population. Therefore, mutation is essential for exploration of solution space. As we have discussed above, the appropriate degree of exploration must be allowed to change during an optimization experiment. ESs apply a simple but powerful adaptive control mechanism for mutation by the introduction of one or several so-called strategy parameters.35,68,69 The general
A Conceptual Framework
15
Fig. 1.9. Examples of binary and real number encoding of a molecular structure; a) real-valued representation, b) standard binary representation, c) binary Gray code representation.
idea is to give an accompanying variable σi to each of the real-valued vectors of the candidate solutions ξl comprising a population. Offspring solutions are generated by using Gaussian mutation of the parental values (Eq. 1.1), where the strategy variable σi denotes the standard deviation (the width of the Gaussian), and g is a Gaussian-distributed random number (Eq. 1.2):
Eq. 1.1 where
Eq. 1.2 Variation of the standard deviation used to generate offspring distributions is one possibility to determine the size of a move in solution space and is thus sometimes referred to as adaptive (or mutative) stepsize control. The parameter σ determines the neighborhood N that is considered. In contrast to ordinary local search with a fixed width of N (Fig. 1.7b), the width of N is an additional adaptive variable in an ES. Basic ESs rely on mutation as the only genetic operator. A simple “one parent—many offspring” ES, which is termed “(1,λ) ES” according to Schwefel’s notation,36 can be formulated as follows using a pseudo-code notation (see Fig. 1.6): initialize parent, (Sp, σp, Qp) repeat for each generation until termination criterion is satisfied: generate _ variants of the parent, (SV, σV, QV) select best variant structure, (Sbest, σbest, Qbest) set (Sp, σp, Qp) = (Sbest, σbest, Qbest)
Adaptive Systems in Drug Design
16
Table 1.1. Three-bit codings of standardized property values (x in [0,1]) using a standard binary coding system and a Gray coding system Standardized Property Value
Standard Binary Code Value
Gray Code Value
[0,0.125[ [0.125,0.25[ [0.25,0.375[ [0.375,0.5[ [0.5,0.625[ [0.625,0.75[ [0.75,0.825[ [0.875,1]
000 001 010 011 100 101 110 111
000 001 011 010 110 111 101 100
According to this scheme selection-of-the-best is performed among the offspring only, i.e., the parent dies out. This characteristic helps to navigate in a multimodal fitness landscape because it sometimes allows to escape from local optima a parent may reside on.57,70 As given by Equation 1.1, the offspring is approximately Gaussian-distributed around the parent. This enables a local stochastic search to be performed and guarantees that there is a finite probability of breeding very dissimilar candidate solutions (“snoopers” in search space). The width of the variant distribution is determined by the variance or standard deviation, σ, of the bell-shaped curve reflecting the distance-to-parent probability. The straightforward (1,λ) ES tends to develop large σ values if large steps in search space are favorable, and adopts small values if for example fine-tuning of solutions is required. The σ value can therefore be regarded as a measure of library diversity. Small values lead to narrow distributions of variants, large values result in the generation of novel structures that are very dissimilar to the parent molecule as judged by a chemical similarity or distance measure. Several such applications are described in Chapter 5. In the beginning of a virtual screening experiment using ES, σ should be large to explore chemical space for potential optima, later on during the design experiment σ will have to be small to facilitate local hill-climbing. This means that σ itself is subjected to optimization and will be propagated from generation to generation, thereby enabling an adaptive stochastic search to be carried out. In contrast to related techniques like simulated annealing, there exists no predefined “cooling schedule” or a fixed decaying function for σ. [Note: there are some new SA variants that have “adaptive re-heating”; see Chapter 3]. Its value may freely change during optimization to adapt to the actual requirements of the search. Large σ values will automatically evolve if large steps are favorable, small values will result when only small moves lead to success in the fitness landscape. If there is an underlying ordering of local optima this strategy can sometimes provide a simple means to perform “peak-hopping” towards the global optimum.35 Some variations of this adaptive mutation scheme found in ESs have been published.34,71 A more recent addition to the idea are “learning-rule methods”.72-75 Adaptive mutation schemes have also been reported for GAs.74,76,77 For example, in Kitano’s scheme the probability of mutation of an offspring depends on the number of mismatches (Hamming distance) between the two parents [Note: GAs need at least two parents to breed offspring, whereas ESs can operate with one parent. The most popular is the (1,λ) ES].76 Low distance results in high mutation, and vice versa. In this way the diversity of individuals in the population can be controlled. In contrast to the strategy parameter σ in ESs, however, this
A Conceptual Framework
17
scheme does not consider the history of the search, i.e., it does not “learn” from previously successful choices of σ values (stepsize control). Dynamic control of the impact of mutations on a candidate solution is relatively easily done for real-coded representations but not for standard binary encoding. One approach to overcome this limitation of standard binary encoding is to apply a Gray coding scheme (Table 1.1).78 Gray coding allows single mutations of the genotype to have smaller impact on the phenotype, because here adjacent values differ only by a single bit (neighborhood in space). This can help to fine-tune solutions and avoid unwanted jumps between candidate solutions during optimization. Alphanumeric coding is another possibility to represent molecular structures, which is especially suited for searching combinatorial chemistry spaces with a GA.79
The Crossover Operator Crossover is a convenient technique for exploitation of candidate solutions present in a population. It is the main operator for offspring generation by GAs. Consider the two parent “chromosomes” 010011011101 and 000011110100. By crossover, strings of bits are exchanged between these two chromosomes to form the offspring: 010011111101 and 000011010100. Crossover points are usually determined randomly. It must be emphasized that crossover operations—in contrast to mutation events—do not add new information to a population’s genotype pool. Therefore, a GA always implements an additional operator for mutation to ensure that innovative solutions can evolve. Without mutation a GA cannot improve on the candidate solutions that are present in a given population of bitstrings. Advanced ESs (so-called “higherorder” ESs) also contain crossover between parents as an additional breeding operator that complements mutation.35 A straightforward GA could be formulated as follows (adapted from ref. 80): // Initialize a usually random population of individuals (genotypes) init population P // Evaluate fitness (quality) of all individuals of the population for all i: Qi = Quality(Pi) // iterate over generations repeat for each generation until a termination criterion is satisfied: // select a sub-population (parents) for breeding offspring Pparent = SelectParents(P) // generate offspring Poffspring = recombine Pparent Poffspring = mutate Poffspring // determine the quality of the offspring chromosomes for all i: Qi = Quality(Pioffspring) // select survivors to form the new starting population P = SelectSurvivors(Poffspring, Q) In evaluating a population of strings, the GA is implicitly estimating the average fitness of all building blocks (“schemata”) of solutions that are present in the population. Their individual representation is increased or decreased according to the Schema Theorem34,38 (for a detailed discussion, see ref. 77). The driving force behind this process is the crossover operator. The cover illustration of M.C. Escher nicely illustrates the evolution of useful building blocks of the final solution. Simultaneous implicit evaluation of large numbers of schemata in a population is known as the implicit parallelism of genetic algorithms.34 This property of EAs, especially GAs, has proven their value for molecular feature selection and molecular optimization as we shall see in the following Chapters.29,81,82
18
Adaptive Systems in Drug Design
The Selection Operator The effect of selection is to gradually bias the sampling procedure towards instances of schemata—and therefore candidate solutions—whose fitness is estimated to be above average.77 EAs implement parent selection mechanisms to pick those individuals of a population that are meant to generate new offspring. The selection process is guided by the fitness values, which might be calculated or experimentally determined. The (1,λ) ES uses plain “selection of the best”, i.e., only the “fittest” survive and become the parent for the next generation (elitism). In its more general form, the µ fittest are selected – (µ,λ) ES. GAs usually implement stochastic selection methods. The most popular is roulette wheel selection, where members of the population are selected proportional to their fitness. In this way, an individual with a low fitness can become a parent for the next generation, although with a lower probability than high-fitness individuals. A major problem with such a selection operator can be premature convergence of the search, if single individuals with extremely high fitness values dominate a population. In such cases rank-based selection can be advantageous. A third commonly used selection mechanism is tournament selection, where two or more members of a population are randomly picked, and the best of these is deemed a new parent.83,84 As we have seen, several strategic EA parameters must be controlled to obtain good solutions and avoid premature convergence on local optima or plateaus in the fitness landscape. These parameters also include the optimization time (number of generations), population size, the ratio between crossover and mutation, and the number of parents. There seems to be no generally best set of parameter values, although some rules of thumb have been suggested for classes of problems.35,37,85 In general, GAs have much larger population sizes (recombination dominates) than ESs (mutation dominates). An example of the dependency between the number of generations and the number of offspring in a simple (1,λ) ES that was used to find the maximum of a multimodal fitness function (in the example a modified Rastrigin function was used, Eq. 1.3) is given in Fig. 1.10. In a situation of limited resources (in terms of CPU calls) and a rugged fitness landscape like the one depicted in Fig. 1.10a, extreme numbers of offspring seem to be disadvantageous.85 Provided this observation is principally valid for experimental drug design, one might speculate that in a situation of limited resources (e.g., manpower, assay material, synthesis capacity) a high-throughput approach is not always appropriate.
Eq. 1.3 It is evident that a given instance of an optimization problem dictates the appropriate values. However, in drug design tasks we usually do not know the complexity of the optimization problem and the structure of search space in advance, so we must rely on empirical procedures. This also implicates that re-running the EA can easily result in different solutions, especially if the underlying fitness landscape is rugged (multimodal). The results of EA search or optimization runs should therefore not be regarded as the overall best or globally optimal. They represent candidate solutions that have a higher fitness than average. EAs certainly suffer in performance comparison against algorithms which were optimized for particular problems. Nevertheless they have proven their usefulness, mainly due to their ability to cope with challenging real-life search and optimization problems, their straightforward implementation and easy extension to additional problems. Several diverse applications of EAs in drug design are described in the following Chapters of this book.
A Conceptual Framework
19
Fig. 1.10. a) A one-dimensional multimodal fitness landscape based on a modified Rastrigin function (K = 5), b) estimated dependency between the resources that are available for optimization (CPU time) and the optimal number of offspring (λopt) to find the global optimum. A (1,λ) ES was used.
Molecular Feature Extraction and Artificial Neural Networks In the previous section we have briefly introduced evolutionary algorithms as one possible framework to address the problem of adaptive molecular design. In this section we focus on the question of how to calculate the quality or fitness values which drive the EA selection process. The aim of a quality function is to structure the search space of possible solutions into regions of different fitness. A coarse-grained quality function might simply lead to classes of solutions (e.g., active and inactive molecules), and a more complex function introduces gradually different quality levels to the search space. The quality of the model determines the success rate of the multi-dimensional design process. How can we develop a good SAR model? It is apparent that no universal recipe exists; nevertheless some general rules of thumb can be given. One approach is to consider the task as a pattern recognition problem, where three main aspects must be considered: i), the data used for generation of a SAR/SPR hypothesis should be representative of the particular problem; ii), the way molecular structures are described for model generation and its level of abstraction must allow for a reasonable solution for the pattern recognition task; iii), the model must permit non-linear relationships to be formulated since the interdependence between molecular properties and structural entities is generally nonlinear. The first point seems to be trivial, but selection of representative data for hypothesis generation is very difficult and often hampered due to a lack of data or populations of data points characterized by abrupt decreases in density functions.53,86 Classification of molecule patterns is based on selected molecular attributes or descriptors (features), which are considered for the classification task: patterns→features→classification. The aim is to combine pattern descriptors (“input variables”) together so that a smaller number of new variables, features, is found. There are straightforward statistical methods available that can be used to construct linear and nonlinear combinations of features which have good discriminating power between classes.87 These methods are complemented by techniques originating from cognition theory, among them are various artificial neural networks (ANN) and “artificial intelligence” (AI) approaches.88,89 According to this concept the feature selection and extraction process—i.e., the formulation of an SAR/SPR model—can be regarded as an adaptive process and formally divided into several steps:90
Adaptive Systems in Drug Design
20
Step 1. Define a set of basic object descriptors (vocabulary); go to Step 2 Step 2. Formulate “tentative” features (hypothesis) by association of basic descriptors, focussing, and filtering; go to Step 3 Step 3. Evaluate the descriptors for their classification ability (quality or fitness of the features); go to Step 4 Step 4. If a useful solution is found then STOP, or else go to Step 2
Usually the features proposed during the early feedback loops are relatively rigid, i.e., sets of raw descriptors and simple combinations of descriptors. During the later stages the features tend to become more flexible and include more complex (“higher-order”) combinations of descriptors. It is important that the feature extraction process is able to focus on certain descriptors and filter-out other descriptors or feature elements.81 Fig. 1.11 gives an example of a “Bongard-problem” for visual pattern recognition to comprehend this rather crude scheme of adaptive feature extraction. The task is to find a feature that allows for discrimination between Class A and Class B. Six objects belong to each class. These might be six active and six inactive molecules. Various types of (ANN) are of considerable value for pharmaceutical research.91 Main tasks performed by these systems are: • • • •
feature extraction, function estimation and non-linear modeling, classification, and prediction.
For many applications alternative techniques exist.92 ANN provide, however, an often more flexible approach offering unique solutions to these tasks. The paradigms offered by ANN lie somewhere between purely empirical and ab initio approaches. Neural networks • • • •
“learn” from examples and acquire their own “knowledge”,93 are sometimes able to find generalizing solutions,94 provide flexible non-linear models of input/output relationships,95 and can cope with noisy data and are fault-tolerant.96
ANN have found a widespread use for classification tasks and function approximation in many fields of medicinal chemistry and cheminformatics (Table 1.2). For these kinds of data analysis mainly two different types of networks are employed, “supervised” neural networks (SNN) and “unsupervised” neural networks (UNN). The main applications of SNN are function approximation, classification, pattern recognition and feature extraction, and prediction tasks. These networks require a set of molecular compounds with known activities to model structure-activity relationships. In an optimization procedure, which will be described below, these known “target activities” serve as a reference for SAR modeling. This principle coined the term “supervised” networks. Correspondingly, “unsupervised” networks can be applied to classification and feature extraction tasks even without prior knowledge of molecular activities or properties. A brief introduction to UNN development is provided in Chapter 2. Hybrid systems have also been constructed and successfully applied to various pattern recognition and SAR/SPR modeling tasks.97-99 In Table 1.2 the main network types and characteristic applications in the life sciences are listed. The description of neural networks in this volume is restricted to the most widely applied in the field of molecular design. More extensive introductions to the theory of neural computation including comparisons with statistical methods are given elsewhere.87,100-103 There are many texts on neural networks covering parts of the subject
A Conceptual Framework
21
Fig. 1.11. A visual classification task. Which feature separates Class A from Class B? This pattern recognition problem is analogous to one of the classical “Bongard-problems” (adapted problem no. 22).125
not discussed here, e.g., the relation of neural network approaches to machine learning,104-106 fuzzy classifiers,96 time series analysis,107 and additional aspects of pattern recognition.108 Ripley provides an in depth treatment of the relation of neural network techniques to Bayesian statistics and Bayesian training methods.87 A commented collection of pioneering publications in neurocomputing has been compiled by Anderson and Rosenfeld.109 Two texts have become available covering both theory and applications of ANN in chemistry, QSAR, and drug design.97,110 Frequently molecular libraries and assay data must be investigated for homogeneity or “diversity”.111-113 UNN are able to automatically recognize characteristic data features and classes of data in a given set based on a rational representation of the compounds and a sensitive similarity measure. The selection of an appropriate measure of similarity is crucial for the clustering results.114-119 UNN systems can be helpful in generating an overview of the distributions of the data sets, and they are, therefore, suited for a first examination of a given data set or raw data analysis. Some types of UNN are able to form a comprehensive model of the data distribution which can be helpful for the identification and an understanding of predominant data features, e.g., molecular structures or properties that are responsible for a certain biological activity. In contrast to UNN, the structuring of a data set must be already known to apply SNN, e.g., knowledge about bioactivity. SNN are able to approximate arbitrary relationships between data points and their functional or classification values, i.e., they can be used to model all kinds of input-output relationships and classify data by establishing a quality function. The term “supervised” indicates that in contrast to UNN for every data point one or more corresponding functional values must be already known (these might be experimentally measured molecule activities or properties). Artificial neural networks consist of two elements, i) formal neurons, and ii) connections between the neurons. Neurons are arranged in layers, where at least two layers of neurons (an
Adaptive Systems in Drug Design
22
Table 1.2. Neural network types with a high application potential in pharmaceutical research (adapted from ref. 126) Network Type / Architecture
SUPERVISED Multilayer feed-forward (bp)
Main Applications
non-linear modeling of (Q)SAR, prediction of molecule activity and structure, pattern recognition, classification, signal filtering, noise reduction, feature extraction
Recurrent networks
sequence and time series analysis
Encoder networks (ReNDeR)
data compression, factor analysis, feature extraction
Learning vector quantization
auto-associative recall, data compression
UNSUPERVISED Kohonen self-organizing map
clustering, data compression, visualization
Hopfield networks
auto-associative recall, optimization
Bidirectional associative memory (BAM)
pattern storage and recall (hetero-association)
Adaptive resonance theory (ART) models
clustering, pattern recognition
HYBRID Counterpropagation networks
function approximation, prediction, pattern recognition
Radial basis function (RBF) networks
function approximation, prediction, clustering
Adaptive fuzzy systems
similar to ART and bp-networks
input layer and an output layer) are required for construction of a neural network. Formal neurons transform a numerical input to an output value, and the neuron connections represent numerical weight values. The weights and the neurons’ internal variables (termed bias or threshold values) are free variables of the system which must be determined in the so-called “training phase” of network development. Selected network types and appropriate training algorithms are discussed in the following Chapters. For details on other network architectures and ANN concepts, see the literature.87,101,103 The idea of adaptation of a three-layered network structure to a given SAR/SPC problem is shown in Figure 1.12. This scheme is thought to illustrate the main idea of “GA-networks” which will be treated in detail in Chapter 3.29,81,120,121 The main advantage of the system is an intelligent selection of neural network input variables by a genetic algorithm. Rather than feeding in all available data descriptors a set of meaningful variables is automatically selected. The GA selects the most relevant variables, and the neural network provides a model-free non-linear mapping device.
A Conceptual Framework
23
Fig. 1.12. Neural network learning fantasy of SAR/SPR modeling (adapted from ref. 15). Formal neurons are drawn as circles, and lines between neurons represent connection weights. Input neurons receive molecular patterns and represent molecular descriptors or attributes. The hidden layer neurons and their associated connection weights process the network stimuli, in a flexible system their connectivity adapts during the learning process, thereby extracting and storing the SAR/SPR-relevant features. The values calculated by the output neurons might be molecular activity, properties etc.
Independent of the choice of network type and architecture applied, the crucial parts of an analysis are appropriate data selection, description and pre-processing. Like any other analysis tool neural networks can only be successful if a solution to a given problem can be found on the basis of the data used. Although this statement seems to be trivial in many real-life applications it can be very difficult to collect representative data and define a comprehensive and useful data description. Sometimes techniques like PCA, data smoothing or filtering can be used prior to network application to facilitate network training (see Chapter 2). In cases where the data description is high-dimensional per se it can be helpful to focus only on a particular sub-space (however, this requires knowledge about essential data features) or to perform a PCA step to reduce the number of dimensions to be considered without significant loss of information. Hybrid network architectures consisting of a pre-processing layer, an unsupervised layer for coarse-grained data clustering, and a supervised layer for fine-analysis were already successfully applied to a variety of tasks and seem to provide very useful techniques for (Q)SAR modeling and molecular design. Both unsupervised and supervised networks have proven their usefulness and applicability to a number of different problems. Nevertheless, it requires significant expertise to apply them effectively and efficiently. For most potential applications of neural networks to drug design conventional techniques exist, and neural networks should be considered as complementary.122,123 However, ANN can sometimes provide a simpler and more elegant, and sometimes even superior solution to these tasks. Of special interest is a combination of evolutionary algorithms for descriptor selection and ANN for function approximation. This conglomerate seems to provide a very useful general approach to QSAR modeling. In connection with other methods the many types of ANN provide a flexible modular framework to help speed up the drug development process and lead to molecules with desired structures and activities.
24
Adaptive Systems in Drug Design
Conventional Supervised Training of Neural Networks Neural network architectures with a multi-layered feed-forward architecture have found a widespread application as function estimators and classificators. They follow the principle of convoluting simple non-linear functions for approximation of complicated input-output relationships (Kolmogorov’s Theorem, ref. 127). Since in such ANN mainly sigmoidal functions are employed the following brief description of supervised network training will be restricted to this network type. We limit the discussion of SNN to feed-forward networks at this place because of their dominating role in drug design. Other network architectures, e.g., recurrent networks for time-series and sequence analysis, are not considered here. A scheme of a fully connected three-layered feed-forward network is shown in Fig. 1.13. Input layer neurons receive the pattern vector (molecular descriptor set). They are often referred to as “fan-out units” because they distribute the data vector to the next network layer without any calculation being performed. In the hidden layer neurons “sigmoidal neurons” are used in the majority of supervised ANN. A sigmoidal neuron computes an output value according to Equations 1.4 and 1.5:
Eq. 1.4 Eq. 1.5 Here w is the weight vector connected to the neuron, x is the neuron’s input signal, and is the neuron’s bias or threshold value. If a single sigmoidal output neuron is used, the overall function represented by the fully connected two-layered feed-forward network shown in Fig. 1.13a will be (Eq. 1.6):
Eq. 1.6 where x represents the input vector (input pattern). The network shown in Fig. 1.13b with three sigmoidal hidden units and a single sigmoidal output unit represents a more complicated function (Eq. 1.7):
Eq. 1.7 where w are the input-to-hidden weights, v are the hidden-to-output weights, ϑ are the hidden layer bias values, and ϑout is the output neuron’s bias. Even more complicated functions can be represented by adding additional layers of neurons to the network. It has been shown that at most two hidden layers with non-linear neurons are required to approximate arbitrary continuous functions.95,128,129 Depending on the application and the accuracy of the fit the required number of layers and the number of neurons in a layer can vary. Baum and Haussler130 addressed the question which size of a network can be expected to generalize from a given number of training examples. A “generalizing” solution derived from the analysis of a limited set of training data will be valid for any further data point, thus leading to perfect predictions. This is possible only by the use of data that is representative of the problem. Most solutions found by feature extraction from lifelike data sets will, however, be sub-optimal in this general meaning (see Chapter 3 for a more detailed discussion of this issue). Determination of a useful architecture by testing the performance of an ANN with independent test and validation data during and after SNN training is essential. If the test performance is bad either the network architecture must be altered, or the data used for training is inadequate, i.e., the problem might be ill-posed. Furthermore, by monitoring test performance the generalization ability can be
A Conceptual Framework
25
Fig. 1.13. Architecture of two fully-connected, feed-forward networks. Formal neurons are drawn as circles, and lines symbolize the connection weights. The flow of information through the networks is from top to bottom. The input vector (pattern vector) is five-dimensional in this example. White circles: fan-out neurons; black circles: sigmoidal neurons. a) Perceptron; b) conventional three-layered feed-forward network.
estimated during the training process and training can be stopped before over-fitting occurs (“forced stop”; see Chapter 3). During supervised ANN training the weights and threshold values of the network are determined which is—again—an optimization procedure. For training, a set of data vectors plus one or more functional values (“target values”) per data point are required. The optimization task is to adjust the free variables of the system in such a way that the network computes the desired output values. The goal during the training phase is to minimize the difference between the actual output and the desired output; a frequently used error function is the sum-ofsquares error as given in Equation 1.8 (here, N is the number of data points (“patterns”) in the training set):
Eq. 1.8 This error function is a special case of a general error function called the Minkowski-R error (Eq. 1.9):
Eq. 1.9 where yk is the actual network output, tk is the desired (target) output vector, x is the input vector, and w is the weight vector. The summation over k will be omitted if only a single output neuron is used. With R = 2 the Minkowski-R error reduces to the sum-of-squares error function, R = 1 leads to the computation of the conditional median of the data rather than the conditional mean as for R = 2. The use of an R value smaller than 2 can help reducing the sensitivity to outliers which are frequently present in a training set, especially if experimental data are used.
Adaptive Systems in Drug Design
26
Several optimization algorithms can be used, e.g., gradient descent techniques, simulated annealing procedures, maximum likelihood and Bayesian learning techniques, or evolutionary algorithms performing some kind of adaptive random search.87,103 Most frequently applied are gradient techniques using the “back-propagation of errors” (bp) algorithm for weight update.100 Gradient descent follows the deepest descent on the error surface (Eq. 1.10):
Eq. 1.10 For network training each weight value is incrementally updated according to Equation 1.11:
Eq. 1.11 γ is a constant defining the step size of each iteration along the downhill gradient direction. If the weights of a two-layered network (Fig. 1.13a) must be updated we make these weight changes individually for each input pattern xµ in turn, and obtain the “delta rule” (also termed “Adaline rule” or “Widrow-Hoff rule”),100,131 as given in Equation 1.12:
Eq. 1.12 Using a continuously differentiable activation function for the hidden units of a multilayered network, it is an easy step to calculate the update values (“deltas”) for hidden unit weights following this bp procedure. A thorough description of this algorithm can be found in many textbooks on neural networks. Several straightforward modifications of bp have been described.101 As with all optimization methods, gradient techniques can have problems with premature convergence due to local energy barriers or plateaus of the error surface.132 This will not be a problem if the network is small enough to establish a comparatively low-dimensional error surface lacking striking local optima. However, the classical bp algorithm might easily get trapped in a local optimum when large networks must be trained. For complicated networks especially adaptive evolutionary optimization can be useful.68,85,99 Besides considerations concerning network architecture, considerable effort must be spent in selecting an appropriate ANN training technique.
References 1. Kier LB. Book review: Neural Networks in QSAR and Drug Design. J Med Chem 1997; 40:2967. 2. Eglen R, Schneider G, Böhm HJ. High-throughput screening and virtual screening: Entry points to drug discovery. In: Böhm HJ, Schneider G, eds. Virtual Screening for Bioactive Molecules. Weinheim, New York: Wiley-VCH, 2000:1-14. 3. Papadimitriou CH, Steiglitz K. Combinatorial Optimization—Algorithms and Complexity. Mineola, New York: Dover Publications, 1998. 4. Walters WP, Stahl MT, Murcko MA. Virtual screening—An overview. Drug Discovery Today 1998; 3:160-178. 5. Walters WP, Ajay, Murcko MA. Recognizing molecules with drug-like properties Curr Opin Chem Biol 1999; 3:384-387. 6. Bevan P, Ryder H, Shaw I. Identifying small-molecule lead compounds: The screening approach to drug discovery. Trends Biotechnol 1995;13:115-121. 7. Grepin C, Pernelle C. High-throughput screening. Drug Discovery Today 2000; 5:212-214. 8. MORPACE Pharma Group Ltd. From data to drugs: Strategies for benefiting from the new drug discovery technologies. Scrip Reports 1999; July. 9. Drews J. Rethinking the role of high-throughput screening in drug research. Decision Resources. Intern Biomedicine Management Partners Orbimed Advisors 1999.
A Conceptual Framework
27
10. Zhang JH, Chung TD, Oldenburg KR. Confirmation of primary active substances from high throughput screening of chemical and biological populations: A statistical approach and practical considerations. J Comb Chem 2000; 2:258-265. 11. Bures MG, Martin YC. Computational methods in molecular diversity and combinatorial chemistry. Curr Opin Chem Biol 1998; 2:376-380. 12. Böhm HJ, Schneider G, eds. Virtual Screening for Bioactive Molecules. Weinheim, New York: Wiley-VCH, 2000. 13. Schneider G, Wrede P. Optimizing amino acid sequences by simulated molecular evolution. In: Jesshope C, Jossifov V, Wilhelmi W, eds. Parallel Computing and Cellular Automata. Berlin: Akademie-Verlag; Mathem Res 1993; 81:335-346. 14. Karan C, Miller BL. Dynamic diversity in drug discovery: Putting small-molecule evolution to work. Drug Discovery Today 2000; 5:67-75. 15. Frank SA. The design of natural and artificial adaptive systems. In: Rose MR, Lauder GV, eds. Adaptation. San Diego: Academic Press, 1996:451-505. 16. Koza JR. The genetic programming paradigm: Genetically breeding populations of computer programs to solve problems. In: Soucek B and the IRIS Group, eds. Dynamic, Genetic, and Chaotic Programming. New York: John Wiley and Sons, 1992:203-321. 17. Schneider G, Schuchhardt J, Wrede P. Simulated molecular evolution and artificial neural networks for sequence-oriented protein design: an evaluation. Endocytobiosis Cell Res 1995; 11:1-18. 18. Schneider G. Evolutionary molecular design in virtual fitness landscapes. In: Böhm HJ, Schneider G, eds.Virtual Screening for Bioactive Molecules. Weinheim, New York: Wiley-VCH, 2000:161-186. 19. Schneider G, Lee ML, Stahl M, Schneider P. De novo design of molecular architectures by evolutionary assembly of drug-derived building blocks. J Comput-Aided Mol Design 2000; 14:487-494. 20. Devillers J, ed. Genetic Algorithms in Molecular Modeling. London: Academic Press, 1996. 21. Narendra KS, Annaswamy AM. Stable Adaptive Systems. London: Prentice Hall, 1989. 22. Dracopoulos DC. Evolutionary Learning Algorithms for Neural Adaptive Control. London: Springer, 1997. 23. Luenberger DG. Introduction to Dynamic Systems. New York: John Wiley and Sons, 1979. 24. High-Throughput Screening. Drug Discovery Today 2000; 5(12) Suppl. 25. Bavarian B. Introduction to neural networks for intelligent control. IEEE Control Systems Magazine 1988; April:3-7. 26. Narenda KS, Mukhopadhyay S. Intelligent control using neural networks. IEEE Control Systems Magazine 1992; April:11-18. 27. Kubinyi H. A general view on similarity and QSAR studies. In: van de Waterbeemd H, Testa B, Folkers G, eds. Computer-Assisted Lead Finding and Optimization—Current Tools for Medicinal Chemistry. Weinheim, New York: Wiley-VCH, 1997:9-28. 28. Leach AR. Molecular Modeling—Principles and Applications. Harlow: Pearson Education, 1996. 29. So SS, Karplus M. Evolutionary optimization in quantitative structure-activity relationships: An application of genetic neural networks. J Med Chem 1996; 39:1521-1530. 30. Schneider G, Neidhart W, Adam G. Integrating virtual screening to the quest for novel membrane protein ligands. CNSA 2001; 1:99-112. 31. Rosenbusch JP. The critical role of detergents in the crystallization of membrane proteins. J Struct Biol 1990; 104:134-138. 32. Sakai H, Tsukihara T. Structures of membrane proteins determined at atomic resolution. J Biochem 1998; 124:1051-1059. 33. Heberle J, Fitter J, Sass HJ et al. Bacteriorhodopsin: The functional details of a molecular machine are being resolved. Biophys Chem 2000; 85:229-248. 34. Holland JH. Adaptation in Natural and Artificial Systems. Ann Arbor: University of Michigan Press, 1975. 35. Rechenberg I. Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. 2nd ed. Stuttgart: Frommann-Holzboog, 1994. 36. Schwefel HP. Numerical Optimization of Computer Models. Chichester: Wiley, 1981. 37. Bäck T, Schwefel HP. An overview of evolutionary algorithms for parameter optimization. Evol Comput 1993; 1:1-23.
28
Adaptive Systems in Drug Design
38. Goldberg DE. Genetic Algorithms in Search, Optimization, and Machine Learning. Reading: Addison-Wesley, 1989. 39. Davis L, ed. Handbook of Genetic Algorithms. New York: Van Nostrand Reinhold, 1991. 40. Rawlins G, ed. Foundations of Genetic Algorithms. Los Altos: Morgan Kauffmann, 1991. 41. Forrest S. Genetic Algorithms: Principles of natural selection applied to computation. Science 1993; 261:872-878. 42. Reeves CR. Modern Heuristic Problems for Combinatorial Problems. Oxford: Blackwell Scientific, 1993. 43. Kirkpatrick S, Gelatt Jr. CD, Vecchi MP. Optimization by simulated annealing. Science 1983; 220:671-680. 44. Ackley DH, Hinton GE, Sejnowski TJ. A learning algorithm for Boltzmann machines. Cognitive Science 1985; 9:147-169. 45. Glover FW, Laguna M. Tabu Search. Boston: Kluwer Academic Publishers, 1997. 46. Baxter CA, Murray CW, Clark DE et al. Flexible docking using Tabu search and an empirical estimate of binding affinity. Proteins: Struct Funct Genet 1998; 33:367-382. 47. Globus A, Lawton J, Wipke T. Automatic molecular design using evolutionary techniques. Nanotechnology 1999; 10:290-299. 48. Newell A. Heuristic programming: Ill-structured problems. In: Arnofsky J, ed. Progress in Operations Research. New York: Wiley, 1969:363-414. 49. Davis L. A genetic algorithms tutorial. In: Davis L, ed. Handbook of Genetic Algorithms. New York: Van Nostrand Reinhold, 1991:1-101. 50. De Jong KA. Introduction to the second special issue on genetic algorithms. Machine Learning 1990; 5(4):351-353. 51. Forrest S, Mitchell M. What makes a problem hartford genetic algorithm? Some anomalous results in the explanation. Machine Learning 1993; 13:285-319. 52. Kauffmann SA. The Origins of Order—Self-Organization and Selection in Evolution. New York: Oxford University Press, 1993. 53. Giuliani A, Benigni R. Modeling without boundary conditions: An issue in QSAR validation. In: van de Waterbeemd H, Testa B, Folkers G, eds. Computer-Assisted Lead Finding and Optimization—Current Tools for Medicinal Chemistry. Weinheim, New York: Wiley-VCH, 1997:51-63. 54. Mathews JH. Numerical Methods for Mathematics, Science, and Engineering. Englewood Cliffs: Prentice-Hall International, 1992. 55. Levitan B, Kauffman S. Adaptive walks with noisy fitness measurements. Mol Diversity 1995; 1:53-68. 56. Schulz AS, Shmoys DB, Williamson DP. Approximation algorithms. Proc Natl Acad Sci USA 1997; 94:12734-12735. 57. Clark DE, Westhead DR. Evolutionary algorithms in computer-aided molecular design. J Comput Aided Mol Des 1996; 10:337-358. 58. Desjarlais JR, Clarke ND. Computer search algorithms in protein modification and design. Curr Opin Struct Biol 1998; 8:471-475. 59. Koza JR. Genetic Programming. Cambridge: The MIT Press, 1992. 60. Clark DE, ed. Evolutionary Algorithms in Molecular Design. Weinheim, New York: Wiley-VCH, 2000. URL: http://panizzi.shef.ac.uk/cisrg/links/ea_bib.html 61. Rechenberg I. Cybernetic solution path of an experimental problem. Royal Aircraft Establishment, Library Translation 1122. Farnborough. 62. Bremermann HJ. Optimization through evolution and recombination. In: Yovits MC, Jacobi GT, Goldstine GD, eds. Self-organizing Systems. Washington: Spartan Books, 1962:93-106. 63. Davidor Y, Schwefel HP. An introduction to adaptive optimization algorithms based on principles of natural evolution. In: Soucek B and the IRIS Group, eds. Dynamic, Genetic, and Chaotic Programming. New York: John Wiley and Sons, 1992:183-202. 64. Dunham B, Fridshal D, Fridshal R et al. Design by natural selection. IBM Res Dept RC-476, June 20, 1961. 65. Johnson MA, Maggiora GM, eds. Concepts and Applications of Molecular Similarity. New York: John Wiley, 1990.
A Conceptual Framework
29
66. Lipinski CA, Lombardo F, Dominy BW et al. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 1997; 23:3-25. 67. Clark DE, Pickett SD. Computational methods for the prediction of ‘drug-likeness’. Drug Discovery Today 2000; 5:49-58. 68. Schneider G, Schuchhardt J, Wrede P. Artificial neural networks and simulated molecular evolution are potential tools for sequence-oriented protein design. Computer Applic Biosci 1994; 10:635-645. 69. Saravanan N, Fogel DB, Nelson KM. A comparison of methods for self-adaptation in evolutionary algorithms. Biosystems 1995; 36:157-166. 70. Conrad M, Ebeling W, Volkenstein MV. Evolutionary thinking and the structure of fitness landscapes. Biosystems 1992; 27:125-128. 71. Schneider G, Schuchhardt J, Wrede P. Peptide design in machina: Development of artificial mitochondrial protein precursor cleavage-sites by simulated molecular evolution. Biophys J 1995; 68:434-447. 72. Davis L. Adapting operator probabilities in genetic algorithms. In: Schaffer JD, ed. Proceedings of the Third International Conference on Genetic Algorithms and Their Applications. San Mateo: Morgan Kaufmann, 1989:61-69. 73. Bäck T, Hoffmeister F, Schwefel HP. A survey of evolution strategies. In: Whitley D, ed. Proceedings of the Fourth International Conference on Genetic Algorithms. San Mateo: Morgan Kaufmann, 1991:2-9. 74. Julstrom BA. What have you done for me lately? In: Eshelman LJ, ed. Proceedings of the Sixth International Conference on Genetic Algorithms. San Mateo: Morgan Kaufmann, 1995:81-87. 75. Tuson A, Ross P. Adapting operator settings in genetic algorithms. Evol Comput 1998; 6:161-184. 76. Kitano H. Designing neural networks using genetic algorithms with graph generation systems. Complex Systems 1990; 4:461-476. 77. Mitchell M. An Introduction to Genetic Algorithms. Cambridge: The MIT Press, 1998. 78. Tuson A, Clark DE. New techniques and future directives. In: Clark DE, ed. Evolutionary Algorithms in Molecular Design. Weinheim, New York: Wiley-VCH, 2000:241-264. 79. Weber L, Practical approaches to evolutionary design. In: Böhm HJ, Schneider G, eds. Virtual Screening for Bioactive Molecules. Weinheim, New York: Wiley-VCH, 2000:187-205. 80. Heitkoetter J, Beasley D. The hitch-hiker’s guide to evolutionary computation: A list of frequentlyasked questions (FAQ). USENET: comp.ai.genetic. Available via anonymus FTP from: rtfm.mit.edu/ pub/usenet/news.answers/ai-faq/genetic 81. So SS. Quantitative structure-activity relationships. In: Clark DE, ed. Evolutionary Algorithms in Molecular Design. New York, Weinheim, New York: Wiley-VCH, 2000:71-97. 82. Sheridan RP, SanFeliciano SG, Kearsley SK. Designing targeted libraries with genetic algorithms. J Mol Graph Model 2000; 18:320-334. 83. Goldberg DE, Korb B, Deb K. Messy genetic algorithms: Motivation, analysis, and first results. Complex Syst 1989; 3:493-530. 84. Parrill AL. Introduction to evolutionary algorithms. In: Clark DE, ed. Evolutionary Algorithms in Molecular Design. New York, Weinheim, New York: Wiley-VCH, 2000:1-13. 85. Schneider G, Schuchhardt J, Wrede P. Evolutionary optimization in multimodal search space. Biol Cybernetics 1996; 74:203-207. 86. Raudys S. How good are support vector machines? Neural Networks, 2000; 13:17-19. 87. Ripley BD. Pattern Recognition and Neural Networks. Cambridge: Cambridge University Press, 1996. 88. Katz WT, Snell JW, Merickel MB. Artificial neural networks. Methods Enzymol 1992; 210:610-636. 89. Soucek B, The IRIS Group, eds. Dynamic, Genetic, and Chaotic Programming. New York: John Wiley and Sons, 1992. 90. Hofstadter DR. Gödel, Escher, Bach: An Eternal Golden Braid. New York: Basic Books, 1979. 91. Agatonovic-Kustrin S, Beresford R. Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. J Pharm Biomed Anal 2000; 22:717-727. 92. Milne GWA. Mathematics as a basis for chemistry. J Chem Inf Comput Sci 1997; 37:639-644. 93. Hinton GE. How neural networks learn from experience. Sci Am 1992; 267:144-151.
30
Adaptive Systems in Drug Design
94. Hampson S. Generalization and specialization in artificial neural networks. Prog Neurobiol 1991; 37:383-431. 95. Hornik K, Stinchcombe M, White H. Multilayer feed-forward networks are universal approximators. Neural Networks 1989; 2:359-366. 96. Kosko B. Neural Networks and Fuzzy Systems—A Dynamical Systems Approach to Machine Intelligence. Englewood Cliffs: Prentice Hall International, 1992. 97. Zupan J, Gasteiger J. Neural Networks for Chemists. Weinheim: VCH, 1993. 98. Schneider G, Wrede P. Artificial neural networks for computer-based molecular design. Prog Biophys Mol Biol 1998; 70:175-222. 99. Schneider G, Schuchhardt J, Wrede P. Development of simple fitness landscapes for peptides by artificial neural filter systems. Biol Cybernetics 1995; 73:245-254. 100. Rumelhart DE, McClelland JL, The PDB Research Group. Parallel Distributed Processing. Cambridge: MIT Press, 1986. 101. Hertz J, Krogh A, Palmer RG. Introduction to the Theory of Neural Computation. Redwood City: Addison-Wesley, 1991. 102. Amari SI. Mathematical methods of neurocomputing. In: Barndorff-Nielsen OE, Jensen JL, Kendall WS, eds. Networks and Chaos—Statistical and Probabilistic Aspects. London: Chapman & Hall, 1993:1-39. 103. Bishop CM. Neural networks for pattern recognition. Oxford: Clarendon Press, 1995. 104. Mitchie D, Spiegelhalter DJ, Taylor CC, eds. Machine Learning: Neural and Statistical Classification. New York: Ellis Horwood, 1994. 105. Bratko I, Muggleton S. Applications of inductive logic programming. Commun Assoc Comput Machinery 1995; 38:65-70. 106. Langley P, Simon HA. Applications of machine learning and rule induction. Commun Assoc Comput Machinery 1996; 38:54-64. 107. Weigend AS, Gershenfeld NA, eds. Time Series Prediction: Forecasting the Future and Understanding the Past. Reading: Addison-Wesley, 1993. 108. Carpenter GA, Grossberg S, eds. Pattern Recognition by Self-Organizing Neural Networks. Cambridge: The MIT Press, 1991. 109. Anderson JA, Rosenfeld, E, eds. Neurocomputing: Foundations of Research. Cambridge: MIT Press, 1988. 110. Devillers J, ed. Neural Networks in QSAR and Drug Design. London: Academic Press, 1996. 111. Downs GM, Willett P. Clustering of chemical structure databases for compound selection. In: van de Waterbeemd H, ed. Advanced Computer-Assisted Techniques in Drug Discovery; Weinheim: VCH, 1995:111-130. 112. Willett P, ed. Methods for the analysis of molecular diversity. In: Perspectives in Drug Discovery and Design. Vols 7/8, 1997. 113. Agrafiotis DK, Lobanov VS, Rassokhin DN et al. The measurement of molecular diversity. In: Böhm HJ, Schneider G, eds. Virtual Screening for Bioactive Molecules. Weinheim, New York: Wiley-VCH, 2000:265-300. 114. Good AC, Peterson SJ, Richards WG. QSAR from similarity matrices. Technique validation and application in the comparison of different similarity evaluation methods. J Med Chem 1993; 36:2929-2937. 115. Good AC, So SS, Richards WG. Structure-activity relationships from molecular similarity matrices. J Med Chem 1993; 36:433-438. 116. Dean PM, ed. Molecular Similarity in Drug Design. London: Blackie Academic & Professional, 1995. 117. Sen K, ed. Molecular Similarities I. Topics in Current Chemistry. Vol 173. Berlin: Springer, 1995. 118. Sen K, ed. Molecular Similarities II. Topics in Current Chemistry. Vol 174. Berlin: Springer, 1995. 119. Barnard JM, Downs GM, Willett P. Descriptor-based similarity measures for screening chemical databases. In: Böhm HJ, Schneider G, eds. Virtual Screening for Bioactive Molecules. Weinheim, New York: Wiley-VCH, 2000:59-80. 120. So SS, Karplus M. Three-dimensional quantitative structure-activity relationships from molecular similarity matrices and genetic neural networks. 1. Method and validations. J Med Chem 1997; 40:4347-4359.
A Conceptual Framework
31
121. So SS, Karplus M. Three-dimensional quantitative structure-activity relationships from molecular similarity matrices and genetic neural networks. 2. Applications. J Med Chem 1997; 40:4360-4371. 122. Loew GH, Villar HO, Alkorta I. Strategies for indirect computer-aided drug design. Pharm Res 1993; 10:475-486. 123. Marrone TJ, Briggs JM, McCammon JA. Structure-based drug design: computational advances. Annu Rev Pharmacol Toxicol 1997; 37:71-90. 124. Drews J, Ryser S, eds. Human Disease—From Genetic Causes to Biochemical Effects. Berlin: Blackwell Science, 1997. 125. Bongard M. Patterns Recognition. Rochelle Park: Spartan Books, 1970. 126. Sumpter BG, Getino C, Noid DW. Theory and applications of neural computing in chemical science. Annu Rev Phys Chem 1994; 45:439-481. 127. Kolmogorov AN. On the representation of continuous functions of several variables by superposition of continuous functions of one variable and addition. Dokl Akad Nauk SSSR 1957; 114:953-956. 128. Cybenko G. Approximations by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems 1989; 2:303-314. 129. Cybenko G. Continuous valued neural networks with two hidden layers are sufficient. Technical Report, Department of Computer Science, Tufts University, Medford, MA, 1988. 130. Baum EB, Haussler D. What size net gives valid generalization? Neural Computation 1989; 1:151-160. 131. Widrow B, Hoff ME. Adaptive switching circuits. In: 1960 IRE WESCON Convention Record, Part 4. New York: IRE, 1960:96-104. 132. McInerny JM, Haines KG, Biafore S, Hecht-Nielsen R. Back propagation error surfaces can have local minima. In: International Joint Conference on Neural Networks 2. Washington, 1989. New York: IEEE Press:627.
32
Adaptive Systems in Drug Design
CHAPTER 2
Analysis of Chemical Space “The critical and age-old question remains: how should a chemist decide what to synthesize?” (W.P. Walters, M.T. Stahl, M.A. Murcko)1
A Virtual Screening Philosophy
A
main goal of virtual screening is to select activity-enriched sets of molecules – or single molecules exhibiting desired activity—from the space of all synthetically accessible structures. Currently the most advanced HTS techniques allow for testing of ~105 compounds per day, and a typical corporate screening library contains several hundred thousand samples. Although these facts alone represent a technological revolution, the turnover numbers still are extremely small compared with the total size of chemical space.1 As a consequence, even ultra HTS combined with fast, parallel combinatorial chemistry can only be successful if a reasonable pre-selection of molecules (or molecular building blocks) for screening is done. Otherwise this approach will more or less represent a random search with a very small probability of success. While HTS and ultraHTS have made significant progress in recent years, we should bear in mind that it will be very costly to screen a million of compounds for activity in all the new receptor assays (estimated $0.1 to $10 per compound per screen). Even if a company has these resources, it is rare that they have access to a diverse one-millioncompound screening library. Thus it can be advantageous to integrate VS tools into the drug discovery process to find leads with novel scaffolds by either starting from competitor compounds described in the literature and/or from a proprietary, existing scaffold. Once a reliable VS process has been defined it can save resources and limit experimental efforts by suggesting defined sets of molecules. To reliably calculate prediction values or properties, the molecules under investigation must be represented in a suitable fashion. In other words, the appropriate level of abstraction must be defined to perform rational VS. A convenient way to do this is to employ molecular descriptors, which can be used to generate molecular encoding schemes reaching from general properties (e.g., lipophilicity, molecular weight, total charge, volume in solution, etc.) to very specific structural and pharmacophoric attributes (e.g., multi-point pharmacophores, fieldbased descriptors).2 Filtering tools can be constructed using a simplistic model relating the descriptors to some kind of bioactivity or molecular property. However, the selection of appropriate descriptors for a given task is not trivial and careful statistical analysis is required. Usually, in the beginning of a medicinal chemistry project one wants to perform a rather coarse-grain sieving of compounds displaying interesting bioactivity. Several such filtering rules have been compiled based on the analysis of known drugs and bioactive molecules.1,3-8 A more detailed description of the “drug-likeness” concept will be given in Chapter 4. As the knowledge about the required molecular features grows during a project, the VS technique becomes increasingly more fine-grained. An overview of a typical VS process is shown in Figure 2.1. Besides an appropriate representation of the molecules under investigation, any useful feature extraction system must be structured in such a way that meaningful analysis and Adaptive Systems in Drug Design, by Gisbert Schneider and Sung-Sau So. ©2003 Eurekah.com.
Analysis of Chemical Space
33
Fig. 2.1. Virtual screening scheme (adapted from Walters et al1).
pattern recognition is possible. Technical systems for information processing are intuitively considered as mimicking some aspects of human capabilities in the fields of perception and cognition. Despite great achievements in artificial intelligence research during the past decades, we are still far from understanding complex biological information processing systems in detail. This means that a feature extraction task that appears very simple to a human expert can be extremely hard or even impossible to solve for a technical system, e.g., a particular virtual screening software. To demonstrate the often fuzzy nature of features let us consider the depictions shown in Figure 2.2. Human perception readily classifies these patterns as “tree”. We “intuitively” know the generalizing features defining the pattern class “tree”. This is however a difficult task for technical information processing systems because the common features of the four tree patterns are not easily rationalized and described. Useful attributes might include something like “the number of coherent areas” and “ball and stick”. Due to the fact that it is very often impossible to give a description of relevant features in sufficient details, pattern recognition systems are often confronted with vaguely and incompletely described tasks.9,10 This is true for the tree-classification example, and also for molecular pattern recognition and virtual screening tasks. Some very basic feature extraction methods will be presented and discussed in this Chapter, more advanced systems that are of particular interest for adaptive SAR modeling are described in Chapter 3.
Logical Inference A chemist’s decision as to which molecules to synthesize next is usually based on the available facts about a particular project, expert knowledge that was acquired over years, and to some extent on intuition. Software containing knowledge about a particular, limited, realworld problem can assist in this decision making process (Definition 2.1).11,12 It is important
Adaptive Systems in Drug Design
34
Fig. 2.2. Graphical representations of four different types of trees. Human perception easily classifies these patterns as “tree”—a difficult task for technical information processing systems.
to note that virtual screening systems are thought to complement the abilities of a human expert, e.g., by analyzing very large sets of data and prioritizing many different designs. Some aspects of human decision-making and reasoning can be adapted or mimicked by “intelligent” software, but many features will probably remain a domain of human perception, cognition, and intuition.
Definition 2.1
Expert systems are computer programs that help to solve problems at an expert level.12 •
Based on an appropriate representation of knowledge, logical inferences are made by man and machine through deduction, abduction, and induction mechanisms (see Fig. 1.4). The modus ponens probably is the best known inference rule: Given facts (axioms): IF i = A THEN j = B i=A Inference (modus ponens): j=B Additional important inference rules are the modus tollens, and the chain rule which combines several implications: Given facts (axioms): IF i = A THEN j = B IF i = B THEN k = C Inference (chain rule): IF i = A THEN k = C Deductive logical programming using predicate logic is based on such rules.13,14 They can be used to derive hypotheses from true facts, i.e., they only consider the syntactic structure of
Analysis of Chemical Space
35
the expressions. Several applications of logical reasoning systems have been described in the context of drug design purposes.15-17 Induction seems to be especially suited to perform learning from examples, and the chemical similarity approach can be regarded as founded on this concept, as illustrated by the following simplifying case: Given facts: The pyrimidine derivative A is active in assay X The pyrimidine derivative B is active in assay X … Inference (induction): All pyrimidines are active in assay X It must be stressed that induction is useful to derive new hypotheses but is not a legal inference in a strict sense. This simply means that the conclusion “All pyrmadines are active in assay X“ can be wrong. In contrast, deduction is denoted a legal inference because if only true axioms are given then the conclusions drawn by deduction are also true. For further details on logical reasoning and inferencing, see the literature.13,18-20 Inductive logic programming (ILP)18 represents a relatively new addition to the field of logic programming, which seems to be appropriate for SAR and SPC (structure-property correlation) modeling tasks.15 According to Plotkin,21 an inductive learning task can be described using a background theory (facts and rules), sets of positive and negative examples (e.g., active and inactive molecules), a candidate hypothesis and a partial ordering system for alternative hypotheses, where the following conditions apply: 1. The background knowledge should not be sufficient to explain all positive examples; otherwise the problem would already be solved (prior necessity). 2. The background knowledge should be consistent with all negative and positive examples (prior satisfiability). 3. The background knowledge and the hypothesis should together explain all positive examples (posterior sufficiency) and should not contradict any of the negative examples (strong posterior consistency); in the presence of noise, logical consistency is sufficient (weak posterior consistency). 4. If there are several hypotheses which fulfil conditions 1, 2, and 3, then the most general hypothesis should be selected as the result.
In addition to drug design projects, ILP has found several successful applications in bioinformatical sequence analysis and protein structure prediction.22,23 The software PROMIS represents an early machine learning program written in the programming language PROLOG, which implements a generate-and-test hill-climbing beam search technique to find patterns in amino acid sequences.24 The idea is to find a sequence of classes (“rules”) that can be used to discriminate between different sets of amino acid sequences. The algorithm is a typical example of a population-based stochastic searching technique. It is related to a (µ + λ) evolution strategy (see Chapter 1): Starting from an initial set of general rules, new rules are formed by means of “generalization”, “specialization” and “extension”. Every newly formed rule is assessed by a fitness function (e.g., classification accuracy or coverage), and is either rejected or selected as a parent rule of the next optimization cycle. The beam searching idea—which can be regarded as analogous to using multiple parents in ES/GA techniques—is thought to reduce the risk of getting trapped in a local optimum. Compared to evolutionary algorithms, however, no adaptive strategy parameters were used in the original algorithm. Within the PROMIS software, Taylor’s classification of amino acids is employed to describe similarity between residues and extract patterns that can be used for sequence classification (Fig. 2.3).25 Residue classes are
36
Adaptive Systems in Drug Design
Fig. 2.3. A Venn diagram representing the class relationship of the 20 genetically coded amino acids according to Taylor.25 This grouping of residues has been successfully applied to finding generalizing patterns in amino acid sequences.
easily represented by PROLOG expressions in the knowledge base, allowing for the construction of generalizing patterns formed by a sequence of class identifiers (ALL, SMALL, CHARGED, HYDROPHOBIC, AROMATIC, etc.): class (ALL, [A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y]). class (SMALL, [A,C,D,G,N,P,S,T,V]). class (HYDROPHOBIC, [A,C,F,G,H,I,K,L,M,T,V,W,Y]). class (AROMATIC, [F,H,W,Y]) ... One particular such rule is listed below. It was generated by a modified version of PROMIS and gives an idea of characteristic features found in peptide substrates of the mitochondrial processing peptidase (MPP):26-28 ALL VERY_HYDROPHOBIC OR SMALL OR LYSINE POSITIVE LARGE NEUTRAL NEUTRAL ALL
Analysis of Chemical Space
37
In a given matching sequence, this MPP cleavage site pattern starts with the class “ALL”. Moving towards the C-terminal end of the sequence, the next position must match one of the residues described by “VERY_HYDROPHOBIC OR SMALL OR LYSINE”, the second next residue must be positively charged, and so on. Such machine-generated rules can help to find and understand function-determining patterns in amino acid sequences. This general feature extraction approach complements other pattern matching routines used in sequence analysis.29,30 Its principle of using generic molecular descriptors (here: residue classes) is very similar to establishing an SAR model for drug design by adaptive rule formation. This short introduction to some general inference mechanisms was to demonstrate one possible approach to how experimental observations and expert knowledge can be represented as facts and rules, and conclusions can be drawn that might help the medicinal chemist to generate new hypotheses (Fig. 2.4). Learning by induction provides a theoretical framework for many SAR/SPC modeling tasks. As we have learned from many years of “artificial intelligence” research it is extremely difficult (if not impossible) to develop virtual screening algorithms mimicking the medicinal chemists’ intuition. Furthermore, there is no common “gut feeling” as different chemists have different educational background, skills and experience. Despite such limitations there is, however, substantial evidence that it is possible to support drug discovery in various ways with the help of computer-assisted library design and selection strategies. There are two specific properties of computers, which make them very attractive for virtual screening applications: 1. By help of virtual library construction hitherto unknown parts of chemical space can easily be explored, and 2. the speed and throughput of virtual testing (fitness or quality calculations) can be far ahead of what is possible by means of “wet bench” experimental systems.
Chemical Compound Libraries Two complementary compound sources are accessible for virtual screening, databases of known structures and de novo designs (including enumerated combinatorial libraries). Some major databases frequently employed for virtual screening experiments are listed in Table 2.1. In addition, several companies offer large libraries of both combinatorial and historical collections on a commercial basis. Usually the combinatorial collections contain 100k-500k structures, whereas commercially available historical collections rarely exceed 100k compounds. Most of the major pharmaceutical companies have compound collection in the 300k+ range. Combinatorial libraries usually provide small amounts of uncharacterized compounds for screening. Once these samples are fully characterized—e.g., by HPLC and mass spectroscopy, the data are of interest for structure–activity purposes. In most companies, these compounds are also present with the ”historical” collection of compounds, generally derived from classical medicinal chemistry programs, most of which have very well defined chemical characteristics. Commercial compound collections can also be purchased that fall between these two extremes. Collectively, therefore, the information used to relate biological activity and chemical structure must clearly integrate all of these types of compounds. Assessment of the diversity of a compound library is often a first step in virtual screening. The most relevant approach is clearly to assess the diversity space using chemical criteria and several algorithms are now available to do that. It is likely that after diversity analysis and extensive experimental screening of the library, at several targets and targets classes, the structure-activity database will point to areas of success and failure in terms of identifying leads. Thus the library may be said to be “GPCR-modulator rich”, “kinase-inhibitor poor” etc. An experiment-based understanding of the screening library diversity should also provide compounds that are “frequent hitters”, i.e., compounds that are not necessarily chemically reactive,
Adaptive Systems in Drug Design
38
Table 2.1. Some major databases that are useful for virtual screening experiments (adapted from Eglen et al33) Database
No. of Molecules
Description
ACDa
> 250,000
Available Chemicals Directory; catalogue of commercially available specialty and bulk chemicals from over 225 international suppliers
Beilsteinb
> 7,000,000
Covers organic chemistry from 1779
CSDc
> 200,000
Cambridge Structural Database; experimentally determined three-dimensional structures of small molecules
CMCa
> 7,000
Comprehensive Medicinal Chemistry database; structures and activities of drugs having generic names (on the market)
MDDRa
> 85,000
MACCS-II (MDL) Drug Data Report; structures and activity data of compounds in the early stages of drug development
MedChemd
> 35,000
Medicinal Chemistry database; pharmaceutical compounds
SPRESId
> 3,400,000
Substances and bibliographic data abstracted from the world’s chemical literature
WDIe
> 50,000
World Drug Index; pharmaceutical compounds from all stages of development
a Molecular design Limited, San Leandro, CA, U.S.A. b Beilstein Informationssysteme GmbH, Frankfurt, Germany c CSD Systems, Cambridge, UK. d Daylight Chemical Information Systems Inc., Claremont, CA, U.S.A. e Derwent Information, London, U.K.
but have structures that repeatedly bind to a range of targets via unspecific interactions or cause a false positive signal for other assay-inherent reasons. Clearly removal of these compounds from the library is an advantage in HTS, as is an understanding of the reason for their promiscuity of interaction. A further issue relates to identifying a screening library subset, ostensibly representative of the diversity of the whole library, that is screened at all targets, usually as a priority in the screening campaign. Assessment of chemical versus operational understanding of diversity is critical in the design of the library subset. Moreover, there are advantages in screening the whole library. First, since HTS or uHTS is generally unconstrained by cost or compound usage, it is as easy to screen 250k compounds, as it is to screen 25k. Second, the screening campaign increases the likelihood of finding actives, especially for difficult targets, as
Analysis of Chemical Space
39
well as finding multiple structurally distinct leads. Indeed, a direct comparison of the approach of screening a representative library has been reported from Pfizer, in which it was noted that 32 out of the 39 leads were missed in comparison to those found by screening the whole library.31 Alternatively, Pharmacopeia have reported that receptor antagonists for the CxCR2 receptor and the human bradykinin B1 receptor were derived from the same 150k compound library, made using the same four combinatorial steps. Noteworthy, this library was neither based on known leads in the GPCR field nor specifically targeted towards GPCRs. On the other hand, researchers at Organon reported that it is possible to rationally select various “actives” from large databases using appropriate “diversity” selection and “representativity” methods.32 The introduction of combinatorial chemistry, HTS and the presence of large compound selections have put us in the comfortable position, that there is a large number of hits to choose from for lead optimization—at least for certain classes of drug targets. We anticipate that while the size of the compound libraries and the number of high-throughput screens will continue to increase leading to a larger number of hits, the number of leads actually being followed up per project will roughly remain the same. The challenge is to select the most promising candidates for further exploration and computational techniques will play a very important role in this process. Assuming a hit rate of 0.1-1% and a compound collection size of 106 compounds, we have (or will have) about 1k-10k hits that are potential starting points for further work. It is important to realize that while the screening throughput has increased significantly, the throughput of a traditional chemistry lab has not. While it is true, that automated and/or parallel chemistry is now routinely used there are still many molecules that are not amenable to these more automated and high-throughput approaches. Therefore the question is: “How can subsequent lead optimization fully exploit this vast amount of information?” Computational techniques can be used to address the question in a variety of ways:33 • Many of the hits are false positive or “frequent-hitters”, which means that the observed effect in the assay is not due to the specific binding to a certain pocket in the molecular target. Docking techniques can be used to place all compounds from the chemical library in the binding pocket and score them. If HTS data are available, a comparison of the in silico docking results and the assay data can be used to prioritize the hits and focus the subsequent work on the more promising candidates. If no HTS data are available (e.g., when no assay amenable to HTS is possible or if no hits were obtained) then docking can be used to select compounds for biological testing. It should be mentioned that correlating docking results with HTS data will surely give a negative bias against allosteric inhibitors. The results obtained by docking procedures must therefore be analyzed with great care. • Many of the compounds have undesirable properties such as a low solubility, or high lipophilicity. In silico prediction tools can be used to rank the HTS hits. While it is generally true that insufficient PC properties can be remedied in the lead optimization process (e.g., large, lipophilic sidechains can be removed, or a “pro-drug” can mask highly acidic groups), it may be advisable if possible to focus on compounds without obvious liabilities. • Toxicity and metabolic stability are extremely important parameters in the process of evaluating a compound for further development. In the past, these parameters were only taken into account at the later stages of the drug discovery process, partly because these parameters are time-consuming to establish and partly because small modifications to a molecule are known to have dramatic effect on these parameters. However, the availability of large databases on toxicity and metabolism has now increased the chance to sensibly relate chemical structures to these effects and to develop alert systems that again can be used to prioritize the hits. There exist however mixed views of this. While one may believe that ADMET predictions are obviously important and will become increasingly accurate in the future, it might still be reasonable to hesitate to apply such methods as a basis to prioritize HTS hits. It is likely that these hits will have little resemblance with the final clinical candidate, so
Adaptive Systems in Drug Design
40
these predictions may not be very relevant so early on. However, once there is a lead series, we should accelerate our effort to formulate a system-specific ADMET tool that can handle these lead structures/classes.
Combinatorial library enumeration provides a straightforward way to prepare large virtual collections of molecules that are chemically feasible. The idea is to define sets of molecular building blocks and a list of chemical reactions for virtual synthesis. Both building blocks and reactions should be close to what is tractable in the laboratory to facilitate the synthesis of selected candidates. However, the real synthons must not necessarily be employed for virtual library construction. The stock of virtual building blocks can be compiled from commercially available structures (e.g., from the ACD), fictive structures, and from retrosynthetic fragmentation of already known molecules.33,34 Generally, the term “building block” refers to variant structural parts of a combinatorial library, where the different building blocks present in a structure are denoted by R1, R2 etc. The “scaffold” contains the invariant structural attributes of a combinatorial library, and a “linker” can be any scaffold or building block with two combinatorial attachment sites. In Fig. 2.4 some typical combinatorial library scaffolds are shown. Both natural polymers like peptides or nucleic acids and small organic molecules provide building blocks and scaffolds for virtual library construction. Limited diversity, a preference for flexible, linear structures and usually bad Pharmacokinetic properties are problematic issues tied to natural polymer libraries. In contrast small molecule libraries can cover a large diversity space, often contain rigid molecules and “unnatural” structures, often have desired Pharmacokinetic properties and can more easily be optimized in the lead optimization phase of drug discovery (Fig. 1.1).
Similarity Searching Chemical similarity searching is a straightforward practical approach to identify candidate molecules by pair-wise comparison of compounds. In its simplest form, the result of a similarity search in a compound database is a ranked list, where high-ranking structures are considered to be more similar to the query in a certain sense than low-ranking molecules. If either the query structure(s) or the database structures or both structures reveal a certain (desired or undesired) property or activity, some conclusions may be drawn for the molecules under investigation. Structures are compared based on a similarity value that is calculated from their molecular descriptors. There are two assumptions inherent to this idea, representing the hypothesis “if molecule A is more similar to the query molecule R than molecule B, then molecule A might more likely show some biological activity that is comparable to the activity of R”: i. The molecular representation (descriptor) is assumed to appropriately cover those molecular attributes which are relevant for the underlying SAR/SPR. ii. The similarity measure applied is assumed to accurately relate differences in molecular descriptions to differences in the quality function (Principle of Strong Causality).35
In the past, the analysis of assay data was primarily performed by medicinal chemists, looking at the active compounds and then deciding which hits the efforts should be focused on. First, with the increase in the number of experimentally-determined hits, this approach becomes increasingly ineffective and computational techniques are increasingly used to classify the hits and derive hypotheses. Second, one should keep in mind that it is basically impossible for a human being also to take into account the large number of inactive compounds. The development of Pharmacophore hypothesis, for example, typically requires the incorporation of information on inactive compounds. By similarity searching, sets of candidate structures can be rapidly compiled from databases or virtual chemical libraries. Practical experience shows that such hypotheses are often weak and there clearly is no cure-all recipe or generally valid hypothesis leading to success in chemical similarity searching. Nevertheless, similarity searching provides a useful concept. A
Analysis of Chemical Space
41
Fig. 2.4. Some important scaffold structures that are amenable to solid phase combinatorial synthesis.
practicable measure of success can be expressed by an enrichment factor, ef, giving the ratio of the fraction of active molecules in the selected subset compared to the fraction of actives in the total pool (database). This value may be regarded as an estimate of the enrichment obtained compared to a random selection of molecules, as given by Equation 2.1.
Eq. 2.1 A large number of molecular descriptors has been developed over the past decades (Definition 2.2).2 The particular selection of a molecular representation defines a chemical space, and thus the ordering of molecules within this space. The choice of descriptors influences the distribution of structures. In Fig. 2.5 two distributions of the 20 genetically encoded amino acids are shown as an example.
Definition 2.2 “The molecular descriptor is the final result of a logical and mathematical procedure which transforms chemical information encoded within a symbolic representation of a molecule into a useful number or the result of some standardized experiment.” (according to Todeschini and Consonni)2 •
42
Adaptive Systems in Drug Design
Fig. 2.5. Distributions of amino acids in chemical space resulting from the selection of different molecular descriptors. Amino acids are denoted in single letter code. a) Volume109 and hydrophobicity;110 b) bulkiness111 and refractivity.111
Data scaling is usually the first step of chemical similarity searching, feature extraction, hypothesis generation, and other types of virtual screening and machine learning. The most frequently applied scaling methods include scaling by range (Eq. 2.2) and scaling by standard deviation (autoscaling, Eq. 2.3). For most applications autoscaling is a method of choice, leading to data with zero mean and unit variance. In some cases, vector normalization to length one is a necessary preprocessing procedure (Eq. 2.4).
Eq. 2.2 where i is the row index, and k is the column index of the raw data matrix X.
Eq. 2.3
In Equations 2.3 and 2.4 n is the number of objects (molecules). Autoscaling results in data vectors scaled to a length of .
Eq. 2.4
Various similarity measures exist that can be used for chemical similarity searching. Very often a distance value dAB between pairs of molecules A and B (i.e., their descriptors ξA and ξB containing n elements each) forms the basis on which a similarity value is calculated. The
Analysis of Chemical Space
43
frequently used Manhattan distance (also called Hamming distance or City-Block distance; Eq. 2.5) and the Euclidean distance (Eq. 2.6) are the first two examples of a general distance metric, the Minkowski or Lp-metric (Eq. 2.7; see Eq. 1.9).
Eq. 2.5
Eq. 2.6
Eq. 2.7
The similarity measure based on the Minkowski distance can be used to express molecular similarity (Eq. 2.8). Completely similar or identical structures have a similarity value of sAB = 1, completely dissimilar molecules have sAB = 0.
Eq. 2.8 where dAB(max) represents the maximal pair-wise distance found in the data set under investigation, e.g., the maximal distance between the query structure and a database compound. Many additional distance and similarity measures have found application in chemical similarity searching.36,37 The Tanimoto coefficient probably is the best known similarity index that is applied to comparison of bitstring representations of molecules (although it its application is not restricted to dichotomous variables). The set-theoretic definition of the Tanimoto coefficient is given by Equation 9, where χA is the number of bits set to 1 in the bitstring vector coding for molecule A, and χB is the number of bits set to 1 in the bitstring vector coding for molecule B. The range of values of the Tanimoto similarity measure is [0,1] for dichotomous variables.
Eq. 2.9
Different ranked lists are obtained from different similarity searching methods. The different results originate from different molecular descriptors and different similarity measures. The particular choice of a distance or similarity criterion and the selection of molecular descriptors are subject to a learning process in each medicinal chemistry project. Similarity-based VS of large databases and virtual libraries needs representations of the molecules that are both effective and efficient, i.e., they must be able to differentiate between molecules that are different, and they must be quick to calculate. An example of different similarity searching results obtained with the same query structure (midazolam) is given in Fig. 2.6. In this example the
44
Adaptive Systems in Drug Design
Fig. 2.6. Structures retrieved by similarity searching taking Midazolam (left) as the query structure. Top line: Tanimoto/Daylight method; Bottom line: CATS method.
molecules were coded by two different descriptors and also compared by two different similarity measures. In Fig. 2.6a the common Daylight chemical fingerprints served as a molecular representation,38 and the Tanimoto coefficient (Eq. 2.9) was used for similarity searching. In Fig. 2.6b a simple topological Pharmacophore descriptor was used together with the Euclidean distance measure (Eq. 2.6).39 The idea of the particular topological Pharmacophore representation is illustrated in Fig. 2.7. Pharmacophore models are particularly useful for drug design purposes and widely applied molecular representations (Definition 2.3).40 The idea is to consider a set of generalized atom types—e.g., H-bond donors and acceptors, lipophilic and charged groups—and their constellation in space, i.e., distances between atom type centers, as a “fingerprint” of a molecule. It is hoped that this abstraction from chemical structure (“meta-description”) represents function-determining molecular features, and facilitates grouping of isofunctional compounds (Fig. 2.8). An in-depth treatment of 2D and 3D Pharmacophore modeling is beyond the scope of this book. Much research has been done in this very important and active area of virtual screening. The interested reader is referred to the literature.40,41 The usefulness of topological pharmacophores for similarity searching will be demonstrated along with a worked example in the following. Their particular advantage is that they can be quickly calculated and no 3D alignment of conformers is required.
Definition 2.3 A Pharmacophore or pharmacophoric pattern is the ensemble of steric and electronic features that is necessary to ensure the optimal supramolecular interactions with a specific biological target structure and to trigger (or to block) its biological response. (IUPAC recommendation 1997)42 •
Analysis of Chemical Space
45
Fig. 2.7. Coding a chemical structure by the CATS topological atom type descriptor. A 2D molecular structure (a) is converted to the molecular graph (b), generalizing atom types are assigned (c), and the frequency of every atom pairs with a distance between 1 and 10 bonds is determined (d). For five atom types (lipophilic, L; hydrogen bond donor, D; hydrogen bond acceptor, A; positively charged, P; negatively charged, N) there are 15 possible pairs, resulting in a 15 x 10 = 150-dimensional histogram representing a molecular structure. In (d) an L-A pair over nine bonds is shown.
The following example demonstrates straightforward database similarity searching using CATS. Ion channels are essential to a wide range of physiological functions such as neuronal signaling, muscle contraction, cardiac pacemaking, hormone secretion and cell proliferation.43 There is evidence that brain T-type calcium channels can modulate excitability and give rise to burst-firing in some CNS neurons. Recently, certain anti-epileptic drugs with anxiolytic properties have been reported to display significant T-channel blocking activity—thus indicating scopes of selective T-type channel blockers as novel neuropsychiatric therapeutics.44,45 Mibefradil has been described as the first T-type selective calcium channel blocking agent showing about 20-fold selectivity over L-type channels (structure 1 in Fig. 2.9).46,47 The application of this compound was however hampered by some drug-drug interaction liabilities. As a consequence, several lead finding initiatives were undertaken based on pharmacophoric features of mibefradil— aiming at a translation into novel scaffolds of lead structures with improved molecular properties and suited for high-speed chemical optimization. These new structures should then serve as starting points to develop novel brain- and heart-selective T-type calcium channel antagonists. The first goal of the lead identification approaches was thus to identify novel lead structures with affinity and selectivity for T-type calcium channels comparable to mibefradil—but with reduced structural complexity, low cytochrome-P450 interaction liability, and with molecular properties indicating scope for chemical optimization. Mibefradil served as the query structure to virtually screen the Roche corporate compound database using CATS. The twelve highest-ranking compounds, which passed certain molecular property filters, were selected and transmitted to the biological screening. Nine out of these twelve molecules (75%) showed significant T-type calcium channel antagonistic activity, with IC50 values comparable to the
46
Adaptive Systems in Drug Design
Fig. 2.8. Schematic of the NAPAP-thrombin complex. On the left the most important interactions between the thrombin inhibitor NAPAP and the thrombin active site are shown. A simple Pharmacophore model of thrombin activity is given on the right. L: lipophilic, D: hydrogen-bond donor, A: hydrogen-bond acceptor, P: positively charged or ionizable group. Pharmacophore models can be used for similarity searching and de novo design exercises.
value of mibefradil. Some selected compounds are depicted in Fig. 2.9. The molecular architectures of the CATS hits are strikingly different from the mibefradil template—however certain common structural features are preserved, particularly in structures 2 and 3 given in Fig. 2.9. This common theme includes a central spacer or chain with an amino group—providing a positive charge at physiological pH—framed by two rather extended substructures, each containing an aromatic group. The topological length of the linker matches well with the one present in mibefradil. The CATS hit 4, which has the lowest similarity value of the hits shown, is most distinguished from mibefradil and has a carboxamido functional group in the central chain instead of amino functionality, which constitutes an (inducible) positive partial charge at this site. Compared to mibefradil, all three CATS hits are structurally less complex, with lower molecular weights (MW < 400) and lower lipophilicity values (clogP values: 4.6, 4.3 and 4.0 for compounds 2, 3 and 4; clogP = 6.1 for mibefradil 1). With respect to potential drug-drug interaction liabilities, the in silico similarity to a cytochrome-P450 binding model (Pharmacophore model) for 2, 3 and 4 are much lower than for 1 (F. Hoffmann-La Roche Ltd.; the prediction model was developed by H. Fischer, S. Poli, and M. Kansy; unpublished). The compounds offer broad scope for chemical optimization, and were further characterized in-depth (F. Hoffmann-La Roche Ltd.; W. Neidhart, T. Giller, G. Schmid, G. Adam; unpublished). In the frame of selection and characterization of analogues of hit 3, the cyclodecyl derivative 5 shown in Fig. 2.9 was found to exhibit the highest T-type calcium channel inhibitory activity of all structures of this type. Subsequent CATS similarity searching based on 5 as the template (query) structure gave rise to the tetrahydro-naphthalene structure 6 as a further hit. The compound is again a close analogue of mibefradil 1, and was prepared in a campaign to obtain improved analogues. This finding emphasizes the potential of iterative similarity searching to transgress the borders of chemical scaffolds—in identifying and translating molecular determinants—to obtain novel biologically active molecules.
Feature Extraction Methods A common theme in molecular feature extraction is the transformation of raw data to a new co-ordinate system, where the axes of the new space represent “factors” or “latent variables”—
Analysis of Chemical Space
47
Fig. 2.9. Structure of mibefradil (1), a calcium channel blocking agent, and four selected isofunctional hits (2-5), which were retrieved by virtual database screening using the CATS software. Taking structure 5 as a query for CATS, a closely related structure (6) to mibefradil is retrieved. RTTC: recombinant T-type calcium channel, FLIPR: fluorometric imaging plate reader.
features that might help to explain the shape of the original distribution. By far the most widely applied statistical feature extraction method in drug design belongs to the class of factorial methods: principal component analysis (PCA).12,48,49 PCA performs a linear projection of data points from the high-dimensional space to a low-dimensional space. In addition to PCA, nonlinear projection methods like self-organizing maps (SOM), encoder networks, and Sammon mapping are sometimes employed in drug design projects.50 Since none of these methods require the knowledge of target values (e.g., inhibition constants, properties) or class membership (e.g., active/inactive assignments) they are termed “unsupervised”. Unsupervised procedures can be used to perform a first data analysis step, complemented by supervised methods later during the adaptive molecular design process. Principal component analysis. PCA performs a projection of the m-dimensional data matrix X down to a d-dimensional subspace by means of the projection matrix LT, yielding the object co-ordinates in this plane, S (Eq. 2.10; Eq. 2.11). S is termed the score matrix with n rows (objects, molecules) and d columns (principal components). L is termed the loading matrix with d columns and p rows, and T denotes the matrix transpose.
48
Adaptive Systems in Drug Design
Fig. 2.10. Principal components (PC) of a set of two-dimensional data. The original co-ordinate system is spanned by ξ1 and ξ2. The orthogonal score vectors s1 and s2 are calculated according to the criterion of maximum variance.
Eq. 2.10 Eq. 2.11 The principal components (PC) are determined on the basis of the maximal variance criterion, i.e., the first PC represents a regression line along the direction of maximal data variance, and the second PC is orthogonal to the first PC along the criterion of maximum variance, and so on (Fig. 2.10). According to this most of the total data variance is contained (“explained”) by the first PCs. The loading matrix contains the regression coefficients of each PC with the original axes (here: molecular descriptors), and the new co-ordinates (PC) are linear combinations of the original variables. Loadings plots give the correlation of the original variables with selected principal components. Graphical inspection or interpretation of the— usually varimax-rotated—loadings matrix can be a helpful step towards an understanding of essential function-determining molecular features or pharmacophores (Definition 2.3). If all PCs are used in the model, 100% of the original data variance is explained. The sum of the squared loadings is also called the eigenvalue of a principal component. For data reduction or visualization, usually only the first two or three PCs are used, i.e., the PCs with the greatest eigenvalues. Further details about PCA and related approaches can be found in the literature.12,51 Fig. 2.11a gives the projection of 73 UGI reaction products in the plane spanned by the first two principal components derived from PCA of a 150-dimensional topological Pharmacophore space (CATS atom pair descriptor). The UGI reaction is shown in Fig. 2.11. This score plot is a linear projection of the original high-dimensional space. There is a clear separation of active molecules (IC50 < 1 µM) and inactive molecules (IC50 > 10 µM) as judged from a thrombin inhibition assay. Based on this projection one could argue that the molecular descriptor seems to be useful to find thrombin inhibitors.
Analysis of Chemical Space
49
Fig. 2.11. Projections of the distribution of 73 molecules generated by the four-component UGI reaction (top). Structures were encoded by the 150-dimensional CATS topological atom pair descriptor, the plots represent projections to a two-dimensional space. Shading indicates activity in a thrombin binding assay (black: IC50 < 1µM; white: IC50 > 10µM). a) PCA projection, b) Sammon map, c) encoder network projection (see Fig. 2.12b), d) toroidal SOM containing (7 x7) neurons. The molecules were synthesized and assayed at F.Hoffmann La-Roche Ltd.112
A simple but useful algorithm for PCA is given by the NIPALS (nonlinear iterative partial least squares) technique. The following description of the algorithm is taken from Otto.12
Adaptive Systems in Drug Design
50
NIPALS Algorithm for Principal Component Analysis Step 0 Step 1 Step 2
Step 3
Scale the raw data matrix by the mean and normalize to length one Estimate the loading vector lT Compute the score vector: s = X l; Compare the new and the old score vector. If the deviations of the elements of the two vectors are within a threshold (e.g., < 10-5) then go to Step 5, otherwise go to Step 3. Compute new loadings: lT = sT X; Normalize the loading vector to length one:
Step 4 Step 5
Step 6 Step 7
Repeat from Step 2 if the number of iterations does not exceed a predefined threshold (e.g., 100 iterations); otherwise go to Step 5. Determine the matrix of residuals: E = X-s lT. If the number of principal components is equal to the number of previously fixed or desired components then go to Step 7; otherwise continue at Step 6. Use the matrix of residuals E as the new X-matrix and compute additional principal components s and loadings lT by means of Step 1. As a result, the matrix X is represented by a principal component model according to Eq. 2.10.
Sammon Mapping The aim of Sammon’s algorithm is to project points from a high-dimensional (m-dimensional) space to a low-dimensional (n-dimensional) space which usually is two-dimensional. It is conventionally applied to exploratory data analysis. The original algorithm is an iterative method based on a gradient search.52 It finds a data distribution in the n-dimensional target space so that as much as possible of the original distribution in the m-dimensional space is preserved. In this non-linear mapping (NLM), the inter-point distances between vectors in the lower-dimensional space approximate the corresponding distances in the original m-dimensional space. (Note: The idea of Kruskal’s mapping procedure is very similar to Sammon’s mapping algorithm: again the inter-point distances in the n-dimensional space approximate the corresponding distances in the m-dimensional space, but the original distances are transformed by some monotonic, increasing function.53) The basis for mapping is given by the inter-point distance matrix. Multidimensional scaling (MDS) is a related technique which is also based on a similarity or dissimilarity matrix. A thorough comparison between Sammon’s and Kruskal’s mapping and MDS can be found elsewhere.54 Sammon’s mapping is an optimization procedure starting from an initial configuration of n-dimensional vectors (e.g., randomly chosen or by taking n columns of the m-dimensional matrix X with maximum variances). A reasonable error function E is called “Sammon’s stress” (Eq. 2.12), measuring how well a distribution of k points in the n-space matches the distribution of k points in the m-space (i.e., the difference between the distance matrix of the original vector set, d, and the projected vector set, δ):
Eq. 2.12
Analysis of Chemical Space
51
An optimization algorithm is applied to decrease the stress, e.g., a steepest descent method to search for the minimum of the error function. Having found the distribution in the n-space after the t-th iteration, the new setting at time t+1 is given by Equation 2.13.
Eq. 2.13 where η is the learning rate, and
Eq. 2.14
Of course, several optimization methods can be used to find a minimum of the error function E. After optimization a Sammon map represents the relative distances of vectors in a high-dimensional space and is thus useful in determining the number and the shapes of clusters, as well as their relative distances. Fig. 2.11b shows the distribution of the UGI reaction products obtained by our implementation of Sammon’s mapping. In contrast to the gradient search technique originally proposed by Sammon, we have applied a (1,λ) evolution strategy to this task. Again the active molecules (black spots) are separated from the inactives (white circles). Compared to the PCA result given in Fig. 2.11a, the map shows a broader distribution of the data and allows for the identification of two clusters among the inactive compounds—which is not visible in the PCA score plot. Clearly, the nonlinear map provides greater individual detail compare to PCA. By preserving the inter-point distances of the original samples NLM is able to represent the topology and structural relationships of a data set. Despite this appealing feature we must keep in mind that the projection does lead to some loss of information, and the resulting map can be misleading.54 The Sammon projection map complements those obtained with the SOM algorithm (vide infra), auto-associative neural networks (AANN), multilayer perceptron (MLP, see Chapter 3) and principal component (PC) feature extractor. Lerner demonstrated for the example of chromosome classification that Sammon’s (unsupervised) mapping is superior to classification based on the AANN and PC feature extractor and highly comparable with that based on the (supervised) MLP.55 Further thorough comparison of these and other supervised and unsupervised methods and application to a wide range of classification and feature extraction tasks must be performed to substantiate these findings. Additional information and different approaches to the NLM task can be found elsewhere.56-59
Encoder Networks A neural network approach to the task of data projection termed ”encoder” networks or ReNDeR (Reversible Non-linear Dimension Reduction) networks is of growing interest for data visualization and nonlinear feature extraction.60 The architecture of an encoder network is illustrated in Fig. 2.12a. The network is symmetrical around a central parameter layer. The number of input and output neurons is defined by the dimension of the data vectors, and the idea of the approach is to reproduce the input patterns at the output layer (auto-association) via an internal representation which is of lower complexity than the original data. In Fig. 2.12 the parameter layer consists of only two neurons—although this layer can have an arbitrary number of neurons—for a reduction of the data to only two dimensions (”factors”). Once the
52
Adaptive Systems in Drug Design
Fig. 2.12. Encoder networks for nonlinear mapping of high-dimensional data. Neurons are drawn as circles, weights are represented by lines. Input neurons (white) are fan-out units, hidden-layer units (black) have a sigmoidal or linear activity, and the output neurons (gray) are linear. a) symmetrical network architecture attempting to reproduce the input patterns by going through a lowdimensional internal representation. Factor 1 and Factor 2 are the score values (co-ordinates) in the lowdimensional (here: two-dimensional) map . b) conventional feed-forward network with two output neurons. The outputs represent the low-dimensional scores.
network weights are optimized by conventional supervised training techniques the input data can be described by the output values of the neurons in the parameter layer. Two or three neurons are especially useful for graphical display. If no intermediate hidden layers are used and the neurons have a linear transfer function the factors found by the encoder network are equivalent to principle components.61 Presence of hidden layers containing non-linear neuron activities enables the system to perform non-linear mappings of the data, quite similar to Kohonen mapping.62 This “nonlinear PCA” seems to be especially suited for mapping brain activities in cognitive science.63 An attractive feature of encoder networks is the possibility to (re)construct data of the original dimension directly from low-dimensional projections. In a two-dimensional map derived from high-dimensional QSAR data, for example, any position is directly linked to the corresponding original space. This might be useful for compound selection with a desired property (or activity) by inspection of a low-dimensional graphical display. Until now, however, this “molecular design” technique is still in its infancy. First applications, advantages and drawbacks of the method have been critically discussed and can be found elsewhere.60,64-66 The use of feed-forward neural networks for data mapping has recently been highlighted by Agrafiotis and Lobanov. 54 The idea is very similar to the encoder network approach
Analysis of Chemical Space
53
described above, but in contrast to the network architecture shown in Figure 2.12a, their network is non-symmetrical (Fig. 2.12b). The number of output neurons determines the dimension of the projection. The network is trained in a conventional supervised manner aiming at a minimization of the mapping error E (Eq. 2.12)—or similar error functions, e.g., Kruskal’s stress. We have implemented such a system, again employing the (1,λ) evolution strategy for network training (F. Hoffmann-La Roche Ltd.; O. Roche, G. Schneider; unpublished). Fig. 2.11c gives the result for our example, the projection of UGI reaction products from a 150dimensional space to the plane. Our mapping network contained 150 input neurons, three sigmoidal hidden neurons, and two linear output neurons. The distribution looks very similar to the Sammon map shown in Figure 2.11b, thereby supporting this NLM. However, in contrast to the Sammon procedure where the mapping function remains unknown, the trained network represents the nonlinear mapping function. Now we can also project a new sample to the low-dimensional display. Compared to the original encoder approach (Fig. 2.12a) the network is smaller and more suitable for optimization. A drawback of this technique and the Sammon map is the fact that we do not know the meaning of the display axes—which is not the case for PCA. These approaches therefore nicely complement each other.
Self-Organizing Networks Complementing the visualization techniques described in this Chapter, the self-organizing map (SOM) has proven its usefulness for drug discovery, in particular for the tasks of data classification, feature extraction, and visualization. Therefore this method will be described in some more detail. The SOM belongs to the class of unsupervised neural networks and was pioneered by Kohonen in the early 1980’s.67 Among other applications, e.g., in robotics,68 it can be used to generate low-dimensional, topology-preserving projections of high-dimensional data. SOMs contain only a single layer of neurons. In contrast to the supervised, multi-layered ANN discussed in other Chapters, the neurons of an SOM do not compute an output value from incoming signals. Rather they represent vectors of the same dimension as the input patterns (Fig. 2.13) and adopt either an “active” or an “inactive” state. For data processing the input pattern (a molecular descriptor vector) is compared to all neurons in the output layer, and the one neuron vector that is most similar to the input pattern—the so-called “winner neuron”—fires a signal, i.e., it is active. All other neurons are inactive. In this way, each pattern is assigned to exactly one neuron. The data patterns belonging to a neuron form a cluster, as they are more similar to their neuron than to any other neuron of the SOM. During the SOM training process—an optimization procedure following the principles of unsupervised Hebbian learning—the original high-dimensional space is tessellated, resulting in a certain number of data clusters. There are formed as many clusters as are neurons in the SOM. The neurons represent prototype vectors of each cluster. This process is similar to vector quantization (Fig. 2.14), and the resulting prototype vectors capture features in the input space that are unique for each data cluster. Feature analysis can be done, e.g., by comparing adjacent neurons. Kohonen’s algorithm represents a strikingly efficient way for mapping similar patterns, given as vectors close to each other in input space, onto contiguous locations in the output space.67 This is achieved by introducing a topology to the SOM neuron layer. The simplest topology is a chain of neurons, followed by a two-dimensional grid. Topological mapping can be achieved by two simple rules: i. Locate the best-matching neuron (winner neuron). ii. Increase matching at this unit and its topological neighbors.
For the first rule only vector distances between the input patterns ξ and the neurons w must be calculated (Eq. 2.15). The number of comparisons needed depends linearly on the size of the self-organizing system C which can be expressed by the number of neurons, c.
54
Adaptive Systems in Drug Design
Fig. 2.13a. Architecture of a self-organizing map (SOM). Network containing (6 x 5) = 30 neurons. Each neuron is a four-dimensional vector represented by a stack of four cubes. An input signal (pattern vector ξ) leads to a response of a single neuron (“winner-takes-all”, gray-colored). Usually the top-down view of an SOM is shown.
Fig. 2.13b. Architecture of a self-organizing map (SOM). A toroidal SOM (top-down view). The neurons in the first and the second neighborhood to the gray-shaded neuron are indicated by black lines. The star symbol is in the second neighborhood of the neuron.
Eq. 2.15 The second rule requires an updating procedure to adapt the vector elements of the winner neuron s and its topological neighbors (Eq. 2.16), where Ns is the topological neighborhood around the neuron s and ε = ε(d(c,s),t) is a learning rate depending on both the topological distance d(c,s) between s and the neuron c, and on the time t. In this context, time is usually measured in number of input patterns presented to the network. In many applications a Gaussian
Analysis of Chemical Space
55
Fig. 2.14. Principle of vector quantization. In this example, two-dimensional data vectors (pattern vectors; open arrowheads) form two distinct clusters. During the vector quantization process neuron vectors (filled arrowheads) move toward the centers of the clusters, thereby forming the cluster centroids.
neighborhood function is used. A toroidal neuron topology can be used to avoid some boundary problems inherent to a planar topology (Fig. 2.13).67,69,70
Eq. 2.16 The complete SOM algorithm can be formulated as follows:
The Self-Organizing Map Algorithm Step1
• Initialize the self-organizing map A to contain N = N1 * N2 neurons ci:
Step 2 Step 3 Step 4
with reference vectors chosen randomly according to p(ξ) from the set of training patterns. • Initialize the connection set C to form a rectangular N1 x N2 grid. • Initialize the time parameter t = 0. Generate at random an input signal ξ according to p(ξ). Determine the winner neuron according to Equation 2.15 Adapt each neuron r according to
where the Hamming distance d1 is used to measure the neuron-to-neuron distance on the SOM grid, and a Gaussian neighborhood around the winner neuron s is used
Adaptive Systems in Drug Design
56
with the standard deviation of the Gaussian:
and
Step 5 Step 6
The time-dependent calculation of σ and ε require initial and final values that must be defined prior to SOM training. Increase the time parameter: t = t + 1. If t < tmax then continue with Step 2, otherwise terminate.
In Fig. 2.15 some snapshots of an SOM training process are shown. In this example a twodimensional neuron grid adapts to a two-dimensional data distribution (actually, in this simplifying example no dimensionality reduction takes place). As a result, topologically adjacent neurons correspond to adjacent input patterns. The “winner-takes-all” SOM training algorithm forces the weight values of the network to move towards centroids of the data distribution and become a set of prototype vectors. All data points located within the “receptive field” of an output neuron will be assigned to the same cluster. The receptive fields of a fully trained UNN are comparable to the areas defined by Voronoi tessellation (or Dirichlet tessellation) of the input space.61,71 All data vectors that are closer to the weight vector of one neuron than to any other weight vector belong to its receptive field. The mapping error can be defined by the mean quantization error, mqe (Eq. 2.17),72 where N is the total number of molecules used for SOM training, Rc is the receptive field of a neuron, c; ξ is the m-dimensional molecular descriptor, and w is the m-dimensional cluster centroid (neuron vector):
Eq. 2.17 Many variations and extensions of Kohonen’s algorithm have been published ever since his original paper appeared. For a recent overview, see for example a volume by Oja and Kaski.73 One major limitation of the original SOM algorithm is that the dimension of the output space and the number of neurons must be predefined prior to SOM training. Self-organizing networks with adapting network size and dimension provide more advanced and sometimes more adequate solutions to data mining and feature extraction.72 A disadvantage of Kohonen-networks can be the comparatively long training time needed, especially if large data sets are used since every data vector must be presented several times to the network for weight adaptation. Hybrid multi-layered UNN which can be trained extremely fast employing very large data sets have already been developed.71,74,75 These systems can provide an alternative classification tool to Kohonen-networks if real-time or on-line computation is required, e.g., for control and analysis of HTS results. Usually they contain more than two layers of neurons where from layer to layer a more subtle data classification is performed, and only parts of the network are adapted during one training cycle. This reduces the training time needed since a data vector is not compared to every weight vector as in classical Kohonen-networks. Especially combinations of
Analysis of Chemical Space
57
Fig. 2.15. Stages of SOM adaptation. A planar (10 x 10) SOM was trained to map a two-dimensional data distribution (small black spots). The receptive fields of the final map are indicated by Voronoi tessellation in the lower left projection. A and B denote two ”empty” neurons, i.e., there are no data points captured by these neurons. The simulation was performed using the SOM tutorial software written by H.S. Loos and B. Fritzke;113 Figures were adapted from the graphical www output.
supervised and unsupervised learning techniques are under steady development and represent a very active area of current research. The authors of this book are convinced that such systems will become an indispensable part of bio- and chemoinformatics in the field of drug discovery. Several practical applications of the SOM to compound classification, drug design, and chemical similarity searching are described in the following part of this Chapter. Figure 2.11d gives a (10 x 10) SOM obtained for the data set of UGI reaction products. A striking advantage of the SOM over the other three projections shown in Figure 2.11 is the automatic classification of data, i.e., cluster detection and definition of cluster boundaries. Black neurons contain inhibitors, white neurons contain inactive compounds. The gray-shaded square (1/3) represents a mixed cluster. For subsequent similarity searching, we can use the neuron vectors representing the characteristic features of strong thrombin binders: taking the vector of neuron (2/4) as the query for searching the 33k entries of the MedChem Database (version 1997, distributed by Daylight Chemical Information Systems Inc., Irvine, CA, USA), the well known nanomolar thrombin inhibitors PPACK and Argatroban were retrieved (Fig. 2.16). Both molecules, PPACK and Argatroban differ in their scaffold architecture from the UGI products. This example demonstrates that the CATS topological atom type descriptor can be useful for “backbone hopping”, and that the SOM technique is a useful means to extract function-determining molecular features. Several variations of this principle have been developed and successfully applied to retrieving novel active structures from databases.76-78 Related descriptors of molecular topology have been developed and productively used in virtual screening experiments by several research groups. 79-85 A fast and straightforward multiple Pharmacophore approach, which builds on an ensemble of 3D hypotheses, has been published recently by Bradley and coworkers addressing some of the limitations inherent to 2D techniques.86 Further information about Pharmacophore extraction and related techniques can be found elsewhere.87-90
58
Adaptive Systems in Drug Design
Fig. 2.16. Structures of two potent thrombin inhibitors: PPACK (left) and Argatroban (right).
The plot shown in Fig. 2.17 was obtained by training a self-organizing map using the software tool NEUROMAP.91 It demonstrates the ability of the CATS descriptor to separate antidepressants from other drugs. All molecules were compiled from the Derwent World Drug Index (Derwent Information, London). Areas in the upper left corner of the SOM are dominated by antidepressant agents, whereas the remaining parts of the map are populated by “other” drug molecules. Again, the SOM can now be used as a coarse-grain virtual screening tool by projecting (“dropping”) new molecules onto the map, e.g., virtual combinatorial libraries or corporate database entries. For demonstration of this idea three known antidepressants were omitted during the map training process, namely imipramine, fluoxetine, and NKP-608 which was developed by a research group at Novartis.92,93 Imipramine and fluoxetine were predicted to fall into the “antidepressants area”, and the recently identified NK1 (substance P-preferring tachykinin) receptor antagonist NKP-608 is located on an adjacent “activity island” on the map. Interestingly, NKP-608 is assigned to a cluster containing several classical antidepressants like e.g., tianeptine, a known serotoninergic agent. Now it would be challenging to test tianeptine derivatives for NK1 binding capability. Irrespective of the outcome of such assays, these virtual screening results clearly support the applicability of the topological Pharmacophore descriptor that was chosen for molecule representation in conjunction with the SOM: The three compounds given as examples in Figure 2.17 would have been identified as potential novel antidepressant agents. Comparison of substance classes and compound libraries is a further application area of the SOM. In the following example, this method was used to compare drugs to a compilation of “nondrugs” in “Ghose & Crippen”-space.94-96 Each molecule was coded by a 120-dimensional vector giving the fragments counts of 120 molecular fragments defined by Ghose and coworkers.97,98 For graphical display the molecule distributions in this 120-dimensional twopoint Pharmacophore space were projected onto a toroidal map consisting of (15 x 15) = 225 neurons (clusters). To determine the raw classification accuracy of an SOM, the correlation coefficient, cc, according to Matthews was calculated (Eq. 2.18).99 In Equation 2.18, P is the number of positive correct predictions, N is the number of negative correct predictions, O is the number of false-positive predictions (overprediction), and U is the number of false-negative predictions (underprediction).
Eq. 2.18
Analysis of Chemical Space
59
Fig. 2.17. Virtual screening for potential antidepressants by a self-organizing map. The SOM represents a topology-preserving visualization of a high-dimensional chemical space spanned by 150 descriptors (CATS). The distribution of a set of known 597 antidepressants is indicated by gray-shading (white: only antidepressants; black: only other drugs; gray: mixed cluster). A separation of antidepressant agents and “other” drugs can be observed. The two known antidepressants imipramine and fluoxetine were predicted to fall in the “antidepressants area” [neuron (4/7) and neuron (4/9)], and the NK1 inhibitor NKP-608 is located on an “activity island” on the map [neuron (7/4)].
To see whether the descriptor is able to separate “drugs” from “nondrugs”—and thus may be used to analyze “drug-relevant” chemical space—an SOM was developed using Sadowski’s collection of drugs and nondrugs (Fig. 2.18).94 This data set was compiled from the WDI (4,998 drugs) and the Available Chemicals Directory (ACD; 4,282 nondrugs). For details about this data set and its limitations, see Chapter 4 and the original publication by Sadowski and Kubiniyi.94 The SOM projection reveals a pronounced separation between drugs and nondrugs in Ghose & Crippen-space, as indicated by the light and dark areas in Fig. 2.18a. To estimate the classification ability of this SOM, a binary pattern class assignment was introduced (Fig. 2.18b): A cluster was regarded as “drug-like” (white color) if it contained more than 50% of drug molecules, otherwise it was regarded as belonging to the “nondrug-like” class (black color). This straightforward binary class assignment clearly shows a drug and a nondrug region, with 3,637 drugs (73%) and 3,155 nondrugs (75%) correctly classified. This corresponds to a Matthews correlation coefficient of cc = 0.48. From the observation of distinct
Adaptive Systems in Drug Design
60
a
b
Fig. 2.18. SOM projection of a chemical space filled with 4,998 drugs and 4,282 nondrugs. The frequencies of 120 Ghose & Crippen fragments were used to encode each molecule. Each square represents a cluster of molecules (Voronoi region). Note that the (10 x 10) map forms a torus. Data sets courtesy of J. Sadowski. a) the ratio of drugs and nondrugs clustered is shown by grey scale shading (white: pure nondrug cluster, black: pure drug cluster). b) binary classification of the distribution shown in (a). The Matthews correlation coefficient for this classification is cc = 0.48.
preferred drug and nondrug regions we concluded that the molecular descriptor might be suited for the analysis of “drug-relevant” space. The binary prediction accuracy obtained is low compared to the much higher accuracy that can be yielded by supervised feature extraction techniques and more problem-specific molecular descriptors (see Chapters 3 and 4). It must be stressed that in this study the SOM was not intended to form a prediction system. The aim was to visualize the distributions of compound libraries in a high-dimensional space. The extension of this approach to a comparison of natural products and trade drugs and consequent virtual library design was done by Lee and Schneider.96 It was demonstrated that natural compounds provide interesting novel scaffold architectures, which can be used in combinatorial drug design approaches. However, in most cases the scaffolds will have to be modified to provide synthetic feasibility and stability and prevent adverse Pharmacokinetic effects. Taking such a natural scaffold in combination with synthetic side-chains might become a typical strategy in future drug design.100 A straightforward method for combinatorial library design using the SOM technique is illustrated in Figure 2.19. To determine the usefulness of a combinatorial library the members of the corresponding virtual library can be projected onto a map displaying the distribution of a set of reference compounds. In our example 5,726 trade drugs (from WDI) served as this reference set. The question was whether the scaffold structure shown in Figure 2.19 might provide a good starting point for the combinatorial design of drug-like structures. A virtual 40 x 40 = 1,600-member combinatorial library was built using 40 generic molecular building blocks. The CATS topological Pharmacophore descriptor was used to encode these molecules. The map showing the distribution of the 5,726 trade drugs (Fig. 2.19 left) and the virtual library (Fig. 2.19 right) in CATS topological Pharmacophore space reveals that apparently these two compound collections do not overlap. From this observation one may conclude that: i) either the scaffold structure does not represent a drug-like substrate, and therefore the two libraries do not significantly overlap; or ii) the virtual library complements the collection of
Analysis of Chemical Space
61
Fig. 2.19. SOM showing the distribution of 5726 trade drugs (left) and a 40 x 40 = 1600-member combinatorial library (right) in CATS topological Pharmacophore space. Apparently these two compound collections do not overlap significantly.
trade drugs in such a way that a larger proportion of Pharmacophore space could be accessed. Of course, these considerations are of a purely theoretical nature, and only a series of practical experiments will teach us whether this particular combinatorial library has drug-like or nondruglike characteristics. This general approach has been proven to be very useful in attempts to prioritize combinatorial libraries and scaffolds, and assessment of external vendor libraries to extend the corporate compound database. The applicability of the SOM for mapping elements of protein structure—like secondary structure elements or surface pockets—was demonstrated recently.101-103 Knowledge of the 3D structure of a target protein undoubtedly is a rich source of information for computer-aided drug design. Of special interest are the size and form of the active site, and the distribution of functional groups and lipophilic areas. Due to the fact that the number of solved X-ray structures of proteins is rapidly increasing—and thus the amount of information available—it is desirable to address questions related to coverage of the protein structure universe, conserved patterns of functional groups, or common ligand binding motifs.104,105 It is evident that such an analysis cannot be performed by visual inspection of structural models only. Automatic procedures for analysis, prediction, and comparison of macromolecular structures—in particular potential binding sites in proteins—will be a very helpful tool.106 One such implementation of a computational method developed at Roche includes four steps:103 i) automated detection of protein surface pockets; ii) generation of a property-encoded solvent accessible surface (SAS) for each pocket; iii) generation of correlation vectors of the SAS to obtain rotation- and
62
Adaptive Systems in Drug Design
translation-invariant descriptors; and iv) SOM projection of these vectors onto a low-dimensional display. As a result, a two-dimensional map is obtained showing the distribution of surface cavities in a chemical property space. This method was originally applied to a set of 176 proteins from the Protein Data Base (PDB)107 containing a catalytically active zinc ion in the active site. On the resulting SOM, with only a small degree of mis-classifications the active site pockets were clearly separated from other surface cavities. A more detailed analysis revealed that the automated mapping of the active sites accurately reflects established enzyme classification. Such a projection and analysis technique can give new insight into local structural similarities between enzymes revealing completely different folds and functions. Furthermore, the SOM mapping technique allowed for the correct classification of surface pockets derived from proteins that were not contained in the training set. We are convinced that this and other similar techniques bear a significant potential for automated protein structure analysis and drug design.108 If possible, the analysis of macromolecular (target) features should parallel feature extraction from sets of known ligands to obtain desired novel designs.
References 1. Walters WP, Stahl MT, Murcko MA. Virtual screening—An overview. Drug Discovery Today 1998; 3:160-178. 2. Todeschini R, Consonni V. Handbook of Molecular Descriptors. Weinheim, New York: WileyVCH, 2000. 3. Lipinski CA, Lombardo F, Dominy BW et al. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 1997; 23:3-25. 4. Ghose AK, Viswanadhan VN, Wendoloski JJ. A knowledge-based approach in designing combinatorial or medicinal chemistry libraries for drug discovery. 1. A qualitative and quantitative characterization of known drug databases. J Comb Chem. 1999; 1:55-68. 5. Teague SJ, Davis AM, Leeson PD et al. The design of leadlike combinatorial libraries. Angew Chemie Int Ed 1999; 38:3743-3748. 6. Wang J, Ramnarayan K. Toward designing drug-like libraries: A novel computational approach for prediction of drug feasibility of compounds. J Comb Chem 1999; 1:524-533. 7. Xu J, Stevenson J. Drug-like index: A new approach to measure drug-like compounds and their diversity. J Chem Inf Comput Sci 2000; 40:1177-1187. 8. Oprea TI. Property distribution of drug-related chemical databases. J Comput Aided Mol Design 2000; 14:251-264. 9. Bremermann HJ. What mathematics can and cannot do for pattern recognition. In: Grüsser OJ, Klinke R, eds. Zeichenerkennung durch biologische und technische Systeme. Berlin: Springer Verlag, 1973:31-45. 10. Lohmann R. Structure evolution in neural systems. In: Soucek B and the IRIS Group, eds. Dynamic, Genetic, and Chaotic Programming. New York: John Wiley & Sons Inc, 1992:395-411. 11. Haugeland J. Artificial Intelligence—The Very Idea. Cambridge: The MIT Press, 1989. 12. Otto M. Chemometrics—Statistics and Computer Application in Analytical Chemistry. Weinheim, New York: Wiley-VCH, 1999. 13. Clocksin WF, Mellish CS. Programming in PROLOG. 2nd ed. Berlin, Heidelberg: Springer, 1984. 14. Li D, Liu D. A Fuzzy PROLOG Database System. New York: John Wiley & Sons Inc, 1990. 15. King RD, Muggleton S, Lewis RA et al. Drug design by machine learning. Proc Natl Acad Sci USA 1992; 89:11322-11326. 16. King RD, Muggleton SH, Srinivasan A et al. Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. Proc Natl Acad Sci USA 1996; 93:438-442. 17. King RD, Srinivasan A. The discovery of indicator variables for QSAR using inductive logic programming. J Comput Aided Mol Des 1997; 11:571-580. 18. Muggleton S. Inductive Logic Programming. London: Academic Press, 1992.
Analysis of Chemical Space
63
19. Morik K, Wrobel S, Kietz JU et al. Knowledge Acquisition and Machine Learning: Theory, Models, and Applications. London: Academic Press, 1993. 20. Schulze-Kremer S. Molecular Bioinformatics—Algorithms and Applications. Berlin, New York: Walter de Gruyter, 1996. 21. Plotkin GD. A note on inductive generalization. In: Meltzer B, Mitchie D, eds. Machine Intelligence New York: American Elsevier, 1970:153-163. 22. King RD, Karwath A, Clare A et al. Accurate prediction of protein functional class from sequence in the Mycobacterium tuberculosis and Escherichia coli genomes using data mining. Yeast 2000; 17:283-293. 23. Turcotte M, Muggleton SH, Sternberg MJ. Automated discovery of structural signatures of protein fold and function. J Mol Biol 2001; 306:591-605. 24. King RG. A machine learning approach to the problem of predicting a protein´s secondary structure from its primary structure (PROMIS)]. Ph.D. Thesis, University of Strathclyde, Strathclyde, UK, 1988. 25. Taylor WR. The classification of amino acid conservation. J Theor Biol 1986; 119:205-221. 26. Schneider G, Wrede P. Signal analysis in protein targeting sequences. Protein Seq Data Anal 1993; 5:227-236. 27. Brunner M, Klaus C, Neupert W. The mitochondrial processing peptidase. In: von Heijne G, ed. Signal Peptides. Austin: RG Landes Company, 1994:73-86. 28. Schneider G, Sjöling S, Wallin E et al. Feature-extraction from endopeptidase cleavage sites in mitochondrial targeting sequences. Proteins 1998; 30:49-60. 29. Attwood T, Parry-Smith DJ. Introduction to Bioinformatics. Essex: Addison Wesley Longman Limited, 1999. 30. Durbin R, Eddy S, Krogh A et al. Biological Sequence Analysis. Cambridge: Cambridge University Press, 1998. 31. Spencer RW. High-throughput screening of historic collections: Observations on file size, biological targets, and file diversity. Biotechnol Bioeng 1998; 61:61-67. 32. Bayada DM, Hamersma H, van Geerestein VJ. Molecular diversity and representativity in chemical databases. J Chem Inf Comput Sci 1999; 39:1-10. 33. Eglen RM, Schneider G, Böhm HJ. High-throughput screening and virtual screening: entry points to drug discovery. In: Schneider G, Böhm HJ, eds. Virtual Screening for Bioactive Molecules. Weinheim, New York: Wiley-VCH:1-14. 34. Lewell XQ, Judd DB, Watson SP et al. RECAP—Retrosynthetic combinatorial analysis procedure: A powerful new technique for identifying privileged molecular fragments with useful applications in combinatorial chemistry. J Chem Inf Comput Sci 1998; 38:511-522. 35. Rechenberg I. Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Stuttgart: Frommann-Holzboog, 1973. 36. Barnard JM, Downs GM, Willett P. Descriptor-based similarity measures for screening chemical databases. In: Schneider G, Böhm HJ, eds. Virtual Screening for Bioactive Molecules. Weinheim, New York: Wiley-VCH:59-80. 37. Willett P, Barnard JM, Downs GM. Chemical similarity searching. J Chem Inf Comput Sci 1998; 38:983-996. 38. Daylight Chemical Information Systems, Inc., 27401 Los Altos, 360 Mission Viejo, CA 92691, USA. http://www.daylight.com 39. Schneider G, Neidhart W, Giller T et al. “Scaffold-Hopping” by topological pharmacophore search: a contribution to virtual screening. Angew Chem Int Ed Engl 1999; 38:2894-2896. 40. Güner OF. Pharmacophore Perception, Development and Use in Drug Design. La Jolla: International University Line, Biotechnology Series, 2000. 41. Good AC, Mason JS, Pickett SD. Pharmacophore pattern application in virtual screening, library design and QSAR. In: Böhm HJ, Schneider G, eds. Virtual Screening for Bioactive Molecule. New York, Weinheim: Wiley-VCH, 2000:131-159. 42. Wermuth CG, Ganelli CR, Lindberg P et al. Glossary of terms used in medicinal chemistry (IUPAC Recommendations 1997). Annu Reports Med Chem 1998;33:385-395. 43. Doyle JL, Stubbs L. Ataxia, arrhythmia and ion-channel gene defects. Trends Genet 1998; 14:92-98.
64
Adaptive Systems in Drug Design
44. Perez-Reyes EP. Molecular characterization of a novel family of low voltage-activated, T-type, calcium channels. J Bioenerg Biomembr 1998; 30:313-317. 45. Todorovic SM, Prakriya M, Nakashima YM et al. Enantioselective blockade of T-type Ca2+ current in adult rat sensory neurons by a steroid that lacks gamma-aminobutyric acid-modulatory activity. Molecular Pharmacol 1998; 54:918-927. 46. Mishra SK, Hermsmeyer K. Selective inhibition of T-type Ca2+ channels by Ro 40-5967. Circ Res 1994; 75:144-148. 47. Bernardeau A, Ertel EA. Meeting, Montpellier, 1996: Low-voltage-activated T-type Calcium Channels, Adis International, 1998:386-394. 48. Jackson JE. A User’s Guide to Principal Components. New York: John Wiley, 1991. 49. Eriksson L, Johansson E, Kettaneh-Wold N et al. Introduction to Multi- and Megavariate Data Analysis using Projection Methods (PCA & PLS). Umeå:Umetrics, 1999. 50. Devillers J, Ed. Neural Networks in QSAR and Drug Design. London: Academic Press, 1996. 51. Backhaus K, Erichson B, Plinke W et al. Multivariate Analysemethoden. Berlin: Springer Verlag, 1989. 52. Sammon JW. A nonlinear mapping for data structure analysis. IEEE Trans Comput 1969; C-18:401409. 53. Kruskal JB. Multidimensional scaling. Psychometrika 1964; 29:115-129. 54. Agrafiotis DK, Lobanov VS. Nonlinear mapping networks. J Chem Inf Comput Sci 2000; 40:1356-1362. 55. Lerner B. On pattern classification with Sammon’s nonlinear mapping. Pattern Recognition 1998; 31:371-381. 56. Chang CL, Lee RCT. A heuristic method for non-linear mapping in cluster analysis. IEEE Trans Syst Man Cybern 1973; SMC-3:197-200. 57. Lee RCY, Slagle JR, Blum H. Triangulation method for the sequential mapping of points from Nspace to 2-space. IEEE Trans Comput 1977; C-27:288-292. 58. Biswas G, Jain AK, Dubes RC. Evaluation of projection algorithms. IEEE Trans Pattern Anal Machine Intell 1981; PAMI-3:701-708. 59. Mao J, Jain AK. A nonlinear projection method based on Kohonen’s topology preserving maps. IEEE Trans Neural Networks 1995; 6:296-317. 60. Livingstone DJ. Multivariate data display using neural networks. In: Devillers J, ed. Neural Networks in QSAR and Drug Design. London: Academic Press, 1996:157-176. 61. Hertz J, Krogh A, Palmer RG. Introduction to the Theory of Neural Computation. Redwood City: Addison-Wesley, 1991. 62. Salt DW, Yildiz N, Livingstone DJ et al. The use of artificial neural networks in QSAR. Pest Sci 1992; 36:161-170. 63. Friston K, Phillips J, Chawla D et al. Revealing interactions among brain systems with nonlinear PCA. Hum Brain Mapp 1999; 8:92-97. 64. Good AC, So SS, Richards WG. Structure-activity relationships from molecular similarity matrices. J Med Chem 1993; 36:433-438. 65. Good AC, Peterson SJ, Richards WG. QSAR’s from similarity matrices. Technique validation and application in the comparison of similarity evaluation methods. J Med Chem 1993; 36:2929-2937. 66. Reibnegger G, Werner-Felmayer G, Wachter H. A note on the low-dimensional display of multivariate data using neural networks. J Mol Graph 1993; 11:129-133. 67. Kohonen T. Self-Organization and Associative Memory. Heidelberg: Springer-Verlag, 1984. 68. Ritter H, Schulten K, Martinez T. Neuronale Netze—Eine Einführung in die Neuroinformatik selbstorganisierender Netzwerke. Bonn: Addison-Wesley, 1990. English edition: Neural Networks. Reading: Addison-Wesley, 1992. 69. Graepel T, Obermayer K. A stochastic self-organizing map for proximity data. Neural Comput 1998; 11:139-155. 70. Bienfait B, Gasteiger J. Checking the projection display of multivariate data with colored graphs. J Mol Graph Model 1997; 15:203-215; 254-258. 71. Preparata FP, Shamos MI. Computational Geometry: An Introduction. New York: Springer, 1985. 72. Fritzke B. Growing self-organizing networks—History, status quo, and perspectives. In: Oja E, Kaski S, eds. Kohonen Maps. Amsterdam: Elsevier Science BV, 1999:131-144.
Analysis of Chemical Space
65
73. Oja E, Kaski S, eds. Kohonen Maps. Amsterdam: Elsevier Science BV, 1999. 74. Lu T, Lerner J. Spectroscopy and hybrid neural network analysis. Proc IEEE 1996; 84:895-905. 75. Melssen WJ, Smits JRM, Buydens LMC et al. Using artificial neural networks for solving chemical problems. Part II. Kohonen self-organizing feature maps and Hopfield networks. Chemom Intell Lab Syst 1994; 23:267-291. 76. Zupan J, Gasteiger J. Neural Networks for Chemists. Heidelberg: Wiley-VCH, 1993. 77. Wagener M, Sadowski J, Gasteiger J. Autocorrelation of molecular surface properties for modeling corticosteroid binding globulin and cytosolic Ah receptor activity by neural networks. J Am Chem Soc 1995; 117:7769-7775. 78. Anzali S, Mederski WWRK, Osswald M et al. Endothelin antagonists: search for surrogates of methylendioxyphenyl by means of a Kohonen neural network. Bioorg Med Chem Lett 1997; 8:11-16. 79. Drie JH, Lajiness MS. Approaches to virtual library design. Drug Discovery Today 1998; 3:274-283. 80. Andrews KM, Cramer RD. Toward general methods of targeted library design: topomer shape similarity searching with diverse structures as queries. J Med Chem 2000; 43:1723-1740. 81. Matter H. Selecting optimally diverse compounds from structure databases: A validation study of two-dimensional and three-dimensional molecular descriptors. J Med Chem 1997; 40:1219-1229. 82. Zheng W, Cho SJ, Tropsha A. Rational combinatorial library design. 2. Rational design of targeted combinatorial peptide libraries using chemical similarity probe and the inverse QSAR approaches. J Chem Inf Comput Sci 1998; 38:51-258. 83. Castro EA, Tueros M, Toropov AA. Maximum topological distances based indices as molecular descriptors for QSPR: 2—Application to aromatic hydrocarbons. Comput Chem 2000; 24:571-576. 84. Gupta S, Singh M, Madan AK. Superpendentic index: A novel topological descriptor for predicting biological activity. J Chem Inf Comput Sci 1999; 39:272-277. 85. Gupta S, Singh M, Madan AK. Connective eccentricity index: a novel topological descriptor for predicting biological activity. J Mol Graph Model 2000; 18:18-25. 86. Bradley EK, Beroza P, Penzotti JE et al. A rapid computational method for lead evolution: Description and application to alpha(1)-adrenergic antagonists. J Med Chem 2000; 43:2770-2774. 87. Milne GW, Nicklaus MC, Wang S. Pharmacophores in drug design and discovery. SAR QSAR Environ Res 1998; 9:23-38. 88. Mason JS, Hermsmeier MA. Diversity assessment. Curr Opin Chem Biol 1999; 3:342-349. 89. Kirkpatrick DL, Watson S, Ulhaq S. Structure-based drug design: combinatorial chemistry and molecular modeling. Comb Chem High Throughput Screen 1999; 2:211-221. 90. Hopfinger AJ, Duca JS. Extraction of pharmacophore information from high-throughput screens. Curr Opin Biotechnol 2000; 11:97-103. 91. Schneider G, Wrede P. Artificial neural networks for computer-based molecular design. Prog Biophys Mol Biol 1998; 70:175-222. 92. Rupniak NM, Kramer MS. Discovery of the antidepressant and anti-emetic efficacy of substance P receptor (NK1) antagonists. Trends Pharmacol Sci 1999; 20:485-490. 93. Papp M, Vassout A, Gentsch C. The NK1-receptor antagonist NKP608 has an antidepressant-like effect in the chronic mild stress model of depression in rats. Behav Brain Res 2000; 115:19-23. 94. Sadowski J, Kubinyi H. A scoring scheme for discriminating between drugs and nondrugs. J Med Chem 1998; 41:3325-3329. 95. Schneider G. Neural networks are useful tools for drug design. Neural Networks 2000; 13:15-16. 96. Lee ML, Schneider G. Scaffold architecture and pharmacophoric properties of natural products and trade drugs: Application in the design of natural product-based combinatorial libraries. J Comb Chem 2001; 3:284-289 97. Ghose AK, Pritchett A, Crippen GM. Atomic physicochemical parameters for three dimensional structure directed quantitative structure-activity relationships III: Modeling hydrophobic interactions. J Comput Chem 1988; 9:80-90. 98. Viswanadhan VN, Ghose AK, Revankar GR et al. Atomic physicochemical parameters for three dimensional structure directed quantitative structure-activity relationships. 4. Additional parameters for hydrophobic and dispersive interactions and their application for an automated superposition of certain naturally occurring nucleoside antibiotics. J Chem Inf Comput Sci 1989; 29:163-172.
66
Adaptive Systems in Drug Design
99. Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 1975; 405:442-451. 100. Harvey A. Strategies for discovering drugs from previously unexplored natural products. Drug Discovery Today 2000; 5:294-300. 101. Schuchhardt J, Schneider G, Reichelt J et al. Local structural motifs of protein backbones are classified by self-organizing neural networks. Protein Eng 1996; 9:833-842. 102. Stahl M, Bur D, Schneider G. Mapping of proteinase active sites by projection of surface-derived correlation vectors. J Comput Chem 1999; 20:336-347. 103. Stahl M, Taroni C, Schneider G. Mapping of protein surface cavities and prediction of enzyme class by a self-organizing neural network. Protein Eng 2000; 13:83-88. 104. Alberts IL, Nadassy K, Wodak SJ. Analysis of zinc binding sites in protein crystal structures. Protein Sci 1998; 7:1700-1716. 105. Young MM, Skillman AG, Kuntz ID. A rapid method for exploring the protein structure universe. Proteins 1999; 34:317-332. 106. Böhm HJ. Prediction of binding constants of protein ligands: a fast method for the prioritization of hits obtained from de novo design or 3D database search programs. J Comput-Aided Mol Des 1998; 12:309-323. 107. Bernstein FC, Koetzle TF, Williams GJB et al. The Protein Data Bank: A computer-based archival file for macromolecular structures. J Mol Biol 1977; 112:535-542. 108. Verdonk ML, Cole JC, Taylor R. SuperStar: A knowledge-based approach for identifying interaction sites in proteins. J Mol Biol 1999; 289:1093-1108. 109. Chothia C. The nature of accessible and buried surfaces in proteins. J Mol Biol 1975; 105:1-14. 110. Engelman DA, Steitz TA, Goldman A. Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Ann Rev Biophys Biophys Chem 1986; 15:321-353. 111. Jones DD. Amino acid properties and side chain orientations in proteins: a cross-correlation approach. J Theor Biol 1975; 50:167-183. 112. Weber L. Wallbaum S, Broger C et al. Optimization of the biological activity of combinatorial compound libraries by a genetic algorithm. Angew Chemie Int Ed 1995; 34:2280-2282. 113. Loos HS, Fritzke B. Some competitive learning methods. University of Bochum: Technical Report, 1997; A neural network demo software is accessible at URL: http://www.neuroinformatik.ruhr-unibochum.de/ini/VDM/research/gsn/DemoGNG/GNG.
Modeling Structure-Activity Relationships
67
CHAPTER 3
Modeling Structure-Activity Relationships “All models are wrong—but some are useful.” (G.E.P. Box)
The Basic Idea
T
raditionally, the design of novel drugs has essentially been a trial-and-error process despite the tremendous efforts devoted to it by pharmaceutical and academic research groups. It is estimated that only one in 5,000 compounds investigated in preclinical discovery research ever emerges as a clinical lead, and that about one in 10 drug candidates in development ever gets through the costly process of clinical trials. For each drug, the investment may be on the order of $600 million over 15 years from its first synthesis to FDA approval. In 2000, U.S. pharmaceutical companies spent more than $22 billion in research and development, which, after inflation adjustment, represents a four-fold increase from the corresponding figure some 20 years ago. In an attempt to counter these rapidly increasing costs associated with the discovery of new medicines, revolutionary advances in basic science and technology are reshaping the manner in which pharmaceutical research is conducted. For example, the use of DNA microarrays facilitates the identification of novel disease genes and also opens up other interesting opportunities in disease diagnosis, pharmacogenomics and toxicological research (toxicogenomics). The development of combinatorial chemistry and parallel synthesis methods has increased both the quantity and chemical diversity of potential leads against new targets. Our ability to discover useful leads has been greatly enhanced through astonishing advances in high-throughput screening (HTS) technologies. Through miniaturization and robotics, we now have the capacity to screen millions of compounds against therapeutic targets in very short period of time. Central to this new drug discovery paradigm is the rapid explosion of computational techniques that allow us to analyze vast amount of data, prioritize HTS hits and guide lead optimization. The advances and applications of computational methods in drug design are beginning to have a significant impact on the prosperity of the pharmaceutical industry. Modern approaches to computer-aided molecular design fall into two general categories. The first includes structure-based methods which utilize the three-dimensional structure of the ligand-bound receptor. Many innovative algorithms have been developed and implemented to construct de novo ligands that fit the receptor binding-site in a complementary manner; some of these will be discussed in Chapter 5. The second approach includes ligand-based methods in which the physicochemical or structural properties of ligand molecules are characterized. A classic example of this concept is a quantitative structure-activity relationship (QSAR) model, which grants a theoretical ground for lead optimization. For the past four decades the development of QSAR has had a momentous impact upon medicinal chemistry. Hansch pioneered the field by demonstrating that the biological activities of drug molecules can be correlated by a function of their physicochemical parameters:
Adaptive Systems in Drug Design, by Gisbert Schneider and Sung-Sau So. ©2003 Eurekah.com.
68
Adaptive Systems in Drug Design
Eq. 3.1 where f is a mathematical function and xi are the molecular descriptors providing information about the physicochemical or structural attributes of the molecules. The major challenges for QSAR practitioners are to find an appropriate set of molecular descriptors and a suitable function that can accurately elucidate the experimental data.
Development of QSAR Models Ever since the seminal work of Hansch almost 40 years ago, QSAR research has evolved from the use of simple regression models with a few electronic or thermodynamics variables to an important discipline that is being applied to a wide range of problems.1-4 In the following Sections, we will outline the typical steps in the development of a QSAR model.
Descriptor Generation The first step is the tabulation of experimental or computational physicochemical parameters which provide a description of similarities and differences of the compounds under investigation. The computation of descriptor values is generally straightforward because many commercial and academic computer-aided molecular design (CAMD) packages have been developed to handle this kind of calculation, often with great ease. However, it is more difficult to know a priori the type of descriptor which might be relevant to the biological activity of interest. In many cases, a standard set of descriptors chosen from experience may be used.5
Dimensionality of QSAR Descriptors Molecular descriptors can vary greatly in their complexity. A simple example may be a structural key descriptor, which takes the form of a binary indicator variable that encodes the presence of certain substructure or functional features. Other descriptors, such as HOMO and LUMO energies, require semi-empirical or quantum mechanical calculations and are therefore more time-consuming to compute. Molecular descriptors are often categorized according to their dimensionality, which refers to the structural representation in which the descriptor values are derived.6 The 1D-descriptors are generally constitutive (e.g., molecular weight) descriptors. The 2D-descriptors include structural fragments fingerprints or molecular connectivity indices. It has been argued that structure key descriptors such as UNITY7 and Daylight8 implicitly account for many physicochemical parameters as well as atom-types and connectivity information.6 The molecular connectivity indices, which are based on graph theory concepts, can differentiate molecules according to their size, degree of branching, shape, and flexibility. Some of the most well-known topological descriptors are the Weiner index (W), the Zagreb index, Hosoya index (Z), Kier and Hall molecular connectivity index (χ), Kier’s shape index (κ) , molecular flexibility index (φ), and Balaban indices (Jx and Jy). As implied by the name, 3D-descriptors are generated from a three-dimensional representation of molecules. Some examples include molecular volume, solvent-accessible surface area, molecular interaction fields, or spatial pharmacophores. With very few exceptions9-11 the descriptor values are computed from a static conformation, which is either a standard conformation with ideal geometries generated from programs such as CORINA12 or CONCORD,13 or a conformation that is fitted against a target X-ray structure or a Pharmacophore.
Fragment-Based Physicochemical Descriptors In addition to intrinsic dimensionality, molecular descriptors can be classified according to their physicochemical attributes. It is recognized that the dominant factors in receptor-drug binding are based on steric, electrostatic, and hydrophobic interactions. For many years
Modeling Structure-Activity Relationships
69
medicinal chemists have attempted to model these principal forces of molecular recognition by using empirical physicochemical parameters, which ultimately led to the introduction of fragment constants in early QSAR studies. These descriptors are constants that account for the effect on a congeneric series of molecules of different substituents attached to the common core. The best-known electronic fragment constants are the Hammett σm and σp constants, denoting the electronic effect of parameters at meta and para positions (Note: the σ constant for ortho substituents is generally unreliable because of its steric interaction with the adjacent core group). Another common pair of electronic parameters is F and R, which are the inductive and the resonance components of the σp parameter, respectively. Perhaps the most widely used fragment constant in the field of QSAR is the hydrophobicity parameter, π. It measures the difference of the hydrophobic character of a given substituent relative to hydrogen. To represent the size of a substituent, molar refractivity (MR) is often used, though it has been shown that MR is not a pure bulk parameter since it also captures molecular polarizability of substituents.3
Novel QSAR Descriptors Novel descriptors continue to appear in the literature; the more currently fashionable types encode combinations of steric, hydrophobic and electrostatic properties, not only for molecular fragments, but the whole molecule as well. For example, polar surface area contains information about both electronics and the size of a molecule and is commonly used in intestinal absorption modeling (see Chapter 4).9 Electrotopological state (E-state) indices capture both molecular connectivity and the electronic character of a molecule.14,15 The GRID16 and CoMFA17 programs take advantage of molecular interaction fields by using different probe types (steric, electrostatic or lipophilic) in a 3D lattice environment. Other variants of molecular field type, such as the molecular similarity-based CoMSIA,18 have also been reported in the literature. Most of the 3D-descriptors require a pre-aligned set of molecules. In cases where the exact molecular alignment is not obvious, one may consider the use of spatially invariant 3D descriptors (i.e., the descriptor values depends on conformation but not spatial orientation). A few innovative descriptors have been introduced for this purpose, including the use of autocorrelation vectors of molecular surface properties, including molecular surface-weighted holistic invariant molecular (MS-WHIM) descriptors MS-WHIM,19-22 and molecular quadrupolar moments.23 Another interesting descriptor, EVA,24,25 is based on normal coordinate frequency analysis and has been validated on a number of standard QSAR benchmark data sets recently.26-28 Burden’s eigenvalue index29—originally developed for chemical indexing and later developed to become the BCUT metric30 for molecular diversity analysis—has also been found useful as a QSAR descriptor in emerging applications.31,32
Amino Acid Descriptors Peptides are often used as probes of a binding site due to ease of synthesis and also to the prevalence of endogenous peptide ligands in nature. Consequently, significant effort has been expended on the development of robust parameters specifically designed to represent amino acids in QSAR applications. Examples are principal properties (z-scales), which are derived from a principal component analysis of 29 experimental and theoretical physicochemical parameters for the 20 naturally occurring amino acids (see Chapter 5),33-36 and the isotropic surface area (ISA) and electronic charge index.37 These latter indices are derived from 3D conformers of the side-chain units and are therefore more readily parameterized for unnatural amino acids. Both sets of parameters have been found to be useful for exploring peptide structureactivity relationships.
70
Adaptive Systems in Drug Design
Feature Selection Having generated a set of molecular descriptors, the next step is to apply a statistical or pattern recognition method to correlate these descriptors with the observed biological activities (see also the introduction to feature extraction given in Chapter 2). Partly due to the ease with which a great variety of theoretical descriptors may be generated, QSAR researchers are often confronted with high-dimensional data sets; i.e., the task in such a situation is to solve an underdetermined problem for which there are more variables (descriptors) than objects (compounds). The situation is even more complicated than it appears, because the underlying physicochemical attributes of the molecules that are correlated with their biological activities are often unknown, so that a priori feature selection is not feasible in most cases. Thus, the selection of the best variables for a QSAR model can be very challenging. To reduce the risk of chance correlation and overfitting of data, the entire data set is usually preprocessed using a filter to remove descriptors with either small variance or no unique information.38,39 A feature selection routine then operates on the reduced data set and identifies which of the descriptors have the most prominent influence on the activity to build a model. There are two major advantages of feature selection. First, it can help to define a model that can be interpreted. Second, the reduced model is often more predictive, partly because of the better signal-to-noise ratio which is a consequence of pruning the non-informative inputs. In the past, variable or feature selection was made by a human expert who relied on experience and scientific intuition, or by a correlation analysis of the data set, or by application of statistical methods such as forward selection or backward elimination. However, when the dimensionality of the data is high and the interrelations between variables are convoluted, human judgment can be unreliable. Also, a simple forward or backward stepping algorithm fails to take into account information that involves the combined effect of several features, so that the optimal solution is not necessary obtained. This problem has been summarized by Kubinyi: “Selection of variables is time-consuming, difficult and, despite many different statistical criteria for the evaluation of the resulting models, a highly subjective and ambiguous procedure”.40 This suggests the need for a method which is applicable to complex multivariate data, easy to use and, of course, supplies a good solution to the problem. Recent developments in computer science have allowed the creation of intelligent algorithms capable of finding the best, or nearly the best, solutions for such a combinatorial optimization problem (“complex adaptive systems”). A number of fully automated, objective feature selection methods have been introduced. In this section, we will review some of the most common selection strategies used in the current QSAR applications.
Forward Stepping and Backward Elimination One of the simplest feature selection methods is a stepwise procedure. For forward stepping selection, new descriptors are added to the model one at a time until no more significant variables can be found. Most often, statistical significance is judged by the F-statistics. In backward elimination, the model begins with a full set of descriptors and less informative descriptors are pruned systematically. Both techniques are available as standard routines in many statistical or molecular modeling packages and are very fast to execute. The major shortcoming of this stepwise approach is that the algorithm fails to take into account information that involves any coupled (correlated, parallel) effect among multiple descriptors. Specifically, it is possible that a descriptor is eliminated in an earlier round because it may appear to be redundant at that stage; but later it could become the most relevant descriptor when others have been eliminated. This method is very sensitive to multiple local minima, and often finds a non-optimal solution— a problem which is related to the coupled descriptor effect just described.
Modeling Structure-Activity Relationships
71
Neural Network Pruning The vast majority of ANN applications in QSAR concern the mapping of descriptor values to biological activities (i.e., parameter estimation of model). Wikel and Dow pioneered the (unconventional) use of ANN in variable selection for QSAR studies.41 This method is considered a “magnitude-based” method for which the inputs with the sensitivities of all variables are computed and the less sensitive variables (as reflected by smaller magnitude in their associated weights) are pruned. Other researchers have proposed “error-based” pruning methods for which the sensitivity of each input is probed by the corresponding change of the training error when its associated weights are eliminated.42
Simulated Annealing Simulated annealing (SA) is a popular optimization method that is modeled from a physical annealing process.43 The principle of this method is similar to Monte Carlo simulations, except for the use of a cooling schedule for the temperature parameter. As the Boltzmann-type probability factor changes with the lowering of the temperature over the course of the simulation, solutions of lower quality are more readily accepted at the beginning of the optimization than during later stages. The result of SA often depends on the particular cooling schedule used during optimization. The most commonly used schedule is a geometric cooling scheme, where the temperature is decreased by a constant factor (typically in the range 0.90 to 0.99) at each annealing step. Other advanced cooling schedules have been proposed that help enhance the exploration of configuration space. They include the methods of geometric re-heating and adaptive cooling. The major advantages of SA are that the algorithm is easily understood, straightforward to implement, very robust, and generally applicable to almost all types of combinatorial optimization problems. In most cases, one can find good quality solutions to the problem at hand. However, as for all stochastic optimization methods, simulations that are initialized with different random seeds or cooling schedules can lead to very different outcomes. Compared to the forward stepping/backward elimination procedure, this algorithm requires substantially more computing resources.
Genetic Algorithm The genetic algorithm (GA) idea mimics Nature to solve complex combinatorial optimization problems. GA is a stochastic optimization method that simulates evolution using a simplistic model of molecular genetics.44 The key aspect of a GA is that it investigates many solutions simultaneously, each of which explores different regions of configuration space. The details of this algorithm have been discussed in Chapter 1 and are not repeated here.
Tabu Search Tabu search is an iterative procedure for solving combinatorial optimization problems. Compared to either SA or GA, the use of Tabu search related to computational molecular design in the literature has been relatively sparse. The basic concept of Tabu search is to explore the search space of all feasible solutions by a sequence of intelligent moves through incorporation of adaptive memory.45 To ensure that new regions of parameter space will be investigated, new moves in Tabu search are not chosen randomly. Instead, some moves that have been previously visited are classified as forbidden (hence Tabu or Taboo), or are severely penalized to avoid cyclic embarking or becoming trapped in a local minimum. Central to this paradigm is Tabu list management that concerns the maintenance and update of moves that are considered forbidden within a particular iteration of moves. In the context of descriptor selection, the Tabu memory may contain a list of individual descriptors or combinations thereof, which have been shown to be uninteresting during the previous rounds of model building.46
72
Adaptive Systems in Drug Design
Exhaustive Enumeration Although GAs or Tabu searching explore many possible solutions simultaneously, there is no guarantee that the best solution will ever emerge from simulations. In the presence of complex non-linear system behavior, the exhaustive enumeration of all possible sets of descriptors is the only method which ensures the discovery of the globally optimal solution, although such bruteforce approach is often practically impossible. This is due to an exponential increase of the number of descriptor combinations that can be formed from a given number of descriptors (Note: The number of all possible subsets is 2N – 1, where N is the number of descriptors considered). For a data set that contains 50 descriptors, the number of possibilities is already greater than 1015.
Other Feature Selection Methods Many feature selection methods have appeared in the literature and it is beyond the scope of this text to provide a comprehensive review. Here other interesting approaches are discussed briefly. GOLPE (Generating Optimal Linear PLS Estimations) is a variable selection procedure to obtain the best predictive PLS (vide infra) models. In this approach, combinations of variables derived from a fractional factorial design strategy are used to run several PLS analyses, and the variables that contribute significantly to prediction accuracy are selected, while all others are discarded.47,48 Livingstone and coworkers recently developed a novel method called Unsupervised Forward Selection (UFS) for eliminating redundant variables.39 UFS begins with the two least correlated descriptors in the data set and iteratively adds more descriptors based on an orthogonality criterion. Variables that have squared multiple correlation coefficients greater than a user-defined threshold with those already chosen will be rejected. The resulting selection of descriptors has low redundancy and multi-colinearity.
Model Construction Having selected relevant features, the final stage of QSAR model building is executed by a feature mapping procedure—also referred to as the parameter estimation problem. The goal is to formulate a mathematical relationship and to estimate the model parameters. The quality of the parameter set is usually judged by comparing the result of the fitted data to observed data.49 Quite often, feature selection and parameter estimation are performed simultaneously to produce a QSAR model.
Linear Methods Multiple linear regression (MLR), or ordinary least squares (OLS), was the traditional method for QSAR applications in the past. The major advantage of this method is its computational simplicity, offering the possibility to easily interpret the resulting equation. However, this method becomes inapplicable as soon as the number of input variables equals or exceeds the number of observed objects. As a rule of thumb, the ratio of objects and variables should be at least five for MLR analysis; otherwise there is a corresponding large risk in chance correlation.50 A common way to reduce the number of inputs to MLR without explicit feature selection is through feature extraction by means of principal component regression (PCR). In this procedure, the complete set of input descriptors is transformed to its orthogonal principal components, relatively few of which may suffice to capture the essential variance of the original data. The principal components are then used as the input to a regression analysis. Another very powerful multivariate statistical method for application to an underdetermined data set is partial least squares (PLS).51 Briefly, PLS attempts to identify a few latent structures, or linear combinations of descriptors, that best correlate with the observables. Cross-validation is employed to avoid overfitting of data. Unlike MLR, there is no restriction in PLS on the ratio
Modeling Structure-Activity Relationships
73
between data objects and variables, and the PLS method can analyze several response variables simultaneously. In addition, PLS can deal with strongly collinear input data and tolerates some missing data values.
Non-Linear Methods Traditionally, non-linear correlation in the data are explicitly dealt with by a predetermined functional transformation before entering a MLR. Unfortunately, the introduction of nonlinear or cross-product terms in a regression equation often requires knowledge which is not available a priori. Moreover, it adds to the complexity of the problem and often leads to insignificant improvement in the resulting QSAR. To overcome this deficiency of linear regression, there is an increasing interest in techniques that are intrinsically non-linear. Some of them are mapping methods that attempt to preserve the original Euclidean distance matrix when highdimensional data are projected to lower (typically two or three) dimensions. Examples of such are non-linear mapping (NLM), self-organizing map (SOM),52 or a ReNDeR-type neural network53 (see discussion in Chapter 2). However, although such maps do offer visual cues to structure-activity relationships, they rarely provide quantitative relationship between structural descriptors and activity. At the present time, artificial neural networks (ANN) are probably the most widely used non-linear methods in chemometric and QSAR applications (see Chapter 1). ANN are computer-based simulations which contain some elements that exist in living nervous systems. What makes these networks powerful is their potential for performance improvement over time as they acquire knowledge about a problem, and their ability to handle fuzzy real world data. With the presence of hidden layers, neural networks are able to implicitly perform non-linear mapping of the physicochemical parameters to the biological activities of the compounds. During the training phase, a network builds an internal feature model from data patterns presented to its input. New similar patterns will then be recognized; the network has the ability for generalization and, more importantly, it is able to make quantitative predictions for queries of similar nature. Another emerging non-linear method is genetic programming (GP), whose initial use was to evolve computer programs.54 Recently, Kell and coworkers have published an exciting adaptation of GP to analyze mass spectral data.55 It is noteworthy to point out that in GP, the evolutionary algorithm is responsible not only for the selection of descriptors, but also for parameter estimation; i.e., the discovery of the appropriate mathematical transformation relating the descriptors and the response function. The functional tree implementation suggested by Kell operates on simple mathematical expressions that are readily manipulated by genetic operators. Ultimately, these simple mathematical functions are combined leading to non-linear multivariate regression. An advantage of GP over ANN is that the trees evolved are more interpretable and therefore provide valuable insights in the logistics of the decision-making process.56 Another notable difference between GP and conventional ANN is that the GP tree can be evolved to arbitrary complexity in order to solve a problem, although the use of evolutionary neural networks does allow for adaptation of neural network architecture (such as number of hidden nodes) during training.57 In both cases, the QSAR practitioner should be aware of the risk of data overfitting. Recently, Tropsha and coworkers published a novel non-linear QSAR approach adapted from the k-nearest-neighbor principle (kNN-QSAR).58,59 Briefly, the activity of an unknown compound is predicted as the average activity of the k most similar compounds as measured by their Euclidean distances in multidimensional descriptor space. A simulated annealing procedure may be applied to the selection of an optimal set of descriptors based on predictive statistical parameters. This method is extremely fast and is also generally applicable to large data sets that may contain many structurally diverse compounds.
74
Adaptive Systems in Drug Design
Model Validation Model validation is a critical but often neglected component of QSAR development. In a recent review,60 Kövesdi and coworkers state that “[..] In many respects, a proper validation process is more important than a proper training. It is all too easy to get a very small error on the training set, due to the enormous fitting ability of the neural network, and then one may erroneously conclude the network would perform excellently”. The first benchmark of a QSAR model is usually to determine the accuracy of the fit to the training data (“re-classification”), most commonly reported by residual root-mean-squares (rms) error or the Pearson correlation coefficient (see the next Section for definitions). However, because QSAR models are often used for activity prediction of compounds not yet synthesized, the more important statistical measures are those giving an indication of their prediction accuracy. The most popular procedure for estimation of the prediction accuracy is cross-validation, which includes techniques such as jack-knife, leave-one-out (LOO), leave-group-out (LGO) and bootstrap analysis. The first group of methods is based on data splitting, where the original data set is randomly divided into two subsets. The first is a set of training compounds used for exploration and model building, and the second is the so-called “validation set” for prediction and model validation. The leave-one-out procedure systematically removes one data point at a time from the training set, and on the basis of this reduced data set constructs a model that is subsequently used to predict the removed sample. This procedure is repeated for all data points, so that a complete set of predicted values can be obtained. It has been argued that the LOO procedure tends to overestimate the model “predictivity” and that resulting QSAR models are “over-optimistic”.61 It is worth noting that LOO cross-validation is often confused with jackknifing. Technically, jackknifing is used to estimate the bias of a statistic. A typical application of jack-knifing is to compute the statistical parameters of interest for each subset of data, and to compare the average of these subset statistics with the one that is obtained from the entire sample in order to estimate the bias of the latter. In LOO, the main focus is on the estimation of the generalization error based on the prediction of the leave-out samples.62 As an alternative to LOO, a LGO procedure can be applied, which typically sets aside between 5% to 20% of the entire data set as a validation subset. In the literature, this procedure is also known as “kfold cross-validation”, indicating that the entire data is divided into k groups of approximately equal size. An added bonus of a LGO procedure is a vast reduction in computing resource relative to a standard LOO cross-validation. Bootstrapping represents another type of re-sampling method that is distinct from datasplitting. It is a statistical simulation method which generates sample distributions from the original data set.63 The concept of bootstrapping is founded on the premise that the sample represents an estimate of the entire population, and that statistical inference can be drawn from a large number of pseudo-samples to estimate the bias, standard error, and confidence intervals of the parameter of interest. The pseudo- (or bootstrap-) samples are created from the original data set by sampling with replacement, where some objects may appear in multiple instances. The usual point of contention about the bootstrap procedure concerns the minimal number of samplings required for computing reliable statistics. An empirical rule given by Davison and Hinkley suggests that the number of bootstrap-samples should be at least 40 times the number of sample objects.64 Another popular means of statistical validation is through a randomization test. In this procedure, the output values (typically biological responses) of the compounds are shuffled randomly, and the scrambled data set is re-examined by the QSAR method against real (unscrambled) input descriptors to determine the correlation and predictivity of the resulting “model”. The entire procedure is repeated multiple times (typically 20-50 models) on many differently scrambled data sets. If there remains a strong correlation between the selected
Modeling Structure-Activity Relationships
75
descriptors and the randomized response variables, then the significance of the proposed QSAR model is regarded as suspect. Finally, the most stringent validation of a QSAR model is through the prediction of new test compounds. It is important that the compounds in an external test set must not be used in any manner during the model building process (e.g., optimizing network parameters or determining a stopping point for neural network training). Otherwise the introduction of bias from the test set compromises the validation process.
Model Quality A variety of statistical parameters have been reported in the QSAR literature to reflect the quality of the model. These measures give indications as to how well the model fits existing data, i.e., they measure the explained variance of the target parameter y in the biological data. Some of the most common measures are listed below. The first is the Pearson product-moment correlation coefficient (r), which measures the linearity between two variables (Eq. 3.2). If two variables are perfectly linear with positive slope, then r = 1. However, the Pearson correlation coefficient can be highly influenced by outliers or a skewed data distribution. Under such circumstances, the Spearman rank correlation coefficient rS (Eq. 3.3) is a robust alternative to r when normality is unreasonable or outliers are present. The Spearman rank correlation coefficient is calculated from the ranks of scores, not the scores themselves. In other words, it is a measure of the strength of the linear relationship between population ranks, and is therefore less sensitive to the presence of outliers in the data. Furthermore, Spearman rank correlation measures the monotony of two random variables; if two variables are perfectly monotonically increasing, then rs = 1. It is noteworthy that if rS is noticeably greater than r, a transformation of data might lead to a stronger linear relationship. For classification, Matthews’ correlation coefficient (c) is a popular statistical parameter to denote the quality of fit (Eq. 2.18). This accounts not only for correct predictions, i.e., true positives (P) and true negatives (N), but also incorrect predictions (false positives/overprediction (O) and false negatives/underpredictions (U)). Similar to the other types of correlation coefficients, the Matthews’ correlation coefficient ranges from 1 (perfect correlation) to –1 (perfect anti-correlation).
Eq. 3.2
Eq. 3.3 Other commonly used goodness-of-fit measures are the residual standard deviation (s) (Eq. 3.4) and the root-mean-square difference (rmsd) (Eq. 3.5) between the calculated versus observed values:
Eq. 3.4
Adaptive Systems in Drug Design
76
Eq. 3.5 where N is the number of data objects and k is the number of terms used in the model. For these quantities, smaller values indicate a better fit of the data. For cross-validation, PRESS (Eq. 3.6) and q2 (Eq. 3.7) have been suggested to provide good estimates of the real prediction error of a model:
Eq. 3.6
Eq. 3.7 It should be noted that, contrary to its name, q2 can be a negative value. Generally speaking, a q2 value of 0.5–0.6 is regarded as the minimum acceptance criteria for a reliable QSAR model. Generally, the following characteristics are regarded as good traits of a robust and highquality QSAR model: 1. All descriptors used in the model are significant, and specifically, none of descriptors should be found to account for single peculiarities (e.g., unique compound-descriptor association). 2. There should be no leverage or outlier compounds in the training set, otherwise the statistical parameters reported by the model may not be meaningful. 3. The cross-validation performance should be significantly better than that of randomized tests but not very different from that of the training set and external test predictions.
Application of Adaptive Methods in QSAR Variable Selection The crux of data reduction is to select a subset of features retaining the maximal information content of the data.65 Feature extraction, exemplified by principal component analysis and discriminant analysis, transforms a set of features into a reduced representation that captures the major variance of data. Feature selection, on the other hand, attempts to identify a small subset of features from all available features. The following is a brief account of how artificial intelligence methods have been applied to feature selection in QSAR applications.
Artificial Neural Networks The earliest application of ANN for variable selection in a QSAR application was reported by Wikel and Dow.41 After training a neural network by using all descriptors, those inputs having large weights between the input and the hidden nodes were selected for further analysis. The “magnitude-based” algorithm was tested on the widely-studied Selwood data set,66 which is comprised of a series of 31 antifilarial antimycin analogs, each parameterized by
Modeling Structure-Activity Relationships
77
53 physicochemical descriptors (Note: Livingstone, a co-author of the Selwood paper, has recently expressed the opinion that this particular data set should not be regarded as a “standard” but rather a “difficult” data set due to poor data distribution)39. In their pioneering work, Wikel and Dow employed a color map to indicate the magnitude of the weight values. This led to the identification of a set of three relevant descriptors. Used in a multiple linear regression they produced to a QSAR model that was marginally better (r = 0.77, rcv = 0.68) than the three-descriptor model originally published by Selwood (r = 0.74, rcv = 0.67).66 Despite this encouraging result, it was argued that this descriptor selection scheme seemed somewhat subjective. Specifically, an overtrained network (vide infra), which is characterized by large weight values of nearly all of the used descriptors, can result in poor discrimination between the relevant and irrelevant descriptors. Thus, it is important to adopt a robust early stopping criteria during neural network training in order to achieve correct pruning of unnecessary input descriptors.41,42 Recently, Tetko and coworkers benchmarked five neural network-based descriptor pruning methods in a series of three studies.42,67,68 Two magnitude- and three error-based methods were examined. The magnitude-based methods are similar in principle to the Wikel and Dow method, in which the pruning follows direct analysis of the magnitudes of the outgoing weights. The error-based method, on the other hand, detects sensitivity of the input by monitoring the change of the network error due to the elimination of some neuron weights associated with some inputs. Unlike the magnitude-based sensitivity method, the error-based pruning method assumes that the training error is at a minimum, so if an early stopping criteria is applied then the assumption will no longer be justified.42 Overall, the authors concluded that no significant advantage of one method over the others was evident from their analyzed data sets. In general, error-based sensitivity methods are computationally more demanding, particularly those which require higher derivatives of the error functions (e.g., optimal brain damage,69 or optimal brain surgeon algorithms70). All algorithms give similar results for both linear and non-linear simulated sets of data (artificial structured data sets) and are capable of identifying the least sensitive input descriptors. In all cases, predictivity of the ANN can be improved by the removal of redundant input descriptors. The five pruning algorithms were also tested with three real QSAR data sets, with the conclusion that the behavior of the different pruning methods seems to deviate more significantly in real QSAR modeling. However, the authors stated that it is very difficult to determine which pruning method would be universally applicable since the efficacy of each method is most likely data-dependent.
Genetic Algorithm The earliest application of GAs in the role of descriptor-selection chemometric and QSAR applications was reported by Leardi and coworkers in 1992.71 Their initial test was performed on an artificial data set containing 16 objects and 11 descriptors. The simplicity of this test set allowed them to compare the result of a GA solution with that of an exhaustive enumeration of all possible subset selections. They observed that, on average, the globally optimal solution was found by the GA within one-quarter of the time required by the exhaustive search. This was in contrast with the result obtained by the stepwise regression method, which found a solution that was ranked only sixth overall in the list generated by the exhaustive search. After the initial validation, the GA-MLR method was applied to a real chemometric data set consisting of 41 samples and 69 descriptors. For this example, the stepwise regression approach yielded a 12descriptor model with a cross-validated variance of 83%. The top models from five independent GA simulations yielded models with cross-validated variances of 89%, 85%, 81%, 91%, and 84%, demonstrating the stochastic nature of GA optimization. It is appropriate to perform multiple runs on the same input data when a GA is employed for feature selection. Besides, it is quite possible that the simplistic implementation of GA used in this exercise failed
78
Adaptive Systems in Drug Design
to escape from local optima. More advanced evolutionary algorithms, such as the “ring” or “island” models of parallel GAs, which partition the population into sub-populations and allow for a “migration” operator between sub-populations, may lead to improved convergence.56,72 Rogers and Hopfinger proposed a new GA-based method, termed genetic function approximation (GFA), for descriptor selection. A conventional GA, which contains crossover and mutation operators, is coupled with a MLR for parameter estimation. There are two principal enhancements in the GFA approach. The first is the introduction of a few non-linear basis functions such as splines and quadratic functions. The second is the incorporation of the lackof-fit (LOF) error measure, which penalizes the use of large numbers of descriptors as a fitness criterion to safeguard against overfitting. With their GFA algorithm, Rogers and Hopfinger discovered a number of linear QSAR models for the Selwood antimycin data set which were significantly better than those obtained by Selwood and by Wikel and Dow.41,66 Interestingly, there seems to be only little overlap, other than the use of clogP—the sole descriptor encoding hydrophobicity—among the three studies. Similar to the finding by Leardi et al, the top 20 GFA models have a range of cross-validated r values from 0.85 to 0.81, again supporting the notion that many independent QSAR models can provide useful activity correlation on the same data. The use of multiple statistical models in the context of consensus scoring was also suggested by the authors, who observed that averaging the predictions of many top-rated models can lead to better predictions compared to any individual model. At about the same time, two research groups published another GA variant also using the Selwood data set as a benchmark. The algorithm investigated termed MUSEUM (Mutation and Selection Uncover Models) by Kubinyi or Evolutionary Programming (EP) by Luke.40,73 The major difference of the algorithm compared to GFA is the absence of a crossover operator and its reliance solely on point mutation to generate new solutions. Independently, both groups discovered other excellent three-descriptor combinations that might have been found by GFA but were probably destroyed during the evolution because GFA did not employ elitism to preserve the best solutions for the next generation. Very recently, Waller and Bradley proposed a novel variable selection technique called FRED (Fast Random Elimination of Descriptors) which contains elements from both evolutionary programming and Tabu search.46 In contrast to the other common genetic and evolutionary algorithms, the complete solutions (i.e., the descriptor combinations) are not propagated to the next generation, but rather, only those descriptors are retained which contribute positively to the genetic makeup of the fittest solutions. Descriptors that appear to be less useful are kept in a Tabu list, and are subsequently eliminated if they are not found to be beneficial during later iterations. Application of the FRED algorithm to the Selwood data yielded a final population of three-descriptor combinations that can be represented by 13 different input variables. This analysis was consistent with the results of the previously published methods. In particular, the selection of descriptors shared much similarity with the sets of descriptors that are chosen in the top GFA, MUSEUM, or EP models. The authors argued that it would be more difficult for poorer descriptors to be masked by some exceptionally good combination of descriptors and subsequently proliferate to the next generations, because only potentially good (single) descriptors are being passed to subsequent generations. It should be emphasized that the result of the FRED algorithm is to prune a potential list of descriptors by eliminating the less relevant descriptors; at the end of the calculation the best solutions are not necessarily guaranteed by the algorithm. Accordingly, one interesting utility of the FRED algorithm would be to treat it as a pre-filter for redundant descriptors so that an exhaustive enumeration could be applied. All of the above variants of GAs are used in conjunction with MLR. A natural extension is to replace MLR with PLS, which is often regarded as a modern alternative and has also played a critical role in the development of the CoMFA methodology. Interestingly, relatively few researchers have investigated methods of variable selection for PLS analysis in the past (other
Modeling Structure-Activity Relationships
79
than filtering out the variables with insignificant variance). One explanation might be that PLS has a high tolerance towards noisy data, and any number of input variables may be used. This attitude has changed somewhat over the past few years, as more people have begun to recognize the benefits of feature selection, and the use of hybrid approaches such as GA-PLS or GOLPEPLS has become increasingly popular. Some examples of the application of GA-PLS include the QSAR studies performed by Funatsu and coworkers.74-78 In a QSAR study of 35 dihydropyridine derivatives,74 these researchers discovered that the cross-validation statistics (q2 = 0.69) of the GA-PLS model based on only six descriptors is superior to the full PLS model using all 12 descriptors (q2 = 0.62). Furthermore, elimination of the less relevant descriptors makes the QSAR model more interpretable, and the selected descriptors were then consistent with an earlier analysis of Gaudio et al, who had performed an extensive investigation on the same set of compounds.79 The usefulness of variable selection in PLS analysis was further demonstrated in a subsequent QSAR study of 57 benzodiazepines.75 Two GA-PLS models— based on 10 and 13 descriptors—yielded the essentially identical q2 value (0.84), and were again significantly better than the model derived from a PLS analysis using all 42 descriptors (q2 = 0.71). The apparent improvement in predictivity was verified by an external validation. Using D-optimal design, the data set was partitioned into a training set of 42 compounds and a test set of 15 compounds. The r2 values of the test predictions for the two GAPLS models were 0.70 and 0.74, respectively, which compares favorably to the solution of the full PLS model (r2 = 0.59). Overall, the results from this and other research groups underscore the value of descriptor selection in the context of QSAR modeling.80
Parameter Estimation Artificial Neural Networks The key strength of a neural network is its ability to allow for flexible mapping of the selected features by implicitly manipulating their functional dependence. Unlike multiple linear regression analysis, ANN handle both linear and non-linear relationships without adding complexity to the model. This capability partly offsets the longer computing time required by a neural network simulation because it avoids the need for separate examination of each possible non-linearity.81 In addition, it has been suggested that neural networks are parsimonious approximators of mathematical functions;82 that is to say, an ANN tends to fit the required function with fewer parameters than other methods, and is particularly effective if non-linearity is evident in the data set. Thus, this type of approach seems to be exceptionally well suited for the study of quantitative structure activity relationships. The first applications of neural networks in the area of QSAR were published by Aoyama, Suzuki, and Ichikawa in 1990 with the promise that “the effective application of such neural networks may bring forth a breakthrough in the current state of QSAR analysis”.83,84 In their initial applications neural networks were used to perform tasks that were previously accomplished by multiple linear regression analysis. Three data sets were examined: a set of mitomycin analogues with anticarcinogenic activity; a series of antihypertensive arylacryloylpiperazines, and a large series of benzodiazepines used as tranquilizers. In these studies, substituent fragment descriptors, together with a few structural indicator variables, were used to encode molecular structures. In all cases the neural networks were able to deduce QSAR models that were superior to MLR fits. However, the use of an excessive number of connecting weights (in one example, 420 weights were used to fit 60 compounds) seemed questionable,85 partly because this contradicted a previously established guideline for MLR: the ratio of compounds versus parameters should be at least five.50 In addition, the authors included both linear and squared terms of molecular descriptors in the analysis, which seems unnecessary since an ANN ought to be able to uncover the appropriate functional transform of each descriptor.
Adaptive Systems in Drug Design
80
Table 3.1. Descriptors used by Andrea and Kalayeh86 Descriptor
Terms
πx MRx Σσ3,4 I1
Hydrophobic substituent parameter at substituent position x Molar refractivity parameter at substituent position x Sum of Hammett σ values of the 3 and 4 substituents Indicator variables to denote different assay conditions. = 1 for DHFR from Walker 256 leukemia tumor; = 0 for DHFR from L1210 leukemia tumor = 1 for compounds with a non-hydrogen substituent at R2; = 0 otherwise = 1 for compounds with R3 or R4 = Ph, CHPh, CONHPh, or C=CHCONHPh = 1 for compounds with a C6H4SO2OC6H4X group = 1 for compounds with R3 or R4 = (CH2)nPh (n=1,2,4,6), or (CH2)4OPh = 1 for compounds with bridges of the type CH2NHCONHC6H4X, CH2C(=O)N(R)C6H4X, or (CH2)3C(=O)N(R)C6H4X (R = H, Me)
I2 I3 I4 I5 I6
This initial promise—as well as some obvious limitations—of the first ANN applications to the field of QSAR motivated many subsequent investigations aiming to gain a better understanding of this novel tool.85 These were exemplified by the outstanding work of Andrea and Kalayeh, who performed a comprehensive investigation of QSAR of a large data set of dihydrofolate reductase (DHFR) inhibitors.86 This data set had been previously analyzed by Silipo and Hansch using MLR,87 and contains 256 compounds that were characterized using seven substituents descriptors augmented by 6 indicator variables (Table 3.1). It is noteworthy that the indicator variables had been introduced by Silipo and Hansch to capture certain commonalities and structural features that could not be easily explained using standard fragment descriptors. It should be recognized that the net effect of some combinations of indicator variables and substituent parameters is to encode non-linear effects. For example, one of the terms in a regression equation may be a fragment-based descriptor (e.g., MR) that reflects how activity generally increases with the size of a given substituent. But at the same time, there may be an indicator variable present in the equation that penalizes the presence of an excessively large group at the same position (i.e., the binding pocket has a finite size). Thus, the net effect is that the relationship between substituent size and bioactivity is non-linear. With the exception of I1, which accounts for possible differences in DHFR active sites or assay conditions, all indicator variables were related to substituent positions already encoded by other fragment parameters. In the published MLR model, the indicator variables explained a significant amount of variance, as well as many outliers in the data set. For this data set, it was found that the r2 value of MLR decreased from 0.77 to 0.49, and the number of outliers (defined by the authors as those compounds with an absolute prediction error greater than 0.8) in the model increased from 20 to 61 when indicator variables (I2 to I6) were excluded from the analysis. However, because indicator variables provide little or no insight into the physicochemical factors that govern biological activity, their utility in de novo design of new analogues is limited. In this regard, it is encouraging to observe that it was not necessary to utilize these indicator variables in the ANN model. In fact, using only seven substituent descriptors and (the non-structural) I1, the ANN model yielded a r2 value of 0.85 and only 12 outliers. Thus, the neural network seemed to circumvent the need for indicator variables and was able to extract relevant information directly from the various hydrophobic, steric, and electronic parameters. In addition, Andrea and Kalayeh conducted a cross-validation experiment on a subset of 132 compounds (DHFR from Walker 256 leukemia tumors, i.e., compounds with I1 = 1) and obtained a r2cv value of
Modeling Structure-Activity Relationships
81
Fig. 3.1. Number of published reports on application of neural networks in the field of QSAR.
0.79 for ANN. This result once again compared very favorably with the corresponding statistics from MLR, which yielded r2cv values of only 0.64 and 0.30 for models with and without the use of indictor variables, respectively. In addition, in contrast to the first ANN applications of Aoyama and coworkers, Andrea and Kalayeh clearly demonstrated that a neural network implicitly handles higher-order effects and also showed that it was not necessary to include non-linear transformation of the descriptors as inputs to the network. Andrea and Kalayeh also presented the first example of ANN overfitting in the area of QSAR. By demonstrating that the training error typically decreases with the number of hidden nodes while the test set error initially decreases, but will later increase, when an excessive number of hidden nodes is deployed. Furthermore, they considered test set statistics as a criteria to select an optimal neural network architecture (Note: For this reason their test set should not be regarded as an external set in the true sense because it was involved during model building). They also proposed a parameter, ρ, which is the ratio of the number of data points to the number of network weights, to help to define optimal network architecture. Though it was later shown that ρ by itself may not be sufficient to minimize the risk of overfitting,88 the general principles that were elucidated in this work are still valid and have probably saved many researchers from the perils of flawed QSAR models; i.e., a QSAR model may yield outstanding performance for the training set but no predictivity for new compounds. Following the publications of Aoyama and co-workers83,84 and Andrea and Kalayeh86 in the early 1990s, the use of ANN in the area of structure-activity relationships has flourished. Figure 3.1 shows a histogram of the number of publications related to the application of ANN in QSAR analysis according to a bibliographic search.89 These reports include many correlative structure-activity studies using standard descriptors,88,90-114 or some more novel descriptors such as topological indices,88,115,116 molecular similarity matrices,117-121 quantum chemistry descriptors,122-125 autocorrelation vectors of surface properties,19,126 hydrogen bonding and lipophilicity properties,127,128 and thermodynamic properties based on ligand-receptor
82
Adaptive Systems in Drug Design
interactions.110 More recently, a number of novel ANN applications have reached beyond the premise of structure correlation with in vitro or in vivo potency, and have ventured to solve some more challenging problems such as the prediction of Pharmacokinetic properties,129-132 bioavailability,133 toxicity,106,125,134-141 carcinogenicity,142,143 prediction of mechanism of action,144 or even the formulation of a discrimination function to distinguish drug-like versus nondrug-like compounds145-147 (see Chapter 4). In addition to these quantitative studies, ANN has also been employed as a visualization tool to reveal qualitative trends in SAR analyses.53,117,118,144,148 The widespread use of ANN in the area of molecular design has stimulated the continuous influx of novel neural network technologies to the field, including the introduction of Bayesian neural networks,145,149-151 cascade-correlation learning architecture,67,152 evolutionary neural networks,57 polynomial neural network153 and intercommunication architecture.154 This cross-fertilization between artificial intelligence and pharmaceutical research is likely to continue as more robust ANN toolkits become commercially or freely available. Many excellent technical reviews have been written on the application of ANN in QSAR, and interested readers are encouraged to refer to them.60,81,82,85,155-164 It is possible to summarize with a set of general guidelines for effective use of neural networks in QSAR modeling (and statistical modeling in general). First, the law of parsimony calls for the use of a small neural network, if possible. The number of adjustable parameters should be small with respect to the number of data points used for model construction, otherwise poor predictive behavior may result due to data overfitting. Neural network modeling can also benefit from the use of a large data set, which can facilitate location of a generalized solution from the underlying correlation in the data. In addition, the training patterns must be representative of the data to be predicted. This will lead to a more realistic predictive performance on the external test set with respect to the training result. It is advantageous to make use of efficient training algorithms; for example, those that make use of second derivatives of the error function for the weight update, which can give better convergence characteristics. Finally the input descriptors must obviously be relevant for the data modeling process. The golden rule of “garbage in, garbage out” can never be over-emphasized.165
Hybrid Methods GA-NN The natural evolution of the next generation of QSAR methods is to apply artificial intelligence methods in both descriptor selection and parameter estimation. An example of such hybrid approach was proposed by So and Karplus, who have combined GA with ANN for QSAR analysis.119,120,166,167 This method, called genetic neural network (GNN), was first applied to an examination of the Selwood data set.166 The major aim of this work was to use a GA to select a suitable set of descriptor for use in the development of a QSAR. The effectiveness of the GA was demonstrated by the ability to select an optimal set of descriptors, as compared to exhaustive enumeration, in the GNN models. It appears that the improvement of the GNN QSAR over other published models (Table 3.2) is due to the selection of non-linear descriptors which the ANN is able to assimilate. In their next study,167 an improvement to the core GNN simulator was made by replacing the problematic steepest descent training algorithm by a more robust scaled conjugate gradient optimizer,168 leading to substantial performance gains in both convergence and the speed of computation. To provide an extended test of the enhanced GNN simulator, it was applied to a set of 57 benzodiazepines which had been previously studied by Maddalena and Johnston using a backward elimination descriptor selection strategy and neural network training.99 It was found that the GNN protocol discovered a number of 6-descriptor QSAR models that are superior to the best (and arguably more complex) models reported by Maddalena and Johnston.
Modeling Structure-Activity Relationships
83
Table 3.2. Comparison of linear regression and neural network QSAR models of the Selwood data set;166 All models are based on three molecular descriptors Model
Type
rtrn
rcv
Selwood66 Wikel41 GFA181 EP73 / MUSEUM40 GNN166
Regression Regression Regression Regression Neural network
0.74 0.77 0.85 0.85 0.92
0.67 0.68 0.80 0.80 0.87
After appreciable success with standard fragment-based 2D descriptors, So and Karplus extended the use of GNN to the analysis of a molecular similarity matrix to derive 3D QSAR models.119,120 Molecular similarity is a measure based on the similarity between the physical or structural attributes of a set of molecules.169 This type of descriptor differs from conventional substituent parameters (e.g., π, σ, and MR) in the sense that it does not encode physicochemical properties which are specific for molecular recognition. The similarity index is derived from numerical integration and normalization of the field values, and represents a global measure of the resemblance between a pair of molecules based on their spatial and/or electrostatic attributes. Thus, instead of a correlation between substituent properties and activities, a similarity-based QSAR method establishes an association between global properties and activity variation among a series of lead molecules. The implicit assumption is that globally similar compounds have similar activities.170 Figure 3.2 is a schematic diagram showing the different stages in the construction of a SMGNN (Similarity Matrix Genetic Neural network) QSAR model. The initial validation was performed on a corticosteroid-binding globulin steroid data set,119 which had been extensively studied in the past by many novel 3D-QSAR methods.17,19,20,22,117,119,171-176 The first SMGNN application focused mainly on method validation, in particular the sensitivity and effect of parameters related to: (a) electrostatic potential calculations (type of atomic charges; truncation scheme for electrostatic potential, and dielectric constant); (b) similarity index (Carbó,177 Hodgkin,178 linear and exponential formulae179); (c) grid parameters (spacing, size, and location); and (d) number of similarity descriptors in QSAR model. The results of the sensitivity studies demonstrated that the SMGNN QSAR obtained was very robust with respect to variation in most of the user-defined electrostatic parameters. The fact that the various similarity indices are highly correlated also means that the choice of an index had negligible effect in determining the quality of the QSAR. The grid-related settings also had relatively little impact on the overall result. The key parameter seems to be the number of descriptors used in the model; it is important to have enough descriptors to characterize the data set but not so many that overfitting can arise. Overall, the SMGNN model is superior to those obtained from PLS and GA-MLR method; and also compares favorably with the results from other established 3D-QSAR methods. This approach was further validated using eight different data sets, with impressive results.120 The biological activities and physicochemical properties of a broad range of chemical classes were successfully correlated. One of the shortcomings of the SMGNN method is that interpretation of the QSAR model is difficult because the similarity index is not related to physicochemical attributes of the molecules. However, it is remarkable that the SMGNN QSAR model was consistent with all known SAR for CBG-steroid binding and, therefore, seems to handle the physical attributes leading to optimal binding in an implicit manner.
84
Adaptive Systems in Drug Design
Fig. 3.2. Schematic diagram for the construction of SMGNN 3D-QSAR models. Reprinted with permission from: So S-S, Karplus M. J Med Chem 1997; 40:4347-4359. ©1997 American Chemical Society.
Modeling Structure-Activity Relationships
85
Recently, Borowski and coworkers implemented and extended the SMGNN methodology to evaluate a set of 5-HT2A receptor antagonists in a 3D QSAR study.121 The data set included 26 2- and 4-(4-methylpiperazino)pyrimidines, as well as clozapine, which was used as a reference compound. Due to molecular symmetry the pyrimidines can have multiple mappings to the clozapine reference structure. Five alternative alignment schemes were suggested by the authors, and the q2 values of the models from each alignment were compared with the values derived from 30 randomly chosen alignment sets, which served as a baseline to test statistical significance. The alignment set with a particularly high predictivity was assumed to contain the correct superimposition of the bioactive conformations of these molecules. An interesting finding was that, although it was recognized that the piperazine nitrogen ought to be protonated upon binding to the 5-HT2A receptor, setting an explicit positive charge on the ligand was detrimental to the performance of the SMGNN QSAR model. This is because the charge has a pronounced effect in the electrostatic calculation, rendering the similarity indices discriminating. One suggestion from the authors is to consider only the neutral, deprotonated form during the similarity calculation. The best steric and electrostatic SMGNN models both contain five descriptors, and yield q2 values of 0.96 and 0.93, respectively. Both models are significantly better than random models with scrambled output values, which return q2 values of 0.28 ± 0.16 and 0.29 ± 0.14, respectively. In summary, the results of this independent study strongly support the use of the SMGNN methodology in 3D QSAR studies. The research group of Peter Jurs is also very active in the development and application of GA-NN type hybrid methods in QSAR and QSPR studies.138 In their procedure, the full data set is usually divided into three parts. The majority (70-80%) of the compounds belongs to the training set (tset), and the reminder of compounds are usually evenly divided to give a crossvalidation set (cvset) to guide model development, and an external prediction test (pset) to validate the newly developed QSAR models. Jurs has defined three types of statistical models derived from their multivariate data analysis: • Type I model is a linear regression model whose selection of descriptors is based on a stochastic combinatorial optimization routine (e.g., SA or GA); • Type II model is a non-linear ANN model that directly adopts the descriptors used in the Type I model; • Type III model is a fully non-linear ANN model developed in conjugation with a SA or GA for descriptor selection.
The quality of a model is based on the following fitness (or cost) function (Eq. 3.8):38
Eq. 3.8 where the coefficient of 0.4 was determined empirically to yield models with enhanced external predictivity. In a recent application this GA-NN method was used to study the QSAR of 157 compounds with inhibitory activity against acyl-CoA:cholesterol O-acyltransferase (ACAT), a biological target implicated in the reduction of triglyceride and cholesterol levels.38 Twenty-seven compounds were removed from the initial data set due to high experimental uncertainty of their IC50 values, and the remaining compounds were partitioned to obtain tset, cvset and pset with 106, 11, and 13 compounds, respectively. A large number of descriptors were generated using their in-house automated data analysis and pattern recognition toolkit (ADAPT) software package, and were pruned according to a minimal redundancy and variance criteria. The best Type I model is a nine-descriptor MLR that has a rmstset of 0.42 and rmspset of 0.43 log units. Using the same set of descriptors, they generated a Type II ANN-based model that has significantly lower rms errors of 0.36 and 0.34 log units for the tset and pset. Finally, to take full
86
Adaptive Systems in Drug Design
advantage of the non-linear modeling capability of ANN, they conducted a more comprehensive search using a combined GA-NN simulation. The top Type III model employed eight descriptors and yielded an rms error of 0.27 for both tset and pset. Four of the eight descriptors used in the Type III model are identical to those selected by the linear Type I model. It is suggested that unique descriptors in the Type III model provide relevant information and are also non-linear in nature. The general applicability of the GA-NN approach in QSAR was further verified in another study where a large set of sodium ion-proton antiporter inhibitors were investigated.180 Following the established procedure, Kauffman and Jurs divided the 113 benzoylguandine derivatives into a 91-member tset, an 11-member cvset, and an 11-member pset. Using an SA feature selection algorithm, they searched for predictive models containing from 3 to 10 descriptors. The optimal Type I linear regression model used 5 descriptors and yielded a rms error of 0.47 and r2 = 0.46 for the tset. The predictive performance of the pset was, however, rather poor (rms = 0.55; r2 = 0.01), indicating a general deficiency of this linear model. The replacement of MLR by ANN in functional mapping led to moderate improvement. The corresponding Type II model reported a rms error of 0.36 and r2 of 0.68 for the tset, and significantly lower prediction errors for the pset compounds (rmspset = 0.42 and r2pset = 0.44). The greatest increase of accuracy was seen in the construction of the Type III model, where the rms error of the tset dropped to 0.28 and r2 increased to 0.81. The corresponding pset statistics were 0.38 and 0.44, respectively. The authors also explored the consensus scoring concept proposed by Rogers and Hopfinger,181 and examined the effect of prediction averaging using a committee of five ANNs. They confirmed that the composite predictions were more reliable than those from individual predictors, largely because they make better use of the available information. The rms error of the prediction set for the consensus model was 0.30 compared to an error of 0.38 ± 0.09 from the five separate trials. This result is also consistent with an earlier GNN study on the Selwood data set, stating that averaging of the outputs of the top-ranking GNN models led to marginally better cross-validation statistics compared to the individual models.166 One major drawback of the QSAR model derived from the ADAPT descriptors concerns the ability to design novel analogues with desirable bioactivity. For example, the five descriptors selected by the GA in the non-linear Type III model were MDE-14, which is a topological descriptor encoding the distance (edges) between all primary and quaternary carbon atom pairs; GEOM-2, the second major geometric moment of the molecule; DIPO-0, the dipole moment; PNSA-3, a combined descriptor with atomic charge weighted partial negative surface area; and RNCS-1, the negatively charged surface area. Because these are whole-molecule descriptors, even a seemingly small substituent modification (e.g., changing methyl to hydroxy) can sometimes lead to significant changes of the entire set of descriptor values. Thus, even when an optimal set of descriptor values is known, it can still be a challenging task to engineer a molecule that fulfils the necessary conditions. One brute-force solution is to enumerate a massive virtual library and deploy the QSAR model as a filtering tool. Another possibility is to perform iterative structure optimization using the predicted activity as the cost function. The latter approach is the basis of many de novo design programs. For example, the EAinventor package provides an interface between a structure optimizer, with some embedded synthetic intelligence, and an user-supplied scoring function. This is a powerful combination which creates a synergy between synthetic consideration and targeted potency (see Chapter 5). In the conclusion of a recent review article on neural networks,85 Manallack and Livingstone wrote: “We feel that the combination of GAs and neural networks is the future for the [QSAR] method, which may also mean that these methods are not limited to simple structureproperty relationships, but can extend to database searching, Pharmacokinetic prediction, toxicity prediction, etc.”
Modeling Structure-Activity Relationships
87
Novel applications utilizing hybrid GA-NN approaches are beginning to appear in the literature, and will be discussed in detail in Chapter 4.
Comparison to Classical QSAR Methods Chance Correlation, Overfitting, and Overtraining In the examples presented in the previous Section, we have discussed the utility of GA for the selection of descriptors to be used in combination with multivariate statistical methods in QSAR applications. Although variable selection is appropriate for the typical size of a data set in conventional QSAR studies (i.e., 20-200 descriptors)—particularly if the initial pool of descriptors exceeds the number of data objects—selection may still carry a great risk of chance correlation.40 The ratio of the number of descriptors to the number of objects used in model building can be a useful parameter indicating the likelihood of chance correlation. As a general guideline, it has been suggested that a ratio of greater than five suggests that GA-optimized descriptor selection may produce unreliable statistical models.80 Obviously, other factors that are related to signal-to-noise, redundancies, and collinearity in the data can also be critical. To further reduce this risk, it is also recommended that randomization tests should be performed as an integral part of standard validation procedures in any application that involves descriptor selection. In addition, it may be beneficial to implement an early stopping point during GA evolution in order to prevent overfitting of data. Based on empirical observation, the fitness of the population usually increases very rapidly during the early phase, and then the improvement slowly levels off. The reason for this behavior is that the modeling of useful information in the data is usually made quite rapidly during the initial stage. Later, the GA begins to fit the noise or idiosyncrasies of the data to the model, sometimes using additional parameters. To determine an optimal stopping point for GA optimization, Leardi suggested a criterion that is based on the difference in the statistical fit between the real and the randomized data set.80 In this scheme, the evolution cycle that corresponds to the maximum difference between the two sets of statistics is considered an optimal termination point. Related to this concept, another intriguing idea is to combine the statistics gathered in both real and randomized training, yielding a composite cost function that may be used to evaluate individual solutions during the course of GA optimization (Dr. Andrew Smellie, personal communication). It is also known that the use of GA can sometimes produce solutions containing non-essential descriptors hidden within a subset of useful descriptors.46 A useful means to eliminate these irrelevant descriptors from the GA selection is through a hybridization operator, which periodically examines the entire population and discards the non-contributing descriptors using a backward elimination procedure.80 This idea originated from the observation that forward selection in the GA-selected subset can greatly reduce the number of irrelevant inputs.182 It has been demonstrated that ANN often produces superior QSAR models compared to models derived by the more traditional approach of multiple linear regression. The key strength of the neural network is that, with the presence of hidden layers, neural networks are able to perform non-linear mapping of the molecular descriptors to the biological activities of the compounds. The quality of the fit to the training data can be particularly impressive for networks that have many adjustable weights. Under such circumstances, the neural network simply memorizes the entire data set and behaves effectively as a look-up table. Thus, it is doubtful that the network would be able to extract a relevant correlation of the input patterns and give a meaningful assessment of other unknown examples. This phenomenon is known as overfitting of data, where a neural network may reproduce the training data set almost perfectly but erroneous predictions will be realized on other unseen test objects. It is fair to point out that the purpose of QSAR is to understand the forces governing the activity of a particular class of
88
Adaptive Systems in Drug Design
compounds, and to assist drug design, and that a look-up table will therefore not aid medicinal chemists in the design of new drugs. What is needed is a system that is able to provide reasonable predictions for the compounds which are previously unknown. So, the use of ANN with an excessive number of network parameters should be avoided. There are two advantages of adopting networks with relatively few processing nodes: First, the efficiency of each node increases and, consequently, the time of the computer simulation is reduced significantly. Second, and probably more importantly, the network can generalize the input patterns better, and this often results in superior predictive power. However, caution is again needed to ensure that the network is not overconstrained. Since a neural network with too few free parameters may not be able to learn the relevant relationships in the data. Such an analysis will collapse during training and again no reliable predictions can be sought. Thus, it is important to find an optimal network topology to deliver a balance between these two extreme situations. While the numbers of nodes in the input and output layers in a neural network are typically pre-determined by the nature of the data set, the users can control the number of hidden units – and subsequently the number of adjustable weights—in the network. It has been suggested that a parameter, ρ, can help to determine an optimal setting for the number of hidden units.86 The definition of ρ is the ratio of the number of data points in the training set to the number of adjustable network parameters. The number of network variables is simply the sum of the number of connections and the number of biases in the network. A three-layered backpropagation network with I input units, H hidden units and O output units will have H x (I + O) connections and H + O biases. The total number of adjustable parameters is therefore H x (I + O) + H + O. The range 1.8 < ρ < 2.2 has been suggested by Andrea and Kalayeh as a guideline for acceptable ρ values.86 It is claimed that for ρ > 3, the network will have difficulty generalizing from the data. The concept of the ρ ratio has made a significant impact upon the design of neural network architecture in many subsequent QSAR studies.85 It is now possible to make a reasonable initial choice for the number of hidden nodes. Nevertheless, the suggested range of 1.8 < ρ < 2.2 is perhaps empirical, and is also expected to be case-dependent. For example, some redundancies may already exist in the training patterns, so that the effective number of data points is in fact smaller than anticipated. On a related note, there is another rough guideline that allows the user to choose the number of hidden nodes independently from the number of data points. It is the so-called geometric pyramid rule, which states that the number of nodes in each layer follows a pyramid shape, decreasing progressively from input layer to output layer in a geometric ratio.60 That is to say, a good starting estimate of the number of hidden nodes will be the geometric mean of the numbers of input and output nodes in the network. Overtraining of a neural network is related to the problem of overfitting of data. While overfitting is often regarded as a problem with excessive neural network parameters, overtraining refers to a prolonged training process. Both can significantly influence the quality of a model. Interestingly, the profile of training error as a function of the epoch cycle is very similar to the situation with the GA-optimization previously discussed. During the initial phase of ANN training, the neural network will quickly establish some crude rules that lead to a rapid minimization of the error. In the mid-phase of the training process, the neural network will begin to learn fine structure and other peculiarities which may in part be due to simple memorization.85 Correspondingly, the rate of decrease of the training error will slow down significantly. An obvious solution to this problem is to stop training before the final convergence of rms error so that the neural network has optimal predictive capability (forced stop).183 This can be achieved through the use of a disjoint validation (or control) data set to monitor the course of neural network training process. The training is halted when the predictive performance of the control set begins to deteriorate.
Modeling Structure-Activity Relationships
89
Functional Dependencies One of the major criticisms of the application of ANN in QSAR research is that neural networks lack transparency, and are often very difficult to interpret. This contrasts with a MLR equation, where the influence of each molecular descriptor on the biological activity is immediately obvious from the coefficients. To improve the interpretability of the ANN model, a technique known as “functional dependence monitoring” has been introduced (see also Chapter 5, in particular Figure 5.6 and the corresponding text passages). Usually, all but one of the input parameters was kept constant, and the one remaining input descriptor is varied between the minimum and the maximum of its known range (but sometimes extrapolation is allowed). Prediction of biological activity is made for this particular set of descriptor values, and the resulting plot provides the functional dependence of the biological descriptor. This procedure is repeated for all input descriptors. It is hoped that the identification of the functional dependence will assist medicinal chemists in the design of more useful analogues. To demonstrate the potency of the approach, this monitoring scheme was applied to a set of dihydrofolate reductase (DHFR) inhibitors which were extensively studied by Hansch and coworkers.184 In Hansch’s analysis of a set of 68 2,4-diamino-(5-substituted benzyl)pyrimidine analogues, a correlation equation was formulated (Eq. 3.9):
Eq. 3.9
The corresponding functional dependence plots were made after successful training of a neural network on the same set of compounds using MR’5, MR’3, MR4, and π’3 as inputs (Figure 3.3). In this calculation, the non-variable input descriptors were pegged at a quarter of their maximum ranges. In Figure 3.3 the corresponding plots from the regression model are shown. It is evident that the neural network result is consistent with the regression equation. Both neural network and the regression analysis suggest that the biological activity is linearly dependent on MR’5 and MR’3, and a parabolic dependence is found for MR4. It is clear from Equation 3.9 that the functional dependence on π’3 is highly non-linear; remarkably, the neural network came up with a very similar plot using this monitoring scheme. In summary, building a regression equation as complex as Equation 3.9 cannot be inspired by a flash of brilliance; it requires a laborious development phase. In regression analysis, the inclusion of higher-order and cross-product terms is often made on a trial-and-error basis. However, the staggering diversity of such terms often makes this task very difficult. Despite this shortcoming, it is fair to point out that with a careful design and the inclusion of appropriate non-linear transformations of descriptors, a multivariate regression model can achieve results that are comparable in quality to a well-trained ANN. To date, the most successful studies in this area are reported by Lucic and coworkers, who have demonstrated that the use of nonlinear multivariate regressions can sometimes outperform ANN models for a number of benchmark QSAR and QSPR data sets.185,186 This success was achieved through a very efficient descriptor selection routine, using an enlarged descriptor set containing squared and crossproduct terms. In neural networks non-linearity is implicitly handled. The descriptor monitoring scheme permits a crude analysis of the functional form and highlights those molecular descriptors that possibly play an important role for biological activity. Furthermore this technique can be used to identify whether and how non-linear terms contribute to the QSAR analysis.
90
Adaptive Systems in Drug Design
Fig. 3.3A. Biological activity as a function of the individual physicochemical parameter for a neural network model (solid line) and multivariate regression model of Equation 3.9 (dashed line).
The Inverse QSAR Problem Many QSAR studies using neural networks have been published; some of them reporting resounding success. Such ANN applications often conclude with the discovery of a statistically significant correlation, or sometimes with validation using an external test set. From a practical
Modeling Structure-Activity Relationships
91
Fig. 3.3B.
standpoint, the more important aspect concerns molecular design, and it would therefore be interesting to use these models to predict the required structural features leading to the generation of novel bioactive analogs.
92
Adaptive Systems in Drug Design
Fig. 3.4. Core structure of benzodiazepines and the substituent numbering scheme.
The reason that the so-called Inverse QSAR Problem appears to be tricky that the functional form connecting the input descriptors with biological activity is not a simple mathematical relationship. This contrasts with regression analysis, where a linear equation is welldefined and optimal values for certain molecular properties can be readily identified. For example, optimal values of MR4 and π3' in Equation 9 are known to be 1.85 and 0.73. One remedy for a neural network model is to monitor the functional dependence of individual descriptors and guess what range of values would be optimal. So and Karplus attempted to identify more potent benzodiazepine (BZ) ligands based on this approach.167 They applied the genetic neural network (GNN) methodology to a series of 57 benzodiazepines that had been studied by Maddalena and Johnston.99 The core structure of BZ is shown in Figure 3.4, together with the conventional numbering scheme for its substituents. These BZs contain six variable substituent positions, though the data were extensive for only positions R7, R1, and R2'. Each substituent was parameterized by seven physicochemical descriptors, including dipole moment (µ), lipophilicity (π), molar refractivity (MR), polar constant (F), resonance constant (R), Hammett meta constant (σm) and para (σp) constant. Using the GNN methodology, three top-ranking 6-descriptor neural network models were developed: Model I used π7, F7, MR1, σm2, π6, and MR8; Model II used MR1, π7, σm7, σm2', MR6', and σm8; and Model III used MR1, π7, σm7, σm2', σp2', and σp6'. These QSAR models yielded r2 values of 0.91 and q2 values in the range 0.87-0.86. In this study the authors also attempted to predict new compounds by a minimum perturbation approach in which one substitution position was focused on at a time. In theory, an exhaustive screening of all six physicochemical parameter spaces—or a more elaborated multivariate experimental design method—can be applied to searching for improved BZ analogues, but such procedures were not attempted. Instead, the most active compound in the training set served as the template which was then subjected to small structural modifications. To minimize the likelihood of generating compounds that had poor steric fit with the receptor, new substituents at a given position were restricted to those that were not significantly greater in bulk than the largest known substituent from a compound of at least moderate activity. The influence of each of the six parameters on the biological activity was monitored by a modified functional dependence monitoring procedure. In this implementation, all but one of the inputs were fixed at the parameter values that corresponded to the substituents of the template, and the one remaining input descriptor was varied between the
Modeling Structure-Activity Relationships
93
Fig. 3.5A. Predicted pIC50 values as a function of the descriptors that have been chosen by GNN. The minimum and maximum values of each descriptor are: µ (-4.13 to 1.53), π (-1.34 to 1.98), MR (0.92 to 19.62), F (-0.07 to 0.67), R (-0.68 to 0.20), σm (-0.16 to 0.71), and σp (-0.66 to 0.78). The scale is linear throughout the range for each descriptor. Reprinted with permission from: So S-S, Karplus M. J Med Chem 1996; 39:5246-5256. © 1996 American Chemical Society.
minimum and the maximum of its known range. The resulting functional dependence plots are shown in Figure 3.5. Based on these plots, it was concluded that the most potent compounds would have a value of MR1 within the range of 1 to 3, thus limiting the substitution option to hydrogen. The relatively flat dependence of the MR8 and σm8 curves indicated that modification at this substituent position was unlikely to lead to significant change in potency and, therefore, additional synthetic effort concerning regiospecific substitution at this position might not be justified. The plot for the 2'-position suggested that increasing the σm and σp parameters would enhance the predicted receptor affinity. Thus, the template, which has a Cl atom in the 2'-position, was replaced with a substituent that had greater σm and σp values than those of Cl. Three relatively simple substituents, CN, NO2 and SO2F, were identified. Examination of Figure 3.5 also suggests that the activity might be improved by decreasing the descriptor values (π6', MR6', and σp6') relative to hydrogen at the 6'-position. To this end, substituents such as NH2 and F were considered. Due to the symmetry of the 2'- and 6'-position, the three new 2'-substituents (CN, NO2, and SO2F) which were considered earlier were probed for the 6'-position as well. Finally, the dependence plot for position 7 suggested that an increase in both lipophilicity and polar effect would also increase activity. However, finding suitable molecular fragments with such characteristics is non-trivial, since the two properties are naturally anti-correlated, although CH2CF3, SO2F, and OCF3 are feasible candidates. On the basis of this analysis, a number of compounds containing these favorable combinations were suggested, most of them possessing predicted activities considerably higher than that of the template compound. This design exercise shows that through careful analysis it is possible to utilize neural network QSAR models to design compounds that are predicted to be more potent. Although it is relatively easy to implement the minimum perturbation method outlined above, it should be recognized that this stepwise approach is unlikely to yield compounds that correspond to the optimal combination of input descriptor values. This is because descriptors from each position are treated independently so that the effect of inter-descriptor coupling
94
Adaptive Systems in Drug Design
Fig. 3.5B.
amongst substituents is not taken into account. To overcome this limitation, ANN may, for example, serve as a fitness function in evolutionary molecular design. This approach was pioneered by Schneider and Wrede for the example of peptide de novo design, and later extended to arbitrary small molecules.187 For further details, see Chapter 5. An alternative design philosophy is to apply a combinatorial optimization technique to perform an extensive search on descriptor space and determine optimal combinations of parameter values, and then map these values to appropriate sets of functional groups. This idea was explored by Burden and coworkers, who applied a GA to search for novel DHFR inhibitors that have maximal predicted potency according to an ANN QSAR model.103 Using a data set of 256 diaminodihydrotriazines, they established a 5-8-1 neural network model that accurately predicted the pIC50 values against two different tumor cell lines. The five descriptors used were π3, π4, MR3 and MR4, the hydrophobicity substituent parameters and molar refractivities at the 3- and 4-positions, respectively, and Σσ3,4, the sum of the Hammett parameter at these two positions. Upon completion of neural network training, they utilized a commercial GA package to probe the activity surface. Three different strategies were suggested to conduct these searches. The first was to constrain the search strictly within descriptor ranges defined by the training set, and the second allowed for a +/-10% extrapolation for each descriptor. Both can be regarded as conservative measures. In the third search, the parameter range explored by the GA was bound by substituent values that are chemically reasonable (i.e., -1 < σ < 1.5, -2 < π < 6, 0 < MR < 100). The results obtained for the L1210 cell lines are shown in Table 3.3. The optimal values determined by the GA search were π3 = 5.34, MR3 = 32.2, π4 = -1.88, MR4 = 15.3, and Σσ3,4 = -0.91. The chemical groups that have parameter values closest to the optimums were identified, based on the tabulated values of substituent parameters,188 and a few novel analogues were proposed. As expected, these compounds were predicted by the model to have high potency (pIC50 = 9.35 8.89), which is considerably greater than the most potent compound in the training set (pIC50 = 8.37). Although the four hydrophobicity and molar refractivity parameters are within the scope of the training compounds, it should be noted that the descriptor values of Σσ3,4 were outside the range of the training set, so that the increase in pIC50 value should be treated with appropriate caution.
Modeling Structure-Activity Relationships
95
Fig. 3.5C.
It is fair to say that reports showing real practical applications of GA and ANN in pharmaceutical design are still relatively sparse. However, several practical applications are discussed in Chapter 5. One outstanding recent example is reported by Schaper and coworkers, who have developed neural network systems that can distinguish the affinity for 5-HT1A and α1-adrenergic receptors using a data set of 32 aryl-piperazines.114 The pKi values of the training set range from 5.3 - 8.7 and 4.8 - 8.3 for 5-HT1A and α1-adrenergic receptors, respectively. However, no significant selectivity is observed in any compound in the training set. Using SAR information derived from the neural network systems, three new analogs were specifically designed to validate the predictions. These compounds were synthesized and their experimental and predicted pKi values against the two receptors are listed in Table 3.4. The discovery of a potent and highly selective 5-HT1A ligand is evidently a great triumph in QSAR research and computer-aided molecular design in general. There is great optimism that the use of adaptive systems in lead optimization will continue to flourish in the future.
Outlook and Concluding Remarks Because QSAR methods, in principle, require only ligand information, they will remain the best computational probes for such cases where there is little or no information available about the 3D structure of the therapeutic target. In our opinion, the next major advance in QSAR research will come from innovative techniques that can deal with the desperately needed increase in data handling capacity, the fact that just a decade ago, scientists routinely screened only tens of compounds each year—the typical size of a “classical” QSAR data set. Now, however, the use of new technologies, including combinatorial chemistry, robotic high-throughput screening and miniaturization give the ability to screen tens of millions of compounds a year. Such a vast volume of experimental data demands a corresponding expansion of the capacity of computational analysis. Obtaining correlations from data sets of this order of magnitude is not a trivial task, although the problem becomes more tractable when practical issues are taken into consideration. Due to the sometimes fuzzy nature of HTS data, new analysis methods must cope with a higher degree of noise. In particular, precise numerical modeling or prediction of biological activity—a characteristic of traditional QSAR approaches—may no longer be
Adaptive Systems in Drug Design
96
Table 3.3. Optimal descriptor values and substituents identified by a GA search103 Cpd. R3 pIC50
π3
MR3
R4
π4
MR4
Σσ3,4 Predicted
I II III IV V VI
5.34 3.26
32.2 43.5
GA values NHSO2CH3 OCH2CON(CH2CH2)O CH2CH2CO2H Cyclopropyl N-propyl
-1.88 -1.18 -1.39 -0.29 1.14 1.55
15.3 18.2 34.9 16.5 13.5 15.0
-0.91 -0.42 -0.43 -0.28 -0.42 -0.34
GA values CH2Si(C2H5)3
9.79 9.35 9.27 9.14 9.10 8.89
necessary or even appropriate. It is to be expected that exciting new developments in QSAR methods will continue to emerge. At the same time, one must realize that there is no guarantee of success even with the most spectacular technological advances at hand. Ultimately, the impact of any computational tool will depend critically on its implementation and integration in the drug discovery process, and on the readiness of medicinal chemists to consider “computergenerated designs” for their work. It is our conviction that medicinal and computational chemists both have responsibility to optimize the balance between the exploration and exploitation of a hypothesis. On this note, we would like to conclude this Chapter with a comment by Hugo Kubinyi:48 “There is a fundamental controversy between statisticians and research scientists, i.e., between theoreticians and practitioners, as to how statistical methods should be applied with or without all the necessary precautions to avoid chance correlations and other pitfalls. Statisticians insist that only good practice guarantees a significance of the results. However, this is most often not the primary goal of a QSAR worker. Proper series design is sometimes impossible owing to the complexity of synthesis or to a lack of availability of appropriate starting materials, problems that are often underestimated by theoreticians. A medicinal chemist is interested in a quick and automated generation of different hypotheses from the available data, to continue his research with the least effort and maximum information, even if it is fuzzy and seemingly worthless information. Predictions can be made from different regression models and the next experiments can be planned on a more rational basis to differentiate between alternative hypotheses.”
References 1. Hansch C, Fujita T. ρ−σ−π analysis. A method for the correlation of biological activity and chemical structure. J Am Chem Soc 1964; 86:1616-1626. 2. Kubinyi H, ed. QSAR: Hansch Analysis and related approaches. In: Mannhold R, Krogsgaard-Larsen P, Timmerman H, eds. Methods and Principles in Medicinal Chemistry. Weinheim: VCH, 1993. 3. van de Waterbeemd H, ed. Chemometric methods in molecular design. In: Mannhold R, Krogsgaard-Larsen P, Timmerman H, eds. Methods and Principles in Medicinal Chemistry. Weinheim: VCH, 1995. 4. van de Waterbeemd H, ed. Advanced computer-assisted techniques in drug discovery. In: Mannhold R, Krogsgaard-Larsen P, Timmerman H, eds. Methods and Principles in Medicinal Chemistry. Weinheim: VCH, 1995. 5. Labute P. A widely applicable set of descriptors. J Mol Graph Model 2000; 18:464-477. 6. Bajorath J. Selected concepts and investigations in compound classification, molecular descriptor analysis, and virtual screening. J Chem Inf Comput Sci 2001; 41:233-245. 7. UNITY. Tripos, Inc., St. Louis, MO.
Modeling Structure-Activity Relationships
97
Table 3.4. Experimental and predicted pKi values against 5-HT1A and α1-adrenergic receptors for the three designed compounds according to ANN QSAR models114
Compound No.
33 34 35
α1
5-HT1A
Expt pKi
Pred pKi
Expt pKi
Pred pKi
7.6 8.5 9.2
7.4 8.8 8.8
5 Molecular weight > 500 Number of hydrogen donor groups > 5 Number of hydrogen acceptor groups >10
The beauty of this rule lies in its simplicity. Because all parameters can be easily computed, the Pfizer Rule (or its variants) has become the most widely applied filter in virtual library design today. However, it should be stressed that compliance to the rule does not necessarily make a molecule drug-like. In fact, the Pfizer Rule by itself appears to be a rather ineffective discriminator between drugs and non-drugs. Frimurer et al showed that using the above criteria, only 66% of the compounds in the MDL Drug Data Report (MDDR) database, which contains compounds with demonstrated biological activities, were classified as druglike; whereas 75% of the supposedly nondrug-like compounds from the Available Chemical Directory (ACD) were in fact regarded as drug-like.5 In other words, if the primary objective is to isolate drugs from nondrugs in the broadest sense, the Pfizer Rule fares no better than making close to random assignments. Obviously, a more complex set of logical rules is required to recognize molecules with drug-like properties. Independently, two research groups investigated the use of artificial neural networks to develop virtual screening tools that can distinguish between drug-like and nondrug-like molecules. The results of their work were published in two back-to-back articles in the Journal of Medicinal Chemistry in 1998.6,7 The first paper was a contribution from Ajay, Walters, and Murcko at Vertex Pharmaceuticals.6 They selected a set of approximately 5000 compounds from the Comprehensive Medicinal Chemistry (CMC) database serving as a surrogate for drug-like molecules. They also chose a similar number of “drug-size” compounds from the ACD to represent molecules that were nondrug-like. Seven simple 1D descriptors were generated to encode each molecule, including molecular weight, number of hydrogen-bond donors, number of hydrogen-bond acceptors, number of rotatable bonds, 2κα (the degree of branching of a molecule), aromatic density, and logP. To augment these 1D features, Ajay and coworkers also considered a second set of 2D descriptors. They were the 166 binary ISIS keys, which contained information on the presence or absence of certain substructural features in a given molecule. A Bayesian neural network (BNN) was used to train a subset of 7,000 compounds, which was comprised of approximately equal numbers of compounds from the CMC and ACD sets. The trained neural network was then applied to the remaining CMC and ACD compounds that were outside the training set. As an external validation they also tested their network on a large collection of compounds from the MDDR databases, which were assumed to contain mostly drug-like candidates. The accuracy of classification for the test predictions using different combinations of 1D and 2D descriptor sets is summarized in Table 4.1. Neural network models using seven 1D descriptors alone classified about 83% of the CMC compounds as drugs, and about 73% of the ACD set as nondrugs. The majority (~65%) of the MDDR compounds were predicted to be drug-like, which was in accordance with general
Prediction of Drug-Like Properties
107
Table 4.1. Average drug-likeness prediction performance of a Bayesian neural network with five hidden nodes on 10 independent test sets6 Descriptors
CMC / %
ACD / %
MDDR Drug-Like / %
7 1D 166 ISIS 7 1D + 166 ISIS 7 1D + 71 ISIS
81-84 77-79 89-91 88-90
71-75 81-83 88-89 87-88
61-68 83-84 77-79 77-80
expectation. When 2D ISIS descriptors were utilized the classification accuracy for the ACD (82%) and the MDDR (83%) compounds improved significantly, though this was at the expense of inferior prediction for the CMC set (78%). The combined use of 1D and 2D descriptors yielded the best prediction overall. The classification accuracy of both CMC and ACD approached 90% and, in addition, about 78% of the MDDR compounds were classified as drug-like. Furthermore, the Vertex team was able to extract the most informative descriptors and suggested that all seven 1D and only 71 out of 166 ISIS descriptors provided relevant information to the neural network. It was demonstrated that the prediction accuracy of a neural network using this reduced set of 78 descriptors was essentially identical to the full model. Finally, to demonstrate the utility of this drug-likeness filter, the researchers conducted a series of simulated library design experiments and concluded that their system could dramatically increase the probability of picking drug-like molecules from a large pool of mostly nondruglike entities. Another neural network-based drug-likeness scoring scheme was reported by Sadowski and Kubinyi from BASF.7 They selected 5,000 compounds each from the World Drug Index (WDI) and ACD, to serve as their databases of drug-like and nondrug-like compounds. The choice of molecular descriptors in their application was based on Ghose and Crippen atomtypes,8 which have been successfully used in the prediction of other physicochemical properties such as logP. In this study, each molecule was represented by the actual count for each of the 120 atom-types found. The full set of descriptors was pruned to a smaller subset of 92 that were populated in at least 20 training molecules, a procedure designed to safeguard against the neural network learning single peculiarities. Their neural network, a 92-5-1 feedforward model, classified 77% of the WDI and 83% of the ACD compounds correctly. Application of the neural network to the complete WDI and ACD databases (containing > 200,000 compounds) yielded similar classification accuracy. It was noteworthy that, in spite of this apparently good predictivity, Sadowski and Kubinyi did not advocate the use of such a scoring scheme to evaluate single compounds because they believed that there was still considerable risk of misclassifying molecules on an individual basis. Instead, they believed that it would be more appropriate to apply this as a filter to weed out designs with very low predicted scores. Recently, Frimurer and coworkers from Novo Nordisk extended these earlier works by attempting to create a drug-likeness classifier that uses a neural network trained with a larger set of data.5 Again, MDDR and the ACD were used as the sources of drug-like and nondruglike entities. The MDDR compounds were partitioned into two sets. The first set represents ~ 4,500 compounds that have progressed to at least Phase I of clinical trials (i.e., they should be somewhat drug-like), and the second was a larger collection of ~ 68,500 molecules that have the status label of “Biological Testing” (i.e., lead-like). To decrease the redundancy of the data sets, a diversity filter was applied to the data set so that any MDDR compounds that had a Tanimoto coefficient (based on ISIS fingerprints) of greater than 0.85 amongst themselves
108
Adaptive Systems in Drug Design
were removed. This procedure discarded about 100 compounds from the drug-like MDDR set and 8,500 compounds from the lead-like set. To reinforce the nondrug-like nature of the ACD set, any compounds that are similar to the 4,400 MDDR drug-like set (greater than a Tanimoto cutoff of 0.85) were eliminated, leaving 90,000 ACD compounds for data analysis. After removing the redundant entries, the 4,400 MDDR drug-like set was partitioned into 3,000 training compounds and 1,400 test compounds, and the ACD compounds into a 60,000 member training and a 30,000 member test set. The 60,000 lead-like MDDR compounds were not utilized in any way during model construction, but were used only as external validation data. Each compound was represented by three molecular descriptors (number of atoms, no of heavy atoms, and total charge) in addition to 77 CONCORD atom-type descriptors encoding the frequency of occurrence (normalized by the entire data set) of particular atom types. Empirically, it was concluded that the optimal neural network configuration contained 200 hidden nodes, based on the quality of test set predictions. This neural network gave a training and test Matthews correlation coefficient of 0.65 and 0.63, respectively (Eq. 2.18).9 Using a threshold value of 0.5 as a criteria to distinguish drug from nondrug, the neural network was able to classify 98% of the ACD compounds but only 63% of the MDDR drug-like set. By lowering the prediction threshold, an increasing number of MDDR drug-like compounds would be correctly identified, at the expense of more false positives for the ACD set. They claimed that a threshold value of 0.15 (anything above that was classified as a drug) was an optimal cutoff, providing the best discrimination between the two data sets. Below this threshold value, 88% of the MDDR drug-like set and ACD databases were correctly classified. In addition, 75% of the MDDR lead-like molecules were also predicted as drug-like. The decrease in percentage from the lead-like to the drug-like set was not unexpected given that there may still be some intrinsic differences between the two classes of compounds. Finally, Frimurer and coworkers probed for the most informative descriptors that allowed for discrimination between drugs and nondrugs. By setting each of the 80 descriptors systematically to a constant value (zero was used in this case), and monitoring the variation in training errors of each sub-system. They argued that the removal of an important descriptor from the input would lead to a substantial increase in training error. Fifteen key descriptors were identified by this method; they were aromatic X-H, non-aromatic X-H, C-H, C=O, sp2 conjugated C, =N, non-aromatic N, N=C, non-aromatic O, sp2 O, sp2 P, F, Cl, number of atoms, and total charge. The performance of their neural network was commendable, even with this vastly reduced set of descriptors. Using a prediction threshold of 0.15, 82% of the MDDR and 87% of the ACD compounds were correctly classified. An interesting aspect of the drug-likeness scoring function that was briefly discussed in the Frimurer publication concerns the setting of the threshold value. For example, if the purpose of the scoring function is to limit the number of false positive predictions, then a higher cutoff value should be used for the threshold. Table 4.2 gives the percentages of ACD and MDDR compounds that are correctly classified using different cutoff values. As reported, a model with a higher cutoff value contributes fewer false positives (i.e., nondrugs that were predicted as drug-like), although this comes at an expense of worse MDDR classification. It is important to keep in mind that the cutoff value should be set depending on whether false positives or false negatives are more harmful for the intended application.10 In a typical virtual screening application, we usually like to first identify and then remove molecules that are predicted to be nondrug-like from a large compound library. Let us assume x is the percentage of the compounds that are actually drugs in the library, and that PD is the probability that a drug is correctly identified as drug-like, and pN is the probability that a nondrug is correctly identified as nondrug-like. To gauge the performance of the drug-likeness scoring function one would compute what percentage of the compounds that were flagged as drug-like were actually drugs. This quantity, denoted henceforth as drug fraction, is given by Equation 4.1:
Prediction of Drug-Like Properties
109
Table 4.2. Percentage of ACD and MDDR compounds that are correctly predicted with their corresponding threshold values in the drug-likeness classifier of Frimurer et al5 Cutoff
0.05 0.15 0.35 0.50
% ACD Correctly Predicted
% MDDR Correctly Predicted
72 88 95 98
95 88 74 63
Eq. 4.1 If we assume that for a given threshold value, PD and pN take the values of %MDDR and %ACD that are correctly classified, we can plot how drug fraction varies with x, the percentage of drugs in the complete library. Figure 4.1 shows the hypothetical curves for each threshold value listed in Table 4.2. In all cases the drug scoring function gives substantial enrichment of drugs after the initial filtering. This is particularly true in situations where the fraction of actual drug molecules in the library is very small, a phenomenon that is perhaps reminiscent of reality. Based on statistics reported by Frimurer et al, the reduction of false positives is, in fact, the key to this kind of virtual screening application, and therefore a high threshold value should be set. Thus, although on a percentage basis a 0.15 threshold seems the most discriminating (88% of both ACD and MDDR), the premise under which virtual screening is applied calls for more rigorous removal of false negatives, even at the expense of a loss of true positives. Finding a generally applicable scoring function to predict the drug-likeness of a molecule will remain one of the most sought-after goals for pharmaceutical researchers in the coming years. The tools that exist today can discriminate between molecules that come from presumably drug-like (e.g., MDDR, CMC, WDI) or nondrug-like (e.g., ACD) databases. In our opinion, the majority of the MDDR and CMC compounds should be regarded as lead-like and not, strictly speaking, drug-like. Ideally, the drug-like training set should contain only drugs that have passed all safety hurdles. We also believe that the nondrug set should consist of molecules that have close resemblance to marketed drugs (i.e., at least somewhat lead-like) but were abandoned during pre-clinical or clinical development. We anticipate that the analysis will benefit from the more rigorous definition of drug and nondrug because the intrinsic difference—presumably owing to their pharmacokinetics or toxicological characteristics—between them will be amplified. In a recent review article, Walters, Ajay, and Murcko wrote:11 “[What] we may witness in coming years might be attempts to predict the various properties that contribute to a drug’s success, rather than the more complex problem of ‘druglikeness’ itself. These might include oral absorption, blood-brain barrier penetration, toxicity, metabolism, aqueous solubility, logP, pKa, half-life, and plasma protein binding. Some of these properties are themselves rather complex and are likely to be extremely difficult to model, but in our view it should be possible for the majority of properties to be predicted with better-than-random accuracy.”
This divide-and-conquer approach to drug-likeness scoring also brings better interpretability to the result. The potential liability of a drug candidate becomes more transparent, and
110
Adaptive Systems in Drug Design
Fig. 4.1. Graphs showing how drug fraction varies with the percentage of drugs in the library (see text).
an appropriate remedy can be sought out accordingly. In the following sections of this Chapter we will discuss the role played by adaptive modeling and artificial intelligence methods in the prediction of individual properties that contribute to the overall drug-likeness of a molecule.
Physicochemical Properties An implicit statement of the Pfizer Rule is that a drug must have a “balanced” hydrophilic-lipophilic character. Two physicochemical parameters have the most profound influence on drug-like properties of a molecule. (i) aqueous solubility, which is critical to drug delivery; (ii) hydrophobicity, which plays a key role in drug absorption, transport and distribution.
Aqueous Solubility A rapidly advancing area of modern pharmaceutical research is the prediction of the aqueous solubility of complex drug-sized compounds from their molecular structures. The ability to design novel entities with sufficient aqueous solubility can bring many benefits to both preclinical research and clinical development. For example, accurate activity measurements can be obtained only if the substance is sufficiently soluble—above the detection limits of the assay. Otherwise, a potentially good SAR can be obscured by apparent poor activity due to insufficient solubility rather than inadequate potency. Finding a ligand with adequate solubility is also a key factor that determines the success of macromolecular structure determination. In Xray crystallography, the formation of crystals appears to be very sensitive to the solubility of ligands. Most biostructural NMR experiments require ligands dissolved at a relatively high concentration in a buffer. At a more downstream level in drug development, the solubility of a drug candidate has perhaps the most profound effect on absorption. Although pro-drug strategies or special methods in pharmaceutical formulation can help to increase oral absorption, the solubility largely dictates the route of drug administration and, quite often, the fate of the drug candidates.
Prediction of Drug-Like Properties
111
The aqueous solubility of a substance is often expressed as log units of molar solubility (mol/L), or logS. It is suggested that solubility is determined by three major thermodynamic components that describe the solubilization process.4 The first is the crystal packing energy, which measures the strength of a solid lattice. The second is the cavitation energy, which accounts for the loss of hydrogen bonds between the structured water upon the formation of a cavity to host the solute. The third is the solvation energy, which gauges the interaction energy between the solute and the water molecules. To account for these effects, a number of experimental and theoretical descriptors have been introduced to solubility models in the past year. Some of them include melting points,12-14 cohesive interaction indices,15 solvatochromic parameters,16,17 shape, electronic and topological descriptors,18-23 and mobile order parameters.24 Most of this work has been summarized in an excellent review by Lipinski et al,4 and will not be discussed here. In this Section, we will focus on some of the most recent developments involving the use of neural networks to correlate a set of physicochemical or topological descriptors with experimental solubility. The earliest neural network-based solubility model in the literature was reported by Bodor and coworkers.18 Fifty-six molecular descriptors, which mostly accounted for geometric (e.g., surface, volume, and ovality), electronic (e.g., dipole moment, partial charges on various atomtypes) and structural (e.g., alkenes, aliphatic amines, number of N-H bonds) properties, were generated from the AMPAC optimized structures of 311 compounds. Empirically, Bodor et al determined that 17 out of the 56 descriptors seemed most relevant for solubility and the resulting 17-18-1 neural network yielded a standard deviation of error of 0.23, which was superior to the corresponding regression model (0.30), based on identical descriptors. In spite of such success, we think that there are two major deficiencies in this neural network model. First, the use of 18 nodes in the hidden layer may be excessive for this application, given that there are only ~ 300 training examples. Second, some of the 17 input descriptors are, in our opinion, redundant. For example, the inclusion of functional transforms of a descriptor (e.g., QN2 and QN4 are functions of QN) might be unnecessary because a neural network should be able to handle such mapping implicitly. To overcome such limitations PCA and smaller networks could be applied. The research group of Jurs at Pennsylvania State University has investigated many QSPR/ QSAR models for a wide range of physical or biological properties based on molecular structures.21-23,25-31 Recently, they published two solubility studies using their in-house ADAPT (Automated Data Analysis and Pattern recognition Toolkit) routine and neural network modeling.22,23 Briefly, each molecule was entered into the ADAPT system as a sketch and the threedimensional structure was optimized using MOPAC with the PM3 Hamiltonian. In addition to topological indices, many geometric and electronic descriptors, including solvent-accessible surface area and volume, moments of inertia, shadow area projections, gravitational indices, and charged partial surface area (CPSA) were computed. To reduce the descriptor set they applied a genetic algorithm and simulated annealing techniques to select a subset of descriptors that yielded optimal predictivity of a ‘validation’ set (here, a small set of test molecules that was typically 10% of the training set). In the first study,22 application of the ADAPT procedure to 123 organic compounds led to the selection of nine descriptors for solubility correlation. The rms errors of the regression and the 9-3-1 neural network models were 0.277 and 0.217 log units, respectively. In the next study,23 the same methodology was applied to a much larger data set containing 332 compounds, whose solubility spanned a range of over 14 log units. The best model reported in this study was a 9-6-1 neural network yielding a rms error of 0.39 log units for the training compounds. It is noteworthy that there was no correspondence between any of the current descriptors to the set that was selected by their previous model. A possible explanation is that the ADAPT descriptors may be highly inter-correlated and therefore the majority of the descriptors are interchangeable in the model with no apparent loss in predictivity.
112
Adaptive Systems in Drug Design
Perhaps the most comprehensive neural network studies of solubility were performed by Huuskonen et al.32-34 In their first study,32 system-specific ANN models were developed to predict solubility for three different drug classes, which comprised 28 steroids, 31 barbituric acid derivatives, and 24 heterocyclic reverse transciptase (RT) inhibitors. The experimental logS of these compounds ranged from -5 to -2. For each class of compounds, the initial list of descriptors contained ~ 30 molecular connectivity indices, shape indices, and E-state indices. Five representative subgroups of descriptors were established, based on the clustering of their pairwise Pearson correlation coefficients. A set of five parameters were then selected, one from each subgroup, as inputs to a 5-3-1 ANN for correlation analysis. Several five-descriptor combinations were tried, and those that gave the best fit of training data were further investigated. To minimize overtraining, an early stopping strategy was applied, so that the training of the neural network stopped when the leave-one-out cross-validation statistics began to deteriorate. The final models yielded q2 values of 0.80, 0.86, and 0.72 for the steroids, barbiturates, and RT inhibitors classes, respectively. Overall, the standard error of predictions was approximately 0.3 to 0.4 log units. Since each ANN was optimized with respect to a specific compound class, it was not surprising that application of solubility models derived from a particular class to other classes of compounds yields unsatisfactory results. It was more surprising, however, that the effort to unravel an universal solubility model applicable to all three classes of compounds also proved unsuccessful (Note: They could, in theory, obtain reasonable predictivity for the combined set if an indicator variable was introduced to specify each compound class. However, this would obviously defeat the purpose of a generally applicable model). One possible explanation is that the combined data set (83 compounds in total) contained compounds segregated in distinct chemical spaces and it would be difficult to find a set of common descriptors that could accurately account for the behavior of each group of compounds. In their next study,33 Huuskonen et al collated experimental solubilities of 211 drugs and related analogs from literature. This set of compounds spanned approximately six log units (from -5.6 to 0.6), which was almost twice the range of their previous study. Thirty-one molecular descriptors, which included 21 E-state indices, 7 molecular connectivity indices, number of hydrogen donors, number of hydrogen acceptors, and an aromaticity indicator, were used initially in model building. The final number of descriptors was later pruned to a subset of 23 by probing the contribution of each individual parameter. The final ANN model had a 235-1 configuration, and yielded r2 = 0.90, and s = 0.46 for the 160-member training set, and r2 = 0.86, s = 0.53 for the remaining 51 test compounds. Besides these descriptive statistical parameters, the authors also published the individual predictions for these compounds. Because all 24 RT inhibitors from the previous study were part of the data set, this allowed us to investigate the relative merit of a system-specific solubility model versus a generally applicable model. Of the 24 compounds, 20 were selected for the training set, and 4 were used as test compounds. Figure 4.2(a) shows the predicted versus observed aqueous solubilities for the RT inhibitor in their previous system-specific model, and Figure 4.2(b) is the corresponding plot for the predicted solubilities from the general purpose model. It is clear that, although the predictions of most RT inhibitors were within the correct solubility range (logS ~ –2 to –5), a comparison of individual predictions for this class of compounds reveals very weak correlation (r2 = 0.16; s = 0.73). This result contrasted sharply with the very good predictivity (r2 = 0.73; s = 0.41) when the RT inhibitors had been considered on their own.32 This supports the notion that a system-specific solubility predictor is more accurate than a general one, though the former obviously has only limited scope. Thus, one must choose an appropriate prediction tool depending on the nature of the intended application. For instance, if the emphasis is on a single class of compounds we should consider the construction of a specialist model (provided there are sufficient experimental data for the series) or recalibrate the general model by seeding its training set with compounds of interest.
Prediction of Drug-Like Properties
113
Fig. 4.2. Predicted versus experimental solubility for the 24 RT inhibitors using (a) a system-specific solubility model and (b) a general model.
114
Adaptive Systems in Drug Design
In his most recent study,34 Huuskonen attempted to improve the accuracy and generality of his aqueous solubility model by considering a large, diverse collection of ~ 1,300 compounds. The logS values for these compounds ranged from –11.6 to +1.6, which essentially covers the range of solubilities that can be reliably measured. The full data set was partitioned to a randomly chosen training set of 884 compounds and a test set of 413. Starting from 55 molecular connectivity, structural, and E-state descriptors, he applied a MLR stepwise backward elimination strategy to reduce the set to 30 descriptors. For the training data, this equation yielded r2 = 0.89, s = 0.67, and r2cv = 0.88, scv = 0.71. The statistical parameters for the 413 test set were essentially identical to that of leave-one-out cross-validation, thereby indicating the generally robust nature of this model. He applied ANN modeling to the same set of parameters in order to determine whether the prediction could be further improved via nonlinear dependencies. Using a 30-12-1 ANN, he obtained r2 = 0.94, s = 0.47 for the training set, and r2 = 0.92, s = 0.60 for the test set, which were both significantly better than the MLR model. The general applicability of the MLR and ANN models was further verified by application to a set of 21 compounds suggested by Yalkowsky,35 which has since become a benchmark for novel methods. The r2 and s values for the MLR model are 0.83 and 0.88, and for the ANN, 0.91 and 0.63, in good agreement with their respective cross-validated and external test statistics. Both results were, however, significantly better than those derived from their previous model constructed using 160 training compounds (r2 = 0.68, s = 1.25). This indicated that a large and structurally diverse set of compounds were required to train a model capable of giving reasonable solubility predictions for structures relevant to pharmaceutical and environment interest, such as the set of compounds under consideration in this study.
logP The n-octanol/water partition coefficient of a chemical is the ratio of its concentration in n-octanol to that in aqueous medium at equilibrium. The logarithm of this coefficient, logP, is perhaps the best-known descriptor in classical QSAR studies. The reason for the usefulness of this property is related to its correlation with the hydrophobicity of organic substances, which plays a key role in the modulation of many key ADME processes. Specifically, drug-membrane interactions, drug transport, biotransformation, distribution, accumulation, protein and receptor binding are all related to drug hydrophobicity.36 The significance of logP is also captured by the Rule-of-5,4 which states that a molecule will likely be poorly absorbed if its logP value exceeds five. Other researchers also established links between logP and blood-brain barrier (BBB) penetration, a critical component in the realization of activity on the central nervous system (CNS).10,37-42 For CNS-active compounds, usually a logP around 4-5 is required. One of the earliest attempts to derive logP values from computational means was the fconstant method proposed by Rekker.43 Later, Leo and Hansch made a significant advance to this fragment-based approach that ultimately led to the successful development of the widely popular ClogP program.44 In summary, they assumed an additive nature of hydrophobicity values from different molecular fragments, whose parameter values were calibrated by statistical analysis of a large experimental database. To estimate the logP value of a novel molecule, the chemical structure is first decomposed into smaller fragments that can be recognized by the program. The logP value of the molecule is simply the incremental sum of parameter values from the composite fragments, and in some cases, additional correction factors. The main advantage of a fragment-based method is that it tends to be very accurate. However, this approach suffers from two major problems. The first is that the molecular decomposition process is often very tricky. The second, and the more serious, concerns missing parameter values when a given structure cannot be decomposed to structures for which fragment values are available. Thus, it becomes more fashionable to treat the molecule in its entirety, and to correlate its logP value with descriptors that are easy to calculate. Most published reports follow this scheme and
Prediction of Drug-Like Properties
115
are based on the use of MLR or ANN on some combination of electronic and steric properties. For example, molecular descriptors such as atomic charges, hydrogen bond effects, molecular volumes or surface areas have been considered in this role. Schaper and Samitier proposed a logP method based on an ANN to determine the lipophilicity of unionized organic substances by recognition of structural features.45 Molecules were encoded by a connection table, where indicator variables were used to denote the presence or absence of specific atoms or bonds in different molecular positions. Eight different atom types (C, N, O, S, F, Cl, Br, and I) and four different bond types (single, double, triple, and resonant) were represented in their implementation. For compounds with up to 10 non-hydrogen atoms, a full description of the molecule required 260 variables (10 x 8 indicator variables for atoms and 45 x 4 for bonds). After preliminary analysis of their data set, which was comprised of 268 training and 50 test compounds, 147 non-zero descriptors were retained. They experimented with three different hidden layer configurations (2, 3, and 4 hidden nodes) and suggested that an ANN with three hidden layer neurons was the optimal choice based on the prediction accuracy of the test set. The 147-3-1 NN yielded a Pearson correlation coefficient (rtrn) of 0.98 and a standard deviation (strn) of 0.25 between observed and calculated logP values for the training compounds. It is interesting to note that, despite the use of a large number of adjustable parameters (448), this particular NN showed little evidence of overfitting: the test set correlation coefficient (rtst) is 0.88 and standard deviation (stst) = 0.66. The authors suggested that with a decrease in the rho (ρ) ratio (either an increase in data objects or a reduction in non-critical indicator variables), the predictivity of this type of NN system would further increase. The major shortcoming of this approach is that the molecular representation is based on connection matrix/indicator variables. Their study was limited to compounds containing no more than 10 non-hydrogen atoms. With a connection matrix, the number of input descriptors to the ANN increases quadratically with the maximum number of allowed atoms (NMaxAtom) in the data set. For example, using the current scheme of 8 atom-types and 4 bondtypes, the total number of descriptors is calculated by:
Eq. 4.2 If we were to apply this method to drug-size molecules, which contain on average of 2025 non-hydrogen atoms, then the ANN would need to deal with approximately 1,000 indicator variables. Introduction of new atom-types, such as phosphorus, would add further complexity to the molecular description. This begs the question: are all these descriptors necessary to produce a sound logP model? The answer is, most probably, no. We speculate that because molecular connectivity descriptors have no physical meaning a large number of them is required to depict or correlate physicochemical properties. If physically meaningful descriptors are used, then one may obtain a more direct relationship from fewer predictors. In a recent study, Breindl, Beck, and Clark applied semi-empirical methods to obtain a small set of quantum chemical descriptors to correlate the logP values for 105 organic molecules.46 They used the CONCORD program to convert 2D connectivity into standard 3D structures,47 whose geometries were further refined by energy minimization using SYBYL.48 The structures were then optimized using VAMP,49 a semi-empirical program. The input descriptors, which included both electrostatic and shape properties of the molecules, were derived from AM1 and PM3 calculations. Using MLR analysis, they derived a 10-term equation that reported a rtrn value of 0.94 and rcv of 0.87. The choice of descriptors for this MLR model was further analyzed using ANN. With a 12-4-1 back-propagation network, they improved the fitting of the training set to rtrn = 0.96 and rcv = 0.93 and, furthermore, the neural network also seemed to perform consistently well on 18 test set molecules. Finally, this approach was
116
Adaptive Systems in Drug Design
validated with a larger data set of 1085 compounds, for which 980 molecules were used as the training set and 105 were held back for testing. The best performance was obtained with a 1625-1 network, which yields a rtrn = 0.97 for training and a rcv of 0.93 with the AM1 parameters, and a slightly worse (rtrn = 0.94 and rcv = 0.91, strn = 0.45) result for the PM3 set. Again, the validity of the neural network result was confirmed by accurate test set predictions, which yielded impressive statistical parameters of rtst = 0.95 and stst = 0.53 for the AM1 result and, again, slightly worse values for the PM3 set (rtst = 0.91; stst = 0.67). The deficiency of the PM3 set was further analyzed, and it was concluded that there was a systematic problem with the estimation of logP values for those compounds with large alkyl chains. They reasoned that the large error was due to the uncertainty of appropriate conformations from gas phase geometries under their setup. By systematically varying the values of one input descriptor while keeping others fixed, they concluded that the logP values were predominately influenced by three descriptors, namely polarizability, balance u, and charge OSUM. Furthermore, a direct linear dependence between logP and polarizability was observed. On the other hand, the effects for the balance parameter and OSUM were shown to be highly non-linear with respect to logP. Overall, it seems that reliable logP models can be sought using a few quantum chemical parameters, although the time-consuming nature of the calculation makes it less attractive for analysis of large virtual libraries. To address some of the limitations of the older QSPR approaches, Huuskonen and coworkers proposed the use of atom-type electrotropological state (E-state) indices for logP correlation.36 The E-state indices were first introduced by Kier and Hall,50,51 and have been validated in many QSAR and QSPR applications. They capture both the electronic and topological characteristics surrounding an atomic center as well as its neighboring environment. In the implementation of E-state descriptors by Huuskonen et al, several new atom-types corresponding to contributions from amino, hydroxyl, and carbonyl groups in different bonding environments were introduced. This level of detail seems particularly relevant for the purpose of hydrophobicity modeling. For instance, it is known that an aromatic amino group is generally less basic than its aliphatic counterpart, which makes the former less likely to ionize and presumably more hydrophobic. The use of the extended parameter set was justified by a significant improvement in cross-validated statistics of the 1,754 training set. An MLR model using 34 basic E-state descriptors yielded a q2 value of 0.81 and an RMScv of 0.64; whereas with 41 extended parameters, the corresponding values were 0.85 and 0.55. Huuskonen et al also applied an ANN to be able to model higher-order nonlinearity between the input descriptors and logP. The final model, which had a 39-5-1 architecture, gave a q2 value of 0.90 and an RMScv 0.46 for leave-one-out cross-validation. Further validation on three independent test sets yielded a similar RMS error (0.41), thereby confirming the consistency of the predictions. The logP predictions of this new method were compared to those derived from commercial programs. It was found that this method was as reliable or better than the established methods for even the most complex structures. In our opinion, the approach of Huuskonen and coworkers represents a method of choice for fast logP estimation, particularly for applications where both speed and accuracy are critical. Because the algorithm does not depend on the identification of suitable basis fragments, the method is generally applicable. Unlike methods that utilize quantum chemical descriptors, the calculation is genuinely high-throughput because E-state indices can be computed directly from SMILES line notation without costly structure optimization. Furthermore, this hydrophobicity model, which was developed using ~ 40 descriptors, can account for most, if not all, molecules of pharmaceutical interest. In contrast, a connectivity table representation may require on the order of thousands of input values, which also increases the risk of chance correlation. The major limitation of the Huuskonen hydrophobicity method is the difficulty of chemical interpretation. This is in part due to the topological nature of the molecular description and in
Prediction of Drug-Like Properties
117
part the use of nonlinear neural networks for property correlation. Particularly, it is hard to isolate the individual contributions of the constituent functional groups to the overall hydrophobicity; or conversely, to design modifications that will lead to a desirable property profile (i.e., the inverse QSPR problem). Another important issue that has not been addressed concerns the treatment of ionizable compounds, which may adopt distinct protonation states under different solvent environments (e.g., water and 1-octanol). Currently, this phenomenon is either ignored or assumed to be handled implicitly. Together with the inverse QSPR problem, the correct handling of such molecules will be the major question that needs to be answered by the next generation logP prediction systems.
Bioavailability Bioavailability is the percentage of a drug dose which proceeds, in an unaltered form, from the site of administration to the central circulation. By definition, a drug that is administered intravenously has 100% bioavailability. By comparing systemic drug levels achieved after intravenous injection with other drug delivery routes, an absolute bioavailability can be measured. Since for several reasons, oral administration is the preferred route for drug delivery, a major challenge for biopharmaceutical research is to achieve high oral bioavailability. Several factors contribute to reduction of oral bioavailability. First, drug molecules may bind to other substances present in the gastrointestinal tract, such as food constituents. The extent of reduction may vary significantly with an individual diet. Second, the drug may be poorly absorbed due to unfavorable physicochemical properties, such as those outlined in the Pfizer Rule. Third, the drug may be metabolized as it passes through the gut wall, or, more commonly, by the liver during first-pass metabolism. Due to the complexity of the different processes affecting oral bioavailability, as well as the scarcity of data, the development of a generally applicable quantitative structure-bioavailability relationship (QSBR) has proven to be a formidable task. The most extensive QSBR study to-date was reported by Yoshida and Topliss,52 who correlated the oral bioavailability of 232 structurally diverse drugs with their physicochemical and structural attributes. Specifically, they introduced a new parameter ∆logD, which is the difference between the logarithm of the distribution coefficient of the neutral form at pH = 6.5 (intestine) versus pH = 7.4 (blood) for an ionizable species. The purpose of this descriptor was to account for the apparent higher bioavailability observed for many acidic compounds. They also included 15 descriptors to encode the structural motifs with well-known metabolic transformations and therefore elucidated the reduction of bioavailability due to the first pass effect. Using these descriptors and a method termed ORMUCS (ordered multicategorical classification method using the simplex technique), they achieved an overall classification rate of 71% (97% within one class) when the compounds were separated to four classes according to bioavailability. Furthermore, 60% (95% within one class) of the 40 independent test compounds were also correctly classified using this linear QSAR equation. The result of this study indicates that it might be feasible to obtain reasonable estimates of oral bioavailability from molecular structures when physically and biologically meaningful descriptors are employed. In the following section, we will give a brief review of how neural network methods have been applied to the modeling of absorption and metabolism processes.
Human Intestinal Absorption The major hurdle in the development of a robust absorption model—and other models— is very often the lack of reliable experimental data. Experimental percent human intestinal absorption (%HIA) data have generally large variability and are usually skewed to either very low or very high values, with only few compounds in the intermediate range. Jurs and coworkers collated a data set of 86 compounds with measured %HIA from the literature.31 The data were divided to three groups: a training set of 67 compounds, a validation set of 9 compounds;
118
Adaptive Systems in Drug Design
and an external prediction set of 10 compounds. Using their in-house ADAPT program, 162 real-value descriptors were generated that encoded the topological, electronic and geometric characteristics for every structure. In addition, 566 binary descriptors were added to the set to indicate the presence of certain substructural fragments. Two approaches were applied to prune this initial set of 728 descriptors to a smaller pool of 127. First, descriptors that had variance less than a user-defined minimum threshold were removed to limit the extent of single example peculiarities in the data set. Second, a correlation analysis was performed to discard potentially redundant descriptors. Application of a GA-NN type hybrid system to this data set yielded a six-descriptor QSAR model. The mean absolute error was 6.7 %HIA units for the training set, 15.4 %HIA units for the validation set, and 11 %HIA units for the external prediction set. The six descriptors that were selected by the GA could elucidate the mechanism of intestinal absorption via passive transport, which is controlled by diffusion through lipid and aqueous media. Three descriptors are related to hydrogen bonding capability, which reflects the lipophilic and lipophobic characteristics of the molecule. The fourth descriptor is the number of single bonds, which can be regarded as a measure of structural flexibility. The other two descriptors represent geometric properties providing information about the molecular size. This set of descriptors, in our opinion, shares a certain similarity to the ones that define the Pfizer Rule. However, it is fair to point out the great popularity of the Pfizer Rule amongst medicinal chemists is, in the words of Lipinski et al,4 ‘because the calculated parameters are very readily visualized structurally and are presented in a pattern recognition format’. On the contrary, the use of more complex 3D descriptors and neural network modeling may enhance prediction accuracy, although it is probably at the expense of a diminished practical acceptance. Overall, the result of this initial attempt to predict absorption models is encouraging and more work in this area is assured. Because in vivo data are generally more variable and expensive, there will be strong emphasis on correlating oral absorption and in vitro permeability obtained from model systems such as Caco-2 or immobilized artificial membranes. In addition, future absorption models may have a molecular recognition component, which will handle compounds that are substrates for biological transporters.
Drug Metabolism Drug metabolism refers to the enzymatic biotransformations which drug molecules are subject to the body. This is an important defensive mechanism of our bodies against potential toxins, which are generally lipophilic and are converted to more soluble derivatives that can be excreted more readily. Most drug metabolism processes occur in the liver, where degradation of drugs is catalyzed by a class of enzymes called hepatic microsomal enzymes. This constitutes the first pass effect, which can limit a drug’s systemic oral bioavailability. In the past, relatively few researchers paid special attention to drug clearance until a lead molecule had advanced nearly to the stage of clinical candidate selection. More recently this attitude has changed as the requirement for Pharmacokinetic data for the purposes of correct dose calibration has been recognized. Thus, there is considerable interest in the development of in vitro or in vivo physiological models to predict hepatic metabolic clearance during the lead optimization stage. Lavé and coworkers at Roche made an attempt to correlate human Pharmacokinetic data from in vitro and in vivo metabolic data.53 They collated experimental data for 22 literature and in-house compounds that were structurally diverse. The in vitro metabolic data were derived from the metabolic stability of the substances in hepatocytes isolated from rats, dogs, and humans, and the in vivo Pharmacokinetic data were measured after intravenous administration for the same species. All in vitro data, as well as the in vivo data for rats and dogs, were used in combination to predict the human in vivo data. Their statistical analysis included multiple linear regression (MLR), principal component regression (PCR), partial least squares (PLS) regression, and artificial neural networks (ANN). The results of their study are summarized in
Prediction of Drug-Like Properties
119
Table 4.3. The major conclusion from this study is that the strongest predictors of human in vivo data were human and rat hepatocyte data; the in vivo clearance data from either rats or dogs did not significantly contribute to any statistical model. One possible explanation is that the results from in vivo experiments are generally more variable and are therefore more noisy when they were used as predictors. It is also clear that all statistical methods (MLR, PCR, PLS and ANN) appeared to work satisfactorily for this data set; in fact, from a statistical view point the results are practically identical. It is interesting to note that the non-linear mapping capability of a neural network was not required in this case, probably because of the already strong linear correlation between the human in vivo data and the human hepatocyte data (r = 0.88) and to the rat hepatocyte data (r = 0.81). Overall, despite the limitation of modest data set size, the results of this study provide further support for early in vitro screening of drug candidates because satisfactory human Pharmacokinetic data can be predicted through mathematical modeling of these less expensive parameters. It is also fair to point out that the accuracy of their model does come with a price, that is one must first synthesize a compound and determine the appropriate biological parameters before a prediction can be made. To overcome this problem, some researchers prefer to focus on theoretical descriptors that can be computed from molecular structure. Recently, Quiñones et al tried to correlate drug half-life values based on physicochemical or topological descriptors, which were derived from a series of 30 structurally diverse antihistamines.54 These descriptors were used as input values to an ANN and were trained against the experimental half-life of a drug, which is the time it takes for one-half of a standard dose to be eliminated from the body. Initially, they tried to formulate a model that made use of seven physicochemical descriptors: logP, pKa, molecular weight, molar refractivity, molar volume, parachor, and polarizability. However, it did not lead to a statistically significant model. They then investigated the possibility of using the CODES descriptors, which capture the atomic character as well as its neighboring chemical environment for each individual atom. In their study, they picked four CODES descriptors that corresponded to a common chemical substructure present in all 30 antihistamines. Two neural network configurations, one with five hidden nodes and another with six, were tested. The results from cross-validated predictions of their model were very encouraging, and they were mostly consistent with the range of experimental half-life values (Fig. 4.3). A test set of five other antihistamines was used to evaluate the two ANN models. Again, there was good agreement between the experimental and calculated half-life values, indicating the general robustness of their models, at least within the domain of biogenic amines. Both approaches described above have their strengths and limitations. From a virtual screening perspective, the approach of Quiñones et al is more attractive since their model does not rely on any experimental parameters. However, the association between the four CODES descriptors and metabolism is unclear and the current model is relevant only to a specific class of compounds that share a common substructure. In this regard, it is our opinion that the set of structural descriptors used by Yoshida and Topliss52 are particularly informative because they represent some well-characterized metabolic liabilities. Nevertheless, drug metabolism is immensely complex and biotransformations are catalyzed by many enzymes—some of them may still be unknown to us. As different enzymes have different substrate specificity according to their structural requirement, it will be a challenging task to formulate a simple theoretical model that is generally applicable to many diverse chemical classes. On the other hand, the method proposed by Lavé is likely to be more general because their approach relies on experimental parameters and is thus less dependent on the metabolic pathways involved.53 It is conceivable that we will see a process that is somewhat of a hybrid of the two in the future. A panel of lead compounds with similar structures could be synthesized and tested in vitro, which is generally less expensive and time-consuming than in vivo animal testing. These in vitro data will serve as calibration data for correlation with a set of relevant theoretical descriptors that are
Adaptive Systems in Drug Design
120
Table 4.3. Accuracy of the statistical models for human in vivo clearance prediction53 Statistical Model
Descriptors Useda r_h d_h h_h r_a d_a
No. Termsb
Statistical Parameters r2 q2
MLR MLR PCR PLS PLS PLS NN_linear NN_sigmoidal NN_sigmoidal
• • • • • • • • •
5 2 2 2 1 1 5 3 2
0.84 0.84 0.85 0.86 0.83 0.83 0.86 0.88 0.88
• • • • •
• • • • • • • • •
•
•
• •
• •
•
• •
0.74 0.79 0.79 0.77 0.79 0.79 0.79 0.77 0.77
a r_h = rat hepatocyte; d_h = dog hepatocyte; h_h = human hepatocyte; r_a = rat animal data;
d_a = dog animal data
b MLR = number of descriptors; PCR = number of principal components; PLS = no. of components;
ANN = no. of descriptors
directly obtained from molecular structures. Because of the strong relationship between in vitro and in vivo data, the predictions from the resulting QSAR model could be used to predict human Pharmacokinetic clearance for compounds that are within the scope of the original lead class. Further research to establish the relationship between in vitro assay from tissue cultures of major metabolic sites (e.g., liver, kidneys, lungs, gastrointestinal tract) and in vivo data appears to be justified.
CNS Activity The nervous system of higher organisms is divided into a central system (CNS) that comprises the brain and spinal cord, and a peripheral system (PNS) that embodies the remaining nervous tissues in the body. The CNS coordinates the activities of our bodily functions, including the collection of information from the environment by means of receptors, the integration of such signals, the storage of information in terms of memory, and the generation of adaptive patterns of behavior. Many factors, such as infectious diseases, hormonal disorders and other neurological degenerative disorders, can disrupt the balance of this extremely complex system, leading to the manifestation of CNS-related diseases. These include depression, anxiety, sleep disorder, eating disorders, meningitis, Alzheimer’s and Parkinson’s diseases. The prevalence of such diseases in the modern world is reflected in part by the continuous growth of the market for CNS drugs, which is now the third highest selling therapeutic category behind cardiovascular and metabolic products, and is predicted to reach over $60 billion worldwide by 2002.55 These drugs, which have the brain as the site of action, must cross the barrier between brain capillaries and brain tissue (the blood-brain barrier, or BBB). This barrier helps to protect the brain from sudden chemical changes and allows only a tiny fraction of a dose of most drugs to penetrate to cerebrospinal fluid and enter the brain. Knowledge of the extent of drug penetration through the BBB is of significant importance in drug discovery, not only for new CNS drugs, but also for other peripherally acting drugs whose exposure to the brain should be limited in order to minimize the potential risk of CNS-related side-effects. It is believed that there are certain common physicochemical characteristics common to molecules that are capable of BBB penetration, whose extent is often
Prediction of Drug-Like Properties
121
Fig. 4.3. Calculated half-life values from a neural network versus experimental values. The cross-validated predictions for the 30 training compounds are shown as open circles; the predictions for the 5 test set compounds are shown in filled squares. The experimental values for some compounds were reported as a range and are plotted accordingly. The diagonal line represents a perfect correlation between experimental and calculated half-life values.
quantified by logBB, the logarithm of the ratio of steady-state concentration of drug in brain to that in blood. Some of these attributes include size, lipophilicity, hydrogen-bonding propensity, charge, and conformation. It was about 20 years ago that Levin reported a study describing a strong relationship between rat brain penetration and molecular weight for drugs that have MW less than 400.37 In a later study, Young et al observed that the logBB could be related to the difference between the experimental logPoctanol/water and logPcyclohexane/water values for a set of histamine H2 antagonists.56 This provided a rationale to improve blood-brain penetration of new designs by reduction of the overall hydrogen bonding propensity. The earliest correlative logBB study that involved theoretical descriptors was that of Kansy and van de Waterbeemd, who reported a two-descriptor MLR model using polar surface area (PSA) and molecular volume from a small set of compounds.38 Although their model seemed to work well for the 20 compounds within the training set, it was evident that predictions for other compounds were rather unreliable, presumably due to erroneous extrapolation.57 To overcome this problem, Abraham and coworkers examined a larger data set of 65 compounds and formulated QSAR models based on excess molar refraction, molecular volume, polarizability, and hydrogen-bonding parameters, as well as the experimental logP value.39 Later, Lombardo et al performed semiempirical calculations on a subset of 57 compounds selected from the Abraham training set, and derived a free energy solvation parameter that correlated well with the logBB values.58
122
Adaptive Systems in Drug Design
Norinder and coworkers developed PLS models of logBB using a set of MolSurf parameters, which provide information on physicochemical properties including lipophilicity, polarity, polarizability, and hydrogen bonding.40 More recently, Luco applied the PLS method using topological and constitutive (e.g., element counts, sum of nitrogen atoms, and indicator variables of individual atoms or molecular fragments) descriptors to correlate logBB.59 In the past two years, several research groups revisited the use of PSA and logP in attempts to create models that are easy to interpret and also generally applicable. These include Clark’s MLR models,42 which are two-descriptor models based on PSA and logP values computed using different methods; the Österberg PLS model that considered logP and a simple count of hydrogen bond donor and acceptor atoms;60 and the Feher MLR model,10 which utilized logP, polar surface area, and the number of solvent accessible hydrogen bond acceptors. Most recently, Keserü and Molnár reported a significant correlation between logBB and solvation free energy derived from generalized Born/surface area (GB/SA) continuum calculations. This established an efficient means to predict CNS penetration in terms of thermodynamic properties, whose utility had been limited previously due to high computational cost.61 The statistical parameters reported by the various studies discussed above are shown in Table 4.4. The following general comments can be made on these studies: a. most models were developed from an analysis of a core set of ~ 50 structures introduced by Young et al and Abraham et al;56,39 b. the various linear models (either MLR or PLS) report r2 values in the range of 0.7 to 0.9, and standard errors of ~ 0.3 to 0.4 log units. The accuracy of the models is acceptable given that most data sets have logBB values that span over 3 log units; c. the descriptors used can be categorized to the following classes: hydrophilic (PSA and its variant, hydrogen bond propensities), hydrophobic (either calculated or measured logP values), or solvation free energy (which arguably characterizes both the hydrophilic and hydrophobic properties of the molecule), or topological indices (which encode, perhaps indirectly, the above physicochemical properties); d. with the exception of the study of Keserü and Molnár, few models have been validated extensively on a sufficiently large test set, probably due to scarcity of reliable data.
The results from the linear models indicate that it is feasible to estimate candidate bloodbrain penetration using computed physicochemical parameters from the molecular structure of a drug. The major drawback of the above models is that they were developed using limited data and therefore their general applicability may be questionable. A consequent solution to development therefore was to increase the diversity of the training set, with the advantage that a larger data set could also safeguard to some degree against model overfitting. This was the strategy followed by Ajay and coworkers at Vertex,41 who developed a Bayesian neural network (BNN) to predict drug BBB penetration using the knowledge acquired from a large (65,000) number of supposedly CNS-active and -inactive molecules. To construct this data set, they selected compounds from the CMC and the MDDR databases, based on therapeutic indication. In their initial classification, compounds that were within the following activity classes were defined as CNS active: anxiolytic, antipyschotic, neuronal injury inhibitor, neuroleptic, neurotropic, antidepressant, non-opioid analgesic, anticonvulsant, antimigraine, cerebral antiischemic, opioid analgesic, antiparkinsonian, sedative, hypnotic, central stimulant, antagonist to narcotics, centrally acting agent, nootropic agent, neurologic agent and epileptic. Other compounds that did not fall into the above categories were considered to be CNS inactive, an assumption that was later shown to be invalid. Based on this classification scheme, there were over 15,000 CNS active molecules and over 50,000 inactive ones. To minimize the risk of chance correlation, they elected to start with only a few molecular descriptors. The seven onedimensional descriptors adopted in their earlier drug-likeness prediction system6 were also used in this work. These were molecular weight (MW), number of hydrogen bond donors
Prediction of Drug-Like Properties
123
Table 4.4. Summary of representative linear logBB models that have appeared in the literature LogBB Model
N
r2
s
RMSE
Model: Descriptorsa
Kansy38 Abraham I39 Abraham II39 Lombardo 58 Norinder I40 Norinder II40 Luco59 Kelder62 Clark I42 Clark II42 Österberg I60 Österberg II60 Feher10 Keserü61
20 57 49 55 28 56 58 45 55 55 69 45 61 55
0.70 0.91 0.90 0.67 0.86 0.78 0.85 0.84 0.79 0.77 0.76 0.72 0.73 0.72
0.45 0.20 0.20 0.41 0.31 0.31 0.32 0.35 0.37 0.37
0.38 0.49 0.42 -
MLR: PSA, Mol_vol MLR: R2, π2H, Σσ2H, Σβ2 H, Vx MLR: logPoct, Σα2H, Σβ2H LR: ∆Gw0 PLS: MolSurf parameters PLS: MolSurf parameter PLS: topological, constitutional LR: dPSA MLR: PSA, ClogP MLR: PSA, MlogP PLS: #HBAo, #HBAn, #HBD, logP PLS: #HBAo, #HBAn, #HBD, logP MLR: nacc,solv, logP, Apol LR: Gsolv
aMolecular descriptors: polar surface area (PSA, Apol), dynamic polar surface area (dPSA), excess molar refraction (R2), dipolarity/polarisability (π2H), hydrogen-bond acceptor acidity (Σα2H), hydrogenbond acceptor basicity (Σβ2H), characteristic volume of McGowan (Vx), experimental logP (logPoct), free energy of solvation in water (∆Gw0, Gsolv), calculated logP (ClogP, MlogP, logP), no. of hydrogen
bonds accepting oxygen and nitrogen atoms (#HBAo, #HBAn), no. of hydrogen bonds donors (#HBD), and no. of hydrogen bond in aqueous medium (nacc,solv).
(Don), number of hydrogen bond acceptors (Acc), number of rotatable bonds (Rot), 2κα (which indicates the degree of branching of a molecule), aromatic density (AR), and MlogP. The authors believed that this set of descriptors were related to the physical attributes that correlate with BBB penetration, thereby allowing the neural network to discriminate between CNS active and inactive compounds. Using a BNN with just the seven physicochemical descriptors, they achieved a prediction accuracy of 75% on active compounds and 65% on inactive ones. Further, they analyzed the false positive entities among the supposedly inactive CMC compounds, and discovered that a significant portion of the false positives actually had no information in the activity class (i.e., their inactivity labeling might be somewhat dubious). Interestingly, for the remaining false positives, the Vertex team discovered that most of the remaining compounds belonged to the following categories: tranquilizer, antivertigo, anorexic, narcotic antagonist, serotonin antagonist, anti-anxiety, sleep, enhancer, sigma opioid antagonist, antiemetic, antinauseant, antipasmodic, and anticholinergic. Thus, it is evident that there were significant omissions of therapeutic indication in the initial CNS activity definition; and furthermore, their BNN made sound generalizations that led to the correct identification of other known CNS agents. Additional validation of their method on a database of 275 compounds revealed that prediction accuracies of 93% and 72% were achieved for the CNS active and inactive compounds, respectively. The BNN method also ranked the relative importance of the seven descriptors in their CNS model, namely: Acc > AR ~ Don~2κα > MW ~ MlogP > Rot
124
Adaptive Systems in Drug Design
Ajay and coworkers concluded that CNS activity was negatively correlated with MW, 2ka, Rot, and Acc, and positively correlated with AR, Don, and MlogP, a result that was consistent with known attributes of CNS drugs. They found that the addition of 166 2D ISIS keys to the seven 1D descriptors yielded significant improvement, which confirmed their earlier druglikeness prediction result.6 Using the combined 1D and 2D descriptors, the BNN yielded predictivity accuracy of 81% on the active compounds and 78% on the inactive. The utility of this BNN as a filter to design a virtual library against CNS targets was subsequently demonstrated. As for any filter designed to handle large compound collections, the principal consideration was the throughput of the calculation. With their in-house implementation, they achieved a throughput of almost 1 million compounds on a single processor (195 MHz R10000) per day. The CNS activity filter was tested on a large virtual library, consisting of about 1 million molecules constructed with 100 drug-like scaffolds63 combined with 300 most common side chains. Two types of filters were applied to prune this library. The first was substructure-based, to exclude compounds containing reactive functional groups; the second was property-based, to discard molecules with undesirable physicochemical properties, including high MW, high MlogP, and in the example case, low predicted CNS activity. From the remaining compounds, they identified several classes of molecules that have favorable BBB penetration properties and are also particularly amenable to combinatorial chemical library synthesis. As a result, such libraries are considered as privileged compound classes to address CNS targets.
Toxicity No substance is free of possible harmful effects. Of the tens of thousands of current commercial chemical products, only perhaps hundreds have been extensively characterized and evaluated for their safety and potential toxicity.64,65 There is strong evidence implicating pesticides and industrial byproducts in links to numerous health problems, including birth defects, cancer, digestive disorders, mutagenicity, tumorigenicity, chronic headaches, fatigue, and irritability. The effect of widespread use of toxic substances on the environment and public health can be devastating. For example, each year about 10,000 U.S. tons of mercury is mined for industrial use, half of which is lost to the environment. The most notorious episode of methylmercury poisoning in history occurred in the 1950s at Minamata Bay in Japan, where the mercury discharged by a chemical factory was ingested and accumulated by fish, and ultimately by the people in the community. Many people developed a severe neurological disorder that later became known as Minamata disease. Despite tremendous effort to restore Minamata Bay, it was not until 1997—50 years later—that the bay was declared mercury-free again. Some biologically persistent chemicals are introduced to the environment in the form of insecticides and pesticides. One of the best known compounds in this category is DDT (dichlorodiphenyltrichloroethane), which was used to protect crops from insect pests and is also credited with the marked decline of insect-vectored human diseases such as malaria and yellow fever in Asia, Africa, and South America. However, there is now strong evidence that DDT contamination has contributed to the marked decline of many animal species via bioaccumulation through food chains. Other chemical agents that can cause great harm even in small quantities are, ironically, medicines whose function are to alleviate the state of disease. For example, anticancer drugs are often highly toxic, because they can disrupt the cellular metabolism of normal tissue as well as that in tumors. Two government agencies are responsible for the regulation of the release of substances that are potentially hazardous in the United States. The Environmental Protection Agency (EPA) seeks to control pollution caused by pesticides, toxic substances, noise, and radiation. The U.S. Food and Drug Administration (FDA), oversees the safety of drugs, foods, cosmetics, and medical devices. The FDA issues regulations
Prediction of Drug-Like Properties
125
that make the drug review process more stringent, requiring that new drugs must be proven effective as well as safe. Because of the potential economic impact on environmental and health effects, finding reliable means to assess chemical toxicity is of enormous interest to both the pharmaceutical and agricultural industries. A large scale ecological assessment of toxicity for a new agricultural chemical is very expensive, and often not feasible. Likewise, traditional in vivo toxicity screening involves animal testing, which is slow and costly, and therefore unacceptable for a mass screening of many potential drug candidates. A remedy to this severe problem is the establishment of standardized in vitro or in vivo tests on model systems relevant to safety assessment. To a large extent, it is the increasing availability of experimental data that has facilitated the ongoing development of computational toxicology, also known as in silico toxicology, ComTox, or e-Tox.66 The major goal of this emerging technology is to analyze toxicological data and create SAR models that can be used to provide toxicity predictions on the basis of molecular structures alone. The work in this field to date can be categorized as two main approaches. The first is an expert system that is based on a set of rules derived from prior knowledge of similar chemical classes or substructures. Depending on whether the rules are inferred by human experts or are extracted by artificial intelligence algorithms, the system will be referred to either as a human expert system or an artificial expert system (see Chapter 2).67 When a query structure is presented for assessment, the rules associated with the structure are identified from the knowledge base to invoke a decision, often together with a possible mechanism of toxic action. The commercial programs DEREK68 and ONCOLOGIC69 represent the most advanced systems in this category. The major criticism of a rule-based system is that it tends to give false positive predictions.67 The second tool is a statistical approach that uses correlative algorithms to determine quantitative structure-toxicity relationship (QSTR) from a large heterogeneous source of chemical structures with harmonized assay data. Two well-known toxicity prediction programs, TOPKAT (Oxford Molecular Inc.) and MCASE (MultiCASE Inc.),70,71 are based on this method. Briefly, TOPKAT relies on physicochemical descriptors, (size-corrected) E-state indices, and topological shape indices to characterize the physical attributes of a molecule. On the other hand, MCASE reduces a molecule to its constituent fragments (from 2 to 10 atoms in size), and treats them as fundamental descriptors. The fragments, or “biophores”, that were associated with most of the toxic chemicals in the database are identified, and a potential toxicity is predicted by summing up the individual contributions. Recent QSTR developments in computational pharmacotoxicology, including case studies of aquatic toxicity, mutagenicity, and carcinogenicity, will be discussed in the next Section.
Aquatic Toxicity Aquatic toxicity is one of key toxicological indicators commonly used to assess the potential risk posed to human and environmental health by chemical substances. A number of marine and freshwater organisms, such as Pimephales promelas, Tetrahymena pyriformis, Daphnia magna, Daphnia pulex and Ceriodaphnia dubia, have become ecotoxicity systems of choice because of their fast growth rate under simple and inexpensive culture conditions. More significantly, the establishment of standard testing protocols for these species makes comparisons of inter-laboratory results meaningful. The most comprehensive resource for aquatic toxicity is the Aquatic Toxicity Information Retrieval (AQUIRE) Database maintained by the EPA, which contains data on ~ 3,000 organisms and ~ 6,000 environmental chemicals, extracted from over 7,000 publications. This abundant experimental data provides a foundation for some of the multivariate QSTR modeling work described here. Basak and coworkers proposed a new approach called “hierarchical QSAR” for predicting the acute aquatic toxicity (LC50) of a set of benzene derivatives.72 The data set, which included benzene and 68 substituted benzene derivatives containing chloro, bromo, nitro, methyl,
126
Adaptive Systems in Drug Design
methoxy, hydroxy, or amino substitutions. The toxicity test for these compounds were done against Pimephales promelas (fathead minnow), with pLC50 values ranging from 3.04 to 6.37. Ninety-five descriptors belonging to four major categories were computed to characterize each molecule: • 35 topostructural indices (TSI), which encode information on the molecular graph without any information on the chemical nature of the atoms or bonds; • 51 topochemical indices (TCI), which were derived from the molecular graph weighted by relevant chemical or physical atom type properties; • 3 geometric indices (3D), which carried three-dimensional information on the geometry of the molecule; • 6 quantum chemical parameters (QCI), which were calculated using the MOPAC program (HOMO, LUMO, heat of formation, etc)
To reduce the number of model parameters, variables from the TSI and TCI categories were clustered based on their inter-correlation. The index that was most correlated with the cluster was automatically selected, along with other poorly correlated (r2 < 0.7) indices that helped to explain data variance. The clustering procedure eliminated most of the descriptors from the two groups, and only five TSI and nine TCI descriptors were retained. All nine descriptors in 3D and QCI categories were kept because they were relatively few in number and in addition they seemed to be poorly correlated amongst themselves. As a result of this preprocessing, a group of 23 molecular descriptors were used for further statistical analysis. In the next step, the authors followed an incremental approach, hierarchical QSAR, to build a linear model based on the reduced descriptor set. First, an exhaustive enumeration of the five TSI descriptors yielded a four-parameter linear regression model with a cross-validated r2 value (rcv2) of 0.37. It was apparent that using the TSI parameters alone could not produce a satisfactory model. Next, they added the nine TCI descriptors to the four selected TSI features from the first model, and again performed an exhaustive search to examine all combinations of linear models. This led to an improved four-descriptor model, yielding rcv2 = 0.75. Repeating this procedure to include the 3D and QCI descriptors resulted in a four-descriptor and a sevendescriptor model that gave rcv2 values of 0.76 and 0.83, respectively. For comparison, they performed variable clustering on all 95 descriptors, and selected seven descriptors to build another linear model. It was noteworthy that this model gave essentially the same predictivity (rcv2 = 0.83) as the (different) selection based on the hierarchical procedure (see discussion below). Basak and coworkers also explored nonlinear models using neural networks. Due to high computational expense, instead of an exhaustive enumeration at every tier level, they decided to use all descriptors in each of the four categories at each hierarchy. The cross-validation results for the four models are shown in Table 4.5. The trend for improvement of predictive performance paralleled that of the regression model. Inclusion of the TCI descriptors appeared to yield the biggest improvement in rcv2. The neural network result also demonstrated that a smaller subset of descriptors can yield similar results as compared to the full 95-descriptor ANN model. Perhaps the most surprising result was that the linear regression model consistently outperformed ANN in this study, at least in terms of cross-validation statistics. We think that one possible reason was an overfitting of data by the neural network that hampered its ability to generalize on the training patterns. One of the ANN models reported by Basak et al had a configuration of 95-15-1. This means that over 1400 adjustable parameters were available to map a data set that was modest in size.69 Consequently, the predictions were likely a result of memorization rather than generalization, leading to a deterioration of predictive performance. This example clearly demonstrates the usefulness of different models to address the same problem. Based on the regression studies, the two seven-descriptor models yielded essentially identical statistics. This begs the question: what is the real merit of the hierarchical approach versus
Prediction of Drug-Like Properties
127
Table 4.5. Statistical results of the hierarchical QSAR models reported by Basak et al72 Modela
Artificial Neural Network
TSI TSI + TCI TSI + TCI + 3D TSI + TCI + 3D + QC All 95 indices
# Desc 5 14 17 23 95
rcv2 0.3 0.62 0.66 0.77 0.76
s 0.63 0.47 0.44 0.36 0.37
Multiple Linear Regression # Desc 4 4 4 7 7
rcv2 0.37 0.75 0.76 0.83 0.83
s 0.63 0.39 0.38 0.34 0.34
a Descriptor classes: topostructural indices (TSI); topochemical indices (TCI); geometric indices (3D);
and quantum chemical parameters (QC).
the traditional “kitchen sink” approach? From a computational point of view, the latter approach is probably simpler to execute. We believe that the biggest benefit of the hierarchical approach is to probe the dependence of activity prediction on a particular descriptor class. In their study, Basak et al demonstrated that the use of TCI descriptors would substantially increase the efficacy of both the neural network and the regression model, and this piece of information would not have been easily deciphered from the single cluster approach. However, to make this unequivocal, one should perform all combinations of cluster-based QSAR models. In addition, there is the issue of a possible order-dependence in the build-up of hierarchical layers. It is likely that if one starts with TCI as the first model, the final descriptor choice will differ from the reported set. Recently, Niculescu et al published an in-depth study of the modeling of chemical toxicity to Tetrahymena pyriformis using molecular fragments and a probabilistic neural network (PNN).65 They obtained the 48-hour ICG50 sublethal toxicity data for T. pyriformis for 825 compounds from the TerraTox database.73 This group of compounds covered a range of chemical classes, many of which have known mechanisms of toxic actions. They included narcotics, oxidative phosphorylation uncouplers, acetylcholinesterase inhibitors, respiratory inhibitors, and reactive electrophiles.74 Instead of using physicochemical descriptors such as logP, Niculescu et al preferred functional group descriptors that represented certain molecular fragments. Each structural descriptor encoded the number of occurrences of specific molecular substructural/atomic features, and a total of 30 descriptors were used. They reasoned that there were two major advantages of this molecular presentation. First, substructural descriptors were considerably cheaper to calculate than most physicochemical or other 3D descriptors, and were also not subject to common errors in their generation (e.g., missing fragments or atom-types from logP calculations, or uncertainty of the 3D conformation). Second, they argued that the use of substructure-based descriptors would lead to a more general QSAR model. The authors suggested that one of the factors that contributed to their superior results relative to other toxicity prediction systems (such as ASTER, CNN, ECOSAR, OASIS, and TOPKAT) was that it did not rely on logP as an independent variable. The 825-compound data set was partitioned to a larger set of 750 compounds which were used to derived quantitative models, and a smaller set of 75 compounds that were later used for model validation. They performed a five-way leave-group-out cross-validation experiment (i.e., 20% of data were left out at one time) on the larger data set, where 150 randomly selected compounds were left out in turn. Five sub-models, m1 to m5, were constructed using each data partition from the same input descriptors. They achieved very good performance for the five
128
Adaptive Systems in Drug Design
sets of training compounds (r2 from 0.93 to 0.95) and also very respectable test performance (r2 from 0.80 to 0.86). Further, the m1-m5 models, together with m0 that was trained from the entire set of 750 compounds, were combined to produce a series of linear correction models. M1 simply accounted for the unexplained variance of m0 model. M2 and M3 were derived from geometric regression to force linear correction, and M4 was a multivariate regression to fit all results:
Since all 750 compounds played a role in the building of M1-M4 models, their true predictivity could only be evaluated by the 75 compounds that had been left out in the initial partition. Figure 4.4 gives a summary of the data partition scheme. Judging from the results obtained for these external compounds, it is encouraging to see that all four models gave consistent predictions, with r2 ranging from 0.88 to 0.89. They concluded that it was viable to utilize in silico prescreens to evaluate chemical toxicity towards a well-characterized endpoint. To demonstrate the generality of this approach to toxicity prediction, Niculescu et al applied this same technique to an analysis of the acute aquatic toxicity of 700 highly structurally diverse chemicals to Daphnia magna, another model organism with widespread use in ecotoxicological screening.75 Using the same set of fragment descriptors as their previous work,65 they obtained five sub-models (m1-m5), with reported r2 values in the range 0.88 to 0.9 for the training compounds, and about 0.5 to 0.72 for the test objects. Again, they constructed four linearly corrected models M1-M4 that combined the characteristics of the individual sub-models. Consistent with the previous work, these corrected models yielded very similar predictivity, though the M4 model, which was based on a linear combination of m1-m5, appeared to give the best overall result. Application of this model on 76 external compounds yielded a standard deviation error of 0.7 log units, which is very impressive, considering that the measured activity spanned approximately 9 orders of magnitude. In contrast, the ECOSAR system, the aquatic toxicity prediction program developed by the EPA, yielded a standard deviation error of 1.4 log units for these compounds. So it seems that, although ECOSAR has information of over 150 SARs for more than 50 chemical classes, the PNN models do cover a wider scope for general application.
Carcinogenicity Carcinogenicity is a toxicity endpoint that concerns the ability of a chemical to produce cancer in animals or humans. The extent of carcinogenicity of a substance is indicated qualitatively through the following categories: “Clear Evidence” or “Some Evidence” for positive results, “Equivocal Evidence” for uncertain finding, “No Evidence” in the absence of observable effects. Every year the National Toxicology Program publishes a list of carcinogens or potential carcinogens in the Annual Report on Carcinogens. An approach for the prediction of the carcinogenic activity of polycyclic aromatic compounds using calculated semi-empirical parameters was described by Barone et al.76 Central to
Prediction of Drug-Like Properties
129
Fig. 4.4. Data partitioning scheme of Niculescu et al.65 The full data set is split into a 750-compound training set and a 75-compound external test set. Six sub-models are derived using full training set (m0), and five leave-group-out cross-validation experiments (m1 - m5). These sub-models are combined using linear corrections, yielding four final QSAR models (M1 - M4). The final QSAR models are then validated using the external test set. The gray color coding represents the different groups of compounds that are used for the model building propose (i.e., external test set and cross-validation).
this work was the use of electronic indices, which were first applied to a set of 26 non-methylated polycyclic aromatic hydrocarbons (PAH). They reasoned that the carcinogenicity of these compounds is related to the local density of state over the ring that contains the highest bond order and on the energy gap (∆H) between HOMO and HOMO-1, the energy level right below HOMO. Six descriptors were considered: • • • • • •
HOMO, the energy of highest occupied molecular orbital HOMO-1, the energy level right below HOMO ∆H, the difference between the energy of HOMO and HOMO-1 CH, the HOMO contribution to the local density of state CH-1, the HOMO-1 contribution to the local density of state ηH, CH - CH-1
A simple rule, stating that if ηH > 0 eV and ∆H > 0.408 eV, then the molecule would likely be a carcinogen was formulated in this study. The use of electronic index methodology (EIM) was extended in a more recent study, where 81 non-methylated and methylated PAH were analyzed through the use of principal component analysis (PCA) and neural network methods. 77 For a set of 26 non-methylated compounds, PCA of the six thermodynamic
130
Adaptive Systems in Drug Design
parameters yielded two principal components that accounted for 42.6% and 35.2% of the total variance. For the 46 methylated compound, the first two principal components captured 81.5% and 15.9% of the variance, respectively. They characterized two clusters of active and a region of inactive compounds from a two-dimensional principal component score plot, for which only three carcinogenic molecules were incorrectly classified. The use of an ANN yielded the best result, both in separate non-methylated and methylated series or when in the combined set (Table 4.6). They concluded that the EIM, which explored energy separation between frontier molecular orbitals, offered a relevant set of descriptors that could be used for accurate modeling of carcinogenic activity of this class of compounds. In another study, Bahler et al constructed a rodent carcinogenicity model using a number of machine learning paradigms, including decision trees, neural networks, and Bayesian classifiers.78 They analyzed a set of 904 rodent bioassay experiments, of which 468 were defined as either clear evidence or some evidence (referred as positive), 276 as no evidence (referred to as negative), and 160 equivocal cases that were eliminated from their analysis. They used 258 different attributes to characterize each of the subjects. Some of these attributes were theoretical descriptors, but the majority were observations from histopathological examination. These included 20 physicochemical parameters, 21 substructural alerts, 209 subchronic histopathology indicators, 4 attributes for maximally tolerated dose, sex and species exposed, route of administration and the Salmonella mutagenesis test result. A ten-way leave-group-out crossvalidation was used to assess the predictivity of a neural network model. Using all attributes, the training accuracy converged to 90% whilst the test set reached about 80% before the performance deteriorated as the neural network began to overtrain. Later, they applied a feature selection routine, termed single hidden unit method,79 to partition the descriptors into relevant and irrelevant features. They discovered that the removal of irrelevant features greatly improved the predictive accuracy for the test set (89%). In addition, the model successfully classified 392 of 452 (87%) of the positive cases, and 272 out of 292 (93%) of the negative cases. More significantly, they were able to extract knowledge embedded in the trained neural network and established a set of classification rules. It seemed that the conversion of a neural network system to such a rule-based system led to only minimal loss of prediction accuracy. For the same set of compounds, the rule-based system correctly identified 389 cases of true positives and 256 cases of true negatives, which corresponded to an overall accuracy of 87%. Because heuristic systems offer superior interpretability, while a neural network is generally more predictive, there is great optimism that the two methods can work synergistically in toxicological prediction. In the past ten years we have witnessed significant development in computational toxicology.67 In contrast to the experimental determination of physicochemical properties (e.g., logP, solubility), the toxicology endpoint has much greater variability, and is often assessed qualitatively (toxic/non-toxic, clear/some/equivocal/no evidence of carcinogenicity) rather than quantitatively. Due to this intrinsic uncertainty, as well as the fact that most, if not all, toxicity models are derived from limited data, we must address the predictions with appropriate caution. Further refinement in reliability estimation, such as the concept of optimum prediction space (i.e., the portion of chemical space within which the model is applicable) advocated in TOPKAT, should be a priority in future development. The major problem in toxicology modeling certainly is the lack of consistent data with uniform assay evaluation criteria. Not surprisingly, most of the larger data sets that have been compiled to date concern aquatic toxicity from simple model organisms, which are useful indicators for risk assessment in environmental safety. For the purpose of drug discovery, however, it would be desirable to have the predictions extrapolated from mammalian toxicity data, but traditional in vivo toxicity screening involves animal testing, which is costly and time-consuming, and is impractical for mass screening of a large library. With the advent of molecular biology, in particular innovative proteomics
Prediction of Drug-Like Properties
131
Table 4.6. Percentage of correct classification for the three different methodologies for 32 non-methylated and 46 methylated polycyclic aromatic hydrocarbons. The number of correct classifications in each instance is shown in parentheses77 Method
Non-Methylated
Methylated
All
EIM PCA NN
84.4% (27) 84.4% (27) 93.8% (32)
73.9% (34) 78.3% (36) 78.3% (36)
78.2% (61) 80.8% (63) 84.6% (66)
technology, one should soon be able to devise rapid in vitro toxicity methods that can gauge the effects of chemicals on human cellular proteins. An early fruit of such multidisciplinary research has appeared in the realm of toxicogenomics, offering hope for the prediction of the mechanisms of toxic action through DNA microarray analysis. Besides technological advances, another possible way to accelerate the compilation of toxicology data could be through a virtual laboratory,80 where consortium members could exchange proprietary data in order to minimize unnecessary duplication of work. In regard to methodology development, the next major advance will probably come from the combined use of expert systems and correlative methods. We must emphasize that expert systems and neural network technologies should not be regarded as competitive, but rather as ideal complements. In particular, a neural network method can achieve an impressive predictive accuracy despite the fact that the method itself is naïve in the sense that it neither relies on any fundamental theory nor provides any clue in the formulation of its answer. While the ability of generalization is the key to the extraordinary power of neural networks, the lack of theory behind the predictions does, to a certain extent, impede its use. This is because scientists are generally less enthusiastic about reliance upon results which are conceived by a “black-box” approach that offers little or no qualitative explanation. Thus, it would be desirable to have an expert system—whether human or artificial—which could provide the logical reasoning complementary to the predictions made by the network. In this regard, the work by Bahler et al have given us a first glimpse of the cross-fertilization of artificial intelligence research in the form of future “expert-neural-network”.78
Multidimensional Lead Optimization It is argued that the advent of combinatorial chemistry and parallel synthesis methods has caused the development of lead compounds that are less drug-like (i.e., higher molecular weights, higher lipophilicity, and lower solubility).4,34,81,82 Because there is seemingly unlimited diversity in chemical entities that can be synthesized, and a vast majority of these molecules appear to be pharmaceutically uninteresting, it is of great practical importance to reliably eliminate poor designs earlier in the drug discovery process. Filters such as the Pfizer Rule / Rule-of-5 4,82 and the REOS system (“Rapid Elimination Of Swill”)11 were developed as a means to prioritize compounds for synthesis. We have also witnessed a change of paradigm in pre-clinical drug discovery, where in the not-so-distant past researchers were optimizing almost entirely by addressing the aspect of potency. There is now an increasing emphasis on properties such as bioavailability and toxicity in parallel with potency improvement during lead optimization. This strategy, which is referred to as multidimensional lead optimization, is depicted in Figure 4.5. For clarity, only some of the key factors are shown inside the middle box, although obviously additional factors such as patent coverage, synthetic feasibility, ease of scale-up and formulation, and chemical stability can also be taken into consideration. Consistent with this
132
Adaptive Systems in Drug Design
Fig. 4.5. Schematic illustration of a multidimensional lead optimization strategy.
strategy, computational chemistry becomes of increasing importance to address these tasks. Traditionally, computer-aided drug discovery has focussed on ligand design by using structurebased or QSAR methods. Today, significant resources are dedicated to the development and application of in silico ADME and toxicity models to predict physicochemical and biological properties that are relevant to in vivo efficacy and therapeutic drug safety.83 While it is still too early to judge the impact of in silico prediction methods on the drug discovery process, we are confident that the continuous integration of medicinal chemistry, high-throughput screening, pharmacology, toxicology, computer modeling and information management will greatly enhance our ability and efficiency in discovering novel medicines.
References 1. Verlinde CLMJ, Hol WGJ. Structure-based drug design: progress, results and challenges. Structure 1994; 2:577-587. 2. Navia MA, Chaturvedi PR. Design principles for orally bioavailable drugs. Drug Discov Today 1996; 1:179-189. 3. Chan OH, Stewart BH. Physicochemical and drug-delivery considerations for oral drug bioavailability. Drug Discov Today 1996; 1:461-473. 4. Lipinski CA, Lombardo F, Dominy BW et al. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Delivery Rev 1997; 23:3-25. 5. Frimurer TM, Bywater R, Nærum L et al. Improving the odds in discriminating “drug-like” from “non drug-like” compounds. J Chem Inf Comput Sci 2000; 40:1315-1324. 6. Ajay, Walters P, Murcko MA. Can we learn to distinguish between “drug-like” and “nondrug-like” molecules? J Med Chem 1998; 41:3314-3324. 7. Sadowski J, Kubinyi H. A scoring scheme for discriminating between drugs and nondrugs. J Med Chem 1998; 41:3325-3329. 8. Viswanadhan VN, Ghose AK, Revankar GR et al. Atomic physiochemical parameters for three dimensional structure directed quantitative structure-activity relationships. 4. Additional parameters for hydrophobic and dispersive interactions and their applications for an automated superposition of certain naturally occurring nucleoside antibiotics. J Chem Inf Comput Sci 1989; 29:163-172. 9. Matthews BW. Comparison of predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 1975; 405:442-451. 10. Feher M, Sourial E, Schmidt JM. A simple model for the prediction of blood-brain partitioning. International J Pharmaceutics 2000; 201:239-247.
Prediction of Drug-Like Properties
133
11. Walters WP, Ajay, Murcko MA. Recognizing molecules with drug-like properties. Curr Opin Chem Biol 1999; 3:384-387. 12. Abramovitz R, Yalkowski SH. Estimation of aqueous solubility and melting point of PCB congeners. Chemosphere 1990; 21:1221-1229. 13. Suzuki T. Development of an automatic estimation system for both the partition coefficient and aqueous solubility. J Comput-Aided Mol Des 1991; 5:149-166. 14. Kamlet MJ. Linear solvation energy relationships: an improved equation for correlation and prediction of aqueous solubilities of aromatic solutes including polycyclic aromatic hydrocarbons and polychlorinated biphenyls. Prog Phys Org Chem 1993; 19:293-317. 15. Jorgensen WL, Duffy EM. Prediction of drug solubility from Monte Carlo simulations. Bioorg Med Chem Lett 2000; 10:1155-1158. 16. Abraham MH, McGowan JC. The use of characteristic volumes to measure cavity terms in reversed phase liquid chromatography. Chromatographia 1987; 23:243-246. 17. Abraham MH. Scales of solute hydrogen-bonding: their construction and application to physicochemical and biochemical processes. Chem Soc Rev 1993; 22:73-83. 18. Bodor N, Harget A, Huang N-J. Neural network studies. 1. Estimation of the aqueous solubility of organic compounds. J Am Chem Soc 1991; 113:9480-9483. 19. Patil GS. Correlation of aqueous solubility and octanol-water partition coefficient based on molecular structure. Chemosphere 1991; 22:723-738. 20. Patil GS. Prediction of aqueous solubility and octanol-water partition coefficient for pesticides based on their molecular structure. J Hazard Mater 1994; 36:35-43. 21. Nelson TM, Jurs PC. Prediction of aqueous solubility of organic compounds. J Chem Inf Comput Sci 1994; 34:601-609. 22. Sutter JM, Jurs PC. Prediction of aqueous solubility for a diverse set of heteroatom-containing organic compounds using a quantitative structure-activity relationship. J Chem Inf Comput Sci 1996; 36:100-107. 23. Mitchell BE, Jurs PC. Prediction of aqueous solubility of organic compounds from molecular structure. J Chem Inf Comput Sci 1998; 38:489-496. 24. Ruelle P, Kesselring UW. Prediction of the aqueous solubility of proton-acceptor oxygen-containing compounds by the mobile order solubility model. J Chem Soc, Faraday Trans 1997; 93:2049-2052. 25. Egolf LM, Wessel MD, Jurs PC. Prediction of boiling points and critical temperatures of industrially important organic compounds from molecular structure. J Chem Inf Comput Sci 1994; 34:947-956. 26. Xu L, Ball JW, Dixon SL et al. Quantitative structure-activity relationships for toxicity of phenols using regression analysis and computational neural networks. Environmental Toxicol Chem 1994; 13:841-851. 27. Sutter JM, Dixon SL, Jurs PC. Automated descriptor selection for quantitative structure-activity relationships using generalized simulated annealing. J Chem Inf Comput Sci 1995; 35:77-84. 28. Wessel MD, Jurs PC. Prediction of normal boiling points for a diverse set of industrially important organic compounds from molecular structure. J Chem Inf Comput Sci 1995; 35:841-850. 29. Mitchell BE, Jurs PC. Prediction of autoignition temperature of organic compounds from molecular structure. J Chem Inf Comput Sci 1997; 37:538-547. 30. Engelhardt HL, Jurs PC. Prediction of supercritical carbon dioxide solubility of organic compounds from molecular structure. J Chem Inf Comput Sci 1997; 37:478-484. 31. Wessel MD, Jurs PC, Tolan JW et al. Prediction of human intestinal absorption of drug compounds from molecular structure. J Chem Inf Comput Sci 1998; 38:726-735. 32. Huuskonen J, Salo M, Taskinen J. Neural network modeling for estimation of the aqueous solubility of structurally related drugs. J Pharm Sci 1997; 86:450-454. 33. Huuskonen J, Salo M, Taskinen J. Aqueous solubility prediction of drugs based on molecular topology and neural network modeling. J Chem Inf Comput Sci 1998; 38:450-456. 34. Huuskonen J. Estimation of aqueous solubility for a diverse set of organic compounds based on molecular topology. J Chem Inf Comput Sci 2000; 40:773-777. 35. Yalkowsky SH, Banerjee S. Aqueous solubility. Methods of estimation for organic compounds. New York: Marcel Dekker, 1992.
134
Adaptive Systems in Drug Design
36. Huuskonen JJ, Livingstone DJ, Tetko IV. Neural network modeling for estimation of partition coefficient based on atom-type electrotopological state indices. J Chem Inf Comput Sci 2000; 40:947-955. 37. Levin VA. Relationship of octanol/water partition coefficient and molecular weight to rat brain capillary permeability. J Med Chem 1980; 23:682-684. 38. Kansy M, van de Waterbeemd H. Hydrogen bonding capacity and brain penetration. Chimia 1992; 46:299-303. 39. Abraham MH, Chadha HS, Mitchell RC. Hydrogen bonding. 33. Factors that influence the distribution of solutes between blood and brain. J Pharm Sci 1994; 83:1257-1268. 40. Norinder U, Sjöberg P, Österberg T. Theoretical calculation and prediction of brain-blood partitioning of organic solutes using MolSurf parameterization and PLS statitistics. J Pharm Sci 1998; 87:952-959. 41. Ajay, Bemis GW, Murcko MA. Designing libraries with CNS activity. J Med Chem 1999; 42:4942-4951. 42. Clark DE. Rapid calculation of polar molecular surface area and its application to the prediction of transport phenomena. 2. Prediction of blood-brain barrier penetration. J Pharm Sci 1999; 88:815-821. 43. Rekker RE. The Hydrophobic Fragment Constant. Amsterdam: Elsevier, 1976. 44. Leo AJ, Jow PY, Silipo C et al. Calculation of hydrophobic constant (logP) from π and F constants. J Med Chem 1975; 18:865-868. 45. Schaper K-J, Samitier MLR. Calculation of octanol/water partition coefficients (logP) using artificial neural networks and connection matrices. Quant Struct-Act Relat 1997; 16:224-230. 46. Breindl A, Beck B, Clark T. Prediction of the n-octanol/water partitiion coefficient, logP, using a combination of semiempirical MO-calculations and a neural network. J Mol Model 1997; 3:142-155. 47. CONCORD. University of Texas, Austin, TX. 48. SYBYL. Tripos, Inc., St Louis, MO. 49. VAMP. Oxford Molecular Group, Oxford, UK. 50. Kier LB, Hall LH. An electrotopological state index for atoms in molecules. Pharm Res 1990; 7:801-807. 51. Hall LH, Kier LB. Electrotopological state indices for atom types: a novel combination of electronic, topological, and valence shell information. J Chem Inf Comput Sci 1995; 35:1039-1045. 52. Yoshida F, Topliss JG. QSAR model for drug human oral bioavailability. J Med Chem 2000; 43:2575-2585. 53. Schneider G, Coassolo P, Lavé T. Combining in vitro and in vivo pharmacokinetic data for prediction of hepatic drug clearance in humans by artificial neural networks and multivariate statistical techniques. J Med Chem 1999; 42:5072-5076. 54. Quiñones C, Caceres J, Stud M et al. Prediction of drug half-life values of antihistamines based on the CODES/neural network model. Quant Struct-Act Relat 2000; 19:448-454. 55. http://www.pjbpubs.com/scriprep/bs1024.htm. 56. Young RC, Mitchell RC, Brown TH et al. Development of a new physiochemical model for brain penetration and its application to the design of centrally acting H2 receptor histamine antagonist. J Med Chem 1988; 31:656-671. 57. Calder JA, Ganellin CR. Predicting the brain penetrating capability of histaminergic compounds. Drug Des Disov 1994; 11:259-268. 58. Lombardo F, Blake JF, Curatolo WJ. Computation of brain-blood partitioning of organic solutes via free energy calculations. J Med Chem 1996; 39:4750-4755. 59. Luco JM. Prediction of the brain-blood distribution of a large set of drugs from structurally derived descriptors using partial least-squares (PLS) modeling. J Chem Inf Comput Sci 1999; 39:396-404. 60. Österberg T, Norinder U. Prediction of polar surface area and drug transport processes using simple parameters and PLS statistics. J Chem Inf Comput Sci 2000; 40:1408-1411. 61. Keserü GM, Molnár L. High-throughput prediction of blood-brain partitioning: A thermodynamic approach. J Chem Inf Comput Sci 2001; 41:120-128. 62. Kelder J, Grootenhuis PD, Bayada DM et al. Polar molecular surface as a dominating determinant for oral absorption and brain penetration of drugs. Pharm Res 1999; 16:1514-1519.
Prediction of Drug-Like Properties
135
63. Bemis GW, Murcko MA. The properties of known drugs. 1. Molecular frameworks. J Med Chem 1996; 39:2887-2893. 64. Young SS, Profeta SJ, Unwalla RJ et al. Exploratory analysis of chemical structure, bacterial mutagenicity and rodent tumorigenicity. Chemo Intell Lab Sys 1997; 37:115-124. 65. Niculescu SP, Kaiser KLE, Schultz TW. Modeling the toxicity of chemicals to Tetrahymena pyriformis using molecular fragment descriptors and probabilistic neural networks. Arch Environ Contam Toxicol 2000; 39:289-298. 66. Matthews EJ, Benz RD, Contrera JF. Use of toxicological information in drug design. J Mol Graph Model 2000; 18:605-615. 67. http://www.netsci.org/Science/Special/feature05.html. 68. Sanderson DM, Earnshaw CG. Computer prediction of possible toxic action from chemical structure: the DEREK system. Human Exptl Toxicol 1991; 10:261-273. 69. Woo Y, Lai DY, Argus MF et al. Development of structure-activity relationship rules for predicting carcinogenic potential of chemicals. Toxicol Lett 1995; 79:219-228. 70. Klopman G. Artificial intelligence approach to structure-activity studies: Computer automated structure evalulation method of biological activity of organic molecules. J Am Chem Soc 1984; 106:7315-7321. 71. Matthews EJ, Contrera JF. A new highly specific metho for predicting the carcinogenic potential of pharmaceuticals in rodents using enhanced MCASE QSAR-ES software. Reg Toxicol Pharmacol 1998; 28:242-264. 72. Basak SC, Grunwald GD, Gute BD et al. Use of statistical and neural net approaches in predicting toxicity of chemicals. J Chem Inf Comput Sci 2000; 40:885-890. 73. TerraTox 2000. TerraBase Inc., Burlington, Ontario, Canada. 74. Russom CL, Bradbury SP, Broderius SJ et al. Predicting modes of toxic action from chemical structure: acute toxicity in the fathead minnow (Pimephales promelas). Environ Toxicol Chem 1997; 16:948-967. 75. Kaiser KLE, Niculescu SP. Modeling acute toxicity of chemicals to Daphnia magna: A probablistic neural network approach. Environ Toxicology and Chemistry 2001; 20:420-431. 76. Barone PMVB, Camilo A, Jr., Galvão DS. Theoretical approach to identify carcinogenic activity of polycyclic aromatic hydrocarbons. Phys Rev Lett 1996; 77:1186-1189. 77. Vendrame R, Braga RS, Takahata Y et al. Structure-activity relationship studies of carcinogenic activity of polycyclic aromatic hydrocarbons using calculated molecular descriptors with principal component analysis and neural network. J Chem Inf Comput Sci 1999; 39:1094-1104. 78. Bahler D, Stone B, Wellington C et al. Symbolic, neural, and Bayesian machine learning models for predicting carcinogenicity of chemical compounds. J Chem Inf Comput Sci 2000; 40:906-914. 79. Stone B. Feature selection and rule extraction for neural networks in the domain of predictive toxicology. Department of Computer Science, North Carolina State University, 1999. 80. Vedani A, Dobler M. Multi-dimensional QSAR in drug research. Predicting binding affinities, toxicity and pharmacokinetic parameters. Prog Drug Res 2000; 55:105-135. 81. Fecik RA, Frank KE, Gentry EJ et al. The search for oral drug bioavailability. Med Res Rev 1998; 18:149-185. 82. Lipinski CA. Drug-like properties and the causes of poor solubility and poor permeability. J Pharmacol Toxicol Methods 2000; 44:235-249. 83. Ekins S, Waller CL, Swaan PW et al. Progress in predicting human ADME parameters in silico. J Pharmacol Toxicol Methods 2000; 44:251-272.
Adaptive Systems in Drug Design
136
CHAPTER 5
Evolutionary De Novo Design “GAs have been shown to be capable of describing extremely complex bahaviour in a range of application domains, including those of molecular recognition and design.” (P. Willett)1
Current Concepts in Computer-Based Molecular Design
I
n the previous Chapters we have addressed some issues related to adaptive optimization methods and fitness calculation in the context of drug design tasks, in particular evolutionary algorithms and artificial neural networks. To close the design cycle depicted in Figure 1.4 we still have to define the molecule generator. This will be the main focus of this Chapter. We will highlight only selected approaches, which we have chosen either because they illustrate a general principle, or we have particular experience with these methods. Again the focus will be on evolutionary techniques. Generally, current computer-based molecular design approaches may be regarded as being guided by two major strategies: 1. Structure-based design relying on a 3D receptor model of the ligand-binding pocket 2. Ligand-based design starting from the knowledge of one or several known actives without taking the 3D receptor structure into account.
The majority of the current structure-based design tools are based on a computer model of a binding site and require a scoring function that computes an estimate of the binding affinity of a molecule—e.g., a potential inhibitor—in a given conformation (also called a pose) within the binding pocket. In contrast, ligand-based tools usually build on a scoring function that implements some sort of similarity principle rather than estimating binding affinity in a receptor-ligand docking experiment. Of course, both approaches complement each other and can be combined—depending on how much biostructure information is available at the beginning or becomes available during a project. New structures can for example be docked into or grown within a binding pocket (provided a receptor structure is available) or compared to a known active reference molecule. De novo design attempts to generate novel molecules matching a given binding pattern (Pharmacophore), i.e., the spatial arrangement of relevant receptor-ligand interaction points. Among the most prominent software solutions for structure-based de novo design are the packages LUDI,2 BUILDER3 and CAVEAT.4 These algorithms identify potential ligand-receptor interaction or attachment points in the receptor binding pocket and construct novel molecular entities by combinatorial or sequential assembly of atoms and molecular fragments (Fig. 5.1). As mentioned above the compatibility (quality, fitness) of novel structures or an individual molecular fragment in a given position is often estimated by empirical scoring functions.5 Although fast combinatorial docking procedures clearly proved their applicability to de novo design,6 one of the major problems still to be solved is the accurate prediction of binding energies.7,8 This problem has been approached in many different ways, e.g., by force-field based methods,9-17 techniques based on the Poisson-Boltzmann equation,18-21 potentials of mean force,22-27 free energy perturbation,28 and linear response approximations.29,30 In this Adaptive Systems in Drug Design, by Gisbert Schneider and Sung-Sau So. ©2003 Eurekah.com.
Evolutionary De Novo Design
137
Fig. 5.1. Two strategies for structure-based molecule assembly from fragments (adapted from M. Stahl).33 The solid line represents a ligand-binding pocket on the surface of a protein. a) “Fragment placing and linking”; b) sequential growth technique.
context it is common to differentiate between empirical and knowledge-based scoring functions. The term “empirical scoring function” stresses that these quality functions approximate the free energy of binding, ∆Gbinding, as a sum of weighted interactions that are described by simple geometrical functions, fi, of the ligand and receptor co-ordinates r (Eq.5.1).8 Most empirical scoring functions are calibrated with a set of experimental binding affinities obtained from protein-ligand complexes, i.e., the weights (coefficients) ∆Gi are determined by regression techniques in a supervised fashion. Such functions usually consider individual contributions from hydrogen bonds, ionic interactions, hydrophobic interactions, and binding entropy. As with many empirical approaches the difficulty with empirical scoring arises from inconsistent calibration data.
Eq. 5.1 Knowledge-based scoring functions have their foundation in the inverse formulation of the Boltzmann law computing an energy function that is also referred to as a “potential of mean force” (PMF). The inverse Boltzmann technique can be applied to derive sets of atompair potentials (energy functions) favoring preferred contacts and penalizing repulsive interactions. The various approaches differ in the sets of protein-ligand complexes used to obtain these potentials, the form of the energy function, the definition of protein and ligand atom types, the definition of reference states, distance cutoffs, and several additional parameters. Scoring functions provide a very active and rapidly changing research field. Thorough treatments of historical and actual concepts and achievements can be found in the literature.7,31-33 A complementary approach to starting from a receptor structure is to build upon a Pharmacophore hypothesis that was derived from a known bioactive molecule or ligand.34 Based on a Pharmacophore model, alternative molecular architectures can be virtually assembled mimicking the Pharmacophore pattern present in the original template structure (for review,
Adaptive Systems in Drug Design
138
see Chapters 2 and 3). This methodology and related tactics represent workable approaches to ligand de novo design when a high-resolution receptor structure is not available, which is especially the case for many membrane-bound neuroreceptors in central nervous system research, including the large group of various G-protein-coupled receptors. Irrespective of the availability of a receptor model and the choice of the fitness function, there are two alternative approaches of how to assemble new molecular structures, namely atom-based methods and fragment-based methods. Furthermore, the assembly process itself may be used to categorize the different molecular design concepts. One may differentiate between incremental-growth and construct-and-score techniques. The first method starts from a small molecular fragment and sequentially adds and modifies parts (atoms, fragments) to obtain the final design (Fig. 5.1). At each step the intermediate solution is scored and evaluated. The alternative is to first build a complete molecule and then perform a single scoring step for the virtual product. Table 5.1 contains a compilation of selected software tools, which implement different procedures and are commonly used in molecular design studies. Several additional algorithms have been proposed during the last decade, most of them are not named but only mentioned in the context of the respective publication. Textbooks and several recent review articles provide in-depth treatments of this field of computational chemistry, with a historical focus on structure-based approaches.33,35-40 While atom-based techniques build up a molecule atom by atom, fragment-based design methods use sets of pre-defined molecular building blocks that are connected by a virtual synthesis scheme. This approach can have several advantages – particularly in combination with evolutionary algorithms: • Fragments can be regarded as molecular “modules”. Both whole molecules and fragments are easily encoded in a “genotype” representation, e.g., a molecular “chromosome”; • the definition of a fragment is variable; it may mean large molecular building blocks (e.g., synthons of combinatorial or parallel chemistry) as well as small fragments like functional groups or even single atoms. In this view, atom-based design encompasses a special case of fragment definition; • the chance of designing a synthetically feasible and “drug-like” structure will be high if physically accessible reaction educts (synthons) or retro-synthetically obtained sets of fragments are used in virtual synthesis; and • the size of the search space is greatly reduced by the use of the fragment-concept compared to purely atom-based techniques.
Despite their appeal and ease of implementation fragment-based techniques have some limitations. Most important, they are often restricted to relatively coarse-grained designs, because fine-tuning of structures can be hampered due to a restricted fragment set—especially during the final optimization cycles. A chemically meaningful selection of fragments for the design process is crucial for success. Very often the usefulness of a particular fragment set depends on the design task. It can be very beneficial to use different sets for the design of GPCR modulators or kinase inhibitors, for example. Furthermore, during the first virtual design cycle—while a coarse-grained search is performed—the fragment sets can differ from the sets used during the later stages. In the ideal case the fragment sets should adapt in a similar way to the output structures. Only recently this idea of adaptive fragment definitions has been incorporated into evolutionary de novo design algorithms (G. Schneider, K. Bleicher, J. Zuegge et al, unpublished).
Evolutionary Structure Generators The idea of using evolutionary algorithms for de novo design is not new, and several extensive reviews and compilations of articles treating this topic are available.1,41-44 The first examples of EA based design programs were published around 1990 (Table 5.2). Initially these
Evolutionary De Novo Design
139
Table 5.1. Selected examples of de novo design algorithms Method
Concept / Comment
BUILDER3 CAVEAT4 CONCERTS132 DLD10 GENSTAR133
Recombination of docked molecules, combinatorial search Database search for fragments fitting onto two bonds Fragment-based, stochastic search Atom-based, structure sampling by simulated annealing Atom-based, grows molecules in situ based on an enzyme contact model Fragment-based, sequential growth, combinatorial search Fragment-based, sequential growth, stochastic search Linker database search for fragments placed by MCSS Atom-based, stochastic search Fragment-based, combinatorial search Atom-based, stochastic search Fragment-based, stochastic sampling of probes Gaph-theoretical technique for 3D molecular design Fragment-based, builds on 3D pharmacophore-models Fragment-based, combinatorial search (GA) Fragment-based, peptide design, stochastic sampling Combinatorial fragment-based, in situ design using a scaffold-linker approach
GROUPBUILD134 GROWMOL135 HOOK136 LEGEND137 LUDI2 MCDNLG138 MCSS139 MOLMAKER140 NEWLEAD141 PRO-LIGAND45,46 PROSA142 PRO-SELECT143 SME74,121 SMOG144 SPLICE145 SPROUT146,147 TOPAS59
Fragment-based, peptide design, evolutionary search Fragment-based, sequential growth, stochastic search Recombination of ligands retrieved by a 3D database search Fragment-based, sequential growth, combinatorial search Fragment-based, evolutionary search
approaches focused on biopolymers like peptides and small RNA molecules using 2D similarity scores as fitness function, but most of the more recent developments concentrate on the design of small molecules as potential drug candidates. The aim of de novo design within a receptor binding pocket (in situ) is to generate molecules that satisfy as many of the potential or known interaction sites as possible without violating the steric constraints of the binding pocket. Characteristic programs for this task are PRO-LIGAND,45,46 LeapFrog, 47 ChemicalGenesis,48 and the early work of Blaney and coworkers.49 Typical evolutionary techniques for the de novo design of molecules to fit 2D constraints includes the algorithms developed by Venkatasubramanian et al,50-52 Nachbar,53 Globus et al,54 Devillers et al,55 Weber et al,56,57 Douget et al,58 and Schneider and coworkers.59,60 Usually such algorithms treat molecules as a linear string, a tree-structure, or as a molecular graph. The SMILES representation of molecules is frequently applied.61 In 3D design the molecule is usually manipulated directly. It is also possible to use fragment-based structure generators like TOPAS59 (vide infra for a detailed description of this algorithm) in combination with heuristic 3D conformer builders like CORINA, CONCORD, or CONFORT,62-65 and feed the designs into fast docking programs—e.g., FlexX,66,67 DOCK,68,69 or GLIDE70—for fitness determination. The program CONJURE was developed by researchers at Vertex Pharmaceuticals and represents an early version following this approach.71
Adaptive Systems in Drug Design
140
Table 5.2. Selected evolutionary de novo design methods (adapted from Gillet44) Method / Program
Structure Encoding
Fitness Function
PepMaker72,74
Linear string (2D), peptides Linear string (2D), peptides Linear string (2D)
Similarity measure based on an amino acid mutation matrix Neural network QSAR
Nachbar53
Tree structure (2D)
QSAR and penalties for undesired chemical features
Globus et al54
Molecular graph (2D)
Devillers et al55
Linear string of fragments (2D)
Atom-pairs based similarity measure Neural network
Weber et al56,57
SMILES string (2D);
Experimental binding constant chemical class restrictions
Blaney et al49
SMILES string (2D)
Receptor site constraints; binding energy; similarity measure; penalty terms
LEA58 TOPAS59
SMILES string (2D) SMILES string (2D); fragment-based
QSAR, property ranges Similarity measure based on topological pharmacophores; penalties for undesired chemical features
ChemicalGenesis48
3D conformers
Comparison of calculated properties with receptor site and Pharmacophore constraints
LeapFrog47
3D conformers
Receptor site constraints and CoMFA model
PRO-LIGAND45,46
3D conformers
Number of receptor site interaction point hits; property penalties
CONJURE71
3D conformers; fragment-based
Empirical scoring function
SME74,121 Venkatasubramanian et al 50-52
Comparison of calculated physicochemical properties with target values
One such construct-and-score technique was explored at Roche for the design of novel serine protease inhibitors (F. Hoffmann-La Roche Ltd.; K. Bleicher, G. Schneider, M. Stahl; unpublished). Structures were assembled by TOPAS and each new molecule was docked into the binding pocket of the target. The FlexX docking score was used as a measure of fitness.66 In Fig. 5.2a, a set of 50 docked structures are shown, forming the offspring during one of the last generations of an evolutionary optimization run. In this example, the serine protease Factor VIIa served as the target enzyme. Many of the known residue structures (mainly benzamidine derivatives) for hydrogen-bonding to the aspartic acid Asp189 at the bottom of the S1 pocket have evolved, and the designed molecules reveal a preference for lipophilic moieties potentially binding to an adjacent lipophilic pocket. It must be noted that the docking step for fitness
Evolutionary De Novo Design
141
Fig. 5.2. Evolutionary de novo design using a combination of a 2D structure generator (TOPAS) and a docking technique (FlexX). a) Docked conformations of one generation of designs (50 molecules) within the Factor VIIa binding pocket; b) one particular designed structure which received a high fitness value (docking score). It was synthesized but turned out to be inactive probably due to insufficient solubility (courtesy of F. Hoffmann-La Roche Ltd.; K. Bleicher, G. Schneider, M. Stahl; unpublished)
estimation comes at a price, i.e., significantly longer computing time than simple similarity searching. In the particular study we used a population size of 50 individuals with non-parallel program execution, as a consequence, one generation took 45 minutes of computation time. Despite this drawback, evolutionary 3D design methods can be of considerable value if a highresolution structure or model of the receptor pocket is available. The success of the approach largely depends on the accuracy of the scoring function involved. Sometimes molecules are grown which receive a high docking score (high fitness), but turn out to be inactive. One example is shown in Fig. 5.2b. This urea derivative possesses poor aqueous solubility, and—as a consequence—its predicted tight binding could not be confirmed in an enzyme inhibition test. This example reminds us not to restrict fitness measures to a single quality (e.g., the predicted binding energy) but to consider several drug-like properties in parallel. One of the most critical physicochemical properties is the aqueous solubility (see Chapter 4). Drug design is a multi-dimensional optimization task. In the remaining part of this Chapter we will restrict the discussion of evolutionary structure generators to evolution strategy-based systems (see Chapter 1 for details about evolution strategies). First, the special case of peptide design will be presented; and finally the generalization of this reduced approach to fragment-based small molecule design will be made, taking the TOPAS approach as an example. It must be emphasized that there are many ways in which an EA can be implemented,1 and the approaches discussed in this Chapter are intended to provide only an entry point only to demonstrate some general principles.
Peptide Design by “Simulated Molecular Evolution” Cell biology, Genomics, and Proteomics provide three important pillars for rational drug design, and computer algorithms combined with a sophisticated combinatorial chemistry provide a useful toolbox for lead identification. Progress in cell biology has led to deep insights
142
Adaptive Systems in Drug Design
into important cellular processes like differentiation, division, and adaptation. Details of the underlying regulation and control mechanisms have been elucidated, and many signal molecules—in particular peptides which function as endogenous ligands to membrane receptors— have been identified. Protein-protein and peptide-protein interactions also provide a molecular basis for information transfer between cellular compartments. Many diseases are caused by aberrant protein-protein interactions. A possible therapeutic strategy is to design peptides specifically blocking these undesired interactions. PepMaker was one of the first fully automated computer algorithms for designing such blocking peptides.72 It was successfully applied to the design of several bioactive peptides.73,74 The appeal of the PepMaker method described here lies in the design of novel peptides starting just from a single known active peptide, the “seed peptide”. The PepMaker system allows exploration of the neighboring area of the seed peptide in amino acid sequence space. It is expected that several functional peptides can be found in the vicinity of the seed peptide which may have similar or even an improved activity.73,74 This idea will now be explained in more detail because it provides an insight into more advanced “simulated molecular evolution” techniques. Compared to small molecule design, peptide design is a simple task because of the inherent modular architecture of amino acid sequences, their easy synthetic accessibility, and the restricted size of sequence space. Peptide design can be viewed as a special combinatorial chemistry approach with a restricted set of building blocks (e.g., the 20 standard amino acid residues) and amide bond formation as the only coupling reaction. A typical task is to identify a linear sequence of amino acid residues exhibiting a desired biological activity, e.g., binding affinity to a target molecule. In the simplest case there are only two parameters subject to optimization: i) the number of sequence positions (peptide length), and ii) the type of residue present at each sequence position. The typical experimental approach is to perform large-scale random screening which has become feasible due to recent advances in peptide synthesis and activity detection techniques.75-77 Blind screening is essential if no information about function determining residue patterns is available. This is particularly true when conventional structurebased molecular modeling cannot be performed due to the lack of high-resolution receptor structures. If however a template peptide (“seed”) or other information is already known that can be used to limit the search space, it is worthwhile following some kind of rational design.74,78-80 The PepMaker algorithm generates variant peptides stemming from sequence space regions around the seed structure, thereby approximating a unimodal bell-shaped distribution. It is assumed that molecules with an improved function can be identified among the peptides located close to the seed peptide in sequence space. This assumption is supported by a number of observations:81-88 • In natural evolutionary processes, large alterations of a protein may occur within a generation, but these extremely different mutants rarely survive. • Most observed mutations leading to a slightly improved function are single-site substitutions keeping the vast majority of the sequence unchanged. • Conservative replacements tend to prefer substitutions of amino acids which are similar in their intrinsic physicochemical properties.
The algorithm generates a bell-shaped distribution of variants for construction of a biased peptide library, which is thought to approximately reflect some of these aspects of natural protein evolution. In addition to incremental optimization by small steps, large sequence alterations can also lead to improved function. This might be the case if, for example several optima exist in sequence space.82,89,90 This idea consequently follows the evolution strategy approach discussed in Chapter 1. The width of the distribution, σ, may be altered to generate more or less focused libraries. As a result the “diversity” of the libraries can be expected to increase from low to large σ values.
Evolutionary De Novo Design
143
σ functions as a strategy parameter in the evolutionary design principle (see Chapter 1). To quantify “diversity”, the Shannon uncertainty measure can be applied (Eq. 2).91,92 A similar entropy measure has been shown to be useful for assessment of the molecular diversity of large collections of small organic molecules and descriptor analysis.93-95 Details on the concept of entropy and information theory can be found in the literature.96-98 Here we will give a brief introduction to the general concept. The Shannon entropy of a sequence alignment block has been proposed as a measure of the randomness of the residue distribution at each aligned position.92,99,100 Various interpretations of the meaning of “entropy” or “information content” are possible, including treatment as a chemical diversity measure or the degree of feature conservation. If Pi(xk) gives the frequency of the symbol xk from the alphabet x1, x2, x3,..., xA at the alignment position i, the Shannon entropy Hi at this position is defined by
Eq. 5.2
In this definition Pi(xk)log2 Pi(xk) is taken to be zero if Pi(xk) = 0. The unit of the Shannon entropy is “bit” because the base of the logarithm in the formula is two. The Shannon information Ri at the position i in the sequence alignment block is defined as the difference of two entropies:
Eq. 5.3 where Hbackground corresponds to the average sequence entropy. Hbackground is maximal when the symbols of the alphabet are evenly distributed:
Eq. 5.4 In this case the formula for Hbackground simplifies to
Eq. 5.5 with all calculated values for Ri equal to or larger than zero. Therefore, an even background distribution of all residues is usually assumed when information theory is applied to sequence alignment blocks. This means that the information content at a given position is high, when the distribution of the symbols is far from random. As a result, calculating the information content for each position in a sequence alignment can be used to spot conserved symbols in the block. However, the naïve assumption of evenly distributed symbols in the background might lead to false interpretations, if the background distribution is highly biased towards certain symbols. Unfortunately, when Hbackground is calculated using the true background distribution, other problems in the interpretation of Ri might occur. It could be the case that Ri is calculated to be zero, although the frequency of each symbol differs in Hbackground and Hi, simply because both distributions are equally far from a totally random distribution. This is because Ri only
144
Adaptive Systems in Drug Design
tells us about the whole distribution of symbols, without comparing the frequency of each specific symbol directly. As a possible solution to this problem one can calculate the relative entropy Hi (Pi Pbackground), also known as Kullback-Leibler “distance”, defined by
Eq. 5.6
The relative entropy equals the Shannon information for an even background distribution, but differs otherwise. Hi (Pi Pbackground) is always equal to or greater than zero. It vanishes only, if every single symbol has the same frequency in both the background distribution and within a sequence block position. Contrary to the Shannon entropy, relative entropy is not a “state function”, and although it is often useful to think of relative entropy as a distance between two probability distributions, it is not symmetric and is not a correct mathematical distance measure.92 For potential applications of Equation 5.4, see e.g., the textbooks of Durbin and coworkers and Baldi and Brunak.98,100 The H-BloX software provides an easy-to-use web interface to entropy calculation and diversity assessment and may be downloaded as HTML/ JavaScript, or accessed online at the URL: http://www.modlab.de.95 The H-BloX analysis is not restricted to protein, DNA or RNA sequences. It may be applied to arbitrary chemical libraries, provided a sensible alignment of structures can be accomplished. As long as the molecular structure can be described by a limited set of building blocks—which is often straightforward for combinatorial libraries—a corresponding sequence-representation may be used as the input data for H-BloX. This short digression to entropy and library diversity was intended to provide a better understanding of the two main problems in evolutionary structure generation and library design. These issues are critical because the diversity of molecular building blocks determines the degree of library focusing (Fig. 1.2): 1. An appropriate metric in chemical space (e.g., sequence space in the case of biopolymer design) must be defined. Both the step-size of an evolutionary walk in chemical space and the degree of library diversity will be measured on the basis of this metric. 2. A procedure must be available that allows the calculation of mutation rates for pair-wise exchange and modification of molecular fragments or building blocks (e.g., amino acid monomers for peptide design). Similarity between building blocks will be defined based on the metric in chemical space.
To illustrate the idea of a mutation operator, an example of a mutation event is shown in Fig. 5.3. For reasons of clarity, the mutation of a single amino acid residue is illustrated. However, a similar scheme is appropriate for arbitrarily defined combinatorial building blocks or whole compounds. In the example shown, the amino acid tyrosine (Y) represents the parent structure, and the remaining 19 genetically encoded amino acids provide the stock of structures. With decreasing probability the parent is substituted by more distant residues. In the example, distance between two residues is defined by their Euclidean distance, di,j, in a primitive two-dimensional space spanned by a hydrophobicity and a volume axis (Fig. 2.5a). This amino acid similarity measure proved to be useful in several design exercises.74,101 As indicated by the dashed lines in Fig. 5.3, based on this model a mutation of tyrosine to phenylalanine (Y→F) or leucine (Y→L) is very likely to occur, whereas the tyrosine-arginine transition (Y→R) is extremely rare. Another set of transition probabilities would result from a different σ value or a changed ordering of the residues. In the PepMaker model the residue transition probability P(i→j) is given by Equation 5.7.72,102
Evolutionary De Novo Design
145
Fig. 5.3. Mutation probability of the amino acid residue tyrosine (Y). The ordering of the 20 natural amino acids is based on a physicochemical distance. In the example there is a high probability of the Y→L transition, and a marginal probability of the Y→R mutation.
Eq. 5.7
A great variety of substitution matrices have been suggested to measure distance between pairs of amino acid sequences, and it is not a trivial task to select the most appropriate for similarity searching or sequence design (for details, see the literature).103-105 Fig. 5.4 provides an example of the effect caused by an appropriate mutation matrix. In the example given the
146
Adaptive Systems in Drug Design
Fig. 5.4. a) Idealized decrease of bioactivity with increasing entropy (diversity) of a compound library. b) Distribution of bioactive peptides (antibody binding) in ten different libraries consisting of 100 peptides each. All libraries were generated from the identical seed structure. Different library entropy was obtained by variation of σ. Peptide activity is plotted versus library entropy (Htotal). Peptide synthesis and testing was performed by Dr. A. Kramer.
mutation matrix given in Table 5.3 was used. Ten small peptide libraries were designed, synthesized and tested (A. Kramer, G. Schneider, J. Schneider-Mergener, P. Wrede; unpublished). All libraries were generated from the same seed peptide, but a different σ-value was used for each library. The total entropy (diversity) of a library was computed by Equation 5.8, where N = 11. Low σ-values led to low total entropy, and large σ-values produced many different variant peptides which is reflected by large total library entropy. In this particular case the experimentally determined distribution of active molecules reflects the theoretically expected distribution (Fig. 5.4). For practical applications of the PepMaker algorithm the width of the variant distribution should be set to σ < 0.2 to obtain focusing libraries containing a sufficient number of active peptides for subsequent SAR modeling. However, this value can only be a rough estimate. It must be kept in mind that an optimal setting depends on the shape of the underlying fitness landscape, the particular amino acid exchange matrix chosen, and other parameters.
Eq. 5.8
To demonstrate how to derive a seed peptide and apply the PepMaker system, we will now provide a prospective example taking HRV-14 infection as a model system for protein-protein interactions. The causative agent for the common cold disease is human rhinovirus 14 (HRV-14) which in its first step of infection interacts with intercellular adhesion molecule 1 (ICAM-1), a cell surface receptor. HRV consists of a viral (+)-strand RNA which is packed into an icosahedral protein shell (capsid). The capsid (300 Å diameter) is composed of 60 identical protein complexes (protomers) consisting of four distinct viral proteins each (VP1-VP4). A three-dimensional structure of HRV-14 was determined by X-ray crystallographic analysis to a resolution of 3.0 Å.106 Based on this structure and electron microscopic investigations, Rossmann and coworkers formulated the canyon hypothesis.107,108 The 25 Å deep canyon is a depression at
Evolutionary De Novo Design
147
Table 5.3. Amino acid distance matrix based on physicochemical properties78,121 A
0
C D E F G H I K L M N P Q R S T V W Y
3.7 2.8 2.2 3.0 3.6 3.2 2.8 3.5 1.7 2.3 3.3 4.9 2.5 4.1 2.6 2.6 2.5 4.2 4.4 A
0 2.8 0 3.9 1.9 2.7 3.1 3.5 3.6 2.4 2.1 3.7 3.8 4.9 3.3 3.9 3.6 2.5 2.7 2.5 1.8 5.5 4.1 3.4 1.9 4.6 3.5 3.1 2.4 3.3 2.7 3.7 3.8 2.9 3.2 3.4 3.5 C D
0 3.6 4.8 2.9 4.0 3.3 3.2 2.6 3.3 4.8 2.2 3.9 3.4 3.5 4.0 3.9 4.6 E
0 3.9 1.9 1.4 3.4 2.1 1.6 2.8 4.5 2.3 3.2 2.7 2.2 1.9 1.7 2.0 F
0 3.7 3.7 4.5 4.0 4.3 2.5 5.2 3.7 4.4 1.7 2.2 3.2 4.9 3.9 G
0 2.9 2.6 3.2 2.0 1.6 5.0 1.4 2.3 2.4 2.2 3.0 2.1 2.2 H
0 3.4 1.3 2.5 3.5 4.5 2.8 3.4 2.7 2.0 0.8 3.0 2.6 I
0 3.3 3.6 2.9 5.2 1.7 1.1 2.9 2.5 3.3 4.0 3.3 K
0 2.2 3.7 4.8 2.6 3.6 2.9 2.4 1.3 3.6 3.6 L
0 3.0 5.1 2.3 3.7 3.1 2.9 2.6 2.3 3.3 M
0 4.8 1.9 2.7 1.5 1.8 3.3 3.2 2.6 N
0 4.4 5.3 4.4 4.3 4.9 4.5 4.4 P
0 1.9 2.1 1.8 2.7 2.8 2.7 Q
0 2.9 2.5 3.3 3.6 2.6 R
0 0.8 0 2.4 1.7 3.7 3.3 2.8 2.3 S T
0 3.5 0 2.9 2.1 0 V W Y
the surface of each protomer circulating around each of the 12 vertices. The accessible surface of the canyon consists of parts of the capsid proteins VP1 and VP3. It is postulated that the canyon provides the recognition site for the virus-specific receptor ICAM-1. This hypothesis is supported by the effect of point mutations located in the viral canyon on virus adsorption on the host cell.109 Further support stems from experiments with antibodies directed against ICAM1 and their corresponding anti-idiotypic antibodies:110-112 Anti-idiotypic antibodies do not inhibit HRV-14 adsorption, as antibodies are too big to bind to the canyon. The receptor ICAM-1 is expressed at the surface of human endothelial and some epithelial cells, monocytes and lymphocyte sub-populations, and is involved in immunological and inflammatory processes.113 The natural ligand of ICAM-1 is the lymphocyte-function-associated antigen-1 (LFA-1) which functions in cell-to-cell adhesion.114 The protein consists of five immunoglobulin-like extracellular domains and belongs to the immunoglobulin superfamily. ICAM-1 is the specific receptor of human rhinoviruses of the major group which comprises 90 of 120 HRV serotypes.115-117 Investigations of point mutations in ICAM-1 and binding studies with HRV-14 revealed that mainly domain 1 of ICAM-1 is involved in HRV-14 binding.118-120 The first hypothetical domain structure was based on sequence alignment, prediction of the secondary structure, and comparison with the tertiary structure of IgG.118 Some years ago, an X-ray structure of domains 1 and 2 at 2.1 Å resolution became available (Fig. 5.5).120 The sequence of an ICAM-1-derived seed peptide was selected on basis of the X-ray model of domain 1 in combination with binding studies employing ICAM-1 mutants and HRV-14.118-120 This ICAM-1-derived peptide represents a continuous sequence of 9 amino acid residues from a loop in positions 43 to 51 of domain 1 (Fig. 5.5). This surface-exposed stretch of residues serves as a candidate for constructing an ICAM-1-derived peptide. The
148
Adaptive Systems in Drug Design
Fig. 5.5. Backbone structure of domain 1 of human ICAM-1 (PDB code: 1iam).120 The location of the seed peptide is shown in dark color.
peptide might be used for blocking the HRV-14 canyon for recognition of ICAM-1. This could be tested by inhibition of virus adsorption at ICAM-1. The following nonapeptide, therefore, may serve as the seed-peptide for the PepMaker approach: LLPGNNRKV. Table 5.4 shows two small peptide libraries generated by PepMaker. Library 1 was generated with σ = 0.05, library 2 was generated with σ = 0.5. Depending on the biological activity of the seed peptide—i.e., activity in cell-protection against HRV-14 infection (which has not been investigated until now)—we assume that there are several peptides in the vicinity of the seed peptide revealing comparable or possibly even higher cell protective activity. Especially in cases where a large amount of peptide is required for a biological test system this approach might be helpful to select a few promising peptide candidates for activity testing. However, at least one peptide with the desired biological function must already be known. Peptide de novo design was the first successful case of evolutionary design employing a neural network for the fitness function.102,121,122 In Fig. 5.6, fitness landscapes are shown, which were generated by a neural network system that was trained on the prediction of eubacterial signal peptidase I cleavage sites in amino acid sequences.121 A set of known peptidase substrates served as the training data for feature extraction by ANN.121,123 It turned out that a chemical space spanned by the amino acid properties “hydrophobicity” and “volume” was suited for this particular application. The fitness functions for the three selected sequence positions shown are smooth and separate the set of natural amino acid residues into “low-fitness” and “high-fitness” candidates. In a series of in machina design experiments the alanine residue was selected as best-suited in position -3 (numbered relative to the signal peptidase cleavage site) (Fig. 5.6a), tryptophan in position -2 (Fig. 5.6b), and glycine in position -1 (Fig. 5.6c). Due to the continuous nature of the fitness landscapes evolutionary search for idealized substrates was straightforward. The design run converged after only 52 optimization steps, changing the initial parent sequence FICLTMGYIC into the functional enzyme substrate FFFFGWYGWA*RE (the asterisk denotes the signal peptidase I cleavage site). Its biological activity—activity as a substrate—is comparable to wild-type sequences, which was proven by an in vivo protein secretion assay and subsequent mass-spectrometric sequence and fragment analysis.73 The X-ray structure of the catalytic domain of signal peptidase I from Escherichia coli was published after these ligand-based design experiments were completed.124,125 It is evident from the structure of the active site that the model peptide excellently compliments the structural and electrostatic features within the enzyme.
Evolutionary De Novo Design
149
Table 5.4. Peptide libraries generated with PepMaker; The seed peptide was LLPGNNRKV. d: distance to the seed peptide Library 1 No.
Library 2
Sequence
d
Sequence
d
1
LLPGNNRKV
0
LRPGQHFFA
6.86
2 3 4 5 6 7 8 9 10
LLPGNNRRI LLPGSNRKV ILPGNNRKI LLPGSNRKI VVPGNNRKV LIPSNNRKV LLPSSNRKI IVPSNNRKV ILPGNSKRI
1.34 1.46 1.46 1.57 1.91 2.17 2.38 2.55 2.62
WLITYVYVL VPVKQHKKV QWMTYYYKM QMEGNGGVH VMMDDENEH DGGTQDVTL SKDNMHEEQ TPLYSDHRN SWCCHFNRS
8.65 8.70 8.81 8.96 9.33 9.36 9.37 9.41 9.42
In a further application of evolutionary search guided by neural networks, antigen-mimicking peptides were developed “from first principles”.74 This design approach included a round of bench experiments for data generation and subsequent computer-assisted evolutionary optimization. The five-step procedure represents a special version of the design cycle shown in Fig. 1.4: 1. Identification of a single compound with some desired activity, e.g., by expert knowledge, data base or random screening, combinatorial libraries, or phage display; 2. Generation of a focusing library taking the compound obtained in step 1 as a “seed structure”. A limited set of variants is generated approximately Gaussian-distributed in some physicochemical space around the “seed peptide”; 3. Synthesis and testing of the new variants for their bioactivity; 4. Training of artificial neural networks providing heuristic (Q)SAR based on the activities measured in step 3; 5. Computer-based evolutionary search for highly active compounds taking the network models as the fitness function.
A novel peptide was identified fully preventing the positive chronotropic effect of anti-β1adrenoceptor auto-antibodies from the serum of patients with idiopathic dilated cardiomyopathy (DCM).74 In an in vitro assay the designed active peptide showed more significant effects compared to the natural epitope. The idea was to test whether it is feasible to derive artificial epitope sequences that might be used as potential immuno-therapeutical agents following the design strategy described above. The model peptide GWFGGADWHA exhibits an activity comparable to its natural counterpart (ARRCYNDPKC) but has a significantly different residue sequence. Selection of such antibody-specific “artificial antigens” may be regarded as complementary to natural clonal B-cell selection leading to the production of specific antibodies. The peptide-antibody interaction investigated can be considered as a model of specific peptideprotein interactions. These results demonstrate that computer-based evolutionary searches can generate novel peptides with substantial biological activity.
TOPAS: Fragment-Based Design of Drug-Like Molecules The software tool TOPAS (TOPology Assigning System) provides an example of a fragment-based molecular structure generator and optimizer based on an evolution strategy.59 Its
150
Adaptive Systems in Drug Design
Fig. 5.6. Artificial fitness landscapes generated by a three-layered feed-forward neural network. The system was trained to predict the usefulness of individual amino acid residues in potential signal peptidase I substrates, based on the hydrophobicity and volume of the side-chains. (a) substrate position -1; (b) substrate position -2; (c) substrate position -3. The neural network response to varying hydrophobicity and volume values is plotted on the fitness axis. This was achieved by keeping all but the substrate position under investigation fixed.121
basic idea is similar to the GA-based software LEA conceived by Douget and coworkers.58 In both programs SMILES representations of molecules are varied by genetic operators. The SMILES strings are assembled from a compilation of molecular building blocks. In the case of TOPAS, these were generated by retro-synthetic fragmentation of the Derwent World Drug Index (WDI version of 1997; as distributed by Daylight Chemical Information Systems Inc., Irvine, CA, USA), in LEA the fragment libraries contain a diverse collection of selected building blocks. The idea behind the TOPAS fragment set is that re-assembly of such drug-derived building blocks by a limited set of chemical reactions might lead to chemically feasible novel structures, from both the medicinal chemistry and the synthesis planning perspective. To compile a database of drug-like building blocks for evolutionary de novo design by TOPAS, all 36,000 structures contained in the WDI, which had an entry related to “mechanism” or “activity”, were subjected to retro-synthetic fragmentation. The reactions are listed in Table 5.5. This approach is identical to the original RECAP procedure developed by Hann and coworkers.126 In total, 24,563 distinctive building blocks were generated (“stock of structures”). Of course, there are many other ways to create fragment sets, and we found it useful to have several such collections available for different design tasks. For example, if the task was to design a potential GPCR modulating agent, then a fragment set generated from known GPCR modulators would be a reasonable choice. TOPAS is grounded on a (1,λ) evolution strategy (see Chapter 1). Starting from an arbitrary point in search space, a set of λ variants are generated, satisfying a bell-shaped distribution centered in the chemical space co-ordinates of the parent structure. This means that most of the variants are very similar to their parent, and with increasing distance in chemical space the number of offspring decreases. In the original implementation of TOPAS, fitness was defined as pair-wise similarity between the template (reference structure) and the offspring. Two different concepts were realized to measure similarity: i) 2D structural similarity as defined by
Evolutionary De Novo Design
151
Table 5.5. Simple retro-synthetic fragmentation schemes,126 and numbers of fragments obtained from WDI. Arrows indicate cleavage sites Bond Type
Fragmentation Type
No. of Fragments
the Tanimoto index on Daylight’s 2D fingerprints (Eq. 2.9), and ii) 2D topological Pharmacophore similarity (see Chapter 2). Tanimoto similarity varies between zero and one, where the value of 1 indicates structural identity. Topological Pharmacophore similarity values vary between zero (indicating identical Pharmacophore distribution in the two molecules) and positive values indicating varying degrees of Pharmacophore similarity. Thus, optimal fitness values are 1 for the Tanimoto measure, and 0 for the Pharmacophore similarity measure. Additional penalty terms such as a modified “rule of 5” and a topological shape filter were added to the fitness function to avoid undesired structures (Note: one particular advantage of the TOPAS approach is that arbitrary quality and penalty functions can be included to compute fitness).
152
Adaptive Systems in Drug Design
Variant structures are derived from the parent molecule, SP, in a four-step process, following the algorithm outlined in Chapter 1: 1. Exhaustive retro-synthetic fragmentation of SP; 2. Random selection of one of the generated fragments; 3. Substitution of this fragment by the one from the stock of building blocks having a pair-wise similarity index that is close to Gaussian-distributed random number; 4. Virtual synthesis to assemble the novel chemical structure.
To demonstrate step 1, the thrombin inhibitor NAPAP was subjected to fragmentation by TOPAS. Reaction scheme 1 (amide bond cleavage) was applied twice, and reaction 11 (sulfonamide bond cleavage) occurred once, resulting in four fragments (Fig. 5.7). Depending on the similarity measure selected and the width of the variant distribution, offspring is generated, e.g., by subjecting the benzamidine residue to mutation (Fig. 5.8). The other three fragments remain unchanged. For fitness calculation, each of the new structures is compared to the template, and the most similar one will become the parent of the next generation. This mutation strategy offers the following advantages: • An adaptive stochastic search is performed in chemical space; • The type of molecules that are virtually generated is not restricted to a predefined combinatorial class (e.g., peptides, Ugi-reaction products); • Novel structures are assembled from drug-derived building blocks using a set of “simple” chemical reactions; • A large diversity of molecular fragments can be explored.
An example of a TOPAS design experiment aiming at the generation of a NAPAP-like structure is shown in Figure 5. 9. The Tanimo index was used as the fitness measure. Initially, a random structure was generated from the stock of building blocks (“parent” of the first generation). The Tanimoto similarity to NAPAP was 0.31 reflecting a great dissimilarity, as expected. In each of the following generations 100 variants were systematically generated by TOPAS, and the best of each generation was selected as the parent for the subsequent generation. Following this scheme, novel molecules were assembled which exhibited a significantly increased fitness (Fig. 5.9). After only 12 optimization cycles the process converged at a high fitness level (approximately 0.86), and the standard deviation, σ, of the variant distributions around the parent structures decreased. The course of σ indicates that first comparably broad distributions were generated (large diversity), after some generations, however, a peak in the fitness landscape was climbed (restricted diversity). The parent structures of each generation are shown in Figure 5.10. The resulting final design shares a significant set of substructure elements with the NAPAP template. Key features for thrombin binding evolved—the benzamidine group forming hydrogen bonds with Asp189 at the bottom of the thrombin P1 pocket, a sulfonamide interacting with the backbone carbonyl of Gly216, and the lipophilic para-tolyl and piperidine rings filling a large lipophilic pocket of the thrombin active site cleft. Automated docking by means of FlexX essentially reproduced the NAPAP binding mode.59 (Note: molecular docking and subsequent scoring of the docked solutions was not used in the selection of the final pool of solutions). This de novo design experiment demonstrated that the algorithm can be used for a fast guided search in a very large chemical space, ending up with rational proposals for novel molecular structures that are similar to a given template. The previous design example was based on similarity searching alone. In this last Section we illustrate a possible interplay of similarity searching, de novo design, and molecular modeling. Here the aim was to design a novel potassium channel (Kv1.5) blocking agent—a so-called “fast follower”—taking a known inhibitor as a starting point. Inhibitors of voltage-dependent potassium channels induce a decrease in potassium ion movement across the plasma membrane. The biological function of these ion channels is multiple. In cardiac cells a decreased potassium flux leads to the prolongation of the action
Evolutionary De Novo Design
153
Fig. 5.7. Fragmentation of the thrombin inhibitor NAPAP. Application of the fragmentation scheme given in Table 5.5 leads to four fragments.
potential. Increasing myocardial refractoriness by prolonging the action potential can be useful for the treatment of cardiac arrhythmia. Blocking potassium channels and depolarizing the resting membrane potential has been shown to regulate a variety of biological processes, like Tcell activation under immune-reactive conditions. Inherited disorders of voltage-gated ion channels are a recently recognized etiology of epilepsy in the developing and mature central nervous system,127 and quite a number of neurodegenerative diseases are known to be associated or caused by potassium channelopathies.128 The root of this fast-follower design experiment was the structure (a) shown in Fig. 5.11, which was identified as a potent Kv1.5 blocking agent by Castle and coworkers at Icagen Inc.129 In our electrophysiological studies this compound had an IC50 of 0.1 µM.60 The first hurdle to take was to identify a novel molecular scaffold that may serve as a lead structure candidate (step 1 in Fig. 5.11). TOPAS was used for this purpose. New structures were generated through assembly or modification of the TOPAS building blocks. A naphthylsulfonamide motif appeared in many of the designs. It was expected that these molecules would be chemically feasible and have some drug-like properties, because the fragments were originally obtained from known bioactive molecules. In fact, the structure (b) shown in Fig. 5.11 has Kv1.5 blocking potential (IC50= 7 µM). To further optimize structure (b) a Pharmacophore matching routine was applied to align it to the original template (a) (step 2 in Fig. 5.11). The modeling program MOLOC was used for this purpose.130 In this study it was found that removal of a methoxy group present in structure (b) might be beneficial to activity. This hypothesis was proven by electrophysiological studies yielding an IC50 of 1 µM for structure (c). The final optimization process (Step 3 in Fig. 5.11) was a CATS similarity search in a virtual combinatorial library. Sulfonylchlorides available from the Roche corporate compound collection were virtually assembled to the free amino functionality of the molecular core of structure (c). A topological Pharmacophore similarity to the original template (a)—as implemented in CATS—was determined for each member of the combinatorial library. This procedure led to molecule (d) yielding an IC50 below 1 µM, which is in the same order of magnitude as the template (a). The new structure would now be ready to enter a medicinal chemistry project. This example shows that evolutionary de novo design algorithms are able to generate novel bioactive classes of compounds. The cyclic interplay between computational design and human reasoning (hypothesis generation), chemical synthesis (structure generation), and biological testing (quality assessment) represents a prototype of Adaptive Drug Design. This and similar strategies will surely provide a basis in the future for drug discovery and lead generation.
154
Adaptive Systems in Drug Design
Fig. 5.8. Mutation of a molecular building block in TOPAS. In this case benzamidine is the parent structure, and with decreasing probability this fragment will be replaced by other amines (see Fig. 5.3). For the given width of the probability function the transition of benzamidine to the guanidinium group is very unlikely. d is the topological Pharmacophore distance (CATS) to the parent fragment.
Concluding Remarks Evolutionary algorithms have undoubtedly proven their usefulness for molecular de novo design. Their basic idea is to perform an adaptive stochastic search based on a guided trial-anderror procedure. It must be emphasized that they do not represent the optimal solution to drug design, but may serve as a general optimization strategy, thereby complementing more specific approaches. Their great appeal is the intuitively comprehensible basic algorithm, which perfectly integrates with experimental drug design cycles. Furthermore, they can easily be implemented in molecular design software. EAs excel in situations where the fitness function and the search space are both multidimensional and multimodal. On the other hand, compared to some other design methods EAs can be slow, suffering from premature convergence and leading to sub-optimal solutions. The authors are convinced that with the continuously increasing speed of computers and in combination with specifically tailored chemistry—in particular advanced parallel medicinal chemistry—EAs will increasingly have larger impact on the future drug discovery process and enrich the medicinal chemists’ arsenal of structures with novel molecules. The speed of chemical synthesis can be a rate-limiting step in the optimization cycle if conventional routes of de novo synthesis are followed. Parallel medicinal chemistry concepts might provide a solution to this problem, as they represent a smart integration of computational design and combinatorial synthesis. In the book Hidden Order—How Adaptation Builds Complexity John H. Holland presented a list of seven basic properties and mechanisms that are common to all complex adaptive
Evolutionary De Novo Design
155
Fig. 5.9. Course of fitness (Tanimoto similarity to NAPAP) and the width (diversity) σ of the offspring distribution (“diversity”) during a TOPAS design experiment (cf Fig. 5.10).
systems:131 aggregation, non-linearity, flows, diversity, tagging, internal models, and building blocks. Some of these basics have been discussed in the present volume, and simplified computer models have been presented that may serve as a starting point for the implementation of more advanced adaptive systems for drug design. The task of formulating a theory for these systems is difficult, especially because the behaviour of a whole complex adaptive system is more than a simple sum of the behaviours of its parts; complex adaptive systems abound in non-linearities. We must face the fact that the drug design process is inherently non-linear, and the different ways of looking at it lead to different emphases and different models. John H. Holland wrote:131 “Adaptive agents come in startling variety, and their strategies are correspondingly diverse, so we need a language powerful enough to define the feasible strategies for these agents. [..] And we must provide well-defined evolutionary procedures that enable agents to acquire learned anticipations and innovations.” It is evident that current drug design models are far from perfect; it is also evident that it will be extremely difficult—if not impossible—to formulate a single theory that directly guides the experiment. Selection guided by taste and experience is crucial, and an adaptive drug design process involves a perpetual interplay between theory and experiment.
References 1. Willett P. Genetic algorithms in molecular recognition and design. Trends Biotechnol 1995; 13:516-521. 2. Böhm HJ. The computer program LUDI: a new method for the de novo design of enzyme inhibitors. J Comput Aided Mol Des 1992; 6:61-78. 3. Roe DC, Kuntz ID. BUILDER v.2: Improving the chemistry of a de novo design strategy. J Comput Aided Mol Des 1995; 9:269-282. 4. Lauri G, Bartlett PA. CAVEAT: A program to facilitate the design of organic molecules. J Comput Aided Mol Des 1994; 8:51-66.
156
Adaptive Systems in Drug Design
Fig. 5.10. Evolution of a potential thrombin inhibitor by TOPAS. Twelve subsequent parent structures of an evolutionary design experiment are shown (Generation 1 to 12). NAPAP served as the template structure, and the Tanimoto index was used as fitness measure. 5. Böhm HJ. Prediction of binding constants of protein ligands: a fast method for the prioritization of hits obtained from de novo design or 3D database search programs. J Comput Aided Mol Des 1998; 12:309-323. 6. Böhm HJ, Banner DW, Weber L. Combinatorial docking and combinatorial chemistry: Design of potent non-peptide thrombin inhibitors. J Comput Aided Mol Des 1999; 13:51-56. 7. Stahl M, Böhm HJ. Development of filter functions for protein-ligand docking. J Mol Graph Model 1998; 16:121-132. 8. Böhm HJ, Stahl M. Rapid empirical scoring functions in virtual screening applications. Med Chem Res 1999:445-462. 9. Goodsell DS, Olson AJ. Automated docking of substrates to proteins by simulated annealing. Proteins 1990; 8:195-202. 10. Miranker A, Karplus M. An automated method for dynamic ligand design. Proteins 1995; 23:472-490. 11. Meng EC, Shoichet BK, Kuntz ID. Automated docking with grid-based energy evaluation. J Comp Chem 1992; 13:505-524. 12. Holloway MK, Wai JM, Halgren TA et al. A priori prediction of activity for HIV-1 protease inhibitors employing energy minimization in the active site. J Med Chem 1995; 38:305-317. 13. Luty BA, Wassermann ZR, Stouten PFW et al. Molecular mechanics/grid method for the evaluation of ligand-receptor interactions. J Comput Chem 1995; 16:454-464. 14. Grootenhuis PDJ, van Galen PJM. Correlation of binding affinities with non-bonded interaction energies of thrombin-inhibitor complexes. Acta Cryst 1995; D51:560-566. 15. Viswanadhan VN, Reddy MR, Wlodawer A et al. An approach to rapid estimation of relative binding affinities of enzyme inhibitors: Application to peptidomimetic inhibitors of the human immunodeficiency virus type 1 protease. J Med Chem 1996; 39:705-712.
Evolutionary De Novo Design
157
Fig. 5.11. Adaptive molecular design. The task was to find a “fast follower” to the known potassium channel inhibitor (a). Structures (b), (c), and (d) were developed using virtual screening techniques. The circled numbers specify different methods: 1, evolutionary de novo design by TOPAS; 2, Pharmacophore modelling using Moloc; 3, CATS similarity searching in a virtual combinatorial compound library. 16. Vieth M, Hirst JD, Kolinski A et al. Assessing energy functions for flexible docking. J Comp Chem 1998; 19:1612-1622. 17. Shoichet BK, Leach AR, Kuntz ID. Ligand solvation in molecular docking. Proteins 1999; 34:4-16. 18. Honig B, Nicholls A. Classical electrostatics in biology and chemistry. Science 1995; 268:1144-1449. 19. Zhang T, Koshland DE Jr. Computational method for relative binding energies of enzyme-substrate complexes. Protein Sci 1996; 5:348-356. 20. Schapira M, Trotov M, Abagyan R. Prediction of the binding energy for small molecules, peptides and proteins. J Mol Recognition 1999; 12:177-190. 21. Majeux N, Scarsi M, Apostolakis J et al. Exhaustive docking of molecular fragments with electrostatic solvation. Proteins 1999; 37:88-105. 22. Verkhivker G, Appelt K, Freer ST et al. Empirical free energy calculations of ligand-protein crystallographic complexes. I. Knowledge-based ligand-protein interaction potentials applied to the prediction of human immunodeficiency virus 1 protease binding affinity. Protein Eng 1995; 8:677-691. 23. Wallqvist A, Jernigan RL, Covell DG. A preference-based free-energy parameterization of enzymeinhibitor binding. Applications to HIV-1-protease inhibitor design. Protein Sci 1995; 49:1881-1903. 24. DeWitte RS, Shakhnovich EI. SMoG: De novo design method based on simple, fast, and accurate free energy estimates. 1. methodology and supporting evidence. J Am Chem Soc 1996; 118:11733-11744. 25. Mitchell JBO, Laskowski RA, Alex A et al. BLEEP—Potential of mean force describing proteinligand interactions: I. Generating potential. J Comput Chem 1999; 20:1165-1177. 26. Muegge I, Martin YC. A general and fast scoring function for protein-ligand interactions: a simplified potential approach. J Med Chem 1999; 42:791-804.
158
Adaptive Systems in Drug Design
27. Muegge I, Martin YC, Hajduk PJ et al. Evaluation of PMF scoring in docking weak ligands to the FK506 binding protein. J Med Chem 1999; 42:2498-2503. 28. Kollmann PA. Advances and continuing challenges in achieving realistic and predictive simulations of the properties of organic and biological molecules. Acc Chem Res 1996; 29:461-469. 29. Aqvist J, Medina C, Samuelsson JE. A new method for predicting binding affinity in computeraided drug design. Protein Eng 1994; 7:385-391. 30. Hansson T, Marelius J, Aqvist J. Ligand binding affinity prediction by linear interaction energy methods. J Comput Aided Mol Des 1998; 12:27-35. 31. Rarey M, Kramer B, Lengauer T. Docking of hydrophobic ligands with interaction-based matching algorithms. Bioinformatics 1999;15:243-250. 32. Gohlke H, Hendlich M, Klebe G. Knowledge-based scoring function to predict protein-ligand interactions. J Mol Biol 2000; 295:337-356. 33. Stahl M. Structure-based library design. In: Böhm HJ, Schneider G, eds. Virtual Screening for Bioactive Molecules. Weinheim,New York: Wiley-VCH, 2000:229-264. 34. Good A, Mason JS, Pickett SD. Pharmacophore pattern application in virtual screening, library design and QSAR. In: Böhm HJ, Schneider G, eds. Virtual Screening for Bioactive Molecules. Weinheim,New York: Wiley-VCH, 2000:131-159. 35. Müller K; Ed. De Novo Design. Leiden: Escom, 1995. 36. Böhm HJ. Computational tools for structure-based ligand design. Prog Biophys Mol Biol 1996; 66:197-210. 37. Kirkpatrick DL, Watson S, Ulhaq S. Structure-based drug design: Combinatorial chemistry and molecular modeling. Comb Chem High Throughput Screen 1999; 2:211-221. 38. Gane PJ, Dean PM. Recent advances in structure-based rational drug design. Curr Opin Struct Biol 2000; 10:401-404. 39. Klebe G. Recent developments in structure-based drug design. J Mol Med 2000; 78:269-281. 40. Böhm HJ, Stahl M. Structure-based library design: molecular modelling merges with combinatorial chemistry. Curr Opin Chem Biol 2000; 4:283-286. 41. Devillers J, Ed. Genetic Algorithms in Molecular Modeling. New York: Adacemic Press, 1996. 42. Clark DE, Ed. Evolutionary Algorithms in Molecular Design. Weinheim: Wiley-VCH, 2000. 43. De Julian-Ortiz JV. Virtual Darwinian drug design: QSAR inverse problem, virtual combinatorial chemistry, and computational screening. Comb Chem High Throughput Screen 2001; 4:295-310. 44. Gillet VJ. De novo molecular design. In: Clark DE, ed. Evolutionary Algorithms in Molecular Design. Weinheim: Wiley-VCH, 2000:49-69. 45. Clark DE, Frenkel D, Levy SA et al. PRO-LIGAND: An approach to de novo molecular design. 1. Application to the design of organic molecules. J Comput Aided Mol Des 1995; 9:13-32. 46. Frenkel D, Clark DE, Li J et al. PRO_LIGAND: An approach to de novo molecular design. 4. Application to the design of peptides. J Comput Aided Mol Des 1995; 9:213-225. 47. LeapFrog is available from TRIPOS Inc., 1699 South Hanley Road, Suite 303, St. Louis, MO 63144, USA. 48. Glen RC, Payne AW. A genetic algorithm for the automated generation of molecules within constraints. J Comput Aided Mol Des 1995; 9:181-202. 49. Blaney JM, Dixon JS, Weininger D. Molecular Graphics Society Meeting on Binding Sites: Characterising and Satifying Steric and Chemical Restraints. York, UK, March 1993; Weininger D, WO095/01606. 50. Venkatasubramanian V, Chan K, Caruthers JM. Computer-aided molecular design using genetic algorithms. Computers Chem Eng 1995; 18:833-844. 51. Venkatasubramanian V, Sundaram A, Chan K et al. Computer-aided molecular design using neural networks and genetic algorithms. In: Devillers J, ed. Genetic Algorithms in Molecular Modeling. New York: Adacemic Press, 1996:271-302. 52. Venkatasubramanian V, Chan K, Caruthers JM. Evolutionary design of molecules with desired properties using a genetic algorithm. J Chem Inf Comput Sci 1998; 38:1177-1191. 53. Nachbar RB. Molecular evolution: A hierarchical representation for chemical topology and its automated manipulation. In: Proceedings of the Third Annual Genetic Programming Conference. Madison: University of Wisconsin, 22-25 July 1998:246-253.
Evolutionary De Novo Design
159
54. Globus A, Lawton J, Wipke T. Automatic molecular design using evolutionary techniques. Nanotechnology 1999; 10:290-299. 55. Devillers J, Putavy C. Designing biodegradable molecules from the combined use of a backpropagation neural network and a genetic algorithm, In: Devillers J, ed. Genetic Algorithms in Molecular Modeling. New York: Adacemic Press, 1996:303-314. 56. Weber L, Wallbaum S, Broger C et al. Optimization of the biological activity of combinatorial compound libraries by a genetic algorithm. Angew Chemie Int Ed Engl 1995; 34:2280-2282. 57. Illgen K, Enderle T, Broger C et al. Simulated molecular evolution in a full combinatorial library. Chem Biol 2000; 7:433-441. 58. Douguet D, Thoreau E, Grassy G. A genetic algorithm for the automated generation of small organic molecules: drug design using an evolutionary algorithm. J Comput Aided Mol Des 2000; 14:449-466. 59. Schneider G, Lee ML, Stahl M et al. De novo design of molecular architectures by evolutionary assembly of drug-derived building blocks. J Comput Aided Mol Des 2000; 14:487-494. 60. Schneider G, Clement-Chomienne O, Hilfiger L et al. Virtual Screening for Bioactive Molecules by Evolutionary De Novo Design. Angew Chem Int Ed Engl 2000; 39:4130-4133. 61. Weininger DJ. SMILES—A chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 1988; 28:31-36. 62. Gasteiger J, Rudolph C, Sadowski J. Automatic generation of 3D-atomic coordinates for organic molecules. Tetrahedron Comp Method 1990; 3:537-547. 63. Sadowski J, Gasteiger J. From atoms and bonds to three-dimensional atomic coordinates: automatic model builders. Chem Reviews 1993; 93:2567-2581. 64. Rusinko A III, Skell JM, Balducci R et al. Concord, a program for the rapid generation of high quality approximate 3-dimensional molecular structures. The University of Texas at Austin and Tripos Associates, St. Louis, MO USA, 1988. 65. Pearlman RS. Rapid generation of high quality approximate 3D molecular structures. Chem Des Aut News 1987; 2:1-6. 66. Rarey M, Kramer B, Lengauer T et al. A fast flexible docking method using an incremental construction algorithm. J Mol Biol 1996; 261:470-489. 67. Kramer B, Rarey M, Lengauer T. Evaluation of the FLEXX incremental construction algorithm for protein-ligand docking. Proteins 1999 ; 37:228-241. 68. Shoichet BK, ID Kuntz. Protein docking and complementarity. J Mol Biol 1991; 221:327-346. 69. Gschwend DA, Good AC, Kuntz ID. Molecular docking towards drug discovery. J Mol Recognit 1996; 9:175-186. 70. Eldridge MD, Murray CW, Auton TR et al. Empirical scoring functions: I. The development of a fast empirical scoring function to estimate the binding affinity of ligands in receptor complexes. J Comput Aided Mol Des 1997; 11:425-445 71. Walters WP, Stahl MT, Murcko MA. Viertual screening—An overview. Drug Discovery Today 1998; 3:160-178. 72. Schneider G, Grunert HP, Schuchhardt J et al. A peptide selection scheme for systematic evolutionary design and construction of synthetic peptide libraries. Min Invas Med 1995; 6:106-115. 73. Wrede P, Landt O, Klages S et al. Peptide design aided by neural networks: biological activity of artificial signal peptidase I cleavage sites. Biochemistry 1998; 37:3588-3593. 74. Schneider G, Schrödl W, Wallukat G et al. Peptide design by artificial neural networks and computer-based evolutionary search. Proc Natl Acad Sci USA 1998; 95:12179-12184. 75. Gausepohl H, Boulin C, Kraft M et al. Automated multiple peptide synthesis. Pept Res 1992; 5:315-320. 76. Kramer A, Schneider-Mergener J. Synthesis and screening of peptide libraries on cellulose membrane supports. Methods Mol Biol 1998; 87:25-39. 77. Kramer A, Keitel T, Winkler K et al. Molecular basis for the binding promiscuity of an anti-p24 (HIV-1) monoclonal antibody. Cell 1997; 91:799-809. 78. Schneider G, Wrede P The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution: de novo design of an idealized leader peptidase cleavage site. Biophys J 1994; 66:335-344.
160
Adaptive Systems in Drug Design
79. Huang P, Kim S, Loew G. Development of a common 3D pharmacophore for delta-opioid recognition from peptides and non-peptides using a novel computer program. J Comput Aided Mol 1997; 11:21-28. 80. Mee RP, Auton TR, Morgan PJ. Design of active analogues of a 15-residue peptide using Doptimal design, QSAR and a combinatorial search algorithm. J Pept Res 1997; 49:89-102. 81. Dayhoff MO, Eck RV. A model of evolutionary change in proteins. In: Dayhoff MO, ed. Atlas of Protein Sequence and Structure. Washington DC:National Biomed Res Found, 1968:345. 82. Eigen M, Winkler-Oswatitsch R, Dress A. Statistical geometry in sequence space. A method of quantitative comparative sequence analysis. Proc Natl Acad Sci USA 1988; 85:5913-5917. 83. Eigen M, McCaskill JS, Schuster P. The molecular quasi-species. Adv Chem Phys 1989, 75:149-263. 84. Grantham R. Amino acid difference formula to help explain protein evolution. Science 1974; 185:862-864. 85. Kimura M. The Neutral Theory of Molecular Evolution. Cambridge: University Press, 1983. 86. Miyata T, Miyazawa S, Yasunaga T. Two types of amino acid substitutions in protein evolution. J Mol Evol 1979; 12:219-236. 87. Rao JKM. New scoring matrix for amino acid residue exchange based on residue characteristic physical parameters. Int J Peptide Protein Res 1987; 29:276-279. 88. Schuster P, Stadler PF. Landscapes: Complex optimization problems and biopolymer structures. Comput Chem 1994;18: 295-324. 89. Fontana W, Stadler PF, Bornberg-Bauer EG et al. RNA folding and combinatory landscapes. Phys Rev 1993; E 47:2083-2099. 90. Kauffman SA. The Origin of Order—Self-Organization and Selection in Evolution. New York/ Oxford: Oxford University Press, 1993. 91. Shannon CE. A mathematical theory of communication. Bell System Tech J 1948; 27:379;623. 92. Schneider TD. Measuring molecular information. J Theor Biol 1999; 201:87-92. 93. Schneider G, Gutknecht EM, Kansy M et al. Diversity assessment tools: Proposed strategy for implementation and impact on the lead discovery process. Roche Progress Report 1998; unpublished. 94. Godden JW, Bajorath J. Shannon entropy—A novel concept in molecular descriptor and diversity analysis. J Mol Graph Model 2000;1 8:73-76. 95. Zuegge J, Ebeling M, Schneider G. (-BloX: Visualizing alignment block entropies. J Mol Graph Model 2001; 19:303-305. 96. Ash RB. Information Theory. Dover: Mineola, 1965, reprinted 1990. 97. Ebeling W, Engel A, Feistel R. Physik der Evolutionsprozesse. Berlin: Akademie-Verlag, 1990. 98. Baldi P, Brunak S. Bioinformatics—The Machine Learning Approach. Cambridge: MIT Press, 1998. 99. Schneider TD, Stephens RM. Sequence logos: A new way to display consensus sequences. Nucleic Acids Res 1990; 18:6097-6100. 100. Durbin R, Eddy S, Krogh A et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge: Cambridge University Press, 1998. 101. Schneider G, Wrede P. Artificial neural networks for computer-aided molecular design. Prog Biophys Mol Biol 1998; 70:175-222. 102. Schneider G, Schuchhardt J, Wrede P. Peptide design in machina: development of artificial mitochondrial protein precursor cleavage-sites by simulated molecular evolution. Biophys J 1995; 68:434-447. 103. Altschul SF. Amino acid substitution matrices from an information theoretic perspective. J Mol Biol 1991; 219:555-565. 104. Henikoff S, Henikoff JG. Performance evaluation of amino acid substitution matrices. Proteins 1993; 17:49-61. 105. Trinquier G, Sanejouand YH. Which effective property of amino acids is best preserved by the genetic code? Protein Engineering 1998; 11:153-169. 106. Rossmann MG, Vriend G et al. Structure of a human common cold virus and functional relationship to other picornaviruses. Nature 1985; 317:145. 107. Rossmann MG, Palmenberg AC. Conservation of the putative receptor attachment site in picornaviruses. Virology 1988; 164:373-382. 108. Olson NH, Kolatkar PR, Oliveira MA et al. Structure of a human rhinovirus complexed with its receptor molecule. Proc Natl Acad Sci USA 1993; 90:507-511.
Evolutionary De Novo Design
161
109. Colonno RJ, Condra JH, Mizutani S et al. Evidence for the direct involvement of the rhinovirus canyon in receptor binding. Proc Natl Acad Sci USA 1988; 85:5449-5453. 110. Rossmann MG, Rueckert RR. What does the molecular structure of viruses tell us about viral functions? Microbiol Sci 1987; 4:206-214. 111. McClintock PR, Prabhakar BS, Notkins AL. Anti-idiotypic antibodies to monoclonal antibodies that neutralize Coxsackie Virus B4 do not recognize viral receptors. Virology 1986; 150:352-360. 112. Cromwell RL. Cellular receptors in virus infections. Am Soc Microbiol News 1987; 53:422. 113. Dustin ML, Rothlein R, Bhan AK et al. Induction by IL-1 and interferon, tissue distribution, biochemistry, and function of a natural adherence molecule (ICAM-1). J Immunol 1986; 137:245-254. 114. Kishimoto TK, Larson RS, Corbi AL et al. The leukocyte integrins. Adv Immunol 1989; 46:146-182. 115. Colonno RJ, Callahan PL, Long WJ. Isolation of a monoclonal antibody that blocks attachment of the major group of human rhinoviruses. J Virol 1986; 57:7-12. 116. Staunton DE, Merluzzi VJ, Rothlein R et al. A cell adhesion molecule, ICAM-1, is the major surface receptor for rhinoviruses. Cell 1989; 56:849-853. 117. Greve JM, Davis G, Meyer AM et al. The major human rhinovirus receptor is ICAM-1. Cell 1989; 56:839-847. 118. Staunton DE, Dustin ML, Erickson HP et al. The arrangement of the immunoglobulin-like domains of ICAM-1 and the binding sites for LFA-1 and rhinovirus. Cell 1990; 61:243-254. 119. Register RB, Uncapher CR, Naylor AM et al. Human-murine chimeras of ICAM-1 identify amino acid residues critical for rhinovirus and antibody binding. J Virol 1991; 65:6589-6596. 120. Bella J, Kolatkar PR, Marlor CW et al. The structure of the two amino-terminal domains of human ICAM-1 suggests how it functions as a rhinovirus receptor and as an LFA-1 integrin ligand. Proc Natl Acad Sci USA 1998; 95:4140-4145. 121. Schneider G, Wrede P. The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution: de novo design of an idealized leader peptidase cleavage site. Biophys J 1994; 66:335-344. 122. Schneider G, Schuchhardt J, Wrede P. Development of simple fitness landscapes for peptides by artificial neural filter systems. Biol Cybern 1995; 73:245-254. 123. Schneider G, Wrede P. Development of artificial neural filters for pattern recognition in protein sequences. J Mol Evol 1993; 36:586-595. 124. Paetzel M, Dalbey RE, Strynadka NC. Crystal structure of a bacterial signal peptidase in complex with a beta-lactam inhibitor. Nature 1998; 396:186-190. 125. Carlos JL, Paetzel M, Brubaker G et al. The role of the membrane-spanning domain of type I signal peptidases in substrate cleavage site selection. J Biol Chem 2000; 275:38813-38822. 126. Lewell XQ, Judd DB, Watson SP et al. RECAP—Retrosynthetic combinatorial analysis procedure: a powerful new technique for identifying privileged molecular fragments with useful applications in combinatorial chemistry. J Chem Inf Comput Sci 1998; 38:511-522. 127. Steinlein OK, Noebels JL. Ion channels and epilepsy in man and mouse. Curr Opin Genet Dev 2000;10:286-291. 128. Grillner S. Bridging the gap—From ion channels to networks and behaviour. Curr Opin Neurobiol 1999; 9:663-669. 129. Castle NA, Hollinshead SP, Hughes PF et al. Icagen Inc, Eli Lilly and Company; Int Pat Appl 1998, WO98/04521. 130. Gerber PR, Müller K. MAB, a generally applicable molecular force field for structure modelling in medicinal chemistry. J Comput Aided Mol Des 1995; 9:251-268. 131. Holland JH. Hidden Order—How Adaptation Builds Complexity. Reading: Perseus Books, 1995. 132. Pearlman DA, Murcko MA. CONCERTS: Dynamic connection of fragments as an approach to de novo ligand design. J Med Chem 1996; 39:1651-1663. 133. Rotstein SH, Murcko MA. GenStar: A method for de novo drug design. J Comput Aided Mol Des 1993; 7:23-43. 134. Rotstein SH, Murcko MA. GroupBuild: A fragment-based method for de novo drug design. J Med Chem 1993; 36:1700-1710.
162
Adaptive Systems in Drug Design
135. Bohacek RS, McMartin C. Multiple highly diverse structures complementary to enzyme binding sites: Results of extensive application of a de novo design method incorporating combinatorial growth. J Am Chem Soc 1994; 116:5560-5571. 136. Eisen MB, Wiley DC, Karplus M et al. HOOK: A program for finding novel molecular architectures that satisfy the chemical and steric requirements of a macromolecule binding site. Proteins 1994; 19:199-221. 137. Nishibata Y, Itai A. Automatic creation of drug candidate structures based on receptor structure. Starting point for artificial lead generation. Tetrahedron 1991; 41:8985-8990. 138. Gehlhaar DK, Moerder KE, Zichi D et al. De novo design of enzyme inhibitors by Monte Carlo ligand generation. J Med Chem 1995; 38:466-472. 139. Miranker A, Karplus M. Functionality maps of binding sites: A multiple copy simultaneous search method. Proteins 1991; 11:29-34. 140. Clark DE, Firth MA, Murray CW. MOLMAKER: De novo generation of 3D databases for use in drug design. J Chem Inf Comput Sci 1996; 36:137-145. 141. Tschinke V, Cohen NC. The NEWLEAD program: A new method for the design of candidate structures from pharmacophoric hypotheses. J Med Chem 1993; 36:3863-3870. 142. Schneider G, Todt T, Wrede P. De novo design of peptides and proteins: Machine-generated sequences by the PROSA program. Comput Appl Biosci 1994; 10:75-77. 143. Murray CW, Clark DE, Auton TR et al. PRO_SELECT: Combining structure-based drug design and combinatorial chemistry for rapid lead discovery. 1. Technology. J Comput Aided Mol Des 1997; 11:193-207. 144. DeWitte RS, Ishchenko AV, Shakhnovich EI. SMoG: de novo design method based on simple, fast, and accurate free energy estimates. 2. case studies in molecular design. J Am Chem Soc 1997; 119:4608-4617. 145. Ho CM, Marshall GR. FOUNDATION: a program to retrieve all possible structures containing a user-defined minimum number of matching query elements from three-dimensional databases. J Comput Aided Mol Des 1993; 7:623-647. 146. Gillet V, Johnson AP, Mata P et al. SPROUT: A program for structure generation. J Comput Aided Mol Des 1993; 7:127-153. 147. Gillet VJ, Newell W, Mata P et al. SPROUT: Recent developments in the de novo design of molecules. J Chem Inf Comput Sci 1994; 34:207-217.
Epilogue
163
EPILOGUE
Adaptation is a Consequence of Organic Nature Paul Wrede “A hen is only an egg´s way of making another egg.” (S. Butler)1
A
daptation is a universally-observed phenomenon in macroscopic as well as microscopic and nanoscopic levels in Nature. A deep understanding of the adaptation processes may support the rational design of new molecules in medicinal chemistry. Until now the search for novel drugs has relied to a large extent on trial and error. It is evident that an increasing number of drugs will be needed to treat very severe diseases. For example, during recent decades several infectious diseases have been identified, which have dramatically affected human life on whole continents. Furthermore, adaptation and resistance of organisms to frequently prescribed medicines—in particular anti-infective agents—means that significant effort and activity are required to find and develop new medicines. We find ourselves in the paradoxical situation of having understood some concepts of adaptation processes on the molecular level—one basic approach for the design of new drugs—while on the other hand natural adaptation processes often lead to resistance of pathogenic organisms which escape the efficacy of these drugs. Adaptation processes of microorganisms are often observed on short time scales. Increasing resistance to many antibiotics causes great public health problems. Inheritable drug resistance in pathogenic microorganisms will be a great challenge in future drug discovery and development. Adaptation is the product of iterative optimization processes leading to organisms that are suited to their environment. There are always several possible solutions to this optimization task. For example, it is surprising to see the many different ways found by Nature of moving in the air: flying birds, the diverse principles evolved by insects, being able to stand still in the air or even move in all three dimensions. Adaptation is also an important permanent process on the molecular level inside the cell. Obviously molecules like proteins must recognize each other as well as certain other molecules. Optimization of molecular recognition processes is a consequence of an evolutionary adaptation process. Understanding these processes will be an ultimate goal in research and will have considerable implications on all drug design technologies. A first understanding of the molecular principle for drug resistance was obtained from studies on antibiotic resistant mutants. Sequencing target proteins of mutant variants revealed very often a single amino acid substitution. A recent example for drug resistance has been described for a camptothecin-resistant prostate cancer cell line. Camptothecin is a human telomerase I inhibitor. Amino acid sequencing of the mutated telomerase protein revealed a single site substitution from an arginine to a histidine at residue position 364 (Fig. 1).2 Even a single site mutation impairs the drug recognition by the protein. Beside the mutation in one allele the Adaptive Systems in Drug Design, by Gisbert Schneider and Sung-Sau So. ©2003 Eurekah.com.
164
Adaptive Systems in Drug Design
Fig. 1. Resistance to the cancer drug Camptothecin (CPT) is caused by single point mutation in the target protein topoisomerase I (topI).2 Part of the amino acid sequence and the corresponding DNA sequence (Swiss Prot. Entry P 11 387) with the point mutation at position 364 from R (Arginine) (upper row, wild type) to a H (Histidine) (lower row, mutant variant) is given. The resistance of the top1/R364H is probably due to a critical H-bond between R364 and CPT. No wild type top I RNA and genomic DNA was detected in the mutated cell line. Selection is obviously directed against the expression of the wildtype gene of top I since it will be poisonous in the presence of the drug CPT. Resistance to CPT results from a two step genetic process: mutation of one of the alleles leading to a catalytically active and CPT-resistant topI and second to silencing or loss of the wild type, CPT-sensitive topI allele.
other wild-type allele must be turned off. Although it is a two step genetic process leading to resistance, the time period is astonishingly short for the conversion of a wild type organism to a meaningful mutant variant. Adaptations of organic Nature present an endless number of solutions to a related endless number of problems, while in biotechnology the opposite is the case, namely a great number of problems facing a still very small number of ideal solutions. Our current knowledge describes the rationale behind natural adaptation processes as a consequence of the efficacy of evolutionary strategies with their many different concepts. The process of adaptation has proved to be a very useful concept in finding reliable solutions to very complex optimization tasks. The authors of this volume present a broad spectrum of applications of evolutionary algorithms and adaptation processes to find new molecules exhibiting a desired biological activity, e.g., binding to a drug target. The task sounds trivial. But such a problem cannot be solved analytically because of its complex, non-linear character. The chemical space has been estimated to contain the incredibly large number of 1060 to 10100 chemically feasible structures.3 How can we expect to find some compounds with a sufficient therapeutical effect and excellent pharmacodynamical and pharmacokinetical properties in this huge chemical space? Here evolutionary algorithms and adaptation processes provide an entry point. Having a strong concept is one side of the coin but the other is being able to calculate in mathematical terms the procedure to find new valuable molecules. Some answers to these questions are given in the present book. A simple concept of evolutionary strategies is based on the principle of mutation and selection. These two counter-rotating processes consist of the exploration of the parameter space by a mutation operator, and the contrary process of exploitation generates new solutions by selection.4 An impressive example is the early embryonic development of vertebrates. The first stages of their embryonic development go through a very similar morphology. Speciesrelevant features come into shape later (Fig. 2).5 This observation led E. Haeckel to formulate the biogenetic dogma: “The ontogenesis is a short recapitulation of the phylogenesis”. The
Epilogue
165
Fig. 2. Embryo development in three stages. On the top row an early stage of embryo development is shown. In the bottom row, fish; salamander; tortoise; hen; pig; cow; rabbit; and human (from left to right) can clearly be recognized. Adaptation processes are very similar in the beginning of embryonic development. In the latest stage the adaptation of the organisms to their future environment are already manifested. Memory is a decisive condition for adaptive processes. Astonishingly, the memory can remember even phylogenetic times. Several stages of fish development are very similar to those of human. These observations lead E. Haeckel to formulate the biogenetic dogma. Figure taken from refernece 5.
common morphology of the early stages demonstrates the principle of maintaining reliable structures and the continuity of organismal development. There is no reason to believe that the major biochemical and cytological processes have changed since the Precambrian period, six hundred million years ago. In molecular peptide design, a straightforward model of the concept of exploration and exploitation by a simplistic mutation-selection scheme has led to new biologically active molecules.6-9 The operator driving the selection process can either be based on a set of bioscreening experiments or may be guided by a reliable mathematical model. Neural networks are universal function approximators which serve this purpose, and they have proven their value in adaptive molecular design cycles. Many details about this topic are given in this book. Natural evolution is not only based on mutation and selection. An additional mechanism is symbiogenesis which explains the high complexity of the eukaryotic cell.10 Symbiogenesis stands for the combination of two independent cellular systems to build up a new cellular entity with a much better conversion and exploitation of energy resulting in significantly better chances of survival. Specific non-compatible reactions are now performed in separate compartments within the cell. The mosaic cell originated from several symbiogenetic events: mitochondria improve the energy conversion by respiration; chloroplasts are responsible for photosynthesis;
Adaptive Systems in Drug Design
166
and the cytoplasm is the space for fermentation.11 The first three quarters of the evolution time scale were used to develop an efficient eukaryotic cellular system. It is obvious that several very complex adaptation processes occurred. For example, solving the problem of interacting and controlling the metabolic pathways of the host cell and the symbiont was a great achievement for adaptation processes. Another example is the regulation of protein transport to the different cellular compartments. Several different types of signal sequences were added to the proteins to determine the correct localization of proteins.12 Since different taxonomic organisms are combined during symbiogenesis this phenomenon is called Inter-Taxonic Combination (ITC).13 Even today, higher organisms develop only if two cells fuse together, the spermatocyte with the oocyte. Only the fertilized oocyte is able to develop a complex multicellular organism. Does modeling of adaptation processes help in virtual screening and molecular design algorithms? The answer is definitely in the affirmative. Evolutionary algorithms in conjunction with non-linear fitness estimators perform surprisingly well in this field. Since symbiogenesis has proved to be an important factor in the rapid evolution of organisms, the question arises whether a formal simulation of symbiogenic processes might improve in machina drug design. We are convinced that it would be worthwhile taking such an approach into account. Another aspect of evolutionary adaptation is horizontal gene transfer.14 There is significant evidence that a whole set of genes—e.g., coding for the vertebrate eye development, is transferred from one genus to another.15 Animals developed more than 30 variants of eyes. Originally scientists thought that these eyes developed independently, but today we know that the eyes of insects and humans have a common origin. In all these cases the development cascade is coded by the same genetic control element. It is very likely that horizontal gene transfer is the cause of this phenomenon. Homology between organisms and within organisms appears to point towards a rather small number of common patterns, recurrently deployed in different species. Smart recombination operators implementing a simple model of horizontal gene transfer might add to the exploitation power of genetic algorithms. Four conceptual evolutionary principles have been formulated based on observation and experiment: 1. 2. 3. 4.
Neo-Darwinistic theory of evolution including symbiogenesis Three laws of genetics by Mendel Weismann’s concept of the continuous germ plasm Population genetics
Of these four principles, the first two give reasons for diversity and adaptation, while Weismann’s concept describes the thin thread of life connecting us with the earliest history and successes of historical adaptations which are still useful today.1 Population genetics describes quantitatively the behavior of the changes in genetic information content depending on different parameters and allowing macroscopic predictions. The evolution of living organisms has been described as an adaptation process on a rugged and always changing multi-dimensional fitness landscape.16,17 Although it is much easier to escape from a local minimum in a multi-dimensional fitness landscape than in a one-dimensional model, it remains a source of great wonder why so many “intelligent” diverse adaptations occur. We have only a little understanding of natural adaptation processes, but nevertheless the application of Darwin´s concept of evolution for technical and medical problems has provided us with a valuable optimization tool.4,18,19 Since the early nineties, computer simulations of adaptation processes have been leading to the successful design of new biologically active molecules.9,20-22 There are impressive examples of the computer-assisted design of small molecules by evolutionary algorithms. In organic Nature, peptides are the most common known substances for the control and regulation of molecular processes within the cell. The question arises how to mimic peptide function by small organic
Epilogue
167
molecules, which might be more suitable for drug development. One approach is to investigate the rules guiding peptide binding to protein receptors. Several such statistical observations have been performed by us, leading to a basic understanding of the essential properties required for specific peptide binding. The information obtained has been used to find characteristic pharmacophoric patterns in small molecules (Wrede et al, unpublished). In other words, the molecule designer tries to find an appropriate overlap of the two different chemical spaces— the peptide world and the chemical space populated by small organic molecules. A great challenge is to describe in advance whether such an overlap does exist for a given problem—if there is no overlap, then the whole optimization endeavor is bound to fail. Finding new molecules with a sufficient binding constant to a given protein is relatively simple compared to the more complex task of multi-parameter optimization, including pharmacokinetical properties like absorption, distribution, metabolism, excretion and toxicity (ADMET). From just the objective mentioned here, it is obvious that modeling the underlying fitness landscape is very difficult. The question arises of whether modeling such a complex fitness landscape is possible at all. Nevertheless, finding “activity islands” in the multi-parameter search space can sometimes be achieved by the application of evolutionary algorithms. Evolutionary strategies provide a concept of how order is created out of chaos. Two principles are perpetually working against each other, random mutations and guided selection. But even inorganic Nature is able to build up ordered structures of impressive complexity, e.g., sand dunes, clouds, snow flakes, mountains, or river systems.1,17 Organic Nature does not only mirror well-ordered complexity of organisms and their populations, but it can easily slip out of balance and into chaos. One possible explanation for adaptation in all living organisms is the continuous optimization process amid permanent changes of properties in a highly complex fitness landscape. When discussing evolutionary adaptation, we should keep in mind that 99% of all the organisms that have appeared in evolutionary history have died out and only a small fraction has survived. In the context of finding the desired small molecules with a set of appropriate properties, the concept of deterministic chaos can deliver a new form of molecular descriptors. Fractal descriptors can be one means of describing molecules and predicting their behavior in many aspects.1,17 Since the time of R. A. Fisher, mathematical approaches have made biology a quantitative exact science.23 Predictions for many cases have become feasible, including problems like population behavior, starting from given premises or the folding of proteins. Mathematics provides us with a powerful language and concepts of the real world including living Nature. Including all special axioms, there are just over 3,000 single disciplines.24 Based on the Bourbaki theorem, the house of mathematics is built on three fundamental species of structures, namely the ordered, algebraic, and topological structures. The mathematical apparatus is already large enough to describe optimization and adaptation processes for biopolymers and chemical compounds. In the future, mathematics will provide molecular biology and biochemistry with more and more exact reliable models. These models will be essential for efficient molecular design, virtual screening, and drug discovery.25 Learning is a result of adequate adaptation processes. There is insufficient space here to give a detailed description of the multitudinous adaptation processes responsible for learning. But one impressive example of the exciting ability of the brain to adapt to new signals is given: Deaf patients with a severe inner ear disease were treated by an implantation of an artificial inner ear (cochlea). When they woke up from anesthesia they were extremely irritated by “strange noises”, but after one year several of these patients were able to make phone calls, understanding speech without lip-reading. An investigation of this phenomenon led to the discovery of the general principle of neuroplasticity.26 An artificial cochlea includes a microphone outside the body and receives the sound and tones from the environment. The output of the microphone is amplified electronically and dispersed into different frequency ranges, and the signal of each
168
Adaptive Systems in Drug Design
frequency range is converted to electrical impulses. Tiny wires are connected with many nerves of the inner ear, going then to the brain. The first stimulation is perceived as rumbling noise without any meaning. Of course, this is not a surprise, since the signals from the artificial ear are completely different from those the brain was used to receiving from the natural ear. After the implantation, neither the temporal nor the spatial arrangement of the impulses reaching the brain corresponded to the previous order of signals. On average, the brain needs about one year to learn to adapt to the new signals and decipher their meaning.27 The brain permanently undergoes adaptation, which is especially true for the forebrain and gives us the great pleasure of permanent learning and optimization. Having highlighted some specific aspects of adaptation the question arises: Where do we go? In the future, the design of new molecules with very diverse pharmacological applications will be driven by computer-based procedures—at least during the early stages of drug discovery. How can the desired properties of molecules be predicted? The whole problem can be reduced to a prediction of molecular recognition events. There are several possible approaches to this task. Molecules must be described appropriately, and many useful examples are given in this book. Deciphering the information contained in the many receptor-ligand interaction patterns will certainly lead to a better design of small drug-like molecules.28 Not only information about tight binding should be present in the small molecule but also its Pharmacokinetic and pharmacodynamic properties. In the future, prediction of ADMET properties will be possible from metabolic network simulations. The symphony of all types of interacting agents within a cell can also be considered as a metabolic network map, and the adaptation process is controlled by selection according to criteria which are mostly still unknown. Probably the number of reactions needed to yield a certain product can play an important role, or efficient feedback reactions for sensitive regulation may be the decisive factor. Models of such a network will provide useful fitness functions for optimizing small inhibitor molecules. A prerequisite will be the knowledge about all expressed enzyme molecules of a cell. Here the complete DNA sequence of the human genome and detailed analysis of the proteome will supply necessary information. A large class of membrane-bound receptors is G-protein coupled (GPCR). Many GPCR have a specific binding site on their extracellular side. After binding of functional modulator, a reaction cascade is initiated via the cytoplasmic portion. Here adaptation is directed to exploit common reaction chains for amplification of weak signals. These processes have to be mimicked by mathematical models to provide us with reliable predictions of GPCR modulator function, especially for agonistic agents. A lot of empirical studies have to be performed in order to understand the specificity of receptor-ligand interactions, not only for GPCRs but for many others like ion channels or endocytotic receptors.29 A challenging goal is the anticipation of cellular adaptation resulting in drug resistance and impaired drug action. Many such adaptation processes have to be rationalized in the near future, not only for proteins but also for nucleic acid molecules, lipids and polysaccharides. Understanding adaptive processes will be a key for future success in drug design.
References 1. Flake GW. The Computational Beauty of Nature. Cambridge: MIT Press, 1998. 2. Urasaki Y, Laco GS, Pourquier P et al. Characterization of a novel topoisomerase I mutation from a camptothecin-resistant human prostate cancer cell line. Cancer Research 2001; 61:1964-1969. 3. Eglen RM, Schneider G, Böhm HJ. High-throughput screening and virtual screening: Entry points to drug discovery. In: Böhm HJ, Schneider G, eds. Virtual Screening for Bioactive Molecules. Weinheim: VCH-Wiley, 2000:1-13. 4. Holland JH. Adaptation in Natural and Artificial Systems. ANN Arbor: University Press, 1975. 5. Haeckel E. Anthropogenie oder Entwicklungsgeschichte des Menschen. Verlag Wilhelm Engelmann, 1874.
Epilogue
169
6. Schneider G, Wrede P. Development of artificial neural filters for pattern recognition in protein sequences. J Mol Evol 1993; 36:586-595. 7. Schneider G, Wrede P. The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution: de novo design of an idealized leader peptidase cleavage site. Biophys J 1994; 66:335-344. 8. Wrede P, Schneider G, eds. Concepts in Protein Engineering and Design. Berlin: Walter de Gruyter, 1994. 9. Wrede P, Landt O, Klages S et al. Peptide design aided by neural networks: Biological activity of artificial signal peptidase cleavage sites. Biochemistry 1998; 37:3588-3593. 10. Frank SA. The design of natural and artificial adaptive systems. In: Rose MR, Lauder GV, eds. Adaptation. San Diego: Academic Press, 1996:451-505. 11. Gray M. Evolution of organellar genomes. Curr Opin Genet Dev 1999; 9:678-687. 12. Bauer MF, Hofmann S, Neupert W et al. Protein translocation into mitochondria: The role of TIM complexes. Trends Cell Biol 2000; 10:25-31. 13. Sitte P. Intertaxonic combination: introducing and defining a new term in symbiogenesis. In: Sato S, Ishida M, Ishikawa H, eds. Endocytobiology V. Tübingen: Tübingen University Press, 1992. 14. Jain R, Rivery MC, Lake JA. Horizontal gene transfer among genomes: The complexity hypothesis. Proc Natl Acad Sci USA 1999; 96:3801-3806. 15. Halder G, Callerts P, Gehring WJ. The induction of ectopic eyes by targeted expression of the eyeless gene in Drosophila. Science 1995; 267:1788-1792. 16. Gordon R. Evolution escapes rugged fitness landscapes by gene or genome doubling: The blessing of higher dimensionality. Comput Chem 1994; 18:325-31. 17. Kauffman SA. The Origins of Order. Oxford, New York: Oxford University Press, 1993. 18. Rechenberg I. Evolutionsstrategie—Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Stuttgart: Fromann-Holzboog, 1973. 19. Rose MR, Lauder GV. Adaptation. San Diego: Academic Press, 1996. 20. Schneider G, Schrödl W, Wallukat G et al. Peptide design by artificial neural networks and computer based evolutionary search. Proc Natl Acad Sci USA 1998; 95:12179-12184. 21. Dobson CM, Karplus M. The fundamentals of protein folding: Bringing together theory and experiment. Curr Opin Struct Biol 1999; 9:92-101. 22. Böhm HJ, Stahl M. Structure-based library design: Molecular modelling merges with combinatorial chemistry. Curr Opin Chem Biol 2000; 4:283-286. 23. Fisher RA. The Genetical Theory of Natural Selection. Oxford: Clarendon Press, 1930. 24. Bourbaki N. The architecture of mathematics. Amer Mathemat Monthly 1950; 57:221-232. 25. Schneider G, Wrede P. Artificial neural networks for computer-based molecular design. Prog Biophys Mol Biol 1998; 70:175-222. 26. Merzenich MM, Sameshima K. Cortical plasticity and memory. Curr Opin Neurol 1993; 3:187-196. 27. Merzenich MM, Jenkins WM, Johnston P et al. Temporal processing deficits of language-learning impaired children ameliorated by training. Science 1996; 271:77-81. 28. Eddershaw PJ, Beresford AP, Bayliss MK. ADME/PK as part of a rational approach to drug discovery. Drug Discovery Today 2000; 5:409-414. 29. Marchese A, George SR, Kolakowski LF et al. Novel GPCR and their endogenous ligands: Expanding the boundaries of physiology and pharmacology. Trends Pharmacol Sci 1999; 20:370-375.
Index
171
Index A Absorption 3, 69, 105, 106, 109, 110, 117, 118, 133, 167 ADMET 40, 167, 168 Antibody 147, 150
B Bioavailability 82, 105, 117, 118, 131 Blood-brain barrier (BBB) 109, 114, 120-124
C Chance correlation 70, 72, 87, 88, 116, 122 Chemical space 9, 10, 15, 32, 37, 41, 42, 58-60, 130, 144, 148, 152, 153, 164, 167 Classification task 19, 21 Coding scheme 16 Combinatorial chemistry 9, 13, 16, 32, 39, 67, 95, 131, 142 Combinatorial design 7, 61 Compound library 37, 39, 108, 147, 156 Correlation coefficient 58, 59, 60, 74, 75, 108, 115
D Database 37, 39-41, 43, 45, 46, 57, 59, 61, 87, 106, 114, 123, 125, 127, 138, 152 De novo design 7, 47, 66, 80, 86, 94, 136-141, 148, 152, 153, 155, 156 Distance matrix 59, 73, 146 Distribution 3, 15, 21, 48, 52, 56, 58-61, 63, 65, 66, 75, 77, 105, 106, 110, 114, 117, 142-144, 146, 147, 152, 155, 167 Diversity 4, 6, 9, 15, 16, 21, 37-40, 67, 69, 89, 107, 122, 131, 143, 144, 146, 147, 152, 153, 155, 166 Docking 39, 136, 139-141, 153
Drug discovery 3, 4, 6, 7, 32, 37, 39, 40, 53, 57, 67, 96, 120, 130-132, 155, 163, 167, 168 Drug-likeness 105-110, 123, 124
E Encoder network 48, 52 Enrichment factor 41 Entropy 137, 143, 144, 146, 147 Evolution strategy (ES) 10, 11, 15, 16, 18, 19, 35, 52, 135, 142, 143, 150, 152 Excretion 105, 106, 167 Expert system 125, 131
F Feature extraction 5, 7, 19, 20, 22, 24, 32, 33, 37, 42, 46, 47, 52, 53, 57, 60, 62, 70, 72, 76, 148 Feed-forward network 24, 25, 51 Functional dependence 79, 89, 92, 93
G G-Protein Coupled Receptor (GPCR) 37, 39, 138, 152, 168 Genetic algorithm (GA) 10, 11, 16, 17, 22, 35, 71, 77-79, 82, 83, 85-88, 94, 95, 97, 111, 118, 138, 150
H High-throughput screening (HTS) 3, 4, 6, 7, 32, 38, 39, 40, 57, 67, 96, 105, 132 Hit rate 4, 39
I Inductive logic programming (ILP) 35 Inverse QSAR 92
172
L Library design 37, 61, 106, 107, 144 Logical inference 33
M Metabolism 40, 105, 106, 109, 117-119, 124, 167 Molecular descriptor 24, 42, 48, 53, 56, 60, 89 Molecular design 39, 47, 52, 67, 68, 71, 82, 91, 94, 95, 105, 136, 138, 155, 156, 165-167 Molecular property 32, 45 Molecular similarity 8, 43, 69, 82, 83 Multidimensional lead optimization 131, 133
N Neural network (ANN) 8, 19-26, 30, 52, 53, 71, 73-77, 79, 80-83, 85-89, 91-96, 98, 106-108, 111, 112, 114-123, 126, 127, 130, 131, 141, 148, 151, 168
O Optimization 3, 4, 6-8, 10-20, 22, 25-27, 35, 39, 40, 45, 46, 50, 51, 53, 54, 66, 67, 70, 71, 77, 85-88, 94, 95, 105, 116, 118, 131, 133, 136, 138, 141-143, 149, 153-155, 163, 164, 166-168 Overfitting 70, 72, 73, 78, 81-83, 87, 88, 115, 122, 126 Overtraining 87, 88, 112
P Parameter estimation 71-73, 78, 82 Partial least squares (PLS) 48, 72, 73, 78, 79, 83, 98, 118, 119, 121, 122, 134 Partition coefficient (logP) 13, 100, 106, 107, 109, 114-117, 119, 121, 122, 127, 130, 134 Pattern recognition 19-22, 32, 33, 70, 85, 111, 118 Peptide design 138, 142, 144, 165
Adaptive Systems in Drug Design
Pfizer rule / Rule of five 106, 110, 117, 118, 131 Pharmacodynamic (PD) 3, 6, 105, 108, 109,168 Pharmacokinetic (PK) 3, 6, 40, 61, 82, 87, 102, 118-120, 168, 169 Pharmacophore 7, 41, 44, 46-48, 58-61, 68, 136, 137, 141, 152, 154-156 Polar surface area (PSA) 13, 69, 121, 122 Principal Component Analysis (PCA) 23, 47, 48, 52, 53, 111, 130
Q Quantitative structure-activity relationship (QSAR) 7, 21, 23, 52, 67-83, 85-90, 92-96, 105, 111, 114, 116-118, 120, 121, 125-128, 132, 141 Quantitative structure-property relationship (QSPR) 85, 89, 111, 116, 117
R Rational drug design 4, 8, 142 Rule of five / Pfizer rule 106, 110, 117, 118, 131
S Sammon map 48, 51, 53 Scoring function 86, 108, 109, 136, 137, 141 Self-organizing map (SOM) 22, 47, 48, 52-62, 73 Sequence space 142-144, 160 Similarity searching 40-47, 57, 63, 141, 146, 153, 156 Solubility 39, 62, 109-112, 114, 130, 131, 140, 141 Statistical validation 74 Stochastic search 10, 11, 15, 138, 152, 155 Structure generator 7, 140, 150 Structure-based design 136 Supervised training 24, 52
Index
T Tanimoto index 152, 157 Thrombin inhibitor 47, 152, 157 Topological distance 54 Toxicity 39, 40, 82, 87, 105, 109, 124-128, 130-133, 167
V Variable selection 71, 72, 76, 78, 79, 87 Virtual screening 4, 6, 7, 11, 15, 32-34, 37, 39, 42, 44, 58, 59, 106, 108, 109, 119, 156, 166, 167
173