Mathematical Chemistry and Chemoinformatics: Structure Generation, Elucidation and Quantitative Structure-Property Relationships 9783110254075, 9783110300079

More than 20 years of experience in molecular structure generation, from conceptualization through to applications Innov

356 94 4MB

English Pages 520 [521] Year 2013

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
List of figures
List of tables
List of symbols
Introduction and outline
1 Basics of graphs and molecular graphs
1.1 Graphs
1.1.1 Labeled graphs
1.1.2 Unlabeled graphs
1.2 Molecular graphs, constitutional isomers
1.2.1 Atom states in organic chemistry
1.2.2 Constitutional isomers
1.2.3 The existence of molecular graphs
1.3 Group actions on molecular graphs
1.3.1 Counting unlabeled structures
1.3.2 Counting by weight
1.3.3 Constructive methods
1.3.4 Generating samples
2 Advanced properties of molecular graphs
2.1 Substructures
2.1.1 Graph-theoretical elements
2.1.2 Subgraphs and their embeddings
2.2 Molecular substructures
2.2.1 Ambiguous molecular graphs
2.2.2 Substructure restrictions
2.3 Chemical reactions
2.4 Mesomerism
2.5 Existing chemical compounds
2.6 Molecular descriptors
2.6.1 Arithmetical descriptors
2.6.2 Topological descriptors
2.6.3 Geometrical descriptors
3 Chirality
3.1 Orientation and chirality
3.1.1 Barycentric placement of molecules in space
3.1.2 Symmetry operations, the point group
3.1.3 Chirality and handedness
3.2 Permutational isomers
3.2.1 Counting permutational isomers
3.2.2 Permutational isomers by content
3.2.3 Enumeration by symmetry
3.2.4 Constructive aspects
4 Stereoisomers
4.1 Basic stereochemistry
4.1.1 Symmetry, the orientational automorphism group
4.1.2 Partial orientation functions (POFs)
4.1.3 Generation of abstract POFs
4.1.4 Tests for chemical realizability
4.2 Radon partitions
4.3 Binary Grassmann–Plücker relations
4.4 Chemical conformation and cyclohexane
4.5 Perspectives
5 Molecular structure generation
5.1 Formula-based structure generation
5.1.1 Orderly generation of simple graphs
5.1.2 Introducing constraints
5.1.3 Variations and refinements
5.1.4 From simple graphs to multigraphs
5.1.5 Applying the Homomorphism Principle
5.1.6 Orderly generation
5.1.7 Beyond orderly generation
5.2 Constrained generation and fuzzy formulas
5.2.1 Restrictions for a molecular formula
5.2.2 Structural restrictions
5.3 Reaction-based structure generation
5.3.1 Libraries of permutational isomers
5.3.2 Attaching substituents to a central molecule
5.3.3 Generation using the network principle
5.3.4 Generation of MS fragments
5.3.5 Construction using the network principle
5.3.6 Combinatorial libraries
5.3.7 Ugi's seven component reaction
5.4 Generic structural formulas
5.4.1 A simple generic structural formula
5.4.2 Patents in chemistry
5.5 Canonizing molecular graphs
5.5.1 Initial classification
5.5.2 Iterative refinement
5.5.3 Labeling by backtracking
5.5.4 Pruning the backtrack tree
5.5.5 Profiting from symmetry
5.6 Data structures for molecular graphs
6 Supervised statistical learning
6.1 Variables and predicting functions
6.1.1 Regression and classification
6.1.2 Validation of the predicting function
6.1.3 Preprocessing of data
6.1.4 Selection of variables
6.2 Models for predicting functions
6.2.1 Linear models
6.2.2 Neural networks
6.2.3 Support vector machines
6.2.4 Decision trees
6.2.5 Nearest neighbors
7 Quantitative structure–property relationships
7.1 Optimization of experiments in combinatorial chemistry
7.2 The use of molecular descriptors
7.2.1 Arithmetical, topological, and geometrical descriptors
7.2.2 Substructure counts
7.3 Mathematical composition of QSPRs
7.4 Case studies of QSPRs obtained by linear modeling
7.4.1 Linear modeling using topological indices
7.4.2 Linear modeling using substructure counts
7.4.3 Linear modeling using TI and SC
7.4.4 Further descriptors and regression methods
7.4.5 Prediction
7.5 Case studies with separate learning and test sets
7.5.1 Preprocessing of structures
7.5.2 Choice of descriptors
7.5.3 Linear modeling by best subset selection
7.5.4 Linear modeling by stepwise subset selection
7.5.5 Linear modeling using principal component regression
7.6 A case study of QSARs with discrete values
7.6.1 Choice and redundancy of descriptors
7.6.2 Regression
7.6.3 Multi-classification
7.6.4 Binary classification
7.6.5 Prediction
7.7 Outlook: Unsupervised learning and diversity considerations
8 Molecular structure elucidation
8.1 Spectroscopic methods
8.2 Automated molecular structure elucidation
8.3 Basics of mass spectrometry
8.3.1 Mode of operation of an EI mass spectrometer
8.3.2 Problems in EI mass spectrometry
8.3.3 Mass spectra and isotope distributions
8.3.4 Database of elucidated mass spectra
8.4 Ranking functions for mass spectra
8.4.1 Ranking of molecular formulas
8.4.2 Ranking of structural formulas
8.5 Classification of mass spectra
8.5.1 MS descriptors
8.5.2 MS classifiers
8.5.3 Search for substructures amenable to MS classification
8.6 Automated structure elucidation via MS
8.6.1 Example methyl n-pentanoate
8.6.2 Example ethyl 3-hydroxyphenylacetate
8.7 High resolution MS
8.7.1 Exact isotope masses
8.7.2 Molecular formulas of identical exact mass
8.7.3 Mass differences between molecular formulas
8.7.4 Molecular formulas from exact molecular masses
8.8 High resolution MS/MS
8.8.1 Generating molecular formulas
8.8.2 Calculating MS match values
8.8.3 Calculating MS/MS match values
8.8.4 Verifying MS/MS match values experimentally
8.8.5 Scope, limitations and outlook for HR–MS
9 Case studies of CASE
9.1 CASE with MOLGEN–MS
9.1.1 Example for a single spectrum
9.1.2 Multiple spectra
9.2 Calculated properties to improve CASE
9.2.1 Mass spectrum prediction
9.2.2 Retention properties
9.2.3 Partitioning properties
9.2.4 Steric energy
9.2.5 Filtering candidates by calculated properties
9.2.6 Consensus scoring
9.3 Examples of CASE at work
9.3.1 Blue rayon unknown 1
9.3.2 Blue rayon unknown 2
9.3.3 Diclofenac transformation product
9.4 CASE conclusions and outlook
9.4.1 GC–EI–MS
9.4.2 CASE with high accuracy data
A Lists of molecular descriptors
A.1 Arithmetical descriptors
A.2 Topological descriptors
A.3 Geometrical descriptors
B Substructures for MS classifiers
B.1 Alkyls
B.2 Aromatics
B.3 Bonds
B.4 Elements
B.5 Functional groups
B.6 Rings
C Molecular formulas by mass and ion type
D Isomers by mass and molecular formula
Bibliography
List of abbreviations
Index
Recommend Papers

Mathematical Chemistry and Chemoinformatics: Structure Generation, Elucidation and Quantitative Structure-Property Relationships
 9783110254075, 9783110300079

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Adalbert Kerber, Reinhard Laue, Markus Meringer, Christoph Rücker, Emma Schymanski Mathematical Chemistry and Chemoinformatics

Adalbert Kerber, Reinhard Laue, Markus Meringer, Christoph Rücker, Emma Schymanski

Mathematical Chemistry and Chemoinformatics | Structure Generation, Elucidation and Quantitative Structure – Property Relationships

Authors Prof. Dr. Adalbert Kerber Schloßhof Birken 21 95447 Bayreuth Germany [email protected] Prof. Dr. Reinhard Laue University of Bayreuth Institute for Computer Sciences Universitätsstraße 30 95447 Bayreuth Germany [email protected] Dr. Markus Meringer German Aerospace Center (DLR) Earth Observation Center (EOC) Münchner Str. 20 82234 Weßling Germany [email protected]

PD Dr. Christoph Rücker Leuphana University Lüneburg Institute of Sustainable and Environmental Chemistry Scharnhorststr. 1 21335 Lüneburg Germany [email protected] Dr. Emma Schymanski Eawag – Swiss Federal Institute of Aquatic Science and Technology Überlandstraße 133 8600 Dübendorf Switzerland [email protected]

ISBN 978-3-11-030007-9 e-ISBN 978-3-11-025407-5 Library of Congress Cataloging-in-Publication Data A CIP catalog record for this book has been applied for at the Library of Congress. Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available in the Internet at http://dnb.dnb.de. © 2014 Walter de Gruyter GmbH, Berlin/Boston Typesetting: le-tex publishing services GmbH, Leipzig Printing and binding: Hubert & Co. GmbH & Co. KG, Göttingen ♾Printed on acid-free paper Printed in Germany www.degruyter.com

Preface In this book, we describe, extend and apply methods of computer chemistry and chemoinformatics, suitable for molecular structure generation, structure elucidation, combinatorial chemistry, QSPRs, the generation of chemical patent libraries and so on. The tools come from discrete mathematics (graph theory, constructive combinato­ rics), stochastics (explorative data analysis, supervised and unsupervised learning), computer science (data structures, algorithms) and chemistry (combinatorial chem­ istry, molecular structure elucidation).

M Di at sc he re m te at ic s

r te pu ces om n C c ie S m is t

s tic tis

C he

a St

ry

Applications

The book evolved from research on constructive combinatorics at the University of Bayreuth, guided by A. Kerber and R. Laue, and based on the use of finite group actions. Combinatorial structures in the focus of this research are in particular codes, designs, groups and graphs. In the present case the emphasis is, of course, on mole­ cular graphs, i.e. multigraphs where the nodes are colored by atom symbols and atom states. They form the model of molecules used in the generator MOLGEN. For this pur­ pose, new methods had to be developed, see e.g. [146, 147, 174] and the present bib­ liography. The book is in a sense a summary of research projects (DFG Ke201/16–1, Ke201/19–1, BMBF 03KE7BA1–4, 03CO318C) which led to implementations of the soft­ ware packages MOLGEN (several versions) for molecular structure generation, MOL­ GEN–MS and MOLGEN–MS/MS for molecular structure elucidation using mass spec­ troscopy, MOLGEN–COMB for combinatorial chemistry and MOLGEN–QSPR to support

vi | Preface the search for quantitative structure–property relationships. We are indebted to the Deutsche Forschungsgemeinschaft DFG and the Bundesministerium für Bildung und Forschung BMBF for this long ranging support which not only made the development of MOLGEN possible but also had an impact on the general theoretical research. Sev­ eral theses originated directly from these projects, e.g. [17, 23, 24, 32, 75, 76, 94, 95, 96, 97, 102, 202]. The aim of this research was to complement the well-known power­ ful counting methods with constructions. While counting gives the number of objects without listing any of them, the structures themselves are essential in chemistry. Historically, the widely known project DENDRAL started in the early 1970s in the US. It can be considered as a precursor of the MOLGEN project in Bayreuth that re­ fined the approach and added theoretical as well as practical material. An early ver­ sion of the MOLGEN structure generator was awarded the German–Austrian University Software Prize for Chemistry (Deutsch-Österreichischer Hochschul-Software-Preis für Chemie) in 1993. The book is based on the dissertation of M. Meringer [202] and also contains the main results of the dissertation of R. Gugisch [102]. C. Rücker used the mathematical tools detailed herein to develop software to find quantitative structure–property relationships (MOLGEN–QSPR) and software for teaching and studying a few facets of organic chemistry, isomerism and in particular stereoisomerism (UNIMOLIS). Finally, E. Schymanski used several MOLGEN products during her disserta­ tion [283] at the Helmholtz Centre for Environmental Research (UFZ, Leipzig, Ger­ many) to integrate analytical and computational methods to identify unknown toxi­ cants isolated during effect-directed analysis. As usual in Mathematics, the order of the book’s authors is alphabetical and as such does not reflect the merits of individual authors. The book’s pdf version can be used as an interactive E-book, which can be ac­ cessed via http://www.degruyter.com/view/product/185915?format=G. The exercises contain illustrative examples that can be evaluated using software pack­ ages such as MOLGEN–ONLINE, SYMMETRICA, MAGMA and others, directly via the respective homepages. The book is written for the users of such software, as they need to know what is really meant when we speak of the generation of molecular graphs, of substructures, of a goodlist of prescribed substructures, of overlapping substructures, of non-over­ lapping substructures, of closed substructures, of substructure counts, of molecular descriptors, and so on. Otherwise, users will not be able to achieve the full potential of the software. It is also meant to provide documentation of the mathematical basics re­ quired for the designers of software for computational chemistry or chemoinformatics.

Preface |

vii

We emphasize in particular the following aspects: – The basic mathematical concepts for representation and evaluation of molecular structures: Molecular graphs, substructures, restrictions, reactions, structure ge­ neration, molecular descriptors and the statistical learning methods that play a central role in the applications. – The most important results and facts, in particular the extensions of the MOLGEN class library: Reaction-based generation of structures, QSPR studies using diffe­ rent kinds of molecular descriptors, various methods for prediction, ranking and classification of mass spectra, relations between spectra and properties and CASE using electron impact (EI) mass spectrometry. – Perspectives and suggestions for further research: New approaches to the interpre­ tation and verification of mass spectra, to stereoisomer and conformer generation, normal forms for patents in chemistry and CASE using high resolution mass spec­ trometry. We would like to thank D. Moser, whose diploma thesis [214] was the starting point for our research, a quarter of a century ago. It contained the first version of the generator MOLGEN, showing that it is possible to make an efficient molecular generator avail­ able to the scientific community. The contributions of R. Grund to orderly generation of molecular graphs, and those of R. Hohberger (evaluation of 2D and 3D placements of molecular graphs) are very useful. Together with T. Wieland and C. Benecke they were responsible for the implementation of MOLGEN up to version 3.5. Thanks are also due to T. Grüner for the development of constrained construction strategies for molecu­ lar graphs which are used in MOLGEN 4 and MOLGEN–MS, and his orderly generation of double cosets for the generation of combinatorial libraries for MOLGEN–COMB. We gratefully acknowledge J. Braun’s implementation of molecular descriptors and aro­ maticity detection, which are important modules of MOLGEN–QSPR, as well as the work of R. Gugisch in developing MOLGEN 5.0. Moreover, we should like to emphasize the enormous transfer of know-how in chemoinformatics we received from many people, in particular from K. Varmuza (Vi­ enna University of Technology) and W. Werther (University of Vienna). Thanks are also due to R. Neudert (Chemical Concepts) and S. Stein (NIST) for the MS databases, W. Brack (UFZ Leipzig) for the excellent supervision and support during E. Schyman­ ski’s thesis, as well as many other colleagues from UFZ Leipzig, finally to S. Reinker (NIBR Basel), who provided first measurements from newest generation tandem mass spectrometers with ultrahigh mass resolution for the evaluation of MOLGEN–MS/MS. Our interest in mathematical chemistry was stimulated in particular by A. Dreiding, A. Dress, M. E. Elyashberg, H. Gerlach, I. Gutman, W. Hässelbarth, O. E. Polansky, E. Ruch, S. Tratch, I. Ugi, N. Zefirov and by the foundation of MATCH in 1975. Bayreuth, München, Lüneburg, Zürich, June 4, 2013 A. Kerber, R. Laue, M. Meringer, C. Rücker, E. Schymanski

Contents Preface | v List of figures | xiv List of tables | xviii List of symbols | xxi Introduction and outline | 1 1 1.1 1.1.1 1.1.2 1.2 1.2.1 1.2.2 1.2.3 1.3 1.3.1 1.3.2 1.3.3 1.3.4

Basics of graphs and molecular graphs | 13 Graphs | 13 Labeled graphs | 14 Unlabeled graphs | 17 Molecular graphs, constitutional isomers | 25 Atom states in organic chemistry | 26 Constitutional isomers | 29 The existence of molecular graphs | 34 Group actions on molecular graphs | 35 Counting unlabeled structures | 36 Counting by weight | 44 Constructive methods | 47 Generating samples | 52

2 2.1 2.1.1 2.1.2 2.2 2.2.1 2.2.2 2.3 2.4 2.5 2.6 2.6.1 2.6.2 2.6.3

Advanced properties of molecular graphs | 56 Substructures | 56 Graph-theoretical elements | 56 Subgraphs and their embeddings | 59 Molecular substructures | 62 Ambiguous molecular graphs | 62 Substructure restrictions | 64 Chemical reactions | 66 Mesomerism | 69 Existing chemical compounds | 72 Molecular descriptors | 76 Arithmetical descriptors | 78 Topological descriptors | 79 Geometrical descriptors | 87

x | Contents 3 3.1 3.1.1 3.1.2 3.1.3 3.2 3.2.1 3.2.2 3.2.3 3.2.4

Chirality | 91 Orientation and chirality | 92 Barycentric placement of molecules in space | 93 Symmetry operations, the point group | 98 Chirality and handedness | 102 Permutational isomers | 106 Counting permutational isomers | 109 Permutational isomers by content | 113 Enumeration by symmetry | 118 Constructive aspects | 124

4 4.1 4.1.1 4.1.2 4.1.3 4.1.4 4.2 4.3 4.4 4.5

Stereoisomers | 132 Basic stereochemistry | 132 Symmetry, the orientational automorphism group | 140 Partial orientation functions (POFs) | 141 Generation of abstract POFs | 143 Tests for chemical realizability | 146 Radon partitions | 150 Binary Grassmann–Plücker relations | 154 Chemical conformation and cyclohexane | 158 Perspectives | 162

5 5.1 5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.1.6 5.1.7 5.2 5.2.1 5.2.2 5.3 5.3.1 5.3.2 5.3.3 5.3.4 5.3.5 5.3.6 5.3.7

Molecular structure generation | 164 Formula-based structure generation | 165 Orderly generation of simple graphs | 165 Introducing constraints | 171 Variations and refinements | 172 From simple graphs to multigraphs | 173 Applying the Homomorphism Principle | 174 Orderly generation | 176 Beyond orderly generation | 179 Constrained generation and fuzzy formulas | 180 Restrictions for a molecular formula | 181 Structural restrictions | 182 Reaction-based structure generation | 183 Libraries of permutational isomers | 183 Attaching substituents to a central molecule | 190 Generation using the network principle | 191 Generation of MS fragments | 193 Construction using the network principle | 194 Combinatorial libraries | 195 Ugi’s seven component reaction | 196

Contents | xi

5.4 5.4.1 5.4.2 5.5 5.5.1 5.5.2 5.5.3 5.5.4 5.5.5 5.6

Generic structural formulas | 199 A simple generic structural formula | 199 Patents in chemistry | 202 Canonizing molecular graphs | 204 Initial classification | 206 Iterative refinement | 207 Labeling by backtracking | 209 Pruning the backtrack tree | 210 Profiting from symmetry | 214 Data structures for molecular graphs | 219

6 6.1 6.1.1 6.1.2 6.1.3 6.1.4 6.2 6.2.1 6.2.2 6.2.3 6.2.4 6.2.5

Supervised statistical learning | 221 Variables and predicting functions | 221 Regression and classification | 222 Validation of the predicting function | 224 Preprocessing of data | 227 Selection of variables | 228 Models for predicting functions | 231 Linear models | 231 Neural networks | 233 Support vector machines | 234 Decision trees | 236 Nearest neighbors | 238

7 7.1 7.2 7.2.1 7.2.2 7.3 7.4 7.4.1 7.4.2 7.4.3 7.4.4 7.4.5 7.5 7.5.1 7.5.2 7.5.3 7.5.4 7.5.5

Quantitative structure–property relationships | 240 Optimization of experiments in combinatorial chemistry | 240 The use of molecular descriptors | 242 Arithmetical, topological, and geometrical descriptors | 243 Substructure counts | 250 Mathematical composition of QSPRs | 251 Case studies of QSPRs obtained by linear modeling | 254 Linear modeling using topological indices | 255 Linear modeling using substructure counts | 261 Linear modeling using TI and SC | 265 Further descriptors and regression methods | 268 Prediction | 270 Case studies with separate learning and test sets | 270 Preprocessing of structures | 271 Choice of descriptors | 273 Linear modeling by best subset selection | 275 Linear modeling by stepwise subset selection | 277 Linear modeling using principal component regression | 282

xii | Contents 7.6 7.6.1 7.6.2 7.6.3 7.6.4 7.6.5 7.7

A case study of QSARs with discrete values | 284 Choice and redundancy of descriptors | 284 Regression | 286 Multi-classification | 288 Binary classification | 290 Prediction | 294 Outlook: Unsupervised learning and diversity considerations | 295

8 8.1 8.2 8.3 8.3.1 8.3.2 8.3.3 8.3.4 8.4 8.4.1 8.4.2 8.5 8.5.1 8.5.2 8.5.3 8.6 8.6.1 8.6.2 8.7 8.7.1 8.7.2 8.7.3 8.7.4 8.8 8.8.1 8.8.2 8.8.3 8.8.4 8.8.5

Molecular structure elucidation | 297 Spectroscopic methods | 297 Automated molecular structure elucidation | 298 Basics of mass spectrometry | 301 Mode of operation of an EI mass spectrometer | 302 Problems in EI mass spectrometry | 303 Mass spectra and isotope distributions | 306 Database of elucidated mass spectra | 311 Ranking functions for mass spectra | 314 Ranking of molecular formulas | 319 Ranking of structural formulas | 327 Classification of mass spectra | 338 MS descriptors | 340 MS classifiers | 341 Search for substructures amenable to MS classification | 355 Automated structure elucidation via MS | 356 Example methyl n-pentanoate | 357 Example ethyl 3-hydroxyphenylacetate | 361 High resolution MS | 363 Exact isotope masses | 363 Molecular formulas of identical exact mass | 364 Mass differences between molecular formulas | 365 Molecular formulas from exact molecular masses | 368 High resolution MS/MS | 372 Generating molecular formulas | 373 Calculating MS match values | 374 Calculating MS/MS match values | 376 Verifying MS/MS match values experimentally | 379 Scope, limitations and outlook for HR–MS | 390

9 9.1 9.1.1 9.1.2

Case studies of CASE | 393 CASE with MOLGEN–MS | 393 Example for a single spectrum | 393 Multiple spectra | 395

Contents | xiii

9.2 9.2.1 9.2.2 9.2.3 9.2.4 9.2.5 9.2.6 9.3 9.3.1 9.3.2 9.3.3 9.4 9.4.1 9.4.2

Calculated properties to improve CASE | 396 Mass spectrum prediction | 397 Retention properties | 398 Partitioning properties | 399 Steric energy | 400 Filtering candidates by calculated properties | 401 Consensus scoring | 406 Examples of CASE at work | 407 Blue rayon unknown 1 | 408 Blue rayon unknown 2 | 410 Diclofenac transformation product | 412 CASE conclusions and outlook | 415 GC–EI–MS | 415 CASE with high accuracy data | 417

A A.1 A.2 A.3

Lists of molecular descriptors | 418 Arithmetical descriptors | 418 Topological descriptors | 418 Geometrical descriptors | 421

B B.1 B.2 B.3 B.4 B.5 B.6

Substructures for MS classifiers | 422 Alkyls | 423 Aromatics | 425 Bonds | 436 Elements | 437 Functional groups | 438 Rings | 442

C

Molecular formulas by mass and ion type | 443

D

Isomers by mass and molecular formula | 447

Bibliography | 459 List of abbreviations | 475 Index | 477

List of figures Fig. 1 Fig. 2

Seven pairs of C6 H6 stereoisomers. | 4 Two pairs of C6 H6 conformers. | 5

Fig. 2.1 Fig. 2.2 Fig. 2.3

Constitutional isomers of C6 H6 in Beilstein. | 73 Steric energy of the constitutional isomers C6 H6 . | 74 Van der Waals volumes of the constitutional isomers C6 H6 . | 75

Fig. 3.1

The 22 permutational isomers of Seveso dioxin. | 130

Fig. 4.1

Three chiral compounds whose stereoisomerism cannot be described in terms of stereocenters, stereogenic double bonds, or rotatable single bonds. | 133 Example conformations. | 139 Stereoisomers generated for structure 4.2a. | 145 Conformers generated for structure 4.2b. | 145 Kinds of quadruples considered in tests. | 146 Stereoisomers and conformers generated for structure 4.2c. | 148 Radon partitions. | 151 Two chemically forbidden minimal Radon partitions. | 157 A non-chemical arrangement of atoms for structure b, Figure 4.2. | 158 The chair form of cyclohexane with numbered atoms. | 158 Two forbidden minimal Radon partitions. | 160 Four conformations found by constrained optimization. | 160

Fig. 4.2 Fig. 4.3 Fig. 4.4 Fig. 4.5 Fig. 4.6 Fig. 4.7 Fig. 4.8 Fig. 4.9 Fig. 4.10 Fig. 4.11 Fig. 4.12 Fig. 5.1 Fig. 5.2 Fig. 5.3 Fig. 5.4 Fig. 5.5 Fig. 5.6 Fig. 5.7 Fig. 5.8 Fig. 5.9 Fig. 5.10

Backtrack tree for labeled generation of simple graphs on three nodes. | 169 Backtrack tree for unlabeled generation of simple graphs on three nodes. | 170 Backtrack tree for orderly generation of simple graphs on three nodes. | 171 Adjacency matrix with block structure. | 177 Structures of the 20 proteinogenic amino acids. | 186 Scheme of the seven component reaction. | 198 Alkyl groups with 1–6 C atoms. | 202 Compounds 1–4 with arbitrary initial vertex numbering. | 207 The backtrack tree for structure 3, Figure 5.8. | 214 The backtrack tree for cubane. | 216

List of figures

Fig. 6.1 Fig. 6.2 Fig. 6.3 Fig. 6.4 Fig. 7.1 Fig. 7.2 Fig. 7.3 Fig. 7.4 Fig. 7.5 Fig. 7.6 Fig. 7.7 Fig. 7.8 Fig. 7.9 Fig. 7.10 Fig. 7.11 Fig. 7.12 Fig. 7.13 Fig. 7.14 Fig. 7.15 Fig. 7.16 Fig. 7.17 Fig. 7.18 Fig. 7.19 Fig. 7.20 Fig. 7.21 Fig. 7.22 Fig. 7.23 Fig. 7.24 Fig. 7.25

| xv

Examples of strong and weak correlations. | 229 Scheme of a neural network, one hidden layer and bias neurons. | 234 Support vector classification, where the classes can be separated. | 235 Scheme of a decision tree. | 237 Schematic workflow for the prediction of property values for a virtual combinatorial library. | 241 Atoms from E11 represented as spheres of van der Waals radii. | 248 3D arrangement of the amino acid methionine as vdW spheres. | 248 Real library of decanes and their boiling points | 252 Substructures containing 2–6 edges. | 254 Statistics for the best LMs for BPs of decanes, containing 1–20 descriptors. | 260 Standard errors for the best BP models containing 1–20 descriptors. | 262 F values for the best BP models containing 1–20 descriptors. | 262 Scatterplot of calculated vs. experimental BPs of decanes for the best model containing 3 TIs. | 263 Scatterplot of calculated vs. experimental BPs of decanes for the best model containing 4 SCs. | 265 Scatterplot of calculated vs. experimental BPs of decanes for the best model containing 3 descriptors (TI and SC). | 267 Scatterplot of calculated vs. experimental BPs of decanes for the best LM model, using 7 descriptors (TI and SC). | 267 Purely virtual library of 25 decanes with predicted BPs. | 271 Scatterplot of 𝑅2𝑇𝑆 vs. 𝑅2𝐿𝑆 for the best models (with respect to LS) of 𝑛 = 1, . . . , 5 descriptors. | 276 Scatterplot of calculated PD vs. experimental PD for the model marked by an arrow in Figure 7.14. | 277 Scatterplot of 𝑅2𝑇𝑆 vs. 𝑅2𝐿𝑆 for the best linear models (with respect to 𝑅2𝑇𝑆 ) after 50-fold stepwise subset selection. | 280 Scatterplot of calculated PD vs. experimental PD for the model marked by an arrow in Figure 7.16. | 280 𝑅2𝐿𝑆 , 𝑅2𝑇𝑆 , 𝑅2𝐶𝑉 on learning set, by number of descriptors. | 282 𝑅2𝐶𝑉 vs. 𝑅2𝐿𝑆 for best linear models with respect to 𝑅2𝐿𝑆 . | 283 𝑅2𝐿𝑆 and 𝑅2𝑇𝑆 for LM, determined by PCR. | 283 Substituents R1 and R2 in the real library of quinolones. | 285 Regression tree for MIC using five descriptors. | 287 Multiclassification tree for MIC using seven descriptors. | 289 Binary classification tree for MIC using three descriptors. | 290 Dendrogram for clustering the virtual library of decanes. | 296

xvi | List of figures Fig. 8.1 Fig. 8.2 Fig. 8.3 Fig. 8.4 Fig. 8.5 Fig. 8.6 Fig. 8.7 Fig. 8.8 Fig. 8.9 Fig. 8.10 Fig. 8.11 Fig. 8.12 Fig. 8.13 Fig. 8.14 Fig. 8.15 Fig. 8.16 Fig. 8.17 Fig. 8.18 Fig. 8.19 Fig. 8.20 Fig. 8.21 Fig. 8.22 Fig. 8.23 Fig. 8.24 Fig. 8.25 Fig. 8.26 Fig. 8.27 Fig. 8.28 Fig. 8.29 Fig. 8.30 Fig. 8.31 Fig. 8.32 Fig. 8.33 Fig. 8.34 Fig. 8.35 Fig. 8.36 Fig. 8.37 Fig. 8.38 Fig. 8.39 Fig. 8.40

Automatic structure elucidation workflow. | 300 Example EI mass spectrum of methyl n-pentanoate. | 302 Mode of operation of an EI mass spectrometer. | 303 Scheme of structure elucidation via MS. | 305 Natural isotope distributions of the elements in E11 . | 309 Molecule mass distribution in the MS–structure data set, E11 . | 313 Molecule mass distribution in the MS–structure data set, E4 . | 313 Match values of the molecular formula candidates of mass 116. | 322 Histogram of the match values of correct molecular formulas. | 322 Distribution of the match values of correct molecular formulas. | 323 Histogram of RRP, correct molecular formulas, 100 spectra. | 325 Ranking of correct molecular formulas, 100 compounds. | 325 Histogram of RRP for correct molecular formulas, 100 spectra. | 326 Ranking position of correct molecular formulas. | 326 MS reactions of methyl n-pentanoate. | 329 Fragment ions of methyl n-pentanoate. | 330 Comparison of experimental spectrum and explained intensities. | 331 Ranking of C6 H12 O2 isomers by spectral match. | 332 Histogram of match values of C6 H12 O2 constitutional isomers. | 333 Distribution of match values of C6 H12 O2 constitutional isomers. | 334 Histogram of match values for a sample of 1000 mass spectra. | 335 Match values of correct candidates for 1000 mass spectra. | 335 Histogram of RRP for structural formulas of 100 mass spectra. | 336 Ranking position of correct candidate and no. of candidates. | 337 Workflow for prediction of structural properties by spectrum. | 339 Classification tree for methyl ester. | 343 Complexity of classification trees. | 346 Mean misclassification rates for learning set and test set by CT. | 346 Mean misclassification rates for test set, two classes, CT. | 348 Mean misclassification rates, LDA, learning set and test set. | 349 Mean misclassification rate, test set, the two classes separately. | 349 Misclassification rates for the test set, CT and LM. | 351 Misclassification rates, various methods, selection by CT. | 353 Misclassification rates of test set, selection by MLR. | 353 Match values of structure candidates for Example 8.2. | 359 Ranking of candidates by match values for Example 8.2. | 360 Mass spectrum of ethyl 3-hydroxyphenylacetate. | 361 Minima, maxima, and arithmetic means of mass differences. | 366 Relative frequencies of MMD for molecular formulas from B𝑐E4 . | 366 Minima, maxima, and arithmetic means of mass differences. | 367

List of figures

Fig. 8.41 Fig. 8.42 Fig. 8.43 Fig. 8.44 Fig. 8.45 Fig. 8.46 Fig. 8.47 Fig. 8.48 Fig. 8.49 Fig. 8.50 Fig. 8.51 Fig. 9.1 Fig. 9.2 Fig. 9.3 Fig. 9.4 Fig. 9.5 Fig. 9.6 Fig. 9.7 Fig. 9.8 Fig. 9.9 Fig. 9.10 Fig. 9.11 Fig. 9.12 Fig. 9.13 Fig. 9.14 Fig. 9.15 Fig. 9.16

| xvii

Relative frequencies of MMD for molecular formulas from B𝑐E11 . | 367 Box plot of counts of candidates for a sample. | 370 Plot of counts of candidates for a sample. | 370 Box plot of counts of candidates for a sample. | 371 Plot of counts of molecular formula candidates for a sample of compounds consisting of elements from E11 . | 371 Simplified flowchart for calculating MS and MS/MS match values. | 380 Bar chart of mean RRP for different match values. | 383 MS/MS and MS of CHAPS, together with the calculated isotopic distribution of CHAPS (blue shading). | 388 Plot of molecular formula candidates for CHAPS. | 388 MS/MS and MS of cyclosporin C, together with the calculated isotopic distribution of cyclosporin C (blue shading). | 389 Plot of the 225 molecular formula candidates for cyclosporin C. | 389 Unknown spectrum, RT = 10.9 min, log 𝐾OW = 4.37 − 4.85. | 394 All structures generated for C4 H2 Cl4 with MOLGEN–MS. | 395 Semilog plot of the number of structures generated in runs 1–3. | 396 Chemically realistic and sterically hindered molecules. | 400 Exclusion strategy to identify unknown compounds with MOLGEN–MS and calculated properties. | 402 The 29 isomers of C12 H10 O2 with NIST EI–MS spectra. | 403 Error margins for experimental and estimated KRI and LRI values. | 404 The remaining structures for CASE with structure 15. | 405 Workflow for consensus scoring for CASE with MOLGEN–MS and various calculated properties. | 407 MOLGEN–MS and NIST substructure classifiers for Unknown 1. | 409 Top four candidates from MOLGEN–MS using the consensus scoring approach for Unknown 1. | 409 MOLGEN–MS and NIST substructure classifiers for Unknown 2. | 411 Top four candidates from MOLGEN–MS using the consensus scoring approach for Unknown 2. | 411 Mass spectrum of the unknown TP and the closest NIST match. | 413 The ‘goodlist’ substructure and the resulting candidate structures. | 413 Mass spectrum of the reisolated unknown TP and the synthesized standard CPAB. | 414

List of tables Table 1.1 Table 1.2 Table 1.3

Numbers of unlabeled 𝑚-multigraphs. | 25 Upper bounds for numbers of unlabeled connected 𝑚-multigraphs. | 25 Some admissible atom states for the elements in E11 . | 28

Table 4.1 Table 4.2 Table 4.3

Partial chirotopes generated for cyclohexane. | 161 Number of partial chirotopes generated. | 161 CPU times of the evaluation of some partial chirotopes generated. | 162

Table 5.1 Table 5.2 Table 5.3

Reactants for structures defined by 𝐺𝑆. | 201 Reactants and reaction schemes for structures defined by 𝐺𝑆. | 201 Initial classification and iterative refinement for a pymetrozine analog. | 209 Labeling by backtracking for N-benzyl-o-toluidine. | 210 Pruning the backtrack tree for 1-azabicyclo[4.3.2]undecane. | 211 Profiting from symmetry for cubane. | 215

Table 5.4 Table 5.5 Table 5.6 Table 7.1

Mean atomic mass, van der Waals radius and van der Waals density of the elements of E11 . | 244 Table 7.2 Calculated van der Waals volumes of small organic molecules. | 249 Table 7.3 Counts of substructures. | 253 Table 7.4 Values of topological indices for the real library of 50 decanes. | 256 Table 7.5 Values of some topological indices for the real library of 50 decanes (continued). | 257 Table 7.6 Part of the correlation matrix for boiling points and topological indices of decanes. | 258 Table 7.7 Statistics of best linear models for BPs of decanes, containing one to 18 topological indices. | 260 Table 7.8 Statistics of the best linear models for BPs of decanes, containing 1–20 substructure counts. | 264 Table 7.9 Statistics for the best linear models containing 𝑛 descriptors (out of 18 TIs and 19 SCs) for the BPs of decanes. | 268 Table 7.10 Best 𝑛-subsets of descriptors for BP models containing TIs, SCs and both types of descriptors. | 269 Table 7.11 𝑅2 of best models for the BPs of decanes obtained by various methods. | 270 Table 7.12 Atomic profile of the real library of propyl acrylates. | 272

List of tables

| xix

Table 7.13 Distribution of some properties within the real library of propyl acrylates. | 273 Table 7.14 Part of the correlation matrix for PD and descriptors for the real library of propyl acrylates. | 274 Table 7.15 Coefficients of determination 𝑅2𝐿𝑆 and 𝑅2𝑇𝑆 of the best five PD models containing 𝑛 descriptors. | 276 Table 7.16 Best subsets of 𝑛 descriptors for PD models, denoted by their 𝑅2𝐿𝑆 | 278 Table 7.17 Descriptors contained in the 25 best (with respect to 𝑅2𝑇𝑆 ) models obtained by 50-fold stepwise subset selection. | 281 Table 7.18 Experimental MIC values (in 𝜇g/mL) for the real library of quinolones. | 285 Table 7.19 𝑅2 for modeling MIC, regression and four 5-subsets of descriptors. | 288 Table 7.20 Distribution of 51 quinolones into activity classes by experiment and by calculation using the CT of Figure 7.23. | 289 Table 7.21 2×2 tables for binary classification of ABA for various classification methods and descriptor sets. | 291 Table 7.22 𝑇𝐶𝐸 and 𝑇𝐶𝐸𝐶𝑉 for binary classification of ABA for various classification methods and descriptor sets. | 292 Table 8.1 Table 8.2 Table 8.3 Table 8.4 Table 8.5 Table 8.6 Table 8.7 Table 8.8 Table 8.9 Table 8.10 Table 8.11 Table 8.12 Table 8.13 Table 8.14 Table 8.15 Table 8.16 Table 8.17 Table 8.18

Peaks from the mass spectrum of Figure 8.2. | 307 Natural isotope distributions of the elements in E11 . | 308 Distribution of elements in the MS–structure data set for E11 . | 311 Atom count and mass distributions of the MS–structure data set. | 312 Match value for C6 H12 O2 and the spectrum from Example 8.2. | 320 Ranking of formulas of mass 116 for Example 8.2. | 321 Quantiles 𝑞𝑝 for match values of correct formulas, various 𝑝. | 324 Match value for methyl n-pentanoate, spectrum Ex. 8.2. | 329 Quantiles 𝑞𝑝 for structure match values, various 𝑝. | 336 Node details of the classification tree for methyl ester. | 344 Descriptor values of the spectrum from Example 8.2. | 345 Misclassification rates of MS classifiers (CT), 77 properties. | 347 Misclassification rates of MS classifiers (LDA), 77 properties. | 350 Misclassification rates, various methods, descr. selection by CT. | 352 Misclassification rates, various methods, selection by MLR. | 352 Misclassification rates of MS classifiers, SVM, 77 properties. | 354 Misclassification rates for well-classifiable substructures. | 356 Exact isotope masses and distributions for elements of E11 . | 364

xx | List of tables Table 8.19 Numbers of candidates, sample of 1000 compounds, E4 and E11 . | 369 Table 8.20 Overview of the samples of the MS/MS study. | 379 Table 8.21 Overview of the spectra of the MS/MS study. | 381 Table 8.22 Numbers of candidates and ranking results for various MS match values. | 382 Table 8.23 Ranking results for MS/MS match values. | 382 Table 8.24 Ranking results for combined match values. | 383 Table 8.25 Molecular formula candidates for sinapinic acid. | 384 Table 8.26 Calculation of the MS/MS MV for sinapinic acid. | 385 Table 8.27 Molecular formula candidates for peptide MRFA. | 386 Table 8.28 Molecular formula candidates for CHAPS. | 387 Table 8.29 Molecular formula candidates for cyclosporin C. | 387 Table 9.1 Table 9.2 Table 9.3

Match values and relative ranking positions for different fragmenters. | 398 NIST match results for Unknown 1. | 408 NIST match results for Unknown 2. | 410

Table C.1 Table C.2 Table C.3 Table C.4

Molecular formulas for nominal masses 1–100, elements in E4 | 444 Molecular formulas for nominal masses 1–100, elements in E11 | 444 Molecular formulas for nominal masses > 100, elements in E4 | 445 Molecular formulas for nominal masses > 100, elements in E11 | 446

List of symbols 𝛾 𝛾𝑏 ℕ 0 𝑛 (𝑛2) 𝑌𝑋 G𝑚,𝑛 G𝑐𝑚,𝑛 M𝛾 𝛾𝑖𝑗 M𝛾𝑏 𝑣(𝛾)𝑖 𝑏(𝛾)𝑖 𝑣(𝛾) 𝑏(𝛾) 𝐺 1𝐺 𝑔−1 ℝ 𝑆𝑋 𝐺𝑋 𝐺(𝑥) 𝐺\\𝑋 |𝑋| T(𝐺\\𝑋) 𝛾̄ 𝑍 𝑣𝑍 𝑝𝑍 𝑞𝑍 𝑟𝑍 ℤ 𝔹 Z𝑋 𝑋 E4

a mapping, mostly a multigraph | 15 the bond graph corresponding to the multigraph 𝛾 | 15 the set {0, 1, 2, . . .} of natural numbers | 15 the empty set | 15 the set {0, . . . , 𝑛 − 1} or the cardinality of this set, if 𝑛 > 0, and 0 or its cardinality 0 if 𝑛 = 0 | 15 the set of 2-element subsets {𝑖, 𝑗}, 𝑖 ≠ 𝑗, of 𝑛, or its cardinality 𝑛(𝑛 − 1)/2 | 15 the set {𝛾 | 𝛾 : 𝑋 → 𝑌} of all mappings from 𝑋 to 𝑌 | 15 the set of all labeled 𝑚-multigraphs on 𝑛 nodes | 16 the connected labeled 𝑚-multigraphs on 𝑛 nodes | 16 the matrix of bond multiplicities in 𝛾 | 16 the multiplicity of the bond connecting nodes 𝑖 and 𝑗 | 16 the bond matrix of 𝛾 | 16 the valence of node 𝑖 in the multigraph 𝛾 | 16 the number of bonds incident with node 𝑖 in 𝛾 | 16 the sequence of valences of the nodes in 𝛾 | 16 the sequence of valences of the nodes in 𝛾𝑏 | 16 a group | 19 the identity element in 𝐺 | 19 the inverse of 𝑔 ∈ 𝐺 | 19 the set of real numbers | 19 the symmetric group on the set 𝑋 | 19 an action of the group 𝐺 on the set 𝑋 | 21 the orbit of 𝑥 under the action 𝐺 𝑋 | 21 the set of all orbits of 𝐺 𝑋 | 21 the order (or cardinality) of the set 𝑋 | 21 the set of all transversals of 𝐺\\𝑋 | 21 the orbit of 𝛾 ∈ 𝑌𝑋 , a symmetry class of mappings | 24 an atom state | 26 the valence of 𝑍 | 26 the number of free electron pairs of 𝑍 | 26 the charge of 𝑍 | 26 existence of an unpaired electron in 𝑍 | 26 the set {0, ±1, ±2, . . .} of integers | 26 the Boolean algebra {𝑓𝑎𝑙𝑠𝑒, 𝑡𝑟𝑢𝑒} or {0, 1} | 26 a set of admissible atom states for 𝑋 | 26 a chemical element | 27 the set of elements {H, C, N, O} | 27

xxii | List of symbols E11 𝑇𝐸𝑋 𝑉𝐸𝑋 𝑣𝑋 E ZE 𝑀 = (𝜀, 𝜁, 𝛾) 𝛾∗ 𝑀∗ M𝑛 M M𝑐𝑛 M𝑐 𝑀̄ M̄ M̄ 𝑛 M̄ 𝑐 𝛽 𝛽𝑀 M̄ 𝑐𝛽 B𝑐E 𝑣𝑋,𝑖 DBE(𝛽) 𝛽󸀠 gcd 𝛺 𝑆𝑛 𝑋𝑔 𝐶𝐺 (𝑔) C ⟨𝑔⟩ 𝑐(𝜋) [𝜋(0) . . . 𝜋(𝑛 − 1)] 𝛼(𝜋) ⊢ 𝑎(𝜋) ⊢ ⊣ lcm ℚ[𝑌] 𝑤(𝛾) 𝑐𝑜𝑛(𝛾)

the set {H, C, N, O, F, Si, P, S, Cl, Br, I} | 27 the atom number of 𝑋 | 27 the number of valence electrons of 𝑋 | 27 the standard valence of 𝑋 | 27 a set of chemical elements | 29 a set of admissible atom states for the elements in E | 29 a labeled molecular graph | 29 the H-suppressed graph obtained from 𝛾 | 30 the H-suppressed molecular graph obtained from 𝑀 | 30 the set of the molecular graphs with 𝑛 atoms | 30 the set of molecular graphs | 30 the set of connected molecular graphs in M𝑛 | 30 the set of all connected molecular graphs | 30 the equivalence class of 𝑀 or an unlabeled graph | 30 the set of unlabeled molecular graphs | 30 the set of unlabeled molecular graphs on n atoms | 30 the set of connected unlabeled molecular graphs | 31 a molecular formula | 32 the molecular formula of a molecular graph 𝑀 | 32 the set of constitutional isomers corresponding to 𝛽 | 32 the set of molecular formulas of connected molecular graphs over E | 34 the possible valence 𝑖 for 𝑋 | 35 the double bond equivalent of 𝛽 | 35 the empirical formula corresponding to 𝛽 | 35 the greatest common divisor of the numbers in 𝛺 | 35 the symmetric group on the set 𝑛 | 36 the set of fixed points of 𝑔 ∈ 𝐺 on 𝑋 | 37 the conjugacy class of 𝑔 ∈ 𝐺 | 38 a transversal of the conjugacy classes of 𝐺 | 38 the subgroup of 𝐺, generated by 𝑔 ∈ 𝐺 | 38 the number of cyclic factors of the permutation 𝜋 | 40 the list notation of the permutation 𝜋 | 41 the cycle partition of 𝜋 ∈ 𝑆𝑛 | 42 ‘is a number partition of’ | 42 the cycle type of 𝜋 ∈ 𝑆𝑛 | 42 ‘is a cycle type of’ | 42 ‘least common multiple’ | 43 the ring of multivariate polynomials with rational coefficients and indeterminates 𝑦𝑖 ∈ 𝑌 | 45 the weight of 𝛾 ∈ 𝑌𝑋 | 45 the content of 𝛾 ∈ 𝑌𝑋 | 45

List of symbols

⟨𝑔⟩\\ 𝑖 𝑋 𝐶(𝐺, 𝑋) 𝐴≤𝐺 𝐴𝑔 𝑔𝐴 𝐺/𝐴 𝐴𝑔𝐵 𝐴\𝐺/𝐵 ≈ 𝑝(−) len𝛾 ((𝑖0 , ..., 𝑖𝑘 )) 𝐵(𝛾) 𝑚𝑤𝑐(𝑙) (𝛾) (𝑙)

𝛾𝑏𝑖𝑗 girth𝛾 dist𝛾 (𝑖, 𝑗) Conn(𝛾) 𝛾󸀠 ⊆ 𝛾 𝛾󸀠 ⊆𝑐 𝛾 𝛾󸀠 ⊆𝑖 𝛾 𝛾|𝑁 𝜙 𝑛𝑁 inj 𝛾󸀠 ⊆𝜙 𝛾 𝛾󸀠 ⊆𝑐𝜙 𝛾 𝛾󸀠 ⊆𝑖𝜙 𝛾 Emb ⊆ (𝛾󸀠 , 𝛾) Emb ⊆𝑐 (𝛾󸀠 , 𝛾) Emb ⊆𝑖 (𝛾󸀠 , 𝛾) 𝛾󸀠̄ ⊆ 𝛾̄ 𝛾󸀠̄ ⊆𝑐 𝛾̄ 𝛾󸀠̄ ⊆𝑖 𝛾̄ Aut(𝛾󸀠 ) 𝐴𝑀𝐺 AMG𝑛 P(𝛺) P⋆ (𝛺) 𝑛 A Q

|

xxiii

the set of orbits of length 𝑖 of ⟨𝑔⟩ on 𝑋 | 45 the cycle index polynomial of 𝐺 𝑋 | 45 𝐴 is a subgroup of G | 47 the right coset of 𝐴 in 𝐺 containing 𝑔 | 48 the left coset of 𝐴 in 𝐺 containing 𝑔 | 48 the set of left cosets of 𝐴 in 𝐺 | 48 the (𝐴, 𝐵)-double coset in 𝐺 containing 𝑔 | 48 the set of (𝐴, 𝐵)-double cosets in 𝐺 | 48 ‘is an action similar to’ | 49 a probability distribution | 53 the length of the walk (𝑖0 , ..., 𝑖𝑘 ) in 𝛾 | 56 the set of bonds in the bond graph of 𝛾 | 56 the number of walks of length 𝑙 in 𝛾 | 56 an entry of the 𝑙-th power M𝑙𝛾𝑏 of M𝛾𝑏 | 57 the girth of 𝛾 | 58 the distance of the nodes 𝑖, 𝑗 in 𝛾 | 59 the set of connected components of 𝛾 | 59 𝛾󸀠 is a subgraph of 𝛾 | 59 𝛾󸀠 is a closed subgraph of 𝛾 | 59 𝛾󸀠 is an induced subgraph of 𝛾 | 59 the subgraph induced on 𝑁 in 𝛾 | 59 an embedding of graphs | 60 the set of injective mappings from 𝑁 to 𝑛 | 60 𝛾󸀠 is a subgraph of 𝛾 with respect to 𝜙 | 60 𝛾󸀠 is a closed subgraph of 𝛾 with respect to 𝜙 | 60 𝛾󸀠 is an induced subgraph of 𝛾 with respect to 𝜙 | 60 the embeddings of 𝛾󸀠 into 𝛾 as subgraph | 60 the embeddings of 𝛾󸀠 into 𝛾 as closed subgraph | 60 the embeddings of 𝛾󸀠 into 𝛾 as induced subgraph | 60 𝛾󸀠̄ is a subgraph of 𝛾̄ | 61 𝛾󸀠̄ is a closed subgraph of 𝛾̄ | 61 𝛾󸀠̄ is an induced subgraph of 𝛾̄ | 61 the automorphism group, i.e. the group of all relabelings of 𝛾󸀠 | 61 an ambiguous molecular graph | 62 the set of ambiguous molecular graphs with 𝑛 atoms | 62 the power set of 𝛺 | 62 the power set of 𝛺 without the empty set | 62 the set {1, . . . , 𝑛} | 62 stands for an atom in an ambiguous molecular graph | 63 stands for a hetero atom in an 𝐴𝑀𝐺 | 63

xxiv | List of symbols 𝐴𝑀𝐺 ⊆𝜙 𝑀 𝐴𝑀𝐺 ⊆𝑖𝜙 𝑀 𝑆𝑅 SR𝑘 [𝑎, 𝑏] 𝑆𝑅Dist {𝑖,𝑗},[𝑎,𝑏] 𝜂 𝑆𝑅Hybrid {𝑖𝑗 |𝑗∈ℎ},𝜂 𝑆 S𝑘 𝑆 ⊆𝜙 𝑀 𝑆 ⊆𝑖𝜙 𝑀 Emb ⊆ (𝑆, 𝑀) Emb ⊆𝑖 (𝑆, 𝑀) 𝐶 CR𝑛 Δ𝐶 Δ𝜁 Δ𝛾 ΔZ G[−3,3],𝑛 ΔCR𝑛 𝑎 ∨̇ 𝑏 Δ𝐶 ∘ 𝑀 Cen(𝐶) RCG(𝐶) 𝑅 R𝑘 𝑅 ∘𝜙 𝑀 Prod𝑅 (𝑀) 𝜉 ℝ3 SE M𝑛 𝐷̄ 𝐷 𝐴 𝑀𝑊

𝐴𝑀𝐺 is an ambiguous molecular subgraph of 𝑀 with respect to 𝜙 | 64 𝐴𝑀𝐺 is an induced ambiguous molecular subgraph of 𝑀 with respect to 𝜙 | 64 a substructure restriction | 64 the set of substructure restrictions on 𝑘 atoms | 64 the interval {𝑎, 𝑎 + 1, ..., 𝑏} of natural numbers | 64 the substructure restriction distance | 64 a hybridization | 65 a substructure restriction hybridization | 65 a molecular substructure | 65 a set of molecular substructures on 𝑘 atoms | 65 𝑆 is a molecular substructure of 𝑀 w.r.t. 𝜙 | 65 𝑆 is an induced molecular substructure of 𝑀 w.r.t. 𝜙 | 65 the embeddings of 𝑆 as molecular substructure | 65 the embeddings of 𝑆 as induced mol. substructure | 65 a chemical reaction | 66 the set of chemical reactions on 𝑛 atoms | 66 a change-of-reaction-graph | 66 a change-of-states-graph | 66 a change-of-bonds-graph | 66 the set of changes of states | 66 the set of changes of bonds | 66 the set of changes of reactions on 𝑛 atoms | 66 means ‘either or’ for two Boolean expressions | 66 the application of Δ𝐶 to 𝑀 | 66 the center of reaction of 𝐶 | 67 the reaction-center graph of 𝐶 | 67 a reaction scheme | 68 the set of reaction schemes on 𝑘 atoms | 68 the application of 𝑅 to 𝑀 w.r.t. 𝜙 | 68 the set of product graphs arising from an application of 𝑅 to 𝑀 | 68 3D placement of a molecule | 72 the threedimensional real vector space | 72 the steric energy | 73 the set of the labeled molecular graphs with 𝑛 atoms | 77 a molecular descriptor | 77 a descriptor | 77 the number of atoms in a molecular graph | 78 the molecular weight | 78

List of symbols

𝑚𝑒𝑎𝑛𝐴𝑊 𝑁𝑋 𝑟𝑒𝑙. 𝑁𝑋 𝑐ℎ𝑎𝑟𝑔𝑒 𝑟𝑎𝑑 𝐵 𝑛−, 𝑛=, 𝑛# 𝐶 𝐿(𝛾) 𝐵(𝛾) 𝑣(𝛾)̄ 𝑏(𝛾)̄ ⊴ 𝛼𝑖󸀠 𝜇𝑘 hyb𝛾 (𝑖) 𝑡𝑤𝑐 𝑚𝑤𝑐(𝑙) 𝑃, 𝑃(𝑙) rings, rings(𝑙) D𝛾 𝑀1 , 𝑀2 ̃ Aut(𝛾) 𝐺1 𝐺2 𝑉𝑣𝑑𝑤 𝜌𝑣𝑑𝑤 𝑠 𝑉𝑣𝑑𝑤 𝑉𝑐𝑢𝑏 𝑆𝑣𝑑𝑤 𝑆𝐴𝑆𝐻2 𝑂 𝑆𝐴𝑆𝐻 𝐷3𝐷 𝑉𝑠𝑝ℎ𝑒𝑟𝑒 x, y, . . . 𝑥𝑖 x⋅y 𝑑(x, y) O3 SO3 P

|

the mean atomic weight | 78 the number of 𝑋 atoms | 79 the relative number of 𝑋 atoms | 79 the total charge | 79 the number of radical sites | 79 the number of bonds in a molecular graph | 79 the number of single, double, triple bonds | 79 the cyclomatic number | 79 the multiset of lines of 𝛾 | 80 the set of bonds in 𝛾 | 80 the partition of valences of 𝛾̄ | 80 the partition of bond degrees of 𝛾̄ | 80 the dominance order for number partitions | 81 a column length in a Young diagram | 82 the no. of nodes connected with 𝑖 by a 𝑘-fold bond | 84 the hybridization of 𝑖 in 𝛾 | 84 the total walk count in the bond graph | 84 the count of walks of prescribed length 𝑙 | 84 the count of paths, of paths of length 𝑙 | 84 the count of rings, of rings of size 𝑙 | 84 the distance matrix of 𝛾 | 84 Zagreb indices | 85 the conjugacy class of the automorphism group of 𝛾 | 86 the gravitational index (pairs, 3D-dist.) | 89 the gravitational index (bonds, 3D-dist.) | 89 the van der Waals volume | 89 the density by van der Waals volume | 89 the standardized van der Waals volume | 89 the enclosing cuboid | 89 the van der Waals surface | 89 the solvent-accessible surface (H2 O) | 89 the solvent-accessible surface (H) | 89 the geometrical diameter | 89 the enclosing sphere | 89 vectors in ℝ3 | 93 the 𝑖th component of the vector x ∈ ℝ3 | 93 the scalar product of the vectors x and y | 94 the Euclidean distance of the vectors x and y | 94 the group of linear isometries of ℝ3 , or of 3-dimensional orthogonal matrices over ℝ | 98 the subgroup of proper rotations in O3 | 98 the point group of a molecule | 98

xxv

xxvi | List of symbols 𝐸 𝜎 𝐶𝑛 𝑖 𝐸,̂ 𝜎,̂ 𝐶𝑛̂ , 𝑆𝑛̂ , 𝑖 ̂ 𝑇 sign 𝐺sign 𝛿 𝜌 𝐷(𝜌) 𝑌𝑐ℎ 𝑌𝑎𝑐 Δ(P × P) 𝑐𝑜 (𝜌)̄ 𝑐𝑒 (𝜌)̄ 𝑌𝑟𝑋 ̃ 𝑈 𝐺\\ 𝑈̃ 𝑋 𝑖𝐻 𝑠𝐻 𝜁(−, −) 𝜁(𝐺) 𝜇(−, −) 𝜇(𝐺) 𝑜𝐾̃ M(𝐺) B(𝐺) 𝑌𝑐𝑋 𝑉 det 𝜒 sign 𝐴𝑛 𝑃𝑂𝐹 rep< (𝐺\\𝑋) I(ℕ) 𝐵

the symmetry element of the identity operation, in Schoenflies notation | 99 the symmetry element of a reflection, in Schoenflies notation | 99 the symmetry element of an 𝑛-fold proper rotation, in Schoenflies notation | 99 the symmetry element of an inversion, in Schoenflies notation | 99 corresponding symmetry operations | 100 the subgroup of proper rotations in 𝑇𝑑 | 102 a sign map | 104 the kernel of the sign map | 104 a distribution of substituents over a skeleton | 107 an element of the point group | 107 the matrix representing the linear mapping 𝜌 | 107 the subset of chiral types of substituents in 𝑌 | 108 the subset of achiral types of substituents in 𝑌 | 108 the diagonal subgroup of P × P | 108 the number of cycles of odd length in 𝜌 ̄ | 109 the number of cycles of even length in 𝜌 ̄ | 109 the set of mappings with racemic content | 113 the class of subgroups conjugate to 𝑈 | 118 the stratum of 𝑈 on 𝑋, with respect to 𝐺 𝑋 | 118 the number of 𝐻-invariant elements of 𝑋 | 119 the number of elements with stabilizer 𝐻 | 119 the zeta function on a lattice of subgroups | 119 the zeta matrix of 𝐺 | 119 the Möbius function on a lattice of subgroups | 120 the Möbius matrix of 𝐺 | 120 ̃ | 120 the number of orbits with stabilizer class 𝐾 the table of marks of the group 𝐺 | 120 the Burnside matrix of the group 𝐺 | 121 the set of distributions of content 𝑐 | 127 the volume determinant function | 135 the determinant function | 135 the orientation function | 135 the sign function | 135 the alternating group on 𝑛 | 138 partial orientation function | 142 the transversal of minimal orbit representatives | 165 the set of intervals of natural numbers | 181 a fuzzy molecular formula | 181

List of symbols

B𝐵 L 𝑚𝑛≤ depthR cha(𝑀) depthL size(𝑀) 𝐺𝑆 𝑋𝑗 𝑌 𝑥𝑖𝑗 𝑦𝑖 X x𝑖 Y 𝑅𝑆𝑆 C 𝐿 ℝ+0 𝛿 𝑇𝐶𝐸 𝐶𝐸(𝑘) 𝑅 𝑦̄ 𝑅2 𝑆 𝑑 𝐹 𝑀𝐶𝐸 𝑀𝐶𝐸(𝑘) 𝐿𝑆 𝑇𝑆 𝐴 ∪̇ 𝐵 𝑓𝐿𝑆 𝑅𝑆𝑆𝑇𝑆 𝑅2𝑇𝑆 𝑇𝐶𝐸𝑇𝑆 𝑀𝐶𝐸𝑇𝑆 𝐶𝐸(𝑘) 𝑇𝑆 𝑀𝐶𝐸(𝑘) 𝑇𝑆 𝐴\𝐵 𝑅𝑆𝑆𝑘𝐶𝑉

| xxvii

the set of molecular formulas compatible with 𝐵 | 181 a set of molecular graphs | 191 the set of weakly increasing mappings from 𝑛 to 𝑚 | 191 the depth for a reaction scheme | 193 the sum of the charges of the atoms in 𝑀 | 193 the depth for reactants | 195 the number of atoms of 𝑀 | 195 generic structural formula | 200 an independent variable (prediction variable) | 221 a dependent variable (target variable) | 221 the value of prediction variable 𝑋𝑗 for observation 𝑖 | 221 the value of target variable 𝑌 for observation 𝑖 | 221 an 𝑚 × 𝑛-matrix with entries 𝑥𝑖𝑗 | 221 the 𝑖-th row vector of X | 221 the 𝑚 × 1-matrix with entries 𝑦𝑖 | 221 residual sum of squares | 222 a finite set | 222 a cost function | 223 the set of nonnegative real numbers | 223 Kronecker’s delta function | 223 total classification error | 223 classification error for class 𝑘 | 223 multiple correlation coefficient | 224 the arithmetic mean of the values 𝑦𝑖 of 𝑦 | 224 coefficient of determination (of a regression) | 224 the standard error (of a regression) | 224 the number of degrees of freedom | 225 the empirical 𝐹 value (of a regression) | 225 mean classification error | 225 the MCE for class 𝑘 | 225 learning set | 225 test set | 225 the disjoint union of two sets 𝐴 and 𝐵 | 225 the prediction function for the learning set | 225 𝑅𝑆𝑆 for the test set | 225 the coefficient of determination for the test set | 225 𝑇𝐶𝐸 for the test set | 226 𝑀𝐶𝐸 for the test set | 226 the error of classification for class 𝑘 in the test set | 226 𝑀𝐶𝐸 for class 𝑘 in the test set | 226 the set theoretic difference of two sets 𝐴 and 𝐵 | 226 RSS for 𝑘-fold cross-validation | 226

xxviii | List of symbols 𝑇𝐶𝐸𝑘𝐶𝑉 𝑅𝑆𝑆𝐶𝑉 𝑅2𝐶𝑉 𝑆𝐶𝑉 𝑇𝐶𝐸𝐶𝑉 𝑀𝐶𝐸𝐶𝑉 ‖.‖2 𝑅(𝑋, 𝑍) 𝐹𝑅(𝑋, 𝑌) diag(𝑎𝑖 ) 𝑁𝑘 (x) Da deg𝑑𝛾 (𝑖) deg𝑣𝑀 (𝑖) 𝐻𝐶𝑀 (𝑖) Å 𝛹 𝑚ℎ𝑟𝑅2 ln 𝐼 𝑚̂ 𝑚̌ 𝑚̃ 𝑃 𝑃̃ 𝑃̂ 𝑚

𝑋

𝐼𝑋 𝑚̂ 𝑋 𝑚̃ 𝑋 𝑚̌ 𝑋 I 𝐼𝛽 𝑚𝛽 𝑚̃ 𝛽 𝑚̂ 𝛽 𝑚̌ 𝛽 𝐾 MV(𝐼, 𝐾) 𝐾𝑇 𝐾𝐹 𝑞𝑝

TCE for 𝑘-fold cross-validation | 226 RSS for LOOCV | 226 coefficient of determination for LOOCV | 227 standard error for LOOCV | 227 TCE for LOOCV | 227 MCE for LOOCV | 227 the Euclidian norm | 228 the correlation coefficient of 𝑋 and 𝑍 | 228 the Fisher Ratio of 𝑋 and 𝑌 | 230 the diagonal matrix with entries 𝑎𝑖 | 233 the set of 𝑘 nearest neighbors of x | 238 the unified atomic mass unit or dalton | 244 the distance degree of node 𝑖 in 𝛾 | 245 the valence degree of atom 𝑖 in 𝑀 | 246 the no. of H-atoms in 𝑀 being neighbors of atom 𝑖 | 246 Ångström | 247 a QSPR function | 251 the mean highest random 𝑅2 | 259 the natural logarithm | 274 a mass spectrum, isotope distribution | 306 maximum 𝑚/𝑧 with intensity > 0 | 306 minimum 𝑚/𝑧 with intensity > 0 | 306 𝑚/𝑧 of maximum intensity, basemass | 306 a peak of a mass spectrum | 306 a peak of highest intensity, base peak | 306 a peak of highest mass | 306 an isotope of mass 𝑚 of 𝑋 | 307 the natural isotope distribution of 𝑋 | 307 the highest isotope mass of 𝑋 | 307 the nominal mass of 𝑋 | 307 the smallest isotope mass of 𝑋 | 307 the set of isotope distributions | 310 the theoretical isotope distribution of 𝛽 | 310 the nominal mass of 𝛽 | 310 the mass with highest intensity of 𝛽 | 310 the highest mass of 𝛽 | 310 the lowest mass of 𝛽 | 310 a candidate molecular or structural formula | 315 the match value for 𝐾 with respect to 𝐼 | 315 the true candidate for a molecular/structural formula | 315 a false candidate for the molecular/structural formula | 316 a 𝑝-quantile | 323

List of symbols

𝑅𝑅𝑃0 𝐵𝐶 𝑊𝐶 𝑇𝐶 𝑅𝑅𝑃1 𝑅𝑅𝑃 𝛷 𝑚̄ 𝑋 E11 \ E4 𝑁𝐷𝑃 𝑁𝑆𝐴𝐸 𝑁𝑆𝑆𝐸 ||x||1 𝐴𝑅𝑃 𝐸𝐶 𝐾OW 𝑅𝐼𝑥

| xxix

relative ranking position (w.r.t. better candidates) | 324 the number of better candidates (w.r.t. 𝐾𝑇 ) | 338 the number of worse candidates (w.r.t. 𝐾𝑇 ) | 338 the total number of candidates | 338 relative ranking position (w.r.t. worse candidates) | 338 relative ranking position (mean) | 338 an MS-classifier | 338 the average atom mass of 𝑋 | 363 the set difference of E11 and E4 | 372 normalized dot product | 375 normalized sum of absolute errors | 375 normalized sum of squared errors | 375 𝐿1 -norm of x | 376 absolute ranking position | 378 number of equal candidates (w.r.t. 𝐾𝑇 ) | 378 the octanol–water partition coefficient | 393 retention index of compound 𝑥 | 399

Introduction and outline Molecules are not easy to handle in silico because of constitutional isomerism, mesomerism, tautomerism, stereoisomerism, chirality and other phenomena. This means, roughly speaking, that molecules are not easily described unambiguously. Precise models are required, since computers can obey orders very well, but cannot read our minds. It is not sufficient to describe a molecule with the molecular formula or a list of covalent bonds between the atoms alone. Even the constitutional isomers are not always sufficient, as stereochemistry is sometimes needed when it comes to the con­ sideration of pharmaceutical properties.

The basic problems Molecules are entities consisting of a set of atoms that are held together by interactions between these atoms. Thus, the first step towards a mathematical model of a molecule is the arithmetical description using a molecular formula, e.g. C6 H6 , which describes a set of atoms, six carbon and six hydrogen atoms, that are able to form certain molecules. This formula does not suffice to distinguish a unique com­ pound. The famous benzene ring is only one of altogether 217 mathematically possible interaction models consisting of six carbon atoms (with valence 4) and six hydrogen atoms (with valence 1), and several of these models correspond to stable molecules with various properties. Alexander von Humboldt (1769–1859) stated this ambiguity already in vol. I of his book [136], published in 1797. We quote from the footnote on pages 127/128: Drei Körper a, b und c können aus gleichen Quantitäten Sauerstoff, Wasserstoff, Kohlenstoff, Stickstoff und Metall zusammengesetzt und in ihrer Natur doch unendlich verschieden seyn.

Translated into English, Humboldt claimed that chemical compounds (‘Körper’) may contain the same quantities of oxygen, hydrogen, carbon, nitrogen and metals, but differ widely in their nature. Moreover, he wrote in the same footnote that different ‘Umhüllungen’ (surroundings) of the constituents were responsible for this pheno­ menon, and he even used the word ‘Bindung’ (bond) for this. He mentioned that the chemical knowledge of his days did not yet provide an explanation: Was ich Umhüllung nenne, mag sich also wohl auf den allgemeinen Begriff der Bindung reduci­ ren; unsere chemischen Kenntnisse sind aber noch nicht vervollkommnet genug, um aus dem, was wir von den Affinitäten und dem Ineinanderwirken der Stoffe wissen, jene Erscheinungen erklären zu können.

2 | Introduction and outline A quarter of a century later, in 1824 and 1827, F. Wöhler and J. von Liebig (whom Hum­ boldt had recommended for a professorship at Gießen when Liebig was 21 years old) found two compounds with the same molecular formula CHNO but different proper­ ties [70]. Further cases were discovered, proving Humboldt’s prediction to be true, and in 1830 J. J. Berzelius recognized this as a general phenomenon and called it isomerism, apparently without knowing Humboldt’s claim. The existence of isomerism means that higher precision is needed in distingui­ shing compounds. A higher level of accuracy is neccesary at the topological level. The corresponding model of organic molecules is a graph theoretic interaction model, ex­ pressing a molecule in terms of a structural formula or a molecular graph that indicates the interactions between the atoms. In the case of the benzene ring, the graph theo­ retic description is

H H

C C

C

C

C H

H

C

H

H In mathematical terms, this is a connected multigraph with nodes colored by atom names, consisting of six nodes of valence 4 representing carbon atoms, and six nodes of valence 1 representing hydrogen atoms. The bonds, called covalent bonds, express pairwise interactions between atoms, their multiplicities (single or double in the case of the benzene ring) express the strength of the respective interaction. In reality the situation is more complicated, especially where aromaticity is present. We will ignore this for the moment, but will revisit this later. Roughly speaking, we may say that a molecular graph corresponding to a molecular formula is a connected and colored multigraph. Its nodes are colored with atom names and atom states, according to the molecular formula, and the bonds between certain nodes indicate pairwise interac­ tions between these atoms. Molecule generators such as MOLGEN produce a total of 217 molecular graphs when given the chemical formula C6 H6 and the default values 4 and 1 for the valences of carbon and hydrogen: If you visit the MOLGEN homepage at http://www.molgen.de press ‘MOLGEN–ONLINE’, click on Example1, click 2D, enter C6 H6 , and have a look at some of the resulting molecular graphs. Which one is the benzene ring?

Introduction and outline | 3

Thus, there are 217 mathematically possible connectivity or constitutional isomers or constitutions that have the molecular formula C6 H6 . Among these are, for example, exactly six isomers of the form (CH)6 . Along with benzene we have

which are called, from left to right: 3,3󸀠 -bicyclopropenyl, Dewar benzene, benzvalene, prismane and tetracyclo[2.2.0.02,5 .03,6 ]hexane. The next level of precision is the geometrical level. Energy models allow place­ ments of connected atoms in 3D space. Go to the homepage of MOLGEN again, via http://www.molgen.de, press ‘MOLGEN–ONLINE’, click on Example1, click 3D, submit C6 H6 , and have a look at some of the 3D placements. They can be moved in space with the cursor. Which one is prismane? The application of an energy model shows e.g. that of the 217 C6 H6 structures, fewer than 70 are reasonable in the sense that 3D models containing common bond lengths, bond angles etc. can be built. For seven of these structures, two distinctly different 3D realizations are possible rather than a single one, see Figure 1. This phenomenon is called stereoisomerism. In five of these seven cases the two stereoisomers are mirror images of each other and are thus enantiomorphic. This phe­ nomenon of non-identical mirror images is called chirality. In the remaining two cases each stereoisomer is its own mirror image, it is achiral. The difference between two stereoisomers of this kind is in the geometrical arrangement around a rigid part of the molecule (double bond, ring). Furthermore, for each of the stereoisomers in the second-last line of Figure 1, two alternative geometric arrangements arise from rotation about a single bond, as shown in Figure 2. Stereoisomers of this kind, called conformers, are often not distinguished, since they usually interconvert under normal conditions and therefore cannot be iso­ lated. 3D placements are usually obtained by an optimization algorithm, and so they are local energy minima, which are difficult to classify. Recently, alternative discrete methods have been developed that on further elaboration may allow the construction of all stereoisomers in many cases [102], see Chapter 4. Taking into account the phenomena of isomerism and chirality, we have to deal with the following main problems: – On the arithmetical level, we need to deduce all the possible molecular formulas for a given set of atoms with prescribed valences. – On the topological level, we would like to construct all the molecular graphs that correspond to a given molecular formula, the connectivity or constitutional iso­ mers.

4 | Introduction and outline

C

C

Fig. 1. Seven pairs of C6 H6 stereoisomers.

Introduction and outline |

5

Fig. 2. Two pairs of C6 H6 conformers.



On the geometrical level, we need to construct all the stereoisomers that belong to a given connectivity isomer (with or without distinguishing conformers).

There are, of course, intermediate steps. For example, the formula C2 H5 OH is more than a molecular formula, since it says that the associated molecule contains a hy­ droxyl group − O − H. We shall discuss some of the intermediate steps later. A preliminary step towards a solution of these problems would be to count par­ ticular isomers. The notion of enumeration covers both counting and constructing. Chemical structure enumeration has been studied by mathematicians, computer sci­ entists and chemists for quite a long time. Given a molecular formula plus (optionally) a list of structural constraints, the typical questions are: – How many isomers exist? – What are they? And, especially if that cannot be answered completely: – How can we obtain a sample of these isomers? In this book, we shall describe algorithms for solving these problems. The techniques are based on the representation of chemical compounds as molecular graphs, i.e. they are mainly applied to constitutional isomers. The major problem is that in silico molecular graphs have to be represented as labeled structures, i.e. the nodes of the graphs are numbered, while in chemical com­ pounds the atoms are not labeled. The mathematical concept for this problem is to consider the labeled graphs that arise from the renumbering of nodes as equivalent. In order to obtain these equivalence classes, we describe them as orbits of labeled mo­ lecular graphs under the operation of a symmetric group. We shall count the number of these orbits, the number of these orbits with given content and the number of orbits of a given symmetry type. The final aim is the efficient construction of a complete sys­ tem of representatives of these orbits that is free of duplicates, so that we even have to solve the isomorphism problem, i.e. we need a description in a canonic form. The

6 | Introduction and outline result is a molecular generator that constructs a complete and redundancy-free set of structural formulas for a given molecular formula. A similar approach can be used for the construction of permutational isomers; stereoisomers will also be considered. According to our introductory questions, we shall distinguish several steps in our approach to the isomerism problem: counting, constructing and sampling isomers. While counting only delivers the number of isomers, the remaining disciplines refer to constructive methods. Enumeration typically encompasses exhaustive and non-re­ dundant methods, sampling typically lacks these characteristics. However, sampling methods are sometimes better suited to solve real-world problems, since ‘small’ mole­ cular formulas can already have ‘astronomic’ numbers of connectivity isomers. This leads to the consideration of probabilistic methods that can be used for generating isomers uniformly at random. There is a wide range of applications where these techniques are helpful or even essential. The main methods and their applications are – Counting techniques deliver pure chemical information, they can help to estimate or even determine sizes of chemical databases or compound libraries that can be obtained in combinatorial chemistry experiments. – Constructive methods are essential to structure elucidation systems. They are used to generate structures that fulfill structural restrictions obtained from chemical analysis in a pre-generation step. – Combinatorial libraries that contain candidate structures for virtual screening and Quantitative Structure–Property Relationships (QSPRs) can be produced and used along with structure enumeration or sampling as rudimentary approaches towards inverse QSPRs. De novo design algorithms often have their roots in con­ ventional structure generation. – Non-quantitative aspects of reaction network generation are also based on methods similar to those used for isomer enumeration. – Patent libraries produced in a canonical way can be compared easily and checked for overlaps. Examples 2, 3, 4 and 5 in http://molgen.de/?src=documents/molgenonline show how such libraries can be generated. Try to obtain your own favorite molecular library. It is our particular aim to describe methods that solve these problems and to show how generators such as MOLGEN can help in all of these situations. In order to fulfill this purpose, we first have to describe the mathematical model used. It is basically a graph theoretical model and so we provide a detailed description of molecular graphs first.

Introduction and outline |

7

Outline Mathematical models are indispensable in organic chemistry. The aim of synthesis is the production of new compounds via known or new reactions, while analysis looks for compounds and their properties (chemical, biological, medicinal etc.). Once a compound is discovered, another important task is the elucidation of its molecular structure, using spectroscopic methods. Both are motivated by the search for chemical agents with prescribed features. Another modern technique for the same purpose is combinatorial chemistry, which uses a given set of chemical building blocks to form many of their combinations, which then are screened for their biological or pharma­ ceutical properties. The resulting compounds form a molecular library. The screening can be automated and parallelized, nevertheless the cost and time needed require an exact planning and automatic evaluation of results. The optimization of experiments in silico – usually done in advance – raises many questions for mathematical modeling, using algebraic and combinatorial algorithms for the simulation and evaluation of the possible output of experiments in combina­ torial chemistry in advance, the construction of the occurring molecular graphs, the use of graph invariants (molecular descriptors), and the use of statistical learning in order to evaluate the possible result. The most involved steps that we shall describe in detail are: – Molecular structure generation. Combinatorial chemistry requires the generation of virtual molecular libraries, usually defined by given reactants and reactions. For this reason we shall describe algorithms for reaction-based structure genera­ tion. – Molecular structure elucidation. Computer-aided structure elucidation (CASE) uses algorithms that construct all mathematically possible structural formulas for a given molecular formula and optional structural restrictions (often obtained from a spectrum). This has to be performed efficiently and without redundance (i.e. no duplicates allowed). Virtual spectra can be calculated for generated structures and compared with the experimental spectrum to rank the generated structure candidates. The corresponding algorithms that we need for such a formula-based structure generation will be described. – Canonization of molecular structures. Often two or more seemingly different mo­ lecular graphs represent one and the same chemical compound. In particular, the atoms in a molecule can be numbered in various ways, which may lead to problems in compound identification. To avoid such problems, structural formu­ las have to be generated in a canonized data structure, so that two libraries are easily compared to detect overlaps. – Quantitative Structure–Property Relationships. These can be obtained from a li­ brary of real molecules that have already been synthesized and screened, con­ tained within the virtual library. Molecular descriptors and statistical learning me­

8 | Introduction and outline thods are tools to establish a QSPR from the data of a real library. The QSPR is then used to predict property values for the ‘unknown’ members of a virtual library. The emphasis in this book lies in combining these methods to solve various problems in organic chemistry, divided into several chapters. A brief description of the contents of the chapters is given below.

1 Basics of graphs and molecular graphs This introductory chapter is devoted to building models in organic chemistry. Chemi­ cal compounds are described by molecular graphs, multigraphs, where the nodes rep­ resent the atoms of the molecule, the bonds between atoms are visualized by lines and the multiplicity indicates the strength of the interaction. Each node is colored by the element symbol of the atom, together with an atom state. The state is a quadruple, con­ sisting of the valence, the number of free electron pairs, the charge, and information about the existence of an unpaired electron. A unique relationship between chemical compounds and colored multigraphs is obtained by identifying structural formulas of chemical compounds with equivalence classes of colored multigraphs. These equiva­ lence classes can be characterized as orbits of a symmetric group, and this opens a door to counting, constructing and generating structural formulas, i.e. constitutional isomers corresponding to a given molecular formula and (optional) further conditions. The basic mathematical definitions, results, tools and methods are described in this chapter.

2 Advanced properties of molecular graphs In this chapter, the graphical model used for the description of molecules is extended to describe chemical reactions. Reaction schemes, together with changes in atom states or bond multiplicities caused by the chemical reaction, form a suitable syntax for theoretical graph description of chemical reactions and the corresponding com­ puter simulation. Extensions of the molecular model to describe mesomerism and account for geometrical aspects is discussed, along with the existence of compounds and the levels of abstraction of the present model. Molecular descriptors are introdu­ ced, for use in later chapters. The embedding of molecules into 3D space using force fields is mentioned briefly.

Introduction and outline | 9

3 Chirality In this chapter we enter the geometrical level. A particularly interesting geometrical aspect of molecules is chirality, which requires the description of further methods, aspects and difficulties concerning 3D placements of molecules in space. We describe the enumeration of permutational isomers in detail, extending the description of Pólya’s methods for counting multigraphs. Constructive aspects are mentioned, exis­ tence problems are discussed and a method for the computation of isomer numbers is demonstrated.

4 Stereoisomers This chapter describes the implementation of the discrete mathematical techniques of A. Dreiding and A. Dress by R. Gugisch, which allows the evaluation of stereoisomers. It uses the notion of an oriented matroid in particular, while the basic approach is the evaluation of an orientation function. The chemical realizability of an orientation function and the corresponding realization function itself are still open problems.

5 Molecular structure generation Chapter 5 introduces and describes the generation of molecular graphs with given structural properties in an efficient, redundance-free and canonic way. Two ap­ proaches are considered: formula-based molecular generation to generate all struc­ tural formulas for a given (optionally fuzzy) molecular formula and reaction-based molecular generation to generate all products for given reactions and reactants. The concepts of orderly generation and target oriented generation (for inconsistent restric­ tions) are introduced for cases such as the generation of combinatorial libraries and of patent libraries. For the attachment of different ligands to a central molecule, the methods of T. Wieland [337] can be applied, using double cosets and orderly genera­ tion. However, successive use of several reactions often is not equivalent to attaching ligands to a central molecule. Ring closure, rearrangement and decomposition may lead to various reaction networks. A general construction algorithm that runs through a reaction network and solves this problem by numbering the generated products canonically is derived. The final sections contain information on a canonizer that obtains the molecules in a canonic form, which is essential for the generation and comparison of molecular libraries for possible overlap (patents etc.). We also describe the data structures used in MOLGEN.

10 | Introduction and outline 6 Supervised statistical learning In Chapter 6 we describe the basic principles of supervised statistical learning and show how it can be used in computer chemistry when a causal connection between structure and property is not known, or can only be calculated with extremely high effort. Such problems occur quite often in combinatorial chemistry as well as in mole­ cular structure elucidation. In supervised statistical learning we train a predicting function using known ex­ amples, and look for a predicting function fitting the known cases. The quality of the predicting function can be checked by resubstitution, by using a test sample, or via cross-validation. It may make sense to use centering, range scaling or autoscaling be­ fore the learning process takes place. In order to avoid overfitting, it is important to restrict the number of predictors. For variable selection, one can use correlation ana­ lysis or either complete or stepwise searches for proper subsets of the set of predictors. In the last few decades, several methods for the training of various types of pre­ dicting functions [117] were developed using inferential statistics. Most important are linear models, artificial neural networks, support vector machines, classification and regression trees and the method of 𝑘 nearest neighbors. These methods complete the set of mathematical tools that we shall use in the sub­ sequent application-oriented part of this book. They allow the development of math­ ematical models for the prediction of certain features of chemical compounds.

7 Quantitative Structure–Property Relationships Techniques of combinatorial chemistry have become more and more important in the search for active compounds, but require careful planning and preliminary computa­ tional simulation, where possible. This chapter describes what is necessary to build Quantitative Structure–Property Relationships (QSPRs). First the calculation and appli­ cation of molecular descriptors is covered, which involves mapping molecular struc­ tures via graph theoretical invariants onto real numbers. The determination of pre­ dicting functions is covered next, using supervised statistical learning methods based on experimental results for the real library (QSPR), as well as applying the predicting function to the virtual library for a prognosis to prepare a directed synthesis. Example QSPR studies are then presented, including the boiling points of decanes, the physical density of propyl acrylates and the search for a biological/pharmaceutical property responsible for the anti-mycobacterial activity of quinolones.

Introduction and outline |

11

8 Molecular structure elucidation This chapter covers structure elucidation with mass spectrometry (MS) according to the three main steps interpretation, generation and selection. First, we describe how to interpret the MS to determine the molecular mass (or an interval of possible masses) in order to calculate the possible molecular formulas. Further interpretation of the MS is then described to deduce appropriate structural restrictions (or classifiers) for gene­ ration of the corresponding structural formulas. Different match values are derived to determine the best matching molecular formula and structural formula candidates by comparing calculated and experimental spectra and the quality of these ranking func­ tions is evaluated statistically. The generation of fragments in silico to try to explain the experimental spectrum is introduced in order to rank the structural candidates ac­ cording to the spectral match. Next, the use of MS classifiers to extract structure pro­ perties is described, introducing MS descriptors. Different classification methods were used for 77 binary molecular descriptors of a given structural property. Following this, the systematic search for new structural properties is presented as a potential further development of MS classifiers. Two examples show the connection between the three steps interpretation, generation and verification. The final sections in this chapter in­ vestigate the application of these methods to high resolution mass spectrometry. As the exact mass is available, HR-MS has a much reduced candidate space for the mo­ lecular formula, which can be reduced further with exact mass fragments acquired using tandem-MS or MS/MS.

9 Case studies of CASE In this chapter, CASE is applied to several contaminants in water samples, and the incorporation of calculated properties (QSPRs) into the molecular structure elucida­ tion system introduced in the preceding chapter is discussed. First, the combination of classifiers from above, together with additional substructure information from the NIST database is investigated to restrict structure generation. The incorporation of calculated properties to improve CASE is then explored, including other methods of mass spectral prediction, retention behavior, partitioning behavior and finally steric energy. We then investigate incorporating these via a filtering or exclusion strategy, then with a consensus scoring approach. Three successful examples of CASE at work on unknown environmental contaminants are then presented to demonstrate CASE in practice. This chapter concludes with an outlook on CASE via MS for GC-EI-MS and then CASE via high resolution MS and MS/MS data.

12 | Introduction and outline 10 Appendix In the appendix we give a list of the molecular descriptors included in MOLGEN–QSPR. We describe and show the molecular substructures used in MOLGEN–MS, grouped into five categories: alkyl groups, aromatics, bonds, elements, functional groups and ring structures. In addition, the interested reader can find tables of molecular formulas by mass and ion type, as well as isomers, by formula and mass. These were obtained using MOLGEN. A list of references is found at the end, along with a list of abbreviations and a subject index.

1 Basics of graphs and molecular graphs This introductory chapter contains basic definitions and facts about molecular graphs, or, in other words, about molecules in silico, i.e. models of molecules developed for computer chemistry and chemoinformatics. The aim is a description of the molecu­ lar model implemented in the various versions of the software packages MOLGEN, MOLGEN–QSPR and MOLGEN–MS for – computer generation of molecular structures, for example connectivity isomers cor­ responding to a molecular formula, – generation of molecular libraries, simulating an experiment of combinatorial chemistry, – evaluation of such libraries with respect to desired properties of compounds, – molecular structure elucidation, identification of a compound, in particular based on its mass spectrum, – the generation of molecular libraries from Markush formulas in a canonical form, in order to detect overlap, and so on. This is to be done scrupulously, since the basic notion of the model is deci­ sive for proper software use, for success of a method and for further implementations. Moreover, we shall introduce the basic mathematical tools.

1.1 Graphs Central to this book is the concept of a molecular graph. Here, the concepts of ‘graph’ and ‘molecule’ merge, as we use a graph-like interaction model for molecules to visu­ alize pairwise interactions between certain atoms. The atoms are represented by the nodes of the graph and the interactions by covalent bonds (single, double or triple bonds, indicated by the corresponding number of one, two or three lines). Here is an example, a graph model for cyanic acid: N

C

O

H

(1.1)

It contains two single bonds and one triple bond and the graph model of this com­ pound is the multigraph that shows the multiplicities of the bonds to visualize the respective types of interactions:

t

t

t

t

(1.2)

If we restrict attention to the interactions, neglecting their types, we obtain the under­ lying simple graph (which is also termed a bond graph) that indicates the location of the covalent bonds: t t t t (1.3)

14 | 1 Basics of graphs and molecular graphs Thus, the molecular graph (1.1), multigraph (1.2) and simple graph (1.3) represent the molecule with decreasing complexity. Although the atom symbols are present in (1.1), the atom states are omitted. In this book, we do not strictly distinguish between terms from graph theory and those from chemistry. Thus, there are bonds as well as lines in a graph just as in a molecular graph. We decided to use the chemical term ‘bond’ with its connotation of a possible multiple bond since a corresponding term seems to be missing in graph theory. A major problem is that a computer can handle only labeled structures, whereas we are dealing with unlabeled structures in chemistry, such as the molecule graph (1.1), the multigraph (1.2) and the simple graph (1.3) shown above. We note the follow­ ing facts and introduce a few terms for the constituents of the graph: – The simple graph (1.3) consists of four nodes and three bonds that join three (of the total six) pairs of nodes. – (1.2) shows that two of the bonds are single bonds, the remaining bond is a triple bond. The multiplicities of these bonds are expressed by a corresponding num­ ber of lines. In other words, the three unconnected node pairs are connected with multiplicity = 0. – (1.1) is obtained from (1.2) by coloring the nodes with element symbols. Later we shall add a further ‘color’, the atom state. In all three cases, the four nodes are not numbered or labeled, as we shall say. Hence the above graphs are unlabeled graphs on four nodes. However, they cannot be entered into a computer ‘as is’, but require labeling.

1.1.1 Labeled graphs In order to describe the graphs on the level of a computer’s capabilities we have to label (or number) the nodes. Using the labels 0, 1, 2 and 3, we obtain a labeled multigraph from (1.2) which may look like the following, depending on the labeling: 0

1

2

3

(1.4)

This labeled graph is easily entered into a computer by storing each pair of labels of nodes together with the multiplicity of the bond in between, including 0 if there is no bond. By ‘pairs’ we mean unordered pairs or sets of two different nodes. To be exact, for unordered pairs {𝑖, 𝑗} we always have {𝑖, 𝑗} = {𝑗, 𝑖}. In contrast, ordered pairs are denoted as (𝑖, 𝑗) and in this case we have (𝑖, 𝑗) ≠ (𝑗, 𝑖) if 𝑖 ≠ 𝑗. The pairs of nodes and the bond multiplicities of example (1.4) are as follows: ({0, 1}, 3), ({0, 2}, 0), ({0, 3}, 0), ({1, 2}, 1), ({1, 3}, 0), ({2, 3}, 1).

1.1 Graphs

| 15

In mathematical terms, we describe the labeled multigraph by a mapping 𝛾 that as­ signs the bond multiplicity to each pair of nodes: 𝛾:

{0, 1} 󳨃→ 3, {0, 2} 󳨃→ 0, {0, 3} 󳨃→ 0, {1, 2} 󳨃→ 1, {1, 3} 󳨃→ 0, {2, 3} 󳨃→ 1.

Obviously, 𝛾 describes the multigraph unambiguously and is acceptable for a compu­ ter. This shows how multigraphs can be defined as mappings 𝛾. The corresponding bond graph will be denoted by 𝛾𝑏 , it is the mapping 𝛾𝑏 :

{0, 1} 󳨃→ 1, {0, 2} 󳨃→ 0, {0, 3} 󳨃→ 0, {1, 2} 󳨃→ 1, {1, 3} 󳨃→ 0, {2, 3} 󳨃→ 1.

In order to prepare the exact definition, we need to introduce some notation. For sim­ plicity, we use the fact that the natural number 𝑛, an element of the set ℕ = {0, 1, 2, . . .} of all natural numbers, can be understood in two ways. The natural number 0 is de­ fined to be the empty set, 0 = 0. The natural number 𝑛, if it is nonzero, is defined recursively as the set 𝑛 = {0, . . . , 𝑛 − 1}. Secondly, 𝑛 can stand for the order of the set {0, . . . , 𝑛 − 1} if it is nonzero, and if it is zero, we understand 0 as the order of the empty set. Using the interpretation of 𝑛 as a set we introduce 𝑛 ( ) = {{𝑖, 𝑗} | 𝑖, 𝑗 ∈ 𝑛, 𝑖 ≠ 𝑗}, 2 the set of all 2-element subsets (or unordered pairs of elements) in the set 𝑛, reading (𝑛2) as ‘𝑛 choose 2’ since it arises from the set 𝑛 by choosing two elements, in all possible ways. Moreover, we shall use the standard notation 𝑌𝑋 = {𝛾 | 𝛾 : 𝑋 → 𝑌} for the set of all mappings from 𝑋 to 𝑌. 𝛾 : 𝑋 → 𝑌 means that 𝛾 is defined to be a mapping from the set 𝑋 to the set 𝑌, while 𝑥 󳨃→ 𝑦 means that 𝛾 maps 𝑥 ∈ 𝑋 onto 𝑦 ∈ 𝑌, or that 𝛾 associates 𝛾(𝑥) = 𝑦 with 𝑥. This will be interpreted here as 𝛾 replaces 𝑥 by 𝑦. 𝑦 is called the image of 𝑥 under 𝛾, while 𝑥 is the inverse image of 𝑦. We are now in a position to introduce labeled multigraphs: 1.1 Definition (Labeled 𝑚-multigraphs on 𝑛 nodes) – For natural numbers 𝑚 > 0, the set of mappings 󵄨󵄨 𝑛 𝑛 𝑛 𝑚( 2) = {0, 1, . . . , 𝑚 − 1}(2 ) = {𝛾 󵄨󵄨󵄨 𝛾 : ( ) → {0, 1, . . . , 𝑚 − 1}} 󵄨 2 is the set of all labeled 𝑚-multigraphs on 𝑛 nodes. The bond multiplicities 𝛾({𝑖, 𝑗}) are contained in 𝑚, i.e. restricted by 𝑚 − 1. If we want to illustrate the multigraph 𝛾, we use the labels 𝑖 ∈ 𝑛 of the nodes and express the bond multiplicities by a number of lines: 𝛾({𝑖, 𝑗}) denotes the number of lines, the multiplicity of the bond between (nodes) 𝑖 an 𝑗. If the multiplicity is 0, we speak of a non-bond, if it is 1, 2, 3, . . ., we call it a single, double, or triple bond, etc.

16 | 1 Basics of graphs and molecular graphs –

We indicate this set of 𝑚-multigraphs on 𝑛 nodes as follows: 𝑛

𝑛

G𝑚,𝑛 = {0, 1, . . . , 𝑚 − 1}(2 ) = 𝑚( 2) . The subset of the connected 𝑚-multigraphs G𝑐𝑚,𝑛 = {𝛾 ∈ G𝑚,𝑛 | 𝛾 is connected} consists of the 𝑚-multigraphs where one can reach every other node from the starting node by walking along the bonds. A graph corresponding to a molecule (compound), a molecule graph, is usually contained in G𝑐4,𝑛 , where 𝑛 is the number of atoms. The elements 𝛾 of G2,𝑛 = {𝛾 | 𝛾({𝑖, 𝑗}) ∈ {0, 1}}



are called simple graphs. An important simple graph is the bond graph 𝛾𝑏 that was mentioned above. In order to describe 𝑚-multigraphs and their bond graphs we introduce two 𝑛 × 𝑛-matrices, one for the multigraph, containing the multiplicities, M𝛾 = (𝛾𝑖𝑗 )𝑖,𝑗∈𝑛 , 𝛾𝑖𝑗 = 𝛾({𝑖, 𝑗}), and, for the corresponding bond graph, {1 M𝛾𝑏 = (𝛾𝑏𝑖𝑗 )𝑖,𝑗∈𝑛 with 𝛾𝑏𝑖𝑗 = 𝛾𝑏 ({𝑖, 𝑗}) = { 0 {

if 𝛾({𝑖, 𝑗}) > 0, otherwise.

It shows the location of the bonds in 𝛾. The latter, the matrix of multiplicities of the bond graph 𝛾𝑏 , is called the bond matrix of 𝛾, the former is called the matrix of bond multiplicities of 𝛾. We note that these matrices are symmetric, since {𝑖, 𝑗} = {𝑗, 𝑖}, and therefore 𝛾𝑖𝑗 = 𝛾({𝑖, 𝑗}) = 𝛾({𝑗, 𝑖}) = 𝛾𝑗𝑖 , 𝛾𝑏𝑖𝑗 = 𝛾𝑏 ({𝑖, 𝑗}) = 𝛾𝑏 ({𝑗, 𝑖}) = 𝛾𝑏𝑗𝑖 . Although there is a lot of redundancy, it is better to use this symmetric matrix instead of its upper or lower half for technical reasons. The matrix of multiplicities describes the multigraph. It is often called the adjacency matrix of the mul­ tigraph. The bond matrix contains information restricted to the existence of bonds, in the molecular case it describes which atoms interact. Since we consider two graphs together, the multigraph and its bond graph, we avoid the word adjacency matrix since it is ambiguous in this case. – Finally we mention the following sums of multiplicities, the row (or column) sums of these matri­ ces: 𝑣(𝛾)𝑖 = ∑ 𝛾({𝑖, 𝑗}). 𝑗∈𝑛

Imitating the corresponding notion of chemistry, we call it the valence of 𝑖. The other sum is 𝑏(𝛾)𝑖 = 𝑣(𝛾𝑏 )𝑖 = ∑ 𝛾𝑏 ({𝑖, 𝑗}), 𝑗∈𝑛

the number of bonds incident with 𝑖, the bond degree of 𝑖. The sequences 𝑣(𝛾) = (𝑣(𝛾)0 , . . . , 𝑣(𝛾)𝑛−1 ) and 𝑏(𝛾) = (𝑏(𝛾)0 , . . . , 𝑏(𝛾)𝑛−1 ) are called the sequence of valences and the sequence of bond degrees, respectively. In a mole­ cular graph 𝑣(𝛾)𝑖 is the number of electrons of atom 𝑖 that interact with electrons of other atoms, while 𝑏(𝛾)𝑖 means the number of atoms bonded to atom 𝑖 or interacting with 𝑖.

1.1 Graphs

| 17

1.2 Example (The multigraph (1.2)) The matrix of the labeled 4-multigraph 𝛾 and the matrix of the labeled bond graph 𝛾𝑏 corresponding to 0

𝛾:

1

2

3

are 0 3 M𝛾 = ( 0 0

3 0 1 0

0 1 0 1

0 0 ) 1 0

and M𝛾𝑏

0 1 =( 0 0

1

2

3

1 0 1 0

0 1 0 1

0 0 ), 1 0

since 𝛾𝑏 :

0

.

The corresponding sequences of valences and of bond degrees are 𝑣(𝛾) = (3, 4, 2, 1) and 𝑏(𝛾) = (1, 2, 2, 1).

1.1.2 Unlabeled graphs Eventually we are interested in unlabeled 𝑚-multigraphs, i.e. we consider 𝑚-multigraphs ‘up to relabeling’. In mathematical terms, we consider all the labeled multi­ graphs that arise from each other by relabeling as equivalent. Hence we are dealing with equivalence classes (see Definition 1.3) of labeled multigraphs, and these give the unlabeled graphs by delabeling, i.e. erasing the labels. The resulting multigraphs – one from each equivalence class – is what we are really interested in. For example, the elements of G2,4 are eleven unlabeled simple graphs, (i.e. unlabeled 2-multigraphs on 4 nodes, |G2,4 | = 11) and are as follows:

s

s

s

s

s

s

s

s

s

s

s

s

s s

s s

s s

s s

s

s

s s s s @ @ s @s

s s @ @ s @s s s @ @ s @s

s s @ @ s @s

s s @ @ s @s

The six graphs in the upper right are connected, |G𝑐2,4 | = 6, the others are disconnected. The corresponding eleven equivalence classes of labeled simple graphs are of order 1, 6, 3, 12, 3, 6, 1, 12, 4, 12, 4 (from left to right, and from top to bottom). The unlabeled

18 | 1 Basics of graphs and molecular graphs graphs shown are obtained by picking one labeled graph from each equivalence class and replacing the labels by dots. For example, the equivalence class of the graph in the lowest row of graphs is clearly 1

0

@ @ @ 3 2

0

1

@ @ @ 3 2

0

2

@ @ @ 3 1

0

3

@ @ @ 1 2

Equivalence is similar to symmetry. The number of classes of equivalent carbon atoms in a molecule is the number of carbon signals in its 13 C NMR spectrum, and two carbon atoms are equivalent if one arises from the other by a suitable symmetry operation. More or less the same argument is used here. The corresponding mathematical notion is introduced in 1.3 Definition (Equivalence relation, equivalence class) Consider a nonempty set 𝑆. Relations and, in particular, equivalence relations on this set are defined as follows: – A relation on 𝑆 is a set 𝑅 of ordered pairs (𝑠, 𝑡) of elements 𝑠, 𝑡 ∈ 𝑆. Instead of ‘(𝑠, 𝑡) is an element of relation 𝑅’, we write (𝑠, 𝑡) ∈ 𝑅 or 𝑠𝑅𝑡, for short. – A relation 𝑅 on 𝑆 is an equivalence relation if the following holds: ∘ 𝑅 is reflexive, i.e. 𝑅 contains all the pairs (𝑠, 𝑠), 𝑠 ∈ 𝑆. ∘ 𝑅 is symmetric, i.e. (𝑠, 𝑡) ∈ 𝑅 implies (𝑡, 𝑠) ∈ 𝑅. ∘ 𝑅 is transitive, i.e. (𝑠, 𝑡), (𝑡, 𝑢) ∈ 𝑅 imply (𝑠, 𝑢) ∈ 𝑅. An equivalence relation 𝑅 decomposes the set 𝑆 into pairwise disjoint and nonempty subsets, the equivalence classes, as it is easy to check, using the symmetry of 𝑆. – The equivalence class of 𝑠 ∈ 𝑆 with respect to 𝑅 is the set {𝑡 ∈ 𝑆 | (𝑠, 𝑡) ∈ 𝑅}.

Thus, the equivalence classes form a set-partition of 𝑆 in the following sense: 1.4 Definition (Set-partition) A sequence of subsets 𝑆𝑖 ⊆ 𝑆, for indices 𝑖 contained in an index set 𝐼, forms a set-partition of 𝑆 if the 𝑆𝑖 are not empty, pairwise disjoint, and their union is 𝑆: 𝑆𝑖 ≠ 0 and

𝑆𝑖 ∩ 𝑆𝑗 = 0 if 𝑖 ≠ 𝑗,

while ⋃ 𝑆𝑖 = 𝑆. 𝑖∈𝐼

Our main aim is a complete system of representatives of the equivalence classes, a transversal, a complete collection of essentially different, i.e. pairwise inequivalent structures. The method of choice is the use of groups and of group actions which we describe next. The equivalence classes will turn out to be orbits of a suitable group, and this fact will allow us to count these classes, to construct representatives and even to generate representatives that are distributed uniformly at random over the equiva­ lence classes, i.e. to generate samples. To begin we recall the notion of group:

1.1 Graphs

| 19

1.5 Definition (Group) Consider a set 𝐺 and a composition procedure for elements 𝑔 and 𝑔󸀠 of G. The composition is a mapping from 𝐺 × 𝐺 to 𝐺 that maps each pair of elements (𝑔, 𝑔󸀠 ) ∈ 𝐺 × 𝐺 to another element of 𝐺, denoted by 𝑔 ⋅ 𝑔󸀠 , ⋅ : 𝐺 × 𝐺 → 𝐺 : (𝑔, 𝑔󸀠 ) 󳨃→ 𝑔 ⋅ 𝑔󸀠 (or 𝑔𝑔󸀠 for simplicity). For convenience of notation and for some analogy with true numerical multiplication, we use for the composition the symbol ‘⋅’ and even the term multiplication. Like a multiplication of numbers, this multiplication may be successively applied several times. The pair (𝐺, ⋅) consisting of the set 𝐺 and such a composition procedure is called a (multiplicative) group, symbolized simply by 𝐺, if the follow­ ing is true: – Like a numerical multiplication, this multiplication may be successively applied several times and is required to be associative: 𝑔(𝑔󸀠 𝑔󸀠󸀠 ) = (𝑔𝑔󸀠 )𝑔󸀠󸀠 , for 𝑔, 𝑔󸀠 , 𝑔󸀠󸀠 ∈ 𝐺. – Moreover, we require that there exists an element 𝑒 ∈ 𝐺 which is a left unit, i.e. 𝑒𝑔 = 𝑔 for all 𝑔 ∈ 𝐺. – Besides this, each group element 𝑔 must possess a left inverse 𝑔󸀠 ∈ 𝐺 with respect to 𝑒, 𝑔󸀠 𝑔 = 𝑒. It is easy to check that a left unit is also a right unit, and that it is uniquely determined. It will therefore be denoted by 1 or, more explicitly, by 1𝐺 and called the identity element of 𝐺. A left inverse is also a right inverse, and it is uniquely determined, but it, of course, depends on 𝑔. We indicate it by 𝑔−1 .

Two simple examples of groups are (ℝ, +), the set of real numbers with the addition of real numbers as composition, and (ℝ \ {0}, ⋅), the set of nonzero real numbers together with numerical multiplication. The most important example is the symmetric group since it describes, for example, the relabelings of nodes in graphs. 1.6 Example (The symmetric group) This important group is based on the set of all bijective mappings on a set 𝑋. – A mapping 𝑓 : 𝑋 → 𝑌 is bijective if and only if it is both a surjective mapping, (i.e. for each 𝑦 ∈ 𝑌 there is an 𝑥 ∈ 𝑋 such that 𝑓(𝑥) = 𝑦) and an injective mapping (which means that there is only one such 𝑥, i.e. 𝑓(𝑥) = 𝑓(𝑥󸀠 ) implies 𝑥 = 𝑥󸀠 ). Thus, a bijective mapping 𝑓 : 𝑋 → 𝑋 (a bijection on 𝑋) is an exchange of the elements of 𝑋, 𝑥 is exchanged or replaced by 𝑥󸀠 = 𝑓(𝑥). If the elements in 𝑋 are numbered, this amounts to a renumbering, 𝑥 = 𝑥𝑖 is replaced by 𝑥󸀠 = 𝑥𝑗 , or, as we might say, the number or the label 𝑖 is replaced by 𝑗. Such a bijection on 𝑋 is called a permutation of 𝑋, it is symbolized by a lowercase Greek letter, for example, by 𝜋. – Consider the set 𝑆𝑋 of all permutations of a given set 𝑋 of objects 𝑥, 𝑆𝑋 = {𝜋 | 𝜋 : 𝑋 → 𝑋, bijectively}. We need a composition procedure linking two permutations. As such we use the successive application of permutations, which for simplicity we again call multi­ plication, using the symbol ‘⋅’. The successive application of two permutations 𝜋 and 𝜌, 𝜌 first, followed by 𝜋, is defined as (𝜋 ⋅ 𝜌)(𝑥) = 𝜋(𝜌(𝑥)).

20 | 1 Basics of graphs and molecular graphs –

To check whether this pair (𝑆𝑋 , ⋅) is a (multiplicative) group we have to verify the three conditions stated above: The composition of mappings is associative, (𝜋 ⋅ 𝜌) ⋅ 𝜎 = 𝜋 ⋅ (𝜌 ⋅ 𝜎) is true, since in both cases we have to apply 𝜎 first, then 𝜌, then 𝜋, obtaining that ((𝜋 ⋅ 𝜌) ⋅ 𝜎)(𝑥) = (𝜋 ⋅ 𝜌)(𝜎(𝑥)) = 𝜋(𝜌(𝜎(𝑥))) and (𝜋 ⋅ (𝜌 ⋅ 𝜎))(𝑥) = 𝜋((𝜌 ⋅ 𝜎)(𝑥)) = 𝜋(𝜌(𝜎(𝑥))),

which is the same. The identity mapping on 𝑋 is clearly bijective, and the inver­ sion of a bijection is bijective, too. Thus, (𝑆𝑋 , ⋅) is in fact a group, it is called the symmetric group on 𝑋. Further important groups are certainly the symmetry groups of molecules. Here is an easy case: 1.7 Example (A symmetry group of a molecule) Consider the naphthalene molecule drawn here with its C atoms arbitrarily numbered (double bonds and H atoms sup­ pressed). 9 1 "b "b 0 " b " b 8" b" b2

7b "b "3 bb"" 5 bb"" 6 4 This molecule allows symmetry operations (reflections, rotations, etc.). We here con­ sider the corresponding permutations of the labels of the C atoms. There is a ‘vertical reflection’, relabeling the atoms as follows, if we write original labels in the upper row, new labels in the lower row: 𝜋0 = (

0 5

1 4

2 3

3 2

4 1

5 0

6 9

7 8

8 7

9 ), 6

and another one, the ‘horizontal reflection’, relabeling the atoms as follows: 𝜋1 = (

0 0

1 9

2 8

3 7

4 6

5 5

6 4

7 3

8 2

9 ), 1

as well as the inversion, the ‘reflection through the center’, 𝜋2 = (

0 5

1 6

2 7

3 8

4 9

5 0

6 1

7 2

8 3

9 ). 4

These three relabelings may be considered as symmetry operations in their own right, or the last may be understood as a combination of the first two, 𝜋2 = 𝜋0 𝜋1 . Since we

1.1 Graphs

| 21

are going to construct a group 𝐺, the identity element also is included, relabeling each atom as itself: 𝜋3 = 1𝐺 = (

0 0

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 ). 9

For these four permutations, the inverse permutations should also be included, which turns out to be the case already since these permutations are self-inverse. Further com­ binations of the four symmetry operations do not result in anything new. So these four permutations together with multiplication (successive operation) form a group, the molecule’s symmetry group. The symmetry group, made of particular permutations, is a subgroup of the corresponding symmetric group 𝑆10 on the set 10 = {0, . . . , 9}. The chemist perceives no fewer than seven nontrivial symmetry operations in naphthalene (point group 𝐷2ℎ or 𝑚𝑚𝑚), the corresponding symmetry elements are three mutually perpendicular mirror planes, three mutually perpendicular twofold ro­ tation axes, and the center of inversion. In addition, the trivial symmetry operation, onefold rotation, is always present. Each of these, however, yields one of the above four permutations. The notion of point group of a molecule will be introduced later. The next step is the introduction of group actions. We describe what groups can do for us. In particular they can decompose sets into equivalence classes, for example the sets G𝑚,𝑛 , by collecting the 𝑚-multigraphs into classes of graphs that are equal ‘up to relabeling’. 1.8 Definition (Group actions, orbits, transversals) Consider a multiplicatively written group 𝐺 (short for (𝐺, ⋅)) and a nonempty set 𝑋. – A mapping 𝐺 × 𝑋 → 𝑋 : (𝑔, 𝑥) 󳨃→ 𝑔𝑥 that associates an element of 𝑋 with each pair (𝑔, 𝑥) denoted by 𝑔𝑥, is called an action of 𝐺 on 𝑋 if the following conditions are satisfied: (𝑔𝑔󸀠 )𝑥 = 𝑔(𝑔󸀠 𝑥)

and 1𝐺 𝑥 = 𝑥,

for all 𝑔, 𝑔󸀠 ∈ 𝐺 and 𝑥 ∈ 𝑋. We abbreviate this situation by 𝐺 𝑋.



If both 𝐺 and 𝑋 are finite, we call it a finite action. 𝐺(𝑥) = {𝑔𝑥 ∈ 𝑋 | 𝑔 ∈ 𝐺} is the orbit of 𝑥 under the action 𝐺 𝑋, and 𝐺\\𝑋 = {𝐺(𝑥) | 𝑥 ∈ 𝑋}



indicates the set of all orbits of 𝐺 𝑋. A subset 𝑇 ⊆ 𝑋 is called a transversal of the set of all orbits of 𝐺 𝑋 if, for each 𝐺(𝑥) ∈ 𝐺\\𝑋, |𝐺(𝑥) ∩ 𝑇| = 1. Thus, transversals 𝑇 are the subsets of 𝑋 that contain exactly one element of each orbit. The set of all transversals of 𝐺 𝑋 will be indicated as T(𝐺\\𝑋).

22 | 1 Basics of graphs and molecular graphs In our naphthalene example, the four permutations contained in the symmetry group relabel atom 1 as 4, 9, 6, 1. The same holds for the atoms 4, 9 and 6, they all are relabeled as 4, 9, 6 or 1. Thus, there is the orbit {1, 4, 6, 9} of C atoms. Atoms 2 and 5 similarly give rise to orbits {2, 3, 7, 8} and {0, 5}, respectively. Hence, the set of orbits is 𝐺\\10 = {{0, 5}, {1, 4, 6, 9}, {2, 3, 7, 8}} in the naphthalene case. A transversal is, for example, {0, 1, 2}, that is, there are exactly three equivalence classes of C atoms in naphthalene, and atoms 0, 1, 2 are represen­ tatives of the different classes. In formal terms, {0, 1, 2} ∈ T(𝐺\\10). A chemist may object that the same information could be obtained simply by inspect­ ing the structure of naphthalene. The point here is that we arrived at conclusions with­ out inspection of the structural formula, based solely on given permutations. A com­ puter is unable to inspect a molecule, but it is instead able to calculate orbits from permutations. In a molecule there are not only atoms, but bonds (pairs of atoms bonded to each other) and non-bonds (pairs of atoms not bonded to each other). There are various kinds of bonds and non-bonds in naphthalene and we would like to find equivalence classes of these as well. Obviously, equivalence of bonds or of non-bonds is governed somehow by the molecule’s symmetry group, though the latter primarily acts on the atoms. Thus, we ask the following question: What happens if a known group acts on a set of objects other than its original objects, but closely related to the latter in a known manner? This problem will be treated in Section 1.2, devoted to enumeration, con­ struction and generation of multigraphs, and in Section 3.2 covering enumeration and construction of permutational isomers. 1.9 Remark (The most important facts about orbits) Consider an action 𝐺 𝑋 of a group 𝐺 on a set 𝑋. Its orbits 𝐺(𝑥) have the following properties: – Two orbits are either equal or disjoint: For all 𝑥, 𝑥󸀠 ∈ 𝑋 we have 𝐺(𝑥) ∩ 𝐺(𝑥󸀠 ) ≠ 0 ⇐⇒ 𝐺(𝑥) = 𝐺(𝑥󸀠 ). –

𝑋 is the disjoint union of the orbits: For each 𝑇 ∈ T(𝐺\\𝑋) we have 𝑋 = ⋃̇ 𝑥∈𝑇 𝐺(𝑥),

or, 𝐺\\𝑋 is a set-partition of 𝑋. (Note that ‘disjoint’ is indicated by the dot above the sign ∪ for union of sets.) It is easy to see that, conversely, for each equivalence relation we can find a group action that has the given equivalence classes as orbits. We need the following notions to justify this claim:

1.1 Graphs





23

Consider a group (𝐺, ⋅). A subset 𝐻 of 𝐺 is called a subgroup if and only if (𝐻, ⋅) is a group. In order to show that a subset 𝐻 is in fact a subgroup, it suffices to check that 𝐻 is not the empty set, 𝐻 ≠ 0, and that for two elements ℎ, ℎ󸀠 ∈ 𝐻 the product ℎ ⋅ ℎ󸀠−1 is also contained in 𝐻. If 𝐻 is finite, it suffices to check that ℎ ⋅ ℎ󸀠 ∈ 𝐻 is true. If we are given a set-partition of a finite set 𝑋 into 𝑘 pairwise disjoint and nonempty subsets 𝑋𝑖 , 𝑖 ∈ 𝑘, then we can form the subgroups 𝑆𝑋𝑖 of 𝑆𝑋 , consisting of the 𝜋𝑖 ∈ 𝑆𝑋 with the following properties: for all 𝑥 ∈ 𝑋𝑖 : 𝜋𝑖 (𝑥) ∈ 𝑋𝑖 ,



|

while for 𝑥 ∉ 𝑋𝑖 : 𝜋𝑖 (𝑥) = 𝑥.

The product of these subgroups is a group again: ∏ 𝑆𝑋𝑖 = {𝜋0 ⋅ ⋅ ⋅ 𝜋𝑘−1 | for all 𝑖 ∈ 𝑘 : 𝜋𝑖 ∈ 𝑆𝑋𝑖 }. 𝑖∈𝑘

It is easy to check that this is a subgroup of 𝑆𝑋 , and that the 𝑋𝑖 are its orbits: (∏ 𝑆𝑋𝑖 )\\𝑋 = {𝑋0 , . . . , 𝑋𝑘−1 }. 𝑖∈𝑘

Thus, the orbits of this group clearly are the 𝑋𝑖 . We can therefore replace equivalence relations on finite sets by sets of orbits of finite group actions, or, for short, by finite group actions. The advantage of this seemingly more complicated interpretation of equivalence rela­ tion is that the theory of finite group actions contains several methods that allow us to count orbits and to construct transversals. Hence, replacing an equivalence relation on a finite set by an action of a suitable finite group on this set opens an approach to count equivalence classes and construct transversals, which is exactly what we need. We are now in a position to define unlabeled graphs in terms of group actions. For this purpose, we introduce an important action obtained from a given action 𝐺 𝑋. G. Pólya introduced this approach in the seminal paper [234] ([235] contains an English translation). His aim was to count permutational isomers which means the essentially different distributions of admissible substituents over a molecular skeleton, where ‘es­ sentially different’ means with respect to the symmetry group of the skeleton. We shall describe this in all detail in Chapter 3, where we show that the same approach allows the construction of corresponding molecular graphs, but this needs further notions. 1.10 Definition (Symmetry classes of mappings, unlabeled graphs) Assume two nonempty finite sets 𝑋, 𝑌 and an action 𝐺 𝑋 of a group 𝐺 on 𝑋. An important example is given in Definition 1.1, where 𝑋 = (𝑛2) means the set of pairs of nodes of a graph while 𝑌 = 𝑚 = {0, . . . , 𝑚 − 1} means the set of admissible bond multiplicities, and G is the group 𝑆𝑛 of all relabelings of the nodes. The reader may also think of 𝑋 as the set of substitutable positions of a molecular skeleton, 𝑌 being a given set of admissible substituents, and 𝐺 denoting the symmetry group of the skeleton.

24 | 1 Basics of graphs and molecular graphs –

The given action of 𝐺 on 𝑋 yields (‘induces’) the following action of 𝐺 on the set of mappings 𝑌𝑋 = {𝛾 | 𝛾 : 𝑋 → 𝑌}, defined as 𝐺 × 𝑌𝑋 → 𝑌𝑋 : (𝑔, 𝛾) 󳨃→ 𝑔𝛾,



where (𝑔𝛾)(𝑥) = 𝛾(𝑔−1 𝑥).

To understand this definition, the reader may think of 𝑥 as a node pair in a multigraph 𝑔𝛾 obtained by renumbering (using 𝑔) a given multigraph 𝛾. The question is ‘what is the bond multiplicity (𝑔𝛾)(𝑥) of node pair 𝑥 in multigraph 𝑔𝛾?’, and the answer is ‘it is the bond multiplicity of that node pair 𝑔−1 𝑥 in the original multigraph 𝛾 from which pair 𝑥 arose, 𝛾(𝑔−1 𝑥)’. This is true since the only change to the multigraph was renumbering. The orbits 𝛾 ̄ = 𝐺(𝛾), so that 𝐺\\𝑌𝑋 = {𝛾 ̄ | 𝛾 ∈ 𝑌𝑋 },

are called the symmetry classes of mappings in 𝑌𝑋 with respect to 𝐺 𝑋. An important particular case of a symmetry class of mappings is an unlabeled 𝑚-multigraph on 𝑛 nodes. It can be identified with a symmetry class of mappings in the following way: Let 𝑛 = {0, . . . , 𝑛−1} denote the set of labels of the nodes and consider the symmetric group 𝑆𝑛, the group of all relabelings. Thus, 𝑆𝑛 acts on the set of labels: 𝑆𝑛 × 𝑛 → 𝑛 : (𝜋, 𝑖) 󳨃→ 𝜋𝑖. The action of 𝑆𝑛 on 𝑛 yields an action on the set of pairs of labels of nodes: 𝑛 𝑛 𝑆𝑛 × ( ) → ( ) : (𝜋, {𝑖, 𝑗}) 󳨃→ {𝜋𝑖, 𝜋𝑗}. 2 2 –

In addition, the action of 𝑆𝑛 on the set of pairs of labels gives an action on the set of labeled 𝑛

graphs (Definition 1.1) 𝑌𝑋 = G𝑚,𝑛 = 𝑚( 2 ) as just defined: 𝑆𝑛 × G𝑚,𝑛 → G𝑚,𝑛 : (𝜋, 𝛾) 󳨃→ 𝜋𝛾, The set of symmetry classes

where (𝜋𝛾)({𝑖, 𝑗}) = 𝛾({𝜋−1 𝑖, 𝜋−1 𝑗}). 𝑛

𝑆𝑛 \\G𝑚,𝑛 = 𝑆𝑛 \\𝑚( 2) can be identified with the set of unlabeled 𝑚-multigraphs on 𝑛 nodes, since the orbit 𝑆𝑛 (𝛾) of the labeled graph 𝛾 consists of the graphs that can be obtained from 𝛾 by relabeling. So we may say that the orbit 𝛾 ̄ = 𝑆𝑛 (𝛾) ‘is 𝛾 up to relabeling’. Thus,

𝑛

𝑛

𝑆𝑛 \\G𝑚,𝑛 = 𝑆𝑛 \\𝑚( 2) = {𝛾̄ = 𝑆𝑛 (𝛾) | 𝛾 ∈ 𝑚( 2 ) } is the formal definition of the set of unlabeled 𝑚-multigraphs on 𝑛 nodes.

We conclude this section with a table of the numbers |𝑆𝑛\\G𝑚,𝑛 | of unlabeled 𝑚-multigraphs on 𝑛 nodes, for the first few 𝑚 and 𝑛. Below we shall describe the mathematical tools for their calculation. 1.11 Exercise Check the entries for 1 ≤ 𝑛 ≤ 5 in the column of 𝑚 = 2 of Table 1.1, i.e. evaluate the numbers of simple graphs with 1 ≤ 𝑛 ≤ 5 nodes, using the fact that not all these graphs are connected.

1.2 Molecular graphs, constitutional isomers

|

25

Table 1.1. Numbers of unlabeled 𝑚-multigraphs. 𝑚=1

𝑚=2

𝑚=3

𝑚=4

𝑚=5

𝑛=1

1

1

1

1

1

𝑛=2

1

2

3

4

5

𝑛=3

1

4

10

20

35

𝑛=4

1

11

66

276

900

𝑛=5

1

34

792

10,688

90,005

𝑛=6

1

156

25,506

1,601,952

43,571,400

Table 1.1, quoted from [146], shows entries rapidly increasing with 𝑚 and 𝑛. The sec­ ond-last column contains the numbers |𝑆𝑛 \\G4,𝑛|, coarse upper bounds for the numbers of distinct multigraphs underlying organic molecule structures of 1 ≤ 𝑛 ≤ 6 atoms. These bounds can be improved since graphs of organic molecules are usually con­ nected 4-multigraphs. For example, there are 156 simple graphs on 6 nodes, but only 112 of these are connected. The number of connected unlabeled 4-multigraphs on 𝑛 nodes is not easy to ob­ tain, but obviously satisfies the inequality |𝑆𝑛\\G𝑐4,𝑛 | ≤ |𝑆𝑛\\G4,𝑛 | − |𝑆𝑛 \\G4,𝑛−1 |, since we obtain a subset of disconnected 4-multigraphs on 𝑛 nodes by adding to each 4-multigraph on 𝑛 − 1 nodes a single isolated node. Thus, Table 1.2 contains upper bounds of numbers of connected unlabeled 𝑚-multigraphs. Table 1.2. Upper bounds for numbers of unlabeled connected 𝑚-multigraphs. 𝑚=1

𝑚=2

𝑚=3

𝑚=4

𝑚=5

𝑛=1

1

1

1

1

1

𝑛=2

0

1

2

3

4

𝑛=3

0

2

7

16

30

𝑛=4

0

7

56

256

865

𝑛=5

0

23

726

10,412

89,105

𝑛=6

0

122

24,714

1,591,264

43,481,395

1.2 Molecular graphs, constitutional isomers The model used in this book for a chemical compound (molecule) is an unlabeled connected colored 4-multigraph, i.e. nodes are colored as described below, and bond multiplicities are 0, 1, 2, or 3.

26 | 1 Basics of graphs and molecular graphs 1.2.1 Atom states in organic chemistry The labeled molecular graph underlying a molecule with 𝑛 atoms is a connected 4-multigraph on 𝑛 nodes, 𝛾 ∈ G𝑐4,𝑛 , with the following additional properties: The color of a node consists of the symbol of the corresponding chemical element and an atom state. We recall from chemistry what that means. – A chemical element is uniquely determined by its atom number, which gives the number of protons in the nucleus and at the same time (for an uncharged atom) its total number of electrons. – Some of an atom’s electrons are able to interact with electrons from other atoms, these are the valence electrons. Their number varies from element to element. – An interaction of two atoms is called a covalent bond and is indicated in the mo­ lecular graph by lines. The number of lines is the bond multiplicity. A bond of multiplicity 1, visualized by one line is a single bond and indicates that the two atoms connected by the bond share two electrons. A double bond, represented in the drawing by two lines, has a bond of multiplicity 2 and indicates that four elec­ trons are shared, etc. Other forms of interaction will be discussed in Section 2.4. – Valence electrons that do not belong to a covalent bond form free electron pairs. A single valence electron is called an unpaired electron. – For a bonded atom, the count of its electrons (engaged in covalent bonds, in free electron pairs and unpaired) may differ from the number of valence electrons of a free atom of the same element. The difference, if any, is the atom’s charge. Using these notions we define: 1.12 Definition (Atom state) An atom state is a quadruple 𝑍 = (𝑣𝑍 , 𝑝𝑍 , 𝑞𝑍 , 𝑟𝑍 ), where – 𝑣𝑍 ∈ ℕ denotes the valence of the atom, i.e. the number of lines incident with it, which is the sum of multiplicities of incident bonds, – 𝑝𝑍 ∈ ℕ means the number of free electron pairs, – 𝑞𝑍 ∈ ℤ indicates the charge, and – 𝑟𝑍 ∈ 𝔹 = {0, 1} = {𝑓𝑎𝑙𝑠𝑒, 𝑡𝑟𝑢𝑒} indicates whether an unpaired electron is present or not. If 𝑟𝑍 = 1, the atom is a radical site. An atom state is called ground state if 𝑞𝑍 = 0 and 𝑟𝑍 = 0.

An atom’s valence is the sum of the multiplicities of its covalent bonds. In terms of the molecular graph drawing, it is the number of lines ending in a node. For example, the usual valences of H, O, N, and C atoms in their ground states are 1, 2, 3, and 4, respectively. There are chemical elements that have more than one possible valence, i.e. they can have more than one ground state. For example, phosphorus can have a valence of 3 or 5, while sulfur can have a valence of 2, 4, and 6. Such variations correspond to variations in the number of free electron pairs. If we leave the ground state, more valence states are possible for the same ele­ ment. Mathematically speaking, we associate a set Z𝑋 of admissible atom states to

1.2 Molecular graphs, constitutional isomers

|

27

each chemical element 𝑋. The set depends on the particular situation to which the model will be applied. 1.13 Example (Organic chemistry) Four of the most common elements in organic chemistry are hydrogen, carbon, nitrogen and oxygen. These can be referred to as E4 = {H, C, N, O}. Further elements also play a role in organic chemistry. The elements fluorine, silicon, phosphorus, sulfur, chlorine, bromine and iodine are collected in E11 = {H, C, N, O, F, Si, P, S, Cl, Br, I}. Table 1.3, taken from [334] with slight modifications, contains the atom number 𝑇𝐸𝑋 , the number of valence electrons 𝑉𝐸𝑋 , as well as a list of most atom states 𝑍 relevant in organic mass spectroscopy for the elements 𝑋 ∈ E11 . The characteristics of the atom state 𝑍 of an atom of element 𝑋 satisfy the following equation: 𝑣𝑍 + 2𝑝𝑍 + 𝑞𝑍 + 𝑟𝑍 = 𝑉𝐸𝑋 . Thus, we may skip one of these items when storing atom states [148]. The set of admissible atom states Z𝑋 for element 𝑋 depends on the underlying chem­ istry chosen in a particular situation. The hierarchic classification of chemistries des­ cribed in [61] can be expressed in terms of allowed atom states: 1.14 Definition (RC, CSC, IC, MC) – Under the headline restricted chemistry (RC) we collect all compounds whose atoms do not carry any charge or unpaired electron, and obey (except hydrogen) the octet rule, 𝑞𝑍 = 0, 𝑟𝑍 = 0, 2𝑣𝑍 + 2𝑝𝑍 = 8.

– – –

The atom states admissible in RC are given in Table 1.3 in the rows containing a ‘×’ in column RC. In RC it is in particular possible to associate a unique valence with each element 𝑋 ∈ E11 , the standard valence 𝑣𝑋 . If we suspend the octet rule, we enter closed shell chemistry (CSC). For atom state 𝑍 we still have 𝑞𝑍 = 0 and 𝑟𝑍 = 0. The corresponding states are marked by a ‘×’ in column CSC in Table 1.3. If we skip even that, we speak of integral chemistry (IC). This is the realm of model building presented in this book. At this level of chemistry, all multiplicities of covalent bonds are integers. However, there are phenomena in chemistry not compatible with this restriction (see Sec­ tion 2.4). Mesomerism and multicentric bonds need a further generalization. We collect all these more general situations under the notion of multicenter chemistry (MC).

Summarizing we obtain the following inclusions: RC ⊂ CSC ⊂ IC ⊂ MC. Note the limits of Table 1.3, given by organic mass spectroscopy. In other fields of or­ ganic chemistry, various reactive intermediates are considered containing atoms in further states, e.g. for C:

28 | 1 Basics of graphs and molecular graphs Table 1.3. Some admissible atom states for the elements in E11 . 𝑋 (𝑇𝐸𝑋 , 𝑉𝐸𝑋 )

𝑣𝑍

𝑝𝑍

𝑞𝑍

𝑟𝑍

RC

CSC

1 0 0

0 0 0

0 1 0

0 0 1

×

×

H (1, 1)

4 3 3 2

0 0 0 0

0 1 0 1

0 0 1 1

×

×

C (6, 4)

N (7, 5)

5 4 3 3 2

0 0 1 0 1

0 1 0 1 0

0 0 0 1 1

O (8, 6)

3 2 2 1

1 2 1 2

1 0 1 0

0 0 1 1

F (9, 7)

2 1 1

2 3 2

1 0 1

0 0 1

Si (14, 4)

4 3 3 2

0 0 0 0

0 1 0 1

0 0 1 1

5 4 4 3 3 2

0 0 0 1 0 1

0 1 0 0 1 0

0 0 1 0 1 1

6 5 5 4 4 3 3 2 2 1

0 0 0 1 0 1 1 2 1 2

0 1 0 0 1 1 0 0 1 0

0 0 1 0 1 0 1 0 1 1

Cl (17, 7)

2 1 1

2 3 2

1 0 1

Br (35, 7)

2 1 1

2 3 2

I (53, 7)

2 1 1

2 3 2

P (15, 5)

S (16, 6)

× ×

×

×

×

×

×

×

×

×

×

×

×

×

×

×

0 0 1

×

×

1 0 1

0 0 1

×

×

1 0 1

0 0 1

×

×

1.2 Molecular graphs, constitutional isomers



| 29

Carbanions, comprising a negatively charged C atom, C in state (3, 1, −1, 0): ⊖

−− C −− | Here is an example, fulminic acid: ⊖

O = N⊕ = C − H –

Carbenes, containing a 2-valent C atom, C in the state (2, 1, 0, 0): −C−



Isonitriles (very important in Ugi’s multicomponent reactions, see Chapter 5) are characterized by the functional group R − NC, whose carbon atom may formally be described as carbanion-like or carbene-like: ⊖

R − N⊕ ≡ C ←→ R − N = C

1.2.2 Constitutional isomers The molecule generator MOLGEN, up to version 3.5 [19], generates compounds on the level of RC or CSC (if we enter particular valences, e.g. 5 for P, or 4 or 6 for S), while from version 4.0 onwards [148] it is possible to go to level IC. Molecules (compounds) generated are unlabeled molecular graphs, i.e. equivalence classes of labeled mole­ cular graphs that we introduce as follows: 1.15 Definition (Labeled molecular graphs of 𝑛 atoms in E) Let 𝑛 denote the number of atoms in the molecular graphs to be defined. Consider a set E of chemical elements, and denote by ZE = ⋃ Z𝑋 𝑋∈E

the set of admissible atom states. A labeled molecular graph 𝑀 on 𝑛 atoms in E with atom states contained in ZE is a triple 𝑀 = (𝜀, 𝜁, 𝛾), where – 𝜀 = (𝜀(0), . . . , 𝜀(𝑛 − 1)) is a sequence of length 𝑛 of (labeled) element symbols 𝜀(𝑖) ∈ E, for short: 𝜀 ∈ E𝑛 , – 𝜁 = (𝜁(0), . . . , 𝜁(𝑛 − 1)) is a sequence of length 𝑛 of admissible atom states 𝜁(𝑖) ∈ Z𝜀(𝑖) , 𝜁 ∈ Z𝑛E , and – 𝛾 is a labeled 4-multigraph, for short: 𝛾 ∈ G4,𝑛 , for which we require that it has the correct va­ lences 𝑣(𝛾)𝑖 = 𝑣𝜁(𝑖) . This means that the sum of multiplicities of the bonds incident with node 𝑖 equals the valence prescribed by 𝜁(𝑖).

30 | 1 Basics of graphs and molecular graphs Since molecules are often considered in a reduced, H-suppressed form [107], we extend the data introduced in Definition 1.1 as follows. 1.16 Definition (H-suppressed molecular graphs) H-suppressed means that we suppress in the mole­ cular graph 𝛾 the nodes representing H atoms, together with the adjacent bonds, obtaining – the H-suppressed graph 𝛾∗ , from which we deduce the H-suppressed molecular graph 𝑀∗ = (𝜀∗ , 𝜁∗ , 𝛾∗ ),



where 𝜀∗ means the corresponding reduced mapping while 𝜁∗ indicates the reduced set of ad­ missible atom states, obtained by restricting attention to the non-H atoms. Analogously we obtain, besides the 𝑛 × 𝑛-matrices M𝛾 of multiplicities and the bond matrix M𝛾𝑏 , the matrices that describe the H-suppressed molecule M 𝛾∗

and

M𝛾∗ 𝑏

and the sums of entries 𝑏

𝑣(𝛾∗ )𝑖 = ∑ 𝛾∗ ({𝑖, 𝑗}) and 𝑏(𝛾∗ )𝑖 = ∑ 𝛾∗ ({𝑖, 𝑗}) . 𝑗∈𝑛

𝑗∈𝑛

They form the sequence 𝑣(𝛾∗ ) of valences and the sequence 𝑏(𝛾∗ ) of bond degrees of the H-suppressed molecule.

By M𝑛 we denote the set of labeled molecular graphs on 𝑛 atoms, by M = ⋃ M𝑛 𝑛>0

the set of all molecular graphs. We call (𝜀, 𝜁, 𝛾) a connected molecular graph or a mole­ cule graph if 𝛾 is connected. Moreover, we indicate by M𝑐𝑛 the set of connected mole­ cular graphs on 𝑛 atoms, obtaining the set of all connected molecular graphs M𝑐 = ⋃ M𝑐𝑛 . 𝑛>0

We are now in a position to introduce the most important object in our book, a ‘molecule in silico’: 1.17 Definition (Unlabeled molecular graphs with 𝑛 atoms in E) Assume 𝑛 > 0, a set E of chemical elements and ZE = ⋃𝑋∈E Z𝑋 , a set of admissible states of the elements in E. Then – 𝑆𝑛 × (E𝑛 × Z𝑛E × G4,𝑛 ) → E𝑛 × Z𝑛E × G4,𝑛 : (𝜋, (𝜀, 𝜁, 𝛾)) 󳨃→ 𝜋(𝜀, 𝜁, 𝛾), with 𝜋(𝜀, 𝜁, 𝛾) = (𝜋𝜀, 𝜋𝜁, 𝜋𝛾) and, for 𝑖, 𝑗 ∈ 𝑛, 𝑖 ≠ 𝑗, 𝜋𝜀(𝑖) = 𝜀(𝜋−1 𝑖), 𝜋𝜁(𝑖) = 𝜁(𝜋−1 𝑖), 𝜋𝛾({𝑖, 𝑗}) = 𝛾({𝜋−1 𝑖, 𝜋−1 𝑗}),



defines an action of 𝑆𝑛 on E𝑛 × Z𝑛E × G4,𝑛 and therefore also on M𝑛 . Two molecular graphs 𝑀, 𝑀󸀠 ∈ M𝑛 are equivalent molecular graphs or isomorphic molecular graphs if they belong to the same orbit under this action. The elements of M̄ 𝑛 = 𝑆𝑛 \\M𝑛 , i.e. the orbits, are called equivalence classes of molecular graphs. The class of 𝑀 will be indicated as ̄ and the set of all these classes is denoted by 𝑀, M̄ = ⋃ M̄ 𝑛 , 𝑛>0

where M̄ 𝑛 = {𝑀̄ | 𝑀 ∈ M𝑛 }.

1.2 Molecular graphs, constitutional isomers



| 31

If 𝛾 ∈ G𝑐4,𝑛 , then the orbit 𝑆𝑛 ((𝜀, 𝜁, 𝛾)) ∈ 𝑆𝑛 \\M𝑐𝑛 of the labeled molecular graph (𝜀, 𝜁, 𝛾) is called a connected unlabeled molecular graph. Equivalence classes of connected labeled molecular graphs can be identified with constitutional formulas of chemical compounds. Exceptions are exotic compounds such as catenanes and rotaxanes, whose structural formulas correspond to disconnected graphs. These are not dealt with in this book. The notation is M̄ 𝑐 = ⋃ M̄ 𝑐𝑛 . 𝑛>0

The usual drawing of such an unlabeled molecular graph is obtained by erasing the labels of 𝛾 and replacing label 𝑖 by the element symbol 𝜀(𝑖) if the remaining part 𝜁(𝑖) of the color is unambiguous. If an atom state 𝜁(𝑖) is not the default value, then we sometimes add at least part of it. Here is an example: H

H

C

C

H H

O

+

H

H C

HO

H H

C

C

C

O

C H

H

H H

+

O

H

The nodes carry element symbols; charges, free electron pairs (bars) and unpaired electrons (dots), if any, are added next to element symbols (left). For a better overview, H atoms attached to C atoms, symbols ‘C’, and free electron pairs are usually not writ­ ten. However, symbols of hetero atoms (atoms other than C or H), of C atoms in unusual atom states, and of H atoms attached to such atoms are always written (right). More explicitly, a molecule graph is an unlabeled graph where the nodes are col­ ored by an element symbol, together with an admissible atom state. For example, the molecule of cyanic acid is modeled by the following unlabeled, colored graph: N (3,1,0,0)

C (4,0,0,0)

O (2,2,0,0)

H (1,0,0,0)

.

The usual chemical notation is |N ≡ C − O − H . It is stored in its labeled form in the computer, for example, as 0 N

1 C

2 O

3 H

(3,1,0,0)

(4,0,0,0)

(2,2,0,0)

(1,0,0,0)

.

Now we introduce the notions of molecular formula, structural formula and constitu­ tional isomer. To be acceptable for a computer, these terms need a precise mathemat­ ical definition. They are of central importance in the following discussions and in the applications:

32 | 1 Basics of graphs and molecular graphs

1.18 Definition (Molecular formula, constitutional isomer) Assume a set E of chemical elements. The molecular formula of a molecule consists of a set of chemical elements, e.g. E4 = {H, C, N, O} together with their occurrence numbers in the molecule, for example 3, 0, 1, 0, resulting in the formula NH3 in the usual notation, where numbers ‘1’ are left out. In mathematical terms this can be defined as follows: – A molecular formula is a mapping 𝛽 from a given set of elements E into the set of natural numbers ℕ, which associates with 𝑋 ∈ E its occurrence number 𝛽(𝑋) ∈ ℕ in the molecule, 𝛽 ∈ ℕE .



E may vary, depending on the chemical context considered, but it clearly has to comprise all elements occurring in 𝑀. Correspondingly, the molecular formula of the molecular graph 𝑀 = (𝜀, 𝜁, 𝛾) ∈ M𝑛 is 𝛽𝑀 ∈ ℕE with its values 𝛽𝑀 (𝑋) = |{𝑖 ∈ 𝑛 | 𝜀(𝑖) = 𝑋}|,



the number of occurrences of the elements 𝑋 in the molecule 𝑀. The set of structural formulas corresponding to molecules with formula 𝛽 is the following set of connected multigraphs with the same molecular formula: M̄ 𝑐𝛽 = {𝑀̄ | 𝑀 ∈ M𝑐 , 𝛽𝑀 = 𝛽}. This set is the set of constitutional isomers (or connectivity isomers or simply isomers) of mole­ cular formula 𝛽.

Another example is the following constitutional isomer of cyanic acid, isocyanic acid: O (2,2,0,0)

C (4,0,0,0)

N (3,1,0,0)

H (1,0,0,0)

.

In chemical notation: O =C = N−H. We are now in a position to collect, for example, all the three constitutional isomers with molecular formula HCNO that were already mentioned. Here are the chemical notations of fulminic acid, cyanic acid and isocyanic acid: O = N⊕ = C ⊖ − H,

| N ≡ C − O − H,

O = C = N−H,

while the molecular graphs are ⊖

O (2,2,0,0)

N⊕ (4,0,1,0)

C (3,1,-1,0)

H (1,0,0,0)

,

N (3,1,0,0)

C (4,0,0,0)

O (2,2,0,0)

H (1,0,0,0)

,

O (2,2,0,0)

C (4,0,0,0)

N (3,1,0,0)

H (1,0,0,0)

.

1.2 Molecular graphs, constitutional isomers

| 33

1.19 Exercise Use MOLGEN–ONLINE, via http://www.molgen.de, to evaluate the number of constitu­ tional isomers of HCNO without prescribing valences for the atoms involved. Thereby, standard valen­ ces will be used, and consequently fulminic acid will not be generated.

1.20 Exercise As MOLGEN–ONLINE is based on MOLGEN 5.0, you can prescribe valences different from the standard ones. Evaluate the constitutional isomers of HCNO with chemically less usual valences, e.g. with carbon of valence 3 and nitrogen of valence 4. Examples 4 and 5 online show how valences can be prescribed. Submit C[val = 3]N[val = 4]OH, and fulminic acid, among others, will be genera­ ted.

The molecular formula of a compound carries information only about the number of atoms contained in a molecule. For example, the molecular formula C2 H6 O describes molecules made of two C atoms, six H atoms, and a single O atom. Usually C is listed first, followed by H, if any, and other element symbols in alphabetic order. This is termed the Hill system [128]. 1.21 Example (Numbers of constitutional isomers) Here are a few example molecular formulas of molecular mass 78 (that of benzene), consisting of the elements in E4 , as well as the numbers of constitutional isomers possible. The numbers are those ob­ tained for RC (see Definition 1.14). 𝑚 78

𝛽 CH2 O4

󵄨󵄨 ̄ 𝑐 󵄨󵄨 󵄨󵄨M𝛽 󵄨󵄨 6

CH6 N2 O2

28

C2 H6 O3

10

C4 H2 N2

465

C5 H2 O

151

C6 H6

217

Check the entries of this table using MOLGEN–ONLINE. You may be surprised if you obtain 462 instead of 465 for the fourth entry. This is because the online version of MOLGEN eliminates aromatic duplicates. Many further examples given in Appendix D show the rapid increase of the numbers of constitutional isomers with increasing molecule size. They also show how only a small fraction of mathematically possible molecules are listed as known compounds in standard compound databases. For example, C6 H6 N4 O has mass 150 and yields 151,838,122 mathematically possible constitutional isomers. The Beilstein database BS0302PR [195] (cf. Section 2.5) contains just 273 of these, and the NIST database of Mass Spectra [224] contains only 11.

34 | 1 Basics of graphs and molecular graphs MOLGEN constructs the 217 C6 H6 constitutional isomers in fractions of a second on a standard PC, while the ≈ 1.51⋅108 isomers of C6 H6 N4 O require less than 5 minutes, depending on CPU speed. 1.22 Exercise Use MOLGEN–ONLINE at http://www.molgen.de to evaluate the number of constitu­ tional isomers of formula C6 H6 that do not contain any triple bonds. Do the same for the number of isomers that contain neither triple nor double bonds.

We used the standard valences 𝑣𝑋 for RC. These numbers were also used in [194] and [237] in order to derive an expression for the plausibility of a molecular formula in molecular structure elucidation.

1.2.3 The existence of molecular graphs Later on we need existence criteria for connected molecular graphs. R. Grund [94, 95] deduced the following important existence theorem formulated in terms of valences: 1.23 Theorem (Existence of molecular graphs in RC) For 𝛽 ∈ ℕE there exists at least one molecular graph 𝑀 with 𝛽𝑀 = 𝛽 if and only if (Gr1) the sum ∑𝑋∈E 𝑣𝑋 𝛽(𝑋) of all valences is an even number, since each line con­ tributes two valences to this sum. (Gr2) ∑𝑋∈E 𝑣𝑋 𝛽(𝑋) − 2 max{𝑣𝑋 | 𝑋 ∈ E, 𝛽(𝑋) > 0} ≥ 0, since there is at least one atom of valence max{𝑣𝑋 | 𝑋 ∈ E, 𝛽(𝑋) > 0}. The bonds of this atom are made from a total of 2 max{𝑣𝑋 | 𝑋 ∈ E, 𝛽(𝑋) > 0} valences, and thus there should be at least that many valences in a molecular graph. For the existence of at least one connected molecular graph we also need (Con) ∑𝑋∈E 𝑣𝑋 𝛽(𝑋)−2 ∑𝑋∈E 𝛽(𝑋)+2 ≥ 0. This condition is necessary, since a connected molecular graph of 𝑘 atoms containing the fewest lines is a (straight or branched) chain containing single bonds only. It comprises 𝑘 − 1 single bonds, and each ad­ ditional atom to be attached requires an additional (single) bond. Hence, the total number of lines should be equal to or higher than the number of atoms minus one. The given arguments show that the three conditions are necessary for the existence of a connected molecular graph. The proof that they are also sufficient is more impor­ tant and complicated, we refer to Grund [95]. We denote the set of molecular formulas satisfying (Gr1), (Gr2) and (Con) by B𝑐E , where we only allow the standard valence and ground state for each element, i.e. for­ mulas are restricted to those from RC. Thus, we use notation 𝛽 for a molecular formula and calligraphic B for the set of molecular formulas on E corresponding to at least one molecular graph, both being variations of capital letter B, in remembrance of the Ger­ man term ‘Bruttoformel’ formerly used for molecular formula.

1.3 Group actions on molecular graphs

| 35

Tests for the existence of molecular graphs become more involved if we admit var­ ious valences 𝑣𝑋,𝑖 , 𝑖 ∈ 𝑛𝑋 = {0, . . . , 𝑛𝑋 − 1}, where 𝑛𝑋 − 1 is the maximal allowed valence of 𝑋. Now we introduce 𝜇𝑋 (𝑖), the number of atoms of element 𝑋 of valence 𝑣𝑋,𝑖 in a molecule with molecular formula 𝛽, obtaining the following refinement of 1.23: 1.24 Remark (Existence of molecular graphs, various valences) There exists at least one molecular graph with molecular formula 𝛽 if sequences 𝜇𝑋 = (𝜇𝑋 (𝑖))𝑖∈𝑛𝑋 , ∑𝑖 𝜇𝑋 (𝑖) = 𝛽𝑋 , 𝑋 ∈ E exist such that (Gr1󸀠 ) ∑𝑋∈E ∑𝑖∈𝑛𝑋 𝑣𝑋,𝑖 𝜇𝑋 (𝑖) is even and (Gr2󸀠 ) ∑𝑋∈E ∑𝑖∈𝑛𝑋 𝑣𝑋,𝑖 𝜇𝑋 (𝑖) − 2 max {𝑣𝑋,𝑖 | 𝑋 ∈ E, 𝑖 ∈ 𝑛𝑋 , 𝜇𝑋 (𝑖) > 0} ≥ 0. For the existence of a connected molecular graph we need in addition that the follow­ ing holds: (Con󸀠 ) ∑𝑋∈E ∑𝑖∈𝑛𝑋 𝑣𝑋,𝑖 𝜇𝑋 (𝑖) − 2 ∑𝑋∈E 𝛽(𝑋) + 2 ≥ 0. In the literature we find in this context the notion of double bond equivalent (DBE). The DBE of molecular formula 𝛽 is DBE(𝛽) =

1 (2 + ∑ 𝛽(𝑋)(𝑣𝑋 − 2)) . 2 𝑋∈E

It is the number by which the number of lines exceeds the minimum required by (Con), i.e. the sum of number of cycles, number of double bonds, and twice the number of triple bonds in a molecule. Conditions (Gr1) and (Con) are usually formulated using DBE. For example, conditions (Con) and (Con󸀠 ) are the same as DBE(𝛽) ≥ 0. There is another expression corresponding to a given molecular formula. It con­ tains information on the relative numbers of element atoms in a molecule: 1.25 Definition (Empirical formula) Suppose a set E of chemical elements and a molecular formula 𝛽 ∈ ℕE , 𝛽 ≠ 0. The empirical formula associated with 𝛽 is 𝛽󸀠 ∈ ℕE , where 𝛽󸀠 (𝑋) =

𝛽(𝑋) gcd 𝛽(E)

for 𝑋 ∈ E. gcd 𝛽(E) means the greatest common divisor of the occurrence numbers 𝛽(𝑋), 𝑋 ∈ E.

For example, the empirical formula of C6 H6 over E4 is 𝛽󸀠 = (1, 1, 0, 0), the same as that of acetylene, C2 H2 , or of cyclooctatetraene, C8 H8 , if we order the elements of E4 in the sequence (H, C, N, O).

1.3 Group actions on molecular graphs As a preliminary step towards construction, we introduce mathematical methods that allow the evaluation of the total number of 𝑚-multigraphs and also the number of 𝑚-multigraphs with given bond multiplicities. These numbers allow a rough estimate

36 | 1 Basics of graphs and molecular graphs of the total number of molecules with given number of atoms or with given bond multi­ plicities. Later we shall also discuss tools for their construction and for the generation of samples. The mathematical methods used for group actions have been around since G. Pólya [234, 235] and are quite general and applicable also to the enumeration of permutational isomers, as we shall see later. First, we recall the basic definitions (cf. 1.10): – Let 𝑋 and 𝑌 be two nonempty finite sets, consider the set 𝑌𝑋 of all mappings 𝛾 from 𝑋 to 𝑌 and assume an action 𝐺 𝑋. The given action of 𝐺 on 𝑋 yields the following action of 𝐺 on 𝑌𝑋 , 𝐺 × 𝑌𝑋 → 𝑌𝑋 : (𝑔, 𝛾) 󳨃→ 𝑔𝛾,



where (𝑔𝛾)(𝑥) = 𝛾(𝑔−1 𝑥).

The orbits 𝛾̄ = 𝐺(𝛾) were called the symmetry classes of mappings in 𝑌𝑋 with re­ spect to 𝐺 𝑋. Important particular cases of symmetry classes of mappings are the unlabeled 𝑚-multigraphs on 𝑛 nodes that we identified with the following symmetry classes of mappings: Let 𝑛 = {0, . . . , 𝑛−1} denote the set of labels of the nodes and consider the symmetric group 𝑆𝑛 , the group of all relabelings. 𝑆𝑛 acts on the set of labels and also on the set (𝑛2) of pairs of labels of nodes in the following way: 𝑛 𝑛 𝑆𝑛 × ( ) → ( ) : (𝜋, {𝑖, 𝑗}) 󳨃→ {𝜋𝑖, 𝜋𝑗}. 2 2 This finally induces an action on the set of labeled 𝑚-multigraphs 𝑌𝑋 = G𝑚,𝑛 = 𝑛 𝑚( 2 ) : 𝑆𝑛 × G𝑚,𝑛 → G𝑚,𝑛 : (𝜋, 𝛾) 󳨃→ 𝜋𝛾,

where (𝜋𝛾)({𝑖, 𝑗}) = 𝛾({𝜋−1 𝑖, 𝜋−1 𝑗}).

The set of symmetry classes 𝑛

𝑆𝑛 \\G𝑚,𝑛 = 𝑆𝑛\\𝑚(2) obtained gives the set of unlabeled 𝑚-multigraphs on 𝑛 nodes by taking a transver­ sal and erasing the labels in each of the elements. Hence, it is our aim to count symmetry classes and to obtain a transversal.

1.3.1 Counting unlabeled structures First, we need a formula for the number of orbits. The following result originated from A. Cauchy and G. Frobenius (19th century) and is sometimes erroneously attributed to Burnside [220, 345] who in fact proved an even stronger result (see below).

1.3 Group actions on molecular graphs

| 37

1.26 Lemma (Cauchy–Frobenius, on the number of orbits) Consider a finite action 𝐺 × 𝑋 → 𝑋 : (𝑔, 𝑥) 󳨃→ 𝑔𝑥, call an 𝑥 ∈ 𝑋 a fixed point of 𝑔 if 𝑔𝑥 = 𝑥, and denote the set of all these fixed points of 𝑔 by 𝑋𝑔 = {𝑥 ∈ 𝑋 | 𝑔𝑥 = 𝑥}. Then, the number of orbits of 𝐺 on 𝑋 is the average number of fixed points: |𝐺\\𝑋| =

1 ∑ |𝑋 |. |𝐺| 𝑔∈𝐺 𝑔

1.27 Example (Naphthalene, cont.) In Example 1.7, the sets of fixed points of the four permutations of naphthalene are 𝑋𝜋0 = 0, 𝑋𝜋1 = {0, 5}, 𝑋𝜋2 = 0, 𝑋𝜋3 = {0, 1, 2, . . . , 9}, so that we obtain, according to Lemma 1.26, |𝐺\\𝑋| = 14 (0 + 2 + 0 + 10) = 3, which is the correct number of orbits of C atoms in naphthalene, see the discussion following Definition 1.8. If instead of the four permutations we consider the eight sym­ metry elements of the naphthalene molecule, then |𝐺\\𝑋| = 18 (10 + 0 + 0 + 2 + 2 + 0 + 0 + 10) = 3 again. The Lemma of Cauchy–Frobenius is basic and very important. It holds since there is an interesting connection between the orbit 𝐺(𝑥) ⊆ 𝑋 and the stabilizer 𝐺𝑥 = {𝑔 ∈ 𝐺 | 𝑔𝑥 = 𝑥} ⊆ 𝐺 of 𝑥 ∈ 𝑋, which is a subgroup of 𝐺. In fact, the following mapping is a bijection: 𝐺(𝑥) → {𝑔𝐺𝑥 | 𝑔 ∈ 𝐺} : 𝑔𝑥 󳨃→ 𝑔𝐺𝑥 = {𝑔𝑔󸀠 | 𝑔󸀠 ∈ 𝐺𝑥 }. This is easily seen as follows: 𝑔𝑥 = 𝑔∗ 𝑥 ⇐⇒ 𝑔−1 𝑔∗ 𝑥 = 𝑥 ⇐⇒ 𝑔−1 𝑔∗ ∈ 𝐺𝑥 ⇐⇒ 𝑔𝐺𝑥 = 𝑔∗ 𝐺𝑥 . Thus, the length |𝐺(𝑥)| of the orbit is equal to the index |𝐺|/|𝐺𝑥 | of the stabilizer. Using this, we can derive the number of orbits: |𝐺| = |𝐺| ⋅ |𝐺\\𝑋|. |𝐺(𝑥)| 𝑥∈𝑋

∑ |𝑋𝑔 | = ∑ ∑ 1 = ∑ ∑ 1 = ∑ |𝐺𝑥 | = ∑ 𝑔∈𝐺

𝑔∈𝐺 𝑥:𝑔𝑥=𝑥

𝑥∈𝑋 𝑔:𝑔𝑥=𝑥

𝑥∈𝑋

38 | 1 Basics of graphs and molecular graphs Further details of this remarkable connection between orbits and stabilizers will be given below in the Fundamental Lemma 1.37. An alternative expression for the number of orbits is more economical if the act­ ing group is not commutative. In this case it does not require the sum of all group elements. The argument is as follows: – Consider the following mapping: 𝐺 × 𝐺 → 𝐺 : (ℎ, 𝑔) 󳨃→ ℎ𝑔ℎ−1 . It is easy to verify that this is an action of 𝐺 on itself, called the conjugation action. By definition, the orbit of 𝑔 is {ℎ𝑔ℎ−1 | ℎ ∈ 𝐺}. This subset of 𝐺 arises from 𝑔 via conjugation, it is therefore called the conjugacy class of 𝑔 in 𝐺 and denoted by 𝐶𝐺 (𝑔). Summarizing, we obtain 𝐶𝐺 (𝑔) = {ℎ𝑔ℎ−1 | ℎ ∈ 𝐺}. Being orbits, the different conjugacy classes are disjoint and form a set-partition of 𝐺. Of course, if 𝐺 is commutative, then ℎ𝑔ℎ−1 = 𝑔, so that 𝐶𝐺 (𝑔) = {𝑔} and this partition of 𝐺 is trivial. Moreover, the number of fixed points is constant on each conjugacy class: |𝑋𝑔 | = |𝑋ℎ𝑔ℎ−1 |.



The reason is that one can easily construct a bijection between the two sets of fixed points of 𝑔 and of ℎ𝑔ℎ−1 by mapping 𝑥 ∈ 𝑋𝑔 onto ℎ𝑥. Hence, if C denotes a transversal of the different conjugacy classes, we obtain the following simplified expression for the number of orbits, |𝐺\\𝑋| =

1 ∑ |𝐶𝐺 (𝑔)| ⋅ |𝑋𝑔 |. |𝐺| 𝑔∈C

(1.5)

This formula does not help in the commutative case since conjugacy classes in commutative groups consist of a single element, so that C is the whole group. For example, the group 𝐺 of four symmetry operations of naphthalene is commuta­ tive. However, the formula is very useful in the noncommutative case, supposing we know the orders |𝐶𝐺 (𝑔)| and a transversal C. This holds for symmetric groups 𝑆𝑛 as soon as 𝑛 ≥ 3. Below we shall describe the conjugacy classes of 𝑆𝑛, give the conjugacy classes of the symmetric group 𝑆4 as an example and show how (1.5) can be applied for counting unlabeled 𝑚-multigraphs on four nodes. Since 𝛾 ∈ 𝑌𝑋 is a fixed point if and only if 𝛾(𝑥) = 𝛾(𝑔𝑥) = 𝛾(𝑔2 𝑥) = ... i.e. if and only if 𝛾 is constant on the orbits of the group ⟨𝑔⟩ = {1𝐺 , 𝑔, 𝑔2 , . . .} generated by 𝑔, we obtain the following helpful result when applying the above to the symmetry classes of mappings:

1.3 Group actions on molecular graphs

|

39

1.28 Remark (The number of unlabeled 𝑚-multigraphs on 𝑛 nodes) – The number of symmetry classes of mappings is |𝐺\\𝑌𝑋 | = –

1 1 ∑ |𝑌||⟨𝑔⟩\\𝑋| = ∑ |𝐶𝐺 (𝑔)| ⋅ |𝑌||⟨𝑔⟩\\𝑋| . |𝐺| 𝑔∈𝐺 |𝐺| 𝑔∈C

Correspondingly, we obtain for the number of unlabeled 𝑚-multigraphs: |𝑆𝑛\\G𝑚,𝑛 | =

𝑛 𝑛 1 1 ∑ 𝑚|⟨𝜋⟩\\(2)| = ∑ |𝐶𝑆𝑛 (𝜋)| ⋅ 𝑚|⟨𝜋⟩\\(2)| . 𝑛! 𝜋∈𝑆 𝑛! 𝜋∈C 𝑛

In order to apply these results we briefly discuss the conjugacy classes of the symmet­ ric group 𝑆𝑛 . The aims are an explicit description of the conjugacy class of 𝜋 ∈ 𝑆𝑛 and of |⟨𝜋⟩\\(𝑛2)|, both in terms of the ‘cycle structure’ of 𝜋. – A permutation 𝜋 ∈ 𝑆𝑛 is written down in full detail by putting the images 𝜋𝑖 in a row under the 𝑖 ∈ 𝑛, say 𝜋=(

0 𝜋0

... ...

𝑛−1 ). 𝜋(𝑛 − 1)

This will be abbreviated by 𝑖 𝜋 = ( ). 𝜋𝑖 Hence, for example, 𝑆3 consists of the following elements: 012 012 012 012 012 012 ( ), ( ), ( ), ( ), ( ), ( ). 012 102 210 021 120 201 The points forming the first row need not be written in their natural order, e.g. 012 102 ( ) = ( ). 120 210 Keeping this in mind, we call a permutation 𝜋 ∈ 𝑆𝑛 a cyclic permutation or a cycle if and only if it can be written in the form (

𝑖0 𝑖1

𝑖1 𝑖2

... ...

𝑖𝑟−2 𝑖𝑟−1

𝑖𝑟−1 𝑖0

𝑖𝑟 𝑖𝑟

... ...

𝑖𝑛−1 ), 𝑖𝑛−1

where 𝑟 > 0. In order to emphasize 𝑟, the number of points that are cyclically permuted, 𝑖0 󳨃→ 𝑖1 󳨃→ . . . 󳨃→ 𝑖𝑟−1 󳨃→ 𝑖0 , we also call it an r-cycle, neglecting for the moment 𝑖𝑟+1 up to 𝑖𝑛−1 , the fixed points. In fact, we shall not hesitate to call both this cyclic shift (considered as a permuta­ tion of the set {𝑖0 , . . . , 𝑖𝑟−1 }) as well as this cyclic shift together with the fixed points (thus considered as a permutation of the set {𝑖0 , . . . , 𝑖𝑟−1 }) an 𝑟-cycle, supposing the set of symbols is clear.

40 | 1 Basics of graphs and molecular graphs –

We note that in this case the orbits of the subgroup ⟨𝜋⟩ generated by this permu­ tation are the following subsets of 𝑛: ⟨𝜋⟩\\𝑛 = {{𝑖0 , . . . , 𝑖𝑟−1 }, {𝑖𝑟 }, . . . , {𝑖𝑛−1 }}. We therefore abbreviate this cycle by (𝑖0 , . . . , 𝑖𝑟−1 )(𝑖𝑟 ) . . . (𝑖𝑛−1 ), where the points that are cyclically permuted are put together in parentheses. For example (012 )= 021 (1, 2)(0). Unless confusion can arise, i.e. if 𝑛 ≤ 10, commas separating the points may be omitted, (1, 2)(0) = (12)(0), and for given 𝑛, 1-cycles may be left out if it is clear which 𝑛 is meant, 012 ( ) = (1, 2)(0) = (12). 021 Hence we can briefly write 𝜋 = (𝑖0 . . . 𝑖𝑟−1 ) for the 𝑟-cycle introduced above. This cycle 𝜋 can also be expressed in terms of 𝑖0 alone: 𝜋 = (𝑖0 𝜋𝑖0 . . . 𝜋𝑟−1 𝑖0 ). By denoting the identity element of 𝑆𝑛 with 1 = 1𝑆𝑛 = (0) ⋅ ⋅ ⋅ (𝑛 − 1) and using the abbreviations introduced above, we obtain 𝑆3 = {1, (01), (02), (12), (012), (021)}.





The notation for a cyclic permutation is not uniquely determined, since (𝑖0 . . . 𝑖𝑟−1 ) = (𝑖1 . . . 𝑖𝑟−1 𝑖0 ) = . . . = (𝑖𝑟−1 𝑖0 . . . 𝑖𝑟−2 ), but the convention is that one usually starts with the smallest entry. 2-cycles are called transpositions. The order of a cycle (𝑖0 . . . 𝑖𝑟−1 ), i.e. the order of the group ⟨(𝑖0 . . . 𝑖𝑟−1 )⟩ generated by this cycle, is equal to its length: |⟨(𝑖0 . . . 𝑖𝑟−1 )⟩| = 𝑟. Two cycles 𝜋 and 𝜌 are called disjoint if the two sets of points which are not fixed by 𝜋 and 𝜌 are disjoint sets. Note that, for example, 1 = (0)(1)(2) and (012), a 1-cycle and a 3-cycle, are disjoint cycles since the sets of symbols that are not fixed are the empty set 0 and {0, 1, 2} Disjoint cycles 𝜋 and 𝜌 commute, 𝜋𝜌 = 𝜌𝜋. Each permutation of a finite set can be written as a product of pairwise different disjoint cycles, e.g. 01234567 ) = (06)(142)(53)(7). ( 64152307 The disjoint cyclic factors ≠ 1 of 𝜋 ∈ 𝑆𝑛 are uniquely determined by 𝜋 and therefore we call these factors together with the fixed point cycles of 𝜋 the cyclic factors of 𝜋. Let 𝑐(𝜋) denote the number of these cyclic factors of 𝜋 (including 1-cycles), let 𝑙𝜈 be their lengths, 𝜈 ∈ 𝑐(𝜋) = {0, . . . , 𝑐(𝜋) − 1}, and choose for each 𝜈 an element 𝑗𝜈 of the 𝜈-th cyclic factor. Then 𝜋 = ∏ (𝑗𝜈 𝜋𝑗𝜈 . . . 𝜋𝑙𝜈 −1 𝑗𝜈 ). 𝜈∈𝑐(𝜋)

(1.6)

1.3 Group actions on molecular graphs

| 41

This notation becomes unique if we choose for the 𝑗𝜈 the smallest elements in their cycles and number the cycles so that 𝑗0 < 𝑗1 < . . ., in formal terms: for all 𝑚 ∈ ℕ : 𝑗𝜈 ≤ 𝜋𝑚 𝑗𝜈 , and for all 𝜈 < 𝑐(𝜋) − 1 : 𝑗𝜈 < 𝑗𝜈+1 . If this holds, then (1.6) is called the standard cycle notation for 𝜋. We note in pass­ ing that the sets {𝑗𝜈 , 𝜋𝑗𝜈 , . . . , 𝜋𝑙𝜈 −1 𝑗𝜈 } of points which are cyclically permuted by 𝜋 are just the orbits of the group ⟨𝜋⟩ generated by 𝜋 : ⟨𝜋⟩\\𝑛 = {{𝑗𝜈 , 𝜋𝑗𝜈 , . . . , 𝜋𝑙𝜈 −1 𝑗𝜈 } | 𝜈 ∈ 𝑐(𝜋)}. For an example we recall from above that 01234567 ) = (06)(142)(53)(7), ( 64152307 the standard cycle notation of which is 01234567 ( ) = (06)(142)(35)(7). 64152307 The set of orbits of the group generated by this permutation turns out to be ⟨(06)(142)(35)(7)⟩\\{0, . . . , 7} = {{0, 6}, {1, 2, 4}, {3, 5}, {7}}. –

Permutations are usually entered in a computer using the list notation, this is, up to commas, the second row of the notation introduced above. For example 01234567 𝜋=( ) = [6, 4, 1, 5, 2, 3, 0, 7] = [𝜋(0), . . . , 𝜋(7)]. 64152307



Having described the elements of 𝑆𝑛, we show which of them are in the same con­ jugacy class, i.e. in the same orbit of the group 𝑆𝑛 on the set 𝑆𝑛 under the conjuga­ tion action 𝑆𝑛 × 𝑆𝑛 → 𝑆𝑛 : (𝜌, 𝜋) 󳨃→ 𝜌𝜋𝜌−1 . In order to do this, we first note how 𝜌𝜋𝜌−1 is obtained from 𝜋: 𝑖 𝑖 𝜌𝑖 𝜌𝑖 𝜌𝜋𝜌−1 = ( ) ( ) ( ) = ( ). 𝜌𝑖 𝜋𝑖 𝑖 𝜌(𝜋𝑖) Thus, in terms of cyclic factors of 𝜋 = ⋅ ⋅ ⋅ (. . . 𝑖 𝜋(𝑖) . . .) ⋅ ⋅ ⋅ , 𝜌𝜋𝜌−1 arises by simply applying the mapping 𝜌 to the points in the cycles of 𝜋: 𝜌𝜋𝜌−1 = ⋅ ⋅ ⋅ (. . . 𝜌𝑖 𝜌(𝜋𝑖) . . .) ⋅ ⋅ ⋅ . This equation shows that the lengths of the cyclic factors of 𝜋 are the same as those of 𝜌𝜋𝜌−1 . It is easy to see that, conversely, for any two elements 𝜋, 𝜎 ∈ 𝑆𝑛 with the same lengths of cyclic factors there exists a 𝜌 ∈ 𝑆𝑛 such that 𝜌𝜋𝜌−1 = 𝜎. Hence the lengths 𝑙𝜈 of the cyclic factors of 𝜋 characterize its conjugacy class.

42 | 1 Basics of graphs and molecular graphs –

Given a 𝜋 ∈ 𝑆𝑛 in standard cycle notation, we order the lengths 𝛼𝑖 (𝜋), 𝑖 ∈ 𝑐(𝜋) of its cyclic factors in decreasing order, so that 𝛼0 (𝜋) ≥ 𝛼1 (𝜋) ≥ . . . , obtaining the number partition 𝛼(𝜋) = (𝛼0 (𝜋), 𝛼1 (𝜋), . . . , 𝛼𝑐(𝜋)−1 (𝜋)) of 𝑛, the sum of the elements of 𝛼(𝜋), for short: 𝛼(𝜋) ⊢ 𝑛, which we call the cycle partition of 𝜋, where 𝛼(𝜋) ⊢ 𝑛 is read as ‘𝛼(𝜋) is a partition of the number 𝑛’. The corresponding 𝑛-tuple 𝑎(𝜋) = (𝑎1 (𝜋), . . . , 𝑎𝑛 (𝜋))

where 𝑎𝑖 (𝜋) = |{𝑗 | 𝛼𝑗 (𝜋) = 𝑖}|,

consisting of the occurrence numbers 𝑎𝑖 (𝜋) of the parts of length 𝑖 in 𝛼(𝜋), is the type of the cycle partition and called the cycle type of 𝜋. Correspondingly we call an 𝑛-tuple 𝑎 = (𝑎1 , . . . , 𝑎𝑛 ) a cycle type of 𝑛 if and only if each 𝑎𝑖 ∈ ℕ, and ∑ 𝑖⋅𝑎𝑖 = 𝑛. This will be abbreviated by 𝑎⊢ ⊣ 𝑛. Read 𝑎 ⊢ ⊣ 𝑛 as ‘𝑎 is a cycle type of the number 𝑛’. Using the 𝑎𝑖 we can abbreviate the cycle partition 𝛼(𝜋) as 𝛼(𝜋) = (𝑛𝑎𝑛 (𝜋) , . . . , 1𝑎1 (𝜋) ), where the entries with the exponents 0 and 1 can be omitted. For example, 𝜋 = (06)(142)(35)(7) –

has 𝛼(𝜋) = (3, 2, 2, 1) ⊢ 8,

𝑎(𝜋) = (1, 2, 1) ⊢ ⊣ 8.

The conjugacy class of 𝜋 ∈ 𝑆𝑛 will be denoted by 𝐶𝑆𝑛 (𝜋), so that we obtain the following descriptions and properties of conjugacy classes of a symmetric group 𝑆𝑛 acting on 𝑛 that we need to count the number of unlabeled 𝑚-multigraphs on 𝑛 economically:

1.29 Remark (The conjugacy classes of symmetric groups) We recall that two ele­ ments 𝜋 and 𝜎 of 𝑆𝑛 are conjugate if and only if they have the same cycle partition, or, in other words, if and only if they are of the same cycle type: 𝐶𝑆𝑛 (𝜋) = 𝐶𝑆𝑛 (𝜎) ⇐⇒ 𝛼(𝜋) = 𝛼(𝜎) ⇐⇒ 𝑎(𝜋) = 𝑎(𝜎). Moreover, as 𝜋 is of the same cycle type as 𝜋−1 , each permutation and its inverse are conjugates in the symmetric group: 𝐶𝑆𝑛 (𝜋) = 𝐶𝑆𝑛 (𝜋−1 ).

1.3 Group actions on molecular graphs

| 43

The order of a conjugacy class is the index of the stabilizer, and hence, since 𝑛! = 1 ⋅ 2 ⋅ ⋅ ⋅ 𝑛 is the order of the symmetric group 𝑆𝑛, |𝐶𝑆𝑛 (𝜋)| =

𝑛! . ∏𝑖 𝑖𝑎𝑖 (𝜋) 𝑎𝑖 (𝜋)!

Since the cyclic factors of 𝜋 commute, the order of the group ⟨𝜋⟩ generated by 𝜋 is the least common multiple of the lengths of the cyclic factors: |⟨𝜋⟩| = lcm{𝛼𝑖 (𝜋) | 𝑖 ∈ 𝑐(𝜋)} = lcm{𝑖 | 𝑎𝑖 (𝜋) > 0}. Finally, each partition 𝛼 ⊢ 𝑛 occurs as the cycle partition of a permutation 𝜋 in 𝑆𝑛, and hence for the cycle types 𝑎 ⊢ ⊣ 𝑛, too. 1.30 Example (Numbers of unlabeled 𝑚-multigraphs on 4 nodes) – These graphs are orbits of the symmetric group 𝑆4 which acts on the set 4 = {0, 1, 2, 3}. Its conjugacy classes correspond to the different cycle partitions or cycle types, 𝐶𝑆4 (1) = {1}, 𝐶𝑆4 ((01)) = {(01), (02), (03), (12), (13), (23)}, 𝐶𝑆4 ((01)(23)) = {(01)(23), (02)(13), (03)(12)}, 𝐶𝑆4 ((012)) = {(012), (021), (013), (031), (023), (032), (123), (132)}, 𝐶𝑆4 ((0123)) = {(0123), (0132), (0213), (0231), (0312), (0321)}. We list a transversal of the conjugacy classes together with the corresponding cy­ cle partition, the cycle type and the order of the class considered:



type 𝑎(𝜋)

order |𝐶𝑆4 (𝜋)|

representative 𝜋

partition 𝛼(𝜋)

1

(14 )

(4)

1

(01)

(2,12 )

(2,1)

6

(01)(23)

(22 )

(2)

3

(012)

(3,1)

(1,0,1)

8

(0123)

(4)

(0,0,0,1)

6

According to Remark 1.28, the number of unlabeled 𝑚-multigraphs on 4 is |𝑆4 \\G𝑚,4 | =

4 4 1 1 ∑ 𝑚|⟨𝜋⟩\\(2)| = ∑ |𝐶𝑆4 (𝜋)| ⋅ 𝑚|⟨𝜋⟩\\(2)| . 4! 𝜋∈𝑆 4! 𝜋∈C 4

Using the characterization of conjugacy classes of symmetric groups by cycle types 𝑎, and the known orders of these classes, we obtain: |𝑆4 \\G𝑚,4 | =

4 4! 1 ∑ ⋅ 𝑚|⟨𝜋⟩\\(2)| . 𝑎 𝑖 4! 𝑎 ⊢⊣ 4 ∏𝑖 𝑖 𝑎𝑖 !

Next, we need to evaluate the number of orbits |⟨𝜋⟩\\(42)|, for the elements of a transversal of the conjugacy classes of 𝑆4 . Below is a table for 𝑛 = 4, containing representatives of the conjugacy classes of 𝑆4 , the orders of the conjugacy classes and the numbers of orbits of the representatives on the set of six pairs of nodes:

44 | 1 Basics of graphs and molecular graphs

representative 𝜋

|𝐶𝑆4 (𝜋)|

|⟨𝜋⟩\\(42) |

1

1

6

(01)

6

4

(01)(23)

3

4

(012)

8

2

(0123)

6

2

The number of simple graphs on 4 points is thus 1 (1 ⋅ 𝑚6 + 6 ⋅ 𝑚4 + 3 ⋅ 𝑚4 + 8 ⋅ 𝑚2 + 6 ⋅ 𝑚2 ) . |𝑆4 \\G𝑚,4 | = 24 By setting 𝑚 = 2 we obtain the number of simple graphs on 4 nodes: |𝑆4 \\G2,4 | =

1 (1 ⋅ 26 + 6 ⋅ 24 + 8 ⋅ 22 + 3 ⋅ 24 + 6 ⋅ 22 ) = 11, 24

while 𝑚 = 4 yields 1 (1 ⋅ 46 + 6 ⋅ 44 + 3 ⋅ 44 + 8 ⋅ 42 + 6 ⋅ 42 ) = 276. 24 We came across the numbers 11 and 276 already in a table at the end of Section 1.1. Several of these graphs are not connected, the connected ones can be colored and extended, giving molecular graphs. For example in the 4-multigraph |𝑆4 \\G4,4 | =

t

t

t

t

we replaced the nodes by element symbols as follows, N

C

O

H

obtaining the structural formula of cyanic acid (see drawing (1.1)).

1.3.2 Counting by weight Since we wish to obtain the number of unlabeled 𝑚-multigraphs with prescribed bond multiplicities, we need to refine the thoughts from above. This can be done using Pólya’s Theorem [234, 235], which allows us to count symmetry classes of mappings by content. It yields a generating function, i.e. a polynomial with the desired numbers of symmetry classes of mappings as coefficients of its monomial summands. For ex­ ample, the generating function for the numbers of simple graphs on 4 nodes turns out to be 𝑦06 + 𝑦05 𝑦1 + 2𝑦04 𝑦12 + 3𝑦03 𝑦13 + 2𝑦02 𝑦14 + 𝑦0 𝑦15 + 𝑦16 . The coefficient 3 of its monomial summand 3𝑦03 𝑦13 indicates that there are exactly three unlabeled simple graphs of content (3, 3), which means three bonds of multiplicity 0 (non-bonds) and three bonds of multiplicity 1 (single bonds). We shall show how this generating function can be obtained.

1.3 Group actions on molecular graphs

| 45

1.31 Remark (The number of symmetry classes by content) – First, we number the elements of 𝑌, setting 𝑌 = {𝑦0 , . . . , 𝑦𝑚−1 }, and we consider the 𝑦𝑖 ∈ 𝑌 as indeterminates, i.e. we form the ring ℚ[𝑌] = ℚ[𝑦0 , . . . , 𝑦𝑚−1 ] of multivariate polynomials with rational coefficients in the indeterminates 𝑦𝑖 , 𝑖 ∈ 𝑚. (It suffices to consider polynomials with rational coefficients since all we need is that we can divide by integers.) The monomial |𝛾−1 (𝑦𝑖 )|

𝑤(𝛾) = ∏ 𝛾(𝑥) = ∏ 𝑦𝑖 𝑥∈𝑋

∈ ℚ[𝑌]

𝑖∈𝑚

is called the weight of 𝛾 ∈ 𝑌𝑋 . The sequence of its exponents, 𝑐𝑜𝑛(𝛾) = (|𝛾−1 (𝑦0 )|, . . . , |𝛾−1 (𝑦𝑚−1 )|) = (𝑐0 , . . . , 𝑐𝑚−1 ) –



is called the content of 𝛾. We denote the set of orbits of length 𝑖 of ⟨𝑔⟩ on 𝑋 by ⟨𝑔⟩\\ 𝑖 𝑋, recalling that this number is the number of 𝑖-cycles of the permutation 𝑔̄ induced by 𝑔 ∈ 𝐺 on the set 𝑋, ̄ |⟨𝑔⟩\\ 𝑖 𝑋| = 𝑎𝑖 (𝑔). According to Pólya, the desired number of symmetry classes 𝛾̄ of mappings 𝛾 ∈ 𝑐 𝑌𝑋 of weight 𝑐 = (𝑐0 , . . . , 𝑐𝑚−1 ) is the coefficient of the monomial ∏𝑖∈𝑚 𝑦𝑖 𝑖 in the polynomial |𝑋|

|⟨𝑔⟩\\ 𝑖 𝑋| 1 𝑖 ∑ ∏ (𝑦𝑖 + . . . + 𝑦𝑚−1 ) |𝐺| 𝑔∈𝐺 𝑖=1 0 |𝑋|

= –

This generating function is easily obtained from another polynomial, called cy­ cle index polynomial. It displays – in the exponents of its summands – the cycle structure of the permutation 𝑔̄ induced by 𝑔 ∈ 𝐺 on 𝑋: 𝐶(𝐺, 𝑋) =



𝑎𝑖 (𝑔)̄ 1 𝑖 ∑ |𝐶𝐺 (𝑔)| ⋅ ∏ (𝑦0𝑖 + . . . + 𝑦𝑚−1 ) . |𝐺| 𝑔∈C 𝑖=1

1 1 𝑎 (𝑔)̄ 𝑎 (𝑔)̄ ∑ ∏ 𝑧𝑖 = ∑ |𝐶𝐺 (𝑔)| ⋅ ∏ 𝑧𝑖 𝑖 . |𝐺| 𝑔∈𝐺 1≤𝑖≤|𝑋| 𝑖 |𝐺| 𝑔∈C 1≤𝑖≤|𝑋|

Thus, the generating function for the enumeration of symmetry classes of map­ pings is obtained from the cycle index polynomial by simply replacing the inde­ terminate 𝑧𝑖 by the ‘power sum symmetric function’ (a polynomial is called sym­ metric if it is invariant under permutations of the indeterminates) ∑𝑗 𝑦𝑗𝑖 , using ̄ |⟨𝑔⟩\\ 𝑖 𝑋| = 𝑎𝑖 (𝑔): |𝑋|

𝑎𝑖 (𝑔)̄

1 ∑ |𝐶𝐺 (𝑔)| ⋅ ∏ ( ∑ 𝑦𝑗𝑖 ) |𝐺| 𝑔∈C 𝑦𝑗 ∈𝑌 𝑖=1

= 𝐶(𝐺, 𝑋)|𝑧𝑖 =∑𝑗 𝑦𝑗𝑖 .

We call the resulting polynomial the group reduction function.

46 | 1 Basics of graphs and molecular graphs

1.32 Exercise Evaluate the cycle index 𝐶(𝐺, 𝑋) of the symmetry group of naphthalene (see Exam­ ple 1.7) and replace the 𝑖-th indeterminate 𝑧𝑖 by the polynomial 𝑦0𝑖 + 𝑦1𝑖 . Derive the corresponding numbers of symmetry classes of the symmetry group on the set of mappings 𝑌𝑋 = 210 by weight.

1.33 Exercise Click on http://symmetrica.uni-bayreuth.de, go to ‘operations of finite groups’, and evaluate the cycle index of the symmetry group of the aromatic benzene ring, by entering genera­ tors, regarding the following facts: This group on 6 points can be generated by a reflection and a cyclic permutation of the 6 points. Take into account that 6 points are usually numbered from 1 to 6 in SYMMETRICA, so that the symmetry group in question is generated, for example, by the permutations 𝜋 = [2, 3, 4, 5, 6, 1]

and

𝜌 = [6, 5, 4, 3, 2, 1].

Check your result by also evaluating the cycle index of the dihedral group on 6 points.

1.34 Exercise How many molecular graphs can be obtained by replacing the six hydrogen atoms of benzene by either H or Cl ? How many of them are of weight (4, 2)?

The application to multigraphs is immediate: 1.35 Example (Unlabeled 𝑚-multigraphs by weight or content) The generating func­ tion for the numbers of unlabeled 𝑚-multigraphs by content is the polynomial 𝑛 󵄨󵄨 . 𝐶 (𝑆𝑛 , ( ))󵄨󵄨󵄨󵄨 2 󵄨𝑧𝑖 =∑𝑗∈𝑚 𝑦𝑗𝑖 In order to apply it to 𝑛 = 4 we need to refine the table of numbers of fixed points by separating the orbits according to their lengths, obtaining repr. 𝜋 of class

class order

cycle type of 𝜋̄

1

1

(6,0,0,0)

(01)

6

(2,2,0,0)

(012)

8

(0,0,2,0)

(01)(23)

3

(2,2,0,0)

(0123)

6

(0,1,0,1)

Thus, the generating function is 1 2 ((𝑦0 + . . . + 𝑦𝑚−1 )6 + 9(𝑦0 + . . . + 𝑦𝑚−1 )2 (𝑦02 + . . . + 𝑦𝑚−1 )2 4 3 2 4 + 8(𝑦03 + . . . + 𝑦𝑚−1 )2 + 6(𝑦02 + . . . + 𝑦𝑚−1 )(𝑦04 + . . . + 𝑦𝑚−1 )) . For 𝑚 = 2 this amounts to the polynomial mentioned above: 𝑦06 + 𝑦05 𝑦1 + 2𝑦04 𝑦12 + 3𝑦03 𝑦13 + 2𝑦02 𝑦14 + 𝑦0 𝑦15 + 𝑦16 .

1.3 Group actions on molecular graphs

| 47

It shows that there is exactly one simple graph containing no bonds, exactly one with one bond, two with two bonds, three with three bonds, two with four bonds, one with five bonds, and one with six bonds, in accordance with the drawing of these graphs that we have seen already (just below Example 1.2). The sum of all coefficients is the total number 11 of simple graphs on 4 nodes. The cycle type 𝑎(𝜋)̄ of the permutation 𝜋,̄ induced by 𝜋 ∈ 𝑆𝑛 on (𝑛2), expressed in terms of the cycle type 𝑎(𝜋) = (𝑎1 (𝜋), . . . , 𝑎𝑛 (𝜋)) is known, so that we can easily obtain tables like the above one: If 𝑖 is odd and lcm(𝑟, 𝑠) means the least common multiple of 𝑟 and 𝑠, then 𝑎𝑖 (𝜋)̄ =

𝑎𝑖 (𝜋) (𝑖𝑎𝑖 (𝜋) − 1) + 𝑎2𝑖 (𝜋) + ∑ 𝑎𝑟 (𝜋)𝑎𝑠 (𝜋) gcd(𝑟, 𝑠). 2 𝑟 0, assume a set of chemical elements E and ZE = ⋃𝑋∈E Z𝑋 , the set of admissible atom states of the elements in E. A triple 𝑛

𝑛

𝑛

𝐴𝑀𝐺 = (𝐸, 𝑍, 𝛤) ∈ P⋆ (E) × P⋆ (ZE ) × P(3)( 2 ) = AMG𝑛 is called ambiguous molecular graph (AMG). P means the power set, i.e. P(E) is the set of all the subsets of E, including the empty set 0 ⊆ E. P⋆ denotes the power set without the empty set 0 and 3 stands for the set {1, 2, 3}. The set of bonds of 𝐴𝑀𝐺 is 𝑛 󵄨󵄨 𝐵(𝛤) = { {𝑖, 𝑗} ∈ ( ) 󵄨󵄨󵄨󵄨 𝛤({𝑖, 𝑗}) ≠ 0} . 2 󵄨

2.2 Molecular substructures

| 63

Although the implementation of the methods so far may seem obvious, efficiency has not yet been mentioned. The distribution of elements and atom states of ambiguous molecules could be calculated easily using a list of all the admissible sequences of elements and atom states (𝐸(𝑖), 𝑍(𝑖)). However, this would turn out to be highly inef­ ficient as soon as 𝐸(𝑖) = E and 𝑍(𝑖) = ZE need to be represented. This problem was solved by introducing an abstract basis class atom type with the function 𝑐𝑜𝑚𝑝𝑎𝑡𝑖𝑏𝑙𝑒 : E × ZE → {𝑡𝑟𝑢𝑒, 𝑓𝑎𝑙𝑠𝑒}, that is called during a substructure search or structure generation, in order to check that the following holds for a node 𝑗 of a molecular graph (𝜀, 𝜁, 𝛾) ∈ M: 𝜀(𝑗) ∈ 𝐸(𝑖) and 𝜁(𝑗) ∈ 𝑍(𝑖). The following atom types are implemented in MOLGEN: – Atom type standard represents exactly one element and one atom state, it covers the cases when |𝐸(𝑖)| = |𝑍(𝑖)| = 1. – Atom type multi can represent a field of tuples of elements and atom states. This atom type can be used in general situations. – Atom type any is used when 𝐸(𝑖) = E and 𝑍(𝑖) = ZE . For this atom type the function 𝑐𝑜𝑚𝑝𝑎𝑡𝑖𝑏𝑙𝑒 always yields the result true. Besides the higher efficiency, the main advantage of this technique is that it can be extended easily. For example, we might need an atom type element that can represent all the possible atom states of that element. For this purpose we only need to define the new atom type including the appropriate function compatible, which returns true if the element of the atom equals the element defined by the atom type. This saves changes in the substructure search and in the algorithms for substructure generation. In particular, it allows us to deal with fragmentation reactions in mass spectrometry (see Subsection 8.4.2). – Atom type MS distinguishes one of the following sets of elements for the member­ ship of 𝜀(𝑗): – all elements, – all heavy elements, i.e. the elements except H, – all heavy elements except C, – all elements with free electron pairs (N, O, P, S, halogens). 𝜁(𝑗) is checked for the existence of a positive charge or a radical position whenever compatible is called. A node of an ambiguous molecular graph is colored by an element symbol or by an atom state, or by symbols that represent various alternatives. Typical symbols are A for any atom and Q for a hetero atom.

64 | 2 Advanced properties of molecular graphs Further examples can be found in Subsection 8.4.2. Alternatives for bonds will be en­ coded as follows:

1, 2

1, 3

2, 3

1, 2, 3

2.13 Definition (Ambiguous molecular subgraph) Consider natural numbers 𝑘, 𝑛 with 0 < 𝑘 ≤ 𝑛, a set E of chemical elements together with ZE = ⋃𝑋∈E Z𝑋 , and assume an 𝑀 ∈ M𝑛 . An ambiguous molecular graph 𝐴𝑀𝐺 = (𝐸, 𝑍, 𝛤) ∈ AMG𝑘 is called ambiguous molecular subgraph of 𝑀, if an injective mapping 𝜙 ∈ 𝑛𝑘inj exits such that – for every label 𝑖 ∈ 𝑘 we have 𝜀(𝜙(𝑖)) ∈ 𝐸(𝑖) and 𝜁(𝜙(𝑖)) ∈ 𝑍(𝑖), –

and for each bond {𝑖, 𝑗} ∈ 𝐵(𝛤) the following is true: 𝛾({𝜙(𝑖), 𝜙(𝑗)}) ∈ 𝛤({𝑖, 𝑗}).

𝜙 is called embedding of 𝐴𝑀𝐺 in 𝑀 as ambiguous molecular subgraph and we write 𝐴𝑀𝐺 ⊆𝜙 𝑀. If, in addition, – for each edge outside of 𝛤, i.e. {𝑖, 𝑗} ∈ (𝑘2) \𝐵(𝛤), 𝛾({𝜙(𝑖), 𝜙(𝑗)}) = 0, then 𝐴𝑀𝐺 is an induced ambiguous molecular subgraph of 𝑀. 𝜙 is an embedding of 𝐴𝑀𝐺 in 𝑀 as induced ambiguous molecular subgraph and we indicate this by writing 𝐴𝑀𝐺 ⊆𝑖𝜙 𝑀.

2.2.2 Substructure restrictions Often we have to handle structural properties that cannot be expressed solely by am­ biguous molecular graphs. For example, two atoms of an ambiguous graph may re­ quire a prescribed distance in the embedding, or have to lie on a ring of given length, or similar. For this purpose we have to equip ambiguous graphs with substructure re­ strictions. 2.14 Definition (Substructure restriction) Assume 0 < 𝑘 ∈ ℕ. A substructure restriction is a mapping 𝑆𝑅 : ⋃ (M𝑛 × 𝑛𝑘inj ) → {𝑡𝑟𝑢𝑒, 𝑓𝑎𝑙𝑠𝑒}. 𝑛

SR𝑘 denotes the set of substructure restrictions on 𝑘 atoms.

Like atom type, the substructure restriction is implemented as an abstract basis class. At present, the following types of such restrictions are available: – Substructure restriction distance: For two atoms 𝑖, 𝑗 ∈ 𝑘 an interval [𝑎, 𝑏] ⊂ ℕ⋆ of positive natural numbers is given that restricts the distance of the two atoms in the embedding 𝜙 of the ambiguous molecular graph into 𝑀. {𝑡𝑟𝑢𝑒 if dist𝑀 (𝜙(𝑖), 𝜙(𝑗)) ∈ [𝑎, 𝑏], 𝑆𝑅Dist {𝑖,𝑗},[𝑎,𝑏] : (𝑀, 𝜙) 󳨃→ { 𝑓𝑎𝑙𝑠𝑒 otherwise. {

2.2 Molecular substructures







| 65

Substructure restriction hybridization: For a nonempty subset {𝑖𝑗 | 𝑗 ∈ ℎ} ⊆ 𝑘 of atoms in an ambiguous molecular graph one introduces a hybridization 𝜂 for the atoms embedded via 𝜙 into the ambiguous molecular graph 𝑀. {𝑡𝑟𝑢𝑒 if for all 𝑗 ∈ ℎ : hyb𝑀 (𝜙(𝑖𝑗 )) = 𝜂, Hybrid 𝑆𝑅{𝑖 |𝑗∈ℎ},𝜂 : (𝑀, 𝜙) 󳨃→ { 𝑗 𝑓𝑎𝑙𝑠𝑒 otherwise. { Substructure restriction neighborhood: For a nonempty subset of the atoms in an ambiguous molecular graph, a distance between the given substructure and chosen atoms can be defined as a prescribed interval, after embedding into 𝑀. Substructure restriction ring: For a nonempty subset of atoms of the ambiguous graph we may prescribe an interval of possible ring lengths such that the chosen atoms have to or must not lie on a ring of size contained in the given interval (after embedding).

2.15 Definition (Molecular substructure) Assume 0 < 𝑘 ∈ ℕ and ℎ ∈ ℕ. A molecular substructure 𝑆 = (𝐴𝑀𝐺, {𝑆𝑅𝑖 | 𝑖 ∈ ℎ}) ∈ AMG𝑘 × P(SR𝑘 ) = S𝑘 is a pair consisting of an ambiguous molecular graph and a set of substructure restrictions.

2.16 Definition (Embeddings) Suppose that 0 < 𝑘, 𝑛 ∈ ℕ, 𝑘 ≤ 𝑛, 𝑆 = (𝐴𝑀𝐺, {𝑆𝑅𝑖 | 𝑖 ∈ ℎ}) ∈ S𝑘 a molecular substructure and 𝑀 ∈ M𝑛 a molecular graph. An injective mapping 𝜙 ∈ 𝑛𝑘inj is called embedding of 𝑆 into 𝑀 as molecular substructure if – 𝐴𝑀𝐺 ⊆𝜙 𝑀 and – for all 𝑖 ∈ ℎ : 𝑆𝑅𝑖 (𝑀, 𝜙) = true. In this case we write 𝑆 ⊆𝜙 𝑀 and call 𝑆 a molecular substructure of 𝑀. If, in addition, – 𝐴𝑀𝐺 ⊆𝑖𝜙 𝑀, then 𝜙 is an embedding of 𝑆 into 𝑀 as induced molecular substructure. We write 𝑆 ⊆𝑖𝜙 𝑀 and call 𝑆 an induced molecular substructure of 𝑀. We denote the respective sets of embeddings by Emb ⊆ (𝑆, 𝑀) = {𝜙 ∈ 𝑛𝑘inj | 𝑆 ⊆𝜙 𝑀} Emb ⊆𝑖 (𝑆, 𝑀) = {𝜙 ∈ 𝑛𝑘inj | 𝑆 ⊆𝑖𝜙 𝑀}.

An algorithm that searches embeddings of a given substructure in a molecular graph is called a substructure search. Such an algorithm is quite important in several regions of computational chemistry. For example, [7, 19, 20, 95, 96, 148, 212] contain lists of substructures (called goodlist and badlist) as input, to construct only those molecu­ lar graphs that contain (goodlist) or do not contain (badlist) the given substructures. An algorithmic description of substructure search is contained in [96]. The result of such a search can be considered as a binary molecular descriptor (see Subsection 7.2.2). In [322], a vector of binary descriptors is used to introduce a notion of molecular graph similarity. In Section 2.3 we describe how reactions and reaction schemes can be de­ fined and simulated using molecular substructures.

66 | 2 Advanced properties of molecular graphs

2.3 Chemical reactions The basic notions are introduced in 2.17 Definition (Chemical reaction, reactant and product graph) Assume 𝑛 > 0, a set E of chemical elements and ZE = ⋃𝑋∈E Z𝑋 , and a set of admissible atom states of the elements in E. Then: – An ordered pair 𝐶 = (𝑀, 𝑀󸀠 ) ∈ M𝑛 × M𝑛

– –





of molecular graphs 𝑀 = (𝜀, 𝜁, 𝛾) and 𝑀󸀠 = (𝜀󸀠 , 𝜁󸀠 , 𝛾󸀠 ) is called a chemical reaction, if 𝜀 = 𝜀󸀠 . 𝑀 is called the reactant graph and 𝑀󸀠 the product graph. The elements of the set Conn(𝑀), i.e. the connected components of 𝑀, are educts or reactants, those of Conn(𝑀󸀠 ) = Conn(𝛾󸀠 ) are the products. The set CR𝑛 = {(𝑀, 𝑀󸀠 ) | (𝑀, 𝑀󸀠 ) ∈ M𝑛 × M𝑛 , 𝜀 = 𝜀󸀠 } denotes the set of chemical reactions on 𝑛 atoms in the following. If |Conn(𝑀)| = 1, then we call 𝐶 a one component reaction. If in addition |Conn(𝑀󸀠 )| = 1, then 𝐶 is called a rearrangement, while in the case of |Conn(𝑀󸀠 )| > 1 we speak of a decomposi­ tion reaction. A reaction with |Conn(𝑀)| = 2 is called a two component reaction. Reactions with |Conn(𝑀)| ≥ 2 are called synthesis reactions. Changes in the atom states or bonds, arising from the chemical reaction 𝐶, are of particular in­ terest. We introduce the change of reaction graph Δ𝐶 = (Δ𝜁, Δ𝛾) ∈ ΔZ𝑛 × G[−3,3],𝑛 = ΔCR𝑛 , where, for 𝑖 ∈ 𝑛, Δ𝜁(𝑖) = (Δ𝑣𝑖 , Δ𝑝𝑖 , Δ𝑞𝑖 , Δ𝑟𝑖 ) ∈ ℤ × ℤ × ℤ × 𝔹 = ΔZ



describes the change of state of atom 𝑖. In this situation ∘ Δ𝑣𝑖 = 𝑣𝜁󸀠 (𝑖) − 𝑣𝜁(𝑖) describes the change of valence, ∘ Δ𝑝𝑖 = 𝑝𝜁󸀠 (𝑖) − 𝑝𝜁(𝑖) gives the change in the number of free electron pairs, ∘ Δ𝑞𝑖 = 𝑞𝜁󸀠 (𝑖) − 𝑞𝜁(𝑖) indicates the change of charge and ∘ Δ𝑟𝑖 = 𝑟𝜁󸀠 (𝑖) ∨̇ 𝑟𝜁(𝑖) means the change of the radical character of atom 𝑖, where ‘ ∨̇ ’ stands for ‘either or’. Δ𝜁 is the change of states for 𝐶. Furthermore, the change of bonds by a chemical reaction 𝑛

Δ𝛾 ∈ G[−3,3],𝑛 = [−3, 3]( 2) for 𝑖, 𝑗 ∈ 𝑛, 𝑖 ≠ 𝑗 with Δ𝛾({𝑖, 𝑗}) = 𝛾󸀠 ({𝑖, 𝑗}) − 𝛾({𝑖, 𝑗}) ∈ [−3, 3] –

gives the change of bonds between atoms 𝑖 and 𝑗. Δ𝛾 is the change of bonds graph of 𝐶. The chemical reaction 𝐶 is described completely by the reactant graph 𝑀 and change of reac­ tion graph Δ𝐶. Therefore, we call the quintuple (𝜀, 𝜁, 𝛾, Δ𝜁, Δ𝛾) the reaction graph. Written in the notation introduced in Definition 2.17, this gives: 𝑀󸀠 = Δ𝐶 ∘ 𝑀, where Δ𝐶 ∘ 𝑀 = (Δ𝜁, Δ𝛾) ∘ (𝜀, 𝜁, 𝛾) = (𝜀, Δ𝜁 ∘ 𝜁, Δ𝛾 ∘ 𝛾) and, for 𝑖, 𝑗 ∈ 𝑛, 𝑖 ≠ 𝑗, (Δ𝜁 ∘ 𝜁)(𝑖) = Δ𝜁(𝑖) ∘ 𝜁(𝑖) (Δ𝛾 ∘ 𝛾)({𝑖, 𝑗}) = 𝛾({𝑖, 𝑗}) + Δ𝛾({𝑖, 𝑗}).

2.3 Chemical reactions

| 67

The change of states is Δ𝜁(𝑖) ∘ 𝜁(𝑖) = (𝑣𝜁(𝑖) + Δ𝑣𝑖 , 𝑝𝜁(𝑖) + Δ𝑝𝑖 , 𝑞𝜁(𝑖) + Δ𝑞𝑖 , 𝑟𝜁(𝑖) ∨̇ Δ𝑟𝑖 ). –

The center of reaction Cen(𝐶) of 𝐶 = ((𝜀, 𝜁, 𝛾), (𝜀, 𝜁󸀠 , 𝛾󸀠 )) ∈ CR𝑛 is the following set of atoms: {𝑖 ∈ 𝑛 | 𝜁(𝑖) ≠ 𝜁󸀠 (𝑖)



or there exists a 𝑗 with 𝛾({𝑖, 𝑗}) ≠ 𝛾󸀠 ({𝑖, 𝑗})}.

Hence, Cen(𝐶) consists of the atoms whose state is changed by the reaction, or whose bonds are altered. It can be described also by the reactant graph, the center of reaction and the change of atom states and bonds of the center atoms. The subgraph of the reaction graph, induced by the center, RCG(𝐶) = (𝜀|Cen(𝐶) , 𝜁|Cen(𝐶) , 𝛾|Cen(𝐶) , Δ𝜁|Cen(𝐶) , Δ𝛾|Cen(𝐶) ) is called the reaction center graph of 𝐶.

2.18 Example (The Diels–Alder reaction) An example Diels–Alder reaction, according to [303] proceeds as shown here. The reactants or educts are shown to the left, the product is on the right: O

O

S

+

S

Cl

Cl

Cl

Cl

The corresponding reaction graph (left) and reaction center graph (right) are as fol­ lows: H H

C

O S

C H C H C

S

Cl C

C H

C C

C

C Cl

C

H

Bonds that are formed are indicated by small dots, bonds that are broken are indicated by crosses. Many reactions in chemistry follow either the same or very similar reaction paths. This manifests itself in ‘similar’ reaction center graphs for these reactions. This ‘similarity’ can be used to describe a data structure that represents ‘similar’ reaction center graphs and thus allows the construction of the underlying chemical reaction from the reactant graph, yielding the reaction center.

68 | 2 Advanced properties of molecular graphs

2.19 Definition (Reaction scheme) Assume that 0 < 𝑘 ∈ ℕ. A reaction scheme is a triple 𝑅 = (𝑆, Δ𝜁, Δ𝛾) ∈ S𝑘 × ΔZ𝑘 × G[−3,3],𝑘 = R𝑘 consisting of the reaction substructure 𝑆, the change of states of a reaction scheme Δ𝜁, and the change of bonds graph of a reaction scheme Δ𝛾. Depending on the number of connected components of the AMG underlying 𝑆, we call 𝑅 a one component reaction scheme, a two component reaction scheme, and so on.

This definition differs from [337]. The advantage is that it allows us to handle decom­ position and rearrangement reactions, as well as synthesis involving more than two components. The application of a reaction scheme 𝑅 = (𝑆, Δ𝜁, Δ𝛾) ∈ R𝑘 to a molecular graph 𝑀 = (𝜀, 𝜁, 𝛾) ∈ M𝑛 is done in two steps. First we look for an embedding of the reaction substructure 𝑆 in 𝑀 as molecular substructure. If we find such an embedding 𝜙 ∈ Emb ⊆𝑖 (𝑆, 𝑀), then we apply a change of states and a change of bonds graph in the following way: 𝜙 induces a mapping −𝜙 : ΔCR𝑘 → ΔCR𝑛 : (Δ𝜁, Δ𝛾) 󳨃→ (Δ𝜁, Δ𝛾)𝜙 = (Δ𝜁𝜙 , Δ𝛾𝜙 ), where, for 𝑖 ∈ 𝑛, {Δ𝜁 (𝜙−1 (𝑖)) if 𝑖 ∈ 𝜙(𝑘), Δ𝜁𝜙 = { (0, 0, 0, 𝑓𝑎𝑙𝑠𝑒) otherwise, { and, for 𝑖, 𝑗 ∈ 𝑛, 𝑖 ≠ 𝑗, {Δ𝛾 ({𝜙−1 (𝑖), 𝜙−1 (𝑗)}) Δ𝛾𝜙 = { 0 {

if 𝑖, 𝑗 ∈ 𝜙(𝑘), otherwise.

We are now in a position to define the application of 𝑅 to 𝑀 with respect to 𝜙 via (Δ𝜁, Δ𝛾)𝜙 as follows: 𝑅 ∘𝜙 𝑀 = (Δ𝜁, Δ𝛾) ∘𝜙 𝑀 = (Δ𝜁, Δ𝛾)𝜙 ∘ 𝑀. This, however, does not guarantee that 𝑅 ∘𝜙 𝑀 is a molecular graph. 2.20 Definition (The set of product graphs) Assume 0 < 𝑘, 𝑛 ∈ ℕ, 𝑘 ≤ 𝑛, 𝑅 = (𝑆, Δ𝜁, Δ𝛾) ∈ R𝑘 a reaction scheme and 𝑀 ∈ M𝑛 a molecular graph. The set of the product graphs arising from an application of 𝑅 to 𝑀 is Prod𝑅 (𝑀) = {𝑅 ∘𝜙 𝑀 ∈ M𝑛 | 𝜙 ∈ Emb ⊆𝑖 (𝑆, 𝑀)}.

2.4 Mesomerism

| 69

Our model [153] allows a qualitative simulation of chemical reactions, which means the determination of all possible products for given reactants and reaction schemes. Another important aspect is the quantitative course of chemical reactions, i.e. esti­ mating the amounts of the various products. This is more difficult to model and the models are often only applicable to small classes of compounds. For example, an at­ tempt was made to quantitatively predict the reactions in a mass spectrometer in [111] and [281], using machine learning methods. [130] extends the possibilities of reaction prediction with the software package EROS 7. However, the exact prediction of reactiv­ ities is a problem that can only by addressed by experimental measurements or quan­ tum chemical calculations, at high expense. Interesting new approaches (e.g. those described in [21, 22]) attempt to perform these energy calculations using strongly sim­ plified models.

2.4 Mesomerism There are phenomena such as mesomerism or resonance that cannot be described in terms of our graph theoretical model of a molecule. The reason is that it is no longer possible to associate a unique and integral multiplicity to a covalent bond in aromatic structures. Instead, we introduce several resonance structures in order to represent aromatic compounds. Such structures are not generally isomorphic in the sense of Definition 1.17. 2.21 Example (Derivatives of benzene) A very simple example of an aromatic com­ pound is benzene:

The two dichlorobenzenes shown below are chemically identical, but are not isomor­ phic if we use a multigraph model that does not consider aromaticity. Cl

Cl Cl

Cl

Thus, ‘aromatic duplicates’ are possible when running a molecule generator. Such duplicates should be avoided to keep search and answer spaces as small as pos­ sible. Thus, the duplicate structures must be detected during structure generation and removed. In addition, it is important to describe chemical compounds as precisely as possible e.g. during the search for QSPRs. In fact, there are molecular descriptors that

70 | 2 Advanced properties of molecular graphs rely on aromaticity. The mathematical aspects of detecting and filtering mesomeric doublets were considered in [76], but the problem lies in finding a suitable graph model for the phenomenological concept of aromaticity. MOLGEN3.5 already contains a filter for aromatic duplicates, but this only detects 6-membered aromatic rings con­ taining C atoms. More recent versions contain an algorithm for filtering aromatic sys­ tems of variable sizes that may also contain hetero atoms and/or charges (see Section 9.8 in [97]). 2.22 Definition (Aromatic, delocalized electrons) Consider 𝑀 ∈ M. A ring 𝑊 of length len𝑀 (𝑊) ≥ 3 is an aromatic ring if i) each multiple bond in 𝑊 is incident with exactly two single bonds in 𝑊, and ii) each single bond in 𝑊 is incident either with a) exactly two multiple bonds in 𝑊, or b) exactly one multiple bond in 𝑊 and an atom with a lone pair of electrons on the other side, or c) exactly one multiple bond in 𝑊 and a charged atom on the other side, and iii) the number of cyclically delocalized 𝜋 electrons on 𝑊 is 4𝑘 + 2, with a suitable 𝑘 ∈ ℕ. This is known as the Hückel Rule [132, 133, 134] and a) each multiple bond and b) every atom with a lone pair of electrons contributes 2 electrons to the complete set of electrons. A bond contained in an aromatic ring is an aromatic bond.

2.23 Example (Naphthalene cont.) According to Definition 2.22 naphthalene pos­ sesses 11 aromatic bonds:

The structure on the left contains two aromatic rings of length 6, while the others both contain one ring of length 6 and one of length 10. The latter structures are isomorphic in the sense of Definition 1.17. After marking the aromatic bonds we see that in fact all three structures are isomorphic. Using this knowledge, repeat the exercise from above: 2.24 Exercise Go to MOLGEN–ONLINE via http://www.molgen.de and enter the molecular formula C10 H8 of naphthalene. Enter further restrictions, e.g. the cyclomatic number (the number of bonds minus the number of atoms plus one, see Example 2.31), ring sizes or numbers of aromatic bonds (bonds in aromatic rings, see Definition 2.22). Note the different numbers of connectivity isomers and try to refine the conditions to obtain a single connectivity isomer, naphthalene, in the final run of MOLGEN.

2.4 Mesomerism | 71

2.25 Algorithm (Identification of aromatic bonds) We can identify aromatic bonds in 𝑀 as follows: i) Run through all walks 𝑊 in 𝑀 that fulfil conditions i) and ii) of Definition 2.22, using depth-first method. ii) If 𝑊 is a ring that satisfies condition iii) of Definition 2.22, then mark all the bonds in 𝑊 as aromatic. iii) Remove atom charges that may be present according to condition ii) c) in Defini­ tion 2.22, since a charge is no longer localized on a particular atom in an aromatic system. Once aromatic bonds are identified, they need to be encoded. One way to do this is to allow multiplicity to be expressed as rational numbers. For example, we might encode the multiplicities of the bonds in a benzene ring by 32 . Already in the case of naphtha­ lene, however, this yields a valence of 92 ∉ ℕ for the two central C atoms. Thus, for each aromatic system, calculation of a particular set of bond multiplicities would be required. For simplicity, aromatic bonds are described using a non-numerical bond type ‘aromatic’ in MOLGEN. Another problem arises when charges are involved. An example is the tropylium ion

+

HC

After marking the aromatic bonds we have to erase the positive charge of the C atom, to be consistent, although the structure remains positively charged. But the fact that this charge is shared by the complete aromatic system cannot be expressed in our model yet. An alternative model uses multi-hypergraphs: Assume a natural number 𝑚 ≥ 2 as an upper bound for the number of electrons in an aromatic system. A chemical compound with 𝑛 atoms can be represented by a multi-hypergraph 𝜑 ∈ 𝑚P



(𝑛)

,

recalling that P⋆ (𝑛) denotes the set of nonempty subsets of 𝑛. For a subset of nodes 𝑁 ∈ P⋆ (𝑛), the value 𝜑(𝑁) is interpreted as the number of electrons shared by the atoms in 𝑁. As mentioned in Section 1.2, we need to introduce a distribution function 𝜀 ∈ E𝑛 for the elements in the molecule, 𝜀 = (𝜀(0), . . . , 𝜀(𝑛 − 1)), a sequence of length 𝑛 of (labeled) element symbols 𝜀(𝑖) ∈ E. Since 𝜑 also encodes the free electron pairs and single electrons, we can evaluate the charge, using the number of valence electrons of the element considered, so that ⋆ we can avoid a distribution function for atom states. Using this idea, 𝜑 ∈ 𝑚P (𝑛) can

72 | 2 Advanced properties of molecular graphs be evaluated from 𝑀 = (𝜀, 𝜁, 𝛾) ∈ M𝑛 as follows: 𝜑({𝑖}) = 2𝑝𝜁(𝑖) + 𝑟𝜁(𝑖) , {2 if 𝑖 and 𝑗 aromatically bound, 𝜑({𝑖, 𝑗}) = { 2𝛾({𝑖, 𝑗}) otherwise, { {0 { { 𝜑({𝑖0 , ..., 𝑖𝑘−1 }) = {otherwise: { { {

if {𝑖0 , ..., 𝑖𝑘−1 } is not an aromatic system, the number of electrons that 𝑖0 , ..., 𝑖𝑘−1 give to the aromatic system,

where 𝑖 ∈ 𝑛, {𝑖, 𝑗} ⊆ 𝑛 and {𝑖0 , ..., 𝑖𝑘−1 } ⊆ 𝑛 with 𝑘 ≥ 3. In order to apply the graph theoretical notion of Section 1.1 to this model, we have to consider the projection 12 𝜑|(2) 𝑛 to the two-element subsets of P⋆ (𝑛). A similar model for the representation of molecules is described in [165] and [166]. In [16], 𝜎 and 𝜋 electron systems are distinguished. Among other things, the represen­ tation of delocalized 𝜋 electrons is intended to enable the description of aromatic sys­ tems. A comprehensive representation of compounds with non-covalent bonds can be found in [92], which also considers configurations, i.e. it can be applied to stereo­ chemistry.

2.5 Molecular graphs and existence of compounds Until now we neglected that molecules should also be considered as objects in 3D space. We can account for this by calculating 3D coordinates for 𝑀 ∈ M𝑛, using a mapping 𝑛 𝜉 ∈ (ℝ3 ) that associates a point in space ℝ3 with each atom. Such a 3D placement of 𝑀 can be obtained using, for example, the MM2 force field [5]. The user starts from a more or less arbitrary placement of atoms in space. Then bond and non-bonded distances, bond angles, torsion angles are introduced, an energy function and an optimization process such as a modified Gauß–Newton-method is applied. The result is – in many cases – a plausible 3D placement of the molecule. This arrangement may, however, be a local minimum and as such may be chemically irrelevant. It certainly depends on the initial coordinates. These can be chosen using various strategies: i) Random placement, ii) Retaining plausible 2D coordinates and randomizing the third coordinate. Alternative i) makes sense if several runs of the optimization are intended, for exam­ ple, in order to analyze a conformation [17, 23]. The probability of obtaining an unre­ alistic 3D placement is rather high. Alternative ii) is used in MOLGEN. The initial 2D placement is determined using an algorithm [293] that evaluates a structure starting from the system of its rings, distributed over a rectangular part of the plane.

2.5 Molecular graphs and existence of compounds |

SE:43,298

1 SE:43,819

2 SE:44,289

3 SE:47,358

4 SE:48,222

5 SE:48,313

6

SE:48,383

7 SE:49,140

8 SE:50,916

9 SE:50,931

10 SE:51,539

11 SE:52,169

12

SE:53,003

13 SE:68,791

14 SE:79,335

15 SE:103,073

16 SE:116,170

17 SE:118,654

18

SE:149,494

19 SE:151,925

20 SE:189,271

21 SE:220,191

22 SE:222,097

23 SE:252,305

24

SE:253,675

25 SE:312,426

26 SE:328,932

27 SE:329,048

28 SE:332,724

29

73

Fig. 2.1. Constitutional isomers C6 H6 found in the Beilstein database, together with their steric en­ ergy values.

The diversity of existing chemical compounds is astonishingly high, but these seem to cover only a tiny fraction of mathematically possible molecular graphs, see [152]. This is most likely due to the fact that most mathematically possible structures represent energetically unstable compounds. We can quantify this fraction using certain assumptions. The Beilstein database BS0302PR [195] contains many known compounds in organic chemistry, including naturally occurring and synthetic compounds. At the time of this research (August 2003) it contained 8,711,107 entries. We extracted all connected structures consisting of elements in E4 and of mass ≤ 150 Da, to obtain 174,290 compounds. Isotopically la­ beled compounds, charged compounds, radicals, and compounds containing atoms in non-standard valencies (not 𝑣𝑋 ) were removed. Only one representative from a set of stereoisomers was considered. Finally, following canonical numbering and elimina­ tion of aromatic doublets, 103,040 structures remained. These were sorted according to their molecular formulas and are given in Appendix D, along with the the number of constitutional isomers. Both the mathematically possible number of isomers and the number of isomers listed in the Beilstein database are reported. Looking at the numbers, it is no wonder that only a marginal percentage is listed (yet). For example, only 29 of the 217 mathematically possible constitutional isomers of benzene C6 H6 are listed in Beilstein (cf. Figure 2.1). A possible explanation is chemical instability of many mathematically possible connectivity isomers. This argument can be quantified by evaluating the steric energy (SE).

700

74 | 2 Advanced properties of molecular graphs

present in Beilstein absent in Beilstein

400 300 0

100

200

Steric energy [kcal/mol]

500

600

|

0

50

100

150

200

Isomer

Fig. 2.2. Steric energy of the constitutional isomers C6 H6 .

2.26 Example (The constitutional isomers of benzene) Starting from the 217 math­ ematially possible constitutional isomers of C6 H6 (see Figure 2.1), we applied an en­ ergy optimization with ten repetitions, starting from different initial placements, using MOLGEN–QSPR. – The lowest steric energy value obtained using a force field similar to MM2 was visualized for each C6 H6 isomer and is shown in Figure 2.2. Energy values for ex­ isting compounds are shown in black, the grey lines indicate energy values of nonexisting isomers. – Existing isomers tend to have lower energy values. The structures with the low­ est energy values that are not contained in the Beilstein database are 1,2,3,4-hex­ atetraene (47.12 kcal/mol, 1) and 3-methylpenta-1,2-dien-4-yne (48.15 kcal/mol, 2). These isomers were prepared for the first time recently [191].





• 1

• 2

75

100

2.5 Molecular graphs and existence of compounds |

present in Beilstein absent in Beilstein

90 80

85

Van der Waals volume [Å3]

95

|

0

50

100

150

200

Isomer

Fig. 2.3. Van der Waals volumes of the constitutional isomers C6 H6 .





The highest-energy C6 H6 isomers in Beilstein are bi(cycloprop-1-en-1-yl) (332.72 kcal/mol), prismane (329.05 kcal/mol) and bi(cycloprop-2-en-1-yl) (328.93 kcal/ mol). Not surprisingly, the C6 H6 isomer with maximal steric energy (688.76 kcal/mol) corresponds to the nonplanar graph 𝐾3,3 . This structure is shown as 3, followed by the isomers of next-highest steric energy, the tetrahedrenes 4 and 5:

3 –



4

5

A closer look at the structures and their energy values shows that the 15 acyclic C6 H6 isomers have the 15 smallest energy values. The smallest energy value among the cyclic isomers belongs to benzene (68.79 kcal/mol). Starting from a 3D placement, we can also calculate an approximate volume, the van der Waals volume. Each atom is replaced by a sphere of van der Waals radius centered at the atom’s coordinates. Details for calculating the volume of intersect­ ing spheres follow in Example 7.4. These volumes are shown in Figure 2.3. Again,

76 | 2 Advanced properties of molecular graphs existing isomers are represented as black lines. Most of these isomers are in the higher volume range. The isomer with maximal volume but not found in Beilstein is 3-methyl-3-vinylcyclopropyne (95.039 Å3 ):

The smallest volumes belong to the nonplanar 𝐾3,3 , isomer 3 (80.957 Å3 ) and pris­ mane (83.022 Å3 ). Existence/nonexistence of a compound is certainly an important topic of chemistry, but depends on the state-of-the-art of chemical synthesis and survey of natural oc­ currence. In Chapters 7 and 8 we shall consider other properties of compounds, in­ cluding less variable properties. We shall try to deduce experimental properties from the chemical structure and vice versa. For this purpose it is crucial to generate in sil­ ico molecular structures with prescribed structural properties in silico, i.e. to generate virtual molecular libraries. The most important tools for this task will be described in Chapter 5.

2.6 Molecular descriptors One of the main aims of computer simulation in chemistry is the prediction of phys­ ical, chemical, biological or pharmaceutical properties of chemical compounds us­ ing molecular descriptors. We distinguish various kinds of such invariants of molecule graphs, here are a few obvious ones: – Arithmetical descriptors are, for example, the numbers of atoms of specific ele­ ments, as given in the molecular formula, or the molecular mass. – Topological descriptors are all kinds of graph invariants, e.g. numbers of single, double or triple bonds, numbers of walks of prescribed length, sizes of cycles, eigenvalues of the bond matrix or of the matrix of multiplicities, et cetera. – Geometrical descriptors depend on the 3D shape of a molecule. They can be ob­ tained from an embedding of the molecule in space. Examples are van der Waals volume and surface, solvent-accessible surface areas, and so on. In order to define a general notion of molecular descriptor, we first briefly recall the basic definition of molecular graph. – Consider a set E of chemical elements, and assume a set ZE of admissible atom states for the elements in E, ZE = ⋃𝑋∈E Z𝑋 . A labeled molecular graph on 𝑛 atoms in E with atom states contained in ZE is a triple 𝑀 = (𝜀, 𝜁, 𝛾), where ∘ 𝜀 = (𝜀(0), . . . , 𝜀(𝑛 − 1)) is a sequence of length 𝑛 of (labeled) atoms 𝜀(𝑖) ∈ E, for short: 𝜀 ∈ E𝑛 ,

2.6 Molecular descriptors | 77

𝜁 = (𝜁(0), . . . , 𝜁(𝑛 − 1)) is a sequence of length 𝑛 of admissible atom states 𝜁(𝑖) ∈ Z𝜀(𝑖) , for short: 𝜁 ∈ Z𝑛E , and ∘ 𝛾 is a labeled 4-multigraph, 𝛾 ∈ G4,𝑛 , with 𝑣(𝛾)𝑖 = 𝑣𝜁(𝑖) , which means that node 𝑖 of the graph has the valence prescribed by the corresponding element atom state. The corresponding unlabeled molecular graphs with 𝑛 atoms in E are the orbits of the action ∘



𝑆𝑛 × (E𝑛 × Z𝑛E × G4,𝑛 ) → E𝑛 × Z𝑛E × G4,𝑛 : (𝜋, (𝜀, 𝜁, 𝛾)) 󳨃→ 𝜋(𝜀, 𝜁, 𝛾), with 𝜋(𝜀, 𝜁, 𝛾) = (𝜋𝜀, 𝜋𝜁, 𝜋𝛾) and, for 𝑖, 𝑗 ∈ 𝑛, 𝑖 ≠ 𝑗, 𝜋𝜀(𝑖) = 𝜀(𝜋−1 𝑖), 𝜋𝜁(𝑖) = 𝜁(𝜋−1 𝑖), 𝜋𝛾({𝑖, 𝑗}) = 𝛾({𝜋−1 𝑖, 𝜋−1 𝑗}). The orbit 𝑆𝑛(𝑀) of 𝑀 is denoted by 𝑀.̄ We denote the set of the labeled molecular graphs with 𝑛 atoms by M𝑛 and use this notation to introduce the following sets: M = ⋃ M𝑛 , M̄ 𝑛 = 𝑆𝑛\\M𝑛, M̄ = ⋃ M̄ 𝑛 . 𝑛>0

𝑛>0

Restricting attention to connected graphs we obtain the corresponding sets of mo­ lecule graphs M𝑐𝑛 and M𝑐 = ⋃ M𝑐𝑛 , M̄ 𝑐𝑛 = 𝑆𝑛\\M𝑐𝑛, M̄ 𝑐 = ⋃ M̄ 𝑐𝑛 . 𝑛>0

𝑛>0

2.27 Definition (Molecular descriptor) – A molecular descriptor is a mapping 𝐷̄ on the set M̄ 𝑐 of molecule graphs. Descriptors can be obtained from mappings 𝐷, defined on M𝑐 , that are constant on the orbits, 𝐷(𝑀) = 𝐷(𝜋𝑀), so that we can set ̄ = 𝐷(𝑀). 𝐷(̄ 𝑀) –



It is important to notice that we can restrict attention to the H-suppressed molecule, as 𝐷(𝑀) = 𝐷(𝜋𝑀) implies 𝐷(𝑀∗ ) = 𝐷(𝜋∗ 𝑀∗ ), for each relabeling 𝜋∗ of the non-H atoms: This justifies the usual procedure to evaluate a descriptor on 𝑀∗ , not on 𝑀. As molecular descriptors are sometimes evaluated using 𝑀 or alternatively by restricting atten­ tion to 𝑀∗ [107], two different descriptors 𝐷̄ and 𝐷̄ ∗ arise from 𝐷 via ̄ = 𝐷(𝑀) and 𝐷̄ ∗ (𝑀) ̄ = 𝐷(𝑀∗ ). 𝐷(̄ 𝑀)

The values of descriptors are often real numbers, 𝐷 : M𝑐 → ℝ, or, more generally, 𝐷(𝑀) is a sequence of real numbers. 𝑀 = (𝜀, 𝜁, 𝛾) yields the following relatively simple descriptors directly: – 𝜀 itself, a sequence consisting of element symbols of all atoms in the molecule, is not a descriptor, since it depends on the labeling, but it gives the numbers of atoms of specific elements in the molecule, which are independent of the labeling and thus can serve as molecular descriptors.

78 | 2 Advanced properties of molecular graphs –



– –

𝜁, the sequence of atom states, contains the sequence of valences which is also not a descriptor. After reordering the valences in decreasing order, this yields a sequence that does not depend on the labeling, i.e. forming another molecular descriptor. 𝛾, the graph, can be tested for the number of bonds of given multiplicity, of walks of prescribed length, of paths, of cycles, and so on. These numbers do not depend on the labeling, so they are molecular descriptors and will be discussed in detail below. The same applies to 𝜀∗ , 𝜁∗ and 𝑀∗ . Embedding a molecule in space yields further descriptors. Examples include the chirality of a molecule in 3D space, the symmetry group, the van der Waals vol­ ume, etc. As embeddings are local energy minima obtained via optimization, they do not describe ideal tetrahedra or other ideal geometric shapes but are often somewhat deformed, such that calculation of descriptors of embeddings still has many unsolved problems.

The standard reference for molecular descriptors is the encyclopedic book [304]. De­ scriptors can be split into three main groups: arithmetical, topological and geometri­ cal descriptors.

2.6.1 Arithmetical descriptors 2.28 Definition (Arithmetical descriptor) – A molecular descriptor 𝐷̄ is called an arithmetical descriptor if the value 𝐷(𝑀) on 𝑀 = (𝜀, 𝜁, 𝛾) does not depend on 𝛾 but solely on 𝜀 and/or 𝜁. If this holds, then 𝐷∗ (𝑀∗ ) depends on 𝜀∗ and 𝜁∗ alone, so that 𝐷̄ ∗ is also an arithmetical descriptor. In formal terms, 𝐷 is arithmetical if and only if 𝑀 = (𝜀, 𝜁, 𝛾) and 𝑀󸀠 = (𝜀, 𝜁, 𝛾󸀠 ) satisfy 𝐷(𝑀) = 𝐷(𝑀󸀠 ). – If 𝐷(𝑀) depends only on 𝜀, and therefore 𝐷(𝑀∗ ) only on 𝜀∗ , we call 𝐷̄ as well as 𝐷̄ ∗ a purely arithmetical descriptor.

Some descriptors that can be obtained easily using MOLGEN–QSPR are given in the example below. The complete list can be found in Appendix A and in the User Guide, available from http://www.molgen.de/documents/molgenqspr_handbuch.pdf or in references [34, 262]. Some of them will be used in Chapters 6 and 7, where we describe applications of the theory covered here. 2.29 Example (Arithmetical descriptors in MOLGEN) – Basic arithmetical descriptors arise from the following mappings: 𝐴 the number of atoms in 𝑀, in 𝑀∗ , 𝑀𝑊 the molecular weight of 𝑀, of 𝑀∗ , 𝑚𝑒𝑎𝑛𝐴𝑊 the mean atomic weight in 𝑀, in 𝑀∗ .

2.6 Molecular descriptors



| 79

For elements 𝑋 ∈ {H, C, O, N, S, F, Cl, Br, I, P} we have the number of 𝑋 atoms in 𝑀, in 𝑀∗ , 𝑁𝑋 𝑁 𝑟𝑒𝑙. 𝑁𝑋 the relative number of 𝑋 atoms, 𝐴𝑋 , in 𝑀.

These are purely arithmetical descriptors. Arithmetical descriptors that are not purely arithmetical since they depend on 𝜁, come from the mappings – 𝑐ℎ𝑎𝑟𝑔𝑒 the total charge, 𝑟𝑎𝑑 the number of radical sites, and various other expressions obtainable form 𝜁, together with 𝜀, if necessary.

2.6.2 Topological descriptors 2.30 Definition (Topological descriptor, topological index) – A molecular descriptor 𝐷̄ is called a topological descriptor if the value 𝐷(𝑀) on 𝑀 = (𝜀, 𝜁, 𝛾) depends on 𝛾. The same applies to 𝐷̄ ∗ and 𝛾∗ . – Topological descriptors, which are real numbers, are often called topological indices. – A topological descriptor 𝐷̄ is purely topological, if and only if its value can be obtained from 𝛾 alone. In formal terms, if 𝑀 = (𝜀, 𝜁, 𝛾) and 𝑀󸀠 = (𝜀󸀠 , 𝜁󸀠 , 𝛾) satisfy 𝐷(𝑀) = 𝐷(𝑀󸀠 ). In this case we may also write 𝐷(𝛾) instead of 𝐷(𝑀). This also holds for 𝐷̄ ∗ and 𝛾∗ .

2.31 Example (Basic topological indices) The following list contains functions that lead to very basic topological descriptors. Further topological indices will be men­ tioned below as they require additional notions. – 𝐵 the number of bonds, 𝑛−, 𝑛=, 𝑛#, the number of single, double, triple bonds, 𝐶 the cyclomatic number, recalling that 𝐶(𝑀) = 𝐵(𝑀) − 𝐴(𝑀) + 1, since |Conn(𝛾)| = 1 for a connected graph. The cyclomatic number of a simple and not necessarily connected graph 𝛾 with 𝑛 nodes is |𝐵(𝛾)| − 𝑛 + |Conn(𝛾)|. The values of 𝐵 and 𝐶 are easily obtained from the bond matrix M𝛾𝑏 , 𝐵(𝑀) = 12 ∑𝑖,𝑗 𝛾𝑏𝑖𝑗 , and 𝐴(𝑀) = 𝑛, the number of rows and of columns of this matrix. Thus 𝐵, 𝑛−, 𝑛=, 𝑛# and 𝐶 are purely topological descriptors. Together with the matrix of bond multiplicities M𝛾 and the bond matrix M𝛾𝑏 of 𝛾, we introduced the following row sums of entries of these matrices: 𝑣(𝛾)𝑖 = ∑ 𝛾𝑖𝑗 𝑗∈𝑛

and 𝑏(𝛾)𝑖 = 𝑣(𝛾𝑏 )𝑖 = ∑ 𝛾𝑏𝑖𝑗 , 𝑗∈𝑛

the valence (the number of lines incident with 𝑖) and the bond degree of 𝑖 (the number of bonds incident with 𝑖). They form the sequences 𝑣(𝛾) = (𝑣(𝛾)0 , . . . , 𝑣(𝛾)𝑛−1 )

and 𝑏(𝛾) = 𝑣(𝛾𝑏 ) = (𝑏(𝛾)0 , . . . , 𝑏(𝛾)𝑛−1 ),

80 | 2 Advanced properties of molecular graphs the sequence of valences and the sequence of bond degrees, respectively, see Defini­ tion 1.1. They are not (yet) molecular descriptors since they depend on the labeling, but we can easily obtain topological descriptors by simply reordering the entries, which results in the following number partitions: 2.32 Definition (Partition of valences, partition of bond degrees) Reordering the sequences of valen­ ces and of bond degrees, respectively, in a decreasing order, we obtain partitions of the numbers 2 ⋅ |𝐿(𝛾)| = ∑ 𝑣(𝛾)𝑖 = ∑ 𝛾𝑖𝑗 , 𝑖

𝑖,𝑗∈𝑛

i.e. twice the number |𝐿(𝛾)| of lines. Here 𝐿(𝛾) denotes the multiset of lines of 𝛾. Another partition is 2 ⋅ |𝐵(𝛾)| = ∑ 𝑏(𝛾)𝑖 = ∑ 𝛾𝑏𝑖𝑗 , 𝑖

𝑖,𝑗∈𝑛

twice the number |𝐵(𝛾)| of bonds. Thus, by reordering the sequences 𝑣(𝛾) and 𝑏(𝛾) in decreasing order, we obtain the partition of valences 𝑣(𝛾)̄ = (𝑣(𝛾)̄ 0 , . . . , 𝑣(𝛾)̄ 𝑛−1 ) ⊢ 2 ⋅ |𝐿(𝛾)| and the partition of bond degrees 𝑏(𝛾)̄ = (𝑏(𝛾)̄ 0 , . . . , 𝑏(𝛾)̄ 𝑛−1 ) ⊢ 2 ⋅ |𝐵(𝛾)|, where 𝑣(𝛾)̄ 0 ≥ . . . ≥ 𝑣(𝛾)̄ 𝑛−1

and 𝑏(𝛾)̄ 0 ≥ . . . ≥ 𝑏(𝛾)̄ 𝑛−1 .

These sequences do not depend on the labeling of the graph, we indicate this by using 𝛾̄ instead of 𝛾. Hence, they are topological descriptors. Moreover, these partitions are purely topological descriptors.

Our standard example, the labeled 4-multigraph 0

𝛾:

1

2

3

yields 0 3 M𝛾 = ( 0 0

3 0 1 0

0 1 0 1

0 0 ) 1 0

and M𝛾𝑏

0 1 =( 0 0

1 0 1 0

0 1 0 1

0 0 ). 1 0

Correspondingly, the partitions of valences and of bond degrees are 𝑣(𝛾)̄ = (4, 3, 2, 1) ⊢ 10

and 𝑏(𝛾)̄ = (2, 2, 1, 1) ⊢ 6.

Thus, we can introduce the following purely topological descriptors: – 𝑣(𝛾)̄ ⊢ 2 ⋅ |𝐿(𝛾)| the partition of valences, – 𝑏(𝛾)̄ ⊢ 2 ⋅ |𝐵(𝛾)| the partition of bond degrees. The partition 𝑏(𝛾)̄ can be used to rank molecule graphs according to branching, using a partial order on the set of number partitions of 𝑛.

2.6 Molecular descriptors

| 81

2.33 Definition (Partial order, poset) A relation 𝑅 on a set 𝑆 is called a partial order if it is – reflexive, i.e. 𝑠𝑅𝑠, for all 𝑠 ∈ 𝑆, – transitive, i.e. 𝑠𝑅𝑡 together with 𝑡𝑅𝑢 imply that 𝑠𝑅𝑢, and – antisymmetric, i.e. 𝑠𝑅𝑡 together with 𝑡𝑅𝑠 implies 𝑠 = 𝑡. 𝑆 together with the partial order 𝑅 is called a poset (partially ordered set).

Our main example is the following partial order on the set of number partitions of 𝑛. It occurs in several applications in this book, since it allows to rank partitions of bond degrees. For example, we shall derive results on the chirality of permutational isomers using the partial order to be introduced next. 2.34 Definition (Dominance order on the number partitions of 𝑛) – For two number partitions 𝛼 ⊢ 𝑛 and 𝛽 ⊢ 𝑛 we define 𝑗

𝛼 ⊴ 𝛽 ⇐⇒

𝑗

for all 𝑗 : ∑ 𝛼𝑖 ≤ ∑ 𝛽𝑖 . 𝑖=0

𝑖=0

If this is the case, we say 𝛼 is dominated by 𝛽. This inequality on all the partial sums defines in fact a partial order, the dominance order on the set of number partitions of 𝑛. This was used successfully by Ruch and Gutman [260, 261] to rank molecular graphs according to branching, denoting the partition of bond degrees 𝑏(𝛾)̄ the branching extent of the graph. We shall also describe its use in connection with chirality and permutational isomers [114].

The smallest natural number where the dominance order is not a total order of its partitions is 𝑛 = 6. Below is the Hasse diagram [206] of the poset of number partitions of 𝑛 = 6, together with the dominance order:

(4, 1, 1)

(3, 1, 1, 1)

s

(6)

s

(5,1)

s (4,2) @ (3, 3) @s

s @ @s (3,2,1) @ @s s (2, 2, 2) @ @s (2, 2, 1, 1) s

(2, 1, 1, 1, 1)

s

(1, 1, 1, 1, 1, 1)

An important result of Ruch and Gutman [261] characterizes the partitions that occur as partitions of bond degrees, the graphical partitions. In order to formulate this result, we represent a partition 𝛼 ⊢ 2 ⋅ |𝐵(𝛾)| by its Young diagram [𝛼] that consists of 2 ⋅ |𝐵(𝛾)|

82 | 2 Advanced properties of molecular graphs symbols ‘×’, forming rows of lengths 𝛼0 , 𝛼1 , . . . and columns of lengths 𝛼0󸀠 , 𝛼1󸀠 , . . .. For example, the partition (3, 2, 1) ⊢ 6 is represented by ⊗ × ×

× ⊗

×

We note that this Young diagram has a main diagonal, consisting of two symbols ‘⊗’. The length of the main diagonal in a Young diagram [𝛼] is denoted by 𝑑(𝛼), it is the length of the diagonal of the maximal square contained in the upper left hand corner of [𝛼]. Using this length we can formulate Remark 2.35, a characterization of the maximal partitions that are graphical, given by Ruch and Gutman. It shows which partitions occur as bond degree partitions of molecular graphs. 2.35 Remark (Bond degree partitions) Maximal (with respect to dominance order) graphical partitions 𝛼 of even natural numbers 2 ⋅ |𝐵(𝛾)| are the partitions that sat­ isfy the following equations for partial sums of differences of column lengths 𝛼𝑖󸀠 and row lengths 𝛼𝑖 in the corresponding Young diagram: For all 1 ≤ 𝑗 ≤ 𝑑(𝛼)

we have ∑(𝛼𝑖󸀠 − 𝛼𝑖 ) = 𝑗. 𝑖∈𝑗

Moreover, all the other partitions 𝛽 ⊢ 2 ⋅ |𝐵(𝛾)| that lie below such a maximal graphical partition are also graphical. In the case of 2 ⋅ |𝐵(𝛾)| = 6, the maximal graphical partitions are (3, 1, 1, 1) and (2, 2, 2). According to Ruch and Gutman, the graphical partitions in the above poset are the partitions underlined in the following picture:

(4, 1, 1)

(3, 1, 1, 1)

s

(6)

s

(5,1)

s (4,2) @ (3, 3) @s

s @ @s (3,2,1) @ @s s (2, 2, 2) @ @s (2, 2, 1, 1) s

(2, 1, 1, 1, 1)

s

(1, 1, 1, 1, 1, 1)

Three of these graphical partitions give connected graphs, and therefore they occur as bond degree partitions of molecular graphs:

2.6 Molecular descriptors |

s

s

s

s

s A  A s As

The remaining two graphs are disconnected: s s s  sH  and HHs s s

s

s

s

s s

s

s

s

83

Trivially, a connected graph of 𝑛 vertices contains at least 𝑛−1 bonds. Thus, the degree partition of a simple graph is the degree partition of a connected simple graph if and only if |𝐵(𝛾)| ≥ 𝑛 − 1. Summarizing, we obtain that a number partition 𝛼 ⊢ 2𝑚 is the sequence of bond degrees of a molecule with 𝑛 atoms if and only if 𝛼0 ≤ 4 and 𝑚 ≥ 𝛼0󸀠 − 1. The last item shows that the characterization of bond degree sequences of molecules does not refer to the dominance order. This important partial order was mentioned since we shall use it later in connection with chiral permutational iso­ mers. Using a different approach, based on a characterization of graphical partitions by Erdös and Gallai, M. Schocker [278] characterized graphical partitions in the following way: 𝛼 = (𝛼0 , . . . , 𝛼𝑛−1 ) ⊢ 2𝑚 is graphical if and only if ∑ 𝛼𝜈 + ∑ 𝛼𝜈 ≤ 2𝑚 + 𝑘(𝑖 − 1), 𝜈∈𝑘

for 1 ≤ 𝑘 ≤ 𝑑(𝛼), 𝑑(𝛼) ≤ 𝑖 ≤ 𝑛.

𝜈∈𝑖

M. Schocker even obtained a characterization of graphical partitions with pre­ scribed cyclomatic number 𝐶 (Corollary 2.5 in [278]): 2.36 Remark (Degree sequences of graphs by cyclomatic number) A partition 𝛼 ⊢ 2𝑚 of an even number – is the degree sequence of a tree (𝐶 = 0) with 𝑛 nodes if and only if 𝑛 = 𝑚 + 1, – is the degree sequence of a simple graph with 𝐶 = 1 if and only if 𝑛 = 𝑚, 𝛼0 ≤ 𝑚 − 1 and 𝛼0 + 𝛼1 ≤ 𝑚 + 1, – is the degree sequence of a simple graph with 𝐶 = 2 if and only if 𝑛 = 𝑚 − 1, 𝛼0 ≤ 𝑚 − 2, 𝛼0 + 𝛼1 ≤ 𝑚 + 1 and 𝛼0 + 𝛼1 + 𝛼2 ≤ 𝑚 + 3. The following concept, expressed in terms of bond multiplicities, is a generalization of a chemical notion.

84 | 2 Advanced properties of molecular graphs

2.37 Definition (Occurrence of bond multiplicities, hybridization) Consider a labeled multigraph 𝛾 ∈ G𝑚,𝑛 and a node 𝑖 of 𝛾, assuming that it is not an isolated node, i.e. 𝑣(𝛾)𝑖 > 0, or, equivalently, 𝑏(𝛾)𝑖 > 0. We denote the number of nodes connected to 𝑖 by a 𝑘-fold bond by 𝜇𝑘 = |{𝑗 ∈ 𝑛 | 𝛾({𝑖, 𝑗}) = 𝑘}|,

for 𝑘 = 1, ..., 𝑚 − 1.

The distribution of these multiplicities of bonds meeting in 𝑖 is denoted by hyb𝛾 (𝑖) = (𝜇1 , ..., 𝜇𝑚−1 ) and is called the hybridization. In fact this definition differs from the chemical definition of hybridiza­ tion. Thus, an sp-hybridized carbon atom (valence 4) may have either the (mathematical) hybridization (1,0,1) or (0,2,0).

The possible hybridizations of a node of valence 4 are below:

O

(4, 0, 0)

O

(2, 1, 0)

O

O

(1, 0, 1)

(0, 2, 0)

In Section 2.1 we introduced the notions of walk, cycle and ring, defined graphical and molecular substructures as well as molecular descriptors (e.g. walk counts) and several kinds of embeddings. These notions allow the introduction of further topolo­ gical descriptors, which are then applied in Chapter 7. The corresponding mappings on labeled graphs are the following counting functions: – 𝑡𝑤𝑐 the total walk count in the bond graph, and 𝑚𝑤𝑐(𝑙) the count of walks of prescribed length 𝑙. We saw in Equation (2.2) that walk counts are obtained by summing the entries in suit­ able powers of the bond matrix of 𝛾 or of 𝛾∗ . Further, there are the purely topological indices such as path counts and ring counts, denoted as follows: the count of paths and of paths of length 𝑙, – 𝑃 and 𝑃(𝑙) rings and rings(𝑙) the count of rings and of rings of size 𝑙. The number of independent cycles, the cyclomatic number was already mentioned. Another interesting descriptor is the girth of 𝛾, introduced in Definition 2.4, the mini­ mal length of a cycle in 𝛾, if any (set to ∞ by default), the girth of the graph. – girth𝛾 Further topological descriptors come from the distance matrix, containing the dis­ tances between nodes, introduced in Definition 2.5, D𝛾 = (dist𝛾 (𝑖, 𝑗)).

2.6 Molecular descriptors |

85

The first topological index (historically) is the Wiener index, developed by H. Wiener [339], which was used to model the boiling points of alkanes and was based on the following function: – 𝑊 the sum of all distances. Its value is, expressed in terms of the symmetric matrix of distances, 𝑊(𝑀) =

1 ∑ dist𝛾 (𝑖, 𝑗). 2 𝑖,𝑗∈𝑛

It gives rise to the topological indices 𝑊̄ and 𝑊̄ ∗ with their values ̄ 𝑀) ̄ = 𝑊(𝑀) = 1 ∑ dist𝛾 (𝑖, 𝑗) 𝑊( 2 𝑖,𝑗∈𝑛 and

̄ = 𝑊(𝑀∗ ) = 1 ∑ dist𝛾∗ (𝑖, 𝑗). 𝑊̄ ∗ (𝑀) 2 𝑖,𝑗∈𝑛

More recent indices are the Zagreb indices, based on the mappings 𝑀1 and 𝑀2 , using squares or products of bond degrees, 𝑀1 (𝑀) = ∑ 𝑏(𝛾)2𝑖

and 𝑀2 (𝑀) =

𝑖∈𝑛



𝑏(𝛾)𝑖 ⋅ 𝑏(𝛾)𝑗 .

{𝑖,𝑗}∈𝐵(𝛾)

There are also modified Zagreb indices, which are introduced below. The Randić in­ dices of various orders also use bond degrees. Some other descriptors also carry the names of their inventors, for example, Balaban, Basak, Gordon–Scantlebury, Harary, Hosoya, Kier, Kier and Hall, Platt, Schultz, and others. They are listed in the Appendix, see Section A. An interesting problem is the question of intercorrelations between different de­ scriptors. Unfortunately there is not much work done in this direction, but statistical methods – using a molecular library, evaluating the descriptors on its compounds and looking for correlations – sometimes suggest such connections between descriptors, in which case one may try to prove that the correlation is in fact true (see e.g. [32, 35, 129]). An example is the close connection between the second Zagreb index and the count of molecular walks of length 3, suggested by Table 7.4 in Chapter 7, where we give descriptor values for 50 decanes. This connection does not seem to be widely known, for example such a relation is not mentioned in a recent review on Zagreb in­ dices [222]. The table suggests that Zagreb index 𝑀2 might just be half the molecular walk count 𝑚𝑤𝑐(3) , while 𝑀1 is in fact equal to 𝑚𝑤𝑐(2) . This is true in general, and it follows directly from the laws of matrix multiplication: 2.38 Remark (The Zagreb indices 𝑀1 and 𝑀2 count walks) For 𝛾 ∈ M𝑛 , walk counts and Zagreb indices, the following holds: 𝑚𝑤𝑐(2) = 𝑀1

and 𝑚𝑤𝑐(3) = 2 ⋅ 𝑀2 .

86 | 2 Advanced properties of molecular graphs (2)

Consider the bond matrix M𝛾𝑏 = (𝛾𝑏𝑖𝑗 ) of 𝛾, its square M2𝛾𝑏 = (𝛾𝑏𝑖𝑗 ) and its cube M3𝛾𝑏 = (3)

(𝛾𝑏𝑖𝑗 ). Since the number of walks of length 𝑙 is the sum of the entries of M𝑙𝛾𝑏 , we obtain: (2)

𝑚𝑤𝑐(2) (𝑀) = ∑ 𝛾𝑏𝑖𝑗 𝑖,𝑗∈𝑛

= ∑ 𝛾𝑏𝑖𝑘 𝛾𝑏𝑘𝑗 𝑖,𝑗,𝑘∈𝑛

= ∑ ∑ 𝛾𝑏𝑖𝑘 ∑ 𝛾𝑏𝑘𝑗 𝑖∈𝑛 𝑗∈𝑛 𝑘∈𝑛 ⏟⏟⏟⏟⏟⏟⏟⏟⏟ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ =𝑏(𝛾)𝑘

=∑

=𝑏(𝛾)𝑘

𝑏(𝛾)2𝑘

𝑘∈𝑛

= 𝑀1 (𝑀). The same applies to the count of walks of length 3. An alternative and visual proof is given in [35]. An important mathematical structure is certainly the automorphism group, the group of all relabelings that keep a labeled graph fixed: Aut(𝛾) = {𝜋 ∈ 𝑆𝑛 | 𝜋𝛾 = 𝛾}. For example, the automorphism group of the graph 2

0

1

@ @ @

3

is Aut(𝛾) = {1, (23)} which shows that not the group ‘as it is’ but its conjugacy class of subgroups, which means ‘the group up to relabeling of the graph’, ̃ = {𝜋Aut(𝛾)𝜋−1 | 𝜋 ∈ 𝑆 } Aut(𝛾) 𝑛 is a topological descriptor: ̃ the conjugacy class of the automorphism group of 𝛾. – Aut(𝛾) This is important to count molecular substructures, as we have seen, and it al­ lows, for example, the evaluation of an approximate number of carbon signals in a 13 C NMR spectrum. This number is only an approximation since the number of signals may be lower or higher than expected in a real spectrum. For example, 1-chloro-2-methylprop-1-ene exhibits four rather than three carbon signals, its geo­ metrical symmetry is lower than its topological symmetry.

2.6 Molecular descriptors |

87

2.6.3 Geometrical descriptors There are many molecular properties that depend on the 3D shape of a molecule (see Section 7.5). Descriptors that take this information into account are called geometrical descriptors or geometrical indices. To calculate geometrical indices, the atoms of the 𝑛 molecule graph 𝑀 ∈ M𝑛 first need 3D coordinates 𝜉 ∈ (ℝ3 ) . Various methods are available for this task [271, 272] and the efficient empirical force field method imple­ mented in MOLGEN–QSPR is described here. It allows the user to obtain local energy minima of placements quickly, using literature values of ideal bond lengths and an­ gles. It is based on a mechanical model that minimizes an empirical energy function by adding up the energy contributions of all forces acting on the molecule.The function is called an empirical energy function since it was derived experimentally, not theoret­ ically. The implementation is a simple version similar to Allinger’s MM2 force field [5]. The function consists of the following elements: – The first aspect taken into account is the length of each bond. The average length of covalent bonds in a molecule can be determined sufficiently exactly by spectro­ scopic measurements. The deviation from this ideal value appears quadratically in the potential: 2 𝑘 𝐸𝑠 = 143.88 ⋅ 𝑠 (|x𝑖 − x𝑗 | − 𝑙𝑖𝑗𝑡𝑎𝑏 ) . 2 Here x𝑖 and x𝑗 are the vectors of bonded atoms 𝑖 and 𝑗, 𝑙𝑖𝑗𝑡𝑎𝑏 is the ‘ideal’ length of a bond of the respective type, and 𝑘𝑠 is a constant that depends on the two atom types and the bond degree. For example, a C − C single bond of two sp3 -hybridized carbon atoms has 𝑘𝑠 = 4.4 and 𝑙𝑖𝑗𝑡𝑎𝑏 = 1.523 Å. – Another important influence comes from each bond angle. Again the deviation from a spectroscopically derived ‘ideal’ value is considered: 𝑘𝑏 𝑡𝑎𝑏 2 (𝛼𝑖𝑗𝑘 − 𝛼𝑖𝑗𝑘 ) . 2 The constants have an analogous meaning. For a C − C − C arrangement, where 𝑡𝑎𝑏 the central C is sp3 -hybridized, there is 𝑘𝑏 = 0.45 and 𝛼𝑖𝑗𝑘 = 109.470∘ . Moreover, the torsion angle of four serially bonded atoms contributes to the po­ tential. Here the first three terms of a Fourier series are used, 𝐸𝑏 = 0.043828 ⋅



𝑉 𝑉 𝑉1 ⋅ (1 + cos 𝜔) + 2 ⋅ (1 − cos 2𝜔) + 3 ⋅ (1 + cos 3𝜔) . 2 2 2 𝜔 is the torsion angle and 𝑉1 , 𝑉2 and 𝑉3 are constants depending on atom types. For a C−C−C−C sequence 𝑉1 = 0.2, 𝑉2 = 0.27 and 𝑉3 = 0.093. The torsion potentials are calculated and added for all sequences of four atoms. Atoms that are not covalently bonded also interact. The contribution of their van der Waals interaction to the potential is: 𝐸𝜔 =



) − 2.25𝑝𝑖𝑗6 ) {290,000 (exp (− 12.5 𝑝𝑖𝑗 𝐸𝑣𝑑𝑤 = √𝐸𝑖 𝐸𝑗 { 2 {336.176𝑝𝑖𝑗

if 𝑝𝑖𝑗 ≤ 3.311, otherwise.

88 | 2 Advanced properties of molecular graphs Here 𝐸𝑖 is another constant (an atom’s ‘hardness’) and 𝑝𝑖𝑗 is the ratio of the sum of the van der Waals radii and the interatomic distance, 𝑝𝑖𝑗 =



𝑅𝑖 + 𝑅𝑗 |x𝑖 − x𝑗 |

.

An 𝑠𝑝3 -hybridized carbon atom has 𝑅𝑖 = 1.9 Å and for an interaction of two such atoms √𝐸𝑖 ⋅ 𝐸𝑗 = 0.044. The total potential function is therefore the following: 𝐸 = ∑ (𝐸𝑠 )𝑖𝑗 + 𝑏𝑜𝑛𝑑 𝑖,𝑗

∑ 𝑎𝑛𝑔𝑙𝑒 𝑖,𝑗,𝑘

(𝐸𝑏 )𝑖𝑗𝑘 +



(𝐸𝜔 )𝑖𝑗𝑘𝑙 + ∑ (𝐸𝑣𝑑𝑤 )𝑖𝑗 .

𝑡𝑜𝑟. 𝑎𝑛𝑔𝑙𝑒 𝑖,𝑗,𝑘,𝑙

𝑖,𝑗∈𝑛

Further interactions such as electromagnetic ones are not taken into account. In the last term, the addition is over all pairs of atoms at least three bonds apart. For details see [5, 38]. A necessary condition for a conformer is that the empirical potential has a local minimum. So a numerical method is used in order to find such a minimum. There are several numerical methods available for the minimization of a sufficiently smooth, non-linear function. The simplest method is to vary the coordinates in the direction of the negative gradient, i.e. the steepest descent. There is, however, no guarantee to find the minimum by this method after a finite number of steps. The Newton method therefore considers also the matrix of second derivatives, where the matrix has to be computed completely. Using the empirical potential would reduce the speed dramatically. Thus, we chose the method of conjugate gradients to solve the minimization problem. This method was originally developed to solve linear equation systems 𝐴x = b. Start­ ing with the assumption that the exact solution x is a minimum of 𝐹(z) =

1 1 𝑇 z 𝐴z − b𝑇 z + b𝑇 𝐴−1 b, 2 2

i.e. 𝐹(x) = min(𝐹(z)) = 0. While the method of steepest descent minimizes in one dimension only, here in step x𝑘 → x𝑘+1 , a (𝑘 + 1)-dimensional minimization is carried out: 𝐹(x𝑘+1 ) = 𝜇min 𝐹(x𝑘 + 𝜇0 r0 + . . . + 𝜇𝑘 r(𝑘) ), where r𝑖 = b − 𝐴x𝑖 , for 𝑖 ≤ 𝑘. ,...,𝜇 0

𝑘

The application to the minimization of a quadratic function such as 1 𝑓(x) = 𝑓(h) + (x − h)𝑇 𝐴(x − h) 2 is now clear, since 𝐴x = b must hold with b = 𝐴h in order to make the gradient ∇𝑓 vanish. In the solution of the equations, the directions p0 , p1 , . . . are calculated, so that p𝑘+1 is a linear combination of ∇𝑓(x𝑘+1 ) and p𝑇𝑖 𝐴p𝑗 = 0 for 𝑖 ≠ 𝑗. Fortunately,

2.6 Molecular descriptors | 89

many of the coefficients vanish, so that only p𝑘+1 p𝑘+1 = ∇𝑓(x𝑘+1 )+ 𝛽𝑘 p𝑘 remains, where 𝛽𝑘 =

∇𝑓(x𝑘+1 )2 . ∇𝑓(x𝑘 )2

So the algorithm for the minimization reads:

2.39 Algorithm (Placing molecules in space, conjugate gradients) i) Choose a random start vector x0 ∈ ℝ3𝑛 , set g0 = ∇𝑓(x0 ) and p0 = −g0 . ii) If g𝑘 = 0: Stop. Else: iii) Determine x𝑘+1 by linear minimization from: 𝑓(x𝑘+1 ) = min 𝑓(x𝑘 − 𝜆p𝑘 ). 𝜆≥0

Set g𝑘+1 = ∇𝑓(x𝑘+1 ) and 𝛽𝑘 = iv) Go to ii)

(g𝑘+1 )2 (g𝑘 )2

and p𝑘+1 = −g𝑘+1 + 𝛽𝑘 p(𝑘) .

If any minimum exists, this method provides it in at most 𝑛 steps. However, the com­ puted extremum is only a local one, i.e. the resulting conformations may differ if we use different initial values. Once the 3D placement is calculated, we can evaluate some more descriptors, such as: – 𝐺1 gravitational index (pairs, 3D-dist.) gravitational index (bonds, 3D-dist.) 𝐺2 They are defined as follows: 𝐺1 (𝑀) = ∑ 𝑖 max{𝑏󸀠 ∈ 𝐵(𝛾)}}. An easy check shows that 𝑃 has the required properties, and so we can apply orderly generation for a construction of G2,𝑛 which is the canonic transversal of the orbits of the identity group: G2,𝑛 = rep< ({1}\\G2,𝑛). The following example is a small case described in detail to show how the orderly generation of labeled graphs works:

168 | 5 Molecular structure generation 5.4 Example (Labeled generation of simple graphs on three nodes) Let us have a brief look at the small case of 𝑛 = 3 nodes. The partition that we use consists of the sets 𝛺𝑖 , 0 ≤ 𝑖 ≤ (32) = 3 of the labeled graphs on 3 nodes consisting of 𝑖 bonds. We recall that a graph 𝛾 ∈ G2,𝑛 can be identified with the set 𝐵(𝛾) of its bonds, as 𝑛 is given. The orderly generation of G2,𝑛 runs as follows: – 𝛺0 consists of the empty graph, let us denote it by 𝛾0 , 𝛺0 = {𝛾0 = (0, 0, 0)}. Being the only element in 𝛺0 it forms the canonic transversal of 𝛺0 and therefore (since the identity group is the acting group) this set consists of the empty graph only, 𝛺0 = {𝛾0 } = rep< ({1}\\𝛺0 ). –

We have to apply 𝑃 to the elements of this canonic transversal, which means adding a bond, one of the bonds {0, 1}, {0, 2} and {1, 2}. The three graphs obtained, consisting of just one bond only, form the canonic transversal of 𝛺1 , therefore 𝛺1 = {{{0, 1}}, {{0, 2}}, {{1, 2}}} = rep< ({1}\\𝛺1 ). And so on with 𝛺2 and 𝛺3 . Figure 5.1 shows the way bonds are inserted during the application of Algorithm 5.1.

The next example is the orderly generation of unlabeled graphs on 𝑛 nodes. Describing this merely involves replacing the group {1} with the symmetric group 𝑆𝑛.

Orderly generation of unlabeled simple graphs We want to generate the canonic transversal rep< (𝑆𝑛 \\G2,𝑛) since it yields the unlabeled graphs by simply erasing the labels. Recall that a labeled graph 𝛾 on 𝑛 nodes is called canonical if it is minimal in its orbit. In mathematical terms: ∀ 𝜋 ∈ 𝑆𝑛 : 𝛾 ≤ 𝜋𝛾. 5.5 Example (Unlabeled generation of simple graphs on three nodes) Continuing from Example 5.4, we use the same partition of the set of labeled graphs consisting of 𝛺𝑖 and apply 𝑃. As in the present nontrivial case the orbits are not necessarily one-element sets, we now have to find the minimal elements in the sets 𝑃(𝑥). The starting point is again 𝛺0 , the empty graph, which yields an element of the desired canonic transversal, rep< (𝑆3 \\G2,3 ) = {𝛾0 , . . .}.

5.1 Formula-based structure generation |

0

169

1 2

{0,1}

{1,2} {0,2}

0

1

0

2 {0,2} 0

1 2

1 2

{1,2} 0

0

1 2

{1,2} 1

0

1 2

2

{1,2} 0

1 2

Fig. 5.1. Backtrack tree for labeled generation of simple graphs on three nodes.

Next we have to find the minimal element in the set of labeled graphs obtained by inserting a suitable bond according to the definition of 𝑃: 𝑃(𝛾0 ) = {{{0, 1}}, {{0, 2}}, {{1, 2}}} . The minimal element of 𝑃(𝛾0 ) is {{0, 1}}, and we obtain an additional element of the transversal that we want to generate, rep< (𝑆3 \\G2,3 ) = {𝛾0 , {{0, 1}}, . . .}. Now we have to find the minimal element of 𝑃({{0, 1}}) = {{{0, 1}, {0, 2}}, {{0, 1}, {1, 2}}}, which is 𝛾 = {{0, 1}, {0, 2}}, obtaining rep< (𝑆3 \\G2,3 ) = {𝛾0 , {{0, 1}}, {{0, 1}, {0, 2}}, . . .}. As 𝑃({{0, 1}, {0, 2}}) = {{0, 1}, {0, 2}, {1, 2}}

170 | 5 Molecular structure generation 0

1 2

{0,1}

{1,2} {0,2}

0

1

0

1

2 {0,2} 0

1 2

{1,2} 0

not minimal

1

0

1 2

not minimal

{1,2}

2

{1,2} 0

2

0

1

not minimal

2

not minimal

1 2

Fig. 5.2. Backtrack tree for unlabeled generation of simple graphs on three nodes.

we finally obtain the desired canonical transversal: rep< (𝑆3 \\G2,3 ) = {𝛾0 , {{0, 1}}, {{0, 1}, {0, 2}}, {{0, 1}, {0, 2}, {1, 2}}}. Figure 5.2 shows the backtrack tree for this application of Algorithm 5.1. However, this algorithm has to check all of the 2𝑛(𝑛−1)/2 labeled graphs on 𝑛 nodes for canonicity, i.e. for minimality. The main finding of Read [246] and Faradzev [72, 73] was that every minimal orbit representative with 𝑞 bonds has a minimal subgraph with 𝑞 − 1 bonds. Thus, non-minimal intermediates must not be considered for further augmentation. This was formulated in [109]: 5.6 Corollary If 𝛾 ∈ rep< (𝑆𝑛 \\G2,𝑛 ) and 𝛾0 ∈ G2,𝑛 with 𝐵(𝛾0 ) ⊂ 𝐵(𝛾) and 𝛾0 < 𝛾, then 𝛾0 ∈ rep< (𝑆𝑛 \\G2,𝑛). Using this knowledge, the algorithm given by Read can be improved to: 5.7 Algorithm (Orderly Generation (𝛾)) (1) (2) (3) (4)

if 𝛾 ∉ rep< (𝑆𝑛 \\G2,𝑛 ) then return Output 𝛾 for each 𝛾󸀠 ∈ 𝑃(𝛾) (in increasing order) do Orderly Generation (𝛾󸀠 )

5.1 Formula-based structure generation

0

| 171

1 2

{0,1}

{1,2} {0,2}

0

1

0

2 {0,2} 0

1 2

0

0

1

2

2

not minimal, no further augmentation

not minimal

1 2

{1,2} 0

{1,2}

1

not minimal

1 2

Fig. 5.3. Backtrack tree for orderly generation of simple graphs on three nodes.

The most expensive step is the test for minimality in row (1) of Algorithm 5.7. Although we can improve the method – running through all the permutations in 𝑆𝑛 – using al­ gebro-combinatorial methods [94], this step (1) remains expensive. In addition, one run through the minimality test yields the group of automorphisms as a byproduct of the tested graph in the form of a Sims chain [296]. 5.8 Example (Orderly generation of graphs on three nodes) Figure 5.3 shows the back­ track tree for Algorithm 5.7 applied to graphs on 𝑛 = 3 nodes. Comparing with Fig­ ure 5.2 we see that one canonicity test is saved using the latter algorithm: graph {{0, 2}} is recognized as non-minimal, and its augmentation {{0, 2}, {1, 2}} is not considered. Of course, Algorithm 5.7 results in far higher savings for increasing 𝑛.

5.1.2 Introducing constraints Typically one is not interested in generating all graphs, but just certain subsets, often denoted as classes of graphs. Such a class of graphs is characterized by one or more constraints or restrictions (see e.g. [47]). In mathematical terms a constraint is a map­ ping 𝑅 from the set of graphs on 𝑛 nodes onto the set of Boolean values {𝑡𝑟𝑢𝑒, 𝑓𝑎𝑙𝑠𝑒}

172 | 5 Molecular structure generation that is symmetry invariant (i.e. a constraint on the set of unlabeled graphs as well), ∀ 𝜋 ∈ 𝑆𝑛 : 𝑅(𝛾) = 𝑅(𝜋𝛾). A graph 𝛾 fulfills 𝑅 if 𝑅(𝛾) = 𝑡𝑟𝑢𝑒. Otherwise 𝛾 violates the constraint. A constraint 𝑅 is called monotonic or consistent with augmentation (addition of further bonds) if the violation of a graph 𝛾 to 𝑅 implies that every augmentation 𝛾󸀠 of 𝛾 violates 𝑅: 𝑅(𝛾) = 𝑓𝑎𝑙𝑠𝑒 ∧ 𝛾 ⊆ 𝛾󸀠

󳨐⇒

𝑅(𝛾󸀠 ) = 𝑓𝑎𝑙𝑠𝑒.

Examples of monotonic constraints are: an upper bound for number of bonds, a mini­ mal ring size or graph-theoretical planarity. On the other hand, the presence or ab­ sence of a certain closed subgraph or a maximum ring size are examples for nonmonotonic constraints. Monotonic constraints can be incorporated into generating algorithms in a way that accelerates structure generation. Such restrictions can be checked after the insertion of each new bond, and help to prune the backtrack tree. Non-monotonic constraints are more problematic. Testing these constraints is only useful after a graph is completed. Completeness itself is also described by constraints. As for generating constitutional isomers, completeness is typically defined by a given sequence of valences. 5.9 Algorithm (Orderly generation with constraints (𝛾)) (1) (2) (3) (4) (5)

if 𝛾 ∉ rep< (𝑆𝑛 \\G𝑛,2 ) then return if 𝛾 violates any monotonic constraint then return if 𝛾 fulfills all monotonic constraints then Output 𝛾 for each 𝛾󸀠 ∈ 𝑃(𝛾) (in increasing order) do Orderly Generation (𝛾󸀠 )

5.1.3 Variations and refinements There are several variations and refinements possible that might, depending on the type of constraints, lead to a considerable gain in speed. – Testing completeness is typically cheaper than testing for other constraints such as presence and absence of substructures. Thus these more expensive non-mono­ tonic constraints should be tested after completeness has been confirmed. – Testing monotonic constraints is often cheaper than testing canonicity. Thus it can be useful to process step (2) of Algorithm 5.9 before step (1). In general the sequence of tests is affected by two strategies: – Process cheap tests first, i.e. tests consuming the least computation time. – Process selective tests first, i.e. tests eliminating the most intermediates.

5.1 Formula-based structure generation

| 173

Tests that fulfill both criteria should be processed first, while those that fulfill none of them should be executed last. However, one has to find a trade-off for expensive tests that are very selective and cheap tests with low selectivity. Going back to Algorithm 5.9, step (1) is often replaced by a cheaper criterion that only tests a necessary condition for canonicity, so-called semi-canonicity. Without go­ ing into details this criterion only checks for transpositions 𝜏 whether 𝛾 ≤ 𝜏𝛾. For a more detailed description see [94, 95]. The full canonicity test is delayed until the graph is completed. If some candidate solution then turns out not to be canonical, a learning criterion provides a necessary condition for the canonicity of the lexico­ graphic successors. The earliest extension step is determined where non-minimality could have been detected in the generation procedure. Applying this criterion will fur­ ther prune the backtrack tree. Details on this criterion can be found in [94, 95]. Orderly generation was successfully applied in the construction of complete cata­ logs of various discrete structures, among them were configurations of double cosets [96], regular graphs [200, 201] and molecular graphs [95]. Orderly generation turned out to be very efficient in the generation of all connectivity isomers for a given molecu­ lar formula, see Subsection 8.4.2 and Appendix D. The use of monotonic restrictions played a decisive role in these examples. Chemical structure generation tasks, how­ ever, usually involve many non-monotonic restrictions. Moreover, we have to obtain small search spaces. Thus, the isomorphism problem turns out to be a minor problem here. Quite often construction of the labeled structures is more difficult. Approaches well adapted to this kind of problem, given by J. Biegholdt in [24] and by R. Grund in [94, 95], are described in the following subsection.

5.1.4 From simple graphs to multigraphs We have learned the principles of orderly generation of labeled and unlabeled simple graphs. Using this method we will certainly find the simple graph

t

t

t

t

but there is still a lot to do in order to find, for example, the following two molecular graphs, which both result from the simple graph above: N (3,1,0,0)

C (4,0,0,0)

O (2,2,0,0)

H (1,0,0,0)

,

O (2,2,0,0)

C (4,0,0,0)

N (3,1,0,0)

H (1,0,0,0)

.

and

174 | 5 Molecular structure generation 5.1.5 Applying the Homomorphism Principle An interesting step towards orderly generation of unlabeled 𝑚-multigraphs is the use of the Homomorphism Principle (see Theorem 3.32) as shown in [24]. The point is that the following 𝜃𝑚 , which maps 𝑚-multigraphs onto 𝑚 − 1-multigraphs, 𝑛 𝑛 𝜃𝑚 : 𝑚(2) → (𝑚 − 1)(2)

defined by {𝛾({𝑖, 𝑗}) if 𝛾({𝑖, 𝑗}) ∈ 𝑚 − 1 𝜃𝑚 (𝛾)({𝑖, 𝑗}) = { 𝑚−2 otherwise. { is compatible with the action of the symmetric group, 𝜋𝜃𝑚 (𝛾) = 𝜃𝑚 (𝜋𝛾) so that the Homomorphism Principle can be applied. This requires two steps, starting from the set of 2-multigraphs, the set of simple graphs on 𝑛 nodes, and leads to the set of 4-multigraphs that contains, e.g. the molecular graphs on 𝑛 atoms: 4-multigraphs PP PP PP PP3-multigraphs PP PP 2-multigraphs PP PP P 𝜃4

-

𝜃3

-

           Starting from a transversal, for example from the canonic transversal 𝑛 rep< (𝑆𝑛 \\2(2) )

of the set of labeled 2-multigraphs, obtained via orderly generation, we construct a transversal of the 𝑆𝑛 -orbits on the set of unlabeled 3-multigraphs by applying the Ho­ 𝑛 momorphism Principle in the following way: For each 𝛾 ∈ rep< (𝑆𝑛 \\2(2) ) we evaluate the stabilizer (𝑆𝑛 )𝛾 , the inverse image 𝜃3−1 (𝛾) of that graph and a transversal 𝑇𝛾 of the set of orbits (𝑆𝑛 )𝛾 \\𝜃3−1 (𝛾) of the stabilizer on the inverse image. This shows, for example, that applying the Ho­ momorphism Principle works, as the acting group 𝑆𝑛 is replaced by its smaller sub­ group (𝑆𝑛)𝛾 . The desired transversal of the 3-multigraphs is now obtained by forming

5.1 Formula-based structure generation

| 175

the union of these transversals ⋃ 𝑇𝛾 . 𝛾

𝑛 2

over all 𝛾 ∈ rep< (𝑆𝑛\\2( ) ) and evaluating the minimal elements in the respective orbits. Starting from this transversal of the set of 3-multigraphs, we can obtain a transversal of the set of 4-multigraphs by applying the Homomorphism Principle again. 5.10 Example (The unlabeled 3-multigraphs on 3 nodes) In this example we start from a transversal of the simple graphs on three nodes. The pairs of nodes are {0, 1}, {0, 2} and {1, 2}, in lexicographical order: {0, 1} < {0, 2} < {1, 2}. The graphs on three nodes are abbreviated as sequences of their values (the bond multiplicities) on these pairs in the following way: 𝛾 = (𝛾({0, 1}),

𝛾({0, 2}),

𝛾({1, 2})).

The symmetric group 𝑆3 is transitive on the three pairs of nodes, so each orbit is char­ 𝑛 acterized by a content. The canonic transversal rep< (𝑆𝑛 \\2(2) ) of these labeled graphs is therefore {𝛾(0) = (0, 0, 0) < 𝛾(1) = (0, 0, 1) < 𝛾(2) = (0, 1, 1) < 𝛾(3) = (1, 1, 1)}. The inverse images 𝜃3−1 (𝛾(𝑖) ) of its elements are easy to calculate since two elements 𝛾, 𝛾󸀠 in the inverse image of (0, 1, 1) can only differ on the pairs {𝑗, 𝑘} where 𝛾(𝑖) has the value 𝑚 − 1 = 1. Thus, for example, 𝜃3−1 (𝛾(2) ) = 𝜃3−1 ((0, 1, 1)) = {(0, 2, 2), (0, 1, 2), (0, 2, 1), (0, 1, 1)}. The stabilizer of 𝛾(2) is clearly (𝑆3 )𝛾(2) = {1, (12)} and the set of its orbits is {{(0, 2, 2)},

{(0, 1, 2), (0, 2, 1)},

{(0, 1, 1)}}.

We obtain the following set as part of the desired transversal of the orbits of labeled 3-multigraphs {(0, 2, 2), (0, 1, 2), (0, 1, 1)}. Similarly we find 𝜃3−1 (𝛾(0) ) = {(0, 0, 0)},

(𝑆3 )𝛾(0) = 𝑆3

which adds (0, 0, 0) to the transversal, while 𝜃3−1 (𝛾(1) ) = {(0, 0, 1), (0, 0, 2)},

(𝑆3 )𝛾(1) = {1}

contributes the representatives (0, 0, 1) and (0, 0, 2). Finally we see that 𝜃3−1 (𝛾(3) ) = {(1, 1, 1), (1, 1, 2), (1, 2, 1), (2, 1, 1), (1, 2, 2), (2, 1, 2), (2, 2, 2)}

176 | 5 Molecular structure generation while the stabilizer is obviously (𝑆3 )𝛾(3) = 𝑆3 . Hence we obtain the following representatives: (1, 1, 1), (1, 1, 2), (1, 2, 2) and (2, 2, 2). The union of these sets of representatives is the desired transversal, consisting of the following ten labeled graphs (0, 2, 2), (0, 1, 2), (0, 1, 1), (0, 0, 0), (0, 0, 1), (0, 0, 2), (1, 1, 1), (1, 1, 2), (1, 2, 2), (2, 2, 2). From these labeled 3-multigraphs we obtain the set of unlabeled 3-multigraphs by simply erasing the labels: r r r r

JJ

JJ

J Jr JJr JJr r

r

r

r r

r

r J Jr

r JJ JJ r

r

J Jr r

r

JJ JJ r r

r

JJ JJ r r

r

r

JJ JJ r r

5.1.6 Orderly generation Molecular graphs are unlabeled multigraphs where the nodes are colored by atom states. The construction of molecular graphs in MOLGEN 3.5 follows the methods of R. Grund [95]. This yields all matrices of bond multiplicities that respect the prescribed valences. An empty matrix 𝐴 is filled successively and all possible distributions of bond multiplicities are generated. It is convenient to use the lexicographical order on the matrix of bond multiplici­ ties as a construction sequence, reading it row by row from top to bottom. Due to the symmetry of this matrix, it is sufficient to consider only the upper triangular matrix, which is often called connectivity stack, if it is read as one row. Maximal objects are selected as canonical orbit representatives. This definition of canonicity is backward compatible in the following sense: If it is restricted to simple graphs, a minimal simple graph as defined in Subsection 5.1.1 has the maximum connectivity stack in its orbit and vice versa. The assignment of atom states to rows and columns of the bond matrix introduces a block structure as depicted in Figure 5.4. Each block belongs to one of the 𝑡 different atom types; 𝜆 𝑟 equals the number of atoms of a state 𝑟. One advantage of this block structure is that it is no longer necessary to check all 𝑛! permutations of the full symmetric group 𝑆𝑛 during the canonicity test. Only the ∏𝑡𝑖=1 𝜆 𝑖 ! permutations that respect the block structure have to be considered. This reduces the computational costs for canonicity testing immensely.

5.1 Formula-based structure generation

| 177

Aλ(1) Aλ(2) •

A =





Aλ(r) •





λ2

{

λ1

{

{ {

Aλ(t) λr

λt

Fig. 5.4. Adjacency matrix with block structure as used in Algorithm 5.11.

Algorithm 5.11, taken from [95], shows how the structure generator behind MOLGEN 3.5 [19, 20] fills the bond matrix. The filling of the matrix blocks (steps (3) and (4)) is iterated with canonicity testing for matrix blocks (step (5)). Only permutations from the automorphism group 𝐴𝑢𝑡(𝑟−1) of blocks 1, ..., 𝑟 − 1 calculated earlier have to be taken into account for the canonicity testing of block 𝑟. 5.11 Algorithm (Orderly enumeration in MOLGEN) (1) Start: set 𝑟 = 0 and goto (3). (2) Stop criterion: if 𝑟 = 0 stop; else goto (4). (3) Maximum filling: fill block 𝐴(𝑟) (depending on 𝐴(1) , ..., 𝐴(𝑟−1) ) in lexicographically maximal manner so that 𝐴(𝑟) fulfills the desired matrix properties (regarding atom states and consistent constraints). If no such filling exists then set 𝑟 = 𝑟 − 1 and goto (2); else goto (5). (4) Next smallest filling: fill block 𝐴(𝑟) (depending on 𝐴(1) , ..., 𝐴(𝑟−1) ) in lexicograph­ ically next smallest manner so that 𝐴(𝑟) fulfills the desired matrix properties (re­ garding atom states and consistent constraints). If no such filling exists then set 𝑟 = 𝑟 − 1 and goto (2); else goto (5). (5) Test canonicity: if ∀𝜋 ∈ 𝐴𝑢𝑡(𝑟−1) (𝐴) : 𝐴(𝑟) ≥ 𝐴(𝑟) 𝜋, then if 𝑟 = 𝑡 (canonical matrix complete) then (a) if constraints are fulfilled then output 𝐴 (b) goto (4) else determine 𝐴𝑢𝑡(𝑟) (𝐴), set 𝑟 = 𝑟 + 1 and goto (3). else goto (4).

178 | 5 Molecular structure generation This algorithm uses two subroutines, the filling of a matrix block and the canonicity test of a matrix block. The filling of a matrix block is called in two different situations: In step (3) block 𝐴(𝑟) is initially filled in maximal manner. When step (4) is called, block 𝐴(𝑟) had already been filled, and now the next smallest filling is produced. Due to their huge technical overhead, these subroutines will not be described in detail here. The reader is referred to the original publication [95]. However, this book comprises the principles of these subroutines. Canonical labeling is described in Section 5.5. Filling a matrix block is done in lexicographically descending order, which is similar to con­ structing labeled graphs as introduced at the beginning of this subsection. Here is a small example, taken from [95]: 5.12 Example (The isomers of C2 H2 O) The prescribed vector of valences is (4, 4, 2, 1, 1) and the empty matrix that we need to fill: . . 𝐴=( . . .

. . . . .

. . . . .

. . . . .

. . . ) . .

consists of three different subsections, as 𝜆 1 = 𝜆 3 = 2 and 𝜆 2 = 1 (unlike the MOLGEN generators we assume here that H atoms are treated explicitly). Row and column sums in these subsections are 4, 4, 2, 1 and 1, according to the prescribed sequence of valences. There are ten possible fillings and the following eight fillings of 𝐴 describe connected labeled 4-multigraphs: 0 3 ( 1 0 0

3 0 0 1 0

1 0 0 0 1

0 1 0 0 0

0 0 1 ) 0 0

describing the multigraph

4



2



0



1



3

0 3 ( 1 0 0

3 0 0 0 1

1 0 0 1 0

0 0 1 0 0

0 1 0 ) 0 0

describing the multigraph

3



2



0



1



4

0 3 ( 0 1 0

3 0 1 0 0

0 1 0 0 1

1 0 0 0 0

0 0 1 ) 0 0

describing the multigraph

4



2



1



0



3

0 3 ( 0 0 1

3 0 1 0 0

0 1 0 1 0

0 0 1 0 0

1 0 0 ) 0 0

describing the multigraph

3



2



1



0



4

179

5.1 Formula-based structure generation |

0 2 ( 2 0 0

2 0 0 1 1

2 0 0 0 0

0 1 0 0 0

0 1 0 ) 0 0

3

0 2 ( 0 1 1

2 0 2 0 0

0 2 0 0 0

1 0 0 0 0

1 0 0 ) 0 0

describing the multigraph

0 2 ( 1 1 0

2 0 1 0 1

1 1 0 0 0

1 0 0 0 0

0 1 0 ) 0 0

describing the multigraph

0 2 ( 1 0 1

2 0 1 1 0

1 1 0 0 0

0 1 0 0 0

1 0 0 ) 0 0

describing the multigraph

/ describing the multigraph

2

=

0

=

1 \ 4 3 /

2

=

1

=

0 \ 4

3



0

= \

1



4

1



3

/ 2

4



0

= \

/ 2

Two steps remain: – The labels 0 and 1 of the nodes have to be replaced by C, while label 2 has to be replaced by O and labels 3 and 4 by H. – It is less trivial to pick the canonic matrices as representatives of the orbits. The corresponding canonicity test is very important for the reduction of complexity, i.e. for the efficiency of the program. For further details we refer the reader to [95].

5.1.7 Beyond orderly generation Of course, other principles can be combined with orderly generation. For instance MOLGEN 3.5 allows definition of macroatoms. These are substructures that are treated as a special atom type during orderly generation and are expanded whenever a ca­ nonical matrix is complete. Double coset representatives are used to avoid isomor­ phic duplicates. This principle is already known from the construction of permuta­ tional isomers and from the treatment of superatoms during tree generation in the DENDRAL (short for DENDRitic ALgorithm) generator. In mathematics, this method of joining partial structures without producing isomorphic duplicates is known as gluing lemma [147, 174]. The references [147, 174] also describe the Homomorphism Principle, introduced above. A homomorphism is a simplification of a structure, which maps iso­ morphic objects onto isomorphic simplified ones. The simplification from molecular graphs to multigraphs by removing element symbols, or from multigraphs to simple graphs by forgetting bond multiplicities are examples of homomorphisms. Indeed,

180 | 5 Molecular structure generation the DENDRAL strategy already relied on these simplification steps, without deducing the general principle. In [101], the approach of using homomorphisms to simplify ge­ neration was taken to the extreme by constructing graphs with a prescribed degree sequence from regular graphs as the most simple graphs. It turned out that for huge numbers of nodes 𝑛 such a generator is much faster than orderly generation alone. However, the generator that was accelerated by homomorphisms was not able to keep up with ordinary orderly generation for small 𝑛 that still allowed the generation of full lists of graphs. Another variation of orderly generation is also worth mentioning: McKay’s enu­ meration by canonical construction path [192] restricts extensions to those structures where the new edges are taken from a certain orbit of the automorphism group. Speed plays an important role in structure enumeration, but only few theoretical results about the computational complexity are known. Goldberg’s work [91] proves that the results in orderly enumeration can be computed with polynomial delay, and a paper of Luks [188] shows that isomorphism testing of molecular graphs can be done in polynomial time. Another approach named constrained generation [175] uses the fact that isomer generators in structure elucidation typically yield small numbers of solutions, see Section 5.2. For this reason, the ability to generate labeled structures that fulfill long lists of constraints becomes more important than efficient isomorphism avoidance. This generator has no fixed sequence of filling the bond matrix. Instead a heuristic method has to decide which alternative makes best use of the actual constraints. The only guarantee necessary is that each isomorphism type is constructed at least once. The canonical representations are then stored in a hash table. If a new representation is added, it will be written to the output, otherwise it is a duplicate and will not be used further. Although this ignores the advantages of sophisticated methods such as orderly generation, Gluing Lemma and Homomorphism Principle and may look like a step backwards, this approach, implemented in MOLGEN 4.0 [148] currently appears to be the best suited solution for this particular application in structure elucidation. This approach is used in chemical and pharmaceutical companies (where results typi­ cally are not disclosed to the public domain), as well as in public research institutions (see for example [285]). Of course not all generation algorithms and implementations can be discussed in detail here. At least the most popular ones such as CHEMICS [80], ASSEMBLE [7, 170], as well as [68, 211, 212] are worth being cited. Volume 27 of the jour­ nal MATCH is completely devoted to molecular structure generation. Faulon’s review [74] also contains a large section about this topic, as well as [204].

5.2 Constrained generation and fuzzy formulas In contrast to orderly generation, constrained generation does not use a fixed order of generation, rather it is controlled by the constraints. The strategy for adding bonds

5.2 Constrained generation and fuzzy formulas |

181

is chosen to reduce the effort needed for backtracking. The heuristics required were described in [96, 97]. Since orderly generation has been abandonded, we have to keep all the generated structures in an associative memory, for example using hash tables. A description of the canonizer can be found in [33, 175], details are given in Section 5.5. A brief description of the software package MOLGEN 4.0, which uses both orderly and constrained generation, can be found in [148]. Omitting the details of the heuristics applied in T. Grüner’s constrained generation, we briefly describe the input format for the latest implementation of a structure generator using the principle of constrained generation, MOLGEN 4.1. The generation process is oriented towards an efficient use of the given constraints and thus has no fixed sequence of steps. As the generation is formula-based, the molecular formula is required. A general­ ization also allows input of a fuzzy molecular formula, i.e. with intervals for multipli­ cities: 5.13 Definition (Fuzzy molecular formula) Let E denote a set of chemical elements and I(ℕ) = {[𝑎, 𝑏] | 𝑎, 𝑏 ∈ ℕ, 𝑎 ≤ 𝑏} ∪ {[𝑎, ∞[ | 𝑎 ∈ ℕ} the set of intervals of natural numbers. A fuzzy molecular formula is a mapping 𝐵 ∈ I(ℕ)E . The set of all molecular formulas compatible with 𝐵 is B𝐵 = {𝛽 ∈ ℕE | for all 𝑋 ∈ E : 𝛽(𝑋) ∈ 𝐵(𝑋)}, which means that 𝛽(𝑋), the occurrence number of element 𝑋 is contained in the interval 𝐵(𝑋) pre­ scribed by 𝐵. We call 𝐵 finite if B𝐵 is finite.

In order to keep the input as flexible as possible, we allow the input of several fuzzy formulas, each of them accompanied by its own restrictions.

5.2.1 Restrictions for a molecular formula The following types of restrictions are possible at present: – Restricting the number of atoms: We can enter an interval of natural numbers that restricts the number of atoms occuring in the molecule. – Restricting the number of heteroatoms: The user can enter an interval that restricts the total number of heteroatoms. Even if these restrictions are not finite (i.e. an interval of the form [𝑎, ∞[ occurs, con­ sisting of the natural numbers 𝑖 ≥ 𝑎), their combination may lead to a finite set of candidates. In MOLGEN–MS an additional restriction is implemented. – Restricting the molecular mass: The user may enter an interval for the sum of atomic masses in a molecule, either integers or exact atomic masses. Details are given in Sections 8.4 and 8.7.

182 | 5 Molecular structure generation These restrictions influence the presence of elements in the molecular formula, while the input of MOLGEN 4.1 offers further formula restrictions that influence the distribu­ tion of atom states. – Relaxing connectedness: Disconnected molecular graphs may or may not be al­ lowed. The implementation applies that the degree partition of a simple graph is the degree partition of a connected simple graph if and only if |𝐵(𝛾)| ≥ 𝑛 − 1 (see the items following Remark 2.35). – Restricting bonds: Prescribe an interval containing the sum of bond multiplicities. – Restricting the double bond equivalent, DBE: Prescribe an interval for DBE, based on the prescribed admissible atom states. – Restricting the charge: Restrict the charge by giving an interval of natural num­ bers. – Restriction of radicals: Restrict the number of atoms bearing an unpaired electron with an interval of natural numbers. – Restricting atom states: Restrict the number of occurrences of atom states by an interval. We should also mention another restriction that has a strong structural influence: – Restricting hydrogen distributions: Prescribe the number of atoms of given ele­ ments bearing a given number of H atoms.

5.2.2 Structural restrictions MOLGEN 4.1 offers four types of structural restrictions: A filter for aromatic duplicates, a symmetry filter, and the restriction types macro and substructure. It should be noted that there is a serious problem with aromaticity: The graph model of molecules seems to need an extension to hypergraphs. This, however, increases the complexity of ge­ neration enormously. – Structural restriction aromaticity: Aromatic bonds are identified and aromatic du­ plicates suppressed. – Structural restriction symmetry: The number of carbon signals in a 13 C NMR spec­ trum can be prescribed, assuming sufficiently high spectrometer resolution. In this case the generator outputs only the structures with the prescribed number of classes of symmetry equivalent carbon atoms, based on topological symmetry. – Structural restriction macro: Macros are nonoverlapping molecular subgraphs prescribed for the generated structures and represent given molecular substruc­ tures. For example, we may prescribe a benzene ring. In order to simplify the generation it appears first as a single node, a ‘superatom’ in the generated graphs and is then extended to the full substructure before output.

5.3 Reaction-based structure generation



| 183

Structural restriction substructure: This restriction consists of one or more sub­ structure entries, together with an interval that fixes the occurrence number of each entry. If we want to forbid a substructure, we prescribe its occurrence num­ ber to be within the interval [0, 0].

The substructure itself is either a molecular substructure, a subunit or a ring. Similar to molecular substructures, ring substructures and subunits can be equipped with substructure restrictions as introduced in Definition 2.14. – Molecular substructures are discussed in Section 2.2. – A subunit is specified by a molecular formula. A molecular graph 𝑀 contains such a subunit with a given occurrence number if this number of connected molecular subgraphs with the specified formula is contained in 𝑀. – A ring substructure is prescribed by an interval of allowed ring lengths. 𝑀 contains the ring substructure with a given occurrence number if this number of rings with admissible length is contained in 𝑀.

5.3 Reaction-based structure generation In Section 2.3, chemical reactions were represented by graphs and simulated virtually using reaction schemes. This model can also be used in structure generation. A spe­ cial case, where substitutions are applied to a molecular skeleton were discussed in Subsection 3.2.4, but the case of an achiral skeleton and possible chiral substituents was postponed. Here, we start with this case, giving an algebraic method for the con­ struction of such permutational isomers.

5.3.1 Libraries of permutational isomers A permutational isomer arises from a molecular skeleton by reactions with sub­ stituents at positions of the skeleton that are open for substitution. There is a very efficient procedure that reduces the generation of such isomers to the generation of double cosets [99, 337]. It was described in Section 1.2, but not yet for the general case of an achiral skeleton and possibly chiral substituents. We shall generalize it now to the arbitrary case, assuming again that the set 𝑌 of substituents contains a chiral substituent 𝑦 as well as the mirror image 𝑦󸀠 . Some results with this case, from W. Hässelbarth, were mentioned at the end of Subsection 3.2.4. The procedure is demonstrated here using the example considered in Chapter 3 with the aim of counting the permutational isomers.

184 | 5 Molecular structure generation 5.14 Example (Libraries of amidations of a xanthene) Recall the skeleton used in Chapter 3, xanthenetetracarboxylic acid chloride: Cl

Cl

O

O O

Cl

Cl O

O

that we assumed to be planar. The combinatorial libraries consist of special amides, where the 4 substituents are the proteinogenic amino acids. The reactions can be des­ cribed as follows: The acid chloride functional group (C(= O)Cl) reacts with an amino (NH2 ) group producing an amide. O

O Cl

Z

+

H2N

H N

Z

OH

OH

+

HCl

O R

O

R

In one exception, for proline, the reacting group is an imino (NH) group. To also include this reaction, only one hydrogen atom is included in the reaction scheme: Cl

H

H

C

N

C

O

O C OH

Bonds to be formed in the reaction are marked with small dots ‘∙’, bonds to be broken by crosses ‘×’. In a synthesis reaction the reaction substructure is made of two or more components. In our example these are O

Cl and O

HN CH

OH

Initially, the reactants are searched for these substructures. The acid chloride sub­ structure is found four times in the skeleton of 𝑀, these are the ‘positions open for substitution’. The 𝛼-imino acid substructure is found in each amino acid exactly once. Taking into account the numbering (labeling) of the positions open for substitution in the skeleton of 𝑀, there are |𝑌|4 possibilities to substitute |𝑌| different amino acids to these four positions. Some of these possibilities may be equivalent due to the point

5.3 Reaction-based structure generation

| 185

group P of 𝑀, which means that the numbering in 𝑀 should be neglected. We dis­ cussed this several times in the preceding chapters and sections. Moreover, we des­ cribed a constructive method using double cosets. Applying this to the present prob­ lem reads as follows. 𝑋 = {0, 1, 2, 3} = 4 was the set of (labels of) the positions open for substitution, the set of numbers of the functional groups (C(= O)Cl), while 𝑌 indicated a set of amino acids. We assumed that 𝑌 = 𝑌𝑎𝑐 ∪ 𝑌𝑐ℎ where 𝑌𝑎𝑐 , the set of achiral substituents, is either empty or it consisted of glycine only, while 𝑌𝑐ℎ , the set of chiral substituents was supposed to consist of pairs {𝑦, 𝑦󸀠 } of enantiomorphic amino acids. We used a simplified version of the skeleton, assumed a barycentric placement in space, and emphasized the positions open for substitution with dots and introduced cartesian coordinates: e2

6 s s "b "b "b " " b " b b " b " b b" ⋅⋅ ⋅ ⋅ ⋅⋅ - e1 s s "b " "b b b " " " b b b" b" b" e0 The point group P was shown to be the set of four linear isometries 𝜌𝑖 , 𝑖 ∈ 4, linear map­ pings of 3D space, represented by the following matrices (with respect to the chosen basis): 1 { { P = {( 0 { { 0

0 1 0

0 −1 0 ),( 0 1 0

0 1 0

0 1 0 ),( 0 1 0

0 −1 0

while the subgroup of proper rotations consists of nant 1: −1 { 1 0 0 { R = {( 0 1 0 ) , ( 0 { 0 { 0 0 1

0 −1 0 ),( 0 1 0

0 −1 0

0 } } 0 )} , } 1 }

the two isometries with determi­ 0 −1 0

0 } } 0 )} . } 1 }

After numbering the positions open for substitution in the following way 1 2 s s "b "b "b " b " b " b " b" b" b

s 0 sb "b "b " 3 b " b " b " b" b" b"

186 | 5 Molecular structure generation

C3H7NO2

Alanine C6H14N4O2

NH2

Arginine C4H8N2O3

O

NH2

O

Asparagine C4H7NO4

NH2

O

Aspartic acid NH2

O

OH OH

OH

NH OH

C3H7NO2S

NH2

NH

NH2

Cysteine C5H10N2O3

SH

Glutamine C5H9NO4 NH2

NH2

O

OH

NH2 O

C6H9N3O2

OH

Glycine

O OH

OH

OH O

O

Histidine C6H13NO2 N

O

Glutamic acid C2H5NO2

O

OH

NH2

O

Isoleucine C6H13NO2 NH2

O

NH2

Leucine C6H14N2O2

NH2

O

NH2

NH

Lysine O OH

OH

OH

OH

O

NH2

C5H11NO2S NH2

NH2

Methionine C9H11NO2

Phenylalanine C5H9NO2

Proline C3H7NO3

O

Serine

NH2

O

NH

OH OH

OH

S O

C4H9NO3

NH2

Threonine C11H12N2O2 NH2

OH

Tryptophan C9H11NO3

O

O

OH

Tyrosine C5H11NO2 OH

Valine NH2

O

NH OH

OH

OH

NH2 O

OH

OH O

Fig. 5.5. Structures of the 20 proteinogenic amino acids.

NH2

5.3 Reaction-based structure generation |

187

we obtained the permutations of these positions induced by the symmetry operations 𝜌: 𝜌0̄ = (0)(1)(2)(3), 𝜌1̄ = (0)(1)(2)(3), 𝜌2̄ = (03)(12), 𝜌3̄ = (03)(12). The crucial point was that the point group acts on both the skeleton and the set of substituents, such that consideration of the induced permutation does not suffice. For example, 𝜌1̄ is the identity permutation induced by the reflection in the plane of the molecule, it is induced by an improper rotation which we cannot see by inspecting 𝜌1̄ , we have to consider the isometry 𝜌1 instead. An improper rotation maps each of the substituents 𝑦 onto their mirror image 𝑦󸀠 . The corresponding permutation of the substituents was denoted by 𝜌.̂ { 𝑦󸀠 ̂ ={ 𝜌𝑦 𝑦 {

if 𝜌 ∈ P \ R and 𝑦 ∈ 𝑌𝑐ℎ , otherwise.

The size of the combinatorial library consisting of all the amidations was expressed in terms of the cycle structures of the permutations induced by the elements of the point group in the following form: 1 1 ∑ |𝑌|𝑐(𝜌)̄ + ∑ |𝑌 |𝑐𝑜 (𝜌)̄ ⋅ |𝑌|𝑐𝑒 (𝜌)̄ . |P| 𝜌∈R |P| 𝜌∈P\R 𝑎𝑐 The smallest cases of amidations, allowing glycine as the achiral substituent and fur­ ther pairs of enantiomorphic amino acids, turned out to be combinatorial libraries of size 25 if 3 different substituents are admitted, 169 if we allow 5, and 579,121 if we allow each of the 20 proteinogenic amino acids together with their enantiomorphs, i.e. if |𝑌| = 39 is the number of admitted substituents. The algebraic approach to count­ ing the corresponding numbers of permutational isomers arising from amidations of xanthenetetracarboxylic acid, was the following one that we can use also in order to describe a method for the construction of the library. This is a quite general approach to counting and constructing symmetry classes of mappings when groups act on the domain and the range of the mappings considered: – Like in Pólya’s approach we assume finite sets 𝑋 and 𝑌. Moreover, we suppose that a finite group 𝐺 acts on 𝑋, i.e. we are given an action 𝐺 𝑋. In addition we consider another finite group 𝐻 together with an action 𝐻 𝑌, obtaining an action of the direct product of 𝐻 and 𝐺 on the set of mappings 𝑌𝑋 . Following de Bruijn’s notation, we call this group the power group and denote it with 𝐻𝐺 . The action that we have in mind is 𝐻𝐺 × 𝑌𝑋 → 𝑌𝑋 : (ℎ, 𝑔)𝑓 󳨃→ ℎ𝑓𝑔−1 , where ℎ𝑓𝑔−1 (𝑥) = ℎ𝑓(𝑔−1 𝑥),

for 𝑥 ∈ 𝑋.

188 | 5 Molecular structure generation An important case is the power group 𝐻𝐺 = PP = {(𝜌, 𝜎) | 𝜌, 𝜎 ∈ P},



arising from the action of a point group on a molecular skeleton and on a set of admissible substituents. Its elements are the pairs of symmetry operations, i.e. of mappings of the 3D space. Pólya’s approach for the enumeration of permutational isomers with achiral substituents uses the power group 𝐻𝐺 = 𝐸𝐺 , where 𝐸 = {1}, the group consisting of the identity operation only. The direct product 𝐻𝐺 = PP of the point group with itself contains the group Δ(P × P) = {(𝜌, 𝜌) | 𝜌 ∈ P}, that we already used for counting the amidations. But this time we are after a transversal of the orbits on the set of distributions, since each transversal of the set of orbits Δ(P × P)\\𝑌𝑋



yields the desired set of different permutational isomers, the amidation products. This allows the application of the double coset method. The Fundamental Theorem suggests to break this problem into pieces as follows: Decompose the set of orbits of Δ(P × P) into disjoint subsets of orbits. The union of this is one orbit of a suitable bigger group 𝐻𝐺 containing the permutation group induced by Δ(P × P). Then we have, for each distribution 𝛿 ∈ 𝑌𝑋 , according to the Fundamental Theorem the bijection Δ(P × P)\\𝐻𝐺 (𝛿) → Δ(P × P) \ 𝐻𝐺 / (𝐻𝐺 )𝛿 defined by Δ(P × P)(𝜋𝛿) 󳨃→ Δ(P × P)𝜋(𝐻𝐺 )𝛿 .



This implies that a transversal of the set of double cosets yields a transversal of Δ(P × P)\\𝐻𝐺 (𝛿). In this way we decomposed the problem of evaluating a transver­ sal of Δ(P × P)\\𝐻𝐺 (𝛿) in one step into usually much smaller steps considering the orbits in the subsets 𝐻𝐺 (𝛿), for suitable distributions 𝛿. As in the construction of the multigraphs and isomers of Seveso dioxin, we use the decomposition of the distributions 𝛿 into distributions with the same content 𝑐 = 𝑐𝑜𝑛(𝛿) = (. . . , |𝛿−1 (𝑦)|, . . .). The set of distributions of content 𝑐 will be denoted by 𝑌𝑐𝑋 again, and it is clear that 𝑌𝑐𝑋 = 𝑆𝑋 (𝛿), if 𝑐𝑜𝑛(𝛿) = 𝑐.

5.3 Reaction-based structure generation

| 189

But we cannot set 𝐻𝐺 = 𝐸𝑆𝑋 since 𝐻𝐺 has to contain Δ(P × P), i.e. we have to include the transposition of enantiomorphs {𝑦 𝜏 : 𝑌 → 𝑌 : 𝑦 󳨃→ { 󸀠 𝑦 {

if 𝑦 ∈ 𝑌𝑎𝑐 , if 𝑦 ∈ 𝑌𝑐ℎ .

For this reason we take the direct product of the symmetric group 𝑆𝑋 and the group ⟨𝜏⟩ for the power group 𝐻𝐺 : 𝐻𝐺 = ⟨𝜏⟩𝑆𝑋 . The Fundamental Theorem gives a bijection between the orbits of Δ(P × P) on the orbit (5.1) ⟨𝜏⟩𝑆𝑋 (𝛿) and the set of double cosets Δ(P × P) \ ⟨𝜏⟩𝑆𝑋 / (⟨𝜏⟩𝑆𝑋 )𝛿 , obtaining that Δ(P × P)\\⟨𝜏⟩𝑆𝑋 (𝛿) is bijective to Δ(P × P) \ ⟨𝜏⟩𝑆𝑋 / (⟨𝜏⟩𝑆𝑋 )𝛿 .

(5.2)

It remains to check which distributions are contained in Equation (5.1). As 𝑆𝑋 keeps the content while 𝜏 maps a substituent 𝑦 ∈ 𝑌𝑐ℎ onto its enantiomorph 𝑦󸀠 ∈ 𝑌𝑐ℎ , we introduce the content 𝑐󸀠 ‘enantiomorph’ to 𝑐. If 𝑐 = (. . . , 𝑐(𝑦), . . .), we define 𝑐󸀠 = (. . . , 𝑐󸀠 (𝑦), . . .) by putting 𝑐󸀠 (𝑦) = 𝑐(𝑦󸀠 ) and find that ⟨𝜏⟩𝑆𝑋 (𝛿) = 𝑌𝑐𝑋 ∪ 𝑌𝑐𝑋󸀠 , if 𝛿 is of content 𝑐. Abbreviating this by 𝑋 𝑋 ∪ 𝑌𝑐𝑋󸀠 , 𝑌𝑐,𝑐 󸀠 = 𝑌𝑐

we see that the problem of constructing the desired transversal of the orbits of Δ(P × P) on the set of distributions is reduced to the evaluation of transversals of the orbits on the usually much smaller sets 𝑋 Δ(P × P)\\𝑌𝑐,𝑐 󸀠.

190 | 5 Molecular structure generation 5.3.2 Attaching substituents to a central molecule The next example is similar to the construction of a library of permutational isomers but it is more general. Instead of a molecular skeleton, which is a fixed arrangement of points in 3D space, together with its point group, we now consider a central molecule that is not assumed to be rigid, so that the point group P of the skeleton has to be replaced by the automorphism group Aut(𝑀) of the molecular graph 𝑀 of the central molecule. Moreover, there is no prescribed set of points open to substitution, so that we need to consider all the possible embeddings of reactive substructures. This is a typical situation in combinatorial chemistry, and it is described in detail in [337]. Assume that we are given a central molecule 𝑀 and substituents 𝑀𝑖 , 𝑖 ∈ 𝑎, together with a reaction scheme 𝑅 = (𝑆, Δ𝜁, Δ𝛾). Recall what this means: – 𝑆 is a molecular substructure, see Definition 2.11, – Δ𝜁 means a change-of-states-graph, while Δ𝛾 is a change-of-bonds-graph, see De­ finition 2.17. 𝑅 is supposed to be a two component synthesis, i.e. the AMG, an ambiguous molecular subgraph, a triple 𝑛

𝐴𝑀𝐺 = (𝐸, 𝑍, 𝛤) ∈ P⋆ (E) × P⋆ (ZE ) × P(3)(2) = AMG𝑛 𝑛

𝑛

(Definition 2.12) underlying 𝑆 has two connectivity components 𝐴 and 𝐵. Assume that there are 𝑘 different non overlapping embeddings 𝜙𝑗 , 𝑗 ∈ 𝑘 of 𝐴 into 𝑀. The atoms in 𝑀 that are images under the 𝜙𝑗 form 𝑘 reactive sites in the central molecule, open for reaction with a substituent. Moreover, we suppose that there is exactly one embedding of 𝐵 into each 𝑀𝑖 . In the corresponding reaction, the substituents react with the reactive sites of the central molecule in different ways. For 𝑘 = 4 we sketch this situation in the following way: 𝑀𝑖1

𝑀𝑖0

@ @'$ @ 𝑀

𝑀𝑖3

&% @ @ @ 𝑀𝑖2

This is very similar to the enumeration of permutational isomers: The different prod­ ucts obtained from reactions between reactive sites and the 𝑀𝑖 are orbits of mappings under the symmetry group Aut(𝑀) of 𝑀. This group acts on the reactive sites induc­ ing a subgroup 𝐺 ≤ 𝑆𝑘 . The different products form the orbits in 𝐺\\𝑎𝑘 . Examples of

5.3 Reaction-based structure generation |

191

this construction problem will follow in Subsection 5.3.6. First, we describe another principle.

5.3.3 Generation using the network principle In the following we describe how successive applications of reaction schemes can be used to generate molecular libraries. Libraries are of particular importance in struc­ ture elucidation and combinatorial chemistry. Most chemical processes can be represented by chemical reaction networks. Such a network is a bipartite, directed graph. This means that the set of nodes is partitioned into two disjoint sets, a set of chemical compounds (the reactants and the products) and a set of reactions. Nodes respresenting chemical compounds are labeled by mole­ cular graphs, others by reaction schemes. The edges are either directed from a reactant to a reaction, or from a reaction towards a product. A reaction scheme may occur sev­ eral times as the label of a node, but a molecular graph occurs only once as the label of a node. We do not construct total reaction networks, we shall only use them as a basic concept for generation of corresponding molecular graphs. We run through the network, starting from the basic compound using breadth first strategy. We have to generalize several notions from Section 2.3. We had introduced the product graphs obtained from a molecular graph 𝑀 ∈ M by applying a reaction scheme 𝑅 = (𝑆, Δ𝜁, Δ𝛾): Prod𝑅 (𝑀) = {𝑅 ∘𝜙 𝑀 ∈ M𝑛 | 𝜙 ∈ Emb ⊆𝑖 (𝑆, 𝑀)}. We extend this definition to sets of molecular graphs and sets of reaction schemes. We start from a set L = {𝑀𝑖 | 𝑖 ∈ 𝑙} ⊆ M𝑐 of connected molecular graphs. In order to determine the set of products, which can be obtained by applying 𝑅 to L, we have to examine the reaction type of 𝑅 first. This can be expressed in terms of the connected components of the AMG underlying 𝑆 = (𝐴𝑀𝐺, {𝑆𝑅𝑖 | 𝑖 ∈ ℎ}) Conn(𝑅) = Conn(𝑆) = Conn(𝐴𝑀𝐺). Up to Conn(𝑅) reactants may participate in the reaction represented by 𝑅. In order to consider this we have to construct – before applying 𝑅 – corresponding sums of com­ binations with repetition of reactants contained in L. Combinations with repetition of 𝑛 elements in an 𝑚-element set are bijective to weakly increasing mappings contained in the set 𝑚𝑛≤ = {𝑓 ∈ 𝑚𝑛 | ∀ 𝑖 ∈ 𝑛 − 1 : 𝑓(𝑖) ≤ 𝑓(𝑖 + 1)}. Using this notation we can define the product graphs obtained by applying 𝑅 to L as Prod𝑅 (L) =



⋃ Prod𝑅 (⨁ 𝑀𝑓(𝑖) ) ,

𝑘∈|Conn(𝑅)| 𝑓∈𝑙≤𝑘

𝑖∈𝑘

192 | 5 Molecular structure generation and for a set R consisting of reaction schemes ProdR (L) = ⋃ Prod𝑅 (L) . 𝑅∈R

Finally we have to decompose the product graphs into connected components and to eliminate the isomorphic duplicates. For this reason we introduce the set of connected labeled graphs for an arbitrary set L of molecular graphs Conn(L) = ⋃ Conn(𝑀). 𝑀∈L

Moreover, we assume a transversal 𝑇 of this set L of labeled connected molecular graphs, together with a function 𝜅 : L → 𝑇 : 𝑀 󳨃→ 𝑀̄ ∈ 𝑇 that associates the representative of its orbit 𝑀̄ contained in the given transversal 𝑇 with 𝑀 ∈ L. This yields the set of molecular graphs 𝜅(L) = {𝜅(𝑀) = 𝑀̄ | 𝑀 ∈ L}. 𝑀̄ is the canonical form of 𝑀. Canonization was mentioned in Subsection 5.1.1 already and will be discussed in more detail in Section 5.5. This completes the set of tools needed for the generation of a molecular library using the network principle. The con­ struction of the molecular library resulting from the reactants in L and a the set of reaction schemes R is described in the following algorithm: 5.15 Algorithm (MolLib(L,R)) (1) (2) (3) (4) (5) (6)

L0 ← 𝜅(L), 𝑘 ← 0 while L𝑘 ≠ 0 do 𝑘←𝑘+1 L𝑘 ← 𝜅 (Conn (ProdR (⋃𝑖∈𝑘 L𝑖 ))) \ ⋃𝑖∈𝑘 L𝑖 𝑂𝑢𝑡𝑝𝑢𝑡(L𝑘 ) end

In row (1) the reactants are brought into a canonical form, duplicates are eliminated and the canonically labeled structures are assigned to L0 . Row (4) is central to Algo­ rithm 5.15. It yields the new structures L𝑘 from the partial libraries generated earlier, L𝑖 and 𝑖 ∈ 𝑘. The process of generation is finished as soon as no further structures are generated. This is checked by row (2). In the following we shall modify the algorithm so that it can be applied to our particular problems.

5.3 Reaction-based structure generation

| 193

5.3.4 Generation of MS fragments Our motivation to implement a structure generator based on the network principle was the necessity to generate fragments that can occur in a mass spectrometer. Here are the features that we need (see Chapter 8): i) The set of reactants consists of a single reactant: L = {𝑀}. ii) All reactions are one component reactions. iii) The set of reaction schemes is partitioned into two subsets, the ionization schemes and the fragmentation schemes: R = R𝐼 ∪̇ R𝐹 . iv) In the first step we apply an ionization to 𝑀, obtaining a posively charged particle and (optionally) a neutral one. v) Positively charged particles only are relevant for further reactions. vi) After ionization, arbitrarily many fragmentations may follow. Steps i) and ii) need not be modified. In order to meet the other conditions, we intro­ duce the following extensions: – We associate a depth with each reaction scheme. In order to remain as flexible as possible, we specifiy intervals: {[1, 1] if 𝑅 ∈ R𝐼 , depthR : R → I(ℕ⋆ ) : 𝑅 󳨃→ { [2, ∞[ otherwise. { –

We can meet conditions iii), iv) and vi) with this association. Instead of Conn() we introduce Conn+ (L) = {𝑀 ∈ Conn(L) | cha(𝑀) = 1}, for the decomposition and selection of connected components of the product graphs. cha(𝑀) is the sum of atom charges in 𝑀.

Since only unimolecular reactions occur, we can restrict attention to row (4) of Algo­ rithm 5.15 while restricting the product formation to L𝑘−1 . Otherwise, ProdR (⋃𝑖∈𝑘 L𝑖 ) would produce duplicates only. The modified algorithm contains the depths of the re­ action schemes as an additional argument: 5.16 Algorithm (MolLibMS (L, R, depthR ())) (1) (2) (3) (4) (5) (6) (7)

L0 ← 𝜅(L), 𝑘 ← 0 while L𝑘 ≠ 0 do 𝑘←𝑘+1 R󸀠 ← {𝑅 ∈ R | 𝑘 ∈ depthR (𝑅)} L𝑘 ← 𝜅 (Conn+ (ProdR󸀠 (L𝑘−1 ))) \ ⋃𝑖∈𝑘 L𝑖 𝑂𝑢𝑡𝑝𝑢𝑡(L𝑘 ) end

194 | 5 Molecular structure generation 5.3.5 Construction using the network principle Finally we want to discuss whether structure generation based on the network prin­ ciple (Subsection 5.3.3) can be used in practice. Consider the amidations of cubane-te­ tracarboxylic acid chloride.

Cl O O

Cl

O Cl

O Cl

Surprisingly, we obtain 13,035 solutions for 20 amino acids (without specifying multi­ plicities and using the topological automorphism group). A glance at one of the struc­ tures generated reveals what has happened:

Since we had to remove the second hydrogen atom of a reacting amino group from our reaction scheme, amino acid residues that are already attached are able to react once more with another active site of the central molecule, forming a ring, as demonstrated in the structure above for a cystein (to the left of the cubane substructure). We do not wish to discuss here whether this reaction makes sense from a chemical point of view. Mathematically, we can exclude such intramolecular reactions by imposing a distance restriction to our reaction scheme. Thus, we request an infinite distance between the carbon atom of the acid chloride substructure and the nitrogen atom of the imino group substructure:

O

Cl

H

H

C

N

C

dist=∞

O C OH

5.3 Reaction-based structure generation

| 195

In this manner we make sure that these two atoms cannot form a bond unless they are situated in different molecules before the reaction. After applying this modification, the network generator produces the 8855 compounds expected.

5.3.6 Combinatorial libraries The generation of combinatorial libraries plays an important role in modern synthetic chemistry. However, it is advisible to have a look at the possible outcome of these com­ binatorial experiments in advance to optimize the procedure. In Subsection 5.3.2 we covered the special case when different substituents are to be attached to a single central molecule. A similar case was introduced in Exam­ ple 3.13, the amidation of polyvalent acid chlorides by amino acids. We counted the sizes of corresponding combinatorial libraries, and in Subsection 5.3.1 we described an algebraic method using double cosets for the construction of this library. This time we should like to describe construction of these libraries using the network principle which is advisible, for example, if rings might arise or if different central molecules may be used. First we note the following peculiarities concerning the generation of combinato­ rial libraries: (i) The set of reactants is partitioned into two subsets, the central molecules and the substituents: L = L𝐶 ∪̇ L𝐿 . (ii) A central molecule can be used only once and at the beginning of a reaction. (iii) Each product contains at least one of the central molecules. (iv) The reactions are either one or two component reactions. (v) Reactions between two intermediate products are neglected. (vi) Byproducts such as H2 O, HCl, CO2 are neglected. In order to meet these requests, we extend our methods as follows: – We associate a depth with each reactant, which says how far it can be used. In order to keep this flexible, we use intervals for the depth: depthL : L → I(ℕ),



{[0, 0] where depthL (𝑀) = { [1, ∞[ {

if 𝑀 ∈ L𝐶 , otherwise.

This allows us to fulfill requirements (i)–(iii). In order to cope with (vi), we define the following set that selects the largest con­ nected components of the product graphs: Conn≥ (L) = ⋃ Conn≥ (𝑀),

where

𝑀∈L

Conn≥ (𝑀) = {𝑀󸀠 ∈ Conn(𝑀) | size(𝑀󸀠 ) ≥ with size(𝑀) denoting the number of atoms in 𝑀.

1 size(𝑀)}. 2

196 | 5 Molecular structure generation We add depthL () and Conn∗ () as additional arguments. In the generation of combina­ torial libraries, we usually choose Conn∗ () = Conn≥ (). Requirement (v) is used in step (6), where (vi) is applied. 5.17 Algorithm (MolLibCC(L,R,depthR (),depthL (),Conn∗ ())) (1) (2) (3) (4) (5) (6) (7) (8)

L0 ← 𝜅({𝑀 ∈ L | 0 ∈ depthL (𝑀)}), 𝑘 ← 0 while L𝑘 ≠ 0 do 𝑘←𝑘+1 L󸀠 ← {𝑀 ∈ L | 𝑘 ∈ depthL (𝑀)} R󸀠 ← {𝑅 ∈ R | 𝑘 ∈ depthR (𝑅)} L𝑘 ← 𝜅 (Conn∗ (ProdR󸀠 (L𝑘−1 ∪ L󸀠 ))) \ ⋃𝑖∈𝑘 L𝑖 𝑂𝑢𝑡𝑝𝑢𝑡(L𝑘 ) end

Additional features can also be useful for the generation of molecular libraries, such as: – Output the final products, where the intermediates are not of interest. – Allowing reactants or reaction schemes to occur with prescribed multiplicities. Requirements of the latter type will be applied in the next section. Considering mul­ tiple reactants and/or reaction schemes means accounting for the fact that different paths through the reaction network may lead to the same product, and that the mul­ tiplicities may differ between such paths.

5.3.7 Ugi’s seven component reaction Many combinatorial libraries are not as easily described as that in Subsection 5.3.1. Often the reactants belong to various classes of compounds and react with each other via complex mechanisms. As an example, consider Ugi’s seven component reaction [310]. The reactants are a central building block, pyridine-2,6-dicarboxylic acid O

O N

HO

OH

5.3 Reaction-based structure generation |

197

and sets of four isocyanides R1 − N+ ≡ C− , -

-

C

-

C

+

+

N

-

C

C

+

N

+

N

N

O

O

four aldehydes R2 − CH = O, O

O

O

O

O

N

S

F

and four primary amines R3 − NH2 . NH2

NH2

NH2

NH2

F

O N N

O

The functional groups characterizing the sets of reactants are highlighted in grey. A stepwise procedure to construct the library described by Ugi’s seven component re­ action is given in [337]. Figure 5.6 shows how the pyridine-carboxylic acid is extended successively by the amine, aldehyde and isocyanide building blocks. The reaction schemes used are O

H

H

O

C

H

C

N

N

C

O

H

H

O

N O

C

C O

C

H

N

198 | 5 Molecular structure generation

O

O N

HO

OH

+ O

R3

2 H2N O

N HN

+

NH

R3

2 H2O

R3 R2

+ R2

O

2

O R2

O N

HO

N

N

R3

R3

+ R2

O

H N

2

OH

-

C

+

N

R2

O

H N

N

R1 O

N

N

R3

R3

Fig. 5.6. Scheme of the seven component reaction.

R1

R1 O

5.4 Generic structural formulas | 199

where the bonds to be broken are marked with crosses ‘×’ and the bonds to be formed with dots ‘∙’, as before. Any change in charges are symbolized by ‘⊕’ and ‘⊖’. The atoms of the newly introduced building block that find themselves in the reaction center are highlighted in grey. If the aromaticity of the pyridine is ignored, there are 42 = 16 intermediates after the first step, 16 ⋅ 42 = 256 intermediates after the second step and 256 ⋅ 42 = 4096 products after the third step. Finally, eliminating the aromatic duplicates leads to 2080 members in the virtual library. Alternatively, the aromaticity of the central molecule can be considered from the beginning. If this is the case, there are 4 symmetrical intermediates after the first step (containing two identical R3 ) plus (42) = 6 asymmetrical intermediates (with two diffe­ rent R3 ). After the second step, there are 4 ⋅ 4 = 16 symmetrical intermediates and 4 4 6 ⋅ 4 + 4 ⋅ ( ) + 6 ⋅ ( ) ⋅ 2 = 120 2 2 asymmetric intermediates. In detail, these are asymmetric in R3 and symmetric in R2 , symmetric in R3 and asymmetric in R2 , and asymmetric both in R3 and R2 . After the third step there are 16 ⋅ 4 = 64 products symmetric in R1 , R2 and R3 , as well as 4 4 120 ⋅ 4 + 16 ⋅ ( ) + 120 ⋅ ( ) ⋅ 2 = 2016 2 2 asymmetric products. Altogether this makes 2080 members of the virtual library, as above. In Section 8.3 of [337], ‘Counting and Screening a library’, it is shown how such a situation of multiple addition to a central molecule can be formulated as a problem of constructing symmetry classes, and how it may be generalized to arbitrary skeletons.

5.4 Generic structural formulas Often the search spaces of combinatorial chemistry are described by generic structural formulas [11], and this is true in particular for patent libraries in chemistry [329]. There are only very few structure generators that allow the use of generic structural formulas to their full potential. Moreover, there is not yet a standardized and comprehensive format for the representation of generic structural formulas.

5.4.1 A simple generic structural formula We present a simple example of a generic structural formula and construct the search space that is obtained by combining the molecular formula and reaction-based struc­

200 | 5 Molecular structure generation ture generation. In generic structural formulas the substituents are denoted as Ri , 𝑖 being its index. The following generic structural formula 𝐺𝑆 OH R1 R3 R2

Cl

– –





R1 : CH3 or C2 H5 (variation of substituents) R2 : Alkyl (variation of homology) (variation of position) R3 : NH2 m : 1–3 (variation of chain length)

(CH2) m

was introduced in [12] to demonstrate the possible variations: Variation of substituents: There are different possibilities for substituent R1 . Variation of homology: For R2 we can choose any element of a specified homolo­ gous series. Alkyl groups are all structures with a molecular formula of the form Cn H2n+1 , 𝑛 ∈ ℕ⋆ . Variation of position: Substituent R3 can be in one of several positions. In the struc­ tural formula above this is illustrated with a bond of R3 placed between two con­ nected C atoms. The position that is not taken by the substituent is occupied by a H atom. Variation of chain length: Up to three chain links of the form CH2 may be added to one substituent.

In reality, generic structural formulas are often much more complicated. In particular the complexity can increase considerably by combination and recursion of the princi­ ples described above. For example, variation of substituents and positions can occur, while the various substituents Ri can themselves be described by generic structural formulas. Variations resulting from homology especially can result in the possibility of an infinite structure space. In our example, we assume that only alkyl groups with 1 − 6 C atoms are allowed. Such restrictions are certainly both reasonable and neces­ sary. To evaluate the space of all these structures we use the following central mole­ cule 𝑀: OH R3

R1

R3

R2

Cl

and set depthL (𝑀) = [0, 0] to use the generation by network principle. The Ri will be considered as atoms with valence 1 and element symbol R1 , R2 , R3 , respectively.

5.4 Generic structural formulas | 201

The other reactants and reaction schemes, together with their depths and multi­ plicities, are listed in Tables 5.1 and 5.2. Z can be considered as an atom and it denotes the place where the central molecule is to be substituted. Table 5.1. Reactants for generation of the space of structures defined by the generic structural for­ mula of Subsection 5.4.1. Variation of

Variation of

Variation of

Variation of

substituents

homology

position

chain length

Z−CH3 Z−C2 H5

33 isomers Z−Cn H2n+1 n∈6

Z−H Z−NH2

CH2

Depth: [1, 1]

Depth: [2, 2]

Depth: [3, 4]

Depth: [5, 6]

Mult: [0, ∞[

Mult: [0, ∞[

Mult: [1, 1]

Mult: [0, ∞[

Table 5.2. Reaction schemes for generation of the space of structures defined by the generic structu­ ral formula of Subsection 5.4.1. Variation of

Variation of

Variation of

Variation of

substituents

homology

position

chain length

C

C

C

C

A

A

R1

Z

R2

Z

R3

Z

C Cl ∞ C H

H

Depth: [1, 1]

Depth: [2, 2]

Depth: [3, 4]

Depth: [5, 6]

Mult: [1, 1]

Mult: [1, 1]

Mult: [2, 2]

Mult: [0, 2]

For the R2 substituents, we first have to generate all alkyl groups of the form Z−Cn H2n+1 , 1 ≤ 𝑛 ≤ 6, i.e. constitutional isomers with the molecular formula Cn H2n+1 Z. A molecular formula-based generation of these isomers yields 33 connectivity isomers (Figure 5.7), that are distributed in the following way:

𝑛

1

2

3

4

5

6



Number

1

1

2

4

8

17

33

The variation of chain length occurs by inserting up to two CH2 groups via breaking the C−Cl-bond. In order to exclude intramolecular reactions, the substructure reaction

202 | 5 Molecular structure generation

CH3Z

1 C2H5Z

2 C3H7Z

3 C3H7Z

4 C4H9Z

5 C4H9Z

6 C4H9Z

Z

7 Z

Z Z

C4H9Z

8 C5H11Z

Z

9 C5H11Z

10 C5H11Z

Z

11 C5H11Z

12 C5H11Z

Z Z

C5H11Z

15 C5H11Z

16 C6H13Z

Z

18 C6H13Z

19 C6H13Z

20 C6H13Z

Z

21 Z

Z Z

22 C6H13Z

23 C6H13Z

24 C6H13Z

Z

Z

25 C6H13Z

26 C6H13Z

Z

Z

27 C6H13Z Z

28 Z

Z

Z

C6H13Z

14

Z

17 C6H13Z

Z

C6H13Z

13 C5H11Z Z

Z

Z

Z

Z

29 C6H13Z

30 C6H13Z

Z

31 C6H13Z

32 C6H13Z

Z Z

33 Z

Z

Fig. 5.7. Alkyl groups with 1–6 C atoms.

is combined with a distance restriction (see Tables 5.1 and 5.2). This ensures we only obtain the embeddings that put C−Cl and CH2 into different reactants. In this way we obtain a space of 396 structures. In the next subsection we shall examine this space with respect to overlaps with a second library of molecules also derived from a generic structural formula.

5.4.2 Patents in chemistry The generation of generic molecular formulas can be used to evaluate patent libraries. Patents in chemistry are usually based on a generic formula, a Markush formula (cf. [329], Chapters 12–23). In Subsection 5.4.1 we introduced the following example: OH R1 R3 R2

Cl

(CH2) m

R1 : CH3 or C2 H5 R2 : Alkyl (1–6 C atoms) R3 : NH2 𝑚 : 1–3

5.4 Generic structural formulas |

203

and we found that this formula covers a set L1 of altogether 396 structural formulas. We should like to compare it with the set of structures L2 defined by R5

R1

R4

R2

R1 : CH3 , C2 H5 , OH R2 : Alkyl (1–6 C atoms) R3 : OH, OCH3 , OC2 H5 , CH3 , C2 H5 R4 : OH, CH2 Cl, NH2 R5 : H, CH3 , C2 H5 , NH2

R3

The reason is that a basic and difficult problem relating to patents in chemistry is to check the overlap of several patent libraries, i.e. for possible patent violations. These examples were discussed in [149] to show a simple but informative case. It is crucial that any solution of this problem requires the structures in the libra­ ries L𝑖 , 𝑖 = 1, 2, to be obtained in canonical form, so that two such libraries can be compared easily with respect to overlap L1 ∩ L2 . A canonizer for molecular structures is described in Section 5.5. Using MOLGEN–COMB and applying the methods described above, we obtain: |L2 | = 5939, although there are altogether 3 ⋅ 33 ⋅ 5 ⋅ 3 ⋅ 4 = 5940 combinations of substituents. Due to symmetry of the benzene skeleton, the compounds with R1 = OH, R2 = C2 H5 , R3 = CH3 , R4 = OH, R5 = H and R1 = OH, R2 = CH3 , R3 = C2 H5 , R4 = OH, R5 = H are identical, as is easily found by the program. Moreover, since the files of these li­ braries are in canonical form we get immediately |L1 ∩ L2 | = 4 and find the overlap: OH

H2N

OH

Cl

H2N

H2N

H2N

Cl

OH

OH

Cl

Cl

204 | 5 Molecular structure generation Certainly, complete generation of all structures is not the only way to solve the prob­ lem. A review of different methods can be found in [13], where in particular the evalu­ ation of molecular descriptors for large libraries specified by generic structural formu­ las is described. In any case canonization of data and normal forms play a central role. Hence it is time to consider this problem and to describe a canonizer.

5.5 Canonizing molecular graphs A chemical compound should be unambiguously identifiable by a unique label. For decades, traditional chemical nomenclature describing the structure served this pur­ pose more or less well. However, with compounds under study becoming more and more complex, chemical names also became ever more complex. As a result, many chemical names now are lengthy, difficult to pronounce and unwieldy. Chemical names were thus superseded by drawings of the structures in chemists’ everyday-life, which is considered by many the natural language of molecular science. On the other hand, a structure can be drawn in various ways, such that there is no one-to-one correspondence between a compound and a particular drawing. Further, the atoms in a drawing may be numbered in many ways (𝑛! = 𝑛 ⋅ (𝑛 − 1) ⋅ ⋅ ⋅ 2 ⋅ 1 num­ berings for a compound containing 𝑛 atoms), so that the computer representations derived (connection tables, bond matrices), although unambiguous, are not unique. This was discussed above in detail and leads to the introduction of unlabeled struc­ tures. For some time registry numbers seemed to be a solution to the problem, at least for the bench chemist and the layman, in that a new registry number is attributed to a compound when it is first registered by Chemical Abstracts Service (CAS–RN) or Beilstein (BRN). This number then serves as the compound’s unique ID. This procedure leaves the agency with the problem to compare a seemingly new compound to all those already present in the database. As a further principal limitation, a RN is not available for an unpublished compound. Furthermore, registry numbers can also be given to mixtures; thus one compound may also have several RNs. Nowadays, in the computer age and the time of combinatorial chemistry, when chemical companies and even individuals establish their own databases of real or vir­ tual compounds and reactions, the problem of identifying compounds has become more urgent than ever. The problem can be described as the problem of canonization, that is to attribute a standard representation to a compound, using a set of rules, i.e. a unique character string easily comparable by computer or manually to the correspon­ ding strings of other compounds. This is equivalent to producing a unique numbering of the atoms in a molecule, a canonical numbering. Any molecule generator software such as MOLGEN [20, 338] or SMOG [211] contains a canonizer out of necessity, to avoid redundant generation.

5.5 Canonizing molecular graphs

|

205

Many canonization methods have been proposed. For earlier procedures des­ cribed in the chemical literature see the paper by Jochum and Gasteiger and refer­ ences cited therein [143]. Randić considered a bond matrix as canonical if the min­ imum binary number resulted when the rows of its upper half were concatenated [241]. Hendrickson instead used the maximum number obtained from the upper half matrix [122]. Kvasnicka and Pospichal prefered the maximum number obtained from the lower half matrix [170, 171]. Although such an extremality requirement obviously leads to a unique numbering, this is not necessary. Rather, the goal may be achieved using one of many procedures, provided it is well-defined and leaves no room for arbitrariness. Other canonization procedures have also been developed [37, 330]. An often used but inferior method to discriminate molecules is to use graph in­ variants, which are numbers obtained from a structure in some well-defined way. Simi­ larly, the atoms in a molecule may often be distinguished using vertex-in-graph invari­ ants. The most important procedure of this kind probably is the Morgan algorithm, in which the atoms in a molecule are distinguished by their extended connectivities, and the numbers are obtained by repeated summation of the connectivity values over all neighbors of a particular atom. This method still seems to be the basis of the Chem­ ical Abstracts registry system [213]. An improved version was proposed by Balaban, Mekenyan and Bonchev [10]. The Weiningers published a method largely based on graph invariants to obtain a unique form of SMILES notation [331]. Though graph in­ variant-based methods sometimes work surprisingly well [268, 270], all graph invari­ ants are degenerate, i.e. there are non-isomorphic graphs (non-identical molecules) with the same numerical value of a particular graph invariant or even identical values for a combination of several graph invariants. This problem is occasionally ignored [244] even today. The real merit of graph invariants in the present context is that they often allow the comparison of two compounds without the need for a rigorous isomorphism test. Similarly, vertex-in-graph invariants, though sometimes identical for nonequivalent nodes, often allow easy comparison of graph nodes, which renders the ensuing rigor­ ous canonization far less difficult. The extraordinary value of a canonizer became apparent to us again when we found that even among simple graphs of no more than 8 nodes, there are some that cannot be differentiated using the highly discriminant combination of Balaban’s index J and distance matrix eigenvalues. For the MOLGEN canonizer it was no problem at all to resolve these degeneracies [265]. Recently, the International Union of Pure and Applied Chemistry (IUPAC) has rec­ ognized the need for a canonization procedure available to every chemist and is in­ vesting a major effort into the corresponding identifier and supporting software, the IUPAC Chemical Identifier (InChI) [121]. The latest software as well as publications may be downloaded from http://www.iupac.org/inchi/.

206 | 5 Molecular structure generation In this section we describe the MOLGEN Chemical Identifier MOLGEN–CID [33]. The following text is reprinted with permission from reference [33], copyright 2004 Amer­ ican Chemical Society. If a (molecular or non-molecular) graph is uploaded to MOLGEN–CID a canonical numbering is performed which results in a unique and unambiguous character string and a molfile that describe the canonized structure. Two molfiles uploaded separately are both canonized, and the resulting character strings are compared automatically, resulting in the answer ‘identical’ or ‘non-identical’. By default, MOLGEN–CID works on H-suppressed graphs. Information on bond multiplicities is used by MOLGEN–CID from the beginning, while in the InChI repre­ sentation (see www.inchi-trust.org), multiple bonds are removed before the canoniza­ tion process begins. Heavy atoms are given in the order of their canonical numbering, each atom is followed by a list of bonds indicating their type (s = single, d = double, t = triple, a = aromatic) to its neighbors, which are identified by their canonical num­ bers. The string is easily reconverted to the structure, even manually. For example, the canonical strings for benzyl alcohol and anisole (methoxybenzene) are Os8Cs8a3a4Ca5Ca6Ca7Ca7CC and Os2s8Ca3a4Ca5Ca6Ca7Ca7CC , respectively. The benzyl alcohol string translates: There is an oxygen atom (number 1) that is joined to atom number 8 with a single bond. Atom number 2 is carbon and shares a single bond with atom 8 and aromatic bonds with atoms 3 and 4. Atom num­ ber 3 is carbon and shares an aromatic bond with atom 5, and so on.

5.5.1 Initial classification (step 1) As in many other canonization procedures, our method starts by partitioning the graph nodes into classes according to some node-in-graph invariants. The purpose of this step is to restrict the numberings to be considered from 𝑛! to 𝑛0 !⋅𝑛1 ! ⋅ ⋅ ⋅ 𝑛𝑘−1 !, where 𝑛0 , 𝑛1 , . . . , 𝑛𝑘−1 are the cardinalities of the vertex classes, so that 𝑛0 + 𝑛1 + . . . + 𝑛𝑘−1 = 𝑛. The criteria used for initial classification are non-numerical and numerical properties of the nodes, which are obtained easily. They are ordered as follows: i) Nature of an atom (C, N, O, . . . ). All atoms with a higher atom number in the pe­ riodic system have priority over (i.e. will get lower canonical numbers than) all atoms with a lower atom number. ii) Atom attributes such as an atomic mass other than default (isotope), a charge other than zero, an unpaired electron (free radical), or a valence other than default (e.g. the default valence for carbon is four, including bonds to hydrogen atoms). iii) Ring or chain nature. Atoms in rings have priority over chain atoms.

5.5 Canonizing molecular graphs

| 207

Fig. 5.8. Example compounds 1–4, with arbitrary initial vertex numbering.

iv) For chain atoms, their skeleton/non-skeleton property. A chain connecting two rings is considered part of the molecular skeleton, in contrast to a side chain which is not. An atom in a skeleton chain has priority over an atom in a side chain. v) The number of aromatic, triple, double and single bonds (not counting those to hydrogen) in which an atom is engaged, in that order. For example, a carbon atom engaged in three aromatic bonds has priority over one with two aromatic bonds; a carbon atom in a triple bond has priority over a central allenic C atom, which in turn has priority over a carbon engaged in one double bond, and a C atom with four single bonds to non-hydrogen atoms has priority over those with three, two, or one such bonds, respectively.

5.5.2 Iterative refinement (step 2) The initial classification is iteratively refined according to each unique atom’s neigh­ bors, as far as a neighbor is already ‘unique’ (forms a class for itself) [17, 193]. Each unique atom in turn is used to split non-unique classes, and each atom that becomes unique joins the queue to be used itself. By using steps 1 and 2, a discrete partition is often obtained, particularly for mo­ lecular graphs. 5.18 Example (Canonical numbering of a pymetrozine analog) Consider the H-suppressed graph 1 from Figure 5.8 with an arbitrary initial vertex numbering. The results of the following are given in Table 5.3. The initial partition is 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

208 | 5 Molecular structure generation Successive application of items in step 1 yields a progressively finer classification: i) separates the atoms according to elements, which yields 7 | 2 3 6 9 16 | 1 4 5 8 10 11 12 13 14 15, ii) is of no use in this example, iii) says that ring atoms have priority, this leads to 7 | 2 3 6 16 | 9 | 1 4 5 11 12 13 14 15 | 8 10, iv) implies that an atom in a skeleton chain has priority over an atom in a side chain, and so we find 7 | 2 3 6 16 | 9 | 1 4 5 11 12 13 14 15 | 10 | 8, v) means a separation according to the number of aromatic, triple, double and single bonds (not counting those to hydrogen) in which an atom is engaged. This yields the desired initial classification 7 | 16 | 3 | 6 | 2 | 9 | 11 | 12 13 14 15 | 1 4 | 5 | 10 | 8. Step 2. Iterative refinement according to each unique atom’s immediate neighbors: – The first unique atom, oxygen atom 7, has carbon atom 1 as its immediate neigh­ bor, but not atom 4. For this reason atoms 1 and 4 are separated into different classes, and we find 7 | 16 | 3 | 6 | 2 | 9 | 11 | 12 13 14 15 | 1 | 4 | 5 | 10 | 8. – –

Attempted refinements by unique atoms 9, 10, 8 in turn (according to their position in the queue) do not lead to any separation. The next unique atom, nitrogen atom 16, has atom 15 as its immediate neighbor, but not atoms 12–14, so we obtain 7 | 16 | 3 | 6 | 2 | 9 | 11 | 15 | 12 13 14 | 1 | 4 | 5 | 10 | 8.



The only remaining non-unique atoms, atoms 12, 13, 14 still in the same class, cannot be split by neighborhood to unique atoms 3, 6, 2. However, the next unique atom, carbon atom 11, has atom 12 as its immediate neighbor, but not atoms 13 and 14, yielding:



Unique atoms 5, 1, and 4 do not separate atoms 13 and 14. Finally we can separate 13 from 14 according to the neighborship to unique atom 15, and so we find

7 | 16 | 3 | 6 | 2 | 9 | 11 | 15 | 12 | 13 14 | 1 | 4 | 5 | 10 | 8.

7 | 16 | 3 | 6 | 2 | 9 | 11 | 15 | 12 | 14 | 13 | 1 | 4 | 5 | 10 | 8. Having obtained this partition into single atoms, we relabel atoms in that mole­ cule using the permutation (

7 1

16 2

3 3

6 4

2 5

9 6

11 7

15 8

12 9

14 10

13 11

1 12

4 13

5 14

10 15

8 ), 16

thus obtaining the canonical numbering of atoms in this pymetrozine analog.

5.5 Canonizing molecular graphs

|

209

Table 5.3. Initial classification and iterative refinement for the pymetrozine analog, Figure 5.8, struc­ ture 1. Atoms that become unique are printed in bold. Initial numbers

1

2

3 4

5

6

7

8

9

10 11 12 13 14 15 16

partitioned by criterion i)

7

2

3 6

9 16

1

4

5

8

criterion iii)

7

2

3 6 16

9

1

4

5

11 12 13 14 15

criterion iv)

7

2

3 6 16

4

5

11 12 13 14 15 10

criterion v)

7 16 3 6

10 11 12 13 14 15 8 10 8

9

1

2

9

11 12 13 14 15

1

4

5

10

8

initial classification refined by 7

7 16 3 6

2

9

11 12 13 14 15

1

4

5

10

8

refined by 16

7 16 3 6

2

9

11 15 12 13 14

1

4

5

10

8

refined by 11

7 16 3 6

2

9

11 15 12 13 14

1

4

5

10

8

refined by 15

7 16 3 6

2

9

11 15 12 14 13

1

4

5

10

8

5.5.3 Labeling by backtracking (step 3) Classification by iterative refinement (steps 1 and 2) can be made even more powerful by slight variations in the procedure. The use of further vertex-in-graph invariants is an obvious option. Instead of the immediate neighborship used in step 2, relations of longer distance or even of all distances could be used for better discrimination. This would initially require the construction and evaluation of the graph’s distance ma­ trix. Further, the restriction of unique atoms only to be used for refinement could be alleviated. This alternative, however, was often not advantageous [17, 18]. We decided not to exploit iterative classification to its limits since backtracking is needed anyway for cases of symmetry. If a discrete partition is not yet achieved, either for insufficient resolving power of steps 1–2, or for symmetry equivalence of certain nodes, discrete partitions (numberings) that do not contradict the initial classification are generated by a backtracking procedure. The first class of lowest cardinality > 1 is chosen [164], and in backtracking a vertex, it is artificially marked ‘preferred’ and is made the root of a branch. By distinguishing a particular node, other sets of nodes may become dis­ tinguishable, so that again a finer partition is obtained by iterative classification. Step 3 is repeated recursively until a discrete partition is achieved (a depth-first search) by backtracking, marking an atom, and iterative refinement applied in turn. Back­ tracking ensures that (in principle) each eligible atom is marked and treated at each branching point at some time in the process, so that, in fact, there is no arbitrariness. The refinement resulting from artificially marking a vertex usually reduces the num­ ber of alternatives on the next backtrack level, thus preventing exponential growth of the backtrack tree in most cases. The canonical numbering is the smallest of all numberings belonging to the leaves of the backtrack tree.

210 | 5 Molecular structure generation Table 5.4. Labeling by backtracking for N-benzyl-o-toluidine, structure 2 in Figure 5.8. Atoms that become unique are printed in bold. Initial numbers

1 2 3 4 5

6

7

8

9 10 11 12 13 14 15

step 1

7 1 6 9 2

3

4

5 10 11 12 13 14 8 15

step 2

7 6 9 1 5 10 14 2

4

3

11 12 13 8 15

btl1, 10 marked

7 6 9 1 5 10 14 2

4

3

11 12 13 8 15

refined by 10

7 6 9 1 5 10 14 2

4

3

11 12 13 8 15

refined by 14

7 6 9 1 5 10 14 2

4

3

11 13 12 8 15

btl1, 14 marked

7 6 9 1 5 14 10 2

4

3

11 12 13 8 15

refined by 14

7 6 9 1 5 14 10 2

4

3

13 11 12 8 15

refined by 10

7 6 9 1 5 14 10 2

4

3

13 11 12 8 15

candidate 1 kept

7 6 9 1 5 10 14 2

4

3

11 13 12 8 15

*1

backtrack

*2 a

5.19 Example (Canonical numbering of an unsubstituted phenyl residue) An un­ substituted phenyl residue is a typical case of both symmetry (two ortho and two meta atoms) and insufficient resolution of steps 1–2 (meta vs. para position). In N-benzyl-o-toluidine, structure 2 from Figure 5.8, two unresolved classes remain after steps 1 and 2, one containing atoms 10 and 14, the other atoms 11–13 (arbitrary numbering given in Figure 5.8). These results are shown in Table 5.4. The two-member class is chosen, and atom 10 is preliminarily marked on back­ track level 1 (btl1). Atom 14 becomes unique as a result, and refinement by 10 and then 14 leads to a discrete partition, candidate 1 (*1) for canonical numbering. Backtrack­ ing and alternative marking of 14 followed by refinement results in another discrete partition (*2) which leads to the same matrix of bond multiplicities as the first (an au­ tomorphism, the symmetry of the phenyl residue, marked by ‘a’ in Table 5.4). Therefore the first candidate is kept and used for assigning canonical numbers, as shown.

5.5.4 Pruning the backtrack tree It is of decisive importance to devise a procedure without constructing all possible numberings, instead ensuring that as many branches of the backtrack tree as possi­ ble are pruned. In the procedure described here, this goal is achieved by combining two features. First, an extremality criterion is used to compare candidate matrices of bond multiplicities, maximizing the number obtained from concatenation of lines in the lower half of the matrix. This choice has the advantage that when entries in a cer­ tain row of the matrix are changed, the rows further up are not affected, i.e. the first digits of the number to be maximized are not changed. Secondly, to compare the ma­ trices of bond multiplicities the atoms are re-numbered in the order of when an atom

5.5 Canonizing molecular graphs

|

211

becomes unique. Therefore, if a partial numbering results in a concatenated number smaller than the current favorite with respect to its first 𝑖 digits, then any permutation in the remaining labels is unnecessary since it cannot change the first 𝑖 digits, i.e. the backtrack tree is pruned at once. Table 5.5. Pruning the backtrack tree for 1-azabicyclo[4.3.2]undecane, structure 3 from Figure 5.8. Atoms that become unique are printed in bold. Initial numbers

1

2

3

4

5

6

7

8

9

10

11

step 1

1

4

2

3

5

6

7

8

9

10

11

refined by 1

1

4

2

7

11

3

5

6

8

9

10

refined by 4

1

4

2

7

11

3

5

8

6

9

10

btl1, 2 marked

1

4

2

7

11

3

5

8

6

9

10

refined by 2

1

4

2

7

11

3

5

8

6

9

10

btl2, 7 marked

1

4

2

7

11

3

5

8

6

9

10

refined by 7

1

4

2

7

11

3

5

8

6

9

10

refined by 11

1

4

2

7

11

3

5

8

6

10

9

refined by 6

1

4

2

7

11

3

5

8

6

10

9

btl2, 11 marked

1

4

2

11

7

3

5

8

6

9

10

refined by 11

1

4

2

11

7

3

5

8

10

6

9

refined by 7

1

4

2

11

7

3

5

8

10

6

9

refined by 6

1

4

2

11

7

3

5

8

10

6

9

btl1, 7 marked

1

4

7

2

11

3

5

8

6

9

10

refined by 7

1

4

7

2

11

3

5

8

6

?

?

btl1, 11 marked

1

4

11

2

7

3

5

8

6

9

10

refined by 11

1

4

11

2

7

3

5

8

10

?

?

candidate 2 kept

1

4

2

11

7

3

5

8

10

6

9

*1

backtrack

*2

backtrack pruned

backtrack pruned

5.20 Example (1-azabicyclo[4.3.2]undecane) The hypothetical 1-azabicyclo[4.3.2]undecane (structure 3, Figure 5.8) has no symmetry. The process is given step by step in Table 5.5. Using criteria i) and v), atoms 1 and 4 become unique, respectively, and all other atoms are in one class. Refinement by 1 and then by 4 allows some splitting but does not result in another unique atom. Therefore in the first class of lowest cardi­ nality (2, 7, 11) atom 2 is marked artificially to become unique (backtrack level 1, btl1). Refinement by 2 results in atom 3 becoming unique. Refinement by 3 has no effect. Therefore now (backtrack level 2, btl2) in the first class of lowest cardinality (7, 11) atom 7 is marked unique, such that atom 11 also becomes unique, and by refinement

212 | 5 Molecular structure generation by 7 and then by 11, atoms 6, 10, and 9 also become unique. Refinement by 6 leads to the first discrete partition (candidate 1). In Table 5.5, in each line the atom(s) that become unique is (are) printed in bold. Renumbering in the order of becoming unique gives the following mapping initial numbering renumbered

1 1

4 2

2 3

3 4

7 5

11 6

6 7

10 8

9 9

5 10

8 11,

corresponding to the following lower half of the matrix of bond multiplicities. Note that in our example 𝛾 = 𝛾𝑏 and so the matrix of bond multiplicities is the same as the bond matrix: 1 2 3 4 5 6 7 8 9 10 11 1 2 0 3 1 0 4 0 1 1 5 1 0 0 0 6 1 0 0 0 0 7 0 0 0 0 1 0 8 0 0 0 0 0 1 0 9 0 0 0 0 0 0 0 1 10 0 1 0 0 0 0 1 0 0 11 0 1 0 0 0 0 0 0 1 0 Now after backtracking to btl2 atom 11 is marked, whereby atom 7 also becomes unique. Refinement by 11 and then by 7 results in atoms 10, 6, and 9 becoming unique in this order. Thus now the partial renumbering scheme is initial numbering renumbered

1 1

4 2

2 3

3 4

11 5

7 6

10 7

6 8

9 9,

corresponding to the following partial matrix of bond multiplicities:

1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

0 1 0 1 1 0 0 0

0 1 0 0 0 0 0

1 0 0 0 0 0

0 0 0 0 0

0 1 0 0

0 1 0

0 1

0

9

Here it is evident, looking at the matrix element (9, 7) ‘1’ (in italics), that the next discrete partion to be found, candidate 2, will be better (in the sense of our extremality

5.5 Canonizing molecular graphs

| 213

criterion) than candidate 1. In fact, while refinement by 10 has no effect, refinement by 6 results in candidate 2, with the following renumbering scheme and the lower half of the matrix of bond multiplicities: initial numbering renumbered

1 1

4 2

2 3

3 4

11 5

7 6

10 7

6 8

9 9 11

5 10

8 11

and 1 2 3 4 5 6 7 8 9 10 11

1

2

3

4

5

6

7

8

9

10

0 1 0 1 1 0 0 0 0 0

0 1 0 0 0 0 0 1 1

1 0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 1 0 0 0 0

0 1 0 0 0

0 1 0 0

0 1 0

0 1

0

This renumbering scheme is kept as the currently best one. Thereby btl2 is exhausted (Table 5.5), and after backtracking to btl1 atom 7 is marked, refinement by 7 results in atom 6 becoming unique, so that now the current partial renumbering scheme and partial matrix of bond multiplicities are as follows initial numbering renumbered

1 1

4 2

7 3

6 4

and 1 2 3 4

1

2

3

0 1 0

0 0

1

4

Here matrix element (4, 2) ‘0’ (in italics) determines that all discrete partitions to be derived from this partial numbering will be worse than candidate 2. Therefore this part of the backtrack tree can be pruned immediately. In exactly the same manner the last alternative at btl1, marking atom 11 with atom 10 also becom­ ing unique after refinement by 11, is found to be worse than candidate 2. Figure 5.9 shows the backtrack tree corresponding to this example, the pruned parts of the tree are drawn as dashed lines. Candidate 2 is thus the basis of the canonical numbering finally obtained as in the previous examples and shown at the bottom of Table 5.5.

214 | 5 Molecular structure generation

Fig. 5.9. The backtrack tree for structure 3, Figure 5.8. Parts of the tree that are pruned are drawn in dashed lines.

5.5.5 Profiting from symmetry Large parts of the search tree can be pruned in cases of higher symmetry. If two label­ ings result in the same bond matrix at different positions in the tree, then a symmetry (automorphism) has been found. The information on automorphisms that accumu­ lates in the process finally defines the complete automorphism group of the graph or molecule. This is stored in the form of a set of generators (a Sims chain [93, 142, 296]). This information is used to prune parts of the backtrack tree found to be equivalent to other parts already considered. 5.21 Example (Cubane) The cubane molecule (see structure 4, Figure 5.8) is highly symmetrical. The process is given step by step in Table 5.6. Steps 1 and 2 do not achieve any splitting. In step 3, marking atoms 1, 2, and 4 in btl1, btl2, and btl3, respectively, soon results in candidate 1 (*1 in Table 5.6 and Figure 5.10). At this stage the renumbering scheme (renumbering atoms in the order of their becoming unique) and the lower half of the matrix of bond multiplicites are initial numbering renumbered

1 1

2 2

4 3

5 4

3 5

6 6

and 1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

1 1 1 0 0 0 0

0 0 1 1 0 0

0 1 0 1 0

0 1 1 0

0 0 1

0 1

1

8

8 7

7 8

5.5 Canonizing molecular graphs

| 215

Table 5.6. Profiting from symmetry for cubane, structure 4, Figure 5.8. Atoms that become unique are printed in bold. Initial numbers

1

2

3

4

5

6

7

8

steps 1 and 2

1

2

3

4

5

6

7

8

btl1, 1 marked

1

2

3

4

5

6

7

8

refined by 1

1

2

4

5

3

6

7

8

btl2, 2 marked

1

2

4

5

3

6

7

8

refined by 2

1

2

4

5

3

6

7

8

btl3, 4 marked

1

2

4

5

3

6

7

8

refined by 4

1

2

4

5

3

6

8

7

btl3, 5 marked

1

2

5

4

3

6

7

8

refined by 5

1

2

5

4

6

3

8

7

btl2, 4 marked

1

4

2

5

3

6

7

8

refined by 4

1

4

2

5

3

8

6

7

btl3, 2 marked

1

4

2

5

3

8

6

7

refined by 2

1

4

2

5

3

8

6

7

btl2, 5 marked

1

5

2

4

3

6

7

8

refined by 5

1

5

2

4

6

8

3

7

btl3, 2 marked

1

5

2

4

6

8

3

7

refined by 2

1

5

2

4

6

8

3

7

*1

backtrack *2 a

backtrack

*3 a

backtrack

*4 a

backtrack btl1, 2 marked

2

1

3

4

5

6

7

8

refined by 2

2

1

3

6

4

5

7

8

btl2, 1 marked

2

1

3

6

4

5

7

8

refined by 1

2

1

3

6

4

5

7

8

btl3, 3 marked

2

1

3

6

4

5

7

8

refined by 3

2

1

3

6

4

5

7

8

3

1

2

4

5

6

7

8

1

2

4

5

3

6

8

7

backtrack btl1, 3 marked etc. candidate 1 kept

*5 a

216 | 5 Molecular structure generation

Fig. 5.10. The backtrack tree for cubane, structure 4, Figure 5.8. The branches for vertex 3–8 past btl1 are as for vertex 2.

By backtracking, marking atom 5 on btl3 and refining by 5, candidate 2 is found and the renumbering scheme is now: initial numbering renumbered

1 1

2 2

5 3

4 4

6 5

3 6

8 7

7 8,

which results in the same bond matrix as before, an automorphism is found, the left­ most ‘a’ in Figure 5.10. Backtracking, marking atoms 4 and 2 on btl2 and btl3, respectively, leads to can­ didate 3, which again produces the same bond matrix as candidate 1 (second ‘a’ in Table 5.6 and Figure 5.10). This automorphism derived from atom 4 marked on btl2 means that there must be another automorphism to be found as a branch originating in that node of the backtrack tree, just as there are two automorphic leaves below atom 2 on btl2. Therefore that whole branch of the tree can be pruned. Backtracking and marking atom 5 on btl2 results in candidate 4, again automor­ phic to candidate 1, and as before a branch is now pruned. Backtracking to btl1, marking atom 2, etc. finds candidate 5, again automorphic to candidate 1 (fourth ‘a’ in Table 5.6 and Figure 5.10). It follows that all branches origi­ nating in atom 2 on btl1 must be equivalent to all branches originating from atom 1 on btl1, they are therefore pruned. This holds true for atoms 3–8 on btl1. Thus candidate 1 remains as the best candidate until the end of the procedure. Note that only the renumbering schemes, not the matrices are stored, as evident from the above examples. This results in a rather low memory requirement. A detailed de­ scription of the canonization procedure was given in [17, 18].

Scope and Limitations At present MOLGEN–CID only treats covalently bonded compounds i.e. made either of one or several components (connected or disconnected undirected graphs). Stereoiso­

5.5 Canonizing molecular graphs

| 217

merism is not considered yet. MOLGEN–CID is not restricted to molecular graphs, in particular, bond degrees (generalized valences) are not bounded by 4. There are app­ lications even in organic chemistry where undirected graphs containing nodes of de­ gree > 4 have to be canonized, see e.g. so-called macroatoms of high degree in the former DENDRAL project and in MOLGEN 3.5.

Tests For test purposes, the nodes of many graphs were routinely renumbered randomly five times, and in all cases all five renumbered graphs resulted in the same canonical numbering. Databases such as the NIST Mass Spectral Library (107,216 organic compounds, 5,943 duplicates or stereoisomeric pairs detected in the 1998 version) or the Maybridge Combinatorial Chemistry Database (MayDec02CCeus, 13,410 compounds with 19 dupli­ cates) were processed by MOLGEN–CID. All the pairs of hard-to-distinguish molecules or graphs appearing in references [270] and [265] were correctly detected as non-identical by MOLGEN–CID. In addition, different drawings of the same graph (e.g. 2 non-trivial cases in [265]) were correctly identified. As a rule, molecular graphs often contain nodes and bonds that can be differ­ entiated easily (e.g. heteroatoms or multiple bonds). Most molecular graphs contain rather few bonds or cycles compared with the number of atoms, and thus most mo­ lecular graphs are planar graphs [263]. This means that molecular graphs are quite easy to handle for canonization, symmetry perception and isomorphism test algo­ rithms. In the words of Read and Corneil [247]: ‘. . . most graphs present no great prob­ lem even to badly designed algorithms. The test of a graph isomorphism algorithm is how it behaves under ‘worst case’ conditions, i.e. how it handles the really recalcitrant graphs . . . ’ . Samples of such really recalcitrant mathematical graphs were compiled by Weisfeiler [332] and Mathon [190] to challenge such algorithms. These are graphs with­ out multiple bonds or special nodes, containing high bond degrees, many regularities (e.g. all nodes with the same bond degree) and with high or seemingly high symme­ try. These graphs were used for a further test of MOLGEN–CID. The 20 Mathon graphs contain between 25 and 50 nodes of degrees up to 16, among them 14 regular graphs, e.g. there are 8 regular graphs with 29 nodes of degree 14. The 39 Weisfeiler graphs are all regular: They are made of 10–28 nodes of degree 3–12, e.g. there are 15 regular graphs with 25 nodes of degree 12. These 59 graphs were canonized by MOLGEN–CID within 14.0 sec on an Athlon XP1600 PC, 1.4 GHz. No duplicate entries were found by MOLGEN–CID in either the Mathon or the Weisfeiler sample. However, two dupli­ cates were correctly found when comparing both sets of graphs [328]: Mathon’s graph 𝐴1 25 is identical to Weisfeiler’s graph 251210, and Mathon’s 𝐵1 25 is identical to Weis­ feiler’s 25123 [268]. Further, Weisfeiler’s graphs 1662 and 1661 were found isomorphic

218 | 5 Molecular structure generation to Shrikhande’s graph [295] and its twin, respectively. These latter two graphs are de­ picted in [265]. As another test, the 1812 fullerenes C60 were canonized by MOLGEN–­ CID within 72 sec on an Athlon XP 1600 PC, 1.4 GHz – all were distinguished correctly. Surprisingly, the identity of such graphs can successfully be tested (although much more slowly) by application of traditional chemical nomenclature. Thus Weis­ feiler’s graph 1561 (15 nodes of degree 6) is hentriacontacyclo[7.6.0.01,3 .01,5 .01,6 .02,7 .02,8 .02,11 .02,12 .03,10 .03,13 .03,14 .04,7 .04,10 .04,12 .04,15 .05,8 .05,12 .05,14 .06,10 .06,11 .06,13 .07,13 .07,15 .08,10 .08,14 .09,12 .09,13 .09,15 .011,14 .011,15 ]pentadecane. The uniqueness of these so-called von Baeyer names is ensured by requirements such as – the main ring shall contain as many carbon atoms as possible, – the main bridge shall be as large as possible, – the main ring shall be divided as symmetrically as possible by the main bridge, – the superscripts locating the other bridges shall be as small as possible . . . (see e.g. rule A-32, IUPAC, Nomenclature of Organic Chemistry, Sections A,B,C,D,E,F,H, Pergamon Press, Oxford, 1979). The above name was composed by the program POL­ CYC, modified for vertex degrees higher than four, see [267].

An application in combinatoral chemistry 5.22 Remark (Test of the subset relation of real and virtual library) In several projects of cooperation with industry, we learnt that an experiment in combinatorial chemis­ try (cf. Section 7.1) usually starts from data already present in some database. In this situation it makes sense to check the existing real library for being a subset of the vir­ tual library using computational methods. For this purpose, initially aromatic bonds are identified in both libraries. Then all molecular graphs are canonized. A possible technique then is to consider both libraries as sequences of molecular graphs. The structures should be entered in a computer in a format as compressed as possible. The next section shows the linear representation of molecular structures used in the recent MOLGEN version. Each single structure is a character string, and these strings are then ordered lexicographically. The ordered strings may be tested for duplicates and subset relations with linear effort. Likewise, intersection and difference sets of the two libraries are obtained.

5.6 Data structures for molecular graphs

| 219

5.6 Data structures for molecular graphs The data structure used for molecular graphs depends on the purpose and on the prob­ lem to be solved. For example, if an efficient formula-based structure generation plays a central role, an optimal random access to the bonds is important, and so the matrix of multiplicities M𝛾 will be used. However, this method has rather high memory re­ quirements. In other situations, e.g. a substructure search, fast sequential access will be favorable and only the neighborhood list is needed. A neighborhood list keeps a list of all adjacent atoms for each atom, up to three labels as well as the associated information about atoms and bonds. Alternatively, the storage space is important where fast access to atoms and bonds may be irrelevant. This holds, for example, when performing structure generation with canonical numbering. In this case the matrix of multiplicities still plays an im­ portant role but each covalent bond only needs to be mentioned once. In the following we describe the data format used in MOLGEN for a compact stor­ age of molecular graphs. It uses the fact that H atoms can be represented either explic­ itly or implicitly. Moreover, it allows the storage of 2D and 3D placements of molecules. 1 Byte: Specification of amount and form of the data stored or to be stored Bit 0: 𝑏𝑁𝑎𝑚𝑒 (1, if a name is to be stored) Bit 1: 𝑏𝐶𝑜𝑜2𝐷 (1, if 2D coordinates are to be stored) Bit 2: 𝑏𝐶𝑜𝑜3𝐷 (1, if 3D coordinates are to be stored) Bit 3: 𝑏𝐸𝑥𝑝𝑙𝐻 (1, if H atoms will be mentioned explicitly) Bit 4–7: still free (for further extensions) if 𝑏𝑁𝑎𝑚𝑒 = 1 4 Byte: Number 𝑘 of characters in the name (unsigned int) 𝑘 Byte: Name of the molecular graph end if 𝑏𝐸𝑥𝑝𝑙𝐻 = 1 2 Byte: Number 𝑛 of atoms (unsigned short) else 2 Byte: Number 𝑛 of (non H) atoms (unsigned short) end for each (non H) atom 𝑖 1 Byte: Atomic number −1 (unsigned char) 1 Byte: Mass difference to the most frequent isotope (signed char) 1 Byte: Charge (signed char) 1 Byte: Radical position (unsigned char) if 𝑏𝐶𝑜𝑜2𝐷 = 1 4 Byte: 1. Coordinate in 2D placement (float) 4 Byte: 2. coordinate in 2D placement (float) end

220 | 5 Molecular structure generation if 𝑏𝐶𝑜𝑜3𝐷 = 1 4 Byte: 4 Byte: 4 Byte: end if 𝑏𝐸𝑥𝑝𝑙𝐻 = 0 1 Byte:

1. Coordinate in 3D placement (float) 2. Coordinate in 3D placement (float) 3. Coordinate in 3D placement (float)

Number of H atoms adjacent to atom 𝑖 (unsigned char)

end 1 Byte: Number of (non H) atoms 𝑗 adjacent to atom 𝑖 with 𝑗 > 𝑖 (unsigned char) for each (non H) atom 𝑗 > 𝑖 adjacent to atom 𝑖 𝑠 Byte: 𝑗, where 𝑠 = 1, if 𝑛 ≤ 256, 𝑠 = 2 otherwise (unsigned char resp. unsigned short) 1 Byte: Multiplicity of the bond between 𝑖 and 𝑗, 4 for aromatic bonds (unsigned char) For a molecular graph with 𝑛 atoms and 𝑏 bonds we need, 1 + 2 + 𝑛 ⋅ (4 + 1) + 𝑏 ⋅ 2 = 3 + 5𝑛 + 2𝑏 bytes, neglecting coordinates and explicit H atoms. If there are ℎ H atoms among the 𝑛 atoms, then 1 + 2 + 𝑛 ⋅ (4 + 1 + 1) + (𝑏 − ℎ) ⋅ 2 = 3 + 6𝑛 − 8ℎ + 2𝑏 bytes are needed, if the H atoms are expressed explicitly. As quite a large proportion of the atoms in chemical compounds can be H atoms, the implicit storage of H atoms is preferable. The explicit storage of H atoms is, however, unavoidable in certain cases, e.g. H2 , H+ , H∙ . Explicit storage is also used to store coordinates. Certainly this data structure is not optimal with respect to compression. In par­ ticular, encoding of atoms could reduce the size further, for example by introducing a list of atom states to allow the use of a pointer to the corresponding list element. The methods of discrete mathematics, introduced in this chapter, were sufficient for describing chemical compounds as discrete structures. However, once the rela­ tionships between properties and structures have to be modelled, non-discrete me­ thods are required. Methods from supervised statistical learning theory and machine learning are particularly useful and thus some of these will be introduced in the next chapter.

6 Supervised statistical learning A central problem in computational chemistry is to find empirical relationships be­ tween structures of organic compounds and their experimental physicochemical, bi­ ological or pharmaceutical properties. This is necessary whenever the functional de­ pendence of a property on molecular structure is not known, or its calculation requires extraordinary effort. In particular, one may be interested in: – predicting properties from a molecular structure, or rather – deriving a molecular structure from known properties. The first problem is known as application of quantitative structure-property relation­ ships (QSPRs). An analogous term quantitative structure-activity relationship (QSAR) is used if a biological activity, rather than a physicochemical property, is to be modeled. Inverse QSPR/QSAR is the search for compounds exhibiting prescribed prop­ erty/activity values. Molecular structure elucidation aims at identifying an unknown compound, i.e. deducing its molecular structure from measured physicochemical properties, most often spectra. The aim of QSPR/QSAR work is to develop mathematical models based on known cases to allow predictions for unknown cases. Furthermore, such models may lead to a better understanding of the often complex causal dependence of a property on structure. Important mathematical tools in the search for QSPRs are the statistical me­ thods of supervised learning. Application of such methods requires a sufficiently large database of the appropriate structure-property pairs.

6.1 Variables and predicting functions Supervised learning starts with a set of 𝑚 observations, 𝑛 independent variables 𝑋𝑗 and one dependent variable 𝑌. The terms predictors for 𝑋𝑗 and target variable for 𝑌 are perhaps more informative. Each observation 𝑖 ∈ 𝑚 gives values 𝑥𝑖𝑗 for 𝑋𝑗 and 𝑦𝑖 for 𝑌. Let us imagine these values as matrices: X = (x𝑖 ) = (𝑥𝑖𝑗 ) is an 𝑚 × 𝑛-matrix whose rows are x𝑖 , Y = (𝑦𝑖 ) is an 𝑚 × 1-matrix. Usually, in both statistics and in linear algebra, rows of an 𝑚×𝑛-matrix are indicated by 𝑖 = 1, . . . , 𝑚, columns by 𝑗 = 1, . . . , 𝑛. For conformity with the earlier chapters of this book we choose here also indices 𝑖 ∈ 𝑚 = {0, . . . , 𝑚 − 1} and 𝑗 ∈ 𝑛 = {0, . . . , 𝑛 − 1}. For our purposes, the predictors are assumed to be continuous, i.e. they have real values. The target variable is either continuous or discrete. The first aim of supervised learning is to find a fitting (and hopefully predictive) function 𝑓 called a predicting function whose values 𝑓(x𝑖 ) are in as close as reasonable agreement with target values 𝑦𝑖 for all observations 𝑖 ∈ 𝑚. A predicting function, despite its name, is not predictive per se, as is seen from its method of construction. A better name therefore may be fit­

222 | 6 Supervised statistical learning ting function. We nevertheless use the traditional expression. In the next paragraphs we shall explain what this means for various types of target variables and predicting functions. 6.1 Example In molecular structure elucidation the observations are pairs of spectra and compounds. Predictors used are spectral predictors, functions that map spectra onto real numbers. The target variable is, for example, a binary molecular descriptor of a structural property 𝑆𝑃, equal to 1 if a compound has property 𝑆𝑃, and equal to 0 otherwise. The search is for a function able to predict whether or not the corresponding compound has property 𝑆𝑃 for a given spectrum. We will calculate such predicting functions in Section 8.5. 6.2 Example In searching for QSPRs/QSARs the observations are pairs of compounds and values of an experimental property. Predictors used are molecular descriptors, quantities that map the topological (graph theoretic) or geometrical structure of a compound onto real numbers. The target variable is an experimentally determined physicochemical property or biological activity. The search is for a function able to predict, the property/activity value for a given compound. We will calculate such pre­ dicting functions in Section 7.3.

6.1.1 Regression and classification Regression If the target variable is continuous, a method leading to a predicting function for 𝑌 𝑓 : ℝ𝑛 → ℝ : x 󳨃→ 𝑓(x) is called regression. If there is only one predictor, it is a simple regression, if there is more than one predictor it is a multiple regression. In any case, 𝑓 is chosen to minimize the absolute differences between target values and the corresponding values of 𝑓 𝑦𝑖 − 𝑓(x𝑖 ), the residuals. Usually 𝑓 is selected as the function minimizing the sum of squared residuals, also called the residual sum of squares (RSS), 2

𝑅𝑆𝑆 = ∑ (𝑦𝑖 − 𝑓(x𝑖 )) . 𝑖∈𝑚

Classification If the target variable is discrete and assumes values from a finite set C, we search for a predicting function 𝑓 : ℝ𝑛 → C.

6.1 Variables and predicting functions |

223

This is called classification. The predicted values 𝑓(x𝑖 ) should agree with the known class variable value for as many observations as possible. To express this quantita­ tively, we introduce a cost function 𝐿 : C × C → ℝ+0 . The function value 𝐿(𝑘, 𝑙) indicates how to penalize a misclassification of an obser­ vation from class 𝑘 as class 𝑙. Of course, the cost of a correct classification should be 𝐿(𝑘, 𝑘) = 0. Unless given otherwise, in this work we will use the zero-one cost function 𝐿(𝑘, 𝑙) = 1 − 𝛿(𝑘, 𝑙), where {1 if 𝑘 = 𝑙, 𝛿(𝑘, 𝑙) = { 0 else { is the Kronecker delta function. Often it may be helpful or even necessary to adjust the misclassification cost to the problem at hand. After the cost function 𝐿 is defined, 𝑓 is to be determined such that the total classification error (TCE) 𝑇𝐶𝐸 = ∑ 𝐿 (𝑦𝑖 , 𝑓(x𝑖 )) 𝑖∈𝑚

is minimized. TCE may also be written as the sum 𝑇𝐶𝐸 = ∑ 𝐶𝐸(𝑘) 𝑘∈C

of the classification errors for all classes: 𝐶𝐸(𝑘) = ∑ 𝐿 (𝑘, 𝑓(x𝑖 ))

with 𝛺𝑘 = {𝑖 ∈ 𝑚 | 𝑦𝑖 = 𝑘}, 𝑘 ∈ C.

𝑖∈𝛺𝑘

Using the zero-one cost function, TCE is just the number of misclassifications.

Classification via regression A binary classification problem, i.e. a classification problem with |C| = 2 classes, is a regression problem for the new target variable 𝑌̃ that assumes values {1 if 𝑦𝑖 = 1, 𝑦𝑖̃ = { −1 otherwise, { for the observations 𝑖 ∈ 𝑚. Then a predicting function 𝑓 ̃ for 𝑌̃ is calculated by regre­ ssion and is called the discriminant function. In the case of the zero-one cost function, the predicting function 𝑓 for the classification problem can be determined from 𝑓 ̃ as {1 𝑓(𝑥) = { 0 {

̃ ≥ 0, if 𝑓(𝑥) otherwise.

224 | 6 Supervised statistical learning In the case of |C| > 2 classes, the new target variables 𝑌𝑘̃ , 𝑘 ∈ C are introduced with values {1 if 𝑦𝑖 = 𝑘, ̃ ={ 𝑦𝑖𝑘 −1 otherwise, { for 𝑖 ∈ 𝑚, 𝑘 ∈ C. We obtain predicting functions 𝑓𝑘̃ for 𝑌𝑘̃ by regression. The predicting function 𝑓 for the C-class problem is obtained as 𝑓(𝑥) = argmax 𝑓𝑘̃ (𝑥), 𝑘∈C

i.e. for 𝑥 ∈ ℝ , the class 𝑘 ∈ C is returned for which 𝑓𝑘̃ (𝑥) is maximal. If this cannot be determined unambiguously, a random decision is made. 𝑛

6.1.2 Validation of the predicting function There are many possible alternatives for calculating a predicting function. The num­ ber and type of the predictors and parameters may vary, while there are also different types of predicting functions, and algorithms to perform the calculations. Not surpris­ ingly, several different predicting functions can be used for the same problem. There­ fore, we need criteria to assess the quality of a predicting function and to select the highest-quality function.

Resubstitution It is important to assess how well the predicting function fits the observed values of the dependent variable. Useful statistics are 𝑅𝑆𝑆 in the case of a regression and 𝑇𝐶𝐸 for classification problems. The calculation of these quantities is a resubstitution, since the values used to obtain the predicting function are the same as the values used to validate the function. A typical statistic for validation of predicting function 𝑓 in a regression is the multiple correlation coefficient 𝑅 = √1 −

𝑅𝑆𝑆 2

∑𝑖 (𝑦𝑖 − 𝑦)̄

,

where 𝑦̄ = 𝑚1 ∑𝑖 𝑦𝑖 is the arithmetic mean of the 𝑦𝑖 values. If the predicting function completely agrees with the target variable values, then 𝑅 = 1. For the trivial predicting function 𝑓 ≡ 𝑦,̄ 𝑅 = 0. Often 𝑅2 , the squared correlation coefficient, is given, the so-called coefficient of determination. The disadvantage of 𝑅2 is that it is not useful to compare predicting functions that containing an unequal numbers of predictors, since it does not decrease with increas­ ing number of predictors. Therefore the standard error of a regression is introduced as 𝑅𝑆𝑆 . 𝑆=√ 𝑚−𝑑

6.1 Variables and predicting functions |

225

The complexity of the predicting function is taken into account via 𝑑, the number of its degrees of freedom. We shall discuss this in Section 6.2 for various types of predicting functions. A good predicting function has a small 𝑆 value. Another statistic often used for regression models is the empirical 𝐹 value, which is defined as 𝑚−𝑑 𝑅2 ⋅ 𝐹= 1 − 𝑅2 𝑑 − 1 and is used to test the significance of a regression ([250], pp. 598–599). A good model has a high 𝐹 value. For classifications, the mean classification error (MCE) 1 𝑇𝐶𝐸 𝑚

𝑀𝐶𝐸 =

is often used to assessing a predicting function. In the present work, we use 𝛿 as cost function, which menas that 𝑀𝐶𝐸 can also be called the misclassification rate. The misclassification rate is zero if all classifications are correct, and 1 if all classifications are erroneous. The distribution of correct and erroneous predictions over the observed classes 𝑘 can be defined as: 𝑀𝐶𝐸(𝑘) = |𝛺𝑘 |−1 𝐶𝐸(𝑘) . where 𝛺𝑘 is the index set of the observed class 𝑘. 𝛺𝑘 must not be empty.

Test sets The predictive ability of a predicting function is as important as its fit. To quantify this, the set of observations is first partitioned randomly into a learning set 𝐿𝑆 and a test set 𝑇𝑆, the test sample: 𝑚 = 𝐿𝑆 ∪̇ 𝑇𝑆. There is no widely-accepted convention giving the ratio of |𝐿𝑆| and |𝑇𝑆|. In this work we will use 𝐿𝑆 and 𝑇𝑆 of equal cardinality (i.e. both sets contain the same number of observations). Examples will be seen in Sections 7.5 and 8.5. The predicting function 𝑓𝐿𝑆 is derived using only the observations from 𝐿𝑆 (i.e. ‘lear­ ning’ from the observations). The predictive ability is then assessed using the sum of the squared residuals of the test set: 2

𝑅𝑆𝑆𝑇𝑆 = ∑ (𝑦𝑖 − 𝑓𝐿𝑆 (x𝑖 )) . 𝑖∈𝑇𝑆

Correspondingly, we define 𝑅2𝑇𝑆 = 1 −

𝑅𝑆𝑆𝑇𝑆 2

̄ ) ∑𝑖∈𝑇𝑆 (𝑦𝑖 − 𝑦𝑇𝑆

,

̄ = where 𝑦𝑇𝑆

1 ∑ 𝑦. |𝑇𝑆| 𝑖∈𝑇𝑆 𝑖

226 | 6 Supervised statistical learning Analogously, the total classification error and the mean classification error for the test set are defined for discrete cases as: 𝑇𝐶𝐸𝑇𝑆 = ∑ 𝐿 (𝑦𝑖 , 𝑓𝐿𝑆 (x𝑖 )) ,

𝑀𝐶𝐸𝑇𝑆 =

𝑖∈𝑇𝑆

1 𝑇𝐶𝐸𝑇𝑆 |𝑇𝑆|

as well as classification error and misclassification rate for a single class in the test set: 1 𝐶𝐸(𝑘) 𝑀𝐶𝐸(𝑘) 𝐶𝐸(𝑘) 𝑇𝑆 = ∑ 𝐿 (𝑘, 𝑓(x𝑖 )) , 𝑇𝑆 = 𝑇𝑆 , |𝑇𝑆 | 𝑘 𝑖∈𝑇𝑆 𝑘

where 𝑇𝑆𝑘 = {𝑖 ∈ 𝑇𝑆 | 𝑦𝑖 = 𝑘} is the nonempty set of indices in the test set with the target variable of value 𝑘. A sufficient number of observations is an obvious prerequisite for a test set. In QSPR/QSAR work, however, often only a few observations are available, so that the calculation of the predicting function suffers from shortage of data. In this situation it is senseless to waste valuable observations for testing only. This problem can be solved by using all observations for both learning and for testing.

Cross-validation Let 𝑘 ≤ 𝑚. In 𝑘-fold cross-validation (CV) 𝑚 is randomly partitioned into 𝑘 subsets of nearly equal cardinality: 𝑚 = ⋃̇ 𝑇𝑙 . 𝑙∈𝑘

For each 𝑙 ∈ 𝑘 a predicting function 𝑓𝑙 is trained based on the observations from 𝑚 \ 𝑇𝑙 . The observations from 𝑇𝑙 excluded from learning are then predicted by 𝑓𝑙 , thus testing the regression or classification for predictivity. Cross-validation can be used for comparing various subsets of predictors, types of predictive functions, or parameters for learning techniques (see examples in Sections 7.4 and 7.6). In the regression case 𝑅𝑆𝑆𝑘𝐶𝑉 = ∑ ∑ (𝑦𝑖 − 𝑓𝑙 (x𝑖 ))

2

𝑙∈𝑘 𝑖∈𝑇𝑙

is calculated, in the classification case 𝑇𝐶𝐸𝑘𝐶𝑉 = ∑ ∑ 𝐿 (𝑦𝑖 , 𝑓𝑙 (x𝑖 )) . 𝑙∈𝑘 𝑖∈𝑇𝑙

For 𝑘 < 𝑚 these statistics depend on the random partition of 𝑚. If 𝑘 = 𝑚, the method is called leave-one-out cross-validation (LOOCV). A predicting function 𝑓𝑖 is calculated for each 𝑖 ∈ 𝑚 , using the observations from 𝑚 \ {𝑖} = {0, . . . , 𝑖 − 1, 𝑖 + 1, . . . , 𝑚 − 1} exclusively for learning. The formula for the sum of squared residuals is simplified to 2

𝑅𝑆𝑆𝐶𝑉 = ∑ (𝑦𝑖 − 𝑓𝑖 (x𝑖 )) . 𝑖∈𝑚

6.1 Variables and predicting functions |

227

The coefficient of determination and the standard error for LOOCV are defined corre­ spondingly: 𝑅2𝐶𝑉 = 1 −

𝑅𝑆𝑆𝐶𝑉 2

∑𝑖 (𝑦𝑖 − 𝑦)̄

,

𝑆𝐶𝑉 = √

𝑅𝑆𝑆𝐶𝑉 . 𝑚−𝑑

(An alternative and more reasonable definition would be 𝑅2𝐶𝑉 = 1 − 1 𝑚−1

𝑅𝑆𝑆𝐶𝑉

2

∑𝑖 (𝑦𝑖 −𝑦𝑖̄ )

with

𝑦𝑖̄ = ∑𝑗=𝑖̸ 𝑦𝑗 , since it leads to 𝑅𝐶𝑉 = 0 for trivial 𝑓𝑖 ≡ 𝑦𝑖̄ . However to maintain consistency with the literature we do not favor this definition.) In the discrete case total and mean classification error for LOOCV are given by 𝑇𝐶𝐸𝐶𝑉 = ∑ 𝐿 (𝑦𝑖 , 𝑓𝑖 (x𝑖 )) ,

𝑀𝐶𝐸𝐶𝑉 =

𝑖∈𝑚

1 𝑇𝐶𝐸𝐶𝑉 . 𝑚

There are further variations of the cross-validation principle. For instance, predicting functions 𝑓𝑇 may be trained excluding 𝑇 for a constant 𝑘 > 1 and for all 𝑘-subsets 𝑇 ⊂ 𝑚, followed by predictions for the observations from 𝑇. In the present work, however, we will restrict ourselves to LOOCV.

6.1.3 Preprocessing of data Some learning methods require variables that fulfill particular conditions, while oth­ ers will merely perform better following a data preprocessing step. A listing of prepro­ cessing methods is found e.g. in [231]. Some examples include the removal of variables with only a constant value and only retaining one independent variable where two or more independent variables agree on each observation.

Linear transformations There are several preliminary linear transformations for a dependent or independent continuous variable 𝑍 with values 𝑧𝑖 , 𝑖 ∈ 𝑚: – Centering shifts all values by the arithmetic mean: 𝑧𝑖∗ = 𝑧𝑖 − 𝑧.̄ –

The centered data 𝑧𝑖∗ then have mean= 0. After range scaling, the values of a variable span the interval [0, 1]: 𝑧𝑖∗ =

𝑧𝑖 − 𝑧̌ , 𝑧̂ − 𝑧̌

𝑧𝑖∗



where 𝑧̌ = min 𝑧𝑖 and 𝑧̂ = max 𝑧𝑖 .

min 𝑧𝑖∗

𝑖∈𝑚

𝑖∈𝑚

max 𝑧𝑖∗

= 0 and = 1. Range scaled data have Data preprocessed by autoscaling will assume mean = 0 and variance = 1: 𝑧𝑖∗ =

𝑧𝑖 − 𝑧̄ , 𝑠

with 𝑠 = √

∑𝑖∈𝑚 (𝑧𝑖 − 𝑧)̄ 2 𝑚−1

228 | 6 Supervised statistical learning where 𝑠 is the standard deviation of variable 𝑍. For the length of vector z∗ = (𝑧𝑖∗ ) of autoscaled data we then have ‖z∗ ‖2 = √𝑚 − 1,

where ‖z∗ ‖2 = √ ∑ (𝑧𝑖∗ )2 𝑖∈𝑚

is the Euclidean norm of z∗ .

Nonlinear transformations and base extensions Depending on the distribution of values it may make sense to perform nonlinear trans­ formations on the data, such as 𝑛-th root or logarithm. Nonlinearly transformed inde­ pendent variables may be used as additional predictors. Further, new variables may be obtained by applying arithmetic operators to pairs or larger subsets of predictors 𝑋𝑗 , 𝑗 ∈ 𝑛. This is called a base extension. Quadratic base extensions are often used, where the squares 𝑋2𝑗 , 𝑗 ∈ 𝑛, and products 𝑋𝑘 𝑋𝑙 , 𝑘 ≠ 𝑙, 𝑘, 𝑙 ∈ 𝑛 are used as predictors along with 𝑋𝑗 .

6.1.4 Selection of variables There are several reasons to keep the complexity of a predicting function as low as possible. The danger of overfitting is high [118] for complicated functions, i.e. the pre­ dicting function will fit the learning set too well and random effects will be integrated into the model. As a consequence the model will not be useful for prediction. Further­ more, a simple model adds more to the understanding of the real dependences. The complexity of a predicting function depends on, among other things, the number of independent variables used. It is difficult to decide how many, and which, predictors are suitable a priori for a given problem. We will explore this in the following text.

Correlation analysis A measure of the linear interrelation of two non-constant variables 𝑋 and 𝑍 is the correlation coefficient 𝑅(𝑋, 𝑍) =

̄ 𝑖 − 𝑧)̄ ∑𝑖 (𝑥𝑖 − 𝑥)(𝑧 √∑𝑖 (𝑥𝑖 − 𝑥)̄ 2 ∑𝑖 (𝑧𝑖 − 𝑧)̄ 2

=

̄ ̄ ∑𝑖 𝑥𝑖 𝑧𝑖 − 𝑚𝑥𝑧 √(∑𝑖 𝑥2𝑖

=

− 𝑚𝑥2̄ )(∑𝑖 𝑧𝑖2 − 𝑚𝑧̄ 2 ) 𝑚 ∑𝑖 𝑥𝑖 𝑧𝑖 − ∑𝑖 𝑥𝑖 ∑𝑖 𝑦𝑖

√(𝑚 ∑𝑖 𝑥2𝑖

− (∑𝑖 𝑥𝑖 )2 ) (𝑚 ∑𝑖 𝑧𝑖2 − (∑𝑖 𝑧𝑖 )2 )

∈ [−1, 1].

6.1 Variables and predicting functions

| 229

If |𝑅(𝑋, 𝑍)| = 1, 𝑋 and 𝑍 are completely correlated, if |𝑅(𝑋, 𝑍)| = 0, they are uncorre­ lated. In a plot of 𝑋 vs. 𝑍, the data points scatter more or less along a straight line, as long as 𝑅(𝑋, 𝑍) ≠ 0. Figure 6.1 shows examples of correlations of various strength together with their correlation coefficients, taken from topological indices and boiling points of a real library of decanes, described in Section 7.4. R JOA1

1

0.97320

2.2 OA1

3.0

2.1

2.8 2.4

2.0

2.6

CIC2

3.2

2.3

3.4

R I C 2C I C 2

1.6

1.8

2.0

2.2

2.4

2.6

3.0

3.5

IC2

R B PI C 2

 0.67897

2.0

IC2

2.2

2.4

5.0 4.5

1.6

3.5

1.8

4.0

F

2 v

0.0021523

2.6

5.5

R B P2Fv

4.0

J

140

150

160

170

BP

140

150

160

170

BP

Fig. 6.1. Examples of strong and weak correlations.

If 𝑋 and 𝑍 are completely correlated, then there is a representation 𝑍 = 𝑎𝑋+𝑏 with 𝑎 ≠ 0. There is a term from linear algebra to describe this situation: affine dependence. Complete correlation is an equivalence relation on the set of variables. If a single predictor 𝑋 is chosen from a larger set of 𝑋𝑗 , 𝑗 ∈ 𝑛 to form a regression for the target variable 𝑌, then the 𝑋 with the highest absolute correlation coefficient with 𝑌 should be selected. This is the optimal choice particularly in linear regression (see Subsection 6.2.1), since the coefficient of determination 𝑅2 in a simple linear re­ gression just equals 𝑅(𝑋, 𝑌)2 . The situation is more complex in multiple and/or nonlinear regression. For these cases, predictors with maximal correlation coefficients to 𝑌 do not necessarily form

230 | 6 Supervised statistical learning best subsets, especially if the predictors are strongly intercorrelated. Highly intercor­ related predictors are therefore often eliminated.

Fisher ratios When dealing with binary classification problems using a single descriptor, Fisher ra­ tios (FR) can be used to select the best predictor [60, 315, 324]. Such ratios are defined as 𝜇 − 𝜇1 , 𝐹𝑅(𝑋, 𝑌) = 0 𝜎0 + 𝜎1 where 𝜇𝑘 are the arithmetic means and 𝜎𝑘 are the variances of 𝑥𝑖 within the classes explained by 𝑌, i.e. with 𝛺𝑘 = {𝑖 ∈ 𝑚 | 𝑦𝑖 = 𝑘}, 𝑘 ∈ C, 𝜇𝑘 = |𝛺𝑘 |−1 ∑ 𝑥𝑖 𝑖∈𝛺𝑘

and 𝜎𝑘 = (|𝛺𝑘 | − 1)−1 ∑ (𝑥𝑖 − 𝜇𝑘 )2 . 𝑖∈𝛺𝑘

Again, interdependences among predictors are not taken into account. Thus, a full search through all available predictors must be performed to select the best predicting function comprising more than one predictor.

Best subset selection We are particularly interested in finding, for a given 𝑘 ≤ 𝑚, the 𝑘-subsets of variables that achieve best predicting functions (best subset selection, BSS). The trivial solution of this problem is to search all 𝑘-subsets, determine a predicting function for each, and then select the best of these. However, this requires a high computational effort. There are linear algebra techniques that can be used to minimize the effort in the case of linear regression [82]. Nevertheless, often it is impossible to search all subsets in reasonable time.

Stepwise subset selection In chemistry applications, an unlimited number of variables are available in principle. Stepwise procedures allow a deeper coverage of the variables. In the simple stepwise procedure (see, e.g. [299], pp. 174 ff), the predictor 𝑋𝑖1 that leads to the best 1-vari­ able predicting function is determined first. In the second step, 2-subsets that contain 𝑋𝑖1 are used in predicting functions. Say the best such function contains predictors 𝑋𝑖1 and 𝑋𝑖2 . In the 𝑗-th step, all 𝑗-subsets containing 𝑋𝑖1 , ..., 𝑋𝑖𝑗−1 are searched. Thus, only ∑𝑖∈𝑘 (𝑛− 𝑖) predicting functions are calculated, compared with (𝑛𝑘) functions in the case of a full 𝑘-subset search. Unfortunately, this procedure does not necessarily lead to the best predicting function comprising 𝑘 predictors, as can be shown for simple examples. In some cases it helps to enlarge by one variable at each step using not only the best predictor subset from the previous step, but all 𝑙 > 1 best subsets. Although this

6.2 Models for predicting functions

| 231

𝑙-fold stepwise procedure is still not guaranteed to find the very best 𝑘-subset, it does find better models than the simple stepwise procedure for most cases with higher 𝑙.

Selection bias Another problem associated with variable selection is selection bias [118, 184, 305]. Consider any real data set, i.e. a vector of observations (the target variable) together with several vectors of descriptors. Even if none of the descriptors has any real (causal) correlation with the target variable, there will be descriptors or combinations of de­ scriptors that fit the target data more or less well, simply by chance. The fit will be better for fewer observations, higher descriptor numbers in the model, and the larger the descriptor pool. This is true both for real correlations and for chance correlations. A good descriptor selection procedure (e.g. BSS or 𝑙-fold stepwise search) will select models of highest 𝑅2 precisely, irrespective of whether they are real or chance. Thus, it remains to determine whether a particular model is real or chance. Although this can­ not be determined, we can determine instead whether the model describes the data significantly better than pure chance. The procedure is to obtain the best models by selecting descriptors for data sets consisting of random numbers in exactly the same manner as for the real data, and to compare such models’ 𝑅2 values to 𝑅2 of the original model [118, 266, 305]. Unfortunately, this is a tedious task, since many random data sets have to be treated for each real data set. For multilinear regression or multiple lin­ ear regression (MLR), this procedure is implemented in an add-on to MOLGEN–QSPR called RandomQSPR. The results obtained in this manner will be shown in the fol­ lowing chapter. A similar but less powerful test that is often used is 𝑦-randomization. Random experiments are, of course, not restricted to MLR, but should be mandatory in all cases involving automatic descriptor selection.

6.2 Models for predicting functions 6.2.1 Linear models Linear models (LM) are based on predicting functions 𝑓 that weight the predictors lin­ early: 𝑓(x) = ∑ 𝑎𝑗 𝑥𝑗 + 𝑏. 𝑗∈𝑛

where 𝑛 is the number of predictors appearing in the function. The number of degrees of freedom is 𝑑 = 𝑛 + 1, resulting from the number of freely-adjustable parameters 𝑎𝑗 and 𝑏. With c = (𝑎0 , ..., 𝑎𝑛−1 , 𝑏)T ∈ ℝ(𝑛+1)×1 and z = (𝑥0 , ..., 𝑥𝑛−1 | 1), we may write 𝑓 as a matrix multiplication 𝑓(𝑥) = zc.

232 | 6 Supervised statistical learning Using methods of linear algebra and multidimensional analysis, c can be determined such that residuals are as small as possible. This is multiple linear regression, MLR.

Least squares regression For 𝑚 observations we want to determine parameters 𝑎𝑗 and 𝑏 such as to minimize 2

𝑅𝑆𝑆(c) = ∑ (𝑦𝑖 − cz𝑖 ) . 𝑖∈𝑚

where z𝑖 are the row vectors of Z = (𝑥𝑖𝑗 | 1) ∈ ℝ𝑚×(𝑛+1) . 𝑅𝑆𝑆(c) may also be written as the product 𝑅𝑆𝑆(c) = (y − Zc)T (y − Zc). Differentiation with respect to c gives 𝜕𝑅𝑆𝑆 (c) = −2ZT (y − Zc). 𝜕c If Z is of full rank, the Hesse matrix 𝜕2 𝑅𝑆𝑆 (c) = −2ZT Z 𝜕c𝜕cT is positive definite, and 𝑅𝑆𝑆 has a global minimum if ZT (y − Zc) = 0. Now c can be calculated as c = (ZT Z)−1 ZT y, by finding the inverse matrix (ZT Z)−1 via Cholesky decomposition. In the following we use the numerically more stable QR decomposition of Z, where Z = QR is written as product of an orthonormal matrix Q and an upper right triangle matrix R. The equa­ tion y − Zc = 0, is then solved by initially calculating Q−1 y = QT y and then determining c by taking advantage of the triangle structure of R Rc = QT y. This manner of minimizing 𝑅𝑆𝑆 is called ordinary least squares (OLS) regression. In the following chapters it plays an important role in determining the predicting functions.

6.2 Models for predicting functions

| 233

Principal component regression In principal component regression (PCR), singular value decomposition (SVD) is used to represent Z with the product Z = USVT of two orthonormal matrices U ∈ ℝ𝑚×𝑚 and V ∈ ℝ(𝑛+1)×(𝑛+1) and a diagonal matrix S ∈ ℝ𝑚×(𝑛+1) . The diagonal elements 𝑠0 ≥ . . . ≥ 𝑠𝑟−1 > 0 = . . . = 0 are the singular values, and 𝑟 designates the rank of Z. If Z is of full rank, then we can calculate the parameters of the predicting function as c = V diag(1/𝑠𝑖 ) UT y and they are identical to those from OLS regression. In contrast to QR decomposition, SVD is possible even if Z is not of full rank. The major aim of PCR is to use the singular values that result in best predictivity (see example in Section 7.5).

Linear classification procedures In classification using linear procedures, the class borders are described as hyperpla­ nes in an 𝑛-dimensional data space. These hyperplanes are obtained either as in Sub­ section 6.1.1 by classification via regression, or by linear discriminant analysis (LDA). Both methods are described in detail in Chapter 4 of [117]. In the present work we will mostly use binary classification (see Section 7.6 and Subsection 8.5.2).

6.2.2 Neural networks Often the dependence of a target variable on predictor variables cannot be described sufficiently using a linear predicting function. Nonlinear predicting functions include artificial neural networks (ANN). In the present work we will use feedforward networks containing one hidden layer, using linear starting weights. Figure 6.2 shows the archi­ tecture of such a network. Between the entrance layer and hidden layer each hidden neuron (HN) 𝑘 ∈ ℎ is weighted linearly, 𝑔𝑘 (x) = a𝑘 xT + 𝑏𝑘 , where 𝑏𝑘 is the weight of the bias neuron belonging to the hidden neurons. An activating function 𝑎(𝑧) = (1 + 𝑒−𝑧 )−1 is applied in the hidden layer. Between hidden layer and the exit layer another linear weighting is performed with 𝛼𝑘 , 𝑘 ∈ ℎ, and bias 𝛽. A feedforward network with a layer ℎ of hidden neurons and linear starting weights is a realization of the following model function: 𝛼𝑘 +𝛽 −(a𝑘 xT +𝑏𝑘 ) 1 + 𝑒 𝑘∈ℎ

𝑓(x) = ∑

234 | 6 Supervised statistical learning

prediction variables

X0

...

X1

input layer

Xn-1

1

weights akj, bk

bias

...

hidden layer

1

weights αk, β output layer

target variable

Y

Fig. 6.2. Scheme of a neural network with one hidden layer and bias neurons.

The number of degrees of freedom 𝑑 = (𝑛 + 2)ℎ + 1 is the number of freely adjustable parameters 𝑎𝑘𝑗 , 𝑏𝑘 , 𝛼𝑘 and 𝛽. Various methods of nonlinear optimization are available to determine these pa­ rameters (e.g. Levenberg–Marquart, Gauss–Newton, backpropagation, steepest de­ scent). In the present work, the implementation contained in the statistics package R was used, which is based on Newton optimization. Importantly, both target vari­ able and predictors were range scaled. Usually, one starts with random numbers for the parameters, which results in the disadvantage that the networks are not repro­ ducible. However, as a rule, random starting parameters provide a better predicting function than prefixed parameters, e.g. zero. Details on the training of neural networks are found in Chapter 11 of [117], a plethora of other types of neural networks is intro­ duced in [350] together with applications in chemistry. In many situations neural networks provide good predicting functions. Neverthe­ less, they have two (further) disadvantages: First, the high degree of interdependence and the corresponding complexity of the predicting function do not allow an interpre­ tation or deeper insight. Second, there is no assessment of the predicting function’s optimality, which is dissatisfying from a mathematical point of view. As a rule, the al­ gorithm used for training of an ANN terminates in a local minimum. A mathematically more convincing approach will be introduced in the next section.

6.2.3 Support vector machines The notion of support vector machines (SVM) was conceived by V. Vapnik in the mid-1990s [51, 313, 314]. Initially, SVMs were designed to solve binary classification

6.2 Models for predicting functions

se hy par pe at rp ing lan e

| 235

SV SV

M

ar

gin

SV

Fig. 6.3. Support vector classification, where the classes can be separated.

problems by searching for an optimal separating hyperplane between two classes (see Figure 6.3). The separating hyperplane is chosen such that it is surrounded by a margin of maximal depth between the data points from either class that are closest to each other. Data points lying on the margin borders are called support vectors, they are marked by ‘SV’ in Figure 6.3. If the two classes are not linearly separable, two additional strategies within the SVM can be used. First, data points on the wrong side of the margin are allowed, but their influence is minimized. Second, by means of a base extension (see Subsec­ tion 6.1.3), data points can be mapped into a space of higher dimension, where they may be separated more easily. These strategies can be written and solved as a quadratic optimization problem with linear inequalities as additional constraints (see e.g. [117]). In the present work we use the implementation libsvm [43] that is also available in the R package e1071 [207]. For a binary classification, the predicting function is of the form {1 𝑓(𝑥) = { 0 {

̃ ≥ 0, if 𝑓(𝑥) otherwise,

where ̃ = ∑ 𝑦̃ 𝛼 𝐾(x , x) + 𝑏 𝑓(x) 𝑖 𝑖 𝑖 𝑖∈𝑚

and {1 if 𝑦𝑖 = 1, 𝑦𝑖̃ = { −1 otherwise. {

236 | 6 Supervised statistical learning 𝛼𝑖 is not equal to zero only for support vectors x𝑖 . The kernel function 𝐾 is a realiza­ tion of the base extension mentioned above. In the implementation used here [43] the following kernel functions are available: Linear kernel: Polynomial of degree 𝑑:

𝐾(x, x󸀠 ) = x󸀠 xT , 𝐾(x, x󸀠 ) = 𝛾(x󸀠 xT + 𝑐)𝑑 ,

Radial base function:

𝐾(x, x󸀠 ) = exp(−𝛾‖x󸀠 − x‖22 ),

Sigmoid function:

𝐾(x, x󸀠 ) = tanh(𝛾x󸀠 xT + 𝑐)𝑑 .

𝛾, 𝑐 and 𝑑 are freely adjustable. As a rule, we set 𝛾 = 1, best values of 𝛾 and 𝑑 are determined by CV or using a test set. In the recent past SVM have been increasingly used to solve problems in compu­ tational chemistry. In a comparison of SVM and ANN for classification of pharmaceu­ tically inactive or active compounds, SVM consistently yielded smaller classification errors [41]. For the classification of mass spectra (see Subsection 8.5.2), SVM with a radial kernel proved to be the best predicting functions. The concept of SVM may be transferred to continuous target variables. For regres­ sions, the predicting function has the form 𝑓(x) = ∑ 𝑎𝑖 𝐾(x𝑖 , x) + 𝑏. 𝑖∈𝑚

Often SVMs have a large number of support vectors, with complex predicting func­ tions that barely promote the understanding of causal dependences. In the following section we will learn about predicting functions that are easier to interpret.

6.2.4 Decision trees Another procedure to obtain predicting functions is to recursively partition ℝ𝑛 into hy­ perrectangles. The resulting predicting functions may be represented as decision trees. Depending on the type of target variable these are called classification or regression trees. Classification and regression trees (CART) can often be more interpretable. A decision tree is a binary rooted tree, i.e. there is one initial node 𝑉0 , the root, and each node (other than a leaf) has exactly two successors. The leaves are also called ter­ minal nodes, all remaining nodes are internal nodes. Internal nodes 𝑉𝑘 bear decision rules of the form 𝑋𝑗𝑘 < 𝑎𝑘 , terminal nodes bear function values 𝑦𝑘̂ . The node number­ ing is such that an internal node 𝑉𝑘 has successors 𝑉2𝑘+1 and 𝑉2𝑘+2 , see Figure 6.4. It is relatively easy to apply the decision tree predicting function, so that even large trees are evaluated quickly. Starting at the root, the internal nodes are visited according to their decision rules: If the decision rule at 𝑉𝑘 is fulfilled, then node 𝑉2𝑘+1 is visited next, otherwise node 𝑉2𝑘+2 . If a terminal node is reached, its function value is returned, otherwise the next decision rule is processed.

6.2 Models for predicting functions

V0: Xj0 < a0

true

| 237

false

V1

V2

Vk: Xjk < ak

true

false

V2k+1

V2k+2

Fig. 6.4. Scheme of a decision tree.

In constructing a decision tree, the learning set is successively partitioned into two disjoint subsets. Partitioning is done according to a binary decision rule 𝑋𝑗 < 𝑎. Let 𝛺𝑘 be the index set of the observations represented by 𝑉𝑘 . Further, 𝛺𝑘 (𝑗, 𝑎) = {𝑖 ∈ 𝛺𝑘 | 𝑥𝑖𝑗 < 𝑎} and 𝛺𝑘󸀠 (𝑗, 𝑎) = {𝑖 ∈ 𝛺𝑘 | 𝑥𝑖𝑗 ≥ 𝑎} are the index sets resulting from splitting 𝛺𝑘 by limit 𝑎 of 𝑋𝑗 . Now, 𝑗 and 𝑎 are de­ termined such that the property values within successor nodes 𝑉2𝑘+1 and 𝑉2𝑘+2 are as homogenous as possible. For a regression tree this means ∑ (𝑦𝑖 − 𝜇)2 +

∑ (𝑦𝑖 − 𝜇󸀠 )2 𝑖∈𝛺𝑘󸀠 (𝑗,𝑎)

𝑖∈𝛺𝑘 (𝑗,𝑎)

should be minimal. Here, 𝜇 = |𝛺𝑘 (𝑗, 𝑎)|−1

∑ 𝑖∈𝛺𝑘 (𝑗,𝑎)

𝑦𝑖

and 𝜇󸀠 = |𝛺𝑘󸀠 (𝑗, 𝑎)|−1



𝑦𝑖

𝑖∈𝛺𝑘󸀠 (𝑗,𝑎)

are the means of property values within 𝛺𝑘 (𝑗, 𝑎) and 𝛺𝑘󸀠 (𝑗, 𝑎), respectively. Finally, op­ timal 𝑗 and 𝑎 define the decision rule at 𝑉𝑘 , and the construction is continued at 𝑉2𝑘+1 and 𝑉2𝑘+2 with 𝛺2𝑘+1 = 𝛺𝑘 (𝑗, 𝑎) and 𝛺2𝑘+2 = 𝛺𝑘󸀠 (𝑗, 𝑎). If 𝑉2𝑘+1 or 𝑉2𝑘+2 is a terminal node, ̂ ̂ then 𝑦2𝑘+1 = 𝜇 or 𝑦2𝑘+2 = 𝜇󸀠 is also the corresponding function value. Thus, the vari­ able selection is performed while constructing the decision tree, and therefore con­ structing a decision tree may be understood as a method of variable selection. Since each choice depends on a local condition, the selection is extremely fast and can be performed in reasonable time even for very large sets of potential predictors. Further details such as stopping criteria for growing decision trees, strategies for pruning decision trees and for constructing classification trees are described in [36]. An example of a classification tree to detect the presence of bromine in a mass spec­ trum is shown on pages 205–215. We will consider similar problems in Chapter 8, using an implementation by B. D. Ripley ([251], Chapter 7) via an interface to the statistics

238 | 6 Supervised statistical learning software R. CART has not been used often to solve problems in chemistry. One of the few exceptions is [327], where the target variable is the pharmaceutical activity of com­ pounds. An interesting further development of regression trees is the software CUBIST [239], which combines recursive partitioning with linear regression. The predicting functions are RT with LM at terminal nodes. In [40] the aqueous solubility of com­ pounds is modeled using this method.

6.2.5 Nearest neighbors In previous sections, model functions were determined arithmetically by regression or classification procedures. The method of 𝑘 nearest neighbors (KNN) is a ‘model-free’ learning procedure. The data are not structured, nor is a model fitted during determi­ nation of the predicting function. Rather, the data from the learning set themselves are used for prediction. KNN procedures are particularly suitable if the learning set data are only moderately structured, but the method does not really assist in improving the understanding of actual dependences. Let 1 ≤ 𝑘 ≤ 𝑛. In order to determine a function value 𝑓(x) for x ∈ ℝ𝑛 , first the distances between x and the x𝑖 , 𝑖 ∈ 𝑚 are calculated and listed in increasing order: ‖x − x𝑖0 ‖ ≤ ... ≤ ‖x − x𝑖𝑘−1 ‖ ≤ ... where the predictors should be autoscaled. ‖.‖ may be any norm on ℝ𝑛 . In the present work we shall use the Euclidean norm ‖.‖2 . In the next step, the set of the 𝑘 nearest neighbors of x is determined: 𝑁𝑘 (x) = {𝑖𝑙 | 𝑙 ∈ 𝑘}. If this is not possible unambiguously, the decision is made randomly. In KNN regre­ ssion, the value of the predicting function is calculated as the arithmetic mean of the 𝑘 nearest neighbors of x: 1 ∑ 𝑦. 𝑓(x) = 𝑘 𝑖∈𝑁 (x) 𝑖 𝑘

In KNN classification, the class to be predicted is selected by majority vote among the 𝑘 nearest neighbors of x: 𝑓(x) = argmax ∑ 𝛿(𝑐, 𝑦𝑖 ). 𝑐∈C

𝑖∈𝑁𝑘 (x)

Again, if class determination is not possible unambiguously, a random decision is made. An alternative to a random decision is e.g. weighted methods that e.g. favor the nearest neighbor in a tie decision. In the present work we will use KNN classifica­ tion in the binary case only, see Section 7.6. In this case random decisions are avoided

6.2 Models for predicting functions |

239

by allowing odd numbers of 𝑘 only. The most suitable 𝑘 can be determined by CV or by using a test sample. While we focussed on the methods of supervised statistical learning in this chap­ ter, the next chapter builds on this with a specific focus on applying these to quanti­ tative structure-property relationships (QSPRs).

7 Quantitative structure–property relationships Traditional chemical synthesis strives for a single pure compound and is often moti­ vated by some expectation about the desired physicochemical property or biological activity of the target compound. A new manner of chemical synthesis, combinatorial chemistry, developed from automation and recent advances in information technol­ ogy. In combinatorial chemistry, a plethora of well-defined compounds can be syn­ thesized in a single procedure, which can then be used to search for new materials or active compounds. Technically, this is achieved e.g. by synthesis robots that are able to disperse reagents rapidly over hundreds of miniaturized reactors contained on a microtiter plate. If a few reactants of types 1, . . . , 𝑛 are combined in many possible combina­ tions, a combinatorial library of compounds is the result. After synthesis, the product library is searched or screened for the desired property. Often, screening also is per­ formed automatically with high throughput rates, in which case we speak of high throughput screening (HTS).

7.1 Motivation: Optimization of experiments in combinatorial chemistry Often, synthesis of a complete combinatorial library in all its variability is expensive, undesirable or even impossible. In Subsection 5.3.6 we considered examples where between 10 and 20 reactants resulted in several thousand different compounds. Thus, often only a subset of the virtual library is synthesized, termed the real library. The virtual library consists of all compounds available when using all possible combina­ tions. Figure 7.1 describes this situation and shows how to find candidates that poten­ tially possess a desired property with higher probability in a virtual library. Subprob­ lems of a mathematical or arithmetic nature are highlighted in grey. These are – generation of the virtual library (Subsection 5.3.6), which depending on the situa­ tion, may comprise ∘ determination of a real sub-library of high diversity (Section 7.7), or ∘ a check for subset relationship between a real and a virtual library (Re­ mark 5.22). – calculation of molecular descriptors (Section 2.6), – determination of predicting functions via methods of statistical learning, cf. Chap­ ter 6, – application of a predicting function.

7.1 Motivation: Optimization of experiments in combinatorial chemistry

| 241

Virtual library (structures only) Real library (structures and properties)

Property values

Target variable

Structural formulas

Structural formulas

Molecular Descriptors

Molecular Descriptors

Predictors

Statistical Learning (Regression, classification)

Predicting function

Predictors

Application of the Predicting function

Predicted property values Fig. 7.1. Schematic workflow for the prediction of property values for a virtual combinatorial library.

242 | 7 Quantitative structure–property relationships Initially, the structures of the real library are mapped onto real numbers using molecu­ lar descriptors. A vector of equal length of real numbers is obtained for each structure. Further, an experimental value of the property of interest is associated with each struc­ ture. Descriptor and property values are the input for statistical methods of supervised learning. These result in a predicting function, a function that is determined to fit the experimental property values reasonably well (see Chapter 6). The predicting function requires a vector of descriptor values as input and returns a predicted property value as output. The complete process of descriptor calculation, the search for and finally application of a predicting function is known as QSPR/QSAR research. The virtual library is constructed using a structure generator. Of course, descrip­ tors can be calculated for each structure in this library. Therefore the predicting func­ tion is able to predict property values for each member of the virtual library. This is called virtual screening. Structures predicted to exhibit desired property values are candidates for targeted synthesis.

7.2 The use of molecular descriptors As mentioned at the beginning of this chapter, methods of statistical learning will be used to obtain predicting functions for (experimentally measurable) properties of compounds. Such methods, however, are unable to process molecular graphs directly. Rather, they expect a vector of real numbers for each observation as input. Molecular graphs are mapped onto real numbers using molecular descriptors. Molecular descrip­ tor values can be calculated in MOLGEN–QSPR using an implementation by J. Braun [32] that is essentially oriented along the encyclopedic book [304] by R. Todeschini and V. Consonni. We recall from Section 2.6 that a molecular descriptor 𝐷̄ comes from a mapping 𝐷 on the set of labeled molecular graphs that is invariant under relabeling of a graph’s nodes. In formal terms, if 𝑀 = (𝜀, 𝜁, 𝛾) ∈ M𝑛 , we require that 𝐷(𝑀) = 𝐷(𝜋𝑀), for each relabeling 𝜋 ∈ 𝑆𝑛. This gives rise to two molecular descriptors 𝐷̄ and 𝐷̄ ∗ with the values ̄ = 𝐷(𝑀) and 𝐷̄ ∗ (𝑀) ̄ = 𝐷(𝑀∗ ). 𝐷(̄ 𝑀) There are many possibilities to define particular molecular descriptors, the values ̄ are mostly real numbers, but they can also be sequences of real numbers or 𝐷(̄ 𝑀) ̃ of the auto­ something completely different, for example, the conjugacy class Aut(𝛾) morphism group Aut(𝛾) of the molecule graph. There are various categories of mo­ lecular descriptors, which depend on the information used to form the descriptor, e.g. arithmetical, topological, and geometrical descriptors. If the descriptor is topologi­ cal, i.e. if it depends on 𝛾 and not only on 𝜀 and/or 𝜁, and if its values are real numbers, it is called a topological index. For the sake of clarity we spoke of purely arithmetical descriptors 𝐷̄ that are independent of 𝛾 and of purely topological descriptors, which

7.2 The use of molecular descriptors

| 243

depend only on 𝛾. Thus, for example, a purely arithmetical descriptor is constant on sets of constitutional isomers. Following a refresher on descriptors below, we will introduce a particularly flex­ ible type of molecular descriptors in Subsection 7.2.2, the substructure counts. These are defined by a molecular substructure and return simply how many times a substruc­ ture is contained in a molecular graph (i.e. the occurrence) as the descriptor value.

7.2.1 Arithmetical, topological, and geometrical descriptors As 𝐷̄ ∗ neglects hydrogen atoms, it is helpful to introduce the following notation for the non-H atoms in a given labeled molecular graph 𝑀 = (𝜀, 𝜁, 𝛾) ∈ M𝑛 , 𝛺 = {𝑖 ∈ 𝑛 | 𝜀(𝑖) ≠ H}. 𝛾|𝛺 is the restriction of 𝛾 to 𝛺, and (𝜀|𝛺 , 𝜁|𝛺 , 𝛾|𝛺 ) is the H-suppressed molecular graph. Thus, 𝑀∗ = (𝜀∗ , 𝜁∗ , 𝛾∗ ) = (𝜀|𝛺 , 𝜁|𝛺 , 𝛾|𝛺 ). Moreover, we recall that we consider the multigraph 𝛾, the bond graph 𝛾𝑏 , a sim­ ple graph obtained by reducing the multiplicities to 1, as well as the graph 𝛾∗ of the 𝑏 H-suppressed molecule, and also its bond graph 𝛾∗ . All in all, the evaluation of de­ scriptor values makes use of the following data: – The element distribution 𝜀, the distribution 𝜁 of the atomic states, as well as the graph 𝛾. They form the molecular graph 𝑀 = (𝜀, 𝜁, 𝛾). The restrictions to 𝛺 give the H-suppressed molecular graph 𝑀∗ = (𝜀∗ , 𝜁∗ , 𝛾∗ ). 𝑏 – The corresponding graphs 𝛾, 𝛾𝑏 , 𝛾∗ and 𝛾∗ are described by the matrices of mul­ tiplicities M𝛾 = (𝛾𝑖𝑗 ) and M𝛾𝑏 , M𝛾∗ and M𝛾∗ 𝑏 , respectively. From these we can derive the sequences of valences and of bond degrees 𝑏

𝑣(𝛾), 𝑏(𝛾) = 𝑣(𝛾𝑏 ) = 𝑏(𝛾𝑏 ), 𝑣(𝛾∗ ), 𝑏(𝛾∗ ) = 𝑣(𝛾∗ ). –

Moreover, there are the distance matrices D𝛾 = (dist𝛾 (𝑖, 𝑗)) and D𝛾𝑏 , D𝛾∗ , and D𝛾∗ 𝑏 .

7.1 Example (Arithmetical descriptors) We recall a few basic arithmetical descriptors: – Let 𝑀 = (𝜀, 𝜁, 𝛾) ∈ M𝑛 be a molecular graph. The purely arithmetic indices 𝐴̄ and 𝐴̄∗ are the counts of all or of the non-H atoms, obtained from the function 𝐴 with its values 𝐴(𝑀) = 𝑛, 𝐴(𝑀∗ ) = |𝛺|.

244 | 7 Quantitative structure–property relationships There is, of course, also the count of atoms of element 𝑋, using 𝑁𝑋 (𝑀) = 𝛽𝑀 (𝑋),

𝑁𝑋 (𝑀∗ ) = 𝛽𝑀∗ (𝑋),

and the evaluation of the molecular weights 𝑀𝑊(𝑀) = ∑ 𝑚̄ 𝑋 𝛽𝑀 (𝑋),

𝑀𝑊(𝑀∗ ) = ∑ 𝑚̄ 𝑋 𝛽𝑀∗ (𝑋).

𝑋∈E

𝑋=H ̸

Here 𝑚̄ 𝑋 is the mean atomic mass of element 𝑋 (see Definition 8.22). Table 7.1 shows mean atomic masses (in daltons, 1 Da is defined as 1/12 of the mass of a 12 C atom) of the 𝑋 ∈ E11 . Table 7.1. Mean atomic mass, van der Waals radius and van der Waals density of the elements of E11 . 𝑚̄ 𝑋

𝑟𝑋

𝜌𝑣𝑑𝑤

[Da]

[Å]

[Da/Å3 ]

H

1.0079

1.20

0.139

C

12.0107

1.70

0.584

N

14.0067

1.55

0.898

O

15.9994

1.52

1.088

F

18.9984

1.47

1.428

Si

28.0855

2.10

0.724

P

30.9738

1.80

1.268

𝑋

S

32.0660

1.80

1.313

Cl

35.4527

1.75

1.579

Br

79.9040

1.85

3.013

I

126.9045

1.98

3.903

These descriptors are purely arithmetical, which is not the case for the descriptors arising from the following mappings: – The numbers of bonds 1 1 𝐵(𝑀) = ∑ 𝑏(𝛾)𝑖 , 𝐵(𝑀∗ ) = ∑ 𝑏(𝛾∗ )𝑖 , 2 𝑖 2 𝑖 and the cyclomatic numbers 𝐶(𝑀) = 𝐵(𝑀) − 𝐴(𝑀) + 1,

𝐶(𝑀∗ ) = 𝐵(𝑀∗ ) − 𝐴(𝑀∗ ) + 1,

which assume that the graph 𝛾 (and therefore also 𝛾∗ ) is connected. These descriptors are not purely arithmetical. They may have different values for com­ pounds of the same molecular formula (and of the same state distribution). They are nevertheless considered arithmetic descriptors since they are based merely on num­ bers of bonds.

7.2 The use of molecular descriptors

| 245

We recall a few topological descriptors from Chapter 2: 7.2 Example (Basic topological indices) A topological index (TI) is based on element and state distribution and on atom neighborhood information. The best known topo­ logical indices can be classified as purely topological, since they exclusively use the multigraphs 𝛾 underlying 𝑀 = (𝜀, 𝜁, 𝛾) ∈ M or 𝛾∗ underlying 𝑀∗ = (𝜀∗ , 𝜁∗ , 𝛾∗ ). – The first topological index was developed by H. Wiener [339]. He used the function 𝑊(𝑀) =

1 ∑ dist𝛾 (𝑖, 𝑗), 2 𝑖,𝑗∈𝛺

obtaining topological indices 𝑊̄ and 𝑊̄ ∗ , named after him for modeling the boil­ ing points of alkanes (see Section 7.4). Graph theoretically, the Wiener index can be considered a measure of branching. This is justified by 𝑊 assuming, within a class of graphs of the same number of nodes and edges, maximal (minimal) val­ ues for minimally (maximally) branched graphs. Minimally branched graphs are chains (left), maximally branched are stars (right): t

t



t

t

t

t

t

t

t

t The values of Zagreb indices [106] are sums of squares or products of node degrees, and we have seen that this means essentially walk counts: 𝑀1 (𝑀) = ∑ (𝑏(𝛾)𝑖 )2 = 𝑚𝑤𝑐(2) (𝛾𝑏 ), 𝑖∈𝛺

𝑀2 (𝑀) =



𝑏(𝛾)𝑖 ⋅ 𝑏(𝛾)𝑗 =

{𝑖,𝑗}∈𝐵(𝛾)



1 𝑚𝑤𝑐(3) (𝛾𝑏 ), 2

and, correspondingly, 𝑀1 (𝑀∗ ) and 𝑀2 (𝑀∗ ). Values of Randić indices [158, 240] of order 𝑘 are calculated as 0

1

𝜒(𝑀) = ∑ (𝑏(𝛾)𝑖 )− 2 , 𝑖∈𝛺

for 𝑘 = 0, and by 𝑘

𝑘

𝜒(𝑀) =



(𝑖0 ,...,𝑖𝑘 ) 𝑗=0 path in 𝛾



for 𝑘 > 0. Similarly we obtain 0 𝜒(𝑀∗ ) and 𝑘 𝜒(𝑀∗ ). The distance degree of node 𝑖 in 𝛾 is defined as deg𝑑𝛾 (𝑖) = ∑ dist𝛾 (𝑖, 𝑗) 𝑗∈𝑛

1

∏(𝑏(𝛾)𝑖 )− 2 ,

246 | 7 Quantitative structure–property relationships and is used in the definition of the Balaban index [8, 9], obtained from −1 𝐵(𝑀) ∑ (deg𝑑𝛾 (𝑖) ⋅ deg𝑑𝛾 (𝑗)) 2 , 𝐶(𝑀) + 1 𝑖,𝑗∈𝑛

𝐽(𝑀) =



and similarly for 𝐽(𝑀∗ ). The Schultz index [279, 280] uses 𝑀𝑇𝐼(𝑀) = ∑ 𝛾𝑖𝑘 (𝛾𝑘𝑗 + dist𝛾 (𝑘, 𝑗)) . 𝑖,𝑗,𝑘



The molecular walk count of length 𝑙 (in 𝑀 or in 𝑀∗ ) was mentioned several times. It adds all entries of the 𝑙-th power of the bond matrix of 𝑀 or of 𝑀∗ . These indices were introduced by C. and G. Rücker [264], they describe the complexity [269] of a (molecular) graph. Walk counts are mostly evaluated for the H-suppressed mo­ lecule. The total walk count sums molecular walk counts of all lengths 𝑙: 𝑡𝑤𝑐(𝑀∗ ) = ∑ 𝑚𝑤𝑐(𝑙) (𝑀∗ ). 𝑙∈|𝛺|



The maximal eigenvalue of M𝛾∗ 𝑏 can also be considered a molecular descriptor.

There are, of course, also topological indices that are not purely topological, since the chemical elements of nodes are considered. 7.3 Example (Valence degrees, Kier & Hall and Basak indices) – For a molecular graph 𝑀 = (𝜀, 𝜁, 𝛾) ∈ M the valence degree of node 𝑖 is defined as deg𝑣𝑀 (𝑖) =



𝑉𝐸𝜀(𝑖) − 𝐻𝐶𝑀 (𝑖) 𝑇𝐸𝜀(𝑖) − 𝑉𝐸𝜀(𝑖) − 1

.

Here 𝐻𝐶𝑀 (𝑖) is the number of hydrogen atoms attached to atom 𝑖, and we recall from Chapter 1 that 𝑉𝐸𝜀(𝑖) denotes the number of valence electrons, while 𝑇𝐸𝜀(𝑖) means the total number of electrons of atom 𝜀(𝑖). The valence degrees of nodes enter the definitions of Kier & Hall indices [156, 157, 158]. In analogy to Randić indices of order 𝑚, in Kier & Hall indices contributions from all paths of length 𝑚 are summed, where node valence degrees (instead of node degrees) enter the contributions: 0 𝑣

𝜒 (𝑀) =

∑ (deg𝑣𝑀 (𝑖))

− 12

,

𝑖∈𝛺 𝑚 𝑣

𝜒 (𝑀) =

𝑚



∏ (deg𝑣𝑀 (𝑣𝑖 ))

− 12

.

(𝑣0 ,...,𝑣𝑚 ) 𝑖=0 path in 𝛾



Similarly for 𝑀∗ . Basak’s information theoretical indices are another class of topological indices that also account for the elements [14, 15]. To calculate these, all atoms have to

7.2 The use of molecular descriptors | 247

be classified according to the chemical element and the type of bonds and atoms up to a distance 𝑟. Let 𝑘𝑟 be the number of classes and 𝑛𝑟𝑖 the number of atoms in class 𝑖. Then the following mappings can be defined, 𝑛 𝑛 𝐼𝐶𝑟 (𝑀) = ∑ 𝑟𝑖 log2 𝑟𝑖 , 𝑛 𝑛 𝑖∈𝑘 𝑟

𝐶𝐼𝐶𝑟 (𝑀) = log2 𝑛 − 𝐼𝐶𝑟 (𝑀)

and

−1

𝑆𝐼𝐶𝑟 (𝑀) = (log2 𝑛) 𝐼𝐶𝑟 (𝑀), functions that yield Basak’s information content, complementary information con­ tent and structural information content of order 𝑟, respectively. Tables 7.4 and 7.5 in Section 7.4 show calculated values of some TIs described here, for the compounds from Figure 7.4. In principle, there are no limits to a researcher in inventing new topological indices. The wide variety of available indices can be help­ ful in modeling physicochemical properties or biological activities, as we will see in Section 7.3. For some applications it suffices to use descriptors that describe the topology of structures only (see e.g. Section 7.4). However, there are many molecular properties that depend on the 3D shape of the molecules (see Section 7.5). Descriptors that take this information into account are called geometrical descriptors or geometrical indices. To allow calculation of geometrical descriptors, the atoms of the molecular graph 𝑀 ∈ 𝑛 M𝑛 have to be given 3D coordinates 𝜉 ∈ (ℝ3 ) (see Subsection 2.6.3). Various methods are available for this task [271, 272], we use an empirical force field method similar to [5]. Of course, values of geometrical descriptors should be invariant with respect to translation and rotation. Therefore, the first step of geometrical index calculation of­ ten involves centering and orienting the molecule about the coordinate origin and along the main axes. Many geometrical descriptors describe geometrical quantities such as volumes, surfaces (see e.g. [48]), or diameters of a molecule. 7.4 Example (van der Waals volume, radius and density) As an example of a geome­ trical index we consider the van der Waals (vdW) volume 𝑉𝑣𝑑𝑤 . Initially, each element 𝑋 is associated with an atom radius, the vdW radius 𝑟𝑋 . Table 7.1 contains the vdW radii (in Ångström, Å) and the vdW density (weight/volume) for elements from E11 . In Figure 7.2, atoms of various elements are depicted as spheres of vdW radii. The vdW volume of a molecule is the total volume occupied by its atom spheres of radius 𝑟𝜀(𝑖) and center 𝜉(𝑖) ∈ ℝ3 . Figure 7.3 shows a molecule with atoms represented by corres­ ponding spheres. In principle, two procedures to calculate a vdW volume seem promising: – The geometric approach tries to calculate the volume of an arrangement of atom spheres exactly. The inclusion–exclusion principle can be applied: First, the vol­ umes of all spheres are added, then each intersection of two spheres is subtracted, each intersection of three spheres is added, and so on. However, the calculation of

248 | 7 Quantitative structure–property relationships

Fig. 7.2. Atoms from E11 represented as spheres of van der Waals radii.

Fig. 7.3. A 3D arrangement of the amino acid methionine represented as space filling model, with atoms shown as spheres of van der Waals radii.

the intersection of two spheres is already rather complicated [252]. Intersections of up to four spheres can be calculated exactly [89], whereas intersections of higher order cannot, as is mathematically proven [48]. Of course, as an approximation intersections up to e.g. third order only may be considered. There is another promising approach that is inspired by the methods of numerical integration. – In a procedure of discretization, a cuboid circumscribing a molecule is con­ structed initially. This is then decomposed into cubes that are as small as possible but still equal in size. From the position of the center of a cube it is possible to de­ cide whether it is inside or outside the molecule as described by the vdW spheres. The molecular vdW volume then is, as a first approximation, the sum of volumes of all cubes with the center inside.

7.2 The use of molecular descriptors |

249

Table 7.2. Calculated van der Waals volumes of small organic molecules, from [262] and [144]. 𝑉𝑣𝑑𝑤 [Å3 ] Compound

[262]

[144]

methane

29.764

29.327

ethane

47.645

47.210

ethylene

41.062

40.135

acetylene

37.942

37.315

benzene

84.174

87.182

naphthalene

126.950

131.564

cyclohexane

106.833

106.654

chloroethane toluene bromobenzene ethylbenzene

56.402

55.690

101.625

105.377

95.538

98.771

119.450

123.317

The latter procedure suffers from strongly increasing effort as the cubes become smaller, in particular if the molecule of interest is large. A considerable improvement was achieved by J. Braun [32]. Initially, a coarse equidistant grid is introduced in the circumscribing cuboid, with a grid distance of the vdW diameter of a hydrogen atom (2.4 Å). Then each cube is related to the sphere it occupies. Cubes that do not intersect any sphere are ignored. The remaining cubes, occupied by one or a few atom spheres each, are treated as described above: We introduce a finer grid (unit length 0.01 Å) and add the volumes of all cubes whose center is inside the molecule. Table 7.2 shows the molecular vdW volumes (in Å3 ) of some small organic com­ pounds calculated in this manner in the first column. The second column gives, for comparison, the molecular vdW volumes of the same compounds in the same confor­ mation, calculated using the CODESSA software (see [144] for details). The two-step procedure is of decisive advantage with respect to precision and reach. In Section 7.5 the vdW volume will be used to model the physical density of compounds. The descriptors introduced above are only a tiny fraction of those implemented in MOLGEN–QSPR. A complete list is found in Appendix A and detailed specifications are in [34, 262].

250 | 7 Quantitative structure–property relationships 7.2.2 Substructure counts Existence/non-existence of a certain substructure 𝑆 and substructure counts are other types of molecular descriptors. The value of such a descriptor is calculated as 𝐹𝑢𝑛𝑐𝑆𝑏 (𝑀) = {

1 0

if 𝑆 ⊆ 𝑀, else,

and 𝑆(𝑀) = |Emb ⊆ (𝑆, 𝑀)|.

In the first case this is a binary molecular descriptor. The software ToSim written by K. Varmuza [290] uses a vector of binary molecular descriptors to check molecular graphs for similarity. In the following we will use substructure counts (SC) due to their higher information content.

Substructure relations and counting of embeddings There are various possibilities to define descriptors based on substructures. Thus, the substructure relation ‘ ⊆ ’ may be replaced by the partial structure relation ‘ ⊆𝑖 ’ (see De­ finition 2.7). Usually, embeddings differing only in the mapping of H atoms are counted only once. More generally, all alternatives described in Example 2.8 could be used to obtain substructure counts. From a chemical point of view, it makes more sense to take into account the symmetry of substructure 𝑆 but not that of molecular graph 𝑀. In our further investigations we will prefer this variant. H atoms will be ignored.

Selection of substructures In using substructure-based descriptors, it is important to select the substructures to be considered carefully. Usually, for example in [290], a fixed set of substructures is given. This procedure is of limited use if compounds do not differ sufficiently with respect to these substructures, or if there are other substructures more suitable to express the structural differences in the compounds in question. In such a situation, user-defined substructures are helpful. This option is provided in the software MOLGEN–QSPR that resulted from this work. However, a prerequisite is good knowledge of the library, which requires some time and effort. This effort can be minimized using a computer. Algorithm 7.5 finds, for a given library L, all substructures and their counts. Con­ tained therein is an associative storage 𝑀𝑎𝑝 that maps substructures onto vectors of natural numbers designed to count multiplicities. 7.5 Algorithm SubstrCounts(L) (1) (2) (3) (4) (5) (6)

for each 𝑀𝑖 ∈ L for each 𝑆 ⊆ 𝑀𝑖 𝑆 ← 𝜅(𝑆) 𝑀𝑎𝑝[𝑆][𝑖] ← 𝑀𝑎𝑝[𝑆][𝑖] + 1 end end

7.3 Mathematical composition of QSPRs

| 251

Line (1) runs through the whole library of molecular graphs 𝑀𝑖 . In line (2) the size of substructures is limited by setting a lower and upper limit for the number of edges. In line (3) 𝑆 is canonically numbered, and in line (4) the count of 𝑆 in 𝑀𝑖 is incremented. If a substructure is encountered for the first time, it is inserted into 𝑀𝑎𝑝 and associated with a vector of zeroes and size of L. At the end, 𝑀𝑎𝑝[𝑆][𝑖] contains the count of 𝑆 in 𝑀𝑖 . 7.6 Example (Substructures of decanes) Figure 7.4 shows a real library of 50 decanes that will be used in Section 7.4 to find a QSPR model for the boiling points of decanes. Decanes are isomers of molecular formula C10 H22 . A total of 75 constitutions exist with this formula. In Figure 7.5 all substructures of two through six bonds contained in the real library are shown. Table 7.3 gives the substructure counts, where columns corre­ spond to substructures and rows correspond to compounds, according to the num­ bering given in the figures. Counts for substructures that contain no bond or one bond have been omitted from the table and figure, as these have constant counts throughout the library, ten and nine, respectively. In contrast to predefined substructure vectors, this dynamic procedure is a natural way of substructure selection. For homogenous libraries such as this one, predefined substructure vectors designed for general chemistry are of little use, since the entries will differ in few components if at all. Nevertheless, static substructure vectors have their advantages since exactly those substructures can be included that are, according to prior knowledge, relevant for a given problem.

7.3 Mathematical composition of QSPRs Consider a real library of 𝑚 compounds together with an experimentally determined property 𝑌. In the following we assume that 𝑌 has real values. The compounds in the library are represented by their molecular graphs. Thus, our QSPR investigation starts with tuples (𝑀𝑖 , 𝑦𝑖 ) ∈ M × ℝ, 𝑖 ∈ 𝑚. We are looking for a function 𝛹 : M → ℝ, to describe our QSPR mathematically. In Chapter 6 and Section 7.1 we described how to arrive at 𝛹.

252 | 7 Quantitative structure–property relationships

BP:136.0

1 BP:145.0

2 BP:146.0

3 BP:147.0

4 BP:147.6

5

BP:147.7

6 BP:148.5

7 BP:148.7

8 BP:149.7

9 BP:151.5

10

BP:152.5

11 BP:152.8

12 BP:153.7

13 BP:154.0

14 BP:154.5

15

BP:154.5

16 BP:155.5

17 BP:155.5

18 BP:156.0

19 BP:157.0

20

BP:157.5

21 BP:157.8

22 BP:158.3

23 BP:158.8

24 BP:158.8

25

BP:159.0

26 BP:159.0

27 BP:159.5

28 BP:159.5

29 BP:159.8

30

BP:160.0

31 BP:160.0

32 BP:160.1

33 BP:160.6

34 BP:160.7

35

BP:162.0

36 BP:162.4

37 BP:162.5

38 BP:163.5

39 BP:163.8

40

BP:164.5

41 BP:165.1

42 BP:165.7

43 BP:166.0

44 BP:166.0

45

BP:167.0

46 BP:167.7

47 BP:168.4

48 BP:170.9

49 BP:174.0

50

Fig. 7.4. Real library of decanes, with boiling points (∘ C) to the top left of each structure.

7.3 Mathematical composition of QSPRs

| 253

Table 7.3. Counts of the substructures from Figure 7.5 in the library of Figure 7.4.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

14 12 11 12 13 12 12 13 14 12 10 12 13 11 11 12 11 13 12 10 11 12 10 10 10 11 15 10 13 14 10 10 12 11 10 9 10 13 9 11 10 9 9 9 11 9 9 14 14 8

7 8 8 9 9 8 7 10 9 9 8 9 11 7 10 10 9 12 12 8 9 11 10 10 9 11 12 8 13 13 9 7 11 9 12 9 10 14 9 11 9 8 8 9 13 7 8 15 15 7

8 5 3 5 6 5 5 6 8 5 2 5 6 4 3 5 3 6 4 2 4 5 2 2 2 4 9 2 6 8 2 2 5 4 2 1 2 6 1 4 2 1 1 1 4 1 1 8 8 0

2 1 0 1 1 1 1 1 2 1 0 1 1 1 0 1 0 1 0 0 1 1 0 0 0 1 2 0 1 2 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 0 2 2 0

6 5 4 6 8 5 4 11 9 7 3 7 13 3 7 8 6 14 12 3 6 10 5 6 4 9 18 3 16 21 4 2 12 6 8 3 6 19 3 9 5 2 2 3 12 1 2 24 24 0

6 6 9 9 8 9 6 7 10 10 8 6 8 6 8 8 6 8 8 7 8 9 8 9 8 9 9 6 8 6 6 6 6 6 10 9 8 7 8 7 6 7 7 7 9 6 6 7 6 6

0 1 1 3 2 1 0 3 2 2 1 2 4 0 4 3 2 8 8 1 2 6 4 5 2 6 9 1 10 6 2 0 4 2 10 3 4 11 3 6 2 1 1 3 12 0 1 14 12 0

9 7 4 6 6 4 5 6 3 3 5 7 4 5 5 5 6 3 4 6 5 4 6 6 5 4 0 5 2 3 6 5 5 5 4 6 5 2 6 5 5 6 5 5 3 5 5 0 1 5

2 1 0 1 1 1 1 2 3 2 0 2 3 1 0 2 0 2 0 0 2 3 0 0 0 3 4 0 3 7 0 0 3 2 0 0 0 4 0 3 0 0 0 0 4 0 0 8 8 0

6 4 8 9 10 10 4 7 18 12 5 4 10 3 6 9 3 10 8 3 6 9 4 4 5 6 18 2 10 6 2 2 4 3 4 3 4 7 2 3 2 2 2 1 3 1 1 6 6 0

0 0 0 0 1 0 0 3 0 0 0 0 3 0 1 0 1 4 3 0 0 0 0 1 0 0 6 0 4 9 0 0 3 0 1 0 1 6 0 0 1 0 0 0 0 0 0 9 9 0

0 0 0 0 0 0 0 0 1 1 0 1 2 0 0 1 0 0 0 0 1 3 0 0 0 3 0 0 2 3 0 0 2 1 0 0 0 4 0 3 0 0 0 0 6 0 0 7 6 0

0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 2 0 1 6 0 0 1 0 0 0 0 2 0 0 0 0 0 0 0 0 0 6 6 0

18 9 2 6 9 3 4 9 3 1 2 9 4 3 4 4 5 3 4 4 3 2 3 2 2 0 0 2 1 3 4 2 4 3 0 0 2 1 1 3 2 2 1 1 0 1 1 0 0 0

0 0 0 0 1 0 0 3 0 0 0 0 2 0 2 0 1 10 8 0 0 0 0 2 0 0 18 0 11 6 0 0 2 0 4 0 2 11 0 0 1 0 0 0 0 0 0 12 12 0

0 0 2 3 3 3 0 1 9 3 1 0 3 0 1 3 0 3 2 0 0 3 1 0 1 0 9 0 3 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

0 1 4 6 6 4 0 6 6 6 3 2 8 0 6 7 2 6 8 2 4 8 6 6 5 8 0 1 6 6 2 0 4 2 8 6 6 6 4 4 2 2 2 2 6 0 1 0 4 0

0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 0 1 0 0 0 0 0 0 0 0 4 1 0 0 1 1 0 0 0 1 4 0 0 3 0 0

2 1 0 2 2 2 1 1 6 3 0 1 2 1 0 2 0 2 0 0 2 2 0 0 0 2 6 0 2 2 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 0 2 2 0

0 3 4 0 0 3 6 0 0 2 3 2 0 4 2 1 4 0 0 3 2 0 2 1 3 1 0 5 0 0 4 4 2 4 0 3 2 0 3 2 4 3 4 4 0 4 4 0 0 4

254 | 7 Quantitative structure–property relationships

C3

1 C4

2 C4

3 C5

4 C5

5

C5

6 C6

7 C6

8 C6

9 C6

10

C6

11 C7

12 C7

13 C7

14 C7

15

C7

16 C7

17 C7

18 C7

19 C7

20

Fig. 7.5. Substructures containing 2–6 bonds found in the library of Figure 7.4.

Typically 𝛹 is composed of several mappings to be performed successively: – Molecular graphs are mapped onto vectors of real numbers by means of molecular descriptors 𝐷𝑖 , 𝑖 ∈ 𝑛, D : M → ℝ𝑛 : 𝑀 󳨃→ (𝐷𝑖 (𝑀))𝑖∈𝑛 . –

Second, it may be necessary or helpful to transform the descriptor values, using a mapping 𝜏 = (𝜏𝑖 )𝑖∈𝑛 : ℝ𝑛 → ℝ𝑛 .



Third, a predicting function 𝑓 : ℝ𝑛 → ℝ is obtained in a statistical learning pro­ cedure, then applied to new cases. Last, if in the learning procedure the target variable also was subjected to a trans­ formation 𝜎, it has to be re-transformed by 𝜎−1 .



In summary, we can write the QSPR model as composition 𝛹 = 𝜎−1 ∘ 𝑓 ∘ 𝜏 ∘ D. If 𝑌 is discrete, assuming values from a finite set C, then 𝑓 and 𝛹 also have co-domain C, and 𝛹 = 𝑓 ∘ 𝜏 ∘ D.

7.4 Case studies of QSPRs obtained by linear modeling Figure 7.4 shows a real library of 50 decanes, along with their boiling points (BP). Data were taken from the Beilstein database (see Section 2.5), the boiling points are given

7.4 Case studies of QSPRs obtained by linear modeling

| 255

in ∘ C. In the following, we want to find QSPR models for this physical property for this class of compounds.

7.4.1 Linear modeling using topological indices We start our investigation with 30 topological indices contained in MOLGEN–QSPR (see Appendix A): 𝑊, 𝑀1 , 𝑀2 , 0 𝜒, 1 𝜒, 2 𝜒, 0 𝜒𝑣 , 1 𝜒𝑣 , 2 𝜒𝑣 , 3 𝜒𝑣 , 𝐽, 𝑀𝑇𝐼, 𝑡𝑤𝑐, 𝑚𝑤𝑐(2), 𝑚𝑤𝑐(3) , 𝑚𝑤𝑐(4) , 𝑚𝑤𝑐(5) , 𝑚𝑤𝑐(6) , 𝑚𝑤𝑐(7) , 𝑚𝑤𝑐(8) , 𝜆𝐴1 , 𝐼𝐶0 , 𝐶𝐼𝐶0 , 𝑆𝐼𝐶0 , 𝐼𝐶1 , 𝐶𝐼𝐶1 , 𝑆𝐼𝐶1 , 𝐼𝐶2 , 𝐶𝐼𝐶2 , 𝑆𝐼𝐶2. First we reduce the corresponding set of descriptors since there are various dependen­ cies: – Since there are only single bonds and no heteroatoms in decanes, 𝑘 𝜒 and 𝑘 𝜒𝑣 are identical. Therefore we exclude 0 𝜒, 1 𝜒 and 2 𝜒. By definition, the molecular formula of all decanes is constant, C10 H22 , so 𝐼𝐶0 , 𝐶𝐼𝐶0 , 𝑆𝐼𝐶0 are likewise constant and therefore excluded from consideration. – Tables 7.4 and 7.5 contain values of the remaining 24 indices for the 50 decanes from Figure 7.4. Inspection reveals that for each compound the values of 𝑀1 and 𝑚𝑤𝑐(2) agree (which is generally true, as we have seen), 𝑚𝑤𝑐(2) = 𝑀1 and also 𝑚𝑤𝑐(3) = 2 ⋅ 𝑀2 . Thus, 𝑀1 and 𝑀2 likewise can be eliminated from our list of descriptors. – Now we consider Table 7.6, which contains the first few columns of the correlation matrix of the remaining indices, with the signs of correlation coefficients omitted. The first column shows the absolute values of the correlation coefficients between BP and the descriptors in decreasing order. The remaining columns give absolute values of correlation coefficients between two descriptors. For a linear regression it does not make sense to use completely correlated predictors. A glance at the table reveals another few pairwise affine interdependences among the descriptors. Thus, the pairs from {𝐼𝐶1 , 𝐶𝐼𝐶1 , 𝑆𝐼𝐶1 } are completely correlated, which results from all decanes having the same number of atoms. For the molecular graphs of decanes we have 𝐶𝐼𝐶1 (𝑀) = 5 − 𝐼𝐶1 (𝑀) and 𝑆𝐼𝐶1 (𝑀) = 15 𝐼𝐶1 (𝑀). {𝐼𝐶2 , 𝐶𝐼𝐶2 , 𝑆𝐼𝐶2 } are also completely correlated. Therefore, we exclude 𝐶𝐼𝐶1 , 𝑆𝐼𝐶1, 𝐶𝐼𝐶2 and 𝑆𝐼𝐶2 . The remaining 18 descriptors are obtained from the following functions: 𝑊, 0 𝜒, 1 𝜒, 2 𝜒, 0 𝜒𝑣 , 1 𝜒𝑣 , 2 𝜒𝑣 , 3 𝜒𝑣 , 𝐽, 𝑀𝑇𝐼, 𝑡𝑤𝑐, 𝑚𝑤𝑐(2) , 𝑚𝑤𝑐(3) , 𝑚𝑤𝑐(4) , 𝑚𝑤𝑐(5) , 𝑚𝑤𝑐(6) , 𝑚𝑤𝑐(7) , 𝑚𝑤𝑐(8) , 𝜆𝐴1 , 𝐼𝐶1 , 𝐼𝐶2 .

256 | 7 Quantitative structure–property relationships

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

127 134 135 126 124 131 139 123 119 127 142 131 120 146 130 126 136 118 121 143 134 122 133 131 138 126 111 146 116 115 141 151 127 138 125 138 135 115 141 129 143 149 150 145 121 158 153 110 111 165

44 41 39 42 44 41 40 45 46 42 37 42 46 38 41 43 40 47 45 37 40 44 39 39 38 42 51 37 48 50 38 36 44 40 41 36 39 49 36 42 38 35 35 36 44 34 35 52 52 32

8.4142 8.1987 8.1463 8.1987 8.3618 8.1987 8.1987 8.3618 8.4142 8.1987 7.9831 8.1987 8.3618 8.0355 8.1463 8.1987 8.1463 8.3618 8.3094 7.9831 8.0355 8.1987 7.9831 7.9831 7.9831 8.0355 8.5774 7.9831 8.3618 8.4142 7.9831 7.9831 8.1987 8.0355 7.9831 7.8200 7.9831 8.3618 7.8200 8.0355 7.9831 7.8200 7.8200 7.8200 8.0355 7.8200 7.8200 8.4142 8.4142 7.6569

4.2071 4.4545 4.5197 4.4925 4.3272 4.4545 4.4165 4.3372 4.2678 4.4772 4.6639 4.4772 4.3599 4.5607 4.5746 4.5152 4.5366 4.3921 4.4641 4.6639 4.6213 4.5378 4.7399 4.7187 4.7019 4.6820 4.1547 4.6639 4.4147 4.3107 4.7019 4.6259 4.5040 4.6213 4.7948 4.8461 4.7187 4.4248 4.8461 4.6820 4.6807 4.8081 4.8081 4.8461 4.7426 4.7701 4.8081 4.3713 4.3713 4.9142

5.6213 4.6128 4.3643 4.4473 4.9861 4.6586 4.8467 4.8966 5.2552 4.5122 3.8769 4.4503 4.7413 4.3713 3.9924 4.2353 4.1925 4.5402 4.2063 3.8650 4.0178 4.1157 3.4316 3.5814 3.6430 3.6642 5.4537 3.8382 4.3748 4.8839 3.6042 4.0722 4.2468 3.9749 3.1532 3.2321 3.5319 4.2854 3.2052 3.6213 3.7171 3.3896 3.3896 3.1783 3.3107 3.5967 3.3628 4.5178 4.4749 3.1213

1.6250 2.0841 1.7475 2.0557 2.0724 1.7423 1.7083 2.3034 1.9660 1.8876 1.9243 2.3556 2.4973 1.7803 2.4585 2.5551 2.3374 2.8635 2.9325 2.0183 2.1339 2.6082 2.5873 2.2617 2.2831 2.5607 2.5981 2.1753 3.1439 2.9053 2.5461 1.8129 2.7376 2.4142 2.7642 2.0908 2.4594 3.3705 2.2402 2.8410 2.4011 2.1010 2.0820 2.3706 3.0303 1.8850 2.2474 3.3713 3.5999 1.9571

464 488 490 456 450 476 508 446 432 460 516 476 434 534 470 456 494 426 436 520 486 440 480 472 500 454 402 532 418 416 512 552 460 502 448 498 488 414 510 466 520 542 546 526 434 578 558 396 400 604

46 42 40 42 44 42 42 44 46 42 38 42 44 40 40 42 40 44 42 38 40 42 38 38 38 40 48 38 44 46 38 38 42 40 38 36 38 44 36 40 38 36 36 36 40 36 36 46 46 34

) (3

𝑐 𝑤 𝑚

𝑤

𝑐 19248 15138 12930 17334 19018 16146 13874 20498 23048 17946 11114 16602 22234 12390 14984 18280 13242 23206 19426 10786 15664 20028 13028 13848 12020 18298 29658 10236 24610 29160 11298 9316 19738 14774 15866 10950 13386 26106 10570 17588 11616 9330 9194 10052 20526 7896 8788 31916 31632 6500

𝑚

𝑡𝑤 𝑐

𝑇𝐼 3.5630 3.3555 3.3374 3.6308 3.6842 3.4695 3.2055 3.7348 3.8876 3.6256 3.1600 3.4647 3.8656 3.0438 3.5027 3.6419 3.3014 3.9418 3.8140 3.1244 3.4175 3.8026 3.4123 3.4999 3.2686 3.6903 4.2311 3.0333 4.0341 4.1018 3.1682 2.9095 3.6334 3.2770 3.6982 3.2951 3.3759 4.0893 3.2055 3.5755 3.1296 2.9984 2.9680 3.0869 3.8748 2.7732 2.8862 4.3283 4.2818 2.6476

𝑀

𝐽

𝑣 3

𝜒

𝑣 2

𝜒

𝑣 1

𝜒

𝜒

0

2

1

46 42 40 42 44 42 42 44 46 42 38 42 44 40 40 42 40 44 42 38 40 42 38 38 38 40 48 38 44 46 38 38 42 40 38 36 38 44 36 40 38 36 36 36 40 36 36 46 46 34

𝑀

𝑀

𝑊

𝑣

(2

)

Table 7.4. Values of topological indices for the real library of 50 decanes.

88 82 78 84 88 82 80 90 92 84 74 84 92 76 82 86 80 94 90 74 80 88 78 78 76 84 102 74 96 100 76 72 88 80 82 72 78 98 72 84 76 70 70 72 88 68 70 104 104 64

7.4 Case studies of QSPRs obtained by linear modeling

| 257

)

1.7947 2.5354 2.2823 2.4104 2.2322 2.5354 2.4729 2.2322 2.0416 2.5590 2.6945 2.4965 2.2169 2.3204 2.4576 2.5590 2.3448 2.3183 1.7947 2.5460 2.3675 2.4965 2.5460 2.5931 2.6556 2.3439 1.5704 2.5695 2.3183 2.0416 2.5306 2.4056 2.5590 2.4300 2.2806 2.3183 2.5306 2.3183 2.4280 2.4064 2.6320 2.5141 2.5141 2.3655 2.2189 2.4516 2.4516 2.0294 1.9669 1.9056

3.2053 2.4646 2.7177 2.5896 2.7678 2.4646 2.5271 2.7678 2.9584 2.4410 2.3055 2.5035 2.7831 2.6796 2.5424 2.4410 2.6552 2.6817 3.2053 2.4540 2.6325 2.5035 2.4540 2.4069 2.3444 2.6561 3.4296 2.4305 2.6817 2.9584 2.4694 2.5944 2.4410 2.5700 2.7194 2.6817 2.4694 2.6817 2.5720 2.5936 2.3680 2.4859 2.4859 2.6345 2.7811 2.5484 2.5484 2.9706 3.0331 3.0944

0.35895 0.50707 0.45645 0.48207 0.44645 0.50707 0.49457 0.44645 0.40832 0.51179 0.53891 0.49929 0.44338 0.46407 0.49151 0.51179 0.46895 0.46367 0.35895 0.50919 0.47351 0.49929 0.50919 0.51863 0.53113 0.46879 0.31407 0.51391 0.46367 0.40832 0.50613 0.48113 0.51179 0.48601 0.45613 0.46367 0.50613 0.46367 0.48560 0.48129 0.52641 0.50282 0.50282 0.47310 0.44379 0.49032 0.49032 0.40588 0.39338 0.38113

2

𝑆𝐼 𝐶

2

0.26489 0.28455 0.27205 0.28455 0.27739 0.28455 0.28455 0.27739 0.26489 0.28455 0.27433 0.28455 0.27739 0.26427 0.27205 0.28455 0.27205 0.27739 0.23989 0.27433 0.26427 0.28455 0.27433 0.27433 0.27433 0.26427 0.25151 0.27433 0.27739 0.26489 0.27433 0.27433 0.28455 0.26427 0.27433 0.26017 0.27433 0.27739 0.26017 0.26427 0.27433 0.26017 0.26017 0.26017 0.26427 0.26017 0.26017 0.26489 0.26489 0.22433

1

3.6755 3.5773 3.6398 3.5773 3.6130 3.5773 3.5773 3.6130 3.6755 3.5773 3.6284 3.5773 3.6130 3.6787 3.6398 3.5773 3.6398 3.6130 3.8005 3.6284 3.6787 3.5773 3.6284 3.6284 3.6284 3.6787 3.7425 3.6284 3.6130 3.6755 3.6284 3.6284 3.5773 3.6787 3.6284 3.6991 3.6284 3.6130 3.6991 3.6787 3.6284 3.6991 3.6991 3.6991 3.6787 3.6991 3.6991 3.6755 3.6755 3.8784

𝐶𝐼 𝐶

1.3245 1.4227 1.3602 1.4227 1.3870 1.4227 1.4227 1.3870 1.3245 1.4227 1.3716 1.4227 1.3870 1.3213 1.3602 1.4227 1.3602 1.3870 1.1995 1.3716 1.3213 1.4227 1.3716 1.3716 1.3716 1.3213 1.2575 1.3716 1.3870 1.3245 1.3716 1.3716 1.4227 1.3213 1.3716 1.3009 1.3716 1.3870 1.3009 1.3213 1.3716 1.3009 1.3009 1.3009 1.3213 1.3009 1.3009 1.3245 1.3245 1.1216

2

1

1

𝐴

2.1987 2.1474 2.1010 2.1889 2.2047 2.1753 2.1289 2.2361 2.2646 2.2089 2.0698 2.1813 2.2616 2.1192 2.1455 2.2082 2.1067 2.2711 2.2143 2.0529 2.1823 2.2361 2.1085 2.1358 2.0886 2.2216 2.3344 2.0314 2.2882 2.3433 2.0615 2.0000 2.2410 2.1679 2.1701 2.0743 2.1268 2.3073 2.0642 2.2120 2.0886 2.0285 2.0237 2.0491 2.2504 1.9696 2.0066 2.3649 2.3623 1.9190

𝐼𝐶

4978 3900 3366 4494 4894 4272 3660 5144 6140 4710 2844 4214 5608 3242 3770 4704 3326 5854 4826 2724 4030 5106 3270 3456 3056 4642 7650 2602 6174 7156 2834 2374 4916 3784 3946 2760 3342 6424 2648 4418 2904 2344 2324 2516 5110 2000 2208 7826 7734 1626

𝑆𝐼 𝐶

2114 1728 1506 1926 2098 1794 1590 2250 2422 1962 1328 1874 2402 1436 1730 2020 1566 2494 2170 1306 1756 2186 1538 1614 1426 2018 3042 1252 2626 3038 1368 1154 2174 1680 1818 1314 1568 2778 1280 1962 1394 1150 1134 1230 2246 1000 1098 3286 3266 848

𝐶𝐼 𝐶

1040 854 764 942 1012 908 818 1040 1198 968 668 896 1102 738 822 968 754 1138 986 650 852 1024 736 760 702 942 1404 632 1180 1310 668 596 986 816 838 642 742 1208 624 908 674 574 572 604 1010 520 554 1402 1388 444

𝐼𝐶

432 376 342 402 430 382 356 450 472 404 312 396 470 328 376 416 354 484 442 310 372 438 346 354 328 410 558 304 502 552 322 288 436 364 386 306 348 522 302 404 324 282 280 296 444 260 276 586 584 232

1

218 188 174 198 210 194 184 212 234 200 158 192 218 170 180 200 172 222 202 156 182 206 166 168 162 192 258 154 226 242 158 150 200 178 178 150 166 228 148 188 158 142 142 146 200 136 140 252 250 122

𝜆

𝑤

𝑐

(8

)

𝑚

𝑤

𝑐

(7

) (6

𝑐

𝑚

𝑤

𝑐

𝑚

𝑤 𝑚

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

𝑚

𝑤

𝑐

(4

(5

)

)

Table 7.5. Values of some topological indices for the real library of 50 decanes (continued).

258 | 7 Quantitative structure–property relationships Table 7.6. Part of the correlation matrix for boiling points and topological indices of decanes. 𝐵𝑃

2 𝑣

𝜒

1 𝑣

𝜒

𝐼𝐶1

𝐶𝐼𝐶1

𝑆𝐼𝐶1

0 𝑣

𝜒

3 𝑣

𝜒

𝑚𝑤𝑐(2)

𝑚𝑤𝑐(4)

𝑊

𝐵𝑃

1.000 0.679

0.587 0.513

0.513 0.513 0.485

0.478

0.447

0.290

0.254

2 𝑣

𝜒

0.679 1.000

0.975 0.297

0.297 0.297 0.892

0.054

0.896

0.768

0.586

1 𝑣

𝜒

0.587 0.975

1.000 0.302

0.302 0.302 0.970

0.163

0.964

0.876

0.732

𝐼𝐶1

0.513 0.297

0.302 1.000

1.000 1.000 0.310

0.042

0.272

0.222

0.283

𝐶𝐼𝐶1

0.513 0.297

0.302 1.000

1.000 1.000 0.310

0.042

0.272

0.222

0.283

𝑆𝐼𝐶1

0.513 0.297

0.302 1.000

1.000 1.000 0.310

0.042

0.272

0.222

0.283

0 𝑣

𝜒

0.485 0.892

0.970 0.310

0.310 0.310 1.000

0.371

0.986

0.951

0.867

3 𝑣

0.478 0.054

0.163 0.042

0.042 0.042 0.371

1.000

0.368

0.539

0.641

𝑚𝑤𝑐(2)

0.447 0.896

0.964 0.272

0.272 0.272 0.986

0.368

1.000

0.970

0.862

𝑚𝑤𝑐(4)

0.290 0.768

0.876 0.222

0.222 0.222 0.951

0.539

0.970

1.000

0.943

𝑊

0.254 0.586

0.732 0.283

0.283 0.283 0.867

0.641

0.862

0.943

1.000

𝑀𝑇𝐼

0.237 0.558

0.708 0.281

0.281 0.281 0.850

0.654

0.844

0.931

0.999

𝑚𝑤𝑐(6)

0.202 0.710

0.831 0.180

0.180 0.180 0.921

0.602

0.945

0.995

0.948

𝜆𝐴1

0.196 0.628

0.762 0.245

0.245 0.245 0.875

0.644

0.898

0.969

0.969

(3)

0.175 0.680

0.818 0.195

0.195 0.195 0.922

0.665

0.932

0.986

0.954

𝑚𝑤𝑐(5)

0.142 0.655

0.794 0.172

0.172 0.172 0.902

0.675

0.919

0.983

0.955

𝐽

0.141 0.553

0.707 0.195

0.195 0.195 0.848

0.696

0.853

0.949

0.990

𝑚𝑤𝑐(8)

0.140 0.675

0.803 0.146

0.146 0.146 0.899

0.635

0.926

0.987

0.940

(7)

𝜒

𝑚𝑤𝑐

0.105 0.637

0.777 0.144

0.144 0.144 0.885

0.684

0.908

0.977

0.945

𝑡𝑤𝑐

0.097 0.642

0.779 0.131

0.131 0.131 0.883

0.674

0.909

0.978

0.937

𝐼𝐶2

0.002 0.459

0.500 0.594

0.594 0.594 0.511

0.260

0.540

0.551

0.435

𝐶𝐼𝐶2

0.002 0.459

0.500 0.594

0.594 0.594 0.511

0.260

0.540

0.551

0.435

𝑆𝐼𝐶2

0.002 0.459

0.500 0.594

0.594 0.594 0.511

0.260

0.540

0.551

0.435

𝑚𝑤𝑐

Using all 18 remaining indices we obtain a linear model with 𝑅2 = 0.97439 and 𝑅2𝐶𝑉 = 0.94191. In order to avoid overfitting, we look for models containing fewer de­ scriptors. As a rule of thumb, an additional degree of freedom, i.e. an additional de­ scriptor, may be justified for every 5 additional observations. The problem of sensibly restricting the number of descriptors is difficult, in particular if there is no test set (see e.g. [305]). For 𝑛 = 1, . . . , 5 we run through all 𝑛-subsets of the 18 topological indices and record the model of highest 𝑅2 . For each 𝑛, the descriptors 𝑋𝑗, 𝑗 ∈ 𝑛 used as predic­ tors are given, followed by the predicting function 𝑓 obtained using OLS regression. Further, the predicting functions are given in terms of autoscaled predictors 𝑋∗𝑗 , which makes the influence of a particular descriptor on a model more visible. 𝑛 = 1 descriptor function: 2 𝜒𝑣 , 𝑓 = −8.0356𝑋0 + 190.74 = −5.0362𝑋∗0 + 157.85.

7.4 Case studies of QSPRs obtained by linear modeling |

259

𝑛 = 2 descriptor functions: 𝑚𝑤𝑐(4) , 𝑚𝑤𝑐(8) , 𝑓 = −1.2961𝑋0 + 0.026540𝑋1 + 287.83 = −42.917𝑋∗0 + 41.312𝑋∗1 + 157.85.

𝑛 = 3 descriptor functions: 3 𝜒𝑣 , 𝑡𝑤𝑐, 𝑚𝑤𝑐(5) , 𝑓 = 16.793𝑋0 + 0.0085894𝑋1 − 0.69764𝑋2 + 246.86 = 7.7409𝑋∗0 + 53.768𝑋∗1 − 59.883𝑋∗2 + 157.85.

𝑛 = 4 descriptor functions: 3 𝜒𝑣 , 𝑚𝑤𝑐(6) , 𝑚𝑤𝑐(7) , 𝑚𝑤𝑐(8) , 𝑓 = 10.930𝑋0 − 0.32884𝑋1 − 0.042581𝑋2 + 0.064274𝑋3 + 229.69 = 5.0382𝑋∗0 − 79.236𝑋∗1 − 25.319𝑋∗2 + 100.05𝑋∗3 + 157.85.

𝑛 = 5 descriptor functions: 𝑊, 3 𝜒𝑣 , 𝑡𝑤𝑐, 𝑚𝑤𝑐(4) , 𝑚𝑤𝑐(8) , 𝑓 = 0.44512𝑋0 + 9.7937𝑋1 − 0.0038957𝑋2 − 0.95038𝑋3 + 0.03649𝑋4 + 164.25 = 5.6464𝑋∗0 + 4.5145𝑋∗1 − 24.386𝑋∗2 − 31.468𝑋∗3 + 56.794𝑋∗4 + 157.85.

Randomization experiments were performed to compare our models with the ran­ dom results. Thus, 50 random numbers were defined as a pseudo-𝑦 vector, as well as 18 vectors of 50 random numbers acting as pseudodescriptors. By searching all sub­ sets, the best fitting combination of 𝑛 pseudodescriptors was obtained. For a constant 𝑛, 100 such runs were performed, each on a fresh set of random numbers. The highest random 𝑅2 from each run was recorded. The mean highest random 𝑅2 (𝑚ℎ𝑟𝑅2 ) and its standard deviation (𝑠𝑡𝑑𝑒𝑣) were calculated and are as follows: 𝑛

𝑚ℎ𝑟𝑅2

𝑠𝑡𝑑𝑒𝑣

1

0.09505

0.04182

2

0.16055

0.06590

3

0.20423

0.07320

4

0.24440

0.07545

5

0.26857

0.07296

Thus, for our best models, the difference between 𝑅2 and 𝑚ℎ𝑟𝑅2 is between 8.7 and 11.1 standard deviations, which means that the original models fit the data far better than the random model, and it is thus extremely unlikely that our models are based on chance correlations. Table 7.7 shows statistics 𝑅2 , 𝑅2𝐶𝑉 , 𝑆, 𝑆𝐶𝑉 , and 𝐹, as well as the differences be­ tween statistics obtained by resubstitution and by LOOCV of the best LM containing 𝑛 = 1, . . . , 18 topological indices. For 𝑅2 , 𝑅2𝐶𝑉 and 𝐹 the maximal values are underlined, for the other columns the minimal values are underlined. In Figures 7.6–7.8, 𝑅2 , 𝑆, and 𝐹 values from resubstitution are symbolized by open triangles, triangles highlighted in grey are values obtained by CV. Note that in Fig­ ures 7.6 and 7.7 the 𝑦 axes are scaled logarithmically. Obviously, in both figures the CV

260 | 7 Quantitative structure–property relationships Table 7.7. Statistics of best linear models for BPs of decanes, containing one to 18 topological in­ dices. 𝑅2𝐶𝑉

𝑅2 −𝑅2𝐶𝑉

𝑆

𝑆𝐶𝑉

𝑆𝐶𝑉 −𝑆

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

0.46101 0.89336 0.93721 0.95011 0.95814 0.96339 0.96450 0.96520 0.96686 0.97045 0.97151 0.97275 0.97304 0.97424 0.97426 0.97438 0.97439 0.97439

0.40131 0.87999 0.92689 0.94126 0.94709 0.95022 0.95043 0.94761 0.94794 0.95468 0.95542 0.95591 0.95294 0.95061 0.94917 0.94563 0.94191 0.94191

0.059698 0.013366 0.010325 0.008856 0.011048 0.013176 0.014074 0.017590 0.018922 0.015764 0.016090 0.016840 0.020097 0.023625 0.025088 0.028750 0.032484 0.032484

5.5019 2.4732 1.9183 1.7287 1.6015 1.5149 1.5095 1.5127 1.4944 1.4292 1.4216 1.4092 1.4209 1.4087 1.4287 1.4468 1.4688 1.4923

5.7986 2.6236 2.0700 1.8759 1.8005 1.7666 1.7838 1.8561 1.8731 1.7699 1.7783 1.7924 1.8772 1.9504 2.0075 2.1075 2.2122 2.2476

0.29669 0.15042 0.15172 0.14718 0.19896 0.25173 0.27431 0.34331 0.37868 0.34062 0.35671 0.38323 0.45629 0.54171 0.57889 0.66078 0.74343 0.75532

𝐹 41.06 196.87 228.87 214.27 201.42 188.62 163.02 142.13 129.67 128.07 117.81 110.05 99.94 94.53 85.79 78.43 71.63 65.53

0.8

0.9

𝑅2

only TI only SC TI and SC only TI (CV) only SC (CV) TI and SC (CV)

0.4

0.6

0.7

Coefficient of determination

𝑛

5

10

15

20

Descriptors

Fig. 7.6. Statistics for the best LMs for BPs of decanes, containing 1–20 descriptors.

7.4 Case studies of QSPRs obtained by linear modeling |

261

values get worse with increasing model complexity after passing through an optimum. This hints towards overfitting in the more complex models. 𝑅2 always increases with increasing 𝑛. Thus, to select a model on the basis of 𝑅2 perhaps the increase rate in 𝑅2 could be used. 𝑅2𝐶𝑉 has the maximum value for 𝑛 = 12. However, 12 descriptors are too many to describe 50 observations and some overfit­ ting is expected in this situation. For the same reason the model exhibiting smallest 𝑆 (14 descriptors) cannot be recommended. The model with 𝑛 = 6 descriptors seems a reasonable choice, supported by its minimal 𝑆𝐶𝑉 . In [222] the difference 𝑆𝐶𝑉 − 𝑆 is con­ sidered a measure of stability of a QSPR. This reasoning would favor the model with 4 descriptors, supported by its likewise minimal difference 𝑅2 − 𝑅2𝐶𝑉 . Even the model containing 3 descriptors shows rather good statistics, e.g. a maximal 𝐹 value. In Fig­ ure 7.9, BPs calculated by this model (open circles) are plotted against experimental BPs, while those calculated using LOOCV are shown as closed circles. The good agree­ ment between these two confirms the model’s consistency. Moreover, this very simple model containing no more than three descriptors fits the experimental observations very well. In the end, the choice of a criterion for model selection is left to the user. In Chapter 6 of [208] further criteria are discussed. Furthermore, for each 𝑛 the second-best models could be included in the set of candidates, or a systematic search for models exhibiting maximal 𝑅2𝐶𝑉 or minimal 𝑆𝐶𝑉 could be performed, resulting, of course, in a higher effort.

7.4.2 Linear modeling using substructure counts Here we want to investigate another aspect of QSPR searching, i.e. the type of descrip­ tors used. The substructures of 2–6 bonds occurring in the real library and their counts were recorded, see Example 7.6. Among the resulting 20 SC none are completely cor­ related for our compound sample. As we did above for TIs, we now calculate linear models of highest 𝑅2 containing 𝑛 = 1, . . . , 20 SC descriptors. These are listed here for 𝑛 = 1, . . . , 5 . The numbering of SCs is consistent with Figure 7.5. The 𝑋∗𝑗 are again autoscaled descriptor values. 𝑛 = 1 descriptor function: 𝑆𝐶14 , 𝑓 = −1.7205𝑋0 + 163.08 = −5.5175𝑋∗0 + 157.85.

𝑛 = 2 descriptor functions: 𝑆𝐶1 , 𝑆𝐶5 , 𝑓 = −6.3403𝑋0 + 1.6051𝑋1 + 216.85 = −10.703𝑋∗0 + 9.3142𝑋∗1 + 157.85.

𝑛 = 3 descriptor functions: 𝑆𝐶1 , 𝑆𝐶9 , 𝑆𝐶15 , 𝑓 = −6.3067𝑋0 + 3.1880𝑋1 + 0.97551𝑋2 + 221.80 = −10.646𝑋∗0 + 6.5053𝑋∗1 + 4.1075𝑋∗2 + 157.85.

6

262 | 7 Quantitative structure–property relationships

3 2

Standard Error

4

5

only TI only SC TI and SC only TI (CV) only SC (CV) TI and SC (CV)

5

10

15

20

Descriptors

300

Fig. 7.7. Standard errors for the best BP models containing 1–20 descriptors.

200 150 50

100

Empirical F

250

only TI only SC TI and SC

5

10

15

Descriptors

Fig. 7.8. F values for the best BP models containing 1–20 descriptors.

20

7.4 Case studies of QSPRs obtained by linear modeling

150

160

170

Resubstitution Cross−validation

140

Boiling point, calculated

| 263

140

150

160

170

Boiling point, experimental

Fig. 7.9. Scatterplot of calculated vs. experimental BPs of decanes for the best model containing 3 TIs.

𝑛 = 4 descriptor functions: 𝑆𝐶1 , 𝑆𝐶5 , 𝑆𝐶6 , 𝑆𝐶19 , 𝑓 = −9.1531𝑋0 + 1.8270𝑋1 − 2.0494𝑋2 + 3.8403𝑋3 + 258.03 = −15.451𝑋∗0 + 10.602𝑋∗1 − 2.6580𝑋∗2 + 5.2283𝑋∗3 + 157.85.

𝑛 = 5 descriptor functions: 𝑆𝐶1 , 𝑆𝐶5 , 𝑆𝐶6 , 𝑆𝐶16 , 𝑆𝐶19 , 𝑓 = −9.2522𝑋0 + 1.8724𝑋1 − 2.3131𝑋2 + 0.56818𝑋3 + 3.2492𝑋4 + 260.76 = −15.618𝑋∗0 + 10.866𝑋∗1 − 2.9999𝑋∗2 + 1.1390𝑋∗3 + 4.4235𝑋∗4 + 157.85.

Again, random experiments were performed, this time using a pool of 20 random pseu­ dodescriptors, obtaining best selections of 𝑛 = 1, . . . , 5 pseudodescriptors. The results are as follows: 𝑛

𝑚ℎ𝑟𝑅2

𝑠𝑡𝑑𝑒𝑣

1

0.10334

0.04447

2

0.17235

0.05730

3

0.20946

0.05504

4

0.25411

0.06564

5

0.28759

0.08269

For each of our best models, the difference between 𝑅2 and 𝑚ℎ𝑟𝑅2 is between 8.1 and 12.3 standard deviations, which means that the original models fit the data far

264 | 7 Quantitative structure–property relationships Table 7.8. Statistics of the best linear models for BPs of decanes, containing 1–20 substructure counts. 𝑛

𝑅2

𝑅2𝐶𝑉

𝑅2 −𝑅2𝐶𝑉

𝑆

𝑆𝐶𝑉

𝑆𝐶𝑉 −𝑆

𝐹

1

0.55334

0.51266

0.040681

5.0085

5.2316

0.22312

59.47

2

0.78507

0.74739

0.037677

3.5111

3.8064

0.29533

85.84

3

0.88372

0.86106

0.022654

2.6105

2.8535

0.24298

116.53

4

0.95621

0.94538

0.010825

1.6197

1.8089

0.18915

245.64

5

0.96185

0.94958

0.012277

1.5288

1.7577

0.22887

221.88

6

0.96669

0.95539

0.011298

1.4450

1.6723

0.22722

208.00

7

0.96869

0.95637

0.012318

1.4176

1.6733

0.25579

185.65

8

0.97133

0.96009

0.011242

1.3729

1.6199

0.24699

173.66

9

0.97167

0.95772

0.013947

1.3819

1.6880

0.30619

152.42

10

0.97202

0.95444

0.017577

1.3907

1.7746

0.38383

135.48

11

0.97566

0.95558

0.020084

1.3140

1.7752

0.46121

138.48

12

0.97587

0.95267

0.023197

1.3259

1.8569

0.53100

124.69

13

0.97627

0.95329

0.022978

1.3330

1.8701

0.53718

113.94

14

0.97676

0.94543

0.031325

1.3380

2.0501

0.71210

105.05

15

0.97711

0.94390

0.033215

1.3471

2.1090

0.76199

96.78

16

0.97790

0.93361

0.044297

1.3435

2.3289

0.98536

91.28

17

0.98026

0.94166

0.038607

1.2894

2.2170

0.92754

93.49

18

0.98028

0.93778

0.042497

1.3095

2.3260

1.01650

85.61

19

0.98030

0.93399

0.046304

1.3306

2.4354

1.10480

78.56

20

0.98030

0.93399

0.046304

1.3534

2.4771

1.12370

72.14

better than the random models, and it is extremely unlikely that the original models are based on chance correlations. Table 7.8 shows statistics for assessing the models. For identical 𝑛, 𝑅2 values are higher for the SC-based models than for the TI-based models, except for 𝑛 = 2 or 3. In Figures 7.6–7.8 the results for the SC-based models are plotted, symbolized by inverted triangles 󳶚. As before, results of CV are highlighted in grey. For selection of a model based on SC, some of the arguments mentioned above recommend the four-descriptor model, such as its minimal differences 𝑅2 − 𝑅2𝐶𝑉 and 𝑆𝐶𝑉 − 𝑆, and its maximal 𝐹. Figure 7.10 is a scatterplot of the calculated vs. experimental boiling points for this model, again containing values from both resubstitution and LOOCV.

| 265

150

160

Resubstitution Cross−validation

140

Boiling point, calculated

170

7.4 Case studies of QSPRs obtained by linear modeling

140

150

160

170

Boiling point, experimental

Fig. 7.10. Scatterplot of calculated vs. experimental BPs of decanes for the best model containing 4 SCs.

7.4.3 Linear modeling using both topological indices and substructure counts Finally, we will use the 18 topological indices and the 20 substructure counts to cal­ culate the best linear models. Initially, we calculate the correlation matrix. One pair of completely correlated descriptors is found: For all 𝑀 in our real library (and for all decanes) the following is true: 𝑆𝐶1 (𝑀) = 12 𝑚𝑤𝑐(2) (𝑀) − 9. Therefore we neglect 𝑆𝐶1 . The linear model containing all the remaining 37 descriptors has 𝑅2 = 0.98756. The considerably lower 𝑅2𝐶𝑉 = 0.88667 clearly hints towards overfitting. The best linear models with 1, . . . , 5 descriptors are the following: 𝑛 = 1 descriptor function: 𝑆𝐶14 , 𝑓 = −1.7205𝑋0 + 163.08 = −5.5175𝑋∗0 + 157.85.

𝑛 = 2 descriptor functions: 𝑚𝑤𝑐(4) , 𝑚𝑤𝑐(8) , 𝑓 = −1.2961𝑋0 + 0.026540𝑋1 + 287.83 = −42.917𝑋∗0 + 41.312𝑋∗1 + 157.85.

𝑛 = 3 descriptor functions: 𝑊, 𝑆𝐶2 , 𝑆𝐶19 , 𝑓 = 1.1519𝑋0 + 6.3865𝑋1 + 2.1586𝑋2 − 58.907 = 14.612𝑋∗0 + 13.349𝑋∗1 + 2.9387𝑋∗2 + 157.85.

266 | 7 Quantitative structure–property relationships 𝑛 = 4 descriptor functions: 𝑊, 2 𝜒𝑣 , 𝑆𝐶5 , 𝑆𝐶19 , 𝑓 = 0.67431𝑋0 − 11.032𝑋1 + 1.7445𝑋2 + 2.1229𝑋3 + 98.079 = 8.5535𝑋∗0 − 6.9138𝑋∗1 + 10.123𝑋∗2 + 2.8901𝑋∗3 + 157.85.

𝑛 = 5 descriptor functions: 𝑊, 3 𝜒𝑣 , 𝑀𝑇𝐼, 𝑆𝐶5 , 𝑆𝐶19 , 𝑓 = 11.320𝑋0 + 6.6378𝑋1 − 2.6968𝑋2 + 1.6403𝑋3 + 2.7967𝑋4 + 77.497 = 143.60𝑋∗0 + 3.0598𝑋∗1 − 129.06𝑋∗2 + 9.5189𝑋∗3 + 3.8075𝑋∗4 + 157.85.

Random experiments to select a combination of 𝑛 = 1, . . . , 5 out of 37 pseudodes­ criptors to fit 50 pseudoobservations resulted in 𝑛

𝑚ℎ𝑟𝑅2

𝑠𝑡𝑑𝑒𝑣

1

0.11276

0.04090

2

0.20298

0.05518

3

0.27160

0.06694

4

0.31647

0.06609

5

0.36898

0.06610

The 𝑚ℎ𝑟𝑅2 values here are somewhat higher than before, illustrating the effect of more descriptors to select from (37 instead of 18 or 20 descriptors as before). Never­ theless, for our best models, the difference between 𝑅2 and 𝑚ℎ𝑟𝑅2 is between 9.0 and 12.5 standard deviations, which means that the original models fit the data far better than the random models, and it is thus extremely unlikely that our models are based on chance correlations. Starting from 𝑛 = 3 descriptors, the models with the highest 𝑅2 contain both TIs and SCs. Thus, for 𝑛 ≥ 3 models the combination of both kinds of descriptors show higher 𝑅2 than those restricted to one or the other kind of descriptors. This is also seen in Figure 7.6, where results for ‘mixed’ models are shown as open circles, and 𝑅2𝐶𝑉 are shown as open circles highlighted in grey. Obviously, the mixed models are also of advantage with respect to 𝑅2𝐶𝑉 , as they are more consistent. Table 7.9 presents the statistics of the best mixed models for 𝑛 = 1, . . . , 10. We note the minimal difference 𝑆𝐶𝑉 − 𝑆 and maximal 𝐹 for 𝑛 = 3, the minimal difference 𝑅2 − 𝑅2 𝐶𝑉 for 𝑛 = 5, and the minimal 𝑆 for 𝑛 = 7. Scatterplots of calculated vs. experimental BP are shown in Figure 7.11 for the best 3-descriptor model and in Figure 7.12 for the best 7-descriptor model.

Summary and interpretation In our example, boiling points of decanes, substructure counts enabled us to construct good QSPR models and are an alternative to topological indices. The models containing both types of descriptors performed even better. In Table 7.10 those descriptors that

150

160

Resubstitution Cross−validation

140

Boiling point, calculated

170

7.4 Case studies of QSPRs obtained by linear modeling | 267

140

150

160

170

Boiling point, experimental

150

160

Resubstitution Cross−validation

140

Boiling point, calculated

170

Fig. 7.11. Scatterplot of calculated vs. experimental BPs of decanes for the best model containing 3 descriptors (TI and SC).

140

150

160

170

Boiling point, experimental

Fig. 7.12. Scatterplot of calculated vs. experimental BPs of decanes for the best LM model, using 7 descriptors (TI and SC).

268 | 7 Quantitative structure–property relationships Table 7.9. Statistics for the best linear models containing 𝑛 descriptors (out of 18 TIs and 19 SCs) for the BPs of decanes. 𝑛

𝑅2

𝑅2𝐶𝑉

𝑅2 −𝑅2𝐶𝑉

𝑆

𝑆𝐶𝑉

𝑆𝐶𝑉 −𝑆

1

0.55334

0.51266

0.040681

5.0085

5.2316

0.22312

59.47

2

0.89336

0.87999

0.013366

2.4732

2.6236

0.15042

196.87

3

0.95302

0.94538

0.007637

1.6594

1.7891

0.12978

311.02

4

0.96133

0.95135

0.009978

1.5220

1.7072

0.18511

279.67

5

0.96734

0.95785

0.009491

1.4145

1.6069

0.19245

260.68

6

0.96932

0.95936

0.009961

1.3868

1.5961

0.20934

226.45

7

0.97097

0.96045

0.010512

1.3651

1.5932

0.22808

200.66

8

0.97230

0.95746

0.014840

1.3496

1.6724

0.32288

179.89

9

0.97395

0.96062

0.013332

1.3250

1.6292

0.30415

166.16

10

0.97626

0.96129

0.014965

1.2810

1.6357

0.35465

160.37

𝐹

appear in the best LM with 𝑛 = 1, . . . , 10 descriptors, separately for models containing TIs, SCs, and both types, are marked by crosses. Both among TIs and among SCs there are some descriptors that rarely appear in best models, while others appear frequently. Thus, 𝑆𝐶19 is contained in each best SC model of 𝑛 ≥ 4 and in each best TI–SC model of 𝑛 ≥ 3. Nevertheless, the weight of 𝑆𝐶19 is rather low, as seen in the predicting functions written in terms of autoscaled pre­ dictors. 𝑆𝐶19 therefore seems to be an important ‘correcting term’. Another important SC for BP modeling is 𝑆𝐶5 . Among TIs, the molecular walk counts, 3 𝜒𝑣 and 𝑊 seem to be influential. On the other hand, 𝐽, 𝜆𝐴1 and both remaining information theoretical indices 𝐼𝐶1 and 𝐼𝐶2 seem unimportant. The latter fact is remarkable since 𝐼𝐶1 has the third-highest correlation coefficient with BP, among all the TIs considered.

7.4.4 Further descriptors and regression methods Let us check whether even better linear models are obtained by adding geometrical descriptors to our descriptor pool. Upon addition of the 35 geometrical indices from Appendix A.3 and calculation of best linear models containing 𝑛 = 1, . . . , 5 descriptors, a geometrical descriptor enters the model only once: 𝑛 = 4 descriptor functions: 𝑊, 𝑆𝐶2 , 𝑆𝐶19 , 𝑠𝑠𝑆𝐻𝑊𝐷3, 𝑓 = 1.1830𝑋0 − 6.3133𝑋1 + 2.3076𝑋2 + 0.23098𝑋3 + 70.914 = 15.006𝑋∗0 − 13.196𝑋∗1 + 3.1416𝑋∗2 + 0.88281𝑋∗3 + 157.85.

Statistics here are 𝑅2 = 0.96358, 𝑅2𝐶𝑉 = 0.95549. 𝑆 = 1.4772. 𝑆𝐶𝑉 = 1.63299 and 𝐹 = 297.61.

| 269

7.4 Case studies of QSPRs obtained by linear modeling

Table 7.10. Best 𝑛-subsets of descriptors for BP models containing TIs, SCs and both types of descriptors.

1

2

3

𝑛 (TI or SC only) 4 5 6 7

8

9

10

1

2

3

𝑛 (TI and SC) 4 5 6 7

8

9

10

𝑊









×



×











×

×







0 𝑣

𝜒











×

×



×

×















×

×

×

1 𝑣

𝜒

















×

×















×

×

×

2 𝑣

𝜒

×













×



×







×







×

×

×

3 𝑣





×

×

×







×











×

×









𝐽



















×





















𝜒

×

×



𝑀𝑇𝐼











×



×













×

×

×







𝑡𝑤𝑐





×



×

×

×



×

×





















(2)

















×

×





















𝑚𝑤𝑐(3)

































×







𝑚𝑤𝑐(4)



×





×





×

×

×



×









×





×

𝑚𝑤𝑐(5)





×





×

×





×





















𝑚𝑤𝑐(6)







×



×

×

×

×

×













×







𝑚𝑤𝑐(7)







×







×

×























𝑚𝑤𝑐(8)



×



×

×

×

×

×

×

×



×

















𝜆𝐴1













×



























𝐼𝐶1









































𝐼𝐶2















×

























𝑆𝐶1



×

×

×

×

×

×

×

×

×





















𝑆𝐶2















×

×

×





×















𝑆𝐶3









































𝑆𝐶4







































×

𝑆𝐶5



×



×

×

×

×

×

×

×







×

×

×







×

𝑆𝐶6







×

×

×

×

×

×

×















×

×



𝑆𝐶7















×

×

×













×



×



𝑆𝐶8











×





























𝑆𝐶9





×



































𝑆𝐶10















×

×

×















×

×

×

𝑆𝐶11









































𝑆𝐶12













×



























𝑆𝐶13



































×

×

×

𝑆𝐶14

×



















×



















𝑆𝐶15





×







×

×

×

×













×







𝑆𝐶16









×



×



×

×















×

×

×

𝑆𝐶17



















×





















𝑆𝐶18































×









𝑆𝐶19







×

×

×

×

×

×

×





×

×

×

×

×

×

×

×

𝑆𝐶20











×





























𝑚𝑤𝑐

270 | 7 Quantitative structure–property relationships Table 7.11. 𝑅2 of best models for the BPs of decanes obtained by various methods. Method MLR ANN, 1HN ANN, 2HN ANN, 3HN SVM, lin SVM, pol SVM, rad

𝑛=1

𝑛=2

𝑛=3

𝑛=4

𝑛=5

0.55334 0.57729 0.57958 0.57981 0.55046 0.56943 0.53903

0.89336 0.30807 0.89365 0.88842 0.62414 0.71620 0.59737

0.95302 0.95443 0.95443 0.95380 0.94951 0.94885 0.89956

0.96133 0.96074 0.96148 0.96126 0.95835 0.95292 0.89266

0.96734 0.85732 0.96838 0.96632 0.83026 0.84941 0.82629

Likewise, other regression methods barely provide better models. Table 7.11 shows 𝑅2 values of models resulting from neural networks containing 1–3 hidden neurons and from SVM with linear, polynomial (degree = 2), and radial kernel. To ensure re­ producibility of ANN, these were trained using starting weights 0, resulting in very low 𝑅2 in some cases. This improves as soon as random starting weights are applied. Data in Table 7.11 are not quite comparable to those obtained earlier for LM, since the de­ scriptor subsets that had turned out best for LM were used, rather than systematically searching for the best subsets again. The algorithm for growing regression trees uses its own logic for selection of pre­ dictors (see Subsection 6.2.4). For this example a regression tree of 9 terminal nodes was obtained that used 1 𝜒𝑣 , 3 𝜒𝑣 , 𝑆𝐶6 , 𝑆𝐶14 , 𝑆𝐶16 and had 𝑅2 = 0.84386. Due to their limited co-domain, RT are more successful in cases where other methods cannot find useful correlations (see Section 7.6). Thus, linear regression with best subset search seems to be well-suited for mod­ eling BPs of decanes.

7.4.5 Prediction There are 75 constitutional isomers altogether for C10 H22 . We generated this virtual li­ brary and then removed the 50 isomers contained in the real library using canonical numbering. For the remaining 25 isomers either a BP is not known, or the compound itself is unknown according to the Beilstein database. We selected the best TI–SC 3-de­ scriptor model for predicting the corresponding BPs. These are shown, together with the structures, in Figure 7.13, in the order of increasing predicted BPs.

7.5 Case studies with separate learning and test sets A fundamental physical property of a compound is its physical density (PD). This quan­ tity is defined as the ratio of mass and volume at 20 ∘ C and normal pressure. In the fol­

7.5 Case studies with separate learning and test sets | 271

BP:151.25

1 BP:152.5 0

2 BP:152.92

3 BP:154.18

4 BP:155.86

5

BP:156.19

6 BP:156.67

7 BP:156.86

8 BP:157.53

9 BP:158.26

10

BP:158.79

11 BP:159.3

12 BP:159.41

13 BP:159.94

14 BP:161.09

15

BP:161.57

16 BP:161.72

17 BP:162.58

18 BP:162.77

19 BP:163.35

20

BP:164.36

21 BP:165.61

22 BP:165.90

23 BP:167.43

24 BP:168.21

25

Fig. 7.13. Purely virtual library of 25 decanes with predicted BPs.

lowing we calculate QSPR models of PD of propyl acrylates. We start with 166 propyl acrylates together with their PDs, found in the Beilstein database. Propyl acrylates have the following substructure in common: O

O

Our structure search result included many compounds in which the C = C double bond is part of an aromatic system. Although strictly speaking these are not acrylates, we kept them in our QSPR study. Five compounds had to be excluded for densities mea­ sured at temperatures other than 20 ∘ C or for other dubious data, one organotin com­ pound likewise was removed due to its unusual composition. Finally there were 160 compounds in the real library.

7.5.1 Preprocessing of structures In this example we demonstrate validation of QSPR models using a test set. The real library was randomly partitioned into learning set and test set of 80 compounds each. The structures had to be subjected to some preprocessing: – H atoms: In the SD file exported by Beilstein structures are coded without H atoms. H atoms are, however, required for calculating some indices and for calculating 3D molecular models. H atoms were thus added according to the valences of all non-H atoms. (The MDL SD file format is a popular exchange format for molecular structures, see [53].

272 | 7 Quantitative structure–property relationships Table 7.12. Atomic profile of the real library of propyl acrylates. Count

H

C

N

O

F

Si

P

1

0

0

2

0

0

31

0

0

3

2

53

0

0

3

0

4

0

0

0

22

7

0

0

43

0

5

0

0

0

11

6–10

12

60

0

11–15

53

64

16–20

46

23

21–25

21

26–30 ≥ 31 ∑

– –

S

Cl

Br

SB

DB

TB

AB

14

7

13

4

0

33

1

0

0

0

5

1

0

56

0

0

0

0

0

1

0

0

46

0

0

0

0

0

1

0

0

18

0

0

0

0

0

0

1

0

10

3

0

8

31

0

0

0

0

0

0

70

5

0

48

0

0

0

0

0

0

0

0

50

0

0

12

0

0

0

0

0

0

0

0

25

0

0

1

10

0

0

0

0

0

0

0

0

3

0

0

0

19

2

0

0

0

0

0

0

0

0

2

0

0

0

9

1

0

0

0

0

0

0

0

0

0

0

0

0

160

160

33

160

7

3

14

7

21

5

160

160

1

69

Aromaticity: Some descriptors take aromaticity into account. For correct calcula­ tion, the aromatic bonds have to be marked as such according to Definition 2.22. 3D placement: Since we planned to include geometrical descriptors, we had to cal­ culate 3D placements according to Section 2.5. In order to avoid unrealistic place­ ments such as penetration of a ring system by a bond, each structure was optimized several times, with each run starting from a fresh set of random atom coordinates. The 3D structure with the lowest energy after optimization was selected. Using this procedure the occurrence of unrealistic geometries is decreased drastically.

To save space, the real library is not depicted here in detail. An overview is given in Table 7.12, which shows element composition and the numbers of single, double, triple and aromatic bonds (SB, DB, TB, AB) , a representation which we call atomic profile. The library contains compounds made of ten chemical elements. All compounds contain oxygen according to the given substructure, other heteroatoms are less abundant. 69 compounds are aromatic. Table 7.13 contains the distributions of experimental density and of descriptors for the size of molecules, i.e. number of atoms (𝐴), molecular weight (𝑀𝑊), and van der Waals volume (𝑉𝑣𝑑𝑤 ). Minimum, maximum, first and third quartile, median and mean are given for each quantity, for the library as a whole and for LS and TS sepa­ rately. LS and TS are similar according to this information. Important predictors will be constructed from 𝐴, 𝑀𝑊 and 𝑉𝑣𝑑𝑤 .

7.5 Case studies with separate learning and test sets

| 273

7.5.2 Choice of descriptors In contrast to the example in the previous section we now have compounds with sev­ eral molecular formulas in our real library. It therefore makes sense to use arithmetical descriptors. In MOLGEN–QSPR there are 48 arithmetical descriptors available (see Ap­ pendix A.1). Since there are no compounds containing iodine, no free radicals and no charged species in the library, 𝑁𝐼 , 𝑟𝑒𝑙. 𝑁𝐼 , 𝑟𝑎𝑑 and 𝑐ℎ𝑎𝑟𝑔𝑒 are useless. Further, there is only one compound containing a triple bond. Therefore the descriptors 𝑛#and 𝑟𝑒𝑙. 𝑛# are also excluded. Fluorine, bromine, and sulfur-containing compounds are scarce, so that an even distribution over LS and TS is questionable. Therefore, we also exclude 𝑁𝐹 , 𝑟𝑒𝑙. 𝑁𝐹 , 𝑁𝑆 , 𝑟𝑒𝑙. 𝑁𝑆 , 𝑁𝐵𝑟 , 𝑟𝑒𝑙. 𝑁𝐵𝑟 . Thus, 35 arithmetical indices are retained. For topological indices, we start with the 30 indices listed in the beginning of Sec­ tion 7.4 and again exclude 𝑀1 and 𝑀2 for the reasons given there. Among the remaining 28 TIs there is no pairwise complete correlation within our library. The function 𝑡𝑤𝑐 should be used with caution: Its values increase exponentially with increasing num­ ber of bonds and with increasing topological density, i.e. the quotient of bond count and atom count of a molecular graph. In our real library the highest 𝑡𝑤𝑐 values are 69,959,869,638,977,272 2,924,196,666,599,052 130,752,536,580,380 The highest 𝑡𝑤𝑐 value is an order of magnitude higher than the next highest, and so on. This may cause fatal consequences for linear (and nonlinear) models: If there are no compounds with highest 𝑡𝑤𝑐 in the LS, completely unrealistic predictions for such Table 7.13. Distribution of some properties within the real library of propyl acrylates. Set

Min.

1. Quart.

Median

Mean

3. Quart.

Max.

PD

all

0.873

1.010

1.066

1.081

1.116

1.534

[g/cm3 ]

LS

0.873

1.009

1.049

1.080

1.116

1.534

TS

0.883

1.016

1.073

1.081

1.114

1.453

all

18.00

27.00

33.00

35.74

43.00

82.00

LS

18.00

24.75

34.00

35.75

43.25

70.00

TS

20.00

28.00

33.00

35.73

41.00

82.00

𝑀𝑊

all

114.1

190.2

238.8

251.8

300.4

580.7

(𝑖𝑛𝑐𝑙. 𝐻)

LS

114.1

183.3

242.7

249.5

305.3

454.6

[Da]

TS

128.2

190.2

233.7

254.2

295.7

580.7

𝑉𝑣𝑑𝑤

all

121.3

183.5

220.6

242.6

280.5

535.5

[Å3 ]

LS

121.3

176.1

230.2

241.9

282.0

462.1

TS

138.9

191.1

218.6

243.3

279.4

535.5

𝐴

274 | 7 Quantitative structure–property relationships

𝜌𝑣𝑑𝑤

𝑚𝑒𝑎𝑛 𝐴𝑊 (𝑖𝑛𝑐𝑙. 𝐻)

𝐼𝐶0

𝑟𝑒𝑙. 𝑁𝐻

𝐼𝐶1

𝑚𝑒𝑎𝑛 𝐴𝑊

𝑆𝐼𝐶0

𝑆𝐼𝐶1

𝐼𝐶2

𝑃𝐷 𝜌𝑣𝑑𝑤 𝑚𝑒𝑎𝑛 𝐴𝑊 (𝑖𝑛𝑐𝑙. 𝐻) 𝐼𝐶0 𝑟𝑒𝑙. 𝑁𝐻 𝐼𝐶1 𝑚𝑒𝑎𝑛 𝐴𝑊 𝑆𝐼𝐶0 𝑆𝐼𝐶1 𝐼𝐶2 𝑁𝐶𝑙 𝑟𝑒𝑙. 𝑁𝐶𝑙 𝐺2 𝐺1 𝑟𝑒𝑙. 𝑁𝐵𝑟 𝑁𝐵𝑟 𝐶𝐼𝐶1 𝐺2 (𝑖𝑛𝑐𝑙.𝐻) 𝐺1 (𝑖𝑛𝑐𝑙.𝐻) 𝑟𝑒𝑙. 𝑁𝑂 𝑀𝑊 𝑆𝐼𝐶2 𝜆𝐴1 𝑀𝑊 (𝑖𝑛𝑐𝑙. 𝐻)

𝑃𝐷

Table 7.14. Part of the correlation matrix for PD and descriptors for the real library of propyl acrylates.

1.000 0.937 0.934 0.801 0.787 0.784 0.772 0.634 0.620 0.514 0.498 0.496 0.484 0.477 0.456 0.455 0.430 0.429 0.423 0.413 0.378 0.374 0.344 0.332

0.937 1.000 0.980 0.825 0.715 0.706 0.902 0.722 0.634 0.365 0.601 0.607 0.410 0.419 0.548 0.542 0.492 0.351 0.360 0.344 0.288 0.390 0.214 0.241

0.934 0.980 1.000 0.849 0.808 0.756 0.847 0.787 0.722 0.394 0.629 0.641 0.343 0.357 0.478 0.471 0.583 0.280 0.293 0.337 0.226 0.478 0.220 0.175

0.801 0.825 0.849 1.000 0.778 0.851 0.628 0.823 0.715 0.446 0.615 0.626 0.400 0.409 0.112 0.107 0.528 0.345 0.352 0.433 0.284 0.401 0.279 0.239

0.787 0.715 0.808 0.778 1.000 0.847 0.376 0.695 0.748 0.531 0.491 0.501 0.291 0.309 0.031 0.028 0.576 0.230 0.246 0.380 0.212 0.503 0.399 0.162

0.784 0.706 0.756 0.851 0.847 1.000 0.432 0.689 0.796 0.725 0.445 0.456 0.421 0.404 0.016 0.013 0.562 0.368 0.350 0.373 0.292 0.529 0.508 0.249

0.772 0.902 0.847 0.628 0.376 0.432 1.000 0.611 0.469 0.161 0.504 0.516 0.278 0.278 0.755 0.747 0.405 0.233 0.234 0.187 0.162 0.317 0.017 0.127

0.634 0.722 0.787 0.823 0.695 0.689 0.611 1.000 0.921 0.271 0.486 0.534 0.120 0.117 0.217 0.201 0.889 0.185 0.183 0.408 0.263 0.756 0.090 0.310

0.620 0.634 0.722 0.715 0.748 0.796 0.469 0.921 1.000 0.465 0.363 0.412 0.136 0.152 0.147 0.132 0.942 0.201 0.216 0.374 0.291 0.873 0.056 0.337

0.514 0.365 0.394 0.446 0.531 0.725 0.161 0.271 0.465 1.000 0.065 0.088 0.389 0.335 0.036 0.042 0.253 0.365 0.316 0.084 0.310 0.508 0.473 0.289

compounds will result. This problem is solved by replacing twc by its natural logarithm ln (𝑡𝑤𝑐). The physical density certainly will depend on the 3D shape of single molecules in some way unknown to us. Therefore we include the 35 geometrical indices from Appendix A.3 in our QSPR study. Thus, altogether we have a pool of 98 descriptors. Table 7.14 shows a part of the correlation matrix for PD and the molecular descriptors, calculated for the complete compound sample, signs are omitted. Not surprisingly, the highest correlation with PD is exhibited by the van der Waals density 𝜌𝑣𝑑𝑤 , the ratio of mass and volume of a single molecule. Almost as highly correlated with PD is the mean atomic mass 𝑚𝑒𝑎𝑛 𝐴𝑊 (𝑖𝑛𝑐𝑙. H), the ratio of molecular mass and atom count. This quantity is far more easily calculated than 𝜌𝑣𝑑𝑤 , while both quantities are strongly cor­ related (𝑟 = 0.980). The consequences of this correlation will be seen in the construction of QSPR models. The other geometrical indices have rather low correlation with PD.

7.5 Case studies with separate learning and test sets

| 275

7.5.3 Linear modeling by best subset selection The linear models of highest 𝑅2𝐿𝑆 (for the learning set) containing 𝑛 = 1, . . . , 5 descrip­ tors are the following: 𝑛 = 1 descriptor function: 𝜌𝑣𝑑𝑤 , 𝑓 = 0.89996𝑋0 + 0.14597, 𝑅2𝐿𝑆 = 0.87968, 𝑆𝐿𝑆 = 0.047187, 𝐹𝐿𝑆 = 570.26.

𝑛 = 2 descriptor functions: 𝐼𝐶2 , 𝜌𝑣𝑑𝑤 , 𝑓 = 0.12217𝑋0 + 0.81813𝑋1 − 0.21835, 𝑅2𝐿𝑆 = 0.91688, 𝑆𝐿𝑆 = 0.039474, 𝐹𝐿𝑆 = 424.68.

𝑛 = 3 descriptor functions: 𝑁𝑂 , 𝑟𝑒𝑙. 𝑛 − (𝑖𝑛𝑐𝑙. 𝐻), 𝜌𝑣𝑑𝑤 , 𝑓 = 0.015912𝑋0 − 0.30250𝑋1 + 0.84571𝑋2 + 0.39659, 𝑅2𝐿𝑆 = 0.93813, 𝑆𝐿𝑆 = 0.034280, 𝐹𝐿𝑆 = 384.12. 𝑛 = 4 descriptor functions: 𝐴, 𝑟𝑒𝑙. 𝑁𝑂 , 0 𝜒, 𝜌𝑣𝑑𝑤 , 𝑓 = 0.064792𝑋0 + 0.67865𝑋1 − 0.084221𝑋2 + 0.89515𝑋3 + 0.062349, 𝑅2𝐿𝑆 = 0.95060, 𝑆𝐿𝑆 = 0.030836, 𝐹𝐿𝑆 = 360.76.

𝑛 = 5 descriptor functions: 1 𝜒, 0 𝜒𝑣 , 𝐶𝐼𝐶0 , 𝐼𝐶1 , 𝜌𝑣𝑑𝑤 , 𝑓 = 0.034734𝑋0 − 0.041353𝑋1 + 0.14269𝑋2 + 0.12314𝑋3 + 1.0380𝑋4 − 0.66073, 𝑅2𝐿𝑆 = 0.95481, 𝑆𝐿𝑆 = 0.029690, 𝐹𝐿𝑆 = 312.70.

Random experiments were performed to select best combinations of 𝑛 = 1, . . . , 5 out of 98 pseudodescriptors for 80 pseudoobservations, the results are as follows: 𝑛

𝑚ℎ𝑟𝑅2

𝑠𝑡𝑑𝑒𝑣

1

0.09500

0.02617

2

0.17331

0.04582

3

0.23008

0.04224

4

0.28082

0.04407

5

0.32892

0.05110

Despite the large descriptor pool, 𝑚ℎ𝑟𝑅2 values here are somewhat lower than in the previous example, illustrating the (positive) effect of 80 instead of 50 observa­ tions to be fitted. For our best models, the difference between 𝑅2 and 𝑚ℎ𝑟𝑅2 is between 12.2 and 30 standard deviations, so the original models fit the data far better than the random models, and it is extremely unlikely that our models are based on chance cor­ relations. Table 7.15 compares 𝑅2𝐿𝑆 and 𝑅2𝑇𝑆 for the 5 best models of 𝑛 = 1, . . . , 5 descriptors. For most models, 𝑅2𝐿𝑆 is higher than 𝑅2𝑇𝑆 , as expected. Further, the best models with respect to 𝑅2𝐿𝑆 do not, as a rule, exhibit the best values of 𝑅2𝑇𝑆 . These data are visualized

276 | 7 Quantitative structure–property relationships Table 7.15. Coefficients of determination 𝑅2𝐿𝑆 and 𝑅2𝑇𝑆 of the best five PD models containing 𝑛 descriptors. 𝑛=4

𝑛=3

𝑛=2

𝑛=1

LS

0.95481

0.95059

0.93813

0.91688

0.87968

TS

0.92018

0.92645

0.93244

0.89716

0.87341

LS

0.95474

0.94903

0.93770

0.91524

0.87278

TS

0.92266

0.91832

0.89164

0.91582

0.87117

LS

0.95440

0.94753

0.93724

0.91415

0.60970

TS

0.93862

0.93797

0.92980

0.88833

0.57425

LS

0.95397

0.94716

0.93720

0.91241

0.59306

TS

0.94236

0.91611

0.92598

0.90536

0.69813

LS

0.95374

0.94664

0.93712

0.91082

0.57784

TS

0.94436

0.90873

0.93826

0.90041

0.67580

0.90

0.92

n=1 n=2 n=3 n=4 n=5

0.88

Coefficient of determination, test set

0.94

𝑛=5

0.88

0.90

0.92

0.94

Coefficient of determination, learning set

Fig. 7.14. Scatterplot of 𝑅2𝑇𝑆 vs. 𝑅2𝐿𝑆 for the best models (with respect to LS) of 𝑛 = 1, . . . , 5 descrip­ tors.

in Figure 7.14. (For 𝑛 = 1 there are only two models within the frame of Figure 7.14.) Among the QSPRs considered, the highest 𝑅2𝑇𝑆 is found for the 5-descriptor model: 𝑛 = 5 descriptor functions: 𝐴, 𝑟𝑒𝑙. 𝑁𝑂 , 𝑊, 0 𝜒, 𝜌𝑣𝑑𝑤 , 𝑓 = 0.065886𝑋0 + 0.68347𝑋1 − 0.000039413𝑋2 − 0.079592𝑋3 + 0.88018𝑋4 + 0.028970, 𝑅2𝐿𝑆 = 0.95374, 𝑆𝐿𝑆 = 0.030041, 𝐹𝐿𝑆 = 305.10.

| 277

1.5

7.5 Case studies with separate learning and test sets

1.3 1.2 1.1 0.9

1.0

Physical density, calculated

1.4

Learning set Test set

0.9

1.0

1.1

1.2

1.3

1.4

1.5

Physical density, experimental

Fig. 7.15. Scatterplot of calculated PD vs. experimental PD for the model marked by an arrow in Fig­ ure 7.14.

A scatterplot of calculated vs. experimental PD for this model is shown in Fig­ ure 7.15. The descriptors contained in the five best models (with respect to the LS) of 𝑛 = 1, . . . , 5 descriptors are shown in Table 7.16. Each column represents a model, the models are ordered by decreasing 𝑅2𝐿𝑆 . De­ scriptors are represented by rows, a cross at position (𝑖, 𝑗) means that descriptor 𝑖 is contained in model 𝑗. The table again demonstrates the high correlation of PD, 𝜌𝑣𝑑𝑤 and 𝑚𝑒𝑎𝑛 𝐴𝑊 (𝑖𝑛𝑐𝑙. 𝐻): Either 𝜌𝑣𝑑𝑤 or 𝑚𝑒𝑎𝑛 𝐴𝑊 (𝑖𝑛𝑐𝑙. 𝐻) are contained in all models of 𝑛 ≥ 2 descriptors, as well as in the first two 1-descriptor models, but never both.

7.5.4 Linear modeling by stepwise subset selection It is to be expected that models of higher 𝑅2𝑇𝑆 can be found using more than five de­ scriptors. Unfortunately, the large descriptor pool will limit our ability to search for best subsets. Therefore we will use the method of stepwise subset selection here, described in Subsection 6.1.4. We generate models of 𝑛 = 1, . . . , 20 descriptors. In each step 50 subsets of descrip­ tors are selected, producing best linear models (with respect to 𝑅2𝐿𝑆 ). These are enlarged by one descriptor in all possible ways, again the 50 best models are selected, and so on. The final result is 50 models each of 𝑛 = 1, . . . , 20 descriptors. Although these do not necessarily contain the very best linear model for each particular 𝑛, this technique is suitable to obtain good models for reasonable effort.

278 | 7 Quantitative structure–property relationships Table 7.16. Best subsets of 𝑛 descriptors for PD models, denoted by their 𝑅2𝐿𝑆 𝑛=4

𝑛=3

𝑛=2

𝑛=1

0.95481 0.95474 0.95440 0.95397 0.95374 0.95059 0.94903 0.94753 0.94716 0.94664 0.93813 0.93770 0.93724 0.93720 0.93712 0.91688 0.91524 0.91415 0.91241 0.91082 0.87968 0.87278 0.60970 0.59306 0.57784

𝑛=5

𝜌𝑣𝑑𝑤

× × × × × × ⋅ × ⋅

⋅ × × × × × × ⋅ × ⋅

⋅ × ⋅







1

𝜒

× × ⋅















⋅ × ⋅









0 𝑣

× × ⋅















⋅ × ⋅

























𝐶𝐼𝐶0

× × ⋅













































𝐼𝐶1

× ⋅















































𝑆𝐼𝐶1

⋅ × ⋅













































𝑟𝑒𝑙. 𝑁𝑂



⋅ × × × × ⋅ × ⋅

































𝐵



⋅ × ⋅





⋅ × ⋅

































𝑙𝑛(𝑡𝑤𝑐)



⋅ × ⋅











































𝜆𝐴1



⋅ × ⋅











































0





⋅ × × × ⋅ × ⋅

































𝐴





⋅ × × × ⋅





































𝑀𝑇𝐼





⋅ × ⋅









































𝑊







⋅ × ⋅







































𝑁𝑂











⋅ × ⋅ × × × ⋅

⋅ × ⋅





⋅ × ⋅











𝑚𝑒𝑎𝑛 𝐴𝑊 (𝑖𝑛𝑐𝑙. 𝐻)











⋅ × ⋅ × × ⋅









⋅ × ⋅ × × ⋅ × ⋅





𝐶











⋅ × ⋅ × ⋅































𝑟𝑒𝑙. 𝑁𝐶𝑙











⋅ × ⋅

⋅ × ⋅





























𝑁𝐶𝑙















⋅ × ⋅































𝐽

















⋅ × ⋅





























𝑟𝑒𝑙. 𝑛 − (𝑖𝑛𝑐𝑙. 𝐻)



















⋅ × ⋅



⋅ × ⋅

𝑟𝑒𝑙. 𝑛 𝑎𝑟𝑜𝑚. (𝑖𝑛𝑐𝑙. 𝐻) ⋅





















⋅ × × ⋅

𝜒

𝜒























































𝐻𝐵𝐴























⋅ × ⋅ × ⋅ × ⋅















𝐼𝐶2





























⋅ × ⋅















𝑚𝑒𝑎𝑛 𝐴𝑊

































⋅ × ⋅





⋅ × ⋅



𝑆𝐻𝐷𝑊1





































⋅ × ⋅







𝐼𝐶0













































⋅ × ⋅

𝑟𝑒𝑙. 𝑁𝐻



















































⋅ ×

7.5 Case studies with separate learning and test sets

| 279

Figure 7.16 is a scatterplot of 𝑅2𝑇𝑆 vs. 𝑅2𝐿𝑆 for these models. The number of descriptors in a model is indicated by the color of the corresponding symbol. Models of a single descriptor are black, those of 20 descriptors are white, all others are shown in corres­ ponding greyscale. Models containing many descriptors clearly show overfitting, their 𝑅2𝑇𝑆 are significantly lower than their 𝑅2𝐿𝑆 . In Figure 7.18, for each 𝑛 = 1, . . . , 20, the best model with respect to 𝑅2𝐿𝑆 was se­ lected, and its 𝑅2𝐿𝑆 and 𝑅2𝑇𝑆 are shown. Similarly, the best model with respect to 𝑅2𝑇𝑆 was selected, and both its 𝑅2𝑇𝑆 and 𝑅2𝐿𝑆 are shown. In both cases, 𝑅2𝐿𝑆 increases mono­ tonically with increasing 𝑛, whereas 𝑅2𝑇𝑆 has a maximum at 7 or 8 descriptors, remains high to about 13 descriptors and decreases afterwards. For the prediction of PD our choice is the model with the highest 𝑅2𝑇𝑆 , an 8-descrip­ tor model: 𝑛 = 8 descriptor functions: 𝑁𝐶 , 𝑁𝑂 , 𝑟𝑒𝑙. 𝑁𝐶𝑙 , 𝐶, 𝑚𝑒𝑎𝑛 𝐴𝑊 (𝑖𝑛𝑐𝑙. 𝐻), 𝑚𝑤𝑐(3) , 𝐶𝐼𝐶0 , 𝐼𝐶1 , 𝑓 = −0.0071639𝑋0 + 0.018563𝑋1 − 0.78530𝑋2 + 0.047740𝑋3 + 0.11221𝑋4 − 0.00060864𝑋5 + 0.12967𝑋6 + 0.072563𝑋7 − 0.27686, 𝑅2𝐿𝑆 = 0.96127, 𝑆𝐿𝑆 = 0.028062, 𝐹𝐿𝑆 = 220.25.

100 random experiments were performed to select best combinations of 8 out of 98 pseudodescriptors to fit 80 pseudoobservations. As in the procedure used for the real data, the 50-fold stepwise selection procedure was applied instead of unfeasible complete searches. The result is 𝑚ℎ𝑟𝑅2 = 0.45266, 𝑠𝑡𝑑𝑒𝑣 = 0.04723. Note the rather high 𝑚ℎ𝑟𝑅2 value resulting from several descriptors to be selected from a large descriptor pool. Nevertheless, the difference between 𝑅2 and 𝑚ℎ𝑟𝑅2 for our best model is 10.8 standard deviations. So the original model fits the data far better than the random models, and it is extremely unlikely that it is based on chance correlations. Critically, 𝑅2𝑇𝑆 = 0.96174 > 𝑅2𝐿𝑆 = 0.96127 for this model, although one would expect the opposite. Figure 7.17 shows experimental and calculated PD values obtained by this model. Symbols for observations from LS are white, those for TS are grey. Table 7.17 shows the descriptors contained in the 25 models of highest 𝑅2𝑇𝑆 . Among these, the only geometric indices present are 𝑆𝐻𝐷𝑊5 and 𝑆𝐴𝑆𝐻2𝑂 . Descriptor 𝑚𝑒𝑎𝑛 𝐴𝑊 (𝑖𝑛𝑐𝑙. 𝐻) is used in all of these models, while 𝜌𝑣𝑑𝑤 is not, because of the high correlation of these two quantities. Several arithmetical indices play a prominent role, in particular counts and relative counts of atoms. The cyclomatic number 𝐶 proving an important descriptor for the PD is also understandable: A more cyclic molecule will have more of its atoms overlapping, so that its van der Waals volume will be lower and its van der Waals density higher. We saw that the PD of propyl acrylates is modeled well using arithmetical and topological descriptors with OLS as the predicting method. Thus, computer-intensive

0.98

280 | 7 Quantitative structure–property relationships

0.94 0.90

0.92

n=11 n=12 n=13 n=14 n=15 n=16 n=17 n=18 n=19 n=20

0.88

Coefficient of determination, test set

0.96

n=1 n=2 n=3 n=4 n=5 n=6 n=7 n=8 n=9 n=10

0.88

0.90

0.92

0.94

0.96

0.98

Coefficient of determination, learning set

1.5

Fig. 7.16. Scatterplot of 𝑅2𝑇𝑆 vs. 𝑅2𝐿𝑆 for the best linear models (with respect to 𝑅2𝑇𝑆 ) after 50-fold step­ wise subset selection.

1.3 1.2 1.1 0.9

1.0

Physical density, calculated

1.4

Learning set Test set

0.9

1.0

1.1

1.2

1.3

1.4

1.5

Physical density, experimental

Fig. 7.17. Scatterplot of calculated PD vs. experimental PD for the model marked by an arrow in Fig­ ure 7.16.

7.5 Case studies with separate learning and test sets

| 281

0.96174 0.96069 0.96004 0.95997 0.95949

0.95949 0.95939 0.95910 0.95906 0.95905

0.95905 0.95896 0.95896 0.95896 0.95895 0.95893 0.95892 0.95891 0.95883 0.95881 0.95863 0.95854 0.95818 0.95816 0.95815

Table 7.17. Descriptors contained in the 25 best (with respect to 𝑅2𝑇𝑆 ) models obtained by 50-fold stepwise subset selection.

𝑚𝑒𝑎𝑛 𝐴𝑊 ( 𝑖𝑛𝑐𝑙. 𝐻)

× × × × ×

× × × × ×

× × × × × × × × × × × × × × ×

𝑟𝑒𝑙. 𝑁𝐶𝑙

× × × × ×

× × × × ×

× × × × × × × × × × × × × × ×

𝑁𝑂

× × × × ×

× × × × ×

× × × × × × × × × × × × × × ×

𝐶

× × × × ×

× × × × ×

× × × × × × × × × × × × × × ×

𝑚𝑤𝑐(3)

× ⋅ ⋅ × ×

× × × × ×

× × × × × × ⋅

× × × ⋅ × × × ×

𝐶𝐼𝐶0

× ⋅ ⋅ × ×

× × × ⋅

×

× × × × ⋅ × ⋅

×

⋅ × ⋅ ⋅ × × ×

𝐼𝐶1

× ⋅ ⋅ × ×



× × ⋅





× ⋅ × ⋅ × ⋅



⋅ × ⋅ ⋅ × × ×

𝑁𝐶

× ⋅ ⋅ ⋅





⋅ ⋅ ⋅





⋅ ⋅ ⋅ ⋅ ⋅ ⋅



⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅

𝑚𝑤𝑐(6)

⋅ × × ⋅

×

×

⋅ ⋅ ⋅

×

×

⋅ ⋅ ⋅ ⋅ ⋅ × ×

⋅ ⋅ × ⋅ ⋅ ⋅ ⋅

𝑀𝑊 (𝑖𝑛𝑐𝑙. 𝐻)

⋅ × × ⋅





⋅ ⋅ ×



×

⋅ ⋅ ⋅ × ⋅ ×



× ⋅ × × ⋅ ⋅ ⋅

𝑚𝑤𝑐(5)

⋅ × × ⋅





⋅ ⋅ ⋅





⋅ ⋅ ⋅ ⋅ ⋅ ×



⋅ ⋅ × ⋅ ⋅ ⋅ ⋅

𝑀𝑇𝐼

⋅ × ⋅ ⋅





⋅ ⋅ ⋅





⋅ ⋅ ⋅ ⋅ ⋅ ×



× ⋅ ⋅ × ⋅ ⋅ ⋅

𝑊

⋅ ⋅ × ⋅





⋅ ⋅ ×





⋅ ⋅ ⋅ × ⋅ ⋅



⋅ ⋅ × ⋅ ⋅ ⋅ ⋅

𝑟𝑒𝑙. 𝑛 𝑎𝑟𝑜𝑚𝑎𝑡𝑖𝑐

⋅ ⋅ ⋅ ×





⋅ ⋅ ⋅





⋅ ⋅ ⋅ ⋅ ⋅ ⋅



⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅

𝐼𝐶0 1 𝑣

⋅ ⋅ ⋅ ⋅

×

×

⋅ ⋅ ⋅

×

× × × ⋅ ⋅ ⋅ ⋅

×

⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅

𝜒

⋅ ⋅ ⋅ ⋅

×

×

⋅ ⋅ ⋅

×

×

⋅ ⋅ ⋅ ⋅ ⋅ × ×

⋅ ⋅ × ⋅ ⋅ ⋅ ⋅

0

𝜒

⋅ ⋅ ⋅ ⋅

×

×

⋅ ⋅ ⋅

×

×

⋅ ⋅ ⋅ ⋅ ⋅ ⋅

×

⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅

𝑚𝑤𝑐(8)

⋅ ⋅ ⋅ ⋅

×

×

⋅ ⋅ ⋅





⋅ ⋅ ⋅ ⋅ ⋅ ⋅

×

⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅

3 𝑣

⋅ ⋅ ⋅ ⋅

×

×

⋅ ⋅ ⋅





⋅ ⋅ ⋅ ⋅ ⋅ ⋅

×

⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅

𝐶𝐼𝐶1

⋅ ⋅ ⋅ ⋅



×

⋅ ⋅ ⋅





⋅ × × ⋅ ⋅ ⋅



⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅

𝑆𝐼𝐶1

⋅ ⋅ ⋅ ⋅





× ⋅ ⋅

×

×

⋅ ⋅ ⋅ ⋅ ⋅ ⋅

×

⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅

𝑟𝑒𝑙. 𝑛−

⋅ ⋅ ⋅ ⋅





⋅ × ⋅





⋅ ⋅ ⋅ ⋅ ⋅ ⋅



⋅ ⋅ ⋅ ⋅ × ⋅ ⋅

𝑆𝐻𝐷𝑊5

⋅ ⋅ ⋅ ⋅





⋅ × ⋅





⋅ ⋅ ⋅ ⋅ ⋅ ⋅



⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅

𝑟𝑒𝑙. 𝑁𝑂

⋅ ⋅ ⋅ ⋅





⋅ ⋅ ×





⋅ ⋅ ⋅ ⋅ ⋅ ⋅



× ⋅ ⋅ ⋅ ⋅ ⋅ ⋅

𝑀𝑊

⋅ ⋅ ⋅ ⋅





⋅ ⋅ ⋅

×



⋅ ⋅ ⋅ ⋅ ⋅ ⋅



⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅

𝐻𝐵𝐷

⋅ ⋅ ⋅ ⋅





⋅ ⋅ ⋅





⋅ ⋅ ⋅ × ⋅ ⋅



⋅ ⋅ ⋅ × ⋅ ⋅ ⋅

𝑟𝑒𝑙. 𝑛 − (𝑖𝑛𝑐𝑙. 𝐻)

⋅ ⋅ ⋅ ⋅





⋅ ⋅ ⋅





⋅ ⋅ ⋅ ⋅ × ⋅



⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅

𝑆𝐴𝑆𝐻2 𝑂

⋅ ⋅ ⋅ ⋅





⋅ ⋅ ⋅





⋅ ⋅ ⋅ ⋅ ⋅ ⋅



⋅ × ⋅ ⋅ ⋅ ⋅ ⋅

𝑛−

⋅ ⋅ ⋅ ⋅





⋅ ⋅ ⋅





⋅ ⋅ ⋅ ⋅ ⋅ ⋅



⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ×

𝜒

Number of descriptors 8 8 8 8 13 13 8 9 8 12 12 8 8 8 8 8 9 13 8 8 9 8 8 7 8

0.94 0.92 0.90

Coefficient of determination

0.96

0.98

282 | 7 Quantitative structure–property relationships

0.88

Best model based on LS Corresponding value for TS Corresponding value for CV Best model based on TS Corresponding value for LS Corresponding value for CV

5

10

15

20

Descriptors

Fig. 7.18. 𝑅2𝐿𝑆 , 𝑅2𝑇𝑆 , and 𝑅2𝐶𝑉 on the learning set, for best models, depending on the number 𝑛 of descriptors. White symbols represent the best models with respect to 𝑅2𝐿𝑆 . Grey symbols, the best models with respect to 𝑅2𝑇𝑆 .

3D placements and calculation of geometric descriptors are not required, improving the applicability of the method. Using a test set, it was shown that such a model’s predictive ability is severely com­ promised by the presence of too many descriptors (overfitting). An interesting question now is whether the phenomenon of overfitting could be detected by cross-validation instead of by using of a test set. For the models obtained by stepwise subset selection we therefore performed a LOOCV on the learning set. The results for the best models (with respect to 𝑅2𝐿𝑆 and 𝑅2𝑇𝑆 ) were added to Figure 7.18, the results for all models of 𝑛 = 3 are shown in Figure 7.19. From the figure we see that 𝑅2𝐶𝑉 does not decrease as visibly for high 𝑛 as does 𝑅2𝑇𝑆 . Thus, a good CV result is not an indication of a model’s predictive ability, it is merely an indication of data consistency, and due caution should be used if models or descriptor subsets are selected on the basis of CV.

7.5.5 Linear modeling using principal component regression Using autoscaled descriptors and property values we performed a PCR. Figure 7.20 shows 𝑅2 for LS and TS depending on the number of principal components. The best model with respect to TS uses 21 PCs and has 𝑅2𝑇𝑆 = 0.95111 and 𝑅2𝐿𝑆 = 0.96428. Its predictive ability is lower than that of the models built previously using OLS and step­

n=12 n=13 n=14 n=15 n=16 n=17 n=18 n=19 n=20

0.93

0.94

0.95

0.96

0.97

n=3 n=4 n=5 n=6 n=7 n=8 n=9 n=10 n=11

0.92

Coefficient of determination, cross−validation

0.98

7.5 Case studies with separate learning and test sets |

0.92

0.93

0.94

0.95

0.96

0.97

0.98

Coefficient of determination, learning set

0.8 0.7 0.6 0.5

Coefficient of determination

0.9

1.0

Fig. 7.19. Scatterplot of 𝑅2𝐶𝑉 vs. 𝑅2𝐿𝑆 for the best linear models (with respect to 𝑅2𝐿𝑆 ) after 50-fold stepwise subset selection.

0.4

Learning set Test set

10

20

30

40

50

60

70

80

Principal components

Fig. 7.20. 𝑅2𝐿𝑆 and 𝑅2𝑇𝑆 for LM, determined by PCR, depending on the number of principal compo­ nents used.

283

284 | 7 Quantitative structure–property relationships wise subset selection. The complexity of the models cannot be compared directly due to the different types of methods. However, in using the PCR model for prediction, all 98 descriptor values enter the calculations and thus have to be calculated for each compound. This additional effort will favor the 8-descriptor OLS model in any case.

7.6 A case study of QSARs with discrete values In the following example we want to derive quantitative structure–activity relation­ ships for a biological property, the anti-mycobacterial activity (ABA) of a class of quinolones. The compounds in our study are described by the generic structure O

O

F OH

R2

N R1

where substituents R1 and R2 are as shown in Figure 7.21 in the upper row (R1 ) and lower three rows (R2 ). The atom designated as ‘Z’ in the figure is the atom of the central molecule to which R is attached. ABA is measured by the minimal concentration that inhibits Mycobacterium for­ tuitum, the minimal inhibitory concentration (MIC). Table 7.18 gives experimental MIC values (in 𝜇g/mL) of 𝑚 = 51 quinolones [248]. The gaps in the table are obviously cases for prediction by a QSAR model. We will construct such models using various methods.

7.6.1 Choice and redundancy of descriptors We started with the 48 arithmetical and 170 topological indices from Appendix A. Of these 218 descriptors, 25 are removed as they are constant within our library 𝑁𝑂 , 𝑁𝑁 , 𝑁𝑆 , 𝑟𝑒𝑙. 𝑁𝑆 , 𝑁𝐶𝑙 , 𝑟𝑒𝑙. 𝑁𝐶𝑙 , 𝑁𝐵𝑟 , 𝑟𝑒𝑙. 𝑁𝐵𝑟 , 𝑁𝐼 , 𝑟𝑒𝑙. 𝑁𝐼 , 𝑁𝑃 , 𝑟𝑒𝑙. 𝑁𝑃 , 𝑛=, 𝑛#, 𝑟𝑒𝑙. 𝑛#, 𝑟𝑒𝑙. 𝑛#(𝑖𝑛𝑐𝑙. 𝐻), 𝑐ℎ𝑎𝑟𝑔𝑒, 𝑟𝑎𝑑, 𝐻𝐵𝐴, 7 rings, 8 rings, ≥9 rings, 𝑇3 , 𝑐𝑜𝑛. 𝑐𝑜𝑚𝑝., 𝑔𝑡 𝑝𝑙𝑎𝑛𝑎𝑟

We searched the remaining 193 indices for pairwise complete correlations. Complete correlation is an equivalence relation. In the following all 19 equivalence classes with more than one element are listed. Most of these complete correlations are due to the special nature of the real library. Complete correlations that are generally true are writ­ ten as ‘≃’.

7.6 A case study of QSARs with discrete values |

C2H5

1 C3H7

2 C4H9

3 C3H5

4 C4H7

5 C6H3F2

6 F

F Z

C4H9N2

Z

Z

1 C5H11N2

Z

2 C5H11N2

NH

Z

3 C6H13N2

Z

4 C6H13N2

5

N N NH NH N

N Z

C6H13N2

N

N

Z

Z

6 C7H15N2

7 C8H17N2

N

Z

Z

8 C8H17N2

9 C4H9N2

10 NH2

N N

NH N Z

C6H13N2

N

N

Z

Z

11 C7H15N2

N N

N Z

12 C6H13N2

Z

13 C7H15N2

14 C8H17N2

15

NH2 N

NH NH

N

N Z

N Z

NH

N

N

Z

Z

Z

Fig. 7.21. Substituents R1 (upper row) and R2 (lower three rows) in the real library of quinolones.

Table 7.18. Experimental MIC values (in 𝜇g/mL) for the real library of quinolones. Substituent R1

Substituent R2

1

2

3

4

5

6

1

0.50

1.00

0.03

0.06

0.50

0.25

2

0.06

0.25

3

1.00

0.50

0.03

0.06

0.25

0.13

4

0.25

1.00

0.13

0.06

0.50

0.50

0.13

0.25

5 6

0.13 0.13

0.50

0.06

0.06

0.25

7

0.25

0.50

8

1.00

2.00

9

1.00

10

1.00

1.00

0.06

11

1.00 1.00

0.13

2.00

0.13

0.03

12 13

0.03 0.13

2.00

2.00

0.13

0.25

14

0.50

15

0.25

0.50

285

286 | 7 Quantitative structure–property relationships 𝑟𝑒𝑙. 𝑁𝑂 ∼ 𝑟𝑒𝑙. 𝑁𝑁 , 𝑁𝐹 ∼ 𝑛𝑎𝑟𝑜𝑚𝑎 , 𝐵 ∼ 𝑙𝑜𝑐. 𝐵, 𝐶 ≃ rings, 𝑀1 ≃ 𝑚𝑤𝑐(2) , 𝑀2 ≃ 𝑚𝑤𝑐(3) , 0 𝜒 ∼ 0 𝜒𝑠 , 𝜒 ∼ 1 𝜒𝑠 , 2 𝜒 ∼ 2 𝜒𝑠 , 3 𝜒𝑠 ∼ 3 𝜒𝑝 , 3 𝜒𝑠 (𝑐𝑙𝑢𝑠𝑡𝑒𝑟) ∼ 3 𝜒𝑐 , 3 𝜒𝑣 ∼ 3 𝜒𝑝𝑣 , 𝐹 ≃ 𝑁𝐺𝑆 ≃ 2 𝑃𝑎𝑐𝑦𝑐 ≃ 2 𝑃, 7 𝑃𝑎𝑐𝑦𝑐 ∼ 7 𝑃,

1 8

𝑣 6 𝑃𝑎𝑐𝑦𝑐 ∼ 8 𝑃, ≥9 𝑃𝑎𝑐𝑦𝑐 ∼ ≥9 𝑃, 3 𝑟𝑖𝑛𝑔𝑠 ∼ 3 𝜒𝑐ℎ ∼ 3 𝜒𝑐ℎ , 𝜒𝑐 ∼ 6 𝜒𝑐𝑣 .

We used only the first entry from each of these equivalence classes, the other 22 are ex­ cluded from further investigation. Thus 171 nonconstant, pairwise incompletely corre­ lated indices remain. The indices 𝑡𝑤𝑐 and 𝑡𝑤𝑐𝑢𝑛𝑠𝑎𝑡 are replaced, as before in Section 7.5, with their natural logarithms.

7.6.2 Regression In this example, various methods of supervised learning are demonstrated. First we considered ABA as a continuous variable, represented by MIC, and obtain predicting functions via regression. As in the sections above, we determined best linear models using OLS and BSS. The following best 5-descriptor model obtained in this manner was: 𝑛 = 5 desc. functions: 𝑋0 = 2 𝜅𝛼 , 𝑋1 = 6 𝑃𝑎𝑐𝑦𝑐 , 𝑋2 = 𝑐ℎ. 𝐽4 , 𝑋3 = 6 𝜒 𝑝 , 𝑋4 = 4 𝜒 𝑐 , 𝑓 = 1.7782𝑋0 + 0.26966𝑋1 + 142.04𝑋2 − 8.9882𝑋3 − 5.8961𝑋4 − 11.528, 𝑅2 = 0.64597, 𝑆 = 0.34676, 𝐹 = 16.421.

100 random experiments were performed to select best combinations of 5 out of 171 pseudodescriptors to fit 51 pseudoobservations. The result was 𝑚ℎ𝑟𝑅2 = 0.56013, 𝑠𝑡𝑑𝑒𝑣 = 0.04772. Note the high 𝑚ℎ𝑟𝑅2 value resulting from a 5-descriptor combination to be selected from a large descriptor pool. Here, the difference between 𝑚ℎ𝑟𝑅2 and 𝑅2 for the original 5-descriptor model was not larger than 1.8 standard deviations. This means that the model does not fit the data convincingly better than random, and it cannot be excluded that the model is based on chance correlations. It seems that linear models of a quality comparable to that for BP and PD models above cannot be constructed for MIC. One reason is that MIC has only seven different values in the real library, with occurrence numbers as follows: MIC Frequency

0.03

0.06

0.13

0.25

0.50

1.00

2.00

4

7

9

9

9

9

4

We used regression trees next. Using standard parameters (mincut = 5, minsize = 10, mindev = 0.01) R constructed a regression tree (Figure 7.22) of seven terminal nodes and five descriptors:

7.6 A case study of QSARs with discrete values

| 287

X3 < 4.5 | X2 < 2.53543

X0 < 26.5 X2 < 2.53379 0.90000

1.80000 X4 < 8.01285

0.65000

0.28220

0.43220

X1 < 9.2755

0.06923

0.27600

Fig. 7.22. Regression tree for MIC using five descriptors.

𝑛 = 5 desc. functions: 𝑋0 = 𝐵, 𝑋1 = 1 𝜒𝑣 , 𝑋2 = 𝜆𝐴1 , 𝑋3 = 𝐹𝑅𝐵, 𝑋4 = 5 𝜒 𝑝𝑐 , 𝑅2 = 0.81748.

Thus, the fit to the experimental data here was appreciably better than in the best linear model containing the same number of descriptors. We then used the two descriptor sets LM and RT, as obtained for the best 5-de­ scriptor linear model and the regression tree, respectively, to train neural networks and support vector machines. Table 7.19 contains 𝑅2 values for SVM with linear, po­ lynomial (degree = 2) and radial kernel, as well as for ANN with one, two, and three HN. To ensure reproducibility of the ANNs, starting weights were set to 0. If starting weights are initialized with random numbers, better models are generally obtained. In this manner, a model with 2 HN and 𝑅2 = 0.75380 was found for the RT descriptor set. The best model obtained was an ANN with 3 HN and 𝑅2 = 0.87415 using descriptor set RT: 𝑛 = 5 desc. functions: 𝑋0 = 𝐵, 𝑋1 = 1 𝜒𝑣 , 𝑋2 = 𝜆𝐴1 , 𝑋3 = 𝐹𝑅𝐵, 𝑋4 = 5 𝜒 𝑝𝑐 , 𝑓∗ =−4.40/(1+exp(−1.88𝑋∗0 +1.10𝑋∗1 −7.41𝑋∗2 −18.5𝑋∗3 −3.81𝑋∗4 +19.4))+ +4.43/(1+exp(−2.23𝑋∗0 −1.38𝑋∗1 +1.19𝑋∗2 −17.8𝑋∗3 +0.709𝑋∗4 +11.1))+ +0.309/(1+exp(−6.65𝑋∗0 +3.88𝑋∗1 +21.2𝑋∗2 −3.51𝑋∗3 +5.40𝑋∗4 −14.7))+ +0.0176, 𝑅2 = 0.87415, 𝑆 = 0.25753, 𝐹 = 9.5924.

where 𝑋∗𝑗 , 𝑗 ∈ 4 are range-scaled descriptor values, and 𝑓∗ provides the range-scaled activity value. To obtain the values of MIC itself, re-transformation is required. The two descriptor sets considered above were those that gave the best models in linear modeling or in a regression tree. We then tested another two descriptor sets that were selected according to correlation coefficients with MIC. Descriptor set BCC contains the five descriptors showing highest absolute correlation coefficients with MIC, 𝜆𝐴1 (−0.402), 𝑅 (0.369), 𝐹𝑅𝐵 (0.346), 4 𝜒𝑣𝑐 (−0.313), 4 𝜒 𝑐 (−0.310). The results are included in Table 7.19.

288 | 7 Quantitative structure–property relationships Table 7.19. 𝑅2 for modeling MIC using various regression methods and four 5-subsets of descriptors. Regression method

𝐿𝑀

Set of descriptors 𝑅𝑇 𝐵𝐶𝐶

𝐻𝐶𝐶

MLR

0.64597

0.39654

0.42980

0.43927

RT

0.39654

0.81748

0.75118

0.65189

ANN, 1HN

0.49117

0.45177

0.43712

0.56780

ANN, 2HN

0.71535

0.45995

0.43712

0.56904

ANN, 3HN

0.71284

0.47453

0.43712

0.56903

SVM, lin

0.39005

0.37659

0.38318

0.32952

SVM, pol

0.37598

0.50894

0.48433

0.60133

SVM, rad

0.38718

0.55660

0.54426

0.59496

As it is the case here, 𝑛 descriptors with the highest absolute correlation coefficients to the target variable often form a significantly worse descriptor set in MLR than the set obtained by BSS. A reason for that may be strong intercorrelation of descriptors, as we saw in Section 7.5. In order to account for this, we calculated the following reference value for a descriptor subset 𝛺: 󵄨󵄨 𝛺 󵄨󵄨−1 |𝛺|−1 ∑ |𝑅(𝑋𝑖 , 𝑌)| − 󵄨󵄨󵄨󵄨( )󵄨󵄨󵄨󵄨 ∑ |𝑅(𝑋𝑖 , 𝑋𝑗 )|, 󵄨 2 󵄨 {𝑖,𝑗}⊂𝛺 𝑖∈𝛺 where 𝑅(𝑋, 𝑍) is the correlation coefficient of 𝑋 and 𝑍. The 5-subset of highest reference value (0.18520) is 𝛷, 4 rings, 5 rings, 𝜆𝐴1 , 6 𝜒 𝑐 . which we call HCC. The results for HCC are also included in Table 7.19. With two ex­ ceptions, models containing HCC provide higher 𝑅2 than models containing BCC. In particular, HCC proves itself best among the four descriptor sets in the calculations of SVMs with polynomial or radial kernel. However, low 𝑅2 values seem to discredit the regression for this example, see­ mingly due to the strange distribution of MIC values as mentioned above. Therefore we tried classification methods next.

7.6.3 Multi-classification In our sample of 51 compounds, MIC assumes no more than seven different values. Thus, the common MIC values can be considered as classes and we then formulate the QSAR search as a classification problem with seven classes. Figure 7.23 shows a classification tree for MIC.

7.6 A case study of QSARs with discrete values | 289

X0 < 5.53472 | X6 < 0.034339

X2 < 2.59751

X5 < 0.0588894

X6 < 0.0331024

1

1 X3 < 4.3917 0.06

0.03

2 X4 < 5.96945 0.5 X1 < 0.144619 0.5 0.25

0.13

Fig. 7.23. Multiclassification tree for MIC using seven descriptors.

Table 7.20. Distribution of 51 quinolones into activity classes by experiment and by calculation using the CT of Figure 7.23. Experimental class

0.03

0.06

Calculated class 0.13 0.25 0.50

0.03

4

0

0

0

0.06

1

4

0

0

0.13

0

0

4

0.25

0

0

1

0.50

0

1

1.00

0

2.00

0

1.00

2.00

0

0

0

2

0

0

2

1

2

0

5

1

0

2

0

0

6

2

0

0

0

0

0

9

0

0

0

0

0

1

3

𝑛 = 7 descriptors: 𝑋0 = 𝑚 𝑀2 , 𝑋1 = 𝑇𝐼𝐶1 , 𝑋2 = 𝐶𝐼𝐶1 , 𝑋3 = 𝐼𝐶2 , 𝑋4 = 𝑀𝑆𝐷, 𝑋5 = 𝑐ℎ. 𝐽5 , 𝑋6 = 𝑐ℎ. 𝐽6 , 𝑀𝐶𝐸 = 16 = 0.31373 . 51 Table 7.20 gives the distribution of 51 compounds into classes by experiment and by calculation. Based on the classes calculated by the CT, a 𝑅2 can be calculated. Its value is 0.34811, understandably significantly worse than that of the RTs determined in the previous examples. In predicting biological or pharmaceutical activities, one is often interested only in an active/inactive discrimination. Thus, binary classification was performed next.

290 | 7 Quantitative structure–property relationships 7.6.4 Binary classification An anti-mycobacterial agent often used is Ciprofloxacin: O

O

F OH

N

N

HN

It is contained in our real library and its MIC is 0.06. Following [273] we considered all compounds that have MIC ≤ 0.06 as active. Our QSAR search is now a binary classifica­ tion problem. We solved it using various methods of descriptor selection and classifi­ cation methods. Growing classification trees, we obtain a 3-descriptor CT (Figure 7.24)

𝑛 = 3 descriptors: 𝑋0 = 𝑚 𝑀2 , 𝑋1 = 3 𝜒𝑠 , 𝑋2 = 𝑐ℎ. 𝐽6 , 𝑀𝐶𝐸 =

2 51

+

0 51

=

2 51

= 0.039216, 𝑀𝐶𝐸𝐶𝑉 =

10 51

= 0.19608.

MCE is composed of two kinds of errors: – type I errors, true (or active) compounds classified as false (or inactive), – type II errors, false (or inactive) compounds classified as true (or active). Table 7.21 shows 2×2 tables for the various sets of descriptors and classification me­ thods. The top left table is for descriptor selection by CT and classification by CT. Ta­ ble 7.22 gives 𝑇𝐶𝐸 and (in parentheses) 𝑇𝐶𝐸𝐶𝑉 as obtained by LOOCV for the various sets of descriptors and classification methods. We modeled ABA via classification by regression next (see Subsection 6.1.1). A property value of 1 was attributed to ‘active’ compounds (MIC≤0.06), and −1 was assigned to ‘inactive’ compounds (MIC>0.06). If the predicting function returns X0 < 5.68056 | X2 < 0.034339 F X1 < 0.224583 F F

T

Fig. 7.24. Binary classification tree for MIC using three descriptors.

7.6 A case study of QSARs with discrete values

| 291

Table 7.21. 2×2 tables for binary classification of ABA for various classification methods and de­ scriptor sets. Experimental class

F

T

F

T

F

Calculated class T F T

F

T

F

T

F

40

0

40

0

39

1

37

3

39

1

37

3

T

2

9

3

8

2

9

2

9

6

5

1

10

F

39

2

40

0

40

0

40

0

39

1

40

0

T

4

7

0

11

0

11

0

11

9

2

9

2

F

38

2

38

2

40

0

38

2

38

2

39

1

T

4

7

0

11

0

11

0

11

6

5

6

5

F

38

2

38

2

40

0

39

1

33

7

35

5

T

1

10

0

11

0

11

0

11

0

11

1

10

Class. method CT

MLR

LDA

QDA

F

40

0

40

0

39

1

37

3

38

2

40

0

T

5

6

5

6

3

8

1

10

4

7

0

11

F

39

5

40

0

40

0

40

0

39

1

40

0

ANN

T

0

11

0

11

0

11

0

11

1

10

11

0

1HN

KNN

F

39

1

40

0

40

0

40

0

39

1

34

6

ANN

T

0

11

0

11

0

11

0

11

1

10

0

11

2HN

F

39

1

40

0

40

0

40

0

39

1

34

6

ANN

T

0

11

0

11

0

11

0

11

1

10

0

11

3HN

F

39

1

39

1

40

0

40

0

37

3

37

3

SVM

T

4

7

0

11

0

11

0

11

4

7

4

7

lin

F

40

0

40

0

40

0

40

0

39

1

35

5

SVM

T

4

7

0

11

0

11

0

11

2

9

2

9

pol

F

40

0

40

0

40

0

40

0

39

1

35

5

SVM

T

2

9

1

10

0

11

0

11

5

6

2

9

rad

Descriptors

CT

LM0

LM1

LM2

FR

[273]

values > 0, the compound is classified as active, otherwise it is inactive. We obtained the best linear models by OLS regression with 𝑛 ∈ 3 descriptors, for assessment we use MCE rather than 𝑅𝑆𝑆 or 𝑅2 . We obtained the models: 𝑛 = 1 descriptor function: 𝑐ℎ. 𝐽1 , 𝑓 ̃ = 9.7362𝑋0 − 2.9152, 𝑀𝐶𝐸 =

8 51

+

0 51

=

8 51

= 0.15686, 𝑀𝐶𝐸𝐶𝑉 =

11 51

= 0.21569.

292 | 7 Quantitative structure–property relationships Table 7.22. 𝑇𝐶𝐸 and 𝑇𝐶𝐸𝐶𝑉 for binary classification of ABA for various classification methods and descriptor sets. Class. method

CT

Descriptor set LM1 LM2

LM0

FR

[273]

CT

2

(10)

3

(8)

3

(12)

5

(12)

7

(10)

4

(8)

MLR

6

(7)

0

(4)

0

(4)

0

(4)

10

(10)

9

(10)

LDA

6

(6)

2

(3)

0

(3)

2

(4)

8

(12)

7

(9)

QDA

3

(5)

2

(4)

0

(2)

1

(3)

7

(9)

6

(8)

KNN

5

(5)

5

(5)

4

(5)

4

(5)

6

(8)

0

(8)

ANN, 1HN

5

(6)

0

(2)

0

(1)

0

(1)

2

(2)

11

(7)

ANN, 2HN

1

(3)

0

(1)

0

(0)

0

(1)

2

(2)

6

(1)

ANN, 3HN

1

(4)

0

(0)

0

(0)

0

(0)

2

(2)

6

(3)

SVM, lin

5

(5)

1

(3)

0

(0)

0

(2)

7

(7)

7

(7)

SVM, pol

4

(5)

0

(1)

0

(0)

0

(1)

3

(3)

7

(8)

SVM, rad

2

(4)

1

(2)

0

(0)

0

(1)

6

(7)

7

(7)

𝑛 = 1 descriptor function: 4 𝜒𝑣𝑐 , 𝑓 ̃ = 5.1999𝑋0 − 0.72345, 𝑀𝐶𝐸 =

6 51

+

2 51

=

8 51

= 0.15686, 𝑀𝐶𝐸𝐶𝑉 = 3

10 51

= 0.19608.

2

𝑛 = 2 descriptor functions: 𝜅, 𝜅𝛼 , 𝑓 ̃ = 2.3886𝑋0 − 1.9619𝑋1 + 3.2971, 𝑀𝐶𝐸 =

1 51

+

1 51

=

2 51

= 0.039216, 𝑀𝐶𝐸𝐶𝑉 =

2 51

= 0.039216.

6 51

= 0.11765.

𝑛 = 2 descriptors: 𝜆𝐴1 , 𝜒 𝑇 , 𝑓 ̃ = 65.570𝑋0 + 7623.5𝑋1 − 167.42, 𝑀𝐶𝐸 =

2 51

+

0 51

=

2 51

= 0.039216, 𝑀𝐶𝐸𝐶𝑉 =

The low 𝑀𝐶𝐸𝐶𝑉 of the first 2-descriptor model is remarkable. A complete separa­ tion of the two classes is achieved by the following three models using three descriptors each: 𝑛 = 3 descriptors: 1 𝜒𝑣 , 𝐶𝐼𝐶2 , 𝜆𝐴1 , 𝑓 ̃ = −0.86763𝑋0 + 0.95011𝑋1 + 55.068𝑋2 − 133.16, 𝑀𝐶𝐸 = 0, 𝑀𝐶𝐸𝐶𝑉 = 2

4 51

= 0.078431.

8

𝑛 = 3 descriptors: 𝜅𝛼 , 𝑃𝑎𝑐𝑦𝑐 , 6 𝑃, 𝑓 ̃ = −0.89054𝑋0 − 0.042156𝑋1 + 0.095820𝑋2 − 1.4373, 𝑀𝐶𝐸 = 0, 𝑀𝐶𝐸𝐶𝑉 =

4 51

= 0.078431.

𝑛 = 3 descriptors: 𝐼𝐶2 , 𝑀𝑆𝐷, 𝜆𝐴1 , 𝑓 ̃ = −0.87302𝑋0 − 1.0691𝑋1 + 44.867𝑋2 − 104.19, 𝑀𝐶𝐸 = 0, 𝑀𝐶𝐸𝐶𝑉 =

4 51

= 0.078431.

7.6 A case study of QSARs with discrete values |

293

We call these three models 𝐿𝑀0 , 𝐿𝑀1 and 𝐿𝑀2 . In Tables 7.21 and 7.22 these expressions also designate the corresponding descriptor sets. The results for classification by MLR are found in these tables in the rows designated ‘MLR’. Further, the descriptors exhibiting highest Fisher ratios 2

𝜅𝛼 (1.32722), 2 𝜅(1.2909), 𝛷𝛼̄ (1.1641)

were used for the calculation of predicting functions (column FR in Tables 7.21 and 7.22). Finally, the descriptor set used in [273] 𝑀1 , 𝑀2 , 𝜉𝑐 was also included. Along with CT and LDA we tested the classification methods quadratic discriminant analysis (QDA), KNN, ANN with one, two, and three HN, as well as SVM with linear, polynomial (degree = 2), and radial kernel. The descriptor values were autoscaled for all these methods. The number of next neighbors to be considered, 𝑘, was determined via LOOCV for the various descriptor sets as follows: Set of descriptors

CT

LM0

LM1

LM2

FR

[273]

Number 𝑘 of neighbors

15

7

5

5

9

1

These 𝑘 are those leading to the smallest 𝑇𝐶𝐸𝐶𝑉, the results based on them are included in Tables 7.21 and 7.22. Since 𝑘 = 1 for the descriptors from [273], a complete separation of classes is achieved trivially in this case. This must be kept in mind when reading Table 7.21, it does not allow a conclusion on the suitability of this descriptor set for KNN classification. For the ANN classification, 10 random starting weight distributions were tested in each run, and that one leading to the smallest 𝑇𝐶𝐸𝐶𝑉 was selected. No ANN with 1 HN separating the classes was found for the descriptors from [273]. A better result for this case was found using range scaled descriptor values (𝑇𝐶𝐸 = 3, 𝑇𝐶𝐸𝐶𝑉 = 6). When training SVM classificators, there are several parameters to vary. Sometimes these influence the fitting and predictive ability of SVMs significantly. We tested the parameters cost = 2𝑖 , 𝑖 ∈ {−1, ..., 7} and gamma = 2𝑗 , 𝑗 ∈ {−3, ..., 3} in the R imple­ mentation, and selected the parameter combination resulting in smallest 𝑇𝐶𝐸𝐶𝑉 . A glance at Table 7.22 shows that the descriptor sets from best LM also result in good ANN and SVM. The descriptors with best Fisher ratios are not well-suited for classifica­ tion. The reason for that may again be their high intercorrelation, 𝑅(2 𝜅𝛼 , 2 𝜅) = 0.99140, 𝑅(2 𝜅𝛼 , 𝛷𝛼̄ ) = 0.97880 und 𝑅(2 𝜅, 𝛷𝛼̄ ) = 0.98505.

294 | 7 Quantitative structure–property relationships 7.6.5 Prediction Here, we used linear models 𝐿𝑀0 , 𝐿𝑀1 , and 𝐿𝑀2 to predict ABA for a virtual library of quinolones, i.e. we want to fill the holes in Table 7.18. The 6 substituents from the upper row of Figure 7.21 were used for R1 and the remaining 15 substituents were used for R2 . The resulting virtual library consists of 6 ⋅ 15 = 90 structures. After removing the 51 compounds contained in the real library, 39 structures remain in the purely virtual library. The predictions resulting from our three models are remarkably consistent: 𝐿𝑀0 and 𝐿𝑀2 agree in predicting the following structures to be active: O

O

O

F

O

F OH

N

OH

N

N

N

H2N

N

A

B

The first of these is also the only structure in the purely virtual library predicted to be active by 𝐿𝑀1 . A thorough literature search for these two compounds resulted in the following: Compound A (R1 = t-Bu, R2 = N-methylpiperazin-1-yl) was synthesized and published already in 1987/1988 in a patent application (US 916757, EP 0266576) as an antibacterial agent. It shows activity against several bacteria, but seemingly was not tested with Mycobacterium fortuitum. Compound B (R1 = t-Bu, R2 = 3-aminomethyl-3-methylpyrrolidin-1-yl) seems to be still unknown. However, a very similar compound, differing in only one additional bond (R1 = t-Bu, R2 = 1-amino-5-azaspiro[2.4]hept-5-yl) was published in a 1991/1992 patent application (KR 9125884, EP 550016) as an antibacterial agent. Inspection of Table 7.18 reveals the following: i) In each of the six completely filled rows, the MIC value for R1 = substituent 3 = t-Bu is either the smallest (4 times) or second-smallest (2 times). Thus, t-Bu is the most promising R1 among those considered. ii) In the only completely filled column (R1 = substituent 4 = cyclopropyl), many substituents have small MIC values (0.03 or 0.06), including R2 = substituent 2 = N-methylpiperazin-1-yl and R2 = substituent 11 = 3-aminomethyl-3-methylpyrrolidin-1-yl. Thus, compounds A and B are among the most promising candidates. This statement was derived from simple reasoning based on the data given. This is another example

7.7 Outlook: Unsupervised learning and diversity considerations

|

295

of a simple truth: The methods of statistical learning presented here are formalized versions of what the human mind is able to perform by itself in simple cases.

7.7 Outlook: Unsupervised learning and diversity considerations We did not yet discuss the choice of the real library. In the examples treated above structures and properties were taken from databases or from the literature. Ideally, the choice of a real library should be a basis for successful optimization of an experiment in combinatorial chemistry. The real library then should contain structures of as high diversity as possible. Mathematical tools for choosing (sub)libraries of high diversity are methods of non-supervised statistical learning. In contrast to supervised learning, there is no de­ pendent variable in non-supervised learning. The aim of non-supervised learning is to structure and classify the observations according to similarities in the values of in­ dependent variables. In our applications in combinatorial chemistry, independent variables are mole­ cular descriptors, and their values may be taken either from the building blocks or from the compounds in the library themselves. Important methods of non-supervised learning are principal component analysis (PCA) and cluster analysis. The method for generation of all substructures and their counts in a given library (described in Subsection 7.2.2) seems to be particularly suited to obtain independent variables, since these are determined in a canonical manner rather than by the sub­ jective choice of the user. In Example 7.6 we obtained all twenty substructures of 2–6 bonds together with their counts in a real library of decanes. Application of the same algorithm to the complete virtual library of all 75 decanes does not result in further such substructures. We used the autoscaled substructure counts for a hierarchical cluster analysis. Figure 7.25 shows the cluster analysis result in form of a dendrogram. The leaves are labelled according to the numbering in Figure 7.4 (R01–R50) and Figure 7.13 (V01–V25). In a hierarchical cluster analysis, observations close to each other are com­ bined successively. The distance of two observations is determined by a metric on the space of predictors 𝑋𝑖 , 𝑖 ∈ 𝑛. In our example we used the Euclidean metric on ℝ20 . The dendrogram visualizes the order of combining the compounds. If it is desired to select 𝑚 structures for synthesis and screening, we could partition the structures into 𝑚 clusters according to the dendrogram and select one structure from each cluster. In Figure 7.25 this is shown by a vertical grey line that is arranged to define twenty clusters. If one compound is taken from each cluster, the resulting sublibrary will be of high diversity. The main objective of QSPRs and QSARs, as covered in this chapter, is to start from the molecular structure and arrive at a predicted property value. However, the oppo­ site also plays an important role in chemoinformatics, namely starting from a given property (or properties) and trying to predict the molecular structure. The next chap­

296 | 7 Quantitative structure–property relationships ter deals with such problems, i.e. trying to derive the structure from the information recorded about a structure in the form of a mass spectrum.

V25 V23 R45 V07 V02 V18 V16 V06 R22 R13 R29 R18 V24 V20 V21 R38 R49 R48 V22 R30 R27 R09 V19 R37 V12 R39 R36 V03 R23 R24 V15 V14 V05 R15 V01 R25 R11 R03 R40 R21 R26 V10 V04 R35 V13 R19 V17 V11 R10 R06 R16 R04 R08 R05 V08 R33 R12 R02 R14 R07 R34 V09 R31 R20 R17 R32 R28 R41 R43 R42 R44 R47 R46 R50 R01

Fig. 7.25. Dendrogram for clustering the virtual library of decanes.

8 Molecular structure elucidation A major part of work in a chemical laboratory, both in research and industry, deals with analytical challenges. Important problems are separation of mixtures into pure compounds, and elucidation of their structures. Often, determination of a compound’s constitution is the foremost aim, though a true identification of a compound should also include determination of its stereochemistry, i.e. its configuration and possibly conformation. We group this together under molecular structure elucidation. The structural formula is an extremely useful piece of information, e.g. in qual­ ity control of a chemical synthesis or in the search for pollutants in environmental analysis. Often the structure is sufficient information to predict physicochemical pro­ perties or biological activities based on QSPR/QSAR models (see Chapter 7). Of course, knowledge of the structure is necessary to unambiguously register a compound and to process all relevant information. Input for a structure elucidation problem is the chemist’s prior knowledge of the compound, as well as results of experiments per­ formed on the analyte. Prior knowledge may be a synthesis path or more generally the source of the compound. Among experimental measurements, spectroscopic methods are generally the most informative.

8.1 Spectroscopic methods There are many methods of this kind in the arsenal of chemical analytics. Most impor­ tant are – Nuclear magnetic resonance (NMR) spectroscopy, – Infrared (IR) and ultraviolet (UV) spectroscopy, – Atomic absorption and atomic emission spectroscopy (AAS, AES), – Mass spectrometry (MS). With the exception of MS, these methods are based on changes of the energy states of the analyte’s molecules or atoms, and detection of the energy absorbed or emitted. In NMR spectroscopy, certain nuclei are excited in a strong and homogeneous mag­ netic field. Transitions between energy levels of nuclei are detected and recorded in the form of chemical shifts and spin-spin couplings. These allow detailed conclusions on the molecular environment of the atoms observed. IR and UV spectroscopy detect the analyte’s light absorbance dependent on the wavelength. Certain substructures can be recognized due to their characteristic absorbances. AAS and AES allow to detect and quantify the presence of certain chemical elements in the analyte. We will go into depth with regards to MS in the following sections. Overviews of the various spectroscopic methods are found in [26, 123, 341]. In [238], simple examples of easy structure eluci­ dation problems are treated with guidance for manual or interactive computer-aided strategies. For us it is of foremost interest to see which kind of structural information

298 | 8 Molecular structure elucidation is provided by spectroscopic methods, and how the information is derived from exper­ imental data and used for structure elucidation.

8.2 The principle of automated molecular structure elucidation For more than three decades now scientists have been striving for automatization of structure elucidation. The development of automated structure elucidation was ac­ celerated by increasingly powerful computer hardware and software. In particular, chemical structures and their properties were digitalized and collected in databases. There are two fundamentally different methods of automated structure elucidation, – database-based structure elucidation, – de novo structure elucidation. Nowadays there are relatively large databases containing pairs of structures and spec­ tra for each spectroscopic method (see e.g. Chapter 9 of [329]). Given an experimen­ tal spectrum of an unknown, such a database allows to search for similar spectra, to rank database spectra according to similarity to the experimental spectrum, and to return an ordered list of corresponding structures as structure candidates. Algorithms for spectrum comparison are now sufficiently developed to return the correct structure at the top position of such a hit list with high probability, provided it is contained in the database (see e.g. [301]). This is, however, the most serious problem of database-based structure elucida­ tion. Even the largest databases contain only tens of thousands or a few hundred thousand spectra to date. Thus, in the MS database NIST11 [226], used in parts of the present work, there are 243,893 EI-MS spectra belonging to 212,961 compounds. The database Beilstein used in Section 2.5 contains 8,711,107 known organic compounds, while the number of mathematically possible constitutions, even for a single molecu­ lar formula of low mass (about 150 Da), can exceed 100 million (see Appendix D). For example, the smallest (with respect to molecular mass) molecular formula associated with more than 100 million mathematically possible constitutions is C8 H6 N2 O (146 Da, 109,240,025 isomers). In the context of combinatorial chemistry, as well as in natural products chemistry, it is quite possible that the analyte of interest is not contained in spectrum databases. In recent years, the use of structural (or compound) databases for structure elucida­ tion has increased due to the evolution of web-based services such as PubChem [218] and ChemSpider [259], with approximately 26 million entries each. These databases do not generally contain spectra (there are some exceptions) and as such only provide information about compounds that have been documented to exist. Although this is a smaller subset of possible structures for a given molecular formula than generating all mathematically possible structures, the same principles apply to determining the ‘correct one’ as for generated structrues, without the guarantee that the correct struc­

8.2 Automated molecular structure elucidation |

299

ture is actually present. When searching for new active compounds, transformation products, metabolites or new materials, compounds of interest are by definition not found in databases. Thus, applications of automated structure elucidation could lie in metabolomics, molecular diagnostics, medicine, forensics and environmental chem­ istry. Examples are the identification of biomarkers within body fluids [145, 348], or toxic compounds in environmental samples [31]. Because of high-throughput synthesis and screening methods available, the quest for automated structure elucidation via MS is now more urgent than ever. In [138, 139, 277] high resolution screening (HRS) is described, a method that isolates pure com­ pounds from a mixture in a single procedure by high performance liquid chromatogra­ phy (HPLC), tests the compounds for a biological activity, and records the mass spec­ trum of each active compound. Thus, several thousands of compounds can be analyzed within a few hours, whereas manual processing of resulting data may take several days or even months for each active compound. Automatic procedures for structure eluci­ dation are urgently needed to overcome this bottleneck. The idea of de novo structure elucidation is to find the correct structure without searching databases. A prominent starting point is the well known DENDRAL system [183], the development of which began already in the mid 1960’s. DENDRAL was de­ veloped for the automated structure elucidation of organic compounds by MS, after separation by gas chromatography (GC). DENDRAL is described in many computer science books as the first expert system. Moreover, it can be considered as one of the roots of chemoinformatics. Interestingly, even NASA was among the founders of this pioneering project, with the ambitious intention to supply future Mars missions with such software, to enable analysis and interpretation of MS samples onboard a pilotless spacecraft and to broadcast only iden­ tified structural formulas back to earth instead of huge GC/MS data sets. Since then a number of expert systems relying on a combination of various spec­ troscopic methods was developed (RASTR [69], X–PERT [67, 68], StrEluc [66], SESAMI [46], CHEMICS [80, 81], SpecSolv [340], EXPEC [186, 187], and so on). Essentially, these systems work along the same strategy, independently of the spectroscopic methods used. The principle is shown in Figure 8.1, it may roughly be divided into three sub­ problems, all of which require some mathematical modeling: – Spectrum interpretation is extraction of structural properties from spectroscopic data. Methods of pattern recognition and of supervised statistical learning are used (Section 8.5). – In a second step, all structures that have the structural properties extracted above are constructed using molecular structure generation (Section 5.1). – Virtual spectra are then calculated for the structure candidates using spectrum simulation. These are then compared to experimental data. Based on such compar­ isons, good structure candidates are ranked and selected (Section 8.4). We sum­ marize simulation, comparison, ranking, and selection under the term structure verification.

300 | 8 Molecular structure elucidation

Experimental data, spectra

Spectra interpretation

Structural properties

Structure generation

Structural formulas

Feedback

Spectra simulation

Virtual spectra

Comparison, ranking, selection

Ranking of structural formulas Fig. 8.1. Automatic structure elucidation workflow.

Feedback

8.3 Basics of mass spectrometry

|

301

In DENDRAL’s termininology these three steps were simply called plan, generate and test. In the best case, the structure with a virtual spectrum most similar to the exper­ imental spectrum is the correct solution. Instead, one is often left with either a huge or an empty structure space after structure generation or verification. In these cases, parameters of interpretation or selection have to be modified. Further, it should be possible to use prior knowledge (e.g. the synthesis or source of the analyte) in this process. While structure generation works using rigorous mathematical principles, a chem­ ical knowledge base is required for interpretation and verification. This may again be a database of spectra. For example, SpecInfo [178, 219] first searches for similar spectra in its database, and then extracts substructures common to the corresponding struc­ tures as input for structure generation. Other procedures [324, 333, 335] train spectral classifiers by means of database spectra. Once such classifiers are at hand, a particu­ lar structure elucidation problem is solved without recourse to database spectra, since goodlist and badlist entries for structure generation are obtained directly from classi­ fier output [321, 325]. The combination of both methods, using substructures from the large NIST database [226] and classifier output [155] reduced the number of candidates during structure generation dramatically [285]. In [298] and [75] it is described how substructures that are either present or ab­ sent are determined from IR spectra by trying to explain the peaks of the experimental spectrum using substructures. Substructures known to produce characteristic IR ab­ sorptions are of most use here. Such experience was initially deduced from spectral databases and is tabulated in the form of intervals [123, 237]. While UV, IR and AAS/AES may provide some supporting information for CASE, there is generally insufficient in­ formation available in these spectra to attempt CASE on these spectra in isolation. At this stage, NMR and MS are the most suited to attempting CASE using one analytical technique alone; of course the results of CASE can be improved by incorporating as much data as possible. For structure verification, a database of known spectrum–structure pairs is indis­ pensable. Statistical models for NMR peak prediction are trained on available data [151, 196, 197, 198]. In Section 8.4 we will see how database spectra are used to select structure candidates via MS.

8.3 Basics of mass spectrometry The name mass spectrometry already shows that it is different from other spectrosco­ pic methods. In contrast to NMR and IR spectroscopy, MS observes ions, i.e. charged particles derived from the analyte, rather than the molecules of the analyte themselves. The occurrence and types of the various ions observed depend on type and energy of ionization. The recent interest in MS is due to

302 | 8 Molecular structure elucidation – – –

its potential to work within a largely automated surrounding of synthesis and screening, which in turn stems from its high sensitivity, which means that mass spectra can be obtained from extremely low amounts of compound and in very short time, its high selectivity, i.e. its potential to provide complete structure information.

Fig. 8.2. Example EI mass spectrum of methyl n-pentanoate.

Unfortunately, this complete information is not easily accessible and we will dis­ cuss this in detail in this chapter. In the following, we concentrate on low resolution 70 eV electron impact (LR–EI) mass spectrometry. For this particular method, there are the largest databases of reference spectra, and spectra of the same compound taken on different instruments are generally reproducible and comparable. Figure 8.2 is a mass spectrum of methyl n-pentanoate taken by this method. The 𝑥-axis gives the mass to charge ratio (𝑚/𝑧), while a peak’s height (intensity) on the 𝑦-axis is proportional to an ion’s abundance. Although intensity is often reported between 0 and 100 or 1000 in mass spectral databases, we will use a 0 to 1 scale for mathematical simplicity.

8.3.1 Mode of operation of an EI mass spectrometer Figure 8.3 is a schematic drawing of the mode of operation of an EI mass spectrometer. In ionization, a molecule (a) of the analyte under electron impact (b) loses an out­ er-sphere electron, forming the positively charged molecule ion (c), usually a cation

8.3 Basics of mass spectrometry

| 303

Fig. 8.3. Mode of operation of an EI mass spectrometer.

radical. The ion radical fragments into an ion (d) and a neutral radical (e), possibly in several concurrent modes of decay. Fragments arising directly from the molecule ion are called primary ions or primary particles, and depending on stability these may un­ dergo subsequent fragmentation into secondary ions and secondary neutral particles. Such fragmentation may happen in ions of any order. Rearrangement reactions result in even more complex mixtures of ions in the ionization chamber. While neutral particles are removed by a vacuum pump, the ions (f) are accelerated by a voltage (g) towards a magnetic field (h). The strength of the deflection of ions within the magnetic field (according to the Lorentz force) depends on the mass to charge ratio of the ions. As the charge is constant (+1) for all ions, heavy ions are deflected less than light ones. Thus, ions are detected at different ‘places’ in the detector, giving rise to different peaks in a spectrum, where each peak has an intensity proportional to the amount of ions of a particular 𝑚/𝑧.

8.3.2 Problems in EI mass spectrometry Automatic structure elucidation based on EI–MS is a three-level task, where the – molecular mass, – molecular formula, and – molecular structure

304 | 8 Molecular structure elucidation of the analyte need to be determined. Unfortunately, determination of the molecular mass from an MS is already problematic in many cases: The molecular ion may be highly unstable, decaying at once with the relatively high energies used in EI–MS re­ sulting in the absence of a molecular ion peak. This happens in approximately 14–30% of all compounds contained in MS databases (see e.g. [180] and Subsection 8.3.4). Even worse, there is no deterministic criterion to decide whether or not the peak at highest mass represents the molecule ion. Nevertheless, there are some ways to deduce the molecular mass from the spec­ trum. A heuristic procedure for obtaining molecular mass candidates is described in [183]. Mun et al. [216] show that the molecular mass was predicted correctly in the first place in 91% of cases, and in the first or second place in 95% of cases for unknown spectra. Methods such as [217] that predict the parity of molecular mass, provide ad­ ditional information. Recently, Hufsky et al. [135] introduced a method for calculating the exact molecular mass from high resolution EI–MS spectra via fragmentation trees, applicable only when this peak is actually present in the spectrum. For 50 test com­ pounds, their method determined the correct molecular ion for 44 compounds, while the algorithm from Scott [289], implemented in the NIST Database [226] estimated the correct nominal mass for 45 compounds. Only one compound was not predicted in top place by either one of the two (very different) approaches. An alternative to high reso­ lution EI–MS measurements (still relatively rare at this stage), is to use soft ionization (SI) to determine the molecular mass, where the energy of the ionizing electron beam is reduced and often not sufficient for bond cleavage, resulting in less fragmentation and correspondingly a more intense molecule ion peak. There are other issues with soft ionisation, discussed further in Section 8.7. Figure 8.4 shows our workflow for structure elucidation via MS, following the plan, generate, test strategy used in DENDRAL (Section 8.2). The focus is on determination of the molecular formula and structure. For interpretation we use MS classifiers, which provide information on both element composition and structure (see Appendix B). We use classifiers described by K. Varmuza and W. Werther [324, 333, 335] and develop new classifiers based on different classification methods (Subsection 8.5.2) and new structural properties (Subsection 8.5.3). All molecular formulas corresponding to a given molecular mass can be calcu­ lated. The generation of all structures corresponding to a given molecular formula was treated in Section 5.1. At first sight, MS provides information on masses and abundances in the ion mix­ ture obtained from the analyte. Isotope compositions and theoretical isotope distribu­ tions link these primary pieces of information. This connection is discussed in Sub­ section 8.3.3. Ideas from [98, 168, 291] will be developed further to calculate match values for molecular formula candidates using mass, intensities and isotope patterns. Such quantities are used for ranking and selection of molecular formula candidates (Subsection 8.4.1).

8.3 Basics of mass spectrometry

| 305

Mass spectrum

Determine molecular masses

Plan molecular masses

Test molecular masses

Selected molecular masses

Generate molecular formulas

Plan molecular formulas

Test molecular formulas

Selected molecular formulas

Fig. 8.4. Scheme of structure elucidation via MS.

Generate structural formulas

Plan structural formulas

Test structural formulas

Selected structural formulas

306 | 8 Molecular structure elucidation If structure candidates are given, the calculation of match values can be improved using knowledge on MS fragmentation reactions to generate possible fragment ions. Structure comparison values are used for ranking and selection of structure candidates (Subsection 8.4.2). Finally we combine these single steps to demonstrate automated structure elucida­ tion via MS for two examples (Section 8.6). Given the known misclassification rates of MS classifiers, the large size of structure spaces, and the deficiencies of candidate selec­ tion, an expert system based exclusively on low resolution EI–MS cannot, at present, work sufficiently reliably for practical use in an automatic mode. The incorporation of additional information into this automated workflow increases the success rate of automated CASE via MS and this is discussed in greater detail in Chapter 9. Fortunately, some of the weaknesses may be overridden by using better hardware. Thus, higher mass resolution helps in molecular formula determination (Sections 8.7 and 8.8). Additional spectroscopic analyses (e.g. AAS, AES) provide additional infor­ mation for element composition determination if available, e.g. the empirical formula (cf. Definition 1.25). Under such circumstances the molecular formula determination is unambiguous in many cases.

8.3.3 Mass spectra and isotope distributions First, we need some mathematical formulations that help us to model mass spectra and the related entities. 8.1 Definition (low resolution mass spectrum) A low resolution mass spectrum 𝐼 is a mapping 𝐼 : ℕ → ℝ0+ : 𝑚 󳨃→ 𝐼(𝑚) from the set of natural numbers onto the set of non-negative real numbers. This mapping relates each integer 𝑚/𝑧 value 𝑚 with its intensity 𝐼(𝑚). There exists a maximum 𝑚/𝑧 value 𝑚̂ with 𝐼(𝑚)̂ > 0, i.e. 𝐼(𝑚)̂ > 0 and

for all 𝑚 > 𝑚̂ : 𝐼(𝑚) = 0.

Analogously, a minimal 𝑚/𝑧 value 𝑚̌ with 𝐼(𝑚)̌ > 0 can be assigned.

Furthermore a spectrum is typically normalized to a certain maximum intensity. To simplify mathematical expressions we will normalize the spectrum to a maximum in­ tensity 1: There exists an 𝑚̃ such that 𝐼(𝑚)̃ = 1 and for all 𝑚 ≠ 𝑚̃ : 𝐼(𝑚) ≤ 1. 𝑚̃ is typically determined uniquely and called the spectrum’s base mass. If two or more peaks have maximal intensity, this points to overamplification in the spectrum. Pairs 𝑃 = (𝑚, 𝐼(𝑚)) with 𝐼(𝑚) ≠ 0 are called peaks of the spectrum. 𝑚 is the mass, ̃ is called the base peak, 𝑃̂ = (𝑚,̂ 𝐼(𝑚)) ̂ is the 𝐼(𝑚) the intensity of peak 𝑃. 𝑃̃ = (𝑚,̃ 𝐼(𝑚)) peak of highest mass.

8.3 Basics of mass spectrometry

| 307

Table 8.1. Peaks from the mass spectrum of Figure 8.2. 𝑚

𝐼(𝑚)

𝑚

26

0.031 33

27

0.172 38

28 29

𝐼(𝑚)

𝑚

𝐼(𝑚)

𝑚

𝐼(𝑚)

𝑚

𝐼(𝑚)

0.005 44

0.032 57

0.300

83

0.009

0.006 45

0.024 58

0.009

84

0.003

0.395 39

0.094 51

0.005 59

0.204

85

0.284

0.246 40

0.011 53

0.013 60

0.006

86

0.015

30

0.005 41

0.246 54

0.006 73

0.015

87

0.250

31

0.028 42

0.066 55

0.136 74

1.000

88

0.017

32

0.005 43

0.445 56

0.055 75

0.028 101

0.013

8.2 Example (Peaks) Table 8.1 shows the peaks in the MS of Figure 8.2. 𝑃̃ = (74, 1.000) is the base peak of the spectrum, 𝑃̂ = (101, 0.013) is the peak of highest mass. In this manner we cannot only describe experimental spectra, but also theoretical iso­ tope distributions and calculated spectra. The atoms of a chemical element 𝑋 ∈ E are not necessarily all of the same mass. The mass of an atom is essentially the mass of its nucleus, which is composed of two types of elementary particles of unit mass, protons and neutrons. In the atoms of a given element 𝑋, the number of positively charged pro­ tons is fixed, while the number of uncharged neutrons may vary. Such atoms of different mass are isotopes of 𝑋 (notation for the isotope of mass 𝑚: 𝑚 𝑋, e.g. 13 C). Natural isotope distributions are known and almost constant, such that mass spectrometry provides information on the elemental composition of an unknown compound through isotope patterns. 8.3 Definition (Natural isotope distribution) Let 𝑋 ∈ E be a chemical element. The natural isotope dis­ tribution of 𝑋 is a mapping 𝐼𝑋 : 𝑚 󳨃→ 𝐼𝑋 (𝑚) that associates an intensity with each mass 𝑚 depending on 𝑋, such that – there is a highest isotope mass 𝑚̂ 𝑋 with 𝐼𝑋 (𝑚̂ 𝑋 ) > 0 such that 𝑚 > 𝑚̂ 𝑋 implies 𝐼𝑋 (𝑚) = 0, – there is a unique nominal mass 𝑚̃ 𝑋 for which the following holds: For all 𝑚 ≠ 𝑚̃ 𝑋 , 𝐼𝑋 (𝑚) < 𝐼𝑋 (𝑚̃ 𝑋 ), – there is a lowest isotope mass 𝑚̌ 𝑋 = 𝑚𝑖𝑛{𝑚 | 𝐼𝑋 (𝑚) ≠ 0}. We postulate a standardization of the sum of intensities to 1: – ∑𝑚 𝐼𝑋 (𝑚) = 1.

The latter condition will play an important role in Section 8.6. Table 8.2 contains nat­ ural isotope distributions for the elements 𝑋 of E11 (source: [288]). For these elements the lowest isotope mass equals the nominal mass: 𝑚̌ 𝑋 = 𝑚̃ 𝑋 . For masses 𝑚 not men­ tioned in the table, 𝐼𝑋 (𝑚) = 0. The elements of E11 can be partitioned into three classes according to their isotope distributions (see also [194]):

308 | 8 Molecular structure elucidation – – –

Class 0: Highest isotope mass and nominal element mass coincide: 𝑚̂ 𝑋 = 𝑚̃ 𝑋 . Class 1: Highest isotope mass and nominal element mass differ by one unit: 𝑚̂ 𝑋 = 𝑚̃ 𝑋 + 1. Class 2: Highest isotope mass and nominal element mass differ by two units: 𝑚̂ 𝑋 = 𝑚̃ 𝑋 + 2.

Table 8.2. Natural isotope distributions of the elements in E11 . 𝑋

𝑚̌ 𝑋

𝑚̂ 𝑋

H

1

1

𝐼𝑋 (𝑚̌ 𝑋 )

𝐼𝑋 (𝑚̌ 𝑋 +1)

𝐼𝑋 (𝑚̌ 𝑋 +2)

1

0

0

C

12

13

0.989

0.011

0

N

14

15

0.9963

0.0037

0

O

16

18

0.9976

0.0004

0.0020

F

19

19

1

0

0

Si

28

30

0.9223

0.0467

0.0310

P

31

31

1

0

0

S

32

34

0.9504

0.0075

0.0421

Cl

35

37

0.7577

0

0.2423

0.5069

0

0.4931

1

0

0

Br

79

81

I

127

127

Figure 8.5 illustrates the isotope distributions for elements 𝑋 of E11 . The back­ ground row gives 𝐼𝑋 (𝑚̃ 𝑋 ), the middle row 𝐼𝑋 (𝑚̃ 𝑋+1), and the foreground row 𝐼𝑋 (𝑚̃ 𝑋+2). Unfortunately, the intensity resolution of real mass spectra is as limited as the optical resolution of this figure: The isotope peaks of N and O are barely visible. The isotopes of an element have almost identical chemical properties, in partic­ ular their bonding and reaction behaviour and in general, isotope distributions in a compound sample generally reflect the natural isotope distributions in routine MS measurements. The molecules of a chemical compound that differ in molecular mass due to varying isotope composition or location are called isotopomers. The relative abundances of molecular masses are determined by the natural isotope distributions of constituent elements and, of course, by the multiplicity of an element in the mole­ cular formula. 8.4 Example (Isotopomers of Cl2 ) There are Cl2 molecules of mass 70 (35 Cl2 ), 72 (35 Cl37 Cl), and 74 (37 Cl2 ). Their relative abundances are: prob(35 Cl2 ) = prob(35 Cl) ⋅ prob(35 Cl) = 0.7577 ⋅ 0.7577 , prob(35 Cl37 Cl) = prob(35 Cl) ⋅ prob(37 Cl) ⋅ 2 = 0.7577 ⋅ 0.2423 ⋅ 2 , prob(37 Cl2 ) = prob(37 Cl) ⋅ prob(37 Cl) = 0.2423 ⋅ 0.2423 .

8.3 Basics of mass spectrometry

| 309

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 H

C

N

O

F

Si

P

S

Cl

Br

I

Fig. 8.5. Natural isotope distributions of the elements in E11 .

The distribution of molecular masses is (after rounding) 0.57410929 { { { { { {0.36718142 𝐼Cl2 (𝑚) = { {0.05870929 { { { { {0

for 𝑚 = 70, for 𝑚 = 72, for 𝑚 = 74, otherwise.

This approach is generalizable to any molecular formula. To do so, we need a general definition: 8.5 Definition (Isotope distribution) An isotope distribution is a mapping 𝐼 : 𝑚 󳨃→ 𝐼(𝑚) with a highest mass 𝑚̂ i.e. 𝐼(𝑚)̂ > 0 while 𝐼(𝑚) = 0 for all 𝑚 > 𝑚,̂ and which is normalized, ∑𝑚 𝐼(𝑚) = 1.

The natural isotope distributions of elements are examples of isotope distribution, as is the distribution of isotopic masses for a given molecular formula as presented in Section 8.4. In a molecule made of more than one atom, the isotope distributions of the atoms somehow cooperate to produce an isotope distribution characteristic for the molecule’s composition. Such distributions can be calculated using the following operation:

310 | 8 Molecular structure elucidation

8.6 Definition (Convolution product of isotope distributions) On the set I of all isotope distributions, there is a ‘multiplication’: For 𝐼1 , 𝐼2 ∈ I we can introduce the following convolution product: 𝑚

(𝐼1 ⋅ 𝐼2 )(𝑚) = ∑ 𝐼1 (𝑖)𝐼2 (𝑚 − 𝑖). 𝑖=0

This composition is also called folding, it is associative (which means that 𝐼1 ⋅ (𝐼2 ⋅ 𝐼3 ) = (𝐼1 ⋅ 𝐼2 ) ⋅ 𝐼3 , and the highest mass 𝑚̂ 12 of 𝐼1 ⋅ 𝐼2 is 𝑚̂ 12 = 𝑚̂ 1 + 𝑚̂ 2 .

8.7 Definition (Theoretical isotope distribution) Let E = {𝑋𝑖 | 𝑖 ∈ 𝑛} be a set of chemical elements, and 𝛽 a molecular formula. Then the theoretical isotope distribution of 𝛽 is defined as convolution product 𝐼𝛽 = ∏ (𝐼𝑋 )𝛽(𝑋) . 𝑋∈E

In order to calculate theoretical isotope distributions of molecular formulas it is helpful to note the following result: 8.8 Remark (Recursive method to calculate theoretical isotope distributions) Let 𝛽 ∈ ℕE be a molecular formula. For the theoretical isotope distribution of 𝛽 either of the following is true: i) If there exists exactly one 𝑋𝑙 ∈ E with 𝛽(𝑋𝑙 ) = 1 and for all 𝑖 ≠ 𝑙 we have 𝛽(𝑋𝑖 ) = 0, then 𝐼𝛽 = 𝐼𝑋𝑙 , or ii) If there exists at least one 𝑋𝑙 ∈ E with 𝛽(𝑋𝑙 ) ≥ 1 and if we define 𝛽 − 𝑋𝑙 by {𝛽(𝑋𝑙 ) − 1 if 𝑋 = 𝑋𝑙 , (𝛽 − 𝑋𝑙 )(𝑋) = { 𝛽(𝑋) otherwise, { then the following holds: 𝐼𝛽 = 𝐼𝛽−𝑋𝑙 ⋅ 𝐼𝑋𝑙 . 8.9 Definition (Masses of molecules) Let 𝛽 ∈ ℕE be a molecular formula. Then we have – 𝑚𝛽 = ∑ 𝑚̃ 𝑋 𝛽(𝑋) the nominal mass, 𝑋∈E

– 𝑚̃ 𝛽 = min{𝑚 | ∀ 𝑚󸀠 : 𝐼𝛽 (𝑚) ≥ 𝐼𝛽 (𝑚󸀠 )} the mass of highest abundance, – 𝑚̂ 𝛽 = max{𝑚 | 𝐼𝛽 (𝑚) > 0} the highest isotopomer mass and – 𝑚̌ 𝛽 = min{𝑚 | 𝐼𝛽 (𝑚) > 0} the lowest isotopomer mass of 𝛽. Thus, for the highest and the lowest isotopomer mass of 𝛽 we have 𝑚̂ 𝛽 = ∑ 𝑚̂ 𝑋 𝛽(𝑋) and 𝑚̌ 𝛽 = ∑ 𝑚̌ 𝑋 𝛽(𝑋), 𝑋∈E

𝑋∈E

respectively.

Remember that the nominal mass of 𝛽 is the sum of nominal masses of its atoms, where the nominal mass of an element atom is the isotope mass of maximal abundance.

8.3 Basics of mass spectrometry

| 311

The mass of maximal intensity of 𝛽 is not obtained as easily, rather, it is taken from the theoretical isotope distribution, which is calculated by folding (see above). In particular, the nominal mass is not necessarily equal to the mass of highest abundance, as demonstrated by the following example: 8.10 Example (Isotopomers of Br2 ) We have 𝑚𝛽 = 2 ⋅ 79 = 158 and { { { { 𝐼𝛽 (𝑚) = { { { { {

for 𝑚 = 158, for 𝑚 = 160, for 𝑚 = 162, otherwise,

0.25694761 0.49990478 0.24314761 0

that is, 𝑚̃ 𝛽 = 160 ≠ 𝑚𝛽 .

8.3.4 Database of elucidated mass spectra A database of elucidated spectra is indispensable for both quality assessment of rank­ ing functions and for calculation of MS classifiers. Here, we use spectra and structures from the NIST MS library [224]. This 1998 version of NIST contains 107,888 spectra of 107,812 structures. Spectra and structures are two separate files, linked by numerical identifiers. The data were subjected to very rigorous consistency checks to ensure only the inclusion of coherent spectrum–structure pairs, before further use: The spectrum file needed to contain the molecular formulas and names of compounds as well as peak Table 8.3. Distribution of elements in the MS–structure data set for E11 . Number

H

C

N

O

F

Si

P

S

Cl

Br

1

176

124

17,989

17,936

2

316

369

11,833

19,725

1391

3583

1659

494

1960

250

3

510

645

3544

12,057

1484

731

4

874

1266

5

1000

1758

2356

8485

295

980

4070

465

6–10

12,218

24,173

626

6043

11–15

19,143

25,310

4

16–20

19,678

16,350

21–25

10,949

8412

26–30

7731

≥ 31 ∑

I

2224

4472

2573

590

290

1933

692

118

62

30

590

85

14

404

10

3

332

62

7

162

1

0

122

11

0

1035

223

2

0

231

12

0

534

245

4

0

0

2

0

0

0

103

116

0

0

0

0

0

0

0

6

32

0

0

0

0

0

0

4606

0

0

6

0

0

0

0

0

0

13,022

2971

0

0

2

0

0

0

0

0

0

85,617

85,984

37,332

68,959

5565

7067

1984

2547

7682

3435

729

312 | 8 Molecular structure elucidation lists and identifiers, while the structure file needed to contain identifiers and names as well as structures. Molecular formulas were obtained from structures and data were excluded if discrepancies arose between the molecular formulas or names for the same identifier. Only structure–spectrum pairs that contained the elements in E11 were con­ sidered, while structures that were charged, contained unpaired electrons or isotopi­ cally-labeled compounds were excluded. 86,052 structure–spectrum pairs remained, 60,761 of these containing only elements in E4 . An overview of this data set is given in several tables and figures. Table 8.3 shows the distribution of elements, Table 8.4 gives statistics of atom counts (including H atoms) and nominal masses of the compounds in our data set, separately for the com­ plete set and for the subset of compounds made of elements from E4 . The histograms in Figures 8.6 and 8.7 show the distributions of masses in both data sets. Table 8.4. Atom count and mass distributions of the MS–structure data set.

Atom count

Nominal mass

E

Min.

1. Quart.

Median

Mean

3. Quart.

Max.

E11

2

25

34

38.67

47

212

E4

2

26

35

39.15

47

212

E11

2

178

242

267.26

330

1014

E4

2

168

226

252.02

312

1014

We then examined the spectra for the presence of molecular ions. 73,845 out of the 86,052 spectra have a peak at the respective nominal mass. The molecular ion is missing in 12,207 spectra (14.186%), which is quite a bit lower than other estimates indicating this peak is absent in approximately 30% [180] and even 42% (for only 1426 spectra, see [135]) of EI–MS spectra. Next, we investigated the difference between the mass of the base peak 𝑚̃ 𝑚𝑎𝑥 of the peak cluster (defined as a sequences of peaks with masses differing by at most 2) at highest mass and the nominal molecular mass 𝑚𝑀 . The differences are distributed as follows: 𝑚𝑀 − 𝑚̃ 𝑚𝑎𝑥 Frequency Rel. frequency

0

1

15

-2

-1

18

31

43

2

29

66,539

4084

2782

1876

1201

970

542

433

387

383

77.32

4.75

3.23

2.18

1.40

1.13

0.63

0.50

0.45

0.45

The negative entries above result when the first and second isotope peaks are more intense than the lowest isotopomer peak. The remaining 6855 spectrum–struc­ ture pairs are distributed over another 360 mass differences. Knowledge of this mass

8.3 Basics of mass spectrometry

17622

15000

15884

10000

Frequency

13169

9950

9714

5000

6822

4386

2667 1706

1585 1007 606

369

201

148

80

47

0

47

0

200

400

600

27

800

12

3

1000

Nominal mass

14000

Fig. 8.6. Molecule mass distribution in the MS–structure data set for E11 .

12000

13856

10000

11347

8855

8000 6000

6300

4170

4000

Frequency

8439

2000

2708

1501

1481 801

553

266

185

111

72

42

0

37

0

200

400

600

26

2

800

Nominal mass

Fig. 8.7. Molecule mass distribution in the MS–structure data set for E4 .

6

3

1000

|

313

314 | 8 Molecular structure elucidation difference distribution is useful in finding candidates for the molecular mass. On the other hand, these statistics also show that determination of the molecular mass by alternative MS methods such as SI–MS is of real advantage for automated structure elucidation. Looking only at structure–spectrum pairs from E4 , the molecular ion is missing in 7232 spectra (11.902%). The mass differences 𝑚𝑀 − 𝑚̃ 𝑚𝑎𝑥 are distributed as follows: 𝑚𝑀 − 𝑚̃ 𝑚𝑎𝑥 Frequency Rel. frequency

0

1

15

-1

18

31

2

43

60

29

49,358

3060

1161

907

901

481

327

309

225

217

81.23

5.04

1.91

1.49

1.48

0.79

0.54

0.51

0.37

0.36

Notably, the entry for ‘−2’ is missing here, as none of E4 have a large second iso­ tope peak (as in e.g. Cl). There are another 315 mass differences in the remaining 3815 structure–spectrum pairs.

8.4 Ranking functions for mass spectra As mentioned above, a mixture of positively charged fragment ions is measured in a mass spectrometer instead of the analyte itself. The ions are separated according to their mass, for each integer mass, and the intensity of the signal detected is pro­ portional to the abundance of ions of that mass. Each ion present in the ion mixture produces signals according to the theoretical isotope distribution of its corresponding elements. The mass spectrum thus is a linear combination of theoretical isotope dis­ tributions with positive coefficients. Note that in the following section we use the term molecular formula to refer (in general) to the neutral species and formula or chemical formula to refer to a ‘molecular’ formula of a fragment ion. For the case of structural formulas, we will use structure as an alternative term. 8.11 Definition (Linear combinations of spectra) Let 𝛽𝑖 ∈ ℕE , 𝑖 ∈ 𝑛 be chemical formulas and 𝑎𝑖 ∈ ℝ+0 . Then the linear combination of 𝐼𝛽𝑖 with coefficients 𝑎𝑖 is the mapping ∑ 𝑎𝑖 𝐼𝛽𝑖 : ℕ⋆ → ℝ+0 : 𝑚 󳨃→ ∑ 𝑎𝑖 𝐼𝛽𝑖 (𝑚).

𝑖∈𝑛

𝑖∈𝑛

How does a mass spectrum arise? Let 𝛽𝑖 , 𝑖 ∈ 𝑛 be the formulas of the fragment ions 𝑀𝑖𝑗 , 𝑖 ∈ 𝑛, 𝑗 ∈ 𝑛𝑖 present in the ion mixture in a mass spectrometer, where 𝛽𝑀𝑖𝑗 = 𝛽𝑖 . Further we assume that the fragment ions 𝑀𝑖𝑗 have abundance 𝑎𝑖𝑗 . Then 𝐼 = ∑ ∑ 𝑎𝑖𝑗 𝐼𝛽𝑖 . 𝑖∈𝑛 𝑗∈𝑛𝑖

(8.1)

8.4 Ranking functions for mass spectra |

315

Calculating abundances 𝑎𝑖𝑗 from the structures of fragment ions 𝑀𝑖𝑗 is not possible at present. Attempts to approximate these values using statistical learning methods such as neural networks were successful for limited compound classes only, while many programs capable of performing such predictions are no longer available (e.g. MASSIS [44, 45, 71], MASSIMO [84, 85]). MS Fragmenter from ACD Labs [2] claims to predict fragment intensities, but no details are published. Furthermore, the mass resolution of even recent accurate mass spectrometers is insufficient to identify molecular formulas unambiguously from their theoretical iso­ tope distributions. Kind and Fiehn [159] discuss this in detail, including the influence of various combinations of mass and relative isotopic abundance accuracy on formula determination. These two facts will be taken into consideration here by treating the abundances as unknowns and the measurement error as the target variable of an optimization prob­ lem. We thus write Equation (8.1) as 𝐼 = ∑ (𝐼𝛽𝑖 ∑ 𝑎𝑖𝑗 ) . 𝑖∈𝑛

(8.2)

𝑗∈𝑛𝑖

The sums 𝑎𝑖 = ∑𝑗∈𝑛𝑖 𝑎𝑖𝑗 , 𝑖 ∈ 𝑛 of the fragment ion abundances are always positive, so we can reformulate Equation (8.2) into 𝐼 = ∑ 𝑎𝑖 𝐼𝛽𝑖

with 𝑎𝑖 > 0, 𝑖 ∈ 𝑛.

(8.3)

𝑖∈𝑛

We then propose: 8.12 Proposition Let 𝐼 be a mass spectrum taken with infinite precision and {𝛽𝑖 | 𝑖 ∈ 𝑛} the set of all molecular formulas of fragment ions contributing to the mass spectrum. Then 𝑥𝑖 ∈ ℝ+0 for 𝑖 ∈ 𝑛 exist, such that 𝐼 = ∑ 𝑥𝑖 𝐼𝛽𝑖 . 𝑖∈𝑛

Match values for molecular formulas and structural formulas The above proposition can be used to determine both a molecular formula and a struc­ tural formula from a mass spectrum 𝐼. We will use Proposition 8.12 to calculate a match value (MV) for a given candidate molecular or structural formula 𝐾 that measures the plausibility that 𝐾 explains spectrum 𝐼. Ideally, such a compatibility match value should fulfill several requirements: It should be between 0 and 1 (R) MV(𝐼, 𝐾) ∈ [0, 1], and have a high/low real number if 𝐼 is well/poorly explained by 𝐾. Further require­ ments for such a ranking function are evident: The match value should be unity for the correct (true) candidate, 𝐾𝑇

316 | 8 Molecular structure elucidation (T) MV(𝐼, 𝐾𝑇 ) = 1, while match values should be lower for wrong (false) candidates 𝐾𝐹 than for the correct candidate, (F) MV(𝐼, 𝐾𝐹 ) < MV(𝐼, 𝐾𝑇 ). If there was a match value that fulfilled the above conditions, the verification step of our structure elucidation problem would be solved. Unfortunately, there is no ranking function for real mass spectra that fulfils the latter two requirements in general. How­ ever, if we assume an exact recording of mass spectra intensities we can at least define a ranking function for molecular formulas that satisfies requirements (R) and (T). Let 𝛽𝑖 , 𝑖 ∈ 𝑛 be the chemical formulas of fragment ions of candidate 𝐾 for spectrum 𝐼 according to Proposition 8.12. Then (8.4)

min ∑ (𝐼(𝑚) − ∑ 𝑥𝑖 𝐼𝛽𝑖 (𝑚)) = 0, x≥0 𝑚

𝑖∈𝑛

if 𝐾 = 𝐾𝑇 is the correct candidate. If any intensity cannot be explained by theoretical isotope distributions 𝐼𝛽𝑖 , then the left-hand side of Equation (8.4) is positive. It makes sense to weight a large unexplained intensity more than several small deviations of the same total extent. This may be achieved for example by squaring the differences. We obtain 2

min ∑ (𝐼(𝑚) − ∑ 𝑥𝑖 𝐼𝛽𝑖 (𝑚)) ∈ [0, 𝛴𝑚 (𝐼(𝑚))2 ]. x≥0 𝑚

(8.5)

𝑖∈𝑛

By normalizing the above we can define a ranking function that satisfies requirement (R), and, for mass spectra measured at infinite precision, also (T): 2

−1

MV(𝐼, 𝐾) = 1 − √(∑ 𝐼(𝑚)2 ) 𝑚

min ∑ (𝐼(𝑚) − ∑ 𝑥𝑖 𝐼𝛽𝑖 (𝑚)) x≥0 𝑚

(8.6)

𝑖∈𝑛

Here, the square root results in a more even distribution of match values, since the minimum from Equation (8.5)usually is closer to zero than to the upper interval limit. In addition, we justify the expression ‘explained fraction of total intensity’ for our ranking function. Two important pieces of information provided by low resolution mass spectra con­ tribute inherently to this MV, – presence (or absence) of a peak at a particular mass, – agreement (or lack thereof) with theoretical isotope distributions. This is all the information necessary for calculation of the molecular formula. In the following, we cover all subproblems relevant in calculation of this match value. First

8.4 Ranking functions for mass spectra

| 317

we determine the set of possible fragment chemical formulas 𝛽𝑖 of a molecular formula candidate or a structural formula candidate. In the case of molecular formulas, all 𝛽󸀠 ⊆ 𝛽 are possible initially. In order to keep the dimension of the optimization problem small, only 𝛽󸀠 that can actually contribute to peak explanation should be used. A first step is to restrict calculations to formulas whose nominal mass 𝑚𝛽󸀠 is at a peak position of the spectrum 𝐼 under examination, that is {𝛽𝑖 | 𝑖 ∈ 𝑛} = {𝛽󸀠 ⊆ 𝛽 | 𝐼(𝑚𝛽󸀠 ) > 0}. We now need a method to generate such formulas.

Generation of molecular formulas Candidates 𝛽 ∈ ℕE for the formula with a given nominal mass 𝑚 have to satisfy the Diophantine equation ∑ 𝑚̃ 𝑋 𝛽(𝑋) = 𝑚. 𝑋∈E

We solve this equation using a backtracking algorithm. Often formulas within a given mass interval are in demand. It should be possible to define upper and lower bounds for atom counts of certain elements, so that prior information on element composition can be taken into account. The following algorithm generates all molecular formulas 𝛽 whose nominal mass is within [𝑚min , 𝑚max ] and that are compatible with the soft molecular formula 𝐵 with 𝐵(𝑋𝑖 ) = [𝛽min (𝑋𝑖 ), 𝛽max (𝑋𝑖 )], where E = {𝑋𝑖 | 𝑖 ∈ 𝑛}: 8.13 Algorithm GenMolForm(𝛽min , 𝛽max , 𝑚min , 𝑚max ) (1) (2) (3) (4) (5)

𝑏𝑒𝑔𝑖𝑛() while 𝐸𝑛𝑑𝐹𝑙𝑎𝑔 = 𝑓𝑎𝑙𝑠𝑒 do 𝑂𝑢𝑡𝑝𝑢𝑡(𝛽) 𝑛𝑒𝑥𝑡() end

Function: begin() (1) (2) (3) (4) (5)

𝐸𝑛𝑑𝐹𝑙𝑎𝑔 ← 𝑓𝑎𝑙𝑠𝑒 𝛽 ← 𝛽min if 𝑚𝛽 < 𝑚min ∨ 𝑚max < 𝑚𝛽 𝑛𝑒𝑥𝑡() end

318 | 8 Molecular structure elucidation Function: next() (1) (2)

do 𝑠𝑡𝑒𝑝() while ¬(𝑚min ≤ 𝑚𝛽 ≤ 𝑚max ) ∧ 𝐸𝑛𝑑𝐹𝑙𝑎𝑔 = 𝑓𝑎𝑙𝑠𝑒

Function: step() (1) (2) (3) (4) (5) (6) (7) (8) (9)

for each 𝑖 ∈ 𝑛 do if 𝛽(𝑋𝑖 ) < 𝛽max (𝑋𝑖 ) ∧ 𝑚𝛽 + 𝑚𝑋𝑖 ≤ 𝑚max 𝛽(𝑋𝑖 ) ← 𝛽(𝑋𝑖 ) + 1 return else 𝛽(𝑋𝑖 ) ← 𝛽min (𝑋𝑖 ) end end 𝐸𝑛𝑑𝐹𝑙𝑎𝑔 ← 𝑡𝑟𝑢𝑒

Appendix C contains tables of molecular formula counts for various nominal masses between 1 and 1000. Appendix D lists all molecular formulas from B𝑐E4 of masses be­ tween 1 and 150 containing at least one C atom. As mentioned before, it makes sense to include formulas only whose nominal masses in fact appear in the mass spectrum when calculating match values.

Molecular formulas and ion types There are unusually high-energy conditions inside a mass spectrometer and as a conse­ quence, ions may contain atoms in very variable states, such that the existence criteria (Gr1) and (Gr2) from Theorem 1.23 are no longer valid. In particular, positively charged ions that do not satisfy (Gr1) can exist. We introduce the terms odd electron ions (OEI) for ions with an unpaired electron and even electron ions (EEI) for ions with no un­ paired electrons. OEI satisfy (Gr1), EEI do not. We will maintain conditions (Gr2) and (Con) also for ions, although a few possible formulas of fragment ions will be excluded as a result. This is justified, however, by the significant reduction in counts of possible molecular formulas (see Appendix C).

Solution of the quadratic optimization problem Calculating match values using Equation (8.6) means solving an optimization problem of the form

8.4 Ranking functions for mass spectra

| 319

2

min ∑ (𝑐𝑗 − ∑ 𝑥𝑖 𝑑𝑖𝑗 ) 𝑗∈𝑝

𝑖∈𝑛

NB: 𝑥𝑖 ≥ 0,

𝑖 ∈ 𝑛.

This is a least squares (LS) problem with constraints, it is solved here by algorithm NLPQL [274].

8.4.1 Ranking of molecular formulas Calculation of match values for molecular formulas For a given mass spectrum 𝐼 and molecular formula 𝛽 let 𝛽𝑖 ⊆ 𝛽, 𝑖 ∈ 𝑛 be all molecular formulas that satisfy (Gr2) and (Con) of Theorem 1.23, and additionally fulfill (Frag) 𝐼(𝑚𝛽𝑖 ) > 0, i.e. a peak must be present at the nominal mass of 𝛽𝑖 . We then calculate the match value of 𝛽 with respect to 𝐼 as 2

−1

MV(𝐼, 𝛽) = 1 − √(∑ 𝐼(𝑚)2 ) 𝑚

min ∑ (𝐼(𝑚) − ∑ 𝑥𝑖 𝐼𝛽𝑖 (𝑚)) x≥0 𝑚

(8.7)

𝑖∈𝑛

8.14 Example (Match value of 𝛽 = C6 H12 O2 for spectrum 𝐼 from Ex. 8.2) Table 8.5 lists the molecular formulas 𝛽𝑖 ⊆ 𝛽 that satisfy (Frag), (Gr2), and (Con). The last column gives the value of 𝑥𝑖 as solution of Equation (8.7). The match value obtained is MV(𝐼, 𝛽) = 0.9942863, indicating a good match for the formula C6 H12 O2 . The match value is not equal to 1 (which would be expected for the correct candidate) as the peak at mass 75 has too low an intensity to be an isotope peak of C3 H6 O2 . The theoretical isotope distribution of C3 H6 O2 at mass 75 indicates a peak of intensity 0.034 should be expected, whereas 𝐼(75) = 0.028. The other fragment chemical formulas for mass 74, C6 H2 and C4 H10 O, would give rise to an even wider de­ viation. Unfortunately, measured isotope peak intensities often deviate from theoreti­ cal values, seriously affecting the automatic verification of molecular formulas. This is most pronounced for low intensity peaks for EI–MS. How can this match value can help us in determining the molecular formula? We generate all molecular formulas 𝛽 ∈ B𝑐E11 with mass 𝑚𝛽 = 116, calculate their match values with respect to 𝐼, and list them in decreasing order. This is called a ranking. There are 1451 molecular formulas with mass 116 altogether, 220 of which are from B𝑐E11 . 23 of these have a higher match value than C6 H12 O2 . If we admit only the ele­ ments of E4 , then there are 162 possible molecular formulas of mass 116, out of which 24 are from B𝑐E4 , and C6 H12 O2 is ranked 9th among these.

320 | 8 Molecular structure elucidation Table 8.5. Match value for C6 H12 O2 and the spectrum from Example 8.2. 𝛽𝑖

𝑚𝛽𝑖

𝑥𝑖

𝛽𝑖

𝑚𝛽𝑖

𝑥𝑖

𝛽𝑖

𝑚𝛽𝑖

𝑥𝑖

C2 H2

26

0.0317 C2 H5 O

45

0.0000 C4 H9 O

73

0.0000

C2 H3

27

0.1751 C4 H3

51

0.0052 C6 H2

74

0.0000

C2 H4

28

0.3999 C3 HO

53

0.0068 C3 H6 O2

74

1.0374

C2 H5

29

0.2426 C4 H5

53

0.0067 C4 H10 O

74

0.0000

CH2 O

30

0.0000 C3 H2 O

54

0.0057 C6 H3

75

0.0000

C2 H6

30

0.0000 C4 H6

54

0.0000 C3 H7 O2

75

0.0000

CH3 O

31

0.0283 C3 H3 O

55

0.0000 C4 H3 O2

83

0.0000

CH4 O

32

0.0047 C4 H7

55

0.1419 C5 H7 O

83

0.0000

O2

32

0.0000 C3 H4 O

56

0.0507 C6 H11

83

0.0096

HO2

33

0.0049 C4 H8

56

0.0000 C4 H4 O2

84

0.0000

C3 H2

38

0.0062 C2 O2

56

0.0000 C5 H8 O

84

0.0025

C3 H3

39

0.0970 C2 HO2

57

0.3064 C6 H12

84

0.0000

C3 H4

40

0.0081 C3 H5 O

57

0.0000 C4 H5 O2

85

0.0889

C2 O

40

0.0000 C4 H9

57

0.0000 C5 H9 O

85

0.2111

C2 HO

41

0.0093 C2 H2 O2

58

0.0000 C4 H6 O2

86

0.0000

C3 H5

41

0.2447 C3 H6 O

58

0.0000 C5 H10 O

86

0.0000

C2 H2 O

42

0.0593 C4 H10

58

0.0021 C4 H7 O2

87

0.0000

C3 H6

42

0.0000 C2 H3 O2

59

0.0739 C5 H11 O

87

0.2637

C2 H3 O

43

0.0000 C3 H7 O

59

0.1354 C4 H8 O2

88

0.0031

C3 H7

43

0.4585 C2 H4 O2

60

0.0000 C5 H12 O

88

0.0000

C2 H4 O

44

0.0000 C3 H8 O

60

0.0000 C6 O

88

0.0000

C3 H8

44

0.0177 C5

60

0.0000 C5 H9 O2

101

0.0138

CO2

44

0.0000 C6 H

73

0.0160

CHO2

45

0.0236 C3 H5 O2

73

0.0000

Table 8.6 contains the top 40 molecular formulas 𝛽 ∈ B𝑐E11 together with their match values. Three empirically derived filters are given in [124] to exclude molecular formulas that rarely occur in nature. After applying these filters, 153 molecular formulas with elements from E11 remain, among which C6 H12 O2 is at position 19. Restriction to the elements of E4 promotes C6 H12 O2 to rank 5 of 9 molecular formulas. Figure 8.8 provides an overview of the match values of all 220 molecular formulas. Those from E4 are shown in black, those passing the Heuerding–Clerc filters [124] are marked with a dot. The correct molecular formula is marked with an arrow. Detailed inspection of ranking lists reveals that the top positions are usually occu­ pied by molecular formulas containing several elements. This is because the number of possible fragment formulas are higher for these candidates than for those made of few elements. In particular, molecular formulas made of several monoisotopic elements

8.4 Ranking functions for mass spectra

| 321

Table 8.6. Ranking of molecular formulas of mass 116 for the spectrum of Example 8.2, formulas passing the filters are marked with ×. 𝛽

𝑀𝑉(𝐼, 𝛽)

1

C3 H5 N2 OP

0.9995987

2

C3 H8 N4 O

0.9994732

3

C4 H5 N2 OF

0.9990449

4

C4 H5 O2 P

5

C4 H8 N2 O2

𝛽

Filter

𝑀𝑉(𝐼, 𝛽)

Filter

21

C2 H5 N4 P

22

C2 H8 N6

0.9947546

×

23

C3 H5 OSiP

0.9945391

0.9981975

×

24

C6 H12 O2

0.9942863

×

0.9978946

×

25

C4 H8 O2 Si

0.9941528

×

×

0.9947663 ×

6

C5 H12 N2 O

0.9976136

×

26

C4 H5 OFSi

0.9940334

×

7

C3 H8 N2 OSi

0.9974305

×

27

C6 H9 OF

0.9940301

×

8

C3 H4 N2 O3

0.9966711

28

C4 H12 N2 Si

0.9898372

×

9

C4 H9 N2 P

0.9962826

×

29

C6 H16 N2

0.9898041

×

10

C3 H5 N4 F

0.9962825

×

30

C7 H16 O

0.9867311

×

11

C4 H12 N4

0.9962643

×

31

C5 H12 OSi

0.9867233

×

12

C5 H9 OP

0.9962090

×

32

C3 H2 N2 FP

0.9786070

×

13

C3 H5 N2 FSi

0.9961477

×

33

C5 H6 FP

0.9786049

×

14

C5 H9 N2 F

0.9960570

×

34

C4 H9 SiP

0.9785747

×

15

C2 H4 N4 O2

0.9958909

35

C4 H2 N2 F2

0.9782367

× ×

16

C5 H8 O3

0.9958316

×

36

C6 H13 P

0.9782361

17

C5 H5 O2 F

0.9952805

×

37

C6 H6 F2

0.9782056

×

18

C2 H4 N2 O2 Si

0.9949604

×

38

C3 HN2 O2 F

0.9768266

×

19

C2 H5 N2 SiP

0.9948218

×

39

C2 HN2 O2 P

0.9765410

20

C2 H8 N4 Si

0.9948070

×

40

C2 HN4 OF

0.9760385

are ranked highly as the experimental peak clusters can be well fitted using theoretical isotope distributions with few or no isotope peaks. Now that we are able to calculate match values for molecular formulas with respect to a mass spectrum and to establish ranking lists, there is another issue: How many candidates in a hit list should be considered if the correct candidate is to be included with a predefined probability?

Selection of relevant candidates from a ranking Let us consider the distribution of match values of correct candidates for the molecular formula with respect to the respective spectrum. For a random sample of 𝑛 = 1000 spectra 𝐼𝑖 we calculate the match values of the correct molecular formula 𝛽𝑖 . Figure 8.9 shows the distribution of match values 𝑥𝑖 = MV(𝐼𝑖 , 𝛽𝑖 ) in form of a his­ 1 2 99 togram. For 𝑝 = 100 , 100 , . . . , 100 we determine the so-called 𝑝-quantiles of (𝑥𝑖 )𝑖∈𝑛 . The

Heuerding−Clerc Criterion Elements C, H, N, O only

0.99

|

0

0.9

Match value

0.999

322 | 8 Molecular structure elucidation

0

50

100

150

200

Molecular formula candidates

Fig. 8.8. Match values of the molecular formula candidates of mass 116.

168

150

0.15

163

69

0.05

72

45

42

Relative frequency

0.10

100

120

98

50

Frequency

124

27 20 8

0

1

0

0.9

0.99

0.999

4

1

0

1

0.00

12

9

16

1

Match value

Fig. 8.9. Histogram of the match values of correct molecular formulas for a sample of 1000 mass spectra.

| 323

Quantile

0.10

0.50

0.99 0

0.01

0.9

Match value

0.90

0.999

0.99

1

8.4 Ranking functions for mass spectra

0

200

400

600

800

1000

Pair of spectrum and molecular formula

Fig. 8.10. Distribution of the match values of correct molecular formulas for a sample of 1000 mass spectra.

number 𝑞𝑝 ∈ ℝ is a 𝑝-quantile of (𝑥𝑖 )𝑖∈𝑛 if 1 1 |{𝑖 ∈ 𝑛 | 𝑥𝑖 ≤ 𝑞𝑝 }| ≥ 𝑝 and |{𝑖 ∈ 𝑛 | 𝑥𝑖 ≥ 𝑞𝑝 }| ≥ 1 − 𝑝. 𝑛 𝑛 Figure 8.10 illustrates the determination of 𝑝-quantiles, while Table 8.7 contains 𝑝-quantiles for various 𝑝. The quantiles can be used in the following way: Let us consider a spectrum 𝐼𝑗 inside the sample, i.e. 𝑗 ∈ 𝑛. If we want to make a selection of candidate formulas (𝛽𝑖 )𝑖∈𝛺 that contains the true candidate with a certain probability ≥ 𝑝, it is sufficient to choose all candidates 𝛽𝑖 with MV(𝐼𝑗 , 𝛽𝑖 ) ≥ 𝑞1−𝑝 . The quantiles assist us in determining the acceptance (or rejection) of a molecular formula candidate, given a prescribed probability. If all molecular formulas 𝛽 of correct mass 𝑚 = 𝑚𝛽 are generated for our data set with a prescribed probability and all 𝛽 whose match value is MV(𝐼, 𝛽) ≥ 𝑞1−𝑝 are selected, then the correct molecular formula is contained in our selection with probability 𝑝. 8.15 Example We select 100 spectra from our data set with compounds composed of the elements within E4 , and 100 spectra of compounds containing E11 . For each spec­ trum, we generate all molecular formulas of correct molecular mass, calculate MVs and perform a ranking. To compare the quality of rankings for different examples, we define a relationship to compare the ranking position of the correct candidate with the

324 | 8 Molecular structure elucidation Table 8.7. Quantiles 𝑞𝑝 for match values of correct molecular formulas, for various probabilities 𝑝. 𝑝

𝑞𝑝

𝑝

𝑞𝑝

𝑝

𝑞𝑝

0.01

0.6056967 0.10

0.9313282 0.91

0.9970228

0.02

0.7323542 0.20

0.9596450 0.92

0.9974521

0.03

0.8171856 0.30

0.9720852 0.93

0.9975946

0.04

0.8519153 0.40

0.9801180 0.94

0.9977639

0.05

0.8775297 0.50

0.9843779 0.95

0.9979752

0.06

0.8969463 0.60

0.9878099 0.96

0.9982111

0.07

0.9111062 0.70

0.9913571 0.97

0.9984834

0.08

0.9205837 0.80

0.9941656 0.98

0.9987761

0.09

0.9255477 0.90

0.9969016 0.99

0.9992254

total number of candidates: The relative ranking position (RRP) 𝑅𝑅𝑃0 =

position of the correct candidate − 1 total number of candidates − 1

equals zero if the correct candidate is ranked first, and 1 if it is ranked last. RRP in­ creases linearly with falling ranking position of the correct candidate, and, sensibly, is not defined for a one-element candidate set. To keep the number of candidates man­ ageable, we included only compounds with molecular mass ≤ 200. The following table summarizes the results. E

Min.

1. Quart.

Median

Mean

3. Quart.

Max.

E4

0.0000

0.0632

0.1962

0.2463

0.4206

0.7273

E11

0.0000

0.0098

0.0685

0.1037

0.1443

0.8022

The first column contains the RRP minima, which shows that the correct candi­ date is ranked first in at least one case for both sets of spectra (E4 and E11 ). The first quartile is synonymous for the 25% quantile. In at least one quarter of cases RRP is less than 0.0632 (E4 ) or 0.0098 (E11 ). The median is the 50% quantile, the third quartile is the 75% quantile. Column mean contains the arithmetic mean. Figure 8.11 shows the distribution of RRP for the elements of E4 as a histogram. In 35% of cases, the RRP is less than 0.1. A corresponding diagram for the elements of E11 is shown as Figure 8.13. The rank­ ings for this set are better than for the E4 examples. One reason for that is that spectra including compounds containing elements with distinct isotope patterns (Si, S, Cl, Br) can be recognised easily. On the other hand, incorrect molecular formula candidates formed of isotope-containing elements obtain visibly worse match values and are au­ tomatically ranked lower. The scatter plots in Figures 8.12 and 8.14 show the ranking

8.4 Ranking functions for mass spectra

| 325

20 15

17 14 10

10

Frequency

25

30

35

35

9

8

5

5

0

2

0.0

0.2

0.4

0.6

0.8

Relative ranking position

40 30 0

10

20

Ranking position

50

60

Fig. 8.11. Histogram of RRP for correct molecular formulas, for 100 mass spectra of compounds of E4 .

0

10

20

30

40

50

60

Number of candidates at reliability 0.9

Fig. 8.12. Ranking position of correct molecular formulas and number of candidates proposed at probability 0.9, for 100 compounds of E4 .

326 | 8 Molecular structure elucidation

30

28

20

Frequency

40

50

58

10

9 3

0

0

0.0

0.2

0.4

0

0

1

0.6

1

0.8

1.0

Relative ranking position

100 1

10

Ranking position

1000

Fig. 8.13. Histogram of RRP for correct molecular formulas, for 100 mass spectra of compounds of E11 .

1

10

100

1000

Number of cadidates at reliability 0.9

Fig. 8.14. Ranking position of correct molecular formulas and number of candidates proposed at probability 0.9, for 100 compounds of E11 .

8.4 Ranking functions for mass spectra

| 327

position of the correct candidate and the number of candidates at probability of 0.9 for the respective data sets. Points above the diagonal represent cases where the correct molecular formula is excluded incorrectly, while points on and below the diagonal represent cases where the correct molecular formula is included in the set of selected candidates. Finally we examine in how many cases the correct molecular formula is among the proposed candidates. The following table shows these numbers for various prob­ abilities 𝑝: E

𝑝 = 0.99

𝑝 = 0.95

𝑝 = 0.90

𝑝 = 0.75

𝑝 = 0.50

E4

99

98

93

85

66

E11

98

96

87

76

55

8.4.2 Ranking of structural formulas Calculation of match values for structural formulas As mentioned earlier, fragmentation in a mass spectrometer generally follows known reaction schemes. We will use this knowledge during calculation of a match value for structure candidate 𝑀 with respect to experimental mass spectrum 𝐼. The calculation is similar to that for molecular formula candidates. However, in the case of structural formulas we can considerably restrict the set of possible fragment molecular formu­ las 𝛽𝑖 . We will consider only such molecular formulas for which fragments exist that are derived from 𝑀 by successive ionization and fragmentation reactions. To simulate these we go back to the previous work in Section 2.3. In Subsection 2.2.1 we introduced several atom types to assist in the definition of reaction schemes in mass spectrometry. We use these atom types to represent various sets of elements: A: all elements C: carbon H: hydrogen Y: all non-H elements Z: all elements with free electron pairs (N, O, P, S, halogens) Reactions in a mass spectrometer are divided into three classes, ionization, fragmen­ tation, and rearrangement reactions. Ionization in the mass spectrometer converts a molecule without charge or unpaired electron into a positively charged ion with an unpaired electron, a cation radical. Ionization steps take place exclusively at the be­ ginning of a fragmentation path. We use these ionization reactions:

328 | 8 Molecular structure elucidation – – –

𝑛 ionization

+

Z

Z

C

C

C+

C

C

C

𝜋 ionization 𝜎 ionization

+

C

+

C

Alternative bond multiplicities are coded as dashed lines (e.g. 𝜋 ionization can occur at double or triple-bonds). Following one of the ionization reactions, various fragmen­ tation reactions may occur. Neutral particles resulting from a reaction are not detected and are irrelevant for further fragmentation, whereas positive ions can be detected and can also be subjected to further fragmentation. We consider these fragmentation reactions and rearrangements: – 𝛼 cleavage

Y – –

Y

A

𝜎 cleavage

+

+

Y

+

H

Z

Y

Y

Y H

Z+

Y

Y

Y

Y

Z

H

Z

Y H

+

Y

Y

Z

A

+

H rearrangements

Y Y H

+

Z

+

Y H

Y Y

Z+

Y

Z

Y

Y

Y

Y

+

Y

For match value calculation, those molecular formulas 𝛽𝑖 , 𝑖 ∈ 𝑛 are determined, for which 𝐼(𝑚̃ 𝛽𝑖 ) > 0, and then 2

−1

MV(𝐼, 𝑀) = 1 − √(∑ 𝐼(𝑚)2 ) 𝑚

min ∑ (𝐼(𝑚) − ∑ 𝑥𝑖 𝐼𝛽𝑖 (𝑚)) x≥0 𝑚

𝑖∈𝑛

329

8.4 Ranking functions for mass spectra |

is calculated as before for molecular formula candidates. The set of reactions given above is not meant to be complete but rather forms a minimal system to describe what happens in a mass spectrometer. In the following, we show how they can be used to calculate match values and rank structure candidates. 8.16 Example Figure 8.15 shows possible MS reactions for methyl n-pentanoate. These include 𝑛 ionization (n-I), 4-atom H shifts, and 𝛼 cleavage, producing fragment ions of masses 116, 115, 87, 85, 74 and 43. H shifts by 5 atoms and 𝜎 ionization-triggered fragment ions are neglected in Figure 8.15 for the sake of clarity. H shifts by 5 atoms do not produce any ions relevant for spectrum explanation. Figure 8.2 shows the mass spectrum 𝐼 of methyl n-pentanoate. Comparing the spectrum and the masses resulting from n ionization we see that some peaks are not explained, those at mass 57, 55, 41, 39, 29, 28, and 27. Some of these are explained by 𝜎 ionization which results in ions of mass 101, 87, 73, 59, 57, 43, 29, and 15. O

O 116 n-I

n -I

O

+

H

-R H

HO

α−C

α− C

HO

+

HO

HO

+

HO

O H

+

α− C

+

O

+

H2C

74

O

115 α− C

85

116

α−C

HO

α−C

O

C H

α− C

116

+

87

O

O

C H

116

+

C α−

HO

+

O

C H

α−C

+

116

-R H

HO

O

O

-R

116

O

+

O 115

O HO

+

87 O

O 115

115

43

Fig. 8.15. MS reactions of methyl n-pentanoate. Table 8.8. Calculation of the match value for methyl n-pentanoate from Example 8.2. 𝛽𝑖

𝑚̃ 𝛽𝑖

𝑥𝑖

𝛽𝑖

𝑚̃ 𝛽𝑖

𝑥𝑖 0.0156

C2 H5

29

0.2515

C3 H5 O2

73

C2 H3 O

43

0.0000

C3 H6 O2

74

1.0379

C3 H7

43

0.4606

C5 H9 O

85

0.3008

CHO2

45

0.0242

C5 H10 O

86

0.0000

C4 H 9

57

0.3134

C4 H7 O2

87

0.2619

C2 H3 O2

59

0.2093

C5 H9 O2

101

0.0138

C2 H4 O2

60

0.0013

O H

+

O H

+

330 | 8 Molecular structure elucidation

CH3

m=15 C2H5

m=29 C2H3O

m=43 C3H7

OH+ +

m=43

+

+

CHO2

m=45 C4H9

OH+

m=57 C2H3O2

m=59 C2H4O2

+

OH+

O

O

C2H4O2

O

+

m=60 C3H5O2 OH+

m=73 C3H6O2 O

+

m=60 .

O

m=74 C5H9O

OH+

m=85

OH+ O

O .

O

C5H10O

.

m=86 C4H7O2

OH+

m=87 C4H7O2

+

m=87 C4H7O2 OH+

O

m=87 OH+

. O O

C5H9O2

O

m=101 C5H9O2

m=101 C5H9O2

m=101 C6H11O2

O

O

OH+

m=115

+

OH+

O

C6H11O2

m=115 C6H11O2

O

OH+

O

m=115 C6H11O2

OH+

O

O

OH+

m=115 C6H11O2

m=115 OH+

O

OH+

C6H12O2

O

m=116 C6H12O2 O

. O+

m=116 C6H12O2

m=116 C6H12O2

m=116

OH+ .

. O+

O

.

O

O OH+

C6H12O2

m=116 C6H12O2

m=116 C6H12O2 OH+

.

O

.

m=116 OH+

.

OH+

O OH+

m=116 C6H12O2

O

.

Fig. 8.16. Fragment ions of methyl n-pentanoate.

O

8.4 Ranking functions for mass spectra |

331

In Figure 8.16, we show all fragments resulting from methyl n-pentanoate using the above reaction schemes. The structures are shown in increasing order of mass. The mass is written on the right in each headline. No fragments are present for the masses 27, 28, 39, 41 or 55. Experimental spectrum 74

100

O

O

43

28

57

85

101

0

20

30

40

50

60

70

80

90

100

110

m/z

90

100

110

m/z

90

100

110

m/z

Explained part of the experimental spectrum 74

100

43 57

29

85

101

0 20

30

40

50

60

70

80

Difference between experimental spectrum and explained part 100

28 41 55

0 20

30

40

50

60

70

80

Fig. 8.17. Comparison of experimental mass spectrum and explained intensities.

332 | 8 Molecular structure elucidation

1 MV: 0.744228

MV: 0.744806

2 MV: 0.721902

3 MV: 0.721902

OH

OH

4

O O

O O

MV: 0.703889

OH

5 MV: 0.703605

OH

6 MV: 0.703605

7 MV: 0.702666

8

OH OH

O O OH

MV: 0.702642

O

OH

9 MV: 0.695252

OH

10 MV: 0.686349

11 MV: 0.636257

12

OH

OH

O O

O

O

O

OH

MV: 0.60556

13 MV: 0.605465

14 MV: 0.605373

OH

15 MV: 0.605298

16

OH

OH O

O

O

MV: 0.605019

O

17 MV: 0.60495

18 MV: 0.604914

O

19 MV: 0.59458

20

O O O O

O

O

MV: 0.594556

21 MV: 0.583546

O

22 MV: 0.583215

O

23 MV: 0.579914 OH

OH

O

O

OH

O O

O

Fig. 8.18. Ranking of C6 H12 O2 isomers by match to spectrum from Example 8.2.

24

8.4 Ranking functions for mass spectra

| 333

The next step in calculation of match values is the determination of fragment mo­ lecular formulas. Table 8.8 lists all such molecular formulas where there is a peak in the spectrum corresponding to the mass of highest intensity, 𝐼(𝑚̃ 𝛽𝑖 ) > 0. The solution 𝑥𝑖 of the optimization problem is given in the last column. We obtain a match value MV(𝐼, 𝑀) = 0.6052978. These values may be used to plot the fraction of explained intensity in spectral form, by calculating 𝐼󸀠 = ∑𝑖 𝑥𝑖 𝛽𝑖 . Figure 8.17 shows the experimental spectrum 𝐼 at the top, the explained intensity 𝐼󸀠 in the middle and the absolute difference |𝐼 − 𝐼󸀠 | at the bottom. Next, we examine whether the match value is able to rank structure candidates according to their relevance for the experimental spectrum. We generate all consti­ tutional isomers of molecular formula C6 H12 O2 , 1313 candidates in total. If these are ordered by decreasing match values, the correct candidate methyl n-pentanoate is at position 16. Figure 8.18 shows the 24 highest-ranked structure candidates and their match values. The top 13 positions are occupied by cyclic structures, although the ratio of cyclic and acyclic structures for C6 H12 O2 is rather balanced (641 acyclic, 672 cyclic structures). If the acyclic nature could be determined somehow from the spec­ trum, then the correct candidate would be at position 2. In Section 8.5 we will try to find criteria for these structural properties empirically.

400

423

200

187

100

Frequency

300

385

70 54

47 25 0

7

0.0

0.2

0

64 27 8

5

0.4

2

9

0.6

Match value

Fig. 8.19. Histogram of match values of C6 H12 O2 constitutional isomers.

0.8

334 | 8 Molecular structure elucidation

Correct candidate False candidates

0.4 0.0

0.2

Match value

0.6

|

0

200

400

600

800

1000

1200

Structure candidate

Fig. 8.20. Distribution of match values of C6 H12 O2 constitutional isomers.

Figures 8.19 and 8.20 show a histogram and match value distribution for all consti­ tutional isomers of C6 H12 O2 . We see that for this example our match value is able to exclude a large fraction of candidates effectively. A candidate selection could be done according to the match value distribution, considering irrelevant e.g. all isomers of match value less than 0.3. However, there is more information in the MS and in the following we describe another strategy that is oriented along experience to be gained from a database of elucidated spectra.

Selection of relevant structure candidates As for molecular formulas, we calculate the match values for structures 𝑀𝑖 from a sam­ ple of randomly selected 𝑛 = 1000 spectra 𝐼𝑖 . Figures 8.21 and 8.22 show a histogram and the distribution of match values 𝑥𝑖 = MV(𝐼𝑖 , 𝑀𝑖 ). As expected, the match values are considerably smaller than for molecular formulas. There are a few possible explana­ tions for this. Some spectra are dominated by one or a few very intense peaks that are not explained using the standard fragmentation rules. No predicted fragments for a very intense peak results in a very low match value. Alternatively, the database spectra may be of low quality or the structures given may be incorrect. 1 2 99 As earlier, we determined the p-quantiles of (𝑥𝑖 )𝑖∈𝑛 for 𝑝 = 100 , 100 , . . . , 100 , shown in Table 8.9.

69 62

62

50

50

55

53

55 51

0.05

60

59 52 53

44 0.04

43 40

39

0.03

30

30

Frequency

0.06

63

64

10

0.01

20

0.02

24

Relative frequency

70

70

| 335

0.07

8.4 Ranking functions for mass spectra

0

0.00

2

0.0

0.2

0.4

0.6

0.8

1.0

Match value

0.7

0.0

0.1

0.2

0.3

Quantile

0.5

0.6 0.4

Match value

0.8

0.9

1.0

Fig. 8.21. Histogram of match values of correct candidates for a sample of 1000 mass spectra.

0

200

400

600

800

1000

Pair of spectrum and structural formula

Fig. 8.22. Distribution of match values of correct candidates for a sample of 1000 mass spectra.

8.17 Example Here, we test the procedure for ranking and candidate selection on a larger data set of 100 randomly selected mass spectra, where the corresponding mole­ cular mass is at most 200 and where no more than 10,000 constitutional isomers are

336 | 8 Molecular structure elucidation Table 8.9. Quantiles 𝑞𝑝 for structure match values for various probabilities 𝑝. 𝑝

𝑞𝑝

𝑝

𝑞𝑝

𝑝

𝑞𝑝

0.01

0.0045723 0.10

0.0777678 0.91

0.8142182

0.02

0.0142139 0.20

0.1777486 0.92

0.8224115

0.03

0.0182292 0.30

0.2680536 0.93

0.8381855

0.04

0.0288057 0.40

0.3435098 0.94

0.8462994

0.05

0.0348364 0.50

0.4405285 0.95

0.8664180

0.06

0.0464190 0.60

0.5335016 0.96

0.8845612

0.07

0.0545087 0.70

0.6163113 0.97

0.8975282

0.08

0.0615875 0.80

0.7099822 0.98

0.9104603

0.09

0.0679465 0.90

0.8073853 0.99

0.9326578

possible for the molecular formula. We generate all constitutional isomers, calculate their match values, and rank them. The following table and Figure 8.23 summarize the results for RRP: Min.

1. Quart.

Median

Mean

3. Quart.

Max.

0.00000

0.07438

0.19210

0.29910

0.50000

1.00000

9

9

25

30

31

15 10

Frequency

20

20

8 6

6

5

5 4

0

2

0.0

0.2

0.4

0.6

0.8

1.0

Relative ranking position

Fig. 8.23. Histogram of RRP for structural formulas of 100 mass spectra.

|

337

100 1

10

Ranking position

1000

10000

8.4 Ranking functions for mass spectra

1

10

100

1000

10000

Number of candidates at reliability 0.9

Fig. 8.24. Ranking position of the correct candidate and number of structure candidates at probabil­ ity 0.9.

The number of selected candidates at probability 0.9 are shown in Figure 8.24 along with the ranking position of the correct candidate. Points above the diagonal represent cases in which the correct candidate is not selected. For other probabilities 𝑝 the following table shows the number of cases in which the correct candidate is in the selected set of candidates. 𝑝 = 0.99

𝑝 = 0.95

𝑝 = 0.90

𝑝 = 0.75

𝑝 = 0.50

99

96

91

75

54

In this example, we considered only cases for which the number of possible isomers is at most 10,000. Such cases are rather exceptional (see Appendix D) and even for small molecular masses there are molecular formulas with considerably more isomers. Mo­ lecular formulas with several billion isomers exist already for molecular masses of 200. Even extremely efficient structure generating algorithms are unable to generate all isomers in reasonable time for such cases, let alone store the results. Unfortunately, a molecular mass of 200 is towards the lower limit of typical analytes for MS (see Fig­ ures 8.6 and 8.7). Therefore it is extremely important to restrict a structure space prior to structure generation. Section 8.5 is dedicated to this problem.

338 | 8 Molecular structure elucidation 8.18 Remark (A better relative ranking position) When performing the calculations above, we became aware of the fact that our definition of 𝑅𝑅𝑃0 overestimates the suc­ cess of our ranking efforts in certain cases. Thus, in [155] we redefined the RRP as follows: Let 𝐵𝐶 denote the number of better candidates, i.e. candidates with a higher MV than the true candidate, 𝑊𝐶 the number of worse candidates and let 𝑇𝐶 be the (total) number of candidates. There are two possibilities to define a relative ranking position: 𝐵𝐶 𝑊𝐶 and 𝑅𝑅𝑃1 = 1 − . 𝑇𝐶 − 1 𝑇𝐶 − 1 Of course 𝑅𝑅𝑃0 and 𝑅𝑅𝑃1 are defined only if at least two candidates exist. Note that this definition of 𝑅𝑅𝑃0 still equals the one of Example 8.15. In the case of false candidates with the same MV as the true structure, 𝑅𝑅𝑃0 and 𝑅𝑅𝑃1 will differ. In order to take such situations into account, we finally define the relative ranking position as mean of 𝑅𝑅𝑃0 and 𝑅𝑅𝑃1 : 𝐵𝐶 − 𝑊𝐶 1 ). 𝑅𝑅𝑃 = (1 + 2 𝑇𝐶 − 1 For instance, if all candidates have the same MV, then 𝑅𝑅𝑃0 = 1, 𝑅𝑅𝑃1 = 1, and 𝑅𝑅𝑃 = 0.5. However, calculations of [155] showed that there are no substantial deviations in the overall results for large data sets. 𝑅𝑅𝑃0 =

8.5 Classification of mass spectra For the statistical considerations in the previous section we restricted ourselves to structure spaces of no more than 10,000 constitutions for a given molecular formula. In practical applications, however, such cases will be the exception rather than the rule (see Appendix D). Thus, it should be possible to determine structural properties (SP) of the analyte prior to structure generation, so that these can be used to restrict the number of generated structures. MS classifiers provide an opportunity to extract information on present or absent SP from mass spectra. An MS classifier 𝛷𝑆 for the binary structural property 𝑆𝑃 is a mapping 𝛷𝑆𝑃 : I → 𝔹 : 𝐼 󳨃→ 𝑏, attributing a class 𝑏 ∈ 𝔹 to a mass spectrum 𝐼. As a rule, 𝑆𝑃 is defined by a molecu­ lar substructure and is true if a molecular graph contains this substructure, and false otherwise. More generally, 𝑆𝑃 may be any binary molecular descriptor. Figure 8.25 shows the procedure to calculate and apply an MS classifier. Note that this principle is not restricted to mass spectrometry. For example, in [232] the construc­ tion of IR classifiers is described using the same scheme. A prerequisite for construction of a spectrum classifier is a database of elucidated spectra containing a sufficient number of structures with and without property 𝑆𝑃. The presence of 𝑆𝑃 is the target variable for a statistical learning program for classification.

8.5 Classification of mass spectra

Elucidated spectra Structural formulas

Spectra

Binary molecular descriptor

Spectral descriptors

Target variable

Predictors

| 339

Unknown spectrum

Spectral descriptors

Statistical learning (classification)

Predicting function

Predictors

Application of the predicting function

Prediction for the binary molecular descriptor Fig. 8.25. Workflow for prediction of structural properties by spectrum classification.

In the case of mass spectra, it would be tempting to use peak intensities as predic­ tors. However, intensities themselves are not linked strongly to structural properties. Instead, MS descriptors are more appropriate to model MS–structure relationships. Classification yields a prediction for the spectrum of an unknown that determines whether 𝑆𝑃 is to be considered as prescribed or forbidden in the further course of struc­ ture elucidation.

340 | 8 Molecular structure elucidation 8.5.1 MS descriptors Analogous to the construction of quantitative structure–property relationships, mass spectra are mapped onto real numbers by MS descriptors. The values obtained allow us to find relationships between mass spectra and structural properties. 8.19 Definition (MS descriptor) An MS descriptor is a mapping 𝐷 : 𝐼 󳨃→ 𝐷(𝐼) that associates a real number 𝐷(𝐼) with a spectrum 𝐼.

In the literature [324, 335], such a mapping is also called an MS feature or MS invariant. For consistency of nomenclature, we use the term MS descriptor here. 8.20 Example (MS descriptors) We introduce a few important MS descriptors in the following examples. – Ion series descriptors sum intensities with a mass difference of 14 units, the mass of a CH2 group. For 1 ≤ 𝑟 ≤ 14 the modulo 14 descriptors are defined by MD14𝑟 (𝐼) = ∑ 𝐼󸀠 (14𝑖 + 𝑟), 𝑖

where only masses above 38 are considered [335]: {𝐼(𝑚) 𝐼󸀠 (𝑚) = { 0 { –

if 𝑚 ≥ 39, else.

Autocorrelation descriptors describe mass differences in a spectrum. They are de­ fined as follows for mass differences 𝑑 > 0 AUCO𝑑 (𝐼) =

∑𝑚 𝐼(𝑚)𝐼(𝑚 + 𝑑) ∑𝑚 𝐼(𝑚)2

and, restricted to the lower (ACLH𝑑 (𝐼)) or upper (ACUH𝑑 (𝐼)) half of the spectrum ACLH𝑑 (𝐼) = ACUH𝑑 (𝐼) = –

∑𝑚≤𝑚/2 ̂ 𝐼(𝑚)𝐼(𝑚 + 𝑑) ∑𝑚 𝐼(𝑚)2 ∑𝑚≥𝑚/2 ̂ 𝐼(𝑚)𝐼(𝑚 + 𝑑) ∑𝑚 𝐼(𝑚)2

, .

Logarithmic intensity ratios are calculated for pairs of masses of mass difference 𝑑 > 0: 𝐼󸀠 (𝑚) , LIQN𝑚,𝑑 (𝐼) = ln 󸀠 𝐼 (𝑚 + 𝑑) where intensities smaller than 0.01 are augmented: {𝐼(𝑚) 𝐼󸀠 (𝑚) = { 0.01 {

if 𝐼(𝑚) ≥ 0.01, otherwise.

8.5 Classification of mass spectra |



341

Spectra type descriptors describe the shape of an MS, for example peak distribu­ tion or symmetry. The centroid of a mass spectrum is defined as CENT(𝐼) =

105 ∑ 𝑚𝐼(𝑚). 𝑚̂ 𝑚

The spectrum’s symmetry with respect to mass 𝑚 is measured by the symmetry function ̂ 𝑚−𝑚

sym𝐼 (𝑚) = ∑ 𝐼(𝑚 − 𝑑)𝐼(𝑚 + 𝑑). 𝑑=0



The smallest mass at which the symmetry function has its maximum value is used in 1 min{𝑚󸀠 | ∀𝑚 : sym𝐼 (𝑚󸀠 ) ≥ sym𝐼 (𝑚)}. SYMX(𝐼) = 𝑚̂ Further MS descriptors are defined using the base peak: 𝑚̃ , 𝑚̂ 𝐼(𝑚)̃ BASE(𝐼) = 100 ⋅ . ∑𝑚 𝐼(𝑚)

MBAS(𝐼) = 100 ⋅



The proportion of small fragments is described by DUST(𝐼) = 100 ⋅



∑78 𝑚=1 𝐼(𝑚) ̂ ∑𝑚 𝑚=1 𝐼(𝑚)

,

the proportion of even-mass peaks by EVEN(𝐼) = 100 ⋅

∑𝑖 𝐼(2𝑖) . ∑𝑚 𝐼(𝑚)

Another descriptor, PN10, gives the number of important peaks. These are peaks of intensity higher than 10% of the base peak intensity. If such a peak is found at mass 𝑚, then peaks of 𝑚 + 1 and 𝑚 + 2 are not counted, since these are probably isotope peaks. A plethora of other descriptors was tested during the development of the software MSclass [324] for the classification of mass spectra. MSclass finally contained 160 clas­ sifiers using 32 descriptors for 431 combinations of parameters in total. We will compare several methods of classification for structural properties in the next two sections, including those treated earlier as well as new ones.

8.5.2 MS classifiers A library of mass spectra with the corresponding compounds represented as molecular graphs is needed for the development of MS classifiers. To construct an MS classifier 𝛷𝑆𝑃

342 | 8 Molecular structure elucidation for structural property 𝑆𝑃, 𝑚𝑇 spectrum–structure pairs with this property are selected along with 𝑚𝐹 pairs without this property. The starting point is thus a set of tuples (𝐼𝑖 , 𝑦𝑖 ) ∈ I × 𝔹,

𝑖 ∈ 𝑚 = 𝑚𝑇 + 𝑚𝐹 ,

where 𝑦𝑖 = 𝑡𝑟𝑢𝑒, if 𝑆𝑃 is found in the structure belonging to 𝐼𝑖 , and 𝑦𝑖 = 𝑓𝑎𝑙𝑠𝑒 otherwise. A function is sought 𝛷𝑆𝑃 : I → 𝔹, describing our MS–structure relationship mathematically. The determination of 𝛷𝑆𝑃 is described at the beginning of Section 8.5 and in Chapter 6. Typically, 𝛷𝑆𝑃 consists of several successive mappings: – First, mass spectra are mapped onto real numbers by MS descriptors D = (𝐷𝑖 )𝑖∈𝑛 : D : I → ℝ𝑛 : 𝐼 󳨃→ (𝐷𝑖 (𝐼))𝑖∈𝑛 . –

Descriptor values necessary or helpful for training the predicting function have to be transformed 𝜏 = (𝜏𝑖 )𝑖∈𝑛 : ℝ𝑛 → ℝ𝑛 ,



the predicting function 𝑓 : ℝ𝑛 → 𝔹, obtained by a statistical learning method, is applied.

In summary, an MS classifier can be written as the composition 𝛷𝑆𝑃 = 𝑓 ∘ 𝜏 ∘ D. In a previous approach [335], LDA, KNN, ANN and soft independent modeling of class analogy (SIMCA) were tested and compared, and ANN and LDA proved to be preferable. In the following we shall calculate classifiers via CART and LDA, and then compare them with those obtained by SVM and ANN.

Classification using decision trees A base set of 86,052 spectrum–structure pairs from the NIST MS library (Subsec­ tion 8.3.4) was scanned for several structural properties contained in Appendix B. For a total of 77 properties there were at least 300 structures with and at least another 300 structures without the given property. Disjoint learning and test sets were selected randomly, 150 with and 150 without the property. For each spectrum selected in this manner, 445 MS descriptors were calculated according to [335]: – MD14𝑟 , 𝑟 = 1, . . . , 14, – AUCO𝑑 , ACLH𝑑 , ACUH𝑑 , 𝑑 = 1, . . . , 50, – LIQN𝑚,𝑑 , 𝑚 = 39, . . . , 175, 𝑑 = 1, 2, – CENT, SYMX, MBAS, BASE, DUST, EVEN, PN10.

| 343

8.5 Classification of mass spectra

These are the potential predictors for our classification method. Target variable is the class membership, true for 𝑆𝑃 present, false otherwise. 8.21 Example (Classification tree for methyl ester) Initially, we construct a classifica­ tion tree to recognize the substructure of methyl ester (see Appendix B.5). We use the standard parameters from the statistics language R interface (mincut = 5, minsize = 10, mindev = 0.01). The resulting classification tree (Figure 8.26) uses 15 descriptors: 𝑋0 𝑋3 𝑋6 𝑋9 𝑋12

= MD141 , = ACLH26 , = AUCO32 , = ACUH46 , = LIQN59,2 ,

= ACLH3 , = AUCO29 , = ACUH32 , = LIQN51,1 , = LIQN74,2 ,

𝑋1 𝑋4 𝑋7 𝑋10 𝑋13

𝑋2 𝑋5 𝑋8 𝑋11 𝑋14

= ACUH3 , = AUCO31 , = ACLH39 , = LIQN58,1 , = LIQN99,1 .

The tree consists of 16 internal and 17 terminal nodes. Nodes are numbered accor­ ding to Subsection 6.2.4. Table 8.10 shows the association of internal nodes 𝑉𝑖 with decision rules 𝑋𝑗𝑖 < 𝑎𝑖 , as well as the values returned by terminal nodes. Addition­ ally, columns 𝑚𝑖 , 𝑚𝑇𝑖 , 𝑚𝐹𝑖 give the total number of observations processed at a node, and the number of observations classified into true and false classes, respectively. The misclassification rate for the learning set can be taken from this table: For type 𝐼 error (true observations misclassified as false) we find 1 + 1 + 2 + 1 + 2 + 1 + 3 = 11 ob­ servations, for type II error 1 + 3 = 4 observations. The misclassification rate thus is 11 4 + 300 = 0.05. As is to be expected, the misclassification rate for the test 𝑀𝐶𝐸𝐿𝑆 = 300 35 38 77 set 𝑀𝐶𝐸𝑇𝑆 = 300 + 300 = 300 = 0.25667 is significantly higher. V0: X11 0.5 (i.e. on par with randomly-generated match values for all candidates); – The average RRP was worse for spectra with few possible structures; – Including library spectra for Mass Frontier improved the match values, but had a slight adverse effect on the RRP, while the calculation time increased dramatically (e.g. from minutes to hours); – The best RRP (Mass Frontier 3 step) is only slightly better than that for MOL­ GEN–MS.

398 | 9 Case studies of CASE Table 9.1. Average and correct match values and relative ranking positions for different in silico fragmenters averaged over the 100 (or 27) spectra from Subsection 8.4.2. Fragmenter

# Spectra

Avg. MV

Correct MV

RRP

Mass Frontier 3 step

100

0.273

0.462

0.269

Mass Frontier 5 step

100

0.396

0.558

0.353

MetFrag Tree Depth 2

100

0.401

0.496

0.412

MetFrag Tree Depth 3

100

0.739

0.748

0.507

MOLGEN–MS

100

0.246

0.431

0.273

ACD 3 step

27

0.767

0.813

0.520

ACD 5 step

27

0.808

0.833

0.535

Mass Frontier 3 step library

27

0.443

0.511

0.389

Mass Frontier 5 step library

27

0.531

0.616

0.382

Both ACD MS Fragmenter and MetFrag also have alternative scoring systems. The ACD Assignment Quality Index (AQI) resulted in slightly higher scores than the match value for the ACD results, but little change to the RRP. The MetFrag score weights the frag­ ments by mass and intensity, as well as accounting for bond dissociation energy. While the ‘score’ was often lower than the match value, the RRP of the score was also lower than the MetFrag results using match values, indicating that the score had a better ranking power than the match value alone (e.g. for MetFrag Tree Depth 2, the average RRP is 0.359 using the score, compared with 0.412 with match values). Full details are in [284]. Despite seeing a clear relationship between the number of fragments predicted for a structure and the resulting match value, we were unable to successfully deduce a new match value that improved the ranking of candidates by incorporating this in­ formation into the match value itself. Furthermore, we stress that these results were calculated without any spectral ‘interpretation’, e.g. using the classifiers introduced in Section 8.5 and implemented in MOLGEN–MS. These results clearly showed that none of the available in silico fragmenters evaluated outperforms MOLGEN–MS dramatically in terms of candidate ranking, although many have much better match values due to the much larger number of fragments predicted. Thus, at this stage we are not able to dramatically improve the ranking of candidate structures using the mass spectral in­ formation alone and instead have to complement this ranking with other information available from the analytical investigation.

9.2.2 Retention properties Retention indices (RI) have been used for many years in GC–MS to confirm matching database spectra by standardising the chromatographic retention times with a set of

9.2 Calculated properties to improve CASE

| 399

measured reference compounds. Two main indices exist, the Kovat’s RI (KRI) and the Lee RI (LRI), with the former based on C 6 −C36 alkanes and the latter on two- to five-ring polycyclic aromatic hydrocarbons (PAHs). The generic equation (see e.g. [258]) is as follows: 𝑇 − 𝑇𝑛 ) 𝑅𝐼𝑥 = 100 ⋅ (𝑛 + 𝑥 𝑇𝑛+1 − 𝑇𝑛 where 𝑅𝐼𝑥 is the retention index of compound 𝑥 with retention time 𝑇𝑥 , the retention times 𝑇𝑛 and 𝑇𝑛+1 are selected to bracket 𝑇𝑥 and 𝑛 refers to the number of C atoms for KRI or the number of PAH rings for LRI. Experimental matches of KRI (with the same analytical conditions) are generally quite good, e.g. error margins of ±3.4 were seen in [287]. However, predictions of KRI incur much larger errors, up to ±382 with the group contribution theory implemented within the NIST database [300]. If we take the worst case of ±382 and add the experimental error, we come up with [284]: 𝐾𝑅𝐼𝑟𝑎𝑛𝑔𝑒 = 𝐾𝑅𝐼 ± 385.4 Similarly for LRI, the error margins for experimental data with the same analytical con­ ditions are very good (e.g. ±0.53 [287]). Eckel and Kind [62] found that correlating the LRI with boiling point (𝐵𝑃) could be used to eliminate compounds outside the range (𝐿𝑅𝐼−10) and (𝐿𝑅𝐼+50) with 95% likelihood. As 𝐵𝑃 predictions are more widely applica­ ble than RI calculations, this can be converted into an equation to eliminate structural candidates outside a given 𝐵𝑃 range where the appropriate LRI standards have been measured. If for example EPI Suite(TM) [311] is used to calculate the 𝐵𝑃, the equation in [62] can be modified to include the error margins in the LRI and the predicted 𝐵𝑃 [287]: (𝐿𝑅𝐼 − 31) ≤ 𝐵𝑃 (∘C) ≤ (𝐿𝑅𝐼 + 71) Although this calculation results in a very large range, we will see in Subsection 9.2.5 that this can still be useful for eliminating candidates in CASE investigations.

9.2.3 Partitioning properties Partition coefficients are an alternative to retention indices to eliminate structures that do not fit the measured properties. Retention times in high performance liquid chro­ matography (HPLC) with certain columns can be related to compound partitioning, commonly represented by the logarithm of the octanol–water partition coefficient, log 𝐾OW . Although this does not hold for GC systems, EDA investigations often involve fractionation with reversed-phase HPLC (RP–HPLC) prior to analysis with GC. If the chromatographic stationary phase contains long hydrocarbon chains, such as C18 , a relationship can be establised with standard compounds using the following: log 𝐾OW = 𝐴 + 𝐵 log 𝑘󸀠 𝑡 −𝑡 𝑘󸀠 = 𝑅 0 𝑡0

400 | 9 Case studies of CASE where 𝑡𝑅 is the retention time of a compound and 𝑡0 the retention time of an unretained compound (e.g. thiourea). The parameters 𝐴 and 𝐵 are determined by linear regression with several standard compounds. Once these are determined, the log 𝐾OW range can be calculated for each eluted fraction in EDA. Thus, when a given HPLC fraction is measured with GC–MS, this range can be used to eliminate candidates with very diffe­ rent partitioning behavior. Similar to 𝐵𝑃 restrictions described above, log 𝐾OW can be calculated with EPI SuiteTM [311]. The RP–HPLC – log 𝐾OW relationship holds for log 𝐾OW between 0 and 6 [230]. The prediction accuracy of EPISuite’s Kowwin is reported such that 96.5% of all predicted values fall within ± 1 of the experimental values used to set up the prediction [189]. Thus, the range of fraction log 𝐾OW range ± 1 is a good ‘rule-of-thumb’ to use for eliminating structures from consideration in EDA studies, or any other investigation using a C18 column. The deviations can be greater than ± 1 log unit for compounds with many functional groups or for those outside the range [0,6] and in these cases the range for elimination should be extended. However, this often results in such a large range of values that the predicted log 𝐾OW values are no longer a great help to eliminate structural candidates.

9.2.4 Steric energy Structure generation can sometimes result in a large number of mathematically pos­ sible but chemically implausible structures. Although these structures can usually be excluded via substructural restrictions (goodlist, badlist), this becomes increasingly difficult for larger structures with many rings or unsaturated bonds. A few such struc­ tures which arise from CASE with substructure information are shown in Figure 9.4. Although the ‘less likely’ structures shown in Figure 9.4 can be eliminated with ringsize restrictions (e.g. allowing only 5-membered rings and above), such restrictions can eliminate real molecules that can occur in the environment, for example those containing cyclopropane or epoxide in the structure. In [284, 287], the use of a steric energy restriction during candidate selection for CASE was investigated. First, the steric energies for the 1000 molecules used in Subsection 8.4.1 were cal­ culated using MOLGEN–QSPR. An upper limit of 429.0 kcal/mol [287] was thus estab­ lished, which included 90% of these molecules. Adding an additional 698 molecules

Fig. 9.4. A chemically realistic molecule (left) and some highly strained isomers generated with the same settings.

9.2 Calculated properties to improve CASE |

401

(generally smaller) from the 𝑀𝑀𝐹𝐹94 Validation Suite [42, 110] lowered the limit to 388.5 kcal/mol [284]. For most cases, the simple limit of 388.5 kcal/mol [284] derived from the 1698 molecule dataset will be sufficient to eliminate most highly strained structures. Similar limits can be determined for alternative programs quite easily as the dataset is available. However, there are some cases where this limit is still too high (i.e. overly strained compounds are still included) and a more comprehensive calcu­ lation using the difference between ‘existing’ and ‘generated’ molecules to establish lower limits is given in [284]. The use of steric energy, along with the other information for candidate elimina­ tion, is shown in Subsection 9.2.5.

9.2.5 Filtering candidates by calculated properties In this subsection, we look at how the calculated properties introduced in the sections above can be used to filter or eliminate structure candidates. This strategy was explored in [287] and is summarized below. First the general strategy is presented, followed by a theoretical example using several compounds of the formula C12 H10 O2 .

Filtering strategy The number of calculated properties described in the previous sections require a log­ ical approach for efficient and reproducible structure elucidation incorporating so many criteria. One approach is to use systematic filtering. This is best split into two basic parts: structure generation (i.e. how to constrain the number of candidates us­ ing substructures and when to add additional constraints) and structure elimination (i.e. which is the best order to eliminate candidates using calculated properties). The structure generation part involves constraining generation using the default classifier settings in MOLGEN–MS (i.e. the 95% probability) and applying the same probability limits to classifiers from the NIST database. If too many structures remain with these classifiers (we used > 10,000 as a benchmark), additional classifiers or re­ strictions should be added to reduce the number of structures. Such restrictions include the addition of substructure classifiers with a lower probability (< 95%) or the addition of bond and ring size restrictions. Experience showed that starting with a large number of structures made the selection of only one or a few possible candidates extremely un­ likely. Nevertheless, it is not always possible to keep the candidate list of structures be­ low 10,000 and although this may seem like a large number of candidate structures, the number of structures generated for e.g. C12 H10 O2 without restrictions is > 1,500,000,000! However, while bond and ring-size restrictions are useful for reducing candidate num­ bers for unsaturated compounds (often dramatically), it is not always advisable to use these restrictions when the information does not come from the mass spectrum itself as the correct candidate could be eliminated using such measures. Some common envi­

Structure Generation

402 | 9 Case studies of CASE

Default Substructure Classifiers (95 %, MOLGEN-MS, NIST)

Additional Classifiers or Restrictions ( 429.0 kcal/mol; ChemBio3D > 213.24 kcal/mol)

Spectral Match Value

Fig. 9.5. Exclusion strategy to identify unknown compounds with MOLGEN–MS and calculated pro­ perties. Reprinted with permission from [287]. Copyright (2011) American Chemical Society.

ronmental contaminants or transformation products (TP) are known to contain 3-mem­ bered rings, for instance cyclopropane rings (irgarol, ChemSpider ID 82701 [259]) and epoxides (carbamazepine-10,11-epoxide, ChemSpider ID 2458). Following generation of candidates using the classifier restrictions, candidates can then be eliminated or filtered using a systematic exclusion procedure. First, ex­ perimental information such as the BP/LRI index and the log 𝐾OW was used to exclude candidates outside the given ranges, followed by the steric energy to exclude energeti­ cally unlikely candidates. The spectral match value calculated with MOLGEN–MS was then applied last, since otherwise spectra of candidate structures that were dominated by one fragment peak not predicted by MOLGEN–MS had very low match values and were excluded too early. The complete strategy is shown in Figure 9.5. We discuss how the parts, as well as the whole, contributed toward identification of the correct candidate in the next paragraphs.

Example with C12 H10 O2 Here, the use of the exclusion strategy is demonstrated using a theoretical example of several compounds with the same molecular formula. The formula C12 H10 O2 was chosen as there were several EI–MS spectra in the NIST [225] database that covered compounds with a wide range of properties, as well as isomers differing in substituent positions, which are very difficult to distinguish using MS information alone. In total,

9.2 Calculated properties to improve CASE | 403

O

O

CH3 O

OH

O

OH

O

CH3

O

H3C O

O

H3C

OH

CH3

CH3

O

O

O

CH3

HO O O

CH3 OH O

O O

CH3 O

O

O OH

O

O

CH3

OH

O

CH3

H3C

O OH OH

OH

OH

O

CH3

OH

O

CH3

OH O

CH3

O

OH

HO

O

OH

O

CH3

O O

HO

CH3

HO

OH O O

OH

OH HO

HO

OH

OH

Fig. 9.6. The 29 isomers of C12 H10 O2 with NIST EI–MS spectra.

29 spectra with the formula C12 H10 O2 were in the NIST database; the structures are given in Figure 9.6. The NIST spectra and CAS numbers are given in [283] and in the Supporting Information of [287] (which is available for download for free). Of the 29 C12 H10 O2 compounds in the NIST database, 19 were available for pur­ chase. These 19 were measured with a standard GC–MS experiment including the ap­ propriate standards for calculating both the LRI and KRI, but in the end only 15 of the 19 compounds could be measured sufficiently well with that GC–MS program. The er­ ror margins for the experimentally-determined indices were very small, as expected: ±3.4 (KRI) and ±0.53 (LRI) [287]. These error margins contrasted strongly with the error margins of the estimated KRI and LRI ranges presented in Subsection 9.2.2 and shown in Figure 9.7. In Figure 9.7, the greyscale represents the estimated values and the associated er­ rors from the NIST database (KRI) and the BP / LRI correlation (calculated starting from a predicted BP, i.e. with two error margins) for LRI. The measured values are represen­ ted by hollow diamonds for KRI and hollow circles for LRI, while the crosses represent literature values given in the NIST database. The black ranges for the 15 measured com­ pounds in the LRI results show the estimated BP range for these compounds using the (𝐿𝑅𝐼 − 10) and (𝐿𝑅𝐼 + 50) rule from Eckel and Kind [62]. The 𝑦-axis to the left applies to the KRI values on the top of the figure, while the 𝑦-axis to the right applies to the LRI values shown in the lower part of the figure.

404 | 9 Case studies of CASE LRI

KRI 2500

1000 900

2000

800 700

1500

600 500

1000

400 300

500

200 100

0

0 1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Structure Number

Fig. 9.7. Comparison of error margins for experimental and estimated KRI and LRI values. Details in the text. Data presented in [283, 287].

The experimental values could be used to distinguish most of the compounds from the other candidates, although some values were very close (e.g. compounds 21 and 28, with KRI = 1729.2 and 1732.1, respectively). As seen in the figure, the predicted KRI values in the NIST database are not particularly useful for separating the candidates. Despite the huge ranges in the predictions, Figure 9.7 shows that the measured KRI can fall outside the predicted KRI, as seen for structure 5. From the perspective of candidate elimination, the ‘best case scenario’ was structure 3, where still only five candidates were excluded from consideration. The situation was very similar for LRI, and here the predicted and measured values also deviated quite dramatically, e.g. structures 1 and 19, although the measurements still remained within the predicted range. Remarkably, despite the errors involved, these values still provided some use in candidate selection, as is shown below. Before moving on any further, we first consider the effectiveness of the mass spec­ tral classifiers in reducing the number of candidates for selection. Of the 29 com­ pounds, 15 were already reduced to the final, small (< 35) number of candidates using 95% probability classifiers alone. No additional classifiers were necessary (or possible) to reduce this number further, nor did calculated properties separate these candidates further. This means over half these compounds required only the default substructure information to constrain structure generation adequately. In a further 4 cases, the re­ duction in candidates to their final number (again below 35) was achieved by using an additional classifier or other restrictions, and no further elimination occurred with calculated properties. One further compound experienced only a slight additional re­

9.2 Calculated properties to improve CASE | 405

duction in candidate structures as a result of calculated properties. Thus, 19 (or even 20) of the 29 compounds could be reduced to a reasonable number of candidates with the use of mass spectral classifiers alone. This result is quite remarkable for a molecular formula with well over 1 billion possible structures. In another 4 cases, the calculated properties were instrumental in reducing the fi­ nal candidate numbers after exploiting the mass spectral classifiers and in these cases with a high number of possible candidates, the calculated properties were generally able to reduce the candidate numbers by an order of magnitude. The remaining 5 cases not yet discussed included those where candidate numbers could not be reduced suffi­ ciently (2 cases) or where the correct structure was absent (3 cases – where removal of the incorrect classifier led to the generation of > 10,000 possible isomers). Clearly, mass spectral classifiers are essential in making CASE work for real examples, although the correct structure may be excluded due to an incorrect classifier. However, it is also clear that calculated properties can provide important support in some cases. Structure 15, 2-naphthyl acetate, is a good example. Using default classifiers (95%), 3902 structures were generated for consideration. Inclusion of the BP/LRI relationship excluded 2/3 of these structures, leaving 1061. The partitioning behavior, represented by the log 𝐾OW , further eliminated structures, with 858 remaining. Combining two programs for steric energy calcuation, MOLGEN–QSPR and ChemBio3D, removed several hundred addi­ tional candidates from consideration, leaving only 338 structures from the original 3902 – an order of magnitude reduction with the calculated properties alone. In this case, no clear restrictions could be made further with the match value. Since this is an aromatic compound with only one main substituent, the mass spectrum is dominated by one main peak and thus not able to distinguish candidates readily. A remarkable difference in the results for this structure (structure 15) was evident when an additional classifier, C(= O)O at 94% precision, was added. With this classifier information, only 36 candidates were generated, and this was reduced to 20 candidates with the BP/LRI relationship. No further reduction in the candidates was possible after this. The re­ maining 20 candidates (represented as 4 Markush structures covering all 20 isomers) are shown in Figure 9.8. Although we gained many orders of magnitude in candidate reduction overall, some weaknesses remained with this exclusion strategy. Of greatest concern are in­ accuracies, both in the classifiers and the calculated properties. As is evident from O

O O

CH3 O

O H

CH3

O O

O

H

H3C

Fig. 9.8. The remaining structures for CASE with structure 15, as Markush representations.

406 | 9 Case studies of CASE Figure 9.7, errors in the predictions can be very large, but still do not always include the correct molecule (e.g. structure 5). Filtering candidates could therefore result in elimination of the correct molecule too early in the process. As an alternative, a con­ sensus scoring approach was developed in [284] and is covered in the next section.

9.2.6 Consensus scoring Another aspect added to the toolbox of ‘calculated properties to improve CASE’ was the idea of consensus scoring. It became increasingly clear when applying the approach described in Figure 9.5 that using hard cut criteria for elimination was not always ap­ propriate, as the early exclusion of the correct candidate due to an incorrect prediction would lead the whole process astray. Thus, in [284], a consensus score was used in­ stead of the filtering in [287] to prioritize candidates that met the most criteria (spectral match, partitioning behavior, retention indices and energy) rather than eliminating the compounds that failed to meet at least one criterion. This was taken one step further, since multiple programs or calculations could be used to determine each ‘criterion’ and then averaged, to smooth out differences from different calculations. Although this strategy yielded a relatively complex equation that needs to be adjusted for each analysis and the availabile calculations, the benefit is the avoidance of strict elimina­ tion, instead allowing the user to balance all data. A general equation is as follows, modified from [284]: 𝐶𝑆 =

∑ 𝐶log𝐾OW ∑ 𝐶𝑅𝐼 ∑ 𝐶𝐸 1 (𝑀𝑉 + 𝑆 + + + ), 5 𝑛𝐶log𝐾 𝑛𝐶𝑅𝐼 𝑛𝐶𝐸 OW

where 𝑀𝑉 and 𝑆 refer to the MOLGEN–MS match value and MetFrag score, respectively; 𝐶log𝐾OW is 1 if the candidate’s predicted log 𝐾OW is within the fraction log 𝐾OW ± 1 and 0 if outside this range; 𝑛𝐶log𝐾 is the number of predicted log 𝐾OW values used; 𝐶𝑅𝐼 is OW 1 if the candidate is within the boundaries of the given retention index criterion and 0 if it is outside this boundary; 𝑛𝐶𝑅𝐼 is the number of RI criteria used; 𝐶𝐸 is 1 if the conformational energy is below the 90th percentile of a given energy calculation and 0 if this is above this value, while 𝑛𝐶𝐸 is the number of energy calculations used. This equation has obvious weaknesses, especially with regards to the use of hard 0/1 boundaries. However, we will show in Section 9.3 that this moves us away from candidate ‘filtering’ per se and towards a single score that integrates multiple crite­ ria. An expansion to an equation with softer criteria than ‘pass-fail’ would certainly be of interest in future applications. The workflow of CASE with consensus scoring, incorporating the different criteria is given in Figure 9.9; some results are given in Section 9.3.

9.3 Examples of CASE at work

NIST

| 407

MOLGEN-MS

Substructure Information O

O N

Present:

R1 R1

R2

O

Absent:

R

C7HNO3

R2

Si

CH3

C

C8H5NO2

CH3

CH3

Structure Generation

O O

NH

O

O

O O

Spectral Match MOLGEN-MS MetFrag

Formula(s)

H

R

H3C

NH

N

O

O

NH

N H

Retention Match LRI/BP Correlation NIST KRI

O N H OH

Partitioning EPISuiteTM CDK AlogP CDK XlogP

Steric Energy MOLGEN-QSPR Obenergy MOPAC

Consensus Candidate Selection Fig. 9.9. Workflow for consensus scoring for CASE with MOLGEN–MS and various calculated proper­ ties. Adapted with permission from [284]. Copyright (2012) American Chemical Society.

9.3 Examples of CASE at work In this section we will work through three real life case studies of CASE. The first two were identified using the consensus scoring approach, the third through prior knowl­ edge of the experimental design. All three examples came from EDA studies. The first two were from EDA of a blue rayon extract, which is a passive sampler designed to capture polar, planar mutagenic compounds in water matrices. These examples were published in [284]; a summary is given here. The spectra detailed below were the only unknowns of significance found in these investigations; most of the polar, planar mu­

408 | 9 Case studies of CASE tagens sampled could only be detected with LC–MS techniques, not GC–MS as used here. The last example came from an EDA of diclofenac exposed to sunlight in the pres­ ence of algae. While it was possible that many transformation products were present, the example explored here was the only consistent signal of sufficient intensity found in this experiment. The results of the full investigation were published in [282]; only the results concerning CASE are presented here.

9.3.1 Blue rayon unknown 1 The first unknown compound was found at 19.2 minutes in a GC–MS run of a muta­ genic subfraction of the blue rayon extract. There were two very similar spectra in the NIST database, with similar probabilities. The NIST match results are summarized in Table 9.2 and show the similarity of the spectra. Further, LRI and KRI standards were measured with the unknown spectrum. The LRI of 224.7 calculated for the unknown re­ sulted in a BP inclusion range of 193.7–295.7 ∘C for structure candidates. The calculated KRI of 1320 resulted in an inclusion range of 934.6–1705 units, taking the 95% confi­ dence interval for the NIST KRI prediction. Taking a lower error margin in this case would actually result in exclusion of all candidates from consideration. As a result, identification based on the database alone was not clear from the first review of the results. While the standard procedure to confirm the candidates would be to purchase one (or both) of the top compounds for comparison of MS and RT, instead we explored the use of CASE via MS to support the identification of a candidate in this case. Running the unknown spectrum through MOLGEN–MS as well as through the NIST classifiers resulted in the following information (95% precision) for formula calcula­ tion: – DBE = 6 − 8, C ≥ 7, H ≥ 0, O = 2 − 4 With this classifier information, three possible formulas were generated by MOL­ GEN–MS; C9 H8 O2 , C8 H4 O3 and C7 O4 , of which the last was quite unlikely. Table 9.2. NIST match results for Unknown 1 (row 1), phthalic anhydride (2), phthalic acid (3), phtha­ lamic acid (4) and monoethyl phthalate (5). Spectrum: 𝑚/𝑧 (%)

Match

1

74(12) 76(57) 104(100) 148(20)



2

50(43) 74(20) 76(89) 104(100) 148(34)

44.6%

3

50(38) 74(19) 76(77) 104(100) 148(22)

41.2%

166

1620 ± 220

4

17 (19) 50(41) 76(86) 104(100) 148(16)

8.4%

165

1673 ± 382

5

50(43) 74(18) 76(81) 104(100) 148(14)

4.6%

194

1629 ± 382

MW

KRI



1320 (calc.)

148

1443 ± 382

9.3 Examples of CASE at work

O

Present:

Absent:

O

C(sat)

R1 CH2

R

O

R

| 409

CH3

R2

R

Fig. 9.10. Substructure classifiers from MOLGEN–MS and NIST for Unknown 1. Adapted with permis­ sion from [284]. Copyright (2012) American Chemical Society.

O

O

O O O

O

O

O O

O

O

O

Fig. 9.11. Top four candidates from MOLGEN–MS using the consensus scoring approach for Un­ known 1. Adapted with permission from [284]. Copyright (2012) American Chemical Society.

The substructure classifier information (relevant to these formulas) is shown in Figure 9.10 and is very specific, which allows a good reduction in the number of possi­ ble structure candidates. Indeed, accepting all three formulas and this substructure in­ formation into the structure generation step of MOLGEN–MS resulted in only 137 struc­ tures, compared with 4,161,969 possible structures generated for the formula C8 H4 O3 without any restrictions. The 137 structures were processed using the consensus scor­ ing approach introduced previously. The consensus scores for the candidates ranged from 0.295 to 0.913, i.e. from barely satisfying any of the multiple critera through to satisfying all criteria. The top 4 candidates are shown in Figure 9.11, in order of score from left (highest) to right. The consensus scores for these top 4 candidates were (from left to right), 0.913, 0.785, 0.783 and 0.753. The top match is a clear winner and this is visually confirmed when looking at the structures in the figure; the left-most structure is a ‘normal’ com­ pound, whereas the others are exceedingly strained. The two middle structures, al­ though highly unlikely, had higher match values and MetFrag scores than the other two candidates, which explained their presence in the top 4 despite the extreme strain. Thus, while 71 candidates had higher match values and 12 had higher MetFrag scores, the use of calculated properties in this case was instrumental in ranking the structure to the left at the top, as it was the only structure that satisfied all of the additional criteria (i.e. steric energy, LRI, KRI). An astute reader may have noted that the next-closest NIST match was at a higher molecular weight than the 148 used for the formula and structure candidates pre­ sented here, and also fell outside the measured KRI range. This added to the evidence for the tentative identification. The mutagenic subfraction, from which this unknown

410 | 9 Case studies of CASE spectrum resulted, was also measured using LC–MS/MS. The mass of 𝑚/𝑧[𝑀 + 𝐻]+ = 149.0229 was detected at 3.04 minutes, corresponding to the protonated mass of Un­ known 1, although with insufficient intensity for a MS/MS spectrum. This peak was considered sufficient additional evidence to purchase the standard of the left-most structure in Figure 9.11, phthalic anhydride. Measurement of this standard in both GC–MS and LC–MS/MS resulted in a confirmation of this structure as the correct com­ pound. Unfortunately this structure was not able to explain any of the mutagenicity of the sample. Thus, although we have a successful structure identification via CASE, the cause of the toxicity remained unidentified.

9.3.2 Blue rayon unknown 2 The second example here was similar to the first in many ways (indeed, one oxygen is replaced with a NH group). However, there are some differences in this example that warranted a more detailed look. This unknown eluted at 25.4 minutes in the chro­ matogram from the same mutagenic subfraction of the blue rayon extract. The results of the NIST library search are summarized in Table 9.3. Again, the result of the NIST search was not 100% clear. Although the top spec­ trum had a higher probability, the KRI was further away from the experimental KRI than that of the second candidate. In addition, the spectra were very similar. Thus, we also pursued CASE for this example as it had a small molecular weight and was likely to be amenable to a successful CASE. The calcuated LRI was 251.0; thus candidates should have a predicted BP range of 220.0–322.0 ∘C, whereas the KRI of 1472 gave an in­ clusion range of 1086.6–1857.4. The substructure information retrieved from NIST and MOLGEN–MS resulted in the following formula information: – DBE = 7 − 8, C ≥ 7, H ≥ 0, O ≥ 2, N ≥ 1, S ≥ 0 Table 9.3. NIST match results for Unknown 2 (row 1), phthalimide (2), 𝑜-cyanobenzoic acid (3), hydroxymethylphthalimide (4), N-(2-acetamidoethylthio)phthalimide (5), phthalimidomethyl 3-methoxybenzoate (6). Spectrum: 𝑚/𝑧 (%)

Match

1

76(47) 103(24) 104(52) 147(100)





1472 (exp)

2

50(20) 76(55) 103(37) 104(74) 147(100)

61.4%

147

1381 ± 382

3

50(46) 76(99) 103(37) 104(74) 147(100)

22.4%

147

1438 ± 382

4

50(38) 76(100) 103(29) 104(59) 147(69)

6.22%

177

1781 ± 382

5

50(59) 76(100) 103(35) 104(75) 147(92)

4.3%

264

2488 ± 382

6

50(56) 76(100) 103(40) 104(66) 147(83)

4.1%

311

2667 ± 382

MW

KRI

9.3 Examples of CASE at work

O

Present:

Absent:

O N

R1

O

H R1

R

N

R2

| 411

H

R2

Fig. 9.12. Substructure classifiers from MOLGEN–MS and NIST for Unknown 2. Adapted with permis­ sion from [284]. Copyright (2012) American Chemical Society.

O

O O NH O

O NH

O

NH

O

O N H

Fig. 9.13. Top four candidates from MOLGEN–MS using the consensus scoring approach for Un­ known 2. Adapted with permission from [284]. Copyright (2012) American Chemical Society.

Two possible formulas were generated within MOLGEN–MS; C7 H8 NO3 and C8 H5 NO2 . The substructural information (95% probability) pertinent to these formulas is shown in Figure 9.12. The number of structures resulting from the two formulas and the sub­ structure information shown was 561, all with the formula C8 H5 NO2 . Although this was quite a lot more than for Unknown 1, again only 1 candidate was clearly on top using the consensus scoring approach. The top 4 candidates shown in Figure 9.13 confirm that the top NIST database match also fits the spectral interpretation information provided by NIST and MOLGEN–MS. The consensus scores for these candidates were, from left to right, 0.848, 0.744, 0.744 and 0.723. Only 60–62% of the experimental spectrum was predicted by MOLGEN–MS, whereas the MetFrag scores ranged (for these 4 candidates) between 0.695 and 1. Again, energy was the main factor separating the top candidate from the remaining candidates. The consensus score of the top candidate was quite a bit lower than for Unknown 1. While 15 candidates were within the predicted BP/LRI range given, these 15 structures all had very high energies. The top candidate in Figure 9.13 was not within this range and this goes to show that despite the large errors associated with these predictions, it is possible (as the uncertainty suggests) that the correct candidate is outside the pre­ dicted ranges. This is additional evidence for a scoring system similar to the consensus approach, rather than hard filtering/exclusion. The confirmation of the top candidate here was similar to Unknown 1 above. A peak corresponding to the 𝑚/𝑧[𝑀 − 𝐻]− = 146.0249 (0.7 ppm error) of Unknown 2 was detected with LC–MS/MS analysis of the same fraction, with negative ionisation, at a retention time of 4.80 minutes. The standard compound of the top candidate, phthal­ imide, was purchased and measured with both GC–MS and LC–MS. A peak at 25.464 minutes was detected, with KRI = 1474 and LRI = 251.3, all corresponding well with

412 | 9 Case studies of CASE Unknown 2. Furthermore, the match between the unknown and standard spectrum was 947 / 949 for the match and reverse match, respectively, when performing a NIST library search, which indicated a very good match between the spectra. Similarly, the LC–MS retention time also matched, with the 𝑚/𝑧[𝑀 − 𝐻]− = 146.0254 (4 ppm error) detected at 4.85 minutes. Thus, phthalimide was confirmed as the identity of Unknown 2. Similar to Unknown 1 however, phthalimide showed no activity in the Ames test for mutagenicity. Thus, although this was another successful example of CASE, the toxi­ city confirmation for the EDA remained incomplete.

9.3.3 Diclofenac transformation product The final complete example of CASE we present here is the study of diclofenac pho­ to-transformation products. Previous research had shown that TPs toxic to algae were formed in the environment; the investigations in [282] aimed to identify these toxic TPs. Flasks containing diclofenac and green algae were exposed to sunlight in order to reproduce the transformation processes. The green algae acted as the biotest, while fractionation was used to reduce the sample complexity (followed by more biotests) to determine which were the toxic components. Surprisingly, in initial experiments only one fraction showed enhanced toxicity, with one peak of interest present. The mass spectrum retrieved from the experiments was unfortunately of quite low quality. The spectrum is shown in Figure 9.14, together with the closest NIST match, diphenyl carbamic chloride. Although the spectra overlapped well in some places, other peak groups did not match, e.g. the 139, 202 and 204 peaks were absent in the NIST spectrum, while the peaks at 77 and 119 were missing in the unknown. This was also reflected in the match probability of 48.1%. It was clear, therefore, that this was a perfect example where CASE should be applied. The substructure information retrieved from NIST and MOLGEN–MS indicated the presence of more than one aromatic ring, one chlorine and 0–2 oxygen atoms. Absent substructures (95% probability) also provided significant additional information here. These included: Ar−O, CH2/3 , ether, OH, NCH3 , NH(CH2 ), C − O, C(= O)O and NH2 . This information alone reduced the number of candidates from over 1 billion possible struc­ tures to 36 candidates. This is an impressive reduction in structure numbers, however 36 candidates were still too many to consider. Instead of the approach taken for the two examples above, the experimental information was used here to provide more effective restrictions than a wide range of calculated properties. The parent product (diclofenac) was known and no other chemicals were present at the beginning of the experiment. Furthermore, extensive investigations of TPs of diclofenac have been undertaken pre­ viously, including a study from Agüera et al. [4] which presented the structures of many TPs. A detailed look at these structures revealed many structural similarities between the TPs, making it possible to define a ‘goodlist’ entry for the structure generation, shown in Figure 9.15 (a). Structure generation with this goodlist entry yielded only two

9.3 Examples of CASE at work |

100

413

(a)

90

Abundance (%)

80 70 60 50 40 30 20 10 0 0

100

20

40

60

80

100

120 140 m/z (amu)

160

180

200

220

20

40

60

80

100

120 140 m/z (amu)

160

180

200

220

(b)

90

Abundance (%)

80 70 60 50 40 30 20 10 0 0

Fig. 9.14. (a) Mass spectrum of the unknown TP of diclofenac. (b) Mass spectrum of closest NIST match, diphenyl carbamic chloride. Source: [283].

candidates, also shown in Figure 9.15, (b) and (c). Both these candidates had a predic­ ted log 𝐾OW of 3.65 (with EPI SuiteTM [311]), which was within the calculated fraction range of [3.4, 3.7]. Diclofenac, the parent compound, has two chlorines on one aromatic ring and an acetic acid group on the other ring. This made it far more likely that the compound shown in Figure 9.15 (b) was the correct candidate, rather than Figure 9.15 (c), with the Cl and aldehyde group on the same aromatic ring. (a)

Cl

H

FV

N

H

(b) H

O Cl

H

(c)

Cl

H N

N FV FV

H H

H H

O

Fig. 9.15. (a) The ‘goodlist’ substructure from TPs of diclofenac. (b) and (c) The resulting two candi­ date structures generated. Source: [283].

414 | 9 Case studies of CASE At this stage, the information was conclusive enough to justify synthesis of the reference standard for the top candidate, Figure 9.15 (b), known as 2-[(2-chlorophenyl)amino]benzaldehyde or CPAB. Meanwhile, the EDA was repeated with larger volumes to retrieve a better quality spectrum of the unknown compound and to con­ firm that this was indeed the compound responsible for the toxicity. The spectrum of the unknown retrieved from the second EDA is shown in Figure 9.16 (a), while the spectrum of the synthesized standard of the top candidate is shown in Figure 9.16 (b). Both the reisolated unknown and the standard showed the same peak groups as in the original unknown spectrum (see Figure 9.14 (a)) and furthermore showed an excel­ lent match to each other, with a NIST match value of 989 (out of 1000; almost a perfect match), as well as KRI values of 1981.0 and 1980.8 for the unknown and standard,

100

(a)

90

Abundance (%)

80 70 60 50 40 30 20 10 0 0 100

20

40

60

80

100

120 140 m/z (amu)

160

180

200

220

60

80

100

120 140 m/z (amu)

160

180

200

220

(b)

90 O

Abundance (%)

80

Cl

70

NH

60 50 40 30 20 10 0 0

20

40

Fig. 9.16. (a) Mass spectrum of unknown transformation product of diclofenac, reisolated in a sec­ ond EDA study. (b) Mass spectrum of the synthesized standard, CPAB. Source: [283].

9.4 CASE conclusions and outlook

|

415

respectively. Additional experiments using the reference compound also confirmed that CPAB was responsible for the observed toxicity [282]. Thus, CPAB was confirmed as the structure of the TP responsible for the enhanced toxicity in the transformed diclofenac samples – as far as can be achieved with mass spectrometry alone. Fur­ ther details on the toxicological confirmation and the complete EDA can be found in Schulze et al. [282].

9.4 CASE conclusions and outlook 9.4.1 GC–EI–MS As we saw in Section 9.3, the building blocks of CASE via MS are present for GC–MS and have been applied successfully to identify or confirm the identity of environmental contaminants in real samples. The use of additional substructure classifiers from the NIST database as well as calculated properties such as RIs, partition coefficients and steric energies provided vital information for candidate generation and selection and enhanced the chances for a successful CASE dramatically. However, there is a long way to go before CASE via EI–MS becomes viable for daily use with a success rate acceptable for routine application, not just research and development. The examples that were successful here were quite small molecules with detailed classifier information (Sub­ sections 9.3.1 and 9.3.2) or with very detailed information about the parent compound (Subsection 9.3.3). In some ways, the success of CASE via EI–MS depended more on the availability of detailed substructure classifiers to reduce the number of structures, rather than the actual size of the molecule, per se. It is clear, however, that the larger the molecule (i.e. the greater the number of atoms), the more difficult de novo CASE will be due to the dramatic increase in the number of possible compounds. A number of areas in the CASE via EI–MS workflow could still be improved to increase the chances of a successful result. These include: – MS classification ∘ development of further descriptors specially adapted to particular classifica­ tion problems, ∘ development of descriptors capable of accounting for further information such as molecular mass or exact fragment masses, ∘ testing of further methods for descriptor selection, ∘ testing of further classification methods such as support vector machines (SVM, see Subsection 6.2.3), as well as parameter optimization for such me­ thods, ∘ using further structural properties even if classifiers developed are of low pre­ dictivity. This may be compensated by ∘ filtering classification results while considering logical implications among single structural properties [323],

416 | 9 Case studies of CASE





∘ incorporation of the maximum common substructure approach (e.g. [126]). structure generation ∘ direct processing of aromatic substructures, ∘ improved planarity restrictions for aromatic compounds MS verification ∘ testing of further ranking functions, e.g. from [291], ∘ testing of several parameters for ranking functions, such as a lower bound for the DBEs of fragment molecular formulas in calculation of molecular formula match values, or use of various sets of fragmenting reactions in the calculation of structure match values, ∘ development and testing of additional criteria for the plausibility of molecular and structural formulas, ∘ consideration of aspects of reaction dynamics, ∘ consideration of the energy needed to form a fragment (e.g. likelihood of oc­ currence), similar to that implemented in MetFrag [344] and (not yet imple­ mented to the best of our knowledge) the absence of a peak for a fragment that would be expected to form preferentially for a given structure under the relevant conditions, ∘ new ideas to address the issue of ‘favourite structures’ for in silico fragmen­ tation. Molecules with more fragmentation possibilities currently outperform those with fewer fragmentation possibilities in ranking functions; some way to adjust for this will improve ranking performance considerably.

Even if all the above-mentioned points were addressed, combining all features neces­ sary for a successful CASE via EI–MS into an automated, ‘one button’ de novo system is still difficult. While the performance of MOLGEN–MS was improved dramatically using the additional properties introduced in this chapter, this already required the use of several additional programs from different sources, including NIST, EPI Suite, MOLGEN–QSPR and OpenBabel. Even if these could be combined into one platform, successful CASE investigations are still heavily reliant on expert knowledge of the ex­ perimental conditions. Building this into a workflow to suit all eventualities is the most challenging aspect of all. The problem of structure confirmation also still remains. Candidates obtained via structure generation (and even for many obtained via com­ pound database searching of e.g. PubChem or ChemSpider) often cannot be purchased or can only be synthesized at a very high cost. An expansion of the compound database searching and ranking via fragmentation provided in MetFrag with the additional cri­ teria discussed in this chapter could help users identify those compounds that may be available for purchase more easily and thus increase success rates in confirming the identity of the top candidate.

9.4 CASE conclusions and outlook

| 417

9.4.2 CASE with high accuracy data As we have seen in Sections 8.7 and 8.8, the building blocks for CASE with high ac­ curacy data are not yet completely in place for this rapidly-growing field. The current databases are relatively small and as a result, reliable MS substructure classifiers are missing. Concepts such as the maximum common substructure approach (e.g. [126]) or obtaining detailed structural information from the fragment identity (as the formula and thus the actual structure of a fragment are often more obvious than from low ac­ curacy data) may be useful but are not yet implemented for routine use. In soft ionisa­ tion techniques, the collision energy has much more influence on the actual process, as seen in the much-improved RRP values for MetFrag with high accuracy data. The lack of spectra (although many spectra are now available, these are measured under many differing conditions and are not completely comparable) means that there is still insufficient data to start building up classifiers as performed for EI–MS as described in Section 8.5. However, contribution of mass spectra to open databases (e.g. MassBank [131] and METLIN [297]) is increasing rapidly and it is possible that work on retrieving substructure information from accurate mass spectra with associated probabilites (as performed here for EI–MS) can start soon. The ‘fragmentation tree’ approach (e.g. [242]) is another promising alternative for higher level MS data (e.g. MSn data, not just tan­ dem MS or MS2 ); however this is also not yet fully automated for practical use and requires further development. Mass spectrometry will rarely deliver as much structural information as alternative techniques such as NMR and it is thus unrealistic to expect that one could achieve the success rates of CASE via NMR with MS, even with high accuracy MSn data. However, there is still plenty of room for improvement and the development of new methods for accurate data will be a field worth following over the next 10 years.

A Lists of molecular descriptors This appendix contains lists of molecular descriptors available in MOLGEN–QSPR, arithmetical, topological, and geometrical descriptors. Some were introduced in this book, the specifications of others can be found in [34, 262].

A.1 Arithmetical descriptors Arithmetical descriptors 𝐷̄ and 𝐷̄ ∗ , defined by ̄ = 𝐷(𝑀) and 𝐷̄ ∗ (𝑀) ̄ = 𝐷(𝑀∗ ), 𝐷(̄ 𝑀) available in MOLGEN, are based on the following mappings 𝐷 : 𝐷 𝐴 𝑁𝑋 𝑟𝑒𝑙. 𝑁𝑋 𝐵 𝑙𝑜𝑐. 𝐵 𝑛−, 𝑛=, 𝑛# 𝑟𝑒𝑙. 𝑛−, 𝑟𝑒𝑙. 𝑛=, 𝑟𝑒𝑙. 𝑛# 𝑛𝑎𝑟𝑜𝑚𝑎 𝑟𝑒𝑙. 𝑛𝑎𝑟𝑜𝑚𝑎 𝐶 𝑀𝑊 𝑚𝑒𝑎𝑛𝐴𝑊 𝑐ℎ𝑎𝑟𝑔𝑒 𝑟𝑎𝑑

𝐷(𝑀) number of atoms, number of X atoms, 𝑋 = C, O, N, S, F, Cl, Br, I, P relative number of X atoms, 𝑋 = C, . . . , P number of bonds number of localized bonding electron pairs number of single, double, triple bonds relative number of single, double, triple bonds number of aromatic bonds relative number of aromatic bonds cyclomatic number molecular weight mean atomic weight total charge number of radical sites

A.2 Topological descriptors Now we list a set of functions, defined on molecular graphs 𝑀 and depending on the underlying multigraphs 𝛾. We give the names of the corresponding topological descrip­ tors or indices, as they are usually called. We note that in MOLGEN they are mostly evaluated on the H–suppressed molecular graph 𝑀∗ of the molecule. 𝑊 𝑀1 , 𝑀2 𝑚 𝑀1 , 𝑚 𝑀2

Wiener index 1-st and 2-nd Zagreb index 1-st and 2-nd modified Zagreb index

A.2 Topological descriptors | 419 𝑘

𝜒 𝜒 3 𝑠 𝜒𝐶 𝑘 𝑣 𝜒 𝑘 𝜅 𝛷𝛼̄ 𝑘 𝜅𝛼 𝛷 𝐹 𝑁𝐺𝑆 𝐽 𝐽𝑢𝑛𝑠𝑎𝑡 𝑀𝑇𝐼 𝑀𝑇𝐼󸀠 𝐻 𝑡𝑤𝑐 𝑚𝑤𝑐(𝑙) 𝑡𝑤𝑐𝑢𝑛𝑠𝑎𝑡 (𝑙) 𝑚𝑤𝑐𝑢𝑛𝑠𝑎𝑡 𝐺1 (𝑡𝑜𝑝𝑜𝑙.) 𝐺2 (𝑡𝑜𝑝𝑜𝑙.) 𝑍 𝐼𝐶0 𝑇𝐼𝐶0 𝐶𝐼𝐶0 𝑁 ∗ 𝐶𝐼𝐶0 𝑆𝐼𝐶0 𝑁 ∗ 𝑆𝐼𝐶0 𝐵𝐼𝐶0 𝑁 ∗ 𝐵𝐼𝐶0 𝐼𝐶1 𝑇𝐼𝐶1 𝐶𝐼𝐶1 𝑁 ∗ 𝐶𝐼𝐶1 𝑆𝐼𝐶1 𝑁 ∗ 𝑆𝐼𝐶1 𝐵𝐼𝐶1 𝑁 ∗ 𝐵𝐼𝐶1 𝐼𝐶2 𝑇𝐼𝐶2 𝐶𝐼𝐶2 𝑘 𝑠

Randic indices of order 𝑘 = 0, 1, 2 solvation connectivity indices of order 𝑘 = 0, 1, 2, 3 solvation connectivity index for clusters Kier and Hall indices of order 𝑘 = 0, 1, 2, 3 Kier shape indices of order 1, 2 and 3 Kier molecular flexibility index non–alpha–modified Kier alpha–modified shape indices of order 1, 2 and 3 Kier molecular flexibility index Platt number Gordon–Scantlebury index Balaban index unsaturated Balaban index Schultz molecular topological index MTI󸀠 index Harary number total walk count counts molecular walks of length 𝑙 = 2, . . . , 8 unsaturated total walk count unsaturated molecular walk counts, 𝑙 = 2, . . . , 8 gravitational index (pairs, topol. dist.) gravitational index (bonds, topol. dist.) Hosoya 𝑍 index Basak information content of order 0 Basak total information content of order 0 Basak complementary information content of order 0 total complementary information content of order 0 Basak structural information content of order 0 total structural information content of order 0 bonding information content of order 0 total bonding information content of order 0 Basak information content of order 1 Basak total information content of order 1 Basak complementary information content of order 1 total complementary information content of order 1 Basak structural information content of order 1 total structural information content of order 1 bonding information content of order 1 total bonding information content of order 1 Basak information content of order 2 Basak total information content of order 2 Basak complementary information content of order 2

420 | A Lists of molecular descriptors 𝑁 ∗ 𝐶𝐼𝐶2 𝑆𝐼𝐶2 𝑁 ∗ 𝑆𝐼𝐶2 𝐵𝐼𝐶2 𝑁 ∗ 𝐵𝐼𝐶2 𝑀𝑆𝐷 𝑤 𝑤𝑑𝑖𝑎𝑔 𝑃𝑎𝑐𝑦𝑐 𝑙 𝑃𝑎𝑐𝑦𝑐 ≥9 𝑃𝑎𝑐𝑦𝑐 𝑃 𝑙 𝑃 ≥9 𝑃 𝑟𝑖𝑛𝑔𝑠 𝑙 𝑟𝑖𝑛𝑔𝑠 ≥9 𝑟𝑖𝑛𝑔𝑠 𝑐ℎ. 𝐺𝑘 𝑐ℎ. 𝐽𝑘 𝑐ℎ. 𝐽 𝐷 𝜉𝑐 𝜆𝐴1 𝑆𝐶𝐴1 𝑆𝐶𝐴2 𝑆𝐶𝐴3 𝜆𝐷1 𝜒𝑇 𝑇𝑚 𝑇3 𝐹𝑅𝐵 𝑆𝑍𝐷 𝑆𝑍𝐷𝑝 𝑙 𝜒𝑝 𝑙 𝜒𝑐 𝑙 𝜒𝑝𝑐 𝑙 𝜒𝑐ℎ 𝑙 𝑣 𝜒𝑝 𝑙 𝑣 𝜒𝑐 𝑙 𝑣 𝜒𝑝𝑐 𝑙 𝑣 𝜒𝑐ℎ

total complementary information content of order 2 Basak structural information content of order 2 total structural information content of order 2 bonding information content of order 2 total bonding information content of order 2 mean square distance index detour index detour index (incl. half main diagonal) total acyclic path count molecular acyclic path count for length 𝑙 = 2, . . . , 8 molecular acyclic path count of length 9 and higher total path count molecular path count for length 𝑙 = 2, . . . , 8 molecular path count of length 9 and higher total ring count molecular ring count for length 𝑙 = 3, . . . , 8 molecular ring count of length 9 and higher topological charge indices of order 𝑘 = 1, . . . , 8 mean topological charge indices of order 𝑘 = 1, . . . , 8 global topological charge index topological diameter eccentric connectivity index principal eigenvalue of 𝐴 sum of coefficents of principal eigenvector of 𝐴 mean coefficent of principal eigenvector of 𝐴 log of sum of coeff. of principal eigenvector of 𝐴 principal eigenvalue of 𝐷 total 𝜒 index number of methyl groups number of pairs of methyl groups at distance 3 freely rotatable bonds Szeged index hyper-Szeged index connectivity index 𝑙 𝜒 path, 𝑙 = 3, . . . , 6 connectivity index 𝑙 𝜒 cluster, 𝑙 = 3, . . . , 6 connectivity index 𝑙 𝜒 path–cluster, 𝑙 = 3, . . . , 6 connectivity index 𝑙 𝜒 chain, 𝑙 = 3, . . . , 6 valence connectivity index 𝑙 𝜒𝑣 path, 𝑙 = 3, . . . , 6 valence connectivity index 𝑙 𝜒𝑣 cluster, 𝑙 = 3, . . . , 6 valence connectivity index 𝑙 𝜒𝑣 path–cluster, 𝑙 = 4, . . . , 6 valence connectivity index 𝑙 𝜒𝑣 chain, 𝑙 = 3, . . . , 6

A.3 Geometrical descriptors | 421

𝑠𝑦𝑚 𝑅 𝑐𝑜𝑛. 𝑐𝑜𝑚𝑝. 𝑔𝑡 𝑝𝑙𝑎𝑛𝑎𝑟

size of topological symmetry group topological radius number of connected components graph–theoretical planarity

A.3 Geometrical descriptors 𝐺1 𝐺2 𝐼𝐴 , 𝐼𝐵 , 𝐼𝐶 𝑠𝑡. 𝑒𝑛𝑒𝑟𝑔𝑦 𝑆𝐻𝐷𝑊1, 𝑆𝐻𝐷𝑊2, 𝑆𝐻𝐷𝑊3 𝑆𝐻𝐷𝑊4, 𝑆𝐻𝐷𝑊5, 𝑆𝐻𝐷𝑊6 𝑆𝐻𝐷𝑊1/𝑆𝐻𝐷𝑊2 𝑆𝐻𝐷𝑊1/𝑆𝐻𝐷𝑊3 𝑆𝐻𝐷𝑊2/𝑆𝐻𝐷𝑊3 𝑠𝑠𝑆𝐻𝐷𝑊1, . . . , 𝑠𝑠𝑆𝐻𝐷𝑊3 𝑠𝑠𝑆𝐻𝐷𝑊4, . . . , 𝑠𝑠𝑆𝐻𝐷𝑊6 𝑠𝑠𝑆𝐻𝐷𝑊1/𝑆𝐻𝐷𝑊2 𝑠𝑠𝑆𝐻𝐷𝑊1/𝑆𝐻𝐷𝑊3 𝑠𝑠𝑆𝐻𝐷𝑊2/𝑆𝐻𝐷𝑊3 𝑉𝑣𝑑𝑤 𝜌𝑣𝑑𝑤 𝑠 𝑉𝑣𝑑𝑤 𝑉𝑐𝑢𝑏 𝑆𝑣𝑑𝑤 𝑆𝐴𝑆𝐻2 𝑂 𝑆𝐴𝑆𝐻 𝐷3𝐷 𝑉𝑠𝑝ℎ𝑒𝑟𝑒

gravitational index (pairs, 3D–dist.) gravitational index (bonds, 3D–dist.) moments of inertia A, B, C steric energy XY, XZ and YZ shadow standardized XY, XZ, YZ shadow XY/XZ shadow XY/YZ shadow XZ/YZ shadow size sorted shadows 1, 2, 3 size sorted standardized shadows 1, 2, 3 size sorted shadow 1/2 size sorted shadow 1/3 size sorted shadow 2/3 van der Waals volume density by van der Waals volume standardized van der Waals volume enclosing cuboid van der Waals surface solvent–accessible surface (H2 O) solvent–accessible surface (H) geometrical diameter enclosing sphere

B Substructures for MS classifiers We describe the structural properties (SP) that MSclass [318] uses in order to classify mass spectra. MSclass comprises classifiers for altogether 85 different SP, and for each of these properties there are up to 4 classifiers available. The SP are identified by a long name, consisting of up to 10 characters. In addition there is a description and a graphic given for each SP. The 85 SP are collected in 5 categories: – Alkyls (13 SP), – Aromatics (40 SP), – Bonds (2 SP), – Elements (10 SP), – Functional groups (19 SP), – Rings (1 SP). The SP are listed in alphabetical order. The information agrees with the descriptions in the handbook of classifiers [317] for MSclass and is approved by the authors of the handbook. In order to use the results of classification for a generator such as MOLGEN, the SP have to be described by restrictions that can be used by the generator. For this purpose the structural information on the arithmetical and the topological level had to be encoded in a format that can be understood by MOLGEN–MS. On the arithmetical level the description uses in particular intervals for the oc­ currence numbers of atoms, on the topological level substructures can be logically combined and used by MOLGEN as structural restrictions. In the following we collect for each 𝑆𝑃 the arithmetical restrictions (AR) and the structural restrictions (SR) according to presence (class 1) and absence (class 0) of 𝑆𝑃 in accordance with the present state of MOLGEN–MS. We do not give an explicit graphical description of the substructures considered, since in most cases it can be easily obtained from the SP. In the cases when the encoding by structural properties is not obvious, we add remarks on the methods used. The information comprised in an SP cannot always be used completely by MOLGEN. Some of the SP are not precise enough for an automatic use. Such SP are empha­ sized by †. The SP were partially used in order to develop new classifiers (Subsection 8.5.2). SP that were not considered this way are marked by ‡. They were either not useful or not manageable during structure generation, or there were no helpful databases for the purpose of application in learning or test sets (Subsection 8.3.4).

B.1 Alkyls

| 423

SPs that could not be used completely in MOLGEN, and others that could not be consulted in a classification (Subsection 8.5.2) are distributed over the categories as follows: Category

SP





Alkyls

13

2

0

Aromatics

40

9

7

2

0

0

Elements

10

0

1

Functional groups

19

0

0

1

0

0

85

11

8

Bonds

Rings ∑

B.1 Alkyls C quart ch: C quarternary (4 chain-bonds to carbon atoms) ¹ AR 1: C ≥ 5 AR 0: – SR 1: at least 1 fragment alkylCquartch SR 0: no fragments alkylCquartch C4 H9: C04 H09 ²

C4 H9 −

AR 1: C ≥ 4, H ≥ 9 AR 0: – SR 1: at least 1 subunit C4 H9 SR 0: no subunits C4 H9

C5 H11: C05 H11

C5 H11 −

AR 1: C ≥ 5, H ≥ 11 AR 0: – SR 1: at least 1 subunit C5 H11 SR 0: no subunits C5 H11

1 Acyclic bonds (∧) are described using substructure restrictions ring. 2 Alkyl groups with a given molecular formula are described by corresponding restrictions for the formula.

424 | B Substructures for MS classifiers C5 H11∗: C05 H11 or other alkyl †, ³

C5 H11 −

AR 1: C ≥ 1, H ≥ 3 AR 0: – SR 1: at least 1 subunit C H3 SR 0: no subunits C H3

C6 H13: C06 H13

C6 H13 −

AR 1: C ≥ 6, H ≥ 13 AR 0: – SR 1: at least 1 subunit C6 H13 SR 0: no subunits C6 H13

C6 H13 n: C06 H13 (n-)

CH3 (CH2 )5 −

AR 1: C ≥ 6, H ≥ 13 AR 0: – SR 1: at least 1 fragment alkylC6H13(n-) SR 0: no fragments alkylC6H13(n-)

C7 H15: C07 H15

C7 H15 −

AR 1: C ≥ 7, H ≥ 15 AR 0: – SR 1: at least 1 subunit C7 H15 SR 0: no subunits C7 H15

C8 H17: C08 H17

C8 H17 −

AR 1: C ≥ 8, H ≥ 17 AR 0: – SR 1: at least 1 subunit C8 H17 SR 0: no subunits C8 H17

C9 H19: C09 H19

C9 H19 −

AR 1: C ≥ 9, H ≥ 19 AR 0: – SR 1: at least 1 subunit C9 H19 SR 0: no subunits C9 H19

3 This definition of an SP is not precise. For a general alkyl group we can only prescribe a CH3 group.

B.2 Aromatics

| 425

C10 H21: C10 H21

C10 H21 −

AR 1: C ≥ 10, H ≥ 21 AR 0: – SR 1: at least 1 subunit C10 H21 SR 0: no subunits C10 H21

C11 H23: C11 H23 †, ⁴

C≥11 H23 −

AR 1: C ≥ 11, H ≥ 23 AR 0: – SR 1: – SR 0: –

hydr carb: hydrocarbon

Cx Hy

AR 1: C ≥ 1 H ≥ 1 no hetero atoms AR 0: at least 1 hetero atom SR 1: – SR 0: –

(CH3)3-C: tertiary butyl AR 1: C ≥ 4, H ≥ 9 AR 0: – SR 1: at least 1 fragment alkyl(CH3)3-C SR 0: no fragments alkyl(CH3)3-C

B.2 Aromatics ar-CHO: aldehyde aryl-CH=O AR 1: C ≥ 7 H ≥ 1 O ≥ 1 DBE ≥ 5 AR 0: – SR 1: at least 1 fragment aromaar-CHO SR 0: no fragments aromaar-CHO

4 This definition means that C11 H23 or a bigger alkyl residue is present. This condition cannot be des­ cribed in the present syntax of MOLGEN. For the determination of classifiers in Subsection 8.5.2, this condition was replaced with C11 H23 .

426 | B Substructures for MS classifiers ar-CO,N2: aryl – -C-O or -C=O or -N=N ⁵ AR 1: C ≥ 6 DBE ≥ 4 AR 0: – SR 1: 1–2 terms of at least 1 fragment aromaar-COsd, at least 1 fragment aromaar-N2 SR 0: exactly 2 terms of no fragments aromaar-COsd, no fragments aromaar-N2 ar-CH2: aryl – -CH2 or -CH3 AR 1: C ≥ 7 H ≥ 2 DBE ≥ 4 AR 0: – SR 1: at least 1 fragment aromaar-CH2 SR 0: no fragments aromaar-CH2 ar-N,NHN: aryl – -N= or -NH-N ⁶ AR 1: C ≥ 6 N ≥ 1 DBE ≥ 4 AR 0: – SR 1: 1–2 terms of at least 1 fragment aromaar-Nsp2, at least 1 fragment aromaar-NHN SR 0: exactly 2 terms of no fragments aromaar-Nsp2, no fragments aromaar-NHN ar-C ch: aryl – C (chain-bond) AR 1: C ≥ 7 DBE ≥ 4 AR 0: – SR 1: at least 1 fragment aromaar-Cch SR 0: no fragments aromaar-Cch

5 The alternatives −C−O and −C = O are considered as one MMG, using several possible bonds. The third alternative is realized by an additional substructure entry in a structural restriction substructure. 6 The substructure −N = is described using a substructure restriction hybridization.

B.2 Aromatics

| 427

ar-C r: aryl – C (ring bond) ⁷ AR 1: C ≥ 7 DBE ≥ 5 AR 0: – SR 1: at least 1 fragment aromaar-Cr SR 0: no fragments aromaar-Cr

ar-CO: aryl – C=O AR 1: C ≥ 7 O ≥ 1 DBE ≥ 5 AR 0: – SR 1: at least 1 fragment aromaar-CO SR 0: no fragments aromaar-CO ar-CH: aryl – CH AR 1: C ≥ 7 H ≥ 1 DBE ≥ 4 AR 0: – SR 1: at least 1 fragment aromaar-CH SR 0: no fragments aromaar-CH ar-CH2CH2: aryl – CH2-CH2 AR 1: C ≥ 8 H ≥ 4 DBE ≥ 4 AR 0: – SR 1: at least 1 fragment aromaar-CH2CH2 SR 0: no fragments aromaar-CH2CH2 ar-Cl: aryl – Cl AR 1: C ≥ 6 Cl ≥ 1 DBE ≥ 4 AR 0: – SR 1: at least 1 fragment aromaar-Cl SR 0: no fragments aromaar-Cl

7 Cyclic bonds (∘) as well as acyclic bonds are described by a substructure restriction ring.

428 | B Substructures for MS classifiers ar-CO-CH2: aryl – CO-CH2 AR 1: C ≥ 8 H ≥ 2 O ≥ 1 DBE ≥ 5 AR 0: – SR 1: at least 1 fragment aromaar-CO-CH2 SR 0: no fragments aromaar-CO-CH2

ar-COO: aryl – COO (benzoic acid/ester) AR 1: C ≥ 7 O ≥ 2 DBE ≥ 5 AR 0: – SR 1: at least 1 fragment aromaar-COO SR 0: no fragments aromaar-COO ar-F: aryl – F AR 1: C ≥ 6 F ≥ 1 DBE ≥ 4 AR 0: – SR 1: at least 1 fragment aromaar-F SR 0: no fragments aromaar-F ar-N: aryl – N AR 1: C ≥ 6 N ≥ 1 DBE ≥ 4 AR 0: – SR 1: at least 1 fragment aromaar-N SR 0: no fragments aromaar-N

ar-N ch: aryl – N (chain-bond) AR 1: C ≥ 6 N ≥ 1 DBE ≥ 4 AR 0: – SR 1: at least 1 fragment aromaar-Nch SR 0: no fragments aromaar-Nch

B.2 Aromatics

ar-N r: aryl – N (ring-bond) AR 1: C ≥ 6 N ≥ 1 DBE ≥ 5 AR 0: – SR 1: at least 1 fragment aromaar-Nr SR 0: no fragments aromaar-Nr

ar-O: aryl – O AR 1: C ≥ 6 O ≥ 1 DBE ≥ 4 AR 0: – SR 1: at least 1 fragment aromaar-O SR 0: no fragments aromaar-O ar-O-CH2: aryl – O-CH2 AR 1: C ≥ 7 H ≥ 2 O ≥ 1 DBE ≥ 4 AR 0: – SR 1: at least 1 fragment aromaar-O-CH2 SR 0: no fragments aromaar-O-CH2

ar-O-CH3: aryl – O-CH3 (methoxy) AR 1: C ≥ 7 H ≥ 3 O ≥ 1 DBE ≥ 4 AR 0: – SR 1: at least 1 fragment aromaar-O-CH SR 0: no fragments aromaar-O-CH3 ar-S r: aryl – S (S in ring) AR 1: C ≥ 6 S ≥ 1 DBE ≥ 5 AR 0: – SR 1: at least 1 fragment aromaar-Sr SR 0: no fragments aromaar-Sr

|

429

430 | B Substructures for MS classifiers biphenyl: biphenyl

AR 1: C ≥ 12 DBE ≥ 8 AR 0: – SR 1: at least 1 fragment aromabiphenyl SR 0: no fragments aromabiphenyl

C6H4-Br: C6H4 - Br (o,m,p substituted) ⁸ AR 1: C ≥ 6 H ≥ 4 Br ≥ 1 DBE ≥ 4 AR 0: – SR 1: 1–5 terms of at least 1 fragment aromaC6H4-Bra, at least 1 fragment aromaC6H4-Brb, at least 1 fragment aromaC6H4-Brc, at least 1 fragment aromaC6H4-Brd, at least 1 fragment aromaC6H4-Bre SR 0: exactly 5 terms of no fragments aromaC6H4-Bra, no fragments aromaC6H4-Brb, no fragments aromaC6H4-Brc, no fragments aromaC6H4-Brd, no fragments aromaC6H4-Bre

8 Alternatives for the position of a substituent are achieved by using several substructure entries in a structural restriction substructure, marked with a–e.

B.2 Aromatics

|

431

C6H4-SO2: C6H4 - SO2 (o,m,p substituted) ‡, ⁹ AR 1: C ≥ 6, H ≥ 4, O ≥ 2, S ≥ 1 DBE ≥ 4 AR 0: – SR 1: 1–5 terms of at least 1 fragment aromaC6H4-SO2a, at least 1 fragment aromaC6H4-SO2b, at least 1 fragment aromaC6H4-SO2c, at least 1 fragment aromaC6H4-SO2d, at least 1 fragment aromaC6H4-SO2e SR 0: exactly 5 terms of no fragments aromaC6H4-SO2a, no fragments aromaC6H4-SO2b, no fragments aromaC6H4-SO2c, no fragments aromaC6H4-SO2d, no fragments aromaC6H4-SO2e C6H4 omp: C6H4 di-substituted (o,m,p) benzene ring AR 1: C ≥ 6 H ≥ 4 DBE ≥ 4 AR 0: – SR 1: 1–4 terms of at least 1 fragment aromaC6H4ompa, at least 1 fragment aromaC6H4ompb, at least 1 fragment aromaC6H4ompc, at least 1 fragment aromaC6H4ompd, SR 0: exactly 4 terms of no fragments aromaC6H4ompa, no fragments aromaC6H4ompb, no fragments aromaC6H4ompc, no fragments aromaC6H4ompd

9 For this SP we did not find enough spectra in our database to obtain classifiers (Subsection 8.5.2).

432 | B Substructures for MS classifiers ph-C: C6H5 - C AR 1: C ≥ 7 H ≥ 5 DBE ≥ 4 AR 0: – SR 1: at least 1 fragment aromaph-C SR 0: no fragments aromaph-C

ph-CH2-O: C6H5 - CH2 - O AR 1: C ≥ 7 H ≥ 7 O ≥ 1 DBE ≥ 4 AR 0: – SR 1: at least 1 fragment aromaph-CH2-O SR 0: no fragments aromaph-CH2-O ph: C6H5- (phenyl) AR 1: C ≥ 6 H ≥ 5 DBE ≥ 4 AR 0: – SR 1: at least 1 fragment aromaph SR 0: no fragments aromaph

benz-O: CH2 - C6H4 - O - (o,m,p) AR 1: C ≥ 7 H ≥ 6 O ≥ 1 DBE ≥ 4 AR 0: – SR 1: 1–5 terms of at least 1 fragment aromabenz-Oa, at least 1 fragment aromabenz-Ob, at least 1 fragment aromabenz-Oc, at least 1 fragment aromabenz-Od, at least 1 fragment aromabenz-Oe SR 0: exactly 5 terms of no fragments aromabenz-Oa, no fragments aromabenz-Ob, no fragments aromabenz-Oc, no fragments aromabenz-Od, no fragments aromabenz-Oe

B.2 Aromatics

ar cond: condensed rings †,‡

condensed aromatic rings

AR 1: C ≥ 7 DBE ≥ 6 AR 0: – SR 1: – SR 0: –

ar-COOCH2*: ester C6H4-COO-CH2- (and subst. at o,m,p) AR 1: C ≥ 8 H ≥ 6 O ≥ 2 DBE ≥ 5 AR 0: – SR 1: 1–5 terms of at least 1 fragment aromaar-COOCH2a, at least 1 fragment aromaar-COOCH2b, at least 1 fragment aromaar-COOCH2c, at least 1 fragment aromaar-COOCH2d, at least 1 fragment aromaar-COOCH2e SR 0: exactly 5 terms of no fragments aromaar-COOCH2a, no fragments aromaar-COOCH2b, no fragments aromaar-COOCH2c, no fragments aromaar-COOCH2d, no fragments aromaar-COOCH2e ar het: hetero-aromatic †,‡ AR 1: C ≥ 3 DBE ≥ 3 AR 0: – SR 1: – SR 0: – ar poly: more than 1 aromatic ring (any type) †,‡

more than one aromatic ring

AR 1: C ≥ 7 DBE ≥ 6 AR 0: – SR 1: – SR 0: –

| 433

434 | B Substructures for MS classifiers naph: naphthalene ring system AR 1: C ≥ 10 DBE ≥ 7 AR 0: – SR 1: 1–2 terms of at least 1 fragment aromanapha, at least 1 fragment aromanaphb, SR 0: exactly 2 terms of no fragments aromanapha, no fragments aromanaphb non ar: non aromatic †,‡

no aromatic ring

AR 1: – AR 0: C ≥ 3 DBE ≥ 3 SR 1: – SR 0: –

phen-1-OH: phenol (1 OH), alkyl-subst. †, ¹⁰ AR 1: C ≥ 6 H ≥ 1 O ≥ 1 DBE ≥ 4 AR 0: – SR 1: at least 1 fragment aromaar-OH SR 0: no fragments aromaar-OH phen: phenol (1-3 OH), alkyl-subst. †, ¹¹ AR 1: C ≥ 6 H ≥ 1 O ≥ 1 DBE ≥ 4 AR 0: – SR 1: at least 1 fragment aromaar-OH SR 0: no fragments aromaar-OH

10 This definition of an SP is not precise. The optional substitution by an alkyl residue cannot be used. 11 This definition of an SP is not precise. The optional substitution by two additional OH groups and an alkyl residue cannot be incorporated.

B.2 Aromatics

phen-2-OH: phenol (2 OH), alkyl-subst. † AR 1: C ≥ 6 H ≥ 2 O ≥ 2 DBE ≥ 4 AR 0: – SR 1: 1–4 terms of at least 1 fragment aromaphen-2-OHa, at least 1 fragment aromaphen-2-OHb, at least 1 fragment aromaphen-2-OHc, at least 1 fragment aromaphen-2-OHd, SR 0: exactly 4 terms of no fragments aromaphen-2-OHa, no fragments aromaphen-2-OHb, no fragments aromaphen-2-OHc, no fragments aromaphen-2-OHd phen1-Cl1: phenol - Cl (1 OH, 1 Cl), alkyl-subst. †,‡ AR 1: C ≥ 6 H ≥ 1 Cl ≥ 1 O ≥ 1 DBE ≥ 4 AR 0: – SR 1: 1–5 terms of at least 1 fragment aromaphen1-Cl1a, at least 1 fragment aromaphen1-Cl1b, at least 1 fragment aromaphen1-Cl1c, at least 1 fragment aromaphen1-Cl1d, at least 1 fragment aromaphen1-Cl1e SR 0: exactly 5 terms of no fragments aromaphen1-Cl1a, no fragments aromaphen1-Cl1b, no fragments aromaphen1-Cl1c, no fragments aromaphen1-Cl1d, no fragments aromaphen1-Cl1e

| 435

436 | B Substructures for MS classifiers phen-Cl: phenol - Cl (1-3 OH, 1-3 Cl), alkyl-subst. †,‡ AR 1: C ≥ 6 H ≥ 1 Cl ≥ 1 O ≥ 1 DBE ≥ 4 AR 0: – SR 1: 1–5 terms of at least 1 fragment aromaphen1-Cl1a, at least 1 fragment aromaphen1-Cl1b, at least 1 fragment aromaphen1-Cl1c, at least 1 fragment aromaphen1-Cl1d, at least 1 fragment aromaphen1-Cl1e SR 0: exactly 5 terms of no fragments aromaphen1-Cl1a, no fragments aromaphen1-Cl1b, no fragments aromaphen1-Cl1c, no fragments aromaphen1-Cl1d, no fragments aromaphen1-Cl1e CO-C6H3-O: tri-subst. benzene ring: -C=O or -C-O or -OH or -N AR 1: C ≥ 7 H ≥ 3 O ≥ 2 DBE ≥ 4 AR 0: – SR 1: 1–3 terms of at least 1 fragment aromaar-COsd, at least 1 fragment aromaar-OH, at least 1 fragment aromaar-Nsp, SR 0: exactly 5 terms of no fragments aromaar-COsd, no fragments aromaar-OH, no fragments aromaar-Nsp

B.3 Bonds r>C=CC=C: isobutylidene AR 1: C ≥ 4 H ≥ 6 DBE ≥ 1 AR 0: – SR 1: at least 1 fragment bond(CH3)2C=C SR 0: no fragments bond(CH3)2C=C

B.4 Elements B: boron (any number) ‡

Bx

AR 1: B ≥ 1 AR 0: B = 0 SR 1: – SR 0: –

Br: bromine (any number)

Brx

AR 1: Br ≥ 1 AR 0: Br = 0 SR 1: – SR 0: –

Cl: chlorine (any number)

Clx

AR 1: Cl ≥ 1 AR 0: Cl = 0 SR 1: – SR 0: –

N: nitrogen (any number)

Nx

AR 1: N ≥ 1 AR 0: N = 0 SR 1: – SR 0: –

N 2: nitrogen (any number)

N2

AR 1: N = 2 AR 0: N ≠ 2 SR 1: – SR 0: –

437

438 | B Substructures for MS classifiers P: phosphorus (any number)

Px

AR 1: P ≥ 1 AR 0: P = 0 SR 1: – SR 0: –

Si: silicon (any number)

Six

AR 1: Si ≥ 1 AR 0: Si = 0 SR 1: – SR 0: –

Si 1: silicon: 1 atom

Si1

AR 1: Si = 1 AR 0: Si ≠ 1 SR 1: – SR 0: –

Si ≥ 2: silicon: ≥ 2 atoms

Si≥2

AR 1: Si ≥ 2 AR 0: Si≤1 SR 1: – SR 0: –

S: sulfur (any number)

Sx

AR 1: S ≥ 1 AR 0: S = 0 SR 1: – SR 0: –

B.5 Functional groups CH3-COO: acetoxy CH3-COO AR 1: C ≥ 2 H ≥ 3 O ≥ 2 DBE ≥ 1 AR 0: – SR 1: at least 1 fragment funcCH3-COO SR 0: no fragments funcCH3-COO

B.5 Functional groups

CH3-CO: acetyl CH3-CO AR 1: C ≥ 2 H ≥ 3 O ≥ 1 DBE ≥ 1 AR 0: – SR 1: at least 1 fragment funcCH3-CO SR 0: no fragments funcCH3-CO alc tert: alcohol tertiary (no ester) AR 1: C ≥ 4 H ≥ 1 O ≥ 1 AR 0: – SR 1: at least 1 fragment funcalctert SR 0: no fragments funcalctert

am tert: amine tertiary (no amide) AR 1: C ≥ 3 N ≥ 1 AR 0: – SR 1: at least 1 fragment funcamtert SR 0: no fragments funcamtert

n-C4H9-O: butyl-oxy n-C4H9 - O AR 1: C ≥ 4 H ≥ 9 O ≥ 1 AR 0: – SR 1: at least 1 fragment funcn-C4H9-O SR 0: no fragments funcn-C4H9-O C2H5-CO: C2H5 - CO AR 1: C ≥ 3 H ≥ 5 O ≥ 1 DBE ≥ 1 AR 0: – SR 1: at least 1 fragment funcC2H5-CO SR 0: no fragments funcC2H5-CO CF3: CF3 trifluoromethyl AR 1: C ≥ 1 F ≥ 3 AR 0: – SR 1: at least 1 fragment funcCF3 SR 0: no fragments funcCF3

| 439

440 | B Substructures for MS classifiers CF3-CO: CF3 - CO AR 1: C ≥ 2 F ≥ 3 O ≥ 1 DBE ≥ 1 AR 0: – SR 1: at least 1 fragment funcCF3-CO SR 0: no fragments funcCF3-CO NH-CH2-CH2: CH2 - CH2 - NH AR 1: C ≥ 2 H ≥ 5 N ≥ 1 AR 0: – SR 1: at least 1 fragment funcNH-CH2-CH2 SR 0: no fragments funcNH-CH2-CH2 CH3-O-CH2: CH3 - O - CH2 AR 1: C ≥ 2 H ≥ 5 O ≥ 1 AR 0: – SR 1: at least 1 fragment funcCH3-O-CH2 SR 0: no fragments funcCH3-O-CH2 N(CH3)2: dimethyl-amine -N(CH3)2 AR 1: C ≥ 2 H ≥ 6 N ≥ 1 AR 0: – SR 1: at least 1 fragment funcN(CH3)2 SR 0: no fragments funcN(CH3)2 CH3-COOCH: ester of acetic acid CH3COO-CH2 AR 1: C ≥ 3 H ≥ 4 O ≥ 2 DBE ≥ 1 AR 0: – SR 1: at least 1 fragment funcCH3-COOCH SR 0: no fragments funcCH3-COOCH et-est: ester: ethyl AR 1: C ≥ 3 H ≥ 5 O ≥ 2 DBE ≥ 1 AR 0: – SR 1: at least 1 fragment funcet-est SR 0: no fragments funcet-est

B.5 Functional groups |

me-est: ester: methyl AR 1: C ≥ 2 H ≥ 3 O ≥ 2 DBE ≥ 1 AR 0: – SR 1: at least 1 fragment funcme-est SR 0: no fragments funcme-est C2H5-O: ethoxy AR 1: C ≥ 2 H ≥ 5 O ≥ 1 AR 0: – SR 1: at least 1 fragment funcC2H5-O SR 0: no fragments funcC2H5-O (CH2)6-CO: ketone (CH2)6 - CO AR 1: C ≥ 7 H ≥ 12 O ≥ 1 DBE ≥ 1 AR 0: – SR 1: at least 1 fragment func(CH2)6-CO SR 0: no fragments func(CH2)6-CO NO: nitrogen-oxygen bond AR 1: N ≥ 1 O ≥ 1 AR 0: – SR 1: at least 1 fragment funcNO SR 0: no fragments funcNO S-CH2: S - CH2 AR 1: C ≥ 1 H ≥ 2 S ≥ 1 AR 0: – SR 1: at least 1 fragment funcS-CH2 SR 0: no fragments funcS-CH2 (CH3)3 Si: trimethylsilyl AR 1: C ≥ 3, H ≥ 9 Si ≥ 1 AR 0: – SR 1: at least 1 fragment func(CH3)3Si SR 0: no fragments func(CH3)3Si

441

442 | B Substructures for MS classifiers

B.6 Rings r 5+6: 5-ring and 6-ring condensed¹² AR 1: DBE ≥ 2 AR 0: – SR 1: at least 1 fragment ring r 5+6 SR 0: no fragments ring r 5+6

12 The condensed rings are represented as MMGs using an arbitrary atom type and alternatives for the bonds.

C Molecular formulas by mass and ion type Tables C.1 – C.4 show the total number of molecular formulas for integer masses with­ out further restrictions in the second column. The column ‘Ions’ contains the number of molecular formulas that fulfill conditions (Gr2) and (Con) in Theorem 1.23. The fol­ lowing column contains the number of molecular formulas that additionally satisfy criterion (Gr1). These formulas can occur as molecular ion in a MS. The last column gives the number of formulas that fulfill (Gr2) and (Con), but not (Gr1).

444 | C Molecular formulas by mass and ion type Table C.1. Number of molecular formulas for nominal masses 1–100, elements in E4 . Mass

Total number

Ions

OEI

EEI

Mass

Total number

Ions

OEI

EEI

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 4 4 4 4 5 5 6 6 8 8 9 9 10 10 10 10 11 11 12 12 14 14 16 16 18 18 19 19 21 21 22

0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 2 3 3 4 4 4 2 1 0 1 1 2 2 4 4 6 6 8 7 7 5 5 3 3

0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 1 1 2 1 3 2 3 1 1 0 1 0 1 1 3 1 3 3 5 3 4 3 4 1 2

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 2 1 2 1 1 0 0 0 1 1 1 1 3 3 3 3 4 3 2 1 2 1

51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

22 24 24 26 26 29 29 31 31 34 34 36 36 39 39 41 41 44 44 47 47 51 51 54 54 58 58 61 61 65 65 68 68 73 73 77 77 82 82 86 86 91 91 95 95 101 101 106 106 112

2 4 4 6 6 9 9 11 10 12 10 10 7 8 6 7 6 9 9 12 12 16 15 17 15 17 14 14 11 13 11 13 12 17 17 21 20 24 22 24 21 23 19 20 17 21 19 23 22 28

1 3 1 3 3 6 3 6 5 8 4 6 4 6 2 4 3 6 3 6 6 10 6 9 8 11 6 8 6 9 4 7 6 11 6 11 10 15 9 13 11 15 8 11 9 14 7 12 11 17

1 1 3 3 3 3 6 5 5 4 6 4 3 2 4 3 3 3 6 6 6 6 9 8 7 6 8 6 5 4 7 6 6 6 11 10 10 9 13 11 10 8 11 9 8 7 12 11 11 11

C Molecular formulas by mass and ion type | 445

Table C.2. Number of molecular formulas for nominal masses 1–100, elements in E11 . Mass

Total number

Ions

OEI

EEI

Mass

Total number

Ions

OEI

EEI

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 5 5 5 5 5 6 6 7 7 10 10 11 13 15 16 16 18 19 19 21 21 24 24 27 29 33 35 37 42 45 47 50

0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 1 1 1 2 3 3 4 4 5 2 4 1 3 1 3 2 5 5 7 9 12 13 15 14 16 11 10

0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 1 0 1 1 2 1 3 2 4 1 4 1 3 0 2 1 4 1 4 4 8 6 9 7 12 5 8

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 2 1 2 1 1 0 0 0 1 1 1 1 4 3 5 4 7 6 7 4 6 2

51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

53 57 57 62 64 71 74 78 85 92 97 103 112 120 123 132 138 147 151 162 170 180 189 199 213 227 236 251 266 282 291 309 323 341 354 372 391 412 430 452 477 504 523 553 581 610 632 667 696 727

5 10 6 10 10 17 19 23 28 34 36 40 35 41 31 31 21 28 21 32 34 43 49 58 69 79 79 85 80 87 71 72 62 71 63 78 85 103 112 130 145 161 159 174 166 172 151 161 145 158

3 9 2 6 4 13 6 13 10 25 14 26 14 31 13 22 9 23 8 19 14 30 20 32 26 53 35 53 32 62 31 49 25 53 26 48 32 71 46 77 53 108 67 113 65 121 63 108 59 110

2 1 4 4 6 4 13 10 18 9 22 14 21 10 18 9 12 5 13 13 20 13 29 26 43 26 44 32 48 25 40 23 37 18 37 30 53 32 66 53 92 53 92 61 101 51 88 53 86 48

446 | C Molecular formulas by mass and ion type Table C.3. Number of molecular formulas for nominal masses > 100, elements in E4 . Mass

Total number

Ions

OEI

EEI

150

313

73

38

35

200

677

151

87

64

250

1244

270

138

132

300

2068

448

248

200

350

3188

676

344

332

400

4657

985

533

452

450

6515

1371

695

676

500

8815

1843

983

860

600

14,916

3102

1639

1463

700

23,332

4824

2530

2294

800

34,433

7089

3697

3392

900

48,591

9977

5180

4797

1000

66,180

13,552

7011

6541

Table C.4. Number of molecular formulas for nominal masses > 100, elements in E11 . Mass

Total number

Ions

OEI

EEI

150

5299

1259

764

495

200

26,263

6383

3797

2586

250

101,339

24,162

14,140

10,022

300

327,411

76,144

43,861

32,283

350

925,843

211,769

120,387

91,382

400

2,357,940

533,418

299,361

234,057

450

5,518,977

1,238,647

688,903

549,744

500

12,045,750

2,685,131

1,481,067

1,204,064

600

48,507,196

10,684,233

5,821,545

4,862,688

700

163,873,929

35,750,618

19,293,812

16,456,806

800

483,527,540

104,698,838

56,072,098

48,626,740

900

1,280,954,355

275,682,153

146,725,060

128,957,093

1000

3,107,850,498

665,461,540

352,344,362

313,117,178

D Isomers by mass and molecular formula The following table contains numbers |M̄ 𝑐𝛽 | of constitutional isomers for given mass 𝑚 and molecular formula 𝛽 with elements in E4 and containing at least one C atom. Column 𝐵𝑆 contains the number of such isomers contained in the Beilstein database (Section 2.5), while column 𝑀𝑆 refers to the MS structure database introduced in Sub­ section 8.3.4. Clearly, these numbers of isomers found in the databases are snapshots, they may have changed in the meantime. Nevertheless they seem to be of interest in order to show the enormous difference between the mathematically possible numbers and the amount of compounds that were shown to exist, including corresponding mass spectra. For more detail see Section 2.5 and [152]. 𝑚 16 24 26 27 28 29 30 31 32 36 38 39 40 41 42 43 44

45 46 47

𝛽 CH4 C2 C2 H2 CHN C2 H4 CH3 N CH2 O C2 H6 CH5 N CH4 O C3 C3 H2 C2 HN CN2 C2 O C3 H4 C2 H3 N CH2 N2 C2 H2 O C3 H6 CHNO C2 H5 N CO2 CH4 N2 C2 H4 O C3 H8 CH3 NO C2 H7 N CH2 O2 CH6 N2 C2 H6 O CH5 NO

|M̄ 𝑐𝛽 |

𝐵𝑆

𝑀𝑆

𝑚

1 0 1 1 1 1 1 1 1 1 1 2 2 1 1 3 5 4 3 2 3 4 1 4 3 1 5 2 2 2 2 3

1 0 1 1 1 1 1 1 1 1 0 1 0 0 0 3 5 4 3 2 3 4 1 4 3 1 5 2 2 2 2 3

1 0 1 1 1 0 1 1 1 1 0 0 0 0 0 3 1 0 1 2 0 1 1 0 2 1 2 2 1 1 2 1

48 50 51 52 53 54 55 56

57 58

59

60

𝛽 CH4 O2 C4 C4 H2 C3 HN C2 N2 C3 O C4 H4 C3 H3 N C2 H2 N2 C3 H2 O C4 H6 CHN3 C2 HNO C3 H5 N CN2 O C2 O2 C2 H4 N2 C3 H4 O C4 H8 CH3 N3 C2 H3 NO C3 H7 N CH2 N2 O C2 H2 O2 C2 H6 N2 C3 H6 O C4 H10 CHNO2 CH5 N3 C2 H5 NO C3 H9 N CO3

|M̄ 𝑐𝛽 |

𝐵𝑆

𝑀𝑆

2 3 7 7 5 2 11 19 19 9 9 6 11 21 4 3 27 13 5 13 26 12 18 9 18 9 2 8 11 22 4 1

2 0 1 1 1 0 7 6 4 3 9 1 1 13 1 1 9 13 5 0 6 12 4 3 10 9 2 2 4 10 4 1

0 0 1 1 1 0 1 1 0 0 7 0 0 2 0 0 0 2 4 0 2 6 0 1 1 6 2 0 1 3 4 0

448 | D Isomers by mass and molecular formula 𝑚

61 62

63 64

65 66 67 68

69 70

71

72

73

74

𝛽 CH4 N2 O C2 H4 O2 C2 H8 N2 C3 H8 O C5 CH3 NO2 CH7 N3 C2 H7 NO CH2 O3 CH6 N2 O C2 H6 O2 C5 H2 CH5 NO2 C4 HN CH4 O3 C3 N2 C4 O C5 H4 C4 H3 N C3 H2 N2 C4 H2 O C5 H6 C2 HN3 C3 HNO C4 H5 N CN4 C2 N2 O C3 O2 C3 H4 N2 C4 H4 O C5 H8 C2 H3 N3 C3 H3 NO C4 H7 N CH2 N4 C2 H2 N2 O C3 H2 O2 C3 H6 N2 C4 H6 O C5 H10 CHN3 O C2 HNO2 C2 H5 N3 C3 H5 NO C4 H9 N CN2 O2 CH4 N4 C2 O3 C2 H4 N2 O C3 H4 O2 C3 H8 N2 C4 H8 O C5 H12 C6 CH3 N3 O C2 H3 NO2 C2 H7 N3 C3 H7 NO C4 H11 N CH2 N2 O2 CH6 N4

|M̄ 𝑐𝛽 |

𝐵𝑆

𝑀𝑆

21 10 6 3 6 15 4 8 4 8 5 21 8 27 3 14 7 40 87 86 36 40 34 46 116 6 20 7 155 62 26 99 136 85 31 114 34 136 55 10 34 40 110 154 35 12 47 5 177 52 62 26 3 19 86 99 58 84 8 65 29

7 9 5 3 0 5 1 7 2 2 5 0 1 0 3 0 0 8 7 8 2 20 3 1 12 0 2 1 19 19 25 10 13 30 4 8 5 23 34 10 2 4 4 24 32 0 1 1 7 15 22 26 3 0 2 10 7 34 8 1 2

2 3 4 3 0 1 0 3 0 0 2 0 0 0 0 0 0 0 0 1 0 4 0 0 5 0 0 1 5 2 16 2 4 6 1 2 1 2 15 10 0 0 0 6 10 0 0 0 0 4 3 17 3 0 0 0 1 8 7 0 1

𝑚

75

76

77

78

79

80

81 82

83

84

𝛽

|M̄ 𝑐𝛽 |

𝐵𝑆

𝑀𝑆

C2 H2 O3 C2 H6 N2 O C3 H6 O2 C3 H10 N2 C4 H10 O C6 H2 CHNO3 CH5 N3 O C2 H5 NO2 C2 H9 N3 C3 H9 NO C5 HN CO4 CH4 N2 O2 CH8 N4 C2 H4 O3 C2 H8 N2 O C3 H8 O2 C4 N2 C5 O C6 H4 CH3 NO3 CH7 N3 O C2 H7 NO2 C5 H3 N CH2 O4 CH6 N2 O2 C2 H6 O3 C4 H2 N2 C5 H2 O C6 H6 CH5 NO3 C3 HN3 C4 HNO C5 H5 N CH4 O4 C2 N4 C3 N2 O C4 O2 C4 H4 N2 C5 H4 O C6 H8 C3 H3 N3 C4 H3 NO C5 H7 N C2 H2 N4 C3 H2 N2 O C4 H2 O2 C4 H6 N2 C5 H6 O C6 H10 CHN5 C2 HN3 O C3 HNO2 C3 H5 N3 C4 H5 NO C5 H9 N CN4 O C2 N2 O2 C2 H4 N4 C3 O3

20 115 34 14 7 85 18 71 84 14 21 112 2 75 8 22 31 11 64 21 185 34 21 28 437 6 28 10 465 151 217 17 194 216 685 5 42 88 28 1005 318 159 706 775 593 272 703 163 1058 337 77 42 256 202 969 1069 313 32 76 512 16

2 20 21 11 7 1 1 4 18 2 18 1 0 5 1 9 9 11 1 0 6 1 0 7 3 0 1 8 2 2 29 2 1 1 17 1 1 1 0 12 14 71 6 12 33 3 5 4 62 63 69 1 1 1 24 47 73 0 1 23 0

1 5 7 6 7 0 0 1 5 0 6 0 0 1 0 1 1 5 1 0 0 0 0 1 0 0 0 0 2 0 5 0 0 0 2 0 0 0 0 5 0 18 2 0 9 1 0 0 9 9 33 0 0 0 5 2 12 0 0 6 0

D Isomers by mass and molecular formula | 449

𝑚

85

86

87

88

89

90

𝛽

|M̄ 𝑐𝛽 |

𝐵𝑆

𝑀𝑆

C3 H4 N2 O C4 H4 O2 C4 H8 N2 C5 H8 O C6 H12 C7 CH3 N5 C2 H3 N3 O C3 H3 NO2 C3 H7 N3 C4 H7 NO C5 H11 N CH2 N4 O C2 H2 N2 O2 C2 H6 N4 C3 H2 O3 C3 H6 N2 O C4 H6 O2 C4 H10 N2 C5 H10 O C6 H14 C7 H2 CHN3 O2 CH5 N5 C2 HNO3 C2 H5 N3 O C3 H5 NO2 C3 H9 N3 C4 H9 NO C5 H13 N C6 HN CN2 O3 CH4 N4 O C2 O4 C2 H4 N2 O2 C2 H8 N4 C3 H4 O3 C3 H8 N2 O C4 H8 O2 C4 H12 N2 C5 N2 C5 H12 O C6 O C7 H4 CH3 N3 O2 CH7 N5 C2 H3 NO3 C2 H7 N3 O C3 H7 NO2 C3 H11 N3 C4 H11 NO C6 H3 N CH2 N2 O3 CH6 N4 O C2 H2 O4 C2 H6 N2 O2 C2 H10 N4 C3 H6 O3 C3 H10 N2 O C4 H10 O2 C5 H2 N2

1371 301 633 205 25 50 131 826 641 681 764 100 227 506 439 98 1194 263 218 74 5 356 137 145 110 935 732 259 299 17 540 29 361 10 807 189 152 527 122 38 271 14 85 920 369 73 288 481 391 45 56 2447 173 225 41 521 37 102 102 28 2652

29 29 59 110 25 0 7 19 16 13 65 69 5 10 4 4 32 61 61 74 5 0 1 0 0 5 42 16 85 17 0 0 2 2 15 6 15 40 59 28 0 14 0 3 1 1 6 13 45 5 42 1 0 1 2 21 2 20 17 26 1

4 5 10 31 22 0 2 0 1 0 14 15 0 0 1 1 3 15 8 44 5 0 0 0 0 0 1 1 15 16 0 0 0 0 3 0 4 4 23 17 0 14 0 0 0 0 0 0 9 0 11 0 0 1 1 2 0 8 3 12 0

𝑚

91

92

93

94

95

96

97

𝛽 C6 H2 O C7 H6 CHNO4 CH5 N3 O2 CH9 N5 C2 H5 NO3 C2 H9 N3 O C3 H9 NO2 C4 HN3 C5 HNO C6 H5 N CO5 CH4 N2 O3 CH8 N4 O C2 H4 O4 C2 H8 N2 O2 C3 N4 C3 H8 O3 C4 N2 O C5 O2 C5 H4 N2 C6 H4 O C7 H8 CH3 NO4 CH7 N3 O2 C2 H7 NO3 C4 H3 N3 C5 H3 NO C6 H7 N CH2 O5 CH6 N2 O3 C2 H6 O4 C3 H2 N4 C4 H2 N2 O C5 H2 O2 C5 H6 N2 C6 H6 O C7 H10 CH5 NO4 C2 HN5 C3 HN3 O C4 HNO2 C4 H5 N3 C5 H5 NO C6 H9 N CN6 CH4 O5 C2 N4 O C3 N2 O2 C3 H4 N4 C4 O3 C4 H4 N2 O C5 H4 O2 C5 H8 N2 C6 H8 O C7 H12 C8 C2 H3 N5 C3 H3 N3 O C4 H3 NO2 C4 H7 N3

|M̄ 𝑐𝛽 |

𝐵𝑆

𝑀𝑆

738 1230 34 306 15 246 101 90 1224 1111 4394 2 207 52 48 132 235 28 475 98 6763 1823 1031 68 86 76 5245 4738 4378 9 73 20 2165 4628 812 8341 2237 575 33 425 1861 1127 8528 7687 2732 35 6 280 412 5016 72 10,770 1821 5984 1623 222 204 1630 7341 4332 7301

0 17 0 2 0 9 0 20 2 0 22 0 2 1 5 1 1 15 1 1 14 15 81 0 0 1 15 8 69 1 0 5 9 11 1 39 65 183 0 2 5 0 31 43 72 1 1 0 0 10 0 40 22 106 217 153 0 1 10 18 65

0 1 0 0 0 1 0 3 0 0 0 0 0 0 0 0 0 1 0 0 4 1 13 0 0 0 4 1 7 0 0 0 0 0 0 15 3 27 0 0 0 0 8 5 9 0 0 0 0 0 0 8 6 17 22 59 0 0 1 1 5

450 | D Isomers by mass and molecular formula 𝑚

98

99

100

101

102

𝛽 C5 H7 NO C6 H11 N CH2 N6 C2 H2 N4 O C3 H2 N2 O2 C3 H6 N4 C4 H2 O3 C4 H6 N2 O C5 H6 O2 C5 H10 N2 C6 H10 O C7 H14 C8 H2 CHN5 O C2 HN3 O2 C2 H5 N5 C3 HNO3 C3 H5 N3 O C4 H5 NO2 C4 H9 N3 C5 H9 NO C6 H13 N C7 HN CN4 O2 CH4 N6 C2 N2 O3 C2 H4 N4 O C3 O4 C3 H4 N2 O2 C3 H8 N4 C4 H4 O3 C4 H8 N2 O C5 H8 O2 C5 H12 N2 C6 N2 C6 H12 O C7 O C7 H16 C8 H4 CH3 N5 O C2 H3 N3 O2 C2 H7 N5 C3 H3 NO3 C3 H7 N3 O C4 H7 NO2 C4 H11 N3 C5 H11 NO C6 H15 N C7 H3 N CH2 N4 O2 CH6 N6 C2 H2 N2 O3 C2 H6 N4 O C3 H2 O4 C3 H6 N2 O2 C3 H10 N4 C4 H6 O3 C4 H10 N2 O C5 H10 O2 C5 H14 N2 C6 H2 N2

|M̄ 𝑐𝛽 |

𝐵𝑆

𝑀𝑆

6637 1111 270 2489 3734 5328 551 11,514 1938 2668 747 56 1804 352 1258 2274 677 10,363 6102 3654 3390 284 2879 139 528 215 5039 36 7547 3080 1073 6754 1168 716 1448 211 356 9 5308 1206 4315 1567 2279 7227 4331 1055 1015 39 15,052 1097 447 1661 4358 246 6618 978 949 2201 400 97 16,977

141 154 1 1 4 38 5 93 112 128 349 55 1 0 1 16 1 58 59 39 184 143 1 0 3 0 18 0 29 8 29 103 206 116 1 188 0 9 0 1 9 2 10 16 127 26 182 39 0 0 0 3 7 3 52 11 68 90 181 58 2

12 16 0 0 0 7 1 13 21 14 83 38 0 0 0 3 0 1 6 0 27 24 0 0 0 0 1 0 2 0 5 4 42 12 0 79 0 9 0 0 0 0 0 1 10 0 33 18 0 0 0 0 1 1 5 0 7 10 41 15 0

𝑚

103

104

105

106

107

𝛽 C6 H14 O C7 H2 O C8 H6 CHN3 O3 CH5 N5 O C2 HNO4 C2 H5 N3 O2 C2 H9 N5 C3 H5 NO3 C3 H9 N3 O C4 H9 NO2 C4 H13 N3 C5 HN3 C5 H13 NO C6 HNO C7 H5 N CN2 O4 CH4 N4 O2 CH8 N6 C2 O5 C2 H4 N2 O3 C2 H8 N4 O C3 H4 O4 C3 H8 N2 O2 C3 H12 N4 C4 N4 C4 H8 O3 C4 H12 N2 O C5 N2 O C5 H12 O2 C6 O2 C6 H4 N2 C7 H4 O C8 H8 CH3 N3 O3 CH7 N5 O C2 H3 NO4 C2 H7 N3 O2 C2 H11 N5 C3 H7 NO3 C3 H11 N3 O C4 H11 NO2 C5 H3 N3 C6 H3 NO C7 H7 N CH2 N2 O4 CH6 N4 O2 CH10 N6 C2 H2 O5 C2 H6 N2 O3 C2 H10 N4 O C3 H6 O4 C3 H10 N2 O2 C4 H2 N4 C4 H10 O3 C5 H2 N2 O C6 H2 O2 C6 H6 N2 C7 H6 O C8 H10 CHNO5

|M̄ 𝑐𝛽 |

𝐵𝑆

𝑀𝑆

32 3971 7982 421 1370 260 4978 560 2644 2630 1640 146 8172 149 6340 30,478 61 1819 183 14 2763 1818 401 2852 143 1616 425 333 2693 69 459 49,516 11,332 7437 1198 674 720 2529 88 1391 419 284 40,910 31,325 34,152 404 1131 32 74 1778 315 263 525 18,307 88 32,187 4636 69,352 15,804 4679 58

32 0 20 0 0 0 10 3 19 28 102 13 1 93 0 11 0 1 1 0 8 1 12 58 1 1 74 31 0 58 0 19 6 102 0 0 3 4 0 30 0 36 9 0 42 0 0 0 2 5 0 14 6 3 31 0 2 67 34 249 0

32 0 1 0 0 0 3 0 0 0 21 1 0 13 0 3 0 0 0 0 0 0 1 5 0 0 17 1 0 25 0 3 0 4 0 0 0 0 0 4 0 4 1 0 4 0 0 0 0 0 0 1 0 0 5 0 0 11 6 20 0

D Isomers by mass and molecular formula |

𝑚

108

109

110

111

𝛽 CH5 N3 O3 CH9 N5 O C2 H5 NO4 C2 H9 N3 O2 C3 HN5 C3 H9 NO3 C4 HN3 O C5 HNO2 C5 H5 N3 C6 H5 NO C7 H9 N CO6 CH4 N2 O4 CH8 N4 O2 C2 N6 C2 H4 O5 C2 H8 N2 O3 C3 N4 O C3 H8 O4 C4 N2 O2 C4 H4 N4 C5 O3 C5 H4 N2 O C6 H4 O2 C6 H8 N2 C7 H8 O C8 H12 C9 CH3 NO5 CH7 N3 O3 C2 H7 NO4 C3 H3 N5 C4 H3 N3 O C5 H3 NO2 C5 H7 N3 C6 H7 NO C7 H11 N CH2 O6 CH6 N2 O4 C2 H2 N6 C2 H6 O5 C3 H2 N4 O C4 H2 N2 O2 C4 H6 N4 C5 H2 O3 C5 H6 N2 O C6 H6 O2 C6 H10 N2 C7 H10 O C8 H14 C9 H2 CHN7 CH5 NO5 C2 HN5 O C3 HN3 O2 C3 H5 N5 C4 HNO3 C4 H5 N3 O C5 H5 NO2 C5 H9 N3 C6 H9 NO

|M̄ 𝑐𝛽 |

𝐵𝑆

𝑀𝑆

1003 128 620 508 3969 302 14,015 6776 76,376 58,218 24,314 3 492 254 389 88 428 2218 71 2628 49,423 327 87,055 12,098 57,411 13,177 2082 832 121 275 186 18,307 65,056 30,807 76,138 58,265 11,673 12 174 3761 35 24,928 28,722 61,793 3292 109,134 15,066 30,600 7166 654 10,064 343 58 4418 10,828 30,527 4429 108,769 51,235 46,125 35,759

0 0 2 0 0 4 0 0 38 32 134 0 0 0 0 0 0 0 10 1 37 0 33 16 137 194 426 0 0 0 0 14 15 8 78 151 184 0 0 2 1 4 3 49 1 145 104 218 590 303 0 0 0 0 1 11 0 94 73 100 356

0 0 0 0 0 0 0 0 2 4 18 0 0 0 0 0 0 0 0 0 4 0 0 1 30 15 76 0 0 0 0 0 0 0 8 21 12 0 0 0 0 0 0 5 0 27 9 17 50 84 0 0 0 0 0 0 0 6 8 5 14

𝑚

112

113

114

115

𝛽 C7 H13 N C8 HN CN6 O CH4 O6 C2 N4 O2 C2 H4 N6 C3 N2 O3 C3 H4 N4 O C4 O4 C4 H4 N2 O2 C4 H8 N4 C5 H4 O3 C5 H8 N2 O C6 H8 O2 C6 H12 N2 C7 N2 C7 H12 O C8 O C8 H16 C9 H4 CH3 N7 C2 H3 N5 O C3 H3 N3 O2 C3 H7 N5 C4 H3 NO3 C4 H7 N3 O C5 H7 NO2 C5 H11 N3 C6 H11 NO C7 H15 N C8 H3 N CH2 N6 O C2 H2 N4 O2 C2 H6 N6 C3 H2 N2 O3 C3 H6 N4 O C4 H2 O4 C4 H6 N2 O2 C4 H10 N4 C5 H6 O3 C5 H10 N2 O C6 H10 O2 C6 H14 N2 C7 H2 N2 C7 H14 O C8 H2 O C8 H18 C9 H6 CHN5 O2 CH5 N7 C2 HN3 O3 C2 H5 N5 O C3 HNO4 C3 H5 N3 O2 C3 H9 N5 C4 H5 NO3 C4 H9 N3 O C5 H9 NO2 C5 H13 N3 C6 HN3 C6 H13 NO

451

|M̄ 𝑐𝛽 |

𝐵𝑆

𝑀𝑆

3809 17,198 280 9 1451 8982 1463 60,869 194 69,482 43,697 7744 77,843 10,893 10,706 8260 2589 1804 139 33,860 1372 18,494 45,304 26,040 18,082 93,323 44,336 17,608 13,982 801 102,012 2711 14,394 9588 14,739 65,627 1635 75,211 18,611 8397 33,689 4869 2338 117,942 596 24,021 18 56,437 1868 1941 4628 26,619 1909 65,434 12,629 26,063 45,798 22,259 4054 59,406 3345

292 0 0 0 0 3 1 34 1 74 68 36 204 389 254 0 782 0 115 2 0 5 27 31 16 118 178 42 398 236 1 0 6 4 3 42 8 101 21 129 200 597 197 0 397 0 18 1 0 0 0 4 0 34 5 42 40 308 44 0 356

30 0 0 0 0 0 0 1 0 7 5 7 14 43 26 0 96 0 79 0 0 0 1 2 0 2 13 0 26 31 0 0 0 0 1 2 2 9 0 10 6 91 23 0 88 0 18 0 0 0 0 0 0 1 0 2 2 12 2 0 44

452 | D Isomers by mass and molecular formula 𝑚

116

117

118

119

𝛽 C7 HNO C7 H17 N C8 H5 N CN4 O3 CH4 N6 O C2 N2 O4 C2 H4 N4 O2 C2 H8 N6 C3 O5 C3 H4 N2 O3 C3 H8 N4 O C4 H4 O4 C4 H8 N2 O2 C4 H12 N4 C5 N4 C5 H8 O3 C5 H12 N2 O C6 N2 O C6 H12 O2 C6 H16 N2 C7 O2 C7 H4 N2 C7 H16 O C8 H4 O C9 H8 CH3 N5 O2 CH7 N7 C2 H3 N3 O3 C2 H7 N5 O C3 H3 NO4 C3 H7 N3 O2 C3 H11 N5 C4 H7 NO3 C4 H11 N3 O C5 H11 NO2 C5 H15 N3 C6 H3 N3 C6 H15 NO C7 H3 NO C8 H7 N CH2 N4 O3 CH6 N6 O C2 H2 N2 O4 C2 H6 N4 O2 C2 H10 N6 C3 H2 O5 C3 H6 N2 O3 C3 H10 N4 O C4 H6 O4 C4 H10 N2 O2 C4 H14 N4 C5 H2 N4 C5 H10 O3 C5 H14 N2 O C6 H2 N2 O C6 H14 O2 C7 H2 O2 C7 H6 N2 C8 H6 O C9 H10 CHN3 O4

|M̄ 𝑐𝛽 |

𝐵𝑆

𝑀𝑆

39,727 89 229,260 448 5678 549 30,346 5431 68 31,006 37,506 3328 43,731 4618 11,556 4986 8585 17,171 1313 260 2254 388,019 72 77,431 57,771 6821 1317 16,849 18,299 6789 45,626 3429 18,469 12,676 6418 453 338,610 398 223,890 284,065 3978 4880 4667 26,505 1646 547 27,415 11,507 2958 13,864 542 159,874 1656 1041 240,339 179 28,770 606,589 120,427 40,139 1092

0 73 2 0 0 0 11 4 0 13 15 19 134 17 1 232 170 0 472 105 1 9 67 6 39 0 0 2 3 3 32 2 67 41 233 16 7 156 2 54 0 0 1 9 0 2 27 5 36 117 5 5 197 37 4 129 0 74 24 157 0

0 18 0 0 0 0 0 0 0 1 0 3 9 1 0 16 14 0 61 22 0 0 41 0 7 0 0 0 0 0 2 0 6 0 16 1 3 14 0 10 0 0 0 2 0 0 1 0 6 3 0 1 23 1 0 37 0 13 3 17 0

𝑚

120

121

122

𝛽 CH5 N5 O2 CH9 N7 C2 HNO5 C2 H5 N3 O3 C2 H9 N5 O C3 H5 NO4 C3 H9 N3 O2 C3 H13 N5 C4 HN5 C4 H9 NO3 C4 H13 N3 O C5 HN3 O C5 H13 NO2 C6 HNO2 C6 H5 N3 C7 H5 NO C8 H9 N CN2 O5 CH4 N4 O3 CH8 N6 O C2 O6 C2 H4 N2 O4 C2 H8 N4 O2 C2 H12 N6 C3 N6 C3 H4 O5 C3 H8 N2 O3 C3 H12 N4 O C4 N4 O C4 H8 O4 C4 H12 N2 O2 C5 N2 O2 C5 H4 N4 C5 H12 O3 C6 O3 C6 H4 N2 O C7 H4 O2 C7 H8 N2 C8 H8 O C9 H12 C10 CH3 N3 O4 CH7 N5 O2 CH11 N7 C2 H3 NO5 C2 H7 N3 O3 C2 H11 N5 O C3 H7 NO4 C3 H11 N3 O2 C4 H3 N5 C4 H11 NO3 C5 H3 N3 O C6 H3 NO2 C6 H7 N3 C7 H7 NO C8 H11 N CH2 N2 O5 CH6 N4 O3 CH10 N6 O C2 H2 O6 C2 H6 N2 O4

|M̄ 𝑐𝛽 |

𝐵𝑆

𝑀𝑆

7934 452 543 19,834 6353 8026 16,276 422 37,895 6836 1605 110,452 874 44,446 710,961 467,617 225,296 115 6850 1959 23 8018 10,925 223 3697 915 11,666 1549 18,638 1310 1981 17,438 493,258 258 1796 738,283 86,246 563,966 112,484 19,983 4330 3274 3892 66 1585 10,035 926 4197 2479 202,151 1116 590,494 233,143 801,769 528,227 122,819 823 4295 313 127 5187

0 0 0 6 0 9 8 0 1 82 1 2 68 0 74 40 103 0 0 0 0 7 0 0 0 3 12 0 2 48 10 0 52 73 0 43 8 131 144 415 0 0 0 0 0 0 0 2 0 13 10 16 6 76 124 218 0 0 0 2 0

0 0 0 0 0 0 0 0 0 8 0 0 7 0 17 8 11 0 0 0 0 0 0 0 0 1 1 0 0 2 0 0 5 10 0 1 0 14 14 40 0 0 0 0 0 1 0 0 0 3 1 0 1 11 12 36 0 0 0 0 0

D Isomers by mass and molecular formula | 453

𝑚

123

124

125

𝛽

|M̄ 𝑐𝛽 |

C2 H10 N4 O2 1824 C3 H2 N6 46,786 C3 H6 O5 608 C3 H10 N2 O3 2051 C4 H2 N4 O 247,932 C4 H10 O4 255 C5 H2 N2 O2 229,717 C5 H6 N4 704,153 C6 H2 O3 21,641 C6 H6 N2 O 1,053,290 C7 H6 O2 122,391 C7 H10 N2 344,434 C8 H10 O 69,669 C9 H14 7244 C10 H2 64,352 CHNO6 92 CH5 N3 O4 2791 CH9 N5 O2 718 C2 HN7 5370 C2 H5 NO5 1385 C2 H9 N3 O3 1951 C3 HN5 O 50,129 C3 H9 NO4 877 C4 HN3 O2 94,422 C4 H5 N5 388,316 C5 HNO3 30,775 C5 H5 N3 O 1,133,182 C6 H5 NO2 444,584 C6 H9 N3 561,140 C7 H9 NO 372,937 C8 H13 N 47,323 C9 HN 113,702 CN8 251 CO7 3 CH4 N2 O5 1040 CH8 N4 O3 942 C2 N6 O 3866 C2 H4 O6 157 C2 H8 N2 O4 1222 C3 N4 O2 13,654 C3 H4 N6 131,957 C3 H8 O5 154 C4 N2 O3 10,835 C4 H4 N4 O 702,522 C5 O4 1015 C5 H4 N2 O2 645,384 C5 H8 N4 579,834 C6 H4 O3 59,327 C6 H8 N2 O 871,629 C7 H8 O2 102,139 C7 H12 N2 143,857 C8 N2 54,288 C8 H12 O 29,797 C9 O 10,064 C9 H16 1902 C10 H4 241,297 CH3 NO6 201 CH7 N3 O4 759 C2 H3 N7 25,873 C2 H7 NO5 405 C3 H3 N5 O 247,655

𝐵𝑆

𝑀𝑆

0 2 1 1 0 21 4 52 1 100 69 273 553 579 1 0 0 0 0 0 0 0 0 0 26 0 52 67 182 397 335 1 0 0 0 0 0 0 0 0 12 2 0 13 0 26 92 27 334 318 295 1 1257 0 431 2 0 0 0 0 0

0 0 0 0 0 1 0 1 0 10 7 31 40 47 0 0 0 0 0 0 0 0 0 0 0 0 2 5 8 35 16 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 4 1 18 23 14 0 78 0 90 0 0 0 0 0 0

𝑚

126

127

128

𝛽 C4 H3 N3 O2 C4 H7 N5 C5 H3 NO3 C5 H7 N3 O C6 H7 NO2 C6 H11 N3 C7 H11 NO C8 H15 N C9 H3 N CH2 N8 CH2 O7 CH6 N2 O5 C2 H2 N6 O C2 H6 O6 C3 H2 N4 O2 C3 H6 N6 C4 H2 N2 O3 C4 H6 N4 O C5 H2 O4 C5 H6 N2 O2 C5 H10 N4 C6 H6 O3 C6 H10 N2 O C7 H10 O2 C7 H14 N2 C8 H2 N2 C8 H14 O C9 H2 O C9 H18 C10 H6 CHN7 O CH5 NO6 C2 HN5 O2 C2 H5 N7 C3 HN3 O3 C3 H5 N5 O C4 HNO4 C4 H5 N3 O2 C4 H9 N5 C5 H5 NO3 C5 H9 N3 O C6 H9 NO2 C6 H13 N3 C7 HN3 C7 H13 NO C8 HNO C8 H17 N C9 H5 N CN6 O2 CH4 N8 CH4 O7 C2 N4 O3 C2 H4 N6 O C3 N2 O4 C3 H4 N4 O2 C3 H8 N6 C4 O5 C4 H4 N2 O3 C4 H8 N4 O C5 H4 O4 C5 H8 N2 O2

|M̄ 𝑐𝛽 | 463,377 388,792 148,046 1,137,301 448,029 258,612 174,763 12,770 755,600 2596 16 364 46,231 61 168,157 167,486 131,318 893,672 11,291 821,421 300,547 75,331 456,982 54,641 40,953 896,748 8796 160,114 338 439,373 3781 97 27,707 44,090 46,748 424,976 14,493 793,933 231,598 252,373 682,547 272,736 78,864 465,296 54,700 273,106 2258 1,863,935 1554 6392 11 5575 117,168 4415 426,683 117,563 429 330,347 630,529 27,721 585,130

𝐵𝑆

𝑀𝑆

6 0 35 3 6 0 208 6 280 10 126 10 573 18 499 26 0 0 0 0 0 0 0 0 0 0 1 0 0 0 15 1 3 0 121 8 0 0 199 16 83 3 140 18 331 10 909 51 296 20 1 0 1347 113 0 0 165 62 16 1 0 0 0 0 0 0 3 0 0 0 54 1 0 0 124 8 34 2 82 2 158 3 448 16 57 0 0 0 706 45 0 0 343 39 12 0 0 0 0 0 0 0 0 0 0 0 0 0 35 1 3 0 0 0 45 4 39 1 25 0 207 14

454 | D Isomers by mass and molecular formula 𝑚

129

130

131

𝛽 C5 H12 N4 C6 N4 C6 H8 O3 C6 H12 N2 O C7 N2 O C7 H12 O2 C7 H16 N2 C8 O2 C8 H4 N2 C8 H16 O C9 H4 O C9 H20 C10 H8 CH3 N7 O C2 H3 N5 O2 C2 H7 N7 C3 H3 N3 O3 C3 H7 N5 O C4 H3 NO4 C4 H7 N3 O2 C4 H11 N5 C5 H7 NO3 C5 H11 N3 O C6 H11 NO2 C6 H15 N3 C7 H3 N3 C7 H15 NO C8 H3 NO C8 H19 N C9 H7 N CH2 N6 O2 CH6 N8 C2 H2 N4 O3 C2 H6 N6 O C3 H2 N2 O4 C3 H6 N4 O2 C3 H10 N6 C4 H2 O5 C4 H6 N2 O3 C4 H10 N4 O C5 H6 O4 C5 H10 N2 O2 C5 H14 N4 C6 H2 N4 C6 H10 O3 C6 H14 N2 O C7 H2 N2 O C7 H14 O2 C7 H18 N2 C8 H2 O2 C8 H6 N2 C8 H18 O C9 H6 O C10 H10 CHN5 O3 CH5 N7 O C2 HN3 O4 C2 H5 N5 O2 C2 H9 N7 C3 HNO5 C3 H5 N3 O3

|M̄ 𝑐𝛽 | 99,803 92,041 54,343 154,666 118,895 19,154 7436 13,163 3,272,676 1684 575,884 35 488,125 16,587 123,010 37,610 206,392 364,469 62,473 685,212 85,111 219,604 254,221 104,235 14,947 2,978,179 10,777 1,729,030 211 2,521,767 16,785 6863 61,544 127,460 48,402 466,623 48,674 4228 362,688 263,477 30,434 249,379 20,046 1,475,564 23,838 31,984 1,921,208 4177 688 197,786 5,625,815 171 985,744 369,067 7424 24,294 14,217 181,597 17,891 4718 305,195

𝐵𝑆

𝑀𝑆

21 0 1 1 379 23 289 16 0 0 1138 95 256 21 0 0 6 3 637 125 1 0 35 35 32 5 0 0 1 0 2 0 26 1 5 0 12 0 72 1 6 2 158 6 58 2 600 28 69 4 7 3 556 45 0 0 117 26 29 5 0 0 0 0 1 0 3 0 5 0 19 1 2 0 2 0 49 2 22 0 67 5 198 9 17 1 8 1 580 45 239 9 0 0 726 65 111 14 0 0 51 8 130 58 17 2 176 21 0 0 0 0 0 0 0 0 0 0 0 0 10 1

𝑚

132

133

𝛽 C3 H9 N5 O C4 H5 NO4 C4 H9 N3 O2 C4 H13 N5 C5 HN5 C5 H9 NO3 C5 H13 N3 O C6 HN3 O C6 H13 NO2 C6 H17 N3 C7 HNO2 C7 H5 N3 C7 H17 NO C8 H5 NO C9 H9 N CN4 O4 CH4 N6 O2 CH8 N8 C2 N2 O5 C2 H4 N4 O3 C2 H8 N6 O C3 O6 C3 H4 N2 O4 C3 H8 N4 O2 C3 H12 N6 C4 N6 C4 H4 O5 C4 H8 N2 O3 C4 H12 N4 O C5 N4 O C5 H8 O4 C5 H12 N2 O2 C5 H16 N4 C6 N2 O2 C6 H4 N4 C6 H12 O3 C6 H16 N2 O C7 O3 C7 H4 N2 O C7 H16 O2 C8 H4 O2 C8 H8 N2 C9 H8 O C10 H12 C11 CH3 N5 O3 CH7 N7 O C2 H3 N3 O4 C2 H7 N5 O2 C2 H11 N7 C3 H3 NO5 C3 H7 N3 O3 C3 H11 N5 O C4 H7 NO4 C4 H11 N3 O2 C4 H15 N5 C5 H3 N5 C5 H11 NO3 C5 H15 N3 O C6 H3 N3 O C6 H15 NO2

|M̄ 𝑐𝛽 |

𝐵𝑆

𝑀𝑆

174,553 92,108 333,234 18,381 371,403 109,126 56,067 922,258 23,946 1395 316,272 6,927,201 1068 3,999,703 2,190,926 1257 36,687 3831 1221 134,993 71,735 130 105,410 266,035 11,469 38,349 8952 210,267 62,968 161,998 18,023 61,555 1933 127,485 5,094,755 6171 3218 10,604 6,599,812 463 666,395 5,767,073 1,013,745 201,578 25,227 28,677 16,557 54,494 125,466 4665 17,677 213,775 45,932 65,500 89,999 1854 2,249,646 30,610 5825 5,572,831 2659

2 20 70 1 0 188 72 0 417 29 0 28 216 11 142 0 0 0 0 1 1 0 3 20 1 1 7 60 4 0 122 175 3 0 25 424 51 0 13 241 6 229 117 379 0 0 0 0 4 0 1 16 2 33 10 0 3 188 4 6 89

0 0 3 0 0 14 0 0 22 1 0 3 10 4 36 0 0 0 0 0 0 0 1 0 0 0 0 4 0 0 9 5 0 0 2 39 1 0 1 29 1 13 17 44 0 0 0 0 0 0 0 0 0 2 0 0 0 13 0 0 6

D Isomers by mass and molecular formula | 455

𝑚

134

135

136

𝛽 C7 H3 NO2 C7 H7 N3 C8 H7 NO C9 H11 N CH2 N4 O4 CH6 N6 O2 CH10 N8 C2 H2 N2 O5 C2 H6 N4 O3 C2 H10 N6 O C3 H2 O6 C3 H6 N2 O4 C3 H10 N4 O2 C3 H14 N6 C4 H2 N6 C4 H6 O5 C4 H10 N2 O3 C4 H14 N4 O C5 H2 N4 O C5 H10 O4 C5 H14 N2 O2 C6 H2 N2 O2 C6 H6 N4 C6 H14 O3 C7 H2 O3 C7 H6 N2 O C8 H6 O2 C8 H10 N2 C9 H10 O C10 H14 C11 H2 CHN3 O5 CH5 N5 O3 CH9 N7 O C2 HNO6 C2 H5 N3 O4 C2 H9 N5 O2 C2 H13 N7 C3 HN7 C3 H5 NO5 C3 H9 N3 O3 C3 H13 N5 O C4 HN5 O C4 H9 NO4 C4 H13 N3 O2 C5 HN3 O2 C5 H5 N5 C5 H13 NO3 C6 HNO3 C6 H5 N3 O C7 H5 NO2 C7 H9 N3 C8 H9 NO C9 H13 N C10 HN CN2 O6 CH4 N4 O4 CH8 N6 O2 CH12 N8 C2 N8 C2 O7

|M̄ 𝑐𝛽 |

𝐵𝑆

𝑀𝑆

1,882,049 8,666,101 5,005,355 1,323,028 12,052 32,046 1121 11,406 119,536 21,136 1115 94,380 80,120 1230 574,379 8070 65,347 6877 2,504,259 5841 7079 1,942,993 8,123,295 772 153,809 10,504,307 1,055,605 3,928,605 697,708 81,909 455,822 2487 34,223 5557 1039 65,614 43,012 540 75,194 21,341 75,329 5343 556,979 23,900 10,903 848,498 4,864,651 3949 229,417 12,015,117 4,032,639 6,807,596 3,955,938 577,485 828,373 199 21,546 12,792 142 4124 31

4 188 150 252 0 0 0 0 0 0 0 11 0 0 2 17 34 0 2 120 13 1 154 145 0 184 63 286 368 639 0 0 0 0 0 1 0 0 0 1 2 0 1 10 0 0 93 13 0 105 68 126 405 374 0 0 1 0 0 0 0

0 16 21 19 0 0 0 0 0 0 0 0 0 0 0 3 1 0 0 6 0 0 16 11 0 11 10 28 47 68 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 6 6 1 36 48 0 0 0 0 0 0 0

𝑚

137

138

𝛽 C2 H4 N2 O5 C2 H8 N4 O3 C2 H12 N6 O C3 N6 O C3 H4 O6 C3 H8 N2 O4 C3 H12 N4 O2 C4 N4 O2 C4 H4 N6 C4 H8 O5 C4 H12 N2 O3 C5 N2 O3 C5 H4 N4 O C5 H12 O4 C6 O4 C6 H4 N2 O2 C6 H8 N4 C7 H4 O3 C7 H8 N2 O C8 H8 O2 C8 H12 N2 C9 N2 C9 H12 O C10 O C10 H16 C11 H4 CH3 N3 O5 CH7 N5 O3 CH11 N7 O C2 H3 NO6 C2 H7 N3 O4 C2 H11 N5 O2 C3 H3 N7 C3 H7 NO5 C3 H11 N3 O3 C4 H3 N5 O C4 H11 NO4 C5 H3 N3 O2 C5 H7 N5 C6 H3 NO3 C6 H7 N3 O C7 H7 NO2 C7 H11 N3 C8 H11 NO C9 H15 N C10 H3 N CH2 N2 O6 CH6 N4 O4 CH10 N6 O2 C2 H2 N8 C2 H2 O7 C2 H6 N2 O5 C2 H10 N4 O3 C3 H2 N6 O C3 H6 O6 C3 H10 N2 O4 C4 H2 N4 O2 C4 H6 N6 C4 H10 O5 C5 H2 N2 O3 C5 H6 N4 O

|M̄ 𝑐𝛽 | 20,331 49,011 2659 47,721 1945 39,896 10,407 132,427 1,845,453 3528 8937 83,751 8,047,925 869 6361 6,190,115 7,553,343 481,262 9,795,506 989,647 1,874,516 391,470 338,761 64,352 24,938 1,885,531 7849 16,846 763 3198 33,266 6080 422,513 11,146 11,117 3,155,630 3760 4,773,841 5,534,563 1,268,434 13,675,413 4,598,367 3,609,741 2,123,287 184,124 6,090,422 1553 13,680 1995 53,906 202 13,267 7961 685,244 1294 6847 1,910,769 2,681,554 655 1,197,634 11,689,522

𝐵𝑆

𝑀𝑆

0 0 0 0 0 0 0 0 2 0 1 1 0 0 1 0 34 0 8 0 2 0 0 0 103 6 35 3 1 0 53 2 101 3 8 1 288 26 329 37 455 41 0 0 1020 97 0 0 932 135 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 37 3 2 0 16 0 39 1 6 0 144 5 250 23 222 1 734 44 435 19 1 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 2 0 0 0 0 0 16 1 10 0 10 0 5 0 85 0

456 | D Isomers by mass and molecular formula 𝑚

139

140

𝛽

|M̄ 𝑐𝛽 |

C6 H2 O4 84,414 C6 H6 N2 O2 8,978,366 C6 H10 N4 4,533,685 C7 H6 O3 696,019 C7 H10 N2 O 5,926,666 C8 H10 O2 607,376 C8 H14 N2 637,380 C9 H2 N2 7,380,691 C9 H14 O 118,215 C10 H2 O 1,175,685 C10 H18 5568 C11 H6 3,717,018 CHNO7 137 CHN9 3140 CH5 N3 O5 6836 CH9 N5 O3 3058 C2 HN7 O 71,821 C2 H5 NO6 2839 C2 H9 N3 O4 6367 C3 HN5 O2 363,438 C3 H5 N7 833,476 C3 H9 NO5 2270 C4 HN3 O3 466,716 C4 H5 N5 O 6,228,822 C5 HNO4 114,952 C5 H5 N3 O2 9,390,618 C5 H9 N5 3,841,244 C6 H5 NO3 2,480,437 C6 H9 N3 O 9,536,191 C7 H9 NO2 3,237,132 C7 H13 N3 1,327,095 C8 HN3 3,928,846 C8 H13 NO 795,607 C9 HNO 2,047,874 C9 H17 N 41,989 C10 H5 N 16,335,064 CN8 O 2681 CO8 4 CH4 N2 O6 2021 CH8 N4 O4 2990 C2 N6 O2 25,376 C2 H4 N8 156,863 C2 H4 O7 256 C2 H8 N2 O5 3077 C3 N4 O3 62,223 C3 H4 N6 O 2,021,309 C3 H8 O6 324 C4 N2 O4 37,712 C4 H4 N4 O2 5,615,022 C4 H8 N6 2,207,858 C5 O5 2711 C5 H4 N2 O3 3,489,419 C5 H8 N4 O 9,647,405 C6 H4 O4 240,785 C6 H8 N2 O2 7,451,672 C6 H12 N4 1,828,942 C7 N4 785,412 C7 H8 O3 582,423 C7 H12 N2 O 2,424,077 C8 N2 O 901,869 C8 H12 O2 254,468

𝐵𝑆

𝑀𝑆

2 0 143 14 129 4 103 13 487 25 876 81 399 13 0 0 2032 96 0 0 633 113 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 14 0 0 0 58 0 51 0 50 5 306 6 551 17 176 9 0 0 922 25 0 0 623 34 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 27 1 1 0 44 2 192 3 24 3 405 14 78 2 0 0 336 31 469 7 0 0 1881 106

𝑚

141

142

143

𝛽 C8 H16 N2 C9 O2 C9 H4 N2 C9 H16 O C10 H4 O C10 H20 C11 H8 CH3 NO7 CH3 N9 CH7 N3 O5 C2 H3 N7 O C2 H7 NO6 C3 H3 N5 O2 C3 H7 N7 C4 H3 N3 O3 C4 H7 N5 O C5 H3 NO4 C5 H7 N3 O2 C5 H11 N5 C6 H7 NO3 C6 H11 N3 O C7 H11 NO2 C7 H15 N3 C8 H3 N3 C8 H15 NO C9 H3 NO C9 H19 N C10 H7 N CH2 N8 O CH2 O8 CH6 N2 O6 C2 H2 N6 O2 C2 H6 N8 C2 H6 O7 C3 H2 N4 O3 C3 H6 N6 O C4 H2 N2 O4 C4 H6 N4 O2 C4 H10 N6 C5 H2 O5 C5 H6 N2 O3 C5 H10 N4 O C6 H6 O4 C6 H10 N2 O2 C6 H14 N4 C7 H2 N4 C7 H10 O3 C7 H14 N2 O C8 H2 N2 O C8 H14 O2 C8 H18 N2 C9 H2 O2 C9 H6 N2 C9 H18 O C10 H6 O C10 H22 C11 H10 CHN7 O2 CH5 NO7 CH5 N9 C2 HN5 O3

|M̄ 𝑐𝛽 | 151,696 84,548 29,566,078 29,172 4,654,419 852 4,442,438 314 15,658 1859 372,568 824 1,892,347 840,842 2,408,635 6,291,833 581,752 9,510,665 1,727,027 2,522,498 4,328,819 1,495,599 333,757 27,869,664 205,672 14,390,891 6355 23,895,548 33,796 20 717 332,923 201,872 98 830,461 2,611,977 493,119 7,260,203 1,126,112 33,662 4,513,867 4,951,073 310,776 3,874,178 492,658 14,352,119 308,660 666,580 16,462,667 72,534 23,437 1,481,754 55,296,968 4745 8,671,508 75 3,614,427 25,160 153 27,189 128,212

𝐵𝑆

𝑀𝑆

398 22 0 0 0 0 1885 107 0 0 252 89 29 2 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 99 1 0 0 240 11 48 0 211 6 191 4 784 26 63 1 3 0 1039 39 0 0 440 29 25 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 19 0 6 1 99 3 6 0 2 1 133 4 57 0 134 13 378 14 31 2 2 1 854 27 400 15 0 0 1915 128 308 14 0 0 50 6 800 78 12 0 75 37 110 6 0 0 0 0 0 0 0 0

D Isomers by mass and molecular formula |

𝑚

144

145

𝛽 C2 H5 N7 O C3 HN3 O4 C3 H5 N5 O2 C3 H9 N7 C4 HNO5 C4 H5 N3 O3 C4 H9 N5 O C5 H5 NO4 C5 H9 N3 O2 C5 H13 N5 C6 HN5 C6 H9 NO3 C6 H13 N3 O C7 HN3 O C7 H13 NO2 C7 H17 N3 C8 HNO2 C8 H5 N3 C8 H17 NO C9 H5 NO C9 H21 N C10 H9 N CN6 O3 CH4 N8 O CH4 O8 C2 N4 O4 C2 H4 N6 O2 C2 H8 N8 C3 N2 O5 C3 H4 N4 O3 C3 H8 N6 O C4 O6 C4 H4 N2 O4 C4 H8 N4 O2 C4 H12 N6 C5 N6 C5 H4 O5 C5 H8 N2 O3 C5 H12 N4 O C6 N4 O C6 H8 O4 C6 H12 N2 O2 C6 H16 N4 C7 N2 O2 C7 H4 N4 C7 H12 O3 C7 H16 N2 O C8 O3 C8 H4 N2 O C8 H16 O2 C8 H20 N2 C9 H4 O2 C9 H8 N2 C9 H20 O C10 H8 O C11 H12 C12 CH7 N9 C2 H3 N5 O3 C2 H7 N7 O CH3 N7 O2

|M̄ 𝑐𝛽 | 655,227 165,781 3,330,976 496,568 41,232 4,229,478 3,729,876 1,016,168 5,690,451 504,522 3,795,824 1,530,269 1,284,174 8,175,534 455,946 53,310 2,439,749 70,969,521 34,156 36,456,956 507 22,467,086 6470 88,451 15 18,233 877,705 141,375 11,649 2,185,015 1,835,635 932 1,284,622 5,135,621 361,836 407,003 85,857 3,223,855 1,607,685 1,489,672 224,720 1,285,303 82,510 1,000,798 54,741,129 105,625 114,992 71,079 62,428,843 13,190 1856 5,541,857 61,680,587 405 9,693,195 2,135,717 171,886 23,255 599,924 563,976 117,237

𝐵𝑆

𝑀𝑆

1 0 14 2 0 48 4 38 124 7 3 360 88 0 872 80 0 50 706 17 142 126 0 0 0 0 2 2 0 6 2 1 15 43 6 0 12 109 21 1 259 301 24 0 24 1064 262 0 35 1062 144 2 171 175 84 324 0 0 0 1 0

0 0 2 0 0 0 0 1 1 0 0 15 4 0 38 1 0 4 25 0 11 20 0 0 0 0 0 0 0 0 0 0 3 2 0 0 0 7 0 0 21 12 3 0 1 49 5 0 1 91 17 0 47 48 8 22 0 0 0 0 0

𝑚

146

147

457

|M̄ 𝑐𝛽 |

𝐵𝑆

𝑀𝑆

C3 H3 N3 O4 768,517 C3 H7 N5 O2 2,882,173 C3 H11 N7 178,007 C4 H3 NO5 186,967 C4 H7 N3 O3 3,682,739 C4 H11 N5 O 1,346,659 C5 H7 NO4 891,604 C5 H11 N3 O2 2,089,121 C5 H15 N5 89,592 C6 H3 N5 25,665,747 C6 H11 NO3 575,709 C6 H15 N3 O 233,221 C7 H3 N3 O 55,071,818 C7 H15 NO2 86,195 C7 H19 N3 4238 C8 H3 NO2 16,228,009 C8 H7 N3 97,109,499 C8 H19 NO 2876 C9 H7 NO 49,865,161 C10 H11 N 14,778,466 CH2 N6 O3 76,720 CH6 N8 O 97,234 C2 H2 N4 O4 216,893 C2 H6 N6 O2 971,399 C2 H10 N8 57,508 C3 H2 N2 O5 137,656 C3 H6 N4 O3 2,429,018 C3 H10 N6 O 749,873 C4 H2 O6 9986 C4 H6 N2 O4 1,432,731 C4 H10 N4 O2 2,125,930 C4 H14 N6 68,990 C5 H2 N6 7,055,345 C5 H6 O5 95,870 C5 H10 N2 O3 1,360,645 C5 H14 N4 O 311,390 C6 H2 N4 O 26,123,593 C6 H10 O4 97,394 C6 H14 N2 O2 257,122 C6 H18 N4 6742 C7 H2 N2 O2 17,388,955 C7 H6 N4 96,024,197 C7 H14 O3 22,151 C7 H18 N2 O 9780 C8 H2 O3 1,187,784 C8 H6 N2 O 109,240,025 C8 H18 O2 1225 C9 H6 O2 9,660,231 C9 H10 N2 46,024,195 C10 H10 O 7,288,733 C11 H14 950,064 C12 H2 3,571,212 CHN5 O4 24,429 CH5 N7 O2 176,798 CH9 N9 10,912 C2 HN3 O5 37,974 C2 H5 N5 O3 908,888 C2 H9 N7 O 265,965 C3 HNO6 10,555 C3 H5 N3 O4 1,164,356 C3 H9 N5 O2 1,374,370

0 5 0 1 43 3 63 104 1 8 394 72 4 554 29 1 144 249 148 303 0 0 0 1 0 0 10 0 1 22 33 0 1 28 153 6 3 345 249 7 0 94 672 52 2 177 334 45 411 421 450 1 0 0 0 0 0 1 0 2 4

0 0 0 0 0 0 0 5 0 1 22 0 0 20 4 0 16 10 21 32 0 0 0 0 0 0 1 0 0 0 1 0 0 2 9 0 0 25 3 2 0 10 36 2 0 14 28 4 22 34 52 0 0 0 0 0 0 0 0 0 0

𝛽

458 | D Isomers by mass and molecular formula 𝑚

148

149

|M̄ 𝑐𝛽 |

𝐵𝑆

𝑀𝑆

C3 H13 N7 36,852 C4 HN7 1,022,466 C4 H5 NO5 282,310 C4 H9 N3 O3 1,783,871 C4 H13 N5 O 281,769 C5 HN5 O 6,233,092 C5 H9 NO4 440,821 C5 H13 N3 O2 448,538 C5 H17 N5 7578 C6 HN3 O2 7,967,364 C6 H5 N5 61,444,491 C6 H13 NO3 128,380 C6 H17 N3 O 20,368 C7 HNO3 1,833,789 C7 H5 N3 O 131,358,449 C7 H17 NO2 8000 C8 H5 NO2 38,484,571 C8 H9 N3 83,983,472 C9 H9 NO 43,311,373 C10 H13 N 7,122,614 C11 HN 6,593,791 CN4 O5 3062 CH4 N6 O3 174,687 CH8 N8 O 54,187 C2 N2 O6 2518 C2 H4 N4 O4 493,211 C2 H8 N6 O2 547,544 C2 H12 N8 13,101 C3 N8 60,490 C3 O7 217 C3 H4 N2 O5 310,668 C3 H8 N4 O3 1,387,392 C3 H12 N6 O 171,753 C4 N6 O 582,583 C4 H4 O6 21,975 C4 H8 N2 O4 831,647 C4 H12 N4 O2 497,254 C4 H16 N6 6153 C5 N4 O2 1,310,920 C5 H4 N6 25,346,382 C5 H8 O5 56,687 C5 H12 N2 O3 328,357 C5 H16 N4 O 28,301 C6 N2 O3 693,682 C6 H4 N4 O 93,583,559 C6 H12 O4 24,562 C6 H16 N2 O2 24,545 C7 O4 42,867 C7 H4 N2 O2 61,817,403 C7 H8 N4 98,791,068 C7 H16 O3 2275 C8 H4 O3 4,161,969 C8 H8 N2 O 112,562,582 C9 H8 O2 9,990,575 C9 H12 N2 24,399,762 C10 N2 3,115,390 C10 H12 O 3,916,111 C11 O 455,822 C11 H16 323,512 C12 H4 16,079,924 CH3 N5 O4 99,306

0 0 9 41 0 0 105 12 0 1 37 290 7 0 91 102 75 235 429 445 1 0 0 0 0 0 2 0 0 0 2 0 0 0 3 45 0 0 0 9 52 44 0 0 38 267 27 0 27 220 192 7 391 255 397 1 892 0 682 0 0

0 0 0 0 0 0 7 0 0 1 2 12 0 0 9 5 8 11 50 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 4 1 0 0 5 16 2 0 0 15 15 2 16 25 20 0 90 0 57 0 0

𝛽

𝑚

150

𝛽

|M̄ 𝑐𝛽 |

CH7 N7 O2 121,630 CH11 N9 2766 C2 H3 N3 O5 152,977 C2 H7 N5 O3 633,408 C2 H11 N7 O 67,609 C3 H3 NO6 41,580 C3 H7 N3 O4 822,099 C3 H11 N5 O2 355,574 C3 H15 N7 3483 C4 H3 N7 6,505,400 C4 H7 NO5 202,072 C4 H11 N3 O3 473,871 C4 H15 N5 O 26,983 C5 H3 N5 O 39,760,215 C5 H11 NO4 121,350 C5 H15 N3 O2 44,621 C6 H3 N3 O2 50,459,744 C6 H7 N5 77,737,459 C6 H15 NO3 13,539 C7 H3 NO3 11,449,751 C7 H7 N3 O 166,085,562 C8 H7 NO2 48,687,255 C8 H11 N3 49,755,227 C9 H11 NO 25,895,621 C10 H15 N 2,569,697 C11 H3 N 53,109,027 CH2 N4 O5 31,784 CH6 N6 O3 155,356 CH10 N8 O 15,509 C2 H2 N2 O6 25,361 C2 H6 N4 O4 443,749 C2 H10 N6 O2 159,347 C2 H14 N8 1341 C3 H2 N8 966,328 C3 H2 O7 2113 C3 H6 N2 O5 282,171 C3 H10 N4 O3 413,244 C3 H14 N6 O 17,528 C4 H2 N6 O 9,630,475 C4 H6 O6 20,050 C4 H10 N2 O4 255,379 C4 H14 N4 O2 52,358 C5 H2 N4 O2 21,728,759 C5 H6 N6 41,205,407 C5 H10 O5 18,092 C5 H14 N2 O3 36,231 C6 H2 N2 O3 11,366,726 C6 H6 N4 O 151,838,122 C6 H14 O4 2922 C7 H2 O4 674,033 C7 H6 N2 O2 100,082,479 C7 H10 N4 66,583,863 C8 H6 O3 6,717,404 C8 H10 N2 O 76,307,072 C9 H10 O2 6,843,602 C9 H14 N2 9,459,132 C10 H2 N2 65,563,828 C10 H14 O 1,548,361 C11 H2 O 9,414,509 C11 H18 84,051 C12 H6 34,030,905

𝐵𝑆

𝑀𝑆

0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 10 0 1 0 0 0 5 0 52 0 0 0 2 0 129 10 23 2 3 0 222 7 265 13 182 5 724 45 558 40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 1 0 0 0 0 0 0 88 8 34 3 0 0 0 0 273 11 61 5 0 0 153 3 105 1 90 7 542 38 667 71 568 29 0 0 1938 150 0 0 762 49 12 0

Bibliography [1] [2] [3] [4]

[5] [6]

[7]

[8] [9] [10]

[11] [12]

[13]

[14] [15]

[16] [17] [18]

H. Abe, H. Hayasaka, Y. Miyashita, and S. Sasaki. Generation of stereoisomeric structures using topological information alone. J. Chem. Inf. Comput. Sci., 24:216–219, 1984. Advanced Chemistry Development, Inc. (ACD), Toronto, Canada. MS Manager, Version 11.01, 2007. D. Agrafiotis, A. Gibbs, F. Zhu, S. Izrailev, and E. Martin. Conformational sampling of bioactive molecules: A comparative study. J. Chem. Inf. Model., 47:1067–1086, 2007. A. Agueera, L. Estrada, I. Ferrer, E. Thurman, S. Malato, and A. Fernandez-Alba. Application of time–of–flight mass spectrometry to the analysis of phototransformation products of di­ clofenac in water under natural sunlight. J. Mass Spectrom., 40(7):908–915, 2005. N. Allinger. MM2. a hydrocarbon force field utilizing 𝑣1 and 𝑣2 torsional terms. J. Am. Chem. Soc., 99:8127–8134, 1977. S. Ashton, R. Gallagher, J. Warrander, N. Loftus, I. Hirano, S. Yamaguchi, N. Mukai, and Y. In­ ohana. Isotope modeling routines applied to empirical formula prediction using high mass accuracy MS𝑛 data. In 23rd LC/MS Montreux Symposium, Montreux, Switzerland, Nov 8–10 2006. M. Badertscher, A. Korytko, K.-P. Schulz, M. Madison, M. Munk, P. Portman, M. Junghans, P. Fontana, and E. Pretsch. Assemble 2.0: A structure generator. Chemom. Intell. Lab. Syst., 51:73–79, 2000. A. Balaban. Highly discriminating distance–based topological index. Chem. Phys. Lett., 89:399–404, 1982. A. Balaban. Topological indices based on topological distances in molecular graphs. Pure Appl. Chem., 55:199–206, 1983. A. Balaban, O. Mekenyan, and D. Bonchev. Unique description of chemical structures based on hierarchically ordered extended connectivities (HOC procedures). J. Comput. Chem., 6:538–551, 1985. J. Barnard and G. Downs. Computer representation and manipulation of combinatorial libra­ ries. Perspect. Drug Discov. Design., 7/8:13–30, 1997. J. Barnard and G. Downs. Use of Markush structure techniques to avoid enumeration in diver­ sity analysis of large combinatorial libraries. http://www.daylight.com/meetings/mug97/ Barnard/970227JB.html, 1997. J. Barnard, G. Downs, and A. v. Scholley-Pfab. Use of Markush structure analysis techniques for descriptor generation and clustering of large combinatorial libraries. J. Mol. Graph. Model., 18:452–463, 2000. S. Basak. Use of molecular complexity indices in predictive pharmacology and toxicology: A QSAR approach. Med. Sci. Res., 15:605–609, 1987. S. Basak. Information theoretic indices of neighborhood complexity and their applications. In J. Devillers and A. Balaban, editors, Topological Indices and Related Descripors in QSAR and QSPR, chapter 12. Gordon and Breach, Amsterdam, 1999. S. Bauerschmidt. Repräsentation von Molekülstrukturen zur computergestützten Behandlung chemischer Reaktionen. PhD thesis, Universität Erlangen–Nürnberg, 1997. C. Benecke. Objektorientierte Darstellung und Algorithmen zur Klassifizierung endlicher bewerteter Strukturen. PhD thesis, Universität Bayreuth, 1997. C. Benecke. Objektorientierte Darstellung und Algorithmen zur Klassifizierung endlicher bewerteter Strukturen. MATCH Commun. Math. Comput. Chem., 37:7–156, 1998.

460 | Bibliography [19]

[20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30]

[31]

[32] [33]

[34] [35]

[36] [37]

[38] [39] [40]

C. Benecke, R. Grund, R. Hohberger, A. Kerber, R. Laue, and T. Wieland. MOLGEN+, a genera­ tor of connectivity isomers and stereoisomers for molecular structure elucidation. Anal. Chim. Acta, 314:141–147, 1995. C. Benecke, T. Grüner, A. Kerber, R. Laue, and T. Wieland. Molecular structure generation with MOLGEN, new features and future developments. Fresenius J. Anal. Chem., 359:23–32, 1997. G. Benkö. A toy model of chemical reaction networks. Master’s thesis, Universität Wien, 2002. G. Benkö, C. Flamm, and P. Stadler. A graph–based toy model of chemistry. J. Chem. Inf. Com­ put. Sci., 43:1085–1093, 2003. J. Biebl. Computerunterstützte Clusteranalyse mit fuzzy Clusteralgorithmen und Anwendun­ gen auf chemische Moleküle. Master’s thesis, Universität Bayreuth, 1999. J. Biegholdt. Computerunterstützte Berechnung von Multigraphen mittels Homomor­ phieprinzip. Master’s thesis, Universität Bayreuth, 1995. A. Björner, M. Las Vergnas, B. Sturmfels, N. White, and G. Ziegler. Oriented Matroids. Cam­ bridge University Press, Cambridge, UK, 1993. J. Böcker. Spektroskopie. Vogel Buchverlag, Würzburg, 1997. S. Böcker and F. Rasche. Towards de novo identification of metabolites by analyzing tandem mass spectra. Bioinf., 24:i49–55, 2008. T. Bohme Leite, D. Gomes, M. Miteva, J. Chomilier, B. Villoutreix, and P. Tufféry. Frog: A free online drug 3d conformation generator. Nucl. Acids Res., 35:W568–W572, 2007. J. Bokowski. Computational Oriented Matroids. Cambridge University Press, Cambridge, UK, 2006. J. Boström. Reproducing the conformations of protein–bound ligands: A critical evaluation of several popular conformational searching tools. J. Comput.–Aided Mol. Des., 15:1137–1152, 2001. W. Brack, M. Schmitt-Jansen, M. Machala, R. Brix, D. Barceló, E. Schymanski, G. Streck, and T. Schulze. How to confirm identified toxicants in effect–directed analysis. Anal. Bioanal. Chem., 390:1959–1973, 2008. J. Braun. Topologische Indizes und ihre computerunterstützte Anwendung in der Chemie. Master’s thesis, Universität Bayreuth, 1999. J. Braun, R. Gugisch, A. Kerber, R. Laue, M. Meringer, and C. Rücker. MOLGEN–CID, a can­ onizer for molecules and graphs accessible through the internet. J. Chem. Inf. Comput. Sci., 44:542–548, 2004. J. Braun, A. Kerber, R. Laue, M. Meringer, and C. Rücker. MOLGEN–QSPR User Guide. Online under http://molgen.de, click Molgen QSPR and there pdf–download, 2009. J. Braun, A. Kerber, M. Meringer, and C. Rücker. Similarity of molecular descriptors: The equivalence of Zagreb indices and walk counts. MATCH Commun. Math. Comput. Chem., 54:163–176, 2005. L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth & Brooks, Monterey, CA, 1984. L. Broadbelt, S. Stark, and M. Klein. Computer generated reaction modelling: Decomposi­ tion and encoding algorithms for determining species uniqueness. Comput. Chem. Engng., 20:113–129, 1996. U. Burkert and N. L. Allinger. Molecular Mechanics. ACS Monograph 177, American Chemical Society, Washington, D.C., 1982. W. Burnside. Theory of Groups of Finite Order. Cambridge University Press, 1897, 2nd ed. 1911. Reprint Dover Publications 1955. D. Butina and J. Gola. Modeling aqueous solubility. J. Chem. Inf. Comput. Sci., 43:837–841, 2003.

Bibliography

[41]

[42] [43] [44]

[45]

[46]

[47] [48] [49] [50]

[51] [52] [53]

[54] [55] [56] [57] [58] [59] [60] [61] [62]

| 461

E. Byvatov, U. Fechner, J. Sadowski, and G. Schneider. Comparison of support vector machine and artificial neural network systems for drug/nondrug classification. J. Chem. Inf. Comput. Sci., 43:1882–1889, 2003. CCL. MMFF94 validation Suite. Computational Chemistry List, Ltd., Columbus, Ohio, USA, http://www.ccl.net/cca/data/MMFF94/ (accessed 20/09/2012), 1999. C.-C. Chang and C.-J. Lin. LIBSVM — a library for support vector machines. http://www.csie. ntu.edu.tw/~cjlin/papers/libsvm.pdf, 2002. H. Chen, B. Fan, M. Petitjean, A. Panaye, J.-P. Doucet, F. Li, H. Xia, and S. Yuan. MASSIS: A mass spectrum simulation system 2. procedures and performance. Eur. J. Mass Spectrom., 9(5):445–457, 2003. H. Chen, B. Fan, H. Xia, M. Petitjean, S. Yuan, A. Panaye, and J.-P. Doucet. MASSIS: A mass spectrum simulation system 1. principle and method. Eur. J. Mass Spectrom., 9(3):175–186, 2003. B. Christie and M. Munk. The role of two–dimensional nuclear magnetic resonance spec­ troscopy in computer–enhanced structure elucidation. J. Am. Chem. Soc., 113:3750–3757, 1991. C. Colbourn and R. Read. Orderly algorithms for generating restricted classes of graphs. J. Graph Theory, 3:187–195, 1979. M. Connolly. Molecular surface and volume. In P. v. R. Schleyer, editor, Encyclopedia of Com­ putational Chemistry, pages 1698–1703. Wiley, Chichester, 1998. M. Contreras, J. Alvarez, D. Guajardo, and R. Rozas. Algorithm for exhaustive and nonredun­ dant organic stereoisomer generation. J. Chem. Inf. Model., 46:2288–2298, 2006. M. Contreras, J. Alvarez, M. Riveros, G. Arias, and R. Rozas. Exhaustive generation of organic isomers. 6. Stereoisomers having isolated and spiro cycles and new extended N_tuples. J. Chem. Inf. Comput. Sci., 41:964–977, 2001. C. Cortes and V. Vapnik. Support–vector networks. Machine Learning, 20:273–297, 1995. G. Crippen and T. Havel. Distance Geometry and Molecular Conformation. Research Studies Press, Taunton, 1988. A. Dalby, J. G. Nourse, W. D. Hounshell, A. K. Gushurst, D. L. Grier, B. A. Leland, and J. Laufer. Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited. J. Chem. Inf. Comput. Sci., 32(3):244–255, 1992. P. D’Arcy and W. Mallard. AMDIS User Guide. National Institute of Standards and Technology (NIST), US Department of Commerce, USA, 2004. N. de Bruijn. Pólya’s theory of counting. In E. Beckenbach, editor, Applied Combinatorial Mathematics, pages 144–184. Wiley, New York, 1964. N. de Bruijn. Pólya’s Abzähl-Theorie: Muster für Graphen und chemische Verbindungen. In Selecta Math. III, pages 1–26. Springer, 1971. A. Dreiding and K. Wirth. The multiplex. a classification of finite ordered point sets in oriented d–dimensional spaces. MATCH Commun. Math. Comput. Chem., 8:341–352, 1980. A. Dress. Chirotopes and oriented matroids. Bayreuther Mathematische Schriften, 21:14–68, 1986. A. Dress, A. Dreiding, and H. Haegi. Classification of mobile molecules by category theory. Studies in Physical and Theoretical Chemistry, 23:39–58, 1983. R. Duda and P. Hart. Pattern Classification and Scene Analysis. Wiley, New York, 1973. J. Dugundji and I. Ugi. An algebraic model of constitutional chemistry as a basis for chemical computer programs. Top. Curr. Chem., 39:19–64, 1973. W. Eckel and T. Kind. Use of boiling point – Lee retention index correlation for rapid review of gas chromatography – mass spectrometry data. Anal. Chim. Acta, 494(1–2):235–243, 2003.

462 | Bibliography [63] [64] [65] [66]

[67]

[68]

[69]

[70] [71]

[72]

[73] [74] [75]

[76] [77] [78] [79] [80] [81]

[82] [83]

J. Eckhoff. Helly, Radon, and Carathéodory type theorems. In P. Gruber and J. Wills, editors, Handbook of Convex Geometry, pages 389–448. North–Holland, Amsterdam, 1993. E. Eliel. Infelicitous stereochemical nomenclature. Chirality, 9:428–430, 1997. E. Eliel and S. Wilen. Stereochemistry of Organic Compounds. Wiley, New York, 1994. M. Elyashberg, K. Blinov, and E. Martirosian. A new approach to computer–aided molecu­ lar structure elucidation: The expert system Structure Elucidator. Lab. Autom. Inf. Man., 34:15–30, 1999. M. Elyashberg, Y. Karasev, E. Martirosian, H. Thiele, and H. Somberg. Expert systems as a tool for the molecular structure elucidation by spectral methods. Strategies of solution to the problems. Anal. Chim. Acta, 348:443–463, 1997. M. Elyashberg, E. Martirosian, Y. Karasev, H. Thiele, and H. Somberg. X–PERT: A user friendly expert system for molecular structure elucidation by spectral methods. Anal. Chim. Acta, 337:265–286, 1997. M. Elyashberg, V. Serov, E. Martirosyan, L. Zlatina, Y. Karasev, V. Koldashov, and Y. Yampol­ skiy. An expert system for molecular structure elucidation based on spectral data. J. Mol. Struct. (Theochem), 230:191–203, 1991. S. Esteban. Liebig–Wöhler controversy and the concept of isomerism. J. Chem. Educ., 85:1201–1203, 2008. B. Fan, H. Chen, M. Petitjean, A. Panaye, J.-P. Doucet, H. Xia, and S. Yuan. New strategy of mass spectrum simulation based on reduced and concentrated knowledge databases. Spec­ trosc. Lett., 38(2):145–170, 2005. I. Faradzhev. Constructive enumeration of combinatorial objects. In Problèmes Combinatoires et Théorie des Graphes, volume 260, pages 131–135, 1978. Colloq. Internat. CNRS, Orsay 1976. I. Faradzhev. Generation of nonisomorphic graphs with a given degree sequence. In Algorith­ mic Studies in Combinatorics, pages 11–19. NAUKA, Moscow, 1978. J.-L. Faulon, D. Visco, and D. Roe. Enumerating molecules, volume 21 of Reviews in Computa­ tional Chemistry, pages 209–286. Wiley–VCH, 2005. D. Fischer. Verwendung graphentheoretischer Netzwerkalgorithmen bei Zuordnungsproble­ men, insbesondere bei der automatischen Analyse von Spektren. Master’s thesis, Universität Bayreuth, 1996. M. Fischer. Algorithmen zur Erkennung von Aromatizitäts– und Tautomerie–Überlagerungen in organischen Strukturen. Master’s thesis, Universität Bayreuth, 1996. P. Floersheim. From descriptions of molecules to their symmetry. PhD thesis, Universität Zürich, 1987. S. Fujita. Symmetry and Combinatorial Enumeration in Chemistry. Springer, Berlin, 1991. S. Fujita. Computer–Oriented Representation of Organic Reactions. Yoshioka Shoten Publish­ ing Company, Kyoto, 2001. K. Funatsu, N. Miyabayashi, and S. Sasaki. Further development of structure generation in the automated structure elucidation system CHEMICS. J. Chem. Inf. Comput. Sci., 28:18–28, 1988. K. Funatsu and S. Sasaki. Recent advances in the automated structure elucidation system, CHEMICS. Utilization of two–dimensional NMR spectral information and development of pe­ ripheral functions for examination of candidates. J. Chem. Inf. Comput. Sci., 36:190–204, 1996. G. Furnival and R. Wilson, Jr. Regression by leaps and bounds. Technometrics, 16:499–511, 1974. A. Fürst, J. Clerc, and E. Pretsch. A computer program for the computation of the molecular formula. Chemom. Intell. Lab. Syst., 5:329–334, 1989.

Bibliography

[84] [85]

[86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98]

[99] [100] [101]

[102] [103]

[104]

|

463

J. Gasteiger, W. Hanebeck, and K.-P. Schulz. Prediction of mass spectra from structural infor­ mation. J. Chem. Inf. Comput. Sci., 32:264–271, 1992. J. Gasteiger, W. Hanebeck, K.-P. Schulz, S. Bauerschmidt, and R. Höllering. Automatic ana­ lysis and simulation of mass spectra. In C. Wilkins, editor, Computer–Enhanced Analytical Spectroscopy, volume 4, pages 97–133. Kluwer Academic Publishers, 1993. M. Gerdts. SQP v1.1 — a fortran77–program package solving nonlinear restricted optimization problems. http://www.math.uni-hamburg.de/home/gerdts/SOFTWARE/sqp.txt, 2004. H. Gerlach. Handedness and chirality. MATCH Commun. Math. Comput. Chem., 61:5–10, 2009. M. Gerlich and S. Neumann. MetFusion: integration of compound identification strategies. J. Mass Spectrom., 48:291–298, 2013. K. Gibson and H. Scheraga. Exact calculation of the volume and the surface area of fused hard sphere molecules with unequal atomic radii. Mol. Phys., 62:1247–1265, 1987. J. Ginter, J. Fox, H. Shackman, and R. Classon. Empirical formula prediction using MS and MS𝑛 spectra and isotope modeling. Chromatography Online.com, 2007. L. Goldberg. Efficient algorithms for listing unlabeled graphs. J. Algorithms, 13:128–143, 1992. B. Gruber. Eine lineare algebraische Repräsentation für Objekte der Syntheseplanung. Soft­ ware–Entwicklung in der Chemie, 9:99–111, 1995. R. Grund. Symmetrieklassen von Abbildungen und die Konstruktion von diskreten Strukturen. Bayreuther Mathematische Schriften, 31:19–54, 1990. R. Grund. Konstruktion schlichter Graphen mit gegebener Gradpartition. Bayreuther Mathematische Schriften, 44:73–104, 1993. R. Grund. Konstruktion molekularer Graphen mit gegebenen Hybridisierungen und überlap­ pungsfreien Fragmenten. Bayreuther Mathematische Schriften, 49:1–113, 1995. T. Grüner. Strategien zur Konstruktion diskreter Strukturen. PhD thesis, Universität Bayreuth, 1998. T. Grüner. Strategien zur Konstruktion diskreter Strukturen und ihre Anwendung auf moleku­ lare Graphen. MATCH Commun. Math. Comput. Chem., 39:39–126, 1999. T. Grüner, A. Kerber, R. Laue, M. Liepelt, M. Meringer, K. Varmuza, and W. Werther. Determi­ nation of sum formulas from mass spectra by recognition of superposed isotope patterns. MATCH Commun. Math. Comput. Chem., 37:163–177, 1998. T. Grüner, A. Kerber, R. Laue, and M. Meringer. Mathematics for Combinatorial Chemistry, volume II of Scientific Computing in Chemical Engineering, pages 74–81. Springer, 1999. T. Grüner, A. Kerber, R. Laue, M. Meringer, K. Varmuza, and W. Werther. MASSMOL. MATCH Commun. Math. Comput. Chem., 38:173–180, 1998. T. Grüner, R. Laue, and M. Meringer. Algorithms for Group Actions: Homomorphism Principle and Orderly Generation Applied to Graphs, volume 28 of DIMACS Series in Discrete Mathe­ matics and Theoretical Computer Science, pages 113–122. American Mathematical Society, 1996. R. Gugisch. Konstruktion von Isomorphieklassen orientierter Matroide. Bayreuther Mathematische Schriften, 72:1–129, 2005. R. Gugisch. A construction of isomorphism classes of oriented matroids. In M. Klin, G. Jones, A. Jurisić, M. Muzychuk, and I. Ponomarenko, editors, Algorithmic Algebraic Combinatorics and Gröbner Bases, pages 229–249. Springer, 2009. R. Gugisch, A. Kerber, R. Laue, M. Meringer, and C. Rücker. History and progress of the ge­ neration of structural formulae in chemistry and its applications. MATCH Commun. Math. Comput. Chem., 58:239–280, 2007.

464 | Bibliography [105] R. Gugisch and C. Rücker. Unified generation of conformations, conformers, and stereoiso­ mers: A discrete mathematics–based approach. MATCH Commun. Math. Comput. Chem., 61:117–148, 2009. [106] I. Gutman, B. Ruščić, N. Trinajstić, and C. Wilcox, Jr. Graph theory and molecular orbitals. XII. Acyclic polyenes. J. Chem. Phys., 62:3399–3405, 1975. [107] I. Gutman, D. Vidović, and L. Popović. Graph representation of organic molecules, Cayley’s plerograms vs. his kenograms. J. Chem. Soc., Faraday Trans., 94:857–860, 1998. [108] E. Haberberger. Isomorphieklassifikation von Inzidenzstrukturen mit dem Lattice-Climb­ ing-Verfahren am Beispiel von 𝑡-Designs. Bayreuther Mathematische Schriften, 68:1–78, 2004. [109] R. Hager, A. Kerber, R. Laue, D. Moser, and W. Weber. Construction of orbit representatives. Bayreuther Mathematische Schriften, 35:157–169, 1991. [110] T. Halgren. MMFF VII. Characterization of MMFF94, MMFF94s and other widely available force fields for conformational energies and for intermolecular–interaction energies and geome­ tries. J. Comput. Chem., 20(7):730–748, 1999. [111] W. Hanebeck. Simulation und Rekonstruktion von Reaktionen im Massenspektrometer. PhD thesis, Technische Universität München, 1991. [112] W. Hässelbarth. A note on Pólya enumeration theory. Theor. Chim. Acta, 66:91–110, 1984. [113] W. Hässelbarth. Substitution symmetry. Theor. Chim. Acta, 67:339–367, 1985. [114] W. Hässelbarth and A. Kerber. Chirality and permutational isomers. MATCH Commun. Math. Comput. Chem., 61:149–216, 2009. [115] W. Hässelbarth and E. Ruch. Permutational isomers with chiral ligands. Isr. J. Chem, 15:112–115, 1977. [116] W. Hässelbarth, E. Ruch, D. J. Klein, and T. H. Seligman. Bilateral classes. J. Math. Phys., 21:951–953, 1980. [117] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, New York, 2001. [118] D. Hawkins. The problem of overfitting. J. Chem. Inf. Comput. Sci., 44:1–12, 2004. [119] P. Hazebroek and L. Oosterhoff. The isomers of cyclohexane. Disc. Faraday Soc., 10:87–93, 1951. [120] M. Heinonen, A. Rantanen, T. Mielikäinen, J. Kokkonen, J. Kiuru, R. Ketola, and J. Rousu. FiD: a software for ab initio structural identification of product ions from tandem mass spectromet­ ric data. Rapid Comm. Mass Spec., 22:3043–3052, 2008. [121] S. Heller and A. McNaught. The IUPAC International Chemical Identifier (InChI). Chemistry International 31 No. 1, 2009. [122] J. Hendrickson and A. Toczko. Unique numbering and cataloguing of molecular structures. J. Chem. Inf. Comput. Sci., 23:171–177, 1983. [123] M. Hesse, H. Meier, and B. Zeeh. Spectroscopic Methods in Organic Chemistry. Thieme, Stuttgart, 1997. [124] S. Heuerding and J. Clerc. Simple tools for the computer–aided interpretation of mass spec­ tra. Chemom. Intell. Lab. Syst., 20:57–69, 1993. [125] HighChem Ltd. Mass Frontier. HighChem Ltd./Thermo Fisher Scientific, Bratislava, Slovakia, 2012. [126] C. Hildebrandt, S. Wolf, and S. Neumann. Database supported candidate search for metabo­ lite identification. J. Integr. Bioinf., 8(2):157, 2011. [127] D. Hill, T. Kertesz, D. Fontaine, R. Friedman, and D. Grant. Mass spectral metabonomics be­ yond elemental formula: chemical database querying by matching experimental with compu­ tational fragmentation spectra. Anal. Chem., 80:5574–5582, 2008.

Bibliography

|

465

[128] E. Hill. On a system of indexing chemical literature; adopted by the classification division of the U. S. Patent Office. J. Am. Chem. Soc., 22(8):478–494, 1900. [129] B. Hollas. On the Redundancy of Topological Indices. PhD thesis, Universität Ulm, 2005. [130] R. Höllering. Simulation von Massenspektren und Entwicklung eines Systems zur Reak­ tionsvorhersage. PhD thesis, Universität Erlangen–Nürnberg, 1998. [131] H. Horai, M. Arita, S. Kanaya, Y. Nihei, T. Ikeda, K. Suwa, Y. Ojima, K. Tanaka, S. Tanaka, K. Aoshima, Y. Oda, Y. Kakazu, M. Kusano, T. Tohge, F. Matsuda, Y. Sawada, M. Y. Hirai, H. Nakanishi, K. Ikeda, N. Akimoto, T. Maoka, H. Takahashi, T. Ara, N. Sakurai, H. Suzuki, D. Shibata, S. Neumann, T. Iida, K. Tanaka, K. Funatsu, F. Matsuura, T. Soga, R. Taguchi, K. Saito, and T. Nishioka. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom., 45:703–714, 2010. [132] E. Hückel. Quantentheoretische Beiträge zum Benzolproblem. Zeitschrift für Physik, 70(3–4):204–286, 1931. [133] E. Hückel. Quantentheoretische Beiträge zum Benzolproblem. Zeitschrift für Physik, 72(5–6):310–337, 1931. [134] E. Hückel. Quantentheoretische Beiträge zum Problem der aromatischen und ungesättigten Verbindungen. III. Zeitschrift für Physik, 76(9–10):628–648, 1932. [135] F. Hufsky, M. Rempt, F. Rasche, G. Pohnert, and S. Böcker. De novo analysis of electron impact mass spectra using fragmentation trees. Anal. Chim. Acta, 739:67–76, 2012. [136] A. v. Humboldt. Versuche über die gereizte Muskel- und Nervenfaser, nebst Vermutungen über den chemischen Prozeß in der Tier- und Pflanzenwelt. Rottmann, Leipzig, 1797. [137] R. Ihaka and R. Gentleman. R: A language for data analysis and graphics. J. Comput. Graph. Stat., 5:299–314, 1996. [138] H. Irth. Continuous–flow systems for ligand binding and enzyme inhibition assays based on mass spectrometry, pages 185–216. Mass Spectrometry in Medicinal Chemistry. Wiley, 2007. [139] H. Irth, S. Long, and T. Schenk. High–resolution screening in an expanded chemical space. Current Drug Discovery, 19–23, 2004. [140] J. Ivanov, S. Karabunarliev, and O. Mekenyan. 3DGEN: A system for exhaustive 3d molecular design proceeding from molecular topology. J. Chem. Inf. Comput. Sci., 34:234–243, 1994. [141] R. Jaritz. Zur mathematischen Modellierung von Raummodellen chemischer Moleküle. MATCH Commun. Math. Comput. Chem., 37:179–193, 1998. [142] M. Jerrum. A compact representation for permutation groups. J. Algorithms, 7:60–78, 1986. [143] C. Jochum and J. Gasteiger. Canonical numbering and constitutional symmetry. J. Chem. Inf. Comput. Sci., 17:113–117, 1977. [144] A. Katritzky, V. Lobanov, and M. Karelson. CODESSA: Reference Manual, Version 2. University of Florida, 1994. [145] D. Kell. Metabolomic biomarkers: search, discovery and validation. Expert Rev. Mol. Diagn., 7:329–333, 2007. [146] A. Kerber. Algebraic Combinatorics via Finite Group Actions. BI-Wissenschaftsverlag, Mannheim, 1990, 2nd edition under the new title Applied Finite Group Actions, Springer, New York, 1999. [147] A. Kerber and R. Laue. Group actions, double cosets, and homomorphisms: unifying concepts for the constructive theory of discrete structures. Acta Appl. Math., 52:63–90, 1998. [148] A. Kerber, R. Laue, T. Grüner, and M. Meringer. MOLGEN 4.0. MATCH Commun. Math. Comput. Chem., 37:205–208, 1998. [149] A. Kerber, R. Laue, and M. Meringer. An application of the structure generator MOLGEN to patents in chemistry. MATCH Commun. Math. Comput. Chem., 47:169–172, 2003. [150] A. Kerber, R. Laue, M. Meringer, and C. Rücker. Molecules in silico: The generation of structu­ ral formulae and its applications. J. Comput. Chem. Jpn., 3:85–96, 2004.

466 | Bibliography [151] A. Kerber, R. Laue, M. Meringer, and C. Rücker. MOLGEN–QSPR, a software package for the search of quantitative structure property relationships. MATCH Commun. Math. Comput. Chem., 51:187–204, 2004. [152] A. Kerber, R. Laue, M. Meringer, and C. Rücker. Molecules in silico: Potential versus known organic compounds. MATCH Commun. Math. Comput. Chem., 54:301–312, 2005. [153] A. Kerber, R. Laue, M. Meringer, and C. Rücker. Molecules in Silico: A Graph Description of Chemical Reactions. J. Chem. Inf. Model., 47:805–817, 2007. [154] A. Kerber, R. Laue, M. Meringer, and K. Varmuza. MOLGEN–MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation. Adv. Mass Spectrom., 15:939–940, 2001. [155] A. Kerber, M. Meringer, and C. Rücker. CASE via MS: Ranking structure candidates by mass spectra. Croat. Chem. Acta, 79:449–464, 2006. [156] L. Kier and L. Hall. The nature of structure–activity relationships and their relation to molecu­ lar connectivity. Eur. J. Med. Chem., 12:307–312, 1977. [157] L. Kier and L. Hall. Molecular Connectivity in Structure–Activity Analysis. Research Studies Press, Chichester, 1986. [158] L. Kier, W. Murray, M. Randić, and L. Hall. Molecular connectivity v: Connectivity series ap­ plied to density. J. Pharm. Sci., 65:1226–1230, 1976. [159] T. Kind and O. Fiehn. Metabolomic database annotations via query of elemental composi­ tions: mass accuracy is insufficient even at less than 1 ppm. BMC Bioinf., 7:234, 2006. [160] T. Kind and O. Fiehn. Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinf., 8:105, 2007. [161] T. Kind and O. Fiehn. Advances in structure elucidation of small molecules using mass spec­ trometry. Bioanalytical Reviews, 2(1–4):23–60, 2010. [162] J. Kirchmair, G. Wolber, C. Laggner, and T. Langer. Comparative performance assessment of the conformational model generators Omega and Catalyst: A large–scale survey on the retrieval of protein–bound ligand conformations. J. Chem. Inf. Model., 46:1848–1861, 2006. [163] M. Klin, S. Tratch, and N. Zefirov. 2D-configurations and clique–cyclic orientations of the graphs 𝐿(𝐾𝑝 ). Rep. Mol. Theory, 1:149–163, 1990. [164] D. Knuth. Estimating the efficiency of backtrack programs. Mathematics of Computation, 29:121–136, 1975. [165] E. Konstantinova and V. Skorobogatov. Molecular hypergraphs: The new Representation of nonclassical molecular structures with polycentric delocalized bonds. J. Chem. Inf. Comput. Sci., 35:472–478, 1995. [166] E. Konstantinova and V. Skorobogatov. Application of hypergraph theory in chemistry. Dis­ crete Math., 235:365–383, 2001. [167] M. Krauss, H. Singer, and J. Hollender. LC–high resolution MS in environmental analysis: from target screening to the identification of unknowns. Anal. Bioanal. Chem., 397:943–951, 2010. [168] K. Kumar, A. Menon, and P. Sastry. Computer–assisted determination of elemental composi­ tion of fragments in mass spectra. Rapid Commun. Mass Spectrom., 6:585–591, 1992. [169] V. Kuz’min. The structure of chiral molecules: An analysis of the concept of configuration and mechanisms of stereoisomerization. Russ. J. Phys. Chem., 63:936–941, 1994. [170] V. Kvasni˘cka and J. Pospíchal. Canonical indexing and constructive enumeration of molecular graphs. J. Chem. Inf. Comput. Sci., 30:99–105, 1990. [171] V. Kvasni˘cka and J. Pospíchal. An improved method of constructive enumeration of graphs. J. Math. Chem., 9:181–196, 1992. [172] J. Kwiatkowski. Computer–aided identification of isotopic patterns in low resolution mass spectra. Org. Mass. Spectrom., 13:513–517, 1978.

Bibliography

|

467

[173] R. Laue. Computing double coset representatives for the generation of solvable groups. In Proceedings EUROCAM’82, Marseille 1982, pages 65–70. Springer LN in Computer Science 144, 1982. [174] R. Laue. Construction of combinatorial objects — a tutorial. Bayreuther Mathematische Schriften, 43:53–96, 1993. [175] R. Laue, T. Grüner, M. Meringer, and A. Kerber. Constrained Generation of Molecular Graphs, volume 69 of DIMACS Series in Discrete Mathematics and Theoretical Computer Science, pages 319–332. American Mathematical Society, 2005. [176] R. Laue. Solving isomorphism problems for 𝑡-designs. In W. Wallis, editor, Designs 2002, pages 277–300. Kluwer Academic Publishers, 2003. [177] A. Leach. A survey of methods for searching the conformational space of small and medium–­ sized molecules. In K. Lipkowitz and D. Boyd, editors, Reviews in Computational Chemistry, volume 2, pages 1–55. VCH Publishers, New York, 1991. [178] K. Lebedev and D. Cabrol-Bass. New computer aided methods for revealing structural fea­ tures of unknown compounds using low resolution mass spectra. J. Chem. Inf. Comput. Sci., 38:410–419, 1998. [179] J. Lederberg. Rapid calculation of molecular formulas from mass values. J. Chem. Educ., 49:613, 1972. [180] S. Lehotay, K. Mastovska, A. Amirav, A. Fialkov, P. Martos, A. de Kok, and A. Fernández-Alba. Identification and confirmation of chemical residues in food by chromatography-mass spec­ trometry and other techniques. Trends Anal. Chem., 27(11):1070–1090, 2008. [181] J. Li, T. Ehlers, J. Sutter, S. Varma-O’Brien, and J. Kirchmair. CAESAR: A new conformer gene­ ration algorithm based on recursive buildup and local rotational symmetry consideration. J. Chem. Inf. Model., 47:1923–1932, 2007. [182] L. Li, J. Kresh, N. Karabacak, J. Cobb, J. Agar, and P. Hong. A hierarchical algorithm for calcu­ lating the isotopic fine structure of molecules. J. Am. Soc. Mass Spectrom., 19:1867–1874, 2008. [183] R. Lindsay, B. Buchanan, E. Feigenbaum, and J. Lederberg. Applications of Artificial Intelli­ gence for Organic Chemistry: The DENDRAL Project. McGraw–Hill, New York, 1980. [184] D. Livingstone and D. Salt. Judging the significance of multiple linear regression models. J. Med. Chem., 48:661–663, 2005. [185] M. Loos, M. Ruff, and H. Singer. enviMass version 1.0 target screening software, Eawag, Dübendorf, Switzerland. http://www.eawag.ch/forschung/uchem/software/ (accessed 04/07/2012), 2011. [186] H.-J. Luinge. EXSPEC: A Knowledge–Based System for Structure Analysis of Organic Molecules from Combined Spectral Data. PhD thesis, Universiteit Utrecht, 1989. [187] H.-J. Luinge and J. van der Maas. Artificial intelligence for the interpretation of com­ bined spectral data. Design and development of a spectrum interpreter. Anal. Chim. Acta, 223:135–147, 1989. [188] E. Luks. Isomorphism of graphs of bounded valence can be tested in polynomial time. J. Com­ puter Syst. Sci., 25:42–65, 1982. [189] R. Mannhold, G. Poda, C. Ostermann, and I. Tetko. Calculation of molecular lipophilicity: state–of–the–art and comparison of log 𝑝 methods on more than 96,000 compounds. J. Pharm. Sci., 98(3):861–893, 2009. [190] R. Mathon. Sample graphs for isomorphism testing. In Proc. 9th S.–E. Conf. Combinatorics, Graph Theory and Computing, Congressus Numerantium 21, pages 499–517, 1978. [191] H. Maurer and H. Hopf. The preparation of the last remaining acyclic isomers of benzene. Eur. J. Org. Chem., pages 2702–2707, 2005. [192] B. McKay. Isomorph–free exhaustive generation. J. Algorithms, 26:306–324, 1998.

468 | Bibliography [193] B. McKay. Computing automorphisms and canonical labelling of graphs. In Proc. Intern. Conf. on Combinatorial Theory, Lecture Notes in Mathematics No. 686, pages 223–232. Springer, Berlin, 1977. [194] F. McLafferty and F. Turecek. Interpretation of Mass Spectra. University Science Books, Mill Valley, CA, 4th edition, 1993. [195] MDL Information Systems. CrossFire Commander Server, version 6.0, 2003. [196] J. Meiler and M. Meringer. Ranking MOLGEN structure proposals by 13 C NMR chemical shift prediction with ANALYZE. MATCH Commun. Math. Comput. Chem., 45:85–108, 2002. [197] J. Meiler, E. Sanli, J. Junker, R. Meusinger, T. Lindel, M. Will, W. Maier, and M. Köck. Valida­ tion of structural proposals by substructure analysis and 13 C NMR chemical shift prediction. J. Chem. Inf. Comput. Sci., 42:241–248, 2002. [198] J. Meiler and M. Will. Automated structure elucidation of organic molecules from 13 C NMR spectra using genetic algorithms and neural networks. J. Chem. Inf. Comput. Sci., 41:1535–1546, 2001. [199] C. Meinert, E. Schymanski, E. Küster, R. Kühne, G. Schüürmann, and W. Brack. Application of preparative capillary gas chromatography (pcGC), automated structure generation and mu­ tagenicity prediction to improve effect–directed analysis of genotoxicants in a contaminated groundwater. Environ. Sci. Pollut. Res., 17(4):885–897, 2010. [200] M. Meringer. Erzeugung regulärer Graphen. Master’s thesis, Universität Bayreuth, 1996. [201] M. Meringer. Fast generation of regular graphs and construction of cages. J. Graph Theory, 30:137–146, 1999. [202] M. Meringer. Mathematische Modelle für die kombinatorische Chemie und die molekulare Strukturaufklärung. PhD thesis, Universität Bayreuth, 2004. Published by Logos–Verlag, Berlin. [203] M. Meringer. MOLGEN–MS/MS Software User Manual. Available from www.molgen.de, 2009. [204] M. Meringer. Structure enumeration and sampling. In J.-L. Faulon and A. Bender, editors, Handbook of Chemoinformatics Algorithms, chapter 8, pages 233–267. CRC/Chapman & Hall, Boca Raton, FL, 2010. [205] M. Meringer, S. Reinker, J. Zhang, and A. Muller. MS/MS data improves automated determi­ nation of molecular formulas by mass spectrometry. MATCH Commun. Math. Comput. Chem., 65:259–290, 2011. [206] R. Merrifield and H. Simmons. Topological methods in chemistry. Wiley–Interscience, New York, 1989. [207] D. Meyer. Support vector machines. R News, 1/3:23–26, 2001. [208] A. Miller. Subset Selection in Regression. Chapman and Hall, London, 1990. [209] K. Mislow and J. Siegel. Stereoisomerism and local chirality. J. Am. Chem. Soc., 106:3319–3328, 1984. [210] M. Mnëv. The universality theorems on the oriented matroid stratification of the space of real matrices. In P. Gritzmann and B. Sturmfels, editors, Applied Geometry and Discrete Mathe­ matics – The Victor Klee Festschrift, volume 6 of DIMACS Series in Discrete Mathematics and Theoretical Computer Science, pages 237–243. American Mathematical Society, Providence, RI, 1991. [211] M. Molchanova and N. Zefirov. Irredundant generation of isomeric molecular structures with some known fragments. J. Chem. Inf. Comput. Sci., 38:8–22, 1998. [212] S. Molodtsov. The generation of molecular graphs with obligatory, forbidden and desirable fragments. MATCH Commun. Math. Comput. Chem., 37:157–162, 1998. [213] H. Morgan. The generation of a unique machine description for chemical structures — a tech­ nique developed at Chemical Abstracts Service. J. Chem. Doc., 5:107–113, 1965.

Bibliography

| 469

[214] D. Moser. Die computerunterstützte Konstruktion molekularer Graphen. Master’s thesis, Universität Bayreuth, 1987. [215] G. Moss. Basic terminology of stereochemistry (IUPAC recommendations 1996). Pure Appl. Chem., 68:2193–2222, 1996. [216] I. Mun, R. Venkataraghavan, and F. McLafferty. Computer prediction of molecular weights from mass spectra. Anal. Chem., 53:179–182, 1981. [217] I. Mun, R. Venkataraghavan, and F. McLafferty. Molecular weight parity predicted from the parity of mass spectral peaks. Org. Mass Spectrom., 16:82–84, 1981. [218] National Center for Biotechnology Information (NCBI). PubChem. http://pubchem.ncbi.nlm.nih.gov, 2012. [219] R. Neudert and M. Penk. Enhanced structure elucidation. J. Chem. Inf. Comput. Sci., 36:244–248, 1996. [220] P. M. Neumann. A lemma that is not Burnside’s. Math. Scientist, 4:133–141, 1979. [221] S. Neumann and S. Böcker. Computational mass spectrometry for metabolomics: Identifi­ cation of metabolites and small molecules. Anal. Bioanal. Chem., 398(7–8):2779–2788, 2010. [222] S. Nikolić, G. Kovačević, A. Miličević, and N. Trinajstić. The Zagreb indices 30 years after. Croat. Chem. Acta, 76:113–124, 2003. [223] NIST. Automated Mass Spectral Deconvolution and Identification System (AMDIS) 2005 Ver­ sion. National Institute of Standards and Technology (NIST), US Department of Commerce, USA, 2005. [224] NIST/EPA/NIH, National Institute of Standards and Technology (NIST), U.S. Department of Commerce, Gaithersburg, MD, USA. NIST Mass Spectral Library ’98 Version, 1998. [225] NIST/EPA/NIH, National Institute of Standards and Technology (NIST), U.S. Department of Commerce, Gaithersburg, MD, USA. NIST Mass Spectral Library ’05 Version, 2005. [226] NIST/EPA/NIH, National Institute of Standards and Technology (NIST), U.S. Department of Commerce, Gaithersburg, MD, USA. NIST Mass Spectral Library ’11 Version, 2011. [227] J. Nourse. The configuration symmetry group and its application to stereoisomer generation, specification, and enumeration. J. Am. Chem. Soc., 101:1210–1216, 1979. [228] J. Nourse, R. Carhart, D. Smith, and C. Djerassi. Exhaustive generation of stereoisomers for structure elucidation. J. Am. Chem. Soc., 101:1216–1223, 1979. [229] J. Nourse, D. Smith, R. Carhart, and C. Djerassi. Computer–assisted elucidation of molecular structure with stereochemistry. J. Am. Chem. Soc., 102:6289–6295, 1980. [230] OECD. Guideline for the Testing of Chemicals 117. Partition Coefficient (n-octanol/water) High Performance Liquid Chromatography (HPLC) Method. Organisation for Economic Co-op­ eration and Development (OECD), Paris, France, 1989. [231] M. Otto. Chemometrie: Statistik und Computereinsatz in der Analytik. VCH, Weinheim, 1997. [232] P. Penchev, G. Andreev, and K. Varmuza. Automatic classification of infrared spectra using a set of improved expert–based features. Anal. Chim. Acta, 388:145–159, 1999. [233] T. Pluskal, T. Uehara, and M. Yanagida. Highly accurate chemical formula prediction tool uti­ lizing high–resolution mass spectra, MS/MS fragmentation, heuristic rules, and isotope pattern matching. Anal. Chem., 84:4396–4403, 2012. [234] G. Pólya. Kombinatorische Anzahlbestimmungen für Gruppen, Graphen und chemische Verbindungen. Acta Math., 68:145–254, 1937. [235] G. Pólya and R. Read. Combinatorial Enumeration of Groups, Graphs, and Chemical Com­ pounds. Springer, New York, 1987. [236] V. Prelog and G. Helmchen. Basic principles of the CIP–system and proposals for a revision. Angew. Chem. Int. Ed. Engl., 21:567–583, 1982. [237] E. Pretsch, P. Bühlmann, C. Affolter, and M. Badertscher. Spektroskopische Daten zur Strukturaufklärung organischer Verbindungen. Springer, Berlin, 2001.

470 | Bibliography [238] E. Pretsch and J. Clerc. Spectra Interpretation of Organic Compounds. VCH, Weinheim, 1997. [239] J. Quinlan. Learning with Continuous Classes. In A. Adams and L. Sterling, editors, Proceed­ ings of the 5th Australian Joint Conference on Artificial Intelligence AI ’92, pages 343–348, 1992. [240] M. Randić. On characterization of molecular branching. J. Am. Chem. Soc., 97:6609–6615, 1975. [241] M. Randić, G. Brissey, and L. Wilkins. Computer perception of topological symmetry via cano­ nical numbering of atoms. J. Chem. Inf. Comput. Sci., 21:52–59, 1981. [242] F. Rasche, K. Scheubert, F. Hufsky, T. Zichner, M. Kai, A. Svatos, and S. Böcker. Identifying the unknowns by aligning fragmentation trees. Anal. Chem., 84(7):3417–3426, 2012. [243] F. Rasche, A. Svatoš, R. Maddula, C. Böttcher, and S. Böcker. Computing fragmentation trees from tandem mass spectrometry data. Anal. Chem., 83:1243–1251, 2011. [244] A. Ratkiewicz and T. Truong. Application of chemical graph theory for automated mechanism generation. J. Chem. Inf. Comput. Sci., 43:36–44, 2003. [245] M. Razinger, K. Balasubramanian, M. Perdih, and M. Munk. Stereoisomer generation in com­ puter–enhanced structure elucidation. J. Chem. Inf. Comput. Sci., 33:812–825, 1993. [246] R. Read. Everyone a winner. Ann. Discr. Math., 2:107–120, 1978. [247] R. Read and D. Corneil. The graph isomorphism disease. J. Graph Theory, 1:339–363, 1977. [248] T. Renau, J. Sanchez, J. Gage, J. Dever, M. Shapiro, S. Gracheck, and J. Domagala. Struc­ ture–activity relationships of the quinolone antibacterials against mycobacteria: Effect of structural changes at N–1 and C–7. J. Med. Chem., 39:729–735, 1996. [249] J. Richter-Gebert. Two interesting oriented matroids. Doc. Math. J. DMV, 1:137–148, 1996. [250] H. Rinne. Taschenbuch der Statistik. Verlag Harry Deutsch, Frankfurt, 2003. [251] B. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, 1996. [252] A. Roberts and M. Knackstedt. Structure–property correlations in model composite materials. Phys. Rev. E, 54:2313–2328, 1996. [253] A. Rockwood and P. Haimi. Efficient calculation of accurate masses of isotopic peaks. J. Am. Soc. Mass. Spectr., 17:415–419, 2006. [254] A. Rockwood and S. Van Orden. Ultrahigh–speed calculation of isotope distributions. Anal. Chem., 68:2027–2030, 1996. [255] A. Rockwood, S. Van Orden, and R. Smith. Rapid calculation of isotope distributions. Anal. Chem., 67:2699–2704, 1995. [256] M. Rojas-Chertó, P. Kasper, E. Willighagen, R. Vreeken, T. Hankemeier, and T. Reijmers. Ele­ mental composition determination based on MS𝑛 . Bioinformatics, 27(17):2376–2383, 2011. [257] M. Rojas-Chertó, J. Peironcely, P. Kasper, J. van der Hooft, R. H. de Vos, R. Vreeken, T. Hanke­ meier, and T. Reijmers. Metabolite identification using automated comparison of high–reso­ lution multistage mass spectral trees. Anal. Chem., 84(13):5524–5534, 2012. [258] C. Rostad and W. Pereira. Kovats and Lee retention indexes determined by gas chromatog­ raphy/mass spectrometry for organic compounds of environmental interest. Journal of High Resolution Chromatography and Chromatography Communications, 9(6):328–334, 1986. [259] Royal Society of Chemistry (RSC). ChemSpider. http://www.chemspider.com, 2012. [260] E. Ruch. The diagram lattice as structural principle. Theor. Chim. Acta, 38:167–183, 1975. [261] E. Ruch and I. Gutman. The branching extent of graphs. J. Comb. Inf. Syst. Sci., 4:285–295, 1979. [262] C. Rücker, J. Braun, A. Kerber, and R. Laue. The Molecular Descriptors Computed with MOLGEN. online under http://molgen.de/documents/molgenqspr-descriptors/MOLGEN_ Descriptors.pdf, 2003.

Bibliography

| 471

[263] C. Rücker and M. Meringer. How many organic compounds are graph–theoretically nonpla­ nar? MATCH Commun. Math. Comput. Chem., 45:153–172, 2002. [264] C. Rücker and G. Rücker. Counts of all walks as atomic and molecular Descriptors. J. Chem. Inf. Comput. Sci., 33:683–695, 1993. [265] C. Rücker, G. Rücker, and M. Meringer. Exploring the limits of graph–invariant– and spec­ trum–based discrimination of (sub)structures. J. Chem. Inf. Comput. Sci., 42:640–650, 2002. [266] C. Rücker, G. Rücker, and M. Meringer. y-Randomization and its variants in QSPR/QSAR. J. Chem. Inf. Model., 47:2345–2357, 2007. [267] G. Rücker and C. Rücker. Nomenclature of organic polycycles out of the computer – How to escape the jungle of the secondary bridges. Chimia, 44:116–120, 1990. [268] G. Rücker and C. Rücker. On using the adjacency matrix power method for perception of symmetry and for isomorphism testing of highly intricate graphs. J. Chem. Inf. Comput. Sci., 31:123–126, 1991. [269] G. Rücker and C. Rücker. Walk counts, labyrinthicity, and complexity of acyclic and cyclic graphs and molecules. J. Chem. Inf. Comput. Sci., 40:99–106, 2000. [270] G. Rücker and C. Rücker. On finding nonisomorphic connected subgraphs and distinct mole­ cular substructures. J. Chem. Inf. Comput. Sci., 41:314–320, 2001. [271] J. Sadowski and J. Gasteiger. From atoms and bonds to three–dimensional atomic coordi­ nates: Automatic model builders. Chem. Rev., 93:2567–2581, 1993. [272] J. Sadowski, J. Gasteiger, and G. Klebe. Comparison of automatic three–dimensional model builders using 639 x–ray structures. J. Chem. Inf. Comput. Sci., 34:1000–1008, 1994. [273] S. Sardana and A. Madan. Application of graph theory: Relationship of antimycobacterial activity of quinolone derivatives with eccentric connectivity index and Zagreb group parame­ ters. MATCH Commun. Math. Comput. Chem., 45:35–53, 2002. [274] K. Schittkowski. NLPQL; a FORTRAN subroutine solving constrained nonlinear programming problems. Ann. Operat. Res., 5:485–500, 1985. [275] B. Schmalz. The t-designs with prescribed automorphism group, new simple 6-designs. J. Comb. Designs, 1:125–170, 1993. [276] B. Schmalz. Verwendung von Untergruppenleitern zur Bestimmung von Doppelnebenklassen. Bayreuther Mathematische Schriften, 31:109–143, 1993. [277] U. Schobel, M. Frenay, D. Van Elswijk, J. McAndrews, K. Long, L. Olson, S. Bobzin, and H. Irth. High resolution screening of plant natural product extracts for estrogen receptor 𝛼 and 𝛽 binding activity using an online HPLC–MS biochemical detection system. J. Biomol. Screen., 6:291–303, 2001. [278] M. Schocker. On degree sequences of graphs with given cyclomatic number. Publ. Inst. Math. (Beograd) (N.S.), 69(83):34–40, 2001. [279] H. Schultz. Topological organic chemistry. 1. Graph theory and topological indices of alkanes. J. Chem. Inf. Comput. Sci., 29:227–228, 1989. [280] H. Schultz and T. Schultz. Topological organic chemistry. 6. Graph theory and molecular topo­ logical indices of cycloalkanes. J. Chem. Inf. Comput. Sci., 33:240–244, 1993. [281] K.-P. Schulz. Computergestützte Untersuchungen über Zusammenhänge zwischen Struktur und Massenspektrum. PhD thesis, Technische Universität München, 1992. [282] T. Schulze, S. Weiss, E. Schymanski, P. von der Ohe, M. Schmitt-Jansen, R. Altenburger, G. Streck, and W. Brack. Identification of a photo-transformation product of diclofenac us­ ing effect-directed analysis. Environ. Pollut., 158(5):1461–1466, 2010. [283] E. Schymanski. Integrated Analytical and Computer Tools for Toxicant Identification in Effec­ t–Directed Analysis. Doctoral Thesis, Faculty for Chemistry and Physics, Technical University Bergakadamie Freiberg, Freiberg, Germany, 2011.

472 | Bibliography [284] E. Schymanski, C. Gallampois, M. Krauss, M. Meringer, S. Neumann, T. Schulze, S. Wolf, and W. Brack. Consensus structure elucidation combining GC/EI-MS, structure generation, and calculated properties. Anal. Chem., 84:3287–3295, 2012. [285] E. Schymanski, C. Meinert, M. Meringer, and W. Brack. The use of MS classifiers and structure generation to assist in the identification of unknowns in effect–directed analysis. Anal. Chim. Acta, 615:136–147, 2008. [286] E. Schymanski, M. Meringer, and W. Brack. Matching structures to mass spectra using frag­ mentation patterns: Are the results as good as they look? Anal. Chem., 81:3608–3617, 2009. [287] E. Schymanski, M. Meringer, and W. Brack. Automated Strategies to Identify Compounds on the Basis of GC/EI-MS and Calculated Properties. Anal. Chem., 83:903–912, 2011. [288] Scientific Instrument Services, Inc. (SIS). Exact Masses and Isotopic Abundances, Alphabeti­ cal Listing. http://www.sisweb.com/referenc/source/exactmaa.htm (accessed 03/07/2012), 2010. [289] D. Scott. Rapid and accurate method for estimating molecular weights of organic compounds from low–resolution mass spectra. Chemometr. Intell. Lab. Syst., 16(3):193–202, 1992. [290] H. Scsibrany and K. Varmuza. ToSiM: PC–software for the investigation of topological similar­ ities in molecules. Software–Entwicklung in der Chemie, 8:235–249, 1994. [291] B. Seebas and E. Pretsch. Automated compatibility tests of the molecular formulas or struc­ tures of organic compounds with their mass spectra. J. Chem. Inf. Comput. Sci., 39:713–717, 1999. [292] J. Senior. Partitions and their representative graphs. Am. J. Math., 73:663–689, 1951. [293] C. Shelley. Heuristic approach for displaying chemical structures. J. Chem. Inf. Comput. Sci., 23:61–65, 1983. [294] P. Shor. Stretchability of pseudolines is NP–hard. In P. Gritzmann and B. Sturmfels, editors, Applied Geometry and Discrete Mathematics – The Victor Klee Festschrift, volume 6 of DIMACS Series in Discrete Mathematics and Theoretical Computer Science, pages 531–554. American Mathematical Society, Providence, RI, 1991. [295] S. Shrikhande. On a characterization of the triangular association scheme. Ann. Math. Statist., 30:39–47, 1959. [296] C. Sims. Computation with permutation groups. In S. Petrick, editor, Proceedings of the Sec­ ond Symposium on Symbolic and Algebraic Manipulation, pages 23–28, New York, 1971. [297] C. Smith, G. O’Maille, E. Want, C. Qin, S. A. Trauger, T. Brandon, D. Custodio, R. Abagyan, and G. Siuzdak. METLIN: a metabolite mass spectral database. Therapeutic Drug Monitoring, 27:747–751, 2005. [298] M. Sonntag. Eine Anwendung des Algorithmus von Ford und Fulkerson bei der Interpretation von Infrarotspektren. MATCH Commun. Math. Comput. Chem., 30:37–51, 1994. [299] J. Stapleton. Linear Statistical Models. Wiley, New York, 1995. [300] S. Stein, V. Babushok, R. Brown, and P. J. Linstrom. Estimation of Kovats retention indices using group contributions. J. Chem. Inf. Model., 47(3):975–980, 2007. [301] S. Stein and D. Scott. Optimization and testing of mass spectral library search algorithms for compound identification. J. Am. Soc. Mass Spectrom., 5:859–866, 1994. [302] M. Stravs, E. Schymanski, H. Singer, and J. Hollender. Automatic recalibration and processing of tandem MS spectra using formula annotation. J. Mass Spectrom., 48:89–99, 2013. [303] O. Temkin, A. Zeigarnik, and D. Bonchev. Chemical Reaction Networks: A Graph–Theoretical Approach. CRC Press, Boca Raton, FL, 1996. [304] R. Todeschini and V. Consonni. Handbook of Molecular Descriptors. Wiley–VCH, Weinheim, 2000, 2nd edition under the new title Molecular Descriptors for Chemoinformatics, 2009. [305] J. Topliss and R. Edwards. Chance factors in studies of quantitative structure–activity rela­ tionships. J. Med. Chem., 22:1238–1244, 1979.

Bibliography

|

473

[306] S. Trach. Mathematical models in stereochemistry. i. Combinatorial characteristics of compo­ sition, connection, and configuration of organic molecules. Russ. J. Org. Chem., 31:1189–1217, 1995. [307] S. Tratch, M. Molchanova, and N. Zefirov. A unified approach to characterization of molecu­ lar composition, connectivity and configuration: Symmetry, chirality, and generation prob­ lems for the corresponding combinatorial objects. MATCH Commun. Math. Comput. Chem., 61:217–266, 2009. [308] S. Tratch and N. Zefirov. Combinatorial models and algorithms in chemistry. The ladder of combinatorial objects and its application to the formalization of structural problems of or­ ganic chemistry. In N. Stepanov, editor, Principles of Symmetry and Systemology in Chemis­ try, pages 54–86. Moscow State University Publishers, Moscow, 1987. [309] S. Tratch and N. Zefirov. Algebraic chirality criteria and their application to chirality classifica­ tion in rigid molecular systems. J. Chem. Inf. Comput. Sci., 36:448–464, 1996. [310] I. Ugi, A. Dömling, B. Gruber, C. Heilingbrunner, C. Heiß, and W. Hörl. Formale Unterstützung bei Multikomponentenreaktionen: Automatisierung der Synthesechemie. Software–Entwick­ lung in der Chemie, 9:113–128, 1995. [311] USEPA. Estimation Program Interface (EPI) Suite (TM). United States Environmental Protection Agency (USEPA), USA. http://www.epa.gov/oppt/exposure/pubs/episuite.htm, 2011. [312] M. Vainio and M. Johnson. Generating conformer ensembles using a multiobjective genetic algorithm. J. Chem. Inf. Model., 47:2462–2474, 2007. [313] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995. [314] V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. [315] K. Varmuza. Pattern Recognition in Chemistry. Springer, Berlin, 1980. [316] K. Varmuza. Automated recognition of isotope peak patterns in mass spectra. Fresenius J. Anal. Chem., 322:170–174, 1985. [317] K. Varmuza et al. MSclass. Software for Chemical–Structure–Structure–Oriented Classification of Mass Spectra. Classifier Guide. Applied ChemoMetrics, Technische Universität Wien, 1996. [318] K. Varmuza et al. MSclass. Software for Chemical–Structure–Structure–Oriented Classification of Mass Spectra. Reference and User Guide. Applied ChemoMetrics, Technische Universität Wien, 1996. [319] K. Varmuza, P. He, and K.-T. Fang. Boosting applied to classification of mass spectral data. J. Data Sci., 1:391–404, 2003. [320] K. Varmuza, U. Jordis, and G. Wolf. Database mining for heterocycles: Are structures of small heterocycles generated by a computer program present in databases? http://www.ch.ic.ac. uk/ectoc/echet96/papers/014/, 1996. [321] K. Varmuza, P. Penchev, F. Stancl, and W. Werther. Systematic structure elucidation of organic compounds by mass spectra classification. J. Mol. Struct., 408/409:91–96, 1997. [322] K. Varmuza and H. Scsibrany. Cluster analysis of chemical structures based on binary mole­ cular descriptors and principal component analysis. Software–Entwicklung in der Chemie, 9:81–90, 1995. [323] K. Varmuza and H. Scsibrany. Substructure isomorphism matrix. J. Chem. Inf. Comput. Sci., 40:308–313, 2000. [324] K. Varmuza and W. Werther. Mass spectral classifiers for supporting systematic structure elucidation. J. Chem. Inf. Comput. Sci., 36:323–333, 1996. [325] K. Varmuza, W. Werther, F. Stancl, A. Kerber, and R. Laue. Computer–assisted structure elu­ cidation of organic compounds, based on mass spectra classification and exhaustive isomer generation. Software–Entwicklung in der Chemie, 10:303–314, 1996. [326] L. G. Wade, Jr. Precision in stereochemical terminology. J. Chem. Educ., 83:1793–1794, 2006.

474 | Bibliography [327] M. Wagener and V. van Geerestein. Potential drugs and nondrugs: Prediction and identifi­ cation of important structural features. J. Chem. Inf. Comput. Sci., 40:280–292, 2000. [328] C. Walter. Adjacency matrices. SIAM J. Alg. Disc. Methods, 7:18–29, 1986. [329] W. Warr and C. Suhr. Chemical Information Management. VCH, Weinheim, 1992. [330] V. Warth, F. Battin-Leclerc, R. Fournet, P. A. Glaude, G. M. Côme, and G. Scacchi. Compu­ ter based generation of reaction mechanisms for gas–phase oxidation. Comput. & Chem., 24:541–560, 2000. [331] D. Weininger, A. Weininger, and J. Weininger. SMILES. 2. Algorithm for generation of unique SMILES notation. J. Chem. Inf. Comput. Sci., 29:97–101, 1989. [332] B. Weisfeiler. On Construction and Identification of Graphs. Lecture Notes in Mathematics No. 558. Springer, Berlin, 1976. [333] W. Werther. Einsatz von Methoden der explorativen Datenanalyse zur Interpretation und Klas­ sifikation von Massenspektren. PhD thesis, Technische Universität Wien, 1993. [334] W. Werther. Versuch einer Systematik der Reaktionsmöglichkeiten in der Elektronenstoß–­ Massenspektrometrie (EI–MS). Unpublished, 1996. [335] W. Werther, H. Lohninger, F. Stancl, and K. Varmuza. Classification of mass spectra: A com­ parison of yes/no classification methods for the recognition of simple structural properties. Chemom. Intell. Lab. Syst., 22:63–76, 1994. [336] T. Wieland. Erzeugung, Abzählung und Konstruktion von Stereoisomeren. MATCH Commun. Math. Comput. Chem., 31:153–203, 1994. [337] T. Wieland. Konstruktionsalgorithmen bei molekularen Graphen und deren Anwendung. MATCH Commun. Math. Comput. Chem., 36:7–157, 1997. [338] T. Wieland, A. Kerber, and R. Laue. Principles of the generation of constitutional and configu­ rational isomers. J. Chem. Inf. Comput. Sci., 36:413–419, 1996. [339] H. Wiener. Structural determination of paraffin boiling points. J. Am. Chem. Soc., 69:17–20, 1947. [340] M. Will, W. Fachinger, and J. Richert. Fully automated structure elucidation — a spectro­ scopist’s dream comes true. J. Chem. Inf. Comput. Sci., 36:221–227, 1996. [341] D. Williams and I. Fleming. Spectroscopic Methods in Organic Chemistry. McGraw–Hill, Lon­ don, 1989. [342] W. Wipke and T. Dyott. Simulation and evaluation of chemical synthesis. Computer represen­ tation and manipulation of stereochemistry. J. Am. Chem. Soc., 96:4825–4834, 1974. [343] W. Wipke and T. Dyott. Stereochemically unique naming algorithm. J. Am. Chem. Soc., 96:4834–4842, 1974. [344] S. Wolf, S. Schmidt, M. Müller-Hannemann, and S. Neumann. In silico fragmentation for com­ puter assisted identification of metabolite mass spectra. BMC Bioinformatics, 11:148, 2010. [345] E. M. Wright. Burnside’s lemma: A historical note. J. Comb. Theory, Ser. B, 30:89–90, 1981. [346] Y. Xu, J.-F. Heilier, G. Madalinski, E. Genin, E. Ezan, J.-C. Tabet, and C. Junot. Evaluation of accurate mass and relative isotopic abundance measurements in the LTQ-Orbitrap mass spec­ trometer for further metabolomics database building. Anal. Chem., 82:5490–5501, 2010. [347] N. Zefirov and S. Tratch. Some notes on Randić–Razinger’s approach to characterization of molecular shapes. J. Chem. Inf. Comput. Sci., 37:900–912, 1997. [348] X. Zhang, D. Wei, Y. Yap, L. Li, S. Guo, and F. Chen. Mass spectrometry-based “omics” tech­ nologies in cancer diagnostics. Mass Spectrom. Rev., 26:403–431, 2007. [349] L. Zlatina and M. Elyashberg. Generation of stereoisomers and their spatial models cor­ responding to the given molecular structure. MATCH Commun. Math. Comput. Chem., 27:191–207, 1992. [350] J. Zupan and J. Gasteiger. Neural Networks for Chemists. VCH, Weinheim, 1993.

List of abbreviations AAS

Atomic Absorption Spectroscopy 297

AB

Aromatic Bond 272

ABA

Anti-mycoBacterial Activity 284

AES

Atomic Emission Spectroscopy 297

AMDIS

Automated Mass Spectral Deconvolution and Identification System 393

ANN

Artificial Neural Network 233

ARP

Absolute Ranking Position 378

BP

Boiling Point 254

BRN

Beilstein Registry Number 204

BSS

Best Subset Selection 230

CART

Classification and Regression Trees 236

CAS–RN

Chemical Abstracts Service Registry Number 204

CASE

Computer-Aided Structure Elucidation 7

CSC

Closed Shell Chemistry 27

CT

Classification Tree 236

CV

Cross-Validation 226

DB

Double Bond 272

DBE

Double Bond Equivalent 35

DENDRAL

DENDRitic ALgorithm 179

EDA

Effect-Directed Analysis 393

EEI

Even Electron Ion 318

EI

Electron Impact 302

ESI

ElectroSpray Ionization 379

FR

Fisher Ratio 230

FT–ICR

Fourier Transform Ion Cyclotron Resonance (MS) 375

FWHM

Full Width at Half Maximum 379

GC

Gas Chromatography 299

HN

Hidden Neuron 233

HPLC

High Performance Liquid Chromatography 299

HRS

High Resolution Screening 299

HTS

High Throughput Screening 240

IC

Integral Chemistry 27

InChI

IUPAC Chemical Identifier 205

IR

InfraRed (spectroscopy) 297

IUPAC

International Union of Pure and Applied Chemistry 205

KNN

𝑘 Nearest Neighbors 238

KRI

Kovat’s Retention Index 399

LC

Liquid Chromatography 299

LDA

Linear Discriminant Analysis 233

LM

Linear Model 231

476 | List of abbreviations LOOCV

Leave-One-Out Cross Validation 226

LR

Low Resolution 302

LRI

Lee Retention Index 399

LS

Least Squares 319

MC

Multicenter Chemistry 27

MCE

Mean Classification Error 225

MIC

Minimal Inhibitory Concentration 284

MLR MMD

Multiple Linear Regression 231 Minimal Mass Difference 365

MS

Mass Spectrometry, Mass Spectrum 297

MV

Match Value 315

NMR

Nuclear Magnetic Resonance (spectroscopy) 297

OEI

Odd Electron Ion 318

OLS

Ordinary Least Squares 232

ON

OrthoNormal(-basis) 96

PAH PCA

Polycyclic Aromatic Hydrocarbon 399 Principal Component Analysis 295

PCR PD

Principal Component Regression 233 Physical Density 270

POF QDA

Partial Orientation Function 142 Quadratic Discriminant Analysis 293

QSAR

Quantitative Structure-Activity Relationship 221

QSPR RBF

Quantitative Structure–Property Relationship 6 Radial Base Function 357

RC

Restricted Chemistry 27

RI

Retention Index 398

RIA RP–HPLC

Relative Isotopic Abundance 385 Reversed-Phase HPLC 399

RRP RSS

Relative Ranking Position 324 Residual Sum of Squares 222

RT

Regression Tree 236

SB

Single Bond 272

SC SE

Substructure Count 250 Steric Energy 73

SI SIMCA

Soft Ionization 304 Soft Independent Modeling of Class Analogy 342

SP SVD

Structural Property 338 Singular Value Decomposition 233

SVM TB

Support Vector Machine 234 Triple Bond 272

TCE TI

Total Classification Error 223 Topological Index 245

TP UV

Transformation Product 402 UltraViolet (spectroscopy) 297

vdW

van der Waals (radius, volume) 247

Index 2D placement 72 3D placement 72 3D space 91 A absolute – errors 375 abstract – basis class 63 – minimal Radon partition 152 access – random 219 – sequential 219 achiral 92, 104 – skeleton 107 action – conjugation 38 – finite 21 – group 21 – similar 118 adjacency – matrix 16 admissible atom state 26 affine – dependence 229 algorithm – Dixon–Wilf 53 alternating – group 138 – mapping 137 ambiguous molecular graph 62 analysis – cluster 295 – linear discriminant 233 – principal component 295 – quadratic discriminant 293 Ångström 247 antisymmetric 81 arithmetical – description 1 – descriptor 78 – level 3 aromatic – bond 70 – ring 70

aromaticity 70 – restriction of 182 artificial neural network 233 associative 19 asymmetric 118 atom – number 26 – state 14, 26 – type 63 atom mass – average 363 atom states – restriction of 182 atom type – any 63 – element 63 – MS 63 – multi 63 – standard 63 atomic – profile 272 atomic absorption spectroscopy 297 atomic emission spectroscopy 297 atoms – hetero 31 autocorrelation descriptor 340 automorphism – orientational 141 automorphism group 86 – geometric 141 – orientational 141 autoscaling 227 average atom mass 363 B badlist 65 – permanent 358 barycenter 94 barycentric – placement 94 Basak’s information content 247 base mass 306 basis – ordered 102 – orthonormal 96

478 | Index basis class – abstract 63 bias 233 bias neuron 233 bijection 19 bijective 19 – mapping 19 binary – classification 223 – Grassmann–Plücker relations 154 – molecular descriptor 250 binary rooted tree 236 bipartite – graph 191 bond – aromatic 70 – covalent 13, 26 – degree 16 – double 15, 26 – graph 13 – matrix 16 – multiplicity 13, 26 – single 14, 15, 26 – triple 14, 15 bond degrees – partition of 80 – sequence of 16, 80 bond multiplicities – matrix of 16 bonds – change of 66 – restriction of 182 bonds graph – change of 68 branching 80, 245 – extent 81 C canonic – transversal 165 canonical – form 192 canonicity – test 179 Cauchy–Frobenius – Lemma of 36 center of – inversion 98 – reaction 67

centering 227 central – molecule 190 centroid 341 chain – Sims 171 chain length – variation of 200 change of – bonds 66 – bonds graph 68 – reaction graph 66 – state 66 – states 68 charge 26 – restriction of 182 chemical – element 26 – formula 314 – reaction 66 chemistry – closed shell 27 – combinatorial 240 – integral 27 – multicenter 27 – restricted 27 chiral 92, 104 – object 105 – skeleton 107 chirality 3, 91, 104 chirotope 152, 158 Cholesky decomposition 232 chromatography – gas 299 – liquid 299 class – conjugacy 38 – equivalence 18 – stabilizer 118 classification 223 – binary 223 classification error – mean 225 – total 223 classification tree 236 classifier – mass spectra 338 clockwise – rotation 103

Index | 479

closed – subgraph 59 – walk 56 closed shell – chemistry 27 cluster – analysis 295 coefficient of determination 224 color 26 coloring 14 column – length 82 combinatorial – chemistry 240 – library 240 comparison – of spectra 299 comparison values – for structures 306 compatible – molecular formula 181 complementary – information content 247 completely – correlated 229 components – connected 59 composition 19 configuration 132 conformation 132 conformer 3, 132 conjugacy – class 38 conjugacy class of – subgroups 118 conjugation – action 38 connected 17 – components 59 – graph 58, 59 – molecular graph 30 – nodes 58 connected component – trivial 59 connectedness 59 connectivity – isomer 32 – restriction of 182 – stack 176

constitution 132 constitutional – isomer 32 constrained – generation 164 content 44, 45 – racemic 113 continuous – variable 221 convex hull 150 convolution – product 310 correlated – completely 229 correlation coefficient – multiple 224 coset – double 48 – left 48 – right 48 cost function 223 – zero-one 223 count – substructure 250 covalent – bond 13, 26 cross-validation 226 cycle 39, 56 – girth 58 – index 45 – length of 40 – notation – standard 41 – order of 40 – partition 42 – type 42 cyclic – factors 40 – permutation 39 cyclically – permuted 39 cyclomatic – number 79, 84 D dalton 244 DBE – restriction of 182 decanes 251

480 | Index decision tree 236 decomposition – reaction 66 degree – bond 16 – valence 246 degrees of freedom 225 delabeling 17 density – physical 270 – topological 273 dependence – affine 229 dependent – variable 221 depth – of a reaction scheme 193 – of reactant 195 description – arithmetical 1 descriptor – arithmetical 78 – autocorrelation 340 – geometrical 87, 247 – ion series 340 – molecular 77, 242 – MS 340 – purely arithmetical 78 – spectra type 341 – topological 79 determinant – volume 134 diagonal – subgroup 108 diagram – Young 81 direct – product 48 directed – graph 191 directional – sense 103 disconnected 17 discrete – partition 207 – variable 221 disjoint cycles 40 distance 59 – degree 245

– Euclidean 94 – matrix 84 – substructure restriction 64 Dixon–Wilf – algorithm 53 dominance – order 81 dot – product 94, 375 dot product – normalized 375 double – bond 15, 26 – coset 48 – transposition 138 double bond equivalent 35 down – step 129 E educt 191 educts 66 electron 26 – unpaired 26 element – chemical 26 – heavy 63 – identity 19 – symmetry 21, 99 elements – enantiomorphic 105 – self-enantiomorphic 105 – symmetry 98 embedding 64 – as molecular substructure 65 – of graphs 60 empirical – formula 35 – 𝐹 value 225 enantiomer 104 enantiomeric 104 enantiomorph 92, 104 enantiomorphic 104 – elements 105 – orbit 105 – pair 105 energy – steric 73

Index | 481

enumeration by – symmetry 118 equivalence – class 18 – relation 18 equivalence class – of molecular graphs 30 equivalent – molecular graphs 30 errors – absolute 375 – squared 375 Euclidean – distance 94 – metric 94 – norm 228 – space 91 even – permutation 138 even electron – ion 318 extent – branching 81 F feature – MS 340 feedforward network 233 finite – action 21 – fuzzy formula 181 Fisher ratio 230 fixed – point 37 folding 310 form – canonical 192 – H-suppressed 30 formula 314 – chemical 314 – empirical 35 – generic 164 – Markush 202 – molecular 32 – structural 32 formula-based – generation 164 – structure generation 7

fragmentation 303 – schemes 193 free electron – pairs 26 function – generating 44, 114 – Möbius 120 – orientation 135, 136 – symmetry 341 – zeta 119 Fundamental – Lemma 48 fuzzy – molecular formula 181 fuzzy formula – finite 181 𝐹 value – empirical 225 G game – ladder 129 gas – chromatography 299 generating – function 44, 114 generation – constrained 164 – formula-based 164 – orderly 141 – reaction-based 164 generic – formula 164 geometric – automorphism group 141 geometrical – descriptor 87, 247 – index 87 – level 3, 5 girth 58, 84 – cycle 58 gluing – lemma 179 goodlist 65 graph – bipartite 191 – bond 13 – connected 58, 59 – directed 191

482 | Index – labeled 14 – model 13 – molecular 13, 14 – molecule 16, 30, 77 – product 66 – reactant 66 – reaction 66 – simple 13, 14, 16 – unlabeled 14 graphical – partitions 81 Grassmann–Plücker relations – binary 154 ground – state 26 group 19 – action 21 – alternating 138 – orthogonal 95, 98 – point 21, 98, 99 – power 187 – reduction function 45 – rotation 107 – symmetric 20, 36 – symmetry 21 H handedness 92 heavy – element 63 hetero – atoms 31 hidden layer 233 hidden neuron 233 high throughput – screening 240 highest – isotope mass 307 – isotopomer mass 310 highest mass 306, 309 highest random – mean 259 highly resolved – isotope masses 363 Hill – system 33 hit list 298 homology – variation of 200

homomorphism 104 – principle 131 H-suppressed – form 30 – molecular graph 30 hybridization 84 – substructure restriction 65 hydrogen distribution – restriction of 182 hyperplane – separating 235 I identity – element 19 image 15 – inverse 15 improper – rotation 98 improper axis of – rotation 98 independent – variable 221 index 48 – cycle 45 – geometrical 87 – retention 398 – topological 242, 245 – Wiener 85, 245 indices – Randić 85, 245 – topological 79 – Zagreb 245 induced – molecular substructure 65 – subgraph 59 inequality – triangle 94 information content – Basak’s 247 – complementary 247 – structural 247 infrared spectroscopy 297 injective – mapping 19 integral – chemistry 27 intensity 303

Index | 483

intensity ratio – logarithmic 340 interaction – model 13 interpretation – of spectrum 299 invariant – MS 340 – symmetry 172 inverse – image 15 – left 19 – QSAR 221 – QSPR 221 inversion – center of 98 ion – even electron 318 – odd electron 318 ion series descriptor 340 ionization 302 – schemes 193 isomer 32 – connectivity 32 – constitutional 32 – permutational 108 – stereo- 5 isometry 91 – linear 95 isomorphic – molecular graphs 30 – renumbering 140 isotope 307 isotope distribution 309 – natural 307 – theoretical 310 isotope mass – highest 307 – lowest 307 isotope masses – highly resolved 363 isotopic abundance – relative 385 isotopomer 308 isotopomer mass – highest 310 – lowest 310

K kernel function 236 Kronecker delta function 223 L labeled – graph 14 – 𝑚-multigraph 15 – molecular graph 29, 76 – multigraph 14 ladder – game 129 left – coset 48 – inverse 19 – unit 19 left-handed 103 – orientation 103 – screw 103 Lemma – Fundamental 48 lemma – gluing 179 Lemma of – Cauchy–Frobenius 36 length 48 – column 82 – of a cycle 40 – of a walk 56 – row 82 level – arithmetical 3 – geometrical 3, 5 – topological 2, 3 lexicographical – order 166, 167 library – combinatorial 240 – molecular 192 – patent 202 – real 240 – virtual 240 line 13 linear – isometry 95 – mapping 96 linear discriminant – analysis 233

484 | Index liquid – chromatography 299 list – neighborhood 219 – notation 41, 167 Lorentz force 303 lowest – isotope mass 307 – isotopomer mass 310 M macro 182 macroatom 179 macros – restriction of 182 map – sign 104 mapping – bijective 19 – injective 19 – linear 96 – surjective 19 mappings – symmetry class of 24, 36 margin borders 235 Markush – formula 202 mass 306 – monoisotopic 374 – nominal 307, 310 mass of highest abundance 310 mass spectra – classifier 338 mass spectrometry 297 mass to charge ratio 303 match value 315 match values – for molecular formulas 304 matrix – adjacency 16 – bond 16 – distance 84 – Möbius 120 – orthogonal 97 – transpose 97 – unit 97 – zeta 119 matrix of – bond multiplicities 16

mean – classification error 225 – highest random 259 median 324 mesomerism 69 metabolomics 373 metric – Euclidean 94 – space 94 – vector space 94 minimal Radon partition – abstract 152 minimality – test 171 mirror – plane 99 mirror plane 98 misclassification rate 225 𝑚-multigraph – labeled 15 – unlabeled 17, 24 model – graph 13 – interaction 13 Möbius – function 120 – matrix 120 molecular – descriptor 77, 242 – formula 32 – graph 13, 14 – library 192 – substructure 65 – walk count 246 molecular descriptor – binary 250 molecular formula – compatible 181 – fuzzy 181 – of a molecular graph 32 molecular graph – connected 30 – H-suppressed 30 – labeled 29, 76 molecular graphs – equivalent 30 – isomorphic 30 molecular mass – restriction of 181

Index | 485

molecular substructure – induced 65 molecule – central 190 – graph 16, 30, 77 – self-enantiomorph 104 – self-enantiomorphic 104 molecule ion 302 monoisotopic – mass 374 monotonic – restriction 172 MS – descriptor 340 MS feature 340 MS invariant 340 MS/MS 372 multicenter – chemistry 27 multigraph 13, 14 – labeled 14 multi-hypergraphs 71 multiplication 19 multiplicative – weight 114 multiplicity – bond 13, 26 – of subgraphs 61 multiset 80 Mycobacterium fortuitum 284 N natural – isotope distribution 307 – number 15 neighborhood – list 219 – substructure restriction 65 network – reaction 191 neutron 307 node 13 nodes – connected 58 nominal – mass 307, 310 non-bond 15 non-supervised – statistical learning 295

norm – Euclidean 228 normalized – dot product 375 – sum of absolute errors 375 – sum of squared errors 375 notation – list 41, 167 – Schoenflies 99, 100 nuclear magnetic resonance 297 number – atom 26 – cyclomatic 79, 84 – natural 15 – partition 42 number of atoms – restriction of 181 number of heteroatoms – restriction of 181 O object – chiral 105 observation 221 octanol–water – partition coefficient 393 octet – rule 27 odd – permutation 138 odd electron – ion 318 ON-basis 96 one component – reaction 66 open – walk 56 operation – symmetry 20, 98, 99 opposite – orientation 102 orbit 21 – enantiomorphic 105 – self-enantiomorphic 105 order 15 – dominance 81 – lexicographical 166, 167 – of a cycle 40 – partial 81

486 | Index ordered – basis 102 – pair 14 – totally 165 orderly – generation 141 orientation 102 – function 135, 136 – left-handed 103 – opposite 102 – right-handed 103 – same 102 orientation function – partial 142 orientation of – tetrahedron 136 orientational – automorphism 141 – automorphism group 141 orthogonal – group 95, 98 – matrix 97 orthogonal group – special 98 orthonormal – basis 96 overfitting 228 P pair – enantiomorphic 105 – ordered 14 – unordered 14 pairs – free electron 26 partial – order 81 – orientation function 142 partition – cycle 42 – discrete 207 – number 42 – Radon 150 partition coefficient – octanol–water 393 partition of – bond degrees 80 – valences 80

partitioning – recursive 236 partitions – graphical 81 patent – library 202 – violations 203 path 56 peak 306 peak cluster 312 permanent – badlist 358 permutation 19 – cyclic 39 – even 138 – odd 138 – sign of 138 permutational – isomer 108 permuted – cyclically 39 physical – density 270 placement – 2D 72 – 3D 72 – barycentric 94 plane – mirror 99 – reflection 99 point – fixed 37 – group 21, 98, 99 Pólya’s – Theorem 106 – theorem 44 poset 81 position – substitutable 107 – variation of 200 power – group 187 predicting function 221 predictive ability 225 predictor 221 principal component – analysis 295 principle – homomorphism 131

Index | 487

product 191 – convolution 310 – direct 48 – dot 94, 375 – graph 66 – scalar 94, 375 products 66 profile – atomic 272 proper – rotation 98 proper axis of – rotation 98 property – structural 338 proton 26, 307 purely – topological 79 purely arithmetical – descriptor 78 Q QR decomposition 232 QSAR – inverse 221 QSPR – inverse 221 quadratic discriminant – analysis 293 quantiles 321 R 𝑟-cycle 39 racemic – content 113 radical – site 26 radicals – restriction of 182 radius – van der Waals 247 Radon – partition 150 Randić – indices 85, 245 random – access 219 range scaling 227

ranking – of molecular formula candidates 304 – of spectra 299 – of structural formulas 306 ranking function 315 reactant – graph 66 reactants 66 reaction – center of 67 – chemical 66 – decomposition 66 – graph 66 – network 191 – one component 66 – scheme 68 – substructure 68 – synthesis 66 – two component 66 – type 191 reaction center graph 67 reaction graph – change of 66 reaction scheme – depth of 193 – one component 68 – two component 68 reaction-based – generation 164 – structure generation 7 reactive – site 190 real – library 240 realizable 159 rearrangement 66 recursive – partitioning 236 reduction function – group 45 reflection – plane 99 reflexive 18, 81 regression 222 regression tree 236 relation 18 – equivalence 18 relative – isotopic abundance 385

488 | Index renumbering – isomorphic 140 resonance – structure 69 restricted – chemistry 27 restriction – monotonic 172 – substructure 64 restriction of – aromaticity 182 – atom states 182 – bonds 182 – charge 182 – connectivity 182 – DBE 182 – hydrogen distribution 182 – macros 182 – molecular mass 181 – number of atoms 181 – number of heteroatoms 181 – radicals 182 – substructures 183 – symmetry 182 resubstitution 224 retention – index 398 – time 398 right – coset 48 right-handed 103 – orientation 103 – screw 103 ring 56 – aromatic 70 – substructure restriction 65 rotation – clockwise 103 – group 107 – improper 98 – improper axis of 98 – proper 98 – proper axis of 98 row – length 82 rule – octet 27

S same – orientation 102 sample – test 225 scalar 93 – product 94, 375 scheme – reaction 68 schemes – fragmentation 193 – ionization 193 Schoenflies – notation 99, 100 screening – high throughput 240 – virtual 242 screw – left-handed 103 – right-handed 103 search – substructure 65 selection – of molecular formula candidates 304 – of structural formulas 306 self-enantiomorph 92 – molecule 104 self-enantiomorphic – elements 105 – molecule 104 – orbit 105 sense – directional 103 separating – hyperplane 235 sequence of – bond degrees 16, 80 – valences 16, 80 sequential – access 219 set-partition 18 sign – map 104 sign of – permutation 138 similar – action 49, 118 simple – graph 13, 14, 16

Index |

Sims – chain 171 single – bond 14, 15, 26 singular values 233 site – radical 26 – reactive 190 skeleton 107 – achiral 107 – chiral 107 space – 3D 91 – Euclidean 91 – metric 94 special – orthogonal group 98 spectra – comparison of 299 – ranking of 299 spectra type – descriptor 341 spectrum – interpretation 299 spectrum comparison 298 spectrum simulation 299 squared – errors 375 stabilizer 37 – class 118 stack – connectivity 176 standard – cycle notation 41 – valence 27 standard deviation 228 standard error 224 state – atom 14, 26 – change of 66 – ground 26 states – change of 68 statistical learning – non-supervised 295 step – down 129 stereoisomer 5 stereoisomers 132

steric – energy 73 stratum 119 structural – formula 32 – information content 247 – property 338 structure 314 – resonance 69 structure generation – formula-based 7 – reaction-based 7 structure verification 299 subgraph 59 – ambiguous molecular 64 – closed 59 – induced ambiguous molecular 64 subgroup 23 – diagonal 108 subgroups – conjugacy class of 118 substituent 107 substituents – variation of 200 substitutable – position 107 substructure 62, 182 – count 250 – molecular 65 – reaction 68 – restriction 64 – search 65 substructure restriction – distance 64 – hybridization 65 – neighborhood 65 – ring 65 substructures – restriction of 183 sum of absolute errors – normalized 375 sum of squared errors – normalized 375 support vector machine 234 support vectors 235 surjective – mapping 19 symmetric 18 – group 20, 36

489

490 | Index symmetry – element 21, 99 – elements 98 – enumeration by 118 – function 341 – group 21 – invariant 172 – operation 20, 98, 99 – restriction of 182 – type 119 symmetry class – of mappings 24, 36 synthesis – reaction 66 system – Hill 33 T tabloids 50 tandem–MS 363, 372 target variable 221 test – canonicity 179 – minimality 171 – sample 225 tetrahedron 134 – orientation of 136 theorem – Pólya’s 44 theoretical – isotope distribution 310 time – retention 398 topological – density 273 – descriptor 79 – index 242, 245 – indices 79 – level 2, 3 – purely 79 total – classification error 223 – walk count 246 totally – ordered 165 transitive 18, 81 transpose – matrix 97

transposition 40 – double 138 transversal 18, 21 – canonic 165 tree 56 triangle – inequality 94 triple – bond 14, 15 trivial – connected component 59 two component reaction 66 type – atom 63 – reaction 191 – symmetry 119 type I errors 290 type II errors 290 U ultraviolet spectroscopy 297 uniform 159 unit – left 19 – matrix 97 unit vector 96 unlabeled – graph 14 – 𝑚-multigraph 17, 24 unordered – pair 14 unpaired – electron 26 V valence 16, 26, 79 – degree 246 – electron 26 – standard 27 valences – partition of 80 – sequence of 16, 80 van der Waals – radius 247 – volume 75 variable – continuous 221 – dependent 221

Index | 491

– discrete 221 – independent 221 variation of – chain length 200 – homology 200 – position 200 – substituents 200 vector 93 vector space – metric 94 violations – patent 203 virtual – library 240 – screening 242 volume – determinant 134 – van der Waals 75 W walk 56 – closed 56 – open 56

walk count – molecular 246 – total 246 weight 45, 114 – multiplicative 114 Wiener – index 85, 245

Y Young – diagram 81

Z Zagreb – indices 245 zero-one cost function 223 zeta – function 119 – matrix 119