International Tables for Crystallography, Vol.G: Definition and Exchange of Crystallographic Data [1 ed.] 9781402031380, 1402031386

International Tables for Crystallography Volume G, Definition and exchange of crystallographic data, describes the stand

306 113 5MB

English Pages 598 Year 2005

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

International Tables for Crystallography, Vol.G: Definition and Exchange of Crystallographic Data [1 ed.]
 9781402031380, 1402031386

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

INT E R NAT I ONAL T AB L E S FOR C RYST AL L OGR APHY

International Tables for Crystallography Volume A: Space-Group Symmetry Editor Theo Hahn First Edition 1983, Fifth Edition 2002 Corrected Reprint 2005 Volume A1: Symmetry Relations between Space Groups Editors Hans Wondratschek and Ulrich Mu¨ller First Edition 2004 Volume B: Reciprocal Space Editor U. Shmueli First Edition 1993, Second Edition 2001 Volume C: Mathematical, Physical and Chemical Tables Editor E. Prince First Edition 1992, Third Edition 2004 Volume D: Physical Properties of Crystals Editor A. Authier First Edition 2003 Volume E: Subperiodic Groups Editors V. Kopsky´ and D. B. Litvin First Edition 2002 Volume F: Crystallography of Biological Macromolecules Editors Michael G. Rossmann and Eddy Arnold First Edition 2001 Volume G: Definition and Exchange of Crystallographic Data Editors Sydney Hall and Brian McMahon First Edition 2005

INTERNATIONAL TABLES FOR CRYSTALLOGRAPHY

Volume G DEFINITION AND EXCHANGE OF CRYSTALLOGRAPHIC DATA

Edited by SYDNEY HALL AND BRIAN MCMAHON

First edition

Published for

T HE I NT E RNAT IONAL UNION OF C RYST AL L OGR APHY by

SPRINGE R 2005

A C.I.P. Catalogue record for this book is available from the Library of Congress ISBN 0-4020-3138-6 (acid-free paper) ISSN 1574-8707

Published by Springer, P.O. Box 17, 3300 AA Dordrecht, The Netherlands Sold and distributed in North, Central and South America by Springer, 101 Philip Drive, Norwell, MA 02061, USA In all other countries, sold and distributed by Springer, P.O. Box 322, 3300 AH Dordrecht, The Netherlands

Technical Editor: N. J. Ashcroft # International Union of Crystallography 2005 Short extracts may be reproduced without formality, provided that the source is acknowledged, but substantial portions may not be reproduced by any process without written permission from the International Union of Crystallography Printed in Denmark by P. J. Schmidt A/S

Associate editors H. M. Berman F. C. Bernstein H. J. Bernstein P. E. Bourne

I. D. Brown P. M. D. Fitzgerald R. W. Grosse-Kunstleve J. R. Hester

N. Spadaccini B. H. Toby J. D. Westbrook

Contributing authors F. H. Allen, Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge, CB2 1EZ, England. [2.4, 4.1, 4.8]

K. Henrick, EMBL Outstation, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, England. [Appendix 3.6.2] M. A. Hoyland, International Union of Crystallography, 5 Abbey Square, Chester CH1 2HU, England. [5.7] B. McMahon, International Union of Crystallography, 5 Abbey Square, Chester CH1 2HU, England. [1.1, 2.2.7, 3.1, 3.2, 3.6, 4.5, 5.2, 5.3, 5.7] G. Madariaga, Departamento de Fı´sica de la Materia Condensada, Facultad de Ciencia y Tecnologı´a, Universidad del Paı´s Vasco, Apartado 644, 48080 Bilbao, Spain. [3.4, 4.3] P. R. Mallinson, Department of Chemistry, University of Glasgow, Glasgow G12 8QQ, Scotland. [3.5, 4.4] N. Spadaccini, School of Computer Science and Software Engineering, University of Western Australia, 35 Stirling Highway, Crawley, Perth, WA 6009, Australia. [2.1, 2.2.7, 5.2] P. R. Strickland, International Union of Crystallography, 5 Abbey Square, Chester CH1 2HU, England. [5.7] B. H. Toby, NIST Center for Neutron Research, National Institute of Standards and Technology, Gaithersburg, Maryland 20899-8562, USA. [3.3, 4.2] K. D. Watenpaugh, retired; formerly Structural, Analytical and Medicinal Chemistry, Pharmacia Corporation, Kalamazoo, Michigan, USA. [3.6, 4.5] J. D. Westbrook, Protein Data Bank, Research Collaboratory for Structural Bioinformatics, Rutgers, The State University of New Jersey, Department of Chemistry and Chemical Biology, 610 Taylor Road, Piscataway, NJ 08854-8087, USA. [2.2, 2.6, 3.6, 4.5, 4.6, 4.10, 5.5] E. L. Ulrich, Department of Biochemistry, University of Wisconsin Madison, 433 Babcock Drive, Madison, WI 53706-1544, USA. [Appendix 3.6.2] H. Yang, Protein Data Bank, Research Collaboratory for Structural Bioinformatics, Rutgers, The State University of New Jersey, Department of Chemistry and Chemical Biology, 610 Taylor Road, Piscataway, NJ 08854-8087, USA. [5.5]

J. M. Barnard, BCI Ltd, 46 Uppergate Road, Stannington, Sheffield S6 6BX, England. [2.4, 4.8] H. M. Berman, Protein Data Bank, Research Collaboratory for Structural Bioinformatics, Rutgers, The State University of New Jersey, Department of Chemistry and Chemical Biology, 610 Taylor Road, Piscataway, NJ 08854-8087, USA. [2.6, 3.6, 4.5, 5.5] H. J. Bernstein, Department of Mathematics and Computer Science, Kramer Science Center, Dowling College, Idle Hour Blvd, Oakdale, NY 11769, USA. [2.2.7, 2.3, 3.7, 4.6, 5.1, 5.4, 5.6] P. E. Bourne, Research Collaboratory for Structural Bioinformatics, San Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0537, USA. [3.6, 4.5] I. D. Brown, Brockhouse Institute for Materials Research, McMaster University, Hamilton, Ontario, Canada L8S 4M1. [2.2.7, 3.5, 3.8, 4.1, 4.7] A. P. F. Cook, BCI Ltd, 46 Uppergate Road, Stannington, Sheffield S6 6BX, England. [2.4, 2.5, 4.8] P. J. Ellis, Stanford Linear Accelerator Center, 2575 Sand Hill Road, Menlo Park, CA 94025, USA. [5.6] Z. Feng, Protein Data Bank, Research Collaboratory for Structural Bioinformatics, Rutgers, The State University of New Jersey, Department of Chemistry and Chemical Biology, 610 Taylor Road, Piscataway, NJ 08854-8087, USA. [5.5] P. M. D. Fitzgerald, Merck Research Laboratories, Rahway, New Jersey, USA. [3.2, 3.6, 4.5] S. R. Hall, School of Biomedical and Chemical Sciences, University of Western Australia, Crawley, Perth, WA 6009, Australia. [1.1, 2.1, 2.2, 2.4, 2.5, 2.6, 3.2, 4.1, 4.8, 4.9, 4.10, 5.2, 5.4] A. P. Hammersley, ESRF/EMBL Grenoble, 6 rue Jules Horowitz, France. [2.3, 4.6]

v

Contents PAGE

Preface (S. R. Hall and B. McMahon) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

PART 1. HISTORICAL INTRODUCTION 1.1. Genesis of the Crystallographic Information File (S. R. Hall and B. McMahon) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.1.1. Prologue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.1.2. Past approaches to data exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.1.3. Card-image formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.1.4. The Standard Crystallographic File Structure (SCFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.1.5. The impact of networking on crystallography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.1.6. The Working Party on Crystallographic Information (WPCI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.1.7. The Crystallographic Information File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.1.8. Diversification: the Molecular Information File and dictionary definition language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.1.9. The macromolecular Crystallographic Information File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.1.10. The Crystallographic Binary File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.1.11. Other extension dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.1.12. The broader context: CIF and XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

PART 2. CONCEPTS AND SPECIFICATIONS 2.1. Specification of the STAR File (S. R. Hall and N. Spadaccini) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.1.2. Universal data language concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.1.3. The syntax of the STAR File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

Appendix 2.1.1. Backus–Naur form of the STAR syntax and grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

2.2. Specification of the Crystallographic Information File (CIF) (S. R. Hall and J. D. Westbrook) . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

2.2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

2.2.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

2.2.3. The syntax of a CIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.2.4. Portability and archival issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.2.5. Common semantic features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.2.6. CIF metadata and dictionary compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

2.2.7. Formal specification of the Crystallographic Information File (S. R. Hall, N. Spadaccini, I. D. Brown, H. J. Bernstein, J. D. Westbrook and B. McMahon) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

2.3. Specification of the Crystallographic Binary File (CBF/imgCIF) (H. J. Bernstein and A. P. Hammersley) . . . . . . . . . . . . . . . . . . .

37

2.3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

2.3.2. CBF and imgCIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

2.3.3. Overview of the format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

2.3.4. A complex example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

2.3.5. imgCIF encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

Appendix 2.3.1. Deprecated CBF conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

2.4. Specification of the Molecular Information File (MIF) (F. H. Allen, J. M. Barnard, A. P. F. Cook and S. R. Hall) . . . . . . . . . . .

44

2.4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

vi

CONTENTS 2.4.2. Historical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

2.4.3. MIF objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

2.4.4. MIF concepts and syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

2.4.5. Atoms, bonds and molecular representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

2.4.6. Bonding conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

2.4.7. Structural templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

2.4.8. Stereochemistry and geometry at stereogenic centres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

2.4.9. MIF query applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

2.4.10. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

2.5. Specification of the core CIF dictionary definition language (DDL1) (S. R. Hall and A. P. F. Cook) . . . . . . . . . . . . . . . . . . . . . . . .

53

2.5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

2.5.2. The organization of a CIF dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

2.5.3. Definition attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

2.5.4. DDL versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

2.5.5. The structure of DDL1 definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

2.5.6. DDL1 attribute descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

2.6. Specification of a relational dictionary definition language (DDL2) (J. D. Westbrook, H. M. Berman and S. R. Hall) . . . . . . . . .

61

2.6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

2.6.2. The DDL2 presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

2.6.3. Overview of the elements of DDL2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

2.6.4. DDL2 organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

2.6.5. DDL2 dictionary applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

2.6.6. Detailed DDL2 specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

PART 3. CIF DATA DEFINITION AND CLASSIFICATION 3.1. General considerations when defining a CIF data item (B. McMahon) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

3.1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

3.1.2. Informal definition procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

3.1.3. Formal definition process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

3.1.4. Choice of data model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

3.1.5. Constructing a DDL1 dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

3.1.6. Constructing a DDL2 dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

3.1.7. Composing new data definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

3.1.8. Management of multiple dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

3.1.9. Composite dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

3.1.10. Public CIF dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

3.2. Classification and use of core data (S. R. Hall, P. M. D. Fitzgerald and B. McMahon) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

3.2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

3.2.2. Experimental measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

3.2.3. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98

3.2.4. Atomicity, chemistry and structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

102

3.2.5. Publication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

112

3.2.6. File metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

114

vii

CONTENTS Appendix 3.2.1. Category structure of the core CIF dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

116

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

116

3.3. Classification and use of powder diffraction data (B. H. Toby) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

117

3.3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

117

3.3.2. Dictionary design considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

117

3.3.3. pdCIF dictionary sections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

117

3.3.4. Experimental measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

118

3.3.5. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

121

3.3.6. Atomicity, chemistry and structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

123

3.3.7. File metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

123

3.3.8. pdCIF for storing unprocessed measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

126

3.3.9. Use of pdCIF for Rietveld refinement results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

128

3.3.10. Other pdCIF applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

129

Appendix 3.3.1. Category structure of the powder CIF dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

130

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

130

3.4. Classification and use of modulated and composite structures data (G. Madariaga) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

131

3.4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

131

3.4.2. Dictionary design considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

131

3.4.3. Arrangement of the dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

132

3.4.4. Use of the msCIF dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

137

Appendix 3.4.1. Category structure of the msCIF dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

139

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

139

3.5. Classification and use of electron density data (P. R. Mallinson and I. D. Brown) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

141

3.5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

141

3.5.2. Dictionary design considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

141

3.5.3. Classification of data definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

141

3.5.4. Development of the dictionary and supporting software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

143

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

143

3.6. Classification and use of macromolecular data (P. M. D. Fitzgerald, J. D. Westbrook, P. E. Bourne, B. McMahon, K. D. Watenpaugh and H. M. Berman) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

144

3.6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

144

3.6.2. Considerations underlying the design of the dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

144

3.6.3. Overview of the mmCIF data model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

145

3.6.4. Content of the macromolecular CIF dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

147

3.6.5. Experimental measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

148

3.6.6. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

152

3.6.7. Atomicity, chemistry and structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

164

3.6.8. Publication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

190

3.6.9. File metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

194

Appendix 3.6.1. Category structure of the mmCIF dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

195

Appendix 3.6.2. The Protein Data Bank exchange data dictionary (J. D. Westbrook, K. Henrick, E. L. Ulrich and H. M. Berman) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

195

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

198

3.7. Classification and use of image data (H. J. Bernstein) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

199

3.7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

199

3.7.2. Binary image data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

199

viii

CONTENTS 3.7.3. Axes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

201

3.7.4. The diffraction experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

202

Appendix 3.7.1. Category structure of the CBF/imgCIF dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

205

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

205

3.8. Classification and use of symmetry data (I. D. Brown) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

206

3.8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

206

3.8.2. Dictionary design considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

206

3.8.3. Arrangement of the dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

207

3.8.4. Future developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

208

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

208

PART 4. DATA DICTIONARIES 4.1. Core dictionary (coreCIF) (S. R. Hall, F. H. Allen and I. D. Brown) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

210

4.2. Powder dictionary (pdCIF) (B. H. Toby) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

258

4.3. Modulated and composite structures dictionary (msCIF) (G. Madariaga) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

270

4.4. Electron density dictionary (rhoCIF) (P. R. Mallinson) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

290

4.5. Macromolecular dictionary (mmCIF) (P. M. D. Fitzgerald, J. D. Westbrook, P. E. Bourne, B. McMahon, K. D. Watenpaugh and H. M. Berman) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

295

4.6. Image dictionary (imgCIF) (A. P. Hammersley, H. J. Bernstein and J. D. Westbrook) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

444

4.7. Symmetry dictionary (symCIF) (I. D. Brown) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

459

4.8. Molecular Information File dictionary (MIF) (F. H. Allen, J. M. Barnard, A. P. F. Cook and S. R. Hall) . . . . . . . . . . . . . . . . . . .

467

4.9. DDL1 dictionary (S. R. Hall) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

471

4.10. DDL2 dictionary (J. D. Westbrook and S. R. Hall) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

473

PART 5. APPLICATIONS 5.1. General considerations in programming CIF applications (H. J. Bernstein) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

481

5.1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

481

5.1.2. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

481

5.1.3. Strategies in designing a CIF-aware application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

483

5.1.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

486

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

486

5.2. STAR File utilities (N. Spadaccini, S. R. Hall and B. McMahon) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

488

5.2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

488

5.2.2. Data instances and context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

488

5.2.3. Star_Base: a general-purpose data extractor for STAR Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

491

5.2.4. Editing STAR Files with Star.vim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

494

5.2.5. Browser-based viewing with StarMarkUp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

494

5.2.6. Object-oriented STAR programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

495

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

497

5.3. Syntactic utilities for CIF (B. McMahon) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

499

5.3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

499

5.3.2. Syntax checker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

499

5.3.3. Editors with graphical user interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

501

ix

CONTENTS 5.3.4. Data-name validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

507

5.3.5. File transformation software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

509

5.3.6. Libraries for scripting languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

515

5.3.7. Rapid development tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

517

5.3.8. Tools for mmCIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

522

5.3.9. Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

525

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

525

5.4. CIFtbx: Fortran tools for manipulating CIFs (H. J. Bernstein and S. R. Hall) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

526

5.4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

526

5.4.2. An overview of the library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

526

5.4.3. Initialization commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

526

5.4.4. Read commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

527

5.4.5. Write commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

527

5.4.6. Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

528

5.4.7. Name aliases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

529

5.4.8. Implementation of the tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

530

5.4.9. How to read CIF data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

530

5.4.10. How to write a CIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

530

5.4.11. Error-message glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

531

5.4.12. Internals and programming style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

535

5.4.13. Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

538

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

538

5.5. The use of mmCIF architecture for PDB data management (J. D. Westbrook, H. Yang, Z. Feng and H. M. Berman) . . . . . . . . .

539

5.5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

539

5.5.2. Representing macromolecular structure data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

539

5.5.3. Integrated data-processing system: overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

541

5.5.4. Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

543

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

543

5.6. CBFlib: an ANSI C library for manipulating image data (P. J. Ellis and H. J. Bernstein) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

544

5.6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

544

5.6.2. CBFlib function descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

545

5.6.3. Compression schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

552

5.6.4. Sample templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

552

5.6.5. Example programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

555

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

556

5.7. Small-molecule crystal structure publication using CIF (P. R. Strickland, M. A. Hoyland and B. McMahon) . . . . . . . . . . . . . . .

557

5.7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

557

5.7.2. Case study: the fully automated reporting of small-unit-cell crystal structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

557

5.7.3. CIF and other journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

562

Appendix 5.7.1. Request list for Acta Crystallographica Section C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

564

Appendix 5.7.2. Data validation using checkcif . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

566

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

568

Subject index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

571

Index of categories and data names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

578

x

Preface By S. R. Hall and B. McMahon approaches to data exchange and provide a rich gallery of tools suitable for different applications. The volume also includes a CDROM containing many of these software packages, as well as machine-readable versions of the data dictionaries themselves, and many links to related web resources. This volume therefore contains material that is relevant to both crystallographers and information-technology specialists. The data dictionaries that form the core of the volume provide the knowledge base for automated validation of data submitted for publication and for deposition in databases (tasks in which most structure-determination scientists are involved). Part 5 will be of particular interest to the non-specialist user of CIF, in that it contains practical advice on stand-alone application programs, the publication of crystal structure reports in primary research journals and the deposition of biological macromolecular structures in the Protein Data Bank. Software authors will benefit from the discussions on designing and implementing CIF-aware programs and from the detailed descriptions of a number of available libraries. The data-representation and knowledge-management techniques used routinely in crystallography today were developed well in advance of recent approaches to information dissemination for the World Wide Web through open protocols. Nevertheless, CIF processes fit well with the enormous contemporary effort to build the so-called semantic web, in which information and knowledge are transported and interpreted by computer. The success of such knowledge management depends on well structured data vocabularies, data models and data structures appropriate for different subject areas. The term used to denote such semantically rich formulations is ‘ontology’. We believe that the formal description of crystallographic data and information presented in this volume is an exemplary ontology in this sense, and will be a useful model for workers in related fields. The volume has taken several years to complete. It covers subject matter that is relatively new and still changing, and indeed much of the material has been undergoing revision even as it was being compiled. Thanks to the determined effort by all involved in its production, this volume provides a comprehensive and up-todate compendium of the current state of crystallographic data definition. This first edition also serves the role of providing the historical and scientific context for future developments in a rapidly evolving field. It is likely that new editions will be needed before long to stay abreast of such developments. Since it began, the CIF project has involved the participation and collaboration of many colleagues across the breadth of crystallographic practice. In the production of this volume, we wish first to thank all the authors of the individual chapters for their major contributions. Each chapter was independently and anonymously peer-reviewed, and the reviewers’ comments were forwarded to the authors for response. This has added to the time and effort involved on everyone’s part, but we believe that the improved presentation and exposition of the detailed and often complex material will be of significant benefit to readers and users. We are grateful to the reviewers for their dedication and enthusiasm during this process. We particularly appreciate the help of many colleagues who have contributed to the CIF project generally and who have made valuable contributions to this volume. They are recognized as Associate Editors in the list of contributors. In addition to the individuals mentioned by name in individual chapters, we thank Doug du Boulay, Eldon Ulrich,

Crystallography is a data-intensive discipline. This is true of the measurement and the analytical aspects of crystallographic studies, and so the orderly acquisition and retention of data is of great importance. There is also a need for computational tools that facilitate efficient data-handling processes. In this data-rich environment, International Tables for Crystallography provides supporting sources of reference data and guidelines for analysing, interpreting and modelling the data involved in a wide variety of studies. This volume in the International Tables series, Definition and exchange of crystallographic data, deals with the precise definition of data items used in crystallography. It focuses on a particular data representation, the Crystallographic Information File (CIF), developed over the last 15 years for the computerized interchange of data important to structural crystallography, and for submissions to crystallographic publications and databases. The need for the volume stems from the IUCr’s adoption in 1990 of CIF as the standard for exchanging crystallographic data across the entire community, and for the submission of research articles to Acta Crystallographica and other journals. The data items described in this volume appear as standard tags in CIF. Each data item has an associated definition, with attributes that characterize its allowed values and relationships with other defined items. The definitions provide a comprehensive machine-readable description of the information associated with a structural study, from area-detector counts through to the derived molecular geometry, and allow a full annotation suitable for publication as a primary research article. The volume is arranged in five parts. Part 1 introduces the need and rationale for a standard interchange format, and the IUCr’s commissioning and development of CIF in response to that need. Part 2 presents the formal specifications of the file formats involved in the CIF standard, and the related crystallographic binary file (CBF) and molecular information file (MIF) standards. Details are also given of the dictionary definition language (DDL) used to specify the names of data items and their attributes in a machine-readable way. The definitions and attributes of data items in different areas of crystallography are listed in separate dictionary files. Part 3 describes the structure and content of each of the public data dictionaries currently available. It provides commentaries on the dictionaries and guidance on best practice in their use for information interchange in several fields of crystallography. The dictionaries themselves are structured ASCII files that may be used directly by software parsers and validators. Part 4 presents the contents of the dictionaries in a text format that is easier for humans to read. Part 5 describes software applications and libraries that use CIF and related interchange standards. The data definitions in this volume can be read and used from the printed page, but their principal use is in a computer-software environment. The volume therefore describes the large number of computer programs and libraries contributed by the crystallographic community for handling data files in the CIF format. Many of these programs use the dictionary files directly to validate and exchange data items. Software by its very nature is under continual development, and individual software implementations become obsolete with disturbing rapidity. Nevertheless, we have felt it important and useful to devote considerable space to the software aspects of CIF data exchange. The programs described in the volume illustrate various basic considerations and

xi

PREFACE Frank Allen, Bill Clegg, Mark Spackman, Victor Streltsov, Bob Sweet, Howard Flack, Peter Murray-Rust and John Bollinger for thought-provoking discussions and contributions. We thank Philip Coppens for initially supporting the proposal for this volume, Theo Hahn and Hartmut Fuess, the previous and current Chairmen of the Commission on International Tables for Crystallography,

and the other members of the Commission who welcomed this volume to the series. Thanks also go to Timo Vaalsta of the University of Western Australia for technical help with the CDROM. Finally, we are grateful to Nicola Ashcroft of the IUCr editorial office for her careful and insightful technical editing of this work.

xii

International Tables for Crystallography (2006). Vol. G, Chapter 1.1, pp. 2–10.

1.1. Genesis of the Crystallographic Information File B Y S. R. H ALL

AND

necessarily occur as the approach evolves to meet the changing demands of an evolving science. Readers should therefore be aware of the need to consult the IUCr website (http://www. iucr.org/iucr-top/cif) for the latest versions of, or successors to, the data dictionaries and the software packages described in this volume. However, the basic concepts have already been shown to be remarkably effective and durable. This volume should therefore provide an invaluable reference for those working with CIF and related universal file approaches. The success of any data-exchange approach depends on its efficiency and flexibility. It must cope with the increasing volume and complexity of data generated by the computing ‘information explosion’. This growth challenges conventional criteria for measuring exchange and storage efficiency based on high datacompression factors. Today’s fast, cheap magnetic and chip technologies make bulk volume a secondary consideration compared with extensibility and portability of data-management processes. Most importantly, improvements in computing technology continue to generate new approaches to harnessing semantic information contained within data collections and to promoting new strategies for knowledge management. The basis for an efficient information-exchange process is mutually agreed rules for the supplier and the receiver, i.e. the establishment of an exchange protocol. This protocol needs to be established at several levels. At the first level, there must be predetermined ways that data (i.e. numbers, characters or text) are arranged in the storage medium. These are the organizational rules that define the syntax or the format of data. There must also be a clear understanding of the meaning of individual data items so that they can be correctly identified, accessed and reused by others. At an even higher level, a protocol may also provide rules for expressing the relationships between the data, as this can lead to automatic processes for validating and applying the data values. These higher levels provide the semantic knowledge needed for the rigorous identification and validation of transmitted and stored data. One may consider the analogy of reading this paragraph in English. To do this we must first be able to recognize the individual words, comprehend their meaning (if necessary with the use of an English dictionary) and understand all of this within the context of sentence construction. As with data, the arrangement of component words is based on a predetermined grammatical syntax, and their individual meaning (as defined in a dictionary or elsewhere), coupled with their contextual function (as nouns, verbs, adjectives etc.), leads to the full comprehension of a word sequence as semantic information.

1.1.1. Prologue Progress in science depends crucially on the ability to find and share theories, observations and the results of experiments. The efficient exchange of information within and across scientific disciplines is therefore of fundamental importance. The rapid growth in the use of computers and networks over the past half century and in the use of the World Wide Web over the past decade have brought remarkable improvements in communications among scientists. Such communications are most effective when there is a common language. The English language has become the standard means of expression of ideas and theories; increasingly, the exchange of data requires the computer equivalent of such a lingua franca. It was with considerable foresight that the International Union of Crystallography (IUCr) in 1990 adopted a data-handling approach based on universal file concepts. At that time this was considered to be a radical idea. The approach adopted by the IUCr is known as the Crystallographic Information File (CIF). This volume of International Tables for Crystallography describes the CIF approach, the associated definition of CIF data items within dictionaries, and handling procedures, applications and software. In this opening chapter, we give a historical perspective on the reasons why the CIF approach was adopted and how, over the past decade, CIF applications have evolved. CIF is the most fully developed and mature of the various universal file approaches available today. It combines flexibility and simplicity of expression with a lean syntax. It has an unsurpassed ability to express ‘hard’ scientific data unambiguously using extensive dictionaries (ontologies) of relevant terms. It has proved to be remarkably well suited to the publication and archiving of smallunit-cell crystallographic structures. What was a radical idea in 1990 has today become the dominant mode of expression of scientific data in this domain. The CIF data model provided the key to the internal restructuring of data managed by the Protein Data Bank in its transition from an archive to a database. The CIF approach is being tested in an increasing number of domains. In some cases, it may well become as successful as it has been for small-molecule crystallography. In other cases, the syntax will be unsuitable, but yet the conceptual discipline of agreed ontologies will still be required. Here, the experience of developing the CIF dictionaries may be carried across into different file formats and modes of expression. Nowadays, informatics is a rapidly evolving field, in which everything is obsolete almost as soon as it is created. Yet there is a responsibility on today’s scientists to preserve data and pass them on to the next generation. CIF was developed not only as a dataexchange mechanism, but also as an archival format, and considerable care has been taken over the past decade and more to keep it a stable and smoothly evolving approach. Some points of detail have been modified or superseded in practice. Other changes will

1.1.2. Past approaches to data exchange The crystallographic community, along with many other scientific disciplines, has long adhered to the philosophy that experimental data and results should be routinely archived to facilitate long-term knowledge retention and access. An early approach to this, recommended by IUCr and other journals, was for authors to deposit data as hard copy (i.e. ink on paper) with the British Library Lending Division. Retaining good records is fundamental to reproducing

Affiliations: S YDNEY R. H ALL , School of Biomedical and Chemical Sciences, University of Western Australia, Crawley, Perth, WA 6009, Australia; B RIAN M C M AHON, International Union of Crystallography, 5 Abbey Square, Chester CH1 2HU, England.

Copyright  2006 International Union of Crystallography

B. M C M AHON

2

1.1. GENESIS OF THE CRYSTALLOGRAPHIC INFORMATION FILE scientific results. However, the sheer volume of diffraction data needed to repeat a crystallographic study precludes these from publication, and has led in the past to relatively ad hoc procedures for depositing supplementary data in local or centralized archives. Typically in the past, only the crystal and structure model parameters were published in the refereed paper and the underpinning diffraction information had to be archived elsewhere. Because the archived data were usually stored as paper in various unregulated formats, considerable information about the experiment and structure-refinement parameters was never retained. Moreover, the archiving of supplementary data via postal services was very slow and labour-intensive; equally, the recovery of deposited data was difficult, with information supplied as either a photocopy of the original deposition or an image taken from a microfiche. Prior to 1970, when less than 9000 structures were deposited with the Cambridge Crystallographic Data Centre, data sets were still small enough to make these deposition and retrieval approaches feasible, albeit tedious. Even so, records show that very few archived data sets were ever retrieved for later use. The rationale of data storage changed radically in the 1980s. The increasing role of computers, automatic diffractometers and phase-solving direct methods in crystallographic studies led to a rapid acceleration in the number and size of structures determined and published. This was the period when fast minicomputers became affordable for laboratories, and the consequent demand for the electronic storage and exchange of information grew exponentially. Typical dataarchival practices changed from using paper to magnetic tapes, as these now became the least expensive and most efficient means of storing data.

HEADER COMPND SOURCE AUTHOR REVDAT REVDAT REVDAT REVDAT REVDAT REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK SEQRES SEQRES SEQRES SEQRES HELIX HELIX SHEET SHEET TURN SSBOND SSBOND SSBOND CRYST1 ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM CONECT CONECT CONECT CONECT CONECT CONECT END

1.1.3. Card-image formats Although the interchange of scientific information depends implicitly on an agreed data format, it remains independent of whether the transmission medium is paper tape, punched card, magnetic tape, computer chip or the Internet. Crystallography has employed countless data-exchange approaches and formats over the past 60 years. Prior to the advent of computers, the standard approach involved the exchange of typed tables of coordinates and structure factors with descriptive headers. In the 1950s and 1960s, as computers became the dominant generators of data, the transfer of data between laboratories was still relatively uncommon. When it was necessary, the Hollerith card formats of commonly used programs, such as ORFLS (Busing et al., 1962) and XRAY (Stewart, 1963), usually sufficed. Even when magnetic tape drives became common and were standardized (mainly to the 1/2-inch 2400-foot reel), the 80-column ‘card-image’ formats of these programs remained the most popular data exchange and deposition approach. As the storage and transporting of electronic data became easier and cheaper, structural information was increasingly deposited directly in databases such as the Cambridge Structural Database (CSD; Allen, 2002) and the Protein Data Bank (PDB; Bernstein et al., 1977). The CSD and PDB simplified these depositions by using standard layouts such as the ASER, BCCAB and PDB formats. Both the PDB and CSD used, and indeed still use as a backup deposition mode, fixed formats with 80-character records and identifier codes. Examples of these format styles are shown in Figs. 1.1.3.1 and 1.1.3.2. The card-image approach, involving a rigid preordained syntax, survived for more than two decades because it was simple, and the suite of data types used to describe crystal structures remained relatively static.

PLANT SEED PROTEIN 30-APR-81 1CRN CRAMBIN ABYSSINIAN CABBAGE (CRAMBE ABYSSINICA) SEED W.A.HENDRICKSON,M.M.TEETER 5 16-APR-87 1CRND 1 HEADER 4 04-MAR-85 1CRNC 1 REMARK 3 30-SEP-83 1CRNB 1 REVDAT 2 03-DEC-81 1CRNA 1 SHEET 1 28-JUL-81 1CRN 0 1 1 REFERENCE 1 1 AUTH M.M.TEETER 1 TITL WATER STRUCTURE OF A HYDROPHOBIC PROTEIN AT ATOMIC 1 TITL 2 RESOLUTION. PENTAGON RINGS OF WATER MOLECULES IN 1 TITL 3 CRYSTALS OF CRAMBIN 1 REF PROC.NAT.ACAD.SCI.USA V. 81 6014 1984 1 REFN ASTM PNASA6 US ISSN 0027-8424 040 1 REFERENCE 2 1 AUTH W.A.HENDRICKSON,M.M.TEETER 1 TITL STRUCTURE OF THE HYDROPHOBIC PROTEIN CRAMBIN 1 TITL 2 DETERMINED DIRECTLY FROM THE ANOMALOUS SCATTERING 1 TITL 3 OF SULPHUR 1 REF NATURE V. 290 107 1981 1 REFN ASTM NATUAS UK ISSN 0028-0836 006 1 REFERENCE 3 1 AUTH M.M.TEETER,W.A.HENDRICKSON 1 TITL HIGHLY ORDERED CRYSTALS OF THE PLANT SEED PROTEIN 1 TITL 2 CRAMBIN 1 REF J.MOL.BIOL. V. 127 219 1979 1 REFN ASTM JMOBAK UK ISSN 0022-2836 070 1 46 THR THR CYS CYS PRO SER ILE VAL ALA ARG SER ASN PHE 2 46 ASN VAL CYS ARG LEU PRO GLY THR PRO GLU ALA ILE CYS 3 46 ALA THR TYR THR GLY CYS ILE ILE ILE PRO GLY ALA THR 4 46 CYS PRO GLY ASP TYR ALA ASN 1 H1 ILE 7 PRO 19 1 3/10 CONFORMATION RES 17,19 2 H2 GLU 23 THR 30 1 DISTORTED 3/10 AT RES 30 1 S1 2 THR 1 CYS 4 0 2 S1 2 CYS 32 ILE 35 -1 1 T1 PRO 41 TYR 44 1 CYS 3 CYS 40 2 CYS 4 CYS 32 3 CYS 16 CYS 26 40.960 18.650 22.520 90.00 90.77 90.00 P 21 2 1 N THR 1 17.047 14.099 3.625 1.00 13.79 2 CA THR 1 16.967 12.784 4.338 1.00 10.80 3 C THR 1 15.685 12.755 5.133 1.00 9.19 4 O THR 1 15.268 13.825 5.594 1.00 9.85 5 CB THR 1 18.170 12.703 5.337 1.00 13.02 6 OG1 THR 1 19.334 12.829 4.463 1.00 15.06 7 CG2 THR 1 18.150 11.546 6.304 1.00 14.23 8 N THR 2 15.115 11.555 5.265 1.00 7.81 9 CA THR 2 13.856 11.469 6.066 1.00 8.31 10 C THR 2 14.164 10.785 7.379 1.00 5.80 20 19 282 26 25 229 116 115 188 188 116 187 229 26 228 282 20 281

1CRND 1 1CRN 4 1CRN 5 1CRN 6 1CRND 2 1CRNC 1 1CRNB 1 1CRNB 2 1CRNB 3 1CRN 7 1CRNC 2 1CRNC 3 1CRNC 4 1CRNC 5 1CRNC 6 1CRNC 7 1CRNC 8 1CRNC 9 1CRN 9 1CRN 10 1CRN 11 1CRN 12 1CRN 13 1CRN 14 1CRNC 10 1CRN 16 1CRN 17 1CRN 18 1CRN 19 1CRN 20 1CRN 51 1CRN 52 1CRN 53 1CRN 54 1CRN 55 1CRN 56 1CRNA 4 1CRN 58 1CRN 59 1CRN 60 1CRN 61 1CRN 62 1CRN 63 1CRN 70 1CRN 71 1CRN 72 1CRN 73 1CRN 74 1CRN 75 1CRN 76 1CRN 77 1CRN 78 1CRN 79 1CRN 398 1CRN 399 1CRN 400 1CRN 401 1CRN 402 1CRN 403 1CRN 405

Fig. 1.1.3.1. An abbreviated example of a PDB format file.

1.1.4. The Standard Crystallographic File Structure (SCFS) By the 1980s, the many different fixed formats used to exchange data electronically had become a significant complication for journals and databases. Because of this, the IUCr Commissions for Crystallographic Data and Computing formed a joint Working Party which was asked to recommend a standard format for the exchange and retention of crystallographic data. They proposed a partially fixed format in which key words on each line identified blocks of data containing items in a specific order. This format was the Standard Crystallographic File Structure (Brown, 1988). An example of an SCFS file is shown in Fig. 1.1.4.1. The effectiveness of the SCFS format approach was curtailed because its release coincided with the arrival of powerful minicomputers, such as the VAX780, in crystallographic laboratories. This led to a period of enormous change in crystallographic computing, in which new data types and file formats proliferated. It was also a time when automatic diffractometers became standard equipment in laboratories and the development of new crystallographic software packages flourished. The fixed-format design of the SCFS was unable to adapt easily to these continually changing data requirements, and this eventually led to a proliferation of SCFS versions.

3

1. HISTORICAL INTRODUCTION ?BAWGEL #ADATE 820705 #COMPND bis(Benzene)-chromium bromide #FORMUL C12 H12 Cr1 1+,Br1 1#AUTHOR A.L.Spek,A.J.M.Duisenberg #JRNL 189,10,1531,1981 #CREF msdb 14.74.003 nbsid 532193 batch 53 cdvol 6 #CLASS 1/74 #SYSCAT sys O cat 3 #CONN El= Cr 2 Br 14 V= 1 2 Ch= + 2 Ch= - 14 Res= Plot= 1 B= 5 1-3 1-4 3-6 4-7 5-8 5-9 6-10 7-10 8-11 9-12 11-13 12-13 B= 9 1-2-5 Res= Plot= 1 14 #DIAGRAM 469 151 374 202 469 248 554 103 272 248 556 299 640 152 186 296 272 151 640 250 101 247 185 100 100 149 100 50 0 0 0 0 0 0 0 0 #CELL a 9.753(6) b 9.316(3) c 11.941(8) z 4 cent 1 sg Fmmm #SYMM x,y,z x,1/2+y,1/2+z 1/2+x,y,1/2+z 1/2+x,1/2+y,z -x,y,z -x,1/2+y,1/2+z 1/2-x,y,1/2+z 1/2-x,1/2+y,z x,-y,z x,1/2-y,1/2+z 1/2+x,-y,1/2+z 1/2+x,1/2-y,z -x,-y,z -x,1/2-y,1/2+z 1/2-x,-y,1/2+z 1/2-x,1/2-y,z #DENSITY dx 1.764 #UNIS int 3 sigcc 3 #RFACT R= 0.0540. #RADIUS C 0.68 H 0.23 Br 1.21 Cr 1.35 #TOLER 0.40 #ATOM Cr1 0.0 0.0 0.0 Br1 0.0 0.0 0.50000 C1 0.06900 0.12800 0.13400 C2 0.13900 0.0 0.13400 H1 0.09300 0.20400 0.12500 H2 0.19800 0.0 0.13000 #BOND Cr1 C1 2.100 Cr1 C2 2.090 C1 C1* 1.340 C1 C2 1.370 C1 H1 0.760 C2 H2 0.580 #MDATE 901205 #END

TITLE *p6122

00001 00002 00003 SG NAME 00004 LATT NP 00005 SYST HEXAGONAL 00006 BRAV HEXAGONAL 00007 HALL p_61_2_(0_0_-1) 00008 HERM p_61_2_2 00009 *EOS 00010 00011 SYMMETRY R11 2 3 T1 R21 2 3 T2 R31 2 3 T3 00012 SYOP 1 0 0 .0000000 0 1 0 .0000000 0 0 1 .0000000 1 00013 SYOP -1 0 0 .0000000 0-1 0 .0000000 0 0 1 .5000000 2 00014 SYOP 0-1 0 .0000000 -1 0 0 .0000000 0 0-1 .8333330 3 00015 SYOP 0 1 0 .0000000 1 0 0 .0000000 0 0-1 .3333330 4 00016 SYOP 1-1 0 .0000000 0-1 0 .0000000 0 0-1 .0000000 5 00017 SYOP -1 1 0 .0000000 0 1 0 .0000000 0 0-1 .5000000 6 00018 SYOP 1 0 0 .0000000 1-1 0 .0000000 0 0-1 .1666670 7 00019 SYOP -1 0 0 .0000000 -1 1 0 .0000000 0 0-1 .6666670 8 00020 SYOP 0-1 0 .0000000 1-1 0 .0000000 0 0 1 .3333330 9 00021 SYOP 0 1 0 .0000000 -1 1 0 .0000000 0 0 1 .8333330 10 00022 SYOP 1-1 0 .0000000 1 0 0 .0000000 0 0 1 .1666670 11 00023 SYOP -1 1 0 .0000000 -1 0 0 .0000000 0 0 1 .6666670 12 00024 *EOS 00025 00026 FORMULA EL NUM 00036 FORL s .5000o .5000c 1.0000 00037 *EOS 00038 00039 CONDITIONS 00040 CELLPAREX .7107 566.00 00041 INT PAREX .7107 566.00 .147 .681 92 00042 HKL PARE 0 2 0 4 0 12 00043 EQIVPARE 92 525 00044 *EOS 00045 00046 ATOMS NAME X U11 Y U22 Z U33 U U12 P U13 U23 MUL AT DT 00052 UALL .03500 00053 ATCO s .20140 .79860 .91667 1.00000 6 s 200054 ATCE s .00040 .00040 .00000 .00000 00055 UIJ s .04100 .04100 .01000 .02500 -.00400 -.00400 00056 UIJE s .00800 .00800 .00700 .00700 .00500 .00500 00057 ATCO o .50100 .50100 .66667 1.00000 6 o 200058 ATCE o .00300 .00300 .00000 .00000 00059 UIJ o .08900 .08900 .09000 .06300 .00900 -.00900 00060 UIJE o .01800 .01800 .02000 .01900 .00800 .00800 00061 ATCO c 1 .49200 .09700 .03780 1.00000 12 c 200062 ATCE c 1 .00300 .00300 .00110 .00000 00063 UIJ c 1 .03170 .03170 .03170 .01585 .00000 .00000 00064 UIJE c 1 .00000 .00000 .00000 .00000 .00000 .00000 00065 *EOS 00066 00067 CELL DIMENSIONS A B C ALPHA BETA GAMMA Z 00068 CELLPARE 8.5300 8.5300 20.3700 90.0000 90.0000 120.0000 12.000069 ERRSPARE .0100 .0100 .0100 .0100 .0100 .0100 00070 VOL PARE 1283.571 3.0775 .5595 00071 PHYSPARE 566.0000 00072 *EOS 00073 00074 END 00081

Fig. 1.1.3.2. An example of a CSD BCCAB format file.

CIFIO

05-Mar87

p6122

Fig. 1.1.4.1. An abbreviated example of a Standard Crystallographic File Structure (SCFS) format file.

1.1.5. The impact of networking on crystallography The growth in power of individual minicomputers inevitably helped the development of computational techniques in crystallography. Yet perhaps a more profound development was networking – the ability to exchange electronic data directly between computers. The laborious procedures for transferring information by manual keystroke or exchange of card decks and magnetic tapes were replaced by error-free programmatic procedures. Initially, data could flow easily between computers in the same laboratory; then colleagues could exchange data between scientific departments on the same campus; and before long experimental results, programs and general communications were flowing freely across national and international networks. During the 1960s, networking was ad hoc and proprietary, and rarely extended effectively outside the laboratory. By the 1970s, however, a few standard networking protocols were becoming established. These included uucp, which promoted the growth of dial-up networking between university campuses, and TCP/IP, the transport protocol underlying the ARPANET, that would eventually give rise to the dominant Internet with which we are familiar today. The potential for improving the practice of crystallography through the ease of communications afforded by computer networks was very clear. However, the technology was still costly and required much effort and expertise to implement. Even towards the end of the decade, a meeting of protein crystallographers concluded (Freer & Stewart, 1979) that

The possibility and usefulness of establishing a computer network for communication among crystallographic laboratories was discussed. The implications for rapid updating and the ease with which programs and data could be transfered among the groups was clearly recognized by all present; however, immediate implementation of a network was not deemed practical by a majority of the participants.

By the mid-1980s, the establishment of a global computer network was well under way. There was still some diversity of transmission protocols on an international scale: uucp, BITNET and X.25 Coloured Book protocols were still competing with TCP/IP, so that communication between different networks had to be managed through gateways. Nevertheless, there was sufficient standardization that it was feasible to communicate with colleagues world-wide by e-mail, to transfer files by ftp and to log in to remote computers by telnet. E-mail, in particular, allowed for the rapid transmission of ASCII text in an arbitrary format. In many respects, this established a goal for other exchange formats to achieve. The establishment of anonymous ftp sites permitted the free exchange of software and data to any user; no special privileges on the host computer were needed. Such availability of electronic information fitted particularly well with the scientific ethic of open exchange of information.

4

1.1. GENESIS OF THE CRYSTALLOGRAPHIC INFORMATION FILE By the early 1990s, TCP/IP and the Internet dominated international networking. The practices of open exchange of information were developed through a number of initiatives. Gopher (Anklesaria et al., 1993) provided a general mechanism to access material categorized and published from a computerized information store. WAIS (Kahle, 1991), a wide-area information server application designed to service queries conforming to the Z39.50 information retrieval protocol (ANSI/NISO, 1995), provided an effective distributed search engine. The rapid proliferation of new techniques for searching and retrieving information from the Internet was capped in the mid-1990s by the rapid growth in sites implementing hypertext servers (Berners-Lee, 1989). The World Wide Web had become a reality. The increasing access to global network facilities during the 1980s led to a growing interest among crystallographers in submitting manuscripts to journals electronically, especially for smallmolecule structure studies. The Australian delegation at the 1987 General Assembly of the XIVth IUCr Congress in Perth proposed that IUCr journals (specifically Acta Crystallographica) should be able to accept manuscripts submitted electronically. It was argued that this would reduce effort on the part of the authors and the journal office in preparation and transcription of manuscripts, and as a consequence reduce costs and transcription errors and simplify data-validation approaches. The acceptance of this General Assembly resolution led to the creation of a Working Party on Crystallographic Information (WPCI), which had as its mandate the investigation of possible approaches to enable the electronic submission of crystallographic research publications.

data_crambin _entry.id

1CRN

_audit.creation_date 1993-04-21 _audit.creation_method ’manual editing of PDB entry’ _audit.update_record ; 1993-04-21 Original PDB entry history recorded here for completeness. 30-apr-81 deposition. 28-jul-81 1crn 0 03-dec-81 correct residue number on strand 1 of sheet s1. 30-sep-83 insert revdat records 04-mar-85 insert new publication as reference and renumber 16-apr-87 change deposition date from 31-apr-81 to 30-apr-81. ; loop_ _struct.entry_id _struct.title 1CRN ’Crambin from Abyssinian cabbage (Crambe abyssinica) seed’ loop_ _citation.id _citation.year _citation.journal_abbrev _citation.journal_volume _citation.page_first _citation_journal_id_ASTM _citation_journal_id_ISSN _citation_title primary 1984 Biochemistry 23 6796 ? 0006-2960 ; Raman spectroscopy of homologous plant toxins: crambin and alpha 1- and beta-purothionin secondary structures, disulfide conformation, and tyrosine environment ; 1 1984 Proc.Nat.Acad.Sci.USA 81 6014 pnasa6 0027-8424 ; Water structure of a hydrophobic protein at atomic resolution. Pentagon rings of water molecules in crystals of crambin ; 2 1981 Nature 280 107 natuas 0028-0836 ; Structure of the hydrophobic protein crambin determined directly from the anomalous scattering of sulphur : loop_ _citation_author.citation_id _citation_author.name primary ’Williams, R.W.’ primary ’Teeter, M.M.’ 1 ’Teeter, M.M.’ 2 ’Hendrickson, W.A.’ 2 ’Teeter, M.M.’

1.1.6. The Working Party on Crystallographic Information (WPCI)

loop_ _entity.id _entity.type _entity.details 1 polymer 2 non-polymer

The WPCI first convened at the 1988 ECM11 conference in Vienna. In the discussions leading up to this meeting, it was widely appreciated that electronic submissions to journals and databases involved data types (e.g. manuscript texts, graphical diagrams, the full suite of crystallographic data) that were beyond those accommodated within the SCFS format promoted by the IUCr Data and Computing Commissions. Consequently, it was suggested at the Vienna meeting that a general and extensible universal file approach, similar to the recently developed Self-defining Text Archive and Retrieval (STAR) File format (Hall, 1991; Hall & Spadaccini, 1994), might also be suitable for crystallographic data applications. At this meeting, it was decided that a WPCI working group, led by Syd Hall, should investigate the development of a universal file protocol that would be suitable for crystallographic data needs. Other universal formats existed, such as ASN.1 (ISO, 2002), which was used for data communications, JCAMP-DX (McDonald & Wilks, 1988), which was used for archiving infrared spectra, and the Standard Molecular Data (SMD) format (Barnard, 1990), which was used for the global exchange of chemical structure data. These were considered relatively inefficient for expressing the repetitive data lists commonly used in crystallography. The working group eventually proposed a Crystallographic Information File (CIF) format which had a syntax similar to, but simpler than, the STAR File. Of particular importance because of the rapid changes taking place with data types, the CIF approach provided a very flexible and extensible file structure in which any type of text or numerical data could be arranged in any order. The typical data structure of a CIF is illustrated in Fig. 1.1.6.1, using the same data as presented in the PDB file of Fig. 1.1.3.1. Similarly, Fig. 1.1.6.2 shows the data in the BCCAB file of Fig. 1.1.3.2 in CIF format.

’Protein chain: *’ ’het group EOH’

loop_ _entity_poly_seq.entity_id _entity_poly_seq.num _entity_poly_seq.mon_id 1 1 THR 1 2 THR 1 3 CYS 1 4 CYS 1 6 SER 1 7 ILE 1 8 VAL 1 9 ALA 1 11 SER 1 12 ASN 1 13 PHE 1 14 ASN #............................................sequence data

1 5 PRO 1 10 ARG 1 15 VAL omitted for brevity

_cell.length_a 40.960 _cell.length_b 18.650 _cell.length_c 22.520 _cell.angle_alpha 90.00 _cell.angle_beta 90.77 _cell.angle_gamma 90.00 _symmetry.space_group_name_H-M ’P 1 21 1’ loop_ _atom_type.symbol _atom_type.description _atom_type.number_in_cell C carbon 404 N nitrogen 112

O oxygen 128

loop_ _atom_site.label_seq_id _atom_site.type_symbol _atom_site.label_atom_id _atom_site.label_comp_id _atom_site.label_asym_id _atom_site.auth_seq_id _atom_site.label_alt_id _atom_site.Cartn_x _atom_site.Cartn_y _atom_site.Cartn_z _atom_site.occupancy _atom_site.B_iso_or_equiv _atom_site.label_entity_id _atom_site.id 1 N N THR * 1 . 17.047 1 C CA THR * 1 . 16.967 1 C C THR * 1 . 15.685 1 O O THR * 1 . 15.268 1 C CB THR * 1 . 18.170 1 O OG1 THR * 1 . 19.334 1 C CG2 THR * 1 . 18.150

14.099 12.784 12.755 13.825 12.703 12.829 11.546

S sulfur 12

3.625 4.338 5.133 5.594 5.337 4.463 6.304

1.00 1.00 1.00 1.00 1.00 1.00 1.00

H hydrogen ?

13.79 10.80 9.19 9.85 13.02 15.06 14.23

1 1 1 1 1 1 1

1 2 3 4 5 6 7

#............................................atom-site data omitted for brevity

Fig. 1.1.6.1. Example 1 of a CIF (using the same data as shown in Fig. 1.1.3.1).

5

1. HISTORICAL INTRODUCTION data_BAWGEL _audit_creation_date 93-05-24 _audit_creation_method manual_conversion_of_ccdc_file _audit_update_record ; 82-07-05 CCDC entry created from journal data A.L.Spek,A.J.M.Duisenberg (1981) 189,10,1531 93-05-21 Received file from Owen Johnson, CCDC. 93-05-24 Initial conversion of the file to CIF/MIF format. ; _chemical_name_systematic ’bis(Benzene)-chromium bromide’ _chemical_formula_moiety ’C12 H12 Cr1 1+,Br1 1-’ _cell_length_a _cell_length_b _cell_length_c _cell_angle_alpha _cell_angle_beta _cell_angle_gamma _cell_formula_units_Z

9.735(6) 9.316(3) 11.941(8) 90 90 90 4

_symmetry_space_group_name_H-M

Fmmm

# The following CIF data names encompass the IUCr Journals Commission # requirements for the reporting of a small-molecule structure in Acta # Crystallographica, Section C _chemical_compound_source _chemical_formula_sum _chemical_formula_moiety _chemical_formula_weight _symmetry_cell_setting _symmetry_space_group_name_H-M _cell_length_a _cell_length_b _cell_length_c _cell_angle_alpha _cell_angle_beta _cell_angle_gamma _cell_volume _cell_formula_units_Z _exptl_crystal_density_diffrn _exptl_crystal_density_meas _exptl_crystal_density_method _diffrn_radiation_type _diffrn_radiation_wavelength _cell_measurement_reflns_used _cell_measurement_theta_min _cell_measurement_theta_max _exptl_absorpt_coefficient_mu _cell_measurement_temperature _exptl_crystal_description _exptl_crystal_colour _exptl_crystal_size_max _exptl_crystal_size_mid _exptl_crystal_size_min _exptl_crystal_size_rad _diffrn_measurement_device _diffrn_measurement_method _exptl_absorpt_correction_type _exptl_absorpt_correction_T_min _exptl_absorpt_correction_T_max _diffrn_reflns_number _reflns_number_total _reflns_number_observed _reflns_observed_criterion _diffrn_reflns_av_R_equivalents _diffrn_reflns_theta_max _diffrn_reflns_limit_h_min _diffrn_reflns_limit_h_max _diffrn_reflns_limit_k_min _diffrn_reflns_limit_k_max _diffrn_reflns_limit_l_min _diffrn_reflns_limit_l_max _diffrn_standards_number _diffrn_standards_interval_count _diffrn_standards_interval_time _diffrn_standards_decay_% _refine_ls_structure_factor_coef _refine_ls_R_factor_obs _refine_ls_wR_factor_obs _refine_ls_goodness_of_fit_obs _refine_ls_number_reflns _refine_ls_number_parameters _refine_ls_hydrogen_treatment _refine_ls_weighting_scheme _refine_ls_shift/esd_max _refine_diff_density_max _refine_diff_density_min _refine_ls_extinction_method _refine_ls_extinction_coef _atom_type_scat_source _refine_ls_abs_structure_details _computing_data_collection _computing_cell_refinement _computing_data_reduction _computing_structure_solution _computing_structure_refinement _computing_molecular_graphics _computing_publication_material _publ_section_experimental

loop_ _symmetry_equiv_pos_as_xyz x,y,z x,1/2+y,1/2+z 1/2+x,y,1/2+z 1/2+x,1/2+y,z -x,y,z -x,1/2+y,1/2+z 1/2-x,y,1/2+z 1/2-x,1/2+y,z x,-y,z x,1/2-y,1/2+z 1/2+x,-y,1/2+z 1/2+x,1/2-y,z -x,-y,z -x,1/2-y,1/2+z 1/2-x,-y,1/2+z 1/2-x,1/2-y,z loop_ _atom_type_symbol _atom_type_radius_bond C 0.68 _exptl_crystal_density_meas _refine_ls_R_factor_obs _reflns_observed_criterion

Br 1.21

Cr 1.35

1.764 0.0540 3sigma(I)

loop_ _atom_site_label _atom_site_fract_x _atom_site_fract_y _atom_site_fract_z Cr1 0.0 Br1 0.0 C1 0.06900 C2 0.13900 H1 0.09300 H2 0.19800 loop_ _geom_bond_atom_site_label_1 _geom_bond_atom_site_label_2 _geom_bond_distance _geom_bond_site_symmetry_1 _geom_bond_site_symmetry_2 Cr1 Cr1 C1 C1 C1 C2

H 0.23

C1 C2 C1 C2 H1 H2

0.0 0.0 0.12800 0.0 0.20400 0.0

2.100 2.090 1.340 1.370 0.760 0.580

. . . . . .

0.0 0.50000 0.13400 0.13400 0.12500 0.13000

. . 5_555 . . .

Fig. 1.1.6.2. Example 2 of a CIF (using the same data as shown in Fig.1.1.3.2).

1.1.7. The Crystallographic Information File As outlined in Section 1.1.6, the working group commissioned by the WPCI set out to establish an exchange protocol suitable for submitting crystallographic data to journals and databases, and this resulted in the development of the CIF syntax. At the same time, the group was also asked to form a list of those data items considered to be essential in a manuscript submitted to Acta Crystallographica. The data items originally recommended are listed in Fig. 1.1.7.1. The syntax of a CIF (a detailed description is given in Chapter 2.2) was intentionally a simple subset of the STAR File syntax (see Chapter 2.1 for details). This simplification was considered important for its easy implementation in existing crystallographic software packages – clearly a primary goal for any format that was to be widely available for submitting data to journals and databases. A compilation of data names referring to specific quantities or concepts in a crystal-structure determination was drawn up. This compilation included the items already identified as necessary for publication and many more besides. As a list of standard tags intended for unambiguous use, the collection was known from the outset as a dictionary of data names.

Fig. 1.1.7.1. Initial set of data items considered to be essential in a structure report submitted to Acta Crystallographica Section C.

The WPCI proposed the CIF format as a standard exchange protocol at the open meetings of the IUCr Commissions on Crystallographic Data and Computing at the 1990 XVth IUCr Congress in Bordeaux. The proposal was accepted and the CIF format was subsequently adopted by the IUCr as the preferred format for data exchange (Hall et al., 1991). The administration of the CIF standard, including the approval of new data items, is the responsibility of the IUCr Committee for the Maintenance of the CIF Standard (COMCIFS). This committee plays a central role in the coordination of CIF activities, such as the creation of new dictionaries for defining crystallographic data items and the updating of data definitions in existing dictionaries. Chapter 3.1 describes relevant aspects of its role in the

6

1.1. GENESIS OF THE CRYSTALLOGRAPHIC INFORMATION FILE commissioning and maintenance cycle. Information about COMCIFS activities and other CIF developments may be obtained from http://www.iucr.org/iucr-top/cif/.

############################################################################## # # DDL Data Name Descriptions # -------------------------# # _compliance The dictionary version in which the item is defined. # # _definition The description of the item. # # _enumeration A permissible value for an item. The value ’unknown’ # signals that the item can have any value. # # _enumeration_default The default value for an item if it is not specified # explicitly. ’unknown’ means the default is not known. # # _enumeration_detail The description of a permissible value for an item. # Note that the code ’.’ normally signals a null # or ’not applicable’ condition. # # _enumeration_range The range of values for a numerical item. The # construction is ’min:max’. If ’max’ is omitted then the # item can have any value greater than or equal to ’min’. # # _esd Signals whether an estimated standard deviation is # expected to be appended (enclosed within brackets) # to a numerical item. May be ’yes’ or ’no’. # # _esd_default The default value for the esd of a numerical item # if a value is not appended. # # _example An example of the item. # # _example_detail A description of the example. # # _list Signals whether an item is expected to occur in a looped # list. Possible values: ’yes’,’no’ or ’both’. # # _list_identifier Identifies a data item that MUST appear in the list # containing the currently defined data item. # # _name The data name of the item defined. # # _type The data type ’numb’ or ’char’ (latter includes ’text’). # # _units_extension The data-name extension code used to specify the units # of a numerical item. # # _units_description A description of the units. # # _units_conversion The method of converting the item into a value based # on the default units. Each conversion number is # preceded by an operator code *, /, +, or - which # indicates how the conversion number is applied. # # _update_history A record of the changes to this file. # #-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof-eof

1.1.8. Diversification: the Molecular Information File and dictionary definition language While the primary thrust of these activities was the development of an exchange mechanism for crystal-structure reports, there was also interest in enriching the description of the chemical properties and behaviour of the compounds under study. Some work was therefore done to broaden the descriptions of bond order that were present in rudimentary form in the core CIF dictionary and to develop more detailed two-dimensional graphical representations of chemical molecules. The result of this work was the Molecular Information File (MIF), described in Chapter 2.4. As with CIF, the specific data items required for MIF were defined in a dictionary. This work has extended the STAR File approach into chemistry. It is envisaged that later modules could describe spectroscopic data, reaction schemes and much more. Particular requirements of chemical structural databases are the need to query for generic structures, and the need to allow for the labelling and comparison of libraries of substructural components. Both requirements can be met by features of the STAR File, but they are features omitted from CIF. In practice, therefore, data files in MIF format cannot be readily accessed by most crystallographic applications, and the format is at present little used by crystallographers. An important outcome of the work on MIF was the recognition that attributes of the data items needed for a particular application can be recorded using the same formalism as the data files themselves. This gave rise to the idea of a dictionary definition language (DDL), a set of tags for describing the names and attributes of data items. The dictionaries (the collections of data names for CIF and MIF applications) could then be constructed as STAR Files themselves, with the immediate result that software written to parse data files could equally easily parse the associated dictionaries. Now it became feasible to build into applications the ability to validate data by dynamically reading and interpreting the properties associated with a data tag in an accompanying dictionary. The idea of a DDL was proposed by Tony Cook during early discussions on MIF and was adopted while the original CIF paper (Hall et al., 1991) was in the press. The original core CIF dictionary was therefore produced with an early version of a DDL that was never fully documented (Fig. 1.1.8.1). Building on early experience with the core dictionary and the technical evolution of MIF, Hall & Cook (1995) worked through several revisions before publishing DDL version 1.4, the stable version described in Chapter 2.5 of this volume. Because of the circumstances in which it was developed, this dictionary definition language is able to accommodate both the flat-file quasi-relational structure of CIF and the more hierarchical multiple-looped data model of MIF.

Fig. 1.1.8.1. Informal DDL used in the initial version of the core CIF dictionary (now superseded by DDL1.4).

The group’s original short-term goal was to fulfil the IUCr mandate of defining data names for an adequate description of a macromolecular crystallographic experiment and its results. Longer-term goals were also determined: to provide sufficient data names so that the experimental section of a structure paper could be written automatically and to facilitate the development of tools so that computer programs could easily interface with CIF data files. A number of informal and formal meetings were held to describe the progress of this project and to solicit community feedback. An important meeting took place at the University of York in April 1993. The attendees included the mmCIF working group, structural biologists and computer scientists. Vigorous discussion arose on whether the formal structure of the dictionary implemented in the then-current dictionary definition language (DDL1) could deal with the complexity of macromolecular data sets. There were criticisms that the data typing was not strong enough and that there were no formal links expressing relationships between data items. A working group was formed to address these issues, resulting in a second workshop in Tarrytown, New York, in October 1993. The discussions at this second meeting focused on the development of software tools and the requirements of an enhanced DDL. Such a DDL was proposed during a third workshop at the Free University of Brussels in October 1994. This new DDL (DDL2; Chapter 2.6) was designed by John Westbrook and sought to address the various problems identified at the preceding workshops, while retaining compatibility with existing CIF data files.

1.1.9. The macromolecular Crystallographic Information File The original goal of CIF was the creation of an archive format for the description and results of experiments in small-molecule and inorganic crystallography. The data names and their definitions were embodied in the core dictionary, so called because most of the terms in it were considered common to any crystallographic application. In 1990, the IUCr formed a working group, chaired by Paula Fitzgerald, to expand this dictionary to meet the additional requirements of macromolecular crystallography. The resulting expanded dictionary was to be known as the macromolecular CIF (mmCIF) dictionary.

7

1. HISTORICAL INTRODUCTION The macromolecular dictionary was built using DDL2 and was presented as a draft to the community at the American Crystallographic Association meeting in Montreal in July 1995. The draft was subsequently posted on a website and community comment solicited via an e-mail discussion list. This provoked lively discussions, leading to continuous correction and updating of the dictionary over an extended period of time. Software for parsing the dictionary and managing mmCIF data sets was developed and was also presented on the website. In January 1997, the mmCIF dictionary was completed and submitted to COMCIFS for review. In June 1997, version 1.0 was approved by COMCIFS and released (Bourne et al., 1997; Fitzgerald et al., 1996). A workshop was held at Rutgers University in October 1997, at which tutorials were presented to demonstrate the use of the various tools that had been developed. More than 100 new definitions have been incorporated in version 2 of the mmCIF dictionary, presented in Chapter 4.5.

At first sight, the distinction between CIF and the crystallographic binary file may therefore seem trivial. However, in allowing binary data, the CBF requires greater care in defining the structure and packing by octets of the data, and loses some of the portability of CIF. It also precludes the use of many simple textbased file-manipulation tools. On the other hand, the importing of a stable and well developed tagged information format means that developers do not need to write novel parsers and compatibility with CIFs is easily attained. Indeed, the original idea of a fully compliant image CIF has been retained. By ASCII-encoding the image data in a crystallographic binary file, a fully compliant CIF, known as imgCIF, may be simply generated. One could consider an imgCIF as an archival version and a CBF as a working version of the same information set.

1.1.11. Other extension dictionaries Since the introduction of the major CIF dictionaries, several other compendia of data names suitable for describing different applications or disciplines within crystallography have been developed. Four of these are described in the current volume. The powder CIF dictionary (pdCIF; Chapter 3.3) is a supplement to the core dictionary addressing the needs of powder diffractionists. The structural model derived from powder work is familiar to single-crystal small-molecule or inorganic scientists. However, the powder CIF effort had the additional goals of documenting and archiving experimental data. It was always intended that powder CIF be used for communication of completed studies and for data exchange between laboratories. This is frequently done at shared diffraction facilities such as neutron and synchrotron sources. The powder dictionary was written with data from conventional X-ray diffractometers and from synchrotron, continuous-wavelength neutron, time-of-flight neutron and energy-dispersive X-ray instruments in mind. The modulated-structures dictionary (msCIF; Chapter 3.4) is also considered as a supplement to the core dictionary and is designed to permit the description of incommensurately modulated crystal structures. The project was sponsored by the IUCr Commission on Aperiodic Crystals and was developed in parallel with a standard for the reporting of such structures in the literature (Chapuis et al., 1997). A small dictionary of terms for reporting accurate electron densities in crystals (rhoCIF; Chapter 3.5) has recently been published as a further supplement to the core dictionary. It has been developed under the sponsorship of the IUCr Commission on Charge, Spin and Momentum Densities. The symmetry dictionary (symCIF; Chapter 3.8) was developed under the direct sponsorship of COMCIFS with the objective of producing a rigorous set of definitions suitable for the description of crystallographic symmetry. Following its publication, several data names from the symCIF dictionary were incorporated in the latest version of the core dictionary to replace the original informal definitions relating to symmetry. The symCIF dictionary contains most of the data names that would be needed to tabulate space-group-symmetry relationships in the manner of International Tables for Crystallography Volume A (2002). It is intended to expand the dictionary to include group–subgroup relations in a later version. Other dictionaries are also under development, often under the supervision of one of the Commissions of the International Union of Crystallography.

1.1.10. The Crystallographic Binary File In 1995, Andy Hammersley approached the IUCr with a proposal to develop an exchange and archival mechanism for image data using the CIF formalism. There was an increasing need to record and exchange two-dimensional images from a growing collection of area detectors and image plates from several manufacturers, each using different and proprietary data-storage formats. The project was encouraged by the IUCr and was developed under a variety of working names. In October 1997, a workshop at Brookhaven National Laboratory, New York, was convened to discuss progress and to coordinate the development of software suitable for this new format. At the workshop, it became apparent that there was broad consensus to adopt and further develop the working set of data names that had been devised by the working group on this project; it was further decided that the relationships between these data names were best handled by the DDL2 formalism used for mmCIF. While image plates were not the sole preserve of macromolecular crystallography, it was felt that there would be maximum synergy with that community, and that the proposed imgCIF dictionary was most naturally viewed as an extension or companion to the mmCIF dictionary. The adoption of DDL2 would not preclude the generation of DDL1 analogue files if they were found to be necessary in certain applications, but to date such a need has not arisen. However, on one point the workshop was insistent. For efficient handling of large volumes of image data on the necessary timescales within a large synchrotron research facility, the raw data must be in binary format. It was argued that although ASCII encoding could preserve the information content of a data stream in a fully CIF-compliant format, the consequent overheads in increased file size and data-processing time were unacceptable in environments with a very high throughput of such data sets, even given the high performance of modern computers. Consequently, the Crystallographic Binary File (CBF) was born as an extension to CIF (Chapter 2.3). The CBF may contain binary data, and therefore cannot be considered a CIF (Chapter 2.2). However, except for the representation of the image data, the file retains all the other features of CIF. Information about the experimental apparatus, duration, environmental conditions and operating parameters, together with descriptions of the pixel characteristics of individual frames, are all provided in ASCII character fields tagged by data names that are themselves fully defined in a DDL2-based dictionary (Chapter 3.7).

8

1.1. GENESIS OF THE CRYSTALLOGRAPHIC INFORMATION FILE Mention should also be made of the use of STAR Files by the BioMagResBank group at the University of Wisconsin to record NMR structures. This work (Ulrich et al., 1998) endeavours to be complementary to the mmCIF descriptions of structures in the Protein Data Bank.

(CML; Murray-Rust & Rzepa, 1999, 2001), IUPAC project groups are mapping out areas of the science for which suitable DTDs and schemas may be constructed that tag relevant chemical content. In this context, the CIF dictionaries described and annotated at length in this volume provide a sound basis for a machine-parsable ontology for crystallographic data. Given the orderly classification and relationships between tags in a CIF data set, format transformations using XML tags are not at all difficult to achieve. Chapters 5.3 and 5.5 describe tools for converting between mmCIF and XML formats for interchange within the biological structure community. In the future, we expect to see the growth of XML environments where the full content of the CIF dictionaries is imported for the purpose of tagging crystallographic data embedded in more general documents and data sets. There have already been examples of ontology development using CIF dictionaries; for example, the Object Management Group has developed software classes for middleware in biological computing applications that are modelled on the mmCIF dictionary (Greer, 2000). The use of interactive STAR ontologies is described by Spadaccini et al. (2000). It is possible that as interdisciplinary knowledge-management software systems develop, the exchange of crystallographic data will occur through XML files or other formats. Neverthless, CIF will remain an efficient mechanism for developing detailed data models within crystallography. We can confidently say that, whatever the actual transport format, the intellectual content of the dictionaries in this volume and their subsequent extensions and revisions will continue to underpin the definition and exchange of crystallographic data.

1.1.12. The broader context: CIF and XML In the light of more recent data-exchange developments, it will be surprising to newcomers to CIF that more use is not made in crystallography of the extensible markup language XML (W3C, 2001). However, the development of CIF predates XML, and the CIF format can be easily translated to and from suitable XML representations. Most current crystallographic software imports and exports data in CIF format and the use of XML only becomes important in applications that cross the boundary of crystallography and involve interoperability with other scientific domains. At one time, the antecedent of XML, standard generalized markup language (SGML; ISO, 1986), was considered as a candidate for a crystallographic exchange mechanism. SGML is a highly flexible and extensible system for specifying markup languages, but is extremely general. In the late 1980s, successful SGML implementations stretched the capacity of affordable computers and little accompanying software was available. SGML was at that time a suitable data and document tagging mechanism for large-scale publishers, but was far from appropriate for smallerscale applications. XML was introduced during the 1990s as a specific SGML markup with a concrete syntax and simplifications that resulted in a lightweight, manageable language, for which robust parsers, editors and other programs could be written and implemented on desktop computers. The consequence has been a very rapid adoption of XML across many disciplines. Parallels may be drawn with the decision to implement CIF as a subset of the more general STAR File. XML provides the ability to mark up a document or data set with embedded tags. Such tags may indicate a particular typographic representation. More usefully, however, they can reflect the nature or purpose of the information to which they refer. The design of such useful and well structured content tagging (in XML and other formalisms) is referred to in terms of constructing a subject or domain ontology. This term is rather poorly defined, but broadly covers the construction for a specific topic area of a controlled vocabulary of terms, the elaboration of relationships between those terms, and rules or constraints governing the use of the terms. In XML and SGML, document-type definitions (DTDs) and schemas exist as external specifications of the markup tags permitted in a document, their relationships and any optional attributes they might possess. If carefully designed, these have the potential to act as ontologies. A simplification that XML offers over SGML is the ability to construct documents that do not need to conform to a particular schema. This makes it rather easier to develop software for generating and transmitting XML files. However, if diverse applications are to make use of the information content in an XML file, there must be some general way to exchange information about the meaning of the embedded markup, and in practice DTDs or schemas are essential for interoperability between software applications from different sources. Recent initiatives in chemistry, under the aegis of the International Union of Pure and Applied Chemistry (IUPAC), suggest an active interest in the development of a machine-parsable ontology for chemistry. Building on an existing XML representation of chemical information known as Chemical Markup Language

References Allen, F. H. (2002). The Cambridge Structural Database: a quarter of a million crystal structures and rising. Acta Cryst. B58, 380–388. Anklesaria, F., McCahill, M., Lindner, P., Johnson, D., Torrey, D. & Alberti, B. (1993). The Internet gopher protocol (a distributed document search and retrieval protocol). RFC 1436. Network Working Group. http://www.ietf.org/rfc/rfc1436.txt. ANSI/NISO (1995). Information retrieval (Z39.50): Application service definition and protocol specification. Z39.50-1995. http://lcweb.loc. gov/z3950/agency/1995doce.html. Barnard, J. M. (1990). Draft specification for revised version of the Standard Molecular Data (SMD) format. J. Chem. Inf. Comput. Sci. 30, 81–96. Berners-Lee, T. (1989). Information management: a proposal. Internal report. Geneva: CERN. http://www.w3.org/History/1989/proposalmsw.html. Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F. Jr, Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol. 112, 535–542. Bourne, P., Berman, H. M., McMahon, B., Watenpaugh, K. D., Westbrook, J. D. & Fitzgerald, P. M. D. (1997). Macromolecular Crystallographic Information File. Methods Enzymol. 277, 571–590. Brown, I. D. (1988). Standard Crystallographic File Structure-87. Acta Cryst. A44, 232. Busing, W. R., Martin, K. O. & Levy, H. A. (1962). ORFLS. Report ORNL-TM-305. Oak Ridge National Laboratory, Tennessee, USA. Chapuis, G., Farkas-Jahnke, M., P´erez-Mato, J. M., Senechal, M., Steurer, W., Janot, C., Pandey, D. & Yamamoto, A. (1997). Checklist for the description of incommensurate modulated crystal structures. Report of the International Union of Crystallography Commission on Aperiodic Crystals. Acta Cryst. A53, 95–100. Fitzgerald, P. M. D., Berman, H., Bourne, P., McMahon, B., Watenpaugh, K. & Westbrook, J. (1996). The mmCIF dictionary: community review and final approval. Acta Cryst. A52 (Suppl.), C575. Freer, S. T. & Stewart, J. (1979). Computer programming for protein crystallographic applications, University of California at San Diego, 28-29 November 1978. J. Appl. Cryst. 12, 426–427.

9

1. HISTORICAL INTRODUCTION Kahle, B. (1991). An information system for corporate users: wide area information servers. Thinking Machines technical report TMC-199. See also Online, 15, 56–60. McDonald, R. S. & Wilks, P. A. (1988). JCAMP-DX: a standard form for exchange of infrared spectra in computer readable form. Appl. Spectrosc. 42, 151–162. Murray-Rust, P. & Rzepa, H. S. (1999). Chemical markup language and XML. Part I. Basic principles. J. Chem. Inf. Comput. Sci. 39, 928–942. Murray-Rust, P. & Rzepa, H. S. (2001). Chemical markup, XML and the world-wide web. Part II: Information objects and the CMLDOM. J. Chem. Inf. Comput. Sci. 41, 1113–1123. Spadaccini, N., Hall, S. R. & Castleden, I. R. (2000). Relational expressions in STAR File dictionaries. J. Chem. Inf. Comput. Sci. 40, 1289– 1301. Stewart, J. (1963). XRAY63 Crystal Structure Calculations System. Report TR-64-6 (NSG-398). Computer Science Center, University of Maryland, USA, and Research Computer Center, University of Washington, USA. Ulrich, E. L. et al. (1998). XVIIth Intl Conf. Magn. Res. Biol. Systems. Tokyo, Japan. W3C (2001). Extensible Markup Language (XML). http://www.w3c.org/ XML/.

Greer, D. S. (2000). Macromolecular structure RFP response, revised submission. http://openmms.sdsc.edu/OpenMMS-1.5.1 Std/ openmms/docs/specs/lifesci 00-11-01.pdf. Hall, S. R. (1991). The STAR file: a new format for electronic data transfer and archiving. J. Chem. Inf. Comput. Sci. 31, 326–333. Hall, S. R., Allen, F. H. & Brown, I. D. (1991). The Crystallographic Information File (CIF): a new standard archive file for crystallography. Acta Cryst. A47, 655–685. Hall, S. R. & Cook, A. P. F. (1995). STAR dictionary definition language: initial specification. J. Chem. Inf. Comput. Sci. 35, 819–825. Hall, S. R. & Spadaccini, N. (1994). The STAR File: detailed specifications. J. Chem. Inf. Comput. Sci. 34, 505–508. International Tables for Crystallography (2002). Volume A, Space-group symmetry, edited by Th. Hahn. Dordrecht: Kluwer Academic Publishers. ISO (1986). ISO 8879. Information processing – Text and office systems – Standard Generalized Markup Language (SGML). Geneva: International Organization for Standardization. ISO (2002). ISO/IEC 8824-1. Abstract Syntax Notation One (ASN.1). Specification of basic notation. Geneva: International Organization for Standardization.

10

references

International Tables for Crystallography (2006). Vol. G, Chapter 2.1, pp. 13–19.

2.1. Specification of the STAR File B Y S. R. H ALL

AND

(i.e. portable); (c) be both machine-parsable and humanunderstandable; (d) adapt to future data evolution (i.e. be extensible and robust); and (e) facilitate data structures of any complexity (i.e. have rich syntax capabilities). The extent to which these objectives can be met determines the universality of a data-storage approach. Several of these properties are difficult to achieve simultaneously, and the compromises that have been adopted have largely determined the success of universal data languages in different fields. The STAR File is a universal data language applicable to all scientific disciplines, and the Crystallographic Information File described in Chapter 2.2 of this volume is a specific instance of the STAR format that has been adopted by the crystallographic community.

2.1.1. Introduction A human language, in all its forms, is spoken, written and comprehended according to grammatical principles that evolve continuously with the need to express efficiently new ideas and experiences. In this chapter we will describe how similar principles have been developed to construct, describe and understand scientific data, such as numbers and codified text. The efficient and flexible expression of data may be achieved using grammatical rules similar to those of spoken languages, though the precise and unambiguous rendition and communication of data must preclude those nuances, subtleties and individual interpretation so important to the spoken and graphical world of poetry, literature and art. It is important to record these aspects of human endeavour, but they are distinctly different from the objectives of science, where information must be expressed with maximum precision and efficiency. Understanding any language involves two fundamental steps – identification of the individual elements of a language sequence, and comprehension of the sequence structure. In a spoken language these steps normally comprise the simultaneous recognition of individual words and the understanding of their grammatical context. Such a simple decoding process belies the enormous potential for complexity in spoken languages. Nevertheless, most ‘word sequences’ are understood in this way and a similar approach may be used with non-textual data. As discussed in Section 1.1.3, until quite recently most approaches to storing scientific data electronically were based on fixed-format structures. These are simple to construct and easy to comprehend provided the data layout is fixed and widely understood. That is, items, as well as lists of items, are written in a fixed sequential order that is mutually agreed on by those writing and reading the data. However, the preordained nature of a fixed format, which intentionally prevents changes to the data structure in a file, also poses a serious limitation for many scientific applications. This is because the nature of data used in scientific disciplines, such as in crystallography, evolves continuously and requires recording processes that are extensible and adaptable to change. A lack of ready extensibility in the expression of data is particularly problematical for long-term archiving. For example, the recovery of information stored in a fixed format may be impossible if the layout details are lost or altered with time. Less rigid fixed-format variants, using keywords to identify groups of data, can improve the flexibility of data recording but they often preclude the introduction of new kinds of data. In the 1980s it was widely recognized across the sciences that more general, extensible and expressive approaches were needed for recording, transmitting and archiving electronic data. This led to the development of free-format file structures. These file structures are the basis for universal file formats intended to: (a) store all kinds of data; (b) be independent of computer hardware

2.1.2. Universal data language concepts As discussed above, the basic precepts of a universal data language parallel those of spoken languages in that a syntax and a grammar are used to describe and arrange (i.e. present) information, and hence define its comprehension. In particular, because data values (as numbers, symbols or text) are not often self-descriptive (e.g. a temperature value as a number alone cannot be distinguished from a number representing an atomic weight) interpretation depends critically on appropriate cues to the meaning and organization of data. A cue, in this context, has traditionally been prior knowledge of a fixed format, but can be explicit descriptions of data items, or even embedded codes identifying the relationship between data items. Cues represent the contextual thread of a data language, in the same way that grammatical rules provide words in natural-language sentences with meaning, and convert computer languages into executable code. This section will consider the concepts of data and languages that are important in understanding the syntax and purpose of a STAR File. The first requirement of a language for electronic data transmission is that it be computer-interpretable, i.e. parsable. One should be able to read, interpret, manipulate and validate data entirely by machine. The other essential attributes of a universal file, as listed in Section 2.1.1, are flexibility, extensibility and applicability to all kinds of data. In modern-day science, data items change and expand continually and it is therefore paramount that a data language can accommodate new kinds of data seamlessly. It is the inflexibility and restrictive scope of fixed formats that limit their usefulness to crystallography for purposes other than local and temporary data retention. 2.1.2.1. Data models Various approaches exist for meeting the objectives for a universal file listed in Section 2.1.1. The task is complicated by the fact that properties such as generality and simplicity, or accessibility and flexibility, are in many respects technically incongruent, and, depending on the nature of the data to be represented, particular compromises may be taken for efficiency or expediency. In other words, data languages, like spoken languages, are not equally efficient at expressing concepts, and the syntax of a language may need to be tailored to the target applications.

Affiliations: S YDNEY R. H ALL , School of Biomedical and Chemical Sciences, University of Western Australia, Crawley, Perth, WA 6009, Australia; N ICK S PADACCINI, School of Computer Science and Software Engineering, University of Western Australia, 35 Stirling Highway, Crawley, Perth, WA 6009, Australia.

Copyright  2006 International Union of Crystallography

N. S PADACCINI

13

2. CONCEPTS AND SPECIFICATIONS This was one of the issues confronting the Working Party on Crystallographic Information formed by the IUCr in 1988 (Section 1.1.6) to decide on the most appropriate universal file data language for crystallography from those under development (McCarthy, 1990). It is interesting historically to note that one of these was HGML – not the web markup language we know today, but the Human Genome Mapping Library language. Another language considered by the working party was ASN.1 (ISO, 2002) used by the National Institute of Standards and Technology and several US Government departments. It is an accepted ANSI and ISO standard for data communication and is supported by software, such as NIST’s OSI Toolkit. ASN.1 possesses a rich set of language constructs suited to representing complex data, but suffers from data identifiers that are encoded and not human-readable, and a syntax that is verbose (particularly for repetitive data items such as those common in crystallography). These characteristics mean that a typical Protein Data Bank (PDB) file expressed in ASN.1 notation increases in size by up to a factor of 5. This was hardly an attraction in the 1980s when storage media were very expensive. Moreover, the ASN.1 syntax is not particularly intuitive, and is difficult to read and to construct. In contrast, the STAR File proposed at the first WPCI meeting had a relatively simple syntax, was human-readable and provided a concise structure for repetitive data. It also proved suitable for constructing electronic dictionaries, as will be discussed in later chapters. However, its serious and well recognized weakness in 1988 was that any recording approach using a simple syntax to encode complex data must involve sophisticated parsing software, and at that time only the prototype software (QUASAR; Hall & Sievers, 1990) was available. It was therefore not a straightforward decision for the WPCI to decide to recommend the STAR File syntax as a more appropriate data language for crystallographic applications. It was this decision that led to the development of the CIF approaches described in this volume.

2.1.3.1. Text string A text string is defined as any of the following. (a) A sequence of non-white-space characters on a single line excluding a leading underscore (ASCII 95). Examples: 5.324 light-blue

(b) A sequence of characters on a single line containing the leading digraph and the trailing digraph . is a single-quote character (ASCII 39) and is white space. Examples: ’light blue’ ’classed as "unknown"’ ’Patrick O’Connor’

Note that the use of the character in the text string that is bounded by a character is not precluded unless it is immediately followed by . The leading and trailing digraphs serve to delimit the string and do not form part of the data. In the above example the value associated with the text field ’light blue’ is light blue. (c) A sequence of characters on a single line containing the leading digraph . character in the text string that is bounded by a ’ | ’?’ | ’@’ | ’F’ | ’G’ | ’H’ | ’N’ | ’O’ | ’P’ | ’V’ | ’W’ | ’X’ | ’a’ | ’b’ | ’c’ | ’i’ | ’j’ | ’k’ | ’q’ | ’r’ | ’s’ | ’y’ | ’z’ | ’{’

We define a ‘comment’ to be initiated with or and the character #, followed by any sequence of characters (which include ). The only characters not allowed are those in the production , and hence these characters terminate a comment. Note the requirement of a leading or is dropped if the # character is the first character in the file. ::= { | }+ ’#’ *

We accept as white space all elements in the above three productions. White spaces are the lexemes able to delimit the lexical tokens. Note that a comment is a legitimate white space because it must end with a line terminator, and hence delimits tokens.

| | | | | | | | | | |

’(’ ’0’ ’8’ ’A’ ’I’ ’Q’ ’Y’ ’d’ ’l’ ’t’ ’|’

| | | | | | | | | | |

’)’ ’1’ ’9’ ’B’ ’J’ ’R’ ’Z’ ’e’ ’m’ ’u’ ’}’

| | | | | | | | | | |

’*’ | ’+’ | ’,’ ’2’ | ’3’ | ’4’ ’:’ | ’ ::= { { * } | { * } | + }+

Data values of type I data are immediately preceded by a Data values of type II data are immediately preceded by a . .

18

2.1. SPECIFICATION OF THE STAR FILE References

::= * *

Allen, F. H., Barnard, J. M., Cook, A. P. F. & Hall, S. R. (1995). The Molecular Information File (MIF): core specifications of a new standard format for chemical data. J. Chem. Inf. Comput. Sci. 35, 412–427. Hall, S. R. (1991). The STAR File: a new format for electronic data transfer and archiving. J. Chem. Inf. Comput. Sci. 31, 326–333. Hall, S. R., Allen, F. H. & Brown, I. D. (1991). The Crystallographic Information File (CIF): a new standard archive file for crystallography. Acta Cryst. A47, 655–685. Hall, S. R. & Cook, A. P. F. (1995). STAR dictionary definition language: initial specification. J. Chem. Inf. Comput. Sci. 35, 819–825. Hall, S. R. & Sievers, R. (1990). QUASAR: CIF syntax checking and file manipulation tool. ftp://ftp.iucr.org/pub/quasar. Hall, S. R. & Spadaccini, N. (1994). The STAR File: detailed specifications. J. Chem. Inf. Comput. Sci. 34, 505–508. ISO (2002). ISO/IEC 8824-1. Abstract Syntax Notation One (ASN.1). Specification of basic notation. Geneva: International Organization for Standardization. McCarthy, J. L. (1990). Data interchange standards for biotechnology: issues and alternatives. National Center for Biotechnical Information Report. NIH/DOE. McLennon, B. J. (1983). Principles of programming languages, design, evaluation and implementation. New York: Holt, Rinehart and Winston. Spadaccini, N., Hall, S. R. & Castleden, I. R. (2000). Relational expressions in STAR File dictionaries. J. Chem. Inf. Comput. Sci. 40, 1289–1301. Westbrook, J. D. & Hall, S. R. (1995). A Dictionary Description Language for Macromolecular Structure. http://ndbserver.rutgers. edu/mmcif/ddl.

The string bounded by square brackets can consist of any character including and , and excluding the characters [ and ] unless they are escaped or are balanced. ::= { | | |

’\\’ ’[’ | ’\\’ ’]’

}*

The development of the STAR File, and particularly its application to crystallographic data, involved many people. The major contributions are acknowledged in the references below and other chapters in this volume. We name here only contributors who are not listed elsewhere in STAR-related publications. Particular thanks are due to Richard Goddard, Ted Maslen, Andre Authier, Philip Coppens, Jim King, Mike Dacombe, Peter Strickland, Mike Hoyland and George Sheldrick for their interest, support and commitment during the development of the STAR File concepts and its applications to crystallography.

19

references

International Tables for Crystallography (2006). Vol. G, Chapter 2.2, pp. 20–36.

2.2. Specification of the Crystallographic Information File (CIF) WITH

S ECTION 2.2.7

BY

B Y S. R. H ALL AND J. D. W ESTBROOK S. R. H ALL , N. S PADACCINI , I. D. B ROWN , H. J. B ERNSTEIN , J. D. W ESTBROOK

B. M C M AHON

for the storage of NMR experimental and structure data (Ulrich et al., 1998). In 2002, a COMCIFS review of the design and implementation of CIF led to a revised syntax specification, which was published in February 2003. This revised specification is reproduced in full in Section 2.2.7. It is important to note that after more than a decade of CIF usage, this revision contains few substantial changes to the design choices of the original version (Hall et al., 1991). There have been some modest extensions to the lengths of data names and text lines, and a number of clarifications are introduced. In addition, various privileged labels used for STAR File constructs (e.g. global blocks, save frames and nested loops, as described in Chapter 2.1) have now been explicitly reserved (i.e. excluded from appearing in unquoted form in an existing CIF). This will allow the clean upward migration of future CIF syntax versions to the more complex data structures permitted in a STAR File, when and if these are later required by the community. The remainder of this chapter is structured as follows. First, there is a brief description of CIF terminology (Section 2.2.2). This is followed by the syntax rules, corresponding to a subset of the STAR syntax, used by CIF data files (Section 2.2.3). The portability and archival issues that programmers must be aware of in applying CIF data in different computing environments are described in Section 2.2.4. They are also detailed in the formal specifications given at the end of the chapter. Section 2.2.5 describes the conventions regarding data typing and embedded semantics that are common to all CIF applications, and Section 2.2.6 outlines future possible ways of introducing metadata which would enable files to be linked to each other, and which would establish the nature of CIF contents within a more general framework of information storage systems. The final section of the chapter, Section 2.2.7, reproduces in full the formal specification documents approved by COMCIFS.

2.2.1. Introduction The term ‘Crystallographic Information File’ (CIF) refers to data and dictionary files conforming to the conventions adopted by the IUCr in 1990 and revised by the IUCr Committee for the Maintenance of the CIF Standard (COMCIFS). The CIF format is intended to meet the needs of a wide range of scientific applications within, and without, the discipline of crystallography. Parts 2 and 3 of this volume provide the full specification of the contents of CIF across the different crystallographic applications. The files used in these applications must conform to the same rules of syntax, and share certain properties and conventions in the way that information is presented. It is these common features that are discussed in this chapter. The CIF family of applications uses a proper subset of the STAR File syntax described in Chapter 2.1. The STAR File grammar provides a very general approach to storing and accessing data values through the use of an associated data name, or tag. A CIF search tool, such as Star Base (Chapter 5.2), can readily access a single data value, or set of values, using this tag without prior knowledge of the order of the file contents. It can also provide details of the context of the data within the file structure. Context, in this sense, is a fully annotated indication of the file structure in which the retrieved value was located. That is, whether it was located in a global declaration, a named save frame, or a looped list. In every case the data ‘value’ is simply a character string and the STAR File protocol itself imposes absolutely no meaning on that string. This leaves the interpretation of the value string (e.g. whether it is numerical or text) to the conventions of the applications used to read and write the STAR File. The CIF approach to the permissive STAR File syntax is restrictive. In the first place, the earliest version of the CIF syntax (Hall et al., 1991) did not adopt some of the grammatical (or syntactical) constructs available to the STAR File in order to facilitate existing crystallographic software approaches. This was in anticipation of likely short-term developments, and to encourage a rapid take-up of the CIF approach. For these reasons it was considered appropriate to adopt only data-block partitioning and a single, rather than multiple, level of looped lists. Save frames were adopted into the CIF later but only in CIF dictionary files written using the DDL2 dictionary definition language (see Chapter 2.6). It is relevant to point out, however, that the full STAR File syntax has been adopted

2.2.2. Terminology A summary of the basic terminology used throughout this chapter and the volume follows. A more extensive description of this terminology is given as formal specifications in Section 2.2.7.1.2. (i) A CIF is a file conforming to the specification presented in this chapter. The term includes both data files containing information on a structural experiment or its results (or similar scientific content) and the dictionary files that provide descriptions of the data identifiers used in such data files. (ii) A data name or tag is an identifier (a string of characters beginning with an underscore character) of the content of an associated data value. (iii) A data value is a string of characters representing a particular item of information. It may represent a single numerical value; a letter, word or phrase; extended discursive text; or in principle any coherent unit of data such as an image, audio clip or virtualreality object. (iv) A data item is a specific piece of information defined by a data name and its associated data value.

Affiliations: S YDNEY R. H ALL , School of Biomedical and Chemical Sciences, University of Western Australia, Crawley, Perth, WA 6009, Australia; J OHN D. W ESTBROOK, Protein Data Bank, Research Collaboratory for Structural Bioinformatics, Rutgers, The State University of New Jersey, Department of Chemistry and Chemical Biology, 610 Taylor Road, Piscataway, NJ 08854-8087, USA; N ICK S PADACCINI, School of Computer Science and Software Engineering, University of Western Australia, 35 Stirling Highway, Crawley, Perth, WA 6009, Australia; I. DAVID B ROWN, Brockhouse Institute for Materials Research, McMaster University, Hamilton, Ontario, Canada L8S 4M1; H ERBERT J. B ERNSTEIN, Department of Mathematics and Computer Science, Kramer Science Center, Dowling College, Idle Hour Blvd, Oakdale, NY 11769, USA; B RIAN M C M AHON, International Union of Crystallography, 5 Abbey Square, Chester CH1 2HU, England.

Copyright  2006 International Union of Crystallography

AND

20

2.2. SPECIFICATION OF THE CRYSTALLOGRAPHIC INFORMATION FILE (CIF) (v) A data block is the highest-level component of a CIF, containing data items or (in the case of dictionary files only) save frames. A data block is identified by a data-block header, which is an isolated character string (that is, bounded by white space and not forming part of a data value) beginning with the caseinsensitive reserved characters data_. A block code is the variable part of a data-block header, e.g. the string foo in the header data_foo. (vi) A looped list of data is a set of data items represented as a table or matrix of values. The data names are assembled immediately following the word loop_ , each separated by white space, and the associated data values are then listed in strict rotation. The table of values is assembled in row-major order; that is, the first occurrence of each of the data items is assembled in sequence, then the second occurrence of each item, and so forth. In a CIF, looped lists may not be nested.

data_99107abs # Chemical data _chemical_name_systematic ; 3-Benzo[b]thien-2-yl-5,6-dihydro-1,4,2-oxathiazine 4-oxide ; _chemical_formula_moiety "C11 H9 N O2 S2" _chemical_formula_weight 251.31 # Crystal data _symmetry_cell_setting _symmetry_space_group_name_H-M

orthorhombic ’P 21 21 21’

loop_ _symmetry_equiv_pos_as_xyz ’x, y, z’ ’x+1/2, -y+1/2, -z’ ’-x, y+1/2, -z+1/2’ ’-x+1/2, -y, z+1/2’ _cell_length_a _cell_length_b _cell_length_c _cell_angle_alpha _cell_angle_beta _cell_angle_gamma

2.2.3. The syntax of a CIF The essential syntax rules for a CIF data file are discussed alongside an example (Fig. 2.2.3.1), which is an extract from a file used to exemplify the reporting of a small-molecule crystal structure to Acta Crystallographica Section C. The following discussion is tutorial in nature and is intended to give an overview of the syntactic features of CIF to the general reader. The special use of save frames in dictionary files is not discussed in this summary. Software developers will find the full specification at the end of the chapter. If there are any real or apparent discrepancies between the two treatments, the full specification is to be taken as definitive. A CIF contains only ASCII characters, organized as lines of text. Tokens (the discrete components of the file) are separated by white space; layout is not significant. Thus, in the list of atom-site coordinates in Fig. 2.2.3.1, the hydrogen-atom entries are cosmetically aligned in columns, but the non-aligned entries for the other atoms are equally valid. Indeed, there is no requirement that each cluster of looped data values be confined to a separate row; contrast the cosmetic ordering of the atom-sites loop with the loop of symmetry-equivalent positions, where entries run on the same or following lines indiscriminately. A comment is a token introduced by a hash character # and extending to the end of the line. Comments are considered to have no portable information content and may freely be discarded by a parser. However, revision 1.1 of the CIF specification introduces a recommendation that a CIF begin with a comment taking the form

7.4730(11) 8.2860(11) 17.527(2) 90.00 90.00 90.00

# Atomic coordinates and displacement parameters loop_ _atom_site_label _atom_site_type_symbol _atom_site_fract_x _atom_site_fract_y _atom_site_fract_z _atom_site_U_iso_or_equiv S4 S 0.32163(7) 0.45232(6) 0.52011(3) 0.04532(13) S11 S 0.39642(7) 0.67998(6) 0.29598(2) 0.04215(12) O1 O -0.00302(17) 0.67538(16) 0.47124(8) 0.0470(3) O4 O 0.2601(2) 0.28588(16) 0.50279(10) 0.0700(5) N2 N 0.14371(19) 0.66863(19) 0.42309(9) 0.0402(3) C3 C 0.2776(2) 0.57587(19) 0.43683(9) 0.0332(3) C5 C 0.1497(3) 0.5457(3) 0.57608(11) 0.0498(5) C6 C -0.0171(3) 0.5529(2) 0.52899(12) 0.0460(4) C12 C 0.4215(2) 0.57488(19) 0.38139(9) 0.0344(3) C13 C 0.5830(2) 0.4995(2) 0.38737(10) 0.0386(4) C13A C 0.6925(2) 0.5229(2) 0.32123(10) 0.0399(4) C14 C 0.8631(3) 0.4608(3) 0.30561(13) 0.0532(5) C15 C 0.9423(3) 0.4948(3) 0.23709(15) 0.0644(7) C16 C 0.8563(3) 0.5917(3) 0.18349(14) 0.0667(7) C17 C 0.6901(3) 0.6568(3) 0.19729(12) 0.0546(5) C17A C 0.6090(3) 0.6204(2) 0.26670(10) 0.0396(4) H5A H 0.1284 0.4834 0.6221 0.060 H5B H 0.1861 0.6537 0.5908 0.060 H6A H -0.0374 0.4490 0.5050 0.055 H6B H -0.1186 0.5762 0.5617 0.055 H13 H 0.6182 0.4397 0.4297 0.046 H14 H 0.9218 0.3972 0.3414 0.064 H15 H 1.0548 0.4527 0.2262 0.077 H16 H 0.9127 0.6130 0.1373 0.080 H17 H 0.6340 0.7227 0.1616 0.066

#\#CIF_1.1

where the 1.1 is a version identifier of the reference CIF specification. This is primarily for the benefit of general file-handling software on current operating systems (e.g. graphical file managers that associate software applications with files of specific type), and its presence or absence does not guarantee the integrity of the file with respect to any particular revision of the CIF specification. The first non-comment token of a CIF must be a data-block header, which is a character string that does not include white space and begins with the case-insensitive characters data_. The file may be partitioned into multiple data blocks by the insertion of further data-block headers. Data-block headers are case-insensitive (that is, two headers differing only in whether corresponding letter characters are upper or lower case are considered identical). Within a single data file identical data-block headers are not permitted.

Fig. 2.2.3.1. Typical small-molecule CIF.

Data names are character strings that begin with an underscore character _ and do not contain white-space characters. Data names serve to index data values and are case-insensitive. Where a data name indexes a single data value, that value follows the data name separated by white space. Where a data name indexes a set of data values (conceptually a vector or table column), the relevant data items are preceded by the case-insensitive string loop_ separated by white space. The examples of Fig. 2.2.3.1 show the use of loop_ to specify a vector or one-dimensional list of values (the symmetry-equivalent positions) and a tabular or matrix list (the atom-site positions).

21

2. CONCEPTS AND SPECIFICATIONS Again it should be emphasized that the rows and columns of the table are identified by parsing each value and referring back to the sequence in the header of its identifying tag, and not by a conventional layout of items. It is usual for CIF writers to lay out two-dimensional data arrays as well formatted tables for ease of inspection in text editors or other viewers, but CIF readers must never rely on layout alone to identify a tabulated value. A corollary of this is that the number of values in a looped list must be an exact integer multiple of the number of data names declared in the loop header. Within a single data block the same data name may not be repeated. Thus if a data item may have multiple values, these items must be collected together within a looped list, the data name itself being given once only in the loop header. A data value is a string of characters. CIF distinguishes between numerical and character data in a broad sense (see Section 2.2.5.2 below). Numerical values may not contain white space (and indeed are constrained to a limited character set and ordering, essentially encompassing a small range of ways in which numbers are characteristically represented in printed form). Because tokens are separated by white space, character data that include white-space characters must be quoted. If the data value does not extend beyond the end of a line of text, it may be quoted by matching single-quote (apostrophe, ’) or double-quote (quotation mark, ") characters. If the data value does extend beyond the end of a line of text, then paired semicolon characters ; as the first character of a line may be used as delimiters. The closing semicolon is the first character of a line immediately following the close of the data-value string. Fig. 2.2.3.1 shows examples of all three types of delimited character values that include white space. Character strings that begin with certain other characters must also be quoted. These leading characters are those which introduce tokens with special roles in a STAR File (such as underscore _ at the start of a data name, hash # at the start of a comment and dollar $ identifying a save-frame reference pointer). Likewise, the STAR File reserved words loop_ , stop_ and global_ must be quoted if they represent data values, as must any character string beginning with data_ or save_. Lines of text are restricted to 2048 characters in length and data names are restricted to 75 characters in length. These are increases over the original values of 80 and 32 characters, respectively.

the permitted ASCII characters. Accented characters, characters in non-Latin alphabets and mathematical or special typographic symbols may not appear as single characters in a CIF, even if a host OS permits such representations. White space (used to separate CIF tokens and within comments or quoted character-string values) is most portably represented by the printable space character (decimal value 32 in the ASCII character set). In an ASCII environment, white space may also be indicated by the control characters denoted HT (horizontal tab, ASCII decimal 9), LF (line feed, ASCII decimal 10) and CR (carriage return, ASCII decimal 13). To ease problems of translation between character encodings, the characters VT (vertical tab, ASCII decimal 11) and FF (form feed, ASCII decimal 12) are explicitly excluded from the CIF character set; this is a restriction that is not in the general STAR File specification (Chapter 2.1).

2.2.4.2. Line terminators Given that the STAR File is built on the premise of a lineoriented text file, it is difficult in practice to provide a complete and portable description of how to identify the start or end of a line of text. The difficulty arises for two reasons. First, some OSs or programming languages are record-oriented; that is, the OS is able to keep track of a region of memory associated with a specific record. It is usually appropriate to associate each such record with a line of text, but the ‘boundaries’ between records are managed by low-level OS utilities and are not amenable to a character-oriented discussion. Such systems, where records of fixed length are maintained, may also give rise to ambiguities in the interpretation of padding to the record boundary following a last printable character – is such padding to be discarded or treated as white space? It is for this reason that the elision of trailing white space in a line is permitted (but not encouraged) in the full CIF syntax specification [Section 2.2.7.1.4(17)]. The second complication arises because current popular OSs support several different character-based line terminators. Historically, applications developed under a specific OS have made general use of system libraries to handle text files, so that the conventions built into the system libraries have in effect become standard representations of line terminators for all applications built on that OS. For a long time this created no great problem, since files were transferred between OSs through applications software that could be tuned to perform the necessary line-terminator translations in transit. The best-known such application is undoubtedly the ‘text’ or ‘ascii’ transmission mode of the typical ftp (file transfer protocol) client. Increasingly, however, a common network-mounted file system may be shared between applications running under different OSs and the same file may present itself as ‘valid’ to one user but not to another because of differences in what the applications consider a line terminator. The problem of handling OS-dependent line termination is by no means unique to STAR File or CIF applications; any application that manipulates line-oriented text files must accommodate this difficulty. The specification notes the practice of designing applications that treat equivalently as a line terminator the characters LF (line feed or newline), CR (carriage return) or the combination CR followed by LF, since these are the dominant conventions under the prevailing Unix, MacOS and DOS/Windows OSs of the present day. While this will be a sensible design decision for many CIF-reading applications, software authors must be aware that the CIF specification aims for portability and archivability through a more general understanding of what constitutes a line of text.

2.2.4. Portability and archival issues The CIF format is designed to be independent of operating system (OS) and programming language. Nevertheless, variations in the way that each OS specifies and handles character sets mean that care must be taken to ensure that CIF software is portable across different computer platforms. There are also constraints on the application of these specifications in order to maintain compatibility between archival systems. These issues are discussed briefly here. More details are given in the formal CIF specification (see Section 2.2.7). In general, compatibility and portability considerations for different OSs are of little importance to users of CIFs, but they need to be well understood by software developers. 2.2.4.1. Character set The characters permitted in a CIF are in effect the printable characters in the ASCII character set. However, a CIF may also be constructed and manipulated using alternative single-character byte mappings such as EBCDIC, and multi-byte or wide character encodings such as Unicode, provided there is a direct mapping to

22

2.2. SPECIFICATION OF THE CRYSTALLOGRAPHIC INFORMATION FILE (CIF) provided that the users of the file have some way of discovering that the cell volume is indeed indexed by the tag _bdouiGFG78=z or _cell_volume, as appropriate. However, it is conventional in CIF applications to define (in public dictionary files) data names that imply by their construction the meaning of the data that they index. Chapter 3.1 discusses the principles that are recommended for constructing data names and defining them in public dictionaries, and for utilizing private data names that will not conflict with those in the public domain. Careful construction of data names according to the principles of Chapter 3.1 results in a text file that is intelligible to a scientist browsing it in a text editor without access to the associated dictionary definition files. In many ways this is useful; it allows the CIF to be viewed and understood without specialized software tools, and it safeguards some understanding of the content if the associated dictionaries cannot be found. On the other hand, there is a danger that well intentioned users may gratuitously invent data names that are similar to those in public use. It is therefore important for determining the correct semantic content of the values tagged by individual data names to make maximum possible disciplined use of the registry of public dictionaries, the registry of private dataname prefixes, and the facilities for constructing and disseminating private dictionaries discussed in Chapter 3.1.

2.2.4.3. Line lengths The STAR File does not restrict the lengths of text lines. The original CIF specification introduced an 80-character limit to facilitate programming in Fortran and transmission of CIFs by email. Contemporary practices have enabled the line-length limit in the version 1.1 specification to be extended to 2048 characters. While the new limit in effect mandates the use of a 2048-character input buffer in compliant CIF-reading software, there is no obligation on CIF writers to generate output lines of this length. 2.2.4.4. Lengths of data names and block codes CIF imposes another length restriction that is not integral to the STAR File syntax: data names, data-block codes and frame codes may not exceed 75 characters in length. This is an increase over the 32-character limit of the original specification, although this extension had already been approved by COMCIFS to coincide with the release of version 2 of the core dictionary (see Chapter 3.2). There is no fundamental technical reason underlying this restriction; it permits assignment of a fixed-length buffer for recording such data names within a software application, but perhaps more significantly, it encourages a measure of conciseness in the creation of data names based on hierarchical component terms. 2.2.4.5. Case sensitivity Following the general STAR File approach, the special tokens the reserved word global_ , data-block codes and save-frame codes, and data names are all case-insensitive. The case of any characters within data values must, however, be respected.

2.2.5.2. Data typing In the STAR File grammar, all data values are represented as character strings. CIF applications may define data types, and in the macromolecular (mmCIF) dictionary (see Chapter 3.6) a range of types has been assigned corresponding to certain contemporary computer data-storage practices (e.g. single characters, caseinsensitive single characters, integers, floating-point numbers and even dates). This dynamic type assignment is supported by the relational dictionary definition language (DDL2; see Chapter 2.6) used for the mmCIF dictionary and is not available for all CIF applications. However, a more restricted set of four primary or base data types is common to all CIF applications. The type numb encompasses all data values that are interpretable as numeric values. It includes without distinction integers and non-integer reals, and the values may be expressed if desired in scientific notation. At this revision of the specification it does not include imaginary numbers. All numeric representations are understood to be in the number base 10. It is, however, a complex type in that the standard uncertainty in a measured physical value may be carried along as part of the value. This is denoted by a trailing integer in parentheses, representing the integer multiple of the uncertainty in the last place of decimals in the numeric representation. That is, a value of ‘1085.3(3)’ corresponds to a measurement of 1085.3 with a standard uncertainty of 0.3. Likewise, the value 34.5(12) indicates a standard uncertainty of 1.2 in the measured value. Care should be taken in the placement of the parentheses when a number is expressed in scientific notation. The second example above may also be presented as 3.45E1(12); that is, the standard uncertainty is applied to the mantissa and not the exponent of the value. Note that existing DDL2 applications itemize standard uncertainties as separate data items. Nevertheless, since the DDL2 dictionary includes the attribute _item_type_conditions.code with an allowed value of ‘esd’, future conformant DDL2 parsers might be expected to handle the parenthesized standard uncertainty representation.

loop_ , save_ ,

2.2.5. Common semantic features As mentioned in Section 2.2.1, the STAR File structure allows retrieval of indexed data values without prior knowledge of the location of a data item within the file. Consequently there is no significance to the order of data items within a data block. However, molecular-structure applications will typically need to do more than retrieve an arbitrary string value. There is a need to identify the nature of individual data items (achieved portably through the definitions of standard data names in dictionary files), but also to be able to process the extracted data according to whether it is numerical or textual in nature, and possibly also to parse and extract more granular information from the entire data field that has been retrieved. Different CIF applications have a measure of freedom to define many of the details of the content of data fields and the ways in which they may be processed – in effect, to define their semantic content. However, there are a number of conventions that are common to all CIF applications and this should be recognized in software applicable to a range of dictionaries. These are discussed in some detail in the formal specification document in Section 2.2.7.4; in this section some introductory comments and additional explanations are given. 2.2.5.1. Data-name semantics It is a fundamental principle of the STAR File approach that a data name is simply an arbitrary string acting as an index to a required value or set of values. It is equally legitimate to store the value of a crystal cell volume either as _bdouiGFG78=z

1085.3(3)

or as _cell_volume

1085.3(3)

23

2. CONCEPTS AND SPECIFICATIONS The preferred behaviour of a CIF application is to determine the type of a data value by looking up the corresponding dictionary definition. However, some CIF-reading software may not be designed with the ability to parse dictionaries; and indeed any CIF reader may encounter data names that are not defined in a public or accompanying dictionary. It is therefore appropriate to adopt a strategy of interpreting as a number any data value that looks like one, i.e. adopts any of the permitted ways to represent a numeric value. Therefore, in the absence of a specific counterindication (from a dictionary definition), the data value in the following example may be taken as the numeric (integer) value 1: _unknown_data_name

presence of a ‘.’ value for the item _geom_bond_site_symmetry_1 is predetermined as the default value 1 555 (as per the dictionary definition). Note that, in this instance, it is also equivalent to ‘no additional symmetry’ or ‘inapplicable’.

2.2.5.3. Extended data typing: content type and encoding The initial implementation of CIF assumed that most character strings would represent identifiers or terse descriptions or comments, and that the correct behaviour of the majority of CIF applications would be simply to store these in computer memory or retrieve them verbatim. Only a few data values were foreseen as having extended content that might need special handling. For example, the complete text of a manuscript was envisaged as being included in the field _publ_manuscript_processed. The handling of this field (its extraction and typesetting) would be left to unspecified external agents, although some clue as to the provenance of the contents of that field (and thus their appropriate handling) would be given by _publ_manuscript_creation. However, the evolution of CIF applications has required that some element of typographic markup be permitted in a growing number of data values, and future applications may be envisaged in which graphical images, virtual-reality models, spreadsheet tables or other complex objects are embedded as the values of specific data items. Since it will not be possible to write general-purpose CIF applications capable of handling all such embedded content, techniques will need to be developed for transferring each such field to a specialized but separate content handler. In the meantime, the rather ad hoc conventions for introducing typographic markup available at present are described in Sections 2.2.7.4.13– 17. It is hoped that in the future different types of such markup may be permitted so long as the data values affected can be tagged with an indication of their content type that allows the appropriate content handlers to be invoked. It has also been necessary to allow native binary objects to be incorporated as CIF data values. This was done to support the storage of the large arrays of image data obtained from area detectors. Since the CIF character set is based on printable ASCII characters only, encodings including compression have been developed to permit interconversion between ASCII and binary representations of such data (see Chapter 2.3). Nowadays, arbitrary embedded objects may be transported in web pages via the http protocol (Fielding et al., 1999) or as attachments to email messages structured according to the MIME protocols (e.g. Freed & Borenstein, 1996). Identification of encoding techniques and hooks to invoke suitable handlers are carried in the relevant Content-Type and Content-Encoding http or MIME headers. It is suggested that this may form the basis of suitable tagging of content types and encoding for future CIF development. A candidate for a CIF-specific encoding protocol is the special convention introduced with CIF version 1.1 to interconvert long lines of text between the new and old length limits (Section 2.2.7.4.11). This is an encoding in the sense that it is a device designed to retain any semantic content implicit in textual layout, while conforming to slightly different rules of syntax. It is designed to enable CIFs written to the longer line-length specification to be transformed so that they can still be handled by older software. Since the object of the exercise is to manage legacy applications, it is likely that the interconversion will be done through external applications, or filters, designed specifically for the purpose. Such a conversion filter is conceptually the same as a filter to convert a binary file into an ASCII base-64 encoding, for example.

1

On the other hand, if _unknown_data_name were explicitly defined in a dictionary with a data type of ‘char’, then the value should be stored as the literal character 1. This is a subtle point, perhaps of interest only to software authors. Nevertheless, the consistent behaviour of CIF applications will depend on correct implementation of this behaviour. The data type char covers single characters or extended character strings. Since CIF tokens are separated by white space, any character string that includes white-space characters (including line-terminating characters) must be delimited by one or other of a set of special characters used for this purpose. The detailed rules for quoting such strings are given in Section 2.2.7.1.4 and comprise the standard CIF syntax rules for this case. No semantic distinction is made in general between short character strings and text strings that extend over several lines, described in the specification document as ‘text fields’, although again particular CIF applications may choose to impose distinctions. Note that numbers within a quoted string or a text block (bounded by semicolons in column 1) are not interpreted as type ‘numb’ but as type ‘char’. The data type uchar was introduced explicitly at revision 1.1 of the CIF specification, and is intended to formalize the description and automated handling of certain strings in CIFs that are caseinsensitive (such as data names and data-block headers). The data type null is a special type that has two uses. It is applied to items for which no definite value may be stored in computer memory. As such it is a formal device for allowing the introduction of data names into dictionary files that do not represent data values permissible within a data file instance. The usual example is that of the special data names introduced in DDL1 dictionaries (such as the core dictionary) to discuss categories. The more important use of the null data type is its application to the meta characters ‘?’ (query) and ‘.’ (full point) that may occur as values associated with any data name and therefore have no specific type. (Arguably, for this case ‘any’ might be a better type descriptor than ‘null’.) The substitution of the query character ‘?’ in place of a data value is an explicit signal that an expected value is missing from a CIF. This ‘missing-value signal’ may be used instead of omitting an item (i.e. its tag and value) entirely from the file, and serves as a reminder that the item would normally be present. The substitution of the full-point character ‘.’ in place of a CIF data value serves two similar, but not identical, purposes. If it is used in looped lists of data it is normally a signal that a value in a particular packet (i.e. a value in the row of the table) is ‘inapplicable’ or ‘inappropriate’. In some CIF applications involving access to a data dictionary it is used to signal that the default value of the item is defined in its definition in the dictionary. Consequently, the interpretation of this signal is an application-specific matter and its use must be determined according to the application. For example, in a CIF submitted for publication in Acta Crystallographica the

24

2.2. SPECIFICATION OF THE CRYSTALLOGRAPHIC INFORMATION FILE (CIF) 2.2.7.1.2. Definition of terms (2) The following terms are used in the CIF specification documents with the specific meanings indicated here. (2.1) A CIF is a file conforming to the specification herein stated, containing either information on a crystallographic experiment or its results (or similar scientific content), or descriptions of the data identifiers in such a file. (2.2) A data file is understood to convey information relating to a crystallographic experiment. (2.3) A dictionary file is understood to contain information about the data items in one or more data files as identified by their data names. (2.4) A data name is a case-insensitive identifier (a string of characters beginning with an underscore character) of the content of an associated data value. (2.5) A data value is a string of characters representing a particular item of information. It may represent a single numerical value; a letter, word or phrase; extended discursive text; or in principle any coherent unit of data such as an image, audio clip or virtual-reality object. (2.6) A data item is a specific piece of information defined by a data name and an associated data value. (2.7) A tag is understood in this document to be a synonym for data name. (2.8) A data block is the highest-level component of a CIF, containing data items or save frames. A data block is identified by a data-block header, which is an isolated character string (that is, bounded by white space and not forming part of a data value) beginning with the case-insensitive reserved characters data_. (2.9) A block code is the variable part of a data-block header, e.g. the string foo in the header data_foo. (2.10) A save frame is a partitioned collection of data items within a data block, started by a save-frame header, which is an isolated character string beginning with the case-insensitive reserved characters save_, and terminated with an isolated character string containing only the case-insensitive reserved characters save_. (2.11) A frame code is the variable part of a save-frame header, e.g. the string foo in the header save_foo.

2.2.6. CIF metadata and dictionary compliance The development of several CIF dictionaries and fields of application has rapidly progressed beyond the specific purpose of describing a small-molecule or inorganic crystal structure for which CIF was devised. With these, a variety of application-specific metadata approaches have evolved to characterize the role of a particular CIF within a family of possible applications. These approaches use data definitions in dictionaries in which enumerated codes identify the file relationships. The mmCIF dictionary (see Chapter 3.6) allows informal identification of ‘external reference files’ which act as libraries of standard molecular geometry. The pdCIF dictionary (see Chapter 3.3) specifies identifiers that may be included within data blocks of external files containing calibration results. It is the responsibility of the file users to manage a lookup table or database between the referenced identifiers and the location of the files to which they pertain. Two categories of data items currently exist in the core dictionary to allow a file to indicate its relationship to CIF dictionaries and other data files. (Equivalent categories are also present in the mmCIF dictionary.) AUDIT_CONFORM is a category of data names identifying the dictionaries that hold definitions of the data names in the current CIF. Particularly where the referenced dictionaries include any of the various public dictionaries described in Part 3 of this volume, this serves to establish the discipline within the broad fields of crystallography, structural biology and structural chemistry to which the data are most relevant. The category AUDIT_LINK allows an informal textual description of the relationship between the data blocks within the current file. It is ‘informal’ in the sense that the relevant data items are free-text in nature. It would surely be useful to have a catalogue of more specific designations to allow automated software to track such relationships as the separate reference and modulated structures in an incommensurate compound, or the multiple trial refinements of a protein structure. The challenge is to determine and classify such standard relationships between data blocks. In the future it is hoped that a common approach to metadata will be developed to enable all CIF instantiations to be uniquely identified and interrelated. Development of standard descriptions of the relationships between structural entities of this sort (reference geometries, calibration results, partial refinements, modulated superposed structures etc.) will be an important stage in the formalization of complete CIF metadata, and will become an important step towards categorization of data entities needed for interoperability between different file formats and across a wide range of scientific disciplines.

2.2.7.1.3. File syntax (3) The syntax of CIF is a proper subset of the syntax of STAR Files as described by Hall (1991) and Hall & Spadaccini (1994). The general structure is described below in Section 2.2.7.1.4 and a number of subsections list specific restrictions to the STAR syntax that are in force within CIF. A formal language grammar using computer-science notation is included as Section 2.2.7.2.

2.2.7. Formal specification of the Crystallographic Information File

2.2.7.1.4. General features (4) A CIF consists of data names (tags) and associated values organized into data blocks. A data block may contain data items (associated data names and data values) and/or it may contain save frames. (5) Save frames may only be used in dictionary files. Implementation note: At a purely syntactic level there is no way to distinguish between dictionary and data files. (It is also to be noted that not all dictionary files contain save frames.) A fully validating parser must therefore be able to detect the start and termination of save frames, the uniqueness of the frame code within a data block and the uniqueness of data names within a frame code. It is, however, legitimate for an application-based parser designed to handle only the contents of data files to consider the presence of a save frame as an error.

Version 1.1 specification B Y S. R. H ALL , N. S PADACCINI , I. D. B ROWN , H. J. B ERNSTEIN , J. D. W ESTBROOK AND B. M C M AHON This section presents the documents File syntax (Sections 2.2.7.1– 3) and Common semantic features (Section 2.2.7.4) that together comprise the formal CIF specification as approved by COMCIFS. 2.2.7.1. Syntax 2.2.7.1.1. Introduction (1) This document describes the full syntax of the Crystallographic Information File (CIF).

25

2. CONCEPTS AND SPECIFICATIONS (6) A data block begins with the reserved case-insensitive string followed immediately by the name of the data block, forming a data-block header. A save frame has a similar structure to a data block, but may not itself contain further save frames. A save frame begins with the reserved case-insensitive string save_ followed immediately by the name of the save frame, forming a save-frame header. Unlike a data block, a save frame also has a marker for the end of the frame in the form of a repetition of the reserved case-insensitive word save_ , this time without the name of the frame. Save frames may not nest. Within a single CIF, no two data blocks may have the same name; within a single data block no two save frames may have the same name, although a save frame may have the same name as a data block in the same CIF. (7) A given data name (tag) [see (2.4) and (2.7)] may appear no more than once in a given data block or save frame. A tag may be followed by a single value, or a list of one or more tags may be marked by the preceding reserved case-insensitive word loop_ as the headings of the columns of a table of values. White space is used to separate a data-block or save-frame header from the contents of the data block or save frame, and to separate tags, values and the reserved word loop_. Data items (tags along with their associated values) that are not presented in a table of values may be relocated along with their values within the same data block or save frame without changing the meaning of the data block or save frame. Complete tables of values (the table column headings along with all columns of data) may be relocated within the same data block or save frame without changing the meaning of the data block or save frame. Within a table of values, each tag may be relocated along with its associated column of values within the same table of values without changing the meaning of the table of values. In general, each row of a table of values may also be relocated within the same table of values without changing the meaning of the table of values. Combining tables of values or breaking up tables of values would change the meanings, and is likely to violate the rules for constructing such tables of values. (8) The case-insensitive word global_ , used in STAR Files to introduce a group of data values with a scope extending to the end of the file, is an additional reserved word in CIF (that is, it may not be used as the unquoted value of any data item). (9) If a data value (2.5) contains white space or begins with a character string reserved for a special purpose, it must be delimited by one of several sets of special character strings (the choice of which is constrained if the data value contains characters interpretable as marking a new line of text according to the discussion in the following paragraphs). Such a data value will be indicated by the term non-simple data value. (10) A simple data value (i.e. one which does not contain white space or begin with a special character string) may optionally be delimited by any of the same set of delimiting character strings, except for data values that are to be interpreted as numbers. (11) The special character strings in this context are listed in the following table. The term ‘non-simple data values’ in this table refers to data values beginning with these special character strings. data_

Character or string

Role

_ # $ ’ " [

identifies data name identifies comment identifies save-frame pointer delimits non-simple data values delimits non-simple data values reserved opening delimiter for non-simple data values [see (19)]

Character or string

Role

]

reserved closing delimiter for non-simple data values [see (19)] delimits non-simple data values

; (at the beginning of a line of text) data_ save_

identifies data-block header identifies save-frame header or terminator

In addition, the following case-insensitive reserved words may not occur as unquoted data values. Reserved word

Role

loop_ stop_ global_

identifies looped list of data reserved STAR word terminating nested loops or loop headers reserved as a STAR global-block header

(12) The complete syntactic description of a numeric data value is included in Section 2.2.7.3(57) under the production (i.e. rule for constructing a part of the language) . (13) Comment: The base CIF specification distinguishes between character and numeric values [see Section 2.2.7.4(15)]. Particular CIF applications may make more finely grained distinctions within these types. The paragraphs immediately above have the corollary that a data value such as 12 that appears within a CIF may be quoted (e.g. ’12’) if and only if it is to be interpreted and stored in computer memory as a character string and not a numeric value. For example ’12’ might legitimately appear as a label for an atomic site, where another alphabetic or alphanumeric string such as ’C12’ is also acceptable; but it may not legitimately be used to represent an integer quantity twelve. (14) Matching single- or double-quote characters (’ or ") may be used to bound a string representing a non-simple data value provided the string does not extend over more than one line. (15) Comment: Because data values are invariably separated from other tokens in the file by white space, such a quote-delimited character string may contain instances of the character used to delimit the string provided they are not followed by white space. For example, the data item _example

’a dog’s life’

is legal; the data value is a dog’s life. (16) Comment: Note that constructs such as ’an embedded \’ quote’

do not behave as in the case of many current programming languages, i.e. the backslash character in this context does not escape the special meaning of the delimiter character. A backslash preceding the apostrophe or double-quote characters does, however, have special meaning in the context of accented characters (Section 2.2.7.4.15) provided there is no white space immediately following the apostrophe or double-quote character. (17) The special sequence of end of line followed immediately by a semicolon in column one (denoted ‘;’) may also be used as a delimiter at the beginning and end of a character string comprising a data value. The complete bounded string is called a text field and may be used to convey multi-line values. The end of line associated with the closing semicolon does not form part of the data value. Within a multi-line text field, leading white space within text lines must be retained as part of the data value; trailing white space on a line may however be elided. (18) Comment: A text field delimited by the ; digraph may not include a semicolon at the start of a line of text as part of its value.

26

2.2. SPECIFICATION OF THE CRYSTALLOGRAPHIC INFORMATION FILE (CIF) (19) Matching square-bracket characters, ‘[’ and ‘]’, are reserved for possible future introduction as delimiters of multi-line data values. At this revision of the CIF specification, a data value may not begin with an unquoted left square-bracket character ‘[’. (While not strictly necessary, the right square-bracket character ‘]’ is restricted in the same way in recognition of its reserved use as a closing delimiter.) (20) Comment: For example, the data value foo may be expressed equivalently as an unquoted string foo, as a quoted string ’foo’ or as a text field

protocols designed to present a unified view of a file system. In practice, for current common operating systems many applications may regard the ASCII characters LF or CR or the sequence CR LF as signalling an end of line, inasmuch as these represent the end-of-line conventions supported under the common operating systems Unix, MacOS or DOS/Windows. On platforms with record-oriented operating systems, applications must understand and implement the appropriate end-of-line convention. Care must be taken when transferring such files to other operating systems to insert the appropriate end-of-line characters for the target operating system. A more complete discussion is given in (42) below.

;foo ;

2.2.7.1.8. Case sensitivity

By contrast, the value of the text field

(26) Data names, block and frame codes, and reserved words are case-insensitive. The case of any characters within data values must be respected.

; foo bar ;

is foo

bar

2.2.7.1.9. Implementation restrictions

(where represents an end of line); the embedded space characters are significant. (21) A comment in a CIF begins with an unquoted character ‘#’ and extends to the end of the current line.

(27) Certain allowed features of STAR File syntax have been expressly excluded or restricted from the CIF implementation. 2.2.7.1.9.1. Maximum line length and character set (28) Lines of text may not exceed 2048 characters in length. This count excludes the character or characters used by the operating system to mark the line termination. The ASCII characters decimal 11 (VT) and 12 (FF) are excluded from the allowed character set [see paragraph (22)].

2.2.7.1.5. Character set (22) Characters within a CIF are restricted to certain printable or white-space characters. Specifically, these are the ones located in the ASCII character set at decimal positions 09 (HT or horizontal tab), 10 (LF or line feed), 13 (CR or carriage return) and the letters, numerals and punctuation marks at positions 32–126. Comment: The ASCII characters at decimal positions 11 (VT or vertical tab) and 12 (FF or form feed), often included in library implementations as white-space characters, are explicitly excluded from the CIF character set at this revision. (23) Comment: The reference to the ASCII character set is specifically to identify characters in an established and widely available standard. It is understood that CIFs may be constructed and maintained on computer platforms that implement other character-set encodings. However, for maximum portability only the characters identified in the section above may be used. Other printable characters, even if available in an accessible character set such as Unicode, must be indicated by some encoding mechanism using only the permitted characters. At this revision, only the encoding convention detailed in Section 2.2.7.4(30)–(37) is recognized for this purpose.

2.2.7.1.9.2. Maximum data-name, block-code and frame-code lengths (29) Data names may not exceed 75 characters in length. (30) Data-block codes and save-frame codes may not exceed 75 characters in length (and therefore data-block headers and saveframe headers may not exceed 80 characters in length). 2.2.7.1.9.3. Single-level loop constructs (31) Only a single level of looping is permitted. 2.2.7.1.9.4. Non-expansion of save-frame references (32) Save frames are permitted in CIFs, but expressly for the purpose of encapsulating data-name definitions within data dictionaries. No reference to these save frames is envisaged, and the save-frame reference code permitted in STAR is not used. This means that unquoted character strings commencing with the $ character may not be interpreted as save-frame codes in CIF. Use of such unquoted character strings is reserved to guard against subsequent relaxation of this constraint.

2.2.7.1.6. White space (24) Any of the white-space characters listed in paragraph (22) (i.e. HT, LF, CR) and the visible space character SP (position number 32 in the ASCII encoding) may be used interchangeably to separate tokens, with the exception that the semicolon characters delimiting multi-line text fields must be preceded by the whitespace character or characters understood as indicating an end of line (see next paragraph).

2.2.7.1.9.5. Exclusion of global blocks (33) In the full STAR specification, blocks of data headed by the special case-insensitive word global_ are permitted before normal data blocks. They contain data names and associated values which are inherited in subsequent data blocks; the scope of a value extends from its point of declaration in a global block to the end of the file. Because rearrangements of the order of data blocks and concatenation of data blocks from different files are commonplace operations in many CIF applications, and because of the difficulty in properly tracking and implementing values implied by global blocks, use of the global_ feature of STAR is expressly forbidden at this revision. To guard against its future introduction, the special case-insensitive word global_ remains reserved in CIF.

2.2.7.1.7. End-of-line conventions (25) The way in which a line is terminated is operating-system dependent. The STAR File specification does not address different operating-system conventions for encoding the end of a line of text in a text file. For a file generated and read in the same machine environment, this is rarely a problem, but increasingly applications on a network host may access files on different hosts through

27

2. CONCEPTS AND SPECIFICATIONS Table 2.2.7.1. A formal grammar for CIF (a) Basic structure of a CIF. Case sensitive?

Syntactic unit

Syntax

? ? { { }* { }? }? { { | } }* { }+ { }+ { }+ | { }+ { }*

yes no yes no yes no yes

Syntactic unit

Syntax

Case sensitive?





{’D’|’d’} {’L’|’l’} {’G’|’g’} {’S’|’s’} {’S’|’s’}





yes

(b) Reserved words.

{’A’|’a’} {’O’|’o’} {’L’|’l’} {’A’|’a’} {’T’|’t’}

{’T’|’t’} {’O’|’o’} {’O’|’o’} {’V’|’v’} {’O’|’o’}

{’A’|’a’} {’P’|’p’} {’B’|’b’} {’E’|’e’} {’P’|’p’}

’_’ ’_’ {’A’|’a’} {’L’|’l’} ’_’ ’_’ ’_’

no no no no no

(c) Tags and values. Syntactic unit

Syntax

Case sensitive?



’_’{}+ {’.’ | ’?’ | | | }

no yes

Syntactic unit

Syntax

Case sensitive?



{ | ’(’ ’)’ } { | } {{ ’+’ | ’-’ }? { | { {’+’|’-’} ? { {} * ’.’ } | { } + ’.’ } } {} ? } } { {’e’ | ’E’ } | {’e’ | ’E’ } { ’+’ | ’-’ } } { }+ { ’0’ | ’1’ | ’2’ | ’3’ | ’4’ | ’5’ | ’6’ | ’7’ | ’8’ | ’9’ }

no no no

Syntactic unit

Syntax

Case sensitive?





| | {}* {|’;’} {}* {}* {}* { } ’;’ { {}* {{ {}*}? }* } ’;’

(d) Numeric values.



no no no no

(e) Character strings and text fields.

28

yes yes yes yes yes yes yes

2.2. SPECIFICATION OF THE CRYSTALLOGRAPHIC INFORMATION FILE (CIF) Table 2.2.7.1. (cont.) ( f ) White space and comments. Syntactic unit

Syntax

Case sensitive?



{|||}+ { ’#’ {}* }+ {|||}+

yes yes yes

Syntactic unit

Syntax

Case sensitive?

{ ’!’|’%’|’&’|’(’|’)’|’*’|’+’|’,’|’-’|’.’|’/’|’0’|’1’|’2’|’3’|’4’|’5’| ’6’|’7’|’8’|’9’|’:’|’’|’?’|’@’|’A’|’B’|’C’|’D’|’E’|’F’|’G’|’H’| ’I’|’J’|’K’|’L’|’M’|’N’|’O’|’P’|’Q’|’R’|’S’|’T’|’U’|’V’|’W’|’X’|’Y’|’Z’| ’\’|’ˆ’|’‘’|’a’|’b’|’c’|’d’|’e’|’f’|’g’|’h’|’i’|’j’|’k’|’l’|’m’|’n’|’o’| ’p’|’q’|’r’|’s’|’t’|’u’|’v’|’w’|’x’|’y’|’z’|’{’|’|’|’}’|’˜’ } ||’#’|’$’||’_’ |’;’|’[’|’]’ ||’#’|’$’||’_’|||’[’|’]’ ||’#’|’$’||’_’|||’;’|’[’|’]’

(g) Character sets.



2.2.7.1.10. Version identification

yes

yes yes yes

by . To allow for the occurrence of a semicolon as the initial character of an unquoted character string, provided it is not the first character in a line of text, the special symbol is used below to indicate any character that is not interpretable as a line terminator. The cases of context sensitivity involving the beginning of text fields and the ends of quoted strings are discussed below, but they are most commonly resolved in a lexical scan. (36) Productions can be used to produce documents, or equivalently to check a document to see if it is valid in this grammar. The angle brackets delimit names for the syntactic units (the ‘nonterminal symbols’) being defined. The curly braces enclose alternatives separated by vertical bars and/or followed by a plus sign for ‘one or more’, an asterisk for ‘zero or more’ or a question mark for ‘zero or one’. (37) In most cases, each production has a single non-terminal symbol in the syntactic unit being defined. However, in some cases, both the syntactic unit and the syntax begin or end with some common symbol. This indicates that a specific context is required in order for the rule to be applied. This is done because the initial semicolon of a semicolon-delimited text field only has meaning at the beginning of a line, and quoted strings may contain their initial quoting character provided the embedded quoting character is not immediately followed by white space. This ‘context-sensitive’ notation is unusual in defining computer languages (although very common in the full specifications of many computer and non-computer languages). This context-sensitive notation greatly simplifies the definitions and is simple to implement. The formal definitions are elaborated below. (38) In the present revision, the production for is a trivial equivalence to . The redundancy is retained to permit possible future extensions to text fields, in particular the possible introduction of a bracket-delimited text value.

(34) As an archival file format, the CIF specification is expected to change infrequently. Revised specifications will be issued to accompany each substantial modification. A CIF may be considered compliant against the most recent version for which in practice it satisfies all syntactic and content rules as detailed in the formal specification document. However, to signal the version against which compliance was claimed at the time of creation, or to signal the file type and version to applications (such as operating-system utilities), it is recommended that a CIF begin with a structured comment that identifies the version of CIF used. For CIFs compliant with the current specification, the first 11 bytes of the file should be the string #\#CIF_1.1

immediately followed by one of the white-space characters permitted in paragraph (22).

2.2.7.2. A formal grammar for CIF 2.2.7.2.1. Summary (35) The rows of Table 2.2.7.1 are called ‘productions’. Productions are rules for constructing sentences in a language. They are written in terms of ‘terminal symbols’ and ‘non-terminal symbols’. ‘Terminal symbols’ are what actually appear in a language. For example, ’poodle’ might be given as a string of terminal symbols in some language discussing dogs. Non-terminal symbols are the higher-level constructs of the language, e.g. sentences, clauses, etc. For example might be given as a non-terminal symbol in some language discussing dogs. Productions may be used to infer rules for parsing the language. For example, ::= { ’poodle’|’terrier’|’bulldog’|’greyhound’ }

2.2.7.2.2. Explanation of the formal syntax

might be given as a rule telling us what names of types of dogs we are allowed to write in this language. In this table, terminal symbols (i.e. terminal character strings) are enclosed in single quotes. To avoid confusion, the terminal symbol consisting of a single quote (i.e. an apostrophe) is indicated by and the terminal symbol consisting of a double quote is indicated by . The printable space character is indicated by , the horizontal tab character by and the end of a line

Comment: Readers not familiar with the conventions used in describing language grammars may wish to consult various lecture notes on the subject available on the web, e.g. Bernstein (2002). (39) In creating a parser for CIF, the normal process is to first perform a ‘lexical scan’ to identify ‘tokens’ in the CIF. A ‘token’ is a grammatical unit, such as a special character, or a tag or a value, or some major grammatical subunit. In the course of a lexical scan, the input stream is reduced to manageable pieces, so that

29

2. CONCEPTS AND SPECIFICATIONS the rest of the parsing may be done more efficiently. The convention followed in this document is to mark the ‘non-terminal’ tokens that are built up out of actual strings of characters or which do not have an immediate representation as printable characters by angle brackets, , and to indicate the tokens that are actual strings of characters as quoted strings of characters. (40) The precise division between a lexical scan and a full parse is a matter of convenience. A suggested division is presented. Before getting to that point, however, there are some highly machine-dependent matters that need to be resolved. There must be a clear understanding of the character set to be used, and of how files and lines begin and end. The character set will be specified in terms of printable characters and a few control characters from the 7-bit ASCII character set. In addition, we will need some means of specifying the end of a line. (41) The character set in CIF is restricted to the ASCII control characters (horizontal tab, position 09 in the ASCII character set), (newline, position 10 in the ASCII character set, also named ) and (carriage return, position 13 in the ASCII character set), and the printable characters in positions 32–126 of the ASCII character set. These are the characters permitted by STAR with the exception of VT (vertical tab, position 11 in the ASCII character set) and FF (form feed, position 12 in the ASCII character set). In general it is poor practice to use characters that are not common to all national variants of the ISO character set. On systems or in programming languages that do not ‘work in ASCII’, the characters themselves may have different numeric values and in some cases there is no access to all the control characters. (42) The token stands for the system-dependent end of line. Implementation note: CIF implementations may follow common HTML and XML practice in handling :

CIF processing such processing does not result in an ever-growing ‘tail’ of empty lines at the end of a CIF document. This discussion is not meant to imply that a parser for a system that uses one of these line-termination conventions must recognize a CIF written using another of these line-termination conventions. This discussion is not meant to imply that parsers on systems that use other line-termination conventions and/or non-ASCII character sets need to handle these ASCII control characters. In processing a valid CIF document, it is always sufficient that a parser be able to recognize the line-termination conventions of text files local to its system environment, and that it be able to recognize the local translations of and the printable characters used to construct a CIF. However, when circumstances permit, if a parser is able to recognize ‘alien’ line terminations, it is permissible for the parser to accept and process the CIF in that form without treating it as an error. In writing CIF documents, the software that emits lines should follow the text-file line-termination conventions of the target system for which it is writing the CIF documents, and not mix conventions from multiple systems. In transmitting a CIF document from system to system, software should be used that causes the document to conform to the line-termination conventions of the target system. In most cases this objective can best be achieved by using ‘text’ or ‘ascii’ transmission modes, rather than ‘binary’ or ‘image’ transmission modes. (43) In order to write the grammar, we need a way to refer to the single-quote characters which we use both to quote within the syntax and to quote within a CIF. To avoid system-dependent confusion, we define the following special tokens:

‘[On many modern systems,] lines are typically separated by some combination of the characters carriage-return (#xD) and line-feed (#xA). To simplify the tasks of applications, the characters passed to an application . . . must be as if the . . . [parser] normalized all line breaks in external parsed entities . . . on input, before parsing, [e.g.] by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.’

Token

Meaning



‘ ’, the printable space character the horizontal-tab character on the system the machine-dependent end of line the complement of the above; any character that does not indicate the machine-dependent end of line the apostrophe, ’ the double-quote character, "



(From the XML specification http://www.w3.org/TR/2000/RECxml-20001006.) Because Unix systems use \n (the ASCII LF control character, or #xA), MS Windows systems use \r\n (the ASCII CR control character, or #xD, followed by the ASCII LF control character, or #xA) and classic MacOS systems use \r, a parser which covers a wide range of systems in a reasonable manner could be constructed using a pseudo-production for such as

(44) There are CIF specifications not definable directly in a context-free Backus–Naur form (BNF). Restrictions in record and data-name lengths, and the parsing of text fields and quoted character strings are best handled in the initial lexical scan. A pure BNF can then be used to parse the tokenized input stream. 2.2.7.3. Lexical tokens (45) We define a ‘comment’ to be initiated with the character #. This can be followed by any sequence of characters (which include or ). The only characters not allowed are those in the production , which terminates a comment. A comment is recognized only at the beginning of a line or after blanks, i.e. only after space, tab or . For this reason we define both comments and ‘tokenized comments’. No portion of the essential machinereadable content within a CIF is conveyed by the comments. Comments are for the convenience of human readers of CIFs and may be freely introduced or removed. Note however the optional structured comment sanctioned in paragraph (34) above, which has the purpose of indicating the file type and revision level to generalpurpose file-handling software.

::= { | | }

provided the supporting infrastructure (such as the lexer) deals with the necessary minor adjustment to ensure that each end of line is recognized and that all end-of-line control characters are filtered out from the portions of the text stream that are to be processed by other productions. One case to handle with care is the end-ofdocument case. It is not uncommon to encounter a last line in a document that is not terminated by any of the above-mentioned control characters. Instead, it may be terminated by the end of the character stream or by a special end-of-text-document control character [e.g. #x4 (control-D) or #x1A (control-Z)]. A CIF parser should normalize such unterminated terminal lines to appear to an application as if they had been properly terminated. On the other hand, care should also be taken so that in multiple generations of

::= { ’#’ {}* }+ ::= { || }+

30

2.2. SPECIFICATION OF THE CRYSTALLOGRAPHIC INFORMATION FILE (CIF) (46) We accept as white space all appropriate combinations of spaces, tabs, end of lines and comments, as well as the beginning of the file. White space are the characters able to delimit the lexical tokens.

Formally we express this with context-sensitive productions. In practice, it requires a one-character look-ahead to decide to continue the scan if the opening quote is encountered, but the following character is not space, tab or end of line. When processing a semicolon-delimited text field, the column position has to be remembered to decide whether a semicolon should be recognized. For a semicolon-delimited text string, failure to provide trailing white space is an error. The on the left-hand side must evaluate to the same string instance on the right-hand side and the parse must terminate on the first valid match reading left to right.

::= {|||}+

(47) Non-blank characters are composed of all the characters in our set, excluding and and characters. ::= ||’#’| ’$’||’_’|’;’|’[’|’]’

(48) AnyPrintChar characters are composed of all the characters in our set excluding characters.

::= {}*

::= {}*

::= {} ::= ’;’ { {}* {{ {}*}? }* } ’;’ ::= ’[’ { |}* ’]’

::= ||’#’| ’$’||’_’|||’;’|’[’|’]’

(49) We define a ‘line of text’ to be a line contained within a semicolon-bounded text field. Hence the first character cannot be a semicolon; it may be followed by any number of characters from the set and terminated with a line-termination character. We define the characters in as those in except for the semicolon. ::= ||’#’| ’$’||’_’|||’[’|’]’

(57) Tags and values are appropriate lexical tokens. The special values of ‘.’ and ‘?’ represent data that are inapplicable or unknown, respectively. (i) No string that matches the production for is accepted as a non-quoted string. (ii) No string that matches the production for is accepted as a non-quoted string. (iii) No string in which the initial five characters match the production for is accepted as a non-quoted string. (iv) No string in which the initial five characters match the production for is accepted as a non-quoted string. (v) No string that matches the production for is accepted as a non-quoted string. Unquoted strings are described by a pair of productions to permit the initial letter of an unquoted string to be a semicolon so long as that does not occur at the beginning of a line. The parser is required to evaluate to the same string instance on both sides of the production.

(50) Ordinary characters are all those printable characters that can initiate a non-quoted character string. These exclude the special characters ", #, $, ’, [, ] and _, and in some cases ;. ::= ’!’|’%’|’&’|’(’|’)’|’*’|’+’|’,’|’-’|’.’|’/’|’0’ |’1’|’2’|’3’|’4’|’5’|’6’|’7’|’8’|’9’|’:’|’’|’?’|’@’|’A’|’B’|’C’|’D’|’E’|’F’|’G’|’H’|’I’ |’J’|’K’|’L’|’M’|’N’|’O’|’P’|’Q’|’R’|’S’|’T’|’U’ |’V’|’W’|’X’|’Y’|’Z’|’\’|’ˆ’|’‘’|’a’|’b’|’c’|’d’ |’e’|’f’|’g’|’h’|’i’|’j’|’k’|’l’|’m’|’n’|’o’|’p’ |’q’|’r’|’s’|’t’|’u’|’v’|’w’|’x’|’y’|’z’|’{’|’|’ |’}’|’˜’

(51) The reserved word data_ (in a case-insensitive form). ::= {’d’|’D’} {’a’|’A’} {’t’|’T’} {’a’|’A’} ’_’

(52) The reserved word loop_ (in a case-insensitive form). ::= {’l’|’L’} {’o’|’O’} {’o’|’O’} {’p’|’P’} ’_’



::= ’_’{ }+ ::= { ’.’|’?’|| | } ::= { | ’(’ ’)’ }

::= { | } ::= { ’+’|’-’ }? ::= { {’e’|’E’ }|{’e’|’E’ } { ’+’|’- ’ } }

::= { }+

::= {’0’|’1’|’2’|’3’|’4’|’5’|’6’|’7’|’8’|’9’}

::= { | { {’+’|’-’} ? { {}* ’.’ } | {}+ ’.’ } } {} ? } } ::= ||

::= {}* ::= {|’;’} {}*

(53) The reserved word save_ (in a case-insensitive form). ::= {’s’|’S’} {’a’|’A’} {’v’|’V’} {’e’|’E’} ’_’

(54) The reserved word stop_ (in a case-insensitive form). ::= {’s’|’S’} {’t’|’T’} {’o’|’O’} {’p’|’P’} ’_’

(55) The reserved word global_ (in a case-insensitive form). This is actually a reserved word of STAR, but we define it here so that it may be explicitly excluded as an unquoted string. We do this so that any possible future adoption of STAR features will not invalidate existing CIFs. ::= {’g’|’G’} {’l’|’L’} {’o’|’O’} {’b’|’B’} {’a’|’A’} {’l’|’L’} ’_’

(56) Quoted strings need to be recognized in the lexical scan, because their definition is context-sensitive. A string quoted by single quotes may contain a single quote as long as it is not followed by white space. A string quoted by double quotes may contain a double quote as long as it is not followed by white space.

31

2. CONCEPTS AND SPECIFICATIONS 2.2.7.3.1. CIF grammar (58) A CIF may be an empty file, or it may contain only comments or white space, or it may contain one or more data blocks. Comments before the first block are acceptable, and there must be white space between blocks.

(c) a taxonomy, or classification scheme relating the specified objects; (d) descriptions of the attributes and relationships of individual and related objects. In the CIF framework, the objects of discourse are described in so-called data dictionary files that provide the vocabulary and taxonomic elements. The dictionaries also contain information about the relationships and attributes of data items, and thus encapsulate most of the semantic content that is accessible to software. In practice, different dictionaries exist to service different domains of crystallography and a CIF that conforms to a specific dictionary must be interpreted in terms of the semantic information conveyed in that dictionary. However, some common semantic features apply across all CIF applications, and the current document outlines the foundations upon which other dictionaries may build more elaborate taxonomies or informational models.

::= ? ? { { }* { }? }?

(59) For a data block, there must be a data heading and zero or more data items or save frames. ::= { { | } }*

(60) A data-block heading consists of the five characters data_ (case-insensitive) immediately followed by at least one non-blank character selected from the set of ordinary characters or the nonquote-mark, non-blank printable characters.

2.2.7.4.2. Definition of terms (2) The definitions of Section 2.2.7.1.2 also hold for this part of the specification.

::= { }+

2.2.7.4.3. Semantics of data items

2.2.7.4. Common semantic features

(3) While the STAR File syntax allows the identification and extraction of tags and associated values, the interpretation of the data thus extracted is application-dependent. In CIF applications, formal catalogues of standard data names and their associated attributes are maintained as external reference files called data dictionaries. These dictionary files share the same structure and syntax rules as data CIFs. (4) At the current revision, two conventions (known as dictionary definition languages or DDLs) are supported for detailing the meaning and associated attributes of data names. These are known as DDL1 (Hall & Cook, 1995) and DDL2 (Westbrook & Hall, 1995), and they differ in the amount of detail they carry about data types, the relationships between specific data items and the large-scale classification of data items. (5) While it may be formally possible to define the semantics of the data items in a given data file in both DDL1 and DDL2 data dictionaries, in practice different dictionaries are constructed to define the data names appropriate for particular crystallographic applications, and each such dictionary is written in DDL1 or DDL2 formalism according to which appears better able to describe the data model employed. There is thus in practice a bifurcation of CIF into two dialects according to the DDL used in composing the relevant dictionary file. However, the use of aliases may permit applications tuned to one dialect to import data constructed according to the other.

2.2.7.4.1. Introduction (1) The Crystallographic Information File (CIF) standard is an extensible mechanism for the archival and interchange of information in crystallography and related structural sciences. Ultimately CIF seeks to establish an ontology for machine-readable crystallographic information – that is, a collection of statements providing the relations between concepts and the logical rules for reasoning about them. Essential components in the development of such an ontology are: (a) the basic rules of grammar and syntax, described in Sections 2.2.7.1 to 2.2.7.3; (b) a vocabulary of the tags or data names specifying particular objects;

2.2.7.4.4. Data-name semantics (6) Strictly, data names should be considered as void of semantic content – they are tags for locating associated values, and all information concerning the meaning of that value should be sought in an associated dictionary. (7) However, it is customary to construct data names as a sequence of components elaborating the classification of the item within the logical structure of its associated dictionary. Hence a data name such as _atom_site_fract_x displays a hierarchical arrangement of components corresponding to membership of nested groupings of data elements. The choice of components readily indicates to a human reader that this data item refers to the fractional x coordinate of an atomic site within a crystal unit

(61) For a save frame, there must be a save-frame heading, some data items and then the reserved word save_. ::= { }+

(62) A save-frame heading consists of the five characters save_ (case-insensitive) immediately followed by at least one non-blank character selected from the set of ordinary characters or the nonquote-mark, non-blank printable characters. ::= { }+

(63) Data come in two forms: (i) A data-name tag separated from its associated value by a . (ii) Looped data. The number of values in the body must be a multiple of the number of tags in the header.

::= | ::= { }+

::= { }*

32

2.2. SPECIFICATION OF THE CRYSTALLOGRAPHIC INFORMATION FILE (CIF) cell, but it should be emphasized from a computer-programming viewpoint that this is coincidental; the attributes that constrain the value of this data item (and its relationship to others such as _atom_site_fract_y and _atom_site_fract_z) must be obtained from the dictionary and not otherwise inferred. (8) Comment: In practice data names described in a DDL2 dictionary are constructed with a period character separating their specific function from the name of the category to which they have been assigned. In the absence of a dictionary file, this convention permits the inference that the data item with name _atom_site.fract_x will appear in the same looped list as other items with names beginning _atom_site., and that all such items belong to the same category.

default units, except for the a˚ ngstr¨om, conform to the SI Standard adopted by the IUCr.

This approach is deprecated and has not been supported by any official CIF dictionary published subsequent to version 1.0 of the core. All data values must be expressed in the single unit assigned in the associated dictionary. A small number of archived CIFs exist with variant data names as permitted by the above clause. If it is necessary to validate them against versions of the core dictionary subsequent to version 1.0, the formal compatibility dictionary cif compat.dic (ftp://ftp.iucr.org/cifdics/cif compat.dic) may be used for the purpose. No other use should be made of this dictionary. 2.2.7.4.7. Data-value semantics (14) The STAR syntax permits retrieval of data by simply requesting a specific data name within a specific data block. Prior knowledge about data type (e.g. text or numbers), whether the item is looped or whether the item exists in the file at all is unnecessary. However, applications in general need to know data type, valid ranges of values and relationships between data items, and a program designer needs to know the purpose of the data item (i.e. what physical quantity or internal book-keeping function it represents). While such semantic information may be defined informally for local data items (ones not intended for exchange between different users or software applications), formal descriptions of the semantics associated with data values are catalogued in data dictionary files. Currently two formalisms (dictionary definition languages) for describing data-value attributes are supported; full specifications of these formalisms (known as DDL1 and DDL2) are provided in Chapters 2.5 and 2.6.

2.2.7.4.5. Name space (9) The intention of the maintainers of public CIF dictionaries is to formulate a single authoritative set of data names for each CIF dialect (i.e. DDL1 and DDL2), thus facilitating the reliable archive and interchange of crystallographic data. However, it is also permissible for users to introduce local data names into a CIF. Two mechanisms exist to reduce the danger of collision of data names that are not incorporated into public dictionaries. (10) The character string [local] (including the literal bracket characters) is reserved for local use. That is, no public dictionary will define a data name that includes this string. This allows experimentation with data items in a strictly local context, i.e. in cases where the CIF is not intended for interchange with any other user. (11) Where CIFs including local data items are expected to enjoy a public circulation, authors may register a reserved prefix for their sole use. The registry is available on the web at http://www.iucr.org/iucr-top/cif/spec/reserved.html. A reserved prefix, e.g. foo, must be used in the following ways: (i) If the data file contains items defined in a DDL1 dictionary, the local data names assigned under the reserved prefix must contain it as their first component, e.g. _foo_atom_site_my_item. (ii) If the data file contains items defined in a DDL2 dictionary, then the reserved prefix must be: (a) the first component of data names in a category defined for local use, e.g. _foo_my_category.my_item. (b) the first component following the period character in a data name describing a new item in a category already defined in a public dictionary, e.g. _atom_site.foo_ my_item. (12) There is no syntactic property identifying such a reserved prefix, so that software validating or otherwise handling such local data names must scan the entire registry and match registered prefixes against the indicated components of data names. Note that reserved prefixes may not themselves contain underscore characters.

2.2.7.4.7.1. Data typing (15) Four base data types are supported in CIF. These are: (i) numb: a value interpretable as a decimal base number and supplied as an integer, a floating-point number or in scientific notation; (ii) char: a value to be interpreted as character or text data (where the value contains white-space characters, it must be quoted); (iii) uchar: a value to be interpreted as character or text data but in a case-insensitive manner (i.e. the values FOO and foo are to be taken as identical); (iv) null: a special data type associated with items for which no definite value may be stored in computer memory. It is the type associated with the special character literal values ? (query mark) and . (full point), which may appear as values for any data item within a data file (see Section 2.2.7.4.8 below). It is also the type assigned to items defined in dictionary files that may not occur in data files. (16) Comment: Many applications distinguish between multiline text fields and character-string values that fit within a single line of text. While this is a convenient practical distinction for coding purposes, formally both manifestations should be regarded as having the same base type, which might be ‘char’ or ‘uchar’. Applications are at liberty to choose whether to define specific multi-line text subtypes, and whether to permit casting between subtypes of a base type. The examples of character-string delimiters in Section 2.2.7.1.4(20) are predicated on an approach that handles all subtypes of character or text data equivalently. (17) Where the attributes of a data value are not available in a dictionary listing, it may be assumed that a character string inter-

2.2.7.4.6. Note on handling of units (13) The published specification for CIF version 1.0 permitted data values expressed in different units to be tagged by variant data names (Hall et al., 1991, p. 657): . . . Many numeric fields contain data for which the units must be known. Each CIF data item has a default units code which is stated in the CIF Dictionary. If a data item is not stored in the default units, the units code is appended to the data name. For example, the default units for a crystal cell dimension are a˚ ngstr¨oms. If it is necessary to include this data item in a CIF with the units of picometres, the data name of _cell_length_a is replaced by _cell_length_a_pm. Only those units defined in the CIF Dictionary are acceptable. The

33

2. CONCEPTS AND SPECIFICATIONS pretable as a number should be taken to represent an item of type ‘numb’. However, an explicit dictionary declaration of type will override such an assumption.

these markup conventions, but the published examples suggest that they may be used in any character field in a CIF data file except as prohibited by a dictionary directive. It is intended that the next CIF version specification shall formally declare where such markup may be used.

2.2.7.4.7.2. Subtyping (18) The base data types detailed in the previous section are very general and need to be refined for practical application. Refinement of types is to some extent application-dependent, and different subtypes are supported for data items defined by DDL1 and DDL2 dictionary files. The following notes indicate some considerations, but the relevant dictionary files and documentation should be consulted in each case. (19) DDL1 dictionaries. Values of type ‘numb’ may include a standard uncertainty in the final digit(s) of the number where the associated item definition includes the attribute _type_conditions

2.2.7.4.11. Handling of long lines (26) The restriction in line length within CIF requires techniques to handle without semantic loss the content of lines of text exceeding the limit (2048 characters in this revision, 80 characters in the initial CIF specification). The line-folding protocol defined here provides a general mechanism for wrapping lines of text within CIFs to any extent within the overall line-length limit. A specific application where this would be useful is the conversion of lines longer than 80 characters to the CIF version 1.0 limit. This 80-character limit is used in the examples below for illustrative purposes. These techniques are applied only to the contents of text fields and to comments. In order to permit such folding, a special semantics is defined for use of the backslash. It is important to understand that this does not change the syntax of CIF version 1.0. All existing CIFs conforming to the CIF version 1.0 specification can be viewed as having exactly the same semantics as they now have. Use of these transformational semantics is optional, but recommended. In order to avoid confusion between CIFs that have undergone these transformations and those that have not, the special comment beginning with a hash mark immediately followed by a backslash (#\) as the last non-blank characters on a line is reserved to mark the beginning of comments created by folding long-line comments, and the special text field beginning with the sequence line termination, semicolon, backslash (;\) as the only non-blank characters on a line is reserved to mark the beginning of text fields created by folding long-line text fields. The backslash character is used to fold long lines in character strings and comments. Consider a comment which extends beyond column 80. In order to provide a comment with the same meaning which can be fitted into 80-character lines, prefix the comment with the special comment consisting of a hash mark followed by a backslash (#\) and the line terminator. Then on new lines take appropriate fragments of the original comment, beginning each fragment with a hash mark and ending all but the last fragment with a backslash. In doing this conversion, check for an original line that ends with a backslash followed only by blanks or tabs. To preserve that backslash in the conversion, add another backslash after it. If the next lexical token (not counting blanks or tabs) is another comment, to avoid fusing this comment with the next comment, be sure to insert a line with just a hash mark. Similarly, for a character string that extends beyond column 80, (i) first convert it to be a text field delimited by line termination– semicolon (;) sequences, (ii) then change the initial line termination–semicolon (;) sequence to line termination–semicolon–backslash–line termination (;\), (iii) and break all subsequent lines that do not fit within 80 columns with a trailing backslash. In the course of doing the translation, (a) check for any original text lines that end with a backslash followed only by blanks or tabs; (b) to preserve that backslash in the conversion, add another backslash after it, and then an empty line. (More formally, the line folding should be done separately and directly on single-line non-semicolon-delimited character strings

esd

(or _type_conditions su, a synonym introduced to DDL1 in 2005). For example, a value of 34.5(12) means 34.5 with a standard uncertainty of 1.2; it may also be expressed in scientific notation as 3.45E1(12). (20) DDL2 dictionaries. DDL2 provides a number of tags that may be used in a dictionary file to specify subtypes for data items defined by that dictionary alone. Examples of the subtypes specified for the macromolecular CIF dictionary are: code ucode uchar1 uchar3 line uline text int float yyyy-mm-dd symop any

identifying code strings or single words identifying code strings or single words (case-insensitive) single-character codes (case-insensitive) three-character codes (case-insensitive) character strings forming a single line of text character strings forming a single line of text (case-insensitive) multi-line text integers floating-point real numbers dates symmetry operations any type permitted

2.2.7.4.8. Special generic values (21) The unquoted character literals ? (query mark) and . (full point) are special and are valid expressions for any data type. (22) The value ? means that the actual value of a requested data item is unknown. (23) The value . means that the actual value of a requested data item is inapplicable. This is most commonly used in a looped list where a data value is required for syntactic integrity. 2.2.7.4.9. Embedded data semantics (24) The attributes of data items defined in CIF dictionaries serve to direct crystallographic applications in the retrieval, storage and validation of relevant data. In principle, a CIF might include as data items suitably encoded fields representing data suitable for manipulation by text processing, image, spreadsheet, database or other applications. It would be useful to have a formal mechanism allowing a CIF to invoke appropriate content handlers for such data fields; this is under investigation for the next CIF version specification. 2.2.7.4.10. CIF conventions for special characters in text (25) The one existing example of embedded semantics is the text character markup introduced in the CIF version 1.0 specification and summarized in paragraphs (30)–(37) below. The specification is silent on which fields should be interpreted according to

34

2.2. SPECIFICATION OF THE CRYSTALLOGRAPHIC INFORMATION FILE (CIF) to allow for recognition of the fact that no terminal line termination is intended – see below.) In order to understand this scheme, suppose the CIF fragment (1) below were considered to have long lines. They could be transformed into (2) as follows:

Since there are very few, if any, CIFs that contain text fields and comments beginning this way, in most cases it is reasonable to adopt the policy of doing this processing unless it is disabled. Here is another example of folding. The following three text fields would be equivalent:

(1) Initial CIF

;C:\foldername\filename ;

####################################################### ### CIF submission form for Rietveld refinements ### ### Version 14 December 1998 ### ####################################################### data_znvodata _chemical_name_systematic ; zinc dihydroxide divanadate dihydrate ; _chemical_formula_moiety _chemical_formula_sum _chemical_formula_weight

;\ C:\foldername\filename ;

and ;\ C:\foldername\file\ name ;

’H2 O9 V2 Zn3, 2(H2 O)’ ’H6 O11 V2 Zn3’ 480.05

but the following example would be a two-line value where the first line had the value C:\foldername\file\ and the second had the value name:

(2) Transformed CIF #\ ###########################\ ########################### ### CIF submission form for Rietveld refinements ### ### Version 14 December 1998 ### ####################################################### data_znvodata _chemical_name_systematic ;\ zinc dihydroxide divan\ adate dihydrate ; _chemical_formula_moiety ;\ H2 O9 V2 Zn3, 2(H2 O)\ ; _chemical_formula_sum _chemical_formula_weight

; C:\foldername\file\ name ;

Note that backslashes should not be used to fold lines outside of comments and text fields. That would introduce extraneous characters into the CIF and violate the basic syntax rules. In any case, such action is not necessary. 2.2.7.4.12. Dictionary compliance (27) Dictionary files containing the definitions and attribute sets for the data items contained in a CIF should be identified within the CIF by some or all of the data items

’H6 O11 V2 Zn3’ 480.05

_audit_conform_dict_name _audit_conform_dict_version _audit_conform_dict_location

In making the transformation from the backslash-folded form to long lines, it is very important to strip trailing blanks before attempting to recognize a backslash as the last character. When reassembling text-field lines, no reassembly should be done except in text fields that begin with the special sequence described above, line termination–semicolon–backslash–line termination, (;\), so that text fields that happen to contain backslashes but which were not created by folding long lines are not changed. It is also important to remove the trailing backslashes when reassembling long lines. The final line termination– semicolon sequence of a text field takes priority over the reassembly process and ends it, but a trailing backslash on the last line of a text field very nicely conveys the information that no trailing line termination is intended to be included within the character string. Similarly, when reassembling long-line comments, the reassembly begins with a comment of the form hash–backslash– line termination. The initial hash mark is retained and then a forward scan is made through line terminations and blanks for the next comment, from which the initial hash mark is stripped and then the contents of the comment are appended. If that comment ends with a backslash, the trailing backslash is stripped and the process repeats. Note that the process will be ended by intervening tags, values, data blocks or other non-white-space information, and that the process will not start at all without the special hash– backslash–line termination comment.

corresponding to DDL1 dictionaries or _audit_conform.dict_name _audit_conform.dict_version _audit_conform.dict_location

for DDL2 dictionaries. Where no such information is provided, it may be assumed that the file should conform against the core CIF dictionary. (28) The _audit_conform data items may be looped in cases where more than one dictionary is used to define the items in a CIF and they may include dictionaries of local data items provided such dictionary files have been prepared in accordance with the rules of the appropriate DDL. (29) A detailed protocol exists for locating, merging and overlaying multiple dictionary files (McMahon et al., 2000) (see Section 3.1.9). 2.2.7.4.13. CIF markup conventions (30) If permitted by the relevant dictionary and if no other indication is present, the contents of a text or character field are assumed to be interpretable as text in English or some other human language. Certain special codes are used to indicate special characters or accented letters not available in the ASCII character set, as listed below.

35

2. CONCEPTS AND SPECIFICATIONS 2.2.7.4.14. Greek letters (31) In general, the corresponding letter of the Latin alphabet, prefixed by a backslash character. The complete set is:

α β χ δ ε ϕ γ η ι κ λ µ

A B X ∆ E Φ Γ H I K Λ M

\a \b \c \d \e \f \g \h \i \k \l \m

\A \B \C \D \E \F \G \H \I \K \L \M

alpha beta chi delta epsilon phi gamma eta iota kappa lambda mu

ν o π θ ρ σ τ υ ω ξ ψ ζ

N O Π Θ R Σ T U Ω Ξ Ψ Z

\n \o \p \q \r \s \t \u \w \x \y \z

\N \O \P \Q \R \S \T \U \W \X \Y \Z

\% degree (◦ ) -dash --single bond \\db double bond \\tb triple bond \\ddb delocalized double bond \\sim ∼ (Note: ˜ is the code for subscript) \\simeq ≃ \\infty ∞

nu omicron pi theta rho sigma tau upsilon omega xi psi zeta

\" \‘ \. \; \, \(

= > < → ←

2.2.7.4.17. Typographic style codes (36) The codes indicated above are designed to refer to special characters not expressible within the CIF character set, and the initial specification did not permit markup for typographic style such as italic or bold-face type. However, in some cases the ability to indicate type style is useful, and in addition to the codes above HTML-like conventions are allowed of surrounding text by to indicate the beginning and end of italic, and by to indicate the beginning and end of bold-face type. (37) If it is necessary to convey more complex typographic information than is permitted by these special character codes and conventions, the entire text field should be of a richer content type allowing detailed typographic markup.

(32) Accents should be indicated by using the following codes before the letter to be modified (i.e. use \’e for an acute e):

acute (´e) overbar or macron (¯a) tilde (˜n) circumflex (ˆa) hacek or caron (ˇo) Hungarian umlaut or double accented (˝o)

× ± ∓

Note that \\db, \\tb and \\ddb should always be followed by a space, e.g. C=C is denoted by C\\db C.

2.2.7.4.15. Accented letters

\’ \= \˜ \ˆ \< \>

\\times +-+ \\square \\neq \\rangle \\langle \\rightarrow \\leftarrow

umlaut (¨u) grave (`a) overdot (˙o) ogonek (u) ˛ cedilla (c¸) breve (˘o)

References Bernstein, H. J. (2002). Some comments on parsing for computer programming languages. http://www.bernstein-plus-sons.com/ TMM/Parsing. Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P. & Berners-Lee, T. (1999). Hypertext Transfer Protocol – HTTP/1.1. RFC 2616. Network Working Group. http://www.w3.org/Protocols/ rfc2616/rfc2616.html. Freed, N. & Borenstein, N. (1996). Multipurpose Internet Mail Extensions (MIME) Part two: media types. RFC 2046. Network Working Group. http://www.ietf.org/rfc/rfc2046.txt. Hall, S. R. (1991). The STAR File: a new format for electronic data transfer and archiving. J. Chem. Inf. Comput. Sci. 31, 326–333. Hall, S. R., Allen, F. H. & Brown, I. D. (1991). The Crystallographic Information File (CIF): a new standard archive file for crystallography. Acta Cryst. A47, 655–685. Hall, S. R. & Cook, A. P. F. (1995). STAR dictionary definition language: initial specification. J. Chem. Inf. Comput. Sci. 35, 819–825. Hall, S. R. & Spadaccini, N. (1994). The STAR File: detailed specifications. J. Chem. Inf. Comput. Sci. 34, 505–508. McMahon, B., Westbrook, J. D. & Bernstein, H. J. (2000). Report of the COMCIFS Working Group on Dictionary Maintenance. http://www.iucr.org/iucr-top/cif/spec/dictionaries/maintenance.html. Ulrich, E. L. et al. (1998). XVIIth Intl Conf. Magn. Res. Biol. Systems. Tokyo, Japan. Westbrook, J. D. & Hall, S. R. (1995). A dictionary description language for macromolecular structure. http://ndbserver.rutgers.edu/mmcif/ddl/.

These codes will always be followed by an alphabetic character. 2.2.7.4.16. Other characters (33) Other special alphabetic characters should be indicated as follows:

\%a \/o

a-ring (˚a) o-slash (ø)

\?i \/l

dotless i (ı) Polish l (ł)

\&s \/d

German ‘ss’ (ß) barred d ( d)

Capital letters may also be used in these codes, so an a˚ ngstr¨om ˚ may be given as \%A. symbol (A) (34) Superscripts and subscripts should be indicated by bracketing relevant characters with circumflex or tilde characters, thus:

superscripts subscripts

Cspˆ3ˆ U˜eq˜

for for

Csp3 Ueq

The closing symbol is essential to return to normal text. (35) Some other codes are accepted by convention. These are:

36

references

International Tables for Crystallography (2006). Vol. G, Chapter 2.3, pp. 37–43.

2.3. Specification of the Crystallographic Binary File (CBF/imgCIF) B Y H. J. B ERNSTEIN

AND

A. P. H AMMERSLEY

2.3.1. Introduction

2.3.2. CBF and imgCIF

The Crystallographic Binary File (CBF) format is a complementary format to the Crystallographic Information File (CIF) (Hall et al., 1991) supporting efficient storage of large quantities of experimental data in a self-describing binary format. The image-supporting Crystallographic Information File (imgCIF) is an extension to CIF to assist in ASCII debugging and archiving of CBF files and to allow for convenient and standardized inclusion of images, such as maps, diagrams and molecular drawings, into CIFs for publication. The binary CBF format is useful for handling large images within laboratories and for interchange among collaborating groups. For smaller blocks of binary data, either format should be suitable. The ASCII imgCIF format is appropriate for interchange of smaller images and for long-term archiving. CBF is designed to support efficient storage of raw experimental data (images) from area detectors with no loss of information, unlike some existing formats intended for this purpose. The format enables very efficient reading and writing of raw data, and encourages economical use of disk space. It may be coded easily and is portable across platforms. It is also flexible and extensible so that new data structures can be added without affecting the present definitions. These goals are achieved by a simple file format, combining a CIF-like file header with compressed binary information. The file header consists of ASCII text giving information about the binary data as CIF tag–value pairs and tables. Each binary image is presented as a text-field value, either as raw octets of binary data in a CBF data set, or as an ASCII-based encoding of the same binary information in a true ASCII imgCIF data set. The ASCIIbased encoded format uses e-mail MIME (Multipurpose Internet Mail Extensions) conventions to encode the binary data (Freed & Borenstein, 1996a,b,c; Freed et al., 1996; Moore, 1996). The present version of the format tries to deal only with simple Cartesian data. These are essentially the ‘raw’ diffraction data that typically are stored in commercial formats or individual formats internal to particular institutes. Other forms of binary image data could be accommodated. It is hoped that CBF will replace individual laboratory or institute formats for ‘home-built’ detector systems, will be used as an inter-program data-exchange format, and will be offered as an output choice by commercial detector manufacturers specializing in X-ray and other detector systems. In this chapter we discuss the basic framework within which binary data and images are stored. The categories and data items that are used to describe beam and equipment axes, rastering methodologies, and image compression techniques are described in Chapter 3.7. The CBF/imgCIF dictionary is given in Chapter 4.6. An application programming interface (API) for the manipulation of image data is described in Chapter 5.6.

CBF and imgCIF are two aspects of the same format. Since CIFs are pure ASCII text files, it was necessary to define a separate binary format to allow the combination of pseudo-ASCII sections and binary data sections. In the binary-file CBF format, the ASCII sections conform closely to the CIF standard but must use operating-system-independent ‘line separators’. In order to facilitate interchange of files, an API that writes CBF files should use \r\n (carriage return, line feed) for the line separator. Use of this line separator allows the ASCII sections to be viewed with standard system utilities (e.g. ‘more’, ‘pg’) on a very wide range of operating systems (e.g. Unix, MacOS and Windows). However, an API that reads CBF format must accept any of the following three alternative line terminators as the end of an ASCII line: \r, \n or \r\n. As for all CIF data sets, an imgCIF file conforms to the normal text file-writing conventions of the system on which it is written. imgCIF is also one of the two names of the CIF dictionary (see Chapter 4.6) that contains the terms specific to describing image data in both CBF and imgCIF data sets. Thus a CBF or imgCIF data set uses data names from the CBF/imgCIF dictionary and other CIF dictionaries. The general structure of a CBF or imgCIF data set is shown in Example 2.3.2.1. After a special comment to identify the file type (a so-called ‘magic number’) and any other initial comments, the data set begins with a ‘data_blockname’ which gives the name of the data block. Tags and values that describe the image data and how they were collected come next. For efficiency in processing, it is recommended that all the descriptive tags come before the actual image data. This recommendation is a requirement for the binary CBF format. It is optional for the ASCII imgCIF format. The image data are given as the value of the tag _array_data.data. The image data are given in a text field, using MIME conventions to describe the encoding. 2.3.2.1. A simple example Before describing the format in full, we start by showing a simple but important and complete use of the format: that of storing a single detector image in a file together with a small amount of auxiliary information. This is intended to be a useful example that can be understood without reference to the full definitions. It also serves as an introduction or overview of the format definition. This example uses CIF DDL2-based dictionary items (see Chapter 2.6). Example 2.3.2.2 relates to an image of 768 × 512 pixels stored as 16-bit unsigned integers, in little-endian byte order (this is the native byte ordering on a PC). The pixel sizes are 100.5 × 99.5 µm. The example will be presented and discussed in three sections. 1 are included to allow us to comment The circled numerals (e.g. ) on portions of the example. They are not part of the CBF/imgCIF format. 1 starting with a hash character (#), is a The line marked by , CIF and CBF comment line. As a first line, the pattern of three hashes followed by ‘CBF’ helps to identify the data set as a CBF. It is a so-called ‘magic number’. The text ###CBF: VERSION must be present as the very first line of every CBF file. Following

Affiliations: H ERBERT J. B ERNSTEIN, Department of Mathematics and Computer Science, Kramer Science Center, Dowling College, Idle Hour Blvd, Oakdale, NY 11769, USA; A NDREW P. H AMMERSLEY, ESRF/EMBL Grenoble, 6 rue Jules Horowitz, France.

Copyright  2006 International Union of Crystallography

37

2. CONCEPTS AND SPECIFICATIONS Example 2.3.2.1. General structure of a CBF or imgCIF data set. The critical values that define an image are marked with .

Example 2.3.2.2. A single image. ###CBF: VERSION 1.0 data_image_1

###CBF: VERSION 1.0 data_blockname # cif tags and values describing the image, e.g. loop_ _array_intensities.array_id _array_intensities.binary_id _array_intensities.linearity _array_intensities.undefined_value _array_intensities.overload image_1 1 linear 0 65535 # # # # # #

the image data are given as the value of the tag _array_data.data, usually in a loop_ at the end of the data block. The first two values identify the structure of the image and assure a unique identifier if there are multiple images of the same structure.

1  2 

_entry.id _chemical.entry_id _chemical.name_common

’image_1’ ’image_1’ ’Protein X’

# Experimental details _exptl_crystal.id _exptl_crystal.colour

4  ’CX-1A’ ’pale yellow’

_diffrn.id _diffrn.crystal_id

DS1 ’CX-1A’

3 

_diffrn_measurement.diffrn_id DS1 _diffrn_measurement.method Oscillation _diffrn_radiation_wavelength.id L1 _diffrn_radiation_wavelength.wavelength 0.7653 _diffrn_radiation_wavelength.wt 1.0

loop_ _array_data.array_id _array_data.binary_id _array_data.data

_diffrn_radiation.diffrn_id _diffrn_radiation.wavelength_id

DS1 L1

_diffrn_source.diffrn_id _diffrn_source.source _diffrn_source.type

image_1 1

_diffrn_detector.diffrn_id _diffrn_detector.id _diffrn_detector.detector _diffrn_detector.type

# The image itself begins and ends as a CIF # text field, within which MIME conventions # are used to describe the encoding of the image ; --CIF-BINARY-FORMAT-SECTION-Content-Type: application/octet-stream; conversions="x-CBF_PACKED" Content-Transfer-Encoding: BINARY X-Binary-Size: 374578 X-Binary-ID: 1 X-Binary-Element-Type: "unsigned 16-bit integer" Content-MD5: jGmkxkrpnizOetd9T/Np4NufAmA==

DS1 synchrotron ’ESRF BM-14’ DS1 ESRFCCD1 CCD ’ESRF Be XRII/CCD’

_diffrn_detector_element.id _diffrn_detector_element.detector_id

1 ESRFCCD1

_diffrn_frame_data.id _diffrn_frame_data.detector_element_id _diffrn_frame_data.array_id _diffrn_frame_data.binary_id

F1 1 ’image_1’ 1

3 Information about the image begins at the line marked with . In the following lines, the apostrophes enclose strings that contain a space. Values that contain white space, or that could be confused with a CIF token, must always be quoted. A double quote (") could have been used. There is a third way to quote a string, with the string \r\n;, i.e. with a semicolon at the beginning of a line, which allows multi-line strings to be presented. We shall use this form of text quotation for the binary data. 4 Many The experimental details begin at the line marked with . more data items could be defined, but here we are giving an example of one useful minimal (but not mandatory) set. See the imgCIF dictionary in Chapter 4.6 and the classification of image data in Chapter 3.7 for a discussion of which items are mandatory. After describing the parameters of the experiment, we describe the organization of the image data (Example 2.3.2.3). Note that we have changed from listing a value directly with each tag to a tabular format, using the CIF loop_ token. The *.array_id tags identify data items belonging to the same array. Here we have chosen the name image_1, but another name could have been used, as long as it is used consistently. The *.index tags refer to the dimension being defined, and the *.dimension column defines the number of elements in that dimension. The *.precedence tag defines the precedence of rastering of the data. In this case, the first dimension is the faster changing dimension. The *.direction column tells us the direction in which the data raster runs within a dimension. Here the data raster runs from the minimum element towards the maximum element (‘increasing’) in the first dimension, and from the maximum element towards the minimum element in the second dimension. This is the default rastering order.

START_OF_BIN *************9************** ... [This is where the raw binary data would be – we can’t print them here] --CIF-BINARY-FORMAT-SECTION---;

‘VERSION’ is the number of the corresponding version of the CBF/imgCIF extension dictionary and supporting documentation. Comment lines and white space (blanks and newlines) may appear anywhere outside the binary sections. In an imgCIF data set, the descriptive tags and values may be presented in any convenient order, e.g. the data could come first and the parameters necessary to interpret the data could come later. This order-independent convention holds for an imgCIF file, but for a CBF all the tags and values describing binary data (i.e. all the tags other than those in the ARRAY_DATA category) should be presented before the binary data, in the form of a header. This does not mean that there cannot be more useful information after the binary data. There could be another full header and more blocks of binary data. In the interest of efficiency in processing a CBF, the parameters that relate to a particular block of binary data must appear earlier in the CBF than the block itself. 2 The data_ token The header begins at the line marked with . is the CIF token for identifying a data block. The name of the data block, image_1, follows immediately without any intervening white space. The name of the data block is arbitrary. Within a data block any given tag may be presented only once, either directly with a value following immediately, or as one of the column headings for the rows of a table. To reuse the same tag one must start a new data block.

38

2.3. SPECIFICATION OF THE CRYSTALLOGRAPHIC BINARY FILE (CBF/imgCIF) Example 2.3.2.3. Organization of image data in a CBF/imgCIF.

Example 2.3.2.4. Representation of the binary data.

# Define image storage mechanism loop_ _array_structure.id _array_structure.encoding_type _array_structure.compression_type _array_structure.byte_order image_1 "unsigned 16-bit integer" little_endian

loop_ _array_data.array_id _array_data.binary_id _array_data.data image_1 1 ; --CIF-BINARY-FORMAT-SECTION-Content-Type: application/octet-stream; conversions="x-CBF_PACKED" Content-Transfer-Encoding: BINARY X-Binary-Size: 374578 X-Binary-ID: 1 X-Binary-Element-Type: "unsigned 16-bit integer" Content-MD5: jGmkxkrpnizOetd9T/Np4NufAmA==

none

loop_ _array_intensities.array_id _array_intensities.binary_id _array_intensities.linearity _array_intensities.undefined_value _array_intensities.overload image_1 1 linear 0 65535

START_OF_BIN ************9*************** ... [This is where the raw binary data would be – we can’t print them here] --CIF-BINARY-FORMAT-SECTION---;

loop_ _array_structure_list.array_id _array_structure_list.index _array_structure_list.dimension _array_structure_list.precedence _array_structure_list.direction image_1 1 768 1 increasing image_1 2 512 2 decreasing

is described in Section 2.3.3.3. The ‘X-Binary-Element-Type’ header specifies the type of binary data in the octets, using the same descriptive phrases as in _array_structure.encoding_type (the default value is ‘unsigned 32-bit integer’). The other MIME headers in the example provide an identifier and a content checksum. The MIME header items are followed by an empty line and then by a special sequence (marked here as START_OF_BIN), consisting of the single characters Ctrl-L, Ctrl-Z, Ctrl-D and a single binary flag character of hexadecimal value D5 (213 decimal). The binary data follow immediately after this flag character. The reasons for choosing this sequence are discussed in Section 2.3.3.3. After the last octet (i.e. byte) of the binary data, there is a special trailer \r\n--CIF-BINARY-FORMAT-SECTION----\r\n;. This repeats the initial boundary marker with an extra -- at the end (a MIME convention for the last boundary marker), followed by a closing semicolon quote for a text section. This is essential in an imgCIF, and we include it in a CBF for consistency.

loop_ _array_element_size.array_id _array_element_size.index _array_element_size.size image_1 1 100.5e-6 image_1 2 99.5e-6

We have given the abstract ordering of the data. The physical view of data is described in detail in the CBF/imgCIF dictionary (Chapters 3.7 and 4.6). In general, the physical sense of the image is from the sample to the detector. The storage of the binary data is now fully defined. Further data items could be defined, but we are ready to present the image data (Example 2.3.2.4). This is done with the ARRAY_DATA category. The actual binary data will come just a little further down, as the essential part of the value of _array_data.data, which begins as semicolon-quoted text. The line immediately after the line with the semicolon is a MIME boundary marker. As with all MIME boundary markers, it begins with ‘--’ (two hyphens). The next few lines are MIME headers, describing some useful information we will need in order to process the binary section. MIME headers can appear in different orders and can be very confusing (look at the raw contents of an e-mail message with attachments), but there are only a few headers that have to be understood to process a CBF. The ‘Content-Type’ header serves to describe the nature of the following data sufficiently to allow an association with an appropriate agent or mechanism for presenting the data to a user. It may be any of the discrete types permitted in RFC 2045 (Freed & Borenstein, 1996b), but, unless the binary data conform to an existing standard format (e.g. TIFF or JPEG), the description ‘application/octet-stream’ is recommended. If the octet stream has been compressed, the compression should be specified by the parameter conversions="x-CBF_PACKED" or by specifying one of the other compression types allowed as described in Chapter 5.6. The ‘Content-Transfer-Encoding’ header describes any encoding scheme applied to the data, most commonly to transform it to an ASCII-only representation. For a CBF the value should be ‘BINARY’. We consider the other values used for imgCIF below. The ‘X-Binary-Size’ header specifies the size of the binary data in octets. Calculation of the size where compression is used

2.3.3. Overview of the format This section describes the major elements of the CBF format. (1) CBF is a binary file, containing self-describing array data, e.g. one or more images, and auxiliary data, e.g. describing the experiment. (2) Apart from the handling of line terminators, the way binary data are presented and more liberal rules for ordering information, an ASCII imgCIF file is the same as a CBF binary file. (3) A CBF consists of pseudo-ASCII text header sections consisting of ‘lines’ of no more than 80 ASCII characters separated by ‘line separators’, which are the pair of ASCII characters carriage return (\r, ASCII 13) and line feed (\n, ASCII 10), followed by zero, one or more binary sections presented as ‘binary strings’. The file returns to the pseudo-ASCII format after each string, allowing additional binary strings to appear later in the file after additional headers. (4) An imgCIF consists of ASCII lines of no more than 80 characters using the the normal line-termination conventions of the current system (e.g. \n in UNIX) with MIME-encoded binary strings at any appropriate point in the file. (For both CBF and imgCIF, the limitation of 80 characters per line will be increased to 2048 as CIF 1.1 is adopted.)

39

2. CONCEPTS AND SPECIFICATIONS (5) The very start of the file has an identification item (magic number). This item also describes the CBF version or level (see Section 2.3.3.1 below). (6) In the CBF binary format, the descriptive tags and values may be viewed as a ‘header’ section for the binary data section. The start of a header section is delimited by the usual CIF data_ token (see Section 2.3.3.2 below). (7) The header information must contain sufficient data names to fully describe the binary data sections. (8) The binary data are presented as ‘binary strings’, which in CBF are specially formatted binary data within a semicolondelimited text string. In imgCIF files, the binary data are MIMEencoded within true ASCII text fields. (9) White space may be used within the pseudo-ASCII sections prior to the ‘start of binary section’ identifier to align the start-binary data sections to word or block boundaries. Similar use may be made of unused bytes within binary sections. However, no blank lines should be introduced among the MIME headers, since that would terminate processing of those headers and start the scan for binary data. In general, no guarantee is made of block or word alignment in a CBF of unknown origin. (10) The end of the file need not be explicitly indicated, but including a comment of the form ###_END_OF_CBF (including the carriage return, line feed pair) can help in debugging. (11) For the most efficient processing of a CBF file, all binary data described in a single data block should appear as the last information in that data block. However, since binary strings can be parsed anywhere within the context of a CBF or imgCIF file, it is recommended that processing software for CBF accept such strings in any order, and it is mandatory that processing software for imgCIF accept such strings in any order. The binary identifier values used within a given data block section, and hence the binary data, must be unique for any given *.array_id, and it would be best to make them globally unique. However, a different data block may reuse binary identifier values. (This allows concatenation of files without renumbering the binary identifiers, and provides a certain level of localization of data within the file to avoid programs having to search potentially huge files for missing binary sections.) (12) The recommended file extension for a CBF is ‘cbf’. This allows users to recognize file types easily and gives programs a chance to ‘know’ the file type without having to prompt the user. However, a program should check for at least the file identifier to ensure that the file type is indeed CBF. (13) The recommended file extensions for imgCIF are ‘icf’ or ‘cif’. (14) CBF format files are binary files, so when ftp is used to transfer files between different computer systems ‘binary’ or ‘image’ mode transfer should be selected. (15) imgCIF files are ASCII files, so when ftp is used to transfer files between different computer systems ‘ascii’ transfer should be selected.

three hashes, and all others must immediately follow a ‘line separator’.) No white space may precede the first hash sign. Following the file identifier is the version number of the file; e.g. the full line might appear as ###CBF: VERSION 1.0. The version number must be separated from the file identifier characters by white space, e.g. a blank (ASCII 32). The version number is defined as a major version number and minor version number separated by a decimal point. A change in the major version may mean that a program for the previous version cannot input the new version as some major change has occurred to CBF. A change in the minor version may also mean incompatibility if the CBF has been written using some new feature. For example, a new form of linearity scaling may be specified; this would be considered a minor version change. A file with the new feature would not be readable by a program supporting only an older version of the format.

2.3.3.2. Details of the header section The start of a header section is delimited by the usual CIF data_ token. Optionally, a header identifier, ###_START_OF_HEADER, may be used before the data_ token, followed by the carriage return, line feed pair, as an aid in debugging, but it is not required. (Naturally, another carriage return, line feed pair should immediately precede this and all other CBF identifiers, with the exception of the CBF file identifier at the very start of the file.) A header section, including the identification items which delimit it, uses only ASCII characters and is divided into ‘lines’. The ‘line separator’ symbols \r\n (carriage return, line feed) are the same regardless of the operating system on which the file is written. This is an important difference from CIF, but must be so, as the file contains binary data and cannot be translated from one operating system to another, which is the case for ASCII text files. While a properly functioning CBF API should write the full \r\n line separator, it should recognize any of three sequences \r, \n, \r\n as valid line separators so that hand-edited headers will not be rejected. The header section in a CBF obeys all CIF rules (Chapter 2.2) with the exception of the line separators, i.e.: (i) ‘Lines’ are a maximum of 80 characters long. (This will be increased to 2048 in line with the CIF version 1.1 specification.) (ii) All data names (tags) start with an underscore character ‘_’. (iii) The hash symbol ‘#’ (outside a quoted character string) means that all text up to the line separator is a comment. (iv) White space outside character strings is not significant. (v) Data names are case-insensitive. (vi) The data item follows the data-name separator and may be of one of two types: character string (char) or number (numb). The type is specified for each data name. (vii) Character strings may be delimited with single or double quotes, or blocks of text may be delimited by semicolons occurring as the first character on a line. (viii) The loop_ mechanism allows a data name to have multiple values. Immediately following the loop_, one or more data names are listed without their values, as column headings. Then one or more rows of values are given. (ix) Any CIF data name may occur within the header section. The tokens data_ and loop_ have special meaning in CIF and should not be used except in their indicated places. The tokens save_, stop_ and global_ also have special meaning in CIF’s parent language, STAR, and should also not be used. A single header section may contain one or more data_ blocks.

2.3.3.1. Details of the magic number The magic number identifier is ###CBF: VERSION. This must always be present so that a program can easily identify a CBF by simply inputting the first 15 characters. [The space is a blank (ASCII 32) and not a tab. All identifier characters must be upper case.] The first hash means that this line within the CIF is a comment line. Three hashes mean that this is a line describing the binary file layout for CBF. (All CBF internal identifiers start with

40

2.3. SPECIFICATION OF THE CRYSTALLOGRAPHIC BINARY FILE (CBF/imgCIF) In general, if the value given for ‘Content-Transfer-Encoding’ is one of the real encodings BASE64, QUOTED-PRINTABLE, X-BASE8, X-BASE10 or X-BASE16, this file is an imgCIF. More details on the imgCIF encodings are given in Section 2.3.5 below. For either a CBF or an imgCIF, the optional ‘Content-MD5’ header provides a sophisticated check [the ‘RSA Data Security, Inc. MD5 Message-Digest Algorithm’ (Rivest, 1992)] on the integrity of the binary data. In a CBF, the raw binary data begin after an empty line terminating the MIME headers and after the START_OF_BIN identifier. START_OF_BIN contains bytes to separate the ASCII lines from the binary data, bytes to try to stop the listing of the header, bytes which define the binary identifier, which should match the *.binary_id defined in the header, and bytes which define the length of the binary section.

2.3.3.3. Details of binary sections All binary sections are contained within semicolon-delimited text fields, which may be thought of as ‘binary strings’. Each such binary string must be given as a value of an appropriate CIF tag (usually _array_data.data), either directly or within a loop. Before getting to the actual binary data, there are some preliminaries to allow a smooth transition from the conventions of CIF to those of raw streams of ‘octets’ (8-bit bytes). Within the binary-data text string, the conventions developed for transmitting e-mail messages including binary attachments are followed. There is secondary ASCII header information, formatted as MIME headers [see RFCs 2045–2049 by Freed & Borenstein (1996a,b,c), Freed et al. (1996) and Moore (1996)]. The boundary marker for the beginning of all this is the special string --CIF-BINARY-FORMAT-SECTION-- at the beginning of a line. The initial ‘--’ says that this is a MIME boundary. We cannot put ‘###’ in front of it and conform to MIME conventions. Immediately after the boundary marker are MIME headers, describing some useful information we will need to process the binary section. Only a few headers with a narrow range of values have to be understood to process a CBF (as opposed to an imgCIF, for which the headers can be more varied). The ‘Content-Type’ header may be any of the discrete types permitted in RFC 2045 (Freed & Borenstein, 1996b); ‘application/octet-stream’ is recommended. If an octet stream was compressed, the compression should be specified by the parameter conversions="x-CBF_PACKED" or by the parameter conversions="x-CBF_CANONICAL". Other compression schemes may be added at a future date. Details of the compression schemes are given in Chapter 5.6. The ‘Content-Transfer-Encoding’ header should be BINARY for a CBF. For an ASCII imgCIF file, the ‘Content-Transfer-Encoding’ header may be BASE64, QUOTED-PRINTABLE, X-BASE8, X-BASE10 or X-BASE16. The BASE64 encoding is a reasonably efficient encoding of octets (approximately eight bits for every six bits of data) and is recommended for imgCIF files. The octal, decimal and hexadecimal transfer encodings are for convenience in debugging and are not recommended for archiving and data interchange. The ‘X-Binary-Size’ header specifies the size of the binary data in octets. If compression was used, this size is the size after compression, including any book-keeping fields. In general, no portion of the binary header is included in the calculation of the size. The ‘X-Binary-Element-Type’ header specifies the type of binary data in the octets, using the same descriptive phrases as in _array_structure.encoding_type. The full list of valid types is:

Octet

Hexadecimal

Decimal

Purpose

1 2 3 4 5...5 + n − 1

0C 1A 04 D5

12 (Ctrl-L) 26 (Ctrl-Z) 04 (Ctrl-D) 213

End the current page Stop listings in MS-DOS Stop listings in Unix Binary section begins Binary data (n octets)

Only bytes 5 . . . 5 + n − 1 are encoded for an imgCIF file using the indicated Content-Transfer-Encoding. The binary characters serve specific purposes: (i) The Ctrl-L will terminate the current page in listings in most operating systems. (ii) The Ctrl-Z will stop the listing of the file in MS-DOS-type operating systems. (iii) The Ctrl-D will stop the listing of the file in Unix-type operating systems. (iv) The unsigned byte value 213 (decimal) is binary 11010101 (octal 325, hexadecimal D5). This has the eighth bit set, so it can be used to detect the error of seven-bit rather than eight-bit transmission. It is also asymmetric, providing one last check for reversal of bit order within bytes. The carriage return, line feed pair before the START_OF_BIN and other lines can also be used to check that the file has not been corrupted (e.g. by being sent by ftp in ASCII mode). The ‘line separator’ immediately precedes the ‘start of binary identifier’, but blank spaces may be added prior to the preceding ‘line separator’ if desired (e.g. to force word or block alignment). The binary data do not have to completely fill the bytes defined by the byte-length value, but clearly cannot be greater than this value (except when the value zero has been stored, which means that the size is unknown, and no other headers follow). The values of any unused bytes are undefined. The end of binary section identifier is placed exactly at the byte following the full binary section as defined by the length value. The end of binary section identifier consists of the carriage return/line feed pair followed by

unsigned 8-bit integer signed 8-bit integer unsigned 16-bit integer signed 16-bit integer unsigned 32-bit integer signed 32-bit integer signed 32-bit real IEEE signed 64-bit real IEEE signed 32-bit complex IEEE

--CIF-BINARY-FORMAT-SECTION---;

The default value is unsigned 32-bit integer. See IEEE (1985) for details on the IEEE formats. The MIME header items are followed by an empty line and then by a special sequence marked here as START_OF_BIN, consisting of Ctrl-L, Ctrl-Z, Ctrl-D and a single binary flag character of hexadecimal value D5 (213 decimal). The binary data follow immediately after this flag character. The control (Ctrl-. . .) characters are inserted as a convenience to inhibit accidental listing or printing of large amounts of binary data on character-oriented output devices such as terminals or printers.

with each of these lines followed by the carriage return/line feed pair. This brings us back into a normal CIF environment. The first ‘line separator’ separates the binary data from the pseudo-ASCII line. The end-of-binary-section identifier is in a sense redundant, since the binary-data-length value tells a program how many bytes to jump over to the end of the binary data. However, this redundancy has been deliberately added for error checking, and for possible file recovery in the case of a corrupted file. This identifier must be present at the end of every block of binary data.

41

2. CONCEPTS AND SPECIFICATIONS Example 2.3.4.1. A multiple-block CBF with several images.

Example 2.3.4.1. (cont.)

###CBF: VERSION 1.0 # CBF file written by cbflib v0.6

# # # #

# A comment cannot appear before the file identifier, # but can appear anywhere else, except within the # binary sections.

Following the "end of binary" identifier the file is pseudo-ASCII again, so comments are valid up to the next "start of binary" identifier. Note that we have increased the binary ID by one.

image_1 2 ; --CIF-BINARY-FORMAT-SECTION-Content-Type: application/octet-stream; conversions="x-CBF_PACKED" Content-Transfer-Encoding: BINARY X-Binary-Size: 3745758 X-Binary-ID: 2 X-Binary-Element-Type: "signed 32-bit integer" Content-MD5: xR5kxiOetd9T/Nr5vMfAmA==

# Here the first data block starts data_xxx ### ... various CIF tags and values here ### but none that define array data items # The "data_" identifier finishes the first data # block and starts the second data_yyy

START_OF_BIN ˆPPˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ ... [This is where the raw binary data would be – we can’t print them here]

### ... various CIF tags and values here including ### ones that define array data items

--CIF-BINARY-FORMAT-SECTION---;

loop_ _array_data.array_id _array_data.binary_id _array_data.data

# Third binary section; note that we have a new # array ID.

image_1 1 ; --CIF-BINARY-FORMAT-SECTION-Content-Type: application/octet-stream; conversions="x-CBF_PACKED" Content-Transfer-Encoding: BINARY X-Binary-Size: 3745758 X-Binary-ID: 1 X-Binary-Element-Type: "signed 32-bit integer" Content-MD5: 1zsJjWPfol2GYl2V+QSXrw==

image_2 3 ; --CIF-BINARY-FORMAT-SECTION-Content-Type: application/octet-stream; conversions="x-CBF_PACKED" Content-Transfer-Encoding: BINARY X-Binary-ID: 3 Content-MD5: yS5kxiOetd9T/NrqTLfAmA== START_OF_BIN *************’9*****‘********* ... [This is where the raw binary data would be – we can’t print them here]

START_OF_BIN ˆPPˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ˆ@ ... [This is where the raw binary data would be – we can’t print them here]

--CIF-BINARY-FORMAT-SECTION---;

--CIF-BINARY-FORMAT-SECTION---;

# Second header section data_zzz ### ... various CIF tags and values here ### including ones that define array ### data items

2.3.4. A complex example In Example 2.3.4.1 only the ARRAY_DATA category CIF identifiers are shown. The other CIF data items are not shown. This shows a possible structuring of a more complicated example. There are two header sections; the first contains two data blocks and defines three binary sections. CIF comment lines, starting with a hash mark ‘#’, are used to clarify the structure.

# Since we only have one block left, we won’t # use a loop _array_data.array_id _array_data.binary_id _array_data.data

image 1

# Note that we can put a comment here ; --CIF-BINARY-FORMAT-SECTION-Content-Type: application/octet-stream; conversions="x-CBF_PACKED" Content-Transfer-Encoding: BINARY X-Binary-ID: 1 Content-MD5: fooxiOetd9T/serNufAmA==

2.3.5. imgCIF encodings For an imgCIF, there are several alternative encodings for binary image data as ASCII text. Each binary image may use a different encoding in the same imgCIF data set or even in the same data block. The choice of encoding is specified in the ‘ContentTransfer-Encoding’ MIME header. If the Transfer Encoding is X-BASE8, X-BASE10 or X-BASE16, the data are presented as octal, decimal or hexadecimal data, respectively, organized into lines or words. Each word is created by composing octets of data in fixed groups of 2, 3, 4, 6 or 8 octets, either in the order . . .4321 (‘big-endian’) or 1234. . . (‘little-endian’). If there are fewer than the specified number of octets to fill the last word, then the missing octets are presented as ‘==’ for each missing octet. Exactly two equal signs are used for each missing octet even for octal and decimal encoding. The format of lines is

START_OF_BIN *************’9*****‘********* ... [This is where the raw binary data would be – we can’t print them here] --CIF-BINARY-FORMAT-SECTION---; ###_END_OF_CBF

‘1234. . .’ octet orderings, respectively. The ‘==’ padding for the last word should be on the appropriate side to correspond to the missing octets, e.g.

rnd xxxxxx xxxxxx xxxxxx

where r is H, O or D for hexadecimal, octal or decimal, n is the number of octets per word, and d is ‘’ for the ‘. . .4321’ and

H4< FFFFFFFF FFFFFFFF 07FFFFFF ====0000

42

2.3. SPECIFICATION OF THE CRYSTALLOGRAPHIC BINARY FILE (CBF/imgCIF) These earlier versions of the specification also included three eight-byte words of information in binary that replicated information now available in the MIME header:

or H3> FF0700 00====

For these hexadecimal, octal and decimal formats only, comments beginning with ‘#’ are permitted to improve readability. BASE64 encoding follows MIME conventions. Octets are in groups of three, c1, c2, c3. The resulting 24 bits are reorganized in the following way (where we use the C operators ≫, ≪, & and | to denote, respectively, a right shift, a left shift, bit-wise intersection and bit-wise union). Four six-bit quantities are specified, starting with the high-order six bits (c1 ≫ 2) of the first octet, then the low-order two bits of the first octet followed by the highorder four bits of the second octet ((c1 & 3) ≪ 4 | (c2 ≫ 4)), then the bottom four bits of the second octet followed by the high-order two bits of the last octet ((c2 & 15) ≪ 2 | (c3 ≫ 6)), then the bottom six bits of the last octet (c3 & 63). Each of these four quantities is translated into an ASCII character using the mapping

5. . .12 13. . .20 21. . .28

Binary section identifier (see _array_data.binary_id), 64-bit, little-endian The size (n) of the binary section in octets (i.e. the offset from octet 29 to the first byte following the data) Compression type: CBF_NONE 0x0040 (64) CBF_CANONICAL 0x0050 (80) CBF_PACKED 0x0060 (96) ···

The three eight-byte words were followed by binary data. These words are not included when a MIME header is provided. We are grateful to Frances C. Bernstein, Sydney R. Hall, Brian McMahon and Nick Spadaccini for their helpful comments and suggestions.

1 2 3 4 01234567890123456789012345678901234567890123456789 | | | | | | | ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwx 5 6 01234567890123 | | | | | | | yz0123456789+/

References Freed, N. & Borenstein, N. (1996a). Multipurpose Internet Mail Extensions (MIME) part five: Conformance criteria and examples. RFC 2049. Network Working Group. http://www.ietf.org/rfc/rfc2049.txt. Freed, N. & Borenstein, N. (1996b). Multipurpose Internet Mail Extensions (MIME) part one: Format of Internet message bodies. RFC 2045. Network Working Group. http://www.ietf.org/rfc/rfc2045.txt. Freed, N. & Borenstein, N. (1996c). Multipurpose Internet Mail Extensions (MIME) part two: Media types. RFC 2046. Network Working Group. http://www.ietf.org/rfc/rfc2046.txt. Freed, N., Klensin, J. & Postel, J. (1996). Multipurpose Internet Mail Extensions (MIME) part four: Registration procedures. RFC 2048. Network Working Group. http://www.ietf.org/rfc/rfc2048.txt. Hall, S. R., Allen, F. H. & Brown, I. D. (1991). The Crystallographic Information File (CIF): a new standard archive file for crystallography. Acta Cryst. A47, 655–685. IEEE (1985). IEEE standard for binary floating-point arithmetic. ANSI/IEEE Std 754–1985. New York: The Institute of Electrical and Electronics Engineers, Inc. Moore, K. (1996). MIME (Multipurpose Internet Mail Extensions) part three: Message header extensions for non-ASCII text. RFC 2047. Network Working Group. http://www.ietf.org/rfc/rfc2047.txt. Rivest, R. (1992). The MD5 message-digest algorithm. RFC 1321. Network Working Group. http://www.ietf.org/rfc/rfc1321.txt.

Short groups of octets are padded on the right with one ‘=’ if c3 is missing, and with ‘==’ if both c2 and c3 are missing. QUOTED-PRINTABLE encoding also follows MIME conventions, copying octets without translation if their ASCII values are 32. . .38, 42, 48. . .57, 59. . .60, 62, 64. . .126 and the octet is not a ‘;’ in column 1. All other characters are translated to ‘=nn’, where nn is the hexadecimal encoding of the octet. All lines are ‘wrapped’ with a terminating ‘=’ (i.e. the MIME conventions for an implicit line terminator are never used).

Appendix 2.3.1 Deprecated CBF conventions There was an earlier, now deprecated, CBF format in which the compression type was given as eight bytes of binary header. In this case, the eight bytes used for the compression type are subtracted from the size, so that the same size will be reported if the compression type is supplied in the MIME header. Use of the MIME header is the recommended way to supply the compression type.

43

references

International Tables for Crystallography (2006). Vol. G, Chapter 2.4, pp. 44–52.

2.4. Specification of the Molecular Information File (MIF) B Y F. H. A LLEN , J. M. BARNARD , A. P. F. C OOK

S. R. H ALL

variation in methods used to represent 2D structures, stereochemical descriptors and certain 3D structural attributes. These are computational ‘bottlenecks’ that detract from an effective use of the large financial and intellectual investment in proprietary software and database systems. They have also contributed to the major need for in-house format conversion software, which must be continually upgraded and maintained to accommodate developmental changes within imported systems. The need for a universal interchange format for chemical information became apparent in the late 1980s, at almost exactly the same time as crystallographers recognized a similar need. Data standards in structural chemistry involve many international organizations and individuals, and consequently a number of proposals for exchanging data were initially put forward. From these the Standard Molecular Data (SMD) format (Bebak et al., 1989; Barnard, 1990) emerged as the leading contender. In the early 1990s, discussions between the SMD and CIF developers led to a re-expression of the SMD data items within the Self-defining Text Archive and Retrieval (STAR) File syntax (Hall, 1991; Hall & Spadaccini, 1994). This chapter describes the initial core data definitions of a universal exchange format for chemistry, the Molecular Information File (MIF: Allen et al., 1995), that arose from this coalescence of concepts and ideas. MIF is a complementary approach to CIF (Hall et al., 1991). Because SMD was fundamental to the development of MIF, we begin this chapter with a brief history of this project.

2.4.1. Introduction This volume is primarily concerned with methods for the exchange of crystallographic information, such as experimental conditions and measurements, computational procedures and results, and the geometrical description of three-dimensional (3D) chemical structures. Such information, now available for some 400 000 compounds (Allen & Glusker, 2002) is, of course, vitally important in chemistry and in many other branches of science. However, it must be appreciated that two-dimensional (2D) chemical structural diagrams are available for over eight million compounds, and are fundamental components of the language of chemistry at all levels. Two-dimensional graphical representations indicate atomic connectivities, formal bond types and residual atomic charges, and provide the universal formalism through which chemists communicate with each other on a daily basis and document their results. In common with scientists in other disciplines, chemists were early users of computer technology. They solved their information needs through the creation of major databases of chemical compounds and the development of methods for searching these databases for complete structures or substructural fragments. From these data, software can compute the 3D structures and properties of molecules, and the resulting molecular images can be displayed and manipulated. Consequently, 2D chemical diagrams are at the heart of many computerized documentation systems, and are the basis for computational chemistry applications that form part of the routine armoury of the modern chemist. Computationally, the 2D diagram is treated as a mathematical graph (Harary, 1972). The nodes of the graph represent atoms and the edges of the graph represent bonds. Each of these primary components can have additional attributes: element type, valency, charge etc. in the case of the atomic nodes, and bond type, cyclicity indicators etc. in the case of the bonded edges. Within this formalism, 2D display coordinates and 3D crystallographic or computed coordinates are additional atom attributes, while interatomic distances can further qualify the bonded edges. Using the concepts of graph theory, it is then possible to write algorithms for the analysis of chemical graphs, e.g. for the detection of chemical rings and ring systems (e.g. Wippke & Dyott, 1975), for the analysis of functional groups and their relationships etc. Most importantly, procedures have also been developed for the matching of complete chemical graphs (full graph isomorphism), and for the location of chemical substructures within a complete chemical graph (sub-graph isomorphism) (e.g. Feldmann et al., 1977). In this way it is possible to achieve graphical substructure searches of very large collections of 2D chemical diagrams. As with early crystallographic data exchange, structural chemistry applications use their own specialized formats for input, manipulation and output. The ready exchange of chemical data is often inhibited by specific data formats and by the enormous

2.4.2. Historical background The Standard Molecular Data (SMD) format was initially developed by a group of European pharmaceutical companies in the mid-1980s. Draft documents were made available from 1987 and the specification was published (Bebak et al., 1989). A meeting in Frankfurt in 1988 established a series of technical working groups under the auspices of the Chemical Structure Association (CSA) to examine the format specifications in detail and to make recommendations for any revision. As a result, a draft form of a revised format, described as SMD Version 5.0, was published in February 1990 (Barnard, 1990). A document describing the core format, i.e. those data items regarded as essential in any exchange file, was prepared by one of us (JMB) for consideration by Subcommittee E49.51 of the American Society for Testing and Materials (ASTM). In December 1993, the ASTM subcommittee E49.51 approved a standard specification for the content (i.e. recommended data items) of computerized chemical structural files (ASTM, 1994), although the subcommittee did not publish any proposals for a format specification. Later, the Chemical Abstracts Service (CAS) circulated a draft proposal for a connection-table-based exchange format for chemical substances and queries. It used some ideas that are similar to the 1990 SMD proposal and is expressed within the framework of the Abstract Syntax Notation 1 (ISO, 2002a,b). MDL Information Systems Inc. has also published a description of their proprietary formats (Dalby et al., 1992) and a number of other software systems now provide interfaces to these formats.

Affiliations: F RANK H. A LLEN, Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge, CB2 1EZ, England; J OHN M. B ARNARD and A NTHONY P. F. C OOK, BCI Ltd, 46 Uppergate Road, Stannington, Sheffield S6 6BX, England; S YDNEY R. H ALL , School of Biomedical and Chemical Sciences, University of Western Australia, Crawley, Perth, WA 6009, Australia.

Copyright  2006 International Union of Crystallography

AND

44

2.4. SPECIFICATION OF THE MOLECULAR INFORMATION FILE (MIF) Table 2.4.4.1. Brief overview of the MIF syntax

During this period, the IUCr Working Party on Crystallographic Information had commissioned one of us (SRH) to coordinate the development of a universal file to replace the existing fixed-format Standard Crystallographic File Structure (SCFS: Brown, 1988). As documented in Chapter 1.1, the CIF approach was adopted as the international standard in 1990 and published by Hall et al. (1991). Although the small-molecule CIF is able to store a representation of 2D chemical topology, its data definitions do not meet all the needs of the chemical community. In 1991, the IUCr became interested in further extending CIF into the chemical arena and discussions took place between representatives of the CIF project and of the SMD Technical Working Group. These meetings decided that an integration of the SMD format and the STAR syntax was desirable because it provided a number of advantages over the existing SMD specifications (Barnard & Cook, 1992). In particular, SMD/STAR provides for a clearer separation of the data structure and the data content roles, together with more flexible data extensibility in future versions. In addition, automated data validation of STAR/SMD files is possible using electronic data dictionaries. In a wider context, there were obvious opportunities for integrating with other applications of the STAR File.

A text string is a string of characters bounded by white space, single or double quotes, or semicolons in column 1. A data name is a text string bounded by white space starting with an underline. A data value is a text string not starting with underline, preceded by an identifying data name. A list is a sequence of data names, preceded by ‘loop_’ and followed by a list of data values. A save frame is a collection of data within a data block, preceded by ‘save_framecode’ and closed with ‘save_’. A data block is a collection of data, preceded by ‘data_blockcode’. A global block is a collection of data, preceded by global_, that is common to all subsequent data blocks. A file may contain any number of data blocks or global blocks. A data name must be unique within a data block.

The MIF syntax, unlike that of a CIF, places no restrictions on line lengths or nested loop levels. For a detailed understanding of the differences between a MIF and a CIF, the reader should compare this chapter with Chapter 2.2 or refer to the published details of the STAR syntax (Hall & Spadaccini, 1994), the specification of the CIF core data items (Hall et al., 1991) and the Dictionary Definition Language (Hall & Cook, 1995) used to define data items in the electronic version of a STAR dictionary. CIF data, described by over a thousand items in the current dictionaries (see Part 4), encompass the fields of crystallographic structure and diffraction techniques, and these data items could readily be incorporated into a MIF. It should be noted, however, that currently the reverse is not possible because the current CIF syntax does not support nested loops or save frames.

2.4.3. MIF objectives Molecular information embraces a broad spectrum of data related to chemical and molecular structure. It includes both individual and linked data items, inter alia spectroscopic measurements, thermochemical data, electrochemical properties, crystal structure information and so on. These items represent the data descriptors of molecular chemistry and it is intended that all of these will eventually be accommodated in the MIF approach. However, the initial MIF implementation (Allen et al., 1995), summarized in this chapter, treated only the important core information: the data items needed to specify the connectivity and stereochemistry of molecules and their 2D and 3D spatial representations. The MIF data items needed for more extensive applications must, in the future, involve the collaborative efforts of informatics and database experts from chemical industry and academia. A dictionary of the initial MIF core data items described in this paper is given in Chapter 4.8. This is the abbreviated text version of the definition attributes contained in the electronic dictionary file. The core MIF data items provide descriptors for representing the 2D connectivity of a molecule or substructure, the conventions for relative or absolute stereochemical relationships, and the coordinates and conventions used for the generation of 2D and 3D graphical depictions. These data items apply to complete molecules, or to substructures with incomplete or variable attributes. As a consequence they are well suited for query definitions in substructure search systems, a feature that will be discussed later in this chapter.

2.4.4.1. Data identification The fundamental principle that underpins MIF is exactly as for CIF: every data item is represented by a unique data tag followed by its associated data value. These combinations are referred to as tag–value pairs or tuples. Data names must start with an underscore (i.e. underline) character and data values may be any type of string, ranging from a single character to many lines of text. Here are some simple examples of MIF data items: _atom_mass_number _atom_type _display_colour

79 Se blue_medium

The complete list of MIF core data items is given in Chapter 4.8. 2.4.4.2. Looped lists Repetitive data are stored in a MIF as lists of values, as they are in a CIF. Each list is prefaced by a loop_ statement and a sequence of data names that identify the data values that follow in ‘packets’ of equal length. The values in each packet match the order and number of the data names. Any number of packets may appear in a looped list. Atom and bond properties are typical of the information to appear in a looped list. The atoms and bonds of thiabutyrolactone in MIF format are shown in Fig. 2.4.4.1. The description of each data item in this example is given in Chapter 4.8, although the meanings are clear from the self-descriptive data names. The number of data values in each list is an exact multiple of the number of data names at the start of each loop structure. Looped lists are terminated by the next list or by any other data name, data block or end of file. Comments may be included in a MIF and are preceded by a # character, as illustrated in Fig. 2.4.4.1. Hierarchical data may require the use of nested loop structures (see the _display_* loop in Fig. 2.4.4.2). Note that the packet for _display_id of 7 has two sets of _display_conn_ values giving

2.4.4. MIF concepts and syntax The syntax of the Molecular Information File is based on that of the STAR File (Hall, 1991; Hall & Spadaccini, 1994). A MIF is an ASCII text file that can be read or amended using a standard text editor, and that can be processed computationally without conversion to another format. The organization and expression of MIF data is summarized in Table 2.4.4.1. Each file consists of a series of data blocks and each block consists of a series of individual data items. There may be any number of items within a block and any number of blocks within a file. A data block represents a logical grouping of data items and, in most MIF applications, a data block will usually specify a complete chemical entity, i.e. a fully defined molecule or a query substructure.

45

2. CONCEPTS AND SPECIFICATIONS

# atom attributes list loop_ _atom_id _atom_type _atom_attach_h 1 C 0 4 C 2

2 5

S C

0 2

3 6

C O

2 0

# bond attributes list loop_ _bond_id_1 _bond_id_2 _bond_type_mif 1 2 S 4 5 S

2 5

3 1

S S

3 1

4 6

S D

data_bromocamphor loop_ _atom_id _atom_label _atom_type _atom_attach_h _atom_coord_x _atom_coord_y _atom_coord_z 1 C1 C 2 C2 C 3 C3 C 4 C4 C 5 C5 C 6 C6 C 7 C7 C 8 C8 C 9 C9 C 10 C10 C 11 O2 O 12 BR3 Br

Fig. 2.4.4.1. MIF coding of atom and bond properties for thiabutyrolactone.

connections to atom sites 1 and 4 (the other connections to site 7 appear in the next two packets). Data items that appear in looped lists are identified in the MIF dictionary (see Chapter 4.8) as having the attribute _list set to either ‘yes’ or ‘both’. Other relationships between looped data items are also specified in the dictionary.

loop_ _bond_id_1 _bond_id_2 _bond_type_mif 1 2 S 5 6 S 7 9 S 4 7 S

2.4.4.3. Save frames Save frames are employed in a MIF to encapsulate grouped data for efficient cross-referencing. If a set of data needs to appear repeatedly in a data application, it is efficient to place this data into an addressable save frame. Molecular fragments, such as amino-acid units, are a case in point. A save frame is bounded by the statement save_framecode and terminated by a save_ statement. It can be referenced within the parent data block using the value $framecode where the framecode matches the string in the save_framecode. Note that all data names must be unique within a save frame, but the same data names may appear in other save frames or in the parent data block. Save frames may not contain other save frames but save-frame references ($framecode) may appear in other save frames. Save frames can be used in a MIF for many purposes. A simple application, the storage of alternative 3D conformational representations describing cyclohexane, is illustrated in Fig. 2.4.4.3. Within the STAR syntax, save-frame references ($framecode) may occur before or after the save-frame definition within any data block. MIF preserves this basic STAR syntax. Save frames are particularly useful for defining commonly referenced structural templates and examples of this facility are discussed and illustrated (Figs. 2.4.7.1 and 2.4.7.2) in Section 2.4.7.

0 0 1 1 2 2 0 3 3 3 0 0

4.69027 3.61112 4.16317 5.39943 6.48959 6.57842 5.27609 5.93386 5.57194 6.85298 3.03484 5.41337

2 3 6 1 1 10

_display_define_scale _display_define_span_x _display_define_span_y

S S S

2.57756 2.44777 2.26258 1.87018 2.15784 3.69178 3.93542 1.67891 0.17837 2.00528 2.36201 2.68686

3 4 1 7 2 11

8 . . . . 9 . . . . 10 . . . . 11 text O blue 10 12 text Br yellow 10

A data block is a sequence of unique data items or save frames. It is opened with a data_blockcode statement and closed by another data-block statement or a global_ statement (see below). The blockcode string identifies the block within the file. Examples of data blocks are shown in Figs. 2.4.4.2, 2.4.4.3 and 2.4.6.1. Each data block in a file must have a unique blockcode.

4 5 7 8 3 12

S S S

50 500 500

loop_ _display_id _display_object _display_symbol _display_colour _display_size _display_coord_x _display_coord_y loop_ _display_conn_id _display_conn_symbol _display_conn_colour 1 . . . . 251 195 2 . . . . 334 244 3 . . . . 334 339 4 . . . . 251 387 5 . . . . 168 339 6 . . . . 168 244 7 . . . . 217 292

2.4.4.4. Data blocks

S S D

2.10705 3.16754 0.69475 -0.13167 0.79703 1.11990 1.94650 2.14924 2.14326 3.35651 0.29318 -1.87368

191 100 251 401 417

426 292 100 206 387

2 3 4 5 6 1 1 4 7 7 1 2 1

1b 1b 1b 1b 1b 1b 1b 1b 1b 1b 1b 2b 1b

black black black black black black black black black black black black black

stop_ stop_ stop_ stop_ stop_ stop_ stop_ stop_ stop_ stop_ stop_ stop_

Fig. 2.4.4.2. MIF coding of atom properties (including 3D coordinates), bond properties and display information for (+)-3-bromocamphor.

46

2.4. SPECIFICATION OF THE MOLECULAR INFORMATION FILE (MIF) enced in subsequent data blocks [this statement corrects an error in Hall & Spadaccini (1994)]. Examples of global data are shown in Figs. 2.4.7.1 and 2.4.7.2, in which a variety of frequently referenced structural units are encapsulated within save frames specified in global blocks.

data_cyclohexane _molecule_name_common

cyclohexane

loop_ _atom_id _atom_type _atom_attach_h 1 C 2 2 C 2 3 C 2 4 C 2 5 C 2 6 C 2 loop_ _bond_id_1 _bond_id_2 _bond_type_mif 1 2 S 2 3 S 3 4 S 4 5 S 5 6 S 6 1 S loop_ _reference_conformation $chair save_chair loop_ _atom_id _atom_coord_x _atom_coord_y _atom_coord_z

$boat

2.4.5. Atoms, bonds and molecular representations The MIF dictionary (see Chapter 4.8) contains definitions of the principal data items needed to specify molecular connectivity and spatial representations. These definitions are grouped according to purpose or, as referred to in the DDL dictionary language (Hall & Cook, 1995), by category. Categories are formally specified in the MIF dictionary using the data attribute _category but they may also be identified from the data-name construction ‘___’. Note that data items appearing in the same looped list must belong to the same category. The values of some data items are restricted, by definition in the MIF dictionary, to standard codes or states. For example, the item _bond_type_mif can only have values S, D, T or O as in its dictionary definitions: S: single (two-electron) bond; D: double (four-electron) bond; T: triple (six-electron) bond; O: other (e.g. coordination) bond. The MIF dictionary plays the important additional role of validating and standardizing data values. This is illustrated with the data item _display_colour, which identifies the colours of ‘atom’ and ‘bond’ graphical objects. The colour codes or states for this item are specified in its dictionary definitions as a set of permitted red/green/blue (RGB) ratios, and no other colours may be used in a MIF. This has the technical advantage of making colour states searchable for chemical applications. Fig. 2.4.4.2 shows MIF data for the molecule (+)-3-bromocamphor. The ‘atom’ list contains the items _atom_id, _atom_type and _atom_attach_h, which identify the chemical properties of the atoms, plus the items _atom_coord_x, *_y and *_z, which specify the 3D molecular structure in Cartesian coordinates [these are taken from diffraction results (Allen & Rogers, 1970)]. The item _atom_label is also used with any graphical depiction of the 3D model. The ‘bond’ loop in this example uses the simple _bond_type_mif conventions described above. The data names needed to depict stereochemistry are discussed with examples (Figs. 2.4.8.1, 2.4.8.2 and 2.4.8.3) in Section 2.4.8. The MIF approach to representing 2D chemical structure separates the specification of chemical atom and bond properties. This provides additional flexibility in the description of the graphical objects, such as atomic nodes and bonded connections. The MIF data required to generate a 2D chemical diagram are shown in Fig. 2.4.4.2. The diagram generated from this data will be in a display area of 500 × 500 coordinate units at a scale of 50 units per cm (the 2D chemical diagram shown in Fig. 2.4.4.2 is not to this scale). The default origin (the bottom left corner of the display area) can be specified with the item _display_define_origin. The data used to depict a 2D structure form a two-level loop with the ‘atomic’ graphical objects at level 1 and the ‘bond’ graphical objects at level 2. The item _display_object has the values ‘.’ (null or no object), ‘text’ (an element or number string) or ‘icon’. The size and colour of the atom site are specified with _display_size and _display_colour. The bonds connected to each atom site are specified as a sequence of _display_conn_id numbers (in loop level 2). These numbers must match one of the

$twisted_boat

1 2 3 4 5 6

1.579 0.159 0.263 0.756 0.507 -0.986 0.825 0.493 1.541 -0.549 -0.131 1.590 -1.377 0.222 0.347 -0.626 -0.158 -0.937

1 2 3 4 5 6

1.657 1.031 0.960 -0.568 -1.051 -0.499

1 2 3 4 5 6

0.933 0.922 0.971 1.186 0.220 -0.368 -0.119 0.161 1.796 -1.135 -0.581 0.911 -1.371 0.181 -0.397 -0.083 0.236 -1.238

save_ save_boat loop_ _atom_id _atom_coord_x _atom_coord_y _atom_coord_z

-0.426 0.356 0.133 -0.927 0.133 1.602 -0.040 1.558 -0.738 0.279 -0.028 -0.964

save_ save_twisted_boat loop_ _atom_id _atom_coord_x _atom_coord_y _atom_coord_z

save_ Fig. 2.4.4.3. Atom and bond properties for cyclohexane, together with 3D coordinate representations of three alternative conformations: chair, boat and twisted boat.

2.4.4.5. Global blocks A global block is similar to a data block except that it is opened with a global_ statement and contains data that are common or ‘default’ to all subsequent data blocks in a file. Global data items remain active until re-specified in a subsequent data block or global block. In some applications it may be efficient to place data that are common to all data blocks within a global block. In particular, save frames may be defined within global blocks and then refer-

47

2. CONCEPTS AND SPECIFICATIONS _display_id numbers at level 1. The connection object is specified with a _display_conn_symbol code, which must be a standard value in the dictionary definition, as is the colour of the icon if specified by _display_conn_colour.

2.4.6. Bonding conventions Chemical information systems use a variety of conventions to specify attributes such as aromaticity, bond-order alternation, tautomerism etc. These system-dependent conventions decide the values that are permitted for quantities such as bond order, electronic charge and hydrogen-atom count. Most systems also provide for redundancy between chemical attributes. For example, the valency, the number of connected non-hydrogen atoms, the number of terminal hydrogen atoms and the bond types associated with a given atom are clearly related. Operational systems make use of these relationships to perform internal checks and to provide flexibility in substructure search processes. The MIF data definitions provide for three bonding conventions. These are the data items _bond_type_mif, _bond_type_casreg3 and _bond_type_ccdc. The ‘mif’ convention defines only single, double, triple and other bonds, while the ‘casreg3’ convention (Mockus & Stobaugh, 1980) extends these to include aromaticity in terms of ‘ring alternating normalized bonds’ and tautomerism via a ‘tautomer normalized bond’. The ‘ccdc’ convention is that employed in the Cambridge Structural Database System (Allen et al., 1991; Allen, 2002) to categorize bond types encountered in both organic and metal-organic molecules. An important advantage of the MIF approach is that a molecule can be represented using all three bonding conventions within the same data block. An example of alternative bonding conventions encoded for toluene is shown in Fig. 2.4.6.1.

data_toluene _molecule_name_common loop_ _atom_id _atom_type _atom_attach_h 1 C 0 5 C 1

toluene

2 6

C C

loop_ _bond_id_1 _bond_id_2 _bond_type_mif _bond_type_casreg3 _bond_type_ccdc 1 2 D A A 3 4 D A A 5 6 D A A 1 7 S S S

1 1

2 4 1

3 7

3 5 6

C C

S S S

1 3

A A A

4

C

1

A A A

Fig. 2.4.6.1. Three alternative bonding conventions for toluene stored in the same MIF data block.

described by the atoms and bonds in the save frames $alanyl and $seryl. The complete dipeptide is specified in its ‘atom’ list as the template peptides (identified by their save-frame names) and an additional carboxylate O atom. Note that only the atom sites affected by molecule formation are identified explicitly in this list, which gives the values of _atom_attach_nh, _atom_attach_h and _atom_charge for the modified sites in the zwitterionic form of alanylserine.

2.4.7. Structural templates In many chemical information systems, it is standard practice to build complete 2D molecular representations through the use of a library of commonly referenced structural templates, e.g. ligands, functional groups, amino-acid units etc. In a MIF, molecular templates can be encapsulated as save frames, either within a data block for a specific molecule, or within a global block that is accessible to many data blocks. A simple application of a MIF template is shown in Fig. 2.4.7.1, where a 4-methylcyclohexyl ligand is used to encode the molecule tris(methylcyclohexyl)phosphine. In this example a molecular fragment is constructed in the save frame mechex, where the ‘atom’ sites and ‘bond’ connections appear in _atom_* and _bond_* loops. The molecule (2-methylcyclohexyl)(3-methylcyclohexyl)(4-methylcyclohexyl)phosphine is encoded by referencing the template fragment as the save frame $mechex. In the ‘atom’ loop, the item _atom_environment identifies the components of the target molecule as an ‘atom’ or ‘frag’ (fragment). If the component is a fragment, the items _atom_frag_key and _atom_frag_id are used to specify the frame code and the ID of the attached atom in the fragment, respectively. In the ‘bond’ loop, the connections from the atom P(1) to the template are encoded simply in terms of the _atom_id values. The necessary redefinition of the hydrogen and non-hydrogen counts of the template atoms is accomplished using the _atom_attach_h and _atom_attach_nh items, respectively. The external values override any values that are contained in, or derived from, the data in the template. The same approach is used to construct the dipeptide alanylserine in Fig. 2.4.7.2. This employs the template peptide units

2.4.8. Stereochemistry and geometry at stereogenic centres The Cahn–Ingold–Prelog (CIP) notation (Cahn et al., 1966; Prelog & Helmchen, 1982) is available in the MIF definitions to specify the stereochemistry of a molecule. The CIP notation is restricted to tetrahedral atomic centres and to olefinic type stereogenic bonds, and the CIF approach is unsuitable for describing molecules with partially known stereochemistry, molecules containing more complex geometries or substructural queries. The MIF data items representing stereochemical quantities are as follows: _define_stereo_relationship _atom_cip _bond_cip _stereo_atom_id _stereo_bond_id_1 _stereo_bond_id_2 _stereo_geometry _stereo_vertex_id

The CIP stereochemical designators (R, S, E, Z, r, s, e, z etc.) are specified with the MIF data items _atom_cip and _bond_cip. The MIF atom-property data for the molecule (+)-3-bromocamphor are shown in Fig. 2.4.8.1. In this the absolute configuration is expressed as the atom CIP values R, R and S for nodes 1, 3 and 4. The period in this example is used to indicate a null field.

48

2.4. SPECIFICATION OF THE MOLECULAR INFORMATION FILE (MIF)

global_ save_alanyl loop_ _atom_id _atom_type _atom_attach_h 1 N 2 4 O 0 loop_ _bond_id_1 _bond_id_2 _bond_type_mif 1 2 S 2 5 S save_ save_seryl loop_ _atom_id _atom_type _atom_attach_h 1 N 2 4 O 0 loop_ _bond_id_1 _bond_id_2 _bond_type_mif 1 2 S 2 5 S save_

global_ save_mechex loop_ _atom_id _atom_type _atom_attach_h 1 C 2 2 5 C 2 6 loop_ _bond_id_1 _bond_id_2 _bond_type_mif 1 2 S 2 5 6 S 1 save_

C C

3 6

2 2

S S

3 7

C C

3 4

2 3

4 7

4

S S

4

C

5

1

S

data_tris(methylcyclohexyl)phosphine _molecule_name_common ; (2-methylcyclohexyl)(3-methylcyclohexyl)(4methylcyclohexyl)phosphine ; loop_ _atom_id _atom_environment _atom_frag_key _atom_frag_id _atom_type _atom_attach_nh _atom_attach_h 1 atom . . 2 frag $mechex 3 3 frag $mechex 2 4 frag $mechex 1 loop_ _bond_id_1 _bond_id_2 _bond_type_mif 1 2 S 1 3 S

2 5

C C

1 3

3

C

0

2

3

S

3

4

D

2 5

C C

1 2

3 6

C O

0 1

2 5

3 6

S S

3

4

D

data_alanylserine

P C C C

3 3 3 3

0 1 1 1

1

4

_molecule_name_common alanylserine loop_ _atom_id _atom_environment _atom_frag_key _atom_frag_id _atom_type _atom_attach_nh _atom_attach_h _atom_charge 1 frag $alanyl 1 2 frag $alanyl 3 3 frag $seryl 1 4 frag $seryl 3 5 atom . . loop_ _bond_id_1 _bond_id_2 _bond_type_mif 2 3 S 4 5 S

S

Fig. 2.4.7.1. MIF representation of (2-methylcyclohexyl)(3-methylcyclohexyl)(4-methylcyclohexyl)phosphine using a single global save frame that encapsulates the structure of methylcyclohexane, together with ‘external’ referencing of save-frame atoms in _atom_ and _bond_ loops.

N C N C O

1 3 2 3 1

3 0 1 0 0

+1 0 0 0 -1

Fig. 2.4.7.2. MIF representation of the dipeptide alanylserine constructed using alanyl and seryl templates encapsulated in global save frames.

49

2. CONCEPTS AND SPECIFICATIONS

data_bromocamphor_2 _molecule_name_common

(+)-3-bromocamphor

_molecule_name_iupac ; 3R-bromo-1R,7,7-trimethyl-4S-bicyclo[2.2.1]heptan2-one ; loop_ _atom_id _atom_type _atom_attach_h _atom_cip 1 C 0 R 2 C 0 . 3 C 1 R 4 C 1 S 5 C 2 . 6 C 2 . 7 C 0 . 8 C 3 . 9 C 3 . 10 C 3 . 11 O 0 . 12 Br 0 . Fig. 2.4.8.1. CIP stereochemical descriptors for (+)-3-bromocamphor. Fig. 2.4.8.2. Archetypal coordination geometries used in stereochemical definition of the MIF data item _stereo_geometry.

The stereogenic centre of a stereo group in a molecule has a relationship within that group that is specified by _define_stereo_relationship. Descriptions of the standard codes for _define_stereo_relationship are as follows. absolute: The configuration of all stereogenic centres is exactly as described. This represents an enantiomerically pure compound with a known absolute configuration. relative: The configuration of the stereogenic centres is only relative and the mirror reflection of the centres will also describe the same molecule. Only the configuration described in the MIF, or its mirror image, will be present in the molecule. This represents an enantiomerically pure compound with the described relative configuration. racemic: The configuration of the stereogenic centres is only relative and the mirror reflection of the centres will also describe the same molecule. Both this configuration and its mirror image are present in a 1:1 ratio. This represents a racemic mixture of the molecule with the described relative configuration. absolute_excess: The configuration of the stereogenic centres describes the absolute configuration of the excess component of a mixture of this configuration and its mirror reflection. This describes an enantiomeric excess in which the excess component has the described absolute configuration. relative_excess: The configuration of the stereogenic centres is only relative. A mixture of this configuration and its mirror image is present, with one or other of the components in excess. This describes an enantiomeric excess mixture. unknown: The configurational relationship between the stereogenic centres is not known. The geometry of each stereogenic centre is described individually in terms of a prototype geometrical model. The basic principles of this approach have been described elsewhere (Barnard et al., 1990). The eight geometries currently defined for the MIF data item _stereo_geometry are given in Fig. 2.4.8.2. They include

the organic stereogenic geometries (the tetrahedron, the rectangular description of olefin-related compounds and the anti-rectangle used to describe allene-related systems) and the common archetypal metal coordination geometries (square planar, tetrahedral, trigonal bipyramidal, square pyramidal, octahedral and cubic). This list is non-exclusive and can be extended as required in later versions of the MIF dictionary. The vertex site of the geometrical model must be occupied by either an atom, an explicit or implicit hydrogen atom, or by an explicitly declared electron pair. In each case, there exist permutations of the enumerated vertices that, if applied, do not change the meaning of the description of the relevant stereo element. Thus, the MIF does not define a canonical ordering for citing geometric vertices and the comparison of two geometries requires the use of the permutation operators. These permutations are also indicated in Fig. 2.4.8.2. For each stereogenic centre (defined by a _stereo_atom_id, or by _stereo_bond_id_1 and *_2), the atom sites forming the stereochemical element specified by a _stereo_geometry code are stored as a sequence of _stereo_vertex_id values. An example of the specification of absolute stereochemistry, including the ordered enumeration of the tetrahedral vertices for the four stereogenic centres, is given in Fig. 2.4.8.3. In this example, the null symbol (a period) is used to indicate an implicit hydrogen atom or an unshared electron pair. 2.4.9. MIF query applications A MIF is suitable for interrogating databases because data items are permitted to have a single value, or a ‘sequence’ of alternative values. This latter option is designated by the dictionary attribute _type_conditions which, for MIF applications, is set to

50

2.4. SPECIFICATION OF THE MOLECULAR INFORMATION FILE (MIF)

data_query_example _molecule_name_common "conjugated (thio)keto query substructure" loop_ _atom_id _atom_type _atom_attach_all _atom_attach_h 1 C 4 1:3 4 C 3 .

data_menthyl_p_toluenesulfinate _molecule_name_common

menthyl-p-toluenesulfinate

_molecule_name_iupac "(1R,2S,5R)-(-)-menthyl (S)-p-toluenesulfinate" loop_ _atom_id _atom_type _atom_attach_h _atom_cip 1 C 2 . 2 C 2 5 C 2 . 6 C 1 9 C 1 . 10 C 3 13 C 0 . 14 C 1 17 C 1 . 18 C 1 loop_ _bond_id_1 _bond_id_2 _bond_type_mif 1 2 S 2 3 S 3 6 7 S 4 8 S 3 8 12 S 12 13 S 12 14 15 D 15 16 S 16

. R . . .

3 7 11 15 19

4 9 19 17

C 1 C 3 C 3 C 1 usp

S 4 5 S 9 10 S 12 20 D 17 18

_define_stereo_relationship loop_ _stereo_atom_id _stereo_geometry loop_ _stereo_vertex_id 6 tetrahedron 3 tetrahedron 4 tetrahedron 12 tetrahedron

S 4 C 1 . 8 O 0 . 12 S 0 . 16 C 1 0 . 20 O 0

S 5 6 S 9 11 D 13 14 S 13 18

loop_ _bond_id_1 _bond_id_2 _bond_type_ccdc 1 2 S 3 4 D,A,E

R . S . .

. 9 5 8

2

3

C O,S

3 O 1 O

S

3

2

5

C

3 .

D

Fig. 2.4.9.1. Query substructure for conjugated ketones or thioketones. Atom C1 is sp3 hybridized (total number of attached hydrogen and non-hydrogen atoms = 4) and carries at least one hydrogen atom. Bond C3—C4 may be localized double (D), aromatic (A) or delocalized double (E) in CCDC conventions.

graphical 2D search interfaces or be readable directly by a variety of 2D substructure search programs.

S S S D

2.4.10. Conclusion The present proliferation of formats for chemical applications tends to inhibit and complicate the exchange and use of chemical data. Many widely used chemical formats have a finite halflife because they are inflexible and not readily extensible. Others offer universality [e.g. Abstract Syntax Notation 1 (ISO, 2002a,b); Dalby et al. (1992); see http://www.daylight.com/smiles/] but lack visual simplicity, generality or machine readability. Nevertheless, the Molecular Information File approach has these properties but needs significantly more development to be a viable exchange approach for mainstream chemistry. The MIF dictionary enables chemical data items to be defined at high precision and this offers real benefits for the creation of a domain ontology in this field. Herein we have outlined the basic MIF approach and provided definitions for an initial core of data items that are fundamental for the representation of 2D and 3D chemical structures and 2D substructures. These core data items cover most of the basic chemical data-exchange requirements of molecular modelling and database applications, but are clearly only a first step towards the level of chemical data exchange needed. Future MIF developments in applications software and, particularly, in data definitions are expected to encompass other aspects of chemistry. These developments will need the collaborative involvement and support of subject specialists from both academia and industry.

absolute

7 5 1 . 2 4 8 . 3 19 20 13

2 5

stop_ stop_ stop_ stop_

Fig. 2.4.8.3. Stereochemical data for menthyl-p-toluenesulfinate.

‘sequenced data’ (via the code seq). This permits a value string to contain alternative ‘values’ satisfying the following constructs: (a) the value string v1, v2, v3 signals that a data item must have the value v1 or v2 or v3, and (b) the value string v1:v2 signals that a data item must have a value in the range v1 to v2. Combinations of these constructions are permitted. All values must comply with the requirements defined by the attributes _enumeration and _enumeration_range. An example of a substructural query in a MIF is shown in Fig. 2.4.9.1 for a conjugated ketone or thioketone fragment. Points of permitted variability of atom properties occur at atom 1, an sp3 carbon atom that must have at least one attached hydrogen atom, and at atom 5, which can be S or O. The conjugated multiple C—C bond (3–4) is defined to be either localized double, delocalized double or aromatic using CCDC bonding conventions. Query coding of this type should be readily generated from most

References Allen, F. H. (2002). The Cambridge Structural Database: a quarter of a million crystal structures and rising. Acta Cryst. B58, 380–388. Allen, F. H., Barnard, J. M., Cook, A. P. F. & Hall, S. R. (1995). The Molecular Information File (MIF): core specifications of a new standard format for chemical data. J. Chem. Inf. Comput. Sci. 35, 412–427.

51

2. CONCEPTS AND SPECIFICATIONS Dalby, A., Nourse, J. G., Hounshell, W. D., Gushurst, A. K. I., Grier, D. L., Leland, B. A. & Laufer, J. (1992). Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited. J. Chem. Inf. Comput. Sci. 32, 244–255. Feldmann, R. J., Milne, G. W. A., Heller, S. R., Fein, A., Miller, J. A. & Koch, B. (1977). An interactive substructure search system. J. Chem. Inf. Comput. Sci. 17, 157–163. Hall, S. R. (1991). The STAR File: a new format for electronic data transfer and archiving. J. Chem. Inf. Comput. Sci. 31, 326–333. Hall, S. R., Allen, F. H. & Brown, I. D. (1991). The Crystallographic Information File (CIF): a new standard archive file for crystallography. Acta Cryst. A47, 655–685. Hall, S. R. & Cook, A. P. F. (1995). STAR Dictionary Definition Language: initial specification. J. Chem. Inf. Comput. Sci. 35, 819–825. Hall, S. R. & Spadaccini, N. (1994). The STAR File: detailed specifications. J. Chem. Inf. Comput. Sci. 34, 505–508. Harary, F. (1972). Graph theory, 3rd ed. London: Addison-Wesley. ISO (2002a). ISO/IEC 8824-1. Abstract Syntax Notation One (ASN.1). Specification of basic notation. Geneva: International Organization for Standardization. ISO (2002b). ISO/IEC 8825-1. ASN.1 encoding rules. Specification of Basic Encoding Rules (BER), Canonical Encoding Rules (CER) and Distinguished Encoding Rules (DER). Geneva: International Organization for Standardization. Mockus, J. & Stobaugh, R. E. (1980). The Chemical Abstracts Service Chemical Registry System. VII. Tautomerism and alternating bonds. J. Chem. Inf. Comput. Sci. 20, 18–22. Prelog, V. & Helmchen, G. (1982). Basic principles of the CIP-system and proposals for a revision. Angew. Chem. Intl Ed. Engl. 21, 567–583. Wippke, W. T. & Dyott, T. M. (1975). Use of ring assemblies in ring perception algorithm. J. Chem. Inf. Comput. Sci. 15, 140–147.

Allen, F. H., Davies, J. E., Galloy, J. J., Johnson, O., Kennard, O., Macrae, C. F., Mitchell, E. M., Mitchell, G. F., Smith, J. M. & Watson, D. G. (1991). The development of versions 3 and 4 of the Cambridge Structural Database System. J. Chem. Inf. Comput. Sci. 31, 187–204. Allen, F. H. & Glusker, J. P. (2002). Preface. Acta Cryst. B58, Part 3. Allen, F. H. & Rogers, D. (1970). X-ray studies of terpenoid derivatives, Part III. A re-determination of the crystal structure of (+)-3bromocamphor: the absolute configuration of (+)-camphor. J. Chem. Soc. B, pp. 632–636. ASTM (1994). Standard specification for the content of computerized chemical structural information files or data sets. ASTM Standard E 1586-93. American Society for Testing and Materials, Philadelphia, PA, USA. Barnard, J. M. (1990). Draft specification for revised version of the Standard Molecular Data (SMD) format. J. Chem. Inf. Comput. Sci. 30, 81–96. Barnard, J. M. & Cook, A. P. F. (1992). The Molecular Information File (MIF): a standard format for molecular information. Report. Chemical Structure Association, London, England. Barnard, J. M., Cook, A. P. F. & Rohde, B. (1990). Beyond the structure diagram, edited by D. Bawden & E. Mitchell, pp. 29–41. Chichester: Ellis Horwood. Bebak, H., Buse, C., Donner, W. T., Hoever, P., Jacob, H., Klaus, H., Pesch, J., Roemelt, J., Schilling, P., Woost, B. & Zirz, C. (1989). The Standard Molecular Data format (SMD format) as an integration tool in computer chemistry. J. Chem. Inf. Comput. Sci. 29, 1–5. Brown, I. D. (1988). Standard Crystallographic File Structure-87. Acta Cryst. A44, 232. Cahn, R. S., Ingold, C. K. & Prelog, V. (1966). Specification of molecular chirality. Angew. Chem. Intl Ed. Engl. 5, 385–415.

52

references

International Tables for Crystallography (2006). Vol. G, Chapter 2.5, pp. 53–60.

2.5. Specification of the core CIF dictionary definition language (DDL1) B Y S. R. H ALL

AND

the precision achievable hinges on the attributes selected for use in dictionary definitions, these being the semantic vocabulary of a dictionary. Within this context, attribute descriptors constitute a dictionary definition language (DDL). Definitions of the attributes described in this chapter are provided in Chapter 4.9. The main purpose of this chapter is to describe the DDL attributes used to construct the core, powder, modulated structures and electron density CIF dictionaries detailed in Chapters 3.2–3.5 and 4.1–4.4. We shall see that each item definition in these dictionaries is constructed separately by appropriate choice from the attribute descriptors available, and that a sequence of definition blocks (one block per item) constitutes a CIF dictionary file. The organization of attributes and definition blocks in a dictionary file need not be related to the syntax of a data CIF, but in practice there are significant advantages if they are. Firstly, a common syntax for data and dictionary files enables the same software to be used to parse both. Secondly, and of equal importance, data descriptions and dictionaries need continual updating and additions, and the CIF syntax provides a high level of extensibility and flexibility, whereas most other formats do not. Finally, the use of a consistent syntax permits the dictionary attribute descriptors themselves to be described in their own DDL dictionary file. While the basic functionality and flexibility of a dictionary is governed by the CIF syntax, the precision of the data definitions contained within it is determined entirely by the scope and number of the attribute descriptors representing the DDL. Indeed, the ability of a DDL to permit the simple and seamless evolution of data definitions and the scope of the DDL to precisely define data items both play absolutely pivotal roles via the supporting dictionaries in determining the power of the CIF data-exchange process.

2.5.1. Introduction The CIF approach to data representation described in Chapter 2.2 is based on the STAR File universal data language (Hall, 1991; Hall & Spadaccini, 1994) detailed in Chapter 2.1. An important advantage of the CIF approach is the self-identification of data items through the use of tag–value tuples. This syntax removes the need for preordained data ordering in a file or stream of data and enables appropriate parsing tools to automate access independently of the data source. In this chapter, we will show that the CIF syntax also provides a higher level of abstraction for managing data storage and exchange – that of defining the meaning of data items (i.e. their properties and characteristics) as attribute descriptors linked to the identifying data tag. Each attribute descriptor specifies a particular characteristic of a data item, and a collection of these attributes can be used to provide a unique definition of the data item. Moreover, placing the definitions of a selected set of data items into a CIF-like file provides an electronic dictionary for a particular subject area. In the modern parlance of knowledge management and the semantic web, such a data dictionary represents a domain ontology. In most respects, data dictionaries serve a role similar to spokenword dictionaries and as such are an important adjunct to the CIF data-management approach by providing semantic information that is necessary for automatic validation and compliance procedures. That is, prior lexical knowledge of the nature of individual items ensures that each item in a CIF can be read and interpreted correctly via the unique tag that is the link to descriptions in a data dictionary file. Because the descriptions in the dictionary are machine-parsable, the semantic information they contain forms an integral part of a data-handling process. In other words, machineinterpretable semantic knowledge embedded in data dictionaries leads directly to the automatic validation and manipulation of the relevant items stored in any CIF.

2.5.2. The organization of a CIF dictionary The precision and efficiency of a data definition language are directly related to the scope of the attribute vocabulary. In other words, the lexical richness of the DDL depends on the number and the specificity of the available language attributes. The breadth of these attributes, in terms of the number of separate data characteristics that can be specified, largely controls the precision of data definitions. However, it is the functionality of attributes that determines the information richness and enables higher-level definition complexity. For example, the attributes that define child–parent relationships between data and key pointers to items in list packets are essential to understanding the data hierarchy and to its validation. Attributes provide the semantic tools of a dictionary. The choice and scope of attributes in the DDL are governed by both semantic and technical considerations. Attributes need to have a clear purpose to facilitate easy definition and comprehension, and their routine application in automatic validation processes. A CIF dictionary is much more than a list of unrelated data definitions. Each definition conforms to the CIF syntax, which requires each data block in the dictionary to be unique. However, the functionality of a dictionary involves elements of both relational and object-oriented processes. For example, attributes in one definition may refer to another definition via _list_link_parent or _list_link_child attributes, so as to indicate the dependency

2.5.1.1. The concept of a dictionary definition language (DDL) The structure or arrangement of data in a CIF is well understood and predictable because the CIF syntax may be specified succinctly (see Chapter 2.2 for CIF syntax expressed using extended Backus–Naur form). In contrast, the meaning of individual data values in a file is only known if the nature of these items is understood. For CIF data this critical link between the value and the meaning of an item is achieved using an electronic ‘data dictionary’ in which the definitions of relevant data items are catalogued according to their data name and expressed as attribute values, one set of attributes per item. The dictionary definitions describe the characteristics of each item, such as data type, enumeration states and relationships between data. The more precise the definitions, the higher the level of semantic knowledge of defined items and the better the efficiency achievable in their exchange. To a large degree, Affiliations: S YDNEY R. H ALL , School of Biomedical and Chemical Sciences, University of Western Australia, Crawley, Perth, WA 6009, Australia; A NTHONY P. F. C OOK, BCI Ltd, 46 Uppergate Road, Stannington, Sheffield S6 6BX, England.

Copyright  2006 International Union of Crystallography

A. P. F. C OOK

53

2. CONCEPTS AND SPECIFICATIONS of data items in lists. In this way the DDL, and consequently the dictionaries constructed from the DDL, invoke aspects of relational and object-oriented database paradigms. It is therefore useful to summarize these briefly here. A relational database model (Kim, 1990) presents data as tables with a hierarchy of links that define the dependencies among tables. These explicit relationships enable certain properties of data to be shared, and, for related data values, to be derived. The structure of data links in a relational database is usually defined separately from the component data. This is an important strength of this approach. However, when data types and dependencies change continually, static relationships are inappropriate, and there is a need for non-relational extensions. The object-oriented database model (Gray et al., 1992) allows data items and tables to be defined without static data dependencies. A data item may be considered as a self-contained ‘object’ and its relationships to other objects handled by ‘methods’ or ‘actions’ defined within the objects. A database may have a base of statically defined explicit relationships with a dynamic layer of relationships provided by presenting some (or all) items as objects. Objects have well defined attributes, some of which may involve relationships with other data items, but objects need not have preordained links imposed by the static database structure. The attributes and the functionality of CIF dictionaries incorporate aspects of both the basic relational and the object-oriented model. These provide the flexibility and extensibility associated with object-oriented data, as well as the relational links important to data validation and, ultimately, data derivation.

Table 2.5.4.1. Comparison of DDL1 and DDL2 variants DDL1

DDL2

Data names identify the category of data as __

Data names identify the category as _.

Definitions are declared as data blocks with data_

Definitions are declared starting with save_ and ending with save_

An irreducible set of items is declared within one definition e.g. Miller indices h, k, l

All items are defined in separate frames related by _item_dependent.name

Items that appear in lists are identified with the attribute list_

List and non-list data items are not distinguished

List dependencies are declared within each definition e.g. _list_reference

Dependencies are declared in a category definition e.g. _category_key.name Identifies subcategories of data within category groupings e.g. matrix Provides aliases to equivalent names, including those in DDL1 dictionaries

reciprocal-space data items; namely, its value represents a key or list pointer (i.e. a unique key to a row of items in a list table) to other data items in the list associated with this vector. This means that data forming a ‘reflection’ list are inaccessible if these indices are absent, or invalid if there is more than one occurrence of the same triplet in the list. Such interdependency and relational information is very important to the application of data, and needs to be specified in a dictionary to enable unambiguous access and validation. Other types of data dependencies will be described in Section 2.5.6.

2.5.3. Definition attributes Efficient data exchange depends implicitly on the prior knowledge of the data. For CIF data, this knowledge is specified in a data dictionary using definition attributes. A unique set of attribute values exists for each kind of data item, be it numerical, textual or symbolic, because these characteristics represent its identity and function. This is illustrated below with two simple examples.

2.5.4. DDL versions The capacity of a DDL to precisely define data items depends implicitly on the scope of the available attributes. It is quite possible, therefore, that a completely new data property cannot be specified using an existing DDL. In a field where data types evolve rapidly, the currently used dictionary language may be inadequate for the precise specification of an item. It is inevitable that future data items will exceed the capacity of existing dictionary attributes and methods to describe new data properties, and the dictionaries must evolve much in the same way that spoken languages continually adapt to changing modes of expression. The first DDL used in crystallography (in 1990) was developed to compile a dictionary of ‘core structural’ crystallographic data items (Hall et al., 1991). These data items were intended for use when submitting a manuscript to the journal Acta Crystallographica (and still are). The ‘core’ DDL is known commonly as DDL1 (Hall & Cook, 1995). Several years later, the definition of macromolecular crystallographic data items needed hierarchical descriptors for the different levels of structural entities, and an ‘mmCIF’ DDL, known as DDL2 (Westbrook & Hall, 1995), was developed. DDL2 was used to build the mmCIF dictionary (Bourne et al., 1997). DDL1 is described in this chapter and DDL2 is described in Chapter 2.6. The DDL1 attributes have been used to construct the crystallographic core (fundamental structural), pd (powder diffraction), ms (modulated and composite structures) and rho (electron density) dictionaries. These dictionaries are discussed in Chapters

2.5.3.1. Example 1: attributes of temperature Every numerical data item has distinct properties. Consider the number 20 as a measure of temperature in degrees. To understand this number it is essential to know its measurement units. If these are degrees Celsius, one knows the item is in the temperature class, degrees Celsius sub-class, and that a lower enumeration boundary of any value can be specified at −273.15. Such a constraint can be used in data validation. More to the point, without any knowledge of both the class and subclass, a numerical value has no meaning. The number 100 is unusable unless one knows what it is a measure of (e.g. temperature or intensity) and, equally, unless one knows what the units are (e.g. degrees Celsius, Kelvin or Fahrenheit; or electrons or volts). 2.5.3.2. Example 2: attributes of Miller indices Knowing the inter-dependency of one data item on another plays a major role in the understanding and validation of data. If a triplet of numbers 5, 3, 0 is identified as Miller indices h, k, l, one immediately appreciates the significance of the index triplet as a vector in reciprocal crystal space. This stipulates that the three numbers form a single non-scalar data item in which the indices are nonassociative (e.g. 3, 5, 0 is not equivalent to 5, 3, 0) and irreducible (e.g. the index 3 alone has no meaning). As a reciprocal-space vector, the triplet has other properties if is part of a list of other

54

2.5. SPECIFICATION OF THE CORE CIF DICTIONARY DEFINITION LANGUAGE (DDL1) data_atom_site_Cartn_ loop_ _name

;

data_cell_formula_units_Z _name ’_cell_formula_units_Z’ _category cell _type numb _enumeration_range 1: _definition ; The number of the formula units in the unit cell as specified by _chemical_formula_structural, _chemical_formula_moiety or _chemical_formula_sum. ;

’_atom_site_Cartn_x’ ’_atom_site_Cartn_y’ ’_atom_site_Cartn_z’ atom_site numb esd yes ’_atom_site_label’ A ’angstroms’

_category _type _type_conditions _list _list_reference _units _units_detail _definition The atom-site coordinates in angstroms specified according to a set of orthogonal Cartesian axes related to the cell axes as specified by the _atom_sites_Cartn_transform_axes description.

Fig. 2.5.5.1. DDL1 definition with a few attributes.

data_on_this_dictionary _dictionary_name cif_example.dic _dictionary_version 0.0 _dictionary_update 1999-03-15 _dictionary_history ; 1999-03-11 Created as a dictionary example 1999-03-15 Further simplifications ; data_parameter_ABC _name ’_parameter_ABC’ # FF0700 00====

For these hexadecimal, octal and decimal formats only, comments beginning with ‘#’ are permitted to improve readability. BASE64 encoding follows MIME conventions. Octets are in groups of three: c1, c2, c3. The resulting 24 bits are broken into four six-bit quantities, starting with the high-order six bits (c1 ≫ 2) of the first octet, then the low-order two bits of the first octet followed by the high-order four bits of the second octet [(c1 & 3) ≪ 4 | (c2 ≫ 4)], then the bottom four bits of the second octet followed by the high-order two bits of the last octet [(c2 & 15) ≪ 2 | (c3 ≫ 6)], then the bottom six bits of the last octet (c3 & 63). Each of these four quantities is translated into an ASCII character using the mapping

Category group(s): inclusive_group array_data_group Category key(s): _array_intensities.array_id _array_intensities.binary_id

Example 1. loop_ _array_intensities.array_id _array_intensities.linearity _array_intensities.gain _array_intensities.overload _array_intensities.undefined_value image_1 linear 1.2 655535 0

1 2 3 4 0123456789012345678901234567890123456789 | | | | ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmn

*_array_intensities.array_id

5 6 012345678901234567890123 | | | opqrstuvwxyz0123456789+/

(code)

This item is a pointer to _array_structure.id in the ARRAY_STRUCTURE category.

(code)

This item is a pointer to _array_structure.id in the ARRAY_STRUCTURE category. [array_intensities]

with short groups of octets padded on the right with one ‘=’ if c3 (*)_array_intensities.binary_id is missing, and with ‘==’ if both c2 and c3 are missing. QUOTED-PRINTABLE encoding also follows MIME conventions, copying octets without translation if their ASCII values are 32. . .38, 42, 48. . .57, 59, 60, 62, 64. . .126 and the octet is not a ‘;’ in column 1. All other characters are translated to =nn, where nn is the hexadecimal encoding of the octet. All lines are ‘wrapped’ with a terminating = (i.e. the MIME conventions for an implicit line terminator are never used).

(int)

This item is a pointer to _array_data.binary_id in the ARRAY_DATA category. [array_intensities]

*_array_intensities.gain

(float)

Detector ‘gain’. The factor by which linearized intensity count values should be divided to produce true photon counts. The permitted range is [0.0, ∞). Related item: _array_intensities.gain_esd (associated value). [array_intensities]

[array_data]

445

4. DATA DICTIONARIES

ARRAY INTENSITIES *_array_intensities.gain_esd

(float)

ARRAY STRUCTURE

The estimated standard deviation in detector ‘gain’.

Data items in the ARRAY_STRUCTURE category record the organization and encoding of array data in the ARRAY_DATA category.

The permitted range is [0.0, ∞). Related item: _array_intensities.gain (associated esd).

Category group(s): inclusive_group array_data_group Category key(s): _array_structure.id

[array_intensities]

*_array_intensities.linearity

(code)

The intensity linearity scaling method used to convert from the raw intensity to the stored element value. ‘linear’ is linear. ‘offset’ means that the value defined by _array_intensities.offset should be added to each element value. ‘scaling’ means that the value defined by _array_intensities.scaling should be multiplied with each element value. ‘scaling offset’ is the combination of the two previous cases, with the scale factor applied before the offset value. ‘sqrt scaled’ means that the square root of raw intensities multiplied by _array_intensities.scaling is calculated and stored, perhaps rounded to the nearest integer. Thus, linearization involves dividing the stored values by _array_intensities.scaling and squaring the result. ‘logarithmic scaled’ means that the logarithm base 10 of raw intensities multiplied by _array_intensities.scaling is calculated and stored, perhaps rounded to the nearest integer. Thus, linearization involves dividing the stored values by _array_intensities.scaling and calculating 10 to the power of this number. ‘raw’ means that the data are a set of raw values straight from the detector.

Example 1. loop_ _array_structure.id _array_structure.encoding_type _array_structure.compression_type _array_structure.byte_order image_1 "unsigned 16-bit integer"

scaling

scaling offset sqrt scaled

raw

packed

canonical

Data are stored in normal format as defined by _array_structure.encoding_type and _array_structure.byte_order. Using the ‘packed’ compression scheme, a CCP4-style packing (International Tables for Crystallography Volume G, Section 5.6.3.2) Using the ‘canonical’ compression scheme (International Tables for Crystallography Volume G, Section 5.6.3.1)

Where no value is given, the assumed value is ‘none’.

[array_structure]

*_array_structure.encoding_type

(uline)

Data encoding of a single element of array data. In several cases, the IEEE format is referenced. See IEEE Standard 754-1985 (IEEE, 1985). Reference: IEEE (1985). IEEE Standard for Binary FloatingPoint Arithmetic. ANSI/IEEE Std 754-1985. New York: Institute of Electrical and Electronics Engineers. The data value must be one of the following:

’unsigned 8-bit integer’ ’signed 8-bit integer’ ’unsigned 16-bit integer’ ’signed 16-bit integer’ ’unsigned 32-bit integer’ ’signed 32-bit integer’ ’signed 32-bit real IEEE’ ’signed 64-bit real IEEE’ ’signed 32-bit complex IEEE’

[array_intensities]

(float)

The saturation intensity level for this data array.

[array_structure]

*_array_structure.id

(code)

The value of _array_structure.id must uniquely identify each item of array data.

Multiplicative scaling value to be applied to array data in the manner described by item _array_intensities.linearity.

The following item(s) have an equivalent role in their respective categories:

[array_intensities]

_array_intensities.undefined_value

The last byte in the byte stream of the bytes which make up an integer value is the most significant byte of an integer.

none

(float)

(float)

little endian

The data value must be one of the following:

Offset value to add to array element values in the manner described by the item _array_intensities.linearity.

_array_intensities.scaling

The first byte in the byte stream of the bytes which make up an integer value is the most significant byte of an integer.

_array_structure.compression_type (code) Type of data-compression method used to compress the array data.

[array_intensities]

[array_intensities]

big endian

[array_structure]

The array consists of raw values to which no corrections have been applied. While the handling of the data is similar to that given for ‘linear’ data with no offset, the meaning of the data differs in that the number of incident photons is not necessarily linearly related to the number of counts reported. This value is intended for use either in calibration experiments or to allow for handling more complex data-fitting algorithms than are allowed for by this data item.

_array_intensities.overload

(code)

The data value must be one of the following:

The square root of raw intensities multiplied by _array_intensities.scaling is calculated and stored, perhaps rounded to the nearest integer. Thus, linearization involves dividing the stored values by _array_intensities.scaling and squaring the result.

_array_intensities.offset

little_endian

The order of bytes for integer values which require more than 1 byte. (IBM PCs and compatibles, and Dec VAXs use low-byte-first ordered integers, whereas Hewlett Packard 700 series, Sun-4 and Silicon Graphics use high-byte-first ordered integers. Dec Alphas can produce/use either depending on a compiler switch.)

Linear. The value defined by _array_intensities. offset should be added to each element value. The value defined by _array_intensities. scaling should be multiplied with each element value. The combination of scaling and offset with the scale factor applied before the offset value.

logarithmic scaled The logarithm base 10 of raw intensities multiplied by _array_intensities.scaling is calculated and stored, perhaps rounded to the nearest integer. Thus, linearization involves dividing the stored values by _array_intensities.scaling and calculating 10 to the power of this number.

none

*_array_structure.byte_order

The data value must be one of the following:

linear offset

cif img.dic

_array_data.array_id ,

(float)

_array_structure_list.array_id ,

A value to be substituted for undefined values in the data array.

_array_intensities.array_id , _diffrn_data_frame.array_id .

[array_intensities]

446

[array_structure]

4.6. IMAGE DICTIONARY (imgCIF)

cif img.dic

ARRAY STRUCTURE LIST AXIS

ARRAY STRUCTURE LIST

ARRAY STRUCTURE LIST AXIS

Data items in the ARRAY_STRUCTURE_LIST category record the size and organization of each array dimension. The relationship to physical axes may be given.

Data items in the ARRAY_STRUCTURE_LIST_AXIS category describe the physical settings of sets of axes for the centres of pixels that correspond to data points described in the ARRAY_STRUCTURE_LIST category. In the simplest cases, the physical increments of a single axis correspond to the increments of a single array index. More complex organizations, e.g. spiral scans, may require coupled motions along multiple axes. Note that a spiral scan uses two coupled axes: one for the angular direction and one for the radial direction. This differs from a cylindrical scan for which the two axes are not coupled into one set.

Category group(s): inclusive_group array_data_group Category key(s): _array_structure_list.array_id _array_structure_list.index

Example 1 – an image array of 1300 × 1200 elements. The raster order of the image is left to right (increasing) in the first dimension and bottom to top (decreasing) in the second dimension. loop_ _array_structure_list.array_id _array_structure_list.index _array_structure_list.dimension _array_structure_list.precedence _array_structure_list.direction _array_structure_list.axis_set_id image_1 1 1300 1 increasing image_1 2 1200 2 decreasing

Category group(s): inclusive_group array_data_group Category key(s): _array_structure_list_axis.axis_set_id _array_structure_list_axis.axis_id

_array_structure_list_axis.angle (float) The setting of the specified axis in degrees for the first data point of the array index with the corresponding value of _array_structure_list.axis_set_id. If the index is specified as ‘increasing’, this will be the centre of the pixel with index value 1. If the index is specified as ‘decreasing’, this will be the centre of the pixel with maximum index value.

ELEMENT_X ELEMENY_Y

*_array_structure_list.array_id

(code)

This item is a pointer to _array_structure.id in the ARRAY_STRUCTURE category.

Where no value is given, the assumed value is ‘0.0’.

[array_structure_list]

[array_structure_list_axis]

_array_structure_list_axis.angle_increment (float) *_array_structure_list.axis_set_id

The pixel-centre-to-pixel-centre increment in the angular setting of the specified axis in degrees. This is not meaningful in the case of ‘constant velocity’ spiral scans and should not be specified for this case. See _array_structure_list_axis.angular_pitch.

(code)

This is a descriptor for the physical axis or set of axes corresponding to an array index. This data item is related to the axes of the detector itself given in DIFFRN_DETECTOR_AXIS, but usually differs in that the axes in this category are the axes of the coordinate system of reported data points, while the axes in DIFFRN_DETECTOR_AXIS are the physical axes of the detector describing the ‘poise’ of the detector as an overall physical object. If there is only one axis in the set, the identifier of that axis should be used as the identifier of the set.

Where no value is given, the assumed value is ‘0.0’. [array_structure_list_axis]

_array_structure_list_axis.angular_pitch

The following item(s) have an equivalent role in their respective categories:

_array_structure_list_axis.axis_set_id .

[array_structure_list]

*_array_structure_list.dimension

(int)

The number of elements stored in the array structure in this dimension. The permitted range is [1, ∞).

Where no value is given, the assumed value is ‘0.0’. [array_structure_list_axis]

*_array_structure_list_axis.axis_id

[array_structure_list]

*_array_structure_list.direction

(int)

The data value must be one of the following:

decreasing

[array_structure_list_axis] Indicates the index changes from 1 to the maximum (*)_array_structure_list_axis.axis_set_id (code) dimension Indicates the index changes from the maximum dimenThe value of this data item is the identifier of the set of sion to 1 axes for which axis settings are being specified. Multiple axes [array_structure_list]

*_array_structure_list.index

may be specified for the same value of _array_structure_ list_axis.axis_set_id. This item is a pointer to _array_ structure_list.axis_set_id in the ARRAY_STRUCTURE_LIST category. If this item is not specified, it defaults to the corresponding axis identifier.

(int)

Identifies the one-based index of the row or column in the array structure.

[array_structure_list_axis]

The following item(s) have an equivalent role in their respective categories: _array_element_size.index . The permitted range is [1, ∞). [array_structure_list]

*_array_structure_list.precedence

_array_structure_list_axis.displacement

(float)

The setting of the specified axis in millimetres for the first data point of the array index with the corresponding value of _array_structure_list.axis_set_id. If the index is specified as ‘increasing’, this will be the centre of the pixel with index value 1. If the index is specified as ‘decreasing’, this will be the centre of the pixel with maximum index value.

(int)

Identifies the rank order in which this array index changes with respect to other array indices. The precedence of 1 indicates the index which changes fastest. The permitted range is [1, ∞).

(code)

The value of this data item is the identifier of one of the axes in the set of axes for which settings are being specified. Multiple axes may be specified for the same value of _array_structure_list_axis.axis_set_id. This item is a pointer to _axis.id in the AXIS category.

Identifies the direction in which this array index changes. increasing

(float)

The pixel-centre-to-pixel-centre distance for a one-step change in the setting of the specified axis in millimetres. This is meaningful only for ‘constant velocity’ spiral scans or for uncoupled angular scans at a constant radius (cylindrical scans) and should not be specified for cases in which the angle between pixels (rather than the distance between pixels) is uniform. See _array_structure_list_axis.angle_increment.

Where no value is given, the assumed value is ‘0.0’.

[array_structure_list]

[array_structure_list_axis]

447

ARRAY STRUCTURE LIST AXIS

4. DATA DICTIONARIES

cif img.dic

_array_structure_list_axis.displacement_increment

When specifying detector axes, the axis is given to the beam centre. The location of the beam centre on the detector should be given in the DIFFRN_DETECTOR category in distortion-corrected millimetres from the (0, 0) corner of the detector. It should be noted that many different origins arise in the definition of an experiment. In particular, as noted above, it is necessary to specify the location of the beam centre on the detector in terms of the origin of the detector, which is, of course, not coincident with the centre of the sample.

(float)

The pixel-centre-to-pixel-centre increment for the displacement setting of the specified axis in millimetres. Where no value is given, the assumed value is ‘0.0’. [array_structure_list_axis]

_array_structure_list_axis.radial_pitch

(float)

The radial distance from one ‘cylinder’ of pixels to the next in millimetres. If the scan is a ‘constant velocity’ scan with differing angular displacements between pixels, the value of this item may differ significantly from the value of _array_structure_ list_axis.displacement_increment.

Reference: Leslie, A. G. W. & Powell, H. (2004). MOSFLM v6.11. MRC Laboratory of Molecular Biology, Hills Road, Cambridge, England. http://www.CCP4.ac.uk/dist/xwindows/Mosflm/.

Where no value is given, the assumed value is ‘0.0’. [array_structure_list_axis]

Category group(s): inclusive_group axis_group diffrn_group Category key(s): _axis.id _axis.equipment

AXIS Data items in the AXIS category record the information required to describe the various goniometer, detector, source and other axes needed to specify a data collection. The location of each axis is specified by two vectors: the axis itself, given as a unit vector, and an offset to the base of the unit vector. These vectors are referenced to a right-handed laboratory coordinate system with its origin in the sample or specimen:

Example 1. This example shows the axis specification of the axes of a kappa-geometry goniometer [see Stout, G. H. & Jensen, L. H. (1989). X-ray structure determination. A practical guide, 2nd ed. p. 134. New York: Wiley Interscience]. There are three axes specified, and no offsets. The outermost axis, ω, is pointed along the X axis. The next innermost axis, κ, is at a 50◦ angle to the X axis, pointed away from the source. The innermost axis, ϕ, aligns with the X axis when ω and ϕ are at their zero points. If Tω , Tκ and Tϕ are the transformation matrices derived from the axis settings, the complete transformation would be x′ = Tω Tκ Tϕ x. loop_ _axis.id _axis.type _axis.equipment _axis.depends_on _axis.vector[1] _axis.vector[2] _axis.vector[3] omega rotation goniometer . 1 0 kappa rotation goniometer omega -.64279 0 phi rotation goniometer kappa 1 0

0 -.76604 0

Example 2. This example show the axis specification of the axes of a detector, source and gravity. The order has been changed as a reminder that the ordering of presentation of tokens is not significant. The centre of rotation of the detector has been taken to be 68 mm in the direction away from the source. loop_ _axis.id _axis.type _axis.equipment _axis.depends_on _axis.vector[1] _axis.vector[2] _axis.vector[3] _axis.offset[1] _axis.offset[2] _axis.offset[3] source . source . 0 0 gravity . gravity . 0 -1 tranz translation detector rotz 0 0 twotheta rotation detector . 1 0 roty rotation detector twotheta 0 1 rotz rotation detector roty 0 0

Axis 1 (X): The X axis is aligned to the mechanical axis pointing from the sample or specimen along the principal axis of the goniometer. Axis 2 (Y ): The Y axis completes an orthogonal right-handed system defined by the X axis and the Z axis (see below). Axis 3 (Z): The Z axis is derived from the source axis which goes from the sample to the source. The Z axis is the component of the source axis in the direction of the source orthogonal to the X axis in the plane defined by the X axis and the source axis. These axes are based on the goniometer, not on the orientation of the detector, gravity etc. The vectors necessary to specify all other axes are given by sets of three components in the order (X, Y, Z). If the axis involved is a rotation axis, it is right-handed, i.e. as one views the object to be rotated from the origin (the tail) of the unit vector, the rotation is clockwise. If a translation axis is specified, the direction of the unit vector specifies the sense of positive translation. Note: This choice of coordinate system is similar to but significantly different from the choice in MOSFLM (Leslie & Powell, 2004). In MOSFLM, X is along the X-ray beam (the CBF/imgCIF Z axis) and Z is along the rotation axis.

1 0 1 0 0 1

. . 0 . 0 0

. . 0 . 0 0

. . -68 . -68 -68

_axis.depends_on

The value of _axis.depends_on specifies the next outermost axis upon which this axis depends. This item is a pointer to _axis.id in the same category. [axis]

_axis.equipment (ucode) The value of _axis.equipment specifies the type of equipment using the axis: ‘goniometer’, ‘detector’, ‘gravity’, ‘source’ or ‘general’. The data value must be one of the following:

All rotations are given in degrees and all translations are given in millimetres. Axes may be dependent on one another. The X axis is the only goniometer axis the direction of which is strictly connected to the hardware. All other axes are specified by the positions they would assume when the axes upon which they depend are at their zero points.

goniometer detector general gravity source

equipment used to orient or position samples equipment used to detect reflections equipment used for general purposes axis specifying the downward direction axis specifying the direction sample to source

Where no value is given, the assumed value is ‘general’.

448

[axis]

4.6. IMAGE DICTIONARY (imgCIF)

cif img.dic *_axis.id

(code)

DIFFRN DATA FRAME

The value of _axis.id must uniquely identify each axis relevant to the experiment. Note that multiple pieces of equipment may share the same axis (e.g. a 2θ arm), so the category key for AXIS also includes the equipment.

Data items in the DIFFRN_DATA_FRAME category record the details about each frame of data. The items in this category were previously in a DIFFRN_FRAME_DATA category, which is now deprecated. The items from the old category are provided as aliases but should not be used for new work.

The following item(s) have an equivalent role in their respective categories:

_axis.depends_on,

Category group(s): inclusive_group array_data_group Category key(s): _diffrn_data_frame.id _diffrn_data_frame.detector_element_id

_array_structure_list_axis.axis_id , _diffrn_detector_axis.axis_id , _diffrn_measurement_axis.axis_id , _diffrn_scan_axis.axis_id , _diffrn_scan_frame_axis.axis_id .

Example 1 – a frame containing data from four frame elements.

[axis]

Each frame element has a common array configuration ‘array 1’ described in ARRAY STRUCTURE and related categories. The data for each detector element are stored in four groups of binary data in the ARRAY DATA category, linked by the array id and binary id.

_axis.offset[1] (float) The [1] element of the three-element vector used to specify the offset to the base of a rotation or translation axis. The vector is specified in millimetres. Where no value is given, the assumed value is ‘0.0’.

loop_ _diffrn_data_frame.id _diffrn_data_frame.detector_element_id _diffrn_data_frame.array_id _diffrn_data_frame.binary_id frame_1 d1_ccd_1 array_1 1 frame_1 d1_ccd_2 array_1 2 frame_1 d1_ccd_3 array_1 3 frame_1 d1_ccd_4 array_1 4

[axis]

_axis.offset[2] (float) The [2] element of the three-element vector used to specify the offset to the base of a rotation or translation axis. The vector is specified in millimetres. Where no value is given, the assumed value is ‘0.0’.

DIFFRN DETECTOR

*_diffrn_data_frame.array_id

(code)

_diffrn_frame_data.array_id (cif img.dic 1.0)

This item is a pointer to _array_structure.id in the ARRAY_ category.

[axis]

STRUCTURE

[diffrn_data_frame]

_axis.offset[3]

(float)

(*)_diffrn_data_frame.binary_id

The [3] element of the three-element vector used to specify the offset to the base of a rotation or translation axis. The vector is specified in millimetres. Where no value is given, the assumed value is ‘0.0’.

[axis]

(int)

_diffrn_frame_data.binary_id (cif img.dic 1.0)

This item is a pointer to _array_data.binary_id in the ARRAY_ category.

DATA

[diffrn_data_frame]

*_diffrn_data_frame.detector_element_id

(code)

_diffrn_frame_data.detector_element_id (cif img.dic 1.0)

_axis.type

This item is a pointer to _diffrn_detector_element.id in the DIFFRN_DETECTOR_ELEMENT category.

(ucode)

The value of _axis.type specifies the type of axis: ‘rotation’ or ‘translation’ (or ‘general’ when the type is not relevant, as for gravity).

[diffrn_data_frame]

*_diffrn_data_frame.id

The data value must be one of the following:

rotation translation general

The value of _diffrn_data_frame.id must uniquely identify each complete frame of data.

right-handed axis of rotation translation in the direction of the axis axis for which the type is not relevant

Where no value is given, the assumed value is ‘general’.

(code)

_diffrn_frame_data.id (cif img.dic 1.0)

The following item(s) have an equivalent role in their respective categories:

_diffrn_refln.frame_id , [axis]

_diffrn_scan.frame_id_start, _diffrn_scan.frame_id_end , _diffrn_scan_frame.frame_id ,

_axis.vector[1]

_diffrn_scan_frame_axis.frame_id .

(float)

The [1] element of the three-element vector used to specify the direction of a rotation or translation axis. The vector should be normalized to be a unit vector and is dimensionless. Where no value is given, the assumed value is ‘0.0’.

_axis.vector[2]

DIFFRN DETECTOR Data items in the DIFFRN_DETECTOR category describe the detector used to measure the scattered radiation, including any analyser and post-sample collimation.

[axis]

Category group(s): inclusive_group diffrn_group Category key(s): _diffrn_detector.diffrn_id _diffrn_detector.id

(float)

The [2] element of the three-element vector used to specify the direction of a rotation or translation axis. The vector should be normalized to be a unit vector and is dimensionless. Where no value is given, the assumed value is ‘0.0’.

Example 1 – based on PDB entry 5HVP and laboratory records for the structure corresponding to PDB entry 5HVP.

[axis]

_diffrn_detector.diffrn_id _diffrn_detector.detector _diffrn_detector.type

_axis.vector[3] (float) The [3] element of the three-element vector used to specify the direction of a rotation or translation axis. The vector should be normalized to be a unit vector and is dimensionless. Where no value is given, the assumed value is ‘0.0’.

[diffrn_data_frame]

’d1’ ’multiwire’ ’Siemens’

_diffrn_detector.details

(text)

_diffrn_detector_details (cif core.dic 2.0.1)

A description of special aspects of the radiation detector. Example: ‘slow mode’.

[axis]

449

[diffrn_detector]

4. DATA DICTIONARIES

DIFFRN DETECTOR _diffrn_detector.detector

(text)

DIFFRN DETECTOR ELEMENT

_diffrn_radiation_detector (cifdic.c91 1.0) _diffrn_detector (cif core.dic 2.0)

Data items in the DIFFRN_DETECTOR_ELEMENT category record the details about spatial layout and other characteristics of each element of a detector which may have multiple elements. In most cases, giving more detailed information in ARRAY_STRUCTURE_LIST and ARRAY_STRUCTURE_LIST_AXIS is preferable to simply providing the centre of the detector element.

The general class of the radiation detector. Examples: ‘photographic film’, ‘scintillation counter’, ‘CCD plate’, ‘BF˜3˜ counter’. [diffrn_detector]

*_diffrn_detector.diffrn_id

(code)

This data item is a pointer to _diffrn.id in the DIFFRN category. The value of _diffrn.id uniquely defines a set of diffraction data. _diffrn_detector.dtime

Category group(s): inclusive_group array_data_group Category key(s): _diffrn_detector_element.id _diffrn_detector_element.detector_id

(float)

_diffrn_radiation_detector_dtime (cifdic.c91 1.0) _diffrn_detector_dtime (cif core.dic 2.0)

Example 1 Detector d1 is composed of four CCD detector elements, each 200 by 200 mm, arranged in a square, in the pattern

The deadtime in microseconds of the detector(s) used to measure the diffraction intensities. The permitted range is [0.0, ∞).

1

[diffrn_detector]

(*)_diffrn_detector.id

3

The value of _diffrn_detector.id must uniquely identify each detector used to collect each diffraction data set. If the value of _diffrn_detector.id is not given, it is implicitly equal to the value of _diffrn_detector.diffrn_id.

loop_ _diffrn_detector_element.detector_id _diffrn_detector_element.id _diffrn_detector_element.center[1] _diffrn_detector_element.center[2] d1 d1_ccd_1 201.5 -1.5 d1 d1_ccd_2 -1.8 -1.5 d1 d1_ccd_3 201.6 201.4 d1 d1_ccd_4 -1.7 201.5

[diffrn_detector]

_diffrn_detector.number_of_axes (int) The value of _diffrn_detector.number_of_axes gives the number of axes of the positioner for the detector identified by _diffrn_detector.id. The word ‘positioner’ is a general term used in instrumentation design for devices that are used to change the positions of portions of apparatus by linear translation, rotation or combinations of such motions. Axes which are used to provide a coordinate system for the face of an area detetctor should not be counted for this data item. The description of each axis should be provided by entries in DIFFRN_DETECTOR_AXIS.

_diffrn_detector_element.center[1] (float) The value of _diffrn_detector_element.center[1] is the X component of the distortion-corrected beam centre in millimetres from the (0, 0) (lower-left) corner of the detector element viewed from the sample side. The X and Y axes are the laboratory coordinate system coordinates defined in the AXIS category measured when all positioning axes for the detector are at their zero settings. If the resulting X or Y axis is then orthogonal to the detector, the Z axis is used instead of the orthogonal axis.

[diffrn_detector]

_diffrn_detector.type

4

Note that the beam centre is slightly displaced from each of the detector elements, just beyond the lower right corner of 1, the lower left corner of 2, the upper right corner of 3 and the upper left corner of 4.

The following item(s) have an equivalent role in their respective categories:

The permitted range is [1, ∞).

2 *

(code)

_diffrn_detector_axis.detector_id .

cif img.dic

(text)

_diffrn_detector_type (cif core.dic 2.0.1)

The make, model or name of the detector device used. [diffrn_detector]

Where no value is given, the assumed value is ‘0.0’.

[diffrn_detector_element]

DIFFRN DETECTOR AXIS Data items in the DIFFRN_DETECTOR_AXIS category associate axes with detectors.

_diffrn_detector_element.center[2] (float) The value of _diffrn_detector_element.center[2] is the Y component of the distortion-corrected beam centre in millimetres from the (0, 0) (lower-left) corner of the detector element viewed from the sample side. The X and Y axes are the laboratory coordinate system coordinates defined in the AXIS category measured when all positioning axes for the detector are at their zero settings. If the resulting X or Y axis is then orthogonal to the detector, the Z axis is used instead of the orthogonal axis.

Category group(s): inclusive_group diffrn_group Category key(s): _diffrn_detector_axis.detector_id _diffrn_detector_axis.axis_id

*_diffrn_detector_axis.axis_id

(code)

This data item is a pointer to _axis.id in the AXIS category.

Where no value is given, the assumed value is ‘0.0’.

[diffrn_detector_axis]

*_diffrn_detector_axis.detector_id

[diffrn_detector_element]

(code)

_diffrn_detector_axis.id (cif img.dic 1.0)

This data item is a pointer to _diffrn_detector.id in the DIFFRN_DETECTOR category. This item was previously named _diffrn_detector_axis.id, which is now a deprecated name. The old name is provided as an alias but should not be used for new work.

*_diffrn_detector_element.detector_id

(code)

This item is a pointer to _diffrn_detector.id in the DIFFRN_ DETECTOR category. [diffrn_detector_element]

[diffrn_detector_axis]

*_diffrn_detector_axis.id

(code)

*_diffrn_detector_element.id

(code)

This data item is a pointer to _diffrn_detector.id in the DIFFRN_DETECTOR category. Deprecated: do not use.

The value of _diffrn_detector_element.id must uniquely identify each element of a detector.

[diffrn_detector_axis]

[diffrn_detector_element]

450

4.6. IMAGE DICTIONARY (imgCIF)

cif img.dic

DIFFRN MEASUREMENT

_diffrn_measurement.details

DIFFRN FRAME DATA

(text)

_diffrn_measurement_details (cif core.dic 2.0.1)

Data items in the DIFFRN_FRAME_DATA category record the details about each frame of data. The items in this category are now in the DIFFRN_DATA_FRAME category. The items in the DIFFRN_FRAME_DATA category are now deprecated. The items from this category are provided as aliases in version 1.0 of the dictionary but should not be used for new work. The items from the old category are provided in this dictionary for completeness but should not be used or cited. To avoid confusion, the example has been removed and the redundant parent–child links to other categories have been removed.

A description of special aspects of the intensity measurement. Example: ; 440 frames, 0.20 degrees, 150 sec, detector distance 12 cm, detector angle 22.5 degrees ; [diffrn_measurement]

(*)_diffrn_measurement.device

(text)

_diffrn_measurement_device (cif core.dic 2.0.1)

The general class of goniometer or device used to support and orient the specimen. If the value of _diffrn_ measurement.device is not given, it is implicitly equal to the value of _diffrn_measurement.diffrn_id. Either _diffrn_ measurement.device or _diffrn_measurement.id may be used to link to other categories. If the experimental setup admits multiple devices, then _diffrn_measurement.id is used to provide a unique link.

Category group(s): inclusive_group array_data_group Category key(s): _diffrn_frame_data.id _diffrn_frame_data.detector_element_id

THE DIFFRN_FRAME_DATA category is deprecated and should not be used. # EXAMPLE REMOVED #

The following item(s) have an equivalent role in their respective categories:

_diffrn_measurement_axis.measurement_device .

*_diffrn_frame_data.array_id

Examples: ‘3-circle camera’, ‘4-circle camera’, ‘kappa-geometry camera’, ‘oscillation camera’, ‘precession camera’. [diffrn_measurement]

(code)

This item is a pointer to _array_structure.id in the ARRAY_ STRUCTURE category. Deprecated: do not use.

_diffrn_measurement.device_details

[diffrn_frame_data]

(text)

_diffrn_measurement_device_details (cif core.dic 2.0.1)

(*)_diffrn_frame_data.binary_id

This item is a pointer to _array_data.binary_id in the category. Deprecated: do not use.

A description of special aspects of the device used to measure the diffraction intensities.

(int) ARRAY_

Example: ; commercial goniometer modified locally to allow for 90\% \t arc ; [diffrn_measurement]

STRUCTURE

[diffrn_frame_data]

*_diffrn_frame_data.detector_element_id

(code)

_diffrn_measurement.device_type

This item is a pointer to _diffrn_detector_element.id in the DIFFRN_DETECTOR_ELEMENT category. Deprecated: do not use.

(text)

_diffrn_measurement_device_type (cif core.dic 2.0.1)

[diffrn_frame_data]

The make, model or name of the measurement device (goniometer) used.

(code)

Examples: ‘Supper model q’, ‘Huber model r’, ‘Enraf-Nonius model s’, ‘home-made’. [diffrn_measurement]

*_diffrn_frame_data.id

The value of _diffrn_frame_data.id must uniquely identify each complete frame of data. Deprecated: do not use.

*_diffrn_measurement.diffrn_id

[diffrn_frame_data]

This data item is a pointer to _diffrn.id in the DIFFRN category. (*)_diffrn_measurement.id

(code)

Category group(s): inclusive_group diffrn_group Category key(s): _diffrn_measurement.device _diffrn_measurement.diffrn_id _diffrn_measurement.id

The value of _diffrn_measurement.id must uniquely identify the set of mechanical characteristics of the device used to orient and/or position the sample used during the collection of each diffraction data set. If the value of _diffrn_measurement.id is not given, it is implicitly equal to the value of _diffrn_measurement.diffrn_id. Either _diffrn_ measurement.device or _diffrn_measurement.id may be used to link to other categories. If the experimental setup admits multiple devices, then _diffrn_measurement.id is used to provide a unique link.

Example 1 – based on PDB entry 5HVP and laboratory records for the structure corresponding to PDB entry 5HVP.

_diffrn_measurement_axis.measurement_id .

DIFFRN MEASUREMENT Data items in the DIFFRN_MEASUREMENT category record details about the device used to orient and/or position the crystal during data measurement and the manner in which the diffraction data were measured.

The following item(s) have an equivalent role in their respective categories:

_diffrn_measurement.diffrn_id ’d1’ _diffrn_measurement.device ’3-circle camera’ _diffrn_measurement.device_type ’Supper model x’ _diffrn_measurement.device_details ’none’ _diffrn_measurement.method ’omega scan’ _diffrn_measurement.details ; 440 frames, 0.20 degrees, 150 sec, detector distance 12 cm, detector angle 22.5 degrees ;

_diffrn_measurement.method

[diffrn_measurement]

(text)

_diffrn_measurement_method (cif core.dic 2.0.1)

Method used to measure intensities. Example: ‘profile data from theta/2theta (\q/2\q) scans’. [diffrn_measurement]

_diffrn_measurement.number_of_axes (int) The value of _diffrn_measurement.number_of_axes gives the number of axes of the positioner for the goniometer or other sample orientation or positioning device identified by _diffrn_measurement.id. The description of the axes should be provided by entries in DIFFRN_MEASUREMENT_AXIS.

Example 2 – based on data set TOZ of Willis, Beckwith & Tozer [Acta Cryst. (1991), C47, 2276–2277]. _diffrn_measurement.diffrn_id ’s1’ _diffrn_measurement.device_type ’Philips PW1100/20 diffractometer’ _diffrn_measurement.method ’theta/2theta (\q/2\q)’

The permitted range is [1, ∞).

451

[diffrn_measurement]

4. DATA DICTIONARIES

DIFFRN MEASUREMENT _diffrn_measurement.specimen_support

_diffrn_radiation.div_x_source

(text)

The physical device used to support the crystal during data collection. Examples: ‘glass capillary’, ‘quartz capillary’, ‘fiber’, ‘metal loop’. [diffrn_measurement]

DIFFRN MEASUREMENT AXIS Data items in the DIFFRN_MEASUREMENT_AXIS category associate axes with goniometers.

[diffrn_radiation]

Category group(s): inclusive_group diffrn_group Category key(s): _diffrn_measurement_axis.measurement_device _diffrn_measurement_axis.measurement_id _diffrn_measurement_axis.axis_id

_diffrn_radiation.div_x_y_source (float) Beam crossfire correlation in degrees squared between the crossfire laboratory X axis component and the crossfire laboratory Y axis component (see AXIS category). This is a characteristic of the X-ray beam as it illuminates the sample (or specimen) after all monochromation and collimation. This is the mean of the products of the deviations of the direction of each photon in the XZ plane times the deviations of the direction of the same photon in the Y Z plane around the mean source beam direction. This will be zero for uncorrelated crossfire. Note that for some synchrotrons this value is specified in milliradians squared, in which case a conversion is needed. To convert a value in milliradians squared to a value in degrees squared, multiply by 0.1802 and divide by π 2 .

(code)

This data item is a pointer to _axis.id in the AXIS category. [diffrn_measurement_axis]

*_diffrn_measurement_axis.id

(code)

This data item is a pointer to _diffrn_measurement.id in the DIFFRN_MEASUREMENT category. Deprecated: do not use. [diffrn_measurement_axis]

(*)_diffrn_measurement_axis.measurement_device (text)

This data item is a pointer to _diffrn_measurement.device in the DIFFRN_MEASUREMENT category.

Where no value is given, the assumed value is ‘0.0’.

[diffrn_measurement_axis]

(*)_diffrn_measurement_axis.measurement_id

[diffrn_measurement_axis]

Where no value is given, the assumed value is ‘0.0’.

DIFFRN RADIATION Data items in the DIFFRN_RADIATION category describe the radiation used for measuring diffraction intensities, its collimation and monochromatization before the sample. Post-sample treatment of the beam is described by data items in the DIFFRN_DETECTOR category.

_diffrn_radiation.filter_edge

(float)

Absorption edge in a˚ ngstr¨oms of the radiation filter used. The permitted range is [0.0, ∞).

_diffrn_radiation.inhomogeneity

[diffrn_radiation]

(float)

_diffrn_radiation_inhomogeneity (cif core.dic 2.0.1)

Example 1 – based on PDB entry 5HVP and laboratory records for the structure corresponding to PDB entry 5HVP.

Half-width in millimetres of the incident beam in the direction perpendicular to the diffraction plane.

’set1’ ’0.3 mm double pinhole’ ’graphite’ ’Cu K\a’ 1

The permitted range is [0.0, ∞).

_diffrn_radiation.monochromator

[diffrn_radiation]

(text)

_diffrn_radiation_monochromator (cif core.dic 2.0.1)

Example 2 – based on data set TOZ of Willis, Beckwith & Tozer [Acta Cryst. (1991), C47, 2276–2277].

The method used to obtain monochromatic radiation. If a monochromator crystal is used, the material and the indices of the Bragg reflection are specified.

1 ’Cu K\a’ ’graphite’

Examples: ‘Zr filter’, ‘Ge 220’, ‘none’, ‘equatorial mounted graphite’. [diffrn_radiation]

(text)

_diffrn_radiation_collimation (cif core.dic 2.0.1)

_diffrn_radiation.polarisn_norm

The collimation or focusing applied to the radiation.

(float)

_diffrn_radiation_polarisn_norm (cif core.dic 2.0.1)

Examples: ‘0.3 mm double-pinhole’, ‘0.5 mm’, ‘focusing mirrors’. [diffrn_radiation]

*_diffrn_radiation.diffrn_id

[diffrn_radiation]

_diffrn_radiation_filter_edge (cif core.dic 2.0.1)

Category group(s): inclusive_group diffrn_group Category key(s): _diffrn_radiation.diffrn_id

_diffrn_radiation.collimation

(float)

Beam crossfire in degrees parallel to the laboratory Y axis (see AXIS category). This is a characteristic of the X-ray beam as it illuminates the sample (or specimen) after all monochromation and collimation. This is the standard uncertainty (e.s.d.) of the directions of photons in the Y Z plane around the mean source beam direction. Note that for some synchrotrons this value is specified in milliradians, in which case a conversion is needed. To convert a value in milliradians to a value in degrees, multiply by 0.180 and divide by π.

_diffrn_measurement_axis.id (cif img.dic 1.0)

_diffrn_radiation.wavelength_id _diffrn_radiation.type _diffrn_radiation.monochromator

[diffrn_radiation]

_diffrn_radiation.div_y_source

(code)

This data item is a pointer to _diffrn_measurement.id in the DIFFRN_MEASUREMENT category. This item was previously named _diffrn_measurement_axis.id, which is now a deprecated name. The old name is provided as an alias but should not be used for new work.

_diffrn_radiation.diffrn_id _diffrn_radiation.collimation _diffrn_radiation.monochromator _diffrn_radiation.type _diffrn_radiation.wavelength_id

(float)

Beam crossfire in degrees parallel to the laboratory X axis (see AXIS category). This is a characteristic of the X-ray beam as it illuminates the sample (or specimen) after all monochromation and collimation. This is the standard uncertainty (e.s.d.) of the directions of photons in the XZ plane around the mean source beam direction. Note that for some synchrotrons this value is specified in milliradians, in which case a conversion is needed. To convert a value in milliradians to a value in degrees, multiply by 0.180 and divide by π.

_diffrn_measurement_specimen_support (cif core.dic 2.0.1)

*_diffrn_measurement_axis.axis_id

cif img.dic

The angle in degrees, as viewed from the specimen, between the perpendicular component of the polarization and the diffraction plane. See _diffrn_radiation_polarisn_ratio.

(code)

This data item is a pointer to _diffrn.id in the DIFFRN category.

The permitted range is [−90.0, 90.0].

452

[diffrn_radiation]

4.6. IMAGE DICTIONARY (imgCIF)

cif img.dic _diffrn_radiation.polarisn_ratio

(float)

*_diffrn_radiation.wavelength_id

_diffrn_radiation_polarisn_ratio (cif core.dic 2.0.1)

(code)

This data item is a pointer to _diffrn_radiation_wavelength.id in the DIFFRN_RADIATION_WAVELENGTH category.

Polarization ratio of the diffraction beam incident on the crystal. This is the ratio of the perpendicularly polarized to the parallel polarized component of the radiation. The perpendicular component forms an angle of _diffrn_radiation.polarisn_norm to the normal to the diffraction plane of the sample (i.e. the plane containing the incident and reflected beams). The permitted range is [0.0, ∞).

DIFFRN SCAN

[diffrn_radiation]

_diffrn_radiation.xray_symbol

(line)

_diffrn_radiation_xray_symbol (cif core.dic 2.0.1)

The IUPAC symbol for the X-ray wavelength for the probe radiation.

[diffrn_radiation]

The data value must be one of the following:

_diffrn_radiation.polarizn_source_norm

(float)

K-L˜3˜ K-L˜2˜ K-M˜3˜ K-L˜2,3˜

The angle in degrees, as viewed from the specimen, between the normal to the polarization plane and the laboratory Y axis as defined in the AXIS category. Note that this is the angle of polarization of the source photons, either directly from a synchrotron beamline or from a monchromator. This differs from the value of _diffrn_radiation.polarisn_norm in that _diffrn_radiation.polarisn_norm refers to polarization relative to the diffraction plane rather than to the laboratory axis system. In the case of an unpolarized beam, or a beam with true circular polarization, in which no single plane of polarization can be determined, the plane should be taken as the XZ plane and the angle as 0. See _diffrn_radiation.polarizn_source_ratio.

Kα1 in older Siegbahn notation Kα2 in older Siegbahn notation Kβ in older Siegbahn notation use where K-L3 and K-L2 are not resolved [diffrn_radiation]

DIFFRN REFLN This category redefinition has been added to extend the key of the standard DIFFRN_REFLN category. Category group(s): inclusive_group diffrn_group Category key(s): _diffrn_refln.frame_id

The permitted range is [−90.0, 90.0]. Where no value is given, the assumed value is ‘0.0’. [diffrn_radiation]

_diffrn_radiation.polarizn_source_ratio

*_diffrn_refln.frame_id (float)

(I p − In )/(I p + In ), where I p is the intensity (amplitude squared) of the electric vector in the plane of polarization and In is the intensity (amplitude squared) of the electric vector in the plane of the normal to the plane of polarization. In the case of an unpolarized beam, or a beam with true circular polarization, in which no single plane of polarization can be determined, the plane is to be taken as the XZ plane and the normal is parallel to the Y axis. Thus, if there was complete polarization in the plane of polarization, the value of _diffrn_radiation.polarizn_source_ratio would be 1, and for an unpolarized beam _diffrn_radiation.polarizn_ source_ratio would have a value of 0. If the X axis has been chosen to lie in the plane of polarization, this definition will agree with the definition of ‘MONOCHROMATOR’ in the Denzo glossary, and values of near 1 should be expected for a bending-magnet source. However, if the X axis were perpendicular to the polarization plane (not a common choice), then the Denzo value would be the negative of _diffrn_radiation.polarizn_source_ratio. [See http://www.hkl-xray.com for information on Denzo, and Otwinowski & Minor (1997).] This differs both in the choice of ratio and choice of orientation from _diffrn_ radiation.polarisn_ratio, which, unlike _diffrn_radiation. polarizn_source_ratio, is unbounded. Reference: Otwinowski, Z. & Minor, W. (1997). Processing of X-ray diffraction data collected in oscillation mode. Methods Enzymol. 276, 307–326. The permitted range is [−1.0, 1.0].

_diffrn_radiation.probe

[diffrn_refln]

DIFFRN SCAN Data items in the DIFFRN_SCAN category describe the parameters of one or more scans, relating axis positions to frames. Category group(s): inclusive_group diffrn_group Category key(s): _diffrn_scan.id

Example 1 – derived from a suggestion by R. M. Sweet. The vector of each axis is not given here, because it is provided in the AXIS category. By making _diffrn_scan_axis.scan_id and _diffrn_scan_ axis.axis_id keys of the DIFFRN SCAN AXIS category, an arbitrary number of scanning and fixed axes can be specified for a scan. In this example, three rotation axes and one translation axis at nonzero values are specified, with one axis stepping. There is no reason why more axes could not have been specified to step. Range information has been specified, but note that it can be calculated from the number of frames and the increment, so the data item _diffrn_scan_axis.angle_range could be dropped. Both the sweep data and the data for a single frame are specified. Note that the information on how the axes are stepped is given twice, once in terms of the overall averages in the value of _diffrn_scan.integration_time and the values for DIFFRN SCAN AXIS, and precisely for the given frame in the value for _diffrn_scan_frame.integration_time and the values for DIFFRN SCAN FRAME AXIS. If dose-related adjustments are made to scan times and nonlinear stepping is done, these values may differ. Therefore, in interpreting the data for a particular frame it is important to use the frame-specific data. _diffrn_scan.id _diffrn_scan.date_start _diffrn_scan.date_end _diffrn_scan.integration_time _diffrn_scan.frame_id_start _diffrn_scan.frame_id_end _diffrn_scan.frames

[diffrn_radiation]

(line)

_diffrn_radiation_probe (cif core.dic 2.0.1)

Name of the type of radiation used. It is strongly recommended that this be given so that the probe radiation is clearly specified.

_diffrn_radiation.type

1 ’2001-11-18T03:26:42’ ’2001-11-18T03:36:45’ 3.0 mad_L2_000 mad_L2_200 201

loop_ _diffrn_scan_axis.scan_id _diffrn_scan_axis.axis_id _diffrn_scan_axis.angle_start _diffrn_scan_axis.angle_range _diffrn_scan_axis.angle_increment _diffrn_scan_axis.displacement_start _diffrn_scan_axis.displacement_range _diffrn_scan_axis.displacement_increment 1 omega 200.0 20.0 0.1 . . . 1 kappa -40.0 0.0 0.0 . . . 1 phi 127.5 0.0 0.0 . . . 1 tranz . . . 2.3 0.0 0.0

The data value must be one of the following:

x-ray neutron electron gamma

(code)

This item is a pointer to _diffrn_data_frame.id in the DIFFRN_ DATA_FRAME category.

[diffrn_radiation]

(line)

_diffrn_radiation_type (cif core.dic 2.0.1)

The nature of the radiation. This is typically a description of the X-ray wavelength in Siegbahn notation. Examples: ‘CuK\a’, ‘Cu K\a˜1˜’, ‘Cu K-L˜2,3˜’, ‘white-beam’. [diffrn_radiation]

453

4. DATA DICTIONARIES

DIFFRN SCAN _diffrn_scan_frame.scan_id _diffrn_scan_frame.date _diffrn_scan_frame.integration_time _diffrn_scan_frame.frame_id _diffrn_scan_frame.frame_number

MAR345-SN26 DETECTOR_Z MAR345-SN26 DETECTOR_PITCH

1 ’2001-11-18T03:27:33’ 3.0 mad_L2_018 18

# category DIFFRN_DETECTOR_ELEMENT loop_ _diffrn_detector_element.id _diffrn_detector_element.detector_id ELEMENT1 MAR345-SN26

loop_ _diffrn_scan_frame_axis.frame_id _diffrn_scan_frame_axis.axis_id _diffrn_scan_frame_axis.angle _diffrn_scan_frame_axis.angle_increment _diffrn_scan_frame_axis.displacement _diffrn_scan_frame_axis.displacement_increment mad_L2_018 omega 201.8 0.1 . . mad_L2_018 kappa -40.0 0.0 . . mad_L2_018 phi 127.5 0.0 . . mad_L2_018 tranz . . 2.3 0.0

# category DIFFRN_DATA_FRAME loop_ _diffrn_data_frame.id _diffrn_data_frame.detector_element_id _diffrn_data_frame.array_id _diffrn_data_frame.binary_id FRAME1 ELEMENT1 ARRAY1 1 # category DIFFRN_MEASUREMENT loop_ _diffrn_measurement.diffrn_id _diffrn_measurement.id _diffrn_measurement.number_of_axes _diffrn_measurement.method P6MB GONIOMETER 3 rotation

Example 2 – a more extensive example (R. M. Sweet, P. J. Ellis & H. J. Bernstein). A detector is placed 240 mm along the Z axis from the goniometer. This leads to a choice: either the axes of the detector are defined at the origin, and then a Z setting of −240 is entered, or the axes are defined with the necessary Z offset. In this case, the setting is used and the offset is left as zero. This axis is called DETECTOR Z. The axis for positioning the detector in the Y direction depends on the detector Z axis. This axis is called DETECTOR Y. The axis for positioning the detector in the X direction depends on the detector Y axis (and therefore on the detector Z axis). This axis is called DETECTOR X. This detector may be rotated around the Y axis. This rotation axis depends on the three translation axes. It is called DETECTOR PITCH. A coordinate system is defined on the face of the detector in terms of 2300 0.150 mm pixels in each direction. The ELEMENT X axis is used to index the first array index of the data array and the ELEMENT Y axis is used to index the second array index. Because the pixels are 0.150 × 0.150 mm, the centre of the first pixel is at (0.075, 0.075) in this coordinate system.

# category DIFFRN_MEASUREMENT_AXIS loop_ _diffrn_measurement_axis.measurement_id _diffrn_measurement_axis.axis_id GONIOMETER GONIOMETER_PHI GONIOMETER GONIOMETER_KAPPA GONIOMETER GONIOMETER_OMEGA # category DIFFRN_SCAN loop_ _diffrn_scan.id _diffrn_scan.frame_id_start _diffrn_scan.frame_id_end _diffrn_scan.frames SCAN1 FRAME1 FRAME1 1

###CBF: VERSION 1.1 data_image_1 # category DIFFRN _diffrn.id P6MB _diffrn.crystal_id P6MB_CRYSTAL7

# category DIFFRN_SCAN_AXIS loop_ _diffrn_scan_axis.scan_id _diffrn_scan_axis.axis_id _diffrn_scan_axis.angle_start _diffrn_scan_axis.angle_range _diffrn_scan_axis.angle_increment _diffrn_scan_axis.displacement_start _diffrn_scan_axis.displacement_range _diffrn_scan_axis.displacement_increment SCAN1 GONIOMETER_OMEGA 12.0 1.0 1.0 0.0 0.0 0.0 SCAN1 GONIOMETER_KAPPA 23.3 0.0 0.0 0.0 0.0 0.0 SCAN1 GONIOMETER_PHI -165.8 0.0 0.0 0.0 0.0 0.0 SCAN1 DETECTOR_Z 0.0 0.0 0.0 -240.0 0.0 0.0 SCAN1 DETECTOR_Y 0.0 0.0 0.0 0.6 0.0 0.0 SCAN1 DETECTOR_X 0.0 0.0 0.0 -0.5 0.0 0.0 SCAN1 DETECTOR_PITCH 0.0 0.0 0.0 0.0 0.0 0.0

# category DIFFRN_SOURCE loop_ _diffrn_source.diffrn_id _diffrn_source.source _diffrn_source.type P6MB synchrotron ’SSRL beamline 9-1’ # category DIFFRN_RADIATION loop_ _diffrn_radiation.diffrn_id _diffrn_radiation.wavelength_id _diffrn_radiation.monochromator _diffrn_radiation.polarizn_source_ratio _diffrn_radiation.polarizn_source_norm _diffrn_radiation.div_x_source _diffrn_radiation.div_y_source _diffrn_radiation.div_x_y_source P6MB WAVELENGTH1 ’Si 111’ 0.8 0.0 0.08 0.01 0.00

# category DIFFRN_SCAN_FRAME loop_ _diffrn_scan_frame.frame_id _diffrn_scan_frame.frame_number _diffrn_scan_frame.integration_time _diffrn_scan_frame.scan_id _diffrn_scan_frame.date FRAME1 1 20.0 SCAN1 1997-12-04T10:23:48

# category DIFFRN_RADIATION_WAVELENGTH loop_ _diffrn_radiation_wavelength.id _diffrn_radiation_wavelength.wavelength _diffrn_radiation_wavelength.wt WAVELENGTH1 0.98 1.0 # category DIFFRN_DETECTOR loop_ _diffrn_detector.diffrn_id _diffrn_detector.id _diffrn_detector.type _diffrn_detector.number_of_axes P6MB MAR345-SN26 ’MAR 345’ 4

# category DIFFRN_SCAN_FRAME_AXIS loop_ _diffrn_scan_frame_axis.frame_id _diffrn_scan_frame_axis.axis_id _diffrn_scan_frame_axis.angle _diffrn_scan_frame_axis.displacement FRAME1 GONIOMETER_OMEGA 12.0 0.0 FRAME1 GONIOMETER_KAPPA 23.3 0.0 FRAME1 GONIOMETER_PHI -165.8 0.0 FRAME1 DETECTOR_Z 0.0 -240.0 FRAME1 DETECTOR_Y 0.0 0.6 FRAME1 DETECTOR_X 0.0 -0.5 FRAME1 DETECTOR_PITCH 0.0 0.0

# category DIFFRN_DETECTOR_AXIS loop_ _diffrn_detector_axis.detector_id _diffrn_detector_axis.axis_id MAR345-SN26 DETECTOR_X MAR345-SN26 DETECTOR_Y

454

cif img.dic

cif img.dic

4.6. IMAGE DICTIONARY (imgCIF)

# category AXIS loop_ _axis.id _axis.type _axis.equipment _axis.depends_on _axis.vector[1] _axis.vector[2] _axis.vector[3] _axis.offset[1] _axis.offset[2] _axis.offset[3] GONIOMETER_OMEGA rotation goniometer . 1 0 0 . . . GONIOMETER_KAPPA rotation goniometer GONIOMETER_OMEGA 0.64279 0 0.76604 . . . GONIOMETER_PHI rotation goniometer GONIOMETER_KAPPA 1 0 0 . . . SOURCE general source . 0 0 1 . . . GRAVITY general gravity . 0 -1 0 . . . DETECTOR_Z translation detector . 0 0 1 0 0 0 DETECTOR_Y translation detector DETECTOR_Z 0 1 0 0 0 0 DETECTOR_X translation detector DETECTOR_Y 1 0 0 0 0 0 DETECTOR_PITCH rotation detector DETECTOR_X 0 1 0 0 0 0 ELEMENT_X translation detector DETECTOR_PITCH 1 0 0 172.43 -172.43 0 ELEMENT_Y translation detector ELEMENT_X 0 1 0 0 0 0

DIFFRN SCAN

X-Binary-Size: 3801324 X-Binary-ID: 1 X-Binary-Element-Type: "signed 32-bit integer" Content-MD5: 07lZFvF+aOcW85IN7usl8A== AABRAAAAAAAAAAAAAAAAAAAA ...AAZBQSr1sKNBOeOe9HITdMdDUnbq7bg ... 8REo6TtBrxJ1vKqAvx9YDMD6...r/tgssjMIJMXATDsZobL90AEXc4KigE --CIF-BINARY-FORMAT-SECTION---;

Example 3 – Example 2 revised for a spiral scan (R. M. Sweet, P. J. Ellis & H. J. Bernstein). A detector is placed 240 mm along the Z axis from the goniometer, as in Example 2 above, but in this example the image plate is scanned in a spiral pattern from the outside edge in. The axis for positioning the detector in the Y direction depends on the detector Z axis. This axis is called DETECTOR Y. The axis for positioning the detector in the X direction depends on the detector Y axis (and therefore on the detector Z axis). This axis is called DETECTOR X. This detector may be rotated around the Y axis. This rotation axis depends on the three translation axes. It is called DETECTOR PITCH. A coordinate system is defined on the face of the detector in terms of a coupled rotation axis and radial scan axis to form a spiral scan. The rotation axis is called ELEMENT ROT and the radial axis is called ELEMENT RAD. A 150 µm radial pitch and a 75 µm ‘constant velocity’ angular pitch are assumed. Indexing is carried out first on the rotation axis and the radial axis is made to be dependent on it. The two axes are coupled to form an axis set ELEMENT SPIRAL.

# category ARRAY_STRUCTURE_LIST loop_ _array_structure_list.array_id _array_structure_list.index _array_structure_list.dimension _array_structure_list.precedence _array_structure_list.direction _array_structure_list.axis_set_id ARRAY1 1 2300 1 increasing ELEMENT_X ARRAY1 2 2300 2 increasing ELEMENT_Y

###CBF: VERSION 1.1 data_image_1 # category DIFFRN _diffrn.id P6MB _diffrn.crystal_id P6MB_CRYSTAL7 # category DIFFRN_SOURCE loop_ _diffrn_source.diffrn_id _diffrn_source.source _diffrn_source.type P6MB synchrotron ’SSRL beamline 9-1’

# category ARRAY_STRUCTURE_LIST_AXIS loop_ _array_structure_list_axis.axis_set_id _array_structure_list_axis.axis_id _array_structure_list_axis.displacement _array_structure_list_axis.displacement_increment ELEMENT_X ELEMENT_X 0.075 0.150 ELEMENT_Y ELEMENT_Y 0.075 0.150

# category DIFFRN_RADIATION loop_ _diffrn_radiation.diffrn_id _diffrn_radiation.wavelength_id _diffrn_radiation.monochromator _diffrn_radiation.polarizn_source_ratio _diffrn_radiation.polarizn_source_norm _diffrn_radiation.div_x_source _diffrn_radiation.div_y_source _diffrn_radiation.div_x_y_source P6MB WAVELENGTH1 ’Si 111’ 0.8 0.0 0.08 0.01 0.00

# category ARRAY_ELEMENT_SIZE loop_ _array_element_size.array_id _array_element_size.index _array_element_size.size ARRAY1 1 150e-6 ARRAY1 2 150e-6 # category ARRAY_INTENSITIES loop_ _array_intensities.array_id _array_intensities.binary_id _array_intensities.linearity _array_intensities.gain _array_intensities.gain_esd _array_intensities.overload _array_intensities.undefined_value ARRAY1 1 linear 1.15 0.2 240000 0

# category DIFFRN_RADIATION_WAVELENGTH loop_ _diffrn_radiation_wavelength.id _diffrn_radiation_wavelength.wavelength _diffrn_radiation_wavelength.wt WAVELENGTH1 0.98 1.0 # category DIFFRN_DETECTOR loop_ _diffrn_detector.diffrn_id _diffrn_detector.id _diffrn_detector.type _diffrn_detector.number_of_axes P6MB MAR345-SN26 ’MAR 345’ 4

# category ARRAY_STRUCTURE loop_ _array_structure.id _array_structure.encoding_type _array_structure.compression_type _array_structure.byte_order ARRAY1 "signed 32-bit integer" packed little_endian

# category DIFFRN_DETECTOR_AXIS loop_ _diffrn_detector_axis.detector_id _diffrn_detector_axis.axis_id MAR345-SN26 DETECTOR_X MAR345-SN26 DETECTOR_Y MAR345-SN26 DETECTOR_Z MAR345-SN26 DETECTOR_PITCH

# category ARRAY_DATA loop_ _array_data.array_id _array_data.binary_id _array_data.data ARRAY1 1 ; --CIF-BINARY-FORMAT-SECTION-Content-Type: application/octet-stream; conversions="x-CBF_PACKED" Content-Transfer-Encoding: BASE64

# category DIFFRN_DETECTOR_ELEMENT loop_ _diffrn_detector_element.id _diffrn_detector_element.detector_id ELEMENT1 MAR345-SN26

455

DIFFRN SCAN

4. DATA DICTIONARIES

# category DIFFRN_DATA_FRAME loop_ _diffrn_data_frame.id _diffrn_data_frame.detector_element_id _diffrn_data_frame.array_id _diffrn_data_frame.binary_id FRAME1 ELEMENT1 ARRAY1 1

cif img.dic

GONIOMETER_PHI rotation goniometer GONIOMETER_KAPPA 1 0 0 . . . SOURCE general source . 0 0 1 . . . GRAVITY general gravity . 0 -1 0 . . . DETECTOR_Z translation detector . 0 0 1 0 0 0 DETECTOR_Y translation detector DETECTOR_Z 0 1 0 0 0 0 DETECTOR_X translation detector DETECTOR_Y 1 0 0 0 0 0 DETECTOR_PITCH rotation detector DETECTOR_X 0 1 0 0 0 0 ELEMENT_ROT translation detector DETECTOR_PITCH 0 0 1 0 0 0 ELEMENT_RAD translation detector ELEMENT_ROT 0 1 0 0 0 0

# category DIFFRN_MEASUREMENT loop_ _diffrn_measurement.diffrn_id _diffrn_measurement.id _diffrn_measurement.number_of_axes _diffrn_measurement.method P6MB GONIOMETER 3 rotation

# category ARRAY_STRUCTURE_LIST loop_ _array_structure_list.array_id _array_structure_list.index _array_structure_list.dimension _array_structure_list.precedence _array_structure_list.direction _array_structure_list.axis_set_id ARRAY1 1 8309900 1 increasing ELEMENT_SPIRAL

# category DIFFRN_MEASUREMENT_AXIS loop_ _diffrn_measurement_axis.measurement_id _diffrn_measurement_axis.axis_id GONIOMETER GONIOMETER_PHI GONIOMETER GONIOMETER_KAPPA GONIOMETER GONIOMETER_OMEGA

# category ARRAY_STRUCTURE_LIST_AXIS loop_ _array_structure_list_axis.axis_set_id _array_structure_list_axis.axis_id _array_structure_list_axis.angle _array_structure_list_axis.displacement _array_structure_list_axis.angular_pitch _array_structure_list_axis.radial_pitch ELEMENT_SPIRAL ELEMENT_ROT 0 . 0.075 . ELEMENT_SPIRAL ELEMENT_RAD . 172.5 . -0.150

# category DIFFRN_SCAN loop_ _diffrn_scan.id _diffrn_scan.frame_id_start _diffrn_scan.frame_id_end _diffrn_scan.frames SCAN1 FRAME1 FRAME1 1

# category ARRAY_ELEMENT_SIZE # The actual pixels are 0.075 by 0.150 mm. # We give the coarser dimension here. loop_ _array_element_size.array_id _array_element_size.index _array_element_size.size ARRAY1 1 150e-6

# category DIFFRN_SCAN_AXIS loop_ _diffrn_scan_axis.scan_id _diffrn_scan_axis.axis_id _diffrn_scan_axis.angle_start _diffrn_scan_axis.angle_range _diffrn_scan_axis.angle_increment _diffrn_scan_axis.displacement_start _diffrn_scan_axis.displacement_range _diffrn_scan_axis.displacement_increment SCAN1 GONIOMETER_OMEGA 12.0 1.0 1.0 0.0 0.0 0.0 SCAN1 GONIOMETER_KAPPA 23.3 0.0 0.0 0.0 0.0 0.0 SCAN1 GONIOMETER_PHI -165.8 0.0 0.0 0.0 0.0 0.0 SCAN1 DETECTOR_Z 0.0 0.0 0.0 -240.0 0.0 0.0 SCAN1 DETECTOR_Y 0.0 0.0 0.0 0.6 0.0 0.0 SCAN1 DETECTOR_X 0.0 0.0 0.0 -0.5 0.0 0.0 SCAN1 DETECTOR_PITCH 0.0 0.0 0.0 0.0 0.0 0.0

# category ARRAY_INTENSITIES loop_ _array_intensities.array_id _array_intensities.binary_id _array_intensities.linearity _array_intensities.gain _array_intensities.gain_esd _array_intensities.overload _array_intensities.undefined_value ARRAY1 1 linear 1.15 0.2 240000 0

# category DIFFRN_SCAN_FRAME loop_ _diffrn_scan_frame.frame_id _diffrn_scan_frame.frame_number _diffrn_scan_frame.integration_time _diffrn_scan_frame.scan_id _diffrn_scan_frame.date FRAME1 1 20.0 SCAN1 1997-12-04T10:23:48

# category ARRAY_STRUCTURE loop_ _array_structure.id _array_structure.encoding_type _array_structure.compression_type _array_structure.byte_order ARRAY1 "signed 32-bit integer" packed little_endian

# category DIFFRN_SCAN_FRAME_AXIS loop_ _diffrn_scan_frame_axis.frame_id _diffrn_scan_frame_axis.axis_id _diffrn_scan_frame_axis.angle _diffrn_scan_frame_axis.displacement FRAME1 GONIOMETER_OMEGA 12.0 0.0 FRAME1 GONIOMETER_KAPPA 23.3 0.0 FRAME1 GONIOMETER_PHI -165.8 0.0 FRAME1 DETECTOR_Z 0.0 -240.0 FRAME1 DETECTOR_Y 0.0 0.6 FRAME1 DETECTOR_X 0.0 -0.5 FRAME1 DETECTOR_PITCH 0.0 0.0

# category ARRAY_DATA loop_ _array_data.array_id _array_data.binary_id _array_data.data ARRAY1 1 ; --CIF-BINARY-FORMAT-SECTION-Content-Type: application/octet-stream; conversions="x-CBF_PACKED" Content-Transfer-Encoding: BASE64 X-Binary-Size: 3801324 X-Binary-ID: 1 X-Binary-Element-Type: "signed 32-bit integer" Content-MD5: 07lZFvF+aOcW85IN7usl8A==

# category AXIS loop_ _axis.id _axis.type _axis.equipment _axis.depends_on _axis.vector[1] _axis.vector[2] _axis.vector[3] _axis.offset[1] _axis.offset[2] _axis.offset[3] GONIOMETER_OMEGA rotation goniometer . 1 0 0 . . . GONIOMETER_KAPPA rotation goniometer GONIOMETER_OMEGA 0.64279 0 0.76604 . . .

AABRAAAAAAAAAAAAAAAAAAAA ...AAZBQSr1sKNBOeOe9HITdMdDUnbq7bg ... 8REo6TtBrxJ1vKqAvx9YDMD6...r/tgssjMIJMXATDsZobL90AEXc4KigE --CIF-BINARY-FORMAT-SECTION---;

456

4.6. IMAGE DICTIONARY (imgCIF)

cif img.dic _diffrn_scan.date_end

DIFFRN SCAN AXIS

_diffrn_scan_axis.angle_range

(yyyy-mm-dd)

The date and time of the end of the scan. Note that this may be an estimate generated during the scan, before the precise time of the end of the scan is known.

(float)

The range from the starting position for the specified axis in degrees. Where no value is given, the assumed value is ‘0.0’.

[diffrn_scan_axis]

[diffrn_scan]

_diffrn_scan.date_start

_diffrn_scan_axis.angle_rstrt_incr (float) The increment after each step for the specified axis in degrees. In general, this will agree with _diffrn_scan_frame_ axis.angle_rstrt_incr. The sum of the values of _diffrn_ _diffrn_scan_frame_axis.angle_ scan_frame_axis.angle, increment and _diffrn_scan_frame_axis.angle_rstrt_incr is the angular setting of the axis at the start of the integration time for the next frame relative to a given frame and should equal _diffrn_scan_frame_axis.angle for this next frame. If the individual frame values vary, then the value of _diffrn_scan_axis.angle_rstrt_incr will be representative of the ensemble of values of _diffrn_scan_frame_axis. angle_rstrt_incr (e.g. the mean).

(yyyy-mm-dd)

The date and time of the start of the scan. [diffrn_scan]

*_diffrn_scan.frame_id_end

(code)

The value of this data item is the identifier of the last frame in the scan. This item is a pointer to _diffrn_data_frame.id in the DIFFRN_DATA_FRAME category. [diffrn_scan]

*_diffrn_scan.frame_id_start

(code)

Where no value is given, the assumed value is ‘0.0’.

The value of this data item is the identifier of the first frame in the scan. This item is a pointer to _diffrn_data_frame.id in the DIFFRN_DATA_FRAME category.

_diffrn_scan_axis.angle_start Where no value is given, the assumed value is ‘0.0’.

The permitted range is [1, ∞).

[diffrn_scan]

*_diffrn_scan.id

(code)

(float)

The starting position for the specified axis in degrees.

[diffrn_scan]

_diffrn_scan.frames (int) The value of this data item is the number of frames in the scan.

[diffrn_scan_axis]

[diffrn_scan_axis]

*_diffrn_scan_axis.axis_id

(code)

The value of this data item is the identifier of one of the axes for the scan for which settings are being specified. Multiple axes may be specified for the same value of _diffrn_scan.id. This item is a pointer to _axis.id in the AXIS category.

The value of _diffrn_scan.id uniquely identifies each scan. The identifier is used to tie together all the information about the scan.

[diffrn_scan_axis]

The following item(s) have an equivalent role in their respective categories:

_diffrn_scan_axis.displacement_increment

_diffrn_scan_axis.scan_id , _diffrn_scan_frame.scan_id .

_diffrn_scan.integration_time (float) Approximate average time in seconds to integrate each step of the scan. The precise time for integration of each particular step must be provided in _diffrn_scan_frame.integration_time, even if all steps have the same integration time. The permitted range is [0.0, ∞).

(float)

The increment for each step for the specified axis in millimetres. In general, this will agree with _diffrn_scan_ frame_axis.displacement_increment. The sum of the values of _diffrn_scan_frame_axis.displacement and _diffrn_ scan_frame_axis.displacement_increment is the angular setting of the axis at the end of the integration time for a given frame. If the individual frame values vary, then the value of _diffrn_scan_axis.displacement_increment will be representative of the ensemble of values of _diffrn_scan_frame_ axis.displacement_increment (e.g. the mean).

[diffrn_scan]

[diffrn_scan]

Where no value is given, the assumed value is ‘0.0’.

[diffrn_scan_axis]

DIFFRN SCAN AXIS _diffrn_scan_axis.displacement_range (float) The range from the starting position for the specified axis in millimetres.

Data items in the DIFFRN_SCAN_AXIS category describe the settings of axes for particular scans. Unspecified axes are assumed to be at their zero points.

Where no value is given, the assumed value is ‘0.0’.

Category group(s): inclusive_group diffrn_group Category key(s): _diffrn_scan_axis.scan_id _diffrn_scan_axis.axis_id

_diffrn_scan_axis.displacement_rstrt_incr

(float)

The increment for each step for the specified axis in millimetres. In general, this will agree with _diffrn_scan_frame_ axis.displacement_rstrt_incr. The sum of the values of _diffrn_scan_frame_axis.displacement, _diffrn_scan_ and _diffrn_scan_ frame_axis.displacement_increment frame_axis.displacement_rstrt_incr is the angular setting of the axis at the start of the integration time for the next frame relative to a given frame and should equal _diffrn_scan_frame_axis.displacement for this next frame. If the individual frame values vary, then the value of _diffrn_ will be representascan_axis.displacement_rstrt_incr tive of the ensemble of values of _diffrn_scan_frame_ axis.displacement_rstrt_incr (e.g. the mean).

_diffrn_scan_axis.angle_increment (float) The increment for each step for the specified axis in degrees. In general, this will agree with _diffrn_scan_frame_ axis.angle_increment. The sum of the values of _diffrn_scan_frame_axis.angle and _diffrn_scan_frame_ axis.angle_increment is the angular setting of the axis at the end of the integration time for a given frame. If the individual frame values vary, then the value of _diffrn_scan_ axis.angle_increment will be representative of the ensemble of values of _diffrn_scan_frame_axis.angle_increment (e.g. the mean). Where no value is given, the assumed value is ‘0.0’.

[diffrn_scan_axis]

[diffrn_scan_axis]

Where no value is given, the assumed value is ‘0.0’.

457

[diffrn_scan_axis]

4. DATA DICTIONARIES

DIFFRN SCAN AXIS

_diffrn_scan_axis.displacement_start The starting position for the specified axis in millimetres.

_diffrn_scan_frame_axis.angle

(float)

Where no value is given, the assumed value is ‘0.0’.

*_diffrn_scan_axis.scan_id

(float)

The setting of the specified axis in degrees for this frame. This is the setting at the start of the integration time.

[diffrn_scan_axis]

Where no value is given, the assumed value is ‘0.0’.

cif img.dic

[diffrn_scan_frame_axis]

(code)

_diffrn_scan_frame_axis.angle_increment

The value of this data item is the identifier of the scan for which axis settings are being specified. Multiple axes may be specified for the same value of _diffrn_scan.id. This item is a pointer to _diffrn_scan.id in the DIFFRN_SCAN category. [diffrn_scan_axis]

Where no value is given, the assumed value is ‘0.0’.

DIFFRN SCAN FRAME

[diffrn_scan_frame_axis]

_diffrn_scan_frame_axis.angle_rstrt_incr

Data items in the DIFFRN_SCAN_FRAME category describe the relationships of particular frames to scans.

(float)

The increment after this frame for the angular setting of the specified axis in degrees. The sum of the values of _diffrn_scan_frame_axis.angle, _diffrn_scan_frame_ axis.angle_increment and _diffrn_scan_frame_axis.angle_ rstrt_incr is the angular setting of the axis at the start of the integration time for the next frame and should equal _diffrn_scan_frame_axis.angle for this next frame.

Category group(s): inclusive_group diffrn_group Category key(s): _diffrn_scan_frame.scan_id _diffrn_scan_frame.frame_id

_diffrn_scan_frame.date

(float)

The increment for this frame for the angular setting of the specified axis in degrees. The sum of the values of _diffrn_scan_frame_axis.angle and _diffrn_scan_frame_ axis.angle_increment is the angular setting of the axis at the end of the integration time for this frame.

Where no value is given, the assumed value is ‘0.0’.

(yyyy-mm-dd)

[diffrn_scan_frame_axis]

The date and time of the start of the frame being scanned. [diffrn_scan_frame]

*_diffrn_scan_frame.frame_id

*_diffrn_scan_frame_axis.axis_id

(code)

The value of this data item is the identifier of one of the axes for the frame for which settings are being specified. Multiple axes may be specified for the same value of _diffrn_scan_frame.frame_id. This item is a pointer to _axis.id in the AXIS category.

(code)

The value of this data item is the identifier of the frame being examined. This item is a pointer to _diffrn_data_frame.id in the DIFFRN_DATA_FRAME category.

[diffrn_scan_frame_axis]

[diffrn_scan_frame]

_diffrn_scan_frame.frame_number

_diffrn_scan_frame_axis.displacement (float) The setting of the specified axis in millimetres for this frame. This is the setting at the start of the integration time.

(int)

The value of this data item is the number of the frame within the scan, starting with 1. It is not necessarily the same as the value of _diffrn_scan_frame.frame_id, but it may be. The permitted range is [0, ∞).

Where no value is given, the assumed value is ‘0.0’.

[diffrn_scan_frame_axis]

[diffrn_scan_frame]

_diffrn_scan_frame_axis.displacement_increment (float)

*_diffrn_scan_frame.integration_time

(float)

The increment for this frame for the displacement setting of the specified axis in millimetres. The sum of the values of _diffrn_scan_frame_axis.displacement and _diffrn_scan_ frame_axis.displacement_increment is the angular setting of the axis at the end of the integration time for this frame.

The time in seconds to integrate this step of the scan. This should be the precise time of integration of each particular frame. The value of this data item should be given explicitly for each frame and not inferred from the value of _diffrn_ scan.integration_time. The permitted range is [0.0, ∞).

Where no value is given, the assumed value is ‘0.0’.

[diffrn_scan_frame_axis]

[diffrn_scan_frame]

*_diffrn_scan_frame.scan_id

_diffrn_scan_frame_axis.displacement_rstrt_incr

(code)

(float)

The value of _diffrn_scan_frame.scan_id identifies the scan containing this frame. This item is a pointer to _diffrn_scan.id in the DIFFRN_SCAN category.

The increment for this frame for the displacement setting of the specified axis in millimetres. The sum of the values of _diffrn_scan_frame_axis.displacement, _diffrn_ scan_frame_axis.displacement_increment and _diffrn_scan_ frame_axis.displacement_rstrt_incr is the angular setting of the axis at the start of the integration time for the next frame and should equal _diffrn_scan_frame_axis.displacement for this next frame.

[diffrn_scan_frame]

DIFFRN SCAN FRAME AXIS Data items in the DIFFRN_SCAN_FRAME_AXIS category describe the settings of axes for particular frames. Unspecified axes are assumed to be at their zero points. If, for any given frame, nonzero values apply for any of the data items in this category, those values should be given explicitly in this category and not simply inferred from values in DIFFRN_SCAN_AXIS.

Where no value is given, the assumed value is ‘0.0’.

[diffrn_scan_frame_axis]

*_diffrn_scan_frame_axis.frame_id

(code)

The value of this data item is the identifier of the frame for which axis settings are being specified. Multiple axes may be specified for the same value of _diffrn_scan_frame.frame_id. This item is a pointer to _diffrn_data_frame.id in the DIFFRN_DATA_FRAME category.

Category group(s): inclusive_group diffrn_group Category key(s): _diffrn_scan_frame_axis.frame_id _diffrn_scan_frame_axis.axis_id

[diffrn_scan_frame_axis]

458

references

International Tables for Crystallography (2006). Vol. G, Chapter 4.7, pp. 459–466.

4.7. Symmetry dictionary (symCIF) B Y I. D. B ROWN This is version 1.0.1 of the symmetry CIF dictionary (symCIF), which gives a more complete description of symmetry than was included in the original core CIF dictionary. A detailed commentary on the philosophy behind the dictionary and its use may be found in Chapter 3.8.

oP oS oI oF tP tI hP hR cP cI cF Example: ‘aP’ (triclinic (anorthic) primitive lattice).

[space_group]

SPACE GROUP _space_group.IT_coordinate_system_code

Contains all the data items that refer to the space group as a whole, such as its name, Laue group etc. It may be looped, for example in a list of space groups and their properties. Space-group types are identified by their number as listed in International Tables for Crystallography Volume A, or by their Schoenflies symbol. Specific settings of the space groups can be identified by their Hall symbol, by specifying their symmetry operations or generators, or by giving the transformation that relates the specific setting to the reference setting based on International Tables Volume A and stored in this dictionary. The commonly used Hermann–Mauguin symbol determines the space-group type uniquely but several different Hermann– Mauguin symbols may refer to the same space-group type. A Hermann–Mauguin symbol contains information on the choice of the basis, but not on the choice of origin.

The data value must be one of the following:

b1 b2 b3 -b1 -b2 -b3 c1 c2 c3 -c1 -c2 -c3 a1 a2 a3 -a1 -a2 -a3 abc ba-c cab -cba bca a-cb 1abc 1ba-c 1cab 1-cba 1bca 1a-cb 2abc 2ba-c 2cab 2-cba 2bca 2a-cb 1 2 h r

Reference: International Tables for Crystallography (2002). Volume A, Space-group symmetry, edited by Th. Hahn, 5th ed. Dordrecht: Kluwer Academic Publishers. Mandatory category. Category key(s): _space_group.id

Example 1 – description of the C2/c space group, No. 15 in International Tables for Crystallography Volume A. _space_group.id _space_group.name_H-M_ref _space_group.name_Schoenflies _space_group.IT_number _space_group.name_Hall _space_group.Bravais_type _space_group.Laue_class _space_group.crystal_system _space_group.centring_type _space_group.Patterson_name_H-M

1 ’C 2/c’ C2h.6 15 ’-C 2yc’ mS 2/m monoclinic C ’C 2/m’

_space_group.Bravais_type

(char)

The symbol denoting the lattice type (Bravais type) to which the translational subgroup (vector lattice) of the space group belongs. It consists of a lower-case letter indicating the crystal system followed by an upper-case letter indicating the lattice centring. The setting-independent symbol mS replaces the setting-dependent symbols mB and mC, and the setting-independent symbol oS replaces the setting-dependent symbols oA, oB and oC. Reference: International Tables for Crystallography (2002). Volume A, Space-group symmetry, edited by Th. Hahn, 5th ed., p. 15. Dordrecht: Kluwer Academic Publishers. The data value must be one of the following:

aP mP mS

Affiliation: I. DAVID B ROWN, Brockhouse Institute for Materials Research, McMaster University, Hamilton, Ontario, Canada L8S 4M1.

Copyright  2006 International Union of Crystallography

(char)

A qualifier taken from the enumeration list identifying which setting in International Tables for Crystallography Volume A (2002) (IT) is used. See IT Table 4.3.2.1, Section 2.2.16, Table 2.2.16.1, Section 2.2.16.1 and Fig. 2.2.6.4. This item is not computerinterpretable and cannot be used to define the coordinate system. Use _space_group.transform_* instead. Reference: International Tables for Crystallography (2002). Volume A, Space-group symmetry, edited by Th. Hahn, 5th ed. Dordrecht: Kluwer Academic Publishers. monoclinic unique axis b, cell choice 1, abc monoclinic unique axis b, cell choice 2, abc monoclinic unique axis b, cell choice 3, abc ¯ monoclinic unique axis b, cell choice 1, cba ¯ monoclinic unique axis b, cell choice 2, cba ¯ monoclinic unique axis b, cell choice 3, cba monoclinic unique axis c, cell choice 1, abc monoclinic unique axis c, cell choice 2, abc monoclinic unique axis c, cell choice 3, abc monoclinic unique axis c, cell choice 1, ba¯c monoclinic unique axis c, cell choice 2, ba¯c monoclinic unique axis c, cell choice 3, ba¯c monoclinic unique axis a, cell choice 1, abc monoclinic unique axis a, cell choice 2, abc monoclinic unique axis a, cell choice 3, abc monoclinic unique axis a, cell choice 1, a¯ cb monoclinic unique axis a, cell choice 2, a¯ cb monoclinic unique axis a, cell choice 3, a¯ cb orthorhombic orthorhombic orthorhombic orthorhombic orthorhombic orthorhombic orthorhombic origin choice 1 orthorhombic origin choice 1 orthorhombic origin choice 1 orthorhombic origin choice 1 orthorhombic origin choice 1 orthorhombic origin choice 1 orthorhombic origin choice 2 orthorhombic origin choice 2 orthorhombic origin choice 2 orthorhombic origin choice 2 orthorhombic origin choice 2 orthorhombic origin choice 2 tetragonal or cubic origin choice 1 tetragonal or cubic origin choice 2 trigonal using hexagonal axes trigonal using rhombohedral axes [space_group]

459

SPACE GROUP

4. DATA DICTIONARIES

_space_group.IT_number

(numb)

cif sym.dic

*_space_group.id

(char)

_symmetry_Int_Tables_number (cif core.dic 1.0)

This is an identifier needed if _space_group.* items are looped.

The number as assigned in International Tables for Crystallography Volume A, specifying the proper affine class (i.e. the orientation-preserving affine class) of space groups (crystallographic space-group type) to which the space group belongs. This number defines the space-group type but not the coordinate system in which it is expressed. Reference: International Tables for Crystallography (2002). Volume A, Space-group symmetry, edited by Th. Hahn, 5th ed. Dordrecht: Kluwer Academic Publishers.

The following item(s) have an equivalent role in their respective categories:

The permitted range is (1, 230).

_space_group_symop.sg_id , _space_group_Wyckoff.sg_id .

_space_group.name_H-M_alt

allows for an alternative Hermann– Mauguin symbol to be given. The way in which this item is used is determined by the user and should be described in the item _space_group.name_H-M_alt_description. It may, for example, be used to give one of the extended Hermann–Mauguin symbols given in Table 4.3.2.1 of International Tables for Crystallography Volume A (2002) or a full Hermann–Mauguin symbol for an unconventional setting. Each component of the space-group name is separated by a space or an underscore character. The use of a space is strongly recommended. The underscore is only retained because it was used in older CIFs. It should not be used in new CIFs. Subscripts should appear without special symbols. Bars should be given as negative signs before the numbers to which they apply. The commonly used Hermann–Mauguin symbol determines the space-group type uniquely, but a given space-group type may be described by more than one Hermann–Mauguin symbol. The spacegroup type is best described using _space_group.IT_number or _space_group.name_Schoenflies. The Hermann–Mauguin symbol may contain information on the choice of basis but does not contain information on the choice of origin. To define the setting uniquely, use _space_group.name_Hall, list the symmetry operations or generators, or give the transformation that relates the setting to the reference setting defined in this dictionary under _space_group.reference_setting. Reference: International Tables for Crystallography (2002). Volume A, Space-group symmetry, edited by Th. Hahn, 5th ed. Dordrecht: Kluwer Academic Publishers. _space_group.name_H-M_alt

(char)

The Hermann–Mauguin symbol of the geometric crystal class of the point group of the space group where a centre of inversion is added if not already present. The data value must be one of the following:

-1 2/m 4/m -3 6/m m-3

mmm 4/mmm -3m 6/mmm m-3m

[space_group]

_space_group.Patterson_name_H-M (char) The Hermann–Mauguin symbol of the type of that centrosymmetric symmorphic space group to which the Patterson function belongs; see Table 2.2.5.1 in International Tables for Crystallography Volume A (2002). A space separates each symbol referring to different axes. Underscores may replace the spaces, but this use is discouraged. Subscripts should appear without special symbols. Bars should be given as negative signs before the number to which they apply. Reference: International Tables for Crystallography (2002). Volume A, Space-group symmetry, edited by Th. Hahn, 5th ed., Table 2.2.5.1. Dordrecht: Kluwer Academic Publishers. Examples: ‘P -1’, ‘P 2/m’, ‘C 2/m’, ‘P m m m’, ‘C m m m’, ‘I m m m’, ‘F m m m’, ‘P 4/m’, ‘I 4/m’, ‘P 4/m m m’, ‘I 4/m m m’, ‘P -3’, ‘R -3’, ‘P -3 m 1’, ‘R -3 m’, ‘P -3 1 m’, ‘P 6/m’, ‘P 6/m m m’, ‘P m -3’, ‘I m -3’, ‘F m -3’, ‘P m -3 m’, ‘I m -3 m’, ‘F m -3 m’. [space_group]

Related items: _space_group.name_H-M_ref (alternate), _space_group.name_H-M_full (alternate) . Example: ; loop space group.name H-M alt space group.name H-M alt description ’C m c m(b n n)’ ’Extended Hermann-Mauguin symbol’ ’C 2/c 2/m 21/m’ ’Full unconventional Hermann-Mauguin symbol’ ’A m a m’ ’Hermann-Mauguin symbol corresponding to setting used’ ; (Three examples for space group No. 63.) [space_group]

_space_group.centring_type (char) Symbol for the lattice centring. This symbol may be dependent on the coordinate system chosen. The data value must be one of the following:

P A B C F I R Rrev H

(char)

_symmetry_space_group_name_H-M (cif core.dic 1.0)

[space_group]

_space_group.Laue_class

[space_group]

primitive no centring A-face centred (0, 1/2, 1/2) B-face centred (1/2, 0, 1/2) C-face centred (1/2, 1/2, 0) all faces centred (0, 1/2, 1/2), (1/2, 0, 1/2), (1/2, 1/2, 0) body centred (1/2, 1/2, 1/2) rhombohedral obverse centred (2/3, 1/3, 1/3), (1/3, 2/3, 2/3) rhombohedral reverse centred (1/3, 2/3, 1/3), (2/3, 1/3, 2/3) hexagonal centred (2/3, 1/3, 0), (1/3, 2/3, 0)

_space_group.name_H-M_alt_description

(char)

A free-text description of the code appearing in _space_ group.name_H-M_alt. [space_group]

[space_group]

_space_group.crystal_system

_space_group.name_H-M_full

(char)

(char)

_symmetry_cell_setting (cif core.dic 1.0)

symmetry.space_group_name_H-M (cif mm.dic 1.0.0)

The name of the system of geometric crystal classes of space groups (crystal system) to which the space group belongs. Note that crystals with the hR lattice type belong to the trigonal system.

The full international Hermann–Mauguin space-group symbol as defined in Section 2.2.3 and given as the second item of the second line of each of the space-group tables of Part 7 of International Tables for Crystallography Volume A (2002). Each component of the space-group name is separated by a space or an underscore character. The use of a space is strongly recommended. The underscore is only retained because it was used in old CIFs. It should not be used in new CIFs. Subscripts should appear without special symbols. Bars should be given as negative signs before the numbers to which they apply. The commonly used Hermann–Mauguin symbol determines the space-group type uniquely but a given

The data value must be one of the following:

triclinic monoclinic orthorhombic tetragonal trigonal hexagonal cubic

[space_group]

460

4.7. SYMMETRY DICTIONARY (symCIF)

cif sym.dic

space-group type may be described by more than one Hermann– Mauguin symbol. The space-group type is best described using _space_group.IT_number or _space_group.name_Schoenflies. The full international Hermann–Mauguin symbol contains information about the choice of basis for monoclinic and orthorhombic space groups but does not give information about the choice of origin. To define the setting uniquely use _space_group.name_Hall, list the symmetry operations or generators, or give the transformation relating the setting used to the reference setting defined in this dictionary under _space_group.reference_setting. Reference: International Tables for Crystallography (2002). Volume A, Space-group symmetry, edited by Th. Hahn, 5th ed. Dordrecht: Kluwer Academic Publishers.

’C ’A ’A ’F ’I ’P ’P ’P ’P ’P ’P ’P ’P ’C ’C ’C ’F ’I ’I ’P ’P ’I ’P ’P ’P ’I ’P ’P ’P ’P ’I ’P ’P ’P ’P ’I ’I ’P ’P ’P ’P ’I ’I ’P ’P ’P ’P ’P ’P ’P ’P ’I ’I ’P ’P ’P ’P ’P

Related items: _space_group.name_H-M_ref (alternate), _space_group.name_H-M_alt (alternate) . Example: ‘P 21/n 21/m 21/a’ (full symbol for Pnma).

_space_group.name_H-M_ref

[space_group]

(char)

The short international Hermann–Mauguin space-group symbol as defined in Section 2.2.3 and given as the first item of each space-group table in Part 7 of International Tables for Crystallography Volume A (2002). Each component of the spacegroup name is separated by a space or an underscore character. The use of a space is strongly recommended. The underscore is only retained because it was used in old CIFs. It should not be used in new CIFs. Subscripts should appear without special symbols. Bars should be given as negative signs before the numbers to which they apply. The short international Hermann–Mauguin symbol determines the space-group type uniquely. However, the space-group type is better described using _space_group.IT_number or _space_group.name_Schoenflies. The short international Hermann–Mauguin symbol contains no information on the choice of basis or origin. To define the setting uniquely use _space_group.name_Hall, list the symmetry operations or generators, or give the transformation that relates the setting to the reference setting defined in this dictionary under _space_group.reference_setting. _space_group.name_H-M_alt may be used to give the Hermann– Mauguin symbol corresponding to the setting used. In the enumeration list below, each possible value is identified by space-group number and Schoenflies symbol. Reference: International Tables for Crystallography (2002). Volume A, Space-group symmetry, edited by Th. Hahn, 5th ed. Dordrecht: Kluwer Academic Publishers. Related items: _space_group.name_H-M_full (alternate), _space_group.name_H-M_alt (alternate). The data value must be one of the following:

’P ’P ’C ’P ’C ’P ’P ’C ’P ’P ’C ’I ’P ’P ’P ’P ’P ’C

1’ 2’ 2’ c’ c’ 21/m’ 2/c’ 2/c’ 2 2 21’ 21 21 21’ 2 2 2’ 2 2 2’ m m 2’ c c 2’ c a 21’ m n 21’ n a 21’ m m 2’

1 C11 3 C21 5 C23 7 Cs2 9 Cs4 2 11 C2h 4 13 C2h 6 15 C2h 17 D22 19 D42 21 D62 23 D82 1 25 C2v 3 27 C2v 5 29 C2v 7 31 C2v 9 33 C2v 11 35 C2v

’P ’P ’P ’C ’P ’C ’P ’P ’P ’C ’F ’I ’P ’P ’P ’P ’P ’C

-1’ 21’ m’ m’ 2/m’ 2/m’ 21/c’ 2 2 2’ 21 21 2’ 2 2 21’ 2 2 2’ 21 21 21’ m c 21’ m a 2’ n c 2’ b a 2’ n n 2’ m c 21’

2 Ci1 4 C22 6 Cs1 8 Cs3 1 10 C2h 3 12 C2h 5 14 C2h 16 D12 18 D32 20 D52 22 D72 24 D92 2 26 C2v 4 28 C2v 6 30 C2v 8 32 C2v 10 34 C2v 12 36 C2v

461

c c 2’ e m 2’ e a 2’ d d 2’ b a 2’ m m m’ c c m’ m m a’ m n a’ b a m’ b c m’ m m n’ b c a’ m c m’ m m m’ m m e’ m m m’ m m m’ b c a’ 4’ 42’ 4’ -4’ 4/m’ 4/n’ 4/m’ 4 2 2’ 41 2 2’ 42 2 2’ 43 2 2’ 4 2 2’ 4 m m’ 42 c m’ 4 c c’ 42 m c’ 4 m m’ 41 m d’ -4 2 m’ -4 21 m’ -4 m 2’ -4 b 2’ -4 m 2’ -4 2 m’ 4/m m m’ 4/n b m’ 4/m b m’ 4/n m m’ 42/m m c’ 42/n b c’ 42/m b c’ 42/n m c’ 4/m m m’ 41/a m d’ 3’ 32’ -3’ 3 1 2’ 31 1 2’

SPACE GROUP 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99 101 103 105 107 109 111 113 115 117 119 121 123 125 127 129 131 133 135 137 139 141 143 145 147 149 151

13 C2v 15 C2v 17 C2v 19 C2v 21 C2v D12h D32h D52h D72h D92h D11 2h D13 2h D15 2h D17 2h D19 2h D21 2h D23 2h D25 2h D27 2h C41 C43 C45 S41 1 C4h 3 C4h 5 C4h D14 D34 D54 D74 D94 1 C4v 3 C4v 5 C4v 7 C4v 9 C4v 11 C4v D12d D32d D52d D72d D92d D11 2d D14h D34h D54h D74h D94h D11 4h D13 4h D15 4h D17 4h D19 4h C31 C33 C3i1 D13 D33

’A ’A ’F ’I ’I ’P ’P ’P ’P ’P ’P ’P ’P ’C ’C ’C ’F ’I ’I ’P ’P ’I ’I ’P ’P ’I ’P ’P ’P ’P ’I ’P ’P ’P ’P ’I ’I ’P ’P ’P ’P ’I ’I ’P ’P ’P ’P ’P ’P ’P ’P ’I ’I ’P ’R ’R ’P ’P

m m 2’ m a 2’ m m 2’ m m 2’ m a 2’ n n n’ b a n’ n n a’ c c a’ c c n’ n n m’ b c n’ n m a’ m c e’ c c m’ c c e’ d d d’ b a m’ m m a’ 41’ 43’ 41’ -4’ 42/m’ 42/n’ 41/a’ 4 21 2’ 41 21 2’ 42 21 2’ 43 21 2’ 41 2 2’ 4 b m’ 42 n m’ 4 n c’ 42 b c’ 4 c m’ 41 c d’ -4 2 c’ -4 21 c’ -4 c 2’ -4 n 2’ -4 c 2’ -4 2 d’ 4/m c c’ 4/n n c’ 4/m n c’ 4/n c c’ 42/m c m’ 42/n n m’ 42/m n m’ 42/n c m’ 4/m c m’ 41/a c d’ 31’ 3’ -3’ 3 2 1’ 31 2 1’

38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136 138 140 142 144 146 148 150 152

14 C2v 16 C2v 18 C2v 20 C2v 22 C2v D22h D42h D62h D82h D10 2h D12 2h D14 2h D16 2h D18 2h D20 2h D22 2h D24 2h D26 2h D28 2h C42 C44 C46 S42 2 C4h 4 C4h 6 C4h D24 D44 D64 D84 D10 4 2 C4v 4 C4v 6 C4v 8 C4v 10 C4v 12 C4v D22d D42d D62d D82d D10 2d D12 2d D24h D44h D64h D84h D10 4h D12 4h D14 4h D16 4h D18 4h D20 4h C32 C34 C3i2 D23 D43

4. DATA DICTIONARIES

SPACE GROUP ’P ’R ’P ’P ’R ’P ’P ’R ’P ’P ’P ’P ’P ’P ’P ’P ’P ’P ’P ’P ’P ’P ’I ’I ’P ’F ’P ’P ’F ’I ’P ’P ’I ’F ’P ’P ’F ’F ’I

32 1 2’ 3 2’ 3 1 m’ 3 1 c’ 3 c’ -3 1 c’ -3 c 1’ -3 c’ 61’ 62’ 63’ 6/m ’ 6 2 2’ 65 2 2’ 64 2 2’ 6 m m’ 63 c m’ -6 m 2’ -6 2 m’ 6/m m m’ 63/m c m’ 2 3’ 2 3’ 21 3’ n -3’ d -3’ a -3’ 4 3 2’ 4 3 2’ 4 3 2’ 41 3 2’ -4 3 m’ -4 3 m’ -4 3 c’ m -3 m’ m -3 n’ m -3 m’ d -3 m’ m -3 m’

153 155 157 159 161 163 165 167 169 171 173 175 177 179 181 183 185 187 189 191 193 195 197 199 201 203 205 207 209 211 213 215 217 219 221 223 225 227 229

D53 D73 2 C3v 4 C3v 6 C3v D23d D43d D63d C62 C64 C66 1 C6h D16 D36 D56 1 C6v 3 C6v D13h D33h D16h D36h 1 T T3 T5 Th2 Th4 Th6 O1 O3 O5 O7 Td1 Td3 Td5 O1h O3h O5h O7h O9h

’P ’P ’P ’R ’P ’P ’R ’P ’P ’P ’P ’P ’P ’P ’P ’P ’P ’P ’P ’P ’P ’F ’P ’P ’F ’I ’I ’P ’F ’P ’I ’F ’P ’I ’P ’P ’F ’F ’I

32 2 1’ 3 m 1’ 3 c 1’ 3 m’ -3 1 m’ -3 m 1’ -3 m’ 6’ 65’ 64’ -6’ 63/m’ 61 2 2’ 62 2 2’ 63 2 2’ 6 c c’ 63 m c’ -6 c 2’ -6 2 c’ 6/m c c’ 63/m m c’ 2 3’ 21 3’ m -3’ m -3’ m -3’ a -3’ 42 3 2’ 41 3 2’ 43 3 2’ 41 3 2’ -4 3 m’ -4 3 n’ -4 3 d’ n -3 n’ n -3 m’ m -3 c’ d -3 c’ a -3 d’

154 156 158 160 162 164 166 168 170 172 174 176 178 180 182 184 186 188 190 192 194 196 198 200 202 204 206 208 210 212 214 216 218 220 222 224 226 228 230

D63 1 C3v 3 C3v 5 C3v D13d D33d D53d C61 C63 C65 1 C3h 2 C6h D26 D46 D66 2 C6v 4 C6v D23h D43h D26h D46h 2

type) to which the space group belongs. This symbol defines the space-group type independently of the coordinate system in which the space group is expressed. The symbol is given with a period, ‘.’, separating the Schoenflies point group and the superscript. Reference: International Tables for Crystallography (2002). Volume A, Space-group symmetry, edited by Th. Hahn, 5th ed. Dordrecht: Kluwer Academic Publishers. The data value must be one of the following:

C1.1 Cs.3 C2h.6 D2.7 C2v.5 C2v.12 C2v.19 D2h.4 D2h.11 D2h.18 D2h.25 C4.4 C4h.3 D4.4 C4v.1 C4v.8 D2d.3 D2d.10 D4h.5 D4h.12 D4h.19 C3i.2 D3.7 D3d.1 C6.2 C6h.2 C6v.1 D3h.4 T.3 Th.5 O.5 Td.4 Oh.5

T T4 Th1 Th3 Th5 Th7 O2 O4 O6 O8 Td2 Td4 Td6 O2h O4h O6h O8h O10 h

Ci.1 Cs.4 D2.1 D2.8 C2v.6 C2v.13 C2v.20 D2h.5 D2h.12 D2h.19 D2h.26 C4.5 C4h.4 D4.5 C4v.2 C4v.9 D2d.4 D2d.11 D4h.6 D4h.13 D4h.20 D3.1 C3v.1 D3d.2 C6.3 D6.1 C6v.2 D6h.1 T.4 Th.6 O.6 Td.5 Oh.6

C2.1 C2h.1 D2.2 D2.9 C2v.7 C2v.14 C2v.21 D2h.6 D2h.13 D2h.20 D2h.27 C4.6 C4h.5 D4.6 C4v.3 C4v.10 D2d.5 D2d.12 D4h.7 D4h.14 C3.1 D3.2 C3v.2 D3d.3 C6.4 D6.2 C6v.3 D6h.2 T.5 Th.7 O.7 Td.6 Oh.7

C2.2 C2h.2 D2.3 C2v.1 C2v.8 C2v.15 C2v.22 D2h.7 D2h.14 D2h.21 D2h.28 S4.1 C4h.6 D4.7 C4v.4 C4v.11 D2d.6 D4h.1 D4h.8 D4h.15 C3.2 D3.3 C3v.3 D3d.4 C6.5 D6.3 C6v.4 D6h.3 Th.1 O.1 O.8 Oh.1 Oh.8

C2.3 C2h.3 D2.4 C2v.2 C2v.9 C2v.16 D2h.1 D2h.8 D2h.15 D2h.22 C4.1 S4.2 D4.1 D4.8 C4v.5 C4v.12 D2d.7 D4h.2 D4h.9 D4h.16 C3.3 D3.4 C3v.4 D3d.5 C6.6 D6.4 D3h.1 D6h.4 Th.2 O.2 Td.1 Oh.2 Oh.9

Example: ‘C2h.5’ (Schoenflies symbol for space group No. 14).

Cs.1 C2h.4 D2.5 C2v.3 C2v.10 C2v.17 D2h.2 D2h.9 D2h.16 D2h.23 C4.2 C4h.1 D4.2 D4.9 C4v.6 D2d.1 D2d.8 D4h.3 D4h.10 D4h.17 C3.4 D3.5 C3v.5 D3d.6 C3h.1 D6.5 D3h.2 T.1 Th.3 O.3 Td.2 Oh.3 Oh.10

Cs.2 C2h.5 D2.6 C2v.4 C2v.11 C2v.18 D2h.3 D2h.10 D2h.17 D2h.24 C4.3 C4h.2 D4.3 D4.10 C4v.7 D2d.2 D2d.9 D4h.4 D4h.11 D4h.18 C3i.1 D3.6 C3v.6 C6.1 C6h.1 D6.6 D3h.3 T.2 Th.4 O.4 Td.3 Oh.4

[space_group]

_space_group.point_group_H-M (char) The Hermann–Mauguin symbol denoting the geometric crystal class of space groups to which the space group belongs, and the geometric crystal class of point groups to which the point group of the space group belongs.

Examples: ‘P 21/c’, ‘P m n a’, ‘P -1’, ‘F m -3 m’, ‘P 63/m m m’. [space_group]

_space_group.name_Hall

cif sym.dic

Examples: ‘-4’, ‘4/m’.

(char)

[space_group]

_symmetry_space_group_name_Hall (cif core.dic 1.0)

Space-group symbol defined by Hall. _space_group.name_Hall uniquely defines the space group and its reference to a particular coordinate system. Each component of the space-group name is separated by a space or an underscore character. The use of a space is strongly recommended. The underscore is only retained because it was used in old CIFs. It should not be used in new CIFs. References: Hall, S. R. (1981). Acta Cryst. A37, 517–525; erratum (1981), A37, 921. International Tables for Crystallography (2001). Volume B, Reciprocal space, edited by U. Shmueli, 2nd ed., Appendix 1.4.2. Dordrecht: Kluwer Academic Publishers.

_space_group.reference_setting (char) The reference setting of a given space group is the setting chosen by the International Union of Crystallography as a unique setting to which other settings can be referred using the transformation matrix column pair given in _space_group.transform_Pp_abc and _space_group.transform_Qq_xyz. The settings are given in the enumeration list in the form ‘_space_group.IT_number: _space_group.name_Hall’. The space-group number defines the space-group type and the Hall symbol specifies the symmetry generators referred to the reference coordinate system. The 230 reference settings chosen are identical to the settings listed in International Tables for Crystallography Volume A (2002). For the space groups where more than one setting is given in International Tables, the following choices have been made. For monoclinic space groups: unique axis b and cell choice 1. For space groups with two origins: origin choice 2 (origin at inversion centre, indicated by adding :2 to the Hermann–Mauguin symbol in

Examples: ‘P 2c -2ac’ (equivalent to Pca21 ), ‘-I 4bd 2ab 3’ (equivalent to Ia3d). [space_group]

_space_group.name_Schoenflies (char) The Schoenflies symbol as listed in International Tables for Crystallography Volume A denoting the proper affine class (i.e. orientation-preserving affine class) of space groups (space-group

462

4.7. SYMMETRY DICTIONARY (symCIF)

cif sym.dic

the enumeration list). For rhombohedral space groups: hexagonal axes (indicated by adding :h to the Hermann–Mauguin symbol in the enumeration list). Based on the symmetry table of R. W. Grosse-Kunstleve, ETH, Zurich. The enumeration list may be extracted from the dictionary and stored as a separate CIF that can be referred to as required. In the enumeration list below, each reference setting is identified by Schoenflies symbol and by the Hermann–Mauguin symbol, augmented by :2 or :h suffixes as described above. References: International Tables for Crystallography (2002). Volume A, Space-group symmetry, edited by Th. Hahn, 5th ed. Dordrecht: Kluwer Academic Publishers. Grosse-Kunstleve, R. W. (2001). Xtal System of Crystallographic Programs, System Documentation. http://xtal.crystal.uwa.edu.au/man/xtal3.7228.html (or follow links to Docs→Space-Group Symbols from http://xtal.sourceforge.net).

’048:-P 2ab 2bc’ ’049:-P 2 2c’ ’050:-P 2ab 2b’ ’051:-P 2a 2a’ ’052:-P 2a 2bc’ ’053:-P 2ac 2’ ’054:-P 2a 2ac’ ’055:-P 2 2ab’ ’056:-P 2ab 2ac’ ’057:-P 2c 2b’ ’058:-P 2 2n’ ’059:-P 2ab 2a’ ’060:-P 2n 2ab’ ’061:-P 2ac 2ab’ ’062:-P 2ac 2n’ ’063:-C 2c 2’ ’064:-C 2ac 2’ ’065:-C 2 2’ ’066:-C 2 2c’ ’067:-C 2a 2’ ’068:-C 2a 2ac’ ’069:-F 2 2’ ’070:-F 2uv 2vw’ ’071:-I 2 2’ ’072:-I 2 2c’ ’073:-I 2b 2c’ ’074:-I 2b 2’ ’075:P 4’ ’076:P 4w’ ’077:P 4c’ ’078:P 4cw’ ’079:I 4’ ’080:I 4bw’ ’081:P -4’ ’082:I -4’ ’083:-P 4’ ’084:-P 4c’ ’085:-P 4a’ ’086:-P 4bc’ ’087:-I 4’ ’088:-I 4ad’ ’089:P 4 2’ ’090:P 4ab 2ab’ ’091:P 4w 2c’ ’092:P 4abw 2nw’ ’093:P 4c 2’ ’094:P 4n 2n’ ’095:P 4cw 2c’ ’096:P 4nw 2abw’ ’097:I 4 2’ ’098:I 4bw 2bw’ ’099:P 4 -2’ ’100:P 4 -2ab’ ’101:P 4c -2c’ ’102:P 4n -2n’ ’103:P 4 -2c’ ’104:P 4 -2n’ ’105:P 4c -2’

The data value must be one of the following:

’001:P 1’ ’002:-P 1’ ’003:P 2y’ ’004:P 2yb’ ’005:C 2y’ ’006:P -2y’ ’007:P -2yc’ ’008:C -2y’ ’009:C -2yc’ ’010:-P 2y’ ’011:-P 2yb’ ’012:-C 2y’ ’013:-P 2yc’ ’014:-P 2ybc’ ’015:-C 2yc’ ’016:P 2 2’ ’017:P 2c 2’ ’018:P 2 2ab’ ’019:P 2ac 2ab’ ’020:C 2c 2’ ’021:C 2 2’ ’022:F 2 2’ ’023:I 2 2’ ’024:I 2b 2c’ ’025:P 2 -2’ ’026:P 2c -2’ ’027:P 2 -2c’ ’028:P 2 -2a’ ’029:P 2c -2ac’ ’030:P 2 -2bc’ ’031:P 2ac -2’ ’032:P 2 -2ab’ ’033:P 2c -2n’ ’034:P 2 -2n’ ’035:C 2 -2’ ’036:C 2c -2’ ’037:C 2 -2c’ ’038:A 2 -2’ ’039:A 2 -2b’ ’040:A 2 -2a’ ’041:A 2 -2ab’ ’042:F 2 -2’ ’043:F 2 -2d’ ’044:I 2 -2’ ’045:I 2 -2c’ ’046:I 2 -2a’ ’047:-P 2 2’

C11 Ci1 C21 C22 C23 Cs1 Cs2 Cs3 Cs4 1 C2h 2 C2h 3 C2h 4 C2h 5 C2h 6 C2h D12 D22 D32 D42 D52 D62 D72 D82 D92 1 C2v 2 C2v 3 C2v 4 C2v 5 C2v 6 C2v 7 C2v 8 C2v 9 C2v 10 C2v 11 C2v 12 C2v 13 C2v 14 C2v 15 C2v 16 C2v 17 C2v 18 C2v 19 C2v 20 C2v 21 C2v 22 C2v D12h

P1 P 1¯ P121 P 1 21 1 C121 P1m1 P1c1 C1m1 C1c1 P 1 2/m 1 P 1 21 /m 1 C 1 2/m 1 P 1 2/c 1 P 1 21 /c 1 C 1 2/c 1 P222 P 2 2 21 P 21 21 2 P 21 21 21 C 2 2 21 C222 F 222 I 222 I 21 21 21 Pmm2 P m c 21 Pcc2 Pma2 P c a 21 Pnc2 P m n 21 Pba2 P n a 21 Pnn2 Cmm2 C m c 21 Ccc2 Amm2 Aem2 Ama2 Aea2 F mm2 Fdd2 I mm2 I ba2 I ma2 Pmmm

463

SPACE GROUP D22h D32h D42h D52h D62h D72h D82h D92h D10 2h D11 2h D12 2h D13 2h D14 2h D15 2h D16 2h D17 2h D18 2h D19 2h D20 2h D21 2h D22 2h D23 2h D24 2h D25 2h D26 2h D27 2h D28 2h C41 C42 C43 C44 C45 C46 S41 S42 1 C4h 2 C4h 3 C4h 4 C4h 5 C4h 6 C4h D14 D24 D34 D44 D54 D64 D74 D84 D94 D10 4 1 C4v 2 C4v 3 C4v 4 C4v 5 C4v 6 C4v 7 C4v

P n n n:2 Pccm P b a n:2 Pmma Pnna Pmna Pcca Pbam Pccn Pbcm Pnnm P m m n:2 Pbcn Pbca Pnma Cmcm Cmce Cmmm Cccm Cmme C c c e:2 F mmm F d d d:2 I mmm I bam I bca I mma P4 P 41 P 42 P 43 I4 I 41 P 4¯ I 4¯ P 4/m P 42 /m P 4/n:2 P 42 /n:2 I 4/m I 41 /a:2 P422 P 4 21 2 P 41 2 2 P 41 21 2 P 42 2 2 P 42 21 2 P 43 2 2 P 43 21 2 I422 I 41 2 2 P4mm P4bm P 42 c m P 42 n m P4cc P4nc P 42 m c

4. DATA DICTIONARIES

SPACE GROUP ’106:P 4c -2ab’ ’107:I 4 -2’ ’108:I 4 -2c’ ’109:I 4bw -2’ ’110:I 4bw -2c’ ’111:P -4 2’ ’112:P -4 2c’ ’113:P -4 2ab’ ’114:P -4 2n’ ’115:P -4 -2’ ’116:P -4 -2c’ ’117:P -4 -2ab’ ’118:P -4 -2n’ ’119:I -4 -2’ ’120:I -4 -2c’ ’121:I -4 2’ ’122:I -4 2bw’ ’123:-P 4 2’ ’124:-P 4 2c’ ’125:-P 4a 2b’ ’126:-P 4a 2bc’ ’127:-P 4 2ab’ ’128:-P 4 2n’ ’129:-P 4a 2a’ ’130:-P 4a 2ac’ ’131:-P 4c 2’ ’132:-P 4c 2c’ ’133:-P 4ac 2b’ ’134:-P 4ac 2bc’ ’135:-P 4c 2ab’ ’136:-P 4n 2n’ ’137:-P 4ac 2a’ ’138:-P 4ac 2ac’ ’139:-I 4 2’ ’140:-I 4 2c’ ’141:-I 4bd 2’ ’142:-I 4bd 2c’ ’143:P 3’ ’144:P 31’ ’145:P 32’ ’146:R 3’ ’147:-P 3’ ’148:-R 3’ ’149:P 3 2’ ’150:P 3 2"’ ’151:P 31 2 (0 0 4)’ ’152:P 31 2"’ ’153:P 32 2 (0 0 2)’ ’154:P 32 2"’ ’155:R 3 2"’ ’156:P 3 -2"’ ’157:P 3 -2’ ’158:P 3 -2"c’ ’159:P 3 -2c’ ’160:R 3 -2"’ ’161:R 3 -2"c’ ’162:-P 3 2’ ’163:-P 3 2c’

8 C4v 9 C4v 10 C4v 11 C4v 12 C4v D12d D22d D32d D42d D52d D62d D72d D82d D92d D10 2d D11 2d D12 2d D14h D24h D34h D44h D54h D64h D74h D84h D94h D10 4h D11 4h D12 4h D13 4h D14 4h D15 4h D16 4h D17 4h D18 4h D19 4h D20 4h C31 C32 C33 C34 C3i1 C3i2 D13 D23 D33 D43 D53 D63 D73 1 C3v 2 C3v 3 C3v 4 C3v 5 C3v 6 C3v D13d D23d

’164:-P 3 2"’ ’165:-P 3 2"c’ ’166:-R 3 2"’ ’167:-R 3 2"c’ ’168:P 6’ ’169:P 61’ ’170:P 65’ ’171:P 62’ ’172:P 64’ ’173:P 6c’ ’174:P -6’ ’175:-P 6’ ’176:-P 6c’ ’177:P 6 2’ ’178:P 61 2 (0 0 5)’ ’179:P 65 2 (0 0 1)’ ’180:P 62 2 (0 0 4)’ ’181:P 64 2 (0 0 2)’ ’182:P 6c 2c’ ’183:P 6 -2’ ’184:P 6 -2c’ ’185:P 6c -2’ ’186:P 6c -2c’ ’187:P -6 2’ ’188:P -6c 2’ ’189:P -6 -2’ ’190:P -6c -2c’ ’191:-P 6 2’ ’192:-P 6 2c’ ’193:-P 6c 2’ ’194:-P 6c 2c’ ’195:P 2 2 3’ ’196:F 2 2 3’ ’197:I 2 2 3’ ’198:P 2ac 2ab 3’ ’199:I 2b 2c 3’ ’200:-P 2 2 3’ ’201:-P 2ab 2bc 3’ ’202:-F 2 2 3’ ’203:-F 2uv 2vw 3’ ’204:-I 2 2 3’ ’205:-P 2ac 2ab 3’ ’206:-I 2b 2c 3’ ’207:P 4 2 3’ ’208:P 4n 2 3’ ’209:F 4 2 3’ ’210:F 4d 2 3’ ’211:I 4 2 3’ ’212:P 4acd 2ab 3’ ’213:P 4bd 2ab 3’ ’214:I 4bd 2c 3’ ’215:P -4 2 3’ ’216:F -4 2 3’ ’217:I -4 2 3’ ’218:P -4n 2 3’ ’219:F -4a 2 3’ ’220:I -4bd 2c 3’ ’221:-P 4 2 3’

P 42 b c I 4mm I 4cm I 41 m d I 41 c d P 4¯ 2 m P 4¯ 2 c P 4¯ 21 m P 4¯ 21 c P 4¯ m 2 P 4¯ c 2 P 4¯ b 2 P 4¯ n 2 I 4¯ m 2 I 4¯ c 2 I 4¯ 2 m I 4¯ 2 d P 4/m m m P 4/m c c P 4/n b m:2 P 4/n n c:2 P 4/m b m P 4/m n c P 4/n m m:2 P 4/n c c:2 P 42 /m m c P 42 /m c m P 42 /n b c:2 P 42 /n n m:2 P 42 /m b c P 42 /m n m P 42 /n m c:2 P 42 /n c m:2 I 4/m m m I 4/m c m I 41 /a m d:2 I 41 /a c d:2 P3 P 31 P 32 R 3:h P 3¯ ¯ R 3:h P312 P321 P 31 1 2 P 31 2 1 P 32 1 2 P 32 2 1 R 3 2:h P3m1 P31m P3c1 P31c R 3 m:h R 3 c:h P 3¯ 1 m P 3¯ 1 c

464

cif sym.dic D33d D43d D53d D63d C61 C62 C63 C64 C65 C66 1 C3h 1 C6h 2 C6h D16 D26 D36 D46 D56 D66 1 C6v 2 C6v 3 C6v 4 C6v D13h D23h D33h D43h D16h D26h D36h D46h 1 T T2 T3 T4 T5 Th1 Th2 Th3 Th4 Th5 Th6 Th7 O1 O2 O3 O4 O5 O6 O7 O8 Td1 Td2 Td3 Td4 Td5 Td6 O1h

P 3¯ m 1 P 3¯ c 1 R 3¯ m:h R 3¯ c:h P6 P 61 P 65 P 62 P 64 P 63 P 6¯ P 6/m P 63 /m P622 P 61 2 2 P 65 2 2 P 62 2 2 P 64 2 2 P 63 2 2 P6mm P6cc P 63 c m P 63 m c P 6¯ m 2 P 6¯ c 2 P 6¯ 2 m P 6¯ 2 c P 6/m m m P 6/m c c P 63 /m c m P 63 /m m c P23 F 23 I 23 P 21 3 I 21 3 P m 3¯ ¯ P n 3:2 F m 3¯ ¯ F d 3:2 ¯ I m3 P a 3¯ I a 3¯ P432 P 42 3 2 F 432 F 41 3 2 I 432 P 43 3 2 P 41 3 2 I 41 3 2 P 4¯ 3 m F 4¯ 3 m I 4¯ 3 m P 4¯ 3 n F 4¯ 3 c I 4¯ 3 d P m 3¯ m

4.7. SYMMETRY DICTIONARY (symCIF)

cif sym.dic ’222:-P ’223:-P ’224:-P ’225:-F ’226:-F ’227:-F ’228:-F ’229:-I ’230:-I

4a 2bc 3’ 4n 2 3’ 4bc 2bc 3’ 4 2 3’ 4a 2 3’ 4vw 2vw 3’ 4ud 2vw 3’ 4 2 3’ 4bd 2c 3’

O2h O3h O4h O5h O6h O7h O8h O9h O10 h

P n 3¯ n:2 P m 3¯ n P n 3¯ m:2 F m 3¯ m F m 3¯ c F d 3¯ m:2 F d 3¯ c:2 I m 3¯ m I a 3¯ d

SPACE GROUP SYMOP Contains information about the symmetry operations of the space group. Category key(s): _space_group_symop.id

Example 1 – the symmetry operations for the space group P21 /c. loop_ _space_group_symop.id _space_group_symop.operation_xyz _space_group_symop.operation_description 1 x,y,z ’identity mapping’ 2 -x,-y,-z ’inversion’ 3 -x,1/2+y,1/2-z ’2-fold screw rotation with axis in (0,y,1/4)’ 4 x,1/2-y,1/2+z ’c glide reflection through the plane (x,1/4,y)’

[space_group]

_space_group.transform_Pp_abc

This item specifies the transformation (P, p) of the basis vectors from the setting used in the CIF (a, b, c) to the reference setting given in _space_group.reference_setting (a′ , b′ , c′ ). The value is given in Jones–Faithful notation corresponding to the rotational matrix P combined with the origin shift vector p in the expression

_space_group_symop.generator_xyz (char) A parsable string giving one of the symmetry generators of the space group in algebraic form. If W is a matrix representation of the rotational part of the generator defined by the positions and signs of x, y and z, and w is a column of translations defined by the fractions, an equivalent position x′ is generated from a given position x by

(a′ , b′ , c′ ) = (a, b, c)P + p. P is a post-multiplication matrix of a row (a, b, c) of column vectors. It is related to the inverse transformation (Q, q) by P = Q−1 , p = Pq = −(Q

−1

SPACE GROUP SYMOP

x′ = W x + w.

)q.

When a list of symmetry generators is given, it is assumed that the complete list of symmetry operations of the space group (including the identity operation) can be generated through repeated multiplication of the generators, that is, (W3 , w3 ) is an operation of the space group if (W2 , w2 ) and (W1 , w1 ) [where (W1 , w1 ) is applied first] are either operations or generators and

These transformations are applied as follows: atomic coordinates (x′ , y′ , z′ ) = Q(x, y, z) + q, Miller indices (h′ , k′ , l ′ ) = (h, k, l)P, symmetry operations W ′ = (Q, q)W (P, p), basis vectors (a′ , b′ , c′ ) = (a, b, c)P + p. This item is given as a character string involving the characters a, b and c with commas separating the expressions for the a′ , b′ and c′ vectors. The numeric values may be given as integers, fractions or real numbers. Multiplication is implicit, division must be explicit. White space within the string is optional.

W3 = W2 × W1 , w3 = W2 × w1 + w2 .

Examples: ‘-b+c, a+c, -a+b+c’ (R3:r to R3:h), ‘a-1/4, b-1/4, c-1/4’ (Pnnn:1 to Pnnn:2), ‘b-1/2, c-1/2, a-1/2’ (Bbab:1 to Ccca:2). [space_group]

Reference: International Tables for Crystallography (2002). Volume A, Space-group symmetry, edited by Th. Hahn, 5th ed. Dordrecht: Kluwer Academic Publishers.

_space_group.transform_Qq_xyz

Related item: _space_group_symop.operation_xyz (alternate). Where no value is given, the assumed value is ‘x,y,z’.

This item specifies the transformation (Q, q) of the atomic coordinates from the setting used in the CIF [(x, y, z) referred to the basis vectors (a, b, c)] to the reference setting given in _space_group.reference_setting [(x′ , y′ , z′ ) referred to the basis vectors (a′ , b′ , c′ )]. The value given in Jones–Faithful notation corresponds to the rotational matrix Q combined with the origin shift vector q in the expression (x′ , y′ , z′ ) = Q(x, y, z) + q.

Example: ‘x,1/2-y,1/2+z’ (c glide reflection through the plane (x, 1/4, z) chosen as one of the generators of the space group). [space_group_symop]

*_space_group_symop.id

(char)

_symmetry_equiv_pos_site_id (cif core.dic 1.0) _symmetry_equiv.id (cif mm.dic 1.0)

An arbitrary identifier that uniquely labels each symmetry operation in the list. [space_group_symop]

Q is a pre-multiplication matrix of the column vector (x, y, z). It is related to the inverse transformation (P, p) by

_space_group_symop.operation_description (char) An optional text description of a particular symmetry operation of the space group.

P = Q−1 , p = Pq = −(Q−1 )q,

[space_group_symop]

where the P and Q transformations are applied as follows: atomic coordinates (x′ , y′ , z′ ) = Q(x, y, z) + q, Miller indices (h′ , k′ , l ′ ) = (h, k, l)P, symmetry operations W ′ = (Q, q)W (P, p), basis vectors (a′ , b′ , c′ ) = (a, b, c)P + p. This item is given as a character string involving the characters x, y and z with commas separating the expressions for the x′ , y′ and z′ components. The numeric values may be given as integers, fractions or real numbers. Multiplication is implicit, division must be explicit. White space within the string is optional.

_space_group_symop.operation_xyz

(char)

_symmetry_equiv_pos_as_xyz (cif core.dic 1.0)

A parsable string giving one of the symmetry operations of the space group in algebraic form. If W is a matrix representation of the rotational part of the symmetry operation defined by the positions and signs of x, y and z, and w is a column of translations defined by the fractions, an equivalent position x′ is generated from a given position x by

Examples: ‘-x/3+2y/3-z/3, -2x/3+y/3+z/3, x/3+y/3+z/3’ (R3:r to R3:h), ‘x+1/4,y+1/4,z+1/4’ (Pnnn:1 to Pnnn:2), ‘z+1/2,x+1/2,y+1/2’ (Bbab:1 to Ccca:2). [space_group]

x′ = W x + w.

465

4. DATA DICTIONARIES

SPACE GROUP SYMOP

When a list of symmetry operations is given, it is assumed that the list contains all the operations of the space group (including the identity operation) as given by the representatives of the general position in International Tables for Crystallography Volume A. Reference: International Tables for Crystallography (2002). Volume A, Space-group symmetry, edited by Th. Hahn, 5th ed. Dordrecht: Kluwer Academic Publishers.

cif sym.dic

*_space_group_Wyckoff.id

(char)

An arbitrary identifier that is unique to a particular Wyckoff position. [space_group_Wyckoff]

_space_group_Wyckoff.letter (char) The Wyckoff letter associated with this position, as given in International Tables for Crystallography Volume A. The enumeration value \a corresponds to the Greek letter ‘α’ used in International Tables. Reference: International Tables for Crystallography (2002). Volume A, Space-group symmetry, edited by Th. Hahn, 5th ed. Dordrecht: Kluwer Academic Publishers.

Related item: _space_group_symop.generator_xyz (alternate). Where no value is given, the assumed value is ‘x,y,z’. Example: ‘x,1/2-y,1/2+z’ (c glide reflection through the plane (x, 1/4, z)). [space_group_symop]

_space_group_symop.sg_id (numb) A child of _space_group.id allowing the symmetry operation to be identified with a particular space group.

The data value must be one of the following:

[space_group_symop]

a b c d e f g h i j k l m n o p q r s t u v w x y z \a [space_group_Wyckoff]

SPACE GROUP WYCKOFF _space_group_Wyckoff.multiplicity

Contains information about Wyckoff positions of a space group. Only one site can be given for each special position but the remainder can be generated by applying the symmetry operations stored in _space_group_symop.operation_xyz.

(numb)

The multiplicity of this Wyckoff position as given in International Tables Volume A. It is the number of equivalent sites per conventional unit cell. Reference: International Tables for Crystallography (2002). Volume A, Space-group symmetry, edited by Th. Hahn, 5th ed. Dordrecht: Kluwer Academic Publishers.

Category key(s): _space_group_Wyckoff.id

¯ (No. 228, origin Example 1 – this example is taken from the space group Fd 3c choice 2). For brevity only a selection of special positions are listed. The coordinates of only one site per special position can be given in this item, but the coordinates of the other sites can be generated using the symmetry operations given in the SPACE_GROUP_SYMOP category.

The permitted range is [1, ∞).

[space_group_Wyckoff]

_space_group_Wyckoff.sg_id (char) A child of _space_group.id allowing the Wyckoff position to be identified with a particular space group.

loop_ _space_group_Wyckoff.id _space_group_Wyckoff.multiplicity _space_group_Wyckoff.letter _space_group_Wyckoff.site_symmetry _space_group_Wyckoff.coord_xyz 1 192 h 1 x,y,z 2 96 g ..2 1/4,y,-y 3 96 f 2.. x,1/8,1/8 4 32 b .32 1/4,1/4,1/4

[space_group_Wyckoff]

_space_group_Wyckoff.site_symmetry

(char)

The subgroup of the space group that leaves the point fixed. It is isomorphic to a subgroup of the point group of the space group. The site-symmetry symbol indicates the symmetry in the symmetry direction determined by the Hermann–Mauguin symbol of the space group (see International Tables for Crystallography Volume A, Section 2.2.12). Reference: International Tables for Crystallography (2002). Volume A, Space-group symmetry, edited by Th. Hahn, 5th ed. Dordrecht: Kluwer Academic Publishers.

_space_group_Wyckoff.coords_xyz (char) Coordinates of one site of a Wyckoff position expressed in terms of its fractional coordinates (x, y, z) in the unit cell. To generate the coordinates of all sites of this Wyckoff position, it is necessary to multiply these coordinates by the symmetry operations stored in _space_group_symop.operation_xyz.

Examples: ‘2.22’ (position 2b in space group No. 94, P42 21 2), ‘42.2’ (position 6b in space ¯ ‘2..’ (Site symmetry for the Wyckoff position 96 f in space group No. group No. 222, Pn3n), ¯ The site-symmetry group is isomorphic to the point group 2 with the twofold axis 228, Fd 3c.

Where no value is given, the assumed value is ‘x,y,z’. Example: ‘x,1/2,0’ (coordinates of Wyckoff site with 2.. symmetry). [space_group_Wyckoff]

along one of the 100 directions.).

466

[space_group_Wyckoff]

references

International Tables for Crystallography (2006). Vol. G, Chapter 4.8, pp. 467–470.

4.8. Molecular Information File dictionary (MIF) B Y F. H. A LLEN , J. M. BARNARD , A. P. F. C OOK This is version 1.2.3 of the molecular information file dictionary (MIF). MIF complements the crystallographic information file with data items describing two-dimensional chemical structure diagrams and bond types in a similar STAR format. Unlike CIF, MIF permits nested looping of data items and data encapsulation in save frames. The MIF format is described in Chapter 2.4. _atom_attach_all _atom_attach_ring _atom_attach_nh _atom_attach_h

Appears in list as essential element of loop structure. May match child data name(s): _bond_id_1, _bond_id_2, _stereo_vertex_id. [atom]

(numb)

_atom_label

_atom_charge

[atom]

(numb)

_atom_coord_x _atom_coord_y _atom_coord_z

[atom]

(numb)

(char)

_atom_frag_id

(numb) [atom]

_atom_frag_key (char) Name of save frame containing atom data on molecular fragment. Appears in list containing _atom_id.

[atom]

Affiliations: F RANK H. A LLEN, Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge, CB2 1EZ, England; J OHN M. B ARNARD and A NTHONY P. F. C OOK, BCI Ltd, 46 Uppergate Road, Stannington, Sheffield S6 6BX, England; S YDNEY R. H ALL , School of Biomedical and Chemical Sciences, University of Western Australia, Crawley, Perth, WA 6009, Australia.

Copyright  2006 International Union of Crystallography

_atom_radical_count

(numb)

The data value must be one of the following: H 1 He 2 Li 3 Be N 7 O 8 F 9 Ne Al 13 Si 14 P 15 S K 19 Ca 20 Sc 21 Ti Mn 25 Fe 26 Co 27 Ni Ga 31 Ge 32 As 33 Se Rb 37 Sr 38 Y 39 Zr Tc 43 Ru 44 Rh 45 Pd In 49 Sn 50 Sb 51 Te Cs 55 Ba 56 La 57 Ce Pm 61 Sm 62 Eu 63 Gd Ho 67 Er 68 Tm 69 Yb Ta 73 W 74 Re 75 Os Au 79 Hg 80 Tl 81 Pb At 85 Rn 86 Fr 87 Ra

[atom]

ID of atom site within the molecular fragment. Appears in list containing _atom_id.

[atom]

[atom]

[atom]

_atom_type (char) Specifies the type of atom site identified by _atom_id. Hydrogen atoms will usually not be assigned to sites except where they are connected to more than one other atom, or where it is necessary to reference them, for example, in describing stereochemistry.

a single atom site atom site in molecular fragment

Where no value is given, the assumed value is ‘atom’.

Appears in list containing _atom_id. The permitted range is 1 → ∞.

Appears in list containing _atom_id. The permitted range is 1 → ∞.

[atom]

Appears in list containing _atom_id. The data value must be one of the following:

atom frag

(numb)

_atom_spin_multiplicity (numb) The spin multiplicity applies only if the _atom_radical_count value is nonzero, and is derived from the formula 2S + 1, where S is the absolute value of the sum of the electron spins (+ 21 or − 21 ) on the atom. Thus, a biradical atom may have the values 1 or 3 [2( 12 − 21 ) + 1 or 2( 21 + 12 ) + 1] to represent a singlet or a triplet biradical.

[atom]

_atom_environment Environment of the atom site.

_atom_mass_number

Appears in list containing _atom_id. The permitted range is 0 → ∞.

Specifies the Cartesian coordinates for the atom with an arbitrary origin and arbitrary orthogonal axes. Appears in list containing _atom_id.

[atom]

Specifies the formal radical state of the atom site identified by _atom_id. This is the number of unpaired electrons associated with the atom, e.g. the value of 1 specifies one singly occupied orbital on the atom that is not involved in bonding.

_atom_cip (char) Specifies the Cahn–Ingold–Prelog designation for the atom. The designators are by Prelog & Helmchen (1982). Reference: Prelog, V. & Helmchen, G. (1982). Angew. Chem. Int. Ed. Engl. 21, 567–583. Appears in list containing _atom_id.

Appears in list containing _atom_id.

Specifies the mass number or the isotopic state of the atom. The default is the ‘most abundant naturally occurring’ value.

Specifies the formal electronic charge on the atom for the different atomic representation conventions. The convention for charge is specified by _define_bonding_convention. Appears in list containing _atom_id. The permitted range is −99 → 99.

(char)

The code providing unique identification of the atom site for display purposes. See also _atom_type.

all sites, all sites forming rings, all sites excluding hydrogens and unshared electron pairs, hydrogen-atom sites.

Appears in list containing _atom_id. The permitted range is 0 → ∞.

S. R. H ALL

_atom_id (char) This specifies a unique code for an ‘atom site’ in a molecule or fragment. A designated atom site may be occupied by a ‘dummy’ atom (see _atom_type). A special syntax exists for this code which permits a template molecular fragment to be referred to as an ‘atom site’, and for the atom within that template to be identified. The syntax is m > n where m is the code identifying the template fragment (contained within a save frame) and n is the code matching an _atom_id value stored within the save frame.

The number of atom sites considered to be attached (i.e. chemically bonded) to this site. The extensions are: all ring nh h

AND

467

4 10 16 22 28 34 40 46 52 58 64 70 76 82 88

B Na Cl V Cu Br Nb Ag I Pr Tb Lu Ir Bi Ac

5 11 17 23 29 35 41 47 53 59 65 71 77 83 89

C Mg Ar Cr Zn Kr Mo Cd Xe Nd Dy Hf Pt Po Th

6 12 18 24 30 36 42 48 54 60 66 72 78 84 90

4. DATA DICTIONARIES

ATOM Pa 91 U Bk 97 Cf Lr 103 usp

92 Np 93 Pu 94 Am 98 Es 99 Fm 100 Md ‘unshared electron pair’ dum

E P

95 Cm 96 101 No 102 ‘dummy atom’

Appears in list containing _atom_id.

_atom_valency

[bond] [atom]

_bond_type_mif

(numb)

Specifies the valency state of the atom site identified by _atom_id. Appears in list containing _atom_id. The permitted range is −99 → 99.

_bond_environment Specifies the connection environment of the bond.

(char)

Code indicating the nature of the bond according to the MIF bonding convention. Aromatic and normalized tautomeric bonds cannot be shown. The convention is specified with _define_bonding_ convention.

[atom]

_bond_cip (char) Specifies the Cahn–Ingold–Prelog designation for the bond. The designators are by Prelog & Helmchen (1982). Reference: Prelog, V. & Helmchen, G. (1982). Angew. Chem. Int. Ed. Engl. 21, 567–583. Appears in list containing _bond_id_1, _bond_id_2.

mif core.dic

equivalent (delocalized double) bond π bond (metal–ligand π interaction)

Appears in list containing _bond_id_1, _bond_id_2. Related items: _bond_type_casreg3 (convention), _bond_type_ccdc (convention). The data value must be one of the following:

[bond]

S D T O

(char)

single (two-electron) bond double (four-electron) bond triple (six-electron) bond other (e.g. coordination) bond

Appears in list containing _bond_id_1, _bond_id_2. The data value must be one of the following:

ring chain

[bond]

in a ring in a chain

_define_bonding_convention (char) Specifies the convention used for the values of the items _atom_charge, _atom_radical_count, _atom_spin_ multiplicity and _atom_valency.

[bond]

_bond_id_1 _bond_id_2

(char)

The data value must be one of the following:

Specify the atom-site codes of ‘chemically connected’ sites. The atom ID codes may appear in either order except for asymmetric bonds. Each ‘bond’ pair may be represented only once. The special syntax for representing atom sites within molecular templates is described in the _atom_id definition.

casreg3 Chemical Abstracts Service Registry III ccdc Cambridge Crystallographic Data Centre mif basic MIF

Appears in list as essential element of loop structure. Must match parent data name _atom_id. [bond]

_define_stereo_relationship (char) Defines the enantiomorphic relationship of the stereo geometry in the data cell (data block or save frame). The descriptions of independently defined stereo regions, whose centres all have the same inter-centre relationship, may be grouped together. For each such group, the inter-centre relationship is specified by _define_stereo_relationship. Where there is only one such group, all the relevant data may be included in the data block or save frame for the molecule as a whole. Where there are several different such groups, each should be shown in a separate save frame and _reference_stereo_group should be used to reference the different save frames. For each stereo group, a two-level loop structure will be used to define the stereocentres it contains. Each stereogenic atom site is defined by _stereo_atom_id and _stereo_geometry at the first loop level, and by _stereo_vertex_id at the second. Each stereogenic bond is defined by _stereo_bond_id_1, _stereo_bond_id_2 and _stereo_geometry at the first level, and by _stereo_vertex_id at the second level.

_bond_type_casreg3 (char) Code indicating the nature of the bond according to the Chemical Abstracts Service Registry III bonding convention. The convention is specified with _define_bonding_convention. Reference: Mockus, J. & Stobaugh, R. E. (1980). J. Chem. Inf. Comput. Sci. 20, 18–22. Appears in list containing _bond_id_1, _bond_id_2. Related items: _bond_type_mif (convention), _bond_type_ccdc (convention). The data value must be one of the following:

S D T A U

[define]

Where no value is given, the assumed value is ‘mif’.

single exact bond double exact bond triple exact bond ring alternating normalized bond tautomer normalized bond [bond]

_bond_type_ccdc (char) Code indicating the nature of the bond according to the Cambridge Crystallographic Data Centre (CCDC) bonding convention. The convention is specified with _define_bonding_convention. Reference: Allen, F. H., Davies, J. E., Galloy, J. J., Johnson, O., Kennard, O., Macrae, C. F., Mitchell, E. M., Mitchell, G. F., Smith, J. M. & Watson, D. G. (1991). J. Chem. Inf. Comput. Sci. 31, 187–204.

The data value must be one of the following:

absolute relative unknown racemic absolute excess relative excess

configuration is as shown configuration is relative configuration is unknown two stereoisomers: equal mix of D and L two stereoisomers; excess as shown two stereoisomers; excess unknown

Where no value is given, the assumed value is ‘unknown’.

[define]

Appears in list containing _bond_id_1, _bond_id_2. Related items:

_display_colour

_bond_type_casreg3 (convention), _bond_type_mif (convention). The data value must be one of the following:

S D T Q A C

(char)

The colour code specifying the colour of the object identified by _display_symbol. The permitted colour codes are stored with the RGB ratios as a separate validation file ‘mif core colours.val’.

single (two-electron) bond or σ bond to metal double (four-electron) bond triple (six-electron) bond quadruple (eight-electron, metal–metal) bond alternating normalized ring bond (aromatic) catena-forming bond in crystal structure

The data value must be one of the following: black white snow white ivory white lavender grey

468

000:000:000 255:250:250 255:255:240 230:230:250 192:192:192

RGB RGB RGB RGB RGB

white white smoke white azure white rose grey light

255:255:255 245:245:245 240:255:255 255:228:225 211:211:211

RGB RGB RGB RGB RGB

4.8. MOLECULAR INFORMATION FILE DICTIONARY (MIF)

mif core.dic grey slate grey slate light blue light blue dark blue slate dark blue royal blue sky deep blue steel turquoise light turquoise medium cyan light green light green sea green sea medium green spring green aquamarine green yellow green forest green khaki yellow light yellow goldenrod yellow goldenrod light yellow green brown rosy brown saddle brown peru brown beige brown sandy brown chocolate salmon light orange red red tomato red violet pink pink deep violet violet red violet plum violet dark violet purple

112:128:144 119:136:153 176:224:230 025:025:112 072:061:139 065:105:225 000:191:255 070:130:180 175:238:238 072:209:204 224:255:255 152:251:152 046:139:087 060:179:113 000:255:127 127:255:212 173:255:047 034:139:034 240:230:140 255:255:224 218:165:032 238:221:130 154:205:050 188:143:143 139:069:019 205:133:063 245:245:220 244:164:096 210:105:030 255:160:122 255:165:000 255:000:000 255:099:071 219:112:147 255:192:203 255:020:147 238:130:238 208:032:144 221:160:221 148:000:211 160:032:240

RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB

grey slate dark blue blue medium blue navy blue slate blue sky blue sky light turquoise turquoise dark cyan green green dark green sea dark green sea light green lawn green chartreuse green lime green olive yellow yellow gold yellow goldenrod pale yellow goldenrod dark brown brown indian red brown sienna brown burlywood brown wheat brown tan salmon salmon dark orange dark red coral red orange red maroon pink light pink hot violet red medium violet magenta violet orchid violet blue violet thistle

047:079:079 000:000:255 000:000:205 000:000:128 106:090:205 135:206:235 135:206:250 064:224:208 000:206:209 000:255:255 000:255:000 000:100:000 143:188:143 032:178:170 124:252:000 127:255:000 050:205:050 107:142:035 255:255:000 255:215:000 238:232:170 184:134:011 165:042:042 205:092:092 160:082:045 222:184:135 245:222:179 210:180:140 250:128:114 233:150:122 255:140:000 255:127:080 255:069:000 176:048:096 255:182:193 255:105:180 199:021:133 255:000:255 218:112:214 138:043:226 216:191:216

red red tomato red violet pink pink deep violet violet red violet plum violet dark violet purple

RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB

RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB

DISPLAY DEFINE red coral red orange red maroon pink light pink hot violet red medium violet magenta violet orchid violet blue violet thistle

255:127:080 255:069:000 176:048:096 255:182:193 255:105:180 199:021:133 255:000:255 218:112:214 138:043:226 216:191:216

RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB

Appears in list at level 2 containing _display_conn_id. Where no value is given, the assumed value is ‘black’. [display_conn]

_display_conn_id

(numb)

The identifying number of a display object which is connected to another display object in level 1 of the loop packet which is designated as _display_id. The ID number appearing as the _display_conn_id must appear elsewhere in the loop structure as a value for _display_id. Appears in list at level 2 as essential element of loop structure. Must match parent data name _display_id. [display_conn]

_display_conn_symbol (char) The symbol code for the bond connection to the displayed objects identified by _display_id and _display_conn_id. The symbol codes and descriptions are stored in the validation file ‘mif core bonds.val’. The data value must be one of the following:

.c -c 1c 2c 3c .b -b 1b 2b 3b >b data value).

[item_enumeration]

[item_range]

*_item_enumeration.value

(any)

A permissible value, character or number for the defined item.

_item_range.minimum

[item_enumeration]

(any)

Minimum permissible value of a data item or the lower bound of a permissible range (minimum value < data value). [item_range]

ITEM EXAMPLES Attributes for describing examples of applications of the data item. Category group(s): ddl_group item_group Category key(s): _item_examples.name _item_examples.case

_item_examples.case

ITEM RELATED Attributes which specify recognized relationships between data items. Category group(s): ddl_group item_group Category key(s): _item_related.name _item_related.related_name _item_related.function_code

(text)

An example application of the defined data item. [item_examples]

475

4. DATA DICTIONARIES

ITEM RELATED *_item_related.function_code

(code)

The code for the type of relationship between the item identified by _item_related.name and the defined item. ‘alternate’ indicates that the item identified in _item_related. related_name is an alternative expression in terms of its application and attributes to the item in this definition. ‘alternate exclusive’ indicates that the item identified in _item_ related.related_name is an alternative expression in terms of its application and attributes to the item in this definition. Only one of the alternative forms may be specified. ‘convention’ indicates that the item identified in _item_ related.related_name differs from the defined item only in terms of a convention in its expression. ‘conversion constant’ indicates that the item identified in _item_related.related_name differs from the defined item only by a known constant. ‘conversion arbitrary’ indicates that the item identified in _item_related.related_name differs from the defined item only by an arbitrary constant. ‘replaces’ indicates that the defined item replaces the item identified in _item_related.related_name. ‘replacedby’ indicates that the defined item is replaced by the item identified in _item_related.related_name. ‘associated value’ indicates that the item identified in _item_ related.related_name is meaningful when associated with the defined item. ‘associated esd’ indicates that the item identified in _item_related.related_name is the estimated standard deviation of the defined item. ‘associated error’ indicates that the item identified in _item_related.related_name is the estimated error of the defined item.

ddl core

*_item_structure_list.code

(code)

The name of the matrix/vector structure declaration. The following item(s) have an equivalent role in their respective categories:

_item_structure.code.

[item_structure_list]

*_item_structure_list.dimension

(int)

Identifies the length of this row or column of the structure. The permitted range is [1, ∞).

[item_structure_list]

*_item_structure_list.index

(int)

Identifies the one-based index of a row or column of the structure. The permitted range is [1, ∞).

[item_structure_list]

ITEM SUB CATEGORY This category assigns data items to subcategories. Category group(s): sub_category_group item_group Category key(s): _item_sub_category.id _item_sub_category.name

ITEM TYPE Attributes for specifying the data-type code for each data item. Category group(s): ddl_group item_group Category key(s): _item_type.name

The data value must be one of the following:

alternate alternate exclusive convention conversion constant conversion arbitrary replaces replacedby associated value associated esd associated error

alternate form of the item mutually exclusive alternate form of the item depends on defined convention related by a known conversion factor related by an arbitrary conversion factor a replacement definition an obsolete definition a meaningful value when related to the item an estimated standard deviation of the item an estimated error of the item

ITEM TYPE CONDITIONS Attributes for specifying additional conditions associated with the data type of the item. Category group(s): ddl_group item_group compliance_group Category key(s): _item_type_conditions.name

*_item_type_conditions.code

(code)

Codes defining conditions on the _item_type.code specification. ‘esd’ permits a number string to contain an appended standard deviation number enclosed within parentheses, e.g. 4.37(5). ‘seq’ permits data to be declared as a sequence of values separated by a comma or a colon . The sequence v1 , v2 , v3 . . . signals that v1 , v2 , v3 etc. are alternative values for the data item. The sequence v1 :v2 signals that v1 and v2 are the boundary values of a continuous range of values. This mechanism was used to specify permitted ranges of an item in previous DDL versions. Combinations of alternate and range sequences are permitted.

[item_related]

ITEM STRUCTURE This category holds the association between data items and named vector/matrix declarations. Category group(s): ddl_group item_group Category key(s): _item_structure.name

The data value must be one of the following:

*_item_structure.organization

none esd seq

(code)

Identifies whether the structure is defined in column- or row-major order. Only the unique elements of symmetric matrices are specified.

no extra conditions apply to this data item numbers may have esd values appended within parentheses data may be declared as a comma- or colon-separated sequence [item_type_conditions]

The data value must be one of the following:

columnwise rowwise

column-major order row-major order

ITEM TYPE LIST Attributes which define each type code. Category group(s): ddl_group item_group Category key(s): _item_type_list.code

ITEM STRUCTURE LIST This category holds a description for each structure type. Category group(s): ddl_group item_group Category key(s): _item_structure_list.code _item_structure_list.index

*_item_type_list.code

(code)

The codes specifying the nature of the data value. The following item(s) have an equivalent role in their respective categories:

_item_type.code.

476

[item_type_list]

4.10. DDL2 DICTIONARY

ddl core _item_type_list.construct

(text)

When a data value can be defined as a pre-determined sequence of characters, optional characters or data names (for which the definition is also available), it is specified as a construction. The rules of construction conform to the the regular expression (REGEX) specifications detailed in IEEE (1991). Resolved data names for which _item_type_list.construct specifications exist are replaced by these constructions, otherwise the data-name string is not replaced. Reference: IEEE (1991). IEEE Standard for Information Technology – Portable Operating System Interface (POSIX) – Part 2: Shell and Utilities, Vol. 1, IEEE Standard 1003.2-1992. New York: The Institute of Electrical Engineers. Example: ‘ year- month- day’ (typical construction for date).

SUB CATEGORY ITEM UNITS LIST

Attributes which describe the physical units of measurement in which data items may be expressed. Category group(s): ddl_group item_group Category key(s): _item_units_list.code

*_item_units_list.code The following item(s) have an equivalent role in their respective categories:

_item_units.code,

[item_type_list]

_item_units_conversion.from_code, _item_units_conversion.to_code.

_item_type_list.detail

(code)

The code specifying the name of the unit of measurement.

[item_units_list]

(text)

An optional description of the data type.

_item_units_list.detail

(text)

A description of the unit of measurement.

[item_type_list]

[item_units_list]

*_item_type_list.primitive_code

(code)

The codes specifying the primitive type of the data value. The data value must be one of the following:

numb char uchar null

METHOD LIST

numerically interpretable string character or text string (case-sensitive) character or text string (case-insensitive) for dictionary purposes only

Attributes specifying the list of methods applicable to data items, subcategories and categories. Category group(s): ddl_group item_group category_group Category key(s): _method_list.id

[item_type_list]

ITEM UNITS Specifies the physical units in which data items are expressed.

*_method_list.code

(code)

A code that describes the function of the method.

Category group(s): ddl_group item_group Category key(s): _item_units.name

Examples: ‘calculation’ (method to calculate the item), ‘verification’ (method to verify the data item), ‘cast’ (method to provide cast conversion), ‘addition’ (method to define item + item), ‘division’ (method to define item / item), ‘multiplication’ (method to define item × item), ‘equivalence’ (method to define item = item), ‘other’ (miscellaneous method). [method_list]

ITEM UNITS CONVERSION

_method_list.detail

Category group(s): ddl_group item_group Category key(s): _item_units_conversion.from_code _item_units_conversion.to_code

(text)

Description of application method in _method_list.id.

Conversion factors between the various units of measure defined in the ITEM_UNITS_LIST category.

[method_list]

*_method_list.id

(idname)

Unique identifier for each method listed. The following item(s) have an equivalent role in their respective categories:

_item_methods.method_id ,

*_item_units_conversion.factor

(any)

_category_methods.method_id ,

The arithmetic operation required to convert between the unit systems: to code = from codeoperatorfactor.

_sub_category_methods.method_id , _datablock_methods.method_id .

[method_list]

[item_units_conversion]

*_method_list.inline

_item_units_conversion.from_code

The unit system on which the conversion operation is applied to produce the unit system specified in _item_units_ conversion.to_code: to code = from codeoperatorfactor. *_item_units_conversion.operator

(text)

In-line text of a method associated with the data item. [method_list]

*_method_list.language

(code)

Language in which the method is expressed.

(code)

Examples: ‘BNF’, ‘C’, ‘C++’, ‘FORTRAN’, ‘LISP’, ‘PASCAL’, ‘PERL’, ‘TCL’, ‘OTHER’. [method_list]

The arithmetic operator required to convert between the unit systems: to code= from codeoperatorfactor. The data value must be one of the following:

+ * /

addition subtraction multiplication division

SUB CATEGORY The purpose of a subcategory is to define an association between data items within a category and optionally to provide a method to validate the collection of items. For example, the subcategory named ‘cartesian’ might be applied to the data items for the coordinates x, y and z.

[item_units_conversion]

_item_units_conversion.to_code

The unit system produced after an operation is applied to the unit system specified by _item_units_conversion.from_code: to code = from codeoperatorfactor.

Category group(s): ddl_group sub_category_group Category key(s): _sub_category.id

477

4. DATA DICTIONARIES

SUB CATEGORY *_sub_category.description

(text)

ddl core

*_sub_category_examples.case

(text)

An example involving items in this subcategory.

Description of the subcategory.

[sub_category_examples]

[sub_category]

*_sub_category.id

(idname)

The identity of the subcategory.

_sub_category_examples.detail

The following item(s) have an equivalent role in their respective categories:

A description of an example given in _sub_category_ examples.case.

_sub_category_examples.id , _sub_category_methods.sub_category_id , _item_sub_category.id .

(text)

[sub_category_examples] [sub_category]

SUB CATEGORY EXAMPLES

SUB CATEGORY METHODS

Example applications and descriptions of data items in this subcategory.

Attributes specifying the association between subcategories and methods.

Category group(s): ddl_group sub_category_group Category key(s): _sub_category_examples.id _sub_category_examples.case

Category group(s): ddl_group sub_category_group Category key(s): _sub_category_methods.method_id _sub_category_methods.sub_category_id

478

references

International Tables for Crystallography (2006). Vol. G, Chapter 5.1, pp. 481–487.

5.1. General considerations in programming CIF applications B Y H. J. B ERNSTEIN information from a CIF. Integrated design from scratch may be needed for maximal performance. Creating or adapting software for CIF is an example of creating or adapting software for an agreed format, a format to be adhered to in the creation of multiple applications. There are different levels of ‘agreement’. The agreement may apply to the representation of data, the representation of information about data (‘metadata’) or both (making a ‘data framework’). The agreement may loosely specify the style of presentation of information or may specify details of presentation with great precision, or anything in between. The effort one needs to make in adapting an application to an agreed format depends to a large part on the level of agreement. If the agreed format is sufficiently detailed, one can comply strictly with the agreed format as a standard. If one’s goals are the widest possible interchange of data and the longest possible survival time of data in archives, it is important to achieve strict adherence to the agreed format as a standard. If one’s goals are shorter-term and do not raise issues of interchange with independent groups, use of an agreed format as a checklist or style guide may help to avoid redundant effort. Even when one has longer-term goals, system requirements may not mesh with the agreed format and the development of new formats may be needed. CIF, as an agreed format, involves specification of metadata for a specific data format and of ontologies of domain-specific definitions that could be used not only in CIF but also in other formats. The use of an agreed format as a base can help to avoid redundant effort in this case as well. Within the domain of small-molecule crystallography, CIF has achieved its powerful impact on crystallographic publication and archiving by being used as a strict standard both for the data format and for definitions of terms. Within other domains or for certain applications other approaches may be appropriate. For example, it has proven productive to make use of CIF data names within XML (Bray et al., 1998) formatted documents (Bernstein & Bernstein, 2002).

5.1.1. Introduction There are many ways to create new ‘CIF-aware’ applications and to adapt existing applications to make them CIF-aware. This chapter reviews general considerations in programming CIF-aware applications, ranging from leaving an application CIF-unaware and relying on external filter utilities to do the job, through engineering an existing application to directly read and write CIFs, to writing a new CIF-aware application from scratch. The adaptation of applications to CIF does not happen in isolation. There are many other data representations and metadata frameworks relevant to crystallography. In Chapter 1.1, the CIF format was placed in the historical context of the development of data representation and metadata languages. In this chapter, we deal with that context from the perspective of software design. The major issues in making an application CIF-aware are: (1) Are CIFs to be read? (i) Are these CIFs produced externally? (ii) Does the organization of information required by the application conform to the organization of information specified in the relevant dictionaries? (iii) Is maximal performance in reading important? (2) Are CIFs to be written? (i) Are these CIFs to be used externally? (ii) Does the organization of information used by the application internally conform to the organization of information specified by the relevant dictionaries? (iii) Is maximal performance in writing important? Reading a CIF is a much more complex task than writing a CIF. Two equally valid CIF presentations of exactly the same information may be ordered differently. An application that reads a CIF must be prepared to accept tags and columns of tables in any order. The same order independence means that an application may write out tags and columns of tables in any convenient order, simplifying the design of CIF write logic. When CIFs are only to be used internally, it is tempting to adjust the format to fit the application, e.g. by imposing order dependence. However, caution is advised if there is any possibility that such an application might eventually have to deal with CIFs coming from or going to other groups. In designing the CIF interface for an application, it is prudent to assume that the read logic will eventually have to deal with externally produced CIFs and that CIFs produced by the write logic will eventually be processed by software at other sites. If performance is not a major issue, then it can be easy to make an application CIF-aware simply by the use of external filter programs. However, when performance is an issue, it becomes necessary to integrate the CIF reading and writing logic with the application. This can be done with reasonable efficiency by the use of existing CIF-aware libraries, but such libraries can impose a cost by their use of private internal data structures to hold the

5.1.2. Background There have been many efforts at creating agreed formats for data to be used in crystallography (see Chapter 1.1). We need to consider how software has been created to make use of such formats, especially software to make use of CIF. Agreement on formats evolved from the earliest efforts at collaboration among research groups. Within crystallography, recognition of the need to use data formats as standards and to adapt applications to agreed formats, rather than to adapt formats to the caprices of particular applications or diffractometers or graphics engines, began in the late 1960s and early 1970s with the establishment of computerized data resources for the chemical and crystallographic community and the increasing availability of computer networks (Lykos, 1975). We will discuss three early data-resource efforts: the Cambridge Crystallographic Data Centre Structural Database File (CSD) (Allen et al., 1973), the Brookhaven National Laboratory Protein Data Bank (PDB) (Bernstein et al., 1977) and the NIH/EPA Chemical Information System (CIS) (Heller et al., 1977). The differences and similarities among application development efforts related to these resources illustrate some of the issues

Affiliation: H ERBERT J. B ERNSTEIN, Department of Mathematics and Computer Science, Kramer Science Center, Dowling College, Idle Hour Blvd, Oakdale, NY 11769, USA.

Copyright  2006 International Union of Crystallography

481

5. APPLICATIONS that now face software developers working with CIF: conformance to agreed formats versus deviations from standards to improve performance, as well as cross-platform portability. The Cambridge Crystallographic Data Centre was established in 1965 ‘to compile a database containing comprehensive information on small-molecule crystal structures, i.e. organics and metallo-organic compounds containing up to 500 non-H atoms, the structures of which had been determined by X-ray or neutron diffraction’ (Allen, 2002). The Protein Data Bank was established at Brookhaven National Laboratory in 1971 as an archive of macromolecular structural information. The NIH/EPA Chemical Information System was established in 1975 as a confederation of databases including mass spectroscopy, NMR and the data from the CSD. The three resources, CSD, PDB and CIS, took different approaches to applications development. The CSD was an integrated software system centred on a database. Both the software and the database were distributed on magnetic tape for users to use on their local computers. The developers of the software had to be concerned with portability of the software across the multiple computer systems used by crystallographers, but retained control of the design of the retrieval software and a core suite of applications. The PDB was an archive, rather than a database. Some software and the data were distributed on magnetic tape, but the application development model was what would now be called ‘open’, with users and software developers taking the data and the PDB format specification and creating software that would do useful things with PDB entries. The CIS was a remotely accessed confederation of databases on a central computer. The developers of software for the CIS did not have to be concerned with cross-platform portability, or with changes in syntax or semantics of data files impacting on external software developers. Developers of software for the CSD and the PDB had to be concerned with strict compliance with the rules for the respective data formats, albeit on somewhat different timescales. Developers of software for the centralized CIS database could negotiate for immediate changes in the data format to improve performance of the relevant application. The CSD had agreed internal formats (Cambridge Structural Database, 1978). However, as noted in Chapter 1.1, there were many different formats in use for small-molecule crystallography and related fields. One may conjecture that one of many causes for such divergence was the CCDC practice of acquiring much of its data from journals, after differences among data formats had been masked by the publication process. The transition from this Tower of Babel to CIF is described in Chapter 1.1, and that history will not be repeated here, but it is important to note that an application writer working in the domain of small-molecule crystallography still has to be aware of a wide variety of formats in addition to CIF. In the beginning, the PDB went through a relatively rapid format change and then achieved a stable format for more than two decades. The PDB differed from the CSD in depending on user deposition of data prior to publication. The better a user conformed to PDB data-format conventions, the more efficiently could the data move from deposition to release. The initial standard PDB format (PDB, 1974) was derived from the format used in a popular refinement program of the day (Diamond, 1971) and used 132character records identified by the character strings in the first six columns. Starting in 1976, the PDB spent more than a year (PDB, 1976a,b, 1977) converting to an 80-column format, extensions of which are still in use to this day. Many external programs were developed using this 80-column format and it has become a major de facto standard for macromolecular software applications. Most application packages producing crystallographic macromolecular

structures made a gradual transition from having output options for producing ‘Diamond format’ to having output options for producing PDB format. Macromolecular applications working with other disciplines shared the small-molecule applications penchant for multiple formats. The CIS, working in a completely closed, central service environment, had little direct impact on the formats to be used for applications. The CIS would acquire data from existing archives and databases and meld them into its master database. It would deliver its data as text on a CRT. Much of the impact of CIS data formats was to be restricted to its own internal application development. Most of the formats resulting from these early efforts were fixed-field, fixed-order formats. The result was that adapting an application to a data format was simple if the processing flow of the application conformed to the fixed order of the data format. Frequently, the data flow did conform. When the processing flow did not conform, it was necessary to create internal data structures or temporary files to allow the unfortunately timed arrival of data to be time-shifted until it was needed. In general, the heaviest burden was imposed on applications that needed to write data conforming to one of the agreed formats. As the complexity of such time-shifting processes increased, it became clear that the cleanest solution was to base an application on an internal database and to populate the database as the data were processed. When data were to be written by an application, the data could be extracted from the database in whatever order was required. In the 1970s and early 1980s, such a procedure was a serious burden to place on an application. With limited memory and processor speeds, there was a strong argument for adapting agreed formats to the ‘natural’ processing flow, reducing or avoiding the need for an internal database. As the speed and size of computers have changed and as programming language and operatingsystem support for dynamic allocation of resources has improved, the need to have agreed formats driven by applications has become less pressing. We need to understand three major thrusts in data representation: the development of markup languages, of data-representation frameworks and of database application support. Modern applications can benefit from all three. 5.1.2.1. Markup languages A markup language allows the raw text of a document to be annotated with interleaved ‘markup’ specifying layout information for the bracketed text. For document processing, the implicit assumption of the use of an internal database became formalized with the gradual adoption of agreed markup languages in the late 1980s and early 1990s [e.g. TEX (Knuth, 1986), SGML (ISO, 1986), RTF (Andrews, 1987), HTML (Berners-Lee, 1989)]. When used in this manner, such a language has the implicit ordering assumption of reading forward in the document. However, with modern demands for multidimensional layout and document reflow, applications managing such documents achieve the best performance and flexibility when they store the entire marked-up document in an internal data structure that allows random access to all the information. 5.1.2.2. Data-representation frameworks A data-representation framework provides the concepts for managing data and data about the management of data (‘metadata’). Such frameworks may be based on programming languages or markup languages or built from scratch. They provide a mechanism for representing data (e.g. as data sets, graphs or trees)

482

5.1. GENERAL CONSIDERATIONS IN PROGRAMMING CIF APPLICATIONS and a mechanism for representing metadata (e.g. as dictionaries or schemas). Four are of particular importance in crystallography: CIF, ASN.1, HDF and XML. As noted in Chapter 1.1, CIF was created to rationalize the publication process for small molecules. It combines a very simple tag–value data representation with a dictionary definition language (DDL) and well populated dictionaries. CIF is table-oriented, naturally row-based, has case-insensitive tags and allows two levels of nesting. CIF is order-independent and uses its dictionaries both to define the meanings of its tags and to parameterize its tags. It is interesting to note that, even though CIF is defined as orderindependent, it effectively fills the role of an order-dependent markup language in the publication process. We will discuss this issue later in this chapter. Abstract Syntax Notation One (ASN.1) (Dubuisson, 2000; ISO, 2002) was developed to provide a data framework for data communications, where great precision in the bit-by-bit layout of data to be seen by very different systems is needed. Although targeted for communications software, ASN.1 is suitable for any application requiring precise control of data structures and, as such, primarily supports the metadata of an application, rather than the data. ASN.1 can be compiled directly to C code. The resulting C code then supports the data of the application. ASN.1 notation found application in NCBI’s macromolecular modelling database (Ohkawa et al., 1995). ASN.1 has case-sensitive tags and allows case-insensitive variants. It manages order-dependent data structures in a mixed order-dependent/order-independent environment. HDF (NCSA, 1993) is ‘a machine-independent, self-describing, extendible file format for sharing scientific data in a heterogeneous computing environment, accompanied by a convenient, standardized, public domain I/O library and a comprehensive collection of high quality data manipulation and analysis interfaces and tools’ (http://ssdoo.gsfc.nasa.gov/nost/formats/hdf.html). HDF was adopted by the Neutron and X-ray Data Format (NeXus) effort (Klosowski et al., 1997). HDF allows the building of a complete data framework, representing both data and metadata. Two parallel threads of software development, focused on the management and exchange of raw data from area detectors, began in the mid-1990s: the Crystallographic Binary File (CBF) (Hammersley, 1997) and NeXus. The volumes of data involved were daunting and efficiency of storage was important. Therefore both proposed formats assumed a binary format. CBF was based on a combination of CIF-like ASCII headers with compressed binary images. NeXus was based on HDF. The first API for CBF was produced by Paul Ellis in 1998. CBF rapidly evolved into CBF/imgCIF with a complete DDL2 dictionary and a fully CIF-compliant API (Chapter 5.6). As of mid-2004, NeXus was still evolving (see http://www.nexus.anl.gov/). XML is a simplified form of SGML, drawing on years of development of tools for SGML and HTML. XML is tree-oriented with case-sensitive entity names. It allows unlimited nesting and is order-dependent. Metadata are managed as a ‘document type definition’ (DTD), which provides minimal syntactic information, or as schemas, which allow for more detail and are more consistent with database conventions. In fields close to crystallography, the first effort at adopting XML was the chemical markup language (CML) (Murray-Rust & Rzepa, 1999). CML is intentionally imprecise in its ontology to allow for flexibility in development. The CSD and PDB have released their own XML representations (http://www.ccdc.cam.ac.uk/support/prods doc/relibase/ Sysdocn1119.html; http://www.rcsb.org/pdb/newsletter/2003q2/ xml beta.html).

CIF input data  cif 2pdb  PDB-aware application  pdb2cif  CIF output data Fig. 5.1.3.1. Example of using filters to make a PDB-aware application CIF-aware.

It may seem from this discussion that the application designer faces an unmanageable variety of data frameworks in an unstable, evolving environment. To some extent this is true. Fortunately, however, there are signs of convergence on CIF dictionary-based ontologies and the use of transliterated CIFs. This means that an application adapted to CIF should be relatively easy to adapt to other data frameworks. 5.1.3. Strategies in designing a CIF-aware application There are multiple strategies to consider when designing a CIFaware application. One can use external filters. One can use existing CIF-aware libraries. One can write CIF-aware code from scratch. 5.1.3.1. Working with filter utilities One solution to making an existing application aware of a new data format is to leave the application unchanged and change the data instead. For almost all crystallographic formats other than CIF, the Swiss-army knife of conversion utilities is Babel (Walters & Stahl, 1994). Babel includes conversions to and from PDB format. Therefore, by the use of cif 2pdb (Bernstein & Bernstein, 1996) and pdb2cif (Bernstein et al., 1998) combined with Babel, many macromolecular applications can be made CIFaware without changing their code (see Figs. 5.1.3.1 and 5.1.3.2). If the need is to extract mmCIF data from the output of a major application, the PDB provides PDB EXTRACT (http://swtools.pdb.org/apps/PDB EXTRACT/). Creating a filter program to go from almost any small-molecule format to core CIF is easy. In many cases one need only insert the appropriate ‘loop_’ headers. Creating a filter to go from CIF to a particular small-molecule format can be more challenging, because a CIF may have its data in any order. This can be resolved by use of QUASAR (Hall & Sievers, 1993) or cif 2cif (Bernstein, 1997), which accept request lists specifying the order in which data are to be presented (see Fig. 5.1.3.3). There are a significant and growing number of filter programs available. Several of them [QUASAR, cif 2cif, ciftex (ftp://ftp.iucr.org/pub/ciftex.tar.Z) (to convert from CIF to TEX) and ZINC (Stampf, 1994) (to unroll CIFs for use by Unix utilities)] are discussed in Chapter 5.3. In addition there are CIF2SX by Louis J. Farrugia (http://www.chem.gla.ac.uk/ ∼louis/software/utils/), to convert from CIF to SHELXL format, and DIFRAC (Flack et al., 1992) to translate many diffractometer output formats to CIF. The program cif 2xml (Bernstein & Bernstein, 2002) translates from CIF to XML and CML. The PDB

483

5. APPLICATIONS CIF output data CIF input data 2: P PP P nnn n P PP P n n n n P PP nnn P P #+ nnn ks +3 Internal data CIF-aware KS API structure

CIF input data  cif 2pdb  Babel

 Application Fig. 5.1.3.4. Typical dataflow of a C-based CIF API.

 General application

of the Unix application programming interface and many languages have viable interfaces to C and/or C++. Therefore it is often feasible to consider use of C, C++ or Objective-C libraries, even for Fortran applications. Star Base (Spadaccini & Hall, 1994; Chapter 5.2) is a program for extracting data from STAR Files. It is written in ANSI C and includes the code needed to parse a STAR File. OOSTAR (Chang & Bourne, 1998; Chapter 5.2) is an Objective-C package that includes another parser for STAR Files (http://www.sdsc.edu/pb/cif/OOSTAR.html). CIFLIB (Westbrook et al., 1997) provides a CIF-specific API. CIFPARSE (Tosic & Westbrook, 1998) is another C-based library for CIF. CBFlib (Chapter 5.6) is an ANSI C API for both CIF and CBF/imgCIF files. The CifSieve package (Hester & Okamura, 1998) provides specialized code generation for retrieval of particular data items in either C or Fortran (see Chapter 5.3 for more details). The package cciflib (Keller, 1996) (http://www.ccp4. ac.uk/dist/html/mmcifformat.html) is used by the CCP4 program suite to support mmCIF in both C and Fortran applications. If an application in Fortran is to be converted with a purely Fortranbased library, the package CIFtbx (Hall, 1993; Hall & Bernstein, 1996) is a solution. See Chapter 5.4 for more details. The common interface provided in C-based applications is for the library to buffer the entire CIF file into an internal data structure (usually a tree), essentially creating a memory-resident database (see Fig. 5.1.3.4). This preload greatly reduces any demands on the application to deal with the order-independence of CIF, at the expense of what can be a very high demand for memory. The problem of excessive memory demand is dealt with in CBFlib by keeping large text fields on disk, with only pointers to them in memory. In some libraries, validation of tags against dictionaries is handled by the API. In others it is the responsibility of the application programmer. While the former approach helps to catch errors early, the second, ‘lightweight’ approach is more popular when fast performance is required. The most commonly used versions of Fortran do not include dynamic memory management. In order to preload an arbitrary CIF, one needs to use one of the C-based libraries. Alternatively, a pure Fortran application can transfer CIFs being read to a diskbased random access file. CIFtbx does this each time it opens a CIF. The user never works directly with the original CIF data set. This provides a clean and simple interface for reading, but slows all read access to CIFs. In Fortran, compromises are often necessary, with critical tables handled in memory rather than on disk, but this may force changes in dimensions and then recompilation when dictionaries or data sets become larger than anticipated.

 Babel  pdb2cif  CIF output data Fig. 5.1.3.2. Example of using filters to make a general application CIF-aware.

provides CIFTr by Zukang Feng and John Westbrook (http://swtools.pdb.org/apps/CIFTr/) to translate from the extended mmCIF format described in Appendix 3.6.2 to PDB format and MAXIT (http://sw-tools.pdb.org/apps/MAXIT/), a more general package that includes conversion capabilities. See also Chapter 5.5 for an extended discussion of the handling of mmCIF in the PDB software environment. 5.1.3.2. Using existing CIF libraries and APIs Another approach to making an existing application CIF-aware or to design a new CIF-aware application is to make use of one (or more) of the existing CIF libraries and application programming interfaces (APIs). Because the data involved need not be reprocessed, code that uses a library directly is often faster than equivalent code working with filter programs. The code within an application can be tuned to the internal data structures and coding conventions of the application. The approach to internal design depends on the language, data structures and operating environment of the application. A few years ago, the precise details of language version and operating system would have been major stumbling blocks to conversion. Today, however, almost every platform supports a variation Request list Unordered CIF input data listing tags in order SSSSS SSSSS SSSS %  QUASAR or cif 2cif  Ordered CIF input data  Order-sensitive application or filter

5.1.3.3. Creating a CIF-aware application from scratch The primary disadvantage of using an existing CIF library or API in building an application is that there can be a loss of performance or a demand for more resources than may be needed. The common practice followed by most libraries of building and

Fig. 5.1.3.3. Using QUASAR or cif 2cif to reorder CIF data for an order-dependent application or filter.

484

5.1. GENERAL CONSIDERATIONS IN PROGRAMMING CIF APPLICATIONS cbf:

datablock

{ cbf_failnez (cbf_find_parent (&($$), $1, CBF_ROOT)) }

; cbfstart:

{ $$ = ((void **) context) [1]; } ;

datablockstart:

datablock:

category: column:

cbfstart { | cbf datablockname { ; datablockstart { | assignment { | loopassignment { ; datablock categoryname ; category columnname { | datablock itemname {

cbf_failnez (cbf_make_child (&($$), $1, CBF_DATABLOCK, NULL)) } cbf_failnez (cbf_make_child (&($$), $1, CBF_DATABLOCK, $2)) } $$ = $1; } cbf_failnez (cbf_find_parent (&($$), $1, CBF_DATABLOCK)) } cbf_failnez (cbf_find_parent (&($$), $1, CBF_DATABLOCK)) } { cbf_failnez (cbf_make_child (&($$), $1, CBF_CATEGORY, $2)) } cbf_failnez (cbf_make_child (&($$), $1, CBF_COLUMN, $2)) } cbf_failnez (cbf_make_new_child (&($$), $1, CBF_CATEGORY, NULL)) cbf_failnez (cbf_make_child (&($$), $$, CBF_COLUMN, $2)) }

; assignment:

column value

{ $$ = $1; cbf_failnez (cbf_set_columnrow ($$, 0, $2, 1)) }

datablock loop

{ cbf_failnez (cbf_make_node (&($$), CBF_LINK, NULL, NULL)) cbf_failnez (cbf_set_link ($$, $1)) }

; loopstart: ; loopcategory:

loopcolumn:

loopstart categoryname

{ cbf_failnez (cbf_make_child (&($$), $1, CBF_CATEGORY, $2)) cbf_failnez (cbf_set_link ($1, $$)) $$ = $1; } | loopcolumn categoryname { cbf_failnez (cbf_find_parent (&($$), $1, CBF_DATABLOCK)) cbf_failnez (cbf_make_child (&($$), $$, CBF_CATEGORY, $2)) cbf_failnez (cbf_set_link ($1, $$)) $$ = $1; } ; loopstart itemname

{ cbf_failnez (cbf_make_new_child (&($$), $1, CBF_CATEGORY, NULL)) cbf_failnez (cbf_make_child (&($$), $$, CBF_COLUMN, $2)) cbf_failnez (cbf_set_link ($1, $$)) cbf_failnez (cbf_add_link ($1, $$)) $$ = $1; } | loopcolumn itemname { cbf_failnez (cbf_find_parent (&($$), $1, CBF_DATABLOCK)) cbf_failnez (cbf_make_child (&($$), $$, CBF_CATEGORY, NULL)) cbf_failnez (cbf_make_child (&($$), $$, CBF_COLUMN, $2)) cbf_failnez (cbf_set_link ($1, $$)) cbf_failnez (cbf_add_link ($1, $$)) $$ = $1; } | loopcategory columnname { cbf_failnez (cbf_make_child (&($$), $1, CBF_COLUMN, $2)) cbf_failnez (cbf_set_link ($1, $$)) cbf_failnez (cbf_add_link ($1, $$)) $$ = $1; } ; loopassignment: loopcolumn value { $$ = $1; cbf_failnez (cbf_shift_link ($$)) cbf_failnez (cbf_add_columnrow ($$, $2)) } | loopassignment value { $$ = $1; cbf_failnez (cbf_shift_link ($$)) cbf_failnez (cbf_add_columnrow ($$, $2)) } ; loop: LOOP ; datablockname: DATA { $$ = $1; } ; categoryname: CATEGORY { $$ = $1; } ; columnname: COLUMN { $$ = $1; } ; itemname: ITEM { $$ = $1; } ; value: STRING { $$ = $1; } | WORD { $$ = $1; } | BINARY { $$ = $1; } ; Fig. 5.1.3.5. Example of bison data defining a CIF parser (taken from CBFlib).

preloading an internal data structure that holds the entire CIF may not be the optimal choice for a given application. When reading a CIF it is difficult to avoid the need for extra data structures to resolve the issue of CIF order independence. However, when writing data to a CIF, it may be sufficient simply to write the necessary

tags and values from the internal data structures of an application, rather than buffering them through a special CIF data structure. It is tempting to apply the same reasoning to the reading of CIF and create a fixed ordering in which data are to be processed, so that no intermediate data structure will be needed to buffer a CIF.

485

5. APPLICATIONS Unless the application designer can be certain that externally produced CIFs will never be presented to the application, or will be filtered through a reordering filter such as QUASAR or cif 2cif, working with CIFs in an order-dependent mode is a mistake. Because of the importance of being able to accept CIFs written by any other application, which may have written its data in a totally different order than is expected, it is a good idea to make use of one of the existing libraries or APIs if possible, unless there is some pressing need to do things differently. If a fresh design is needed, e.g. to achieve maximal performance in a time-critical application, it will be necessary to create a CIF parser to translate CIF documents into information in the internal data structures of the application. In doing this, the syntax specification of the CIF language given in Chapter 2.2 should be adhered to precisely. This result is most easily achieved if the code that does the parsing is generated as automatically as possible from the grammar of the language. Current ‘industrial’ practice in creating parsers is based on use of commonly available tools for lexical scanning of tokens and parsing of grammars based on lex (Lesk & Schmidt, 1975) and yacc (Johnson, 1975). Two accessible descendants of these programs are flex (by V. Paxson et al.) and bison (by R. Corbett et al.). See Fig. 5.1.3.5 for an example of bison data in building a CIF parser. Both flex and bison are available from the GNU project at http://www.gnu.org. Neither flex nor bison is used directly by the final application. Each may be used to create code that becomes part of the application. For example, both are used by CifSieve to generate the code it produces. There is an important division of labour between flex and bison; flex is used to produce a lexicographic scanner, i.e. code that converts a string of characters into a sequence of ‘tokens’. In CIF, the important tokens are such things as tags and values and reserved words such as loop_. Once tokens have been identified, responsibility passes to the code generated by bison to interpret. In practice, because of the complexities of context-sensitive management of white space to separate tokens and the small number of distinct token types, flex is not always used to generate the lexicographic scanner for a CIF parser. Instead, a hand-coded lexer might be used. The parser generated by bison uses a token-based grammar and actions to be performed as tokens are recognized. There are two major alternatives to consider in the design: event-driven interaction with the application or building of a complete data structure to hold a representation of the CIF before interaction with the application. The advantage of the event-driven approach is that a full extra data structure does not have to be populated in order to access a few data items. The advantage of building a complete representation of the CIF is that the application does not have to be prepared for tags to appear in an arbitrary order.

We are grateful to Frances C. Bernstein for her helpful comments and suggestions.

References Allen, F. H. (2002). The Cambridge Structural Database: a quarter of a million crystal structures and rising. Acta Cryst. B58, 380–388. Allen, F. H., Kennard, O., Motherwell, W. D. S., Town, W. G. & Watson, D. G. (1973). Cambridge Crystallographic Data Centre. II. Structural Data File. J. Chem. Doc. 13, 119–123. Andrews, N. (1987). Rich Text Format standard makes transferring text easier. Microsoft Syst. J. 2, 63–67. Berners-Lee, T. (1989). Information management: a proposal. Internal Report. Geneva: CERN. http://www.w3.org/History/1989/proposalmsw.html. Bernstein, F. C. & Bernstein, H. J. (1996). Translating mmCIF data into PDB entries. Acta Cryst. A52 (Suppl.), C-576. Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F. Jr, Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol. 112, 535–542. Bernstein, H. J. (1997). cif2cif – CIF copy program. Bernstein + Sons, Bellport, NY, USA. Included in http://www.bernstein-plussons.com/software/ciftbx. Bernstein, H. J. & Bernstein, F. C. (2002). YAXDF and the interaction between CIF and XML. Acta Cryst. A58 (Suppl.), C257. Bernstein, H. J., Bernstein, F. C. & Bourne, P. E. (1998). CIF applications. VIII. pdb2cif: translating PDB entries into mmCIF format. J. Appl. Cryst. 31, 282–295. Software available from http://www.bernsteinplus-sons.com/software/pdb2cif. Bray, T., Paoli, J. & Sperberg-McQueen, C. (1998). Extensible Markup Language (XML). W3C recommendation 10-February-1998. http://www.w3.org/TR/1998/REC-xml-19980210. Cambridge Structural Database (1978). Cambridge Crystallographic Database User Manual. Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge, England. Chang, W. & Bourne, P. E. (1998). CIF applications. IX. A new approach for representing and manipulating STAR files. J. Appl. Cryst. 31, 505– 509. Diamond, R. (1971). A real-space refinement procedure for proteins. Acta Cryst. A27, 436–452. Dubuisson, O. (2000). ASN.1 – communication between heterogeneous systems. San Francisco, CA: Morgan Kaufmann. (Translated from the French by P. Fouquart.) Flack, H. D., Blanc, E. & Schwarzenbach, D. (1992). DIFRAC, singlecrystal diffractometer output-conversion software. J. Appl. Cryst. 25, 455–459. Hall, S. R. (1993). CIF applications. IV. CIFtbx: a tool box for manipulating CIFs. J. Appl. Cryst. 26, 482–494. Hall, S. R. & Bernstein, H. J. (1996). CIF applications. V. CIFtbx2: extended tool box for manipulating CIFs. J. Appl. Cryst. 29, 598–603. Hall, S. R. & Sievers, R. (1993). CIF applications. I. QUASAR: for extracting data from a CIF. J. Appl. Cryst. 26, 469–473. Hammersley, A. P. (1997). FIT2D: an introduction and overview. ESRF Internal Report ESRF97HA02T. Grenoble: ESRF. Heller, S. R., Milne, G. W. A. & Feldmann, R. J. (1977). A computerbased chemical information system. Science, 195, 253–259. Hester, J. R. & Okamura, F. P. (1998). CIF applications. X. Automatic construction of CIF input functions: CifSieve. J. Appl. Cryst. 31, 965– 968. ISO (1986). ISO 8879. Information processing – Text and office systems – Standard Generalized Markup Language (SGML). Geneva: International Organization for Standardization. ISO (2002). ISO/IEC 8824-1. Abstract Syntax Notation One (ASN.1). Specification of basic notation. Geneva: International Organization for Standardization. Johnson, S. C. (1975). YACC: Yet Another Compiler-Compiler. Bell Laboratories Computing Science Technical Report No. 32. Bell Laboratories, Murray Hill, New Jersey, USA. (Also in UNIX Programmer’s Manual, Supplementary Documents, 4.2 Berkeley Software Distribution, Virtual VAX-11 Version, March 1984.) Keller, P. A. (1996). A mmCIF toolbox for CCP4 applications. Acta Cryst. A52 (Suppl.), C-576.

5.1.4. Conclusion Making CIF-aware applications is a demanding, but manageable, task. A software developer has the choice of using external filters, using existing libraries and APIs, or of building CIF infrastructure from scratch. The last choice presents an opportunity to tune the handling of CIFs to the needs of the application, but also presents the risk of creating code that does not conform to CIF specifications. One can never know for certain how a new application may be used in the future. If there is any doubt that an application built from scratch will conform to CIF specifications, prudence dictates that one should use filter programs or well tested libraries and APIs in preference to cutting corners in building an application from scratch.

486

5.1. GENERAL CONSIDERATIONS IN PROGRAMMING CIF APPLICATIONS Klosowski, P., Koennecke, M., Tischler, J. Z. & Osborn, R. (1997). NeXus: a common format for the exchange of neutron and synchrotron data. Physica B Condens. Matter, B241–243, 151–153. Knuth, D. E. (1986). The TEXbook. Computers and typesetting, Vol. A. Reading, MA: Addison-Wesley. Lesk, M. E. & Schmidt, E. (1975). Lex – a lexical analyzer generator. Bell Laboratories Computing Science Technical Report No. 39. Bell Laboratories, Murray Hill, New Jersey, USA. (Also in UNIX Programmer’s Manual, Supplementary Documents, 4.2 Berkeley Software Distribution, Virtual VAX-11 Version, March 1984.) Lykos, P. (1975). Editor. Computer networking and chemistry. ACS Symposium Series, Vol. 19. Washington DC: American Chemical Society. Murray-Rust, P. & Rzepa, H. (1999). Chemical markup, XML and the WWW, Part I: Basic principles. J. Chem. Inf. Comput. Sci. 39, 928–942. NCSA (1993). NCSA HDF: specification and developer’s guide. Version 3.2. University of Illinois at Urbana-Champaign, USA. Ohkawa, H., Ostell, J. & Bryant, S. (1995). MMDB: an ASN.1 specification for macromolecular structure. In Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, Cambridge, England, 16–19 July 1995, pp. 259–267. Menlo Park, CA: American Association for Artificial Intelligence.

PDB (1974). PDB Newsletter 1. Brookhaven National Laboratory, USA. PDB (1976a). PDB Newsletter 2. Brookhaven National Laboratory, USA. PDB (1976b). PDB Newsletter 3. Brookhaven National Laboratory, USA. PDB (1977). PDB Newsletter 4. Brookhaven National Laboratory, USA. Spadaccini, N. & Hall, S. R. (1994). Star Base: accessing STAR File data. J. Chem. Inf. Comput. Sci. 34, 509–516. Stampf, D. R. (1994). ZINC – galvanizing CIF to work with UNIX. Manual. Protein Data Bank, Brookhaven National Laboratory, USA. Tosic, O. & Westbrook, J. D. (1998). CIFPARSE: A library of access tools for mmCIF. Reference guide. Version 3.1. Nucleic Acid Database Project, Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, USA. http://swtools.pdb.org/apps/CIFPARSE/cifparse/cifparse.html. Walters, P. & Stahl, M. (1994). BABEL reference manual. Version 1.06. Dolata Research Group, Department of Chemistry, University of Arizona, USA. Westbrook, J. D., Hsieh, S.-H. & Fitzgerald, P. M. D. (1997). CIF applications. VI. CIFLIB: an application program interface to CIF dictionaries and data files. J. Appl. Cryst. 30, 79–83.

487

references

International Tables for Crystallography (2006). Vol. G, Chapter 5.2, pp. 488–498.

5.2. STAR File utilities B Y N. S PADACCINI , S. R. H ALL 5.2.1. Introduction

B. M C M AHON

_chapter_title

The STAR File, described in Chapter 2.1, has a simple format intended to allow the flexible and extensible representation of data without regard to specific data models. In crystallography and related disciplines, the restricted format chosen for the Crystallographic Information File (CIF, Chapter 2.2) and Crystallographic Binary File (CBF, Chapter 2.3) lends itself to rather flat data models. In particular, the relationships between data items enforced through DDL2 dictionaries in applications such as mmCIF (Chapter 3.6) are essentially equivalent to the data structures and relationships of a relational database. Of course, properly normalized relational tables can represent a hierarchy of structure, although this may not be an efficient representation. There are other applications, such as the molecular information file (MIF, Chapter 2.4), that make use of additional features of the STAR File, such as multiple-level loop structures, global variable scoping and data-instance encapsulation in save frames. These applications may more efficiently represent certain hierarchical or object-oriented data models. While particular applications require software tools tailored to their specific purposes, it is helpful to have programs or libraries capable of manipulating arbitrary STAR File data, relying solely on the syntax rules and format of the STAR File and taking no account of the semantic content of the included data. In this chapter, the stand-alone program Star Base is described in detail. This program uses a local query language to demonstrate the ability to retrieve or re-order data with their associated context. There is also a brief review of Star.vim and StarMarkUp, applications for editing and browsing STAR Files. The chapter concludes by reviewing a number of object classes and libraries for a variety of STAR and generalized CIF applications: prototypical approaches OOSTAR and CIF++, CIFOBJ and starlib used by major macromolecular data repositories, and the document-object model package StarDOM.

’STAR File utilities’

Alternatively, a data item may occur multiple times, in a vector or a list. In such a case, the data identifiers appear in a loop header and the values follow in the order of presentation in the loop header. For the simple example of a tabular array, the loop header plays the role of column header, e.g. loop_ _chapter_number _chapter_title 5.2 ’STAR File utilities’ 5.3 ’Syntactic utilities for CIF’

Here the instances of the data item identified by the data name have two values, 5.2 and 5.3. Likewise the instances of the data item identified by _chapter_title have two values. Note an important point: the example has been chosen to suggest to the reader a tabular relationship between the two data items, and in many STAR File applications such a relationship is intended and perhaps formalized through an external dictionary defining the relationships between these data names. However, the existence of such a relationship is not mandated by the STAR File syntax. It is legitimate for a generic STAR application to extract a single data item from such an aggregated loop without making any supposition about its relationship with other data items in the same loop. (It should be emphasized that in practice such physical juxtaposition of data items will almost invariably represent a real relationship, and that most application-specific programming will depend on this fact; but it is not an essential component of STAR in its most abstract form.) It is also axiomatic that the ordering of the multiple values within a list structure has no intrinsic significance in the STAR paradigm. (Again, specific applications may override this by enforcing an ordering, but this is not fundamental to STAR.) _chapter_number

5.2.2. Data instances and context

5.2.2.2. Loop packets and context within lists

In a STAR File, a data item consists of a value, which is a simple ASCII character string, and an associated identifier or data name which precedes the value, and is invariably an ASCII character string beginning with an underscore character and not including any white-space character, such as _date or _chemical_formula_sum. (The detailed and formal syntax rules for STAR Files are given in Chapter 2.1.)

Where multiple data names are declared in a loop header, STAR does however enforce the notion of a ‘loop packet’. The loop packet is the data structure including all individual data values at a particular iteration through the loop. Hence, in the simple example above, 5.2 and STAR File utilities comprise the tuple of values in a single loop packet. For the single level of loop considered so far, the loop packet plays the role of a table row. For nested loops, the situation is more complex. Consider Fig. 5.2.2.1, which is an example of quantum chemistry basis sets for hydrogen and lithium. (The examples in this chapter are derived from various test applications, and do not represent specific adopted exchange protocols in the selected subject areas.) For each element, a list of basis sets is presented, each containing a set of parameters and a table of functional values. At the outermost level of looping in this example, a loop packet comprises all the data associated with an individual atom type, for example hydrogen. At the next inner level of looping, a loop packet corresponds to an individual basis set (including its embedded table of

5.2.2.1. Single and multiple values A data item may have a single value, in which case the data name may immediately precede the data value, separated only by white space, e.g. Affiliations: N. S PADACCINI, School of Computer Science and Software Engineering, University of Western Australia, 35 Stirling Highway, Crawley, Perth, WA 6009, Australia; S YDNEY R. H ALL , School of Biomedical and Chemical Sciences, University of Western Australia, Crawley, Perth, WA 6009, Australia; B RIAN M C M AHON, International Union of Crystallography, 5 Abbey Square, Chester CH1 2HU, England.

Copyright  2006 International Union of Crystallography

AND

488

5.2. STAR FILE UTILITIES data_Gaussian

data_Gaussian loop_ loop_ loop_ _basis_set_function_exponent stop_ stop_

loop_ _basis_set_atomic_name _basis_set_atomic_symbol _basis_set_atomic_number _basis_set_atomic_mass loop_ _basis_set_contraction_scheme _basis_set_funct_per_contraction _basis_set_primary_reference _basis_set_source_exponent _basis_set_source_coefficient _basis_set_atomic_energy loop_ _basis_set_function_exponent _basis_set_function_coefficient

1.3324838E+01 1.3326990E+01 1.3324800E-01 4.5018000E+00 stop_

921.271 138.730 31.9415 9.35329 3.15789 1.15685 0.44462 0.44462 0.076663 0.028643 1.488 0.2667 0.07201 0.02370 stop_ 1.09353E+02 1.64228E+01 3.59415E+00 9.05297E-01 5.40205E-01 1.02255E-01 2.85645E-02 5.40205E-01 1.02255E-01 2.85645E-02 stop_ stop_

Fig. 5.2.2.2. Retrieval from the example file in Fig. 5.2.2.1 of the value of _basis_set_function_exponent with associated context.

without any other a priori information regarding the data model. The context is most easily expressed by listing the output values in STAR File format. Fig. 5.2.2.2 is an output listing of the requested values for this example, where the context is expressed as the innermost of three nested loop levels and distinct packets at this level are indicated. It will be seen also that by tracing the disposition of stop_ words the embedding within higher-level loop packets can also be inferred.

lithium Li 3 6.94 ------(4)->[4] 1: PKC3.1.1 R44 . -7.376895 3.4856175E+01 1.0 5.1764114E+00 1.0 1.0514394E+00 1.0 4.7192775E-02 1.0 stop_ (9,4)->[3,2] 7:2:1,3:1 PKC3.9.1 R2 R98 -7.431735 921.271 0.001367 138.730 0.010425 31.9415 0.049859 9.35329 0.160701 3.15789 0.344604 1.15685 0.425197 0.44462 0.169468 0.44462 -0.222311 0.076663 1.116477 0.028643 1.0 1.488 0.038770 0.2667 0.236257 0.07201 0.830448 0.02370 1.0 stop_ (4,3)->[3,2] 4:2:1,2:1 PKC3.30.1 R77 R77 -7.419509 1.09353E+02 1.90277E-02 1.64228E+01 1.30276E-01 3.59415E+00 4.39082E-01 9.05297E-01 5.57314E-01 5.40205E-01 2.85645E-02 5.40205E-01 2.85645E-02

-2.63127E-01 1.02255E-01 1.00000E+00 1.61546E-01 1.02255E-01 1.00000E+00 stop_ stop_

stop_ stop_ stop_ 1.5139800E-01 stop_

3.4856175E+01 5.1764114E+00 1.0514394E+00 4.7192775E-02 stop_

hydrogen H 1 1.0079 # -------(2)->[2] 1: PKC1.1.1 R44 . -0.485813 1.3324838E+01 1.0 2.0152720E-01 1.0 stop_ (2)->[2] 1: PKC1.2.1 R33 . -0.485813 1.3326990E+01 1.0 2.0154600E-01 1.0 stop_ (2)->[1] 2 PKC1.14.1 R24 R24 -0.485813 1.3324800E-01 2.7440850E-01 2.0152870E-01 8.2122540E-01 stop_ (3)->[2] 2:1 PKC1.23.1 R75 R75 -0.496979 4.5018000E+00 1.5628500E-01 6.8144400E-01 9.0469100E-01 1.5139800E-01 1.0000000E+01 stop_ stop_

#

2.0152720E-01 2.0154600E-01 2.0152870E-01 6.8144400E-01

5.2.2.3. Context in data sets Another indicator of context in the previous example is the datablock header, which was reproduced in the output of Fig. 5.2.2.2. The STAR File allows data instances in three types of location: in a data block, in a save frame or in a global block. The usual way to partition a STAR File is by data blocks; each such block represents a data set in which a data name (associated with a single or multiple values) may be declared once only. Data blocks may include save frames. A save frame is an encapsulated subsidiary data set, effectively insulated from the contents of the surrounding data block, in which data items may occur that have the same names as items in the parent data block. Indeed, ‘parent’ is potentially a misleading term, since no relationship is implied between the data within a save frame and those in the data block in which the save frame occurs. A reference to a save frame may, however, occur as a data value within the data block where the save frame is specified. Recall from Section 2.1.3.6 that save frames within a data block are uniquely identified by the framecode header. Global blocks may also occur in a STAR File, preceding or interspersed between data blocks. For each data item defined within a global block, that definition is inherited by each succeeding data block that does not contain an internal definition of a data item with the same name. If there is a definition of a data item with the same name within a data block, that internal definition overrides the global definition within that data block. The situation is then re-evaluated in the next data block. If that data block does not contain an internal definition, the global definition holds.

1.14339E+00

9.15663E-01

Fig. 5.2.2.1. Example quantum chemistry basis set functions in STAR File format.

coefficients). At the innermost loop level, a loop packet is simply a row within a table of exponents and coefficients of the basis set function. If one were to treat this example file as a database of indeterminate structure and query the values associated with one of the data names, for example, _basis_set_function_exponent, one would retrieve a series of strings 1.3324838E+01, 2.0152720E-01 etc. However, the value strings in themselves are insufficient to allow the reconstitution of any data structure in the file. One also needs an expression of the levels within the nested loop structure at which the values were located, and an indication that they were associated with different packets of information at those various levels. This additional information about the context of each value is sufficient to determine its position within the data structure

489

5. APPLICATIONS data_reaction

data_reaction

save_methyl loop_ _atom_identity_node _atom_identity_symbol loop_ _attached_hydrogen_node _attached_hydrogen_count save_

save_methyl loop_ _atom_identity_node _atom_identity_symbol 1 C 2 C loop_ _attached_hydrogen_node _attached_hydrogen_count 1 3 save_

save_ethyl loop_ _atom_identity_node _atom_identity_symbol loop_ _attached_hydrogen_node _attached_hydrogen_count save_

1

C

1

3

C

1

C

2

C

1

3

2

3

save_R1 loop_ _variable_alternative_number _variable_identifier_symbol _variable_node 1 $methyl save_ save_carboxylic_acid loop_ _atom_identity_node _atom_identity_symbol O loop_ _attached_hydrogen_node _attached_hydrogen_count save_

2

1

2

1 $R1

2

3

C

$ethyl

C

save_ethyl loop_ _atom_identity_node _atom_identity_symbol 1 C 2 C 3 C loop_ _attached_hydrogen_node _attached_hydrogen_count 1 3 2 3 save_

3

O

save_R1 loop_ _variable_alternative_number _variable_identifier_symbol _variable_node 1 $methyl save_

1

2

$ethyl

1

save_carboxylic_acid loop_ _atom_identity_symbol $R1 C O O save_

4

loop_ _reaction_component_symbol $carboxylic_acid 2

0

3

0

4

1 Fig. 5.2.2.4. Context for the requested values of _atom_identity_symbol in the preceding example. See text for details.

loop_ _reaction_component_number _reaction_component_symbol _reaction_component_type 1 $carboxylic_acid

1

item is presented solely in the context of the save header and closure strings (it is shown in italics in Fig. 5.2.2.4). However, one of the values extracted from this location is the save-frame reference pointer $R1 that identifies save_R1, and the complete contents of this save frame are presented (because the data structure represented by the save frame is itself one of the values of the requested data item). Further de-referencing of the save-frame pointers within save_R1 results in the extraction also of the complete save frames save_methyl and save_ethyl. In this example it is coincidental that there are instances of the requested data item (_atom_identity_symbol) within these returned save frames as well. Notice, however, that establishing the full context of the returned data demands also that data values referencing the save_carboxylic_acid frame be presented. In this example, the value of _reaction_component_symbol at the outermost level of the data block is returned, a result that may at first seem surprising. It is only in this way that one can be sure that an arbitrary application will have access to the full semantic information carried by the data item. Data values declared in global blocks should be presented in the same spirit of supplying the complete context in which the value was instantiated, and not simply the value in isolation. For example, given the trivial STAR File

reactant

Fig. 5.2.2.3. Example STAR data structure where save frames encapsulate related data sets. See text for details.

The scope of data values is well defined (see Section 2.1.3.9). Only data expressed in a global block have values that are inherited in later portions of the STAR File. Data values in data blocks or save frames are restricted in scope to the current data block or save frame, respectively. A consequence of these rules of scope and encapsulation is that a full description of the context of a STAR data value must also reflect any values carried through as global data or by dereferencing associated save frames. The results are not always intuitive. Consider Fig. 5.2.2.3, which represents a partial description of a chemical reaction where one of the reactants is expressed as a generic structure described by the save frame save_R1. However, the generic structure in this case is restricted to a small number of alkyl groups, each described in its own save frame. Setting aside this prior knowledge, we see that a request for _atom_identity_symbol must return not only the data values in their embedded save frames, but also the save frames in their entirety and the higher-order data values that reference the matching save frames. It is only in this way that we can guarantee that the value can be used by any application. Fig. 5.2.2.4 demonstrates the full context of the returned requested data values. Notice that the requested item occurs (among other places) in the save frame save_carboxylic_acid and this instance of the

global_ _example data_1 data_2 _example

foo bar

a request for _example should return the identical file, not the interpolated result

490

5.2. STAR FILE UTILITIES data_1 _example data_2 _example

foo

the result is structured differently:

bar

data_Gaussian loop_ _basis_set_atomic_name loop_ _basis_set_contraction_scheme stop_ _basis_set_atomic_symbol

despite the latter’s equivalence purely in terms of the noncontextual values returned. 5.2.3. Star Base: a general-purpose data extractor for STAR Files

hydrogen (2)->[2] (2)->[2] (2)->[1] (3)->[2] stop_ H lithium (4)->[4] (9,4)->[3,2] (4,3)->[3,2] stop_ Li

The stand-alone application Star Base (Spadaccini & Hall, 1994) provides a facility for performing database-style queries on arbitrary STAR Files. It is generic in nature and makes no assumptions about the nature or organization of the data in a STAR File. It may indeed be used as an application-specific database tool if the user has prior knowledge of the relationships between included data items. However, by faithfully returning context as well as value in the way outlined in Section 5.2.2, it can be applied to any STAR File even without such prior knowledge.

In the examples so far, one or more data items have been requested by name. Star Base extends the type of requests that can be made through its own query language. This gives it much of the power of a database query language such as SQL. Three types of query are supported, known as data, conditional and branching requests. A data request is a straightforward generalization of the request by data name. Individual data items may be requested by name, as may individual data blocks or save frames. Wild carding is permitted to generalize the requests. More details are given in Section 5.2.3.2. A conditional request involves one or more conditions; only data items satisfying the conditions are returned. More details are given in Section 5.2.3.3. A branching request applies similar conditions to establish the context in which matching data items occur within the file, but may also apply scoping rules to select among the available contexts. Only data items matching both the conditions imposed on their values and the requested scope are returned. It is the existence of such branching conditions that gives Star Base the ability to select data matching the specific requirements of overlying data models. Again, however, it is emphasized that the program itself operates without any semantic awareness of the significance of the data that is implied within the overlaid data model. More details on branching requests are given in Section 5.2.3.4.

5.2.3.1. Program features Star Base is a fully functional STAR File parser and may be used to test the syntactic validity of an input STAR File. It may be used to write an input STAR File directly to the output stream, while validating the structural integrity as the contents are parsed. The input format and comments are discarded on output. Given a valid input file, Star Base guarantees to write output in fully compliant STAR format. If a data name is supplied as a request item, Star Base will return the single or multiple values associated with that data name and their associated context according to the principles of Section 5.2.2, i.e. all loop structures, data-block headers and global headers will be returned, and save frames will be expanded as required to accommodate de-referencing of frame codes as returned values. Where multiple data items are requested, Star Base will write their occurrences to its output stream in the order they were requested, not in the order of appearance in the input file. This may disturb data relationships that are implicit in the ordering or association of values in the input file, but it is the responsibility of the user to track and retain such associations where they are an essential part of an application-level data model. As emphasized before, a generic STAR tool will make no assumptions about data models and will simply return values and contexts as requested. To illustrate the effect of this, consider a request for the following data items from the example file of Fig. 5.2.2.1:

5.2.3.2. The Star Base data request A data request is the simplest type of query used to extract single items from a file. It may be formed from any of the following string types: (i) a name string, e.g. _atom_identity_symbol; (ii) a block string, e.g. data_Gaussian; (iii) a frame string, e.g. save_methyl. In accordance with the principles set out earlier in this chapter, data requests satisfy the following rules: (i) Requested data items are returned with their associated context (i.e. including the headers of any containing data blocks, save frames and loop structures). (ii) A request for a data block returns all preceding global blocks (since the data block will contain by inheritance all values in the global blocks). (iii) A request for a save frame also returns the header of the data block encompassing the save frame. All frame-pointer codes are resolved so that if a requested save frame contains pointer codes to other save frames, these are also returned. (iv) A request for global_ returns all global blocks, together with all data-block headers in their scope. (v) The request need not be specified explicitly. Two wild-card characters are permitted. An asterisk (*) represents any sequence

_basis_set_atomic_name _basis_set_atomic_symbol _basis_set_contraction_scheme

Star Base will return the following result: data_Gaussian loop_ _basis_set_atomic_name _basis_set_atomic_symbol loop_ _basis_set_contraction_scheme stop_ hydrogen H (2)->[2] (2)->[2] (2)->[1] (3)->[2] stop_ lithium Li (4)->[4] (9,4)->[3,2] (4,3)->[3,2] stop_

However, if the same items are requested in a different order, _basis_set_atomic_name _basis_set_contraction_scheme _basis_set_atomic_symbol

491

5. APPLICATIONS Table 5.2.3.1. Permitted constructions for a Star Base conditional request

Table 5.2.3.2. Value-matching operators in Star Base conditional requests

& | !

Requests are of the form . The second column describes the relationship that data identified by the must satisfy against the in order to be returned as part of the result set.

of characters and a question mark (?) represents any single character. (vi) A request for looped data returns the items in the order requested, making any necessary adjustments to the structuring of nested loops to preserve the original context. (vii) A request for data within a save frame returns those items plus the associated context. (viii) If a requested data item includes a save-frame pointer as a value, the referenced save frame is returned intact. All other pointers contained within the returned data are resolved. (ix) A request for a data item in a global data block will also return the data-block headers within the scope of the global block. (x) The scope of a data request is the entire input file. Control of the search scope is only possible within branching requests.

Operator

Relationship

Text comparison operators: ∼= ?= ∼< ∼> ∼!= ?!= ∼=

Is identically equal to Includes as a substring Is less than (in ASCII order) Is greater than (in ASCII order) Is not identically equal to Does not include as a substring Is not greater than (in ASCII order) Is not less than (in ASCII order)

Numerical comparison operators: = < > != =

Is equal to Is less than Is greater than Is not equal to Is not greater than Is not less than

These tests are valid for any STAR application. Numerical operators permit comparison of the numerical values implied by the returned data-value strings. Recall from Chapter 2.1 that data values in STAR are specified only as character strings. Casting to different types may be performed by specific applications, but is not defined for arbitrary STAR applications. Nevertheless, Star Base recognizes that a majority of STAR applications will in fact specify numeric types, and therefore allows for numerical comparisons based on interpretations of certain value strings according to the conventions adopted by CIF for the numb data type (Section 2.2.7.4.7.1). Such values may be given as integers, real numbers or in scientific notation.

5.2.3.3. The Star Base conditional request While a data request allows retrieval of data items according to name, conditional requests allow retrieval of data items by value. The general form of a conditional request may be characterized as , where is any data request as defined in the preceding section, is any of the test operators defined below, and is a string pattern against which values of data items retrieved by the data request are matched according to the operator specified. Conditional requests may be combined by set operators &, | and ! to provide logical AND, OR and NOT tests. Table 5.2.3.1 lists the allowed constructions for a conditional request. A bare data request is considered a degenerate case of a conditional request. The construction & allows for the conjunction of conditionals. All data are returned (including context) from the intersection of sets of data that individually satisfy the conditions to be a non-empty set. It is important to note that the conjunction of conditionals based on different data names is the empty set. The construction | allows for the disjunction of conditionals. All data are returned (including context) from the union of sets of data that individually satisfy the conditions to be a non-empty set. The construction ! allows for the negation or complement of conditionals. All data are returned (including context) from the universal set of data that do not satisfy the conditions of the conditional request. The universal set is defined as the input file. Table 5.2.3.2 lists the permitted value-matching operators when a retrieved data value is compared with a target text string in the basic test described above. (If the contains white-space characters, it must be quoted with matching single or double quotes. The test is performed on the value of the text string, i.e. the complete text string including white-space characters but omitting the surrounding quote characters.) Two classes of operators are defined. Text operators may be used to test for string equality, substring containment or greater and lesser values (where the ‘greater’ and ‘lesser’ values for text strings are based on the ASCII character set ordering sequence).

5.2.3.4. The Star Base branching request Both conditional and data requests will retrieve matching data items wherever they may be found in the input file; the scope of the query in both cases is the entire file. The top-level query type supported by Star Base, the branching request, allows selection of sub-requests based on the results of prior tests, and also allows the narrowing or expansion of the scope of a request. The effect is to permit extensive control over the selection of data matching complex conditions. It is this which gives Star Base the power of a database query language. Note again that the user will in general need prior knowledge of the arrangement of data items within a STAR File in order to compose meaningful requests; Star Base is agnostic about the organization and structure of the contents of a data file and will simply return exactly those data items and their context that match the specified conditions. A branching request takes the following form: if [ else ] [ unknown ] endif The has exactly the same form as a conditional request, but does not return data to the calling process. It returns only a logical value that is used to determine which branch to evaluate. This logical return value may be TRUE if the condition is satisfied, UNKNOWN if the condition is not satisfied because there was no occurrence of a requested data name within the current

492

5.2. STAR FILE UTILITIES scope of the query, or FALSE if the condition is not met otherwise. The branching request must have a condition and a branch request that is made if the condition returns TRUE. The other possible branches are optional. The unknown branch, if present, is executed when the condition returns a value of UNKNOWN. If there is no unknown branch, then the default truth value is set as FALSE (i.e. a return value of UNKNOWN is treated as equivalent to FALSE) and the else branch is executed if present. It is possible to override this behaviour by using the special operator assume true before a condition. This operator forces the default truth value to TRUE when a condition returns an UNKNOWN value. It is a useful shorthand when the same branch request is applied against a condition that is either TRUE or UNKNOWN. The syntax is assume true (). A (that is, the set of actual individual requests within a branching request construct) has three possible forms: (i) a conditional request, (ii) a branching request, (iii) scope endscope . Note carefully the different contexts in which branch requests and nested branching requests may occur. scope specifies the range of data to be searched in the input file. The effect of the setting is closed by the endscope statement. The permitted values of are: (i) data item restricts the branch request to the data items in the condition, (ii) loop packet restricts the branch request to the contents of the loop packet in which data match the condition, (iii) loop structure restricts the branch request to the loop structure in which data match the condition, (iv) save frame restricts the branch request to the contents of the save frame in which data match the condition, (v) data block restricts the branch request to the contents of the data block in which data match the condition, (vi) file specifies that the branch request applies to the contents of the file containing data matching the condition (the default setting). The default scope is invoked when a scope is not specified; in such a case the scope of the branch request is the same as that of the condition. Fig. 5.2.3.1 demonstrates the construction of a branching request that restricts the scope of the query. The two requests in this figure are applied to the STAR File example of Fig. 5.2.2.1. In Figure 5.2.3.1(a), the query is targeted to retrieve all data items in a loop packet where the value of _basis_set_contraction_scheme includes the substring (3), provided that the value of _basis_set_atomic_name identically matches the string value hydrogen in the next outer nested loop packet. The data relevant to the contraction scheme labelled (3)->[2] are returned. Note how the wildcard data request _* retrieves the data items from the next outer loop structure in which the requested data lie. In the example of Fig. 5.2.3.1(c), the request for an unknown data name cannot be matched within the input file, and the unknown branch of the request is executed. In this case, the secondary request is more specific (only data names including the substring contraction are matched) and hence only a few items from the second-level loop are returned. Scopes can be expanded or contracted. If the example of Fig. 5.2.3.1(a) were to be modified by replacing the innermost

if_ _basis_set_atomic_name = hydrogen scope_loop_packet_ if_ _basis_set_contraction_scheme ?= (3) scope_loop_packet_ _* endscope_ endif_ endscope_ endif_

(a) data_Gaussian loop_ _basis_set_atomic_name _basis_set_atomic_symbol _basis_set_atomic_number _basis_set_atomic_mass loop_ _basis_set_contraction_scheme _basis_set_funct_per_contraction _basis_set_primary_reference _basis_set_source_exponent _basis_set_source_coefficient _basis_set_comments_index _basis_set_atomic_energy loop_ _basis_set_function_exponent _basis_set_function_coefficient stop_ stop_ hydrogen H 1 1.0079 (3)->[2] 2:1 PKC1.23.1 R75 R75 C13,C19 -0.496979 4.5018000E+00 1.5628500E-01 6.8144400E-01 9.0469100E-01 1.5139800E-01 1.0000000E+01 stop_ stop_

(b) if_ _basis_set_atomic_name = hydrogen scope_loop_packet_ if_ _basis_set_contraction_xxxxxx ?= (3) scope_loop_packet_ _* endscope_ unknown_ _*contraction* endif_ endscope_ endif_

(c) data_Gaussian loop_ loop_ _basis_set_contraction_scheme _basis_set_funct_per_contraction stop_ (2)->[2] 1: (2)->[1] 2 stop_

(2)->[2] 1: (3)->[2] 2:1

(d) Fig. 5.2.3.1. Examples of branching requests and the results returned by Star Base from the example file of Fig. 5.2.2.1. (a) A query designed to extract a data structure relevant to one contraction scheme and one atom type. (b) The results of that request. (c) A similar request, but with a branch followed when the condition cannot be matched against a requested data item in the current scope, and (d) the resulting output. See text for discussion.

declaration with scope_loop_structure_, the query would proceed by testing for the existence of a contraction scheme value including the string (3) in the loop packet relevant to the hydrogen results (as before). Finding that this condition was satisfied, the result returned would be all data names in

scope_loop_packet_

493

5. APPLICATIONS the encompassing loop structure – i.e. in this particular example, the complete loop contents would be returned. Note that Star Base faithfully returns context even in the processing of complex branching requests. Therefore if, for example, a save-frame pointer is returned as a data value following the processing of a request, the associated save-frame contents will be returned in full so that they are referenced in the returned STAR data structure.

quoted strings, semicolon-delimited text and frame codes. The language for defining a syntax in vim is very simple and very powerful. The constructs allow the user to define the syntax precisely enough so that the system does not match patterns within other patterns, unless directed to. There are three types of syntax items: keyword, match and region. keyword can only contain keyword characters, no other syntax items. It will only match with a complete word (there are no keyword characters before or after the match). The rule for the STAR syntax is

5.2.3.5. Implementation issues Star Base is implemented in the C programming language, and exploits Gnu’s flex and bison compiler-compiler system to generate a lexer and parser for the STAR File and a separate lexer and parser for the Star Base query language. The STAR File parser builds an in-memory representation (much like most programming-language compilers) of the file contents, and differs from similar applications that are based on a single pass over a stream (like SAX for XML applications). While a system like CIFtbx retains a block copy of the STAR File in memory, the initial Star Base processing removes all comments and formatting, and stores the meaningful tokens in a binary tree representation. For each STAR File container (global block, data block, save frame, loop or data item) there is a C structure defined. For each of these there are additional structures defined that hold sequences of containers. The nodes of this tree are populated with these structures. Each leaf of the tree is the data item consisting of the data name and its associated value. A binary tree of the global-block sequences is built in reverse order (that is, in an order reverse to that in which they appear in the file), making it simple to identify the global values in scope for a specific data block. It will be recalled that the STAR File semantics require a backward scan through the file to pick up the global blocks in scope. The binary search algorithm employed is the classic tsearch of Knuth (1973), which is part of the standard C libraries. Given modern computer systems, the implementation is extremely fast and efficient. There are no files in existence whose size would test the limits of Star Base. The use of a binary tree simplifies the process by which a legitimate STAR File is returned as output by Star Base and also how the scope over which the conditionals operate can be controlled by the user. The program stores references to the data nodes of the tree it needs to extract when outputting. Since the location in the original data tree is always stored, the program is easily able to reconstruct the correct structure of the file by walking the tree, identifying the nodes that need to be output in addition to the data. Star Base is by default the ‘gold standard’ for testing other applications for correctness with respect to the syntax and semantics of the STAR File. It can be said that the output of Star Base is not optimal, since it is yet another STAR File and one which is devoid of the original comments and formatting. However Star Base is in essence an API for STAR File applications, rather than a stand-alone program (although it is often used in that way). Star Base was the platform from which BioMagResBank’s starlib (Section 5.2.6.4) was developed.

syn keyword strKeyword

global_ save_ loop_ stop_

match is a match with a single regular expression (regexp) pattern. The rule for matching on a save-frame code is syn match strFramecode

"$[ˆ \t]\+"

region starts at a match of the ‘start’ regexp pattern and ends with a match with the ‘end’ regexp pattern. Any other text can appear in between. There can be several ‘start’ and ‘end’ patterns in the one definition. A ‘skip’ regexp pattern can be used to avoid matching the ‘end’ pattern. There are a number of character offset parameters that allow the user to redefine the start and end of the matched text given the pattern that matches the regular expression. Quite separately, one can define the region for highlighting, which can be different from the matched text. The rule that matches a double-quoted string is syn region strString start=+ˆ"+ start=+\s\+"+ms=e end=+" +me=e-1 end=+"\t+me=e-1 end=+"$+ contains=strSpecial " skip=+\\\\\|\\"+

In this rule, the beginning of the pattern is a double quote that is either the first character on the line or that has one or more whitespace characters before it. The beginning of the matched string (ms) is the end of the matched pattern (e). That is, the matched string begins at the quote. The end pattern is a double quote followed by a single space, a tab or an end of record. The end of the matched string (me) is one character less than the end of the matched pattern (e). That is, the trailing character after the closing double quote is not considered part of the matched string. By default the characters between location ms and me are highlighted. This too can be controlled, and by including hs = ms + 1 and he = me − 1 the highlighted text would not include the delimiting double quotes. As these rules are based on regular expressions, there is no possibility of using them to validate the STAR File structure. However, problems in the structure are often identifiable by unexpected or irregular highlighted text [a fact often used in graphical CIF editors to help the user locate visually errors in syntax (see e.g. Section 5.3.3.1.4)]. 5.2.5. Browser-based viewing with StarMarkUp StarMarkUp is a Tcl/Tk program that takes any STAR File as input and outputs the contents as HTML. The output is a faithful copy of the input, and there is no reformatting or deletion of content. During transformation, the contents can be cross-referenced against any other STAR File using HTML anchors. This feature is particularly useful when marking up a data file, since the data names contained within can be hyperlinked to their definition in their dictionary. Furthermore, the definitions contained within the dictionary can be hyperlinked to the DDL dictionary. StarMarkUp makes no presumptions about the version of DDL employed, the preferred dictionary structure or the specific application

5.2.4. Editing STAR Files with Star.vim The vim editor supports syntax highlighting for a wide variety of languages through syntax definition rules. The definition rules for STAR Files are simple and small in number. The entire syntax is defined by 19 rules that include the regular expressions for the STAR File keywords, data names, numbers, single- and double-

494

5.2. STAR FILE UTILITIES 5.2.6. Object-oriented STAR programming The STAR File syntax is very simple but flexible, and suggests a number of well defined data structures. ‘Scalar’ values (i.e. single text strings identified by specific data names) are obvious, as are ‘vectors’ (multiple string values associated with a data name in a loop_ declaration). ‘Matrices’ are constructed by associating identical-length vector items in the same loop_ structure. Note that these data structures are not true matrices, but arrays or tables of column-addressable vectors. Column addressing arises from the need to associate each set of values with a distinct data name in the loop header declaration. Through the nested loop construct in STAR, vector or matrix elements are permitted within vectors or matrices. Save frames provide addressable data structures containing one or more scalar, vector or matrix components; and data blocks contain similar data structures with the inclusion of embedded save frames. This hierarchy of structure allows many other data models to be mapped onto the STAR syntax. For example, STAR does not in itself provide addressable rows in matrix structures, but such a concept may be achieved in application-specific ways. For example, the relational database structure of mmCIF and other DDL2 applications ensures that table rows can be identified by specific key values. Loops of multiple values associated with a single data name are similar to lists or one-dimensional arrays as defined in many computer languages. Simple associative arrays, or ‘hashes’, such as are used to great effect in Perl and Python, can be modelled by designating the first value of a two-item loop structure as the ‘key’ item, by analogy with the relational database example above (in such an application the key value of course needs to be unique). The hierarchy of nested loops can be used to model certain types of complex structures such as might be defined in a C-language ‘struct’, for example. There is a natural possibility of representing particular well defined data structures as ‘objects’ and handling STAR Files by object-oriented programming techniques. Where applications associate properties of specific data items with external reference descriptions, as in the DDL dictionaries, the dictionary entries can be considered to express types, relationships and even methods associated with the object classes. However, the versatility of the STAR syntax means that there are many ways of designing STAR objects and their relationships, and it is likely that the most useful efforts will be in designing applications for specific purposes or subject areas. Below we describe a number of prototypes and implementations of object-oriented STAR programming. Some are rather generic in outlook; others are informed by the crystallographic viewpoint embodied in CIF.

Fig. 5.2.5.1. Hyperlinking markup generated by StarMarkUp. The example is based on an extended relational dictionary definition language, StarDDL.

of the STAR File. The program understands only the rules of a legitimate STAR File. StarMarkUp provides a number of hyperlinking facilities. During markup, it automatically hyperlinks frame-code values to the internal save-frame block to which they point. As part of the same process, StarMarkUp inserts anchors in all data and save-frame blocks. It does this in anticipation that there may be the need to hyperlink to these anchors from another STAR File. The most obvious application for this is in the marking up of the DDL dictionary. The list of anchors generated from this process can be passed on to the discipline dictionary during its markup phase (Fig. 5.2.5.1). In this way, the tags used in the discipline dictionary to define the data names can be made to point back to their entry in the DDL dictionary. At the same time that the discipline dictionary is being marked up, a list of its anchors is being generated (each anchor being to a data-item definition). This list is used when marking up an instance of the discipline STAR data file to hyperlink each data name back to its definition in the discipline dictionary. StarMarkUp comprises a hand-crafted tokenizer employing a character buffer and one-character look-ahead to identify accepted tokens. StarMarkUp does not build an internal representation of the file, but functions as a streaming parser. The GetToken() function returns a structure consisting of the token type (an enumerated set) and the token value (the lexeme associated with that token). In most cases, the lexeme is marked up and injected into the output stream. If the token is associated with a STAR File block, the parser will recursively call a MarkUpBlock() function. A recursive descent parser makes it very easy to treat all STAR File blocks (global_, data_ and save_) in identical fashion. StarMarkUp is implemented in Tcl/Tk for novelty and not because of any particular superior qualities of the language and its API. It is, however, fast and sufficiently flexible, and extensions to the program can be rapidly implemented and tested.

5.2.6.1. OOSTAR The OOSTAR approach (Chang & Bourne, 1998) developed and tested an Objective-C (Pinson & Richard, 1991) toolset for STAR representation. The developers selected Objective-C over the more common C++ language because of its perceived advantages of (i) loose data typing, (ii) run-time type checking and (iii) messagepassing ability. These were all seen to aid flexible software design, an asset for applications built on the very flexible structure of the STAR syntax itself. Chang & Bourne constructed a set of classes built on the basic objects item and value. Their model builds upwards through a hierarchy of classes describing relationships between objects. The ItemAssoc class contains data members corresponding to the data item itself, its included value or set of values, and the data name that is used to reference the item; the class also has pointers to

495

5. APPLICATIONS previous and next elements to allow iteration. The StarAssoc class contains methods for manipulating STAR objects through setting and retrieval of data names and assignment of values. DataBlock is the class containing all the lower-order items and associations, and one or more such DataBlock classes comprise the StarFile class. The OOSTAR approach also provides Dictionary and DictionaryElement classes so that STAR applications that do describe element attributes in external dictionary files can make use of such information. Some sample applications were built with the class libraries of the OOSTAR toolset, and they are described in Chang & Bourne (1998). They include: simple converters of STAR data files to HTML pages with hyperlinks to associated dictionary entries; a query tool for retrieving the data values for a specific item specified by name within a specific data block; and a query mechanism that retrieves a set of items from a STAR loop structure and represents their values in an array, i.e. flattening if necessary any nested loops into a purely tabular presentation. These query tools are different from Star Base primarily in that they do not return the context in which the data are found in the original file – their interpretation depends on a detailed a priori understanding of the target file structure. This approach is rather different from the canonical description of STAR given in this volume, and was never developed into fullblown applications (in part because the developers recognized that the Java language would provide a preferable platform for further development). Nevertheless, it provides interesting ideas for the developer considering building object-oriented applications for STAR Files.

5.2.6.3. CIFOBJ The CIFOBJ class library (Schirripa & Westbrook, 1996) was developed to provide an object view of the mmCIF dictionary and to complement the relational CIFLIB class library (Westbrook et al., 1997) for handling mmCIF data. As such, it is very much tuned to crystallographic applications and it handles only the subset of STAR features used in CIF dictionaries. This does, however, include some limited handling of save frames, and is therefore a little more complex than applications interested only in CIF data input/output processing. CIFOBJ has two components. The first builds a persistent store of objects of types item, subcategory, category and dictionary. Each such object is a container for all relevant attributes permitted for that object type. The object store is populated from the mmCIF dictionary by the CIFOBJ loader class (using methods provided by CIFLIB). This loader class assembles the dictionary objects and passes them to an object-storage manager. The second component of the CIFOBJ class library provides the methods necessary for building dictionary objects from the persistent store and for passing attribute strings to procedures concerned with establishing the integrity of data values. The main reason for discussing this implementation here is to indicate the rapid growth in complexity needed to impose an application-specific object view on even a relatively simple STAR data structure. In generic prototype STAR projects such as OOSTAR, half a dozen or so classes and associated methods suffice to represent the highest-level abstract concepts implied in STAR constructions. In CIFOBJ, however, dozens of methods are associated with dictionary access. Dictionary-driven validation of an mmCIF numerical data value can involve: (i) retrieval from the dictionary of the extended type declaration associated with the data item; (ii) validation of the basic type (i.e. that it is indeed numeric) against the primitive data types supported by the dictionary; (iii) validation of the extended type by regular-expression matching of the string representation of the value against the allowed patterns stored in the dictionary; (iv) retrieval of any existing range constraints specified in the dictionary and comparison with the data value; (v) location and evaluation of any associated standard uncertainty; (vi) identification of the units in which the physical quantity is expressed; (vii) if necessary, conversion of the units according to the conversion tables stored in the dictionary. More complexity arises from the relationships between data items expressed through dictionary attributes such as name aliasing, parent–child dependencies and category membership. Nevertheless, the complexity of these relationships is an indication of the richness of the metadata available through the dictionary approach, and the availability of well defined object representations simplifies the construction of well designed largescale application frameworks, such as underpin the Protein Data Bank (Berman, Battistuz et al., 2002) and Nucleic Acid Database (Berman, Westbrook et al., 2002).

5.2.6.2. CIF++ CIF++ was a small library of classes designed by Peter MurrayRust during the time that DDL was being developed as a language for representing the properties and attributes of STAR data items (e.g. Murray-Rust, 1993). The purpose of this project was to demonstrate the design of data classes that represented real-world objects such as molecules and crystal cells. At a time when the structure of CIF was under intense discussion, many of the classes were potentially extensible to STAR features that were a superset of those found in CIF. While these classes were never developed into fully functional applications, they demonstrated a potentially fruitful approach to model representation and gave rise to the idea of including methods in STAR dictionary definitions, thus allowing STAR applications to dynamically associate algorithmic relationships and operations with data objects by parsing the methods description in the dictionary definitions. This approach is currently being developed into a relational expression language for STAR called dREL (Spadaccini et al., 2000). The ideas behind CIF++ were further developed in a class library for molecular representation called Democritos, and have informed the representation of molecular and crystal structure in Chemical Markup Language (CML) and in the structured document browser Jumbo (Murray-Rust, 1998). The CIF++ classes have been refactored into Java using the W3C document object model (W3C, 2004) and other DOM-like models. They are now based on an XML schema which is the abstraction of the formal DDL1 specification (Chapter 2.5). These are available as part of the Jumbo distribution at http://cml.sf.net. The design involves an interface that would allow the C++ classes to be recreated from the Java.

5.2.6.4. starlib BioMagResBank (BMRB) is a repository for NMR spectroscopy data on proteins, peptides and nucleic acids at the University of Wisconsin – Madison (Ulrich et al., 1989). For some time, NMR data sets have been exchanged within this environment using STAR Files; an NMRStar data dictionary to define the data names used for tagging NMR data is under development

496

5.2. STAR FILE UTILITIES Table 5.2.6.1. Object classes for manipulating STAR data in starlib

(BioMagResBank, 2004). The starlib class library was developed at BMRB for handling NMRStar files, but its initial application to such files independently of the prototype data dictionary means that it is applicable to any STAR File. It does not provide a relational database paradigm (although this is a long-term goal). However, it does provide objects and methods suitable for searching and manipulating STAR data. Table 5.2.6.1 lists the top-level classes used in starlib. ASTnode is a formal base class, providing the types and methods that can be used in other derived classes. StarFileNode is the root parent of all other objects contained in an in-memory representation of a STAR File; in practice it contains a single StarListNode, which is the list of all items contained in the file. BlockNode is a class which contains a partition of the STAR File: the class handles both data blocks and global blocks. Data-block names are stored in instances of the HeadingNode object, which also holds save-frame identification codes and is therefore useful for accessing named portions of the file. DataNode is a virtual class representing the types of data objects handled by the library (accessed directly as DataItemNode, DataLoopNode and SaveFrameNode). Looped data items are handled by a number of objects. DataLoopNameListNode is a list of lists of names in a loop. The first list of names is the list of names for the outermost loop, the second list of names is the list of names for the next nesting level and so on. LoopNameListNode is a list of tag names representing one single nesting level of a loop’s definition. LoopTableNode is a table of rows in a DataLoopNode (not itemized in Table 5.2.6.1; it is an object representing a list of tag names and their associated values, a particular case of DataNode). starlib views a loop in a STAR file as a table of values, with each iteration of the loop being a row of the table. Each row of the table can have another table under it (another nesting level), but such tables are the same structure as the outermost one. Thus LoopTableNode stores a table at some arbitrary nesting level in the loop. A simple singly nested loop will have only one loop table node, but a multiply nested loop will have a whole tree of loop tables. LoopRowNode is a single row of values in a loop. DataNameNode holds the name of a tag/value pair or a loop tag name. DataValueNode is the type that holds a single string value from the STAR file and the delimiter type that is used to quote it. DataListNode and SaveFrameListNode store lists of data within higher-order data objects or save frames, and are internal classes rarely invoked directly by a programmer. A number of observations may be made regarding this approach. Firstly, the objects can be mapped with reasonable fidelity to the high-level Backus–Naur form representation of STAR (Chapter 2.1). Secondly, it is computationally convenient to abstract common features into parent classes, so that, for example, individual data items, looped data and save frames are represented as child objects of the DataNode object, and not themselves as firstgeneration children of the base class. Thirdly, the handling of nested loops may be achieved in different ways; starlib has chosen a particular view that is perhaps well suited to relational data models. As expected within a programming toolkit, starlib offers a large number of methods for retrieving STAR data values, adding new data items, extending or re-ordering list structures, and performing structural transformations of the in-memory data representation. Unlike the stand-alone Star Base application, it does not guarantee that output data will be in a STAR-conformant format; and the programmer is left with the responsibility of validating transformed data at a low level.

ASTnode StarFileNode StarListNode BlockNode HeadingNode DataNode DataLoopNameListNode LoopNameListNode LoopTableNode LoopRowNode DataNameNode DataValueNode DataListNode SaveFrameListNode

The base class from which all other classes are derived The STAR File object List of items contained in the STAR File A data or global block Labels for major STAR File components General class for data objects List of lists of names in a loop List of tag names representing one nesting loop level Table of rows in a loop Single row of values in a loop A data name A single string value List of data within a higher-order data object List of data items allowed in a save frame

Nevertheless, this is a substantial and important library which, as with CIFOBJ, has played an important role in the functioning of a major public data repository. Development of the class libraries continues, with a Java version now available. 5.2.6.5. StarDOM A convenience of well designed object representations is that effective transformation between different data representations may be possible. The StarDOM package (Linge et al., 1999) demonstrates a transformation from STAR Files to an XML representation, where the tree structure of a STAR File as interpreted in the starlib view above is mapped to a document object model (DOM; W3C, 2004). This approach is similar to Jumbo, mentioned above in Section 5.2.6.2. A demonstration of StarDOM is the transformation of the complete set of NMR data files at BioMagResBank to XML. The resultant files can then be interrogated using the XQL query language (Robie et al., 1998). In this example implementation, the target XML document type definition (DTD) includes a small number of XML elements matching the STAR objects global and data block, save frame, list, data item, data name and data value. Particular data names are recorded as values of the element. The authors of the StarDOM package are considering an extension in which named data items map directly to separate XML elements; the goal is to develop an NMR-specific DTD that is isomorphous to the emerging NMRStar data dictionary.

References Berman, H. M., Battistuz, T., Bhat, T. N., Bluhm, W. F., Bourne, P. E., Burkhardt, K., Feng, Z., Gilliland, G. L., Iype, L., Jain, S., Fagan, P., Marvin, J., Padilla, D., Ravichandran, V., Schneider, B., Thanki, N., Weissig, H., Westbrook, J. D. & Zardecki, C. (2002). The Protein Data Bank. Acta Cryst. D58, 899–907. Berman, H. M., Westbrook, J., Feng, Z., Iype, L., Schneider, B. & Zardecki, C. (2002). The Nucleic Acid Database. Acta Cryst. D58, 889– 898. BioMagResBank (2004). NMR-STAR Dictionary. Version 3.0. (Under development.) http://www.bmrb.wisc.edu/dictionary/3.0html/. Chang, W. & Bourne, P. E. (1998). CIF applications. IX. A new approach for representing and manipulating STAR files. J. Appl. Cryst. 31, 505– 509. Knuth, D. E. (1973). The art of computer programming. Vol. 3, Sorting and searching, pp. 422–447. Reading, MA: Addison-Wesley. Linge, J. P., Nilges, M. & Ehrlich, L. (1999). StarDOM: from STAR format to XML. J. Biomol. NMR, 15, 169–172. Murray-Rust, P. (1993). Analysis of the DDL/dictionary parsing problem. Proc. First Macromolecular Crystallographic Information File (CIF) Tools Workshop, ch. 12. New York: Columbia University.

497

5. APPLICATIONS Spadaccini, N., Hall, S. R. & Castleden, I. R. (2000). Relational expressions in STAR File dictionaries. J. Chem. Inf. Comput. Sci. 40, 1289– 1301. Ulrich, E. L., Markley, J. L. & Kyogoku, Y. (1989). Creation of a nuclear magnetic resonance data repository and literature database. Protein Seq. Data Anal. 2, 23–37. W3C (2004). Document Object Model (DOM). http://www.w3.org/ DOM/. Westbrook, J. D., Hsieh, S.-H. & Fitzgerald, P. M. D. (1997). CIF applications. VI. CIFLIB: an application program interface to CIF dictionaries and data files. J. Appl. Cryst. 30, 79–83.

Murray-Rust, P. (1998). The globalization of crystallographic knowledge. Acta Cryst. D54, 1065–1070. Pinson, L. J. & Richard, R. S. (1991). Objective-C object-oriented programming techniques. Reading, MA: Addison-Wesley. Robie, J., Lapp, J. & Schach, D. (1998). XML Query Language (XQL). http://www.w3.org/TandS/QL/QL98/pp/xql.html. Schirripa, S. & Westbrook, J. D. (1996). CIFOBJ. A class library of mmCIF access tools. Version 3.01 reference guide. Technical Report NDB-269. Rutgers University, New Brunswick, New Jersey, USA. Spadaccini, N. & Hall, S. R. (1994). Star Base: accessing STAR File data. J. Chem. Inf. Comput. Sci. 34, 509–516.

498

references

International Tables for Crystallography (2006). Vol. G, Chapter 5.3, pp. 499–525.

5.3. Syntactic utilities for CIF B Y B. M C M AHON

computer programmers. However, the use of ASCII character sets, deliberately expressive data names and simple layout conventions both permit and encourage users to edit the files with general text editors that cannot guarantee to retain syntactic integrity. Consequently, there is a definite use for a simple program that can check whether a file conforms to the specified syntax. It is worth mentioning that programmable text editors such as emacs may be supplied with rules that can check syntax as a file is edited. A simple rule set (known as a mode file) has been developed (Winn, 1998) to indicate the different components of a CIF, as a first step towards a syntax-checking emacs mode. The Star.vim utility of Section 5.2.4 provides a similar functionality for editing in the vim environment, although it is not capable of validation directly; nevertheless, the appearance of unexpected or irregular highlighted text can draw the user’s attention to syntactic problems, a feature that is also useful in more extended editors such as enCIFer (Section 5.3.3.1).

5.3.1. Introduction Since the introduction of the Crystallographic Information File (CIF), the crystallographic community has produced a wide variety of tools and applications to handle CIFs. Many changes have been made to existing programs to input and output CIF data sets, and occasionally changes may have been made to internal crystallographic calculations to provide a better fit to the view of the data expressed by the standard CIF dictionaries. However, for most crystallographers with an involvement in programming, there is an understandable tendency to invest the minimum amount of effort needed to accommodate the new format. Their primary interest is in the understanding and discovery of the underlying physical model of a crystal structure. This chapter reviews several general-purpose tools that have been developed for CIF to check, edit, extract or manipulate arbitrary data items, with little in the way of crystallographic computation. They are of interest to the end user who wishes to visualize a structure in three dimensions or submit an article to a journal but who does not want to be concerned about the details of CIF. They also include several utilities that are helpful for manipulating the contents of CIFs without the need to write a large and complex program. The programmer with an interest in writing complete and robust CIF applications should look at the comprehensive libraries described in Chapters 5.4 to 5.6. Many of the programs described in this chapter operate purely at the syntactic level; they require no knowledge of the scientific meaning of the data items being manipulated. Others have some bearing on the semantics of the file contents, either explicitly through information about data types and interrelationships carried in external CIF dictionaries, or implicitly through the user’s choice and deliberate manipulation of items based on an understanding of what they signify. Nevertheless, most utilities described here are characterized by an ability to handle CIFs of any content and provenance. The best example of a program able to handle arbitrary CIFs at a purely syntactic level is Star Base (Spadaccini & Hall, 1994), described in Chapter 5.2. It should be noted that not all the programs described here are fully compliant with the specification of Chapter 2.2, and others have implementation restrictions or known bugs. Many, especially the older programs, are no longer actively supported and need to be handled with care. However, they are included here for the record, and because they may provide useful ideas and suggestions to future developers in an area that can still accommodate a wider range of tools for different uses.

5.3.2.1. vcif A simple syntax checker for CIF is the program vcif (McMahon, 1998), which scans a text file and outputs informative messages about apparent errors. While conservative CIF parsing software will quit upon finding an error, vcif will attempt to read to the end of the file and list all clearly distinguished errors. However, its interpretation of errors depends on a close adherence to the CIF syntax specification and makes no assumption about the intended purpose of the character strings it reads. In consequence, a single logical error such as failing to terminate a multiple-line text string may cause the program to report many other apparent errors as it proceeds out of phase through the rest of the file. 5.3.2.1.1. How to use vcif The program may be run under Unix or DOS by typing vcif filename

where filename is the name of the file to test. If filename is given as the hyphen character -, the program will read standard input. Standard input will also be read if no file name is supplied; this allows the program to be used in a pipeline of commands. A number of options may be supplied to the program to modify its behaviour. Without these options (i.e. invoked as above) a brief but informative message is written to the standard output channel for each occurrence of what the program perceives to be a syntax error. For example, for the incorrect sample file of Fig. 5.3.2.1(a), the output is listed in Fig. 5.3.2.1(b). Note that the sequence number of the line in which the error occurs is printed. The summary error message is output on a single line (longer lines have been wrapped and indented in Fig. 5.3.2.1 for legibility). Where the type of error necessarily affects only a single line, the program can recover and correctly identify errors on subsequent lines. Where possible, unexpected character strings are printed to help the user to identify the error. No attempt is made to assign any meaning to the data names or the data values in the

5.3.2. Syntax checker A CIF must conform to a subset of the syntax rules of a general STAR File (Chapter 2.1), but with the additional restrictions and conventions described in Chapter 2.2. The syntax is rather simple and robust subroutines to create CIFs may easily be written by Affiliation: B RIAN M C M AHON, International Union of Crystallography, 5 Abbey Square, Chester CH1 2HU, England.

Copyright  2006 International Union of Crystallography

499

5. APPLICATIONS # Sample CIF with syntax errors _date

’Monday 12 April 1999

_cell_length_a _cell_length_b

7.514 (3) 9.467 (2)

ERROR: No data block code before dataname at line 3 >>> "_date" *** A data block MUST begin with a data_something declaration. ERROR: Single-quoted character string does not terminate at line 4 >>> "_cell_length_a" *** The indicated line appears to contain some word or words introduced by a single quote, but not terminated with a matching single quote. ERROR: Unexpected string ((3)) at line 5 *** There is an unexpected word or number as indicated. This may be because a loop is intended but the loop_ keyword has been missed out; or a phrase with several words is not enclosed in matching delimiting quote marks; or a text field (extending over several lines) is not properly closed with a final semicolon; or a data_ block header has not yet been seen. ERROR: Unexpected string ((2)) at line 6 >>> "_cell_length_b" ERROR: Number of loop elements not multiple of packetsize at line 15 >>> "_example_comment" *** A loop_ header defines a list of datanames. The values following this header are assigned in sequence with the datanames in the header, so each packet of information (or row in the table of values defined by the loop structure) must have the same number of values as there are datanames declared in the loop header. Common reasons for this error include: omission of a value where the associated data are absent (insert . or ? as placeholders); numeric values where the standard uncertainty (or e.s.d) has come adrift from its associated value (e.g. 10.925 (2)); multi-word phrases or text entries that are not properly delimited with quote marks or initial semicolons. ERROR: Text field at end of file does not terminate >>> "" *** Is the CIF complete?

loop_ _geom_bond_atom_site_label_1 _geom_bond_atom_site_label_2 _geom_bond_distance O1 C2 1.342(4) O1 C5 1.439 (3) _example_comment ; The purpose of this example is to indicate how vcif describes some of the syntax errors it finds.

(a) ERROR: No data block code before dataname at line 3 ERROR: Single-quoted character string does not terminate at line 4 ERROR: Unexpected string ((3)) at line 5 ERROR: Unexpected string ((2)) at line 6 ERROR: Number of loop elements not multiple of packetsize at line 15 ERROR: Text field at end of file does not terminate

(b) Fig. 5.3.2.1. (a) An example CIF with a number of syntax errors and (b) the report of the errors produced by vcif.

file. Hence the same logical error (the detachment of a standard uncertainty in parentheses from its parent value) is indicated variously as an unexpected text string or as an extraneous loop item, depending on where it occurs in the file. Indeed, in the case of the incorrect number of loop elements, the program makes no attempt to identify which data value or values in the loop might be in error: it simply counts the number of values in a loop and complains when this is not a multiple of the number of data names declared in the loop header.

Fig. 5.3.2.2. Verbose error listing from vcif when run with the ‘-v’ option on the example of Fig. 5.3.2.1.

5.3.2.1.2. Options to vcif

A related option, vcif -b, counts errors and returns the result as an integer to the calling environment, as in the previous case; but additionally outputs a list of all the data-block codes in the file. While adding nothing to the syntax-checking function of the program, this provides a useful small utility for simply listing datablock names. Although intended for use with the restricted STAR File syntax permitted for CIF (Chapter 2.2), vcif may also be used with the ‘-s’ option to check the syntax of CIF dictionary files, which may include save frames. The program does not, however, handle nested loop structures. The program will flag as an error any line of greater than 80 characters length (the original limit in the CIF version 1.0 specification; see Chapter 2.2), but this behaviour may be overridden with the ‘-l’ option. If used, only lines longer than the specified number of characters will be reported and the reports of such lines will be prefaced with the word ‘WARNING’. Likewise, the ‘-w’ option may be used to override the CIF version 1.0 restriction of data names and data-block codes to 32 characters. Other options allow the program to write extensive debugging information to a user-specified file, indicating its internal state upon processing each token of input, and to list either a brief summary of how it may be used or its current version number.

A number of options may be supplied as command-line arguments to modify the output from vcif. A more complete account is given of each error on its first occurrence when the program is invoked with the ‘-v’ option. The output listing explains in more detail what the breach of syntax is and sometimes suggests how misunderstandings of the file structure result in such breaches (Fig. 5.3.2.2). Each error message is prefaced by the word ‘ERROR’ (or occasionally another phrase such as ‘WARNING’ or ‘STAR ERROR’). Three chevrons preface a printout of the beginning of the troublesome line. Then an expanded description of the error is given, prefaced by three asterisks, on the first occurrence of each distinct error. In this mode, only the first 20 errors are listed (the assumption is that this mode is best suited to novices, who should identify and correct each error in turn and would not want to be swamped by large numbers of error messages arising from a single error). More errors may be reported by using the ‘-e’ commandline option. The quiet option (vcif -q) outputs no error messages but instead returns to the calling environment an integer giving the total number of errors found. This option allows scripts or external programs to use vcif as a silent test of whether a file has any syntax errors.

500

5.3. SYNTACTIC UTILITIES FOR CIF 5.3.2.1.3. Limitations of vcif Because the program is testing certain properties of character strings within logical lines of a file, it stores a line at a time for further internal processing. If a line contains a null character (an ASCII character with integer value zero), this will be taken as the termination of the string currently being processed, according to the normal conventions in the C programming language for marking the end of a text string. In this case, subsequent error messages may not reflect the real problem. The null character, of course, is not allowed in a CIF. vcif also interprets syntax rules literally, so a misplaced semicolon might mean that a large section of the file is regarded as a text field and too many or too few error messages are generated. This can make a correct interpretation of the causative errors difficult for a novice user.

5.3.3. Editors with graphical user interfaces A useful class of editing tool is the graphical editor, where different types of access can be provided through icons, windows or frames, menus and other graphical representations. The availability of standard instructions through drop-down menus makes such tools particularly suitable for users who are not expert on the fine details of the file format. The ability within the program to restrict access to particular regions of the file makes it easier to modify the contents of a CIF without breaking the syntax rules. A small but growing number of such editors are becoming available, such as those described here.

Fig. 5.3.3.1. The enCIFer graphical user interface.

5.3.3.1.1. The main graphical window Fig. 5.3.3.1 is an example of the use of enCIFer to read and modify a CIF. The figure shows the components of the main window after a file has been opened. Beneath the standard toolbar that provides access to operating-system utilities and to the main functions of the program itself is a task bar (here split over two lines) providing rapid access to a subset of the program’s features. Under this are two large panes. The pane on the right is the editing window, where the content of the CIF is displayed and may be modified. The left-hand pane is a user-selectable view by category of the data names stored in the CIF dictionary against which the file is to be validated. At the bottom are two smaller panes. The one on the right logs the session activities and displays informational messages. The left-hand pane lists errors and warning notices generated by the validation system. Errors are labelled by line number, and selection of a specific message (by a mouse double-click) scrolls the content of the main text-editing window to that line number. Tabs in the middle of the display allow the user to switch rapidly between the editing mode and a visualization of the threedimensional structures described in the CIF. These components are described more fully below, followed by a description of the other windows that may be created by a user: the help viewer, the loop editor and the data-entry wizards.

5.3.3.1. enCIFer The program enCIFer (Allen et al., 2004) has been developed as a graphical utility designed to indicate clearly to a novice user where errors are present in a CIF, to permit interactive editing and revalidation of the file, and to allow visualization of threedimensional structures described in the file. In its early releases, it was targeted at the community of small-molecule crystallographers interested in publishing structures or depositing them directly in a structure database. Version 1.0 depended on a compiled version of the CIF core dictionary, but subsequent versions allow external CIF dictionaries to be imported. At the time of publication (2005), development is concentrating on support for DDL1 dictionaries. Given its target user base, the purpose of the program is to permit the following operations within single- or multi-block CIFs: (i) Location and reporting of syntax and/or format violations using the current CIF dictionary. (ii) Correction of these syntax and/or format violations. (iii) Editing of existing individual data items or looped data items. (iv) Addition of new individual data items or looped data items. (v) Addition of some standard additional information via two data-entry utilities prompting the user for required input (‘wizards’): the publication wizard, for entering the basic bibliographic information required by most journals and databases that accept CIFs for publication or deposition; and the chemical and crystal data wizard, for entering chemical and physical property information in a CIF for publication in a journal or deposition in a database. (vi) Visualization of the structure(s) in the CIF. In all cases where data are edited or added, enCIFer can be used to check the format integrity of the amended file.

5.3.3.1.2. The interface toolbar This toolbar provides menus labelled ‘File’, ‘Edit’, ‘Search’, ‘Tools’ and ‘Help’ that provide the expected functionality of graphical interfaces: the ability to open, close and save files, store a list of recently accessed files, spawn help and other windows, allow searching for strings within the document, allow the user to modify aspects of the behaviour or the look and feel of the program, and provide entry points for specific modes of operation. The most useful of these utilities can also be accessed from icons on the task bar. They are discussed in more detail in the following section. This main menu is structured in a way familar to users of popular applications designed for the Microsoft Windows operating

501

5. APPLICATIONS system, although the enCIFer program runs on a variety of different operating systems and machine hardware platforms. Nevertheless, the use of a common menu style makes the initial use of the program much easier for novice users and allows the program to be effectively used without detailed study of its documentation.

Two other typographic cues are used to help the user to trace errors, or to ensure that certain text has been input correctly. Subscripts and superscripts are represented in a smaller typeface (and in a different colour) so that missing delimiter characters are again obvious to the eye. Secondly, some special characters in the conventional CIF encoding (such as Greek letters) are displayed in an appropriate symbol font when the file is first loaded, so that for example the input string \a is rendered as \α. Note that the backslash character is retained, and that the symbol character is not generated as new text is input or edited. This scheme therefore has some potential for confusion, but is nevertheless helpful in checking that less obvious special codes have been entered correctly. The user is free to enter arbitrary text in this pane, possibly breaking CIF syntax rules in the process. Only when the revalidation process is manually invoked will the file be rescanned and any errors reported.

5.3.3.1.3. The task bar The task bar allows rapid one-click access to the standard operations of creating a new document, opening, saving or printing the contents of the current file, copying, cutting and pasting text, searching for specific text within the document, and undoing or redoing previous edits. Two buttons allow insertion of complete text files. One allows the user to select any file from local or network-mounted file systems. The other imports a specific file (the location of which may be specified by the user through the ‘Preferences. . . ’ selection of the main ‘Edit’ menu). While this specific file may contain anything, it is intended to be a template CIF that a user will tailor to meet their own requirements. The default provided with the software is a standard template distributed by the IUCr for use in submitting articles to Acta Crystallographica. In either case, the file is imported at the current editing location and is not subject to validation upon input; the user must manually revalidate the file after import. An icon on the task bar allows the user to run a validation procedure. This icon will be dimmed (indicating that the validation procedure may not be run) unless the user has modified the contents of the CIF. Other icons on the task bar behave in the same manner, allowing the procedures with which they are associated to be executed only under appropriate circumstances. Thus, for example, the looped list editor is not invoked unless the user clicks within the reserved word loop_ in a list header. Similarly, the ‘help’ icon in the task bar is dimmed unless the user has selected a data name in the CIF; when this is done, the icon is activated and clicking on it launches a help window containing the CIF dictionary definition of the data item. The task bar also contains a drop-down menu listing all the datablock names in the current file. When the user selects one of the data-block names, the edit cursor is positioned at the head of the matching data block in the edit window. This is a rapid and efficient way of navigating within large and complex files. The other buttons provided on the task bar allow the user to: reduce or increase the font size in the editing window; create a new looped list within the loop-editing window; invoke the publication and data-entry wizards; and hide or reveal the dictionary browse window pane. Users may modify the appearance of the task bar to retain or conceal subsets of these icons, depending on which they find most useful.

5.3.3.1.5. The dictionary browse pane The upper left-hand pane in Fig. 5.3.3.1 illustrates the dictionary browser, an optional graphical view of the contents of the CIF dictionary against which the file is being validated. (The presence or absence of this pane is toggled from an icon in the task bar.) Box icons represent the contents of categories, and the tree of category containers may be expanded or collapsed as desired to show individual items within categories. A dictionary view is generated for each separate data block in the CIF. Within the dictionary view of an individual data block, those data items present in the data block are shown in bold; other items defined in the dictionary but absent from the current data block appear in a lighter colour. Within the dictionary browse pane, a user may select (with a click of the appropriate mouse button) a menu of three options which depend on whether the data name is present or absent in the data block. If present, one option positions the cursor in the editing window at the location of the selected data item. If the item is absent from the data block, the user is given the option to paste the data name into the editing window at the current insertion point. The other options (in both cases) are to copy the data name to the clipboard or to open the help window with the CIF dictionary definition of the selected item. 5.3.3.1.6. The error notification pane and logging area The lower left-hand pane of Fig. 5.3.3.1 illustrates typical error notices generated by the parser when the validation process is invoked. At present, the classification of the severity of errors is guided by the editorial requirements of databases and journals, and does not necessarily match the formal errors dictated by the CIF specification. It is likely that this will change in future releases as validation is driven increasingly by the dictionaries rather than by hard-coded subroutines. A convenient feature is that double-clicking on the line number in the error report relocates the cursor to that line in the editing pane. At present, error messages are listed by line only – they are not grouped by data block. The user has a small number of options to control error notification. The choice of the maximum number of consecutive error lines to permit before error checking is abandoned is a useful way, especially for novices, to reduce the amount of output generated by severe syntax errors and to focus on repairing individual errors. The user may also specify a file that contains a set of CIF data names which are considered mandatory components of a particular file. Absence of any of these items from the current data

5.3.3.1.4. The main edit pane The main edit pane is a text-editing area where the user may directly modify the content of a CIF. Colours and font styles are used to indicate different syntactic elements. The details of the colours and styles may be modified to suit the user. For the novice user, this is perhaps the most immediately helpful feature offered by this program. When a trailing semicolon is inadvertently lost from an extended text field, typical sequential parsers may interpret succeeding tokens as part of the quoted text and produce misleading error reports. Within the enCIFer edit window, all such text is marked up in a specific colour (green by default) so that the fault is much more obvious to the human eye and its source much easier to locate.

502

5.3. SYNTACTIC UTILITIES FOR CIF

Fig. 5.3.3.4. The enCIFer chemical and crystal data wizard.

Fig. 5.3.3.2. The enCIFer loop editor.

5.3.3.1.8. The publication and chemical and crystal data wizards The user may invoke data-entry ‘wizards’, subordinate programs that prompt for particular data items useful for the publication of a crystal structure report or for the deposition of a crystal structure in a database. This is the kind of information that might be requested in the Notes for authors for a journal, and it is helpful if the information is routinely requested from inexperienced authors during normal use of the software. The data-entry tools are known as ‘wizards’ because they will utilize information already in the file. Hence, as shown in Fig. 5.3.3.3, details of an article’s contact author are retrieved from the CIF and used to seed a list of contributing authors. As the address for each author is entered, the program makes each new address available as a stored record for easier input of additional information. Fig. 5.3.3.4 demonstrates the same approach to encouraging authors to supplement information already in the CIF with related chemical (or crystal) data not usually provided by the CIF generators embedded in crystallographic structure determination programs.

block is flagged as an error. The program log in the lower righthand part of the program window records the history of the user’s interactions with the file during the current editing session. Information is written to the status bar (the lower margin of the window) to indicate the location by line and column number of the editing cursor. 5.3.3.1.7. The loop editor The program has a useful spreadsheet-style editor for looped lists (Fig. 5.3.3.2). A particular benefit of this style of display is that the spreadsheet cells are arranged in a rectangular grid, so that visual scans can often detect deviations from a pattern of values within a column, thus making it easy to identify placement errors where values have been omitted or inadvertently conjoined. Such errors are not always obvious by direct visual inspection of a CIF, where the layout of a looped list need not follow any regular pattern. The buttons to add or delete columns allow for the straightforward addition or deletion of data items from the loop. If the user selects the ‘New Column’ button, a small pop-up window helpfully provides a view of the associated dictionary (in the same hierarchical category-based tree view of the dictionary browser pane) to help the user select the required new data name. The ‘Insert Cell’ and ‘Delete Cells’ buttons are convenient tools for the realignment of rows and columns where values have been omitted or misplaced. The loop editor is invoked from one of two buttons in the task bar, allowing either the creation of a new looped list or the modification of an existing one. As with the application as a whole, there is no dynamic validation of input; the new list must be saved and the entire CIF then manually revalidated.

5.3.3.1.9. The visualization window A final useful feature of enCIFer is its ability to visualize the three-dimensional structure of molecules described in the data blocks of a CIF. Fig. 5.3.3.5 demonstrates crystal packing with

Fig. 5.3.3.3. The enCIFer publication data wizard. Information about the title and authors of an article to be submitted for publication is requested through a sequence of linked dialogue boxes.

Fig. 5.3.3.5. Visualization of a molecular and crystal three-dimensional structure with enCIFer.

503

5. APPLICATIONS a space-filling molecular representation, and the drop-down menu indicates some of the options available to modify the appearance of the graphics. The molecular-graphics library used by the program is part of the larger database interface software package developed at the Cambridge Crystallographic Data Centre. In the present version, the visualizer is run only upon initial parsing of the input CIF, and therefore does not provide an ability to track visually the molecular changes associated with direct modification of the contents of the file. 5.3.3.2. CIFEDIT The CIFEDIT program (Toby, 2003) is written in Tcl/Tk (Ousterhout, 1994) and provides an application for viewing and editing CIFs. The code is written in such a way that it can be embedded into larger programs to provide a CIF-editing interface within larger application suites. The current version of the program is able to validate CIFs against both DDL1 and DDL2 dictionaries, although the DDL2 validation is currently less complete than for DDL1. For example, numeric values are checked against permitted enumeration ranges only for DDL1. Dictionaries are accessed through index files, each of which contains Tcl data structures that point to the location of the definitions in the dictionary file itself and store information such as units and enumeration ranges that can be used for data validation. A utility provided with the program allows a user to generate new index files when new versions of the dictionaries become available. It is intended that dictionary indexing will be incorporated within the main application in the next program release, so that interactive dictionary selection will be possible. When a CIF is opened, the contents are parsed and validated against one or more user-selected dictionaries. Errors are displayed in a pop-up window and may be written to a file or viewed within the application. The main program window displays the contents of the CIF in two primary panes (Fig. 5.3.3.6). In the left-hand pane, a tree structure shows the data blocks in the file and the data names present in each block. The data blocks may be expanded or collapsed by the user, to present an overview or a detailed view of the data structure of the file. Underneath the icon representing the data block, non-looped data items are listed alphabetically. The figure demonstrates how a single value may be selected in the left-hand pane (_cell_length_a) and displayed in the main window. Physical units for the selected quantity are extracted from the corresponding dictionary definition and presented alongside the numeric value. The dictionary definition may also be displayed in a separate pop-up window using the ‘Show CIF Definitions’ button.

Fig. 5.3.3.7. Row-based loop editing with CIFEDIT; here loop 2 (comprising the ATOM SITE category) has been selected by the user; the editing cursor begins at row 1 of the loop.

The program may be run in two modes: a ‘browse’ mode, where the selected value is displayed in the main pane, but may not be altered; and an ‘edit’ mode (as in the example) where the value appears in an editable text widget. Data loops in the CIF are displayed after the alphabetical list of non-looped items. The loops are numbered sequentially from zero and an indication of the loop category is given in parentheses in the tree-view window. The loop ‘branches’ of the tree may be expanded or collapsed as the user wishes. Loops may be viewed and edited in two ways: by row or by column. If the user selects the loop title node in the hierarchical view pane, the loop is presented by row, starting in sequence at row 1 (Fig. 5.3.3.7). Other rows may be selected by using the address box in the lower-right-hand part of the window. Alternatively, if the user selects an individual data name within the loop representation in the hierarchical view, all instances of that data item within the loop are displayed in the main pane. (In practice the number of values shown is constrained to a maximum number that the user may choose, so that the application does not run out of memory if there are very large loops.) For items with a restricted set of permitted values in the dictionary, the editing function allows the user to select only one of the permitted options via a drop-down menu. While the application is intended to be used in this structured and itemized mode, there is an option to open the entire CIF in a text-editing window if there are errors that cannot be handled in the normal mode. This is not recommended, but is occasionally convenient. While this free-text editing mode is in operation, the ability to modify the file through the structured editing pane is suspended to avoid conflicting changes. After any change has been made, the user may revalidate the file. This is strongly recommended after making changes in the free-text editing mode. 5.3.3.3. HICCuP The program HICCuP (Edgington, 1997) was an early graphical utility developed at the Cambridge Crystallographic Data Centre for interactive editing and validation of a CIF. It is no longer supported, having been replaced by enCIFer (Section 5.3.3.1). Nevertheless, it contained some interesting features and is of potential interest to developers using multiple-platform scripting languages. It was implemented in the Python language (van Rossum, 1991) and required that Tcl/Tk (Ousterhout, 1994) be also available on the host computer. The name of the program is an acronym for ‘High-Integrity CIF Checking using Python’.

Fig. 5.3.3.6. The use of CIFEDIT to display and alter the contents of a CIF; here a non-looped data item is shown.

504

5.3. SYNTACTIC UTILITIES FOR CIF HICCuP was designed to allow users of the Cambridge Structural Database (Allen, 2002) to check structures intended for deposition in the database and therefore included a range of additional content checks specific to this purpose. These could, however, be disabled by the user. 5.3.3.3.1. Interactive use of the program 5.3.3.3.1.1. The control window Because HICCuP was designed as an interactive tool, upon invocation it presented to the user a control window from which CIFs could be selected for analysis and in which summary results of the program’s operations were logged. Fig. 5.3.3.8 shows an example of the control window after a single CIF has been loaded. In the large frame below the file-entry field are listed the data blocks found by the program. The names are highlighted in various colours according to the highest level of severity of errors found within the corresponding data block. Because the utility was designed for processing large amounts of CIF data for structural databases, it was considered useful to supply a compact visual indicator of the progress of the program through a large file. This takes the form of a grid of rectangular cells, one column for each data block present. Each column contains three cells, which monitor the performance of checks on the file syntax, conformance against a CIF dictionary, and other checks specific to the requirements of the Cambridge Crystallographic Data Centre. As each data block was checked, the corresponding cells were coloured according to the types of error found. Different colours were used to indicate: no errors; structure errors in the initial syntax tests; dictionary errors; or a deviation from certain conventions used by journals and databases in naming datablocks. The large frame at the bottom of the control window provides a text summary of the same information, listing the number of errors found. Check boxes and an ‘Options. . . ’ button allowed some configurability of checks by the user.

Fig. 5.3.3.9. HICCuP edit window and error description.

at the point where the program has detected the first error and a terse statement of the type of error, with a longer explanation of its nature and possible cause, would be given. In the example of Fig. 5.3.3.9, the program has detected that there is a missing text delimiter (a semicolon character), and positions the text in the upper frame at the likely location of the error. The program has attempted to localize the region where the error may have occurred. Because a text field might contain arbitrary contents, including extracts of CIF content, it is impossible to be sure on purely syntactic grounds of the nature of the error. Nonetheless, some heuristic rules serve to identify the author’s likely intent in the majority of cases. So, in this example, the user may scan the file contents in the vicinity of the line highlighted by the program and find the error within a few lines (in this example an incorrectly terminated _publ_author_footnote entry beginning ‘Current address:’). For this example, the more literal vcif error analysis provides only the message

5.3.3.3.1.2. The report frame and edit window

ERROR: Text field at end of file does not terminate

The user could get more details of the reported errors by clicking on the name of the data block of interest in the control window. The text of the CIF would appear in a new window positioned

The upper frame in this window is an editable window, so that the user could modify the text and revalidate the current data block. Only when a satisfactorily ‘clean’ data block was obtained were the changes saved, and the modified data block written back into the original file. 5.3.3.3.1.3. Dictionary browsing An additional useful feature of the program was its interactive link to a CIF dictionary file (Fig. 5.3.3.10). The browser window contains the definition section of the dictionary referring to

Fig. 5.3.3.8. Control window of the HICCuP application.

Fig. 5.3.3.10. HICCuP dictionary browser window.

505

5. APPLICATIONS the selected data name and hyperlinks to definitions of other data names referred to. Additionally, there is a small text-entry box allowing a specific definition to be retrieved and an ‘Index’ button to list all available definitions. 5.3.3.3.2. Options As already mentioned, the user could modify the detailed mode of operation of the program. Any or all of the ‘initial’, ‘dictionary’ or ‘other’ checks could disabled. The ‘dictionary’ checks could be modified by the user through the ‘Options’ button of the main control window. The CIF dictionary for validation could be specified; the dictionary itself had to be translated from a source file in DDL format to a Python data structure. The types of dictionary-based validation supported by the program were: (i) List Status (checking whether a data value should be included in a looped list), (ii) Limited Enumeration Options (checking that a data value is one of the permitted codes where such a constraint exists), (iii) Incorrect Enumeration Case [a special case of (ii), where a data value matches a permitted code except for incorrect alphanumeric case], (iv) Enumeration Range (the data value falls outside the range permitted), (v) Value Type (numb or char) (the data value has the wrong type), (vi) List Link Parent (a data item is present within the data block, but its mandated parent item is not – for example, the data item _atom_site_aniso_label should not be present without its parent data item _atom_site_label), (vii) List Reference (the required data name used to reference the loop in which the current data name appears is missing), (viii) Esd Allowable (a data value appears to have a standard uncertainty value where one is not expected). The user could also supply the program with a list of data names that do not appear in the validation dictionary but for which no warning message should be raised. The program normally flagged such nonstandard data names as possible errors and suggested the possible form of a standard data name that might have been intended. This was useful in catching misspellings of additional data items entered by hand. The program could also be run in a batch mode when the objective was to work through a large volume of CIF data and identify the data blocks that require attention. This mode of operation is particularly useful in databases or publishing houses. In this mode, input is from a named file or from the standard input channel; output is written to standard output or redirected to a results file. The operation of the program may be controlled by the application of various command-line flags.

Fig. 5.3.3.11. A category view in the beCIF editor of a CIF with navigation by tabs.

5.3.3.4. Platform-specific editors As well as the tools described earlier in this section, which are designed to run under a variety of common operating systems, there are some applications restricted to users of particular types of computer. Here we mention two that run in the popular Microsoft Windows environment on personal computers.

structures. It provides a rather different view of the contents of a CIF from the applications discussed above through an interface that will be familiar to users of Microsoft Windows applications. When the application is opened, the user is prompted to provide the location of a CIF dictionary (at any one time, only a single dictionary file may be loaded). This dictionary is loaded into memory and used to validate CIFs upon input. As a data file is read, discrepancies from the types and value ranges permitted by the dictionary are listed in an information window. The file contents are presented in a number of panels, one per dictionary category, between which the user may navigate by selecting the tab with the desired category name (Fig. 5.3.3.11). At the highest level, tabs allow the user to choose the data block of interest. Buttons are provided to delete a data block entirely, to rename it or to create a new data block. Within each data block, the user may add new categories. Again, to help the novice user, when the button ‘New Category’ is selected, a list of only those categories described in the current dictionary but absent from the current data block is presented to the user. Each category present in the data file is accessed through its own tabbed display panel. Where the category contains non-looped data items, values may be edited within individual text widgets; data items may be removed by selecting the adjacent check box; or new data items may be added by selecting the ‘New Data Item’ button to create a dialogue box offering a choice of the remaining data items in the dictionary category. Against each data item a button provides access to a pop-up window containing the relevant dictionary definition. For a category with looped data, the contents are displayed in a spreadsheet-style representation, with columns headed by the matching data name and rows numbered for convenience (Fig. 5.3.3.12). The changes requested to the CIF are only effected when the user selects the ‘Save CIF’ button. Unlike many other of the CIF editors previously discussed, this program does not make any effort to retain the initial ordering of the input data, nor does it preserve comments. The edited CIF may therefore be superficially very different from the input file; however, the only significant differences in content will be those introduced through use of the editing functions within the application.

5.3.3.4.1. beCIF The Windows program beCIF (Brown et al., 2004) is still in prototype. It is a DDL1-dictionary-driven CIF manipulation tool that does not require detailed knowledge of CIF or dictionary

5.3.3.4.2. printCIF for Word The tools described so far emphasize the data content of a CIF. printCIF for Word (Westrip, 2004), on the other hand, was commissioned to help prospective authors of structure reports in the

506

5.3. SYNTACTIC UTILITIES FOR CIF

Fig. 5.3.3.12. Representation by the beCIF editor of looped data within a category (here ATOM SITE ) in spreadsheet style.

IUCr journals to visualize and prepare for publication complete papers submitted in CIF format. Chapter 5.7 describes the workflow and processing of such submissions. Here is given a brief description of the use of the printCIF software from an author’s viewpoint. This application also differs from others discussed in this chapter in that it is rather specific to a particular program environment, being written as Visual Basic macros embedded in a Microsoft Word template document. Efforts are under way to provide versions that can run with other word processors. Nevertheless, Word is currently sufficiently widespread that the utility is likely to be of use to a large community. Typically the author begins by double-clicking on the icon associated with the printcif.dot template file. The initial macros are loaded and the author is prompted to provide the location of a CIF. As the CIF is imported into the application, the data items that will be used in the publication are extracted and converted into a rich-text format (RTF) representation. For extended text fields, this RTF content may be edited directly in the word-processing environment; this makes it easy for authors to compose and edit continuous text in a familiar way. Numeric and brief textual data items from the CIF are processed and presented in read-only fields in the manner in which they will appear in the journal, often as entries in a table or as a list of brief experimental details. These fields may not be edited within the RTF representation; if it is necessary to change these, the author must modify the data value in the CIF itself. To assist the author, the contents of the CIF are opened in a text-editor window alongside the formatted representation. The CIF and RTF representations are linked; if the author selects text in the RTF window, the corresponding CIF data item is highlighted within the text-editor window (Fig. 5.3.3.13). The advantages to the author of editing in RTF format are that existing text may be cut and pasted from other applications, and formatting features, such as subscript or superscript text, Greek letters and other special symbols, may be entered through the wordprocessor’s menu-driven interface, rather than by use of the rather unmemorable ASCII codings used in CIF. The major disadvantage is the need to recognize that two versions of the file, both editable, are accessible at the same time; and care must therefore be taken to ensure that conflicting changes are not made, and that the author is aware of which version is currently the master. The function ‘Update CIF using RTF’ (in the toolbar of the CIF editing window) will reimport into the CIF all the editable content from the RTF window, replacing any existing data items.

Fig. 5.3.3.13. The dual RTF/CIF editing windows in the printCIF for Word application. In this example, the author has selected the word ‘Monoclinic’ in the read-only table of crystal data; the corresponding CIF data item _symmetry_cell_setting is highlighted in the CIF window, where it may be edited.

The complementary function, ‘Build preprint’, creates a fresh copy of the preprint representation of the document in RTF format. A number of options are available to modify the preprint that is generated (for example, by printing a complete list of the geometry included in the CIF rather than just the items flagged for publication; or listing the atomic coordinate data). The general style is that of Acta Crystallographica Section C and Section E; nevertheless, the application may be useful to users who do not intend to submit to these journals but who wish to produce an attractive representation of the content of their CIFs. Utilities are provided to create tables in the RTF environment suitable for embedding in the CIF, to browse the contents of the CIF core dictionary and to validate the syntax of the CIF. The application is not dictionary-driven, however, and does not carry out detailed consistency checks. It is therefore best considered as an aid to publication, to be used alongside data-centric editors and validation tools such as enCIFer. A particularly useful self-documenting feature of printCIF for Word is that the User Guide is automatically opened when the application is started, before a CIF is loaded.

5.3.4. Data-name validation In a CIF, a data name (a character token beginning with an underscore character, _ ) is an essential handle on an item of data within a data block. Equipped only with knowledge of the data names appearing in a CIF, a user may extract, reorder or query the information content of the file. Such manipulations require no prior knowledge of the semantic content of the data. However, for most practical applications it is important to know the meaning attached to data names, and CIF dictionaries provide the mechanism for associating a data name with its intended meaning for an application. It is therefore valuable to be able to check whether data names in a CIF match those defined in a dictionary file. It is also valuable to check the consistency of the data names listed in the dictionary file itself; since this will be used by external applications to validate data names, it is essential that it be internally consistent.

507

5. APPLICATIONS Hence there is a real need for a utility to validate data names – effectively a CIF spelling checker.

When the text file has been processed, a validation report file is output containing the alphabetically sorted list of unmatched names and line numbers, followed by the sorted list of names from all dictionaries that are used within the text. If requested, this is followed by the sorted list of names from all dictionaries that are not used within the text in the file. If a data name has an alias defined in the dictionaries, a warning about the existence of the alias is given. If more than one dictionary has been used, the source dictionary is identified for each data name. An example of the output from CYCLOPS is shown in Fig. 5.3.4.1.

5.3.4.1. CYCLOPS The program CYCLOPS (Hall, 1993; Bernstein & Hall, 1998) was written specifically to address the problem of validating CIF data names. Its use extends beyond simply identifying data names in a CIF data file and checking that they are defined in a dictionary. Any ASCII file may be input, allowing for the checking of CIF data names in any text documents or program source. The program was originally written in Fortran as an aid to ensuring that the original core CIF dictionary was free from data-name errors; subsequently it was extended to be able to read multiple dictionaries in DDL1 and DDL2 formats, and to resolve dataname aliases across multiple dictionaries. The extended version was written with the library routines of the CIFtbx toolkit (Hall & Bernstein, 1996) described in Chapter 5.4 and is distributed as an example application with CIFtbx. The description below refers to this extended version, also known as CYCLOPS2.

5.3.4.1.2. Invocation of the program CYCLOPS is generally invoked from a command line that specifies the input and output file names and the dictionary files against which to validate the input. However, because the program is portable across a wide range of operating systems, there is substantial flexibility in the way in which it may be invoked. Under a Unix-like operating system, the program may typically be called with a command such as cyclops -i infile -o outfile -d dictfile

5.3.4.1.1. Operation

where infile is the name of the input file for validation, outfile is the file to which the detailed output of the program is written and dictfile is a dictionary file. A more complete set of options available in a Unix-like operating environment is

The program determines the dictionary (or list of dictionaries) against which to validate the input text file (see below for the method of passing such information to the program). It opens each dictionary in turn and stores all data names defined in the dictionaries. Where the same name is defined in multiple dictionaries, the behaviour is determined by a command-line switch. The text file is then input and parsed for candidate data names. Because the program is designed to check potential data names embedded in ordinary text files, it is not sufficient to apply the CIF parsing rule of a white-space-delimited character string beginning with an underscore character. Instead, character strings are sought that begin with an underscore optionally preceded by white space or one of the characters ,.([{$file };

Table 5.3.6.1 summarizes the object methods provided by the package. The store method allows a DataBlock object to be serialized and written to hard disk for long-term storage. The Perl public Storable:: module is used. The get item data method returns the data values for a named data item. It is used frequently in the example program of Fig. 5.3.6.1; for example, at lines 28–30 the array of categories

515

5. APPLICATIONS 1

Table 5.3.6.1. Object methods provided by the STAR::DataBlock Perl module

#!/usr/local/bin/perl

2 3 4 5

use STAR::Parser; use STAR::DataBlock; use STAR::Dictionary;

6 7

my $file = $ARGV[0];

Method

Description

store get item data get keys

Saves a DataBlock object to disk Returns all the data for a specified item Returns a string with a hierarchically formatted list of hash keys (data blocks, save blocks, categories and items) found in the data structure of the DataBlock object Returns an array with all the items present in the DataBlock Returns an array with all the categories present in the DataBlock Inserts a category into a data structure Insert an item into a data structure Sets the data content of an item according to a supplied array

8 9

@dicts = STAR::Parser->parse(-file=>$file, -dict=>1);

10 11

get items get categories

exit unless ( $#dicts >= 0 ); # exit if nothing there

12 13

print_document_header;

insert category insert item set item data

14 15 16 17 18 19 20

foreach $dictionary (@dicts) { # usually one # Get the dictionary name and version $dictname = ($dictionary->get_item_data( -item=>"_dictionary.title"))[0]; $dictversion = ($dictionary->get_item_data( -item=>"_dictionary.version"))[0];

the different levels in the data structure hierarchy, each level in the hierarchy being indicated by the amount of indentation (Fig. 5.3.6.2). The get items and get categories methods are largely selfexplanatory. The items or categories in the currently active DataBlock object are returned in array context. insert item and insert category are the complements of these methods, designed to allow the insertion of new items or categories. Where appropriate (i.e. in dictionary applications), the save frame into which the insertion is to be made can be specified. The remaining method, set item data, is called to set the data of item $item to an array of data referenced by $dataref:

21 22

print "Dictionary $dictname, version $dictversion\n";

23 24 25

# Get list of save-frame names, sorted alphabetically @saveblocks = sort $dictionary->get_save_blocks;

26 27 28 29 30 31 32

foreach $saveblock (@saveblocks) { # For each save frame @categories = $dictionary->get_item_data( -save=>$saveblock, -item=>"_category.id"); if ( $#categories == 0 ) { # category save frame format_category($saveblock); # format the category

33 34 35 36 37 38 39 40 41

$category = $categories[0] ; foreach $block (@saveblocks) { # re-traverse blocks @itemcategoryids = $dictionary->get_item_data( -save=>$block, -item=>"_item.category_id"); @itemname = $dictionary->get_item_data( -save=>$block, -item=>"_item.name");

$data_obj->set_item_data( -item=>$item, -dataref=>$dataref );

As usual, an optional parameter -save=>$save may be included for dictionary applications where a save frame needs to be identified; the value of the variable $save is the save-frame name. Note that the current version of the module does not support the creation and manipulation of data loops, although the get item data method will correctly retrieve arrays of data values from a looped list. There are five methods available to set or retrieve attributes of a DataBlock object, namely: file name for the name of the file in which the DataBlock object was found; title for the title of the DataBlock object (i.e. the name of the CIF data block with the leading data_ string omitted); type for the type of data contained – ‘data’ for a DataBlock object but ‘dictionary’ for an object in the STAR::Dictionary subclass; and starting line and ending line for the start and end line numbers in the file where the data block is located. The method get attributes returns a string containing a descriptive list of attributes of the DataBlock object.

42 43 44 45 46 47 48 49

if ( $#itemcategoryids == 0) { # category pointer format_item($block) if $itemcategoryids[0] = /ˆ$category$/i; } elsif ( $#itemname == 0 && $itemname[0] = /ˆ_$category\./i ) { format_item($block); }

50 51 52 53 54

} # end foreach of items matching current category } } # end foreach of save frames in the dictionary } # end foreach loop per included dictionary

55 56 57

print_document_footer; exit;

Fig. 5.3.6.1. Skeleton version of an application to format a CIF dictionary for publication. Only the main program fragment is shown. Line numbering is provided for referencing in the text. format category, format item and the print document * commands are calls to external subroutines not included in this extract.

5.3.6.1.4. STAR::Checker This module implements a set of checks on a data block against a dictionary object and returns a value of ‘1’ if the check was successful, ‘0’ otherwise. The check tests a specific set of criteria: (i) Are all items in the DataBlock object defined in the dictionary? (ii) Are mandatory items present in the data block? (iii) Are dependent items present in the data block? (iv) Are parent items present? (v) Do the item values conform to item type definitions in the dictionary? Obviously, these criteria will not be appropriate for all purposes, and are in any case fully developed only for DDL2 dictionaries. An optional parameter -options=>’1’ may be set to write a list of specific problems to the standard error output channel.

in the input dictionary is assembled by retrieving the value associated with the data item _category.id as the component save frames are scanned in sequence. Note that for applications within dictionaries, the method takes a -save=>$saveblock parameter to allow the extraction of items from specific save frames. For manipulations of data files, this parameter is omitted. In lines 17–20, the method is called without this parameter because the name and version of the dictionary are expected to be found in the outer part of the file, not within any save frame. get keys allows a user to display the structure of a CIF or dictionary file and can be used to analyse the content of an unknown input file. When written to a terminal, the string that is returned by this method appears as a tabulation of the items present at

516

5.3. SYNTACTIC UTILITIES FOR CIF As with the Perl library discussed above, PyCifRW presents an object-oriented set of classes and methods. Two classes are provided, CifFile and CifBlock. A CifFile object provides an associative array of CifBlock objects, accessed by data-block name. The methods available for the CifFile type are: ReadCif(filename), which initializes or reinitializes a file to contain the CIF contents; GetBlocks(), which returns a list of the data-block names in the file; NewBlock(blockname, [block contents]), which adds a new data block to the file object; and WriteOut(comment), which returns the contents of the current file as a CIF-conformant string, with an optional comment at the beginning. For the NewBlock method, the optional block contents must be a valid CifBlock object, as described below. The NewBlock method returns the name of the new block created, which will not be the requested blockname if a data block of the same name already exists (this conforms to the STAR and CIF requirement that data-block names must be unique within a file). A CifBlock object represents the contents of a data block. The methods available to retrieve or manipulate the contents are: GetCifItem(itemname), which will return the value of the data item with data name given by itemname (and which can be a single value or an array of looped values); AddCifItem(data), which adds data to the current block, where data represents either a data name and an associated single value, or, for the case of looped data, a tuple containing an array of data names and an array of arrays of associated data values; RemoveCifItem(dataname), to remove the specified data item from the current block; and GetLoop(dataname), which returns a list of all data items occurring in the loop containing the data name provided. If dataname does not represent a looped data item, an error is returned. The GetLoop method is important for the proper handling of looped data, and care is taken to handle loops robustly and efficiently. Items that are initially looped together are kept in the same loop structure. Both CifFile and CifBlock objects act as Python mapping objects, which has the advantage that the value of a data item can be read or changed using an intuitive square-bracket notation. For example, a program can retrieve the value of a data item named _my_data_item_name in block ‘myblockname’ of a previously opened file cf using the following syntax:

data save block block categ. item ---------------------------cif_img.dic _category_group_list _category_group_list.description _category_group_list.id _category_group_list.parent_id _dictionary _dictionary.datablock_id _dictionary.title _dictionary.version _dictionary_history _dictionary_history.revision _dictionary_history.update _dictionary_history.version _item_type_list _item_type_list.code _item_type_list.construct _item_type_list.detail _item_type_list.primitive_code _item_units_conversion _item_units_conversion.factor _item_units_conversion.from_code _item_units_conversion.operator _item_units_conversion.to_code _item_units_list _item_units_list.code _item_units_list.detail ARRAY_DATA _category _category.description _category.id _category.mandatory_code _category_examples _category_examples.case _category_examples.detail _category_group _category_group.id _category_key _category_key.name Fig. 5.3.6.2. Structure of the imgCIF dictionary (Chapter 4.6) as described by the get keys method of the STAR::DataBlock module. Only the high-order file structure and the contents of the first category are included in this extract.

value = cf["myblockname"]["_my_data_item_name"]

5.3.6.1.5. STAR::Writer and STAR::Filter

The returned value is either a single item, or a Python array of values if the data item occurs in a loop. Values are set in an analogous way and other common operations with mapping objects are also implemented.

Two other modules are supplied by this package. STAR::Writer is a prototype module that can write STAR::DataBlock objects out as files in different formats; currently only the write cif method exists to output a conformant CIF. STAR::Filter is an interactive module that prompts the user to select or reject individual categories from a STAR::Dictionary object when building a subset of the larger dictionary.

5.3.7. Rapid development tools The programs described so far in this chapter tend to fulfil a single purpose. Each program addresses a single clearly defined task and is of benefit to a user who has no need or desire to write a customized program. Where there is a need to write a new application, libraries of subroutines are available to the full-time programmer, such as those described in Chapters 5.4 to 5.6. However, in between these extremes, there are a large number of cases for which there is a need to combine the functionality of a number of existing programs without incurring the overhead of writing a new integrated application. This section describes utilities that assist the development of new applications from existing software.

5.3.6.2. PyCifRW: CIF reading and writing in Python PyCifRW (Hester, 2002) is a simple CIF input/output utility written in Python. It does not validate content against dictionaries, but it does provide a robust parser that has been extensively tested against various test files containing subtle syntactic features and against the collection of over 18 000 macromolecular CIFs available from the Protein Data Bank. The parser was implemented using the Yapps2 parser generator (Patel, 2002) and is based on the draft Backus–Naur form (BNF) developed during a community exercise to review the CIF specification of Chapter 2.2.

517

5. APPLICATIONS 5.3.7.1. ZINC: an interface to CIF for standard Unix tools Unix and derivative operating systems provide a convenient environment for rapid prototyping of programs acting on textual data. The convenience arises from the facility to chain applications together in a ‘pipeline’, where the output from an application may be passed directly to the input channel of another application without needing intermediate disk files for storage; and from the very rich set of utilities supplied with the typical Unix command shell, which permit files to be concatenated, split, compared, searched, stream-edited and otherwise transformed. Many users are familiar with these utilities and can rapidly develop prototype or short-lived applications of great power by chaining them together as required. There is a temptation to use such techniques to manipulate CIFs, which as ASCII files are well suited to this. However, there are some features of the CIF syntax that are at variance with the conventional Unix idiom of storing data tags and their associated values within a basic unit of a line (i.e. a sequence of characters terminated by an end-of-line character code). Although CIFs are built from ASCII-characterpopulated lines, data values may be placed on a different line from that containing the parent data name (this is almost always the case for looped lists).

# A simple CIF data_object # description of a simple # polygon _name ; triangle ; loop_ _x _y 0.0 0.0 1.0 0.0 0.0 1.0 _num_sides 3

(a)

object object object object object object object object object object

5.3.7.1.1. Description of the ZINC format The ZINC format (Stampf, 1994) was developed to transform CIF data into an isomorphous format suitable for manipulation by standard Unix utilities. The manner in which Unix textual utilities work suggests that an appropriate working format is one in which every individual CIF data value is available within a single-line textual record. The record must also contain information about the context of the value, conveyed through: the name of the data block in which the data value occurs; its associated data name (from which the meaning of the data value is inferred); and for recurrent data (i.e. values in a looped list) an indication of the list in which the data value occurs and a counter of its current occurrence in that list. In practice, each record is structured as a data line containing five TABseparated fields in the order blockcode

name

index

value

( 0 ( 1 ( 2 _name _x 0 _y 0 _x 1 _y 1 _x 2 _y 2 _num_sides

# A simple CIF # description of a simple # polygon ;\ntriangle\n; 0.0 _x 0.0 _x 1.0 _x 0.0 _x 0.0 _x 1.0 _x 3

(b) ( 0 # object object object object object object object object object object

A simple CIF ( 1 # description of a simple ( 2 # polygon _name ;\ntriangle\n; _x 0 0.0 _x _y 0 0.0 _x _x 1 1.0 _x _y 1 0.0 _x _x 2 0.0 _x _y 2 1.0 _x _num_sides 3

(c)

list-id

Fig. 5.3.7.1. ZINC transformation of CIF. (a) is a sample file in CIF format. (b) shows the output to a display terminal when this file is transformed by cif Zinc. (c) The same output as (b), but with TAB characters represented by special graphical characters.

where blockcode is the name of the CIF data block (the leading string is omitted); name is the data name; index is a zerobased index of the number of occurrences of the data name within a loop (with a null value if the data occur outside a loop); value is the data value itself; text strings extending over several lines are collapsed into a single line with the replacement of the end-of-line character by the sequence \n; and list-id is a list identifier, stored as the data name within the list that sorts earliest (because it is a purely syntactic transformation, the utility does not consult a dictionary file for the correct _list_reference identifying token). Comments in the CIF are also stored, to permit regeneration of the original file by an inverse transformation; and because it is often convenient to read such interpolations, especially in interactive activities with CIFs of which the user has no prior knowledge. Comments are stored with a value of ‘(’ in the data-name field. They are also numbered (starting from zero) in the index field of the ZINC record. Most details of the ZINC transformation are illustrated by the simple example in CIF format shown in Fig. 5.3.7.1(a). This file transformed to ZINC format would appear on a display terminal as illustrated in Fig. 5.3.7.1(b). However, the spacing is deceptive, since typical display terminals convert TAB characters to a data_

variable number of spaces. A more accurate (albeit less legible) representation of the output is given in Fig. 5.3.7.1(c). Note the following points: the data-block name is initially null; each comment line is numbered in sequence from zero; the index field is null for data names that are not within looped lists. 5.3.7.1.2. ZINC-based utilities The purpose of ZINC is to form an intermediate stage in pipeline processes involving Unix tools, and therefore CIF data in ZINC format have only a transitory existence. The ZINC distribution package provides the complementary tools cif Zinc and zincCif required to interconvert formats; and in addition a few sample applications are provided as examples for Unix programmers. 5.3.7.1.2.1. cif Zinc cif Zinc takes a CIF name as a command-line argument or the CIF itself from standard input and produces a ZINC-format file on standard output. It has one option, ‘-c’, which removes comments (which arguably have no place in a CIF).

518

5.3. SYNTACTIC UTILITIES FOR CIF 5.3.7.1.2.2. zincCif zincCif is a Perl script that takes a ZINC-format file (again from standard input or as a name on the command line) and pretty prints the corresponding CIF to standard output. Often, the pipeline cifZinc a.cif | zincCif > b.cif produces a more attractive CIF than the original.

data_sample _title ’scatter graph’ _description ; A collection of x, y coordinates of points drawn in specified colours ; loop_ _x _y _colour 0 0 red 1 1 red 2 4 red 3 9 orange 4 16 orange 5 25 orange _status complete

5.3.7.1.2.3. zincGrep The shell script zincGrep is the utility most requested by those seeing a CIF for the first time. It allows a regular-expression search of a ZINC-format file (or a CIF specified on the command line, which is converted to a ZINC-format file first) and reports the block name, data name, index and value. For example, if the file describing a triangle in the last section were called simple.cif, the command zincGrep _name simple.cif would produce object _name

(a) data_sample _status complete loop_ _y _x _colour 0 0 red 1 1 red 2 2 red 9 3 orange 16 4 orange 25 5 orange _title ’scatter graph’ _description ; A collection of x, y coordinates of points drawn in specified colours ;

;\ntriangle\n;

5.3.7.1.2.4. cifdiff cifdiff is a C-shell script that takes two CIFs and lists the differences between them. Unlike the standard Unix utility diff, which compares files line-by-line, cifdiff can determine differences that are independent of reordering and white-space padding. This script takes each CIF, converts it to a ZINC-format file, then sorts it, first based on the data-block name, then (keeping the loops together) on the data name. It then removes the last field (which is not part of the CIF) and stores the remainder in temporary files. It then runs the standard diff program against these reordered temporary files. This is remarkably effective both in finding any differences and in providing the context (it names the block and data name as well as the value) needed to understand the differences. Fig. 5.3.7.2 illustrates two CIFs that are very different in the presentation of their contents, but have only a small difference of substance in their content. Fig. 5.3.7.3 indicates the output from cifdiff that identifies the changed data.

(b) Fig. 5.3.7.2. Two example files in CIF format differing greatly in layout but little in content: (a) sample1.cif, (b) sample2.cif.

% cifdiff sample1.cif sample2.cif 18c18 < sample _y 2 4 --> sample _y 2 2

5.3.7.1.2.5. zb

Fig. 5.3.7.3. Output from cifdiff comparing files sample1.cif and sample2.cif of Fig. 5.3.7.2. The entries in each line comprise the data-block name, the variable name, the zero-based index of the occurrence of the value in a looped list and the value. Hence it is the third value of _y in the loop that has changed.

zb is a small (less than 200 lines) Tcl/Tk program (Ousterhout, 1994) that provides a simple graphical front end to a ZINCformat file or CIF allowing the user to browse through the contents. Multiple files can be viewed simultaneously, as can multiple data blocks, on any X terminal. zb recognizes command-line argument file names in the form *.cif as being in CIF format and converts them to ZINC format automatically.

included in the subset, and the second of which is the ZINC-format file itself (or standard input). It allows two options: ‘-c’ to remove comments and ‘-v’ to invert the sense of the search. It converts the CIF into a ZINC-format file, uses the Unix grep program to search through the ZINC-format file for patterns that appear in the regular-expression file and pretty prints the result. For example, the command

5.3.7.1.2.6. zincNl zincNl is a Perl script that takes a ZINC file and creates a Fortran-compatible namelist file allowing for easy access to any CIF by Fortran programs without the need for extensive I/O libraries or reprogramming. As with zb above, it will automatically convert a CIF to a ZINC-format file if it needs to. For a more substantial tool providing CIF input functions in Fortran and C, see the discussion below of CifSieve (Section 5.3.7.2).

zincSubset defs cif_core.dic > cifdic.defs

will produce a subset of the core CIF dictionary that contains only the names and definitions when the file named ‘defs’ contains two lines with the TAB-surrounded word ‘_name’ and the TABsurrounded word ‘_definition’. All the tools listed in this section operate by design in concert with each other, providing the opportunity for generating increasingly complex tools. For example, to generate the Fortran namelist input file with only certain data items, a pipeline of zincSubset and zincNl will suffice. As more tools are developed, the range of applications will increase many-fold.

5.3.7.1.2.7. zincSubset zincSubset is another C-shell script which is very short but very useful. It allows a user to generate a custom subset of any ZINCformat file (or CIF) simply by listing the desired data blocks and data names. The script has two file arguments, the first of which specifies a file with regular expressions that specify what is to be

519

5. APPLICATIONS 5.3.7.2. CifSieve: automatic construction of CIF input functions

data_atom_site_aniso_label _name ’_atom_site_aniso_label’ _category atom_site _type char _variable_name mylabel[50] _list yes

Among the utilities described in the ZINC package above was a tool to generate Fortran namelist files. It is a common requirement of applications developers that they should be able swiftly to convert existing programs to read CIF data. While libraries such as CIFtbx (Chapter 5.4) and CIFLIB (Westbrook et al., 1997) offer very powerful functions for building CIF applications, it can be time-consuming to integrate them with existing software. It is a goal of CifSieve (Hester & Okamura, 1998) to enable the rapid creation of new CIF-conversant software by using a CIF dictionary as a template for input data structures. The CifSieve program runs on Unix systems with installed versions of the software utilities and programming languages bison or yacc, flex, Perl (Wall et al., 2000) and C.

data_atom_site_aniso_U_ loop_ _name ’_atom_site_aniso_U_11’ ’_atom_site_aniso_U_12’ ’_atom_site_aniso_U_13’ ’_atom_site_aniso_U_22’ ’_atom_site_aniso_U_23’ ’_atom_site_aniso_U_33’ _category atom_site _variable_name atsiteu[1000] _type numb data_reflns_number_ loop_ _name

5.3.7.2.1. Overview of the process The data names in a CIF are defined in a dictionary written in DDL1 or DDL2 formalism. Therefore, information about the data type and array structure of data variables is already to hand for a software author wishing to determine how to read CIF data into a program’s data structures. The CifSieve process requires that the programmer augment the relevant CIF dictionary by adding to a copy of the definition of desired items a new attribute, named _variable_name, that passes to the application program the name of the associated program variable. A program BuildSiv then reads the augmented dictionary and produces a subroutine capable of reading a CIF and transferring the data items tagged in the augmented dictionary to internal variable storage. The associated data structure is presented in an ancillary file which must be linked to the application program. CifSieve can produce input subroutines and header or include files for C and Fortran language programs. For C applications, the input subroutine is called cifsiv and is invoked with arguments cifsiv (CIF, block) where CIF is the name of the input CIF and block is the name of the data block from which data should be read. The data structure is declared in a header file cifvars.h which must be included in subroutines that manipulate the data input from the CIF. For Fortran applications, the input subroutine is also called cifsiv , but takes an additional argument, blockbeg, which is the address of the common block containing the input variable names, declared in the include file forcif.inc.

_category _type _enumeration_range _variable_name

’_reflns_number_total’ ’_reflns_number_observed’ reflns numb 0: reftot

data_refine_ls_extinction_method _variable_name extmet _name ’_refine_ls_extinction_method’ _category refine _type char _enumeration_default ’Zachariasen’ Fig. 5.3.7.4. Extracts from an augmented DDL1 dictionary (version 1.0 of the core CIF dictionary). The additional _variable_name entry is shown in italics.

either format. When BuildSiv is invoked, the parser reads the augmented dictionary and identifies the data items required by the target input subroutine by the presence of a _variable_name attribute in the definition block. The definition is read and the relevant values of the type (DDL attribute _type), item name (_name) and variable name are output in a simple tag–value format and in a standard order. For DDL2 dictionaries, values of _item_aliases.alias_name and _item_linked.parent_name, if present, are also output. The DDL parser thus transforms and simplifies the dictionary contents. Where the item-name attribute occurs inside a loop (i.e. several data names occur in a single definition block in the dictionary), the variable name for that particular definition block will be given an extra array dimension by CifSieve, equal to the number of names in the loop. When a name from this loop is found in a CIF, the value will be read into the respective array location. If an _item_aliases.alias_name attribute is present (DDL2), the alias will also be recognized in CIF input files. If this attribute occurs together with looped item names in the domain dictionary, an attempt is made to determine the parent _item.name in the loop to which this _item_aliases.alias_name refers. This is done within the BuildSiv program by examining _item_linked.parent_name entries within the same definition block. Data typing is simplified; the _item_type.code values of DDL2 dictionaries are collapsed onto primitive ‘numb’ or ‘char’ types. Values of type numb are declared and stored as type double (C) or REAL*8 (Fortran), while values of type char are stored as character arrays char[84] (C) or CHARACTER*84 (Fortran). In consequence, multiple lines of text cannot be retrieved with this version of CifSieve. Note in particular that values declared as of type ‘int’ in DDL2 dictionaries will be stored as double-precision real.

5.3.7.2.2. The augmented DDL dictionary Fig. 5.3.7.4 is an example of the annotations necessary to flag the data names that refer to data items desired to be input from a CIF. The current implementation requires that a copy of the DDL dictionary relevant for the CIF be physically edited to include the new _variable_name attribute. The inclusion of such a new attribute will not affect the use of the CIF dictionary for other purposes and by other software. The definition blocks of data items that are not to be read by the application should be left unchanged. The value assigned to the _variable_name attribute is the name of the variable declared in the application program for storing the input data item. If the items to be input are part of an array (i.e. they exist in the CIF as a looped list), the variable name should be supplied as a dimensioned array variable, e.g. atsiteu[1000] in the example of Fig. 5.3.7.4. The same attribute (_variable_name) may be inserted in DDL1 or DDL2 dictionaries. Separate parsers are supplied for use with

520

5.3. SYNTACTIC UTILITIES FOR CIF /* These declarations have been automatically generated by the cif file input/output function generator. This file should be included in any routines that call these functions */ typedef char cifstring[84]; /* to avoid array complications later */ #define MYLABELMAX 50 #define ATSITEUMAX 1000 typedef double atsiteutype [6]; #ifdef CIFVARDEC cifstring errormes; /* an error message */ int errornum; /* an error number */ cifstring mylabel[50]; /*data_atom_site_aniso_label*/ atsiteutype atsiteu[1000]; /*data_atom_site_aniso_U_*/ atsiteutype atsiteuesd[1000]; cifstring extmet; /*data_refine_ls_extinction_method*/ double reftot [2]; /*data_reflns_number_*/ double reftotesd [2]; #else extern cifstring errormes; /* an error message */ extern int errornum; /* an error number */ extern cifstring mylabel[50]; /*data_atom_site_aniso_label*/ extern atsiteutype atsiteu[1000]; /*data_atom_site_aniso_U_*/ extern atsiteutype atsiteuesd[1000]; extern cifstring extmet; /*data_refine_ls_extinction_method*/ extern double reftot [2]; /*data_reflns_number_*/ extern double reftotesd [2]; #endif

/* A simple example application of the automatically generated cifsiv_ function */ #include #include "cifvars.h" main(int argc, char *argv[]) { int i; char filename[80]; char block[80]; printf("Please enter CIF file name: "); scanf("%s", filename); printf("Please enter data block name "); printf("(without data_ prepended): "); scanf("%s", block); errornum = 0; cifsiv_(filename,block); if(errornum != 0) /* an error, we have problems */ { printf("An error occurred in reading the CIF:\n"); printf("%s",errormes); } for(i=0;iextmet.

5.3.8.2. OpenMMS Object classes represent the first stage in abstracting related data components. By building structured software modules that can manage the small-scale interactions between data components, the programmer can write more succinct code to handle the interactions between much higher-level data constructs. An API then permits third parties to handle the larger-scale objects without any need to know the internal workings of the class library. The next logical step is to present a standard set of ‘objects’ representing complete logical entities to any programmer for ‘plug-and-play’ incorporation into new applications.

5.3.8. Tools for mmCIF The complex relationships between the components of a macromolecular structure at various levels of detail are richly described by the data names in the mmCIF dictionary, but their number and complexity demand more heavyweight tools for proper handling. Input/output for small-molecule or inorganic structures can

522

5.3. SYNTACTIC UTILITIES FOR CIF The Life Sciences Research domain task force of the Object Management Group (OMG, 2001) is concerned with the development of standards for data exchange in biomolecular sciences, and in 2002 approved a macromolecular structure Corba specification. Corba (the common object request broker architecture) is a middleware architecture intended to serve just this purpose of providing access to standard objects representing discrete logical entities suitable for programmatic manipulation. Corba promotes interoperability across networked applications by separating entirely the API from the implementation of the underlying data objects. For applications such as the macromolecular structures database hosted by the Protein Data Bank, the attraction of networked interoperability is that information can be accessed through distributed and federated databases, and can be delivered on demand to any compatible software. A Corba application comprises an interface definition language (IDL) and an API that together define access to a data structure that encapsulates the abstract representation of the objects and relationships relevant to a particular area of knowledge. In general terms, this data structure may be described as an ‘ontology’ (Westbrook & Bourne, 2000). The ontology adopted for macromolecular structure (MMS) data was based on the mmCIF dictionary following a submission by the Research Collaboratory for Structural Bioinformatics to a Request for Proposal (Greer, 2000). Fig. 5.3.8.1. The OpenMMS metamodel and data flow.

5.3.8.2.1. The OpenMMS toolkit In practice, the ontology was developed in a ‘metamodel’ that combined the definitions and relationships between data items specified in the mmCIF dictionary with a generic metamodel framework. The metamodel extracts the information in the mmCIF dictionary but maintains it in a representation that is independent of the mmCIF STAR or any other file format. The standard building block of the metamodel is an Entry object, modelling a single macromolecular structure. From a suitable metamodel, it then becomes relatively straightforward to generate alternative expressions of the information to suit different access requirements. The OpenMMS toolkit (Greer et al., 2002) was built using Java source code to generate a Corba interface, an SQL schema for relational database loading and an XML representation of macromolecular data sets (Fig. 5.3.8.1). The toolkit contains an mmCIF parsing module capable of direct access to the underlying data archive of mmCIF data files. This is important, because the data files represent a common reference for all the derived representations. Any errors or discrepancies between the expressed forms of the Corba, XML or SQL representations are resolved against the standard mmCIF reference form. The relational database supporting an SQL-92 compatible interface provides an appropriate API for many applications, particularly ones that require extensive string searches. The close relationship between the mmCIF data model and relational database models has already been described earlier in this volume (Chapter 2.6). Advantages of the SQL interface are that it provides rapid access direct to the binary data storage representation and that individual components of a data set may be efficiently retrieved without the need to search sequentially through an entire entry. This efficiency of access and the ability to retrieve individual MMS data elements from a remote server is best realized through the Corba interface, the primary purpose of which is indeed to facilitate such high-performance access. The bulk exchange of data is addressed through the generation of XML files. XML is a simple, powerful and widely used standard for interchanging data, and its use for transporting

macromolecular data obviates the need for target applications to build their own STAR parsers. However, the use of markup tags around every individual data element does make the files much larger than their mmCIF progenitors. This is not an insurmountable problem in large-scale application environments, but it can undermine the effectiveness of XML as a representation mechanism in such applications as web browsers. A possible approach to this could be to define different, less verbose, XML representations and populate these on demand from a database store, either by SQL or XML queries. This is not an approach that the current OpenMMS toolkit supports directly. Fig. 5.3.8.2 is an extract from an XML data file generated from the PDB structure 1xy2. The XML uses a reserved name space PDBx conforming to the schema http://deposit.pdb.org/ pdbML/pdbx-v0.905.xsd. Data tags map cleanly to the corresponding data names in the mmCIF dictionary formed by concatenating the XML element name with its parent category name. For example, the entry 27.080 included in the container tag can be directly translated to the corresponding mmCIF data item _cell.length_a 27.080. CIF data loops are represented by repeated instances of the XML tag representing the corresponding CIF data name (for example, the multiple tags are equivalent to a CIF loop_ _audit_author.name construct). Nonstandard items with a pdbx prefix (e.g. in the group) refer to private data names in the PDB extension dictionary (Appendix 3.6.2). 5.3.8.3. mmLib: a Python toolkit for bioinformatics applications While the libraries developed for use within the Protein Data Bank provide powerful functionality, their very size and complexity make them inappropriate for some applications. Indeed, considerable effort may be needed to compile the C++ code on nonstandard platforms. The mmLib toolkit (Painter & Merritt, 2004)

523

5. APPLICATIONS Table 5.3.8.1. The modules provided by the mmLib toolkit







27.080 9.060 22.980 90.00 102.06 90.00 4



Crystal structure analysis of deamino-oxytocin: conformational flexibility and receptor binding. Science 232 633 636 1986 SCIEAS US 0036-8075 0038



SHELX SHELX-76





polymer man OXYTOCIN 978.189 1

water nat water 18.015 7



mmLib.mmCIF mmLib.PDB mmLib.Library mmLib.Extensions.CCP4Library mmLib.Elements mmLib.AminoAcids mmLib.NucleicAcids mmLib.Structure mmLib.GLViewer

mmCIF parser PDB format parser Base chemical library Data retrieval from CCP4 monomer library Chemical data for elements Chemical data for amino acids Chemical data for nucleic acids Macromolecular structure model OpenGL visualizer

import mmLib from mmLib.FileLoader import LoadStructure, SaveStructure struct = LoadStructure( fil = cif, format = "PDB", build_properties = ("no_bonds",) ) SaveStructure( fil = pdb, structure = struct, format = "CIF")

Fig. 5.3.8.3. A snippet of code illustrating mmCIF/PDB file format conversion with the mmLib toolkit.

available to structural biologists. Not only do applications need to be able to handle atomic positions and build appropriate threedimensional structure representations; but links to and integration with information on sequence, homologous structures, and biochemical, genetic and medical form and function are also demanded from individual program systems. Since much of these data are available from external databases in a variety of formats, mmLib will not be restricted to the handling of files in a single format. Its initial release provides support for mmCIF, for the PDB format files that historically have been used for representation of macromolecular structures (Westbrook & Fitzgerald, 2003) and for the MTZ format used by the CCP4 program suite (Collaborative Computational Project, Number 4, 1994). Table 5.3.8.1 lists the main modules in the current release. mmLib.mmCIF and mmLib.PDB are read/write parsers for mmCIF and PDB format files, respectively, which handle file input and output in these formats, and provide support for inspection or modification of such file formats. They are typically used in conjunction with the mmLib.FileLoader component to populate the mmLib.Structure internal representation of the macromolecular structure. The high-level abstraction of such functionality allows for very succinct programmatic constructs. Fig. 5.3.8.3 illustrates this with a program snippet that (apart from the necessary system calls for file management) achieves the conversion of an mmCIF input file to a PDB format representation. This is sufficiently robust and lightweight to act as an input filter to software already designed for handling PDB format files. mmLib.Structure represents the internal representation of a molecular structure and is implemented as an object hierarchy with four basic object classes: Structure, Chain, Fragment and Atom. The Fragment class has subclasses AminoAcidResidue and NucleicAcidResidue. In order to build a complete representation of a structure, the toolkit may need to load data from an input mmCIF or PDB format file, and also from standard data sets of properties of individual monomers and chemical elements; these standard libraries of chemical properties are provided by the mmLib.Library module. The core mmLib source includes a limited library of such chemical properties (accessible through the subclasses mmLib.Elements, mmLib.AminoAcids and mmLib.NucleicAcids)

Fig. 5.3.8.2. Sample XML output from the OpenMMS XML generator. Lines have been omitted or wrapped to fit the present column width.

addresses this by supplying a library of object-oriented routines implemented in Python (van Rossum, 1991) that are designed to integrate with existing or new applications in an easy way. The objective of mmLib is to build a support platform to handle the increasingly rich data about macromolecular structure

524

5.3. SYNTACTIC UTILITIES FOR CIF Brown, I. D., Zabobonin, A. & Holt, B. (2004). beCIF. Browser and editor for CIF. Private communication. Collaborative Computational Project, Number 4 (1994). The CCP4 suite: programs for protein crystallography. Acta Cryst. D50, 760–763. Edgington, P. R. (1997). HICCuP: High-Integrity CIF Checking using Python. Cambridge: Cambridge Crystallographic Data Centre. Greer, D. S. (2000). Macromolecular structure RFP response. Revised submission. http://openmms.sdsc.edu/OpenMMS-1.5.1 Std/ openmms/docs/specs/lifesci 00-11-01.pdf. Greer, D. S., Westbrook, J. D. & Bourne, P. E. (2002). An ontology driven architecture for derived representations of macromolecular structure. Bioinformatics, 18, 1280–1281. Hall, S. R. (1993). CIF applications. III. CYCLOPS: for validating CIF data names. J. Appl. Cryst. 26, 480–481. Hall, S. R., Allen, F. H. & Brown, I. D. (1991). The Crystallographic Information File (CIF): a new standard archive file for crystallography. Acta Cryst. A47, 655–685. Hall, S. R. & Bernstein, H. J. (1996). CIF applications. V. CIFtbx2: extended tool box for manipulating CIFs. J. Appl. Cryst. 29, 598–603. Hall, S. R. & Sievers, R. (1993). CIF applications. I. QUASAR: for extracting data from a CIF. J. Appl. Cryst. 26, 469–473. Hester, J. R. (2002). PyCifRW. A general CIF reading/writing utility in Python. http://www.ansto.gov.au/natfac/ANBF/CIF/. Hester, J. R. & Okamura, F. P. (1998). CIF applications. X. Automatic construction of CIF input functions: CifSieve. J. Appl. Cryst. 31, 965– 968. Knuth, D. E. (1986). The TEXbook. Computers and Typesetting, Vol. A. Reading, MA: Addison-Wesley. McMahon, B. (1993). ciftex: translation utility from CIF to TEX. ftp://ftp.iucr.org/pub/ciftex.tar.Z. McMahon, B. (1998). vcif: a utility to validate the syntax of a Crystallographic Information File. http://www.iucr.org/ iucr-top/cif/software/vcif/index.html. OMG (2001). Life Sciences Research Domain Task Force. http://www.omg.org/homepages/lsr/. Ousterhout, J. K. (1994). Tcl and the Tk toolkit. Reading, MA: AddisonWesley. Painter, J. & Merritt, E. A. (2004). mmLib Python toolkit for manipulating annotated structural models of biological macromolecules. J. Appl. Cryst. 37, 174–178. Patel, A. J. (2002). Yapps: Yet Another Python Parser System. http://theory.stanford.edu/∼amitp/yapps/. Rossum, G. van (1991). Python programming language. http:// www.python.org. Spadaccini, N. & Hall, S. R. (1994). Star Base: accessing STAR File data. J. Chem. Inf. Comput. Sci. 34, 509–516. Stampf, D. R. (1994). ZINC: galvanizing CIF to work with UNIX. Brookhaven: Protein Data Bank. Toby, B. H. (2003). CIF applications. XIII. CIFEDIT, a program for viewing and editing CIFs. J. Appl. Cryst. 36, 1288–1289. Tosic, O. & Westbrook, J. D. (2000). CIFParse. A library of access tools for mmCIF. Reference guide. http://pdb.rutgers.edu/mmcif/ CIFPARSE-OBJ/cifparse/index.html. Wall, L., Schwartz, R. L., Christiansen, T. & Orwant, J. (2000). Programming Perl, 3rd ed. Sebastopol, CA, USA: O’Reilly & Associates, Inc. Westbrook, J. D. & Bourne, P. E. (2000). STAR/mmCIF: an ontology for macromolecular structure. Bioinformatics, 16, 159–168. Westbrook, J. & Fitzgerald, P. (2003). The PDB format, mmCIF formats and other data formats. Structural bioinformatics, edited by P. E. Bourne & H. Weissig, pp. 161–179. Hoboken, NJ: John Wiley & Sons, Inc. Westbrook, J. D., Hsieh, S.-H. & Fitzgerald, P. M. D. (1997). CIF applications. VI. CIFLIB: an application program interface to CIF dictionaries and data files. J. Appl. Cryst. 30, 79–83. Westrip, S. P. (2004). printCIF for Word. http://www.iucr.org/ iucr-top/cif/software/printCIFforWord/index.html. Winn, M. (1998). cif.el: an Emacs mode for CIF. Daresbury Laboratory, Warrington, England.

and also provides support for the extensive CCP4 monomer library through the mmLib.Extensions.CCP4Library. The naming of this class expresses the intention that other standard data sources should be made accessible in the same way. The CCP4 monomer library is in fact included with the software as a directory tree of small files in mmCIF format, which are loaded into the Structure object through the normal use of the toolkit’s mmCIF parser. mmLib.GLViewer is a module provided to support visualization programs using the OpenGL graphics environment. Although it does not by itself provide a stand-alone viewer, it can be incorporated into many common graphics application building environments. An example molecular viewer, mmView, is provided with the distribution as an example of an application using the GTK graphical user interface, a popular toolkit in Linux. 5.3.9. Concluding remarks CIF is a domain-specific format that cannot attract the number of programmers that generic formats such as XML do. In spite of this, there is an impressive collection of programs available to support activities at many levels, from the single-line shell script needed to search for some desired content in a collection of CIFs, to the industrial-scale activities of major databases and publishing houses. As many examples as possible of the programs discussed in this chapter have been collected on the CD-ROM accompanying this volume and on the IUCr web site (http://www.iucr.org/ iucr-top/cif/software). It is hoped that the contributions described here will inspire future generations of programmers to contribute to a growing and increasingly robust software collection to make the use of CIFs ever easier and more fruitful.

I am immensely grateful for the assistance, cooperation and involvement of the community of software authors who have contributed to this chapter in one way or another, and to all the programmers and developers who have been active through the cif-developers discussion list of the IUCr (http://www.iucr.org/ iucr-top/lists/cif-developers) and in private discussions. References Allen, F. H. (2002). The Cambridge Structural Database: a quarter of a million crystal structures and rising. Acta Cryst. B58, 380–388. Allen, F. H., Johnson, O., Shields, G. P., Smith, B. R. & Towler, M. (2004). CIF applications. XV. enCIFer: a program for viewing, editing and visualizing CIFs. J. Appl. Cryst. 37, 335–338. Berman, H. M., Battistuz, T., Bhat, T. N., Bluhm, W. F., Bourne, P. E., Burkhardt, K., Feng, Z., Gilliland, G. L., Iype, L., Jain, S., Fagan, P., Marvin, J., Padilla, D., Ravichandran, V., Schneider, B., Thanki, N., Weissig, H., Westbrook, J. D. & Zardecki, C. (2002). The Protein Data Bank. Acta Cryst. D58, 899–907. Berman, H. M., Olson, W. K., Beveridge, D. L., Westbrook, J., Gelbin, A., Demeny, T., Hsieh, S.-H., Srinivasan, A. R. & Schneider, B. (1992). The Nucleic Acid Database: a comprehensive relational database of three-dimensional structures of nucleic acids. Biophys. J. 63, 751–759. Bernstein, H. J. (1998). cif 2cif. CIF copy program. http://www.iucr.org/ iucr-top/cif/software/ciftbx/cif 2cif.src/. Bernstein, H. J. & Hall, S. R. (1998). CIF applications. VII. CYCLOPS2: extending the validation of CIF data names. J. Appl. Cryst. 31, 278– 281. Bluhm, W. (2000). STAR (CIF) parser. http://pdb.sdsc.edu/STAR/ index.html.

525

references

International Tables for Crystallography (2006). Vol. G, Chapter 5.4, pp. 526–538.

5.4. CIFtbx: Fortran tools for manipulating CIFs B Y H. J. B ERNSTEIN

AND

S. R. H ALL

The arguments in CIFtbx commands have been kept to a minimum. Most of the parameter setting is handled automatically by reading and setting variables held in common blocks supplied as the file ciftbx.cmn. The type declarations for all the commands are also provided in the file ciftbx.cmn, and the programmer must ‘include’ this file in each application program, function or subroutine invoking CIFtbx commands. The flexibility of the CIF syntax can present some challenges to an author of applications reading or writing CIF data. This is because the information in a CIF may be in any order, have data names as either upper or lower case, and have an arbitrary spacing between data items. For example, one may extract the cell parameters from the front of a CIF and place them at the end, change all the data names from lower case to upper case, and introduce a blank line between each data name and its value, and yet the data (value) content of the output CIF will be identical to that input. CIFtbx provides the application writer with the tools to handle such presentation details seamlessly without altering the basic information content. Most importantly, CIFtbx allows applications to be ‘objectoriented’, in that data items are simply requested by name without prior knowledge of the file structure. It also allows for more advanced data processing in which data items are parsed sequentially, and typed and validated via the dictionary. This enables items to be read independently of the names, and the data typing is automatically determined and returned. In this way, where needed, applications can go beyond the position-independent context of a CIF. The main purpose of CIFtbx is to manipulate CIF data. However, there is much in common between CIF and the Extensible Markup Language XML (Bray et al., 1998), and facilities have been added to CIFtbx to facilitate writing output in XML as well as CIF format. CIFtbx provides four basic kinds of facilities for programmers: (i) commands to initialize later handling; (ii) commands to read CIF data; (iii) commands to write CIF data; (iv) variables for monitor and control signals. These commands are described in detail below.

5.4.1. Introduction CIFtbx is a function library for programmers developing CIF applications. It is written in Fortran and is intended for use with Fortran programs. The first version was released in 1993 (Hall, 1993b) and was extended (Hall & Bernstein, 1996) to accommodate subsequent CIF applications and DDL changes. The CIFtbx library is for novice and expert programmers of CIF applications. It has been used to develop CIF manipulation programs such as CYCLOPS (Bernstein & Hall, 1998), CIFIO (Hall, 1993a), cif 2cif (Bernstein, 1997), pdb2cif (Bernstein et al., 1998) and cif 2pdb (Bernstein & Bernstein, 1996). Programmers writing in C, C++ and mixed Fortran–C should consider alternative approaches, as discussed in Chapter 5.1 or in the work on CCP4 (Keller, 1996). The description of library functions below assumes familiarity with the STAR, CIF and DDL syntax described in Part 2. A complete Primer and reference manual for CIFtbx is provided on the CD-ROM accompanying this volume. Fortran is a very general and powerful language, and many compilers allow programming in a wide variety of styles. However, there is a traditional Fortran programming style that ensures portability to a wide variety of platforms. CIFtbx conforms to this style and has been ported to many platforms. The internals of CIFtbx and the style chosen are discussed at the end of this chapter and in more detail in the Primer. 5.4.2. An overview of the library The CIFtbx library is made up of functions, subroutines and variables that can be added to application programs as ‘commands’ to read and write CIF data. They may also be used to automatically validate incoming and outgoing CIF data. The self-checking aspects of some functions ensure that data are syntactically correct and, when used with DDL dictionaries, that individual items conform to their formal definitions. The CIFtbx commands are invoked in user software as standard Fortran function or subroutine calls. For example, to open the dictionary file ‘core.dic’ one uses the logical function dict_ as follows. FN = dict_(’core.dic’,’valid’)

The argument ’core.dic’ is the local file identifier for the relevant dictionary. The argument ’valid’ signals that checking should be done against the data definitions in this dictionary. The local logical variable FN is returned as .true. if dict_ opens the file core.dic correctly; otherwise the function is returned as .false.. Some CIFtbx commands are issued as subroutine calls. For example, to clear the internal data tables the programmer inserts the command

5.4.3. Initialization commands Initialization commands are applied at the start of a program to set global conditions for processing CIF data. There are only two commands of this type.

call purge_

logical function init_ (devcif, devout, devdir, deverr) integer devcif, devout, devdir, deverr logical function dict_ (fname, checks) character fname*(*), checks*(*)

Affiliations: H ERBERT J. B ERNSTEIN, Department of Mathematics and Computer Science, Kramer Science Center, Dowling College, Idle Hour Blvd, Oakdale, NY 11769, USA; S YDNEY R. H ALL , School of Biomedical and Chemical Sciences, University of Western Australia, Crawley, Perth, WA 6009, Australia.

init_ is an optional command that specifies the device number assignments for the input CIF devcif, the output CIF devout, an internal scratch file devdir and the file containing error messages deverr. The internal scratch file devdir is used to hold a copy of

Copyright  2006 International Union of Crystallography

526

5.4. CIFTBX: FORTRAN TOOLS FOR MANIPULATING CIFS the input CIF as a direct-access file (i.e. for random access to parts of the CIF). init_ is a logical function that is always returned with a value of .true.. The default device numbers for these files are 1, 2, 3 and 6. dict_ is an optional command for opening a dictionary fname and initiating various optional data checks, checks. The choices of checks to perform are given by a string of blank-separated fivecharacter ‘check codes’, such as valid or dtype, which turn on checking for the validity of tags or types of values, respectively. dict_ is a logical function which is returned as .true. if the named dictionary was opened and if the check codes are recognizable.

item is either absent or cannot be recognized as a valid number, the function is returned as .false. and the original numeric argument values are not changed. numd_ returns the number numb and its standard uncertainty sdev (if appended) as double-precision variables of a named data item name. The logical function is returned as .true. if the item is present and is a number. If the item is either absent or cannot be recognized as a valid number, the function is returned as .false. and the original numeric argument values are not changed. char_ returns character or text strings, strg, of the named data item name. The logical function is returned as .true. if the item is present. If text lines are being read, this function is called repeatedly until the logical variable text_ is .false.. cmnt_ returns the next comment, strg, in the current data block. The logical function is returned as .true. if a comment is present. The initial comment character ‘#’ is not included in the returned string and a completely blank line is treated as a comment. purge_ closes all attached data files and clears all tables and pointers. This is a subroutine call.

5.4.4. Read commands These commands are used to read data from an existing CIF. Since CIF data are order-independent, most applications would work from a known list of data names (tags) and extract the desired values from the CIF in the order specified. However, some applications need to browse a CIF in the order of presentation. In CIFtbx, a blank name has the meaning of the next name in the file.

5.4.5. Write commands

logical function ocif_ (fname) character fname*(*) logical function data_ (name) character name*(*) logical function bkmrk_ (mark) integer mark logical function find_ (name, type, strg) character name*(*), type*(*), strg*(*) logical function test_ (name) character name*(*) logical function name_ (name) character name*(*) logical function numb_ (name, numb, sdev) character name*(*) real numb, sdev logical function numd_ (name, numb, sdev) character name*(*) double precision numb, sdev logical function char_ (name, strg) character name*(*), strg*(*) logical function cmnt_ (strg) character strg*(*) subroutine purge_

The following commands are available for writing data to a new CIF. logical function pfile_ (fname) character fname*(*) logical function pdata_ (name) character name*(*) logical function ploop_ (name) character name*(*) logical function pnumb_ (name, numb, sdev) character name*(*) real numb, sdev logical function pnumd_ (name, numb, sdev) character name*(*) double precision numb, sdev logical function pchar_ (name, string) character name*(*), string*(*) logical function pcmnt_ (string) character string*(*) logical function ptext_ (name, string) character name*(*), string*(*) logical function prefx_ (strg, lstrg) character strg*(*) integer lstrg subroutine close_

ocif_ requests the named CIF fname to be opened. The logical function is returned as .true. if the CIF can be opened. data_ specifies the data block name containing the data to be read from the CIF. The logical function is returned as .true. if the data block is found. bkmrk_ is a bookmark function that saves or restores the current position in the CIF so that data can be accessed nonsequentially if need be. The logical function is returned as .true. if there is space to store the current position or if the restored bookmark number is valid. find_ finds the requested item in the current data block. The logical function is returned as .true. if the item is found. test_ provides the data attributes of a data item in the current data block. The logical function is returned as .true. if the item is found. The data attributes are returned in the common-block variables list_, type_, dictype_, diccat_ and dicname_. name_ identifies the next data name in the current data block. The logical function is returned as .true. if another data name exists in the data block and .false. if the end of the data block is reached. The name is returned in the function argument, name. numb_ returns the number numb and its standard uncertainty sdev (if appended) of a named data item name. The logical function is returned as .true. if the item is present and is a number. If the

pfile_ creates a new file with the specified file name fname. The logical function is returned as .true. if the file is opened. The value will be .false. if the file already exists. pdata_ puts the string data_name from the argument name into the output CIF. The logical function is returned as .true. if the block is created. The value will be .false. if the block name already exists. This command inserts the string save_name instead of the data-block name if the variable saveo_ is set to .true.. If the prior block was a save frame, the necessary terminal save_ is written for that block before the new block is started. ploop_ puts the specified data name name into the output CIF. On the first invocation of this command for a given loop, a loop_ string is placed before the data name. The logical function is returned as .true. if the name passes any requested dictionary validation checks. Once a series of data names for a loop_ header has been declared by calls to this function, all calls to pchar_, ptext_, pnumb_ or pnumd_ for the associated data values must be made with blank data names or the loop_ will be terminated. (At the very least, the first character of these data names must be blank.) pchar_ puts the specified data name name and character string string into the output CIF. If the data name is blank, only the

527

5. APPLICATIONS Table 5.4.6.1. CIFtbx variables

character string is put. The logical function is returned as .true. if the data name passes any requested dictionary validation checks. pnumb_ puts the specified data name name, single-precision number numb and an appended standard uncertainty sdev into the output CIF. The logical function is returned as .true. if the data name passes any requested dictionary validation checks. pnumd_ puts the specified data name name, double-precision number numb and an appended standard uncertainty sdev into the output CIF. The logical function is returned as .true. if the data name passes any requested dictionary validation checks. ptext_ puts the specified data name name and text string string into the output CIF. The data name will only be inserted on the first invocation of a sequence. The logical function is returned as .true. if the data name passes any requested dictionary validation checks. This command must be invoked repeatedly until the text is finished. The terminal semicolon character ‘;’ is placed in the output CIF when the next call to pchar_, pnumb_ or pnumd_ is made, or if a call is made to ptext_ for a different data name. pcmnt_ puts the specified comment string string into the output CIF. The logical function is always returned as .true.. The comment character ‘#’ should not be included in the string. A blank comment is presented as a blank line without the leading ‘#’. The string char(0)//char(0) can be used to produce an empty comment with the leading ‘#’. prefx_ prefixes the specified string strg of length lstrg to subsequent lines of the output CIF. The total line length is still limited to the value given by the variable line_ (the default is 80 characters). This function is useful when embedding a CIF into another text document, such as a PDB REMARK. The logical function is always returned as .true.. close_ closes the output CIF only. This command must be used if pfile_ is used. This a subroutine call.

General monitor

General control

Input monitor

alias_

Output control aliaso_ align_

append_ bloc_ decp_ diccat_ dicname_ dictype_ dicver_

pdecp_

esdlim_ file_ glob_

globo_

line_ list_ long_ longf_ loop_ lzero_ posdec_ posend_ posnam_ posval_

plzero_ nblanko_ pposdec_ pposend_ pposnam_ pposval_

quote_

pquote_

save_ strg_

saveo_

nblank_

precn_ recbeg_ recend_ recn_

tabl_ ptabx_

tabx_ tagname_ tbxver_ text_ type_

xmlout_ xmlong_

5.4.6. Variables The CIFtbx library also contains a large number of variables declared in the common blocks in the file ciftbx.cmn that provide signals to the programmer on various aspects of the data reading and writing processes. These variables are described below in four broad categories, as shown in Table 5.4.6.1: general monitor variables, general control variables, input monitor variables and output control variables. Note that for all but special applications only the basic variables list_, loop_, strg_, text_ and type_ are usually used. These variables supplement the argument lists of the various commands, providing essential status information.

5.4.6.2. General control variables These variables control CIFtbx commands. The user may accept the default values or may store new values into these variables to change the behaviour of the commands. alias_: logical variable to control the use of data-name aliases for input items. If set to .true., aliases from the input dictionary may be used (see Section 5.4.7). The default is .true.. append_: logical variable to control reuse of the direct-access file. If set to .true., it will cause each call to ocif_ to append the information found to the current CIF. The default is .false.. line_: integer variable to set the input/output line limit for processing a CIF. The default value is 80 characters. This limit counts the visible printable characters of the line, not the systemdependent line terminators. nblank_: logical variable to control the treatment of input blank strings. If set to .true., char_ or test_, it will return the type as ‘null’ rather than ‘char’ when encountering a quoted blank. recbeg_: integer variable to give the record number of the first record to be used. May be changed by the user to restrict access to a CIF. recend_: integer variable to give the record number of the last record to be used. May be changed by the user to restrict access to a CIF. tabx_: logical variable is set to .true. for tab stops to be expanded to blanks during the reading of a CIF. The default is .true..

5.4.6.1. General monitor variables These variables are returned by CIFtbx and provide information about the general status of processing. file_: character string containing the file name of the current input file. longf_: integer variable containing the length of the file name in file. precn_: integer variable containing the line number (starting from 1) of the last line written to the output CIF. recn_: integer variable containing the line number (starting from 1) of the last line read from the input CIF. tbxver_: character*32 variable that is the CIFtbx version and date in the form ’CIFtbx version N.N.N DD MMM YYYY’ (some older versions of CIFtbx use a two-digit year and have a comma after the version number).

528

5.4. CIFTBX: FORTRAN TOOLS FOR MANIPULATING CIFS data type. CIFtbx permits multi-line text fields to be used whenever character strings are expected.

5.4.6.3. Input monitor variables These variables are returned by CIFtbx tools and are used to decide on subsequent actions in the program. The lengths of the character strings that hold data names and block names are controlled by the parameter NUMCHAR in the common-block declarations. bloc_: character string containing the current data-block name. decp_: logical variable is .true. if a decimal point is present in the input numeric value. diccat_: character string containing the category name specified in the attached dictionaries. dicname_: character string containing the root alias data name (see Section 5.4.7) specified in the attached dictionaries or, after a call to dict_, the name of the dictionary. dictype_: character string containing the data-type code specified in the attached dictionaries. These types may be more specific (e.g. ‘float’ or ‘int’) than the types given by the variable type_ (e.g. ‘numb’). dicver_: character string containing the version of a dictionary after a call to dict_. glob_: logical variable is .true. if the current data block is a global block. The application is responsible for managing the relationship of global data to other data blocks. list_: integer variable containing the sequence number of the current looped list. This value may be used by the application to identify variables that are in different lists or that are not in a list (a zero value). long_: integer variable containing the length of the data string in strg_. loop_: logical variable is .true. if another loop packet is present in the current looped list. lzero_: logical variable is .true. if the input numeric value is of the form [sign]0.nnnn rather than [sign].nnnn. posdec_: integer variable containing the column number (position along the line, counting from 1 at the left) of the decimal point for the last number read. posend_: integer variable containing the column number (position along the line, counting from 1 at the left) of the last character for the last string or number read. posnam_: integer variable containing the starting column (position along the line, counting from 1 at the left) of the last name or comment read. posval_: integer variable containing the starting column (position along the line, counting from 1 at the left) of the last data value read. quote_: character variable giving the quotation symbol found delimiting the last string read. save_: logical variable is .true. if the current data block is a save frame, otherwise .false.. strg_: character variable containing the data name or string representing the data value last retrieved. tagname_: character variable containing the data name of the current data item as it was found in the CIF. May differ from dicname_ because of aliasing. text_: logical variable is .true. if another text line is present in the current input text block. type_: character variable containing the data-type code of the current input data item. This will be one of the four-character strings ‘null’ (for missing data, the period or the question mark), ‘numb’ (for numeric data), ‘char’ (for most character data) or ‘text’ (for semicolon-delimited multi-line character data). For most purposes the type ‘text’ is a subtype of the type ‘char’, not a distinct

5.4.6.4. Output control variables These variables are specified to control the processing by CIFtbx commands that write CIFs. aliaso_: logical variable to control the use of data-name aliases for output items. If set to .true., preferred synonyms from the input dictionary may be output (see Section 5.4.7). The default is .false.. align_: logical variable to control the column alignment of data values in loop_ lists output to a CIF. The default is .true.. esdlim_: integer variable to set the upper limit of appended standard uncertainty (e.s.d.) integers output by pnumb_. The default value is 19, which limits standard uncertainties to the range 2–19. globo_: logical variable which if set to .true. will cause the output data block from pdata_ to be written as a global block. nblanko_: logical variable controls the treatment of output blank strings. If set to .true., output quoted blank strings will be converted to an unquoted period (i.e. to a data item of type null). Recall that CIFtbx treats an unquoted period or question mark as being of type null. pdecp_: logical variable controls the treatment of output decimal numbers. If set to .true., a decimal point will be inserted into numbers output by pnumb_ or pnumbd_. If set to .false., a decimal point will be output only when needed. The default is .false.. plzero_: logical variable controls the treatment of leading zeros in output decimal numbers. If set to .true., a zero will be inserted before a leading decimal point. The default is .false.. pposdec_: integer variable to set the column number (position along the line, counting from 1 at the left) of the decimal point for the next number to be output. pposend_: integer variable to set the position of the ending column for the next number or character string to be output. Used to pad with zeros or blanks. pposnam_: integer variable to set the starting column of the next name or comment to be output. pposval_: integer variable to set the position of the starting column of the next data value to be output. pquote_: character variable containing the quotation symbol to be used for the next string written. saveo_: logical variable is set to .true. for pdata_ to output a save frame, otherwise a data block is output. ptabx_: logical variable is set to .true. for tab stops to be expanded to blanks during the creation of a CIF. The default is .true.. tabl_: logical variable is set to .true. for tab stops to be used in the alignment of output data. The default is .true.. xmlout_: logical variable is set to .true. to change the output style to XML conventions. Note that this is not a CML (MurrayRust & Rzepa, 1999) output, but a literal translation from the input CIF. The default is .false.. xmlong_: logical variable is set to .true. to change the style of XML output if xmlout_ is .true.. When .true. (the default), XML tag names are the full CIF tag names with the leading underscore, _, removed. When .false., an attempt is made to strip the leading category name as well. 5.4.7. Name aliases CIF dictionaries written in DDL2 permit data names to be aliased or equivalenced to other data names. This serves two purposes. First, it allows for the different data-name structures used in DDL1

529

5. APPLICATIONS 5.4.9. How to read CIF data The CIFtbx approach to reading CIF data is illustrated using a simple example program CIF IN (Fig. 5.4.9.1), which reads the file test.cif (Fig. 5.4.9.2) and tests the input data items against the dictionary file cif core.dic. The resulting output is shown in Fig. 5.4.9.3. The program CIF IN may be divided into the following steps, each tagged with the relevant reference letter in the comment records of the listing shown in Fig. 5.4.9.1. A: Define the local variables. The CIFtbx variables are added with the line include ’ciftbx.cmn’. B: Assign device numbers to the files using the command init_. The device number 1 refers to the input CIF, 3 to the scratch file and 6 (stdout) to the error-message files. The device number 2 refers to an output CIF, if we were to choose to write one. C: Open a specific dictionary file named cif core.dic with the command dict_. The code valid signals that the input data items are to be validated against the dictionary. In this application, dict_ is invoked in an IF statement that tests whether the command is successful. D: Open the CIF test.cif with the command ocif_ and test that the file is opened. E: Invoke the data_ command, containing a blank block code, to ‘open’ the next data block. The block name encountered is placed in the variable bloc_, which in this application is printed. F: Read the cell-length values and their standard uncertainties with the numb_ command, and print these out. Test whether all of the requested data items are found. G: The char_ function is used to read a single character string. H: The name_ function is used to get the data name of the next data item encountered. I: This sequence illustrates how text lines are read. The char_ function is used to read each line and the text_ variable is tested to see whether another text line exists in this data item. J: This sequence illustrates how a looped list of items is read. Individual items are read using char_ or numb_ functions and the existence of another packet of items is tested with the variable loop_. The resulting printout is shown in Fig. 5.4.9.3. In this figure, note the following: (i) The first six lines of the printout are output by CIFtbx routines, not by the program CIF IN. They occur when the data_ command is executed and data items in the block mumbo_jumbo are read from the CIF and checked against the dictionary file. Note that this is when the CIFtbx routines store the pointers and attributes of all items in the data block. All subsequent commands use these pointers to access the data. (ii) The ‘####’ string in front of _cell_length_a in the input CIF makes this line a comment and makes it inaccessible to CIF IN. (iii) Data items may be read from a CIF in any order but looped items must normally be in the same list. If one needs to access looped items in different lists simultaneously, the bkmrk_ command is used to preserve CIFtbx loop pointers.

read(8,’(a)’,end=400) name f1 = test_(name) write(6,’(2(3x,a32)’) name,dicname_ name=dicname_ f1 = test_(name) write(6,’(2(3x,a32)’) name,tagname_ Fig. 5.4.7.1. Example application accessing aliased data names.

and DDL2 dictionaries, and, second, it links equivalent data names within the DDL2 dictionary. Aliasing also allows the use of synonyms appropriate to the application. CIFtbx is capable of handling aliased data names transparently so that both the input CIF and the application software can use any of the equivalent aliased names. In addition, an output CIF may be written with the data names specified in the CIFtbx functions or with names that have been automatically converted to preferred dictionary names. If more than one dictionary is loaded, the first aliases have priority. We call the preferred dictionary name the ‘root alias’. The default behaviour of CIFtbx is to accept all combinations of aliases and to produce output CIFs with the exact names specified in the user calls. The interpretation of aliased data names is modified by setting the logical variables alias_ and aliaso_. When alias_ is set to .false., the automatic recognition and translation of aliases stops. When aliaso_ is set to .true., the automatic conversion of user-supplied names to dictionary-preferred alias names in writing data to output CIFs is enabled. The preferred alias name is stored in the variable dicname_ following any invocation of a getting function, such as numb_ or test_. If alias_ is set to .false., dicname_ will correspond to the called name. The variable tagname_ is always set to the actual name used in an input CIF. For example, the data name _atom_site_anisotrop.u[1][1] in the DDL2 mmCIF dictionary is aliased to the data name _atom_site_aniso_U_11 in the DDL1 core CIF dictionary. In the example application of Fig. 5.4.7.1, showing the CIFtbx function test_ run with both the mmCIF and core dictionaries loaded, the specified data name _atom_site_aniso_U_11 is used to inquire as to the names used in an input CIF. The execution of this code results in the following printout. _atom_site_aniso_U_11 _atom_site_anisotrop.u[1][1] _atom_site_anisotrop.u[1][1] _atom_site_aniso_U_11

5.4.8. Implementation of the tools Implementation of the CIFtbx tools is straightforward. The supplied source files in all versions are: ciftbx.f, ciftbx.sys (used in ciftbx.f) and ciftbx.cmn (used in local applications). More recent versions of CIFtbx (version 2.4 and later) require certain additional source files: ciftbx.cmf, ciftbx.cmv, hash funcs.f and clearfp.f (or clearfp sun.f). The common file ciftbx.cmn must be ‘include’d into any local routines that use CIFtbx tools. The library in ciftbx.f may be invoked by either (i) compiling and linking the resulting object file as an object library, or with explicit references in the application linking sequence (versions 2.4 and later require hash funcs.o as an additional object file); or (ii) including ciftbx.f in the local application and compiling and linking it with the local program. Approach (i) is more efficient, but for some applications approach (ii) may be simpler.

5.4.10. How to write a CIF Writing a CIF usually is simpler than reading an existing one. An example of a CIF-writing program is shown in Fig. 5.4.10.1. This example is intentionally trivial. The created CIF test.new is shown in Fig. 5.4.10.2. Note that command dict_ causes all output items to be checked against the dictionary cif core.dic. Unknown names are flagged in the output CIF with the comment

530

5.4. CIFTBX: FORTRAN TOOLS FOR MANIPULATING CIFS PROGRAM CIF_IN C C....A.. Define the data variables include ’ciftbx.cmn’ logical f1,f2,f3 character*32 name character*80 line real cela,celb,celc,siga,sigb,sigc real x,y,z,u,numb,sdev data cela,celb,celc,siga,sigb,sigc /6*0.0/ C....B.. Assign the CIFtbx files f1 = init_( 1, 2, 3, 6 ) C....C.. Request dictionary validation check if(dict_(’cif_core.dic’,’valid’)) goto 100 write(6,’(/a/)’) ’ Requested Core dictionary not present’ C....D.. Open the CIF to be accessed 100 name=’test.cif’ if(ocif_(name)) goto 120 write(6,’(a///)’) ’ >>>>>>>>> CIF cannot be opened’ stop C....E.. Assign the data block to be accessed 120 if(.not.data_(’ ’)) goto 200 write(6,’(/a,a/)’) ’ Access items in data block ’,bloc_ C....F.. Extract some cell dimensions; test all is OK f1 = numb_(’_cell_length_a’, cela, siga) f2 = numb_(’_cell_length_b’, celb, sigb) f3 = numb_(’_cell_length_c’, celc, sigc) if(.not.(f1.and.f2.and.f3)) write(6,’(a)’) ’ Cell lengths missing!’ write(6,’(a,6f10.4)’) ’ Cell ’,cela,celb,celc,siga,sigb,sigc C....G.. Extract space group notation (expected char string) f1 = char_(’_symmetry_cell_setting’, name) write(6,’(a,a/)’) ’ Cell setting ’,name(1:long_) C....H.. Get the next name in the CIF and print it out f1 = name_(name) write(6,’(a,a/)’) ’ Next data name in CIF is

’,name

C....I.. List the audit record (possible text line sequence) write(6,’(a)’) ’ Audit record’ 140 f1 = char_(’_audit_update_record’, line) write(6,’(a)’) line if(text_) goto 140 C....J.. Extract atom site data in a loop write(6,’(/a)’) ’ Atom sites’ 160 f1 = char_(’_atom_site_label’, name) f2 = numb_(’_atom_site_fract_x’, x, sx) f2 = numb_(’_atom_site_fract_y’, y, sy) f2 = numb_(’_atom_site_fract_z’, z, sz) f3 = numb_(’_atom_site_U_iso_or_equiv’, u, su) write(6,’(1x,a4,4f8.4)’) name,x,y,z,u if(loop_) goto 160 C goto 120 200 continue end Fig. 5.4.9.1. Sample program CIF IN. See text for explanation. ’#< not in dictionary’.

This applies to both looped and single data items. A more complex example of writing a CIF is given in the program cif 2cif available with the CIFtbx release. A similar program that reads a CIF and writes an XML file is cif 2xml, also available with the CIFtbx release.

errors), an appropriate error message is placed in the printout and execution terminates. However, the default approach is to remain mute and for error detection to be monitored by the application program via the CIFtbx functions returning .true. or .false. values that tell the application program whether the command was performed correctly. This places the primary responsibility for error checking on the application software. The importance of this approach is that it enables the local application to respond to runtime problems in a controlled way and to take corrective action if it is possible. However, some types of processing errors, such as exceeding the dimensions of critical CIFtbx arrays, do require appropriate messages to be issued and for execution to cease.

5.4.11. Error-message glossary The CIFtbx routines will generate explicit error messages in the printout or in the created CIF if requested to do so (e.g. during dictionary checks). If data processing cannot continue (i.e. fatal

531

5. APPLICATIONS data_mumbo_jumbo _audit_creation_date 91-03-20 _audit_creation_method from_xtal_archive_file_using_CIFIO _audit_update_record ; 91-04-09 text and data added by Tony Willis. 91-04-15 rec’d by co-editor with diagram as manuscript HL7 ; _dummy_test "rubbish to see what dict_ says" _chemical_name_systematic trans-3-Benzoyl-2-(tert-butyl)-4-(iso-butyl)-1,3-oxazolidin-5-one _chemical_formula_moiety ’C18 H25 N O3’ _chemical_formula_weight 303.40 _chemical_melting_point ? ####_cell_length_a _cell_length_b _cell_length_c _cell_measurement_theta_min _cell_measurement_theta_max _symmetry_cell_setting

5.959(1) 14.956(1) 19.737(3) 25 31 orthorhombic

loop_ _atom_site_label _atom_site_fract_x _atom_site_fract_y _atom_site_fract_z _atom_site_U_iso_or_equiv _atom_site_thermal_displace_type _atom_site_calc_flag s .20200 .79800 .91667 .030(3) Uij ? o .49800 .49800 .66667 .02520 Uiso ? c1 .48800 .09600 .03800 .03170 Uiso ? loop_ _blat1 _blat2 1 2 3 4 5 6 a b c d 7 8 9 0 Fig. 5.4.9.2. Example CIF read by the sample program CIF IN shown in Fig. 5.4.9.1.

CIFtbx warning: test.cif data_mumbo_jumbo line: Data name _dummy_test not in dictionary! CIFtbx warning: test.cif data_mumbo_jumbo line: Data name _blat1 not in dictionary! CIFtbx warning: test.cif data_mumbo_jumbo line: Data name _blat2 not in dictionary! Access items in data block

Audit record 91-04-09 91-04-15 Atom sites s 0.2020 o 0.4980 c1 0.4880

35 35

mumbo_jumbo

Cell dimension(s) missing! Cell 0.0000 14.9560 19.7370 Cell setting orthorhombic Next data name in CIF is

8

0.0000

0.0010

0.0030

_atom_type_symbol

text and data added by Tony Willis. rec’d by co-editor with diagram as manuscript HL7

0.7980 0.4980 0.0960

0.9167 0.6667 0.0380

0.0300 0.0252 0.0317

Fig. 5.4.9.3. Printout from the example program CIF IN run on the test file of Fig. 5.4.9.2.

ciftbx.sys. If the value of MAXBUF needs to be changed, the file ciftbx.cmv must also be updated.

CIFtbx error messages are in four parts: ‘warning’ or ‘error’ header line, the name of the file being processed, the current data block or save frame, and the line number. Another line contains the text of the message.

Input line_ value > MAXBUF Number of categories > NUMBLOCK Number of data names > NUMBLOCK Cifdic names > NUMDICT Dictionary category names > NUMDICT Items per loop packet > NUMITEM Number of loop_s > NUMLOOP

5.4.11.1. Fatal errors: array bounds The following fatal messages are issued if the CIFtbx array bounds are exceeded. Operation terminates immediately. Array bounds can be adjusted by changing the PARAMETER values in

532

5.4. CIFTBX: FORTRAN TOOLS FOR MANIPULATING CIFS C....... Open a new CIF 400 if(pfile_(’test.new’)) goto 450 write(6,’(//a/)’) ’ Output CIF by this name exists already!’ goto 500 C....... Request dictionary validation check 450 if(dict_(’cif_core.dic’,’valid’)) goto 460 write(6,’(/a/)’) ’ Requested Core dictionary not present’ C....... Insert a data block code 460 f1 = pdata_(’whoops_a_daisy’) C....... Enter various single data items to show how f1 = pchar_(’_audit_creation_method’,’using CIFtbx’) f1 = pchar_(’_audit_creation_extra2’,"Terry O’Connell") f1 = pchar_(’_audit_creation_extra3’,’Terry O"Connell’) f1 = ptext_(’_audit_creation_record’,’ Text data may be ’) f1 = ptext_(’_audit_creation_record’,’ entered like this’) f1 = ptext_(’_audit_creation_record’,’ or in a loop.’) f1 = pnumb_(’_cell_measurement_temperature’, 293., 0.) f1 = pnumb_(’_cell_volume’, 1759.0, 13.) f1 = pnumb_(’_cell_length_b’, 8.75353553524313,0.) f1 = pnumb_(’_cell_length_c’, 19.737, .003) C....... Enter some looped data f1 = ploop_(’_atom_type_symbol’) f1 = ploop_(’_atom_type_oxidation_number’) f1 = ploop_(’_atom_type_number_in_cell’) do 470 i=1,3 f1 = pchar_(’ ’,alpha(1:i)) f1 = pnumb_(’ ’,float(i),float(i)*0.1) 470 f1 = pnumb_(’ ’,float(i)*8.64523,0.) C....... Do it again but as contiguous data with text data f1 = ploop_(’_atom_site_label’) f1 = ploop_(’_atom_site_occupancy’) f1 = ploop_(’_some_silly_text’) do 480 i=1,2 f1 = pchar_(’ ’,alpha(1:i)) f1 = pnumb_(’ ’,float(i),float(i)*0.1) 480 f1 = ptext_(’ ’,’ Hi Ho the diddly oh!’) 500 call close_ Fig. 5.4.10.1. Sample program to create a CIF.

data_whoops_a_daisy _audit_creation_method ’using CIFtbx’ _audit_creation_extra2 ’Terry O’Connell’ #< not in dictionary _audit_creation_extra3 ’Terry O"Connell’ #< not in dictionary _audit_creation_record ;Text data may be entered like this or in a loop. ; _cell_measurement_temperature 293 _cell_volume 1759(13) _cell_length_b 8.75354 _cell_length_c 19.737(3) loop_ _atom_type_symbol _atom_type_oxidation_number _atom_type_number_in_cell a 1.00(10) 8.64523 ab 2.0(2) 17.2905 abc 3.0(3) 25.9357 loop_ _atom_site_label _atom_site_occupancy _some_silly_text #< not in dictionary a 1.00(10) ;Hi Ho the diddly oh! ; ab 2.0(2) ;Hi Ho the diddly oh! ; Fig. 5.4.10.2. Sample CIF created by the example program of Fig. 5.4.10.1.

533

5. APPLICATIONS However, the message

Duplicate data item

Two or more identical data names have been detected in a data block or save frame.

More than MAXBOOK bookmarks requested

is not ‘fatal’, in the sense that the function bkmrk_ returns .false. to permit appropriate action before termination. This is effectively a fatal error for which recompilation with a larger value of MAXBOOK is necessary. However, this is usually the result of a logic error in the application, and the error has been made non-fatal to allow the programmer to insert debugging code, if desired. The application should clean up and exit promptly.

Exponent overflow in numeric input Exponent underflow in numeric input

The numeric value being processed has an exponent that cannot be processed on this machine. If the string involved is not intended as a number, then surrounding it with quotes may resolve the problem. Heterogeneous categories in loop vs

5.4.11.2. Fatal errors: data sequence, syntax and file construction

Looped lists should not contain data items belonging to different categories. This error occurs if the category of a new data item fails to match the category of a prior data item. A special category (none) is used to denote item names for which no category has been declared. Warnings are not issued on this level for a loop for which all data items have no declared category.

Dict_ must precede ocif_

Dictionary files must be loaded before an input CIF is opened because some checking occurs during the CIF loading process. Illegal tag/value construction

Data name (i.e. a ‘tag’) and data values are not matched (outside a looped list). This usually means that a data name immediately follows another data name, or a data value was found without a preceding data name. The most likely cause of this error is the failure to provide ‘.’ or ‘?’ for missing or unknown data values or a failure to declare a loop_ when one was intended.

Input line length exceeds line_

Non-blank characters were found beyond the value given by the variable line_. The default value for line_ is 80 (optionally increased to 2048 in CIFtbx 2.7 and later for CIF 1.1 compliance). The extra characters in column positions line_ + 1 through MAXBUF will be processed but the input file may need to be reformatted for use with other CIF-handling programs.

Item miscount in loop

Within a looped list the total number of data values must be an exact multiple of the number of data names in the loop_ header.

Missing loop_ items set as DUMMY

A looped list of output items was truncated with an incomplete loop packet (i.e. the number of items did not match the number of loop_ data names). The missing values were set to the character string ‘DUMMY’.

Prior save-frame not terminated

Save-frame terminator found out of context. Save frames must start with save_framecode and end with save_. These messages will be issued if this does not occur.

Numb type violated

The data item has been processed with an explicit dictionary type numb, but with a non-numeric value. Note that the values ‘?’ or ‘.’ will not generate this message.

Syntax construction error

Within a data block or save frame the number of data values does not match the number of data names (ignoring loop structures). This message should occur only if there is an internal logic error in the library. Normally the program will terminate on Item miscount in loop first.

Quoted string not closed

Character values may be enclosed by bounding quotes. The strict definition of a ‘quoted-string’ value is that it must start with a digraph and end with a digraph, where w is a white-space character blank or tab and q is a single or double quote, and the same type of quote mark is used in the terminal digraph as was used in the initial digraph. This message is issued if these conditions are not met.

Unexpected end of data

When processing multi-line text the end of the CIF is encountered before the terminal semicolon. 5.4.11.3. Fatal errors: invalid arguments The following messages are generated by calls with invalid arguments.

5.4.11.5. Warnings: output errors

Call to find_ with invalid arguments Internal error in putnum

Converted pchar_ output to text for

An attempt was made to write a string with pchar_ instead of ptext_, but the string contains a combination of characters for which ptext_ must be used.

5.4.11.4. Warnings: input errors Category first implicitly defined in cif

ESD less than precision of machine

The category code in the DDL2 data name is not matched by an explicit definition in the dictionary. This may be intentional but usually indicates a typographical error in the CIF or the dictionary.

Overflow of esd Underflow of esd

A call to pnumb_ or numb_ was made with values of the number and standard uncertainty (e.s.d) which cannot be presented properly on this machine. A bounding value of 0 or 99999 is used for the e.s.d.

Data name not in dictionary!

The data item name was used in the CIF but could not be found in the dictionary.

Invalid value of esdlim_ reset to 19

In processing numeric output, a value of esdlim_ less than 9 or greater than 99999 was found. esdlim_ is then set to 19.

Data block header missing

No data_ or global_ was found when expected.

534

5.4. CIFTBX: FORTRAN TOOLS FOR MANIPULATING CIFS Missing loop_ name set as _DUMMY Missing loop_ items set as DUMMY

Multiple DDL category definitions Multiple DDL name definitions

In processing a loop_, a dummy string has been inserted for a missing header or value.

Multiple DDL type definitions Multiple DDL related item definitions Multiple DDL related item functions

Output CIF line longer than line_

DDL1 and DDL2 declarations for categories, data names, data types and related items are used in the same data block or save frame.

In outputting a line, the data exceed the limit specified in line_. This occurs only if a single data name or a value exceeds this limit.

Multiple categories for one name

Out-of-sequence call to end text block

Multiple types for one name

The termination of a text block has been invoked before a text block has been started. This can only occur with irregular use of the CIFtbx routines rather than the standard interface routines.

A dictionary contains a loop of category or type definitions and an unlooped declaration of a single data name. The first category or type definition is used.

Output prefix may force line overflow

No category defined in block and name

A prefix string placed in prefx_ exceeds line_ less the allowed length of tags.

does not match

A DDL2 dictionary contains no category for the defined data item and it was not possible to derive an implicit category from the block name. This usually indicates a typographical error in the dictionary.

Prefix string truncated

A prefix string specified to prefx_ is longer than the maximum line length allowed. The prefix string is truncated and processing continues.

No category specified for name

A dictionary contains categories and category checking is enabled but no category is defined for the named data item. 5.4.11.6. Warnings: dictionary checks

No name defined in block No name in the block matches the block name

Aliases and names in different loops; only using first alias

These messages are issued if a dictionary save frame or data block contains no name definition or if all the names defined fail to match the block name.

If a DDL2 dictionary contains a loop of alias declarations, the corresponding data-name declarations are expected to be in the same loop. Only the first alias name is used.

No type specified for name

Attempt to redefine category for item

A type code is missing from a dictionary and type checking was requested in the dict_ invocation.

Attempt to redefine type for item

If a DDL2 dictionary contains a category or type for a data item that conflicts with an earlier declaration, the first is used.

One alias, looped names, linking to first

A DDL2 dictionary may contain a list of data names and a single alias outside this loop. In this case, the correct name to which to link the alias must be derived implicitly. If the save-frame code matches the first name in the loop no warning is issued, because the use of the block name was probably the intended result, but if no such match is found this warning is issued.

Categories and names in different loops Types and names in different loops

If a DDL2 dictionary contains a loop of category or type declarations, the corresponding data-name declarations are expected to be in the same loop. Only the first category name or type is used. Category id does not match block name

In a DDL2 dictionary, the save-frame code is expected to start with the category name. If a category name within the frame is not within a loop, it is checked against that in the frame code and a warning is issued if these do not match.

5.4.12. Internals and programming style CIFtbx is programmed in a highly portable Fortran programming style. However, on some older systems, some adaptation may be necessary to allow compilation. Implementors should be aware of the extensive use of variables in common blocks to transmit information and control execution (programming by side-effects), the use of the INCLUDE statement, the use of the ENDDO statement, the names of routines used internally by the package, the use of names longer than six characters and the use of names including the underscore character. Some aspects of the internal organization of the library to deal with characteristics of CIFs are worth noting. CIFtbx copies an input CIF to a direct-access (i.e. random-access) file, but writes an output CIF directly. All data names are converted to lower case to deal with the case-insensitive nature of CIF. A hierarchy of parsing routines is used to deal with processing white space. The major issues of programming style and internals are summarized here. See the Primer on the CD-ROM for more information.

Conflicting definition of alias

A DDL2 dictionary contains a new declaration of a data-name alias which is in conflict with a previous alias definition. The first alias declaration is used. Duplicate definition of same alias

A DDL2 dictionary contains a new declaration of an alias for a data name which duplicates a previously defined alias pair. Item name does not match category name

If category checking is enabled and the category assigned to an item name does not match the intial characters of the item name, this message is issued. This may indicate a typographical error or a deprecated item in the dictionary. Item type not recognised

The DDL2 dictionary type codes are translated to the DDL type codes ‘numb’, ‘char’ and ‘text’. If an unrecognized type code is found no translation occurs.

535

5. APPLICATIONS Table 5.4.12.1. Parameter statements in CIFtbx

5.4.12.1. Programming style A traditional Fortran style of programming is used in CIFtbx. Common blocks are declared to report and control the state of the processing. This allows argument lists to be kept short and avoids the need to create complex data structure types, but introduces extensive ‘programming by side-effects’. In order to reduce the impact of this approach on users, two different views of the common blocks are provided. The declarations in ciftbx.cmn are needed by all users. The more extensive declarations in ciftbx.sys, which include the same common declarations as are found in ciftbx.cmn and additional declarations used internally within CIFtbx, are provided for use in maintaining the library. Caution is needed in making internal modifications to the library to maintain the desired relationships among the actions of various routines and the states of variables declared in the common blocks. Statements are written in the first 72 columns of a line, reserving columns one through five for statement labels and using column six for continuation. Approaches that would require the use of C libraries or non-portable Fortran extensions are avoided. For this reason, all the internal service routines are written in Fortran, all memory needed is preallocated with DIMENSION statements and a direct-access file is used to hold the working copy of a CIF.

NUMCHAR MAXBUF

Maximum number of characters in data names (default 48) Maximum number of characters in a line (default 200, increased to 4096 in CIFtbx 2.7 and later) Number of memory resident pages (default 10) Number of characters per page (default 16 384) Number of entries in dictionary tables (default 3200) Number of hash table entries (a modest prime, default 53) Number of entries in data-block tables (default 500) Number of loops in a data block (default 50) Number of items in a loop (default 50) Maximum number of tabs in output CIF line (default 10) Maximum number of simultaneous bookmarks (default 1000)

NUMPAGE NUMCPP NUMDICT NUMHASH NUMBLOCK NUMLOOP NUMITEM MAXTAB MAXBOOK

statements with the contents of the indicated file. For example, the only non-comments in ciftbx.cmn are include include

’ciftbx.cmv’ ’ciftbx.cmf’

This means that the file ciftbx.cmn could be replaced by a concatenation of the two files ciftbx.cmv and ciftbx.cmf. 5.4.12.4. Use of ENDDO CIFtbx makes some use of the ENDDO statement (as well as nested IF, THEN, ELSE, ENDIF constructs) to improve readability of the source code. Most compilers accept the ENDDO statement, but if conversion is needed then constructs of the form

5.4.12.2. Memory management Since CIFtbx does static memory allocation with DIMENSION statements, it is sometimes necessary to adjust the array dimensions chosen to suit a particular application. It may also be necessary to increase the storage allocated for individual tags to allow for unusually long ones. The sizes of most arrays and strings used in CIFtbx that might require adjustment are controlled by PARAMETER statements in the files ciftbx.sys and ciftbx.cmv (the variable-declaration portion of ciftbx.cmn). The parameters are shown in Table 5.4.12.1. These values can result in CIFtbx requiring more than a megabyte of memory. On smaller machines working with a small dictionary and simple CIFs, considerable space can be saved by reducing the values of NUMDICT and NUMBLOCK. On the other hand, an application working with several layered dictionaries and large and complex CIFs with many data items and many loops in a data block might require a version of CIFtbx with larger values of NUMDICT, NUMBLOCK and, perhaps, of NUMLOOP. The variables NUMPAGE and NUMCPP control the amount of memory to be used to buffer the direct-access file and the size of the data transfers to and from that file. Smaller values will reduce the demand for memory at the expense of slower execution.

do index = istart, iend, incr ... enddo

should be changed to

nnn

do nnn index = istart, iend, incr ... continue

where nnn is a unique statement number, not used elsewhere in the same routine. 5.4.12.5. Names of internal routines The following routines are used internally by later versions of CIFtbx. If these names are needed for other routines, then changes in the library will be needed to avoid conflicts. (a) Variable initialization: block data

Critical CIFtbx variables are initialized with data statements in a block data routine.

5.4.12.3. Use of INCLUDE The INCLUDE statement allows the statements in the specified file to be treated as if they were being included in a program in place of the INCLUDE statement itself. This simplifies the maintenance of common-block declarations and is an important tool in keeping code well organized. In CIFtbx, the INCLUDE statement is used to bring the statements in the files ciftbx.cmn and ciftbx.sys into programs where they are needed, and to simplify ciftbx.cmn and ciftbx.sys by using INCLUDEs of the files ciftbx.cmv and ciftbx.cmf. The file ciftbx.cmv contains the definitions of the essential CIFtbx data structures as common blocks, for inclusion in both ciftbx.cmn for user applications and in ciftbx.sys for the CIFtbx library routines themselves. Most compilers handle the INCLUDE statement, but, if necessary, a user may replace any or all of the INCLUDE

(b) Control of floating-point exceptions: subroutine clearfp

If a system requires special handling of floating-point exceptions, the necessary calls should be added to this subroutine. (c) Message processing: subroutine err (mess) character mess*(*) subroutine warn (mess) character mess*(*) subroutine cifmsg (flag, mess) character mess*(*), flag*(*)

536

5.4. CIFTBX: FORTRAN TOOLS FOR MANIPULATING CIFS Error and warning messages are processed through these three routines.

5.4.12.6. Use of the underscore character All the externally accessible CIFtbx commands and variables terminate with the underscore character. This works well on most systems, but can cause occasional problems, because traditional Fortran does not include the underscore in the character set and some operating systems reserve the underscore as a system flag, for example to distinguish C-language library routines from those written in Fortran. If conversion is needed, and the local compiler allows long variable and subroutine names, then the simplest approach would be to make a local variant of CIFtbx in which every occurrence of underscore in a function, subroutine or variable name is changed to a distinctive character pattern (e.g. ‘CIF’ or ‘qq’), but caution is needed, since there are many character strings used in the library that include the underscore. For example, in changing the variable loop_ to loopCIF, it would be a mistake to change the statement

(d) Internal service routines: subroutine dcheck (name, type, flag, tflag) logical flag, tflag character name*(*), type*4 subroutine eotext subroutine eoloop subroutine excat (sfname, bcname, lbcname) character*(*) sfname, bcname integer lbcname subroutine getitm (name) character name*(*) subroutine getstr subroutine getlin (flag) character flag*4 subroutine putstr (string) character string*(*)

if(strg_(1:5).eq.’loop_’) type_=’loop’

These routines are used internally by the library. The subroutine dcheck validates names against dictionaries. The subroutines eotext and eoloop are used to ensure termination of loops and text strings. The subroutines getitm, getstr and getlin extract items, strings and lines from the input CIF. The subroutine putstr writes strings to the output CIF.

to if(strg_(1:5).eq.’loopCIF’) type_=’loop’

(e) Numeric routines:

5.4.12.7. Names longer than six characters CIFtbx uses some function, subroutine and variable names longer than six characters to improve readability, but, in most cases, consistent truncation of all uses of a name to six characters will not cause any problems.

subroutine ctonum subroutine putnum (numb, sdev, prec) double precision numb, sdev, prec

The routine ctonum converts a string to a number and its standard uncertainty. The subroutine putnum converts a number and standard uncertainty to an output string.

5.4.12.8. File management CIFtbx allows the user to read from one CIF while writing to another. The input CIF is first copied to a direct-access file to allow random access to desired portions of the input CIF. Since CIF allows data items to be presented in any order, the alternatives to the use of a direct-access file would have been to create memory-resident data structures for the entire CIF or to track position and make multiple search passes through the file as data items are requested. When programming for personal and laboratory computers with limited memory and which may lack virtual memory capabilities, assuming the availability of enough memory for large CIFs would greatly restrict the applications within which CIFtbx could be used. However, the disk accesses involved in using a direct-access file slow execution. When working on larger computers, execution speed can be increased at the expense of memory by increasing the number of memory-resident pages (see the parameter NUMPAGE above). If the number of pages times the number of characters per page (NUMCPP) is large enough to hold the entire CIF, the application will run much faster. Direct reading of the input CIF, making multiple passes when data items are requested in a different order to that in which they are presented in the CIF, is only practical when the number of out-of-order requests is small and the applications will not need to be used as a filter, perhaps reading the output of another program ‘on-the-fly’. Since we cannot predict the range of applications and CIFs for which CIFtbx will be used, and direct reading could become impossibly slow, CIFtbx uses a direct-access file. The processing of an output CIF is simpler than reading a CIF. The application determines the order in which the writing is to be done. No sorting is normally needed. Therefore CIFtbx writes an output CIF directly.

(f ) String manipulation: subroutine detab integer function lastnb (str) character str*(*) character*(MAXBUF) function locase (name) character name*(*)

The subroutine detab converts tabs to blanks. The function finds the column position of the last non-blank character in a string. The function locase converts a string to lower case.

lastnb

(g) Hash-table processing: subroutine hash_find (name, name_list, chain_list, list_length, num_list, hash_table, hash_length, ifind) character name*(*), name_list(list_length) integer hash_length, chain_list(list_length), hash_table(hash_length), ifind subroutine hash_store (name, name_list, chain_list, list_length, num_list, hash_table, hash_length, ifind) character name*(*), name_list(list_length) integer hash_length, chain_list(list_length), hash_table(hash_length), ifind integer function hash_value (name, hash_length) character name*(*) integer hash_length

These routines are used to manipulate the internal hash tables used by the library.

537

5. APPLICATIONS iucr.org/iucr-top/cif) and authors’ (http://www.bernstein-plussons.com/software/ciftbx) web sites. The release kit is a compressed C-shell archive ciftbx.cshar.Z or a compressed shell archive ciftbx.shar.Z. Only one is needed. The uncompressed files ciftbx.cshar or ciftbx.shar are needed for implementation.

5.4.12.9. Case sensitivity A CIF may contain data names in upper, lower or a mixture of cases. Internally, CIFtbx does all its name comparisons in lower case, using the function locase (see above) to convert. Good style, however, dictates the use of certain case combinations in certain names. Therefore CIFtbx does this lower-case conversion as needed, preserving the original case for whatever use may be desired. An application needing maximum speed and which does not need to preserve the cases in the original CIF might consider doing the case conversion once and removing the use of locase.

We are grateful to Frances C. Bernstein for her helpful comments and suggestions. References

5.4.12.10. Management of white space

Bernstein, F. C. & Bernstein, H. J. (1996). Translating mmCIF data into PDB entries. Acta Cryst. A52 (Suppl.), C576. Software available at http://www.bernstein-plus-sons.com/software/cif 2pdb. Bernstein, H. J. (1997). cif 2cif. CIF copy program. Bernstein + Sons, Bellport, NY, USA. Included in http://www.bernstein-plussons.com/software/ciftbx. Bernstein, H. J., Bernstein, F. C. & Bourne, P. E. (1998). CIF applications. VIII. pdb2cif: translating PDB entries into mmCIF format. J. Appl. Cryst. 31, 282–295. Software available at http://www.bernsteinplus-sons.com/software/pdb2cif. Bernstein, H. J. & Hall, S. R. (1998). CIF applications. VII. CYCLOPS2: extending the validation of CIF data names. J. Appl. Cryst. 31, 278–281. Software available at http://www.bernstein-plussons.com/software/ciftbx/cyclops.src. Bray, T., Paoli, J. & Sperberg-McQueen, C. M. (1998). Extensible Markup Language (XML). W3C recommendation 10-February-1998. http://www.w3.org/TR/1998/REC-xml-19980210. Hall, S. R. (1993a). CIF applications. II. CIFIO: for CIF input/output in the Xtal system. J. Appl. Cryst. 26, 474–479. Hall, S. R. (1993b). CIF applications. IV. CIFtbx: a tool box for manipulating CIFs. J. Appl. Cryst. 26, 482–494. Hall, S. R. & Bernstein, H. J. (1996). CIF applications. V. CIFtbx2: extended tool box for manipulating CIFs. J. Appl. Cryst. 29, 598–603. Keller, P. A. (1996). A mmCIF toolbox for CCP4 applications. Acta Cryst. A52 (Suppl.), C576. Murray-Rust, P. & Rzepa, H. (1999). Chemical markup, XML and the Worldwide Web. 1. Basic principles. J. Chem. Inf. Comput. Sci. 39, 928– 942.

CIF does not care about white space. One blank or tab is equivalent to many blanks or tabs or empty lines in separating data names from values and values from one another. The internal routine getstr extracts the next white-space-delimited string, using getlin to deliver input lines from the direct-access file as required. Since Fortran does not provide dynamic memory allocation, this approach presents a problem with multi-line text fields. Rather than allocate a large fixed space that might not hold still larger text fields, the library delivers those strings one line at a time. As with case sensitivity, CIFtbx does white-space scanning repeatedly, keeping the original presentation (including tabs) available should an application need access to it. The author of an application needing maximum speed, not needing the original presentation and wishing to conserve disk space might wish to modify the operation of CIFtbx to remove all comments and compress all separating white space to single blanks or line terminators in an initial sweep.

5.4.13. Distribution Version 2.6.4 and an early release of version 3 of CIFtbx are included on the accompanying CD-ROM. As later versions are developed they will be available from the IUCr (http://www.

538

references

International Tables for Crystallography (2006). Vol. G, Chapter 5.5, pp. 539–543.

5.5. The use of mmCIF architecture for PDB data management B Y J. D. W ESTBROOK , H. YANG , Z. F ENG

AND

H. M. B ERMAN

5.5.1. Introduction CRYST1 ATOM ATOM ATOM ATOM ATOM

The Protein Data Bank (PDB) is an archive for macromolecular structures (Bernstein et al., 1977; Berman et al., 2000) and a major component of a global resource for macromolecular structural science (Berman et al., 2003). The scale of its data handling operations is large, and depends on the effective exploitation of the latest developments in the science and technology of informatics. A significant component of its data storage and retrieval strategy is the management of structural data in mmCIF format with appropriate extensions. Over its 30-year history, the PDB archive has grown from seven entries in 1973 to a collection of over 30 000 structures as of May 2005. The growth in the size of the archive has been accompanied by increases in both data content and in the structural complexity of individual entries. As the PDB has grown, there has been a significant broadening of its user community. In response to this change, the role of the PDB has expanded from being simply a provider of structure data files to providing a key information resource for the structural biology community. Looking forward, an acceleration in the growth of the PDB archive is anticipated owing to developments in high-throughput structural determination methodologies and worldwide structural genomics efforts. To support the continued growth and evolution of the PDB archive, a framework is required that supports automation and scalability, and that can adapt to changes in both data content and delivery technology. At the core of the PDB informatics infrastructure is an ontology of data definitions which electronically encode domain information in the form of precise definitions, examples and controlled vocabularies. In addition to domain information, data definitions also encode information such as data type, data relationships, range restrictions and presentation units. The software-accessible PDB exchange data dictionary (Appendix 3.6.2) is the key part of the PDB informatics infrastructure. The exchange dictionary is an extension of the macromolecular Crystallographic Information File (mmCIF) data dictionary (Bourne et al., 1997). The dictionary provides the foundation for software tools which exchange and validate data, create and load databases, translate data formats, and serve application program interfaces. The components of the informatics infrastructure developed by the PDB are being used to build a data pipeline to support high-throughput structure determination.

60.440 ASP A ASP A ASP A ASP A ASP A

1 1 1 1 1

56.630 90.00 119.05 23.482 -0.621 24.897 -0.728 25.573 0.515 24.918 1.359 24.976 -0.729

90.00 C 1 2 1 -1.419 1.00 35.27 -1.885 1.00 32.46 -1.339 1.00 28.22 -0.744 1.00 29.11 -3.427 1.00 38.24

4 N C C O C

Fig. 5.5.2.1. Excerpt of records from a PDB data file.

While this PDB format has in general been adequate for representing coordinate data, it has proved less satisfactory for the description of related information such as chemical and biological features and experimental methodology. To provide a more rigorous data encoding that includes all of this related information, the Protein Data Bank has in recent years adopted a comprehensive ontology of structure and experiment based on the content of the mmCIF data dictionary.

5.5.2.1. PDB format For the past 30 years, the PDB has served as the single central repository for macromolecular structure data. The data format used to store archival entries in the PDB is a column-oriented data format resembling many data formats developed to accommodate the limitations of paper punched-card technology (see Chapter 1.1). An example of the data format is shown in Fig. 5.5.2.1. Many of the data records in this format are prefixed with a record tag (e.g. CRYST1, ATOM) followed by individual items of data. The specifications for the records in this data format are described informally by Callaway et al. (1996). In addition to the labelled records as in Fig. 5.5.2.1, many data records in the PDB format are presented as unstructured or only semi-structured remark records.

5.5.2.2. Ontology representation of macromolecular structure data In 1998, the Research Collaboratory for Structural Bioinformatics (RCSB) assumed the management responsibilities for the PDB. One important outcome was the change in the underlying data representation used to process PDB data. The PDB now collects and processes data using a data representation based on a comprehensive ontology of macromolecular structure and experiment: the PDB exchange data dictionary. This representation is an extension of the mmCIF data dictionary, now the standard data representation for experimentally determined three-dimensional macromolecular structures. The dictionary and data files based on this data ontology (Westbrook & Bourne, 2000) are expressed using Self-defining Text Archival and Retrieval (STAR) syntax (Chapter 2.1). Although the mmCIF dictionary was developed within the crystallographic community, the metadata model employed by mmCIF is quite general and has been adopted by other application domains including NMR, molecular modelling and molecular recognition (dictionaries are available at http://mmcif.pdb.org/). Within the crystallographic community, metadata dictionaries have also been

5.5.2. Representing macromolecular structure data Macromolecular structure data have historically been represented in a simple record-oriented format developed by the PDB; this format has been widely used in structural and computational biology.

Affiliations: J OHN D. W ESTBROOK, H UANWANG YANG, Z UKANG F ENG and H ELEN M. B ERMAN, Protein Data Bank, Research Collaboratory for Structural Bioinformatics, Rutgers, The State University of New Jersey, Department of Chemistry and Chemical Biology, 610 Taylor Road, Piscataway, NJ 08854-8087, USA.

Copyright  2006 International Union of Crystallography

129.230 1 N 2 CA 3 C 4 O 5 CB

539

5. APPLICATIONS developed for other types of diffraction experiments, electronmicroscopy data and for the general description of image data. The metadata concepts and tools that have been developed to support mmCIF are sufficiently general that they may be applied to the description of data in virtually any application. The demands of structural genomics projects have driven the development of extensions to capture an increased level of experimental detail. These are available at http://mmcif.pdb.org/. Extensions have also been introduced to describe NMR, cryo-electron microscopy and all aspects of protein production. The ability to rapidly add extensions and incorporate these into the PDB data-processing system is an important feature for supporting the rapidly evolving technologies associated with high-throughput structure determinations. The mmCIF metadata architecture is built from three levels as illustrated in Fig. 5.5.2.2 (see also Chapter 2.6). Individual data files are described at the top level (e.g. Fig. 5.5.2.2a). The contents of these data files are defined by a data dictionary (e.g. Fig. 5.5.2.2b) in the next lower level (see Chapters 3.6 and 4.5). The attributes used in this data dictionary to build data definitions are in turn defined in the dictionary description language (DDL) (e.g. Fig. 5.5.2.2c) in the lowest level (see Chapters 2.6 and 4.10). The major syntactical constructs used by mmCIF are illustrated in the data file example of Fig. 5.5.2.2(a). Each data item or group of data items is preceded by an identifying keyword. Groups of related data items are organized into data categories. Two categories, CELL and ENTITY_POLY_SEQ, are shown in the example. CELL contains an individual instance describing a single set of crystallographic cell constants. ENTITY_POLY_SEQ contains a loop_ (i.e. table) of instances describing a polymer residue sequence. Essentially all mmCIF data are described as a set of tabular data structures. Each mmCIF data item is defined in a data dictionary. Data definitions are given between save-frame delimiters (i.e. save_); apart from this, the data definitions share the same simple syntax as used in data files. An example definition for a crystallographic cell constant is shown in Fig. 5.5.2.2(b). Many features of the cell constant are described in this definition, including data type, range restrictions, units of expression, dependent quantities, related definitions, necessity and related precision estimate. Although not shown in this example, dictionary definitions can also include parent–child relationships that have important consequences in maintaining data consistency. The attributes of each data definition are defined in the DDL dictionary. Fig. 5.5.2.2(c) shows example DDL definitions describing data types. DDL definitions have the same syntax as definitions used in the data dictionary. Because the attributes of the DDL are also used in DDL definitions, this metadata architecture is described as self-defining. The RCSB PDB distributes parsing tools that support all three levels of this metadata architecture (http://sw-tools.pdb.org/). The CIFPARSE OBJ package (Tosic & Westbrook, 2000) provides high-level methods to read, write, validate and manage data from data files, dictionaries and DDLs. Data files can be validated relative to an input data dictionary, and dictionary files can be validated relative to an input DDL. CIFPARSE OBJ stores information in a collection of table objects. Access methods are provided to search and manipulate the table objects. A companion package, CIFOBJ (Schirripa & Westbrook, 1996), provides an alternative representation of dictionary and DDL data. CIFOBJ organizes dictionary information into a collection of category and item-level objects. Access methods are provided for all dictionary attributes.

_cell.entry_id W1QQQ _cell.length_a 129.230 _cell.length_b 60.440 _cell.length_c 56.630 _cell.angle_alpha 90.00 _cell.angle_beta 119.05 _cell.angle_gamma 90.00 _cell.Z_PDB 4 loop_ _entity_poly_seq.entity_id _entity_poly_seq.num _entity_poly_seq.mon_id 1 1 ASP 1 1 3 VAL 1 1 5 THR 1 1 7 SER 1 1 9 ALA 1

2 4 6 8 10

ILE LEU GLN PRO SER

(a) save__cell.length_a _item_description.description ; Unit-cell length a corresponding to the structure reported. ; _item.name ’_cell.length_a’ _item.category_id cell _item.mandatory_code no _item_aliases.alias_name ’_cell_length_a’ _item_aliases.dictionary cif_core.dic _item_aliases.version 2.0.1 loop_ _item_dependent.dependent_name ’_cell.length_b’ ’_cell.length_c’ loop_ _item_range.maximum _item_range.minimum . 0.0 0.0 0.0 _item_related.related_name ’_cell.length_a_esd’ _item_related.function_code associated_esd _item_sub_category.id cell_length _item_type.code float _item_type_conditions.code esd _item_units.code angstroms save_

(b) save_ITEM_TYPE_LIST _category.description ; Attributes which define each type code. ; _category.id item_type_list _category.mandatory_code no _category_key.name ’_item_type_list.code’ loop_ _category_group.id ’ddl_group’ ’item_group’ save_ save__item_type_list.code _item_description.description ; The codes specifying the nature of the data value. ; loop_ _item.name _item.category_id _item.mandatory_code ’_item_type_list.code’ item_type_list yes ’_item_type.code’ item_type yes _item_type.code _item_linked.child_name _item_linked.parent_name

code ’_item_type.code’ ’_item_type_list.code’

save_

(c) Fig. 5.5.2.2. Files at different levels of the mmCIF metadata architecture. (a) mmCIF data file excerpt. (b) Example mmCIF data dictionary definition. (c) Example DDL dictionary attribute definition.

540

5.5. THE USE OF mmCIF ARCHITECTURE FOR PDB DATA MANAGEMENT depositing data and that of the interface used for annotating the data. The depositor is focused only on data collection and provides the simplest possible presentation of the information to be input. The annotator sees the detail of all possible data items as well as the full functionality of the supporting data-processing software and database system. Although the ADIT system was originally developed to support the centralized data deposition and annotation of macromolecular structure data, it is not limited to these particular applications. Because the architecture of the ADIT system derives the full scope of information to be processed from a data dictionary, the system can transparently provide data input and processing functionality for any content domain. This feature has been exploited in building a data-input tool for the BioSync project (Kuller et al., 2002). The ADIT system can also be configured in workstation mode to provide single-user data collection and processing functionality. This version of the ADIT system as well as the supporting mmCIF parsing and data-management tools are currently distributed by the RCSB PDB under an open-source licence (http://sw-tools.pdb.org/apps/ADIT).

5.5.2.3. Supporting other data formats and data delivery methods One of the greatest benefits of a dictionary-based informatics infrastructure is the flexibility that it provides in supporting alternative data formats and delivery methods. Because the data and all of their defining attributes are electronically encoded, translation between data and dictionary formats can be achieved using lightweight software filters without loss of any information. XML provides a particularly good example of the ease with which data can be converted to and from the mmCIF format. XML translations of mmCIF data files are currently provided on the RCSB PDB beta ftp site (ftp://beta.rcsb.org/pub/ pdb/uniformity/data/XML/). These XML files use mmCIF dictionary data-item names as XML tags. These files were created by a translation tool (http://sw-tools.pdb.org/apps/MMCIF-XMLUTIL/) that translates mmCIF data files to XML in compliance with an XML schema. The XML schema is similarly softwaretranslated from the PDB exchange data dictionary. Other delivery methods such as Corba (http://www.omg.org/cgibin/doc?lifesci/00-02-02) do not require a data format, as data are exchanged using an application program interface (API). A Corba API for macromolecular structure (Greer et al., 2002) based on the content of the mmCIF data dictionary has been approved by the Object Management Group (OMG). Software tools supporting this Corba API (OpenMMS, http://openmms.sdsc.edu, and FILM, http://sw-tools.pdb.org/apps/FILM) take full advantage of the data dictionary in building the interface definitions and supporting server on which the API is based (see also Section 5.3.8.2).

5.5.3.1. ADIT: functional description The basic functions of the ADIT deposition system are shown in Fig. 5.5.3.2. Users interact with the ADIT system through a web server. The CGI components of the ADIT system (that is, functional software components interacting with web input data through the Common Gateway Interface protocol) dynamically build the HTML that provides the system user interface. These CGI components are currently implemented as compiled binaries from C++ source code. User data can be provided in the form of data files or as keyboard input. Input files can be accepted in a variety of formats. ADIT uses a collection of format filters to convert input data to the data specification defined in a persistent data dictionary. Data in the form of data files are typically loaded first. Any input data that are not included in uploaded files can be keyed in by the user. ADIT builds a set of HTML forms for each category of data to be input. At any point during an input session, a user may choose to view or deposit the input data. Users who are depositing data may also use the data-validation services through the ADIT interface. Comprehensive data ontologies like the PDB exchange dictionary contain vast numbers of data definitions. A data-input application may only need to access a small fraction of these definitions at any point. To address the problem of selecting only the relevant set of input data items from a data dictionary ADIT uses a view database. In addition to defining the scope of the data items to be edited by the ADIT application, an ADIT data view also stores

5.5.3. Integrated data-processing system: overview The RCSB PDB data-processing system has been designed to take full advantage of the features of the mmCIF metadata framework. The AutoDep Input Tool (ADIT) is an integrated data-processing system developed to support deposition, data processing and annotation of three-dimensional macromolecular structure data. This system, which is outlined in Fig. 5.5.3.1, accepts experimental and structural data from a user for deposition. Data are input in the form of data files or through a web-based form interface. The input data can be validated in a very basic sense for syntax compliance and internal consistency. Other computational validation can also be applied, including checking the input structure data against a variety of community standard geometrical criteria and comparing the input experimental data with the derived structure model. The suite of validation software used within ADIT is distributed separately (http://sw-tools.pdb.org/apps/VAL/). All of this validation information is returned to the user as a collection of HTML reports. In addition to providing data-validation reports, ADIT also encodes data in archival data files and loads data into a relational database. The loading of data into the relational database is aided by an expert annotator. The ADIT system customizes its behaviour according to the user’s requirements. One important distinction is between the behaviour of the interface provided for

Fig. 5.5.3.2. Schematic diagram of ADIT editing, format translation and validation functions.

Fig. 5.5.3.1. Functional diagram of the ADIT system.

541

5. APPLICATIONS

Fig. 5.5.3.4. Schematic diagram of ADIT database loading functions.

database instances. The relation of the database loader to the central components of the ADIT system is shown in Fig. 5.5.3.4. Schemata are defined in a metadata repository that is accessed by the loader application. In the simplest case, a schema can be constructed that is modelled directly from the data dictionary. Since the data model underlying the dictionary description language used to build ADIT data dictionaries is essentially relational, mapping a data dictionary specification to a relational schema is straightforward. In other cases, a mapping is required between the target schema and the data dictionary specification. This mapping is encoded in the schema metadata repository. The database loader uses this mapping information to extract items from data files and translate these data into a form that can be loaded into the target database schema. The definition of the mapping operation can include: selection operations with equijoin constraints (e.g. the value of _entity.type where _entity.id = 1), aggregation (e.g. count, sum, average), collapse (e.g. vector to string), type conversions and existence tests. Schema definitions are converted by the database loader into SQL instructions that create the defined tables and indices. Loadable data are produced either as SQL insert/update instructions or in the more efficient table copy formats used by popular database engines (i.e. DB2, Sybase, Oracle and MySQL). Loadable data can also be produced in XML.

Fig. 5.5.3.3. Example ADIT data-input screen.

presentation details that are used in building the HTML input forms. An important use of the data view is to provide a simple and intuitive presentation of information for novice users which disguises the complex details of a data dictionary. Fig. 5.5.3.3 shows an example ADIT editing screen for the crystallographic unit cell. The data dictionary category containing this information is named CELL, and the length of the first cell axis is defined in the dictionary as _cell.length_a (Fig. 5.5.2.2b). In this case, the data view has substituted Unit Cell and Length a for the dictionary data names. Although this example is simple, some dictionary data names are as long as 75 characters, and in these instances the ability to display a simpler name is essential. Precise dictionary definitions and examples obtained from the data dictionary are accessible from the ADIT interface through buttons next to each data item. ADIT makes full use of the dictionary specification in data-input operations. Data items defined to assume only specific values have pulldown menus or selection boxes. Data type and range restrictions are checked when data are input and diagnostics are displayed to the user if errors are detected. For performance reasons, the data dictionary is converted from its tabular text structure to an object representation using CIFOBJ. The class supporting the object representation provides efficient access functions to all of the data dictionary attributes. A dictionary loader is used to check the consistency of the data dictionary and to load the object representation from the text form of the data dictionary. Any dictionary that complies with the dictionary description language (DDL2) can be loaded and used by ADIT. All ADIT software components gain their knowledge of the input data from the data dictionary and any associated data views. Consequently, ADIT can be tailored for use in virtually any data-input and dataprocessing application.

5.5.3.3. Building a structure-determination data pipeline One goal of high-throughput structural genomics is the automatic capture of all the details of each step in the process of structure determination. Fig. 5.5.3.5 shows a simplified structuredetermination data pipeline. The essential details of each pipeline step are extracted and later assembled to make a data file for PDB deposition. The RCSB PDB data-processing infrastructure has been developed in anticipation of a data pipeline in which automated deposition would be the terminal step. The dictionary technology and software tools developed by the RCSB PDB to process and manage mmCIF data can be reused to provide the data-handling operations required to build the pipeline. Dictionary definitions have been carefully developed to describe the details of each step in the structure-determination pipeline. These data items are typically accessible in electronic form after each program step. The information is either exported directly in mmCIF format or is printed in a program output file. To deal with the latter case, a utility program, PDB EXTRACT (http://sw-tools.pdb.org/apps/PDB EXTRACT), has been developed to parse program output files and extract key data values. In either case, the results of this incremental extraction of data from each program step must be merged to build a complete mmCIF

5.5.3.2. Generalized database support In addition to the data editing and processing functions, ADIT also supports a versatile database loader (mmCIF Loader; http://sw-tools.pdb.org/apps/MMCIF-LOADER) that builds database schemata and extracts the processed data required to load

542

5.5. THE USE OF mmCIF ARCHITECTURE FOR PDB DATA MANAGEMENT (NIBIB), and the National Institute of Neurological Disorders and Stroke (NINDS).

References

Fig. 5.5.3.5. Schematic diagram of a structure-determination data pipeline.

Berman, H. M., Henrick, K. & Nakamura, H. (2003). Announcing the worldwide Protein Data Bank. Nature Struct. Biol. 10, 980. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). The Protein Data Bank. Nucleic Acids Res. 28, 235–242. Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F. Jr, Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol. 112, 535–542. Bourne, P. E., Berman, H. M., McMahon, B., Watenpaugh, K. D., Westbrook, J. D. & Fitzgerald, P. M. D. (1997). Macromolecular Crystallographic Information File. Methods Enzymol. 277, 571–590. Callaway, J., Cummings, M., Deroski, B., Esposito, P., Forman, A., Langdon, P., Libeson, M., McCarthy, J., Sikora, J., Xue, D., Abola, E., Bernstein, F., Manning, N., Shea, R., Stampf, D. & Sussman, J. (1996). Protein Data Bank contents guide: Atomic coordinate entry format description. Brookhaven National Laboratory, New York, USA. Available from http://www.rcsb.org/pdb/docs/format/ pdbguide2.2/part 0.html. Greer, D. S., Westbrook, J. D. & Bourne, P. E. (2002). An ontology driven architecture for derived representations of macromolecular structure. Bioinformatics, 18, 1280–1281. Kuller, A., Fleri, W., Bluhm, W. F., Smith, J. L., Westbrook, J. & Bourne, P. E. (2002). A biologist’s guide to synchrotron facilities: the BioSync web resource. Trends Biochem. Sci. 27, 213–215. Schirripa, S. & Westbrook, J. D. (1996). CIFOBJ. A class library of mmCIF access tools. Reference guide. http://sw-tools.pdb.org/apps/ CIFOBJ/cifobj/index.html. Tosic, O. & Westbrook, J. D. (2000). CIFParse. A library of access tools for mmCIF. Reference guide. http://sw-tools.pdb.org/apps/CIFPARSEOBJ/cifparse/index.html. Westbrook, J. & Bourne, P. E. (2000). STAR/mmCIF: an ontology for macromolecular structure. Bioinformatics, 16, 159–168.

data file ready for deposition. The PDB EXTRACT program also carrys out this merging operation. Some steps in the structure-determination pipeline may not be driven by software. For instance, the details of protein production may be held in laboratory databases or within laboratory notebooks. A version of ADIT with a data view including all of the structural genomics data extensions has been created for entering these data. This ADIT tool can also be used to validate and check the completeness of the final data file. 5.5.4. Access All of the software tools and libraries described in this chapter are distributed with full source under an open-source licence. Applications are also distributed in binary form for Intel/Linux, Sun/Solaris, SGI/IRIX and Dec Alpha platforms. Several are also included on the CD-ROM accompanying this volume. The RCSB/PDB is operated by Rutgers, The State University of New Jersey; the San Diego Supercomputer Center at the University of California, San Diego; and the Center for Advanced Research in Biotechnology of the National Institute of Standards and Technology. RCSB/PDB is supported by funds from the National Science Foundation (NSF), the National Institute of General Medical Sciences (NIGMS), the Department of Energy (DOE), the National Library of Medicine (NLM), the National Cancer Institute (NCI), the National Center for Research Resources (NCRR), the National Institute of Biomedical Imaging and Bioengineering

543

references

International Tables for Crystallography (2006). Vol. G, Chapter 5.6, pp. 544–556.

5.6. CBFlib: an ANSI C library for manipulating image data B Y P. J. E LLIS

AND

H. J. B ERNSTEIN Table 5.6.1.1. Return values from CBFlib functions

5.6.1. Introduction CBFlib is a library of ANSI C functions providing a simple mechanism for accessing crystallographic binary files (CBFs) and imagesupporting CIF (imgCIF) files (see Chapters 2.3, 3.7 and 4.6). The CBFlib application programming interface (API) consists of a set of low-level functions and a set of high-level functions. The lowlevel functions are loosely based on the CIFPARSE (Tosic & Westbrook, 2000) API for mmCIF files. As in CIFPARSE, the lowlevel functions in CBFlib do not perform any semantic integrity checks and simply create, read, modify and write files conforming to the CIF syntax, with additional functionality for working with binary sections. These basic functions are independent of the details of the CBF/imgCIF dictionary and can be used to manage any CIF data set. In contrast, the high-level functions are based on the CBF/imgCIF dictionary and facilitate writing or reading commonly used entries to or from CBF and imgCIF data files. External to a program, a CBF/imgCIF data set ‘lives’ in a file. Internally, when managed by CBFlib, a CBF or imgCIF data set has a simple tree structure pointed to by a ‘handle’ (Fig. 5.6.1.1). At the highest level are named data blocks. Each data block may contain a number of named categories. Within each category, the actual data entries are stored in tabular form with named columns and numbered rows. The numbers of rows in different columns of a given category are constrained by the software to be the same.

Error code

Hexadecimal value

CBF_FORMAT CBF_ALLOC CBF_ARGUMENT CBF_ASCII CBF_BINARY CBF_BITCOUNT

1 2 4 8 10 20

CBF_ENDOFDATA

40

CBF_FILECLOSE CBF_FILEOPEN CBF_FILEREAD CBF_FILESEEK CBF_FILETELL CBF_FILEWRITE CBF_IDENTICAL

80 100 200 400 800 1000 2000

CBF_NOTFOUND

4000

CBF_OVERFLOW

8000

CBF_UNDEFINED CBF_NOTIMPLEMENTED

10000 20000

Meaning Invalid file format Memory allocation failed Invalid function argument Value is ASCII (not binary) Value is binary (not ASCII) Expected number of bits does not match the actual number written End of the data reached before end of array File close error File open error File read error File seek error File tell error File write error A data block with the new name already exists Data block, category, column or row does not exist Number read cannot fit into the destination argument; destination has been set to the nearest value Requested number not defined Requested functionality is not yet implemented

Handle CBFlib provides functions to create a corresponding data structure in memory; to copy a data set from an external file to the data structure or from the data structure to an external file; to navigate the tree; to scan, add and remove data blocks within data sets, categories (tables) within data blocks, and rows or columns within categories; to read or modify data entries; and finally to delete the structure from memory. As is common in C programming, all functions return an integer equal to 0 for success or an error code for failure. The CBFlib error codes are given in Table 5.6.1.1. CBFlib is thread-safe, re-entrant and able to operate on multiple data sets at a time. This means that the library maintains no static data and that the object to be operated on must be passed to each function. In CBFlib, this is accomplished by referring to each data set in memory with a unique handle of type cbf_handle. The handle maintains a link to the data structure in memory as well as the current location on the tree (data block, category, column and row). Before reading or creating a data set, the handle is created by calling the cbf_make_handle function. When the data set is no longer required, the resources associated with it can be freed using cbf_free_handle. Most functions in the library expect a handle as the first argument. CBF binary data files and imgCIF ASCII data files may have one or more large images or other data sections as values for CIF tags. The focus of CBFlib is to handle large data sections efficiently. The basic flow of an application reading CBF/imgCIF data with the low-level CBFlib functions is shown in Fig. 5.6.1.2.

 Data set FF FF sss s FF s s s FF s FF yss  # Data block Data block ... F s FF s s F s FF sss FF FF ysss  # Category Category ... FF ss FF s s FF sss FF FF ysss  # Column Column ...  Data row 0 Data row 1 ... Fig. 5.6.1.1. Schematic data structure of a CBF or imgCIF data set accessed through a handle.

Affiliations: PAUL J. E LLIS, Stanford Linear Accelerator Center, 2575 Sand Hill Road, Menlo Park, CA 94025, USA; H ERBERT J. B ERNSTEIN, Department of Mathematics and Computer Science, Kramer Science Center, Dowling College, Idle Hour Blvd, Oakdale, NY 11769, USA.

Copyright  2006 International Union of Crystallography

544

5.6. CBFlib: AN ANSI C LIBRARY FOR MANIPULATING IMAGE DATA Start

Start

 Create handle: cbf_make_handle

 Create handle: cbf_make_handle

 Load data structures: cbf_read_file

 Create datablock: cbf_new_datablock d

 Select datablock: cbf_select_datablock b

 Create category: cbf_new_category k

" Select category: cbf_select_category b

 Create columns: cbf_new_column : |

" Select row: cbf_select_row b

 Select row: cbf_new_row d

" Select column: cbf_select_column c

# Select column: cbf_select_column c  Set value: cbf_set_value or cbf_set_integerarray

" Get value: cbf_get_value or cbf_get_integerarray Fig. 5.6.1.2. Flow chart for a typical application reading CBF/imgCIF data.

v Write out data structures: cbf_write_file The general approach to reading CBF/imgCIF data with CBFlib is to create an empty data structure with cbf_make_handle, load the data structures with cbf_read_file and then use nested loops to work through data blocks, categories, rows and columns in turn to extract values. Conceptually, all data values are held in the memory-resident data structures. In practice, however, only pointers to text fields with image data are held in memory. The data themselves remain on disk until explicitly referenced. The basic flow of an application writing CBF/imgCIF data with the low-level CBFlib functions is shown in Fig. 5.6.1.3. The general approach to writing CBF/imgCIF data with CBFlib is to create empty data structures with cbf_make_handle and load the data structures with nested loops, working through data blocks, categories, rows and columns in turn, to store values. The major difference from the nested loops used for reading is that empty columns are created before data are stored into the data structures row by row. Alternatively, the data could be stored column by column. Finally, the fully loaded memory data structures are written out with cbf_write_file. As with reading, text fields with image data are actually held on disk.

Fig. 5.6.1.3. Flow chart for an application writing CBF/imgCIF data.

The data structures ‘behind’ the handles retain pointers to current locations. This facilitates scanning through a CIF or CBF by data blocks, categories, rows and columns. The term ‘rewind’ refers to setting the internal pointer for the type of item specified so that the first such item is pointed to. In general, CIF does not permit duplication of the names of data blocks or category names. In practice, however, duplications do occur. CBFlib provides ‘force’ variants of some functions to allow creation of duplicate names. In CBFlib, the term ‘set’ refers to changing the name of the currently specified item. The term ‘reset’ refers to emptying a data block or category without deleting it. The term ‘remove’ refers to deleting a data block, category, column or row. The terms ‘select’ and ‘next’ refer to finding the designated item by number, while the term ‘find’ refers to finding the designated item by name. CBFlib provides the following functions to manage data blocks and categories: cbf_set_datablockname ⎧ ⎫

5.6.2. CBFlib function descriptions All CBFlib functions have two common characteristics: (i) they return an integer equal to 0 for success or an error code for failure; (ii) any pointer argument for the result of an operation can be safely set to NULL. The error codes are given in Table 5.6.1.1. CBFlib provides two low-level functions to create or destroy the structure used to hold a data set:

cbf_

new ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ force_new ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ reset ⎪ ⎪ ⎪ ⎪ ⎨ ⎬ remove

⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩

rewind select next find

⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎭

cbf_make_handle

cbf_reset_datablocks

cbf_free_handle



cbf_count

There are two functions to copy a data set from or into a file: cbf_read_file

cbf_

cbf_write_file

545



_datablock _category

_datablocks _categories

datablock category

_name





5. APPLICATIONS Table 5.6.2.1. Formal parameters for low-level CBFlib functions

The following functions manage columns and rows: ⎧ ⎫ new ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ remove ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ rewind ⎪ ⎪ ⎨ ⎬ _column

next

cbf_

array binary_id categories category categoryname ciforcbf

_row

⎪ ⎪ ⎪ find ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ count ⎪ ⎪ ⎪ ⎪ ⎩ ⎭

column columnname columns compression datablock datablockname datablocks elements elements_read elsigned elsize elunsigned

select

cbf_column_name cbf_row_number cbf_



insert delete



_row

cbf_find_nextrow

The following functions are provided to manage data values: ⎧ ⎫ _value ⎪ ⎪ ⎪  ⎪ ⎨ ⎬ get set

cbf_

_integervalue

_doublevalue ⎪ ⎪ ⎪ ⎪ ⎩ ⎭

encoding

_integerarray

cbf_get_integerarrayparameters

file handle headers

Two macro definitions are provided to facilitate the handling of errors: cbf_failnez

maxelement minelement number readable

cbf_onfailnez

CBFlib also provides higher-level routines to simplify the management of complex CBF/imgCIF data sets: cbf_read_template ⎧

cbf_



get set

⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩

_diffrn_id _crystal_id _wavelength _polarization _divergence _gain _overload _integration_time _time _date _image

_axis_setting cbf_count_elements cbf_get_element_id

row rows value

⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬

integer equal to 0 to indicate success or an error code on failure (Table 5.6.1.1). The arguments to CBFlib functions are based on a view of a CBF/imgCIF data set as a tree (Fig. 5.6.1.1). The root of the tree is the data set and is identified by a handle that points to the data structures representing that tree. The main branches of the tree are the data blocks, identified by name or by number. Within each data block, the tree branches into categories, each of which branches into columns. Categories and columns also are identified by name or by number. Within each column is an array of values, the rows of which are identified by number. The current data block, category, column and row are stored in the data structures of a data set. The following function descriptions include the formal parameters. When a ‘∗’ appears before a formal parameter, it is a pointer to the relevant value, rather than the actual value. The formal parameters for the low-level CBFlib functions are given in Table 5.6.2.1. Before working with a CBF (or CIF), it is necessary to create a handle. When work with the CBF is completed, the handle and associated data structures should be released:

⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎭

cbf_set_current_time cbf_get_image_size   cbf_

construct free

cbf_get_rotation_



_goniometer

_detector axis



range

cbf_rotate_vector cbf_get_reciprocal ⎧ cbf_get_

beam_center ⎪ ⎪ ⎪ ⎪ detector_distance ⎪ ⎪ ⎨ detector_normal

Untyped array, typically holding a pointer to an image Integer identifier of a binary section Integer used for a count of categories Integer ordinal of a category, counting from 0 Character string; the name of a category Integer; selects the format in which the binary sections are written (CIF/CBF) Integer ordinal of a column, counting from 0 Character string; the name of a column Integer count of columns in a category Integer designating the compression method used Integer ordinal of a data block, counting from 0 Character string; the name of a data block Integer count of data blocks in a CBF/imgCIF data set Number of elements in the array Pointer to the destination number of elements actually read Set to nonzero if the destination array elements are signed Size in bytes of each array element Pointer to an integer; set to 1 if the elements can be read as unsigned integers Integer; selects the type of encoding used for binary sections and the type of line termination in imgCIF files File descriptor CBF handle Integer; controls/selects the type of header in CBF binary sections and message digest generation Integer; largest element Integer; smallest element Integer or double value Integer; if nonzero: this file is random-access and readable, and can be used as a buffer Integer; row ordinal Integer; row count Integer or double value

⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬

int cbf_make_handle (cbf_handle *handle); int cbf_free_handle (cbf_handle handle);

pixel_coordinates ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ pixel_normals ⎪ ⎪ ⎪ ⎪ ⎩ ⎭

Normally, processing cannot continue if a handle is not created. Typical code to create a handle is:

pixel_area

#include "cbf.h" cbf_handle cif;

5.6.2.1. Low-level CBFlib functions The prototypes for low-level CBFlib functions are defined in the header file cbf.h, which should be included in any program that uses CBFlib. As noted previously, every function returns an

if ( cbf_make_handle (&cif) ) { fprintf(stderr, "Failed to create handle for input_cif\n"); exit(1); }

546

5.6. CBFlib: AN ANSI C LIBRARY FOR MANIPULATING IMAGE DATA Table 5.6.2.2. Values for headers in cbf read file MSG_DIGEST

MSG_DIGESTNOW

MSG_NODIGEST

Table 5.6.2.3. Values for headers in cbf write file Values may be combined bit-wise.

Check that the digest of the binary section matches any header value. If the digests do not match, the call will return CBF_FORMAT. The evaluation and comparison is delayed (a ‘lazy’ evaluation) to ensure maximal processing efficiency. If an immediate evaluation is desired, see MSG_DIGESTNOW below. Check that the digest of the binary section matches any header value. If the digests do not match, the call will return CBF_FORMAT. This evaluation and comparison is performed during initial parsing of the section to ensure timely error reporting at the expense of processing efficiency. If a more efficient delayed (‘lazy’) evaluation is desired, see MSG_DIGEST above. Do not check the digest (default).

MIME_HEADERS MIME_NOHEADERS MSG_DIGEST MSG_NODIGEST

Use MIME-type headers (default) Use simple ASCII headers Generate message digests for binary data validation Do not generate message digests (default)

and ENC_LFTERM terminates lines with line feed (LF) (thus ENC_CRTERM|ENC_LFTERM will use CR LF). CBFlib maintains temporary storage on disk as necessary for files to be written, so that file does not have to be random-access. However, if it is random-access and readable, resources can be conserved by setting readable nonzero. The remaining low-level functions are involved in navigating the tree structure, creating and deleting data blocks, categories and table columns and rows, and retrieving or modifying data values. The navigation functions are:

Once a handle has been created, the data structures can be loaded with all the information held in a CBF file: int cbf_read_file (cbf_handle handle, FILE *file, int headers);

int cbf_find_datablock (cbf_handle handle, const char *datablockname);

Conceptually, all data values are associated with the handle at the cbf_read_file call. In practice, however, only the non-binary data are actually stored in memory. To work with potentially large binary sections most efficiently, these are skipped until explicitly referenced. For this reason, file must be a random-access file opened in binary mode [fopen (..., "rb")] and must not be closed by the calling program. CBFlib will call fclose when the file is no longer required. The headers parameter controls the handling of any message digests embedded in the binary sections (Table 5.6.2.2). A headers value of MSG_DIGEST will cause the code to compare the digest of the binary section with any header message digest value. To maximize processing efficiency, this comparison will be delayed until the binary section is actually read into memory or copied (a ‘lazy’ evaluation). If immediate evaluation is required, use MSG_DIGESTNOW. In either case, if the digests do not match, the function in which the evaluation is taking place will return the error CBF_FORMAT. To ignore any digests, use the headers value MSG_NODIGEST. The cbf_write_file call writes out the data associated with a CBF handle:

int cbf_find_category (cbf_handle handle, const char *categoryname); int cbf_find_column (cbf_handle handle, const char *columnname); int cbf_find_row (cbf_handle handle, const char *value); int cbf_find_nextrow (cbf_handle handle, const char *value); int cbf_select_datablock (cbf_handle handle, unsigned int datablock); int cbf_select_category (cbf_handle handle, unsigned int category); int cbf_select_column (cbf_handle handle, unsigned int column); int cbf_select_row (cbf_handle handle, unsigned int row); int cbf_rewind_datablock (cbf_handle handle); int cbf_rewind_category (cbf_handle handle); int cbf_rewind_column (cbf_handle handle); int cbf_rewind_row (cbf_handle handle);

int cbf_write_file (cbf_handle handle, FILE *file,

int cbf_next_datablock (cbf_handle handle);

int readable, int ciforcbf, int headers, int encoding);

int cbf_next_category (cbf_handle handle); int cbf_next_column (cbf_handle handle);

This call has several options controlling whether binary sections are written unencoded (CBF) or encoded in ASCII to conform to the CIF syntax (imgCIF), the type of headers in the binary sections, and the type of ASCII encoding and line termination used. The acceptable values for ciforcbf are CIF for ASCII-encoded binary sections or CBF for unencoded binary sections. The headers parameter (Table 5.6.2.3) can take the value MIME_HEADERS to select MIME-type binary section headers or MIME_NOHEADERS for simple ASCII headers. The value MSG_DIGEST will generate digests for validation of the binary data and the value MSG_NODIGEST will skip digest evaluation. The header and digest flags may be combined using the logical OR operator. Similarly, there are several combinable flags for the parameter encoding (Table 5.6.2.4). ENC_BASE64 selects BASE64 encoding, ENC_QP selects quoted-printable encoding, and ENC_BASE8, ENC_BASE10 and ENC_BASE16 select octal, decimal and hexadecimal, respectively. ENC_FORWARD maps bytes to words forward (1234) for BASE8, BASE10 or BASE16 encoding and ENC_BACKWARD maps bytes to words backward (4321). Finally, ENC_CRTERM terminates lines with carriage return (CR)

int cbf_next_row (cbf_handle handle);

The function cbf_find_datablock selects the first data block with name datablockname as the current data block. Similarly, cbf_find_category selects the category within the current data block with name categoryname and cbf_find_column selects the corresponding column within the current category. The function cbf_find_row differs slightly in that it selects the first row in the current column with the corresponding value and cbf_find_nextrow selects the row with the corresponding value following the current row. Note that selecting a new data block makes the current category, column and row undefined and that selecting a new category similarly makes the column and row undefined. In contrast, repositioning by column does not change the current row and repositioning by row does not change the current column. The remaining functions navigate on the basis of the order of the data blocks, categories, columns and rows. Thus, cbf_select_datablock selects data-block number datablock, counting from 0, cbf_rewind_datablock selects the first data

547

5. APPLICATIONS Table 5.6.2.4. Values for encodings in cbf write file

row to the end of the current category. cbf_insert_row provides the ability to insert a row before ordinal row, starting from 0. The newly inserted row gets the row ordinal row and the row that originally had that ordinal and all rows with higher ordinals are pushed downwards. In general, CIF does not permit duplication of the names of data blocks or categories. In practice, however, duplications do occur. CBFlib provides ‘force’ variants to allow creation of duplicate data-block and category names. Because, in this case, the program analysing the resulting file can only distinguish the duplicates by ordinal, these variants are not recommended for general use. The following functions are used to remove entities from the tree:

Values may be combined bit-wise. ENC_BASE64 ENC_QP ENC_BASE8 ENC_BASE10 ENC_BASE16 ENC_FORWARD ENC_BACKWARD ENC_CRTERM ENC_LFTERM

Use BASE64 encoding (default) Use quoted-printable encoding Use BASE8 (octal) encoding Use BASE10 (decimal) encoding Use BASE16 (hexadecimal) encoding For BASE8, BASE10 or BASE16 encoding, map bytes to words forward (1234) (default on little-endian machines) For BASE8, BASE10 or BASE16 encoding, map bytes to words backward (4321) (default on big-endian machines) Terminate lines with CR Terminate lines with LF (default)

int cbf_remove_datablock (cbf_handle handle); int cbf_remove_category (cbf_handle handle);

block and cbf_next_datablock selects the data block following the current data block. All of these functions return CBF_NOTFOUND if the requested object does not exist. The ‘count’ functions evaluate the number of data blocks in the data set, the number of categories in the current data block and the number of columns or rows in the current category:

int cbf_remove_column (cbf_handle handle); int cbf_remove_row (cbf_handle handle); int cbf_delete_row (cbf_handle handle, unsigned int row); int cbf_reset_datablocks (cbf_handle handle); int cbf_reset_datablock (cbf_handle handle); int cbf_reset_category (cbf_handle handle);

int cbf_count_datablocks (cbf_handle handle, unsigned int *datablocks);

The basic ‘remove’ functions delete the current data block, category, column or row. Note that removing a data block makes the current data block, category, column and row undefined; removing a category makes the current category, column and row undefined. Removing a column makes the current column undefined, but leaves the current row intact, and removing a row leaves the current column intact. The function cbf_delete_row is similar to cbf_remove_row except that it removes the specified row in the current category. If the current row is not the deleted row, then it will remain valid. All the categories in all data blocks, all the categories in the current data block or all the entries in the current category may be removed using the ‘reset’ functions. When a column and row within a category have been selected, the entry value may be examined or modified:

int cbf_count_categories (cbf_handle handle, unsigned int *categories); int cbf_count_columns (cbf_handle handle, unsigned int *columns); int cbf_count_rows (cbf_handle handle, unsigned int *rows);

The ‘name’ functions retrieve the current data block, category or column names: int cbf_datablock_name (cbf_handle handle, const char **datablockname); int cbf_set_datablockname (cbf_handle handle, const char *datablockname); int cbf_category_name (cbf_handle handle, const char **categoryname);

int cbf_get_value (cbf_handle handle,

As rows do not have names, the corresponding function is:

const char **value);

int cbf_row_number (cbf_handle handle,

int cbf_set_value (cbf_handle handle,

unsigned int *row);

const char *value); int cbf_get_integervalue (cbf_handle handle,

To create new entities within the tree, CBFlib provides the functions:

int *number); int cbf_set_integervalue (cbf_handle handle,

int cbf_new_datablock (cbf_handle handle,

int number);

const char *datablockname); int cbf_new_category (cbf_handle handle, const char *categoryname);

int cbf_get_doublevalue (cbf_handle handle,

int cbf_new_column (cbf_handle handle,

int cbf_set_doublevalue (cbf_handle handle,

double *number); const char *format; int cbf_get_integerarrayparameters (cbf_handle handle,

const char *columnname); int cbf_new_row (cbf_handle handle);

unsigned int *compression, size_t *elsize,

int cbf_insert_row (cbf_handle handle,

size_t *elements, int *maxelement);

unsigned int row);

int cbf_get_integerarray (cbf_handle handle,

int cbf_force_new_datablock (cbf_handle handle,

int *binary_id, int elsigned,

const char *datablockname); int cbf_force_new_category (cbf_handle handle,

size_t *elements_read); int cbf_set_integerarray (cbf_handle handle,

const char *categoryname);

unsigned int compression, void *array, size_t elements);

The ‘new’ functions add a new data block within the data set, a new category in the current data block, or a new column or row within the current category, and make it the current data block, category, column or row, respectively. If the data block, category or column already exists, then the function simply makes it the current data block, category or column. The function cbf_new_row adds the

A value within a CBF/imgCIF data set may be a simple character string, an integer or real number, or an array of integers. The functions cbf_get_value and cbf_set_value provide the basic functionality for normal CIF values, retrieving and modifying the

548

5.6. CBFlib: AN ANSI C LIBRARY FOR MANIPULATING IMAGE DATA Table 5.6.2.5. Values for the parameter compression in cbf get integerarrayparameters and cbf set integerarray CBF_CANONICAL CBF_PACKED CBF_NONE

"\nCBFlib error %d in \"x\"\n", \ err); return err; }} #define cbf_onfailnez(x,c) \ {int err; err = (x); \ if (err) { fprintf (stderr, \ "\nCBFlib error %d in \"x\"\n", \ err); \ { c; } return err; }}

Canonical-code compression (Section 5.6.3.1) CCP4-style packing (Section 5.6.3.2) No compression

current entry as a string. The functions cbf_get_integervalue and cbf_get_doublevalue interpret the retrieved string as an integer or real value and the functions cbf_set_integer and cbf_set_doublevalue convert the number argument into a string before setting the entry. The functions for working with binary sections are more complicated as they must take into account compression, array size and the variety of different integer types available on different systems: signed/unsigned and various sizes. The function cbf_get_integerarrayparameters retrieves the parameters of the current, binary, entry. The compression argument is set to the compression type used (Table 5.6.2.5). At present, this may take one of three values: CBF_CANONICAL, for canonical-code compression (see Section 5.6.3.1 below); CBF_PACKED, for CCP4-style packing (see Section 5.6.3.2 below); or CBF_NONE, for no compression. [Note: CBF_NONE is by far the slowest scheme of the three and uses much more disk space. It is intended for routine use with small arrays only. With large arrays (like images) it should be used only for debugging.] The binary_id value is a unique integer identifier for each binary section, elsize is the size in bytes of the array entries, elsigned and elunsigned are nonzero if the array can be read as unsigned or signed, respectively, elements is the number of entries in the array, and minelement and maxelement are the lowest and highest elements. If a destination argument is too small to hold a value, it will be set to the nearest value and the function will return CBF_OVERFLOW. If the current entry is not binary, cbf_get_integerarrayparameters will return CBF_ASCII. cbf_get_integerarray reads the current binary entry into an integer array. The parameter array points to an array of elements interpreted as integers. Each element in the array is signed if elsigned is nonzero and unsigned otherwise, and each element occupies elsize bytes. The argument elements_read is set to the number of elements actually obtained. If the binary section does not contain sufficient entries to fill the array, the function returns CBF_ENDOFDATA. As before, the function will return CBF_OVERFLOW on overflow and CBF_ASCII if the entry is not binary. cbf_set_integerarray sets the current binary or ASCII entry to the binary value of an integer array. As before, the acceptable values for compression are CBF_PACKED, CBF_CANONICAL and CBF_NONE. Each binary section should be given a unique integer identifier binary_id. Two macros are provided to facilitate processing and propagation of error returns: one to return from the current function immediately and one to execute a given command first:

5.6.2.2. High-level CBFlib functions The high-level CBFlib functions provide a level of abstraction above the CIF file structure and their prototypes are defined in the header file cbf simple.h. Most of these functions simply use the low-level routines to navigate the CBF/imgCIF structure and read and modify data entries, and consequently expect a cbf_handle argument. There are also, however, additional sets of functions used to analyse the geometry of the goniometer and detector. These functions use additional handles of type cbf_goniometer and cbf_detector, respectively. All functions return the same error codes as the low-level functions do. The function return values are given in Table 5.6.1.1. The formal parameters for the high-level CBFlib functions are given in Table 5.6.2.6. 5.6.2.3. General high-level functions The general high-level functions use the low-level routines to accomplish common tasks with a single call. The first of these is used to facilitate the preparation of the complex CBF/imgCIF header structure: int cbf_read_template (cbf_handle handle, FILE *file);

simply reads the CBF/imgCIF file file into the data structure associated with the given handle and selects the first data block. It is typically used to read a template – an imgCIF file populated with data entries, but without any binary sections, into which experimental information can then be inserted. Template files are discussed further in Section 5.6.4 below. The value of _diffrn_radiation_wavelength.wavelength can be retrieved or set. The functions

cbf_read_template

int cbf_get_wavelength (cbf_handle handle, double *wavelength); int cbf_set_wavelength (cbf_handle handle, double wavelength);

operate on the categories DIFFRN_RADIATION and DIFFRN_ RADIATION_WAVELENGTH. The wavelength is found indirectly. The value of _diffrn_radiation.wavelength_id is retrieved and used to find a matching row in the DIFFRN_ RADIATION_WAVELENGTH category, from which the value of _diffrn_radiation_wavelength.wavelength is obtained. The value of the ratio of the intensities of the polarization components _diffrn_radiation.polarizn_source_ratio and the value of the angle _diffrn_radiation.polarizn_source_norm between the normal to the polarization plane and the laboratory Y axis can be retrieved or set. The functions

#define cbf_failnez(f) \ {int err; err = (f); if (err) return err; } #define cbf_onfailnez(f,c) \ {int err; err = (f); if (err) {{c; }return err; }}

int cbf_get_polarization (cbf_handle handle, double *polarizn_source_ratio, double *polarizn_source_norm);

If the symbol CBFDEBUG is defined, alternative definitions that print out the error number as given in Table 5.6.1.1 are used:

int cbf_set_polarization (cbf_handle handle, double polarizn_source_ratio, double polarizn_source_norm);

#define cbf_failnez(x) \ {int err; err = (x); \ if (err) { fprintf (stderr, \

operate on the DIFFRN_RADIATION category.

549

5. APPLICATIONS Table 5.6.2.6. Formal parameters for high-level CBFlib functions array axis_id center1 center2 compression coordinate1 coordinate2 coordinate3 crystal_id day detector diffrn_id distance div_x_source

Pointer to image array data Axis ID Displacement along the slow axis Displacement along the fast axis Compression type x component y component z component ASCII crystal ID Timestamp day (1–31) Detector handle ASCII diffraction ID Distance Value of _diffrn_radiation.div_x_

div_x_y_source

Value of _diffrn_radiation.div_x_y_

div_y_source

Value of _diffrn_radiation.div_y_

element_id element_number elements elsigned

ASCII element ID Detector element counting from 0 Count of elements Set to nonzero if the destination array elements are signed Size in bytes of each destination array element File descriptor Detector gain in counts per photon Gain e.s.d. value Goniometer handle CBF handle Timestamp hour (0–23) Increment value Slow index Fast index x component of the initial vector y component of the initial vector z component of the initial vector Timestamp minute (0–59) Timestamp month (1–12) Slow array dimension Fast array dimension x component of the normal vector y component of the normal vector z component of the normal vector Overload value Polarization normal Polarization ratio Timestamp precision in seconds Apparent area in mm2 Goniometer setting (0 = beginning of exposure, 1 = end) x component of the real-space vector y component of the real-space vector z component of the real-space vector x component of the reciprocal-space vector y component of the reciprocal-space vector z component of the reciprocal-space vector Unused; any value other than 0 is invalid Timestamp second (0–60.0) Start value Timestamp in seconds since 1 January 1970 or integration time in seconds Time zone difference from universal time in minutes or CBF_NOTIMEZONE x component of the rotation axis y component of the rotation axis z component of the rotation axis ˚ Wavelength in A Pointer to the destination timestamp year

The values of the divergence parameters, represented by the data names _diffrn_radiation.div_x_source, *.div_y_source and *.div_x_y_source, can be retrieved or set. The functions int cbf_get_divergence (cbf_handle handle, double *div_x_source, double *div_y_source, double *div_x_y_source); int cbf_set_divergence (cbf_handle handle, double div_x_source, double div_y_source, double div_x_y_source);

operate on the DIFFRN_RADIATION category. The values of _diffrn.id and _diffrn.crystal_id can be retrieved or set: int cbf_get_diffrn_id (cbf_handle handle, const char **diffrn_id);

source

int cbf_set_diffrn_id (cbf_handle handle,

source

const char *diffrn_id); int cbf_get_crystal_id (cbf_handle handle, const char **crystal_id);

source

elsize file gain gain_esd goniometer handle hour increment index1 index2 initial1 initial2 initial3 minute month ndim1 ndim2 normal1 normal2 normal3 overload polarizn_source_norm polarizn_source_ratio precision projected_area ratio real1 real2 real3 reciprocal1 reciprocal2 reciprocal3 reserved second start time timezone vector1 vector2 vector3 wavelength year

int cbf_set_crystal_id (cbf_handle handle, const char *crystal_id);

Changing

also modifies the corresponding entries in the DIFFRN_SOURCE, DIFFRN_RADIATION, DIFFRN_DETECTOR and DIFFRN_MEASUREMENT categories. The starting value and increment of an axis may be retrieved or set: _diffrn.id

*.diffrn_id

int cbf_get_axis_setting (cbf_handle handle, unsigned int reserved, const char *axis_id, double *start, double *increment); int cbf_set_axis_setting (cbf_handle handle, unsigned int reserved, const char *axis_id, double start, double increment);

The cbf_set_axis_setting call is used during the creation of a CBF/imgCIF file to store the goniometer settings and rotation. The cbf_get_axis_setting is not generally useful when interpreting a file as there are no standard identifiers and the arrangement of the experimental axes is not consistent. Much more useful are the goniometer geometry functions described below. The number of detector elements can be retrieved: int cbf_count_elements (cbf_handle handle, unsigned int *elements);

This is the number of rows in the DIFFRN_DETECTOR_ELEMENT category. For each element, counting from 0, the detector identifier (the *.detector_id entry) can be retrieved and the gain and overload values in the ARRAY_INTENSITIES category retrieved or set: int cbf_get_element_id (cbf_handle handle, unsigned int element_number, const char **element_id); int cbf_get_gain (cbf_handle handle, unsigned int element_number, double *gain, double *gain_esd); int cbf_set_gain (cbf_handle handle, unsigned int element_number, double gain, double gain_esd); int cbf_get_overload (cbf_handle handle, unsigned int element_number, double *overload); int cbf_set_overload (cbf_handle handle, unsigned int element_number, double overload);

For each element, counting from 0, the values of the parameters of the detector can be retrieved and some can be set. The value of _diffrn_detector_element.id is retrieved as element_id. The value of _diffrn_data_frame.array_id can be retrieved as array_name. The values of _array_intensities.gain and _array_intensities.gain_esd are retrieved as gain

550

5.6. CBFlib: AN ANSI C LIBRARY FOR MANIPULATING IMAGE DATA and gain_esd. The value of _array_intensities.overload can be retrieved or set as overload. The value of _diffrn_scan_frame.integration_time can be retrieved or set as integration_time. Timestamp calls operate on the DATE entry in the DIFFRN_SCAN_FRAME category:

pression (see Section 5.6.3.1); CBF_PACKED, for CCP4-style packing (see Section 5.6.3.2); or CBF_NONE, for no compression. 5.6.2.4. Goniometer geometry functions

unsigned int reserved, double time,

A CBF/imgCIF file includes a geometric description of the goniometer used to orient the sample during the experiment. Practical use of this information, however, is not trivial as it involves combining data from several categories and analysing in three dimensions the nested axes in which the description is framed (see Section 3.7.3 for a discussion of the axis system). CBFlib provides six functions to facilitate this task:

int timezone, double precision);

int cbf_construct_goniometer (cbf_handle handle,

int cbf_get_timestamp (cbf_handle handle, unsigned int reserved, double *time, int *timezone); int cbf_set_timestamp (cbf_handle handle,

int cbf_get_datestamp (cbf_handle handle, unsigned int reserved, int *year, int *month,

cbf_goniometer *goniometer); int cbf_free_goniometer (cbf_goniometer goniometer);

int *day, int *hour, int *minute, double *second,

int cbf_get_rotation_axis (cbf_goniometer goniometer,

int *timezone);

unsigned int reserved, double *vector1,

int cbf_set_datestamp (cbf_handle handle, unsigned int reserved, int year, int month,

double *vector2, double *vector3); int cbf_get_rotation_range (cbf_goniometer goniometer,

int day, int hour, int minute, double second,

unsigned int reserved, double *start,

int timezone, double precision);

double *increment);

int cbf_set_current_timestamp (cbf_handle handle,

int cbf_rotate_vector (cbf_goniometer goniometer, unsigned int reserved, double ratio,

unsigned int reserved, int timezone)

and cbf_set_timestamp measure time in seconds since 1 January 1970. cbf_get_datestamp and cbf_set_datestamp work in terms of individual year, month, day, hour, minute and second. The optional collection time zone, timezone, is the difference from universal time in minutes; precision is the fraction, in seconds, to which the time will be recorded. cbf_set_current_timestamp sets the collection timestamp from the current time, to the nearest second. Also in the DIFFRN_SCAN_FRAME category is the integration time of the image: cbf_get_timestamp

double initial1, double initial2, double initial3, double *final1, double *final2, double *final3); int cbf_get_reciprocal (cbf_goniometer goniometer, unsigned int reserved, double ratio, double wavelength, double real1, double real2, double real3, double *reciprocal1, double *reciprocal2, double *reciprocal3); cbf_construct_goniometer uses the data in the categories DIFFRN_MEASUREMENT,

AXIS, and DIFFRN_SCAN_AXIS to construct a geometric representation of the goniometer and initializes the cbf_goniometer handle, goniometer. cbf_free_goniometer frees the goniometer structure. cbf_get_ rotation_axis and cbf_get_rotation_range get the normalized rotation vector, and the starting value and increment of the first rotating axis of the goniometer, respectively. The cbf_rotate_vector call applies the goniometer axis rotation to the given initial vector, with the ratio value specifying the goniometer setting from 0.0 at the beginning of the exposure to 1.0 at the end, irrespective of the actual rotation range. Finally, cbf_get_reciprocal transforms the given real-space vector (real1, real2, real3) to the corresponding reciprocal-space vector (reciprocal1, reciprocal2, reciprocal3). As before, the transform corresponds to the goniometer initial position with a ratio of 0.0 and the goniometer final position with a ratio of 1.0.

int cbf_get_integration_time (cbf_handle handle, unsigned int reserved, double *time); int cbf_set_integration_time (cbf_handle handle, unsigned int reserved, double time);

Finally, these functions include routines for working with binary images: int cbf_get_image_size (cbf_handle handle, unsigned int reserved, unsigned int element_number, size_t *ndim1, size_t *ndim2); int cbf_get_image (cbf_handle handle, unsigned int reserved, unsigned int element_number, void *array, size_t elsize, int elsign, size_t ndim1, size_t ndim2); int cbf_set_image (cbf_handle handle, unsigned int reserved, unsigned int element_number,

5.6.2.5. Detector geometry functions In a similar manner, a CBF/imgCIF file includes a description of the surface of each detector and the arrangement of the pixels in space. CBFlib provides eight functions for analysing this description:

unsigned int compression, void *array, size_t elsize, int elsign, size_t ndim1, size_t ndim2); cbf_get_image_size

DIFFRN_MEASUREMENT_AXIS,

DIFFRN_SCAN_FRAME_AXIS

retrieves the dimensions of detector element

element_number from the ARRAY_STRUCTURE_LIST category, setting ndim1 and ndim2 to the slow and fast array dimensions, respec-

int cbf_construct_detector (cbf_handle handle, cbf_detector *detector,

tively. These dimensions can be used to allocate memory before calling cbf_get_image. cbf_get_image reads the image data from detector element element_number into a signed or unsigned integer array of size ndim1 * ndim2 and cbf_set_image associates image data with a detector element. As in the description of the integer array functions, the compression argument can currently take one of three values: CBF_CANONICAL, for canonical-code com-

unsigned int element_number); int cbf_free_detector (cbf_detector detector); int cbf_get_beam_center (cbf_detector detector, double *index1, double *index2, double *center1, double *center2); int cbf_get_detector_distance (cbf_detector detector, double *distance);

551

5. APPLICATIONS Table 5.6.3.1. Structure of compressed data using the canonical-code scheme

int cbf_get_detector_normal (cbf_detector detector, double *normal1, double *normal2, double *normal3); int cbf_get_pixel_coordinates (cbf_detector detector, double index1, double index2, double *coordinate1, double *coordinate2, int cbf_get_pixel_normal (cbf_detector detector, double index1, double index2, double *normal1, double *normal2, double *normal3); int cbf_get_pixel_area (cbf_detector detector, double index1, double index2, double *area, double *projected_area); DIFFRN,

uses

DIFFRN_DETECTOR,

data

from

the

Value

1 to 8

Number of elements (64-bit little-endian number) Minimum element Maximum element (Reserved for future use) Number of bits directly coded, n Maximum number of bits encoded, maxbits Number of bits in each direct code Number of bits in the stop code

9 to 16 17 to 24 25 to 32 33 34 35 to 35 + 2n − 1 35 + 2n 35 + 2n + 1 to 35 + 2n + maxbits − n 35 + 2n + maxbits − n + 1 . . .

double *coordinate3);

cbf_construct_detector

Byte

Number of bits in each indirect code Coded data

categories Table 5.6.3.2. Structure of compressed data using the CCP4-style scheme

DIFFRN_DETECTOR_ELEMENT,

DIFFRN_DETECTOR_AXIS, AXIS, ARRAY_STRUCTURE_LIST and ARRAY_STRUCTURE_LIST_AXIS to construct a geometric repre-

sentation of detector element element_number and initializes the handle, detector. cbf_free_detector frees the detector structure; cbf_get_beam_center calculates the location at which the beam intersects the detector surface, either in terms of the pixel indices (index1, index2) along the slow and fast detector axes, respectively, or the displacement in millimetres along the slow and fast axes (center1, center2); cbf_get_detector_distance and cbf_get_detector_normal calculate the distance of the sample from the plane of the detector surface and the normal vector of the detector at pixel (0, 0), respectively; cbf_get_pixel_coordinates, cbf_get_pixel_normal and cbf_get_pixel_area calculate the coordinates, normal vector, and area and apparent area as viewed from the sample position of the pixel with the given indices, respectively. cbf_detector

Value in bits 3 to 5

Number of bits in each error

0 1 2 3 4 5 6 7

0 4 5 6 7 8 16 65

Byte

Value

1 to 8 9 to 16 17 to 24 25 to 32 33 . . .

Number of elements (64-bit little-endian number) Minimum element (currently unused) Maximum element (currently unused) (Reserved for future use) Coded data

The structure of the compressed data is described in Table 5.6.3.1.

5.6.3. Compression schemes Two schemes for lossless compression of integer arrays (such as images) have been implemented in this version of CBFlib: (i) an entropy-encoding scheme using canonical coding; (ii) a CCP4-style packing scheme. Both encode the difference (or error) between the current element in the array and the prior element. Parameters required for more sophisticated predictors have been included in the compression functions and will be used in a future version of the library.

5.6.3.2. CCP4-style compression The CCP4-style compression writes the errors in blocks. Each block begins with a 6-bit code. The number of errors in the block is 2n , where n is the value in bits 0 to 2. Bits 3 to 5 encode the number of bits in each error. The data structure is summarized in Table 5.6.3.2.

5.6.4. Sample templates 5.6.3.1. Canonical-code compression

The construction of CBF/imgCIF files can be simplified using templates. A template is itself an imgCIF file populated with data entries but without any binary sections. This file is normally associated with a CBF handle using the cbf_read_template call and provides a framework into which images and other experimentspecific data may be entered. Fig. 5.6.4.1 is a sample template for an ADSC Quantum 4 detector (ADSC, 1997) with a κ-geometry diffractometer at Stanford Synchrotron Radiation Laboratory (SSRL) beamline 1-5. The template for a MAR345 image plate detector (MAR Research, 1997) is almost identical. The major differences are in the size of the array (2300 × 2300 versus 2304 × 2304), the parameters for the CCD elements and the geometry of the elements. Therefore a few of the values in the AXIS, ARRAY_STRUCTURE_LIST, ARRAY_STRUCTURE_LIST_AXIS and ARRAY_INTENSITIES categories are different, as listed in Fig. 5.6.4.2.

The canonical-code compression scheme encodes errors in two ways: directly or indirectly. Errors are coded directly using a symbol corresponding to the error value. Errors are coded indirectly using a symbol for the number of bits in the (signed) error, followed by the error itself. At the start of the compression, CBFlib constructs a table containing a set of symbols, one for each of the 2n direct codes from −2n−1 to 2n−1 − 1, one for a stop code and one for each of the maxbits − n indirect codes, where n is chosen at compression time and maxbits is the maximum number of bits in an error. CBFlib then assigns to each symbol a bit code, using a shorter bit code for the more common symbols and a longer bit code for the less common symbols. The bit-code lengths are calculated using a Huffman-type algorithm and the actual bit codes are constructed using the canonical-code algorithm described by Moffat et al. (1997).

552

5.6. CBFlib: AN ANSI C LIBRARY FOR MANIPULATING IMAGE DATA ###CBF: VERSION 1.1

# category DIFFRN_MEASUREMENT loop_ _diffrn_measurement.diffrn_id _diffrn_measurement.id _diffrn_measurement.number_of_axes _diffrn_measurement.method _diffrn_measurement.details DIFFRN_ID GONIOMETER 3 rotation ; i0=1.000 i1=1.000 i2=1.000 ib=1.000 beamstop=20 mm 0% attenuation ;

data_image_1 # category DIFFRN loop_ _diffrn.id _diffrn.crystal_id DIFFRN_ID DIFFRN_CRYSTAL_ID # category DIFFRN_SOURCE loop_ _diffrn_source.diffrn_id _diffrn_source.source _diffrn_source.current _diffrn_source.type DIFFRN_ID synchrotron 100.0 ’SSRL beamline 1-5’

# category DIFFRN_MEASUREMENT_AXIS loop_ _diffrn_measurement_axis.measurement_id _diffrn_measurement_axis.axis_id GONIOMETER GONIOMETER_PHI GONIOMETER GONIOMETER_KAPPA GONIOMETER GONIOMETER_OMEGA

# category DIFFRN_DETECTOR_ELEMENT loop_ _diffrn_detector_element.id _diffrn_detector_element.detector_id ELEMENT1 ADSCQ4

# category DIFFRN_SCAN loop_ _diffrn_scan.id _diffrn_scan.frame_id_start _diffrn_scan.frame_id_end _diffrn_scan.frames SCAN1 FRAME1 FRAME1 1

# category DIFFRN_RADIATION loop_ _diffrn_radiation.diffrn_id _diffrn_radiation.wavelength_id _diffrn_radiation.probe _diffrn_radiation.monochromator _diffrn_radiation.polarizn_source_ratio _diffrn_radiation.polarizn_source_norm _diffrn_radiation.div_x_source _diffrn_radiation.div_y_source _diffrn_radiation.div_x_y_source _diffrn_radiation.collimation DIFFRN_ID WAVELENGTH1 x-ray ’Si 111’ 0.8 0.0 0.08 0.01 0.00 ’0.20 mm x 0.20 mm’

# category DIFFRN_SCAN_AXIS loop_ _diffrn_scan_axis.scan_id _diffrn_scan_axis.axis_id _diffrn_scan_axis.angle_start _diffrn_scan_axis.angle_range _diffrn_scan_axis.angle_increment _diffrn_scan_axis.displacement_start _diffrn_scan_axis.displacement_range _diffrn_scan_axis.displacement_increment SCAN1 GONIOMETER_OMEGA 0.0 0.0 0.0 0.0 0.0 SCAN1 GONIOMETER_KAPPA 0.0 0.0 0.0 0.0 0.0 SCAN1 GONIOMETER_PHI 0.0 0.0 0.0 0.0 0.0 SCAN1 DETECTOR_Z 0.0 0.0 0.0 0.0 0.0 SCAN1 DETECTOR_Y 0.0 0.0 0.0 0.0 0.0 SCAN1 DETECTOR_X 0.0 0.0 0.0 0.0 0.0 SCAN1 DETECTOR_PITCH 0.0 0.0 0.0 0.0 0.0

# category DIFFRN_RADIATION_WAVELENGTH loop_ _diffrn_radiation_wavelength.id _diffrn_radiation_wavelength.wavelength _diffrn_radiation_wavelength.wt WAVELENGTH1 0.98 1.0 # category DIFFRN_DETECTOR loop_ _diffrn_detector.diffrn_id _diffrn_detector.id _diffrn_detector.type _diffrn_detector.details _diffrn_detector.number_of_axes DIFFRN_ID ADSCQ4 ’ADSC QUANTUM4’ ’slow mode’ 4

# category DIFFRN_SCAN_FRAME loop_ _diffrn_scan_frame.frame_id _diffrn_scan_frame.frame_number _diffrn_scan_frame.integration_time _diffrn_scan_frame.scan_id _diffrn_scan_frame.date FRAME1 1 0.0 SCAN1 1997-12-04T10:23:48

# category DIFFRN_DETECTOR_AXIS loop_ _diffrn_detector_axis.detector_id _diffrn_detector_axis.axis_id ADSCQ4 DETECTOR_X ADSCQ4 DETECTOR_Y ADSCQ4 DETECTOR_Z ADSCQ4 DETECTOR_PITCH

# category DIFFRN_SCAN_FRAME_AXIS loop_ _diffrn_scan_frame_axis.frame_id _diffrn_scan_frame_axis.axis_id _diffrn_scan_frame_axis.angle _diffrn_scan_frame_axis.displacement FRAME1 GONIOMETER_OMEGA 0.0 0.0 FRAME1 GONIOMETER_KAPPA 0.0 0.0 FRAME1 GONIOMETER_PHI 0.0 0.0 FRAME1 DETECTOR_Z 0.0 0.0 FRAME1 DETECTOR_Y 0.0 0.0 FRAME1 DETECTOR_X 0.0 0.0 FRAME1 DETECTOR_PITCH 0.0 0.0

# category DIFFRN_DATA_FRAME loop_ _diffrn_data_frame.id _diffrn_data_frame.detector_element_id _diffrn_data_frame.array_id _diffrn_data_frame.binary_id FRAME1 ELEMENT1 ARRAY1 1

Fig. 5.6.4.1. Template imgCIF for use with an ADSC Quantum 4 detector.

Fig. 5.6.4.1. (cont.)

553

0.0 0.0 0.0 0.0 0.0 0.0 0.0

5. APPLICATIONS # category AXIS loop_ _axis.id _axis.type _axis.equipment _axis.depends_on _axis.vector[1] _axis.vector[2] _axis.vector[3] _axis.offset[1] _axis.offset[2] _axis.offset[3] GONIOMETER_OMEGA rotation goniometer . 1 0 0 . . . GONIOMETER_KAPPA rotation goniometer GONIOMETER_OMEGA 0.64279 0 0.76604 . . . GONIOMETER_PHI rotation goniometer GONIOMETER_KAPPA 1 0 0 . . . SOURCE general source . 0 0 1 . . . GRAVITY general gravity . 0 -1 0 . . . DETECTOR_Z translation detector . 0 0 -1 0 0 0 DETECTOR_Y translation detector DETECTOR_Z 0 1 0 0 0 0 DETECTOR_X translation detector DETECTOR_Y 1 0 0 0 0 0 DETECTOR_PITCH rotation detector DETECTOR_X 0 1 0 0 0 0 ELEMENT_X translation detector DETECTOR_PITCH 1 0 0 -94.0032 94.0032 0 ELEMENT_Y translation detector ELEMENT_X 0 1 0 0 0 0

# category ARRAY_DATA loop_ _array_data.array_id _array_data.binary_id _array_data.data ARRAY1 1 ? Fig. 5.6.4.1. (cont.)

loop_ _axis.id _axis.type _axis.equipment _axis.depends_on _axis.vector[1] _axis.vector[2] _axis.vector[3] _axis.offset[1] _axis.offset[2] _axis.offset[3] GONIOMETER_OMEGA rotation goniometer . 1 0 0 . . . GONIOMETER_KAPPA rotation goniometer GONIOMETER_OMEGA 0.64279 0 0.76604 . . . GONIOMETER_PHI rotation goniometer GONIOMETER_KAPPA 1 0 0 . . . SOURCE general source . 0 0 1 . . . GRAVITY general gravity . 0 -1 0 . . . DETECTOR_Z translation detector . 0 0 -1 0 0 0 DETECTOR_Y translation detector DETECTOR_Z 0 1 0 0 0 0 DETECTOR_X translation detector DETECTOR_Y 1 0 0 0 0 0 DETECTOR_PITCH rotation detector DETECTOR_X 0 1 0 0 0 0 ELEMENT_X translation detector DETECTOR_PITCH 1 0 0 -172.5 172.5 0 ELEMENT_Y translation detector ELEMENT_X 0 1 0 0 0 0

# category ARRAY_STRUCTURE_LIST loop_ _array_structure_list.array_id _array_structure_list.index _array_structure_list.dimension _array_structure_list.precedence _array_structure_list.direction _array_structure_list.axis_set_id ARRAY1 1 2304 1 increasing ELEMENT_X ARRAY1 2 2304 2 increasing ELEMENT_Y # category ARRAY_STRUCTURE_LIST_AXIS loop_ _array_structure_list_axis.axis_set_id _array_structure_list_axis.axis_id _array_structure_list_axis.displacement _array_structure_list_axis.displacement_increment ELEMENT_X ELEMENT_X 0.0408 0.0816 ELEMENT_Y ELEMENT_Y -0.0408 -0.0816

loop_ _array_structure_list.array_id _array_structure_list.index _array_structure_list.precedence _array_structure_list.direction _array_structure_list.axis_set_id ARRAY1 1 2300 1 increasing ELEMENT_X ARRAY1 2 2300 2 increasing ELEMENT_Y

# category ARRAY_INTENSITIES loop_ _array_intensities.array_id _array_intensities.binary_id _array_intensities.linearity _array_intensities.gain _array_intensities.gain_esd _array_intensities.overload _array_intensities.undefined_value ARRAY1 1 linear 0.23 0.03 65000 0

loop_ _array_structure_list_axis.axis_set_id _array_structure_list_axis.axis_id _array_structure_list_axis.displacement _array_structure_list_axis.displacement_increment ELEMENT_X ELEMENT_X 0.075 0.150 ELEMENT_Y ELEMENT_Y -0.075 -0.150

# category ARRAY_STRUCTURE loop_ _array_structure.id _array_structure.encoding_type _array_structure.compression_type _array_structure.byte_order ARRAY1 "signed 32-bit integer" packed little_endian

loop_ _array_intensities.array_id _array_intensities.binary_id _array_intensities.linearity _array_intensities.gain _array_intensities.gain_esd _array_intensities.overload _array_intensities.undefined_value ARRAY1 1 linear 1.15 0.2 240000 0

Fig. 5.6.4.1. (cont.)

Fig. 5.6.4.2. Part of the template file for a MAR345 detector. Values that differ from those in Fig. 5.6.4.1 are underlined.

554

5.6. CBFlib: AN ANSI C LIBRARY FOR MANIPULATING IMAGE DATA /* Read and modify the template */

/* Make the _diffrn_frame_data category */ cbf_failnez( cbf_new_category /* create the category (cbf, "diffrn_frame_data")) cbf_failnez( cbf_new_column /* create the column (cbf, "id")) cbf_failnez( cbf_set_value /* add data (cbf, "frame_1")) cbf_failnez( cbf_new_column /* create next column (cbf, "detector_element_id")) cbf_failnez( cbf_set_integervalue /* add data (cbf, 1)) cbf_failnez( cbf_new_column /* create the column (cbf, "detector_id")) cbf_failnez( cbf_set_value /* add data (cbf, detector_id)) cbf_failnez( cbf_new_column /* create next column (cbf, "array_id")) cbf_failnez( cbf_set_value /* add data (cbf, "image_1")) cbf_failnez( cbf_new_column /* create next column (cbf, "binary_id")) cbf_failnez( cbf_set_integervalue /* add data (cbf, 1))

cbf_failnez (cbf_make_handle (\&cbf)) in = fopen (template_name, "rb"); if (!in) exit (4); cbf_failnez (cbf_read_template (cbf, in)) /* Wavelength */ wavelength = img_get_number (img, "WAVELENGTH"); if (wavelength) cbf_failnez ( cbf_set_wavelength (cbf, wavelength)) /* Distance */ distance = img_get_number (img, "DISTANCE"); cbf_failnez ( cbf_set_axis_setting ( cbf, 0, "DETECTOR_Z", distance, 0)) /* Oscillation start and range */ axis = img_get_field (img, "OSCILLATION AXIS"); if (!axis) axis = "PHI"; osc_start = img_get_number (img, axis); osc_range = img_get_number (img, "OSCILLATION RANGE"); cbf_failnez ( cbf_set_axis_setting ( cbf, 0, "GONIOMETER_PHI", osc_start, osc_range))

*/

*/

*/

*/

*/

*/

*/

*/

*/

*/

*/

Fig. 5.6.5.2. Code fragment to illustrate the creation of the DIFFRN FRAME DATA category.

code that reads in the template and inserts the wavelength, the distance to the detector, and the oscillation start and range.

Fig. 5.6.5.1. An extract of program code for the conversion of image data to CBF format.

5.6.5.2. makecbf makecbf is a good example of how to use many of the lowerlevel CBFlib functions. An example MAR345 image can be found at http://smb.slac.stanford.edu/∼ellis/ and on the CD-ROM accompanying this volume. To run makecbf with the example image, type:

5.6.5. Example programs The CBFlib API comes with example programs and sample templates. The example program convert image reads either of the templates described in Section 5.6.4, template adscquantum4 2304x2304.cbf and template mar345 2300x2300.cbf, and uses the higher-level CBFlib routines to convert an image file to a CBF. The example programs makecbf and img2cif read an image file from a MAR300, MAR345 or ADSC CCD detector and then use CBFlib to convert that image to CBF format (makecbf ) or to either imgCIF or CBF format (img2cif ). makecbf writes the CBF-format image to disk, reads it in again, and then compares it with the original. img2cif just writes the desired file without a verification pass. makecbf works only from files on disk, so that random input/output can be used. img2cif includes code to process files from stdin and to stdout. The example program cif 2cbf copies an ASCII CIF to a binary CBF/imgCIF file.

./bin/makecbf example.mar2300 test.cbf

The typical code fragment from makecbf.c shown in Fig. 5.6.5.2 creates the DIFFRN_FRAME_DATA category. The program img2cif is an extended version of makecbf that allows the user to choose the details of the format of the output CBF. img2cif has the following command-line interface: img2cif [-i input_image] [-o output_cif] [-c

{p[acked]|c[anonical]|[n[one]}]

[-m [-d

{h[eaders]|n[oheaders]}] {d[igest]|n[odigest]}]

[-e {b[ase64]|q[uoted-printable]| d[ecimal]|h[exadecimal]|o[ctal]|n[one]}]

5.6.5.1. convert image The program convert image takes two arguments, imagefile and cbffile. These are the primary input and output. The detector type is extracted from the image file, converted to lower case and used to construct the name of a template CBF to use for the copy. The template file name is of the form template name columnsxrows. The program comes with img.c and img.h to read image files. It uses the img_... routines to extract data and metadata from the image and uses the higher-level CBFlib routines to populate the template. Fig. 5.6.5.1 is a portion of the

[-b {f[orward]|b[ackwards]}] [input_image] [output_cif]

-i takes the name of the input image file in MAR300, MAR345 or ADSC CCD detector format. If no input_image file is specified or is given as ‘-’, an image is copied from stdin to a temporary file. -o takes the name of the output CIF (if BASE64 or quotedprintable encoding is used) or CBF (if no encoding is used). If no output_cif is specified or is given as ‘-’, the output is written to stdout. -c specifies the compression scheme (packed, canonical or none; the default is packed). -m selects MIME (Freed &

555

5. APPLICATIONS Borenstein, 1996) headers within binary data value text fields. The default is headers for CIFs, no headers for CBFs. -d specifies the use of digests. The default is the MD5 digest (Rivest, 1992) when MIME headers are selected. -e specifies one of the standard MIME encodings: BASE64 (the default) or quoted-printable; or a nonstandard decimal, hexadecimal or octal encoding for an ASCII CIF; or ‘none’ for a binary CBF. -b specifies the direction of mapping of bytes into words for decimal, hexadecimal or octal output, marked by ‘>’ for forwards or ‘Si present yes 3 ALERT level A = In general: serious problem 0 ALERT level B = Potentially serious problem 3 ALERT level C = Check and explain 1 ALERT level G = General alerts; check

5.7.2.6. Automated data validation: checkcif The current service for checking structural data submitted to IUCr journals is known as checkcif and is available at http:// journals.iucr.org/services/cif/checkcif.html. Versions of this service have been made available to other publishers for some time. In 2003, a general service was introduced at http://checkcif.iucr.org to provide structural checks on CIF data sets destined for publication in non-IUCr journals or database deposition, or indeed to allow authors to assess the quality of their structure determinations whether they wish to publish them or not. The tests carried out by checkcif include: (i) a simple file syntax check: essential in the early days of manual CIF construction, but of less importance now as syntaxpreserving editing programs have become more widespread; (ii) tests for the self-consistency of mutually dependent data items present in the CIF; (iii) a large collection of analytic tests on structural chemistry and molecular geometry based on the program PLATON (Spek, 2003). The checks carried out at the time of publication (2005) are listed in Appendix 5.7.2 and on the CD-ROM accompanying this volume. The current list is available from http://journals.iucr.org/ services/cif/datavalidation.html. Although the results from checkcif provide valuable indications of possible inconsistencies or data errors, an article for publication is not accepted or rejected on the basis of the checkcif report alone. The report is always read by a reviewer as part of a considered critical appraisal of the article. Sometimes, particular data values are so far from the expected values that some response is required from the author to explain them. The unusual values may be a consequence of poor experimental conditions that the author was unable to improve, or of poor crystal quality; they may indicate an uncertainty in part of the structure determination that the author considers acceptable, particularly if the purpose of the study is to concentrate on a different part of the structure; or they may genuinely indicate novel chemical features. Whatever the case, anomalous values usually need to be discussed by the author and the reviewer or editor, and often need to be commented on in the article. For Acta Cryst. C and E, checkcif generates in CIF format a list of the tests that have highlighted unusual values in the author’s CIF (called ‘A alerts’), together with a text field for each of these tests in which the author may justify or discuss the apparently anomalous results (see Fig. 5.7.2.3). Together these comprise a ‘validation reply form’. The author can complete this form and paste it into the final version of the CIF submitted for publication. The editor handling the paper can then read the comments in the validation reply form and decide whether to accept the paper for publication. The submission system will automatically return to the author any CIF which generates an A alert but does not contain a completed validation reply form. Every article published in Acta Cryst. E has as part of its supplementary material a summary of the checkcif report for the structure described in it. This summary includes any validation reply that the author has supplied. It also includes selected numerical data items identified by the journal editors as characterizing the overall quality and completeness of the structure determination. The characterization of the ‘quality’ of a structure is a contentious issue. For journals, where there is active selection of

5 0 0 2

ALERT ALERT ALERT ALERT

type type type type

1 2 3 4

CIF construction/syntax error, inconsistent or missing data Indicator that the structure model may be wrong or deficient Indicator that the structure quality may be low Improvement, methodology, query or suggestion

(a) Publication of your CIF You should always attempt to resolve as many as possible of the alerts in all categories. Often the minor alerts point to easily fixed oversights, errors and omissions in your CIF or refinement strategy, so attention to these fine details can be worthwhile. In order to resolve some of the more serious problems it may be necessary to carry out additional measurements or structure refinements. However, the nature of your study may justify the reported deviations from the submission requirements of the journal and these should be commented upon in the discussion or experimental section of a paper - after all, they might represent an interesting feature. If level A alerts remain, which you believe to be justified deviations, and you intend to submit this CIF for publication in Acta Crystallographica Section C or Section E, you should additionally insert an explanation in your CIF using the Validation Reply Form (VRF) below. Your explanation will be considered as part of the review process. If you intend to submit to another section of Acta Crystallographica or Journal of Applied Crystallography or Journal of Synchrotron Radiation, you should make sure that at least a basic structural check is run on the final version of your CIF prior to submission. # start Validation Reply Form _vrf_PLAT035_99107abs ; PROBLEM: No _chemical_absolute_configuration info given . RESPONSE: ... ; _vrf_PLAT761_99107abs ; PROBLEM: CIF Contains no X-H Bonds ...................... RESPONSE: ... ; _vrf_PLAT762_99107abs ; PROBLEM: CIF Contains no X-Y-H or H-Y-H Angles .......... RESPONSE: ... ;

?

?

?

# end Validation Reply Form

If you wish to submit your CIF for publication in Acta Crystallographica Section C or E, you should upload your CIF via the web. If your CIF is to form part of a submission to another IUCr journal, you will be asked, either during electronic submission or by the Co-editor handling your paper, to upload your CIF via our web site.

(b) Fig. 5.7.2.3. Extracts from a checkcif report for a ‘publication check’ on a CIF to be submitted to an IUCr journal. (a) Alerts of various levels of severity are listed. (b) The journal policy on the handling of alerts is summarized and a validation reply form listing the A alerts is supplied for the author to fill in.

articles for publication, it can be difficult to assign criteria for assessing the quality of the structure determination without these being seen as judging the quality or worth of the scientific work giving rise to the result. Thus journals rely upon the experience and discernment of referees to identify structures ‘worth’ publish-

561

5. APPLICATIONS ing. However, in a comprehensive collection of structural data sets, such as in a public structural database, it might be possible to identify particular data items that could be used for weighting individual data sets when the database is being ‘mined’ for particular patterns or characteristic values. It will be interesting to see whether a consensus emerges on what items would be suitable. It is clear that reliance on a single indicator will not be appropriate for sophisticated studies. The old idea that a structure could be classed as ‘good’ or ‘bad’ on the basis of its final residual R factor alone has long been abandoned, but it may be possible to stipulate criteria for a set of interrelated data items and use these to filter specific information from a database.

version of the article, and the extraction of metadata for building online tables of contents and for supplying to bibliographic databases. The conventional published article then appears in a monthly issue. Each article is still similar in style to the type of structure report published in journals for decades, although tables of atomic positions and geometric data are not usually displayed now, since these data are so readily available from the online article. The online version of the journal, however, presents a much more information-rich version of the article. Each article is generally available in the form of a PDF file, suitable for downloading and offline printing. There is also an HTML version of the same text, and this version has rich internal links that make it easy to scroll back and forth through the article, jump to specific sections and see figures in low-resolution thumbnail or highresolution views. The reference list contains links to the articles that are cited. There may also be links to related records in chemical or crystal structure databases. The reader may also download the experimental data and any supplementary documents associated with the article. As mentioned above, for Acta Cryst. E a summary of the check report is also available. Finally, the structural data may be downloaded directly in CIF format. The CIF is presented in two ways. If a reader follows one link in a web browser, the file is interpreted simply as a text file and appears as a simple listing in the browser window, from which it may be printed or saved to disk. However, if the reader follows the other link, the CIF is transmitted to the browser with a header declaring its MIME type (Freed & Borenstein, 1996) as ‘chemical/x-cif’. This is one of several MIME types registered for particular presentations of chemistry-related content by Rzepa & Murray-Rust (1998). The reader may then configure a web browser to respond in a specific way to content tagged with this MIME type; typically a helper application such as a molecular visualizer [e.g. Mercury (Bruno et al., 2002)] will be launched that allows three-dimensional visualization and manipulation of the molecular or crystal structure. When an article has been published in Acta Cryst. C or E, the CIF is transferred to the relevant public structural databases. Thus, the transcription errors that used to cause so many problems for data harvesters are completely avoided and one of the initial goals of the CIF project is achieved: uncorrupted data transfer from diffractometer, through publication, to a final repository. Because Acta Cryst. C and E handle almost exclusively the publication of structure reports, the editorial workflow based on CIF lends itself to a very high level of automation and the journals are produced efficiently and on short timescales. Routine refereeing of structures is made very easy by the provision of checking reports, and the universal use of e-mail and web file transfer means that production times can be very fast.

5.7.2.7. Submission and review When an author has previewed and checked the contents of the CIF and has made the changes suggested by a careful study of the preprint and the checkcif report, the article may finally be submitted to Acta Cryst. C or E by file upload over the web. Other files completing or supporting the submission are also transferred to the editorial office at this time. These include structure-factor or powder profile listings for each structure, figures and chemical diagrams, and sometimes other supplementary documents. Structurefactor listings are supplied in CIF format. Figures may be in one of a number of standard graphics file formats, and at the moment have to be uploaded as separate files. Future extensions to CIF, perhaps following the imgCIF approach, may allow all the items needed to submit an article, including figures, to be prepared as a single file. When all the files have arrived at the editorial office, a review document is generated that can be sent to the referees. This document contains: the text and tables of the article that will appear in the final publication, but laid out in a more open style suitable for annotation by hand; tables of atomic positions and geometry (containing all the data in the CIF, not just the subset that has been selected for displaying in the published article); certain fields from the CIF that are not normally printed but which may contain details of the way in which the experiment was carried out (these fields might have been completed manually or by the software controlling the experiment); the figures and other supplementary documents; and a print-out of the report from a final checkcif cycle, including a displacement-ellipsoid plot of the molecule in a minimal-overlap least-squares plane view. This composite document provides the information that a referee will typically want to consider in a compact and convenient form. Because the CIF is so highly structured, producing this review document is in most cases entirely automatic. The complete CIF as submitted by the author and the experimental data are also made available to the reviewer. If revisions are requested, authors may upload modified files. The generation of revised versions of an article is also largely automatic. 5.7.2.8. Publication When the final version of a CIF for Acta Cryst. C or E is approved, the article is ready for publication. Once more, the data fields required for the published article are extracted from the CIF and sorted. If the author has asked for additional items to be printed by using _publ_manuscript_incl_extra_item, these also are extracted. The result is transformed to a file suitable for processing by typesetting software. For Acta Cryst. C this was originally a TEX file; now a further transformation generates an SGML file that conforms to the document type definition (DTD) common to all IUCr journals. This allows not only typesetting and printing, but also the generation of the HTML for the navigable online

5.7.3. CIF and other journals Not every journal will be able to benefit to the same extent from the handling of CIFs. For many journals, structure reports will be secondary to the main purpose of most articles, and CIF data will more usually be deposited as supplementary or supporting documents, while only a summary (if anything) of the structure will be reported in the article body. Nevertheless, the ability to extract data from CIFs automatically and the ability of much crystallographic software to read CIFs mean that even journals that do not specialize in crystallography can provide a production stream that includes careful checking of crystal structure data. The IUCr continues to develop checkcif as

562

5.7. SMALL-MOLECULE CRYSTAL STRUCTURE PUBLICATION USING CIF a service which can be used by other publishers to enhance their checking of crystal structures, and there is considerable interest in this approach. All journals publishing the results of crystal structure determinations may easily collect the supporting data in CIF format and transfer the files to public databases, improving the accuracy and efficiency of the database-building procedures.

data_I _symmetry_cell_setting _symmetry_space_group_name_H-M _symmetry_space_group_name_Hall _symmetry_Int_Tables_number

orthorhombic ’P 21 21 21’ ’P 2ac 2ab’ 19

loop_ _symmetry_equiv_pos_as_xyz ’x, y, z’ ’-x+1/2, -y, z+1/2’ ’-x, y+1/2, -z+1/2’ ’x+1/2, -y+1/2, -z’

5.7.3.1. Including CIF data in an article For journals other than those specializing in full-scale structure reports, including CIF data in tables or reports of structures within general articles is rather more problematic. The translation of CIF data into XML seems to be a promising route to explore, as journals and reference volumes are increasingly being typeset from XML files. Traditionally, publishing has emphasized content markup that leads to a particular typographic representation. Modern trends are towards markup that tags the content by purpose, with the representation directed by external ‘style files’. Consider Fig. 5.7.3.1, which shows the typeset representation of a set of data items in a CIF for a structural paper. First, it can be seen that several CIF data items are omitted from the printed representation, such as the International Tables space-group number and the Hall symbol for the space group. For compactness, the printed data value does not have a legend or annotation if the meaning of an item is clear from the context; thus, the crystal system and Hermann–Mauguin space-group symbol are printed without any accompanying text. The journal may also omit information that is implicit given other data; thus the cell angles are not printed for an orthorhombic cell. On the other hand, units, which are implicit in the definition of a CIF data item, are printed. Related items are grouped together in a single expression, as in the case of the θ range or the crystal dimensions. In some cases, numerical values have been rounded to meet the journal’s policy. All of these transformations are matters of style, but it can be seen that they are not always trivial mappings to single data names. The style files determining the transformation from a detailed explicit data tabulation in the initial CIF may need to implement complex logical tests to suit the requirements of the journal. Fig. 5.7.3.2 shows the same extract in TEX, the markup and typesetting language that was used for several years to produce Acta Cryst. C. It can be seen from this extract that the actual markup maps very closely to the initial CIF. All the cell parameters, including the cell angles, are present in the source file. The expansion of the macros (e.g. \cellalpha) executes the logic required to determine whether the value is to be printed and generates the additional text surrounding the value. Each data name is mapped to a distinct macro (even if the macros themselves have identical or nearidentical internal structure), which preserves the semantic labelling of the original CIF. These macros are maintained in a separate file referenced and executed by every invocation of the typesetting program. In contrast, Fig. 5.7.3.3 shows part of the SGML now used to typeset Acta Cryst. C and to generate HTML versions of the articles online. It is immediately seen that the markup emphasizes typographic style and positioning, and there is no explicit labelling by semantic element. Additional labelling is now found in the document structure; the individual items are marked up as ‘list items’ (

  • ), but the arrangement of this list into a tabular form is a feature of the typesetting engine, not the SGML. It is clear that the TEX macros provide a representation of the contents of the CIF that could easily be converted back to the initial

    _chemical_formula_moiety _diffrn_radiation_type _exptl_absorpt_coefficient_mu

    ’C10 H8 Br N S’ MoK\a 4.280

    _cell_length_a _cell_length_b _cell_length_c _cell_angle_alpha _cell_angle_beta _cell_angle_gamma _cell_volume _cell_formula_units_Z _cell_measurement_temperature _cell_measurement_reflns_used _cell_measurement_theta_min _cell_measurement_theta_max

    5.7339(7) 14.8229(15) 23.469(2) 90.0 90.0 90.0 1994.7(4) 8 296(2) 58 4.806 11.635

    _exptl_crystal_description _exptl_crystal_colour _exptl_crystal_size_max _exptl_crystal_size_mid _exptl_crystal_size_min _exptl_crystal_density_meas _exptl_crystal_density_diffrn _exptl_crystal_density_method _exptl_crystal_F_000

    plate colourless 0.40 0.32 0.04 ? 1.693 ’not measured’ 1008

    (a)

    Crystal data C10 H8 BrNS Mr = 254.14 Orthorhombic, P21 21 21 ˚ a = 5.7339 (7) A ˚ b = 14.8229 (15) A ˚ c = 23.469 (2) A ˚3 V = 1994.7 (4) A Z=8 Dx = 1.693 Mg m−3

    Mo Kα radiation Cell parameters from 58 reflections θ = 4.8–11.6◦ µ = 4.28 mm−1 T = 296 (2) K Plate, colourless 0.40 × 0.32 × 0.04 mm (b)

    Fig. 5.7.3.1. Typesetting of structural data. The contents of the CIF (a) are transformed into a typeset representation (b) that omits, annotates or reorders the incoming data according to context and the style rules of the journal.

    input CIF. At present, such bidirectional translation is not possible from the SGML file. Clearly, therefore, a mapping to SGML that preserved semantic markup would be preferable. It is most likely that suitable bidirectional translations would be based on XML. 5.7.3.2. CIF and XML XML is a specific concrete implementation of SGML suitable for generation of online browsable content. Mature style transformation mechanisms for XML exist and others are under active development.

    563

    5. APPLICATIONS \structno \noindent{\nineit Crystal data}\nobreak\par \vskip2pt\begindoublecolumns\twocoltrue\defaultfont \raggedright \everypar={\global\parindent=0pt\hangindent=1em \hangafter=1 }\noindent