Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop 0309710251, 9780309710251

Artificial intelligence (AI) has the potential to aid new mathematical discoveries. Particularly as the amount of data a

98 33

English Pages 88 [89] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
FrontMatter
Reviewers
Contents
1 Introduction
2 Overview and Grand Vision
3 Case Studies
4 Current Challenges and Barriers
5 Technical Advances Required to Expand Artificial Intelligence for Mathematical Reasoning
6 Roles for Stakeholders
7 Conclusion
Appendixes
Appendix A: Workshop Agenda
Appendix B: Biographical Information for Planning Committee Members
Recommend Papers

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop
 0309710251, 9780309710251

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

Artificial Intelligence to Assist Mathematical Reasoning

Samantha Koretsky, Rapporteur Board on Mathematical Sciences and Analytics Division on Engineering and Physical Sciences

Proceedings of a Workshop Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

NATIONAL ACADEMIES PRESS 500 Fifth Street, NW Washington, DC 20001

This activity was supported by a contract between the National Academy of Sciences and the National Science Foundation. Any opinions, findings, conclusions, or recommendations expressed in this publication do not necessarily reflect the views of any organization or agency that provided support for the project. International Standard Book Number-13: 978-0-309-71025-1 International Standard Book Number-10: 0-309-71025-1 Digital Object Identifier: https://doi.org/10.17226/27241 This publication is available from the National Academies Press, 500 Fifth Street, NW, Keck 360, Washington, DC 20001; (800) 624-6242 or (202) 334-3313; http://www.nap.edu. Copyright 2023 by the National Academy of Sciences. National Academies of Sciences, Engineering, and Medicine and National Academies Press and the graphical logos for each are all trademarks of the National Academy of Sciences. All rights reserved. Printed in the United States of America. Suggested citation: National Academies of Sciences, Engineering, and Medicine. 2023. Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop. Washington, DC: The National Academies Press. https://doi. org/10.17226/27241.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

The National Academy of Sciences was established in 1863 by an Act of Congress, signed by President Lincoln, as a private, nongovernmental institution to advise the nation on issues related to science and technology. Members are elected by their peers for outstanding contributions to research. Dr. Marcia McNutt is president. The National Academy of Engineering was established in 1964 under the charter of the National Academy of Sciences to bring the practices of engineering to advising the nation. Members are elected by their peers for extraordinary contributions to engineering. Dr. John L. Anderson is president. The National Academy of Medicine (formerly the Institute of Medicine) was established in 1970 under the charter of the National Academy of Sciences to advise the nation on medical and health issues. Members are elected by their peers for distinguished contributions to medicine and health. Dr. Victor J. Dzau is president. The three Academies work together as the National Academies of Sciences, Engineering, and Medicine to provide independent, objective analysis and advice to the nation and conduct other activities to solve complex problems and inform public policy decisions. The National Academies also encourage education and research, recognize outstanding contributions to knowledge, and increase public understanding in matters of science, engineering, and medicine. Learn more about the National Academies of Sciences, Engineering, and Medicine at www.nationalacademies.org.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

Consensus Study Reports published by the National Academies of Sciences, Engineering, and Medicine document the evidence-based consensus on the study’s statement of task by an authoring committee of experts. Reports typically include findings, conclusions, and recommendations based on information gathered by the committee and the committee’s deliberations. Each report has been subjected to a rigorous and independent peer-review process and it represents the position of the National Academies on the statement of task. Proceedings published by the National Academies of Sciences, Engineering, and Medicine chronicle the presentations and discussions at a workshop, symposium, or other event convened by the National Academies. The statements and opinions contained in proceedings are those of the participants and are not endorsed by other participants, the planning committee, or the National Academies. Rapid Expert Consultations published by the National Academies of Sciences, Engineering, and Medicine are authored by subject-matter experts on narrowly focused topics that can be supported by a body of evidence. The discussions contained in rapid expert consultations are considered those of the authors and do not contain policy recommendations. Rapid expert consultations are reviewed by the institution before release. For information about other products and activities of the National Academies, please visit www.nationalacademies.org/about/whatwedo.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

PLANNING COMMITTEE ON ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING: A WORKSHOP PETROS KOUMOUTSAKOS (NAE), Harvard University, Chair JORDAN ELLENBERG, University of Wisconsin–Madison MELVIN GREER, Intel Corporation BRENDAN HASSETT, Brown University YANN LeCUN (NAS/NAE), Meta Platforms, Inc.; New York University HEATHER MACBETH, Fordham University TALIA RINGER, University of Illinois at Urbana-Champaign KAVITHA SRINIVAS, IBM Research TERENCE TAO (NAS), University of California, Los Angeles Staff SAMANTHA KORETSKY, Research Assistant, Workshop Director PADMA LIM, College Student Hire (until August 4, 2023) BLAKE REICHMUTH, Associate Program Officer (as of June 5, 2023) BRITTANY SEGUNDO, Program Officer MICHELLE SCHWALBE, Director, Board on Mathematical Sciences and Analytics and National Materials and Manufacturing Board ERIK SVEDBERG, Scholar Consultant LINDA CASOLA, Writing Consultant

v

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

BOARD ON MATHEMATICAL SCIENCES AND ANALYTICS MARK L. GREEN, University of California, Los Angeles, Co-Chair KAREN E. WILLCOX (NAE), The University of Texas at Austin, Co-Chair HÉLÈNE BARCELO, Mathematical Sciences Research Institute BONNIE BERGER (NAS), Massachusetts Institute of Technology RUSSEL E. CAFLISCH (NAS), New York University DAVID S. CHU, Institute for Defense Analyses DUANE COOPER, Morehouse College JAMES H. CURRY, University of Colorado Boulder RONALD D. FRICKER, JR., Virginia Tech JULIE IVY, North Carolina State University LYDIA E. KAVRAKI (NAM), Rice University TAMARA G. KOLDA (NAE), Sandia National Laboratories PETROS KOUMOUTSAKOS (NAE), Harvard University RACHEL KUSKE, Georgia Institute of Technology YANN LeCUN (NAS/NAE), Meta Platforms, Inc.; New York University JILL C. PIPHER, Brown University YORAM SINGER, WorldQuant TATIANA TORO, University of Washington JUDY WALKER, University of Nebraska–Lincoln LANCE A. WALLER, Emory University Staff MICHELLE SCHWALBE, Director BRITTANY SEGUNDO, Program Officer BLAKE REICHMUTH, Associate Program Officer (as of June 5, 2023) SAMANTHA KORETSKY, Research Assistant PADMA LIM, College Student Hire (until August 4, 2023) JOSEPH PALMER, Senior Program Assistant HEATHER LOZOWSKI, Senior Finance Business Partner

vi

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

Reviewers

This Proceedings of a Workshop was reviewed in draft form by individuals chosen for their diverse perspectives and technical expertise. The purpose of this independent review is to provide candid and critical comments that will assist the National Academies of Sciences, Engineering, and Medicine in making each published proceedings as sound as possible and to ensure that it meets the institutional standards for quality, objectivity, evidence, and responsiveness to the charge. The review comments and draft manuscript remain confidential to protect the integrity of the process. We thank the following individuals for their review of this proceedings: JEREMY AVIGAD, Carnegie Mellon University HEATHER MACBETH, Fordham University RYAN MURPHY, National Academies of Sciences, Engineering, and Medicine KAIYU YANG, California Institute of Technology Although the reviewers listed above provided many constructive comments and suggestions, they were not asked to endorse the content of the proceedings nor did they see the final draft before its release. The review of this proceedings was overseen by JAMES CROWLEY, Society for ­Industrial and Applied Mathematics (Retired). He was responsible for making certain that an independent examination of this proceedings was carried out in accordance with standards of the National Academies and that all review comments were carefully considered. Responsibility for the final content rests entirely with the rapporteur and the National Academies. vii

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

Contents

1 INTRODUCTION Workshop Overview, 2 Organization of This Workshop Proceedings, 3 Reference, 4

1

2 OVERVIEW AND GRAND VISION The Quest for Automated Reasoning, 5 What Can the Mathematician Expect from Deep Learning?, 6 Discussion, 9 References, 10

5

3 CASE STUDIES Case Studies: Artificial Intelligence to Assist Mathematical Reasoning, 11 Case Studies: Proof Verification and Checking, 16 References, 21

11

4 CURRENT CHALLENGES AND BARRIERS Data for Artificial Intelligence and Mathematics: Challenges and New Frontiers, 23 Building an Interdisciplinary Community, 26 The Role of Intuition and Mathematical Practice, 30 Concentration of Machine Learning Capabilities and Open-Source Options, 35

23

ix

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

x CONTENTS Mathematical Foundations of Machine Learning, 39 Challenges and Barriers Panel, 43 References, 47 5 TECHNICAL ADVANCES REQUIRED TO EXPAND ARTIFICIAL INTELLIGENCE FOR MATHEMATICAL REASONING 49 Research Advances in Computer Science, 49 Research Advances in the Mathematical Sciences, 56 References, 62 6 ROLES FOR STAKEHOLDERS

63

7 CONCLUSION Reference, 68

66

APPENDIXES A Workshop Agenda B Biographical Information for Planning Committee Members

Copyright National Academy of Sciences. All rights reserved.

71 75

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

1 Introduction

Artificial intelligence (AI) is useful in various research disciplines, and its applicability to mathematical research has drawn growing interest in recent years. For example, Davies and colleagues (2021) used AI to help produce two meaningful contributions to mathematics: one in representation theory, producing a proof for a conjecture that had remained unsolved for over 40 years, and one in topology, connecting the algebraic and geometric structure of knots in a new way. These contributions are just two of many that demonstrate that AI has the potential to aid new mathematical discoveries. Particularly as the amount of data available grows beyond what any person can study, AI can be useful in its power to identify patterns in data. Davies and colleagues (2021) explained that machine learning (ML), a type of AI, can be a powerful tool in refining relationships between properties. Traditionally, this part of the mathematical research process requires the mathematician’s intuition; ML can help guide intuition by quickly testing whether a relationship may be worth exploring further. More generally, ML can assist high-level idea generation in m ­ athematics. Additionally, symbolic AI, which is another type of AI distinct from ML, can be useful to mathematics research as well. Symbolic automated reason­ing tools such as first-order theorem provers embody a complementary approach to using AI to assist mathematical reasoning, and they can be especially useful in formalizing mathematics—essentially digitizing mathematics in a computer proof assistant—and can ultimately open up new avenues for mathematical research. 1

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

2

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

Mathematical reasoning is a central aspect of human intelligence that plays an important role in knowledge discovery. In the past few decades, the mathematics and computer science communities have contributed to research on how AI may assist mathematical reasoning, whether that is through the use of ML for guiding idea generation or finding counter­examples to conjectures, the use of symbolic AI for formalizing ­mathematics, or other ways entirely. Recent technical advances have led to a surge of interest in this domain from the mathematical and computer science communities, as well as from the AI community. WORKSHOP OVERVIEW Sponsored by the National Science Foundation (NSF), the National Academies of Sciences, Engineering, and Medicine’s Board on Mathe­ matical Sciences and Analytics convened a 3-day public virtual workshop on June 12–14, 2023, to bring together stakeholders to discuss the state of the art and current challenges and opportunities to advance research in using AI for mathematical reasoning (see Box 1-1 for the workshop’s statement of task).

BOX 1-1 Statement of Task A National Academies of Sciences, Engineering, and Medicine–appointed ad hoc committee will plan and organize a workshop to explore opportunities to advance AI to assist mathematical reasoning. This workshop will bring together academic, industry, and government stakeholders to discuss the following topics: • State of the art of using AI for mathematical reasoning, including case studies in particular domains. • Opportunities to advance research in AI for mathematical reasoning and potential impacts from doing so, and technical advances required to expand this initiative. • Current challenges and barriers to the use of AI for mathematical reasoning. • Roles for stakeholders in advancing AI for mathematical reasoning. In addressing these topics, the workshop will bring together domain ­experts from mathematics, statistics, computer science, data science, and other relevant fields; highlight emerging research opportunities; and explore approaches to strengthen coordination and collaboration among the interdisciplinary research communities. One or more rapporteurs who are not members of the committee will be designated to prepare a workshop proceedings.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

INTRODUCTION

3

At the beginning of the workshop, David Manderscheid, director of the Division of Mathematical Sciences (DMS) at NSF, and Dilma Da Silva, director of the Division on Computing and Communication Foundations (CCF) at NSF, gave brief remarks on NSF’s interest in AI to assist mathematical reasoning. Manderscheid explained that AI is a priority at NSF, and the application of AI to mathematics in particular has recently garnered attention. He identified DMS’s interest in both research and education and expressed that AI has the potential to transform the way mathematics research is conducted. Da Silva emphasized CCF’s interest in formal methods and verification of software and hardware as well as AI, and she noted a connection between elements of the workshop and NSF’s larger vision. Invited workshop speakers included mathematicians and computer scientists, including researchers working in areas such as interactive ­theorem proving, automated reasoning, formal methods, type theory, and AI and ML. The workshop began with an overview and discussion of case studies, leading to sessions later in Day 1 and throughout Day 2 that explored various challenges and barriers to the adoption of AI for mathematical reasoning. Speakers presenting on Day 3 discussed technical advances required to widen the adoption and roles for stakeholders to advance AI for mathematical reasoning. ORGANIZATION OF THIS WORKSHOP PROCEEDINGS Chapter 2 presents an overview of the history, state of the art, and potential futures for using AI to assist mathematical reasoning. Chapter 3 highlights case studies of AI assisting mathematical reasoning as well as case studies in proof verification and checking. Chapter 4 details several challenges and barriers to the adoption of AI for mathematical reasoning, including both institutional and technical challenges. Chapter 5 describes past and potential future research advances in proof verification and AI for mathematical reasoning, particularly exploring synergies between the two, and Chapter 6 details a panel discussion on roles for stakeholders in supporting the advancement of using AI for mathematical reasoning. Finally, Chapter 7 offers key themes underlying the workshop presentations and discussions. This proceedings was prepared by the workshop rapporteur as a factual summary of what occurred at the workshop. The workshop planning committee’s role was limited to organizing and convening the workshop (see Appendix A for the workshop agenda and ­Appendix B for biographical sketches of the workshop planning committee m ­ embers). The views expressed in this proceedings are those of the individual workshop participants and do not necessarily represent the views of

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

4

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

the participants as a whole, the planning committee, or the National Academies. REFERENCE Davies, A., P. Veličković, L. Buesing, S. Blackwell, D. Zheng, N. Tomašev, R. Tanburn, et al. 2021. “Advancing Mathematics by Guiding Human Intuition with AI.” Nature 600:70–74. https://doi.org/10.1038/s41586-021-04086-x.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

2 Overview and Grand Vision

Moshe Vardi, Rice University, and Geordie Williamson, University of Sydney, presented an overview of and vision for using artificial intelligence (AI) to assist mathematical reasoning and participated in a joint discussion session moderated by workshop planning committee member Brendan Hassett, Brown University. THE QUEST FOR AUTOMATED REASONING Vardi provided a brief history of mathematical proof and reasoning, beginning in ancient Greece and leading up to automated reasoning in the present day. He described the ancient Greek concept of the deductive proof as an airtight argument whose concept marked the beginning of mathematical reasoning. Aristotle’s concept of the syllogism, reasoning that comes out of form, followed closely behind. The ancient Greeks also introduced the concept of the paradox, in which self-reference is a key feature. Jumping to the 17th century, Vardi explained Gottfried Leibniz’s dream of a universal mathematical language—a lingua characteristica ­universalis—that could express all human knowledge, allowing calculational rules to be carried out by machines. In 1847, George Boole took a step toward this m ­ athematical language with the creation of Boolean logic. This logic allows one to treat ­Aristotelean syllogisms algebraically, in that a logical “AND” can be viewed as a product, or an “OR” as a sum, for example. 5

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

6

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

Vardi noted that mathematical advances in the 19th century, particularly Georg Cantor’s proof that there exist infinitely many distinct infinities, caused mathematicians to consider more carefully what a rigorous proof is. In pondering this question, Gottlob Frege developed logicism, the idea of logic being foundational for mathematics. Logic as a language for mathematics allows one to, for instance, formalize Aristotelian ­syllogisms. However, in 1902, Bertrand Russell pointed out a paradox in Frege’s logical framework, sparking a foundational crisis in mathematics dubbed the Grundlagenkrise. David Hilbert developed Hilbertian formalism in the 1920s in response to this crisis, Vardi continued, which requires mathematics to be consistent, complete, and decidable. Hilbert emphasized that proofs must be studied and treated as mathematical objects. Kurt Gödel’s incompleteness theorem proved shortly after shows that in number theory, the system cannot prove its own consistency, demonstrating that Hilbert’s program was impossible. To finish his historical account, Vardi presented Stephen Cook and Robert Reckhow’s 1979 definition that a proof system is an algorithm, and a rigorous proof is one that can be checked. Vardi then turned to the Boolean satisfiability problem (SAT), a problem for which researchers have tried to automate solving. During the ­so-called SAT revolution in the mid-1990s, key tools were developed— including GRASP in 1996 and Chaff in 2001—and many modern SAT solvers use an algorithm called conflict-driven clause learning. Computers can now solve SAT problems with millions of variables (see Figure 2-1). As another example, Vardi presented the Pythagorean triple problem, which asks whether the positive integers can be colored red and blue so that no Pythagorean triples (sets of numbers x, y, and z that satisfy x2 + y2 = z2) consist of all red or all blue members. By reducing the problem to SAT, Heule, Kullmann, and Marek proved the impossibility of this coloring (Heule et al. 2016). Their proof was developed and checked algorithmically. In closing, Vardi remarked that the quest for automated reasoning is at the heart of mathematics and computer science. Huge progress has been made in Boolean reasoning, automated deduction, and proof assistants, as well as in the formal verification of programs and in formalized mathematics. He concluded his presentation with an open question about the future roles of deep learning (DL) and large language models (LLMs) for automated reasoning. WHAT CAN THE MATHEMATICIAN EXPECT FROM DEEP LEARNING? Williamson focused on how one can use DL to guide mathematical proof or reasoning at a high level. Citing Turing’s 1948 work Intelligent

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

Some Experience with SAT Solving

OVERVIEW AND GRAND VISION

Speed-up of 2012 solver over other solvers

7

Speed-up (log scale)

1,000

100

10

S. A. Seshia

(2 00 Pr 7) ec os at C ry (2 pt 00 om 9) in is at G (2 lu 01 co 0) se 2. 0 G (2 lu 01 co 1) se 2. 1 (2 01 2)

(2 00 6)

R sa t+

Sa tE lit e

05 )

at 2

(2 0

ite

M in is

Sa tE l

)

(2 00 4)

eg e

Si

M in is at +

zC ha ff (

20 03 -0 4

203 )

1)

(2 00

20 0

in

B er kM

G

ra sp

zC ha ff (

(2 00 0)

1

Solver

1

FIGURE 2-1  The improvement in Boolean satisfiability problem (SAT) solvers between 2000 and 2012. SOURCES: Created by Sanjit A. Seshia, presented to the workshop by Moshe Vardi, Rice University, June 12, 2023.

Machinery, one of the first papers to suggest the possibility of AI, he noted that mathematics has served as a litmus test for AI from its very beginnings. The centrality of reasoning, which modern AI still struggles with, to the mathematical process makes mathematics an especially interesting and important area to test the power of AI. Williamson also remarked that AI may change the fields of mathematics and computer science dramatically. Keeping this in mind, he noted that mathematicians and computer scientists play a vital role in the conversation on steering the future of AI deliberately and carefully; he pointed to discussions in the 2022 Fields Medal Symposium that delve further into this topic.1 He emphasized that mathematical understanding is guided by a wide variety of sources, including intuition (e.g., geometric or physical), proof (whether informal or formal), notation, analogy, computation, and heuristics. To give a “crash course in DL,” Williamson explained that one may think of a neural network (NN) as a type of function approximator. For example, one can imagine that a function could take as input the pixel 1 For more information on the 2022 Fields Medal Symposium, see http://www.fields.­ utoronto.ca/activities/22-23/fieldsmedalsym, accessed June 20, 2023.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

8

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

­ alues of an image of a tiger and output the probability that the image is of v a tiger; as another example, a function could take as input an incomplete sentence and output a probability distribution on the next word. To create this function approximation with a NN, one can combine linear algebra with nonlinearities, applying gradient descent to move from an enormous space of possible functions toward a target function. ­Williamson presented an example using TensorFlow Playground2 that trained a function to classify orange versus blue nodes on a spiral. He emphasized that the process is stochastic, as well as not necessarily monotonically improving. Additionally, in general, he said that DL works best when (1) the input dimension is high, (2) the function is on the unit cube, and (3) coordinates have low symbolic content. Providing advice on DL to the interested mathematician, Williamson expressed that NNs are difficult to interpret, which is especially important to note as understanding is vital to the mathematician. One should also expect engineering subtleties; the choice of architecture is important, and it can be unclear whether the implementation or the idea is at fault when encountering issues. Additionally, one should beware the “hype machine” that can produce misconceptions on the power of DL. Finally, he suggested that mathematicians work with DL experts—­interdisciplinary work is challenging but worthwhile. Williamson then presented an example of a problem in representation theory. A pair of permutations x, y can be associated with two objects, a Bruhat graph and a Kazhdan-Lusztig polynomial. Using DL, Williamson and his team approached the unsolved combinatorial invariance conjecture stating that there exists a relationship between these two objects. They trained a graph NN to predict the coefficients of Kazhdan-Lusztig polynomials, and it achieved high accuracy after just a few days of training. The difficulty was distilling understanding out of the model. In studying the model’s behavior, the team noticed a hidden labeling of certain edges of the Bruhat graphs. Looking at edges deemed important by the model led to the discovery of hypercube-like structures within the graphs; with further work, the team eventually devised a formula proving a special case of the combinatorial invariance conjecture, opening the door to new work in combinatorics. Williamson concluded his presentation by citing two works that delve further into the topics he discussed (see Davies et al. 2021; Williamson 2023).

2 The TensorFlow Playground website is https://playground.tensorflow.org, accessed June 20, 2023.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

9

OVERVIEW AND GRAND VISION

DISCUSSION Hassett moderated a discussion among the session’s speakers. He asked for the speakers’ thoughts on the role of mathematical intuition and DL’s relationship to it. Williamson remarked that for mathematicians, understanding comes from several axes, and he indicated that DL can provide another axis to enhance intuition. Vardi expressed appreciation for Williamson’s constructive approach, noting the fearmongering that often occurs around AI, and seconded Williamson’s thoughts on DL as another tool for mathematicians. Hassett then asked what machine learning (ML) may do for mathematicians’ understanding of theorems. With the caveat that the future is unpredictable, Vardi stated that these tools may provide support in areas where mathematicians lack intuition, and studying the large computational proofs that ML produces can lead to new insights. Currently, in many cases, NNs can solve a problem, but it is a mystery how: the algorithms are black boxes. Williamson noted that researchers generally have held the position that NNs will remain black boxes, but recent work has pointed toward the future possibility of distilling human comprehension from them. Hassett inquired about the kinds of mathematical problems that will be susceptible to NN techniques. Because few data points exist, ­Williamson cautioned against being restricted by notions of what might or might not work. That said, these techniques seem to work better on problems with greater noise stability, but both Williamson and Vardi emphasized that they have consistently been surprised by the abilities of AI. Noting that Williamson and Vardi have both conducted inter­ disciplinary research, Hassett posed a question about the language gaps that exist between mathematicians and ML researchers and how these gaps can be bridged. Vardi highlighted that between communities, even basic vocabulary is not necessarily shared, and one must always question assumptions and be flexible in thinking. Williamson added that interdisciplinary work takes time. It took a year, he approximated, to find a common language with his DeepMind research collaborators. Additionally, different research communities can have very different goals. Vardi suggested being open to changing how one views the world in order to truly work with another person. Next, Hassett wondered whether the development of DL techniques changes the types of questions that researchers ask and choose to explore. Vardi answered that he does not see that change happening yet. At the moment, these tools are still being built. However, he continued, if these techniques become working tools for mathematicians, they will certainly influence how researchers think about mathematics. The belief in one’s ability to solve a problem is a key aspect of selecting a problem to study;

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

10

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

changes in tools will change methods of attack and thus the problems researchers choose to pursue. Williamson agreed and added that DL itself is built on rich mathematical foundations, and the problems that arise from DL research will also influence mathematics. Finally, Hassett asked Williamson and Vardi for their advice for a young person looking to enter this area of research. Vardi expressed that the PhD path is best for a person who finds it intellectually thrilling to be at the interface between the known and unknown. Williamson emphasized that people should really take time to understand and follow the things they find interesting. REFERENCES Davies, A., P. Veličković, L. Buesing, S. Blackwell, D. Zheng, N. Tomašev, R. Tanburn, et al. 2021. “Advancing Mathematics by Guiding Human Intuition with AI.” Nature 600:70–74. https://doi.org/10.1038/s41586-021-04086-x. Heule, M.J.H., O. Kullmann, and V.W. Marek. 2016. “Solving and Verifying the Boolean Pythagorean Triples Problem via Cube-and-Conquer.” Pp. 228–245 in ­International ­Conference on Theory and Applications of Satisfiability Testing. Cham: Springer I­ nternational Publishing. Turing, A. 1948. Intelligent Machinery. Teddington, Middlesex: National Physical Laboratory, Mathematics Division. Williamson, G. 2023. “Is Deep Learning a Useful Tool for the Pure Mathematician?” arXiv preprint. arXiv:2304.12602.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

3 Case Studies

In the workshop session following the overview and grand vision for using artificial intelligence (AI) to assist mathematical reasoning, ­several speakers explored case studies in both mathematics and computer science. CASE STUDIES: ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING Yann LeCun, Meta Platforms, Inc., moderated the session on case studies in AI to assist mathematical reasoning; he introduced speakers François Charton, Meta AI, and Adam Wagner, Worcester Polytechnic Institute. Reasoning with Transformers Charton presented the goal of solving mathematical problems with a transformer—a type of neural network (NN) model used in large language models (LLMs) like ChatGPT that trains on large amounts of data but cannot necessarily reason by itself. He described mathematical problem-solving with transformers as a translation task. The transformer should be trained on generated sets of problems and solutions in order to translate the problem, representable as a mathematical statement, into the solution, another mathematical statement. For example, one would want the transformer to be able to translate the input 7 + 9 into the output 16, 11

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

12

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

or to translate the input of a specific polynomial into the output of the polynomial’s roots. Charton presented the problem of finding the greatest common­ (GCD) of two integers. The integers 10 and 32 have a GCD of 2. This can be encoded as sequences of base 10 digits, so the model would train on an input of +, 1, 0, +, 3, 2, with the matching output +, 2. He emphasized that the transformer does not reason mathematically; it learns purely by training on examples. He then presented several instances in which ­transformers effectively learned to solve mathematical problems by computing the symbolic integral of a function (Lample and Charton 2019) or calculating the eigenvalues of a matrix, for example (Charton 2021). Although the model was shown to be effective at solving some problems, questions remain as to whether it had learned the underlying mathematics. Charton presented a few scenarios in which the model could not give a correct answer, regardless of the size of the transformer (i.e., the number of parameters) or the amount of data used in training. Returning to the GCD problem, he showed results for a model trained on randomly generated pairs of integers and their GCDs, encoded in base 30. Achieving 85 percent accuracy after training on 300,000 examples and 94.6 percent accuracy after 450 million examples, the model appears to be effective at finding the GCD. However, when encoding the training data in other bases, the model did not perform as well, with its accuracy depending on the base. Studying the model more closely, Charton found that it was predicting the GCD correctly for specific cases and consistently failing for others. He presented ongoing work using a different way of training the model, in which training data are logarithmically distributed, that seems to be more effective in finding the GCD for pairs of integers in all bases. Charton showed another example in which a model learned to decompose symmetric matrices (Charton 2022). The spectral theorem provides a simple way for humans to solve this problem using a few of the matrix’s properties, such as its eigenvalues. One can ask whether the model, which is training on examples and is unaware of the underlying m ­ athematics, learns this theorem. After training the model to 92 percent accuracy, ­Charton tested it on 100,000 matrices and found that though the model predicted the correct decomposition in 92 percent of the cases, it predicted the matrices’ eigenvalues correctly in 99.99 percent of cases. Even when the model failed, it respected certain properties, which suggests that some of the underlying mathematics was learned. Furthermore, when he altered the experiment and stopped training the model after it reached 70 percent accuracy, it still predicted eigenvalues correctly in 99.6 percent of test cases. Finally, Charton presented an example in which a model was trained to predict eigenvalues of matrices. After training on only Wigner matrices, the model performed very well when tested on Wigner matrices

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

CASE STUDIES

13

but could not predict eigenvalues of non-Wigner matrices. Training on matrices with Gaussian or Laplace-distributed eigenvalues, however, the model was able to generalize and predict eigenvalues for different kinds of matrices. He underscored that training data matter. To conclude, Charton reiterated a few key points: (1) Test sets are not sufficient; a model can do well on test sets but perform catastrophically on other data. (2) It is important to perform failure case analysis, studying problems when a model fails. (3) Transformers are, amazingly, able to learn the mathematics underlying a problem just by training on examples. Finding Counterexamples via Reinforcement Learning Wagner remarked that reinforcement learning (RL) algorithms have achieved superhuman levels of success in games such as Go and chess, starting with only the rules of the game as input. He asked whether this same idea could be applied to mathematics by framing a mathematical problem as a game and giving it to an NN. He examined several examples of the same type of RL algorithm seeking counterexamples to open conjectures. Wagner’s first example centered on some of his own work surrounding a conjecture stating that for any graph, the sum of its largest eigenvalue and its matching number is greater than or equal to some function of its number of vertices. The smallest counterexample found for this conjecture has 600 vertices. In general, to use RL to find a counterexample to a conjecture, one determines two things: how to phrase the conjecture as a game, and what the reward function should be. He explained that in this case the conjecture can be framed easily as a game: the algorithm is given edges one at a time and can choose whether to accept or reject the next edge. The reward function can be designed to minimize the sum of the graph’s largest eigenvalue and matching number. Wagner then showed results in which the algorithm, over time, was able to find a counter­example. He acknowledged that this was a “dream scenario,” with both an obvious way to phrase the conjecture as a game and an obvious choice of reward function. Wagner shared five more complex examples with unique challenges in which the RL algorithm still successfully refuted an open conjecture. He presented an open conjecture similar to that of the first example, stating that for any graph, the sum of two certain parameters is positive (Aouchiche and Hansen 2016). The game can be phrased the same way, and the reward function can minimize the sum of the two parameters, as before. This time, the best graph the algorithm could find was not quite a counterexample, with the two parameters summing to approximately 0.4. However, the mathematician can easily adjust the graph produced by the

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

14

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

algorithm to find a true counterexample. Thus, he stated that even though the RL algorithm could not find a counterexample on its own, it provided enough insight for a human to solve the problem easily. Wagner presented a third example related to matrices (Brualdi and Cao 2020), demonstrating the algorithm’s applicability to other types of problems besides graphs. He then gave two more examples where the choice of reward function was less obvious than in the three previous examples. The first was a conjecture on trees (Collins 1989), and the second a question about the existence of two graphs with a special relationship. Through the latter case especially, Wagner explained that sometimes with more complicated conjectures, though a reward function can be devised, RL is not necessarily the best approach for the problem. Wagner’s final example studied a conjecture posed by Paul Erdős in 1962, which differed from his earlier examples in that any counter­ example must be an infinite sequence of graphs. Wagner and his colleague Gwenaël Joret found a way to produce a counterexample with RL by using a standard “trick” in graph theory: constructing a finite graph and then taking a limit to infinity. Wagner remarked that this example demonstrates the power of these methods; if the conjecture had been made today, it would not have remained unsolved for 27 years. However, the example also reveals a significant drawback: although their algorithm found a counterexample and solved the conjecture, the graph it produced is complicated and messy, providing no understanding beyond the falsity of the conjecture. Thomason (1989) offers far more insight into the problem and solution. The understanding that usually arises from human over machine proofs is important to keep in mind, Wagner stressed. Discussion Incorporating questions submitted by workshop participants, LeCun asked each speaker a few questions about their presentations. He asked about Wagner’s specific methodology, and Wagner explained some of the subtleties of his RL method (versus other kinds of machine learning methods) as well as the difficulties of benchmarking. LeCun drew the participants’ attention to the community project Nevergrad,1 a large PyTorch library of gradient-free optimization methods that can be useful as a baseline before studying more sophisticated methods. Turning to Charton, LeCun inquired about the utility of finding the GCD via transformers, given that efficient algorithms already exist to calculate the GCD deterministically. Charton explained that transformers are 1 The documentation for Nevergrad can be found at https://facebookresearch.github.io/ nevergrad, accessed June 29, 2023.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

CASE STUDIES

15

useful in this case because it is a simple, easily solved problem. Evaluating the effectiveness of transformers to find the GCD helps guide intuition and define the use cases for which AI may (and may not) be helpful. LeCun also asked about Charton’s current work in which training data for the GCD model are logarithmically distributed. Charton described the motivation of curriculum learning; the logarithmic distribution ensures that the model trains on a greater number of simpler cases, imitating how humans first learn simple problems before learning more difficult ones. Delving into more general topics, LeCun posed a question about how combining generative methods like LLMs with computational and mathematical tools like WolframAlpha might be useful in the context of mathematical reasoning. Wagner answered that since LLMs cannot reason on their own, real solutions will likely come from adjusting their architectures to enable reasoning and planning. However, Charton expressed skepticism, noting that this kind of solution may work for simple arithmetic and mathematical word problems but may not be effective for more complex physics problems that require a mixture of several tools at once, especially considering the large amounts of data necessary for training. LeCun asked how researchers could combine the power of LLMs, which cannot reason, with the planning abilities that have been developed in other spaces like AI for game playing. He explained the concept of a Monte Carlo tree search, which is crucial to AI for many games. Games like Go can be represented in a tree with branches for every possible move; mathematical proving can be represented similarly. Because the branches grow exponentially, he continued, success hinges on the ability to identify which branches are most promising and should thus be explored. Charton reiterated that to apply this idea to mathematical proof, one of the main barriers lies in obtaining and effectively using data to train a model to predict efficient next steps. Continuing to explore the question of how to develop an ability in AI to reason and rely on mental models like humans do, Charton emphasized the importance of definitions. It could be useful to add definitions to AI—mirroring the notions mathematicians have on the differences among theorems, lemmas, corollaries, etc.—as well as to build real-world elements into training—mirroring how children easily intuit mathematical concepts from real-world elements. He stressed the value of conceptbuilding and representation at a base level. Wagner described his process of proving mathematical theorems, indicating that both a global overview of the proof and a local picture of the current step are important. Such an architecture would be crucial to advancing AI for mathematical proof. Drawing on Wagner’s description, LeCun remarked on the top-down, hierarchical approach that humans take to complete a proof and wondered whether it would ever be possible for AI to plan hierarchically.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

16

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

Charton commented that hierarchical planning might be less necessary than many believe. Human processes use hierarchical planning, but NNs are much better at working in high-dimensional spaces than humans. Researchers may be influenced by the human limitations that necessitate hierarchical planning, he explained, and NNs may not need that kind of architecture to be effective. CASE STUDIES: PROOF VERIFICATION AND CHECKING Speakers Thierry Coquand, University of Gothenburg; Johan ­Commelin, University of Freiburg; and Greg Morrisett, Cornell University, presented case studies in proof verification and checking. The session was moderated by Talia Ringer, University of Illinois at Urbana-Champaign. Interactive Proof Assistants Coquand provided a general introduction to interactive proof assistants, discussing notions of correctness and the motivation for developing and using the tools, followed by a demonstration of a proof assistant in action. He first explained why mathematical correctness can be checked with computers. For knowledge outside of mathematics, one must rely on an external authority, but within mathematics, an objective and purely internal way to ensure correctness exists. Mathematical correctness is verified using the form, not the content, of the argument. He qualified that intuitions and social factors are still crucial to mathematics, but ultimately, an argument is accepted based on the formal rules. Before the advent of computers, absolute precision in mathematics was considered unfeasible. However, mathematicians can now write complex proofs precisely using interactive proof assistants, which he defined as “software that helps mathematicians [and] computer scientists in writing mathematical arguments.” Proof assistants themselves can be trusted to be correct because their kernels are small enough to be verifiable by humans. To explore motivations for using proof assistants, Coquand pointed out that in both computer science and mathematics, some proofs are so long and complex that it is virtually impossible to check their correctness by hand with full confidence. He gave the Feit-Thompson theorem as an example from mathematics; its proof was 250 pages long (Feit and Thompson 1963). For some applications in computer science, it is essential that all details be correct, making proof assistants useful, he said. seL42 and CompCert3 are two examples of software that were verified using 2 The 3 The

seL4 website is https://sel4.systems, accessed July 10, 2023. CompCert website is https://compcert.org, accessed July 10, 2023.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

17

CASE STUDIES

interactive proof assistants. seL4 is an operating system microkernel, and CompCert is a compiler; having proof of correctness lends great value to each. Additional motivation for using proof assistants arises from the way these tools push researchers to consider how to express formal rules for mathematics, which can lead to new insights; for example, recent work revealed new logical insight about the notion of equality—a very basic mathematical notion—and unexpected connections between equality and the field of homotopy theory. Proof assistants can provide a new avenue for mathematicians to organize their ideas and to improve their understanding of complex proofs, which could ultimately lead to producing better proofs. Coquand explained that the development of proofs has a strong analogy with the development of programs. They have the same issues of modularity, notations, choice of names, and maintenance, and most developments in both areas happen through large team collaborations. Proof assistants also have the advantage of being interactive, Coquand said. He cited an analogy between proof assistants and video games: “The main advantage of a proof assistant is that it gives the students immediate feedback. Incorrect proofs are rejected right away. Such rejections are an insult and challenge for motivated students. They will persist in their struggle to convince the machine” (Nipkow 2012). Coquand demonstrated the use of an interactive proof assistant by playing Kevin Buzzard and Mohammad Pedramfar’s Natural Number Game,4 which uses the Lean theorem prover. He noted its similarity to a video game in its level design and walked the workshop participants through the interactions with the computer for one level. He remarked that AI has become exceptional at playing some games, so the “game” of proving mathematical theorems could be a natural area to follow. Interactive Theorem Proving Commelin stated that interactive theorem provers (ITPs) form a rigorous and precise platform that can serve several functions. He broke down interactive theorem proving into three pieces: (1) formal verification, in which proofs are mechanically verified using a small set of rules, as Coquand discussed; (2) proof automation, in which the computer assists verification through neural or symbolic methods; and (3) libraries, which are necessary to underpin these large mathematical projects. He remarked that libraries are often overlooked but can be a great tool for knowledge

4 The Natural Number Game website is https://www.ma.imperial.ac.uk/~buzzard/xena/ natural_number_game, accessed June 27, 2023.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

18

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

management, providing a platform to search, organize, and preserve mathematical knowledge. Referencing the Liquid Tensor Experiment5 in which users formally verified a complex proof written by Dustin Clausen and Peter Scholze, Commelin showed workshop participants an example using the Lean theorem prover. The theorem prover contains an input pane where code is written, and a section with the goal state showing the current assumptions in place and what needs to be proved. The code imports prerequisites from the library,6 asserts some global assumptions, and then states the theorem. Within the theorem statement are tactics—sets of instructions for the proof assistant—that build the proof. Enumerating some desired features of ITPs, Commelin described the perspectives of readers, authors, and maintainers. For readers, he said that formal proofs should be readable, with good notation and documentation, as well as explorable and interactive. For example, readers want to be able to identify with ease the main idea of a proof or assumptions needed in order to apply the result. He pointed workshop participants toward a tool developed by Patrick Massot and Kyle Miller with interactive elements that allow users to hover over statements and see the current Lean proof state in ordinary language, among other functionalities.7 For authors, Commelin asserted that formal proofs should be flexible (in terms of foundations, input, and output), powerful, and expressive. Authors want a large library of preliminary results and automation that is fast, effective, and integrated into the workflow. The ideal scenario would allow an author to sketch the main idea of a proof and leave the rest to automation. However, he noted that most of these desired features do not yet exist. Finally, for maintainers of libraries, he said that formal proofs should be robust, deterministic, and fast. Maintainers desire tools to refactor large libraries and automation to fix broken proofs, as well as tools like linters and tests that can guarantee quality of both the contributions and the library. He pointed out the tension among what readers, authors, and maintainers value in an ITP. Turning to a discussion of the state of the art of automation and ITPs, Commelin gave a few examples of ways automation is or could be applied to ITPs. He named decision problems, explaining that some fragments of mathematics, such as Euclidean geometry problems, are decidable and can be solved by certain algorithms. Another opportunity 5 The

GitHub repository for the Liquid Tensor Experiment can be found at https://github. com/leanprover-community/lean-liquid, accessed June 27, 2023. 6 The mathlib library for the Lean theorem prover can be found at https://github.com/ leanprover-community/mathlib, accessed June 27, 2023. 7 This tool can be found at tinyurl.com/LeanIpam, accessed June 27, 2023.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

CASE STUDIES

19

to use automation for ITPs comes from computer algebra. Connections between computer algebra systems and ITPs are emerging, but because a systematic protocol or framework linking the two does not yet exist, Commelin highlighted this as an area for further research. Satisfiability problem (SAT) and satisfiability modulo theories (SMT, which is a generalization of SAT) solvers can also be connected to ITPs. One drawback is that SAT/SMT solvers can only apply to a specific type of mathematics, and it is not clear how to generalize; another is that proofs from these solvers can be enormous, and it is nontrivial to reconstruct them in ITPs. Finally, neural AI may be applied to ITPs. He cited an example in GPT-f producing a small contribution to mathlib and expressed that promising developments are occurring in the area. However, some disadvantages to using neural AI tools in this way are that there is not yet integrated tooling with ITPs; they provide slow, hit-or-miss responses; and they depend on enormous computational power beyond that of consumer hardware, which raises sociological and ecological questions. In closing, Commelin explored ways that AI and ITPs could enhance each other in the future. He imagined that neural and symbolic automation could help ITPs with proof discovery, explanation, and repair, as well as library design. Conversely, ITP libraries could provide large sets of training data. Intermediate goal states, proof terms, and tactic scripts could be useful for the AI that is often only trained on polished end results. He encouraged AI researchers to explore Lean 4 in particular because it exposes an enormous amount of internal structure. ­Researchers could begin to fine-tune LLMs in a feedback loop with ITPs, teaching the models error recovery in a process similar to how humans learn. ­Commelin concluded his presentation with an open challenge to create an AI that selects “beautiful” or “important” results from mathlib, suggesting that this is intimately connected to proof discovery and the ways AI can provide meaningful assistance in mathematical research. Mathematics Versus Computer Science Proofs Offering a perspective from computer science, Morrisett expressed that one of the main challenges today in software security is ensuring that there are no bugs that can be exploited. The hope in software verification is that proof assistants can check correctness, safety, and security in the massive amounts of code that cannot be checked properly by humans. Presenting a case from several years ago, Morrisett described an instance in which Google aimed to provide a service allowing users to run native machine code in the Chrome web browser. Intending to sandbox the code (i.e., prevent the code from reading or writing outside a given data segment) to ensure integrity and security, Google developed a

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

20

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

software fault isolation tool called Native Client. The compiler was modified to generate self-policing code that would ensure the code was following the sandbox security policy, and a checker was built to ensure that the machine code was truly self-policing. The checker was roughly 600 lines of code—much simpler, smaller, and more easily verifiable than the compiler. However, when Google ran a competition, numerous teams found bugs in the checker’s code. At this point, Morrisett and some of his students decided to build a formally verified checker for the software fault isolation tool. Morrisett noted that constructing code and its proof of correctness together is often easier than verifying code after the fact. With this in mind, he and his students built a completely new checker that had about 80 lines of code. With this shorter length, they were able to prove the checker’s correctness. However, he remarked that to prove correctness completely in a case like this, one needs a model for how to run the code on a machine. Because machines like the Intel x86 have thousands of instructions, which are often ill-specified, much of the project’s effort went toward building and validating this model to reason about the execution of the checker’s code. He noted that this exemplifies a difference between proofs in computer science and mathematics—in computer science, proofs can be straightforward but extremely tedious, with many cases. Thus, automation played a key role in this project. Furthermore, the main challenge often lies not in the proof itself but in modeling the environment connecting to the real world. He remarked that the speed of automation is vital for engaging with the proof in a meaningful way. As another example, Morrisett presented the Defense Advanced Research Projects Agency’s (DARPA’s) High-Assurance Cyber Military Systems program that sought to develop a secure software stack for drones. Researchers built a clean slate software stack that included an seL4 software system and some additional features. They proved systemwide properties including that the system was memory-safe and ignored malformed and nonauthenticated messages. Although this program was a great success, Morrisett enumerated some challenges that remain: (1) The seL4 kernel is simple but still took 20 person-years to prove its correctness; (2) DARPA did not meet all safety and security requirements, just those deemed most important; and (3) models need to be faithful to the environment, and the software running on the drone needs to be verified. Morrisett stressed that in software security, the value in formalization comes from the machine’s ability to audit proofs with an enormous amount of cases. Additionally, unlike in mathematics, software verification requires constructing and validating models, and constructing decision procedures to scale. He concluded with a “wish list,” pointing to areas where future work may be helpful. For example, a co-pilot for proof

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

21

CASE STUDIES

assistants would be useful, including support for synthesis and for summarizing and navigating large proofs. He remarked that he finds himself reproving small data structures and algorithms where a library could be used, but at the moment, finding an existing library is more work than reconstructing a proof himself. Tools for generating real-world models and testing infrastructure to validate the models would also be useful. Discussion Ringer asked Commelin how one verifies, in practice, that the ­formal statement of a definition or theorem in a library corresponds to the ­English statement. Commelin replied that for a small proof, one would check by hand that each definition matched the English statement. However, for the Liquid Tensor Experiment, this was impossible because of the recursive nature of definitions; the researchers instead used a form of abductive reasoning and provided many examples of formalized statements corresponding to English statements, establishing reasonable confidence. He referenced a blog post that expounds on this topic (see Topaz 2022). Noting that all three presenters have substantial experience writing formal proofs, Ringer asked about the most difficult part of writing a formal proof development, and how better automation could help. ­Commelin named designing the global structure and hierarchy of the project overall, particularly for large experiments like the Liquid Tensor Experiment. Proof repair is another difficult part of these projects due to its tediousness, and he envisioned LLMs being useful in that area. ­Morrisett described writing definitions as very difficult, which ­Commelin and Coquand both seconded. Ringer followed up by inquiring about the most useful qualities for practical proof automation. Commelin emphasized that integration into workflows and user experience design are crucial, and Morrisett expressed frustration with the lack of integration between decision procedures. Commelin noted that in Lean, writing tactics to combine other tactics is relatively easy, allowing for smoother integration and making Lean more appealing as a proof assistant. REFERENCES Aouchiche, M., and P. Hansen. 2016. “Proximity, Remoteness and Distance Eigenvalues of a Graph.” Discrete Applied Mathematics 213:17–25. Brualdi, R.A., and L. Cao. 2020. “Pattern-Avoiding (0, 1)-Matrices.” arXiv preprint ­arXiv:2005.00379. Charton, F. 2021. “Linear Algebra with Transformers.” arXiv preprint arXiv:2112.01898. Charton, F. 2022. “What Is My Math Transformer Doing?—Three Results on Interpretability and Generalization.” arXiv preprint arXiv:2211.00170.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

22

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

Collins, K.L. 1989. “On a Conjecture of Graham and Lovász About Distance Matrices.” ­Discrete Applied Mathematics 25(1–2):27–35. Feit, W., and J. Thompson. 1963. “Solvability of Groups of Odd Order.” Pacific Journal of Mathematics 13(3):775–1029. http://dx.doi.org/10.2140/pjm.1963.13.775. Lample, G., and F. Charton. 2019. “Deep Learning for Symbolic Mathematics.” arXiv preprint arXiv:1912.01412. Nipkow, T. 2012. “Teaching Semantics with a Proof Assistant: No More LSD Trip Proofs.” Pp. 24–38 in Verification, Model Checking, and Abstract Interpretation: 13th International Conference, VMCAI 2012, Philadelphia, PA, USA, January 22–24, 2012. Proceedings 13. Berlin, Germany: Springer Berlin Heidelberg. Thomason, A. 1989. “A Disproof of a Conjecture of Erdős in Ramsey Theory.” Journal of the London Mathematical Society 2(2):246–255. Topaz, A. 2022. “Definitions in the Liquid Tensor Experiment.” Lean Community Blog. October 14. https://leanprover-community.github.io/blog/posts/lte-examples.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

4 Current Challenges and Barriers

The workshop’s second session focused on challenges and barriers to the adoption of artificial intelligence (AI) for mathematical reasoning. DATA FOR ARTIFICIAL INTELLIGENCE AND MATHEMATICS: CHALLENGES AND NEW FRONTIERS Sean Welleck, University of Washington, discussed the development of datasets specific to the mathematical sciences. He focused on the intersection of language models (LMs) and mathematics, noting a recent example of the large language model (LLM) Minerva’s success in producing a correct solution to a question from the Poland National Math Exam (Lewkowycz et al. 2022). Welleck delineated a spectrum of formality for mathematical tasks, ranging from formal theorem proving to free-form conversation, which is more informal. He began explaining the usefulness of LMs by exploring their potential in the formal area surrounding proof assistants. He explained that LMs are often trained on massive amounts of general data, but small amounts of expert data can still be impactful. miniF2F is a common benchmark for evaluating LMs for formal theorem proving (Zheng et al. 2021).1 It consists of 488 problem statements, drawn mainly from 1 The GitHub repository for miniF2F can be found at https://github.com/­facebookresearch/ miniF2F, accessed June 29, 2023.

23

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

24

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

mathematics competitions, written in four different proof assistants. Relative to the size of general web data, this amount of problem statements is small, but he asserted that miniF2F has been an effective way to measure and drive progress in neural theorem proving. Welleck next provided an overview of draft, sketch, and prove (DSP), a new method that employed miniF2F in its development (Jiang et al. 2022). It uses an LM to translate an informal proof into a sketch that guides a formal prover. From a data perspective, the project involved two components: (1) extending miniF2F’s set of formal problem statements to include the corresponding informal LaTeX statements and (2) writing 20 different examples demonstrating the translation from an informal proof to a sketch. One challenge of working with these expertly informed, small datasets is ensuring that problems from all kinds of domains are represented. ProofNet is another recent benchmark that draws from textbooks and aims to cover undergraduate level mathematics (Azerbayev et al. 2023). Welleck noted that these kinds of projects are valuable in broadening the scope of data, and he added that interdisciplinary collaboration is crucial for expanding this work. Welleck stated that evaluating correctness for informal mathematical tasks is more difficult. This issue is vital for LMs, which generally do not guarantee correctness, as he demonstrated with an example of ChatGPT incorrectly answering and reasoning about an algebraic inequality problem. Expert evaluation, or having experts interact with these systems, can be useful in these cases. For example, Welleck presented NaturalProver, an LM that trained on theorems and partially completed proofs in order to generate a next step in a given proof (Welleck et al. 2022). He and his team asked experts to evaluate proof steps based on correctness and usefulness and to identify errors, resulting in a dataset of annotated proof steps. This project revealed that the models were sometimes capable of providing correct and useful next steps, and exposed the models’ areas of weakness in derivations and long proofs. Other groups are continuing this kind of work (see, e.g., Collins et al. 2023). Welleck described another way to build datasets by transforming problems that are difficult to evaluate into easier problems. For instance, LILA unified problem-solving datasets and transformed problems to be executable on a Python program (Mishra et al. 2022). He summarized that moving forward, it may be useful to continue exploring the interactions of LMs with humans and other computational tools, and to use these interactions and resulting feedback to expand the models’ coverage and reliability. Finally, Welleck remarked that many LM-based tools either depend on closed data or are closed themselves. This can be a barrier to understanding data and ultimately to developing these models. Math-LM is an ongoing, open project with EleutherAI working to build a large corpus

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

CURRENT CHALLENGES AND BARRIERS

25

model suite and evaluation harness for AI and mathematics.2 He stressed that open research and open data are vital to enabling new science and understanding. Discussion Kavitha Srinivas, IBM Research, moderated a discussion with Welleck. Srinivas began by clarifying that the success of the DSP method in training on small amounts of data partly hinged on leveraging the LM in ­combination with symbolic tools. She then asked about the issue of generali­zation, arising from both the scarcity and the particulars of the distribution of data for mathematical problems. Welleck responded that external tools could be helpful in augmenting LMs. For example, for proofs involving multiplication, a model may need to train on extensive amounts of data to learn how to multiply reliably, but multiplication could instead be offloaded to an external tool when applying the model. Referencing the LILA project, Srinivas wondered how incorporating external tools by having a system output a Python program compares to the Toolformer approach. Welleck specified that the LILA project focuses on the dataset, collecting data into a Python form, whereas Toolformer trains a model to automatically use different tools, like calling Python functions or a retrieval mechanism. Srinivas and Welleck discussed the interaction between LMs and interactive theorem provers. Welleck expressed that LMs excel in translating between informal and formal expressions, whereas a traditional tool might have more difficulty generating natural language. Conversely, mapping expressions so that they are checkable by proof assistants offloads the verification portion of the task, for which LMs are not well equipped. Given that proof strategies differ from one another in different areas, Srinivas asked if it is more effective for a dataset to be focused only in a particular area, like either algebra or calculus. Welleck expressed that both general and targeted datasets are useful. Datasets with high coverage can make it difficult to identify where a model is failing. Expanding on the issue of coverage, he specified that datasets in his current work aim to cover areas including elementary-level mathematical word problems, informal theorem proving, formal theorem proving, and math-related cogeneration tasks.

2 The GitHub repository for the Math-LM project can be found at https://github.com/­ EleutherAI/math-lm, accessed June 29, 2023.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

26

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

BUILDING AN INTERDISCIPLINARY COMMUNITY The session on building an interdisciplinary community highlighted speakers Jeremy Avigad, Carnegie Mellon University, and Alhussein Fawzi, Google DeepMind. Their presentations were followed by a discussion moderated by Heather Macbeth, Fordham University. Machine Learning, Formal Methods, and Mathematics Avigad outlined formal methods in mathematics and the benefits and challenges of collaboration between mathematics and computer science. Before introducing formal methods, Avigad distinguished between two fundamentally different approaches in AI: symbolic methods and neural methods. Symbolic methods are precise and exact, based on logic-based representations, and have explicit rules of inference; neural methods like machine learning (ML) are probabilistic and approximate, based on distributed representations, and require large quantities of data. He stressed that mathematicians need both approaches. Neural methods can reveal patterns and connections, while symbolic methods can help generate precise answers. Avigad then distinguished between mathematical and empirical knowledge as two fundamentally different types of knowledge. Like the distinction between symbolic and neural methods, mathematical knowledge is precise and exact, and empirical knowledge is imprecise and more probabilistic. The field of mathematics provides a precise way to discuss imprecise knowledge. For example, a mathematical model can be made using empirical data, and though the model may only be an approximation, it can be used to reason precisely about evidence and implications, and this reasoning can be useful for considering consequences of actions, deliberating, and planning. Avigad described formal methods as symbolic AI methods, defining them as “a body of logic-based methods used in computer science to write specifications for hardware, software, protocols, and so on, and verify that artifacts meet their specifications.” Originating in computer science, these methods can be useful in mathematics as well—namely, for writing mathematical statements, definitions, and theorems, and then proving theorems and verifying the proofs’ correctness. Proof assistants, which are built on formal methods in mathematics, allow users to write mathematical proofs in such a way that they can be processed, verified, shared, and searched mechanically. He displayed an example of the Lean theorem prover, likening it to a word processor for mathematics. It can interactively identify mistakes as one makes them, and it also has access to a library of mathematical knowledge. This technology can be used to verify proofs, correct mistakes, gain insight, refactor proofs and libraries,

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

CURRENT CHALLENGES AND BARRIERS

27

communicate, collaborate, and teach, among many other uses. Avigad referred workshop participants to several resources that delve further into how this technology can help mathematicians.3 He asserted that applying ML and neural methods to mathematics is a newer frontier, and recent work has used ML in conjunction with conventional interactive proof assistants. DSP, as discussed by Welleck, is one example that combines neural and symbolic methods (Jiang et al. 2022). Turning to the topic of collaboration, Avigad remarked that digital technologies can bring together the two disparate fields of mathematics and computer science. In general, mathematicians enjoy solving hard problems, finding deep patterns and connections, and developing abstractions, whereas computer scientists enjoy implementing complex systems, finding optimizations, and making systems more reliable and robust. He underscored that these communities need one another. Digital technologies like proof assistants provide new platforms for researchers to cooperate and collaborate. The Liquid Tensor Experiment exemplifies this collaboration: formalization was kept in a shared online repository; participants followed an informal blueprint with links to the repository; participants communicated through the chat software Zulip; and a proof assistant ensured that pieces fit together, so each participant could work locally on individual subproblems. Another example of collaboration is the port of the large formal library mathlib, which required millions of lines of formal proof to be translated from the Lean 3 to Lean 4 system. Mario Carneiro wrote an automatic translator for some of this work, but user intervention was still necessary to repair some translations manually and sometimes rewrite lines entirely. Like the Liquid Tensor Experiment, the port of mathlib has online instructions to guide volunteers who stay in constant communication over Zulip, as well as dashboards to monitor progress. Avigad enumerated a few institutional challenges to advancing AI technology for mathematical research. Neither industrial research that 3 Avigad’s resources include his preprint, J. Avigad, “Mathematics and the Formal Turn,” https://www.andrew.cmu.edu/user/avigad/Papers/formal_turn.pdf, as well as presentations found at Isaac Newton Institute, 2017, “Big Conjectures,” video, https:// www.newton.ac.uk/seminar/21474; Leanprover Community, 2020, “Sébastien G ­ ouëzel: On a ­ Mathematician’s Attempts to Formalize His Own Research in Proof Assistants,” video, https://www.youtube.com/watch?v=sVRC1kuAR7Q; Institut des Hautes Études ­Scientifiques, 2022, “­Patrick Massot: Why Explain Mathematics to Computers?,” video, https://www.youtube.com/watch?v=1iqlhJ1-T3A; International Mathematical Union, 2022, “­Kevin Buzzard: The Rise of Formalism in Mathematics,” video, https://www.youtube.com/ watch?v=SEID4XYFN7o; The Fields Institute, 2022, “Abstract Formalities,” video, http:// www.fields.utoronto.ca/talks/Abstract-Formalities; University of California, Los Angeles, “Liquid Tensor Experiment,” video, http://www.ipam.ucla.edu/abstract/?tid=19428; and “Algorithm and Abstraction in Formal Mathematics,” video, https://www.ipam.ucla.edu/ abstract/?tid=17900, all accessed August 2, 2023.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

28

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

is driven by corporate interests nor academic research that is focused on specialization is perfectly aligned with developing such technologies. Academia in particular is governed by traditional methods of assessment. Both mathematicians and computer scientists invest significant time and energy into developing these technologies before they are recognized by traditional reward mechanisms. Avigad therefore advocated for new incentive structures for activities like ongoing development and maintenance of systems, libraries, or web pages and collaboration platforms. Avigad concluded that advances in AI for mathematics will require input from distinct communities, including computer scientists working in formal methods, computer scientists working in ML, and mathematicians working to apply AI to mathematics. Progress will come from collaboration, and advances in technology both necessitate and provide new platforms for collaboration. He suggested that mechanisms to assess new kinds of mathematical contributions are required for advancement in the field, and better institutional support should be built for collaborative, cross-disciplinary work. Challenges and Ingredients for Solving Mathematical Problems with Artificial Intelligence Fawzi discussed using AI to solve mathematical problems from his perspective as an ML researcher. Recent breakthroughs in image classification, captioning, and generation; speech recognition; and game playing have all relied on ML. He highlighted the challenges of applying AI to mathematics specifically and suggested some best practices for collaboration to this end. The first challenge Fawzi presented was the lack of training data in mathematics. ML models rely on large amounts of data, but data in mathematics are limited. There are few formalized theorems that a model could use for training compared to the vast number of images available for training an image classifier, for example. Synthetic data can be used instead, but this approach produces its own challenges. The distribution of synthetic data may be quite different from the target, and the model will not be able to generalize; however, he suggested that reinforcement learning could be useful in this situation. Another challenge arises owing to the different goals of ML research and mathematics research. For example, ML research generally focuses on average performance across many data points, while mathematics research is usually concerned with finding the correct result in one instance (e.g., proving one theorem). A third challenge Fawzi identified surrounds the tension between performance and interpretability. Mathematicians are interested in interpretable results and understanding how solutions are found, not just the solutions themselves. However, most

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

CURRENT CHALLENGES AND BARRIERS

29

major ML breakthroughs have been driven by optimizing performance, and he noted that it is difficult to interpret the decisions made by ML models. Fawzi enumerated several best practices for collaboration, beginning with an explanation of the difference between partnership and collaboration. A partnership between a mathematician and an ML researcher occurs when the mathematician formulates a problem, the ML researcher runs the experiment and returns the results, and the process repeats. Instead of such a partnership, he advocated for collaboration, in which an ML researcher and mathematician somewhat understand one another’s research and role, and they work together from the beginning. Collaboration is particularly important when using ML for mathematical problems because ML models require a degree of tailoring, and the ML researcher needs an understanding of the mathematical domain to refine the model. He stressed the value of breaking the communication barrier to facilitate effective collaboration. Time taken to understand the other party’s domain, especially at the beginning of a project, is time well spent. The second best practice Fawzi identified was to establish metrics for progress early and to set up a leaderboard, which he deemed necessary for iterating over ML models. Metrics allow ML researchers to track and guide progress, and models cannot improve without this roadmap to define improvement. A leaderboard, especially one that connects directly to the experiments and updates automatically, encourages team members to test ideas and score them quickly, leading to more rapid improvement. Fawzi’s final best practice was to build on existing works. He explained that researchers could build on top of theoretical frameworks, successful existing ML models, or traditional computational approaches. ML ­researchers tend to work from scratch, but he asserted that building on existing knowledge is a better practice. Setting a project in solid foundations by establishing good collaboration, choosing useful metrics, and identifying an existing base of work to begin from can lead to greater success. Discussion Macbeth led a discussion on collaboration between mathematics and computer science and how the domains may bolster one another. Noticing that both speakers’ presentations mentioned differences between mathematicians and computer scientists, Macbeth prompted them to discuss the cultural differences they perceived further. Avigad and Fawzi each affirmed the other’s depiction of the fields. Avigad reiterated that mathematicians tend to focus more on developing theory while computer

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

30

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

scientists are interested in efficiently solving practical problems, but he pointed out that the two approaches are not disjointed. There can be a deep respect for the different skills that each field offers. Fawzi agreed and highlighted the engineering work that is fundamental to ML. Workshop participants posed several questions about benchmarks in each of the speakers’ work. Fawzi emphasized that metrics are crucial to the process of developing ML models, and benchmarks should be determined carefully at the beginning of a project. He cautioned that it is easy to choose ill-fitting benchmarks without prior knowledge of the domain; collaboration between ML and domain experts is therefore key to choosing useful benchmarks. Avigad explained that benchmarks are one area in which ML and mathematics differ; ML generally relies on benchmarks while in mathematical inquiry, the questions may not yet be clear or new questions may continually arise. Evaluating progress in the two domains can be quite different, which makes collaborating and combining approaches more useful than relying on only one or the other domain. He also described his work on ProofNet, which gathered problems from undergraduate mathematics textbooks to serve as benchmarks for automatic theorem proving (Azerbayev et al. 2023). A few questions sparked discussion on how using AI for mathe­ matics may benefit AI more broadly. Avigad speculated that in many cases, practical applications of AI could benefit from mathematical explanations and justification. For example, ChatGPT could explain how to design a bridge, but obtaining mathematical specifications and proof of safety before executing the design would be reassuring. Fawzi added that since interpretability is critical to mathematics, developments in ML for mathematics may advance work on interpretability of ML models overall. THE ROLE OF INTUITION AND MATHEMATICAL PRACTICE A session on the role of intuition and mathematical practice highlighted speakers Ursula Martin, University of Oxford, and Stanislas Dehaene, Collège de France. Petros Koumoutsakos, Harvard University, moderated the discussion that followed their presentations. How Does Mathematics Happen, and What Are the Lessons for Artificial Intelligence and Mathematics? Martin provided an overview of how social scientists understand the practice of mathematics. She defined the mathematical space as the theorems, conjectures, and accredited mathematical knowledge in addition to the infrastructure that supports the practice of mathematics, which

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

CURRENT CHALLENGES AND BARRIERS

31

comprises the workforce (i.e., researchers, educators, students, and users), employers, publishers, and funding agencies. She observed that mathematical papers are published in essentially the same model that has been used since the 17th century, though the rate of publication has grown in the past few decades, especially in China and India. To elucidate “where mathematics comes from,” Martin offered the example of Fermat’s Last Theorem, conjectured in 1637 and solved by Andrew Wiles and Richard Taylor in 1995. She relayed pieces of an interview with Andrew Wiles about his process developing the proof (Wiles 1997). She highlighted how the process, rather than containing singular moments of genius or sudden inspiration, required extensive trial and error. For months or years, researchers do all kinds of work, such as trying computations and studying literature, that is not included in the final paper. Martin stressed that breakthroughs that may happen near the end of this process rely on the long, arduous work beforehand, with the brain working in the background toward broad understanding of a problem. The advancement of mathematics also relies on researchers’ ability to understand and build on existing literature. Furthermore, she asserted that collaboration is crucial; it allows for the co-creation of ideas and the sharing of feedback, and it develops and reinforces the collective memory and culture of mathematics. Martin described the Polymath Project, which demonstrated a new, collaborative way of developing mathematics instead of employing the traditional publication model. Started by Timothy Gowers in 2009, the Polymath Project invites people to collaborate online to solve a mathematical problem (Gowers 2009). Martin displayed some blog comments from the project to demonstrate how participants communicated: mathematicians engaged in brief exchanges and checked one another’s work or presented counterexamples. She emphasized that one crucial key to success was that leading mathematicians set communication guidelines for the project, such as a rule for being polite and constructive. The first problem ultimately solved by the collaboration—a proof for the density Hales-Jewett theorem—was published in mathematical papers under the pseudonym D.H.J. Polymath. Martin identified some challenges that the Polymath Project brought forward surrounding (1) the role of the leader, (2) pace and visibility, (3) credit and attribution, and (4) makeup of participants. To support this type of collaboration, the leader has the demanding job of both following the mathematical conversation and guiding social interactions, including mediating disagreements. Pace and visibility are difficult to balance because the project should encourage new participants to join, even though doing so can be intimidating. Credit and attribution are a clear challenge against the traditional structure of mathematics research

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

32

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

that incentivizes individual activity. Finally, online collaboration has the potential to lead to increased accessibility for and diversity of participants, but it does not automatically do so. Martin presented an analysis of one of the Polymath Project’s problems that identified the kinds of contributions participants made throughout the process (Pease and Martin 2012). Only 14 percent of contributions ended up being proof steps; the other 86 percent was comprised of conjecture, concepts, examples, and other contributions. She remarked that much current work on AI to assist mathematical reasoning studies using AI not just for proof but also for conjecture and discovery of concepts, reflecting Pease and Martin’s findings. In closing, Martin speculated that AI may deliver an abundance of mathematics and urged the community to plan for this future. She summarized that traditional mathematical papers present and organize mathematical knowledge; are accredited by proof; are refereed and published; give notions of credit, attribution, and accountability; and are embedded in and guided by the mathematical ecosystem. All of these features are important, she continued, and should be preserved as mathematics evolves. She concluded that mathematics is not just about writing proofs; acknowledging that mathematics lives in a broader context shaped by human factors will better allow stakeholders to take initiative and collaborate to create new mathematical practices—for example, using AI to assist mathematical reasoning. Can Neural Networks Mimic Human Mathematical Intuitions?: Challenges from Cognitive Neuroscience and Brain Imaging Dehaene provided a psychological and neuroscientific perspective of mathematical understanding and AI. He discussed how well artificial neural networks (NNs) mimic human cognition and learning. Current NNs successfully model the early stages of vision (e.g., reading) and some aspects of language processing, but they do not perform nearly as well as the human brain at many tasks related to mathematical learning and reasoning. For example, the human brain is able to learn from a very small number of examples, and it is better in its achievement of explicit representations than NNs. He focused on how the human brain, unlike NNs, learns from a combination of approximate intuitions and precise language-like symbolic expressions. Dehaene explained that even very young children possess the mathematical intuition that untrained artificial NNs lack. Infants in the first year of life react to violations of the laws of arithmetic and probability. The capacity for geometry is similarly ingrained, which he supported by showing examples of art from the Lascaux cave in the form of geometrical

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

CURRENT CHALLENGES AND BARRIERS

33

drawings. Such numerical and geometrical symbols can be found in all human cultures. Dehaene asserted that humans possess a unique ability to understand abstractions, presenting a study in which humans predicted the next item in a geometrical sequence (Amalric et al. 2017). Subjects in the study could make correct predictions based on very little information because of geometric intuition. This intuition is a kind of language of thought that recognizes geometric rotations and symmetries and can combine these understandings into a formula that gives the next item in the sequence. This language of mathematics allows one to compress spatial sequences in working memory, thereby recognizing complex sequences that have spatial regularity (Al Roumi et al. 2021; Wang et al. 2019). Another example compared the abilities of humans and baboons to identify an outlier in a group of quadrilaterals (Sablé-Meyer et al. 2021). Humans varying in age, education, and culture could easily identify a parallelogram in a group of rectangles, for example, but struggled more with groups of irregular shapes. Baboons did not show this geometric regularity effect; they performed equally across various quadrilaterals, regardless of regularity (i.e., regularity in a group did not lead to improved performance as in humans). Dehaene speculated that the intuition of numbers and shapes has the same evolutionary foundations for humans and baboons, but in humans, this ability is tied to a language of thought or mental construction that non-humans do not possess. In the study, a convolutional NN could capture baboon behavior but generally not human behavior; a symbolic model fit human behavior much better. Sable-Meyer and colleagues believe that humans rely on both an intuitive processing of shapes, like other primates, as well as a unique symbolic model. Dehaene turned to a discussion of current AI, presenting a few examples to demonstrate its lack of ability in processing geometrical shapes. He displayed a picture of several eggs in a refrigerator door and stated that an NN classified it as an image of ping pong balls. It is clear to humans that the eggs depicted are not ping pong balls because they do not have a spherical shape, but the AI was not attuned to the precise geometrical properties that make a sphere. Another example showed an image generator that was unable to create a picture of “a green triangle to the left of a blue circle,” instead offering images with a blue triangle within a green triangle or a green triangle within a blue circle, for example (Figure 4-1). Discussion Koumoutsakos moderated a discussion between Martin and Dehaene on mathematical intuition, especially spatial intuition, and its relationship to AI. Koumoutsakos shared a workshop participant’s question on

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

34

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

FIGURE 4-1  Images generated in response to the prompt “a green triangle to the left of a blue circle” made by the DALL-E 2 artificial intelligence system. SOURCES: Stanislas Dehaene, Collège de France, presentation to the workshop. Generated with DALL-E 2 with thanks to Théo Desbordes.

the ability of AI not just to verify existing proofs but also to discover new mathematics. Martin and Dehaene both asserted that AI could certainly aid in discovery and conjecture, and indeed it already has. Martin highlighted that the current mathematical knowledge base is so large that it is easy to imagine new mathematics being generated from what already exists. A few questions prompted Dehaene to delve further into the limits of current AI and mathematical reasoning. He explained that NNs are inherently limited by the way data are encoded. This approximate nature may

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

CURRENT CHALLENGES AND BARRIERS

35

work for the meanings of words, but the limitations become apparent when one considers the meaning of a square, for example, which has a precise definition that is not captured by an approximation. He observed that other technology exists with the ability to make exact symbolic calculations far beyond human ability. The issue is that this technology is not currently connected to language models, barring a few experiments. The human brain bridges between language and mathematical reasoning, allowing one to go back and forth and use both complementarily; he suggested that perhaps a way to connect these different technologies would allow AI to mimic the brain adequately, mathematical reasoning included. One workshop participant asked the speakers whether mathematical intuition could be programmed into AI. Martin observed that intuition is a “slippery” term, as one person’s intuition is not necessarily the same as another’s. She suspected that because the definition of intuition will continually evolve, it is difficult to determine whether mathematical intuition can be programmed into AI. Dehaene wondered whether it is necessary for AI to have mathematical intuition the way humans do. Human brains are limited to the architecture produced by evolution, but AI is free from that constraint and need not mimic the brain’s architecture. He suspected that the use of other kinds of architectures may allow AI to work differently and perhaps reach far beyond human ability, comparing this idea to the way an airplane flies quite differently from a bird. Koumoutsakos asked how researchers could discover such architectures, which would differ radically from human intelligence. Offering AlphaGo Zero as an example, Dehaene remarked that starting from very little and allowing the machine to make discoveries on its own may be promising. Finally, a question about the power of AI prompted Martin to note that in addition to AI training on existing data, AI also trains society, in a sense. As output from AI grows, she continued, people learn to expect certain text in particular forms (e.g., people expect text in children’s books to be simpler than that in books for adults). She wondered how and to what extent the output from AI conditions people’s use of language. CONCENTRATION OF MACHINE LEARNING CAPABILITIES AND OPEN-SOURCE OPTIONS Stella Biderman, Booz Allen Hamilton and EleutherAI, presented in the next session on the concentration of ML capabilities within few institutions and the potential for open-source practice. She focused on large-scale generative AI technologies, explaining that her work is primarily on training very large, open-source language models such as GPT-Neo-2.7B, GPT-NeoX-20B, and BLOOM, as well as other kinds of

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

36

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

models like OpenFold, which is an open-source replication of DeepMind’s AlphaFold2 that predicts protein structures. Biderman explored the centralization of ML capabilities, specifying that the models themselves as well as funding, political influence and power, and attention from media and researchers are all centralized. Breaking into LLM research is difficult for several reasons. First, training an LLM is expensive; computing costs alone for a model can range from $1 million to $10 million, and it can cost 10 times more for an organization to enter the field. Second, training a very large LLM requires specialized knowledge that most AI researchers do not have. Biderman estimated that most AI researchers have never trained a model that requires over four graphical processing units (GPUs), and these very large ­models require hundreds of GPUs. Finally, the GPUs themselves are difficult to obtain. A small number of companies control the supply of cutting-edge GPUs, and most groups cannot get access to them. For example, a waiting list has existed for the H100, NVIDIA’s most recent GPU, since before it was released. Each of these factors reinforces centralization of ML capabilities. Biderman stated that centralization influences what researchers can access. A spectrum of levels of access to large-scale AI technologies exists ranging from open source to privately held. Open source is the most permissive (e.g., T5, GPT-NeoX-20B, Pythia, and Falcon), followed by limited use, which typically constrains commercial use (e.g., OPT, BLOOM, and LLaMA). Application programming interface (API)-only is more restrictive and often offers partial use of a model for a price (e.g., GPT-3, GPT-4, Bard, and Bing Assistant). Privately held is the most restrictive level (e.g., PaLM and Megatron-DeepSpeed NLG). Most powerful m ­ odels are privately held or were originally privately held and later released as API-only. Biderman asserted that research conducted with privately held ­models creates a bottleneck for AI research for the larger scientific community, because research with these models does not have practical applications for most people due to lack of access to the models. Even open-source models are highly concentrated within a few companies, Biderman recognized. Eight companies to date have trained ­models with over 13 billion parameters, seven of which are for-profit companies. Thus, the release of open-source LLMs as a way to democratize and increase access to these technologies is still influenced by the needs of a small number of organizations. Biderman enumerated several reasons why the concentration of ML capabilities can be considered a significant issue. Ethics—a belief that scientific research and tools should be publicly released—and finances—the costs of tools built on commercial APIs—may both affect a researcher’s understanding of the problem. This concentration also creates limitations

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

37

CURRENT CHALLENGES AND BARRIERS

in applications of LLMs; the small number of individuals creating and training these models may lack diversity and cause disparate impact based on their priorities. For example, almost all state-of-the-art language models that are trained on a specific language use English texts, and almost all others use Chinese texts. It is extremely difficult for someone who does not speak English or Chinese to access a usable model of this kind. In the field of AI for mathematical reasoning, plenty of exciting research is being done—for example, Baldur (First et al. 2023) and Minerva (Lewkowycz et al. 2022)—but Biderman said there is no reason to believe that these models will ever be accessible to mathematicians except the few researchers working on these projects. Recounting the story of OpenFold and DeepMind’s AlphaFold2, ­Biderman described how open-source work can influence the larger research landscape. As open-source capabilities approached the state of the art, DeepMind decided to license AlphaFold2 for non-commercial use months before OpenFold’s release. Biologists and chemists who could not previously access the model were finally able to use it for scientific research. In closing, Biderman asserted that a thriving open-source ecosystem for LLMs exists now and shared several resources for the different steps of developing LLMs (Figure 4-2). Discussion Terence Tao, University of California, Los Angeles, moderated the discussion session, beginning with a question about priorities for large open-source AI efforts. Biderman replied that building an interested community is essential for all open-source work, both at the beginning and throughout the project. Citing her work with EleutherAI, she described how hundreds of users participate daily, submitting code and discussing AI. The community’s assistance and support are critical for maintenance

Datasets • C4 • The Pile • Red Pajamas

Training • Megatron‐ DeepSpeed • GPT‐NeoX • MetaSeq

Finetuning • Transformers  Library • PEFT • trlX

Evaluation • LM Evaluation  Harness • BIG Bench

Deployment • HuggingFace  Hub • AutoGPT • DeepSpeed  Inference

FIGURE 4-2 Various open-source resources for developing large language models. SOURCE: Stella Biderman, presentation to the workshop.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

38

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

and keeping up with the rapidly changing field; however, she noted that managing volunteers is a challenge in and of itself. Tao asked what cultural changes are needed to encourage collaboration in the open-source community and how the projects can avoid being co-opted by other interests. Biderman stressed that a huge amount of collaboration already exists in the open-source AI community, sharing tools, advice, and research projects. The communication breakdown often occurs between those engaged in open-source research efforts and for-profit organizations that are less interested in open-source work. She invited a broader range of people to join open-source AI efforts, emphasizing that much of the infrastructure and work underpinning large-scale AI systems is standard software engineering and data science—one does not need to be an AI expert to contribute. A workshop participant asked whether there are LLMs that have been trained on mathematical material and are widely available to mathematicians. Biderman responded that it is common to train LLMs on mathematical material; for example, standard LLMs use data from arXiv and PubMed. In particular, she named Goat,4 which is a fine-tuned version of a model called LLaMA that focuses on correct arithmetic computations. Another example is Meta’s Galactica,5 which trained exclusively on scientific corpora and was advertised as an assistant for writing scientific texts and answering scientific questions. In a final question, Tao inquired about the role of the U.S. government in encouraging open-source AI research, and Biderman asserted that it would be most helpful for the government to create and release a dataset that the large-scale AI community could use for training. The community faces a “legally awkward situation” in which large organizations train LLMs on anything that can be downloaded from the Internet and are comfortable that they will not be punished because this is practiced so widely. For smaller organizations, especially those without many lawyers, the legal uncertainty provides a barrier to doing any work that involves transparency about data sourcing. No publicly available dataset exists that is large enough to train an LLM and is known to be fully license compliant. She added that another role for the government could be to supply additional GPUs, along with support for researchers on how to train on large numbers of GPUs.

4 The GitHub for the Goat website is https://github.com/liutiedong/goat, accessed July 24, 2023. 5 The Hugging Face page for Galactica is https://huggingface.co/facebook/galactica-1.3b, accessed July 24, 2023.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

CURRENT CHALLENGES AND BARRIERS

39

MATHEMATICAL FOUNDATIONS OF MACHINE LEARNING Tao moderated the next session on the mathematical foundations of ML featuring Morgane Austern, Harvard University, and Rebecca Willett, University of Chicago. Mathematical Foundations of Machine Learning: Uncertainty Quantification and Some Mysteries Observing that ML algorithms are increasingly used for decision ­ aking in critical settings, such as self-driving cars and personalized m ­medicine, and that errors can have severe consequences, Austern stated building trustworthy AI is essential. A first step in this direction can be made by reliably quantifying uncertainty. Uncertainty quantification (UQ) in ML is the process of quantifying uncertainty of an ML algorithm’s prediction. Closely related is the idea of calibration, in which ML ­researchers seek quantify how well the machine estimates the uncertainty of its own predictions. Austern explored UQ and risk estimation in ML by focusing on classification problems, which are a type of supervised learning problem. In these problems, data are in pairs of observations and labels—for example, data may be pairs of images and the type of animal depicted—and the goal is to train an algorithm to label new observations. After training the algorithm, the first question one can ask regarding risk estimation is how well the model performs on unseen data (i.e., data it was not trained on). Risk can be measured by the probability that the algorithm will mis­ classify a new observation. She explained that a naïve way to estimate risk could be to study how many examples in the training data were mis­ classified; however, this systematically underestimates risk. Instead, one of the most common methods used to estimate risk in traditional ML is the K-fold cross-validation method, which splits data into some number of sections K. Each unique section is withheld, and the model is trained on data from all other sections and then tested on the withheld section. Risk is estimated by averaging the number of misclassified examples in all K iterations of the process. A data efficient way to perform this method is by using K equal to the number of observations (Austern and Zhou 2020). Turning to NNs, Austern noted that modern deep learning (DL) algorithms can achieve extremely high accuracy, but since none can be perfect, calibration is important. She defined a well-calibrated model as one that faithfully reflects its own uncertainty. NN models produce confidence scores as part of their training, but they are notoriously poorly calibrated and overconfident, necessitating more reliable, external quantification of their uncertainty. One technique is to use prediction sets: a model can

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

40

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

return a set of potential labels for an observation, rather than just one, with very high confidence that the correct label is in the set. This can be done even for poorly calibrated or black box algorithms through a method called conformal inference. Calibration is also an important issue for LLMs, Austern stated, that garners special interest because of their infamous tendency to “hallucinate.” Because many of these models are black boxes, confidence is measured through repetitive prompting, which is a kind of sampling. She noted that work in this area is still experimental. Austern next discussed theoretical understanding of deep NNs. She posited that perhaps mathematical formulas or theorems could be used for UQ, though the theoretical behavior of deep NNs is not well understood. She focused on one “mystery” in particular: generalization. ­Classical learning theory favors simpler solutions; if a model is too complicated, it tends to overfit the data and perform worse at prediction. DL models, however, follow a different pattern. As model complexity increases, error decreases until a certain threshold in which performance worsens. In this regime, more data hurt the model, unlike in classical learning theory (in which more data always help). Increasing model complexity even more, past this threshold, results in decreasing error once again. Beyond this specific regime, DL models can have billions of parameters and still generalize well. To explain this behavior—called double descent—and the ability of generalization, researchers have developed a theory called implicit regularization, drawing on ideas from various fields including free probability, statistical mechanics, physics, and information theory. In addition to DL in general, Austern described graph representation learning that is used in ML for scientific discovery, where data often have specific structures, which makes UQ difficult. She emphasized that this is another area where theoretical foundations are not well understood. Austern summarized that modern ML algorithms are so poorly understood because researchers lack the mathematical tools needed to study them. Classical learning theory relies on a few powerful, universal probability theorems that do not extend to modern ML methods. Instead, new mathematics will be needed. She concluded by framing the study of theoretical foundations of ML as an opportunity for all mathematicians— whether from ergodic theory, probability theory, functional analysis, optimization, algebra, or another area—to contribute to better understanding of ML algorithms. Mathematical Foundations of Machine Learning To highlight the importance of theory, Willett began by stating that investing in ML without understanding mathematical foundations is like

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

CURRENT CHALLENGES AND BARRIERS

41

investing in health care without understanding biology. She asserted that understanding foundations will allow for faster, cheaper, more accurate, and more reliable systems. Willett first shared success stories of mathematical foundations improving ML practice. She explained that ML often requires training data to set parameters of a model (e.g., NN weights), and parameters are updated in an iterative approach based on an optimization algorithm that minimizes errors, or loss. AdaGrad is an innovation that accelerates training and has strong theoretical foundations. Adam, one of the most popular optimization methods for training ML algorithms, is based on the foundation of AdaGrad. Another example is distributed optimization, in which large-scale ML is distributed across multiple machines. She remarked that in early instances of distributed optimization, naïve methods resulted in slower computation with the use of more machines. Researchers studying the mathematical and theoretical foundations discovered a bottleneck in the communication necessary to keep the different machines in sync. Understanding optimization theory led the community to develop approaches, like Hogwild!, that were theoretically grounded methods for asynchronous distributed optimization, eliminating most of the communication bottleneck and accelerating computation. She noted that data privacy also provides an example of the success of mathematical foundations for practice. The classical approach of aggregate data is a form of privacy that is easy to break, but the more recent innovation of differential privacy, grounded in mathematical foundations, uses random perturbations to better protect privacy. Willett referenced Austern’s mention of conformal inference as a method for UQ of ML algorithms with theoretical guarantees, and she shared a final example from medical image reconstruction, in which knowledge from inverse problem theory aided the creation of new network architectures that can reconstruct CT images from raw data (i.e., sinograms) both more accurately and with less training data. Willett then addressed some open questions and opportunities in the field of ML’s theoretical foundations. Returning to the basic definition of an NN, she explained that NNs are essentially functions that have some input data and produce an output, depending on learned weights or parameters. One interpretation of training a network is that it is searching over different functions to find the one that fits best. Therefore, she said that thinking about NNs as functions allows researchers to better consider questions such as (1) how much and what data are required, (2) which models are appropriate and how large they should be, (3) how robustness can be promoted, and (4) whether models will work in new settings. Other questions for the ML field include those of efficiency and environmental impact, Willett noted. Training and deploying large-scale AI

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

42

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

systems are associated with a large carbon footprint, and water is sometimes required to cool computational systems. However, mathematical foundations may help optimize architectures for efficiency. Concerns about AI perpetuating systematic bias also raise questions for the field, she continued, and investing in mathematical foundations can contribute to the design of enforceable and meaningful regulations and certifications of ML systems. Willett observed that ML is changing the way scientific research is approached, and it will continue to influence hypothesis generation, experimental design, simulation, and data analysis. She cautioned that using AI tools without understanding mathematical foundations may lead to results that cannot necessarily be trusted—citing concerns of a reproducibility crisis in science for this very reason (Knight 2022)—but the combination of ML and scientific research still carries enormous potential. She described four ways that ML and scientific research interact. First, ML could help uncover governing physical laws given observations of a system (see Brunton et al. 2016; Lemos et al. 2022; Udrescu and Tegmark 2020). Despite promising research, significant mathematical challenges remain—such as generalizing to higher dimensions or dealing with sparse or noisy data—and developing mathematical foundations could be especially impactful in this area. Second, ML can be used to better design experiments, simulations, and sensors, and she shared the example of self-driving laboratories that adaptively design and conduct experiments. Third, ML can assist in jointly leveraging physical models and experi­ mental or observational data, as in physics-informed ML. Since work in such areas still lacks mathematical foundations, leaving many open problems, developing foundations can solidify this work. Finally, Willett pointed to how advancing the frontiers of ML can benefit broader science because of the overlap between fields; learning from simulations requires understanding data assimilation and active learning, for example, and studying these benefits both ML research and other sciences. She asserted that investing in the mathematical foundations of ML has the potential to be impactful across different applications and domains. Discussion Tao began the discussion by requesting advice for young researchers wishing to contribute to the mathematical foundations of ML. Willett indicated that beyond understanding the basics of linear algebra, statistics, and probability that are fundamental to ML, researchers will need training specific to the kind of work they would like to do. Some areas might require knowledge of partial differential equations and scientific computation, while others might be grounded in information theory or signal

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

CURRENT CHALLENGES AND BARRIERS

43

processing. Austern agreed that a strong understanding of the basics should be developed first, and she also suggested that new contributions may come from mathematical areas that have not previously interacted much with ML, such as ergodic theory. Noting that some areas of theoretical computer science have been driven by major mathematical conjectures, such as P = NP, Tao asked whether ML has comparable mathematical conjectures. Willett observed that ML research has often been driven by empirical observations instead of any mathematical conjectures as concrete as P = NP. Austern agreed, adding that ML is characterized more by the multitude of aspects that are not yet understood rather than as one large question. Referencing Austern’s explanation that for deep NNs, a regime exists in which more data engender worse performance, Tao inquired as to whether the training protocol should be reconsidered, perhaps starting with smaller datasets instead of using all data simultaneously. Austern specified that the behavior she described occurs in a critical setting and expressed wariness toward changing practices before a full mathematical understanding is developed. Willett shared that a subfield of ML called curriculum learning seeks to train models with easy examples first and then gradually increase complexity. Compelling work is being done, but many interesting open mathematical questions remain, such as what makes an example easy or difficult, or how the process can be made rigorous. She described this as yet another area in ML where researchers are seeking to understand mathematical foundations. CHALLENGES AND BARRIERS PANEL The session on challenges and barriers to the adoption of AI for mathematical reasoning concluded with a panel discussion between earlier speakers Jeremy Avigad, Stella Biderman, and Ursula Martin, along with Carlo Angiuli, Carnegie Mellon University. Tao served as the moderator. Tao began by asking each panelist to discuss the main challenges for the community using AI to assist mathematical reasoning. Explaining that the community is comprised of few tenured professors and more postdoctoral, graduate, and undergraduate students, Avigad identified the main challenge as determining how to create space for the younger people to succeed. No standard path exists yet for young researchers in this area, so the main challenge is institutional. Martin seconded this sentiment and noted that external funding is a particularly powerful motivator for academic research. She also raised the challenge of integrating new mathematical tools into the research pipeline, and she suggested a potential solution in learning from other disciplines that have adapted to new technologies. Sharing her perspective as an AI researcher,

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

44

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

Biderman stated that AI researchers generally do not know what would be helpful to mathematicians, which creates a barrier to technological advancements. Angiuli added that a future challenge to grapple with, beyond the collaboration of mathematicians and AI researchers, is how to create tools that mathematicians can use without direct, constant collaboration with AI researchers—essentially, how AI can become part of mathematics itself. The speakers continued to discuss the institutional challenges facing the field of AI for mathematical reasoning. Angiuli observed that currently, work in mathematical formalization can generally only be published in computer science venues and not mathematics journals. This work, which lies at the intersection of fields, is not yet widely recognized as mathematics; interdisciplinary work often faces this struggle. B ­ iderman turned the discussion toward the incentive structure in mathematics, emphasizing the importance of the metrics that departments consider when evaluating researchers. She pointed out that AI researchers easily amass thousands of citations, but that milestone may not be of any importance in mathematics. Avigad confirmed that mathematics departments value the specific journals in which a person publishes and not the number of citations, and Biderman responded by emphasizing how different the metrics are between mathematics and computer science. Martin asserted that changing this incentive structure may require a top-down approach from high-level leadership and noted the political dimension to the issue. Tao summarized that there seems to be a promising trend in which the definition of mathematics is broadening, and hopefully the culture will continue to change. Tao shared a workshop participant’s question about the drastic shift of researchers from public to private institutions, especially corporations, due to the large sums of money required to train AI models. Acknowledging the gravity of the issue, Biderman said that this shift is a matter of expertise in addition to money, because usually only select corporations (and not universities) have the ability to teach new researchers how to train large models. Avigad agreed with the concern, citing the university as the primary home of mathematics, where people have the freedom to explore ideas without focusing on applications. He stressed the importance of ensuring that universities have the necessary resources for mathematicians. Angiuli asked Biderman whether the financial issue is a matter of funding in industry versus academia. Biderman answered that the issue is about the political willingness to spend rather than about a lack of funding; winning grants for large-scale academic AI research is difficult. In response to another workshop participant’s question about where funding for interactive theorem proving can be found, Biderman named Google in addition to Microsoft, as well as the Defense Advanced

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

CURRENT CHALLENGES AND BARRIERS

45

Research Projects Agency. She reiterated that finding funding for interactive theorem proving can be challenging. One workshop participant voiced concern that graduate students in pure mathematics usually have a limited coding background compared to computer science students and asked for suggestions for such students to engage with AI tools. Avigad countered that young mathematicians are becoming increasingly skilled in coding as computers are built into ways of thinking, and he also suggested that anyone interested in formalization can find endless resources and support online—for example, starting with the Lean community.6 Angiuli, Biderman, and Martin all agreed that learning to code is important for mathematics students and should be pursued. Biderman indicated that students could start with smaller AI models that require less expertise, but above all she urged mathematics students to take introductory computer science courses and learn Python. Workshop planning committee member Talia Ringer, University of Illinois at Urbana-Champaign, expressed that collaborating with AI researchers can be intimidating because of how large and powerful the community is, and one often ends up needing to work on their terms. Ringer wondered about ways to address this power dynamic. Stressing that this is a difficult question, Martin observed that any given collaboration will be with individuals and not the entire community. ­Collaboration is social; finding the right collaborator can take a long time, she continued, and one can learn from collaborating with the wrong people. Avigad added that crossing disciplinary boundaries is one of the main challenges in this work. Research includes a large social component, and communication between mathematicians and computer scientists can be awkward because of the differences in communities: different mindsets, expectations, and ways of thinking and talking. Martin agreed, explaining that this dynamic is why collaboration takes time. Biderman added that many AI researchers do not know exactly what mathematicians do, or what distinguishes mathematical thinking and proof, so events like this workshop help break that barrier by spreading awareness and promoting connections for smoother collaboration. She also pointed out that much of the large-scale AI research that is not conducted at large technology companies occurs on Discord, between “random people on the Internet.” These communities do high-quality work out of passion and interest, and she asserted that many places exist to find collaborators who are not from large, powerful companies. Stating that the world is currently in the century of computer science, workshop planning committee chair Petros Koumoutsakos asked about 6 The Lean theorem prover website is https://leanprover.github.io, and the Lean community website is https://leanprover-community.github.io, both accessed July 26, 2023.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

46

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

reasons for computer scientists to engage with mathematics. Citing that AI research is concerned with various applications, Biderman observed that computer scientists often engage with other domains, and many researchers find the application of AI to mathematics inherently interesting. Avigad emphasized the benefits of any opportunities for the fields to come together, and Angiuli added that verifying software is a natural place for the domains to meet. Biderman also noted the historical overlap between AI research and theoretical computer science, and she indicated that collaboration between AI and mathematics could build on those experiences. Another question sparked discussion of the ethical issues involved when researching the interplay of AI and mathematical reasoning. Martin highlighted potential ethical issues of bias in the training of models, and Biderman highlighted similar issues in model development. Biderman stated that compared to many other applications of AI, fewer opportunities exist for ethical issues in AI for mathematics. AI technologies for identifying people in videos, for example, have been used by oppressive regimes to surveil their citizens or in the United States to aid police arrests. AI for pure mathematics does not carry these kinds of dangers of accidentally contributing to oppressive systems, she continued. However, Angiuli observed that as the mathematics community is comprised of people, potential issues could still arise from the interfacing between the abstract mathematics and the people who lead and are part of the community. Biderman qualified that the area is not devoid of ethical issues, but the moral hazard is lower when compared with other fields like natural language processing that commonly have certain classes of ethical issues. Sharing a related anecdote, Tao mentioned G.H. Hardy, who took pride working in number theory because of its lack of potentially dangerous applications and was then horrified when it became fundamental to cryptography, which has many military applications. Angiuli pointed out that plenty of funding for software and hardware verification research comes from the Department of Defense—for example, to verify unmanned drones. The session concluded with a discussion of how progress in using AI for mathematical reasoning is made and measured. Avigad asserted that mathematicians will value the use of AI on any traditionally recognized mathematical problem; the question is what mathematical questions AI will be able to succeed in solving, and finding the answer will require trial and error. Biderman noted a symmetry in that researchers can use mathematics to advance AI and use AI to advance mathematics; the two are closely intertwined and grow together.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

CURRENT CHALLENGES AND BARRIERS

47

REFERENCES Al Roumi, F., S. Marti, L. Wang, M. Amalric, and S. Dehaene. 2021. “Mental Compression of Spatial Sequences in Human Working Memory Using Numerical and Geometrical Primitives.” Neuron 109(16):2627–2639. Amalric, M., L. Wang, P. Pica, S. Figueira, M. Sigman, and S. Dehaene. 2017. “The Language of Geometry: Fast Comprehension of Geometrical Primitives and Rules in Human Adults and Preschoolers.” PLoS Computational Biology 13(1):e1005273. Austern, M., and W. Zhou. 2020. “Asymptotics of Cross-Validation.” arXiv preprint ­arXiv:2001.11111. Azerbayev, Z., B. Piotrowski, H. Schoelkopf, E.W. Ayers, D. Radev, and J. Avigad. 2023. “Proofnet: Autoformalizing and Formally Proving Undergraduate-Level M ­ athematics.” arXiv preprint arXiv:2302.12433. Brunton, S.L., J.L. Proctor, and J.N. Kutz. 2016. “Discovering Governing Equations from Data by Sparse Identification of Nonlinear Dynamical Systems.” Proceedings of the National Academy of Sciences 113(15):3932–3937. Collins, K.M., A.Q. Jiang, S. Frieder, L. Wong, M. Zilka, U. Bhatt, T. Lukasiewicz, et al. 2023. “Evaluating Language Models for Mathematics Through Interactions.” arXiv preprint arXiv:2306.01694. First, E., M.N. Rabe, T. Ringer, and Y. Brun. 2023. “Baldur: Whole-Proof Generation and Repair with Large Language Models.” arXiv preprint arXiv:2303.04910. Gowers, T. 2009. “Is Massively Collaborative Mathematics Possible?” January 27. https:// gowers.wordpress.com/2009/01/27/is-massively-collaborative-mathematics-possible. Jiang, A.Q., S. Welleck, J.P. Zhou, W. Li, J. Liu, M. Jamnik, T. Lacroix, Y. Wu, and G. Lample. 2022. “Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs.” arXiv preprint arXiv:2210.12283. Knight, W. 2022. “Sloppy Use of Machine Learning Is Causing a ‘Reproducibility C ­ risis’ in Science.” Modified August 10, 2022. https://www.wired.com/story/machine-­learningreproducibility-crisis. Lemos, P., N. Jeffrey, M. Cranmer, S. Ho, and P. Battaglia. 2022. “Rediscovering Orbital ­Mechanics with Machine Learning.” arXiv preprint arXiv:2202.02306. Lewkowycz, A., A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. Ramasesh, A. Slone, et al. 2022. “Solving Quantitative Reasoning Problems with Language Models.” ­Advances in Neural Information Processing Systems 35:3843–3857. Mishra, S., M. Finlayson, P. Lu, L. Tang, S. Welleck, C. Baral, T. Rajpurohit, et al. 2022. “Lila: A Unified Benchmark for Mathematical Reasoning.” arXiv preprint arXiv:2210.17517. Pease, A., and U. Martin. 2012. “Seventy Four Minutes of Mathematics: An Analysis of the Third Mini-Polymath Project.” Proceedings of AISB/IACAP 2012. Sablé-Meyer, M., J. Fagot, S. Caparos, T. van Kerkoerle, M. Amalric, and S. Dehaene. 2021. “Sensitivity to Geometric Shape Regularity in Humans and Baboons: A P ­ utative Signature of Human Singularity.” Proceedings of the National Academy of Sciences 118(16):e2023123118. Udrescu, S., and M. Tegmark. 2020. “AI Feynman: A Physics-Inspired Method for Symbolic Regression.” Science Advances 6(16):eaay2631. Wang, L., M. Amalric, W. Fang, X. Jiang, C. Pallier, S. Figueira, M. Sigman, and S. Dehaene. 2019. “Representation of Spatial Sequences Using Nested Rules in Human Prefrontal Cortex.” NeuroImage 186:245–255. Welleck, S., J. Liu, X. Lu, H. Hajishirzi, and Y. Choi. 2022. “Naturalprover: Grounded Mathematical Proof Generation with Language Models.” Advances in Neural Information Processing Systems 35:4913–4927.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

48

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

Wiles, A. 1997. “The Proof.” PBS, October 28. https://www.pbs.org/wgbh/nova/ transcripts/2414proof.html. Zheng, K., J.M. Han, and S. Polu. 2021. “MiniF2F: A Cross-System Benchmark for Formal Olympiad-Level Mathematics.” arXiv preprint arXiv:2109.00110.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

5 Technical Advances Required to Expand Artificial Intelligence for Mathematical Reasoning

On the final day of the workshop, speakers from computer science and mathematical backgrounds discussed technical research advances needed to expand the use of artificial intelligence (AI) for mathematical reasoning. RESEARCH ADVANCES IN COMPUTER SCIENCE The session on research advances in computer science highlighted speakers Brigitte Pientka, McGill University; Aleksandar Nanevski, IMDEA Software Institute; and Avraham Shinnar, IBM Research. Talia Ringer, University of Illinois at Urbana-Champaign, moderated the discussion. Principles of Programming and Proof Language Giving an overview of the field of principles of programming languages, Pientka discussed what AI can do for mathematical reasoning from the perspective of the field’s community. She described the research landscape in this field as a combination of work in programming languages based on logical principles, proofs about programs and programming languages themselves—traditionally done on paper but increasingly using proof assistants, and implementation and evaluation of programming and proof environments. The idea that proofs can be thought of as programs is foundational to the field. 49

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

50

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

Pientka asserted that trust and reliability of programs is a central motivation for this research and becomes increasingly important as large language models (LLMs) can now generate code. Compilers, used to make code runnable by translating it into low-level machine instructions, are a key piece of software infrastructure whose bugs are notoriously difficult to find and fix. In the 1970s, a grand challenge posed by Tony Hoare asked the community to verify a compiler. The challenge was finally met in 2009 with CompCert,1 a fully verified compiler for C, using the Coq proof assistant. She emphasized this success as a testament to the value and power of proof assistants. The language in which software is implemented is also crucial to reliability, Pientka continued. The language itself can be flawed, which makes it important to study the language designs themselves. Java exemplifies this well. It was one of the first languages with a formal, rigorous mathematical definition, and it claimed to be type-safe, meaning that a well-typed program will execute without issue. However, she explained that over the decades since its creation, researchers have investigated and debated that claim (see Amin and Tate 2016; Drossopoulou and Eisenbach 1997; Rompf and Amin 2016; Saraswat 1997). Highlighting the differences between proofs in programming language theory and in mathematics, Pientka remarked that computer scientists are often proving the same kinds of theorems—such as type safety, bisimulation, soundness of a transformation, or consistency—for different languages. The difficulty lies in that the language itself is a moving target and can be quite large. She illustrated these differences with a metaphor of buildings: mathematics is like a skyscraper, and proof builds on earlier floors, stacking up to high levels. In programming languages, the ground is like an earthquake zone, and work goes into establishing the stability of smaller houses, ensuring they do not collapse. Pientka explained that programming language proofs are difficult to write on paper because of the level of detail one must consider. Proof assistants can help but come with their own challenges. To use proof assistants, extensive work needs to go into building basic infrastructure; there are many boilerplate proofs; it is easy to get lost in technical, low-level details; and it is time-consuming and highly dependent on experience. For example, she presented an article that developed a theory using Coq and noted that out of 550 lemmas, approximately 400 were “tedious ‘infrastructure’ lemmas” (Rossberg et al. 2014), which is a common experience. Returning to the idea of proofs as programs, Pientka noted the synergy between a few concepts in programming and proofs, including code and proof generation; code and proof macros; program and proof synthesis; 1 The

CompCert website is https://compcert.org, accessed July 10, 2023.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

TECHNICAL ADVANCES REQUIRED

51

and incomplete programs and incomplete proofs. Abstractions are also crucial to both programming and proofs. Consequentially, the two domains grow together, with advances in one domain applying to the other. Pointing toward potential areas of future work, Pientka commented on the usefulness of collections of challenge problems. She noted the influence of POPLMark, which helped popularize proof assistants in the programming language community and served as a point of comparison across different proof assistants, tools, and libraries (Aydemir et al. 2005). There are also clear areas for improvement in interactive and automatic proof development, such as eliminating boilerplate proofs and gaining an overall better understanding of automation. To conclude, Pientka summarized two main lessons. First, abstractions are key for the interplay between humans and proof assistants, and developing theoretical foundations that provide these abstractions is just as important as building new tools and engineering. Second, proof automation has high potential for programming language proofs. Because the goal of proofs is often clear and the language is a moving target, she underscored that one should, in principle, be able to port proof structures to new languages, making programming language theory a useful testbed for proof automation. Structuring the Proofs of Concurrent Programs Nanevski presented on presented on how and why to structure proofs, using the theory of concurrent programs as a running example. He explained that the idea of structuring is making proofs scale, which is connected closely to structured programming. Foundational to computer science, structured programming suggests that how programs are written matters (e.g., removing a command can improve clarity, quality, and development time). Structured proving/verification of programs is application specific, but he noted that a common thread exists between different applications. Reiterating Pientka’s earlier statement, he described this common thread as that programming and proving are essentially the same. For example, the Curry-Howard isomorphism formally establishes a connection between proofs and programs. The Curry-Howard isomorphism is foundational for type theory, which is the study of program structures and types as their specifications. Type theory, then, is foundational for proof assistants (e.g., Coq, Lean). However, he continued, although the Curry-Howard isomorphism is tied fundamentally to a state-free programming language (i.e., sufficient for writing proofs in mathematics, writing proofs about programming languages, and verifying state-free programs), a question remains about how to verify programs that fall outside of the state-free paradigm.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

52

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

To begin to address this question, Nanevski explained how type theory can be applied more broadly to shared-memory concurrency programming models. The earliest proposal for a program logic on which a type theory for shared-memory concurrency could be built came from Owicki and Gries (1976). Nanevski detailed Owicki and Gries’s path to design a Hoare logic and their methods to verify the compositionality of a program, in particular, and observed two key ideas: (1) programs needed to be annotated with an additional variable state (i.e., the ghost state variable), and (2) resource invariants needed to be introduced. With these two added steps, verification became possible. He noted that by introducing a new state, new codes, and new programs for the sole purpose of designing a proof, Owicki and Gries blurred the distinction between programs and proofs. Nanevski remarked that although this work was influential, several issues emerged. First, proofs depend on thread topology, but a two-thread subproof cannot be reused because it uses a different resource ­invariant—in other words, the program state leaks into the resource invariant, leading to false proof obligations, and one proves the same thing over and over again. This case demonstrates a failure of information hiding and in code/proof reuse. Second, without knowing a program’s number of threads, it is not possible to annotate that program. This reflects a failure of data abstraction. Nanevski explained that a question also arises about how to dis­ associate proofs from the number and topology of threads, and he asserted that types are helpful in addressing this question. For example, one takes the judgment from the Owicki-Gries specification and turns it into a dependent type. He emphasized that in type theory, programs that differ only in the names of variables are the same program, but an additional abstraction is needed to make the types equal. He noted that this pattern of abstraction is so pervasive that a new type theory emerged as a default for the Hoare triple type. He said that each thread and Hoare type should have two local variables. These variables have different values in different threads, but the sum of the variables is the same in all threads, which means that the resource invariant has been disassociated from the number and the topology of threads. However, these local variables cannot be independent, and their relationship must be captured to generalize separation logic, which has been reformulated as a type theory. He added that proof reuse can be restored with parallel composition, and data abstraction can be restored by writing an incrementation program and hiding the thread topology with the types. He underscored that such approaches scale up for more sophisticated specifications and structures such as stacks, queues, and locks. Nanevski concluded by summarizing a few keys to scaling mechanized verification. Proofs scale by improving their behavior in terms of

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

TECHNICAL ADVANCES REQUIRED

53

information hiding and reuse, especially proofs of programming correctness. Determining the right abstractions is essential but difficult. To find these abstractions, he suggested using type theory and generally experimenting with proofs; AI also may be helpful in this area. He noted that although correctness is the most commonly mentioned reason for proof mechanization, a second benefit is that, by facilitating such experimentation, it provides a way to gain insight to the structure of a problem. Finally, he reiterated the sentiment that advances in proofs are intimately tied to advances in programming, and the two will progress together. Artificial Intelligence for Mathematics: Usability for Interactive Theorem Proving Shinnar discussed current tools and automation for interactive theorem proving and ways to improve their usability. Machine learning (ML) can help, he imagined, in both the foreground and background of the proof assistant. In the foreground—that is, at the step in a proof where the user is currently working—ML tools could make immediate suggestions (such as suggestions of tactics or tactic arguments). In the background, slower but more powerful ML tools could work on the statements which the user previously marked with a placeholder proof (called an “admit” or “sorry” statement) and moved past. Such statements are typically numerous; researchers will often sketch out a high-level proof plan, temporarily assuming the truth of many statements that appear in this plan (rather than immediately working to prove them) so that their focus can remain on the plausibility of the plan as a whole. Another background task for ML tools could be suggesting improved names for lemmas. Shinnar called attention to the importance of the user experience offered by ML tools. People writing formal proofs, like people who write other kinds of programs, typically do this work in an application called an integrated development environment (IDE), which is designed to help them do this work efficiently. For widely used programming languages such as Python, IDE tools offer a programmer ML support in a carefully designed way which does not interrupt workflow, including auto­ complete tools like GitHub Copilot that quickly recompute “foreground” suggestions with each keystroke, as well as slower tools for “background” tasks that store suggestions unobtrusively for the user to consult later. For formal theorem proving languages, existing IDE tools for ML support do not yet offer this degree of user-friendliness. They have largely been written by academic ML researchers, and ML researchers’ incentives for publication are to maximize the number of tasks their ML tools can solve eventually, not to develop a good trade-off between success percentage and speed. Shinnar emphasized that formal proof developments are built

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

54

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

in composable, or modular, ways. For example, Coquelicot and Flocq are two libraries for Coq which themselves depend on the Coq standard library, and Shinnar’s library FormalML depends on both of these. Individual developers use FormalML and may contribute their code to that repository when it is finished. All of these pieces work together, but they are developed and maintained by different people. He advocated for ML models to be built modularly in the same way, with initial training on basic libraries and fine-tuning on higher-level libraries. One barrier to adopting this approach is that a system of composed models will not perform as well as a model whose whole training is done on the full library ecosystem for the test suite. The compiler community, for example, has accepted this trade-off because the composed system can scale to large programs and work for people who do not have access to super­ computers. Shinnar urged that the same should happen for ML models. However, he observed that this is also an institutional challenge, and that the community needs to be able to incentivize these systems, even if they do not perform as well, to achieve models that are usable, scalable, and work for everyone. Discussion Ringer moderated the discussion, and the speakers began by considering the importance of abstractions. Shinnar remarked that because proofs often look similar, it is faster to copy and tweak previous proofs than to identify the perfectly fitting abstraction. Using ML to abstract a lemma from the core parts of the proof would be useful for proving and could provide insight into the proof. Pientka added that it would be useful to have methods to abstract conceptual ideas beyond the level of tactics, such as what clauses and rules are used for certain kinds of proofs. Ringer asked whether mathematicians truly need these kinds of logics and type theories, and, if so, how the speakers would convince them to invest in these foundations. Nanevski indicated that mechanization, for example, is a way to understand the structure of a problem and can lead to new mathematics. Pientka agreed, stating that formalizing theory in a proof assistant forces one to consider what underlies the theory. Furthermore, mathematical proofs can experience the same issue of “bugs” that computer programs do, and proof assistants lend certainty to the proofs. Shinnar noted that mathematicians also have the experience of writing the same proof over and over, naming integration as an area where “the standard argument” is often referenced. Ringer posed another question about priorities when developing new type-theoretic foundations. Pientka stressed that the theory should be informed by practice. Metaprogramming, for example, is gaining traction

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

TECHNICAL ADVANCES REQUIRED

55

but does not have a clear foundation, so there is a need for new theory that can support proofs about properties of metaprograms. Nanevski added that theory should be informed not just by the kinds of programs being written but also by the support needed to build proofs. Shinnar cautioned that the core type theory and kernel should stay as small as possible, for trustworthiness, but should support layering of abstractions. Rather than making core logic expressive, he continued, researchers should build expressive logic and ensure the core system can support it. A workshop participant’s question about research on proof environments directed toward machine use rather than human use prompted a discussion on the essential purpose of interactive proof assistants. Nanevski recited Edsger Dijkstra’s famous statement that humans should write for humans and have machines adapt, rather than vice versa, and Shinnar agreed that the fundamental need is for humans and computers to work together. Nanevski and Shinnar noted that the ability to build layers of abstraction simplifies a given problem, which is good for both the machine and the human. Along this line, Pientka observed that optimizing environments for either human or machine use is not necessarily exclusive. Observing that some researchers frame proof assistants as a tool that is more difficult to use but provides greater certainty than working on paper, Ringer asked the speakers whether they had examples in which writing a proof was easier with proof assistants than on paper. Shinnar asserted that writing proofs in proof assistants can absolutely be easier than on paper. Programming language proofs often raise numerous cases, and the computer can enumerate what cases exist instead of one having to think deeply about what might arise. Additionally, a major part of theorem proving is discovering that some of one’s earlier definitions were wrong and returning to tweak them. On paper, it is unclear what part of the proof needs to be discarded and what can be reused; proof assistants take care of those details and provide much greater confidence. Shinnar and Nanevski both stated that they essentially only use proof assistants over paper because they allow for freedom to explore, although Pientka said she still writes many proofs on paper. In response to a workshop participant’s question requesting recommendations for introductory books or papers for mathematicians to learn more about proof assistants, the speakers shared resources. Nanevski offered “Programs and Proofs,” a collection of lecture notes from a summer ­ OPLMark school led by Ilya Sergey.2 Pientka stated that something like P for mathematicians would be incredibly valuable, recalling her earlier remarks that it was a small enough challenge to provide an entry point 2 These

notes can be found at https://ilyasergey.net/pnp, accessed July 2, 2023.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

56

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

for exploration and even fun. Nanevski echoed her sentiment and urged the creation of learning ramps with simpler concepts to build up to the use of libraries like mathlib and offer an easier playground for exploration. Ringer summarized some other resources such as Jeremy Avigad and Patrick Massot’s book Mathematics in Lean3 and Kevin ­Buzzard’s blog, the Xena project4; Ringer also noted the existence of games for Lean and cubical type theory. Concluding the discussion, Ringer asked a final question about which problems the speakers are most excited. Nanevski named compositional type theory or compositional cryptography, and Pientka named developing type theory for concurrent process calculi. Shinnar mentioned work that the ML community could do to help mathematicians with formalization. RESEARCH ADVANCES IN THE MATHEMATICAL SCIENCES Speakers Javier Gómez-Serrano, Brown University; Adam Topaz, University of Alberta; and Alex Kontorovich, Rutgers University, presented research advances in the mathematical sciences and participated in a discussion moderated by Brendan Hassett, Brown University. Artificial Intelligence, Computer-Assisted Proofs, and Mathematics Gómez-Serrano discussed recent advances in the interplay among AI, computer-assisted proofs, and mathematics. He asserted that although AI and mathematics have a relatively short history together, there is likely a long future with many opportunities for collaboration between the disciplines. He underscored that training mathematicians in ML will be vital. Gómez-Serrano’s presentation centered on the three-dimensional (3D) incompressible Euler and Navier-Stokes equations from fluid dynamics. A famous open problem asks whether there exist smooth, finite initial energy data that lead to a singularity in finite time, called a blow-up solution. Gómez-Serrano recounted the history of the problem from the perspective of the numerical analysis and simulations community, explaining that for a long time, there was no consensus on whether the answer was yes or no. A breakthrough by Luo and Hou (2014) found an axisymmetric blow-up away from the axis. They considered a fluid inside a cylinder 3 The book can be accessed through the GitHub repository found at https://github.com/ leanprover-community/mathematics_in_lean, accessed July 2, 2023. 4 The Xena project blog can be found at https://xenaproject.wordpress.com, accessed July 2, 2023.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

TECHNICAL ADVANCES REQUIRED

57

spinning in one direction in the top half and in the other direction in the bottom half, which generates secondary currents. They showed numerically that these currents lead to a finite-time singularity on the boundary in the middle of the cylinder. Their results also suggested a self-similar blow-up—a solution that is similar to itself as one zooms in with an appropriate scaling (Luo and Hou 2014). Gómez-Serrano’s team searched for this suggested self-similar solution using ML. Using a technique called physics-informed neural networks, developed by Raissi and colleagues (2019), Gómez-Serrano’s team found a self-similar solution numerically (Wang et al. 2023). He emphasized that the technique is fast and versatile, but even more significantly, it can discover new solutions that no one has been able to compute using traditional methods. He advocated for increasing the use of ML to complement traditional methods in mathematics. Gómez-Serrano discussed his team’s strategy for future work, aiming to prove singularity formation for the 3D Euler and Navier-Stokes equations without boundary. In discussing his team, he noted that the project involves collaboration with an industry partner, which provides more access to more extensive computational resources. He concluded by expressing excitement toward future work, underlining that extensive recent progress has opened new avenues of research involving computerassisted proofs and ML, and the interaction of both with mathematics. Formal Mathematics and Artificial Intelligence Topaz presented his experiences formalizing research-level mathematics and the ability of formal methods to act as a bridge between AI and mathematics. He distinguished library building from research mathematics, asserting that formal libraries capture a segment of mathematical knowledge in a formal system and are meant to be used in other formalization endeavors. They are large and include many contributors—for example, Lean’s mathlib includes over 1 million lines of code; has over 300 contributors; and covers a wide range of mathematics including algebra, topology, category theory, analysis, probability, geometry, combinatorics, dynamics, data structures, and logic. Conversely, projects for research mathematics work to formalize cutting-edge theory or results. They are more focused with fewer contributors and more direct collaboration, and they employ formal libraries—the Liquid Tensor Experiment is one such example. Despite these differences in library building and research mathematics, he noted that these two areas of formal mathematics share the features of being collaborative, asynchronous, and open. Topaz illustrated the workflow of these collaborations, particularly noting the use of “sorry” placeholders for most intermediate statements’

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

58

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

proofs, which allows high-level proof structure to be planned more quickly, as previously described in Shinnar’s presentation. These sorry statements can be broken down into smaller components, and this process is done recursively until the target statements are tractable and can be handled directly. He highlighted how individual contributors can focus on individual targets that they feel most comfortable with, and the proof assistant ensures consistency, which enables smooth collaboration. Topaz imagined ways in which AI, with its current capabilities, could help these projects. The first was an “AI collaborator” that would iterate through sorry statements and attempt to solve each one, which Shinnar also discussed. Topaz emphasized that the AI can be useful even if it is not particularly powerful. If it succeeds with a sorry statement, it helps; if it does not, the researchers will fill the sorry statement in later as they do in current workflows. A few transformer-based tools already exist, including GPT-f, HyperTree, and Sagredo. He urged researchers to optimize usefulness by focusing only on filling in proofs for now, since having AI frame definitions or theorem statements, for example, is quite challenging. A second way AI could assist is in user experience, specifically in searching for lemmas or integrating more functionality into the editor. Like Shinnar, Topaz noted that ML tools are most useful when they are integrated into the editor. He acknowledged a few barriers including cost, the need to maintain infrastructure, and the culture separation between open-source projects and for-profit industry organizations. Topaz then discussed the synergies between mathematics and AI research. He noted that mathematics research can be split crudely into problem-solving and theory building. Formalizing mathematics involves attempting to understand the problem-solving process itself, and aspiring to find the right definitions and abstractions. He observed parallels in AI research where researchers are striving to augment the reasoning capabilities of LLMs with novel prompting techniques and often find challenges in having AI produce useful definitions. He concluded that between mathematics formalization and AI research, advances in one area can support the other, and he advocated for increased collaboration between the two. Artificial Intelligence to Assist Mathematical Reasoning Kontorovich presented the path to developing AI to assist mathematical reasoning, offering several “conjectures” about the realities and future of the field. Working backward, he began with potential outcomes, imagining an instance in which AI solves the Riemann hypothesis. AI could provide a proof that is completely incomprehensible or perhaps give a perfectly comprehensible, beautiful proof. Furthermore, AI could

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

TECHNICAL ADVANCES REQUIRED

59

develop its own definitions and conjectures, supplying new mathe­matics by finding patterns in reasoning that are outside the current realm of mathematical thinking. Thinking more modestly, he predicted that AI likely will assist in mathematical reasoning. The current workflow of research mathematicians is to start with some idea; break it into smaller theorems, propositions, and lemmas; try computations using standard or non-standard techniques; and repeat. He observed that many steps within the process intuitively seem like they could be automated. If AI is to assist the mathematical research process, Kontorovich continued, it will need to train on large datasets. He conjectured that LLMs trained solely on natural language will not be able to reliably reason mathematically at a professional level. He considered that almost all humans learn language automatically, but contributing to advanced mathematics requires years of training. Furthermore, unlike language, the underlying arguments of mathematics are solely deterministic. He noted that many researchers at Google and OpenAI are working to disprove his speculation, claiming that if AI had enough training data, parameters, transformers, etc., it would be able to solve professional-level mathematics problems. However, he posited the following question: Even if AI could solve professional-level mathematics problems just by training on language and give a result in natural language, how would one certify its correctness? This question led Kontorovich to his second conjecture, stating that the path to AI assisting mathematical reasoning will be through an adversarial process, likely involving interactive theorem provers. He imagined a process involving feedback between an LLM and a theorem prover, with the LLM suggesting possible directions and the theorem prover certifying mathematical correctness of steps. Large datasets will be necessary to inform this feedback loop. At the moment, all formalized libraries combined do not have nearly enough data for this training. He also observed that open questions remain on the specifics of training, such as whether to train models on lines of code, pairs of goal states and next proof lines, or pairs of natural language lines and formal lines. Kontorovich asserted that researchers will need to produce orders of magnitude more lines of formalized professional-level mathematics before AI can assist mathematics in this way. He offered a few reasons to be optimistic, noting that the pace of formalization has been increasing in recent years; proof assistants are able to support increasingly complex statements and proofs; and formalized proofs can sometimes be made to look exactly like human proofs. However, he underscored that a significant barrier to AI assisting mathematics remains: Humans alone will never be able to formalize enough mathematics to provide sufficient data for training. Autoformalization, in which AI formalizes problems with solutions already known to humans, is a promising area that could

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

60

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

help overcome this barrier. As a potential path forward, he suggested that mathematicians dedicate resources to improving human- and auto­ formalization (Figure 5-1), which may eventually lead to enough training data for AI systems to meaningfully and reliably assist mathematical research. Discussion Hassett moderated a discussion among the three speakers. He asked how a collaborative product like a large formal mathematics library interacts with traditional incentive structures in mathematics, given the difficulty of ascribing individual credit in such collaborative projects, and further inquired as to how the mathematical community can adapt. Topaz commented that this issue needs to be addressed in order for formalized mathematics to become mainstream. Kontorovich explained that reward mechanisms for the difficult, high-level, technical work of formalizing mathematics do not yet exist but should. He suggested that the interactive theorem proving community meet the mathematics community where it is. Formal mathematics libraries such as Lean’s mathlib, Coq’s Mathematical Components, Isabelle’s Archive of Formal Proofs and the Mizar Mathematical Library are each already a kind of journal, with maintainers who are effectively editors and a completely open, public trail. The theorem proving community could start a mathematics journal associated with formal mathematics library development, where significant contributions are associated with a paper that could, for example, include

AI to Assist Mathematical Reasoning AI makes its own Definitions and Conjectures

Human formalization AI Assists in Theorem Proving Autoformalization: AI learns to prove things we already know, verifiably

Lean

AI solves the Riemann Hypothesis

FIGURE 5-1  Flowchart of a potential future of artificial intelligence (AI) assisting mathematical reasoning. SOURCE: Alex Kontorovich, presentation to the workshop.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

TECHNICAL ADVANCES REQUIRED

61

meta-discussions about the work, ultimately allowing the work to be recognized by standard mathematical structures. He noted that mathematics formalization work is already included in computer science proceedings, but pure mathematicians do not have a broad awareness of the value in these proceedings. In addition to supporting Kontorovich’s suggestion, Gómez-Serrano urged the traditional mathematics community to build awareness of the publishing landscape in computer science. Hassett asked another question about how LLMs might influence the speakers’ current work. Topaz recalled the demonstration he gave with sorry statements and described how AI could be useful; he added that as AI becomes a useful assistant, mathematicians will learn to write proofs in a way that is more amenable to AI assistance. Gómez-Serrano noted that one of the most significant contributions of LLMs is awareness—the attention on LLMs gives mathematicians an opportunity to educate the public. Observing that breakthroughs in mathematics are sometimes driven by ideas cutting across different disciplines, Hassett wondered whether this process is uniquely human or if it could be supported by AI. K ­ ontorovich replied that statistical models excel at finding patterns, and that ability could certainly provide a resource for connecting patterns across disciplines. Topaz and Gómez-Serrano agreed, and Gómez-Serrano highlighted how AI could have particular value in exploration. Another question sparked discussion on how general m ­ ethodology could be applied to numerous specific problems. Gómez-Serrano explained that the discovery phase in the work he presented was not tied to the particular problem he and his team were aiming to solve. The AI did not need much a priori knowledge about the physics to be surprisingly effective, but he stressed that work remains to understand why AI works this way. Hassett next asked about the extent to which AI can drive generalizations, which underlie the essence of mathematics. Topaz replied that one of the strong points of formalizing mathematics is that it forces one to find the right abstraction. Kontorovich added that pedagogically, since it is often easier to formalize general statements than specialized ones, formalization pushes one to understand the big picture. Topaz agreed, rearticulating that formalization drives extraction of ideas. Hassett asked about the role of analogy as a driver for discovery and conjecture. Topaz emphasized that it is one of the main drivers of progress in mathematics. He speculated that analogy is related to the question of how AI can produce constructions and definitions, because in ­mathematics, definitions are often motivated by analogy. Inviting another perspective, the speakers consulted Ringer, who gave a brief explanation of how neural tools perform poorly at identifying analogies and about current architectural limitations for this ability that may be resolved in the future.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

62

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

Hassett observed that some areas of mathematics, such as low-­ dimensional topology, are mainly driven by visualization and g ­ eometric insight, and he wondered how that could translate to formalization. Gómez-­Serrano remarked that formalization in this area could resolve controversy about the correctness of proof by picture. Referencing proof widgets in Lean, Topaz asserted that the infrastructure for visualization while working on formalization does exist, which is promising for incorporating visualization into formalization. Kontorovich mentioned work on autoformalization that suggested AI could simply intake the visuals themselves from mathematical textbooks (see Szegedy 2020). REFERENCES Amin, N., and R. Tate. 2016. “Java and Scala’s Type Systems Are Unsound: The Existential Crisis of Null Pointers.” ACM SIGPLAN Notices 51(10):838–848. Aydemir, B.E., A. Bohannon, M. Fairbairn, J.N. Foster, B.C. Pierce, P. Sewell, D. Vytiniotis, G. Washburn, S. Weirich, and S. Zdancewic. 2005. “Mechanized Metatheory for the Masses: The POPLMark Challenge.” Pp. 50–65 in Theorem Proving in Higher Order Logics: 18th International Conference, TPHOLs 2005, Oxford, UK, August 22–25, 2005. Proceedings 18. Berlin, Germany: Springer Berlin Heidelberg. Drossopoulou, S., and S. Eisenbach. 1997. “Java Is Type Safe—Probably.” Pp. 389–418 in ECOOP’97—Object-Oriented Programming: 11th European Conference Jyväskylä, Finland, June 9–13, 1997. Proceedings 11. Berlin, Germany: Springer Berlin Heidelberg. Luo, G., and T.Y. Hou. 2014. “Potentially Singular Solutions of the 3D Axisymmetric Euler Equations.” Proceedings of the National Academy of Sciences 111(36):12968–12973. Owicki, S., and D. Gries. 1976. “An Axiomatic Proof Technique for Parallel Programs I.” Acta Informatica 6(4):319–340. Raissi, M., P. Perdikaris, and G.E. Karniadakis. 2019. “Physics-Informed Neural Networks: A Deep Learning Framework for Solving Forward and Inverse Problems Involving Nonlinear Partial Differential Equations.” Journal of Computational Physics 378:686–707. Rompf, T., and N. Amin. 2016. “Type Soundness for Dependent Object Types (DOT).” Pp. 624–641 in Proceedings of the 2016 ACM SIGPLAN International Conference on ObjectOriented Programming, Systems, Languages, and Applications. Rossberg, A., C. Russo, and D. Dreyer. 2014. “F-ing Modules.” Journal of Functional Programming 24(5):529–607. Saraswat, V. 1997. “Java Is Not Type-Safe.” https://www.cis.upenn.edu/~bcpierce/­ courses/629/papers/Saraswat-javabug.html. Szegedy, C. 2020. “A Promising Path Towards Autoformalization and General Artificial Intelligence.” Pp. 3–20 in Intelligent Computer Mathematics: 13th International Conference, CICM 2020, Bertinoro, Italy, July 26–31, 2020. Proceedings 13. Springer International Publishing. Wang, Y., C-Y. Lai, J. Gómez-Serrano, and T. Buckmaster. 2023. “Asymptotic Self-Similar Blow-Up Profile for Three-Dimensional Axisymmetric Euler Equations Using Neural Networks.” Physical Review Letters 130(24):244002.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

6 Roles for Stakeholders

The workshop’s final session was a panel discussion on the perspectives of mathematical organizations. Moderated by Heather Macbeth, Fordham University, the panel included Gunnar Carlsson, American Mathematical Society (AMS); Brendan Hassett, Institute for Computational and Experimental Research in Mathematics (ICERM); Dimitri Shlyakhtenko, Institute for Pure and Applied Mathematics; and Suzanne Weekes, Society for Industrial and Applied Mathematics (SIAM). Each panelist’s organization is unique, but they share the goal of supporting mathematical research and the exchange of mathematical knowledge. Noting that each organization has an existing framework for inter­ disciplinary work, Macbeth asked about specific examples of interdisciplinary collaboration. Weekes remarked that SIAM is affiliated with various organizations like the Computing Research Association, the American Automatic Control Council, the Association for C ­ omputing Machinery, the American Statistical Association, the Mathematical Association of America, and AMS. She noted that SIAM often runs sessions at conferences of other societies or organizations, and vice versa, and the applied nature of SIAM lends to natural collaboration. ­Carlsson added that AMS often supports workshops of an interdisciplinary nature held by other institutes and encourages mathematicians to attend such workshops. In response to a question on the challenges to building inter­disciplinary connections, Hassett observed that people with different backgrounds engage differently in interdisciplinary programs. Oftentimes, mathematicians can spend weeks away from their home institution immersing themselves in 63

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

64

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

another community, but many people working in application spaces like biology or computer science, for example, can only be involved for a short time. With this in mind, ICERM develops interdisciplinary programs that include both short, concentrated periods for targeted application areas and longer ­periods for mathematicians. He summarized that it is important to design programs intentionally by considering these cultural differences and ensuring that everyone can engage meaningfully, even if engagement differs between people. Shlyakhtenko added that in inter­ disciplinary work, different groups of people often have not only different opinions on a problem but also different conceptions of the problem itself. Translation is a key issue. He advocated for highlighting the benefits of inter­disciplinary work to encourage collaboration: There is great convergence between disciplines, but a lack of interdisciplinary work leads to a duplication of efforts. Macbeth asked the panelists to reflect on the workshop by commenting on what artificial intelligence (AI) might be able to do for ­mathematics. Shlyakhtenko noted that language models have been used for translation for several years and wondered whether they could be used for the similar “translation” problem that researchers encounter in inter­disciplinary work—working with unfamiliar terminology. He also mentioned that mathematics has invested centuries of notation and education into ensuring readability to scientists and others not fully trained in mathematics, and he advocated for prioritizing this accessibility as mathematics progresses and perhaps becomes formalized. Carlsson remarked that mathematical reasoning includes several aspects beyond theorems and proofs, such as experimentation or the notion of idealized models, which may not be quantitatively exact but can reveal insights qualitatively. He encouraged mathematicians to take the initiative to be more involved with AI and indicated that their involvement could improve understanding and explainability of AI. Building on the idea of experimentation in mathematics, Weekes expressed excitement toward AI’s ability to find patterns in massive data, reveal new insights, and inform mathematics. Hassett commented that theorem proving and machine learning projects require more complex collaboration than traditional mathematics. Collaboration is growing, and he suggested that mathematical institutes could support this collaboration by connecting mathematicians and researchers from different backgrounds. Weekes added that the community will need to evolve to appreciate work in these new paradigms. Shlyakhtenko observed that large-scale collaboration is already common and growing in other areas, such as gravitational wave physics. He urged the mathematics community to move away from traditional reward metrics toward more holistic recognition and suggested that professional societies could advocate for this shift.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

ROLES FOR STAKEHOLDERS

65

Macbeth remarked that professional societies provide resources for mathematicians and mathematics departments and inquired about what new shared resources and frameworks might be necessary. R ­ eferencing MathSciNet, a specialized database for mathematics publications, C ­ arlsson wondered whether an “AISciNet” might be possible. He also suggested that mathematics may need faster, updated publication models. Current mathematical publication processes are slow compared to many other sciences, and they create a barrier to mathematicians engaging with AI. Hassett described how ICERM is hosting a conference1 in July that invites researchers to submit manuscripts, allowing for quick dissemination to a wide audience. He indicated that some areas of mathematics, especially those driven by data and algorithms, could move to new publication models by learning from computer science. Macbeth suggested that computerized proofs may also aid in accelerating the review process. Macbeth transmitted an audience question, asking for the panelists’ thoughts on how the younger generation can learn to interact with AI technologies and how these technologies can be integrated into mathematics education. Shlyakhtenko observed that, in general, younger people are naturally engaging more with AI. For mathematics specifically, he cautioned that any new technologies that become common in research (e.g., proof assistants) should also be integrated into educational curricula early—at the undergraduate or even high school level—to ensure that mathematics stays relatively accessible. Hassett remarked that some areas in pure mathematics can rely on the same traditional curricula that have not changed for decades. Many people can therefore teach those courses, but curricula involving new technology will require greater p ­ edological engagement and training for teachers. He urged mathematicians to reflect on how flexibility can be added to curricula so that students can be exposed to technologies such as proof assistants. Macbeth concluded that overall, mathematicians need new approaches in addition to—not in replacement of—traditional approaches to accommodate these developing technologies.

1 Information about the conference can be found at https://icerm.brown.edu/events/­­sc-23lucant, accessed August 2, 2023.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

7 Conclusion

Jordan Ellenberg, University of Wisconsin–Madison, provided concluding remarks to wrap up the 3-day workshop. He emphasized how multifarious the field of artificial intelligence (AI) to assist mathematical reasoning is, citing the presentations on topics ranging from reinforcement learning to generate counterexamples to conjectures, to formal language to codify mathematical reasoning in machine-processable forms. AI may even assist aspects of mathematics outside of proof such as conjecture forming. Advancement in this area will be a two-way street, he said, requiring communication between groups of researchers who speak and think differently. Ellenberg pointed out that translation is an important theme in many senses. It is vital for communication between researchers. In addition, when one receives output from a machine, the information conveyed should be understandable, but even further, one aims to understand the workings of the language itself. He stressed that although this may feel new to mathematicians, in reality this has always been the experience of mathematicians. Mathematicians first encounter strange new phenomena; then the phenomena become familiar; and finally, they become legible, with true understanding of the underlying structures. He shared André Weil’s thoughts about the analogies between number fields and function fields: “These texts are the only source of knowledge about the languages in which they are written. In each column, we understand only fragments” (Weil 1960). Ellenberg underscored that translation between fields is an old idea. 66

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

CONCLUSION

67

Next, Ellenberg characterized AI as being able to support rapid and eccentric exploration. This exploration is considered rapid because of how quickly the field moves and eccentric because it works differently from how humans alone might operate. He emphasized that this eccentricity is what provides value. Raising the theme of understanding that appeared throughout the workshop, Ellenberg pondered what it means to understand. He observed that “artificial intelligence” is now a common term, but “artificial understanding” is not—understanding is something associated with humans. And mathematics is primarily motivated by growing, enlarging, and deepening human understanding. The goal of formalizing mathe­matical reasoning is not to eliminate the need for informal reasoning but to aid understanding. Interpretability in machine learning (ML) is a related issue, and mathematics can be a testbed for interpretability, he said, in part because mathematics is such an established discipline that has consensus on what counts as insight. Having this consensus can aid the study of what interpretability in ML means. Ellenberg cited the simile from Rebecca Willett, University of Chicago, that investing in ML without understanding mathematical foundations is like investing in healthcare without understanding biology. He considered it an apt statement precisely because societies do invest in healthcare without understanding biology—there is so much still not understood in biology—and the two areas work in an iterative interplay. Applied work will leap ahead while biology seeks to understand why certain treatments are effective, and foundational biology guides applied work in directions more likely to be productive. The two support one another, as AI research and mathematics do. Delving into how invigorating collaboration can be, Ellenberg urged all pure mathematicians to collaborate with others whose goals are aligned but not completely the same. The charm of collaboration is that unique perspectives are brought together. He recalled a statement in the workshop overview by Moshe Vardi, Rice University, that if one is not authentically open to changing one’s mind when speaking with another person, it is not really a conversation. Ellenberg suggested that while this standard is a high bar, it can be aspirational for research communities and inspire better collaboration. Ellenberg expressed his belief in historical incrementalism, imagining that the most likely future resembles the past. Machines have been assisting mathematics for a long time by recategorizing certain tasks as computation instead of mathematics, freeing humans to explore further. Everything being done right now is new, but in another sense it is not; “it is new in a way that rhymes with the past,” he said. Summarizing the key themes of the workshop—translation, exploration, and

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

68

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

understanding—Ellenberg shared that despite the barriers discussed, these three actions could allow AI to progress from assisting mathe­matics to collaborating with mathematics. REFERENCE Weil, A. 1960. “De la Métaphysique aux Mathématiques.” Sciences 2:52–56.

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

Appendixes

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

A Workshop Agenda

JUNE 12, 2023 (all times EDT) 10:00 a.m. Welcome Petros Koumoutsakos, Harvard University

Motivation from the National Science Foundation David Manderscheid, Director, Division of Mathematical Sciences Dilma Da Silva, Director, Division on Computing and Communication Foundations



Session I—State of the Art of Using Artificial Intelligence to Assist Mathematical Reasoning 10:20 a.m.

Overview and Grand Vision Moshe Vardi, Rice University Geordie Williamson, University of Sydney Moderated by Brendan Hassett, Brown University

11:30 a.m. Case Studies: Artificial Intelligence to Assist Mathematical Reasoning François Charton, Facebook AI Research Adam Wagner, Worcester Polytechnic Institute Moderated by Yann LeCun, Meta 71

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

72

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

12:45 p.m. Break 1:00 p.m.

Case Studies: Proof Verification and Checking Thierry Coquand, University of Gothenburg Johan Commelin, University of Freiburg Greg Morrisett, Cornell University Moderated by Talia Ringer, University of Illinois at Urbana-Champaign

Session II—Current Challenges and Barriers to the Use of Artificial Intelligence for Mathematical Reasoning 2:20 p.m. Development of Datasets Specific to the Mathematical Sciences Sean Welleck, University of Washington Moderated by Kavitha Srinivas, IBM Research 2:55 p.m.

Wrap Up Petros Koumoutsakos, Harvard University

3:00 p.m.

Adjourn Day 1 JUNE 13, 2023 (all times EDT)

10:00 a.m. Welcome Petros Koumoutsakos, Harvard University 10:10 a.m.

Building an Interdisciplinary Community Jeremy Avigad, Carnegie Mellon University Alhussein Fawzi, Google DeepMind Moderated by Heather Macbeth, Fordham University

11:10 a.m.

The Role of Intuition and Mathematical Practice Ursula Martin, University of Oxford Stanislas Dehaene, Collège de France Moderated by Petros Koumoutsakos, Harvard University

12:10 p.m. Break 12:30 p.m. Concentration of Machine Learning Capabilities and OpenSource Options Stella Biderman, Booz Allen Hamilton and EleutherAI Moderated by Terry Tao, University of California, Los Angeles

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

73

APPENDIX A

1:00 p.m.

Mathematical Foundations of Machine Learning Morgane Austern, Harvard University Rebecca Willett, University of Chicago Moderated by Terry Tao, University of California, Los Angeles

2:00 p.m.

Challenges and Barriers Panel Carlo Angiuli, Carnegie Mellon University Jeremy Avigad, Carnegie Mellon University Stella Biderman, Booz Allen Hamilton and EleutherAI Ursula Martin, University of Oxford Moderated by Terry Tao, University of California, Los Angeles

2:55 p.m.

Wrap Up Petros Koumoutsakos, Harvard University

3:00 p.m.

Adjourn Day 2 JUNE 14, 2023 (all times EDT)

10:00 a.m. Welcome Petros Koumoutsakos, Harvard University Session III—Technical Advances Required to Expand This Initiative 10:10 a.m.

Research Advances in Computer Science Brigitte Pientka, McGill University Aleksandar Nanevski, IMDEA Software Institute Avraham Shinnar, IBM Research Moderated by Talia Ringer, University of Illinois at Urbana-Champaign

11:40 a.m.

Research Advances in the Mathematical Sciences Javier Gómez-Serrano, Brown University Alex Kontorovich, Rutgers University Adam Topaz, University of Alberta Moderated by Brendan Hassett, Brown University

1:10 p.m.

Break

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

74

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

Session IV—Roles for Stakeholders in Advancing Artificial Intelligence for Mathematical Reasoning 1:30 p.m.



Perspectives of Mathematical Organizations Gunnar Carlsson, American Mathematical Society Brendan Hassett, Institute for Computational and Experimental Research in Mathematics Dima Shlyakhtenko, Institute for Pure and Applied Mathematics Suzanne Weekes, Society for Industrial and Applied Mathematics Moderated by Heather Macbeth, Fordham University

2:30 p.m.

Concluding Remarks Jordan Ellenberg, University of Wisconsin–Madison

2:50 p.m.

Wrap Up Petros Koumoutsakos, Harvard University

3:00 p.m.

Adjourn Workshop



Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

B Biographical Information for Planning Committee Members

PETROS KOUMOUTSAKOS, Chair, is the Herbert S. Winokur Professor of Engineering and Applied Sciences and Area Chair of Applied M ­ athematics at the Harvard John A. Paulson School of Engineering and Applied Sciences (SEAS). He studied naval architecture (diploma from the National Technical University of Athens, MEng from the University of Michigan) and aero­ nautics and applied mathematics (PhD from the California Institute of Tech­ nology [Caltech]). Koumoutsakos has condu­cted postdoctoral studies at the Center for Parallel Computing at Caltech and at the Center for Turbulent Research at Stanford University and NASA Ames Research Center. He has served as the chair of computational science at ETHZ Zurich (1997–2020) and has held visiting fellow positions at Caltech, the University of Tokyo, the Massachusetts Institute of Technology (MIT), and the Radcliffe Institute of Advanced Study at Harvard University, and he is a Distinguished Affiliated Professor at the Technical University of Munich. Koumoutsakos is an elected fellow of the American Society of M ­ echanical Engineers (ASME), the American Physical Society (APS), the S ­ ociety of Industrial and Applied Mathematics (SIAM), and the Collegium H ­ elveticum. He is recipient of the Advanced Investigator Award by the European Research Council and the Association for Computing Machinery (ACM) Gordon Bell prize in supercomputing. He is an elected member of the National Academy of Engineering (NAE). JORDAN ELLENBERG is the John D. MacArthur Professor of Mathematics at the University of Wisconsin–Madison. His research centers on 75

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

76

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

number theory and arithmetic geometry. He is a Discovery Fellow at the Wisconsin Institute for Discovery, where he is part of the Machine Learning group and the Institute for Foundations of Data Science. Ellenberg is a member of the Science Board of IPAM. He has been writing for a general audience about math for over 15 years; his work has appeared in The New York Times, The Wall Street Journal, The Washington Post, Wired, The Believer, and The Boston Globe, and he is the author of the “Do the Math” column in Slate. Ellenberg’s Wired feature story on compressed sensing appeared in the Best Writing on Mathematics 2011 anthology. His novel The ­Grasshopper King was a finalist for the 2004 New York Public Library Young Lions Fiction Award. Ellenberg’s 2014 book, How Not to Be Wrong, was a New York Times and Sunday Times (London) bestseller. MELVIN GREER is chief data scientist at Americas, Intel Corporation. He is a technical expert in the application of advanced mathematics, artificial intelligence, machine learning (ML), blockchain zero trust models and neuromorphic computing. His systems and software engineering experience has resulted in patented inventions in cloud computing, synthetic biology, and Internet of Things (IoT) bio-sensors for edge analytics. He has authored 5 books and published over 400 research papers. His research has been cited over 4,000 times worldwide. He is a member of the American Association for the Advancement of Science (AAAS) and has served for 8 years on the National Academies of Sciences, Engineering, and Medicine’s Government-University-Industry Research Roundtable. Greer has been appointed to senior advisor and fellow at the FBI Information Technology and Data Division, where he accelerates the FBI mission via data analytics and advanced data science techniques. He has been recognized by LinkedIn as a Top 10 Voice in Data Science and ­Analytics and is the recipient of the BDPA Lifetime Achievement Award. Greer also received the WashingtonExec inaugural Pinnacle Award as Artificial Intelligence Executive of the Year. He has been awarded the Black Engineer of the Year Awards (BEYA) Technologist of the Year Award and has been inducted into the 2023 BEYA Hall of Fame which recognize his outstanding technical contributions. Greer also is adjunct faculty for the advanced academic program at Johns Hopkins University, where he teaches the master of science course, “Practical Applications of Artificial Intelligence.” He has also been appointed senior advisor at the University of ­California, Berkeley, Goldman School of Public Policy where he develops and accelerates the adoption of public policy for emerging and advanced technologies. BRENDAN HASSETT joined the Brown University faculty in 2015 as a professor of mathematics. He assumed the directorship of the Institute for

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

APPENDIX B

77

Computational and Experimental Research in Mathematics in July 2016. His research focus is algebraic geometry—the study of geometric objects defined as solutions to polynomial equations. He has written 70 research papers and has authored or co-edited 8 books. Hassett was the chair of the mathematics department at Rice University from 2009 to 2014. His work has been recognized with a Sloan Research Fellowship, a National Science Foundation (NSF) CAREER award, and the Charles W. Duncan Award for Outstanding Faculty at Rice University. He is a fellow of the American Mathematical Society and the AAAS. Hassett received his PhD from Harvard University in 1996 and then spent 4 years at the University of Chicago as a Dickson Instructor and NSF postdoctoral fellow. YANN LeCUN is the vice president and chief artificial intelligence (AI) scientist at Meta and a Silver Professor at New York University (NYU) affiliated with the Courant Institute of Mathematical Sciences and the Center for Data Science. He was the founding director of FAIR and of the NYU Center for Data Science. Lecun received an engineering diploma from ESIEE (Paris) and a PhD from Sorbonne Université. After a post­ doctoral in Toronto, he joined AT&T Bell Labs in 1988, and AT&T Labs in 1996 as head of image processing research. Lecu joined NYU as a professor in 2003 and Meta/Facebook in 2013. His interests include AI, ML, computer perception, robotics, and computational neuroscience. He is the recipient of the 2018 ACM Turing Award (with Geoffrey Hinton and Yoshua Bengio) for “conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing,” and a member of the National Academy of Sciences (NAS), the NAE, and the French Académie des Sciences. HEATHER MACBETH is an assistant professor of mathematics at ­Fordham University. She was previously a postdoctoral at MIT and at the ENS (Paris). Macbeth is a pure mathematician, specializing in the partial differential equations arising in differential geometry. Since 2020 she has also worked on the formalization of these subjects in interactive proof assistants. She holds a PhD in mathematics from Princeton University and a BSc (Hons) in mathematics from the University of Auckland. TALIA RINGER is an assistant professor at the University of Illinois at Urbana-Champaign. Their main interest is in making program verification using interactive theorem provers more accessible through better proof engineering tools and practices, especially when it comes to maintaining proofs as programs change over time. Ringer’s vision is a future of verification that is accessible to all programmers, not just to experts. They got their PhD from the University of Washington in June 2021,

Copyright National Academy of Sciences. All rights reserved.

Artificial Intelligence to Assist Mathematical Reasoning: Proceedings of a Workshop

78

ARTIFICIAL INTELLIGENCE TO ASSIST MATHEMATICAL REASONING

where they were an NSF Graduate Research Fellowship Program fellow. Prior to graduate school, they earned their bachelor’s in mathematics and computer science from the University of Maryland, then worked at Amazon as a software engineer for 3 years. Ringer recently visited Google AI working on ML for formal proof. They are the founder and previous chair of the SIGPLAN-M international long-term mentoring program, the founder and president of the Computing Connections Fellowship, and a contributor to the Coq interactive theorem prover. KAVITHA SRINIVAS is currently a senior research scientist at IBM Research, worked in the past as a chief technical officer and co-founder of RivetLabs, with over 20 years in research (2001–2023). She has worked extensively in the areas of semantic web, ontology reasoning, graph databases and more recently, code generation for automated ML. Srinivas has received several outstanding technical achievements awards at IBM Research, is published in major conferences such as ACM’s SIGMOD (Special Interest Group on Management of Data), the Association for the Advancement of Artificial Intelligence, the International Joint Conferences on Artificial Intelligence, the International Conference on Very Large Data Bases, and the International Semantic Web Conference, and been on organizational committees and program committees of conferences such as the International Conference on Extending Database Technology, the International Joint Conference on Artificial Intelligence, the World Wide Web Conference, etc. She was trained as a cognitive psychologist and left the field as an associate professor at Boston College to switch fields. TERENCE TAO is a professor of mathematics at the University of ­California, Los Angeles, where he has taught since 1996. Tao has authored and contributed to several books, and his numerous articles have appeared in such publications as the Annals of Mathematics, Acta Mathematica, and the American Journal of Mathematics. He received the Fields Medal in 2006 and is a member of the NAS. Tao received an MSc (1992) from Flinders University of South Australia and a PhD (1996) from Princeton University. His areas of research include harmonic analysis, number theory, and combinatorics.

Copyright National Academy of Sciences. All rights reserved.