115 98 273MB
English Pages 208 Year 2023
ISSN 0002-9920 (print) ISSN 1088-9477 (online)
of the American Mathematical Society September 2023
Volume 70, Number 8
The cover design is based on imagery from “Aperiodic Tilings, Order, and Randomness,” page 1179.
W W W. A M S .O R G/ C O E M I N I C O N F E R E N C E
Image credits: subjug (chalkboard) and Hanna Plonsak (sticky notes) / iStock / Getty Images Plus via Getty Images
023 2 , 8 2 r Septembe ET . m . p 0 5: 0 – . m . a 9: 00
September 2023 Cover Credit: The image used in the cover design appears in “Aperiodic Tilings, Order, and Randomness,” p. 1179, and is courtesy of Rodrigo Treviño.
FEATURES Aperiodic Tilings, Order, and Randomness............................... 1179 Rodrigo Treviño
An Analytical and Geometric Perspective on Adversarial Robustness............................................................... 1193 Nicolás García Trillos and Matt Jacobs
Machine Learning and Invariant Theory.................................. 1205 Ben Blum-Smith and Soledad Villar
Persistence Over Posets...............................................................1214 Woojin Kim and Facundo Mémoli
A Word from... Stephan Ramon Garcia..................... 1176
Book Review: Mexican Mathematicians in the
Early Career: Hispanic History Month..................... 1228
Reviewed by Adolfo Arroyo-Rabasa and José Simental
An Interview with Federico Ardila....................... 1228
Anthony Bonato
Advice from Our Advisor: Jesús A. De Loera Consejo de Nuestra Asesor: Jesús A. De Loera......... 1231 Jamie Haddock and Ruriko Yoshida
An Excerpt from Testimonios: Stories of Latinx and Hispanic Mathematicians................................ 1234
World: Trends and Recent Contributions.................1274
Bookshelf................................................................... 1278 AMS Bookshelf.......................................................... 1279 Communication: Graduate Assistantships in Developing Countries (GRAID)............................... 1281
Jessica M. Deshler
Jose Maria Balmaceda, C. Herbert Clemens, Ingrid Daubechies, Angel R. Pineda, Galina Rusu, and Michel Waldschmidt
Angel Ramón Pineda Fortín
Communication: The Mathematical Congress of the Americas is Coming to the US in 2025............. 1285
An Excerpt from Testimonios: Stories of Latinx and Hispanic Mathematicians................................ 1236 Master’s Students: This One is for You................ 1239 Mariana Smit Vega Garcia
Dear Early Career.................................................. 1240 Memorial Tribute: Emma Previato............................ 1243 Ron Donagi, Malcolm R. Adams, James Glazebrook, Lisa Jeffrey, Shigeki Matsutani, David Mumford, Volodya Roubtsov, and Ben Thompson Memorial Tribute: Remembrances of
Derek William Robinson, June 25, 1935–August 31, 2021................................ 1252
Editors: Michael Barnsley, Bruno Nachtergaele, and Barry Simon Contributors: Alain Connes, David Evans, Giovanni Gallavotti, Sheldon Glashow, Arthur Jaffe, Palle Jorgensen, Aki Kishimoto, Elliott Lieb, Heide Narnhofer, David Ruelle, Mary Beth Ruskai, Adam Sikora, and A. F. M. ter Elst
Education: Staying the Course: Examining College Students’ Paths to Calculus | Executive Summary................................................... 1268 Marcelo Almora Rios and Pamela Burdman
Robert Stephen Cantrell and Susan Friedlander
Communication: Challenges for Deaf Students in Mathematics Graduate School............................. 1289 Gerardo R. Chacón and Christopher Hayes
AMS Communication: Urheim Named AMS
Congressional Fellow for 2023–2024...................... 1338
Elaine Beebe
News: AMS Updates.................................................. 1342 News: Mathematics People....................................... 1344
Classified Advertising................................................ 1346 New Books Offered by the AMS............................... 1349 Meetings & Conferences of the AMS........................ 1359
FROM THE AMS SECRETARY
INVITATIONS FROM THE AMS
Calls for Nominations & Applications....................... 1294
Claytor-Gilmer Fellowship................................... 1294 Joan and Joseph Birman Fellowship for Women Scholars................................................... 1294 Centennial Research Fellowship.......................... 1294 2024 MOS–AMS Fulkerson Prize........................ 1295
Report of the Treasurer (2022)................................. 1296 2023 AMS Election..................................................... 1303
Election Information............................................ 1303 Current Bylaws...................................................... 1305 Proposed Bylaws Amendments............................1310 Nominations for President.................................. 1314 Candidate Biographies......................................... 1317
2023 AMS Education Mini-Conference................................inside front cover Graduate School Fair.................................................1178 Position Available: Executive Director..................... 1242 Contribute to AMS Open Math Notes..................... 1251 Organize a 2025 Mathematics Research Community................................................ 1280 Apply for the Stefan Bergman Fellowship............... 1302 Math Variety Show: Seeking Performers.................. 1376 AMS Career Fair at JMM 2024............inside back cover
2024 AMS Election..................................................... 1335
Call for Suggestions.............................................. 1335 Nominations by Petition..................................... 1336
Help your Students Discover the Simplicity, Beauty, and Power of Calculus The Six Pillars of Calculus: Biology & Business Editions
Lorenzo Sadun, University of Texas at Austin, TX The Six Pillars of Calculus: Biology Edition and The Six Pillars of Calculus: Business Edition are conceptual and practical introductions to differential and integral calculus for use in a oneor two-semester course. By boiling calculus down to six commonsense ideas, both texts invite students to make calculus an integral part of how they view the world. Pure and Applied Undergraduate Texts, Volume 56; 2023; approximately 355 pages; Softcover; ISBN: 978-1-4704-6995-5; List US$99; AMS members US$79.20; MAA members US$89.10; Order code AMSTEXT/56 Pure and Applied Undergraduate Texts, Volume 60; 2023; approximately 364 pages; Softcover; ISBN: 978-1-4704-6996-2; List US$99; AMS members US$79.20; MAA members US$89.10; Order code AMSTEXT/60
Learn more or request an exam copy at bookstore.ams.org/amstext-56 or bookstore.ams.org/amstext-60 Background image credit: Liudmila Chernetska / iStock / Getty Images Plus via Getty Images
Notices of the American Mathematical Society
EDITOR IN CHIEF
SUBSCRIPTION INFORMATION
Erica Flapan
Individual subscription prices for Volume 70 (2023) are as follows: nonmember, US$742, member, US$445.20. (The subscription price for members is included in the annual dues.) For information on institutional pricing, please visit https://www.ams.org/publications/journals/subscriberinfo. Subscription renewals are subject to late fees. Add US$6.50 for delivery within the United States; US$24 for surface delivery outside the United States. See www.ams.org/journal-faq for more journal subscription information.
ASSOCIATE EDITORS
Daniela De Silva Benjamin Jaye Reza Malek-Madani Chikako Mese Han-Bom Moon Emily Olson Scott Sheffield Laura Turner
Boris Hasselblatt, ex officio Richard A. Levine William McCallum Antonio Montalbán Asamoah Nkwanta Emilie Purvine Krystal Taylor
ASSISTANT TO THE EDITOR IN CHIEF
PERMISSIONS
Masahiro Yamada
All requests to reprint Notices articles should be sent to: [email protected].
CONSULTANTS
Jesús De Loera Ken Ono Bianca Viray
Bryna Kra Kenneth A. Ribet
ADVERTISING Notices publishes situations wanted and classified advertising, and display advertising for publishers and academic or scientific organizations. Advertising requests, materials, and/or questions should be sent to: [email protected] (classified ads) [email protected] (display ads)
Hee Oh Francis Su
SUBMISSIONS
MANAGING EDITOR
The editor-in-chief should be contacted about articles for consideration after potential authors have reviewed the “For Authors” page at www.ams.org/noticesauthors.
Meaghan Healy
The managing editor should be contacted for additions to our news sections and for any questions or corrections. Contact the managing editor at: [email protected].
CONTRIBUTING WRITER
Letters to the editor should be sent to: [email protected].
Elaine Beebe
To make suggestions for additions to other sections, and for full contact information, see www.ams.org/noticescontact.
COMPOSITION, DESIGN, and EDITING
Brian Bartling Craig Dujon Lori Nero Courtney Rose-Price Peter Sykes
John F. Brady Mary-Milam Granberry Dan Normand Miriam Schaerf
Nora Culik Anna Hattoy John C. Paul Mike Southern
Supported by the AMS membership, most of this publication, including the opportunity to post comments, is freely available electronically through the AMS website, the Society’s resource for delivering electronic products and services. Use the URL www.ams.org/notices to access the Notices on the website. The online version of the Notices is the version of record, so it may occasionally differ slightly from the print version.
The print version is a privilege of Membership. Graduate students at member institutions can opt to receive the print magazine by updating their individual member profiles at www.ams.org/member-directory. For questions regarding updating your profile, please call 800-321-4267. For back issues see www.ams.org/backvols. Note: Single issues of the Notices are not available after one calendar year.
The American Mathematical Society is committed to promoting and facilitating equity, diversity and inclusion throughout the mathematical sciences. For its own long-term prosperity as well as that of the public at large, our discipline must connect with and appropriately incorporate all sectors of society. We reaffirm the pledge in the AMS Mission Statement to “advance the status of the profession of mathematics, encouraging and facilitating full participation of all individuals,” and urge all members to conduct their professional activities with this goal in mind. (as adopted by the April 2019 Council)
[Notices of the American Mathematical Society (ISSN 0002-9920) is published monthly except bimonthly in June/July by the American Mathematical Society at 201 Charles Street, Providence, RI 02904-2213 USA, GST No. 12189 2046 RT****. Periodicals postage paid at Providence, RI, and additional mailing offices. POSTMASTER: Send address change notices to Notices of the American Mathematical Society, PO Box 6248, Providence, RI 02904-6248 USA.] Publication here of the Society’s street address and the other bracketed information is a technical requirement of the US Postal Service.
© Copyright 2023 by the American Mathematical Society. All rights reserved. Printed in the United States of America. The paper used in this journal is acid-free and falls within the guidelines established to ensure permanence and durability. Opinions expressed in signed Notices articles are those of the authors and do not necessarily reflect opinions of the editors or policies of the American Mathematical Society.
A WORD FROM. . . Stephan Ramon Garcia
The opinions expressed here are not necessarily those of the Notices or the AMS.
Photo courtesy of Gizem Karaali.
For many years now, the Notices has celebrated National Hispanic Heritage Month (September 15– October 15) with a special issue dedicated to the contributions and achievements of Hispanic scholars of the mathematical sciences. I have been fortunate enough to pen the “A Word from. . . ” for these issues since 2019, despite having rotated off the Notices editorial board a few years ago. As I look back on the last several Hispanic Heritage issues, I am struck by the changing content of these short introductions. The 2019 “A Word from. . . ” was hopeful and informative, explaining the purpose of the celebratory month. In contrast, I wrote the 2020 note in April during the Covid-19 lockdowns, then later edited it to acknowledge the dramatic protests triggered by the murder of George Floyd. The year 2021 saw us deep into the pandemic, and I addressed the radical changes in our personal and professional lives that emerged. In my introduction to the 2022 issue, written in April of last year, I wrote: “[l]et us hope, perhaps unrealistically, that the virus will finally be tamed and that the wanton bloodshed in Ukraine will come to a peaceful end.” To a large extent, we have emerged from the Covid-19 pandemic, although the world will never be the same. Sadly, the war in Ukraine still rages on with no end in sight. Stephan Ramon Garcia is the W.M. Keck Distinguished Service Professor and Chair of the Department of Mathematics and Statistics at Pomona College. He served as an associate editor of the Notices from 2019–2021. His email address is [email protected]. For permission to reprint this article, please contact: [email protected].
DOI: https://doi.org/10.1090/noti2764
1176
As I write these words now, stunning new developments in artificial intelligence are changing the world around us. Demagoguery and grandstanding mark the politics of our age, and autocrats and fascists continue to make stunning gains. None of us can foresee the exact state of the world the next Notices Hispanic Heritage issues will inhabit. Here is one thing that is certain: this special issue has a host of contents that should excite, intrigue, and advise mathematicians from all walks of life. First of all, we have four feature articles which highlight the research of Hispanic authors: • Rodrigo Trevino ˜ introduces us to the wonderful world of aperiodic tilings. • Nicol´as Garc´ıa Trillos and Matt Jacobs discuss adversarial robustness in neural networks. • Ben Blum-Smith and Soledad Villar explain the role of invariant theory in machine learning. • Woojin Kim and Facundo M´emoli explain persistence over posets in topological data analysis. Besides the feature articles, this issue is packed with a variety of compelling pieces. Our Early Career section contains words of wisdom from several esteemed colleagues. First is Anthony Bonato’s wide-ranging interview with Federico Ardila. Next up, Jamie Haddock and Ruriko Yoshida share the advice they received from Jesus ´ de Loera, their advisor. We also get two excerpts (by Jessica M. Deshler and Angel Ramon ´ Pineda Fort´ın) from the wonderful edited volume Testimonios: Stories of Latinx and Hispanic Mathematicians. Finally, Mariana Smit Vega Garcia gives many useful tips for masters students in the mathematical sciences. But wait, there’s more! Marcelo Almora Rios and Pamela Burdman dive into college students’ paths to calculus. Our book review, written by Adolfo Arroyo-Rabasa and Jos´e Simental, takes a close look at the landmark volume Mexican Mathematicians in the World: Trends and recent contributions, edited by Fernando Galaz Garc´ıa, Cecilia Gonz´alez-Tokman, and Juan Carlos Pardo Mill´an.
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
VOLUME 70, NUMBER 8
The Communications section contains some remarkable gems that should not be overlooked. First off, Gerardo R. Chacon, ´ Christopher Hayes, and James Nickerson inform us about challenges for deaf students in mathematics graduate school. We also learn about Graduate Assistantships in Developing Countries (GRAID) in an informative piece written by Jose Maria Balmaceda, C. Herbert Clemens, Ingrid Daubechies, Angel R. Pineda, Galina Rusu, and Michel Waldschmidt. Finally, Robert Stephen Cantrell and Susan Friedlander describe the upcoming Mathematical Congress of the Americas, which is coming to the United States in 2025. As you can see, the 2023 Hispanic Heritage Month issue of the Notices is packed with important and exciting material. Read and enjoy!
It’s almost time
to celebrate! Monday, November 27, 2023
Join us on Monday, November 27, 2023 as we celebrate members via “AMS Day,” a day of special offers on AMS publications, membership, and much more! Stay tuned on social media and membership communications for details about this exciting day.
Spread the word about #AMSDay today!
SEPTEMBER 2023
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1177
Meet & recruit students at the 2024
Graduate School Fair!
Held on Friday, January 5, at the Moscone North/South Convention Center Connect with over 300 talented student attendees Research-oriented special undergraduate programs Graduate mathematical science programs from across the United States
Free for JMM registered students. Schools pay a table fee to showcase their programs. Learn more at:
www.ams.org/gradfair Not attending the JMM? Register for the 2023 Fall Graduate School Fair, an online event scheduled for October 18.
Please check the Joint Mathematics Meetings registration site for updated dates and times.
Aperiodic Tilings, Order, and Randomness
Rodrigo Trevino ˜ In 1619, Kepler published his Harmonices Mundi (Harmonies of the World), a volume of five books discussing mathematics, physics, astrology, and music. In Book II, Kepler sets out to understand the “Congruences of regular figures,” that is, how different types of regular polygons can be placed next to one another and tile parts of the plane. In Proposition XIX, he states:1 There are six ways in which the plane can be filled around a point by figures of two kinds: in two ways using five angles, in one way using four angles, and Rodrigo Trevino ˜ is an assistant professor at the University of Maryland. His email address is [email protected]. For permission to reprint this article, please contact: [email protected]. DOI: https://doi.org/10.1090/noti2759 I am using the translation by E. J. Aiton, A. M. Duncan, and J. V. Field published in 1997 by the American Philosophical Society.
1
SEPTEMBER 2023
in three ways using three angles. After discussing how the angles of equilateral triangles and squares can come together, he proceeds to consider angles which come up in regular pentagons (emphasis mine, see Figure 1): if we now come to the pentagon angles we may take two of them, because together they come to more than two right angles; and a decagon angle fits into the space they leave. The decagon is encircled by ten pentagons, but this pattern cannot be continued in its pure form. See the inner part of diagram Z. Here we must consider the star pentagon, since we can fit together three pentagon angles and one point of a star, because the re-entrant angle of the star takes one pentagon angle while, no less, the gap left by fitting together three pentagon angles
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1179
takes the point of the star. See the outer part of the same diagram Z. However, this pattern cannot be continued indefinitely, for the domain it builds up is unsociable and when it has added to its size a little it builds fortifications. You may see a different arrangement of these two forms marked with the letters Aa. If you really wish to continue the pattern, certain irregularities must be admitted, two decagons must be combined, two sides being removed from each of them. As the pattern is continued outwards five-cornered forms appear repeatedly . . . as it progresses this five-cornered pattern continually introduces something new. The structure is very elaborate and intricate. See the diagram marked Aa. And so Kepler recorded how it is difficult to tile the plane with shapes derived from pentagonal angles. The modern theory of aperiodic tilings in dimensions greater than one started in the second half of the twentieth century: in the early sixties, motivated by questions of decision in logic, Wang investigated whether one could tile the plane with a finite collection of squares with decorated sides, where two tiles could be placed in an edge-sharing way if and only if the decorations on the sides were the same. He conjectured that if, given a finite set with edges decorations, a tiling of the plane could be found, then a periodic tiling could also be found. That is, a tiling that is invariant under a nontrivial element of the group of Euclidean translations. Any such element is called a period. Wang’s student, Berger, soon disproved the conjecture, giving a set of 20,426 decorated squares—now known as Wang tiles—such that they tile the plane but only do so without any periods, that is, they tile the plane aperiodically. This was the first explicit finite set of tiles which gave an aperiodic tiling of the plane. Berger’s result marked the beginning of the modern study of aperiodic tilings of the plane, and his work was closely followed by R. Robinson, who soon constructed a set of 52 Wang tiles—obtained from rotations and reflections of a set of seven tiles— which tiled the plane aperiodically, in addition to other sets of tiles which tile the plane aperiodically.2 The 1970s was a critical period for the construction of various types of aperiodic tilings of the plane. In the early 1970s Penrose independently came up with the construction of an aperiodic tiling now known as the Penrose tiling, shown partly in Figure 2. His own account in 1974 in the 2 And so the quest began to tile the plane aperiodically with the smallest number of tile types. Depending on the operations allowed (e.g., translations, rotations, or reflections) there are different results. The recent papers of Greenfeld and Tao give good historical accounts to these types of questions, which will not be the focus of this article. See for example [GT22, §1] and references therein.
1180
Figure 1. The pentagons which stumped Kepler and their “unsociable” tendencies.
Bulletin of the Institute of Mathematics and its Applications of how he arrived at the pattern goes like this: I had often doodled by fitting together limited configurations of pentagons and similar shapes but I had never found a good rule for continuing such patterns indefinitely. However, recently I wanted to design something interesting for someone who was in hospital to look at and I realized that there was a certain definite rule whereby one could continue such a pattern to arbitrary size. The pattern
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
VOLUME 70, NUMBER 8
Figure 3. Two influential early texts on tilings flanked by 3D printed tiles.
Figure 2. A patch of the first aperiodic tiling discovered by Penrose. Compare with Figure 1.
never repeats itself–in the sense that there is no period parallelogram. Nevertheless, the rule that forces this pattern is a purely local one. And thus Penrose tamed the pentagons which tormented Kepler.3 In fact, comparing Figures 1 and 2 one sees how Penrose managed to coherently fill in the figures which had stumped Kepler in his drawing. Later, Penrose modified his construction so that there are only two types of tiles—“kites” and “darts”—which appear in the tiling in finitely many orientations. This construction graced the cover of Scientific American in 1977 and was described by Martin Gardner in that same issue (see Figure 3). Gardner mentioned in his article that he did not write about the construction sooner because Penrose was waiting to be granted a patent for his construction. The approval of the patent (US patent 4,133,152) led to one of the strangest cases of mathematical litigation: in 1997, Kimberly-Clark introduced a new toilet paper product with an interesting new pattern (see Figure 4). Penrose thought it infringed on his patent, and so he took the company to court, at which 3 There is another way to extend Kepler’s configuration Aa in Figure 1 into a tiling of the plane, but this tiling is a periodic tiling. See [GS16, §2.5] for details.
SEPTEMBER 2023
Figure 4. Roll of toilet paper allegedly violating the law.
point his lawyer made the following statement: “When it comes to the population of Great Britain being invited by a multinational to wipe their bottoms on what appears to be the work of a Knight of the Realm without his permission, then a last stand must be taken.”
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1181
Figure 6. Outcome of diffraction experiment from the Shechtman–Blech–Gratias–Cahn 1984 article with the unexpected 10-fold symmetry. Figure 5. In Penrose’s US patent, he discusses variations which can be achieved through matching rules. For example they can be done so that the tiles can be decorated as birds (these are sometimes called Penrose chickens).
The study of tilings received an unexpected boost in the 1980s from chemistry. In early 1982, D. Shechtman discovered a metallic solid with unbelievable properties: it diffracted electrons like a crystal, but it did so in a way that its diffraction spectrum had an icosahedral group of symmetries, which was incompatible with a lattice molecular structure. That is, in some sense, it behaved like a metal whose atomic structure should be periodic but due to the crystallographic restriction theorem this alloy could not have such molecular structure. Figure 6 shows some of the diffraction patterns from Shechtman’s discovery. What is remarkable is that— at once—it exhibits icosahedral symmetry and there are bright spots in the picture of the diffraction experiment. This means that the internal structure does not have a periodic structure but it experiences long range order (this will be revisited more precisely below). At the time, many scientists did not believe Shechtman’s discovery because the dogma at that point was that all well ordered metal alloys had to have a periodic structure, and this discovery challenged that dogma. This incredulity made it hard for him to publish his findings, eventually doing so in 1984. Slightly before this discovery, Mackay independently showed that the Penrose tiling diffracted, which led him to hypothesize that materials with nonperiodic but 1182
well-ordered molecular structures were theoretically possible. In 1984, Levine and Steinhardt made the explicit connection that Penrose-like structures were actually models for the material discovered by Shechtman, coining the term quasicrystal to describe such structures. Shechtman would go on to win the 2011 Nobel Prize in chemistry for his discovery. Thus quasicrystals are materials whose atomic structures are not periodic but they share many properties with metals whose atomic structures are periodic (crystals). This augmented the raison d’être of aperiodic tilings in higher dimensions as they were suddenly found to be physically relevant. This catalyzed the development of the modern theory of aperiodic tilings in any dimension, and my goal here is to give a brief introduction to it. I note that the theory of one-dimensional tilings is older than that of tilings of the plane. The reason is that there is only one type of connected tile in one-dimensional tilings—a line interval—and so some of the geometric complications which arise in higher dimension are not present for one-dimensional tilings (although there are geometric considerations when studying one-dimensional tilings). In its earliest days, the study of aperiodic tilings in one dimension manifested itself in the study of aperiodic sequences. The earliest such sequence is the so-called Prouhet–Thue–Morse sequence, which showed up implicitly in the work of Prouhet in 1851, was made explicit and studied by Thue in 1906, and made widely known by
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
VOLUME 70, NUMBER 8
Morse in 1921 when he (re)discovered the sequence and used in the study of geodesics in on surfaces of negative curvature. See [AS99] for a full history of this sequence. Morse would go on to cofound—along with Hedlund— the field of symbolic dynamics in 1938 in their seminal paper of the same name, where they introduce the study of symbolic sequences as dynamical objects and derive properties of the Prouhet–Thue–Morse sequence. There are many excellent references on different aspects of the theory of aperiodic tilings [GS16, Sad08, BG13, KLS15, AA20] and I will reference them when needed. The Tilings Encyclopedia (https:// tilings.math.uni-bielefeld.de) is also a great resource to see examples of many constructions.
1. Basic Definitions A tiling 𝒯 of ℝ𝑑 is a countable union of sets of the form (𝑡, 𝑗)—called the tiles—where 𝑡 ⊂ ℝ𝑑 is the support of the tile and 𝑗 belongs to a countable set 𝐶 of labels or colors, such that ℝ𝑑 = 𝑡, ⋃ 𝑖 𝑖
where 𝑡 𝑖 is the support of a tile in the tiling and the union is over the set of tiles in the tiling and where, if for some 𝑚 ≠ 𝑛, 𝑡𝑚 ∩ 𝑡𝑛 ≠ ∅ then 𝑡𝑚 ∩ 𝑡𝑛 ⊂ 𝜕𝑡𝑚 ∪ 𝜕𝑡𝑛 . There are all sorts of assumptions that can be made about the tiles, but a standard one is the following: the support 𝑡 of the tiles are connected compact sets with nonempty interior, and the set of labels 𝐶 is finite. Throughout this article this will be the standing assumption unless otherwise noted, and I will often refer to the supports of the tiles as tiles when it is not ambiguous. The reasons labels/colors are needed is that we get extra structure from considering them: for example, all Wang tiles have supports which are squares, but the decorations on their edges make them different as tiles, and the interesting properties about these tilings come from the labels. I will mostly focus on tilings of Euclidean spaces, but note that it is easy to define tilings of other spaces and/or groups in analogous ways. There is a somewhat dual point of view to tilings. A Delone set Λ ⊂ ℝ𝑑 is a countable subset such that it is uniformly discrete: there exists an 𝑟 > 0 such that |𝑥 − 𝑦| > 𝑟 for all 𝑥, 𝑦 ∈ Λ, and relatively dense: there exists an 𝑅 > 0 such that 𝐵𝑅 (𝑥) ∩ Λ ≠ ∅ for any 𝑥 ∈ ℝ𝑑 , where 𝐵𝑅 (𝑥) is the open ball of radius 𝑅 around 𝑥. The rough duality between tilings and Delone sets comes from the fact that if Λ is a Delone set then both the Delaunay triangulation of ℝ𝑑 given by this point set as well as the Voronoi tessellation defined by this point set give a tiling of ℝ𝑑 . Likewise, if 𝒯 is a tiling then the set consisting of the center of mass of each tile gives a Delone set, assuming the tiles of 𝒯 are nice enough. In what follows, properties of SEPTEMBER 2023
tilings will be discussed, and the corresponding properties for Delone sets are analogously defined. A patch 𝑃 of a tiling 𝒯 is a compact subset of 𝒯 consisting of a finite union of tiles, usually taken to be connected. The 𝑅-patch of 𝒯 around 𝑥 is the collection of all tiles in 𝒯 whose support intersects 𝐵𝑅 (𝑥). Two patches 𝑃1 and 𝑃2 are translation-equivalent if there is a 𝑣 ∈ ℝ𝑑 such that 𝑃1 = 𝑃2 − 𝑣, where in slight abuse of notation the translation is meant in the support of a patch, but the equality is meant both in the support and the label. A tiling has finite local complexity (FLC) if for every 𝑅 > 0 there is a finite number of translation-equivalent classes of 𝑅-patches, otherwise it has infinite local complexity (ILC). A tiling 𝒯 is repetitive if for every 𝑅 > 0 there is a 𝑇 > 0 such that every 𝑅-patch is found as a subpatch in the 𝑇-patch around every point in ℝ𝑑 . Repetitivity necessitates finite local complexity but the converse is not true.
2. Constructions There are three classical ways to construct aperiodic tilings: by matching rules, through substitution and expansion rules, and through the cut-and-project method. The Penrose tiling has the remarkable property that it can be obtained by all three of these methods. I will briefly describe each of these classical methods, but do note that recently there has been significant work on variations of these methods as means of constructing aperiodic tilings which I will not discuss here. The method of matching rules is what was used to construct aperiodic tilings with Wang tiles: a finite number of types of tiles are given with rules on how two of them may share an edge. Defining matching rules does not immediately guarantee that a tiling of ℝ𝑑 can be achieved through these matching rules, nor that the tiling obtained will be aperiodic or not. The method of substitution and inflation/expansion is defined through a substitution rule: by starting with a finite collection of tiles 𝑡1 , … , 𝑡𝑚 in ℝ𝑑 —the prototiles—and an expanding linear map 𝐴 ∈ 𝐺𝐿(𝑑, ℝ), these tiles admit a substitution rule with expansion 𝐴 if there exist vectors 𝜏𝑖𝑗𝑘 such that for any 𝑖, 𝐴𝑡 𝑖 is a patch defined as 𝑚 𝑀𝑖𝑗
𝐴𝑡 𝑖 =
⋃⋃
𝑡𝑗 − 𝜏𝑖𝑗𝑘 ,
(1)
𝑗=1 𝑘=1
that is, there is a way to partition 𝐴𝑡 𝑖 into the union of copies of different types of tiles. The collection of numbers 𝑀𝑖𝑗 is called the substitution matrix,4 and the number 𝑀𝑖𝑗 describes exactly how many copies of tile 𝑡𝑗 are found in an expanded tile 𝐴𝑡 𝑖 . We can then apply the substitution rule 4Note that this matrix depends on how we index the prototiles, and so different
indexing leads to different matrices. Some authors use the convention that 𝑀𝑖𝑗 here is the transpose of the substitution matrix.
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1183
Figure 7. Two iterations of a substitution rule found in [Lan40] with expansion 3. By carefully iterating substitution rules indefinitely, a tiling of ℝ2 can be obtained. The reader should verify that 𝑀 is a substitution matrix for this substitution rule.
to each tile in the resulting patch which will give an even larger patch (see Figures 7 and 9 for some examples). This procedure can be done indefinitely and by carefully taking a limit we can cover all of ℝ𝑑 with a tiling obtained from a substitution rule. The resulting tiling will be repetitive if the rule is primitive, which means there is a 𝑁 ∈ ℕ such that the matrix 𝑀 𝑁 has strictly positive entries. Whether or not this tiling is aperiodic can be determined by looking at whether the substitution rule is recognizable; see [BG13, §5.6]. Finite local complexity of the resulting tiling can be guaranteed in some cases, e.g., if the prototiles are CW complexes and the union in (1) respects the CW structure to a certain extent. I do not know when the first substitution rule was discovered, but the substitution rule of well-known aperiodic tilings, such as the chair tiling and the half-hex tiling already appear in print in 1940 in a note of Langford [Lan40], although it was not noticed that indefinite applications of these rules would lead to aperiodic tilings. If the rescaling is conformal, then the tiling is selfsimilar, otherwise it is self-affine. Before proceeding to other methods of construction, it is worth noting how substitution works in one dimension. Let 𝒜 be a finite set—the alphabet—and 𝒜∗ the set of all finite words with symbols in 𝒜. A symbolic substitution rule is a function 𝜌 ∶ 𝒜 → 𝒜∗ . This rule can be iterated: if 𝜌𝑛 (𝑎) ∈ 𝒜∗ , then 𝜌𝑛+1 (𝑎) is obtained by applying 𝜌 to every symbol in 𝜌𝑛 (𝑎). Assuming this rule is primitive, there is a symbol 𝑎 ∈ 𝒜 and sequence 𝑛𝑖 → ∞ such that 𝜌𝑛𝑖 (𝑎) converges to an infinite aperiodic word which is invariant under (a power of) the substitution rule.5 This symbolic tiling can be made geometric and realized as a rule in ℝ of the type (1) by using the Perron–Frobenius eigenvalues and eigenvectors of the substitution matrix 𝑀. The cut-and-project method creates an aperiodic point set as follows. Let 𝐸 ⊂ ℝ𝑑+𝑛 be a 𝑑-dimensional irrational subspace (the physical space), where by irrational I mean 𝐸 ∩ ℤ𝑑+𝑛 = 0, 𝐹 ⊂ ℝ𝑑+𝑛 an 𝑛-dimensional subspace (the internal space) transverse to 𝐸 (which we can assume is 5For example, the Prouhet–Thue–Morse sequence is constructed using 𝒜 = {0, 1} and rule 𝜌(0) = 01 and 𝜌(1) = 10.
1184
𝐸 ⟂ ) and 𝑊 ⊂ 𝐹 a compact set called the window, which usually is taken to have nonempty interior, but does not have to. Denoting by 𝜋𝐸 ∶ ℝ𝑑+𝑛 → 𝐸 and 𝜋𝐹 ∶ ℝ𝑑+𝑛 → 𝐹 the projections, let 𝒮𝑊 = {𝑧 ∈ ℤ𝑑+𝑛 ∶ 𝜋𝐹 (𝑧) ∈ 𝑊}. The cut and project set associated to this data, is the set Λ𝐸,𝐹,𝑊 ≔ 𝜋𝐸 (𝒮𝑊 ) ⊂ 𝐸, which is a Delone set. It is aperiodic, repetitive and of finite local complexity since 𝐸 was chosen to be irrational. The Penrose tiling has the satisfying property that it can be recovered by any of the three constructions above: the first cut-and-project set was described by de Bruijn [dB81], who sought to recover the Penrose tiling through a projection method and managed to recover the vertex set of the Penrose tiling. In his first description of his construction and in his patent application, Penrose himself managed also to give matching rules for his original aperiodic tiling (the version in Figure 5 is possible by the matching rules). Thus the Penrose tiling can be constructed by the three methods introduced above. In 1998, Goodman-Strauss proved that every aperiodic tiling constructed from a substitution rule in dimension greater than one can in fact be obtained through “decorated” matching rules. Not every substitution tiling can be recovered through the cut-andproject method, as there are obstructions given by a unitary representation of ℝ𝑑 defined by a tiling which can get in the way (this will be discussed in §3 below). 2.1. Organizing the tilings. What do we do with tilings defined like this? We want good ways in which they can be distinguished. Here are a couple of ways of doing that. First, it is helpful to throw all sorts of related tilings into one space, and call that space a space of tilings. There are a few ways in which tilings can be related. For 𝑣 ∈ ℝ𝑑 and a tiling 𝒯, let 𝜑𝑣 (𝒯) ≔ 𝒯 − 𝑣 be the translation of 𝒯 by −𝑣. That is, it is a new tiling where the tiles have supports 𝑡 − 𝑣, where 𝑡 is the support of a tile in 𝒯. If 𝜑𝑣 (𝒯) = 𝒯 and 𝑣 ≠ 0, then 𝑣 is a period of 𝒯. A tiling is aperiodic whenever 𝜑𝑣 (𝒯) = 𝒯 implies 𝑣 = 0. If 𝒯 is aperiodic then {𝜑𝑣 (𝒯)}𝑣∈ℝ𝑑 is a 𝑑-parameter family of tilings which are all distinct but obviously related as they all are translates of 𝒯. Without further structure, this
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
VOLUME 70, NUMBER 8
set is essentially ℝ𝑑 , which says nothing about 𝒯. So we want to be able to compare 𝒯 and 𝜑𝑣 (𝒯) in a more meaningful way. One way of doing this to define a new notion of distance between two different translates of 𝒯. A common one is roughly described like this: two translates 𝜑𝑣1 (𝒯) and 𝜑𝑣2 (𝒯) are no more than 𝜀 apart if, when looking at their 𝜀−1 -patches around the origin they are the same up to an error of size 𝜀. What “an error of size 𝜀” means can be subtle, but if 𝒯 has finite local complexity then it means a translation of size at most 𝜀. More generally, the spirit of this type of metric is that the 𝜀−1 -patch around the origin of both tilings should be 𝜀-close in a Hausdorff-ish metric, called the Chabauty metric. With a choice of metric on the set of all translates of a given tiling, we can consider the metric completion of the set of all translates of 𝒯 Ω𝒯 ≔ {𝜑𝑣 (𝒯)}𝑣∈ℝ𝑑 and call this the tiling space of 𝒯. With the metric thus defined, it is a compact metric space, which is nice, since tilings themselves are not compact and it is often easier to deal with compact spaces than with noncompact spaces. Note that if a tiling is repetitive, then we cannot distinguish two tilings in its tiling space by purely local considerations, since repetitivity forces the two tilings to look the same on arbitrarily large patches. Thus, tiling spaces for repetitive tilings are sometimes called local isomorphism classes. To get a good idea of what these spaces are like, let us assume for the moment that 𝒯 is aperiodic, repetitive, and of finite local complexity. Then its tiling space Ω𝒯 has the local product structure of the form 𝐵𝜀 × 𝒞, where 𝐵𝜀 is an Euclidean ball and 𝒞 is a Cantor set. The 𝐵𝜀 factor comes from the fact that by translating a tiling 𝒯 by a very small vector 𝜏, 𝜑𝜏 (𝒯) is 𝜀-close to 𝒯 by the definition of the metric above. Now for simplicity, suppose that the diameter of any tile in any tiling of 𝒯 is less than 1. Take a tiling from Ω𝒯 and assume that the origin is contained in the interior of (the support of) a tile 𝑡. Consider 𝑡 as a patch and consider all tilings in Ω𝒯 which have the same patch, both in terms of the same support and same label. In how many ways does this patch appear as a subpatch of an 𝑅patch around the origin? By the assumptions of aperiodicity, repetitivity, and FLC, for some 𝑅0 > 0, this patch appears as a subpatch of an 𝑅0 -patch in at least two ways. Applying the same reasoning, there is a sequence of numbers 𝑅𝑘 → ∞ such that any 𝑅𝑘 -patch around the origin appears in at least two different ways as a subpatch of a 𝑅𝑘+1 patch. This sequence of choices between finitely many but at least two choices should remind you of a Cantor set, and it shows that the local structure has a component which is a Cantor set. SEPTEMBER 2023
This local structure extends to give Ω𝒯 the structure of a foliated object (think of puff pastry or phyllo dough for tiling spaces of two-dimensional tilings). Indeed, note that 𝜑𝜏 (𝒯) ∈ Ω𝒯 for all 𝜏 ∈ ℝ𝑑 , and so tiling spaces come with an action by ℝ𝑑 by translation. If 𝒯 is repetitive, then the orbit of every tiling is dense in the tiling space, and so Ω𝒯 can be seen as a compact metric space foliated by copies of ℝ𝑑 which are the orbits of tilings in the space. Tiling spaces of this type can also be seen to be inverse limits, which is especially convenient in the case of tilings constructed through a substitution rule. If 𝒯 is built from a substitution rule on CW complexes, then, as proved in [AP98], we have the following homeomorphism Ω𝒯 ≅ lim (Γ, 𝛾) ←
= {(𝑧0 , 𝑧1 , … ) ∈ Γ∞ ∶ 𝑧𝑖 = 𝛾(𝑧𝑖+1 ) for all 𝑖}
(2)
where Γ is a CW complex and 𝛾 is a cellular map. This description of a tiling space will be useful below. As noted by Penrose when he first wrote about his construction, the substitution and inflation/expansion method introduces a hierarchy of scales of the tiling. That is, once a tiling is constructed through a substitution rule, by the nature of the substitution and inflation/expansion rule, there are patches of tiles whose support is a scaled copy of one of the tiles. More specifically, consider the expanding linear map 𝐴 ∈ 𝐺𝐿(𝑑, ℝ) required in the definition of a substitution rule. If 𝑡 ∈ 𝒯 is a tile, then we call it a level-0 supertile, and, for every 𝑘 ∈ ℕ, 𝐴𝑘 𝑡 is a level-𝑘 supertile. The self-similarity of 𝒯 with expansion 𝐴 forces the structure of level-0 supertiles to be determined by the structure of level-1 supertiles, which are in turn determined by the structure of level-2 supertiles, etc. Thus there is a hierarchical structure of supertiles, each one determining the one preceding it. This hierarchical structure is baked into the AndersonPutnam [AP98] point of view of a tiling space of a selfsimilar tiling being an inverse limit. However, a substitution rule is not necessary to have a hierarchical structure. Frank and Sadun have developed a theory of fusion, which aims to systematize tilings with hierarchical structure (see N. P. Frank’s notes in [AA20] and references within). 2.2. Distinguishing different tilings. In what ways can tilings be different? Certainly if Ω𝒯 is not homeomorphic to Ω𝒯 ′ then we would not expect 𝒯 and 𝒯 ′ to be the same, perhaps even locally. Moreover, two tilings in the same tiling space can be compared by considering how close they are with respect to the distance on the tiling space. So we can talk about both local and global comparisons. The tiling 𝒯1 is locally derivable from 𝒯2 (denoted 𝒯2 ⇝ 𝒯1 ) if there is a 𝑅 > 0 and a rule from the set of all 𝑅patches of 𝒯2 to the set of tile types of 𝒯1 such that the 𝑅-patch around 𝑥 in 𝒯2 determines the tile(s) containing 𝑥 in 𝒯1 . If 𝒯2 ⇝ 𝒯1 and 𝒯1 ⇝ 𝒯2 , then they are mutually
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1185
locally derivable (MLD). This means that a finite amount of information around 𝑥 in 𝒯𝑖 determines the local information around 𝑥 in 𝒯3−𝑖 . This implies that the tiling spaces of the two tilings are homeomorphic, but the converse is not true. In addition, if two tilings 𝒯1 and 𝒯2 are MLD equivalent, then not only do we get a homeomorphism Ψ ∶ Ω𝒯1 → Ω𝒯2 of tiling spaces but in fact a topological conjugacy of their respective ℝ𝑑 actions: 𝜑𝑡 ∘Ψ = Ψ∘𝜑𝑡 for all 𝑡 ∈ ℝ𝑑 , where 𝜑𝑡 is the translation action on the appropriate tiling space. The bird tiling from Penrose’s patent (Figure 5) is not the same as the original Penrose tiling on the cover of Scientific American (Figure 3), but there are tilings in the respective tiling spaces which are MLD equivalent because the local information from one determines the local information from the other (see [BG13, Theorem 6.1] for a fuller description of the Penrose tiling’s MLD class). The concept of local derivability can capture the essence of a self-similar tiling, or one constructed from a substitution rule. If 𝜙 ∶ ℝ𝑑 → ℝ𝑑 is an expanding linear map, then a repetitive FLC tiling 𝒯 is pseudo self-affine with expansion 𝜙 if 𝜙𝒯 ⇝ 𝒯. This captures the essence of a substitution rule: if the tiles of 𝒯 are expanded by 𝜙, then a substitution rule determines how those expanded tiles can be subdivided to recover patches of the tiling 𝒯. Let me turn now to global comparisons. The first thing one may think about when trying to show that two tiling spaces are not homeomorphic is to compute some sort of topological invariant for both spaces and compare them. This is where the point of view of tiling spaces as inverse limits, especially in the substitution rule case, is very conˇ venient. The Cech cohomology of the substitution tiling seen as the inverse limit (2) is the direct/inductive limit 𝐻̌ ∗ (Ω𝒯 ; ℤ) = lim (𝐻̌ ∗ (Γ; ℤ), 𝛾∗ ) . →
(3)
In particular, it is computable—sometimes even by a human. There are also versions of de Rham cohomology for tiling spaces and under mild assumptions they are isomorˇ phic to the Cech cohomology. The last 20 years have seen the rise of different approaches to compute topological invariants of tiling spaces. Another topological invariant which has been studied is the 𝐾-theory of tiling spaces which, at a somewhat superficial level, is related to the (rational) cohomology through the Chern character. However, crucial information can be lost in passing from integer cohomology to rational one. In any case, interest in 𝐾-theories emanating from tiling spaces was first motivated by Bellissard’s program to label gaps in the spectrum of certain self-adjoint operators—so-called random Schrödinger operators—defined from aperiodic tilings. The labeling is understood for aperiodic tilings in dimensions at most 3, but not clear for higher-dimensional tilings. 1186
ˇ You’re probably curious what the Cech cohomology of the tiling space Ω for the Penrose tilings is. This has been computed by different methods and different people, leading to: 𝐻̌ 0 (Ω; ℤ) = ℤ, 𝐻̌ 1 (Ω; ℤ) = ℤ5 , 𝐻̌ 2 (Ω; ℤ) = ℤ8 . The calculation can be found in Sadun’s book [Sad08, §4.7] on the cohomology of tiling spaces. Thus it is possible to distinguish two tilings by distinguishing their tiling spaces using topological invariants. It is worth mentioning that there are examples of aperiodic tilings whose cohomology groups contain torsion. This was noted later than it should have been, as substitution rules which lead to aperiodic tilings with torsion in their cohomology have appeared in the literature since 1940.6 Tiling cohomology is useful in distinguishing two tilings through their tiling spaces, but it can also help you create a different tiling from an initial tiling 𝒯. Clark and Sadun [CS06] started a theory of deformations of tiling spaces and it roughly starts like this: suppose we want to deform a polygonal, repetitive, FLC tiling 𝒯 of ℝ2 by deforming the edge of all tiles in 𝒯 which are translationequivalent in the same way and at once. This is the same as deforming the shape of the prototile by perturbing its boundary. Thus the shape of its boundary—as a collection of vectors defined by the edges—can be seen as the image under some map of the cycle represented by the boundary of the tile, i.e., a cocycle. That these vectors/edges have to close up when added together is equivalent to the cocycle being closed, i.e., the shape is defined by a representative of a class in 𝐻̌ 1 (Ω𝒯 ; ℝ𝑑 ). Thus the shape of tilings in Ω𝒯 is defined by a class 𝐋 ∈ 𝐻̌ 1 (Ω𝒯 ; ℝ𝑑 ) and any class 𝐋′ close enough to 𝐋 defines slight deformation of 𝒯 resulting in a different tiling 𝒯𝐋′ . These classes are called shape vectors. Building on these initial results, Julien and Sadun [JS18] proved there is an open set ℳ ⊂ 𝐻̌ 1 (Ω; ℝ𝑑 ) which parametrizes nondegenerate deformations; do consult there for details on deformations. The tiling obtained by the substitution rule in Figure 7 has 𝐻̌ 1 (Ω; ℤ) = ℤ2 , and thus the space of deformation parameters is four-dimensional (𝐻̌ 1 (Ω; ℝ2 ) = ℝ4 ). Note that any tiling can be deformed using linear transformations defined by elements of 𝐺𝐿(𝑑, ℝ). Since 𝐺𝐿(2, ℝ) is four dimensional, the only deformations admitted by this tiling are the “trivial” deformations coming from 𝐺𝐿(2, ℝ). In contrast, since the first cohomology space of the Penrose tiling is five-dimensional (see above), there exists a sixdimensional space of nontrivial deformations. See [CS06] for examples of these nontrivial deformations. 6The cohomology of the tiling space obtained from the substitution rule in Fig-
ure 7 has torsion. This was verified using code written by Jianlong Liu.
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
VOLUME 70, NUMBER 8
3. Dynamics and Order Besides topological invariants, another powerful tool in the study of tiling spaces comes from ergodic theory. If 𝒯 is an aperiodic repetitive FLC tiling, then every orbit of the ℝ𝑑 action on Ω𝒯 is dense in Ω𝒯 . It is typical to assume that this action is also uniquely ergodic, that is that there exists a unique ℝ𝑑 -invariant probability measure 𝜇 on Ω𝒯 such that for any continuous function 𝑓 on Ω𝒯 and 𝒯 ′ ∈ Ω𝒯 : 1 ∫ Vol(𝐵𝑅 (0)) 𝐵
𝑓 ∘ 𝜑𝑡 (𝒯 ′ ) 𝑑𝑡 → ∫ 𝑓 𝑑𝜇
𝑅 (0)
(4)
Ω𝒯
as 𝑅 → ∞. This is Birkhoff’s ergodic theorem in action. This type of convergence is a feature of all self-similar aperiodic tilings, and even of a larger class of tilings called linearly repetitive (see the survey on linearly repetitive sets in [KLS15]). In fact, it takes quite some effort to construct repetitive aperiodic FLC tilings which are not uniquely ergodic (see e.g. [CN16]). It turns out that we can squeeze out a lot from the convergence of orbit averages above. For example, the asymptotic frequency of any patch 𝑃 ⊂ 𝒯 is given by the convergence above by choosing the appropriate continuous function 𝑓𝑃 on Ω𝒯 . In addition, since the ℝ𝑑 action on the tiling spaces defines a unitary representation of ℝ𝑑 on 𝐿2𝜇 (Ω𝒯 ) through the family of Koopman operators 𝑈𝑡 ∶ 𝑓 ↦ 𝑓 ∘ 𝜑𝑡 for 𝑡 ∈ ℝ𝑑 , it was noticed by Dworkin and Hof in the early 1990s that diffraction, which led Shechtman to his discovery of quasicrystals, can be recovered through an application of Birkhoff’s theorem by averaging the right “correlation functions” on the tiling space. That is, the bright spot in Shechtman’s diffraction diagram (Figure 6) correspond to the pure point part of the spectral measure 𝜈𝜌 of a correlation function 𝜌 ∈ 𝐿2𝜇 (Ω𝒯 ). A function 𝑓 ∈ 𝐿2𝜇 (Ω𝒯 ) is an eigenfunction of 𝑈𝑡 with eigenvalue 𝜆 ∈ ℝ𝑑 if 𝑈𝑡 𝑓 = 𝑒2𝜋𝑖⟨𝑡,𝜆⟩ 𝑓 for all 𝑡 ∈ ℝ𝑑 (constant functions are always eigenfunctions with eigenvalue 0 and so they are considered trivial eigenfunctions). By Bochner’s theorem, it is straightforward to verify that the spectral measure 𝜈𝑓 for a eigenfunction 𝑓 with eigenvalue 𝜆 is a Dirac mass at 𝜆 of magnitude ‖𝑓‖2 , which is why the bright spots on a diffraction picture—called the Bragg peaks—are located at eigenvalues. The term “long range order” was initially used to describe systems with spectral measures which have nontrivial pure point components, meaning that they exhibit strong correlations at arbitrarily large scales.7 These types of tilings are called mathematical quasicrystals. The long range order coming from eigenfunctions can be seen as follows. Suppose 𝑈𝑡 𝑓 = 𝑒2𝜋𝑖⟨𝑡,𝜆⟩ 𝑓 for some 𝜆 ≠ 0 7This term is used more broadly now to describe different features of a tiling, but
here I will exclusively use it in its original form.
SEPTEMBER 2023
Figure 8. A patch from the Godrèche–Lancon–Billard ¸ tiling from the Tilings Encyclopedia.
and all 𝑡 ∈ ℝ𝑑 . Then 𝑓 defines a countable collection of codimension-1 affine subspaces 𝒟𝜆 ⊂ ℝ𝑑 such that 𝑈𝑡 𝑓 = 𝑓 for all 𝑡 ∈ 𝒟𝜆 by the condition ⟨𝑡, 𝜆⟩ ∈ ℤ for 𝑡 ∈ 𝒟𝜆 . Thus 𝑓 is strongly-correlated with itself along 𝒟𝜆 : ⟨𝑈𝑡 𝑓, 𝑓⟩ = ‖𝑓‖2 for all 𝑡 ∈ 𝒟𝜆 . Since 𝒯 is aperiodic, 𝑈𝑡 ≠ Id for all 𝑡 ≠ 0. However 𝑈𝑡 𝑓 = 𝑓 for all 𝑡 ∈ 𝒟𝜆 suggest a strong form of almost periodicity as measured by 𝑓. The strongest form of long range order occurs when every spectral measure in this system is purely discrete, which means 𝐿2𝜇 (Ω𝒯 ) is spanned by eigenfunctions of 𝑈𝑡 . These systems are said to have pure point or pure discrete spectrum and have a strong algebraic flavor: by the Halmosvon Neumann theorem, systems whose spectral measures are all pure point measures are—from a measurable point of view—rotations on locally compact abelian groups. If a tiling does not define a system of pure point type, then it can have spectral measures of mixed types, or it can have the property that all spectral measures have trivial pure point component. Tilings of the last type are called weak mixing. Although long range order is missing in weak mixing tilings, they can still be very well structured. In particular, they can be repetitive and of finite local complexity constructed from substitution rules, thus giving them a nice hierarchical structure. The Godrèche– Lan¸con–Billard tiling in Figure 8 is one such example. Due to [Sol97], it is easy to determine when a substitution tiling is weak mixing, but determining when a substitution system has pure discrete spectrum is not so straight forward. Recall that a Pisot–Vijayaraghavan (PV) number is a real algebraic integer greater than 1 whose Galois conjugates are all less than 1 in absolute value. For a selfsimilar substitution system to have some long range order, it is necessary for the expansion factor to be a PV number. For example, the system defined by the Penrose tiling has pure discrete spectrum (though this can be deduced by other means) and its expansion factor is the PV number associated to the polynomial 𝑥2 − 𝑥 − 1. On the other hand, the Godrèche–Lan¸con–Billard tiling (Figure 8), which is constructed using a substitution rule on Penrose rhombs has a non-PV inflation number (√𝜏 + 2, where 𝜏 is the golden mean), making the tiling weak mixing.
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1187
How sufficient the PV condition is for pure point spectrum, however, is still an open question. The set of conjectures collectively known as the Pisot substitution conjectures assert that other forms of algebraic conditions (in addition to the PV condition) imposed on a substitution rule imply that the system is fundamentally algebraic, i.e., that all the spectral measures are pure point measures. There are several partial results towards proving this conjecture; see the survey on these conjectures in [KLS15]. 3.1. Renormalization. There is a rich interplay between ergodic theory and cohomology given through renormalization. Renormalization is the process of changing coordinates to view a familiar object from a different point of view and gaining information in the process. This is a powerful tool to upgrade qualitative statements such as the convergence in (4) to quantitative statements, such as the speed of convergence in (4). Suppose 𝒯 is a FLC aperiodic tiling constructed using a self-similar substitution rule. The substitution rule induces a self-homeomorphism Φ ∶ Ω𝒯 → Ω𝒯 with a nontrivial action on cohomology Φ∗ ∶ 𝐻̌ ∗ (Ω𝒯 ; ℤ) → 𝐻̌ ∗ (Ω𝒯 ; ℤ). It is Φ that affords a change of variables which allows us to use renormalization techniques. To be more precise, let 𝜆1 > |𝜆2 | > ⋯ > |𝜆𝑞 | be the ordered spectrum of the induced action of Φ∗ on 𝐻̌ 𝑑 (Ω𝒯 ; ℝ). Then it can be shown that for 𝑓 ∶ Ω𝒯 → ℝ regular enough,8 𝑞
∫ 𝑓 ∘ 𝜙𝑡 𝑑𝑡 ≈ ∑ 𝑅 𝐵𝑅
𝑑
log |𝜆𝑖 | log 𝜆1
𝐷𝑖 (𝑓)𝑔𝑖 (𝑅) + 𝑂(𝑅𝑑−1 ),
𝑖=1
where 𝐷𝑖 are distributions and the 𝑔𝑖 are bounded. Thus the spectrum of the action of Φ∗ on 𝐻̌ 𝑑 (Ω𝒯 ; ℝ) determines the speed of convergence of (4) up to an error term given by the boundary of the averaging sets 𝐵𝑅 . In particular, the eigenvalues which are relevant in the estimate above are the ones which are strongly expanding. There are sevˇ eral approaches to prove this: first by Sadun using Cech cohomology, then by Bufetov and Solomyak using different tools, and then by myself and Schmieding using the version of de Rham cohomology for tiling spaces. The action of Φ∗ on 𝐻̌ 1 (Ω𝒯 ; ℝ) is also important in the discussion of order, for we may ask which tiling deformations, if any, preserve long range order. First note that if 𝑓 ∈ 𝐿2𝜇 (Ω𝒯 ) is an eigenfunction with eigenvalue 𝜆 and 𝒯 ′ is the tiling obtained by deforming 𝒯 using some 𝐵 ∈ 𝐺𝐿(𝑑, ℝ), then 𝐵 𝑇 𝜆 is an eigenvalue for the Koopman operator for the deformed tiling space. Thus the trivial deformations preserve the spectral types of the tiling, so one now wonders what happens with nontrivial deformations. Clark and Sadun have a nice answer to this question [CS06]. Let 𝐸 ⊂ 𝐻̌ 1 (Ω𝒯 ; ℝ) be span of the generalized 8This can be made precise, but is not very important here.
1188
eigenvectors corresponding to expanding eigenvalues with respect to Φ∗ and 𝑆 be the span of those with eigenvalue strictly less than 1 in magnitude. Clark and Sadun proved that if 𝐻̌ 1 (Ω𝒯 ; ℝ) ≠ 𝐸 ⊕ 𝑆 with 𝐸 𝑑-dimensional, then for a typical choice of parameter 𝐋 the deformed tiling 𝒯𝐋 is weak mixing and thus lacks long range order. If an aperiodic tiling is weak mixing, we may want to quantify how far it is from having long range order. Recall that the lower local dimension at 𝑥 ∈ ℝ𝑑 of a measure 𝜇 is defined by 𝑑𝜇− (𝑥) = lim inf 𝑟→0
log(𝜇(𝐵𝑟 (𝑥))) . log(𝑟)
Since any pure point measure 𝜇 has 𝑑𝜇− (𝑥) = 0 for all 𝑥, weak mixing can be made quantitative by proving lower bounds for the lower local dimension of spectral measures, that is, quantifying how far a measure is from have a pure point component. This was the idea used by Bufetov and Solomyak when they started off the study of quantitative weak mixing. The general approach to study quantitative properties of spectral measures for self-similar tilings relies on the following spectral refinement to a substitution matrix. Given a tiling, let Λ𝑖𝑗 ⊂ ℝ𝑑 be the finite subset of vectors which describes the relative positions of tiles of type 𝑗 inside a level-1 supertile of type 𝑖 with respect to a “base” tile of type 𝑗 (this relies on a choice of base tile, but it turns out to be irrelevant). Note that a set of relative vectors Λ𝐋𝑖𝑗 is analogously for a deformed tiling 𝒯𝐋 . For a spectral parameter 𝜆 ∈ ℝ𝑑 , define the matrix 𝑆 𝐋 (𝜆)𝑖𝑗 = ∑ 𝑒2𝜋𝑖⟨𝜏,𝜆⟩
(5)
𝜏∈Λ𝐋 𝑖𝑗
which Bufetov and Solomyak call the spectral cocycle and [BGM19] calls the Fourier matrix. Note that 𝑆 𝐋 (0) = 𝑀𝑖𝑗 . (𝑛)
For 𝑛 ∈ ℕ, define the family of matrix products 𝑆 𝐋 (𝜆) ≔ 𝑆 𝐋 (𝜃𝑛−1 𝜆) ⋯ 𝑆 𝐋 (𝜆), where 𝜃 is the expansion factor of the substitution rule. Quantitative and qualitative properties of spectral measures at 𝜆 of the deformed tiling 𝒯𝐋 can be de(𝑛) termined by the growth properties of 𝑆 𝐋 (𝜆) as measured through its Lyapunov exponent. More precisely, the slower (𝑛) (𝑛) 𝑆 𝐋 (𝜆) grows (which is never faster than 𝑆 𝐋 (0) = 𝑀 𝑛 ), then the farther the system is from having a spectral measure with pure point mass at 𝜆. See [BGM19] and the many papers of Bufetov and Solomyak for details.
4. Adding Randomness In [GL89], Godrèche and Luck considered ways in which one could disorganize the perfect Penrose tiling through randomization. A distinction was made of doing so through “global randomness” or “local randomness.” To define these terms, suppose that for a finite set of tiles 𝑡1 , … , 𝑡𝑚 , you have two different substitution rules
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
VOLUME 70, NUMBER 8
(Figure 9 provides an example). Recall that self-similar tiling is constructed through the sequential application of the same substitution rule. What if instead of applying the same rule over and over, we flipped a coin to decide which rule to apply each time? Globally random substitution tilings (sometimes also called mixed substitutions or 𝑆-adic systems) impose the condition that once the coin is flipped, then the same substitution rule is applied to all of the tiles in a patch, whereas locally random substitutions have you flip a coin for each tile and apply a substitution rule accordingly. Both types of random tilings have seen a surge of activity in the past decade. Let me mention what are some typical features of these constructions. First, in dimension greater than one, one should have little hope that one gets tilings with finite local complexity, as the different substitution rules need not fit together in a nice enough way to ensure this.9 Secondly, although it may differ at every scale, these types of tilings still have a hierarchical structure for the same reason that substitution tilings have hierarchical structure. The most significant qualitative difference between the globally random and the locally random is their configurational complexities: globally random tilings have a polynomial growth (in 𝑅) of the number of distinct classes of 𝑅-patches, whereas locally random tilings typically have exponential growth. Globally random constructions have a randomly chosen hierarchical structure. This can be combined with another type of “geometric randomness” as follows. If the substitutions are compatible so that the tiling has finite local complexity, the cohomology of the corresponding tiling space is still given by (3). Recall that there is an open set ℳ ⊂ 𝐻 1 (Ω; ℝ𝑑 ) which parametrizes local deformations. Endowed with Lebesgue measure, we can refer to a typical element of ℳ as a random deformation. A theory of locally random substitution tilings has mostly been developed for one-dimensional tilings. As mentioned above, the high amount of complexity introduced by locally random constructions yields tilings with properties not present in classical self-similar constructions, such as mixing properties [MRST22], positive topological entropy as well as interesting measure theoretic entropy (see [GMRS23] and references within). In dimension two, locally random constructions can be obtained in the same way as those of [GL89] by using other triangular substitution rules with the same expansion rates (e.g. ones from [GKM15]), or by using variations on a known construction, such as the ones in Figure 9. In globally random constructions which yield tilings of finite local complexity, long range order or lack thereof— 9 The study [GKM15] is motivated by the desire to want to find compatible substitution rules on the same set of tiles, that is, rules which, when used together randomly still give tilings with finite local complexity.
SEPTEMBER 2023
from the point of view of the type of spectral measures— can sometimes be deduced for random choices of hierarchical structure and random deformations of the tiling (𝑛) space. A corresponding family of products 𝑆 𝐋 (𝜆) can still be defined using matrices of the type (5), although each matrix in the product depends now on a random choice of substitution rule. These analyses fundamentally use renormalization schemes and techniques, and as such the growth properties of the products are determined by their Lyapunov exponents. The exponents give insight into the types of spectral measures the resulting tiling space has. This was done in dimension one by Bufetov and Solomyak and in higher dimensions in a recent paper of mine, where I framed (5) and the subsequent family of products as twisted traces on a locally approximately finitedimensional ∗-algebra. The generalization to the random setting of the result of Clark and Sadun reads as follows: if the renormalization cocycle on 𝐻̌ 1 has more than 𝑑 positive Lyapunov exponents (this depends on how randomization is being done), then the typical deformation parameter gives a weak mixing tiling. Remarkably, the high amount of complexity in locally random constructions is also compatible with order, as there are locally random constructions which yield systems whose spectral measures have pure point components. This was already pointed out in the original analysis of spectral measures in [GL89], which already featured a precursor to the matrices (5) that was used to study randomized Penrose tilings. These facts emphasize the distinct roles that order and randomness play in the study of aperiodic tilings although they are interrelated. To summarize: order comes from the strong correlations at large scales and this is determined by the geometry of the tiles and the geometry of the substitution rules (i.e., the hierarchical structure). Randomness in hierarchy or in shape may or may not affect long range order and this can be measured the the growth properties of the product of matrices determined by (5). There are self-similar tilings (i.e., not random) such as the one in Figure 8 which do not have long range order, and there are random tilings which have long range order. The fun is figuring out what happens when.
5. Conclusion The study of aperiodic tilings is at the crossroads of many branches of mathematics such as combinatorics, logic, ergodic theory, algebraic topology, operator algebras, and mathematical physics, and so there are many points of view and many tools available to study them. There are also many open questions, especially for tilings in higher dimensions. There are several topics which I did not discuss here that I would have liked to discuss. For example, little was
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1189
Figure 9. Variations on the Ammann–Beenker substitution rule: taking 𝐴 = diag(1 + √2, 1 + √2) and letting 𝑡 𝑖 be one of the three tiles on the left, then 𝐴𝑡 𝑖 is decomposed into the union of tiles of finitely types. See [BG13, §6.1] for more details on the Ammann–Beenker tiling.
mentioned here about tilings with infinite local complexity; see N. P. Frank’s survey in [KLS15]. Additionally, problems from mathematical physics are largely absent here. A major area of activity is the spectral theory of random Schrödinger operators on aperiodic tilings. The theory for one-dimensional self-similar tilings is rich (see the survey in [KLS15]), but in dimensions greater than one very little is known. During the refereeing period of this article two remarkable things happened. First, a preprint by Smith, Myers, Kaplan, and Goodman-Strauss was posted on arxiv announcing the discovery of an aperiodic monotile— dubbed the hat—which is a shape which tiles the plane but only aperiodically.10 Any aperiodic tiling constructed using the hat uses it and its reflection in 6 different orientations, giving a tiling with 12 prototiles. Secondly, less than two months after the announcement of this discovery, a preprint by Baake, Gähler, and Sadun was posted on arxiv computing the cohomology groups of the tiling space associated with the hat, as well as a proof that the associated system has pure-point spectrum. These two announcements show that the field is both young enough to contain plenty of of exciting open problems as well as mature enough to possess a robust set of tools—coming from many areas of mathematics—which can be used to determine properties of different tilings. If you find an aperiodic monotile which defines a weak mixing system, let me know. ACKNOWLEDGMENT. I am grateful to Scott Schmieding and two anonymous referees who provided very helpful feedback on the first version of this article and helped improve it considerably.
10Soon after the announcement of this discovery, some of the authors gave
References
[AA20] Shigeki Akiyama and Pierre Arnoux (eds.), Substitution and tiling dynamics: introduction to self-inducing structures, Lecture Notes in Mathematics, vol. 2273, Springer, Cham, 2020. CIRM Jean-Morlet Chair, Fall 2017; Lecture notes from the Tiling Dynamical Systems research school held as part of Tilings and Discrete Geometry program, DOI 10.1007/978-3-030-57666-0. MR4200104 [AS99] Jean-Paul Allouche and Jeffrey Shallit, The ubiquitous Prouhet-Thue-Morse sequence, Sequences and their applications (Singapore, 1998), Springer Ser. Discrete Math. Theor. Comput. Sci., Springer, London, 1999, pp. 1–16. MR1843077 [AP98] Jared E. Anderson and Ian F. Putnam, Topological invariants for substitution tilings and their associated 𝐶 ∗ -algebras, Ergodic Theory Dynam. Systems 18 (1998), no. 3, 509– 537, DOI 10.1017/S0143385798100457. MR1631708 [BGM19] Michael Baake, Franz Gähler, and Neil Manibo, ˜ Renormalisation of pair correlation measures for primitive inflation rules and absence of absolutely continuous diffraction, Comm. Math. Phys. 370 (2019), no. 2, 591–635, DOI 10.1007/s00220-019-03500-w. MR3994581 [BG13] Michael Baake and Uwe Grimm, Aperiodic order. Vol. 1, Encyclopedia of Mathematics and its Applications, vol. 149, Cambridge University Press, Cambridge, 2013. A mathematical invitation; With a foreword by Roger Penrose, DOI 10.1017/CBO9781139025256. MR3136260 [CS06] Alex Clark and Lorenzo Sadun, When shape matters: deformations of tiling spaces, Ergodic Theory Dynam. Systems 26 (2006), no. 1, 69–86, DOI 10.1017/S0143385705000623. MR2201938 [CN16] Mar´ıa Isabel Cortez and Andr´es Navas, Some examples of repetitive, nonrectifiable Delone sets, Geom. Topol. 20 (2016), no. 4, 1909–1939, DOI 10.2140/gt.2016.20.1909. MR3548461 [dB81] N. G. de Bruijn, Algebraic theory of Penrose’s nonperiodic tilings of the plane. I, II, Nederl. Akad. Wetensch. Indag. Math. 43 (1981), no. 1, 39–52, 53–66. MR609465 [GKM15] Franz Gähler, Eugene E. Kwan, and Gregory R. Maloney, A computer search for planar substitution tilings with 𝑛-fold rotational symmetry, Discrete Comput. Geom. 53 (2015), no. 2, 445–465. MR3316232
a virtual talk about it, and it can be found in MoMath’s YouTube channel: https://www.youtube.com/watch?v=FkZPMf73qYc.
1190
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
VOLUME 70, NUMBER 8
[GL89] C. Godrèche and J. M. Luck, Quasiperiodicity and randomness in tilings of the plane, J. Statist. Phys. 55 (1989), no. 1-2, 1–28. MR1003500 [GMRS23] P. Gohlke, A. Mitchell, D. Rust, and T. Samuel, Measure Theoretic Entropy of Random Substitution Subshifts, Ann. Henri Poincar´e 24 (2023), no. 1, 277–323. MR4533524 [GT22] Rachel Greenfeld and Terence Tao, A counterexample to the periodic tiling conjecture, Preprint, arXiv:arXiv:2209.08451, 2022. [GS16] B. Grünbaum and G.C. Shephard, Tilings and patterns, Second Edition, Dover Books on Mathematics Series, Dover Publications, Incorporated, 2016. [JS18] Antoine Julien and Lorenzo Sadun, Tiling deformations, cohomology, and orbit equivalence of tiling spaces, Ann. Henri Poincar´e 19 (2018), no. 10, 3053–3088, DOI 10.1007/s00023-018-0713-3. MR3851781 [KLS15] Johannes Kellendonk, Daniel Lenz, and Jean Savinien (eds.), Mathematics of aperiodic order, Progress in Mathematics, vol. 309, Birkhäuser/Springer, Basel, 2015, DOI 10.1007/978-3-0348-0903-0. MR3380566 [Lan40] C. Dudley Langford, 1464. uses of a geometric puzzle, The Mathematical Gazette 24 (1940), no. 260, 209–211. [MRST22] Eden Miro, Dan Rust, Lorenzo Sadun, and Gwendolyn Tadeo, Topological mixing of random substitutions, Israel J. Math. (2022). [Sad08] Lorenzo Sadun, Topology of tiling spaces, University Lecture Series, vol. 46, American Mathematical Society, Providence, RI, 2008, DOI 10.1090/ulect/046. MR2446623 [Sol97] Boris Solomyak, Dynamics of self-similar tilings, Ergodic Theory Dynam. Systems 17 (1997), no. 3, 695–738, DOI 10.1017/S0143385797084988. MR1452190
Credits
Opening image and Figures 2, 3, 7, and 9 are courtesy of Rodrigo Trevino. ˜ Figure 1 is in the public domain. Figure 4 is courtesy of ©The Board of Trustees of the Science Museum. Licensed under CC-BA-SA 4.0. Figure 5 is courtesy of the United States Patent and Trademark Office. Figure 6 is reprinted with permission from D. Shechtman et al., “Metallic Phase with Long-Range Orientational Order and No Translational Symmetry,” Physical Review Letters 53, 1951. Copyright (1951) by the American Physical Society. Figure 8 is courtesy of D. Frettlöh, E. Harriss, F. Gähler: Tilings encyclopedia, https://tilings.math.uni -bielefeld.de/. Licensed under CC-BY-Sa 2.0. Photo of Rodrigo Trevino ˜ is courtesy of Laurie DeWitt, Pure Light Images.
Rodrigo Trevino ˜
SEPTEMBER 2023
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1191
G Sm
c
ni
ui
s.o rg
m
is
La th
ar it re ci Ve a, g se E Ph a ar xp D c l o La h a r e tin n x a d m the nd en ins Hi tor pir sp in in an g c g l ic on ive m tr s a at ib n he ut d m ion the at s ic of ia th ns e . se
na
ia
ar
M
Gu
an d Bo ro M Ph ot o rre o is co ro ral ur te sy , P es of hD Al ex ia
ej
Al
to
Ph o sy of Ca l
a
hi
nt
rte
ou
is c
Cy
e
at
St
hD
a iz
Ru
D
Ph
el
Va r
,P
a
lin
ro
Ca
iz,
Ru
C Ro ase dr y ig ue z
s
nd
a
Ed
da
hD
sla
lI
,P
ne
an
Ch
by
an
n
ke
ta
es
or
Fl
o
Am ot
Ph
s
inx
Lat
is
of
t
Lis
hD in
w
Er
r, P sy
te
ur
co
lla
Vi o
ot
d Ph
da
le
So
pan
His
R M yan or uz zi,
hD
he
Jr ., P
ics in t
Ma
the
ma
is
lS cie n
ces
Co lle ge
le y, Ph ol yo D ke ou nt H
Ch um co ur te sy of M
Ti m
tica
Ph ot o
Ro sa ur Lo a Us m ca el n í, ga Ph D
An Analytical and Geometric Perspective on Adversarial Robustness
Nicol´as Garc´ıa Trillos and Matt Jacobs . . . y luego se fueron el uno para el otro, como si fueran dos mortales enemigos. Don Quixote; Chapter 8, Part 1.
1. Introduction In the last ten years, neural networks have made incredible strides in classifying large data sets, to the point that Nicol´as Garc´ıa Trillos is an assistant professor in the department of statistics at the University of Wisconsin-Madison, Madison, Wisconsin. His email address is [email protected]. Matt Jacobs is an assistant professor in the Department of Mathematics at the University of California Santa Barbara, Santa Barbara, California. His email address is [email protected]. Communicated by Notices Associate Editor Daniela De Silva. For permission to reprint this article, please contact: [email protected].
they can now outperform humans in raw accuracy. However, the robustness of these systems is a completely different story. Suppose you were asked to identify whether a photo contained an image of a cat or a dog. You probably would have no difficulty at all; at worst, maybe you would only be tripped up by a particularly small or unusual Shiba Inu. In contrast, it has been widely documented that an adversary can convince an otherwise well-performing neural network that a dog is actually a cat (or vice-versa) by making tiny human-imperceptible changes to an image at the pixel level. These small perturbations are known as adversarial attacks and they are a significant obstacle to the deployment of machine learning systems in security-critical applications [GSS14]. The susceptibility to adversarial attacks is not exclusive to neural network models, and many other learning systems have also been observed to be brittle when facing adversarial perturbations of data.
DOI: https://doi.org/10.1090/noti2758
SEPTEMBER 2023
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1193
The business of defending against adversarial attacks is known as adversarial robustness, robust training, or simply adversarial training (although we will mostly reserve the latter name for a specific optimization objective). There are many methods in the literature that can be used to build defenses against adversarial attacks, but here we will be particularly interested in methods that enforce robustness during model training. In these types of methods, the standard training process — driven primarily by accuracy maximization — is substituted by a training process that promotes robustness, typically through the use of a different optimization objective that factors in the actions of a well-defined adversary. In this article, we give a brief overview of adversarial attacks and adversarial robustness and summarize some recent attempts to mathematically understand the process of robust training. Adversarial training and its mathematical foundations are active areas of research and a thorough review of its extant literature is beyond the scope of this article.1 For this reason, we will focus our discussion around some important lines of research in the theory of adversarial robustness, some of which are based on our own research work in the field, which takes a distinctive analytic and geometric perspective. One of our goals is to convey the mathematical richness of the field and discuss some of the many opportunities that are available for mathematicians to contribute to the development and understanding of this important applied problem. 1.1. Basics of training learning models. Data classification or regression typically occurs over a product space of the form 𝒵 = 𝒳×𝒴. Here 𝒳 is the data space or feature space, an abstract metric space containing the data points, while 𝒴 is the set of labels, usually a finite set for classification tasks or the real line for regression. In the remainder, we mostly focus our discussion on the classification problem. There, the goal is to construct a function that accurately partitions the data space into the possible classes contained in 𝒴. The learner does this by searching for a function 𝑓 ∶ 𝒳 → 𝒴 or 𝑓 ∶ 𝒳 → 𝑆𝒴 where 𝑆𝒴 is the probability simplex with |𝒴| vertices 𝑆𝒴 = {𝑝 ∈ [0, 1]𝒴 ∶ ∑𝑦∈𝒴 𝑝𝑦 = 1}. If we write 𝑓(𝑥) = (𝑓𝑦 (𝑥))𝑦∈𝒴 , then each 𝑓𝑦 represents the learner’s confidence that the data point 𝑥 ∈ 𝒳 belongs to class 𝑦 ∈ 𝒴. While a probabilistic classifier is typically the desired output of a learning task, note that one can always obtain a deterministic classifier 𝑓 ∶ 𝒳 → 𝒴 by selecting the largest entry of any tuple. To train a machine learning system, one typically needs a finite training set 𝑍 ⊂ 𝒵 consisting of data pairs (𝑥𝑖 , 𝑦 𝑖 ), i.e., feature vectors with their associated ground truth classification. One then minimizes a loss function over some 1
AMS Notices limits to 20 the references per article; we refer to the references cited here for further pointers to the literature.
1194
chosen function space ℱ ⊂ {𝑓 ∶ 𝒳 → 𝑆𝒴 } with the goal of finding a function 𝑓∗ that produces an accurate classification of the data. For instance, ℱ may be the space of all neural networks with a certain architecture, while the loss function typically has the form 1 ∑ ℓ(𝑓(𝑥), 𝑦), |𝑍| (𝑥,𝑦)∈𝑍
(1)
where ℓ is a function that is small when 𝑓(𝑥) gives high probability to the ground truth label 𝑦, and large otherwise. In practice, it may only be possible to find a classifier 𝑓 that is a local minimizer of (1) over ℱ, though it is often possible to drive the loss function to nearly zero during training via stochastic gradient descent. Either way, well-trained classifiers typically perform well—at least in the absence of adversarial attacks. 1.2. Adversarial attacks. Given a trained classifier 𝑓 and a data point 𝑥 ∈ 𝒳, what is the best way for an adversary to perturb 𝑥 to produce an incorrect classification? In order for this to be a nontrivial question, we must assume that there are some restrictions on how far the adversary can perturb 𝑥. This restriction is known as the adversarial budget, and it plays a crucial role in both adversarial attacks and robust training. For our purposes, we will formulate the adversarial budget through a parameter 𝜀 > 0 and assume that the adversary can only produce a perturbed data point that lies in 𝐵𝜀 (𝑥) ⊂ 𝒳, the ball of radius 𝜀 centered at the original data point 𝑥. Returning to our question, if the adversary has full access to the function 𝑓 and knows that the correct label for 𝑥 is 𝑦, then the most powerful attack for a given budget 𝜀 is found by replacing 𝑥 with any point 𝑥̃ satisfying 𝑥̃ ∈ argmax ℓ(𝑓(𝑥′ ), 𝑦).
(2)
𝑥′ ∈𝐵𝜀 (𝑥)
In practice, the adversary may not be able to find a point 𝑥̃ that exactly satisfies (2). However, when 𝒳 is a subspace of Euclidean space, a simpler approach that produces highly effective attacks is to perturb the data in the direction of steepest ascent for the loss function by choosing 𝑥̃ = 𝑥 + 𝜀
∇𝑥 ℓ(𝑓(𝑥), 𝑦) , ‖∇𝑥 ℓ(𝑓(𝑥), 𝑦)‖
(3)
or by considering the popular PGD attack 𝑥̃ = 𝑥 + 𝜀sign(∇𝑥 ℓ(𝑓(𝑥), 𝑦)),
(4)
where sign(⋅) denotes the coordinatewise sign of its input; see [GSS14,MMS+ 18], and [TT22] for a motivation for the PGD attack. Regardless of how the adversary chooses its attack, there are two key takeaways from formulas (2), (3), and (4) that we would like to highlight. Firstly, we see that adversarial attacks are found by attempting to maximize the loss function with respect to data perturbations. In contrast,
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
VOLUME 70, NUMBER 8
the learner trains the classifier by attempting to minimize the loss function among classifiers belonging to a chosen function space ℱ (typically a parametric family). Hence, the learner and adversary can be viewed as playing a twoplayer game where they compete to set the value of the loss function using the tools at their disposal (the learner first gets to choose 𝑓, the adversary then gets to modify data); this connection to game theory will become more important shortly. Secondly, it should be clear from formulas (2), (3), and (4) that the effectiveness of adversarial attacks must stem from a certain lack of smoothness in the trained classifier 𝑓. Indeed, if 𝑓 were say 1-Lipschitz, then an adversary with budget 𝜀 could not change the classification probabilities at any point by more than 𝜀. Thus, attacks that can fool an image classifier by making humanimperceptible changes to pixel values must be exploiting a significant lack of regularity. 1.3. Adversarial training. In light of the above considerations, to stave off adversarial attacks one must find a way to construct classifiers with better regularity properties. The most classical (and perhaps most obvious) way to do this would be to replace the training objective (1) with a new objective 1 ∑ ℓ(𝑓(𝑥), 𝑦) + 𝑅(𝑓) (5) |𝑍| (𝑥,𝑦)∈𝑍
(5) are conceptual and computational. Conceptual, because in the formulation (6) one explicitly trains to defend against a well-defined adversary (although in practice this requires “understanding the enemy”). Computational, because (6) is in essence a min-max problem (the adversary maximizes over 𝜀-perturbations while the learner minimizes by altering 𝑓) for which many implementable algorithms exist (for instance alternating gradient descent and ascent steps). Furthermore, the regularizing effect of (6) is data-dependent in contrast to the regularization induced by a standard gradient penalty term which has very little connection to the structure of the data. On the other hand, compared to (5), it is harder to understand (analytically and geometrically) how exactly (6) is regularizing/robustifying the classifier 𝑓; see the discussion in [TT22] and references therein. Furthermore, it is not so clear how the user should choose the budget parameter 𝜀 and be mindful of the tradeoff between accuracy (on clean data) and robustness that problem (6) introduces. Answering these questions in full generality is challenging and remains an open problem in the field. 1.4. Outline. To fix some ideas, we write (6) in the general form:
where 𝑅 ∶ ℱ → ℝ is a term that promotes regularity. For instance, 𝑅 could constrain the Lipschitz constant of 𝑓 or could be some other gradient penalty term. This approach has a long history of success in inverse problems (in that setting one typically adds a regularizing term to a datafitting term to help mitigate the effect of noise), however, in the context of machine learning, it is often too difficult to efficiently minimize (5). For instance, it is very difficult to train a neural network with a Lipschitz constant constraint. On the other hand, popular and computationally feasible regularization terms for neural network training, for instance, weight regularization, do not seem to provide any defense against adversarial attacks. Adversarial training is a different approach to regularization/robustification that has become very popular in the machine learning community. In adversarial training, rather than modifying the training objective with a regularizing term, one instead incorporates the adversary into the training process [SZS+ 14, MMS+ 18]. More precisely, adversarial training replaces the objective (1) with
where 𝜇 is an arbitrary probability measure over 𝒵, the “clean data distribution,” and not just an empirical measure as in (6). We work with arbitrary 𝜇 to avoid distinguishing, whenever unnecessary, between population and finite data settings. Only in some parts of Section 2 will it be important to make assumptions on 𝜇. In this paper we explore the following general questions:
1 ∑ sup ℓ(𝑓(𝑥,̃ 𝑦), |𝑍| (𝑥,𝑦)∈𝑍 𝑥∈𝐵 ̃ 𝜀 (𝑥)
(6)
where the adversarial budget 𝜀 is chosen by the user. When training using (6), the learner is forced to find a function 𝑓 that cannot be easily attacked by an adversary with budget 𝜀. The advantages of training using (6) compared to SEPTEMBER 2023
min 𝔼(𝑥,𝑦)∼𝜇 [ sup ℓ(𝑓(𝑥), ̃ 𝑦)], 𝑓∈ℱ
(AT)
̃ 𝜀 (𝑥) 𝑥∈𝐵
1. What type of regularization is enforced on learning models by the presence of adversaries? 2. What are the tradeoffs between accuracy and robustness when training models robustly? 3. How can one actually train models to be robust to specific adversarial attacks? 4. How can one compute meaningful lower bounds for the (AT) problem? The above questions are too broad to be answered in complete generality, and in the remainder, we will focus on specific settings where we can reveal interesting geometric and analytic structures. In particular, in Section 2, we explore the type of regularization enforced on binary classifiers, revealing a connection between adversarial training and perimeter minimization problems. This connection will allow us to interpret geometrically the tradeoff between accuracy and robustness. In Section 3, we discuss a concrete game-theoretic interpretation of adversarial training and discuss a more general framework for adversarial
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1195
robustness based on distributionally robust optimization (DRO). We use this connection with game theory to discuss some potential strategies for training robust learning models and highlight the significance of the concept of Nash equilibrium for adversarial training. Finally, in Section 4, we discuss how an agnostic learner setting can be used to derive lower bounds for more general (AT) problems. We show that in the agnostic learner setting for multiclass classification the adversarial robustness objective can be equivalently rewritten as the geometric problem of finding a (generalized) barycenter of a collection of measures and then discuss the computational implications of this equivalence. We wrap up the paper in Section 5 by discussing some research directions connected to the topics presented throughout the paper. 1.4.1. Additional notation. When working on 𝒳 = ℝ𝑑 we will often consider balls 𝐵𝜀 (𝑥) associated to a given norm ‖⋅‖ on ℝ𝑑 . We use ‖⋅‖∗ to denote the dual norm of ‖⋅‖, which is defined according to ‖𝑣‖∗ =
sup
⟨𝑢, 𝑣⟩.
ᵆ∈ℝ𝑑 ∶ ‖ᵆ‖≤1
We will denote by 𝒫(𝒵) the space of Borel probability measures over the set 𝒵. Given two measures 𝜇, 𝜇̃ ∈ 𝒫(𝒵) we denote by Γ(𝜇, 𝜇)̃ the space of couplings between 𝜇 and 𝜇,̃ i.e., probability measures over 𝒵 × 𝒵 whose first and second marginals are, respectively, 𝜇 and 𝜇.̃ We will also use the notion of a pushforward of a measure by a map. Precisely, if 𝑇 ∶ 𝐴 ↦ 𝐵 is a measurable map between two measurable spaces, and 𝜇 is a probability measure over 𝐴, we define 𝑇 ♯ 𝜇, the pushforward of 𝜇 by 𝑇, to be the measure on 𝐵 for which 𝑇 ♯ 𝜇(𝐶) = 𝜇(𝑇 −1 (𝐶)) for all measurable subsets 𝐶 of 𝐵.
2. Adversarial Robustness: Regularization and Perimeter In this section, we discuss the connection between adversarial training and explicit regularization methods. To motivate this connection, let us first consider a simple robust linear regression setting. In this setting, models in the family ℱ = {𝑓𝜃 ∶ 𝜃 ∈ Θ} take the form 𝑓𝜃 (𝑥) = ⟨𝜃, 𝑥⟩,
𝑥 ∈ 𝒳,
where Θ is some subset of ℝ𝑑 and 𝒳 = ℝ𝑑 . Here, the learner’s goal is to select a linear regression function relating inputs 𝑥 ∈ ℝ𝑑 to real-valued outputs 𝑦. We show the following equivalence between problem (AT) and an explicit regularization problem taking the form of a Lassotype linear regression: min𝔼(𝑥,𝑦)∼𝜇 [ sup |⟨𝜃, 𝑥⟩̃ − 𝑦|] 𝜃∈Θ
̃ 𝜀 (𝑥) 𝑥∈𝐵
= min 𝔼(𝑥,𝑦)∼𝜇 [|⟨𝜃, 𝑥⟩ − 𝑦|] + 𝜀‖𝜃‖∗ ; 𝜃∈Θ
1196
(7)
θ x ˜
x
x ˜
Figure 1. ℓ1 ball around 𝑥 of radius 𝜀 crossed by level sets of function 𝑥 ↦ ⟨𝜃, 𝑥⟩. The value sup𝑥∈𝐵 ̃ 𝜀 (𝑥) |⟨𝜃, 𝑥⟩̃ − 𝑦| is realized at either 𝑥′̃ or 𝑥″̃ .
note that in order to be consistent with the standard definition of the Lasso regularization, we would require ‖⋅‖ to be the 𝑙∞ -norm to get ‖⋅‖∗ to be the 𝑙1 -norm. Identity (7) is only one of many similar identities relating regularization methods and adversarially robust learning problems for classical families of statistical models. The extent of this type of equivalences is more apparent when considering DRO versions (see Section 3 for a definition) of the adversarial training problem, e.g., see [BKM19], where in addition some statistical inference methodologies, motivated by these equivalences, are proposed. To deduce (7), it is enough to consider the optimization problem sup𝑥∈𝐵 ̃ 𝜀 (𝑥) |𝑓𝜃 (𝑥) − 𝑦| at every fixed (𝑥, 𝑦) and realize that the sup can be written as either sup𝑥∈𝐵 ̃ 𝜀 (𝑥) ⟨𝑥,̃ 𝜃⟩ − 𝑦 when ⟨𝑥, 𝜃⟩ ≥ 𝑦, or as sup𝑥∈𝐵 𝑦 − ⟨ 𝑥, ̃ 𝜃⟩ when ⟨𝑥, 𝜃⟩ ≤ 𝑦; ̃ 𝜀 (𝑥) see Figure 1 for an illustration. Using the definition of the dual norm ‖⋅‖∗ one can deduce that in all cases this expression is equal to |⟨𝜃, 𝑥⟩ − 𝑦| + 𝜀‖𝜃‖∗ , from which (7) follows. Equivalence (7), although limited to the linear regression setting, motivates exploring the regularization effect of adversaries on more general families of learning models. In the remainder of this section we discuss how in the binary classification setting this regularization effect can be related to geometric properties of decision boundaries, in particular to their size or curvature. By presenting this analysis we hope to convey that the connection between adversarial robustness and regularization methods goes beyond simple classical statistical settings, in turn revealing a variety of interesting geometric problems motivated by machine learning.
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
VOLUME 70, NUMBER 8
2.1. Perimeter regularization. Let us consider a binary classification version of (AT) where 𝑦 ∈ {0, 1}, ℓ is the 0-1 loss (i.e., 0 if the two inputs of ℓ are the same, and 1 otherwise), and ℱ is a family of binary classifiers ℱ = {1𝐴 ∶ 𝐴 ∈ 𝒜} for 𝒜 a family of measurable subsets of 𝒳. Here 1𝐴 denotes the indicator function of a subset 𝐴 of 𝒳, defined according to 1 1𝐴 (𝑥) = { 0
if 𝑥 ∈ 𝐴 if 𝑥 ∉ 𝐴.
Notice that we can “parameterize” a family of binary classifiers with a family of subsets in 𝒳 without losing any generality, due to the fact that binary classifiers output one of two values, 0 or 1, and thus can be characterized by the regions of points in 𝒳 that they classify as a 1. In general, as shown in [BGTM23], Problem (AT) is equivalent to a regularization problem, with a non-local perimeter regularizer, of the form: inf 𝔼(𝑥,𝑦)∼𝜇 [|1𝐴 (𝑥) − 𝑦|] + 𝜀Per𝜀 (𝐴; 𝜇),
(8)
1 1 𝜇 (𝜕 𝐴𝑐 ) + 𝜇1 (𝜕𝜀 𝐴), 𝜀 0 𝜀 𝜀 𝜕𝜀 𝐴𝑐 ≔ {𝑥 ∈ 𝐴𝑐 ∶ dist(𝑥, 𝐴) < 𝜀},
(9)
𝐴∈𝒜
where Per𝜀 (𝐴) ≔
𝜕𝜀 𝐴 ≔ {𝑥 ∈ 𝐴 ∶ dist(𝑥, 𝐴𝑐 ) < 𝜀}; Figure 2 illustrates the sets 𝜕𝜀 𝐴𝑐 and 𝜕𝜀 𝐴. In the above, the measures 𝜇0 and 𝜇1 are the measures over 𝒳 defined according to 𝜇0 (⋅) ≔ 𝜇(⋅ × {0}) and 𝜇1 (⋅) ≔ 𝜇(⋅ × {1}), i.e., up to scaling factors they are the conditional distributions of the variable 𝑥 given the possible values that 𝑦 may take. The equivalence between (AT) and (8) can be deduced, at least at a formal level, by adding and subtracting the term 𝔼(𝑥,𝑦)∼𝜇 [|1𝐴 (𝑥) − 𝑦|] from the term 𝔼(𝑥,𝑦)∼𝜇 [sup𝑥∈𝐵 ̃ 𝜀 (𝑥) ℓ(1𝐴 (𝑥), 𝑦)] and then identifying the resulting terms with those in (9). To make these computations rigorous and to show existence of solutions to problem (9), there are several technical challenges, beginning with the measurability of the operations involved in defining the problem (AT), that must be overcome; the first part of the work [BGTM23] discusses some of these challenges. Now, let us motivate the use of the word perimeter when describing the functional Per𝜀 (𝐴). Suppose that 𝒳 = ℝ𝑑 and that the measures 𝜇0 and 𝜇1 are absolutely continuous with respect to the Lebesgue measure so that we can write them as d𝜇0 = 𝜌0 d𝑥 and d𝜇1 = 𝜌1 d𝑥 for two nonnegative Lebesgue-integrable functions 𝜌0 and 𝜌1 that for simplicity will be assumed to be smooth. In this case, (8) can be rewritten as 1 1 Per𝜀 (𝐴) = ∫ 𝜌0 (𝑥) d𝑥 + ∫ 𝜌1 (𝑥) d𝑥. 𝜀 𝜕 𝐴𝑐 𝜀 𝜕𝐴 𝜀
𝜀
Notice that the sets 𝜕𝜀 𝐴, 𝜕𝜀 𝐴𝑐 in the volume integrals shrink toward 𝜕𝐴, the boundary of 𝐴, as we send 𝜀 → 0. Moreover, SEPTEMBER 2023
due to the rescaling factor 𝜀 in front of these integrals, one may anticipate a connection between Per𝜀 (𝐴) and the more classical notion of (weighted) perimeter: Per(𝐴) ≔ ∫ (𝜌0 (𝑥) + 𝜌1 (𝑥)) d𝑥 = ∫ 𝜌(𝑥) dℋ 𝑑−1 (𝑥), 𝜕𝐴
𝜕𝐴
where 𝜌(𝑥) ≔ 𝜌0 (𝑥) + 𝜌1 (𝑥). Note that 𝜕𝐴 is precisely the decision boundary between classes 1 and 0 according to the classifier 1𝐴 and that 𝜌(𝑥) is the density, with respect to the Lebesgue measure, of the marginal of the data distribution 𝜇 on the 𝑥 variable. In the definition of Per(𝐴) we have used ℋ 𝑑−1 , the 𝑑−1 dimensional Hausdorff measure, which can be used to measure the size of a hypersurface of codimension one. In what follows we discuss two different ways to understand the relationship between Per𝜀 and Per. One first possible way to relate Per𝜀 and Per is through a pointwise convergence analysis: fix a set 𝐴 with regular enough boundary and then study the behavior of Per𝜀 (𝐴) as 𝜀 → 0. This is the type of analysis discussed in [GTM22], which was used by the authors to motivate the connection between adversarial robustness in binary classification and geometric variational problems involving perimeter. However, pointwise convergence of Per𝜀 toward Per is not sufficient to ensure that Per𝜀 induces a perimeter regularization type effect on its minimizers. For that, we need a different type of convergence. A different way to compare the functionals Per𝜀 (⋅) and Per is through a variational analysis; we refer the interested reader to the recent paper [BS22], which explains in detail this type of convergence and shows that Per𝜀 converges variationally toward Per, at least when balls are induced by the Euclidean distance. Here we restrict ourselves to discussing some of the mathematical implications of the analysis in [BS22], which considers problem (9) when 𝒜 is the set 𝔅(ℝ𝑑 ) of all (Borel) measurable subsets of 𝒳 = ℝ𝑑 ; this setting corresponds to an agnostic learner setting. First, [BS22] shows that minimizers 𝐴𝜀 of (8) converge, as 𝜀 → 0, toward minimizers of the problem min Per(𝐴),
𝐴∈𝒜Opt
(10)
where 𝒜Opt ≔ argmin𝐴′ ∈𝔅(ℝ𝑑 ) 𝔼(𝑥,𝑦)∼𝜇 [|1𝐴′ (𝑥) − 𝑦|]. This means that, as 𝜀 → 0, solutions to the adversarial training problem select among minimizers to the unrobust risk the ones with minimal perimeter. In particular, this result helps capture the idea that when 𝜀 is small, the presence of an adversary has the same effect as imposing a perimeter penalization term on the classifiers; c.f. [GTM22] for more discussion on this idea and on the relation with mean curvature flows. A second consequence of the results in [BS22] is an expansion in 𝜀 for the adversarial risk 𝑅∗𝜀 , i.e., the minimum
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1197
Figure 2. In green, the set of points in 𝐴 within distance 𝜀 from the boundary of 𝐴; in blue, the set of points in 𝐴𝑐 within distance 𝜀 from the boundary. The union of 𝜕𝜀 𝐴𝑐 and 𝜕𝜀 𝐴 is the region where the adversary may attack and guarantee a mismatch between predicted and true labels.
value of (AT). Precisely, 𝑅∗𝜀 = 𝑅∗0 + 𝜀Per∗ + 𝑜(𝜀),
(11)
where Per∗ denotes the minimum value in (10). The above result is reminiscent to Danskin’s theorem for functions over Euclidean space, a theorem that is used to characterize the first “derivative” of a function defined as the infimum over a family of functions. Naturally, the difficulty in proving a result like (11) lies in the fact that the functionals of interest take subsets of 𝒳 as input, and thus Danskin’s theorem cannot be applied. Formula (11) can be used to give a geometric interpretation to the rate at which accuracy is lost when building robust classifiers as a function of the adversarial budget 𝜀. Indeed, this result says that for small 𝜖, accuracy is lost at the rate given by the Bayes classifier with minimal perimeter. The trade-off between accuracy and robustness has been investigated from statistical perspectives in [ZYJ+ 1909], for example. In contrast, the tools and concepts discussed here have a geometric and analytic flavor, with the caveat that they are only meaningful for a population-level analysis of adversarial robustness in binary classification. To gain an even better understanding of the situation, higher-order expansions of the adversarial risk in 𝜀 would be desirable and are a current topic of investigation. 2.2. Certifiability and regularity. In this section, we discuss notions of certifiability and regularity of robust binary classifiers. We begin with a definition. Definition 2.1. Let 1𝐴 be a binary classifier. We say that 𝑥 ∈ 𝒳 is 𝜀-certifiable (for 1𝐴 ) if 𝐵𝜀 (𝑥) ⊆ 𝐴 or if 𝐵𝜀 (𝑥) ⊆ 𝐴𝑐 . In simple terms, the certifiable points of a given classifier are the points in 𝒳 for which the classification rule stays constant within the ball of radius 𝜀 around them: they are the points that are insensitive to the adversarial attacks in the adversarial problem (AT). While it is not possible to build nontrivial sets 𝐴 for which all points in 𝒳 are certifiable, we can still ask whether it is possible to find robust classifiers that are fully characterized by their certifiable points. This motivates the following definitions. Definition 2.2. We say that a measurable set 𝐴 is 𝜀 inner regular if for all 𝑥 ∈ 𝜕𝐴 there exists 𝑥′ ∈ 𝐴 such that 1198
Figure 3. A data set for which no optimal robust classifier is 𝜀 pseudo-certifiable. If 𝑥 is in the blue region and 𝑥 ∈ 𝜕𝐵𝜀 for some ball of radius 𝜀, then 𝐵𝜀 must intersect both a red and purple circle.
𝐵𝜀 (𝑥′ ) ⊂ 𝐴 and 𝑥 ∈ 𝜕𝐵𝜀 (𝑥′ ). Likewise, we say that 𝐴 is 𝜀 outer regular, if 𝐴𝑐 is 𝜀 inner regular. Sets that are both inner and outer 𝜀 regular will be referred to as 𝜀 pseudocertifiable. Notice that a classifier that is 𝜀 pseudo-certifiable is completely determined by its outputs on its certifiable points. Pseudo-certifiability is thus a desirable property. It is then natural to wonder whether it is always possible to construct an 𝜀 pseudo-certifiable classifier 1𝐴 minimizing the adversarial risk, i.e., a set 𝐴 solving (AT) when the class of sets 𝒜 is 𝔅(ℝ𝑑 ). As it turns out, the notion of pseudocertifiability is very strong, and in the case of the Euclidean distance, for example, it implies that decision boundaries are locally the graph of a 𝐶 1,1 function; see [BGTM23] and references therein. An example of a setting where no optimal robust classifier is 𝜀 pseudo-certifiable is given in Figure 3. There, 𝜇 is the sum of four delta measures in ℝ2 at the points (±𝜀, ±𝜀), two red and two purple. Any optimal classifier must stay constant within the 𝜀-balls centered at each point. The color choice in the shaded blue region does not affect optimality, however, there is no way to color this region and maintain 𝜖-inner regularity for both sets (c.f. Figure 3). While we cannot guarantee pseudo-certifiability for robust classifiers in general, we can still guarantee existence of 𝜀-inner regular solutions, 𝜀-outer regular solutions, and sometimes solutions with other forms of regularity. This is the content of a series of results in [BGTM23] stated informally below. Theorem 2.3 (Informal from [BGTM23]). Let 𝜇 be an arbitrary probability measure over 𝒳 × {0, 1}, and let ℱ be the class of all Borel measurable classifiers. Let 𝐴 be any solution to the (AT) problem. Then there exist two solutions 𝐴𝐼 , 𝐴𝑂 to (AT) such that 𝐴𝐼 ⊆ 𝐴 ⊆ 𝐴𝑂 , and 1. 𝐴𝐼 is 𝜀 inner-regular and 𝐴𝑂 is 𝜀 outer-regular. 2. Any measurable set 𝐴′ satisfying 𝐴𝐼 ⊆ 𝐴′ ⊆ 𝐴𝑂 is a solution to (AT).
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
VOLUME 70, NUMBER 8
The above are two instances of distributionally robust optimization problems given their inner maximization over probability measures. Notice that, in general, problem (12) can be written as (13) by defining the cost 𝐶(𝜇, 𝜇)̃ to be 0 if the explicit constraint 𝐷(𝜇, 𝜇)̃ ≤ 𝜀 is satisfied and infinity otherwise. Both 𝐶 and 𝐷 can be interpreted as “distances” between probability distributions. In the remainder of the paper, we will restrict our attention to problem (13) with a cost function 𝐶 taking the form of an optimal transport problem:
AI AS
AO
𝐶(𝜇, 𝜇)̃ =
Figure 4. A set 𝐴𝐼 that is 𝜀-inner regular (some inwards cusps allowed), a set 𝐴𝑂 that is 𝜀-outer regular (some outward cusps allowed), and a smooth set 𝐴𝑆 in between. In the context of Theorem 2.3 for Euclidean balls, the set 𝐴𝑆 would be a solution to (AT).
Moreover, if 𝒳 is an Euclidean space and the balls 𝐵𝜀 are induced by the Euclidean distance, then there exists a solution 𝐴 to (AT) such that the boundary of 𝐴 is locally the graph of a 𝐶 1,1/3 function. The analysis in [BGTM23] also provides quantitative estimates for the regularity of the decision boundary of the classifier 1𝐴 in the last part of Theorem 2.3. In general, these estimates blow up when 𝜀 → 0. It is however expected that one could get finer regularity estimates under additional assumptions on 𝜇, e.g., assuming that 𝜇0 = 𝜌0 d𝑥, 𝜇1 = 𝜌1 d𝑥, and the set {𝑥 ∈ ℝ𝑑 ∶ 𝜌1 (𝑥) = 𝜌0 (𝑥)} is sufficiently regular. Obtaining these finer estimates and characterizing the needed regularity for these finer estimates to apply are topics of current investigation.
In this section, we introduce a framework for adversarial training that encompasses (AT) and that can be cast more precisely within game theory. In particular, in this larger framework we will be able to discuss the notion of Nash equilibrium in adversarial training and consider its implications on the robust training of learning models. The idea is as follows. Instead of considering pointwise attacks as in (AT), where for every single data point (𝑥, 𝑦) the adversary proposes an attack, we allow the adversary to modify 𝜇 by producing an entirely new data distribution 𝜇.̃ Naturally, as in model (AT), the adversary must pay a price (or use a budget) for carrying out this modification. In precise terms, we consider the following families of problems: ̃ 𝔼(𝑥,̃ 𝑦)̃ [ℓ(𝑓(𝑥), ̃ 𝑦)]
(12)
̃ − 𝐶(𝜇, 𝜇). min max 𝔼(𝑥,̃ 𝑦)̃ [ℓ(𝑓(𝑥), ̃ 𝑦)] ̃
(13)
min
max
̃ ̃ 𝑓∈ℱ 𝜇∈𝒫(𝒵) s.t. 𝐷(𝜇,𝜇)≤𝜀
and ̃ 𝑓∈ℱ 𝜇∈𝒫(𝒵)
SEPTEMBER 2023
∫ 𝑐 𝒵 (𝑧, 𝑧)𝑑𝜋(𝑧, ̃ 𝑧), ̃
(14)
for a cost function 𝑐 𝒵 ∶ 𝒵 × 𝒵 ↦ [0, ∞] that describes the marginal cost that the adversary must pay in order to move a clean data point 𝑧 to a new location 𝑧 ̃ (recall that Γ(𝜇, 𝜇)̃ is the space of probability measures on 𝒵 × 𝒵 with first marginal 𝜇 and second marginal 𝜇). ̃ Notice that, in this generality, the adversary has the ability to modify both the feature vector 𝑥 and the label 𝑦. Some natural examples of cost functions 𝑐 𝒵 are 𝑐 𝒵 (𝑧, 𝑧)̃ ≔ 𝑐𝑎 |𝑧 − 𝑧|̃ 2 , a choice that is particularly meaningful when 𝒳 = ℝ𝑑 and 𝒴 = ℝ; here, 𝑐𝑎 is a positive constant that can be interpreted as reciprocal to an adversarial budget. Another example of cost function 𝑐 𝒵 of interest is (15) below, which can be used to rewrite problem (AT) in the form (13). Problem (13) is thus a rather general mathematical formulation for adversarial training. Proposition 3.1 (Informal). Problem (AT) is equivalent to problem (13) for a cost function 𝐶 of the form (14) with marginal cost 0, 𝑐 𝒵 (𝑧, 𝑧)̃ = { ∞,
3. Connections to Game Theory: DRO Formulations of AT
inf
𝜋∈Γ(𝜇,𝜇)̃
if 𝑑(𝑥, 𝑥)̃ ≤ 𝜀 and 𝑦 = 𝑦 ̃ otherwise .
(15)
Proof. For a given 𝑓, let 𝜇̃ be a solution of the inner maximization problem in (13). Notice that, without the loss of generality, we can assume that 𝐶(𝜇, 𝜇)̃ < ∞, which means that there exists a coupling 𝜋 ∈ Γ(𝜇, 𝜇)̃ whose support is contained in the set {(𝑧, 𝑧)̃ ∶ 𝑦 = 𝑦,̃ 𝑑(𝑥, 𝑥)̃ ≤ 𝜀}. Due to this, we can write ̃ = 𝔼(𝑧,𝑧)∼𝜋 𝔼𝑧∼̃ 𝜇̃ [ℓ(𝑓(𝑥), ̃ 𝑦)] [ℓ(𝑓(𝑥), ̃ 𝑦)] ̃ ≤ 𝔼(𝑧,𝑧)∼𝜋 [ sup ℓ(𝑓(𝑥′̃ ), 𝑦)] ̃ 𝑥̃′ ∈𝐵𝜀 (𝑥)
≤ 𝔼𝑧∼𝜇 [ sup ℓ(𝑓(𝑥̃′ ), 𝑦)]. 𝑥̃′ ∈𝐵𝜀 (𝑥)
This shows that (13) ≤ (AT). On the other hand, for an arbitrary 𝑓 and (𝑥, 𝑦) in the support of 𝜇, let 𝑇1 (𝑥, 𝑦) ∈ argmax𝑥∈𝐵 ̃ 𝑦) ̃ 𝜀 (𝑥) ℓ(𝑓(𝑥), (assuming, for simplicity, that the sup is indeed reached and that this operation can be defined in a measurable
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1199
way). We then define 𝑇(𝑥, 𝑦) = (𝑇1 (𝑥, 𝑦), 𝑦) and consider 𝜇̃ ≔ 𝑇 ♯ 𝜇. Notice that by construction we have 𝐶(𝜇, 𝜇)̃ = 0, from where it follows that ̃ 𝔼𝑧∼𝜇 [ sup ℓ(𝑓(𝑥′̃ ), 𝑦)] = 𝔼𝑧∼̃ 𝜇̃ [ℓ(𝑓(𝑥), ̃ 𝑦)] 𝑥̃′ ∈𝐵𝜀 (𝑥)
̃ − 𝐶(𝜇, 𝜇̃′ ). ≤ max 𝔼𝑧∼̃ 𝜇̃′ [ℓ(𝑓(𝑥), ̃ 𝑦)] 𝜇̃′ ∈𝒫(𝒵)
From this we can deduce (AT) ≤ (13).
□
Remark 3.2. Problem (AT) can also be written in the form (12). To see this, it is sufficient to define 𝐷(𝜇, 𝜇)̃ as the following ∞-Wasserstein distance in 𝒵: 𝑊∞ (𝜇, 𝜇)̃ =
inf
ess sup 𝛿(𝑧, 𝑧), ̃
𝜋∈Γ(𝜇,𝜇)̃ (𝑧,𝑧)∼𝜋 ̃
where 𝛿(𝑧, 𝑧)̃ = 𝑑(𝑥, 𝑥)̃ if 𝑦 = 𝑦,̃ and 𝛿(𝑧, 𝑧)̃ = ∞ if 𝑦 ≠ 𝑦.̃ 3.1. Nash equilibria in DRO. One of the merits of writing adversarial training problems in the form (13) (or (12)) is that it allows us to explicitly interpret the process of robust training as a zero-sum game between two players, a learner and an adversary. In this interpretation, the learner’s strategies consist of learning models 𝑓 ∈ ℱ (regression functions/classifiers), while the adversary’s consist of data perturbations 𝜇.̃ The payoff function for the adversary is set to be 𝒰(𝜇,̃ 𝑓) ∶= 𝔼𝑧∼̃ 𝜇̃ [ℓ(𝑓(𝑥), 𝑦)] − 𝐶(𝜇, 𝜇), ̃
(16)
and the adversary’s goal is to maximize it, while the learner’s goal is to minimize it. We now recall the notion of a Nash equilibrium of a game, one of the central notions in game theory. Definition 3.3. We say that (𝜇̃∗ , 𝑓∗ ) ∈ 𝒫(𝒵) × ℱ is a Nash equilibrium for the adversarial training game min max 𝒰(𝜇,̃ 𝑓),
(17)
̃ 𝑓∈ℱ 𝜇∈𝒫(𝑍)
if 𝒰(𝜇,̃ 𝑓∗ ) ≤ 𝒰(𝜇̃∗ , 𝑓) for all 𝑓 ∈ ℱ and all 𝜇̃ ∈ 𝒫(𝒵). For adversarial training, the theoretical existence of a Nash equilibrium (𝜇̃∗ , 𝑓∗ ) means that if the learner were to choose model 𝑓∗ , then its worst outcome would occur precisely if the adversary played 𝜇̃∗ . This means that the learner would have no incentive to use a model different from 𝑓∗ regardless of the adversary’s attack. In addition, when Nash equilibria exist, the min and the max in (17) can be swapped and the apparent advantage that the adversary has over the learner in the formulation (17) (the adversary plays after observing the classifier chosen by the learner) is in fact only apparent. Existence of Nash equilibria for (17) thus means good news for the robust training of models provided one could actually compute one of them. Before we move on, it is important to highlight that for 𝑓∗ to be useful in applications, one would need to 1200
make sure that the cost function 𝐶 indeed restricts the adversary to consider only small perturbations of clean data points, but the exact meaning of “small perturbation” may be application dependent. In what follows, we put aside the challenges of modelling the cost function 𝐶 and instead discuss the existence of Nash equilibria for (17) assuming 𝐶 has been fixed (i.e., we have already determined how to model the adversary). A well-known meta-result in game theory states that Nash equilibria for a game typically exist in the players’ spaces of mixed strategies. In mathematical terms, this means that to prove existence of Nash equilibria of a given game one typically needs a convexification of the original space of strategies. For the adversarial training problem (17), since the adversary takes strategies in the space of probability measures 𝒫(𝒵), no convexification is needed for the adversary because 𝒫(𝒵) is already a convex space. On the other hand, the space ℱ of classification/regression models may not be convex in general. One way to convexify ℱ when ℱ is a parametric family of models ℱ = {𝑓𝜃 ∶ 𝜃 ∈ Θ}, the most standard setting in practice, is to consider a randomization of the classifiers/regression functions in the original ℱ; this is the approach taken in [MSP+ 2118]. Precisely, for the setting described in Proposition 3.1, and given a parametric family ℱ, the authors of [MSP+ 2118] show that the problem min max ∫ 𝒰(𝑓𝜃 , 𝜇)̃ d𝜈(𝜃)
̃ 𝜈∈𝒫(Θ) 𝜇∈𝒫(𝒵) Θ
(18)
admits Nash equilibria. Here 𝜈 can be interpreted as a mixed strategy for the original game and induces a regression function/classification rule as follows: given an input 𝑥, sample 𝜃 from 𝜈 and then evaluate 𝑓𝜃 (𝑥). Another approach to convexify the set ℱ, useful in the regression setting or when considering probabilistic classifiers, is to work with the space of aggregate models ℱ̂ ≔ {∫Θ 𝑓𝜃 (⋅) d𝜈(𝜃) ∶ 𝜈 ∈ 𝒫(Θ)}; notice that elements in this family can be directly interpreted as deterministic regression functions/probabilistic classifiers. In this setting, one considers the game min max 𝒰(𝜈, 𝜇), ̃
̃ 𝜈∈𝒫(Θ) 𝜇∈𝒫(𝒵)
(19)
where we abuse notation slightly and write 𝒰(𝜈, 𝜇)̃ to denote 𝒰(∫Θ 𝑓𝜃 (⋅) d𝜈(𝜃), 𝜇). ̃ The above setting is the one motivating the work [GG23]. While the convexification of the spaces of strategies is important, to guarantee the existence of Nash equlibria one also needs to make assumptions on the payoff function 𝒰. Sion’s theorem [Sio58], for example, a very general result that can be used to guarantee existence of Nash equilibria for rather general games, requires lower and upper semicontinuity of the payoff function 𝒰 with respect to some topology, as well as some weaker form of
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
VOLUME 70, NUMBER 8
convexity/concavity of the payoff (a compactness property is required as well). It is actually not so difficult to check these assumptions for problems (18) and (19) when the spaces 𝒫(𝒵) and 𝒫(Θ) are endowed with the topology of weak convergence of probability measures and the loss function ℓ in (16) is convex in its first argument; see [MSP+ 2118] and [GG23] for more details on these assumptions. Works discussing the existence of Nash equilibria for a different variety of games go at least as far back as the work by von Neumann [vN59] (English translation from the original paper from 1928) and include other classical papers such as [Gli52,Sio58]. There are plenty of results in the literature that hold under a variety of assumptions that are worth discussing, and we will yet see another minmax result in Section 4, but discussing the extent of this topic is certainly beyond the scope of this paper. 3.2. Greedy algorithms for DRO. Existence results are statements made by an optimist: “there exists at least one Nash equilibrium, and thus there must be a way to find one. . . ” the realist would immediately inquire how. In this section, we review a classical, perhaps the most popular, greedy algorithm that has been introduced in the literature to attempt solving minmax problems in Euclidean spaces. After that, we provide some pointers to recent literature where tools from optimal transport theory are used to adapt those greedy methods to solve minmax games in spaces of measures over continuum domains, examples of which are the DRO adversarial training problems (13) when 𝒳 is a domain of ℝ𝑑 , and Θ is, for example, the space of parameters of a neural network. If the minmax problem that we were interested in was one of the form min𝑝∈𝐷 max𝑞∈𝐸 Φ(𝑝, 𝑞), where 𝐷, 𝐸 are subsets of two Euclidean spaces, a natural greedy strategy to alternate gradient ascent steps in the 𝑞 coordinate and gradient descent steps in the 𝑝 coordinate. This greedy algorithm is to minmax games what the gradient descent algorithm is for minimization problems. In continuous time, this descent-ascent approach can be interpreted as a system of ODEs of the form: {
𝑝𝑡̇ = −∇𝑝 Φ(𝑝𝑡 , 𝑞𝑡 ) 𝑞𝑡̇ = ∇𝑞 Φ(𝑝𝑡 , 𝑞𝑡 ),
(20)
or projected versions thereof to guarantee that the dynamics stay within the feasible sets 𝐷 and 𝐸. As can be expected, convergence of this scheme, especially toward Nash equilibria for the problem, depends on properties of the payoff function Φ (like for example strong convexityconcavity) or on whether one is interested in the behavior of (𝑝𝑡 , 𝑞𝑡 ) as 𝑡 → ∞ or in the behavior of average iter1 𝑡 1 𝑡 ates ( ∫0 𝑝𝑠 d𝑠, ∫0 𝑞𝑠 d𝑠) as 𝑡 → ∞. We refer the reader to 𝑡 𝑡 the recent works [LJJ2013, DP18], which discuss some of SEPTEMBER 2023
the existing literature on the topic and discuss drawbacks of and alternatives to gradient descent-ascent dynamics to solve minmax games. While in general these potential issues about convergence may play against the use of ascent-descent schemes, they remain to be the simplest methods to consider for solving minmax games. Due to this, it is of interest to adapt them to the setting of problems (18) and (19) — the difficulty lies in the fact that now the dynamics must be defined in spaces of probability measures. Fortunately, the theory of optimal transport, which has experienced tremendous growth in the past two decades and has made its way into a variety of applications in a variety of fields (including machine learning), provides some useful avenues for carrying out this adaptation. Some of these ideas can be found in the works [GG23, WC22, Lu22]. For example, the recent work [GG23] discusses the use of optimal transport based dynamics to solve convex-concave adversarial training problems like (19). [WC22, Lu22], on the other hand, use optimal transport based dynamics to solve minmax games on spaces of measures with bilinear payoff structure and thus are more suited for problems such as (18). All works [GG23, WC22, Lu22] present some promising results on the convergence properties of their schemes, but, as it is discussed there, their theories remain far from complete. Designing schemes that can efficiently find Nash equilibria is an important question for adversarial training and for game theory at large.
4. Adversaries and Barycenters In Section 2, we discussed how adversarial training can be seen as a perimeter minimization problem from the perspective of the learner in the case of binary classification. In this section, we will instead consider the nonparametric problem from the perspective of the adversary and show that the optimal adversarial strategy is given by solving a generalized barycenter problem among data distributions—an interpretation that holds regardless of the number of classes. This constitutes yet another piece of evidence that there is a rich geometric structure to adversarial learning. The discussion here will be a brief sketch of the main result from [GTKJ23]. Our starting point is the non-parametric, agnosticclassifier version of the DRO training problem introduced in Section 3. Here we will suppose that 𝑦 ∈ 𝒴 = {1, … , 𝐾}, ℓ is the 0-1 loss, and we allow the learner to choose any possible probabilistic classifier. The resulting adversarial training problem takes the form min
max 𝔼(𝑥,̃ 𝑦)∼ ̃ − 𝐶(𝜇, 𝜇). ̃ ̃ 𝜇̃ [1 − 𝑓𝑦 ̃ (𝑥)]
̃ 𝑓∶𝒳→𝑆𝒴 𝜇∈𝒫(𝒵)
(21)
While focusing on the non-parametric case and the 0,1 loss simplifies the problem, it is still a challenge to give an interpretation for (21) in its current form. To make progress,
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1201
we will eliminate the learner from the problem and obtain a pure maximization problem that determines the optimal strategy for the adversary. To do so, we will need to interchange the order of the min and max operations. 4.1. Interchanging min and max. As we discussed in Section 3, there are various well-known theorems, such as Sion’s minimax theorem, that guarantee that the interchange of min and max does not affect the value of the problem or the optimal strategies for the players. The space of all probabilistic classifiers {𝑓 ∶ 𝒳 → 𝑆𝒴 } is convex and the 0-1 loss is linear with respect to both 𝑓 and 𝜇.̃ As a result, Sion’s minimax theorem applies and the interchange is valid. From a mathematical perspective, switching the order allows us to simplify the objective as the minĩ has a simmization problem min𝑓∶𝒳→𝑆𝒴 𝔼(𝑥,̃ 𝑦)∼ ̃ 𝜇̃ [1−𝑓𝑦 ̃ (𝑥)] ple explicit solution. To see this more clearly, we decompose the measure 𝜇̃ over the label space writing 𝜇̃ = (𝜇̃1 , … , 𝜇̃𝐾 ). The previous line is then equivalent to min
𝑓∶𝒳→𝑆𝒴
∑
𝔼𝑥∼ ̃ ̃ 𝜇̃𝑦 [1 − 𝑓𝑦 (𝑥)].
(22)
𝑦∈{1,…,𝐾}
Obviously, the learner would like to choose 𝑓 such that 𝑓𝑦 (𝑥)̃ = 1. However, this may not always be possible. For instance, if 𝑥̃ belongs to the support of 𝜇̃1 and 𝜇̃2 , then the learner must make a choice in order to respect the con𝐾 straint ∑𝑦=1 𝑓𝑦 (𝑥)̃ = 1. If 𝜇̃1 gives more mass to 𝑥̃ than 𝜇̃2 , then it is best to choose 𝑓1 (𝑥)̃ = 1 (and vice versa in the other case); however, either way, the learner will have no choice but to classify some of the data incorrectly. In general, if a point 𝑥̃ belongs to the support of multiple measures, then the learner achieves the smallest value at 𝑥̃ by choosing 𝑓(𝑥)̃ to concentrate on the label 𝑦∗ such that 𝜇𝑦̃ ∗ gives more mass to 𝑥̃ than any of the other measures (i.e., the mass from 𝑦∗ is classified correctly and the rest of the data at 𝑥̃ is misclassified). This reveals an extremely important facet of the adversary’s strategy: if the adversary can manipulate the data so that points from different classes are on top of one another, then the learner is forced to misclassify some of the data; furthermore, this effect gets stronger as the number of overlapping classes increases. From the above considerations, it turns out that (22) is equal to max −𝜆(𝒳) +
𝜆∈ℳ+ (𝒳)
∑
𝜇𝑦̃ (𝒳) s.t. 𝜇𝑦̃ ≤ 𝜆,
𝑦∈{1,…𝐾}
i.e., 𝜆 will be the smallest possible measure that lies above each of the 𝜇𝑦̃ (note that ℳ+ (𝒳) represents the space of all nonnegative Borel measures on 𝒳). Note that since the adversary cannot change the number of data points (equivalently the total mass of the data) we must have ∑𝑦∈{1,…,𝐾} 𝜇𝑦̃ (𝒳) = ∑𝑦∈{1,…,𝐾} 𝜇𝑦 (𝒳), which is a positive constant that we will denote as 𝑀. Hence, after 1202
interchanging the min and the max, we can eliminate the learner and replace (25) with a problem that only considers the action of the adversary max
̃ 𝜆∈ℳ+ (𝒳),𝜇∈𝒫(𝒵)
𝑀 − 𝜆(𝒳) − 𝐶(𝜇, 𝜇)̃ s.t. 𝜇𝑦̃ ≤ 𝜆.
(23)
Let us note that the quantity 𝑀−𝜆(𝒳) is precisely the adversarial risk. Hence, the adversary would like to maximize the risk, while respecting the constraints and not paying too much in the transportation cost 𝐶(𝜇, 𝜇). ̃ In what follows, we will show that this problem can be viewed as a generalization of a barycenter problem with respect to the Wasserstein distance. 4.2. Generalized barycenters. Given 𝐾 probability measures 𝜚1 , … , 𝜚𝐾 ∈ 𝒫(𝒳) and a cost 𝑐 ∶ 𝒳 × 𝒳 → [0, ∞], the Wasserstein barycenter problem tries to find a measure 𝜚∗ such that the summed cost of transporting each of the 𝜚𝑖 onto 𝜚∗ (with respect to the optimal transport cost induced by 𝑐) is as small as possible. We now claim that problem (23) is a generalization of this barycenter problem when the adversary is not allowed to change class labels. In that case, the cost 𝐶(𝜇, 𝜇)̃ decomposes into a sum over each of the possible class labels 𝐶(𝜇, 𝜇)̃ = ∑𝑦∈{1,…,𝐾} 𝐶𝒳 (𝜇𝑦 , 𝜇𝑦̃ ) (where 𝐶𝒳 is the optimal transport cost for measures defined over 𝒳 rather than 𝒵). For 𝜆 fixed, let us write ̄ 𝑦 , 𝜆) ≔ min 𝐶(𝜇𝑦 , 𝜇𝑦̃ ) s.t. 𝜇𝑦̃ ≤ 𝜆, 𝐶(𝜇 𝜇̃𝑦
̄ 𝑦 , 𝜆) represents the cheapest possible way and note that 𝐶(𝜇 to transport 𝜇𝑦 onto some part of 𝜆. Using this notation, problem (23) becomes max 𝑀 − 𝜆(𝒳) −
𝜆∈ℳ+ (𝒳)
∑
̄ 𝑦 , 𝜆), 𝐶(𝜇
(24)
𝑦∈{1,…,𝐾}
which we will refer to as the generalized Wasserstein barycenter problem (GBP) and an optimal solution 𝜆∗ as a generalized Wasserstein barycenter. In GBP, we try to find a nonnegative measure 𝜆 (no longer necessarily a probability measure) such that the total mass of 𝜆 plus the summed cost of transporting each 𝜇𝑦 onto some part of 𝜆 is as small as possible. To understand this in the context of adversarial training, let us consider two extreme choices for 𝜆. In the first extreme case, let us choose 𝜆1 = ∑𝑦∈{1,…,𝐾} 𝜇𝑦 . With this choice, ̄ 𝑦 , 𝜆1 ) = 0 for all 𝑦, since 𝜇𝑦 is already part of 𝜆1 and 𝐶(𝜇 hence we do not need to transport any mass. On the other hand, 𝑀 − 𝜆1 (𝒳) = 𝑀 − ∑𝑦∈{1,…,𝐾} 𝜇𝑦 (𝒳) = 0. In other words, this choice produces 0 adversarial risk, meaning that the learner will be able to classify everything correctly (i.e., the adversary has not created any confusion between the classes). Clearly, this is a bad choice for the adversary even though the transportation cost is 0. In the second extreme case, 𝜆2 , we try to make the adversarial risk 𝑀−𝜆2 (𝒳)
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
VOLUME 70, NUMBER 8
as large as possible. Since each of the 𝜇𝑦 must be transported on to 𝜆2 , we must have 𝜆2 (𝒳) ≥ max𝑦 𝜇𝑦 (𝒳). In order to avoid paying a large transportation cost, we want 𝜆2 to satisfy 𝜆2 ∈
argmin
̄ 𝑦 , 𝜆). 𝐶(𝜇
∑
𝜆∈ℳ+ (𝒳),𝜆(𝑋)=max𝑦 𝜇𝑦 (𝒳) 𝑦∈{1,…,𝐾}
When all of the 𝜇𝑦 have the same total mass, 𝜆2 will be a solution to the Wasserstein barycenter problem, since the condition 𝜆(𝑋) = 𝜇1 (𝒳) = ⋯ = 𝜇𝐾 (𝒳) means that each 𝜇𝑦 must transport all of its mass onto 𝜆. In other words, to maximize the adversarial risk, the best thing the adversary can do is rearrange the data distributions for each class so that they all fully overlap on the same measure (note that this lines up with our insights from the previous subsection). In this case, the learner must necessarily misclassify 𝑀−1 100 ∗ % of the data as every location in 𝒳 containing 𝑀 a data point will have an equally mixed fraction of each class. Note however, that this may not be the best overall ̄ 𝑦 , 𝜆2 ) may outweigh adversarial strategy, as the costs 𝐶(𝜇 the maximization of the adversarial risk. This is particularly the case when we consider the most relevant cost (15), where the adversarial budget parameter 𝜀 may make it literally impossible for the adversary to move each 𝜇𝑦 onto a single common distribution. In general, the optimal choice of 𝜆 is something between the two extremes offered by 𝜆1 , 𝜆2 (c.f. Figure 5), in other words the adversary must balance the desire to maximize the risk against the imposition of the adversarial budget. Now one may ask, what can we gain from understanding the optimal adversarial strategy through the lens of GBP? Furthermore, one might also wonder does this shed any light on adversarial learning outside of the nonparametric setting? First, let us highlight that the GBP connection allows us to use powerful tools from computational optimal transport to compute the optimal 𝜆 and hence the optimal adversarial strategy. Furthermore, because the adversary can only combine points that are distance at most 𝜀 away from one another, GBP appears to be computationally easier than the classical barycenter problem. Development of efficient algorithms that take advantage of the special structure of GBP is an ongoing work. Next, GBP reveals that the optimal adversarial strategy is strongly tied to the geometry of the data. Indeed, if 𝜆 is an optimal solution, then one can show that every point 𝑥̃ ∈ spt(𝜆) there exists a set 𝐴 ⊂ {1, … , 𝐾} such that 𝑥̃ ∈ argmin ∑ 𝑐(𝑥, 𝑥𝑦 ) for some 𝑥𝑦 ∈ spt(𝜇𝑦 ), 𝑥∈𝒳
𝑦∈𝐴
i.e., every point in spt(𝜆) is itself a barycenter (with respect to the distance 𝑐) of 𝐾 or fewer points drawn from each of the 𝜇𝑦 . Hence the 𝜆 encodes local data (combining nearby points in different classes to get pointwise barycenters) as well as global data (choosing which points SEPTEMBER 2023
from different classes to combine). Finally, there are two ways in the non-parametric problem is still meaningful for the above model-specific problem where the learner is forced to choose from a parametric family of classifiers ℱ ⊂ {𝑓 ∶ 𝒳 → 𝑆 𝐾 }. The key insight is the fact that we always have the inequality min
max 𝔼(𝑥,̃ 𝑦)∼ ̃ − 𝐶(𝜇, 𝜇)̃ ≤ ̃ 𝜇̃ [1 − 𝑓𝑦 ̃ (𝑥)]
̃ 𝑓∶𝒳→𝑆𝒴 𝜇∈𝒫(𝒵)
min max 𝔼(𝑥,̃ 𝑦)∼ ̃ − 𝐶(𝜇, 𝜇). ̃ (25) ̃ 𝜇̃ [1 − 𝑓𝑦 ̃ (𝑥)] ̃ 𝑓∈ℱ 𝜇∈𝒫(𝒵)
Hence, the non-parametric setting provides a universal lower bound on the adversarial risk and the perturbations 𝜇̃1 , … , 𝜇̃𝐾 , 𝜆 found in GBP are universally powerful attacks against any classifier. As a result, 1) the computable optimal 𝜇𝑦̃ can be used as a way to generate strong adversarial examples that could be used during training of any desired model; 2) the optimal value of (23) can serve as a benchmark for robust training within any family of models and help provide insight on how to choose the budget parameter 𝜀 properly.
5. Conclusions In this paper we have discussed some recent analytic and geometric perspectives on adversarial training. Three key takeaways that we would like to highlight are: (1) learners respond to adversaries by choosing more regular decision boundaries, in particular boundaries with smaller perimeter (at least in the binary case), (2) adversarial training can be formulated as a game between two players, and (3) in the agnostic learner setting the optimal adversarial strategy to perturb the data is given by solving a generalized version of the Wasserstein barycenter problem. This can be summarized more glibly as learners minimize perimeter, adversaries find barycenters, together they arrive at a Nash equilibrium for their zero sum game. Extending these results to more general settings is an important open question. It would also be desirable to give finer estimates for 1,
1
the smoothness of decision boundaries beyond the 𝐶 3 result from Theorem 2.3. While we did not discuss it in this paper, lurking behind many of these variational problems are interesting PDEs whose analysis may shed further light on these problems. We hope that the discussion here will pique the interest of readers to continue adding results in this direction. There is much to still understand about these problems. ACKNOWLEDGMENT. NGT is supported by the NSF grants DMS-2005797 and DMS-2236447. References
[BKM19] Jose Blanchet, Yang Kang, and Karthyek Murthy, Robust Wasserstein profile inference and applications to machine
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1203
λ μ ˜1 μ1
μ ˜2
μ ˜3
μ2
μ3 Figure 5. Illustration of a generalized barycenter 𝜆 for the measures 𝜇1 , 𝜇2 , 𝜇3 and the associated perturbations 𝜇̃𝑦 . The smaller the total mass of 𝜆, the better for the adversary. Since 𝜆 must lie above the 𝜇𝑦 , the only way to reduce the mass of 𝜆 is to make the 𝜇̃𝑦 overlap.
learning, J. Appl. Probab. 56 (2019), no. 3, 830–857, DOI 10.1017/jpr.2019.49. MR4015639 [BGTM23] Leon Bungert, Nicol´as Garc´ıa Trillos, and Ryan Murray, The geometry of adversarial training in binary classification, Inf. Inference 12 (2023), no. 2, 921–968, DOI 10.1093/imaiai/iaac029. MR4565755 [BS22] Leon Bungert and Kerrek Stinson, Gamma-convergence of a nonlocal perimeter arising in adversarial machine learning, ArXiv Preprint (2022), available at arXiv:2211.15223. [DP18] Constantinos Daskalakis and Ioannis Panageas, The limit points of (optimistic) gradient descent in min-max optimization, Advances in neural information processing systems, 2018. [TT22] Camilo Andr´es Garc´ıa Trillos and Nicol´as Garc´ıa Trillos, On the regularized risk of distributionally robust learning over deep neural networks, Res. Math. Sci. 9 (2022), no. 3, Paper No. 54, 32, DOI 10.1007/s40687-022-003499. MR4468594 [GG23] Camilo Andr´es Garc´ıa Trillos and Nicol´as Garc´ıa Trillos, On adversarial robustness and the use of wasserstein ascentdescent dynamics to enforce it, ArXiv Preprint (2023), available at arXiv:2301.03662. [GTKJ23] Nicol´as Garc´ıa Trillos, Jakwang Kim, and Matt Jacobs, The multimarginal optimal transport formulation of adversarial multiclass classification, J. Mach. Learn. Res. 24 (2023), Paper No. 45, 56, DOI 10.4995/agt.2023.17046. MR4582467 [GTM22] Nicol´as Garc´ıa Trillos and Ryan Murray, Adversarial classification: necessary conditions and geometric flows, J. Mach. Learn. Res. 23 (2022), Paper No. [187], 38. MR4577140 [Gli52] I. L. Glicksberg, A further generalization of the Kakutani fixed theorem, with application to Nash equilibrium points, Proc. Amer. Math. Soc. 3 (1952), 170–174, DOI 10.2307/2032478. MR46638 [GSS14] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy, Explaining and harnessing adversarial examples, arXiv, 2014. 1204
[LJJ2013] Tianyi Lin, Chi Jin, and Michael Jordan, On gradient descent ascent for nonconvex-concave minimax problems, Proceedings of the 37th international conference on machine learning, 202013, pp. 6083–6093. [Lu22] Yulong Lu, Two-scale gradient descent ascent dynamics finds mixed nash equilibria of continuous games: A meanfield perspective, ArXiv Preprint (2022), available at arXiv: 2212.08791. [MMS+ 18] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu, Towards deep learning models resistant to adversarial attacks, 6th international conference on learning representations, ICLR 2018, proceedings, 2018. [MSP+ 2118] Laurent Meunier, Meyer Scetbon, Rafael B Pinot, Jamal Atif, and Yann Chevaleyre, Mixed nash equilibria in the adversarial examples game, Proceedings of the 38th international conference on machine learning, 202118, pp. 7677– 7687. [Sio58] Maurice Sion, On general minimax theorems, Pacific J. Math. 8 (1958), 171–176. MR97026 [SZS+ 14] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus, Intriguing properties of neural networks, 2nd international conference on learning representations, ICLR 2014, 2014. [vN59] John von Neumann, On the theory of games of strategy, Contributions to the theory of games, Vol. IV, Annals of Mathematics Studies, no. 40, Princeton University Press, Princeton, N.J., 1959, pp. 13–42. MR0101828 [WC22] Guillaume Wang and L´enaïc Chizat, An exponentially converging particle method for the mixed nash equilibrium of continuous games, ArXiv Preprint (2022), available at arXiv:2211.01280. [ZYJ+ 1909] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan, Theoretically principled trade-off between robustness and accuracy, Proceedings of the 36th international conference on machine learning, 201909, pp. 7472–7482.
Nicolas ´ Garc´ıa Trillos
Matt Jacobs
Credits
The opening image is courtesy of peterschreiber.media via Getty. Figures 1–5 are courtesy of the authors. Photo of Nicol´as Garc´ıa Trillos is courtesy of Nicol´as Garc´ıa Trillos. Photo of Matt Jacobs is courtesy of Matt Jacobs.
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
VOLUME 70, NUMBER 8
Machine Learning and Invariant Theory
Ben Blum-Smith and Soledad Villar 1. Introduction Modern machine learning has not only surpassed the state of the art in many engineering and scientific problems, but it also has had an impact on society at large, and will likely continue to do so. This includes deep learning, large language models, diffusion models, etc. In this article, we give an account of certain mathematical principles that are used in the definition of some of these machine learning models, and we explain how classical invariant theory Ben Blum-Smith is a postdoctoral fellow in the department of applied mathematics & statistics at the Johns Hopkins University. His email address is bblumsm1 @jhu.edu. Soledad Villar is an assistant professor in the department of applied mathematics & statistics at the Johns Hopkins University. Her email address is svillar3 @jhu.edu. For permission to reprint this article, please contact: [email protected]. DOI: https://doi.org/10.1090/noti2760
SEPTEMBER 2023
plays a role in them. Due to space constraints we leave out many relevant references. A version of this manuscript with a longer set of references is available on arXiv [1]. In supervised machine learning, we typically have a training set (𝑥𝑖 , 𝑦 𝑖 )𝑛𝑖=1 , where 𝑥𝑖 ∈ ℝ𝑑 are the data points and 𝑦 𝑖 ∈ ℝ𝑘 are the labels. A typical example is image recognition, where the 𝑥𝑖 are images and the 𝑦 𝑖 are image labels (say, “cat” or “dog”), encoded as vectors. The goal is to find a function 𝑓 ̂ in a hypothesis space ℱ, that not only ̂ 𝑖 ) ≈ 𝑦 𝑖 ), approximately interpolates the training data (𝑓(𝑥 but also performs well on unseen (or held-out) data. The function 𝑓 ̂ is called the trained model, predictor, or estimator. In practice, one parametrizes the class of functions ℱ with some parameters 𝜃 varying over a space Θ of parameter values sitting inside some ℝ𝑠 ; in other words, ℱ = {𝑓𝜃 ∶ ℝ𝑑 → ℝ𝑘 , 𝜃 ∈ Θ ⊆ ℝ𝑠 }. Then one uses local optimization (in 𝜃) to find a function in ℱ that locally and approximately minimizes a prespecified empirical loss function ℓ which compares a candidate function 𝑓𝜃 ’s values on the
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1205
𝑥𝑖 with the “true” target values 𝑦 𝑖 . In other words, one ap𝑛 proximately solves 𝜃∗ ≔ argmin𝜃 ∑𝑖=1 ℓ(𝑓𝜃 (𝑥𝑖 ), 𝑦 𝑖 ), and then takes 𝑓 ̂ = 𝑓𝜃∗ . Modern machine learning performs regressions on classes of functions that are typically overparameterized (the dimension 𝑠 of the space of parameters is much larger than the number 𝑛 of training samples), and in many cases, several functions in the hypothesis class ℱ can interpolate the data perfectly. Deep learning models can even interpolate plain noise, or fit images to random labels. Moreover, the optimization problem is typically nonconvex. Therefore the model performance is highly dependent on how the class of functions is parameterized and the optimization algorithms employed. The parameterization of the hypothesis class of functions is what in deep learning is typically referred to as the architecture. In recent years, the most successful architectures have been ones that use properties or heuristics regarding the structure of the data (and the problem) to design the class of functions: convolutional neural networks for images, recurrent neural networks for time series, graph neural networks for graph-structured data, transformers, etc. Many of these design choices are related to the symmetries of the problem: for instance, convolutional neural networks can be translation equivariant, and transformers can be permutation invariant. When the learning problem comes from the physical sciences, there are concrete sets of rules that the function being modeled must obey, and these rules often entail symmetries. The rules (and symmetries) typically come from coordinate freedoms and conservation laws [19]. One classical example of these coordinate freedoms is the scaling symmetry that comes from dimensional analysis (for instance, if the input data to the model is rescaled to change everything that has units of kilograms to pounds, the predictions should scale accordingly). In order to do machine learning on physical systems, researchers have designed models that are consistent with physical law; this is the case for physics-informed machine learning, neural ODEs and PDEs, and equivariant machine learning. Given data spaces 𝑉, 𝑊 and a group 𝐺 acting on both of them, a function 𝑓 ∶ 𝑉 → 𝑊 is equivariant if 𝑓(𝑔 ⋅ 𝑣) = 𝑔 ⋅ 𝑓(𝑣) for all 𝑔 ∈ 𝐺 and all 𝑣 ∈ 𝑉. Many physical problems are equivariant with respect to rotations, permutations, or scalings. For instance, consider a problem where one uses data to predict the dynamics of a folding protein or uses simulated data to emulate the dynamics of a turbulent fluid. Equivariant machine learning restricts the hypothesis space to a class of equivariant functions. The philosophy is that every function that the machine learning model can express is equivariant, and therefore consistent with physical law.
1206
Symmetries were used for machine learning (and in particular neural networks) in early works [16], and more recently they have been revisited in the context of deep learning. There are three main ways to implement symmetries. The simplest one parameterizes the invariant and equivariant functions with respect to discrete groups by averaging arbitrary functions over the group orbit [3]. The second approach, explained in the next section, uses classical representation theory to parameterize the space of equivariant functions (see for instance [8]). The third approach, the main point of this article, uses invariant theory. As an example, we briefly discuss graph neural networks (GNNs), which have been a very popular area of research in the past couple of years. GNNs can be seen as equivariant functions that take a graph represented by its adjacency matrix 𝐴 ∈ ℝ𝑛×𝑛 and possible node features 𝑋 ∈ ℝ𝑛×𝑑 , and output an embedding 𝑓(𝐴, 𝑋) ∈ ℝ𝑛×𝑑 so that 𝑓(Π𝐴Π⊤ , Π𝑋) = Π𝑓(𝐴, 𝑋) for all Π 𝑛 × 𝑛 permutation matrices. Graph neural networks are typically implemented as variants of graph convolutions or message passing, which are equivariant by definition. However, many equivariant functions cannot be expressed with these architectures. Several recent works analyze the expressive power of different GNN architectures in connection to the graph isomorphism problem. Beyond graphs, equivariant machine learning models have been extremely successful at predicting molecular structures and dynamics, protein folding, protein binding, and simulating turbulence and climate effects, to name a few applications. Theoretical developments have shown the universality of certain equivariant models, as well as generalization improvements of equivariant machine learning models over nonequivariant baselines. There has been some recent work studying the inductive bias of equivariant machine learning, and its relationship with data augmentation. See [1] for a list of references on these topics.
2. Equivariant Convolutions and Multilayer Perceptrons Modern deep learning models have evolved from the classical artificial neural network known as the perceptron. The multilayer perceptron model takes an input 𝑥 and outputs 𝐹(𝑥) defined to be the composition of affine linear maps and nonlinear entry-wise functions. Namely, 𝐹(𝑥) = 𝜌 ∘ 𝐿𝑇 ∘ … . ∘ 𝐿2 ∘ 𝜌 ∘ 𝐿1 (𝑥) ,
(1)
where 𝜌 is the (fixed) entrywise nonlinear function and 𝐿𝑖 ∶ ℝ𝑑𝑖 → ℝ𝑑𝑖+1 are affine linear maps to be learned from the data. The linear maps 𝐿𝑖 can be expressed as 𝐿𝑖 (𝑥) = 𝐴𝑖 𝑥 + 𝑏𝑖 where 𝐴𝑖 ∈ ℝ𝑑𝑖 ×𝑑𝑖+1 and 𝑏𝑖 ∈ ℝ𝑑𝑖+1 . In this example each function 𝐹 is defined by the parameters 𝜃 = (𝐴𝑖 , 𝑏𝑖 )𝑇𝑖=1 .
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
VOLUME 70, NUMBER 8
The first neural network that was explicitly equivariant with respect to a group action is the convolutional neural network. The observation is that if the 𝑁 × 𝑁 input images 𝑥 are seen in the torus ℤ/𝑁ℤ × ℤ/𝑁ℤ, the linear equivariant maps are cross-correlations (which in ML are referred to as convolutions) with fixed filters. The idea of restricting the linear maps to satisfy symmetry constraints was generalized to equivariance with respect to discrete rotations and translations, and to general homogenous spaces. Note that when working with arbitrary groups there are restrictions on the functions 𝜌 for the model to be equivariant. Classical results in neural networks show that certain multilayer perceptrons can universally approximate any continuous function in the limit where the number of neurons goes to infinity. However, that is not true in general in the equivariant case. Namely, functions expressed as (1) where the 𝐿𝑖 are linear and equivariant may not universally approximate all continuous equivariant functions. In some cases, there may not even exist nontrivial linear equivariant maps. One popular idea to address this issue is to extend this model to use equivariant linear maps on tensors. Now ⊗𝑘 ⊗𝑘 𝐿𝑖 ∶ ℝ𝑑 𝑖 → ℝ𝑑 𝑖+1 are linear equivariant maps (where the action in the tensor product is defined as the tensor product of the action in each component and extended linearly). Now the question is how can we parameterize the space of such functions to do machine learning? The answer is via Schur’s lemma. A representation of a group 𝐺 is a map 𝜙 ∶ 𝐺 → GL(𝑉) that satisfies 𝜙(𝑔1 𝑔2 ) = 𝜙(𝑔1 )𝜙(𝑔2 ) (where 𝑉 is a vector space and GL(𝑉), as usual, denotes the automorphisms of 𝑉, that is, invertible linear maps 𝑉 → 𝑉). A group action of 𝐺 on ℝ𝑑 (written as ⋅) is equivalent to the group representation 𝜙 ∶ 𝐺 → GL(ℝ𝑑 ) such that 𝜙(𝑔)(𝑣) = 𝑔 ⋅ 𝑣. We extend the action ⋅ to the tensor product (ℝ𝑑 )⊗𝑘 so that the group acts independently in every tensor factor (i.e., in every dimension or mode), namely 𝜙𝑘 = ⊗𝑘𝑟=1 𝜙 ∶ 𝐺 → GL((ℝ𝑑 )⊗𝑘 ). The first step is to note that a linear equivariant map 𝐿𝑖 ∶ (ℝ𝑑 )⊗𝑘𝑖 → (ℝ𝑑 )⊗𝑘𝑖+1 corresponds to a map between group representations such that 𝐿𝑖 ∘ 𝜙𝑘𝑖 (𝑔) = 𝜙𝑘𝑖+1 (𝑔) ∘ 𝐿𝑖 for all 𝑔 ∈ 𝐺. Homomorphisms between group representations are easily parametrizable if we decompose the representations in terms of irreducible representations (aka irreps): 𝑇𝑘
𝑖
𝜙𝑘𝑖 =
⨁
𝒯ℓ .
(2)
ℓ=1
The equivariant neural-network approach consists in decomposing the group representations in terms of irreps and explicitly parameterizing the maps [9]. In general, it is not obvious how to decompose an arbitrary group representation into irreps. However in the case where 𝐺 = SO(3), the decomposition of a tensor representation as a sum of irreps is given by the Clebsch–Gordan decomposition: ⊗𝑘𝑠=1 𝜙𝑠 = ⊕𝑇ℓ=1 𝒯ℓ (3) The Clebsch–Gordan decomposition not only gives the decomposition of the right side of (3) but also it gives the explicit change of coordinates. This decomposition is fundamental for implementing the equivariant 3D point-cloud methods defined in [8] and other works (see references in [1]). Moreover, recent work [6] shows that the classes of functions used in practice are universal, meaning that every continuous SO(3)-equivariant function can be approximated uniformly in compact sets by those neural networks. However, there exists a clear limitation to this approach: Even though decompositions into irreps are broadly studied in mathematics (plethysm), the explicit transformation that allows us to write the decomposition of tensor representations into irreps is a hard problem in general. It is called the Clebsch–Gordan problem.
3. Invariant Theory for Machine Learning An alternative but related approach to the linear equivariant layers described above is the approach based on invariant theory, the focus of this article. In particular, the authors of this note and collaborators [18] explain that for some physically relevant groups—the orthogonal group, the special orthogonal group, and the Lorentz group—one can use classical invariant theory to design universally expressive equivariant machine learning models that are expressed in terms of the generators of the algebra of invariant polynomials. Following an idea attributed to B. Malgrange (that we learned from G. Schwarz), it is shown how to use the generators of the algebra of invariant polynomials to produce a parameterization of equivariant functions for a specific set of groups and actions. To illustrate, let us focus on O(𝑑)-equivariant functions, namely functions 𝑓 ∶ (ℝ𝑑 )𝑛 → ℝ𝑑 such that 𝑓(𝑄𝑣 1 , … , 𝑄𝑣 𝑛 ) = 𝑄𝑓(𝑣 1 , … , 𝑣 𝑛 ) for all 𝑄 ∈ O(𝑑) and all 𝑣 1 , … , 𝑣 𝑛 ∈ ℝ𝑑 (for instance, the prediction of the position and velocity of the center of mass of a particle system). The method of B. Malgrange (explicated below) leads to the conclusion that all such functions can be expressed as 𝑛
𝑓(𝑣 1 , … , 𝑣 𝑛 ) = ∑ 𝑓𝑗 (𝑣 1 , … , 𝑣 𝑛 )𝑣𝑗 ,
(4)
𝑗=1
In particular, Schur’s Lemma says that a map between two irreps over ℂ is zero (if they are not isomorphic) or a multiple of the identity (if they are). SEPTEMBER 2023
where 𝑓𝑗 ∶ (ℝ𝑑 )𝑛 → ℝ are O(𝑑)-invariant functions. Classical invariant theory shows that 𝑓𝑗 is O(𝑑)-invariant if and only if it is a function of the pairwise inner products
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1207
(𝑣⊤𝑖 𝑣𝑗 )𝑛𝑖,𝑗=1 . So, in actuality, (4) can be rewritten 𝑛
𝑓(𝑣 1 , … , 𝑣 𝑛 ) = ∑ 𝑝𝑗 ((𝑣⊤𝑖 𝑣𝑗 )𝑛𝑖,𝑗=1 ) 𝑣𝑗 ,
(5)
𝑗=1 𝑛+(𝑛) where 𝑝𝑗 ∶ ℝ 2 → ℝ are arbitrary functions. In other words, the pairwise inner products generate the algebra of invariant polynomials for this action, and every equivariant map is a linear combination of the vectors themselves with coefficients in this algebra. In this article, we explicate the method of B. Malgrange in full generality, showing how to convert knowledge of the algebra of invariant polynomials into a characterization of the equivariant polynomial (or smooth) maps. In Section 4 we explain the general philosophy of the method, and in Section 5 we give the precise algebraic development, formulated as an algorithm to produce parametrizations of equivariant maps given adequate knowledge of the underlying invariant theory. In Section 6, we work through several examples. We note that, for machine learning purposes, it is not critical that the functions 𝑝𝑗 are defined on invariant polynomials, nor that they themselves are polynomials. In the following, we focus on polynomials because the ideas were developed in the context of invariant theory; the arguments explicated below are set in this classical context. However, in [18], the idea is to preprocess the data by 𝑛 to the tuple of dot products converting the tuple (𝑣𝑗 )𝑗=1 𝑛 ⊤ (𝑣 𝑖 𝑣𝑗 )𝑖,𝑗=1 , and then treat the latter as input to the 𝑝𝑗 ’s, which are then learned using a machine learning architecture of one’s choice. Therefore, the 𝑝𝑗 ’s are not polynomials but belong to whatever function class is output by the chosen architecture. Meanwhile, some recent works [2, 5, 11] have proposed alternative classes of separating invariants that can be used in place of the classical algebra generators as input to the 𝑝𝑗 ’s, and may have better numerical stability properties. This is a promising research direction.
4. Big Picture We are given a group 𝐺, and finite-dimensional linear 𝐺representations 𝑉 and 𝑊 over a field 𝑘. (We can take 𝑘 = ℝ or ℂ.) We want to understand the equivariant polynomial maps 𝑉 → 𝑊. We assume we have a way to understand 𝐺invariant polynomials on spaces related to 𝑉 and 𝑊, and the goal is to leverage that knowledge to understand the equivariant maps. The following is a philosophical discussion, essentially to answer the question: why should it be possible to do this? It is not precise; its purpose is just to guide thinking. Below in Section 5 we show how to actually compute the equivariant polynomials 𝑉 → 𝑊 given adequate knowledge of the invariants. That section is rigorous. 1208
The first observation is that any reasonable family of maps 𝑉 → 𝑊 (for example linear, polynomial, smooth, continuous, etc.) has a natural 𝐺-action induced from the actions on 𝑉 and 𝑊, and that the 𝐺-equivariant maps in such a family are precisely the fixed points of this action, as we now explain. This observation is a standard insight in representation and invariant theory. Let Maps(𝑉, 𝑊) be the set of maps of whatever kind, and let 𝐺𝐿(𝑉) (respectively 𝐺𝐿(𝑊)) be the group of linear invertible maps from 𝑉 (respectively 𝑊) to itself. Given 𝑓 ∈ Maps(𝑉, 𝑊) and 𝑔 ∈ 𝐺 and 𝑣 ∈ 𝑉, we define the map 𝑔𝑓 by (6) 𝑔𝑓 ≔ 𝜓(𝑔) ∘ 𝑓 ∘ 𝜙(𝑔−1 ), where 𝜙 ∶ 𝐺 → 𝐺𝐿(𝑉) and 𝜓 ∶ 𝐺 → 𝐺𝐿(𝑊) are the group homomorphisms defining the representations 𝑉 and 𝑊. The algebraic manipulation to verify that this is really a group action is routine and not that illuminating. A perhaps more transparent way to understand this definition of the action as “the right one” is that it is precisely the formula needed to make this square commute: 𝑓
𝑉 𝜙(𝑔)
𝑉
𝑊 𝜓(𝑔)
𝑔𝑓
𝑊
It follows from the definition of this action that the condition 𝑔𝑓 = 𝑓 is equivalent to the statement that 𝑓 is 𝐺equivariant. The square above automatically commutes, so 𝑔𝑓 = 𝑓 is the same as saying that the below square commutes— 𝑓
𝑉 𝜙(𝑔)
𝑊 𝜓(𝑔)
𝑓
𝑉 𝑊 —and this is what it means to be equivariant. An important special case of (6) is the action of 𝐺 on 𝑉 ∗ , the linear dual of 𝑉. This is the case 𝑊 = 𝑘 with trivial action, and for ℓ ∈ 𝑉 ∗ (6) reduces to 𝑔ℓ ≔ ℓ ∘ 𝜙(𝑔−1 ). This is known as the contragredient action. We will utilize it momentarily with 𝑊 in the place of 𝑉. The second observation is that Maps(𝑉, 𝑊) can be identified with functions from a bigger space to the underlying field 𝑘 by “currying,” and this change in point of view preserves the group action. Again, this is a standard maneuver in algebra. Specifically, given any map 𝑓 ∈ Maps(𝑉, 𝑊), we obtain a function 𝑓 ̃ ∶ 𝑉 × 𝑊 ∗ → 𝑘, defined by the formula 𝑓̃ ∶ 𝑉 × 𝑊 ∗ → 𝑘 (𝑣, ℓ) ↦ ℓ(𝑓(𝑣)). Note that the function 𝑓 ̃ is linear homogeneous in ℓ ∈ 𝑊 ∗ . Conversely, given any function 𝑓′ ∶ 𝑉 × 𝑊 ∗ → 𝑘 that is linear homogeneous in the second coordinate, we can
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
VOLUME 70, NUMBER 8
recover a map 𝑓 ∶ 𝑉 → 𝑊 such that 𝑓′ = 𝑓,̃ by taking 𝑓(𝑣) to be the element of 𝑊 identified along the canonical isomorphism 𝑊 → 𝑊 ∗∗ with the functional on 𝑊 ∗ that sends ℓ ∈ 𝑊 ∗ to 𝑓′ (𝑣, ℓ)—this functional is guaranteed to exist by the fact that 𝑓′ is linear homogeneous in the second coordinate. An observation we will exploit in the next section is that the desired functional is actually the gradient of 𝑓′ (𝑣, ℓ) with respect to ℓ. This construction gives an identification of Maps(𝑉, 𝑊) with a subset of Maps(𝑉 × 𝑊 ∗ , 𝑘). Furthermore, there is a natural action of 𝐺 on Maps(𝑉 × 𝑊 ∗ , 𝑘), defined precisely by the above formula (6) with 𝑉 × 𝑊 ∗ in place of 𝑉, 𝑘 in place of 𝑊, and trivial action on 𝑘;1 and the identification described here preserves this action. Therefore, the fixed points for the 𝐺-action on Maps(𝑉, 𝑊) correspond with fixed points for the 𝐺-action on Maps(𝑉 × 𝑊 ∗ , 𝑘), which are invariant functions (since the action of 𝐺 on 𝑘 is trivial). What has been achieved is the reinterpretation of equivariant maps 𝑓 ∈ Maps(𝑉, 𝑊) first as fixed points of a 𝐺-action, and then as invariant functions 𝑓 ̃ ∈ Maps(𝑉 × 𝑊 ∗ , 𝑘). Thus, knowledge of invariant functions can be parlayed into knowledge of equivariant maps.
5. Equivariance from Invariants With the above imprecise philosophical discussion as a guide, Algorithm 1 shows how in practice to get from a description of invariant polynomials on 𝑉 × 𝑊 ∗ , to equivariant polynomial (or smooth) maps 𝑉 → 𝑊. The technique given here is attributed to B. Malgrange; see [13, Proposition 2.2] where it is used to obtain the smooth equivariant maps, and [15, Proposition 6.8] where it is used to obtain holomorphic equivariant maps. Variants on this method are used to compute equivariant maps in [7, Sections 2.1– 2.2], [12, Section 3.12], [4, Section 4.2.3], and [20, Section 4]. The goal of the algorithm is to provide a parametrization of equivariant maps. That said, the proof of correctness is constructive: as an ancillary benefit, it furnishes a method for taking an arbitrary equivariant map given by explicit polynomial expressions for the coordinates and expressing it in terms of this parametrization. We now exposit in detail Algorithm 1 and its proof of correctness, in the case where 𝑓 is a polynomial map; for simplicity we take 𝑘 = ℝ. The argument is similar for smooth or holomorphic maps, except that one needs an additional theorem to arrive at the expression (7) below. If 𝐺 is a compact Lie group, the needed theorem is proven in [14] for smooth maps, and in [10] for holomorphic maps over ℂ. We begin with linear representations 𝑉 and 𝑊 of a group 𝐺 over ℝ. We take 𝑊 ∗ to be the contragredient 1The action of 𝐺 on 𝑉 × 𝑊 ∗ is defined by acting separately on each factor; the
action on 𝑊 ∗ is the contragredient representation defined above.
SEPTEMBER 2023
Algorithm 1 Malgrange’s method for getting equivariant functions Input: Bihomogeneous generators 𝑓1 , … , 𝑓𝑚 for ℝ[𝑉 × 𝑊 ∗ ]𝐺 . 1. Order the generators so that 𝑓1 , … , 𝑓𝑟 are of degree 0 and 𝑓𝑟+1 , … , 𝑓𝑠 are of degree 1 in 𝑊 ∗ . Discard 𝑓𝑠+1 , … , 𝑓𝑚 (of higher degree in 𝑊 ∗ ). 2. Choose a basis 𝑒 1 , … , 𝑒 𝑑 for 𝑊, and let 𝑒⊤1 , … , 𝑒⊤𝑑 be the dual basis, so an arbitrary element ℓ ∈ 𝑊 ∗ can be written 𝑑
ℓ = ∑ ℓ𝑖 𝑒⊤𝑖 , 𝑖=1
and ℓ(𝑒 𝑖 ) = ℓ𝑖 . 3. For 𝑗 = 𝑟 + 1, … , 𝑠, and for 𝑣 ∈ 𝑉, ℓ ∈ 𝑊 ∗ , let 𝐹𝑗 (𝑣) be the gradient of 𝑓𝑗 (𝑣, ℓ) with respect to ℓ ∈ 𝑊 ∗ , identified with an element of 𝑊 along the canonical isomorphism 𝑊 ∗∗ ≅ 𝑊; explicitly, 𝑑
𝐹𝑗 (𝑣) ≔ ∑ ( 𝑖=1
𝜕 𝑓 (𝑣, ℓ)) 𝑒 𝑖 . 𝜕ℓ𝑖 𝑗
Then each 𝐹𝑗 is a function 𝑉 → 𝑊. Output: Equivariant polynomial functions 𝐹𝑟+1 , … , 𝐹𝑠 from 𝑉 to 𝑊 such that any equivariant polynomial (or smooth, if 𝐺 is compact) map 𝑓 ∶ 𝑉 → 𝑊 can be written as 𝑠
𝑓 = ∑ 𝑝𝑗 (𝑓1 , … , 𝑓𝑟 )𝐹𝑗 , 𝑗=𝑟+1 𝑟
where 𝑝𝑗 ∶ ℝ → ℝ are arbitrary polynomial (or smooth, if 𝐺 is compact) functions. Remark: Because the 𝑓𝑗 (𝑣, ℓ)’s are being differentiated with respect to variables in which they are linear, Step 3 could alternatively have been stated: for each 𝑗, define 𝐹𝑗 as the vector in 𝑊 whose coefficients with respect to the 𝑒 𝑖 ’s are just the coefficients of the the ℓ𝑖 ’s in 𝑓𝑗 (𝑣, ℓ). representation to 𝑊, defined above. (If 𝐺 is compact, we can work in a coordinate system in which the action of 𝐺 on 𝑊 is orthogonal, and then we may ignore the distinction between 𝑊 and 𝑊 ∗ , as discussed above in the case of 𝐺 = O(𝑑).) We suppose we have an explicit set 𝑓1 , … , 𝑓𝑚 of polynomials that generate the algebra of invariant polynomials on the vector space 𝑉 × 𝑊 ∗ (denoted as ℝ[𝑉 × 𝑊 ∗ ]𝐺 )—in other words, they have the property that any invariant polynomial can be written as a polynomial in these. We also assume they are bihomogeneous, i.e., independently homogeneous in 𝑉 and in 𝑊 ∗ . To reduce notational clutter we suppress the maps specifying the actions of 𝐺 on 𝑉 and 𝑊 (which were called 𝜑 and 𝜓 in the previous section), writing the image of 𝑣 ∈ 𝑉 (respectively 𝑤 ∈ 𝑊, ℓ ∈ 𝑊 ∗ ) under the action of an element 𝑔 ∈ 𝐺 as 𝑔𝑣 (respectively 𝑔𝑤, 𝑔ℓ). We suppose 𝑓1 , … , 𝑓𝑟 are degree 0
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1209
in ℓ (so they are functions of 𝑣 alone), 𝑓𝑟+1 , … , 𝑓𝑠 are degree 1 in ℓ, and 𝑓𝑠+1 , … , 𝑓𝑚 are degree > 1 in ℓ. Now we consider an arbitrary 𝐺-equivariant polynomial function 𝑓 ∶ 𝑉 → 𝑊. ∗ We let ℓ ∈ 𝑊 be arbitrary, and, as in the previous section, we construct the function 𝑓̃ ∶ 𝑉 × 𝑊 ∗ → ℝ (𝑣, ℓ) ↦ ℓ(𝑓(𝑣)). Equivariance of 𝑓 ∶ 𝑉 → 𝑊 implies that 𝑓 ̃ is invariant: ̃ ̃ ℓ). 𝑓(𝑔𝑣, 𝑔ℓ) = ℓ∘𝑔−1 (𝑓(𝑔𝑣)) = ℓ∘𝑔−1 (𝑔𝑓(𝑣)) = ℓ(𝑓(𝑣)) = 𝑓(𝑣, From the invariance of 𝑓,̃ and the fact that 𝑓1 , … , 𝑓𝑚 generate the algebra of invariant polynomials on 𝑉 × 𝑊 ∗ , we have an equality of the form ̃ ℓ) = 𝑃(𝑓1 (𝑣), … , 𝑓𝑚 (𝑣, ℓ)), 𝑓(𝑣, (7) where 𝑃 ∈ ℝ[𝑋1 , … , 𝑋𝑚 ] is a polynomial. Note that 𝑓1 , … , 𝑓𝑟 do not depend on ℓ, while 𝑓𝑟+1 , … , 𝑓𝑚 do. We now fix 𝑣 ∈ 𝑉 and take the gradient 𝐷ℓ of both sides of (7) with respect to ℓ ∈ 𝑊 ∗ , viewed as an element of 𝑊.2 Choosing dual bases 𝑒 1 , … , 𝑒 𝑑 for 𝑊 and 𝑒⊤1 , … , 𝑒⊤𝑑 for 𝑊 ∗ , and writing ℓ = ∑ ℓ𝑖 𝑒⊤𝑖 , we can express the operator 𝐷ℓ acting on a smooth function 𝐹 ∶ 𝑊 ∗ → ℝ explicitly by the formula 𝑑 𝜕 𝐹) 𝑒 𝑖 . 𝐷ℓ 𝐹 = ∑ ( 𝜕ℓ𝑖 𝑖=1 Applying 𝐷ℓ to the left side of (7), we get 𝑑 𝑖=1
𝜕 ℓ(𝑓(𝑣))) 𝑒 𝑖 𝜕ℓ𝑖
𝑑
𝑛
̃ ℓ) = ∑ ( 𝐷ℓ 𝑓(𝑣, = ∑( 𝑖=1 𝑑
= ∑ (𝑒⊤𝑖 𝑓(𝑣)) 𝑒 𝑖 𝑖=1
= 𝑓(𝑣), so 𝐷ℓ recovers 𝑓 from 𝑓.̃ (Indeed, this was the point.) Meanwhile, applying 𝐷ℓ to the right side of (7), writing 𝜕𝑗 𝑃 for the partial derivative of 𝑃 with respect to its 𝑗th argument, and using the chain rule, we get 𝑚
𝐷ℓ 𝑃(𝑓1 , … , 𝑓𝑚 ) = ∑ 𝜕𝑗 𝑃(𝑓1 , … , 𝑓𝑚 )𝐷ℓ 𝑓𝑗 . 𝑗=1
Combining these, we conclude
as 𝑝𝑗 (𝑓1 (𝑣), … , 𝑓𝑟 (𝑣)), we may thus rewrite (8) as 𝑠
𝑓(𝑣) = ∑ 𝑝𝑗 (𝑓1 (𝑣), … , 𝑓𝑟 (𝑣))𝐷ℓ 𝑓𝑗 (𝑣, ℓ). 𝑗=𝑟+1
Finally, we observe that, 𝑓𝑗 being linear homogeneous in ℓ for 𝑗 = 𝑟 + 1, … , 𝑠, 𝐷ℓ 𝑓𝑗 (𝑣, ℓ) is degree 0 in ℓ, i.e., it does not depend on ℓ. So we may call it 𝐹𝑗 (𝑣) as in the algorithm, and we have finally expressed 𝑓 as the sum 𝑠 ∑𝑟+1 𝑝𝑗 (𝑓1 , … , 𝑓𝑟 )𝐹𝑗 , as promised.
6. Examples In this section we apply Malgrange’s method to parametrize equivariant functions in various examples. In all cases, for positive integers 𝑑, 𝑛, we take a group 𝐺 of 𝑑 × 𝑑 matrices, equipped with its canonical action on ℝ𝑑 , and we are looking for equivariant maps
from an 𝑛-tuple of vectors to a single vector. The underlying invariant theory is provided by Weyl’s The Classical Groups in each case. The orthogonal group. We parametrize maps that are equivariant for 𝐺 = O(𝑑). By the Riesz representation theorem, we can identify ℝ𝑑 with (ℝ𝑑 )∗ along the map 𝑣 ↦ ⟨𝑣, ⋅⟩, where ⟨⋅, ⋅⟩ is the standard dot product ⟨𝑣, 𝑤⟩ = 𝑣⊤ 𝑤. Since O(𝑑) preserves this product (by definition), this identification is equivariant with respect to the O(𝑑)-action, thus (ℝ𝑑 )∗ is isomorphic with ℝ𝑑 as a representation of O(𝑑). We may therefore ignore the difference between 𝑊 = ℝ𝑑 and 𝑊 ∗ in applying the algorithm. Thus, we consider the ring of O(𝑑)-invariant polynomials on tuples (𝑣 1 , … , 𝑣 𝑛 , ℓ) ∈ (ℝ𝑑 )𝑛 × ℝ𝑑 = 𝑉 × 𝑊.
𝑚
𝑓(𝑣) = ∑ 𝜕𝑗 𝑃(𝑓1 (𝑣), … , 𝑓𝑚 (𝑣, ℓ))𝐷ℓ 𝑓𝑗 (𝑣, ℓ).
(8)
𝑗=1 2In the background, we are using canonical isomorphisms to identify 𝑊 ∗ with
1210
𝜕𝑗 𝑃(𝑓1 (𝑣), … , 𝑓𝑟 (𝑣), 0, … , 0)
(ℝ𝑑 )𝑛 → ℝ𝑑
𝜕 ( ∑ ℓ 𝑒⊤ 𝑓(𝑣))) 𝑒 𝑖 𝜕ℓ𝑖 𝑗=1 𝑗 𝑗
all its tangent spaces, and 𝑊 ∗∗ with 𝑊.
Now we observe that 𝐷ℓ 𝑓𝑗 (𝑣) = 0 if 𝑗 ≤ 𝑟, because in those cases 𝑓𝑗 (𝑣) is constant with respect to ℓ. But meanwhile, the left side of (8) does not depend on ℓ, and it follows the right side does not either; thus we can evaluate it at our favorite choice of ℓ; we take ℓ = 0. Upon doing this, 𝐷ℓ 𝑓𝑗 (𝑣, ℓ)|ℓ=0 also becomes 0 for 𝑗 > 𝑠, because in these cases 𝑓(𝑣, ℓ) is homogeneous of degree at least 2 in ℓ, so its partial derivatives with respect to the ℓ𝑖 remain homogeneous degree at least 1 in ℓ, thus they vanish at ℓ = 0. Meanwhile, 𝑓𝑗 (𝑣, ℓ)|ℓ=0 itself vanishes for 𝑗 = 𝑟 + 1, … , 𝑚, so that the (𝑟 + 1)st to 𝑚th arguments of each 𝜕𝑗 𝑃 vanish. Abbreviating
We begin with bihomogeneous generators for this ring. By a classical theorem of Weyl known as the first fundamental theorem for O(𝑑), they are the dot products 𝑣⊤𝑖 𝑣𝑗 for 1 ≤ 𝑖 ≤ 𝑗 ≤ 𝑛; 𝑣⊤𝑖 ℓ for 1 ≤ 𝑖 ≤ 𝑛; and ℓ⊤ ℓ. These are ordered by
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
VOLUME 70, NUMBER 8
their degree in 𝑊 as in Step 1 of the algorithm; we discard ℓ⊤ ℓ as it is degree > 1 in 𝑊. We can work in the standard basis 𝑒 1 , … , 𝑒 𝑑 for 𝑊 = ℝ𝑑 , and we have identified it with its dual, so Step 2 is done as well. Applying Step 3, we take the generators of degree 1 in 𝑊, which are 𝑓1 = 𝑣⊤1 ℓ, … , 𝑓𝑛 = 𝑣⊤𝑛 ℓ.
𝑑
𝜕 ⊤ 𝐹𝑗 (𝑣 1 , … , 𝑣 𝑛 ) = ∑ ( 𝑣𝑗 ℓ) 𝑒 𝑖 𝜕ℓ 𝑖 𝑖=1 𝑑
= ∑ (𝑣𝑗 )𝑖 𝑒 𝑖 𝑖=1
= 𝑣𝑗 , where (𝑣𝑗 )𝑖 denotes the 𝑖th coordinate of 𝑣𝑗 . Thus the 𝐹𝑗 (𝑣 1 , … , 𝑣 𝑛 ) yielded by the algorithm is nothing but projection to the 𝑗th input vector. Meanwhile, the 𝑓1 (𝑣), … , 𝑓𝑟 (𝑣) of the algorithm are the algebra generators 𝑣⊤𝑖 𝑣𝑗 of degree zero in ℓ; thus the output of the algorithm is precisely the representation described in (5) and the paragraph following. The Lorentz and symplectic groups. If we replace O(𝑑) with the Lorentz group O(1, 𝑑 − 1), or (in case 𝑑 is even) the symplectic group Sp(𝑑), the entire discussion above can be copied verbatim, except with the standard dot product being replaced everywhere by the Minkowski product 𝑣⊤ diag(−1, 1, … , 1)𝑣 in the former case, or the standard skew-symmetric bilinear form 𝑣⊤ 𝐽𝑤 (where 𝐽 is block diagonal with 2×2 𝜋/2-rotation matrices as blocks) in the latter. We also need to use these respective products in place of the standard dot product to identify ℝ𝑑 equivariantly with its dual representation. The key point is that the invariant theory works the same way (see [12, Sec. 9.3] for a concise modern treatment, noting that 𝑂(𝑑) and 𝑂(1, 𝑑 − 1) have the same complexification). The special orthogonal group. Now we consider 𝐺 = SO(𝑑). We can once again identify ℝ𝑑 with its dual. However, this time, in Step 1, the list of bihomogeneous generators is longer: in addition to the dot products 𝑣⊤𝑖 𝑣𝑗 and 𝑣⊤𝑖 ℓ (and ℓ⊤ ℓ, which will be discarded), we have 𝑑 × 𝑑 determinants det(𝑣 𝑖1 , … , 𝑣 𝑖𝑑 ) for 1 ≤ 𝑖1 ≤ ⋯ ≤ 𝑖𝑑 ≤ 𝑛, and det(𝑣 𝑖1 , … , 𝑣 𝑖𝑑−1 , ℓ) for 1 ≤ 𝑖1 ≤ ⋯ ≤ 𝑖𝑑−1 ≤ 𝑛. The former are of degree 0 in ℓ while the latter are of degree 1. Thus, the latter contribute to our list of 𝐹𝑗 ’s in Step 3, while the former figure in the arguments of the 𝑝𝑗 ’s. Carrying out Step 3 in this case, we find that 𝑑 𝑖=1
SEPTEMBER 2023
𝑛
𝑓 = ∑ 𝑝 𝑖 ((𝑣𝑗⊤ 𝑣 𝑘 ){𝑗,𝑘}∈([𝑛]) , det(𝑣|𝑆 )𝑆∈([𝑛]) ) 𝑣 𝑖
𝜕 det(𝑣 𝑖1 , … , 𝑣 𝑖𝑑−1 , ℓ)) 𝑒 𝑖 𝜕ℓ𝑖
2
𝑖=1
+
Taking the gradients, we get
∑(
is exactly the generalized cross product of the 𝑑 − 1 vectors 𝑣 𝑖1 , … , 𝑣 𝑖𝑑−1 . Thus we must add to the 𝐹𝑗 ’s a generalized cross-product for each ( 𝑛 )-subset of our input vectors; in 𝑑−1 the end the parametrization of equivariant maps looks like
𝑑
∑ 𝑝𝑆′ ((𝑣𝑗⊤ 𝑣 𝑘 ){𝑗,𝑘}∈([𝑛]) , det(𝑣|𝑆 )𝑆∈([𝑛]) ) 𝑣𝑆′ , 2 𝑑 𝑆 ′ ∈( [𝑛] ) 𝑑−1
where [𝑛] ≔ {1, … , 𝑛}, ([𝑛]) represents the set of 𝑘-subsets 𝑘 of [𝑛], det(𝑣|𝑆 ) is shorthand for det(𝑣 𝑖1 , … , 𝑣 𝑖𝑑 ) where 𝑆 = {𝑖1 , … , 𝑖𝑑 }, and 𝑣𝑆′ is shorthand for the generalized cross product of the 𝑑 − 1 vectors 𝑣 𝑖1 , … , 𝑣 𝑖𝑑−1 where 𝑆 ′ = {𝑖1 , … , 𝑖𝑑−1 }. The special linear group. We include an example where we cannot identify ℝ𝑑 with its dual representation. Namely, we take 𝐺 = SL(𝑑, ℝ). As 𝐺 does not preserve any bilinear form on ℝ𝑑 , we must regard (ℝ𝑑 )∗ as a distinct representation. Thus in Step 1 we must consider the polynomial invariants on (𝑣 1 , … , 𝑣 𝑛 , ℓ) ∈ (ℝ𝑑 )𝑛 × (ℝ𝑑 )∗ . A generating set of homogeneous invariants is given by the canonical pairings ℓ(𝑣 𝑖 ), 𝑖 = 1, … , 𝑛, and the 𝑑 × 𝑑 determinants det(𝑣|𝑆 ), 𝑆 ∈ ([𝑛]), where we have used the same 𝑑 shorthand as above for a 𝑑-subset 𝑆 of [𝑛] and the 𝑑 × 𝑑 determinant det(𝑣|𝑆 ) whose columns are the 𝑣 𝑖 ’s indexed by 𝑆. The former are degree 1 in ℓ while the latter are degree 𝑑 0. So, writing ℓ = ∑𝑖=1 ℓ𝑖 𝑒⊤𝑖 as in Step 2, we have 𝑑
𝑓𝑟+𝑗 (𝑣, ℓ) = ℓ(𝑣𝑗 ) = ∑ ℓ𝑖 ⋅ (𝑣𝑗 )𝑖 , 𝑖=1
and again in Step 3 we get 𝐹𝑗 (𝑣) = 𝑣𝑗 , with the computation identical to the one above for the orthogonal group. Thus the algorithm outputs that an arbitrary 𝐺 = SL(𝑑, ℝ)-equivariant polynomial map has the form 𝑛
𝑓 = ∑ 𝑝𝑗 (det(𝑣|𝑆 )𝑆∈([𝑛]) ) 𝑣𝑗 . 𝑗=1
𝑑
The symmetric group. We give an example where the algebra of invariants is generated by something other than determinants and bilinear forms, so the 𝐹𝑗 ’s output by the algorithm are not just the 𝑣𝑗 ’s themselves and generalized cross products. Take 𝐺 = 𝑆 𝑑 , the symmetric group on 𝑑 letters, acting on ℝ𝑑 by permutations of the coordinates. As 𝑆 𝑑 realized in this way is a subgroup of O(𝑑), we can once again identify ℝ𝑑 with its dual. By the fundamental theorem on symmetric polynomials, the algebra of 𝐺-invariant polynomials on a single
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1211
vector 𝑣 ∈ ℝ𝑑 is given by the elementary symmetric polynomials in the coordinates 𝜎1 (𝑣) = ∑(𝑣)𝛼 , 𝜎2 (𝑣) = ∑𝛼 0 small, 𝑋𝜀 does not contain any of the outliers. The figure depicts the Vietoris–Rips complexes ℛ(𝑋𝜀𝑗 ; 𝑟 𝑖 ) for some values 𝑟1 < 𝑟2 < 𝑟3 and 𝜀1 < 𝜀2 .
The need for a more general framework. Practical data analysis scenarios necessitate methods that can cope with more than one parameter. For instance, a dataset 𝑋 ⊂ ℝ𝑑 might have nonuniform density (see Figure 2), possibly due to noise produced during the acquisition process or due to underlying scientific phenomena. In such scenarios, in addition to a geometric scale parameter, one may wish to incorporate a (co)density parameter and obtain an increasing family of simplicial complexes indexed by ℝ2 with the product order ≤, i.e., (𝑎1 , 𝑎2 ) ≤ (𝑏1 , 𝑏2 ) iff 𝑎𝑖 ≤ 𝑏𝑖 for 𝑖 = 1, 2. The result of applying the homology functor to such an ℝ2 -indexed family, or more generally, to an analogously defined ℝ𝑛 -indexed family, is called a multiparameter persistence module [CZ09]. There are scenarios that give rise to persistence modules indexed by posets other than other than ℝ𝑛 . For example, the time-evolution of the positions of animals during collective motion can lead to considering zigzag posets ZZ ≔ {1 ↔ 2 ↔ ⋯ ↔ 𝑚}, where, for each 𝑖, 𝑖 ↔ 𝑖 + 1 stands for either 𝑖 < 𝑖 + 1 or 𝑖 > 𝑖 + 1 [CdS10, KM22]. Beyond these scenarios, and with a great deal of foresight, in [BdSS15], Bubenik et al. proposed to consider the phenomenon of persistence for parameters taken from general posets. Remark 1. In this article, we restrict ourselves to persistence modules over finite connected posets P, which are general enough indexing sets for modeling the type of datasets that arise in practice. For example, for an integer 𝑚 > 0, the linearly ordered poset L𝑚 ≔ {1 < 2 < ⋯ < 𝑚} (2) is often used to succinctly encode ℝ- or ℝ≥0 -modules.
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1215
Example 1 (L𝑚 -module). Given a finite metric space 𝑋, the Vietoris–Rips filtration of 𝑋 consists of finitely many distinct simplicial complexes. Hence, for the persistence module 𝑀 ≔ H𝑘 (ℛ(𝑋, −)) ∶ ℝ≥0 → 𝐯𝐞𝐜, there exist 0 = 𝑟1 < ⋯ < 𝑟𝑚 in ℝ≥0 so that 𝑀(𝑟, 𝑠) is a linear isomorphism whenever 𝑟, 𝑠 ∈ [𝑟 𝑖 , 𝑟 𝑖+1 ) with 𝑟 ≤ 𝑠, for some 𝑖 = 1, … , 𝑚 where 𝑟𝑚+1 ≔ ∞. In such a case, the L𝑚 -module H𝑘 (ℛ(𝑋; 𝑟1 )) → ⋯ → H𝑘 (ℛ(𝑋; 𝑟𝑚 )) determines the isomorphism type of 𝑀. Connections with quiver representations. A quiver is a finite directed graph. Given a quiver 𝑄, the assignment of a finite-dimensional vector space to each vertex and a linear map to each arrow (between the participating vector spaces) is called a representation of 𝑄; see [DW05]. A finite poset P induces the quiver 𝑄P on the vertex set P with arrows 𝑝 → 𝑝″ for all pairs 𝑝 < 𝑝″ such that there is no 𝑝′ ∈ P with 𝑝 < 𝑝′ < 𝑝″ . Note that a P-module 𝑀 canonically induces a representation of 𝑄P : To each vertex 𝑝 of 𝑄P , the vector space 𝑀(𝑝) is assigned. To each arrow 𝑝 → 𝑝″ of 𝑄P , the linear map 𝑀(𝑝, 𝑝″ ) is assigned. Note that the resulting representation of 𝑄P satisfies the following commutativity condition: For every 𝑝, 𝑝″ ∈ P, if there are multiple directed paths from 𝑝 to 𝑝″ in 𝑄P , then the compositions of the linear maps along each of those paths agrees with 𝑀(𝑝, 𝑝″ ). Conversely, a representation of 𝑄P satisfying the commutativity condition induces a Pmodule in the obvious way.
Rank Invariant and Persistence Diagrams We first recall the classical notions of rank invariant and persistence diagrams of L𝑚 -modules [CSEH07,CZ09], and then we describe a natural way to extend those notions to the setting of P-modules. Rank invariant. For any integer 𝑚 > 0 and any 𝑏 ≤ 𝑑 ∈ L𝑚 , we call [𝑏, 𝑑] ≔ {𝑏, … , 𝑑} an interval in L𝑚 . Let Int(L𝑚 ) be the set of all intervals in L𝑚 . The rank invariant of a given persistence module 𝑀 ∶ L𝑚 → 𝐯𝐞𝐜 is defined to be the function rk𝑀 ∶
Int(L𝑚 ) → [𝑏, 𝑑] ↦
ℤ≥0
(3)
rank(𝑀(𝑏, 𝑑)).
It is important to note that (i) the rank invariant is preserved under isomorphism and that (ii) it encodes the dimensions of all vector spaces 𝑀(𝑏) for 𝑏 ∈ L𝑚 since we have rk𝑀 ([𝑏, 𝑑]) = dim(𝑀(𝑏)) whenever 𝑏 = 𝑑. Note also that (iii) rk𝑀 is monotone, i.e., rk𝑀 ([𝑏′ , 𝑑 ′ ]) ≤ rk𝑀 ([𝑏, 𝑑]) ′
(4)
′
whenever [𝑏, 𝑑] ⊆ [𝑏 , 𝑑 ]; This follows from the fact that the map 𝑀(𝑏′ , 𝑑 ′ ) factors through the map 𝑀(𝑏, 𝑑). By convention, we set 0 = rk𝑀 ([0, 𝑑]) = rk𝑀 ([𝑏, 𝑚 + 1]) for every 𝑏, 𝑑 ∈ L𝑚 . Next, we utilize the rank invariant to compute, for each [𝑏, 𝑑] ∈ Int(L𝑚 ), a count of the “persistent features” that start at 𝑏 and end at 𝑑, leading to the notion of persistence diagram of 𝑀.
Example 2. The poset L𝑚 from Equation (2) induces the quiver 𝑄L𝑚 ∶ 1 → 2 → ⋯ → 𝑚. Example 3. For integers 𝑚, 𝑛 > 0, consider the poset L𝑚 × L𝑛 equipped with the product order. Then, for example, the poset (L2 × L3 ) induces the quiver (2, 1)
(2, 2)
(2, 3)
(1, 1)
(1, 2)
(1, 3).
𝑄L2 ×L3 ∶
The poset L𝑚 × L𝑛 is often used to encode ℝ2 -modules. Example 4. The following commutative diagram defines an (L2 ×L3 )-module which can be obtained by applying the 0-th homology functor H0 (−, 𝕜) to the (L2 × L3 )-indexed simplicial filtration depicted next: 0
k
(1 )
k2
(
1 1
1
(0) 0
1216
)
Figure 3. The rank invariant and the persistence diagram of a given 𝑀 ∶ L13 → 𝐯𝐞𝐜. At each point (𝑏, 𝑑) with 𝑏 ≤ 𝑑 in the L13 × L13 grid, nonzero rk𝑀 ([𝑏, 𝑑]) and dgm𝑀 ([𝑏, 𝑑]) are recorded (e.g., rk𝑀 ([5, 6]) = 2 and dgm𝑀 ([5, 6]) = 0).2
Persistence diagrams. Fix an integer 𝑚 > 0. Let 𝑝′ ∈ L𝑚 and a vector 𝑣 ∈ 𝑀(𝑝′ ). We say that 𝑣 is born at the point 𝑏(𝑣) ∈ L𝑚 where 𝑏(𝑣) ≔ min{𝑝 ∈ L𝑚 ∶ 𝑣 ∈ im (𝑀(𝑝, 𝑝′ ))}.
.
We say that 𝑣 lives until the point 𝑑(𝑣) ∈ L𝑚 (or dies at 𝑑(𝑣) + 1) where
k 1
k
1
k
(5)
𝑑(𝑣) ≔ max{𝑝″ ∈ L𝑚 ∶ 𝑣 ∉ ker (𝑀(𝑝′ , 𝑝″ ))}.
.
(6)
2
If 𝑀 encodes an ℝ or ℝ≥0 -module (as in the scenario of Example 1), the visualization of dgm𝑀 may require a scale readjustment [CSEH07, p.106].
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
VOLUME 70, NUMBER 8
The lifespan of 𝑣 is the interval [𝑏(𝑣), 𝑑(𝑣)]. The persistence diagram of a given persistence module 𝑀 ∶ L𝑚 → 𝐯𝐞𝐜 is then defined to be the function
The matrix 𝜇 is upper-triangular and all of its diagonal entries are 1. Therefore, 𝜇 is invertible. Note that Equation (7) amounts to
dgm𝑀 ∶ Int(L𝑚 ) → ℤ≥0
rk𝑀 ⋅ 𝜇 = dgm𝑀 ,
sending each [𝑏, 𝑑] ∈ Int(L𝑚 ) to the maximal number of linearly independent vectors in im(𝑀(𝑏, 𝑑)) whose lifespans are exactly [𝑏, 𝑑]. From rk𝑀 to dgm𝑀 . It turns out that dgm𝑀 can be computed in terms of rk𝑀 . Indeed, let 𝑘 ≔ rk𝑀 ([𝑏, 𝑑]). This implies that there exist 𝑘 linearly independent vectors 𝑣 1 , 𝑣 2 , … , 𝑣 𝑘 in im(𝑀(𝑏, 𝑑)) ⊆ 𝑀(𝑑) that are born at 𝑏 or before 𝑏, and live until 𝑑 or later, i.e., [𝑏(𝑣 𝑖 ), 𝑑(𝑣 𝑖 )] ⊇ [𝑏, 𝑑]. Hence,
which implies that dgm𝑀 and rk𝑀 determine each other. Equation (9) permits computing rk𝑀 in terms of dgm𝑀 as follows. First, one verifies that the inverse of 𝜇 has entries
𝑠(𝑏, 𝑑) ≔ rk𝑀 ([𝑏, 𝑑]) − rk𝑀 ([𝑏 − 1, 𝑑])
𝑠(𝑏, 𝑑 + 1) ≔ rk𝑀 ([𝑏, 𝑑 + 1]) − rk𝑀 ([𝑏 − 1, 𝑑 + 1]) equals the maximal number of independent vectors in im(𝑀(𝑏, 𝑑 + 1)) that were born at precisely 𝑏 and live until 𝑑 + 1 or later. Hence, the difference 𝑠(𝑏, 𝑑) − 𝑠(𝑏, 𝑑 + 1) equals the maximal number of independent vectors in im(𝑀(𝑏, 𝑑)) whose lifespans are exactly [𝑏, 𝑑] and thus we arrive at the following formula: dgm𝑀 ([𝑏, 𝑑]) = rk𝑀 ([𝑏, 𝑑]) − rk𝑀 ([𝑏 − 1, 𝑑]) (7)
In practice, persistence diagrams are represented as points (with multiplicity) in the two-dimensional grid: only those intervals [𝑏, 𝑑] for which dgm𝑀 ([𝑏, 𝑑]) > 0 are recorded; see Figure 3. The earliest appearance of formula (7) in the TDA community that is known to the authors is [LF97]. This expression appears prominently in the work of Cohen-Steiner et al. on the stability of persistence diagrams [CSEH07]. From dgm𝑀 to rk𝑀 . Let us consider Int(L𝑚 ) as a poset ordered by containment ⊇. The poset Int(L𝑚 ) consists of 𝑚(𝑚 + 1)/2 elements. By the order-extension principle, we can index the intervals in Int(L𝑚 ) as 𝐼1 , 𝐼2 , … , 𝐼𝑚(𝑚+1)/2 so that 𝐼𝑖 ⊇ 𝐼𝑗 implies 𝑗 ≥ 𝑖. Below we also use the convenient notation 𝐼𝑖 = [𝑏𝑖 , 𝑑𝑖 ]. Since Int(L𝑚 ) consists of 𝑚(𝑚 + 1)/2 elements, we identify rk𝑀 with a vector in ℝ𝑚(𝑚+1)/2 whose 𝑖-th entry is rk𝑀 (𝐼𝑖 ). Similarly, dgm𝑀 is identified with a vector of the same dimension. Consider the square matrix 𝜇 of length 𝑚(𝑚 + 1)/2 whose (𝑖, 𝑗)-entry is ⎧ 1 if 𝐼𝑖 = 𝐼𝑗 or 𝐼𝑖 = [𝑏𝑗 − 1, 𝑑𝑗 + 1], 𝜇𝑖𝑗 ≔ −1 if 𝐼𝑖 = [𝑏𝑗 − 1, 𝑑𝑗 ] or 𝐼𝑖 = [𝑏𝑗 , 𝑑𝑗 + 1], ⎨ ⎩ 0 otherwise.
SEPTEMBER 2023
1 if 𝐼𝑖 ⊇ 𝐼𝑗 (𝜇−1 )𝑖𝑗 = { 0 otherwise. Therefore, the equality rk𝑀 = dgm𝑀 ⋅ 𝜇−1 implies that rk𝑀 (𝐼𝑗 ) =
∑ dgm𝑀 (𝐼𝑖 ) ∀ 1 ≤ 𝑗 ≤ 𝑚(𝑚 + 1)/2. (10) 𝑖∶ 𝐼𝑖 ⊇𝐼𝑗
equals the maximal number of independent vectors in im(𝑀(𝑏, 𝑑)) that were born at precisely 𝑏 and live until 𝑑 or later. Similarly,
− rk𝑀 ([𝑏, 𝑑 + 1]) + rk𝑀 ([𝑏 − 1, 𝑑 + 1]) ≥ 0.
(9)
Example 5 (An application of Equation (10)). The equality rk𝑀 ([5, 6]) = 2 in Figure 3 can be derived from the fact that there are exactly two points (𝑏, 𝑑) in the upper-left quadrant with corner point (5, 6) for which dgm𝑀 ([𝑏, 𝑑]) = 1. Example 6 (Persistence diagram of an L2 -module). Let 𝑀 be an L2 -module: 𝑀∶
𝑀(2).
Let the rank invariant of 𝑀 be given by rk𝑀 ∶ [1, 2] ↦ 𝑘, [1, 1] ↦ 𝑑1 , [2, 2] ↦ 𝑑2 ,
(11)
for integers 0 ≤ 𝑘 ≤ 𝑑1 , 𝑑2 . Then, the persistence diagram of 𝑀 is given by dgm𝑀 ∶ [1, 2] ↦ 𝑘, [1, 1] ↦ 𝑑1 − 𝑘, [2, 2] ↦ 𝑑2 − 𝑘, (12) where 𝑑1 − 𝑘 and 𝑑2 − 𝑘 are the dimensions of the kernel and cokernel of the map 𝑀(1, 2). Under the identification rk𝑀 ↔ (𝑘, 𝑑1 , 𝑑2 ),
dgm𝑀 ↔ (𝑘, 𝑑1 − 𝑘, 𝑑2 − 𝑘),
in this case, Equation (9) reads: 1 rk𝑀 ⋅ [0 0
−1 1 0
−1 0 ] = dgm𝑀 . 1
Remark 2. From Equation (11), we infer that there are bases of the vector spaces 𝑀(1) and 𝑀(2) in which the map 𝑀(1, 2) is given by the (𝑑2 × 𝑑1 )-block matrix [
(8)
𝑀(1,2)
𝑀(1)
𝐼𝑘 0
0 ], 0
where 𝐼𝑘 is the 𝑘 by 𝑘 identity matrix. In other words, there are linear isomorphisms ℎ1 and ℎ2 such that the diagram
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1217
below commutes: 𝑀(1,2)
𝑀(1) ≅ ℎ1 [
𝕜𝑘 ⊕ 𝕜𝑑1 −𝑘
𝐼𝑘
0
0
0
𝑀(2) ≅ ℎ2
(13)
] 0
𝕜𝑘 ⊕ 𝕜𝑑2 −𝑘
Indecomposables and Barcodes There is another interpretation of the persistence diagram which is tied to the notion of indecomposable decompositions of quiver representations. Indecomposable decompositions. Given any two Pmodules 𝑀 and 𝑁, the direct sum 𝑀⊕𝑁 is the P-module defined via pointwise direct sum: (𝑀 ⊕𝑁)(𝑝) ≔ 𝑀(𝑝)⊕𝑁(𝑝) for 𝑝 ∈ P and (𝑀 ⊕ 𝑁)(𝑝, 𝑝′ ) ≔ 𝑀(𝑝, 𝑝′ ) ⊕ 𝑁(𝑝, 𝑝′ ) for 𝑝 ≤ 𝑝′ in P, i.e., 𝑀(𝑝, 𝑝′ ) 0 ). 0 𝑁(𝑝, 𝑝′ )
A nonzero P-module 𝑀 is indecomposable if, whenever 𝑀 = 𝑀1 ⊕ 𝑀2 for some P-modules 𝑀1 and 𝑀2 , then, either 𝑀1 = 0 or 𝑀2 = 0. We will refer to such modules as P-indecomposables. Due to the Krull–Schmidt–Remak– Azumaya principle, any P-module can be decomposed as a direct sum of P-indecomposables: Theorem 1. Any P-module 𝑀 admits a decomposition 𝑀≅
⨁
𝑀𝛼 ,
(14)
𝛼∈𝐴
where each 𝑀𝛼 is P-indecomposable. Such a decomposition is unique up to isomorphism and to reordering of the summands. Theorem 1 indicates that understanding the structure of P-modules reduces to the problem of elucidating the structure of the P-indecomposables. Example 7. Diagram (13) shows that the L2 -module 𝑀 therein decomposes into a direct sum of 𝑘, 𝑑1 − 𝑘, 𝑑2 − 𝑘 copies of the L2 -modules 1
0
0
𝕜 → 𝕜, 𝕜 → 0, 0 → 𝕜, respectively. It is not hard to verify that all these L2 modules are indecomposable and thus Diagram (13) is an example of a decomposition into a direct sum of indecomposable modules. Notice that this decomposition is reflected by the specification of dgm𝑀 in Equation (12). Since 𝑀 was an arbitrary L2 -module, this decomposition further implies that the three L2 -modules shown above constitute an exhaustive list of all the L2 -indecomposables. 1218
0
1
1
0
0
0 ⟶ 0 ⋯ 0 ⟶ 𝕜 ⟶ 𝕜 ⋯ 𝕜 ⟶ 𝕜 ⟶ 0 ⋯ 0 ⟶ 0,
Therefore, we conclude that rk𝑀 (and therefore dgm𝑀 ) determines the isomorphism type of 𝑀. We will shortly show that this is also the case for an arbitrary L𝑚 -module 𝑀 with 𝑚 > 2.
(𝑀 ⊕ 𝑁)(𝑝, 𝑝′ ) ≔ (
Barcode of an L𝑚 -module. A classical theorem by Pierre Gabriel (see [DW05]) establishes a far reaching generalization of the previous example and, in particular, implies that the L𝑚 -indecomposables are exactly those 𝑉 [𝑏,𝑑] ∶ L𝑚 → 𝐯𝐞𝐜 that look like: (15)
where (from left to right) the first occurrence of 𝕜 is at some index 𝑏 ∈ L𝑚 (for “birth”) and the last occurrence is at an index 𝑑 ∈ L𝑚 (for “death”).3 More precisely, given any [𝑏, 𝑑] ∈ Int(L𝑚 ), 𝑉 [𝑏,𝑑] is the persistence module over L𝑚 where: (i) 𝑉 [𝑏,𝑑] (𝑖) = 𝕜 for 𝑖 ∈ [𝑏, 𝑑] and 𝑉 [𝑏,𝑑] (𝑖) = 0 otherwise, and (ii) all internal morphisms between adjacent copies of 𝕜 are 1, and all other morphisms are (necessarily) 0. Any such 𝑉 [𝑏,𝑑] is called an interval (persistence) module. The considerations above imply that each 𝑀𝛼 appearing in the decomposition in Equation (14) of 𝑀, assuming P = L𝑚 , is isomorphic to 𝑉 [𝑏𝛼 ,𝑑𝛼 ] for some [𝑏𝛼 , 𝑑𝛼 ] ∈ Int(L𝑚 ). Therefore, to the L𝑚 -module 𝑀 we can associate the multiset barc(𝑀), the barcode of 𝑀, consisting of all the intervals [𝑏𝛼 , 𝑑𝛼 ], 𝛼 ∈ 𝐴 (counted with multiplicity) appearing in the decomposition of 𝑀 given above. Furthermore, these considerations imply that barc(𝑀) is a complete invariant of 𝑀, i.e., barc(𝑀) determines the isomorphism type of 𝑀. Persistence diagrams and barcodes determine each other. It holds that dgm𝑀 ([𝑏, 𝑑]) equals the multiplicity of the interval [𝑏, 𝑑] in barc(𝑀). This claim follows from Theorem 4 in the present article, a result which is applicable in the context of general posets.4 Since barc(𝑀) is a complete invariant of 𝑀 ∶ L𝑚 → 𝐯𝐞𝐜, so is dgm𝑀 . This establishes the previous claim that both dgm𝑀 and rk𝑀 determine the isomorphism type of a given 𝑀 ∶ L𝑚 → 𝐯𝐞𝐜. Remark 3. Persistence diagrams are known to be stable (i.e., Lipschitz continuous under suitable metrics) [CSEH07]. Barcode of a P-module. Given any P-module 𝑀, by Theorem 1, one could, in principle, consider the multiset {𝑀𝛼 }𝛼∈𝐴 of indecomposable summands as a complete invariant of 𝑀. However, other than for a handful of exceptional posets,5 and even in simple cases such as the one mentioned in Example 8 below, the collection of all P-indecomposables can be tremendously complex. One manifestation of this complexity is the possibility that 3By Equations (5) and (6), for any 𝑝 ∈ [𝑏, 𝑑] and any nonzero 𝑣 ∈ 𝑉 [𝑏,𝑑] (𝑝),
we have 𝑏(𝑣) = 𝑏 and 𝑑(𝑣) = 𝑑. 4Interestingly, in the case of L , this relationship follows from work by Abeasis 𝑚 et al. dating back to 1981; see [ADFK81, p. 405]. The authors thank Ezra Miller for pointing this out. 5For example, by Gabriel’s theorem, the indecomposable modules over a zigzag poset are exactly the interval modules (defined below Definition 2) on that poset.
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
VOLUME 70, NUMBER 8
there may exist infinitely many isomorphism types of Pindecomposables. Example 8. Consider the 6-point poset P inducing the quiver • 𝑄P ∶ •
•
•
•
The following are simple but important facts: (a) No matter what P is, every interval module 𝑉 𝐼 ∶ P → 𝐯𝐞𝐜 is indecomposable. (b) There are posets P for which there exist P-indecomposables that are not interval modules. For example, let P ≔ {𝑎, 𝑏, 𝑐, 𝑑} be equipped with the partial order 𝑏 ≤ 𝑎, 𝑐 ≤ 𝑎 and 𝑑 ≤ 𝑎 and consider the P-module 𝐹 given below:
• 1 ( ) 0
An infinite two-parameter family of P-indecomposables are presented in [DW05, Example 8]. For the reasons above, and due to the analogy with the case when P = L𝑚 , much of the research in the TDA community has concentrated on understanding (i.e., testing, computing, etc) decomposability of a given P-module as a direct sum of indecomposables with “simple” structure. One such notion of simple indecomposables is obtained through a generalization of the interval modules over L𝑚 described in Equation (15). This leads to the notion of interval modules over an arbitrary poset P: these are Pindecomposable modules having dimension exactly one on certain nice subsets of P and zero elsewhere, and such that all internal morphisms between nontrivial spaces are 1. The nice subsets mentioned above are called intervals of P and are defined as follows. Definition 2. An interval 𝐼 of a given poset P is any subset 𝐼 ⊆ P such that: (i) 𝐼 is nonempty. (ii) If 𝑝, 𝑝′ ∈ 𝐼 and 𝑝″ ∈ P such that 𝑝 ≤ 𝑝″ ≤ 𝑝′ , then 𝑝″ ∈ 𝐼. (iii) 𝐼 is connected, i.e., for any 𝑝, 𝑝′ ∈ 𝐼, there is a sequence 𝑝 = 𝑝0 , 𝑝1 , ⋯ , 𝑝ℓ = 𝑝′ of elements of 𝐼 with either 𝑝 𝑖 ≤ 𝑝 𝑖+1 or 𝑝 𝑖+1 ≤ 𝑝 𝑖 for each 𝑖 ∈ [0, ℓ − 1]. We will henceforth use the notation Int(P) to denote the collection of all intervals of P.6 Note that when P = L𝑚 , Int(P) reduces to the definition given in the previous section. For 𝐼 ∈ Int(P), the interval module 𝑉 𝐼 ∶ P → 𝐯𝐞𝐜 induced by 𝐼 is defined via the conditions 𝕜 if 𝑝 ∈ 𝐼 𝑉 𝐼 (𝑝) ≔ { 0 otherwise, 1 if 𝑝, 𝑝′ ∈ 𝐼 and 𝑝 ≤ 𝑝′ and 𝑉 𝐼 (𝑝, 𝑝′ ) ≔ { 0 otherwise. In general, it is important to note that if 𝐼 ⊆ P did not satisfy (ii), then 𝑉 𝐼 would not be well-defined (as it would not satisfy Equation (1). If 𝐼 did not satisfy (iii), then 𝑉 𝐼 would fail to be indecomposable. 6We warn the reader the definition of intervals that we are using—the most
common in TDA—differs from the one that is traditional in order theory.
SEPTEMBER 2023
𝐹(𝑏) ≔ 𝕜
𝐹(𝑎) ≔ 𝕜2 0 ( ) 1
1 ( ) 1
𝐹(𝑐) ≔ 𝕜
(16) 𝐹(𝑑) ≔ 𝕜
It is clear that 𝐹 is neither an interval module nor is it isomorphic to a direct sum of interval modules. That 𝐹 is indecomposable is also relatively easy to verify. A decomposition of a given P-module into a direct sum of interval modules gives rise to its barcode. Definition 3. A P-module 𝑀 is interval decomposable if there exists a multiset barc(𝑀) of intervals of P (called the barcode of 𝑀) such that 𝑀≅
⨁
𝑉 𝐼.
𝐼∈barc(𝑀)
Example 9 (Barcode of an L2 -module). Recall from Example 7 that Diagram (13) describes the indecomposable decomposition of an arbitrary L2 -module 𝑀. The figure below shows a visualization of the barcode of 𝑀 for two different values of 𝑘 when 𝑀 is such that 𝑑1 = 𝑑2 = 4:
Even if it may be that a given P-module is not interval decomposable, it might be useful to understand whether at least some of its indecomposables are isomorphic to interval modules. This leads to defining the multiplicity function. Definition 4 (Multiplicity of intervals). Given any Pmodule 𝑀 and any interval 𝐼 of P, let mult(𝐼, 𝑀) denote the number of P-indecomposable direct summands of 𝑀 that are isomorphic to the interval module 𝑉 𝐼 . Furthermore, let mlt𝑀 ∶ Int(P) → ℤ≥0 denote the multiplicity function defined by 𝐼 ↦ mult(𝐼, 𝑀). It is clear that, whenever 𝑀 is interval decomposable, mlt𝑀 determines and is determined by barc(𝑀).
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1219
From a practical point of view, one aims to extract as much persistence-like information as possible from a given P-module 𝑀, regardless of whether 𝑀 is interval decomposable or not, while bypassing the inherent difficulties associated to dealing with the “wild west” of indecomposables. This has led toward exploring the notion of generalized persistence diagrams which is enabled by Möbius inversion and by a suitable generalization of the notion of rank invariant. One key idea arose in 2016 when Patel noticed that the process being implemented in Equation (7) is the Möbius inversion (over the poset (Int(L𝑚 ), ⊇)) of the rank invariant [Pat18]. This crucial observation has led to very rich developments which, by injecting ideas from combinatorics, surmount the difficulties inherent to dealing with indecomposables.
𝑣𝑝 = 𝑤𝑝 for some 𝑝 ∈ P. This property defines a reflexive and symmetric, but not necessarily transitive relation on 𝐿𝑀 . Let ∼ be the transitive closure of the resulting relation. We observe that 𝐿𝑀 /𝑊𝑀 is the quotient 𝐿𝑀 / ∼. Indeed: 𝐯∼𝐰 ⇔ there exists a sequence 𝐯 = 𝐯1 , 𝐯2 , … , 𝐯𝑛 = 𝐰 in 𝐿𝑀 such that 𝐯𝑖 and 𝐯𝑖+1 are intersecting for every 𝑖 ⇔ there exists a sequence 𝐯 = 𝐯1 , 𝐯2 , … , 𝐯𝑛 = 𝐰 in 𝐿𝑀 such that 𝐯𝑖 − 𝐯𝑖+1 is non-fully-supported for every 𝑖 ⇔ there exists a sequence 𝐯 = 𝐯1 , 𝐯2 , … , 𝐯𝑛 = 𝐰 in 𝐿𝑀 such that 𝐯𝑖 − 𝐯𝑖+1 ∈ 𝑊𝑀 for every 𝑖 ⇔ 𝐯 − 𝐰 ∈ 𝑊𝑀 . We call 𝐯 ∈ 𝐿𝑀 full if, whenever 𝐯 is written as a sum of linearly independent vectors 𝐰1 , … , 𝐰𝑛 ∈ 𝐿𝑀 , then at least one of the 𝐰𝑗 is fully supported. From the observation above, we have:
Generalized Rank Invariant
Theorem 2. rank(𝑀) is the maximal number of linearly independent, full, nonintersecting persistent vectors of 𝑀.
Given a P-module 𝑀, the map 𝑝 ≤ 𝑝′ ↦ rank(𝑀(𝑝, 𝑝′ ))
(17)
for all 𝑝 ≤ 𝑝′ in P is a direct generalization of the rank invariant of L𝑚 -modules given in Equation (3).7 However, beyond the case P = L𝑚 , this “standard” rank invariant is, in general, a weaker invariant than the barcode of interval decomposable P-modules; see e.g., [KM21a, Appendix C]. Motivated by this, we consider a generalized version of the rank invariant, which exhibits stronger discriminating power. We start by generalizing the notion of rank of a linear map to the context of P-modules. Rank of a P-module. Given a P-module 𝑀 with 𝑚 ≔ |P|, an 𝑚-tuple 𝐯 = (𝑣𝑝 ) ∈ ⨁𝑝∈P 𝑀(𝑝) is called a persistent vector if all the 𝑣𝑝 are compatible in 𝑀, i.e., 𝑀(𝑝, 𝑝′ )(𝑣𝑝 ) = 𝑣𝑝′ for all 𝑝 ≤ 𝑝′ in P. The set 𝐿𝑀 of persistent vectors is a linear subspace of ⨁𝑝∈P 𝑀(𝑝). We call 𝐯 ∈ 𝐿𝑀 fully supported if 𝑣𝑝 ≠ 0 for all 𝑝 ∈ P. Toward defining the rank of 𝑀, we identify 𝐯 ∈ 𝐿𝑀 with 0 ∈ 𝐿𝑀 whenever 𝐯 is not fully supported, i.e., if there exists 𝑝 ∈ P such that 𝑣𝑝 = 0. In other words, we consider the quotient space 𝐿𝑀 /𝑊𝑀 where 𝑊𝑀 is the linear span of all non-fully-supported vectors in 𝐿𝑀 . We then define the rank of 𝑀 as rank(𝑀) ≔ dim(𝐿𝑀 /𝑊𝑀 ).
(18)
Example 10 (The case when P = L2 ). Given any L2 module 𝑀, we have that 𝐿𝑀 and 𝑊𝑀 are isomorphic to 𝑀(1) and ker(𝑀(1, 2)), respectively. Therefore, rank(𝑀) reduces to the rank of the linear map 𝑀(1, 2). Here is an alternative view on the rank of 𝑀. Let us call any two persistent vectors 𝐯 and 𝐰 intersecting if 7This notion was introduced in [CZ09].
1220
Note that if 𝐯 is full, then 𝐯 is fully supported. However, the converse does not hold in general. Example 11. When P = L2 , every fully supported 𝐯 ∈ 𝐿 is full. 𝜋1
𝜋2
Example 12. Consider the diagram 𝕜 ← 𝕜2 → 𝕜 over P = {𝑎 > 𝑏 < 𝑐}, where 𝜋𝑖 is the projection to the 𝑖-th coordinate. The fully supported persistent vector 1 ↤ (1, 1) ↦ 1 is not full since it is the sum of non-fully-supported persistent vectors 1 ↤ (1, 0) ↦ 0 and 0 ↤ (0, 1) ↦ 1. The space 𝐿𝑀 /𝑊𝑀 from Equation (18) is related to fundamental notions in category theory [ML98]. The space 𝐿𝑀 of persistent vectors coincides with the limit of 𝑀, denoted by lim 𝑀. The quotient space 𝐶𝑀 ∶= ←−− (⨁𝑝∈P 𝑀(𝑝)) / ≈ coincides with the colimit lim 𝑀 of 𝑀, −−→ and is obtained by identifying 𝑣𝑝 ∈ 𝑀(𝑝) with 𝑣𝑝′ ∈ 𝑀(𝑝′ ) through the transitive closure ≈ of the relation 𝑅 such that (𝑣𝑝 , 𝑣𝑝′ ) ∈ 𝑅 whenever 𝑀(𝑝, 𝑝′ )(𝑣𝑝 ) = 𝑣𝑝′ for 𝑝 ≤ 𝑝′ ∈ P. There is a canonical map 𝜓𝑀 from the limit 𝐿𝑀 to the colimit 𝐶𝑀 . Indeed, note that, since P is connected, for any 𝐯 = (𝑣𝑝 ) ∈ 𝐿𝑀 and for any 𝑝, 𝑝′ ∈ P, the two vectors 𝑣𝑝 ∈ 𝑀(𝑝) and 𝑣𝑝′ ∈ 𝑀(𝑝′ ) are identified in 𝐶𝑀 . Therefore, we obtain the well-defined canonical linear map8 𝜓𝑀 ∶ 𝐿𝑀 → 𝐶𝑀 given by 𝐯 ↦ [𝑣𝑝 ] for an arbitrary 𝑝 ∈ P. Note that the vector space 𝐿𝑀 /𝑊𝑀 is isomorphic to the image of 𝜓𝑀 . 8The idea of studying the map from the limit to the colimit of a given diagram of vector spaces stems from work by Amit Patel and Robert MacPherson circa 2012. We thank Prof. Harm Derksen for pointing out to us recently that this type of map was used in the study of quiver representations in [Kin08].
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
VOLUME 70, NUMBER 8
Generalized rank invariant. We are now ready to define the generalized rank invariant. Definition 5. The generalized rank invariant of a P-module 𝑀 is the function rk𝑀 ∶
Int(P) →
ℤ≥0
𝐼↦
rank(𝑀|𝐼 ),
where 𝑀|𝐼 is the restriction of 𝑀 to the interval 𝐼.
Figure 4. The notions of limit and colimit of a diagram of vector spaces are used in a fundamental way to generalize the notion of rank invariant and persistence diagram.
Proposition 1 ([KM21a, Section 3]). The rank of the Pmodule 𝑀 ∶ P → 𝐯𝐞𝐜 agrees with the rank of the canonical limit-to-colimit map 𝜓𝑀 ∶ lim 𝑀 → lim 𝑀. See Figure 4. −−→ ←−− Remark 4 (Additivity of the rank). If 𝑀 ≅ ⨁𝛼∈𝐴 𝑀𝛼 for some indexing set 𝐴, then rank(𝑀) = ∑ rank(𝑀𝛼 ).
That, as defined above, rk𝑀 indeed generalizes Equation (3)) is a consequence of the fact that when P = L𝑚 , for any interval 𝐼 = [𝑏, 𝑑] ∈ Int(L𝑚 ), we have lim 𝑀|𝐼 ≅ 𝑀(𝑏), ←−− lim 𝑀|𝐼 ≅ 𝑀(𝑑), and 𝜓𝑀|𝐼 ≅ 𝑀(𝑏, 𝑑). −−→ Remark 5 (Monotonicity of rk𝑀 ). Let 𝐼, 𝐽 ∈ Int(P) with 𝐽 ⊇ 𝐼. Then rk𝑀 (𝐽) ≤ rk𝑀 (𝐼), which is analogous to Equation (4). This is so because the canonical limit-to-colimit map lim 𝑀|𝐼 → lim 𝑀|𝐼 for the −−→ ←−− interval 𝐼 is a factor of the canonical limit-to-colimit map lim 𝑀|𝐽 → lim 𝑀|𝐽 for the larger interval 𝐽. −−→ ←−− Remark 6. Let 𝐽 ∈ Int(P) and let 𝑉 𝐽 ∶ P → 𝐯𝐞𝐜 be the interval module induced by 𝐽. By Theorem 2, for 𝐼 ∈ Int(P), rk𝑉𝐽 (𝐼) = rank(𝑉 𝐽 |𝐼 ) equals 1 if 𝐽 ⊇ 𝐼, and 0 if 𝐽 ⊋ 𝐼. When 𝑀 ∶ P → 𝐯𝐞𝐜 is interval decomposable, rk𝑀 (𝐼) equals the multiplicity of those intervals 𝐽 in barc(𝑀) that contain 𝐼. Proposition 2. Let 𝑀 ∶ P → 𝐯𝐞𝐜 be interval decomposable. Then, for any 𝐼 ∈ Int(P), rk𝑀 (𝐼) =
∑
mlt𝑀 (𝐽).
𝐽⊇𝐼 𝐽∈Int(P)
𝛼∈𝐴
To prove this, we proceed as follows. Since P is finite and dim(𝑀(𝑝)) is finite for each 𝑝 ∈ P, the direct sum commutes with limits as well as with colimits:
Proof. Let 𝐴 be a finite indexing set and let 𝐼𝛼 ∈ Int(P), 𝛼 ∈ 𝐴, be intervals such that 𝑀 ≅ ⊕𝛼∈𝐴 𝑉 𝐼𝛼 . Then, by Remark 4, rk𝑀 (𝐼) = rank(𝑀|𝐼 ) = ∑ rank(𝑉 𝐼𝛼 |𝐼 ),
lim ( 𝑀 )≅ (lim 𝑀 ) ⨁ ←−− 𝛼 ←−− ⨁ 𝛼
which by Remark 6 above equals the claimed quantity. □
𝛼∈𝐴
(19)
𝛼∈𝐴
and lim ( 𝑀 )≅ (lim 𝑀 ) , ⨁ −−→ 𝛼 −−→ ⨁ 𝛼 𝛼∈𝐴
(20)
𝛼∈𝐴
which leads to the desired equality.9 The isomorphisms in Equations (19) and (20) can also be easily verified by invoking the constructions of limits and colimits given above. 9The claimed isomorphism in Equation (19) follows from the following argu-
ment. A priori we have lim (∏𝛼∈𝐴 𝑀𝛼 ) ≅ ∏𝛼∈𝐴 (lim 𝑀𝛼 ), where Π denotes ←−− ←−− a direct product [ML98, Chapter III]. But, one can check that Π agrees with ⨁ on both sides since P is finite and dim(𝑀(𝑝)) is finite for each 𝑝 ∈ P.
SEPTEMBER 2023
𝛼∈𝐴
Now that the rank invariant has been generalized from the case of L𝑚 -modules to that of P-modules, the mechanism of Möbius inversion will provide the sought-after generalization of the notion of persistence diagram for the case of L𝑚 -modules (cf. Equation (7)) to that of Pmodules.
Möbius Inversion The summation of a number-theoretic function 𝑓(𝑛) over the divisors of 𝑛 and its inversion play an important role in elementary number theory (the meaning of inversion will be made clear in Example 13). The classical Möbius inversion formula, introduced by August Ferdinand Möbius
NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY
1221
in 1832, is a useful tool for performing such inversions. In the 1960s, Rota [Rot64] noticed the vast combinatorial implications of the Möbius inversion formula and established connections to coloring problems, flows in networks, and to the inclusion-exclusion principle. Two basic examples of Möbius inversion follow. Example 13 (Sum over L𝑚 ). Fix an integer 𝑚 > 0. Consider any two functions 𝑓, 𝑔 ∶ L𝑚 → ℝ such that 𝑔(𝑞) = 𝑞 ∑𝑞′ =1 𝑓(𝑞′ ) for all 𝑞 ∈ L𝑚 . One can easily invert the sum, i.e., solve for 𝑓, as 𝑓(1) = 𝑔(1) and 𝑓(𝑞) = 𝑔(𝑞) − 𝑔(𝑞 − 1) for 2 ≤ 𝑞 ∈ L𝑚 . (21) Given a set 𝐷, let 2𝐷 be the set of subsets of 𝐷 ordered by containment ⊇. Example 14 (Principle of inclusion-exclusion). Let 𝑆 be a finite nonempty set of objects. Let 𝐷 = {𝑝1 , … , 𝑝𝑛 } be a set of properties such that, for every 𝑖 = 1 … , 𝑛, each object in 𝑆 either satisfies the property 𝑝 𝑖 or it does not. For 𝐸 ∈ 2𝐷 , let 𝑔(𝐸) the number of objects in 𝑆 satisfying the properties in 𝐸 (and possibly more), and let 𝑓(𝐸) be the number of objects in 𝑆 satisfying the properties in 𝐸 and no other properties. Then, the function 𝑔 ∶ 2𝐷 → ℤ≥0 can be expressed in terms of the function 𝑓 ∶ 2𝐷 → ℤ≥0 as follows: 𝑔(𝐸) = ∑ 𝑓(𝐹),
∀ 𝐸 ∈ 2𝐷 .
(22)
𝐹⊇𝐸
We will shortly see how one can express 𝑓 in terms of 𝑔. Whereas inverting the summation in Example 13 above can be done directly, inverting the summation over the poset 2𝐷 in Example 14 motivates introducing more sophisticated machinery. In general, the Möbius function of a poset Q plays a fundamental role in inverting a summation over Q. Let Q be a finite poset. The Möbius function of Q is the unique function 𝜇Q ∶ Q × Q → ℤ defined by 𝜇Q (𝑞, 𝑞″ ) = 0 when 𝑞 ≰ 𝑞″ and, when 𝑞 ≤ 𝑞″ , by ∑
𝜇(𝑞, 𝑞′ ) = 𝛿(𝑞, 𝑞″ ),
𝑞′ ∶ 𝑞≤𝑞′ ≤𝑞″ ″
″
″
″
where 𝛿(𝑞, 𝑞 ) = 0 if 𝑞 ≠ 𝑞 and 𝛿(𝑞, 𝑞 ) = 1 if 𝑞 = 𝑞 . The function 𝜇Q can be computed recursively through the conditions: ″
𝑞=𝑞 , ⎧1, ′ 𝜇Q (𝑞, 𝑞 ) = − ∑𝑞≤𝑞′