#draft/aitp_thoughts
One of the biggest moonshot ideas in AI theorem proving is “auto-formalization”. Szegedy’s N2Formal group at Google AI has publicly expressed interest in the topic, and Kevin Buzzard often says auto formalization is closer than we think. Having said that, there is no clear definition what auto-formalization even means in practice.
Once again, this is a good place for some user stories where we can concretely explore what we mean by auto-formalization, again starting at the most wild and working our way to more realistic goals.
An AI systematically goes over all mathematical literature (arXiv, journals, books, (maybe even scans of books), and formalizes all mathematical knowledge into formal definitions and proofs. It gives both a literal formalization specific to that work, and also connects it to a common formalization which fits into a unified formal library of mathematics.
This is likely not achievable without something like AGI, but I’m happy to be proved wrong.
An AI reads math papers as they are published, systematically understanding the paper and relating it to other papers. The knowledge is compressed deep into its memory, but not necessarily formalized. Instead, the agent (like a human) understands mathematics and doesn’t need it formalized to know it is correct. The agent can easily read past small errors in papers and see the bigger picture. It can answer math questions given in human language and quickly find the relevant mathematical citation. It is like talking to a human mathematician who has read and understands all the worlds math papers. It may work in conjunction with other more formal AI systems, helping them organize and formalize this knowledge.
This is not achievable without something like AGI, but I’m happy to be proved wrong.
An AI scans papers and connects the ideas and knowledge. It is not perfect and doesn’t show full understanding, but it is really good. One can enter in a theorem statement in natural language and ask if it is known. The AI quickly finds what is likely to be an equivalent or related theorem, or expresses that it is likely unknown, but offers similar theorems. It also connects ideas in interesting and novel ways, suggesting lemmas and connections that can then be followed up on. It can also start to create (fairly dry) wikipedia-like articles for every math topic.
This is unlikely to occur any time soon, but with advancements in natural language understanding, it might arrive sooner than one thinks. Also, less advanced versions of this are available much sooner. (GPT-J could already write a Wikipedia article on any math topic, but it wouldn’t necessarily be correct which is the problem.)
An AI agent works with a good library of formalized mathematics covering undergraduate mathematics and more. It scans papers and starts to find theorems which uses only definitions currently in the library. It formalizes those, and attempts to formalize the proofs. When it encounter definitions which again only use established definitions and notation, it formalizes those as well. It also formalizes exercises, open problems, and common mathematical trivia as it encounters it. The system is careful to realize that many of the formalizations could be wrong. Also many are likely duplicates, and it is continuously cleaning up and connecting facts in the library, trying to fill in missing proofs, or correct incorrect formalism guesses.
Many aspects of this are very possible with today’s technology, especially using newer ideas in machine translation and natural language. Some of it has already been done at a very small level. It there were well aligned formal-informal libraries (or even just corpora of formal and informal language of relatively equal size), then it would be much easier. The challenge is leveraging the knowledge inside large language models to extract the right kind of training data. (I’m reminded of a recent paper by OpenAI which shows that one can extract a good English-French training set just from what is implicitly inside GPT-3’s knowledge.)
Thousands of students take proof-based mathematics courses each year. They write proofs which are both good and bad, correct and incorrect. What if we started to scan those homework, and clean the data a little? (The largest need is just to make sure there is an image or series of images for each proof and those are linked to the exercise question. The students could do this by uploading their homework. It could be a product which (1) gives the students and teachers an easy to use interface for submitting math homework, and (2) gives the teachers and students helpful advice on that homework. The ultimate training objective would be to have it grade each proof, relative to both correctness and the level of detail required for a given course and book.
I think this is very doable. The interface would require a lot of engineering. and there are many practical concerns like getting by-in from math departments, labeling the data (with course, problem statement, book, etc) but not too much to make it onerous, dealing with privacy concerns, and copywriter concerns, etc. However, AI technology wise, I see know reason with good multi-modal models down the line why we can’t with a decent sized training set be able to predict correctness of proofs from images. Like a lot of datasets, just because a model can predict correctness, we have to realize it might do so for the wrong reasons like good penmanship or using the correct mathematical terms. Therefore, it would also be important to correct for this with various tricks (GANs, self-supervision, adversarial ideas, etc.)
There are many large interactive theorem proving libraries like Mathlib, the Archive of formal proofs, set.mm, the Mizar library, the HOL light library, and (whatever Coq has). The big problem is that there are no ways to translate between them. Within this user story are actually three user stories:
- Full IPT library translation The AI agent would translate whole sections of one ITP library to other ITP, adding many files of theorems, definitions, and lemmas, carefully shifting the style from one theorem prover to the style of another. If a background definition or lemma is missing, or a proof isn’t in the style of the target, then those issues are automatically accounted for. (This would be a hard project and I really don’t see this happening anytime soon, but a lot of incremental progress can be made. See below.)
- Translate proofs from one theorem prover to another one theorem at a time. A user selects one theorem in one ITP library and the AI produces another theorem statement and theorem attempt in another library. This is very doable with today’s translation technology. There are already similar projects which do this for Python and C++. The challenge is that the libraries may not match up. However, as long as all the definitions in the theorem statement are in both library and reasonably match and as long as there is enough stuff currently in the target library to prove the theorem, I think it is very doable. The two main ideas are to (1) use modern machine translation with large language models to parallelly learn both corpora simultaneously to support unsupervised translation, and (2) use proof search to guide and check the proofs. (If one isn’t concerned with translating the theorem statements, just the proof, then just train a GPT-f like language model on both libraries simultaneously.)
- Align imported proofs. If two theorem provers have compatible logics, one can automatically convert an entire library from one theorem prover to another. (Mario Carneiro did this with the MetaMath
set.mm
library inside Lean.) The problem is that when doing this, none of the definitions align. One could “prove” the MetaMath prime number theorem in Lean’s logic this way, but the imported theorem statement wouldn’t use the same definition of limit, logarithm, prime number, or natural number. These all have to be manually proved to be the same in some sense. This is an area where a AI could be quite useful: A user and an AI work side by side. The user begins to formalize the alignments. This involves:- Knowing what definitions align directly and being able to state those alignment theorems.
- Proving those alignment theorems using both facts in the source library and the target library.
- Converting definitions (which don’t directly align) from the source language to the target language, replacing concepts from the source language with equivalent concepts from the target language.
- Converting theorems from the source language to the target language, replacing concepts from the source language with equivalent concepts from the target language.
The AI would then take this work and start to copy the same ideas to new definitions and theorems. It would likely be a back and forth project which some level of coded boilerplate and templates, but also some AI-drive choices around the alignments and how to prove them.
An AI would train on all ITP libraries, wikipedia, the stacks project, proof wiki, N-lab, etc. Then given a theorem or definition in one source, it would guess as to the closest theorem or definitions in all the other sources. This would be less of a full scale useful product, and more of an experiment.
I think creating an experimental approximation of this is very doable, and not that difficult to build as soon as one has parsed data and a large language model trained on this and other data. After a few manual alignments, one could start to get it going automatically. Actually, I think there are already previous experiments doing this for a few source pairs. The big thing would be to continue to push what is possible in this area.