Retrospective: Proof Assistant

Reasoning
Exposition
Published

April 9, 2025

Paul Klee *Nordzimmer* (1932)

Motivation

I set out to build a proof assistant for a few reasons:

  1. General interest in proofs and mathematics. I’ve used systems like Lean and Coq1, but wanted a deeper understanding of what exactly they were doing.
  2. In particular, I was confused about how the typing rules related to tactics.
  3. Using AI to generate mathematical proofs is a growing area of interest. I was curious how proofs and intermediate reasoning steps are represented in these languages, to hopefully get a better grasp on how systems might merge formal theorem provers with AI.
  4. I saw very few (if any) tutorials on how to build a proof assistant from scratch.

Project Details

I built a rudimentary proof assistant in Python intermittently from November 2024 to March 2025. Most of the actual coding was completed between November-January (about 3 months calendar time), with some additional labor for some additional features and write up in February and March.

At this point, most of the core elements of a proof assistant are in place:

This roughly corresponds to the original game plan. I also managed to prove some (mostly trivial) theorems in the language.

Code

The total codebase is roughly 4000 lines of Python. While not optimized for performance, the code does implement the main features of the Calculus of Inductive Constructions.

Code is written in (pure) Python3.10 and organized into the following files:

  1. calculus.py - Contains the main terms and environment definitions
  2. conversions.py - Contains various equality and reduction rules
  3. typing_rules.py - Contains the typing rules
  4. subtyping.py - I built a HypothesisRegister class. The idea was to check to use this to determine subtypes.
  5. sort_checking.py - Some limited rules around sort checking.
  6. induction.py - Handles inductive types
  7. induction_definitions.py - Definitions of inductive objects
  8. proof_engine.py - The proof engine
  9. tactics.py - Tactics for the proof engine
  10. match.py - Contains match constructs
  11. induction_functions.py - Functions that operate on inductive types
  12. proofs.py - Simple proofs
  13. future.py - Stuff I wrote but didn’t need yet.
  14. tests.py - Unit tests (defunct)

I’ve placed the code here (warts and all). Don’t trust any of this code to be correct.

To run the proofs, I navigate to the proofs directory and run python3.10 proofs.py.

The unit tests do not work as of time of publication. Testing became laborious to maintain as I was rapidly making changes. This is an experimental codebase and as such is not suitable for production use without substantial modifications. If the tests did work, you would run them by running python3.10 -m tests.

Goals Evaluation

Given my motivations, how well did I do?

Deeper understanding of theorem provers

A success. Having implemented a theorem prover from scratch, I definitely understand them much better. That being said, there are still many aspects I find myself unsure about, or unclear if I have the correct intuition.

Understand relationship between typing rules and tactics

I’d say that this is a partial success. I was able to do some basic elaborations, but I’m not sure they actually do this in real production theorem provers. So how do we know that every tactic is faithful to the typing rules?

Using AI to generate mathematical proofs

I didn’t get this far. Given the complexity level of building this system from scratch, it’s probably better to just work off of Coq/Lean. Hopefully I’ll get to this in a future project.

Tutorial on how to build a proof assistant from scratch

Though the pedagogy could be improved, I’d call this a success. The write up on this blog could function as a tutorial.

Lessons Learned

Representations and Reductions Are Critical

Getting the right data structures is always key, and this project was no exception. I didn’t understand the features before I was implementing them, which made choosing the representations harder. For example, I had to revisit the reduction rules a few times. If I could go back I’d spend way more time getting the representation and equivalences and reductions right, before moving on to other aspects.

Tactic Engine over Typing Rules

In practice, tactics are WAY easier to prove theorems with than typing rules. I was surprised how hard it was to work out some theorems without tactics. Going back, I would focus more on the tactics engine than the typing rules.

Programming Language

I used Python. The logic here was that once I had the proof assistant natively implemented in Python, I could use it as a substrate for experimenting with AI-assisted proofs. This was overambitious and I never ended up reaching the ML part of the project. In retrospect a more strongly typed, functional language might have been helpful.

Proof Assistants Are Complex

I’m eight blog posts in and I feel like I’ve only just scratched the surface of what would need to be done to build a real theorem prover.

What’s still missing?

More Features

If you look at Coq’s core language I’m missing a few elements. Among them:

Record Types

Unlike inductive types, records allow bundling multiple fields with their associated types under a single structure, similar to structs in programming languages Implementing records would require extending the term representation, typing rules for record creation and projection, and tactics for manipulating record structures.

Coinductive Types

While inductive types define objects by how they are constructed (built up from constructors), coinductive types define objects by how they can be observed or decomposed, allowing representation of infinite streams, non-terminating processes, or bisimulation relations. This is useful for proving theorems in fields like process calculus.

The technical challenge with coinductive types lies in their termination requirements: unlike inductive types which require termination “going down” (constructors must eventually hit base cases), coinductive types require productivity “going up” (always possible to produce the next element).

Advanced Tactics

There’s many categories of tactics I did not implement:

  • Tactical combinators would allow composing existing tactics into more powerful automation. For example, repeat (intro; try apply theorem1).
  • Domain-specific tactics for arithmetic, equational reasoning, etc. Automated tactics like ring for ring structures, field for field arithmetic, or omega for Presburger arithmetic could solve entire classes of goals automatically.
  • Proof search tactics (auto, eauto) capable of trying multiple approaches and backtracking.
  • Reflective tactics that use computational reflection (proving by computation rather than deduction).
  • Some way to allow users to define tactics (although in this system, they could just write new tactics in Python).

User Experience

Lexer/Parser

Most proof assistants implement some kind of simplified language on top of their kernel. To do this I would need to design a DSL (domain-specific language) and build a lexer/parser of some kind. Users would write a proof in the DSL and then the lexer/parser would convert it into a Python proof and run it.

Interactivity

Systems like Coq and Lean provide real-time visibility into the current goal state, remaining subgoals, and progress tracking. This would be super useful to add: it would probably require pushing the tactics engine behind some UI.

Production System

This is a research system. What would I need to do to make it production worthy?

General Efficiency

I could check types in parallel, or implement caching of partial results.

Term Representation and Normalization

Production theorem provers use techniques like hashconsing to ensure term sharing and avoid duplication, reducing memory usage and comparison costs.

Also, my strong normalization approach (repeatedly applying all conversion rules until no changes occur) would become prohibitively expensive for large terms. Advanced systems use strategies like lazy evaluation, weak-head normalization, and caching of normalized forms.

Namespaces and Modules

Support for namespaces, sections, and imports would would allow us to manage much larger sets of theorems and definitions.

Modules would allow us to abstract over certain theories and reuse them in multiple contexts. For instance, we might develop a theory of ordered structures, then instantiate it for integers, reals, and other number systems without duplicating the foundational proofs. Alternatively, instead of separately developing theories for groups, rings, and fields, we could create a parameterized theory of algebraic structures, and apply the appropriate parameters as needed (rather than rebuilding the entire framework).

Standard Library

If we had modules, we could also build a standard library of core data types (lists, vectors, trees, etc.) common mathematical structures (groups, rings, fields) and basic theorems about numbers, sets, and functions.

Future Research Thrusts

  1. Alternate foundations

There are some differences between Coq and Lean that would be interesting to explore.

  1. Proof Search

Mentioned before.

  1. Differentiable Programming

I’m very interested in other ways to get reasoning “inside” deep neural networks. Differentiable analogues to type theories might be interesting.

Questions I still have

  1. Is it true that for any tactic, we can build the reverse mode “elaboration” of the tactic?
  • I’d love a proof of this, although I suspect it’s not true (see this post on StackExchange). Probably we can just check at the end if the term satisfies the type, and there are no free variables (this is also better from a design perspective as we have to be less careful about how to contruct each tactic). I guess I’m also asking if there is a kind of “soundness” to the typing rules: do all valid proofs come from the typing rules? This would imply that you could elaborate any set of tactics back to the typing rules.
  1. Perhaps more specifically, are there formal requirements such that, given the tactic meets the requirements, it has a “reverse mode” elaboration?

  2. What are the best data structures to represent proofs? Are there alternatives to what we have here? Optimized for machine learning?

  • I’ve seen some papers that I will check out along these lines.
  1. Are there formal laws or properties that describe how tactics compose? Could there be an algebraic theory of tactics?

  2. Could machine learning techniques be applied to either generate tactics from examples of proof terms, or to predict what proof terms will result from certain tactics?

  3. How do proof assistants verify that tactic implementations correctly produce valid proof terms? Are there formal correctness proofs for the tactics themselves?

  4. Are there better ways to test and ensure correctness?

Edits

(4/16) Added more links

Footnotes

  1. Coq’s name was changed to Rocq over the course of this project.↩︎