AxProverBase: A Minimal Agent for Automated Theorem Proving
Our ICML paper introduces AxProverBase, a simple, reproducible, open-source agent that automates theorem proving in Lean — matching far more complex systems while using up to 100× fewer resources.
- research
- formal methods
- ICML
At Axiomatic AI, we are building trustworthy AI systems for science and engineering: domains where progress depends on results that are not only innovative, but also certifiably correct. A chip layout, a control system, a photonic design, a symbolic derivation, or a mathematical argument must satisfy its specification to the fullest. Therefore, verifying that a proposed solution is sound, satisfies its constraints, and faithfully matches the intended requirements is a fundamental part of making AI useful in these settings.
Formal methods offer a path toward this kind of verified scientific reasoning: AI systems that not only generate hypotheses, derivations, or explanations, but also help produce machine-checkable evidence of their correctness. In practice, however, writing formal proofs remains difficult and time-consuming, often requiring deep expertise in both the subject matter and the formal system itself. This has motivated the development of AI tools that make formal methods more accessible to a broader community and enable their use at the scale required for verified AI systems.
Our ICML paper, “A Minimal Agent for Automated Theorem Proving”, shows that a simple and reproducible theorem-proving agent can already achieve competitive results by following the essential workflow of formal proof development: propose a proof, check it, learn from the errors, and try again. This simplicity is deliberate. For AI systems that aim to support verified science and engineering, transparency and reproducibility are part of what makes the system trustworthy.
Figure: Schematic overview of the agentic workflow in AxProverBase. The file with the target theorem are provided to the agent. Then, the agent proposes a candidate proof that is reviewed and verified. If approved, we have a valid proof! Otherwise, detailed feedback from Lean and the review system are processed by the memory mechanism to provide context for subsequent proof attempts.
The system we introduce, AxProverBase, works in Lean, a proof assistant where mathematical statements and proofs are written in a precise formal language and checked by a small trusted kernel. Given a theorem to prove, AxProverBase asks an AI model to propose a proof in Lean. The proof is then checked automatically. If it fails, the system receives concrete feedback, records useful lessons from the attempt, searches for relevant facts in existing libraries when needed, and tries again. In this way, AxProverBase turns the usual process of formal proof development into an automated and reproducible loop. A key feature is that success is not determined by whether the AI’s answer looks convincing. The proof must be accepted by Lean’s checker, resulting in a system in which the AI works toward a concrete artifact that can be independently verified.
Because AxProverBase has a clean and modular architecture, it also lets us study which parts of an automated proving system actually matter. This is important for moving from isolated demonstrations to reliable systems: if we want AI to support verified scientific and engineering work at scale, we need to understand how and when it fails and succeeds as well as which design choices make it more useful in practice.
Figure: Success rate on PutnamBench as a function of the number of sampled proofs. Sampling parallel independent proofs is an inefficient strategy to improve the success rate. Introducing a basic feedback mechanism to let the agent iteratively improve on its original proof strategy yields the largest gains. Introducing a long-term memory mechanism improves the scaling. Finally, adding search tools yields a more modest improvement.
Our experiments show that the ability to iterate on a proof is one of the strongest drivers of performance. This mirrors how human proof development often works: write an argument, check it, inspect what failed, and improve it. We also find that memory plays an important role. By keeping track of the lessons from previous attempts, the system spends less effort grappling with the mechanics of Lean, allowing it to put more effort on the actual mathematical development of the proof. Search tools provide an additional benefit, although not as significant as the gains from this feedback-driven loop with memory.
At the same time, our results make clear that the agentic framework does not remove the dependence on the underlying AI model. Stronger models benefit more from the right architecture: our framework acts as an enabler for the underlying model’s capabilities, emphasizing model differences. This is a practical advantage of the design. As frontier models improve, AxProverBase can naturally improve with them.
We evaluate AxProverBase on benchmarks that test different kinds of formal reasoning, from competition-style mathematics to research-level abstract algebra and category theory. These also cover practical implications, with some proofs requiring clever arguments while others require navigating specialized definitions, libraries, and formal structures. Across these settings, AxProverBase achieves competitive performance with much more complex systems, proving on its first attempt 54.7% of the PutnamBench problems, 98.0% of FATE-M, 66.0% of FATE-H, 24.0% of FATE-X, and 59.0% of LeanCat. These results show that such a system can be useful in realistic scenarios, where it can prove a large portion of the theorems involved in complex projects. Furthermore, AxProverBase reaches these results using up to 100x fewer resources than state-of-the-art approaches that achieve the same performance. For broad deployment, the efficiency of this result is as important as the accuracy.
AxProverBase is designed to be useful beyond a benchmark table. It is open source, modular, and resource-efficient, making it easy to inspect, improve, and integrate into practical systems. Therefore, we believe it can serve as a practical baseline for the community, allowing researchers to replace individual components, compare architectural changes systematically, and benefit from improvements in frontier models without rebuilding the whole prover. Furthermore, developers can also use it directly in their Lean projects. On this front, we provide a GitHub integration to help bring AxProverBase into real formalization projects and delegate proofs to it.