ntp-toolkit

The neural theorem proving toolkit transforms Lean repositories into datasets for training and evaluating machine learning models.

The toolkit is originally a fork of Kim Morrison's lean-training-data and developed in miniCTX.

Running extraction

To run the full pipeline on all repositories in a config file in the configs directory:

python scripts/extract_repos.py --cwd {filepath_of_this_repo} --config {filepath_of_config_file} [flags to indicate which processes to run]

The flags that can be set to indicate which processes to run are:

--training_data: This outputs to the TacticPrediction directory
--full_proof_training_data: This outputs to the FullProof directory
--premises: This outputs to the Premises directory
--state_comments: This outputs to the StateComments directory
--full_proof_training_data_states: This outputs to the FullProofWithStates directory
--training_data_with_premises: This outputs to the TrainingDataWithPremises directory
--declarations: This outputs to the Declarations directory
--imports: This outputs to the Imports directory

At least one of the above flags must be set in order for the script to run (but there should be no issue with setting multiple or even all of the above flags)

On a Macbook Pro (M3 Max, 14 CPU) it takes around 2 hours to run the extractions on mathlib.

To run a tool individually, use lake exe <tool>.
The run_pipeline.py script uses Python to call tools in this way and organize the resulting files.

Extraction tools:

`training_data`

This produces a .jsonl file where each line is an example of the following form:

{
   "state": "{tactic state}",
   "nextTactic" : "{pretty-printed next tactic}",
   "srcUpToTactic" : "{source code in the file up to the tactic invocation}",
   "declUpToTactic" : "{source code in the declaration up to the tactic invocation}",
   "decl": "{declaration without proof (e.g., statement of a theorem)}",
   "declId": "{unique identifier of the declaration}"
}

`full_proof_training_data`

This produces a .jsonl file where each line is an example of the following form:

{
   "srcUpToDecl":"{source code in the file up to the declaration}",
   "decl": "{declaration without proof (e.g., statement of a theorem)}",
   "declId": "{unique identifier of the declaration}",
   "proof":"{proof}"
}

`state_comments`

This produces Lean source files with proof states interleaved as comments after each tactic.

`premises`

This produces premises used by each constant in a module.

`declarations`

This produces information that pretty-prints each declaration in a module. The resulting format is

{
   "name": "pow_two",
   "kind": "theorem",
   "args": ["{M : Type u_2}", "[Monoid M]", "(a : M)"],
   "type": "a ^ 2 = a * a",
   "doc": "Note that most of the lemmas about powers of two refer to it as `sq`.",
   "decl": "/-- Note that most of the lemmas about powers of two refer to it as `sq`. -/\ntheorem pow_two {M : Type u_2} [Monoid M] (a : M) : a ^ 2 = a * a",
   "line": 810,
   "column": 0
}

`imports`

This outputs the imports of each module (both transitively imported modules and directly imported modules). The resulting format is

{
   "name": "{imported module name}",
   "isDirect": "{whether it is explicitly imported by the module (otherwise it is transitively imported)}"
}

`training_data_with_premises`

This produces a .jsonl file where each line contains all of the information in training_data plus the following fields:

{
   ...
   "nextTacticIsSimpOrRwVariant": "{A boolean indicating whether the next tactic is one of several `simp` or `rw` variants}",
   "nextTacticTermPremises": "{Prop-typed constants that appear in the proof term generated by the tactic}",
   "nextTacticExplicitConstants": "{Constants that appear in tactic syntax}",
   "nextTacticExplicitPremises": "{Prop-typed constants that appear in tactic syntax}",
   "nextTacticExplicitLocalHypotheses": "{Local context Prop-typed hypotheses that appear in tactic syntax}",
   "nextTacticAllPremises": "{The union of nextTacticTermPremises and nextTacticExplicitPremises}",
   "nextTacticHammerRecommendation": "{Best guess of which constants would help a hammer based on the constants that appear in the next tactic}",
   "declTermPremises": "{Prop-typed constants that appear in the overall declaration's proof term}",
   "declExplicitConstants": "{Constants that appear in the tactic syntax for any tactic used to create the overall declaration's proof}",
   "declExplicitPremises": "{Prop-typed constants that appear in the tactic syntax for any tactic used to create the overall declaration's proof}",
   "declAllPremises": "{The union of declTermPremises and declExplicitPremises}",
   "declHammerRecommendation": "{The union of each nextTacticHammerRecommendation for every tactic used to create the overall declaration's proof}",
   "termPremisesToEndOfGoal": "{Prop-typed constants that appear in the sum of all proof terms used to close the current goal}",
   "explicitConstantsToEndOfGoal": "{Constants that appear in the tactic syntax for any tactic used to close the current goal}",
   "explicitPremisesToEndOfGoal": "{Prop-typed constants that appear in the tactic syntax for any tactic used to close the current goal}",
   "allPremisesToEndOfGoal": "{The union of termPremisesToEndOfGoal and explicitPremisesToEndOfGoal}",
   "hammerRecommendationToEndOfGoal": "{The union of each nextTacticHammerRecommendation for every tactic used to close the current goal}"
}

Note: Of the above additional fields, all ...ToEndOfGoal fields are potentially approximations. They currently appear to be accurate for the patterns I have tested, but it's hard to check whether there may be edge cases for which they are inaccurate.

Running instruction tuning data generation

After extraction, you can generate various forms of (prompt, completion) examples for fine-tuning language models.

To do so, run:

python scripts/instruction_tuning.py --prompt context_state_tactic

See python scripts/instruction_tuning.py -h for other options for --prompt or other settings.

The prompt includes a natural language description of the task, commonly referred to as an "instruction" (hence the name instruction tuning data).

Other setup docs from `lean-training-data`

You may find these useful during setup.

Install elan by running

curl https://raw.githubusercontent.com/leanprover/elan/master/elan-init.sh -sSf | sh

Run lake exe cache get (this downloads precompiled binaries for Mathlib).
Run lake build
Run lake exe <tool>, where <tool> is one of the programs documented below.

Projects

Projects that use or build upon ntp-toolkit:

miniCTX: Neural Theorem Proving with (Long-)Contexts
ImProver: Agent-Based Automated Proof Optimization

Submit a PR to add your project or paper here!

Citation

The toolkit is originally a fork of Kim Morrison's lean-training-data:

@misc{lean-training-data,
  author = {Kim Morrison},
  title = {lean-training-data},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/semorrison/lean-training-data}},
}

The ntp-toolkit was initially developed in miniCTX:

@misc{hu2024minictxneuraltheoremproving,
      title={miniCTX: Neural Theorem Proving with (Long-)Contexts}, 
      author={Jiewen Hu and Thomas Zhu and Sean Welleck},
      year={2024},
      eprint={2408.03350},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2408.03350}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ntp-toolkit

Running extraction

Extraction tools:

`training_data`

`full_proof_training_data`

`state_comments`

`premises`

`declarations`

`imports`

`training_data_with_premises`

Running instruction tuning data generation

Other setup docs from `lean-training-data`

Projects

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

ntp-toolkit

Running extraction

Extraction tools:

training_data

full_proof_training_data

state_comments

premises

declarations

imports

training_data_with_premises

Running instruction tuning data generation

Other setup docs from lean-training-data

Projects

Citation

`training_data`

`full_proof_training_data`

`state_comments`

`premises`

`declarations`

`imports`

`training_data_with_premises`

Other setup docs from `lean-training-data`