Create x,y tokenization scheme #164

valedan · 2023-04-12T19:46:01Z

Instead of having one token per node, we would instead have a row token and a column token (plus a seperator) representing each node. This would potentially allow us to scale to larger mazes. The problem with node tokens is that when going to larger mazes, the model is dealing with tokens it has never seen before.

Question - What base should the x and y be in? If we use base10, the model would need to be trained on 10x10 mazes before it could generalize. But if we use a base4 scheme, the model would only need to be trained on 4x4 mazes before generalizing.

Blocked by #163

Refactor to be compatible with `maze-dataset` versions `0.2.1` and onwards. See PRs: - [`maze_dataset` PR #5](understanding-search/maze-dataset#5) - [`maze_dataset` PR #6](understanding-search/maze-dataset#6) See related issues: - #164 - #163 - #77 These changes also revert changes in #118, to be consistent with underscores only appearing once in the special tokens. # commit history: * test_cfg_post_init working * migrated SPECIAL_TOKENS usage * wip * wip * wip, all but 3 in tok tests passing * test_tokenizers passing * unit tests passing (but need to update maze_dataset dep) * poetry lock * format * remove deprecated kwarg to process_weights_ Upgrading transformer_lens to 1.4.0 caused `HookedTransformer.process_weights_()` to no longer accept the keyword arg `move_state_dict_to_device` However, I'm not sure if this was important in the first place. If any issues come up, move the state dict to device manually in `ZanjHookedTransformer._load_state_dict_wrapper()` where all this was happening in the first place * fixed MazeTokenizer not being passed to as_tokens() in some spots * updated changed dataset config key since we removed tokenizer stuff from the dataset * fixed eval_model nb, added ZanjHookedTransformer.config ref the `eval_model.ipynb` notebook has a function `testdata_plot_predicted_path` which was using `model.zanj_model_config` to get the tokenizer, an attribute missing from the `RandomBaseline` class since it only inherits from `HookedTransformer` to fix this: - `ZanjHookedTransformer` now has a `config` property which just accesses the `zanj_model_config` used by the parent `ConfiguredModel` - `testdata_plot_predicted_path` now uses `model.config` everywhere * lock after update maze-dataset to 0.2.1 * fixed minor import issue * update configs refs in train_model notebook * lock poetry, re-run notebook * format * update coverage

mivanit · 2023-08-06T21:38:13Z

added in #191
see the linked maze-dataset PRs in #191 for more details

the tokenization scheme detailed here is AOTP_indexed, but there needs to be more documentation in maze_dataset about this.

mivanit mentioned this issue Aug 6, 2023

Refactor tokenization #191

Merged

mivanit closed this as completed Aug 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create x,y tokenization scheme #164

Create x,y tokenization scheme #164

valedan commented Apr 12, 2023

mivanit commented Aug 6, 2023

Create x,y tokenization scheme #164

Create x,y tokenization scheme #164

Comments

valedan commented Apr 12, 2023

mivanit commented Aug 6, 2023