Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor for modern 2022 python style and usage #80

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

mattsta
Copy link

@mattsta mattsta commented Jul 24, 2022

No actual functionality changes, just a lot of cleanup to make the project look like it belongs in the present instead of something half abandoned from 10 years ago.

Some of these changes are important because other projects copy this project over and over again, so every legacy code architecture pattern here gets proliferated unnecessarily into the future for all time.

There's simple solutions to hacks I've seen in this codebase copied and pasted into dozens of other projects (like the 10 line "continuous iterator" hack which actually has a one line python builtin for doing it properly: itertools.cycle(iterable_thing))

Also refactored lots of modern python usage like using pathlib.Path for easy and more semantic file manipulation instead of a dozen different os.path functions.

and of course complying with modern syntax means we need to use the standard python code formatter instead of bespoke "data-science-ese" of write-once disposable code patterns (but it still gets copied in thousands of places and just ends up decaying more and more over time)

also added a pyproject/poetry config to specify requirements somewhat (and as a bonus, i configured current scripts as poetry runnables so you can just do things like poetry run adder and it works automatically)

good luck.

Join 2022 with the rest of us.

Fixed via: `fd -e py -X black`

```
$ black --version
black, 22.6.0 (compiled: yes)
Python (CPython) 3.10.2
```
I've seen this pattern copied out to a dozen different ML repos, but it's
just completely wrong?

Instead of using 10 lines of "hack python to be smarter than python," it
can all be replaced with a one-line python-provided API for repeating
a single iterator forever:

    for iter_num, batch in enumerate(itertools.cycle(train_loader)):
This is the "proper" python way of doing path operations. It
makes file operations one-liners with half the typing vs using 10 year
outdated legacy python patterns.

Also reducing some unnecessary line padding and unnecessary text
decoration in places too.
Using globals to store data between calls? What?!

The proper way is for callbacks to have access to an internal aribtrary
state object for their own accounting purposes. this does thusly.
Small minor non-important fix, but it makes the intent more clear and
shows people there's more python built-ins to explore.
The new syntax makes formatting intent much clearer and easier to
understand than legacy python string formatting syntax.

The new syntax only requires knowing like 2 concepts while the old
formatting syntax requires knowing 8 different things about how strings
and formatting and replacements and substitutions happen in legacy python strings.
Hints for improving readability:
    - don't smash 10+ lines together without line breaks
    - if you have comments for a block, make it a visual block with
      newlines instead of:
      #comment\ncode\n#comment\ncode\ncode\n#comment\ncode\n....
    - don't create any globals or code under __main__, just call a
      helper function. this also helps when wanting to turn python
      entrypoints into single callable scripts with paraemters since a
      function entry point can be called easier (a la Fire, etc)
    - avoid end-of-line comments where possible because it makes lines
      too long then the code formatter uses worse formatting.
      prefer "# big comment\ncode" instead of "code # big comment"
Also added 'adder' helper runner (see below)

Enables:
    - automated package buliding: poetry build
    - automated command running
        - example: poetry run adder
    - all the other modern package management things like dependency
      verification, isolated install environments, etc
poetry run chargpt (if you have 'input.txt' in your local directory)

Also includes the basic Path/style/formatting/cmd refactors made to
adder.
Usable via: poetry run bpe

Fixed same issues as cleaning up everything else:
    - pathlib.Path everywhere instead of direct os module file manipulation
    - improves code readability (the entire purpose of code, after all)
    - improves output readability too with more logical breaks for
      reading/understanding what's actually happening
@karpathy
Copy link
Owner

I appreciate the effort, I reviewed the full diff, but I don't like it. But I do like the itertools.cycle call, which I was not originally aware of and I think may want to adopt.

@karpathy
Copy link
Owner

Actually on closer look I'm a bit worried about itertools.cycle, as it caches the individual elements in memory to yield through them on second pass. I am worried this would consume a large amount of memory, which the current implementation discards and recreates each element of the data loader on demand. Punting for now.

@mattsta
Copy link
Author

mattsta commented Jul 26, 2022

oh, good catch. could also just do it manually with a hand rolled infinite generator:

In [1]: import random

In [2]: randos = [random.randbytes(3000) for i in range(333)]

In [3]: def forever():
   ...:     while True:
   ...:         yield from randos

In [4]: for i, got in enumerate(forever()):
   ...:     print(i)
   ...:     if i > 3000000:
   ...:         break

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants