Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Concurrency Model #2634

Draft
wants to merge 62 commits into
base: main
Choose a base branch
from
Draft

New Concurrency Model #2634

wants to merge 62 commits into from

Conversation

nvoxland
Copy link
Contributor

@nvoxland nvoxland commented Oct 3, 2023

🚀 🚀 Pull Request

Impact

  • Bug fix (non-breaking change which fixes expected existing functionality)
  • Enhancement/New feature (adds functionality without impacting existing logic)
  • Breaking change (fix or feature that would cause existing functionality to change)

Description

Replaces the existing lock-based write consistency system with a new commit-log style system inspired by Delta Lake.

With this system, there is a now a new _deeplake_log directory which store atomically created files describing the operations performed against deeplake. Files in the root of that directory correspond to changes to the "main" branch, and sub-directories contain operations for different branches.

Example directory structure:

_deeplake_log
    _meta
        00000000000000000001.json
        00000000000000000002.json
        00000000000000000003.json
        00000000000000000004.json
    65af17c7c4bc4cbbbcafa
        00000000000000000001.json
        00000000000000000002.json
    e8d4ce5b70e8476fb811
        00000000000000000002.json

With an example json file of:

{"protocol":{"minReaderVersion":4,"minWriterVersion":4}}{"metadata":{"createdTime":1696350362244,"description":null,"id":"0d1a97d8-6cc4-4e12-b603-b5af37e8024d","name":null}}{"branch":{"fromId":"","fromVersion":-1,"id":"","name":"main"}}

Readers are always working from a particular branch + version based on the file name, such as e8d4ce5b70e8476fb811 + 2 and build up the current state by union-ing all the json files. All files (both deep log files plus data files) are immutable, with new versions specifying new file versions rather than overwriting them.

For optimization purposes, the unioned json files are periodically checkpoint()ed into parquet files, and unused files can be vacuumed() away.

See https://hubdb.slack.com/archives/C01HCMDL97F for more information

Things to be aware of

In progress...

Things to worry about

In progress...

Additional Context

In progress...

nvoxland and others added 30 commits September 12, 2023 18:17
# Conflicts:
#	deeplake/core/dataset/dataset.py
#	deeplake/core/tensor.py
#	deeplake/deeplog/actions.py
#	deeplake/deeplog/deeplog.py
#	deeplake/util/version_control.py
FayazRahman and others added 30 commits September 25, 2023 13:23
Fixed action filtering
- Defined version 0 as "no data" (so no 000.json file)
- Main branch is a random-generated id like any other branch
- Collapsed methods into metadata_snapshot->find_branch()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants