Skip to content

Commit

Permalink
CLI to upload arbitrary huge folder (#2254)
Browse files Browse the repository at this point in the history
* still an early draft

* this is better

* fix

* revampt/refactor download process

* resume download by default + do not upload .huggingface folder

* compute sha256 if necessary

* fix hash

* add tests + fix some stuff

* fix snapshot download tests

* fix test

* lots of docs

* add secu

* as constant

* dix

* fix tests

* remove unused code

* don't use jsons

* style

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <[email protected]>

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <[email protected]>

* Warn more about resume_download

* fix test

* Add tests specific to .huggingface folder

* remove advice to use hf_transfer when downloading from cli

* fix torhc test

* more test fix

* feedback

* First draft for large upload CLI

* Fixes + CLI

* verbose by default

* ask for report

* line

* suggested changes

* more robust

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <[email protected]>

* comment

* commen

* robust tests

* fix CI

* ez

* rules update

* more ribust?

* allow for 1s diff

* don't raise on unlink

* style

* robustenss

* tqdm while recovering

* make sure upload paths are correct on windows

* test get_local_upload_paths

* only 1 preupload LFS at a time if hf_transfer enabled

* upload one at a time if hf_transfer

* Add waiting workers in report

* better reporting

* raise on KeyboardInterrupt + can disable bars

* fix type annotation on Python3.8

* make repo_type required

* dcostring

* style

* fix circular import

* docs

* docstring

* init

* guide

* dedup

* instructions

* add test

* styl

* tips

* Apply suggestions from code review

Co-authored-by: Omar Sanseviero <[email protected]>

* typo

* remove comment

* comments

* move determine_task to its own method

* rename to upload_large_folder

* fix md

* update

* dont wait on exit

* Fix typo in docs

* Apply suggestions from code review

Co-authored-by: Omar Sanseviero <[email protected]>

* add PR tips

* add comment

* add comment about --no-bars

---------

Co-authored-by: Lysandre Debut <[email protected]>
Co-authored-by: Pedro Cuenca <[email protected]>
Co-authored-by: Omar Sanseviero <[email protected]>
  • Loading branch information
4 people authored Aug 29, 2024
1 parent 893e889 commit ecbbeb3
Show file tree
Hide file tree
Showing 11 changed files with 1,172 additions and 34 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -138,3 +138,5 @@ dmypy.json

# Spell checker config
cspell.json

tmp*
104 changes: 74 additions & 30 deletions docs/source/en/guides/upload.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,80 @@ set, files are uploaded at the root of the repo.

For more details about the CLI upload command, please refer to the [CLI guide](./cli#huggingface-cli-upload).

## Upload a large folder

In most cases, the [`upload_folder`] method and `huggingface-cli upload` command should be the go-to solutions to upload files to the Hub. They ensure a single commit will be made, handle a lot of use cases, and fail explicitly when something wrong happens. However, when dealing with a large amount of data, you will usually prefer a resilient process even if it leads to more commits or requires more CPU usage. The [`upload_large_folder`] method has been implemented in that spirit:
- it is resumable: the upload process is split into many small tasks (hashing files, pre-uploading them, and committing them). Each time a task is completed, the result is cached locally in a `./cache/huggingface` folder inside the folder you are trying to upload. By doing so, restarting the process after an interruption will resume all completed tasks.
- it is multi-threaded: hashing large files and pre-uploading them benefits a lot from multithreading if your machine allows it.
- it is resilient to errors: a high-level retry-mechanism has been added to retry each independent task indefinitely until it passes (no matter if it's a OSError, ConnectionError, PermissionError, etc.). This mechanism is double-edged. If transient errors happen, the process will continue and retry. If permanent errors happen (e.g. permission denied), it will retry indefinitely without solving the root cause.

If you want more technical details about how `upload_large_folder` is implemented under the hood, please have a look to the [`upload_large_folder`] package reference.

Here is how to use [`upload_large_folder`] in a script. The method signature is very similar to [`upload_folder`]:

```py
>>> api.upload_large_folder(
... repo_id="HuggingFaceM4/Docmatix",
... repo_type="dataset",
... folder_path="/path/to/local/docmatix",
... )
```

You will see the following output in your terminal:
```
Repo created: https://huggingface.co/datasets/HuggingFaceM4/Docmatix
Found 5 candidate files to upload
Recovering from metadata files: 100%|█████████████████████████████████████| 5/5 [00:00<00:00, 542.66it/s]
---------- 2024-07-22 17:23:17 (0:00:00) ----------
Files: hashed 5/5 (5.0G/5.0G) | pre-uploaded: 0/5 (0.0/5.0G) | committed: 0/5 (0.0/5.0G) | ignored: 0
Workers: hashing: 0 | get upload mode: 0 | pre-uploading: 5 | committing: 0 | waiting: 11
---------------------------------------------------
```

First, the repo is created if it didn't exist before. Then, the local folder is scanned for files to upload. For each file, we try to recover metadata information (from a previously interrupted upload). From there, it is able to launch workers and print an update status every 1 minute. Here, we can see that 5 files have already been hashed but not pre-uploaded. 5 workers are pre-uploading files while the 11 others are waiting for a task.

A command line is also provided. You can define the number of workers and the level of verbosity in the terminal:

```sh
huggingface-cli upload-large-folder HuggingFaceM4/Docmatix --repo-type=dataset /path/to/local/docmatix --num-workers=16
```

<Tip>

For large uploads, you have to set `repo_type="model"` or `--repo-type=model` explicitly. Usually, this information is implicit in all other `HfApi` methods. This is to avoid having data uploaded to a repository with a wrong type. If that's the case, you'll have to re-upload everything.

</Tip>

<Tip warning={true}>

While being much more robust to upload large folders, `upload_large_folder` is more limited than [`upload_folder`] feature-wise. In practice:
- you cannot set a custom `path_in_repo`. If you want to upload to a subfolder, you need to set the proper structure locally.
- you cannot set a custom `commit_message` and `commit_description` since multiple commits are created.
- you cannot delete from the repo while uploading. Please make a separate commit first.
- you cannot create a PR directly. Please create a PR first (from the UI or using [`create_pull_request`]) and then commit to it by passing `revision`.

</Tip>

### Tips and tricks for large uploads

There are some limitations to be aware of when dealing with a large amount of data in your repo. Given the time it takes to stream the data, getting an upload/push to fail at the end of the process or encountering a degraded experience, be it on hf.co or when working locally, can be very annoying.

Check out our [Repository limitations and recommendations](https://huggingface.co/docs/hub/repositories-recommendations) guide for best practices on how to structure your repositories on the Hub. Let's move on with some practical tips to make your upload process as smooth as possible.

- **Start small**: We recommend starting with a small amount of data to test your upload script. It's easier to iterate on a script when failing takes only a little time.
- **Expect failures**: Streaming large amounts of data is challenging. You don't know what can happen, but it's always best to consider that something will fail at least once -no matter if it's due to your machine, your connection, or our servers. For example, if you plan to upload a large number of files, it's best to keep track locally of which files you already uploaded before uploading the next batch. You are ensured that an LFS file that is already committed will never be re-uploaded twice but checking it client-side can still save some time. This is what [`upload_large_folder`] does for you.
- **Use `hf_transfer`**: this is a Rust-based [library](https://github.com/huggingface/hf_transfer) meant to speed up uploads on machines with very high bandwidth. To use `hf_transfer`:
1. Specify the `hf_transfer` extra when installing `huggingface_hub`
(i.e., `pip install huggingface_hub[hf_transfer]`).
2. Set `HF_HUB_ENABLE_HF_TRANSFER=1` as an environment variable.

<Tip warning={true}>

`hf_transfer` is a power user tool! It is tested and production-ready, but it lacks user-friendly features like advanced error handling or proxies. For more details, please take a look at this [section](https://huggingface.co/docs/huggingface_hub/hf_transfer).

</Tip>

## Advanced features

In most cases, you won't need more than [`upload_file`] and [`upload_folder`] to upload your files to the Hub.
Expand Down Expand Up @@ -418,36 +492,6 @@ you don't store another reference to it. This is expected as we don't want to ke
already uploaded. Finally we create the commit by passing all the operations to [`create_commit`]. You can pass
additional operations (add, delete or copy) that have not been processed yet and they will be handled correctly.

## Tips and tricks for large uploads

There are some limitations to be aware of when dealing with a large amount of data in your repo. Given the time it takes to stream the data,
getting an upload/push to fail at the end of the process or encountering a degraded experience, be it on hf.co or when working locally, can be very annoying.

Check out our [Repository limitations and recommendations](https://huggingface.co/docs/hub/repositories-recommendations) guide for best practices on how to structure your repositories on the Hub. Next, let's move on with some practical tips to make your upload process as smooth as possible.

- **Start small**: We recommend starting with a small amount of data to test your upload script. It's easier to iterate
on a script when failing takes only a little time.
- **Expect failures**: Streaming large amounts of data is challenging. You don't know what can happen, but it's always
best to consider that something will fail at least once -no matter if it's due to your machine, your connection, or our
servers. For example, if you plan to upload a large number of files, it's best to keep track locally of which files you
already uploaded before uploading the next batch. You are ensured that an LFS file that is already committed will never
be re-uploaded twice but checking it client-side can still save some time.
- **Use `hf_transfer`**: this is a Rust-based [library](https://github.com/huggingface/hf_transfer) meant to speed up
uploads on machines with very high bandwidth. To use `hf_transfer`:

1. Specify the `hf_transfer` extra when installing `huggingface_hub`
(e.g. `pip install huggingface_hub[hf_transfer]`).
2. Set `HF_HUB_ENABLE_HF_TRANSFER=1` as an environment variable.

<Tip warning={true}>

`hf_transfer` is a power user tool!
It is tested and production-ready,
but it lacks user-friendly features like advanced error handling or proxies.
For more details, please take a look at this [section](https://huggingface.co/docs/huggingface_hub/hf_transfer).

</Tip>

## (legacy) Upload files with Git LFS

All the methods described above use the Hub's API to upload files. This is the recommended way to upload files to the Hub.
Expand Down
2 changes: 2 additions & 0 deletions src/huggingface_hub/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,6 +252,7 @@
"update_webhook",
"upload_file",
"upload_folder",
"upload_large_folder",
"whoami",
],
"hf_file_system": [
Expand Down Expand Up @@ -756,6 +757,7 @@ def __dir__():
update_webhook, # noqa: F401
upload_file, # noqa: F401
upload_folder, # noqa: F401
upload_large_folder, # noqa: F401
whoami, # noqa: F401
)
from .hf_file_system import (
Expand Down
195 changes: 191 additions & 4 deletions src/huggingface_hub/_local_folder.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
└── [ 16] file.parquet
Metadata file structure:
Download metadata file structure:
```
# file.txt.metadata
11c5a3d5811f50298f278a704980280950aedb10
Expand Down Expand Up @@ -68,7 +68,7 @@ class LocalDownloadFilePaths:
"""
Paths to the files related to a download process in a local dir.
Returned by `get_local_download_paths`.
Returned by [`get_local_download_paths`].
Attributes:
file_path (`Path`):
Expand All @@ -88,6 +88,30 @@ def incomplete_path(self, etag: str) -> Path:
return self.metadata_path.with_suffix(f".{etag}.incomplete")


@dataclass(frozen=True)
class LocalUploadFilePaths:
"""
Paths to the files related to an upload process in a local dir.
Returned by [`get_local_upload_paths`].
Attributes:
path_in_repo (`str`):
Path of the file in the repo.
file_path (`Path`):
Path where the file will be saved.
lock_path (`Path`):
Path to the lock file used to ensure atomicity when reading/writing metadata.
metadata_path (`Path`):
Path to the metadata file.
"""

path_in_repo: str
file_path: Path
lock_path: Path
metadata_path: Path


@dataclass
class LocalDownloadFileMetadata:
"""
Expand All @@ -111,6 +135,50 @@ class LocalDownloadFileMetadata:
timestamp: float


@dataclass
class LocalUploadFileMetadata:
"""
Metadata about a file in the local directory related to an upload process.
"""

size: int

# Default values correspond to "we don't know yet"
timestamp: Optional[float] = None
should_ignore: Optional[bool] = None
sha256: Optional[str] = None
upload_mode: Optional[str] = None
is_uploaded: bool = False
is_committed: bool = False

def save(self, paths: LocalUploadFilePaths) -> None:
"""Save the metadata to disk."""
with WeakFileLock(paths.lock_path):
with paths.metadata_path.open("w") as f:
new_timestamp = time.time()
f.write(str(new_timestamp) + "\n")

f.write(str(self.size)) # never None
f.write("\n")

if self.should_ignore is not None:
f.write(str(int(self.should_ignore)))
f.write("\n")

if self.sha256 is not None:
f.write(self.sha256)
f.write("\n")

if self.upload_mode is not None:
f.write(self.upload_mode)
f.write("\n")

f.write(str(int(self.is_uploaded)) + "\n")
f.write(str(int(self.is_committed)) + "\n")

self.timestamp = new_timestamp


@lru_cache(maxsize=128) # ensure singleton
def get_local_download_paths(local_dir: Path, filename: str) -> LocalDownloadFilePaths:
"""Compute paths to the files related to a download process.
Expand Down Expand Up @@ -152,6 +220,49 @@ def get_local_download_paths(local_dir: Path, filename: str) -> LocalDownloadFil
return LocalDownloadFilePaths(file_path=file_path, lock_path=lock_path, metadata_path=metadata_path)


@lru_cache(maxsize=128) # ensure singleton
def get_local_upload_paths(local_dir: Path, filename: str) -> LocalUploadFilePaths:
"""Compute paths to the files related to an upload process.
Folders containing the paths are all guaranteed to exist.
Args:
local_dir (`Path`):
Path to the local directory that is uploaded.
filename (`str`):
Path of the file in the repo.
Return:
[`LocalUploadFilePaths`]: the paths to the files (file_path, lock_path, metadata_path).
"""
# filename is the path in the Hub repository (separated by '/')
# make sure to have a cross platform transcription
sanitized_filename = os.path.join(*filename.split("/"))
if os.name == "nt":
if sanitized_filename.startswith("..\\") or "\\..\\" in sanitized_filename:
raise ValueError(
f"Invalid filename: cannot handle filename '{sanitized_filename}' on Windows. Please ask the repository"
" owner to rename this file."
)
file_path = local_dir / sanitized_filename
metadata_path = _huggingface_dir(local_dir) / "upload" / f"{sanitized_filename}.metadata"
lock_path = metadata_path.with_suffix(".lock")

# Some Windows versions do not allow for paths longer than 255 characters.
# In this case, we must specify it as an extended path by using the "\\?\" prefix
if os.name == "nt":
if not str(local_dir).startswith("\\\\?\\") and len(os.path.abspath(lock_path)) > 255:
file_path = Path("\\\\?\\" + os.path.abspath(file_path))
lock_path = Path("\\\\?\\" + os.path.abspath(lock_path))
metadata_path = Path("\\\\?\\" + os.path.abspath(metadata_path))

file_path.parent.mkdir(parents=True, exist_ok=True)
metadata_path.parent.mkdir(parents=True, exist_ok=True)
return LocalUploadFilePaths(
path_in_repo=filename, file_path=file_path, lock_path=lock_path, metadata_path=metadata_path
)


def read_download_metadata(local_dir: Path, filename: str) -> Optional[LocalDownloadFileMetadata]:
"""Read metadata about a file in the local directory related to a download process.
Expand All @@ -165,8 +276,6 @@ def read_download_metadata(local_dir: Path, filename: str) -> Optional[LocalDown
`[LocalDownloadFileMetadata]` or `None`: the metadata if it exists, `None` otherwise.
"""
paths = get_local_download_paths(local_dir, filename)
# file_path = local_file_path(local_dir, filename)
# lock_path, metadata_path = _download_metadata_file_path(local_dir, filename)
with WeakFileLock(paths.lock_path):
if paths.metadata_path.exists():
try:
Expand Down Expand Up @@ -204,6 +313,84 @@ def read_download_metadata(local_dir: Path, filename: str) -> Optional[LocalDown
return None


def read_upload_metadata(local_dir: Path, filename: str) -> LocalUploadFileMetadata:
"""Read metadata about a file in the local directory related to an upload process.
TODO: factorize logic with `read_download_metadata`.
Args:
local_dir (`Path`):
Path to the local directory in which files are downloaded.
filename (`str`):
Path of the file in the repo.
Return:
`[LocalUploadFileMetadata]` or `None`: the metadata if it exists, `None` otherwise.
"""
paths = get_local_upload_paths(local_dir, filename)
with WeakFileLock(paths.lock_path):
if paths.metadata_path.exists():
try:
with paths.metadata_path.open() as f:
timestamp = float(f.readline().strip())

size = int(f.readline().strip()) # never None

_should_ignore = f.readline().strip()
should_ignore = None if _should_ignore == "" else bool(int(_should_ignore))

_sha256 = f.readline().strip()
sha256 = None if _sha256 == "" else _sha256

_upload_mode = f.readline().strip()
upload_mode = None if _upload_mode == "" else _upload_mode
if upload_mode not in (None, "regular", "lfs"):
raise ValueError(f"Invalid upload mode in metadata {paths.path_in_repo}: {upload_mode}")

is_uploaded = bool(int(f.readline().strip()))
is_committed = bool(int(f.readline().strip()))

metadata = LocalUploadFileMetadata(
timestamp=timestamp,
size=size,
should_ignore=should_ignore,
sha256=sha256,
upload_mode=upload_mode,
is_uploaded=is_uploaded,
is_committed=is_committed,
)
except Exception as e:
# remove the metadata file if it is corrupted / not the right format
logger.warning(
f"Invalid metadata file {paths.metadata_path}: {e}. Removing it from disk and continue."
)
try:
paths.metadata_path.unlink()
except Exception as e:
logger.warning(f"Could not remove corrupted metadata file {paths.metadata_path}: {e}")

# TODO: can we do better?
if (
metadata.timestamp is not None
and metadata.is_uploaded # file was uploaded
and not metadata.is_committed # but not committed
and time.time() - metadata.timestamp > 20 * 3600 # and it's been more than 20 hours
): # => we consider it as garbage-collected by S3
metadata.is_uploaded = False

# check if the file exists and hasn't been modified since the metadata was saved
try:
if metadata.timestamp is not None and paths.file_path.stat().st_mtime <= metadata.timestamp:
return metadata
logger.info(f"Ignored metadata for '{filename}' (outdated). Will re-compute hash.")
except FileNotFoundError:
# file does not exist => metadata is outdated
pass

# empty metadata => we don't know anything expect its size
return LocalUploadFileMetadata(size=paths.file_path.stat().st_size)


def write_download_metadata(local_dir: Path, filename: str, commit_hash: str, etag: str) -> None:
"""Write metadata about a file in the local directory related to a download process.
Expand Down
Loading

0 comments on commit ecbbeb3

Please sign in to comment.