CLI to upload arbitrary huge folder (#2254)

* still an early draft * this is better * fix * revampt/refactor download process * resume download by default + do not upload .huggingface folder * compute sha256 if necessary * fix hash * add tests + fix some stuff * fix snapshot download tests * fix test * lots of docs * add secu * as constant * dix * fix tests * remove unused code * don't use jsons * style * Apply suggestions from code review Co-authored-by: Lysandre Debut <[email protected]> * Apply suggestions from code review Co-authored-by: Lysandre Debut <[email protected]> * Warn more about resume_download * fix test * Add tests specific to .huggingface folder * remove advice to use hf_transfer when downloading from cli * fix torhc test * more test fix * feedback * First draft for large upload CLI * Fixes + CLI * verbose by default * ask for report * line * suggested changes * more robust * Apply suggestions from code review Co-authored-by: Pedro Cuenca <[email protected]> * comment * commen * robust tests * fix CI * ez * rules update * more ribust? * allow for 1s diff * don't raise on unlink * style * robustenss * tqdm while recovering * make sure upload paths are correct on windows * test get_local_upload_paths * only 1 preupload LFS at a time if hf_transfer enabled * upload one at a time if hf_transfer * Add waiting workers in report * better reporting * raise on KeyboardInterrupt + can disable bars * fix type annotation on Python3.8 * make repo_type required * dcostring * style * fix circular import * docs * docstring * init * guide * dedup * instructions * add test * styl * tips * Apply suggestions from code review Co-authored-by: Omar Sanseviero <[email protected]> * typo * remove comment * comments * move determine_task to its own method * rename to upload_large_folder * fix md * update * dont wait on exit * Fix typo in docs * Apply suggestions from code review Co-authored-by: Omar Sanseviero <[email protected]> * add PR tips * add comment * add comment about --no-bars --------- Co-authored-by: Lysandre Debut <[email protected]> Co-authored-by: Pedro Cuenca <[email protected]> Co-authored-by: Omar Sanseviero <[email protected]>
huggingface · Aug 29, 2024 · ecbbeb3 · ecbbeb3
1 parent 893e889
commit ecbbeb3
Show file tree

Hide file tree

Showing 11 changed files with 1,172 additions and 34 deletions.
diff --git a/.gitignore b/.gitignore
@@ -138,3 +138,5 @@ dmypy.json
 
 # Spell checker config
 cspell.json
+
+tmp*
diff --git a/docs/source/en/guides/upload.md b/docs/source/en/guides/upload.md
@@ -103,6 +103,80 @@ set, files are uploaded at the root of the repo.
 
 For more details about the CLI upload command, please refer to the [CLI guide](./cli#huggingface-cli-upload).
 
+## Upload a large folder
+
+In most cases, the [`upload_folder`] method and `huggingface-cli upload` command should be the go-to solutions to upload files to the Hub. They ensure a single commit will be made, handle a lot of use cases, and fail explicitly when something wrong happens. However, when dealing with a large amount of data, you will usually prefer a resilient process even if it leads to more commits or requires more CPU usage. The [`upload_large_folder`] method has been implemented in that spirit:
+- it is resumable: the upload process is split into many small tasks (hashing files, pre-uploading them, and committing them). Each time a task is completed, the result is cached locally in a `./cache/huggingface` folder inside the folder you are trying to upload. By doing so, restarting the process after an interruption will resume all completed tasks.
+- it is multi-threaded: hashing large files and pre-uploading them benefits a lot from multithreading if your machine allows it.
+- it is resilient to errors: a high-level retry-mechanism has been added to retry each independent task indefinitely until it passes (no matter if it's a OSError, ConnectionError, PermissionError, etc.). This mechanism is double-edged. If transient errors happen, the process will continue and retry. If permanent errors happen (e.g. permission denied), it will retry indefinitely without solving the root cause.
+
+If you want more technical details about how `upload_large_folder` is implemented under the hood, please have a look to the [`upload_large_folder`] package reference.
+
+Here is how to use [`upload_large_folder`] in a script. The method signature is very similar to [`upload_folder`]:
+
+```py
+>>> api.upload_large_folder(
+...     repo_id="HuggingFaceM4/Docmatix",
+...     repo_type="dataset",
+...     folder_path="/path/to/local/docmatix",
+... )
+```
+
+You will see the following output in your terminal:
+```
+Repo created: https://huggingface.co/datasets/HuggingFaceM4/Docmatix
+Found 5 candidate files to upload
+Recovering from metadata files: 100%|█████████████████████████████████████| 5/5 [00:00<00:00, 542.66it/s]
+
+---------- 2024-07-22 17:23:17 (0:00:00) ----------
+Files:   hashed 5/5 (5.0G/5.0G) | pre-uploaded: 0/5 (0.0/5.0G) | committed: 0/5 (0.0/5.0G) | ignored: 0
+Workers: hashing: 0 | get upload mode: 0 | pre-uploading: 5 | committing: 0 | waiting: 11
+---------------------------------------------------
+```
+
+First, the repo is created if it didn't exist before. Then, the local folder is scanned for files to upload. For each file, we try to recover metadata information (from a previously interrupted upload). From there, it is able to launch workers and print an update status every 1 minute. Here, we can see that 5 files have already been hashed but not pre-uploaded. 5 workers are pre-uploading files while the 11 others are waiting for a task.
+
+A command line is also provided. You can define the number of workers and the level of verbosity in the terminal:
+
+```sh
+huggingface-cli upload-large-folder HuggingFaceM4/Docmatix --repo-type=dataset /path/to/local/docmatix --num-workers=16
+```
+
+<Tip>
+
+For large uploads, you have to set `repo_type="model"` or `--repo-type=model` explicitly. Usually, this information is implicit in all other `HfApi` methods. This is to avoid having data uploaded to a repository with a wrong type. If that's the case, you'll have to re-upload everything.
+
+</Tip>
+
+<Tip warning={true}>
+
+While being much more robust to upload large folders, `upload_large_folder` is more limited than [`upload_folder`] feature-wise. In practice:
+- you cannot set a custom `path_in_repo`. If you want to upload to a subfolder, you need to set the proper structure locally.
+- you cannot set a custom `commit_message` and `commit_description` since multiple commits are created.
+- you cannot delete from the repo while uploading. Please make a separate commit first.
+- you cannot create a PR directly. Please create a PR first (from the UI or using [`create_pull_request`]) and then commit to it by passing `revision`.
+
+</Tip>
+
+### Tips and tricks for large uploads
+
+There are some limitations to be aware of when dealing with a large amount of data in your repo. Given the time it takes to stream the data, getting an upload/push to fail at the end of the process or encountering a degraded experience, be it on hf.co or when working locally, can be very annoying.
+
+Check out our [Repository limitations and recommendations](https://huggingface.co/docs/hub/repositories-recommendations) guide for best practices on how to structure your repositories on the Hub. Let's move on with some practical tips to make your upload process as smooth as possible.
+
+- **Start small**: We recommend starting with a small amount of data to test your upload script. It's easier to iterate on a script when failing takes only a little time.
+- **Expect failures**: Streaming large amounts of data is challenging. You don't know what can happen, but it's always best to consider that something will fail at least once -no matter if it's due to your machine, your connection, or our servers. For example, if you plan to upload a large number of files, it's best to keep track locally of which files you already uploaded before uploading the next batch. You are ensured that an LFS file that is already committed will never be re-uploaded twice but checking it client-side can still save some time. This is what [`upload_large_folder`] does for you.
+- **Use `hf_transfer`**: this is a Rust-based [library](https://github.com/huggingface/hf_transfer) meant to speed up uploads on machines with very high bandwidth. To use `hf_transfer`:
+    1. Specify the `hf_transfer` extra when installing `huggingface_hub`
+       (i.e., `pip install huggingface_hub[hf_transfer]`).
+    2. Set `HF_HUB_ENABLE_HF_TRANSFER=1` as an environment variable.
+
+<Tip warning={true}>
+
+`hf_transfer` is a power user tool! It is tested and production-ready, but it lacks user-friendly features like advanced error handling or proxies. For more details, please take a look at this [section](https://huggingface.co/docs/huggingface_hub/hf_transfer).
+
+</Tip>
+
 ## Advanced features
 
 In most cases, you won't need more than [`upload_file`] and [`upload_folder`] to upload your files to the Hub.
@@ -418,36 +492,6 @@ you don't store another reference to it. This is expected as we don't want to ke
 already uploaded. Finally we create the commit by passing all the operations to [`create_commit`]. You can pass
 additional operations (add, delete or copy) that have not been processed yet and they will be handled correctly.
 
-## Tips and tricks for large uploads
-
-There are some limitations to be aware of when dealing with a large amount of data in your repo. Given the time it takes to stream the data,
-getting an upload/push to fail at the end of the process or encountering a degraded experience, be it on hf.co or when working locally, can be very annoying.
-
-Check out our [Repository limitations and recommendations](https://huggingface.co/docs/hub/repositories-recommendations) guide for best practices on how to structure your repositories on the Hub. Next, let's move on with some practical tips to make your upload process as smooth as possible.
-
-- **Start small**: We recommend starting with a small amount of data to test your upload script. It's easier to iterate
-on a script when failing takes only a little time.
-- **Expect failures**: Streaming large amounts of data is challenging. You don't know what can happen, but it's always
-best to consider that something will fail at least once -no matter if it's due to your machine, your connection, or our
-servers. For example, if you plan to upload a large number of files, it's best to keep track locally of which files you
-already uploaded before uploading the next batch. You are ensured that an LFS file that is already committed will never
-be re-uploaded twice but checking it client-side can still save some time.
-- **Use `hf_transfer`**: this is a Rust-based [library](https://github.com/huggingface/hf_transfer) meant to speed up
-  uploads on machines with very high bandwidth. To use `hf_transfer`:
-
-    1. Specify the `hf_transfer` extra when installing `huggingface_hub`
-       (e.g. `pip install huggingface_hub[hf_transfer]`).
-    2. Set `HF_HUB_ENABLE_HF_TRANSFER=1` as an environment variable.
-
-<Tip warning={true}>
-
-`hf_transfer` is a power user tool!
-It is tested and production-ready,
-but it lacks user-friendly features like advanced error handling or proxies.
-For more details, please take a look at this [section](https://huggingface.co/docs/huggingface_hub/hf_transfer).
-
-</Tip>
-
 ## (legacy) Upload files with Git LFS
 
 All the methods described above use the Hub's API to upload files. This is the recommended way to upload files to the Hub.

diff --git a/src/huggingface_hub/__init__.py b/src/huggingface_hub/__init__.py
@@ -252,6 +252,7 @@
         "update_webhook",
         "upload_file",
         "upload_folder",
+        "upload_large_folder",
         "whoami",
     ],
     "hf_file_system": [
@@ -756,6 +757,7 @@ def __dir__():
         update_webhook,  # noqa: F401
         upload_file,  # noqa: F401
         upload_folder,  # noqa: F401
+        upload_large_folder,  # noqa: F401
         whoami,  # noqa: F401
     )
     from .hf_file_system import (

diff --git a/src/huggingface_hub/_local_folder.py b/src/huggingface_hub/_local_folder.py
@@ -34,7 +34,7 @@
     └── [   16]  file.parquet
 
 
-Metadata file structure:
+Download metadata file structure:
 ```
 # file.txt.metadata
 11c5a3d5811f50298f278a704980280950aedb10
@@ -68,7 +68,7 @@ class LocalDownloadFilePaths:
     """
     Paths to the files related to a download process in a local dir.
 
-    Returned by `get_local_download_paths`.
+    Returned by [`get_local_download_paths`].
 
     Attributes:
         file_path (`Path`):
@@ -88,6 +88,30 @@ def incomplete_path(self, etag: str) -> Path:
         return self.metadata_path.with_suffix(f".{etag}.incomplete")
 
 
+@dataclass(frozen=True)
+class LocalUploadFilePaths:
+    """
+    Paths to the files related to an upload process in a local dir.
+
+    Returned by [`get_local_upload_paths`].
+
+    Attributes:
+        path_in_repo (`str`):
+            Path of the file in the repo.
+        file_path (`Path`):
+            Path where the file will be saved.
+        lock_path (`Path`):
+            Path to the lock file used to ensure atomicity when reading/writing metadata.
+        metadata_path (`Path`):
+            Path to the metadata file.
+    """
+
+    path_in_repo: str
+    file_path: Path
+    lock_path: Path
+    metadata_path: Path
+
+
 @dataclass
 class LocalDownloadFileMetadata:
     """
@@ -111,6 +135,50 @@ class LocalDownloadFileMetadata:
     timestamp: float
 
 
+@dataclass
+class LocalUploadFileMetadata:
+    """
+    Metadata about a file in the local directory related to an upload process.
+    """
+
+    size: int
+
+    # Default values correspond to "we don't know yet"
+    timestamp: Optional[float] = None
+    should_ignore: Optional[bool] = None
+    sha256: Optional[str] = None
+    upload_mode: Optional[str] = None
+    is_uploaded: bool = False
+    is_committed: bool = False
+
+    def save(self, paths: LocalUploadFilePaths) -> None:
+        """Save the metadata to disk."""
+        with WeakFileLock(paths.lock_path):
+            with paths.metadata_path.open("w") as f:
+                new_timestamp = time.time()
+                f.write(str(new_timestamp) + "\n")
+
+                f.write(str(self.size))  # never None
+                f.write("\n")
+
+                if self.should_ignore is not None:
+                    f.write(str(int(self.should_ignore)))
+                f.write("\n")
+
+                if self.sha256 is not None:
+                    f.write(self.sha256)
+                f.write("\n")
+
+                if self.upload_mode is not None:
+                    f.write(self.upload_mode)
+                f.write("\n")
+
+                f.write(str(int(self.is_uploaded)) + "\n")
+                f.write(str(int(self.is_committed)) + "\n")
+
+            self.timestamp = new_timestamp
+
+
 @lru_cache(maxsize=128)  # ensure singleton
 def get_local_download_paths(local_dir: Path, filename: str) -> LocalDownloadFilePaths:
     """Compute paths to the files related to a download process.
@@ -152,6 +220,49 @@ def get_local_download_paths(local_dir: Path, filename: str) -> LocalDownloadFil
     return LocalDownloadFilePaths(file_path=file_path, lock_path=lock_path, metadata_path=metadata_path)
 
 
+@lru_cache(maxsize=128)  # ensure singleton
+def get_local_upload_paths(local_dir: Path, filename: str) -> LocalUploadFilePaths:
+    """Compute paths to the files related to an upload process.
+
+    Folders containing the paths are all guaranteed to exist.
+
+    Args:
+        local_dir (`Path`):
+            Path to the local directory that is uploaded.
+        filename (`str`):
+            Path of the file in the repo.
+
+    Return:
+        [`LocalUploadFilePaths`]: the paths to the files (file_path, lock_path, metadata_path).
+    """
+    # filename is the path in the Hub repository (separated by '/')
+    # make sure to have a cross platform transcription
+    sanitized_filename = os.path.join(*filename.split("/"))
+    if os.name == "nt":
+        if sanitized_filename.startswith("..\\") or "\\..\\" in sanitized_filename:
+            raise ValueError(
+                f"Invalid filename: cannot handle filename '{sanitized_filename}' on Windows. Please ask the repository"
+                " owner to rename this file."
+            )
+    file_path = local_dir / sanitized_filename
+    metadata_path = _huggingface_dir(local_dir) / "upload" / f"{sanitized_filename}.metadata"
+    lock_path = metadata_path.with_suffix(".lock")
+
+    # Some Windows versions do not allow for paths longer than 255 characters.
+    # In this case, we must specify it as an extended path by using the "\\?\" prefix
+    if os.name == "nt":
+        if not str(local_dir).startswith("\\\\?\\") and len(os.path.abspath(lock_path)) > 255:
+            file_path = Path("\\\\?\\" + os.path.abspath(file_path))
+            lock_path = Path("\\\\?\\" + os.path.abspath(lock_path))
+            metadata_path = Path("\\\\?\\" + os.path.abspath(metadata_path))
+
+    file_path.parent.mkdir(parents=True, exist_ok=True)
+    metadata_path.parent.mkdir(parents=True, exist_ok=True)
+    return LocalUploadFilePaths(
+        path_in_repo=filename, file_path=file_path, lock_path=lock_path, metadata_path=metadata_path
+    )
+
+
 def read_download_metadata(local_dir: Path, filename: str) -> Optional[LocalDownloadFileMetadata]:
     """Read metadata about a file in the local directory related to a download process.
 
@@ -165,8 +276,6 @@ def read_download_metadata(local_dir: Path, filename: str) -> Optional[LocalDown
         `[LocalDownloadFileMetadata]` or `None`: the metadata if it exists, `None` otherwise.
     """
     paths = get_local_download_paths(local_dir, filename)
-    # file_path = local_file_path(local_dir, filename)
-    # lock_path, metadata_path = _download_metadata_file_path(local_dir, filename)
     with WeakFileLock(paths.lock_path):
         if paths.metadata_path.exists():
             try:
@@ -204,6 +313,84 @@ def read_download_metadata(local_dir: Path, filename: str) -> Optional[LocalDown
     return None
 
 
+def read_upload_metadata(local_dir: Path, filename: str) -> LocalUploadFileMetadata:
+    """Read metadata about a file in the local directory related to an upload process.
+
+    TODO: factorize logic with `read_download_metadata`.
+
+    Args:
+        local_dir (`Path`):
+            Path to the local directory in which files are downloaded.
+        filename (`str`):
+            Path of the file in the repo.
+
+    Return:
+        `[LocalUploadFileMetadata]` or `None`: the metadata if it exists, `None` otherwise.
+    """
+    paths = get_local_upload_paths(local_dir, filename)
+    with WeakFileLock(paths.lock_path):
+        if paths.metadata_path.exists():
+            try:
+                with paths.metadata_path.open() as f:
+                    timestamp = float(f.readline().strip())
+
+                    size = int(f.readline().strip())  # never None
+
+                    _should_ignore = f.readline().strip()
+                    should_ignore = None if _should_ignore == "" else bool(int(_should_ignore))
+
+                    _sha256 = f.readline().strip()
+                    sha256 = None if _sha256 == "" else _sha256
+
+                    _upload_mode = f.readline().strip()
+                    upload_mode = None if _upload_mode == "" else _upload_mode
+                    if upload_mode not in (None, "regular", "lfs"):
+                        raise ValueError(f"Invalid upload mode in metadata {paths.path_in_repo}: {upload_mode}")
+
+                    is_uploaded = bool(int(f.readline().strip()))
+                    is_committed = bool(int(f.readline().strip()))
+
+                    metadata = LocalUploadFileMetadata(
+                        timestamp=timestamp,
+                        size=size,
+                        should_ignore=should_ignore,
+                        sha256=sha256,
+                        upload_mode=upload_mode,
+                        is_uploaded=is_uploaded,
+                        is_committed=is_committed,
+                    )
+            except Exception as e:
+                # remove the metadata file if it is corrupted / not the right format
+                logger.warning(
+                    f"Invalid metadata file {paths.metadata_path}: {e}. Removing it from disk and continue."
+                )
+                try:
+                    paths.metadata_path.unlink()
+                except Exception as e:
+                    logger.warning(f"Could not remove corrupted metadata file {paths.metadata_path}: {e}")
+
+            # TODO: can we do better?
+            if (
+                metadata.timestamp is not None
+                and metadata.is_uploaded  # file was uploaded
+                and not metadata.is_committed  # but not committed
+                and time.time() - metadata.timestamp > 20 * 3600  # and it's been more than 20 hours
+            ):  # => we consider it as garbage-collected by S3
+                metadata.is_uploaded = False
+
+            # check if the file exists and hasn't been modified since the metadata was saved
+            try:
+                if metadata.timestamp is not None and paths.file_path.stat().st_mtime <= metadata.timestamp:
+                    return metadata
+                logger.info(f"Ignored metadata for '{filename}' (outdated). Will re-compute hash.")
+            except FileNotFoundError:
+                # file does not exist => metadata is outdated
+                pass
+
+    # empty metadata => we don't know anything expect its size
+    return LocalUploadFileMetadata(size=paths.file_path.stat().st_size)
+
+
 def write_download_metadata(local_dir: Path, filename: str, commit_hash: str, etag: str) -> None:
     """Write metadata about a file in the local directory related to a download process.