HuggingFace Integration #94

balisujohn · 2023-06-20T09:32:02Z

Description

Draft PR for hugging face Minari integration.

Adds functions to convert back and forth between MinariDataset and datasets.Dataset from Hugging Face datasets. Additionally, it adds functions that allow the user to push and pull datasets from hugging face hub. The core code is ready for review, but there are still a few more features I will add, for which I'm adding a checklist to this description:

I also refactored the tests slightly, creating the new helperful function create_dummy_dataset_with_collecter_env_helper to avoid code repetition.

Additional Features

CLI wrappers for push and pull dataset.
Wrappers to to automatically convert minari datasets when uploading and downloading.
Support for Text Spaces

Checklist:

I have run the pre-commit checks with pre-commit run --all-files (see CONTRIBUTING.md instructions to set it up)
I have run pytest -v and no errors are present.
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I solved any possible warnings that pytest -v has generated that are related to my code to the best of my knowledge.
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…ll as a test for tuple action spaces and a combo env with nested dict and tuple action spaces

…ened spaces

…rvations for create_dataset_from_buffers, this may be inefficient and need refactoring

… observation and action space of data now saved in dataset.

…tten-spaces

…pss+1 observations were being loaded when calling get_episodes

…paces in dataset

…dict in datacollector after termination or truncation

… buffer

…ndencies file name to common.py, removed depdency duplication in serialization.py, added a dataset integrity check to test_download_dataset_from_farama_server

…corresponding test

…issing from pushed datasets

younik

I added few comments on the functions, but overall I have some concerns on the API.

For the user, the current workflow is something like that:

import gymnasium as gym
import minari
from minor import DataCollectorV0
from minari.integrations.hugging_face import (
    convert_hugging_face_dataset_to_minari_dataset,
    convert_minari_dataset_to_hugging_face_dataset,
    pull_dataset_from_hugging_face,
    push_dataset_to_hugging_face,
)

env = DataCollectorV0(gym.make("EnvName"))
... # code that generates the dataset
dataset = minari.create_dataset_from_collector_env(...)

hf_dataset = convert_minari_dataset_to_hugging_face_dataset(dataset)
push_dataset_to_hugging_face(hf_dataset, "name/repo")

and then

hf_dataset = pull_dataset_from_hugging_face("name/repo")
dataset = convert_hugging_face_dataset_to_minari_dataset(hf_dataset)

The main red flag for me here is that we have public functions that are specifically for huggingface. I think this should be (more) transparent to the user, i.e. the function in minari/integrations/hugging_face.py should be all private.

I have a couple of alternatives in mind:

Add a HF flag to load and upload

dataset = minari.load_dataset("name/repo", hugging_face_hub=True)

Which pull the dataset and return a MinariDataset that reads from HF dataset using the HuggingFaceStorage that I suggested in a review comment.

Similarly for pushing:

minari.upload_dataset('dataset-name', hugging_face_hub=True)

Cons of this:

we still have flags specifically for HF.
Load dataset directly from cloud, while we have a function download_dataset. I imagine, in the future, we also want the possibility to stream directly from the cloud: shall we drop download_dataset, and implicitly download on load_dataset?

Use a setup_remote()

We discussed about having the possibility to setup different remotes than our GCP bucket. This is a particular case of that. We can create the API for that and use it in this case:

minari.setup_remote(
      "https://huggingface.co/balisujohn",
      # others args like auth_key
)

And now every load_dataset/upload_dataset takes from/push to the HF hub directly as before, and conversions are done under the hood.
Cons:

Still, download_dataset it is no-sense
setup_remote should work also for GCP bucket and who knows
It complicates the library code as setup_remote changes other function behavior

I am more prone for the second version, but it also requires more work

As HF uses Arrow, I am wondering this: if we switch to Arrow as we discussed, will we natively support HF Dataset without needing any conversion? This is also a reason to mask the conversion to user.

younik · 2023-06-26T08:29:15Z

minari/integrations/hugging_face.py

+        assert False, f"error, invalid observation or action structure{data}"
+
+
+def convert_minari_dataset_to_hugging_face_dataset(dataset: MinariDataset):


missing return type

younik · 2023-06-26T08:40:53Z

minari/integrations/hugging_face.py

+from minari.serialization import deserialize_space, serialize_space
+
+
+def _reconstuct_obs_or_action_at_index_recursive(


we already have this function, consider a refactoring

younik · 2023-06-26T08:51:17Z

minari/integrations/hugging_face.py

+    """Converts a MinariDataset into a HuggingFace datasets dataset."""
+    episodes = [episode for episode in dataset.iterate_episodes()]
+    episodes_dict = {
+        "observations": [],
+        "actions": [],
+        "rewards": [],
+        "truncations": [],
+        "terminations": [],
+        "episode_ids": [],
+    }
+    for episode in episodes:
+        episodes_dict["observations"].extend(
+            [
+                _reconstuct_obs_or_action_at_index_recursive(episode.observations, i)
+                for i in range(episode.total_timesteps + 1)
+            ]
+        )
+        episodes_dict["actions"].extend(
+            [
+                _reconstuct_obs_or_action_at_index_recursive(episode.actions, i)
+                for i in range(episode.total_timesteps)
+            ]
+            + [
+                None,
+            ]
+        )
+        episodes_dict["rewards"].extend(
+            list(episode.rewards)
+            + [


This function can be extremely slow for big dataset.
Wouldn't be better using a generator and from_generator() method?

younik · 2023-06-26T08:52:16Z

minari/integrations/hugging_face.py

+        )
+
+
+def convert_hugging_face_dataset_to_minari_dataset(dataset: Dataset):


missing return type

younik · 2023-06-26T08:52:55Z

minari/integrations/hugging_face.py

+        code_permalink="https://github.com/Farama-Foundation/Minari/blob/f095bfe07f8dc6642082599e07779ec1dd9b2667/tutorials/LocalStorage/local_storage.py",
+        author="WillDudley",
+        author_email="[email protected]",


should have meaningful values

Probably settable from function arguments?

younik · 2023-06-26T08:59:02Z

minari/integrations/hugging_face.py

+def convert_hugging_face_dataset_to_minari_dataset(dataset: Dataset):
+
+    description_data = json.loads(dataset.info.description)
+
+    action_space = deserialize_space(description_data["action_space"])
+    observation_space = deserialize_space(description_data["observation_space"])
+    env_name = description_data["env_name"]
+    dataset_id = description_data["dataset_id"]
+
+    episode_ids = dataset.unique("episode_ids")
+
+    buffer = []
+
+    for episode_id in episode_ids:


This function can be very slow.
I propose to instead read from a HuggingFace Dataset using a custom MinariStorage.

To do so, we need an abstract class MinariStorage where we define the public methods that must be implemented. The current MinariStorage is actually a HDF5Storage that implements that interface. We can do a HuggingFaceStorage that reads from a HF Dataset.

younik · 2023-06-26T09:04:15Z

tests/common.py

+        dataset_id=dataset_id,
+        collector_env=env,
+        algorithm_name="random_policy",
+        code_permalink="https://github.com/Farama-Foundation/Minari/blob/f095bfe07f8dc6642082599e07779ec1dd9b2667/tutorials/LocalStorage/local_storage.py",


more informative value; it can be the GitHub link to the common.py file

younik · 2023-06-26T09:04:56Z

tests/integrations/test_hugging_face.py

+@pytest.mark.skip(
+    reason="relies on a private repo, if you want to use this test locally, you'll need to change it to point at a repo you control"
+)


we should have this test on a public repo

RedTachyon

Note: I didn't do a comprehensive review of the code, just left a few comments for whatever stood out to me.

As for the higher-level design, I don't really mind having public functions in a separate minari.integrations.hugging_face namespace, if those are the ones that developers would use for pulling/uploading/converting.
I'd potentially consider making the conversion back and forth automatic inside of the push/pull functions, depending on whether or not it makes any sense to operate on "raw" HF datasets in the context of Minari.

@younik can you elaborate on your issue with those functions being public? Imo the namespace makes it explicit enough, but I might be missing something.

As for the two alternative proposals:

I'm not necessarily a fan of integrating it into the core load_dataset etc functions, we'd essentially tie core functionality to an external library and external servers. Keeping integrations separate (but accessible) is the right move imo
setup_remote sounds like a somewhat more ambitious plan for the future, like being generic between GCP/AWS/Azure/HF/whatever else, so my guess is that it's not a solution for right now?

RedTachyon · 2023-07-01T22:18:07Z

minari/integrations/hugging_face.py

+    elif isinstance(data, np.ndarray):
+        return data[index]
+    else:
+        assert False, f"error, invalid observation or action structure{data}"


Is there a reason to asset False instead of just raising an exception?

RedTachyon · 2023-07-01T22:22:33Z

minari/integrations/hugging_face.py

+        code_permalink="https://github.com/Farama-Foundation/Minari/blob/f095bfe07f8dc6642082599e07779ec1dd9b2667/tutorials/LocalStorage/local_storage.py",
+        author="WillDudley",
+        author_email="[email protected]",


Probably settable from function arguments?

younik · 2023-07-04T17:05:44Z

@RedTachyon
One thing that it is likely to happen is that we support the HF data format, and then the conversion functions are no-op. We have MinariStorage that is designed to abstract to the dataset the difference on file formats (see #94 (comment))

Also, if we want to support other libraries as well (e.g. RLDS), we need other conversion functions; I don't think this is interesting for the user.

And we may want to add some loading keywords to load_dataset and then they must be added also to convert_hugging_face_dataset_to_minari_dataset with the same semantics.

elliottower · 2023-08-28T15:44:26Z

pyproject.toml

@@ -28,6 +28,7 @@ dependencies = [
    "numpy >=1.21.0",
    "h5py>=3.8.0",
    "tqdm>=4.65.0",
+    "datasets>=2.13.0",


Should this be an optional requirement? Like install minari[huggingface]

rodrigodelazcano and others added 30 commits May 24, 2023 09:41

unflatten StepDataCallback

aa45e15

datacollector update

0f3b5c2

remove flatten metadata

5cd6421

add fix removed with rebase

91c3b99

fixed tests, added draft for testing saving unflattened dict spaces

1fcd31a

added test and tentative support for dict valued action spaces

01929a0

Merge remote-tracking branch 'origin' into unflatten-spaces

d39bd01

fixed registartion path for test dict env

22dde61

added experimental support for unflattened tuple action spaces, as we…

da9f474

…ll as a test for tuple action spaces and a combo env with nested dict and tuple action spaces

dummy env registration entrypoint change, hopefully fixes online tests

1df1304

added tuple space reconstruction support and tests for it for unflatt…

949c5de

…ened spaces

added support experimental support for non-flattened actions and obse…

4c0733a

…rvations for create_dataset_from_buffers, this may be inefficient and need refactoring

added tests for unflattened spaces with nested discrete spaces

f553e73

added action and observation space serialization and deserialization,…

8ae05b0

… observation and action space of data now saved in dataset.

Merge branch 'main' of github.com:Farama-Foundation/Minari into unfla…

7843af1

…tten-spaces

fixed bug where total_timesteps observations instead of total_timeste…

76554ad

…pss+1 observations were being loaded when calling get_episodes

changes to address review

1ada485

updated doc

7cf0928

added more detailed description of new data format in doc

596c3a7

test to show proof of concept subset altered observation and action s…

d9543e5

…paces in dataset

small fixes

affdce0

removed env space serialization (only dataset spaces are serialized)

36e0c24

added a line to test for last entry of episode buffer being an empty …

5858749

…dict in datacollector after termination or truncation

some changes to addresss review

347e52a

change to data_collector space initialization to address review

ccd694e

added a test for nested space subset when collecting episodes using a…

8ff86b9

… buffer

changes to address review

6908ea1

removed Tuple annotations when saving a Minari dataset to HDF5

e3698c1

Merge branch 'main' into unflatten-spaces

899156d

refactored tests to reduce helper function duplication

6d74e63

balisujohn added 7 commits June 13, 2023 19:35

added note about space serialization to doc, changed test shared depe…

b04eaa6

…ndencies file name to common.py, removed depdency duplication in serialization.py, added a dataset integrity check to test_download_dataset_from_farama_server

added TypeError when attempting to serialize unsupported Spaces, and …

77e8f13

…corresponding test

Merge branch 'main' into unflatten-spaces

b1ea332

partial draft of huggingface integration

f0913fd

attempt to accomodating changes to space encoding

9ab218c

Merge branch 'main' into dev-huggingface

55ec97f

more work on hugging face draft

84d6617

balisujohn marked this pull request as draft June 20, 2023 09:32

balisujohn added 5 commits June 20, 2023 05:34

draft implementation of hugging face integration. Description still m…

f6c3aea

…issing from pushed datasets

correction

8e02372

added temporary workaround for metadata uploading

308a8c0

removed print statements

4ba97ce

removed more print statements

14a7939

balisujohn requested a review from younik June 24, 2023 00:53

younik requested changes Jun 26, 2023

View reviewed changes

RedTachyon reviewed Jul 1, 2023

View reviewed changes

elliottower reviewed Aug 28, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HuggingFace Integration #94

HuggingFace Integration #94

balisujohn commented Jun 20, 2023 •

edited

Loading

younik left a comment •

edited

Loading

younik Jun 26, 2023

younik Jun 26, 2023

younik Jun 26, 2023

younik Jun 26, 2023

younik Jun 26, 2023

RedTachyon Jul 1, 2023

younik Jun 26, 2023

younik Jun 26, 2023

younik Jun 26, 2023

RedTachyon left a comment

RedTachyon Jul 1, 2023

RedTachyon Jul 1, 2023

younik commented Jul 4, 2023

elliottower Aug 28, 2023

		assert False, f"error, invalid observation or action structure{data}"


		def convert_minari_dataset_to_hugging_face_dataset(dataset: MinariDataset):

		from minari.serialization import deserialize_space, serialize_space


		def _reconstuct_obs_or_action_at_index_recursive(

		)


		def convert_hugging_face_dataset_to_minari_dataset(dataset: Dataset):

HuggingFace Integration #94

Are you sure you want to change the base?

HuggingFace Integration #94

Conversation

balisujohn commented Jun 20, 2023 • edited Loading

Description

Additional Features

Checklist:

younik left a comment • edited Loading

Choose a reason for hiding this comment

Add a HF flag to load and upload

Use a setup_remote()

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RedTachyon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

younik commented Jul 4, 2023

Choose a reason for hiding this comment

balisujohn commented Jun 20, 2023 •

edited

Loading

younik left a comment •

edited

Loading