training module #25

saanikat · 2024-09-17T13:04:21Z

Issue #23 solved.

changelog

saanikat · 2024-10-02T15:33:20Z

Added TrainStandardizer for training custom models by the user.
The present README.md doesn't have the details, documentation will be added to bedbase docs

saanikat · 2024-10-07T18:12:38Z

Separation of schemas from BEDMS

Available schemas for standardization have been moved to the HuggingFace repository :https://huggingface.co/databio/attribute-standardizer-model6
This solves the issue of having to update BEDMS each time a new schema is added by us.
README.md has been updated with the new function calls.
Earlier we would instantiate it like like:

from bedms import AttrStandardizer

model = AttrStandardizer("ENCODE")

BEDMS had mapped schema name ENCODE to the model, and its associated configuration and files. Similarly, BEDBASE and FAIRTRACKS were associated with their respective files. However, this would've required us to update the package each time we added a new schema.
Now, we need to provide the schema model and its associated configuration to BEDMS via HuggingFace. In the HuggingFace repository, each schema has its own directory. And each time a schema is added, a new schema directory would be added to HuggingFace ( details of adding a new schema have been provided there). The instantiation looks like this:

from bedms import AttrStandardizer

model = AttrStandardizer(
    repo_id="databio/attribute-standardizer-model6", model_name="encode"
)

This also makes it easier for the user to provide their chosen schemas ( as long as they have models on their HuggingFace repository ).

sanghoonio · 2024-10-07T19:48:36Z

Is it worth updating

from bedms.const import AVAILABLE_SCHEMAS

to return a dictionary that includes the repo_id value for the 3 schemas we provide?

Or should we just hardcode the 3 repo IDs from PEPhub?

nleroy917

I think this is good for the most part. Small comments everywhere. There was an overarching theme I just wanted to bring up... curious what others think:

Using classes as state-storage

I brought this up in a recent review for @ClaudeHu. I'm noticing a pattern where people define and call functions on classes that set state inside that class. So for example in the AttributeStandardizerTrainer, we call load_enode_data...

This doesn't return anything, instead if sets state on the trainer itself. In essence, it does this:

def load_encode_data(self):
    self.encode_data = ...

Instead of:

def load_encode_data(self):
    encode_data = ...
    return encode_data

I don't know if one is better than the other, but the former feels like an anti-pattern to me for some reason. I suppose my argument for the second version is that it gives control of the data to the user and reduces ambiguity about whats going on; it hides less functionality.

Last comments

Lastly, is it worth considering lightning? I know the gains are smaller when there's no deep learning or multi-GPU training going on, but it really does clean up a lot of stuff and handle boring nuances that don't matter. Its a nice paradigm.

Moreover, if possible, we could integrate ClearML logging so the user can view the training in real-time instead of waiting for the plot to be saved to disk only to realize their model didn't learn anything...

Again this is if training times are long, if not then we can forget it.

nleroy917 · 2024-10-09T13:22:13Z

README.md

+To train the custom model:
+
+```python
+trainer.training()


Should this be .train()?

nleroy917 · 2024-10-09T13:22:30Z

README.md

+To load the datasets and encode them:
+
+```python
+trainer.load_encode_data()


Does this return anything? I'd expect this to return something

nleroy917 · 2024-10-09T13:22:47Z

README.md


-To see the available schemas, you can run:
+```python
+trainer.testing()


Again here I like .test() over .testing()

nleroy917 · 2024-10-09T13:24:02Z

README.md

+)
+results = model.standardize(pep="geo/gse228634:default")
+
+assert results


Instead of assert could we show how maybe they could iterate over results? Then maybe include comments on what the expected output of a print might be? Something like:

for result in results: print(result) # {'attr': 'genome', 'score': 0.8 }

nleroy917 · 2024-10-09T13:25:14Z

bedms/train.py

+        self.label_encoder = None
+        self.vectorizer = None
+        self.train_loader = None
+        self.val_loader = None
+        self.test_loader = None
+        self.output_size = None
+        self.criterion = None
+        self.train_accuracies = None
+        self.val_accuracies = None
+        self.train_losses = None
+        self.val_losses = None
+        self.model = None
+        self.fpr = None
+        self.tpr = None
+        self.roc_auc = None
+        self.all_labels = None
+        self.all_preds = None


If possible, could we type these?

nleroy917 · 2024-10-09T13:42:53Z

bedms/utils_train.py

+    return glob(os.path.join(dir, "*.csv"))
+
+
+def load_and_preprocess(file_path: str) -> pd.DataFrame:


More descriptive name? What are we loading and preprocessing?

nleroy917 · 2024-10-09T13:43:11Z

bedms/utils_train.py

+)
+
+
+def load_from_dir(dir: str) -> List[str]:


Same here can this be a bit more descriptive?

nleroy917 · 2024-10-09T13:44:12Z

bedms/utils_train.py

+    :return torch.Tensor: A tensor representing the
+        average of embeddings in the most common cluster.
+    """
+    flattened_embeddings = [embedding.tolist() for embedding in embeddings]


I dont think this is flattened, its just converting the list of np.arrays to a list of lists, right?

nleroy917 · 2024-10-09T13:45:22Z

bedms/utils_train.py

+        y_tensor,
+    )
+    # Create DataLoader
+    return DataLoader(dataset, batch_size=batch_size, shuffle=True)


Specify num_workers?

nleroy917 · 2024-10-09T13:46:30Z

bedms/utils_train.py

+    return all_preds, all_labels
+
+
+def plot_learning_curve(


What do we think about returning the plots to the user?

saanikat added 2 commits September 17, 2024 09:03

readme updated

1507582

updated readme

5096945

saanikat requested a review from khoroshevskyi September 17, 2024 16:19

saanikat and others added 5 commits September 17, 2024 13:38

changes

6a585ae

example configs for custom training and std

273fa20

Merge pull request #27 from databio/master

f3eefc8

changelog

training module

089755a

black

a4863b0

saanikat changed the title ~~readme updated~~ training module Oct 2, 2024

README updated with custom training

e67e094

saanikat requested a review from nleroy917 October 2, 2024 15:51

saanikat added 8 commits October 2, 2024 15:10

minor changes in comments

24a5d77

separating schemas from bedms

03ad055

const error solved

27df607

matplotlib

3cd2e7d

req

37d1fe6

updated test

5edee67

lint

e6ace12

README updated

732b505

saanikat requested a review from sanghoonio October 7, 2024 18:12

saanikat mentioned this pull request Oct 7, 2024

Providing training infrastructure for users #24

Open

nleroy917 requested changes Oct 9, 2024

View reviewed changes

reviewer request changes

dfa1a02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training module #25

training module #25

saanikat commented Sep 17, 2024

saanikat commented Oct 2, 2024 •

edited

Loading

saanikat commented Oct 7, 2024

sanghoonio commented Oct 7, 2024 •

edited

Loading

nleroy917 left a comment •

edited

Loading

nleroy917 Oct 9, 2024

nleroy917 Oct 9, 2024

nleroy917 Oct 9, 2024

nleroy917 Oct 9, 2024

nleroy917 Oct 9, 2024

nleroy917 Oct 9, 2024

nleroy917 Oct 9, 2024

nleroy917 Oct 9, 2024

nleroy917 Oct 9, 2024

nleroy917 Oct 9, 2024

		return glob(os.path.join(dir, "*.csv"))


		def load_and_preprocess(file_path: str) -> pd.DataFrame:

training module #25

Are you sure you want to change the base?

training module #25

Conversation

saanikat commented Sep 17, 2024

saanikat commented Oct 2, 2024 • edited Loading

saanikat commented Oct 7, 2024

Separation of schemas from BEDMS

sanghoonio commented Oct 7, 2024 • edited Loading

nleroy917 left a comment • edited Loading

Choose a reason for hiding this comment

Using classes as state-storage

Last comments

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saanikat commented Oct 2, 2024 •

edited

Loading

sanghoonio commented Oct 7, 2024 •

edited

Loading

nleroy917 left a comment •

edited

Loading