Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training module #25

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open

training module #25

wants to merge 17 commits into from

Conversation

saanikat
Copy link
Member

Issue #23 solved.

@saanikat saanikat changed the title readme updated training module Oct 2, 2024
@saanikat
Copy link
Member Author

saanikat commented Oct 2, 2024

Added TrainStandardizer for training custom models by the user.
The present README.md doesn't have the details, documentation will be added to bedbase docs

@saanikat
Copy link
Member Author

saanikat commented Oct 7, 2024

Separation of schemas from BEDMS

Available schemas for standardization have been moved to the HuggingFace repository :https://huggingface.co/databio/attribute-standardizer-model6
This solves the issue of having to update BEDMS each time a new schema is added by us.
README.md has been updated with the new function calls.
Earlier we would instantiate it like like:

from bedms import AttrStandardizer

model = AttrStandardizer("ENCODE")

BEDMS had mapped schema name ENCODE to the model, and its associated configuration and files. Similarly, BEDBASE and FAIRTRACKS were associated with their respective files. However, this would've required us to update the package each time we added a new schema.
Now, we need to provide the schema model and its associated configuration to BEDMS via HuggingFace. In the HuggingFace repository, each schema has its own directory. And each time a schema is added, a new schema directory would be added to HuggingFace ( details of adding a new schema have been provided there). The instantiation looks like this:

from bedms import AttrStandardizer

model = AttrStandardizer(
    repo_id="databio/attribute-standardizer-model6", model_name="encode"
)

This also makes it easier for the user to provide their chosen schemas ( as long as they have models on their HuggingFace repository ).

@sanghoonio
Copy link
Member

sanghoonio commented Oct 7, 2024

Is it worth updating

from bedms.const import AVAILABLE_SCHEMAS

to return a dictionary that includes the repo_id value for the 3 schemas we provide?

Or should we just hardcode the 3 repo IDs from PEPhub?

Copy link
Member

@nleroy917 nleroy917 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good for the most part. Small comments everywhere. There was an overarching theme I just wanted to bring up... curious what others think:

Using classes as state-storage

I brought this up in a recent review for @ClaudeHu. I'm noticing a pattern where people define and call functions on classes that set state inside that class. So for example in the AttributeStandardizerTrainer, we call load_enode_data...

This doesn't return anything, instead if sets state on the trainer itself. In essence, it does this:

def load_encode_data(self):
    self.encode_data = ...

Instead of:

def load_encode_data(self):
    encode_data = ...
    return encode_data

I don't know if one is better than the other, but the former feels like an anti-pattern to me for some reason. I suppose my argument for the second version is that it gives control of the data to the user and reduces ambiguity about whats going on; it hides less functionality.

Last comments

Lastly, is it worth considering lightning? I know the gains are smaller when there's no deep learning or multi-GPU training going on, but it really does clean up a lot of stuff and handle boring nuances that don't matter. Its a nice paradigm.

Moreover, if possible, we could integrate ClearML logging so the user can view the training in real-time instead of waiting for the plot to be saved to disk only to realize their model didn't learn anything...

Again this is if training times are long, if not then we can forget it.

README.md Outdated
To train the custom model:

```python
trainer.training()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be .train()?

README.md Outdated
To load the datasets and encode them:

```python
trainer.load_encode_data()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this return anything? I'd expect this to return something

README.md Outdated

To see the available schemas, you can run:
```python
trainer.testing()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again here I like .test() over .testing()

README.md Outdated
)
results = model.standardize(pep="geo/gse228634:default")

assert results
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of assert could we show how maybe they could iterate over results? Then maybe include comments on what the expected output of a print might be? Something like:

for result in results:
    print(result) # {'attr': 'genome', 'score': 0.8 }

bedms/train.py Outdated
Comment on lines 44 to 60
self.label_encoder = None
self.vectorizer = None
self.train_loader = None
self.val_loader = None
self.test_loader = None
self.output_size = None
self.criterion = None
self.train_accuracies = None
self.val_accuracies = None
self.train_losses = None
self.val_losses = None
self.model = None
self.fpr = None
self.tpr = None
self.roc_auc = None
self.all_labels = None
self.all_preds = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, could we type these?

return glob(os.path.join(dir, "*.csv"))


def load_and_preprocess(file_path: str) -> pd.DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More descriptive name? What are we loading and preprocessing?

)


def load_from_dir(dir: str) -> List[str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here can this be a bit more descriptive?

:return torch.Tensor: A tensor representing the
average of embeddings in the most common cluster.
"""
flattened_embeddings = [embedding.tolist() for embedding in embeddings]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think this is flattened, its just converting the list of np.arrays to a list of lists, right?

y_tensor,
)
# Create DataLoader
return DataLoader(dataset, batch_size=batch_size, shuffle=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify num_workers?

return all_preds, all_labels


def plot_learning_curve(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we think about returning the plots to the user?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants