Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on end_to_end vs trainer.train_embeddings #529

Open
naddeoa opened this issue May 29, 2024 · 2 comments
Open

Clarification on end_to_end vs trainer.train_embeddings #529

naddeoa opened this issue May 29, 2024 · 2 comments

Comments

@naddeoa
Copy link

naddeoa commented May 29, 2024

I'm experimenting with a simple model right now and I'm confused about whether or not I should expect the sentence transformer model to change during the training process.

    # Define the model and training arguments
    model = SetFitModel.from_pretrained(
        "sentence-transformers/all-MiniLM-L6-v2",
        multi_target_strategy="one-vs-rest",
        use_differentiable_head=True,
        head_params={"out_features": len(labels)},
        labels=labels,
    )

    args = TrainingArguments(
        batch_size=128,
        # end_to_end=False,
        # body_learning_rate=10.0,
        num_epochs=4,
        evaluation_strategy="no",
        save_strategy="no",
        load_best_model_at_end=True,
    )

    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        metric="accuracy",
        column_mapping={
            "text": "text",
            "label": "label",
        },  # Map dataset columns to text/label expected by trainer
    )

The documentation for end_to_end implies that the only time that the underlying model will change is when this argument is set, but experimentally that isn't true. The underlying sentence transformer (the "body" as I understand it) seems to always be trained in the train() logic, which is hard coded to always call train_embeddings(). I determined that the body changed by comparing the output scores of my model as well as the embeddings generated by the base sentence transformer model and the one that is set to my model body after training.

Did I misunderstand the docs? The only way I can get this to not happen is to comment out the train_embeddings() call in the setlib library's train() here

@naddeoa
Copy link
Author

naddeoa commented May 29, 2024

The underlying motivation for me: I'm interested in passing in pre computed sentence transformer embeddings instead of having my setfit model compute them internally because I already have them computed in another part of my system and I don't want to spend time recomputing them. That only makes sense if it's reasonable for me to expect that they shouldn't have changed, at that point they're effectively two different embedding models anyway.

@binarymax
Copy link

binarymax commented Jul 17, 2024

I haven't looked at the code, but having read the literature, setfit works well because the underlying vector space is re-aligned to the classification task. So while you could probably get it to do what you want (just use a separate logistic regression model), my intuition says you will lose accuracy.

See here for an overview: https://huggingface.co/docs/setfit/en/conceptual_guides/setfit#embedding-finetuning-phase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants