Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

d_model parameter #39

Open
williamstark01 opened this issue Jun 30, 2022 · 8 comments
Open

d_model parameter #39

williamstark01 opened this issue Jun 30, 2022 · 8 comments

Comments

@williamstark01
Copy link
Collaborator

I think that the d_model parameter (the embedding dimension) would take a significantly larger value than the value currently used. It is usually a multiple of num_heads, which usually take the value 8, so maybe an initial value of 32 would make sense here? Or is there a specific reasoning behind using a smaller value for it?

@williamstark01
Copy link
Collaborator Author

This will probably help tackle overfitting (#34)

@yangtcai
Copy link
Collaborator

yangtcai commented Jul 1, 2022

Cool, I will try it, could I run it in our cluster now? :D

@williamstark01
Copy link
Collaborator Author

Yes, of course, after you update the dependencies you should be able to submit a training job. Adding some more details on Slack.

@yangtcai
Copy link
Collaborator

yangtcai commented Jul 1, 2022

Hi @williamstark01, the reason why I set d_model = 6 is that we use one hot to encode a DNA sequence, so every sequence has the shape of [2000, 6]. Are there any ways to change a large value that is currently used?

@williamstark01
Copy link
Collaborator Author

That's a good question. With label encoding the single value is converted to a tensor of shape (1, embed_dim) (simply a vector of length embed_dim, a better name for d_model). I haven't used one-hot encoded DNA sequences with transformers before and I'm not sure how they can be converted to embeddings. Maybe label encoding is the best option, but it's probably worth researching a bit to see whether other similar projects are using another approach with one-hot encoding.

@yangtcai
Copy link
Collaborator

yangtcai commented Jul 1, 2022

Currently, I just put the sequence which has been encoded by one-hot to the transformer model.
https://github.com/yangtcai/Ensembl-Repeat-Identification/blob/9a5b7bb21555ae07cbb6e267d3f3c3ba1f6c98da/transformer.py#L64
and I found in your previous project, that token_embedding was used.

def forward(self, x):
    # generate token embeddings
    token_embeddings = self.token_embedding(x)

Should I add this token_embedding to our project?

@williamstark01
Copy link
Collaborator Author

I thought some more about this, and I'm not sure there is a single correct answer to how we should process the base characters.

Using the bases as tokens and their one-hot encodings directly may work, but we lose the learnable embeddings which may map the bases to a higher dimensional space which may represent meaningful features. Then again, since we have so few different tokens, this may not be consequential.

For generating embeddings we also have an additional option of using n-grams as tokens instead of single bases directly.

Maybe we can continue using one-hot encodings without embeddings for now, but at some point it's probably worth taking a look at similar projects to get additional insights on this:

https://github.com/jerryji1993/DNABERT
https://github.com/jdcla/DNA-transformer
https://github.com/lucidrains/enformer-pytorch
https://github.com/lucidrains/tf-bind-transformer
https://github.com/Rvbens/non-coding-DNA-classifier

@yangtcai
Copy link
Collaborator

yangtcai commented Jul 2, 2022

Ok, I will check it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants