d_model parameter #39

williamstark01 · 2022-06-30T12:55:43Z

I think that the d_model parameter (the embedding dimension) would take a significantly larger value than the value currently used. It is usually a multiple of num_heads, which usually take the value 8, so maybe an initial value of 32 would make sense here? Or is there a specific reasoning behind using a smaller value for it?

The text was updated successfully, but these errors were encountered:

williamstark01 · 2022-06-30T12:56:57Z

This will probably help tackle overfitting (#34)

yangtcai · 2022-07-01T10:26:11Z

Cool, I will try it, could I run it in our cluster now? :D

williamstark01 · 2022-07-01T10:49:11Z

Yes, of course, after you update the dependencies you should be able to submit a training job. Adding some more details on Slack.

yangtcai · 2022-07-01T11:23:42Z

Hi @williamstark01, the reason why I set d_model = 6 is that we use one hot to encode a DNA sequence, so every sequence has the shape of [2000, 6]. Are there any ways to change a large value that is currently used?

williamstark01 · 2022-07-01T12:12:19Z

That's a good question. With label encoding the single value is converted to a tensor of shape (1, embed_dim) (simply a vector of length embed_dim, a better name for d_model). I haven't used one-hot encoded DNA sequences with transformers before and I'm not sure how they can be converted to embeddings. Maybe label encoding is the best option, but it's probably worth researching a bit to see whether other similar projects are using another approach with one-hot encoding.

yangtcai · 2022-07-01T13:38:56Z

Currently, I just put the sequence which has been encoded by one-hot to the transformer model.
https://github.com/yangtcai/Ensembl-Repeat-Identification/blob/9a5b7bb21555ae07cbb6e267d3f3c3ba1f6c98da/transformer.py#L64
and I found in your previous project, that token_embedding was used.

def forward(self, x):
    # generate token embeddings
    token_embeddings = self.token_embedding(x)

Should I add this token_embedding to our project?

williamstark01 · 2022-07-01T14:13:42Z

I thought some more about this, and I'm not sure there is a single correct answer to how we should process the base characters.

Using the bases as tokens and their one-hot encodings directly may work, but we lose the learnable embeddings which may map the bases to a higher dimensional space which may represent meaningful features. Then again, since we have so few different tokens, this may not be consequential.

For generating embeddings we also have an additional option of using n-grams as tokens instead of single bases directly.

Maybe we can continue using one-hot encodings without embeddings for now, but at some point it's probably worth taking a look at similar projects to get additional insights on this:

https://github.com/jerryji1993/DNABERT
https://github.com/jdcla/DNA-transformer
https://github.com/lucidrains/enformer-pytorch
https://github.com/lucidrains/tf-bind-transformer
https://github.com/Rvbens/non-coding-DNA-classifier

yangtcai · 2022-07-02T05:18:28Z

Ok, I will check it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

d_model parameter #39

d_model parameter #39

williamstark01 commented Jun 30, 2022

williamstark01 commented Jun 30, 2022

yangtcai commented Jul 1, 2022

williamstark01 commented Jul 1, 2022

yangtcai commented Jul 1, 2022

williamstark01 commented Jul 1, 2022

yangtcai commented Jul 1, 2022

williamstark01 commented Jul 1, 2022

yangtcai commented Jul 2, 2022

d_model parameter #39

d_model parameter #39

Comments

williamstark01 commented Jun 30, 2022

williamstark01 commented Jun 30, 2022

yangtcai commented Jul 1, 2022

williamstark01 commented Jul 1, 2022

yangtcai commented Jul 1, 2022

williamstark01 commented Jul 1, 2022

yangtcai commented Jul 1, 2022

williamstark01 commented Jul 1, 2022

yangtcai commented Jul 2, 2022