Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MosaicBERT: pretraining configuration for models > 128 seq. length #442

Open
stefan-it opened this issue Jan 3, 2024 · 5 comments
Open

Comments

@stefan-it
Copy link

Hi MosaicML team,

many thanks for releasing the code and models for your MosaicBERT! I highly appreciate the effort that you put in modernizing the BERT architecture.

I am interested in pretraining MosaicBERT so I have some questions :)

  • I am interested in the pretraining configuration for the model with 512 sequence length. Additionally: do you have hardware recommendations and the approx. time to pretrain MosaicBERT with 512 seq. length. Did you use the phase 1 + phase 2 "trick" with pretraining for 128 seq. length and then fewer steps with 512? For that, the MosaicBERT with 128 seq. length could be "recycled".
  • I'm also interested in what implementation is recommended to use e.g. a tagged/specific commit or the upcoming Modernize MosaicBERT #440 PR.

Many thanks in advance!

Stefan

@stefan-it stefan-it changed the title MosaicBERT: pretraining configurations for models > 128 seq. length MosaicBERT: pretraining configuration for models > 128 seq. length Jan 3, 2024
@Taytay
Copy link

Taytay commented Jan 4, 2024

@stefan-it - I tried the commit in main, and ran into a number of errors, and was pointed to #440, so I am planning on basing my work on that unless I hear otherwise.

@jacobfulano
Copy link
Contributor

jacobfulano commented Jan 5, 2024

Hi @stefan-it we did not experiment with training on 128 then switching to 512 (as in the original BERT paper by Devlin et al. 2018). In our experiments, training MosaicBERT-Base on sequence length 512 with batch size 4096 and 70,000 steps took roughly 30 hours on 8 A100 80 GB GPUs (see below).

It might take us a few more days to merge the FA2 PR #440, but do let us know if you run into any issues!

image

@mmarius
Copy link

mmarius commented Jan 6, 2024

Hi @jacobfulano, do you also have an estimate for how long it will take to pre-train MosaicBERT-Large on a sequence length of 512 with batch size 4096 for 70,000 steps?

@jacobfulano
Copy link
Contributor

Hi @mmarius, we did not specifically train MosaicBERT-Large with sequence length 512 with batch size 4096 for 70,000 steps. However my estimate would be roughly 4x the time it takes to train MosaicBERT-Large with sequence length 128 with batch size 4096 for 70,000 (~27.2 hours). So roughly 108 hours on 8 A100 80GB GPUs

@jacobfulano
Copy link
Contributor

If you are going any larger than that I would recommend looking at the mosaicml/llm-foundry which should have support for training encoders/embedding models soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants