🚀 Composer v0.15.0

What's New

Exact Eval (#2218)

Composer now supports exact evaluation! Now, evaluation will give the exact same results regardless of the number of GPUs by removing any duplicated samples from the dataloader.
Monolithic Checkpoint Loading (#2288)

When training large models, loading the model and optimizer on every rank can use up all the system memory. With FSDP, Composer can now load the model and optimizer on only rank 0 and broadcast it to all other ranks. To enable:
```
from composer import Trainer

# Construct Trainer
trainer = Trainer(
   ...,
   fsdp_config={
      load_monolith_rank0_only: True
   },
)

# Train!
trainer.fit()
```
and ensure the model on rank 0 is on CPU/GPU (as opposed to meta).
Spin Dataloaders

By default, Composer spins dataloaders back to the current timestamp to ensure deterministic resumption. However, dataloader spinning can be very slow, so Trainer now has a new flag to disable spinning if determinism is not required. To enable:
```
from composer import Trainer

# Construct Trainer
trainer = Trainer(
   ...,
   spin_dataloaders=False,
)

# Train!
trainer.fit()
```

Deprecations

HealthChecker is now deprecated and will be removed in v0.17.0

Bug Fixes

Add support for saving HF info in state dict when using DDP by @dakinggg in #2206
Change state dict loading default to strict by @dakinggg in #2216
CE loss vs CE metric equivalence by @dakinggg in #2241
Move sharded checkpoints into their own intermediate prefix folder by @eracah in #2205
Fix typo depricated -> deprecated by @eracah in #2270
Spin dataloader arg by @mvpatel2000 in #2267
Confirming the output variable has two dimensions before confirming the shape of the second element. by @jimmiemunyi in #2275
Add loss_dict keyword to closure lambda function by @Landanjs in #1952
Strip spacing icl by @bmosaicml in #2306

What's Changed

Update FFCV by @mvpatel2000 in #2197
Add support for saving HF info in state dict when using DDP by @dakinggg in #2206
Bump junitparser from 3.0.0 to 3.1.0 by @dependabot in #2212
Bump sentencepiece from 0.1.98 to 0.1.99 by @dependabot in #2208
Add docs for Checkpointing with Cloudflare R2 by @eracah in #2215
Working slack link by @growlix in #2217
Change state dict loading default to strict by @dakinggg in #2216
Fix typo in evaluation docs by @dakinggg in #2225
Clean soft cross entropy by @mvpatel2000 in #2227
add cmake by @dakinggg in #2229
Upgrade to mcli0.4, smaller mcli improvements by @aspfohl in #2226
Bump to torch 2.0.1 by @mvpatel2000 in #2235
Deprecate healthchecker by @mvpatel2000 in #2236
Update torch 2.0.1 workflows by @mvpatel2000 in #2239
Log wandb URL to metadata by @mvpatel2000 in #2240
Bump ipykernel from 6.22.0 to 6.23.1 by @dependabot in #2244
Update transformers requirement from <4.29,>=4.11 to >=4.11,<4.30 by @dependabot in #2245
CE loss vs CE metric equivalence by @dakinggg in #2241
Exact Eval by @mvpatel2000 in #2218
bump torchmetrics pin by @nik-mosaic in #2247
Remove deprecated code / torch 1.11 / torch 1.12 by @mvpatel2000 in #2234
Rename backwards_create_graph description by @mvpatel2000 in #2248
Move sharded checkpoints into their own intermediate prefix folder by @eracah in #2205
Fix daily tests by fixing test_fsdp_load_old_checkpoint by @eracah in #2249
Support for multiple optimizer groups in torch 2.0 + FSDP by @sashaDoubov in #2230
Change AdamW step to a tensor instead of an int by @eracah in #2237
Update to cuda 11.8 by @mvpatel2000 in #2250
Fix daily tests by adding s3 secrets to daily-gpu tests by @eracah in #2254
Typo in s3_prefix: epemeral -> ephemeral 🤦‍♂️ by @eracah in #2255
Bump yamllint from 1.31.0 to 1.32.0 by @dependabot in #2256
Bump coverage[toml] from 7.2.5 to 7.2.6 by @dependabot in #2258
Add callbacks for EVAL_BEFORE_ALL and EVAL_AFTER_ALL by @rishab-partha in #2264
Update torch device naming convention for h100 gpus by @vchiley in #2265
Fix typo depricated -> deprecated by @eracah in #2270
alerts for daily tests by @mvpatel2000 in #2272
Fix daily tests by patching cupy version by @mvpatel2000 in #2274
Skip ffcv notebook by @mvpatel2000 in #2277
Spin dataloader arg by @mvpatel2000 in #2267
Confirming the output variable has two dimensions before confirming the shape of the second element. by @jimmiemunyi in #2275
Bump coverage[toml] from 7.2.6 to 7.2.7 by @dependabot in #2282
Patch for tokenizers that have python files in save_pretrained output by @dakinggg in #2279
fix get file(overwite=True) to properly handle pre-existing files by @bmosaicml in #2284
Fix Checkpointing Docs Link by @rishab-partha in #2278
Add errors for Mixed Dataloader Eval by @rishab-partha in #2269
Fix autoresume with slashed directory by @rishab-partha in #2287
Delete symlinks when not saving checkpoints locally by @rishab-partha in #2285
fixed adding tokenizer to hf by @KuuCi in #2290
New Console Logger Test + Discard before Eval by @rishab-partha in #2273
Enabled kv caching during generate to speed up QA Task by @bmosaicml in #2293
Update monai requirement from <1.2,>=0.9.1 to >=0.9.1,<1.3 by @dependabot in #2298
Bump sphinxcontrib-katex from 0.9.4 to 0.9.5 by @dependabot in #2296
Training Checkpoint Fix by @KuuCi in #2294
Update transformers requirement from <4.30,>=4.11 to >=4.11,<4.31 by @dependabot in #2295
Fixed how save_checkpoint_to_save_folder called CheckpointSaver object to save state and logger by @KuuCi in #2300
Update Slack link in README.md by @ejyuen in #2261
Change progress bar logger to print all eval metrics by @rishab-partha in #2286
Add pytest clear cache by @rishab-partha in #2305
Fix tests for wandb and mlflow loggers by @b-chu in #2302
Monolithic Loading by @mvpatel2000 in #2288
Add loss_dict keyword to closure lambda function by @Landanjs in #1952
Strip spacing icl by @bmosaicml in #2306
Add additional error with auto microbatching by @mvpatel2000 in #2308
Group autoresume messages by @mvpatel2000 in #2307
Move deepspeed enabled to state by @mvpatel2000 in #2309
Jiggling tests and adding gc collect by @bcui19 in #2312
Monolithic loading improvements by @mvpatel2000 in #2313
Update version to 0.15 by @mvpatel2000 in #2315

New Contributors

@aspfohl made their first contribution in #2226
@sashaDoubov made their first contribution in #2230
@rishab-partha made their first contribution in #2264
@jimmiemunyi made their first contribution in #2275
@KuuCi made their first contribution in #2290
@b-chu made their first contribution in #2302

Full Changelog: v0.14.1...v0.15.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.15.0

🚀 Composer v0.15.0

What's New

Deprecations

Bug Fixes

What's Changed

New Contributors

Contributors