v0.15.0
🚀 Composer v0.15.0
What's New
-
Exact Eval (#2218)
Composer now supports exact evaluation! Now, evaluation will give the exact same results regardless of the number of GPUs by removing any duplicated samples from the dataloader.
-
Monolithic Checkpoint Loading (#2288)
When training large models, loading the model and optimizer on every rank can use up all the system memory. With FSDP, Composer can now load the model and optimizer on only rank 0 and broadcast it to all other ranks. To enable:
from composer import Trainer # Construct Trainer trainer = Trainer( ..., fsdp_config={ load_monolith_rank0_only: True }, ) # Train! trainer.fit()
and ensure the model on rank 0 is on CPU/GPU (as opposed to meta).
-
Spin Dataloaders
By default, Composer spins dataloaders back to the current timestamp to ensure deterministic resumption. However, dataloader spinning can be very slow, so
Trainer
now has a new flag to disable spinning if determinism is not required. To enable:from composer import Trainer # Construct Trainer trainer = Trainer( ..., spin_dataloaders=False, ) # Train! trainer.fit()
Deprecations
HealthChecker
is now deprecated and will be removed inv0.17.0
Bug Fixes
- Add support for saving HF info in state dict when using DDP by @dakinggg in #2206
- Change state dict loading default to strict by @dakinggg in #2216
- CE loss vs CE metric equivalence by @dakinggg in #2241
- Move sharded checkpoints into their own intermediate prefix folder by @eracah in #2205
- Fix typo depricated -> deprecated by @eracah in #2270
- Spin dataloader arg by @mvpatel2000 in #2267
- Confirming the output variable has two dimensions before confirming the shape of the second element. by @jimmiemunyi in #2275
- Add loss_dict keyword to closure lambda function by @Landanjs in #1952
- Strip spacing icl by @bmosaicml in #2306
What's Changed
- Update FFCV by @mvpatel2000 in #2197
- Add support for saving HF info in state dict when using DDP by @dakinggg in #2206
- Bump junitparser from 3.0.0 to 3.1.0 by @dependabot in #2212
- Bump sentencepiece from 0.1.98 to 0.1.99 by @dependabot in #2208
- Add docs for Checkpointing with Cloudflare R2 by @eracah in #2215
- Working slack link by @growlix in #2217
- Change state dict loading default to strict by @dakinggg in #2216
- Fix typo in evaluation docs by @dakinggg in #2225
- Clean soft cross entropy by @mvpatel2000 in #2227
- add cmake by @dakinggg in #2229
- Upgrade to mcli0.4, smaller mcli improvements by @aspfohl in #2226
- Bump to torch 2.0.1 by @mvpatel2000 in #2235
- Deprecate healthchecker by @mvpatel2000 in #2236
- Update torch 2.0.1 workflows by @mvpatel2000 in #2239
- Log wandb URL to metadata by @mvpatel2000 in #2240
- Bump ipykernel from 6.22.0 to 6.23.1 by @dependabot in #2244
- Update transformers requirement from <4.29,>=4.11 to >=4.11,<4.30 by @dependabot in #2245
- CE loss vs CE metric equivalence by @dakinggg in #2241
- Exact Eval by @mvpatel2000 in #2218
- bump torchmetrics pin by @nik-mosaic in #2247
- Remove deprecated code / torch 1.11 / torch 1.12 by @mvpatel2000 in #2234
- Rename
backwards_create_graph
description by @mvpatel2000 in #2248 - Move sharded checkpoints into their own intermediate prefix folder by @eracah in #2205
- Fix daily tests by fixing test_fsdp_load_old_checkpoint by @eracah in #2249
- Support for multiple optimizer groups in torch 2.0 + FSDP by @sashaDoubov in #2230
- Change AdamW step to a tensor instead of an int by @eracah in #2237
- Update to cuda 11.8 by @mvpatel2000 in #2250
- Fix daily tests by adding s3 secrets to daily-gpu tests by @eracah in #2254
- Typo in s3_prefix: epemeral -> ephemeral 🤦♂️ by @eracah in #2255
- Bump yamllint from 1.31.0 to 1.32.0 by @dependabot in #2256
- Bump coverage[toml] from 7.2.5 to 7.2.6 by @dependabot in #2258
- Add callbacks for EVAL_BEFORE_ALL and EVAL_AFTER_ALL by @rishab-partha in #2264
- Update torch device naming convention for h100 gpus by @vchiley in #2265
- Fix typo depricated -> deprecated by @eracah in #2270
- alerts for daily tests by @mvpatel2000 in #2272
- Fix daily tests by patching cupy version by @mvpatel2000 in #2274
- Skip ffcv notebook by @mvpatel2000 in #2277
- Spin dataloader arg by @mvpatel2000 in #2267
- Confirming the output variable has two dimensions before confirming the shape of the second element. by @jimmiemunyi in #2275
- Bump coverage[toml] from 7.2.6 to 7.2.7 by @dependabot in #2282
- Patch for tokenizers that have python files in save_pretrained output by @dakinggg in #2279
- fix get file(overwite=True) to properly handle pre-existing files by @bmosaicml in #2284
- Fix Checkpointing Docs Link by @rishab-partha in #2278
- Add errors for Mixed Dataloader Eval by @rishab-partha in #2269
- Fix autoresume with slashed directory by @rishab-partha in #2287
- Delete symlinks when not saving checkpoints locally by @rishab-partha in #2285
- fixed adding tokenizer to hf by @KuuCi in #2290
- New Console Logger Test + Discard before Eval by @rishab-partha in #2273
- Enabled kv caching during generate to speed up QA Task by @bmosaicml in #2293
- Update monai requirement from <1.2,>=0.9.1 to >=0.9.1,<1.3 by @dependabot in #2298
- Bump sphinxcontrib-katex from 0.9.4 to 0.9.5 by @dependabot in #2296
- Training Checkpoint Fix by @KuuCi in #2294
- Update transformers requirement from <4.30,>=4.11 to >=4.11,<4.31 by @dependabot in #2295
- Fixed how save_checkpoint_to_save_folder called CheckpointSaver object to save state and logger by @KuuCi in #2300
- Update Slack link in README.md by @ejyuen in #2261
- Change progress bar logger to print all eval metrics by @rishab-partha in #2286
- Add pytest clear cache by @rishab-partha in #2305
- Fix tests for wandb and mlflow loggers by @b-chu in #2302
- Monolithic Loading by @mvpatel2000 in #2288
- Add loss_dict keyword to closure lambda function by @Landanjs in #1952
- Strip spacing icl by @bmosaicml in #2306
- Add additional error with auto microbatching by @mvpatel2000 in #2308
- Group autoresume messages by @mvpatel2000 in #2307
- Move deepspeed enabled to state by @mvpatel2000 in #2309
- Jiggling tests and adding gc collect by @bcui19 in #2312
- Monolithic loading improvements by @mvpatel2000 in #2313
- Update version to 0.15 by @mvpatel2000 in #2315
New Contributors
- @aspfohl made their first contribution in #2226
- @sashaDoubov made their first contribution in #2230
- @rishab-partha made their first contribution in #2264
- @jimmiemunyi made their first contribution in #2275
- @KuuCi made their first contribution in #2290
- @b-chu made their first contribution in #2302
Full Changelog: v0.14.1...v0.15.0