20 Jun 19:04

mvpatel2000

9983bcd

v0.15.0

🚀 Composer v0.15.0

What's New

Exact Eval (#2218)

Composer now supports exact evaluation! Now, evaluation will give the exact same results regardless of the number of GPUs by removing any duplicated samples from the dataloader.
Monolithic Checkpoint Loading (#2288)

When training large models, loading the model and optimizer on every rank can use up all the system memory. With FSDP, Composer can now load the model and optimizer on only rank 0 and broadcast it to all other ranks. To enable:
```
from composer import Trainer

# Construct Trainer
trainer = Trainer(
   ...,
   fsdp_config={
      load_monolith_rank0_only: True
   },
)

# Train!
trainer.fit()
```
and ensure the model on rank 0 is on CPU/GPU (as opposed to meta).
Spin Dataloaders

By default, Composer spins dataloaders back to the current timestamp to ensure deterministic resumption. However, dataloader spinning can be very slow, so Trainer now has a new flag to disable spinning if determinism is not required. To enable:
```
from composer import Trainer

# Construct Trainer
trainer = Trainer(
   ...,
   spin_dataloaders=False,
)

# Train!
trainer.fit()
```

Deprecations

HealthChecker is now deprecated and will be removed in v0.17.0

Bug Fixes

Add support for saving HF info in state dict when using DDP by @dakinggg in #2206
Change state dict loading default to strict by @dakinggg in #2216
CE loss vs CE metric equivalence by @dakinggg in #2241
Move sharded checkpoints into their own intermediate prefix folder by @eracah in #2205
Fix typo depricated -> deprecated by @eracah in #2270
Spin dataloader arg by @mvpatel2000 in #2267
Confirming the output variable has two dimensions before confirming the shape of the second element. by @jimmiemunyi in #2275
Add loss_dict keyword to closure lambda function by @Landanjs in #1952
Strip spacing icl by @bmosaicml in #2306

What's Changed

Update FFCV by @mvpatel2000 in #2197
Add support for saving HF info in state dict when using DDP by @dakinggg in #2206
Bump junitparser from 3.0.0 to 3.1.0 by @dependabot in #2212
Bump sentencepiece from 0.1.98 to 0.1.99 by @dependabot in #2208
Add docs for Checkpointing with Cloudflare R2 by @eracah in #2215
Working slack link by @growlix in #2217
Change state dict loading default to strict by @dakinggg in #2216
Fix typo in evaluation docs by @dakinggg in #2225
Clean soft cross entropy by @mvpatel2000 in #2227
add cmake by @dakinggg in #2229
Upgrade to mcli0.4, smaller mcli improvements by @aspfohl in #2226
Bump to torch 2.0.1 by @mvpatel2000 in #2235
Deprecate healthchecker by @mvpatel2000 in #2236
Update torch 2.0.1 workflows by @mvpatel2000 in #2239
Log wandb URL to metadata by @mvpatel2000 in #2240
Bump ipykernel from 6.22.0 to 6.23.1 by @dependabot in #2244
Update transformers requirement from <4.29,>=4.11 to >=4.11,<4.30 by @dependabot in #2245
CE loss vs CE metric equivalence by @dakinggg in #2241
Exact Eval by @mvpatel2000 in #2218
bump torchmetrics pin by @nik-mosaic in #2247
Remove deprecated code / torch 1.11 / torch 1.12 by @mvpatel2000 in #2234
Rename backwards_create_graph description by @mvpatel2000 in #2248
Move sharded checkpoints into their own intermediate prefix folder by @eracah in #2205
Fix daily tests by fixing test_fsdp_load_old_checkpoint by @eracah in #2249
Support for multiple optimizer groups in torch 2.0 + FSDP by @sashaDoubov in #2230
Change AdamW step to a tensor instead of an int by @eracah in #2237
Update to cuda 11.8 by @mvpatel2000 in #2250
Fix daily tests by adding s3 secrets to daily-gpu tests by @eracah in #2254
Typo in s3_prefix: epemeral -> ephemeral 🤦‍♂️ by @eracah in #2255
Bump yamllint from 1.31.0 to 1.32.0 by @dependabot in #2256
Bump coverage[toml] from 7.2.5 to 7.2.6 by @dependabot in #2258
Add callbacks for EVAL_BEFORE_ALL and EVAL_AFTER_ALL by @rishab-partha in #2264
Update torch device naming convention for h100 gpus by @vchiley in #2265
Fix typo depricated -> deprecated by @eracah in #2270
alerts for daily tests by @mvpatel2000 in #2272
Fix daily tests by patching cupy version by @mvpatel2000 in #2274
Skip ffcv notebook by @mvpatel2000 in #2277
Spin dataloader arg by @mvpatel2000 in #2267
Confirming the output variable has two dimensions before confirming the shape of the second element. by @jimmiemunyi in #2275
Bump coverage[toml] from 7.2.6 to 7.2.7 by @dependabot in #2282
Patch for tokenizers that have python files in save_pretrained output by @dakinggg in #2279
fix get file(overwite=True) to properly handle pre-existing files by @bmosaicml in #2284
Fix Checkpointing Docs Link by @rishab-partha in #2278
Add errors for Mixed Dataloader Eval by @rishab-partha in #2269
Fix autoresume with slashed directory by @rishab-partha in #2287
Delete symlinks when not saving checkpoints locally by @rishab-partha in #2285
fixed adding tokenizer to hf by @KuuCi in #2290
New Console Logger Test + Discard before Eval by @rishab-partha in #2273
Enabled kv caching during generate to speed up QA Task by @bmosaicml in #2293
Update monai requirement from <1.2,>=0.9.1 to >=0.9.1,<1.3 by @dependabot in #2298
Bump sphinxcontrib-katex from 0.9.4 to 0.9.5 by @dependabot in #2296
Training Checkpoint Fix by @KuuCi in #2294
Update transformers requirement from <4.30,>=4.11 to >=4.11,<4.31 by @dependabot in #2295
Fixed how save_checkpoint_to_save_folder called CheckpointSaver object to save state and logger by @KuuCi in #2300
Update Slack link in README.md by @ejyuen in #2261
Change progress bar logger to print all eval metrics by @rishab-partha in #2286
Add pytest clear cache by @rishab-partha in #2305
Fix tests for wandb and mlflow loggers by @b-chu in #2302
Monolithic Loading by @mvpatel2000 in #2288
Add loss_dict keyword to closure lambda function by @Landanjs in #1952
Strip spacing icl by @bmosaicml in #2306
Add additional error with auto microbatching by @mvpatel2000 in #2308
Group autoresume messages by @mvpatel2000 in #2307
Move deepspeed enabled to state by @mvpatel2000 in #2309
Jiggling tests and adding gc collect by @bcui19 in #2312
Monolithic loading improvements by @mvpatel2000 in #2313
Update version to 0.15 by @mvpatel2000 in #2315

New Contributors

@aspfohl made their first contribution in #2226
@sashaDoubov made their first contribution in #2230
@rishab-partha made their first contribution in...

Contributors

sashaDoubov, eracah, and 15 other contributors

Assets 2

05 May 05:46

mvpatel2000

v0.14.1

7da93f8

v0.14.1

Bug Fixes

Fixes a bug related to sentpiece tokenizers and ICL eval.

What's Changed

Update docs to remove gradient clipping in events by @mvpatel2000 in #2193
remove explorer info from readme by @nik-mosaic in #2174
bugfix sentpiece by @bmosaicml in #2198
Fix Broken Training Loop Image Link by @eracah in #2199
Fix broken image link for GLU by @eracah in #2201
bugfix sentpiece (#2198) by @bmosaicml in #2200
Bump version to v0.14.1 by @mvpatel2000 in #2202
Pin protobuf by @mvpatel2000 in #2203

Full Changelog: v0.14.0...v0.14.1

Contributors

eracah, mvpatel2000, and 2 other contributors

Assets 2

03 May 15:42

bandish-shah

v0.14.0

5ba2a60

v0.14.0

🚀 Composer v0.14.0

Composer v0.14.0 is released! Install via pip:

pip install composer==0.14.0

The legacy package name still works via pip:

pip install mosaicml==0.14.0

New Features

🆕 PyTorch 2.0 Support (#2172)

We're thrilled to announce official support for PyTorch 2.0! We've got all initial unit tests passing and run through our examples. We've also made some updates to start taking advantage of all the great new features.

Initial support also includes:

Support for torch.compile

Model	Dataset	Without compile thoughput/samples_per_sec	With compile thoughput/samples_per_sec	Performance %
ResNet50	ImageNet	5557	7424	33.60%
DeepLab V3	ADE20K	81.60	98.82	21.10%
HF BERT	C4	3360	4259	26.75%
HF Causal LM	C4	50.61	103.29	100.05%

To start using, simply add compile_config argument to the Trainer:

  # To use default `torch.compile` config
  trainer = Trainer(
     ...,
     compile_config={},
  )

  # To use custom `torch.compile` config, provide an argument as a dictionary, for example:
  trainer = Trainer(
     ...,
     compile_config={'mode': 'reduce-overhead'},
  )

The Trainer also supports pre-compiled models passed via the models argument. If the model has been pre-compiled, the compile_config argument is ignored if provided.

Note: We recommend baselining your model with and without torch.compile as there are scenarios where enabling compile does not yield any throughput improvements and in some cases where this can lead to a regression.

PyTorch 2.0 Docker Images

We've added the following new official MosaicML Docker Images with PyTorch 2.0 support:

Linux Distro	Flavor	PyTorch Version	CUDA Version	Python Version	Docker Tags
Ubuntu 20.04	Base	2.0.0	11.7.1 (Infiniband)	3.10	`mosaicml/pytorch:2.0.0_cu117-python3.10-ubuntu20.04`
Ubuntu 20.04	Base	2.0.0	11.7.1 (EFA)	3.10	`mosaicml/pytorch:2.0.0_cu117-python3.10-ubuntu20.04-aws`
Ubuntu 20.04	Base	2.0.0	cpu	3.10	`mosaicml/pytorch:2.0.0_cpu-python3.10-ubuntu20.04`
Ubuntu 20.04	Vision	2.0.0	11.7.1 (Infiniband)	3.10	`mosaicml/pytorch_vision:2.0.0_cu117-python3.10-ubuntu20.04`
Ubuntu 20.04	Vision	2.0.0	cpu	3.10	`mosaicml/pytorch_vision:2.0.0_cpu-python3.10-ubuntu20.04`

🦾 New Callbacks

Activation monitor (#2066)

Monitors activations in the network. Every interval batches it will attach a forwards hook and logs the max, average, l2 norm, and kurtosis for the input and output activations. To enable:
```
from composer import Trainer
from composer.callbacks import ActivationMonitor

# Construct Trainer
trainer = Trainer(
   ...,
   callbacks=[ActivationMonitor()],
)

# Train!
trainer.fit()
```

Slack Logger (#2133)

You can now send custom training metrics using Slack! To enable:

from composer import Trainer
from composer.loggers import SlackLogger

transform = transforms.Compose([transforms.ToTensor()])


trainer = Trainer(
   ...
   loggers=[
       SlackLogger(
           log_interval="10ba", # or 1ep, 2ep 
           include_keys=["algorithm_traces*", "loss*"],
           formatter_func=(lambda data, **kwargs:
              [
                  {
                      "type": "section", "text": {"type": "mrkdwn", "text": f"*{k}:* {v}"}
                  }
                  for k, v in data.items()
              ])
       )
   ],
)

trainer.fit()

Please see PR #2133 for additional details.

API changes

The grad_accum argument has been removed from Trainer, users are now required to use device_train_microbatch_size instead (#2040)

Deprecations

We no longer support PyTorch 1.11 and 1.12 due to security vulnerabilities. New features will not be tested against these versions.

Bug Fixes

Eval subset num batches bug fix (#2028)
Protect for missing slack_sdk import (#2031)
Adjust HuggingFaceModel token embedding resizing to only occur when necessary (#2027)
Update FSDP meta weight tying tests to include precision testing (#2050)
Backward Compat with Torchmetrics (#2046)
Busy wait for local rank 0 download to avoid timeout on large file download (#2054)
Fix OCIObjectStore save_overwrite=False bug (#2053)
Busy wait so that non local rank zeros don't timeout while local rank zero downloads a monolithic checkpoint (#2071)
Skip extra downloads when not using a format string (#2073)
fix name_or_path usage in HF save/load usage (#2075)
Fix EMA resumption issue with calling trainer.eval() before trainer.fit() (#2088)
Patch EMA with FSDP (#2091)
Updating gradient clipping to be torch 2.0 compatible (#2089)
Adding checks for weight tying s.t. we don't think None attributes are weight tied (#2103)
gate the extra forward call specifically for fsdp (#2102)
Allow user to set ONNX opset version when Exporting for Inference (#2101)
Runtime estimator (#2124)
Use state_dict Torchmetrics Serialization (#2116)
Fix filelock in checkpoint download (#2184)

What's Changed

Eval subset num batches bug fix by @mvpatel2000 in #2028
Protect for missing slack_sdk import by @hanlint in #2031
switch code quality workflow to dev target and smoketest by @dakinggg in #2032
Generate composer PyPi package by @bandish-shah in #2034
HealthChecker should only send test message on global rank zero by @hanlint in #2035
Bump version to 0.13.1 by @bandish-shah in #2033
Use follow in mcp script by @mvpatel2000 in #2022
Bump pytest from 7.2.1 to 7.2.2 by @dependabot in #2039
Bump pypandoc from 1.10 to 1.11 by @dependabot in #2038
Adds a PR guidelines section to contributing.md by @dakinggg in #1993
Adjust HuggingFaceModel token embedding resizing to only occur when necessary by @dakinggg in #2027
Remove deprecated code by @mvpatel2000 in #2026
test and fix composer package name usage in composer_collect_env by @dakinggg in #2049
Log nodename information in composer by @eracah in #2043
Update FSDP meta weight tying tests to include precision testing by @bcui19 in #2050
Backward Compat with Torchmetrics by @mvpatel2000 in #2046
update fsdp mixed precision by @vchiley in #2047
Checkpoints Simplified by @mvpatel2000 in #2041
Add composer PyPI package tests to daily workflow by @bandish-shah in #2052
Delete composer package GPU workflow by @dakinggg in #2055
Revert "Checkpoints Simplified (#2041)" by @dakinggg in #2056
Raise error if attempting to export FSDP model by @hanlint in #2051
Busy wait for local rank 0 download to avoid timeout on large file download by @dakinggg in #2054
Fix OCIObjectStore save_overwrite=False bug by @eracah in #2053
Update docs with non-rank zero logs instructions by @hanlint in #2058
Pin torchmetrics by @mvpatel2000 in #2065
Add NO_REENTRANT activation checkpointing by @bmosaicml in #20...

Contributors

eracah, vchiley, and 13 other contributors

Assets 2

24 Apr 20:54

mvpatel2000

v0.13.5

7bb2df2

v0.13.5

Full Changelog: v0.13.4...v0.13.5

Add support for EMA + FSDP

Assets 2

05 Apr 02:55

mvpatel2000

v0.13.4

d80c37d

v0.13.4

Full Changelog: v0.13.3...v0.13.4

Bumps streaming version pin to <1.0

Assets 2

04 Apr 20:35

bandish-shah

v0.13.3

b32229a

v0.13.3

🚀 Composer v0.13.3

Introducing the `composer` PyPi package!

Composer v0.13.3 is released!

Composer can also now be installed using the new composer PyPi package via pip:

pip install composer==0.13.3

The legacy package name still works via pip:

pip install mosaicml==0.13.3

Bug Fixes

add sentencepiece support by @dakinggg in #2093

What's Changed

Bump version to 0.13.3 by @bandish-shah in #2115
add missing import by @dakinggg in #2113
add sentencepiece support by @dakinggg in #2093
Pin mcli version until API change is resolved by @dakinggg in #2111

Full Changelog: v0.13.2...v0.13.3

Contributors

dakinggg and bandish-shah

Assets 2

31 Mar 23:45

bandish-shah

v0.13.2

f25078a

v0.13.2

🚀 Composer v0.13.2

Introducing the `composer` PyPi package!

Composer v0.13.2 is released!

Composer can also now be installed using the new composer PyPi package via pip:

pip install composer==0.13.2

The legacy package name still works via pip:

pip install mosaicml==0.13.2

Bug Fixes

test and fix composer package name usage in composer_collect_env (#2049)
Backward Compat with Torchmetrics by @mvpatel2000 (#2046)
Fix OCIObjectStore save_overwrite=False bug (#2053)
busy wait for the rank 0 download (#2071)
Skip extra downloads when not using a format string (#2073)

What's Changed

Pin transformers package to <4.27 by @dakinggg in #2076
Bump version to v0.13.2 (#2068) by @bandish-shah
Skip extra downloads when not using a format string by @dakinggg in #2073
add support for autoresume + FSDP + sharding by @dakinggg in #2072
busy wait for the rank 0 download by @dakinggg in #2071
Revert "Checkpoints Simplified (#2059)" by @dakinggg in #2070
Add device and dtype back to LPLayerNorm (#2067) by @abhi-mosaic
Checkpoints Simplified by @mvpatel2000 in #2059
Allow LPLayerNorm and LPGroupNorm to support self.bias or self.weight = None (#2044) by @abhi-mosaic
Add NO_REENTRANT activation checkpointing (#2042) by @bmosaicml
pin torchmetrics by @mvpatel2000 in #2065
Update docs with non-rank zero logs instructions by @hanlint in #2058
Fix OCIObjectStore save_overwrite=False bug by @eracah in #2053
Busy wait for local rank 0 download to avoid timeout on large file download by @dakinggg in #2054
Raise error if attempting to export FSDP model by @hanlint in #2051
Revert "Checkpoints Simplified (#2041)" by @dakinggg in #2056
Delete composer package GPU workflow by @dakinggg in #2055
Add composer PyPI package tests to daily workflow (#2052) by @bandish-shah
Checkpoints Simplified by @mvpatel2000 in #2041
update fsdp mixed precision by @vchiley in #2047
Backward Compat with Torchmetrics by @mvpatel2000 in #2046
Update FSDP meta weight tying tests to include precision testing by @bcui19 in #2050
Log nodename information in composer by @eracah in #2043
test and fix composer package name usage in composer_collect_env by @dakinggg in #2049
Adjust how HuggingFaceModel handles embedding resizing by @dakinggg in #2027
Adds a PR guidelines section to contributing.md by @dakinggg in #1993
Bump pypandoc from 1.10 to 1.11 (#2038) by @dependabot[bot]
Bump pytest from 7.2.1 to 7.2.2 (#2039) by @dependabot[bot]
Use follow in mcp script by @mvpatel2000 in #2022

Full Changelog: v0.13.1...v0.13.2

Contributors

eracah, vchiley, and 8 other contributors

Assets 2

07 Mar 03:11

bandish-shah

v0.13.1

8e83ff8

v0.13.1

🚀 Composer v0.13.1

Introducing the `composer` PyPi package!

Composer v0.13.1 is released!

Composer can also now be installed using the new composer PyPi package via pip:

pip install composer==0.13.1

The legacy package name still works via pip:

pip install mosaicml==0.13.1

Note: The mosaicml==0.13.0 PyPi package was yanked due to some minor packaging issues discovered after release. The package was re-released as Composer v0.13.1, thus these release notes contain details for both v0.13.0 and v0.13.1.

New Features

🤙 New and Updated Callbacks
- New HealthChecker Callback (#2002)
  
  The callback will log a warning if the GPUs on a given node appear to be in poor health (low utilization). The callback can also be configured to send a Slack message!
```
from composer import Trainer
from composer.callbacks import HealthChecker

# Warn if GPU utilization difference drops below 10%
health_checker = HealthChecker(
    threshold = 10
)

# Construct Trainer
trainer = Trainer(
    ...,
    callbacks=health_checker,
)

# Train!
trainer.fit()
```
- Updated MemoryMonitor to use GigaBytes (GB) units (#1940)
- New RuntimeEstimator Callback (#1991)
  
  Estimate the remaining runtime of your job! Approximates the time remaining by observing the throughput and comparing to the number of batches remaining.
```
from composer import Trainer
from composer.callbacks import RuntimeEstimator

# Construct trainer with RuntimeEstimator callback
trainer = Trainer(
    ...,
    callbacks=RuntimeEestimator(),
)

# Train!
trainer.fit()
```
- Updated SpeedMonitor throughput metrics (#1987)
  
  Expands throughput metrics to track relative to several different time units and per device:
  - throughput/batches_per_sec and throughput/device/batches_per_sec
  - throughput/tokens_per_sec and throughput/device/tokens_per_sec
  - throughput/flops_per_sec and throughput/device/flops_per_sec
  - throughput/device/samples_per_sec
  Also adds throughput/device/mfu metric to compute per device MFU. Simply enable the SpeedMonitor callback per usual to log these new metrics! Please see SpeedMonitor documentation for more information.

⣿ FSDP Sharded Checkpoints (#1902)

Users can now specify the state_dict_type in the fsdp_config dictionary to enable sharded checkpoints. For example:

from composer import Trainer

fsdp_confnig = {
    'sharding_strategy': 'FULL_SHARD',
    'state_dict_type': 'local',
}

trainer = Trainer(
    ...,
    fsdp_config=fsdp_config,
    save_folder='checkpoints',
    save_filename='ba{batch}_rank{rank}.pt',
    save_interval='10ba',
)

Please see the PyTorch FSDP docs and Composer's Distributed Training notes for more information.

🤗 HuggingFace Improvements
- Update HuggingFaceModel class to support encoder-decoder batches without decoder_input_ids (#1950)
- Allow evaluation metrics to be passed to HuggingFaceModel directly (#1971)
- Add a utility function to load a Composer checkpoint of a HuggingFaceModel and write out the expected config.json and pytorch_model.bin in the HuggingFace pretrained folder (#1974)
🛟 Nvidia H100 Alpha Support - Added amp_fp8 data type

In preparation for H100's arrival, we've added the amp_fp8 precision type. Currently setting amp_fp8 specifies a new precision context using transformer_engine.pytorch.fp8_autocast. For more details, please see Nvidia's new Transformer Engine and the specific fp8 recipe we utilize.
```
from composer import Trainer

trainer = Trainer(
    ...,
    precision='amp_fp8',
)
```

API changes

The torchmetrics package has been upgraded to 0.11.x.

The torchmetrics.Accuracy metric now requires a task argument which can take on a value of binary, multiclass or multilabel. Please see Torchmetrics Accuracy docs for details.

Additonally, since specifying value='multiclass' requires an additional field of num_classes to be specified, we've had to update ComposerClassifier to accept the additional num_classes argument. Please see PR's #2017 and #2025 for additional details
Surgery algorithms used in functional form return a value of None (#1543)

Deprecations

Deprecate HFCrossEntropy and Perplexity (#1857)
Remove Jenkins CI (#1943, #1954)
Change Deprecation Warnings to Warnings for specifying ProgressBarLogger and ConsoleLogger to loggers (#1846)

Bug Fixes

Fixed an issue introduced in 0.12.1 where HuggingFaceModel crashes if config.return_dict = False (#1948)
Refactor EMA to improve memory efficiency (#1941)
Make wandb checkpoint logging compatible with wandb model registry (#1973)
Fix ICL race conditions (#1978)
Update epoch metric name to trainer/epoch (#1986)
reset scaler (#1999)
Bug/sync optimization logger across ranks (#1970)
Update Docker images to fix resolve vulnerability scan issues (#2007)
Fix eval duplicate logging issue (#2018)
extend test and patch bug (#2028)
Protect for missing slack_sdk import (#2031)

Known Issues

Docker Image Security Vulnerability
- CVE-2022-45907: The mosaicml/pytorch:1.12.1*, mosaicml/pytorch:1.11.0*, mosaicml/pytorch_vision:1.12.1* and mosaicml/pytorch_vision:1.11.0* images are impacted and currently supported for legacy use cases. We recommend users upgrade to images with PyTorch >1.13. The affected images will be removed in the next Composer release.

What's Changed

Raise error if max duration is in epochs and dataloader is infinite by @dakinggg in #1942
Bump traitlets from 5.8.0 to 5.9.0 by @dependabot in #1946
Deprecate HFCrossEntropy and Perplexity by @dakinggg in #1857
Change functional surgery method return values to None by @nik-mosaic in #1543
Retire Jenkins by @bandish-shah in #1943
Update MCP GHA Name by @mvpatel2000 in #1951
update memory monitor by @mvpatel2000 in #1940
Move ffcv up in test order by @dskhudia in #1953
Fix memory monitor test by @mvpatel2000 in #1957
Fix model surgery failure due to functional API change by @nik-mosaic in #1949
Change how we check for forwards args in models for HF models by @bcui19 in #1955
add return dict false test and bug fix by @dakinggg in #1948
remove jenkins ci by @mvpatel2000 in #1954
add support for enc-dec batches without decoder_input_ids by @dakinggg in #1950
Refactor EMA to improve memory efficiency by @coryMosaicML in #1941
Add warning for untrusted checkpoints by @mvpatel2000 in #1959
permit opt tokenizer by @bmosaicml in #1958
GHA Docker build flow for PR's by @bandish-shah in #1883
Update download badge link to pepy by @karan6181 in #1966
Update python version in setup.py and fixed pypi download badge by @karan6181 in #1969
allow eval metrics to be passed in to HuggingFaceModel directly by @dakinggg in #1971
Make wandb checkpoint logging compatible with wandb model registry by @growlix in #1973
Add support for FP8 on H100 using NVidia's TransformerEngine by @dskhudia in #1965
Util for writing HuggingFace save_pretrained from a composer checkpoint by @dakinggg in #1974
Enable sharded checkpoint save and load (support local, sharded, and full state dicts for FSDP) by @eracah in #1902
Bump custom-inherit from 2.4.0 to 2.4.1 by @dependabot in #1981
Bump gitpython from 3.1.30 to 3.1.31 by @dependabot in #1982
Fix ICL race conditions by @dakinggg in #1978
add...

Contributors

eracah, vchiley, and 13 other contributors

Assets 2

07 Mar 03:10

bandish-shah

v0.13.0

3618c63

v0.13.0

This release has been yanked due to a minor packaging issue, please skip directly to Composer v0.13.1

What's Changed

Raise error if max duration is in epochs and dataloader is infinite by @dakinggg in #1942
Bump traitlets from 5.8.0 to 5.9.0 by @dependabot in #1946
Deprecate HFCrossEntropy and Perplexity by @dakinggg in #1857
Change functional surgery method return values to None by @nik-mosaic in #1543
Retire Jenkins by @bandish-shah in #1943
Update MCP GHA Name by @mvpatel2000 in #1951
update memory monitor by @mvpatel2000 in #1940
Move ffcv up in test order by @dskhudia in #1953
Fix memory monitor test by @mvpatel2000 in #1957
Fix model surgery failure due to functional API change by @nik-mosaic in #1949
Change how we check for forwards args in models for HF models by @bcui19 in #1955
add return dict false test and bug fix by @dakinggg in #1948
remove jenkins ci by @mvpatel2000 in #1954
add support for enc-dec batches without decoder_input_ids by @dakinggg in #1950
Refactor EMA to improve memory efficiency by @coryMosaicML in #1941
Add warning for untrusted checkpoints by @mvpatel2000 in #1959
permit opt tokenizer by @bmosaicml in #1958
GHA Docker build flow for PR's by @bandish-shah in #1883
Update download badge link to pepy by @karan6181 in #1966
Update python version in setup.py and fixed pypi download badge by @karan6181 in #1969
allow eval metrics to be passed in to HuggingFaceModel directly by @dakinggg in #1971
Make wandb checkpoint logging compatible with wandb model registry by @growlix in #1973
Add support for FP8 on H100 using NVidia's TransformerEngine by @dskhudia in #1965
Util for writing HuggingFace save_pretrained from a composer checkpoint by @dakinggg in #1974
Enable sharded checkpoint save and load (support local, sharded, and full state dicts for FSDP) by @eracah in #1902
Bump custom-inherit from 2.4.0 to 2.4.1 by @dependabot in #1981
Bump gitpython from 3.1.30 to 3.1.31 by @dependabot in #1982
Fix ICL race conditions by @dakinggg in #1978
add map location to huggingface utils by @dakinggg in #1980
fix log epoch by @mvpatel2000 in #1986
GHA release workflow, refactor PR and Daily workflows by @bandish-shah in #1968
Remove python-version input from Daily CPU tests by @bandish-shah in #1989
Add some logic to pass the correct github ref to mcp script by @bandish-shah in #1990
Fix typo in docstring for eval with missing space by @mvpatel2000 in #1992
Fix failing sharded_checkpoint tests that fail when pytorch 1.13 is not installed by @eracah in #1988
Add merge_group event trigger to GHA daily workflow by @bandish-shah in #1996
Runtime estimator by @mvpatel2000 in #1991
Reset scaler state by @mvpatel2000 in #1999
Speed monitor refactor by @mvpatel2000 in #1987
Test hf fsdp by @dakinggg in #1972
Bug/sync optimization logger across ranks by @bmosaicml in #1970
Fix optimizer monitor test gating with FSDP by @mvpatel2000 in #2000
Low precision groupnorm by @mvpatel2000 in #1976
Bump coverage[toml] from 7.1.0 to 7.2.1 by @dependabot in #2008
Update docs to include runtime estimator by @mvpatel2000 in #2009
Tag surgery algorithms LPLN and LPGN by @mvpatel2000 in #2011
Update SpeedMonitor short-description for docs table by @mvpatel2000 in #2010
Update Low Precision LayerNorm arguments by @nik-mosaic in #1994
Medical Segmentation Example Typo by @mvpatel2000 in #2014
Update wallclock logging to default hours by @mvpatel2000 in #2005
Add HealthChecker Callback by @hanlint in #2002
Allow FX graph mode post-training dynamic quantisation of BlurConv2d operations. by @BrettRyland in #1995
Add multi-gpu testing to test_algorithm_resumption by @eracah in #2016
Add backwards compatible checkpoint loading for EMA by @coryMosaicML in #2012
fsdp with custom process groups by @vchiley in #2006
Patch Speed Monitor MFU by @mvpatel2000 in #2013
Remove runtime estimator state dict by @mvpatel2000 in #2015
Update Docker images to fix resolve vulnerability scan issues by @bandish-shah in #2007
Change Deprecation Warnings to Warnings for specifying ProgressBarLogger and ConsoleLogger to loggers by @eracah in #1846
Fix eval duplicate logging issue by @mvpatel2000 in #2018
Add workflow_dispatch trigger to pr-docker workflow by @bandish-shah in #2019
Bump streaming version to less than 0.4.0 by @karan6181 in #2020
Upgrade ipython installed in Docker images by @bandish-shah in #2021
Upgrade torchmetrics by @nik-mosaic in #2017
Complete upgrade of torchmetrics accuracy by @nik-mosaic in #2025
Bump version to v0.13.0 by @bandish-shah in #2024

New Contributors

@BrettRyland made their first contribution in #1995

Full Changelog: v0.12.1...v0.13.0

Contributors

eracah, vchiley, and 13 other contributors

Assets 2

05 Feb 09:19

bandish-shah

v0.12.1

f15b077

v0.12.1

🚀 Composer v0.12.1

Composer v0.12.1 is released! Install via pip:

pip install --upgrade mosaicml==0.12.1

New Features

📚 In-Context Learning (#1876)

With Composer and MosaicML Cloud you can now evaluate LLMs on in-context learning tasks (LAMBADA, HellaSwag, PIQA, and more) hundreds of times faster than other evaluation harnesses. Please see our "Blazingly Fast LLM Evaluation for In-Context Learning" blog post for more details!

💾 Added support for Coreweave Object Storage (#1915)

Coreweave object store is compatible with boto3. Uploading objects to Coreweave object store is almost exactly like writing to using S3, except an endpoint_url must be set via the S3_ENDPOINT_URLenvironment variable. For example:

import os
os.environ['S3_ENDPOINT_URL'] = 'https://object.las1.coreweave.com'

from composer.trainer import Trainer

# Save checkpoints every epoch to s3://my_bucket/checkpoints
trainer = Trainer(
    model=model,
    train_dataloader=train_dataloader,
    max_duration='10ep',
    save_folder='s3://my_bucket/checkpoints',
    save_interval='1ep',
    save_overwrite=True,
    save_filename='ep{epoch}.pt',
    save_num_checkpoints_to_keep=0,  # delete all checkpoints locally
 )

 trainer.fit()

Please see our checkpointing documentation for more details.

🪵 Automatic logging of Trainer hparams (#1855)

Hyperparameter arguments passed to the Trainer are now automatically logged. Simply set the Trainer argument auto_log_hparams=True.

Bug Fixes

Update Docker images to use ‘posix_prefix’ paths (#1854)
Disable new notebook in CI (#1875)
[Fix] Enable logging of metrics from Callbacks to ConsoleLogging (#1884)
Ensure loggers run init event before callbacks in Engine (#1890)
Raise an error in FSDP meta tensor initialization if there's no initialization functions, fix associated flaky FSDP test (#1905)
Add primitive list support (#1906)
Add logic for shifting labels before computing metrics (#1913)
Fixes mis specified dependency (#1919)
pin setuptools in build requirements (#1926)
Pin pip<23 in Docker images (#1936)
Fix bug in trainer.eval and add test cases for test_console_logger (#1937)

What's Changed

Rename GradMonitor -> OptimizerMonitor; add functionality to log optimizer-specific metrics to assist loss spike investigation by @bmosaicml in #1743
Add GCS uri support for loading and saving checkpoints by @eracah in #1833
HF factory function tests by @dakinggg in #1832
Fix doc issue, Trainer hparam log_to_console defaults to False by @eracah in #1840
Removed YAHP references from Docs by @bandish-shah in #1841
Typo by @nguyenhoan1988 in #1843
Fix source code links in docs by @bandish-shah in #1844
add importorskip by @dakinggg in #1847
Update Docker images to use ‘posix_prefix’ paths by @mvpatel2000 in #1854
Fix typo by @standardAI in #1849
ConsoleLogger: log first batch and first epoch when using console_log_interval by @eracah in #1860
Simpler auto log hparams by @eracah in #1855
Fix typos by @cclauss in #1850
Bump sphinxext-opengraph from 0.7.3 to 0.7.4 by @dependabot in #1851
Bump coverage[toml] from 6.5.0 to 7.0.1 by @dependabot in #1853
Bump traitlets from 5.7.0 to 5.8.0 by @dependabot in #1852
Bump ipython from 7.32.0 to 8.8.0 by @dependabot in #1865
Update monai requirement from <0.10,>=0.9.1 to >=0.9.1,<1.2 by @dependabot in #1869
Bump sphinxcontrib-katex from 0.9.3 to 0.9.4 by @dependabot in #1868
Bump coverage[toml] from 7.0.1 to 7.0.4 by @dependabot in #1867
Upgrade docker images to torch==1.13.1 by @abhi-mosaic in #1863
add more useful info to state by @dakinggg in #1848
Feature/lambada evaluator by @bmosaicml in #1845
multi-node distributed training, submitit & composer integration demo by @YilunKuang in #1753
Daily tests by @mvpatel2000 in #1870
Disable new notebook in CI by @mvpatel2000 in #1875
Update deepspeed by @mvpatel2000 in #1864
fix fail fast in daily by @mvpatel2000 in #1880
Fix getting started docs by @mvpatel2000 in #1878
Speed up test_lm_task_evaluation by @mvpatel2000 in #1879
Fix unprotected import by @mvpatel2000 in #1874
add ignore_modules to fsdp by @vchiley in #1877
Change vision image by @mvpatel2000 in #1881
Fix eval_forward in the ComposerModel ABC by @eracah in #1871
Fix fsdp weight tying by @bcui19 in #1856
Bump pytest from 7.2.0 to 7.2.1 by @dependabot in #1886
Bump ipykernel from 6.19.2 to 6.20.1 by @dependabot in #1887
Bump gitpython from 3.1.28 to 3.1.30 by @dependabot in #1888
Update Vision Image in Pytest by @mvpatel2000 in #1882
Streaming data tests by @dakinggg in #1842
Add NLP Algorithms Tests by @nik-mosaic in #1839
rename HF notebook by @dakinggg in #1873
Ensure loggers run init event before callbacks in Engine by @eracah in #1890
[Fix] Enable logging of metrics from Callbacks to ConsoleLogging by @eracah in #1884
Updating how we load metrics in a state_dict so we don't add extra memory overhead by @bcui19 in #1892
Getting daily tests passing by @dakinggg in #1893
Bump nbsphinx from 0.8.10 to 0.8.12 by @dependabot in #1897
Fix docker image by @mvpatel2000 in #1894
Add primitive list support by @mvpatel2000 in #1906
Raise an error in FSDP meta tensor initialization if there's no initialization functions, fix associated flaky FSDP test by @bcui19 in #1905
Gpu Test by @mvpatel2000 in #1907
Update docker with FFCV fix by @mvpatel2000 in #1908
Restore GPU tests by @mvpatel2000 in #1909
Update workflow names by @mvpatel2000 in #1910
Enable daily gpu tests by @mvpatel2000 in #1911
Tweak daily GPU tests by @mvpatel2000 in #1912
Daily GPU Tests -- Change to Git Commit by @mvpatel2000 in #1914
Add logic for shifting labels before computing metrics by @alextrott16 in #1913
Add coreweave object store support. by @eracah in #1915
Fixes mis specified dependency by @dakinggg in #1919
Bump coverage[toml] from 7.0.4 to 7.1.0 by @dependabot in #1923
Update importlib-metadata requirement from <6,>=5.0.0 to >=5.0.0,<7 by @dependabot in #1921
pin setuptools in build requirements by @dakinggg in #1926
Remove synthetic testing infrastructure for HF/NLP by @dakinggg in #1895
Add upgrade flags to pip installs by @dakinggg in #1916
Temporarily pin pip to <23 by @dakinggg in #1930
add link protection by @mvpatel2000 in #1927
Cleaning up error checking for FSDP sharding strategies with fp32 precision by @bcui19 in #1925
Fix mcp script to avoid follow by @mvpatel2000 in #1932
Emit Eval progress in console logging by @eracah in #1917
Remove Fused LayerNorm deprecation by @nik-mosaic in https://github.com/mosaicml/comp...

Contributors

nguyenhoan1988, cclauss, and 13 other contributors

Assets 2

Releases: mosaicml/composer

v0.15.0

🚀 Composer v0.15.0

What's New

Deprecations

Bug Fixes

What's Changed

New Contributors

Contributors

v0.14.1

Bug Fixes

What's Changed

Contributors

v0.14.0

🚀 Composer v0.14.0

New Features

API changes

Deprecations

Bug Fixes

What's Changed

Contributors

v0.13.5

v0.13.4

v0.13.3

🚀 Composer v0.13.3

Introducing the composer PyPi package!

Bug Fixes

What's Changed

Contributors

v0.13.2

🚀 Composer v0.13.2

Introducing the composer PyPi package!

Bug Fixes

What's Changed

Contributors

v0.13.1

🚀 Composer v0.13.1

Introducing the composer PyPi package!

New Features

API changes

Deprecations

Bug Fixes

Known Issues

What's Changed

Contributors

v0.13.0

What's Changed

New Contributors

Contributors

v0.12.1

🚀 Composer v0.12.1

New Features

Bug Fixes

What's Changed

Contributors

Introducing the `composer` PyPi package!

Introducing the `composer` PyPi package!

Introducing the `composer` PyPi package!