Releases: pytorch/torchrec
v1.0.0
TorchRec 1.0.0 Stable Release Notes
We are excited to announce the release of TorchRec 1.0.0, the official stable release for TorchRec! This release is done in conjunction with FBGEMM 1.0.0 release.
Stable Release
Core portions of the TorchRec library are now marked as stable, with the following guarantees:
- Backward compatability guarantees, with breaking changes announced two versions ahead of time.
- Enhanced documentation for all stable features of TorchRec
- Functionality guarantees through unit test frameworks for each stable feature, running on every merged PR and release.
- No performance guarantees. However, we are committed to providing support on a best-effort basis.
Improvements
Key improvements have been added to TorchRec for the reliability and UX of the library as part of the stable release. The following main features have been added as part of the stable release:
- TorchRec's documentation has been completely revamped, with added overview of how TorchRec works, high level architecture and concepts, and simplification of the API references for TorchRec stable features! Check out the new documentation here.
- The TorchRec tutorial on pytorch.org has been completely redone, with a new, comprehensive end-to-end tutorial of TorchRec highlighting all the stable features! Check it out the new tutorial here.
- A unit test framework for API compatability have been added under
torchrec/schema
to test compatibility for all TorchRec stable features. - The unit test CI for TorchRec on Github has been enabled on GPU, running on nightly versions of TorchRec and manually validated during release time.
Changelog
Revamped TorchRec inference solution with torch.fx and TorchScript #2101
Faster KJT init #2369
Improvements to TorchRec Train Pipeline #2363 #2352 #2324 #2149 #2181
PT2 Dynamo and Inductor compatibility work with TorchRec and train pipeline #2108 #2125 #2130 #2141 #2151 #2152 #2162 #2176 #2178 #2228 #2310
VBE improvements #2366 #2127 #2215 #2216 #2256
Replace ShardedTensor with DTensor #2147 #2167 #2169
Enable pruning of embedding tables in TorchRec inference #2343
torch.export compatibility support with TorchRec data types and modules #2166 #2174 #2067 #2197 #2195 #2203 #2246 #2250 #1900
Added benchmarking for TorchRec modules #2139 #2145
Much more optimized KeyedTensor regroup with custom module KTRegroupAsDict
with benchmarked results #2120 #2158 #2210
Overlap comms on backwards pass #2117
OSS quality of life improvements #2273
v1.0.0-rc2
2nd Release candidate for v1.0.0
v1.0.0-rc1
Release candidate 1 for stable release
v0.8.0
New Features
In Training Embedding Pruning (ITEP) for more efficient RecSys training
Provides a representation of In-Training Embedding Pruning, which is used internally at Meta for more efficient RecSys training by decreasing memory footprint of embedding tables. Pull Request: #2074 introduces the modules into TorchRec, with tests showing how to use them.
Mean Pooling
Mean pooling enabled on embeddings for row-wise and table-row-wise sharding types in TorchRec. Mean pooling mode done through TBE (table-batched embedding) won’t be accurate for row-wise and table-row-wise sharding types, which modify the input due to sharding. This feature efficiently calculates the divisor using caching and overlapping in input dist to implement mean pooling, which had proved to be much more performant than out-of-library implementations. PR: #1772
Changelog
Torch.export (non-strict) compatibility with KJT/JT/KT, EBC/Quantized EBC, sharded variants #1815 #1816 #1788 #1850 #1976 and dynamic shapes #2058
torch.compile support with TorchRec #2045 #2018 #1979
TorchRec serialization with non-strict torch.export for regenerating eager sparse modules (EBC) from IR for sharding #1860 #1848 with meta functionalization when torch.exporting #1974
More benchmarking for TorchRec modules/data types #2094 #2033 #2001 #1855
More VBE support (data parallel sharding) #2093 (EmbeddingCollection) #2047 #1849
RegroupAsDict module for performance improvements with caching #2007
Train Pipeline improvements #1967 #1969 #1971
Bug Fixes and library improvements
v0.8.0-rc1
Update setup and version for release 0.8.0
v0.7.0
No major features in this release
Changelog
- Expanding out ZCH/MCH
- Increased support with Torch Dynamo/Export
- Distributed Benchmarking introduced under torchrec/distributed/benchmarks for inference and training
- VBE optimizations
- TWRW support for VBE (I think this happened in the last release, Josh can confirm)
- Generalized train_pipeline for different pipeline stage overlapping
- Autograd support for traceable collectives
- Output dtype support for embeddings
- Dynamo tracing for sharded embedding modules
- Bug fixes
v0.7.0-rc1
Pre release for v0.7.0
v0.6.0
VBE
TorchRec now natively supports VBE (variable batched embeddings) within the EmbeddingBagCollection
module. This allows variable batch size per feature, unlocking sparse input data deduplication, which can greatly speed up embedding lookup and all-to-all time. To enable, simply initialize KeyedJaggedTensor
with stride_per_key_per_rank
and inverse_indices
fields, which specify batch size per feature and inverse indices to reindex the embedding output respectively.
Embedding offloading
Embedding offloading is UVM caching (i.e. storing embedding tables on host memory with cache on HBM memory) plus prefetching and optimal sizing of cache. Embedding offloading would allow running a larger model with fewer GPUs, while maintaining competitive performance. To use, one needs to use the prefetching pipeline (PrefetchTrainPipelineSparseDist) and pass in per table cache load factor and the prefetch_pipeline flag through constraints in the planner.
Trec.shard/shard_modules
These APIs replace embedding submodules with its sharded variant. The shard API applies to an individual embedding module while the shard_modules API replaces all embedding modules and won’t touch other non-embedding submodules.
Embedding sharding follows similar behavior to the prior TorchRec DistributedModuleParallel behavior, except the ShardedModules have been made composable, meaning the modules are backed by TableBatchedEmbeddingSlices which are views into the underlying TBE (including .grad). This means that fused parameters are now returned with named_parameters(), including in DistributedModuleParallel.
v0.6.0-rc2
v0.6.0-rc2
v0.6.0-rc1
This should support python 3.8 - 3.11 and 3.12 (experimental)
pip install torchrec --index-url https://download.pytorch.org/whl/test/cpu
pip install torchrec --index-url https://download.pytorch.org/whl/test/cu118
pip install torchrec --index-url https://download.pytorch.org/whl/test/cu121