Releases: Intel-tensorflow/tensorflow
Intel® Optimizations for TensorFlow 2.14
This release of Intel® Optimized TensorFlow is based on the TensorFlow v2.14.0 tag and is built with support for oneDNN (oneAPI Deep Neural Network Library). For features and fixes that were introduced in TensorFlow 2.14.0, you can check the release notes of TensorFlow 2.14. This build was built from v2.14.0.
This release notes cover optimizations made in both Intel® Optimization for TensorFlow* and official TensorFlow v2.14.0 which has oneDNN optimizations enabled by default on Linux x86 packages and for CPUs with neural-network-focused hardware features such as AVX512_VNNI, AVX512_BF16, AMX, and others, which are found on Intel Cascade Lake and newer CPUs.
Breaking changes:
Intel® Optimization for TensorFlow*, version 2.14, will not be supported with any additional security or other updates after March 31, 2024. Intel® Optimization for TensorFlow*, version 2.14, is provided as is. Intel recommends that users of Intel® Optimization for TensorFlow, version 2.14, uninstall, and discontinue its use and install Intel® Extension for TensorFlow*, version 2.14, beginning March 31, 2024, which provides all available optimizations. No changes to code or installation setup is needed. More information on Intel's TensorFlow extension plugin can be viewed at https://github.com/intel/intel-extension-for-tensorflow.
oneDNN v3.0 has introduced a new quantization scheme where bias is applied after dequantization. As a result, some INT8 models may have sub-optimal performance if they contain many int32 bias nodes for convolution. For such models, we recommend that users re-quantize the graph to float32 bias using Intel® Neural Compressor.
Major features:
- See TensorFlow 2.14.0 release notes
- Enabled oneDNN v3.x by default on both Linux and Windows x86 builds.
- Enabled ITT tagging by default for oneDNN primitives on Intel® VTune™ Profiler which helps users on finding performance bottlenecks and provide detailed platform information such as L1/L2 cache misses or level of AVX512 vectorization.
- Upgrade to oneDNN v3.2.1
- Supported platforms: Linux
Improvements:
- Enabled caching scaled Bias in QuantizedMatmul in oneDNN v3.x nad enabled weight caching for Matmul op with oneDNN v3.x
- Added oneDNN v3.x support to the following ops: FusedInstanceNorm, fused-matmul, batch-matmul, MKL CBLAS matmul, Einsum, QuantizedMatmul, QuantizedConvolution ops/fusions, maxpooling and avgpooling fwd and bwd (with primitive cache) for FP32 and BF16. Main changes in oneDNN v3.x for quantization API are that the scale needs to be set for each tensor and Bias needs to be passed as FP32.
- Enabled reorder primitive cache and oneDNN v3.x benchmark tests.
- Enabled weight caching in oneDNN convolution ops for oneDNN v3.x
- Added 3D support to layout optimizer
- Code clean-up to avoid potential bugs due to nullptr dereference checks, out-of-bound memory accesses, etc.
- Upgrade curl version for potential vulnerability fixes
- Enabled valid Eigen kernels for FusedBatchNormV3 and Grad for CPU and relevant tests.
- Update oneDNN fused conv2d op signature to align with generic fused conv2d
- Enabled rsync to work on Windows by changing the file path which is input to rsync from windows-compatible to linux-compatible
Bug fixes:
- Resolved issues in oneDNN v3.2.1.
- Fixed all issues found during static scan analyses.
- Change to allocate hash map for kernel registry as a unique pointer to avoid possible memory leak.
- Fixed incorrect use of int for dim size when invoking oneDNN GEMM and Matmul primitive
- Updated all occurrences of dim size usage to int64_t data type for all oneDNN kernel implementation.
- Fixed failing resnet50 benchmark tests with v2 by passing the correct order of parameters.
- Fixed the corner case for swish and mish op fusion by making sure not only the input ops match, but the tensors also match.
- Fixed performance issue observed when user enables TF_ONEDNN_THREADPOOL_USE_CALLER_THREAD that allows one task to run on the main thread by using the original threadpool scheduling approach if the use_caller_thread is enabled.
- Fixed performance issue by removing the log added to execute.cc
- Fixed mkl_eager_op_rewrite_test by updating the test to use a raw pointer.
Versions and components:
- Intel® optimized TensorFlow based on TensorFlow v2.14.0: r2.14.0_intel_release
- TensorFlow v2.14.0: v2.14.0
- oneDNN v3.2.1: v3.2.1
- Model Zoo for Intel® Architecture: Model Zoo
Intel® Optimizations for TensorFlow 2.13.0
This release of Intel® Optimized TensorFlow is based on the TensorFlow v2.13.0 tag and is built with support for oneDNN (oneAPI Deep Neural Network Library). For features and fixes that were introduced in TensorFlow 2.13.0, you can check the release notes of TensorFlow 2.13. This build was built from v2.13.0.
This release notes cover optimizations made in both Intel® Optimization for TensorFlow* and official TensorFlow v2.13.0 which has oneDNN optimizations enabled by default on Linux x86 packages and for CPUs with neural-network-focused hardware features such as AVX512_VNNI, AVX512_BF16, AMX, and others, which are found on Intel Cascade Lake and newer CPUs.
Breaking changes:
oneDNN ops which rely on blocked format are no longer supported and have been changed to return an error. This is in preparation for completely removing them from TensorFlow in the next release. Users are encouraged to use the corresponding Eigen ops instead. Below is the list of such ops that are no longer supported:
- Element-wise ops such as MklAddN, MklAdd, MklAddV2, MklSub, MklMul, MklMaximum, MklSquaredDifference.
- MklIdentity
- MklInputConversion
- MklLRN and MklLRNGrad
- MklReshape
- MklSlice
- MklToTf
Major features:
- See TensorFlow 2.13.0 release notes
- Enabled reduced precision floating point arithmetic mode via a new environment variable ‘TF_SET_ONEDNN_FPMATH_MODE’. This variable can be set to “BF16” to allow down-conversions from FP32 to BF16 to speedup computations without noticeable impact on accuracy.
- Enabled ITT tagging by default for oneDNN primitives on Intel® VTune™ Profiler. This helps users to identify platform bottlenecks with detailed information such as L1/L2 cache misses or level of FP vectorization at the primitive level on VTune.
- Supported platforms: Linux
Improvements:
- Parallelized UnsortedSegment op with a simpler algorithm for workload balancing. This resulted in ~1.92 - 14.46x performance speedup in microbenchmarks and ~5% throughput performance speedup in public recommendation models on CPU.
- Added support for setting maximum number of threads available to oneDNN at primitive creation time for Eigen threadpool backend. This resulted in up to ~1.5x performance speedup in convolution microbenchmarks and higher CPU utilization in hyperthreading enabled systems.
- Enabled optimized implementation for FP32 using the new Eigen LeakyRelu functor leading to ~55% performance gain on average as measured by kernel microbenchmarks.
- Added BF16 support for FusedPadConv2D op since there are many occurrences of this op in CycleGAN model. This feature resulted in ~10% performance improvement in CycleGAN with AMP (auto-mixed precision) enabled.
- Added support for fusing element-wise ops such as LeakyRelu and Mish activations with Fused-Conv2D/3D in remapper. This resulted in ~8% performance speedup on average for models containing such pattern.
- Changed default initialization behavior for inter-op parallelism threads when it is negative by the caller thread by resetting it’s value to 1. This helped in fixing performance degradations for weight sharing.
- Added support for oneDNN v3.1 in the following ops: convolution (fwd + bwd), matmul, einsum, transpose, softmax, layernorm, concat, element-wise ops, pooling, quantize, dequantize, quantized-concat, requantization-range-per-channel, requantize-per-channel. oneDNN v3.1 can be conditionally compiled by passing “--config=mkl --define=build_with_onednn_v3=true --define=build_with_onednn_v2=false” flags when building TensorFlow.
- Added support for weight caching in convolution for oneDNN v3.1.
- Added support for Mul + BatchMatMulV2 + Add fusion for FP32 and BF16 in oneDNN fused-batch-matmul op since this pattern occurs DistilBert.
- Added kernel support for Instance Normalization for FP32 and BF16. This includes fusing breakdown ops along with optional Relu/LeakyRelu into a single Instance Normalization op in the graph pass.
- Added support for Quantized Maxpool3D op.
- Added support for FusedBatchNormEx fusion to TFG MLIR grappler.
- Added support for AsString + StringToHashBucketFast fusion in TFG MLIR grappler.
- Cleaned up CUDA/oneDNN warnings produced by TensorFlow when running on a machine without a GPU. This is to provide more meaningful CUDA/oneDNN warnings depending on the machine in which TensorFlow is being run.
Bug fixes:
- Resolved issues in TensorFlow 2.13.0
- Resolved issues in oneDNN v2.7 and oneDNN v3.1.
- Fixed all issues found during static scan analyses.
- Updated the function signature of oneDNN FusedConv2D to align with generic FusedConv2D. This was done to remove a workaround which was previously applied to fix a crash in oneDNN FusedConv2D.
- Added unit tests for bfloat16 AddN and Round op.
- Fixed potential NPE (null-pointer exception) in quantized ops by adding index-validity checks for min/max tensors.
- Fixed a bug in framework::function_test by adjusting the relative tolerance of the unit test.
- Fixed potential accuracy issues by moving Mean op back to AMP (auto-mixed precision) deny list.
- Fixed a build failure in mkl_fused_batch_norm_op test by adding relative error tolerance which was caused by using a different GEMM API.
- Added error checking to oneDNN AvgPoolGrad kernel to avoid out-of-bounds output memory access.
- Fixed a crash in Mul + Maximum + LeakyRelu fusion for BF16 by fusing Mul + Maximum in the first remapper pass to avoid Cast -> Const conversions for LeakyRelu’s alpha parameter.
- Fixed a performance issue where large matmuls were running on a single thread by storing the dimension sizes of input and output matrices in int64_t instead of int.
- Reverted logging errors added for executor failures since it resulted in non-negligible performance drop when running some models.
Versions and components:
- Intel® optimized TensorFlow based on TensorFlow v2.13.0: r2.13.0_intel_release
- TensorFlow v2.13.0: v2.13.0
- oneDNN v2.7.3: oneDNN v2.7.3
- oneDNN v3.1: oneDNN v3.1
- Model Zoo for Intel® Architecture: Model Zoo
Known issues:
- bfloat16 is not guaranteed to work on AVX or AVX2 systems.
Intel® Optimizations for TensorFlow 2.12.0
This release of Intel® Optimized TensorFlow is based on the TensorFlow v2.12.0 tag and is built with support for oneDNN (oneAPI Deep Neural Network Library). For features and fixes that were introduced in TensorFlow 2.12.0, please check the release notes of TensorFlow 2.12. This build was built from v2.12.0.
This release notes cover both Intel® Optimizations for TensorFlow* and official TensorFlow v2.12 which has oneDNN optimizations enabled by default on Linux x86 packages and for CPUs with neural-network-focused hardware features such as AVX512_VNNI, AVX512_BF16, AMX, and others, which are found on Intel Cascade Lake and newer CPUs.
Major features:
- See TensorFlow 2.12.0 release notes
- Added support for Softmax forward for float32 and bfloat16 resulting in ~2-3x performance speedup on microbenchmarks measured on Intel Xeon CPUs.
- Additional performance improvements were made for bfloat16 models using AMX. New operations were added to support bfloat16 to improve performance and reduce the number of “Cast” operations.
- Made performance improvements to CPU memory allocator and Eigen threadpool’s task scheduling algorithm.
- Supported platforms: Linux
Improvements:
- Updated oneDNN version to v2.7.3
- Added support for Softmax forward for float32 and bfloat16 types. This resulted in ~3x performance speedup for float32 microbenchmarks and ~2.8x speedup for bfloat16 microbenchmarks as measured on Intel Xeon CPUs with Eigen threadpool. It also improved inference performance by 12% on some models which use Softmax.
- Updated bfloat16 auto-mixed-precision list by adding “Sum” and “Square” ops to the Infer list. This helped reduce the number of “Cast” operations around such ops and improved performance by 2x for some models.
- Added support for fusing Gelu subgraph with MatMul and BiasAdd for float32 and bfloat16 types. This pattern is found in models such as BERT-base and BERT-large.
- Added support for Conv2D + BiasAdd + Sigmoid + Mul and Conv2D + FusedBatchNorm/V2/V3 + Sigmoid + Mul fusions into FusedConv2D for float32 and bfloat16 types resulting in up to 15% performance improvement for some models.
- Increased the threshold to use default memory allocator from 4K to 256K based on internal benchmarking.
- Added an environment variable for improving Eigen threadpool’s task scheduling algorithm for cases when the number of threads is equal to the number of available CPU cores. This resulted in ~15% throughput performance improvement for float32 models and ~12% throughput performance improvement for bfloat16 models.
- Added bfloat16 registration for Eigen’s FusedBatchNormV2 on CPU to reduce the number of “Cast” operations.
- Added bfloat16 support for the following binary ops: xdivy, xlogy and xlog1py.
- Added bfloat16 support for the following 3D pooling ops in Eigen: AveragePool3D, MaxPool3D, AveragePool3DGrad and MaxPool3DGrad.
Bug fixes:
- Resolved issues in TensorFlow 2.12.0
- Resolved issues in oneDNN v2.7.3
- Issues found during static scan analyses are fixed.
- Fixed NPE (null-pointer exception error) for min/max tensors in QuantizedMatmul.
- Fixed a bug in Average-Pool-3D-Grad for empty input tensors by mimicking the same behavior as the Eigen-based implementation since this case is not natively supported by oneDNN.
- Fixed a bug in LayerNorm by adding the missing epsilon attribute.
- Fixed another bug in LayerNorm by adding Eigen threadpool interface to the oneDNN stream. This fix prevented LayerNorm from running on a single thread.
- Fixed collective_combine_all_reduce_test_cpu and collective_test_cpu Python unit test failures on tensorflow:devel docker container due to incompatible Numpy versions.
- Fixed a bug in the initialization of destination memory in MatMul primitive.
- Fixed a bug in fused batch-matmul op by adding missing epsilon and leaky-relu alpha attributes.
Versions and components:
- Intel optimized TensorFlow based on TensorFlow v2.12.0: r2.12.0_intel_release
- TensorFlow v2.12.0: v2.12.0
- oneDNN v2.7.3: v2.7.3
- Model Zoo for Intel® Architecture: Model Zoo
Known issues:
-
Bfloat16 is not guaranteed to work on AVX or AVX2 systems.
-
There is a known issue of low accuracy for 3DUnet mlperf bfloat16 inference, and the issue has been fixed post TF2.12 release. For a workaround in TF2.12, please add the following environment variables to run the bfloat16 inference case for this model:
-
TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_INFERLIST_REMOVE=Mean
-
TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_DENYLIST_ADD=Mean
-
Intel optimized TensorFlow is no longer supported on Windows:
-
To run TensorFlow on Windows, use [official TensorFlow v2.12](https://pypi.org/project/tensorflow/2.12.0/) and set the environment variable TF_ENABLE_ONEDNN_OPTS to 1 (i.e., “set TF_ENABLE_ONEDNN_OPTS=1”). Also, if the PC has hyperthreading enabled, users need to bind the ML application to one logical core per CPU in order to get the best runtime performance.
-
Use the initialization script from the following link, to get best performance on Windows: https://github.com/IntelAI/models/blob/r2.7/benchmarks/common/windows_intel1dnn_setenv.bat
Intel® Optimizations for TensorFlow 2.11.0
This release of Intel® Optimized TensorFlow is based on the TensorFlow v2.11.0 tag and is built with support for oneDNN (oneAPI Deep Neural Network Library). For features and fixes that were introduced in TensorFlow 2.11.0, please see the TensorFlow 2.11 release notes also. This build was built from v2.11.0.
This release note covers both Intel® Optimizations for TensorFlow* and official TensorFlow v2.11 which has oneDNN optimizations enabled by default on Linux x86 packages and for CPUs with neural-network-focused hardware features such as AVX512_VNNI, AVX512_BF16, AMX, and others, which are found on Intel Cascade Lake and newer CPUs.
Major features:
· Please see the TensorFlow 2.11.0 release notes
· Further Performance improvements for Bfloat16 models with AMX optimizations and more operations are supported with BF16 datatype.
· A new set of APIs is added for INT8 which will improve performance.
· Supported platforms: Linux.
Improvements:
· Updated oneDNN to version v2.7.1
· Added MLIR support for Contraction + BiasAdd + fusion
· A new set of APIs of quantized convolution ops/fusions is added to consolidate many existing convolution ops/fusions into few. With the new ops API, single op will cover several fusions for INT8.
· Fused Mul-Max pattern into LeakyRelu improved performance by 9% on various GAN models.
· Enabled support for BF16 for Conv3DBackpropFilterV2 and Conv3DBackpropFilterV2 for performance improvement
· Enabled fp32 & bf16 Einsum for CPU by default for performance improvement
· Enhanced performance by~15-20% by updating AMP mkl lists for several models including EfficientDet/EfficientNet, IcNet and more
· Enabled user mode scratchpad for inner-product (FusedMatMul & quantized MatMul) for better memory usage control
· Enabled changes and added a test case by unblocking Matmul+Add(Bias) fusion
Bug fixes:
· Tensorflow 2.11.0 resolved issues
· oneDNN resolved issues. 2.7.1 resolved issues
· Static scan analysis findings are all fixed.
· Fixed AvgPool Floating point issue
· Fixed AvgPool3d Floating point issue
· Fixed Memory Corruption issue in AvgPool3D when OneDNN is enabled
· Fixed Integer divide-by-0 during fused convolution with oneDNN on CPUs supporting AVX512 instructions
· Fixed primitive cache key which has potential problem that can appear in primitive caching for some rare cases where model has some FusedConv2D/3D nodes with same exact dimensions and parameters with the only difference being the fused activation function
· Fixed _FusedConv2D crash in oneDNN enabled
· Fixed LeakyRelu in the grappler remapper fusion (Pad + Conv3D + BiasAdd + Activation
· Fixed unit test failure //tensorflow/python/grappler:remapper_test
· Fixed build failure by adding patch to openmp build
· Fixed memory corruption issue with oneDNN primitive cache
Versions and components
• Intel optimized TensorFlow based on TensorFlow v2.11.0: https://github.com/Intel-tensorflow/tensorflow/tree/r2.11.0_intel_release
• TensorFlow v2.11.0: https://github.com/tensorflow/tensorflow/tree/v2.11.0
• oneDNN: https://github.com/oneapi-src/oneDNN/releases/tag/v2.7.1
• Model Zoo: https://github.com/IntelAI/models
Known issues
Bfloat16 is not guaranteed to work on AVX or AVX2
In Windows OS, to use oneDNN enabled TensorFlow, users need to run “set TF_ENABLE_ONEDNN_OPTS=1”. Also, if the PC has hyperthreading enabled, users need to bind the ML application to one logical core per CPU in order to get the best runtime performance.
Use the initialization script from the following link, to get best performance on windows : https://github.com/IntelAI/models/blob/r2.7/benchmarks/common/windows_intel1dnn_setenv.bat
Intel® Optimizations for TensorFlow 2.10.0
This release of Intel® Optimized TensorFlow is based on the TensorFlow v2.10.0 tag and is built with support for oneDNN (oneAPI Deep Neural Network Library). For features and fixes that were introduced in TensorFlow 2.10.0, please see the TensorFlow 2.10 release notes also. This build was built from v2.10.0
This release note covers both Intel® Optimizations for TensorFlow* and official TensorFlow v2.10 which has oneDNN optimizations enabled by default on Linux x86 packages and for CPUs with neural-network-focused hardware features such as AVX512_VNNI, AVX512_BF16, AMX, and others, which are found on Intel Cascade Lake and newer CPUs.
Major features:
· Please see the TensorFlow 2.10.0 release notes
· Further Performance improvements for Bfloat16 models with AMX optimizations and more operations are supported with BF16 datatype.
· Supported platforms: Linux.
Improvements:
· Updated oneDNN to version v2.6.1
· Improved performance on MatMul on Broadwell-2-socket system.
· Improved performance on SSD-MobileNet 300x300 inference.
· pattern matcher is a generic fix in BFloat16, performance enhancement for GAN models
· Added support for div and log ops BFloat16.
· Enabled user mode scratchpad for inner-product (FusedMatMul & quantized MatMul) for better memory usage control
· Renamed “grappler flag” config auto_mixed_precision_mkl API change
· Performance improvement on Resnet50-eager 1S
Bug fixes:
· Tensorflow 2.10.0 resolved issues
· oneDNN resolved issues. 2.6 resolved issues
· Static scan analysis findings are all fixed.
· Fixed conv3d_backprop_filter_v2_grad_test_cpu issue
· Fixed mkl_fused_ops_test failure and disable blocked format
· Fixed a failure in //tensorflow/python/kernel_tests/nn_ops:conv_ops_test_cpu exposed after adding security vulnerability test for raw_ops.Conv2DBackpropInputfunction feature was enabled by default
· Fixed Shape inference fix for INT8 convolutions test
· Fixed pooling_ops_3d_test unit test failure
· Fixed ValueError: operands could not be broadcast together with shapes (0,) (96,) bug in optimize for inference
· Fixed two major bugs //tensorflow/python:quantized_ops_test and //tensorflow/python:dequantized_ops_test
· Fixed Segmentation fault on tf.matmul and tf.einsum with batched input tensors using intel-tensorflow-avx512 issue by adding name of the primitive as key of mkl primitive which intends to avoid collision in the cache
· Fixed unit test quantization_ops:quantization_ops_test failure
· Fixed memory corruption issue by disabling oneDNN primitive cache
Versions and components
• Intel optimized TensorFlow based on TensorFlow v2.10.0: https://github.com/Intel-tensorflow/tensorflow/tree/v2.10.0
• TensorFlow v2.10.0: https://github.com/tensorflow/tensorflow/tree/v2.10.0
• oneDNN: https://github.com/oneapi-src/oneDNN/releases/tag/v2.6.1
• Model Zoo: https://github.com/IntelAI/models
Known issues
- Bfloat16 is not guaranteed to work on AVX or AVX2
- In Windows OS, to use oneDNN enabled TensorFlow, users need to run “set TF_ENABLE_ONEDNN_OPTS=1”. Also, if the PC has hyperthreading enabled, users need to bind the ML application to one logical core per CPU in order to get the best runtime performance.
- Use the initialization script from the following link, to get best performance on windows : https://github.com/IntelAI/models/blob/r2.7/benchmarks/common/windows_intel1dnn_setenv.bat
Intel® Optimizations for TensorFlow 2.9.1
This release of Intel® Optimized TensorFlow is based on the TensorFlow v2.9.1 tag and is built with support for oneDNN (oneAPI Deep Neural Network Library). For features and fixes that were introduced in TensorFlow 2.9.1, please see the TensorFlow 2.9.1 release notes also. This build was built from v2.9.1.
This release note covers both Intel® Optimizations for TensorFlow* and official TensorFlow v2.9.1 which has oneDNN optimizations enabled by default on Linux x86 packages and for CPUs with neural-network-focused hardware features such as AVX512_VNNI, AVX512_BF16, AMX, and others, which are found on Intel Cascade Lake and newer CPUs.
Major features:
• Please see the TensorFlow 2.9.1 release notes
• No longer needs environment variable TF_ENABLE_ONEDNN_OPTS to be set to “1”, to turn on oneDNN optimizations on Intel Cascade lake and newer CPUs on Linux OS
• Further Performance improvements for Bfloat16 models with AMX optimizations and more operations are supported with BF16 datatype.
• Supported platforms: Linux, Windows 10, and Windows 11.
Improvements:
• Updated oneDNN to version v2.6
• Performance enhancement for models like 3D-UNet and Yolo-V4 with addition of 3D Convolution-Add fusion and Mish (Softplus-Tanh-Mul) fusions
• Improved performance on MatMul operations with smaller shapes like (50x50).
• Enabled user mode scratchpad for inner-product (FusedMatMul & quantized MatMul) for better memory usage control
• Performance enhancement for SSD-Resnet34 by eliminating unnecessary data copying in per class NMS computation, this reduces memory usage and improves performance
• Throughput improvement for recommendation models by parallelizing UnSortedSegmentOp
• Added auto_mixed_precision_mkl as an optimizer option to be enabled for saved_model in eager mode
• Improvement in saved_model inference performance by removing eager check in remapper for oneDNN specific optimizations.
Bug fixes:
• Issues resolved in TensorFlow 2.9.1
• oneDNN resolved issues. 2.6 resolved issues
• Static scan analysis findings are all fixed.
• Fixed failure related to transformer-mlperf model training with BF16 datatype
• Fixed a failure in //tensorflow/python/framework:node_file_writer_test exposed after eager_op_as_function feature was enabled by default
• Fixed gruv2_test_gpu and layer_correctness_test_gpu tests
Versions and components
• Intel optimized TensorFlow based on TensorFlow v2.9.1: https://github.com/Intel-tensorflow/tensorflow/tree/v2.9.1
• TensorFlow v2.9.1: https://github.com/tensorflow/tensorflow/tree/v2.9.1
• oneDNN: https://github.com/oneapi-src/oneDNN/releases/tag/v2.6
• Model Zoo: https://github.com/IntelAI/models
Known issues
• Open issues: open issues for oneDNN optimizations
• Bfloat16 is not guaranteed to work on AVX or AVX2
• conv3d_backprop_filter_v2_grad_test_cpu, Mkl_fused_op_test unit tests fail , will be fixed in next release.
• In Windows OS, to use oneDNN enabled TensorFlow, users need to run “set TF_ENABLE_ONEDNN_OPTS=1”. Also, if the PC has hyperthreading enabled, users need to bind the ML application to one logical core per CPU in order to get the best runtime performance.
• Use the initialization script from the following link, to get best performance on windows : https://github.com/IntelAI/models/blob/r2.7/benchmarks/common/windows_intel1dnn_setenv.bat
Intel® Optimizations for TensorFlow 2.8.0
This release of Intel® Optimized TensorFlow is based on the TensorFlow v2.8.0 tag and is built with support for oneDNN (oneAPI Deep Neural Network Library). For features and fixes that were introduced in TensorFlow 2.8.0, please see the TensorFlow 2.8.0 release notes also. This build was built from v2.8.0.
This release note covers both Intel® Optimizations for TensorFlow* and official TensorFlow v2.8.0 with oneDNN enabled (via setting the environment variable TF_ENABLE_ONEDNN_OPTS to 1).
Major features
• Please see the TensorFlow 2.8.0 release notes
• Performance improvements for Bfloat16 models with AMX optimizations.
• Enabled support for 12th Gen Intel(R) Core (TM) (code named Alder Lake) platform.
• No longer supports oneDNN block format, i.e., setting TF_ENABLE_MKL_NATIVE_FORMAT=0 will not enable blocked format.
• To enable AMX optimization, you no longer need DNNL_MAX_CPU_ISA = AVX512_CORE_AMX.
• Supported platforms: Linux, Windows 10, and Windows 11.
Improvements
• Updated oneDNN to version 2.5.1
• oneDNN namespace changed from “mkldnn” to “dnnl” and cleaned up source code to remove unnecessary header files, dangling methods and/or data members which were part of older MKL-DNN support
• Improved _FusedMatMul operation, which enhances the performance of models like BERT
• Added LayerNormalization ops fusion and BatchMatMul – Mul – AddV2 fusion to improve performance of Transformer based language models
• Improved performance of EfficentNet and EfficientDet models with addition of swish (Sigmoid – Mul) fusion
• Removed unnecessary transpose elimination to enhance performance for 3DUnet model
Bug fixes
• Issues resolved in TensorFlow 2.8
• oneDNN resolved issues. 2.5.1 resolved issues
• Fixed undefined behavior for cases when different number of threads used at primitive creation and execution
• Static scan analysis findings are all fixed.
• Fixed a bug with _FusedConv3D op registration
• Fixed run_eager_op_as_function_test and nn_fused_batchnorm_deterministic test failures
• Transformer-LT performance degradation is fixed
• Wide-and-deep INT8 performance degradation is fixed.
Versions and components
• Intel optimized TensorFlow based on TensorFlow v2.8.0: https://github.com/Intel-tensorflow/tensorflow/tree/v2.8.0
• TensorFlow v2.8.0: https://github.com/tensorflow/tensorflow/tree/v2.8.0
• oneDNN: https://github.com/oneapi-src/oneDNN/releases/tag/v2.5.1
• Model Zoo: https://github.com/IntelAI/models
Known issues
• Open issues: open issues for oneDNN optimizations
• Bfloat16 is not guaranteed to work on AVX or AVX2
• Mkl_fused_op_test unit test fails, will be fixed in next release.
• In Windows OS, to use oneDNN enabled TensorFlow, users need to run “set TF_ENABLE_ONEDNN_OPTS=1”. Also, if the PC has hyperthreading enabled, users need to bind the ML application to one logical core per CPU in order to get the best runtime performance.
v2.7.0
Intel® Optimizations for TensorFlow 2.7.0
This release of Intel® Optimized TensorFlow is based on the TensorFlow v2.7.0 tag and is built with support for oneDNN (oneAPI Deep Neural Network Library). For features and fixes that were introduced in TensorFlow 2.7.0, please see the TensorFlow 2.7.0 release notes also. This build was built from v2.7.0.
This release note covers both Intel® Optimizations for TensorFlow* and official TensorFlow v2.7.0 with oneDNN enabled (via setting the environment variable TF_ENABLE_ONEDNN_OPTS to 1).
Major features:
• Please see the TensorFlow 2.7.0 release notes
• Supported platforms: Linux and Windows 10.
Improvements:
• Updated oneDNN to version 2.4.1
• Improved Bfloat16 performance for element-wise Eigen operations
• Improved the performance for TensorFlow Saved Models
• Added additional fusions, e.g. matmul-biasadd-gelu
• Marked nodeDef with '_kernel' attribute as NameChange label before inferring the device and inside WrapInCallOp
• Enabled simple heuristic-based tuning for innerproduct primitive
• Disabled rewrite conv_grad ops to MKL with explicit padding
• Added sanity check for the corner case of "zero element of filter" in mkl_conv_ops.cc
• Enhanced PluggableDevice support
- Added DEVICE_DEFAULT for python ops
- Enabled the TensorList with DEVICE_DEFAULT including memzero, memset and memset32
- Fixed While segment fault on PluggableDevice
- Added DEVICE_DEFAULT for collective/bcast ops
- Added DEVICE_DEFAULT for session/transpose ops
Bug fixes:
• Issues resolved in TensorFlow 2.7
• oneDNN resolved issues. 2.4.1 resolved issues
• Updates curl to 7.79.1 to handle CVE-2021-22947, CVE-2021-22946, CVE-2021-22945
• Static scan analysis findings are all fixed.
• Fixed a bug inside pattern matcher for grappler due to not considering the nodes_to_preserve in the remapper use case
• Fixed tensorflow/python/framework/node_file_writer_test failure caused by op rewrite with different op name
• Fixed XByak-induced crashes on non-Intel systems
• Fixed missing-device unit test failures
Versions and components:
• Intel optimized TensorFlow based on TensorFlow v2.7.0: https://github.com/Intel-tensorflow/tensorflow/tree/v2.7.0
• TensorFlow v2.7.0: https://github.com/tensorflow/tensorflow/tree/v2.7.0
• oneDNN: https://github.com/oneapi-src/oneDNN/releases/tag/v2.4.1
• Model Zoo: https://github.com/IntelAI/models
Known issues
• Open issues: open issues for oneDNN optimizations
• Bfloat16 is not guaranteed to work on AVX or AVX2
• Transformer-LT model can see performance degradation of up to 18% as compared to Intel TensorFlow v2.6.0
• Wide-and-Deep model can have performance degradation of up to 14% as compared to Intel TensorFlow v2.6.0
• In Windows OS, to use oneDNN enabled TensorFlow, users need to run “set TF_ENABLE_ONEDNN_OPTS=1”. Also, if the PC has hyperthreading enabled, users need to bind the ML application to one logical core per CPU in order to get the best runtime performance.
Intel® Optimizations for TensorFlow 2.6.0
This release of Intel® Optimized TensorFlow is based on the TensorFlow v2.6.0 tag and is built with support for oneDNN (oneAPI Deep Neural Network Library). For features and fixes that were introduced in TensorFlow 2.6.0, please see the TensorFlow 2.6.0 release notes also. This build was built fromv2.6.0.
Major features:
- Native format is enabled for all data types.
- Single binary with runtime environment variable (TF_ENABLE_ONEDNN_OPTS=1) is enabled for all data types.
- Enabled Windows OpenMP support for Intel-oneDNN to improve performance on CPUs
Improvements:
- Native format support is extended to the following
- Int8 data type is enabled with native format.
- Added support for Conv2DBackpropFilterWithBias Fusion in native format.
- Enabled quantizedConcatV2 with native format.
- Enabled dequantize op
- Enabled quantized pooling ops
- Enabled quantized Conv ops
- Upgraded oneDNN to v2.3_rc2
- FusedMatMul and Sigmoid are enabled for CPU.
- Updated the oneDNN auto_mixed_precision_lists to allow more ops in bfloat16. This significantly reduces the number of Cast ops in models running bf16 inference with auto_mixed_precision and improves broad model performance.
- Enhanced pattern matcher for grappler graph optimization
- Build issues on Mac are fixed for CPU optimizations
- Removed static macro-INTEL_MKL and changed to IsMKLEnabled,
- Added check for dtype as MklMatMul supports bfloat16 and float32. whereas the default type is float64.
Bug fixes:
- Issues resolved in TensorFlow 2.6
- Updates curl to 7.78.0 to handle CVE-2021-22922, CVE-2021-22923, CVE-2021-22924, CVE-2021-22925, CVE-2021-22926
- oneDNN resolved issues. 2.3-rc2 resolved issues
- Static scan analysis findings are all fixed.
Versions and components
- Intel optimized TensorFlow based on TensorFlow v2.6.0: https://github.com/Intel-tensorflow/tensorflow/tree/v2.6.0
- TensorFlow v2.6.0: https://github.com/tensorflow/tensorflow/tree/v2.6.0
- oneDNN: https://github.com/oneapi-src/oneDNN/releases/tag/v2.3_rc2
- Model Zoo: https://github.com/IntelAI/models
Known issues
- Open issues: open issues for oneDNN optimizations
- Bfloat16 is not guaranteed to work on AVX or AVX2.
Intel® Optimizations for TensorFlow* 1.15 UP3 Maintenance Release
This maintenance release of Intel® Optimizations for TensorFlow* 1.15 UP3 Release is based on the TensorFlow v1.15.0up3 tag (https://github.com/Intel-tensorflow/tensorflow.git) as built with support for oneAPI Deep Neural Network Library (oneDNN v2.2.4). This revision contains the following features and fixes:
New functionality and usability improvements:
• Support oneDNN version 2.2.4 and integration work with TensorFlow.
• Add Conv2D + BiasAdd + Relu/LeakyRelu + Add INT8 kernel.
• Fused sigmoid + mul into swish.
• Support quantized s8 pooling.
• Support the MKL runtime disable.
Bug fixes:
• Fixing a UT bug by removing the libtensorflow_framework.so dependency on oneDNN.
• Fix shape inference for QuantizedConv2D-like operations.
Additional security and performance patches:
• Remove the aws-crt-cpp and cJSON dependency.
• Below components are updated to the newest version:
• libjpeg-turbo: 2.1.0
• org_sqlite: 3350500
• curl: 7.77.0
Known issues:
• Remove AWS support temporarily to fix the security caused by aws-crt-cpp and cJSON. AWS S3 file system is unavailable in v1.15.0up3. Please use v1.15.0up2 if you need this feature.
• INT8 Conv with unsigned int8 input in oneDNN v2.2.4 may produce wrong result on the sever without VNNI hardware capability. Consider functionality and performance, we strongly suggest to only use this operation on the server(CLX, ICX and future Xeon) with VNNI.
Best known methods:
• Gelu API:
If model uses gelu op, suggest to use new API ‘tf.nn.gelu’ instead of small operations in python model code. An example is below.
https://github.com/IntelAI/models/blob/master/models/language_modeling/tensorflow/bert_large/inference/generic_ops.py#L88-L106
• Freeze graph
Freeze graph is an important step to improve inference performance. But the steps vary from model to model. A freeze graph script of BERT base inference classifier is provided as reference: https://github.com/IntelAI/models/blob/master/models/language_modeling/tensorflow/bert_large/inference/export_classifier.py
• MKL runtime disable
Set environment variable “export TF_DISABLE_MKL=1” to switch from oneDNN to Eigen backend at runtime. Rebuilding v1.15.0up3 with extra bazel option will get complete experience of this feature:
bazel build --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 --config=opt --copt=-O3 --copt=-Wformat --copt=-Wformat-security --copt=-fstack-protector --copt=-fPIC --copt=-fpic --linkopt=-znoexecstack --linkopt=-zrelro --linkopt=-znow --linkopt=-fstack-protector --config=mkl --copt=-march=native --define=tensorflow_mkldnn_contraction_kernel=1 //tensorflow/tools/pip_package:build_pip_package