Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/main' into snnn/update_dep
Browse files Browse the repository at this point in the history
  • Loading branch information
snnn committed Jul 18, 2024
2 parents a1053ca + 9140d9b commit 2199f26
Show file tree
Hide file tree
Showing 37 changed files with 326 additions and 444 deletions.
402 changes: 244 additions & 158 deletions onnxruntime/core/mlas/lib/sqnbitgemm_kernel_neon_int8.cpp

Large diffs are not rendered by default.

6 changes: 2 additions & 4 deletions onnxruntime/python/tools/tensorrt/perf/build/build_image.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,10 @@
from typing import List, Optional

TRT_DOCKER_FILES = {
"8.4.cuda_11_6_cudnn_8": "tools/ci_build/github/linux/docker/Dockerfile.ubuntu_cuda11_6_tensorrt8_4",
"8.5.cuda_11_8_cudnn_8": "tools/ci_build/github/linux/docker/Dockerfile.ubuntu_cuda11_8_tensorrt8_5",
"8.6.cuda_11_8_cudnn_8": "tools/ci_build/github/linux/docker/Dockerfile.ubuntu_cuda11_8_tensorrt8_6",
"8.6.cuda_12_3_cudnn_9": "tools/ci_build/github/linux/docker/Dockerfile.ubuntu_cuda12_3_tensorrt8_6",
"10.0.cuda_11_8_cudnn_8": "tools/ci_build/github/linux/docker/Dockerfile.ubuntu_cuda11_8_tensorrt10_0",
"10.0.cuda_12_4_cudnn_9": "tools/ci_build/github/linux/docker/Dockerfile.ubuntu_cuda12_4_tensorrt10_0",
"10.2.cuda_11_8_cudnn_8": "tools/ci_build/github/linux/docker/Dockerfile.ubuntu_cuda11_tensorrt10",
"10.2.cuda_12_5_cudnn_9": "tools/ci_build/github/linux/docker/Dockerfile.ubuntu_cuda12_tensorrt10",
"BIN": "tools/ci_build/github/linux/docker/Dockerfile.ubuntu_tensorrt_bin",
}

Expand Down
26 changes: 0 additions & 26 deletions onnxruntime/python/tools/transformers/models/llama/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,6 @@ Please note the package versions needed for using LLaMA-2 in the `requirements.t
- Note that `torch` with CUDA enabled is not installed automatically. This is because `torch` should be installed with the CUDA version used on your machine. Please visit [the PyTorch website](https://pytorch.org/get-started/locally/) to download the `torch` version that is used with the CUDA version installed on your machine and satisfies the requirement listed in the file.
- `requirements-quant.txt`
- For running the SmoothQuant algorithm using [Intel's Neural Compressor](https://github.com/intel/neural-compressor)
- `requirements-70b-model.txt`
- For running the LLaMA-2 70B model on multiple GPUs
- `requirements.txt`
- Package versions needed in each of the above files

Expand Down Expand Up @@ -221,18 +219,6 @@ $ python3 -m models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --output l
$ python3 -m onnxruntime.transformers.models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --output llama2-7b-int4-cpu --precision int4 --quantization_method blockwise --execution_provider cpu --use_gqa
```

Export LLaMA-2 70B sharded model into 4 partitions
```
# From source:
# 1. Install necessary packages from requirements-70b-model.txt
$ pip install -r requirements-70b-model.txt
# 2. Build ONNX Runtime from source with NCCL enabled. Here is a sample command:
$ ./build.sh --config Release --use_cuda --cuda_home /usr/local/cuda-12.2 --cudnn_home /usr/local/cuda-12.2 --build_wheel --cuda_version=12.2 --parallel --skip_tests --enable_nccl --nccl_home /usr/local/cuda-12.2 --use_mpi --mpi_home=/usr/lib/x86_64-linux-gnu/
# 3. Shard and export the LLaMA-2 70B model. With FP16, you will need at least 140GB of GPU memory to load the model. Therefore, you will need at least 4 40GB A100 GPUs or 2 80GB A100 GPUs to shard the PyTorch model and export each shard to ONNX. Here is an example command:
$ CUDA_VISIBLE_DEVICES=0,1,2,3 bash convert_70b_model.sh 4 -m meta-llama/Llama-2-70b-hf --output llama2-70b-distributed --precision fp16 --execution_provider cuda --use_gqa
```

## Parity Checking LLaMA-2

Expand Down Expand Up @@ -395,18 +381,6 @@ CUDA_VISIBLE_DEVICES=4 python3 -m models.llama.benchmark \
--device cuda
```

9. ONNX Runtime, FP16, convert_to_onnx, LLaMA-2 70B shard to 4 GPUs
```
CUDA_VISIBLE_DEVICES=4,5,6,7 bash benchmark_70b_model.sh 4 \
--benchmark-type ort-convert-to-onnx \
--ort-model-path ./llama2-70b-dis/rank_{}_Llama-2-70b-hf_decoder_merged_model_fp16.onnx \
--model-name meta-llama/Llama-2-70b-hf \
--cache-dir ./model_cache \
--precision fp16 \
--device cuda \
--warmup-runs 5 \
--num-runs 100
```

You can profile a variant by adding the `--profile` flag and providing one batch size and sequence length combination.

Expand Down

This file was deleted.

This file was deleted.

This file was deleted.

12 changes: 8 additions & 4 deletions onnxruntime/test/providers/cpu/tensor/isinf_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,17 @@ constexpr double DOUBLE_NINF = -std::numeric_limits<double>::infinity();
constexpr double DOUBLE_NAN = std::numeric_limits<double>::quiet_NaN();

template <typename T>
void run_is_inf_test(int opset, int64_t detect_positive, int64_t detect_negative, const std::initializer_list<T>& input, const std::initializer_list<bool>& output) {
void run_is_inf_test(int opset, int64_t detect_positive, int64_t detect_negative, const std::initializer_list<T>& input, const std::initializer_list<bool>& output, bool skip_trt = false) {
OpTester test("IsInf", opset);
test.AddAttribute<int64_t>("detect_positive", detect_positive);
test.AddAttribute<int64_t>("detect_negative", detect_negative);
test.AddInput<T>("X", {onnxruntime::narrow<int64_t>(input.size())}, input);
test.AddOutput<bool>("Y", {onnxruntime::narrow<int64_t>(output.size())}, output);
test.Run();
if (skip_trt) {
test.Run(OpTester::ExpectResult::kExpectSuccess, "", {kTensorrtExecutionProvider});
} else {
test.Run();
}
}

TEST(IsInfTest, test_isinf_float10) {
Expand Down Expand Up @@ -124,7 +128,7 @@ TEST(IsInfTest, test_isinf_bfloat16) {
std::initializer_list<BFloat16> input = {BFloat16{-1.7f}, BFloat16::NaN, BFloat16::Infinity, 3.6_bfp16,
BFloat16::NegativeInfinity, BFloat16::Infinity};
std::initializer_list<bool> output = {false, false, true, false, true, true};
run_is_inf_test(20, 1, 1, input, output);
run_is_inf_test(20, 1, 1, input, output, true); // Skip as TRT10 supports BF16 but T4 GPU run on TRT CIs doesn't
}

TEST(IsInfTest, test_isinf_positive_bfloat16) {
Expand All @@ -146,7 +150,7 @@ TEST(IsInfTest, test_Float8E4M3FN) {
std::initializer_list<Float8E4M3FN> input = {
Float8E4M3FN(-1.0f), Float8E4M3FN(FLOAT_NAN, false), Float8E4M3FN(1.0f), Float8E4M3FN(FLOAT_NINF, false), Float8E4M3FN(FLOAT_NINF, false), Float8E4M3FN(FLOAT_INF, false)};
std::initializer_list<bool> output = {false, false, false, false, false, false};
run_is_inf_test(20, 1, 1, input, output);
run_is_inf_test(20, 1, 1, input, output, true); // Skip as TRT10.1 supports Float8 but T4 GPU run on TRT CIs doesn't
}

TEST(IsInfTest, test_Float8E4M3FNUZ) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ variables:
- name: docker_base_image
value: onnxruntimebuildcache.azurecr.io/internal/azureml/onnxruntime/build/cuda11_x64_almalinux8_gcc11:20240531.1
- name: linux_trt_version
value: 10.0.1.6-1.cuda11.8
value: 10.2.0.19-1.cuda11.8
- name: Repository
value: 'onnxruntimecuda11manylinuxbuild'

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ variables:
value: 11.8

- name: win_trt_home
value: $(Agent.TempDirectory)\TensorRT-10.0.1.6.Windows10.x86_64.cuda-11.8
value: $(Agent.TempDirectory)\TensorRT-10.2.0.19.Windows10.x86_64.cuda-11.8
- name: win_cuda_home
value: $(Agent.TempDirectory)\v11.8

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -68,9 +68,9 @@ variables:
value: nvidia/cuda:12.2.2-cudnn8-devel-ubi8
- name: win_trt_home
${{ if eq(parameters.CudaVersion, '11.8') }}:
value: $(Agent.TempDirectory)\TensorRT-10.0.1.6.Windows10.x86_64.cuda-11.8
value: $(Agent.TempDirectory)\TensorRT-10.2.0.19.Windows10.x86_64.cuda-11.8
${{ if eq(parameters.CudaVersion, '12.2') }}:
value: $(Agent.TempDirectory)\TensorRT-10.0.1.6.Windows10.x86_64.cuda-12.4
value: $(Agent.TempDirectory)\TensorRT-10.2.0.19.Windows10.x86_64.cuda-12.5
- name: win_cuda_home
${{ if eq(parameters.CudaVersion, '11.8') }}:
value: $(Agent.TempDirectory)\v11.8
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,9 @@ variables:
value: onnxruntimebuildcache.azurecr.io/internal/azureml/onnxruntime/build/cuda12_x64_ubi8_gcc12:20240610.1
- name: linux_trt_version
${{ if eq(parameters.CudaVersion, '11.8') }}:
value: 10.0.1.6-1.cuda11.8
value: 10.2.0.19-1.cuda11.8
${{ if eq(parameters.CudaVersion, '12.2') }}:
value: 10.0.1.6-1.cuda12.4
value: 10.2.0.19-1.cuda12.5

jobs:
- job: Linux_Build
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ stages:
${{ if eq(parameters.CudaVersion, '12.2') }}:
DockerBuildArgs: "
--build-arg BASEIMAGE=nvidia/cuda:12.2.2-devel-ubuntu20.04
--build-arg TRT_VERSION=10.0.1.6-1+cuda12.4
--build-arg TRT_VERSION=10.2.0.19-1+cuda12.5
--build-arg BUILD_UID=$( id -u )
"
${{ else }}:
Expand Down
2 changes: 1 addition & 1 deletion tools/ci_build/github/azure-pipelines/post-merge-jobs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,7 @@ stages:
BuildConfig: 'RelWithDebInfo'
EnvSetupScript: setup_env_trt.bat
buildArch: x64
additionalBuildFlags: --enable_pybind --build_java --build_nodejs --use_cuda --cuda_home="$(Agent.TempDirectory)\v11.8" --enable_cuda_profiling --use_tensorrt --tensorrt_home="$(Agent.TempDirectory)\TensorRT-10.0.1.6.Windows10.x86_64.cuda-11.8" --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=86
additionalBuildFlags: --enable_pybind --build_java --build_nodejs --use_cuda --cuda_home="$(Agent.TempDirectory)\v11.8" --enable_cuda_profiling --use_tensorrt --tensorrt_home="$(Agent.TempDirectory)\TensorRT-10.2.0.19.Windows10.x86_64.cuda-11.8" --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=86
msbuildPlatform: x64
isX86: false
job_name_suffix: x64_RelWithDebInfo
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ stages:
python_wheel_suffix: '_gpu'
timeout: 480
docker_base_image: onnxruntimebuildcache.azurecr.io/internal/azureml/onnxruntime/build/cuda11_x64_almalinux8_gcc11:20240531.1
trt_version: '10.0.1.6-1.cuda11.8'
trt_version: '10.2.0.19-1.cuda11.8'
cuda_version: '11.8'


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,9 +49,9 @@ jobs:
value: onnxruntimebuildcache.azurecr.io/internal/azureml/onnxruntime/build/cuda12_x64_ubi8_gcc12:20240610.1
- name: linux_trt_version
${{ if eq(parameters.CudaVersion, '11.8') }}:
value: 10.0.1.6-1.cuda11.8
value: 10.2.0.19-1.cuda11.8
${{ if eq(parameters.CudaVersion, '12.2') }}:
value: 10.0.1.6-1.cuda12.4
value: 10.2.0.19-1.cuda12.5
pool: ${{ parameters.machine_pool }}
steps:
- checkout: self
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -80,9 +80,9 @@ stages:

- name: linux_trt_version
${{ if eq(parameters.CudaVersion, '11.8') }}:
value: 10.0.1.6-1.cuda11.8
value: 10.2.0.19-1.cuda11.8
${{ if eq(parameters.CudaVersion, '12.2') }}:
value: 10.0.1.6-1.cuda12.4
value: 10.2.0.19-1.cuda12.5
steps:
- checkout: self
clean: true
Expand Down Expand Up @@ -149,9 +149,9 @@ stages:
value: '12'
- name: linux_trt_version
${{ if eq(parameters.CudaVersion, '11.8') }}:
value: 10.0.1.6-1.cuda11.8
value: 10.2.0.19-1.cuda11.8
${{ if eq(parameters.CudaVersion, '12.2') }}:
value: 10.0.1.6-1.cuda12.4
value: 10.2.0.19-1.cuda12.5
steps:
- checkout: self # due to checkout multiple repos, the root directory is $(Build.SourcesDirectory)/onnxruntime
submodules: false
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -65,9 +65,9 @@ stages:
SpecificArtifact: ${{ parameters.SpecificArtifact }}
BuildId: ${{ parameters.BuildId }}
${{ if eq(parameters.cuda_version, '11.8') }}:
EP_BUILD_FLAGS: --enable_lto --use_tensorrt --tensorrt_home=$(Agent.TempDirectory)\TensorRT-10.0.1.6.Windows10.x86_64.cuda-11.8 --cuda_home=$(Agent.TempDirectory)\v11.8 --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=52;60;61;70;75;80"
EP_BUILD_FLAGS: --enable_lto --use_tensorrt --tensorrt_home=$(Agent.TempDirectory)\TensorRT-10.2.0.19.Windows10.x86_64.cuda-11.8 --cuda_home=$(Agent.TempDirectory)\v11.8 --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=52;60;61;70;75;80"
${{ if eq(parameters.cuda_version, '12.2') }}:
EP_BUILD_FLAGS: --enable_lto --use_tensorrt --tensorrt_home=$(Agent.TempDirectory)\TensorRT-10.0.1.6.Windows10.x86_64.cuda-12.4 --cuda_home=$(Agent.TempDirectory)\v12.2 --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=52;60;61;70;75;80"
EP_BUILD_FLAGS: --enable_lto --use_tensorrt --tensorrt_home=$(Agent.TempDirectory)\TensorRT-10.2.0.19.Windows10.x86_64.cuda-12.5 --cuda_home=$(Agent.TempDirectory)\v12.2 --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=52;60;61;70;75;80"

- ${{ if eq(parameters.enable_linux_gpu, true) }}:
- template: ../templates/py-linux-gpu.yml
Expand All @@ -79,7 +79,7 @@ stages:
cuda_version: ${{ parameters.cuda_version }}
${{ if eq(parameters.cuda_version, '11.8') }}:
docker_base_image: onnxruntimebuildcache.azurecr.io/internal/azureml/onnxruntime/build/cuda11_x64_almalinux8_gcc11:20240531.1
trt_version: 10.0.1.6-1.cuda11.8
trt_version: 10.2.0.19-1.cuda11.8
${{ if eq(parameters.cuda_version, '12.2') }}:
docker_base_image: onnxruntimebuildcache.azurecr.io/internal/azureml/onnxruntime/build/cuda12_x64_ubi8_gcc12:20240610.1
trt_version: 10.0.1.6-1.cuda12.4
trt_version: 10.2.0.19-1.cuda12.5
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@ parameters:
- 12.2
- name: TrtVersion
type: string
default: '10.0.1.6'
default: '10.2.0.19'
values:
- 8.6.1.6
- 10.0.1.6
- 10.2.0.19

steps:
- ${{ if eq(parameters.DownloadCUDA, true) }}:
Expand All @@ -42,9 +42,9 @@ steps:
- powershell: |
Write-Host "##vso[task.setvariable variable=trtCudaVersion;]12.0"
displayName: Set trtCudaVersion
- ${{ if and(eq(parameters.CudaVersion, '12.2'), eq(parameters.TrtVersion, '10.0.1.6')) }}:
- ${{ if and(eq(parameters.CudaVersion, '12.2'), eq(parameters.TrtVersion, '10.2.0.19')) }}:
- powershell: |
Write-Host "##vso[task.setvariable variable=trtCudaVersion;]12.4"
Write-Host "##vso[task.setvariable variable=trtCudaVersion;]12.5"
displayName: Set trtCudaVersion
- script: |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,17 +24,11 @@ steps:
displayName: 'Download Secondary CUDA SDK v${{ parameters.SecondaryCUDAVersion }}'
- ${{ if eq(parameters.DownloadTRT, 'true') }}:
- powershell: |
azcopy.exe cp --recursive "https://lotusscus.blob.core.windows.net/models/local/TensorRT-8.6.1.6.Windows10.x86_64.cuda-11.8" $(Agent.TempDirectory)
displayName: 'Download TensorRT-8.6.1.6.Windows10.x86_64.cuda-11.8'
azcopy.exe cp --recursive "https://lotusscus.blob.core.windows.net/models/local/TensorRT-10.2.0.19.Windows10.x86_64.cuda-11.8" $(Agent.TempDirectory)
displayName: 'Download TensorRT-10.2.0.19.Windows10.x86_64.cuda-11.8'
- powershell: |
azcopy.exe cp --recursive "https://lotusscus.blob.core.windows.net/models/local/TensorRT-8.6.1.6.Windows10.x86_64.cuda-12.0" $(Agent.TempDirectory)
displayName: 'Download TensorRT-8.6.1.6.Windows10.x86_64.cuda-12.0'
- powershell: |
azcopy.exe cp --recursive "https://lotusscus.blob.core.windows.net/models/local/TensorRT-10.0.1.6.Windows10.x86_64.cuda-11.8" $(Agent.TempDirectory)
displayName: 'Download TensorRT-10.0.1.6.Windows10.x86_64.cuda-11.8'
- powershell: |
azcopy.exe cp --recursive "https://lotusscus.blob.core.windows.net/models/local/TensorRT-10.0.1.6.Windows10.x86_64.cuda-12.4" $(Agent.TempDirectory)
displayName: 'Download TensorRT-10.0.1.6.Windows10.x86_64.cuda-12.4'
azcopy.exe cp --recursive "https://lotusscus.blob.core.windows.net/models/local/TensorRT-10.2.0.19.Windows10.x86_64.cuda-12.5" $(Agent.TempDirectory)
displayName: 'Download TensorRT-10.2.0.19.Windows10.x86_64.cuda-12.5'
- task: BatchScript@1
displayName: 'setup env'
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,10 @@ parameters:

- name: trt_version
type: string
default: '10.0.1.6-1.cuda11.8'
default: '10.2.0.19-1.cuda11.8'
values:
- 10.0.1.6-1.cuda11.8
- 10.0.1.6-1.cuda12.4
- 10.2.0.19-1.cuda11.8
- 10.2.0.19-1.cuda12.5
- name: cuda_version
type: string
default: '11.8'
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@ parameters:

- name: trt_version
type: string
default: '10.0.1.6-1.cuda11.8'
default: '10.2.0.19-1.cuda11.8'
values:
- 10.0.1.6-1.cuda11.8
- 10.0.1.6-1.cuda12.4
- 10.2.0.19-1.cuda11.8
- 10.2.0.19-1.cuda12.5
- name: cuda_version
type: string
default: '11.8'
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -381,7 +381,7 @@ stages:
variables:
CUDA_VERSION: '11.8'
buildArch: x64
EpBuildFlags: --use_tensorrt --tensorrt_home="$(Agent.TempDirectory)\TensorRT-10.0.1.6.Windows10.x86_64.cuda-11.8" --cuda_version=$(CUDA_VERSION) --cuda_home="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v$(CUDA_VERSION)" --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=37;50;52;60;61;70;75;80"
EpBuildFlags: --use_tensorrt --tensorrt_home="$(Agent.TempDirectory)\TensorRT-10.2.0.19.Windows10.x86_64.cuda-11.8" --cuda_version=$(CUDA_VERSION) --cuda_home="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v$(CUDA_VERSION)" --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=37;50;52;60;61;70;75;80"
EnvSetupScript: setup_env_gpu.bat
EP_NAME: gpu
VSGenerator: 'Visual Studio 17 2022'
Expand Down
Loading

0 comments on commit 2199f26

Please sign in to comment.