Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]Add ncu report analyzer #2497

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

[WIP]Add ncu report analyzer #2497

wants to merge 2 commits into from

Conversation

FindHao
Copy link
Member

@FindHao FindHao commented Oct 8, 2024

This PR adds a ncu report analyzer to analyze the profiled ncu report. It also adds two metrics memory_traffic and arithmetic_intensity. To avoid excessive profiling overhead, we only profile with necessary ncu metrics.

This PR is a part of operator benchmarking plan

Example commands:

python run_benchmark.py triton --op fp8_gemm --num-inputs 1  --metrics ncu_rep,memory_traffic,arithmetic_intensity

Example output:

  0%|                                                                                                                                                                                                                                                              | 0/1 [00:00<?, ?it/s]==PROF== Connected to process 1289285 (/scratch/yhao/miniconda3/envs/pta_gil/bin/python3.10)
  0%|                                                                                                                                                                                                                                                              | 0/1 [00:00<?, ?it/s]==PROF== Profiling "sm90_xmma_gemm_e4m3f16_e4m3f3..." - 0: 0%....50%....100% - 3 passes
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.34s/it]
             x_val    torch_fp8_gemm-_ncu_trace_in_task
------------------  -----------------------------------
(1024, 1024, 1024)                              success
==PROF== Disconnected from process 1289285
==WARNING== No source files were imported. Check that the target application was compiled with -lineinfo.
==PROF== Report: /scratch/yhao/tmp/tritonbench/fp8_gemm/ncu_traces/torch_fp8_gemm_0/ncu_output.ncu-rep
==PROF== Connected to process 1289431 (/scratch/yhao/miniconda3/envs/pta_gil/bin/python3.10)
  0%|                                                                                                                                                                                                                                                              | 0/1 [00:00<?, ?it/s]==PROF== Profiling "matmul_kernel" - 0: 0%....50%....100% - 3 passes
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.25s/it]
             x_val    triton_fp8_gemm-_ncu_trace_in_task
------------------  ------------------------------------
(1024, 1024, 1024)                               success
==PROF== Disconnected from process 1289431
==PROF== Report: /scratch/yhao/tmp/tritonbench/fp8_gemm/ncu_traces/triton_fp8_gemm_0/ncu_output.ncu-rep
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.40s/it]
             x_val    torch_fp8_gemm-arithmetic_intensity    torch_fp8_gemm-memory_traffic                                                                 torch_fp8_gemm-ncu_rep    triton_fp8_gemm-arithmetic_intensity    triton_fp8_gemm-memory_traffic                                                                 triton_fp8_gemm-ncu_rep
------------------  -------------------------------------  -------------------------------  -------------------------------------------------------------------------------------  --------------------------------------  --------------------------------  --------------------------------------------------------------------------------------
(1024, 1024, 1024)              (1.3621756724589384, 0.0)               (2150656.0, 256.0)  /scratch/yhao/tmp/tritonbench/fp8_gemm/ncu_traces/torch_fp8_gemm_0/ncu_output.ncu-rep                              (0.0, 0.0)                  (2116096.0, 0.0)  /scratch/yhao/tmp/tritonbench/fp8_gemm/ncu_traces/triton_fp8_gemm_0/ncu_output.ncu-rep

import ncu_report

# save all kernels' metrics. {metric_name: [kernel1_metric_value, kernel2_metric_value, ...]}
results = defaultdict(list)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xuzhao9
Any suggestions on how we should save this data? We need to keep the metric results for each kernel, but we also need aggregated results, right? For example, the memory traffic (both read and write) for the whole operator should be the sum of all kernels' read and write traffic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xuzhao9 @eellison
Do you think the arithmetic intensity of the whole operator can be represented as a weighted average based on execution time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants