cutlass_transformer_sweep.out

cupy is not loaded.
cupy is not loaded.
cupy is not loaded.
cupy is not loaded.
[2023-09-06 17:43:59,418] [INFO] [comm.py:643:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
[2023-09-06 17:44:00,238] [INFO] [comm.py:697:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=26.0.157.113, master_port=6000
[2023-09-06 17:44:00,239] [INFO] [comm.py:661:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-09-06 17:44:00,265] [INFO] [checkpointing.py:227:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
num_attention_heads: 32, hidden_size: 8192, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x8192x24576, b=2048): 0.0131
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x8192x24576, b=2048): 251.780
b: 128, m: 2048, n: 256, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x256x2048): 0.0023
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x256x2048): 120.196
b: 128, m: 2048, n: 2048, k: 256,
Elapsed time for attention_prob_times_values (128x2048x2048x256): 0.0028
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x256): 98.658
Elapsed time for attention_linear_projection (4x8192x8192, b=2048): 0.0051
Throughput (in TFLOP/s) for attention_linear_projection (4x8192x8192, b=2048): 213.751
Elapsed time for mlp_h_to_4h (4x8192x32768, b=2048): 0.0172
Throughput (in TFLOP/s) for mlp_h_to_4h (4x8192x32768, b=2048): 256.038
Elapsed time for mlp_4h_to_h (4x32768x8192, b=2048): 0.0174
Throughput (in TFLOP/s) for mlp_4h_to_h (4x32768x8192, b=2048): 252.491

Attention duration (in seconds): 0.0233
Attention throughput (in TFLOP/s): 212.190
MLP duration (in seconds): 0.0346
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0579
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 8256, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x8256x24768, b=2048): 0.0138
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x8256x24768, b=2048): 242.398
b: 128, m: 2048, n: 258, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x258x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x258x2048): 100.235
b: 128, m: 2048, n: 2048, k: 258,
Elapsed time for attention_prob_times_values (128x2048x2048x258): 0.0038
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x258): 72.821
Elapsed time for attention_linear_projection (4x8256x8256, b=2048): 0.0054
Throughput (in TFLOP/s) for attention_linear_projection (4x8256x8256, b=2048): 207.506
Elapsed time for mlp_h_to_4h (4x8256x33024, b=2048): 0.0181
Throughput (in TFLOP/s) for mlp_h_to_4h (4x8256x33024, b=2048): 247.149
Elapsed time for mlp_4h_to_h (4x33024x8256, b=2048): 0.0178
Throughput (in TFLOP/s) for mlp_4h_to_h (4x33024x8256, b=2048): 251.620

Attention duration (in seconds): 0.0258
Attention throughput (in TFLOP/s): 194.834
MLP duration (in seconds): 0.0358
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0616
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 8320, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x8320x24960, b=2048): 0.0139
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x8320x24960, b=2048): 244.681
b: 128, m: 2048, n: 260, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x260x2048): 0.0027
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x260x2048): 103.788
b: 128, m: 2048, n: 2048, k: 260,
Elapsed time for attention_prob_times_values (128x2048x2048x260): 0.0032
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x260): 85.965
Elapsed time for attention_linear_projection (4x8320x8320, b=2048): 0.0054
Throughput (in TFLOP/s) for attention_linear_projection (4x8320x8320, b=2048): 209.686
Elapsed time for mlp_h_to_4h (4x8320x33280, b=2048): 0.0182
Throughput (in TFLOP/s) for mlp_h_to_4h (4x8320x33280, b=2048): 249.691
Elapsed time for mlp_4h_to_h (4x33280x8320, b=2048): 0.0181
Throughput (in TFLOP/s) for mlp_4h_to_h (4x33280x8320, b=2048): 250.391

Attention duration (in seconds): 0.0253
Attention throughput (in TFLOP/s): 201.765
MLP duration (in seconds): 0.0363
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0615
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 8384, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x8384x25152, b=2048): 0.0140
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x8384x25152, b=2048): 246.675
b: 128, m: 2048, n: 262, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x262x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x262x2048): 101.904
b: 128, m: 2048, n: 2048, k: 262,
Elapsed time for attention_prob_times_values (128x2048x2048x262): 0.0038
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x262): 73.197
Elapsed time for attention_linear_projection (4x8384x8384, b=2048): 0.0054
Throughput (in TFLOP/s) for attention_linear_projection (4x8384x8384, b=2048): 211.516
Elapsed time for mlp_h_to_4h (4x8384x33536, b=2048): 0.0183
Throughput (in TFLOP/s) for mlp_h_to_4h (4x8384x33536, b=2048): 251.557
Elapsed time for mlp_4h_to_h (4x33536x8384, b=2048): 0.0183
Throughput (in TFLOP/s) for mlp_4h_to_h (4x33536x8384, b=2048): 251.947

Attention duration (in seconds): 0.0261
Attention throughput (in TFLOP/s): 198.399
MLP duration (in seconds): 0.0366
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0627
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 8448, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x8448x25344, b=2048): 0.0141
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x8448x25344, b=2048): 248.989
b: 128, m: 2048, n: 264, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x264x2048): 0.0027
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x264x2048): 103.848
b: 128, m: 2048, n: 2048, k: 264,
Elapsed time for attention_prob_times_values (128x2048x2048x264): 0.0030
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x264): 94.010
Elapsed time for attention_linear_projection (4x8448x8448, b=2048): 0.0055
Throughput (in TFLOP/s) for attention_linear_projection (4x8448x8448, b=2048): 213.301
Elapsed time for mlp_h_to_4h (4x8448x33792, b=2048): 0.0184
Throughput (in TFLOP/s) for mlp_h_to_4h (4x8448x33792, b=2048): 253.530
Elapsed time for mlp_4h_to_h (4x33792x8448, b=2048): 0.0186
Throughput (in TFLOP/s) for mlp_4h_to_h (4x33792x8448, b=2048): 251.502

Attention duration (in seconds): 0.0253
Attention throughput (in TFLOP/s): 207.152
MLP duration (in seconds): 0.0370
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0624
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 8512, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x8512x25536, b=2048): 0.0142
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x8512x25536, b=2048): 250.512
b: 128, m: 2048, n: 266, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x266x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x266x2048): 103.245
b: 128, m: 2048, n: 2048, k: 266,
Elapsed time for attention_prob_times_values (128x2048x2048x266): 0.0039
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x266): 73.445
Elapsed time for attention_linear_projection (4x8512x8512, b=2048): 0.0055
Throughput (in TFLOP/s) for attention_linear_projection (4x8512x8512, b=2048): 215.682
Elapsed time for mlp_h_to_4h (4x8512x34048, b=2048): 0.0186
Throughput (in TFLOP/s) for mlp_h_to_4h (4x8512x34048, b=2048): 255.200
Elapsed time for mlp_4h_to_h (4x34048x8512, b=2048): 0.0188
Throughput (in TFLOP/s) for mlp_4h_to_h (4x34048x8512, b=2048): 252.690

Attention duration (in seconds): 0.0264
Attention throughput (in TFLOP/s): 201.690
MLP duration (in seconds): 0.0374
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0638
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 8576, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x8576x25728, b=2048): 0.0143
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x8576x25728, b=2048): 252.995
b: 128, m: 2048, n: 268, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x268x2048): 0.0027
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x268x2048): 106.613
b: 128, m: 2048, n: 2048, k: 268,
Elapsed time for attention_prob_times_values (128x2048x2048x268): 0.0033
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x268): 87.690
Elapsed time for attention_linear_projection (4x8576x8576, b=2048): 0.0055
Throughput (in TFLOP/s) for attention_linear_projection (4x8576x8576, b=2048): 217.262
Elapsed time for mlp_h_to_4h (4x8576x34304, b=2048): 0.0187
Throughput (in TFLOP/s) for mlp_h_to_4h (4x8576x34304, b=2048): 257.435
Elapsed time for mlp_4h_to_h (4x34304x8576, b=2048): 0.0191
Throughput (in TFLOP/s) for mlp_4h_to_h (4x34304x8576, b=2048): 252.746

Attention duration (in seconds): 0.0258
Attention throughput (in TFLOP/s): 209.001
MLP duration (in seconds): 0.0378
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0636
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 8640, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x8640x25920, b=2048): 0.0150
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x8640x25920, b=2048): 244.137
b: 128, m: 2048, n: 270, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x270x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x270x2048): 104.492
b: 128, m: 2048, n: 2048, k: 270,
Elapsed time for attention_prob_times_values (128x2048x2048x270): 0.0039
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x270): 73.987
Elapsed time for attention_linear_projection (4x8640x8640, b=2048): 0.0058
Throughput (in TFLOP/s) for attention_linear_projection (4x8640x8640, b=2048): 211.176
Elapsed time for mlp_h_to_4h (4x8640x34560, b=2048): 0.0197
Throughput (in TFLOP/s) for mlp_h_to_4h (4x8640x34560, b=2048): 248.526
Elapsed time for mlp_4h_to_h (4x34560x8640, b=2048): 0.0193
Throughput (in TFLOP/s) for mlp_4h_to_h (4x34560x8640, b=2048): 253.578

Attention duration (in seconds): 0.0275
Attention throughput (in TFLOP/s): 198.884
MLP duration (in seconds): 0.0390
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0665
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 8704, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x8704x26112, b=2048): 0.0151
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x8704x26112, b=2048): 246.018
b: 128, m: 2048, n: 272, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x272x2048): 0.0027
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x272x2048): 106.948
b: 128, m: 2048, n: 2048, k: 272,
Elapsed time for attention_prob_times_values (128x2048x2048x272): 0.0030
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x272): 97.491
Elapsed time for attention_linear_projection (4x8704x8704, b=2048): 0.0058
Throughput (in TFLOP/s) for attention_linear_projection (4x8704x8704, b=2048): 213.367
Elapsed time for mlp_h_to_4h (4x8704x34816, b=2048): 0.0199
Throughput (in TFLOP/s) for mlp_h_to_4h (4x8704x34816, b=2048): 250.050
Elapsed time for mlp_4h_to_h (4x34816x8704, b=2048): 0.0197
Throughput (in TFLOP/s) for mlp_4h_to_h (4x34816x8704, b=2048): 252.188

Attention duration (in seconds): 0.0267
Attention throughput (in TFLOP/s): 207.987
MLP duration (in seconds): 0.0395
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0662
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 8768, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x8768x26304, b=2048): 0.0153
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x8768x26304, b=2048): 247.780
b: 128, m: 2048, n: 274, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x274x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x274x2048): 106.031
b: 128, m: 2048, n: 2048, k: 274,
Elapsed time for attention_prob_times_values (128x2048x2048x274): 0.0039
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x274): 74.494
Elapsed time for attention_linear_projection (4x8768x8768, b=2048): 0.0059
Throughput (in TFLOP/s) for attention_linear_projection (4x8768x8768, b=2048): 214.521
Elapsed time for mlp_h_to_4h (4x8768x35072, b=2048): 0.0202
Throughput (in TFLOP/s) for mlp_h_to_4h (4x8768x35072, b=2048): 249.248
Elapsed time for mlp_4h_to_h (4x35072x8768, b=2048): 0.0201
Throughput (in TFLOP/s) for mlp_4h_to_h (4x35072x8768, b=2048): 251.201

Attention duration (in seconds): 0.0278
Attention throughput (in TFLOP/s): 202.065
MLP duration (in seconds): 0.0403
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0681
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 8832, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x8832x26496, b=2048): 0.0154
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x8832x26496, b=2048): 249.689
b: 128, m: 2048, n: 276, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x276x2048): 0.0027
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x276x2048): 109.303
b: 128, m: 2048, n: 2048, k: 276,
Elapsed time for attention_prob_times_values (128x2048x2048x276): 0.0033
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x276): 89.617
Elapsed time for attention_linear_projection (4x8832x8832, b=2048): 0.0059
Throughput (in TFLOP/s) for attention_linear_projection (4x8832x8832, b=2048): 216.836
Elapsed time for mlp_h_to_4h (4x8832x35328, b=2048): 0.0201
Throughput (in TFLOP/s) for mlp_h_to_4h (4x8832x35328, b=2048): 254.509
Elapsed time for mlp_4h_to_h (4x35328x8832, b=2048): 0.0201
Throughput (in TFLOP/s) for mlp_4h_to_h (4x35328x8832, b=2048): 254.319

Attention duration (in seconds): 0.0273
Attention throughput (in TFLOP/s): 209.216
MLP duration (in seconds): 0.0402
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0675
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 8896, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x8896x26688, b=2048): 0.0155
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x8896x26688, b=2048): 251.195
b: 128, m: 2048, n: 278, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x278x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x278x2048): 107.348
b: 128, m: 2048, n: 2048, k: 278,
Elapsed time for attention_prob_times_values (128x2048x2048x278): 0.0040
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x278): 74.957
Elapsed time for attention_linear_projection (4x8896x8896, b=2048): 0.0059
Throughput (in TFLOP/s) for attention_linear_projection (4x8896x8896, b=2048): 218.347
Elapsed time for mlp_h_to_4h (4x8896x35584, b=2048): 0.0202
Throughput (in TFLOP/s) for mlp_h_to_4h (4x8896x35584, b=2048): 256.249
Elapsed time for mlp_4h_to_h (4x35584x8896, b=2048): 0.0205
Throughput (in TFLOP/s) for mlp_4h_to_h (4x35584x8896, b=2048): 253.307

Attention duration (in seconds): 0.0282
Attention throughput (in TFLOP/s): 205.184
MLP duration (in seconds): 0.0407
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0689
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 8960, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x8960x26880, b=2048): 0.0156
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x8960x26880, b=2048): 253.054
b: 128, m: 2048, n: 280, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x280x2048): 0.0027
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x280x2048): 110.036
b: 128, m: 2048, n: 2048, k: 280,
Elapsed time for attention_prob_times_values (128x2048x2048x280): 0.0031
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x280): 98.179
Elapsed time for attention_linear_projection (4x8960x8960, b=2048): 0.0060
Throughput (in TFLOP/s) for attention_linear_projection (4x8960x8960, b=2048): 220.306
Elapsed time for mlp_h_to_4h (4x8960x35840, b=2048): 0.0204
Throughput (in TFLOP/s) for mlp_h_to_4h (4x8960x35840, b=2048): 257.394
Elapsed time for mlp_4h_to_h (4x35840x8960, b=2048): 0.0207
Throughput (in TFLOP/s) for mlp_4h_to_h (4x35840x8960, b=2048): 254.558

Attention duration (in seconds): 0.0274
Attention throughput (in TFLOP/s): 214.289
MLP duration (in seconds): 0.0411
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0685
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 9024, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x9024x27072, b=2048): 0.0163
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x9024x27072, b=2048): 244.941
b: 128, m: 2048, n: 282, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x282x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x282x2048): 108.595
b: 128, m: 2048, n: 2048, k: 282,
Elapsed time for attention_prob_times_values (128x2048x2048x282): 0.0040
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x282): 76.167
Elapsed time for attention_linear_projection (4x9024x9024, b=2048): 0.0062
Throughput (in TFLOP/s) for attention_linear_projection (4x9024x9024, b=2048): 214.382
Elapsed time for mlp_h_to_4h (4x9024x36096, b=2048): 0.0215
Throughput (in TFLOP/s) for mlp_h_to_4h (4x9024x36096, b=2048): 248.742
Elapsed time for mlp_4h_to_h (4x36096x9024, b=2048): 0.0210
Throughput (in TFLOP/s) for mlp_4h_to_h (4x36096x9024, b=2048): 254.055

Attention duration (in seconds): 0.0293
Attention throughput (in TFLOP/s): 202.616
MLP duration (in seconds): 0.0425
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0718
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 9088, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x9088x27264, b=2048): 0.0164
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x9088x27264, b=2048): 247.255
b: 128, m: 2048, n: 284, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x284x2048): 0.0027
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x284x2048): 112.501
b: 128, m: 2048, n: 2048, k: 284,
Elapsed time for attention_prob_times_values (128x2048x2048x284): 0.0033
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x284): 92.162
Elapsed time for attention_linear_projection (4x9088x9088, b=2048): 0.0063
Throughput (in TFLOP/s) for attention_linear_projection (4x9088x9088, b=2048): 216.323
Elapsed time for mlp_h_to_4h (4x9088x36352, b=2048): 0.0216
Throughput (in TFLOP/s) for mlp_h_to_4h (4x9088x36352, b=2048): 250.548
Elapsed time for mlp_4h_to_h (4x36352x9088, b=2048): 0.0214
Throughput (in TFLOP/s) for mlp_4h_to_h (4x36352x9088, b=2048): 253.044

Attention duration (in seconds): 0.0287
Attention throughput (in TFLOP/s): 209.897
MLP duration (in seconds): 0.0430
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0717
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 9152, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x9152x27456, b=2048): 0.0166
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x9152x27456, b=2048): 248.530
b: 128, m: 2048, n: 286, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x286x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x286x2048): 110.201
b: 128, m: 2048, n: 2048, k: 286,
Elapsed time for attention_prob_times_values (128x2048x2048x286): 0.0040
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x286): 76.943
Elapsed time for attention_linear_projection (4x9152x9152, b=2048): 0.0063
Throughput (in TFLOP/s) for attention_linear_projection (4x9152x9152, b=2048): 218.051
Elapsed time for mlp_h_to_4h (4x9152x36608, b=2048): 0.0218
Throughput (in TFLOP/s) for mlp_h_to_4h (4x9152x36608, b=2048): 252.020
Elapsed time for mlp_4h_to_h (4x36608x9152, b=2048): 0.0216
Throughput (in TFLOP/s) for mlp_4h_to_h (4x36608x9152, b=2048): 254.705

Attention duration (in seconds): 0.0296
Attention throughput (in TFLOP/s): 205.943
MLP duration (in seconds): 0.0433
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0730
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 9216, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x9216x27648, b=2048): 0.0167
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x9216x27648, b=2048): 250.508
b: 128, m: 2048, n: 288, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x288x2048): 0.0027
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x288x2048): 113.566
b: 128, m: 2048, n: 2048, k: 288,
Elapsed time for attention_prob_times_values (128x2048x2048x288): 0.0030
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x288): 102.801
Elapsed time for attention_linear_projection (4x9216x9216, b=2048): 0.0063
Throughput (in TFLOP/s) for attention_linear_projection (4x9216x9216, b=2048): 219.787
Elapsed time for mlp_h_to_4h (4x9216x36864, b=2048): 0.0219
Throughput (in TFLOP/s) for mlp_h_to_4h (4x9216x36864, b=2048): 254.022
Elapsed time for mlp_4h_to_h (4x36864x9216, b=2048): 0.0219
Throughput (in TFLOP/s) for mlp_4h_to_h (4x36864x9216, b=2048): 254.127

Attention duration (in seconds): 0.0287
Attention throughput (in TFLOP/s): 215.290
MLP duration (in seconds): 0.0438
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0725
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 9280, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x9280x27840, b=2048): 0.0168
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x9280x27840, b=2048): 251.630
b: 128, m: 2048, n: 290, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x290x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x290x2048): 111.561
b: 128, m: 2048, n: 2048, k: 290,
Elapsed time for attention_prob_times_values (128x2048x2048x290): 0.0040
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x290): 77.898
Elapsed time for attention_linear_projection (4x9280x9280, b=2048): 0.0064
Throughput (in TFLOP/s) for attention_linear_projection (4x9280x9280, b=2048): 221.135
Elapsed time for mlp_h_to_4h (4x9280x37120, b=2048): 0.0222
Throughput (in TFLOP/s) for mlp_h_to_4h (4x9280x37120, b=2048): 253.826
Elapsed time for mlp_4h_to_h (4x37120x9280, b=2048): 0.0221
Throughput (in TFLOP/s) for mlp_4h_to_h (4x37120x9280, b=2048): 255.458

Attention duration (in seconds): 0.0300
Attention throughput (in TFLOP/s): 208.951
MLP duration (in seconds): 0.0443
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0743
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 9344, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x9344x28032, b=2048): 0.0169
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x9344x28032, b=2048): 254.119
b: 128, m: 2048, n: 292, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x292x2048): 0.0027
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x292x2048): 115.741
b: 128, m: 2048, n: 2048, k: 292,
Elapsed time for attention_prob_times_values (128x2048x2048x292): 0.0033
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x292): 94.772
Elapsed time for attention_linear_projection (4x9344x9344, b=2048): 0.0064
Throughput (in TFLOP/s) for attention_linear_projection (4x9344x9344, b=2048): 223.103
Elapsed time for mlp_h_to_4h (4x9344x37376, b=2048): 0.0221
Throughput (in TFLOP/s) for mlp_h_to_4h (4x9344x37376, b=2048): 258.336
Elapsed time for mlp_4h_to_h (4x37376x9344, b=2048): 0.0224
Throughput (in TFLOP/s) for mlp_4h_to_h (4x37376x9344, b=2048): 255.601

Attention duration (in seconds): 0.0293
Attention throughput (in TFLOP/s): 216.567
MLP duration (in seconds): 0.0445
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0739
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 9408, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x9408x28224, b=2048): 0.0170
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x9408x28224, b=2048): 256.630
b: 128, m: 2048, n: 294, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x294x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x294x2048): 113.119
b: 128, m: 2048, n: 2048, k: 294,
Elapsed time for attention_prob_times_values (128x2048x2048x294): 0.0040
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x294): 79.375
Elapsed time for attention_linear_projection (4x9408x9408, b=2048): 0.0065
Throughput (in TFLOP/s) for attention_linear_projection (4x9408x9408, b=2048): 224.467
Elapsed time for mlp_h_to_4h (4x9408x37632, b=2048): 0.0224
Throughput (in TFLOP/s) for mlp_h_to_4h (4x9408x37632, b=2048): 259.374
Elapsed time for mlp_4h_to_h (4x37632x9408, b=2048): 0.0226
Throughput (in TFLOP/s) for mlp_4h_to_h (4x37632x9408, b=2048): 256.988

Attention duration (in seconds): 0.0302
Attention throughput (in TFLOP/s): 213.117
MLP duration (in seconds): 0.0449
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0751
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 9472, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x9472x28416, b=2048): 0.0171
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x9472x28416, b=2048): 258.458
b: 128, m: 2048, n: 296, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x296x2048): 0.0027
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x296x2048): 115.868
b: 128, m: 2048, n: 2048, k: 296,
Elapsed time for attention_prob_times_values (128x2048x2048x296): 0.0031
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x296): 103.210
Elapsed time for attention_linear_projection (4x9472x9472, b=2048): 0.0065
Throughput (in TFLOP/s) for attention_linear_projection (4x9472x9472, b=2048): 226.479
Elapsed time for mlp_h_to_4h (4x9472x37888, b=2048): 0.0225
Throughput (in TFLOP/s) for mlp_h_to_4h (4x9472x37888, b=2048): 261.084
Elapsed time for mlp_4h_to_h (4x37888x9472, b=2048): 0.0228
Throughput (in TFLOP/s) for mlp_4h_to_h (4x37888x9472, b=2048): 258.349

Attention duration (in seconds): 0.0294
Attention throughput (in TFLOP/s): 221.803
MLP duration (in seconds): 0.0453
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0747
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 9536, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x9536x28608, b=2048): 0.0179
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x9536x28608, b=2048): 250.208
b: 128, m: 2048, n: 298, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x298x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x298x2048): 114.345
b: 128, m: 2048, n: 2048, k: 298,
Elapsed time for attention_prob_times_values (128x2048x2048x298): 0.0040
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x298): 79.866
Elapsed time for attention_linear_projection (4x9536x9536, b=2048): 0.0068
Throughput (in TFLOP/s) for attention_linear_projection (4x9536x9536, b=2048): 220.440
Elapsed time for mlp_h_to_4h (4x9536x38144, b=2048): 0.0235
Throughput (in TFLOP/s) for mlp_h_to_4h (4x9536x38144, b=2048): 253.484
Elapsed time for mlp_4h_to_h (4x38144x9536, b=2048): 0.0232
Throughput (in TFLOP/s) for mlp_4h_to_h (4x38144x9536, b=2048): 257.395

Attention duration (in seconds): 0.0314
Attention throughput (in TFLOP/s): 209.993
MLP duration (in seconds): 0.0467
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0781
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 9600, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x9600x28800, b=2048): 0.0180
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x9600x28800, b=2048): 252.245
b: 128, m: 2048, n: 300, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x300x2048): 0.0027
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x300x2048): 118.745
b: 128, m: 2048, n: 2048, k: 300,
Elapsed time for attention_prob_times_values (128x2048x2048x300): 0.0034
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x300): 96.019
Elapsed time for attention_linear_projection (4x9600x9600, b=2048): 0.0068
Throughput (in TFLOP/s) for attention_linear_projection (4x9600x9600, b=2048): 222.264
Elapsed time for mlp_h_to_4h (4x9600x38400, b=2048): 0.0237
Throughput (in TFLOP/s) for mlp_h_to_4h (4x9600x38400, b=2048): 254.444
Elapsed time for mlp_4h_to_h (4x38400x9600, b=2048): 0.0233
Throughput (in TFLOP/s) for mlp_4h_to_h (4x38400x9600, b=2048): 259.759

Attention duration (in seconds): 0.0308
Attention throughput (in TFLOP/s): 216.879
MLP duration (in seconds): 0.0470
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0778
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 9664, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x9664x28992, b=2048): 0.0181
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x9664x28992, b=2048): 253.843
b: 128, m: 2048, n: 302, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x302x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x302x2048): 115.999
b: 128, m: 2048, n: 2048, k: 302,
Elapsed time for attention_prob_times_values (128x2048x2048x302): 0.0040
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x302): 81.175
Elapsed time for attention_linear_projection (4x9664x9664, b=2048): 0.0068
Throughput (in TFLOP/s) for attention_linear_projection (4x9664x9664, b=2048): 223.722
Elapsed time for mlp_h_to_4h (4x9664x38656, b=2048): 0.0239
Throughput (in TFLOP/s) for mlp_h_to_4h (4x9664x38656, b=2048): 256.393
Elapsed time for mlp_4h_to_h (4x38656x9664, b=2048): 0.0237
Throughput (in TFLOP/s) for mlp_4h_to_h (4x38656x9664, b=2048): 258.235

Attention duration (in seconds): 0.0317
Attention throughput (in TFLOP/s): 213.447
MLP duration (in seconds): 0.0476
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0793
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 9728, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x9728x29184, b=2048): 0.0182
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x9728x29184, b=2048): 255.941
b: 128, m: 2048, n: 304, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x304x2048): 0.0027
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x304x2048): 119.519
b: 128, m: 2048, n: 2048, k: 304,
Elapsed time for attention_prob_times_values (128x2048x2048x304): 0.0030
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x304): 107.625
Elapsed time for attention_linear_projection (4x9728x9728, b=2048): 0.0069
Throughput (in TFLOP/s) for attention_linear_projection (4x9728x9728, b=2048): 225.868
Elapsed time for mlp_h_to_4h (4x9728x38912, b=2048): 0.0238
Throughput (in TFLOP/s) for mlp_h_to_4h (4x9728x38912, b=2048): 260.162
Elapsed time for mlp_4h_to_h (4x38912x9728, b=2048): 0.0241
Throughput (in TFLOP/s) for mlp_4h_to_h (4x38912x9728, b=2048): 257.540

Attention duration (in seconds): 0.0308
Attention throughput (in TFLOP/s): 222.539
MLP duration (in seconds): 0.0479
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0787
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 9792, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x9792x29376, b=2048): 0.0183
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x9792x29376, b=2048): 257.646
b: 128, m: 2048, n: 306, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x306x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x306x2048): 117.046
b: 128, m: 2048, n: 2048, k: 306,
Elapsed time for attention_prob_times_values (128x2048x2048x306): 0.0040
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x306): 81.986
Elapsed time for attention_linear_projection (4x9792x9792, b=2048): 0.0069
Throughput (in TFLOP/s) for attention_linear_projection (4x9792x9792, b=2048): 226.778
Elapsed time for mlp_h_to_4h (4x9792x39168, b=2048): 0.0240
Throughput (in TFLOP/s) for mlp_h_to_4h (4x9792x39168, b=2048): 261.914
Elapsed time for mlp_4h_to_h (4x39168x9792, b=2048): 0.0243
Throughput (in TFLOP/s) for mlp_4h_to_h (4x39168x9792, b=2048): 259.121

Attention duration (in seconds): 0.0320
Attention throughput (in TFLOP/s): 216.675
MLP duration (in seconds): 0.0482
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0803
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 9856, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x9856x29568, b=2048): 0.0184
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x9856x29568, b=2048): 259.449
b: 128, m: 2048, n: 308, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x308x2048): 0.0027
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x308x2048): 120.902
b: 128, m: 2048, n: 2048, k: 308,
Elapsed time for attention_prob_times_values (128x2048x2048x308): 0.0034
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x308): 97.184
Elapsed time for attention_linear_projection (4x9856x9856, b=2048): 0.0070
Throughput (in TFLOP/s) for attention_linear_projection (4x9856x9856, b=2048): 228.729
Elapsed time for mlp_h_to_4h (4x9856x39424, b=2048): 0.0242
Throughput (in TFLOP/s) for mlp_h_to_4h (4x9856x39424, b=2048): 263.553
Elapsed time for mlp_4h_to_h (4x39424x9856, b=2048): 0.0247
Throughput (in TFLOP/s) for mlp_4h_to_h (4x39424x9856, b=2048): 258.156

Attention duration (in seconds): 0.0315
Attention throughput (in TFLOP/s): 223.102
MLP duration (in seconds): 0.0488
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0803
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 9920, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x9920x29760, b=2048): 0.0193
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x9920x29760, b=2048): 251.087
b: 128, m: 2048, n: 310, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x310x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x310x2048): 118.335
b: 128, m: 2048, n: 2048, k: 310,
Elapsed time for attention_prob_times_values (128x2048x2048x310): 0.0041
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x310): 81.269
Elapsed time for attention_linear_projection (4x9920x9920, b=2048): 0.0072
Throughput (in TFLOP/s) for attention_linear_projection (4x9920x9920, b=2048): 223.183
Elapsed time for mlp_h_to_4h (4x9920x39680, b=2048): 0.0253
Throughput (in TFLOP/s) for mlp_h_to_4h (4x9920x39680, b=2048): 254.965
Elapsed time for mlp_4h_to_h (4x39680x9920, b=2048): 0.0248
Throughput (in TFLOP/s) for mlp_4h_to_h (4x39680x9920, b=2048): 259.821

Attention duration (in seconds): 0.0334
Attention throughput (in TFLOP/s): 213.043
MLP duration (in seconds): 0.0501
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0835
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 9984, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x9984x29952, b=2048): 0.0194
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x9984x29952, b=2048): 252.698
b: 128, m: 2048, n: 312, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x312x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x312x2048): 121.582
b: 128, m: 2048, n: 2048, k: 312,
Elapsed time for attention_prob_times_values (128x2048x2048x312): 0.0031
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x312): 107.681
Elapsed time for attention_linear_projection (4x9984x9984, b=2048): 0.0073
Throughput (in TFLOP/s) for attention_linear_projection (4x9984x9984, b=2048): 224.589
Elapsed time for mlp_h_to_4h (4x9984x39936, b=2048): 0.0255
Throughput (in TFLOP/s) for mlp_h_to_4h (4x9984x39936, b=2048): 256.352
Elapsed time for mlp_4h_to_h (4x39936x9984, b=2048): 0.0252
Throughput (in TFLOP/s) for mlp_4h_to_h (4x39936x9984, b=2048): 258.809

Attention duration (in seconds): 0.0325
Attention throughput (in TFLOP/s): 221.437
MLP duration (in seconds): 0.0507
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0833
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 10048, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x10048x30144, b=2048): 0.0195
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x10048x30144, b=2048): 254.381
b: 128, m: 2048, n: 314, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x314x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x314x2048): 120.259
b: 128, m: 2048, n: 2048, k: 314,
Elapsed time for attention_prob_times_values (128x2048x2048x314): 0.0040
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x314): 83.587
Elapsed time for attention_linear_projection (4x10048x10048, b=2048): 0.0073
Throughput (in TFLOP/s) for attention_linear_projection (4x10048x10048, b=2048): 225.922
Elapsed time for mlp_h_to_4h (4x10048x40192, b=2048): 0.0256
Throughput (in TFLOP/s) for mlp_h_to_4h (4x10048x40192, b=2048): 258.091
Elapsed time for mlp_4h_to_h (4x40192x10048, b=2048): 0.0256
Throughput (in TFLOP/s) for mlp_4h_to_h (4x40192x10048, b=2048): 258.486

Attention duration (in seconds): 0.0337
Attention throughput (in TFLOP/s): 216.561
MLP duration (in seconds): 0.0512
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0849
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 10112, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x10112x30336, b=2048): 0.0196
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x10112x30336, b=2048): 256.058
b: 128, m: 2048, n: 316, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x316x2048): 0.0027
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x316x2048): 124.335
b: 128, m: 2048, n: 2048, k: 316,
Elapsed time for attention_prob_times_values (128x2048x2048x316): 0.0034
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x316): 100.059
Elapsed time for attention_linear_projection (4x10112x10112, b=2048): 0.0074
Throughput (in TFLOP/s) for attention_linear_projection (4x10112x10112, b=2048): 227.263
Elapsed time for mlp_h_to_4h (4x10112x40448, b=2048): 0.0258
Throughput (in TFLOP/s) for mlp_h_to_4h (4x10112x40448, b=2048): 259.694
Elapsed time for mlp_4h_to_h (4x40448x10112, b=2048): 0.0258
Throughput (in TFLOP/s) for mlp_4h_to_h (4x40448x10112, b=2048): 259.392

Attention duration (in seconds): 0.0331
Attention throughput (in TFLOP/s): 222.823
MLP duration (in seconds): 0.0516
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0848
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 10176, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x10176x30528, b=2048): 0.0197
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x10176x30528, b=2048): 257.788
b: 128, m: 2048, n: 318, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x318x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x318x2048): 120.846
b: 128, m: 2048, n: 2048, k: 318,
Elapsed time for attention_prob_times_values (128x2048x2048x318): 0.0040
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x318): 84.442
Elapsed time for attention_linear_projection (4x10176x10176, b=2048): 0.0074
Throughput (in TFLOP/s) for attention_linear_projection (4x10176x10176, b=2048): 229.429
Elapsed time for mlp_h_to_4h (4x10176x40704, b=2048): 0.0264
Throughput (in TFLOP/s) for mlp_h_to_4h (4x10176x40704, b=2048): 257.505
Elapsed time for mlp_4h_to_h (4x40704x10176, b=2048): 0.0262
Throughput (in TFLOP/s) for mlp_4h_to_h (4x40704x10176, b=2048): 258.617

Attention duration (in seconds): 0.0340
Attention throughput (in TFLOP/s): 219.633
MLP duration (in seconds): 0.0526
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0866
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 10240, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x10240x30720, b=2048): 0.0199
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x10240x30720, b=2048): 259.545
b: 128, m: 2048, n: 320, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x320x2048): 0.0027
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x320x2048): 125.078
b: 128, m: 2048, n: 2048, k: 320,
Elapsed time for attention_prob_times_values (128x2048x2048x320): 0.0031
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x320): 112.599
Elapsed time for attention_linear_projection (4x10240x10240, b=2048): 0.0074
Throughput (in TFLOP/s) for attention_linear_projection (4x10240x10240, b=2048): 230.629
Elapsed time for mlp_h_to_4h (4x10240x40960, b=2048): 0.0262
Throughput (in TFLOP/s) for mlp_h_to_4h (4x10240x40960, b=2048): 262.228
Elapsed time for mlp_4h_to_h (4x40960x10240, b=2048): 0.0263
Throughput (in TFLOP/s) for mlp_4h_to_h (4x40960x10240, b=2048): 260.906

Attention duration (in seconds): 0.0331
Attention throughput (in TFLOP/s): 228.336
MLP duration (in seconds): 0.0525
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0857
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 10304, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x10304x30912, b=2048): 0.0200
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x10304x30912, b=2048): 261.160
b: 128, m: 2048, n: 322, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x322x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x322x2048): 121.525
b: 128, m: 2048, n: 2048, k: 322,
Elapsed time for attention_prob_times_values (128x2048x2048x322): 0.0043
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x322): 80.044
Elapsed time for attention_linear_projection (4x10304x10304, b=2048): 0.0075
Throughput (in TFLOP/s) for attention_linear_projection (4x10304x10304, b=2048): 232.731
Elapsed time for mlp_h_to_4h (4x10304x41216, b=2048): 0.0263
Throughput (in TFLOP/s) for mlp_h_to_4h (4x10304x41216, b=2048): 264.108
Elapsed time for mlp_4h_to_h (4x41216x10304, b=2048): 0.0267
Throughput (in TFLOP/s) for mlp_4h_to_h (4x41216x10304, b=2048): 260.313

Attention duration (in seconds): 0.0346
Attention throughput (in TFLOP/s): 220.951
MLP duration (in seconds): 0.0531
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0877
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 10368, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x10368x31104, b=2048): 0.0201
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x10368x31104, b=2048): 262.880
b: 128, m: 2048, n: 324, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x324x2048): 0.0027
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x324x2048): 126.752
b: 128, m: 2048, n: 2048, k: 324,
Elapsed time for attention_prob_times_values (128x2048x2048x324): 0.0036
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x324): 96.264
Elapsed time for attention_linear_projection (4x10368x10368, b=2048): 0.0075
Throughput (in TFLOP/s) for attention_linear_projection (4x10368x10368, b=2048): 234.397
Elapsed time for mlp_h_to_4h (4x10368x41472, b=2048): 0.0270
Throughput (in TFLOP/s) for mlp_h_to_4h (4x10368x41472, b=2048): 260.653
Elapsed time for mlp_4h_to_h (4x41472x10368, b=2048): 0.0270
Throughput (in TFLOP/s) for mlp_4h_to_h (4x41472x10368, b=2048): 261.100

Attention duration (in seconds): 0.0340
Attention throughput (in TFLOP/s): 227.857
MLP duration (in seconds): 0.0540
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0880
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 10432, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x10432x31296, b=2048): 0.0210
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x10432x31296, b=2048): 254.799
b: 128, m: 2048, n: 326, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x326x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x326x2048): 123.928
b: 128, m: 2048, n: 2048, k: 326,
Elapsed time for attention_prob_times_values (128x2048x2048x326): 0.0043
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x326): 81.366
Elapsed time for attention_linear_projection (4x10432x10432, b=2048): 0.0078
Throughput (in TFLOP/s) for attention_linear_projection (4x10432x10432, b=2048): 227.976
Elapsed time for mlp_h_to_4h (4x10432x41728, b=2048): 0.0277
Throughput (in TFLOP/s) for mlp_h_to_4h (4x10432x41728, b=2048): 257.646
Elapsed time for mlp_4h_to_h (4x41728x10432, b=2048): 0.0274
Throughput (in TFLOP/s) for mlp_4h_to_h (4x41728x10432, b=2048): 260.276

Attention duration (in seconds): 0.0359
Attention throughput (in TFLOP/s): 217.917
MLP duration (in seconds): 0.0551
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0910
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 10496, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x10496x31488, b=2048): 0.0211
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x10496x31488, b=2048): 256.293
b: 128, m: 2048, n: 328, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x328x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x328x2048): 127.431
b: 128, m: 2048, n: 2048, k: 328,
Elapsed time for attention_prob_times_values (128x2048x2048x328): 0.0033
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x328): 107.942
Elapsed time for attention_linear_projection (4x10496x10496, b=2048): 0.0079
Throughput (in TFLOP/s) for attention_linear_projection (4x10496x10496, b=2048): 229.445
Elapsed time for mlp_h_to_4h (4x10496x41984, b=2048): 0.0278
Throughput (in TFLOP/s) for mlp_h_to_4h (4x10496x41984, b=2048): 259.421
Elapsed time for mlp_4h_to_h (4x41984x10496, b=2048): 0.0278
Throughput (in TFLOP/s) for mlp_4h_to_h (4x41984x10496, b=2048): 259.570

Attention duration (in seconds): 0.0350
Attention throughput (in TFLOP/s): 226.271
MLP duration (in seconds): 0.0556
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0907
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 10560, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x10560x31680, b=2048): 0.0212
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x10560x31680, b=2048): 258.042
b: 128, m: 2048, n: 330, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x330x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x330x2048): 124.795
b: 128, m: 2048, n: 2048, k: 330,
Elapsed time for attention_prob_times_values (128x2048x2048x330): 0.0044
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x330): 81.155
Elapsed time for attention_linear_projection (4x10560x10560, b=2048): 0.0079
Throughput (in TFLOP/s) for attention_linear_projection (4x10560x10560, b=2048): 230.735
Elapsed time for mlp_h_to_4h (4x10560x42240, b=2048): 0.0281
Throughput (in TFLOP/s) for mlp_h_to_4h (4x10560x42240, b=2048): 260.512
Elapsed time for mlp_4h_to_h (4x42240x10560, b=2048): 0.0281
Throughput (in TFLOP/s) for mlp_4h_to_h (4x42240x10560, b=2048): 260.028

Attention duration (in seconds): 0.0364
Attention throughput (in TFLOP/s): 220.454
MLP duration (in seconds): 0.0562
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0925
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 10624, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x10624x31872, b=2048): 0.0214
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x10624x31872, b=2048): 259.574
b: 128, m: 2048, n: 332, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x332x2048): 0.0027
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x332x2048): 130.278
b: 128, m: 2048, n: 2048, k: 332,
Elapsed time for attention_prob_times_values (128x2048x2048x332): 0.0036
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x332): 97.827
Elapsed time for attention_linear_projection (4x10624x10624, b=2048): 0.0079
Throughput (in TFLOP/s) for attention_linear_projection (4x10624x10624, b=2048): 232.748
Elapsed time for mlp_h_to_4h (4x10624x42496, b=2048): 0.0284
Throughput (in TFLOP/s) for mlp_h_to_4h (4x10624x42496, b=2048): 260.437
Elapsed time for mlp_4h_to_h (4x42496x10624, b=2048): 0.0286
Throughput (in TFLOP/s) for mlp_4h_to_h (4x42496x10624, b=2048): 258.922

Attention duration (in seconds): 0.0357
Attention throughput (in TFLOP/s): 227.182
MLP duration (in seconds): 0.0570
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0927
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 10688, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x10688x32064, b=2048): 0.0215
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x10688x32064, b=2048): 261.068
b: 128, m: 2048, n: 334, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x334x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x334x2048): 125.875
b: 128, m: 2048, n: 2048, k: 334,
Elapsed time for attention_prob_times_values (128x2048x2048x334): 0.0043
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x334): 82.917
Elapsed time for attention_linear_projection (4x10688x10688, b=2048): 0.0080
Throughput (in TFLOP/s) for attention_linear_projection (4x10688x10688, b=2048): 234.828
Elapsed time for mlp_h_to_4h (4x10688x42752, b=2048): 0.0286
Throughput (in TFLOP/s) for mlp_h_to_4h (4x10688x42752, b=2048): 261.928
Elapsed time for mlp_4h_to_h (4x42752x10688, b=2048): 0.0288
Throughput (in TFLOP/s) for mlp_4h_to_h (4x42752x10688, b=2048): 260.321

Attention duration (in seconds): 0.0367
Attention throughput (in TFLOP/s): 223.829
MLP duration (in seconds): 0.0573
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0940
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 10752, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x10752x32256, b=2048): 0.0216
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x10752x32256, b=2048): 262.667
b: 128, m: 2048, n: 336, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x336x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x336x2048): 130.855
b: 128, m: 2048, n: 2048, k: 336,
Elapsed time for attention_prob_times_values (128x2048x2048x336): 0.0032
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x336): 111.102
Elapsed time for attention_linear_projection (4x10752x10752, b=2048): 0.0080
Throughput (in TFLOP/s) for attention_linear_projection (4x10752x10752, b=2048): 236.820
Elapsed time for mlp_h_to_4h (4x10752x43008, b=2048): 0.0288
Throughput (in TFLOP/s) for mlp_h_to_4h (4x10752x43008, b=2048): 263.086
Elapsed time for mlp_4h_to_h (4x43008x10752, b=2048): 0.0291
Throughput (in TFLOP/s) for mlp_4h_to_h (4x43008x10752, b=2048): 260.223

Attention duration (in seconds): 0.0356
Attention throughput (in TFLOP/s): 232.856
MLP duration (in seconds): 0.0579
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0935
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 10816, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x10816x32448, b=2048): 0.0226
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x10816x32448, b=2048): 254.876
b: 128, m: 2048, n: 338, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x338x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x338x2048): 127.457
b: 128, m: 2048, n: 2048, k: 338,
Elapsed time for attention_prob_times_values (128x2048x2048x338): 0.0043
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x338): 83.869
Elapsed time for attention_linear_projection (4x10816x10816, b=2048): 0.0083
Throughput (in TFLOP/s) for attention_linear_projection (4x10816x10816, b=2048): 230.185
Elapsed time for mlp_h_to_4h (4x10816x43264, b=2048): 0.0300
Throughput (in TFLOP/s) for mlp_h_to_4h (4x10816x43264, b=2048): 255.726
Elapsed time for mlp_4h_to_h (4x43264x10816, b=2048): 0.0297
Throughput (in TFLOP/s) for mlp_4h_to_h (4x43264x10816, b=2048): 257.857

Attention duration (in seconds): 0.0381
Attention throughput (in TFLOP/s): 220.500
MLP duration (in seconds): 0.0597
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0978
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 10880, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x10880x32640, b=2048): 0.0228
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x10880x32640, b=2048): 254.775
b: 128, m: 2048, n: 340, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x340x2048): 0.0027
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x340x2048): 133.046
b: 128, m: 2048, n: 2048, k: 340,
Elapsed time for attention_prob_times_values (128x2048x2048x340): 0.0036
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x340): 100.514
Elapsed time for attention_linear_projection (4x10880x10880, b=2048): 0.0084
Throughput (in TFLOP/s) for attention_linear_projection (4x10880x10880, b=2048): 231.242
Elapsed time for mlp_h_to_4h (4x10880x43520, b=2048): 0.0302
Throughput (in TFLOP/s) for mlp_h_to_4h (4x10880x43520, b=2048): 256.896
Elapsed time for mlp_4h_to_h (4x43520x10880, b=2048): 0.0299
Throughput (in TFLOP/s) for mlp_4h_to_h (4x43520x10880, b=2048): 259.250

Attention duration (in seconds): 0.0376
Attention throughput (in TFLOP/s): 225.741
MLP duration (in seconds): 0.0601
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0977
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 10944, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x10944x32832, b=2048): 0.0230
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x10944x32832, b=2048): 256.216
b: 128, m: 2048, n: 342, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x342x2048): 0.0029
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x342x2048): 128.460
b: 128, m: 2048, n: 2048, k: 342,
Elapsed time for attention_prob_times_values (128x2048x2048x342): 0.0043
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x342): 84.493
Elapsed time for attention_linear_projection (4x10944x10944, b=2048): 0.0084
Throughput (in TFLOP/s) for attention_linear_projection (4x10944x10944, b=2048): 232.549
Elapsed time for mlp_h_to_4h (4x10944x43776, b=2048): 0.0304
Throughput (in TFLOP/s) for mlp_h_to_4h (4x10944x43776, b=2048): 258.250
Elapsed time for mlp_4h_to_h (4x43776x10944, b=2048): 0.0304
Throughput (in TFLOP/s) for mlp_4h_to_h (4x43776x10944, b=2048): 258.450

Attention duration (in seconds): 0.0386
Attention throughput (in TFLOP/s): 222.263
MLP duration (in seconds): 0.0608
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0994
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 11008, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x11008x33024, b=2048): 0.0231
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x11008x33024, b=2048): 257.509
b: 128, m: 2048, n: 344, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x344x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x344x2048): 133.451
b: 128, m: 2048, n: 2048, k: 344,
Elapsed time for attention_prob_times_values (128x2048x2048x344): 0.0033
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x344): 111.424
Elapsed time for attention_linear_projection (4x11008x11008, b=2048): 0.0085
Throughput (in TFLOP/s) for attention_linear_projection (4x11008x11008, b=2048): 234.277
Elapsed time for mlp_h_to_4h (4x11008x44032, b=2048): 0.0305
Throughput (in TFLOP/s) for mlp_h_to_4h (4x11008x44032, b=2048): 260.130
Elapsed time for mlp_4h_to_h (4x44032x11008, b=2048): 0.0306
Throughput (in TFLOP/s) for mlp_4h_to_h (4x44032x11008, b=2048): 259.639

Attention duration (in seconds): 0.0377
Attention throughput (in TFLOP/s): 230.324
MLP duration (in seconds): 0.0611
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0988
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 11072, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x11072x33216, b=2048): 0.0233
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x11072x33216, b=2048): 258.717
b: 128, m: 2048, n: 346, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x346x2048): 0.0029
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x346x2048): 130.190
b: 128, m: 2048, n: 2048, k: 346,
Elapsed time for attention_prob_times_values (128x2048x2048x346): 0.0044
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x346): 85.159
Elapsed time for attention_linear_projection (4x11072x11072, b=2048): 0.0085
Throughput (in TFLOP/s) for attention_linear_projection (4x11072x11072, b=2048): 235.085
Elapsed time for mlp_h_to_4h (4x11072x44288, b=2048): 0.0308
Throughput (in TFLOP/s) for mlp_h_to_4h (4x11072x44288, b=2048): 261.070
Elapsed time for mlp_4h_to_h (4x44288x11072, b=2048): 0.0310
Throughput (in TFLOP/s) for mlp_4h_to_h (4x44288x11072, b=2048): 259.069

Attention duration (in seconds): 0.0390
Attention throughput (in TFLOP/s): 224.765
MLP duration (in seconds): 0.0618
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1008
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 11136, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x11136x33408, b=2048): 0.0234
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x11136x33408, b=2048): 260.230
b: 128, m: 2048, n: 348, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x348x2048): 0.0027
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x348x2048): 136.153
b: 128, m: 2048, n: 2048, k: 348,
Elapsed time for attention_prob_times_values (128x2048x2048x348): 0.0036
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x348): 102.683
Elapsed time for attention_linear_projection (4x11136x11136, b=2048): 0.0086
Throughput (in TFLOP/s) for attention_linear_projection (4x11136x11136, b=2048): 236.727
Elapsed time for mlp_h_to_4h (4x11136x44544, b=2048): 0.0310
Throughput (in TFLOP/s) for mlp_h_to_4h (4x11136x44544, b=2048): 262.345
Elapsed time for mlp_4h_to_h (4x44544x11136, b=2048): 0.0314
Throughput (in TFLOP/s) for mlp_4h_to_h (4x44544x11136, b=2048): 258.487

Attention duration (in seconds): 0.0384
Attention throughput (in TFLOP/s): 231.171
MLP duration (in seconds): 0.0624
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1008
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 11200, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x11200x33600, b=2048): 0.0244
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x11200x33600, b=2048): 252.724
b: 128, m: 2048, n: 350, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x350x2048): 0.0029
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x350x2048): 131.115
b: 128, m: 2048, n: 2048, k: 350,
Elapsed time for attention_prob_times_values (128x2048x2048x350): 0.0044
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x350): 85.620
Elapsed time for attention_linear_projection (4x11200x11200, b=2048): 0.0089
Throughput (in TFLOP/s) for attention_linear_projection (4x11200x11200, b=2048): 230.838
Elapsed time for mlp_h_to_4h (4x11200x44800, b=2048): 0.0323
Throughput (in TFLOP/s) for mlp_h_to_4h (4x11200x44800, b=2048): 254.322
Elapsed time for mlp_4h_to_h (4x44800x11200, b=2048): 0.0318
Throughput (in TFLOP/s) for mlp_4h_to_h (4x44800x11200, b=2048): 258.344

Attention duration (in seconds): 0.0406
Attention throughput (in TFLOP/s): 221.239
MLP duration (in seconds): 0.0641
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1047
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 11264, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x11264x33792, b=2048): 0.0246
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x11264x33792, b=2048): 253.909
b: 128, m: 2048, n: 352, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x352x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x352x2048): 136.696
b: 128, m: 2048, n: 2048, k: 352,
Elapsed time for attention_prob_times_values (128x2048x2048x352): 0.0033
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x352): 115.823
Elapsed time for attention_linear_projection (4x11264x11264, b=2048): 0.0089
Throughput (in TFLOP/s) for attention_linear_projection (4x11264x11264, b=2048): 232.978
Elapsed time for mlp_h_to_4h (4x11264x45056, b=2048): 0.0327
Throughput (in TFLOP/s) for mlp_h_to_4h (4x11264x45056, b=2048): 254.563
Elapsed time for mlp_4h_to_h (4x45056x11264, b=2048): 0.0321
Throughput (in TFLOP/s) for mlp_4h_to_h (4x45056x11264, b=2048): 258.938

Attention duration (in seconds): 0.0395
Attention throughput (in TFLOP/s): 229.576
MLP duration (in seconds): 0.0648
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1043
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 11328, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x11328x33984, b=2048): 0.0247
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x11328x33984, b=2048): 255.643
b: 128, m: 2048, n: 354, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x354x2048): 0.0029
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x354x2048): 133.167
b: 128, m: 2048, n: 2048, k: 354,
Elapsed time for attention_prob_times_values (128x2048x2048x354): 0.0044
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x354): 86.782
Elapsed time for attention_linear_projection (4x11328x11328, b=2048): 0.0090
Throughput (in TFLOP/s) for attention_linear_projection (4x11328x11328, b=2048): 234.144
Elapsed time for mlp_h_to_4h (4x11328x45312, b=2048): 0.0326
Throughput (in TFLOP/s) for mlp_h_to_4h (4x11328x45312, b=2048): 257.725
Elapsed time for mlp_4h_to_h (4x45312x11328, b=2048): 0.0323
Throughput (in TFLOP/s) for mlp_4h_to_h (4x45312x11328, b=2048): 260.497

Attention duration (in seconds): 0.0409
Attention throughput (in TFLOP/s): 224.282
MLP duration (in seconds): 0.0649
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1058
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 11392, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x11392x34176, b=2048): 0.0249
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x11392x34176, b=2048): 256.471
b: 128, m: 2048, n: 356, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x356x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x356x2048): 138.572
b: 128, m: 2048, n: 2048, k: 356,
Elapsed time for attention_prob_times_values (128x2048x2048x356): 0.0037
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x356): 104.673
Elapsed time for attention_linear_projection (4x11392x11392, b=2048): 0.0090
Throughput (in TFLOP/s) for attention_linear_projection (4x11392x11392, b=2048): 235.509
Elapsed time for mlp_h_to_4h (4x11392x45568, b=2048): 0.0330
Throughput (in TFLOP/s) for mlp_h_to_4h (4x11392x45568, b=2048): 257.671
Elapsed time for mlp_4h_to_h (4x45568x11392, b=2048): 0.0327
Throughput (in TFLOP/s) for mlp_4h_to_h (4x45568x11392, b=2048): 259.896

Attention duration (in seconds): 0.0403
Attention throughput (in TFLOP/s): 229.956
MLP duration (in seconds): 0.0657
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1060
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 11456, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x11456x34368, b=2048): 0.0248
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x11456x34368, b=2048): 260.339
b: 128, m: 2048, n: 358, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x358x2048): 0.0029
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x358x2048): 134.078
b: 128, m: 2048, n: 2048, k: 358,
Elapsed time for attention_prob_times_values (128x2048x2048x358): 0.0044
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x358): 88.089
Elapsed time for attention_linear_projection (4x11456x11456, b=2048): 0.0091
Throughput (in TFLOP/s) for attention_linear_projection (4x11456x11456, b=2048): 236.874
Elapsed time for mlp_h_to_4h (4x11456x45824, b=2048): 0.0331
Throughput (in TFLOP/s) for mlp_h_to_4h (4x11456x45824, b=2048): 260.219
Elapsed time for mlp_4h_to_h (4x45824x11456, b=2048): 0.0331
Throughput (in TFLOP/s) for mlp_4h_to_h (4x45824x11456, b=2048): 259.814

Attention duration (in seconds): 0.0411
Attention throughput (in TFLOP/s): 228.049
MLP duration (in seconds): 0.0662
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1072
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 11520, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x11520x34560, b=2048): 0.0249
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x11520x34560, b=2048): 262.417
b: 128, m: 2048, n: 360, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x360x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x360x2048): 139.334
b: 128, m: 2048, n: 2048, k: 360,
Elapsed time for attention_prob_times_values (128x2048x2048x360): 0.0033
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x360): 115.881
Elapsed time for attention_linear_projection (4x11520x11520, b=2048): 0.0091
Throughput (in TFLOP/s) for attention_linear_projection (4x11520x11520, b=2048): 238.644
Elapsed time for mlp_h_to_4h (4x11520x46080, b=2048): 0.0333
Throughput (in TFLOP/s) for mlp_h_to_4h (4x11520x46080, b=2048): 260.951
Elapsed time for mlp_4h_to_h (4x46080x11520, b=2048): 0.0334
Throughput (in TFLOP/s) for mlp_4h_to_h (4x46080x11520, b=2048): 260.720

Attention duration (in seconds): 0.0401
Attention throughput (in TFLOP/s): 236.297
MLP duration (in seconds): 0.0667
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1068
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 11584, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x11584x34752, b=2048): 0.0253
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x11584x34752, b=2048): 260.914
b: 128, m: 2048, n: 362, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x362x2048): 0.0029
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x362x2048): 134.680
b: 128, m: 2048, n: 2048, k: 362,
Elapsed time for attention_prob_times_values (128x2048x2048x362): 0.0044
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x362): 88.091
Elapsed time for attention_linear_projection (4x11584x11584, b=2048): 0.0092
Throughput (in TFLOP/s) for attention_linear_projection (4x11584x11584, b=2048): 239.132
Elapsed time for mlp_h_to_4h (4x11584x46336, b=2048): 0.0336
Throughput (in TFLOP/s) for mlp_h_to_4h (4x11584x46336, b=2048): 262.122
Elapsed time for mlp_4h_to_h (4x46336x11584, b=2048): 0.0338
Throughput (in TFLOP/s) for mlp_4h_to_h (4x46336x11584, b=2048): 260.056

Attention duration (in seconds): 0.0418
Attention throughput (in TFLOP/s): 229.142
MLP duration (in seconds): 0.0674
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1091
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 11648, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x11648x34944, b=2048): 0.0256
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x11648x34944, b=2048): 260.977
b: 128, m: 2048, n: 364, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x364x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x364x2048): 140.931
b: 128, m: 2048, n: 2048, k: 364,
Elapsed time for attention_prob_times_values (128x2048x2048x364): 0.0037
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x364): 106.546
Elapsed time for attention_linear_projection (4x11648x11648, b=2048): 0.0092
Throughput (in TFLOP/s) for attention_linear_projection (4x11648x11648, b=2048): 241.368
Elapsed time for mlp_h_to_4h (4x11648x46592, b=2048): 0.0337
Throughput (in TFLOP/s) for mlp_h_to_4h (4x11648x46592, b=2048): 264.107
Elapsed time for mlp_4h_to_h (4x46592x11648, b=2048): 0.0341
Throughput (in TFLOP/s) for mlp_4h_to_h (4x46592x11648, b=2048): 260.918

Attention duration (in seconds): 0.0412
Attention throughput (in TFLOP/s): 234.766
MLP duration (in seconds): 0.0677
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1089
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 11712, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x11712x35136, b=2048): 0.0263
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x11712x35136, b=2048): 256.738
b: 128, m: 2048, n: 366, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x366x2048): 0.0029
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x366x2048): 135.653
b: 128, m: 2048, n: 2048, k: 366,
Elapsed time for attention_prob_times_values (128x2048x2048x366): 0.0044
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x366): 88.882
Elapsed time for attention_linear_projection (4x11712x11712, b=2048): 0.0095
Throughput (in TFLOP/s) for attention_linear_projection (4x11712x11712, b=2048): 235.581
Elapsed time for mlp_h_to_4h (4x11712x46848, b=2048): 0.0349
Throughput (in TFLOP/s) for mlp_h_to_4h (4x11712x46848, b=2048): 257.840
Elapsed time for mlp_4h_to_h (4x46848x11712, b=2048): 0.0345
Throughput (in TFLOP/s) for mlp_4h_to_h (4x46848x11712, b=2048): 260.218

Attention duration (in seconds): 0.0431
Attention throughput (in TFLOP/s): 226.710
MLP duration (in seconds): 0.0694
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1125
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 11776, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x11776x35328, b=2048): 0.0265
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x11776x35328, b=2048): 257.305
b: 128, m: 2048, n: 368, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x368x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x368x2048): 142.333
b: 128, m: 2048, n: 2048, k: 368,
Elapsed time for attention_prob_times_values (128x2048x2048x368): 0.0033
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x368): 120.035
Elapsed time for attention_linear_projection (4x11776x11776, b=2048): 0.0096
Throughput (in TFLOP/s) for attention_linear_projection (4x11776x11776, b=2048): 236.814
Elapsed time for mlp_h_to_4h (4x11776x47104, b=2048): 0.0352
Throughput (in TFLOP/s) for mlp_h_to_4h (4x11776x47104, b=2048): 258.128
Elapsed time for mlp_4h_to_h (4x47104x11776, b=2048): 0.0350
Throughput (in TFLOP/s) for mlp_4h_to_h (4x47104x11776, b=2048): 259.620

Attention duration (in seconds): 0.0422
Attention throughput (in TFLOP/s): 234.349
MLP duration (in seconds): 0.0702
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1124
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 11840, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x11840x35520, b=2048): 0.0266
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x11840x35520, b=2048): 258.552
b: 128, m: 2048, n: 370, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x370x2048): 0.0029
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x370x2048): 138.365
b: 128, m: 2048, n: 2048, k: 370,
Elapsed time for attention_prob_times_values (128x2048x2048x370): 0.0044
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x370): 91.046
Elapsed time for attention_linear_projection (4x11840x11840, b=2048): 0.0096
Throughput (in TFLOP/s) for attention_linear_projection (4x11840x11840, b=2048): 238.228
Elapsed time for mlp_h_to_4h (4x11840x47360, b=2048): 0.0354
Throughput (in TFLOP/s) for mlp_h_to_4h (4x11840x47360, b=2048): 259.437
Elapsed time for mlp_4h_to_h (4x47360x11840, b=2048): 0.0352
Throughput (in TFLOP/s) for mlp_4h_to_h (4x47360x11840, b=2048): 261.020

Attention duration (in seconds): 0.0435
Attention throughput (in TFLOP/s): 229.329
MLP duration (in seconds): 0.0706
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1141
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 11904, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x11904x35712, b=2048): 0.0269
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x11904x35712, b=2048): 259.111
b: 128, m: 2048, n: 372, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x372x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x372x2048): 144.301
b: 128, m: 2048, n: 2048, k: 372,
Elapsed time for attention_prob_times_values (128x2048x2048x372): 0.0037
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x372): 108.184
Elapsed time for attention_linear_projection (4x11904x11904, b=2048): 0.0097
Throughput (in TFLOP/s) for attention_linear_projection (4x11904x11904, b=2048): 239.702
Elapsed time for mlp_h_to_4h (4x11904x47616, b=2048): 0.0356
Throughput (in TFLOP/s) for mlp_h_to_4h (4x11904x47616, b=2048): 260.762
Elapsed time for mlp_4h_to_h (4x47616x11904, b=2048): 0.0357
Throughput (in TFLOP/s) for mlp_4h_to_h (4x47616x11904, b=2048): 260.292

Attention duration (in seconds): 0.0430
Attention throughput (in TFLOP/s): 234.405
MLP duration (in seconds): 0.0713
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1143
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 11968, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x11968x35904, b=2048): 0.0269
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x11968x35904, b=2048): 261.247
b: 128, m: 2048, n: 374, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x374x2048): 0.0029
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x374x2048): 139.329
b: 128, m: 2048, n: 2048, k: 374,
Elapsed time for attention_prob_times_values (128x2048x2048x374): 0.0044
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x374): 91.985
Elapsed time for attention_linear_projection (4x11968x11968, b=2048): 0.0097
Throughput (in TFLOP/s) for attention_linear_projection (4x11968x11968, b=2048): 240.917
Elapsed time for mlp_h_to_4h (4x11968x47872, b=2048): 0.0359
Throughput (in TFLOP/s) for mlp_h_to_4h (4x11968x47872, b=2048): 261.144
Elapsed time for mlp_4h_to_h (4x47872x11968, b=2048): 0.0360
Throughput (in TFLOP/s) for mlp_4h_to_h (4x47872x11968, b=2048): 260.919

Attention duration (in seconds): 0.0439
Attention throughput (in TFLOP/s): 231.924
MLP duration (in seconds): 0.0719
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1159
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 12032, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x12032x36096, b=2048): 0.0271
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x12032x36096, b=2048): 262.665
b: 128, m: 2048, n: 376, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x376x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x376x2048): 144.954
b: 128, m: 2048, n: 2048, k: 376,
Elapsed time for attention_prob_times_values (128x2048x2048x376): 0.0034
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x376): 119.816
Elapsed time for attention_linear_projection (4x12032x12032, b=2048): 0.0098
Throughput (in TFLOP/s) for attention_linear_projection (4x12032x12032, b=2048): 242.598
Elapsed time for mlp_h_to_4h (4x12032x48128, b=2048): 0.0362
Throughput (in TFLOP/s) for mlp_h_to_4h (4x12032x48128, b=2048): 262.167
Elapsed time for mlp_4h_to_h (4x48128x12032, b=2048): 0.0366
Throughput (in TFLOP/s) for mlp_4h_to_h (4x48128x12032, b=2048): 259.307

Attention duration (in seconds): 0.0430
Attention throughput (in TFLOP/s): 239.296
MLP duration (in seconds): 0.0728
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1158
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 12096, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x12096x36288, b=2048): 0.0283
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x12096x36288, b=2048): 254.254
b: 128, m: 2048, n: 378, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x378x2048): 0.0029
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x378x2048): 140.227
b: 128, m: 2048, n: 2048, k: 378,
Elapsed time for attention_prob_times_values (128x2048x2048x378): 0.0044
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x378): 92.469
Elapsed time for attention_linear_projection (4x12096x12096, b=2048): 0.0101
Throughput (in TFLOP/s) for attention_linear_projection (4x12096x12096, b=2048): 236.461
Elapsed time for mlp_h_to_4h (4x12096x48384, b=2048): 0.0375
Throughput (in TFLOP/s) for mlp_h_to_4h (4x12096x48384, b=2048): 255.503
Elapsed time for mlp_4h_to_h (4x48384x12096, b=2048): 0.0370
Throughput (in TFLOP/s) for mlp_4h_to_h (4x48384x12096, b=2048): 259.414

Attention duration (in seconds): 0.0457
Attention throughput (in TFLOP/s): 227.550
MLP duration (in seconds): 0.0745
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1202
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 12160, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x12160x36480, b=2048): 0.0284
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x12160x36480, b=2048): 256.026
b: 128, m: 2048, n: 380, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x380x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x380x2048): 146.096
b: 128, m: 2048, n: 2048, k: 380,
Elapsed time for attention_prob_times_values (128x2048x2048x380): 0.0037
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x380): 111.077
Elapsed time for attention_linear_projection (4x12160x12160, b=2048): 0.0102
Throughput (in TFLOP/s) for attention_linear_projection (4x12160x12160, b=2048): 237.851
Elapsed time for mlp_h_to_4h (4x12160x48640, b=2048): 0.0380
Throughput (in TFLOP/s) for mlp_h_to_4h (4x12160x48640, b=2048): 255.341
Elapsed time for mlp_4h_to_h (4x48640x12160, b=2048): 0.0375
Throughput (in TFLOP/s) for mlp_4h_to_h (4x48640x12160, b=2048): 258.701

Attention duration (in seconds): 0.0450
Attention throughput (in TFLOP/s): 233.277
MLP duration (in seconds): 0.0754
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1204
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 12224, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x12224x36672, b=2048): 0.0287
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x12224x36672, b=2048): 255.528
b: 128, m: 2048, n: 382, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x382x2048): 0.0029
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x382x2048): 141.014
b: 128, m: 2048, n: 2048, k: 382,
Elapsed time for attention_prob_times_values (128x2048x2048x382): 0.0043
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x382): 94.324
Elapsed time for attention_linear_projection (4x12224x12224, b=2048): 0.0102
Throughput (in TFLOP/s) for attention_linear_projection (4x12224x12224, b=2048): 238.869
Elapsed time for mlp_h_to_4h (4x12224x48896, b=2048): 0.0382
Throughput (in TFLOP/s) for mlp_h_to_4h (4x12224x48896, b=2048): 256.082
Elapsed time for mlp_4h_to_h (4x48896x12224, b=2048): 0.0379
Throughput (in TFLOP/s) for mlp_4h_to_h (4x48896x12224, b=2048): 258.363

Attention duration (in seconds): 0.0462
Attention throughput (in TFLOP/s): 229.477
MLP duration (in seconds): 0.0761
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1224
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 12288, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x12288x36864, b=2048): 0.0289
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x12288x36864, b=2048): 256.936
b: 128, m: 2048, n: 384, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x384x2048): 0.0028
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x384x2048): 149.291
b: 128, m: 2048, n: 2048, k: 384,
Elapsed time for attention_prob_times_values (128x2048x2048x384): 0.0033
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x384): 125.109
Elapsed time for attention_linear_projection (4x12288x12288, b=2048): 0.0103
Throughput (in TFLOP/s) for attention_linear_projection (4x12288x12288, b=2048): 240.331
Elapsed time for mlp_h_to_4h (4x12288x49152, b=2048): 0.0384
Throughput (in TFLOP/s) for mlp_h_to_4h (4x12288x49152, b=2048): 257.534
Elapsed time for mlp_4h_to_h (4x49152x12288, b=2048): 0.0382
Throughput (in TFLOP/s) for mlp_4h_to_h (4x49152x12288, b=2048): 259.038

Attention duration (in seconds): 0.0452
Attention throughput (in TFLOP/s): 236.982
MLP duration (in seconds): 0.0766
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1219
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 12352, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x12352x37056, b=2048): 0.0291
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x12352x37056, b=2048): 257.444
b: 128, m: 2048, n: 386, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x386x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x386x2048): 124.589
b: 128, m: 2048, n: 2048, k: 386,
Elapsed time for attention_prob_times_values (128x2048x2048x386): 0.0046
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x386): 90.437
Elapsed time for attention_linear_projection (4x12352x12352, b=2048): 0.0103
Throughput (in TFLOP/s) for attention_linear_projection (4x12352x12352, b=2048): 241.984
Elapsed time for mlp_h_to_4h (4x12352x49408, b=2048): 0.0385
Throughput (in TFLOP/s) for mlp_h_to_4h (4x12352x49408, b=2048): 259.907
Elapsed time for mlp_4h_to_h (4x49408x12352, b=2048): 0.0385
Throughput (in TFLOP/s) for mlp_4h_to_h (4x49408x12352, b=2048): 259.481

Attention duration (in seconds): 0.0474
Attention throughput (in TFLOP/s): 228.585
MLP duration (in seconds): 0.0770
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1244
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 12416, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x12416x37248, b=2048): 0.0292
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x12416x37248, b=2048): 259.469
b: 128, m: 2048, n: 388, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x388x2048): 0.0032
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x388x2048): 129.188
b: 128, m: 2048, n: 2048, k: 388,
Elapsed time for attention_prob_times_values (128x2048x2048x388): 0.0039
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x388): 106.927
Elapsed time for attention_linear_projection (4x12416x12416, b=2048): 0.0104
Throughput (in TFLOP/s) for attention_linear_projection (4x12416x12416, b=2048): 243.274
Elapsed time for mlp_h_to_4h (4x12416x49664, b=2048): 0.0388
Throughput (in TFLOP/s) for mlp_h_to_4h (4x12416x49664, b=2048): 260.290
Elapsed time for mlp_4h_to_h (4x49664x12416, b=2048): 0.0390
Throughput (in TFLOP/s) for mlp_4h_to_h (4x49664x12416, b=2048): 258.953

Attention duration (in seconds): 0.0467
Attention throughput (in TFLOP/s): 234.148
MLP duration (in seconds): 0.0778
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1245
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 12480, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x12480x37440, b=2048): 0.0303
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x12480x37440, b=2048): 252.598
b: 128, m: 2048, n: 390, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x390x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x390x2048): 125.961
b: 128, m: 2048, n: 2048, k: 390,
Elapsed time for attention_prob_times_values (128x2048x2048x390): 0.0046
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x390): 90.574
Elapsed time for attention_linear_projection (4x12480x12480, b=2048): 0.0108
Throughput (in TFLOP/s) for attention_linear_projection (4x12480x12480, b=2048): 237.324
Elapsed time for mlp_h_to_4h (4x12480x49920, b=2048): 0.0403
Throughput (in TFLOP/s) for mlp_h_to_4h (4x12480x49920, b=2048): 253.021
Elapsed time for mlp_4h_to_h (4x49920x12480, b=2048): 0.0394
Throughput (in TFLOP/s) for mlp_4h_to_h (4x49920x12480, b=2048): 259.253

Attention duration (in seconds): 0.0490
Attention throughput (in TFLOP/s): 225.371
MLP duration (in seconds): 0.0797
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1287
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 12544, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x12544x37632, b=2048): 0.0304
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x12544x37632, b=2048): 254.068
b: 128, m: 2048, n: 392, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x392x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x392x2048): 128.552
b: 128, m: 2048, n: 2048, k: 392,
Elapsed time for attention_prob_times_values (128x2048x2048x392): 0.0035
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x392): 119.430
Elapsed time for attention_linear_projection (4x12544x12544, b=2048): 0.0108
Throughput (in TFLOP/s) for attention_linear_projection (4x12544x12544, b=2048): 238.574
Elapsed time for mlp_h_to_4h (4x12544x50176, b=2048): 0.0405
Throughput (in TFLOP/s) for mlp_h_to_4h (4x12544x50176, b=2048): 254.626
Elapsed time for mlp_4h_to_h (4x50176x12544, b=2048): 0.0402
Throughput (in TFLOP/s) for mlp_4h_to_h (4x50176x12544, b=2048): 256.762

Attention duration (in seconds): 0.0480
Attention throughput (in TFLOP/s): 232.154
MLP duration (in seconds): 0.0807
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1287
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 12608, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x12608x37824, b=2048): 0.0307
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x12608x37824, b=2048): 254.509
b: 128, m: 2048, n: 394, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x394x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x394x2048): 127.007
b: 128, m: 2048, n: 2048, k: 394,
Elapsed time for attention_prob_times_values (128x2048x2048x394): 0.0047
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x394): 90.707
Elapsed time for attention_linear_projection (4x12608x12608, b=2048): 0.0109
Throughput (in TFLOP/s) for attention_linear_projection (4x12608x12608, b=2048): 239.756
Elapsed time for mlp_h_to_4h (4x12608x50432, b=2048): 0.0408
Throughput (in TFLOP/s) for mlp_h_to_4h (4x12608x50432, b=2048): 255.604
Elapsed time for mlp_4h_to_h (4x50432x12608, b=2048): 0.0402
Throughput (in TFLOP/s) for mlp_4h_to_h (4x50432x12608, b=2048): 258.941

Attention duration (in seconds): 0.0496
Attention throughput (in TFLOP/s): 227.289
MLP duration (in seconds): 0.0810
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1305
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 12672, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x12672x38016, b=2048): 0.0308
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x12672x38016, b=2048): 256.259
b: 128, m: 2048, n: 396, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x396x2048): 0.0032
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x396x2048): 131.356
b: 128, m: 2048, n: 2048, k: 396,
Elapsed time for attention_prob_times_values (128x2048x2048x396): 0.0040
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x396): 107.578
Elapsed time for attention_linear_projection (4x12672x12672, b=2048): 0.0109
Throughput (in TFLOP/s) for attention_linear_projection (4x12672x12672, b=2048): 241.138
Elapsed time for mlp_h_to_4h (4x12672x50688, b=2048): 0.0409
Throughput (in TFLOP/s) for mlp_h_to_4h (4x12672x50688, b=2048): 257.612
Elapsed time for mlp_4h_to_h (4x50688x12672, b=2048): 0.0404
Throughput (in TFLOP/s) for mlp_4h_to_h (4x50688x12672, b=2048): 260.764

Attention duration (in seconds): 0.0489
Attention throughput (in TFLOP/s): 232.600
MLP duration (in seconds): 0.0812
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1301
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 12736, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x12736x38208, b=2048): 0.0309
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x12736x38208, b=2048): 257.747
b: 128, m: 2048, n: 398, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x398x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x398x2048): 128.370
b: 128, m: 2048, n: 2048, k: 398,
Elapsed time for attention_prob_times_values (128x2048x2048x398): 0.0046
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x398): 91.990
Elapsed time for attention_linear_projection (4x12736x12736, b=2048): 0.0109
Throughput (in TFLOP/s) for attention_linear_projection (4x12736x12736, b=2048): 242.953
Elapsed time for mlp_h_to_4h (4x12736x50944, b=2048): 0.0411
Throughput (in TFLOP/s) for mlp_h_to_4h (4x12736x50944, b=2048): 258.624
Elapsed time for mlp_4h_to_h (4x50944x12736, b=2048): 0.0409
Throughput (in TFLOP/s) for mlp_4h_to_h (4x50944x12736, b=2048): 260.116

Attention duration (in seconds): 0.0498
Attention throughput (in TFLOP/s): 230.411
MLP duration (in seconds): 0.0820
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1318
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 12800, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x12800x38400, b=2048): 0.0310
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x12800x38400, b=2048): 259.400
b: 128, m: 2048, n: 400, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x400x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x400x2048): 131.906
b: 128, m: 2048, n: 2048, k: 400,
Elapsed time for attention_prob_times_values (128x2048x2048x400): 0.0035
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x400): 122.630
Elapsed time for attention_linear_projection (4x12800x12800, b=2048): 0.0110
Throughput (in TFLOP/s) for attention_linear_projection (4x12800x12800, b=2048): 244.214
Elapsed time for mlp_h_to_4h (4x12800x51200, b=2048): 0.0413
Throughput (in TFLOP/s) for mlp_h_to_4h (4x12800x51200, b=2048): 260.187
Elapsed time for mlp_4h_to_h (4x51200x12800, b=2048): 0.0413
Throughput (in TFLOP/s) for mlp_4h_to_h (4x51200x12800, b=2048): 259.957

Attention duration (in seconds): 0.0488
Attention throughput (in TFLOP/s): 237.655
MLP duration (in seconds): 0.0826
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1314
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 12864, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x12864x38592, b=2048): 0.0313
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x12864x38592, b=2048): 259.561
b: 128, m: 2048, n: 402, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x402x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x402x2048): 129.716
b: 128, m: 2048, n: 2048, k: 402,
Elapsed time for attention_prob_times_values (128x2048x2048x402): 0.0047
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x402): 92.772
Elapsed time for attention_linear_projection (4x12864x12864, b=2048): 0.0111
Throughput (in TFLOP/s) for attention_linear_projection (4x12864x12864, b=2048): 245.062
Elapsed time for mlp_h_to_4h (4x12864x51456, b=2048): 0.0417
Throughput (in TFLOP/s) for mlp_h_to_4h (4x12864x51456, b=2048): 260.379
Elapsed time for mlp_4h_to_h (4x51456x12864, b=2048): 0.0419
Throughput (in TFLOP/s) for mlp_4h_to_h (4x51456x12864, b=2048): 259.040

Attention duration (in seconds): 0.0504
Attention throughput (in TFLOP/s): 232.397
MLP duration (in seconds): 0.0835
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1339
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 12928, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x12928x38784, b=2048): 0.0315
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x12928x38784, b=2048): 261.121
b: 128, m: 2048, n: 404, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x404x2048): 0.0032
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x404x2048): 134.129
b: 128, m: 2048, n: 2048, k: 404,
Elapsed time for attention_prob_times_values (128x2048x2048x404): 0.0039
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x404): 110.976
Elapsed time for attention_linear_projection (4x12928x12928, b=2048): 0.0111
Throughput (in TFLOP/s) for attention_linear_projection (4x12928x12928, b=2048): 246.434
Elapsed time for mlp_h_to_4h (4x12928x51712, b=2048): 0.0419
Throughput (in TFLOP/s) for mlp_h_to_4h (4x12928x51712, b=2048): 261.425
Elapsed time for mlp_4h_to_h (4x51712x12928, b=2048): 0.0421
Throughput (in TFLOP/s) for mlp_4h_to_h (4x51712x12928, b=2048): 260.423

Attention duration (in seconds): 0.0497
Attention throughput (in TFLOP/s): 237.772
MLP duration (in seconds): 0.0840
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1337
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 12992, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x12992x38976, b=2048): 0.0326
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x12992x38976, b=2048): 254.546
b: 128, m: 2048, n: 406, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x406x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x406x2048): 130.763
b: 128, m: 2048, n: 2048, k: 406,
Elapsed time for attention_prob_times_values (128x2048x2048x406): 0.0047
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x406): 93.308
Elapsed time for attention_linear_projection (4x12992x12992, b=2048): 0.0115
Throughput (in TFLOP/s) for attention_linear_projection (4x12992x12992, b=2048): 240.669
Elapsed time for mlp_h_to_4h (4x12992x51968, b=2048): 0.0432
Throughput (in TFLOP/s) for mlp_h_to_4h (4x12992x51968, b=2048): 256.020
Elapsed time for mlp_4h_to_h (4x51968x12992, b=2048): 0.0427
Throughput (in TFLOP/s) for mlp_4h_to_h (4x51968x12992, b=2048): 259.250

Attention duration (in seconds): 0.0521
Attention throughput (in TFLOP/s): 229.101
MLP duration (in seconds): 0.0859
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1380
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 13056, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x13056x39168, b=2048): 0.0327
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x13056x39168, b=2048): 256.285
b: 128, m: 2048, n: 408, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x408x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x408x2048): 134.102
b: 128, m: 2048, n: 2048, k: 408,
Elapsed time for attention_prob_times_values (128x2048x2048x408): 0.0036
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x408): 123.056
Elapsed time for attention_linear_projection (4x13056x13056, b=2048): 0.0115
Throughput (in TFLOP/s) for attention_linear_projection (4x13056x13056, b=2048): 241.877
Elapsed time for mlp_h_to_4h (4x13056x52224, b=2048): 0.0434
Throughput (in TFLOP/s) for mlp_h_to_4h (4x13056x52224, b=2048): 257.510
Elapsed time for mlp_4h_to_h (4x52224x13056, b=2048): 0.0429
Throughput (in TFLOP/s) for mlp_4h_to_h (4x52224x13056, b=2048): 260.310

Attention duration (in seconds): 0.0511
Attention throughput (in TFLOP/s): 235.923
MLP duration (in seconds): 0.0863
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1374
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 13120, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x13120x39360, b=2048): 0.0329
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x13120x39360, b=2048): 256.815
b: 128, m: 2048, n: 410, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x410x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x410x2048): 131.665
b: 128, m: 2048, n: 2048, k: 410,
Elapsed time for attention_prob_times_values (128x2048x2048x410): 0.0047
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x410): 94.251
Elapsed time for attention_linear_projection (4x13120x13120, b=2048): 0.0116
Throughput (in TFLOP/s) for attention_linear_projection (4x13120x13120, b=2048): 243.160
Elapsed time for mlp_h_to_4h (4x13120x52480, b=2048): 0.0437
Throughput (in TFLOP/s) for mlp_h_to_4h (4x13120x52480, b=2048): 258.028
Elapsed time for mlp_4h_to_h (4x52480x13120, b=2048): 0.0434
Throughput (in TFLOP/s) for mlp_4h_to_h (4x52480x13120, b=2048): 260.189

Attention duration (in seconds): 0.0526
Attention throughput (in TFLOP/s): 231.393
MLP duration (in seconds): 0.0871
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1396
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 13184, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x13184x39552, b=2048): 0.0330
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x13184x39552, b=2048): 258.511
b: 128, m: 2048, n: 412, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x412x2048): 0.0032
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x412x2048): 136.694
b: 128, m: 2048, n: 2048, k: 412,
Elapsed time for attention_prob_times_values (128x2048x2048x412): 0.0039
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x412): 112.988
Elapsed time for attention_linear_projection (4x13184x13184, b=2048): 0.0117
Throughput (in TFLOP/s) for attention_linear_projection (4x13184x13184, b=2048): 244.417
Elapsed time for mlp_h_to_4h (4x13184x52736, b=2048): 0.0439
Throughput (in TFLOP/s) for mlp_h_to_4h (4x13184x52736, b=2048): 259.489
Elapsed time for mlp_4h_to_h (4x52736x13184, b=2048): 0.0439
Throughput (in TFLOP/s) for mlp_4h_to_h (4x52736x13184, b=2048): 259.567

Attention duration (in seconds): 0.0519
Attention throughput (in TFLOP/s): 236.753
MLP duration (in seconds): 0.0878
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1396
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 13248, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x13248x39744, b=2048): 0.0333
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x13248x39744, b=2048): 259.313
b: 128, m: 2048, n: 414, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x414x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x414x2048): 133.149
b: 128, m: 2048, n: 2048, k: 414,
Elapsed time for attention_prob_times_values (128x2048x2048x414): 0.0047
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x414): 94.534
Elapsed time for attention_linear_projection (4x13248x13248, b=2048): 0.0117
Throughput (in TFLOP/s) for attention_linear_projection (4x13248x13248, b=2048): 245.900
Elapsed time for mlp_h_to_4h (4x13248x52992, b=2048): 0.0442
Throughput (in TFLOP/s) for mlp_h_to_4h (4x13248x52992, b=2048): 260.319
Elapsed time for mlp_4h_to_h (4x52992x13248, b=2048): 0.0441
Throughput (in TFLOP/s) for mlp_4h_to_h (4x52992x13248, b=2048): 260.977

Attention duration (in seconds): 0.0530
Attention throughput (in TFLOP/s): 233.788
MLP duration (in seconds): 0.0883
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1413
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 13312, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x13312x39936, b=2048): 0.0335
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x13312x39936, b=2048): 259.941
b: 128, m: 2048, n: 416, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x416x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x416x2048): 137.002
b: 128, m: 2048, n: 2048, k: 416,
Elapsed time for attention_prob_times_values (128x2048x2048x416): 0.0035
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x416): 127.423
Elapsed time for attention_linear_projection (4x13312x13312, b=2048): 0.0118
Throughput (in TFLOP/s) for attention_linear_projection (4x13312x13312, b=2048): 247.063
Elapsed time for mlp_h_to_4h (4x13312x53248, b=2048): 0.0445
Throughput (in TFLOP/s) for mlp_h_to_4h (4x13312x53248, b=2048): 261.010
Elapsed time for mlp_4h_to_h (4x53248x13312, b=2048): 0.0447
Throughput (in TFLOP/s) for mlp_4h_to_h (4x53248x13312, b=2048): 260.037

Attention duration (in seconds): 0.0520
Attention throughput (in TFLOP/s): 240.399
MLP duration (in seconds): 0.0892
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1412
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 13376, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x13376x40128, b=2048): 0.0345
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x13376x40128, b=2048): 254.754
b: 128, m: 2048, n: 418, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x418x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x418x2048): 134.244
b: 128, m: 2048, n: 2048, k: 418,
Elapsed time for attention_prob_times_values (128x2048x2048x418): 0.0047
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x418): 95.719
Elapsed time for attention_linear_projection (4x13376x13376, b=2048): 0.0121
Throughput (in TFLOP/s) for attention_linear_projection (4x13376x13376, b=2048): 241.507
Elapsed time for mlp_h_to_4h (4x13376x53504, b=2048): 0.0458
Throughput (in TFLOP/s) for mlp_h_to_4h (4x13376x53504, b=2048): 256.044
Elapsed time for mlp_4h_to_h (4x53504x13376, b=2048): 0.0449
Throughput (in TFLOP/s) for mlp_4h_to_h (4x53504x13376, b=2048): 260.911

Attention duration (in seconds): 0.0547
Attention throughput (in TFLOP/s): 230.812
MLP duration (in seconds): 0.0907
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1454
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 13440, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x13440x40320, b=2048): 0.0347
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x13440x40320, b=2048): 255.661
b: 128, m: 2048, n: 420, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x420x2048): 0.0032
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x420x2048): 139.276
b: 128, m: 2048, n: 2048, k: 420,
Elapsed time for attention_prob_times_values (128x2048x2048x420): 0.0039
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x420): 114.825
Elapsed time for attention_linear_projection (4x13440x13440, b=2048): 0.0122
Throughput (in TFLOP/s) for attention_linear_projection (4x13440x13440, b=2048): 242.898
Elapsed time for mlp_h_to_4h (4x13440x53760, b=2048): 0.0461
Throughput (in TFLOP/s) for mlp_h_to_4h (4x13440x53760, b=2048): 256.958
Elapsed time for mlp_4h_to_h (4x53760x13440, b=2048): 0.0454
Throughput (in TFLOP/s) for mlp_4h_to_h (4x53760x13440, b=2048): 260.729

Attention duration (in seconds): 0.0541
Attention throughput (in TFLOP/s): 235.588
MLP duration (in seconds): 0.0915
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1456
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 13504, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x13504x40512, b=2048): 0.0350
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x13504x40512, b=2048): 256.222
b: 128, m: 2048, n: 422, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x422x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x422x2048): 135.529
b: 128, m: 2048, n: 2048, k: 422,
Elapsed time for attention_prob_times_values (128x2048x2048x422): 0.0047
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x422): 96.375
Elapsed time for attention_linear_projection (4x13504x13504, b=2048): 0.0122
Throughput (in TFLOP/s) for attention_linear_projection (4x13504x13504, b=2048): 244.361
Elapsed time for mlp_h_to_4h (4x13504x54016, b=2048): 0.0459
Throughput (in TFLOP/s) for mlp_h_to_4h (4x13504x54016, b=2048): 260.212
Elapsed time for mlp_4h_to_h (4x54016x13504, b=2048): 0.0455
Throughput (in TFLOP/s) for mlp_4h_to_h (4x54016x13504, b=2048): 262.606

Attention duration (in seconds): 0.0553
Attention throughput (in TFLOP/s): 232.693
MLP duration (in seconds): 0.0914
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1467
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 13568, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x13568x40704, b=2048): 0.0349
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x13568x40704, b=2048): 259.039
b: 128, m: 2048, n: 424, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x424x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x424x2048): 139.168
b: 128, m: 2048, n: 2048, k: 424,
Elapsed time for attention_prob_times_values (128x2048x2048x424): 0.0036
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x424): 127.090
Elapsed time for attention_linear_projection (4x13568x13568, b=2048): 0.0123
Throughput (in TFLOP/s) for attention_linear_projection (4x13568x13568, b=2048): 245.667
Elapsed time for mlp_h_to_4h (4x13568x54272, b=2048): 0.0463
Throughput (in TFLOP/s) for mlp_h_to_4h (4x13568x54272, b=2048): 260.444
Elapsed time for mlp_4h_to_h (4x54272x13568, b=2048): 0.0458
Throughput (in TFLOP/s) for mlp_4h_to_h (4x54272x13568, b=2048): 263.694

Attention duration (in seconds): 0.0541
Attention throughput (in TFLOP/s): 240.006
MLP duration (in seconds): 0.0921
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1461
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 13632, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x13632x40896, b=2048): 0.0350
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x13632x40896, b=2048): 260.660
b: 128, m: 2048, n: 426, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x426x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x426x2048): 136.843
b: 128, m: 2048, n: 2048, k: 426,
Elapsed time for attention_prob_times_values (128x2048x2048x426): 0.0047
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x426): 96.817
Elapsed time for attention_linear_projection (4x13632x13632, b=2048): 0.0123
Throughput (in TFLOP/s) for attention_linear_projection (4x13632x13632, b=2048): 246.668
Elapsed time for mlp_h_to_4h (4x13632x54528, b=2048): 0.0464
Throughput (in TFLOP/s) for mlp_h_to_4h (4x13632x54528, b=2048): 262.600
Elapsed time for mlp_4h_to_h (4x54528x13632, b=2048): 0.0463
Throughput (in TFLOP/s) for mlp_4h_to_h (4x54528x13632, b=2048): 262.864

Attention duration (in seconds): 0.0555
Attention throughput (in TFLOP/s): 236.122
MLP duration (in seconds): 0.0927
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1482
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 13696, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x13696x41088, b=2048): 0.0351
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x13696x41088, b=2048): 262.450
b: 128, m: 2048, n: 428, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x428x2048): 0.0032
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x428x2048): 141.783
b: 128, m: 2048, n: 2048, k: 428,
Elapsed time for attention_prob_times_values (128x2048x2048x428): 0.0040
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x428): 115.998
Elapsed time for attention_linear_projection (4x13696x13696, b=2048): 0.0124
Throughput (in TFLOP/s) for attention_linear_projection (4x13696x13696, b=2048): 247.802
Elapsed time for mlp_h_to_4h (4x13696x54784, b=2048): 0.0466
Throughput (in TFLOP/s) for mlp_h_to_4h (4x13696x54784, b=2048): 264.078
Elapsed time for mlp_4h_to_h (4x54784x13696, b=2048): 0.0465
Throughput (in TFLOP/s) for mlp_4h_to_h (4x54784x13696, b=2048): 264.270

Attention duration (in seconds): 0.0547
Attention throughput (in TFLOP/s): 241.385
MLP duration (in seconds): 0.0931
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1478
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 13760, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x13760x41280, b=2048): 0.0355
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x13760x41280, b=2048): 262.053
b: 128, m: 2048, n: 430, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x430x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x430x2048): 138.128
b: 128, m: 2048, n: 2048, k: 430,
Elapsed time for attention_prob_times_values (128x2048x2048x430): 0.0047
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x430): 98.252
Elapsed time for attention_linear_projection (4x13760x13760, b=2048): 0.0125
Throughput (in TFLOP/s) for attention_linear_projection (4x13760x13760, b=2048): 248.177
Elapsed time for mlp_h_to_4h (4x13760x55040, b=2048): 0.0471
Throughput (in TFLOP/s) for mlp_h_to_4h (4x13760x55040, b=2048): 263.310
Elapsed time for mlp_4h_to_h (4x55040x13760, b=2048): 0.0471
Throughput (in TFLOP/s) for mlp_4h_to_h (4x55040x13760, b=2048): 263.327

Attention duration (in seconds): 0.0561
Attention throughput (in TFLOP/s): 237.837
MLP duration (in seconds): 0.0942
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1503
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 13824, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x13824x41472, b=2048): 0.0359
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x13824x41472, b=2048): 261.864
b: 128, m: 2048, n: 432, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x432x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x432x2048): 142.333
b: 128, m: 2048, n: 2048, k: 432,
Elapsed time for attention_prob_times_values (128x2048x2048x432): 0.0035
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x432): 130.934
Elapsed time for attention_linear_projection (4x13824x13824, b=2048): 0.0126
Throughput (in TFLOP/s) for attention_linear_projection (4x13824x13824, b=2048): 249.421
Elapsed time for mlp_h_to_4h (4x13824x55296, b=2048): 0.0475
Throughput (in TFLOP/s) for mlp_h_to_4h (4x13824x55296, b=2048): 263.584
Elapsed time for mlp_4h_to_h (4x55296x13824, b=2048): 0.0476
Throughput (in TFLOP/s) for mlp_4h_to_h (4x55296x13824, b=2048): 263.196

Attention duration (in seconds): 0.0552
Attention throughput (in TFLOP/s): 243.583
MLP duration (in seconds): 0.0951
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1503
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 13888, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x13888x41664, b=2048): 0.0366
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x13888x41664, b=2048): 259.133
b: 128, m: 2048, n: 434, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x434x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x434x2048): 139.154
b: 128, m: 2048, n: 2048, k: 434,
Elapsed time for attention_prob_times_values (128x2048x2048x434): 0.0047
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x434): 98.820
Elapsed time for attention_linear_projection (4x13888x13888, b=2048): 0.0129
Throughput (in TFLOP/s) for attention_linear_projection (4x13888x13888, b=2048): 244.744
Elapsed time for mlp_h_to_4h (4x13888x55552, b=2048): 0.0487
Throughput (in TFLOP/s) for mlp_h_to_4h (4x13888x55552, b=2048): 259.668
Elapsed time for mlp_4h_to_h (4x55552x13888, b=2048): 0.0479
Throughput (in TFLOP/s) for mlp_4h_to_h (4x55552x13888, b=2048): 263.871

Attention duration (in seconds): 0.0576
Attention throughput (in TFLOP/s): 235.791
MLP duration (in seconds): 0.0966
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1541
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 13952, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x13952x41856, b=2048): 0.0369
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x13952x41856, b=2048): 259.500
b: 128, m: 2048, n: 436, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x436x2048): 0.0032
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x436x2048): 144.657
b: 128, m: 2048, n: 2048, k: 436,
Elapsed time for attention_prob_times_values (128x2048x2048x436): 0.0040
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x436): 117.614
Elapsed time for attention_linear_projection (4x13952x13952, b=2048): 0.0130
Throughput (in TFLOP/s) for attention_linear_projection (4x13952x13952, b=2048): 246.042
Elapsed time for mlp_h_to_4h (4x13952x55808, b=2048): 0.0488
Throughput (in TFLOP/s) for mlp_h_to_4h (4x13952x55808, b=2048): 261.239
Elapsed time for mlp_4h_to_h (4x55808x13952, b=2048): 0.0484
Throughput (in TFLOP/s) for mlp_4h_to_h (4x55808x13952, b=2048): 263.306

Attention duration (in seconds): 0.0570
Attention throughput (in TFLOP/s): 240.028
MLP duration (in seconds): 0.0973
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1543
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 14016, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x14016x42048, b=2048): 0.0372
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x14016x42048, b=2048): 259.907
b: 128, m: 2048, n: 438, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x438x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x438x2048): 140.567
b: 128, m: 2048, n: 2048, k: 438,
Elapsed time for attention_prob_times_values (128x2048x2048x438): 0.0047
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x438): 99.645
Elapsed time for attention_linear_projection (4x14016x14016, b=2048): 0.0130
Throughput (in TFLOP/s) for attention_linear_projection (4x14016x14016, b=2048): 248.145
Elapsed time for mlp_h_to_4h (4x14016x56064, b=2048): 0.0492
Throughput (in TFLOP/s) for mlp_h_to_4h (4x14016x56064, b=2048): 261.576
Elapsed time for mlp_4h_to_h (4x56064x14016, b=2048): 0.0487
Throughput (in TFLOP/s) for mlp_4h_to_h (4x56064x14016, b=2048): 264.593

Attention duration (in seconds): 0.0582
Attention throughput (in TFLOP/s): 237.424
MLP duration (in seconds): 0.0979
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1561
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 14080, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x14080x42240, b=2048): 0.0374
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x14080x42240, b=2048): 260.766
b: 128, m: 2048, n: 440, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x440x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x440x2048): 144.105
b: 128, m: 2048, n: 2048, k: 440,
Elapsed time for attention_prob_times_values (128x2048x2048x440): 0.0036
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x440): 131.066
Elapsed time for attention_linear_projection (4x14080x14080, b=2048): 0.0131
Throughput (in TFLOP/s) for attention_linear_projection (4x14080x14080, b=2048): 248.788
Elapsed time for mlp_h_to_4h (4x14080x56320, b=2048): 0.0493
Throughput (in TFLOP/s) for mlp_h_to_4h (4x14080x56320, b=2048): 263.387
Elapsed time for mlp_4h_to_h (4x56320x14080, b=2048): 0.0493
Throughput (in TFLOP/s) for mlp_4h_to_h (4x56320x14080, b=2048): 263.609

Attention duration (in seconds): 0.0573
Attention throughput (in TFLOP/s): 243.205
MLP duration (in seconds): 0.0986
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1559
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 14144, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x14144x42432, b=2048): 0.0378
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x14144x42432, b=2048): 260.347
b: 128, m: 2048, n: 442, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x442x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x442x2048): 142.338
b: 128, m: 2048, n: 2048, k: 442,
Elapsed time for attention_prob_times_values (128x2048x2048x442): 0.0047
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x442): 101.127
Elapsed time for attention_linear_projection (4x14144x14144, b=2048): 0.0131
Throughput (in TFLOP/s) for attention_linear_projection (4x14144x14144, b=2048): 249.864
Elapsed time for mlp_h_to_4h (4x14144x56576, b=2048): 0.0501
Throughput (in TFLOP/s) for mlp_h_to_4h (4x14144x56576, b=2048): 261.500
Elapsed time for mlp_4h_to_h (4x56576x14144, b=2048): 0.0497
Throughput (in TFLOP/s) for mlp_4h_to_h (4x56576x14144, b=2048): 263.962

Attention duration (in seconds): 0.0589
Attention throughput (in TFLOP/s): 238.651
MLP duration (in seconds): 0.0998
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1587
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 14208, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x14208x42624, b=2048): 0.0380
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x14208x42624, b=2048): 261.285
b: 128, m: 2048, n: 444, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x444x2048): 0.0032
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x444x2048): 147.496
b: 128, m: 2048, n: 2048, k: 444,
Elapsed time for attention_prob_times_values (128x2048x2048x444): 0.0039
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x444): 121.778
Elapsed time for attention_linear_projection (4x14208x14208, b=2048): 0.0131
Throughput (in TFLOP/s) for attention_linear_projection (4x14208x14208, b=2048): 251.591
Elapsed time for mlp_h_to_4h (4x14208x56832, b=2048): 0.0503
Throughput (in TFLOP/s) for mlp_h_to_4h (4x14208x56832, b=2048): 262.934
Elapsed time for mlp_4h_to_h (4x56832x14208, b=2048): 0.0506
Throughput (in TFLOP/s) for mlp_4h_to_h (4x56832x14208, b=2048): 261.661

Attention duration (in seconds): 0.0583
Attention throughput (in TFLOP/s): 243.412
MLP duration (in seconds): 0.1009
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1591
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 14272, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x14272x42816, b=2048): 0.0392
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x14272x42816, b=2048): 255.436
b: 128, m: 2048, n: 446, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x446x2048): 0.0034
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x446x2048): 142.951
b: 128, m: 2048, n: 2048, k: 446,
Elapsed time for attention_prob_times_values (128x2048x2048x446): 0.0047
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x446): 102.308
Elapsed time for attention_linear_projection (4x14272x14272, b=2048): 0.0136
Throughput (in TFLOP/s) for attention_linear_projection (4x14272x14272, b=2048): 245.630
Elapsed time for mlp_h_to_4h (4x14272x57088, b=2048): 0.0520
Throughput (in TFLOP/s) for mlp_h_to_4h (4x14272x57088, b=2048): 256.856
Elapsed time for mlp_4h_to_h (4x57088x14272, b=2048): 0.0513
Throughput (in TFLOP/s) for mlp_4h_to_h (4x57088x14272, b=2048): 260.465

Attention duration (in seconds): 0.0608
Attention throughput (in TFLOP/s): 235.262
MLP duration (in seconds): 0.1032
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1640
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 14336, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x14336x43008, b=2048): 0.0393
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x14336x43008, b=2048): 256.781
b: 128, m: 2048, n: 448, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x448x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x448x2048): 147.746
b: 128, m: 2048, n: 2048, k: 448,
Elapsed time for attention_prob_times_values (128x2048x2048x448): 0.0035
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x448): 135.930
Elapsed time for attention_linear_projection (4x14336x14336, b=2048): 0.0137
Throughput (in TFLOP/s) for attention_linear_projection (4x14336x14336, b=2048): 246.557
Elapsed time for mlp_h_to_4h (4x14336x57344, b=2048): 0.0525
Throughput (in TFLOP/s) for mlp_h_to_4h (4x14336x57344, b=2048): 256.354
Elapsed time for mlp_4h_to_h (4x57344x14336, b=2048): 0.0516
Throughput (in TFLOP/s) for mlp_4h_to_h (4x57344x14336, b=2048): 261.125

Attention duration (in seconds): 0.0598
Attention throughput (in TFLOP/s): 241.356
MLP duration (in seconds): 0.1041
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1639
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 14400, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x14400x43200, b=2048): 0.0397
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x14400x43200, b=2048): 256.984
b: 128, m: 2048, n: 450, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x450x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x450x2048): 144.614
b: 128, m: 2048, n: 2048, k: 450,
Elapsed time for attention_prob_times_values (128x2048x2048x450): 0.0049
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x450): 98.787
Elapsed time for attention_linear_projection (4x14400x14400, b=2048): 0.0137
Throughput (in TFLOP/s) for attention_linear_projection (4x14400x14400, b=2048): 248.006
Elapsed time for mlp_h_to_4h (4x14400x57600, b=2048): 0.0525
Throughput (in TFLOP/s) for mlp_h_to_4h (4x14400x57600, b=2048): 258.864
Elapsed time for mlp_4h_to_h (4x57600x14400, b=2048): 0.0520
Throughput (in TFLOP/s) for mlp_4h_to_h (4x57600x14400, b=2048): 261.148

Attention duration (in seconds): 0.0616
Attention throughput (in TFLOP/s): 236.328
MLP duration (in seconds): 0.1045
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1661
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 14464, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x14464x43392, b=2048): 0.0399
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x14464x43392, b=2048): 257.633
b: 128, m: 2048, n: 452, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x452x2048): 0.0032
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x452x2048): 149.800
b: 128, m: 2048, n: 2048, k: 452,
Elapsed time for attention_prob_times_values (128x2048x2048x452): 0.0041
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x452): 117.098
Elapsed time for attention_linear_projection (4x14464x14464, b=2048): 0.0138
Throughput (in TFLOP/s) for attention_linear_projection (4x14464x14464, b=2048): 248.855
Elapsed time for mlp_h_to_4h (4x14464x57856, b=2048): 0.0528
Throughput (in TFLOP/s) for mlp_h_to_4h (4x14464x57856, b=2048): 259.857
Elapsed time for mlp_4h_to_h (4x57856x14464, b=2048): 0.0525
Throughput (in TFLOP/s) for mlp_4h_to_h (4x57856x14464, b=2048): 261.016

Attention duration (in seconds): 0.0611
Attention throughput (in TFLOP/s): 240.395
MLP duration (in seconds): 0.1053
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1664
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 14528, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x14528x43584, b=2048): 0.0401
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x14528x43584, b=2048): 258.725
b: 128, m: 2048, n: 454, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x454x2048): 0.0034
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x454x2048): 144.477
b: 128, m: 2048, n: 2048, k: 454,
Elapsed time for attention_prob_times_values (128x2048x2048x454): 0.0049
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x454): 99.158
Elapsed time for attention_linear_projection (4x14528x14528, b=2048): 0.0138
Throughput (in TFLOP/s) for attention_linear_projection (4x14528x14528, b=2048): 249.933
Elapsed time for mlp_h_to_4h (4x14528x58112, b=2048): 0.0530
Throughput (in TFLOP/s) for mlp_h_to_4h (4x14528x58112, b=2048): 260.854
Elapsed time for mlp_4h_to_h (4x58112x14528, b=2048): 0.0531
Throughput (in TFLOP/s) for mlp_4h_to_h (4x58112x14528, b=2048): 260.651

Attention duration (in seconds): 0.0622
Attention throughput (in TFLOP/s): 237.968
MLP duration (in seconds): 0.1061
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1683
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 14592, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x14592x43776, b=2048): 0.0402
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x14592x43776, b=2048): 260.387
b: 128, m: 2048, n: 456, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x456x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x456x2048): 149.258
b: 128, m: 2048, n: 2048, k: 456,
Elapsed time for attention_prob_times_values (128x2048x2048x456): 0.0038
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x456): 129.977
Elapsed time for attention_linear_projection (4x14592x14592, b=2048): 0.0139
Throughput (in TFLOP/s) for attention_linear_projection (4x14592x14592, b=2048): 251.330
Elapsed time for mlp_h_to_4h (4x14592x58368, b=2048): 0.0539
Throughput (in TFLOP/s) for mlp_h_to_4h (4x14592x58368, b=2048): 258.771
Elapsed time for mlp_4h_to_h (4x58368x14592, b=2048): 0.0534
Throughput (in TFLOP/s) for mlp_4h_to_h (4x58368x14592, b=2048): 261.217

Attention duration (in seconds): 0.0611
Attention throughput (in TFLOP/s): 244.329
MLP duration (in seconds): 0.1073
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1685
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 14656, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x14656x43968, b=2048): 0.0414
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x14656x43968, b=2048): 255.101
b: 128, m: 2048, n: 458, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x458x2048): 0.0034
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x458x2048): 146.090
b: 128, m: 2048, n: 2048, k: 458,
Elapsed time for attention_prob_times_values (128x2048x2048x458): 0.0049
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x458): 99.448
Elapsed time for attention_linear_projection (4x14656x14656, b=2048): 0.0143
Throughput (in TFLOP/s) for attention_linear_projection (4x14656x14656, b=2048): 246.055
Elapsed time for mlp_h_to_4h (4x14656x58624, b=2048): 0.0550
Throughput (in TFLOP/s) for mlp_h_to_4h (4x14656x58624, b=2048): 256.048
Elapsed time for mlp_4h_to_h (4x58624x14656, b=2048): 0.0541
Throughput (in TFLOP/s) for mlp_4h_to_h (4x58624x14656, b=2048): 260.431

Attention duration (in seconds): 0.0640
Attention throughput (in TFLOP/s): 235.319
MLP duration (in seconds): 0.1090
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1730
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 14720, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x14720x44160, b=2048): 0.0416
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x14720x44160, b=2048): 255.806
b: 128, m: 2048, n: 460, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x460x2048): 0.0032
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x460x2048): 152.070
b: 128, m: 2048, n: 2048, k: 460,
Elapsed time for attention_prob_times_values (128x2048x2048x460): 0.0042
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x460): 118.985
Elapsed time for attention_linear_projection (4x14720x14720, b=2048): 0.0144
Throughput (in TFLOP/s) for attention_linear_projection (4x14720x14720, b=2048): 247.117
Elapsed time for mlp_h_to_4h (4x14720x58880, b=2048): 0.0552
Throughput (in TFLOP/s) for mlp_h_to_4h (4x14720x58880, b=2048): 257.463
Elapsed time for mlp_4h_to_h (4x58880x14720, b=2048): 0.0543
Throughput (in TFLOP/s) for mlp_4h_to_h (4x58880x14720, b=2048): 261.394

Attention duration (in seconds): 0.0634
Attention throughput (in TFLOP/s): 239.564
MLP duration (in seconds): 0.1095
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1729
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 14784, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x14784x44352, b=2048): 0.0418
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x14784x44352, b=2048): 257.030
b: 128, m: 2048, n: 462, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x462x2048): 0.0034
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x462x2048): 147.450
b: 128, m: 2048, n: 2048, k: 462,
Elapsed time for attention_prob_times_values (128x2048x2048x462): 0.0050
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x462): 99.993
Elapsed time for attention_linear_projection (4x14784x14784, b=2048): 0.0144
Throughput (in TFLOP/s) for attention_linear_projection (4x14784x14784, b=2048): 248.318
Elapsed time for mlp_h_to_4h (4x14784x59136, b=2048): 0.0554
Throughput (in TFLOP/s) for mlp_h_to_4h (4x14784x59136, b=2048): 258.581
Elapsed time for mlp_4h_to_h (4x59136x14784, b=2048): 0.0548
Throughput (in TFLOP/s) for mlp_4h_to_h (4x59136x14784, b=2048): 261.549

Attention duration (in seconds): 0.0645
Attention throughput (in TFLOP/s): 237.301
MLP duration (in seconds): 0.1102
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1747
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 14848, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x14848x44544, b=2048): 0.0420
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x14848x44544, b=2048): 257.861
b: 128, m: 2048, n: 464, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x464x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x464x2048): 152.787
b: 128, m: 2048, n: 2048, k: 464,
Elapsed time for attention_prob_times_values (128x2048x2048x464): 0.0037
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x464): 132.948
Elapsed time for attention_linear_projection (4x14848x14848, b=2048): 0.0145
Throughput (in TFLOP/s) for attention_linear_projection (4x14848x14848, b=2048): 249.335
Elapsed time for mlp_h_to_4h (4x14848x59392, b=2048): 0.0557
Throughput (in TFLOP/s) for mlp_h_to_4h (4x14848x59392, b=2048): 259.319
Elapsed time for mlp_4h_to_h (4x59392x14848, b=2048): 0.0551
Throughput (in TFLOP/s) for mlp_4h_to_h (4x59392x14848, b=2048): 262.084

Attention duration (in seconds): 0.0635
Attention throughput (in TFLOP/s): 243.153
MLP duration (in seconds): 0.1108
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1744
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 14912, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x14912x44736, b=2048): 0.0422
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x14912x44736, b=2048): 258.804
b: 128, m: 2048, n: 466, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x466x2048): 0.0034
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x466x2048): 148.442
b: 128, m: 2048, n: 2048, k: 466,
Elapsed time for attention_prob_times_values (128x2048x2048x466): 0.0050
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x466): 101.073
Elapsed time for attention_linear_projection (4x14912x14912, b=2048): 0.0146
Throughput (in TFLOP/s) for attention_linear_projection (4x14912x14912, b=2048): 249.943
Elapsed time for mlp_h_to_4h (4x14912x59648, b=2048): 0.0559
Throughput (in TFLOP/s) for mlp_h_to_4h (4x14912x59648, b=2048): 260.567
Elapsed time for mlp_4h_to_h (4x59648x14912, b=2048): 0.0558
Throughput (in TFLOP/s) for mlp_4h_to_h (4x59648x14912, b=2048): 261.040

Attention duration (in seconds): 0.0651
Attention throughput (in TFLOP/s): 239.120
MLP duration (in seconds): 0.1118
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1769
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 14976, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x14976x44928, b=2048): 0.0426
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x14976x44928, b=2048): 258.711
b: 128, m: 2048, n: 468, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x468x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x468x2048): 154.240
b: 128, m: 2048, n: 2048, k: 468,
Elapsed time for attention_prob_times_values (128x2048x2048x468): 0.0042
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x468): 120.674
Elapsed time for attention_linear_projection (4x14976x14976, b=2048): 0.0147
Throughput (in TFLOP/s) for attention_linear_projection (4x14976x14976, b=2048): 250.817
Elapsed time for mlp_h_to_4h (4x14976x59904, b=2048): 0.0564
Throughput (in TFLOP/s) for mlp_h_to_4h (4x14976x59904, b=2048): 260.792
Elapsed time for mlp_4h_to_h (4x59904x14976, b=2048): 0.0561
Throughput (in TFLOP/s) for mlp_4h_to_h (4x59904x14976, b=2048): 262.061

Attention duration (in seconds): 0.0647
Attention throughput (in TFLOP/s): 242.774
MLP duration (in seconds): 0.1124
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1771
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 15040, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x15040x45120, b=2048): 0.0426
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x15040x45120, b=2048): 260.947
b: 128, m: 2048, n: 470, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x470x2048): 0.0034
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x470x2048): 149.547
b: 128, m: 2048, n: 2048, k: 470,
Elapsed time for attention_prob_times_values (128x2048x2048x470): 0.0050
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x470): 101.798
Elapsed time for attention_linear_projection (4x15040x15040, b=2048): 0.0147
Throughput (in TFLOP/s) for attention_linear_projection (4x15040x15040, b=2048): 252.132
Elapsed time for mlp_h_to_4h (4x15040x60160, b=2048): 0.0567
Throughput (in TFLOP/s) for mlp_h_to_4h (4x15040x60160, b=2048): 261.248
Elapsed time for mlp_4h_to_h (4x60160x15040, b=2048): 0.0565
Throughput (in TFLOP/s) for mlp_4h_to_h (4x60160x15040, b=2048): 262.350

Attention duration (in seconds): 0.0656
Attention throughput (in TFLOP/s): 241.226
MLP duration (in seconds): 0.1133
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1789
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 15104, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x15104x45312, b=2048): 0.0430
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x15104x45312, b=2048): 260.475
b: 128, m: 2048, n: 472, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x472x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x472x2048): 154.518
b: 128, m: 2048, n: 2048, k: 472,
Elapsed time for attention_prob_times_values (128x2048x2048x472): 0.0038
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x472): 133.006
Elapsed time for attention_linear_projection (4x15104x15104, b=2048): 0.0148
Throughput (in TFLOP/s) for attention_linear_projection (4x15104x15104, b=2048): 252.921
Elapsed time for mlp_h_to_4h (4x15104x60416, b=2048): 0.0570
Throughput (in TFLOP/s) for mlp_h_to_4h (4x15104x60416, b=2048): 262.080
Elapsed time for mlp_4h_to_h (4x60416x15104, b=2048): 0.0573
Throughput (in TFLOP/s) for mlp_4h_to_h (4x60416x15104, b=2048): 260.791

Attention duration (in seconds): 0.0649
Attention throughput (in TFLOP/s): 245.920
MLP duration (in seconds): 0.1144
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1793
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 15168, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x15168x45504, b=2048): 0.0442
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x15168x45504, b=2048): 255.966
b: 128, m: 2048, n: 474, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x474x2048): 0.0034
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x474x2048): 150.088
b: 128, m: 2048, n: 2048, k: 474,
Elapsed time for attention_prob_times_values (128x2048x2048x474): 0.0050
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x474): 102.212
Elapsed time for attention_linear_projection (4x15168x15168, b=2048): 0.0152
Throughput (in TFLOP/s) for attention_linear_projection (4x15168x15168, b=2048): 248.116
Elapsed time for mlp_h_to_4h (4x15168x60672, b=2048): 0.0588
Throughput (in TFLOP/s) for mlp_h_to_4h (4x15168x60672, b=2048): 256.403
Elapsed time for mlp_4h_to_h (4x60672x15168, b=2048): 0.0578
Throughput (in TFLOP/s) for mlp_4h_to_h (4x60672x15168, b=2048): 260.948

Attention duration (in seconds): 0.0677
Attention throughput (in TFLOP/s): 237.604
MLP duration (in seconds): 0.1166
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1843
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 15232, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x15232x45696, b=2048): 0.0444
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x15232x45696, b=2048): 256.826
b: 128, m: 2048, n: 476, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x476x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x476x2048): 156.842
b: 128, m: 2048, n: 2048, k: 476,
Elapsed time for attention_prob_times_values (128x2048x2048x476): 0.0042
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x476): 121.719
Elapsed time for attention_linear_projection (4x15232x15232, b=2048): 0.0153
Throughput (in TFLOP/s) for attention_linear_projection (4x15232x15232, b=2048): 248.641
Elapsed time for mlp_h_to_4h (4x15232x60928, b=2048): 0.0591
Throughput (in TFLOP/s) for mlp_h_to_4h (4x15232x60928, b=2048): 257.374
Elapsed time for mlp_4h_to_h (4x60928x15232, b=2048): 0.0582
Throughput (in TFLOP/s) for mlp_4h_to_h (4x60928x15232, b=2048): 261.062

Attention duration (in seconds): 0.0671
Attention throughput (in TFLOP/s): 241.662
MLP duration (in seconds): 0.1173
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1845
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 15296, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x15296x45888, b=2048): 0.0447
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x15296x45888, b=2048): 257.496
b: 128, m: 2048, n: 478, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x478x2048): 0.0034
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x478x2048): 151.739
b: 128, m: 2048, n: 2048, k: 478,
Elapsed time for attention_prob_times_values (128x2048x2048x478): 0.0049
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x478): 103.721
Elapsed time for attention_linear_projection (4x15296x15296, b=2048): 0.0154
Throughput (in TFLOP/s) for attention_linear_projection (4x15296x15296, b=2048): 249.513
Elapsed time for mlp_h_to_4h (4x15296x61184, b=2048): 0.0593
Throughput (in TFLOP/s) for mlp_h_to_4h (4x15296x61184, b=2048): 258.712
Elapsed time for mlp_4h_to_h (4x61184x15296, b=2048): 0.0588
Throughput (in TFLOP/s) for mlp_4h_to_h (4x61184x15296, b=2048): 260.936

Attention duration (in seconds): 0.0684
Attention throughput (in TFLOP/s): 239.336
MLP duration (in seconds): 0.1180
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1864
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 15360, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x15360x46080, b=2048): 0.0449
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x15360x46080, b=2048): 258.518
b: 128, m: 2048, n: 480, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x480x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x480x2048): 157.871
b: 128, m: 2048, n: 2048, k: 480,
Elapsed time for attention_prob_times_values (128x2048x2048x480): 0.0038
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x480): 137.226
Elapsed time for attention_linear_projection (4x15360x15360, b=2048): 0.0154
Throughput (in TFLOP/s) for attention_linear_projection (4x15360x15360, b=2048): 250.998
Elapsed time for mlp_h_to_4h (4x15360x61440, b=2048): 0.0594
Throughput (in TFLOP/s) for mlp_h_to_4h (4x15360x61440, b=2048): 260.151
Elapsed time for mlp_4h_to_h (4x61440x15360, b=2048): 0.0593
Throughput (in TFLOP/s) for mlp_4h_to_h (4x61440x15360, b=2048): 260.771

Attention duration (in seconds): 0.0673
Attention throughput (in TFLOP/s): 245.142
MLP duration (in seconds): 0.1187
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1860
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 15424, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x15424x46272, b=2048): 0.0451
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x15424x46272, b=2048): 259.023
b: 128, m: 2048, n: 482, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x482x2048): 0.0034
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x482x2048): 152.890
b: 128, m: 2048, n: 2048, k: 482,
Elapsed time for attention_prob_times_values (128x2048x2048x482): 0.0049
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x482): 104.669
Elapsed time for attention_linear_projection (4x15424x15424, b=2048): 0.0154
Throughput (in TFLOP/s) for attention_linear_projection (4x15424x15424, b=2048): 252.297
Elapsed time for mlp_h_to_4h (4x15424x61696, b=2048): 0.0598
Throughput (in TFLOP/s) for mlp_h_to_4h (4x15424x61696, b=2048): 260.923
Elapsed time for mlp_4h_to_h (4x61696x15424, b=2048): 0.0591
Throughput (in TFLOP/s) for mlp_4h_to_h (4x61696x15424, b=2048): 263.855

Attention duration (in seconds): 0.0689
Attention throughput (in TFLOP/s): 241.229
MLP duration (in seconds): 0.1188
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1878
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 15488, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x15488x46464, b=2048): 0.0451
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x15488x46464, b=2048): 261.501
b: 128, m: 2048, n: 484, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x484x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x484x2048): 159.559
b: 128, m: 2048, n: 2048, k: 484,
Elapsed time for attention_prob_times_values (128x2048x2048x484): 0.0042
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x484): 123.828
Elapsed time for attention_linear_projection (4x15488x15488, b=2048): 0.0154
Throughput (in TFLOP/s) for attention_linear_projection (4x15488x15488, b=2048): 254.710
Elapsed time for mlp_h_to_4h (4x15488x61952, b=2048): 0.0597
Throughput (in TFLOP/s) for mlp_h_to_4h (4x15488x61952, b=2048): 263.400
Elapsed time for mlp_4h_to_h (4x61952x15488, b=2048): 0.0598
Throughput (in TFLOP/s) for mlp_4h_to_h (4x61952x15488, b=2048): 263.032

Attention duration (in seconds): 0.0680
Attention throughput (in TFLOP/s): 246.574
MLP duration (in seconds): 0.1195
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1874
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 15552, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x15552x46656, b=2048): 0.0461
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x15552x46656, b=2048): 257.888
b: 128, m: 2048, n: 486, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x486x2048): 0.0034
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x486x2048): 154.398
b: 128, m: 2048, n: 2048, k: 486,
Elapsed time for attention_prob_times_values (128x2048x2048x486): 0.0049
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x486): 105.477
Elapsed time for attention_linear_projection (4x15552x15552, b=2048): 0.0159
Throughput (in TFLOP/s) for attention_linear_projection (4x15552x15552, b=2048): 249.483
Elapsed time for mlp_h_to_4h (4x15552x62208, b=2048): 0.0614
Throughput (in TFLOP/s) for mlp_h_to_4h (4x15552x62208, b=2048): 258.117
Elapsed time for mlp_4h_to_h (4x62208x15552, b=2048): 0.0603
Throughput (in TFLOP/s) for mlp_4h_to_h (4x62208x15552, b=2048): 262.968

Attention duration (in seconds): 0.0703
Attention throughput (in TFLOP/s): 240.290
MLP duration (in seconds): 0.1217
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1920
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 15616, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x15616x46848, b=2048): 0.0463
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x15616x46848, b=2048): 258.864
b: 128, m: 2048, n: 488, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x488x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x488x2048): 159.408
b: 128, m: 2048, n: 2048, k: 488,
Elapsed time for attention_prob_times_values (128x2048x2048x488): 0.0038
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x488): 136.625
Elapsed time for attention_linear_projection (4x15616x15616, b=2048): 0.0160
Throughput (in TFLOP/s) for attention_linear_projection (4x15616x15616, b=2048): 250.129
Elapsed time for mlp_h_to_4h (4x15616x62464, b=2048): 0.0617
Throughput (in TFLOP/s) for mlp_h_to_4h (4x15616x62464, b=2048): 258.841
Elapsed time for mlp_4h_to_h (4x62464x15616, b=2048): 0.0607
Throughput (in TFLOP/s) for mlp_4h_to_h (4x62464x15616, b=2048): 263.080

Attention duration (in seconds): 0.0694
Attention throughput (in TFLOP/s): 245.387
MLP duration (in seconds): 0.1225
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1919
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 15680, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x15680x47040, b=2048): 0.0468
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x15680x47040, b=2048): 258.114
b: 128, m: 2048, n: 490, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x490x2048): 0.0034
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x490x2048): 155.614
b: 128, m: 2048, n: 2048, k: 490,
Elapsed time for attention_prob_times_values (128x2048x2048x490): 0.0049
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x490): 106.350
Elapsed time for attention_linear_projection (4x15680x15680, b=2048): 0.0161
Throughput (in TFLOP/s) for attention_linear_projection (4x15680x15680, b=2048): 250.479
Elapsed time for mlp_h_to_4h (4x15680x62720, b=2048): 0.0620
Throughput (in TFLOP/s) for mlp_h_to_4h (4x15680x62720, b=2048): 259.918
Elapsed time for mlp_4h_to_h (4x62720x15680, b=2048): 0.0613
Throughput (in TFLOP/s) for mlp_4h_to_h (4x62720x15680, b=2048): 262.678

Attention duration (in seconds): 0.0712
Attention throughput (in TFLOP/s): 240.984
MLP duration (in seconds): 0.1233
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1946
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 15744, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x15744x47232, b=2048): 0.0470
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x15744x47232, b=2048): 259.276
b: 128, m: 2048, n: 492, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x492x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x492x2048): 162.458
b: 128, m: 2048, n: 2048, k: 492,
Elapsed time for attention_prob_times_values (128x2048x2048x492): 0.0042
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x492): 126.702
Elapsed time for attention_linear_projection (4x15744x15744, b=2048): 0.0161
Throughput (in TFLOP/s) for attention_linear_projection (4x15744x15744, b=2048): 252.610
Elapsed time for mlp_h_to_4h (4x15744x62976, b=2048): 0.0624
Throughput (in TFLOP/s) for mlp_h_to_4h (4x15744x62976, b=2048): 260.257
Elapsed time for mlp_4h_to_h (4x62976x15744, b=2048): 0.0616
Throughput (in TFLOP/s) for mlp_4h_to_h (4x62976x15744, b=2048): 263.728

Attention duration (in seconds): 0.0705
Attention throughput (in TFLOP/s): 245.447
MLP duration (in seconds): 0.1240
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1945
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 15808, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x15808x47424, b=2048): 0.0472
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x15808x47424, b=2048): 260.237
b: 128, m: 2048, n: 494, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x494x2048): 0.0034
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x494x2048): 156.542
b: 128, m: 2048, n: 2048, k: 494,
Elapsed time for attention_prob_times_values (128x2048x2048x494): 0.0049
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x494): 107.628
Elapsed time for attention_linear_projection (4x15808x15808, b=2048): 0.0162
Throughput (in TFLOP/s) for attention_linear_projection (4x15808x15808, b=2048): 252.225
Elapsed time for mlp_h_to_4h (4x15808x63232, b=2048): 0.0628
Throughput (in TFLOP/s) for mlp_h_to_4h (4x15808x63232, b=2048): 260.963
Elapsed time for mlp_4h_to_h (4x63232x15808, b=2048): 0.0623
Throughput (in TFLOP/s) for mlp_4h_to_h (4x63232x15808, b=2048): 262.950

Attention duration (in seconds): 0.0717
Attention throughput (in TFLOP/s): 243.044
MLP duration (in seconds): 0.1250
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1968
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 15872, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x15872x47616, b=2048): 0.0474
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x15872x47616, b=2048): 261.328
b: 128, m: 2048, n: 496, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x496x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x496x2048): 162.895
b: 128, m: 2048, n: 2048, k: 496,
Elapsed time for attention_prob_times_values (128x2048x2048x496): 0.0038
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x496): 141.049
Elapsed time for attention_linear_projection (4x15872x15872, b=2048): 0.0163
Throughput (in TFLOP/s) for attention_linear_projection (4x15872x15872, b=2048): 253.116
Elapsed time for mlp_h_to_4h (4x15872x63488, b=2048): 0.0629
Throughput (in TFLOP/s) for mlp_h_to_4h (4x15872x63488, b=2048): 262.322
Elapsed time for mlp_4h_to_h (4x63488x15872, b=2048): 0.0634
Throughput (in TFLOP/s) for mlp_4h_to_h (4x63488x15872, b=2048): 260.230

Attention duration (in seconds): 0.0707
Attention throughput (in TFLOP/s): 248.464
MLP duration (in seconds): 0.1264
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1971
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 15936, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x15936x47808, b=2048): 0.0489
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x15936x47808, b=2048): 255.456
b: 128, m: 2048, n: 498, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x498x2048): 0.0034
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x498x2048): 157.499
b: 128, m: 2048, n: 2048, k: 498,
Elapsed time for attention_prob_times_values (128x2048x2048x498): 0.0049
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x498): 109.133
Elapsed time for attention_linear_projection (4x15936x15936, b=2048): 0.0167
Throughput (in TFLOP/s) for attention_linear_projection (4x15936x15936, b=2048): 248.664
Elapsed time for mlp_h_to_4h (4x15936x63744, b=2048): 0.0650
Throughput (in TFLOP/s) for mlp_h_to_4h (4x15936x63744, b=2048): 256.167
Elapsed time for mlp_4h_to_h (4x63744x15936, b=2048): 0.0640
Throughput (in TFLOP/s) for mlp_4h_to_h (4x63744x15936, b=2048): 259.872

Attention duration (in seconds): 0.0739
Attention throughput (in TFLOP/s): 239.714
MLP duration (in seconds): 0.1290
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2029
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 16000, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x16000x48000, b=2048): 0.0491
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x16000x48000, b=2048): 256.063
b: 128, m: 2048, n: 500, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x500x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x500x2048): 164.617
b: 128, m: 2048, n: 2048, k: 500,
Elapsed time for attention_prob_times_values (128x2048x2048x500): 0.0041
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x500): 129.429
Elapsed time for attention_linear_projection (4x16000x16000, b=2048): 0.0168
Throughput (in TFLOP/s) for attention_linear_projection (4x16000x16000, b=2048): 249.793
Elapsed time for mlp_h_to_4h (4x16000x64000, b=2048): 0.0802
Throughput (in TFLOP/s) for mlp_h_to_4h (4x16000x64000, b=2048): 209.202
Elapsed time for mlp_4h_to_h (4x64000x16000, b=2048): 0.0641
Throughput (in TFLOP/s) for mlp_4h_to_h (4x64000x16000, b=2048): 261.932

Attention duration (in seconds): 0.0733
Attention throughput (in TFLOP/s): 243.399
MLP duration (in seconds): 0.1442
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2176
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 16064, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x16064x48192, b=2048): 0.0491
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x16064x48192, b=2048): 258.113
b: 128, m: 2048, n: 502, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x502x2048): 0.0034
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x502x2048): 159.077
b: 128, m: 2048, n: 2048, k: 502,
Elapsed time for attention_prob_times_values (128x2048x2048x502): 0.0049
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x502): 110.240
Elapsed time for attention_linear_projection (4x16064x16064, b=2048): 0.0169
Throughput (in TFLOP/s) for attention_linear_projection (4x16064x16064, b=2048): 250.713
Elapsed time for mlp_h_to_4h (4x16064x64256, b=2048): 0.0726
Throughput (in TFLOP/s) for mlp_h_to_4h (4x16064x64256, b=2048): 232.958
Elapsed time for mlp_4h_to_h (4x64256x16064, b=2048): 0.0649
Throughput (in TFLOP/s) for mlp_4h_to_h (4x64256x16064, b=2048): 260.488

Attention duration (in seconds): 0.0743
Attention throughput (in TFLOP/s): 242.182
MLP duration (in seconds): 0.1375
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2118
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 16128, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x16128x48384, b=2048): 0.0495
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x16128x48384, b=2048): 258.145
b: 128, m: 2048, n: 504, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x504x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x504x2048): 164.289
b: 128, m: 2048, n: 2048, k: 504,
Elapsed time for attention_prob_times_values (128x2048x2048x504): 0.0039
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x504): 140.173
Elapsed time for attention_linear_projection (4x16128x16128, b=2048): 0.0170
Throughput (in TFLOP/s) for attention_linear_projection (4x16128x16128, b=2048): 251.287
Elapsed time for mlp_h_to_4h (4x16128x64512, b=2048): 0.0664
Throughput (in TFLOP/s) for mlp_h_to_4h (4x16128x64512, b=2048): 256.873
Elapsed time for mlp_4h_to_h (4x64512x16128, b=2048): 0.0647
Throughput (in TFLOP/s) for mlp_4h_to_h (4x64512x16128, b=2048): 263.595

Attention duration (in seconds): 0.0736
Attention throughput (in TFLOP/s): 246.182
MLP duration (in seconds): 0.1310
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2047
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 16192, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x16192x48576, b=2048): 0.0493
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x16192x48576, b=2048): 261.302
b: 128, m: 2048, n: 506, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x506x2048): 0.0034
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x506x2048): 160.413
b: 128, m: 2048, n: 2048, k: 506,
Elapsed time for attention_prob_times_values (128x2048x2048x506): 0.0049
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x506): 111.943
Elapsed time for attention_linear_projection (4x16192x16192, b=2048): 0.0169
Throughput (in TFLOP/s) for attention_linear_projection (4x16192x16192, b=2048): 254.304
Elapsed time for mlp_h_to_4h (4x16192x64768, b=2048): 0.0652
Throughput (in TFLOP/s) for mlp_h_to_4h (4x16192x64768, b=2048): 263.455
Elapsed time for mlp_4h_to_h (4x64768x16192, b=2048): 0.0648
Throughput (in TFLOP/s) for mlp_4h_to_h (4x64768x16192, b=2048): 265.295

Attention duration (in seconds): 0.0744
Attention throughput (in TFLOP/s): 245.387
MLP duration (in seconds): 0.1300
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2044
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 16256, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x16256x48768, b=2048): 0.0496
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x16256x48768, b=2048): 261.705
b: 128, m: 2048, n: 508, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x508x2048): 0.0032
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x508x2048): 168.247
b: 128, m: 2048, n: 2048, k: 508,
Elapsed time for attention_prob_times_values (128x2048x2048x508): 0.0041
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x508): 132.589
Elapsed time for attention_linear_projection (4x16256x16256, b=2048): 0.0170
Throughput (in TFLOP/s) for attention_linear_projection (4x16256x16256, b=2048): 255.281
Elapsed time for mlp_h_to_4h (4x16256x65024, b=2048): 0.0656
Throughput (in TFLOP/s) for mlp_h_to_4h (4x16256x65024, b=2048): 263.936
Elapsed time for mlp_4h_to_h (4x65024x16256, b=2048): 0.0654
Throughput (in TFLOP/s) for mlp_4h_to_h (4x65024x16256, b=2048): 264.911

Attention duration (in seconds): 0.0739
Attention throughput (in TFLOP/s): 248.951
MLP duration (in seconds): 0.1310
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2049
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 16320, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x16320x48960, b=2048): 0.0500
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x16320x48960, b=2048): 261.779
b: 128, m: 2048, n: 510, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x510x2048): 0.0034
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x510x2048): 162.034
b: 128, m: 2048, n: 2048, k: 510,
Elapsed time for attention_prob_times_values (128x2048x2048x510): 0.0048
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x510): 113.329
Elapsed time for attention_linear_projection (4x16320x16320, b=2048): 0.0170
Throughput (in TFLOP/s) for attention_linear_projection (4x16320x16320, b=2048): 256.049
Elapsed time for mlp_h_to_4h (4x16320x65280, b=2048): 0.0921
Throughput (in TFLOP/s) for mlp_h_to_4h (4x16320x65280, b=2048): 189.595
Elapsed time for mlp_4h_to_h (4x65280x16320, b=2048): 0.0661
Throughput (in TFLOP/s) for mlp_4h_to_h (4x65280x16320, b=2048): 263.929

Attention duration (in seconds): 0.0753
Attention throughput (in TFLOP/s): 246.472
MLP duration (in seconds): 0.1582
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2335
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 16384, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x16384x49152, b=2048): 0.0498
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x16384x49152, b=2048): 264.859
b: 128, m: 2048, n: 512, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x512x2048): 0.0033
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x512x2048): 168.667
b: 128, m: 2048, n: 2048, k: 512,
Elapsed time for attention_prob_times_values (128x2048x2048x512): 0.0038
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x512): 145.305
Elapsed time for attention_linear_projection (4x16384x16384, b=2048): 0.0172
Throughput (in TFLOP/s) for attention_linear_projection (4x16384x16384, b=2048): 255.814
Elapsed time for mlp_h_to_4h (4x16384x65536, b=2048): 0.0900
Throughput (in TFLOP/s) for mlp_h_to_4h (4x16384x65536, b=2048): 195.503
Elapsed time for mlp_4h_to_h (4x65536x16384, b=2048): 0.0665
Throughput (in TFLOP/s) for mlp_4h_to_h (4x65536x16384, b=2048): 264.411

Attention duration (in seconds): 0.0741
Attention throughput (in TFLOP/s): 252.417
MLP duration (in seconds): 0.1565
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2306
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 16448, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x16448x49344, b=2048): 0.0513
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x16448x49344, b=2048): 259.235
b: 128, m: 2048, n: 514, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x514x2048): 0.0039
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x514x2048): 142.041
b: 128, m: 2048, n: 2048, k: 514,
Elapsed time for attention_prob_times_values (128x2048x2048x514): 0.0051
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x514): 108.975
Elapsed time for attention_linear_projection (4x16448x16448, b=2048): 0.0176
Throughput (in TFLOP/s) for attention_linear_projection (4x16448x16448, b=2048): 252.039
Elapsed time for mlp_h_to_4h (4x16448x65792, b=2048): 0.0681
Throughput (in TFLOP/s) for mlp_h_to_4h (4x16448x65792, b=2048): 260.303
Elapsed time for mlp_4h_to_h (4x65792x16448, b=2048): 0.0671
Throughput (in TFLOP/s) for mlp_4h_to_h (4x65792x16448, b=2048): 264.112

Attention duration (in seconds): 0.0778
Attention throughput (in TFLOP/s): 241.981
MLP duration (in seconds): 0.1352
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2131
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 16512, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x16512x49536, b=2048): 0.0516
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x16512x49536, b=2048): 259.905
b: 128, m: 2048, n: 516, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x516x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x516x2048): 146.810
b: 128, m: 2048, n: 2048, k: 516,
Elapsed time for attention_prob_times_values (128x2048x2048x516): 0.0043
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x516): 127.516
Elapsed time for attention_linear_projection (4x16512x16512, b=2048): 0.0177
Throughput (in TFLOP/s) for attention_linear_projection (4x16512x16512, b=2048): 252.938
Elapsed time for mlp_h_to_4h (4x16512x66048, b=2048): 0.0686
Throughput (in TFLOP/s) for mlp_h_to_4h (4x16512x66048, b=2048): 260.464
Elapsed time for mlp_4h_to_h (4x66048x16512, b=2048): 0.0679
Throughput (in TFLOP/s) for mlp_4h_to_h (4x66048x16512, b=2048): 262.963

Attention duration (in seconds): 0.0773
Attention throughput (in TFLOP/s): 245.358
MLP duration (in seconds): 0.1366
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2139
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 16576, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x16576x49728, b=2048): 0.0521
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x16576x49728, b=2048): 259.017
b: 128, m: 2048, n: 518, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x518x2048): 0.0039
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x518x2048): 143.182
b: 128, m: 2048, n: 2048, k: 518,
Elapsed time for attention_prob_times_values (128x2048x2048x518): 0.0051
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x518): 109.002
Elapsed time for attention_linear_projection (4x16576x16576, b=2048): 0.0177
Throughput (in TFLOP/s) for attention_linear_projection (4x16576x16576, b=2048): 254.192
Elapsed time for mlp_h_to_4h (4x16576x66304, b=2048): 0.0696
Throughput (in TFLOP/s) for mlp_h_to_4h (4x16576x66304, b=2048): 258.775
Elapsed time for mlp_4h_to_h (4x66304x16576, b=2048): 0.0681
Throughput (in TFLOP/s) for mlp_4h_to_h (4x66304x16576, b=2048): 264.479

Attention duration (in seconds): 0.0788
Attention throughput (in TFLOP/s): 242.516
MLP duration (in seconds): 0.1377
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2165
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 16640, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x16640x49920, b=2048): 0.0519
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x16640x49920, b=2048): 262.233
b: 128, m: 2048, n: 520, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x520x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x520x2048): 145.684
b: 128, m: 2048, n: 2048, k: 520,
Elapsed time for attention_prob_times_values (128x2048x2048x520): 0.0040
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x520): 138.720
Elapsed time for attention_linear_projection (4x16640x16640, b=2048): 0.0178
Throughput (in TFLOP/s) for attention_linear_projection (4x16640x16640, b=2048): 255.043
Elapsed time for mlp_h_to_4h (4x16640x66560, b=2048): 0.0690
Throughput (in TFLOP/s) for mlp_h_to_4h (4x16640x66560, b=2048): 262.952
Elapsed time for mlp_4h_to_h (4x66560x16640, b=2048): 0.0684
Throughput (in TFLOP/s) for mlp_4h_to_h (4x66560x16640, b=2048): 265.161

Attention duration (in seconds): 0.0775
Attention throughput (in TFLOP/s): 248.412
MLP duration (in seconds): 0.1374
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2150
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 16704, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x16704x50112, b=2048): 0.0525
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x16704x50112, b=2048): 260.995
b: 128, m: 2048, n: 522, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x522x2048): 0.0039
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x522x2048): 144.217
b: 128, m: 2048, n: 2048, k: 522,
Elapsed time for attention_prob_times_values (128x2048x2048x522): 0.0052
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x522): 108.611
Elapsed time for attention_linear_projection (4x16704x16704, b=2048): 0.0179
Throughput (in TFLOP/s) for attention_linear_projection (4x16704x16704, b=2048): 256.064
Elapsed time for mlp_h_to_4h (4x16704x66816, b=2048): 0.0705
Throughput (in TFLOP/s) for mlp_h_to_4h (4x16704x66816, b=2048): 259.456
Elapsed time for mlp_4h_to_h (4x66816x16704, b=2048): 0.0690
Throughput (in TFLOP/s) for mlp_4h_to_h (4x66816x16704, b=2048): 264.963

Attention duration (in seconds): 0.0794
Attention throughput (in TFLOP/s): 244.276
MLP duration (in seconds): 0.1395
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2189
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 16768, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x16768x50304, b=2048): 0.0525
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x16768x50304, b=2048): 263.139
b: 128, m: 2048, n: 524, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x524x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x524x2048): 149.143
b: 128, m: 2048, n: 2048, k: 524,
Elapsed time for attention_prob_times_values (128x2048x2048x524): 0.0044
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x524): 127.914
Elapsed time for attention_linear_projection (4x16768x16768, b=2048): 0.0179
Throughput (in TFLOP/s) for attention_linear_projection (4x16768x16768, b=2048): 257.535
Elapsed time for mlp_h_to_4h (4x16768x67072, b=2048): 0.0906
Throughput (in TFLOP/s) for mlp_h_to_4h (4x16768x67072, b=2048): 203.340
Elapsed time for mlp_4h_to_h (4x67072x16768, b=2048): 0.0696
Throughput (in TFLOP/s) for mlp_4h_to_h (4x67072x16768, b=2048): 264.781

Attention duration (in seconds): 0.0786
Attention throughput (in TFLOP/s): 248.821
MLP duration (in seconds): 0.1602
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2388
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 16832, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x16832x50496, b=2048): 0.0539
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x16832x50496, b=2048): 258.133
b: 128, m: 2048, n: 526, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x526x2048): 0.0039
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x526x2048): 144.993
b: 128, m: 2048, n: 2048, k: 526,
Elapsed time for attention_prob_times_values (128x2048x2048x526): 0.0052
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x526): 108.635
Elapsed time for attention_linear_projection (4x16832x16832, b=2048): 0.0184
Throughput (in TFLOP/s) for attention_linear_projection (4x16832x16832, b=2048): 252.806
Elapsed time for mlp_h_to_4h (4x16832x67328, b=2048): 0.0715
Throughput (in TFLOP/s) for mlp_h_to_4h (4x16832x67328, b=2048): 259.640
Elapsed time for mlp_4h_to_h (4x67328x16832, b=2048): 0.0701
Throughput (in TFLOP/s) for mlp_4h_to_h (4x67328x16832, b=2048): 264.794

Attention duration (in seconds): 0.0814
Attention throughput (in TFLOP/s): 241.969
MLP duration (in seconds): 0.1416
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2230
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 16896, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x16896x50688, b=2048): 0.0539
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x16896x50688, b=2048): 260.398
b: 128, m: 2048, n: 528, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x528x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x528x2048): 148.545
b: 128, m: 2048, n: 2048, k: 528,
Elapsed time for attention_prob_times_values (128x2048x2048x528): 0.0040
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x528): 141.634
Elapsed time for attention_linear_projection (4x16896x16896, b=2048): 0.0185
Throughput (in TFLOP/s) for attention_linear_projection (4x16896x16896, b=2048): 252.538
Elapsed time for mlp_h_to_4h (4x16896x67584, b=2048): 0.0718
Throughput (in TFLOP/s) for mlp_h_to_4h (4x16896x67584, b=2048): 260.570
Elapsed time for mlp_4h_to_h (4x67584x16896, b=2048): 0.0708
Throughput (in TFLOP/s) for mlp_4h_to_h (4x67584x16896, b=2048): 264.212

Attention duration (in seconds): 0.0802
Attention throughput (in TFLOP/s): 247.337
MLP duration (in seconds): 0.1426
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2228
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 16960, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x16960x50880, b=2048): 0.0546
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x16960x50880, b=2048): 259.127
b: 128, m: 2048, n: 530, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x530x2048): 0.0039
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x530x2048): 145.632
b: 128, m: 2048, n: 2048, k: 530,
Elapsed time for attention_prob_times_values (128x2048x2048x530): 0.0052
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x530): 108.619
Elapsed time for attention_linear_projection (4x16960x16960, b=2048): 0.0186
Throughput (in TFLOP/s) for attention_linear_projection (4x16960x16960, b=2048): 252.850
Elapsed time for mlp_h_to_4h (4x16960x67840, b=2048): 0.0726
Throughput (in TFLOP/s) for mlp_h_to_4h (4x16960x67840, b=2048): 259.654
Elapsed time for mlp_4h_to_h (4x67840x16960, b=2048): 0.0713
Throughput (in TFLOP/s) for mlp_4h_to_h (4x67840x16960, b=2048): 264.490

Attention duration (in seconds): 0.0823
Attention throughput (in TFLOP/s): 242.745
MLP duration (in seconds): 0.1439
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2262
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 17024, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x17024x51072, b=2048): 0.0543
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x17024x51072, b=2048): 262.169
b: 128, m: 2048, n: 532, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x532x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x532x2048): 151.544
b: 128, m: 2048, n: 2048, k: 532,
Elapsed time for attention_prob_times_values (128x2048x2048x532): 0.0044
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x532): 128.516
Elapsed time for attention_linear_projection (4x17024x17024, b=2048): 0.0186
Throughput (in TFLOP/s) for attention_linear_projection (4x17024x17024, b=2048): 255.694
Elapsed time for mlp_h_to_4h (4x17024x68096, b=2048): 0.0837
Throughput (in TFLOP/s) for mlp_h_to_4h (4x17024x68096, b=2048): 226.927
Elapsed time for mlp_4h_to_h (4x68096x17024, b=2048): 0.0715
Throughput (in TFLOP/s) for mlp_4h_to_h (4x68096x17024, b=2048): 265.533

Attention duration (in seconds): 0.0811
Attention throughput (in TFLOP/s): 248.223
MLP duration (in seconds): 0.1552
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2363
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 17088, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x17088x51264, b=2048): 0.0547
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x17088x51264, b=2048): 262.169
b: 128, m: 2048, n: 534, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x534x2048): 0.0039
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x534x2048): 146.803
b: 128, m: 2048, n: 2048, k: 534,
Elapsed time for attention_prob_times_values (128x2048x2048x534): 0.0053
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x534): 108.677
Elapsed time for attention_linear_projection (4x17088x17088, b=2048): 0.0186
Throughput (in TFLOP/s) for attention_linear_projection (4x17088x17088, b=2048): 256.921
Elapsed time for mlp_h_to_4h (4x17088x68352, b=2048): 0.0726
Throughput (in TFLOP/s) for mlp_h_to_4h (4x17088x68352, b=2048): 263.487
Elapsed time for mlp_4h_to_h (4x68352x17088, b=2048): 0.0721
Throughput (in TFLOP/s) for mlp_4h_to_h (4x68352x17088, b=2048): 265.258

Attention duration (in seconds): 0.0825
Attention throughput (in TFLOP/s): 245.716
MLP duration (in seconds): 0.1448
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2273
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 17152, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x17152x51456, b=2048): 0.0549
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x17152x51456, b=2048): 263.484
b: 128, m: 2048, n: 536, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x536x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x536x2048): 149.971
b: 128, m: 2048, n: 2048, k: 536,
Elapsed time for attention_prob_times_values (128x2048x2048x536): 0.0041
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x536): 141.996
Elapsed time for attention_linear_projection (4x17152x17152, b=2048): 0.0187
Throughput (in TFLOP/s) for attention_linear_projection (4x17152x17152, b=2048): 258.353
Elapsed time for mlp_h_to_4h (4x17152x68608, b=2048): 0.0728
Throughput (in TFLOP/s) for mlp_h_to_4h (4x17152x68608, b=2048): 264.714
Elapsed time for mlp_4h_to_h (4x68608x17152, b=2048): 0.0726
Throughput (in TFLOP/s) for mlp_4h_to_h (4x68608x17152, b=2048): 265.580

Attention duration (in seconds): 0.0814
Attention throughput (in TFLOP/s): 250.911
MLP duration (in seconds): 0.1454
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2269
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 17216, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x17216x51648, b=2048): 0.0555
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x17216x51648, b=2048): 262.504
b: 128, m: 2048, n: 538, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x538x2048): 0.0039
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x538x2048): 147.794
b: 128, m: 2048, n: 2048, k: 538,
Elapsed time for attention_prob_times_values (128x2048x2048x538): 0.0053
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x538): 109.043
Elapsed time for attention_linear_projection (4x17216x17216, b=2048): 0.0188
Throughput (in TFLOP/s) for attention_linear_projection (4x17216x17216, b=2048): 257.687
Elapsed time for mlp_h_to_4h (4x17216x68864, b=2048): 0.0734
Throughput (in TFLOP/s) for mlp_h_to_4h (4x17216x68864, b=2048): 264.726
Elapsed time for mlp_4h_to_h (4x68864x17216, b=2048): 0.0734
Throughput (in TFLOP/s) for mlp_4h_to_h (4x68864x17216, b=2048): 264.530

Attention duration (in seconds): 0.0835
Attention throughput (in TFLOP/s): 246.321
MLP duration (in seconds): 0.1468
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2304
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 17280, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x17280x51840, b=2048): 0.0556
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x17280x51840, b=2048): 263.988
b: 128, m: 2048, n: 540, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x540x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x540x2048): 153.319
b: 128, m: 2048, n: 2048, k: 540,
Elapsed time for attention_prob_times_values (128x2048x2048x540): 0.0045
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x540): 129.214
Elapsed time for attention_linear_projection (4x17280x17280, b=2048): 0.0188
Throughput (in TFLOP/s) for attention_linear_projection (4x17280x17280, b=2048): 259.757
Elapsed time for mlp_h_to_4h (4x17280x69120, b=2048): 0.1048
Throughput (in TFLOP/s) for mlp_h_to_4h (4x17280x69120, b=2048): 186.746
Elapsed time for mlp_4h_to_h (4x69120x17280, b=2048): 0.0735
Throughput (in TFLOP/s) for mlp_4h_to_h (4x69120x17280, b=2048): 266.370

Attention duration (in seconds): 0.0827
Attention throughput (in TFLOP/s): 250.651
MLP duration (in seconds): 0.1783
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2610
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 17344, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x17344x52032, b=2048): 0.0567
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x17344x52032, b=2048): 260.785
b: 128, m: 2048, n: 542, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x542x2048): 0.0039
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x542x2048): 148.802
b: 128, m: 2048, n: 2048, k: 542,
Elapsed time for attention_prob_times_values (128x2048x2048x542): 0.0053
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x542): 109.716
Elapsed time for attention_linear_projection (4x17344x17344, b=2048): 0.0194
Throughput (in TFLOP/s) for attention_linear_projection (4x17344x17344, b=2048): 254.266
Elapsed time for mlp_h_to_4h (4x17344x69376, b=2048): 0.0753
Throughput (in TFLOP/s) for mlp_h_to_4h (4x17344x69376, b=2048): 261.972
Elapsed time for mlp_4h_to_h (4x69376x17344, b=2048): 0.0740
Throughput (in TFLOP/s) for mlp_4h_to_h (4x69376x17344, b=2048): 266.237

Attention duration (in seconds): 0.0853
Attention throughput (in TFLOP/s): 244.774
MLP duration (in seconds): 0.1493
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2346
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 17408, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x17408x52224, b=2048): 0.0569
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x17408x52224, b=2048): 261.661
b: 128, m: 2048, n: 544, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x544x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x544x2048): 152.741
b: 128, m: 2048, n: 2048, k: 544,
Elapsed time for attention_prob_times_values (128x2048x2048x544): 0.0040
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x544): 146.074
Elapsed time for attention_linear_projection (4x17408x17408, b=2048): 0.0194
Throughput (in TFLOP/s) for attention_linear_projection (4x17408x17408, b=2048): 256.120
Elapsed time for mlp_h_to_4h (4x17408x69632, b=2048): 0.0890
Throughput (in TFLOP/s) for mlp_h_to_4h (4x17408x69632, b=2048): 223.146
Elapsed time for mlp_4h_to_h (4x69632x17408, b=2048): 0.0751
Throughput (in TFLOP/s) for mlp_4h_to_h (4x69632x17408, b=2048): 264.427

Attention duration (in seconds): 0.0841
Attention throughput (in TFLOP/s): 249.940
MLP duration (in seconds): 0.1641
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2482
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 17472, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x17472x52416, b=2048): 0.0573
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x17472x52416, b=2048): 262.043
b: 128, m: 2048, n: 546, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x546x2048): 0.0039
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x546x2048): 150.083
b: 128, m: 2048, n: 2048, k: 546,
Elapsed time for attention_prob_times_values (128x2048x2048x546): 0.0053
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x546): 110.889
Elapsed time for attention_linear_projection (4x17472x17472, b=2048): 0.0195
Throughput (in TFLOP/s) for attention_linear_projection (4x17472x17472, b=2048): 256.816
Elapsed time for mlp_h_to_4h (4x17472x69888, b=2048): 0.0761
Throughput (in TFLOP/s) for mlp_h_to_4h (4x17472x69888, b=2048): 262.824
Elapsed time for mlp_4h_to_h (4x69888x17472, b=2048): 0.0753
Throughput (in TFLOP/s) for mlp_4h_to_h (4x69888x17472, b=2048): 265.511

Attention duration (in seconds): 0.0859
Attention throughput (in TFLOP/s): 246.469
MLP duration (in seconds): 0.1515
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2374
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 17536, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x17536x52608, b=2048): 0.0577
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x17536x52608, b=2048): 262.088
b: 128, m: 2048, n: 548, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x548x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x548x2048): 155.473
b: 128, m: 2048, n: 2048, k: 548,
Elapsed time for attention_prob_times_values (128x2048x2048x548): 0.0045
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x548): 130.636
Elapsed time for attention_linear_projection (4x17536x17536, b=2048): 0.0197
Throughput (in TFLOP/s) for attention_linear_projection (4x17536x17536, b=2048): 256.348
Elapsed time for mlp_h_to_4h (4x17536x70144, b=2048): 0.0766
Throughput (in TFLOP/s) for mlp_h_to_4h (4x17536x70144, b=2048): 263.080
Elapsed time for mlp_4h_to_h (4x70144x17536, b=2048): 0.0770
Throughput (in TFLOP/s) for mlp_4h_to_h (4x70144x17536, b=2048): 261.855

Attention duration (in seconds): 0.0856
Attention throughput (in TFLOP/s): 249.142
MLP duration (in seconds): 0.1536
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2392
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 17600, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x17600x52800, b=2048): 0.0584
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x17600x52800, b=2048): 260.907
b: 128, m: 2048, n: 550, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x550x2048): 0.0039
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x550x2048): 150.824
b: 128, m: 2048, n: 2048, k: 550,
Elapsed time for attention_prob_times_values (128x2048x2048x550): 0.0053
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x550): 110.654
Elapsed time for attention_linear_projection (4x17600x17600, b=2048): 0.0198
Throughput (in TFLOP/s) for attention_linear_projection (4x17600x17600, b=2048): 256.279
Elapsed time for mlp_h_to_4h (4x17600x70400, b=2048): 0.0786
Throughput (in TFLOP/s) for mlp_h_to_4h (4x17600x70400, b=2048): 258.234
Elapsed time for mlp_4h_to_h (4x70400x17600, b=2048): 0.0774
Throughput (in TFLOP/s) for mlp_4h_to_h (4x70400x17600, b=2048): 262.387

Attention duration (in seconds): 0.0874
Attention throughput (in TFLOP/s): 245.754
MLP duration (in seconds): 0.1560
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2434
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 17664, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x17664x52992, b=2048): 0.0593
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x17664x52992, b=2048): 258.690
b: 128, m: 2048, n: 552, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x552x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x552x2048): 154.563
b: 128, m: 2048, n: 2048, k: 552,
Elapsed time for attention_prob_times_values (128x2048x2048x552): 0.0041
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x552): 145.763
Elapsed time for attention_linear_projection (4x17664x17664, b=2048): 0.0200
Throughput (in TFLOP/s) for attention_linear_projection (4x17664x17664, b=2048): 255.094
Elapsed time for mlp_h_to_4h (4x17664x70656, b=2048): 0.0781
Throughput (in TFLOP/s) for mlp_h_to_4h (4x17664x70656, b=2048): 261.819
Elapsed time for mlp_4h_to_h (4x70656x17664, b=2048): 0.0780
Throughput (in TFLOP/s) for mlp_4h_to_h (4x70656x17664, b=2048): 262.103

Attention duration (in seconds): 0.0872
Attention throughput (in TFLOP/s): 248.022
MLP duration (in seconds): 0.1561
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2433
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 17728, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x17728x53184, b=2048): 0.0605
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x17728x53184, b=2048): 255.532
b: 128, m: 2048, n: 554, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x554x2048): 0.0039
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x554x2048): 151.331
b: 128, m: 2048, n: 2048, k: 554,
Elapsed time for attention_prob_times_values (128x2048x2048x554): 0.0053
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x554): 111.399
Elapsed time for attention_linear_projection (4x17728x17728, b=2048): 0.0203
Throughput (in TFLOP/s) for attention_linear_projection (4x17728x17728, b=2048): 253.225
Elapsed time for mlp_h_to_4h (4x17728x70912, b=2048): 0.0801
Throughput (in TFLOP/s) for mlp_h_to_4h (4x17728x70912, b=2048): 257.211
Elapsed time for mlp_4h_to_h (4x70912x17728, b=2048): 0.0787
Throughput (in TFLOP/s) for mlp_4h_to_h (4x70912x17728, b=2048): 261.773

Attention duration (in seconds): 0.0901
Attention throughput (in TFLOP/s): 241.917
MLP duration (in seconds): 0.1588
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2488
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 17792, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x17792x53376, b=2048): 0.0606
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x17792x53376, b=2048): 256.899
b: 128, m: 2048, n: 556, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x556x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x556x2048): 157.296
b: 128, m: 2048, n: 2048, k: 556,
Elapsed time for attention_prob_times_values (128x2048x2048x556): 0.0045
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x556): 132.410
Elapsed time for attention_linear_projection (4x17792x17792, b=2048): 0.0204
Throughput (in TFLOP/s) for attention_linear_projection (4x17792x17792, b=2048): 254.026
Elapsed time for mlp_h_to_4h (4x17792x71168, b=2048): 0.0875
Throughput (in TFLOP/s) for mlp_h_to_4h (4x17792x71168, b=2048): 237.148
Elapsed time for mlp_4h_to_h (4x71168x17792, b=2048): 0.0790
Throughput (in TFLOP/s) for mlp_4h_to_h (4x71168x17792, b=2048): 262.691

Attention duration (in seconds): 0.0893
Attention throughput (in TFLOP/s): 245.722
MLP duration (in seconds): 0.1665
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2557
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 17856, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x17856x53568, b=2048): 0.0602
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x17856x53568, b=2048): 260.160
b: 128, m: 2048, n: 558, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x558x2048): 0.0039
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x558x2048): 152.729
b: 128, m: 2048, n: 2048, k: 558,
Elapsed time for attention_prob_times_values (128x2048x2048x558): 0.0053
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x558): 112.484
Elapsed time for attention_linear_projection (4x17856x17856, b=2048): 0.0204
Throughput (in TFLOP/s) for attention_linear_projection (4x17856x17856, b=2048): 256.549
Elapsed time for mlp_h_to_4h (4x17856x71424, b=2048): 0.0795
Throughput (in TFLOP/s) for mlp_h_to_4h (4x17856x71424, b=2048): 262.732
Elapsed time for mlp_4h_to_h (4x71424x17856, b=2048): 0.0788
Throughput (in TFLOP/s) for mlp_4h_to_h (4x71424x17856, b=2048): 265.043

Attention duration (in seconds): 0.0898
Attention throughput (in TFLOP/s): 245.896
MLP duration (in seconds): 0.1584
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2482
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 17920, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x17920x53760, b=2048): 0.0602
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x17920x53760, b=2048): 262.058
b: 128, m: 2048, n: 560, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x560x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x560x2048): 157.321
b: 128, m: 2048, n: 2048, k: 560,
Elapsed time for attention_prob_times_values (128x2048x2048x560): 0.0041
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x560): 147.893
Elapsed time for attention_linear_projection (4x17920x17920, b=2048): 0.0204
Throughput (in TFLOP/s) for attention_linear_projection (4x17920x17920, b=2048): 258.007
Elapsed time for mlp_h_to_4h (4x17920x71680, b=2048): 0.0797
Throughput (in TFLOP/s) for mlp_h_to_4h (4x17920x71680, b=2048): 264.071
Elapsed time for mlp_4h_to_h (4x71680x17920, b=2048): 0.0795
Throughput (in TFLOP/s) for mlp_4h_to_h (4x71680x17920, b=2048): 264.682

Attention duration (in seconds): 0.0885
Attention throughput (in TFLOP/s): 251.358
MLP duration (in seconds): 0.1592
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2477
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 17984, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x17984x53952, b=2048): 0.0605
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x17984x53952, b=2048): 262.658
b: 128, m: 2048, n: 562, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x562x2048): 0.0039
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x562x2048): 154.463
b: 128, m: 2048, n: 2048, k: 562,
Elapsed time for attention_prob_times_values (128x2048x2048x562): 0.0054
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x562): 112.756
Elapsed time for attention_linear_projection (4x17984x17984, b=2048): 0.0205
Throughput (in TFLOP/s) for attention_linear_projection (4x17984x17984, b=2048): 258.443
Elapsed time for mlp_h_to_4h (4x17984x71936, b=2048): 0.0805
Throughput (in TFLOP/s) for mlp_h_to_4h (4x17984x71936, b=2048): 263.416
Elapsed time for mlp_4h_to_h (4x71936x17984, b=2048): 0.0799
Throughput (in TFLOP/s) for mlp_4h_to_h (4x71936x17984, b=2048): 265.174

Attention duration (in seconds): 0.0903
Attention throughput (in TFLOP/s): 248.134
MLP duration (in seconds): 0.1604
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2507
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 18048, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x18048x54144, b=2048): 0.0607
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x18048x54144, b=2048): 263.596
b: 128, m: 2048, n: 564, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x564x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x564x2048): 160.204
b: 128, m: 2048, n: 2048, k: 564,
Elapsed time for attention_prob_times_values (128x2048x2048x564): 0.0045
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x564): 134.914
Elapsed time for attention_linear_projection (4x18048x18048, b=2048): 0.0206
Throughput (in TFLOP/s) for attention_linear_projection (4x18048x18048, b=2048): 259.528
Elapsed time for mlp_h_to_4h (4x18048x72192, b=2048): 0.0810
Throughput (in TFLOP/s) for mlp_h_to_4h (4x18048x72192, b=2048): 263.705
Elapsed time for mlp_4h_to_h (4x72192x18048, b=2048): 0.0807
Throughput (in TFLOP/s) for mlp_4h_to_h (4x72192x18048, b=2048): 264.583

Attention duration (in seconds): 0.0896
Attention throughput (in TFLOP/s): 251.850
MLP duration (in seconds): 0.1616
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2512
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 18112, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x18112x54336, b=2048): 0.0623
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x18112x54336, b=2048): 258.949
b: 128, m: 2048, n: 566, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x566x2048): 0.0039
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x566x2048): 154.787
b: 128, m: 2048, n: 2048, k: 566,
Elapsed time for attention_prob_times_values (128x2048x2048x566): 0.0053
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x566): 113.614
Elapsed time for attention_linear_projection (4x18112x18112, b=2048): 0.0212
Throughput (in TFLOP/s) for attention_linear_projection (4x18112x18112, b=2048): 253.863
Elapsed time for mlp_h_to_4h (4x18112x72448, b=2048): 0.0827
Throughput (in TFLOP/s) for mlp_h_to_4h (4x18112x72448, b=2048): 259.896
Elapsed time for mlp_4h_to_h (4x72448x18112, b=2048): 0.0810
Throughput (in TFLOP/s) for mlp_4h_to_h (4x72448x18112, b=2048): 265.477

Attention duration (in seconds): 0.0927
Attention throughput (in TFLOP/s): 244.991
MLP duration (in seconds): 0.1637
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2564
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 18176, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x18176x54528, b=2048): 0.0627
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x18176x54528, b=2048): 259.186
b: 128, m: 2048, n: 568, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x568x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x568x2048): 158.717
b: 128, m: 2048, n: 2048, k: 568,
Elapsed time for attention_prob_times_values (128x2048x2048x568): 0.0041
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x568): 148.758
Elapsed time for attention_linear_projection (4x18176x18176, b=2048): 0.0213
Throughput (in TFLOP/s) for attention_linear_projection (4x18176x18176, b=2048): 254.631
Elapsed time for mlp_h_to_4h (4x18176x72704, b=2048): 0.0836
Throughput (in TFLOP/s) for mlp_h_to_4h (4x18176x72704, b=2048): 258.961
Elapsed time for mlp_4h_to_h (4x72704x18176, b=2048): 0.0825
Throughput (in TFLOP/s) for mlp_4h_to_h (4x72704x18176, b=2048): 262.309

Attention duration (in seconds): 0.0919
Attention throughput (in TFLOP/s): 249.000
MLP duration (in seconds): 0.1661
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2580
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 18240, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x18240x54720, b=2048): 0.0632
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x18240x54720, b=2048): 258.718
b: 128, m: 2048, n: 570, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x570x2048): 0.0039
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x570x2048): 155.315
b: 128, m: 2048, n: 2048, k: 570,
Elapsed time for attention_prob_times_values (128x2048x2048x570): 0.0054
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x570): 114.046
Elapsed time for attention_linear_projection (4x18240x18240, b=2048): 0.0214
Throughput (in TFLOP/s) for attention_linear_projection (4x18240x18240, b=2048): 254.886
Elapsed time for mlp_h_to_4h (4x18240x72960, b=2048): 0.0843
Throughput (in TFLOP/s) for mlp_h_to_4h (4x18240x72960, b=2048): 258.590
Elapsed time for mlp_4h_to_h (4x72960x18240, b=2048): 0.0831
Throughput (in TFLOP/s) for mlp_4h_to_h (4x72960x18240, b=2048): 262.328

Attention duration (in seconds): 0.0939
Attention throughput (in TFLOP/s): 245.238
MLP duration (in seconds): 0.1674
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2613
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 18304, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x18304x54912, b=2048): 0.0636
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x18304x54912, b=2048): 258.970
b: 128, m: 2048, n: 572, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x572x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x572x2048): 162.220
b: 128, m: 2048, n: 2048, k: 572,
Elapsed time for attention_prob_times_values (128x2048x2048x572): 0.0045
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x572): 137.068
Elapsed time for attention_linear_projection (4x18304x18304, b=2048): 0.0214
Throughput (in TFLOP/s) for attention_linear_projection (4x18304x18304, b=2048): 256.392
Elapsed time for mlp_h_to_4h (4x18304x73216, b=2048): 0.0839
Throughput (in TFLOP/s) for mlp_h_to_4h (4x18304x73216, b=2048): 261.690
Elapsed time for mlp_4h_to_h (4x73216x18304, b=2048): 0.0830
Throughput (in TFLOP/s) for mlp_4h_to_h (4x73216x18304, b=2048): 264.422

Attention duration (in seconds): 0.0933
Attention throughput (in TFLOP/s): 248.594
MLP duration (in seconds): 0.1669
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2602
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 18368, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x18368x55104, b=2048): 0.0637
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x18368x55104, b=2048): 260.295
b: 128, m: 2048, n: 574, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x574x2048): 0.0039
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x574x2048): 156.557
b: 128, m: 2048, n: 2048, k: 574,
Elapsed time for attention_prob_times_values (128x2048x2048x574): 0.0054
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x574): 114.754
Elapsed time for attention_linear_projection (4x18368x18368, b=2048): 0.0215
Throughput (in TFLOP/s) for attention_linear_projection (4x18368x18368, b=2048): 257.032
Elapsed time for mlp_h_to_4h (4x18368x73472, b=2048): 0.0848
Throughput (in TFLOP/s) for mlp_h_to_4h (4x18368x73472, b=2048): 260.846
Elapsed time for mlp_4h_to_h (4x73472x18368, b=2048): 0.0844
Throughput (in TFLOP/s) for mlp_4h_to_h (4x73472x18368, b=2048): 261.833

Attention duration (in seconds): 0.0945
Attention throughput (in TFLOP/s): 246.962
MLP duration (in seconds): 0.1692
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2637
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 18432, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x18432x55296, b=2048): 0.0637
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x18432x55296, b=2048): 261.974
b: 128, m: 2048, n: 576, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x576x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x576x2048): 162.008
b: 128, m: 2048, n: 2048, k: 576,
Elapsed time for attention_prob_times_values (128x2048x2048x576): 0.0040
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x576): 154.464
Elapsed time for attention_linear_projection (4x18432x18432, b=2048): 0.0215
Throughput (in TFLOP/s) for attention_linear_projection (4x18432x18432, b=2048): 259.309
Elapsed time for mlp_h_to_4h (4x18432x73728, b=2048): 0.1125
Throughput (in TFLOP/s) for mlp_h_to_4h (4x18432x73728, b=2048): 197.864
Elapsed time for mlp_4h_to_h (4x73728x18432, b=2048): 0.0835
Throughput (in TFLOP/s) for mlp_4h_to_h (4x73728x18432, b=2048): 266.688

Attention duration (in seconds): 0.0930
Attention throughput (in TFLOP/s): 252.630
MLP duration (in seconds): 0.1960
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2890
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 18496, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x18496x55488, b=2048): 0.0638
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x18496x55488, b=2048): 263.584
b: 128, m: 2048, n: 578, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x578x2048): 0.0039
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x578x2048): 158.098
b: 128, m: 2048, n: 2048, k: 578,
Elapsed time for attention_prob_times_values (128x2048x2048x578): 0.0056
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x578): 111.195
Elapsed time for attention_linear_projection (4x18496x18496, b=2048): 0.0216
Throughput (in TFLOP/s) for attention_linear_projection (4x18496x18496, b=2048): 259.445
Elapsed time for mlp_h_to_4h (4x18496x73984, b=2048): 0.0847
Throughput (in TFLOP/s) for mlp_h_to_4h (4x18496x73984, b=2048): 264.794
Elapsed time for mlp_4h_to_h (4x73984x18496, b=2048): 0.0844
Throughput (in TFLOP/s) for mlp_4h_to_h (4x73984x18496, b=2048): 265.607

Attention duration (in seconds): 0.0949
Attention throughput (in TFLOP/s): 249.317
MLP duration (in seconds): 0.1691
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2640
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 18560, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x18560x55680, b=2048): 0.0643
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x18560x55680, b=2048): 263.340
b: 128, m: 2048, n: 580, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x580x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x580x2048): 164.076
b: 128, m: 2048, n: 2048, k: 580,
Elapsed time for attention_prob_times_values (128x2048x2048x580): 0.0047
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x580): 131.129
Elapsed time for attention_linear_projection (4x18560x18560, b=2048): 0.0217
Throughput (in TFLOP/s) for attention_linear_projection (4x18560x18560, b=2048): 259.938
Elapsed time for mlp_h_to_4h (4x18560x74240, b=2048): 0.0852
Throughput (in TFLOP/s) for mlp_h_to_4h (4x18560x74240, b=2048): 265.125
Elapsed time for mlp_4h_to_h (4x74240x18560, b=2048): 0.0851
Throughput (in TFLOP/s) for mlp_4h_to_h (4x74240x18560, b=2048): 265.339

Attention duration (in seconds): 0.0946
Attention throughput (in TFLOP/s): 251.933
MLP duration (in seconds): 0.1702
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2648
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 18624, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x18624x55872, b=2048): 0.0652
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x18624x55872, b=2048): 261.431
b: 128, m: 2048, n: 582, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x582x2048): 0.0039
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x582x2048): 159.115
b: 128, m: 2048, n: 2048, k: 582,
Elapsed time for attention_prob_times_values (128x2048x2048x582): 0.0057
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x582): 110.511
Elapsed time for attention_linear_projection (4x18624x18624, b=2048): 0.0222
Throughput (in TFLOP/s) for attention_linear_projection (4x18624x18624, b=2048): 256.087
Elapsed time for mlp_h_to_4h (4x18624x74496, b=2048): 0.0874
Throughput (in TFLOP/s) for mlp_h_to_4h (4x18624x74496, b=2048): 260.085
Elapsed time for mlp_4h_to_h (4x74496x18624, b=2048): 0.0857
Throughput (in TFLOP/s) for mlp_4h_to_h (4x74496x18624, b=2048): 265.362

Attention duration (in seconds): 0.0970
Attention throughput (in TFLOP/s): 247.265
MLP duration (in seconds): 0.1731
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2700
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 18688, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x18688x56064, b=2048): 0.0657
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x18688x56064, b=2048): 261.474
b: 128, m: 2048, n: 584, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x584x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x584x2048): 162.945
b: 128, m: 2048, n: 2048, k: 584,
Elapsed time for attention_prob_times_values (128x2048x2048x584): 0.0043
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x584): 147.097
Elapsed time for attention_linear_projection (4x18688x18688, b=2048): 0.0223
Throughput (in TFLOP/s) for attention_linear_projection (4x18688x18688, b=2048): 256.865
Elapsed time for mlp_h_to_4h (4x18688x74752, b=2048): 0.0870
Throughput (in TFLOP/s) for mlp_h_to_4h (4x18688x74752, b=2048): 263.014
Elapsed time for mlp_4h_to_h (4x74752x18688, b=2048): 0.0863
Throughput (in TFLOP/s) for mlp_4h_to_h (4x74752x18688, b=2048): 265.191

Attention duration (in seconds): 0.0960
Attention throughput (in TFLOP/s): 251.380
MLP duration (in seconds): 0.1733
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2694
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 18752, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x18752x56256, b=2048): 0.0658
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x18752x56256, b=2048): 262.556
b: 128, m: 2048, n: 586, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x586x2048): 0.0039
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x586x2048): 159.338
b: 128, m: 2048, n: 2048, k: 586,
Elapsed time for attention_prob_times_values (128x2048x2048x586): 0.0056
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x586): 111.425
Elapsed time for attention_linear_projection (4x18752x18752, b=2048): 0.0224
Throughput (in TFLOP/s) for attention_linear_projection (4x18752x18752, b=2048): 257.451
Elapsed time for mlp_h_to_4h (4x18752x75008, b=2048): 0.0876
Throughput (in TFLOP/s) for mlp_h_to_4h (4x18752x75008, b=2048): 263.144
Elapsed time for mlp_4h_to_h (4x75008x18752, b=2048): 0.0868
Throughput (in TFLOP/s) for mlp_4h_to_h (4x75008x18752, b=2048): 265.577

Attention duration (in seconds): 0.0978
Attention throughput (in TFLOP/s): 248.494
MLP duration (in seconds): 0.1743
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2722
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 18816, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x18816x56448, b=2048): 0.0665
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x18816x56448, b=2048): 261.487
b: 128, m: 2048, n: 588, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x588x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x588x2048): 166.037
b: 128, m: 2048, n: 2048, k: 588,
Elapsed time for attention_prob_times_values (128x2048x2048x588): 0.0047
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x588): 133.051
Elapsed time for attention_linear_projection (4x18816x18816, b=2048): 0.0226
Throughput (in TFLOP/s) for attention_linear_projection (4x18816x18816, b=2048): 257.192
Elapsed time for mlp_h_to_4h (4x18816x75264, b=2048): 0.0886
Throughput (in TFLOP/s) for mlp_h_to_4h (4x18816x75264, b=2048): 261.941
Elapsed time for mlp_4h_to_h (4x75264x18816, b=2048): 0.0882
Throughput (in TFLOP/s) for mlp_4h_to_h (4x75264x18816, b=2048): 263.026

Attention duration (in seconds): 0.0977
Attention throughput (in TFLOP/s): 250.537
MLP duration (in seconds): 0.1768
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2744
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 18880, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x18880x56640, b=2048): 0.0674
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x18880x56640, b=2048): 259.870
b: 128, m: 2048, n: 590, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x590x2048): 0.0040
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x590x2048): 160.222
b: 128, m: 2048, n: 2048, k: 590,
Elapsed time for attention_prob_times_values (128x2048x2048x590): 0.0057
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x590): 111.536
Elapsed time for attention_linear_projection (4x18880x18880, b=2048): 0.0227
Throughput (in TFLOP/s) for attention_linear_projection (4x18880x18880, b=2048): 257.250
Elapsed time for mlp_h_to_4h (4x18880x75520, b=2048): 0.0894
Throughput (in TFLOP/s) for mlp_h_to_4h (4x18880x75520, b=2048): 261.243
Elapsed time for mlp_4h_to_h (4x75520x18880, b=2048): 0.0888
Throughput (in TFLOP/s) for mlp_4h_to_h (4x75520x18880, b=2048): 263.105

Attention duration (in seconds): 0.0998
Attention throughput (in TFLOP/s): 246.878
MLP duration (in seconds): 0.1782
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2780
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 18944, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x18944x56832, b=2048): 0.0678
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x18944x56832, b=2048): 260.177
b: 128, m: 2048, n: 592, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x592x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x592x2048): 166.436
b: 128, m: 2048, n: 2048, k: 592,
Elapsed time for attention_prob_times_values (128x2048x2048x592): 0.0042
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x592): 150.919
Elapsed time for attention_linear_projection (4x18944x18944, b=2048): 0.0227
Throughput (in TFLOP/s) for attention_linear_projection (4x18944x18944, b=2048): 258.959
Elapsed time for mlp_h_to_4h (4x18944x75776, b=2048): 0.0898
Throughput (in TFLOP/s) for mlp_h_to_4h (4x18944x75776, b=2048): 261.784
Elapsed time for mlp_4h_to_h (4x75776x18944, b=2048): 0.0894
Throughput (in TFLOP/s) for mlp_4h_to_h (4x75776x18944, b=2048): 262.990

Attention duration (in seconds): 0.0985
Attention throughput (in TFLOP/s): 251.592
MLP duration (in seconds): 0.1793
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2778
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 19008, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x19008x57024, b=2048): 0.0691
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x19008x57024, b=2048): 256.854
b: 128, m: 2048, n: 594, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x594x2048): 0.0040
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x594x2048): 160.978
b: 128, m: 2048, n: 2048, k: 594,
Elapsed time for attention_prob_times_values (128x2048x2048x594): 0.0057
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x594): 112.339
Elapsed time for attention_linear_projection (4x19008x19008, b=2048): 0.0232
Throughput (in TFLOP/s) for attention_linear_projection (4x19008x19008, b=2048): 254.815
Elapsed time for mlp_h_to_4h (4x19008x76032, b=2048): 0.0918
Throughput (in TFLOP/s) for mlp_h_to_4h (4x19008x76032, b=2048): 258.017
Elapsed time for mlp_4h_to_h (4x76032x19008, b=2048): 0.0901
Throughput (in TFLOP/s) for mlp_4h_to_h (4x76032x19008, b=2048): 262.845

Attention duration (in seconds): 0.1020
Attention throughput (in TFLOP/s): 244.623
MLP duration (in seconds): 0.1819
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2839
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 19072, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x19072x57216, b=2048): 0.0694
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x19072x57216, b=2048): 257.779
b: 128, m: 2048, n: 596, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x596x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x596x2048): 168.053
b: 128, m: 2048, n: 2048, k: 596,
Elapsed time for attention_prob_times_values (128x2048x2048x596): 0.0048
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x596): 134.510
Elapsed time for attention_linear_projection (4x19072x19072, b=2048): 0.0234
Throughput (in TFLOP/s) for attention_linear_projection (4x19072x19072, b=2048): 255.192
Elapsed time for mlp_h_to_4h (4x19072x76288, b=2048): 0.0920
Throughput (in TFLOP/s) for mlp_h_to_4h (4x19072x76288, b=2048): 259.064
Elapsed time for mlp_4h_to_h (4x76288x19072, b=2048): 0.0906
Throughput (in TFLOP/s) for mlp_4h_to_h (4x76288x19072, b=2048): 263.168

Attention duration (in seconds): 0.1013
Attention throughput (in TFLOP/s): 248.018
MLP duration (in seconds): 0.1826
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2839
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 19136, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x19136x57408, b=2048): 0.0699
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x19136x57408, b=2048): 257.354
b: 128, m: 2048, n: 598, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x598x2048): 0.0040
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x598x2048): 161.858
b: 128, m: 2048, n: 2048, k: 598,
Elapsed time for attention_prob_times_values (128x2048x2048x598): 0.0057
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x598): 113.010
Elapsed time for attention_linear_projection (4x19136x19136, b=2048): 0.0235
Throughput (in TFLOP/s) for attention_linear_projection (4x19136x19136, b=2048): 255.746
Elapsed time for mlp_h_to_4h (4x19136x76544, b=2048): 0.0926
Throughput (in TFLOP/s) for mlp_h_to_4h (4x19136x76544, b=2048): 259.052
Elapsed time for mlp_4h_to_h (4x76544x19136, b=2048): 0.0915
Throughput (in TFLOP/s) for mlp_4h_to_h (4x76544x19136, b=2048): 262.350

Attention duration (in seconds): 0.1030
Attention throughput (in TFLOP/s): 245.353
MLP duration (in seconds): 0.1841
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2872
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 19200, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x19200x57600, b=2048): 0.0700
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x19200x57600, b=2048): 258.918
b: 128, m: 2048, n: 600, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x600x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x600x2048): 167.493
b: 128, m: 2048, n: 2048, k: 600,
Elapsed time for attention_prob_times_values (128x2048x2048x600): 0.0043
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x600): 149.862
Elapsed time for attention_linear_projection (4x19200x19200, b=2048): 0.0236
Throughput (in TFLOP/s) for attention_linear_projection (4x19200x19200, b=2048): 256.104
Elapsed time for mlp_h_to_4h (4x19200x76800, b=2048): 0.0929
Throughput (in TFLOP/s) for mlp_h_to_4h (4x19200x76800, b=2048): 260.132
Elapsed time for mlp_4h_to_h (4x76800x19200, b=2048): 0.0919
Throughput (in TFLOP/s) for mlp_4h_to_h (4x76800x19200, b=2048): 262.850

Attention duration (in seconds): 0.1017
Attention throughput (in TFLOP/s): 250.199
MLP duration (in seconds): 0.1848
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2865
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 19264, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x19264x57792, b=2048): 0.0703
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x19264x57792, b=2048): 259.359
b: 128, m: 2048, n: 602, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x602x2048): 0.0040
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x602x2048): 163.107
b: 128, m: 2048, n: 2048, k: 602,
Elapsed time for attention_prob_times_values (128x2048x2048x602): 0.0057
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x602): 113.140
Elapsed time for attention_linear_projection (4x19264x19264, b=2048): 0.0237
Throughput (in TFLOP/s) for attention_linear_projection (4x19264x19264, b=2048): 256.229
Elapsed time for mlp_h_to_4h (4x19264x77056, b=2048): 0.0933
Throughput (in TFLOP/s) for mlp_h_to_4h (4x19264x77056, b=2048): 260.627
Elapsed time for mlp_4h_to_h (4x77056x19264, b=2048): 0.0931
Throughput (in TFLOP/s) for mlp_4h_to_h (4x77056x19264, b=2048): 261.301

Attention duration (in seconds): 0.1037
Attention throughput (in TFLOP/s): 246.913
MLP duration (in seconds): 0.1864
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2901
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 19328, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x19328x57984, b=2048): 0.0706
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x19328x57984, b=2048): 260.239
b: 128, m: 2048, n: 604, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x604x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x604x2048): 169.756
b: 128, m: 2048, n: 2048, k: 604,
Elapsed time for attention_prob_times_values (128x2048x2048x604): 0.0048
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x604): 135.710
Elapsed time for attention_linear_projection (4x19328x19328, b=2048): 0.0239
Throughput (in TFLOP/s) for attention_linear_projection (4x19328x19328, b=2048): 255.989
Elapsed time for mlp_h_to_4h (4x19328x77312, b=2048): 0.0943
Throughput (in TFLOP/s) for mlp_h_to_4h (4x19328x77312, b=2048): 259.565
Elapsed time for mlp_4h_to_h (4x77312x19328, b=2048): 0.0935
Throughput (in TFLOP/s) for mlp_4h_to_h (4x77312x19328, b=2048): 261.960

Attention duration (in seconds): 0.1031
Attention throughput (in TFLOP/s): 250.125
MLP duration (in seconds): 0.1878
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2908
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 19392, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x19392x58176, b=2048): 0.0723
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x19392x58176, b=2048): 255.828
b: 128, m: 2048, n: 606, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x606x2048): 0.0040
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x606x2048): 163.767
b: 128, m: 2048, n: 2048, k: 606,
Elapsed time for attention_prob_times_values (128x2048x2048x606): 0.0057
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x606): 114.225
Elapsed time for attention_linear_projection (4x19392x19392, b=2048): 0.0244
Throughput (in TFLOP/s) for attention_linear_projection (4x19392x19392, b=2048): 252.998
Elapsed time for mlp_h_to_4h (4x19392x77568, b=2048): 0.0958
Throughput (in TFLOP/s) for mlp_h_to_4h (4x19392x77568, b=2048): 257.313
Elapsed time for mlp_4h_to_h (4x77568x19392, b=2048): 0.0937
Throughput (in TFLOP/s) for mlp_4h_to_h (4x77568x19392, b=2048): 262.888

Attention duration (in seconds): 0.1063
Attention throughput (in TFLOP/s): 244.147
MLP duration (in seconds): 0.1895
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2958
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 19456, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x19456x58368, b=2048): 0.0724
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x19456x58368, b=2048): 256.851
b: 128, m: 2048, n: 608, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x608x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x608x2048): 170.370
b: 128, m: 2048, n: 2048, k: 608,
Elapsed time for attention_prob_times_values (128x2048x2048x608): 0.0042
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x608): 154.151
Elapsed time for attention_linear_projection (4x19456x19456, b=2048): 0.0244
Throughput (in TFLOP/s) for attention_linear_projection (4x19456x19456, b=2048): 254.068
Elapsed time for mlp_h_to_4h (4x19456x77824, b=2048): 0.0960
Throughput (in TFLOP/s) for mlp_h_to_4h (4x19456x77824, b=2048): 258.401
Elapsed time for mlp_4h_to_h (4x77824x19456, b=2048): 0.0944
Throughput (in TFLOP/s) for mlp_4h_to_h (4x77824x19456, b=2048): 262.836

Attention duration (in seconds): 0.1049
Attention throughput (in TFLOP/s): 248.899
MLP duration (in seconds): 0.1904
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2953
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 19520, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x19520x58560, b=2048): 0.0726
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x19520x58560, b=2048): 258.122
b: 128, m: 2048, n: 610, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x610x2048): 0.0040
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x610x2048): 164.878
b: 128, m: 2048, n: 2048, k: 610,
Elapsed time for attention_prob_times_values (128x2048x2048x610): 0.0057
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x610): 114.557
Elapsed time for attention_linear_projection (4x19520x19520, b=2048): 0.0245
Throughput (in TFLOP/s) for attention_linear_projection (4x19520x19520, b=2048): 254.698
Elapsed time for mlp_h_to_4h (4x19520x78080, b=2048): 0.0963
Throughput (in TFLOP/s) for mlp_h_to_4h (4x19520x78080, b=2048): 259.361
Elapsed time for mlp_4h_to_h (4x78080x19520, b=2048): 0.0948
Throughput (in TFLOP/s) for mlp_4h_to_h (4x78080x19520, b=2048): 263.497

Attention duration (in seconds): 0.1068
Attention throughput (in TFLOP/s): 246.177
MLP duration (in seconds): 0.1910
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2978
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 19584, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x19584x58752, b=2048): 0.0728
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x19584x58752, b=2048): 258.794
b: 128, m: 2048, n: 612, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x612x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x612x2048): 171.619
b: 128, m: 2048, n: 2048, k: 612,
Elapsed time for attention_prob_times_values (128x2048x2048x612): 0.0048
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x612): 136.297
Elapsed time for attention_linear_projection (4x19584x19584, b=2048): 0.0246
Throughput (in TFLOP/s) for attention_linear_projection (4x19584x19584, b=2048): 254.965
Elapsed time for mlp_h_to_4h (4x19584x78336, b=2048): 0.0967
Throughput (in TFLOP/s) for mlp_h_to_4h (4x19584x78336, b=2048): 260.015
Elapsed time for mlp_4h_to_h (4x78336x19584, b=2048): 0.0956
Throughput (in TFLOP/s) for mlp_4h_to_h (4x78336x19584, b=2048): 262.812

Attention duration (in seconds): 0.1061
Attention throughput (in TFLOP/s): 249.196
MLP duration (in seconds): 0.1923
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2984
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 19648, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x19648x58944, b=2048): 0.0731
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x19648x58944, b=2048): 259.488
b: 128, m: 2048, n: 614, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x614x2048): 0.0040
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x614x2048): 165.730
b: 128, m: 2048, n: 2048, k: 614,
Elapsed time for attention_prob_times_values (128x2048x2048x614): 0.0057
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x614): 114.901
Elapsed time for attention_linear_projection (4x19648x19648, b=2048): 0.0247
Throughput (in TFLOP/s) for attention_linear_projection (4x19648x19648, b=2048): 255.987
Elapsed time for mlp_h_to_4h (4x19648x78592, b=2048): 0.0965
Throughput (in TFLOP/s) for mlp_h_to_4h (4x19648x78592, b=2048): 262.261
Elapsed time for mlp_4h_to_h (4x78592x19648, b=2048): 0.0962
Throughput (in TFLOP/s) for mlp_4h_to_h (4x78592x19648, b=2048): 263.057

Attention duration (in seconds): 0.1075
Attention throughput (in TFLOP/s): 247.502
MLP duration (in seconds): 0.1926
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3002
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 19712, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x19712x59136, b=2048): 0.0729
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x19712x59136, b=2048): 261.894
b: 128, m: 2048, n: 616, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x616x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x616x2048): 172.290
b: 128, m: 2048, n: 2048, k: 616,
Elapsed time for attention_prob_times_values (128x2048x2048x616): 0.0043
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x616): 153.323
Elapsed time for attention_linear_projection (4x19712x19712, b=2048): 0.0247
Throughput (in TFLOP/s) for attention_linear_projection (4x19712x19712, b=2048): 257.936
Elapsed time for mlp_h_to_4h (4x19712x78848, b=2048): 0.0966
Throughput (in TFLOP/s) for mlp_h_to_4h (4x19712x78848, b=2048): 263.603
Elapsed time for mlp_4h_to_h (4x78848x19712, b=2048): 0.0964
Throughput (in TFLOP/s) for mlp_4h_to_h (4x78848x19712, b=2048): 264.209

Attention duration (in seconds): 0.1058
Attention throughput (in TFLOP/s): 253.289
MLP duration (in seconds): 0.1930
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2987
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 19776, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x19776x59328, b=2048): 0.0735
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x19776x59328, b=2048): 261.676
b: 128, m: 2048, n: 618, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x618x2048): 0.0040
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x618x2048): 166.720
b: 128, m: 2048, n: 2048, k: 618,
Elapsed time for attention_prob_times_values (128x2048x2048x618): 0.0057
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x618): 116.060
Elapsed time for attention_linear_projection (4x19776x19776, b=2048): 0.0248
Throughput (in TFLOP/s) for attention_linear_projection (4x19776x19776, b=2048): 258.205
Elapsed time for mlp_h_to_4h (4x19776x79104, b=2048): 0.0975
Throughput (in TFLOP/s) for mlp_h_to_4h (4x19776x79104, b=2048): 262.871
Elapsed time for mlp_4h_to_h (4x79104x19776, b=2048): 0.0975
Throughput (in TFLOP/s) for mlp_4h_to_h (4x79104x19776, b=2048): 262.891

Attention duration (in seconds): 0.1080
Attention throughput (in TFLOP/s): 249.667
MLP duration (in seconds): 0.1950
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3030
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 19840, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x19840x59520, b=2048): 0.0736
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x19840x59520, b=2048): 263.010
b: 128, m: 2048, n: 620, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x620x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x620x2048): 173.571
b: 128, m: 2048, n: 2048, k: 620,
Elapsed time for attention_prob_times_values (128x2048x2048x620): 0.0049
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x620): 135.572
Elapsed time for attention_linear_projection (4x19840x19840, b=2048): 0.0250
Throughput (in TFLOP/s) for attention_linear_projection (4x19840x19840, b=2048): 257.543
Elapsed time for mlp_h_to_4h (4x19840x79360, b=2048): 0.0979
Throughput (in TFLOP/s) for mlp_h_to_4h (4x19840x79360, b=2048): 263.369
Elapsed time for mlp_4h_to_h (4x79360x19840, b=2048): 0.0978
Throughput (in TFLOP/s) for mlp_4h_to_h (4x79360x19840, b=2048): 263.672

Attention duration (in seconds): 0.1073
Attention throughput (in TFLOP/s): 252.709
MLP duration (in seconds): 0.1958
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3031
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 19904, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x19904x59712, b=2048): 0.0755
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x19904x59712, b=2048): 258.007
b: 128, m: 2048, n: 622, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x622x2048): 0.0040
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x622x2048): 167.538
b: 128, m: 2048, n: 2048, k: 622,
Elapsed time for attention_prob_times_values (128x2048x2048x622): 0.0058
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x622): 115.544
Elapsed time for attention_linear_projection (4x19904x19904, b=2048): 0.0256
Throughput (in TFLOP/s) for attention_linear_projection (4x19904x19904, b=2048): 253.892
Elapsed time for mlp_h_to_4h (4x19904x79616, b=2048): 0.1000
Throughput (in TFLOP/s) for mlp_h_to_4h (4x19904x79616, b=2048): 259.701
Elapsed time for mlp_4h_to_h (4x79616x19904, b=2048): 0.0987
Throughput (in TFLOP/s) for mlp_4h_to_h (4x79616x19904, b=2048): 263.145

Attention duration (in seconds): 0.1108
Attention throughput (in TFLOP/s): 246.371
MLP duration (in seconds): 0.1986
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3094
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 19968, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x19968x59904, b=2048): 0.0758
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x19968x59904, b=2048): 258.577
b: 128, m: 2048, n: 624, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x624x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x624x2048): 174.875
b: 128, m: 2048, n: 2048, k: 624,
Elapsed time for attention_prob_times_values (128x2048x2048x624): 0.0043
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x624): 154.613
Elapsed time for attention_linear_projection (4x19968x19968, b=2048): 0.0257
Throughput (in TFLOP/s) for attention_linear_projection (4x19968x19968, b=2048): 254.317
Elapsed time for mlp_h_to_4h (4x19968x79872, b=2048): 0.1003
Throughput (in TFLOP/s) for mlp_h_to_4h (4x19968x79872, b=2048): 260.540
Elapsed time for mlp_4h_to_h (4x79872x19968, b=2048): 0.0993
Throughput (in TFLOP/s) for mlp_4h_to_h (4x79872x19968, b=2048): 263.036

Attention duration (in seconds): 0.1096
Attention throughput (in TFLOP/s): 250.545
MLP duration (in seconds): 0.1996
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3093
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 20032, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x20032x60096, b=2048): 0.0759
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x20032x60096, b=2048): 259.849
b: 128, m: 2048, n: 626, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x626x2048): 0.0040
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x626x2048): 169.091
b: 128, m: 2048, n: 2048, k: 626,
Elapsed time for attention_prob_times_values (128x2048x2048x626): 0.0057
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x626): 117.611
Elapsed time for attention_linear_projection (4x20032x20032, b=2048): 0.0258
Throughput (in TFLOP/s) for attention_linear_projection (4x20032x20032, b=2048): 255.112
Elapsed time for mlp_h_to_4h (4x20032x80128, b=2048): 0.1006
Throughput (in TFLOP/s) for mlp_h_to_4h (4x20032x80128, b=2048): 261.374
Elapsed time for mlp_4h_to_h (4x80128x20032, b=2048): 0.0997
Throughput (in TFLOP/s) for mlp_4h_to_h (4x80128x20032, b=2048): 263.702

Attention duration (in seconds): 0.1114
Attention throughput (in TFLOP/s): 248.214
MLP duration (in seconds): 0.2003
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3117
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 20096, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x20096x60288, b=2048): 0.0761
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x20096x60288, b=2048): 260.745
b: 128, m: 2048, n: 628, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x628x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x628x2048): 176.380
b: 128, m: 2048, n: 2048, k: 628,
Elapsed time for attention_prob_times_values (128x2048x2048x628): 0.0049
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x628): 138.967
Elapsed time for attention_linear_projection (4x20096x20096, b=2048): 0.0259
Throughput (in TFLOP/s) for attention_linear_projection (4x20096x20096, b=2048): 255.928
Elapsed time for mlp_h_to_4h (4x20096x80384, b=2048): 0.1010
Throughput (in TFLOP/s) for mlp_h_to_4h (4x20096x80384, b=2048): 262.000
Elapsed time for mlp_4h_to_h (4x80384x20096, b=2048): 0.1006
Throughput (in TFLOP/s) for mlp_4h_to_h (4x80384x20096, b=2048): 263.139

Attention duration (in seconds): 0.1107
Attention throughput (in TFLOP/s): 251.365
MLP duration (in seconds): 0.2016
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3123
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 20160, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x20160x60480, b=2048): 0.0766
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x20160x60480, b=2048): 260.835
b: 128, m: 2048, n: 630, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x630x2048): 0.0040
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x630x2048): 170.151
b: 128, m: 2048, n: 2048, k: 630,
Elapsed time for attention_prob_times_values (128x2048x2048x630): 0.0058
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x630): 117.252
Elapsed time for attention_linear_projection (4x20160x20160, b=2048): 0.0259
Throughput (in TFLOP/s) for attention_linear_projection (4x20160x20160, b=2048): 256.659
Elapsed time for mlp_h_to_4h (4x20160x80640, b=2048): 0.1018
Throughput (in TFLOP/s) for mlp_h_to_4h (4x20160x80640, b=2048): 261.570
Elapsed time for mlp_4h_to_h (4x80640x20160, b=2048): 0.1009
Throughput (in TFLOP/s) for mlp_4h_to_h (4x80640x20160, b=2048): 264.082

Attention duration (in seconds): 0.1123
Attention throughput (in TFLOP/s): 249.281
MLP duration (in seconds): 0.2027
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3150
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 20224, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x20224x60672, b=2048): 0.0769
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x20224x60672, b=2048): 261.503
b: 128, m: 2048, n: 632, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x632x2048): 0.0039
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x632x2048): 175.881
b: 128, m: 2048, n: 2048, k: 632,
Elapsed time for attention_prob_times_values (128x2048x2048x632): 0.0043
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x632): 156.020
Elapsed time for attention_linear_projection (4x20224x20224, b=2048): 0.0260
Throughput (in TFLOP/s) for attention_linear_projection (4x20224x20224, b=2048): 258.198
Elapsed time for mlp_h_to_4h (4x20224x80896, b=2048): 0.1022
Throughput (in TFLOP/s) for mlp_h_to_4h (4x20224x80896, b=2048): 262.283
Elapsed time for mlp_4h_to_h (4x80896x20224, b=2048): 0.1016
Throughput (in TFLOP/s) for mlp_4h_to_h (4x80896x20224, b=2048): 263.860

Attention duration (in seconds): 0.1110
Attention throughput (in TFLOP/s): 253.624
MLP duration (in seconds): 0.2038
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3148
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 20288, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x20288x60864, b=2048): 0.0784
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x20288x60864, b=2048): 257.976
b: 128, m: 2048, n: 634, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x634x2048): 0.0040
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x634x2048): 171.006
b: 128, m: 2048, n: 2048, k: 634,
Elapsed time for attention_prob_times_values (128x2048x2048x634): 0.0057
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x634): 118.516
Elapsed time for attention_linear_projection (4x20288x20288, b=2048): 0.0266
Throughput (in TFLOP/s) for attention_linear_projection (4x20288x20288, b=2048): 253.628
Elapsed time for mlp_h_to_4h (4x20288x81152, b=2048): 0.1041
Throughput (in TFLOP/s) for mlp_h_to_4h (4x20288x81152, b=2048): 259.218
Elapsed time for mlp_4h_to_h (4x81152x20288, b=2048): 0.1026
Throughput (in TFLOP/s) for mlp_4h_to_h (4x81152x20288, b=2048): 262.901

Attention duration (in seconds): 0.1147
Attention throughput (in TFLOP/s): 246.970
MLP duration (in seconds): 0.2067
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3214
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 20352, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x20352x61056, b=2048): 0.0789
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x20352x61056, b=2048): 257.879
b: 128, m: 2048, n: 636, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x636x2048): 0.0039
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x636x2048): 177.257
b: 128, m: 2048, n: 2048, k: 636,
Elapsed time for attention_prob_times_values (128x2048x2048x636): 0.0049
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x636): 140.289
Elapsed time for attention_linear_projection (4x20352x20352, b=2048): 0.0266
Throughput (in TFLOP/s) for attention_linear_projection (4x20352x20352, b=2048): 255.048
Elapsed time for mlp_h_to_4h (4x20352x81408, b=2048): 0.1048
Throughput (in TFLOP/s) for mlp_h_to_4h (4x20352x81408, b=2048): 259.044
Elapsed time for mlp_4h_to_h (4x81408x20352, b=2048): 0.1031
Throughput (in TFLOP/s) for mlp_4h_to_h (4x81408x20352, b=2048): 263.259

Attention duration (in seconds): 0.1143
Attention throughput (in TFLOP/s): 249.493
MLP duration (in seconds): 0.2079
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3222
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 20416, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x20416x61248, b=2048): 0.0789
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x20416x61248, b=2048): 259.809
b: 128, m: 2048, n: 638, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x638x2048): 0.0040
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x638x2048): 171.961
b: 128, m: 2048, n: 2048, k: 638,
Elapsed time for attention_prob_times_values (128x2048x2048x638): 0.0057
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x638): 119.407
Elapsed time for attention_linear_projection (4x20416x20416, b=2048): 0.0265
Throughput (in TFLOP/s) for attention_linear_projection (4x20416x20416, b=2048): 257.469
Elapsed time for mlp_h_to_4h (4x20416x81664, b=2048): 0.1037
Throughput (in TFLOP/s) for mlp_h_to_4h (4x20416x81664, b=2048): 263.315
Elapsed time for mlp_4h_to_h (4x81664x20416, b=2048): 0.1029
Throughput (in TFLOP/s) for mlp_4h_to_h (4x81664x20416, b=2048): 265.589

Attention duration (in seconds): 0.1151
Attention throughput (in TFLOP/s): 249.231
MLP duration (in seconds): 0.2066
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3217
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 20480, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x20480x61440, b=2048): 0.0784
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x20480x61440, b=2048): 263.096
b: 128, m: 2048, n: 640, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x640x2048): 0.0038
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x640x2048): 179.841
b: 128, m: 2048, n: 2048, k: 640,
Elapsed time for attention_prob_times_values (128x2048x2048x640): 0.0043
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x640): 160.852
Elapsed time for attention_linear_projection (4x20480x20480, b=2048): 0.0266
Throughput (in TFLOP/s) for attention_linear_projection (4x20480x20480, b=2048): 258.570
Elapsed time for mlp_h_to_4h (4x20480x81920, b=2048): 0.1255
Throughput (in TFLOP/s) for mlp_h_to_4h (4x20480x81920, b=2048): 218.977
Elapsed time for mlp_4h_to_h (4x81920x20480, b=2048): 0.1030
Throughput (in TFLOP/s) for mlp_4h_to_h (4x81920x20480, b=2048): 266.747

Attention duration (in seconds): 0.1130
Attention throughput (in TFLOP/s): 255.353
MLP duration (in seconds): 0.2286
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3416
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 20544, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x20544x61632, b=2048): 0.0784
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x20544x61632, b=2048): 264.639
b: 128, m: 2048, n: 642, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x642x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x642x2048): 154.607
b: 128, m: 2048, n: 2048, k: 642,
Elapsed time for attention_prob_times_values (128x2048x2048x642): 0.0060
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x642): 115.357
Elapsed time for attention_linear_projection (4x20544x20544, b=2048): 0.0266
Throughput (in TFLOP/s) for attention_linear_projection (4x20544x20544, b=2048): 259.520
Elapsed time for mlp_h_to_4h (4x20544x82176, b=2048): 0.1046
Throughput (in TFLOP/s) for mlp_h_to_4h (4x20544x82176, b=2048): 264.476
Elapsed time for mlp_4h_to_h (4x82176x20544, b=2048): 0.1040
Throughput (in TFLOP/s) for mlp_4h_to_h (4x82176x20544, b=2048): 266.051

Attention duration (in seconds): 0.1155
Attention throughput (in TFLOP/s): 251.483
MLP duration (in seconds): 0.2085
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3240
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 20608, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x20608x61824, b=2048): 0.0795
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x20608x61824, b=2048): 262.714
b: 128, m: 2048, n: 644, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x644x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x644x2048): 161.326
b: 128, m: 2048, n: 2048, k: 644,
Elapsed time for attention_prob_times_values (128x2048x2048x644): 0.0051
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x644): 135.402
Elapsed time for attention_linear_projection (4x20608x20608, b=2048): 0.0267
Throughput (in TFLOP/s) for attention_linear_projection (4x20608x20608, b=2048): 260.536
Elapsed time for mlp_h_to_4h (4x20608x82432, b=2048): 0.1056
Throughput (in TFLOP/s) for mlp_h_to_4h (4x20608x82432, b=2048): 263.539
Elapsed time for mlp_4h_to_h (4x82432x20608, b=2048): 0.1055
Throughput (in TFLOP/s) for mlp_4h_to_h (4x82432x20608, b=2048): 263.808

Attention duration (in seconds): 0.1156
Attention throughput (in TFLOP/s): 252.823
MLP duration (in seconds): 0.2111
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3267
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 20672, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x20672x62016, b=2048): 0.0804
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x20672x62016, b=2048): 261.092
b: 128, m: 2048, n: 646, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x646x2048): 0.0044
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x646x2048): 156.247
b: 128, m: 2048, n: 2048, k: 646,
Elapsed time for attention_prob_times_values (128x2048x2048x646): 0.0060
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x646): 115.353
Elapsed time for attention_linear_projection (4x20672x20672, b=2048): 0.0272
Throughput (in TFLOP/s) for attention_linear_projection (4x20672x20672, b=2048): 257.400
Elapsed time for mlp_h_to_4h (4x20672x82688, b=2048): 0.1072
Throughput (in TFLOP/s) for mlp_h_to_4h (4x20672x82688, b=2048): 261.191
Elapsed time for mlp_4h_to_h (4x82688x20672, b=2048): 0.1068
Throughput (in TFLOP/s) for mlp_4h_to_h (4x82688x20672, b=2048): 262.236

Attention duration (in seconds): 0.1181
Attention throughput (in TFLOP/s): 248.880
MLP duration (in seconds): 0.2140
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3321
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 20736, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x20736x62208, b=2048): 0.0809
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x20736x62208, b=2048): 261.353
b: 128, m: 2048, n: 648, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x648x2048): 0.0044
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x648x2048): 159.865
b: 128, m: 2048, n: 2048, k: 648,
Elapsed time for attention_prob_times_values (128x2048x2048x648): 0.0045
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x648): 152.985
Elapsed time for attention_linear_projection (4x20736x20736, b=2048): 0.0271
Throughput (in TFLOP/s) for attention_linear_projection (4x20736x20736, b=2048): 260.031
Elapsed time for mlp_h_to_4h (4x20736x82944, b=2048): 0.1070
Throughput (in TFLOP/s) for mlp_h_to_4h (4x20736x82944, b=2048): 263.311
Elapsed time for mlp_4h_to_h (4x82944x20736, b=2048): 0.1075
Throughput (in TFLOP/s) for mlp_4h_to_h (4x82944x20736, b=2048): 262.234

Attention duration (in seconds): 0.1169
Attention throughput (in TFLOP/s): 253.049
MLP duration (in seconds): 0.2145
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3313
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 20800, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x20800x62400, b=2048): 0.0824
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x20800x62400, b=2048): 258.057
b: 128, m: 2048, n: 650, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x650x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x650x2048): 156.609
b: 128, m: 2048, n: 2048, k: 650,
Elapsed time for attention_prob_times_values (128x2048x2048x650): 0.0061
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x650): 114.906
Elapsed time for attention_linear_projection (4x20800x20800, b=2048): 0.0276
Throughput (in TFLOP/s) for attention_linear_projection (4x20800x20800, b=2048): 256.393
Elapsed time for mlp_h_to_4h (4x20800x83200, b=2048): 0.1098
Throughput (in TFLOP/s) for mlp_h_to_4h (4x20800x83200, b=2048): 258.264
Elapsed time for mlp_4h_to_h (4x83200x20800, b=2048): 0.1084
Throughput (in TFLOP/s) for mlp_4h_to_h (4x83200x20800, b=2048): 261.487

Attention duration (in seconds): 0.1206
Attention throughput (in TFLOP/s): 246.715
MLP duration (in seconds): 0.2182
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3388
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 20864, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x20864x62592, b=2048): 0.0825
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x20864x62592, b=2048): 259.285
b: 128, m: 2048, n: 652, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x652x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x652x2048): 162.534
b: 128, m: 2048, n: 2048, k: 652,
Elapsed time for attention_prob_times_values (128x2048x2048x652): 0.0051
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x652): 137.270
Elapsed time for attention_linear_projection (4x20864x20864, b=2048): 0.0279
Throughput (in TFLOP/s) for attention_linear_projection (4x20864x20864, b=2048): 255.763
Elapsed time for mlp_h_to_4h (4x20864x83456, b=2048): 0.1098
Throughput (in TFLOP/s) for mlp_h_to_4h (4x20864x83456, b=2048): 259.809
Elapsed time for mlp_4h_to_h (4x83456x20864, b=2048): 0.1083
Throughput (in TFLOP/s) for mlp_4h_to_h (4x83456x20864, b=2048): 263.360

Attention duration (in seconds): 0.1198
Attention throughput (in TFLOP/s): 249.793
MLP duration (in seconds): 0.2181
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3379
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 20928, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x20928x62784, b=2048): 0.0829
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x20928x62784, b=2048): 259.647
b: 128, m: 2048, n: 654, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x654x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x654x2048): 157.136
b: 128, m: 2048, n: 2048, k: 654,
Elapsed time for attention_prob_times_values (128x2048x2048x654): 0.0060
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x654): 116.486
Elapsed time for attention_linear_projection (4x20928x20928, b=2048): 0.0280
Throughput (in TFLOP/s) for attention_linear_projection (4x20928x20928, b=2048): 256.501
Elapsed time for mlp_h_to_4h (4x20928x83712, b=2048): 0.1102
Throughput (in TFLOP/s) for mlp_h_to_4h (4x20928x83712, b=2048): 260.506
Elapsed time for mlp_4h_to_h (4x83712x20928, b=2048): 0.1093
Throughput (in TFLOP/s) for mlp_4h_to_h (4x83712x20928, b=2048): 262.512

Attention duration (in seconds): 0.1214
Attention throughput (in TFLOP/s): 248.038
MLP duration (in seconds): 0.2195
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3409
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 20992, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x20992x62976, b=2048): 0.0834
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x20992x62976, b=2048): 259.830
b: 128, m: 2048, n: 656, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x656x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x656x2048): 162.185
b: 128, m: 2048, n: 2048, k: 656,
Elapsed time for attention_prob_times_values (128x2048x2048x656): 0.0045
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x656): 155.788
Elapsed time for attention_linear_projection (4x20992x20992, b=2048): 0.0281
Throughput (in TFLOP/s) for attention_linear_projection (4x20992x20992, b=2048): 257.261
Elapsed time for mlp_h_to_4h (4x20992x83968, b=2048): 0.1106
Throughput (in TFLOP/s) for mlp_h_to_4h (4x20992x83968, b=2048): 261.020
Elapsed time for mlp_4h_to_h (4x83968x20992, b=2048): 0.1096
Throughput (in TFLOP/s) for mlp_4h_to_h (4x83968x20992, b=2048): 263.450

Attention duration (in seconds): 0.1203
Attention throughput (in TFLOP/s): 251.794
MLP duration (in seconds): 0.2203
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3405
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 21056, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x21056x63168, b=2048): 0.0835
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x21056x63168, b=2048): 260.890
b: 128, m: 2048, n: 658, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x658x2048): 0.0044
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x658x2048): 158.775
b: 128, m: 2048, n: 2048, k: 658,
Elapsed time for attention_prob_times_values (128x2048x2048x658): 0.0061
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x658): 116.774
Elapsed time for attention_linear_projection (4x21056x21056, b=2048): 0.0280
Throughput (in TFLOP/s) for attention_linear_projection (4x21056x21056, b=2048): 259.704
Elapsed time for mlp_h_to_4h (4x21056x84224, b=2048): 0.1101
Throughput (in TFLOP/s) for mlp_h_to_4h (4x21056x84224, b=2048): 264.012
Elapsed time for mlp_4h_to_h (4x84224x21056, b=2048): 0.1093
Throughput (in TFLOP/s) for mlp_4h_to_h (4x84224x21056, b=2048): 265.721

Attention duration (in seconds): 0.1220
Attention throughput (in TFLOP/s): 249.746
MLP duration (in seconds): 0.2194
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3414
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 21120, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x21120x63360, b=2048): 0.0833
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x21120x63360, b=2048): 263.060
b: 128, m: 2048, n: 660, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x660x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x660x2048): 164.802
b: 128, m: 2048, n: 2048, k: 660,
Elapsed time for attention_prob_times_values (128x2048x2048x660): 0.0051
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x660): 138.611
Elapsed time for attention_linear_projection (4x21120x21120, b=2048): 0.0283
Throughput (in TFLOP/s) for attention_linear_projection (4x21120x21120, b=2048): 258.524
Elapsed time for mlp_h_to_4h (4x21120x84480, b=2048): 0.1099
Throughput (in TFLOP/s) for mlp_h_to_4h (4x21120x84480, b=2048): 265.892
Elapsed time for mlp_4h_to_h (4x84480x21120, b=2048): 0.1095
Throughput (in TFLOP/s) for mlp_4h_to_h (4x84480x21120, b=2048): 267.067

Attention duration (in seconds): 0.1210
Attention throughput (in TFLOP/s): 253.252
MLP duration (in seconds): 0.2194
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3404
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 21184, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x21184x63552, b=2048): 0.0842
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x21184x63552, b=2048): 262.078
b: 128, m: 2048, n: 662, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x662x2048): 0.0044
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x662x2048): 159.748
b: 128, m: 2048, n: 2048, k: 662,
Elapsed time for attention_prob_times_values (128x2048x2048x662): 0.0060
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x662): 117.906
Elapsed time for attention_linear_projection (4x21184x21184, b=2048): 0.0286
Throughput (in TFLOP/s) for attention_linear_projection (4x21184x21184, b=2048): 257.476
Elapsed time for mlp_h_to_4h (4x21184x84736, b=2048): 0.1119
Throughput (in TFLOP/s) for mlp_h_to_4h (4x21184x84736, b=2048): 262.843
Elapsed time for mlp_4h_to_h (4x84736x21184, b=2048): 0.1108
Throughput (in TFLOP/s) for mlp_4h_to_h (4x84736x21184, b=2048): 265.406

Attention duration (in seconds): 0.1232
Attention throughput (in TFLOP/s): 250.261
MLP duration (in seconds): 0.2227
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3459
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 21248, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x21248x63744, b=2048): 0.1080
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x21248x63744, b=2048): 205.498
b: 128, m: 2048, n: 664, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x664x2048): 0.0044
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x664x2048): 163.767
b: 128, m: 2048, n: 2048, k: 664,
Elapsed time for attention_prob_times_values (128x2048x2048x664): 0.0046
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x664): 156.393
Elapsed time for attention_linear_projection (4x21248x21248, b=2048): 0.0286
Throughput (in TFLOP/s) for attention_linear_projection (4x21248x21248, b=2048): 258.768
Elapsed time for mlp_h_to_4h (4x21248x84992, b=2048): 0.1128
Throughput (in TFLOP/s) for mlp_h_to_4h (4x21248x84992, b=2048): 262.273
Elapsed time for mlp_4h_to_h (4x84992x21248, b=2048): 0.1116
Throughput (in TFLOP/s) for mlp_4h_to_h (4x84992x21248, b=2048): 265.103

Attention duration (in seconds): 0.1455
Attention throughput (in TFLOP/s): 213.177
MLP duration (in seconds): 0.2244
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3699
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 21312, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x21312x63936, b=2048): 0.0852
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x21312x63936, b=2048): 262.108
b: 128, m: 2048, n: 666, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x666x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x666x2048): 159.968
b: 128, m: 2048, n: 2048, k: 666,
Elapsed time for attention_prob_times_values (128x2048x2048x666): 0.0061
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x666): 116.776
Elapsed time for attention_linear_projection (4x21312x21312, b=2048): 0.0286
Throughput (in TFLOP/s) for attention_linear_projection (4x21312x21312, b=2048): 259.799
Elapsed time for mlp_h_to_4h (4x21312x85248, b=2048): 0.1189
Throughput (in TFLOP/s) for mlp_h_to_4h (4x21312x85248, b=2048): 250.355
Elapsed time for mlp_4h_to_h (4x85248x21312, b=2048): 0.1136
Throughput (in TFLOP/s) for mlp_4h_to_h (4x85248x21312, b=2048): 262.106

Attention duration (in seconds): 0.1244
Attention throughput (in TFLOP/s): 250.753
MLP duration (in seconds): 0.2325
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3569
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 21376, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x21376x64128, b=2048): 0.1345
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x21376x64128, b=2048): 167.039
b: 128, m: 2048, n: 668, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x668x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x668x2048): 166.018
b: 128, m: 2048, n: 2048, k: 668,
Elapsed time for attention_prob_times_values (128x2048x2048x668): 0.0051
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x668): 140.534
Elapsed time for attention_linear_projection (4x21376x21376, b=2048): 0.0290
Throughput (in TFLOP/s) for attention_linear_projection (4x21376x21376, b=2048): 258.145
Elapsed time for mlp_h_to_4h (4x21376x85504, b=2048): 0.1568
Throughput (in TFLOP/s) for mlp_h_to_4h (4x21376x85504, b=2048): 190.953
Elapsed time for mlp_4h_to_h (4x85504x21376, b=2048): 0.1137
Throughput (in TFLOP/s) for mlp_4h_to_h (4x85504x21376, b=2048): 263.377

Attention duration (in seconds): 0.1729
Attention throughput (in TFLOP/s): 181.514
MLP duration (in seconds): 0.2705
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4434
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 21440, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x21440x64320, b=2048): 0.0864
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x21440x64320, b=2048): 261.633
b: 128, m: 2048, n: 670, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x670x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x670x2048): 160.809
b: 128, m: 2048, n: 2048, k: 670,
Elapsed time for attention_prob_times_values (128x2048x2048x670): 0.0061
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x670): 118.600
Elapsed time for attention_linear_projection (4x21440x21440, b=2048): 0.0291
Throughput (in TFLOP/s) for attention_linear_projection (4x21440x21440, b=2048): 259.128
Elapsed time for mlp_h_to_4h (4x21440x85760, b=2048): 0.1147
Throughput (in TFLOP/s) for mlp_h_to_4h (4x21440x85760, b=2048): 262.570
Elapsed time for mlp_4h_to_h (4x85760x21440, b=2048): 0.1145
Throughput (in TFLOP/s) for mlp_4h_to_h (4x85760x21440, b=2048): 263.057

Attention duration (in seconds): 0.1260
Attention throughput (in TFLOP/s): 250.586
MLP duration (in seconds): 0.2293
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3552
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 21504, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x21504x64512, b=2048): 0.0874
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x21504x64512, b=2048): 259.929
b: 128, m: 2048, n: 672, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x672x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x672x2048): 166.323
b: 128, m: 2048, n: 2048, k: 672,
Elapsed time for attention_prob_times_values (128x2048x2048x672): 0.0045
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x672): 160.043
Elapsed time for attention_linear_projection (4x21504x21504, b=2048): 0.0292
Throughput (in TFLOP/s) for attention_linear_projection (4x21504x21504, b=2048): 259.437
Elapsed time for mlp_h_to_4h (4x21504x86016, b=2048): 0.1155
Throughput (in TFLOP/s) for mlp_h_to_4h (4x21504x86016, b=2048): 262.382
Elapsed time for mlp_4h_to_h (4x86016x21504, b=2048): 0.1157
Throughput (in TFLOP/s) for mlp_4h_to_h (4x86016x21504, b=2048): 261.833

Attention duration (in seconds): 0.1255
Attention throughput (in TFLOP/s): 252.990
MLP duration (in seconds): 0.2312
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3567
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 21568, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x21568x64704, b=2048): 0.0879
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x21568x64704, b=2048): 260.080
b: 128, m: 2048, n: 674, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x674x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x674x2048): 161.614
b: 128, m: 2048, n: 2048, k: 674,
Elapsed time for attention_prob_times_values (128x2048x2048x674): 0.0060
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x674): 119.802
Elapsed time for attention_linear_projection (4x21568x21568, b=2048): 0.0297
Throughput (in TFLOP/s) for attention_linear_projection (4x21568x21568, b=2048): 257.020
Elapsed time for mlp_h_to_4h (4x21568x86272, b=2048): 0.1169
Throughput (in TFLOP/s) for mlp_h_to_4h (4x21568x86272, b=2048): 260.793
Elapsed time for mlp_4h_to_h (4x86272x21568, b=2048): 0.1161
Throughput (in TFLOP/s) for mlp_4h_to_h (4x86272x21568, b=2048): 262.481

Attention duration (in seconds): 0.1281
Attention throughput (in TFLOP/s): 249.313
MLP duration (in seconds): 0.2330
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3611
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 21632, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x21632x64896, b=2048): 0.0892
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x21632x64896, b=2048): 257.816
b: 128, m: 2048, n: 676, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x676x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x676x2048): 168.099
b: 128, m: 2048, n: 2048, k: 676,
Elapsed time for attention_prob_times_values (128x2048x2048x676): 0.0051
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x676): 142.296
Elapsed time for attention_linear_projection (4x21632x21632, b=2048): 0.0299
Throughput (in TFLOP/s) for attention_linear_projection (4x21632x21632, b=2048): 256.271
Elapsed time for mlp_h_to_4h (4x21632x86528, b=2048): 0.1183
Throughput (in TFLOP/s) for mlp_h_to_4h (4x21632x86528, b=2048): 259.289
Elapsed time for mlp_4h_to_h (4x86528x21632, b=2048): 0.1172
Throughput (in TFLOP/s) for mlp_4h_to_h (4x86528x21632, b=2048): 261.690

Attention duration (in seconds): 0.1285
Attention throughput (in TFLOP/s): 249.858
MLP duration (in seconds): 0.2355
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3640
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 21696, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x21696x65088, b=2048): 0.0894
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x21696x65088, b=2048): 258.803
b: 128, m: 2048, n: 678, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x678x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x678x2048): 162.581
b: 128, m: 2048, n: 2048, k: 678,
Elapsed time for attention_prob_times_values (128x2048x2048x678): 0.0061
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x678): 120.309
Elapsed time for attention_linear_projection (4x21696x21696, b=2048): 0.0300
Throughput (in TFLOP/s) for attention_linear_projection (4x21696x21696, b=2048): 256.992
Elapsed time for mlp_h_to_4h (4x21696x86784, b=2048): 0.1326
Throughput (in TFLOP/s) for mlp_h_to_4h (4x21696x86784, b=2048): 232.730
Elapsed time for mlp_4h_to_h (4x86784x21696, b=2048): 0.1176
Throughput (in TFLOP/s) for mlp_4h_to_h (4x86784x21696, b=2048): 262.308

Attention duration (in seconds): 0.1299
Attention throughput (in TFLOP/s): 248.619
MLP duration (in seconds): 0.2502
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3801
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 21760, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x21760x65280, b=2048): 0.0895
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x21760x65280, b=2048): 259.968
b: 128, m: 2048, n: 680, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x680x2048): 0.0044
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x680x2048): 167.594
b: 128, m: 2048, n: 2048, k: 680,
Elapsed time for attention_prob_times_values (128x2048x2048x680): 0.0046
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x680): 159.080
Elapsed time for attention_linear_projection (4x21760x21760, b=2048): 0.0301
Throughput (in TFLOP/s) for attention_linear_projection (4x21760x21760, b=2048): 257.568
Elapsed time for mlp_h_to_4h (4x21760x87040, b=2048): 0.1188
Throughput (in TFLOP/s) for mlp_h_to_4h (4x21760x87040, b=2048): 261.137
Elapsed time for mlp_4h_to_h (4x87040x21760, b=2048): 0.1188
Throughput (in TFLOP/s) for mlp_4h_to_h (4x87040x21760, b=2048): 261.121

Attention duration (in seconds): 0.1286
Attention throughput (in TFLOP/s): 252.675
MLP duration (in seconds): 0.2377
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3663
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 21824, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x21824x65472, b=2048): 0.1162
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x21824x65472, b=2048): 201.413
b: 128, m: 2048, n: 682, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x682x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x682x2048): 163.306
b: 128, m: 2048, n: 2048, k: 682,
Elapsed time for attention_prob_times_values (128x2048x2048x682): 0.0060
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x682): 121.062
Elapsed time for attention_linear_projection (4x21824x21824, b=2048): 0.0303
Throughput (in TFLOP/s) for attention_linear_projection (4x21824x21824, b=2048): 257.473
Elapsed time for mlp_h_to_4h (4x21824x87296, b=2048): 0.1576
Throughput (in TFLOP/s) for mlp_h_to_4h (4x21824x87296, b=2048): 198.001
Elapsed time for mlp_4h_to_h (4x87296x21824, b=2048): 0.1195
Throughput (in TFLOP/s) for mlp_4h_to_h (4x87296x21824, b=2048): 261.188

Attention duration (in seconds): 0.1571
Attention throughput (in TFLOP/s): 208.048
MLP duration (in seconds): 0.2772
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4342
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 21888, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x21888x65664, b=2048): 0.1340
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x21888x65664, b=2048): 175.782
b: 128, m: 2048, n: 684, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x684x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x684x2048): 170.163
b: 128, m: 2048, n: 2048, k: 684,
Elapsed time for attention_prob_times_values (128x2048x2048x684): 0.0051
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x684): 144.759
Elapsed time for attention_linear_projection (4x21888x21888, b=2048): 0.0303
Throughput (in TFLOP/s) for attention_linear_projection (4x21888x21888, b=2048): 259.377
Elapsed time for mlp_h_to_4h (4x21888x87552, b=2048): 0.1917
Throughput (in TFLOP/s) for mlp_h_to_4h (4x21888x87552, b=2048): 163.767
Elapsed time for mlp_4h_to_h (4x87552x21888, b=2048): 0.1202
Throughput (in TFLOP/s) for mlp_4h_to_h (4x87552x21888, b=2048): 261.146

Attention duration (in seconds): 0.1736
Attention throughput (in TFLOP/s): 189.307
MLP duration (in seconds): 0.3119
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4856
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 21952, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x21952x65856, b=2048): 0.1358
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x21952x65856, b=2048): 174.478
b: 128, m: 2048, n: 686, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x686x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x686x2048): 164.325
b: 128, m: 2048, n: 2048, k: 686,
Elapsed time for attention_prob_times_values (128x2048x2048x686): 0.0060
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x686): 121.935
Elapsed time for attention_linear_projection (4x21952x21952, b=2048): 0.0304
Throughput (in TFLOP/s) for attention_linear_projection (4x21952x21952, b=2048): 259.862
Elapsed time for mlp_h_to_4h (4x21952x87808, b=2048): 0.1203
Throughput (in TFLOP/s) for mlp_h_to_4h (4x21952x87808, b=2048): 262.622
Elapsed time for mlp_4h_to_h (4x87808x21952, b=2048): 0.1217
Throughput (in TFLOP/s) for mlp_4h_to_h (4x87808x21952, b=2048): 259.595

Attention duration (in seconds): 0.1767
Attention throughput (in TFLOP/s): 187.109
MLP duration (in seconds): 0.2419
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4186
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 22016, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x22016x66048, b=2048): 0.1261
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x22016x66048, b=2048): 188.913
b: 128, m: 2048, n: 688, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x688x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x688x2048): 170.237
b: 128, m: 2048, n: 2048, k: 688,
Elapsed time for attention_prob_times_values (128x2048x2048x688): 0.0046
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x688): 162.020
Elapsed time for attention_linear_projection (4x22016x22016, b=2048): 0.0307
Throughput (in TFLOP/s) for attention_linear_projection (4x22016x22016, b=2048): 258.678
Elapsed time for mlp_h_to_4h (4x22016x88064, b=2048): 0.1755
Throughput (in TFLOP/s) for mlp_h_to_4h (4x22016x88064, b=2048): 181.016
Elapsed time for mlp_4h_to_h (4x88064x22016, b=2048): 0.1211
Throughput (in TFLOP/s) for mlp_4h_to_h (4x88064x22016, b=2048): 262.208

Attention duration (in seconds): 0.1657
Attention throughput (in TFLOP/s): 200.609
MLP duration (in seconds): 0.2966
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4623
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 22080, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x22080x66240, b=2048): 0.0924
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x22080x66240, b=2048): 259.427
b: 128, m: 2048, n: 690, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x690x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x690x2048): 165.521
b: 128, m: 2048, n: 2048, k: 690,
Elapsed time for attention_prob_times_values (128x2048x2048x690): 0.0061
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x690): 122.270
Elapsed time for attention_linear_projection (4x22080x22080, b=2048): 0.0310
Throughput (in TFLOP/s) for attention_linear_projection (4x22080x22080, b=2048): 257.379
Elapsed time for mlp_h_to_4h (4x22080x88320, b=2048): 0.1429
Throughput (in TFLOP/s) for mlp_h_to_4h (4x22080x88320, b=2048): 223.529
Elapsed time for mlp_4h_to_h (4x88320x22080, b=2048): 0.1221
Throughput (in TFLOP/s) for mlp_4h_to_h (4x88320x22080, b=2048): 261.575

Attention duration (in seconds): 0.1339
Attention throughput (in TFLOP/s): 249.609
MLP duration (in seconds): 0.2651
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3990
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 22144, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x22144x66432, b=2048): 0.1274
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x22144x66432, b=2048): 189.122
b: 128, m: 2048, n: 692, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x692x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x692x2048): 171.670
b: 128, m: 2048, n: 2048, k: 692,
Elapsed time for attention_prob_times_values (128x2048x2048x692): 0.0051
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x692): 145.842
Elapsed time for attention_linear_projection (4x22144x22144, b=2048): 0.0311
Throughput (in TFLOP/s) for attention_linear_projection (4x22144x22144, b=2048): 258.100
Elapsed time for mlp_h_to_4h (4x22144x88576, b=2048): 0.1988
Throughput (in TFLOP/s) for mlp_h_to_4h (4x22144x88576, b=2048): 161.680
Elapsed time for mlp_4h_to_h (4x88576x22144, b=2048): 0.1222
Throughput (in TFLOP/s) for mlp_4h_to_h (4x88576x22144, b=2048): 262.964

Attention duration (in seconds): 0.1680
Attention throughput (in TFLOP/s): 200.141
MLP duration (in seconds): 0.3210
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4890
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 22208, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x22208x66624, b=2048): 0.0925
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x22208x66624, b=2048): 262.170
b: 128, m: 2048, n: 694, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x694x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x694x2048): 165.941
b: 128, m: 2048, n: 2048, k: 694,
Elapsed time for attention_prob_times_values (128x2048x2048x694): 0.0061
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x694): 122.033
Elapsed time for attention_linear_projection (4x22208x22208, b=2048): 0.0313
Throughput (in TFLOP/s) for attention_linear_projection (4x22208x22208, b=2048): 258.409
Elapsed time for mlp_h_to_4h (4x22208x88832, b=2048): 0.1590
Throughput (in TFLOP/s) for mlp_h_to_4h (4x22208x88832, b=2048): 203.254
Elapsed time for mlp_4h_to_h (4x88832x22208, b=2048): 0.1233
Throughput (in TFLOP/s) for mlp_4h_to_h (4x88832x22208, b=2048): 262.159

Attention duration (in seconds): 0.1343
Attention throughput (in TFLOP/s): 251.707
MLP duration (in seconds): 0.2823
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4166
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 22272, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x22272x66816, b=2048): 0.1434
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x22272x66816, b=2048): 170.067
b: 128, m: 2048, n: 696, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x696x2048): 0.0044
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x696x2048): 171.182
b: 128, m: 2048, n: 2048, k: 696,
Elapsed time for attention_prob_times_values (128x2048x2048x696): 0.0046
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x696): 161.864
Elapsed time for attention_linear_projection (4x22272x22272, b=2048): 0.0314
Throughput (in TFLOP/s) for attention_linear_projection (4x22272x22272, b=2048): 258.959
Elapsed time for mlp_h_to_4h (4x22272x89088, b=2048): 0.2025
Throughput (in TFLOP/s) for mlp_h_to_4h (4x22272x89088, b=2048): 160.553
Elapsed time for mlp_4h_to_h (4x89088x22272, b=2048): 0.1250
Throughput (in TFLOP/s) for mlp_4h_to_h (4x89088x22272, b=2048): 260.135

Attention duration (in seconds): 0.1837
Attention throughput (in TFLOP/s): 185.071
MLP duration (in seconds): 0.3274
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5112
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 22336, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x22336x67008, b=2048): 0.1157
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x22336x67008, b=2048): 211.906
b: 128, m: 2048, n: 698, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x698x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x698x2048): 166.924
b: 128, m: 2048, n: 2048, k: 698,
Elapsed time for attention_prob_times_values (128x2048x2048x698): 0.0061
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x698): 123.246
Elapsed time for attention_linear_projection (4x22336x22336, b=2048): 0.0316
Throughput (in TFLOP/s) for attention_linear_projection (4x22336x22336, b=2048): 258.458
Elapsed time for mlp_h_to_4h (4x22336x89344, b=2048): 0.1307
Throughput (in TFLOP/s) for mlp_h_to_4h (4x22336x89344, b=2048): 250.214
Elapsed time for mlp_4h_to_h (4x89344x22336, b=2048): 0.1253
Throughput (in TFLOP/s) for mlp_4h_to_h (4x89344x22336, b=2048): 260.836

Attention duration (in seconds): 0.1579
Attention throughput (in TFLOP/s): 216.536
MLP duration (in seconds): 0.2560
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4139
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 22400, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x22400x67200, b=2048): 0.1093
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x22400x67200, b=2048): 225.540
b: 128, m: 2048, n: 700, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x700x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x700x2048): 173.044
b: 128, m: 2048, n: 2048, k: 700,
Elapsed time for attention_prob_times_values (128x2048x2048x700): 0.0052
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x700): 145.842
Elapsed time for attention_linear_projection (4x22400x22400, b=2048): 0.0321
Throughput (in TFLOP/s) for attention_linear_projection (4x22400x22400, b=2048): 256.379
Elapsed time for mlp_h_to_4h (4x22400x89600, b=2048): 0.1464
Throughput (in TFLOP/s) for mlp_h_to_4h (4x22400x89600, b=2048): 224.608
Elapsed time for mlp_4h_to_h (4x89600x22400, b=2048): 0.1255
Throughput (in TFLOP/s) for mlp_4h_to_h (4x89600x22400, b=2048): 262.064

Attention duration (in seconds): 0.1509
Attention throughput (in TFLOP/s): 227.860
MLP duration (in seconds): 0.2719
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4228
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 22464, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x22464x67392, b=2048): 0.0960
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x22464x67392, b=2048): 258.243
b: 128, m: 2048, n: 702, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x702x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x702x2048): 167.685
b: 128, m: 2048, n: 2048, k: 702,
Elapsed time for attention_prob_times_values (128x2048x2048x702): 0.0061
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x702): 124.313
Elapsed time for attention_linear_projection (4x22464x22464, b=2048): 0.0322
Throughput (in TFLOP/s) for attention_linear_projection (4x22464x22464, b=2048): 256.655
Elapsed time for mlp_h_to_4h (4x22464x89856, b=2048): 0.2027
Throughput (in TFLOP/s) for mlp_h_to_4h (4x22464x89856, b=2048): 163.151
Elapsed time for mlp_4h_to_h (4x89856x22464, b=2048): 0.1276
Throughput (in TFLOP/s) for mlp_4h_to_h (4x89856x22464, b=2048): 259.102

Attention duration (in seconds): 0.1388
Attention throughput (in TFLOP/s): 249.092
MLP duration (in seconds): 0.3303
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4692
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 22528, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x22528x67584, b=2048): 0.0965
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x22528x67584, b=2048): 258.431
b: 128, m: 2048, n: 704, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x704x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x704x2048): 174.243
b: 128, m: 2048, n: 2048, k: 704,
Elapsed time for attention_prob_times_values (128x2048x2048x704): 0.0045
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x704): 167.275
Elapsed time for attention_linear_projection (4x22528x22528, b=2048): 0.0324
Throughput (in TFLOP/s) for attention_linear_projection (4x22528x22528, b=2048): 256.969
Elapsed time for mlp_h_to_4h (4x22528x90112, b=2048): 0.2054
Throughput (in TFLOP/s) for mlp_h_to_4h (4x22528x90112, b=2048): 161.892
Elapsed time for mlp_4h_to_h (4x90112x22528, b=2048): 0.1267
Throughput (in TFLOP/s) for mlp_4h_to_h (4x90112x22528, b=2048): 262.615

Attention duration (in seconds): 0.1377
Attention throughput (in TFLOP/s): 252.445
MLP duration (in seconds): 0.3321
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4698
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 22592, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x22592x67776, b=2048): 0.0968
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x22592x67776, b=2048): 259.203
b: 128, m: 2048, n: 706, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x706x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x706x2048): 168.034
b: 128, m: 2048, n: 2048, k: 706,
Elapsed time for attention_prob_times_values (128x2048x2048x706): 0.0063
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x706): 120.501
Elapsed time for attention_linear_projection (4x22592x22592, b=2048): 0.0325
Throughput (in TFLOP/s) for attention_linear_projection (4x22592x22592, b=2048): 257.524
Elapsed time for mlp_h_to_4h (4x22592x90368, b=2048): 0.1377
Throughput (in TFLOP/s) for mlp_h_to_4h (4x22592x90368, b=2048): 242.926
Elapsed time for mlp_4h_to_h (4x90368x22592, b=2048): 0.1275
Throughput (in TFLOP/s) for mlp_4h_to_h (4x90368x22592, b=2048): 262.416

Attention duration (in seconds): 0.1401
Attention throughput (in TFLOP/s): 249.647
MLP duration (in seconds): 0.2652
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4052
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 22656, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x22656x67968, b=2048): 0.1169
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x22656x67968, b=2048): 215.866
b: 128, m: 2048, n: 708, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x708x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x708x2048): 176.719
b: 128, m: 2048, n: 2048, k: 708,
Elapsed time for attention_prob_times_values (128x2048x2048x708): 0.0053
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x708): 142.805
Elapsed time for attention_linear_projection (4x22656x22656, b=2048): 0.0322
Throughput (in TFLOP/s) for attention_linear_projection (4x22656x22656, b=2048): 260.789
Elapsed time for mlp_h_to_4h (4x22656x90624, b=2048): 0.1764
Throughput (in TFLOP/s) for mlp_h_to_4h (4x22656x90624, b=2048): 190.748
Elapsed time for mlp_4h_to_h (4x90624x22656, b=2048): 0.1288
Throughput (in TFLOP/s) for mlp_4h_to_h (4x90624x22656, b=2048): 261.139

Attention duration (in seconds): 0.1587
Attention throughput (in TFLOP/s): 221.481
MLP duration (in seconds): 0.3052
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4639
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 22720, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x22720x68160, b=2048): 0.1109
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x22720x68160, b=2048): 228.754
b: 128, m: 2048, n: 710, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x710x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x710x2048): 169.848
b: 128, m: 2048, n: 2048, k: 710,
Elapsed time for attention_prob_times_values (128x2048x2048x710): 0.0063
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x710): 120.467
Elapsed time for attention_linear_projection (4x22720x22720, b=2048): 0.0325
Throughput (in TFLOP/s) for attention_linear_projection (4x22720x22720, b=2048): 259.894
Elapsed time for mlp_h_to_4h (4x22720x90880, b=2048): 0.1658
Throughput (in TFLOP/s) for mlp_h_to_4h (4x22720x90880, b=2048): 204.011
Elapsed time for mlp_4h_to_h (4x90880x22720, b=2048): 0.1285
Throughput (in TFLOP/s) for mlp_4h_to_h (4x90880x22720, b=2048): 263.347

Attention duration (in seconds): 0.1543
Attention throughput (in TFLOP/s): 229.167
MLP duration (in seconds): 0.2943
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4486
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 22784, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x22784x68352, b=2048): 0.0968
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x22784x68352, b=2048): 263.719
b: 128, m: 2048, n: 712, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x712x2048): 0.0044
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x712x2048): 175.098
b: 128, m: 2048, n: 2048, k: 712,
Elapsed time for attention_prob_times_values (128x2048x2048x712): 0.0048
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x712): 159.785
Elapsed time for attention_linear_projection (4x22784x22784, b=2048): 0.0327
Throughput (in TFLOP/s) for attention_linear_projection (4x22784x22784, b=2048): 259.988
Elapsed time for mlp_h_to_4h (4x22784x91136, b=2048): 0.1915
Throughput (in TFLOP/s) for mlp_h_to_4h (4x22784x91136, b=2048): 177.620
Elapsed time for mlp_4h_to_h (4x91136x22784, b=2048): 0.1284
Throughput (in TFLOP/s) for mlp_4h_to_h (4x91136x22784, b=2048): 264.859

Attention duration (in seconds): 0.1386
Attention throughput (in TFLOP/s): 256.460
MLP duration (in seconds): 0.3200
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4586
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 22848, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x22848x68544, b=2048): 0.0982
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x22848x68544, b=2048): 261.407
b: 128, m: 2048, n: 714, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x714x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x714x2048): 171.360
b: 128, m: 2048, n: 2048, k: 714,
Elapsed time for attention_prob_times_values (128x2048x2048x714): 0.0064
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x714): 119.845
Elapsed time for attention_linear_projection (4x22848x22848, b=2048): 0.0330
Throughput (in TFLOP/s) for attention_linear_projection (4x22848x22848, b=2048): 258.863
Elapsed time for mlp_h_to_4h (4x22848x91392, b=2048): 0.1309
Throughput (in TFLOP/s) for mlp_h_to_4h (4x22848x91392, b=2048): 261.423
Elapsed time for mlp_4h_to_h (4x91392x22848, b=2048): 0.1288
Throughput (in TFLOP/s) for mlp_4h_to_h (4x91392x22848, b=2048): 265.579

Attention duration (in seconds): 0.1421
Attention throughput (in TFLOP/s): 251.606
MLP duration (in seconds): 0.2597
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4018
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 22912, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x22912x68736, b=2048): 0.0984
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x22912x68736, b=2048): 262.314
b: 128, m: 2048, n: 716, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x716x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x716x2048): 178.016
b: 128, m: 2048, n: 2048, k: 716,
Elapsed time for attention_prob_times_values (128x2048x2048x716): 0.0054
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x716): 143.595
Elapsed time for attention_linear_projection (4x22912x22912, b=2048): 0.0332
Throughput (in TFLOP/s) for attention_linear_projection (4x22912x22912, b=2048): 259.364
Elapsed time for mlp_h_to_4h (4x22912x91648, b=2048): 0.1349
Throughput (in TFLOP/s) for mlp_h_to_4h (4x22912x91648, b=2048): 255.071
Elapsed time for mlp_4h_to_h (4x91648x22912, b=2048): 0.1304
Throughput (in TFLOP/s) for mlp_4h_to_h (4x91648x22912, b=2048): 263.929

Attention duration (in seconds): 0.1412
Attention throughput (in TFLOP/s): 254.542
MLP duration (in seconds): 0.2652
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4064
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 22976, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x22976x68928, b=2048): 0.0995
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x22976x68928, b=2048): 260.900
b: 128, m: 2048, n: 718, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x718x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x718x2048): 171.443
b: 128, m: 2048, n: 2048, k: 718,
Elapsed time for attention_prob_times_values (128x2048x2048x718): 0.0064
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x718): 120.431
Elapsed time for attention_linear_projection (4x22976x22976, b=2048): 0.0336
Throughput (in TFLOP/s) for attention_linear_projection (4x22976x22976, b=2048): 257.748
Elapsed time for mlp_h_to_4h (4x22976x91904, b=2048): 0.1748
Throughput (in TFLOP/s) for mlp_h_to_4h (4x22976x91904, b=2048): 197.903
Elapsed time for mlp_4h_to_h (4x91904x22976, b=2048): 0.1324
Throughput (in TFLOP/s) for mlp_4h_to_h (4x91904x22976, b=2048): 261.300

Attention duration (in seconds): 0.1439
Attention throughput (in TFLOP/s): 251.121
MLP duration (in seconds): 0.3072
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4511
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 23040, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x23040x69120, b=2048): 0.1003
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x23040x69120, b=2048): 260.102
b: 128, m: 2048, n: 720, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x720x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x720x2048): 178.360
b: 128, m: 2048, n: 2048, k: 720,
Elapsed time for attention_prob_times_values (128x2048x2048x720): 0.0048
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x720): 161.976
Elapsed time for attention_linear_projection (4x23040x23040, b=2048): 0.0336
Throughput (in TFLOP/s) for attention_linear_projection (4x23040x23040, b=2048): 258.470
Elapsed time for mlp_h_to_4h (4x23040x92160, b=2048): 0.2031
Throughput (in TFLOP/s) for mlp_h_to_4h (4x23040x92160, b=2048): 171.312
Elapsed time for mlp_4h_to_h (4x92160x23040, b=2048): 0.1335
Throughput (in TFLOP/s) for mlp_4h_to_h (4x92160x23040, b=2048): 260.611

Attention duration (in seconds): 0.1431
Attention throughput (in TFLOP/s): 253.968
MLP duration (in seconds): 0.3366
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4796
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 23104, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x23104x69312, b=2048): 0.1004
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x23104x69312, b=2048): 261.333
b: 128, m: 2048, n: 722, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x722x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x722x2048): 171.706
b: 128, m: 2048, n: 2048, k: 722,
Elapsed time for attention_prob_times_values (128x2048x2048x722): 0.0064
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x722): 120.439
Elapsed time for attention_linear_projection (4x23104x23104, b=2048): 0.0337
Throughput (in TFLOP/s) for attention_linear_projection (4x23104x23104, b=2048): 259.439
Elapsed time for mlp_h_to_4h (4x23104x92416, b=2048): 0.1333
Throughput (in TFLOP/s) for mlp_h_to_4h (4x23104x92416, b=2048): 262.394
Elapsed time for mlp_4h_to_h (4x92416x23104, b=2048): 0.1329
Throughput (in TFLOP/s) for mlp_4h_to_h (4x92416x23104, b=2048): 263.260

Attention duration (in seconds): 0.1451
Attention throughput (in TFLOP/s): 251.851
MLP duration (in seconds): 0.2662
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4113
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 23168, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x23168x69504, b=2048): 0.1000
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x23168x69504, b=2048): 263.939
b: 128, m: 2048, n: 724, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x724x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x724x2048): 179.618
b: 128, m: 2048, n: 2048, k: 724,
Elapsed time for attention_prob_times_values (128x2048x2048x724): 0.0055
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x724): 142.540
Elapsed time for attention_linear_projection (4x23168x23168, b=2048): 0.0335
Throughput (in TFLOP/s) for attention_linear_projection (4x23168x23168, b=2048): 262.634
Elapsed time for mlp_h_to_4h (4x23168x92672, b=2048): 0.1747
Throughput (in TFLOP/s) for mlp_h_to_4h (4x23168x92672, b=2048): 201.345
Elapsed time for mlp_4h_to_h (4x92672x23168, b=2048): 0.1326
Throughput (in TFLOP/s) for mlp_4h_to_h (4x92672x23168, b=2048): 265.371

Attention duration (in seconds): 0.1432
Attention throughput (in TFLOP/s): 256.463
MLP duration (in seconds): 0.3073
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4505
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 23232, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x23232x69696, b=2048): 0.1008
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x23232x69696, b=2048): 263.135
b: 128, m: 2048, n: 726, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x726x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x726x2048): 173.408
b: 128, m: 2048, n: 2048, k: 726,
Elapsed time for attention_prob_times_values (128x2048x2048x726): 0.0064
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x726): 120.985
Elapsed time for attention_linear_projection (4x23232x23232, b=2048): 0.0337
Throughput (in TFLOP/s) for attention_linear_projection (4x23232x23232, b=2048): 262.144
Elapsed time for mlp_h_to_4h (4x23232x92928, b=2048): 0.1336
Throughput (in TFLOP/s) for mlp_h_to_4h (4x23232x92928, b=2048): 264.829
Elapsed time for mlp_4h_to_h (4x92928x23232, b=2048): 0.1340
Throughput (in TFLOP/s) for mlp_4h_to_h (4x92928x23232, b=2048): 263.876

Attention duration (in seconds): 0.1455
Attention throughput (in TFLOP/s): 253.837
MLP duration (in seconds): 0.2676
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4131
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 23296, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x23296x69888, b=2048): 0.1010
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x23296x69888, b=2048): 264.197
b: 128, m: 2048, n: 728, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x728x2048): 0.0044
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x728x2048): 179.081
b: 128, m: 2048, n: 2048, k: 728,
Elapsed time for attention_prob_times_values (128x2048x2048x728): 0.0048
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x728): 162.284
Elapsed time for attention_linear_projection (4x23296x23296, b=2048): 0.0341
Throughput (in TFLOP/s) for attention_linear_projection (4x23296x23296, b=2048): 260.601
Elapsed time for mlp_h_to_4h (4x23296x93184, b=2048): 0.1674
Throughput (in TFLOP/s) for mlp_h_to_4h (4x23296x93184, b=2048): 212.526
Elapsed time for mlp_4h_to_h (4x93184x23296, b=2048): 0.1357
Throughput (in TFLOP/s) for mlp_4h_to_h (4x93184x23296, b=2048): 262.183

Attention duration (in seconds): 0.1443
Attention throughput (in TFLOP/s): 257.369
MLP duration (in seconds): 0.3030
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4473
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 23360, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x23360x70080, b=2048): 0.1032
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x23360x70080, b=2048): 259.868
b: 128, m: 2048, n: 730, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x730x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x730x2048): 173.985
b: 128, m: 2048, n: 2048, k: 730,
Elapsed time for attention_prob_times_values (128x2048x2048x730): 0.0064
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x730): 121.841
Elapsed time for attention_linear_projection (4x23360x23360, b=2048): 0.0346
Throughput (in TFLOP/s) for attention_linear_projection (4x23360x23360, b=2048): 258.074
Elapsed time for mlp_h_to_4h (4x23360x93440, b=2048): 0.1572
Throughput (in TFLOP/s) for mlp_h_to_4h (4x23360x93440, b=2048): 227.482
Elapsed time for mlp_4h_to_h (4x93440x23360, b=2048): 0.1363
Throughput (in TFLOP/s) for mlp_4h_to_h (4x93440x23360, b=2048): 262.394

Attention duration (in seconds): 0.1488
Attention throughput (in TFLOP/s): 250.883
MLP duration (in seconds): 0.2935
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4423
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 23424, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x23424x70272, b=2048): 0.1028
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x23424x70272, b=2048): 262.471
b: 128, m: 2048, n: 732, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x732x2048): 0.0044
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x732x2048): 180.410
b: 128, m: 2048, n: 2048, k: 732,
Elapsed time for attention_prob_times_values (128x2048x2048x732): 0.0055
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x732): 141.852
Elapsed time for attention_linear_projection (4x23424x23424, b=2048): 0.0346
Throughput (in TFLOP/s) for attention_linear_projection (4x23424x23424, b=2048): 259.725
Elapsed time for mlp_h_to_4h (4x23424x93696, b=2048): 0.1669
Throughput (in TFLOP/s) for mlp_h_to_4h (4x23424x93696, b=2048): 215.489
Elapsed time for mlp_4h_to_h (4x93696x23424, b=2048): 0.1368
Throughput (in TFLOP/s) for mlp_4h_to_h (4x93696x23424, b=2048): 262.831

Attention duration (in seconds): 0.1473
Attention throughput (in TFLOP/s): 254.859
MLP duration (in seconds): 0.3037
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4509
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 23488, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x23488x70464, b=2048): 0.1033
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x23488x70464, b=2048): 262.575
b: 128, m: 2048, n: 734, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x734x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x734x2048): 175.366
b: 128, m: 2048, n: 2048, k: 734,
Elapsed time for attention_prob_times_values (128x2048x2048x734): 0.0064
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x734): 122.313
Elapsed time for attention_linear_projection (4x23488x23488, b=2048): 0.0349
Throughput (in TFLOP/s) for attention_linear_projection (4x23488x23488, b=2048): 258.844
Elapsed time for mlp_h_to_4h (4x23488x93952, b=2048): 0.1628
Throughput (in TFLOP/s) for mlp_h_to_4h (4x23488x93952, b=2048): 222.072
Elapsed time for mlp_4h_to_h (4x93952x23488, b=2048): 0.1382
Throughput (in TFLOP/s) for mlp_4h_to_h (4x93952x23488, b=2048): 261.576

Attention duration (in seconds): 0.1491
Attention throughput (in TFLOP/s): 253.013
MLP duration (in seconds): 0.3010
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4502
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 23552, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x23552x70656, b=2048): 0.1134
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x23552x70656, b=2048): 240.336
b: 128, m: 2048, n: 736, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x736x2048): 0.0044
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x736x2048): 180.743
b: 128, m: 2048, n: 2048, k: 736,
Elapsed time for attention_prob_times_values (128x2048x2048x736): 0.0048
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x736): 165.393
Elapsed time for attention_linear_projection (4x23552x23552, b=2048): 0.0352
Throughput (in TFLOP/s) for attention_linear_projection (4x23552x23552, b=2048): 258.416
Elapsed time for mlp_h_to_4h (4x23552x94208, b=2048): 0.1860
Throughput (in TFLOP/s) for mlp_h_to_4h (4x23552x94208, b=2048): 195.452
Elapsed time for mlp_4h_to_h (4x94208x23552, b=2048): 0.1993
Throughput (in TFLOP/s) for mlp_4h_to_h (4x94208x23552, b=2048): 182.390

Attention duration (in seconds): 0.1578
Attention throughput (in TFLOP/s): 240.445
MLP duration (in seconds): 0.3853
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5431
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 23616, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x23616x70848, b=2048): 0.1665
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x23616x70848, b=2048): 164.597
b: 128, m: 2048, n: 738, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x738x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x738x2048): 176.022
b: 128, m: 2048, n: 2048, k: 738,
Elapsed time for attention_prob_times_values (128x2048x2048x738): 0.0064
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x738): 124.124
Elapsed time for attention_linear_projection (4x23616x23616, b=2048): 0.0354
Throughput (in TFLOP/s) for attention_linear_projection (4x23616x23616, b=2048): 258.089
Elapsed time for mlp_h_to_4h (4x23616x94464, b=2048): 0.2335
Throughput (in TFLOP/s) for mlp_h_to_4h (4x23616x94464, b=2048): 156.548
Elapsed time for mlp_4h_to_h (4x94464x23616, b=2048): 0.1402
Throughput (in TFLOP/s) for mlp_4h_to_h (4x94464x23616, b=2048): 260.781

Attention duration (in seconds): 0.2128
Attention throughput (in TFLOP/s): 179.176
MLP duration (in seconds): 0.3736
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5865
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 23680, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x23680x71040, b=2048): 0.1631
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x23680x71040, b=2048): 169.032
b: 128, m: 2048, n: 740, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x740x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x740x2048): 183.527
b: 128, m: 2048, n: 2048, k: 740,
Elapsed time for attention_prob_times_values (128x2048x2048x740): 0.0054
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x740): 147.157
Elapsed time for attention_linear_projection (4x23680x23680, b=2048): 0.0356
Throughput (in TFLOP/s) for attention_linear_projection (4x23680x23680, b=2048): 258.426
Elapsed time for mlp_h_to_4h (4x23680x94720, b=2048): 0.2328
Throughput (in TFLOP/s) for mlp_h_to_4h (4x23680x94720, b=2048): 157.865
Elapsed time for mlp_4h_to_h (4x94720x23680, b=2048): 0.1405
Throughput (in TFLOP/s) for mlp_4h_to_h (4x94720x23680, b=2048): 261.490

Attention duration (in seconds): 0.2083
Attention throughput (in TFLOP/s): 184.020
MLP duration (in seconds): 0.3733
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5817
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 23744, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x23744x71232, b=2048): 0.1183
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x23744x71232, b=2048): 234.230
b: 128, m: 2048, n: 742, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x742x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x742x2048): 176.986
b: 128, m: 2048, n: 2048, k: 742,
Elapsed time for attention_prob_times_values (128x2048x2048x742): 0.0064
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x742): 124.300
Elapsed time for attention_linear_projection (4x23744x23744, b=2048): 0.0359
Throughput (in TFLOP/s) for attention_linear_projection (4x23744x23744, b=2048): 256.992
Elapsed time for mlp_h_to_4h (4x23744x94976, b=2048): 0.2348
Throughput (in TFLOP/s) for mlp_h_to_4h (4x23744x94976, b=2048): 157.358
Elapsed time for mlp_4h_to_h (4x94976x23744, b=2048): 0.1410
Throughput (in TFLOP/s) for mlp_4h_to_h (4x94976x23744, b=2048): 262.057

Attention duration (in seconds): 0.1652
Attention throughput (in TFLOP/s): 233.357
MLP duration (in seconds): 0.3758
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5410
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 23808, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x23808x71424, b=2048): 0.1066
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x23808x71424, b=2048): 261.465
b: 128, m: 2048, n: 744, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x744x2048): 0.0044
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x744x2048): 182.549
b: 128, m: 2048, n: 2048, k: 744,
Elapsed time for attention_prob_times_values (128x2048x2048x744): 0.0048
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x744): 164.985
Elapsed time for attention_linear_projection (4x23808x23808, b=2048): 0.0360
Throughput (in TFLOP/s) for attention_linear_projection (4x23808x23808, b=2048): 257.860
Elapsed time for mlp_h_to_4h (4x23808x95232, b=2048): 0.2376
Throughput (in TFLOP/s) for mlp_h_to_4h (4x23808x95232, b=2048): 156.376
Elapsed time for mlp_4h_to_h (4x95232x23808, b=2048): 0.2118
Throughput (in TFLOP/s) for mlp_4h_to_h (4x95232x23808, b=2048): 175.367

Attention duration (in seconds): 0.1518
Attention throughput (in TFLOP/s): 255.257
MLP duration (in seconds): 0.4494
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6012
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 23872, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x23872x71616, b=2048): 0.1560
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x23872x71616, b=2048): 179.592
b: 128, m: 2048, n: 746, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x746x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x746x2048): 177.771
b: 128, m: 2048, n: 2048, k: 746,
Elapsed time for attention_prob_times_values (128x2048x2048x746): 0.0064
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x746): 125.718
Elapsed time for attention_linear_projection (4x23872x23872, b=2048): 0.0363
Throughput (in TFLOP/s) for attention_linear_projection (4x23872x23872, b=2048): 257.089
Elapsed time for mlp_h_to_4h (4x23872x95488, b=2048): 0.2363
Throughput (in TFLOP/s) for mlp_h_to_4h (4x23872x95488, b=2048): 158.083
Elapsed time for mlp_4h_to_h (4x95488x23872, b=2048): 0.1433
Throughput (in TFLOP/s) for mlp_4h_to_h (4x95488x23872, b=2048): 260.673

Attention duration (in seconds): 0.2032
Attention throughput (in TFLOP/s): 191.716
MLP duration (in seconds): 0.3795
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5827
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 23936, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x23936x71808, b=2048): 0.1072
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x23936x71808, b=2048): 262.681
b: 128, m: 2048, n: 748, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x748x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x748x2048): 184.728
b: 128, m: 2048, n: 2048, k: 748,
Elapsed time for attention_prob_times_values (128x2048x2048x748): 0.0054
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x748): 147.912
Elapsed time for attention_linear_projection (4x23936x23936, b=2048): 0.0361
Throughput (in TFLOP/s) for attention_linear_projection (4x23936x23936, b=2048): 260.040
Elapsed time for mlp_h_to_4h (4x23936x95744, b=2048): 0.2426
Throughput (in TFLOP/s) for mlp_h_to_4h (4x23936x95744, b=2048): 154.803
Elapsed time for mlp_4h_to_h (4x95744x23936, b=2048): 0.1428
Throughput (in TFLOP/s) for mlp_4h_to_h (4x95744x23936, b=2048): 262.907

Attention duration (in seconds): 0.1531
Attention throughput (in TFLOP/s): 255.773
MLP duration (in seconds): 0.3854
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5384
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 24000, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x24000x72000, b=2048): 0.1082
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x24000x72000, b=2048): 261.542
b: 128, m: 2048, n: 750, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x750x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x750x2048): 178.149
b: 128, m: 2048, n: 2048, k: 750,
Elapsed time for attention_prob_times_values (128x2048x2048x750): 0.0064
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x750): 126.095
Elapsed time for attention_linear_projection (4x24000x24000, b=2048): 0.0365
Throughput (in TFLOP/s) for attention_linear_projection (4x24000x24000, b=2048): 258.626
Elapsed time for mlp_h_to_4h (4x24000x96000, b=2048): 0.1443
Throughput (in TFLOP/s) for mlp_h_to_4h (4x24000x96000, b=2048): 261.621
Elapsed time for mlp_4h_to_h (4x96000x24000, b=2048): 0.1438
Throughput (in TFLOP/s) for mlp_4h_to_h (4x96000x24000, b=2048): 262.595

Attention duration (in seconds): 0.1556
Attention throughput (in TFLOP/s): 252.878
MLP duration (in seconds): 0.2880
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4437
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 24064, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x24064x72192, b=2048): 0.1093
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x24064x72192, b=2048): 260.523
b: 128, m: 2048, n: 752, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x752x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x752x2048): 186.185
b: 128, m: 2048, n: 2048, k: 752,
Elapsed time for attention_prob_times_values (128x2048x2048x752): 0.0048
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x752): 167.113
Elapsed time for attention_linear_projection (4x24064x24064, b=2048): 0.0364
Throughput (in TFLOP/s) for attention_linear_projection (4x24064x24064, b=2048): 260.731
Elapsed time for mlp_h_to_4h (4x24064x96256, b=2048): 0.1730
Throughput (in TFLOP/s) for mlp_h_to_4h (4x24064x96256, b=2048): 219.424
Elapsed time for mlp_4h_to_h (4x96256x24064, b=2048): 0.1433
Throughput (in TFLOP/s) for mlp_4h_to_h (4x96256x24064, b=2048): 264.831

Attention duration (in seconds): 0.1548
Attention throughput (in TFLOP/s): 255.574
MLP duration (in seconds): 0.3163
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4711
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 24128, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x24128x72384, b=2048): 0.1677
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x24128x72384, b=2048): 170.635
b: 128, m: 2048, n: 754, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x754x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x754x2048): 180.729
b: 128, m: 2048, n: 2048, k: 754,
Elapsed time for attention_prob_times_values (128x2048x2048x754): 0.0064
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x754): 127.491
Elapsed time for attention_linear_projection (4x24128x24128, b=2048): 0.0363
Throughput (in TFLOP/s) for attention_linear_projection (4x24128x24128, b=2048): 262.691
Elapsed time for mlp_h_to_4h (4x24128x96512, b=2048): 0.2470
Throughput (in TFLOP/s) for mlp_h_to_4h (4x24128x96512, b=2048): 154.435
Elapsed time for mlp_4h_to_h (4x96512x24128, b=2048): 0.1443
Throughput (in TFLOP/s) for mlp_4h_to_h (4x96512x24128, b=2048): 264.311

Attention duration (in seconds): 0.2148
Attention throughput (in TFLOP/s): 185.128
MLP duration (in seconds): 0.3914
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6062
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 24192, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x24192x72576, b=2048): 0.1230
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x24192x72576, b=2048): 233.828
b: 128, m: 2048, n: 756, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x756x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x756x2048): 187.691
b: 128, m: 2048, n: 2048, k: 756,
Elapsed time for attention_prob_times_values (128x2048x2048x756): 0.0055
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x756): 148.483
Elapsed time for attention_linear_projection (4x24192x24192, b=2048): 0.0365
Throughput (in TFLOP/s) for attention_linear_projection (4x24192x24192, b=2048): 263.055
Elapsed time for mlp_h_to_4h (4x24192x96768, b=2048): 0.1983
Throughput (in TFLOP/s) for mlp_h_to_4h (4x24192x96768, b=2048): 193.414
Elapsed time for mlp_4h_to_h (4x96768x24192, b=2048): 0.1464
Throughput (in TFLOP/s) for mlp_4h_to_h (4x96768x24192, b=2048): 261.914

Attention duration (in seconds): 0.1693
Attention throughput (in TFLOP/s): 236.187
MLP duration (in seconds): 0.3447
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5140
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 24256, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x24256x72768, b=2048): 0.1110
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x24256x72768, b=2048): 260.527
b: 128, m: 2048, n: 758, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x758x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x758x2048): 180.211
b: 128, m: 2048, n: 2048, k: 758,
Elapsed time for attention_prob_times_values (128x2048x2048x758): 0.0064
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x758): 127.688
Elapsed time for attention_linear_projection (4x24256x24256, b=2048): 0.0373
Throughput (in TFLOP/s) for attention_linear_projection (4x24256x24256, b=2048): 258.214
Elapsed time for mlp_h_to_4h (4x24256x97024, b=2048): 0.2501
Throughput (in TFLOP/s) for mlp_h_to_4h (4x24256x97024, b=2048): 154.183
Elapsed time for mlp_4h_to_h (4x97024x24256, b=2048): 0.1456
Throughput (in TFLOP/s) for mlp_4h_to_h (4x97024x24256, b=2048): 264.788

Attention duration (in seconds): 0.1592
Attention throughput (in TFLOP/s): 252.388
MLP duration (in seconds): 0.3957
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5549
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 24320, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x24320x72960, b=2048): 0.1109
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x24320x72960, b=2048): 262.210
b: 128, m: 2048, n: 760, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x760x2048): 0.0044
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x760x2048): 186.617
b: 128, m: 2048, n: 2048, k: 760,
Elapsed time for attention_prob_times_values (128x2048x2048x760): 0.0049
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x760): 167.633
Elapsed time for attention_linear_projection (4x24320x24320, b=2048): 0.0371
Throughput (in TFLOP/s) for attention_linear_projection (4x24320x24320, b=2048): 261.152
Elapsed time for mlp_h_to_4h (4x24320x97280, b=2048): 0.2416
Throughput (in TFLOP/s) for mlp_h_to_4h (4x24320x97280, b=2048): 160.432
Elapsed time for mlp_4h_to_h (4x97280x24320, b=2048): 0.1464
Throughput (in TFLOP/s) for mlp_4h_to_h (4x97280x24320, b=2048): 264.804

Attention duration (in seconds): 0.1572
Attention throughput (in TFLOP/s): 256.930
MLP duration (in seconds): 0.3880
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5452
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 24384, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x24384x73152, b=2048): 0.1109
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x24384x73152, b=2048): 263.437
b: 128, m: 2048, n: 762, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x762x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x762x2048): 182.452
b: 128, m: 2048, n: 2048, k: 762,
Elapsed time for attention_prob_times_values (128x2048x2048x762): 0.0063
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x762): 129.076
Elapsed time for attention_linear_projection (4x24384x24384, b=2048): 0.0372
Throughput (in TFLOP/s) for attention_linear_projection (4x24384x24384, b=2048): 261.551
Elapsed time for mlp_h_to_4h (4x24384x97536, b=2048): 0.2392
Throughput (in TFLOP/s) for mlp_h_to_4h (4x24384x97536, b=2048): 162.923
Elapsed time for mlp_4h_to_h (4x97536x24384, b=2048): 0.1484
Throughput (in TFLOP/s) for mlp_4h_to_h (4x97536x24384, b=2048): 262.660

Attention duration (in seconds): 0.1590
Attention throughput (in TFLOP/s): 255.355
MLP duration (in seconds): 0.3875
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5465
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 24448, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x24448x73344, b=2048): 0.1788
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x24448x73344, b=2048): 164.301
b: 128, m: 2048, n: 764, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x764x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x764x2048): 189.563
b: 128, m: 2048, n: 2048, k: 764,
Elapsed time for attention_prob_times_values (128x2048x2048x764): 0.0054
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x764): 151.990
Elapsed time for attention_linear_projection (4x24448x24448, b=2048): 0.0376
Throughput (in TFLOP/s) for attention_linear_projection (4x24448x24448, b=2048): 260.443
Elapsed time for mlp_h_to_4h (4x24448x97792, b=2048): 0.2398
Throughput (in TFLOP/s) for mlp_h_to_4h (4x24448x97792, b=2048): 163.367
Elapsed time for mlp_4h_to_h (4x97792x24448, b=2048): 0.1508
Throughput (in TFLOP/s) for mlp_4h_to_h (4x97792x24448, b=2048): 259.675

Attention duration (in seconds): 0.2261
Attention throughput (in TFLOP/s): 180.477
MLP duration (in seconds): 0.3906
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6168
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 24512, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x24512x73536, b=2048): 0.1122
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x24512x73536, b=2048): 263.251
b: 128, m: 2048, n: 766, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x766x2048): 0.0045
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x766x2048): 182.189
b: 128, m: 2048, n: 2048, k: 766,
Elapsed time for attention_prob_times_values (128x2048x2048x766): 0.0063
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x766): 130.042
Elapsed time for attention_linear_projection (4x24512x24512, b=2048): 0.0376
Throughput (in TFLOP/s) for attention_linear_projection (4x24512x24512, b=2048): 262.116
Elapsed time for mlp_h_to_4h (4x24512x98048, b=2048): 0.2430
Throughput (in TFLOP/s) for mlp_h_to_4h (4x24512x98048, b=2048): 162.016
Elapsed time for mlp_4h_to_h (4x98048x24512, b=2048): 0.1488
Throughput (in TFLOP/s) for mlp_4h_to_h (4x98048x24512, b=2048): 264.685

Attention duration (in seconds): 0.1606
Attention throughput (in TFLOP/s): 255.460
MLP duration (in seconds): 0.3918
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5524
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 24576, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x24576x73728, b=2048): 0.1846
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x24576x73728, b=2048): 160.835
b: 128, m: 2048, n: 768, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x768x2048): 0.0043
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x768x2048): 190.387
b: 128, m: 2048, n: 2048, k: 768,
Elapsed time for attention_prob_times_values (128x2048x2048x768): 0.0048
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x768): 172.241
Elapsed time for attention_linear_projection (4x24576x24576, b=2048): 0.0379
Throughput (in TFLOP/s) for attention_linear_projection (4x24576x24576, b=2048): 260.798
Elapsed time for mlp_h_to_4h (4x24576x98304, b=2048): 0.2594
Throughput (in TFLOP/s) for mlp_h_to_4h (4x24576x98304, b=2048): 152.565
Elapsed time for mlp_4h_to_h (4x98304x24576, b=2048): 0.1501
Throughput (in TFLOP/s) for mlp_4h_to_h (4x98304x24576, b=2048): 263.712

Attention duration (in seconds): 0.2316
Attention throughput (in TFLOP/s): 177.998
MLP duration (in seconds): 0.4095
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6412
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 24640, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x24640x73920, b=2048): 0.1139
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x24640x73920, b=2048): 262.012
b: 128, m: 2048, n: 770, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x770x2048): 0.0050
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x770x2048): 165.313
b: 128, m: 2048, n: 2048, k: 770,
Elapsed time for attention_prob_times_values (128x2048x2048x770): 0.0066
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x770): 125.399
Elapsed time for attention_linear_projection (4x24640x24640, b=2048): 0.0384
Throughput (in TFLOP/s) for attention_linear_projection (4x24640x24640, b=2048): 258.721
Elapsed time for mlp_h_to_4h (4x24640x98560, b=2048): 0.2559
Throughput (in TFLOP/s) for mlp_h_to_4h (4x24640x98560, b=2048): 155.505
Elapsed time for mlp_4h_to_h (4x98560x24640, b=2048): 0.1512
Throughput (in TFLOP/s) for mlp_4h_to_h (4x98560x24640, b=2048): 263.122

Attention duration (in seconds): 0.1639
Attention throughput (in TFLOP/s): 252.796
MLP duration (in seconds): 0.4071
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5710
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 24704, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x24704x74112, b=2048): 0.1149
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x24704x74112, b=2048): 261.041
b: 128, m: 2048, n: 772, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x772x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x772x2048): 170.497
b: 128, m: 2048, n: 2048, k: 772,
Elapsed time for attention_prob_times_values (128x2048x2048x772): 0.0056
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x772): 147.534
Elapsed time for attention_linear_projection (4x24704x24704, b=2048): 0.0386
Throughput (in TFLOP/s) for attention_linear_projection (4x24704x24704, b=2048): 259.234
Elapsed time for mlp_h_to_4h (4x24704x98816, b=2048): 0.2551
Throughput (in TFLOP/s) for mlp_h_to_4h (4x24704x98816, b=2048): 156.779
Elapsed time for mlp_4h_to_h (4x98816x24704, b=2048): 0.1524
Throughput (in TFLOP/s) for mlp_4h_to_h (4x98816x24704, b=2048): 262.445

Attention duration (in seconds): 0.1640
Attention throughput (in TFLOP/s): 254.041
MLP duration (in seconds): 0.4075
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5715
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 24768, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x24768x74304, b=2048): 0.1331
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x24768x74304, b=2048): 226.575
b: 128, m: 2048, n: 774, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x774x2048): 0.0050
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x774x2048): 164.735
b: 128, m: 2048, n: 2048, k: 774,
Elapsed time for attention_prob_times_values (128x2048x2048x774): 0.0066
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x774): 125.316
Elapsed time for attention_linear_projection (4x24768x24768, b=2048): 0.0391
Throughput (in TFLOP/s) for attention_linear_projection (4x24768x24768, b=2048): 256.784
Elapsed time for mlp_h_to_4h (4x24768x99072, b=2048): 0.2610
Throughput (in TFLOP/s) for mlp_h_to_4h (4x24768x99072, b=2048): 154.057
Elapsed time for mlp_4h_to_h (4x99072x24768, b=2048): 0.1540
Throughput (in TFLOP/s) for mlp_4h_to_h (4x99072x24768, b=2048): 261.027

Attention duration (in seconds): 0.1839
Attention throughput (in TFLOP/s): 227.657
MLP duration (in seconds): 0.4150
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5989
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 24832, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x24832x74496, b=2048): 0.1786
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x24832x74496, b=2048): 169.685
b: 128, m: 2048, n: 776, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x776x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x776x2048): 168.920
b: 128, m: 2048, n: 2048, k: 776,
Elapsed time for attention_prob_times_values (128x2048x2048x776): 0.0050
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x776): 165.677
Elapsed time for attention_linear_projection (4x24832x24832, b=2048): 0.0395
Throughput (in TFLOP/s) for attention_linear_projection (4x24832x24832, b=2048): 255.635
Elapsed time for mlp_h_to_4h (4x24832x99328, b=2048): 0.2621
Throughput (in TFLOP/s) for mlp_h_to_4h (4x24832x99328, b=2048): 154.198
Elapsed time for mlp_4h_to_h (4x99328x24832, b=2048): 0.1554
Throughput (in TFLOP/s) for mlp_4h_to_h (4x99328x24832, b=2048): 260.102

Attention duration (in seconds): 0.2281
Attention throughput (in TFLOP/s): 184.472
MLP duration (in seconds): 0.4174
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6455
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 24896, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x24896x74688, b=2048): 0.1684
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x24896x74688, b=2048): 180.878
b: 128, m: 2048, n: 778, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x778x2048): 0.0050
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x778x2048): 165.986
b: 128, m: 2048, n: 2048, k: 778,
Elapsed time for attention_prob_times_values (128x2048x2048x778): 0.0066
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x778): 126.432
Elapsed time for attention_linear_projection (4x24896x24896, b=2048): 0.0397
Throughput (in TFLOP/s) for attention_linear_projection (4x24896x24896, b=2048): 255.667
Elapsed time for mlp_h_to_4h (4x24896x99584, b=2048): 0.2640
Throughput (in TFLOP/s) for mlp_h_to_4h (4x24896x99584, b=2048): 153.840
Elapsed time for mlp_4h_to_h (4x99584x24896, b=2048): 0.1574
Throughput (in TFLOP/s) for mlp_4h_to_h (4x99584x24896, b=2048): 258.009

Attention duration (in seconds): 0.2198
Attention throughput (in TFLOP/s): 192.416
MLP duration (in seconds): 0.4215
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6413
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 24960, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x24960x74880, b=2048): 0.1189
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x24960x74880, b=2048): 257.618
b: 128, m: 2048, n: 780, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x780x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x780x2048): 171.683
b: 128, m: 2048, n: 2048, k: 780,
Elapsed time for attention_prob_times_values (128x2048x2048x780): 0.0057
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x780): 148.176
Elapsed time for attention_linear_projection (4x24960x24960, b=2048): 0.0393
Throughput (in TFLOP/s) for attention_linear_projection (4x24960x24960, b=2048): 259.492
Elapsed time for mlp_h_to_4h (4x24960x99840, b=2048): 0.2652
Throughput (in TFLOP/s) for mlp_h_to_4h (4x24960x99840, b=2048): 153.982
Elapsed time for mlp_4h_to_h (4x99840x24960, b=2048): 0.1572
Throughput (in TFLOP/s) for mlp_4h_to_h (4x99840x24960, b=2048): 259.704

Attention duration (in seconds): 0.1687
Attention throughput (in TFLOP/s): 251.904
MLP duration (in seconds): 0.4224
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5911
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 25024, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x25024x75072, b=2048): 0.1199
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x25024x75072, b=2048): 256.810
b: 128, m: 2048, n: 782, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x782x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x782x2048): 165.904
b: 128, m: 2048, n: 2048, k: 782,
Elapsed time for attention_prob_times_values (128x2048x2048x782): 0.0066
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x782): 126.352
Elapsed time for attention_linear_projection (4x25024x25024, b=2048): 0.0402
Throughput (in TFLOP/s) for attention_linear_projection (4x25024x25024, b=2048): 255.314
Elapsed time for mlp_h_to_4h (4x25024x100096, b=2048): 0.2357
Throughput (in TFLOP/s) for mlp_h_to_4h (4x25024x100096, b=2048): 174.079
Elapsed time for mlp_4h_to_h (4x100096x25024, b=2048): 0.1582
Throughput (in TFLOP/s) for mlp_4h_to_h (4x100096x25024, b=2048): 259.445

Attention duration (in seconds): 0.1717
Attention throughput (in TFLOP/s): 248.733
MLP duration (in seconds): 0.3939
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5657
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 25088, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x25088x75264, b=2048): 0.1190
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x25088x75264, b=2048): 259.912
b: 128, m: 2048, n: 784, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x784x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x784x2048): 171.699
b: 128, m: 2048, n: 2048, k: 784,
Elapsed time for attention_prob_times_values (128x2048x2048x784): 0.0050
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x784): 169.101
Elapsed time for attention_linear_projection (4x25088x25088, b=2048): 0.0402
Throughput (in TFLOP/s) for attention_linear_projection (4x25088x25088, b=2048): 256.225
Elapsed time for mlp_h_to_4h (4x25088x100352, b=2048): 0.1680
Throughput (in TFLOP/s) for mlp_h_to_4h (4x25088x100352, b=2048): 245.525
Elapsed time for mlp_4h_to_h (4x100352x25088, b=2048): 0.1575
Throughput (in TFLOP/s) for mlp_4h_to_h (4x100352x25088, b=2048): 261.930

Attention duration (in seconds): 0.1692
Attention throughput (in TFLOP/s): 253.805
MLP duration (in seconds): 0.3255
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4946
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 25152, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x25152x75456, b=2048): 0.1197
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x25152x75456, b=2048): 259.711
b: 128, m: 2048, n: 786, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x786x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x786x2048): 166.627
b: 128, m: 2048, n: 2048, k: 786,
Elapsed time for attention_prob_times_values (128x2048x2048x786): 0.0067
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x786): 126.071
Elapsed time for attention_linear_projection (4x25152x25152, b=2048): 0.0404
Throughput (in TFLOP/s) for attention_linear_projection (4x25152x25152, b=2048): 256.754
Elapsed time for mlp_h_to_4h (4x25152x100608, b=2048): 0.2132
Throughput (in TFLOP/s) for mlp_h_to_4h (4x25152x100608, b=2048): 194.436
Elapsed time for mlp_4h_to_h (4x100608x25152, b=2048): 0.1599
Throughput (in TFLOP/s) for mlp_4h_to_h (4x100608x25152, b=2048): 259.305

Attention duration (in seconds): 0.1719
Attention throughput (in TFLOP/s): 251.067
MLP duration (in seconds): 0.3731
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5450
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 25216, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x25216x75648, b=2048): 0.1197
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x25216x75648, b=2048): 261.031
b: 128, m: 2048, n: 788, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x788x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x788x2048): 173.055
b: 128, m: 2048, n: 2048, k: 788,
Elapsed time for attention_prob_times_values (128x2048x2048x788): 0.0057
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x788): 148.029
Elapsed time for attention_linear_projection (4x25216x25216, b=2048): 0.0403
Throughput (in TFLOP/s) for attention_linear_projection (4x25216x25216, b=2048): 258.534
Elapsed time for mlp_h_to_4h (4x25216x100864, b=2048): 0.2523
Throughput (in TFLOP/s) for mlp_h_to_4h (4x25216x100864, b=2048): 165.153
Elapsed time for mlp_4h_to_h (4x100864x25216, b=2048): 0.1593
Throughput (in TFLOP/s) for mlp_4h_to_h (4x100864x25216, b=2048): 261.611

Attention duration (in seconds): 0.1706
Attention throughput (in TFLOP/s): 254.135
MLP duration (in seconds): 0.4116
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5822
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 25280, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x25280x75840, b=2048): 0.1210
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x25280x75840, b=2048): 259.659
b: 128, m: 2048, n: 790, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x790x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x790x2048): 167.011
b: 128, m: 2048, n: 2048, k: 790,
Elapsed time for attention_prob_times_values (128x2048x2048x790): 0.0067
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x790): 126.645
Elapsed time for attention_linear_projection (4x25280x25280, b=2048): 0.0405
Throughput (in TFLOP/s) for attention_linear_projection (4x25280x25280, b=2048): 258.287
Elapsed time for mlp_h_to_4h (4x25280x101120, b=2048): 0.2651
Throughput (in TFLOP/s) for mlp_h_to_4h (4x25280x101120, b=2048): 157.991
Elapsed time for mlp_4h_to_h (4x101120x25280, b=2048): 0.1609
Throughput (in TFLOP/s) for mlp_4h_to_h (4x101120x25280, b=2048): 260.350

Attention duration (in seconds): 0.1733
Attention throughput (in TFLOP/s): 251.482
MLP duration (in seconds): 0.4260
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5993
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 25344, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x25344x76032, b=2048): 0.1679
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x25344x76032, b=2048): 187.991
b: 128, m: 2048, n: 792, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x792x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x792x2048): 172.104
b: 128, m: 2048, n: 2048, k: 792,
Elapsed time for attention_prob_times_values (128x2048x2048x792): 0.0051
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x792): 167.560
Elapsed time for attention_linear_projection (4x25344x25344, b=2048): 0.0402
Throughput (in TFLOP/s) for attention_linear_projection (4x25344x25344, b=2048): 261.911
Elapsed time for mlp_h_to_4h (4x25344x101376, b=2048): 0.2718
Throughput (in TFLOP/s) for mlp_h_to_4h (4x25344x101376, b=2048): 154.883
Elapsed time for mlp_4h_to_h (4x101376x25344, b=2048): 0.1593
Throughput (in TFLOP/s) for mlp_4h_to_h (4x101376x25344, b=2048): 264.283

Attention duration (in seconds): 0.2181
Attention throughput (in TFLOP/s): 200.772
MLP duration (in seconds): 0.4311
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6492
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 25408, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x25408x76224, b=2048): 0.1416
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x25408x76224, b=2048): 224.159
b: 128, m: 2048, n: 794, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x794x2048): 0.0050
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x794x2048): 168.951
b: 128, m: 2048, n: 2048, k: 794,
Elapsed time for attention_prob_times_values (128x2048x2048x794): 0.0067
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x794): 127.015
Elapsed time for attention_linear_projection (4x25408x25408, b=2048): 0.0405
Throughput (in TFLOP/s) for attention_linear_projection (4x25408x25408, b=2048): 260.974
Elapsed time for mlp_h_to_4h (4x25408x101632, b=2048): 0.2467
Throughput (in TFLOP/s) for mlp_h_to_4h (4x25408x101632, b=2048): 171.474
Elapsed time for mlp_4h_to_h (4x101632x25408, b=2048): 0.1611
Throughput (in TFLOP/s) for mlp_4h_to_h (4x101632x25408, b=2048): 262.673

Attention duration (in seconds): 0.1938
Attention throughput (in TFLOP/s): 227.056
MLP duration (in seconds): 0.4078
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6016
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 25472, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x25472x76416, b=2048): 0.1206
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x25472x76416, b=2048): 264.544
b: 128, m: 2048, n: 796, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x796x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x796x2048): 175.514
b: 128, m: 2048, n: 2048, k: 796,
Elapsed time for attention_prob_times_values (128x2048x2048x796): 0.0057
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x796): 150.745
Elapsed time for attention_linear_projection (4x25472x25472, b=2048): 0.0406
Throughput (in TFLOP/s) for attention_linear_projection (4x25472x25472, b=2048): 262.050
Elapsed time for mlp_h_to_4h (4x25472x101888, b=2048): 0.2826
Throughput (in TFLOP/s) for mlp_h_to_4h (4x25472x101888, b=2048): 150.455
Elapsed time for mlp_4h_to_h (4x101888x25472, b=2048): 0.1630
Throughput (in TFLOP/s) for mlp_4h_to_h (4x101888x25472, b=2048): 260.878

Attention duration (in seconds): 0.1717
Attention throughput (in TFLOP/s): 257.670
MLP duration (in seconds): 0.4456
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6173
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 25536, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x25536x76608, b=2048): 0.1231
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x25536x76608, b=2048): 260.360
b: 128, m: 2048, n: 798, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x798x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x798x2048): 169.203
b: 128, m: 2048, n: 2048, k: 798,
Elapsed time for attention_prob_times_values (128x2048x2048x798): 0.0067
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x798): 127.501
Elapsed time for attention_linear_projection (4x25536x25536, b=2048): 0.0413
Throughput (in TFLOP/s) for attention_linear_projection (4x25536x25536, b=2048): 258.444
Elapsed time for mlp_h_to_4h (4x25536x102144, b=2048): 0.2026
Throughput (in TFLOP/s) for mlp_h_to_4h (4x25536x102144, b=2048): 210.888
Elapsed time for mlp_4h_to_h (4x102144x25536, b=2048): 0.1636
Throughput (in TFLOP/s) for mlp_4h_to_h (4x102144x25536, b=2048): 261.292

Attention duration (in seconds): 0.1762
Attention throughput (in TFLOP/s): 252.225
MLP duration (in seconds): 0.3662
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5424
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 25600, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x25600x76800, b=2048): 0.1853
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x25600x76800, b=2048): 173.796
b: 128, m: 2048, n: 800, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x800x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x800x2048): 174.872
b: 128, m: 2048, n: 2048, k: 800,
Elapsed time for attention_prob_times_values (128x2048x2048x800): 0.0050
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x800): 171.770
Elapsed time for attention_linear_projection (4x25600x25600, b=2048): 0.0413
Throughput (in TFLOP/s) for attention_linear_projection (4x25600x25600, b=2048): 260.175
Elapsed time for mlp_h_to_4h (4x25600x102400, b=2048): 0.2685
Throughput (in TFLOP/s) for mlp_h_to_4h (4x25600x102400, b=2048): 159.937
Elapsed time for mlp_4h_to_h (4x102400x25600, b=2048): 0.1633
Throughput (in TFLOP/s) for mlp_4h_to_h (4x102400x25600, b=2048): 263.005

Attention duration (in seconds): 0.2365
Attention throughput (in TFLOP/s): 188.847
MLP duration (in seconds): 0.4318
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6684
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 25664, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x25664x76992, b=2048): 0.1914
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x25664x76992, b=2048): 169.152
b: 128, m: 2048, n: 802, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x802x2048): 0.0050
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x802x2048): 170.864
b: 128, m: 2048, n: 2048, k: 802,
Elapsed time for attention_prob_times_values (128x2048x2048x802): 0.0067
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x802): 127.764
Elapsed time for attention_linear_projection (4x25664x25664, b=2048): 0.0413
Throughput (in TFLOP/s) for attention_linear_projection (4x25664x25664, b=2048): 261.022
Elapsed time for mlp_h_to_4h (4x25664x102656, b=2048): 0.2754
Throughput (in TFLOP/s) for mlp_h_to_4h (4x25664x102656, b=2048): 156.751
Elapsed time for mlp_4h_to_h (4x102656x25664, b=2048): 0.1640
Throughput (in TFLOP/s) for mlp_4h_to_h (4x102656x25664, b=2048): 263.222

Attention duration (in seconds): 0.2445
Attention throughput (in TFLOP/s): 183.580
MLP duration (in seconds): 0.4394
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6839
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 25728, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x25728x77184, b=2048): 0.2035
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x25728x77184, b=2048): 159.852
b: 128, m: 2048, n: 804, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x804x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x804x2048): 177.460
b: 128, m: 2048, n: 2048, k: 804,
Elapsed time for attention_prob_times_values (128x2048x2048x804): 0.0057
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x804): 151.800
Elapsed time for attention_linear_projection (4x25728x25728, b=2048): 0.0414
Throughput (in TFLOP/s) for attention_linear_projection (4x25728x25728, b=2048): 261.663
Elapsed time for mlp_h_to_4h (4x25728x102912, b=2048): 0.2686
Throughput (in TFLOP/s) for mlp_h_to_4h (4x25728x102912, b=2048): 161.492
Elapsed time for mlp_4h_to_h (4x102912x25728, b=2048): 0.1653
Throughput (in TFLOP/s) for mlp_4h_to_h (4x102912x25728, b=2048): 262.435

Attention duration (in seconds): 0.2555
Attention throughput (in TFLOP/s): 176.522
MLP duration (in seconds): 0.4339
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6895
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 25792, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x25792x77376, b=2048): 0.1249
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x25792x77376, b=2048): 261.826
b: 128, m: 2048, n: 806, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x806x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x806x2048): 170.362
b: 128, m: 2048, n: 2048, k: 806,
Elapsed time for attention_prob_times_values (128x2048x2048x806): 0.0068
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x806): 127.953
Elapsed time for attention_linear_projection (4x25792x25792, b=2048): 0.0421
Throughput (in TFLOP/s) for attention_linear_projection (4x25792x25792, b=2048): 259.165
Elapsed time for mlp_h_to_4h (4x25792x103168, b=2048): 0.2830
Throughput (in TFLOP/s) for mlp_h_to_4h (4x25792x103168, b=2048): 154.032
Elapsed time for mlp_4h_to_h (4x103168x25792, b=2048): 0.1652
Throughput (in TFLOP/s) for mlp_4h_to_h (4x103168x25792, b=2048): 263.876

Attention duration (in seconds): 0.1788
Attention throughput (in TFLOP/s): 253.536
MLP duration (in seconds): 0.4482
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6270
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 25856, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x25856x77568, b=2048): 0.1239
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x25856x77568, b=2048): 265.186
b: 128, m: 2048, n: 808, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x808x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x808x2048): 175.776
b: 128, m: 2048, n: 2048, k: 808,
Elapsed time for attention_prob_times_values (128x2048x2048x808): 0.0051
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x808): 170.027
Elapsed time for attention_linear_projection (4x25856x25856, b=2048): 0.0416
Throughput (in TFLOP/s) for attention_linear_projection (4x25856x25856, b=2048): 263.414
Elapsed time for mlp_h_to_4h (4x25856x103424, b=2048): 0.2355
Throughput (in TFLOP/s) for mlp_h_to_4h (4x25856x103424, b=2048): 186.029
Elapsed time for mlp_4h_to_h (4x103424x25856, b=2048): 0.1663
Throughput (in TFLOP/s) for mlp_4h_to_h (4x103424x25856, b=2048): 263.492

Attention duration (in seconds): 0.1755
Attention throughput (in TFLOP/s): 259.486
MLP duration (in seconds): 0.4018
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5773
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 25920, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x25920x77760, b=2048): 0.1273
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x25920x77760, b=2048): 259.427
b: 128, m: 2048, n: 810, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x810x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x810x2048): 171.280
b: 128, m: 2048, n: 2048, k: 810,
Elapsed time for attention_prob_times_values (128x2048x2048x810): 0.0067
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x810): 129.102
Elapsed time for attention_linear_projection (4x25920x25920, b=2048): 0.0426
Throughput (in TFLOP/s) for attention_linear_projection (4x25920x25920, b=2048): 258.490
Elapsed time for mlp_h_to_4h (4x25920x103680, b=2048): 0.2664
Throughput (in TFLOP/s) for mlp_h_to_4h (4x25920x103680, b=2048): 165.276
Elapsed time for mlp_4h_to_h (4x103680x25920, b=2048): 0.1692
Throughput (in TFLOP/s) for mlp_4h_to_h (4x103680x25920, b=2048): 260.265

Attention duration (in seconds): 0.1817
Attention throughput (in TFLOP/s): 251.912
MLP duration (in seconds): 0.4356
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6173
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 25984, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x25984x77952, b=2048): 0.1275
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x25984x77952, b=2048): 260.309
b: 128, m: 2048, n: 812, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x812x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x812x2048): 178.300
b: 128, m: 2048, n: 2048, k: 812,
Elapsed time for attention_prob_times_values (128x2048x2048x812): 0.0057
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x812): 153.788
Elapsed time for attention_linear_projection (4x25984x25984, b=2048): 0.0427
Throughput (in TFLOP/s) for attention_linear_projection (4x25984x25984, b=2048): 258.875
Elapsed time for mlp_h_to_4h (4x25984x103936, b=2048): 0.2884
Throughput (in TFLOP/s) for mlp_h_to_4h (4x25984x103936, b=2048): 153.426
Elapsed time for mlp_4h_to_h (4x103936x25984, b=2048): 0.1690
Throughput (in TFLOP/s) for mlp_4h_to_h (4x103936x25984, b=2048): 261.872

Attention duration (in seconds): 0.1808
Attention throughput (in TFLOP/s): 254.411
MLP duration (in seconds): 0.4574
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6381
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 26048, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x26048x78144, b=2048): 0.1975
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x26048x78144, b=2048): 168.868
b: 128, m: 2048, n: 814, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x814x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x814x2048): 172.271
b: 128, m: 2048, n: 2048, k: 814,
Elapsed time for attention_prob_times_values (128x2048x2048x814): 0.0067
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x814): 130.316
Elapsed time for attention_linear_projection (4x26048x26048, b=2048): 0.0432
Throughput (in TFLOP/s) for attention_linear_projection (4x26048x26048, b=2048): 257.554
Elapsed time for mlp_h_to_4h (4x26048x104192, b=2048): 0.2850
Throughput (in TFLOP/s) for mlp_h_to_4h (4x26048x104192, b=2048): 156.046
Elapsed time for mlp_4h_to_h (4x104192x26048, b=2048): 0.1721
Throughput (in TFLOP/s) for mlp_4h_to_h (4x104192x26048, b=2048): 258.446

Attention duration (in seconds): 0.2524
Attention throughput (in TFLOP/s): 183.076
MLP duration (in seconds): 0.4570
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.7094
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 26112, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x26112x78336, b=2048): 0.1503
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x26112x78336, b=2048): 223.000
b: 128, m: 2048, n: 816, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x816x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x816x2048): 178.786
b: 128, m: 2048, n: 2048, k: 816,
Elapsed time for attention_prob_times_values (128x2048x2048x816): 0.0050
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x816): 173.698
Elapsed time for attention_linear_projection (4x26112x26112, b=2048): 0.0428
Throughput (in TFLOP/s) for attention_linear_projection (4x26112x26112, b=2048): 260.801
Elapsed time for mlp_h_to_4h (4x26112x104448, b=2048): 0.2798
Throughput (in TFLOP/s) for mlp_h_to_4h (4x26112x104448, b=2048): 159.724
Elapsed time for mlp_4h_to_h (4x104448x26112, b=2048): 0.1697
Throughput (in TFLOP/s) for mlp_4h_to_h (4x104448x26112, b=2048): 263.363

Attention duration (in seconds): 0.2031
Attention throughput (in TFLOP/s): 228.682
MLP duration (in seconds): 0.4494
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6525
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 26176, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x26176x78528, b=2048): 0.1361
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x26176x78528, b=2048): 247.487
b: 128, m: 2048, n: 818, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x818x2048): 0.0050
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x818x2048): 174.371
b: 128, m: 2048, n: 2048, k: 818,
Elapsed time for attention_prob_times_values (128x2048x2048x818): 0.0067
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x818): 130.831
Elapsed time for attention_linear_projection (4x26176x26176, b=2048): 0.0429
Throughput (in TFLOP/s) for attention_linear_projection (4x26176x26176, b=2048): 261.626
Elapsed time for mlp_h_to_4h (4x26176x104704, b=2048): 0.2148
Throughput (in TFLOP/s) for mlp_h_to_4h (4x26176x104704, b=2048): 209.078
Elapsed time for mlp_4h_to_h (4x104704x26176, b=2048): 0.1711
Throughput (in TFLOP/s) for mlp_4h_to_h (4x104704x26176, b=2048): 262.387

Attention duration (in seconds): 0.1907
Attention throughput (in TFLOP/s): 244.631
MLP duration (in seconds): 0.3859
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5766
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 26240, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x26240x78720, b=2048): 0.1921
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x26240x78720, b=2048): 176.208
b: 128, m: 2048, n: 820, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x820x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x820x2048): 180.655
b: 128, m: 2048, n: 2048, k: 820,
Elapsed time for attention_prob_times_values (128x2048x2048x820): 0.0057
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x820): 155.591
Elapsed time for attention_linear_projection (4x26240x26240, b=2048): 0.0432
Throughput (in TFLOP/s) for attention_linear_projection (4x26240x26240, b=2048): 261.351
Elapsed time for mlp_h_to_4h (4x26240x104960, b=2048): 0.2888
Throughput (in TFLOP/s) for mlp_h_to_4h (4x26240x104960, b=2048): 156.223
Elapsed time for mlp_4h_to_h (4x104960x26240, b=2048): 0.1731
Throughput (in TFLOP/s) for mlp_4h_to_h (4x104960x26240, b=2048): 260.709

Attention duration (in seconds): 0.2458
Attention throughput (in TFLOP/s): 190.776
MLP duration (in seconds): 0.4619
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.7077
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 26304, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x26304x78912, b=2048): 0.1311
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x26304x78912, b=2048): 259.420
b: 128, m: 2048, n: 822, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x822x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x822x2048): 174.456
b: 128, m: 2048, n: 2048, k: 822,
Elapsed time for attention_prob_times_values (128x2048x2048x822): 0.0068
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x822): 130.617
Elapsed time for attention_linear_projection (4x26304x26304, b=2048): 0.0441
Throughput (in TFLOP/s) for attention_linear_projection (4x26304x26304, b=2048): 257.330
Elapsed time for mlp_h_to_4h (4x26304x105216, b=2048): 0.1740
Throughput (in TFLOP/s) for mlp_h_to_4h (4x26304x105216, b=2048): 260.668
Elapsed time for mlp_4h_to_h (4x105216x26304, b=2048): 0.1728
Throughput (in TFLOP/s) for mlp_4h_to_h (4x105216x26304, b=2048): 262.409

Attention duration (in seconds): 0.1870
Attention throughput (in TFLOP/s): 251.973
MLP duration (in seconds): 0.3468
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5337
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 26368, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x26368x79104, b=2048): 0.1566
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x26368x79104, b=2048): 218.248
b: 128, m: 2048, n: 824, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x824x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x824x2048): 179.378
b: 128, m: 2048, n: 2048, k: 824,
Elapsed time for attention_prob_times_values (128x2048x2048x824): 0.0051
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x824): 173.434
Elapsed time for attention_linear_projection (4x26368x26368, b=2048): 0.0442
Throughput (in TFLOP/s) for attention_linear_projection (4x26368x26368, b=2048): 257.541
Elapsed time for mlp_h_to_4h (4x26368x105472, b=2048): 0.2365
Throughput (in TFLOP/s) for mlp_h_to_4h (4x26368x105472, b=2048): 192.685
Elapsed time for mlp_4h_to_h (4x105472x26368, b=2048): 0.1746
Throughput (in TFLOP/s) for mlp_4h_to_h (4x105472x26368, b=2048): 260.992

Attention duration (in seconds): 0.2108
Attention throughput (in TFLOP/s): 224.497
MLP duration (in seconds): 0.4111
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6219
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 26432, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x26432x79296, b=2048): 0.1320
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x26432x79296, b=2048): 260.198
b: 128, m: 2048, n: 826, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x826x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x826x2048): 174.844
b: 128, m: 2048, n: 2048, k: 826,
Elapsed time for attention_prob_times_values (128x2048x2048x826): 0.0067
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x826): 131.620
Elapsed time for attention_linear_projection (4x26432x26432, b=2048): 0.0444
Throughput (in TFLOP/s) for attention_linear_projection (4x26432x26432, b=2048): 258.087
Elapsed time for mlp_h_to_4h (4x26432x105728, b=2048): 0.2597
Throughput (in TFLOP/s) for mlp_h_to_4h (4x26432x105728, b=2048): 176.328
Elapsed time for mlp_4h_to_h (4x105728x26432, b=2048): 0.1757
Throughput (in TFLOP/s) for mlp_4h_to_h (4x105728x26432, b=2048): 260.589

Attention duration (in seconds): 0.1881
Attention throughput (in TFLOP/s): 252.794
MLP duration (in seconds): 0.4354
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6235
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 26496, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x26496x79488, b=2048): 0.1849
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x26496x79488, b=2048): 186.611
b: 128, m: 2048, n: 828, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x828x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x828x2048): 181.645
b: 128, m: 2048, n: 2048, k: 828,
Elapsed time for attention_prob_times_values (128x2048x2048x828): 0.0057
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x828): 155.874
Elapsed time for attention_linear_projection (4x26496x26496, b=2048): 0.0445
Throughput (in TFLOP/s) for attention_linear_projection (4x26496x26496, b=2048): 258.753
Elapsed time for mlp_h_to_4h (4x26496x105984, b=2048): 0.1783
Throughput (in TFLOP/s) for mlp_h_to_4h (4x26496x105984, b=2048): 257.978
Elapsed time for mlp_4h_to_h (4x105984x26496, b=2048): 0.1765
Throughput (in TFLOP/s) for mlp_4h_to_h (4x105984x26496, b=2048): 260.697

Attention duration (in seconds): 0.2400
Attention throughput (in TFLOP/s): 199.143
MLP duration (in seconds): 0.3548
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5948
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 26560, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x26560x79680, b=2048): 0.1346
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x26560x79680, b=2048): 257.600
b: 128, m: 2048, n: 830, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x830x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x830x2048): 175.600
b: 128, m: 2048, n: 2048, k: 830,
Elapsed time for attention_prob_times_values (128x2048x2048x830): 0.0068
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x830): 131.861
Elapsed time for attention_linear_projection (4x26560x26560, b=2048): 0.0445
Throughput (in TFLOP/s) for attention_linear_projection (4x26560x26560, b=2048): 259.834
Elapsed time for mlp_h_to_4h (4x26560x106240, b=2048): 0.2007
Throughput (in TFLOP/s) for mlp_h_to_4h (4x26560x106240, b=2048): 230.300
Elapsed time for mlp_4h_to_h (4x106240x26560, b=2048): 0.1757
Throughput (in TFLOP/s) for mlp_4h_to_h (4x106240x26560, b=2048): 263.087

Attention duration (in seconds): 0.1909
Attention throughput (in TFLOP/s): 251.490
MLP duration (in seconds): 0.3765
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5674
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 26624, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x26624x79872, b=2048): 0.2032
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x26624x79872, b=2048): 171.439
b: 128, m: 2048, n: 832, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x832x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x832x2048): 182.247
b: 128, m: 2048, n: 2048, k: 832,
Elapsed time for attention_prob_times_values (128x2048x2048x832): 0.0050
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x832): 177.591
Elapsed time for attention_linear_projection (4x26624x26624, b=2048): 0.0443
Throughput (in TFLOP/s) for attention_linear_projection (4x26624x26624, b=2048): 262.211
Elapsed time for mlp_h_to_4h (4x26624x106496, b=2048): 0.3073
Throughput (in TFLOP/s) for mlp_h_to_4h (4x26624x106496, b=2048): 151.184
Elapsed time for mlp_4h_to_h (4x106496x26624, b=2048): 0.1774
Throughput (in TFLOP/s) for mlp_4h_to_h (4x106496x26624, b=2048): 261.807

Attention duration (in seconds): 0.2574
Attention throughput (in TFLOP/s): 187.381
MLP duration (in seconds): 0.4847
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.7422
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 26688, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x26688x80064, b=2048): 0.1460
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x26688x80064, b=2048): 239.826
b: 128, m: 2048, n: 834, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x834x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x834x2048): 176.695
b: 128, m: 2048, n: 2048, k: 834,
Elapsed time for attention_prob_times_values (128x2048x2048x834): 0.0070
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x834): 128.560
Elapsed time for attention_linear_projection (4x26688x26688, b=2048): 0.0449
Throughput (in TFLOP/s) for attention_linear_projection (4x26688x26688, b=2048): 260.168
Elapsed time for mlp_h_to_4h (4x26688x106752, b=2048): 0.1778
Throughput (in TFLOP/s) for mlp_h_to_4h (4x26688x106752, b=2048): 262.504
Elapsed time for mlp_4h_to_h (4x106752x26688, b=2048): 0.1788
Throughput (in TFLOP/s) for mlp_4h_to_h (4x106752x26688, b=2048): 261.100

Attention duration (in seconds): 0.2029
Attention throughput (in TFLOP/s): 238.926
MLP duration (in seconds): 0.3566
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5595
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 26752, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x26752x80256, b=2048): 0.1546
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x26752x80256, b=2048): 227.556
b: 128, m: 2048, n: 836, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x836x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x836x2048): 183.623
b: 128, m: 2048, n: 2048, k: 836,
Elapsed time for attention_prob_times_values (128x2048x2048x836): 0.0059
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x836): 151.705
Elapsed time for attention_linear_projection (4x26752x26752, b=2048): 0.0449
Throughput (in TFLOP/s) for attention_linear_projection (4x26752x26752, b=2048): 261.137
Elapsed time for mlp_h_to_4h (4x26752x107008, b=2048): 0.2818
Throughput (in TFLOP/s) for mlp_h_to_4h (4x26752x107008, b=2048): 166.432
Elapsed time for mlp_4h_to_h (4x107008x26752, b=2048): 0.1791
Throughput (in TFLOP/s) for mlp_4h_to_h (4x107008x26752, b=2048): 261.928

Attention duration (in seconds): 0.2103
Attention throughput (in TFLOP/s): 231.570
MLP duration (in seconds): 0.4609
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6712
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 26816, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x26816x80448, b=2048): 0.2193
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x26816x80448, b=2048): 161.151
b: 128, m: 2048, n: 838, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x838x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x838x2048): 176.397
b: 128, m: 2048, n: 2048, k: 838,
Elapsed time for attention_prob_times_values (128x2048x2048x838): 0.0070
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x838): 128.477
Elapsed time for attention_linear_projection (4x26816x26816, b=2048): 0.0457
Throughput (in TFLOP/s) for attention_linear_projection (4x26816x26816, b=2048): 257.838
Elapsed time for mlp_h_to_4h (4x26816x107264, b=2048): 0.2008
Throughput (in TFLOP/s) for mlp_h_to_4h (4x26816x107264, b=2048): 234.669
Elapsed time for mlp_4h_to_h (4x107264x26816, b=2048): 0.2784
Throughput (in TFLOP/s) for mlp_4h_to_h (4x107264x26816, b=2048): 169.254

Attention duration (in seconds): 0.2771
Attention throughput (in TFLOP/s): 176.548
MLP duration (in seconds): 0.4793
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.7564
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 26880, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x26880x80640, b=2048): 0.1360
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x26880x80640, b=2048): 261.039
b: 128, m: 2048, n: 840, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x840x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x840x2048): 182.262
b: 128, m: 2048, n: 2048, k: 840,
Elapsed time for attention_prob_times_values (128x2048x2048x840): 0.0053
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x840): 170.583
Elapsed time for attention_linear_projection (4x26880x26880, b=2048): 0.0455
Throughput (in TFLOP/s) for attention_linear_projection (4x26880x26880, b=2048): 260.453
Elapsed time for mlp_h_to_4h (4x26880x107520, b=2048): 0.2907
Throughput (in TFLOP/s) for mlp_h_to_4h (4x26880x107520, b=2048): 162.908
Elapsed time for mlp_4h_to_h (4x107520x26880, b=2048): 0.1817
Throughput (in TFLOP/s) for mlp_4h_to_h (4x107520x26880, b=2048): 260.584

Attention duration (in seconds): 0.1917
Attention throughput (in TFLOP/s): 256.372
MLP duration (in seconds): 0.4724
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6641
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 26944, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x26944x80832, b=2048): 0.1367
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x26944x80832, b=2048): 261.080
b: 128, m: 2048, n: 842, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x842x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x842x2048): 178.071
b: 128, m: 2048, n: 2048, k: 842,
Elapsed time for attention_prob_times_values (128x2048x2048x842): 0.0070
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x842): 128.749
Elapsed time for attention_linear_projection (4x26944x26944, b=2048): 0.0458
Throughput (in TFLOP/s) for attention_linear_projection (4x26944x26944, b=2048): 259.838
Elapsed time for mlp_h_to_4h (4x26944x107776, b=2048): 0.2517
Throughput (in TFLOP/s) for mlp_h_to_4h (4x26944x107776, b=2048): 189.022
Elapsed time for mlp_4h_to_h (4x107776x26944, b=2048): 0.1825
Throughput (in TFLOP/s) for mlp_4h_to_h (4x107776x26944, b=2048): 260.667

Attention duration (in seconds): 0.1946
Attention throughput (in TFLOP/s): 253.845
MLP duration (in seconds): 0.4342
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6288
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 27008, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x27008x81024, b=2048): 0.1378
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x27008x81024, b=2048): 260.194
b: 128, m: 2048, n: 844, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x844x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x844x2048): 184.678
b: 128, m: 2048, n: 2048, k: 844,
Elapsed time for attention_prob_times_values (128x2048x2048x844): 0.0059
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x844): 152.965
Elapsed time for attention_linear_projection (4x27008x27008, b=2048): 0.0460
Throughput (in TFLOP/s) for attention_linear_projection (4x27008x27008, b=2048): 259.694
Elapsed time for mlp_h_to_4h (4x27008x108032, b=2048): 0.3193
Throughput (in TFLOP/s) for mlp_h_to_4h (4x27008x108032, b=2048): 149.727
Elapsed time for mlp_4h_to_h (4x108032x27008, b=2048): 0.1835
Throughput (in TFLOP/s) for mlp_4h_to_h (4x108032x27008, b=2048): 260.523

Attention duration (in seconds): 0.1946
Attention throughput (in TFLOP/s): 254.908
MLP duration (in seconds): 0.5028
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6974
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 27072, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x27072x81216, b=2048): 0.1383
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x27072x81216, b=2048): 260.456
b: 128, m: 2048, n: 846, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x846x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x846x2048): 178.189
b: 128, m: 2048, n: 2048, k: 846,
Elapsed time for attention_prob_times_values (128x2048x2048x846): 0.0070
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x846): 129.303
Elapsed time for attention_linear_projection (4x27072x27072, b=2048): 0.0461
Throughput (in TFLOP/s) for attention_linear_projection (4x27072x27072, b=2048): 260.221
Elapsed time for mlp_h_to_4h (4x27072x108288, b=2048): 0.3118
Throughput (in TFLOP/s) for mlp_h_to_4h (4x27072x108288, b=2048): 154.049
Elapsed time for mlp_4h_to_h (4x108288x27072, b=2048): 0.1839
Throughput (in TFLOP/s) for mlp_4h_to_h (4x108288x27072, b=2048): 261.196

Attention duration (in seconds): 0.1966
Attention throughput (in TFLOP/s): 253.581
MLP duration (in seconds): 0.4957
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6923
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 27136, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x27136x81408, b=2048): 0.1532
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x27136x81408, b=2048): 236.275
b: 128, m: 2048, n: 848, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x848x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x848x2048): 185.256
b: 128, m: 2048, n: 2048, k: 848,
Elapsed time for attention_prob_times_values (128x2048x2048x848): 0.0053
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x848): 172.886
Elapsed time for attention_linear_projection (4x27136x27136, b=2048): 0.0463
Throughput (in TFLOP/s) for attention_linear_projection (4x27136x27136, b=2048): 260.537
Elapsed time for mlp_h_to_4h (4x27136x108544, b=2048): 0.1828
Throughput (in TFLOP/s) for mlp_h_to_4h (4x27136x108544, b=2048): 264.026
Elapsed time for mlp_4h_to_h (4x108544x27136, b=2048): 0.1852
Throughput (in TFLOP/s) for mlp_4h_to_h (4x108544x27136, b=2048): 260.565

Attention duration (in seconds): 0.2097
Attention throughput (in TFLOP/s): 238.845
MLP duration (in seconds): 0.3680
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5777
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 27200, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x27200x81600, b=2048): 0.1404
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x27200x81600, b=2048): 258.999
b: 128, m: 2048, n: 850, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x850x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x850x2048): 178.998
b: 128, m: 2048, n: 2048, k: 850,
Elapsed time for attention_prob_times_values (128x2048x2048x850): 0.0071
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x850): 129.187
Elapsed time for attention_linear_projection (4x27200x27200, b=2048): 0.0471
Throughput (in TFLOP/s) for attention_linear_projection (4x27200x27200, b=2048): 257.335
Elapsed time for mlp_h_to_4h (4x27200x108800, b=2048): 0.2686
Throughput (in TFLOP/s) for mlp_h_to_4h (4x27200x108800, b=2048): 180.493
Elapsed time for mlp_4h_to_h (4x108800x27200, b=2048): 0.1862
Throughput (in TFLOP/s) for mlp_4h_to_h (4x108800x27200, b=2048): 260.372

Attention duration (in seconds): 0.1997
Attention throughput (in TFLOP/s): 251.971
MLP duration (in seconds): 0.4549
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6545
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 27264, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x27264x81792, b=2048): 0.1402
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x27264x81792, b=2048): 260.579
b: 128, m: 2048, n: 852, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x852x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x852x2048): 185.976
b: 128, m: 2048, n: 2048, k: 852,
Elapsed time for attention_prob_times_values (128x2048x2048x852): 0.0060
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x852): 151.771
Elapsed time for attention_linear_projection (4x27264x27264, b=2048): 0.0471
Throughput (in TFLOP/s) for attention_linear_projection (4x27264x27264, b=2048): 258.439
Elapsed time for mlp_h_to_4h (4x27264x109056, b=2048): 0.3015
Throughput (in TFLOP/s) for mlp_h_to_4h (4x27264x109056, b=2048): 161.567
Elapsed time for mlp_4h_to_h (4x109056x27264, b=2048): 0.1875
Throughput (in TFLOP/s) for mlp_4h_to_h (4x109056x27264, b=2048): 259.763

Attention duration (in seconds): 0.1983
Attention throughput (in TFLOP/s): 254.912
MLP duration (in seconds): 0.4890
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6873
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 27328, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x27328x81984, b=2048): 0.1409
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x27328x81984, b=2048): 260.474
b: 128, m: 2048, n: 854, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x854x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x854x2048): 179.204
b: 128, m: 2048, n: 2048, k: 854,
Elapsed time for attention_prob_times_values (128x2048x2048x854): 0.0072
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x854): 127.998
Elapsed time for attention_linear_projection (4x27328x27328, b=2048): 0.0472
Throughput (in TFLOP/s) for attention_linear_projection (4x27328x27328, b=2048): 259.115
Elapsed time for mlp_h_to_4h (4x27328x109312, b=2048): 0.2929
Throughput (in TFLOP/s) for mlp_h_to_4h (4x27328x109312, b=2048): 167.103
Elapsed time for mlp_4h_to_h (4x109312x27328, b=2048): 0.1879
Throughput (in TFLOP/s) for mlp_4h_to_h (4x109312x27328, b=2048): 260.477

Attention duration (in seconds): 0.2004
Attention throughput (in TFLOP/s): 253.344
MLP duration (in seconds): 0.4808
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6812
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 27392, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x27392x82176, b=2048): 0.1408
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x27392x82176, b=2048): 261.932
b: 128, m: 2048, n: 856, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x856x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x856x2048): 185.733
b: 128, m: 2048, n: 2048, k: 856,
Elapsed time for attention_prob_times_values (128x2048x2048x856): 0.0053
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x856): 173.075
Elapsed time for attention_linear_projection (4x27392x27392, b=2048): 0.0473
Throughput (in TFLOP/s) for attention_linear_projection (4x27392x27392, b=2048): 259.820
Elapsed time for mlp_h_to_4h (4x27392x109568, b=2048): 0.1964
Throughput (in TFLOP/s) for mlp_h_to_4h (4x27392x109568, b=2048): 250.433
Elapsed time for mlp_4h_to_h (4x109568x27392, b=2048): 0.1883
Throughput (in TFLOP/s) for mlp_4h_to_h (4x109568x27392, b=2048): 261.169

Attention duration (in seconds): 0.1984
Attention throughput (in TFLOP/s): 257.149
MLP duration (in seconds): 0.3846
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5830
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 27456, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x27456x82368, b=2048): 0.1423
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x27456x82368, b=2048): 260.399
b: 128, m: 2048, n: 858, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x858x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x858x2048): 179.675
b: 128, m: 2048, n: 2048, k: 858,
Elapsed time for attention_prob_times_values (128x2048x2048x858): 0.0071
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x858): 130.034
Elapsed time for attention_linear_projection (4x27456x27456, b=2048): 0.0478
Throughput (in TFLOP/s) for attention_linear_projection (4x27456x27456, b=2048): 258.539
Elapsed time for mlp_h_to_4h (4x27456x109824, b=2048): 0.2049
Throughput (in TFLOP/s) for mlp_h_to_4h (4x27456x109824, b=2048): 241.113
Elapsed time for mlp_4h_to_h (4x109824x27456, b=2048): 0.1932
Throughput (in TFLOP/s) for mlp_4h_to_h (4x109824x27456, b=2048): 255.698

Attention duration (in seconds): 0.2023
Attention throughput (in TFLOP/s): 253.347
MLP duration (in seconds): 0.3981
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6004
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 27520, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x27520x82560, b=2048): 0.1523
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x27520x82560, b=2048): 244.386
b: 128, m: 2048, n: 860, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x860x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x860x2048): 187.251
b: 128, m: 2048, n: 2048, k: 860,
Elapsed time for attention_prob_times_values (128x2048x2048x860): 0.0060
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x860): 154.547
Elapsed time for attention_linear_projection (4x27520x27520, b=2048): 0.0479
Throughput (in TFLOP/s) for attention_linear_projection (4x27520x27520, b=2048): 258.942
Elapsed time for mlp_h_to_4h (4x27520x110080, b=2048): 0.3051
Throughput (in TFLOP/s) for mlp_h_to_4h (4x27520x110080, b=2048): 162.655
Elapsed time for mlp_4h_to_h (4x110080x27520, b=2048): 0.1911
Throughput (in TFLOP/s) for mlp_4h_to_h (4x110080x27520, b=2048): 259.680

Attention duration (in seconds): 0.2111
Attention throughput (in TFLOP/s): 243.813
MLP duration (in seconds): 0.4963
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.7074
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 27584, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x27584x82752, b=2048): 0.1756
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x27584x82752, b=2048): 212.916
b: 128, m: 2048, n: 862, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x862x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x862x2048): 180.765
b: 128, m: 2048, n: 2048, k: 862,
Elapsed time for attention_prob_times_values (128x2048x2048x862): 0.0071
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x862): 131.223
Elapsed time for attention_linear_projection (4x27584x27584, b=2048): 0.0480
Throughput (in TFLOP/s) for attention_linear_projection (4x27584x27584, b=2048): 259.983
Elapsed time for mlp_h_to_4h (4x27584x110336, b=2048): 0.2828
Throughput (in TFLOP/s) for mlp_h_to_4h (4x27584x110336, b=2048): 176.314
Elapsed time for mlp_4h_to_h (4x110336x27584, b=2048): 0.1913
Throughput (in TFLOP/s) for mlp_4h_to_h (4x110336x27584, b=2048): 260.677

Attention duration (in seconds): 0.2358
Attention throughput (in TFLOP/s): 219.346
MLP duration (in seconds): 0.4741
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.7099
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 27648, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x27648x82944, b=2048): 0.1429
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x27648x82944, b=2048): 262.867
b: 128, m: 2048, n: 864, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x864x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x864x2048): 187.895
b: 128, m: 2048, n: 2048, k: 864,
Elapsed time for attention_prob_times_values (128x2048x2048x864): 0.0053
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x864): 176.492
Elapsed time for attention_linear_projection (4x27648x27648, b=2048): 0.0480
Throughput (in TFLOP/s) for attention_linear_projection (4x27648x27648, b=2048): 260.693
Elapsed time for mlp_h_to_4h (4x27648x110592, b=2048): 0.3209
Throughput (in TFLOP/s) for mlp_h_to_4h (4x27648x110592, b=2048): 156.117
Elapsed time for mlp_4h_to_h (4x110592x27648, b=2048): 0.1922
Throughput (in TFLOP/s) for mlp_4h_to_h (4x110592x27648, b=2048): 260.589

Attention duration (in seconds): 0.2012
Attention throughput (in TFLOP/s): 258.251
MLP duration (in seconds): 0.5131
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.7143
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 27712, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x27712x83136, b=2048): 0.1464
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x27712x83136, b=2048): 257.910
b: 128, m: 2048, n: 866, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x866x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x866x2048): 182.129
b: 128, m: 2048, n: 2048, k: 866,
Elapsed time for attention_prob_times_values (128x2048x2048x866): 0.0071
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x866): 131.052
Elapsed time for attention_linear_projection (4x27712x27712, b=2048): 0.0484
Throughput (in TFLOP/s) for attention_linear_projection (4x27712x27712, b=2048): 259.767
Elapsed time for mlp_h_to_4h (4x27712x110848, b=2048): 0.1924
Throughput (in TFLOP/s) for mlp_h_to_4h (4x27712x110848, b=2048): 261.617
Elapsed time for mlp_4h_to_h (4x110848x27712, b=2048): 0.1926
Throughput (in TFLOP/s) for mlp_4h_to_h (4x110848x27712, b=2048): 261.256

Attention duration (in seconds): 0.2070
Attention throughput (in TFLOP/s): 252.127
MLP duration (in seconds): 0.3850
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5920
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 27776, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x27776x83328, b=2048): 0.1866
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x27776x83328, b=2048): 203.190
b: 128, m: 2048, n: 868, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x868x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x868x2048): 189.423
b: 128, m: 2048, n: 2048, k: 868,
Elapsed time for attention_prob_times_values (128x2048x2048x868): 0.0060
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x868): 156.178
Elapsed time for attention_linear_projection (4x27776x27776, b=2048): 0.0489
Throughput (in TFLOP/s) for attention_linear_projection (4x27776x27776, b=2048): 258.234
Elapsed time for mlp_h_to_4h (4x27776x111104, b=2048): 0.2538
Throughput (in TFLOP/s) for mlp_h_to_4h (4x27776x111104, b=2048): 199.182
Elapsed time for mlp_4h_to_h (4x111104x27776, b=2048): 0.1937
Throughput (in TFLOP/s) for mlp_4h_to_h (4x111104x27776, b=2048): 260.969

Attention duration (in seconds): 0.2465
Attention throughput (in TFLOP/s): 212.708
MLP duration (in seconds): 0.4476
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6941
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 27840, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x27840x83520, b=2048): 0.1518
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x27840x83520, b=2048): 250.936
b: 128, m: 2048, n: 870, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x870x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x870x2048): 182.061
b: 128, m: 2048, n: 2048, k: 870,
Elapsed time for attention_prob_times_values (128x2048x2048x870): 0.0071
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x870): 131.199
Elapsed time for attention_linear_projection (4x27840x27840, b=2048): 0.0489
Throughput (in TFLOP/s) for attention_linear_projection (4x27840x27840, b=2048): 259.462
Elapsed time for mlp_h_to_4h (4x27840x111360, b=2048): 0.1936
Throughput (in TFLOP/s) for mlp_h_to_4h (4x27840x111360, b=2048): 262.415
Elapsed time for mlp_4h_to_h (4x111360x27840, b=2048): 0.1930
Throughput (in TFLOP/s) for mlp_4h_to_h (4x111360x27840, b=2048): 263.193

Attention duration (in seconds): 0.2130
Attention throughput (in TFLOP/s): 247.234
MLP duration (in seconds): 0.3866
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5996
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 27904, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x27904x83712, b=2048): 0.1457
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x27904x83712, b=2048): 262.711
b: 128, m: 2048, n: 872, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x872x2048): 0.0050
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x872x2048): 188.932
b: 128, m: 2048, n: 2048, k: 872,
Elapsed time for attention_prob_times_values (128x2048x2048x872): 0.0053
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x872): 175.334
Elapsed time for attention_linear_projection (4x27904x27904, b=2048): 0.0489
Throughput (in TFLOP/s) for attention_linear_projection (4x27904x27904, b=2048): 260.781
Elapsed time for mlp_h_to_4h (4x27904x111616, b=2048): 0.2898
Throughput (in TFLOP/s) for mlp_h_to_4h (4x27904x111616, b=2048): 176.074
Elapsed time for mlp_4h_to_h (4x111616x27904, b=2048): 0.1955
Throughput (in TFLOP/s) for mlp_4h_to_h (4x111616x27904, b=2048): 261.053

Attention duration (in seconds): 0.2049
Attention throughput (in TFLOP/s): 258.189
MLP duration (in seconds): 0.4853
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6902
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 27968, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x27968x83904, b=2048): 0.1764
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x27968x83904, b=2048): 217.951
b: 128, m: 2048, n: 874, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x874x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x874x2048): 183.341
b: 128, m: 2048, n: 2048, k: 874,
Elapsed time for attention_prob_times_values (128x2048x2048x874): 0.0070
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x874): 133.334
Elapsed time for attention_linear_projection (4x27968x27968, b=2048): 0.0497
Throughput (in TFLOP/s) for attention_linear_projection (4x27968x27968, b=2048): 257.709
Elapsed time for mlp_h_to_4h (4x27968x111872, b=2048): 0.3311
Throughput (in TFLOP/s) for mlp_h_to_4h (4x27968x111872, b=2048): 154.828
Elapsed time for mlp_4h_to_h (4x111872x27968, b=2048): 0.3172
Throughput (in TFLOP/s) for mlp_4h_to_h (4x111872x27968, b=2048): 161.634

Attention duration (in seconds): 0.2383
Attention throughput (in TFLOP/s): 223.006
MLP duration (in seconds): 0.6483
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8865
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 28032, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x28032x84096, b=2048): 0.1699
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x28032x84096, b=2048): 227.294
b: 128, m: 2048, n: 876, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x876x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x876x2048): 191.680
b: 128, m: 2048, n: 2048, k: 876,
Elapsed time for attention_prob_times_values (128x2048x2048x876): 0.0060
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x876): 156.231
Elapsed time for attention_linear_projection (4x28032x28032, b=2048): 0.0494
Throughput (in TFLOP/s) for attention_linear_projection (4x28032x28032, b=2048): 260.773
Elapsed time for mlp_h_to_4h (4x28032x112128, b=2048): 0.3318
Throughput (in TFLOP/s) for mlp_h_to_4h (4x28032x112128, b=2048): 155.222
Elapsed time for mlp_4h_to_h (4x112128x28032, b=2048): 0.3246
Throughput (in TFLOP/s) for mlp_4h_to_h (4x112128x28032, b=2048): 158.669

Attention duration (in seconds): 0.2302
Attention throughput (in TFLOP/s): 231.856
MLP duration (in seconds): 0.6563
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8866
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 28096, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x28096x84288, b=2048): 0.2491
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x28096x84288, b=2048): 155.784
b: 128, m: 2048, n: 878, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x878x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x878x2048): 184.808
b: 128, m: 2048, n: 2048, k: 878,
Elapsed time for attention_prob_times_values (128x2048x2048x878): 0.0071
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x878): 133.173
Elapsed time for attention_linear_projection (4x28096x28096, b=2048): 0.0507
Throughput (in TFLOP/s) for attention_linear_projection (4x28096x28096, b=2048): 254.951
Elapsed time for mlp_h_to_4h (4x28096x112384, b=2048): 0.3535
Throughput (in TFLOP/s) for mlp_h_to_4h (4x28096x112384, b=2048): 146.364
Elapsed time for mlp_4h_to_h (4x112384x28096, b=2048): 0.1983
Throughput (in TFLOP/s) for mlp_4h_to_h (4x112384x28096, b=2048): 260.873

Attention duration (in seconds): 0.3120
Attention throughput (in TFLOP/s): 171.871
MLP duration (in seconds): 0.5518
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8637
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 28160, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x28160x84480, b=2048): 0.1989
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x28160x84480, b=2048): 195.924
b: 128, m: 2048, n: 880, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x880x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x880x2048): 192.088
b: 128, m: 2048, n: 2048, k: 880,
Elapsed time for attention_prob_times_values (128x2048x2048x880): 0.0053
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x880): 177.537
Elapsed time for attention_linear_projection (4x28160x28160, b=2048): 0.0496
Throughput (in TFLOP/s) for attention_linear_projection (4x28160x28160, b=2048): 262.166
Elapsed time for mlp_h_to_4h (4x28160x112640, b=2048): 0.3223
Throughput (in TFLOP/s) for mlp_h_to_4h (4x28160x112640, b=2048): 161.263
Elapsed time for mlp_4h_to_h (4x112640x28160, b=2048): 0.1956
Throughput (in TFLOP/s) for mlp_4h_to_h (4x112640x28160, b=2048): 265.724

Attention duration (in seconds): 0.2587
Attention throughput (in TFLOP/s): 208.161
MLP duration (in seconds): 0.5178
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.7766
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 28224, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x28224x84672, b=2048): 0.2315
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x28224x84672, b=2048): 169.147
b: 128, m: 2048, n: 882, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x882x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x882x2048): 186.750
b: 128, m: 2048, n: 2048, k: 882,
Elapsed time for attention_prob_times_values (128x2048x2048x882): 0.0070
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x882): 135.209
Elapsed time for attention_linear_projection (4x28224x28224, b=2048): 0.0497
Throughput (in TFLOP/s) for attention_linear_projection (4x28224x28224, b=2048): 262.348
Elapsed time for mlp_h_to_4h (4x28224x112896, b=2048): 0.3510
Throughput (in TFLOP/s) for mlp_h_to_4h (4x28224x112896, b=2048): 148.739
Elapsed time for mlp_4h_to_h (4x112896x28224, b=2048): 0.1959
Throughput (in TFLOP/s) for mlp_4h_to_h (4x112896x28224, b=2048): 266.530

Attention duration (in seconds): 0.2933
Attention throughput (in TFLOP/s): 184.449
MLP duration (in seconds): 0.5469
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8402
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 28288, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x28288x84864, b=2048): 0.2284
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x28288x84864, b=2048): 172.183
b: 128, m: 2048, n: 884, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x884x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x884x2048): 195.175
b: 128, m: 2048, n: 2048, k: 884,
Elapsed time for attention_prob_times_values (128x2048x2048x884): 0.0060
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x884): 159.432
Elapsed time for attention_linear_projection (4x28288x28288, b=2048): 0.0497
Throughput (in TFLOP/s) for attention_linear_projection (4x28288x28288, b=2048): 263.666
Elapsed time for mlp_h_to_4h (4x28288x113152, b=2048): 0.3089
Throughput (in TFLOP/s) for mlp_h_to_4h (4x28288x113152, b=2048): 169.773
Elapsed time for mlp_4h_to_h (4x113152x28288, b=2048): 0.1977
Throughput (in TFLOP/s) for mlp_4h_to_h (4x113152x28288, b=2048): 265.319

Attention duration (in seconds): 0.2890
Attention throughput (in TFLOP/s): 188.049
MLP duration (in seconds): 0.5066
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.7955
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 28352, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x28352x85056, b=2048): 0.1887
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x28352x85056, b=2048): 209.328
b: 128, m: 2048, n: 886, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x886x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x886x2048): 187.641
b: 128, m: 2048, n: 2048, k: 886,
Elapsed time for attention_prob_times_values (128x2048x2048x886): 0.0070
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x886): 136.431
Elapsed time for attention_linear_projection (4x28352x28352, b=2048): 0.0505
Throughput (in TFLOP/s) for attention_linear_projection (4x28352x28352, b=2048): 260.968
Elapsed time for mlp_h_to_4h (4x28352x113408, b=2048): 0.3020
Throughput (in TFLOP/s) for mlp_h_to_4h (4x28352x113408, b=2048): 174.436
Elapsed time for mlp_4h_to_h (4x113408x28352, b=2048): 0.3198
Throughput (in TFLOP/s) for mlp_4h_to_h (4x113408x28352, b=2048): 164.734

Attention duration (in seconds): 0.2513
Attention throughput (in TFLOP/s): 217.239
MLP duration (in seconds): 0.6218
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8730
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 28416, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x28416x85248, b=2048): 0.2521
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x28416x85248, b=2048): 157.449
b: 128, m: 2048, n: 888, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x888x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x888x2048): 192.881
b: 128, m: 2048, n: 2048, k: 888,
Elapsed time for attention_prob_times_values (128x2048x2048x888): 0.0054
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x888): 177.797
Elapsed time for attention_linear_projection (4x28416x28416, b=2048): 0.0508
Throughput (in TFLOP/s) for attention_linear_projection (4x28416x28416, b=2048): 260.629
Elapsed time for mlp_h_to_4h (4x28416x113664, b=2048): 0.3528
Throughput (in TFLOP/s) for mlp_h_to_4h (4x28416x113664, b=2048): 150.009
Elapsed time for mlp_4h_to_h (4x113664x28416, b=2048): 0.3307
Throughput (in TFLOP/s) for mlp_4h_to_h (4x113664x28416, b=2048): 160.018

Attention duration (in seconds): 0.3131
Attention throughput (in TFLOP/s): 175.083
MLP duration (in seconds): 0.6835
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.9966
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 28480, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x28480x85440, b=2048): 0.1799
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x28480x85440, b=2048): 221.662
b: 128, m: 2048, n: 890, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x890x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x890x2048): 187.870
b: 128, m: 2048, n: 2048, k: 890,
Elapsed time for attention_prob_times_values (128x2048x2048x890): 0.0070
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x890): 137.000
Elapsed time for attention_linear_projection (4x28480x28480, b=2048): 0.0510
Throughput (in TFLOP/s) for attention_linear_projection (4x28480x28480, b=2048): 260.439
Elapsed time for mlp_h_to_4h (4x28480x113920, b=2048): 0.3280
Throughput (in TFLOP/s) for mlp_h_to_4h (4x28480x113920, b=2048): 162.068
Elapsed time for mlp_4h_to_h (4x113920x28480, b=2048): 0.3321
Throughput (in TFLOP/s) for mlp_4h_to_h (4x113920x28480, b=2048): 160.075

Attention duration (in seconds): 0.2429
Attention throughput (in TFLOP/s): 226.668
MLP duration (in seconds): 0.6601
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.9030
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 28544, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x28544x85632, b=2048): 0.1625
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x28544x85632, b=2048): 246.469
b: 128, m: 2048, n: 892, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x892x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x892x2048): 195.713
b: 128, m: 2048, n: 2048, k: 892,
Elapsed time for attention_prob_times_values (128x2048x2048x892): 0.0059
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x892): 162.148
Elapsed time for attention_linear_projection (4x28544x28544, b=2048): 0.0512
Throughput (in TFLOP/s) for attention_linear_projection (4x28544x28544, b=2048): 260.570
Elapsed time for mlp_h_to_4h (4x28544x114176, b=2048): 0.3735
Throughput (in TFLOP/s) for mlp_h_to_4h (4x28544x114176, b=2048): 142.956
Elapsed time for mlp_4h_to_h (4x114176x28544, b=2048): 0.2808
Throughput (in TFLOP/s) for mlp_4h_to_h (4x114176x28544, b=2048): 190.150

Attention duration (in seconds): 0.2245
Attention throughput (in TFLOP/s): 246.362
MLP duration (in seconds): 0.6543
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8788
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 28608, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x28608x85824, b=2048): 0.2071
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x28608x85824, b=2048): 194.207
b: 128, m: 2048, n: 894, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x894x2048): 0.0051
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x894x2048): 188.255
b: 128, m: 2048, n: 2048, k: 894,
Elapsed time for attention_prob_times_values (128x2048x2048x894): 0.0069
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x894): 138.869
Elapsed time for attention_linear_projection (4x28608x28608, b=2048): 0.0517
Throughput (in TFLOP/s) for attention_linear_projection (4x28608x28608, b=2048): 259.213
Elapsed time for mlp_h_to_4h (4x28608x114432, b=2048): 0.3648
Throughput (in TFLOP/s) for mlp_h_to_4h (4x28608x114432, b=2048): 147.019
Elapsed time for mlp_4h_to_h (4x114432x28608, b=2048): 0.2021
Throughput (in TFLOP/s) for mlp_4h_to_h (4x114432x28608, b=2048): 265.398

Attention duration (in seconds): 0.2709
Attention throughput (in TFLOP/s): 205.097
MLP duration (in seconds): 0.5669
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8378
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 28672, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x28672x86016, b=2048): 0.2228
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x28672x86016, b=2048): 181.332
b: 128, m: 2048, n: 896, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x896x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x896x2048): 195.847
b: 128, m: 2048, n: 2048, k: 896,
Elapsed time for attention_prob_times_values (128x2048x2048x896): 0.0053
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x896): 182.079
Elapsed time for attention_linear_projection (4x28672x28672, b=2048): 0.0512
Throughput (in TFLOP/s) for attention_linear_projection (4x28672x28672, b=2048): 263.087
Elapsed time for mlp_h_to_4h (4x28672x114688, b=2048): 0.3455
Throughput (in TFLOP/s) for mlp_h_to_4h (4x28672x114688, b=2048): 155.921
Elapsed time for mlp_4h_to_h (4x114688x28672, b=2048): 0.2040
Throughput (in TFLOP/s) for mlp_4h_to_h (4x114688x28672, b=2048): 264.147

Attention duration (in seconds): 0.2842
Attention throughput (in TFLOP/s): 196.322
MLP duration (in seconds): 0.5495
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8337
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 28736, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x28736x86208, b=2048): 0.2096
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x28736x86208, b=2048): 193.668
b: 128, m: 2048, n: 898, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x898x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x898x2048): 173.371
b: 128, m: 2048, n: 2048, k: 898,
Elapsed time for attention_prob_times_values (128x2048x2048x898): 0.0072
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x898): 134.615
Elapsed time for attention_linear_projection (4x28736x28736, b=2048): 0.0520
Throughput (in TFLOP/s) for attention_linear_projection (4x28736x28736, b=2048): 259.954
Elapsed time for mlp_h_to_4h (4x28736x114944, b=2048): 0.3591
Throughput (in TFLOP/s) for mlp_h_to_4h (4x28736x114944, b=2048): 150.704
Elapsed time for mlp_4h_to_h (4x114944x28736, b=2048): 0.3112
Throughput (in TFLOP/s) for mlp_4h_to_h (4x114944x28736, b=2048): 173.909

Attention duration (in seconds): 0.2743
Attention throughput (in TFLOP/s): 204.290
MLP duration (in seconds): 0.6703
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.9446
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 28800, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x28800x86400, b=2048): 0.1961
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x28800x86400, b=2048): 207.916
b: 128, m: 2048, n: 900, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x900x2048): 0.0054
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x900x2048): 179.609
b: 128, m: 2048, n: 2048, k: 900,
Elapsed time for attention_prob_times_values (128x2048x2048x900): 0.0061
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x900): 157.358
Elapsed time for attention_linear_projection (4x28800x28800, b=2048): 0.0519
Throughput (in TFLOP/s) for attention_linear_projection (4x28800x28800, b=2048): 261.624
Elapsed time for mlp_h_to_4h (4x28800x115200, b=2048): 0.3250
Throughput (in TFLOP/s) for mlp_h_to_4h (4x28800x115200, b=2048): 167.259
Elapsed time for mlp_4h_to_h (4x115200x28800, b=2048): 0.2065
Throughput (in TFLOP/s) for mlp_4h_to_h (4x115200x28800, b=2048): 263.175

Attention duration (in seconds): 0.2595
Attention throughput (in TFLOP/s): 216.881
MLP duration (in seconds): 0.5315
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.7911
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 28864, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x28864x86592, b=2048): 0.1963
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x28864x86592, b=2048): 208.578
b: 128, m: 2048, n: 902, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x902x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x902x2048): 174.069
b: 128, m: 2048, n: 2048, k: 902,
Elapsed time for attention_prob_times_values (128x2048x2048x902): 0.0072
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x902): 134.316
Elapsed time for attention_linear_projection (4x28864x28864, b=2048): 0.0521
Throughput (in TFLOP/s) for attention_linear_projection (4x28864x28864, b=2048): 262.222
Elapsed time for mlp_h_to_4h (4x28864x115456, b=2048): 0.3196
Throughput (in TFLOP/s) for mlp_h_to_4h (4x28864x115456, b=2048): 170.840
Elapsed time for mlp_4h_to_h (4x115456x28864, b=2048): 0.2049
Throughput (in TFLOP/s) for mlp_4h_to_h (4x115456x28864, b=2048): 266.528

Attention duration (in seconds): 0.2612
Attention throughput (in TFLOP/s): 216.485
MLP duration (in seconds): 0.5245
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.7856
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 28928, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x28928x86784, b=2048): 0.2574
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x28928x86784, b=2048): 159.800
b: 128, m: 2048, n: 904, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x904x2048): 0.0055
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x904x2048): 177.869
b: 128, m: 2048, n: 2048, k: 904,
Elapsed time for attention_prob_times_values (128x2048x2048x904): 0.0055
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x904): 175.319
Elapsed time for attention_linear_projection (4x28928x28928, b=2048): 0.0518
Throughput (in TFLOP/s) for attention_linear_projection (4x28928x28928, b=2048): 264.835
Elapsed time for mlp_h_to_4h (4x28928x115712, b=2048): 0.3707
Throughput (in TFLOP/s) for mlp_h_to_4h (4x28928x115712, b=2048): 147.925
Elapsed time for mlp_4h_to_h (4x115712x28928, b=2048): 0.2065
Throughput (in TFLOP/s) for mlp_4h_to_h (4x115712x28928, b=2048): 265.520

Attention duration (in seconds): 0.3202
Attention throughput (in TFLOP/s): 177.361
MLP duration (in seconds): 0.5773
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8975
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 28992, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x28992x86976, b=2048): 0.1997
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x28992x86976, b=2048): 206.836
b: 128, m: 2048, n: 906, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x906x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x906x2048): 174.244
b: 128, m: 2048, n: 2048, k: 906,
Elapsed time for attention_prob_times_values (128x2048x2048x906): 0.0073
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x906): 133.081
Elapsed time for attention_linear_projection (4x28992x28992, b=2048): 0.0524
Throughput (in TFLOP/s) for attention_linear_projection (4x28992x28992, b=2048): 262.667
Elapsed time for mlp_h_to_4h (4x28992x115968, b=2048): 0.3629
Throughput (in TFLOP/s) for mlp_h_to_4h (4x28992x115968, b=2048): 151.795
Elapsed time for mlp_4h_to_h (4x115968x28992, b=2048): 0.3496
Throughput (in TFLOP/s) for mlp_4h_to_h (4x115968x28992, b=2048): 157.558

Attention duration (in seconds): 0.2651
Attention throughput (in TFLOP/s): 215.159
MLP duration (in seconds): 0.7125
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.9776
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 29056, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x29056x87168, b=2048): 0.1970
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x29056x87168, b=2048): 210.599
b: 128, m: 2048, n: 908, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x908x2048): 0.0054
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x908x2048): 180.909
b: 128, m: 2048, n: 2048, k: 908,
Elapsed time for attention_prob_times_values (128x2048x2048x908): 0.0062
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x908): 157.407
Elapsed time for attention_linear_projection (4x29056x29056, b=2048): 0.0531
Throughput (in TFLOP/s) for attention_linear_projection (4x29056x29056, b=2048): 260.418
Elapsed time for mlp_h_to_4h (4x29056x116224, b=2048): 0.3206
Throughput (in TFLOP/s) for mlp_h_to_4h (4x29056x116224, b=2048): 172.600
Elapsed time for mlp_4h_to_h (4x116224x29056, b=2048): 0.2096
Throughput (in TFLOP/s) for mlp_4h_to_h (4x116224x29056, b=2048): 264.003

Attention duration (in seconds): 0.2617
Attention throughput (in TFLOP/s): 218.839
MLP duration (in seconds): 0.5301
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.7919
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 29120, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x29120x87360, b=2048): 0.1830
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x29120x87360, b=2048): 227.752
b: 128, m: 2048, n: 910, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x910x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x910x2048): 175.245
b: 128, m: 2048, n: 2048, k: 910,
Elapsed time for attention_prob_times_values (128x2048x2048x910): 0.0073
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x910): 133.882
Elapsed time for attention_linear_projection (4x29120x29120, b=2048): 0.0530
Throughput (in TFLOP/s) for attention_linear_projection (4x29120x29120, b=2048): 262.004
Elapsed time for mlp_h_to_4h (4x29120x116480, b=2048): 0.3061
Throughput (in TFLOP/s) for mlp_h_to_4h (4x29120x116480, b=2048): 181.577
Elapsed time for mlp_4h_to_h (4x116480x29120, b=2048): 0.2113
Throughput (in TFLOP/s) for mlp_4h_to_h (4x116480x29120, b=2048): 263.003

Attention duration (in seconds): 0.2489
Attention throughput (in TFLOP/s): 231.121
MLP duration (in seconds): 0.5174
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.7663
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 29184, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x29184x87552, b=2048): 0.1821
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x29184x87552, b=2048): 229.907
b: 128, m: 2048, n: 912, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x912x2048): 0.0054
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x912x2048): 179.884
b: 128, m: 2048, n: 2048, k: 912,
Elapsed time for attention_prob_times_values (128x2048x2048x912): 0.0055
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x912): 177.237
Elapsed time for attention_linear_projection (4x29184x29184, b=2048): 0.0532
Throughput (in TFLOP/s) for attention_linear_projection (4x29184x29184, b=2048): 262.509
Elapsed time for mlp_h_to_4h (4x29184x116736, b=2048): 0.3573
Throughput (in TFLOP/s) for mlp_h_to_4h (4x29184x116736, b=2048): 156.219
Elapsed time for mlp_4h_to_h (4x116736x29184, b=2048): 0.2095
Throughput (in TFLOP/s) for mlp_4h_to_h (4x116736x29184, b=2048): 266.393

Attention duration (in seconds): 0.2462
Attention throughput (in TFLOP/s): 234.658
MLP duration (in seconds): 0.5668
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8130
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 29248, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x29248x87744, b=2048): 0.1819
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x29248x87744, b=2048): 231.100
b: 128, m: 2048, n: 914, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x914x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x914x2048): 175.079
b: 128, m: 2048, n: 2048, k: 914,
Elapsed time for attention_prob_times_values (128x2048x2048x914): 0.0074
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x914): 132.561
Elapsed time for attention_linear_projection (4x29248x29248, b=2048): 0.0530
Throughput (in TFLOP/s) for attention_linear_projection (4x29248x29248, b=2048): 264.384
Elapsed time for mlp_h_to_4h (4x29248x116992, b=2048): 0.3592
Throughput (in TFLOP/s) for mlp_h_to_4h (4x29248x116992, b=2048): 156.078
Elapsed time for mlp_4h_to_h (4x116992x29248, b=2048): 0.2139
Throughput (in TFLOP/s) for mlp_4h_to_h (4x116992x29248, b=2048): 262.110

Attention duration (in seconds): 0.2480
Attention throughput (in TFLOP/s): 234.008
MLP duration (in seconds): 0.5731
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8210
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 29312, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x29312x87936, b=2048): 0.2190
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x29312x87936, b=2048): 192.852
b: 128, m: 2048, n: 916, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x916x2048): 0.0054
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x916x2048): 182.124
b: 128, m: 2048, n: 2048, k: 916,
Elapsed time for attention_prob_times_values (128x2048x2048x916): 0.0063
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x916): 157.041
Elapsed time for attention_linear_projection (4x29312x29312, b=2048): 0.0531
Throughput (in TFLOP/s) for attention_linear_projection (4x29312x29312, b=2048): 265.169
Elapsed time for mlp_h_to_4h (4x29312x117248, b=2048): 0.3655
Throughput (in TFLOP/s) for mlp_h_to_4h (4x29312x117248, b=2048): 154.053
Elapsed time for mlp_4h_to_h (4x117248x29312, b=2048): 0.2129
Throughput (in TFLOP/s) for mlp_4h_to_h (4x117248x29312, b=2048): 264.444

Attention duration (in seconds): 0.2837
Attention throughput (in TFLOP/s): 205.388
MLP duration (in seconds): 0.5784
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8622
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 29376, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x29376x88128, b=2048): 0.1614
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x29376x88128, b=2048): 262.769
b: 128, m: 2048, n: 918, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x918x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x918x2048): 175.726
b: 128, m: 2048, n: 2048, k: 918,
Elapsed time for attention_prob_times_values (128x2048x2048x918): 0.0074
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x918): 133.472
Elapsed time for attention_linear_projection (4x29376x29376, b=2048): 0.0542
Throughput (in TFLOP/s) for attention_linear_projection (4x29376x29376, b=2048): 260.979
Elapsed time for mlp_h_to_4h (4x29376x117504, b=2048): 0.3290
Throughput (in TFLOP/s) for mlp_h_to_4h (4x29376x117504, b=2048): 171.894
Elapsed time for mlp_4h_to_h (4x117504x29376, b=2048): 0.2169
Throughput (in TFLOP/s) for mlp_4h_to_h (4x117504x29376, b=2048): 260.728

Attention duration (in seconds): 0.2286
Attention throughput (in TFLOP/s): 256.031
MLP duration (in seconds): 0.5459
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.7745
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 29440, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x29440x88320, b=2048): 0.1823
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x29440x88320, b=2048): 233.623
b: 128, m: 2048, n: 920, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x920x2048): 0.0055
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x920x2048): 181.128
b: 128, m: 2048, n: 2048, k: 920,
Elapsed time for attention_prob_times_values (128x2048x2048x920): 0.0056
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x920): 177.512
Elapsed time for attention_linear_projection (4x29440x29440, b=2048): 0.0549
Throughput (in TFLOP/s) for attention_linear_projection (4x29440x29440, b=2048): 258.788
Elapsed time for mlp_h_to_4h (4x29440x117760, b=2048): 0.3630
Throughput (in TFLOP/s) for mlp_h_to_4h (4x29440x117760, b=2048): 156.456
Elapsed time for mlp_4h_to_h (4x117760x29440, b=2048): 0.2175
Throughput (in TFLOP/s) for mlp_4h_to_h (4x117760x29440, b=2048): 261.178

Attention duration (in seconds): 0.2482
Attention throughput (in TFLOP/s): 236.775
MLP duration (in seconds): 0.5805
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8288
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 29504, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x29504x88512, b=2048): 0.1637
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x29504x88512, b=2048): 261.316
b: 128, m: 2048, n: 922, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x922x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x922x2048): 175.663
b: 128, m: 2048, n: 2048, k: 922,
Elapsed time for attention_prob_times_values (128x2048x2048x922): 0.0075
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x922): 132.109
Elapsed time for attention_linear_projection (4x29504x29504, b=2048): 0.0549
Throughput (in TFLOP/s) for attention_linear_projection (4x29504x29504, b=2048): 259.607
Elapsed time for mlp_h_to_4h (4x29504x118016, b=2048): 0.3620
Throughput (in TFLOP/s) for mlp_h_to_4h (4x29504x118016, b=2048): 157.601
Elapsed time for mlp_4h_to_h (4x118016x29504, b=2048): 0.2175
Throughput (in TFLOP/s) for mlp_4h_to_h (4x118016x29504, b=2048): 262.327

Attention duration (in seconds): 0.2318
Attention throughput (in TFLOP/s): 254.652
MLP duration (in seconds): 0.5794
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8112
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 29568, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x29568x88704, b=2048): 0.2484
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x29568x88704, b=2048): 173.002
b: 128, m: 2048, n: 924, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x924x2048): 0.0054
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x924x2048): 184.211
b: 128, m: 2048, n: 2048, k: 924,
Elapsed time for attention_prob_times_values (128x2048x2048x924): 0.0063
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x924): 156.972
Elapsed time for attention_linear_projection (4x29568x29568, b=2048): 0.0549
Throughput (in TFLOP/s) for attention_linear_projection (4x29568x29568, b=2048): 260.924
Elapsed time for mlp_h_to_4h (4x29568x118272, b=2048): 0.3468
Throughput (in TFLOP/s) for mlp_h_to_4h (4x29568x118272, b=2048): 165.209
Elapsed time for mlp_4h_to_h (4x118272x29568, b=2048): 0.2185
Throughput (in TFLOP/s) for mlp_4h_to_h (4x118272x29568, b=2048): 262.263

Attention duration (in seconds): 0.3150
Attention throughput (in TFLOP/s): 188.195
MLP duration (in seconds): 0.5653
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8803
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 29632, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x29632x88896, b=2048): 0.1636
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x29632x88896, b=2048): 263.808
b: 128, m: 2048, n: 926, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x926x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x926x2048): 177.552
b: 128, m: 2048, n: 2048, k: 926,
Elapsed time for attention_prob_times_values (128x2048x2048x926): 0.0074
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x926): 133.823
Elapsed time for attention_linear_projection (4x29632x29632, b=2048): 0.0548
Throughput (in TFLOP/s) for attention_linear_projection (4x29632x29632, b=2048): 262.289
Elapsed time for mlp_h_to_4h (4x29632x118528, b=2048): 0.3333
Throughput (in TFLOP/s) for mlp_h_to_4h (4x29632x118528, b=2048): 172.625
Elapsed time for mlp_4h_to_h (4x118528x29632, b=2048): 0.2168
Throughput (in TFLOP/s) for mlp_4h_to_h (4x118528x29632, b=2048): 265.390

Attention duration (in seconds): 0.2315
Attention throughput (in TFLOP/s): 257.189
MLP duration (in seconds): 0.5502
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.7817
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 29696, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x29696x89088, b=2048): 0.2696
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x29696x89088, b=2048): 160.764
b: 128, m: 2048, n: 928, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x928x2048): 0.0054
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x928x2048): 183.264
b: 128, m: 2048, n: 2048, k: 928,
Elapsed time for attention_prob_times_values (128x2048x2048x928): 0.0055
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x928): 180.908
Elapsed time for attention_linear_projection (4x29696x29696, b=2048): 0.0550
Throughput (in TFLOP/s) for attention_linear_projection (4x29696x29696, b=2048): 262.863
Elapsed time for mlp_h_to_4h (4x29696x118784, b=2048): 0.3425
Throughput (in TFLOP/s) for mlp_h_to_4h (4x29696x118784, b=2048): 168.728
Elapsed time for mlp_4h_to_h (4x118784x29696, b=2048): 0.2204
Throughput (in TFLOP/s) for mlp_4h_to_h (4x118784x29696, b=2048): 262.184

Attention duration (in seconds): 0.3355
Attention throughput (in TFLOP/s): 178.184
MLP duration (in seconds): 0.5630
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8985
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 29760, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x29760x89280, b=2048): 0.1991
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x29760x89280, b=2048): 218.690
b: 128, m: 2048, n: 930, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x930x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x930x2048): 177.623
b: 128, m: 2048, n: 2048, k: 930,
Elapsed time for attention_prob_times_values (128x2048x2048x930): 0.0075
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x930): 133.238
Elapsed time for attention_linear_projection (4x29760x29760, b=2048): 0.0557
Throughput (in TFLOP/s) for attention_linear_projection (4x29760x29760, b=2048): 260.603
Elapsed time for mlp_h_to_4h (4x29760x119040, b=2048): 0.4165
Throughput (in TFLOP/s) for mlp_h_to_4h (4x29760x119040, b=2048): 139.349
Elapsed time for mlp_4h_to_h (4x119040x29760, b=2048): 0.2223
Throughput (in TFLOP/s) for mlp_4h_to_h (4x119040x29760, b=2048): 261.121

Attention duration (in seconds): 0.2679
Attention throughput (in TFLOP/s): 224.150
MLP duration (in seconds): 0.6388
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.9067
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 29824, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x29824x89472, b=2048): 0.2144
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x29824x89472, b=2048): 203.935
b: 128, m: 2048, n: 932, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x932x2048): 0.0054
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x932x2048): 184.054
b: 128, m: 2048, n: 2048, k: 932,
Elapsed time for attention_prob_times_values (128x2048x2048x932): 0.0063
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x932): 157.647
Elapsed time for attention_linear_projection (4x29824x29824, b=2048): 0.0560
Throughput (in TFLOP/s) for attention_linear_projection (4x29824x29824, b=2048): 260.291
Elapsed time for mlp_h_to_4h (4x29824x119296, b=2048): 0.3710
Throughput (in TFLOP/s) for mlp_h_to_4h (4x29824x119296, b=2048): 157.131
Elapsed time for mlp_4h_to_h (4x119296x29824, b=2048): 0.3388
Throughput (in TFLOP/s) for mlp_4h_to_h (4x119296x29824, b=2048): 172.039

Attention duration (in seconds): 0.2822
Attention throughput (in TFLOP/s): 213.693
MLP duration (in seconds): 0.7098
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.9920
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 29888, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x29888x89664, b=2048): 0.1686
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x29888x89664, b=2048): 260.348
b: 128, m: 2048, n: 934, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x934x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x934x2048): 178.175
b: 128, m: 2048, n: 2048, k: 934,
Elapsed time for attention_prob_times_values (128x2048x2048x934): 0.0074
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x934): 134.690
Elapsed time for attention_linear_projection (4x29888x29888, b=2048): 0.0564
Throughput (in TFLOP/s) for attention_linear_projection (4x29888x29888, b=2048): 259.606
Elapsed time for mlp_h_to_4h (4x29888x119552, b=2048): 0.4016
Throughput (in TFLOP/s) for mlp_h_to_4h (4x29888x119552, b=2048): 145.761
Elapsed time for mlp_4h_to_h (4x119552x29888, b=2048): 0.2241
Throughput (in TFLOP/s) for mlp_4h_to_h (4x119552x29888, b=2048): 261.180

Attention duration (in seconds): 0.2381
Attention throughput (in TFLOP/s): 254.300
MLP duration (in seconds): 0.6258
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8639
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 29952, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x29952x89856, b=2048): 0.2279
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x29952x89856, b=2048): 193.443
b: 128, m: 2048, n: 936, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x936x2048): 0.0054
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x936x2048): 184.424
b: 128, m: 2048, n: 2048, k: 936,
Elapsed time for attention_prob_times_values (128x2048x2048x936): 0.0056
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x936): 180.044
Elapsed time for attention_linear_projection (4x29952x29952, b=2048): 0.0564
Throughput (in TFLOP/s) for attention_linear_projection (4x29952x29952, b=2048): 260.390
Elapsed time for mlp_h_to_4h (4x29952x119808, b=2048): 0.3548
Throughput (in TFLOP/s) for mlp_h_to_4h (4x29952x119808, b=2048): 165.722
Elapsed time for mlp_4h_to_h (4x119808x29952, b=2048): 0.2247
Throughput (in TFLOP/s) for mlp_4h_to_h (4x119808x29952, b=2048): 261.711

Attention duration (in seconds): 0.2954
Attention throughput (in TFLOP/s): 205.815
MLP duration (in seconds): 0.5794
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8749
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 30016, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x30016x90048, b=2048): 0.1688
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x30016x90048, b=2048): 262.291
b: 128, m: 2048, n: 938, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x938x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x938x2048): 178.560
b: 128, m: 2048, n: 2048, k: 938,
Elapsed time for attention_prob_times_values (128x2048x2048x938): 0.0074
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x938): 135.202
Elapsed time for attention_linear_projection (4x30016x30016, b=2048): 0.0567
Throughput (in TFLOP/s) for attention_linear_projection (4x30016x30016, b=2048): 260.562
Elapsed time for mlp_h_to_4h (4x30016x120064, b=2048): 0.3550
Throughput (in TFLOP/s) for mlp_h_to_4h (4x30016x120064, b=2048): 166.338
Elapsed time for mlp_4h_to_h (4x120064x30016, b=2048): 0.2236
Throughput (in TFLOP/s) for mlp_4h_to_h (4x120064x30016, b=2048): 264.049

Attention duration (in seconds): 0.2386
Attention throughput (in TFLOP/s): 255.932
MLP duration (in seconds): 0.5786
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8172
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 30080, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x30080x90240, b=2048): 0.2747
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x30080x90240, b=2048): 161.908
b: 128, m: 2048, n: 940, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x940x2048): 0.0054
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x940x2048): 187.235
b: 128, m: 2048, n: 2048, k: 940,
Elapsed time for attention_prob_times_values (128x2048x2048x940): 0.0064
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x940): 158.738
Elapsed time for attention_linear_projection (4x30080x30080, b=2048): 0.0566
Throughput (in TFLOP/s) for attention_linear_projection (4x30080x30080, b=2048): 262.115
Elapsed time for mlp_h_to_4h (4x30080x120320, b=2048): 0.4013
Throughput (in TFLOP/s) for mlp_h_to_4h (4x30080x120320, b=2048): 147.761
Elapsed time for mlp_4h_to_h (4x120320x30080, b=2048): 0.2242
Throughput (in TFLOP/s) for mlp_4h_to_h (4x120320x30080, b=2048): 264.440

Attention duration (in seconds): 0.3430
Attention throughput (in TFLOP/s): 178.771
MLP duration (in seconds): 0.6255
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.9685
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 30144, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x30144x90432, b=2048): 0.2894
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x30144x90432, b=2048): 154.346
b: 128, m: 2048, n: 942, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x942x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x942x2048): 180.366
b: 128, m: 2048, n: 2048, k: 942,
Elapsed time for attention_prob_times_values (128x2048x2048x942): 0.0075
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x942): 134.585
Elapsed time for attention_linear_projection (4x30144x30144, b=2048): 0.0569
Throughput (in TFLOP/s) for attention_linear_projection (4x30144x30144, b=2048): 261.642
Elapsed time for mlp_h_to_4h (4x30144x120576, b=2048): 0.3995
Throughput (in TFLOP/s) for mlp_h_to_4h (4x30144x120576, b=2048): 149.075
Elapsed time for mlp_4h_to_h (4x120576x30144, b=2048): 0.2267
Throughput (in TFLOP/s) for mlp_4h_to_h (4x120576x30144, b=2048): 262.679

Attention duration (in seconds): 0.3594
Attention throughput (in TFLOP/s): 171.326
MLP duration (in seconds): 0.6262
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.9856
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 30208, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x30208x90624, b=2048): 0.1705
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x30208x90624, b=2048): 263.060
b: 128, m: 2048, n: 944, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x944x2048): 0.0054
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x944x2048): 186.620
b: 128, m: 2048, n: 2048, k: 944,
Elapsed time for attention_prob_times_values (128x2048x2048x944): 0.0055
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x944): 182.644
Elapsed time for attention_linear_projection (4x30208x30208, b=2048): 0.0573
Throughput (in TFLOP/s) for attention_linear_projection (4x30208x30208, b=2048): 260.859
Elapsed time for mlp_h_to_4h (4x30208x120832, b=2048): 0.3989
Throughput (in TFLOP/s) for mlp_h_to_4h (4x30208x120832, b=2048): 149.931
Elapsed time for mlp_4h_to_h (4x120832x30208, b=2048): 0.2281
Throughput (in TFLOP/s) for mlp_4h_to_h (4x120832x30208, b=2048): 262.151

Attention duration (in seconds): 0.2388
Attention throughput (in TFLOP/s): 258.924
MLP duration (in seconds): 0.6270
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8658
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 30272, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x30272x90816, b=2048): 0.1709
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x30272x90816, b=2048): 263.613
b: 128, m: 2048, n: 946, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x946x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x946x2048): 180.863
b: 128, m: 2048, n: 2048, k: 946,
Elapsed time for attention_prob_times_values (128x2048x2048x946): 0.0075
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x946): 135.105
Elapsed time for attention_linear_projection (4x30272x30272, b=2048): 0.0573
Throughput (in TFLOP/s) for attention_linear_projection (4x30272x30272, b=2048): 261.889
Elapsed time for mlp_h_to_4h (4x30272x121088, b=2048): 0.3958
Throughput (in TFLOP/s) for mlp_h_to_4h (4x30272x121088, b=2048): 151.729
Elapsed time for mlp_4h_to_h (4x121088x30272, b=2048): 0.2256
Throughput (in TFLOP/s) for mlp_4h_to_h (4x121088x30272, b=2048): 266.208

Attention duration (in seconds): 0.2413
Attention throughput (in TFLOP/s): 257.274
MLP duration (in seconds): 0.6214
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8627
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 30336, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x30336x91008, b=2048): 0.2447
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x30336x91008, b=2048): 184.880
b: 128, m: 2048, n: 948, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x948x2048): 0.0054
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x948x2048): 188.412
b: 128, m: 2048, n: 2048, k: 948,
Elapsed time for attention_prob_times_values (128x2048x2048x948): 0.0064
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x948): 160.035
Elapsed time for attention_linear_projection (4x30336x30336, b=2048): 0.0573
Throughput (in TFLOP/s) for attention_linear_projection (4x30336x30336, b=2048): 263.030
Elapsed time for mlp_h_to_4h (4x30336x121344, b=2048): 0.3925
Throughput (in TFLOP/s) for mlp_h_to_4h (4x30336x121344, b=2048): 153.677
Elapsed time for mlp_4h_to_h (4x121344x30336, b=2048): 0.2291
Throughput (in TFLOP/s) for mlp_4h_to_h (4x121344x30336, b=2048): 263.214

Attention duration (in seconds): 0.3137
Attention throughput (in TFLOP/s): 198.715
MLP duration (in seconds): 0.6216
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.9353
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 30400, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x30400x91200, b=2048): 0.1970
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x30400x91200, b=2048): 230.599
b: 128, m: 2048, n: 950, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x950x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x950x2048): 181.412
b: 128, m: 2048, n: 2048, k: 950,
Elapsed time for attention_prob_times_values (128x2048x2048x950): 0.0076
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x950): 134.187
Elapsed time for attention_linear_projection (4x30400x30400, b=2048): 0.0582
Throughput (in TFLOP/s) for attention_linear_projection (4x30400x30400, b=2048): 260.298
Elapsed time for mlp_h_to_4h (4x30400x121600, b=2048): 0.3981
Throughput (in TFLOP/s) for mlp_h_to_4h (4x30400x121600, b=2048): 152.125
Elapsed time for mlp_4h_to_h (4x121600x30400, b=2048): 0.2280
Throughput (in TFLOP/s) for mlp_4h_to_h (4x121600x30400, b=2048): 265.672

Attention duration (in seconds): 0.2684
Attention throughput (in TFLOP/s): 233.275
MLP duration (in seconds): 0.6261
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8945
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 30464, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x30464x91392, b=2048): 0.2203
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x30464x91392, b=2048): 207.050
b: 128, m: 2048, n: 952, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x952x2048): 0.0055
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x952x2048): 186.873
b: 128, m: 2048, n: 2048, k: 952,
Elapsed time for attention_prob_times_values (128x2048x2048x952): 0.0056
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x952): 181.732
Elapsed time for attention_linear_projection (4x30464x30464, b=2048): 0.0577
Throughput (in TFLOP/s) for attention_linear_projection (4x30464x30464, b=2048): 263.397
Elapsed time for mlp_h_to_4h (4x30464x121856, b=2048): 0.4277
Throughput (in TFLOP/s) for mlp_h_to_4h (4x30464x121856, b=2048): 142.221
Elapsed time for mlp_4h_to_h (4x121856x30464, b=2048): 0.2309
Throughput (in TFLOP/s) for mlp_4h_to_h (4x121856x30464, b=2048): 263.437

Attention duration (in seconds): 0.2891
Attention throughput (in TFLOP/s): 217.426
MLP duration (in seconds): 0.6585
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.9477
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 30528, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x30528x91584, b=2048): 0.2017
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x30528x91584, b=2048): 227.074
b: 128, m: 2048, n: 954, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x954x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x954x2048): 181.921
b: 128, m: 2048, n: 2048, k: 954,
Elapsed time for attention_prob_times_values (128x2048x2048x954): 0.0076
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x954): 135.299
Elapsed time for attention_linear_projection (4x30528x30528, b=2048): 0.0581
Throughput (in TFLOP/s) for attention_linear_projection (4x30528x30528, b=2048): 262.782
Elapsed time for mlp_h_to_4h (4x30528x122112, b=2048): 0.4123
Throughput (in TFLOP/s) for mlp_h_to_4h (4x30528x122112, b=2048): 148.153
Elapsed time for mlp_4h_to_h (4x122112x30528, b=2048): 0.2300
Throughput (in TFLOP/s) for mlp_4h_to_h (4x122112x30528, b=2048): 265.525

Attention duration (in seconds): 0.2730
Attention throughput (in TFLOP/s): 231.197
MLP duration (in seconds): 0.6423
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.9153
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 30592, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x30592x91776, b=2048): 0.2088
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x30592x91776, b=2048): 220.349
b: 128, m: 2048, n: 956, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x956x2048): 0.0054
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x956x2048): 189.667
b: 128, m: 2048, n: 2048, k: 956,
Elapsed time for attention_prob_times_values (128x2048x2048x956): 0.0064
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x956): 160.963
Elapsed time for attention_linear_projection (4x30592x30592, b=2048): 0.0580
Throughput (in TFLOP/s) for attention_linear_projection (4x30592x30592, b=2048): 264.489
Elapsed time for mlp_h_to_4h (4x30592x122368, b=2048): 0.4041
Throughput (in TFLOP/s) for mlp_h_to_4h (4x30592x122368, b=2048): 151.765
Elapsed time for mlp_4h_to_h (4x122368x30592, b=2048): 0.2328
Throughput (in TFLOP/s) for mlp_4h_to_h (4x122368x30592, b=2048): 263.447

Attention duration (in seconds): 0.2785
Attention throughput (in TFLOP/s): 227.581
MLP duration (in seconds): 0.6369
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.9155
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 30656, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x30656x91968, b=2048): 0.2163
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x30656x91968, b=2048): 213.580
b: 128, m: 2048, n: 958, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x958x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x958x2048): 182.414
b: 128, m: 2048, n: 2048, k: 958,
Elapsed time for attention_prob_times_values (128x2048x2048x958): 0.0077
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x958): 133.686
Elapsed time for attention_linear_projection (4x30656x30656, b=2048): 0.0593
Throughput (in TFLOP/s) for attention_linear_projection (4x30656x30656, b=2048): 259.665
Elapsed time for mlp_h_to_4h (4x30656x122624, b=2048): 0.4135
Throughput (in TFLOP/s) for mlp_h_to_4h (4x30656x122624, b=2048): 148.952
Elapsed time for mlp_4h_to_h (4x122624x30656, b=2048): 0.2358
Throughput (in TFLOP/s) for mlp_4h_to_h (4x122624x30656, b=2048): 261.243

Attention duration (in seconds): 0.2889
Attention throughput (in TFLOP/s): 220.303
MLP duration (in seconds): 0.6492
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.9382
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 30720, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x30720x92160, b=2048): 0.2137
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x30720x92160, b=2048): 217.051
b: 128, m: 2048, n: 960, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x960x2048): 0.0054
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x960x2048): 189.967
b: 128, m: 2048, n: 2048, k: 960,
Elapsed time for attention_prob_times_values (128x2048x2048x960): 0.0055
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x960): 186.645
Elapsed time for attention_linear_projection (4x30720x30720, b=2048): 0.0588
Throughput (in TFLOP/s) for attention_linear_projection (4x30720x30720, b=2048): 262.760
Elapsed time for mlp_h_to_4h (4x30720x122880, b=2048): 0.4245
Throughput (in TFLOP/s) for mlp_h_to_4h (4x30720x122880, b=2048): 145.684
Elapsed time for mlp_4h_to_h (4x122880x30720, b=2048): 0.2339
Throughput (in TFLOP/s) for mlp_4h_to_h (4x122880x30720, b=2048): 264.458

Attention duration (in seconds): 0.2835
Attention throughput (in TFLOP/s): 225.428
MLP duration (in seconds): 0.6584
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.9419
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 30784, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x30784x92352, b=2048): 0.2374
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x30784x92352, b=2048): 196.201
b: 128, m: 2048, n: 962, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x962x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x962x2048): 182.966
b: 128, m: 2048, n: 2048, k: 962,
Elapsed time for attention_prob_times_values (128x2048x2048x962): 0.0078
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x962): 131.971
Elapsed time for attention_linear_projection (4x30784x30784, b=2048): 0.0597
Throughput (in TFLOP/s) for attention_linear_projection (4x30784x30784, b=2048): 260.260
Elapsed time for mlp_h_to_4h (4x30784x123136, b=2048): 0.3955
Throughput (in TFLOP/s) for mlp_h_to_4h (4x30784x123136, b=2048): 157.043
Elapsed time for mlp_4h_to_h (4x123136x30784, b=2048): 0.2387
Throughput (in TFLOP/s) for mlp_4h_to_h (4x123136x30784, b=2048): 260.236

Attention duration (in seconds): 0.3105
Attention throughput (in TFLOP/s): 206.648
MLP duration (in seconds): 0.6341
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.9447
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 30848, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x30848x92544, b=2048): 0.2247
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x30848x92544, b=2048): 208.153
b: 128, m: 2048, n: 964, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x964x2048): 0.0054
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x964x2048): 191.414
b: 128, m: 2048, n: 2048, k: 964,
Elapsed time for attention_prob_times_values (128x2048x2048x964): 0.0065
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x964): 159.361
Elapsed time for attention_linear_projection (4x30848x30848, b=2048): 0.0600
Throughput (in TFLOP/s) for attention_linear_projection (4x30848x30848, b=2048): 259.785
Elapsed time for mlp_h_to_4h (4x30848x123392, b=2048): 0.4020
Throughput (in TFLOP/s) for mlp_h_to_4h (4x30848x123392, b=2048): 155.144
Elapsed time for mlp_4h_to_h (4x123392x30848, b=2048): 0.2391
Throughput (in TFLOP/s) for mlp_4h_to_h (4x123392x30848, b=2048): 260.834

Attention duration (in seconds): 0.2966
Attention throughput (in TFLOP/s): 217.226
MLP duration (in seconds): 0.6411
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.9377
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 30912, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x30912x92736, b=2048): 0.2440
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x30912x92736, b=2048): 192.482
b: 128, m: 2048, n: 966, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x966x2048): 0.0057
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x966x2048): 183.055
b: 128, m: 2048, n: 2048, k: 966,
Elapsed time for attention_prob_times_values (128x2048x2048x966): 0.0079
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x966): 131.989
Elapsed time for attention_linear_projection (4x30912x30912, b=2048): 0.0600
Throughput (in TFLOP/s) for attention_linear_projection (4x30912x30912, b=2048): 260.929
Elapsed time for mlp_h_to_4h (4x30912x123648, b=2048): 0.4240
Throughput (in TFLOP/s) for mlp_h_to_4h (4x30912x123648, b=2048): 147.699
Elapsed time for mlp_4h_to_h (4x123648x30912, b=2048): 0.3874
Throughput (in TFLOP/s) for mlp_4h_to_h (4x123648x30912, b=2048): 161.641

Attention duration (in seconds): 0.3175
Attention throughput (in TFLOP/s): 203.750
MLP duration (in seconds): 0.8114
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 1.1289
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 32, hidden_size: 30976, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x30976x92928, b=2048): 0.2322
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x30976x92928, b=2048): 203.118
b: 128, m: 2048, n: 968, k: 2048,
Elapsed time for attention_key_query_prob (128x2048x968x2048): 0.0055
Throughput (in TFLOP/s) for attention_key_query_prob (128x2048x968x2048): 189.675
b: 128, m: 2048, n: 2048, k: 968,
Elapsed time for attention_prob_times_values (128x2048x2048x968): 0.0058
Throughput (in TFLOP/s) for attention_prob_times_values (128x2048x2048x968): 179.071
Elapsed time for attention_linear_projection (4x30976x30976, b=2048): 0.0600
Throughput (in TFLOP/s) for attention_linear_projection (4x30976x30976, b=2048): 262.001
Elapsed time for mlp_h_to_4h (4x30976x123904, b=2048): 0.4371
Throughput (in TFLOP/s) for mlp_h_to_4h (4x30976x123904, b=2048): 143.851
Elapsed time for mlp_4h_to_h (4x123904x30976, b=2048): 0.2389
Throughput (in TFLOP/s) for mlp_4h_to_h (4x123904x30976, b=2048): 263.205

Attention duration (in seconds): 0.3035
Attention throughput (in TFLOP/s): 214.057
MLP duration (in seconds): 0.6760
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.9795
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 8192, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x8192x24576, b=2048): 0.0131
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x8192x24576, b=2048): 252.037
b: 512, m: 2048, n: 64, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x64x2048): 0.0046
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x64x2048): 59.845
b: 512, m: 2048, n: 2048, k: 64,
Elapsed time for attention_prob_times_values (512x2048x2048x64): 0.0050
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x64): 55.037
Elapsed time for attention_linear_projection (4x8192x8192, b=2048): 0.0052
Throughput (in TFLOP/s) for attention_linear_projection (4x8192x8192, b=2048): 212.755
Elapsed time for mlp_h_to_4h (4x8192x32768, b=2048): 0.0171
Throughput (in TFLOP/s) for mlp_h_to_4h (4x8192x32768, b=2048): 257.080
Elapsed time for mlp_4h_to_h (4x32768x8192, b=2048): 0.0174
Throughput (in TFLOP/s) for mlp_4h_to_h (4x32768x8192, b=2048): 253.188

Attention duration (in seconds): 0.0278
Attention throughput (in TFLOP/s): 177.704
MLP duration (in seconds): 0.0345
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0623
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 8448, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x8448x25344, b=2048): 0.0141
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x8448x25344, b=2048): 249.124
b: 512, m: 2048, n: 66, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x66x2048): 0.0046
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x66x2048): 61.047
b: 512, m: 2048, n: 2048, k: 66,
Elapsed time for attention_prob_times_values (512x2048x2048x66): 0.0079
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x66): 35.744
Elapsed time for attention_linear_projection (4x8448x8448, b=2048): 0.0055
Throughput (in TFLOP/s) for attention_linear_projection (4x8448x8448, b=2048): 213.421
Elapsed time for mlp_h_to_4h (4x8448x33792, b=2048): 0.0185
Throughput (in TFLOP/s) for mlp_h_to_4h (4x8448x33792, b=2048): 253.494
Elapsed time for mlp_4h_to_h (4x33792x8448, b=2048): 0.0185
Throughput (in TFLOP/s) for mlp_4h_to_h (4x33792x8448, b=2048): 252.159

Attention duration (in seconds): 0.0321
Attention throughput (in TFLOP/s): 163.197
MLP duration (in seconds): 0.0370
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0691
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 8704, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x8704x26112, b=2048): 0.0151
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x8704x26112, b=2048): 246.053
b: 512, m: 2048, n: 68, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x68x2048): 0.0046
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x68x2048): 63.072
b: 512, m: 2048, n: 2048, k: 68,
Elapsed time for attention_prob_times_values (512x2048x2048x68): 0.0060
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x68): 49.084
Elapsed time for attention_linear_projection (4x8704x8704, b=2048): 0.0058
Throughput (in TFLOP/s) for attention_linear_projection (4x8704x8704, b=2048): 212.887
Elapsed time for mlp_h_to_4h (4x8704x34816, b=2048): 0.0198
Throughput (in TFLOP/s) for mlp_h_to_4h (4x8704x34816, b=2048): 250.200
Elapsed time for mlp_4h_to_h (4x34816x8704, b=2048): 0.0197
Throughput (in TFLOP/s) for mlp_4h_to_h (4x34816x8704, b=2048): 252.313

Attention duration (in seconds): 0.0315
Attention throughput (in TFLOP/s): 175.910
MLP duration (in seconds): 0.0395
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0711
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 8960, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x8960x26880, b=2048): 0.0156
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x8960x26880, b=2048): 252.976
b: 512, m: 2048, n: 70, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x70x2048): 0.0047
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x70x2048): 63.836
b: 512, m: 2048, n: 2048, k: 70,
Elapsed time for attention_prob_times_values (512x2048x2048x70): 0.0080
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x70): 37.522
Elapsed time for attention_linear_projection (4x8960x8960, b=2048): 0.0060
Throughput (in TFLOP/s) for attention_linear_projection (4x8960x8960, b=2048): 219.867
Elapsed time for mlp_h_to_4h (4x8960x35840, b=2048): 0.0205
Throughput (in TFLOP/s) for mlp_h_to_4h (4x8960x35840, b=2048): 256.989
Elapsed time for mlp_4h_to_h (4x35840x8960, b=2048): 0.0207
Throughput (in TFLOP/s) for mlp_4h_to_h (4x35840x8960, b=2048): 254.420

Attention duration (in seconds): 0.0343
Attention throughput (in TFLOP/s): 170.907
MLP duration (in seconds): 0.0412
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0755
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 9216, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x9216x27648, b=2048): 0.0167
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x9216x27648, b=2048): 250.136
b: 512, m: 2048, n: 72, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x72x2048): 0.0046
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x72x2048): 66.837
b: 512, m: 2048, n: 2048, k: 72,
Elapsed time for attention_prob_times_values (512x2048x2048x72): 0.0059
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x72): 52.593
Elapsed time for attention_linear_projection (4x9216x9216, b=2048): 0.0064
Throughput (in TFLOP/s) for attention_linear_projection (4x9216x9216, b=2048): 219.028
Elapsed time for mlp_h_to_4h (4x9216x36864, b=2048): 0.0219
Throughput (in TFLOP/s) for mlp_h_to_4h (4x9216x36864, b=2048): 253.746
Elapsed time for mlp_4h_to_h (4x36864x9216, b=2048): 0.0219
Throughput (in TFLOP/s) for mlp_4h_to_h (4x36864x9216, b=2048): 253.983

Attention duration (in seconds): 0.0335
Attention throughput (in TFLOP/s): 184.346
MLP duration (in seconds): 0.0439
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0774
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 9472, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x9472x28416, b=2048): 0.0172
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x9472x28416, b=2048): 256.604
b: 512, m: 2048, n: 74, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x74x2048): 0.0048
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x74x2048): 66.650
b: 512, m: 2048, n: 2048, k: 74,
Elapsed time for attention_prob_times_values (512x2048x2048x74): 0.0082
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x74): 38.997
Elapsed time for attention_linear_projection (4x9472x9472, b=2048): 0.0065
Throughput (in TFLOP/s) for attention_linear_projection (4x9472x9472, b=2048): 225.675
Elapsed time for mlp_h_to_4h (4x9472x37888, b=2048): 0.0228
Throughput (in TFLOP/s) for mlp_h_to_4h (4x9472x37888, b=2048): 258.360
Elapsed time for mlp_4h_to_h (4x37888x9472, b=2048): 0.0230
Throughput (in TFLOP/s) for mlp_4h_to_h (4x37888x9472, b=2048): 255.591

Attention duration (in seconds): 0.0366
Attention throughput (in TFLOP/s): 177.932
MLP duration (in seconds): 0.0458
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0824
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 9728, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x9728x29184, b=2048): 0.0183
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x9728x29184, b=2048): 253.569
b: 512, m: 2048, n: 76, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x76x2048): 0.0047
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x76x2048): 69.877
b: 512, m: 2048, n: 2048, k: 76,
Elapsed time for attention_prob_times_values (512x2048x2048x76): 0.0061
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x76): 53.489
Elapsed time for attention_linear_projection (4x9728x9728, b=2048): 0.0069
Throughput (in TFLOP/s) for attention_linear_projection (4x9728x9728, b=2048): 225.359
Elapsed time for mlp_h_to_4h (4x9728x38912, b=2048): 0.0242
Throughput (in TFLOP/s) for mlp_h_to_4h (4x9728x38912, b=2048): 256.790
Elapsed time for mlp_4h_to_h (4x38912x9728, b=2048): 0.0243
Throughput (in TFLOP/s) for mlp_4h_to_h (4x38912x9728, b=2048): 255.007

Attention duration (in seconds): 0.0360
Attention throughput (in TFLOP/s): 190.421
MLP duration (in seconds): 0.0485
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0845
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 9984, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x9984x29952, b=2048): 0.0195
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x9984x29952, b=2048): 251.504
b: 512, m: 2048, n: 78, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x78x2048): 0.0048
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x78x2048): 70.232
b: 512, m: 2048, n: 2048, k: 78,
Elapsed time for attention_prob_times_values (512x2048x2048x78): 0.0083
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x78): 40.482
Elapsed time for attention_linear_projection (4x9984x9984, b=2048): 0.0073
Throughput (in TFLOP/s) for attention_linear_projection (4x9984x9984, b=2048): 224.119
Elapsed time for mlp_h_to_4h (4x9984x39936, b=2048): 0.0258
Throughput (in TFLOP/s) for mlp_h_to_4h (4x9984x39936, b=2048): 253.381
Elapsed time for mlp_4h_to_h (4x39936x9984, b=2048): 0.0255
Throughput (in TFLOP/s) for mlp_4h_to_h (4x39936x9984, b=2048): 256.124

Attention duration (in seconds): 0.0398
Attention throughput (in TFLOP/s): 180.911
MLP duration (in seconds): 0.0513
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0911
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 10240, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x10240x30720, b=2048): 0.0201
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x10240x30720, b=2048): 256.332
b: 512, m: 2048, n: 80, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x80x2048): 0.0047
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x80x2048): 73.811
b: 512, m: 2048, n: 2048, k: 80,
Elapsed time for attention_prob_times_values (512x2048x2048x80): 0.0059
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x80): 58.179
Elapsed time for attention_linear_projection (4x10240x10240, b=2048): 0.0075
Throughput (in TFLOP/s) for attention_linear_projection (4x10240x10240, b=2048): 230.496
Elapsed time for mlp_h_to_4h (4x10240x40960, b=2048): 0.0265
Throughput (in TFLOP/s) for mlp_h_to_4h (4x10240x40960, b=2048): 259.321
Elapsed time for mlp_4h_to_h (4x40960x10240, b=2048): 0.0267
Throughput (in TFLOP/s) for mlp_4h_to_h (4x40960x10240, b=2048): 257.579

Attention duration (in seconds): 0.0381
Attention throughput (in TFLOP/s): 198.293
MLP duration (in seconds): 0.0532
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0913
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 10496, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x10496x31488, b=2048): 0.0213
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x10496x31488, b=2048): 253.918
b: 512, m: 2048, n: 82, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x82x2048): 0.0048
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x82x2048): 73.543
b: 512, m: 2048, n: 2048, k: 82,
Elapsed time for attention_prob_times_values (512x2048x2048x82): 0.0083
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x82): 42.309
Elapsed time for attention_linear_projection (4x10496x10496, b=2048): 0.0079
Throughput (in TFLOP/s) for attention_linear_projection (4x10496x10496, b=2048): 228.939
Elapsed time for mlp_h_to_4h (4x10496x41984, b=2048): 0.0283
Throughput (in TFLOP/s) for mlp_h_to_4h (4x10496x41984, b=2048): 255.098
Elapsed time for mlp_4h_to_h (4x41984x10496, b=2048): 0.0282
Throughput (in TFLOP/s) for mlp_4h_to_h (4x41984x10496, b=2048): 256.197

Attention duration (in seconds): 0.0423
Attention throughput (in TFLOP/s): 187.235
MLP duration (in seconds): 0.0565
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0988
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 10752, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x10752x32256, b=2048): 0.0220
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x10752x32256, b=2048): 258.182
b: 512, m: 2048, n: 84, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x84x2048): 0.0047
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x84x2048): 76.556
b: 512, m: 2048, n: 2048, k: 84,
Elapsed time for attention_prob_times_values (512x2048x2048x84): 0.0062
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x84): 58.160
Elapsed time for attention_linear_projection (4x10752x10752, b=2048): 0.0080
Throughput (in TFLOP/s) for attention_linear_projection (4x10752x10752, b=2048): 235.451
Elapsed time for mlp_h_to_4h (4x10752x43008, b=2048): 0.0291
Throughput (in TFLOP/s) for mlp_h_to_4h (4x10752x43008, b=2048): 260.787
Elapsed time for mlp_4h_to_h (4x43008x10752, b=2048): 0.0295
Throughput (in TFLOP/s) for mlp_4h_to_h (4x43008x10752, b=2048): 256.474

Attention duration (in seconds): 0.0410
Attention throughput (in TFLOP/s): 202.541
MLP duration (in seconds): 0.0586
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.0996
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 11008, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x11008x33024, b=2048): 0.0233
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x11008x33024, b=2048): 255.879
b: 512, m: 2048, n: 86, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x86x2048): 0.0048
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x86x2048): 77.350
b: 512, m: 2048, n: 2048, k: 86,
Elapsed time for attention_prob_times_values (512x2048x2048x86): 0.0084
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x86): 44.094
Elapsed time for attention_linear_projection (4x11008x11008, b=2048): 0.0085
Throughput (in TFLOP/s) for attention_linear_projection (4x11008x11008, b=2048): 233.659
Elapsed time for mlp_h_to_4h (4x11008x44032, b=2048): 0.0309
Throughput (in TFLOP/s) for mlp_h_to_4h (4x11008x44032, b=2048): 257.100
Elapsed time for mlp_4h_to_h (4x44032x11008, b=2048): 0.0309
Throughput (in TFLOP/s) for mlp_4h_to_h (4x44032x11008, b=2048): 257.078

Attention duration (in seconds): 0.0449
Attention throughput (in TFLOP/s): 193.211
MLP duration (in seconds): 0.0618
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1067
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 11264, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x11264x33792, b=2048): 0.0246
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x11264x33792, b=2048): 253.424
b: 512, m: 2048, n: 88, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x88x2048): 0.0047
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x88x2048): 80.389
b: 512, m: 2048, n: 2048, k: 88,
Elapsed time for attention_prob_times_values (512x2048x2048x88): 0.0063
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x88): 60.041
Elapsed time for attention_linear_projection (4x11264x11264, b=2048): 0.0089
Throughput (in TFLOP/s) for attention_linear_projection (4x11264x11264, b=2048): 232.531
Elapsed time for mlp_h_to_4h (4x11264x45056, b=2048): 0.0326
Throughput (in TFLOP/s) for mlp_h_to_4h (4x11264x45056, b=2048): 254.710
Elapsed time for mlp_4h_to_h (4x45056x11264, b=2048): 0.0324
Throughput (in TFLOP/s) for mlp_4h_to_h (4x45056x11264, b=2048): 256.825

Attention duration (in seconds): 0.0445
Attention throughput (in TFLOP/s): 203.639
MLP duration (in seconds): 0.0650
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1096
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 11520, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x11520x34560, b=2048): 0.0252
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x11520x34560, b=2048): 258.353
b: 512, m: 2048, n: 90, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x90x2048): 0.0048
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x90x2048): 80.577
b: 512, m: 2048, n: 2048, k: 90,
Elapsed time for attention_prob_times_values (512x2048x2048x90): 0.0084
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x90): 46.069
Elapsed time for attention_linear_projection (4x11520x11520, b=2048): 0.0091
Throughput (in TFLOP/s) for attention_linear_projection (4x11520x11520, b=2048): 237.767
Elapsed time for mlp_h_to_4h (4x11520x46080, b=2048): 0.0336
Throughput (in TFLOP/s) for mlp_h_to_4h (4x11520x46080, b=2048): 259.217
Elapsed time for mlp_4h_to_h (4x46080x11520, b=2048): 0.0337
Throughput (in TFLOP/s) for mlp_4h_to_h (4x46080x11520, b=2048): 257.902

Attention duration (in seconds): 0.0476
Attention throughput (in TFLOP/s): 199.038
MLP duration (in seconds): 0.0673
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1149
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 11776, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x11776x35328, b=2048): 0.0267
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x11776x35328, b=2048): 254.957
b: 512, m: 2048, n: 92, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x92x2048): 0.0047
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x92x2048): 84.073
b: 512, m: 2048, n: 2048, k: 92,
Elapsed time for attention_prob_times_values (512x2048x2048x92): 0.0062
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x92): 63.230
Elapsed time for attention_linear_projection (4x11776x11776, b=2048): 0.0096
Throughput (in TFLOP/s) for attention_linear_projection (4x11776x11776, b=2048): 236.215
Elapsed time for mlp_h_to_4h (4x11776x47104, b=2048): 0.0356
Throughput (in TFLOP/s) for mlp_h_to_4h (4x11776x47104, b=2048): 255.272
Elapsed time for mlp_4h_to_h (4x47104x11776, b=2048): 0.0353
Throughput (in TFLOP/s) for mlp_4h_to_h (4x47104x11776, b=2048): 257.118

Attention duration (in seconds): 0.0473
Attention throughput (in TFLOP/s): 208.837
MLP duration (in seconds): 0.0709
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1183
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 12032, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x12032x36096, b=2048): 0.0276
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x12032x36096, b=2048): 258.214
b: 512, m: 2048, n: 94, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x94x2048): 0.0048
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x94x2048): 83.867
b: 512, m: 2048, n: 2048, k: 94,
Elapsed time for attention_prob_times_values (512x2048x2048x94): 0.0084
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x94): 47.802
Elapsed time for attention_linear_projection (4x12032x12032, b=2048): 0.0098
Throughput (in TFLOP/s) for attention_linear_projection (4x12032x12032, b=2048): 241.908
Elapsed time for mlp_h_to_4h (4x12032x48128, b=2048): 0.0366
Throughput (in TFLOP/s) for mlp_h_to_4h (4x12032x48128, b=2048): 259.116
Elapsed time for mlp_4h_to_h (4x48128x12032, b=2048): 0.0367
Throughput (in TFLOP/s) for mlp_4h_to_h (4x48128x12032, b=2048): 258.593

Attention duration (in seconds): 0.0506
Attention throughput (in TFLOP/s): 203.371
MLP duration (in seconds): 0.0733
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1239
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 12288, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x12288x36864, b=2048): 0.0289
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x12288x36864, b=2048): 256.435
b: 512, m: 2048, n: 96, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x96x2048): 0.0047
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x96x2048): 87.524
b: 512, m: 2048, n: 2048, k: 96,
Elapsed time for attention_prob_times_values (512x2048x2048x96): 0.0059
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x96): 69.453
Elapsed time for attention_linear_projection (4x12288x12288, b=2048): 0.0103
Throughput (in TFLOP/s) for attention_linear_projection (4x12288x12288, b=2048): 240.264
Elapsed time for mlp_h_to_4h (4x12288x49152, b=2048): 0.0382
Throughput (in TFLOP/s) for mlp_h_to_4h (4x12288x49152, b=2048): 258.748
Elapsed time for mlp_4h_to_h (4x49152x12288, b=2048): 0.0381
Throughput (in TFLOP/s) for mlp_4h_to_h (4x49152x12288, b=2048): 259.781

Attention duration (in seconds): 0.0499
Attention throughput (in TFLOP/s): 214.895
MLP duration (in seconds): 0.0763
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1262
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 12544, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x12544x37632, b=2048): 0.0303
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x12544x37632, b=2048): 255.165
b: 512, m: 2048, n: 98, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x98x2048): 0.0048
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x98x2048): 87.026
b: 512, m: 2048, n: 2048, k: 98,
Elapsed time for attention_prob_times_values (512x2048x2048x98): 0.0085
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x98): 49.663
Elapsed time for attention_linear_projection (4x12544x12544, b=2048): 0.0108
Throughput (in TFLOP/s) for attention_linear_projection (4x12544x12544, b=2048): 238.922
Elapsed time for mlp_h_to_4h (4x12544x50176, b=2048): 0.0404
Throughput (in TFLOP/s) for mlp_h_to_4h (4x12544x50176, b=2048): 255.408
Elapsed time for mlp_4h_to_h (4x50176x12544, b=2048): 0.0398
Throughput (in TFLOP/s) for mlp_4h_to_h (4x50176x12544, b=2048): 259.069

Attention duration (in seconds): 0.0544
Attention throughput (in TFLOP/s): 204.990
MLP duration (in seconds): 0.0802
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1346
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 12800, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x12800x38400, b=2048): 0.0311
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x12800x38400, b=2048): 258.697
b: 512, m: 2048, n: 100, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x100x2048): 0.0047
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x100x2048): 90.634
b: 512, m: 2048, n: 2048, k: 100,
Elapsed time for attention_prob_times_values (512x2048x2048x100): 0.0063
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x100): 68.322
Elapsed time for attention_linear_projection (4x12800x12800, b=2048): 0.0110
Throughput (in TFLOP/s) for attention_linear_projection (4x12800x12800, b=2048): 243.881
Elapsed time for mlp_h_to_4h (4x12800x51200, b=2048): 0.0412
Throughput (in TFLOP/s) for mlp_h_to_4h (4x12800x51200, b=2048): 260.932
Elapsed time for mlp_4h_to_h (4x51200x12800, b=2048): 0.0409
Throughput (in TFLOP/s) for mlp_4h_to_h (4x51200x12800, b=2048): 262.545

Attention duration (in seconds): 0.0532
Attention throughput (in TFLOP/s): 218.136
MLP duration (in seconds): 0.0820
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1352
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 13056, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x13056x39168, b=2048): 0.0326
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x13056x39168, b=2048): 257.146
b: 512, m: 2048, n: 102, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x102x2048): 0.0048
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x102x2048): 90.824
b: 512, m: 2048, n: 2048, k: 102,
Elapsed time for attention_prob_times_values (512x2048x2048x102): 0.0085
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x102): 51.526
Elapsed time for attention_linear_projection (4x13056x13056, b=2048): 0.0115
Throughput (in TFLOP/s) for attention_linear_projection (4x13056x13056, b=2048): 242.413
Elapsed time for mlp_h_to_4h (4x13056x52224, b=2048): 0.0431
Throughput (in TFLOP/s) for mlp_h_to_4h (4x13056x52224, b=2048): 258.964
Elapsed time for mlp_4h_to_h (4x52224x13056, b=2048): 0.0425
Throughput (in TFLOP/s) for mlp_4h_to_h (4x52224x13056, b=2048): 262.809

Attention duration (in seconds): 0.0574
Attention throughput (in TFLOP/s): 209.779
MLP duration (in seconds): 0.0856
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1431
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 13312, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x13312x39936, b=2048): 0.0333
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x13312x39936, b=2048): 261.175
b: 512, m: 2048, n: 104, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x104x2048): 0.0047
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x104x2048): 94.473
b: 512, m: 2048, n: 2048, k: 104,
Elapsed time for attention_prob_times_values (512x2048x2048x104): 0.0064
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x104): 70.247
Elapsed time for attention_linear_projection (4x13312x13312, b=2048): 0.0117
Throughput (in TFLOP/s) for attention_linear_projection (4x13312x13312, b=2048): 247.641
Elapsed time for mlp_h_to_4h (4x13312x53248, b=2048): 0.0443
Throughput (in TFLOP/s) for mlp_h_to_4h (4x13312x53248, b=2048): 262.191
Elapsed time for mlp_4h_to_h (4x53248x13312, b=2048): 0.0444
Throughput (in TFLOP/s) for mlp_4h_to_h (4x53248x13312, b=2048): 261.582

Attention duration (in seconds): 0.0562
Attention throughput (in TFLOP/s): 222.698
MLP duration (in seconds): 0.0887
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1449
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 13568, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x13568x40704, b=2048): 0.0349
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x13568x40704, b=2048): 259.138
b: 512, m: 2048, n: 106, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x106x2048): 0.0048
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x106x2048): 93.899
b: 512, m: 2048, n: 2048, k: 106,
Elapsed time for attention_prob_times_values (512x2048x2048x106): 0.0085
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x106): 53.449
Elapsed time for attention_linear_projection (4x13568x13568, b=2048): 0.0123
Throughput (in TFLOP/s) for attention_linear_projection (4x13568x13568, b=2048): 245.872
Elapsed time for mlp_h_to_4h (4x13568x54272, b=2048): 0.0463
Throughput (in TFLOP/s) for mlp_h_to_4h (4x13568x54272, b=2048): 260.665
Elapsed time for mlp_4h_to_h (4x54272x13568, b=2048): 0.0458
Throughput (in TFLOP/s) for mlp_4h_to_h (4x54272x13568, b=2048): 263.340

Attention duration (in seconds): 0.0606
Attention throughput (in TFLOP/s): 214.285
MLP duration (in seconds): 0.0921
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1526
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 13824, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x13824x41472, b=2048): 0.0358
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x13824x41472, b=2048): 262.164
b: 512, m: 2048, n: 108, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x108x2048): 0.0048
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x108x2048): 96.688
b: 512, m: 2048, n: 2048, k: 108,
Elapsed time for attention_prob_times_values (512x2048x2048x108): 0.0063
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x108): 73.130
Elapsed time for attention_linear_projection (4x13824x13824, b=2048): 0.0125
Throughput (in TFLOP/s) for attention_linear_projection (4x13824x13824, b=2048): 250.577
Elapsed time for mlp_h_to_4h (4x13824x55296, b=2048): 0.0478
Throughput (in TFLOP/s) for mlp_h_to_4h (4x13824x55296, b=2048): 262.221
Elapsed time for mlp_4h_to_h (4x55296x13824, b=2048): 0.0476
Throughput (in TFLOP/s) for mlp_4h_to_h (4x55296x13824, b=2048): 262.987

Attention duration (in seconds): 0.0595
Attention throughput (in TFLOP/s): 226.216
MLP duration (in seconds): 0.0954
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1548
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 14080, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x14080x42240, b=2048): 0.0375
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x14080x42240, b=2048): 259.552
b: 512, m: 2048, n: 110, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x110x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x110x2048): 97.013
b: 512, m: 2048, n: 2048, k: 110,
Elapsed time for attention_prob_times_values (512x2048x2048x110): 0.0086
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x110): 55.239
Elapsed time for attention_linear_projection (4x14080x14080, b=2048): 0.0130
Throughput (in TFLOP/s) for attention_linear_projection (4x14080x14080, b=2048): 248.902
Elapsed time for mlp_h_to_4h (4x14080x56320, b=2048): 0.0495
Throughput (in TFLOP/s) for mlp_h_to_4h (4x14080x56320, b=2048): 262.665
Elapsed time for mlp_4h_to_h (4x56320x14080, b=2048): 0.0498
Throughput (in TFLOP/s) for mlp_4h_to_h (4x56320x14080, b=2048): 261.106

Attention duration (in seconds): 0.0640
Attention throughput (in TFLOP/s): 217.718
MLP duration (in seconds): 0.0992
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1632
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 14336, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x14336x43008, b=2048): 0.0394
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x14336x43008, b=2048): 256.428
b: 512, m: 2048, n: 112, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x112x2048): 0.0048
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x112x2048): 100.735
b: 512, m: 2048, n: 2048, k: 112,
Elapsed time for attention_prob_times_values (512x2048x2048x112): 0.0061
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x112): 78.461
Elapsed time for attention_linear_projection (4x14336x14336, b=2048): 0.0136
Throughput (in TFLOP/s) for attention_linear_projection (4x14336x14336, b=2048): 247.075
Elapsed time for mlp_h_to_4h (4x14336x57344, b=2048): 0.0522
Throughput (in TFLOP/s) for mlp_h_to_4h (4x14336x57344, b=2048): 258.259
Elapsed time for mlp_4h_to_h (4x57344x14336, b=2048): 0.0518
Throughput (in TFLOP/s) for mlp_4h_to_h (4x57344x14336, b=2048): 259.963

Attention duration (in seconds): 0.0639
Attention throughput (in TFLOP/s): 225.737
MLP duration (in seconds): 0.1040
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1679
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 14592, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x14592x43776, b=2048): 0.0404
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x14592x43776, b=2048): 259.312
b: 512, m: 2048, n: 114, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x114x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x114x2048): 100.065
b: 512, m: 2048, n: 2048, k: 114,
Elapsed time for attention_prob_times_values (512x2048x2048x114): 0.0086
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x114): 57.200
Elapsed time for attention_linear_projection (4x14592x14592, b=2048): 0.0139
Throughput (in TFLOP/s) for attention_linear_projection (4x14592x14592, b=2048): 251.188
Elapsed time for mlp_h_to_4h (4x14592x58368, b=2048): 0.0662
Throughput (in TFLOP/s) for mlp_h_to_4h (4x14592x58368, b=2048): 210.663
Elapsed time for mlp_4h_to_h (4x58368x14592, b=2048): 0.0539
Throughput (in TFLOP/s) for mlp_4h_to_h (4x58368x14592, b=2048): 259.011

Attention duration (in seconds): 0.0677
Attention throughput (in TFLOP/s): 220.581
MLP duration (in seconds): 0.1201
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1878
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 14848, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x14848x44544, b=2048): 0.0421
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x14848x44544, b=2048): 257.175
b: 512, m: 2048, n: 116, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x116x2048): 0.0048
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x116x2048): 103.572
b: 512, m: 2048, n: 2048, k: 116,
Elapsed time for attention_prob_times_values (512x2048x2048x116): 0.0064
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x116): 77.920
Elapsed time for attention_linear_projection (4x14848x14848, b=2048): 0.0145
Throughput (in TFLOP/s) for attention_linear_projection (4x14848x14848, b=2048): 248.848
Elapsed time for mlp_h_to_4h (4x14848x59392, b=2048): 0.0562
Throughput (in TFLOP/s) for mlp_h_to_4h (4x14848x59392, b=2048): 256.864
Elapsed time for mlp_4h_to_h (4x59392x14848, b=2048): 0.0556
Throughput (in TFLOP/s) for mlp_4h_to_h (4x59392x14848, b=2048): 260.060

Attention duration (in seconds): 0.0679
Attention throughput (in TFLOP/s): 227.614
MLP duration (in seconds): 0.1118
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1797
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 15104, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x15104x45312, b=2048): 0.0432
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x15104x45312, b=2048): 259.543
b: 512, m: 2048, n: 118, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x118x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x118x2048): 103.501
b: 512, m: 2048, n: 2048, k: 118,
Elapsed time for attention_prob_times_values (512x2048x2048x118): 0.0086
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x118): 58.789
Elapsed time for attention_linear_projection (4x15104x15104, b=2048): 0.0148
Throughput (in TFLOP/s) for attention_linear_projection (4x15104x15104, b=2048): 252.351
Elapsed time for mlp_h_to_4h (4x15104x60416, b=2048): 0.0571
Throughput (in TFLOP/s) for mlp_h_to_4h (4x15104x60416, b=2048): 261.630
Elapsed time for mlp_4h_to_h (4x60416x15104, b=2048): 0.0573
Throughput (in TFLOP/s) for mlp_4h_to_h (4x60416x15104, b=2048): 261.137

Attention duration (in seconds): 0.0715
Attention throughput (in TFLOP/s): 223.178
MLP duration (in seconds): 0.1144
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1859
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 15360, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x15360x46080, b=2048): 0.0449
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x15360x46080, b=2048): 258.222
b: 512, m: 2048, n: 120, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x120x2048): 0.0048
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x120x2048): 106.342
b: 512, m: 2048, n: 2048, k: 120,
Elapsed time for attention_prob_times_values (512x2048x2048x120): 0.0065
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x120): 79.833
Elapsed time for attention_linear_projection (4x15360x15360, b=2048): 0.0154
Throughput (in TFLOP/s) for attention_linear_projection (4x15360x15360, b=2048): 250.540
Elapsed time for mlp_h_to_4h (4x15360x61440, b=2048): 0.0597
Throughput (in TFLOP/s) for mlp_h_to_4h (4x15360x61440, b=2048): 259.206
Elapsed time for mlp_4h_to_h (4x61440x15360, b=2048): 0.0593
Throughput (in TFLOP/s) for mlp_4h_to_h (4x61440x15360, b=2048): 260.787

Attention duration (in seconds): 0.0716
Attention throughput (in TFLOP/s): 230.217
MLP duration (in seconds): 0.1189
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1906
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 15616, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x15616x46848, b=2048): 0.0468
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x15616x46848, b=2048): 256.082
b: 512, m: 2048, n: 122, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x122x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x122x2048): 106.713
b: 512, m: 2048, n: 2048, k: 122,
Elapsed time for attention_prob_times_values (512x2048x2048x122): 0.0086
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x122): 60.861
Elapsed time for attention_linear_projection (4x15616x15616, b=2048): 0.0160
Throughput (in TFLOP/s) for attention_linear_projection (4x15616x15616, b=2048): 249.481
Elapsed time for mlp_h_to_4h (4x15616x62464, b=2048): 0.0619
Throughput (in TFLOP/s) for mlp_h_to_4h (4x15616x62464, b=2048): 258.244
Elapsed time for mlp_4h_to_h (4x62464x15616, b=2048): 0.0609
Throughput (in TFLOP/s) for mlp_4h_to_h (4x62464x15616, b=2048): 262.485

Attention duration (in seconds): 0.0763
Attention throughput (in TFLOP/s): 223.073
MLP duration (in seconds): 0.1228
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.1991
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 15872, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x15872x47616, b=2048): 0.0475
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x15872x47616, b=2048): 260.706
b: 512, m: 2048, n: 124, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x124x2048): 0.0048
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x124x2048): 110.649
b: 512, m: 2048, n: 2048, k: 124,
Elapsed time for attention_prob_times_values (512x2048x2048x124): 0.0064
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x124): 82.638
Elapsed time for attention_linear_projection (4x15872x15872, b=2048): 0.0163
Throughput (in TFLOP/s) for attention_linear_projection (4x15872x15872, b=2048): 253.977
Elapsed time for mlp_h_to_4h (4x15872x63488, b=2048): 0.0627
Throughput (in TFLOP/s) for mlp_h_to_4h (4x15872x63488, b=2048): 263.148
Elapsed time for mlp_4h_to_h (4x63488x15872, b=2048): 0.0629
Throughput (in TFLOP/s) for mlp_4h_to_h (4x63488x15872, b=2048): 262.639

Attention duration (in seconds): 0.0750
Attention throughput (in TFLOP/s): 234.318
MLP duration (in seconds): 0.1256
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2006
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 16128, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x16128x48384, b=2048): 0.0494
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x16128x48384, b=2048): 258.603
b: 512, m: 2048, n: 126, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x126x2048): 0.0049
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x126x2048): 110.755
b: 512, m: 2048, n: 2048, k: 126,
Elapsed time for attention_prob_times_values (512x2048x2048x126): 0.0086
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x126): 62.851
Elapsed time for attention_linear_projection (4x16128x16128, b=2048): 0.0169
Throughput (in TFLOP/s) for attention_linear_projection (4x16128x16128, b=2048): 251.935
Elapsed time for mlp_h_to_4h (4x16128x64512, b=2048): 0.0657
Throughput (in TFLOP/s) for mlp_h_to_4h (4x16128x64512, b=2048): 259.517
Elapsed time for mlp_4h_to_h (4x64512x16128, b=2048): 0.0652
Throughput (in TFLOP/s) for mlp_4h_to_h (4x64512x16128, b=2048): 261.307

Attention duration (in seconds): 0.0799
Attention throughput (in TFLOP/s): 227.036
MLP duration (in seconds): 0.1309
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2108
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 16384, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x16384x49152, b=2048): 0.0505
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x16384x49152, b=2048): 261.454
b: 512, m: 2048, n: 128, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x128x2048): 0.0048
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x128x2048): 115.010
b: 512, m: 2048, n: 2048, k: 128,
Elapsed time for attention_prob_times_values (512x2048x2048x128): 0.0060
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x128): 91.800
Elapsed time for attention_linear_projection (4x16384x16384, b=2048): 0.0173
Throughput (in TFLOP/s) for attention_linear_projection (4x16384x16384, b=2048): 254.483
Elapsed time for mlp_h_to_4h (4x16384x65536, b=2048): 0.0682
Throughput (in TFLOP/s) for mlp_h_to_4h (4x16384x65536, b=2048): 257.921
Elapsed time for mlp_4h_to_h (4x65536x16384, b=2048): 0.0673
Throughput (in TFLOP/s) for mlp_4h_to_h (4x65536x16384, b=2048): 261.333

Attention duration (in seconds): 0.0785
Attention throughput (in TFLOP/s): 238.064
MLP duration (in seconds): 0.1355
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2140
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 16640, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x16640x49920, b=2048): 0.0528
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x16640x49920, b=2048): 257.755
b: 512, m: 2048, n: 130, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x130x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x130x2048): 100.497
b: 512, m: 2048, n: 2048, k: 130,
Elapsed time for attention_prob_times_values (512x2048x2048x130): 0.0095
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x130): 58.885
Elapsed time for attention_linear_projection (4x16640x16640, b=2048): 0.0180
Throughput (in TFLOP/s) for attention_linear_projection (4x16640x16640, b=2048): 252.685
Elapsed time for mlp_h_to_4h (4x16640x66560, b=2048): 0.0696
Throughput (in TFLOP/s) for mlp_h_to_4h (4x16640x66560, b=2048): 260.546
Elapsed time for mlp_4h_to_h (4x66560x16640, b=2048): 0.0694
Throughput (in TFLOP/s) for mlp_4h_to_h (4x66560x16640, b=2048): 261.520

Attention duration (in seconds): 0.0858
Attention throughput (in TFLOP/s): 224.530
MLP duration (in seconds): 0.1390
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2248
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 16896, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x16896x50688, b=2048): 0.0545
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x16896x50688, b=2048): 257.468
b: 512, m: 2048, n: 132, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x132x2048): 0.0055
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x132x2048): 102.890
b: 512, m: 2048, n: 2048, k: 132,
Elapsed time for attention_prob_times_values (512x2048x2048x132): 0.0074
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x132): 76.398
Elapsed time for attention_linear_projection (4x16896x16896, b=2048): 0.0186
Throughput (in TFLOP/s) for attention_linear_projection (4x16896x16896, b=2048): 251.825
Elapsed time for mlp_h_to_4h (4x16896x67584, b=2048): 0.0727
Throughput (in TFLOP/s) for mlp_h_to_4h (4x16896x67584, b=2048): 257.479
Elapsed time for mlp_4h_to_h (4x67584x16896, b=2048): 0.0716
Throughput (in TFLOP/s) for mlp_4h_to_h (4x67584x16896, b=2048): 261.369

Attention duration (in seconds): 0.0860
Attention throughput (in TFLOP/s): 230.722
MLP duration (in seconds): 0.1442
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2302
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 17152, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x17152x51456, b=2048): 0.0558
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x17152x51456, b=2048): 259.123
b: 512, m: 2048, n: 134, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x134x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x134x2048): 102.874
b: 512, m: 2048, n: 2048, k: 134,
Elapsed time for attention_prob_times_values (512x2048x2048x134): 0.0096
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x134): 59.970
Elapsed time for attention_linear_projection (4x17152x17152, b=2048): 0.0190
Throughput (in TFLOP/s) for attention_linear_projection (4x17152x17152, b=2048): 253.863
Elapsed time for mlp_h_to_4h (4x17152x68608, b=2048): 0.0741
Throughput (in TFLOP/s) for mlp_h_to_4h (4x17152x68608, b=2048): 260.279
Elapsed time for mlp_4h_to_h (4x68608x17152, b=2048): 0.0736
Throughput (in TFLOP/s) for mlp_4h_to_h (4x68608x17152, b=2048): 261.836

Attention duration (in seconds): 0.0900
Attention throughput (in TFLOP/s): 227.058
MLP duration (in seconds): 0.1477
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2377
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 17408, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x17408x52224, b=2048): 0.0580
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x17408x52224, b=2048): 257.028
b: 512, m: 2048, n: 136, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x136x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x136x2048): 103.957
b: 512, m: 2048, n: 2048, k: 136,
Elapsed time for attention_prob_times_values (512x2048x2048x136): 0.0068
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x136): 85.924
Elapsed time for attention_linear_projection (4x17408x17408, b=2048): 0.0196
Throughput (in TFLOP/s) for attention_linear_projection (4x17408x17408, b=2048): 253.036
Elapsed time for mlp_h_to_4h (4x17408x69632, b=2048): 0.0765
Throughput (in TFLOP/s) for mlp_h_to_4h (4x17408x69632, b=2048): 259.497
Elapsed time for mlp_4h_to_h (4x69632x17408, b=2048): 0.0757
Throughput (in TFLOP/s) for mlp_4h_to_h (4x69632x17408, b=2048): 262.309

Attention duration (in seconds): 0.0900
Attention throughput (in TFLOP/s): 233.674
MLP duration (in seconds): 0.1522
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2422
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 17664, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x17664x52992, b=2048): 0.0592
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x17664x52992, b=2048): 259.126
b: 512, m: 2048, n: 138, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x138x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x138x2048): 105.379
b: 512, m: 2048, n: 2048, k: 138,
Elapsed time for attention_prob_times_values (512x2048x2048x138): 0.0097
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x138): 60.994
Elapsed time for attention_linear_projection (4x17664x17664, b=2048): 0.0201
Throughput (in TFLOP/s) for attention_linear_projection (4x17664x17664, b=2048): 254.156
Elapsed time for mlp_h_to_4h (4x17664x70656, b=2048): 0.0786
Throughput (in TFLOP/s) for mlp_h_to_4h (4x17664x70656, b=2048): 260.318
Elapsed time for mlp_4h_to_h (4x70656x17664, b=2048): 0.0781
Throughput (in TFLOP/s) for mlp_4h_to_h (4x70656x17664, b=2048): 261.683

Attention duration (in seconds): 0.0946
Attention throughput (in TFLOP/s): 228.589
MLP duration (in seconds): 0.1567
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2513
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 17920, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x17920x53760, b=2048): 0.0611
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x17920x53760, b=2048): 258.411
b: 512, m: 2048, n: 140, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x140x2048): 0.0055
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x140x2048): 108.553
b: 512, m: 2048, n: 2048, k: 140,
Elapsed time for attention_prob_times_values (512x2048x2048x140): 0.0076
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x140): 79.534
Elapsed time for attention_linear_projection (4x17920x17920, b=2048): 0.0206
Throughput (in TFLOP/s) for attention_linear_projection (4x17920x17920, b=2048): 255.483
Elapsed time for mlp_h_to_4h (4x17920x71680, b=2048): 0.0809
Throughput (in TFLOP/s) for mlp_h_to_4h (4x17920x71680, b=2048): 260.180
Elapsed time for mlp_4h_to_h (4x71680x17920, b=2048): 0.0802
Throughput (in TFLOP/s) for mlp_4h_to_h (4x71680x17920, b=2048): 262.252

Attention duration (in seconds): 0.0948
Attention throughput (in TFLOP/s): 234.747
MLP duration (in seconds): 0.1611
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2559
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 18176, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x18176x54528, b=2048): 0.0630
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x18176x54528, b=2048): 257.748
b: 512, m: 2048, n: 142, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x142x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x142x2048): 108.484
b: 512, m: 2048, n: 2048, k: 142,
Elapsed time for attention_prob_times_values (512x2048x2048x142): 0.0098
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x142): 62.229
Elapsed time for attention_linear_projection (4x18176x18176, b=2048): 0.0213
Throughput (in TFLOP/s) for attention_linear_projection (4x18176x18176, b=2048): 254.517
Elapsed time for mlp_h_to_4h (4x18176x72704, b=2048): 0.0886
Throughput (in TFLOP/s) for mlp_h_to_4h (4x18176x72704, b=2048): 244.398
Elapsed time for mlp_4h_to_h (4x72704x18176, b=2048): 0.0821
Throughput (in TFLOP/s) for mlp_4h_to_h (4x72704x18176, b=2048): 263.555

Attention duration (in seconds): 0.0997
Attention throughput (in TFLOP/s): 229.419
MLP duration (in seconds): 0.1707
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2704
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 18432, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x18432x55296, b=2048): 0.0638
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x18432x55296, b=2048): 261.613
b: 512, m: 2048, n: 144, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x144x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x144x2048): 110.058
b: 512, m: 2048, n: 2048, k: 144,
Elapsed time for attention_prob_times_values (512x2048x2048x144): 0.0068
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x144): 90.772
Elapsed time for attention_linear_projection (4x18432x18432, b=2048): 0.0216
Throughput (in TFLOP/s) for attention_linear_projection (4x18432x18432, b=2048): 257.940
Elapsed time for mlp_h_to_4h (4x18432x73728, b=2048): 0.0846
Throughput (in TFLOP/s) for mlp_h_to_4h (4x18432x73728, b=2048): 263.322
Elapsed time for mlp_4h_to_h (4x73728x18432, b=2048): 0.0845
Throughput (in TFLOP/s) for mlp_4h_to_h (4x73728x18432, b=2048): 263.616

Attention duration (in seconds): 0.0978
Attention throughput (in TFLOP/s): 240.201
MLP duration (in seconds): 0.1690
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2669
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 18688, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x18688x56064, b=2048): 0.0663
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x18688x56064, b=2048): 258.719
b: 512, m: 2048, n: 146, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x146x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x146x2048): 111.252
b: 512, m: 2048, n: 2048, k: 146,
Elapsed time for attention_prob_times_values (512x2048x2048x146): 0.0098
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x146): 63.728
Elapsed time for attention_linear_projection (4x18688x18688, b=2048): 0.0223
Throughput (in TFLOP/s) for attention_linear_projection (4x18688x18688, b=2048): 256.092
Elapsed time for mlp_h_to_4h (4x18688x74752, b=2048): 0.0877
Throughput (in TFLOP/s) for mlp_h_to_4h (4x18688x74752, b=2048): 261.114
Elapsed time for mlp_4h_to_h (4x74752x18688, b=2048): 0.0877
Throughput (in TFLOP/s) for mlp_4h_to_h (4x74752x18688, b=2048): 261.070

Attention duration (in seconds): 0.1042
Attention throughput (in TFLOP/s): 231.757
MLP duration (in seconds): 0.1753
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2795
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 18944, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x18944x56832, b=2048): 0.0956
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x18944x56832, b=2048): 184.590
b: 512, m: 2048, n: 148, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x148x2048): 0.0055
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x148x2048): 114.875
b: 512, m: 2048, n: 2048, k: 148,
Elapsed time for attention_prob_times_values (512x2048x2048x148): 0.0076
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x148): 83.788
Elapsed time for attention_linear_projection (4x18944x18944, b=2048): 0.0228
Throughput (in TFLOP/s) for attention_linear_projection (4x18944x18944, b=2048): 257.798
Elapsed time for mlp_h_to_4h (4x18944x75776, b=2048): 0.1242
Throughput (in TFLOP/s) for mlp_h_to_4h (4x18944x75776, b=2048): 189.434
Elapsed time for mlp_4h_to_h (4x75776x18944, b=2048): 0.0897
Throughput (in TFLOP/s) for mlp_4h_to_h (4x75776x18944, b=2048): 262.080

Attention duration (in seconds): 0.1315
Attention throughput (in TFLOP/s): 188.539
MLP duration (in seconds): 0.2139
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3454
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 19200, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x19200x57600, b=2048): 0.0701
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x19200x57600, b=2048): 258.455
b: 512, m: 2048, n: 150, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x150x2048): 0.0057
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x150x2048): 113.905
b: 512, m: 2048, n: 2048, k: 150,
Elapsed time for attention_prob_times_values (512x2048x2048x150): 0.0099
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x150): 64.749
Elapsed time for attention_linear_projection (4x19200x19200, b=2048): 0.0236
Throughput (in TFLOP/s) for attention_linear_projection (4x19200x19200, b=2048): 255.910
Elapsed time for mlp_h_to_4h (4x19200x76800, b=2048): 0.1105
Throughput (in TFLOP/s) for mlp_h_to_4h (4x19200x76800, b=2048): 218.551
Elapsed time for mlp_4h_to_h (4x76800x19200, b=2048): 0.0914
Throughput (in TFLOP/s) for mlp_4h_to_h (4x76800x19200, b=2048): 264.431

Attention duration (in seconds): 0.1093
Attention throughput (in TFLOP/s): 232.795
MLP duration (in seconds): 0.2019
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3112
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 19456, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x19456x58368, b=2048): 0.0719
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x19456x58368, b=2048): 258.900
b: 512, m: 2048, n: 152, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x152x2048): 0.0057
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x152x2048): 115.205
b: 512, m: 2048, n: 2048, k: 152,
Elapsed time for attention_prob_times_values (512x2048x2048x152): 0.0071
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x152): 92.391
Elapsed time for attention_linear_projection (4x19456x19456, b=2048): 0.0243
Throughput (in TFLOP/s) for attention_linear_projection (4x19456x19456, b=2048): 255.672
Elapsed time for mlp_h_to_4h (4x19456x77824, b=2048): 0.0952
Throughput (in TFLOP/s) for mlp_h_to_4h (4x19456x77824, b=2048): 260.547
Elapsed time for mlp_4h_to_h (4x77824x19456, b=2048): 0.0937
Throughput (in TFLOP/s) for mlp_4h_to_h (4x77824x19456, b=2048): 264.627

Attention duration (in seconds): 0.1089
Attention throughput (in TFLOP/s): 239.892
MLP duration (in seconds): 0.1890
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.2978
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 19712, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x19712x59136, b=2048): 0.0728
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x19712x59136, b=2048): 262.173
b: 512, m: 2048, n: 154, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x154x2048): 0.0057
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x154x2048): 116.529
b: 512, m: 2048, n: 2048, k: 154,
Elapsed time for attention_prob_times_values (512x2048x2048x154): 0.0100
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x154): 66.320
Elapsed time for attention_linear_projection (4x19712x19712, b=2048): 0.0247
Throughput (in TFLOP/s) for attention_linear_projection (4x19712x19712, b=2048): 257.481
Elapsed time for mlp_h_to_4h (4x19712x78848, b=2048): 0.0977
Throughput (in TFLOP/s) for mlp_h_to_4h (4x19712x78848, b=2048): 260.712
Elapsed time for mlp_4h_to_h (4x78848x19712, b=2048): 0.0970
Throughput (in TFLOP/s) for mlp_4h_to_h (4x78848x19712, b=2048): 262.436

Attention duration (in seconds): 0.1132
Attention throughput (in TFLOP/s): 236.595
MLP duration (in seconds): 0.1947
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3079
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 19968, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x19968x59904, b=2048): 0.0756
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x19968x59904, b=2048): 259.106
b: 512, m: 2048, n: 156, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x156x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x156x2048): 120.705
b: 512, m: 2048, n: 2048, k: 156,
Elapsed time for attention_prob_times_values (512x2048x2048x156): 0.0076
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x156): 87.757
Elapsed time for attention_linear_projection (4x19968x19968, b=2048): 0.0254
Throughput (in TFLOP/s) for attention_linear_projection (4x19968x19968, b=2048): 257.102
Elapsed time for mlp_h_to_4h (4x19968x79872, b=2048): 0.1571
Throughput (in TFLOP/s) for mlp_h_to_4h (4x19968x79872, b=2048): 166.295
Elapsed time for mlp_4h_to_h (4x79872x19968, b=2048): 0.1002
Throughput (in TFLOP/s) for mlp_4h_to_h (4x79872x19968, b=2048): 260.684

Attention duration (in seconds): 0.1142
Attention throughput (in TFLOP/s): 240.482
MLP duration (in seconds): 0.2574
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3716
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 20224, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x20224x60672, b=2048): 0.0769
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x20224x60672, b=2048): 261.315
b: 512, m: 2048, n: 158, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x158x2048): 0.0057
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x158x2048): 119.246
b: 512, m: 2048, n: 2048, k: 158,
Elapsed time for attention_prob_times_values (512x2048x2048x158): 0.0100
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x158): 67.762
Elapsed time for attention_linear_projection (4x20224x20224, b=2048): 0.0260
Throughput (in TFLOP/s) for attention_linear_projection (4x20224x20224, b=2048): 257.630
Elapsed time for mlp_h_to_4h (4x20224x80896, b=2048): 0.1342
Throughput (in TFLOP/s) for mlp_h_to_4h (4x20224x80896, b=2048): 199.704
Elapsed time for mlp_4h_to_h (4x80896x20224, b=2048): 0.1012
Throughput (in TFLOP/s) for mlp_4h_to_h (4x80896x20224, b=2048): 264.895

Attention duration (in seconds): 0.1186
Attention throughput (in TFLOP/s): 237.356
MLP duration (in seconds): 0.2354
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3541
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 20480, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x20480x61440, b=2048): 0.0785
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x20480x61440, b=2048): 262.534
b: 512, m: 2048, n: 160, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x160x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x160x2048): 121.683
b: 512, m: 2048, n: 2048, k: 160,
Elapsed time for attention_prob_times_values (512x2048x2048x160): 0.0069
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x160): 99.965
Elapsed time for attention_linear_projection (4x20480x20480, b=2048): 0.0268
Throughput (in TFLOP/s) for attention_linear_projection (4x20480x20480, b=2048): 256.714
Elapsed time for mlp_h_to_4h (4x20480x81920, b=2048): 0.1772
Throughput (in TFLOP/s) for mlp_h_to_4h (4x20480x81920, b=2048): 155.105
Elapsed time for mlp_4h_to_h (4x81920x20480, b=2048): 0.1032
Throughput (in TFLOP/s) for mlp_4h_to_h (4x81920x20480, b=2048): 266.391

Attention duration (in seconds): 0.1178
Attention throughput (in TFLOP/s): 244.974
MLP duration (in seconds): 0.2804
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3982
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 20736, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x20736x62208, b=2048): 0.1146
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x20736x62208, b=2048): 184.413
b: 512, m: 2048, n: 162, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x162x2048): 0.0057
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x162x2048): 122.629
b: 512, m: 2048, n: 2048, k: 162,
Elapsed time for attention_prob_times_values (512x2048x2048x162): 0.0100
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x162): 69.312
Elapsed time for attention_linear_projection (4x20736x20736, b=2048): 0.0268
Throughput (in TFLOP/s) for attention_linear_projection (4x20736x20736, b=2048): 263.104
Elapsed time for mlp_h_to_4h (4x20736x82944, b=2048): 0.1544
Throughput (in TFLOP/s) for mlp_h_to_4h (4x20736x82944, b=2048): 182.549
Elapsed time for mlp_4h_to_h (4x82944x20736, b=2048): 0.1063
Throughput (in TFLOP/s) for mlp_4h_to_h (4x82944x20736, b=2048): 265.110

Attention duration (in seconds): 0.1571
Attention throughput (in TFLOP/s): 188.239
MLP duration (in seconds): 0.2607
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4178
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 20992, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x20992x62976, b=2048): 0.0821
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x20992x62976, b=2048): 263.847
b: 512, m: 2048, n: 164, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x164x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x164x2048): 126.862
b: 512, m: 2048, n: 2048, k: 164,
Elapsed time for attention_prob_times_values (512x2048x2048x164): 0.0076
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x164): 92.220
Elapsed time for attention_linear_projection (4x20992x20992, b=2048): 0.0279
Throughput (in TFLOP/s) for attention_linear_projection (4x20992x20992, b=2048): 259.208
Elapsed time for mlp_h_to_4h (4x20992x83968, b=2048): 0.1646
Throughput (in TFLOP/s) for mlp_h_to_4h (4x20992x83968, b=2048): 175.443
Elapsed time for mlp_4h_to_h (4x83968x20992, b=2048): 0.1102
Throughput (in TFLOP/s) for mlp_4h_to_h (4x83968x20992, b=2048): 262.152

Attention duration (in seconds): 0.1231
Attention throughput (in TFLOP/s): 245.975
MLP duration (in seconds): 0.2748
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3979
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 21248, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x21248x63744, b=2048): 0.1106
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x21248x63744, b=2048): 200.708
b: 512, m: 2048, n: 166, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x166x2048): 0.0057
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x166x2048): 124.693
b: 512, m: 2048, n: 2048, k: 166,
Elapsed time for attention_prob_times_values (512x2048x2048x166): 0.0100
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x166): 71.000
Elapsed time for attention_linear_projection (4x21248x21248, b=2048): 0.0289
Throughput (in TFLOP/s) for attention_linear_projection (4x21248x21248, b=2048): 256.130
Elapsed time for mlp_h_to_4h (4x21248x84992, b=2048): 0.1685
Throughput (in TFLOP/s) for mlp_h_to_4h (4x21248x84992, b=2048): 175.642
Elapsed time for mlp_4h_to_h (4x84992x21248, b=2048): 0.1124
Throughput (in TFLOP/s) for mlp_4h_to_h (4x84992x21248, b=2048): 263.154

Attention duration (in seconds): 0.1552
Attention throughput (in TFLOP/s): 199.829
MLP duration (in seconds): 0.2809
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4361
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 21504, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x21504x64512, b=2048): 0.0861
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x21504x64512, b=2048): 264.112
b: 512, m: 2048, n: 168, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x168x2048): 0.0057
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x168x2048): 126.264
b: 512, m: 2048, n: 2048, k: 168,
Elapsed time for attention_prob_times_values (512x2048x2048x168): 0.0071
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x168): 101.167
Elapsed time for attention_linear_projection (4x21504x21504, b=2048): 0.0290
Throughput (in TFLOP/s) for attention_linear_projection (4x21504x21504, b=2048): 261.413
Elapsed time for mlp_h_to_4h (4x21504x86016, b=2048): 0.1142
Throughput (in TFLOP/s) for mlp_h_to_4h (4x21504x86016, b=2048): 265.295
Elapsed time for mlp_4h_to_h (4x86016x21504, b=2048): 0.1151
Throughput (in TFLOP/s) for mlp_4h_to_h (4x86016x21504, b=2048): 263.186

Attention duration (in seconds): 0.1279
Attention throughput (in TFLOP/s): 248.254
MLP duration (in seconds): 0.2294
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.3573
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 21760, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x21760x65280, b=2048): 0.1261
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x21760x65280, b=2048): 184.550
b: 512, m: 2048, n: 170, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x170x2048): 0.0057
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x170x2048): 127.991
b: 512, m: 2048, n: 2048, k: 170,
Elapsed time for attention_prob_times_values (512x2048x2048x170): 0.0100
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x170): 72.867
Elapsed time for attention_linear_projection (4x21760x21760, b=2048): 0.0299
Throughput (in TFLOP/s) for attention_linear_projection (4x21760x21760, b=2048): 259.381
Elapsed time for mlp_h_to_4h (4x21760x87040, b=2048): 0.1760
Throughput (in TFLOP/s) for mlp_h_to_4h (4x21760x87040, b=2048): 176.309
Elapsed time for mlp_4h_to_h (4x87040x21760, b=2048): 0.1170
Throughput (in TFLOP/s) for mlp_4h_to_h (4x87040x21760, b=2048): 265.180

Attention duration (in seconds): 0.1717
Attention throughput (in TFLOP/s): 189.187
MLP duration (in seconds): 0.2930
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4648
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 22016, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x22016x66048, b=2048): 0.0896
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x22016x66048, b=2048): 265.927
b: 512, m: 2048, n: 172, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x172x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x172x2048): 132.657
b: 512, m: 2048, n: 2048, k: 172,
Elapsed time for attention_prob_times_values (512x2048x2048x172): 0.0077
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x172): 96.376
Elapsed time for attention_linear_projection (4x22016x22016, b=2048): 0.0302
Throughput (in TFLOP/s) for attention_linear_projection (4x22016x22016, b=2048): 263.018
Elapsed time for mlp_h_to_4h (4x22016x88064, b=2048): 0.1766
Throughput (in TFLOP/s) for mlp_h_to_4h (4x22016x88064, b=2048): 179.879
Elapsed time for mlp_4h_to_h (4x88064x22016, b=2048): 0.1197
Throughput (in TFLOP/s) for mlp_4h_to_h (4x88064x22016, b=2048): 265.372

Attention duration (in seconds): 0.1330
Attention throughput (in TFLOP/s): 249.917
MLP duration (in seconds): 0.2963
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4293
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 22272, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x22272x66816, b=2048): 0.0926
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x22272x66816, b=2048): 263.341
b: 512, m: 2048, n: 174, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x174x2048): 0.0057
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x174x2048): 130.594
b: 512, m: 2048, n: 2048, k: 174,
Elapsed time for attention_prob_times_values (512x2048x2048x174): 0.0100
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x174): 74.668
Elapsed time for attention_linear_projection (4x22272x22272, b=2048): 0.0310
Throughput (in TFLOP/s) for attention_linear_projection (4x22272x22272, b=2048): 262.088
Elapsed time for mlp_h_to_4h (4x22272x89088, b=2048): 0.1472
Throughput (in TFLOP/s) for mlp_h_to_4h (4x22272x89088, b=2048): 220.913
Elapsed time for mlp_4h_to_h (4x89088x22272, b=2048): 0.1240
Throughput (in TFLOP/s) for mlp_4h_to_h (4x89088x22272, b=2048): 262.244

Attention duration (in seconds): 0.1393
Attention throughput (in TFLOP/s): 244.056
MLP duration (in seconds): 0.2711
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4104
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 22528, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x22528x67584, b=2048): 0.1343
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x22528x67584, b=2048): 185.674
b: 512, m: 2048, n: 176, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x176x2048): 0.0057
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x176x2048): 132.647
b: 512, m: 2048, n: 2048, k: 176,
Elapsed time for attention_prob_times_values (512x2048x2048x176): 0.0071
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x176): 106.583
Elapsed time for attention_linear_projection (4x22528x22528, b=2048): 0.0322
Throughput (in TFLOP/s) for attention_linear_projection (4x22528x22528, b=2048): 257.851
Elapsed time for mlp_h_to_4h (4x22528x90112, b=2048): 0.1844
Throughput (in TFLOP/s) for mlp_h_to_4h (4x22528x90112, b=2048): 180.370
Elapsed time for mlp_4h_to_h (4x90112x22528, b=2048): 0.1265
Throughput (in TFLOP/s) for mlp_4h_to_h (4x90112x22528, b=2048): 262.843

Attention duration (in seconds): 0.1794
Attention throughput (in TFLOP/s): 193.837
MLP duration (in seconds): 0.3109
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4903
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 22784, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x22784x68352, b=2048): 0.1339
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x22784x68352, b=2048): 190.505
b: 512, m: 2048, n: 178, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x178x2048): 0.0057
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x178x2048): 133.251
b: 512, m: 2048, n: 2048, k: 178,
Elapsed time for attention_prob_times_values (512x2048x2048x178): 0.0100
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x178): 76.155
Elapsed time for attention_linear_projection (4x22784x22784, b=2048): 0.0326
Throughput (in TFLOP/s) for attention_linear_projection (4x22784x22784, b=2048): 260.811
Elapsed time for mlp_h_to_4h (4x22784x91136, b=2048): 0.1471
Throughput (in TFLOP/s) for mlp_h_to_4h (4x22784x91136, b=2048): 231.207
Elapsed time for mlp_4h_to_h (4x91136x22784, b=2048): 0.1309
Throughput (in TFLOP/s) for mlp_4h_to_h (4x91136x22784, b=2048): 259.901

Attention duration (in seconds): 0.1823
Attention throughput (in TFLOP/s): 194.982
MLP duration (in seconds): 0.2780
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4604
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 23040, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x23040x69120, b=2048): 0.1182
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x23040x69120, b=2048): 220.719
b: 512, m: 2048, n: 180, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x180x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x180x2048): 138.507
b: 512, m: 2048, n: 2048, k: 180,
Elapsed time for attention_prob_times_values (512x2048x2048x180): 0.0077
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x180): 99.870
Elapsed time for attention_linear_projection (4x23040x23040, b=2048): 0.0335
Throughput (in TFLOP/s) for attention_linear_projection (4x23040x23040, b=2048): 259.496
Elapsed time for mlp_h_to_4h (4x23040x92160, b=2048): 0.1703
Throughput (in TFLOP/s) for mlp_h_to_4h (4x23040x92160, b=2048): 204.306
Elapsed time for mlp_4h_to_h (4x92160x23040, b=2048): 0.1325
Throughput (in TFLOP/s) for mlp_4h_to_h (4x92160x23040, b=2048): 262.640

Attention duration (in seconds): 0.1651
Attention throughput (in TFLOP/s): 220.145
MLP duration (in seconds): 0.3027
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4678
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 23296, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x23296x69888, b=2048): 0.1271
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x23296x69888, b=2048): 209.938
b: 512, m: 2048, n: 182, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x182x2048): 0.0057
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x182x2048): 136.240
b: 512, m: 2048, n: 2048, k: 182,
Elapsed time for attention_prob_times_values (512x2048x2048x182): 0.0101
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x182): 77.672
Elapsed time for attention_linear_projection (4x23296x23296, b=2048): 0.0342
Throughput (in TFLOP/s) for attention_linear_projection (4x23296x23296, b=2048): 259.858
Elapsed time for mlp_h_to_4h (4x23296x93184, b=2048): 0.1531
Throughput (in TFLOP/s) for mlp_h_to_4h (4x23296x93184, b=2048): 232.380
Elapsed time for mlp_4h_to_h (4x93184x23296, b=2048): 0.1350
Throughput (in TFLOP/s) for mlp_4h_to_h (4x93184x23296, b=2048): 263.382

Attention duration (in seconds): 0.1771
Attention throughput (in TFLOP/s): 209.679
MLP duration (in seconds): 0.2881
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.4652
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 23552, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x23552x70656, b=2048): 0.1585
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x23552x70656, b=2048): 172.001
b: 512, m: 2048, n: 184, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x184x2048): 0.0057
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x184x2048): 137.972
b: 512, m: 2048, n: 2048, k: 184,
Elapsed time for attention_prob_times_values (512x2048x2048x184): 0.0072
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x184): 109.042
Elapsed time for attention_linear_projection (4x23552x23552, b=2048): 0.0352
Throughput (in TFLOP/s) for attention_linear_projection (4x23552x23552, b=2048): 258.432
Elapsed time for mlp_h_to_4h (4x23552x94208, b=2048): 0.2021
Throughput (in TFLOP/s) for mlp_h_to_4h (4x23552x94208, b=2048): 179.835
Elapsed time for mlp_4h_to_h (4x94208x23552, b=2048): 0.1370
Throughput (in TFLOP/s) for mlp_4h_to_h (4x94208x23552, b=2048): 265.408

Attention duration (in seconds): 0.2067
Attention throughput (in TFLOP/s): 183.557
MLP duration (in seconds): 0.3391
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5458
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 23808, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x23808x71424, b=2048): 0.1069
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x23808x71424, b=2048): 260.638
b: 512, m: 2048, n: 186, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x186x2048): 0.0058
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x186x2048): 138.567
b: 512, m: 2048, n: 2048, k: 186,
Elapsed time for attention_prob_times_values (512x2048x2048x186): 0.0101
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x186): 78.955
Elapsed time for attention_linear_projection (4x23808x23808, b=2048): 0.0359
Throughput (in TFLOP/s) for attention_linear_projection (4x23808x23808, b=2048): 258.454
Elapsed time for mlp_h_to_4h (4x23808x95232, b=2048): 0.2089
Throughput (in TFLOP/s) for mlp_h_to_4h (4x23808x95232, b=2048): 177.781
Elapsed time for mlp_4h_to_h (4x95232x23808, b=2048): 0.1410
Throughput (in TFLOP/s) for mlp_4h_to_h (4x95232x23808, b=2048): 263.414

Attention duration (in seconds): 0.1587
Attention throughput (in TFLOP/s): 244.127
MLP duration (in seconds): 0.3500
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5087
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 24064, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x24064x72192, b=2048): 0.1078
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x24064x72192, b=2048): 264.137
b: 512, m: 2048, n: 188, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x188x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x188x2048): 144.429
b: 512, m: 2048, n: 2048, k: 188,
Elapsed time for attention_prob_times_values (512x2048x2048x188): 0.0078
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x188): 103.966
Elapsed time for attention_linear_projection (4x24064x24064, b=2048): 0.0364
Throughput (in TFLOP/s) for attention_linear_projection (4x24064x24064, b=2048): 260.557
Elapsed time for mlp_h_to_4h (4x24064x96256, b=2048): 0.2457
Throughput (in TFLOP/s) for mlp_h_to_4h (4x24064x96256, b=2048): 154.469
Elapsed time for mlp_4h_to_h (4x96256x24064, b=2048): 0.1461
Throughput (in TFLOP/s) for mlp_4h_to_h (4x96256x24064, b=2048): 259.707

Attention duration (in seconds): 0.1575
Attention throughput (in TFLOP/s): 251.164
MLP duration (in seconds): 0.3918
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5493
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 24320, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x24320x72960, b=2048): 0.1563
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x24320x72960, b=2048): 186.030
b: 512, m: 2048, n: 190, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x190x2048): 0.0058
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x190x2048): 141.523
b: 512, m: 2048, n: 2048, k: 190,
Elapsed time for attention_prob_times_values (512x2048x2048x190): 0.0101
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x190): 80.927
Elapsed time for attention_linear_projection (4x24320x24320, b=2048): 0.0373
Throughput (in TFLOP/s) for attention_linear_projection (4x24320x24320, b=2048): 259.804
Elapsed time for mlp_h_to_4h (4x24320x97280, b=2048): 0.2288
Throughput (in TFLOP/s) for mlp_h_to_4h (4x24320x97280, b=2048): 169.379
Elapsed time for mlp_4h_to_h (4x97280x24320, b=2048): 0.1476
Throughput (in TFLOP/s) for mlp_4h_to_h (4x97280x24320, b=2048): 262.660

Attention duration (in seconds): 0.2094
Attention throughput (in TFLOP/s): 192.883
MLP duration (in seconds): 0.3764
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5858
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 24576, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x24576x73728, b=2048): 0.1362
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x24576x73728, b=2048): 218.041
b: 512, m: 2048, n: 192, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x192x2048): 0.0057
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x192x2048): 144.253
b: 512, m: 2048, n: 2048, k: 192,
Elapsed time for attention_prob_times_values (512x2048x2048x192): 0.0070
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x192): 117.858
Elapsed time for attention_linear_projection (4x24576x24576, b=2048): 0.0380
Throughput (in TFLOP/s) for attention_linear_projection (4x24576x24576, b=2048): 260.119
Elapsed time for mlp_h_to_4h (4x24576x98304, b=2048): 0.2579
Throughput (in TFLOP/s) for mlp_h_to_4h (4x24576x98304, b=2048): 153.482
Elapsed time for mlp_4h_to_h (4x98304x24576, b=2048): 0.1505
Throughput (in TFLOP/s) for mlp_4h_to_h (4x98304x24576, b=2048): 262.982

Attention duration (in seconds): 0.1869
Attention throughput (in TFLOP/s): 220.598
MLP duration (in seconds): 0.4084
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5953
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 24832, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x24832x74496, b=2048): 0.1258
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x24832x74496, b=2048): 240.864
b: 512, m: 2048, n: 194, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x194x2048): 0.0058
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x194x2048): 143.411
b: 512, m: 2048, n: 2048, k: 194,
Elapsed time for attention_prob_times_values (512x2048x2048x194): 0.0110
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x194): 75.526
Elapsed time for attention_linear_projection (4x24832x24832, b=2048): 0.0388
Throughput (in TFLOP/s) for attention_linear_projection (4x24832x24832, b=2048): 260.701
Elapsed time for mlp_h_to_4h (4x24832x99328, b=2048): 0.2606
Throughput (in TFLOP/s) for mlp_h_to_4h (4x24832x99328, b=2048): 155.072
Elapsed time for mlp_4h_to_h (4x99328x24832, b=2048): 0.1541
Throughput (in TFLOP/s) for mlp_4h_to_h (4x99328x24832, b=2048): 262.297

Attention duration (in seconds): 0.1814
Attention throughput (in TFLOP/s): 231.926
MLP duration (in seconds): 0.4147
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5961
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 25088, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x25088x75264, b=2048): 0.1462
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x25088x75264, b=2048): 211.665
b: 512, m: 2048, n: 196, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x196x2048): 0.0056
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x196x2048): 149.137
b: 512, m: 2048, n: 2048, k: 196,
Elapsed time for attention_prob_times_values (512x2048x2048x196): 0.0087
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x196): 96.878
Elapsed time for attention_linear_projection (4x25088x25088, b=2048): 0.0402
Throughput (in TFLOP/s) for attention_linear_projection (4x25088x25088, b=2048): 256.365
Elapsed time for mlp_h_to_4h (4x25088x100352, b=2048): 0.2098
Throughput (in TFLOP/s) for mlp_h_to_4h (4x25088x100352, b=2048): 196.630
Elapsed time for mlp_4h_to_h (4x100352x25088, b=2048): 0.1574
Throughput (in TFLOP/s) for mlp_4h_to_h (4x100352x25088, b=2048): 262.036

Attention duration (in seconds): 0.2007
Attention throughput (in TFLOP/s): 213.895
MLP duration (in seconds): 0.3672
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5679
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 25344, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x25344x76032, b=2048): 0.1903
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x25344x76032, b=2048): 165.932
b: 512, m: 2048, n: 198, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x198x2048): 0.0059
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x198x2048): 144.618
b: 512, m: 2048, n: 2048, k: 198,
Elapsed time for attention_prob_times_values (512x2048x2048x198): 0.0112
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x198): 75.829
Elapsed time for attention_linear_projection (4x25344x25344, b=2048): 0.0404
Throughput (in TFLOP/s) for attention_linear_projection (4x25344x25344, b=2048): 260.608
Elapsed time for mlp_h_to_4h (4x25344x101376, b=2048): 0.2730
Throughput (in TFLOP/s) for mlp_h_to_4h (4x25344x101376, b=2048): 154.193
Elapsed time for mlp_4h_to_h (4x101376x25344, b=2048): 0.1605
Throughput (in TFLOP/s) for mlp_4h_to_h (4x101376x25344, b=2048): 262.316

Attention duration (in seconds): 0.2477
Attention throughput (in TFLOP/s): 176.779
MLP duration (in seconds): 0.4335
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6812
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 25600, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x25600x76800, b=2048): 0.1425
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x25600x76800, b=2048): 225.981
b: 512, m: 2048, n: 200, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x200x2048): 0.0058
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x200x2048): 147.538
b: 512, m: 2048, n: 2048, k: 200,
Elapsed time for attention_prob_times_values (512x2048x2048x200): 0.0078
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x200): 109.780
Elapsed time for attention_linear_projection (4x25600x25600, b=2048): 0.0414
Throughput (in TFLOP/s) for attention_linear_projection (4x25600x25600, b=2048): 259.339
Elapsed time for mlp_h_to_4h (4x25600x102400, b=2048): 0.2265
Throughput (in TFLOP/s) for mlp_h_to_4h (4x25600x102400, b=2048): 189.651
Elapsed time for mlp_4h_to_h (4x102400x25600, b=2048): 0.1640
Throughput (in TFLOP/s) for mlp_4h_to_h (4x102400x25600, b=2048): 261.856

Attention duration (in seconds): 0.1976
Attention throughput (in TFLOP/s): 226.058
MLP duration (in seconds): 0.3905
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5881
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 25856, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x25856x77568, b=2048): 0.1880
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x25856x77568, b=2048): 174.767
b: 512, m: 2048, n: 202, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x202x2048): 0.0061
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x202x2048): 142.652
b: 512, m: 2048, n: 2048, k: 202,
Elapsed time for attention_prob_times_values (512x2048x2048x202): 0.0113
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x202): 76.963
Elapsed time for attention_linear_projection (4x25856x25856, b=2048): 0.0415
Throughput (in TFLOP/s) for attention_linear_projection (4x25856x25856, b=2048): 263.919
Elapsed time for mlp_h_to_4h (4x25856x103424, b=2048): 0.2831
Throughput (in TFLOP/s) for mlp_h_to_4h (4x25856x103424, b=2048): 154.773
Elapsed time for mlp_4h_to_h (4x103424x25856, b=2048): 0.1650
Throughput (in TFLOP/s) for mlp_4h_to_h (4x103424x25856, b=2048): 265.455

Attention duration (in seconds): 0.2469
Attention throughput (in TFLOP/s): 184.497
MLP duration (in seconds): 0.4481
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6950
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 26112, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x26112x78336, b=2048): 0.1271
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x26112x78336, b=2048): 263.637
b: 512, m: 2048, n: 204, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x204x2048): 0.0057
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x204x2048): 154.624
b: 512, m: 2048, n: 2048, k: 204,
Elapsed time for attention_prob_times_values (512x2048x2048x204): 0.0088
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x204): 99.800
Elapsed time for attention_linear_projection (4x26112x26112, b=2048): 0.0428
Throughput (in TFLOP/s) for attention_linear_projection (4x26112x26112, b=2048): 261.052
Elapsed time for mlp_h_to_4h (4x26112x104448, b=2048): 0.2291
Throughput (in TFLOP/s) for mlp_h_to_4h (4x26112x104448, b=2048): 195.086
Elapsed time for mlp_4h_to_h (4x104448x26112, b=2048): 0.1702
Throughput (in TFLOP/s) for mlp_4h_to_h (4x104448x26112, b=2048): 262.555

Attention duration (in seconds): 0.1844
Attention throughput (in TFLOP/s): 251.884
MLP duration (in seconds): 0.3992
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.5836
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 26368, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x26368x79104, b=2048): 0.1469
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x26368x79104, b=2048): 232.600
b: 512, m: 2048, n: 206, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x206x2048): 0.0061
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x206x2048): 144.846
b: 512, m: 2048, n: 2048, k: 206,
Elapsed time for attention_prob_times_values (512x2048x2048x206): 0.0113
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x206): 78.195
Elapsed time for attention_linear_projection (4x26368x26368, b=2048): 0.0438
Throughput (in TFLOP/s) for attention_linear_projection (4x26368x26368, b=2048): 259.807
Elapsed time for mlp_h_to_4h (4x26368x105472, b=2048): 0.2363
Throughput (in TFLOP/s) for mlp_h_to_4h (4x26368x105472, b=2048): 192.804
Elapsed time for mlp_4h_to_h (4x105472x26368, b=2048): 0.1731
Throughput (in TFLOP/s) for mlp_4h_to_h (4x105472x26368, b=2048): 263.280

Attention duration (in seconds): 0.2082
Attention throughput (in TFLOP/s): 227.364
MLP duration (in seconds): 0.4094
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6176
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 26624, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x26624x79872, b=2048): 0.2197
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x26624x79872, b=2048): 158.554
b: 512, m: 2048, n: 208, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x208x2048): 0.0058
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x208x2048): 154.866
b: 512, m: 2048, n: 2048, k: 208,
Elapsed time for attention_prob_times_values (512x2048x2048x208): 0.0078
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x208): 114.555
Elapsed time for attention_linear_projection (4x26624x26624, b=2048): 0.0446
Throughput (in TFLOP/s) for attention_linear_projection (4x26624x26624, b=2048): 260.157
Elapsed time for mlp_h_to_4h (4x26624x106496, b=2048): 0.1798
Throughput (in TFLOP/s) for mlp_h_to_4h (4x26624x106496, b=2048): 258.301
Elapsed time for mlp_4h_to_h (4x106496x26624, b=2048): 0.1772
Throughput (in TFLOP/s) for mlp_4h_to_h (4x106496x26624, b=2048): 262.195

Attention duration (in seconds): 0.2779
Attention throughput (in TFLOP/s): 173.561
MLP duration (in seconds): 0.3570
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6350
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 26880, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x26880x80640, b=2048): 0.1962
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x26880x80640, b=2048): 180.978
b: 512, m: 2048, n: 210, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x210x2048): 0.0061
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x210x2048): 147.901
b: 512, m: 2048, n: 2048, k: 210,
Elapsed time for attention_prob_times_values (512x2048x2048x210): 0.0113
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x210): 79.917
Elapsed time for attention_linear_projection (4x26880x26880, b=2048): 0.0454
Throughput (in TFLOP/s) for attention_linear_projection (4x26880x26880, b=2048): 260.800
Elapsed time for mlp_h_to_4h (4x26880x107520, b=2048): 0.2262
Throughput (in TFLOP/s) for mlp_h_to_4h (4x26880x107520, b=2048): 209.317
Elapsed time for mlp_4h_to_h (4x107520x26880, b=2048): 0.1801
Throughput (in TFLOP/s) for mlp_4h_to_h (4x107520x26880, b=2048): 262.926

Attention duration (in seconds): 0.2590
Attention throughput (in TFLOP/s): 189.785
MLP duration (in seconds): 0.4063
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6653
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 27136, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x27136x81408, b=2048): 0.1738
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x27136x81408, b=2048): 208.217
b: 512, m: 2048, n: 212, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x212x2048): 0.0057
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x212x2048): 159.699
b: 512, m: 2048, n: 2048, k: 212,
Elapsed time for attention_prob_times_values (512x2048x2048x212): 0.0089
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x212): 102.495
Elapsed time for attention_linear_projection (4x27136x27136, b=2048): 0.0456
Throughput (in TFLOP/s) for attention_linear_projection (4x27136x27136, b=2048): 264.675
Elapsed time for mlp_h_to_4h (4x27136x108544, b=2048): 0.2659
Throughput (in TFLOP/s) for mlp_h_to_4h (4x27136x108544, b=2048): 181.508
Elapsed time for mlp_4h_to_h (4x108544x27136, b=2048): 0.1828
Throughput (in TFLOP/s) for mlp_4h_to_h (4x108544x27136, b=2048): 263.993

Attention duration (in seconds): 0.2340
Attention throughput (in TFLOP/s): 214.019
MLP duration (in seconds): 0.4487
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.6827
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 27392, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x27392x82176, b=2048): 0.1696
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x27392x82176, b=2048): 217.496
b: 512, m: 2048, n: 214, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x214x2048): 0.0061
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x214x2048): 150.813
b: 512, m: 2048, n: 2048, k: 214,
Elapsed time for attention_prob_times_values (512x2048x2048x214): 0.0113
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x214): 81.044
Elapsed time for attention_linear_projection (4x27392x27392, b=2048): 0.0469
Throughput (in TFLOP/s) for attention_linear_projection (4x27392x27392, b=2048): 261.877
Elapsed time for mlp_h_to_4h (4x27392x109568, b=2048): 0.3313
Throughput (in TFLOP/s) for mlp_h_to_4h (4x27392x109568, b=2048): 148.445
Elapsed time for mlp_4h_to_h (4x109568x27392, b=2048): 0.1878
Throughput (in TFLOP/s) for mlp_4h_to_h (4x109568x27392, b=2048): 261.860

Attention duration (in seconds): 0.2339
Attention throughput (in TFLOP/s): 218.049
MLP duration (in seconds): 0.5190
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.7530
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 27648, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x27648x82944, b=2048): 0.2114
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x27648x82944, b=2048): 177.724
b: 512, m: 2048, n: 216, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x216x2048): 0.0058
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x216x2048): 159.858
b: 512, m: 2048, n: 2048, k: 216,
Elapsed time for attention_prob_times_values (512x2048x2048x216): 0.0080
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x216): 116.312
Elapsed time for attention_linear_projection (4x27648x27648, b=2048): 0.0478
Throughput (in TFLOP/s) for attention_linear_projection (4x27648x27648, b=2048): 261.808
Elapsed time for mlp_h_to_4h (4x27648x110592, b=2048): 0.3309
Throughput (in TFLOP/s) for mlp_h_to_4h (4x27648x110592, b=2048): 151.380
Elapsed time for mlp_4h_to_h (4x110592x27648, b=2048): 0.1912
Throughput (in TFLOP/s) for mlp_4h_to_h (4x110592x27648, b=2048): 261.950

Attention duration (in seconds): 0.2730
Attention throughput (in TFLOP/s): 190.282
MLP duration (in seconds): 0.5222
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.7952
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 27904, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x27904x83712, b=2048): 0.2057
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x27904x83712, b=2048): 186.073
b: 512, m: 2048, n: 218, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x218x2048): 0.0061
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x218x2048): 152.956
b: 512, m: 2048, n: 2048, k: 218,
Elapsed time for attention_prob_times_values (512x2048x2048x218): 0.0114
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x218): 82.242
Elapsed time for attention_linear_projection (4x27904x27904, b=2048): 0.0488
Throughput (in TFLOP/s) for attention_linear_projection (4x27904x27904, b=2048): 261.201
Elapsed time for mlp_h_to_4h (4x27904x111616, b=2048): 0.3307
Throughput (in TFLOP/s) for mlp_h_to_4h (4x27904x111616, b=2048): 154.306
Elapsed time for mlp_4h_to_h (4x111616x27904, b=2048): 0.1953
Throughput (in TFLOP/s) for mlp_4h_to_h (4x111616x27904, b=2048): 261.265

Attention duration (in seconds): 0.2720
Attention throughput (in TFLOP/s): 194.471
MLP duration (in seconds): 0.5260
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.7980
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 28160, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x28160x84480, b=2048): 0.1526
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x28160x84480, b=2048): 255.463
b: 512, m: 2048, n: 220, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x220x2048): 0.0057
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x220x2048): 164.454
b: 512, m: 2048, n: 2048, k: 220,
Elapsed time for attention_prob_times_values (512x2048x2048x220): 0.0089
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x220): 106.362
Elapsed time for attention_linear_projection (4x28160x28160, b=2048): 0.0497
Throughput (in TFLOP/s) for attention_linear_projection (4x28160x28160, b=2048): 261.231
Elapsed time for mlp_h_to_4h (4x28160x112640, b=2048): 0.3475
Throughput (in TFLOP/s) for mlp_h_to_4h (4x28160x112640, b=2048): 149.554
Elapsed time for mlp_4h_to_h (4x112640x28160, b=2048): 0.1972
Throughput (in TFLOP/s) for mlp_4h_to_h (4x112640x28160, b=2048): 263.588

Attention duration (in seconds): 0.2169
Attention throughput (in TFLOP/s): 248.269
MLP duration (in seconds): 0.5447
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.7616
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 28416, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x28416x85248, b=2048): 0.1508
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x28416x85248, b=2048): 263.269
b: 512, m: 2048, n: 222, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x222x2048): 0.0061
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x222x2048): 155.128
b: 512, m: 2048, n: 2048, k: 222,
Elapsed time for attention_prob_times_values (512x2048x2048x222): 0.0113
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x222): 84.332
Elapsed time for attention_linear_projection (4x28416x28416, b=2048): 0.0505
Throughput (in TFLOP/s) for attention_linear_projection (4x28416x28416, b=2048): 262.197
Elapsed time for mlp_h_to_4h (4x28416x113664, b=2048): 0.3502
Throughput (in TFLOP/s) for mlp_h_to_4h (4x28416x113664, b=2048): 151.089
Elapsed time for mlp_4h_to_h (4x113664x28416, b=2048): 0.1992
Throughput (in TFLOP/s) for mlp_4h_to_h (4x113664x28416, b=2048): 265.591

Attention duration (in seconds): 0.2187
Attention throughput (in TFLOP/s): 250.729
MLP duration (in seconds): 0.5495
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.7682
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 28672, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x28672x86016, b=2048): 0.1804
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x28672x86016, b=2048): 224.038
b: 512, m: 2048, n: 224, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x224x2048): 0.0058
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x224x2048): 165.759
b: 512, m: 2048, n: 2048, k: 224,
Elapsed time for attention_prob_times_values (512x2048x2048x224): 0.0078
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x224): 122.801
Elapsed time for attention_linear_projection (4x28672x28672, b=2048): 0.0511
Throughput (in TFLOP/s) for attention_linear_projection (4x28672x28672, b=2048): 263.693
Elapsed time for mlp_h_to_4h (4x28672x114688, b=2048): 0.3518
Throughput (in TFLOP/s) for mlp_h_to_4h (4x28672x114688, b=2048): 153.166
Elapsed time for mlp_4h_to_h (4x114688x28672, b=2048): 0.2020
Throughput (in TFLOP/s) for mlp_4h_to_h (4x114688x28672, b=2048): 266.713

Attention duration (in seconds): 0.2451
Attention throughput (in TFLOP/s): 227.686
MLP duration (in seconds): 0.5538
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.7988
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 28928, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x28928x86784, b=2048): 0.2299
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x28928x86784, b=2048): 178.932
b: 512, m: 2048, n: 226, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x226x2048): 0.0061
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x226x2048): 158.328
b: 512, m: 2048, n: 2048, k: 226,
Elapsed time for attention_prob_times_values (512x2048x2048x226): 0.0114
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x226): 85.169
Elapsed time for attention_linear_projection (4x28928x28928, b=2048): 0.0520
Throughput (in TFLOP/s) for attention_linear_projection (4x28928x28928, b=2048): 263.628
Elapsed time for mlp_h_to_4h (4x28928x115712, b=2048): 0.3709
Throughput (in TFLOP/s) for mlp_h_to_4h (4x28928x115712, b=2048): 147.846
Elapsed time for mlp_4h_to_h (4x115712x28928, b=2048): 0.2086
Throughput (in TFLOP/s) for mlp_4h_to_h (4x115712x28928, b=2048): 262.936

Attention duration (in seconds): 0.2994
Attention throughput (in TFLOP/s): 189.652
MLP duration (in seconds): 0.5795
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8789
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 29184, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x29184x87552, b=2048): 0.2019
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x29184x87552, b=2048): 207.369
b: 512, m: 2048, n: 228, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x228x2048): 0.0058
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x228x2048): 169.338
b: 512, m: 2048, n: 2048, k: 228,
Elapsed time for attention_prob_times_values (512x2048x2048x228): 0.0088
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x228): 111.387
Elapsed time for attention_linear_projection (4x29184x29184, b=2048): 0.0534
Throughput (in TFLOP/s) for attention_linear_projection (4x29184x29184, b=2048): 261.548
Elapsed time for mlp_h_to_4h (4x29184x116736, b=2048): 0.3719
Throughput (in TFLOP/s) for mlp_h_to_4h (4x29184x116736, b=2048): 150.099
Elapsed time for mlp_4h_to_h (4x116736x29184, b=2048): 0.2097
Throughput (in TFLOP/s) for mlp_4h_to_h (4x116736x29184, b=2048): 266.146

Attention duration (in seconds): 0.2698
Attention throughput (in TFLOP/s): 214.140
MLP duration (in seconds): 0.5816
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8514
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 29440, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x29440x88320, b=2048): 0.1625
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x29440x88320, b=2048): 262.083
b: 512, m: 2048, n: 230, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x230x2048): 0.0061
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x230x2048): 160.880
b: 512, m: 2048, n: 2048, k: 230,
Elapsed time for attention_prob_times_values (512x2048x2048x230): 0.0113
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x230): 87.301
Elapsed time for attention_linear_projection (4x29440x29440, b=2048): 0.0539
Throughput (in TFLOP/s) for attention_linear_projection (4x29440x29440, b=2048): 263.372
Elapsed time for mlp_h_to_4h (4x29440x117760, b=2048): 0.3798
Throughput (in TFLOP/s) for mlp_h_to_4h (4x29440x117760, b=2048): 149.557
Elapsed time for mlp_4h_to_h (4x117760x29440, b=2048): 0.2222
Throughput (in TFLOP/s) for mlp_4h_to_h (4x117760x29440, b=2048): 255.628

Attention duration (in seconds): 0.2339
Attention throughput (in TFLOP/s): 251.269
MLP duration (in seconds): 0.6020
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8359
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 29696, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x29696x89088, b=2048): 0.2629
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x29696x89088, b=2048): 164.844
b: 512, m: 2048, n: 232, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x232x2048): 0.0059
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x232x2048): 169.733
b: 512, m: 2048, n: 2048, k: 232,
Elapsed time for attention_prob_times_values (512x2048x2048x232): 0.0081
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x232): 123.660
Elapsed time for attention_linear_projection (4x29696x29696, b=2048): 0.0553
Throughput (in TFLOP/s) for attention_linear_projection (4x29696x29696, b=2048): 261.350
Elapsed time for mlp_h_to_4h (4x29696x118784, b=2048): 0.3450
Throughput (in TFLOP/s) for mlp_h_to_4h (4x29696x118784, b=2048): 167.529
Elapsed time for mlp_4h_to_h (4x118784x29696, b=2048): 0.2207
Throughput (in TFLOP/s) for mlp_4h_to_h (4x118784x29696, b=2048): 261.915

Attention duration (in seconds): 0.3322
Attention throughput (in TFLOP/s): 179.993
MLP duration (in seconds): 0.5656
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.8978
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 29952, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x29952x89856, b=2048): 0.2696
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x29952x89856, b=2048): 163.532
b: 512, m: 2048, n: 234, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x234x2048): 0.0062
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x234x2048): 162.862
b: 512, m: 2048, n: 2048, k: 234,
Elapsed time for attention_prob_times_values (512x2048x2048x234): 0.0113
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x234): 89.137
Elapsed time for attention_linear_projection (4x29952x29952, b=2048): 0.0565
Throughput (in TFLOP/s) for attention_linear_projection (4x29952x29952, b=2048): 260.318
Elapsed time for mlp_h_to_4h (4x29952x119808, b=2048): 0.3791
Throughput (in TFLOP/s) for mlp_h_to_4h (4x29952x119808, b=2048): 155.106
Elapsed time for mlp_4h_to_h (4x119808x29952, b=2048): 0.2226
Throughput (in TFLOP/s) for mlp_4h_to_h (4x119808x29952, b=2048): 264.105

Attention duration (in seconds): 0.3436
Attention throughput (in TFLOP/s): 176.986
MLP duration (in seconds): 0.6017
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.9452
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 30208, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x30208x90624, b=2048): 0.2263
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x30208x90624, b=2048): 198.180
b: 512, m: 2048, n: 236, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x236x2048): 0.0059
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x236x2048): 172.038
b: 512, m: 2048, n: 2048, k: 236,
Elapsed time for attention_prob_times_values (512x2048x2048x236): 0.0088
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x236): 115.008
Elapsed time for attention_linear_projection (4x30208x30208, b=2048): 0.0567
Throughput (in TFLOP/s) for attention_linear_projection (4x30208x30208, b=2048): 263.637
Elapsed time for mlp_h_to_4h (4x30208x120832, b=2048): 0.4073
Throughput (in TFLOP/s) for mlp_h_to_4h (4x30208x120832, b=2048): 146.831
Elapsed time for mlp_4h_to_h (4x120832x30208, b=2048): 0.2263
Throughput (in TFLOP/s) for mlp_4h_to_h (4x120832x30208, b=2048): 264.276

Attention duration (in seconds): 0.2977
Attention throughput (in TFLOP/s): 207.668
MLP duration (in seconds): 0.6336
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.9313
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 30464, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x30464x91392, b=2048): 0.2498
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x30464x91392, b=2048): 182.585
b: 512, m: 2048, n: 238, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x238x2048): 0.0062
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x238x2048): 165.391
b: 512, m: 2048, n: 2048, k: 238,
Elapsed time for attention_prob_times_values (512x2048x2048x238): 0.0113
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x238): 90.601
Elapsed time for attention_linear_projection (4x30464x30464, b=2048): 0.0575
Throughput (in TFLOP/s) for attention_linear_projection (4x30464x30464, b=2048): 264.367
Elapsed time for mlp_h_to_4h (4x30464x121856, b=2048): 0.4273
Throughput (in TFLOP/s) for mlp_h_to_4h (4x30464x121856, b=2048): 142.338
Elapsed time for mlp_4h_to_h (4x121856x30464, b=2048): 0.2312
Throughput (in TFLOP/s) for mlp_4h_to_h (4x121856x30464, b=2048): 263.049

Attention duration (in seconds): 0.3248
Attention throughput (in TFLOP/s): 193.544
MLP duration (in seconds): 0.6585
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.9833
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 30720, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x30720x92160, b=2048): 0.2873
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x30720x92160, b=2048): 161.438
b: 512, m: 2048, n: 240, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x240x2048): 0.0059
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x240x2048): 175.422
b: 512, m: 2048, n: 2048, k: 240,
Elapsed time for attention_prob_times_values (512x2048x2048x240): 0.0080
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x240): 128.426
Elapsed time for attention_linear_projection (4x30720x30720, b=2048): 0.0597
Throughput (in TFLOP/s) for attention_linear_projection (4x30720x30720, b=2048): 258.904
Elapsed time for mlp_h_to_4h (4x30720x122880, b=2048): 0.4163
Throughput (in TFLOP/s) for mlp_h_to_4h (4x30720x122880, b=2048): 148.560
Elapsed time for mlp_4h_to_h (4x122880x30720, b=2048): 0.2348
Throughput (in TFLOP/s) for mlp_4h_to_h (4x122880x30720, b=2048): 263.441

Attention duration (in seconds): 0.3610
Attention throughput (in TFLOP/s): 177.058
MLP duration (in seconds): 0.6511
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 1.0120
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================
num_attention_heads: 128, hidden_size: 30976, train_micro_batch_size_per_gpu: 4, tensor_mp_size: 1, pipeline_mp_size: 1, dp_size: 1


Estimate
--------
Elapsed time for attention_key_value_query_transform (4x30976x92928, b=2048): 0.2085
Throughput (in TFLOP/s) for attention_key_value_query_transform (4x30976x92928, b=2048): 226.152
b: 512, m: 2048, n: 242, k: 2048,
Elapsed time for attention_key_query_prob (512x2048x242x2048): 0.0062
Throughput (in TFLOP/s) for attention_key_query_prob (512x2048x242x2048): 168.229
b: 512, m: 2048, n: 2048, k: 242,
Elapsed time for attention_prob_times_values (512x2048x2048x242): 0.0112
Throughput (in TFLOP/s) for attention_prob_times_values (512x2048x2048x242): 92.913
Elapsed time for attention_linear_projection (4x30976x30976, b=2048): 0.0600
Throughput (in TFLOP/s) for attention_linear_projection (4x30976x30976, b=2048): 262.208
Elapsed time for mlp_h_to_4h (4x30976x123904, b=2048): 0.4319
Throughput (in TFLOP/s) for mlp_h_to_4h (4x30976x123904, b=2048): 145.595
Elapsed time for mlp_4h_to_h (4x123904x30976, b=2048): 0.2414
Throughput (in TFLOP/s) for mlp_4h_to_h (4x123904x30976, b=2048): 260.450

Attention duration (in seconds): 0.2859
Attention throughput (in TFLOP/s): 227.248
MLP duration (in seconds): 0.6733
MLP throughput (in TFLOP/s): 1.000
Transformer duration (in seconds): 0.9592
Transformer throughput (in TFLOP/s): 1.000
Transformer - MLP - Attention (in seconds): 0.0000
========================================================================================================================