profiling_based_partitioner doesn't divide evenly the time of the segments. #23

HanChangHun · 2022-06-03T06:27:45Z

Description

The diff_threshold_ns that is profiling_based_partitioner's option is not working well.

It doesn't compare the difference (in ns) between the slowest segment (upper bound) and the fastest segment (lower bound).
But it compares last_segment_latency and target_latency.

So I was able to get the result with the slowest segment is too slow and the fastest segment is speedy.

Maybe the source code(last_segment_latency - target_latency) should be changed.

Click to expand!

Issue Type

Bug

Operating System

Mendel Linux, Linux

Coral Device

M.2 Accelerator with dual Edge TPU

Other Devices

No response

Programming Language

C++

Relevant Log Output

segment latencies
[1.4403, 1.3686, 0.354, 0.161]  # difference is bigger than 1ms
[1.2178, 1.3376, 0.9702, 0.0683, 0.1601]
[1.1891, 1.3306, 0.8717, 0.1092, 0.0511, 0.1604]
[2.9966, 6.0864]  # difference is so big!
[2.6992, 1.9771, 2.9165]
[2.5653, 1.9029, 1.7592, 1.1261]
[2.3772, 1.5753, 1.7227, 1.4968, 0.6235]

The text was updated successfully, but these errors were encountered:

hjonnala · 2022-06-03T13:05:36Z

Hello @HanChangHun It could be due to input output latency. can you please share the latency results with single model benchmark for each segment file in txt file. Thanks! google-coral/edgetpu#593 (comment)

HanChangHun · 2022-06-03T14:55:27Z

I changed the profile-based partitioner to perform the partitioning in a single edge tpu and to share the SRAM of single edge tpu.
So, the latency is different from the usual profile-based partitioning of example inception v2.

But, It is difficult to see the big time gap caused by input-output data transfer time.

The logs are as follows:

2022-06-03 23:43:41
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 48K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 512K (x8)
  L3 Unified 16384K (x1)
Load Average: 0.14, 0.88, 0.86
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_Model         2.94 ms        0.268 ms         1000 inception_v2_224_quant_segment_0_of_2_edgetpu.tflite

2022-06-03 23:43:53
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 48K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 512K (x8)
  L3 Unified 16384K (x1)
Load Average: 0.11, 0.85, 0.85
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_Model         6.04 ms        0.223 ms         1000 inception_v2_224_quant_segment_1_of_2_edgetpu.tflite

Another example is inception v2 splitting in 4. The gap between the slowest latency and fastest latency is greater than 1ms. (I setted diff_threshold_ns as 1000000)

2022-06-03 23:53:31
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 48K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 512K (x8)
  L3 Unified 16384K (x1)
Load Average: 0.41, 0.27, 0.50
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_Model         2.51 ms        0.269 ms         2698 inception_v2_224_quant_segment_0_of_4_edgetpu.tflite

2022-06-03 23:53:41
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 48K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 512K (x8)
  L3 Unified 16384K (x1)
Load Average: 0.35, 0.26, 0.50
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_Model         1.89 ms        0.299 ms         2507 inception_v2_224_quant_segment_1_of_4_edgetpu.tflite

2022-06-03 23:53:48
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 48K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 512K (x8)
  L3 Unified 16384K (x1)
Load Average: 0.32, 0.25, 0.49
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_Model         1.72 ms        0.239 ms         2811 inception_v2_224_quant_segment_2_of_4_edgetpu.tflite

2022-06-03 23:53:55
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 48K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 512K (x8)
  L3 Unified 16384K (x1)
Load Average: 0.27, 0.24, 0.49
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_Model         1.09 ms        0.168 ms         3925 inception_v2_224_quant_segment_3_of_4_edgetpu.tflite

Thank you for the fast response.

hjonnala · 2022-06-03T17:13:52Z

Hmm..profile-based partitioner is not intended to perform the partitioning in a single edge tpu. Please check this page for details and requirements to use this tool: https://github.com/google-coral/libcoral/blob/master/coral/tools/partitioner/README.md. Thanks!

HanChangHun · 2022-06-04T06:40:41Z

Thank you for your response.

I aimed to utilize the existing code with only one edgetpu. So I changed the code that utilizes multiple edge tpu into a code that utilizes only one edge tpu.
However, there was no other modification, so I thought that the partitioning part in the existing code was not considering the latency of the slowest segment and the latency of the fastest latency.

I modified the existing code, so it would be difficult for you to answer.
Thank you for your kind reply!

hjonnala · 2022-06-04T20:18:34Z

Can you please try this code with two TPUs and with two segments on inception v3 model and share the logs and single model benchmark results for output models. Thanks!

HanChangHun · 2022-06-05T01:54:52Z

This code doesn't contain lower bound and upper bound update codes. So, I was changed somewhere and run with co-compilation.

The result is like this. Inception V3 with 2 segments, Inception V3 with 3 segments, and Inception V3 with 4 segments.
It looks split model evenly. And gap between slowest and fastest are not over diff_threshold_ns(=1000000).

# Inception V3 with 2 segments
# 24.1ms and 24.9ms
target_latency:  24704940.8, num_ops: [84 48], latencies: [24120441 24918119]

# Inception V3 with 3 segments
# 16ms, 16.4ms, 16.7ms
target_latency:  16749807.1125, num_ops: [65 38 29], latencies: [16061107 16405574 16782598]

# Inception V3 with 4 segments
# 12.8ms, 13.1ms, 13.3ms and 13.7ms
target_latency:  13378520.2, num_ops: [56 27 25 24], latencies: [12896665 13151985 13315528 13781784]

your code is very helpful!
Thank you!

hjonnala · 2022-06-05T20:14:08Z

The diff_threshold_ns that is profiling_based_partitioner's option is not working well.

It doesn't compare the difference (in ns) between the slowest segment (upper bound) and the fastest segment (lower bound). But it compares last_segment_latency and target_latency.

So I was able to get the result with the slowest segment is too slow and the fastest segment is speedy.

Maybe the source code(last_segment_latency - target_latency) should be changed.

Awesome, Fell free to submit a Pull Request for this bug for the developer's review. Thanks!

google-coral-bot bot assigned hjonnala Jun 3, 2022

google-coral-bot bot added Hardware:M.2 Accelerator with dual Edge TPU Coral M.2 Accelerator with Dual Edge TPU issues subtype:Mendel Linux Mendel Linux Build/installation issues type:bug Bug labels Jun 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

profiling_based_partitioner doesn't divide evenly the time of the segments. #23

profiling_based_partitioner doesn't divide evenly the time of the segments. #23

HanChangHun commented Jun 3, 2022 •

edited by google-coral-bot bot

Loading

Issue Type

Operating System

Coral Device

Other Devices

Programming Language

Relevant Log Output

hjonnala commented Jun 3, 2022

HanChangHun commented Jun 3, 2022 •

edited

Loading

hjonnala commented Jun 3, 2022

HanChangHun commented Jun 4, 2022

hjonnala commented Jun 4, 2022 •

edited

Loading

HanChangHun commented Jun 5, 2022

hjonnala commented Jun 5, 2022

profiling_based_partitioner doesn't divide evenly the time of the segments. #23

profiling_based_partitioner doesn't divide evenly the time of the segments. #23

Comments

HanChangHun commented Jun 3, 2022 • edited by google-coral-bot bot Loading

Description

Issue Type

Operating System

Coral Device

Other Devices

Programming Language

Relevant Log Output

hjonnala commented Jun 3, 2022

HanChangHun commented Jun 3, 2022 • edited Loading

hjonnala commented Jun 3, 2022

HanChangHun commented Jun 4, 2022

hjonnala commented Jun 4, 2022 • edited Loading

HanChangHun commented Jun 5, 2022

hjonnala commented Jun 5, 2022

HanChangHun commented Jun 3, 2022 •

edited by google-coral-bot bot

Loading

HanChangHun commented Jun 3, 2022 •

edited

Loading

hjonnala commented Jun 4, 2022 •

edited

Loading