Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

twister: DeviceHandler DUT selection improvements and fixes #75548

Merged

Conversation

golowanow
Copy link
Member

@golowanow golowanow commented Jul 7, 2024

Several improvements and fixes for how Twister DeviceHandler selects and releases a DUT having multiple candidates for a test instance execution and taking into account failures.
See individual commits for details.

@golowanow golowanow force-pushed the twister-dut-select_20240706 branch from 3d8187f to f85f3dc Compare July 8, 2024 20:16
@golowanow golowanow marked this pull request as ready for review July 8, 2024 21:30
@zephyrbot zephyrbot added the area: Twister Twister label Jul 8, 2024
@golowanow
Copy link
Member Author

dear reviewers, up to your attention

@@ -473,7 +473,7 @@ def device_is_available(self, instance):
d.counter_increment()
avail = True
logger.debug(f"Retain DUT:{d.platform}, Id:{d.id}, "
f"counter:{d.counter}")
f"counter:{d.counter}, failures:{d.failures}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre my experiences, failure count is not that matter, instead the error count matters.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are necessary, as hardware failure oft shows as test timeouts, which are failed statuses, not errors.
At the same time, I'd rank them separately; a separate failures and errors counters, which would be used in sorting, with errors given a higher priority (sorted()'s key accepts tuples)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed change counts both 'error' and 'failed' test instance statuses for each DUT as a failure situation happened there not going deeper into details of what was the origin. It is a single counter to avoid overcomplication, but to stimulate retry attempts on different DUTs choosing less 'suspicious' one as a candidate for the next round. This way it works for any root cause not known at the moment - either at a board, a harness, or a test, or any their combination.

@@ -484,8 +484,10 @@ def device_is_available(self, instance):
return None

def make_dut_available(self, dut):
if self.instance.status in ["error", "failed"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to fix this problem twister needs to consider 'failed' test instances at a DUT:

.. for instance to resolve ploblems when some DUTs have connectivity or HW issues slowing down test plan execution, or even block the execution when only one test suite runs whereas the same first DUT candidate in the list is not working and others were not chosen.

Copy link
Collaborator

@LukaszMrugala LukaszMrugala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd approve, were it not for the hash.

if not dut_found:
raise TwisterException(f"No device to serve as {device} platform.")

#
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A stray hash.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LukaszMrugala do you have any other objections ?

@@ -473,7 +473,7 @@ def device_is_available(self, instance):
d.counter_increment()
avail = True
logger.debug(f"Retain DUT:{d.platform}, Id:{d.id}, "
f"counter:{d.counter}")
f"counter:{d.counter}, failures:{d.failures}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are necessary, as hardware failure oft shows as test timeouts, which are failed statuses, not errors.
At the same time, I'd rank them separately; a separate failures and errors counters, which would be used in sorting, with errors given a higher priority (sorted()'s key accepts tuples)

@golowanow
Copy link
Member Author

rebased

@golowanow
Copy link
Member Author

dear reviewers, up to your attention

LukaszMrugala
LukaszMrugala previously approved these changes Aug 9, 2024
@golowanow
Copy link
Member Author

dear reviewers, up to your attention

just rebased again

@golowanow
Copy link
Member Author

dear reviewers, up to your attention

just rebased again

and rebased adjusting to #71401

katgiadla
katgiadla previously approved these changes Aug 13, 2024
Copy link
Collaborator

@katgiadla katgiadla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works on CI. LGTM

hakehuang
hakehuang previously approved these changes Aug 13, 2024
Several improvements at Twister DeviceHandler when it releases
current DUT (Device Under Test):

 - release the exact DUT which is used for the test instance
   instead of all configured DUTs which happened to have
   the same serial device configured.

 - Twister PyTtest harness plugin adjustment to the above.

 - additional debug logging to track DUT waiting/retain/release.

Signed-off-by: Dmitrii Golovanov <[email protected]>
Fix Twister DeviceHandler exit on SerialException when
it connects to the serial device in 'flash before' mode.

Signed-off-by: Dmitrii Golovanov <[email protected]>
Twister DeviceHandler - make DUT use counter increment
operation atomic.

Signed-off-by: Dmitrii Golovanov <[email protected]>
Twister DeviceHandler - add test failure counter for how many
test instances have been failed on each DUT (Device Under Test)
when it executes the current test plan.
Output DUT falure counter summary at the end of Twister run.

Signed-off-by: Dmitrii Golovanov <[email protected]>
Change Twister PyTest plugin's test finalizing sequence to release
the DUT it is used as the very last operation, after the Test Instance
status becomes fully updated from the execution results.

This also fix a race condition possible when pytest plugin releases
the DUT and it becomes acquired by another test while the current test
is not yet finalized completely.

Signed-off-by: Dmitrii Golovanov <[email protected]>
Improve DUT selection at DeviceHandler: for each DUT it counts
how many test instances have been failed on it during the current
twister execution, so the next available DUT will be chosen
ordering the eligible DUTs by less failures occured so far.

The new selection mechanism should increase chances to retry failed
tests on different DUTs, for instance to resolve ploblems when some
DUTs have connectivity or HW issues slowing down test plan execution,
or even block the execution when only one test suite runs whereas
the same first DUT candidate in the list is not working and others
were not chosen.

Signed-off-by: Dmitrii Golovanov <[email protected]>
@golowanow golowanow dismissed stale reviews from hakehuang and katgiadla via e40c7c7 August 14, 2024 18:00
@golowanow
Copy link
Member Author

dear reviewers, up to your attention

just rebased again

and rebased adjusting to #71401

.. and rebased adjusting to #76962

@fabiobaltieri fabiobaltieri merged commit a4cb802 into zephyrproject-rtos:main Aug 15, 2024
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants