Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FRU-Device does not work well with 16bit eeproms #1

Open
feistjj opened this issue Jul 19, 2019 · 3 comments
Open

FRU-Device does not work well with 16bit eeproms #1

feistjj opened this issue Jul 19, 2019 · 3 comments

Comments

@feistjj
Copy link
Member

feistjj commented Jul 19, 2019

Known issue, unfortunately I don't have any 16bit eeproms in my system to play with.

Start of solution here: https://gerrit.openbmc-project.xyz/c/openbmc/entity-manager/+/18783

@feistjj
Copy link
Member Author

feistjj commented Aug 15, 2019

Adding @pstrinkle, @amithash and @vijaykhemka as they are / have worked with this issue.

@vijaykhemka
Copy link
Contributor

Main issue is that it is hard to detect a device 8 bit vs 16 bit by reading it. In current implementation, assumption is device comes up with index pointer pointing to 0 offset. If it points to different offset/page then can't read header without writing.

@pstrinkle
Copy link
Member

Yup. That's the primary difficulty. I have a device that is 16-bit addressed, but every other boot of the BMC, FruDevice changes its mind. So I implemented a quick hint-lookup that'll check and see if a device is "hard-coded" to be one or the other. However, this requires a lot of board knowledge -- and we mix 8-bit and 16-bit at the same smbus address. Although, we have some knowledge that if it's on bus 6 or 7 (for example) then it must be 16-bit. So, I have those hints available to the code. With the hint in place, it always works for me.

bradbishop pushed a commit that referenced this issue Nov 11, 2019
When using multiple dbus-probe types, we were seeing:

Program received signal SIGBUS, Bus error.
0x00475c6c in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() ()
(gdb) bt
#0  0x00475c6c in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() ()
#1  0x00477820 in std::vector<std::shared_ptr<PerformProbe>, std::allocator<std::shared_ptr<PerformProbe> > >::clear() ()
#2  0x0046d594 in ?? ()
#3  0x0046e14c in ?? ()
#4  0x76f60bd0 in ?? () from /lib/libsystemd.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)


The logic in this was quite bad, by moving the storage of
PerformProbe shared_ptrs into the captures, we don't need
to worry about calling clear ever, so we won't run into this
problem. This was reordered to fix the issue.

Tested: On system that frequently saw the crash, it went
away, all sensors still available.

Change-Id: Icacb8861466816df64b24efe940e5a732102345a
Signed-off-by: James Feist <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants