Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

main boot-collector process exits because of race condition with previous boot-collector process with --usleep arg detected #25

Open
hyei opened this issue Mar 17, 2011 · 8 comments

Comments

@hyei
Copy link

hyei commented Mar 17, 2011

I've observed that bootchartd is unable to dump logs about 10% of the time because the bootchart-collector process isn't available. It looks like the process that is supposed to be there exits because there is a bootchart-collector process with the argument --usleep 5000 still running. This is from the wait_boot() function thats put in the background right before.

I've added -c to /sbin/bootchartd and a fprintf at the beginning of main() in collector.c to narrow down the issue.
[partial output]
Starting bootchart logging
bootchart: no initrd used; starting
bootchart-collector started as pid 415 with 3 args: '-c' '--usleep' '5000'
bootchart-collector started as pid 414 with 2 args: '-c' '50'
bootchart console.debug [1]
bootchart-collector started as pid 413 with 3 args: '-c' '--usleep' '250000'
bootchart-collector started as pid 414 with 2 args: '-c' '50'
bootchart-collector run outside initrd
bootchart collector already running as pid 416, exiting...
bootchart-collector started as pid 416 with 3 args: '-c' '--usleep' '5000'
bootchart-collector started as pid 417 with 3 args: '-c' '--usleep' '5000'
bootchart-collector started as pid 418 with 3 args: '-c' '--usleep' '5000'
bootchart-collector started as pid 419 with 3 args: '-c' '--usleep' '5000'
bootchart-collector started as pid 420 with 3 args: '-c' '--usleep' '5000'
bootchart-collector started as pid 421 with 3 args: '-c' '--usleep' '5000'
bootchart-collector started as pid 422 with 3 args: '-c' '--usleep' '5000'
bootchart-collector started as pid 423 with 3 args: '-c' '--usleep' '5000'
bootchart-collector started as pid 424 with 3 args: '-c' '--usleep' '5000'

.....

bootchart-collector started as pid 1207 with 3 args: '-c' '--dump' '/tmp/bootchart.XXXXN2niDZ'
bootchart-collector started as pid 1207 with 3 args: '-c' '--dump' '/tmp/bootchart.XXXXN2niDZ'
Failed to find the collector's pid
bootchart-collector pid: 1207 unmounted proc / clean exit
Can't extract boot chart from collector

@mmeeks
Copy link
Collaborator

mmeeks commented Mar 18, 2011

Riiight - it is possible that we are mis-filtering collectors that have the --usleep still running - but there should be code to detect and simply skip over those. Quite possibly that code doesn't work when they are exiting [ perhaps there is some race condition whereby the cmdline-args are destroyed by glibc before the kernel kills the process or something ? ]. If so that would be an interesting angle [ and yes the intermittent nature is deadly annoying - help fixing it much appreciated ]. Could you enable the debug to collector/dump.c in bootchart_find_running_pid to see if we can get something out of that ?

In the medium term, I had hoped to move towards killing the nasty shell-script wrapper, and moving ever more of that into the collector process itself (into plain C) - any progress you can help out with there is much appreciated !

@hyei
Copy link
Author

hyei commented Mar 18, 2011

The call to fread() on the cmdline file in the proc entry of the process with the --usleep argument returns a length of 0 instead of the expected len of "--usleep". I think it is somehow catching the process so early in its execution that the proc entry hasn't been completely initialized yet.

len = fread (abuffer, 1, 4095, argf);

Annoyingly enough, the more debug code I throw in, the harder it is to reproduce.

I could kill the loop at the beginning of wait_boot, because I've modified the script to wait X seconds as the exit criteria which will fix it for me, but will give some thought on the race condition.

@mmeeks
Copy link
Collaborator

mmeeks commented Mar 18, 2011

Wow - that is helpful; so yes there are races and unfinished state at start/finish of processes which is sad/broken but hard to avoid I guess.

I think that (really) the sol'n is to bin the use of bootchart-collector --usleep in the code - perhaps we can do this with an installed symlink - so the process has a different name, and then do the usleep mode on argv[0] ? so ln- -s bootchart-collector busleep; busleep 1000 etc. ?

@hyei
Copy link
Author

hyei commented Mar 19, 2011

The process will still have the same name, at least as seen under proc. When busleep is executed, /proc/1269/exe is still a link to exe -> /lib/bootchart/bootchart-collector.

I think there will have be a change in the bootchart_find_running_pid function to use some other way to detect whether there is already a bootchart-collector process collecting data.

@mmeeks
Copy link
Collaborator

mmeeks commented Mar 21, 2011

Riight - in reality I think it should be fairly trivial to remove the need for usleep. Currently we use it in bootchartd.in for:
"wait_boot()" - we could easily have a collector --wait-boot type call here that could be implemented in native code [ ideally sharing the same task watching stuff ].
Then of course there is the "let the collector get started before letting the system run on" wait - which is itself flawed; IMHO we should spawn the collector synchronously, and it should call daemon(); when it is ready to let the system continue booting.

Patches much appreciated [ of course ]. I guess it is the 'wait_boot' ones that are most problematic.

@xrmx
Copy link
Owner

xrmx commented Apr 24, 2011

What about this? This is only compile tested and break initrd case.
http://people.freedesktop.org/~rm/bootchart2/move-wait-boot-to-collector.diff

@mmeeks
Copy link
Collaborator

mmeeks commented Apr 26, 2011

Patch looks interesting to me, I think we need to ',' delimit arguments of EXIT_PROC (and in strtok) otherwise they'll be separate arguments ;-)

I don't quite see how the two steps - running the collector inside some odd chroot, and running the dump phase from some later date / in a different root works out - but if you say it does :-) that's all good.

@mmeeks mmeeks closed this as completed Apr 26, 2011
@mmeeks mmeeks reopened this Apr 26, 2011
@xrmx
Copy link
Owner

xrmx commented Apr 27, 2011

Yeah, this is more of a proof of concept and some more work is needed to have it work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants