Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue full and stop sending metrics to backend w/ ThreadedBackend in a Celery project #37

Open
puntonim opened this issue Jan 11, 2019 · 1 comment

Comments

@puntonim
Copy link
Contributor

puntonim commented Jan 11, 2019

I am using timeexecution with a ThreadedBackend on ElasticsearchBackend in a project at CERN. It is great software!
For reference, this is where timeexecution gets configured at app's boot up:
https://github.com/inspirehep/inspire-next/blob/899dabc588159dd9d45e7202692c675f073f0fe0/inspirehep/utils/ext.py#L76

The entire project runs in a Celery instance (which might live for days before getting restarted). Occasionally, it stops sending metrics to ES.
When this happens, the logs show that all metrics are discarded (log entry like: [2019-01-10 12:24:18,816: WARNING/ForkPoolWorker-16] Discard metric inspirehep.modules.records.receivers.push_to_orcid) because the queue is full and it does not get consumed:
https://github.com/kpn-digital/py-timeexecution/blob/d1b005a337fb6705aa804da6c26b7d9d477b62fc/time_execution/backends/threaded.py#L43-L44
My guess is that the cause is that the consumer thread started by the method start_worker has died (or hanged).

Note that the same project also runs in a Gunicorn instance where the issue never happens.
The problem is solved when Celery is restarted.

INVESTIGATION
Celery has a MainProcess (main) and a number (2 in our case) of ForkPoolWorker-N (worker or child) processes. A worker process might be killed by the main when using too much memory (log entry like: [2019-01-10 11:47:20,010: ERROR/ForkPoolWorker-1] child process exiting after exceeding memory limit (2760384KiB / 1370112KiB)).

When a worker process dies it is replaced by a new one. I noticed that every time the issue happens, it's right after a worker process has been killed because using too much memory. I was able to reproduce this issue only a few times and not in a deterministic way (even setting a very low threshold for memory, thus triggering the kill very frequently).

To complicate things: the consumer thread is owned by the MainProcess and the write method is executed in a ForkPoolWorker-N process (I added explicit logs to prove this).

POSSIBLE SOLUTION

        except Full:
            if not self.thread:  # and maybe: or not self.thread.is_alive()
                self.start_worker()
            logger.warning("Discard metric %s", name)

I am not going to make such PR yet as I was not able to deterministically reproduce the issue, but I want to keep track of it in here.

UPDATE 15/01/2019

The possible solution mentioned above did not work, but I am now able to reproduce the issue systematically.
Celery has a main process and a number (depending on the configuration) of worker processes.
This is our case:
image

I added more logging statement and acknowledged that:

  • the consumer thread started here lives in the main Celery process. That is because the ThreadedBackend class is instantiated on app's boot up (thread with pid=32450 in the htop screenshot).
  • the thread safe queue is also started in the __init__ (I guess it is the thread with pid=32447 in the htop screenshot).
  • the producer code lives in the Celery worker processes (processes with pids=1017,966 in the htop screenshot).

When a Celery worker process exceeds the memory limit set in the configuration, what happens is:

  • the Celery worker process is killed and replaced by a new process.
  • the producer code in the new Celery worker process keeps on queueing metrics until it gets full (a few seconds later).
  • the consumer thread in the main Celery process sees an empty queue and thus no metrics is sent.

I worked out a solution where the consumer thread, the queue and the producer code live in the Celery worker processes. I have been testing it manually and in one canary production machine for a couple of days and it works well (while all the other non-canary machines actually are affected by the issue). The solution is backward compatible. PR coming soon.

@puntonim puntonim changed the title Queue full and no metrics sent to backend occasionally w/ ThreadedBackend Queue full and stop sending metrics to backend occasionally w/ ThreadedBackend Jan 11, 2019
@puntonim puntonim changed the title Queue full and stop sending metrics to backend occasionally w/ ThreadedBackend Queue full and stop sending metrics to backend w/ ThreadedBackend Jan 15, 2019
@puntonim puntonim changed the title Queue full and stop sending metrics to backend w/ ThreadedBackend Queue full and stop sending metrics to backend w/ ThreadedBackend in a Celery project Jan 15, 2019
puntonim added a commit to puntonim/py-timeexecution that referenced this issue Jan 15, 2019
The lazy_init kwarg can be used to avoid a bug that leads to no metrics
being sent by the ThreadedBackend when used in a Celery project with
--max-memory-per-child param.
See: kpn#37
puntonim added a commit to puntonim/py-timeexecution that referenced this issue Jan 15, 2019
The lazy_init kwarg can be used to avoid a bug that leads to no metrics
being sent by the ThreadedBackend when used in a Celery project with
--max-memory-per-child param.
See: kpn#37
@ricardosantosalves
Copy link
Member

Hi Paolo,

Sorry for late feedback. Thanks for filling the issue and for the PR. We are evaluating the solution and will update/merge soon, since there's some conflicts with the original PR.

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants