-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Intermittent startup issues with Opensearch 1.3.3 Docker container #2568
Comments
Moving to opensearch-devops, but if you think it's a PA problem open a related issue in https://github.com/opensearch-project/performance-analyzer/ |
Hi @threesquared, The design is ensure that if either PA or OS process get killed, both process will get killed to ensure both process always running together. It was a design back in the day to remove dependencies on supervisord as it is not favored by docker hub official images. The thing that seems suspicious is the PA log error accessing:
I dont recall if PA would stop if the conf is not accessible, but even then there should not have any permission problem with that file, let me take a look on it. Have you guys try to use the official docker-compose.yml on opensearch.org? We have tested with above compose file and havent seen issues yet. Thanks. |
@peterzhuamazon Did you mean to move it back to opensearch-build. Curious why? |
Hey @threesquared just checking back if you have tried with the suggested docker-compose file by @peterzhuamazon and still notice the issue? |
So it's really hard to pin down as its very intermittent. The compose file we use in CI is only a single-node but otherwise is the same as the official compose file. We have hacked around this with a script that catches an exit code from dockerize that the port has not come up and then uses I will try and catch the issue again but there are no logs in the container other than the ones for the performance analyser I mentioned above but I will attach the full contents next time. |
We also encounter this issue in OS Javascript repo. One of our workflows runs a matrix of 24 configurations, and often one or two of those configurations will encounter this issue. Those "Killing process" messages came from opensearch-docker-entrypoint.sh. It's most likely that said script encountered an error and triggered the trap EXIT and killed these 2 processes. Grepping the container's logs for those messages to restart the container is a workaround we're considering. |
The workaround in opensearch-project/opensearch-js#304 is not pretty, but it seems to work. |
Added an issue on the meta repo with my findings and a couple of workarounds: opensearch-project/project-meta#62 |
If you look at that code, it receives a signal to terminate, then terminates the processes. Look for the cause of that signal. opensearch-build/docker/release/config/opensearch/opensearch-docker-entrypoint.sh Line 105 in 4597d77
|
@threesquared do you have a complete set of logs from a failed to start container by any chance? |
@dblock we just see the same as opensearch-project/project-meta#62 all that comes out of the container logs is:
|
@threesquared But you also have a log inside opensearch/logs inside the container. That's what I'm after. |
Ah ok sorry I wish I had saved the whole output when I last caught it. I will try and reproduce and upload a full log |
So I managed to catch this again using a script like this using dockerize in our CI platform: RETRIES=100
for ((i=0; i<RETRIES; i++)); do
./docker/scripts/dockerize.sh \
-wait http://elasticsearch:9200 \
-timeout 30s
if [[ $? -eq 1 ]]
then
sleep 1h
else
docker-compose up -d --force-recreate --no-deps elasticsearch
fi
done This gist is the last 200 lines of the logs/performance-analyzer.log Let me know if you need any more info from the container. No other log files in that directory and as usual only those two lines in the docker logs. |
I've collected some logs via github workflow. Opensearch container can fail in 2 ways:
I also collected OpenSearch logs inside the container along with Kernel and System logs of the workflow runner. |
@nhtruong I compared successful and failed runs and it's pretty clear that in both OpenSearch actually starts ( There's a 20 second timeout in https://github.com/opensearch-project/opensearch-js/blob/main/Makefile#L6. Can we replace that with an active wait that polls tje OpenSearch cluster start with a much longer timeout? @threesquared same in your case, see if you actually can hit that node in the active wait loop and increase the timeout to minutes so we can see if the node ever starts responding? |
@dblock There 2 ways the runs could fail, one of it produces almost no logs, and just
Where did you see In the makefile I used to recreate these fail runs, I actually used 45 seconds sleep time. In both 2 failed runs, whether the container was stopped or not, the docker logs show the "Kill process" message in both cases. Are you saying that this service will restart automatically after some time if Opensearch is not running? |
I was comparing the other two runs that had more logs. This one puzzles me. My running theory rn is that we start the container, do something that fails (run a test), don't properly report it or fail to see it, visibly shutdown the container (e.g. job killed in GHA), which causes the container to kill the process because it has received a termination signal itself. Make that sleep 2 minutes and log when it's done waiting. Do you see a kill signal before? |
The workflow I ran did not do any tests. It marked a run as failed if the container was unhealthy, and collected a bunch of logs. So, the only interactions OpenSearch had were from the curl command that ran every 5 seconds for health check. And the first healthcheck happened 20 seconds after the container's been booted up. I don't think there was anything from outside of the container that sent the kill signal to OS. |
See opensearch-project/opensearch-js#304 for a retry implementation. I propose we leave this open until we can get to the bottom of the issue, maybe someone will be more successful? |
Hey the resolved issue #1185, should fix this intermittent startup issues. Keeping the memory setting setting/adjustment aside, we can now disable the performance analyzer https://github.com/opensearch-project/opensearch-build/tree/main/docker/release#disable-performance-analyzer-agent-cli-and-related-configurations, this is be out for 2.4.0, should we backport this to 1.3.x? @gaiksaya @dblock @bbarani @peterzhuamazon WDTY? |
Hi @prudhvigodithi this change is already in 1.3.x as we release it in 1.3.7. Thanks. |
Thanks @peterzhuamazon since 1.3.7 is already out, we should be closing this issue, @threesquared can you please test with 1.3.7 and add your thoughts here please? |
Closing this issue since we have released 1.3.7 version and haven't noticed this behavior. Feel free to re-open if required. |
Describe the bug
We are seeing intermittent but pretty regular issues when using the Opensearch docker image in CI. All we see in the logs for the docker-compose container output is:
This is even after enabling trace level logging with
rootLogger.level = trace
in the log4j2 config. The full docker-compose config is:We are also extending the Opensearch image to remove the security plugin and add analysis-icu plugin and our image can be found on public ECR as
public.ecr.aws/peakon/opensearch:1.3.3
The container continues to run and after
exec
ing into it the logs directory contains some output from the performance analyser plugin with some errors like:However simply running
opensearch
within the container works, but using the entrypoint script results in the same error. It seems to be something to do with the terminateProcesses function.We have found that removing
CHILD
from the linetrap terminateProcesses TERM INT EXIT CHLD
in the script allows Opensearch to start up.To reproduce
This issue is hard for us to reproduce locally but happens fairly regularly in our CI infrastructure. If there is anything else we can do to debug this or get some more information that would be extremely handy.
Expected behavior
The docker image should start Opensearch or exit with some error in the logs
Screenshots
No response
Host / Environment
Docker version 20.10.17, build 100c701
Additional context
No response
Relevant log output
No response
The text was updated successfully, but these errors were encountered: