Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Intermittent startup issues with Opensearch 1.3.3 Docker container #2568

Closed
threesquared opened this issue Sep 7, 2022 · 24 comments
Closed
Labels
bug Something isn't working

Comments

@threesquared
Copy link

Describe the bug

We are seeing intermittent but pretty regular issues when using the Opensearch docker image in CI. All we see in the logs for the docker-compose container output is:

opensearch_1 | Killing opensearch process 10
opensearch_1 | Killing performance analyzer process 11

This is even after enabling trace level logging with rootLogger.level = trace in the log4j2 config. The full docker-compose config is:

elasticsearch:
    image: public.ecr.aws/peakon/opensearch:1.3.3
    environment:
      - discovery.type=single-node
      - http.cors.enabled=true
      - http.cors.allow-origin=*
      - bootstrap.memory_lock=true
      - 'OPENSEARCH_JAVA_OPTS=-Xms2G -Xmx2G'
    restart: always
    shm_size: 2gb
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    volumes:
      - ./docker/log4j2.properties:/usr/share/opensearch/config/log4j2.properties

We are also extending the Opensearch image to remove the security plugin and add analysis-icu plugin and our image can be found on public ECR as public.ecr.aws/peakon/opensearch:1.3.3

FROM opensearchproject/opensearch:1.3.3

RUN /usr/share/opensearch/bin/opensearch-plugin install --batch --verbose analysis-icu \
    && /usr/share/opensearch/bin/opensearch-plugin remove --verbose opensearch-security

The container continues to run and after execing into it the logs directory contains some output from the performance analyser plugin with some errors like:

ERROR org.opensearch.performanceanalyzer.rca.RcaController - Error reading file /usr/share/opensearch/data/rca_enabled.conf

However simply running opensearch within the container works, but using the entrypoint script results in the same error. It seems to be something to do with the terminateProcesses function.

We have found that removing CHILD from the line trap terminateProcesses TERM INT EXIT CHLD in the script allows Opensearch to start up.

To reproduce

This issue is hard for us to reproduce locally but happens fairly regularly in our CI infrastructure. If there is anything else we can do to debug this or get some more information that would be extremely handy.

Expected behavior

The docker image should start Opensearch or exit with some error in the logs

Screenshots

No response

Host / Environment

Docker version 20.10.17, build 100c701

Additional context

No response

Relevant log output

No response

@threesquared threesquared added bug Something isn't working untriaged Issues that have not yet been triaged labels Sep 7, 2022
@dblock dblock transferred this issue from opensearch-project/opensearch-build Sep 7, 2022
@dblock
Copy link
Member

dblock commented Sep 7, 2022

Moving to opensearch-devops, but if you think it's a PA problem open a related issue in https://github.com/opensearch-project/performance-analyzer/

@peterzhuamazon peterzhuamazon transferred this issue from opensearch-project/opensearch-devops Sep 7, 2022
@peterzhuamazon
Copy link
Member

Hi @threesquared,

The design is ensure that if either PA or OS process get killed, both process will get killed to ensure both process always running together.

It was a design back in the day to remove dependencies on supervisord as it is not favored by docker hub official images.

The thing that seems suspicious is the PA log error accessing:

ERROR org.opensearch.performanceanalyzer.rca.RcaController - Error reading file /usr/share/opensearch/data/rca_enabled.conf

I dont recall if PA would stop if the conf is not accessible, but even then there should not have any permission problem with that file, let me take a look on it.

Have you guys try to use the official docker-compose.yml on opensearch.org?
https://opensearch.org/samples/docker-compose.yml

We have tested with above compose file and havent seen issues yet.
At least not happening when we start the cluster.
Do you guys see the issues right after startup, or hours later?

Thanks.

@dblock
Copy link
Member

dblock commented Sep 8, 2022

@peterzhuamazon Did you mean to move it back to opensearch-build. Curious why?

@peterzhuamazon peterzhuamazon removed the untriaged Issues that have not yet been triaged label Sep 15, 2022
@prudhvigodithi
Copy link
Collaborator

Hey @threesquared just checking back if you have tried with the suggested docker-compose file by @peterzhuamazon and still notice the issue?

@threesquared
Copy link
Author

So it's really hard to pin down as its very intermittent. The compose file we use in CI is only a single-node but otherwise is the same as the official compose file.

We have hacked around this with a script that catches an exit code from dockerize that the port has not come up and then uses docker-compose restart to try again.

I will try and catch the issue again but there are no logs in the container other than the ones for the performance analyser I mentioned above but I will attach the full contents next time.

@nhtruong
Copy link

nhtruong commented Oct 4, 2022

We also encounter this issue in OS Javascript repo. One of our workflows runs a matrix of 24 configurations, and often one or two of those configurations will encounter this issue.

Those "Killing process" messages came from opensearch-docker-entrypoint.sh. It's most likely that said script encountered an error and triggered the trap EXIT and killed these 2 processes.

Grepping the container's logs for those messages to restart the container is a workaround we're considering.

@dblock
Copy link
Member

dblock commented Oct 4, 2022

The workaround in opensearch-project/opensearch-js#304 is not pretty, but it seems to work.

@nhtruong
Copy link

nhtruong commented Oct 6, 2022

Added an issue on the meta repo with my findings and a couple of workarounds: opensearch-project/project-meta#62

@dblock
Copy link
Member

dblock commented Oct 6, 2022

Those "Killing process" messages came from opensearch-docker-entrypoint.sh. It's most likely that said script encountered an error and triggered the trap EXIT and killed these 2 processes.

If you look at that code, it receives a signal to terminate, then terminates the processes. Look for the cause of that signal.

@dblock
Copy link
Member

dblock commented Oct 6, 2022

@threesquared do you have a complete set of logs from a failed to start container by any chance?

@threesquared
Copy link
Author

@dblock we just see the same as opensearch-project/project-meta#62 all that comes out of the container logs is:

opensearch_1 | Killing opensearch process 10
opensearch_1 | Killing performance analyzer process 11

@dblock
Copy link
Member

dblock commented Oct 6, 2022

@threesquared But you also have a log inside opensearch/logs inside the container. That's what I'm after.

@threesquared
Copy link
Author

Ah ok sorry I wish I had saved the whole output when I last caught it. I will try and reproduce and upload a full log

@threesquared
Copy link
Author

So I managed to catch this again using a script like this using dockerize in our CI platform:

RETRIES=100

for ((i=0; i<RETRIES; i++)); do
  ./docker/scripts/dockerize.sh \
    -wait http://elasticsearch:9200 \
    -timeout 30s

  if [[ $? -eq 1 ]]
  then
    sleep 1h
  else
    docker-compose up -d --force-recreate --no-deps elasticsearch
  fi
done

This gist is the last 200 lines of the logs/performance-analyzer.log

Let me know if you need any more info from the container. No other log files in that directory and as usual only those two lines in the docker logs.

@nhtruong
Copy link

nhtruong commented Oct 7, 2022

I've collected some logs via github workflow. Opensearch container can fail in 2 ways:

I also collected OpenSearch logs inside the container along with Kernel and System logs of the workflow runner.
For comparision, here's what a successful run looks like.

@dblock
Copy link
Member

dblock commented Oct 7, 2022

@nhtruong I compared successful and failed runs and it's pretty clear that in both OpenSearch actually starts ([2022-10-06T22:08:58,320][INFO ][o.o.s.c.ConfigurationRepository] [5fb0eef93fa0] Node '5fb0eef93fa0' initialized). So I don't think it's to blame for the shutdown, something else is terminating it, and maybe it's just cleanup?

There's a 20 second timeout in https://github.com/opensearch-project/opensearch-js/blob/main/Makefile#L6. Can we replace that with an active wait that polls tje OpenSearch cluster start with a much longer timeout?

@threesquared same in your case, see if you actually can hit that node in the active wait loop and increase the timeout to minutes so we can see if the node ever starts responding?

@nhtruong
Copy link

nhtruong commented Oct 7, 2022

@dblock There 2 ways the runs could fail, one of it produces almost no logs, and just

Killing opensearch process 10
Killing performance analyzer process 11

Where did you see [2022-10-06T22:08:58,320][INFO ][o.o.s.c.ConfigurationRepository] [5fb0eef93fa0] Node '5fb0eef93fa0' in that run?

In the makefile I used to recreate these fail runs, I actually used 45 seconds sleep time. In both 2 failed runs, whether the container was stopped or not, the docker logs show the "Kill process" message in both cases. Are you saying that this service will restart automatically after some time if Opensearch is not running?

@dblock
Copy link
Member

dblock commented Oct 7, 2022

I was comparing the other two runs that had more logs. This one puzzles me.

My running theory rn is that we start the container, do something that fails (run a test), don't properly report it or fail to see it, visibly shutdown the container (e.g. job killed in GHA), which causes the container to kill the process because it has received a termination signal itself.

Make that sleep 2 minutes and log when it's done waiting. Do you see a kill signal before?

@nhtruong
Copy link

nhtruong commented Oct 7, 2022

The workflow I ran did not do any tests. It marked a run as failed if the container was unhealthy, and collected a bunch of logs. So, the only interactions OpenSearch had were from the curl command that ran every 5 seconds for health check. And the first healthcheck happened 20 seconds after the container's been booted up.

I don't think there was anything from outside of the container that sent the kill signal to OS. OOMKilled: false when you inspect the container after a crash. The trap was likely triggered by the EXIT or CHLD signal.

@dblock
Copy link
Member

dblock commented Oct 11, 2022

See opensearch-project/opensearch-js#304 for a retry implementation. I propose we leave this open until we can get to the bottom of the issue, maybe someone will be more successful?

@prudhvigodithi
Copy link
Collaborator

Hey the resolved issue #1185, should fix this intermittent startup issues. Keeping the memory setting setting/adjustment aside, we can now disable the performance analyzer https://github.com/opensearch-project/opensearch-build/tree/main/docker/release#disable-performance-analyzer-agent-cli-and-related-configurations, this is be out for 2.4.0, should we backport this to 1.3.x? @gaiksaya @dblock @bbarani @peterzhuamazon WDTY?
Thank you

@peterzhuamazon
Copy link
Member

Hi @prudhvigodithi this change is already in 1.3.x as we release it in 1.3.7.

Thanks.

@prudhvigodithi
Copy link
Collaborator

Thanks @peterzhuamazon since 1.3.7 is already out, we should be closing this issue, @threesquared can you please test with 1.3.7 and add your thoughts here please?
Thank you

@bbarani
Copy link
Member

bbarani commented Jan 30, 2023

Closing this issue since we have released 1.3.7 version and haven't noticed this behavior. Feel free to re-open if required.

@bbarani bbarani closed this as completed Jan 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants