Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance when switching to multiple CPU Cores #10793

Open
Gal-Lahat opened this issue Aug 18, 2024 · 7 comments
Open

Poor performance when switching to multiple CPU Cores #10793

Gal-Lahat opened this issue Aug 18, 2024 · 7 comments
Labels
type: bug Something isn't working

Comments

@Gal-Lahat
Copy link

Gal-Lahat commented Aug 18, 2024

Description

I’m experiencing significant performance degradation when using GVisor with more than one CPU core. When running a container with a single CPU, everything works as expected. However, when I allocate two or more CPU cores, the container becomes extremely slow and idles at around 20% CPU usage, rendering it nearly unusable.

This issue occurs consistently across all containers I’ve tested, including completely empty containers, which exhibit the same performance degradation. Interestingly, even if the containers are only running a single thread, it seems like all cores of the CPU, when using multiple cores (e.g., 4 cores), experience high load, contributing to the overall slowdown.

Steps to reproduce

1.	Run any container with GVisor using a single CPU.
•	Expected Result: The container performs normally with low CPU usage.
2.	Run the same container with two or more CPU cores.
•	Actual Result: The container becomes very slow, with high idle CPU usage (~20%) and poor performance.

runsc version

runsc version release-20240807.0
spec: 1.1.0-rc.1

docker version (if using docker)

Server: Docker Engine - Community
 Engine:
  Version:          27.1.2
  API version:      1.46 (minimum version 1.24)
  Go version:       go1.21.13
  Git commit:       f9522e5
  Built:            Mon Aug 12 11:51:03 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.20
  GitCommit:        8fc6bcff51318944179630522a095cc9dbf9f353
 runsc:
  Version:          release-20240807.0
  GitCommit:        
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

uname

No response

kubectl (if using Kubernetes)

No response

repo state (if built from source)

No response

runsc debug logs (if available)

No response

@Gal-Lahat Gal-Lahat added the type: bug Something isn't working label Aug 18, 2024
@Gal-Lahat
Copy link
Author

Gal-Lahat commented Aug 18, 2024

Here are some High-Frequency Syscalls on the host (running an idle express js app with 0 requests) (about 30% cpu on runc):

1.	High-Frequency Syscalls:
•	sys_enter_write: 6 million calls, which suggests a lot of data writing operations are happening.
•	sys_enter_futex: 30,000 calls, indicating heavy use of synchronization primitives like mutexes.
•	sys_enter_nanosleep: 4,000 calls, which could imply the program is frequently sleeping for short periods.

@ayushr2
Copy link
Collaborator

ayushr2 commented Aug 18, 2024

Could you share a reproducer workload? (Like a Dockerfile or something) And what environment are you using? What CPU? What Linux version? What runsc platform (if you are not explicitly setting --platform flag, then you must be using systrap platform)?

@Gal-Lahat
Copy link
Author

Gal-Lahat commented Aug 18, 2024

Most of my tests are in a Docker Compose environment running a service that is built using a Dockerfile. The runc runtime is set as the default on the Docker daemon, and all of this runs on a VPS hosted on Contabo. I am running docker-compose up --build as a simple way to execute it.

I’m using Ubuntu 20.04. Here’s a summary of the CPU information from the VPS:

•	CPU MHz: 2496.248
•	Hypervisor vendor: KVM
•	Virtualization type: full
•	Caches: L1 - 256 KiB, L2 - 2 MiB, L3 - 16 MiB

I haven’t explicitly set the --platform flag, so I assume it’s using the sysstrap platform. Below is an example of one of the services defined in my Docker Compose setup:

2pls5ib68:
    build:
      context: ./2pls5ib68
      dockerfile: Dockerfile
    ports:
      - 3011:80
    networks:
      - 2pls5ib68
    restart: always
    logging:
      driver: local
      options:
        max-size: 2m
        max-file: "3"
    deploy:
      resources:
        limits:
          memory: 6G
    volumes:
      - /loop-devices-mount/2pls5ib68:/app

If you need anything else, like more configuration details or further clarification, please let me know.

@Gal-Lahat
Copy link
Author

I actually experimented with the old legacy platform ptrace, and it seemed to resolve some of the performance issues I was facing. Specifically, the idle CPU usage dropped significantly, from around 30% to 0.5%. This is a noticeable improvement, although the overall performance isn’t yet optimal. I still need to conduct more tests under heavy CPU workloads to determine whether the improvement is limited to idle performance or if it positively affects performance across the board. Based on this, it seems like there might be an issue with the new systrap platform that needs further investigation.

@ayushr2
Copy link
Collaborator

ayushr2 commented Aug 18, 2024

@avagin @konstantin-s-bogom for systrap. Yeah if switching to ptrace improves things, then likely an issue with systrap.

What is the application doing though? Like what CPU-intensive workload are you using? So we can reproduce.

@EtiennePerot
Copy link
Contributor

EtiennePerot commented Aug 18, 2024

+1 to a reproducer workload; multi-core applications use multiple cores in different ways and Systrap tries to do some heuristics to work well with most of them. So being able to reproduce what this specific application is doing is necessary in order to understand this problem.

I'd also note that Contabo is notorious for highly oversubscribing its machines, and having unreliable and inconsistent performance over time. I've experienced this first-hand; with disk I/O bandwidth I'd get 10x performance difference on some days vs others. You can look up reviews for Contabo online and that's usually the first thing they'll mention. The other thing they'll mention is the low price, not coincidentally.

So I suggest reproducing this on your local machine or on some other dedicated hardware. I'm not putting the blame on Contabo; it's quite likely that there is something suboptiomal about the way Systrap uses multiple cores for this particular workload, as it has had this type of problem in the past (see issue #9119). All I'm saying is that Contabo is not a reliable environment to get performance measurements from.

@EtiennePerot
Copy link
Contributor

EtiennePerot commented Aug 19, 2024

Another thing you may want to try is to build runsc after changing the following line:

neverEnableFastPath = min(runtime.NumCPU(), runtime.GOMAXPROCS(0)) == 1

to:

		neverEnableFastPath = true

From the way the variable is named, this sounds like it would hurt performance, and in most cases it should. Setting this to true removes the "fast path" feature of Systrap, which involves using spare CPU cores to achieve faster syscall handling performance. But in the case of a very busy system, which I think may be the case here, it might hurt more than it helps. So try to see what happens when you disable fast path (by setting neverEnableFastPath to true).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants