Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable grab download resume #5034

Closed
wants to merge 151 commits into from

Conversation

jnummelin
Copy link
Member

@jnummelin jnummelin commented Sep 24, 2024

Description

Fixes #4296

By default grab tries to resume the download if the file name determined from either the url or from content-type headers already exists. This makes things go side ways, if the existing file is smaller than the new one, the old content would still be there and only the "extra" new bytes would get written. I.e. the download would be "resumed". 🤦

So this change disables resuming altogether fro grab.

To be able to download and replace k0s without resuming, i.e. "truncate" the file in-between. We need to force the download to download it as k0s.tmp first. Only after that we can safely move the file as truncating a running binary is not allowed and will error with text file busy.

This is a minimal possible fix that we can easily backport. @twz123 is already working on bigger refactoring of autopilot download functionality that gets rid of grab. Grab seems to bring more (bad) surprises than real benefits. In the end, we just download files and we should pretty much always just replace them. No need for full library dependecy for that.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

How Has This Been Tested?

  • Manual test
  • Auto test added
  • Existing tests cover this

Checklist:

  • My code follows the style guidelines of this project
  • My commit messages are signed-off
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

twz123 and others added 30 commits May 13, 2024 18:53
The default port is 8080. This conflicts with kube-router, which
since 2.x fails if it cannot bind to its metrics port. This leads to a
situation where the default configuration for a single node k0s cluster
is broken, as kube-router will go into a crash loop. Kube-router 1.x
will not be able to provide metrics, but will simply log the error and
move on. The integration tests did not cover this case properly, as
the kube-router manifests did not specify a readiness probe that the
inttests need to detect problems with daemon sets.

Port 2380 was chosen because it is normally used by etcd. Kine also uses
port 2379 for the client API, so why not use 2380 as well?

Signed-off-by: Tom Wieczorek <[email protected]>
(cherry picked from commit 179989e)
…elease-1.30

[Backport release-1.30] Change kine metrics port from 8080 to 2380
Use the JSON API and look for the push_time_seconds metric. Collect all
job labels that had successful pushes.

Signed-off-by: Tom Wieczorek <[email protected]>
(cherry picked from commit 2de1b0f)
This is a variant of check-metricsscraper that executes the inttest on
a single-node cluster. This mainly means that, in addition to etcd, kine
scraping is also tested.

Signed-off-by: Tom Wieczorek <[email protected]>
(cherry picked from commit 62beca8)
The worker config ConfigMaps have the Kubernetes version in their names.
This is to ensure a seamless upgrade path when upgrading minor k0s
versions. However, the RBAC resources and the stack name itself didn't
have the version in their names. This means that as soon as a controller
with the new minor k0s version takes over, it removes all previous
worker ConfigMaps from the cluster, effectively defeating the purpose of
the versioned names.

Signed-off-by: Tom Wieczorek <[email protected]>
(cherry picked from commit 90e952b)
…elease-1.30

[Backport release-1.30] Use a versioned worker config stack
…elease-1.30

[Backport release-1.30] Introduce `check-metricsscraper-singlenode`
Currently the RestartSec is 120, which mean that systemd waits 2mins before it restarts k0s.
This makes updates very slow with autopilot, i.e. with 3 controllers updating the controllers will take 6mins.

Signed-off-by: Jussi Nummelin <[email protected]>
(cherry picked from commit 225de58)
…elease-1.30

[Backport release-1.30] Lessen the systemd RestartSec to 10 secs
The watch for CRD getting ready used wrong error when checking if it is retriable. This results in bogus logging and a "busyloop":
```
time="2024-05-16 10:05:45" level=info msg="Transient error while watching etcdmember CRD, last observed version is \"\", starting over after 0s ..." component=etcdMemberReconciler error="<nil>"
time="2024-05-16 10:05:45" level=info msg="Transient error while watching etcdmember CRD, last observed version is \"\", starting over after 0s ..." component=etcdMemberReconciler error="<nil>"
time="2024-05-16 10:05:45" level=info msg="Transient error while watching etcdmember CRD, last observed version is \"\", starting over after 0s ..." component=etcdMemberReconciler error="<nil>"
...
```

It actually hides the transient errors encountered.

Signed-off-by: Jussi Nummelin <[email protected]>
(cherry picked from commit d67b442)
With this change the first controller on new version can apply the needed versioned resources as it will always be guaranteed to become the leader.

Signed-off-by: Jussi Nummelin <[email protected]>
(cherry picked from commit e5a271b)
…elease-1.30

[Backport release-1.30] Bump Kubernetes to v1.30.1
…elease-1.30

[Backport release-1.30] Fix error handling in EtcdMemberReconciler
This allows for better feedback of kube-router health via the DaemonSet
resource. Without those, it's possible to observe a "healthy" DaemonSet,
even if it's not. This affects e.g. rolling updates, and, most notably
k0s's own integration tests.

Signed-off-by: Tom Wieczorek <[email protected]>
(cherry picked from commit 65fcf39)
Otherwise the defaulting could pick up `kube-bridge` address as api/etcd address when restarting k0s if it sees it before any "real" address.

This happens on e.g. bootloose containers where kube-bridge is at index 11 and eth0 on index 18.

Signed-off-by: Jussi Nummelin <[email protected]>
(cherry picked from commit dd2fdca)
…elease-1.30

[Backport release-1.30] Skip `kube-bridge` interface for api/etcd address
…elease-1.30

[Backport release-1.30] Use dedicated leasepool for worker config component
…elease-1.30

[Backport release-1.30] Add readinessProbe/minReadySeconds to kube-router
…elease-1.30

[Backport release-1.30] Bump containerd to v1.7.17
So that they are ignored by the Go tooling. Otherwise, when building in
parallel, Go may try to treat these temporary directories as Go package
folders and fail utterly.

https://pkg.go.dev/cmd/go#hdr-Package_lists_and_patterns

> Directory and file names that begin with "." or "_" are ignored by the
> go tool, as are directories named "testdata".

Fixes: d68dc34 ("Improve Docker export in embedded-binaries Makefile")
Signed-off-by: Tom Wieczorek <[email protected]>
(cherry picked from commit db51280)
Signed-off-by: Alexey Makhov <[email protected]>
(cherry picked from commit 5ead418)
…elease-1.30

[Backport release-1.30] Adding CLI args reference to the docs
We may run k0s on arm/v8 but only have images for arm/v7. Use Only()
instead of OnlyStrinct() to also match arm/v7 on arm/v8.

Fixes commit 9dd47b4 (k0sproject#1939 Fix airgap arm64 build. Strictly use
target platform only)

ref: https://pkg.go.dev/github.com/containerd/containerd/platforms#Only

Signed-off-by: Natanael Copa <[email protected]>
(cherry picked from commit 87d6609)
Signed-off-by: Jussi Nummelin <[email protected]>
(cherry picked from commit 29c7544)
Signed-off-by: Jussi Nummelin <[email protected]>
(cherry picked from commit c2fb3e0)
Co-authored-by: Tom Wieczorek <[email protected]>
Signed-off-by: Jussi Nummelin <[email protected]>
(cherry picked from commit e53d0fc)
twz123 and others added 9 commits September 17, 2024 13:20
…se-1.30

[Backport release-1.30] Bump runc to 1.1.14
…se-1.30

[Backport release-1.30] Bump etcd to 3.5.16
[release-1.30] Bump kubernetes to 1.30.5
…se-1.30

[Backport release-1.30] Bump metrics-server to v0.7.2
…-1.7.22

[Backport release-1.30] Bump containerd to 1.7.22
…te description

Signed-off-by: SebPlv <[email protected]>
(cherry picked from commit c334f22)
…elease-1.30

[Backport release-1.30] [doc] Remove non-existant 'targets' field in Autopilot airgapupdate description
@jnummelin jnummelin requested a review from a team as a code owner September 24, 2024 19:25
@jnummelin jnummelin added backport/release-1.28 PR that needs to be backported/cherrypicked to release-1.28 branch backport/release-1.29 PR that needs to be backported/cherrypicked to the release-1.29 branch labels Sep 24, 2024
This is to mitigate cases like k0sproject#4296

By default grab tries to resume the download if the file name determined from either the url or from content-type headers already exists. This makes things go side ways, if the existing file is smaller than the new one, the old content would still be there and only the "extra" new bytes would get written. I.e. the download would be "resumed". 🤦

So this change disables resuming altogether fro grab.

To be able to download and replace k0s without resuming, i.e. "truncate" the file in-between. We need to force the download to download it as k0s.tmp first. Only after that we can safely move the file as truncating a running binary is not allowed and will error with `text file busy`.

This is a minimal possible fix that we can easily backport. @twz123 is already working on bigger refactoring of autopilot download functionality that gets rid of grab. Grab seems to bring more (bad) surprises than real benefits. In the end, we just download files and we should pretty much always just replace them. No need for full library dependecy for that.

Signed-off-by: Jussi Nummelin <[email protected]>
@twz123 twz123 changed the title [Release 1.30] Disable grab download resume [release-1.30] Disable grab download resume Sep 25, 2024
@twz123
Copy link
Member

twz123 commented Sep 26, 2024

Can we retarget this to main and then do a backport? I think there are some changes in here that I'd like to rebase #5020 on.

@jnummelin jnummelin changed the base branch from release-1.30 to main September 26, 2024 09:46
@jnummelin jnummelin changed the title [release-1.30] Disable grab download resume Disable grab download resume Sep 26, 2024
@twz123 twz123 added bug Something isn't working component/autopilot backport/release-1.30 PR that needs to be backported/cherrypicked to the release-1.30 branch labels Sep 26, 2024
Copy link
Contributor

This pull request has merge conflicts that need to be resolved.

@jnummelin
Copy link
Member Author

See #5042, targeting main and then we'll backport it

@jnummelin jnummelin closed this Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/release-1.28 PR that needs to be backported/cherrypicked to release-1.28 branch backport/release-1.29 PR that needs to be backported/cherrypicked to the release-1.29 branch backport/release-1.30 PR that needs to be backported/cherrypicked to the release-1.30 branch bug Something isn't working component/autopilot merge-conflict
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issues running repeated updates with autopilot
8 participants