Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 [bug] - PetBattleMongoDBDiskUsage alert is not reported #182

Open
rmarting opened this issue Jul 1, 2022 · 9 comments
Open

🐛 [bug] - PetBattleMongoDBDiskUsage alert is not reported #182

rmarting opened this issue Jul 1, 2022 · 9 comments
Labels
bug Something isn't working

Comments

@rmarting
Copy link
Contributor

rmarting commented Jul 1, 2022

📝 Description

Following the instructions of the Creating Alerts exercise, the PetBattleMongoDBDiskUsage alert is not reported.

The status of the PVC, after execute the command `` is:

image

But the alert is not reported.

Also if we forced to consume all the space, the alert is also not reported:

sh-4.2$ dd if=/dev/urandom of=/var/lib/mongodb/data/rando-calrissian bs=10M count=80
dd: error writing '/var/lib/mongodb/data/rando-calrissian': No space left on device
66+0 records in
65+0 records out
688480256 bytes (688 MB) copied, 6.29632 s, 109 MB/s 

🚶 Steps to reproduce

Followed the instructions of this exercise.

🧙‍♀️ Suggested solution

... if applicable

@rmarting rmarting added the bug Something isn't working label Jul 1, 2022
@eformat
Copy link
Member

eformat commented Jul 1, 2022

@rmarting - i tried this out in my cluster and it seems ok. i dropped the alert to 40% to check

metrics being reported ok

Screenshot from 2022-07-02 08-15-24

and the firing rule

Screenshot from 2022-07-02 08-30-52

any chance you could debug a little further ? see what may be going on in your cluster - see if metrics reporting first ?

@eformat
Copy link
Member

eformat commented Jul 1, 2022

also .. i have these rules in my project after running thru all exercises (the one above is part of "pb-api-alerts")

$ oc get prometheusrule -n ateam-test
NAME               AGE
blue-pet-battle    134d
green-pet-battle   134d
keycloak           38h
pb-api-alerts      134d
pet-battle         134d
pet-battle-b       134d

@eformat
Copy link
Member

eformat commented Jul 1, 2022

and i logged in as a student user .. just to check i can still see alert OK (i was checking as cluster admin above). looks ok

Screenshot from 2022-07-02 08-55-31

@fc7
Copy link

fc7 commented Jul 8, 2022

During the enablement in Frankfurt my teammates encountered exactly the same issue, which was also confirmed by @rmarting.

Screenshot from 2022-07-08 11-09-44

@rmarting
Copy link
Contributor Author

rmarting commented Jul 11, 2022

Reproduced in a new cluster following the next steps:

  1. Used teamsters to deploy technical exercises 1+2 for a new user and set up the environment. Created CRW, environment variables and move to Alerting exercise.
  2. Add new rules in the pet-battle-api helm chart (prometheusrule.yaml)
  3. Bumping Chart.yaml file with a new version (1.3.2). Incremented from the version already deployed in Nexus as part of the activities in technical exercise 2.
  4. The pipeline updates the Helm Chart 1.3.1 tgz file (overwriting it) instead of creating the new version. IMHO the pipeline is using the version from pom.xml file (maven task) instead of the Helm Chart version to create the new tgz file in Nexus.
  5. ArgoCD is not synchronizing the new version and it is using the previous one (1.3.1) in tech-exercise/pet-battle/test/values.yaml.

Workaround: Changing the pom.xml file to the new version 1.3.2 fixed the issue. A new Helm Chart tgz file is uploaded, the pet-battle-api version in tech-exercise/pet-battle/test/values.yaml is updated, and everything is deployed in OpenShift. So the alert is shown successfully.

Could you double-check my findings? Maybe we need to extend the instructions to align the helm chart version and app to deploy successfully from ArgoCD, or maybe we need to review the Tekton pipeline about the right version from the right file (pom.xml, or Chart.yaml).

@eformat
Copy link
Member

eformat commented Jul 11, 2022

Ahh ! that makes sense @rmarting .. i see what is going on now.

OK, the section in 4.2.4 is wrong - i have fixed this now. PTAL at this commit:

6437d57

The history here is this:

  • at some time in the past we allowed users to update app and chart version separately (helm chart default behaviour)
  • we changed the pipeline in pet-battle-api to match pet-battle UX where app version was solely controlled by:
    VERSION - file for node.js
    pox.xml - file for java
  • this matched what "developers" would do .. i.e. control it from source and not worry too much about yaml files ! and let the pipeline deal with it
  • the commit for this was here:
    255556e
  • however it seems we did not go back and update all the right bits (monitoring) for this change.

I think there is still a question in my mind as to why argocd does not sync the new chart (same version) .. we may find that this is just the difference in argo between a sync, a refresh and a hard refresh. i.e. hitting the sync button may have solved this as the chart of the same version is timestamped in nexus .. so you will always get the updated chart as we push it there in the pipeline. need to test this and take a look at where it is getting "stale"

@rmarting
Copy link
Contributor Author

Great @eformat !!! Everything makes sense now!

Reviewing the new content I am wondering if the 1.3.1. version is the best one, as technical exercise 2 defines that version. If we are triggering a new version, then the 1.3.2 version fits better, or another different from the current version in the pom.xml file. If you update the content to that version, then LGTM to go ahead and close this issue.

On the other hand, Why does ArgoCD not sync the new chart? It could be something related to the different sync options. However, IMHO if we want to deploy a new chart for an application, it is better to use a new Helm Chart version and not to override in Nexus. I don't like at all the idea to override artifacts versions in Nexus (only for SNAPSHOTS), as you can't control who downloaded or not. As Helm Chart hasn't snapshot versions, the best approach is to trigger a new pipeline with a new version and then deploy it from ArgoCD. (my two cents).

@springdo
Copy link
Contributor

If the version of the chart does not change - argocd has it cached, doing a refresh on argocd clears the cache hence it updates the k8s resources after a refresh. This is why we always bump version on main (even if its just a minor). Perhaps an automated way to ensure this doesn't happen would be to append the git sha to the version (if help supports that).

@springdo
Copy link
Contributor

@eformat - if you rememeber, we encountered this issue when writing the book. We were changing values files bt not updating the version and argocd was not seeing the change. I think the way around this we implememented was changing the labels on the resources to contain a value from teh values file or something like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants