Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

O+M 2022-08-25 #4423

Closed
10 tasks
hkdctol opened this issue Aug 18, 2023 · 3 comments
Closed
10 tasks

O+M 2022-08-25 #4423

hkdctol opened this issue Aug 18, 2023 · 3 comments
Assignees

Comments

@hkdctol
Copy link
Contributor

hkdctol commented Aug 18, 2023

As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.

Check the O&M Rotation Schedule for future planning.

Miscs

Acceptance criteria

You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten. You can copy each checklist into your daily report.

Daily Checklist

Check Production State/Actions

Note: Catalog Auto Tasks
You will need to update the chart values manually. Click the Action link in each issue and grab the values from monitor task output and check runtime.

Weekly Checklist

@hkdctol hkdctol moved this to 📟 Sprint Backlog [7] in data.gov team board Aug 18, 2023
@btylerburton btylerburton moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Aug 21, 2023
@btylerburton
Copy link
Contributor

8.22.23

  • Lots of harvest failures due to validations errors in sources. No anomalies.

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Aug 24, 2023

O&M Retro started to document a way of structuring the O&M role a bit more.

@btylerburton
Copy link
Contributor

  • Tracking update fails, but this is expeted
  • DMARC has numerous 116 fails in the last 5 days, but there is no documentation on how to triage this
  • DB-SOLR-SYNC has identified 5k+ objects that have to be removed manually. That process is in action, and troubleshooting steps have been added to the new O&M troubleshooting section.
  • Ran duplicate check and it turned up several with shared identifiers.
    • Ran check org duplicates to identify the orgs with dupes
    • Ran De-dupe on each org with duplicates
    • De-dupe script continued to run for ca-gov
  • Catalog Solr instances have MANY logging errors, but this is expected drive-by hacking traffic per Fuhu
  • Added troubleshooting section to O&M doc: https://github.com/GSA/data.gov/wiki/Operation-and-Maintenance-Responsibilities#troubleshooting

@hkdctol hkdctol moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants