Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blackbox fails to send error notification if failure reason field is over Discord embed limit #155

Open
jb3 opened this issue May 2, 2024 · 3 comments
Labels

Comments

@jb3
Copy link
Contributor

jb3 commented May 2, 2024

Error report

Discord has a field size limit of 1024 characters, not 2,000 as is currently programmed. This means that blackbox silently fails to notify if a backup has failed.

# Combine and truncate total output to < 2000 characters, fields don't support more.
output = self.report.output[:2000]

Additionally, this truncation hides the actual error for failing databases since it includes all output from databases that successfully backed up, a future iteration might want to include only the output for the failing stage.

As an example of an invalid body, I have attached the current PyDis backup output which is failing because of #154, as you can see the truncated body means that no PostgreSQL error is logged, even though the actual PostgreSQL error is only a couple of short lines.

Webhook Payload
{
  "content": null,
  "embeds": [
    {
      "title": "Backup",
      "color": 13377568,
      "fields": [
        {
          "name": "**main_mongodb**",
          "inline": true,
          "value": ":white_check_mark:  singapore_s3\n:white_check_mark:  newark_s3\n:white_check_mark:  frankfurt_s3"
        },
        {
          "name": "**main_redis**",
          "inline": true,
          "value": ":white_check_mark:  singapore_s3\n:white_check_mark:  newark_s3\n:white_check_mark:  frankfurt_s3"
        },
        {
          "name": "**main_postgres**",
          "inline": true,
          "value": ":x:"
        },
        {
          "name": "Output",
          "value": "2024-05-02T16:17:58.887+0000\twriting admin.system.users to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.890+0000\tdone dumping admin.system.users (1 document)\n2024-05-02T16:17:58.890+0000\twriting admin.system.version to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.893+0000\tdone dumping admin.system.version (2 documents)\n2024-05-02T16:17:58.894+0000\twriting pydis_forms.forms to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.894+0000\twriting modmail_bot.notes to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.896+0000\twriting modmail_bot.logs to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.896+0000\twriting pydis_forms.responses to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.906+0000\tdone dumping modmail_bot.notes (20 documents)\n2024-05-02T16:17:58.906+0000\twriting pydis_forms.admins to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.910+0000\tdone dumping pydis_forms.forms (26 documents)\n2024-05-02T16:17:58.911+0000\twriting modmail_bot.plugins.PingManager to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.912+0000\tdone dumping pydis_forms.admins (2 documents)\n2024-05-02T16:17:58.913+0000\twriting modmail_bot.config to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.913+0000\tdone dumping modmail_bot.plugins.PingManager (2 documents)\n2024-05-02T16:17:58.914+0000\twriting modmail_bot.plugins.BanAppeals to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.922+0000\tdone dumping modmail_bot.config (1 document)\n2024-05-02T16:17:58.923+0000\twriting pydis_forms.discord_members to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.927+0000\tdone dumping pydis_forms.discord_m"
        }
      ]
    }
  ],
  "username": "blackbox",
  "avatar_url": "https://raw.githubusercontent.com/lemonsaurus/blackbox/main/img/blackbox_avatar.png"
}
Actual PostgreSQL error
blackbox-1235626330217779380-4wz7x blackbox 16:18:26 pg_dumpall: error: aborting because of server version mismatch
blackbox-1235626330217779380-4wz7x blackbox pg_dumpall: detail: server version: 16.2; pg_dumpall version: 15.5 (Debian 15.5-1.pgdg120+1)

Suggested fix (by 🍋 lemonsaurus)

  • We need to capture only what we need. We should only capture the output from the particular backup(s) that failed, and only the tail of that log, let's say the last 10 lines at most.
  • If for whatever reason that output still does exceed 1024 characters (maybe a bunch of different backups all failed?), we can still truncate, but we ought to truncate to 1024 chars, not 2000.
  • Currently when we truncate, we truncate to the first 2000 characters in the output. That's stupid, because the tail of the log is usually more interesting than the head of the log, so we should be truncating to the last 1024 characters in the output.
    If there are multiple services failing, we can do something like
    Iterate through each service's output and sort them from smallest to biggest output.
    • Divide 1024 by the number of services, and this will be our budget for truncating the smallest output.
    • If any characters remain after this truncation, add them back to the total remaining and divide again by the number of services.
    • Example: We got 3 services. Start with (1024 / 3) as budget. If we only consume 51 characters for the first output, no truncation is needed (it is within our budget), and the remaining 290 characters can be used for the subsequent services. So now we do ((1024 / 2) - 51) to figure out our budget for the next service, and continue this until we're done.
    • This will always lead to an optimal situation where we are truncating as little as possible, and only cutting off the start of the longest logs. It also ensures that any services with logs that exceed our character budget will be truncated equally, maximizing our chances of getting relevant output.
@lemonsaurus
Copy link
Owner

lemonsaurus commented May 7, 2024

Yeah this part sucks. Maybe instead of just naively truncating, we could upload the full error message somewhere and attach a URL or something. It's just a bit sketchy if that log contains sensitive shit.

also, but I'm not sure this is possible, but replying to the notification (turning it into a thread) on Discord with the full log split into multiple messages would be a fairly elegant solution.

Labeling this as help wanted, in case someone wants to implement some solution for this. ❤️

@jb3
Copy link
Contributor Author

jb3 commented May 7, 2024

The question is more why on an error are the process logs from all subprocesses attached.

If this was only the tool that experienced the error and the logs were tailed to the last N lines then this error wouldn't matter.

If pg_dumpall fails, it should not attach the logs of mongodump.

Most tools experience an error that will fit in Discord, but not if all subprocess logs are attached.

@lemonsaurus
Copy link
Owner

lemonsaurus commented May 8, 2024

yeah jb, that's all true too. We should only be capturing the relevant errors in the output. But then there really are three bugs here to fix:

  • We should capture less. We should only capture the output from the particular backup(s) that failed, and only the tail of that log, let's say the last 10 lines at most.
  • If for whatever reason that output still does exceed 1024 characters (maybe a bunch of different backups all failed?), we can still truncate, but we ought to truncate to 1024 chars, not 2000.
  • Currently when we truncate, we truncate to the first 2000 characters in the output. That's stupid, because the tail of the log is usually more interesting than the head of the log, so we should be truncating to the last 1024 characters in the output.
  • If there are multiple services failing, we can do something like
    • Iterate through each service's output and sort them from smallest to biggest output.
    • Divide 1024 by the number of services, and this will be our budget for truncating the smallest output.
    • If any characters remain after this truncation, add them back to the total remaining and divide again by the number of services.
    • Example: We got 3 services. Start with (1024 / 3) as budget. If we only consume 51 characters for the first output, no truncation is needed (it is within our budget), and the remaining 290 characters can be used for the subsequent services. So now we do ((1024 / 2) - 51) to figure out our budget for the next service, and continue this until we're done.
    • This will always lead to an optimal situation where we are truncating as little as possible, and only cutting off the start of the longest logs. It also ensures that any services with logs that exceed our character budget will be truncated equally, maximizing our chances of getting relevant output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants