Blackbox fails to send error notification if failure reason field is over Discord embed limit #155

jb3 · 2024-05-02T21:39:15Z

Error report

Discord has a field size limit of 1024 characters, not 2,000 as is currently programmed. This means that blackbox silently fails to notify if a backup has failed.

blackbox/blackbox/handlers/notifiers/discord.py

Lines 13 to 14 in d40f08d

    
           # Combine and truncate total output to < 2000 characters, fields don't support more. 
        
           output = self.report.output[:2000]

Additionally, this truncation hides the actual error for failing databases since it includes all output from databases that successfully backed up, a future iteration might want to include only the output for the failing stage.

As an example of an invalid body, I have attached the current PyDis backup output which is failing because of #154, as you can see the truncated body means that no PostgreSQL error is logged, even though the actual PostgreSQL error is only a couple of short lines.

Webhook Payload

{
  "content": null,
  "embeds": [
    {
      "title": "Backup",
      "color": 13377568,
      "fields": [
        {
          "name": "**main_mongodb**",
          "inline": true,
          "value": ":white_check_mark:  singapore_s3\n:white_check_mark:  newark_s3\n:white_check_mark:  frankfurt_s3"
        },
        {
          "name": "**main_redis**",
          "inline": true,
          "value": ":white_check_mark:  singapore_s3\n:white_check_mark:  newark_s3\n:white_check_mark:  frankfurt_s3"
        },
        {
          "name": "**main_postgres**",
          "inline": true,
          "value": ":x:"
        },
        {
          "name": "Output",
          "value": "2024-05-02T16:17:58.887+0000\twriting admin.system.users to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.890+0000\tdone dumping admin.system.users (1 document)\n2024-05-02T16:17:58.890+0000\twriting admin.system.version to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.893+0000\tdone dumping admin.system.version (2 documents)\n2024-05-02T16:17:58.894+0000\twriting pydis_forms.forms to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.894+0000\twriting modmail_bot.notes to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.896+0000\twriting modmail_bot.logs to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.896+0000\twriting pydis_forms.responses to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.906+0000\tdone dumping modmail_bot.notes (20 documents)\n2024-05-02T16:17:58.906+0000\twriting pydis_forms.admins to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.910+0000\tdone dumping pydis_forms.forms (26 documents)\n2024-05-02T16:17:58.911+0000\twriting modmail_bot.plugins.PingManager to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.912+0000\tdone dumping pydis_forms.admins (2 documents)\n2024-05-02T16:17:58.913+0000\twriting modmail_bot.config to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.913+0000\tdone dumping modmail_bot.plugins.PingManager (2 documents)\n2024-05-02T16:17:58.914+0000\twriting modmail_bot.plugins.BanAppeals to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.922+0000\tdone dumping modmail_bot.config (1 document)\n2024-05-02T16:17:58.923+0000\twriting pydis_forms.discord_members to archive '/tmp/tmpot_19_0a/main_mongodb_*****_02_05_2024.archive'\n2024-05-02T16:17:58.927+0000\tdone dumping pydis_forms.discord_m"
        }
      ]
    }
  ],
  "username": "blackbox",
  "avatar_url": "https://raw.githubusercontent.com/lemonsaurus/blackbox/main/img/blackbox_avatar.png"
}

Actual PostgreSQL error

blackbox-1235626330217779380-4wz7x blackbox 16:18:26 pg_dumpall: error: aborting because of server version mismatch
blackbox-1235626330217779380-4wz7x blackbox pg_dumpall: detail: server version: 16.2; pg_dumpall version: 15.5 (Debian 15.5-1.pgdg120+1)

Suggested fix (by 🍋 lemonsaurus)

We need to capture only what we need. We should only capture the output from the particular backup(s) that failed, and only the tail of that log, let's say the last 10 lines at most.
If for whatever reason that output still does exceed 1024 characters (maybe a bunch of different backups all failed?), we can still truncate, but we ought to truncate to 1024 chars, not 2000.
Currently when we truncate, we truncate to the first 2000 characters in the output. That's stupid, because the tail of the log is usually more interesting than the head of the log, so we should be truncating to the last 1024 characters in the output.
If there are multiple services failing, we can do something like
Iterate through each service's output and sort them from smallest to biggest output.
- Divide 1024 by the number of services, and this will be our budget for truncating the smallest output.
- If any characters remain after this truncation, add them back to the total remaining and divide again by the number of services.
- Example: We got 3 services. Start with (1024 / 3) as budget. If we only consume 51 characters for the first output, no truncation is needed (it is within our budget), and the remaining 290 characters can be used for the subsequent services. So now we do ((1024 / 2) - 51) to figure out our budget for the next service, and continue this until we're done.
- This will always lead to an optimal situation where we are truncating as little as possible, and only cutting off the start of the longest logs. It also ensures that any services with logs that exceed our character budget will be truncated equally, maximizing our chances of getting relevant output.

The text was updated successfully, but these errors were encountered:

lemonsaurus · 2024-05-07T11:40:48Z

Yeah this part sucks. Maybe instead of just naively truncating, we could upload the full error message somewhere and attach a URL or something. It's just a bit sketchy if that log contains sensitive shit.

also, but I'm not sure this is possible, but replying to the notification (turning it into a thread) on Discord with the full log split into multiple messages would be a fairly elegant solution.

Labeling this as help wanted, in case someone wants to implement some solution for this. ❤️

jb3 · 2024-05-07T11:54:02Z

The question is more why on an error are the process logs from all subprocesses attached.

If this was only the tool that experienced the error and the logs were tailed to the last N lines then this error wouldn't matter.

If pg_dumpall fails, it should not attach the logs of mongodump.

Most tools experience an error that will fit in Discord, but not if all subprocess logs are attached.

lemonsaurus · 2024-05-08T11:54:05Z

yeah jb, that's all true too. We should only be capturing the relevant errors in the output. But then there really are three bugs here to fix:

We should capture less. We should only capture the output from the particular backup(s) that failed, and only the tail of that log, let's say the last 10 lines at most.
If for whatever reason that output still does exceed 1024 characters (maybe a bunch of different backups all failed?), we can still truncate, but we ought to truncate to 1024 chars, not 2000.
Currently when we truncate, we truncate to the first 2000 characters in the output. That's stupid, because the tail of the log is usually more interesting than the head of the log, so we should be truncating to the last 1024 characters in the output.
If there are multiple services failing, we can do something like
- Iterate through each service's output and sort them from smallest to biggest output.
- Divide 1024 by the number of services, and this will be our budget for truncating the smallest output.
- If any characters remain after this truncation, add them back to the total remaining and divide again by the number of services.
- Example: We got 3 services. Start with (1024 / 3) as budget. If we only consume 51 characters for the first output, no truncation is needed (it is within our budget), and the remaining 290 characters can be used for the subsequent services. So now we do ((1024 / 2) - 51) to figure out our budget for the next service, and continue this until we're done.
- This will always lead to an optimal situation where we are truncating as little as possible, and only cutting off the start of the longest logs. It also ensures that any services with logs that exceed our character budget will be truncated equally, maximizing our chances of getting relevant output.

jb3 mentioned this issue May 2, 2024

Blackbox does not exit with status code 1 when a database has failed #156

Open

lemonsaurus added area: notifiers help wanted Extra attention is needed level: intermediate priority: high type: bug Something isn't working labels May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blackbox fails to send error notification if failure reason field is over Discord embed limit #155

Blackbox fails to send error notification if failure reason field is over Discord embed limit #155

jb3 commented May 2, 2024 •

edited by lemonsaurus

Loading

lemonsaurus commented May 7, 2024 •

edited

Loading

jb3 commented May 7, 2024

lemonsaurus commented May 8, 2024 •

edited

Loading

Blackbox fails to send error notification if failure reason field is over Discord embed limit #155

Blackbox fails to send error notification if failure reason field is over Discord embed limit #155

Comments

jb3 commented May 2, 2024 • edited by lemonsaurus Loading

Error report

Suggested fix (by 🍋 lemonsaurus)

lemonsaurus commented May 7, 2024 • edited Loading

jb3 commented May 7, 2024

lemonsaurus commented May 8, 2024 • edited Loading

jb3 commented May 2, 2024 •

edited by lemonsaurus

Loading

lemonsaurus commented May 7, 2024 •

edited

Loading

lemonsaurus commented May 8, 2024 •

edited

Loading