Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional bad log routing and malformed messages get to s3 output #834

Open
fizyk opened this issue Jun 13, 2024 · 0 comments
Open

Occasional bad log routing and malformed messages get to s3 output #834

fizyk opened this issue Jun 13, 2024 · 0 comments

Comments

@fizyk
Copy link

fizyk commented Jun 13, 2024

Describe the question/issue

Hi,

We have a system, that under stress experiences two issues with routing messages through a fluent bit.
We utilise fluent bit to pass logs to datadog, and two kinds of messages to s3. Problem is that under heavy load some messages get occasionally routed in wrong way - dimensions to events and events to dimensions, mixing the two (10-40k correct events and up to 400 of wrongly routed messages), and in the same files that have the wrongly routed messages there are lines with messages put in the same line instead of single messages per line.

Configuration

ECS Task

{
    "taskDefinitionArn": "arn:aws:ecs:<AWS_REGION>:<AWS-USER-ID>:task-definition/<ENVIRONMENT_NAME>-ecs-cluster-<ENVIRONMENT_NAME>-<SERVICE-NAME-ID>:39",
    "containerDefinitions": [
        {
            "name": "<ENVIRONMENT_NAME>-<SERVICE-NAME-ID>",
            "image": "<AWS_ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/<SERVICE_ECR>:1.6.0",
            "cpu": 0,
            "memoryReservation": 696,
            "portMappings": [
                {
                    "containerPort": 8090,
                    "hostPort": 8090,
                    "protocol": "tcp"
                },
                {
                    "containerPort": 443,
                    "hostPort": 443,
                    "protocol": "tcp"
                }
            ],
            "essential": true,
            "environment": [
                {
                    "name": "GOMAXPROCS",
                    "value": "1"
                },
                {
                    "name": "DD_METRICS_SAMPLING_RATE",
                    "value": "1.0"
                },
                {
                    "name": "AWS_MAX_IDLE_CONNECTIONS",
                    "value": "512"
                },
                {
                    "name": "AWS_DISABLE_SSL",
                    "value": "true"
                },
                {
                    "name": "LOG_LEVEL",
                    "value": "info"
                },
                {
                    "name": "DD_METRICS_PORT",
                    "value": "8125"
                },
                {
                    "name": "AWS_COGNITO_REFRESH_INTERVAL",
                    "value": "500s"
                },
                {
                    "name": "DD_SERVICE_RETRY_INTERVAL",
                    "value": "10s"
                },
                {
                    "name": "ENABLE_DNSMASQ",
                    "value": "true"
                },
                {
                    "name": "FLUENT_ADDRESS",
                    "value": "localhost:5170"
                },
                {
                    "name": "AWS_DEFAULT_REGION",
                    "value": "<AWS_REGION>"
                },
                {
                    "name": "DD_SERVICE_RETRY_ATTEMPTS",
                    "value": "30"
                },
                {
                    "name": "LOG_ENVIRONMENT",
                    "value": "<ENVIRONMENT_NAME>"
                },
                {
                    "name": "AWS_COGNITO_SCOPES",
                    "value": "<SCOPE>/<SERVICE>"
                },
                {
                    "name": "DD_AGENT_HOST",
                    "value": "localhost"
                },
                {
                    "name": "FLUENT_PARALLEL_FLUSH_LIMIT",
                    "value": "16"
                },
                {
                    "name": "FLUENT_FLUSH_INTERVAL",
                    "value": "10s"
                },
                {
                    "name": "DD_METRICS_ENABLED",
                    "value": "true"
                },
                {
                    "name": "FLUENT_BATCH_COUNT",
                    "value": "100"
                }
            ],
            "mountPoints": [],
            "volumesFrom": [],
            "secrets": [ ... ],
            "dependsOn": [
                {
                    "containerName": "fluentbit",
                    "condition": "HEALTHY"
                },
                {
                    "containerName": "datadog",
                    "condition": "HEALTHY"
                },
                {
                    "containerName": "dnsmasq",
                    "condition": "HEALTHY"
                }
            ],
            "stopTimeout": 30,
            "user": "0",
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "/aws/ecs/<ENVIRONMENT_NAME>-<SERVICE-NAME-ID>",
                    "awslogs-region": "<AWS_REGION>",
                    "awslogs-stream-prefix": "<ENVIRONMENT_NAME>-<SERVICE>"
                }
            },
            "systemControls": []
        },
        {
            "name": "datadog",
            "image": "<AWS_ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/publecr/datadog/agent:7",
            "cpu": 128,
            "memoryReservation": 100,
            "portMappings": [
                {
                    "containerPort": 8125,
                    "hostPort": 8125,
                    "protocol": "tcp"
                },
                {
                    "containerPort": 8126,
                    "hostPort": 8126,
                    "protocol": "tcp"
                }
            ],
            "essential": true,
            "environment": [
                {
                    "name": "DD_APM_NON_LOCAL_TRAFFIC",
                    "value": "true"
                },
                {
                    "name": "DD_TRACE_SAMPLE_RATE",
                    "value": "1.0"
                },
                {
                    "name": "DD_SITE",
                    "value": "datadoghq.com"
                },
                {
                    "name": "DD_DOGSTATSD_TAG_CARDINALITY",
                    "value": "high"
                },
                {
                    "name": "DD_APM_ENABLED",
                    "value": "true"
                },
                {
                    "name": "DD_ENV",
                    "value": "<ENVIRONMENT_NAME>"
                },
                {
                    "name": "DD_CHECKS_TAG_CARDINALITY",
                    "value": "high"
                },
                {
                    "name": "DD_DOGSTATSD_NON_LOCAL_TRAFFIC",
                    "value": "true"
                },
                {
                    "name": "DD_LOG_LEVEL",
                    "value": "CRITICAL"
                },
                {
                    "name": "DD_CLOUD_PROVIDER_METADATA",
                    "value": "aws"
                }
            ],
            "mountPoints": [
                {
                    "sourceVolume": "docker_sock",
                    "containerPath": "/var/run/docker.sock"
                },
                {
                    "sourceVolume": "cgroup",
                    "containerPath": "/host/sys/fs/cgroup"
                },
                {
                    "sourceVolume": "proc",
                    "containerPath": "/host/proc"
                }
            ],
            "volumesFrom": [],
            "secrets": [ ... ],
            "user": "0",
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "/aws/ecs/<ENVIRONMENT_NAME>-<SERVICE-NAME-ID>",
                    "awslogs-region": "<AWS_REGION>",
                    "awslogs-stream-prefix": "<ENVIRONMENT_NAME>-<SERVICE>"
                }
            },
            "healthCheck": {
                "command": [
                    "./probe.sh"
                ],
                "interval": 30,
                "timeout": 5,
                "retries": 3
            },
            "systemControls": []
        },
        {
            "name": "fluentbit",
            "image": "<AWS_ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/<FLUENTBIT_ECR>:1.6.0",
            "cpu": 0,
            "memoryReservation": 128,
            "portMappings": [
                {
                    "containerPort": 5170,
                    "hostPort": 5170,
                    "protocol": "tcp"
                }
            ],
            "essential": true,
            "environment": [
                {
                    "name": "EVENTS_OUTPUT_RETRY_LIMIT",
                    "value": "1"
                },
                {
                    "name": "AWS_REGION",
                    "value": "<AWS_REGION>"
                },
                {
                    "name": "DIMENSIONS_OUTPUT_RETRY_LIMIT",
                    "value": "1"
                },
                {
                    "name": "FLUENT_BUFFER_MAX_SIZE",
                    "value": "6M"
                },
                {
                    "name": "EVENTS_OUTPUT_WORKERS",
                    "value": "1"
                },
                {
                    "name": "DIMENSIONS_OUTPUT_TOTAL_FILE_SIZE",
                    "value": "25M"
                },
                {
                    "name": "DD_SOURCE",
                    "value": "<ENVIRONMENT_NAME>-<SERVICE>"
                },
                {
                    "name": "DD_TAGS",
                    "value": "<DATADOG-TAGS>"
                },
                {
                    "name": "EVENTS_OUTPUT_UPLOAD_TIMEOUT",
                    "value": "10s"
                },
                {
                    "name": "EVENTS_OUTPUT_TOTAL_FILE_SIZE",
                    "value": "25M"
                },
                {
                    "name": "AWS_S3_BUCKET",
                    "value": "<ENVIRONMENT_NAME>"
                },
                {
                    "name": "FLUENT_BIT_S3_EXPORTER",
                    "value": "true"
                },
                {
                    "name": "DD_SERVICE",
                    "value": "<ENVIRONMENT_NAME>-<SERVICE>"
                },
                {
                    "name": "SERVICE_HC_PERIOD",
                    "value": "5"
                },
                {
                    "name": "FLUENT_BUFFER_CHUNK_SIZE",
                    "value": "1M"
                },
                {
                    "name": "SERVICE_GRACE",
                    "value": "15"
                },
                {
                    "name": "DIMENSIONS_OUTPUT_UPLOAD_TIMEOUT",
                    "value": "10s"
                },
                {
                    "name": "DIMENSIONS_OUTPUT_WORKERS",
                    "value": "1"
                },
                {
                    "name": "SERVICE_HC_RETRY_FAILURE_COUNT",
                    "value": "5"
                },
                {
                    "name": "SERVICE_LOG_LEVEL",
                    "value": "info"
                },
                {
                    "name": "SERVICE_FLUSH",
                    "value": "10"
                },
                {
                    "name": "SERVICE_HC_ERRORS_COUNT",
                    "value": "5"
                }
            ],
            "mountPoints": [],
            "volumesFrom": [],
            "secrets": [ ... ],
            "user": "0",
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "/aws/ecs/<ENVIRONMENT_NAME>-<SERVICE-NAME-ID>",
                    "awslogs-region": "<AWS_REGION>",
                    "awslogs-stream-prefix": "<ENVIRONMENT_NAME>-<SERVICE>"
                }
            },
            "healthCheck": {
                "command": [
                    "CMD-SHELL",
                    "curl -f http://127.0.0.1:2020/api/v1/health || exit 1"
                ],
                "interval": 10,
                "timeout": 5,
                "retries": 3,
                "startPeriod": 30
            },
            "systemControls": [],
            "firelensConfiguration": {
                "type": "fluentbit",
                "options": {
                    "config-file-type": "file",
                    "config-file-value": "/fluent-bit/etc/fluent-bit.conf",
                    "enable-ecs-log-metadata": "true"
                }
            }
        },
        {
            "name": "dnsmasq",
            "image": "<AWS_ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/<DNSMASQ_ECR>:0.4.0",
            "cpu": 128,
            "memoryReservation": 100,
            "portMappings": [],
            "essential": true,
            "environment": [],
            "mountPoints": [],
            "volumesFrom": [],
            "linuxParameters": {
                "capabilities": {
                    "add": [
                        "NET_ADMIN"
                    ],
                    "drop": []
                }
            },
            "stopTimeout": 30,
            "user": "0",
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "/aws/ecs/<ENVIRONMENT_NAME>-<SERVICE-NAME-ID>",
                    "awslogs-region": "<AWS_REGION>",
                    "awslogs-stream-prefix": "<ENVIRONMENT_NAME>-<SERVICE>"
                }
            },
            "healthCheck": {
                "command": [
                    "/etc/healthcheck.sh"
                ],
                "interval": 30,
                "timeout": 5,
                "retries": 3
            },
            "systemControls": []
        }
    ],
    "family": "<ENVIRONMENT_NAME>-ecs-cluster-<ENVIRONMENT_NAME>-<SERVICE-NAME-ID>",
    "taskRoleArn": "arn:aws:iam::<AWS-USER-ID>:role/<ENVIRONMENT_NAME>-<SERVICE>-ecs-task",
    "executionRoleArn": "arn:aws:iam::<AWS-USER-ID>:role/<ENVIRONMENT_NAME>-<SERVICE>-ecs-task-execution",
    "networkMode": "awsvpc",
    "revision": 39,
    "volumes": [
        {
            "name": "cgroup",
            "host": {
                "sourcePath": "/sys/fs/cgroup/"
            }
        },
        {
            "name": "proc",
            "host": {
                "sourcePath": "/proc/"
            }
        },
        {
            "name": "docker_sock",
            "host": {
                "sourcePath": "/var/run/docker.sock"
            }
        }
    ],
    "status": "ACTIVE",
    "requiresAttributes": [
        {
            "name": "ecs.capability.execution-role-awslogs"
        },
        {
            "name": "com.amazonaws.ecs.capability.ecr-auth"
        },
        {
            "name": "ecs.capability.firelens.options.config.file"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.17"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.21"
        },
        {
            "name": "com.amazonaws.ecs.capability.logging-driver.awsfirelens"
        },
        {
            "name": "com.amazonaws.ecs.capability.task-iam-role"
        },
        {
            "name": "ecs.capability.container-health-check"
        },
        {
            "name": "ecs.capability.execution-role-ecr-pull"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
        },
        {
            "name": "ecs.capability.task-eni"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.29"
        },
        {
            "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.24"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
        },
        {
            "name": "ecs.capability.firelens.fluentbit"
        },
        {
            "name": "ecs.capability.secrets.asm.environment-variables"
        },
        {
            "name": "ecs.capability.container-ordering"
        }
    ],
    "placementConstraints": [],
    "compatibilities": [
        "EC2"
    ],
    "requiresCompatibilities": [
        "EC2"
    ],
    "cpu": "512",
    "memory": "1024",
    "registeredAt": "2024-06-11T06:25:16.832Z",
    "registeredBy": "arn:aws:iam::<AWS-USER-ID>:user/<USER-NAME>",
    "tags": [
        {
            "key": "<PRODUCT-NAME>-version",
            "value": "1.6.0"
        },
        {
            "key": "owner",
            "value": "<SERVICE>"
        },
        {
            "key": "product",
            "value": "<PRODUCT-NAME>"
        },
        {
            "key": "SERVICE-NAME-ID",
            "value": "<ENVIRONMENT_NAME>-<SERVICE>"
        },
        {
            "key": "environment",
            "value": "<ENVIRONMENT_NAME>"
        },
        {
            "key": "consumer",
            "value": "<SERVICE>"
        },
        {
            "key": "managed-by",
            "value": "terraform"
        },
        {
            "key": "service-stage",
            "value": "<ENVIRONMENT_NAME>-<SERVICE-NAME-ID>"
        },
        {
            "key": "Name",
            "value": "<ENVIRONMENT_NAME>"
        },
        {
            "key": "nickname",
            "value": "<DEPLOYMENT-TIER>-<NUMBER>"
        },
        {
            "key": "DEPLOYMENT-TIER",
            "value": "<DEPLOYMENT-TIER>"
        },
        {
            "key": "environment",
            "value": "<ENVIRONMENT_NAME>"
        }
    ]
}

Fluentbit configuration

[SERVICE]
    HTTP_Server             On
    HTTP_Listen             0.0.0.0
    HTTP_PORT               2020
    Flush                   10
    Grace                   15
    Daemon                  off
    Log_Level              info
    Health_Check            On
    HC_Errors_Count         5
    HC_Retry_Failure_Count  5
    HC_Period               5

[INPUT]
    Name                    forward
    Listen                  0.0.0.0
    Port                    5170
    Buffer_Chunk_Size       1M
    Buffer_Max_Size         6M

[INPUT]
    Name                    forward
    unix_path               /var/run/fluent.sock
    Tag_Prefix              service_logs.

[OUTPUT]
    Name                    s3
    Match                   events.*
    bucket                  ${AWS_S3_BUCKET}
    region                  ${AWS_REGION}
    total_file_size         25M
    upload_timeout          10s
    use_put_object          On
    compression             gzip
    s3_key_format           /reports/$TAG[0]/$TAG[1]/%Y/%m/%d/%H/%M/%S/${ECS_TASK_ID}.json.gz
    json_date_key           false
    static_file_path        On
    retry_limit             1
    workers                 1
    tls                     Off
    tls.verify              Off

[OUTPUT]
    Name                    s3
    Match                   dimensions
    bucket                  ${AWS_S3_BUCKET}
    region                  ${AWS_REGION}
    total_file_size         25M
    upload_timeout          10s
    use_put_object          On
    compression             gzip
    s3_key_format           /reports/dimensions/%Y/%m/%d/%H/%M/%S/${ECS_TASK_ID}.json.gz
    json_date_key           false
    static_file_path        On
    retry_limit             1
    workers                 1
    tls                     Off
    tls.verify              Off

[OUTPUT]
    Name                    datadog
    Match                   service_logs.*
    Host                    http-intake.logs.datadoghq.com
    TLS                     on
    compress                gzip
    provider                ecs
    apikey                  ${DD_API_KEY}
    dd_service              ${DD_SERVICE}
    dd_source               ${DD_SOURCE}
    dd_tags                 ${DD_TAGS}
    dd_message_key          log

Fluent Bit Log Output

There are only logs about successful file upload

{

            "timestamp": 1718150801649,

            "message": "[2024/06/12 00:06:41] [ info] [output:s3:s3.0] Successfully uploaded object <object-name>.json.gz",

            "ingestionTime": 1718150803721

        },

Fluent Bit Version Info

amazon/aws-for-fluent-bit:2.31.11

Application is written in go, and is using github.com/IBM/fluent-forward-go v0.2.2 client to publish messages to fluent-bit

Cluster Details

ECS on EC2 with Sidecar deployment

Application Details

The application at the time of occurring has been sending 26k of smaller messages and 26k if bigger messages to the two instances of fluent bit within around 30 seconds burst.

Steps to reproduce issue

Related Issues

Not really sure, maybe this: #376

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant