Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6fb7715 broke pull consumer - possibly because of out of order acks? [v2.10.17] #5720

Open
paolobarbolini opened this issue Jul 30, 2024 · 4 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@paolobarbolini
Copy link

Observed behavior

After upgrading from v2.10.16 one of our arm64 leafnodes pull consumer stopped working. As new messages came into the stream, the NEXT endpoint continued returning 404 despite not having reached the pending acks limits. This has been bisected to 6fb7715 after v2.10.18 and v2.10.19-RC.2 exhibit the same problem. To double check we went on v2.10.19-RC.2 and reverted the suspect commit, which fixed the issue.

Expected behavior

The consumer continues delivering messages.

Server and client version

nats-server: v2.10.17
nats: v0.1.5

Host environment

Linux <redacted> 6.6.22-v8 #1 SMP PREEMPT Tue Mar 19 17:41:59 UTC 2024 aarch64 GNU/Linux

Steps to reproduce

The pull consumer is on a stream mirroring $MQTT_msgs, with the rest of the settings left to default.

I'm not sure how to reproduce it. It doesn't seem to happen on our other machines. The only two things I can think of are:

  1. We're using the consumer to replicate a stream manually (without using mirrors) because we want a push based approach. As acks come back we also concurrently ack the message from our consumer, in whichever order that comes in.
  2. The consumer is on a stream mirroring $MQTT_msgs

Strangely if I delete the consumer and recreate it all messages that are already in the stream get delivered, but the issue still occurs on new messages.

@paolobarbolini paolobarbolini added the defect Suspected defect such as a bug or regression label Jul 30, 2024
@neilalexander
Copy link
Member

Can you please provide a stream info and a consumer info from after this problem has happened?

@wallyqs wallyqs changed the title 6fb7715 broke pull consumer - possibly because of out of order acks? 6fb7715 broke pull consumer - possibly because of out of order acks? [v2.10.16] Sep 4, 2024
@wallyqs wallyqs changed the title 6fb7715 broke pull consumer - possibly because of out of order acks? [v2.10.16] 6fb7715 broke pull consumer - possibly because of out of order acks? [v2.10.17] Sep 4, 2024
@wallyqs
Copy link
Member

wallyqs commented Sep 5, 2024

@paolobarbolini were you using the same consumer name in the leafnode server and the remote leafnode?

@paolobarbolini
Copy link
Author

Hey, sorry for the radio silence. Here's the info you've requested about the leafnode:

info about the current broken system

stream info

Information for Stream readings created 2024-08-21 12:58:52

           Description: [redacted]
              Replicas: 1
               Storage: File
           Compression: S2 Compression

Options:

             Retention: Limits
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 0s
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: unlimited
   Maximum Per Subject: unlimited
         Maximum Bytes: 10 GiB
           Maximum Age: unlimited
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Replication:

                Mirror: $MQTT_msgs

Cluster Information:

                  Name: [redacted]
                Leader: [redacted (same value as above)]

Mirror Information:

           Stream Name: $MQTT_msgs
                   Lag: 0
             Last Seen: 462ms

State:

              Messages: 404,729
                 Bytes: 320 MiB
        First Sequence: 1 @ 2024-08-21 12:58:59 UTC
         Last Sequence: 404,729 @ 2024-09-10 09:33:00 UTC
      Active Consumers: 1
    Number of Subjects: 26

consumer info

Information for Consumer readings > readings-replicator created 2024-09-10T09:31:39+02:00

Configuration:

                    Name: readings-replicator
             Description: Local readings to cloud readings replicator
               Pull Mode: true
          Deliver Policy: All
              Ack Policy: Explicit
                Ack Wait: 1m0s
           Replay Policy: Instant
         Max Ack Pending: 10,000
       Max Waiting Pulls: 512

State:

  Last Delivered Message: Consumer sequence: 1,001 Stream sequence: 858,602 Last delivery: 2m10s ago
    Acknowledgment Floor: Consumer sequence: 1,001 Stream sequence: 858,585 Last Ack: 6.88s ago
        Outstanding Acks: 0 out of maximum 10,000
    Redelivered Messages: 0
    Unprocessed Messages: 0
           Waiting Pulls: 1 of maximum 512

Note on the 1001: it was successfully able to poll 1 item and then 1000 items. The following polls all yielded no items.

server info

nats-server: v2.10.20

Built with Go 1.22.6

were you using the same consumer name in the leafnode server and the remote leafnode?

No they have different names. The stream names are the same though.


Testing with 6fb7715 reverted

build

git clone https://github.com/nats-io/nats-server.git -b release/v2.10.20
cd nats-server/
git revert 6fb7715a46b5f64dff144f0357f009ad3bf64473
GOOS=linux GOARCH=arm64 GO111MODULE=on CGO_ENABLED=0 go build -trimpath -ldflags="-w -X 'github.com/nats-io/nats-server/v2/server.gitCommit=$(git rev-parse --short HEAD)'" -o nats-server-arm64
# deploy and IMPORTANT: delete the consumer

stream info

Information for Stream readings created 2024-08-21 12:58:52

           Description: [redacted]
              Replicas: 1
               Storage: File
           Compression: S2 Compression

Options:

             Retention: Limits
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 0s
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: unlimited
   Maximum Per Subject: unlimited
         Maximum Bytes: 10 GiB
           Maximum Age: unlimited
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Replication:

                Mirror: $MQTT_msgs

Cluster Information:

                  Name: [redacted]
                Leader: [redacted (same value as above)]

Mirror Information:

           Stream Name: $MQTT_msgs
                   Lag: 0
             Last Seen: 619ms

State:

              Messages: 404,909
                 Bytes: 320 MiB
        First Sequence: 1 @ 2024-08-21 12:58:59 UTC
         Last Sequence: 404,909 @ 2024-09-10 09:47:18 UTC
      Active Consumers: 1
    Number of Subjects: 26

consumer info

(after a few minutes)

Information for Consumer readings > readings-replicator created 2024-09-10T09:44:49+02:00

Configuration:

                    Name: readings-replicator
             Description: Local readings to cloud readings replicator
               Pull Mode: true
          Deliver Policy: All
              Ack Policy: Explicit
                Ack Wait: 1m0s
           Replay Policy: Instant
         Max Ack Pending: 10,000
       Max Waiting Pulls: 512

State:

  Last Delivered Message: Consumer sequence: 404,909 Stream sequence: 404,909 Last delivery: 4.96s ago
    Acknowledgment Floor: Consumer sequence: 404,909 Stream sequence: 404,909 Last Ack: 4.90s ago
        Outstanding Acks: 0 out of maximum 10,000
    Redelivered Messages: 0
    Unprocessed Messages: 0
           Waiting Pulls: 1 of maximum 512

@diorcety
Copy link

diorcety commented Oct 3, 2024

I will try to provide a minimal test case, but I fall on the same issue. If my client do the explicit acks in the order there is no problem but if my client process schedule the acks asynchronously (which doesn't garantee the order, or by explicitly forcing a random acknowledging), I'm starting to experiencing some acknowledge issues: (Outstanding Acks: 1,000 out of maximum 1,000)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

4 participants