Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Homer 10 - Call Flow - Sorting Issues/Wrong Order #625

Open
tony1661 opened this issue Dec 29, 2023 · 30 comments · Fixed by metrico/grafana-flow#47 or metrico/grafana-flow#48
Open
Assignees
Labels

Comments

@tony1661
Copy link

tony1661 commented Dec 29, 2023

I am noticing an issue in Homer 10's Call Flow dashboard. It seems that the Call Flow is not in the correct order which is making troubleshooting difficult.

I have a Freeswitch server (FusionPBX) with the portable helpify agent running heplify -hs homer-stack-ip:9060

If I search by the Call-ID in the SIP headers, I see all SIP messages associated with the call. See below:
image

Obviously the INVITE would have occurred before the 407 however that is not what Grafana shows.

Based on the HEP Flow panel, the messages should be sorted oldest to newest.

If I look into each message I see that the INVITE has a later date than the 407.

INVITE Date
image

407 Date
image

The issue is similar to what is being experienced in Homer 7's web-ui which leads me to think the issue may be related to Heplify. In Homer 7, when searching for a call, the results are also in the wrong order, however when I click on the Session ID to view the ladder, it is in the correct order.

Is it possible that the ladder in Homer 7 is referencing a different timestamp that is also being referenced in Homer 10?

Copy link

Please star this repository to motivate the developers and to get higher priority! ⭐

@tony1661
Copy link
Author

I've tested using captagent instead of heplify and the issue persists in Homer 10.

@tony1661
Copy link
Author

To add to this issue, I took a packet capture of a call that has the SIP messages displayed in the wrong order in Homer 10.

It seems that in the HEP packets, the Unix Timestamp doesn't change. See below:
image

The last two messages (BYE and 200 OK) have a different timestamp and in Homer they do indeed appear at the bottom of the ladder but they appear in a different order in the ladder than they do in the packet capture.

It may be worth adding the Timestamp μs to the equation.

image

@lmangani
Copy link
Member

lmangani commented Dec 31, 2023

Thanks for the report @tony1661 we're investigating and will make sure this is part of the next grafana-flow release
@AlexeyOplachko could you check this after the holidays?

@tony1661
Copy link
Author

tony1661 commented Jan 4, 2024

@AlexeyOplachko let me know if I can help in any way. I can provide logs, pcaps etc.

I have this on a production freeswitch server with heplify and captagent both running.

Hundreds of calls a day that we can look at.

@lmangani
Copy link
Member

lmangani commented Jan 6, 2024

Hi @tony1661 we're back next week and we' ll most definitely address this

@RFbkak37y3kIY
Copy link
Member

pushed fix for grafana-plugin, metrico/grafana-flow#47

used field [tsNs] for increased sorting accuracy SIP messages

RFbkak37y3kIY added a commit to metrico/grafana-flow that referenced this issue Jan 9, 2024
RFbkak37y3kIY added a commit to metrico/grafana-flow that referenced this issue Jan 9, 2024
fix: sipcapture/homer#625 used tsNs field for sorting of messages
@tony1661
Copy link
Author

Hi all,

I saw there were some code merged. If I pull the latest docker images, will I be able to test this?

@lmangani
Copy link
Member

As long as its using plugin version 10.0.10 you can also update an existing setup

@lmangani
Copy link
Member

@tony1661 here's how

@tony1661
Copy link
Author

@lmangani Thanks for your quick response. I tested and the issue seems to still be there. Is there anything I can provide to help? Logs etc

@AlexeyOplachko
Copy link
Collaborator

For starters can you please verify that your grafana indeed got new plugin version please?
your_grafana_url/plugins/qxip-flow-panel
image

On our side we'll try to replicate this issue today and see if we need anything else from you

@tony1661
Copy link
Author

Hi @AlexeyOplachko ,

I have verified that I have 10.0.10 installed. See below:
image

@tony1661
Copy link
Author

Anything I can help with?

@lmangani
Copy link
Member

@AlexeyOplachko please provide an update

@AlexeyOplachko
Copy link
Collaborator

AlexeyOplachko commented Mar 14, 2024

Anything I can help with?

@tony1661
Can you please provide screenshots of Message details with all the info in them, on two messages that are in incorrect order.
image
image
And also can you please check if Sort Items is set.
image

@tony1661
Copy link
Author

Hi @AlexeyOplachko sorry for the delay on this.

The Sort items is set to "Sort by Time: Oldest first".

Here is what the call flow looks like:

image

Here is the first message (INVITE):

image

Here is the second message (200 OK):

image

Here is the fourth message (that is supposed to be second - 100 Trying):

image

@AlexeyOplachko
Copy link
Collaborator

Hi @tony1661, thanks for reply, seems like this is not a sorting issue, but an issue with data.

If you look closely, message with 100 Trying has timestamp almost 4 minutes later than 200 OK.
And all three timestamps(one in labels, one in Time field, and nanosecond one) show matching data that supports this.

@tony1661
Copy link
Author

tony1661 commented Apr 25, 2024

@AlexeyOplachko Yea something seems off with the data. The issue happens on multiple HEP clients (heplify and captagent)

I have some screenshots from pcaps above that may assist.

I am running freeswitch (via FusionPBX)

@Dletta
Copy link
Collaborator

Dletta commented Jun 21, 2024

@AlexeyOplachko Could you check if you ended up fixing this issue?

@tony1661
Copy link
Author

@Dletta are you also experiencing this issue?

@AlexeyOplachko
Copy link
Collaborator

@AlexeyOplachko Could you check if you ended up fixing this issue?

Yes, from standpoint of our frontend there is no way for it to sort incorrectly, so it's only an issue with data

@Dletta
Copy link
Collaborator

Dletta commented Jun 21, 2024

@tony1661

I am not experiencing the same issue. I work with Alexey and wanted to make sure we don't let this issue go stale, :)

@tony1661
Copy link
Author

@AlexeyOplachko Could you check if you ended up fixing this issue?

Yes, from standpoint of our frontend there is no way for it to sort incorrectly, so it's only an issue with data

What can I do to assist? I've used multiple HEP agents and get the same results. Homer 7 does not have the issue with the same data source

@tony1661
Copy link
Author

Hi @lmangani is there any progress on this? I really want to be able to move over to Homer 10 if possible 🙂

@kYroL01
Copy link
Collaborator

kYroL01 commented Sep 10, 2024

Hi @tony1661
We tried to replicate your issue in our lab using Homer and various HEP agents (heplify, captangent, etc.) but could not reproduce the problem.
From our experience, this issue could be related with the server timestamp where the traffic is flowing

I am running freeswitch (via FusionPBX)

Consider that heplify and captagent don't change the timestamp, as they are passive actors, they just show what they see.
2 things:

  1. Can you check the timestamp and the order of the incoming traffic before the agents ?
  2. Can you check the timestamp and the order of the HEP packets arriving to heplify-server ?

This can help

@tony1661
Copy link
Author

@kYroL01 thanks for your reply.

The strange thing is homer 7 does not have the issue.

I have my Freeswitch/Fusion PBX server with heplify or captagent pointed to a server running docker. If I start the homer 7 containers and test, calls look good. Once I shut them down and start Homer 10, the issue starts.

What data should I be looking at? Should I be looking at packet captures?

@kYroL01
Copy link
Collaborator

kYroL01 commented Sep 10, 2024

IMHO you should check the data between the agent and the heplify server, and between heplify server and the database.

It's better to find any concrete evidence between Homer 7 and Homer 10 to help you, otherwise, as we cannot reproduce the issue, it's really hard for us to use only the screenshots.

@tony1661
Copy link
Author

@kYroL01 will do. I will try to see what I can find. I did provide some details on the info I found in the pcaps above. I will try to find some better reference data

@covalschi
Copy link

covalschi commented Oct 9, 2024

Noticing similar behaviour, sending HEP3 using siptrace.so on Kamailio 5.4.6 be668987e9296232a00fa7bba82bc72abd72a743 (June 2021), 100% UDP, using setflag to mark the transaction to capture
Capturing with heplify-server 1.59.7 via qryn 3.2.31 into ClickHouse 24.9.2.42, peaking with ~7000 messages/second
Running Grafana 11.2, plugin v10.2.0

Please check this PNG export from the plugin - messages are offset by mere microseconds, however such offset vastly impacts the representation of the SIP flow. This is not the case for all of the calls, however I didn't find any pattern in the occurencies of the issue, which happens quiet often - including off-peak hours.
2024-9-9_7_36_13(1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment