Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1240: prefer flows with DNS records when checking for dup #174

Merged
merged 1 commit into from
Sep 13, 2023

Conversation

msherif1234
Copy link
Contributor

@msherif1234 msherif1234 commented Sep 7, 2023

DNS record could be missed because of matching TC flow on different interface

here is example of DNS marked as dup of TC with different interfaces

TC
{
  "AgentIP": "10.0.128.2",
  "Bytes": 196,
  "DstAddr": "10.131.0.17",
  "DstK8S_HostIP": "10.0.128.2",
  "DstK8S_HostName": "ci-ln-3qnw742-72292-j6czx-worker-a-llr5p",
  "DstK8S_Name": "netobserv-plugin-57dfcbfc86-m4tmw",
  "DstK8S_Namespace": "netobserv",
  "DstK8S_OwnerName": "netobserv-plugin",
  "DstK8S_OwnerType": "Deployment",
  "DstK8S_Type": "Pod",
  "DstMac": "0A:58:0A:83:00:11",
  "DstPort": 59839,
  "Duplicate": false,
  "Etype": 2048,
  "FlowDirection": "0",
  "IfDirection": 1,
  "Interface": "42482213ec85729",
  "K8S_ClusterName": "03c0fe19-be2a-4add-8196-ed3d5d098742",
  "Packets": 1,
  "Proto": 17,
  "SrcAddr": "172.30.0.10",
  "SrcK8S_Name": "dns-default",
  "SrcK8S_Namespace": "openshift-dns",
  "SrcK8S_OwnerName": "dns-default",
  "SrcK8S_OwnerType": "Service",
  "SrcK8S_Type": "Service",
  "SrcMac": "0A:58:0A:83:00:06",
  "SrcPort": 53,
  "TimeFlowEndMs": 1694098016605,
  "TimeFlowStartMs": 1694098016605,
  "TimeReceived": 1694098017,
  "app": "netobserv-flowcollector"
}
DNS hook
{
  "AgentIP": "10.0.128.2",
  "Bytes": 196,
  "DnsFlags": 34048,
  "DnsFlagsResponseCode": "NoError",
  "DnsId": 46631,
  "DnsLatencyMs": 0,
  "DstAddr": "10.131.0.17",
  "DstK8S_HostIP": "10.0.128.2",
  "DstK8S_HostName": "ci-ln-3qnw742-72292-j6czx-worker-a-llr5p",
  "DstK8S_Name": "netobserv-plugin-57dfcbfc86-m4tmw",
  "DstK8S_Namespace": "netobserv",
  "DstK8S_OwnerName": "netobserv-plugin",
  "DstK8S_OwnerType": "Deployment",
  "DstK8S_Type": "Pod",
  "DstMac": "0A:58:0A:83:00:11",
  "DstPort": 59839,
  "Duplicate": true,
  "Etype": 2048,
  "FlowDirection": "0",
  "IfDirection": 0,
  "Interface": "222a405082585f3",
  "K8S_ClusterName": "03c0fe19-be2a-4add-8196-ed3d5d098742",
  "Packets": 1,
  "Proto": 17,
  "SrcAddr": "172.30.0.10",
  "SrcK8S_Name": "dns-default",
  "SrcK8S_Namespace": "openshift-dns",
  "SrcK8S_OwnerName": "dns-default",
  "SrcK8S_OwnerType": "Service",
  "SrcK8S_Type": "Service",
  "SrcMac": "0A:58:0A:83:00:06",
  "SrcPort": 53,
  "TimeFlowEndMs": 1694098016605,
  "TimeFlowStartMs": 1694098016605,
  "TimeReceived": 1694098017,
  "app": "netobserv-flowcollector"
}

suggested solution to prefer enriched flow with DNS and mark TC flow as duplicate

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Sep 7, 2023

@msherif1234: This pull request references NETOBSERV-1240 which is a valid jira issue.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Sep 7, 2023

@msherif1234: This pull request references NETOBSERV-1240 which is a valid jira issue.

In response to this:

DNS record could be missed because of matching TC flow on different interface

here is example of DNS marked as dup of TC with different interfaces

TC
{
 "AgentIP": "10.0.128.2",
 "Bytes": 196,
 "DstAddr": "10.131.0.17",
 "DstK8S_HostIP": "10.0.128.2",
 "DstK8S_HostName": "ci-ln-3qnw742-72292-j6czx-worker-a-llr5p",
 "DstK8S_Name": "netobserv-plugin-57dfcbfc86-m4tmw",
 "DstK8S_Namespace": "netobserv",
 "DstK8S_OwnerName": "netobserv-plugin",
 "DstK8S_OwnerType": "Deployment",
 "DstK8S_Type": "Pod",
 "DstMac": "0A:58:0A:83:00:11",
 "DstPort": 59839,
 "Duplicate": false,
 "Etype": 2048,
 "FlowDirection": "0",
 "IfDirection": 1,
 "Interface": "42482213ec85729",
 "K8S_ClusterName": "03c0fe19-be2a-4add-8196-ed3d5d098742",
 "Packets": 1,
 "Proto": 17,
 "SrcAddr": "172.30.0.10",
 "SrcK8S_Name": "dns-default",
 "SrcK8S_Namespace": "openshift-dns",
 "SrcK8S_OwnerName": "dns-default",
 "SrcK8S_OwnerType": "Service",
 "SrcK8S_Type": "Service",
 "SrcMac": "0A:58:0A:83:00:06",
 "SrcPort": 53,
 "TimeFlowEndMs": 1694098016605,
 "TimeFlowStartMs": 1694098016605,
 "TimeReceived": 1694098017,
 "app": "netobserv-flowcollector"
}
DNS hook
{
 "AgentIP": "10.0.128.2",
 "Bytes": 196,
 "DnsFlags": 34048,
 "DnsFlagsResponseCode": "NoError",
 "DnsId": 46631,
 "DnsLatencyMs": 0,
 "DstAddr": "10.131.0.17",
 "DstK8S_HostIP": "10.0.128.2",
 "DstK8S_HostName": "ci-ln-3qnw742-72292-j6czx-worker-a-llr5p",
 "DstK8S_Name": "netobserv-plugin-57dfcbfc86-m4tmw",
 "DstK8S_Namespace": "netobserv",
 "DstK8S_OwnerName": "netobserv-plugin",
 "DstK8S_OwnerType": "Deployment",
 "DstK8S_Type": "Pod",
 "DstMac": "0A:58:0A:83:00:11",
 "DstPort": 59839,
 "Duplicate": true,
 "Etype": 2048,
 "FlowDirection": "0",
 "IfDirection": 0,
 "Interface": "222a405082585f3",
 "K8S_ClusterName": "03c0fe19-be2a-4add-8196-ed3d5d098742",
 "Packets": 1,
 "Proto": 17,
 "SrcAddr": "172.30.0.10",
 "SrcK8S_Name": "dns-default",
 "SrcK8S_Namespace": "openshift-dns",
 "SrcK8S_OwnerName": "dns-default",
 "SrcK8S_OwnerType": "Service",
 "SrcK8S_Type": "Service",
 "SrcMac": "0A:58:0A:83:00:06",
 "SrcPort": 53,
 "TimeFlowEndMs": 1694098016605,
 "TimeFlowStartMs": 1694098016605,
 "TimeReceived": 1694098017,
 "app": "netobserv-flowcollector"
}

suggested solution to prefer enriched flow with DNS and mark TC flow as duplicate

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@codecov
Copy link

codecov bot commented Sep 7, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.17% 🎉

Comparison is base (faf274e) 39.21% compared to head (931946b) 39.39%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #174      +/-   ##
==========================================
+ Coverage   39.21%   39.39%   +0.17%     
==========================================
  Files          31       31              
  Lines        2382     2394      +12     
==========================================
+ Hits          934      943       +9     
- Misses       1391     1393       +2     
- Partials       57       58       +1     
Flag Coverage Δ
unittests 39.39% <100.00%> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
pkg/flow/deduper.go 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@msherif1234 msherif1234 force-pushed the dns_dup_handling branch 2 times, most recently from 6e59443 to 38e0d25 Compare September 7, 2023 21:17
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Sep 7, 2023
@github-actions
Copy link

github-actions bot commented Sep 7, 2023

New image:
quay.io/netobserv/netobserv-ebpf-agent:1af8426

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=1af8426 make set-agent-image

@@ -26,6 +26,7 @@ type deduperCache struct {

type entry struct {
key *ebpf.BpfFlowId
record *Record
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this solution to store the whole record here but it has some downsides / challenges:

  • Records are now kept in memory in two places: in the Accounter and now here - with different eviction timeouts, the one defined here being bigger for good reasons. So we can expect an increased memory usage as those records won't be garbage-collected until they expire from this cache.
  • If a record was flushed out from Accounter and still present here while you revert its Duplicate flag with fEntry.record.Duplicate = true, this will actually have no effect downstream, so there are still cases where we'll have duplicates not flagged as such

Copy link
Member

@jotak jotak Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, does the concept of "Weak reference" exists in go? In order to keep this cache with weak refs of records, so that they're still gc'ed after Accounter flush ?

[Edit] It seems it doesn't exist... but we could implement something to have the accounter notifying the deduper on flush, so that all flushed records are niled in the deduper cache. It won't fix my item #2 mentioned above, but should avoid increased memory usage.

Copy link
Contributor Author

@msherif1234 msherif1234 Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we are storing pointer to the record not the record itself I shared the memory usage below b4 and after and its about the same

dup

pkg/flow/deduper.go Outdated Show resolved Hide resolved
if justMark {
fEntry.record.Duplicate = true
} else {
fwd = findAndDeleteRecord(fwd, fEntry.record)
Copy link
Member

@jotak jotak Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work if the entry was in cache but already sent previously (via a previous set of records): it won't be in the fwd slice.
The record should be looked for & deleted from the Accounter store I guess

Copy link
Contributor Author

@msherif1234 msherif1234 Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on the info in the arch doc my understanding is the dup stage after the account , plus of the record not in the same fwd slice in that case it won't be considered dup right ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wrong about the accounter but IMO the problem is still the same: a flow could have been sent out to FLP and its dup comes in later.

the record not in the same fwd slice in that case it won't be considered dup right ?

I think that's wrong and that's why there is a separate cache with its own expiry time: we keep track of flow keys in that cache, even for flows already sent to FLP, so that flows coming later can still be seen as duplicates

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Sep 8, 2023
@msherif1234 msherif1234 force-pushed the dns_dup_handling branch 2 times, most recently from 3fc7d7c to a2ab479 Compare September 11, 2023 17:32
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Sep 11, 2023
@github-actions
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:79005c8

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=79005c8 make set-agent-image

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Sep 12, 2023
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Sep 12, 2023
@github-actions
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:deb3152

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=deb3152 make set-agent-image

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Sep 12, 2023
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Sep 12, 2023
@github-actions
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:9729bb3

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=9729bb3 make set-agent-image

@msherif1234
Copy link
Contributor Author

image

image

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Sep 12, 2023
@msherif1234
Copy link
Contributor Author

@memodi can u pls give this a try and see if it fixes the issue ?

@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Sep 12, 2023
@github-actions
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:2af0f86

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=2af0f86 make set-agent-image

@memodi
Copy link
Contributor

memodi commented Sep 12, 2023

Verified DNS flows shows up without selecting "Show Duplicates" in UI - cc @skrthomas
/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved QE has approved this pull request label Sep 12, 2023
Copy link
Member

@jotak jotak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!
thanks!

@msherif1234
Copy link
Contributor Author

/approve

@openshift-ci
Copy link

openshift-ci bot commented Sep 13, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msherif1234

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit cca4aa3 into netobserv:main Sep 13, 2023
10 checks passed
@msherif1234 msherif1234 deleted the dns_dup_handling branch September 13, 2023 11:39
msherif1234 added a commit to msherif1234/netobserv-ebpf-agent that referenced this pull request Sep 13, 2023
openshift-merge-robot pushed a commit that referenced this pull request Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved jira/valid-reference lgtm ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. qe-approved QE has approved this pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants