Skip to content

Commit

Permalink
fix: add monitoring dashboard
Browse files Browse the repository at this point in the history
Signed-off-by: Ilya Kheifets <[email protected]>
  • Loading branch information
ikheifets-splunk committed Sep 18, 2024
1 parent 20ebf8f commit cbf165b
Show file tree
Hide file tree
Showing 8 changed files with 312 additions and 0 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Changelog

## Unreleased
- add metrics dashboard

### Changed

Expand Down
254 changes: 254 additions & 0 deletions dashboard/dashboard.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,254 @@
<form version="1.1" theme="dark">
<label>sc4snmp</label>
<fieldset submitButton="false" autoRun="true"></fieldset>
<row>
<panel>
<title>SNMP polling status</title>
<input type="dropdown" token="poll_status_host" searchWhenChanged="true">
<label>SNMP device</label>
<choice value="*">all</choice>
<default>*</default>
<initialValue>*</initialValue>
<fieldForLabel>ip</fieldForLabel>
<fieldForValue>ip</fieldForValue>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" "Scheduler: Sending due task sc4snmp;*;*;poll" | rex field=_raw "Sending due task sc4snmp;(?&lt;ip&gt;.+);(?&lt;num&gt;\d+);poll" | stats count by ip</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
</search>
</input>
<chart>
<title>In case of unsuccessful polling status, please copy spl query from this chart and find failed tasks. Explanation of error log messages you can find at the https://splunk.github.io/splunk-connect-for-snmp/main/bestpractices/</title>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" splunk_connect_for_snmp.snmp.tasks.poll $poll_status_host$ | rex field=_raw "Task splunk_connect_for_snmp.*\[*\] (?&lt;status&gt;\w+)" | where status != "received" | timechart count by status</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
<refresh>5m</refresh>
<refreshType>delay</refreshType>
</search>
<option name="charting.axisTitleX.visibility">visible</option>
<option name="charting.axisTitleY.visibility">visible</option>
<option name="charting.axisTitleY2.visibility">visible</option>
<option name="charting.chart">line</option>
<option name="charting.chart.nullValueMode">connect</option>
<option name="charting.drilldown">all</option>
<option name="charting.legend.placement">right</option>
<option name="height">331</option>
<option name="refresh.display">progressbar</option>
<option name="trellis.enabled">0</option>
<drilldown>
<link target="_blank">search?q=index%3D*%20sourcetype%3D%22*%3Acontainer%3Asplunk-connect-for-snmp-*%22%20splunk_connect_for_snmp.snmp.tasks.poll%20$poll_status_host$%20%7C%20rex%20field%3D_raw%20%22Task%20splunk_connect_for_snmp.*%5C%5B*%5C%5D%20(%3F%3Cstatus%3E%5Cw%2B)%22%20%7C%20where%20status%20!%3D%20%22received%22&amp;earliest=-24h@h&amp;latest=now</link>
</drilldown>
</chart>
</panel>
<panel>
<title>SNMP schedule of polling tasks</title>
<input type="dropdown" token="poll_host" searchWhenChanged="true">
<label>SNMP device</label>
<choice value="*">all</choice>
<default>*</default>
<initialValue>*</initialValue>
<fieldForLabel>ip</fieldForLabel>
<fieldForValue>ip</fieldForValue>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" "Scheduler: Sending due task sc4snmp;*;*;poll" | rex field=_raw "Sending due task sc4snmp;(?&lt;ip&gt;.+);(?&lt;num&gt;\d+);poll" | stats count by ip</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
</search>
</input>
<chart>
<title>Using this chart you can understand when SC4SNMP scheduled polling for your SNMP device last time. The process works if it runs regularly.</title>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" Scheduler: Sending due task sc4snmp;$poll_host$;*poll | timechart count</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
<refresh>5m</refresh>
<refreshType>delay</refreshType>
</search>
<option name="charting.chart">line</option>
<option name="charting.drilldown">all</option>
<option name="height">331</option>
<option name="refresh.display">progressbar</option>
<drilldown>
<link target="_blank">search?q=index%3D*%20sourcetype%3D%22*%3Acontainer%3Asplunk-connect-for-snmp-*%22%20Scheduler%3A%20Sending%20due%20task%20sc4snmp%3B$poll_host$%3B*poll&amp;earliest=-24h@h&amp;latest=now</link>
</drilldown>
</chart>
</panel>
</row>
<row>
<panel>
<title>SNMP walk status</title>
<input type="dropdown" token="walk_status_host">
<label>SNMP device</label>
<choice value="*">all</choice>
<default>*</default>
<initialValue>*</initialValue>
<fieldForLabel>ip</fieldForLabel>
<fieldForValue>ip</fieldForValue>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" "Scheduler: Sending due task sc4snmp;*;walk" | rex field=_raw "Sending due task sc4snmp;(?&lt;ip&gt;.+);walk" | stats count by ip</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
</search>
</input>
<chart>
<title>In case of unsuccessful walk status, please copy spl query from this chart and find failed tasks. Explanation of error log messages you can find at the https://splunk.github.io/splunk-connect-for-snmp/main/bestpractices/</title>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" splunk_connect_for_snmp.snmp.tasks.walk $walk_status_host$ | rex field=_raw "Task splunk_connect_for_snmp.*\[*\] (?&lt;status&gt;\w+)" | where status != "received" | timechart count by status</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
<refresh>5m</refresh>
<refreshType>delay</refreshType>
</search>
<option name="charting.chart">line</option>
<option name="charting.drilldown">all</option>
<option name="height">327</option>
<option name="refresh.display">progressbar</option>
<drilldown>
<link target="_blank">search?q=index%3D*%20sourcetype%3D%22kube%3Acontainer%3Asplunk-connect-for-snmp-*%22%20splunk_connect_for_snmp.snmp.tasks.walk%20$walk_status_host$%20%7C%20rex%20field%3D_raw%20%22Task%20splunk_connect_for_snmp.*%5C%5B*%5C%5D%20(%3F%3Cstatus%3E%5Cw%2B)%22%20%7C%20where%20status%20!%3D%20%22received%22&amp;earliest=-24h@h&amp;latest=now</link>
</drilldown>
</chart>
</panel>
<panel>
<title>SNMP schedule for walk tasks</title>
<input type="dropdown" token="walk_host">
<label>SNMP device</label>
<choice value="*">all</choice>
<default>*</default>
<initialValue>*</initialValue>
<fieldForLabel>ip</fieldForLabel>
<fieldForValue>ip</fieldForValue>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" "Scheduler: Sending due task sc4snmp;*;walk" | rex field=_raw "Sending due task sc4snmp;(?&lt;ip&gt;.+);walk" | stats count by ip</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
</search>
</input>
<chart>
<title>Using this chart you can understand when SC4SNMP scheduled walk for your SNMP device last time. The process works if it runs regularly.</title>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" Scheduler: Sending due task sc4snmp;$walk_host$;walk | timechart count</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
<refresh>5m</refresh>
<refreshType>delay</refreshType>
</search>
<option name="charting.chart">line</option>
<option name="charting.drilldown">all</option>
<option name="height">324</option>
<option name="refresh.display">progressbar</option>
<drilldown>
<link target="_blank">search?q=index%3D*%20sourcetype%3D%22*%3Acontainer%3Asplunk-connect-for-snmp-*%22%20Scheduler%3A%20Sending%20due%20task%20sc4snmp%3B$walk_host$%3Bwalk&amp;earliest=-24h@h&amp;latest=now</link>
</drilldown>
</chart>
</panel>
</row>
<row>
<panel>
<title>SNMP trap status</title>
<chart>
<title>In case of unsuccessful trap status, please copy spl query from this chart and find failed tasks. Explanation of error log messages you can find at the https://splunk.github.io/splunk-connect-for-snmp/main/bestpractices/</title>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" splunk_connect_for_snmp.snmp.tasks.trap | rex field=_raw "Task splunk_connect_for_snmp.*\[*\] (?&lt;status&gt;\w+)" | where status != "received" | timechart count by status</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
<refresh>5m</refresh>
<refreshType>delay</refreshType>
</search>
<option name="charting.chart">line</option>
<option name="charting.drilldown">all</option>
<option name="height">332</option>
<option name="refresh.display">progressbar</option>
<drilldown>
<link target="_blank">search?q=index%3D*%20sourcetype%3D%22*%3Acontainer%3Asplunk-connect-for-snmp-*%22%20splunk_connect_for_snmp.snmp.tasks.trap%20%7C%20rex%20field%3D_raw%20%22Task%20splunk_connect_for_snmp.*%5C%5B*%5C%5D%20(%3F%3Cstatus%3E%5Cw%2B)%22%20%7C%20where%20status%20!%3D%20%22received%22&amp;earliest=-24h@h&amp;latest=now</link>
</drilldown>
</chart>
</panel>
<panel>
<title>SNMP trap authorisation</title>
<chart>
<title>If it's not succeeded it means that you have SNMP authorisation problem.</title>
<search>
<query>index=* "ERROR Security Model failure for device" OR "splunk_connect_for_snmp.snmp.tasks.trap\[*\] succeeded" | eval status=if(searchmatch("succeeded"), "succeeded", "failed") | timechart count by status</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
<refresh>5m</refresh>
<refreshType>delay</refreshType>
</search>
<option name="charting.chart">line</option>
<option name="charting.drilldown">all</option>
<option name="height">329</option>
<option name="refresh.display">progressbar</option>
<drilldown>
<link target="_blank">search?q=index%3D*%20%22ERROR%20Security%20Model%20failure%20for%20device%22&amp;earliest=-24h@h&amp;latest=now</link>
</drilldown>
</chart>
</panel>
</row>
<row>
<panel>
<title>SNMP send to Splunk status</title>
<chart>
<title>In case of unsuccessful enrich status, please copy spl query from this chart and find failed tasks. Explanation of error log messages you can find at the https://splunk.github.io/splunk-connect-for-snmp/main/bestpractices/</title>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" splunk_connect_for_snmp.splunk.tasks.send | rex field=_raw "Task splunk_connect_for_snmp.*\[*\] (?&lt;status&gt;\w+)" | where status != "received" | timechart count by status</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
<refresh>5m</refresh>
<refreshType>delay</refreshType>
</search>
<option name="charting.chart">line</option>
<option name="charting.drilldown">none</option>
<option name="refresh.display">progressbar</option>
</chart>
</panel>
<panel>
<title>SNMP enrich task status</title>
<chart>
<title>In case of unsuccessful enrich status, please copy spl query from this chart and find failed tasks. Explanation of error log messages you can find at the https://splunk.github.io/splunk-connect-for-snmp/main/bestpractices/</title>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" splunk_connect_for_snmp.enrich.tasks.enrich | rex field=_raw "Task splunk_connect_for_snmp.*\[*\] (?&lt;status&gt;\w+)" | where status != "received" | timechart count by status</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
<refresh>5m</refresh>
<refreshType>delay</refreshType>
</search>
<option name="charting.chart">line</option>
<option name="charting.drilldown">none</option>
<option name="refresh.display">progressbar</option>
</chart>
</panel>
<panel>
<title>SNMP prepare task status</title>
<chart>
<title>In case of unsuccessful enrich status, please copy spl query from this chart and find failed tasks. Explanation of error log messages you can find at the https://splunk.github.io/splunk-connect-for-snmp/main/bestpractices/</title>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" splunk_connect_for_snmp.splunk.tasks.prepare | rex field=_raw "Task splunk_connect_for_snmp.*\[*\] (?&lt;status&gt;\w+)" | where status != "received" | timechart count by status</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
<refresh>5m</refresh>
<refreshType>delay</refreshType>
</search>
<option name="charting.chart">line</option>
<option name="charting.drilldown">none</option>
<option name="refresh.display">progressbar</option>
</chart>
</panel>
<panel>
<title>SNMP inventory poller task status</title>
<chart>
<title>In case of unsuccessful enrich status, please copy spl query from this chart and find failed tasks. Explanation of error log messages you can find at the https://splunk.github.io/splunk-connect-for-snmp/main/bestpractices/</title>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" splunk_connect_for_snmp.inventory.tasks.inventory_setup_poller | rex field=_raw "Task splunk_connect_for_snmp.*\[*\] (?&lt;status&gt;\w+)" | where status != "received" | timechart count by status</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
<refresh>5m</refresh>
<refreshType>delay</refreshType>
</search>
<option name="charting.chart">line</option>
<option name="charting.drilldown">none</option>
<option name="refresh.display">progressbar</option>
</chart>
</panel>
</row>
</form>
56 changes: 56 additions & 0 deletions docs/dashboard.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Dashboard

Using dashboard you can monitor SC4SNMP and be sure that is healthy and working correctly.

## Presetting

1. [Create metrics indexes](gettingstarted/splunk-requirements.md#requirements-for-splunk-enterprise-or-enterprise-cloud) in Splunk.
2. Enable metrics logging for your runtime:
* For K8S install [Splunk OpenTelemetry Collector for K8S](gettingstarted/sck-installation.md)
* For docker-compose use [Splunk logging driver for docker](dockercompose/9-splunk-logging.md)

## Install dashboard

1. In Splunk platform open **Search -> Dashboards**.
2. Click on **Create New Dashboard** and make an empty dashboard. Be sure to choose Classic Dashboards.
3. In the **Edit Dashboard** view, go to Source and replace the initial xml with the contents of [dashboard/dashboard.xml](https://github.com/splunk/splunk-connect-for-snmp/blob/main/dashboard/dashboard.xml) published in the SC4SNMP repository.
4. Save your changes. Your dashboard is ready to use.


## Metrics explanation

### Polling dashboards

To check that polling on your device is working correctly first of all check **SNMP schedule of polling tasks** dashboard.
Using this chart you can understand when SC4SNMP scheduled polling for your SNMP device last time. The process works if it runs regularly.

After double-checking that SC4SNMP scheduled polling tasks for your SNMP device we need to be sure that polling is working.
For that look at another dashboard **SNMP polling status** and if everything is working you will see only **succeeded** status of polling.
If something is going wrong you will see also another status (like on screenshot), then use [troubleshooting docs for that](bestpractices.md)

![Polling dashboards](images/dashboard/polling_dashboard.png)

### Walk dashboards

To check that walk on your device is working correctly first of all check **SNMP schedule of walk tasks** dashboard.
Using this chart you can understand when SC4SNMP scheduled walk for your SNMP device last time. The process works if it runs regularly.

After double-checking that SC4SNMP scheduled walk tasks for your SNMP device we need to be sure walk is working.
For that look at another dashboard **SNMP walk status** and if everything is working you will see only **succeeded** status of walk.
If something is going wrong you will see also another status (like on screenshot), then use [troubleshooting docs for that](bestpractices.md)

![Walk dashboards](images/dashboard/walk_dashboard.png)

### Trap dashboards

First of all check **SNMP traps authorisation** dashboard, if you see only **succeeded** status it means that authorisation is configured correctly, otherwise please use [troubleshooting docs for that](bestpractices.md#identifying-traps-issues).

After checking that we have not any authorisation traps issues we can check that trap tasks are working correctly. For that we need to go **SNMP trap status** dashboard, if we have only **succeeded** status it means that everything is working, otherwise we will see information with another status.

![Trap dashboards](images/dashboard/trap_dashboard.png)

### Other dashboards

We also have tasks that will be a callback for walk and poll. For example **send** will publish result in Splunk. We need to be sure that after successful walk and poll this callbacks finished. Please check that we have only successful status for this tasks.

![Other dashboards](images/dashboard/other_dashboard.png)
Binary file added docs/images/dashboard/other_dashboard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/dashboard/polling_dashboard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/dashboard/trap_dashboard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/dashboard/walk_dashboard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -87,3 +87,4 @@ nav:
- Releases: "releases.md"
- High Availability: ha.md
- Improved polling performance: "improved-polling.md"
- Monitoring dashboard: "dashboard.md"

0 comments on commit cbf165b

Please sign in to comment.