Skip to content

Commit

Permalink
NMS-15584: Update to alarm docs (#6033)
Browse files Browse the repository at this point in the history
* NMS-15584: Some improvements in the docs are required to answer the following questions

* Apply suggestions from code review

Co-authored-by: Bonrob2 <[email protected]>

* NMS:15584: fix alarm into link

* Apply suggestions from code review

Co-authored-by: mmahacek <[email protected]>

* Update docs/modules/operation/pages/deep-dive/alarms/alarm-handling.adoc

Co-authored-by: mmahacek <[email protected]>

* Added the missing screenshots

* NMS-15584: Alarm doc updates

* Apply suggestions from code review

Co-authored-by: Bonrob2 <[email protected]>

---------

Co-authored-by: Bonrob2 <[email protected]>
Co-authored-by: Mark Mahacek <[email protected]>
  • Loading branch information
3 people authored Jun 27, 2023
1 parent 533fa9d commit 640c388
Show file tree
Hide file tree
Showing 17 changed files with 170 additions and 102 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 2 additions & 3 deletions docs/modules/operation/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -103,10 +103,9 @@
*** xref:deep-dive/events/event-advanced-search.adoc[]

** xref:deep-dive/alarms/introduction.adoc[]
*** xref:deep-dive/alarms/alarmd.adoc[]
*** xref:deep-dive/alarms/configuring-alarms.adoc[]
*** xref:deep-dive/alarms/alarm-notes.adoc[]
*** xref:deep-dive/alarms/alarm-related-events.adoc[]
*** xref:deep-dive/alarms/alarm-example.adoc[]
*** xref:deep-dive/alarms/alarm-handling.adoc[]
*** xref:deep-dive/alarms/alarm-sound-flash.adoc[]
*** xref:deep-dive/alarms/history.adoc[]
*** xref:deep-dive/alarms/alarm-advanced-search.adoc[]
Expand Down
68 changes: 68 additions & 0 deletions docs/modules/operation/pages/deep-dive/alarms/alarm-example.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@

= Alarm Lifecycle

The following is an example of the alarm lifecycle based on a `nodeLostService` event.

== Lifecycle example

A new `nodeLostService` event is received and creates a new alarm.

.New alarm visible in outstanding alarm list
image::alarms/single_alarm_1.png["New alarm visible in outstanding alarm list]

Clicking the number displayed in the *Count* column displays the corresponding events and their details.

.Event list showing events related to the alarm
image::alarms/single_alarm_2.png["Event list showing events related to the alarm]

The alarm clears automatically when service is restored, based on a `nodeRegainedService` event.

.Alarm cleared
image::alarms/single_alarm_3.png["Alarm List displaying one cleared alarm and its log message"]

.Service down and service restored events
image::alarms/single_alarm_4.png["Event list page displaying one service down event and one service restored event"]

If the problem occurs again, the events are reduced into the existing alarm.
The alarm's count is updated to reflect the new activity.

.Alarm reopened with an increase in the `count` value
image::alarms/single_alarm_5.png["Alarm List displaying one alarm with a count of 2"]

.Event list showing events related to the alarm
image::alarms/single_alarm_6.png["Detailed event list page displaying two service down events and one service restored event, all of which are components of the same alarm"]

The alarm once again clears immediately when service is restored.

.Reduced alarm cleared
image::alarms/single_alarm_7.png["Alarm List displaying one cleared alarm with a count of 2, and its log message"]

Note that the alarm's count only increments on events with a severity of Warning or greater.

.Service down and restored events
image::alarms/single_alarm_8.png["Detailed event list page displaying two service down events and two service restored events, all of which are members of the same alarm"]

== Alarm lifetime rules

Alarms are deleted from the {page-component-title} database after a set amount of time.
This lifetime can be configured via Drools rules in the `$\{OPENNMS_HOME}/etc/alarmd/drools-rules.d/alarmd.drl` file.
The default alarm lifetimes:

[options="autowidth"]
|===
| Alarm State | Deletion Delay

| Cleared and Unacknowledged
| 5 minutes

| Cleared and Acknowledged
| 1 day

| Active and Unacknowledged
| 3 days

| All other alarms
| 8 days
|===

These delays are based on the last event time, and will restart the counter if a new problem event is reduced into the same alarm.
71 changes: 71 additions & 0 deletions docs/modules/operation/pages/deep-dive/alarms/alarm-handling.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@

= Alarm Handling

The following are ways you can interact with alarms.

== Acknowledgment

Users can acknowledge alarms to let other {page-component-title} users see that someone is aware of the alarm.
The alarm will be moved out from `Alarm(s) outstanding` into the `Alarm(s) acknowledged` view.
Acknowledged alarms will also be hidden from the "Nodes with Pending Problems" section of the home page.

.Acknowledged alarm of an HTTP outage in the alarm overview
image::alarms/acked_alarm_overview.png["Acknowledged alarm of an HTTP outage in the alarm overview"]

.Acknowledged alarm of an HTTP outage in detail view
image::alarms/acked_alarm_detail.png["Acknowledged alarm of an HTTP outage in detail view"]

== Clearing

Clearing an alarm means "mark it as resolved".

.Cleared alarm of an HTTP outage in the alarm overview
image::alarms/cleared_alarm_overview.png["Cleared alarm of an HTTP outage in the alarm overview"]

.Cleared alarm of an HTTP outage in detail view
image::alarms/cleared_alarm_detail.png["Cleared alarm of an HTTP outage in detail view"]

== Escalation

By default, an alarm has the same <<deep-dive/events/event-configuration.adoc#severities, severity>> as its most recent event.
If an alarm gets escalated, the alarm's severity increases by one level.

.Escalated alarm of an HTTP outage in the alarm overview
image::alarms/escalated_alarm_overview.png["Escalated alarm of an HTTP outage in the alarm overview"]

.Escalated alarm of an HTTP outage in detail view
image::alarms/escalated_alarm_detail.png["Escalated alarm of an HTTP outage in detail view"]

== Related events

{page-component-title} correlates possible related events into alarms based on events with the same <<deep-dive/alarms/configuring-alarms#ga-reduction-key, reduction key>>.
You can use the related events section of the alarm details view to see which events have been grouped into the alarm.


.Alarm Related Events page
image::alarms/alarm_related-events.png["Alarm Related Events page displaying related events.", 850]

== Alarm notes

Alarm notes let you assign comments to a specific alarm, or to a whole class of alarms, and share that information with other people on your team.

.Alarm Details page with sample notes
image::alarms/01_alarm-notes.png["Alarm Details page displaying sample notes in the Sticky Memo and Journal Memo boxes", 850]

You can add two types of notes to existing alarms or alarm classes:

Sticky Memo:: A user-defined note for a specific instance of an alarm.
Deleting the alarm also deletes any associated sticky memos.
Journal Memo:: A user-defined note for a class of alarms, based on the resolved reduction key.
Journal memos are shown for all alarms that match a specific reduction key.
Deleting an individual alarm does not remove the journal memo.
You must click *Clear* on an alarm with an associated journal memo to remove the memo.

The Alarm List Summary and Alarm List Detail pages will have a symbol to indicate if individual alarms have associated sticky or journal memos.

[[ga-advanced-alarm-handling]]
== Advanced alarm handling

In addition to the manual actions described above, it is possible to automate alarm handling with the use of https://www.drools.org/[Drools] scripts.
There is a default rule set for handling alarm cleanup in the `$\{OPENNMS_HOME}/etc/alarmd/drools-rules.d/` directory.
You can find some additional examples in the `$\{OPENNMS_HOME}/etc/examples/alarmd/drools-rules.d/` directory.
20 changes: 0 additions & 20 deletions docs/modules/operation/pages/deep-dive/alarms/alarm-notes.adoc

This file was deleted.

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -62,3 +62,4 @@ NOTE: If `alarm.properties` does not exist, create it and specify the above sett

The sound that is played is determined by the contents of `$\{OPENNMS_HOME}/jetty-webapps/opennms/sounds/alert.wav`.
To change the alarm sound, create a new `.wav` file with the desired sound, name it `alert.wav`, and replace the default file in `$\{OPENNMS_HOME}/jetty-webapps/opennms/sounds/`.
Make sure too keep a copy of this file as it may be overwritten to the default sound while installing updates to {page-component-title}.
10 changes: 0 additions & 10 deletions docs/modules/operation/pages/deep-dive/alarms/alarmd.adoc

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@

[[ga-configure-alarms]]
= Configure Alarms

Because alarmd instantiates alarms from events, defining alarms in {page-component-title} involves defining an additional event XML element that indicates a problem or resolution in the network.
Expand Down Expand Up @@ -51,6 +51,7 @@ See <<deep-dive/events/event-definition.adoc#ga-events-anatomy-of-an-event, Anat

== Attributes and elements

[[ga-reduction-key]]
=== reduction-key

Alarmd is designed to consolidate multiple occurrences of an alarm into a single alarm.
Expand All @@ -70,9 +71,9 @@ Most commonly, the event's unique event identifier (UEI) is used as the least si
</event>
----

`$dpname%` refers to the "distributed poller name", which is the name of the Minion that originated the event.
`%dpname%` refers to the "distributed poller name", which is the name of the monitoring location where the event originated.

Decreasing the significance of the `reduction-key` is a way to, for example, aggregate all nodes into a single alarm.
Decreasing the specificity of the `reduction-key` is a way to aggregate events from multiple nodes into a single alarm.
There are caveats, however:

.Least significant `reduction-key` attribute
Expand Down Expand Up @@ -111,22 +112,22 @@ When configuring a resolution alarm, you can set this attribute to match the cor
.`interfaceUp` event clearing an `interfaceDown` alarm
[source, xml]
----
<event>
<uei>uei.opennms.org/nodes/interfaceDown</uei>
...
<alarm-data reduction-key="%uei%:%dpname%:%nodeid%:%interface%" <1>
alarm-type="1"
auto-clean="false"/>
</event>
<event>
<uei>uei.opennms.org/nodes/interfaceUp</uei>
...
<alarm-data reduction-key="%uei%:%dpname%:%nodeid%:%interface%"
alarm-type="2"
clear-key="uei.opennms.org/nodes/interfaceDown:%dpname%:%nodeid%:%interface%" <2>
auto-clean="false"/>
</event>
<event>
<uei>uei.opennms.org/nodes/interfaceDown</uei>
...
<alarm-data reduction-key="%uei%:%dpname%:%nodeid%:%interface%" <1>
alarm-type="1"
auto-clean="false"/>
</event>
<event>
<uei>uei.opennms.org/nodes/interfaceUp</uei>
...
<alarm-data reduction-key="%uei%:%dpname%:%nodeid%:%interface%"
alarm-type="2"
clear-key="uei.opennms.org/nodes/interfaceDown:%dpname%:%nodeid%:%interface%" <2>
auto-clean="false"/>
</event>
----
<1> The `interfaceDown` event sets a `reduction-key` that includes enough information to identify a specific interface on a specific node.
<2> The `interfaceUp` event has a `clear-key` that matches the `reduction-key` of an `interfaceDown` alarm, letting a match automatically clear the previous alarm.
Expand All @@ -136,7 +137,7 @@ When configuring a resolution alarm, you can set this attribute to match the cor
The `auto-clean` attribute instructs alarmd to retain only the most recent event that has been reduced into an alarm.
For alarms that produce many events, this serves as a way to reduce the size of the most recent events in the database.

WARNING: Avoid using this feature with alarms that have pairwise correlation (matching problems with resolutions).
WARNING: Avoid using this feature with alarms that have pairwise correlation (type 1 and 2 alarms that match problems with resolutions).
It may delete all problem events, erasing your ability to study an alarm's history.

=== update-field
Expand Down Expand Up @@ -175,7 +176,7 @@ With this property set, when a repeat incident occurs and the current state of t
.New `node-down` alarm and existing cleared alarm
image::alarms/new_after_clear_3.png["Alarms List page displaying two alarms generated by the same node: the first is of major severity, and the second has been cleared"]

In this case, alarmd alters the existing alarm's `reduction-key` to be unique (appended with ":ID:" and the alarm's ID).
When enabled, alarmd alters the existing alarm's `reduction-key` to be unique (appended with ":ID:" and the alarm's ID).
This prevents it from being reused for a reoccurring problem in the network.

.Alarm Details page displaying altered `reduction-key` attribute
Expand Down
3 changes: 2 additions & 1 deletion docs/modules/operation/pages/deep-dive/alarms/history.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@
[[ga-alarm-history]]
= Alarm History

The alarm history feature integrates with Elasticsearch to provide long-term storage and maintain a history of alarm state changes.
Alarms are deleted from the {page-component-title} database as they clear or become stale.
If you would like to keep an historical record of alarm data, you can enable the alarm history feature to provide long-term storage and maintain a history of alarm state changes in Elasticsearch.
When it is enabled, alarms are indexed in Elasticsearch when they are created, deleted, or when any interesting fields (for example, Ticket State, Sticky Memo) on the alarm are updated.
Alarms are indexed so that operators can answer the following questions:

Expand Down
42 changes: 3 additions & 39 deletions docs/modules/operation/pages/deep-dive/alarms/introduction.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -3,43 +3,7 @@

{page-component-title} uses a network's managed entities, their resources, the services they provide, and any applications that they host to monitor problem states.

While events carry immutable problem state attributes such as severity, alarms indicate problems in the network (see xref:deep-dive/bsm/introduction.adoc[]) and carry changeable attributes such as acknowledgment.
While events are immutable, historical records of things that happen within {page-component-title} and the nodes it monitors, alarms are created from one or more events that are <<deep-dive/alarms/configuring-alarms.adoc#ga-configure-alarms, configured>> with `<alarm-data>` information.
These are used to indicate problems in the network and carry changeable attributes such as acknowledgment, clearing, escalation, or temporary and persisted notes.

== Single alarm tracking problem states

An alarm is instantiated on the first occurrence of a `service-down` problem:

.`service-down` alarm created
image::alarms/single_alarm_1.png["Alarm List displaying one alarm of minor severity"]

Click the number displayed in the *Count* column to display the corresponding events and their details:

.Alarm displaying service down event
image::alarms/single_alarm_2.png["Alarm Details page displaying a service down event"]

The alarm is cleared immediately when service is restored, and no alarm is created when the service returns to a normal state:

.Alarm cleared
image::alarms/single_alarm_3.png["Alarm List displaying one cleared alarm and its log message"]

.Service down and service restored events
image::alarms/single_alarm_4.png["Alarm Details page displaying one service down event and one service restored event"]

If the problem occurs again, the events are reduced into the existing alarm.
The alarm's count is updated to reflect the new activity:

.Reduced alarm
image::alarms/single_alarm_5.png["Alarm List displaying one alarm with a count of 2"]

.List of events
image::alarms/single_alarm_6.png["Detailed event list page displaying two service down events and one service restored event, all of which are members of the same alarm"]

The alarm is once again cleared immediately when service is restored:

.Reduced alarm cleared
image::alarms/single_alarm_7.png["Alarm List displaying one cleared alarm with a count of 2, and its log message"]

Note that the alarm's count does not increment when the problem is resolved.

.Service down and restored events
image::alarms/single_alarm_8.png["Detailed event list page displaying two service down events and two service restored events, all of which are members of the same alarm"]
Alarms provide fault management information regarding problems in your network.
2 changes: 1 addition & 1 deletion docs/modules/reference/pages/glossary.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ https://docs.docker.com/[Docker]:: An open-source container virtualization servi
Dominion:: The service on a {page-component-title} instance that controls Minion operations (see xref:reference:configuration/minion-confd/minion-confd.adoc[]).

https://www.drools.org/[Drools]:: A system to manage business rules that supports the Java Rules Engine API standard.
It helps provide a more robust infrastructure for workflow and problem state management in alarmd (see xref:operation:deep-dive/alarms/alarmd.adoc[]).
It helps provide a more robust infrastructure for workflow and problem state management in alarmd (see xref:operation:deep-dive/alarms/introduction.adoc[]).

== E

Expand Down

0 comments on commit 640c388

Please sign in to comment.