Skip to content

Latest commit

 

History

History
364 lines (295 loc) · 14.2 KB

40_other_stats.asciidoc

File metadata and controls

364 lines (295 loc) · 14.2 KB

Cluster Stats

The Cluster Stats API provides very similar output to the Node Stats. There is one crucial difference: Node Stats shows you statistics per-node, while Cluster Stats will show you the sum total of all nodes in a single metric.

This provides some useful stats to glance at. You can see that your entire cluster is using 50% available heap, filter cache is not evicting heavily, etc. It’s main use is to provide a quick summary which is more extensive than the Cluster Health, but less detailed than Node Stats. It is also useful for clusters which are very large, which makes Node Stats output difficult to read.

The API may be invoked with:

GET _cluster/stats

Index Stats

So far, we have been looking at node-centric statistics. How much memory does this node have? How much CPU is being used? How many searches is this node servicing? Etc. etc.

Sometimes it is useful to look at statistics from an index-centric perspective. How many search requests is this index receiving? How much time is spent fetching docs in that index, etc.

To do this, select the index (or indices) that you are interested in and execute an Index Stats API:

GET my_index/_stats (1)

GET my_index,another_index/_stats (2)

GET _all/_stats (3)
  1. Stats for my_index

  2. Stats for multiple indices can be requested by comma separating their names

  3. Stats indices can be requested using the special _all index name

The stats returned will be familar to the Node Stats output: search, fetch, get, index, bulk, segment counts, etc

Index-centric stats can be useful for identifying or verifying "hot" indices inside your cluster, or trying to determine while some indices are faster/slower than others.

In practice, however, node-centric statistics tend to be more useful. Entire nodes tend to bottleneck, not individual indices. And because indices are usually spread across multiple nodes, index-centric statistics are usually not very helpful because it aggregates different physical machines operating in different environments.

Index-centric stats are a useful tool to keep in your repertoire, but are not usually the first tool to reach for.

Pending Tasks

There are certain tasks that only the master can perform, such as creating a new index or moving shards around the cluster. Since a cluster can only have one master, only one node can ever process cluster-level metadata changes. In 99.9999% of the time, this is never a problem. The queue of metadata changes remains essentially zero.

In some very rare clusters, the number of metadata changes occurs faster than the master can process them. This leads to a build up of pending actions which are queued.

The Pending Tasks API will show you what (if any) cluster-level metadata changes are pending in the queue:

GET _cluster/pending_tasks

Usually, the response will look like this:

{
   "tasks": []
}

Meaning there are no pending tasks. If you have one of the rare clusters that bottlenecks on the master node, your pending task list may look like this:

{
   "tasks": [
      {
         "insert_order": 101,
         "priority": "URGENT",
         "source": "create-index [foo_9], cause [api]",
         "time_in_queue_millis": 86,
         "time_in_queue": "86ms"
      },
      {
         "insert_order": 46,
         "priority": "HIGH",
         "source": "shard-started ([foo_2][1], node[tMTocMvQQgGCkj7QDHl3OA], [P], s[INITIALIZING]), reason [after recovery from gateway]",
         "time_in_queue_millis": 842,
         "time_in_queue": "842ms"
      },
      {
         "insert_order": 45,
         "priority": "HIGH",
         "source": "shard-started ([foo_2][0], node[tMTocMvQQgGCkj7QDHl3OA], [P], s[INITIALIZING]), reason [after recovery from gateway]",
         "time_in_queue_millis": 858,
         "time_in_queue": "858ms"
      }
  ]
}

You can see that tasks are assigned a priority (URGENT is processed before HIGH, etc), the order it was inserted, how long the action has been queued and what the action is trying to perform. In the above list, there is a Create Index action and two Shard Started actions pending.

When should I worry about Pending Tasks?

As mentioned, the master node is rarely the bottleneck for clusters. The only time it can potentially bottleneck is if the cluster state is both very large and updated frequently.

For example, if you allow customers to create as many dynamic fields as they wish, and have a unique index for each customer every day, your cluster state will grow very large. The cluster state includes (among other things) a list of all indices, their types, and the fields for each index.

So if you have 100,000 customers, and each customer averages 1000 fields and 90 days of retention…​.that’s nine billion fields to keep in the cluster state. Whenever this changes, the nodes must be notified.

The master must process these changes which requires non-trivial CPU overhead, plus the network overhead of pushing the updated cluster state to all nodes.

It is these clusters which may begin to see cluster state actions queuing up. There is no easy solution to this problem, however. You have three options:

  • Obtain a beefier master node. Vertical scaling just delays the inevitable, unfortunately

  • Restrict the dynamic nature of the documents in some way, so as to limit the cluster state size.

  • Spin up another cluster once a certain threshold has been crossed.

Cat API

If you work from the command line often, the Cat APIs will be very helpful to you. Named after the linux cat command, these APIs are designed to be work like *nix command line tools.

They provide statistics that are identical to all the previously discussed APIs (Health, Node Stats, etc), but present the output in tabular form instead of JSON. This is very convenient as a system administrator and you just want to glance over your cluster, or find nodes with high memory usage, etc.

Executing a plain GET against the Cat endpoint will show you all available APIs:

GET /_cat

=^.^=
/_cat/allocation
/_cat/shards
/_cat/shards/{index}
/_cat/master
/_cat/nodes
/_cat/indices
/_cat/indices/{index}
/_cat/segments
/_cat/segments/{index}
/_cat/count
/_cat/count/{index}
/_cat/recovery
/_cat/recovery/{index}
/_cat/health
/_cat/pending_tasks
/_cat/aliases
/_cat/aliases/{alias}
/_cat/thread_pool
/_cat/plugins
/_cat/fielddata
/_cat/fielddata/{fields}

Many of these APIs should look familiar to you (and yes, that’s a cat at the top :) ). Let’s take a look at the Cat Health API:

GET /_cat/health

1408723713 12:08:33 elasticsearch_zach yellow 1 1 114 114 0 0 114

The first thing you’ll notice is that the response is plain text in tabular form, not JSON. The second thing you’ll notices is that there are no column headers enabled by default. This is designed to emulate *nix tools, since it is assumed that once you become familiar with the output you no longer want to see the headers.

To enable headers, add the ?v parameter:

GET /_cat/health?v

epoch      timestamp cluster                   status node.total node.data shards pri relo init unassign
1408723890 12:11:30  elasticsearch_zach yellow      1         1    114 114    0    0      114

Ah, much better. We now see the timestamp, cluster name, the status, how many nodes are in the cluster, etc. All the same information as the Cluster Health API.

Let’s look at Node Stats in the Cat API:

GET /_cat/nodes?v

host         ip            heap.percent ram.percent load node.role master name
zacharys-air 192.168.1.131           45          72 1.85 d         *      Zach

We see some stats about the nodes in our cluster, but it is very basic compared to the full Node Stats output. There are many additional metrics that you can include, but rather than consulting the documentation, let’s just ask the Cat API what is available.

You can do this by adding ?help to any API:

GET /_cat/nodes?help

id                       | id,nodeId                 | unique node id
pid                      | p                         | process id
host                     | h                         | host name
ip                       | i                         | ip address
port                     | po                        | bound transport port
version                  | v                         | es version
build                    | b                         | es build hash
jdk                      | j                         | jdk version
disk.avail               | d,disk,diskAvail          | available disk space
heap.percent             | hp,heapPercent            | used heap ratio
heap.max                 | hm,heapMax                | max configured heap
ram.percent              | rp,ramPercent             | used machine memory ratio
ram.max                  | rm,ramMax                 | total machine memory
load                     | l                         | most recent load avg
uptime                   | u                         | node uptime
node.role                | r,role,dc,nodeRole        | d:data node, c:client node
master                   | m                         | m:master-eligible, *:current master
...
...

(Note that the output has been truncated for brevity)

The first column shows the "fullname", the second column shows the "short name", and the third column offers a brief description about the parameter . Now that we know some column names, we can ask for those explicitly using the ?h parameter:

GET /_cat/nodes?v&h=ip,port,heapPercent,heapMax

ip            port heapPercent heapMax
192.168.1.131 9300          53 990.7mb

Because the Cat API tries to behave like *nix utilities, you can pipe the output to other tools such as sort, grep, awk, etc. For example, we can find the largest index in our cluster by using:

% curl 'localhost:9200/_cat/indices?bytes=b' | sort -rnk8

yellow test_names         5 1 3476004 0 376324705 376324705
yellow .marvel-2014.08.19 1 1  263878 0 160777194 160777194
yellow .marvel-2014.08.15 1 1  234482 0 143020770 143020770
yellow .marvel-2014.08.09 1 1  222532 0 138177271 138177271
yellow .marvel-2014.08.18 1 1  225921 0 138116185 138116185
yellow .marvel-2014.07.26 1 1  173423 0 132031505 132031505
yellow .marvel-2014.08.21 1 1  219857 0 128414798 128414798
yellow .marvel-2014.07.27 1 1   75202 0  56320862  56320862
yellow wavelet            5 1    5979 0  54815185  54815185
yellow .marvel-2014.07.28 1 1   57483 0  43006141  43006141
yellow .marvel-2014.07.21 1 1   31134 0  27558507  27558507
yellow .marvel-2014.08.01 1 1   41100 0  27000476  27000476
yellow kibana-int         5 1       2 0     17791     17791
yellow t                  5 1       7 0     15280     15280
yellow website            5 1      12 0     12631     12631
yellow agg_analysis       5 1       5 0      5804      5804
yellow v2                 5 1       2 0      5410      5410
yellow v1                 5 1       2 0      5367      5367
yellow bank               1 1      16 0      4303      4303
yellow v                  5 1       1 0      2954      2954
yellow p                  5 1       2 0      2939      2939
yellow b0001_072320141238 5 1       1 0      2923      2923
yellow ipaddr             5 1       1 0      2917      2917
yellow v2a                5 1       1 0      2895      2895
yellow movies             5 1       1 0      2738      2738
yellow cars               5 1       0 0      1249      1249
yellow wavelet2           5 1       0 0       615       615

By adding ?bytes=b we disable the "human readable" formatting on numbers and force them to be listed as bytes. This output is then piped into sort so that our indices are ranked according to size (the 8th column).

Unfortunately, you’ll notice that the Marvel indices are clogging up the results, and we don’t really care about those indices right now. Let’s pipe the output through grep and remove anything mentioning marvel:

% curl 'localhost:9200/_cat/indices?bytes=b' | sort -rnk8 | grep -v marvel

yellow test_names         5 1 3476004 0 376324705 376324705
yellow wavelet            5 1    5979 0  54815185  54815185
yellow kibana-int         5 1       2 0     17791     17791
yellow t                  5 1       7 0     15280     15280
yellow website            5 1      12 0     12631     12631
yellow agg_analysis       5 1       5 0      5804      5804
yellow v2                 5 1       2 0      5410      5410
yellow v1                 5 1       2 0      5367      5367
yellow bank               1 1      16 0      4303      4303
yellow v                  5 1       1 0      2954      2954
yellow p                  5 1       2 0      2939      2939
yellow b0001_072320141238 5 1       1 0      2923      2923
yellow ipaddr             5 1       1 0      2917      2917
yellow v2a                5 1       1 0      2895      2895
yellow movies             5 1       1 0      2738      2738
yellow cars               5 1       0 0      1249      1249
yellow wavelet2           5 1       0 0       615       615

Voila! After piping through grep (with -v to invert the matches), we get a sorted list of indices without marvel cluttering it up.

This is just a simple example of the flexibility of Cat at the command line. Once you get used to using Cat, you’ll see it like any other *nix tool and start going crazy with piping, sorting, grepping. If you are a system admin and spend any length of time ssh’d into boxes…​definitely spend some time getting familiar with the Cat API.