HTTP API metrics #1969

arkgil · 2018-07-09T09:22:16Z

The goal of this PR is to introduce metrics to the HTTP API ("backend", "frontend", whatever). It consists of two parts:

two Cowboy middleware modules responsible for recording metrics (mongoose_cowboy_metrics_mw_before and mongoose_cowboy_metrics_mw_after)
little code in ejabberd_cowboy responsible for assembling names of metrics and defining them (currently metrics are opt-in)

For each handler (or rather for each prefix) "types" of metrics are recorded:

request count per HTTP method (suffix ..., Method, request, count]
response count per HTTP method x status class (1XX, 2XX etc.) (suffix ..., Method, response, Class, count])
response latency per HTTP method x status class (suffix ..., Method, response, Class, latency])

These metrics together cover 3 of 4 golden signals.

Metrics need to be enabled by providing {metrics, true} option to ejabberd_cowboy. All defined metrics are global. By default they are named like [http] ++ lists:flatten([HandlerBaseName]) ++ MetricTypeSuffix].

The HandlerBaseName part is the handler's module name by default. The handler's definition tuple can now contain a fifth element, which is not passed to handler but is interpreted by ejabberd_cowboy. This element is a list of options, and currently the only supported one is metrics:

it can be set to false, in which case the metrics for the handler won't be created
it can be a list with {prefix, Prefix} option, in which case Prefix becomes a HandlerBaseName

There is a small issue with this design - if one configured the same handler multiple times under different paths, metrics for both these handlers will have the same name (which obviously might skew the results if the handlers are configured differently). In such cases when we have duplicated metric names we could:

log a warning (per @michalwski 's suggestion)
raise an error, in which case the listener won't be started

I haven't decided which option to choose yet.

Oh, and recording of metrics can be disabled by executing cowboy:set_env(Ref, record_metrics, false).

TODOs:

documentation for ejabberd_cowboy's new options
choose what to do in case of duplicate metric names
tests for ejabberd_cowboy with metrics
new metrics should be added when routes are being reloaded (we shouldn't delete the old ones)

mongoose-im · 2018-07-09T09:33:35Z

5077.1 / Erlang 19.3 / small_tests / fa7b041
Reports root / small

5077.6 / Erlang 19.3 / elasticsearch_and_cassandra_mnesia / fa7b041
Reports root/ big
OK: 445 / Failed: 0 / User-skipped: 8 / Auto-skipped: 0

5077.5 / Erlang 19.3 / ldap_mnesia / fa7b041
Reports root/ big
OK: 1024 / Failed: 0 / User-skipped: 77 / Auto-skipped: 0

5077.3 / Erlang 19.3 / mysql_redis / fa7b041
Reports root/ big
OK: 2799 / Failed: 0 / User-skipped: 222 / Auto-skipped: 0

5077.2 / Erlang 19.3 / internal_mnesia / fa7b041
Reports root/ big
OK: 1060 / Failed: 0 / User-skipped: 41 / Auto-skipped: 0

5077.4 / Erlang 19.3 / odbc_mssql_mnesia / fa7b041
Reports root/ big
OK: 2813 / Failed: 0 / User-skipped: 208 / Auto-skipped: 0

5077.8 / Erlang 20.0 / pgsql_mnesia / fa7b041
Reports root/ big / small
OK: 2845 / Failed: 0 / User-skipped: 176 / Auto-skipped: 0

5077.9 / Erlang 21.0 / riak_mnesia / fa7b041
Reports root/ big / small
OK: 1282 / Failed: 1 / User-skipped: 43 / Auto-skipped: 0

mod_global_distrib_SUITE:multi_connection:test_muc_conversation_history

{error,{{assertEqual,[{module,mod_global_distrib_SUITE},
            {line,418},
            {expression,"exml_query : attr ( AlicePresence , << \"from\" >> )"},
            {expected,<<"[email protected]/alicE4.776981">>},
            {value,<<"[email protected]/eve5.71013">>}]},
    [{mod_global_distrib_SUITE,'-test_muc_conversation_history/1-fun-1-',
                   3,
                   [{file,"mod_global_distrib_SUITE.erl"},
                  {line,418}]},
     {mod_global_distrib_SUITE,'-test_muc_conversation_history/1-fun-5-',
                   3,
                   [{file,"mod_global_distrib_SUITE.erl"},
                  {line,417}]},
     {escalus_story,story,4,
            [{file,"/home/travis/build/esl/MongooseIM/big_tests/_build/default/lib/escalus/src/escalus_story.erl"},
             {line,72}]},
     {mod_global_distrib_SUITE,test_muc_conversation_history,1,
                   [{file,"mod_global_distrib_SUITE.erl"},
                  {line,390}]},
     {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1545}]},
     {test_server,run_test_case_eval1,6,
            [{file,"test_server.erl"},{line,1063}]},
     {test_server,run_test_case_eval,9,
            [{file,"test_server.erl"},{line,995}]}]}}

Report log

5077.8 / Erlang 20.0 / pgsql_mnesia / fa7b041
Reports root/ big / small
OK: 2845 / Failed: 0 / User-skipped: 176 / Auto-skipped: 0

5077.9 / Erlang 21.0 / riak_mnesia / fa7b041
Reports root/ big / small
OK: 1284 / Failed: 1 / User-skipped: 43 / Auto-skipped: 0

offline_SUITE:mod_offline_tests:max_offline_messages_reached

{error,
  {timeout_when_waiting_for_stanza,
    [{escalus_client,wait_for_stanza,
       [{client,<<"bOb8.11573@localhost/res1">>,escalus_tcp,<0.8734.1>,
          [{event_manager,<0.8708.1>},
           {server,<<"localhost">>},
           {username,<<"bOb8.11573">>},
           {resource,<<"res1">>}],
          [{event_client,
             [{event_manager,<0.8708.1>},
            {server,<<"localhost">>},
            {username,<<"bOb8.11573">>},
            {resource,<<"res1">>}]},
           {resource,<<"res1">>},
           {username,<<"bOb8.11573">>},
           {server,<<"localhost">>},
           {host,<<"localhost">>},
           {port,5222},
           {auth,{escalus_auth,auth_plain}},
           {wspath,undefined},
           {username,<<"bOb8.11573">>},
           {server,<<"localhost">>},
           {password,<<"makrolika">>},
           {stream_id,<<"3D08F955361D3E89">>}]},
        5000],
       [{file,
          "/home/travis/build/esl/MongooseIM/big_tests/_build/default/lib/escalus/src/escalus_client.erl"},
        {line,138}]},
     {offline_SUITE,'-max_offline_messages_reached/1-fun-1-',6,
       [{file,"offline_SUITE.erl"},{line,112}]},
     {escalus_story,story,4,
       [{file,
          "/home/travis/build/esl/MongooseIM/big_tests/_build/default/lib/escalus/src/escalus_story.erl"},
        {line,72}]},
     {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1545}]},
     {test_server,run_test_case_eval1,6,
       [{file,"test_server.erl"},{line,1063}]},...

Report log

michalwski · 2018-07-09T09:45:22Z

src/mongoose_cowboy_metrics_mw_after.erl

+%%
+%% This middleware needs to run after `mongoose_cowboy_metrics_mw_before' and  `cowboy_router'
+%% middleware. However, it does not need to run after `cowboy_hander' middleware because metrics are
+%% updated in `onresponse' callback which is called whenever the request is sent.


Should there be whenever the response is sent instead of request?

Yes, thanks!

michalwski · 2018-07-09T09:49:47Z

src/mongoose_cowboy_metrics_mw_after.erl

+            Method = get_req_method(Req),
+            HandlerToPrefixMappings = proplists:get_value(handler_to_metric_prefix, Env, #{}),
+            Prefix = maps:get(Handler, HandlerToPrefixMappings, [Handler]),
+            mongoose_metrics:update(global, mongoose_cowboy_metrics:request_count_metric(Prefix, Method), 1),


Would it make sense to move the request counter update to the before middleware?

Unfortunately we can't do that 😢 "Before" middleware doesn't know which handler will serve the request, because cowboy_router has not been run yet.

On the other hand we can't merge the middleware, because in case of no matching routes cowboy_router terminates the middleware chain. In those cases we'd lose even the request count metric (currently we lose response count and latency).

Ok, thanks for explanation :)

@michalwski sorry, I was wrong. We can't report the metric until we know the handler, which means the reporting needs to occur after cowboy_router. However, if we're willing to not report the time which it takes for cowboy_router to execute (i.e. only measure the run time of the handler itself) then we could merge the middleware into one which could be placed after cowboy_router.

michalwski · 2018-07-09T09:55:06Z

test/mongoose_cowboy_metrics_SUITE.erl

+    ?assertNotEqual(0, proplists:get_value(value, DataPoints)).
+
+run_requests(Methods, Statuses) ->
+    {ok, Conn} = gun:open("localhost", ranch:get_port(?LISTENER)),


In small tests we already use fusco when there is a need to send an HTTP request. It'd better to re-use what is already used, in my opinion. I do agree that there are better libs than fusco. At some point it will be replaced by sth else (gun or sth else).

mongoose-im · 2018-07-09T10:06:17Z

5078.1 / Erlang 19.3 / small_tests / b7bffef
Reports root / small

5078.2 / Erlang 19.3 / internal_mnesia / b7bffef
Reports root/ big
OK: 1060 / Failed: 0 / User-skipped: 41 / Auto-skipped: 0

5078.3 / Erlang 19.3 / mysql_redis / b7bffef
Reports root/ big
OK: 2799 / Failed: 0 / User-skipped: 222 / Auto-skipped: 0

5078.4 / Erlang 19.3 / odbc_mssql_mnesia / b7bffef
Reports root/ big
OK: 2813 / Failed: 0 / User-skipped: 208 / Auto-skipped: 0

5078.6 / Erlang 19.3 / elasticsearch_and_cassandra_mnesia / b7bffef
Reports root/ big
OK: 445 / Failed: 0 / User-skipped: 8 / Auto-skipped: 0

5078.5 / Erlang 19.3 / ldap_mnesia / b7bffef
Reports root/ big
OK: 1024 / Failed: 0 / User-skipped: 77 / Auto-skipped: 0

5078.9 / Erlang 21.0 / riak_mnesia / b7bffef
Reports root/ big / small
OK: 1279 / Failed: 0 / User-skipped: 43 / Auto-skipped: 0

5078.8 / Erlang 20.0 / pgsql_mnesia / b7bffef
Reports root/ big / small
OK: 2845 / Failed: 0 / User-skipped: 176 / Auto-skipped: 0

codecov · 2018-07-09T11:44:00Z

Codecov Report

Merging #1969 into master will decrease coverage by 6.58%.
The diff coverage is 12.5%.

@@            Coverage Diff             @@
##           master    #1969      +/-   ##
==========================================
- Coverage   74.91%   68.32%   -6.59%     
==========================================
  Files         313      315       +2     
  Lines       28459    28468       +9     
==========================================
- Hits        21319    19450    -1869     
- Misses       7140     9018    +1878

Impacted Files	Coverage Δ
src/mongoose_cowboy_metrics_mw_after.erl	`0% <0%> (ø)`
src/mongoose_cowboy_metrics.erl	`0% <0%> (ø)`
src/mongoose_cowboy_metrics_mw_before.erl	`0% <0%> (ø)`
src/ejabberd_app.erl	`82.29% <100%> (-4.17%)`	⬇️
src/ejabberd_cowboy.erl	`64.76% <23.33%> (-23.7%)`	⬇️
src/mod_auth_token_odbc.erl	`0% <0%> (-100%)`	⬇️
src/mam/mam_message_xml.erl	`0% <0%> (-100%)`	⬇️
src/inbox/mod_inbox_odbc_pgsql.erl	`0% <0%> (-100%)`	⬇️
src/mod_offline_riak.erl	`0% <0%> (-96.43%)`	⬇️
src/mod_last_riak.erl	`0% <0%> (-95.24%)`	⬇️
... and 112 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8dd8a1b...e18e89c. Read the comment docs.

mongoose-im · 2018-07-09T11:55:58Z

5080.1 / Erlang 19.3 / small_tests / 7be9fc1
Reports root / small

5080.5 / Erlang 19.3 / ldap_mnesia / 7be9fc1
Reports root/ big
OK: 1024 / Failed: 0 / User-skipped: 77 / Auto-skipped: 0

5080.6 / Erlang 19.3 / elasticsearch_and_cassandra_mnesia / 7be9fc1
Reports root/ big
OK: 445 / Failed: 0 / User-skipped: 8 / Auto-skipped: 0

5080.3 / Erlang 19.3 / mysql_redis / 7be9fc1
Reports root/ big
OK: 2799 / Failed: 0 / User-skipped: 222 / Auto-skipped: 0

5080.2 / Erlang 19.3 / internal_mnesia / 7be9fc1
Reports root/ big
OK: 1060 / Failed: 0 / User-skipped: 41 / Auto-skipped: 0

5080.4 / Erlang 19.3 / odbc_mssql_mnesia / 7be9fc1
Reports root/ big
OK: 2813 / Failed: 0 / User-skipped: 208 / Auto-skipped: 0

5080.8 / Erlang 20.0 / pgsql_mnesia / 7be9fc1
Reports root/ big / small
OK: 2845 / Failed: 0 / User-skipped: 176 / Auto-skipped: 0

5080.9 / Erlang 21.0 / riak_mnesia / 7be9fc1
Reports root/ big / small
OK: 1279 / Failed: 0 / User-skipped: 43 / Auto-skipped: 0

fenek · 2018-07-09T13:34:43Z

src/mongoose_cowboy_metrics_mw_after.erl

+%% See the License for the specific language governing permissions and
+%% limitations under the License.
+%%
+%% @doc Cowboy middleware updating metrics related to request handling


It would be even better to have all of this information in a markdown docs. :)

Should we document each MongooseIM module in markdown docs? Not only the ones which are configurable by users? I mean, I know that one can configure this middleware in ejabberd_cowboy's protocol opts, but it's rather aimed at MongooseIM's developers, and not operators.

What do you think?

The metrics names and meaning must be described in user docs. The module role description may remain here.

fenek · 2018-07-09T14:01:16Z

src/ejabberd_app.erl

    ejabberd_listener:start_listeners(),
+    mongoose_metrics:init(),


Was there some kind of problem with old order?

Yes, mongoose_metrics:init() subscribes all reporters to all metrics which are running at the moment of its start. Since ejabberd_cowboy is started as one of the listeners, it would define metrics after the reporters were subscribed, which means that no reporter reported HTTP API metrics.

fenek · 2018-07-09T14:24:11Z

src/ejabberd_cowboy.erl

+    when Acc :: #{module() => mongoose_cowboy_metrics:prefix()}.
+build_metric_prefixes(_BasePrefix, [], Acc) ->
+    Acc;
+build_metric_prefixes(BasePrefix, [{_Host, _Path, Handler, _HandlerOpts, Opts} | Tail], Acc) ->


I'm not sure if these metrics are that valuable with no path distinction. Various actions triggered by different paths (and their :parameters) may have drastically different impact on latency stats (if they all apply e.g. to 200 responses to GETs).

Yes, I agree that in general case it won't work well. However, in our HTTP API handlers usually do the exact same thing given some HTTP method (i.e. it doesn't matter what was the path or parameters).

The other solution would be hand-crafted metrics, different for each handler (i.e. we explicitly report metrics from each HTTP module). However, we would not get any metrics in case the handler has crashed.

Maybe the metrics for each handler in a single ejabberd_cowboy's scope should opt-in as well? This way we can turn them on only for endpoints which we know it makes sense to have these kinds of metrics for.

OK, I understand it's good enough until someone actually needs a more fine-grained names. :)

This way reporters will be automatically subscribed to all HTTP metrics.

mongoose-im · 2018-07-17T11:51:14Z

5164.1 / Erlang 19.3 / small_tests / 741de3b
Reports root / small

5164.3 / Erlang 19.3 / mysql_redis / 741de3b
Reports root/ big
OK: 2803 / Failed: 0 / User-skipped: 222 / Auto-skipped: 0

5164.5 / Erlang 19.3 / ldap_mnesia / 741de3b
Reports root/ big
OK: 1028 / Failed: 0 / User-skipped: 77 / Auto-skipped: 0

5164.6 / Erlang 19.3 / elasticsearch_and_cassandra_mnesia / 741de3b
Reports root/ big
OK: 445 / Failed: 0 / User-skipped: 8 / Auto-skipped: 0

5164.2 / Erlang 19.3 / internal_mnesia / 741de3b
Reports root/ big
OK: 1064 / Failed: 0 / User-skipped: 41 / Auto-skipped: 0

5164.4 / Erlang 19.3 / odbc_mssql_mnesia / 741de3b
Reports root/ big
OK: 2822 / Failed: 1 / User-skipped: 208 / Auto-skipped: 0

offline_SUITE:mod_offline_tests:headline_message_is_not_stored

{error,
  {{assertion_failed,assert,is_presence,
     {xmlel,<<"message">>,
       [{<<"from">>,<<"alicE46.330326@localhost/res1">>},
        {<<"to">>,<<"bOb46.330326@localhost/res1">>},
        {<<"xml:lang">>,<<"en">>},
        {<<"type">>,<<"headline">>}],
       [{xmlel,<<"body">>,[],[{xmlcdata,<<"msgtxt">>}]}]},
     "<message from='alicE46.330326@localhost/res1' to='bOb46.330326@localhost/res1' xml:lang='en' type='headline'><body>msgtxt</body></message>"},
   [{escalus_new_assert,assert_true,2,
      [{file,
         "/home/travis/build/esl/MongooseIM/big_tests/_build/default/lib/escalus/src/escalus_new_assert.erl"},
       {line,84}]},
    {offline_SUITE,login_send_and_receive_presence,2,
      [{file,"offline_SUITE.erl"},{line,171}]},
    {offline_SUITE,'-headline_message_is_not_stored/1-fun-0-',3,
      [{file,"offline_SUITE.erl"},{line,95}]},
    {escalus_story,story,4,
      [{file,
         "/home/travis/build/esl/MongooseIM/big_tests/_build/default/lib/escalus/src/escalus_story.erl"},
       {line,72}]},
    {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1529}]},
    {test_server,run_test_case_eval1,6,
      [{file,"test_server.erl"},{line,1045}]},
    {test_server,run_test_case_eval,9,
      [{file,"test_server.erl"},{line,977}]}]}}

Report log

5164.8 / Erlang 20.0 / pgsql_mnesia / 741de3b
Reports root/ big / small
OK: 2849 / Failed: 0 / User-skipped: 176 / Auto-skipped: 0

5164.9 / Erlang 21.0 / riak_mnesia / 741de3b
Reports root/ big / small
OK: 1283 / Failed: 0 / User-skipped: 43 / Auto-skipped: 0

ludwikbukowski · 2018-07-18T15:33:07Z

src/ejabberd_cowboy.erl

@@ -130,8 +130,9 @@ do_start_cowboy(Ref, Opts) ->
    TransportOpts = gen_mod:get_opt(transport_options, Opts, []),
    Modules = gen_mod:get_opt(modules, Opts),
    Dispatch = cowboy_router:compile(get_routes(Modules)),
-    ProtocolOpts = [{env, [{dispatch, Dispatch}]} |
-                    gen_mod:get_opt(protocol_options, Opts, [])],
+    {MetricsEnv, MetricsProtoOpts} = maybe_init_metrics(Opts),


hmm maybe we can move whole metrics logic to separate module ?

arkgil force-pushed the cowboy-metrics-middleware branch from 942845e to 9b5a91f Compare July 9, 2018 09:25

arkgil requested review from michalwski and fenek July 9, 2018 09:30

michalwski reviewed Jul 9, 2018

View reviewed changes

fenek requested changes Jul 9, 2018

View reviewed changes

arkgil and others added 4 commits July 17, 2018 10:45

Implement Cowboy middleware for calculating metrics

2ae5d2e

Define HTTP metrics when starting ejabberd_cowboy

3b141a1

Init metric subscriptions after listeners are started

3bddde0

This way reporters will be automatically subscribed to all HTTP metrics.

Replace gun with fusco

e18e89c

kanes115 force-pushed the cowboy-metrics-middleware branch from b491188 to e18e89c Compare July 17, 2018 11:39

ludwikbukowski reviewed Jul 18, 2018

View reviewed changes

fenek added the WIP 🚧 label Jul 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTP API metrics #1969

HTTP API metrics #1969

arkgil commented Jul 9, 2018

mongoose-im commented Jul 9, 2018 •

edited

Loading

michalwski Jul 9, 2018

arkgil Jul 9, 2018

michalwski Jul 9, 2018 •

edited

Loading

arkgil Jul 9, 2018

michalwski Jul 9, 2018

arkgil Jul 9, 2018

michalwski Jul 9, 2018

arkgil Jul 9, 2018

mongoose-im commented Jul 9, 2018 •

edited

Loading

codecov bot commented Jul 9, 2018 •

edited

Loading

mongoose-im commented Jul 9, 2018 •

edited

Loading

fenek Jul 9, 2018

arkgil Jul 10, 2018

fenek Jul 23, 2018

fenek Jul 9, 2018

arkgil Jul 10, 2018

fenek Jul 9, 2018

arkgil Jul 10, 2018

fenek Jul 23, 2018

mongoose-im commented Jul 17, 2018 •

edited

Loading

ludwikbukowski Jul 18, 2018

		ejabberd_listener:start_listeners(),
		mongoose_metrics:init(),

HTTP API metrics #1969

Are you sure you want to change the base?

HTTP API metrics #1969

Conversation

arkgil commented Jul 9, 2018

mongoose-im commented Jul 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michalwski Jul 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mongoose-im commented Jul 9, 2018 • edited Loading

codecov bot commented Jul 9, 2018 • edited Loading

Codecov Report

mongoose-im commented Jul 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mongoose-im commented Jul 17, 2018 • edited Loading

Choose a reason for hiding this comment

mongoose-im commented Jul 9, 2018 •

edited

Loading

michalwski Jul 9, 2018 •

edited

Loading

mongoose-im commented Jul 9, 2018 •

edited

Loading

codecov bot commented Jul 9, 2018 •

edited

Loading

mongoose-im commented Jul 9, 2018 •

edited

Loading

mongoose-im commented Jul 17, 2018 •

edited

Loading