Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP API metrics #1969

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

HTTP API metrics #1969

wants to merge 4 commits into from

Conversation

arkgil
Copy link
Contributor

@arkgil arkgil commented Jul 9, 2018

The goal of this PR is to introduce metrics to the HTTP API ("backend", "frontend", whatever). It consists of two parts:

  • two Cowboy middleware modules responsible for recording metrics (mongoose_cowboy_metrics_mw_before and mongoose_cowboy_metrics_mw_after)
  • little code in ejabberd_cowboy responsible for assembling names of metrics and defining them (currently metrics are opt-in)

For each handler (or rather for each prefix) "types" of metrics are recorded:

  • request count per HTTP method (suffix ..., Method, request, count]
  • response count per HTTP method x status class (1XX, 2XX etc.) (suffix ..., Method, response, Class, count])
  • response latency per HTTP method x status class (suffix ..., Method, response, Class, latency])

These metrics together cover 3 of 4 golden signals.

Metrics need to be enabled by providing {metrics, true} option to ejabberd_cowboy. All defined metrics are global. By default they are named like [http] ++ lists:flatten([HandlerBaseName]) ++ MetricTypeSuffix].

The HandlerBaseName part is the handler's module name by default. The handler's definition tuple can now contain a fifth element, which is not passed to handler but is interpreted by ejabberd_cowboy. This element is a list of options, and currently the only supported one is metrics:

  • it can be set to false, in which case the metrics for the handler won't be created
  • it can be a list with {prefix, Prefix} option, in which case Prefix becomes a HandlerBaseName

There is a small issue with this design - if one configured the same handler multiple times under different paths, metrics for both these handlers will have the same name (which obviously might skew the results if the handlers are configured differently). In such cases when we have duplicated metric names we could:

  • log a warning (per @michalwski 's suggestion)
  • raise an error, in which case the listener won't be started

I haven't decided which option to choose yet.

Oh, and recording of metrics can be disabled by executing cowboy:set_env(Ref, record_metrics, false).

TODOs:

  • documentation for ejabberd_cowboy's new options
  • choose what to do in case of duplicate metric names
  • tests for ejabberd_cowboy with metrics
  • new metrics should be added when routes are being reloaded (we shouldn't delete the old ones)

@arkgil arkgil force-pushed the cowboy-metrics-middleware branch from 942845e to 9b5a91f Compare July 9, 2018 09:25
@arkgil arkgil requested review from michalwski and fenek July 9, 2018 09:30
@mongoose-im
Copy link
Collaborator

mongoose-im commented Jul 9, 2018

5077.1 / Erlang 19.3 / small_tests / fa7b041
Reports root / small


5077.6 / Erlang 19.3 / elasticsearch_and_cassandra_mnesia / fa7b041
Reports root/ big
OK: 445 / Failed: 0 / User-skipped: 8 / Auto-skipped: 0


5077.5 / Erlang 19.3 / ldap_mnesia / fa7b041
Reports root/ big
OK: 1024 / Failed: 0 / User-skipped: 77 / Auto-skipped: 0


5077.3 / Erlang 19.3 / mysql_redis / fa7b041
Reports root/ big
OK: 2799 / Failed: 0 / User-skipped: 222 / Auto-skipped: 0


5077.2 / Erlang 19.3 / internal_mnesia / fa7b041
Reports root/ big
OK: 1060 / Failed: 0 / User-skipped: 41 / Auto-skipped: 0


5077.4 / Erlang 19.3 / odbc_mssql_mnesia / fa7b041
Reports root/ big
OK: 2813 / Failed: 0 / User-skipped: 208 / Auto-skipped: 0


5077.8 / Erlang 20.0 / pgsql_mnesia / fa7b041
Reports root/ big / small
OK: 2845 / Failed: 0 / User-skipped: 176 / Auto-skipped: 0


5077.9 / Erlang 21.0 / riak_mnesia / fa7b041
Reports root/ big / small
OK: 1282 / Failed: 1 / User-skipped: 43 / Auto-skipped: 0

mod_global_distrib_SUITE:multi_connection:test_muc_conversation_history
{error,{{assertEqual,[{module,mod_global_distrib_SUITE},
            {line,418},
            {expression,"exml_query : attr ( AlicePresence , << \"from\" >> )"},
            {expected,<<"[email protected]/alicE4.776981">>},
            {value,<<"[email protected]/eve5.71013">>}]},
    [{mod_global_distrib_SUITE,'-test_muc_conversation_history/1-fun-1-',
                   3,
                   [{file,"mod_global_distrib_SUITE.erl"},
                  {line,418}]},
     {mod_global_distrib_SUITE,'-test_muc_conversation_history/1-fun-5-',
                   3,
                   [{file,"mod_global_distrib_SUITE.erl"},
                  {line,417}]},
     {escalus_story,story,4,
            [{file,"/home/travis/build/esl/MongooseIM/big_tests/_build/default/lib/escalus/src/escalus_story.erl"},
             {line,72}]},
     {mod_global_distrib_SUITE,test_muc_conversation_history,1,
                   [{file,"mod_global_distrib_SUITE.erl"},
                  {line,390}]},
     {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1545}]},
     {test_server,run_test_case_eval1,6,
            [{file,"test_server.erl"},{line,1063}]},
     {test_server,run_test_case_eval,9,
            [{file,"test_server.erl"},{line,995}]}]}}

Report log


5077.8 / Erlang 20.0 / pgsql_mnesia / fa7b041
Reports root/ big / small
OK: 2845 / Failed: 0 / User-skipped: 176 / Auto-skipped: 0


5077.9 / Erlang 21.0 / riak_mnesia / fa7b041
Reports root/ big / small
OK: 1284 / Failed: 1 / User-skipped: 43 / Auto-skipped: 0

offline_SUITE:mod_offline_tests:max_offline_messages_reached
{error,
  {timeout_when_waiting_for_stanza,
    [{escalus_client,wait_for_stanza,
       [{client,<<"bOb8.11573@localhost/res1">>,escalus_tcp,<0.8734.1>,
          [{event_manager,<0.8708.1>},
           {server,<<"localhost">>},
           {username,<<"bOb8.11573">>},
           {resource,<<"res1">>}],
          [{event_client,
             [{event_manager,<0.8708.1>},
            {server,<<"localhost">>},
            {username,<<"bOb8.11573">>},
            {resource,<<"res1">>}]},
           {resource,<<"res1">>},
           {username,<<"bOb8.11573">>},
           {server,<<"localhost">>},
           {host,<<"localhost">>},
           {port,5222},
           {auth,{escalus_auth,auth_plain}},
           {wspath,undefined},
           {username,<<"bOb8.11573">>},
           {server,<<"localhost">>},
           {password,<<"makrolika">>},
           {stream_id,<<"3D08F955361D3E89">>}]},
        5000],
       [{file,
          "/home/travis/build/esl/MongooseIM/big_tests/_build/default/lib/escalus/src/escalus_client.erl"},
        {line,138}]},
     {offline_SUITE,'-max_offline_messages_reached/1-fun-1-',6,
       [{file,"offline_SUITE.erl"},{line,112}]},
     {escalus_story,story,4,
       [{file,
          "/home/travis/build/esl/MongooseIM/big_tests/_build/default/lib/escalus/src/escalus_story.erl"},
        {line,72}]},
     {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1545}]},
     {test_server,run_test_case_eval1,6,
       [{file,"test_server.erl"},{line,1063}]},...

Report log

%%
%% This middleware needs to run after `mongoose_cowboy_metrics_mw_before' and `cowboy_router'
%% middleware. However, it does not need to run after `cowboy_hander' middleware because metrics are
%% updated in `onresponse' callback which is called whenever the request is sent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be whenever the response is sent instead of request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks!

Method = get_req_method(Req),
HandlerToPrefixMappings = proplists:get_value(handler_to_metric_prefix, Env, #{}),
Prefix = maps:get(Handler, HandlerToPrefixMappings, [Handler]),
mongoose_metrics:update(global, mongoose_cowboy_metrics:request_count_metric(Prefix, Method), 1),
Copy link
Contributor

@michalwski michalwski Jul 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to move the request counter update to the before middleware?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we can't do that 😢 "Before" middleware doesn't know which handler will serve the request, because cowboy_router has not been run yet.

On the other hand we can't merge the middleware, because in case of no matching routes cowboy_router terminates the middleware chain. In those cases we'd lose even the request count metric (currently we lose response count and latency).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for explanation :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michalwski sorry, I was wrong. We can't report the metric until we know the handler, which means the reporting needs to occur after cowboy_router. However, if we're willing to not report the time which it takes for cowboy_router to execute (i.e. only measure the run time of the handler itself) then we could merge the middleware into one which could be placed after cowboy_router.

?assertNotEqual(0, proplists:get_value(value, DataPoints)).

run_requests(Methods, Statuses) ->
{ok, Conn} = gun:open("localhost", ranch:get_port(?LISTENER)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In small tests we already use fusco when there is a need to send an HTTP request. It'd better to re-use what is already used, in my opinion. I do agree that there are better libs than fusco. At some point it will be replaced by sth else (gun or sth else).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😢

@mongoose-im
Copy link
Collaborator

mongoose-im commented Jul 9, 2018

5078.1 / Erlang 19.3 / small_tests / b7bffef
Reports root / small


5078.2 / Erlang 19.3 / internal_mnesia / b7bffef
Reports root/ big
OK: 1060 / Failed: 0 / User-skipped: 41 / Auto-skipped: 0


5078.3 / Erlang 19.3 / mysql_redis / b7bffef
Reports root/ big
OK: 2799 / Failed: 0 / User-skipped: 222 / Auto-skipped: 0


5078.4 / Erlang 19.3 / odbc_mssql_mnesia / b7bffef
Reports root/ big
OK: 2813 / Failed: 0 / User-skipped: 208 / Auto-skipped: 0


5078.6 / Erlang 19.3 / elasticsearch_and_cassandra_mnesia / b7bffef
Reports root/ big
OK: 445 / Failed: 0 / User-skipped: 8 / Auto-skipped: 0


5078.5 / Erlang 19.3 / ldap_mnesia / b7bffef
Reports root/ big
OK: 1024 / Failed: 0 / User-skipped: 77 / Auto-skipped: 0


5078.9 / Erlang 21.0 / riak_mnesia / b7bffef
Reports root/ big / small
OK: 1279 / Failed: 0 / User-skipped: 43 / Auto-skipped: 0


5078.8 / Erlang 20.0 / pgsql_mnesia / b7bffef
Reports root/ big / small
OK: 2845 / Failed: 0 / User-skipped: 176 / Auto-skipped: 0

@codecov
Copy link

codecov bot commented Jul 9, 2018

Codecov Report

Merging #1969 into master will decrease coverage by 6.58%.
The diff coverage is 12.5%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1969      +/-   ##
==========================================
- Coverage   74.91%   68.32%   -6.59%     
==========================================
  Files         313      315       +2     
  Lines       28459    28468       +9     
==========================================
- Hits        21319    19450    -1869     
- Misses       7140     9018    +1878
Impacted Files Coverage Δ
src/mongoose_cowboy_metrics_mw_after.erl 0% <0%> (ø)
src/mongoose_cowboy_metrics.erl 0% <0%> (ø)
src/mongoose_cowboy_metrics_mw_before.erl 0% <0%> (ø)
src/ejabberd_app.erl 82.29% <100%> (-4.17%) ⬇️
src/ejabberd_cowboy.erl 64.76% <23.33%> (-23.7%) ⬇️
src/mod_auth_token_odbc.erl 0% <0%> (-100%) ⬇️
src/mam/mam_message_xml.erl 0% <0%> (-100%) ⬇️
src/inbox/mod_inbox_odbc_pgsql.erl 0% <0%> (-100%) ⬇️
src/mod_offline_riak.erl 0% <0%> (-96.43%) ⬇️
src/mod_last_riak.erl 0% <0%> (-95.24%) ⬇️
... and 112 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8dd8a1b...e18e89c. Read the comment docs.

@mongoose-im
Copy link
Collaborator

mongoose-im commented Jul 9, 2018

5080.1 / Erlang 19.3 / small_tests / 7be9fc1
Reports root / small


5080.5 / Erlang 19.3 / ldap_mnesia / 7be9fc1
Reports root/ big
OK: 1024 / Failed: 0 / User-skipped: 77 / Auto-skipped: 0


5080.6 / Erlang 19.3 / elasticsearch_and_cassandra_mnesia / 7be9fc1
Reports root/ big
OK: 445 / Failed: 0 / User-skipped: 8 / Auto-skipped: 0


5080.3 / Erlang 19.3 / mysql_redis / 7be9fc1
Reports root/ big
OK: 2799 / Failed: 0 / User-skipped: 222 / Auto-skipped: 0


5080.2 / Erlang 19.3 / internal_mnesia / 7be9fc1
Reports root/ big
OK: 1060 / Failed: 0 / User-skipped: 41 / Auto-skipped: 0


5080.4 / Erlang 19.3 / odbc_mssql_mnesia / 7be9fc1
Reports root/ big
OK: 2813 / Failed: 0 / User-skipped: 208 / Auto-skipped: 0


5080.8 / Erlang 20.0 / pgsql_mnesia / 7be9fc1
Reports root/ big / small
OK: 2845 / Failed: 0 / User-skipped: 176 / Auto-skipped: 0


5080.9 / Erlang 21.0 / riak_mnesia / 7be9fc1
Reports root/ big / small
OK: 1279 / Failed: 0 / User-skipped: 43 / Auto-skipped: 0

%% See the License for the specific language governing permissions and
%% limitations under the License.
%%
%% @doc Cowboy middleware updating metrics related to request handling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be even better to have all of this information in a markdown docs. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document each MongooseIM module in markdown docs? Not only the ones which are configurable by users? I mean, I know that one can configure this middleware in ejabberd_cowboy's protocol opts, but it's rather aimed at MongooseIM's developers, and not operators.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics names and meaning must be described in user docs. The module role description may remain here.

ejabberd_listener:start_listeners(),
mongoose_metrics:init(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there some kind of problem with old order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, mongoose_metrics:init() subscribes all reporters to all metrics which are running at the moment of its start. Since ejabberd_cowboy is started as one of the listeners, it would define metrics after the reporters were subscribed, which means that no reporter reported HTTP API metrics.

when Acc :: #{module() => mongoose_cowboy_metrics:prefix()}.
build_metric_prefixes(_BasePrefix, [], Acc) ->
Acc;
build_metric_prefixes(BasePrefix, [{_Host, _Path, Handler, _HandlerOpts, Opts} | Tail], Acc) ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if these metrics are that valuable with no path distinction. Various actions triggered by different paths (and their :parameters) may have drastically different impact on latency stats (if they all apply e.g. to 200 responses to GETs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that in general case it won't work well. However, in our HTTP API handlers usually do the exact same thing given some HTTP method (i.e. it doesn't matter what was the path or parameters).

The other solution would be hand-crafted metrics, different for each handler (i.e. we explicitly report metrics from each HTTP module). However, we would not get any metrics in case the handler has crashed.

Maybe the metrics for each handler in a single ejabberd_cowboy's scope should opt-in as well? This way we can turn them on only for endpoints which we know it makes sense to have these kinds of metrics for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I understand it's good enough until someone actually needs a more fine-grained names. :)

@mongoose-im
Copy link
Collaborator

mongoose-im commented Jul 17, 2018

5164.1 / Erlang 19.3 / small_tests / 741de3b
Reports root / small


5164.3 / Erlang 19.3 / mysql_redis / 741de3b
Reports root/ big
OK: 2803 / Failed: 0 / User-skipped: 222 / Auto-skipped: 0


5164.5 / Erlang 19.3 / ldap_mnesia / 741de3b
Reports root/ big
OK: 1028 / Failed: 0 / User-skipped: 77 / Auto-skipped: 0


5164.6 / Erlang 19.3 / elasticsearch_and_cassandra_mnesia / 741de3b
Reports root/ big
OK: 445 / Failed: 0 / User-skipped: 8 / Auto-skipped: 0


5164.2 / Erlang 19.3 / internal_mnesia / 741de3b
Reports root/ big
OK: 1064 / Failed: 0 / User-skipped: 41 / Auto-skipped: 0


5164.4 / Erlang 19.3 / odbc_mssql_mnesia / 741de3b
Reports root/ big
OK: 2822 / Failed: 1 / User-skipped: 208 / Auto-skipped: 0

offline_SUITE:mod_offline_tests:headline_message_is_not_stored
{error,
  {{assertion_failed,assert,is_presence,
     {xmlel,<<"message">>,
       [{<<"from">>,<<"alicE46.330326@localhost/res1">>},
        {<<"to">>,<<"bOb46.330326@localhost/res1">>},
        {<<"xml:lang">>,<<"en">>},
        {<<"type">>,<<"headline">>}],
       [{xmlel,<<"body">>,[],[{xmlcdata,<<"msgtxt">>}]}]},
     "<message from='alicE46.330326@localhost/res1' to='bOb46.330326@localhost/res1' xml:lang='en' type='headline'><body>msgtxt</body></message>"},
   [{escalus_new_assert,assert_true,2,
      [{file,
         "/home/travis/build/esl/MongooseIM/big_tests/_build/default/lib/escalus/src/escalus_new_assert.erl"},
       {line,84}]},
    {offline_SUITE,login_send_and_receive_presence,2,
      [{file,"offline_SUITE.erl"},{line,171}]},
    {offline_SUITE,'-headline_message_is_not_stored/1-fun-0-',3,
      [{file,"offline_SUITE.erl"},{line,95}]},
    {escalus_story,story,4,
      [{file,
         "/home/travis/build/esl/MongooseIM/big_tests/_build/default/lib/escalus/src/escalus_story.erl"},
       {line,72}]},
    {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1529}]},
    {test_server,run_test_case_eval1,6,
      [{file,"test_server.erl"},{line,1045}]},
    {test_server,run_test_case_eval,9,
      [{file,"test_server.erl"},{line,977}]}]}}

Report log


5164.8 / Erlang 20.0 / pgsql_mnesia / 741de3b
Reports root/ big / small
OK: 2849 / Failed: 0 / User-skipped: 176 / Auto-skipped: 0


5164.9 / Erlang 21.0 / riak_mnesia / 741de3b
Reports root/ big / small
OK: 1283 / Failed: 0 / User-skipped: 43 / Auto-skipped: 0

@@ -130,8 +130,9 @@ do_start_cowboy(Ref, Opts) ->
TransportOpts = gen_mod:get_opt(transport_options, Opts, []),
Modules = gen_mod:get_opt(modules, Opts),
Dispatch = cowboy_router:compile(get_routes(Modules)),
ProtocolOpts = [{env, [{dispatch, Dispatch}]} |
gen_mod:get_opt(protocol_options, Opts, [])],
{MetricsEnv, MetricsProtoOpts} = maybe_init_metrics(Opts),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm maybe we can move whole metrics logic to separate module ?

@fenek fenek added the WIP 🚧 label Jul 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants