- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Drawbacks
- Alternatives
- Infrastructure Needed (optional)
- Enhancement issue in release milestone, which links to pull request in [keylime/enhancements]
- Core members have approved the issue with the label
implementable
- Design details are appropriately documented
- Test plan is in place
- User-facing documentation has been created in [keylime/keylime-docs]
Should someone restart an agent based server or force an agent offline, the agent will no longer be monitored by the verifier. Upon starting the agent will just register with the registrar and IMA monitoring will cease.
This behavior was originally discussed on the keylime mailing list
Its acceptable that someone may want to manually restart a server (or the server
restarts as part of an automated work flow) while retaining the configuration
set up during the intial "adding" of the agent to the verifier (allowlist
,
tpm_policy
). They should not have to again add (or update) the verifier
every time if there is not change in configuration or trust mapping (e.g software
CA).
A user restarts the agent on a target node. When the agent is becomes active again the verifier proceeds to recommence monitoring the delegated measurements from when the target agent was first added to the verifier and registrar.
Any sort of migration or fault redundancy (although both areas benefit from this change)
A target machine is rebooted with no change in state (measured properties). This machine should not require “re adding” with the keylime tenant again.
Once the target node / agent returns to an online / reachable state, the verifier should proceed to recommence run time monitoring.
A new tornado web handler will be created within the verifier to listen for requests that an agent will emit when it (re)starts.
Code will be introduced within the agent that will perform a POST
request to
inform the verifier an agent has been (re)started. This in turn will cause the
verifier to perform an operational_state query
for the UUID
of that agent
and then proceed to perform run time integrity monitoring again.
For any given reason my server reboots. Keylime handles this event and provides trust monitoring once the server and agent are back online and can be reached by the verifier.
Should the machines state have been tampered with during the offline period, Keylime will immediate fail the target node accordingly (or likewise show the machine is still in the expected trust state according to the delegated measurements)
If I want to change measurements, I use the existing update
command available
in the Keylime Tenant CLI.
We should be sure we do not introduce security risks and be mindful of future enhancements such as multi tenancy, auth and migration.
A new tornado web handler is created within the verifier to listen for requests
that an agent will emit when it starts. We will call this /nudge
for now with
a more suitable name agreed within this review.
A new operational_state
named OFFLINE
will be created for when a machine
becomes unreachable during a GET_QUOTE
operational_state
. This state will be
set once the agent fails to respond during its retry query period set within
the keylime.conf
configuration file.
A new database row will need to be introduced for the OFFLINE
operational_state
Code will be introduced to the agent that will perform a POST
request
/nudge
to inform the verifier an agent has been (re)started. This in turn will
instruct the verifier to perform an operational_state
query for the UUID
of
the concerned agent. Should the operational_state
be OFFLINE
, it will
change the operational_state
to GET_QUOTE
and proceed to (re)start continuous
monitoring of the node with the previous set measurements (whitelist
,
tpm_policy
)
No immediate changes come to mind, but we should be mindful of this as the design evolves.
We will need to assess changes required within our TPM communications. For
example the Agent calls tpm_startup -c
and takes ownership of the tpm
every time it starts. The AK handle is also flushed.
We may need to consider having some sort of flag the agent queries to establish its already associated with a verifier.
Rather than bootstrapping itself as a fresh agent, it instead retains its TPM
set up and instead just instantiates its web service to allow rest API
interactions with the verifier again. These interactions will be a continuum
of the previous quote GET
requests from the verifier, while retaining the
existing root of trust already set up by the registrar (EKpub and AKPub).
Functional tests will be needed to play out the user case of restarting a agent, persisting state and reestablishing measurements upon its restart.
Unit tests will be needed to test the new nudge
API functionality.
May need to consider impact of upgrading with an agent offline and then the new TPM code changes interacting with the TPM setup from the previous release.
TBD
We evolve the retry handler in the verifier to wait for indefinite periods instead of having a wake up API - this is hazardous as we risk bottle necks and need to consider managing more state (for example a node goes offline to never return).
Some changes may be needed to travis CI, but not expected currently.
No new repos required.