From 677c9258f90321df480673445f1053dbf6d1b779 Mon Sep 17 00:00:00 2001 From: "Documenter.jl" Date: Fri, 12 Jul 2024 21:57:32 +0000 Subject: [PATCH] build based on 6ffa564 --- dev/.documenter-siteinfo.json | 2 +- dev/POMDPTools/beliefs/index.html | 4 +- dev/POMDPTools/common_rl/index.html | 4 +- dev/POMDPTools/distributions/index.html | 4 +- dev/POMDPTools/index.html | 2 +- dev/POMDPTools/model/index.html | 10 +-- dev/POMDPTools/policies/index.html | 12 ++-- dev/POMDPTools/simulators/index.html | 18 ++--- dev/POMDPTools/testing/index.html | 4 +- dev/POMDPTools/visualization/index.html | 4 +- dev/api/index.html | 40 +++++------ dev/concepts/index.html | 2 +- dev/def_pomdp/index.html | 2 +- dev/def_solver/index.html | 2 +- dev/def_updater/index.html | 2 +- dev/example_defining_problems/index.html | 2 +- dev/example_gridworld_mdp/index.html | 54 +++++++-------- dev/example_simulations/index.html | 84 ++++++++++++------------ dev/example_solvers/index.html | 18 ++--- dev/examples/index.html | 2 +- dev/faq/index.html | 2 +- dev/gallery/index.html | 2 +- dev/get_started/index.html | 2 +- dev/index.html | 2 +- dev/install/index.html | 2 +- dev/interfaces/index.html | 2 +- dev/offline_solver/index.html | 2 +- dev/online_solver/index.html | 2 +- dev/policy_interaction/index.html | 2 +- dev/run_simulation/index.html | 2 +- dev/simulation/index.html | 2 +- 31 files changed, 147 insertions(+), 147 deletions(-) diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index 154f6a12..355f399d 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.4","generation_timestamp":"2024-07-12T21:14:25","documenter_version":"1.5.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.4","generation_timestamp":"2024-07-12T21:57:27","documenter_version":"1.5.0"}} \ No newline at end of file diff --git a/dev/POMDPTools/beliefs/index.html b/dev/POMDPTools/beliefs/index.html index 3b3e8f1a..6e9a48ea 100644 --- a/dev/POMDPTools/beliefs/index.html +++ b/dev/POMDPTools/beliefs/index.html @@ -1,7 +1,7 @@ -Implemented Belief Updaters · POMDPs.jl

Implemented Belief Updaters

POMDPTools provides the following generic belief updaters:

  • a discrete belief updater
  • a k previous observation updater
  • a previous observation updater
  • a nothing updater (for when the policy does not depend on any feedback)

For particle filters see ParticleFilters.jl.

Discrete (Bayesian Filter)

The DiscreteUpater is a default implementation of a discrete Bayesian filter. The DiscreteBelief type is provided to represent discrete beliefs for discrete state POMDPs.

A convenience function uniform_belief is provided to create a DiscreteBelief with equal probability for each state.

POMDPTools.BeliefUpdaters.DiscreteBeliefType
DiscreteBelief

A belief specified by a probability vector.

Normalization of b is assumed in some calculations (e.g. pdf), but it is only automatically enforced in update(...), and a warning is given if normalized incorrectly in DiscreteBelief(pomdp, b).

Constructor

DiscreteBelief(pomdp, b::Vector{Float64}; check::Bool=true)

Fields

  • pomdp : the POMDP problem
  • state_list : a vector of ordered states
  • b : the probability vector
source

K Previous Observations

POMDPTools.BeliefUpdaters.KMarkovUpdaterType
KMarkovUpdater

Updater that stores the k most recent observations as the belief.

Example:

up = KMarkovUpdater(5)
+Implemented Belief Updaters · POMDPs.jl

Implemented Belief Updaters

POMDPTools provides the following generic belief updaters:

  • a discrete belief updater
  • a k previous observation updater
  • a previous observation updater
  • a nothing updater (for when the policy does not depend on any feedback)

For particle filters see ParticleFilters.jl.

Discrete (Bayesian Filter)

The DiscreteUpater is a default implementation of a discrete Bayesian filter. The DiscreteBelief type is provided to represent discrete beliefs for discrete state POMDPs.

A convenience function uniform_belief is provided to create a DiscreteBelief with equal probability for each state.

POMDPTools.BeliefUpdaters.DiscreteBeliefType
DiscreteBelief

A belief specified by a probability vector.

Normalization of b is assumed in some calculations (e.g. pdf), but it is only automatically enforced in update(...), and a warning is given if normalized incorrectly in DiscreteBelief(pomdp, b).

Constructor

DiscreteBelief(pomdp, b::Vector{Float64}; check::Bool=true)

Fields

  • pomdp : the POMDP problem
  • state_list : a vector of ordered states
  • b : the probability vector
source

K Previous Observations

POMDPTools.BeliefUpdaters.KMarkovUpdaterType
KMarkovUpdater

Updater that stores the k most recent observations as the belief.

Example:

up = KMarkovUpdater(5)
 s0 = rand(rng, initialstate(pomdp))
 initial_observation = rand(rng, initialobs(pomdp, s0))
 initial_obs_vec = fill(initial_observation, 5)
 hr = HistoryRecorder(rng=rng, max_steps=100)
-hist = simulate(hr, pomdp, policy, up, initial_obs_vec, s0)
source

Previous Observation

Nothing Updater

+hist = simulate(hr, pomdp, policy, up, initial_obs_vec, s0)
source

Previous Observation

Nothing Updater

diff --git a/dev/POMDPTools/common_rl/index.html b/dev/POMDPTools/common_rl/index.html index 40e06adf..2436a5c2 100644 --- a/dev/POMDPTools/common_rl/index.html +++ b/dev/POMDPTools/common_rl/index.html @@ -12,5 +12,5 @@ m = convert(POMDP, env) planner = solve(POMCPSolver(), m) a = action(planner, initialstate(m))

You can also use the constructors listed below to manually convert between the interfaces.

Environment Wrapper Types

Since the standard reinforcement learning environment interface offers less information about the internal workings of the environment than the POMDPs.jl interface, MDPs and POMDPs created from these environments will have limited functionality. There are two types of (PO)MDP types that can wrap an environment:

Generative model wrappers

If the state and setstate! CommonRLInterface functions are provided, then the environment can be wrapped in a RLEnvMDP or RLEnvPOMDP and the POMDPs.jl generative model interface will be available.

Opaque wrappers

If the state and setstate! are not provided, then the resulting POMDP or MDP can only be simulated. This case is represented using the OpaqueRLEnvPOMDP and OpaqueRLEnvMDP wrappers. From the POMDPs.jl perspective, the state of the opaque (PO)MDP is just an integer wrapped in an OpaqueRLEnvState. This keeps track of the "age" of the environment so that POMDPs.jl actions that attempt to interact with the environment at a different age are invalid.

Constructors

Creating RL environments from MDPs and POMDPs

POMDPTools.CommonRLIntegration.MDPCommonRLEnvType
MDPCommonRLEnv(m, [s])
-MDPCommonRLEnv{RLO}(m, [s])

Create a CommonRLInterface environment from MDP m; optionally specify the state 's'.

The RLO parameter can be used to specify a type to convert the observation to. By default, this is AbstractArray. Use Any to disable conversion.

source
POMDPTools.CommonRLIntegration.POMDPCommonRLEnvType
POMDPCommonRLEnv(m, [s], [o])
-POMDPCommonRLEnv{RLO}(m, [s], [o])

Create a CommonRLInterface environment from POMDP m; optionally specify the state 's' and observation 'o'.

The RLO and RLS parameters can be used to specify types to convert the observation and state to. By default, this is AbstractArray. Use Any to disable conversion.

source

Creating MDPs and POMDPs from RL environments

POMDPTools.CommonRLIntegration.RLEnvMDPType
RLEnvMDP(env; discount=1.0)

Create an MDP by wrapping a CommonRLInterface.AbstractEnv. state and setstate! from CommonRLInterface must be provided, and the POMDPs generative model functionality will be provided.

source
POMDPTools.CommonRLIntegration.RLEnvPOMDPType
RLEnvPOMDP(env; discount=1.0)

Create an POMDP by wrapping a CommonRLInterface.AbstractEnv. state and setstate! from CommonRLInterface must be provided, and the POMDPs generative model functionality will be provided.

source
POMDPTools.CommonRLIntegration.OpaqueRLEnvMDPType
OpaqueRLEnvMDP(env; discount=1.0)

Wrap a CommonRLInterface.AbstractEnv in an MDP object. The state will be an OpaqueRLEnvState and only simulation will be supported.

source
POMDPTools.CommonRLIntegration.OpaqueRLEnvPOMDPType
OpaqueRLEnvPOMDP(env; discount=1.0)

Wrap a CommonRLInterface.AbstractEnv in an POMDP object. The state will be an OpaqueRLEnvState and only simulation will be supported.

source
+MDPCommonRLEnv{RLO}(m, [s])

Create a CommonRLInterface environment from MDP m; optionally specify the state 's'.

The RLO parameter can be used to specify a type to convert the observation to. By default, this is AbstractArray. Use Any to disable conversion.

source
POMDPTools.CommonRLIntegration.POMDPCommonRLEnvType
POMDPCommonRLEnv(m, [s], [o])
+POMDPCommonRLEnv{RLO}(m, [s], [o])

Create a CommonRLInterface environment from POMDP m; optionally specify the state 's' and observation 'o'.

The RLO and RLS parameters can be used to specify types to convert the observation and state to. By default, this is AbstractArray. Use Any to disable conversion.

source

Creating MDPs and POMDPs from RL environments

POMDPTools.CommonRLIntegration.RLEnvMDPType
RLEnvMDP(env; discount=1.0)

Create an MDP by wrapping a CommonRLInterface.AbstractEnv. state and setstate! from CommonRLInterface must be provided, and the POMDPs generative model functionality will be provided.

source
POMDPTools.CommonRLIntegration.RLEnvPOMDPType
RLEnvPOMDP(env; discount=1.0)

Create an POMDP by wrapping a CommonRLInterface.AbstractEnv. state and setstate! from CommonRLInterface must be provided, and the POMDPs generative model functionality will be provided.

source
POMDPTools.CommonRLIntegration.OpaqueRLEnvMDPType
OpaqueRLEnvMDP(env; discount=1.0)

Wrap a CommonRLInterface.AbstractEnv in an MDP object. The state will be an OpaqueRLEnvState and only simulation will be supported.

source
POMDPTools.CommonRLIntegration.OpaqueRLEnvPOMDPType
OpaqueRLEnvPOMDP(env; discount=1.0)

Wrap a CommonRLInterface.AbstractEnv in an POMDP object. The state will be an OpaqueRLEnvState and only simulation will be supported.

source
diff --git a/dev/POMDPTools/distributions/index.html b/dev/POMDPTools/distributions/index.html index e49c8b7f..06df77fa 100644 --- a/dev/POMDPTools/distributions/index.html +++ b/dev/POMDPTools/distributions/index.html @@ -1,5 +1,5 @@ -Implemented Distributions · POMDPs.jl

Implemented Distributions

POMDPTools contains several utility distributions to be used in the POMDPs transition and observation functions. These implement the appropriate methods of the functions in the distributions interface.

This package also supplies showdistribution for pretty printing distributions as unicode bar graphs to the terminal.

Sparse Categorical (SparseCat)

SparseCat is a sparse categorical distribution which is specified by simply providing a list of possible values (states or observations) and the probabilities corresponding to those particular objects.

Example: SparseCat([1,2,3], [0.1,0.2,0.7]) is a categorical distribution that assigns probability 0.1 to 1, 0.2 to 2, 0.7 to 3, and 0 to all other values.

POMDPTools.POMDPDistributions.SparseCatType
SparseCat(values, probabilities)

Create a sparse categorical distribution.

values is an iterable object containing the possible values (can be of any type) in the distribution that have nonzero probability. probabilities is an iterable object that contains the associated probabilities.

This is optimized for value iteration with a fast implementation of weighted_iterator. Both pdf and rand are order n.

source

Implicit

In situations where a distribution object is required, but the pdf is difficult to specify and only samples are required, ImplicitDistribution provides a convenient way to package a sampling function.

POMDPTools.POMDPDistributions.ImplicitDistributionType
ImplicitDistribution(sample_function, args...)

Define a distribution that can only be sampled from using rand, but has no explicit pdf.

Each time rand(rng, d::ImplicitDistribution) is called,

sample_function(args..., rng)

will be called to generate a new sample.

ImplicitDistribution is designed to be used with anonymous functions or the do syntax as follows:

Examples

ImplicitDistribution(rng->rand(rng)^2)
struct MyMDP <: MDP{Float64, Int} end
+Implemented Distributions · POMDPs.jl

Implemented Distributions

POMDPTools contains several utility distributions to be used in the POMDPs transition and observation functions. These implement the appropriate methods of the functions in the distributions interface.

This package also supplies showdistribution for pretty printing distributions as unicode bar graphs to the terminal.

Sparse Categorical (SparseCat)

SparseCat is a sparse categorical distribution which is specified by simply providing a list of possible values (states or observations) and the probabilities corresponding to those particular objects.

Example: SparseCat([1,2,3], [0.1,0.2,0.7]) is a categorical distribution that assigns probability 0.1 to 1, 0.2 to 2, 0.7 to 3, and 0 to all other values.

POMDPTools.POMDPDistributions.SparseCatType
SparseCat(values, probabilities)

Create a sparse categorical distribution.

values is an iterable object containing the possible values (can be of any type) in the distribution that have nonzero probability. probabilities is an iterable object that contains the associated probabilities.

This is optimized for value iteration with a fast implementation of weighted_iterator. Both pdf and rand are order n.

source

Implicit

In situations where a distribution object is required, but the pdf is difficult to specify and only samples are required, ImplicitDistribution provides a convenient way to package a sampling function.

POMDPTools.POMDPDistributions.ImplicitDistributionType
ImplicitDistribution(sample_function, args...)

Define a distribution that can only be sampled from using rand, but has no explicit pdf.

Each time rand(rng, d::ImplicitDistribution) is called,

sample_function(args..., rng)

will be called to generate a new sample.

ImplicitDistribution is designed to be used with anonymous functions or the do syntax as follows:

Examples

ImplicitDistribution(rng->rand(rng)^2)
struct MyMDP <: MDP{Float64, Int} end
 
 function POMDPs.transition(m::MyMDP, s, a)
     ImplicitDistribution(s, a) do s, a, rng
@@ -8,4 +8,4 @@
 end
 
 td = transition(MyMDP(), 1.0, 1)
-rand(td) # will return a number near 2
source

Bool Distribution

Deterministic

POMDPTools.POMDPDistributions.DeterministicType
Deterministic(value)

Create a deterministic distribution over only one value.

This is intended to be used when a distribution is required, but the outcome is deterministic. It is equivalent to a Kronecker Delta distribution.

source

Uniform

POMDPTools.POMDPDistributions.UniformType
Uniform(collection)

Create a uniform categorical distribution over a collection of objects.

The objects in the collection must be unique (this is tested on construction), and will be stored in a Set. To avoid this overhead, use UnsafeUniform.

source
POMDPTools.POMDPDistributions.UnsafeUniformType
UnsafeUniform(collection)

Create a uniform categorical distribution over a collection of objects.

No checks are performed to ensure uniqueness or check whether an object is actually in the set when evaluating the pdf.

source

Pretty Printing

+rand(td) # will return a number near 2
source

Bool Distribution

Deterministic

POMDPTools.POMDPDistributions.DeterministicType
Deterministic(value)

Create a deterministic distribution over only one value.

This is intended to be used when a distribution is required, but the outcome is deterministic. It is equivalent to a Kronecker Delta distribution.

source

Uniform

POMDPTools.POMDPDistributions.UniformType
Uniform(collection)

Create a uniform categorical distribution over a collection of objects.

The objects in the collection must be unique (this is tested on construction), and will be stored in a Set. To avoid this overhead, use UnsafeUniform.

source
POMDPTools.POMDPDistributions.UnsafeUniformType
UnsafeUniform(collection)

Create a uniform categorical distribution over a collection of objects.

No checks are performed to ensure uniqueness or check whether an object is actually in the set when evaluating the pdf.

source

Pretty Printing

diff --git a/dev/POMDPTools/index.html b/dev/POMDPTools/index.html index 2b076ad3..129d4937 100644 --- a/dev/POMDPTools/index.html +++ b/dev/POMDPTools/index.html @@ -1,2 +1,2 @@ -POMDPTools: the standard library for POMDPs.jl · POMDPs.jl

POMDPTools: the standard library for POMDPs.jl

The POMDPs.jl package does nothing more than define an interface or language for interacting with and solving (PO)MDPs; it does not contain any implementations. In practice, defining and solving POMDPs is made vastly easier if some commonly-used structures are provided. The POMDPTools package contains these implementations. Thus, the relationship between POMDPs.jl and POMDPTools is similar to the relationship between a programming language and its standard library.

The POMDPTools package source code is hosted in the POMDPs.jl github repository in the lib/POMDPTools directory.

The contents of the library are outlined below:

+POMDPTools: the standard library for POMDPs.jl · POMDPs.jl

POMDPTools: the standard library for POMDPs.jl

The POMDPs.jl package does nothing more than define an interface or language for interacting with and solving (PO)MDPs; it does not contain any implementations. In practice, defining and solving POMDPs is made vastly easier if some commonly-used structures are provided. The POMDPTools package contains these implementations. Thus, the relationship between POMDPs.jl and POMDPTools is similar to the relationship between a programming language and its standard library.

The POMDPTools package source code is hosted in the POMDPs.jl github repository in the lib/POMDPTools directory.

The contents of the library are outlined below:

diff --git a/dev/POMDPTools/model/index.html b/dev/POMDPTools/model/index.html index 80eb83ba..3507bc18 100644 --- a/dev/POMDPTools/model/index.html +++ b/dev/POMDPTools/model/index.html @@ -5,9 +5,9 @@ julia> collect(weighted_iterator(d)) 2-element Array{Pair{Bool,Float64},1}: true => 0.7 - false => 0.3source

Observation Weight

Sometimes, e.g. in particle filtering, the relative likelihood of an observation is required in addition to a generative model, and it is often tedious to implement a custom observation distribution type. For this case, the shortcut function obs_weight is provided.

POMDPTools.ModelTools.obs_weightFunction
obs_weight(pomdp, s, a, sp, o)

Return a weight proportional to the likelihood of receiving observation o from state sp (and a and s if they are present).

This is a useful shortcut for particle filtering so that the observation distribution does not have to be represented.

source

Ordered Spaces

It is often useful to have a list of states, actions, or observations ordered consistently with the respective index function from POMDPs.jl. Since the POMDPs.jl interface does not demand that spaces be ordered consistently with index, the states, actions, and observations functions are not sufficient. Thus POMDPModelTools provides ordered_actions, ordered_states, and ordered_observations to provide this capability.

POMDPTools.ModelTools.ordered_actionsFunction
ordered_actions(mdp)

Return an AbstractVector of actions ordered according to actionindex(mdp, a).

ordered_actions(mdp) will always return an AbstractVector{A} v containing all of the actions in actions(mdp) in the order such that actionindex(mdp, v[i]) == i. You may wish to override this for your problem for efficiency.

source
POMDPTools.ModelTools.ordered_statesFunction
ordered_states(mdp)

Return an AbstractVector of states ordered according to stateindex(mdp, a).

ordered_states(mdp) will always return a AbstractVector{A} v containing all of the states in states(mdp) in the order such that stateindex(mdp, v[i]) == i. You may wish to override this for your problem for efficiency.

source
POMDPTools.ModelTools.ordered_observationsFunction
ordered_observations(pomdp)

Return an AbstractVector of observations ordered according to obsindex(pomdp, a).

ordered_observations(mdp) will always return a AbstractVector{A} v containing all of the observations in observations(pomdp) in the order such that obsindex(pomdp, v[i]) == i. You may wish to override this for your problem for efficiency.

source

Info Interface

It is often the case that useful information besides the belief, state, action, etc is generated by a function in POMDPs.jl. This information can be useful for debugging or understanding the behavior of a solver, updater, or problem. The info interface provides a standard way for problems, policies, solvers or updaters to output this information. The recording simulators from POMDPTools automatically record this information.

To specify info from policies, solvers, or updaters, implement the following functions:

POMDPTools.ModelTools.action_infoFunction
a, ai = action_info(policy, x)

Return a tuple containing the action determined by policy 'p' at state or belief 'x' and information (usually a NamedTuple, Dict or nothing) from the calculation of that action.

By default, returns nothing as info.

source
POMDPTools.ModelTools.solve_infoFunction
policy, si = solve_info(solver, problem)

Return a tuple containing the policy determined by a solver and information (usually a NamedTuple, Dict or nothing) from the calculation of that policy.

By default, returns nothing as info.

source
POMDPTools.ModelTools.update_infoFunction
bp, i = update_info(updater, b, a, o)

Return a tuple containing the new belief and information (usually a NamedTuple, Dict or nothing) from the belief update.

By default, returns nothing as info.

source

Model Transformations

POMDPTools contains several tools for transforming problems into other classes so that they can be used by different solvers.

Linear Algebra Representations

For some algorithms, such as value iteration, it is convenient to use vectors that contain the reward for every state, and matrices that contain the transition probabilities. These can be constructed with the following functions:

POMDPTools.ModelTools.transition_matricesFunction
transition_matrices(p::SparseTabularProblem)

Accessor function for the transition model of a sparse tabular problem. It returns a list of sparse matrices for each action of the problem.

source
transition_matrices(m::Union{MDP,POMDP})
-transition_matrices(m; sparse=true)

Construct transition matrices for (PO)MDP m.

The returned object is an associative object (usually a Dict), where the keys are actions. Each value in this object is an AbstractMatrix where the row corresponds to the state index of s and the column corresponds to the state index of s'. The entry in the matrix is the probability of transitioning from state s to state s'.

source
POMDPTools.ModelTools.reward_vectorsFunction
reward_vectors(m::Union{MDP, POMDP})

Construct reward vectors for (PO)MDP m.

The returned object is an associative object (usually a Dict), where the keys are actions. Each value in this object is an AbstractVector where the index corresponds to the state index of s and the entry is the reward for that state.

source

Sparse Tabular MDPs and POMDPs

The SparseTabularMDP and SparseTabularPOMDP represents discrete problems defined using the explicit interface. The transition and observation models are represented using sparse matrices. Solver writers can leverage these data structures to write efficient vectorized code. A problem writer can define its problem using the explicit interface and it can be automatically converted to a sparse tabular representation by calling the constructors SparseTabularMDP(::MDP) or SparseTabularPOMDP(::POMDP). See the following docs to know more about the matrix representation and how to access the fields of the SparseTabular objects:

POMDPTools.ModelTools.SparseTabularMDPType
SparseTabularMDP

An MDP object where states and actions are integers and the transition is represented by a list of sparse matrices. This data structure can be useful to exploit in vectorized algorithm (e.g. see SparseValueIterationSolver). The recommended way to access the transition and reward matrices is through the provided accessor functions: transition_matrix and reward_vector.

Fields

  • T::Vector{SparseMatrixCSC{Float64, Int64}} The transition model is represented as a vector of sparse matrices (one for each action). T[a][s, sp] the probability of transition from s to sp taking action a.
  • R::Array{Float64, 2} The reward is represented as a matrix where the rows are states and the columns actions: R[s, a] is the reward of taking action a in sate s.
  • initial_probs::SparseVector{Float64, Int64} Specifies the initial state distribution
  • terminal_states::Set{Int64} Stores the terminal states
  • discount::Float64 The discount factor

Constructors

  • SparseTabularMDP(mdp::MDP) : One can provide the matrices to the default constructor or one can construct a SparseTabularMDP from any discrete state MDP defined using the explicit interface.

Note that constructing the transition and reward matrices requires to iterate over all the states and can take a while. To learn more information about how to define an MDP with the explicit interface please visit https://juliapomdp.github.io/POMDPs.jl/latest/explicit/ .

  • SparseTabularMDP(smdp::SparseTabularMDP; transition, reward, discount) : This constructor returns a new sparse MDP that is a copy of the original smdp except for the field specified by the keyword arguments.
source
POMDPTools.ModelTools.SparseTabularPOMDPType
SparseTabularPOMDP

A POMDP object where states and actions are integers and the transition and observation distributions are represented by lists of sparse matrices. This data structure can be useful to exploit in vectorized algorithms to gain performance (e.g. see SparseValueIterationSolver). The recommended way to access the transition, reward, and observation matrices is through the provided accessor functions: transition_matrix, reward_vector, observation_matrix.

Fields

  • T::Vector{SparseMatrixCSC{Float64, Int64}} The transition model is represented as a vector of sparse matrices (one for each action). T[a][s, sp] the probability of transition from s to sp taking action a.
  • R::Array{Float64, 2} The reward is represented as a matrix where the rows are states and the columns actions: R[s, a] is the reward of taking action a in sate s.
  • O::Vector{SparseMatrixCSC{Float64, Int64}} The observation model is represented as a vector of sparse matrices (one for each action). O[a][sp, o] is the probability of observing o from state sp after having taken action a.
  • initial_probs::SparseVector{Float64, Int64} Specifies the initial state distribution
  • terminal_states::Set{Int64} Stores the terminal states
  • discount::Float64 The discount factor

Constructors

  • SparseTabularPOMDP(pomdp::POMDP) : One can provide the matrices to the default constructor or one can construct a SparseTabularPOMDP from any discrete state MDP defined using the explicit interface.

Note that constructing the transition and reward matrices requires to iterate over all the states and can take a while. To learn more information about how to define an MDP with the explicit interface please visit https://juliapomdp.github.io/POMDPs.jl/latest/explicit/ .

  • SparseTabularPOMDP(spomdp::SparseTabularMDP; transition, reward, observation, discount) : This constructor returns a new sparse POMDP that is a copy of the original smdp except for the field specified by the keyword arguments.
source
POMDPTools.ModelTools.transition_matrixFunction
transition_matrix(p::SparseTabularProblem, a)

Accessor function for the transition model of a sparse tabular problem. It returns a sparse matrix containing the transition probabilities when taking action a: T[s, sp] = Pr(sp | s, a).

source
POMDPTools.ModelTools.reward_vectorFunction
reward_vector(p::SparseTabularProblem, a)

Accessor function for the reward function of a sparse tabular problem. It returns a vector containing the reward for all the states when taking action a: R(s, a). The length of the return vector is equal to the number of states.

source
POMDPTools.ModelTools.observation_matrixFunction
observation_matrix(p::SparseTabularPOMDP, a::Int64)

Accessor function for the observation model of a sparse tabular POMDP. It returns a sparse matrix containing the observation probabilities when having taken action a: O[sp, o] = Pr(o | sp, a).

source
POMDPTools.ModelTools.reward_matrixFunction
reward_matrix(p::SparseTabularProblem)

Accessor function for the reward matrix R[s, a] of a sparse tabular problem.

source
POMDPTools.ModelTools.observation_matricesFunction
observation_matrices(p::SparseTabularPOMDP)

Accessor function for the observation model of a sparse tabular POMDP. It returns a list of sparse matrices for each action of the problem.

source

Fully Observable POMDP

POMDPTools.ModelTools.FullyObservablePOMDPType
FullyObservablePOMDP(mdp)

Turn MDP mdp into a POMDP where the observations are the states of the MDP.

source

Generative Belief MDP

Every POMDP is an MDP on the belief space GenerativeBeliefMDP creates a generative model for that MDP.

Warning

The reward generated by the GenerativeBeliefMDP is the reward for a single state sampled from the belief; it is not the expected reward for that belief transition (though, in expectation, they are equivalent of course). Implementing the model with the expected reward requires a custom implementation because belief updaters do not typically deal with reward.

POMDPTools.ModelTools.GenerativeBeliefMDPType
GenerativeBeliefMDP(pomdp, updater)
-GenerativeBeliefMDP(pomdp, updater; terminal_behavior=TerminalStateTerminalBehavior())

Create a generative model of the belief MDP corresponding to POMDP pomdp with belief updates performed by updater. Each step is performed by sampling a state from the current belief, generating an observation from that state and action, and then using updater to update the belief.

A belief is considered terminal when all POMDP states in the support with nonzero probability are terminal.

The default behavior when a terminal POMDP state is sampled from the belief is to transition to terminalstate. This can be controlled by the terminal_behavior keyword argument. Using terminal_behavior=ContinueTerminalBehavior(pomdp, updater) will cause the MDP to keep attempting a belief update even when the sampled state is terminal. This can be further customized by providing terminal_behavior with a Function or callable object that takes arguments b, s, a, rng and returns a new belief (see the implementation of ContinueTerminalBehavior for an example). You can customize behavior additionally using determine_gbmdp_state_type.

source

Example

using POMDPs
+ false => 0.3
source

Observation Weight

Sometimes, e.g. in particle filtering, the relative likelihood of an observation is required in addition to a generative model, and it is often tedious to implement a custom observation distribution type. For this case, the shortcut function obs_weight is provided.

POMDPTools.ModelTools.obs_weightFunction
obs_weight(pomdp, s, a, sp, o)

Return a weight proportional to the likelihood of receiving observation o from state sp (and a and s if they are present).

This is a useful shortcut for particle filtering so that the observation distribution does not have to be represented.

source

Ordered Spaces

It is often useful to have a list of states, actions, or observations ordered consistently with the respective index function from POMDPs.jl. Since the POMDPs.jl interface does not demand that spaces be ordered consistently with index, the states, actions, and observations functions are not sufficient. Thus POMDPModelTools provides ordered_actions, ordered_states, and ordered_observations to provide this capability.

POMDPTools.ModelTools.ordered_actionsFunction
ordered_actions(mdp)

Return an AbstractVector of actions ordered according to actionindex(mdp, a).

ordered_actions(mdp) will always return an AbstractVector{A} v containing all of the actions in actions(mdp) in the order such that actionindex(mdp, v[i]) == i. You may wish to override this for your problem for efficiency.

source
POMDPTools.ModelTools.ordered_statesFunction
ordered_states(mdp)

Return an AbstractVector of states ordered according to stateindex(mdp, a).

ordered_states(mdp) will always return a AbstractVector{A} v containing all of the states in states(mdp) in the order such that stateindex(mdp, v[i]) == i. You may wish to override this for your problem for efficiency.

source
POMDPTools.ModelTools.ordered_observationsFunction
ordered_observations(pomdp)

Return an AbstractVector of observations ordered according to obsindex(pomdp, a).

ordered_observations(mdp) will always return a AbstractVector{A} v containing all of the observations in observations(pomdp) in the order such that obsindex(pomdp, v[i]) == i. You may wish to override this for your problem for efficiency.

source

Info Interface

It is often the case that useful information besides the belief, state, action, etc is generated by a function in POMDPs.jl. This information can be useful for debugging or understanding the behavior of a solver, updater, or problem. The info interface provides a standard way for problems, policies, solvers or updaters to output this information. The recording simulators from POMDPTools automatically record this information.

To specify info from policies, solvers, or updaters, implement the following functions:

POMDPTools.ModelTools.action_infoFunction
a, ai = action_info(policy, x)

Return a tuple containing the action determined by policy 'p' at state or belief 'x' and information (usually a NamedTuple, Dict or nothing) from the calculation of that action.

By default, returns nothing as info.

source
POMDPTools.ModelTools.solve_infoFunction
policy, si = solve_info(solver, problem)

Return a tuple containing the policy determined by a solver and information (usually a NamedTuple, Dict or nothing) from the calculation of that policy.

By default, returns nothing as info.

source
POMDPTools.ModelTools.update_infoFunction
bp, i = update_info(updater, b, a, o)

Return a tuple containing the new belief and information (usually a NamedTuple, Dict or nothing) from the belief update.

By default, returns nothing as info.

source

Model Transformations

POMDPTools contains several tools for transforming problems into other classes so that they can be used by different solvers.

Linear Algebra Representations

For some algorithms, such as value iteration, it is convenient to use vectors that contain the reward for every state, and matrices that contain the transition probabilities. These can be constructed with the following functions:

POMDPTools.ModelTools.transition_matricesFunction
transition_matrices(p::SparseTabularProblem)

Accessor function for the transition model of a sparse tabular problem. It returns a list of sparse matrices for each action of the problem.

source
transition_matrices(m::Union{MDP,POMDP})
+transition_matrices(m; sparse=true)

Construct transition matrices for (PO)MDP m.

The returned object is an associative object (usually a Dict), where the keys are actions. Each value in this object is an AbstractMatrix where the row corresponds to the state index of s and the column corresponds to the state index of s'. The entry in the matrix is the probability of transitioning from state s to state s'.

source
POMDPTools.ModelTools.reward_vectorsFunction
reward_vectors(m::Union{MDP, POMDP})

Construct reward vectors for (PO)MDP m.

The returned object is an associative object (usually a Dict), where the keys are actions. Each value in this object is an AbstractVector where the index corresponds to the state index of s and the entry is the reward for that state.

source

Sparse Tabular MDPs and POMDPs

The SparseTabularMDP and SparseTabularPOMDP represents discrete problems defined using the explicit interface. The transition and observation models are represented using sparse matrices. Solver writers can leverage these data structures to write efficient vectorized code. A problem writer can define its problem using the explicit interface and it can be automatically converted to a sparse tabular representation by calling the constructors SparseTabularMDP(::MDP) or SparseTabularPOMDP(::POMDP). See the following docs to know more about the matrix representation and how to access the fields of the SparseTabular objects:

POMDPTools.ModelTools.SparseTabularMDPType
SparseTabularMDP

An MDP object where states and actions are integers and the transition is represented by a list of sparse matrices. This data structure can be useful to exploit in vectorized algorithm (e.g. see SparseValueIterationSolver). The recommended way to access the transition and reward matrices is through the provided accessor functions: transition_matrix and reward_vector.

Fields

  • T::Vector{SparseMatrixCSC{Float64, Int64}} The transition model is represented as a vector of sparse matrices (one for each action). T[a][s, sp] the probability of transition from s to sp taking action a.
  • R::Array{Float64, 2} The reward is represented as a matrix where the rows are states and the columns actions: R[s, a] is the reward of taking action a in sate s.
  • initial_probs::SparseVector{Float64, Int64} Specifies the initial state distribution
  • terminal_states::Set{Int64} Stores the terminal states
  • discount::Float64 The discount factor

Constructors

  • SparseTabularMDP(mdp::MDP) : One can provide the matrices to the default constructor or one can construct a SparseTabularMDP from any discrete state MDP defined using the explicit interface.

Note that constructing the transition and reward matrices requires to iterate over all the states and can take a while. To learn more information about how to define an MDP with the explicit interface please visit https://juliapomdp.github.io/POMDPs.jl/latest/explicit/ .

  • SparseTabularMDP(smdp::SparseTabularMDP; transition, reward, discount) : This constructor returns a new sparse MDP that is a copy of the original smdp except for the field specified by the keyword arguments.
source
POMDPTools.ModelTools.SparseTabularPOMDPType
SparseTabularPOMDP

A POMDP object where states and actions are integers and the transition and observation distributions are represented by lists of sparse matrices. This data structure can be useful to exploit in vectorized algorithms to gain performance (e.g. see SparseValueIterationSolver). The recommended way to access the transition, reward, and observation matrices is through the provided accessor functions: transition_matrix, reward_vector, observation_matrix.

Fields

  • T::Vector{SparseMatrixCSC{Float64, Int64}} The transition model is represented as a vector of sparse matrices (one for each action). T[a][s, sp] the probability of transition from s to sp taking action a.
  • R::Array{Float64, 2} The reward is represented as a matrix where the rows are states and the columns actions: R[s, a] is the reward of taking action a in sate s.
  • O::Vector{SparseMatrixCSC{Float64, Int64}} The observation model is represented as a vector of sparse matrices (one for each action). O[a][sp, o] is the probability of observing o from state sp after having taken action a.
  • initial_probs::SparseVector{Float64, Int64} Specifies the initial state distribution
  • terminal_states::Set{Int64} Stores the terminal states
  • discount::Float64 The discount factor

Constructors

  • SparseTabularPOMDP(pomdp::POMDP) : One can provide the matrices to the default constructor or one can construct a SparseTabularPOMDP from any discrete state MDP defined using the explicit interface.

Note that constructing the transition and reward matrices requires to iterate over all the states and can take a while. To learn more information about how to define an MDP with the explicit interface please visit https://juliapomdp.github.io/POMDPs.jl/latest/explicit/ .

  • SparseTabularPOMDP(spomdp::SparseTabularMDP; transition, reward, observation, discount) : This constructor returns a new sparse POMDP that is a copy of the original smdp except for the field specified by the keyword arguments.
source
POMDPTools.ModelTools.transition_matrixFunction
transition_matrix(p::SparseTabularProblem, a)

Accessor function for the transition model of a sparse tabular problem. It returns a sparse matrix containing the transition probabilities when taking action a: T[s, sp] = Pr(sp | s, a).

source
POMDPTools.ModelTools.reward_vectorFunction
reward_vector(p::SparseTabularProblem, a)

Accessor function for the reward function of a sparse tabular problem. It returns a vector containing the reward for all the states when taking action a: R(s, a). The length of the return vector is equal to the number of states.

source
POMDPTools.ModelTools.observation_matrixFunction
observation_matrix(p::SparseTabularPOMDP, a::Int64)

Accessor function for the observation model of a sparse tabular POMDP. It returns a sparse matrix containing the observation probabilities when having taken action a: O[sp, o] = Pr(o | sp, a).

source
POMDPTools.ModelTools.reward_matrixFunction
reward_matrix(p::SparseTabularProblem)

Accessor function for the reward matrix R[s, a] of a sparse tabular problem.

source
POMDPTools.ModelTools.observation_matricesFunction
observation_matrices(p::SparseTabularPOMDP)

Accessor function for the observation model of a sparse tabular POMDP. It returns a list of sparse matrices for each action of the problem.

source

Fully Observable POMDP

POMDPTools.ModelTools.FullyObservablePOMDPType
FullyObservablePOMDP(mdp)

Turn MDP mdp into a POMDP where the observations are the states of the MDP.

source

Generative Belief MDP

Every POMDP is an MDP on the belief space GenerativeBeliefMDP creates a generative model for that MDP.

Warning

The reward generated by the GenerativeBeliefMDP is the reward for a single state sampled from the belief; it is not the expected reward for that belief transition (though, in expectation, they are equivalent of course). Implementing the model with the expected reward requires a custom implementation because belief updaters do not typically deal with reward.

POMDPTools.ModelTools.GenerativeBeliefMDPType
GenerativeBeliefMDP(pomdp, updater)
+GenerativeBeliefMDP(pomdp, updater; terminal_behavior=TerminalStateTerminalBehavior())

Create a generative model of the belief MDP corresponding to POMDP pomdp with belief updates performed by updater. Each step is performed by sampling a state from the current belief, generating an observation from that state and action, and then using updater to update the belief.

A belief is considered terminal when all POMDP states in the support with nonzero probability are terminal.

The default behavior when a terminal POMDP state is sampled from the belief is to transition to terminalstate. This can be controlled by the terminal_behavior keyword argument. Using terminal_behavior=ContinueTerminalBehavior(pomdp, updater) will cause the MDP to keep attempting a belief update even when the sampled state is terminal. This can be further customized by providing terminal_behavior with a Function or callable object that takes arguments b, s, a, rng and returns a new belief (see the implementation of ContinueTerminalBehavior for an example). You can customize behavior additionally using determine_gbmdp_state_type.

source

Example

using POMDPs
 using POMDPModels
 using POMDPTools
 
@@ -27,7 +27,7 @@
 (a, r, sp) = (true, -5.0, DiscreteBelief{POMDPModels.BabyPOMDP, Bool}(POMDPModels.BabyPOMDP(-5.0, -10.0, 0.1, 0.8, 0.1, 0.9), Bool[0, 1], [1.0, 0.0]))
 (a, r, sp) = (true, -5.0, DiscreteBelief{POMDPModels.BabyPOMDP, Bool}(POMDPModels.BabyPOMDP(-5.0, -10.0, 0.1, 0.8, 0.1, 0.9), Bool[0, 1], [1.0, 0.0]))
 (a, r, sp) = (false, 0.0, DiscreteBelief{POMDPModels.BabyPOMDP, Bool}(POMDPModels.BabyPOMDP(-5.0, -10.0, 0.1, 0.8, 0.1, 0.9), Bool[0, 1], [0.9759036144578314, 0.02409638554216867]))
-(a, r, sp) = (false, 0.0, DiscreteBelief{POMDPModels.BabyPOMDP, Bool}(POMDPModels.BabyPOMDP(-5.0, -10.0, 0.1, 0.8, 0.1, 0.9), Bool[0, 1], [0.9701315984030756, 0.029868401596924433]))

Underlying MDP

POMDPTools.ModelTools.UnderlyingMDPType
UnderlyingMDP(m::POMDP)

Transform POMDP m into an MDP where the states are fully observed.

UnderlyingMDP(m::MDP)

Return m

source

State Action Reward Model

POMDPTools.ModelTools.StateActionRewardType
StateActionReward(m::Union{MDP,POMDP})

Robustly create a reward function that depends only on the state and action.

If reward(m, s, a) is implemented, that will be used, otherwise the mean of reward(m, s, a, sp) for MDPs or reward(m, s, a, sp, o) for POMDPs will be used.

Example

using POMDPs
+(a, r, sp) = (false, 0.0, DiscreteBelief{POMDPModels.BabyPOMDP, Bool}(POMDPModels.BabyPOMDP(-5.0, -10.0, 0.1, 0.8, 0.1, 0.9), Bool[0, 1], [0.9701315984030756, 0.029868401596924433]))

Underlying MDP

State Action Reward Model

POMDPTools.ModelTools.StateActionRewardType
StateActionReward(m::Union{MDP,POMDP})

Robustly create a reward function that depends only on the state and action.

If reward(m, s, a) is implemented, that will be used, otherwise the mean of reward(m, s, a, sp) for MDPs or reward(m, s, a, sp, o) for POMDPs will be used.

Example

using POMDPs
 using POMDPModels
 using POMDPTools
 
@@ -39,4 +39,4 @@
 
 # output
 
--15.0
source

Utility Types

Terminal State

TerminalState and its singleton instance terminalstate are available to use for a terminal state in concert with another state type. It has the appropriate type promotion logic to make its use with other types friendly, similar to nothing and missing.

Note

NOTE: This is NOT a replacement for the standard POMDPs.jl isterminal function, though isterminal is implemented for the type. It is merely a convenient type to use for terminal states.

Warning

WARNING: Early tests (August 2018) suggest that the Julia 1.0 compiler will not be able to efficiently implement union splitting in cases as complex as POMDPs, so using a Union for the state type of a problem can currently have a large overhead.

POMDPTools.ModelTools.TerminalStateType
TerminalState

A type with no fields whose singleton instance terminalstate is used to represent a terminal state with no additional information.

This type has the appropriate promotion logic implemented to function like Missing when added to arrays, etc.

Note that terminal states NEED NOT be of type TerminalState. You can define any state to be terminal by implementing the appropriate isterminal method. Solvers and simulators SHOULD NOT check for this type, but should instead check using isterminal.

source
+-15.0source

Utility Types

Terminal State

TerminalState and its singleton instance terminalstate are available to use for a terminal state in concert with another state type. It has the appropriate type promotion logic to make its use with other types friendly, similar to nothing and missing.

Note

NOTE: This is NOT a replacement for the standard POMDPs.jl isterminal function, though isterminal is implemented for the type. It is merely a convenient type to use for terminal states.

Warning

WARNING: Early tests (August 2018) suggest that the Julia 1.0 compiler will not be able to efficiently implement union splitting in cases as complex as POMDPs, so using a Union for the state type of a problem can currently have a large overhead.

POMDPTools.ModelTools.TerminalStateType
TerminalState

A type with no fields whose singleton instance terminalstate is used to represent a terminal state with no additional information.

This type has the appropriate promotion logic implemented to function like Missing when added to arrays, etc.

Note that terminal states NEED NOT be of type TerminalState. You can define any state to be terminal by implementing the appropriate isterminal method. Solvers and simulators SHOULD NOT check for this type, but should instead check using isterminal.

source
POMDPTools.ModelTools.terminalstateConstant
terminalstate

The singleton instance of type TerminalState representing a terminal state.

source
diff --git a/dev/POMDPTools/policies/index.html b/dev/POMDPTools/policies/index.html index 9d9d434e..b372bc34 100644 --- a/dev/POMDPTools/policies/index.html +++ b/dev/POMDPTools/policies/index.html @@ -1,8 +1,8 @@ -Implemented Policies · POMDPs.jl

Implemented Policies

POMDPTools currently provides the following policy types:

  • a wrapper to turn a function into a Policy
  • an alpha vector policy type
  • a random policy
  • a stochastic policy type
  • exploration policies
  • a vector policy type
  • a wrapper to collect statistics and errors about policies

In addition, it provides the showpolicy function for printing policies similar to the way that matrices are printed in the repl and the evaluate function for evaluating MDP policies.

Function

Wraps a Function mapping states to actions into a Policy.

Alpha Vector Policy

Represents a policy with a set of alpha vectors (See AlphaVectorPolicy constructor docstring). In addition to finding the optimal action with action, the alpha vectors can be accessed with alphavectors or alphapairs.

Determining the estimated value and optimal action depends on calculating the dot product between alpha vectors and a belief vector. POMDPTools.Policies.beliefvec(pomdp, b) is used to create this vector and can be overridden for new belief types for efficiency.

POMDPTools.Policies.AlphaVectorPolicyType
AlphaVectorPolicy(pomdp::POMDP, alphas, action_map)

Construct a policy from alpha vectors.

Arguments

  • alphas: an |S| x (number of alpha vecs) matrix or a vector of alpha vectors.

  • action_map: a vector of the actions correponding to each alpha vector

    AlphaVectorPolicy{P<:POMDP, A}

Represents a policy with a set of alpha vectors.

Use action to get the best action for a belief, and alphavectors and alphapairs to

Fields

  • pomdp::P the POMDP problem
  • n_states::Int the number of states in the POMDP
  • alphas::Vector{Vector{Float64}} the list of alpha vectors
  • action_map::Vector{A} a list of action corresponding to the alpha vectors
source
POMDPTools.Policies.beliefvecFunction
POMDPTools.Policies.beliefvec(m::POMDP, n_states::Int, b)

Return a vector-like representation of the belief b suitable for calculating the dot product with the alpha vectors.

source

Random Policy

A policy that returns a randomly selected action using rand(rng, actions(pomdp)).

POMDPTools.Policies.RandomPolicyType
RandomPolicy{RNG<:AbstractRNG, P<:Union{POMDP,MDP}, U<:Updater}

a generic policy that uses the actions function to create a list of actions and then randomly samples an action from it.

Constructor:

`RandomPolicy(problem::Union{POMDP,MDP};
+Implemented Policies · POMDPs.jl

Implemented Policies

POMDPTools currently provides the following policy types:

  • a wrapper to turn a function into a Policy
  • an alpha vector policy type
  • a random policy
  • a stochastic policy type
  • exploration policies
  • a vector policy type
  • a wrapper to collect statistics and errors about policies

In addition, it provides the showpolicy function for printing policies similar to the way that matrices are printed in the repl and the evaluate function for evaluating MDP policies.

Function

Wraps a Function mapping states to actions into a Policy.

Alpha Vector Policy

Represents a policy with a set of alpha vectors (See AlphaVectorPolicy constructor docstring). In addition to finding the optimal action with action, the alpha vectors can be accessed with alphavectors or alphapairs.

Determining the estimated value and optimal action depends on calculating the dot product between alpha vectors and a belief vector. POMDPTools.Policies.beliefvec(pomdp, b) is used to create this vector and can be overridden for new belief types for efficiency.

POMDPTools.Policies.AlphaVectorPolicyType
AlphaVectorPolicy(pomdp::POMDP, alphas, action_map)

Construct a policy from alpha vectors.

Arguments

  • alphas: an |S| x (number of alpha vecs) matrix or a vector of alpha vectors.

  • action_map: a vector of the actions correponding to each alpha vector

    AlphaVectorPolicy{P<:POMDP, A}

Represents a policy with a set of alpha vectors.

Use action to get the best action for a belief, and alphavectors and alphapairs to

Fields

  • pomdp::P the POMDP problem
  • n_states::Int the number of states in the POMDP
  • alphas::Vector{Vector{Float64}} the list of alpha vectors
  • action_map::Vector{A} a list of action corresponding to the alpha vectors
source
POMDPTools.Policies.beliefvecFunction
POMDPTools.Policies.beliefvec(m::POMDP, n_states::Int, b)

Return a vector-like representation of the belief b suitable for calculating the dot product with the alpha vectors.

source

Random Policy

A policy that returns a randomly selected action using rand(rng, actions(pomdp)).

POMDPTools.Policies.RandomPolicyType
RandomPolicy{RNG<:AbstractRNG, P<:Union{POMDP,MDP}, U<:Updater}

a generic policy that uses the actions function to create a list of actions and then randomly samples an action from it.

Constructor:

`RandomPolicy(problem::Union{POMDP,MDP};
          rng=Random.default_rng(),
-         updater=NothingUpdater())`

Fields

  • rng::RNG a random number generator
  • probelm::P the POMDP or MDP problem
  • updater::U a belief updater (default to NothingUpdater in the above constructor)
source

Stochastic Policies

Types for representing randomized policies:

  • StochasticPolicy samples actions from an arbitrary distribution.
  • UniformRandomPolicy samples actions uniformly (see RandomPolicy for a similar use)
  • CategoricalTabularPolicy samples actions from a categorical distribution with weights given by a ValuePolicy.
POMDPTools.Policies.StochasticPolicyType

StochasticPolicy{D, RNG <: AbstractRNG}

Represents a stochastic policy. Action are sampled from an arbitrary distribution.

Constructor:

`StochasticPolicy(distribution; rng=Random.default_rng())`

Fields

  • distribution::D
  • rng::RNG a random number generator
source
POMDPTools.Policies.CategoricalTabularPolicyType
CategoricalTabularPolicy

represents a stochastic policy sampling an action from a categorical distribution with weights given by a ValuePolicy

constructor:

CategoricalTabularPolicy(mdp::Union{POMDP,MDP}; rng=Random.default_rng())

Fields

  • stochastic::StochasticPolicy
  • value::ValuePolicy
source

Vector Policies

Tabular policies including the following:

  • VectorPolicy holds a vector of actions, one for each state, ordered according to stateindex.
  • ValuePolicy holds a matrix of values for state-action pairs and chooses the action with the highest value at the given state
POMDPTools.Policies.VectorPolicyType
VectorPolicy{S,A}

A generic MDP policy that consists of a vector of actions. The entry at stateindex(mdp, s) is the action that will be taken in state s.

Fields

  • mdp::MDP{S,A} the MDP problem
  • act::Vector{A} a vector of size |S| mapping state indices to actions
source
POMDPTools.Policies.ValuePolicyType
 ValuePolicy{P<:Union{POMDP,MDP}, T<:AbstractMatrix{Float64}, A}

A generic MDP policy that consists of a value table. The entry at stateindex(mdp, s) is the action that will be taken in state s. It is expected that the order of the actions in the value table is consistent with the order of the actions in act. If act is not explicitly set in the construction, act is ordered according to actionindex.

Fields

  • mdp::P the MDP problem
  • value_table::T the value table as a |S|x|A| matrix
  • act::Vector{A} the possible actions
source

Value Dict Policy

ValueDictPolicy holds a dictionary of values, where the key is state-action tuple, and chooses the action with the highest value at the given state. It allows one to write solvers without enumerating state and action spaces, but actions and states must support Base.isequal() and Base.hash().

POMDPTools.Policies.ValueDictPolicyType
 ValueDictPolicy(mdp)

A generic MDP policy that consists of a Dict storing Q-values for state-action pairs. If there are no entries higher than a default value, this will fall back to a default policy.

Keyword Arguments

  • value_table::AbstractDict the value dict, key is (s, a) Tuple.
  • default_value::Float64 the defalut value of value_dict.
  • default_policy::Policy the policy taken when no action has a value higher than default_value
source

Exploration Policies

Exploration policies are often useful for Reinforcement Learning algorithm to choose an action that is different than the action given by the policy being learned (on_policy).

Exploration policies are subtype of the abstract ExplorationPolicy type and they follow the following interface: action(exploration_policy::ExplorationPolicy, on_policy::Policy, k, s). k is used to compute the value of the exploration parameter (see Schedule), and s is the current state or observation in which the agent is taking an action.

The action method is exported by POMDPs.jl. To use exploration policies in a solver, you must use the four argument version of action where on_policy is the policy being learned (e.g. tabular policy or neural network policy).

This package provides two exploration policies: EpsGreedyPolicy and SoftmaxPolicy

POMDPTools.Policies.EpsGreedyPolicyType
EpsGreedyPolicy <: ExplorationPolicy

represents an epsilon greedy policy, sampling a random action with a probability eps or returning an action from a given policy otherwise. The evolution of epsilon can be controlled using a schedule. This feature is useful for using those policies in reinforcement learning algorithms.

Constructor:

EpsGreedyPolicy(problem::Union{MDP, POMDP}, eps::Union{Function, Float64}; rng=Random.default_rng(), schedule=ConstantSchedule)

If a function is passed for eps, eps(k) is called to compute the value of epsilon when calling action(exploration_policy, on_policy, k, s).

Fields

  • eps::Function
  • rng::AbstractRNG
  • m::M POMDPs or MDPs problem
source
POMDPTools.Policies.SoftmaxPolicyType
SoftmaxPolicy <: ExplorationPolicy

represents a softmax policy, sampling a random action according to a softmax function. The softmax function converts the action values of the on policy into probabilities that are used for sampling. A temperature parameter or function can be used to make the resulting distribution more or less wide.

Constructor

SoftmaxPolicy(problem, temperature::Union{Function, Float64}; rng=Random.default_rng())

If a function is passed for temperature, temperature(k) is called to compute the value of the temperature when calling action(exploration_policy, on_policy, k, s)

Fields

  • temperature::Function
  • rng::AbstractRNG
  • actions::A an indexable list of action
source

Schedule

Exploration policies often rely on a key parameter: $\epsilon$ in $\epsilon$-greedy and the temperature in softmax for example. Reinforcement learning algorithms often require a decay schedule for these parameters. Schedule can be passed to an exploration policy as functions. For example one can define an epsilon greedy policy with an exponential decay schedule as follow:

    m # your mdp or pomdp model
-    exploration_policy = EpsGreedyPolicy(m, k->0.05*0.9^(k/10))

POMDPTools exports a linear decay schedule object that can be used as well.

POMDPTools.Policies.LinearDecayScheduleType
LinearDecaySchedule

A schedule that linearly decreases a value from start to stop in steps steps. if the value is greater or equal to stop, it stays constant.

Constructor

LinearDecaySchedule(;start, stop, steps)

source

Playback Policy

A policy that replays a fixed sequence of actions. When all actions are used, a backup policy is used.

POMDPTools.Policies.PlaybackPolicyType
PlaybackPolicy{A<:AbstractArray, P<:Policy, V<:AbstractArray{<:Real}}

a policy that applies a fixed sequence of actions until they are all used and then falls back onto a backup policy until the end of the episode.

Constructor:

`PlaybackPolicy(actions::AbstractArray, backup_policy::Policy; logpdfs::AbstractArray{Float64, 1} = Float64[])`

Fields

  • actions::Vector{A} a vector of actions to play back
  • backup_policy::Policy the policy to use when all prescribed actions have been taken but the episode continues
  • logpdfs::Vector{Float64} the log probability (density) of actions
  • i::Int64 the current action index
source

Utility Wrapper

A wrapper for policies to collect statistics and handle errors.

POMDPTools.Policies.PolicyWrapperType
PolicyWrapper

Flexible utility wrapper for a policy designed for collecting statistics about planning.

Carries a function, a policy, and optionally a payload (that can be any type).

The function should typically be defined with the do syntax. Each time action is called on the wrapper, this function will be called.

If there is no payload, it will be called with two argments: the policy and the state/belief. If there is a payload, it will be called with three arguments: the policy, the payload, and the current state or belief. The function should return an appropriate action. The idea is that, in this function, action(policy, s) should be called, statistics from the policy/planner should be collected and saved in the payload, exceptions can be handled, and the action should be returned.

Constructor

PolicyWrapper(policy::Policy; payload=nothing)

Example

using POMDPModels
+         updater=NothingUpdater())`

Fields

  • rng::RNG a random number generator
  • probelm::P the POMDP or MDP problem
  • updater::U a belief updater (default to NothingUpdater in the above constructor)
source

Stochastic Policies

Types for representing randomized policies:

  • StochasticPolicy samples actions from an arbitrary distribution.
  • UniformRandomPolicy samples actions uniformly (see RandomPolicy for a similar use)
  • CategoricalTabularPolicy samples actions from a categorical distribution with weights given by a ValuePolicy.
POMDPTools.Policies.StochasticPolicyType

StochasticPolicy{D, RNG <: AbstractRNG}

Represents a stochastic policy. Action are sampled from an arbitrary distribution.

Constructor:

`StochasticPolicy(distribution; rng=Random.default_rng())`

Fields

  • distribution::D
  • rng::RNG a random number generator
source
POMDPTools.Policies.CategoricalTabularPolicyType
CategoricalTabularPolicy

represents a stochastic policy sampling an action from a categorical distribution with weights given by a ValuePolicy

constructor:

CategoricalTabularPolicy(mdp::Union{POMDP,MDP}; rng=Random.default_rng())

Fields

  • stochastic::StochasticPolicy
  • value::ValuePolicy
source

Vector Policies

Tabular policies including the following:

  • VectorPolicy holds a vector of actions, one for each state, ordered according to stateindex.
  • ValuePolicy holds a matrix of values for state-action pairs and chooses the action with the highest value at the given state
POMDPTools.Policies.VectorPolicyType
VectorPolicy{S,A}

A generic MDP policy that consists of a vector of actions. The entry at stateindex(mdp, s) is the action that will be taken in state s.

Fields

  • mdp::MDP{S,A} the MDP problem
  • act::Vector{A} a vector of size |S| mapping state indices to actions
source
POMDPTools.Policies.ValuePolicyType
 ValuePolicy{P<:Union{POMDP,MDP}, T<:AbstractMatrix{Float64}, A}

A generic MDP policy that consists of a value table. The entry at stateindex(mdp, s) is the action that will be taken in state s. It is expected that the order of the actions in the value table is consistent with the order of the actions in act. If act is not explicitly set in the construction, act is ordered according to actionindex.

Fields

  • mdp::P the MDP problem
  • value_table::T the value table as a |S|x|A| matrix
  • act::Vector{A} the possible actions
source

Value Dict Policy

ValueDictPolicy holds a dictionary of values, where the key is state-action tuple, and chooses the action with the highest value at the given state. It allows one to write solvers without enumerating state and action spaces, but actions and states must support Base.isequal() and Base.hash().

POMDPTools.Policies.ValueDictPolicyType
 ValueDictPolicy(mdp)

A generic MDP policy that consists of a Dict storing Q-values for state-action pairs. If there are no entries higher than a default value, this will fall back to a default policy.

Keyword Arguments

  • value_table::AbstractDict the value dict, key is (s, a) Tuple.
  • default_value::Float64 the defalut value of value_dict.
  • default_policy::Policy the policy taken when no action has a value higher than default_value
source

Exploration Policies

Exploration policies are often useful for Reinforcement Learning algorithm to choose an action that is different than the action given by the policy being learned (on_policy).

Exploration policies are subtype of the abstract ExplorationPolicy type and they follow the following interface: action(exploration_policy::ExplorationPolicy, on_policy::Policy, k, s). k is used to compute the value of the exploration parameter (see Schedule), and s is the current state or observation in which the agent is taking an action.

The action method is exported by POMDPs.jl. To use exploration policies in a solver, you must use the four argument version of action where on_policy is the policy being learned (e.g. tabular policy or neural network policy).

This package provides two exploration policies: EpsGreedyPolicy and SoftmaxPolicy

POMDPTools.Policies.EpsGreedyPolicyType
EpsGreedyPolicy <: ExplorationPolicy

represents an epsilon greedy policy, sampling a random action with a probability eps or returning an action from a given policy otherwise. The evolution of epsilon can be controlled using a schedule. This feature is useful for using those policies in reinforcement learning algorithms.

Constructor:

EpsGreedyPolicy(problem::Union{MDP, POMDP}, eps::Union{Function, Float64}; rng=Random.default_rng(), schedule=ConstantSchedule)

If a function is passed for eps, eps(k) is called to compute the value of epsilon when calling action(exploration_policy, on_policy, k, s).

Fields

  • eps::Function
  • rng::AbstractRNG
  • m::M POMDPs or MDPs problem
source
POMDPTools.Policies.SoftmaxPolicyType
SoftmaxPolicy <: ExplorationPolicy

represents a softmax policy, sampling a random action according to a softmax function. The softmax function converts the action values of the on policy into probabilities that are used for sampling. A temperature parameter or function can be used to make the resulting distribution more or less wide.

Constructor

SoftmaxPolicy(problem, temperature::Union{Function, Float64}; rng=Random.default_rng())

If a function is passed for temperature, temperature(k) is called to compute the value of the temperature when calling action(exploration_policy, on_policy, k, s)

Fields

  • temperature::Function
  • rng::AbstractRNG
  • actions::A an indexable list of action
source

Schedule

Exploration policies often rely on a key parameter: $\epsilon$ in $\epsilon$-greedy and the temperature in softmax for example. Reinforcement learning algorithms often require a decay schedule for these parameters. Schedule can be passed to an exploration policy as functions. For example one can define an epsilon greedy policy with an exponential decay schedule as follow:

    m # your mdp or pomdp model
+    exploration_policy = EpsGreedyPolicy(m, k->0.05*0.9^(k/10))

POMDPTools exports a linear decay schedule object that can be used as well.

POMDPTools.Policies.LinearDecayScheduleType
LinearDecaySchedule

A schedule that linearly decreases a value from start to stop in steps steps. if the value is greater or equal to stop, it stays constant.

Constructor

LinearDecaySchedule(;start, stop, steps)

source

Playback Policy

A policy that replays a fixed sequence of actions. When all actions are used, a backup policy is used.

POMDPTools.Policies.PlaybackPolicyType
PlaybackPolicy{A<:AbstractArray, P<:Policy, V<:AbstractArray{<:Real}}

a policy that applies a fixed sequence of actions until they are all used and then falls back onto a backup policy until the end of the episode.

Constructor:

`PlaybackPolicy(actions::AbstractArray, backup_policy::Policy; logpdfs::AbstractArray{Float64, 1} = Float64[])`

Fields

  • actions::Vector{A} a vector of actions to play back
  • backup_policy::Policy the policy to use when all prescribed actions have been taken but the episode continues
  • logpdfs::Vector{Float64} the log probability (density) of actions
  • i::Int64 the current action index
source

Utility Wrapper

A wrapper for policies to collect statistics and handle errors.

POMDPTools.Policies.PolicyWrapperType
PolicyWrapper

Flexible utility wrapper for a policy designed for collecting statistics about planning.

Carries a function, a policy, and optionally a payload (that can be any type).

The function should typically be defined with the do syntax. Each time action is called on the wrapper, this function will be called.

If there is no payload, it will be called with two argments: the policy and the state/belief. If there is a payload, it will be called with three arguments: the policy, the payload, and the current state or belief. The function should return an appropriate action. The idea is that, in this function, action(policy, s) should be called, statistics from the policy/planner should be collected and saved in the payload, exceptions can be handled, and the action should be returned.

Constructor

PolicyWrapper(policy::Policy; payload=nothing)

Example

using POMDPModels
 using POMDPToolbox
 
 mdp = GridWorld()
@@ -32,10 +32,10 @@
     return a
 end
 
-h = simulate(HistoryRecorder(max_steps=100), mdp, errwrapper)

Fields

  • f::F
  • policy::P
  • payload::PL
source

Pretty Printing Policies

POMDPTools.Policies.showpolicyFunction
showpolicy([io], [mime], m::MDP, p::Policy)
+h = simulate(HistoryRecorder(max_steps=100), mdp, errwrapper)

Fields

  • f::F
  • policy::P
  • payload::PL
source

Pretty Printing Policies

POMDPTools.Policies.showpolicyFunction
showpolicy([io], [mime], m::MDP, p::Policy)
 showpolicy([io], [mime], statelist::AbstractVector, p::Policy)
-showpolicy(...; pre=" ")

Print the states in m or statelist and the actions from policy p corresponding to those states.

For the MDP version, if io[:limit] is true, will only print enough states to fill the display.

source

Policy Evaluation

The evaluate function provides a policy evaluation tool for MDPs:

POMDPTools.Policies.evaluateFunction
evaluate(m::MDP, p::Policy)
+showpolicy(...; pre=" ")

Print the states in m or statelist and the actions from policy p corresponding to those states.

For the MDP version, if io[:limit] is true, will only print enough states to fill the display.

source

Policy Evaluation

The evaluate function provides a policy evaluation tool for MDPs:

POMDPTools.Policies.evaluateFunction
evaluate(m::MDP, p::Policy)
 evaluate(m::MDP, p::Policy; rewardfunction=POMDPs.reward)

Calculate the value for a policy on an MDP using the approach in equation 4.2.2 of Kochenderfer, Decision Making Under Uncertainty, 2015.

Returns a DiscreteValueFunction, which maps states to values.

Example

using POMDPTools, POMDPModels
 m = SimpleGridWorld()
 u = evaluate(m, FunctionPolicy(x->:left))
-u([1,1]) # value of always moving left starting at state [1,1]
source
+u([1,1]) # value of always moving left starting at state [1,1]
source
diff --git a/dev/POMDPTools/simulators/index.html b/dev/POMDPTools/simulators/index.html index 13fb9054..24902db1 100644 --- a/dev/POMDPTools/simulators/index.html +++ b/dev/POMDPTools/simulators/index.html @@ -16,7 +16,7 @@ println("in state $s") println("took action $a") println("received observation $o and reward $r") -end

The optional spec argument can be a string, tuple of symbols, or single symbol and follows the same pattern as eachstep called on a SimHistory object.

Under the hood, this function creates a StepSimulator with spec and returns a [PO]MDPSimIterator by calling simulate with all of the arguments except spec. All keyword arguments are passed to the StepSimulator constructor.

source

The StepSimulator contained in this file can provide the same functionality with the following syntax:

sim = StepSimulator("s,a,r,sp")
+end

The optional spec argument can be a string, tuple of symbols, or single symbol and follows the same pattern as eachstep called on a SimHistory object.

Under the hood, this function creates a StepSimulator with spec and returns a [PO]MDPSimIterator by calling simulate with all of the arguments except spec. All keyword arguments are passed to the StepSimulator constructor.

source

The StepSimulator contained in this file can provide the same functionality with the following syntax:

sim = StepSimulator("s,a,r,sp")
 for (s,a,r,sp) in simulate(sim, problem, policy)
     # do something
 end

Rollouts

RolloutSimulator is the simplest MDP or POMDP simulator. When simulate is called, it simply simulates a single trajectory of the process and returns the discounted reward.

rs = RolloutSimulator()
@@ -25,12 +25,12 @@
 
 r = simulate(rs, mdp, policy)
POMDPTools.Simulators.RolloutSimulatorType
RolloutSimulator(rng, max_steps)
 RolloutSimulator(; <keyword arguments>)

A fast simulator that just returns the reward

The simulation will be terminated when either

  1. a terminal state is reached (as determined by isterminal() or
  2. the discount factor is as small as eps or
  3. max_steps have been executed

Keyword arguments:

  • rng::AbstractRNG (default: Random.default_rng()) - A random number generator to use.
  • eps::Float64 (default: 0.0) - A small number; if γᵗ where γ is the discount factor and t is the time step becomes smaller than this, the simulation will be terminated.
  • max_steps::Int (default: typemax(Int)) - The maximum number of steps to simulate.

Usage (optional arguments in brackets):

ro = RolloutSimulator()
-history = simulate(ro, pomdp, policy, [updater [, init_belief [, init_state]]])

See also: HistoryRecorder, run_parallel

source

History Recorder

A HistoryRecorder runs a simulation and records the trajectory. It returns an AbstractVector of NamedTuples - see Histories for more info.

hr = HistoryRecorder(max_steps=100)
+history = simulate(ro, pomdp, policy, [updater [, init_belief [, init_state]]])

See also: HistoryRecorder, run_parallel

source

History Recorder

A HistoryRecorder runs a simulation and records the trajectory. It returns an AbstractVector of NamedTuples - see Histories for more info.

hr = HistoryRecorder(max_steps=100)
 pomdp = TigerPOMDP()
 policy = RandomPolicy(pomdp)
 
 h = simulate(hr, pomdp, policy)
POMDPTools.Simulators.HistoryRecorderType

A simulator that records the history for later examination

The simulation will be terminated when either

  1. a terminal state is reached (as determined by isterminal() or
  2. the discount factor is as small as eps or
  3. max_steps have been executed

Keyword Arguments: - rng: The random number generator for the simulation - capture_exception::Bool: whether to capture an exception and store it in the history, or let it go uncaught, potentially killing the script - show_progress::Bool: show a progress bar for the simulation - eps - max_steps

Usage (optional arguments in brackets):

hr = HistoryRecorder()
-history = simulate(hr, pomdp, policy, [updater [, init_belief [, init_state]]])
source

sim()

The sim function provides a convenient way to interact with a POMDP or MDP environment and return a history. The first argument is a function that is called at every time step and takes a state (in the case of an MDP) or an observation (in the case of a POMDP) as the argument and then returns an action. The second argument is a pomdp or mdp. It is intended to be used with Julia's do syntax as follows:

pomdp = TigerPOMDP()
+history = simulate(hr, pomdp, policy, [updater [, init_belief [, init_state]]])
source

sim()

The sim function provides a convenient way to interact with a POMDP or MDP environment and return a history. The first argument is a function that is called at every time step and takes a state (in the case of an MDP) or an observation (in the case of a POMDP) as the argument and then returns an action. The second argument is a pomdp or mdp. It is intended to be used with Julia's do syntax as follows:

pomdp = TigerPOMDP()
 history = sim(pomdp, max_steps=10) do obs
     println("Observation was $obs.")
     return TIGER_OPEN_LEFT
@@ -49,7 +49,7 @@
     return a
 end

for a POMDP and a belief updater.

Keyword Arguments

All Versions

POMDP version

POMDP and updater version

source

Histories

The results produced by HistoryRecorders and the sim function are contained in SimHistory objects.

POMDPTools.Simulators.SimHistoryType
SimHistory

An (PO)MDP simulation history returned by simulate(::HistoryRecorder, ::Union{MDP,POMDP},...).

This is an AbstractVector of NamedTuples containing the states, actions, etc.

Examples

hist[1][:s] # returns the first state in the history
hist[:a] # returns all of the actions in the history
source

Examples

using POMDPs, POMDPTools, POMDPModels
+end
will limit the simulation to 100 steps.

POMDP version

POMDP and updater version

source

Histories

The results produced by HistoryRecorders and the sim function are contained in SimHistory objects.

POMDPTools.Simulators.SimHistoryType
SimHistory

An (PO)MDP simulation history returned by simulate(::HistoryRecorder, ::Union{MDP,POMDP},...).

This is an AbstractVector of NamedTuples containing the states, actions, etc.

Examples

hist[1][:s] # returns the first state in the history
hist[:a] # returns all of the actions in the history
source

Examples

using POMDPs, POMDPTools, POMDPModels
 hr = HistoryRecorder(max_steps=10)
 hist = simulate(hr, BabyPOMDP(), FunctionPolicy(x->true))
 step = hist[1] # all information available about the first step
@@ -60,12 +60,12 @@
     println("reward $r received when state $sp was reached after action $a was taken in state $s")
 end

returns the start state, action, reward and destination state for each step of the simulation.

Alternatively, instead of expanding the steps implicitly, the elements of the step can be accessed as fields (since each step is a NamedTuple):

for step in eachstep(h, "(s, a, r, sp)")    
     println("reward $(step.r) received when state $(step.sp) was reached after action $(step.a) was taken in state $(step.s)")
-end

The possible valid elements in the iteration specification are

source

Examples:

collect(eachstep(h, "a,o"))

will produce a vector of action-observation named tuples.

collect(norm(sp-s) for (s,sp) in eachstep(h, "s,sp"))

will produce a vector of the distances traveled on each step (assuming the state is a Euclidean vector).

Notes

Other Functions

state_hist(h), action_hist(h), observation_hist(h) belief_hist(h), and reward_hist(h) will return vectors of the states, actions, and rewards, and undiscounted_reward(h) and discounted_reward(h) will return the total rewards collected over the trajectory. n_steps(h) returns the number of steps in the history. exception(h) and backtrace(h) can be used to hold an exception if the simulation failed to finish.

view(h, range) (e.g. view(h, 1:n_steps(h)-4)) can be used to create a view of the history object h that only contains a certain range of steps. The object returned by view is an AbstractSimHistory that can be iterated through and manipulated just like a complete SimHistory.

Parallel

POMDPTools contains a utility for running many Monte Carlo simulations in parallel to evaluate performance. The basic workflow involves the following steps:

  1. Create a vector of Sim objects, each specifying how a single simulation should be run.
  2. Use the run_parallel or run function to run the simulations.
  3. Analyze the results of the simulations contained in the DataFrame returned by run_parallel.

Example

An example can be found in the Parallel Simulations section.

Sim objects

Each simulation should be specified by a Sim object which contains all the information needed to run a simulation, including the Simulator, POMDP or MDP, Policy, Updater, and any other ingredients.

POMDPTools.Simulators.SimType
Sim(m::MDP, p::Policy[, initialstate]; kwargs...)
-Sim(m::POMDP, p::Policy[, updater[, initial_belief[, initialstate]]]; kwargs...)

Create a Sim object that contains everything needed to run and record a single simulation, including model, initial conditions, and metadata.

A vector of Sim objects can be executed with run or run_parallel.

Keyword Arguments

  • rng::AbstractRNG=Random.default_rng()
  • max_steps::Int=typemax(Int)
  • simulator::Simulator=HistoryRecorder(rng=rng, max_steps=max_steps)
  • metadata::NamedTuple a named tuple (or dictionary) of metadata for the sim that will be recorded, e.g.(solver_iterations=500,)`.
source

Running simulations

The simulations are actually carried out by the run and run_parallel functions.

POMDPTools.Simulators.run_parallelFunction
run_parallel(queue::Vector{Sim})
+end

The possible valid elements in the iteration specification are

  • Any node in the (PO)MDP Dynamic Decision network (by default :s, :a, :sp, :o, :r)
  • b - the initial belief in the step (for POMDPs only)
  • bp - the belief after being updated based on o (for POMDPs only)
  • action_info - info from the policy decision (from action_info)
  • update_info - info from the belief update (from update_info)
  • t - the timestep index
source

Examples:

collect(eachstep(h, "a,o"))

will produce a vector of action-observation named tuples.

collect(norm(sp-s) for (s,sp) in eachstep(h, "s,sp"))

will produce a vector of the distances traveled on each step (assuming the state is a Euclidean vector).

Notes

Other Functions

state_hist(h), action_hist(h), observation_hist(h) belief_hist(h), and reward_hist(h) will return vectors of the states, actions, and rewards, and undiscounted_reward(h) and discounted_reward(h) will return the total rewards collected over the trajectory. n_steps(h) returns the number of steps in the history. exception(h) and backtrace(h) can be used to hold an exception if the simulation failed to finish.

view(h, range) (e.g. view(h, 1:n_steps(h)-4)) can be used to create a view of the history object h that only contains a certain range of steps. The object returned by view is an AbstractSimHistory that can be iterated through and manipulated just like a complete SimHistory.

Parallel

POMDPTools contains a utility for running many Monte Carlo simulations in parallel to evaluate performance. The basic workflow involves the following steps:

  1. Create a vector of Sim objects, each specifying how a single simulation should be run.
  2. Use the run_parallel or run function to run the simulations.
  3. Analyze the results of the simulations contained in the DataFrame returned by run_parallel.

Example

An example can be found in the Parallel Simulations section.

Sim objects

Each simulation should be specified by a Sim object which contains all the information needed to run a simulation, including the Simulator, POMDP or MDP, Policy, Updater, and any other ingredients.

POMDPTools.Simulators.SimType
Sim(m::MDP, p::Policy[, initialstate]; kwargs...)
+Sim(m::POMDP, p::Policy[, updater[, initial_belief[, initialstate]]]; kwargs...)

Create a Sim object that contains everything needed to run and record a single simulation, including model, initial conditions, and metadata.

A vector of Sim objects can be executed with run or run_parallel.

Keyword Arguments

  • rng::AbstractRNG=Random.default_rng()
  • max_steps::Int=typemax(Int)
  • simulator::Simulator=HistoryRecorder(rng=rng, max_steps=max_steps)
  • metadata::NamedTuple a named tuple (or dictionary) of metadata for the sim that will be recorded, e.g.(solver_iterations=500,)`.
source

Running simulations

The simulations are actually carried out by the run and run_parallel functions.

POMDPTools.Simulators.run_parallelFunction
run_parallel(queue::Vector{Sim})
 run_parallel(f::Function, queue::Vector{Sim})

Run Sim objects in queue in parallel and return results as a DataFrame.

By default, the DataFrame will contain the reward for each simulation and the metadata provided to the sim.

Arguments

  • queue: List of Sim objects to be executed
  • f: Function to process the results of each simulation

This function should take two arguments, (1) the Sim that was executed and (2) the result of the simulation, by default a SimHistory. It should return a named tuple that will appear in the dataframe. See Examples below.

Keyword Arguments

  • show_progress::Bool: whether or not to show a progress meter
  • progress::ProgressMeter.Progress: determines how the progress meter is displayed

Examples

run_parallel(queue) do sim, hist
     return (n_steps=n_steps(hist), reward=discounted_reward(hist))
-end

will return a dataframe with with the number of steps and the reward in it.

source

The run function is also provided to run simulations in serial (this is often useful for debugging). Note that the documentation below also contains a section for the builtin julia run function, even though it is not relevant here.

Base.runFunction
run(queue::Vector{Sim})
-run(f::Function, queue::Vector{Sim})

Run the Sim objects in queue on a single process and return the results as a dataframe.

See run_parallel for more information.

source

Specifying information to be recorded

By default, only the discounted rewards from each simulation are recorded, but arbitrary information can be recorded.

The run_parallel and run functions accept a function (normally specified via the do syntax) that takes the Sim object and history of the simulation and extracts relevant statistics as a named tuple. For example, if the desired characteristics are the number of steps in the simulation and the reward, run_parallel would be invoked as follows:

df = run_parallel(queue) do sim::Sim, hist::SimHistory
+end

will return a dataframe with with the number of steps and the reward in it.

source

The run function is also provided to run simulations in serial (this is often useful for debugging). Note that the documentation below also contains a section for the builtin julia run function, even though it is not relevant here.

Base.runFunction
run(queue::Vector{Sim})
+run(f::Function, queue::Vector{Sim})

Run the Sim objects in queue on a single process and return the results as a dataframe.

See run_parallel for more information.

source

Specifying information to be recorded

By default, only the discounted rewards from each simulation are recorded, but arbitrary information can be recorded.

The run_parallel and run functions accept a function (normally specified via the do syntax) that takes the Sim object and history of the simulation and extracts relevant statistics as a named tuple. For example, if the desired characteristics are the number of steps in the simulation and the reward, run_parallel would be invoked as follows:

df = run_parallel(queue) do sim::Sim, hist::SimHistory
     return (n_steps=n_steps(hist), reward=discounted_reward(hist))
 end

These statistics are combined into a DataFrame, with each line representing a single simulation, allowing for statistical analysis. For example,

mean(df[:reward]./df[:n_steps])

would compute the average reward per step with each simulation weighted equally regardless of length.

Display

DisplaySimulator

The DisplaySimulator displays each step of a simulation in real time through a multimedia display such as a Jupyter notebook or ElectronDisplay. Specifically it uses POMDPTools.render and the built-in Julia display function to visualize each step.

Example:

using POMDPs
 using POMDPModels
@@ -77,4 +77,4 @@
 m = SimpleGridWorld()
 simulate(ds, m, RandomPolicy(m))
POMDPTools.Simulators.DisplaySimulatorType
DisplaySimulator(;kwargs...)

Create a simulator that displays each step of a simulation.

Given a POMDP or MDP model m, this simulator roughly works like

for step in stepthrough(m, ...)
     display(render(m, step))
-end

Keyword Arguments

  • display::AbstractDisplay: the display to use for the first argument to the display function. If this is nothing, display(...) will be called without an AbstractDisplay argument.
  • render_kwargs::NamedTuple: keyword arguments for POMDPTools.render(...)
  • max_fps::Number=10: maximum number of frames to be displayed per second - sleep will be used to skip extra time, so this is not designed for high precision
  • predisplay::Function: function to call before every call to display(...). The only argument to this function will be the display (if it is specified) or nothing
  • extra_initial::Bool=false: if true, display an extra step at the beginning with only elements t, sp, and bp for POMDPs (this can be useful to see the initial state if render displays only sp and not s).
  • extra_final::Bool=true: iftrue, display an extra step at the end with only elementst,done,s, andbfor POMDPs (this can be useful to see the final state ifrenderdisplays onlysand notsp`).
  • max_steps::Integer: maximum number of steps to run for
  • spec::NTuple{Symbol}: specification of what step elements to display (see eachstep)
  • rng::AbstractRNG: random number generator

See the POMDPSimulators documentation for more tips about using specific displays.

source

Display-specific tips

The following tips may be helpful when using particular displays.

Jupyter notebooks

By default, in a Jupyter notebook, the visualizations of all steps are displayed in the output box one after another. To make the output animated instead, where the image is overwritten at each step, one may use

DisplaySimulator(predisplay=(d)->IJulia.clear_output(true))

ElectronDisplay

By default, ElectronDisplay will open a new window for each new step. To prevent this, use

ElectronDisplay.CONFIG.single_window = true
+end

Keyword Arguments

See the POMDPSimulators documentation for more tips about using specific displays.

source

Display-specific tips

The following tips may be helpful when using particular displays.

Jupyter notebooks

By default, in a Jupyter notebook, the visualizations of all steps are displayed in the output box one after another. To make the output animated instead, where the image is overwritten at each step, one may use

DisplaySimulator(predisplay=(d)->IJulia.clear_output(true))

ElectronDisplay

By default, ElectronDisplay will open a new window for each new step. To prevent this, use

ElectronDisplay.CONFIG.single_window = true
diff --git a/dev/POMDPTools/testing/index.html b/dev/POMDPTools/testing/index.html index fd25d638..581fc45b 100644 --- a/dev/POMDPTools/testing/index.html +++ b/dev/POMDPTools/testing/index.html @@ -1,8 +1,8 @@ Testing · POMDPs.jl

Testing

POMDPTools contains basic utilities for testing models and solvers.

Testing (PO)MDP Models

POMDPTools.Testing.has_consistent_distributionsFunction
has_consistent_distributions(m::MDP; atol=0)
-has_consistent_distributions(m::POMDP; atol=0)

Return true if no problems are found in the distributions for a discrete problem. Print information and return false if problems are found.

Tests whether

  • All probabilities are positive
  • Probabilities for all distributions sum to 1
  • All items with positive probability are in the support

Keyword Arguments

  • atol: absolute tolerance passed to approx for all probability checks
source
POMDPTools.Testing.has_consistent_initial_distributionFunction
has_consistent_initial_distribution(m; atol=0)

Return true if no problems are found with the initial state distribution for a discrete problem. Print information and return false if problems are found.

See has_consistent_distributions for information on what checks are performed.

source
POMDPTools.Testing.has_consistent_transition_distributionsFunction
has_consistent_transition_distributions(m; atol=0)

Return true if no problems are found in the transition distributions for a discrete problem. Print information and return false if problems are found.

See has_consistent_distributions for information on what checks are performed.

source
POMDPTools.Testing.has_consistent_observation_distributionsFunction
has_consistent_observation_distributions(m; atol=0)

Return true if no problems are found in the observation distributions for a discrete POMDP. Print information and return false if problems are found.

See has_consistent_distributions for information on what checks are performed.

source

Testing Solvers

POMDPTools.Testing.test_solverFunction
test_solver(solver::Solver, problem::POMDP)
+has_consistent_distributions(m::POMDP; atol=0)

Return true if no problems are found in the distributions for a discrete problem. Print information and return false if problems are found.

Tests whether

  • All probabilities are positive
  • Probabilities for all distributions sum to 1
  • All items with positive probability are in the support

Keyword Arguments

  • atol: absolute tolerance passed to approx for all probability checks
source
POMDPTools.Testing.has_consistent_initial_distributionFunction
has_consistent_initial_distribution(m; atol=0)

Return true if no problems are found with the initial state distribution for a discrete problem. Print information and return false if problems are found.

See has_consistent_distributions for information on what checks are performed.

source
POMDPTools.Testing.has_consistent_transition_distributionsFunction
has_consistent_transition_distributions(m; atol=0)

Return true if no problems are found in the transition distributions for a discrete problem. Print information and return false if problems are found.

See has_consistent_distributions for information on what checks are performed.

source
POMDPTools.Testing.has_consistent_observation_distributionsFunction
has_consistent_observation_distributions(m; atol=0)

Return true if no problems are found in the observation distributions for a discrete POMDP. Print information and return false if problems are found.

See has_consistent_distributions for information on what checks are performed.

source

Testing Solvers

POMDPTools.Testing.test_solverFunction
test_solver(solver::Solver, problem::POMDP)
 test_solver(solver::Solver, problem::MDP)

Use the solver to solve the specified problem, then run a simulation.

This is designed to illustrate how solvers are expected to function. All solvers should be able to complete this standard test with the simple models in the POMDPModels package.

Note that this does NOT test the optimality of the solution, but is only a smoke test to see if the solver interacts with POMDP models as expected.

To run this with a solver called YourSolver, run

using POMDPToolbox
 using POMDPModels
 
 solver = YourSolver(# initialize with parameters #)
-test_solver(solver, BabyPOMDP())
source
+test_solver(solver, BabyPOMDP())source diff --git a/dev/POMDPTools/visualization/index.html b/dev/POMDPTools/visualization/index.html index 2f63d407..e483af07 100644 --- a/dev/POMDPTools/visualization/index.html +++ b/dev/POMDPTools/visualization/index.html @@ -1,7 +1,7 @@ -Visualization · POMDPs.jl

Visualization

POMDPTools contains a basic visualization interface consisting of the render function.

Problem writers should implement a method of this function so that their problem can be visualized in a variety of contexts including jupyter notebooks, web browsers, or saved as images or animations.

POMDPTools.ModelTools.renderFunction
render(m::Union{MDP,POMDP}, step::NamedTuple)

Return a renderable representation of the step in problem m.

The renderable representation may be anything that has show(io, mime, x) methods. It could be a plot, svg, Compose.jl context, Cairo context, or image.

Arguments

step is a NamedTuple that contains the states, action, etc. corresponding to one transition in a simulation. It may have the following fields:

  • t: the time step index
  • s: the state at the beginning of the step
  • a: the action
  • sp: the state at the end of the step (s')
  • r: the reward for the step
  • o: the observation
  • b: the belief at the
  • bp: the belief at the end of the step
  • i: info from the model when the state transition was calculated
  • ai: info from the policy decision
  • ui: info from the belief update

Keyword arguments are reserved for the problem implementer and can be used to control appearance, etc.

Important Notes

  • step may not contain all of the elements listed above, so render should check for them and render only what is available
  • o typically corresponds to sp, so it is often clearer for POMDPs to render sp rather than s.
source

Sometimes it is important to have control over how the problem is rendered with different mimetypes. One way to handle this is to have render return a custom type, e.g.

struct MyProblemVisualization
+Visualization · POMDPs.jl

Visualization

POMDPTools contains a basic visualization interface consisting of the render function.

Problem writers should implement a method of this function so that their problem can be visualized in a variety of contexts including jupyter notebooks, web browsers, or saved as images or animations.

POMDPTools.ModelTools.renderFunction
render(m::Union{MDP,POMDP}, step::NamedTuple)

Return a renderable representation of the step in problem m.

The renderable representation may be anything that has show(io, mime, x) methods. It could be a plot, svg, Compose.jl context, Cairo context, or image.

Arguments

step is a NamedTuple that contains the states, action, etc. corresponding to one transition in a simulation. It may have the following fields:

  • t: the time step index
  • s: the state at the beginning of the step
  • a: the action
  • sp: the state at the end of the step (s')
  • r: the reward for the step
  • o: the observation
  • b: the belief at the
  • bp: the belief at the end of the step
  • i: info from the model when the state transition was calculated
  • ai: info from the policy decision
  • ui: info from the belief update

Keyword arguments are reserved for the problem implementer and can be used to control appearance, etc.

Important Notes

  • step may not contain all of the elements listed above, so render should check for them and render only what is available
  • o typically corresponds to sp, so it is often clearer for POMDPs to render sp rather than s.
source

Sometimes it is important to have control over how the problem is rendered with different mimetypes. One way to handle this is to have render return a custom type, e.g.

struct MyProblemVisualization
     mdp::MyProblem
     step::NamedTuple
 end
 
-POMDPTools.render(mdp, step) = MyProblemVisualization(mdp, step)

and then implement custom show methods, e.g.

show(io::IO, mime::MIME"text/html", v::MyProblemVisualization)
+POMDPTools.render(mdp, step) = MyProblemVisualization(mdp, step)

and then implement custom show methods, e.g.

show(io::IO, mime::MIME"text/html", v::MyProblemVisualization)
diff --git a/dev/api/index.html b/dev/api/index.html index 7f31c667..388f0410 100644 --- a/dev/api/index.html +++ b/dev/api/index.html @@ -1,36 +1,36 @@ API Documentation · POMDPs.jl

API Documentation

Docstrings for POMDPs.jl interface members can be accessed through Julia's built-in documentation system or in the list below.

Contents

Index

Types

POMDPs.POMDPType
POMDP{S,A,O}

Abstract base type for a partially observable Markov decision process.

S: state type
 A: action type
-O: observation type
source
POMDPs.MDPType
MDP{S,A}

Abstract base type for a fully observable Markov decision process.

S: state type
-A: action type
source
POMDPs.PolicyType

Base type for a policy (a map from every possible belief, or more abstract policy state, to an optimal or suboptimal action)

source
POMDPs.UpdaterType

Abstract type for an object that defines how the belief should be updated

A belief is a general construct that represents the knowledge an agent has about the state of the system. This can be a probability distribution, an action observation history or a more general representation.

source

Model Functions

Dynamics

POMDPs.transitionFunction
transition(m::POMDP, state, action)
-transition(m::MDP, state, action)

Return the transition distribution from the current state-action pair.

If it is difficult to define the probability density or mass function explicitly, consider using POMDPModelTools.ImplicitDistribution to define a generative model.

source
POMDPs.MDPType
MDP{S,A}

Abstract base type for a fully observable Markov decision process.

S: state type
+A: action type
source
POMDPs.PolicyType

Base type for a policy (a map from every possible belief, or more abstract policy state, to an optimal or suboptimal action)

source
POMDPs.UpdaterType

Abstract type for an object that defines how the belief should be updated

A belief is a general construct that represents the knowledge an agent has about the state of the system. This can be a probability distribution, an action observation history or a more general representation.

source

Model Functions

Dynamics

POMDPs.transitionFunction
transition(m::POMDP, state, action)
+transition(m::MDP, state, action)

Return the transition distribution from the current state-action pair.

If it is difficult to define the probability density or mass function explicitly, consider using POMDPModelTools.ImplicitDistribution to define a generative model.

source
POMDPs.observationFunction
observation(m::POMDP, statep)
 observation(m::POMDP, action, statep)
 observation(m::POMDP, state, action, statep)

Return the observation distribution. You need only define the method with the fewest arguments needed to determine the observation distribution.

If it is difficult to define the probability density or mass function explicitly, consider using POMDPModelTools.ImplicitDistribution to define a generative model.

Example

using POMDPModelTools # for SparseCat
 
 struct MyPOMDP <: POMDP{Int, Int, Int} end
 
-observation(p::MyPOMDP, sp::Int) = SparseCat([sp-1, sp, sp+1], [0.1, 0.8, 0.1])
source
POMDPs.rewardFunction
reward(m::POMDP, s, a)
+observation(p::MyPOMDP, sp::Int) = SparseCat([sp-1, sp, sp+1], [0.1, 0.8, 0.1])
source
POMDPs.rewardFunction
reward(m::POMDP, s, a)
 reward(m::MDP, s, a)

Return the immediate reward for the s-a pair.

reward(m::POMDP, s, a, sp)
-reward(m::MDP, s, a, sp)

Return the immediate reward for the s-a-s' triple

reward(m::POMDP, s, a, sp, o)

Return the immediate reward for the s-a-s'-o quad

For some problems, it is easier to express reward(m, s, a, sp) or reward(m, s, a, sp, o), than reward(m, s, a), but some solvers, e.g. SARSOP, can only use reward(m, s, a). Both can be implemented for a problem, but when reward(m, s, a) is implemented, it should be consistent with reward(m, s, a, sp[, o]), that is, it should be the expected value over all destination states and observations.

source
POMDPs.genFunction
gen(m::Union{MDP,POMDP}, s, a, rng::AbstractRNG)

Function for implementing the entire MDP/POMDP generative model by returning a NamedTuple.

gen should only be implemented in the case where two or more of the next state, observation, and reward need to be generated at the same time. If the state transition model can be separated from the reward and observation models, you should implement transition with an ImplicitDistribution instead of gen.

Solver and simulator writers should use the @gen macro to call a generative model.

Arguments

  • m: an MDP or POMDP model
  • s: the current state
  • a: the action
  • rng: a random number generator (Typically a MersenneTwister)

Return

The function should return a NamedTuple. With a subset of following entries:

MDP

  • sp: the next state
  • r: the reward for the step
  • info: extra debugging information, typically in an associative container like a NamedTuple

POMDP

  • sp: the next state
  • o: the observation
  • r: the reward for the step
  • info: extra debugging information, typically in an associative container like a NamedTuple

Some elements can be left out. For instance if o is left out of the return, the problem-writer can also implement observation and POMDPs.jl will automatically use it when needed.

Example

struct LQRMDP <: MDP{Float64, Float64} end
+reward(m::MDP, s, a, sp)

Return the immediate reward for the s-a-s' triple

reward(m::POMDP, s, a, sp, o)

Return the immediate reward for the s-a-s'-o quad

For some problems, it is easier to express reward(m, s, a, sp) or reward(m, s, a, sp, o), than reward(m, s, a), but some solvers, e.g. SARSOP, can only use reward(m, s, a). Both can be implemented for a problem, but when reward(m, s, a) is implemented, it should be consistent with reward(m, s, a, sp[, o]), that is, it should be the expected value over all destination states and observations.

source
POMDPs.genFunction
gen(m::Union{MDP,POMDP}, s, a, rng::AbstractRNG)

Function for implementing the entire MDP/POMDP generative model by returning a NamedTuple.

gen should only be implemented in the case where two or more of the next state, observation, and reward need to be generated at the same time. If the state transition model can be separated from the reward and observation models, you should implement transition with an ImplicitDistribution instead of gen.

Solver and simulator writers should use the @gen macro to call a generative model.

Arguments

  • m: an MDP or POMDP model
  • s: the current state
  • a: the action
  • rng: a random number generator (Typically a MersenneTwister)

Return

The function should return a NamedTuple. With a subset of following entries:

MDP

  • sp: the next state
  • r: the reward for the step
  • info: extra debugging information, typically in an associative container like a NamedTuple

POMDP

  • sp: the next state
  • o: the observation
  • r: the reward for the step
  • info: extra debugging information, typically in an associative container like a NamedTuple

Some elements can be left out. For instance if o is left out of the return, the problem-writer can also implement observation and POMDPs.jl will automatically use it when needed.

Example

struct LQRMDP <: MDP{Float64, Float64} end
 
-POMDPs.gen(m::LQRMDP, s, a, rng) = (sp = s + a + randn(rng), r = -s^2 - a^2)
source
POMDPs.@genMacro
@gen(X)(m, s, a)
-@gen(X)(m, s, a, rng::AbstractRNG)

Call the generative model for a (PO)MDP m; Sample values from several nodes in the dynamic decision network. X is one or more symbols indicating which nodes to output.

Solvers and simulators should call this rather than the gen function. Problem writers should implement a method of the transition or gen function instead of altering @gen.

Arguments

  • m: an MDP or POMDP model
  • s: the current state
  • a: the action
  • rng (optional): a random number generator (Typically a MersenneTwister)

Return

If X, is a symbol, return a value sample from the corresponding node. If X is several symbols, return a Tuple of values sampled from the specified nodes.

Examples

Let m be an MDP or POMDP, s be a state of m, a be an action of m, and rng be an AbstractRNG.

  • @gen(:sp, :r)(m, s, a) returns a Tuple containing the next state and reward.
  • @gen(:sp, :o, :r)(m, s, a, rng) returns a Tuple containing the next state, observation, and reward.
  • @gen(:sp)(m, s, a, rng) returns the next state.
source

Static Properties

POMDPs.statesFunction
states(problem::POMDP)
-states(problem::MDP)

Returns the complete state space of a POMDP.

source
POMDPs.actionsFunction
actions(m::Union{MDP,POMDP})

Returns the entire action space of a (PO)MDP.


actions(m::Union{MDP,POMDP}, s)

Return the actions that can be taken from state s.


actions(m::POMDP, b)

Return the actions that can be taken from belief b.

To implement an observation-dependent action space, use currentobs(b) to get the observation associated with belief b within the implementation of actions(m, b).

source
POMDPs.isterminalFunction
isterminal(m::Union{MDP,POMDP}, s)

Check if state s is terminal.

If a state is terminal, no actions will be taken in it and no additional rewards will be accumulated. Thus, the value function at such a state is, by definition, zero.

source
POMDPs.discountFunction
discount(m::POMDP)
-discount(m::MDP)

Return the discount factor for the problem.

source
POMDPs.initialstateFunction
initialstate(m::Union{POMDP,MDP})

Return a distribution of initial states for (PO)MDP m.

If it is difficult to define the probability density or mass function explicitly, consider using POMDPModelTools.ImplicitDistribution to define a model for sampling.

source
POMDPs.initialobsFunction
initialobs(m::POMDP, s)

Return a distribution of initial observations for POMDP m and state s.

If it is difficult to define the probability density or mass function explicitly, consider using POMDPModelTools.ImplicitDistribution to define a model for sampling.

This function is only used in cases where the policy expects an initial observation rather than an initial belief, e.g. in a reinforcement learning setting. It is not used in a standard POMDP simulation.

source
POMDPs.stateindexFunction
stateindex(problem::POMDP, s)
-stateindex(problem::MDP, s)

Return the integer index of state s. Used for discrete models only.

source
POMDPs.actionindexFunction
actionindex(problem::POMDP, a)
-actionindex(problem::MDP, a)

Return the integer index of action a. Used for discrete models only.

source
POMDPs.obsindexFunction
obsindex(problem::POMDP, o)

Return the integer index of observation o. Used for discrete models only.

source
POMDPs.convert_sFunction
convert_s(::Type{V}, s, problem::Union{MDP,POMDP}) where V<:AbstractArray
-convert_s(::Type{S}, vec::V, problem::Union{MDP,POMDP}) where {S,V<:AbstractArray}

Convert a state to vectorized form or vice versa.

source
POMDPs.convert_aFunction
convert_a(::Type{V}, a, problem::Union{MDP,POMDP}) where V<:AbstractArray
-convert_a(::Type{A}, vec::V, problem::Union{MDP,POMDP}) where {A,V<:AbstractArray}

Convert an action to vectorized form or vice versa.

source
POMDPs.convert_oFunction
convert_o(::Type{V}, o, problem::Union{MDP,POMDP}) where V<:AbstractArray
-convert_o(::Type{O}, vec::V, problem::Union{MDP,POMDP}) where {O,V<:AbstractArray}

Convert an observation to vectorized form or vice versa.

source

Type Inference

POMDPs.statetypeFunction
statetype(t::Type)
+POMDPs.gen(m::LQRMDP, s, a, rng) = (sp = s + a + randn(rng), r = -s^2 - a^2)
source
POMDPs.@genMacro
@gen(X)(m, s, a)
+@gen(X)(m, s, a, rng::AbstractRNG)

Call the generative model for a (PO)MDP m; Sample values from several nodes in the dynamic decision network. X is one or more symbols indicating which nodes to output.

Solvers and simulators should call this rather than the gen function. Problem writers should implement a method of the transition or gen function instead of altering @gen.

Arguments

  • m: an MDP or POMDP model
  • s: the current state
  • a: the action
  • rng (optional): a random number generator (Typically a MersenneTwister)

Return

If X, is a symbol, return a value sample from the corresponding node. If X is several symbols, return a Tuple of values sampled from the specified nodes.

Examples

Let m be an MDP or POMDP, s be a state of m, a be an action of m, and rng be an AbstractRNG.

  • @gen(:sp, :r)(m, s, a) returns a Tuple containing the next state and reward.
  • @gen(:sp, :o, :r)(m, s, a, rng) returns a Tuple containing the next state, observation, and reward.
  • @gen(:sp)(m, s, a, rng) returns the next state.
source

Static Properties

POMDPs.statesFunction
states(problem::POMDP)
+states(problem::MDP)

Returns the complete state space of a POMDP.

source
POMDPs.actionsFunction
actions(m::Union{MDP,POMDP})

Returns the entire action space of a (PO)MDP.


actions(m::Union{MDP,POMDP}, s)

Return the actions that can be taken from state s.


actions(m::POMDP, b)

Return the actions that can be taken from belief b.

To implement an observation-dependent action space, use currentobs(b) to get the observation associated with belief b within the implementation of actions(m, b).

source
POMDPs.isterminalFunction
isterminal(m::Union{MDP,POMDP}, s)

Check if state s is terminal.

If a state is terminal, no actions will be taken in it and no additional rewards will be accumulated. Thus, the value function at such a state is, by definition, zero.

source
POMDPs.discountFunction
discount(m::POMDP)
+discount(m::MDP)

Return the discount factor for the problem.

source
POMDPs.initialstateFunction
initialstate(m::Union{POMDP,MDP})

Return a distribution of initial states for (PO)MDP m.

If it is difficult to define the probability density or mass function explicitly, consider using POMDPModelTools.ImplicitDistribution to define a model for sampling.

source
POMDPs.initialobsFunction
initialobs(m::POMDP, s)

Return a distribution of initial observations for POMDP m and state s.

If it is difficult to define the probability density or mass function explicitly, consider using POMDPModelTools.ImplicitDistribution to define a model for sampling.

This function is only used in cases where the policy expects an initial observation rather than an initial belief, e.g. in a reinforcement learning setting. It is not used in a standard POMDP simulation.

source
POMDPs.stateindexFunction
stateindex(problem::POMDP, s)
+stateindex(problem::MDP, s)

Return the integer index of state s. Used for discrete models only.

source
POMDPs.actionindexFunction
actionindex(problem::POMDP, a)
+actionindex(problem::MDP, a)

Return the integer index of action a. Used for discrete models only.

source
POMDPs.obsindexFunction
obsindex(problem::POMDP, o)

Return the integer index of observation o. Used for discrete models only.

source
POMDPs.convert_sFunction
convert_s(::Type{V}, s, problem::Union{MDP,POMDP}) where V<:AbstractArray
+convert_s(::Type{S}, vec::V, problem::Union{MDP,POMDP}) where {S,V<:AbstractArray}

Convert a state to vectorized form or vice versa.

source
POMDPs.convert_aFunction
convert_a(::Type{V}, a, problem::Union{MDP,POMDP}) where V<:AbstractArray
+convert_a(::Type{A}, vec::V, problem::Union{MDP,POMDP}) where {A,V<:AbstractArray}

Convert an action to vectorized form or vice versa.

source
POMDPs.convert_oFunction
convert_o(::Type{V}, o, problem::Union{MDP,POMDP}) where V<:AbstractArray
+convert_o(::Type{O}, vec::V, problem::Union{MDP,POMDP}) where {O,V<:AbstractArray}

Convert an observation to vectorized form or vice versa.

source

Type Inference

POMDPs.statetypeFunction
statetype(t::Type)
 statetype(p::Union{POMDP,MDP})

Return the state type for a problem type (the S in POMDP{S,A,O}).

type A <: POMDP{Int, Bool, Bool} end
 
-statetype(A) # returns Int
source
POMDPs.actiontypeFunction
actiontype(t::Type)
 actiontype(p::Union{POMDP,MDP})

Return the state type for a problem type (the S in POMDP{S,A,O}).

type A <: POMDP{Bool, Int, Bool} end
 
-actiontype(A) # returns Int
source
POMDPs.obstypeFunction
obstype(t::Type)

Return the state type for a problem type (the S in POMDP{S,A,O}).

type A <: POMDP{Bool, Bool, Int} end
+actiontype(A) # returns Int
source
POMDPs.obstypeFunction
obstype(t::Type)

Return the state type for a problem type (the S in POMDP{S,A,O}).

type A <: POMDP{Bool, Bool, Int} end
 
-obstype(A) # returns Int
source

Distributions and Spaces

Base.randFunction
rand(rng::AbstractRNG, d::Any)

Return a random element from distribution or space d.

If d is a state or transition distribution, the sample will be a state; if d is an action distribution, the sample will be an action or if d is an observation distribution, the sample will be an observation.

source
Distributions.pdfFunction
pdf(d::Any, x::Any)

Evaluate the probability density of distribution d at sample x.

source
Distributions.supportFunction
support(d::Any)

Return an iterable object containing the possible values that can be sampled from distribution d. Values with zero probability may be skipped.

source

Belief Functions

POMDPs.updateFunction
update(updater::Updater, belief_old, action, observation)

Return a new instance of an updated belief given belief_old and the latest action and observation.

source

Distributions and Spaces

Base.randFunction
rand(rng::AbstractRNG, d::Any)

Return a random element from distribution or space d.

If d is a state or transition distribution, the sample will be a state; if d is an action distribution, the sample will be an action or if d is an observation distribution, the sample will be an observation.

source
Distributions.pdfFunction
pdf(d::Any, x::Any)

Evaluate the probability density of distribution d at sample x.

source
Distributions.supportFunction
support(d::Any)

Return an iterable object containing the possible values that can be sampled from distribution d. Values with zero probability may be skipped.

source

Belief Functions

POMDPs.updateFunction
update(updater::Updater, belief_old, action, observation)

Return a new instance of an updated belief given belief_old and the latest action and observation.

source
POMDPs.initialize_beliefFunction
initialize_belief(updater::Updater,
                      state_distribution::Any)
-initialize_belief(updater::Updater, belief::Any)

Returns a belief that can be updated using updater that has similar distribution to state_distribution or belief.

The conversion may be lossy. This function is also idempotent, i.e. there is a default implementation that passes the belief through when it is already the correct type: initialize_belief(updater::Updater, belief) = belief

source
POMDPs.historyFunction
history(b)

Return the action-observation history associated with belief b.

The history should be an AbstractVector, Tuple, (or similar object that supports indexing with end) full of NamedTuples with keys :a and :o, i.e. history(b)[end][:a] should be the last action taken leading up to b, and history(b)[end][:o] should be the last observation received.

It is acceptable to return only part of the history if that is all that is available, but it should always end with the current observation. For example, it would be acceptable to return a structure containing only the last three observations in a length 3 Vector{NamedTuple{(:o,),Tuple{O}}.

source
POMDPs.currentobsFunction
currentobs(b)

Return the latest observation associated with belief b.

If a solver or updater implements history(b) for a belief type, currentobs has a default implementation.

source

Policy and Solver Functions

POMDPs.solveFunction
solve(solver::Solver, problem::POMDP)

Solves the POMDP using method associated with solver, and returns a policy.

source
POMDPs.updaterFunction
updater(policy::Policy)

Returns a default Updater appropriate for a belief type that policy p can use

source
POMDPs.actionFunction
action(policy::Policy, x)

Returns the action that the policy deems best for the current state or belief, x.

x is a generalized information state - can be a state in an MDP, a distribution in POMDP, or another specialized policy-dependent representation of the information needed to choose an action.

source
POMDPs.valueFunction
value(p::Policy, s)
-value(p::Policy, s, a)

Returns the utility value from policy p given the state (or belief), or state-action (or belief-action) pair.

The state-action version is commonly referred to as the Q-value.

source

Simulator

POMDPs.simulateFunction
simulate(sim::Simulator, m::POMDP, p::Policy, u::Updater=updater(p), b0=initialstate(m), s0=rand(b0))
-simulate(sim::Simulator, m::MDP, p::Policy, s0=rand(initialstate(m)))

Run a simulation using the specified policy.

The return type is flexible and depends on the simulator. Simulations should adhere to the Simulation Standard.

source
+initialize_belief(updater::Updater, belief::Any)

Returns a belief that can be updated using updater that has similar distribution to state_distribution or belief.

The conversion may be lossy. This function is also idempotent, i.e. there is a default implementation that passes the belief through when it is already the correct type: initialize_belief(updater::Updater, belief) = belief

source
POMDPs.historyFunction
history(b)

Return the action-observation history associated with belief b.

The history should be an AbstractVector, Tuple, (or similar object that supports indexing with end) full of NamedTuples with keys :a and :o, i.e. history(b)[end][:a] should be the last action taken leading up to b, and history(b)[end][:o] should be the last observation received.

It is acceptable to return only part of the history if that is all that is available, but it should always end with the current observation. For example, it would be acceptable to return a structure containing only the last three observations in a length 3 Vector{NamedTuple{(:o,),Tuple{O}}.

source
POMDPs.currentobsFunction
currentobs(b)

Return the latest observation associated with belief b.

If a solver or updater implements history(b) for a belief type, currentobs has a default implementation.

source

Policy and Solver Functions

POMDPs.solveFunction
solve(solver::Solver, problem::POMDP)

Solves the POMDP using method associated with solver, and returns a policy.

source
POMDPs.updaterFunction
updater(policy::Policy)

Returns a default Updater appropriate for a belief type that policy p can use

source
POMDPs.actionFunction
action(policy::Policy, x)

Returns the action that the policy deems best for the current state or belief, x.

x is a generalized information state - can be a state in an MDP, a distribution in POMDP, or another specialized policy-dependent representation of the information needed to choose an action.

source
POMDPs.valueFunction
value(p::Policy, s)
+value(p::Policy, s, a)

Returns the utility value from policy p given the state (or belief), or state-action (or belief-action) pair.

The state-action version is commonly referred to as the Q-value.

source

Simulator

POMDPs.SimulatorType

Base type for an object defining how simulations should be carried out.

source
POMDPs.simulateFunction
simulate(sim::Simulator, m::POMDP, p::Policy, u::Updater=updater(p), b0=initialstate(m), s0=rand(b0))
+simulate(sim::Simulator, m::MDP, p::Policy, s0=rand(initialstate(m)))

Run a simulation using the specified policy.

The return type is flexible and depends on the simulator. Simulations should adhere to the Simulation Standard.

source
diff --git a/dev/concepts/index.html b/dev/concepts/index.html index 4282e097..81a06874 100644 --- a/dev/concepts/index.html +++ b/dev/concepts/index.html @@ -1,2 +1,2 @@ -Concepts and Architecture · POMDPs.jl

Concepts and Architecture

POMDPs.jl aims to coordinate the development of three software components: 1) a problem, 2) a solver, 3) an experiment. Each of these components has a set of abstract types associated with it and a set of functions that allow a user to define each component's behavior in a standardized way. An outline of the architecture is shown below.

concepts

The MDP and POMDP types are associated with the problem definition. The Solver and Policy types are associated with the solver or decision-making agent. Typically, the Updater type is also associated with the solver, but a solver may sometimes be used with an updater that was implemented separately. The Simulator type is associated with the experiment.

The code components of the POMDPs.jl ecosystem relevant to problems and solvers are shown below. The arrows represent the flow of information from the problems to the solvers. The figure shows the two interfaces that form POMDPs.jl - Explicit and Generative. Details about these interfaces can be found in the section on Defining POMDPs.

interface_relationships

POMDPs and MDPs

An MDP is a mathematical framework for sequential decision making under uncertainty, and where all of the uncertainty arises from outcomes that are partially random and partially under the control of a decision maker. Mathematically, an MDP is a tuple $(S,A,T,R,\gamma)$, where $S$ is the state space, $A$ is the action space, $T$ is a transition function defining the probability of transitioning to each state given the state and action at the previous time, and $R$ is a reward function mapping every possible transition $(s,a,s')$ to a real reward value. Finally, $\gamma$ is a discount factor that defines the relative weighting of current and future rewards. For more information see a textbook such as [1]. In POMDPs.jl an MDP is represented by a concrete subtype of the MDP abstract type and a set of methods that define each of its components as described in the problem definition section.

A POMDP is a more general sequential decision making problem in which the agent is not sure what state they are in. The state is only partially observable by the decision making agent. Mathematically, a POMDP is a tuple $(S,A,T,R,O,Z,\gamma)$ where $S$, $A$, $T$, $R$, and $\gamma$ have the same meaning as in an MDP, $O$ is the agent's observation space, and $Z$ defines the probability of receiving each observation at a transition. In POMDPs.jl, a POMDP is represented by a concrete subtype of the POMDP abstract type, and the methods described in the problem definition section.

POMDPs.jl contains additional functions for defining optional problem behavior such as an initial state distribution or terminal states. More information can be found in the Defining POMDPs section.

Beliefs and Updaters

In a POMDP domain, the decision-making agent does not have complete information about the state of the problem, so the agent can only make choices based on its "belief" about the state. In the POMDP literature, the term "belief" is typically defined to mean a probability distribution over all possible states of the system. However, in practice, the agent often makes decisions based on an incomplete or lossy record of past observations that has a structure much different from a probability distribution. For example, if the agent is represented by a finite-state controller, as is the case for Monte-Carlo Value Iteration [2], the belief is the controller state, which is a node in a graph. Another example is an agent represented by a recurrent neural network. In this case, the agent's belief is the state of the network. In order to accommodate a wide variety of decision-making approaches in POMDPs.jl, we use the term "belief" to denote the set of information that the agent makes a decision on, which could be an exact state distribution, an action-observation history, a set of weighted particles, or the examples mentioned before. In code, the belief can be represented by any built-in or user-defined type.

When an action is taken and a new observation is received, the belief is updated by the belief updater. In code, a belief updater is represented by a concrete subtype of the Updater abstract type, and the update(updater, belief, action, observation) function defines how the belief is updated when a new observation is received.

Although the agent may use a specialized belief structure to make decisions, the information initially given to the agent about the state of the problem is usually most conveniently represented as a state distribution, thus the initialize_belief function is provided to convert a state distribution to a specialized belief structure that an updater can work with.

In many cases, the belief structure is closely related to the solution technique, so it will be implemented by the programmer who writes the solver. In other cases, the agent can use a variety of belief structures to make decisions, so a domain-specific updater implemented by the programmer that wrote the problem description may be appropriate. Finally, some advanced generic belief updaters such as particle filters may be implemented by a third party. The convenience function updater(policy) can be used to get a suitable default updater for a policy, however many policies can work with other updaters.

For more information on implementing a belief updater, see Defining a Belief Updater

Solvers and Policies

Sequential decision making under uncertainty involves both online and offline calculations. In the broad sense, the term "solver" as used in the node in the figure at the top of the page refers to the software package that performs the calculations at both of these times. However, the code is broken up into two pieces, the solver that performs calculations offline and the policy that performs calculations online.

In the abstract, a policy is a mapping from every belief that an agent might take to an action. A policy is represented in code by a concrete subtype of the Policy abstract type. The programmer implements action to describe what computations need to be done online. For an online solver such as POMCP, all of the decision computation occurs within action while for an offline solver like SARSOP, there is very little computation within action. See Interacting with Policies for more information.

The offline portion of the computation is carried out by the solver, which is represented by a concrete subtype of the Solver abstract type. Computations occur within the solve function. For an offline solver like SARSOP, nearly all of the decision computation occurs within this function, but for some online solvers such as POMCP, solve merely embeds the problem in the policy.

Simulators

A simulator defines a way to run one or more simulations. It is represented by a concrete subtype of the Simulator abstract type and the simulation is an implemention of simulate. Depending on the simulator, simulate may return a variety of data about the simulation, such as the discounted reward or the state history. All simulators should perform simulations consistent with the Simulation Standard.

[1] Decision Making Under Uncertainty: Theory and Application by Mykel J. Kochenderfer, MIT Press, 2015

[2] Bai, H., Hsu, D., & Lee, W. S. (2014). Integrated perception and planning in the continuous space: A POMDP approach. The International Journal of Robotics Research, 33(9), 1288-1302

+Concepts and Architecture · POMDPs.jl

Concepts and Architecture

POMDPs.jl aims to coordinate the development of three software components: 1) a problem, 2) a solver, 3) an experiment. Each of these components has a set of abstract types associated with it and a set of functions that allow a user to define each component's behavior in a standardized way. An outline of the architecture is shown below.

concepts

The MDP and POMDP types are associated with the problem definition. The Solver and Policy types are associated with the solver or decision-making agent. Typically, the Updater type is also associated with the solver, but a solver may sometimes be used with an updater that was implemented separately. The Simulator type is associated with the experiment.

The code components of the POMDPs.jl ecosystem relevant to problems and solvers are shown below. The arrows represent the flow of information from the problems to the solvers. The figure shows the two interfaces that form POMDPs.jl - Explicit and Generative. Details about these interfaces can be found in the section on Defining POMDPs.

interface_relationships

POMDPs and MDPs

An MDP is a mathematical framework for sequential decision making under uncertainty, and where all of the uncertainty arises from outcomes that are partially random and partially under the control of a decision maker. Mathematically, an MDP is a tuple $(S,A,T,R,\gamma)$, where $S$ is the state space, $A$ is the action space, $T$ is a transition function defining the probability of transitioning to each state given the state and action at the previous time, and $R$ is a reward function mapping every possible transition $(s,a,s')$ to a real reward value. Finally, $\gamma$ is a discount factor that defines the relative weighting of current and future rewards. For more information see a textbook such as [1]. In POMDPs.jl an MDP is represented by a concrete subtype of the MDP abstract type and a set of methods that define each of its components as described in the problem definition section.

A POMDP is a more general sequential decision making problem in which the agent is not sure what state they are in. The state is only partially observable by the decision making agent. Mathematically, a POMDP is a tuple $(S,A,T,R,O,Z,\gamma)$ where $S$, $A$, $T$, $R$, and $\gamma$ have the same meaning as in an MDP, $O$ is the agent's observation space, and $Z$ defines the probability of receiving each observation at a transition. In POMDPs.jl, a POMDP is represented by a concrete subtype of the POMDP abstract type, and the methods described in the problem definition section.

POMDPs.jl contains additional functions for defining optional problem behavior such as an initial state distribution or terminal states. More information can be found in the Defining POMDPs section.

Beliefs and Updaters

In a POMDP domain, the decision-making agent does not have complete information about the state of the problem, so the agent can only make choices based on its "belief" about the state. In the POMDP literature, the term "belief" is typically defined to mean a probability distribution over all possible states of the system. However, in practice, the agent often makes decisions based on an incomplete or lossy record of past observations that has a structure much different from a probability distribution. For example, if the agent is represented by a finite-state controller, as is the case for Monte-Carlo Value Iteration [2], the belief is the controller state, which is a node in a graph. Another example is an agent represented by a recurrent neural network. In this case, the agent's belief is the state of the network. In order to accommodate a wide variety of decision-making approaches in POMDPs.jl, we use the term "belief" to denote the set of information that the agent makes a decision on, which could be an exact state distribution, an action-observation history, a set of weighted particles, or the examples mentioned before. In code, the belief can be represented by any built-in or user-defined type.

When an action is taken and a new observation is received, the belief is updated by the belief updater. In code, a belief updater is represented by a concrete subtype of the Updater abstract type, and the update(updater, belief, action, observation) function defines how the belief is updated when a new observation is received.

Although the agent may use a specialized belief structure to make decisions, the information initially given to the agent about the state of the problem is usually most conveniently represented as a state distribution, thus the initialize_belief function is provided to convert a state distribution to a specialized belief structure that an updater can work with.

In many cases, the belief structure is closely related to the solution technique, so it will be implemented by the programmer who writes the solver. In other cases, the agent can use a variety of belief structures to make decisions, so a domain-specific updater implemented by the programmer that wrote the problem description may be appropriate. Finally, some advanced generic belief updaters such as particle filters may be implemented by a third party. The convenience function updater(policy) can be used to get a suitable default updater for a policy, however many policies can work with other updaters.

For more information on implementing a belief updater, see Defining a Belief Updater

Solvers and Policies

Sequential decision making under uncertainty involves both online and offline calculations. In the broad sense, the term "solver" as used in the node in the figure at the top of the page refers to the software package that performs the calculations at both of these times. However, the code is broken up into two pieces, the solver that performs calculations offline and the policy that performs calculations online.

In the abstract, a policy is a mapping from every belief that an agent might take to an action. A policy is represented in code by a concrete subtype of the Policy abstract type. The programmer implements action to describe what computations need to be done online. For an online solver such as POMCP, all of the decision computation occurs within action while for an offline solver like SARSOP, there is very little computation within action. See Interacting with Policies for more information.

The offline portion of the computation is carried out by the solver, which is represented by a concrete subtype of the Solver abstract type. Computations occur within the solve function. For an offline solver like SARSOP, nearly all of the decision computation occurs within this function, but for some online solvers such as POMCP, solve merely embeds the problem in the policy.

Simulators

A simulator defines a way to run one or more simulations. It is represented by a concrete subtype of the Simulator abstract type and the simulation is an implemention of simulate. Depending on the simulator, simulate may return a variety of data about the simulation, such as the discounted reward or the state history. All simulators should perform simulations consistent with the Simulation Standard.

[1] Decision Making Under Uncertainty: Theory and Application by Mykel J. Kochenderfer, MIT Press, 2015

[2] Bai, H., Hsu, D., & Lee, W. S. (2014). Integrated perception and planning in the continuous space: A POMDP approach. The International Journal of Robotics Research, 33(9), 1288-1302

diff --git a/dev/def_pomdp/index.html b/dev/def_pomdp/index.html index 025a17fa..ced9a70b 100644 --- a/dev/def_pomdp/index.html +++ b/dev/def_pomdp/index.html @@ -196,4 +196,4 @@ R = [-1. -100. 10.; -1. 10. -100.] -m = TabularPOMDP(T, R, O, 0.95)

Here T is a $|S| \times |A| \times |S|$ array representing the transition probabilities, with T[sp, a, s] $= T(s' | s, a)$. Similarly, O is an $|O| \times |A| \times |S|$ encoding the observation distribution with O[o, a, sp] $= Z(o | a, s')$, and R is a $|S| \times |A|$ matrix that encodes the reward function. 0.95 is the discount factor.

+m = TabularPOMDP(T, R, O, 0.95)

Here T is a $|S| \times |A| \times |S|$ array representing the transition probabilities, with T[sp, a, s] $= T(s' | s, a)$. Similarly, O is an $|O| \times |A| \times |S|$ encoding the observation distribution with O[o, a, sp] $= Z(o | a, s')$, and R is a $|S| \times |A|$ matrix that encodes the reward function. 0.95 is the discount factor.

diff --git a/dev/def_solver/index.html b/dev/def_solver/index.html index 91575a93..4d0c36c6 100644 --- a/dev/def_solver/index.html +++ b/dev/def_solver/index.html @@ -1,2 +1,2 @@ -Solvers · POMDPs.jl

Solvers

Defining a solver involves creating or using four pieces of code:

  1. A subtype of Solver that holds the parameters and configuration options for the solver.
  2. A subtype of Policy that holds all of the data needed to choose actions online.
  3. A method of solve that takes the Solver and a (PO)MDP as arguments, performs all of the offline computations for solving the problem, and returns the policy.
  4. A method of action that takes in the policy and a state or belief and returns an action.

In many cases, items 2 and 4 can be satisfied with an off-the-shelf Policy from the POMDPTools package. also contains many tools that are useful for defining solvers in a robust, concise, and readable manner.

Online and Offline Solvers

Generally, solvers can be grouped into two categories: Offline solvers that do most of their computational work before interacting with the environment, and online solvers that do their work online as each new state or observation is encountered. Although offline and online solvers both use the exact same Solver, solve, Policy, action structure, the work of defining online and offline solvers is focused on different portions.

For an offline solver, most of the implementation effort will be spent on the [solve] function, and an off-the-shelf policy from POMDPTools will typically be used.

For an online solver, the solve function typically does little or no work, but merely creates a Policy object that will carry out computation online. It is typical in POMDPs.jl to use the term "Planner" to name a Policy object for an online solver that carries out a large amount of computation ("planning") at interaction time. In this case most of the effort will be focused on implementing the action method for the "Planner" Policy type.

Examples

Solver implementation is most clearly explained through examples. The following sections contain examples of both online and offline solver definitions:

+Solvers · POMDPs.jl

Solvers

Defining a solver involves creating or using four pieces of code:

  1. A subtype of Solver that holds the parameters and configuration options for the solver.
  2. A subtype of Policy that holds all of the data needed to choose actions online.
  3. A method of solve that takes the Solver and a (PO)MDP as arguments, performs all of the offline computations for solving the problem, and returns the policy.
  4. A method of action that takes in the policy and a state or belief and returns an action.

In many cases, items 2 and 4 can be satisfied with an off-the-shelf Policy from the POMDPTools package. also contains many tools that are useful for defining solvers in a robust, concise, and readable manner.

Online and Offline Solvers

Generally, solvers can be grouped into two categories: Offline solvers that do most of their computational work before interacting with the environment, and online solvers that do their work online as each new state or observation is encountered. Although offline and online solvers both use the exact same Solver, solve, Policy, action structure, the work of defining online and offline solvers is focused on different portions.

For an offline solver, most of the implementation effort will be spent on the [solve] function, and an off-the-shelf policy from POMDPTools will typically be used.

For an online solver, the solve function typically does little or no work, but merely creates a Policy object that will carry out computation online. It is typical in POMDPs.jl to use the term "Planner" to name a Policy object for an online solver that carries out a large amount of computation ("planning") at interaction time. In this case most of the effort will be focused on implementing the action method for the "Planner" Policy type.

Examples

Solver implementation is most clearly explained through examples. The following sections contain examples of both online and offline solver definitions:

diff --git a/dev/def_updater/index.html b/dev/def_updater/index.html index 200de762..0e4134f3 100644 --- a/dev/def_updater/index.html +++ b/dev/def_updater/index.html @@ -29,4 +29,4 @@ b = Any[POMDPModels.BoolDistribution(0.0), false, false] b = Any[POMDPModels.BoolDistribution(0.0), false, false, false, false] b = Any[POMDPModels.BoolDistribution(0.0), false, false, false, false, true, false] -b = Any[POMDPModels.BoolDistribution(0.0), false, false, false, false, true, false, true, false] +b = Any[POMDPModels.BoolDistribution(0.0), false, false, false, false, true, false, true, false] diff --git a/dev/example_defining_problems/index.html b/dev/example_defining_problems/index.html index f71ee4f9..7e71aa2f 100644 --- a/dev/example_defining_problems/index.html +++ b/dev/example_defining_problems/index.html @@ -247,4 +247,4 @@ discount = 0.9 -tabular_crying_baby_pomdp = TabularPOMDP(T, R, O, discount) +tabular_crying_baby_pomdp = TabularPOMDP(T, R, O, discount) diff --git a/dev/example_gridworld_mdp/index.html b/dev/example_gridworld_mdp/index.html index ebdf36ba..16d2694b 100644 --- a/dev/example_gridworld_mdp/index.html +++ b/dev/example_gridworld_mdp/index.html @@ -48,10 +48,10 @@ Size x: 10 Size y: 10 Reward states: - Main.GridWorldState(8, 8) => 3.0 Main.GridWorldState(4, 3) => -10.0 - Main.GridWorldState(4, 6) => -5.0 Main.GridWorldState(9, 3) => 10.0 + Main.GridWorldState(4, 6) => -5.0 + Main.GridWorldState(8, 8) => 3.0 Hit wall reward: -1.0 Transition probability: 0.7 Discount: 0.9 @@ -239,30 +239,30 @@ solver = ValueIterationSolver(; max_iterations=100, belres=1e-3, verbose=true) # Solve for an optimal policy -vi_policy = POMDPs.solve(solver, mdp)
[Iteration 1   ] residual:         10 | iteration runtime:      0.227 ms, (  0.000227 s total)
-[Iteration 2   ] residual:        6.3 | iteration runtime:      0.232 ms, (  0.000459 s total)
-[Iteration 3   ] residual:       4.53 | iteration runtime:      0.214 ms, (  0.000673 s total)
-[Iteration 4   ] residual:       3.21 | iteration runtime:      0.211 ms, (  0.000885 s total)
-[Iteration 5   ] residual:       2.31 | iteration runtime:      0.213 ms, (    0.0011 s total)
-[Iteration 6   ] residual:       1.62 | iteration runtime:      0.212 ms, (   0.00131 s total)
-[Iteration 7   ] residual:       1.24 | iteration runtime:      0.211 ms, (   0.00152 s total)
-[Iteration 8   ] residual:       1.06 | iteration runtime:      0.206 ms, (   0.00173 s total)
-[Iteration 9   ] residual:      0.865 | iteration runtime:      0.209 ms, (   0.00194 s total)
-[Iteration 10  ] residual:      0.657 | iteration runtime:      0.206 ms, (   0.00214 s total)
-[Iteration 11  ] residual:      0.545 | iteration runtime:      0.217 ms, (   0.00236 s total)
-[Iteration 12  ] residual:      0.455 | iteration runtime:      0.209 ms, (   0.00257 s total)
-[Iteration 13  ] residual:      0.378 | iteration runtime:      0.208 ms, (   0.00278 s total)
-[Iteration 14  ] residual:      0.306 | iteration runtime:      0.208 ms, (   0.00298 s total)
-[Iteration 15  ] residual:      0.211 | iteration runtime:      0.209 ms, (   0.00319 s total)
-[Iteration 16  ] residual:      0.132 | iteration runtime:      0.210 ms, (    0.0034 s total)
-[Iteration 17  ] residual:     0.0778 | iteration runtime:      0.208 ms, (   0.00361 s total)
-[Iteration 18  ] residual:     0.0437 | iteration runtime:      0.209 ms, (   0.00382 s total)
-[Iteration 19  ] residual:     0.0237 | iteration runtime:      0.208 ms, (   0.00403 s total)
-[Iteration 20  ] residual:     0.0125 | iteration runtime:      0.233 ms, (   0.00426 s total)
-[Iteration 21  ] residual:    0.00649 | iteration runtime:      0.210 ms, (   0.00447 s total)
-[Iteration 22  ] residual:    0.00332 | iteration runtime:      0.208 ms, (   0.00468 s total)
-[Iteration 23  ] residual:    0.00167 | iteration runtime:      0.210 ms, (   0.00489 s total)
-[Iteration 24  ] residual:   0.000834 | iteration runtime:      0.210 ms, (    0.0051 s total)

We can now use the policy to compute the optimal action for a given state:

s = GridWorldState(9, 2)
+vi_policy = POMDPs.solve(solver, mdp)
[Iteration 1   ] residual:         10 | iteration runtime:      0.215 ms, (  0.000215 s total)
+[Iteration 2   ] residual:        6.3 | iteration runtime:      0.244 ms, (  0.000459 s total)
+[Iteration 3   ] residual:       4.53 | iteration runtime:      0.222 ms, (  0.000681 s total)
+[Iteration 4   ] residual:       3.21 | iteration runtime:      0.218 ms, (  0.000899 s total)
+[Iteration 5   ] residual:       2.31 | iteration runtime:      0.228 ms, (   0.00113 s total)
+[Iteration 6   ] residual:       1.62 | iteration runtime:      0.318 ms, (   0.00144 s total)
+[Iteration 7   ] residual:       1.24 | iteration runtime:      0.249 ms, (   0.00169 s total)
+[Iteration 8   ] residual:       1.06 | iteration runtime:      0.260 ms, (   0.00195 s total)
+[Iteration 9   ] residual:      0.865 | iteration runtime:      0.230 ms, (   0.00218 s total)
+[Iteration 10  ] residual:      0.657 | iteration runtime:      0.232 ms, (   0.00242 s total)
+[Iteration 11  ] residual:      0.545 | iteration runtime:      0.232 ms, (   0.00265 s total)
+[Iteration 12  ] residual:      0.455 | iteration runtime:      0.212 ms, (   0.00286 s total)
+[Iteration 13  ] residual:      0.378 | iteration runtime:      0.214 ms, (   0.00307 s total)
+[Iteration 14  ] residual:      0.306 | iteration runtime:      0.211 ms, (   0.00329 s total)
+[Iteration 15  ] residual:      0.211 | iteration runtime:      0.209 ms, (   0.00349 s total)
+[Iteration 16  ] residual:      0.132 | iteration runtime:      0.208 ms, (    0.0037 s total)
+[Iteration 17  ] residual:     0.0778 | iteration runtime:      0.212 ms, (   0.00391 s total)
+[Iteration 18  ] residual:     0.0437 | iteration runtime:      0.211 ms, (   0.00413 s total)
+[Iteration 19  ] residual:     0.0237 | iteration runtime:      0.209 ms, (   0.00433 s total)
+[Iteration 20  ] residual:     0.0125 | iteration runtime:      0.210 ms, (   0.00454 s total)
+[Iteration 21  ] residual:    0.00649 | iteration runtime:      0.212 ms, (   0.00476 s total)
+[Iteration 22  ] residual:    0.00332 | iteration runtime:      0.212 ms, (   0.00497 s total)
+[Iteration 23  ] residual:    0.00167 | iteration runtime:      0.215 ms, (   0.00518 s total)
+[Iteration 24  ] residual:   0.000834 | iteration runtime:      0.211 ms, (    0.0054 s total)

We can now use the policy to compute the optimal action for a given state:

s = GridWorldState(9, 2)
 @show action(vi_policy, s)
:up
s = GridWorldState(8, 3)
 @show action(vi_policy, s)
:right

Solving the Grid World MDP (MCTS)

Similar to the process with Value Iteration, we can solve the MDP using MCTS. We will use the MCTSSolver from the MCTS package.

# Initialize the problem (we have already done this, but just calling it again for completeness in the example)
 mdp = GridWorldMDP()
@@ -367,4 +367,4 @@
  2 | →  →  →  →  →  →  →  →  ↑  ↑ |
  1 | →  →  →  →  →  →  ↑  ↑  ↑  ↑ |
    ------------------------------
-    1  2  3  4  5  6  7  8  9  10

Seeing a Policy In Action

Another useful tool is to view the policy in action by creating a gif of a simulation. To accomplish this, we could use POMDPGifs. To use POMDPGifs, we need to extend the POMDPTools.render function to GridWorldMDP. Please reference Gallery of POMDPs.jl Problems for examples of this process.

+ 1 2 3 4 5 6 7 8 9 10

Seeing a Policy In Action

Another useful tool is to view the policy in action by creating a gif of a simulation. To accomplish this, we could use POMDPGifs. To use POMDPGifs, we need to extend the POMDPTools.render function to GridWorldMDP. Please reference Gallery of POMDPs.jl Problems for examples of this process.

diff --git a/dev/example_simulations/index.html b/dev/example_simulations/index.html index 5f54059b..bc67cb19 100644 --- a/dev/example_simulations/index.html +++ b/dev/example_simulations/index.html @@ -24,21 +24,21 @@ Step 2 b = sated => 0.989010989010989, hungry => 0.010989010989010992 s = :sated -a = :feed +a = :ignore o = :quiet -r = -5.0 -r_sum = -5.5 +r = 0.0 +r_sum = -0.5 Step 3 -b = sated => 1.0, hungry => 0.0 +b = sated => 0.9732977303070761, hungry => 0.026702269692923903 s = :sated -a = :ignore -o = :quiet -r = 0.0 +a = :feed +o = :crying +r = -5.0 r_sum = -5.5 Step 4 -b = sated => 0.9759036144578314, hungry => 0.024096385542168676 +b = sated => 1.0, hungry => 0.0 s = :sated a = :sing o = :quiet @@ -46,20 +46,20 @@ r_sum = -6.0

Rollout Simulations

While stepthrough is a flexible and convenient tool for many user-facing demonstrations, it is often less error-prone to use the standard simulate function with a Simulator object. The simplest Simulator is the RolloutSimulator. It simply runs a simulation and returns the discounted reward.

policy = RandomPolicy(explicit_crying_baby_pomdp)
 sim = RolloutSimulator(max_steps=10)
 r_sum = simulate(sim, explicit_crying_baby_pomdp, policy)
-println("Total discounted reward: $r_sum")
Total discounted reward: -45.4782422345

Recording Histories

Sometimes it is important to record the entire history of a simulation for further examination. This can be accomplished with a HistoryRecorder.

policy = RandomPolicy(tabular_crying_baby_pomdp)
+println("Total discounted reward: $r_sum")
Total discounted reward: -49.81028715

Recording Histories

Sometimes it is important to record the entire history of a simulation for further examination. This can be accomplished with a HistoryRecorder.

policy = RandomPolicy(tabular_crying_baby_pomdp)
 hr = HistoryRecorder(max_steps=5)
-history = simulate(hr, tabular_crying_baby_pomdp, policy, DiscreteUpdater(tabular_crying_baby_pomdp), Deterministic(1))

The history object produced by a HistoryRecorder is a SimHistory, documented in the POMDPTools simulater section Histories. The information in this object can be accessed in several ways. For example, there is a function:

discounted_reward(history)
-14.543550000000002

Accessor functions like state_hist and action_hist can also be used to access parts of the history:

state_hist(history)
6-element Vector{Int64}:
+history = simulate(hr, tabular_crying_baby_pomdp, policy, DiscreteUpdater(tabular_crying_baby_pomdp), Deterministic(1))

The history object produced by a HistoryRecorder is a SimHistory, documented in the POMDPTools simulater section Histories. The information in this object can be accessed in several ways. For example, there is a function:

discounted_reward(history)
-9.05

Accessor functions like state_hist and action_hist can also be used to access parts of the history:

state_hist(history)
6-element Vector{Int64}:
  1
+ 1
+ 1
+ 1
+ 1
+ 1
collect(action_hist(history))
5-element Vector{Int64}:
  2
+ 1
  2
  2
- 2
- 2
collect(action_hist(history))
5-element Vector{Int64}:
- 3
- 3
- 3
- 2
- 2

Keeping track of which states, actions, and observations belong together can be tricky (for example, since there is a starting state, and ending state, but no action is taken from the ending state, the list of actions has a different length than the list of states). It is often better to think of histories in terms of steps that include both starting and ending states.

The most powerful function for accessing the information in a SimHistory is the eachstep function which returns an iterator through named tuples representing each step in the history. The eachstep function is similar to the stepthrough function above except that it iterates through the immutable steps of a previously simulated history instead of conducting the simulation as the for loop is being carried out.

r_sum = 0.0
+ 1

Keeping track of which states, actions, and observations belong together can be tricky (for example, since there is a starting state, and ending state, but no action is taken from the ending state, the list of actions has a different length than the list of states). It is often better to think of histories in terms of steps that include both starting and ending states.

The most powerful function for accessing the information in a SimHistory is the eachstep function which returns an iterator through named tuples representing each step in the history. The eachstep function is similar to the stepthrough function above except that it iterates through the immutable steps of a previously simulated history instead of conducting the simulation as the for loop is being carried out.

r_sum = 0.0
 step = 0
 for step_i in eachstep(sim_history, "b,s,a,o,r")
     step += 1
@@ -76,42 +76,42 @@
 end # hide
Step 1
 step_i.b = sated => 1.0, hungry => 0.0
 step_i.s = 1
-step_i.a = 3
+step_i.a = 2
 step_i.o = 2
-step_i.r = 0.0
-r_sum = 0.0
+step_i.r = -0.5
+r_sum = -0.5
 
 Step 2
-step_i.b = sated => 0.9759036144578314, hungry => 0.024096385542168676
-step_i.s = 2
-step_i.a = 3
+step_i.b = sated => 0.989010989010989, hungry => 0.010989010989010992
+step_i.s = 1
+step_i.a = 1
 step_i.o = 2
-step_i.r = 0.0
-r_sum = 0.0
+step_i.r = -5.0
+r_sum = -5.5
 
 Step 3
-step_i.b = sated => 0.9701315984030756, hungry => 0.029868401596924443
-step_i.s = 2
-step_i.a = 3
-step_i.o = 1
-step_i.r = 0.0
-r_sum = 0.0
+step_i.b = sated => 1.0, hungry => 0.0
+step_i.s = 1
+step_i.a = 2
+step_i.o = 2
+step_i.r = -0.5
+r_sum = -6.0
 
 Step 4
-step_i.b = sated => 0.4624149353547852, hungry => 0.5375850646452149
-step_i.s = 2
+step_i.b = sated => 0.989010989010989, hungry => 0.010989010989010992
+step_i.s = 1
 step_i.a = 2
-step_i.o = 1
-step_i.r = -10.5
-r_sum = -10.5
+step_i.o = 2
+step_i.r = -0.5
+r_sum = -6.5
 
 Step 5
-step_i.b = sated => 0.0, hungry => 1.0
-step_i.s = 2
-step_i.a = 2
+step_i.b = sated => 0.9878048780487805, hungry => 0.012195121951219514
+step_i.s = 1
+step_i.a = 1
 step_i.o = 1
-step_i.r = -10.5
-r_sum = -21.0

Parallel Simulations

It is often useful to evaluate a policy by running many simulations. The parallel simulator is the most effective tool for this. To use the parallel simulator, first create a list of Sim objects, each of which contains all of the information needed to run a simulation. Then then run the simulations using run_parallel, which will return a DataFrame with the results.

In this example, we will compare the performance of the polcies we computed in the Using Different Solvers section (i.e. sarsop_policy, pomcp_planner, and heuristic_policy). To evaluate the policies, we will run 100 simulations for each policy. We can do this by adding 100 Sim objects of each policy to the list.

using DataFrames
+step_i.r = -5.0
+r_sum = -11.5

Parallel Simulations

It is often useful to evaluate a policy by running many simulations. The parallel simulator is the most effective tool for this. To use the parallel simulator, first create a list of Sim objects, each of which contains all of the information needed to run a simulation. Then then run the simulations using run_parallel, which will return a DataFrame with the results.

In this example, we will compare the performance of the polcies we computed in the Using Different Solvers section (i.e. sarsop_policy, pomcp_planner, and heuristic_policy). To evaluate the policies, we will run 100 simulations for each policy. We can do this by adding 100 Sim objects of each policy to the list.

using DataFrames
 using StatsBase: std
 
 # Defining paramters for the simulations
@@ -178,4 +178,4 @@
 
 # Calculate the mean and confidence interval for each policy
 grouped_df = groupby(data, :policy)
-result = combine(grouped_df, :reward => mean_and_ci => AsTable)
4×3 DataFrame
Rowpolicymeanci
String?Float64Float64
1sarsop-14.62641.81814
2pomcp-18.69041.57649
3heuristic-15.48951.96535
4random-30.42012.64208

By default, the parallel simulator only returns the reward from each simulation, but more information can be gathered by specifying a function to analyze the Sim-history pair and record additional statistics. Reference the POMDPTools simulator section for more information (Specifying information to be recorded).

+result = combine(grouped_df, :reward => mean_and_ci => AsTable)
4×3 DataFrame
Rowpolicymeanci
String?Float64Float64
1sarsop-14.62641.81814
2pomcp-18.69041.57649
3heuristic-15.48711.73522
4random-30.42012.64208

By default, the parallel simulator only returns the reward from each simulation, but more information can be gathered by specifying a function to analyze the Sim-history pair and record additional statistics. Reference the POMDPTools simulator section for more information (Specifying information to be recorded).

diff --git a/dev/example_solvers/index.html b/dev/example_solvers/index.html index 7c33f48a..e94e2b26 100644 --- a/dev/example_solvers/index.html +++ b/dev/example_solvers/index.html @@ -25,19 +25,19 @@ For solve(::QMDP.QMDPSolver, ::POMDP): [No additional requirements] For solve(::ValueIterationSolver, ::Union{MDP,POMDP}) (in solve(::QMDP.QMDPSolver, ::POMDP)): - [✔] discount(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("6ed96d00-fb5b-4b0b-95b8-068772e977d7"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}) - [✔] transition(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("6ed96d00-fb5b-4b0b-95b8-068772e977d7"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}, ::Symbol, ::Symbol) - [✔] reward(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("6ed96d00-fb5b-4b0b-95b8-068772e977d7"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}, ::Symbol, ::Symbol, ::Symbol) - [✔] stateindex(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("6ed96d00-fb5b-4b0b-95b8-068772e977d7"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}, ::Symbol) - [✔] actionindex(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("6ed96d00-fb5b-4b0b-95b8-068772e977d7"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}, ::Symbol) - [✔] actions(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("6ed96d00-fb5b-4b0b-95b8-068772e977d7"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}, ::Symbol) + [✔] discount(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("bc06d3ef-e7bc-4206-81ed-395883874f86"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}) + [✔] transition(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("bc06d3ef-e7bc-4206-81ed-395883874f86"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}, ::Symbol, ::Symbol) + [✔] reward(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("bc06d3ef-e7bc-4206-81ed-395883874f86"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}, ::Symbol, ::Symbol, ::Symbol) + [✔] stateindex(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("bc06d3ef-e7bc-4206-81ed-395883874f86"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}, ::Symbol) + [✔] actionindex(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("bc06d3ef-e7bc-4206-81ed-395883874f86"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}, ::Symbol) + [✔] actions(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("bc06d3ef-e7bc-4206-81ed-395883874f86"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}, ::Symbol) [✔] length(::Array{Symbol1}) [✔] support(::Deterministic{Symbol}) [✔] pdf(::Deterministic{Symbol}, ::Symbol) For ordered_states(::Union{MDP,POMDP}) (in solve(::ValueIterationSolver, ::Union{MDP,POMDP})): - [✔] states(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("6ed96d00-fb5b-4b0b-95b8-068772e977d7"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}) + [✔] states(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("bc06d3ef-e7bc-4206-81ed-395883874f86"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}) For ordered_actions(::Union{MDP,POMDP}) (in solve(::ValueIterationSolver, ::Union{MDP,POMDP})): - [✔] actions(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("6ed96d00-fb5b-4b0b-95b8-068772e977d7"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}) + [✔] actions(::UnderlyingMDP{QuickPOMDPs.QuickPOMDP{UUID("bc06d3ef-e7bc-4206-81ed-395883874f86"), Symbol, Symbol, Symbol, @NamedTuple{stateindex::Dict{Symbol, Int64}, isterminal::Bool, obsindex::Dict{Symbol, Int64}, states::Vector{Symbol}, observations::Vector{Symbol}, discount::Float64, actions::Vector{Symbol}, observation::Main.var"#2#5", actionindex::Dict{Symbol, Int64}, initialstate::Deterministic{Symbol}, transition::Main.var"#1#4", reward::Main.var"#3#6"}}SymbolSymbol}) Explicit Crying Baby POMDP INFO: POMDPLinter requirements for solve(::QMDP.QMDPSolver, ::POMDP) and dependencies. ([✔] = implemented correctly; [X] = not implemented; [?] = could not determine) @@ -152,4 +152,4 @@ @show [a1, a2]
2-element Vector{Symbol}:
  :feed
- :sing
+ :ignore diff --git a/dev/examples/index.html b/dev/examples/index.html index c1e9c3cb..70a43e4b 100644 --- a/dev/examples/index.html +++ b/dev/examples/index.html @@ -1,2 +1,2 @@ -Examples · POMDPs.jl

Examples

This section contains examples of how to use POMDPs.jl. For specific informaiton about the interface and functions used in the examples, please reference the correpsonding area in the documenation or the API Documentation.

The examples are organized by topic. The exmaples are designed to build through each step. First, we have to define a POMDP. Then we need to solve the POMDP to get a policy. Finally, we can simulate the policy to see how it performs. The examples are designed to be exeucted in order. For example, the examples in Simulations Examples assume that the POMDPs defined in the Defining a POMDP section have been defined and we have a policy we would like to simulate that we computed in the Using Different Solvers section.

The GridWorld MDP Tutorial section is a standalone example that does not require any of the other examples.

Outline

+Examples · POMDPs.jl

Examples

This section contains examples of how to use POMDPs.jl. For specific informaiton about the interface and functions used in the examples, please reference the correpsonding area in the documenation or the API Documentation.

The examples are organized by topic. The exmaples are designed to build through each step. First, we have to define a POMDP. Then we need to solve the POMDP to get a policy. Finally, we can simulate the policy to see how it performs. The examples are designed to be exeucted in order. For example, the examples in Simulations Examples assume that the POMDPs defined in the Defining a POMDP section have been defined and we have a policy we would like to simulate that we computed in the Using Different Solvers section.

The GridWorld MDP Tutorial section is a standalone example that does not require any of the other examples.

Outline

diff --git a/dev/faq/index.html b/dev/faq/index.html index 1801a46b..3b710a0e 100644 --- a/dev/faq/index.html +++ b/dev/faq/index.html @@ -14,4 +14,4 @@ end end -POMDPs.reward(m, s, a) = rdict[(s, a)]

Why do I need to put type assertions pomdp::POMDP into the function signature?

Specifying the type in your function signature allows Julia to call the appropriate function when your custom type is passed into it. For example if a POMDPs.jl solver calls states on the POMDP that you passed into it, the correct states function will only get dispatched if you specified that the states function you wrote works with your POMDP type. Because Julia supports multiple-dispatch, these type assertion are a way for doing object-oriented programming in Julia.

+POMDPs.reward(m, s, a) = rdict[(s, a)]

Why do I need to put type assertions pomdp::POMDP into the function signature?

Specifying the type in your function signature allows Julia to call the appropriate function when your custom type is passed into it. For example if a POMDPs.jl solver calls states on the POMDP that you passed into it, the correct states function will only get dispatched if you specified that the states function you wrote works with your POMDP type. Because Julia supports multiple-dispatch, these type assertion are a way for doing object-oriented programming in Julia.

diff --git a/dev/gallery/index.html b/dev/gallery/index.html index 18152a7a..ba9d8c2d 100644 --- a/dev/gallery/index.html +++ b/dev/gallery/index.html @@ -187,4 +187,4 @@ sim = GifSimulator(; filename="examples/TagPOMDP.gif", max_steps=50, rng=MersenneTwister(1), show_progress=false) saved_gif = simulate(sim, pomdp, policy) -println("gif saved to: $(saved_gif.filename)")
gif saved to: examples/TagPOMDP.gif

To add new examples, please submit a pull request to the POMDPs.jl repository with changes made to the gallery.md file in docs/src/. Please include the creation of a gif in the code snippet. The gif should be generated during the creation of the documenation using @eval and saved in the docs/src/examples/ directory. The gif should be named problem_name.gif where problem_name is the name of the problem. The gif can then be included using ![problem_name](examples/problem_name.gif).

+println("gif saved to: $(saved_gif.filename)")
gif saved to: examples/TagPOMDP.gif

To add new examples, please submit a pull request to the POMDPs.jl repository with changes made to the gallery.md file in docs/src/. Please include the creation of a gif in the code snippet. The gif should be generated during the creation of the documenation using @eval and saved in the docs/src/examples/ directory. The gif should be named problem_name.gif where problem_name is the name of the problem. The gif can then be included using ![problem_name](examples/problem_name.gif).

diff --git a/dev/get_started/index.html b/dev/get_started/index.html index e313c041..9ee119ae 100644 --- a/dev/get_started/index.html +++ b/dev/get_started/index.html @@ -13,4 +13,4 @@ init_dist = initialstate(pomdp) # from POMDPModels hr = HistoryRecorder(max_steps=100) # from POMDPTools hist = simulate(hr, pomdp, policy, belief_updater, init_dist) # run 100 step simulation -println("reward: $(discounted_reward(hist))")

The first part of the code loads the desired packages and initializes the problem and the solver. Next, we compute a POMDP policy. Lastly, we evaluate the results.

There are a few things to mention here. First, the TigerPOMDP type implements all the functions required by QMDPSolver to compute a policy. Second, each policy has a default updater (essentially a filter used to update the belief of the POMDP). To learn more about Updaters check out the Concepts and Architecture section.

+println("reward: $(discounted_reward(hist))")

The first part of the code loads the desired packages and initializes the problem and the solver. Next, we compute a POMDP policy. Lastly, we evaluate the results.

There are a few things to mention here. First, the TigerPOMDP type implements all the functions required by QMDPSolver to compute a policy. Second, each policy has a default updater (essentially a filter used to update the belief of the POMDP). To learn more about Updaters check out the Concepts and Architecture section.

diff --git a/dev/index.html b/dev/index.html index 3d9263cb..4211d280 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,2 +1,2 @@ -POMDPs.jl · POMDPs.jl

POMDPs.jl

A Julia interface for defining, solving and simulating partially observable Markov decision processes and their fully observable counterparts.

Package and Ecosystem Features

  • General interface that can handle problems with discrete and continuous state/action/observation spaces
  • A number of popular state-of-the-art solvers implemented for use out-of-the-box
  • Tools that make it easy to define problems and simulate solutions
  • Simple integration of custom solvers into the existing interface

Available Packages

The POMDPs.jl package contains only the interface used for expressing and solving Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs). The POMDPTools package acts as a "standard library" for the POMDPs.jl interface, providing implementations of commonly-used components such as policies, belief updaters, distributions, and simulators. The list of solver and support packages maintained by the JuliaPOMDP community is available at the POMDPs.jl Readme.

Documentation Outline

Documentation comes in three forms:

  1. An explanatory guide is available in the sections outlined below.
  2. How-to examples are available throughout this documentation with specicic examples in Examples and Gallery of POMDPs.jl Problems.
  3. Reference docstrings for the entire POMDPs.jl interface are available in the API Documentation section.
Note

When updating these documents, make sure this is synced with docs/make.jl!!

Basics

Defining POMDP Models

Writing Solvers and Updaters

Analyzing Results

POMDPTools - the standard library for POMDPs.jl

Reference

+POMDPs.jl · POMDPs.jl

POMDPs.jl

A Julia interface for defining, solving and simulating partially observable Markov decision processes and their fully observable counterparts.

Package and Ecosystem Features

  • General interface that can handle problems with discrete and continuous state/action/observation spaces
  • A number of popular state-of-the-art solvers implemented for use out-of-the-box
  • Tools that make it easy to define problems and simulate solutions
  • Simple integration of custom solvers into the existing interface

Available Packages

The POMDPs.jl package contains only the interface used for expressing and solving Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs). The POMDPTools package acts as a "standard library" for the POMDPs.jl interface, providing implementations of commonly-used components such as policies, belief updaters, distributions, and simulators. The list of solver and support packages maintained by the JuliaPOMDP community is available at the POMDPs.jl Readme.

Documentation Outline

Documentation comes in three forms:

  1. An explanatory guide is available in the sections outlined below.
  2. How-to examples are available throughout this documentation with specicic examples in Examples and Gallery of POMDPs.jl Problems.
  3. Reference docstrings for the entire POMDPs.jl interface are available in the API Documentation section.
Note

When updating these documents, make sure this is synced with docs/make.jl!!

Basics

Defining POMDP Models

Writing Solvers and Updaters

Analyzing Results

POMDPTools - the standard library for POMDPs.jl

Reference

diff --git a/dev/install/index.html b/dev/install/index.html index c2a75397..abc2ce52 100644 --- a/dev/install/index.html +++ b/dev/install/index.html @@ -1,3 +1,3 @@ Installation · POMDPs.jl

Installation

If you have a running Julia distribution (Julia 0.4 or greater), you have everything you need to install POMDPs.jl. To install the package, simply run the following from the Julia REPL:

import Pkg
-Pkg.add("POMDPs") # installs the POMDPs.jl package

Some auxiliary packages and older versions of solvers may be found in the JuliaPOMDP registry. To install this registry, run:

using Pkg; pkg"registry add https://github.com/JuliaPOMDP/Registry"

Note: to use this registry, JuliaPro users must also run edit(normpath(Sys.BINDIR,"..","etc","julia","startup.jl")), comment out the line ENV["DISABLE_FALLBACK"] = "true", save the file, and restart JuliaPro as described in this issue.

+Pkg.add("POMDPs") # installs the POMDPs.jl package

Some auxiliary packages and older versions of solvers may be found in the JuliaPOMDP registry. To install this registry, run:

using Pkg; pkg"registry add https://github.com/JuliaPOMDP/Registry"

Note: to use this registry, JuliaPro users must also run edit(normpath(Sys.BINDIR,"..","etc","julia","startup.jl")), comment out the line ENV["DISABLE_FALLBACK"] = "true", save the file, and restart JuliaPro as described in this issue.

diff --git a/dev/interfaces/index.html b/dev/interfaces/index.html index 965be55c..c5137070 100644 --- a/dev/interfaces/index.html +++ b/dev/interfaces/index.html @@ -1,2 +1,2 @@ -Spaces and Distributions · POMDPs.jl

Spaces and Distributions

Two important components of the definitions of MDPs and POMDPs are spaces, which specify the possible states, actions, and observations in a problem and distributions, which define probability distributions. In order to provide for maximum flexibility spaces and distributions may be of any type (i.e. there are no abstract base types). Solvers and simulators will interact with space and distribution types using the functions defined below.

Spaces

A space object should contain the information needed to define the set of all possible states, actions or observations. The implementation will depend on the attributes of the elements. For example, if the space is continuous, the space object may only contain the limits of the continuous range. In the case of a discrete problem, a vector containing all states is appropriate for representing a space.

The following functions may be called on a space object (Click on a function to read its documentation):

Distributions

A distribution object represents a probability distribution.

The following functions may be called on a distribution object (Click on a function to read its documentation):

You can find some useful pre-made distribution objects in Distributions.jl or POMDPTools.

  • 1Distributions should support both rand(rng::AbstractRNG, d) and rand(d). The recommended way to do this is by implmenting Base.rand(rng::AbstractRNG, s::Random.SamplerTrivial{<:YourDistribution}) from the julia rand interface.
+Spaces and Distributions · POMDPs.jl

Spaces and Distributions

Two important components of the definitions of MDPs and POMDPs are spaces, which specify the possible states, actions, and observations in a problem and distributions, which define probability distributions. In order to provide for maximum flexibility spaces and distributions may be of any type (i.e. there are no abstract base types). Solvers and simulators will interact with space and distribution types using the functions defined below.

Spaces

A space object should contain the information needed to define the set of all possible states, actions or observations. The implementation will depend on the attributes of the elements. For example, if the space is continuous, the space object may only contain the limits of the continuous range. In the case of a discrete problem, a vector containing all states is appropriate for representing a space.

The following functions may be called on a space object (Click on a function to read its documentation):

Distributions

A distribution object represents a probability distribution.

The following functions may be called on a distribution object (Click on a function to read its documentation):

You can find some useful pre-made distribution objects in Distributions.jl or POMDPTools.

  • 1Distributions should support both rand(rng::AbstractRNG, d) and rand(d). The recommended way to do this is by implmenting Base.rand(rng::AbstractRNG, s::Random.SamplerTrivial{<:YourDistribution}) from the julia rand interface.
diff --git a/dev/offline_solver/index.html b/dev/offline_solver/index.html index dd8922ec..6b5fe88a 100644 --- a/dev/offline_solver/index.html +++ b/dev/offline_solver/index.html @@ -70,4 +70,4 @@ @assert action(policy, Deterministic(TIGER_LEFT)) == TIGER_OPEN_RIGHT @assert action(policy, Deterministic(TIGER_RIGHT)) == TIGER_OPEN_LEFT -@assert action(policy, Uniform(states(tiger))) == TIGER_LISTEN +@assert action(policy, Uniform(states(tiger))) == TIGER_LISTEN diff --git a/dev/online_solver/index.html b/dev/online_solver/index.html index f88bfaa6..e9ba07b2 100644 --- a/dev/online_solver/index.html +++ b/dev/online_solver/index.html @@ -56,4 +56,4 @@ @assert action(planner, Deterministic(TIGER_LEFT)) == TIGER_OPEN_RIGHT @assert action(planner, Deterministic(TIGER_RIGHT)) == TIGER_OPEN_LEFT -# note action(planner, Uniform(states(tiger))) is not very reliable with this number of samples +# note action(planner, Uniform(states(tiger))) is not very reliable with this number of samples diff --git a/dev/policy_interaction/index.html b/dev/policy_interaction/index.html index 4bb17a4c..58b9a121 100644 --- a/dev/policy_interaction/index.html +++ b/dev/policy_interaction/index.html @@ -1,2 +1,2 @@ -Interacting with Policies · POMDPs.jl

Interacting with Policies

A solution to a POMDP is a policy that maps beliefs or action-observation histories to actions. In POMDPs.jl, these are represented by Policy objects. See Solvers and Policies for more information about what a policy can represent in general.

One common task in evaluating POMDP solutions is examining the policies themselves. Since the internal representation of a policy is an esoteric implementation detail, it is best to interact with policies through the action and value interface functions. There are three relevant methods

  • action(policy, s) returns the best action (or one of the best) for the given state or belief.
  • value(policy, s) returns the expected sum of future rewards if the policy is executed.
  • value(policy, s, a) returns the "Q-value", that is, the expected sum of rewards if action a is taken on the next step and then the policy is executed.

Note that the quantities returned by these functions are what the policy/solver expects to be the case after its (usually approximate) computations; they may be far from the true value if the solution is not exactly optimal.

+Interacting with Policies · POMDPs.jl

Interacting with Policies

A solution to a POMDP is a policy that maps beliefs or action-observation histories to actions. In POMDPs.jl, these are represented by Policy objects. See Solvers and Policies for more information about what a policy can represent in general.

One common task in evaluating POMDP solutions is examining the policies themselves. Since the internal representation of a policy is an esoteric implementation detail, it is best to interact with policies through the action and value interface functions. There are three relevant methods

  • action(policy, s) returns the best action (or one of the best) for the given state or belief.
  • value(policy, s) returns the expected sum of future rewards if the policy is executed.
  • value(policy, s, a) returns the "Q-value", that is, the expected sum of rewards if action a is taken on the next step and then the policy is executed.

Note that the quantities returned by these functions are what the policy/solver expects to be the case after its (usually approximate) computations; they may be far from the true value if the solution is not exactly optimal.

diff --git a/dev/run_simulation/index.html b/dev/run_simulation/index.html index 113fb201..b4539b1f 100644 --- a/dev/run_simulation/index.html +++ b/dev/run_simulation/index.html @@ -1,3 +1,3 @@ Running Simulations · POMDPs.jl

Running Simulations

Running a simulation consists of two steps, creating a simulator and calling the simulate function. For example, given a POMDP or MDP model m, and a policy p, one can use the RolloutSimulator from POMDPTools to find the accumulated discounted reward from a single simulated trajectory as follows:

sim = RolloutSimulator()
-r = simulate(sim, m, p)

More inputs, such as a belief updater, initial state, initial belief, etc. may be specified as arguments to simulate. See the docstring for simulate and the appropriate "Input" sections in the Simulation Standard page for more information.

More examples can be found in the Simulations Examples section. A variety of simulators that return more information and interact in different ways can be found in POMDPTools.

+r = simulate(sim, m, p)

More inputs, such as a belief updater, initial state, initial belief, etc. may be specified as arguments to simulate. See the docstring for simulate and the appropriate "Input" sections in the Simulation Standard page for more information.

More examples can be found in the Simulations Examples section. A variety of simulators that return more information and interact in different ways can be found in POMDPTools.

diff --git a/dev/simulation/index.html b/dev/simulation/index.html index cd8f1738..86de561c 100644 --- a/dev/simulation/index.html +++ b/dev/simulation/index.html @@ -21,4 +21,4 @@ d *= discount(mdp) end

In terms of the explicit interface, the @gen macro above expands to the equivalent of:

    sp = rand(transition(pomdp, s, a))
     r = reward(pomdp, s, a, sp)
-    s = sp
+ s = sp