diff --git a/.history/docs/src/example_defining_problems_20240704220213.md b/.history/docs/src/example_defining_problems_20240704220213.md new file mode 100644 index 00000000..e7f2d173 --- /dev/null +++ b/.history/docs/src/example_defining_problems_20240704220213.md @@ -0,0 +1,314 @@ +# Defining a POMDP +As mentioned in the [Defining POMDPs and MDPs](@ref defining_pomdps) section, there are verious ways to define a POMDP using POMDPs.jl. In this section, we provide more examples of how to define a POMDP using the different interfaces. + +There is a large variety of problems that can be expressed as MDPs and POMDPs and different solvers require different components of the POMDPs.jl interface to be defined. Therefore, these examples are not intended to cover all possible use cases. When deeloping a problem and you have an idea of what solver(s) you would like to use, it is recommended to use [POMDPLinter](https://github.com/JuliaPOMDP/POMDPLinter.jl) to help you to determine what components of the POMDPs.jl interface need to be defined. Reference the [Checking Requirements](@ref) section for an example of using POMDPLinter. + +## CryingBaby Problem Definition +For the examples, we will use the CryingBaby problem from [Algorithms for Decision Making](https://algorithmsbook.com/) by Mykel J. Kochenderfer, Tim A. Wheeler, and Kyle H. Wray. + +!!! note + This craying baby problem follows the description in Algorithms for Decision Making and is different than `BabyPOMDP` defined in [POMDPModels.jl](https://github.com/JuliaPOMDP/POMDPModels.jl). + +From [Appendix F](https://algorithmsbook.com/files/appendix-f.pdf) of Algorithms for Decision Making: +> The crying baby problem is a simple POMDP with two states, three actions, and two observations. Our goal is to care for a baby, and we do so by choosing at each time step whether to feed the baby, sing to the baby, or ignore the baby. +> +> The baby becomes hungry over time. We do not directly observe whether the baby is hungry; instead, we receive a noisy observation in the form of whether the baby is crying. The state, action, and observation spaces are as follows: +> ```math +> \begin{align*} +> \mathcal{S} &= \{\text{sated}, \text{hungry} \}\\ +> \mathcal{A} &= \{\text{feed}, \text{sing}, \text{ignore} \} \\ +> \mathcal{O} &= \{\text{crying}, \text{quiet} \} +> \end{align*} +> ``` +> +> Feeding will always sate the baby. Ignoring the baby risks a sated baby becoming hungry, and ensures that a hungry baby remains hungry. Singing to the baby is an information-gathering action with the same transition dynamics as ignoring, but without the potential for crying when sated (not hungry) and with an increased chance of crying when hungry. +> +> The transition dynamics are as follows: +> ```math +> \begin{align*} +> & T(\text{sated} \mid \text{hungry}, \text{feed}) = 100\% \\ +> & T(\text{hungry} \mid \text{hungry}, \text{sing}) = 100\% \\ +> & T(\text{hungry} \mid \text{hungry}, \text{ignore}) = 100\% \\ +> & T(\text{sated} \mid \text{sated}, \text{feed}) = 100\% \\ +> & T(\text{hungry} \mid \text{sated}, \text{sing}) = 10\% \\ +> & T(\text{hungry} \mid \text{sated}, \text{ignore}) = 10\% +> \end{align*} +> ``` +> +> The observation dynamics are as follows: +> ```math +> \begin{align*} +> & O(\text{crying} \mid \text{feed}, \text{hungry}) = 80\% \\ +> & O(\text{crying} \mid \text{sing}, \text{hungry}) = 90\% \\ +> & O(\text{crying} \mid \text{ignore}, \text{hungry}) = 80\% \\ +> & O(\text{crying} \mid \text{feed}, \text{sated}) = 10\% \\ +> & O(\text{crying} \mid \text{sing}, \text{sated}) = 0\% \\ +> & O(\text{crying} \mid \text{ignore}, \text{sated}) = 10\% +> \end{align*} +> ``` +> +> The reward function assigns ``−10`` reward if the baby is hungry, independent of the action taken. The effort of feeding the baby adds a further ``−5`` reward, whereas singing adds ``−0.5`` reward. As baby caregivers, we seek the optimal infinite-horizon policy with discount factor ``\gamma = 0.9``. + +## [QuickPOMDP Interface](@id quick_crying) +```julia +using POMDPs +using POMDPTools +using QuickPOMDPs + +quick_crying_baby_pomdp = QuickPOMDP( + states = [:sated, :hungry], + actions = [:feed, :sing, :ignore], + observations = [:quiet, :crying], + initialstate = Deterministic(:sated), + discount = 0.9, + transition = function (s, a) + if a == :feed + return Deterministic(:sated) + elseif s == :sated # :sated and a != :feed + return SparseCat([:sated, :hungry], [0.9, 0.1]) + else # s == :hungry and a != :feed + return Deterministic(:hungry) + end + end, + observation = function (a, sp) + if sp == :hungry + if a == :sing + return SparseCat([:crying, :quiet], [0.9, 0.1]) + else # a == :ignore || a == :feed + return SparseCat([:crying, :quiet], [0.8, 0.2]) + end + else # sp = :sated + if a == :sing + return Deterministic(:quiet) + else # a == :ignore || a == :feed + return SparseCat([:crying, :quiet], [0.1, 0.9]) + end + + end + end, + reward = function (s, a) + r = 0.0 + if s == :hungry + r += -10.0 + end + if a == :feed + r += -5.0 + elseif a == :sing + r+= -0.5 + end + return r + end +) +``` + +## [Explicit Interface](@id explicit_crying) +```julia +using POMDPs +using POMDPTools + +struct CryingBabyState # Alternatively, you could just use a Bool or Symbol for the state. + hungry::Bool +end + +struct CryingBabyPOMDP <: POMDP{CryingBabyState, Symbol, Symbol} + p_sated_to_hungry::Float64 + p_cry_feed_hungry::Float64 + p_cry_sing_hungry::Float64 + p_cry_ignore_hungry::Float64 + p_cry_feed_sated::Float64 + p_cry_sing_sated::Float64 + p_cry_ignore_sated::Float64 + reward_hungry::Float64 + reward_feed::Float64 + reward_sing::Float64 + discount_factor::Float64 +end + +function CryingBabyPOMDP(; + p_sated_to_hungry=0.1, + p_cry_feed_hungry=0.8, + p_cry_sing_hungry=0.9, + p_cry_ignore_hungry=0.8, + p_cry_feed_sated=0.1, + p_cry_sing_sated=0.0, + p_cry_ignore_sated=0.1, + reward_hungry=-10.0, + reward_feed=-5.0, + reward_sing=-0.5, + discount_factor=0.9 +) + return CryingBabyPOMDP(p_sated_to_hungry, p_cry_feed_hungry, + p_cry_sing_hungry, p_cry_ignore_hungry, p_cry_feed_sated, + p_cry_sing_sated, p_cry_ignore_sated, reward_hungry, + reward_feed, reward_sing, discount_factor) +end + +POMDPs.actions(::CryingBabyPOMDP) = [:feed, :sing, :ignore] +POMDPs.states(::CryingBabyPOMDP) = [CryingBabyState(false), CryingBabyState(true)] +POMDPs.observations(::CryingBabyPOMDP) = [:crying, :quiet] +POMDPs.stateindex(::CryingBabyPOMDP, s::CryingBabyState) = s.hungry ? 2 : 1 +POMDPs.obsindex(::CryingBabyPOMDP, o::Symbol) = o == :crying ? 1 : 2 +POMDPs.actionindex(::CryingBabyPOMDP, a::Symbol) = a == :feed ? 1 : a == :sing ? 2 : 3 + +function POMDPs.transition(pomdp::CryingBabyPOMDP, s::CryingBabyState, a::Symbol) + if a == :feed + return Deterministic(CryingBabyState(false)) + elseif s == :sated # :sated and a != :feed + return SparseCat([CryingBabyState(false), CryingBabyState(true)], [1 - pomdp.p_sated_to_hungry, pomdp.p_sated_to_hungry]) + else # s == :hungry and a != :feed + return Deterministic(CryingBabyState(true)) + end +end + +function POMDPs.observation(pomdp::CryingBabyPOMDP, a::Symbol, sp::CryingBabyState) + if sp.hungry + if a == :sing + return SparseCat([:crying, :quiet], [pomdp.p_cry_sing_hungry, 1 - pomdp.p_cry_sing_hungry]) + elseif a== :ignore + return SparseCat([:crying, :quiet], [pomdp.p_cry_ignore_hungry, 1 - pomdp.p_cry_ignore_hungry]) + else # a == :feed + return SparseCat([:crying, :quiet], [pomdp.p_cry_feed_hungry, 1 - pomdp.p_cry_feed_hungry]) + end + else # sated + if a == :sing + return SparseCat([:crying, :quiet], [pomdp.p_cry_sing_sated, 1 - pomdp.p_cry_sing_sated]) + elseif a== :ignore + return SparseCat([:crying, :quiet], [pomdp.p_cry_ignore_sated, 1 - pomdp.p_cry_ignore_sated]) + else # a == :feed + return SparseCat([:crying, :quiet], [pomdp.p_cry_feed_sated, 1 - pomdp.p_cry_feed_sated]) + end + end +end + +function POMDPs.reward(pomdp::CryingBabyPOMDP, s::CryingBabyState, a::Symbol) + r = 0.0 + if s.hungry + r += pomdp.reward_hungry + end + if a == :feed + r += pomdp.reward_feed + elseif a == :sing + r += pomdp.reward_sing + end + return r +end + +POMDPs.discount(pomdp::CryingBabyPOMDP) = pomdp.discount_factor + +POMDPs.initialstate(::CryingBabyPOMDP) = Deterministic(CryingBabyState(false)) + +explicit_crying_baby_pomdp = CryingBabyPOMDP() +``` + +## [Generative Interface](@id gen_crying) +This crying baby problem should not be implemented using the generative interface. However, this exmple is provided for pedagogical purposes. + +```julia +using POMDPs +using POMDPTools +using Random + +struct GenCryingBabyState # Alternatively, you could just use a Bool or Symbol for the state. + hungry::Bool +end + +struct GenCryingBabyPOMDP <: POMDP{GenCryingBabyState, Symbol, Symbol} + p_sated_to_hungry::Float64 + p_cry_feed_hungry::Float64 + p_cry_sing_hungry::Float64 + p_cry_ignore_hungry::Float64 + p_cry_feed_sated::Float64 + p_cry_sing_sated::Float64 + p_cry_ignore_sated::Float64 + reward_hungry::Float64 + reward_feed::Float64 + reward_sing::Float64 + discount_factor::Float64 + + GenCryingBabyPOMDP() = new(0.1, 0.8, 0.9, 0.8, 0.1, 0.0, 0.1, -10.0, -5.0, -0.5, 0.9) +end + +function POMDPs.gen(pomdp::GenCryingBabyPOMDP, s::GenCryingBabyState, a::Symbol, rng::AbstractRNG) + + if a == :feed + sp = GenCryingBabyState(false) + else + sp = rand(rng) < pomdp.p_sated_to_hungry ? GenCryingBabyState(true) : GenCryingBabyState(false) + end + + if sp.hungry + if a == :sing + o = rand(rng) < pomdp.p_cry_sing_hungry ? :crying : :quiet + elseif a== :ignore + o = rand(rng) < pomdp.p_cry_ignore_hungry ? :crying : :quiet + else # a == :feed + o = rand(rng) < pomdp.p_cry_feed_hungry ? :crying : :quiet + end + else # sated + if a == :sing + o = rand(rng) < pomdp.p_cry_sing_sated ? :crying : :quiet + elseif a== :ignore + o = rand(rng) < pomdp.p_cry_ignore_sated ? :crying : :quiet + else # a == :feed + o = rand(rng) < pomdp.p_cry_feed_sated ? :crying : :quiet + end + end + + r = 0.0 + if sp.hungry + r += pomdp.reward_hungry + end + if a == :feed + r += pomdp.reward_feed + elseif a == :sing + r += pomdp.reward_sing + end + + return (sp=sp, o=o, r=r) +end + +POMDPs.initialstate(::GenCryingBabyPOMDP) = Deterministic(GenCryingBabyState(false)) + +gen_crying_baby_pomdp = GenCryingBabyPOMDP() +``` + +## [Probability Tables](@id tab_crying) +For this implementaion we will use the following indexes: +- States + - `:sated` = 1 + - `:hungry` = 2 +- Actions + - `:feed` = 1 + - `:sing` = 2 + - `:ignore` = 3 +- Observations + - `:crying` = 1 + - `:quiet` = 2 + +```julia +using POMDPModels + +T = zeros(2, 3, 2) # |S| x |A| x |S'|, T[sp, a, s] = p(sp | a, s) +T[:, 1, :] = [1.0 1.0; + 0.0 0.0] +T[:, 2, :] = [0.9 0.0; + 0.1 1.0] +T[:, 3, :] = [0.9 0.0; + 0.1 1.0] + +O = zeros(2, 3, 2) # |O| x |A| x |S'|, O[o, a, sp] = p(o | a, sp) +O[:, 1, :] = [0.1 0.8; + 0.9 0.2] +O[:, 2, :] = [0.0 0.9; + 1.0 0.1] +O[:, 3, :] = [0.1 0.8; + 0.9 0.2] + +R = zeros(2, 3) # |S| x |A| +R = [-5.0 -0.5 0.0; + -15.0 -10.5 0.0] + +discount = 0.9 + +tabular_crying_baby_pomdp = TabularPOMDP(T, R, O, discount) +``` diff --git a/.history/docs/src/example_defining_problems_20240704220935.md b/.history/docs/src/example_defining_problems_20240704220935.md new file mode 100644 index 00000000..be243fe4 --- /dev/null +++ b/.history/docs/src/example_defining_problems_20240704220935.md @@ -0,0 +1,314 @@ +# Defining a POMDP +As mentioned in the [Defining POMDPs and MDPs](@ref defining_pomdps) section, there are various ways to define a POMDP using POMDPs.jl. In this section, we provide more examples of how to define a POMDP using the different interfaces. + +There is a large variety of problems that can be expressed as MDPs and POMDPs and different solvers require different components of the POMDPs.jl interface to be defined. Therefore, these examples are not intended to cover all possible use cases. When developing a problem and you have an idea of what solver(s) you would like to use, it is recommended to use [POMDPLinter](https://github.com/JuliaPOMDP/POMDPLinter.jl) to help you to determine what components of the POMDPs.jl interface need to be defined. Reference the [Checking Requirements](@ref) section for an example of using POMDPLinter. + +## CryingBaby Problem Definition +For the examples, we will use the CryingBaby problem from [Algorithms for Decision Making](https://algorithmsbook.com/) by Mykel J. Kochenderfer, Tim A. Wheeler, and Kyle H. Wray. + +!!! note + This crying baby problem follows the description in Algorithms for Decision Making and is different than `BabyPOMDP` defined in [POMDPModels.jl](https://github.com/JuliaPOMDP/POMDPModels.jl). + +From [Appendix F](https://algorithmsbook.com/files/appendix-f.pdf) of Algorithms for Decision Making: +> The crying baby problem is a simple POMDP with two states, three actions, and two observations. Our goal is to care for a baby, and we do so by choosing at each time step whether to feed the baby, sing to the baby, or ignore the baby. +> +> The baby becomes hungry over time. We do not directly observe whether the baby is hungry; instead, we receive a noisy observation in the form of whether the baby is crying. The state, action, and observation spaces are as follows: +> ```math +> \begin{align*} +> \mathcal{S} &= \{\text{sated}, \text{hungry} \}\\ +> \mathcal{A} &= \{\text{feed}, \text{sing}, \text{ignore} \} \\ +> \mathcal{O} &= \{\text{crying}, \text{quiet} \} +> \end{align*} +> ``` +> +> Feeding will always sate the baby. Ignoring the baby risks a sated baby becoming hungry, and ensures that a hungry baby remains hungry. Singing to the baby is an information-gathering action with the same transition dynamics as ignoring, but without the potential for crying when sated (not hungry) and with an increased chance of crying when hungry. +> +> The transition dynamics are as follows: +> ```math +> \begin{align*} +> & T(\text{sated} \mid \text{hungry}, \text{feed}) = 100\% \\ +> & T(\text{hungry} \mid \text{hungry}, \text{sing}) = 100\% \\ +> & T(\text{hungry} \mid \text{hungry}, \text{ignore}) = 100\% \\ +> & T(\text{sated} \mid \text{sated}, \text{feed}) = 100\% \\ +> & T(\text{hungry} \mid \text{sated}, \text{sing}) = 10\% \\ +> & T(\text{hungry} \mid \text{sated}, \text{ignore}) = 10\% +> \end{align*} +> ``` +> +> The observation dynamics are as follows: +> ```math +> \begin{align*} +> & O(\text{crying} \mid \text{feed}, \text{hungry}) = 80\% \\ +> & O(\text{crying} \mid \text{sing}, \text{hungry}) = 90\% \\ +> & O(\text{crying} \mid \text{ignore}, \text{hungry}) = 80\% \\ +> & O(\text{crying} \mid \text{feed}, \text{sated}) = 10\% \\ +> & O(\text{crying} \mid \text{sing}, \text{sated}) = 0\% \\ +> & O(\text{crying} \mid \text{ignore}, \text{sated}) = 10\% +> \end{align*} +> ``` +> +> The reward function assigns ``−10`` reward if the baby is hungry, independent of the action taken. The effort of feeding the baby adds a further ``−5`` reward, whereas singing adds ``−0.5`` reward. As baby caregivers, we seek the optimal infinite-horizon policy with discount factor ``\gamma = 0.9``. + +## [QuickPOMDP Interface](@id quick_crying) +```julia +using POMDPs +using POMDPTools +using QuickPOMDPs + +quick_crying_baby_pomdp = QuickPOMDP( + states = [:sated, :hungry], + actions = [:feed, :sing, :ignore], + observations = [:quiet, :crying], + initialstate = Deterministic(:sated), + discount = 0.9, + transition = function (s, a) + if a == :feed + return Deterministic(:sated) + elseif s == :sated # :sated and a != :feed + return SparseCat([:sated, :hungry], [0.9, 0.1]) + else # s == :hungry and a != :feed + return Deterministic(:hungry) + end + end, + observation = function (a, sp) + if sp == :hungry + if a == :sing + return SparseCat([:crying, :quiet], [0.9, 0.1]) + else # a == :ignore || a == :feed + return SparseCat([:crying, :quiet], [0.8, 0.2]) + end + else # sp = :sated + if a == :sing + return Deterministic(:quiet) + else # a == :ignore || a == :feed + return SparseCat([:crying, :quiet], [0.1, 0.9]) + end + + end + end, + reward = function (s, a) + r = 0.0 + if s == :hungry + r += -10.0 + end + if a == :feed + r += -5.0 + elseif a == :sing + r+= -0.5 + end + return r + end +) +``` + +## [Explicit Interface](@id explicit_crying) +```julia +using POMDPs +using POMDPTools + +struct CryingBabyState # Alternatively, you could just use a Bool or Symbol for the state. + hungry::Bool +end + +struct CryingBabyPOMDP <: POMDP{CryingBabyState, Symbol, Symbol} + p_sated_to_hungry::Float64 + p_cry_feed_hungry::Float64 + p_cry_sing_hungry::Float64 + p_cry_ignore_hungry::Float64 + p_cry_feed_sated::Float64 + p_cry_sing_sated::Float64 + p_cry_ignore_sated::Float64 + reward_hungry::Float64 + reward_feed::Float64 + reward_sing::Float64 + discount_factor::Float64 +end + +function CryingBabyPOMDP(; + p_sated_to_hungry=0.1, + p_cry_feed_hungry=0.8, + p_cry_sing_hungry=0.9, + p_cry_ignore_hungry=0.8, + p_cry_feed_sated=0.1, + p_cry_sing_sated=0.0, + p_cry_ignore_sated=0.1, + reward_hungry=-10.0, + reward_feed=-5.0, + reward_sing=-0.5, + discount_factor=0.9 +) + return CryingBabyPOMDP(p_sated_to_hungry, p_cry_feed_hungry, + p_cry_sing_hungry, p_cry_ignore_hungry, p_cry_feed_sated, + p_cry_sing_sated, p_cry_ignore_sated, reward_hungry, + reward_feed, reward_sing, discount_factor) +end + +POMDPs.actions(::CryingBabyPOMDP) = [:feed, :sing, :ignore] +POMDPs.states(::CryingBabyPOMDP) = [CryingBabyState(false), CryingBabyState(true)] +POMDPs.observations(::CryingBabyPOMDP) = [:crying, :quiet] +POMDPs.stateindex(::CryingBabyPOMDP, s::CryingBabyState) = s.hungry ? 2 : 1 +POMDPs.obsindex(::CryingBabyPOMDP, o::Symbol) = o == :crying ? 1 : 2 +POMDPs.actionindex(::CryingBabyPOMDP, a::Symbol) = a == :feed ? 1 : a == :sing ? 2 : 3 + +function POMDPs.transition(pomdp::CryingBabyPOMDP, s::CryingBabyState, a::Symbol) + if a == :feed + return Deterministic(CryingBabyState(false)) + elseif s == :sated # :sated and a != :feed + return SparseCat([CryingBabyState(false), CryingBabyState(true)], [1 - pomdp.p_sated_to_hungry, pomdp.p_sated_to_hungry]) + else # s == :hungry and a != :feed + return Deterministic(CryingBabyState(true)) + end +end + +function POMDPs.observation(pomdp::CryingBabyPOMDP, a::Symbol, sp::CryingBabyState) + if sp.hungry + if a == :sing + return SparseCat([:crying, :quiet], [pomdp.p_cry_sing_hungry, 1 - pomdp.p_cry_sing_hungry]) + elseif a== :ignore + return SparseCat([:crying, :quiet], [pomdp.p_cry_ignore_hungry, 1 - pomdp.p_cry_ignore_hungry]) + else # a == :feed + return SparseCat([:crying, :quiet], [pomdp.p_cry_feed_hungry, 1 - pomdp.p_cry_feed_hungry]) + end + else # sated + if a == :sing + return SparseCat([:crying, :quiet], [pomdp.p_cry_sing_sated, 1 - pomdp.p_cry_sing_sated]) + elseif a== :ignore + return SparseCat([:crying, :quiet], [pomdp.p_cry_ignore_sated, 1 - pomdp.p_cry_ignore_sated]) + else # a == :feed + return SparseCat([:crying, :quiet], [pomdp.p_cry_feed_sated, 1 - pomdp.p_cry_feed_sated]) + end + end +end + +function POMDPs.reward(pomdp::CryingBabyPOMDP, s::CryingBabyState, a::Symbol) + r = 0.0 + if s.hungry + r += pomdp.reward_hungry + end + if a == :feed + r += pomdp.reward_feed + elseif a == :sing + r += pomdp.reward_sing + end + return r +end + +POMDPs.discount(pomdp::CryingBabyPOMDP) = pomdp.discount_factor + +POMDPs.initialstate(::CryingBabyPOMDP) = Deterministic(CryingBabyState(false)) + +explicit_crying_baby_pomdp = CryingBabyPOMDP() +``` + +## [Generative Interface](@id gen_crying) +This crying baby problem should not be implemented using the generative interface. However, this example is provided for pedagogical purposes. + +```julia +using POMDPs +using POMDPTools +using Random + +struct GenCryingBabyState # Alternatively, you could just use a Bool or Symbol for the state. + hungry::Bool +end + +struct GenCryingBabyPOMDP <: POMDP{GenCryingBabyState, Symbol, Symbol} + p_sated_to_hungry::Float64 + p_cry_feed_hungry::Float64 + p_cry_sing_hungry::Float64 + p_cry_ignore_hungry::Float64 + p_cry_feed_sated::Float64 + p_cry_sing_sated::Float64 + p_cry_ignore_sated::Float64 + reward_hungry::Float64 + reward_feed::Float64 + reward_sing::Float64 + discount_factor::Float64 + + GenCryingBabyPOMDP() = new(0.1, 0.8, 0.9, 0.8, 0.1, 0.0, 0.1, -10.0, -5.0, -0.5, 0.9) +end + +function POMDPs.gen(pomdp::GenCryingBabyPOMDP, s::GenCryingBabyState, a::Symbol, rng::AbstractRNG) + + if a == :feed + sp = GenCryingBabyState(false) + else + sp = rand(rng) < pomdp.p_sated_to_hungry ? GenCryingBabyState(true) : GenCryingBabyState(false) + end + + if sp.hungry + if a == :sing + o = rand(rng) < pomdp.p_cry_sing_hungry ? :crying : :quiet + elseif a== :ignore + o = rand(rng) < pomdp.p_cry_ignore_hungry ? :crying : :quiet + else # a == :feed + o = rand(rng) < pomdp.p_cry_feed_hungry ? :crying : :quiet + end + else # sated + if a == :sing + o = rand(rng) < pomdp.p_cry_sing_sated ? :crying : :quiet + elseif a== :ignore + o = rand(rng) < pomdp.p_cry_ignore_sated ? :crying : :quiet + else # a == :feed + o = rand(rng) < pomdp.p_cry_feed_sated ? :crying : :quiet + end + end + + r = 0.0 + if sp.hungry + r += pomdp.reward_hungry + end + if a == :feed + r += pomdp.reward_feed + elseif a == :sing + r += pomdp.reward_sing + end + + return (sp=sp, o=o, r=r) +end + +POMDPs.initialstate(::GenCryingBabyPOMDP) = Deterministic(GenCryingBabyState(false)) + +gen_crying_baby_pomdp = GenCryingBabyPOMDP() +``` + +## [Probability Tables](@id tab_crying) +For this implementation we will use the following indexes: +- States + - `:sated` = 1 + - `:hungry` = 2 +- Actions + - `:feed` = 1 + - `:sing` = 2 + - `:ignore` = 3 +- Observations + - `:crying` = 1 + - `:quiet` = 2 + +```julia +using POMDPModels + +T = zeros(2, 3, 2) # |S| x |A| x |S'|, T[sp, a, s] = p(sp | a, s) +T[:, 1, :] = [1.0 1.0; + 0.0 0.0] +T[:, 2, :] = [0.9 0.0; + 0.1 1.0] +T[:, 3, :] = [0.9 0.0; + 0.1 1.0] + +O = zeros(2, 3, 2) # |O| x |A| x |S'|, O[o, a, sp] = p(o | a, sp) +O[:, 1, :] = [0.1 0.8; + 0.9 0.2] +O[:, 2, :] = [0.0 0.9; + 1.0 0.1] +O[:, 3, :] = [0.1 0.8; + 0.9 0.2] + +R = zeros(2, 3) # |S| x |A| +R = [-5.0 -0.5 0.0; + -15.0 -10.5 0.0] + +discount = 0.9 + +tabular_crying_baby_pomdp = TabularPOMDP(T, R, O, discount) +``` diff --git a/.history/docs/src/example_gridworld_mdp_20240704220213.md b/.history/docs/src/example_gridworld_mdp_20240704220213.md new file mode 100644 index 00000000..4dfa4e48 --- /dev/null +++ b/.history/docs/src/example_gridworld_mdp_20240704220213.md @@ -0,0 +1,592 @@ +# GridWorld MDP Tutorial + +In this tutorial, we provide a simple example of how to define a Markov decision process (MDP) using the POMDPS.jl interface. We will then solve the MDP using value iteration and Monte Carlo tree search (MCTS). We will walk through constructing the MDP using the explicit interface which invovles defining a new type for the MDP and then extending different components of the POMDPs.jl interface for that type. + +## Dependencies + +We need a few modules in order to run this example. All of the models can be added by running the following command in the Julia REPL: + +```julia +using Pkg + +Pkg.add("POMDPs") +Pkg.add("POMDPTools") +Pkg.add("DiscreteValueIteration") +Pkg.add("MCTS") +``` + +If you already had the models installed, it is prudent to update them to the latest version: + +```julia +Pkg.update() +``` + +Now that we have the models installed, we can load them into our workspace: + +```@example gridworld_mdp +using POMDPs +using POMDPTools +using DiscreteValueIteration +using MCTS +``` + +## Problem Overview + +In Grid World, we are trying to control an agent who has trouble moving in the desired direction. In our problem, we have four reward states within the a grid. Each position on the grid represents a state, and the positive reward states are terminal (the agent stops recieving reward after reaching them and performing an action from that state). The agent has four actions to choose from: up, down, left, right. The agent moves in the desired direction with a probability of $0.7$, and with a probability of $0.1$ in each of the remaining three directions. If the agent bumps into the outside wall, there is a penalty of $1$ (i.e. reward of $-1$). The problem has the following form: + +![Grid World](examples/grid_world_overview.gif) + +## Defining the Grid World MDP Type + +In POMDPs.jl, an MDP is defined by creating a subtype of the `MDP` abstract type. The types of the states and actions for the MDP are declared as [parameters](https://docs.julialang.org/en/v1/manual/types/#Parametric-Types-1) of the MDP type. For example, if our states and actions are both represented by integers, we can define our MDP type as follows: + +```julia +struct MyMDP <: MDP{Int64, Int64} # MDP{StateType, ActionType} + # fields go here +end +``` + +In our grid world problem, we will represent the states using a custom type that designates the `x` and `y` coordinate within the grid. The actions will by represented by a symbol. + +### GridWorldState +There are numerous ways to represent the state of the agent in a grid world. We will use a custom type that designates the `x` and `y` coordinate within the grid. + +```@example gridworld_mdp +struct GridWorldState + x::Int64 + y::Int64 +end +``` + +To help us later, let's extend the `==` for our `GridWorldStat`: + +```@example gridworld_mdp +function Base.:(==)(s1::GridWorldState, s2::GridWorldState) + return s1.x == s2.x && s1.y == s2.y +end +``` + +### GridWorld Actions +Since our action is the direction the agent chooses to go (i.e. up, down, left, right), we can use a Symbol to represent it. Note that in this case, we are not defining a custom type for our action, instead we represent it directly with a symbol. Our actions will be `:up`, `:down`, `:left`, and `:right`. + +### GridWorldMDP +Now that we have defined our types for states and actions, we can define our MDP type. We will call it `GridWorldMDP` and it will be a subtype of `MDP{GridWorldState, Symbol}`. + +```@example gridworld_mdp +struct GridWorldMDP <: MDP{GridWorldState, Symbol} + size_x::Int64 # x size of the grid + size_y::Int64 # y size of the grid + reward_states_values::Dict{GridWorldState, Float64} # Dictionary mapping reward states to their values + hit_wall_reward::Float64 # reward for hitting a wall + tprob::Float64 # probability of transitioning to the desired state + discount_factor::Float64 # disocunt factor +end +``` + +We can define a constructor for our `GridWorldMDP` to make it easier to create instances of our MDP. + +```@example gridworld_mdp +function GridWorldMDP(; + size_x::Int64=10, + size_y::Int64=10, + reward_states_values::Dict{GridWorldState, Float64}=Dict( + GridWorldState(4, 3) => -10.0, + GridWorldState(4, 6) => -5.0, + GridWorldState(9, 3) => 10.0, + GridWorldState(8, 8) => 3.0), + hit_wall_reward::Float64=-1.0, + tprob::Float64=0.7, + discount_factor::Float64=0.9) + return GridWorldMDP(size_x, size_y, reward_states_values, hit_wall_reward, tprob, discount_factor) +end +``` + +To help us visualize our MDP, we can extend `show` for our `GridWorldMDP` type: + +```@example gridworld_mdp +function Base.show(io::IO, mdp::GridWorldMDP) + println(io, "Grid World MDP") + println(io, "\tSize x: $(mdp.size_x)") + println(io, "\tSize y: $(mdp.size_y)") + println(io, "\tReward states:") + for (key, value) in mdp.reward_states_values + println(io, "\t\t$key => $value") + end + println(io, "\tHit wall reward: $(mdp.hit_wall_reward)") + println(io, "\tTransition probability: $(mdp.tprob)") + println(io, "\tDiscount: $(mdp.discount_factor)") +end +``` + +Now lets create an instance of our `GridWorldMDP`: + +```@example gridworld_mdp +mdp = GridWorldMDP() + +``` + +!!! note + In this definition of the problem, our coordiates start in the bottom left of the grid. That is GridState(1, 1) is the bottom left of the grid and GridState(10, 10) would be on the right of the grid with a grid size of 10 by 10. + +## Grid World State Space +The state space in an MDP represents all the states in the problem. There are two primary functionalities that we want our spaces to support. We want to be able to iterate over the state space (for Value Iteration for example), and sometimes we want to be able to sample form the state space (used in some POMDP solvers). In this notebook, we will only look at iterable state spaces. + +Since we can iterate over elements of an array, and our problem is small, we can store all of our states in an array. We also have a terminal state based on the definition of our problem. We can represent that as a location outside of the grid (i.e. `(-1, -1)`). + +```@example gridworld_mdp +function POMDPs.states(mdp::GridWorldMDP) + states_array = GridWorldState[] + for x in 1:mdp.size_x + for y in 1:mdp.size_y + push!(states_array, GridWorldState(x, y)) + end + end + push!(states_array, GridWorldState(-1, -1)) # Adding the terminal state + return states_array +end +``` + +Let's view some of the states in our state space: + +```@example gridworld_mdp +@show states(mdp)[1:5] + +``` + +We also need a other functions related to the state space. + +```@example gridworld_mdp +# Check if a state is the terminal state +POMDPs.isterminal(mdp::GridWorldMDP, s::GridWorldState) = s == GridWorldState(-1, -1) + +# Define the initial state distribution (always start in the bottom left) +POMDPs.initialstate(mdp::GridWorldMDP) = Deterministic(GridWorldState(1, 1)) + +# Function that returns the index of a state in the state space +function POMDPs.stateindex(mdp::GridWorldMDP, s::GridWorldState) + if isterminal(mdp, s) + return length(states(mdp)) + end + + @assert 1 <= s.x <= mdp.size_x "Invalid state" + @assert 1 <= s.y <= mdp.size_y "Invalid state" + + si = (s.x - 1) * mdp.size_y + s.y + return si +end + +``` + + +### Large State Spaces +If your problem is very large we probably do not want to store all of our states in an array. We can create an iterator using indexing functions to help us out. One way of doing this is to define a function that returns a state from an index and then construct an iterator. This is an example of how we can do that for the Grid World problem. + +!!! note + If you run this section, you will redefine the `states(::GridWorldMDP)` that we just defined in the previous section. + +```@example gridworld_mdp + + # Define the length of the state space, number of grid locations plus the terminal state + Base.length(mdp::GridWorldMDP) = mdp.size_x * mdp.size_y + 1 + + # `states` now returns the mdp, which we will construct our iterator from + POMDPs.states(mdp::GridWorldMDP) = mdp + + function Base.getindex(mdp::GridWorldMDP, si::Int) # Enables mdp[si] + @assert si <= length(mdp) "Index out of bounds" + @assert si > 0 "Index out of bounds" + + # First check if we are in the terminal state (which we define as the last state) + if si == length(mdp) + return GridWorldState(-1, -1) + end + + # Otherwise, we need to calculate the x and y coordinates + y = (si - 1) % mdp.size_y + 1 + x = div((si - 1), mdp.size_y) + 1 + return GridWorldState(x, y) + end + + function Base.getindex(mdp::GridWorldMDP, si_range::UnitRange{Int}) # Enables mdp[1:5] + return [getindex(mdp, si) for si in si_range] + end + + Base.firstindex(mdp::GridWorldMDP) = 1 # Enables mdp[begin] + Base.lastindex(mdp::GridWorldMDP) = length(mdp) # Enables mdp[end] + + # We can now construct an iterator + function Base.iterate(mdp::GridWorldMDP, ii::Int=1) + if ii > length(mdp) + return nothing + end + s = getindex(mdp, ii) + return (s, ii + 1) + end + + +``` + +Similar to above, let's iterate over a few of the states in our state space: + +```@example gridworld_mdp +@show states(mdp)[1:5] +@show mdp[begin] +@show mdp[end] + +``` + +## Grid World Action Space +The action space is the set of all actions availiable to the agent. In the grid world problem the action space consists of up, down, left, and right. We can define the action space by implementing a new method of the actions function. + +```@example gridworld_mdp +POMDPs.actions(mdp::GridWorldMDP) = [:up, :down, :left, :right] +``` + +Similar to the state space, we need a function that returns an index given an action. + +```@example gridworld_mdp +function POMDPs.actionindex(mdp::GridWorldMDP, a::Symbol) + @assert in(a, actions(mdp)) "Invalid action" + return findfirst(x -> x == a, actions(mdp)) +end + +``` + +## Grid World Transition Function +MDPs often define the transition function as $T(s^{\prime} \mid s, a)$, which is the probability of transitioning to state $s^{\prime}$ given that we are in state $s$ and take action $a$. For the POMDPs.jl interface, we define the transition function as a distribution over the next states. That is, we want $T(\cdot \mid s, a)$ which is a function that takes in a state and an action and returns a distribution over the next states. + +For our grid world example, there are only a few states to which the agent can transition and thus only a few states with nonzero probaility in $T(\cdot \mid s, a)$. We can use the `SparseCat` distribution to represent this. The `SparseCat` distribution is a categorical distribution that only stores the nonzero probabilities. We can define our transition function as follows: + +```@example gridworld_mdp +function POMDPs.transition(mdp::GridWorldMDP, s::GridWorldState, a::Symbol) + # If we are in the terminal state, we stay in the terminal state + if isterminal(mdp, s) + return SparseCat([s], [1.0]) + end + + # If we are in a positive reward state, we transition to the terminal state + if s in keys(mdp.reward_states_values) && mdp.reward_states_values[s] > 0 + return SparseCat([GridWorldState(-1, -1)], [1.0]) + end + + # Probability of going in a direction other than the desired direction + tprob_other = (1 - mdp.tprob) / 3 + + new_state_up = GridWorldState(s.x, min(s.y + 1, mdp.size_y)) + new_state_down = GridWorldState(s.x, max(s.y - 1, 1)) + new_state_left = GridWorldState(max(s.x - 1, 1), s.y) + new_state_right = GridWorldState(min(s.x + 1, mdp.size_x), s.y) + + new_state_vector = [new_state_up, new_state_down, new_state_left, new_state_right] + t_prob_vector = fill(tprob_other, 4) + + if a == :up + t_prob_vector[1] = mdp.tprob + elseif a == :down + t_prob_vector[2] = mdp.tprob + elseif a == :left + t_prob_vector[3] = mdp.tprob + elseif a == :right + t_prob_vector[4] = mdp.tprob + else + error("Invalid action") + end + + # Combine probabilities for states that are the same + for i in 1:4 + for j in (i + 1):4 + if new_state_vector[i] == new_state_vector[j] + t_prob_vector[i] += t_prob_vector[j] + t_prob_vector[j] = 0.0 + end + end + end + + # Remove states with zero probability + new_state_vector = new_state_vector[t_prob_vector .> 0] + t_prob_vector = t_prob_vector[t_prob_vector .> 0] + + return SparseCat(new_state_vector, t_prob_vector) +end + +``` + +Let's examline a few transitions: + +```@example gridworld_mdp +@show transition(mdp, GridWorldState(1, 1), :up) + +``` + +```@example gridworld_mdp +@show transition(mdp, GridWorldState(1, 1), :left) + +``` + +```@example gridworld_mdp +@show transition(mdp, GridWorldState(9, 3), :right) + +``` + +```@example gridworld_mdp +@show transition(mdp, GridWorldState(-1, -1), :down) + +``` + +## Grid World Reward Function + +In our problem, we have a reward function that depends on the next state as well (i.e. if we hit a wall, we stay in the same state and get a reward of $-1$). We can still construct a reward function that only depends on the current state and action by using expectation over the next state. That is, we can define our reward function as $R(s, a) = \mathbb{E}_{s^{\prime} \sim T(\cdot \mid s, a)}[R(s, a, s^{\prime})]$. + +```@example gridworld_mdp +# First, let's define the reward function given the state, action, and next state +function POMDPs.reward(mdp::GridWorldMDP, s::GridWorldState, a::Symbol, sp::GridWorldState) + # If we are in the terminal state, we get a reward of 0 + if isterminal(mdp, s) + return 0.0 + end + + # If we are in a positive reward state, we get the reward of that state + # For a positive reward, we transition to the terminal state, so we don't have + # to worry about the next state (i.g. hitting a wall) + if s in keys(mdp.reward_states_values) && mdp.reward_states_values[s] > 0 + return mdp.reward_states_values[s] + end + + # If we are in a negative reward state, we get the reward of that state + # If the negative reward state is on the edge of the grid, we can also be in this state + # and hit a wall, so we need to check for that + r = 0.0 + if s in keys(mdp.reward_states_values) && mdp.reward_states_values[s] < 0 + r += mdp.reward_states_values[s] + end + + # If we hit a wall, we get a reward of -1 + if s == sp + r += mdp.hit_wall_reward + end + + return r +end + +# Now we can define the reward function given the state and action +function POMDPs.reward(mdp::GridWorldMDP, s::GridWorldState, a::Symbol) + r = 0.0 + for (sp, p) in transition(mdp, s, a) + r += p * reward(mdp, s, a, sp) + end + return r +end + +``` + +Let's examine a few rewards: + +```@example gridworld_mdp +@show reward(mdp, GridWorldState(1, 1), :up) + +``` + +```@example gridworld_mdp +@show reward(mdp, GridWorldState(1, 1), :left) + +``` + +```@example gridworld_mdp +@show reward(mdp, GridWorldState(9, 3), :right) + +``` + +```@example gridworld_mdp +@show reward(mdp, GridWorldState(-1, -1), :down) + +``` + +```@example gridworld_mdp +@show reward(mdp, GridWorldState(2, 3), :up) + +``` + +## Grid World Remaining Functions +We are almost done! We still need to define `discount`. Let's first use `POMDPLinter` to check if we have defined all the functions we need for DiscreteValueIteration: + +```@example gridworld_mdp +using POMDPLinter + +@show_requirements POMDPs.solve(ValueIterationSolver(), mdp) + +``` +As we expected, we need to define `discount`. + +```@example gridworld_mdp +function POMDPs.discount(mdp::GridWorldMDP) + return mdp.discount_factor +end + +``` + +Let's check again: + +```@example gridworld_mdp +@show_requirements POMDPs.solve(ValueIterationSolver(), mdp) + +``` + +## Solving the Grid World MDP (Value Iteration) +Now that we have defined our MDP, we can solve it using Value Iteration. We will use the `ValueIterationSolver` from the [DiscreteValueIteration](https://github.com/JuliaPOMDP/DiscreteValueIteration.jl) package. First, we construct the a Solver type which contains the solver parameters. Then we call `POMDPs.solve` to solve the MDP and return a policy. + +```@example gridworld_mdp +# Initialize the problem (we have already done this, but just calling it again for completeness in the example) +mdp = GridWorldMDP() + +# Initialize the solver with desired parameters +solver = ValueIterationSolver(; max_iterations=100, belres=1e-3, verbose=true) + +# Solve for an optimal policy +vi_policy = POMDPs.solve(solver, mdp) +nothing # hide + +``` + +We can now use the policy to compute the optimal action for a given state: + +```@example gridworld_mdp +s = GridWorldState(9, 2) +@show action(vi_policy, s) + +``` + +```@example gridworld_mdp +s = GridWorldState(8, 3) +@show action(vi_policy, s) + +``` + +## Solving the Grid World MDP (MCTS) +Similar to the process with Value Iteration, we can solve the MDP using MCTS. We will use the `MCTSSolver` from the [MCTS](https://github.com/JuliaPOMDP/MCTS.jl) package. + +```@example gridworld_mdp +# Initialize the problem (we have already done this, but just calling it again for completeness in the example) +mdp = GridWorldMDP() + +# Initialize the solver with desired parameters +solver = MCTSSolver(n_iterations=1000, depth=20, exploration_constant=10.0) + +# Now we construct a planner by calling POMDPs.solve. For online planners, the computation for the +# optimal action occurs in the call to `action`. +mcts_planner = POMDPs.solve(solver, mdp) +nothing # hide + +``` + +Similar to the value iteration policy, we can use the policy to compute the action for a given state: + +```@example gridworld_mdp +s = GridWorldState(9, 2) +@show action(mcts_planner, s) + +``` + +```@example gridworld_mdp +s = GridWorldState(8, 3) +@show action(mcts_planner, s) + +``` + +## Visualizing the Value Iteration Policy +We can visualize the value iteration policy by plotting the value function and the policy. We can use numerous plotting packages to do this, but we will use [UnicodePlots](https://github.com/JuliaPlots/UnicodePlots.jl) for this example. + +```@example gridworld_mdp +using UnicodePlots +using Printf +``` + +### Value Function as a Heatmap +We can plot the value function as a heatmap. The value function is a function over the state space, so we need to iterate over the state space and store the value at each state. We can use the `value` function to evaluate the value function at a given state. + +```@example gridworld_mdp +# Initialize the value function array +value_function = zeros(mdp.size_y, mdp.size_x) + +# Iterate over the state space and store the value at each state +for s in states(mdp) + if isterminal(mdp, s) + continue + end + value_function[s.y, s.x] = value(vi_policy, s) +end + +# Plot the value function +heatmap(value_function; + title="GridWorld VI Value Function", + xlabel="x position", + ylabel="y position", + colormap=:inferno +) + +``` + +!!! note + Rendering of unicode plots in the documentation is not optimal. For a better image, run this locally in a REPL. + +### Visualizing the Value Iteration Policy +One way to visualize the policy is to plot the action that the policy takes at each state. + +```@example gridworld_mdp +# Initialize the policy array +policy_array = fill(:up, mdp.size_x, mdp.size_y) + +# Iterate over the state space and store the action at each state +for s in states(mdp) + if isterminal(mdp, s) + continue + end + policy_array[s.x, s.y] = action(vi_policy, s) +end + +# Let's define a mapping from symbols to unicode arrows +arrow_map = Dict( + :up => " ↑ ", + :down => " ↓ ", + :left => " ← ", + :right => " → " +) + +# Plot the policy to the terminal, with the origin in the bottom left +@printf(" GridWorld VI Policy \n") +for y in mdp.size_y+1:-1:0 + if y == mdp.size_y+1 || y == 0 + for xi in 0:10 + if xi == 0 + print(" ") + elseif y == mdp.size_y+1 + print("___") + else + print("---") + end + end + else + for x in 0:mdp.size_x+1 + if x == 0 + @printf("%2d |", y) + elseif x == mdp.size_x + 1 + print("|") + else + print(arrow_map[policy_array[x, y]]) + end + end + end + println() + if y == 0 + for xi in 0:10 + if xi == 0 + print(" ") + else + print(" $xi ") + end + end + end +end +``` + +## Seeing a Policy In Action +Another useful tool is to view the policy in action by creating a gif of a simulation. To accomplish this, we could use [POMDPGifs](https://github.com/JuliaPOMDP/POMDPGifs.jl). To use POMDPGifs, we need to extend the [`POMDPTools.render`](@ref) function to `GridWorldMDP`. Please reference [Gallery of POMDPs.jl Problems](@ref) for examples of this process. \ No newline at end of file diff --git a/.history/docs/src/example_gridworld_mdp_20240704222519.md b/.history/docs/src/example_gridworld_mdp_20240704222519.md new file mode 100644 index 00000000..1cd68847 --- /dev/null +++ b/.history/docs/src/example_gridworld_mdp_20240704222519.md @@ -0,0 +1,592 @@ +# GridWorld MDP Tutorial + +In this tutorial, we provide a simple example of how to define a Markov decision process (MDP) using the POMDPS.jl interface. We will then solve the MDP using value iteration and Monte Carlo tree search (MCTS). We will walk through constructing the MDP using the explicit interface which involves defining a new type for the MDP and then extending different components of the POMDPs.jl interface for that type. + +## Dependencies + +We need a few modules in order to run this example. All of the models can be added by running the following command in the Julia REPL: + +```julia +using Pkg + +Pkg.add("POMDPs") +Pkg.add("POMDPTools") +Pkg.add("DiscreteValueIteration") +Pkg.add("MCTS") +``` + +If you already had the models installed, it is prudent to update them to the latest version: + +```julia +Pkg.update() +``` + +Now that we have the models installed, we can load them into our workspace: + +```@example gridworld_mdp +using POMDPs +using POMDPTools +using DiscreteValueIteration +using MCTS +``` + +## Problem Overview + +In Grid World, we are trying to control an agent who has trouble moving in the desired direction. In our problem, we have four reward states within the a grid. Each position on the grid represents a state, and the positive reward states are terminal (the agent stops receiving reward after reaching them and performing an action from that state). The agent has four actions to choose from: up, down, left, right. The agent moves in the desired direction with a probability of $0.7$, and with a probability of $0.1$ in each of the remaining three directions. If the agent bumps into the outside wall, there is a penalty of $1$ (i.e. reward of $-1$). The problem has the following form: + +![Grid World](examples/grid_world_overview.gif) + +## Defining the Grid World MDP Type + +In POMDPs.jl, an MDP is defined by creating a subtype of the `MDP` abstract type. The types of the states and actions for the MDP are declared as [parameters](https://docs.julialang.org/en/v1/manual/types/#Parametric-Types-1) of the MDP type. For example, if our states and actions are both represented by integers, we can define our MDP type as follows: + +```julia +struct MyMDP <: MDP{Int64, Int64} # MDP{StateType, ActionType} + # fields go here +end +``` + +In our grid world problem, we will represent the states using a custom type that designates the `x` and `y` coordinate within the grid. The actions will by represented by a symbol. + +### GridWorldState +There are numerous ways to represent the state of the agent in a grid world. We will use a custom type that designates the `x` and `y` coordinate within the grid. + +```@example gridworld_mdp +struct GridWorldState + x::Int64 + y::Int64 +end +``` + +To help us later, let's extend the `==` for our `GridWorldStat`: + +```@example gridworld_mdp +function Base.:(==)(s1::GridWorldState, s2::GridWorldState) + return s1.x == s2.x && s1.y == s2.y +end +``` + +### GridWorld Actions +Since our action is the direction the agent chooses to go (i.e. up, down, left, right), we can use a Symbol to represent it. Note that in this case, we are not defining a custom type for our action, instead we represent it directly with a symbol. Our actions will be `:up`, `:down`, `:left`, and `:right`. + +### GridWorldMDP +Now that we have defined our types for states and actions, we can define our MDP type. We will call it `GridWorldMDP` and it will be a subtype of `MDP{GridWorldState, Symbol}`. + +```@example gridworld_mdp +struct GridWorldMDP <: MDP{GridWorldState, Symbol} + size_x::Int64 # x size of the grid + size_y::Int64 # y size of the grid + reward_states_values::Dict{GridWorldState, Float64} # Dictionary mapping reward states to their values + hit_wall_reward::Float64 # reward for hitting a wall + tprob::Float64 # probability of transitioning to the desired state + discount_factor::Float64 # discount factor +end +``` + +We can define a constructor for our `GridWorldMDP` to make it easier to create instances of our MDP. + +```@example gridworld_mdp +function GridWorldMDP(; + size_x::Int64=10, + size_y::Int64=10, + reward_states_values::Dict{GridWorldState, Float64}=Dict( + GridWorldState(4, 3) => -10.0, + GridWorldState(4, 6) => -5.0, + GridWorldState(9, 3) => 10.0, + GridWorldState(8, 8) => 3.0), + hit_wall_reward::Float64=-1.0, + tprob::Float64=0.7, + discount_factor::Float64=0.9) + return GridWorldMDP(size_x, size_y, reward_states_values, hit_wall_reward, tprob, discount_factor) +end +``` + +To help us visualize our MDP, we can extend `show` for our `GridWorldMDP` type: + +```@example gridworld_mdp +function Base.show(io::IO, mdp::GridWorldMDP) + println(io, "Grid World MDP") + println(io, "\tSize x: $(mdp.size_x)") + println(io, "\tSize y: $(mdp.size_y)") + println(io, "\tReward states:") + for (key, value) in mdp.reward_states_values + println(io, "\t\t$key => $value") + end + println(io, "\tHit wall reward: $(mdp.hit_wall_reward)") + println(io, "\tTransition probability: $(mdp.tprob)") + println(io, "\tDiscount: $(mdp.discount_factor)") +end +``` + +Now lets create an instance of our `GridWorldMDP`: + +```@example gridworld_mdp +mdp = GridWorldMDP() + +``` + +!!! note + In this definition of the problem, our coordinates start in the bottom left of the grid. That is GridState(1, 1) is the bottom left of the grid and GridState(10, 10) would be on the right of the grid with a grid size of 10 by 10. + +## Grid World State Space +The state space in an MDP represents all the states in the problem. There are two primary functionalities that we want our spaces to support. We want to be able to iterate over the state space (for Value Iteration for example), and sometimes we want to be able to sample form the state space (used in some POMDP solvers). In this notebook, we will only look at iterable state spaces. + +Since we can iterate over elements of an array, and our problem is small, we can store all of our states in an array. We also have a terminal state based on the definition of our problem. We can represent that as a location outside of the grid (i.e. `(-1, -1)`). + +```@example gridworld_mdp +function POMDPs.states(mdp::GridWorldMDP) + states_array = GridWorldState[] + for x in 1:mdp.size_x + for y in 1:mdp.size_y + push!(states_array, GridWorldState(x, y)) + end + end + push!(states_array, GridWorldState(-1, -1)) # Adding the terminal state + return states_array +end +``` + +Let's view some of the states in our state space: + +```@example gridworld_mdp +@show states(mdp)[1:5] + +``` + +We also need a other functions related to the state space. + +```@example gridworld_mdp +# Check if a state is the terminal state +POMDPs.isterminal(mdp::GridWorldMDP, s::GridWorldState) = s == GridWorldState(-1, -1) + +# Define the initial state distribution (always start in the bottom left) +POMDPs.initialstate(mdp::GridWorldMDP) = Deterministic(GridWorldState(1, 1)) + +# Function that returns the index of a state in the state space +function POMDPs.stateindex(mdp::GridWorldMDP, s::GridWorldState) + if isterminal(mdp, s) + return length(states(mdp)) + end + + @assert 1 <= s.x <= mdp.size_x "Invalid state" + @assert 1 <= s.y <= mdp.size_y "Invalid state" + + si = (s.x - 1) * mdp.size_y + s.y + return si +end + +``` + + +### Large State Spaces +If your problem is very large we probably do not want to store all of our states in an array. We can create an iterator using indexing functions to help us out. One way of doing this is to define a function that returns a state from an index and then construct an iterator. This is an example of how we can do that for the Grid World problem. + +!!! note + If you run this section, you will redefine the `states(::GridWorldMDP)` that we just defined in the previous section. + +```@example gridworld_mdp + + # Define the length of the state space, number of grid locations plus the terminal state + Base.length(mdp::GridWorldMDP) = mdp.size_x * mdp.size_y + 1 + + # `states` now returns the mdp, which we will construct our iterator from + POMDPs.states(mdp::GridWorldMDP) = mdp + + function Base.getindex(mdp::GridWorldMDP, si::Int) # Enables mdp[si] + @assert si <= length(mdp) "Index out of bounds" + @assert si > 0 "Index out of bounds" + + # First check if we are in the terminal state (which we define as the last state) + if si == length(mdp) + return GridWorldState(-1, -1) + end + + # Otherwise, we need to calculate the x and y coordinates + y = (si - 1) % mdp.size_y + 1 + x = div((si - 1), mdp.size_y) + 1 + return GridWorldState(x, y) + end + + function Base.getindex(mdp::GridWorldMDP, si_range::UnitRange{Int}) # Enables mdp[1:5] + return [getindex(mdp, si) for si in si_range] + end + + Base.firstindex(mdp::GridWorldMDP) = 1 # Enables mdp[begin] + Base.lastindex(mdp::GridWorldMDP) = length(mdp) # Enables mdp[end] + + # We can now construct an iterator + function Base.iterate(mdp::GridWorldMDP, ii::Int=1) + if ii > length(mdp) + return nothing + end + s = getindex(mdp, ii) + return (s, ii + 1) + end + + +``` + +Similar to above, let's iterate over a few of the states in our state space: + +```@example gridworld_mdp +@show states(mdp)[1:5] +@show mdp[begin] +@show mdp[end] + +``` + +## Grid World Action Space +The action space is the set of all actions available to the agent. In the grid world problem the action space consists of up, down, left, and right. We can define the action space by implementing a new method of the actions function. + +```@example gridworld_mdp +POMDPs.actions(mdp::GridWorldMDP) = [:up, :down, :left, :right] +``` + +Similar to the state space, we need a function that returns an index given an action. + +```@example gridworld_mdp +function POMDPs.actionindex(mdp::GridWorldMDP, a::Symbol) + @assert in(a, actions(mdp)) "Invalid action" + return findfirst(x -> x == a, actions(mdp)) +end + +``` + +## Grid World Transition Function +MDPs often define the transition function as $T(s^{\prime} \mid s, a)$, which is the probability of transitioning to state $s^{\prime}$ given that we are in state $s$ and take action $a$. For the POMDPs.jl interface, we define the transition function as a distribution over the next states. That is, we want $T(\cdot \mid s, a)$ which is a function that takes in a state and an action and returns a distribution over the next states. + +For our grid world example, there are only a few states to which the agent can transition and thus only a few states with nonzero probability in $T(\cdot \mid s, a)$. We can use the `SparseCat` distribution to represent this. The `SparseCat` distribution is a categorical distribution that only stores the nonzero probabilities. We can define our transition function as follows: + +```@example gridworld_mdp +function POMDPs.transition(mdp::GridWorldMDP, s::GridWorldState, a::Symbol) + # If we are in the terminal state, we stay in the terminal state + if isterminal(mdp, s) + return SparseCat([s], [1.0]) + end + + # If we are in a positive reward state, we transition to the terminal state + if s in keys(mdp.reward_states_values) && mdp.reward_states_values[s] > 0 + return SparseCat([GridWorldState(-1, -1)], [1.0]) + end + + # Probability of going in a direction other than the desired direction + tprob_other = (1 - mdp.tprob) / 3 + + new_state_up = GridWorldState(s.x, min(s.y + 1, mdp.size_y)) + new_state_down = GridWorldState(s.x, max(s.y - 1, 1)) + new_state_left = GridWorldState(max(s.x - 1, 1), s.y) + new_state_right = GridWorldState(min(s.x + 1, mdp.size_x), s.y) + + new_state_vector = [new_state_up, new_state_down, new_state_left, new_state_right] + t_prob_vector = fill(tprob_other, 4) + + if a == :up + t_prob_vector[1] = mdp.tprob + elseif a == :down + t_prob_vector[2] = mdp.tprob + elseif a == :left + t_prob_vector[3] = mdp.tprob + elseif a == :right + t_prob_vector[4] = mdp.tprob + else + error("Invalid action") + end + + # Combine probabilities for states that are the same + for i in 1:4 + for j in (i + 1):4 + if new_state_vector[i] == new_state_vector[j] + t_prob_vector[i] += t_prob_vector[j] + t_prob_vector[j] = 0.0 + end + end + end + + # Remove states with zero probability + new_state_vector = new_state_vector[t_prob_vector .> 0] + t_prob_vector = t_prob_vector[t_prob_vector .> 0] + + return SparseCat(new_state_vector, t_prob_vector) +end + +``` + +Let's examline a few transitions: + +```@example gridworld_mdp +@show transition(mdp, GridWorldState(1, 1), :up) + +``` + +```@example gridworld_mdp +@show transition(mdp, GridWorldState(1, 1), :left) + +``` + +```@example gridworld_mdp +@show transition(mdp, GridWorldState(9, 3), :right) + +``` + +```@example gridworld_mdp +@show transition(mdp, GridWorldState(-1, -1), :down) + +``` + +## Grid World Reward Function + +In our problem, we have a reward function that depends on the next state as well (i.e. if we hit a wall, we stay in the same state and get a reward of $-1$). We can still construct a reward function that only depends on the current state and action by using expectation over the next state. That is, we can define our reward function as $R(s, a) = \mathbb{E}_{s^{\prime} \sim T(\cdot \mid s, a)}[R(s, a, s^{\prime})]$. + +```@example gridworld_mdp +# First, let's define the reward function given the state, action, and next state +function POMDPs.reward(mdp::GridWorldMDP, s::GridWorldState, a::Symbol, sp::GridWorldState) + # If we are in the terminal state, we get a reward of 0 + if isterminal(mdp, s) + return 0.0 + end + + # If we are in a positive reward state, we get the reward of that state + # For a positive reward, we transition to the terminal state, so we don't have + # to worry about the next state (i.g. hitting a wall) + if s in keys(mdp.reward_states_values) && mdp.reward_states_values[s] > 0 + return mdp.reward_states_values[s] + end + + # If we are in a negative reward state, we get the reward of that state + # If the negative reward state is on the edge of the grid, we can also be in this state + # and hit a wall, so we need to check for that + r = 0.0 + if s in keys(mdp.reward_states_values) && mdp.reward_states_values[s] < 0 + r += mdp.reward_states_values[s] + end + + # If we hit a wall, we get a reward of -1 + if s == sp + r += mdp.hit_wall_reward + end + + return r +end + +# Now we can define the reward function given the state and action +function POMDPs.reward(mdp::GridWorldMDP, s::GridWorldState, a::Symbol) + r = 0.0 + for (sp, p) in transition(mdp, s, a) + r += p * reward(mdp, s, a, sp) + end + return r +end + +``` + +Let's examine a few rewards: + +```@example gridworld_mdp +@show reward(mdp, GridWorldState(1, 1), :up) + +``` + +```@example gridworld_mdp +@show reward(mdp, GridWorldState(1, 1), :left) + +``` + +```@example gridworld_mdp +@show reward(mdp, GridWorldState(9, 3), :right) + +``` + +```@example gridworld_mdp +@show reward(mdp, GridWorldState(-1, -1), :down) + +``` + +```@example gridworld_mdp +@show reward(mdp, GridWorldState(2, 3), :up) + +``` + +## Grid World Remaining Functions +We are almost done! We still need to define `discount`. Let's first use `POMDPLinter` to check if we have defined all the functions we need for DiscreteValueIteration: + +```@example gridworld_mdp +using POMDPLinter + +@show_requirements POMDPs.solve(ValueIterationSolver(), mdp) + +``` +As we expected, we need to define `discount`. + +```@example gridworld_mdp +function POMDPs.discount(mdp::GridWorldMDP) + return mdp.discount_factor +end + +``` + +Let's check again: + +```@example gridworld_mdp +@show_requirements POMDPs.solve(ValueIterationSolver(), mdp) + +``` + +## Solving the Grid World MDP (Value Iteration) +Now that we have defined our MDP, we can solve it using Value Iteration. We will use the `ValueIterationSolver` from the [DiscreteValueIteration](https://github.com/JuliaPOMDP/DiscreteValueIteration.jl) package. First, we construct the a Solver type which contains the solver parameters. Then we call `POMDPs.solve` to solve the MDP and return a policy. + +```@example gridworld_mdp +# Initialize the problem (we have already done this, but just calling it again for completeness in the example) +mdp = GridWorldMDP() + +# Initialize the solver with desired parameters +solver = ValueIterationSolver(; max_iterations=100, belres=1e-3, verbose=true) + +# Solve for an optimal policy +vi_policy = POMDPs.solve(solver, mdp) +nothing # hide + +``` + +We can now use the policy to compute the optimal action for a given state: + +```@example gridworld_mdp +s = GridWorldState(9, 2) +@show action(vi_policy, s) + +``` + +```@example gridworld_mdp +s = GridWorldState(8, 3) +@show action(vi_policy, s) + +``` + +## Solving the Grid World MDP (MCTS) +Similar to the process with Value Iteration, we can solve the MDP using MCTS. We will use the `MCTSSolver` from the [MCTS](https://github.com/JuliaPOMDP/MCTS.jl) package. + +```@example gridworld_mdp +# Initialize the problem (we have already done this, but just calling it again for completeness in the example) +mdp = GridWorldMDP() + +# Initialize the solver with desired parameters +solver = MCTSSolver(n_iterations=1000, depth=20, exploration_constant=10.0) + +# Now we construct a planner by calling POMDPs.solve. For online planners, the computation for the +# optimal action occurs in the call to `action`. +mcts_planner = POMDPs.solve(solver, mdp) +nothing # hide + +``` + +Similar to the value iteration policy, we can use the policy to compute the action for a given state: + +```@example gridworld_mdp +s = GridWorldState(9, 2) +@show action(mcts_planner, s) + +``` + +```@example gridworld_mdp +s = GridWorldState(8, 3) +@show action(mcts_planner, s) + +``` + +## Visualizing the Value Iteration Policy +We can visualize the value iteration policy by plotting the value function and the policy. We can use numerous plotting packages to do this, but we will use [UnicodePlots](https://github.com/JuliaPlots/UnicodePlots.jl) for this example. + +```@example gridworld_mdp +using UnicodePlots +using Printf +``` + +### Value Function as a Heatmap +We can plot the value function as a heatmap. The value function is a function over the state space, so we need to iterate over the state space and store the value at each state. We can use the `value` function to evaluate the value function at a given state. + +```@example gridworld_mdp +# Initialize the value function array +value_function = zeros(mdp.size_y, mdp.size_x) + +# Iterate over the state space and store the value at each state +for s in states(mdp) + if isterminal(mdp, s) + continue + end + value_function[s.y, s.x] = value(vi_policy, s) +end + +# Plot the value function +heatmap(value_function; + title="GridWorld VI Value Function", + xlabel="x position", + ylabel="y position", + colormap=:inferno +) + +``` + +!!! note + Rendering of unicode plots in the documentation is not optimal. For a better image, run this locally in a REPL. + +### Visualizing the Value Iteration Policy +One way to visualize the policy is to plot the action that the policy takes at each state. + +```@example gridworld_mdp +# Initialize the policy array +policy_array = fill(:up, mdp.size_x, mdp.size_y) + +# Iterate over the state space and store the action at each state +for s in states(mdp) + if isterminal(mdp, s) + continue + end + policy_array[s.x, s.y] = action(vi_policy, s) +end + +# Let's define a mapping from symbols to unicode arrows +arrow_map = Dict( + :up => " ↑ ", + :down => " ↓ ", + :left => " ← ", + :right => " → " +) + +# Plot the policy to the terminal, with the origin in the bottom left +@printf(" GridWorld VI Policy \n") +for y in mdp.size_y+1:-1:0 + if y == mdp.size_y+1 || y == 0 + for xi in 0:10 + if xi == 0 + print(" ") + elseif y == mdp.size_y+1 + print("___") + else + print("---") + end + end + else + for x in 0:mdp.size_x+1 + if x == 0 + @printf("%2d |", y) + elseif x == mdp.size_x + 1 + print("|") + else + print(arrow_map[policy_array[x, y]]) + end + end + end + println() + if y == 0 + for xi in 0:10 + if xi == 0 + print(" ") + else + print(" $xi ") + end + end + end +end +``` + +## Seeing a Policy In Action +Another useful tool is to view the policy in action by creating a gif of a simulation. To accomplish this, we could use [POMDPGifs](https://github.com/JuliaPOMDP/POMDPGifs.jl). To use POMDPGifs, we need to extend the [`POMDPTools.render`](@ref) function to `GridWorldMDP`. Please reference [Gallery of POMDPs.jl Problems](@ref) for examples of this process. \ No newline at end of file diff --git a/.history/docs/src/example_simulations_20240704220213.md b/.history/docs/src/example_simulations_20240704220213.md new file mode 100644 index 00000000..cd6b5e95 --- /dev/null +++ b/.history/docs/src/example_simulations_20240704220213.md @@ -0,0 +1,174 @@ + +# Simulations Examples + +In these simulation examples, we will use the crying baby POMDPs defined in the [Defining a POMDP](@ref) section (i.e. [`quick_crying_baby_pomdp`](@ref quick_crying), [`explicit_crying_baby_pomdp`](@ref explicit_crying), [`gen_crying_baby_pomdp`](@ref gen_crying), and [`tabular_crying_baby_pomdp`](@ref tab_crying)). + +```@setup crying_sim +include("examples/crying_baby_examples.jl") +include("examples/crying_baby_solvers.jl") +``` + +## Stepthrough +The stepthrough simulater provides a window into the simulation with a for-loop syntax. + +Within the body of the for loop, we have access to the belief, the action, the observation, and the reward, in each step. We also calculate the sum of the rewards in this example, but note that this is _not_ the _discounted reward_. + +```@example crying_sim +function run_step_through_simulation() # hide +policy = RandomPolicy(quick_crying_baby_pomdp) +r_sum = 0.0 +step = 0 +for (b, s, a, o, r) in stepthrough(quick_crying_baby_pomdp, policy, DiscreteUpdater(quick_crying_baby_pomdp), "b,s,a,o,r"; max_steps=4) + step += 1 + println("Step $step") + println("b = sated => $(b.b[1]), hungry => $(b.b[2])") + @show s + @show a + @show o + @show r + r_sum += r + @show r_sum + println() +end +end #hide + +run_step_through_simulation() # hide +``` + +## Rollout Simulations +While stepthrough is a flexible and convenient tool for many user-facing demonstrations, it is often less error-prone to use the standard simulate function with a `Simulator` object. The simplest Simulator is the `RolloutSimulator`. It simply runs a simulation and returns the discounted reward. + +```@example crying_sim +function run_rollout_simulation() # hide +policy = RandomPolicy(explicit_crying_baby_pomdp) +sim = RolloutSimulator(max_steps=10) +r_sum = simulate(sim, explicit_crying_baby_pomdp, policy) +println("Total discounted reward: $r_sum") +end # hide +run_rollout_simulation() # hide +``` + +## Recording Histories +Sometimes it is important to record the entire history of a simulation for further examination. This can be accomplished with a `HistoryRecorder`. + +```@example crying_sim +policy = RandomPolicy(tabular_crying_baby_pomdp) +hr = HistoryRecorder(max_steps=5) +history = simulate(hr, tabular_crying_baby_pomdp, policy, DiscreteUpdater(tabular_crying_baby_pomdp), Deterministic(1)) +nothing # hide +``` + +The history object produced by a `HistoryRecorder` is a `SimHistory`, documented in the POMDPTools simulater section [Histories](@ref). The information in this object can be accessed in several ways. For example, there is a function: +```@example crying_sim +discounted_reward(history) +``` +Accessor functions like `state_hist` and `action_hist` can also be used to access parts of the history: +```@example crying_sim +state_hist(history) +``` +``` @example crying_sim +collect(action_hist(history)) +``` + +Keeping track of which states, actions, and observations belong together can be tricky (for example, since there is a starting state, and ending state, but no action is taken from the ending state, the list of actions has a different length than the list of states). It is often better to think of histories in terms of steps that include both starting and ending states. + +The most powerful function for accessing the information in a `SimHistory` is the `eachstep` function which returns an iterator through named tuples representing each step in the history. The `eachstep` function is similar to the `stepthrough` function above except that it iterates through the immutable steps of a previously simulated history instead of conducting the simulation as the for loop is being carried out. + +```@example crying_sim +function demo_eachstep(sim_history) # hide +r_sum = 0.0 +step = 0 +for step_i in eachstep(sim_history, "b,s,a,o,r") + step += 1 + println("Step $step") + println("step_i.b = sated => $(step_i.b.b[1]), hungry => $(step_i.b.b[2])") + @show step_i.s + @show step_i.a + @show step_i.o + @show step_i.r + r_sum += step_i.r + @show r_sum + println() +end +end # hide +demo_eachstep(history) # hide +``` + +## Parallel Simulations +It is often useful to evaluate a policy by running many simulations. The parallel simulator is the most effective tool for this. To use the parallel simulator, first create a list of `Sim` objects, each of which contains all of the information needed to run a simulation. Then then run the simulations using `run_parallel`, which will return a `DataFrame` with the results. + +In this example, we will compare the performance of the polcies we computed in the [Using Different Solvers](@ref) section (i.e. `sarsop_policy`, `pomcp_planner`, and `heuristic_policy`). To evaluate the policies, we will run 100 simulations for each policy. We can do this by adding 100 `Sim` objects of each policy to the list. + +```@example crying_sim +using DataFrames +using StatsBase: std + +# Defining paramters for the simulations +number_of_sim_to_run = 100 +max_steps = 20 +starting_seed = 1 + +# We will also compare against a random policy +rand_policy = RandomPolicy(quick_crying_baby_pomdp, rng=MersenneTwister(1)) + +# Create the list of Sim objects +sim_list = [] + +# Add 100 Sim objects of each policy to the list. +for sim_number in 1:number_of_sim_to_run + seed = starting_seed + sim_number + + # Add the SARSOP policy + push!(sim_list, Sim( + quick_crying_baby_pomdp, + rng=MersenneTwister(seed), + sarsop_policy, + max_steps=max_steps, + metadata=Dict(:policy => "sarsop", :seed => seed)) + ) + + # Add the POMCP policy + push!(sim_list, Sim( + quick_crying_baby_pomdp, + rng=MersenneTwister(seed), + pomcp_planner, + max_steps=max_steps, + metadata=Dict(:policy => "pomcp", :seed => seed)) + ) + + # Add the heuristic policy + push!(sim_list, Sim( + quick_crying_baby_pomdp, + rng=MersenneTwister(seed), + heuristic_policy, + max_steps=max_steps, + metadata=Dict(:policy => "heuristic", :seed => seed)) + ) + + # Add the random policy + push!(sim_list, Sim( + quick_crying_baby_pomdp, + rng=MersenneTwister(seed), + rand_policy, + max_steps=max_steps, + metadata=Dict(:policy => "random", :seed => seed)) + ) +end + +# Run the simulations in parallel +data = run_parallel(sim_list) + +# Define a function to calculate the mean and confidence interval +function mean_and_ci(x) + m = mean(x) + ci = 1.96 * std(x) / sqrt(length(x)) # 95% confidence interval + return (mean = m, ci = ci) +end + +# Calculate the mean and confidence interval for each policy +grouped_df = groupby(data, :policy) +result = combine(grouped_df, :reward => mean_and_ci => AsTable) + +``` + +By default, the parallel simulator only returns the reward from each simulation, but more information can be gathered by specifying a function to analyze the `Sim`-history pair and record additional statistics. Reference the POMDPTools simulator section for more information ([Specifying information to be recorded](@ref)). \ No newline at end of file diff --git a/.history/docs/src/example_simulations_20240704222110.md b/.history/docs/src/example_simulations_20240704222110.md new file mode 100644 index 00000000..c1cc0d0a --- /dev/null +++ b/.history/docs/src/example_simulations_20240704222110.md @@ -0,0 +1,174 @@ + +# Simulations Examples + +In these simulation examples, we will use the crying baby POMDPs defined in the [Defining a POMDP](@ref) section (i.e. [`quick_crying_baby_pomdp`](@ref quick_crying), [`explicit_crying_baby_pomdp`](@ref explicit_crying), [`gen_crying_baby_pomdp`](@ref gen_crying), and [`tabular_crying_baby_pomdp`](@ref tab_crying)). + +```@setup crying_sim +include("examples/crying_baby_examples.jl") +include("examples/crying_baby_solvers.jl") +``` + +## Stepthrough +The stepthrough simulator provides a window into the simulation with a for-loop syntax. + +Within the body of the for loop, we have access to the belief, the action, the observation, and the reward, in each step. We also calculate the sum of the rewards in this example, but note that this is _not_ the _discounted reward_. + +```@example crying_sim +function run_step_through_simulation() # hide +policy = RandomPolicy(quick_crying_baby_pomdp) +r_sum = 0.0 +step = 0 +for (b, s, a, o, r) in stepthrough(quick_crying_baby_pomdp, policy, DiscreteUpdater(quick_crying_baby_pomdp), "b,s,a,o,r"; max_steps=4) + step += 1 + println("Step $step") + println("b = sated => $(b.b[1]), hungry => $(b.b[2])") + @show s + @show a + @show o + @show r + r_sum += r + @show r_sum + println() +end +end #hide + +run_step_through_simulation() # hide +``` + +## Rollout Simulations +While stepthrough is a flexible and convenient tool for many user-facing demonstrations, it is often less error-prone to use the standard simulate function with a `Simulator` object. The simplest Simulator is the `RolloutSimulator`. It simply runs a simulation and returns the discounted reward. + +```@example crying_sim +function run_rollout_simulation() # hide +policy = RandomPolicy(explicit_crying_baby_pomdp) +sim = RolloutSimulator(max_steps=10) +r_sum = simulate(sim, explicit_crying_baby_pomdp, policy) +println("Total discounted reward: $r_sum") +end # hide +run_rollout_simulation() # hide +``` + +## Recording Histories +Sometimes it is important to record the entire history of a simulation for further examination. This can be accomplished with a `HistoryRecorder`. + +```@example crying_sim +policy = RandomPolicy(tabular_crying_baby_pomdp) +hr = HistoryRecorder(max_steps=5) +history = simulate(hr, tabular_crying_baby_pomdp, policy, DiscreteUpdater(tabular_crying_baby_pomdp), Deterministic(1)) +nothing # hide +``` + +The history object produced by a `HistoryRecorder` is a `SimHistory`, documented in the POMDPTools simulator section [Histories](@ref). The information in this object can be accessed in several ways. For example, there is a function: +```@example crying_sim +discounted_reward(history) +``` +Accessor functions like `state_hist` and `action_hist` can also be used to access parts of the history: +```@example crying_sim +state_hist(history) +``` +``` @example crying_sim +collect(action_hist(history)) +``` + +Keeping track of which states, actions, and observations belong together can be tricky (for example, since there is a starting state, and ending state, but no action is taken from the ending state, the list of actions has a different length than the list of states). It is often better to think of histories in terms of steps that include both starting and ending states. + +The most powerful function for accessing the information in a `SimHistory` is the `eachstep` function which returns an iterator through named tuples representing each step in the history. The `eachstep` function is similar to the `stepthrough` function above except that it iterates through the immutable steps of a previously simulated history instead of conducting the simulation as the for loop is being carried out. + +```@example crying_sim +function demo_eachstep(sim_history) # hide +r_sum = 0.0 +step = 0 +for step_i in eachstep(sim_history, "b,s,a,o,r") + step += 1 + println("Step $step") + println("step_i.b = sated => $(step_i.b.b[1]), hungry => $(step_i.b.b[2])") + @show step_i.s + @show step_i.a + @show step_i.o + @show step_i.r + r_sum += step_i.r + @show r_sum + println() +end +end # hide +demo_eachstep(history) # hide +``` + +## Parallel Simulations +It is often useful to evaluate a policy by running many simulations. The parallel simulator is the most effective tool for this. To use the parallel simulator, first create a list of `Sim` objects, each of which contains all of the information needed to run a simulation. Then then run the simulations using `run_parallel`, which will return a `DataFrame` with the results. + +In this example, we will compare the performance of the policies we computed in the [Using Different Solvers](@ref) section (i.e. `sarsop_policy`, `pomcp_planner`, and `heuristic_policy`). To evaluate the policies, we will run 100 simulations for each policy. We can do this by adding 100 `Sim` objects of each policy to the list. + +```@example crying_sim +using DataFrames +using StatsBase: std + +# Defining paramters for the simulations +number_of_sim_to_run = 100 +max_steps = 20 +starting_seed = 1 + +# We will also compare against a random policy +rand_policy = RandomPolicy(quick_crying_baby_pomdp, rng=MersenneTwister(1)) + +# Create the list of Sim objects +sim_list = [] + +# Add 100 Sim objects of each policy to the list. +for sim_number in 1:number_of_sim_to_run + seed = starting_seed + sim_number + + # Add the SARSOP policy + push!(sim_list, Sim( + quick_crying_baby_pomdp, + rng=MersenneTwister(seed), + sarsop_policy, + max_steps=max_steps, + metadata=Dict(:policy => "sarsop", :seed => seed)) + ) + + # Add the POMCP policy + push!(sim_list, Sim( + quick_crying_baby_pomdp, + rng=MersenneTwister(seed), + pomcp_planner, + max_steps=max_steps, + metadata=Dict(:policy => "pomcp", :seed => seed)) + ) + + # Add the heuristic policy + push!(sim_list, Sim( + quick_crying_baby_pomdp, + rng=MersenneTwister(seed), + heuristic_policy, + max_steps=max_steps, + metadata=Dict(:policy => "heuristic", :seed => seed)) + ) + + # Add the random policy + push!(sim_list, Sim( + quick_crying_baby_pomdp, + rng=MersenneTwister(seed), + rand_policy, + max_steps=max_steps, + metadata=Dict(:policy => "random", :seed => seed)) + ) +end + +# Run the simulations in parallel +data = run_parallel(sim_list) + +# Define a function to calculate the mean and confidence interval +function mean_and_ci(x) + m = mean(x) + ci = 1.96 * std(x) / sqrt(length(x)) # 95% confidence interval + return (mean = m, ci = ci) +end + +# Calculate the mean and confidence interval for each policy +grouped_df = groupby(data, :policy) +result = combine(grouped_df, :reward => mean_and_ci => AsTable) + +``` + +By default, the parallel simulator only returns the reward from each simulation, but more information can be gathered by specifying a function to analyze the `Sim`-history pair and record additional statistics. Reference the POMDPTools simulator section for more information ([Specifying information to be recorded](@ref)). \ No newline at end of file diff --git a/.history/docs/src/example_solvers_20240704220213.md b/.history/docs/src/example_solvers_20240704220213.md new file mode 100644 index 00000000..069053a7 --- /dev/null +++ b/.history/docs/src/example_solvers_20240704220213.md @@ -0,0 +1,108 @@ +# Using Different Solvers +There are various solvers implemented for use out-of-the-box. Please reference the repository README for a list of [MDP Solvers](https://github.com/JuliaPOMDP/POMDPs.jl?tab=readme-ov-file#mdp-solvers) and [POMDP Solvers](https://github.com/JuliaPOMDP/POMDPs.jl?tab=readme-ov-file#pomdp-solvers) implemented and maintained by the JuliaPOMDP community. We provide a few examples of how to use a small subset of these solvers. + +```@setup crying_sim +include("examples/crying_baby_examples.jl") +``` + +## Checking Requirements +Before using a solver, it is prudent to ensure the problem meets the requirements of the solver. Please reference the solver documentation for detailed information about the requirements of each solver. + +We can use [POMDPLInter](https://github.com/JuliaPOMDP/POMDPLinter.jl) to help us determine if we have all of the required components defined for a particular solver. However, not all solvers have the requirements implemented. If/when you encounter a solver that does not have the requirements implemented, please open an issue on the solver's repository. + +Let's check if we have all of the required components of our problems for the QMDP solver. + +```@example crying_sim +using POMDPLinter +using QMDP + +qmdp_solver = QMDPSolver() + +println("Quick Crying Baby POMDP") +@show_requirements POMDPs.solve(qmdp_solver, quick_crying_baby_pomdp) + +println("\nExplicit Crying Baby POMDP") +@show_requirements POMDPs.solve(qmdp_solver, explicit_crying_baby_pomdp) + +println("\nTabular Crying Baby POMDP") +@show_requirements POMDPs.solve(qmdp_solver, tabular_crying_baby_pomdp) + +println("\nGen Crying Baby POMDP") +# We don't have an actions(::GenGryingBabyPOMDP) implemented +try + @show_requirements POMDPs.solve(qmdp_solver, gen_crying_baby_pomdp) +catch err_msg + println(err_msg) +end +``` + +## Offline (SARSOP) +In this example, we will use the [NativeSARSOP](https://github.com/JuliaPOMDP/NativeSARSOP.jl) solver. The process for generating offline polcies is similar for all offline solvers. First, we define the solver with the desired parameters. Then, we call `POMDPs.solve` with the solver and the problem. We can query the policy using the `action` function. + +```@example crying_sim +using NativeSARSOP + +# Define the solver with the desired paramters +sarsop_solver = SARSOPSolver(; max_time=10.0) + +# Solve the problem by calling POMDPs.solve. SARSOP will compute the policy and return an `AlphaVectorPolicy` +sarsop_policy = POMDPs.solve(sarsop_solver, quick_crying_baby_pomdp) + +# We can query the policy using the `action` function +b = initialstate(quick_crying_baby_pomdp) +a = action(sarsop_policy, b) + +@show a + +``` + +## Online (POMCP) +For the online solver, we will use Particle Monte Carlo Planning ([POMCP](https://github.com/JuliaPOMDP/BasicPOMCP.jl)). For online solvers, we first define the solver similar to offline solvers. However, when we call `POMDPs.solve`, we are returned an online plannner. Similar to the offline solver, we can query the policy using the `action` function and that is when the online solver will compute the action. + +```@example crying_sim +using BasicPOMCP + +pomcp_solver = POMCPSolver(; c=5.0, tree_queries=1000, rng=MersenneTwister(1)) +pomcp_planner = POMDPs.solve(pomcp_solver, quick_crying_baby_pomdp) + +b = initialstate(quick_crying_baby_pomdp) +a = action(pomcp_planner, b) + +@show a + +``` + +## Heuristic Policy +While we often want to use a solver to compute a policy, sometimes we might want to use a heuristic policy. For example, we may want to use a heuristic policy during our rollouts for online solvers or to use as a baseline. In this example, we will define a simple heuristic policy that feeds the baby if our belief of the baby being hungry is greater than 50%, otherwise we will randomly ignore or sing to the baby. + +```@example crying_sim +struct HeuristicFeedPolicy{P<:POMDP} <: Policy + pomdp::P +end + +# We need to implement the action function for our policy +function POMDPs.action(policy::HeuristicFeedPolicy, b) + if pdf(b, :hungry) > 0.5 + return :feed + else + return rand([:ignore, :sing]) + end +end + +# Let's also define the default updater for our policy +function POMDPs.updater(policy::HeuristicFeedPolicy) + return DiscreteUpdater(policy.pomdp) +end + +heuristic_policy = HeuristicFeedPolicy(quick_crying_baby_pomdp) + +# Let's query the policy a few times +b = SparseCat([:sated, :hungry], [0.1, 0.9]) +a1 = action(heuristic_policy, b) + +b = SparseCat([:sated, :hungry], [0.9, 0.1]) +a2 = action(heuristic_policy, b) + +@show [a1, a2] + +``` \ No newline at end of file diff --git a/.history/docs/src/example_solvers_20240704221322.md b/.history/docs/src/example_solvers_20240704221322.md new file mode 100644 index 00000000..cee92985 --- /dev/null +++ b/.history/docs/src/example_solvers_20240704221322.md @@ -0,0 +1,108 @@ +# Using Different Solvers +There are various solvers implemented for use out-of-the-box. Please reference the repository README for a list of [MDP Solvers](https://github.com/JuliaPOMDP/POMDPs.jl?tab=readme-ov-file#mdp-solvers) and [POMDP Solvers](https://github.com/JuliaPOMDP/POMDPs.jl?tab=readme-ov-file#pomdp-solvers) implemented and maintained by the JuliaPOMDP community. We provide a few examples of how to use a small subset of these solvers. + +```@setup crying_sim +include("examples/crying_baby_examples.jl") +``` + +## Checking Requirements +Before using a solver, it is prudent to ensure the problem meets the requirements of the solver. Please reference the solver documentation for detailed information about the requirements of each solver. + +We can use [POMDPLInter](https://github.com/JuliaPOMDP/POMDPLinter.jl) to help us determine if we have all of the required components defined for a particular solver. However, not all solvers have the requirements implemented. If/when you encounter a solver that does not have the requirements implemented, please open an issue on the solver's repository. + +Let's check if we have all of the required components of our problems for the QMDP solver. + +```@example crying_sim +using POMDPLinter +using QMDP + +qmdp_solver = QMDPSolver() + +println("Quick Crying Baby POMDP") +@show_requirements POMDPs.solve(qmdp_solver, quick_crying_baby_pomdp) + +println("\nExplicit Crying Baby POMDP") +@show_requirements POMDPs.solve(qmdp_solver, explicit_crying_baby_pomdp) + +println("\nTabular Crying Baby POMDP") +@show_requirements POMDPs.solve(qmdp_solver, tabular_crying_baby_pomdp) + +println("\nGen Crying Baby POMDP") +# We don't have an actions(::GenGryingBabyPOMDP) implemented +try + @show_requirements POMDPs.solve(qmdp_solver, gen_crying_baby_pomdp) +catch err_msg + println(err_msg) +end +``` + +## Offline (SARSOP) +In this example, we will use the [NativeSARSOP](https://github.com/JuliaPOMDP/NativeSARSOP.jl) solver. The process for generating offline polices is similar for all offline solvers. First, we define the solver with the desired parameters. Then, we call `POMDPs.solve` with the solver and the problem. We can query the policy using the `action` function. + +```@example crying_sim +using NativeSARSOP + +# Define the solver with the desired parameters +sarsop_solver = SARSOPSolver(; max_time=10.0) + +# Solve the problem by calling POMDPs.solve. SARSOP will compute the policy and return an `AlphaVectorPolicy` +sarsop_policy = POMDPs.solve(sarsop_solver, quick_crying_baby_pomdp) + +# We can query the policy using the `action` function +b = initialstate(quick_crying_baby_pomdp) +a = action(sarsop_policy, b) + +@show a + +``` + +## Online (POMCP) +For the online solver, we will use Particle Monte Carlo Planning ([POMCP](https://github.com/JuliaPOMDP/BasicPOMCP.jl)). For online solvers, we first define the solver similar to offline solvers. However, when we call `POMDPs.solve`, we are returned an online planner. Similar to the offline solver, we can query the policy using the `action` function and that is when the online solver will compute the action. + +```@example crying_sim +using BasicPOMCP + +pomcp_solver = POMCPSolver(; c=5.0, tree_queries=1000, rng=MersenneTwister(1)) +pomcp_planner = POMDPs.solve(pomcp_solver, quick_crying_baby_pomdp) + +b = initialstate(quick_crying_baby_pomdp) +a = action(pomcp_planner, b) + +@show a + +``` + +## Heuristic Policy +While we often want to use a solver to compute a policy, sometimes we might want to use a heuristic policy. For example, we may want to use a heuristic policy during our rollout for online solvers or to use as a baseline. In this example, we will define a simple heuristic policy that feeds the baby if our belief of the baby being hungry is greater than 50%, otherwise we will randomly ignore or sing to the baby. + +```@example crying_sim +struct HeuristicFeedPolicy{P<:POMDP} <: Policy + pomdp::P +end + +# We need to implement the action function for our policy +function POMDPs.action(policy::HeuristicFeedPolicy, b) + if pdf(b, :hungry) > 0.5 + return :feed + else + return rand([:ignore, :sing]) + end +end + +# Let's also define the default updater for our policy +function POMDPs.updater(policy::HeuristicFeedPolicy) + return DiscreteUpdater(policy.pomdp) +end + +heuristic_policy = HeuristicFeedPolicy(quick_crying_baby_pomdp) + +# Let's query the policy a few times +b = SparseCat([:sated, :hungry], [0.1, 0.9]) +a1 = action(heuristic_policy, b) + +b = SparseCat([:sated, :hungry], [0.9, 0.1]) +a2 = action(heuristic_policy, b) + +@show [a1, a2] + +``` \ No newline at end of file diff --git a/.history/docs/src/gallery_20240704220213.md b/.history/docs/src/gallery_20240704220213.md new file mode 100644 index 00000000..39997852 --- /dev/null +++ b/.history/docs/src/gallery_20240704220213.md @@ -0,0 +1,264 @@ +# Gallery of POMDPs.jl Problems +A gallery of models written for [POMDPs.jl](https://github.com/JuliaPOMDP/POMDPs.jl) with visualizations. To view these visualizations on your own machine, the code is provided below each visualization. + +## [EscapeRoomba](https://github.com/sisl/AA228FinalProject) +Originally, an optional final project for AA228 at Stanford in Fall 2018. A Roomba equipped with a LIDAR or a bump sensor needs to try to find the safe exit (green) without accidentally falling down the stairs (red). + +![EscapeRoomba](examples/EscapeRoomba.gif) + +```@setup EscapeRoomba +using Pkg +Pkg.add(url="https://github.com/sisl/RoombaPOMDPs.git") +``` + +```@example EscapeRoomba +using POMDPs +using POMDPTools +using POMDPGifs +using BasicPOMCP +using Random +using ParticleFilters +using Cairo +using LinearAlgebra + + +# If you don't have RoombaPOMDPs installed, uncomment the following two lines +# using Pkg +# Pkg.add(url="https://github.com/sisl/RoombaPOMDPs.git") +using RoombaPOMDPs + +# Let's only consider discrete actions +roomba_actions = [RoombaAct(2.0, 0.0), RoombaAct(2.0, 0.7), RoombaAct(2.0, -0.7)] + +pomdp = RoombaPOMDP(; + sensor=Bumper(), + mdp=RoombaMDP(; + config=2, + discount=0.99, + contact_pen=-0.1, + aspace=roomba_actions + ) +) + +# Define the belief updater +num_particles = 20000 +v_noise_coefficient = 0.0 +om_noise_coefficient = 0.4 +resampler=LowVarianceResampler(num_particles) +rng = MersenneTwister(1) +belief_updater = RoombaParticleFilter( + pomdp, num_particles, v_noise_coefficient, + om_noise_coefficient,resampler, rng +) + +# Custom update function for the particle filter +function POMDPs.update(up::RoombaParticleFilter, b::ParticleCollection, a, o) + pm = up._particle_memory + wm = up._weight_memory + ps = [] + empty!(pm) + empty!(wm) + all_terminal = true + for s in particles(b) + if !isterminal(up.model, s) + all_terminal = false + a_pert = RoombaAct(a.v + (up.v_noise_coeff * (rand(up.rng) - 0.5)), a.omega + (up.om_noise_coeff * (rand(up.rng) - 0.5))) + sp = @gen(:sp)(up.model, s, a_pert, up.rng) + weight_sp = pdf(observation(up.model, sp), o) + if weight_sp > 0.0 + push!(ps, s) + push!(pm, sp) + push!(wm, weight_sp) + end + end + end + + while length(pm) < up.n_init + a_pert = RoombaAct(a.v + (up.v_noise_coeff * (rand(up.rng) - 0.5)), a.omega + (up.om_noise_coeff * (rand(up.rng) - 0.5))) + s = isempty(ps) ? rand(up.rng, b) : rand(up.rng, ps) + sp = @gen(:sp)(up.model, s, a_pert, up.rng) + weight_sp = obs_weight(up.model, s, a_pert, sp, o) + if weight_sp > 0.0 + push!(pm, sp) + push!(wm, weight_sp) + end + end + + # if all particles are terminal, issue an error + if all_terminal + error("Particle filter update error: all states in the particle collection were terminal.") + end + + # return ParticleFilters.ParticleCollection(deepcopy(pm)) + return ParticleFilters.resample(up.resampler, + WeightedParticleBelief(pm, wm, sum(wm), nothing), + up.rng) +end + +solver = POMCPSolver(; + tree_queries=20000, + max_depth=150, + c = 10.0, + rng=MersenneTwister(1) +) + +planner = solve(solver, pomdp) + +sim = GifSimulator(; + filename="examples/EscapeRoomba.gif", + max_steps=100, + rng=MersenneTwister(3), + show_progress=false, + fps=5) +saved_gif = simulate(sim, pomdp, planner, belief_updater) + +println("gif saved to: $(saved_gif.filename)") +``` + +```@setup EscapeRoomba +Pkg.rm("RoombaPOMDPs") +``` + +## [DroneSurveillance](https://github.com/JuliaPOMDP/DroneSurveillance.jl) +Drone surveillance POMDP from M. Svoreňová, M. Chmelík, K. Leahy, H. F. Eniser, K. Chatterjee, I. Černá, C. Belta, "Temporal logic motion planning using POMDPs with parity objectives: case study paper", International Conference on Hybrid Systems: Computation and Control (HSCC), 2015. + +In this problem, the UAV must go from one corner to the other while avoiding a ground agent. It can only detect the ground agent within its field of view (in blue). + +![DroneSurveillance](examples/DroneSurveillance.gif) + +```@example +using POMDPs +using POMDPTools +using POMDPGifs +using NativeSARSOP +using Random +using DroneSurveillance +import Cairo, Fontconfig + +pomdp = DroneSurveillancePOMDP() +solver = SARSOPSolver(; precision=0.1, max_time=10.0) +policy = solve(solver, pomdp) + +sim = GifSimulator(; filename="examples/DroneSurveillance.gif", max_steps=30, rng=MersenneTwister(1), show_progress=false) +saved_gif = simulate(sim, pomdp, policy) + +println("gif saved to: $(saved_gif.filename)") +``` + +## [QuickMountainCar](https://github.com/JuliaPOMDP/QuickPOMDPs.jl) +An implementation of the classic Mountain Car RL problem using the QuickPOMDPs interface. + +![QuickMountainCar](examples/QuickMountainCar.gif) + +```@example +using POMDPs +using POMDPTools +using POMDPGifs +using Random +using QuickPOMDPs +using Compose +import Cairo + +mountaincar = QuickMDP( + function (s, a, rng) + x, v = s + vp = clamp(v + a*0.001 + cos(3*x)*-0.0025, -0.07, 0.07) + xp = x + vp + if xp > 0.5 + r = 100.0 + else + r = -1.0 + end + return (sp=(xp, vp), r=r) + end, + actions = [-1., 0., 1.], + initialstate = Deterministic((-0.5, 0.0)), + discount = 0.95, + isterminal = s -> s[1] > 0.5, + + render = function (step) + cx = step.s[1] + cy = 0.45*sin(3*cx)+0.5 + car = (context(), Compose.circle(cx, cy+0.035, 0.035), fill("blue")) + track = (context(), line([(x, 0.45*sin(3*x)+0.5) for x in -1.2:0.01:0.6]), Compose.stroke("black")) + goal = (context(), star(0.5, 1.0, -0.035, 5), fill("gold"), Compose.stroke("black")) + bg = (context(), Compose.rectangle(), fill("white")) + ctx = context(0.7, 0.05, 0.6, 0.9, mirror=Mirror(0, 0, 0.5)) + return compose(context(), (ctx, car, track, goal), bg) + end +) + +energize = FunctionPolicy(s->s[2] < 0.0 ? -1.0 : 1.0) +sim = GifSimulator(; filename="examples/QuickMountainCar.gif", max_steps=200, fps=20, rng=MersenneTwister(1), show_progress=false) +saved_gif = simulate(sim, mountaincar, energize) + +println("gif saved to: $(saved_gif.filename)") +``` + +## [RockSample](https://github.com/JuliaPOMDP/RockSample.jl) +The RockSample problem problem from T. Smith, R. Simmons, "Heuristic Search Value Iteration for POMDPs", Association for Uncertainty in Artificial Intelligence (UAI), 2004. + +The robot must navigate and sample good rocks (green) and then arrive at an exit area. The robot can only sense the rocks with an imperfect sensor that has performance that depends on the distance to the rock. + +![RockSample](examples/RockSample.gif) + +```@example +using POMDPs +using POMDPTools +using POMDPGifs +using NativeSARSOP +using Random +using RockSample +using Cairo + +pomdp = RockSamplePOMDP(rocks_positions=[(2,3), (4,4), (4,2)], + sensor_efficiency=20.0, + discount_factor=0.95, + good_rock_reward = 20.0) + +solver = SARSOPSolver(precision=1e-3; max_time=10.0) +policy = solve(solver, pomdp) + +sim = GifSimulator(; filename="examples/RockSample.gif", max_steps=30, rng=MersenneTwister(1), show_progress=false) +saved_gif = simulate(sim, pomdp, policy) + +println("gif saved to: $(saved_gif.filename)") +``` + +## [TagPOMDPProblem](https://github.com/JuliaPOMDP/TagPOMDPProblem.jl) +The Tag problem from J. Pineau, G. Gordon, and S. Thrun, "Point-based value iteration: An anytime algorithm for POMDPs", International Joint Conference on Artificial Intelligence (IJCAI), 2003. + +The orange agent is the pursuer and the red agent is the evader. The pursuer must "tag" the evader by being in the same grid cell as the evader. However, the pursuer can only see the evader if it is in the same grid cell as the evader. The evader moves stochastically "away" from the pursuer. + +![TagPOMDPProblem](examples/TagPOMDP.gif) + +```@setup TagPOMDP +using Pkg +Pkg.add("Plots") +using Plots +``` + +```@example TagPOMDP +using POMDPs +using POMDPTools +using POMDPGifs +using NativeSARSOP +using Random +using TagPOMDPProblem + +pomdp = TagPOMDP() +solver = SARSOPSolver(; max_time=20.0) +policy = solve(solver, pomdp) +sim = GifSimulator(; filename="examples/TagPOMDP.gif", max_steps=50, rng=MersenneTwister(1), show_progress=false) +saved_gif = simulate(sim, pomdp, policy) + +println("gif saved to: $(saved_gif.filename)") +``` + +```@setup TagPOMDP +using Pkg +Pkg.rm("Plots") +``` + +## Adding New Gallery Examples +To add new examples, please submit a pull request to the POMDPs.jl repository with changes made to the `gallery.md` file in `docs/src/`. Please include the creation of a gif in the code snippet. The gif should be generated during the creation of the documenation using `@eval` and saved in the `docs/src/examples/` directory. The gif should be named `problem_name.gif` where `problem_name` is the name of the problem. The gif can then be included using `![problem_name](examples/problem_name.gif)`. \ No newline at end of file diff --git a/.history/docs/src/gallery_20240704222830.md b/.history/docs/src/gallery_20240704222830.md new file mode 100644 index 00000000..6823b717 --- /dev/null +++ b/.history/docs/src/gallery_20240704222830.md @@ -0,0 +1,264 @@ +# Gallery of POMDPs.jl Problems +A gallery of models written for [POMDPs.jl](https://github.com/JuliaPOMDP/POMDPs.jl) with visualizations. To view these visualizations on your own machine, the code is provided below each visualization. + +## [EscapeRoomba](https://github.com/sisl/AA228FinalProject) +Originally, an optional final project for AA228 at Stanford in Fall 2018. A Roomba equipped with a LIDAR or a bump sensor needs to try to find the safe exit (green) without accidentally falling down the stairs (red). + +![EscapeRoomba](examples/EscapeRoomba.gif) + +```@setup EscapeRoomba +using Pkg +Pkg.add(url="https://github.com/sisl/RoombaPOMDPs.git") +``` + +```@example EscapeRoomba +using POMDPs +using POMDPTools +using POMDPGifs +using BasicPOMCP +using Random +using ParticleFilters +using Cairo +using LinearAlgebra + + +# If you don't have RoombaPOMDPs installed, uncomment the following two lines +# using Pkg +# Pkg.add(url="https://github.com/sisl/RoombaPOMDPs.git") +using RoombaPOMDPs + +# Let's only consider discrete actions +roomba_actions = [RoombaAct(2.0, 0.0), RoombaAct(2.0, 0.7), RoombaAct(2.0, -0.7)] + +pomdp = RoombaPOMDP(; + sensor=Bumper(), + mdp=RoombaMDP(; + config=2, + discount=0.99, + contact_pen=-0.1, + aspace=roomba_actions + ) +) + +# Define the belief updater +num_particles = 20000 +v_noise_coefficient = 0.0 +om_noise_coefficient = 0.4 +resampler=LowVarianceResampler(num_particles) +rng = MersenneTwister(1) +belief_updater = RoombaParticleFilter( + pomdp, num_particles, v_noise_coefficient, + om_noise_coefficient,resampler, rng +) + +# Custom update function for the particle filter +function POMDPs.update(up::RoombaParticleFilter, b::ParticleCollection, a, o) + pm = up._particle_memory + wm = up._weight_memory + ps = [] + empty!(pm) + empty!(wm) + all_terminal = true + for s in particles(b) + if !isterminal(up.model, s) + all_terminal = false + a_pert = RoombaAct(a.v + (up.v_noise_coeff * (rand(up.rng) - 0.5)), a.omega + (up.om_noise_coeff * (rand(up.rng) - 0.5))) + sp = @gen(:sp)(up.model, s, a_pert, up.rng) + weight_sp = pdf(observation(up.model, sp), o) + if weight_sp > 0.0 + push!(ps, s) + push!(pm, sp) + push!(wm, weight_sp) + end + end + end + + while length(pm) < up.n_init + a_pert = RoombaAct(a.v + (up.v_noise_coeff * (rand(up.rng) - 0.5)), a.omega + (up.om_noise_coeff * (rand(up.rng) - 0.5))) + s = isempty(ps) ? rand(up.rng, b) : rand(up.rng, ps) + sp = @gen(:sp)(up.model, s, a_pert, up.rng) + weight_sp = obs_weight(up.model, s, a_pert, sp, o) + if weight_sp > 0.0 + push!(pm, sp) + push!(wm, weight_sp) + end + end + + # if all particles are terminal, issue an error + if all_terminal + error("Particle filter update error: all states in the particle collection were terminal.") + end + + # return ParticleFilters.ParticleCollection(deepcopy(pm)) + return ParticleFilters.resample(up.resampler, + WeightedParticleBelief(pm, wm, sum(wm), nothing), + up.rng) +end + +solver = POMCPSolver(; + tree_queries=20000, + max_depth=150, + c = 10.0, + rng=MersenneTwister(1) +) + +planner = solve(solver, pomdp) + +sim = GifSimulator(; + filename="examples/EscapeRoomba.gif", + max_steps=100, + rng=MersenneTwister(3), + show_progress=false, + fps=5) +saved_gif = simulate(sim, pomdp, planner, belief_updater) + +println("gif saved to: $(saved_gif.filename)") +``` + +```@setup EscapeRoomba +Pkg.rm("RoombaPOMDPs") +``` + +## [DroneSurveillance](https://github.com/JuliaPOMDP/DroneSurveillance.jl) +Drone surveillance POMDP from M. Svoreňová, M. Chmelík, K. Leahy, H. F. Eniser, K. Chatterjee, I. Černá, C. Belta, "Temporal logic motion planning using POMDPs with parity objectives: case study paper", International Conference on Hybrid Systems: Computation and Control (HSCC), 2015. + +In this problem, the UAV must go from one corner to the other while avoiding a ground agent. It can only detect the ground agent within its field of view (in blue). + +![DroneSurveillance](examples/DroneSurveillance.gif) + +```@example +using POMDPs +using POMDPTools +using POMDPGifs +using NativeSARSOP +using Random +using DroneSurveillance +import Cairo, Fontconfig + +pomdp = DroneSurveillancePOMDP() +solver = SARSOPSolver(; precision=0.1, max_time=10.0) +policy = solve(solver, pomdp) + +sim = GifSimulator(; filename="examples/DroneSurveillance.gif", max_steps=30, rng=MersenneTwister(1), show_progress=false) +saved_gif = simulate(sim, pomdp, policy) + +println("gif saved to: $(saved_gif.filename)") +``` + +## [QuickMountainCar](https://github.com/JuliaPOMDP/QuickPOMDPs.jl) +An implementation of the classic Mountain Car RL problem using the QuickPOMDPs interface. + +![QuickMountainCar](examples/QuickMountainCar.gif) + +```@example +using POMDPs +using POMDPTools +using POMDPGifs +using Random +using QuickPOMDPs +using Compose +import Cairo + +mountaincar = QuickMDP( + function (s, a, rng) + x, v = s + vp = clamp(v + a*0.001 + cos(3*x)*-0.0025, -0.07, 0.07) + xp = x + vp + if xp > 0.5 + r = 100.0 + else + r = -1.0 + end + return (sp=(xp, vp), r=r) + end, + actions = [-1., 0., 1.], + initialstate = Deterministic((-0.5, 0.0)), + discount = 0.95, + isterminal = s -> s[1] > 0.5, + + render = function (step) + cx = step.s[1] + cy = 0.45*sin(3*cx)+0.5 + car = (context(), Compose.circle(cx, cy+0.035, 0.035), fill("blue")) + track = (context(), line([(x, 0.45*sin(3*x)+0.5) for x in -1.2:0.01:0.6]), Compose.stroke("black")) + goal = (context(), star(0.5, 1.0, -0.035, 5), fill("gold"), Compose.stroke("black")) + bg = (context(), Compose.rectangle(), fill("white")) + ctx = context(0.7, 0.05, 0.6, 0.9, mirror=Mirror(0, 0, 0.5)) + return compose(context(), (ctx, car, track, goal), bg) + end +) + +energize = FunctionPolicy(s->s[2] < 0.0 ? -1.0 : 1.0) +sim = GifSimulator(; filename="examples/QuickMountainCar.gif", max_steps=200, fps=20, rng=MersenneTwister(1), show_progress=false) +saved_gif = simulate(sim, mountaincar, energize) + +println("gif saved to: $(saved_gif.filename)") +``` + +## [RockSample](https://github.com/JuliaPOMDP/RockSample.jl) +The RockSample problem from T. Smith, R. Simmons, "Heuristic Search Value Iteration for POMDPs", Association for Uncertainty in Artificial Intelligence (UAI), 2004. + +The robot must navigate and sample good rocks (green) and then arrive at an exit area. The robot can only sense the rocks with an imperfect sensor that has performance that depends on the distance to the rock. + +![RockSample](examples/RockSample.gif) + +```@example +using POMDPs +using POMDPTools +using POMDPGifs +using NativeSARSOP +using Random +using RockSample +using Cairo + +pomdp = RockSamplePOMDP(rocks_positions=[(2,3), (4,4), (4,2)], + sensor_efficiency=20.0, + discount_factor=0.95, + good_rock_reward = 20.0) + +solver = SARSOPSolver(precision=1e-3; max_time=10.0) +policy = solve(solver, pomdp) + +sim = GifSimulator(; filename="examples/RockSample.gif", max_steps=30, rng=MersenneTwister(1), show_progress=false) +saved_gif = simulate(sim, pomdp, policy) + +println("gif saved to: $(saved_gif.filename)") +``` + +## [TagPOMDPProblem](https://github.com/JuliaPOMDP/TagPOMDPProblem.jl) +The Tag problem from J. Pineau, G. Gordon, and S. Thrun, "Point-based value iteration: An anytime algorithm for POMDPs", International Joint Conference on Artificial Intelligence (IJCAI), 2003. + +The orange agent is the pursuer and the red agent is the evader. The pursuer must "tag" the evader by being in the same grid cell as the evader. However, the pursuer can only see the evader if it is in the same grid cell as the evader. The evader moves stochastically "away" from the pursuer. + +![TagPOMDPProblem](examples/TagPOMDP.gif) + +```@setup TagPOMDP +using Pkg +Pkg.add("Plots") +using Plots +``` + +```@example TagPOMDP +using POMDPs +using POMDPTools +using POMDPGifs +using NativeSARSOP +using Random +using TagPOMDPProblem + +pomdp = TagPOMDP() +solver = SARSOPSolver(; max_time=20.0) +policy = solve(solver, pomdp) +sim = GifSimulator(; filename="examples/TagPOMDP.gif", max_steps=50, rng=MersenneTwister(1), show_progress=false) +saved_gif = simulate(sim, pomdp, policy) + +println("gif saved to: $(saved_gif.filename)") +``` + +```@setup TagPOMDP +using Pkg +Pkg.rm("Plots") +``` + +## Adding New Gallery Examples +To add new examples, please submit a pull request to the POMDPs.jl repository with changes made to the `gallery.md` file in `docs/src/`. Please include the creation of a gif in the code snippet. The gif should be generated during the creation of the documentation using `@eval` and saved in the `docs/src/examples/` directory. The gif should be named `problem_name.gif` where `problem_name` is the name of the problem. The gif can then be included using `![problem_name](examples/problem_name.gif)`. \ No newline at end of file diff --git a/docs/src/example_defining_problems.md b/docs/src/example_defining_problems.md index e7f2d173..be243fe4 100644 --- a/docs/src/example_defining_problems.md +++ b/docs/src/example_defining_problems.md @@ -1,13 +1,13 @@ # Defining a POMDP -As mentioned in the [Defining POMDPs and MDPs](@ref defining_pomdps) section, there are verious ways to define a POMDP using POMDPs.jl. In this section, we provide more examples of how to define a POMDP using the different interfaces. +As mentioned in the [Defining POMDPs and MDPs](@ref defining_pomdps) section, there are various ways to define a POMDP using POMDPs.jl. In this section, we provide more examples of how to define a POMDP using the different interfaces. -There is a large variety of problems that can be expressed as MDPs and POMDPs and different solvers require different components of the POMDPs.jl interface to be defined. Therefore, these examples are not intended to cover all possible use cases. When deeloping a problem and you have an idea of what solver(s) you would like to use, it is recommended to use [POMDPLinter](https://github.com/JuliaPOMDP/POMDPLinter.jl) to help you to determine what components of the POMDPs.jl interface need to be defined. Reference the [Checking Requirements](@ref) section for an example of using POMDPLinter. +There is a large variety of problems that can be expressed as MDPs and POMDPs and different solvers require different components of the POMDPs.jl interface to be defined. Therefore, these examples are not intended to cover all possible use cases. When developing a problem and you have an idea of what solver(s) you would like to use, it is recommended to use [POMDPLinter](https://github.com/JuliaPOMDP/POMDPLinter.jl) to help you to determine what components of the POMDPs.jl interface need to be defined. Reference the [Checking Requirements](@ref) section for an example of using POMDPLinter. ## CryingBaby Problem Definition For the examples, we will use the CryingBaby problem from [Algorithms for Decision Making](https://algorithmsbook.com/) by Mykel J. Kochenderfer, Tim A. Wheeler, and Kyle H. Wray. !!! note - This craying baby problem follows the description in Algorithms for Decision Making and is different than `BabyPOMDP` defined in [POMDPModels.jl](https://github.com/JuliaPOMDP/POMDPModels.jl). + This crying baby problem follows the description in Algorithms for Decision Making and is different than `BabyPOMDP` defined in [POMDPModels.jl](https://github.com/JuliaPOMDP/POMDPModels.jl). From [Appendix F](https://algorithmsbook.com/files/appendix-f.pdf) of Algorithms for Decision Making: > The crying baby problem is a simple POMDP with two states, three actions, and two observations. Our goal is to care for a baby, and we do so by choosing at each time step whether to feed the baby, sing to the baby, or ignore the baby. @@ -201,7 +201,7 @@ explicit_crying_baby_pomdp = CryingBabyPOMDP() ``` ## [Generative Interface](@id gen_crying) -This crying baby problem should not be implemented using the generative interface. However, this exmple is provided for pedagogical purposes. +This crying baby problem should not be implemented using the generative interface. However, this example is provided for pedagogical purposes. ```julia using POMDPs @@ -273,7 +273,7 @@ gen_crying_baby_pomdp = GenCryingBabyPOMDP() ``` ## [Probability Tables](@id tab_crying) -For this implementaion we will use the following indexes: +For this implementation we will use the following indexes: - States - `:sated` = 1 - `:hungry` = 2 diff --git a/docs/src/example_gridworld_mdp.md b/docs/src/example_gridworld_mdp.md index 4dfa4e48..1cd68847 100644 --- a/docs/src/example_gridworld_mdp.md +++ b/docs/src/example_gridworld_mdp.md @@ -1,6 +1,6 @@ # GridWorld MDP Tutorial -In this tutorial, we provide a simple example of how to define a Markov decision process (MDP) using the POMDPS.jl interface. We will then solve the MDP using value iteration and Monte Carlo tree search (MCTS). We will walk through constructing the MDP using the explicit interface which invovles defining a new type for the MDP and then extending different components of the POMDPs.jl interface for that type. +In this tutorial, we provide a simple example of how to define a Markov decision process (MDP) using the POMDPS.jl interface. We will then solve the MDP using value iteration and Monte Carlo tree search (MCTS). We will walk through constructing the MDP using the explicit interface which involves defining a new type for the MDP and then extending different components of the POMDPs.jl interface for that type. ## Dependencies @@ -32,7 +32,7 @@ using MCTS ## Problem Overview -In Grid World, we are trying to control an agent who has trouble moving in the desired direction. In our problem, we have four reward states within the a grid. Each position on the grid represents a state, and the positive reward states are terminal (the agent stops recieving reward after reaching them and performing an action from that state). The agent has four actions to choose from: up, down, left, right. The agent moves in the desired direction with a probability of $0.7$, and with a probability of $0.1$ in each of the remaining three directions. If the agent bumps into the outside wall, there is a penalty of $1$ (i.e. reward of $-1$). The problem has the following form: +In Grid World, we are trying to control an agent who has trouble moving in the desired direction. In our problem, we have four reward states within the a grid. Each position on the grid represents a state, and the positive reward states are terminal (the agent stops receiving reward after reaching them and performing an action from that state). The agent has four actions to choose from: up, down, left, right. The agent moves in the desired direction with a probability of $0.7$, and with a probability of $0.1$ in each of the remaining three directions. If the agent bumps into the outside wall, there is a penalty of $1$ (i.e. reward of $-1$). The problem has the following form: ![Grid World](examples/grid_world_overview.gif) @@ -79,7 +79,7 @@ struct GridWorldMDP <: MDP{GridWorldState, Symbol} reward_states_values::Dict{GridWorldState, Float64} # Dictionary mapping reward states to their values hit_wall_reward::Float64 # reward for hitting a wall tprob::Float64 # probability of transitioning to the desired state - discount_factor::Float64 # disocunt factor + discount_factor::Float64 # discount factor end ``` @@ -126,7 +126,7 @@ mdp = GridWorldMDP() ``` !!! note - In this definition of the problem, our coordiates start in the bottom left of the grid. That is GridState(1, 1) is the bottom left of the grid and GridState(10, 10) would be on the right of the grid with a grid size of 10 by 10. + In this definition of the problem, our coordinates start in the bottom left of the grid. That is GridState(1, 1) is the bottom left of the grid and GridState(10, 10) would be on the right of the grid with a grid size of 10 by 10. ## Grid World State Space The state space in an MDP represents all the states in the problem. There are two primary functionalities that we want our spaces to support. We want to be able to iterate over the state space (for Value Iteration for example), and sometimes we want to be able to sample form the state space (used in some POMDP solvers). In this notebook, we will only look at iterable state spaces. @@ -236,7 +236,7 @@ Similar to above, let's iterate over a few of the states in our state space: ``` ## Grid World Action Space -The action space is the set of all actions availiable to the agent. In the grid world problem the action space consists of up, down, left, and right. We can define the action space by implementing a new method of the actions function. +The action space is the set of all actions available to the agent. In the grid world problem the action space consists of up, down, left, and right. We can define the action space by implementing a new method of the actions function. ```@example gridworld_mdp POMDPs.actions(mdp::GridWorldMDP) = [:up, :down, :left, :right] @@ -255,7 +255,7 @@ end ## Grid World Transition Function MDPs often define the transition function as $T(s^{\prime} \mid s, a)$, which is the probability of transitioning to state $s^{\prime}$ given that we are in state $s$ and take action $a$. For the POMDPs.jl interface, we define the transition function as a distribution over the next states. That is, we want $T(\cdot \mid s, a)$ which is a function that takes in a state and an action and returns a distribution over the next states. -For our grid world example, there are only a few states to which the agent can transition and thus only a few states with nonzero probaility in $T(\cdot \mid s, a)$. We can use the `SparseCat` distribution to represent this. The `SparseCat` distribution is a categorical distribution that only stores the nonzero probabilities. We can define our transition function as follows: +For our grid world example, there are only a few states to which the agent can transition and thus only a few states with nonzero probability in $T(\cdot \mid s, a)$. We can use the `SparseCat` distribution to represent this. The `SparseCat` distribution is a categorical distribution that only stores the nonzero probabilities. We can define our transition function as follows: ```@example gridworld_mdp function POMDPs.transition(mdp::GridWorldMDP, s::GridWorldState, a::Symbol) diff --git a/docs/src/example_simulations.md b/docs/src/example_simulations.md index cd6b5e95..c1cc0d0a 100644 --- a/docs/src/example_simulations.md +++ b/docs/src/example_simulations.md @@ -9,7 +9,7 @@ include("examples/crying_baby_solvers.jl") ``` ## Stepthrough -The stepthrough simulater provides a window into the simulation with a for-loop syntax. +The stepthrough simulator provides a window into the simulation with a for-loop syntax. Within the body of the for loop, we have access to the belief, the action, the observation, and the reward, in each step. We also calculate the sum of the rewards in this example, but note that this is _not_ the _discounted reward_. @@ -58,7 +58,7 @@ history = simulate(hr, tabular_crying_baby_pomdp, policy, DiscreteUpdater(tabula nothing # hide ``` -The history object produced by a `HistoryRecorder` is a `SimHistory`, documented in the POMDPTools simulater section [Histories](@ref). The information in this object can be accessed in several ways. For example, there is a function: +The history object produced by a `HistoryRecorder` is a `SimHistory`, documented in the POMDPTools simulator section [Histories](@ref). The information in this object can be accessed in several ways. For example, there is a function: ```@example crying_sim discounted_reward(history) ``` @@ -97,7 +97,7 @@ demo_eachstep(history) # hide ## Parallel Simulations It is often useful to evaluate a policy by running many simulations. The parallel simulator is the most effective tool for this. To use the parallel simulator, first create a list of `Sim` objects, each of which contains all of the information needed to run a simulation. Then then run the simulations using `run_parallel`, which will return a `DataFrame` with the results. -In this example, we will compare the performance of the polcies we computed in the [Using Different Solvers](@ref) section (i.e. `sarsop_policy`, `pomcp_planner`, and `heuristic_policy`). To evaluate the policies, we will run 100 simulations for each policy. We can do this by adding 100 `Sim` objects of each policy to the list. +In this example, we will compare the performance of the policies we computed in the [Using Different Solvers](@ref) section (i.e. `sarsop_policy`, `pomcp_planner`, and `heuristic_policy`). To evaluate the policies, we will run 100 simulations for each policy. We can do this by adding 100 `Sim` objects of each policy to the list. ```@example crying_sim using DataFrames diff --git a/docs/src/example_solvers.md b/docs/src/example_solvers.md index 069053a7..cee92985 100644 --- a/docs/src/example_solvers.md +++ b/docs/src/example_solvers.md @@ -37,12 +37,12 @@ end ``` ## Offline (SARSOP) -In this example, we will use the [NativeSARSOP](https://github.com/JuliaPOMDP/NativeSARSOP.jl) solver. The process for generating offline polcies is similar for all offline solvers. First, we define the solver with the desired parameters. Then, we call `POMDPs.solve` with the solver and the problem. We can query the policy using the `action` function. +In this example, we will use the [NativeSARSOP](https://github.com/JuliaPOMDP/NativeSARSOP.jl) solver. The process for generating offline polices is similar for all offline solvers. First, we define the solver with the desired parameters. Then, we call `POMDPs.solve` with the solver and the problem. We can query the policy using the `action` function. ```@example crying_sim using NativeSARSOP -# Define the solver with the desired paramters +# Define the solver with the desired parameters sarsop_solver = SARSOPSolver(; max_time=10.0) # Solve the problem by calling POMDPs.solve. SARSOP will compute the policy and return an `AlphaVectorPolicy` @@ -57,7 +57,7 @@ a = action(sarsop_policy, b) ``` ## Online (POMCP) -For the online solver, we will use Particle Monte Carlo Planning ([POMCP](https://github.com/JuliaPOMDP/BasicPOMCP.jl)). For online solvers, we first define the solver similar to offline solvers. However, when we call `POMDPs.solve`, we are returned an online plannner. Similar to the offline solver, we can query the policy using the `action` function and that is when the online solver will compute the action. +For the online solver, we will use Particle Monte Carlo Planning ([POMCP](https://github.com/JuliaPOMDP/BasicPOMCP.jl)). For online solvers, we first define the solver similar to offline solvers. However, when we call `POMDPs.solve`, we are returned an online planner. Similar to the offline solver, we can query the policy using the `action` function and that is when the online solver will compute the action. ```@example crying_sim using BasicPOMCP @@ -73,7 +73,7 @@ a = action(pomcp_planner, b) ``` ## Heuristic Policy -While we often want to use a solver to compute a policy, sometimes we might want to use a heuristic policy. For example, we may want to use a heuristic policy during our rollouts for online solvers or to use as a baseline. In this example, we will define a simple heuristic policy that feeds the baby if our belief of the baby being hungry is greater than 50%, otherwise we will randomly ignore or sing to the baby. +While we often want to use a solver to compute a policy, sometimes we might want to use a heuristic policy. For example, we may want to use a heuristic policy during our rollout for online solvers or to use as a baseline. In this example, we will define a simple heuristic policy that feeds the baby if our belief of the baby being hungry is greater than 50%, otherwise we will randomly ignore or sing to the baby. ```@example crying_sim struct HeuristicFeedPolicy{P<:POMDP} <: Policy diff --git a/docs/src/gallery.md b/docs/src/gallery.md index 39997852..6823b717 100644 --- a/docs/src/gallery.md +++ b/docs/src/gallery.md @@ -196,7 +196,7 @@ println("gif saved to: $(saved_gif.filename)") ``` ## [RockSample](https://github.com/JuliaPOMDP/RockSample.jl) -The RockSample problem problem from T. Smith, R. Simmons, "Heuristic Search Value Iteration for POMDPs", Association for Uncertainty in Artificial Intelligence (UAI), 2004. +The RockSample problem from T. Smith, R. Simmons, "Heuristic Search Value Iteration for POMDPs", Association for Uncertainty in Artificial Intelligence (UAI), 2004. The robot must navigate and sample good rocks (green) and then arrive at an exit area. The robot can only sense the rocks with an imperfect sensor that has performance that depends on the distance to the rock. @@ -261,4 +261,4 @@ Pkg.rm("Plots") ``` ## Adding New Gallery Examples -To add new examples, please submit a pull request to the POMDPs.jl repository with changes made to the `gallery.md` file in `docs/src/`. Please include the creation of a gif in the code snippet. The gif should be generated during the creation of the documenation using `@eval` and saved in the `docs/src/examples/` directory. The gif should be named `problem_name.gif` where `problem_name` is the name of the problem. The gif can then be included using `![problem_name](examples/problem_name.gif)`. \ No newline at end of file +To add new examples, please submit a pull request to the POMDPs.jl repository with changes made to the `gallery.md` file in `docs/src/`. Please include the creation of a gif in the code snippet. The gif should be generated during the creation of the documentation using `@eval` and saved in the `docs/src/examples/` directory. The gif should be named `problem_name.gif` where `problem_name` is the name of the problem. The gif can then be included using `![problem_name](examples/problem_name.gif)`. \ No newline at end of file