Skip to content

connormbrooks/BotWorld-MDPVis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BotWorld-MDPVis

DEMO THE PROJECT NOW:

https://rawgit.com/connormbrooks/BotWorld-MDPVis/master/index.html

BotWorld is a Markov Decision Process Visualizer tool.

FEATURES

-User can enter their world as a text file

-User can select of starting state

-User can specify multiple ending states

-User can create inaccessible states

-User can specify the reward/cost of a state

-User can specify the probability of successful movement while in a state

-User can specify threshold and gamma for policy finding

-Finds a policy for the given board

-After finding a policy, executes the policy and displays final “score” during simulation

-User can specify agent sensor characteristics for POMDPs

-User can specify whether or not agent knows starting location for POMDPs

-Simulation of running through partially observable environment with belief updating

-Current belief state visualized through “heat-map” while simulation runs

-POMDP solution approximation using most likely state for policy determination

-POMDP solution approximation using Q-MDP for policy determination

-User can pause and resume the agent moving through the policy

-Data on states/agent printing below the board

BACKGROUND

The solution to a MDP can be found by using the Bellman equation:

U(s) = R(s) + 𝛾 maxas’ [ P(s’ | s, a) * U(s’) ] , ∀ a ∈ A(s)

to calculate the utility of each move. The utility is found through using the value-iteration algorithm. For this value-iteration, the user provided gamma is used as the discount rate on future moves and the user provided threshold value determines how long to keep iterating. Once the change is less than the threshold value, it stops.

U’(s) = R(s) + 𝛾 maxas’ P(s’ | s,a) *U(s’)

ẟ = maxs | U’(s) - U(s) |

Quit once ẟ < THRESHOLD*(1-𝛾 )/𝛾

BotWorld also sets up Partially Observable Markov Decision Processes. In POMDPs, the agent isn’t always certain which state it is currently in, but instead has beliefs about its state based on sensory evidence and its last belief state/action. Thus, the agent must maintain a belief state that represents how likely the agent believe it is that it is in any given state. If the agent knows the starting position, it starts with a belief of 1 in that position and 0 in all other positions. Otherwise, it evenly distributes the belief between all positions. To update the agent’s belief state, b, after an action, a, is taken (note this action is what the agent ATTEMPTED, the agent does not know if it was successful or not) and sensory input, e, is received, the following equation is used:

b’(s’) = 𝛂 P(e | s’) ∑s [P(s’ | s, a) * b(s)]

Where 𝛂 = 1/[∑s’ (P(e | s’) ∑s [P(s’ | s, a) * b(s)])]

Several methods have been developed to approximate the solutions to POMDPs. For our project, we use “greedy” approaches which rely on the underlying MDP. We implement two heuristics: the Most-Likely State policy, and the Q-MDP policy. The “Most-Likely State” (MLS) heuristic policy is found by:

𝝅POMDP(s) = 𝝅MDP(maxs b(s))

The Q-MDP policy is found by:

𝝅POMDP(s) = maxa(∑s b(s)*Q(s,a))

Where Q(s,a) = ∑s’ (P(s’ | s, a) * (R(s, a, s’) * 𝛾*U(s’)))

About

Visualization tool for solving MDPs and POMDPs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •