Reinforcement Learning

Training a Minesweeper Solver
Another One

Solid overview of different reinforcement learning techniques
- https://smartlabai.medium.com/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc
It looks like I am going to use Q-Learning

Pulling Layers

Makes the classifier insensitive to where the object is in a picture

Model-Based

Generally: Model-based learning attempts to model the environment then choose the optimal policy based on it’s learned model;

Model Free

In Model-free learning the agent relies on trial-and-error experience for setting up the optimal policy.

Deep Q-Learning Tutorial

https://towardsdatascience.com/deep-q-learning-tutorial-mindqn-2a4c855abffc

Types of Reinforcement Learning

Reinforcment is either model free or model based
- Model free: don’t learn a model from their environment
- Q-learning is one such model

Vanilla Q-Learning

Initialize your Q-table
Choose an action using the Epsilon-Greedy Exploration Strategy
Update the Q-table using the Bellman equation

This would be the same as generating a bunch of states of the minesweeper game, taking the optimal solution (or random) for each one, and then recording the result in the table.

Minesweeper is too big of a game for this all to be stored within a single table.

Q-Value and Q-Table

Example Q-Table:

State-action	Q-Value
(S1,A1)	5
(S3, A0)	9
(S5, A1)	1

Maps a state-action pair to a q-value
Q-Value: the estimated optimal future value or the maximum expected reward an agent can reach by taking a given action A from the state S
The Q-table is initialized to all zeros, the agent doesn’t know anything about the world
Learning about the world through trial and error is called exploration
The Q-Learning Algorithm tries to learn the Q-Value for a new environment.
Once the Q-table has been populated, action A can be taken to maximize the reward for a given state S
- Picking the best action at a state is called exploitation

Choosing an Action

Common strategy for tackling the exploration-exploitation tradeoff is the Epsilon Greedy Exploitation Strategy

At evey time step when an action needs to be taken, roll the fice
If the dice has a probability less than epsilon, choose a random action a. This allows the agent to explore the environment some more
Otherwise take the best known action at the agent’s current state

NOTE: At the beginning of the algorithm, all the actions taken are random. This helps the agent learn about the environment. As time progresses, better actions will be taken.

NOTE: All epsilon is initialized to 1 -> all aactions are random

At the end of the trainig, more exploitation will happen and less exploring.

Updating the Q-Table using the Bellman Equation

The Bellman Equation tells us how to update our Q-table after each step we take.

SUMMARY: The agents updates the current perceived value with the estimated optimal future reward which assumes that the agent takes the best current known action. In implementation, the agent will search through all actions for a particular state and choose the state-action pair with the higest corresponding Q-value

$Q(S_t, A_t) = (1 - α)Q(S_t, A_t) + α * (R_t + λ * max_aQ(S_t+1, a))$

S = the state or Observation
A = the Action the agent takes
R = the reward from taking an action
t = the time step
$α$ = the learning rate
$λ$ = the discount factor which causes rewards to lose their value over time so more immediate rewards are valued more highly

Deep Q-Network

Vanilla Q-Learning: A table maps each state-action pair to its corresponding Q-value

Deep Q-Learning: A Neural Network maps input states to (action, Q-Value) pairs

Deep Q-Network Algorithm

Initialize your Main and Target neural networks
Choose an action usint the Epsilon-Greedy Exploration Strategy
Update your network weights using the Bellman Equation

Initialize your Target and Main neural networks

A major difference between Deep Q-Learning and Vanilla Q-Learning is the implementation of the Q-table.

Deep Q-learning uses neural network to map input states -> (action, Q-value) pairs
- Different from a table mapping state-action pairs to a Q-value
Deep Q-Learning uses two neural networks
- A main network and a target network
- These networks have the same architecure but different weights
- Every N steps, weights from main network are copied to target network
- Two networks leads to more stability and efficiency in the learning process

Mapping States to Actions

Main and target networks map input states to an (ation, q-value) pair
In this case, each output node (representing an action) contains the action’s q-value as a floating point number.
- The output nodes do not represent a probability distribution, so they will not add up to 1

Defining our Network Architecture

From the article:

def agent(state_shape, action_shape):
    learning_rate = 0.001
    init = tf.keras.initializers.HeUniform()
    model = keras.Sequential()
    model.add(keras.layers.Dense(24, input_shape=state_shape, activation='relu',\
                                 kernel_initializer=init))
    model.add(keras.layers.Dense(12, activation='relu', kernel_initializer=init))
    model.add(keras.layers.Dense(action_shape, activation='linear',\
                                 kernel_initializer=init))
    model.compile(loss=tf.keras.losses.Huber(),\
                  optimizer=tf.keras.optimizers.Adam(lr=learning_rate),\
                  metrics=['accuracy'])
    return model

The main and target networks are 3 densely connect layers with Relu activation functions
- [ ] TODO What the heck do these mean :P
He uniform initialization as well as the Huber loss function to achieve better performance
- [ ] TODO What the heck do these mean :P

Choose an Action using the Epsilon-Greedy Exploration Strategy

The agent chooses a random action with probability epsilon and exploits the best action with probability 1 - epsilon

Finding Best Known Action from the Networks

Both the Main model and the Target model map input states to output actions.
These output actions actually represent the model’s predicted Q-value
The action that has the largest predicted Q-Value is the best known action at that state.

Updating Network Weights Using the Bellman Equation

$Q(S_t, A_t) = (1 - α)Q(S_t, A_t) + α * (R_t + λ * max_aQ(S_t+1, a))$

S = the state or Observation
A = the Action the agent takes
R = the reward from taking an action
t = the time step
$α$ = the learning rate
$λ$ = the discount factor which causes rewards to lose their value over time so more immediate rewards are valued more highly

After choosing and executing an action, the Main and Target networks need to be updated
Deep Q-Learning agents use Experience Relay to learn about the environment and update the Main and Target networks
Main Network samples and trains on a bunch of past experiences every 4 steps.
The main network weights are copied to the target network weights every 100 steps.

Experience Relay

Experience Relay is the act of storing and replaying game states (the state, action, reward, next_state) that the RL algorithm is able to learn from.

Experience Replay can be used in off-policy algorithms to learn in an offline fashion.
Can update algorithm’s parameters using information from previously taken actions.
Used in small batches in order to avoid skewing the dataset distribution of different states, actions, and next_states that the neural network will see.

Bellman Equation

The agent still needs to update the model weights according to the Bellman Equation
We want to replicate the Temporal Difference target operation using the network instead of a Q-table
Adjust weights and update the new q-value

Temporal Difference Target:

$(R_t + λ * max_aQ(S_t+1, a))$

Tips and Tricks

Having the right model parameter update frequency is important a. Updating model weights too often -> slow learning b. The model weuight updates every 4 steps
Setting the correct frequency to copy weights from the Main Network to the Target network a. Can lead to instability b. Updating every 100 steps seemed to work relatively well
Using the Huber loss function instead of the Mean Squared Error loss function a. Outliers don’t influence results as much with the Huber loss function
Right initialization strategy is important a. Using the He Initialization is a good initialization strategy b. For networks with the Relu activation function ^

Plan

[X] Create Simulation
[ ] Create Rewards
[ ] Translate a Q-table to a neural network
[ ] Get it to run

Ideas

Instead of deep q-learning I could also train a classifier for a given square.
Honestly, creating a classifier might be the move.
Probe the frontier
- For every uncovered square on the frontier, grab a 3x3 centered around it
- Have the program get trained on these 3x3 boards

-

Frontier Probing

Full Board
I think this is the move
Probe all the currently uncovered NUMBER squares
- 3x3 around the number in question
Setup a classifier for the squares
Flag things that have a high likelihood of being a mine

First square is random
Probe Frontier Algorithm a. Look at all the uncovered NUMBER squares a. A set of actions can be taken, uncovering each of the 8 adjacent squares b. Account for FLAGS, this can be used to minimize the number at hand c. Create a PDF for the action set a. Values between 0 and 1, 0 for SAFE, 1 for MINE b. A Value of 0.5 means that it has no clue. c. A value of 0, means that the value is SAFE d. A value of 1, means that the value has a MINE b. Take a set of actions a. There will be squares with action values of 0 or 1 b. FLAG everything that should be flagged c. UNCOVER everything that should be uncovered d. Take the BEST action otherwise c. Repeat

Uncovered Square Probing

Hashtable Implementation

Example -> ” 1|112|1.1”

Flip the string around so that the hash table doesn’t become MASSIVE

Finding Squares to Probe

To save on updating, only probe squares that have had a recent change.

Batches of actions are going to be taken. So throw everything into a queue and then process everything at once

Initialize Queue
- Queue is a set of coordinates
- Repeats are not allowed

On flag a. Queue surrounding squares b. Subtract surrounding numbers
On flood uncover a. Queue number squares
Regular uncover a. Queue current number

Probing Uncovered Squares Images

a. b. c. d. e. f. g. h.

Covered Squares

1
2
3
4

Uncovered Squares

1
2
3
4

Testing

Looks solid

1. 12. …

 [None, None, None, None, None, None, None, None]

xxx 1. 12.

 [None, None, None, None, None, None, None, None]

1 112 …

 [None, None, None, None, None, None, None, None]

x x11 x..

 [None, None, None, None, None, None, None, None]

Minesweeper Implementation

UI / Coordinate System

I want the player to give (x, y) to indicate a position to uncover / flag

Board: 10 | . . . . . . . . . . 09 | . . . . . . . . . . 08 | . . . . . . . . . . 07 | . . . . . . . . . . 06 | . . . . . . . . . . 05 | . . . . . . . . . . 04 | . . . . . . . . . . 03 | . . . . . . . . . . 02 | . . . . . . . . . . 01 | . . . . . . . . . . +-------------------- 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0 1

Input: x y Uncover: [y-1][x-1] Input: f x y <- Flag variant Flag: [y-1][x-1]

for y in range(len(board)): for x in range(len(board[0])): print(board[y][x]) print()

Placing Mines

Recursive function that randomly places a mine if the square doesn’t already have the mine.

placeMine(): if empty:

placeMines(mines): currentMines = 0 while currentMines < mines: if placeMine(): currentMines++;

First Move

Will clear mines so that a 0 is where the player clicks.

Simply count and remove the mines in the surrounding squares and then call placeMines once again.

Then the numbers for each mine will be assigned

Numbers

Assigned after the first move by the player. This ensures that numbers are only calculated once.

Loop over squares and check surrounding squares for a mine, increment the number if there is a mine.

Boards

Two boards are needed
- Keep the state of the game
- Keep what the player can see

Game State Board

Needs the following:
- Numbers 0-8
- Mines

Visible Board

Needs the following:
- Flag
- Covered
- Uncovered

Printing Board

Do a check on the visible board, if uncovered print the game state board

Files

rl.org

Latest commit

History