- Solid overview of different reinforcement learning techniques
- It looks like I am going to use Q-Learning
- Makes the classifier insensitive to where the object is in a picture
Generally: Model-based learning attempts to model the environment then choose the optimal policy based on it’s learned model;
In Model-free learning the agent relies on trial-and-error experience for setting up the optimal policy.
- Reinforcment is either model free or model based
- Model free: don’t learn a model from their environment
- Q-learning is one such model
- Initialize your Q-table
- Choose an action using the Epsilon-Greedy Exploration Strategy
- Update the Q-table using the Bellman equation
This would be the same as generating a bunch of states of the minesweeper game, taking the optimal solution (or random) for each one, and then recording the result in the table.
Minesweeper is too big of a game for this all to be stored within a single table.
Example Q-Table:
State-action | Q-Value |
---|---|
(S1,A1) | 5 |
(S3, A0) | 9 |
(S5, A1) | 1 |
- Maps a state-action pair to a q-value
- Q-Value: the estimated optimal future value or the maximum expected reward an agent can reach by taking a given action A from the state S
- The Q-table is initialized to all zeros, the agent doesn’t know anything about the world
- Learning about the world through trial and error is called exploration
- The Q-Learning Algorithm tries to learn the Q-Value for a new environment.
- Once the Q-table has been populated, action A can be taken to maximize the reward for a given state S
- Picking the best action at a state is called exploitation
Common strategy for tackling the exploration-exploitation tradeoff is the Epsilon Greedy Exploitation Strategy
- At evey time step when an action needs to be taken, roll the fice
- If the dice has a probability less than epsilon, choose a random action a. This allows the agent to explore the environment some more
- Otherwise take the best known action at the agent’s current state
NOTE: At the beginning of the algorithm, all the actions taken are random. This helps the agent learn about the environment. As time progresses, better actions will be taken.
NOTE: All epsilon is initialized to 1 -> all aactions are random
At the end of the trainig, more exploitation will happen and less exploring.
The Bellman Equation tells us how to update our Q-table after each step we take.
SUMMARY: The agents updates the current perceived value with the estimated optimal future reward which assumes that the agent takes the best current known action. In implementation, the agent will search through all actions for a particular state and choose the state-action pair with the higest corresponding Q-value
$Q(St, At) = (1 - α)Q(St, At) + α * (Rt + λ * maxaQ(St+1, a))$
- S = the state or Observation
- A = the Action the agent takes
- R = the reward from taking an action
- t = the time step
-
$α$ = the learning rate -
$λ$ = the discount factor which causes rewards to lose their value over time so more immediate rewards are valued more highly
Vanilla Q-Learning: A table maps each state-action pair to its corresponding Q-value
Deep Q-Learning: A Neural Network maps input states to (action, Q-Value) pairs
- Initialize your Main and Target neural networks
- Choose an action usint the Epsilon-Greedy Exploration Strategy
- Update your network weights using the Bellman Equation
A major difference between Deep Q-Learning and Vanilla Q-Learning is the implementation of the Q-table.
- Deep Q-learning uses neural network to map input states -> (action, Q-value) pairs
- Different from a table mapping state-action pairs to a Q-value
- Deep Q-Learning uses two neural networks
- A main network and a target network
- These networks have the same architecure but different weights
- Every N steps, weights from main network are copied to target network
- Two networks leads to more stability and efficiency in the learning process
- Main and target networks map input states to an (ation, q-value) pair
- In this case, each output node (representing an action) contains the action’s q-value as a floating point number.
- The output nodes do not represent a probability distribution, so they will not add up to 1
From the article:
def agent(state_shape, action_shape):
learning_rate = 0.001
init = tf.keras.initializers.HeUniform()
model = keras.Sequential()
model.add(keras.layers.Dense(24, input_shape=state_shape, activation='relu',\
kernel_initializer=init))
model.add(keras.layers.Dense(12, activation='relu', kernel_initializer=init))
model.add(keras.layers.Dense(action_shape, activation='linear',\
kernel_initializer=init))
model.compile(loss=tf.keras.losses.Huber(),\
optimizer=tf.keras.optimizers.Adam(lr=learning_rate),\
metrics=['accuracy'])
return model
- The main and target networks are 3 densely connect layers with Relu activation functions
- [ ] TODO What the heck do these mean :P
- He uniform initialization as well as the Huber loss function to achieve better performance
- [ ] TODO What the heck do these mean :P
The agent chooses a random action with probability epsilon and exploits the best action with probability 1 - epsilon
- Both the Main model and the Target model map input states to output actions.
- These output actions actually represent the model’s predicted Q-value
- The action that has the largest predicted Q-Value is the best known action at that state.
$Q(St, At) = (1 - α)Q(St, At) + α * (Rt + λ * maxaQ(St+1, a))$
- S = the state or Observation
- A = the Action the agent takes
- R = the reward from taking an action
- t = the time step
-
$α$ = the learning rate -
$λ$ = the discount factor which causes rewards to lose their value over time so more immediate rewards are valued more highly
- After choosing and executing an action, the Main and Target networks need to be updated
- Deep Q-Learning agents use Experience Relay to learn about the environment and update the Main and Target networks
- Main Network samples and trains on a bunch of past experiences every 4 steps.
- The main network weights are copied to the target network weights every 100 steps.
Experience Relay is the act of storing and replaying game states (the state, action, reward, next_state) that the RL algorithm is able to learn from.
- Experience Replay can be used in off-policy algorithms to learn in an offline fashion.
- Can update algorithm’s parameters using information from previously taken actions.
- Used in small batches in order to avoid skewing the dataset distribution of different states, actions, and next_states that the neural network will see.
- The agent still needs to update the model weights according to the Bellman Equation
- We want to replicate the Temporal Difference target operation using the network instead of a Q-table
- Adjust weights and update the new q-value
Temporal Difference Target:
$(Rt + λ * maxaQ(St+1, a))$
- Having the right model parameter update frequency is important a. Updating model weights too often -> slow learning b. The model weuight updates every 4 steps
- Setting the correct frequency to copy weights from the Main Network to the Target network a. Can lead to instability b. Updating every 100 steps seemed to work relatively well
- Using the Huber loss function instead of the Mean Squared Error loss function a. Outliers don’t influence results as much with the Huber loss function
- Right initialization strategy is important a. Using the He Initialization is a good initialization strategy b. For networks with the Relu activation function ^
- [X] Create Simulation
- [ ] Create Rewards
- [ ] Translate a Q-table to a neural network
- [ ] Get it to run
- Instead of deep q-learning I could also train a classifier for a given square.
- Honestly, creating a classifier might be the move.
- Probe the frontier
- For every uncovered square on the frontier, grab a 3x3 centered around it
- Have the program get trained on these 3x3 boards
-
- Full Board
- I think this is the move
- Probe all the currently uncovered NUMBER squares
- 3x3 around the number in question
- Setup a classifier for the squares
- Flag things that have a high likelihood of being a mine
- First square is random
- Probe Frontier Algorithm a. Look at all the uncovered NUMBER squares a. A set of actions can be taken, uncovering each of the 8 adjacent squares b. Account for FLAGS, this can be used to minimize the number at hand c. Create a PDF for the action set a. Values between 0 and 1, 0 for SAFE, 1 for MINE b. A Value of 0.5 means that it has no clue. c. A value of 0, means that the value is SAFE d. A value of 1, means that the value has a MINE b. Take a set of actions a. There will be squares with action values of 0 or 1 b. FLAG everything that should be flagged c. UNCOVER everything that should be uncovered d. Take the BEST action otherwise c. Repeat
Example -> ” 1|112|1.1”
Flip the string around so that the hash table doesn’t become MASSIVE
To save on updating, only probe squares that have had a recent change.
Batches of actions are going to be taken. So throw everything into a queue and then process everything at once
- Initialize Queue
- Queue is a set of coordinates
- Repeats are not allowed
- On flag a. Queue surrounding squares b. Subtract surrounding numbers
- On flood uncover a. Queue number squares
- Regular uncover a. Queue current number
Looks solid
1. 12. …
[None, None, None, None, None, None, None, None]
xxx 1. 12.
[None, None, None, None, None, None, None, None]
1 112 …
[None, None, None, None, None, None, None, None]
x x11 x..
[None, None, None, None, None, None, None, None]
I want the player to give (x, y) to indicate a position to uncover / flag
Board: 10 | . . . . . . . . . . 09 | . . . . . . . . . . 08 | . . . . . . . . . . 07 | . . . . . . . . . . 06 | . . . . . . . . . . 05 | . . . . . . . . . . 04 | . . . . . . . . . . 03 | . . . . . . . . . . 02 | . . . . . . . . . . 01 | . . . . . . . . . . +-------------------- 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0 1
Input: x y Uncover: [y-1][x-1] Input: f x y <- Flag variant Flag: [y-1][x-1]
for y in range(len(board)): for x in range(len(board[0])): print(board[y][x]) print()
Recursive function that randomly places a mine if the square doesn’t already have the mine.
placeMine(): if empty:
placeMines(mines): currentMines = 0 while currentMines < mines: if placeMine(): currentMines++;
Will clear mines so that a 0 is where the player clicks.
Simply count and remove the mines in the surrounding squares and then call placeMines once again.
Then the numbers for each mine will be assigned
Assigned after the first move by the player. This ensures that numbers are only calculated once.
Loop over squares and check surrounding squares for a mine, increment the number if there is a mine.
- Two boards are needed
- Keep the state of the game
- Keep what the player can see
- Needs the following:
- Numbers 0-8
- Mines
- Needs the following:
- Flag
- Covered
- Uncovered
Do a check on the visible board, if uncovered print the game state board