ucl-dark · newtonkwan · Nov 18, 2022 · Nov 21, 2022 · Nov 21, 2022 · Nov 21, 2022
diff --git a/README.md b/README.md
@@ -217,6 +217,11 @@ Because JAX installation is different depending on your CUDA version, Haiku does
 
 First, follow these instructions to install JAX with the relevant accelerator support.
 
+``` 
+pip install -r requirements.txt
+``` 
+
+
 ## General Information
 The project entrypoint is `pax/experiment.py`. The simplest command to run a game would be: 
 

diff --git a/docs/getting-started/agents.md b/docs/getting-started/agents.md
@@ -1,9 +1,92 @@
 # Agents 
 
-## Agent 1
+## Overview 
+
+Pax provides a number of fixed opponents and learning agents to train and train against. 
+
+## Specifying an Agent
+<!-- TODO: This isn't how Pax works atm. However, taken from Github README, so 
+assuming it is on the TODOs to add later. -->
+Pax comes installed with an `Agent` class and several predefined agents. To specify an agent, import the `Agent` class and specify the agent parameters. 
+
+```
+import jax.numpy as jnp
+import Agent
+
+args = {"hidden": 16, "observation_spec": 5}
+rng = jax.random.PRNGKey(0)
+bs = 1
+init_hidden = jnp.zeros((bs, args.hidden))
+obs = jnp.ones((bs, 5))
+
+agent = Agent(args)
+state, memory = agent.make_initial_state(rng, init_hidden)
+action, state, mem = agent.policy(rng, obs, mem)
+
+state, memory, stats = agent.update(
+    traj_batch, obs, state, mem
+)
+
+mem = agent.reset_memory(mem, False)
+```
+
+To run an experiment with a specific agent, use a pre-made `.yaml` file located in `conf/...` or create your own, and specify the agent. In the below example, `agent1` is a learning agent that learns via PPO and `agent2` is an agent that only chooses the Cooperate action. 
+
+```
+# Agents  
+agent1: 'PPO'
+agent2: 'Altruistic'
+
+...
+```
+
+## List of Agents
+
+```{note}
+Fixed agents are game-specific, while learning agents like PPO can be used in both games. 
+```
+
+### agent1, agent2
+
+#### Fixed
+
+Matrix games
+
+|  Agent      |  Description   | 
+| ----------- | ----------- |
+| **`Altruistic`**  | Always chooses the Cooperate (C) action. |
+| **`Defect`**     | Always chooses the Defect (D) action. |
+| **`GrimTrigger`**   | Chooses the C action on the first turn and reciprocates with the C action until the opponent chooses D, where Grim switches to only choosing D.|
+| **`HyperAltruistic`**  | Infinite matrix game variant of `Altruistic`. Always chooses the Cooperate (C) action.|
+| **`HyperDefect`**  | Infinite matrix game variant of `Defect`. Always chooses the Defect (D) action.|
+| **`HyperTFT`**  | Infinite matrix game variant of `TitForTat`. Chooses the C action on the first turn and reciprocates the opponent's last action.|
+| **`Random`**        | Randomly chooses the C or D action. |
+| **`TitForTat`**    | Chooses the C action on the first turn and reciprocates the opponent's last action.|
+
+
+Coin Game
+
+|   Agent      |    Description| 
+| ----------- | ----------- |
+| **`EvilGreedy`** | Attempts to pick up the closest coin. If equidistant to two colored coins, then it chooses its opponents color coin.|
+| **`GoodGreedy`** | Attempts to pick up the closest coin. If equidistant to two colored coins, then it chooses its own color coin. |
+| **`RandomGreedy`**  | Attempts to pick up the closest coin. If equidistant to two colored coins, then it randomly chooses a color coin. |
+| **`Stay`**     | Agent does not move.|
+
+#### Learning
+
+|  Agent      |   Description | 
+| ----------- | ----------- |
+| **`Naive`**  | Simple learning agent that learns via REINFORCE. |
+| **`NaiveEx`**  | Infinite matrix game variant of `Naive`. Simple learning agent that learns via REINFORCE. |
+| **`MFOS`**  | Meta-learning algorithm for opponent shaping. |
+| **`PPO`**  | Learning agent parameterised by a multilayer perceptron that learns via PPO. |
+| **`PPO_memory`** | Learning agent parameterised by a multilayer perceptron with a memory component that learns via PPO. |
+| **`Tabular`** | Learning agent parameterised by a single layer perceptron that learns via PPO. |
+
+```{note}
+`PPO_memory` serves as the core learning algorithm for both **Good Shepherd (GS)** and **Context and History Aware Other Shaping (CHAOS)** when the training with meta-learning.
+```
 
-Lorem ipsum.
 
-## Agent 2
 
-Lorem ipsum.
diff --git a/docs/getting-started/environments.md b/docs/getting-started/environments.md
@@ -1,9 +1,94 @@
 # Environments
 
-## Environment 1
+## Overview 
+Pax supports two environments for learning agents to train within: matrix games and grid-world games. 
+
+## Specifying the Environment
+
+Pax environments are similar to gymnax. To specify an environment, import the environment and specify the environment parameters. 
+
+```
+from pax.envs.iterated_matrix_game import (
+    IteratedMatrixGame,
+    EnvParams,
+)
+
+env = IteratedMatrixGame(num_inner_steps=5)
+env_params = EnvParams(payoff_matrix=payoff)
+
+# 0 = Defect, 1 = Cooperate
+actions = (jnp.ones(()), jnp.ones(()))
+obs, env_state = env.reset(rng, env_params)
+done = False
+
+while not done:
+    obs, env_state, rewards, done, info = env.step(
+        rng, env_state, actions, env_params
+    )
+```
+
+To specify the parameters for the environment: 
+
+```
+...
+# Environment  
+env_id: coin_game
+env_type: meta
+egocentric: True
+env_discount: 0.96
+payoff: [[1, 1, -2], [1, 1, -2]]
+...
+```
+
+## List of Environment Parameters
+
+### env_id 
+|       Name | Description   | 
+| :----------- | :----------- |
+|`iterated_matrix_game`| Classic normal form game with a 2x2 payoff matrix repeatedly played over `n` steps. |                       
+|`infinite_matrix_game` | Special case of the classic normal form game that calculates an exact value, simulating an infinite game. 
+|`coin_game`    | Classic grid-world social dilemma environment.          |               
+
+### env_type
+
+|       Name | Description   | 
+| :----------- | :----------- |
+|`sequential`| Classic normal form game with a 2x2 payoff matrix repeatedly played over `n` steps. |                       
+|`meta`| Meta-learning regime, where an agent learns via meta-learning.     |
+
+### egocentric 
+|       Name | Description   | 
+| :----------- | :----------- |
+|*bool*| If `True`, sets an agent in the Coin Game environment to an egocentric view, empirically found to be more appropriate for other shaping. Else, sets an agent in  to a non-egocentric view, in line with the original version. |
+
+### env_discount 
+<!-- TODO: Possibly deprecate. -->
+|       Name | Description   | 
+| :----------- | :----------- |
+|*Numeric*| Meta-learning discount factor. Between 0 and 1. |     
+
+### payoff 
+|       Name | Description   | 
+| :----------- | :----------- |
+|*Array*| Custom payoff for game. |                       
+
+Example: 
+
+```
+# if playing Coin Game 
+payoff: [[1, 1, -2], [1, 1, -2]]
+```
+
+```
+# if playing Matrix Games
+payoff: [[-1, -1], [-3, 0], [0, -3], [-2, -2]]
+```
+
+```{note}
+Docstrings are under constuction. Please check back later. 
+```
+
+
 
-Lorem ipsum.
 
-## Environment 2
 
-Lorem ipsum.
diff --git a/docs/getting-started/evaluation.md b/docs/getting-started/evaluation.md
@@ -0,0 +1,81 @@
+# Saving & Loading
+
+Pax provides an easy way to save and load your models. 
+
+## Overview 
+
+Saving and loading allows users to save or load models locally or from Weight and Biases. Users can configure the experiment `.yaml` file to set up the save and load file path, either locally or online. 
+
+## List of Saving Parameters
+
+### save 
+|       Name | Description   | 
+| :----------- | :----------- |                 
+|*bool* | If `True`, the model is saved to the filepath specified by `save_dir`. |
+
+
+### save_dir 
+|       Name | Description   | 
+| :----------- | :----------- |                 
+|*String* | Filepath used to save a model. | 
+
+### save_interval 
+
+|       Name | Description   | 
+| :----------- | :----------- |                 
+|*Int*  | Number of iterations between saving a model. | 
+
+Example
+```
+# config.yaml
+save: True
+save_interval: 10
+save_dir: "./exp/${wandb.group}/${wandb.name}"
+```
+
+## List of Loading Parameters
+
+### model_path 
+|       Name | Description   | 
+| :----------- | :----------- |                 
+|*String* | Filepath to load the model. | 
+
+### run_path 
+|       Name | Description   | 
+| :----------- | :----------- |                 
+|*String* | If using Weights and Biases (i.e. `wandb.log=True`), this is the  run path of the model used to load the model.  | 
+
+Example
+```
+# config.yaml
+run_path: ucl-dark/cg/3mpgbfm2
+model_path: exp/coin_game-EARL-PPO_memory-vs-Random/run-seed-0/2022-09-08_20.41.03.643377/generation_30
+```
+
+### wandb 
+
+```{note}
+The following parameters are used for Weights and Biases specific features.  
+```
+
+```
+wandb:
+  entity: "ucl-dark"
+  project: cg
+  group: 'EARL-${agent1}-vs-${agent2}'
+  name: run-seed-${seed}
+  log: False
+```
+|       Name | Description   | 
+| :----------- | :----------- |                 
+|`entity` | Weights and Biases entity. |
+|`project` | Weights and Biases project name.  |
+|`group` | Weights and Biases group name.  |
+|`name` | Weights and Biases run name.  |
+|`log` | Weights and Biases run name.  |
+
+
+
+
+
+
diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md
@@ -1,7 +1,3 @@
 # Installation
 
-Pax is written in pure Python, but depends on C++ code via JAX.
-
-Because JAX installation is different depending on your CUDA version, Haiku does not list JAX as a dependency in requirements.txt.
-
-First, follow these instructions to install JAX with the relevant accelerator support.
+PAX will soon be available to install via the [Python Package Index](https://github.com/akbir/pax). For full installation instructions, please refer to the [Install Guide](https://github.com/akbir/pax) in the project README.
diff --git a/docs/getting-started/runners.md b/docs/getting-started/runners.md
@@ -1,9 +1,95 @@
 # Runner 
 
-## Runner 1
+## Overview 
 
-Lorem ipsum.
+Pax provides a number of experiment runners useful for different use cases of training and evaluating reinforcement learning agents. 
 
-## Runner 2
+## Specifying a Runner
 
-Lorem ipsum.
+Pax centers around its runners, pieces of custom experiment logic that leverage the speed of JAX. After specifying the environment and agents, a runner carries out the experiment. The code below shows a portion of a runner that carries out a rollout and updates the agent:  
+
+```
+def _rollout(carry, unused):
+    """Runner for inner episode"""
+    (
+        rngs,
+        obs,
+        a_state,
+        a_mem,
+        env_state,
+        env_params,
+    ) = carry
+
+    # unpack rngs
+    rngs = self.split(rngs, 4)
+    action, a_state, new_a_mem = agent1.batch_policy(
+        a_state,
+        obs[0],
+        a_mem,
+    )
+
+    next_obs, env_state, rewards, done, info = env.step(
+        rngs,
+        env_state,
+        (action, action),
+        env_params,
+    )
+
+    traj = Sample(
+        obs1,
+        action,
+        rewards[0],
+        new_a1_mem.extras["log_probs"],
+        new_a1_mem.extras["values"],
+        done,
+        a1_mem.hidden,
+    )
+
+    return (
+        rngs,
+        next_obs,
+        a1_state,
+        new_a1_mem,
+        env_state,
+        env_params,
+    ), (
+        traj1,
+        traj2,
+    )
+
+
+agent = Agent(args)
+state, memory = agent.make_initial_state(rng, init_hidden)
+
+for _ in range(num_updates):
+    final_timestep, batch_trajectory = jax.lax.scan(
+        _rollout,
+        ((obs, env_state, rng), rollout_length),
+        10,
+    )
+
+    _, obs, rewards, a1_state, a1_mem, _, _ = final_timestep
+
+    state, memory, stats = agent.update(
+        batch_trajectory, obs[0], state, memory
+    )
+```
+
+To specify the runner in an experiment, use a pre-made `.yaml` file located in `conf/...` or create your own, and specify the runner with `runner`. In the below example, the `evo` flag and the `EvoRunner` used.
+
+```
+...
+# Runner 
+runner: evo 
+...
+```
+
+## List of Runners 
+
+### runner 
+|   Runner      |    Description| 
+| ----------- | ----------- |
+| **`eval`**   | Evaluation runner, where a single, pre-trained agent is evaluated. |
+| **`evo`** | Evolution runner, where two independent agents are trained via Evolutionary Strategies (ES). |
+| **`rl`** | Multi-agent runner, where two independent agents are trained via reinforcement learning.  |
+| **`sarl`**  | Single-agent runner, where a single agent is trained via reinforcement learning.  |