CS 5043: HW6: Deep Policy Gradient

Assignment notes:

Deadline: Friday, April 24th @11:59pm.
Hand-in procedure: submit observations, results, code changes to the Canvas discussion board.
This homework assignment is collaborative.
I have released a complete implementation of Deep Policy Gradient. You can use it in Jupyter or at the command line. Feel free to make tweaks to it (at the very least, you will need to make changes to the embedded paths).

Problem

We are going to set up a RL-agent to learn to solve the MsPacman-v0 problem. In this problem, a single observation is an image of the play field (230 x 160 x 3). Actions are one of : NoOp, Up, Right, Left, Down, Up-Right, Up-Left, Down-Right and Down-Left (a total of 9). Positive rewards are given every time Ms Pacman consumes the small pellets in the environment, the larger power-up pellets or ghosts (only when powered-up). If Ms Pacman is caught by a ghost while she is not powered-up, she will lose a life. Once three lives are lost, the game is terminated. The goal is to accumulate as much reward as possible before the three lives are expended.

Implementation

The code that I provide is a full implementation of Policy Gradient with images as inputs (there is also an image implementation of Q-Learning available). When you create your agent and initialize your model, you can specify a range of parameters, including:

epsilon, gamma, learning rate, L2 regularization
Number of times that an agent-selected action is repeatedly executed (note that the Ms Pacman environment also adds repeats on top of this)
How often greedy evaluations are done within the learning trials
Max replay buffer size
deep network architecture: Any number of convolutional layers, each followed (optionally) by max pooling, followed by any number of deep layers.
Maximum length of each trial (number of steps) and the maximum size of each learning mini-batch

Here is a minimalist network/agent configuration:

# One Conv layer with max pooling (and striding)
conv_layers=[{'filters': 10, 'kernel_size': (5,5), 'pool_size': (5,5), 'strides': (2,2)}
            ]
# One dense layer
dense_layers=[{'units': 40}
             ]

# Configure the agent
sh = env.observation_space.shape

agent = myImagePolicyGradientAgent(sh, env.action_space.n, 
                    epsilon=0.1, lrate=.0005, maxlen=2000, gamma=0.4)
agent.build_model(conv_layers=conv_layers, dense_layers=dense_layers,
                    lambda_l2=.0001)

You can also specify the same architecture (and other experiment parameters) at the command line:

python hw6_basis.py -vv -env MsPacman-v0 -pg -ntrials 100 -results_path results_hw6 \
-conv_size 5 -conv_nfilters 10 -pool 5 -pool_stride 2 -hidden 40 -lrate .0005 \
-replay 1000 -maxlen 2000 -steps 500 -action_repeat 4 -rotation 0 -epsilon .1  -l2 .0001

(note that the '\'s are not part of the command line)

Hints / Notes

Buffer size is important. On OSCER, I have been pushing up to 20,000 (my laptop struggles a lot with 8,000)
Replay size affects training time. I have been in the 1000-4000 range.
Lower gammas seem to work better (especially since the rewards are frequent).
I have largely been playing with architectures that are 100K-300K parameters in size.

What to Hand In

Post important observations, results, code changes to the Canvas HW6 discussion board.

Grades

This homework is optional. If you make 3 interesting contributions on the discussion board, you will receive full credit.

andrewhfagg -- gmail.com

Last modified: Thu Apr 9 14:25:34 2020