CS 5043: HW6: Deep Policy Gradient
Assignment notes:
- Deadline: Friday, April 24th @11:59pm.
- Hand-in procedure: submit observations, results, code changes
to the Canvas discussion board.
- This homework assignment is collaborative.
- I have released a complete implementation of
Deep Policy Gradient. You can use it in Jupyter or at the
command line. Feel free to make tweaks to it (at the very
least, you will need to make changes to the embedded paths).
Problem
We are going to set up a RL-agent to learn to solve the MsPacman-v0
problem. In this problem, a single observation is an image of the
play field (230 x 160 x 3). Actions are one of : NoOp, Up, Right,
Left, Down, Up-Right, Up-Left, Down-Right and Down-Left (a total of
9). Positive rewards are given every time Ms Pacman consumes the small
pellets in the environment, the larger power-up pellets or ghosts
(only when powered-up). If Ms Pacman is caught by a ghost while she
is not powered-up, she will lose a life. Once three lives are lost,
the game is terminated. The goal is to accumulate as much reward as
possible before the three lives are expended.
Implementation
The code that I provide is a full implementation of Policy Gradient
with images as inputs (there is also an image implementation of
Q-Learning available). When you create your agent and initialize your
model, you can specify a range of parameters, including:
- epsilon, gamma, learning rate, L2 regularization
- Number of times that an agent-selected action is repeatedly
executed (note that the Ms Pacman environment also adds repeats
on top of this)
- How often greedy evaluations are done within the learning trials
- Max replay buffer size
- deep network architecture: Any number of convolutional layers, each
followed (optionally) by max pooling, followed by any number of
deep layers.
- Maximum length of each trial (number of steps) and the maximum
size of each learning mini-batch
Here is a minimalist network/agent configuration:
# One Conv layer with max pooling (and striding)
conv_layers=[{'filters': 10, 'kernel_size': (5,5), 'pool_size': (5,5), 'strides': (2,2)}
]
# One dense layer
dense_layers=[{'units': 40}
]
# Configure the agent
sh = env.observation_space.shape
agent = myImagePolicyGradientAgent(sh, env.action_space.n,
epsilon=0.1, lrate=.0005, maxlen=2000, gamma=0.4)
agent.build_model(conv_layers=conv_layers, dense_layers=dense_layers,
lambda_l2=.0001)
You can also specify the same architecture (and other experiment parameters) at the
command line:
python hw6_basis.py -vv -env MsPacman-v0 -pg -ntrials 100 -results_path results_hw6 \
-conv_size 5 -conv_nfilters 10 -pool 5 -pool_stride 2 -hidden 40 -lrate .0005 \
-replay 1000 -maxlen 2000 -steps 500 -action_repeat 4 -rotation 0 -epsilon .1 -l2 .0001
(note that the '\'s are not part of the command line)
Hints / Notes
- Buffer size is important. On OSCER, I have been pushing up to 20,000 (my laptop struggles a lot with 8,000)
- Replay size affects training time. I have been in the 1000-4000 range.
- Lower gammas seem to work better (especially since the rewards are frequent).
- I have largely been playing with architectures that are 100K-300K parameters in size.
What to Hand In
Post important observations, results, code changes to the Canvas HW6 discussion board.
Grades
This homework is optional. If you make 3 interesting contributions on the discussion board, you will receive full credit.
andrewhfagg -- gmail.com
Last modified: Thu Apr 9 14:25:34 2020