CS 5043: HW5: Q-Learning
Assignment notes:
- Deadline: Friday, March 27th @11:59pm.
- Hand-in procedure: submit to the HW5 drop box on Canvas (zip or
tar file with code) and Gradescope (pdf). The pdf should
include all relevant source files and results.
- This work is to be done on your own. While general discussion
about Python, Keras and Tensorflow is encouraged, sharing
solution-specific code is inappropriate. Likewise, downloading
solution-specific code is not allowed.
- Do not submit MSWord documents.
- I have released a new version of qlearning-skel2.ipynb. Feel
free to use this as your starting point. There is a tiny bit of code
to fill in plus the set up for your problem
Problem
We are going to set up a RL-agent to learn to solve the Acrobot-v1
problem. Before solving this problem, I suggest that you get networks
working for CartPole-v1 and/or
Pendulum-v0.
Your code is largely in place, so your focus will be on:
- Choosing an appropriate deep network architecture for Q-value
approximation
- Choosing learning hyper-parameters, include learning rate, epsilon and
discount factor (gamma)
- Selecting a regularization strategy (L2/L1 on your kernel;
L2/L1 on your hidden unit activation levels; the latter is
about forcing sparsity in your hidden layer activation patterns)
Experiments
Once you have settled on network hyper-parameters:
- Perform 5 independent runs. Generate a plot that shows the
learning curve (total accumulated reward) as a function of
training epoch. Reported performance can be with respect to
the epsilon-greedy policy or the greedy policy (in my example
that we discussed in class, I was only reporting epsilon-greedy)
- For each run, report the mean performance over the last 100
trials. This can be the performance of the epsilon-greedy or
the greedy policy.
- Pick something to alter about your network/model. A good choice will
be to add some form of regularization. You could also add epsilon
"shrinking". Perform another 5
independent runs and show their performance.
- For each run, report the mean performance over the last 100
trials.
Hints / Notes
- Be patient. With sparse rewards, it can take a while for a
randomly exploring agent to discover a goal.
- On the contrary, if you are having to train for many 1000s of
epochs without interesting results, then something is wrong.
- Optimal performance is a policy that consistently receives
episode scores of -50 to -70, depending on the initial conditions of the trial.
What to Hand In
Hand in your notebook containing all of your code + the PDF export of
the code. The PDF file must include:
- Code for generating and training the network. Some useful unix
command line programs:
- enscript: translate code (e.g., py files) into postscript files
- ps2pdf: translate postscript files into pdf files
- pdfunite: merge several pdf files together
- Four figures described above
- A report of the mean accuracy for each of the two architectures
Grades
- 30 pts: Complete learning code
- 35 pts: Performance on 5-runs shows interesting improvement
- 35 pts: Altered configuration: performance on 5 runs also shows
interesting improvement
andrewhfagg -- gmail.com
Last modified: Tue Mar 24 13:37:05 2020