CS 5043: HW5: Q-Learning

Assignment notes:

Deadline: Friday, March 27th @11:59pm.
Hand-in procedure: submit to the HW5 drop box on Canvas (zip or tar file with code) and Gradescope (pdf). The pdf should include all relevant source files and results.
This work is to be done on your own. While general discussion about Python, Keras and Tensorflow is encouraged, sharing solution-specific code is inappropriate. Likewise, downloading solution-specific code is not allowed.
Do not submit MSWord documents.
I have released a new version of qlearning-skel2.ipynb. Feel free to use this as your starting point. There is a tiny bit of code to fill in plus the set up for your problem

Problem

We are going to set up a RL-agent to learn to solve the Acrobot-v1 problem. Before solving this problem, I suggest that you get networks working for CartPole-v1 and/or Pendulum-v0.

Your code is largely in place, so your focus will be on:

Choosing an appropriate deep network architecture for Q-value approximation
Choosing learning hyper-parameters, include learning rate, epsilon and discount factor (gamma)
Selecting a regularization strategy (L2/L1 on your kernel; L2/L1 on your hidden unit activation levels; the latter is about forcing sparsity in your hidden layer activation patterns)

Experiments

Once you have settled on network hyper-parameters:

Perform 5 independent runs. Generate a plot that shows the learning curve (total accumulated reward) as a function of training epoch. Reported performance can be with respect to the epsilon-greedy policy or the greedy policy (in my example that we discussed in class, I was only reporting epsilon-greedy)
For each run, report the mean performance over the last 100 trials. This can be the performance of the epsilon-greedy or the greedy policy.
Pick something to alter about your network/model. A good choice will be to add some form of regularization. You could also add epsilon "shrinking". Perform another 5 independent runs and show their performance.
For each run, report the mean performance over the last 100 trials.

Hints / Notes

Be patient. With sparse rewards, it can take a while for a randomly exploring agent to discover a goal.
On the contrary, if you are having to train for many 1000s of epochs without interesting results, then something is wrong.
Optimal performance is a policy that consistently receives episode scores of -50 to -70, depending on the initial conditions of the trial.

What to Hand In

Hand in your notebook containing all of your code + the PDF export of the code. The PDF file must include:

Code for generating and training the network. Some useful unix command line programs:
- enscript: translate code (e.g., py files) into postscript files
- ps2pdf: translate postscript files into pdf files
- pdfunite: merge several pdf files together
Four figures described above
A report of the mean accuracy for each of the two architectures

Grades

30 pts: Complete learning code
35 pts: Performance on 5-runs shows interesting improvement
35 pts: Altered configuration: performance on 5 runs also shows interesting improvement

andrewhfagg -- gmail.com

Last modified: Tue Mar 24 13:37:05 2020