CS 5043: HW7: Attention

Assignment notes:

Deadline: Tuesday, April 26th @11:59pm.
Hand-in procedure: submit a zip file to Gradescope
This work is to be done on your own. While general discussion about Python, Keras and Tensorflow is encouraged, sharing solution-specific code is inappropriate. Likewise, downloading solution-specific code is not allowed.
Do not submit MSWord documents.

The Problem

We are using the same problem space as in the previous homework assignment. However, rather than using a RNN approach to connect information across the amino acid chains, we will be employing Attention mechanisms to do so. This approach dramatically reduces the number of layers through which gradients must be propagated.

Data Set

Expect an updated data set; its form will be the same as what we have been using for HW 6.

The Data set is available on SCHOONER:

/home/fagg/datasets/pfam: directory tree containing the data (including two zip files)

The data are already partitioned into five independent folds, with the classes stratified across the folds (the samples for class k are distributed equally across the five folds). However, the different classes have different numbers of examples, with as much as a 1-10 ratio between the minority and majority classes.

Deep Learning Experiment

Objective: Create an Attention-based model to perform the amino acid family classification. The architecture will be a form along the lines of:

Embedding layer
(optional) 1D Convolutional layers
Multiple Attention layers. I recommend investigating tf.keras.layers.MultiHeadAttention
One or more Dense layers, with the output using a softmax non-linearity

Performance Reporting

Once you have selected a reasonable architecture and set of hyper-parameters, produce the following figures:

Figure 0: Network architectures from plot_model()
Figure 1: Training set Accuracy as a function of epoch for each rotation of five rotations.
Figure 2: Validation set accuracy as a function of epoch for each of the rotations.
Figure 3: Histogram of accuracy for the test folds that shows vertical lines that correspond to the average accuracy (also show this average in text).

What to Hand In

Turn in a single zip file that contains:

All of your python code (.py) and any notebook file (.ipynb) [Gradescope can render notebook files directly - no need to convert to pdf!]
Figures 0-3

Grading

20 pts: Clean, general code for model building (including in-code documentation)
15 pts: Figure 0
15 pts: Figure 1
15 pts: Figure 2
15 pts: Figure 3
20 pts: Reasonable test set performance for all rotations

References

Full Data Set
Pfam: The protein families database in 2021: J. Mistry, S. Chuguransky, L. Williams, M. Qureshi, G.A. Salazar, E.L.L. Sonnhammer, S.C.E. Tosatto, L. Paladin, S. Raj, L.J. Richardson, R.D. Finn, A. Bateman Nucleic Acids Research (2020) doi: 10.1093/nar/gkaa913
Keras Multi-headed Attention Layer

andrewhfagg -- gmail.com

Last modified: Fri Apr 15 01:35:47 2022