CS 5043: HW6: RNNs + CNNs

Assignment notes:

Deadline: Tuesday, April 19th @11:59pm.
Hand-in procedure: submit a zip file to Gradescope
This work is to be done on your own. While general discussion about Python, Keras and Tensorflow is encouraged, sharing solution-specific code is inappropriate. Likewise, downloading solution-specific code is not allowed.
Do not submit MSWord documents.

The Problem

Proteins are chains of amino acids that perform many different biological functions, depending on the specific sequence of amino acids. Families of amino acid chains exhibit similarities in their structure and function. For a new chain, one problem we would like to solve is that of predicting the family that it most likely belongs to. In this assignment, we will be classifying amino acid chains as one of eighteen families: PF01810, PF01925, PF02659, PF03824, PF16955, PF04955, PF11139, PF13386, PF13795, PF01169, PF01914, PF02673, PF02674, PF02683, PF03239, PF03596, PF03741 or PF19510.

Data Set

The Data set is available on SCHOONER:

/home/fagg/datasets/pfam: directory tree containing the data (including two zip files)

The data are already partitioned into five independent folds, with the four classes stratified across the folds (the samples for class k are distributed equally across the five folds). However, the different classes have different numbers of examples, with as much as a 1-10 ratio between the minority and majority classes.

Each example consists of:

A tokenized string of length 1574 amino acids. The strings in the data set have been padded this time on the left hand side (in HW5, padding was on the right hand side). In addition to the padding token, there is also a token that corresponds to the "unknown" amino acid. Within each string, there can be long runs of "unknown" tokens.
A tokenized class label (integers from 0 to 4).

There are two ways to load the data (provided in pfam_loader.py):

prepare_data_set(): loads the raw data, constructs the train/validation/test data sets, and performs the tokenization. These files are smaller, but require CPU processing before training.
load_rotation(): loads an already constructed rotation from a pickle file. These files are a lot larger, but require no processing once loaded.

Both loaders return the same data set format (documented in pfam_loader.py)

Deep Learning Experiment

Objective: Create a neural network model that can predict the family of a given amino acid. We will compare a "simple" architecture with a "complex" architecture. The precise definition of these is up to you, but you should adjust hyper-parameters for each so that they can do their best (with respect to the validation set) without changing model architecture.

Notes:

Because of the length of the amino acid chains, an RNN will likely not work well. Instead, consider using a combination of 1D CNNs and RNNs. The CNNs give you the opportunity to collapse information from multiple steps into a single step, thus reducing the length for the subsequent RNN layer.
Your network should have four outputs, one for each class. Use the softmax() nonlinearity for the final step.
Class labels from the loader are integers (they are not one-hot encoded). You can either convert the integers to a 1-hot encoded representation and use categorical cross-entropy for your loss, or you can keep the integers and use sparse categorical cross-entropy (this loss function will automatically do the conversion for you).
Likewise, you will need to use sparse categorical accuracy if you are using the raw integers

Performance Reporting

Once you have selected a reasonable architecture and set of hyper-parameters, produce the following figures:

Figure 0a,b: Network architectures from plot_model()
Figure 1: Training set Accuracy as a function of epoch for each rotation. Include both models
Figure 2: Validation set accuracy as a function of epoch for each of the rotations. Include both models
Figure 3: Histogram of accuracy for the test folds that shows vertical lines that correspond to the average accuracy for each model type (also show this average in text).

Provided Code

In code for class:

pfam_loader.py: data loading and conversion tools

What to Hand In

Turn in a single zip file that contains:

All of your python code (.py) and any notebook file (.ipynb) [Gradescope can render notebook files directly - no need to convert to pdf!]
A picture of your model architecture (from plot_model())
Figures 0-3

Grading

20 pts: Clean, general code for model building (including in-code documentation)
15 pts: Figure 0
15 pts: Figure 1
15 pts: Figure 2
15 pts: Figure 3
20 pts: Reasonable test set performance for all rotations

References

Full Data Set
Pfam: The protein families database in 2021: J. Mistry, S. Chuguransky, L. Williams, M. Qureshi, G.A. Salazar, E.L.L. Sonnhammer, S.C.E. Tosatto, L. Paladin, S. Raj, L.J. Richardson, R.D. Finn, A. Bateman Nucleic Acids Research (2020) doi: 10.1093/nar/gkaa913

andrewhfagg -- gmail.com

Last modified: Mon Apr 11 16:35:16 2022