CS 5043: HW6: RNNs + CNNs
Assignment notes:
- Deadline: Tuesday, April 19th @11:59pm.
- Hand-in procedure: submit a zip file to Gradescope
- This work is to be done on your own. While general discussion
about Python, Keras and Tensorflow is encouraged, sharing
solution-specific code is inappropriate. Likewise, downloading
solution-specific code is not allowed.
- Do not submit MSWord documents.
The Problem
Proteins are chains of amino acids that perform many different
biological functions, depending on the specific sequence of amino
acids. Families of amino acid chains exhibit similarities in their
structure and function. For a new chain, one problem we would like to
solve is that of predicting the family that it most likely belongs
to. In this assignment, we will be classifying amino acid chains as
one of eighteen families: PF01810, PF01925, PF02659, PF03824, PF16955,
PF04955, PF11139, PF13386, PF13795, PF01169, PF01914, PF02673, PF02674,
PF02683, PF03239, PF03596, PF03741 or PF19510.
Data Set
The Data set is available on SCHOONER:
- /home/fagg/datasets/pfam: directory tree containing the data
(including two zip files)
The data are already partitioned into five independent folds, with the
four classes stratified across the folds (the samples for class k are
distributed equally across the five folds). However, the different
classes have different numbers of examples, with as much as a 1-10
ratio between the minority and majority classes.
Each example consists of:
- A tokenized string of length 1574 amino acids. The strings in
the data set have been padded this time on the left hand side
(in HW5, padding was on the right hand side). In addition to
the padding token, there is also a token that corresponds to
the "unknown" amino acid. Within each string, there can be long
runs of "unknown" tokens.
- A tokenized class label (integers from 0 to 4).
There are two ways to load the data (provided in pfam_loader.py):
- prepare_data_set(): loads the raw data, constructs the
train/validation/test data sets, and performs the
tokenization. These files are smaller, but require CPU
processing before training.
- load_rotation(): loads an already constructed rotation from a
pickle file. These files are a lot larger, but require no
processing once loaded.
Both loaders return the same data set format (documented in
pfam_loader.py)
Deep Learning Experiment
Objective: Create a neural network model that can predict the family
of a given amino acid. We will compare a "simple" architecture with a
"complex" architecture. The precise definition of these is up to you,
but you should adjust hyper-parameters for each so that they can do
their best (with respect to the validation set) without changing model
architecture.
Notes:
- Because of the length of the amino acid chains, an RNN will
likely not work well. Instead, consider using a combination of
1D CNNs and RNNs. The CNNs give you the opportunity to
collapse information from multiple steps into a single step,
thus reducing the length for the subsequent RNN layer.
- Your network should have four outputs, one for each class. Use
the softmax() nonlinearity for the final step.
- Class labels from the loader are integers (they are not one-hot
encoded). You can either convert the integers to a 1-hot
encoded representation and use categorical cross-entropy
for your loss, or you can keep the integers and use sparse
categorical cross-entropy (this loss function will
automatically do the conversion for you).
- Likewise, you will need to use sparse categorical
accuracy if you are using the raw integers
Performance Reporting
Once you have selected a reasonable architecture and set of
hyper-parameters, produce the following figures:
- Figure 0a,b: Network architectures from plot_model()
- Figure 1: Training set Accuracy as a function of epoch for each
rotation. Include both models
- Figure 2: Validation set accuracy as a function of epoch for
each of the rotations. Include both models
- Figure 3: Histogram of accuracy for the test folds that shows
vertical lines that correspond to the average accuracy for each
model type (also show this average in text).
Provided Code
In code for class:
- pfam_loader.py: data loading and conversion tools
What to Hand In
Turn in a single zip file that contains:
- All of your python code (.py) and any notebook file (.ipynb)
[Gradescope can render notebook files directly - no need to
convert to pdf!]
- A picture of your model architecture (from plot_model())
- Figures 0-3
Grading
- 20 pts: Clean, general code for model building (including
in-code documentation)
- 15 pts: Figure 0
- 15 pts: Figure 1
- 15 pts: Figure 2
- 15 pts: Figure 3
- 20 pts: Reasonable test set performance for all rotations
References
- Full Data Set
- Pfam: The protein families database in 2021: J. Mistry,
S. Chuguransky, L. Williams, M. Qureshi, G.A. Salazar,
E.L.L. Sonnhammer, S.C.E. Tosatto, L. Paladin, S. Raj,
L.J. Richardson, R.D. Finn, A. Bateman Nucleic Acids Research
(2020) doi: 10.1093/nar/gkaa913
andrewhfagg -- gmail.com
Last modified: Mon Apr 11 16:35:16 2022