CS 5043: HW6: Recurrent Neural Networks and Timeseries Data

Assignment notes:

The Problem

Peptides are short chains of amino acids that perform a wide range of biological functions, depending on the specific sequence of amino acids. Understanding how amino acids interact with other molecules is key to understanding the mechanisms of genetic diseases and to constructing therapies to address these diseases.

One form of interaction is the question of how well a peptide binds to a specific molecule (this is known as binding affinity). For this assignment, we will use a data set that includes a measure of binding affinity for a large number of peptides toward two molecules, known as HLA-DRB1*13:01 and HLA-DRB1*15:01. Our goal is to predict the binding affinity for novel peptides toward these molecules.

Amino Acids

A total of 22 amino acids make up the building blocks for all peptides and proteins. The following table gives the names and one-letter codes used to represent these amino acids:

Data Set

The data set is available in the git repository in a set of CSV files. The data are already partitioned into five independent folds of training and validation data. For the purposes of this assignment, you should preserve this partition. Also, you should only use the HLA-DRB1*15:01 molecule (look for files containing "1501").

Each row in the CSV files contains two columns: a string representing the sequence of amino acids and a measure of binding affinity. Here is an example:

DLDKKETVWHLEE,0
DKSKPKVYQWFDL,0
HYTVDKSKPKVYQ,0
KSKPKVYQWFDLR,0
LHYTVDKSKPKVY,0
TVDKSKPKVYQWF,0
VDKSKPKVYQWFD,0
YTVDKSKPKVYQW,0
ADVILPIGTRSVETD,0.646144
AFKPVLVDEGRKVAI,0.57075
AGLKTNDRKWCFEGP,0.615622
AHLAEENEGDNACKR,0
ALYEKKLALYLLLAL,0.610019
ANGKTLGEVWKRELN,0.495407

Note that different rows contain different numbers of amino acids.

The binding affinity is encoded on a log scale and ranges between 0 and 1. Peptides with affinities at or above 0.426 are considered more likely than not to bind to the molecule in question, whereas affinities below this threshold are considered less likely to bind.

Deep Learning Experiment

Objective: Create a recurrent neural network that predicts binding affinity as a function of the amino acid string.

Performance Reporting

Once you have selected a reasonable architecture and set of hyper-parameters, produce the following figures:
  1. AUC as a function of epoch for each of the training folds (so, 5 curves)

  2. AUC as a function of epoch for each of the validation folds

  3. Accuracy as a function of epoch for each of the training folds. Note: using tf.keras.metrics.BinaryAccuracy() as a metric will allow you to set the threshold of 0.426

  4. Accuracy as a function of epoch for each of the validation folds

In addition, report the following:


Provided Code

In the git repository:


What to Hand In

Hand in a PDF file that contains:

Grades

References


andrewhfagg -- gmail.com

Last modified: Wed Apr 7 21:54:42 2021