CS 5043: HW5: Recurrent Neural Networks and String Data

Assignment notes:

The Problem

Peptides are short chains of amino acids that perform a wide range of biological functions, depending on the specific sequence of amino acids. Understanding how amino acids interact with other molecules is key to understanding the mechanisms of genetic diseases and to constructing therapies to address these diseases.

One form of interaction is the question of how well a peptide binds to a specific molecule (this is known as binding affinity). For this assignment, we will use a data set that includes a measure of binding affinity for a large number of peptides toward two molecules, known as HLA-DRB1*13:01 and HLA-DRB1*15:01. Our goal is to predict the binding affinity for novel peptides toward these molecules.

Amino Acids

A total of 22 amino acids make up the building blocks for all peptides and proteins. The following table gives the names and one-letter codes used to represent these amino acids:

Data Set

The Data set is available on SCHOONER: The data are already partitioned into five independent folds of training and validation data. For the purposes of this assignment, you should preserve this partition. Also, you should only use the HLA-DRB1*13:01 molecule (look for files containing "1301").

Each row in the CSV files contains two columns: a string representing the sequence of amino acids and a measure of binding affinity. Each character of the string corresponds to a unique amino acid. Here is an example:

DLDKKETVWHLEE,0
DKSKPKVYQWFDL,0
HYTVDKSKPKVYQ,0
KSKPKVYQWFDLR,0
LHYTVDKSKPKVY,0
TVDKSKPKVYQWF,0
VDKSKPKVYQWFD,0
YTVDKSKPKVYQW,0
ADVILPIGTRSVETD,0.646144
AFKPVLVDEGRKVAI,0.57075
AGLKTNDRKWCFEGP,0.615622
AHLAEENEGDNACKR,0
ALYEKKLALYLLLAL,0.610019
ANGKTLGEVWKRELN,0.495407

Note that different rows contain different numbers of amino acids.

The binding affinity is encoded on a log scale and ranges between 0 and 1. Peptides with affinities at or above 0.426 are considered more likely than not to bind to the molecule in question, whereas affinities below this threshold are considered less likely to bind.

Deep Learning Experiment

Objective: Create a recurrent neural network that predicts binding as a function of the amino acid string.

Performance Reporting

Once you have selected a reasonable architecture and set of hyper-parameters, produce the following figures:
  1. Figure 1: Training set Area Under the Curve (AUC) as a function of epoch for each of the rotations (so, 5 curves)


Provided Code

In code for class:


What to Hand In

Turn in a single zip file that contains:

Grading

References


andrewhfagg -- gmail.com

Last modified: Fri Apr 1 20:37:47 2022