CS 5043: HW3: Comparing Regression Algorithms
Assignment notes:
- Deadline: Thursday, March 1st @11:59pm.
- Hand-in procedure: submit to the HW3 drop box on Canvas.
- This work is to be done on your own. While general discussion
about Python and Scikit-Learn is encouraged, sharing
solution-specific code is inappropriate.
- Do not submit zip or MSWord documents.
Data Set and Goals
A brain-machine interface data set has been placed in:
~fagg/datasets/bmi/DAT6_08. This data set is from a single
day's session with a monkey. While sitting in a chair, the monkey
places her arm in an exoskeleton that restricts arm movements to the
X-Y plane. For this data set, the exoskeleton is passive and is only
used to record the monkey's arm movements. A series of virtual
targets is then provided to the monkey. As the monkey's hand moves
through the current target, a new one appears (so, we are observing a
sequence of reaches). During this session,
information is recorded every 50ms. The following information is
collected and stored in their own files:
- Neural activity for 48 neurons from the primary motor cortex
(MI). During every 50ms time period,
we count the number of action potentials (spikes) that each
neuron produces.
- Hand position, velocity and acceleration (each 2 dimensions,
corresponding to the X and Y directions, respectively).
- Joint position, velocity, and torque (each 2 dimensions,
corresponding to the shoulder and elbow, respectively).
- Time stamp. Although the data are in order, gaps exist, as
indicated by time stamps that are more than 50ms apart from
one-another.
In the brain-machine interface context, after a data set such as this
is collected, a model can be constructed that predicts from the neural
data one or more properties that describe the state of the arm. Under
some conditions, these models can be used to translate in real time
the neural activity into control signals for the exoskeleton (hence,
the monkey's arm will be moved for her).
Our goals for this homework are to assess:
- multiple models and learning algorithms,
- appropriate parameters for each algorithm type, and
- the amount of training data that is necessary to create
high-performing models.
Notes:
- The data are pre-cut into 20 folds.
- A single sample of data is captured as a single row in the
files. A single row in one file corresponds to the same sample
in the matching files.
- Because the neural data is noisy (and we have so few neurons),
we will use a history of 20 time steps (so, a full second) of
neural data to make the arm state prediction for the current
time step. This history information has already been accounted
for in the individual rows in the MI data files: each row has
20 x 48 = 960 columns, with the first 20 columns corresponding
to the first neuron (ordered from the oldest to the most recent
counts). The discontinuities in the data have already been
accounted for in these histories, so you don't need to take any
additional steps to handle these.
The following will load all folds for one file type and return a list
of Numpy arrays:
import fnmatch
# File loading
def read_bmi_file_set(directory, filebase):
'''Read a set of CSV files and append them together
:param directory: The directory in which to scan for the CSV files
:param filebase: A file specification that potentially includes wildcards
:returns: A list of Numpy arrays (one for each fold)
'''
# The set of files in the directory
files = fnmatch.filter(os.listdir(directory), filebase)
files.sort()
# Create a list of Pandas objects; each object is from a file in the directory that matches filebase
lst = [pd.read_csv(directory + "/" + file, delim_whitespace=True).values for file in files]
# Concatenate the Pandas objects together. ignore_index is critical here so that
# the duplicate row indices are addressed
return lst
# Load the time stamps
time = read_bmi_file_set('/home2/fagg/datasets/bmi/DAT6_08', 'time_fold*')
Part 1: Data Exploration
- Load up a single fold's worth of data for MI and torque for
training (fold zero), and a single fold for testing (fold 19).
- Construct a LinearRegression model from these data. Show a
plot of elbow torque as a function of sample index for the test data
(one curve each for true and predicted torque).
- Construct a RidgeRegression model for the same data set.
Generate a similar plot as the previous step. Try a few
different parameter values (you only need to show one case in
the plot). How do these results compare to the
LinearRegression model?
- Construct a GradientBoostingRegressor model. Generate the same
type of plot for one parameter choice. How do these results
compare to the RidgeRegression case?
Part 2: Cross-Validation
Because Scikit-Learn leaves something to be desired in its
implementation of cross-validation, we will implement our own. The
data set contains N=20 folds. For a given rotation, N-2 folds are
available for training, one will be used for validation and one will be
held out for testing.
We will be implementing three nested loops (I implemented these across two functions):
- Loop over possible hyper-parameter values (for a single hyper-parameter). Which
hyper-parameter that is being varied should be a parameter to
this function, as
different models use different parameters.
- Loop over training data set sizes (in terms of the number of
folds that are actually used).
For example, the first model using 18 folds for training will use folds
1..18 for this purpose, #19 for validation and #20 for testing.
The first model using only 2 folds for training will use folds
1..2 for this purpose, #19 for validation and #20 for testing.
- Loop over the 20 different models that are constructed for each
hyper-parameter and training set size choice combination.
These three different types of folds will rotate together
across the full data sets. We
will be making use of mean Fraction of Variance Accounted For
(FVAF) to measure performance for these 20 models. SciKit-Learn calls this the
explained_variance_score() function.
In addition, write a function that:
Notes:
Part 3: Analysis
- Consider training set sizes of 1, 2, 3, 18, and some
intermediate sizes.
-
Use your new functions to analyze the following models using shoulder
torque. Give a short textual interpretation for each and note the best
parameter choice for each training data set size (according to mean
validation FVAF):
- LinearRegression. Because we don't have any interesting
parameters to vary here, I just specified fit_intercept
and the only possible value being True (the default).
- RidgeRegression.
- GradientBoostingRegression. Vary the number of trees. Use a
small depth.
- For each training data set size, determine which of the three
approaches is best according to mean test set FVAF. Remember
that you are only allowed to observe test set performance for
your best parameter choice for each model and training data
set size.
Hints
- Numpy.concatenate() and Numpy.take() are very helpful.
- I spent 2+ CPU hours playing with different ideas and executing
the final experiments. When you are testing, please do this
with small parameter sets & only go to the big sets once things
are working. You can check your usage using the top
program in a separate shell.
andrewhfagg -- gmail.com
Last modified: Wed Feb 21 17:34:10 2018