CS 5043: HW3: Comparing Regression Algorithms

Assignment notes:

Deadline: Thursday, March 1st @11:59pm.
Hand-in procedure: submit to the HW3 drop box on Canvas.
This work is to be done on your own. While general discussion about Python and Scikit-Learn is encouraged, sharing solution-specific code is inappropriate.
Do not submit zip or MSWord documents.

Data Set and Goals

A brain-machine interface data set has been placed in: ~fagg/datasets/bmi/DAT6_08. This data set is from a single day's session with a monkey. While sitting in a chair, the monkey places her arm in an exoskeleton that restricts arm movements to the X-Y plane. For this data set, the exoskeleton is passive and is only used to record the monkey's arm movements. A series of virtual targets is then provided to the monkey. As the monkey's hand moves through the current target, a new one appears (so, we are observing a sequence of reaches). During this session, information is recorded every 50ms. The following information is collected and stored in their own files:

Neural activity for 48 neurons from the primary motor cortex (MI). During every 50ms time period, we count the number of action potentials (spikes) that each neuron produces.
Hand position, velocity and acceleration (each 2 dimensions, corresponding to the X and Y directions, respectively).
Joint position, velocity, and torque (each 2 dimensions, corresponding to the shoulder and elbow, respectively).
Time stamp. Although the data are in order, gaps exist, as indicated by time stamps that are more than 50ms apart from one-another.

In the brain-machine interface context, after a data set such as this is collected, a model can be constructed that predicts from the neural data one or more properties that describe the state of the arm. Under some conditions, these models can be used to translate in real time the neural activity into control signals for the exoskeleton (hence, the monkey's arm will be moved for her).

Our goals for this homework are to assess:

multiple models and learning algorithms,
appropriate parameters for each algorithm type, and
the amount of training data that is necessary to create high-performing models.

Notes:

The data are pre-cut into 20 folds.
A single sample of data is captured as a single row in the files. A single row in one file corresponds to the same sample in the matching files.
Because the neural data is noisy (and we have so few neurons), we will use a history of 20 time steps (so, a full second) of neural data to make the arm state prediction for the current time step. This history information has already been accounted for in the individual rows in the MI data files: each row has 20 x 48 = 960 columns, with the first 20 columns corresponding to the first neuron (ordered from the oldest to the most recent counts). The discontinuities in the data have already been accounted for in these histories, so you don't need to take any additional steps to handle these.

The following will load all folds for one file type and return a list of Numpy arrays:

import fnmatch
# File loading
def read_bmi_file_set(directory, filebase):
    '''Read a set of CSV files and append them together
    :param directory: The directory in which to scan for the CSV files
    :param filebase: A file specification that potentially includes wildcards
    :returns: A list of Numpy arrays (one for each fold)
    '''
    
    # The set of files in the directory
    files = fnmatch.filter(os.listdir(directory), filebase)
    files.sort()

    # Create a list of Pandas objects; each object is from a file in the directory that matches filebase
    lst = [pd.read_csv(directory + "/" + file, delim_whitespace=True).values for file in files]
    
    # Concatenate the Pandas objects together.  ignore_index is critical here so that
    # the duplicate row indices are addressed
    return lst


# Load the time stamps
time = read_bmi_file_set('/home2/fagg/datasets/bmi/DAT6_08', 'time_fold*')

Part 1: Data Exploration

Load up a single fold's worth of data for MI and torque for training (fold zero), and a single fold for testing (fold 19).
Construct a LinearRegression model from these data. Show a plot of elbow torque as a function of sample index for the test data (one curve each for true and predicted torque).
Construct a RidgeRegression model for the same data set. Generate a similar plot as the previous step. Try a few different parameter values (you only need to show one case in the plot). How do these results compare to the LinearRegression model?
Construct a GradientBoostingRegressor model. Generate the same type of plot for one parameter choice. How do these results compare to the RidgeRegression case?

Part 2: Cross-Validation

Because Scikit-Learn leaves something to be desired in its implementation of cross-validation, we will implement our own. The data set contains N=20 folds. For a given rotation, N-2 folds are available for training, one will be used for validation and one will be held out for testing.

We will be implementing three nested loops (I implemented these across two functions):

Loop over possible hyper-parameter values (for a single hyper-parameter). Which hyper-parameter that is being varied should be a parameter to this function, as different models use different parameters.
Loop over training data set sizes (in terms of the number of folds that are actually used).
For example, the first model using 18 folds for training will use folds 1..18 for this purpose, #19 for validation and #20 for testing.
The first model using only 2 folds for training will use folds 1..2 for this purpose, #19 for validation and #20 for testing.
Loop over the 20 different models that are constructed for each hyper-parameter and training set size choice combination. These three different types of folds will rotate together across the full data sets. We will be making use of mean Fraction of Variance Accounted For (FVAF) to measure performance for these 20 models. SciKit-Learn calls this the explained_variance_score() function.

In addition, write a function that:

Plots mean validation set FVAF (across the 20 models) as a function of the size of the training set (measured in folds) used to fit the model parameters.
There will be one curve for each hyper-parameter choice.
For each training set size: report the best hyper-parameter choice, as measured by mean validation set performance.

Notes:

The function(s) should take as input a base regressor. This way, you can hand it any regressor.
You can change a model's parameters on the fly using the following syntax:
```
       model.set_params(**{param_name: param})
```
Each of the arm state files contains two columns. One input parameter to your cross-validation function(s) should indicate which of the two columns is being trained / tested with. [it would also be cool to handle more than one column, but that will take some thought since some of the regressors are not happy with doing this automatically]
Do not shuffle the data.
Test your code on small data sets first (before moving on to expensive computations).
The mod operator (%) is useful for computing fold indices under the different rotations.

Part 3: Analysis

Consider training set sizes of 1, 2, 3, 18, and some intermediate sizes.
Use your new functions to analyze the following models using shoulder torque. Give a short textual interpretation for each and note the best parameter choice for each training data set size (according to mean validation FVAF):
- LinearRegression. Because we don't have any interesting parameters to vary here, I just specified fit_intercept and the only possible value being True (the default).
- RidgeRegression.
- GradientBoostingRegression. Vary the number of trees. Use a small depth.
For each training data set size, determine which of the three approaches is best according to mean test set FVAF. Remember that you are only allowed to observe test set performance for your best parameter choice for each model and training data set size.

Hints

Numpy.concatenate() and Numpy.take() are very helpful.
I spent 2+ CPU hours playing with different ideas and executing the final experiments. When you are testing, please do this with small parameter sets & only go to the big sets once things are working. You can check your usage using the top program in a separate shell.

andrewhfagg -- gmail.com

Last modified: Wed Feb 21 17:34:10 2018