CS 5043: HW1

Assignment notes:

Deadline: Tuesday, February 6th @11:59pm.
Hand-in procedure: submit to the HW1 drop box on Canvas.
This work is to be done on your own. While general discussion about Python and Scikit-Learn are encouraged, sharing solution-specific code is inappropriate.
Do not submit zip or MSWord documents.

Part 1: Pipelines

Do one of the following:

1A Implement a pipeline component that repairs unknown data in a time-series using interpolation.

Assume that the input and output of the pipeline component are Pandas DataFrames.
The class constructor must take as input the list of columns to repair.
Unknown data are indicated using "NaN" and can occur anywhere in the time-series (there may be multiple samples in a row that are NaN).
Handle edge cases by copying the first valid value to the edge.
Handle intermediate cases by linear interpolation of the valid values that immediately surround the "gap" of invalid data

1B We talked in class about our desire to be able to have a "split" in a pipeline to avoid multiple executions of the early stages of the pipeline. Implement a Pipeline splitter (and, perhaps a corresponding Combiner).

At minimum, the Splitter must be a proper Transformer. I will leave it open as to whether it should also be an Estimator.
The Splitter will have multiple children that are each pipeline components (and can be pipelines in and of themselves).
s.transform(x) will pass x to each of the child pipelines (i.e., call c.transform(x) for each child).
The return value of s.transform(x) must be some reasonable representation of the return values from each of the children (a hash is probably a good approach here).

For either 1A and 1B, submit a Jupyter notebook and the corresponding pdf that shows the code and a documented set of test cases. These tests should be convincing.

Part 2: Classifiers

The sippc_action column of the baby data files indicates time periods during which the robot is producing an extended action in response to something that the infant has done. This is an enumerated data type. The codes are:

0: no robot action
1-4: forward, backward, left or right movement in response to a force-based event as sensed by the force/torque sensor on the robot
5-8: forward, backward, left or right movement in response to a gesture-based event as sensed by the kinematic capture suit worn by the infant

Our task is to construct a classifier that distinguishes the kinematic conditions for force-based events from those of gesture-based events. Specifically, we will use samples from the time-series at the initiation of a force-based response as the set of positive examples and the samples at the initiation of a gesture-based response as the set of negative examples. All other samples will be discarded from the data set. Our classifier will distinguish these two categories as a function of the positions and velocities contained in the data set.

Notes:

Because a single trial file only contains a small number of examples, we will load all of the data available from a single subject. Two new data sets (c3 and k3) have been added to the baby1 directory. These have a larger number of trials. Here is some code that may be of help:

# Pipeline for computing the derivative and dropping bad rows
pipeCleaner = Pipeline(([
    ('derivative', ComputeDerivative(fieldsKin, dt=.02)),
    ('dropper', DataSampleDropper())]))

# File loading
def read_kinematic_file_set(directory, filebase):
    '''Read a set of CSV files and append them together
    :param directory: The directory in which to scan for the CSV files
    :param filebase: A file specification that potentially includes wildcards
    :returns: A Pandas object that contains the concatenation of all of the CSV tables 
    '''
    
    # The set of files in the directory
    files = fnmatch.filter(os.listdir(directory), filebase)

    # Create a list of Pandas objects; each object is from a file in the directory that matches filebase
    lst = [pipeCleaner.transform(pd.read_csv(directory + "/" + file)) for file in files]
    
    # Concatenate the Pandas objects together.  ignore_index is critical here so that
    # the duplicate row indices are addressed
    return pd.concat(lst, ignore_index=True)

Make sure to use proper validation data sets to compute your scores. While we used cross_val_predict() in class, one also needs to worry about a balanced representation of the different classes in each of the folds. I ended up implementing my own version that does stratification based on the discussion on p. 83 of the text. You may end up with cleaner results if you take this route.

What to hand-in:

Hand in both a Jupyter notebook and the corresponding pdf file of your work. There should be sufficient documentation to explain what you are doing at each step.
Attempt multiple types of classifiers and parameter sets. Keep the most salient two or three in your notebook. At least one of them should exhibit interesting skill. My "best" classifier gave me a PSS of 0.16 on the validation set (not very high, but we are asking our classifier to solve a hard problem). Can you beat it?
For each of the selected classifiers/parameters, show the cumulative distributions for TPR and FPR. Also, compute the KS distance.
In addition, show the corresponding ROC curve and the Area-Under-the-ROC-Curve (AU-ROC)

andrewhfagg -- gmail.com

Last modified: Wed Jan 31 02:19:49 2018