CS 5043: HW1
Assignment notes:
- Deadline: Tuesday, February 6th @11:59pm.
- Hand-in procedure: submit to the HW1 drop box on Canvas.
- This work is to be done on your own. While general discussion
about Python and Scikit-Learn are encouraged, sharing
solution-specific code is inappropriate.
- Do not submit zip or MSWord documents.
Part 1: Pipelines
Do one of the following:
1A Implement a pipeline component that repairs unknown data in
a time-series using interpolation.
- Assume that the input and output of the pipeline component are
Pandas DataFrames.
- The class constructor must take as input the list of columns to
repair.
- Unknown data are indicated using "NaN" and can occur anywhere in
the time-series (there may be multiple samples in a row that are
NaN).
- Handle edge cases by copying the first valid value to the edge.
- Handle intermediate cases by linear interpolation of the valid
values that immediately surround the "gap" of invalid data
1B We talked in class about our desire to be able to have a
"split" in a pipeline to avoid multiple executions of the early stages
of the pipeline. Implement a Pipeline splitter (and, perhaps a
corresponding Combiner).
- At minimum, the Splitter must be a proper Transformer. I will
leave it open as to whether it should also be an Estimator.
- The Splitter will have multiple children that are each pipeline
components (and can be pipelines in and of themselves).
- s.transform(x) will pass x to each of the
child pipelines (i.e., call c.transform(x) for each
child).
- The return value of s.transform(x) must be some
reasonable representation of the return values from each of the
children (a hash is probably a good approach here).
For either 1A and 1B, submit a Jupyter notebook and the corresponding
pdf that shows the code and a documented set of test cases. These
tests should be convincing.
Part 2: Classifiers
The sippc_action column of the baby data files indicates time
periods during which the robot is producing an extended action in
response to something that the infant has done. This is an enumerated
data type. The codes are:
- 0: no robot action
- 1-4: forward, backward, left or right movement in response to a
force-based event as sensed by the force/torque sensor on the robot
- 5-8: forward, backward, left or right movement in response to a
gesture-based event as sensed by the kinematic capture suit
worn by the infant
Our task is to construct a classifier that distinguishes the kinematic
conditions for force-based events from those of gesture-based events.
Specifically, we will use samples from the time-series at the
initiation of a force-based response as the set of positive
examples and the samples at the initiation of a gesture-based
response as the set of negative examples. All other samples
will be discarded from the data set. Our classifier will distinguish
these two categories as a function of the positions and velocities
contained in the data set.
Notes:
- Because a single trial file only contains a small number of
examples, we will load all of the data available from a
single subject. Two new data sets (c3 and k3) have been added
to the baby1 directory. These have a larger number of trials.
Here is some code that may be of help:
# Pipeline for computing the derivative and dropping bad rows
pipeCleaner = Pipeline(([
('derivative', ComputeDerivative(fieldsKin, dt=.02)),
('dropper', DataSampleDropper())]))
# File loading
def read_kinematic_file_set(directory, filebase):
'''Read a set of CSV files and append them together
:param directory: The directory in which to scan for the CSV files
:param filebase: A file specification that potentially includes wildcards
:returns: A Pandas object that contains the concatenation of all of the CSV tables
'''
# The set of files in the directory
files = fnmatch.filter(os.listdir(directory), filebase)
# Create a list of Pandas objects; each object is from a file in the directory that matches filebase
lst = [pipeCleaner.transform(pd.read_csv(directory + "/" + file)) for file in files]
# Concatenate the Pandas objects together. ignore_index is critical here so that
# the duplicate row indices are addressed
return pd.concat(lst, ignore_index=True)
- Make sure to use proper validation data sets to compute your
scores. While we used cross_val_predict() in class,
one also needs to worry about a balanced representation of the
different classes in each of the folds. I ended up
implementing my own version that does stratification
based on the discussion on p. 83 of the text. You may end up
with cleaner results if you take this route.
What to hand-in:
- Hand in both a Jupyter notebook and the corresponding pdf
file of your work. There should be sufficient documentation to
explain what you are doing at each step.
- Attempt multiple types of classifiers and parameter sets. Keep
the most salient two or three in your notebook. At least one of them should
exhibit interesting skill. My "best" classifier gave me a PSS
of 0.16 on the validation set (not very high, but we are asking
our classifier to solve a hard problem). Can you beat it?
- For each of the selected classifiers/parameters, show the
cumulative distributions for TPR and FPR. Also, compute the KS
distance.
- In addition, show the corresponding ROC curve and the
Area-Under-the-ROC-Curve (AU-ROC)
andrewhfagg -- gmail.com
Last modified: Wed Jan 31 02:19:49 2018