CS 5043: HW2: Regression and Parameter Searches
Assignment notes:
- Deadline: Thursday, February 15th @11:59pm.
- Hand-in procedure: submit to the HW2 drop box on Canvas.
- This work is to be done on your own. While general discussion
about Python and Scikit-Learn are encouraged, sharing
solution-specific code is inappropriate.
- Do not submit zip or MSWord documents.
Part 1: Regularization with Ridge Regression
Perform a grid search for the appropriate ridge regression alpha
parameter using the GridSearchCV() method (discussed in chapter 2).
Specifics:
- Data set:
- subject_k3_w10
- Do not shuffle the data
- Inputs: all kinematic variables (position and velocity) except
those related to the left foot and left ankle
- Output (variable to be predicted): left foot x
- Ridge regression:
- Alphas: use a wide range here
- Solver: cholesky
- Grid search:
- Cross-validation folds: do separate analyses for 2, 3 and 10
- Scoring: neg_mean_squared_error
- Return_train_score: True
- Examine the GridSearchCV class documentation: after
execution, a property is available that contains the
predicted values for the validation data
- Product:
- For each of the cross-validation fold numbers, produce
a figure that shows separate curves for mean training
set and mean validation set performance (RMSE) as a
function of alpha. I suggest a logarithmic scale along alpha.
- For each of these experiment, report the best validation set
performance and the corresponding alpha.
- Note: In your grid search, you should be using enough
alphas for these curves to be reasonably smooth.
Part 2: Data Set Size with Lasso
Explore the effects of data set size on the proper
choice of hyper-parameters in Lasso.
Specifics:
- Data set: same data configuration as in Part 1
- Lasso:
- Alphas: use a wide range here. You will find that you will need a different range than with
Ridge regression
- Meta-search:
- Write a method that iterates through different data set
sizes: first 10%, first 30%, ... first 90% of the
available data
- For each step in the iteration, call GridSearchCV() over
your set of alphas. Each call will yield validation
performance as a function of alpha
- Hint: my implementation returns an array of results from
GridSearchCV(), one for each data set size
- Product:
- Produce a single figure showing mean validation set
performance (RMSE) as a function of alpha. Again, I
suggest a logarithmic scale along alpha.
- Each choice of data set size will yield its own curve.
- Include a legend showing which curve is which.
- In my implementation, the legend shows both the number
of elements in the data set and the best choice for alpha
- Notes:
- Remember that because Lasso involves an iterative search
process, it will require substantially more time to
complete the learning process than Ridge Regression.
andrewhfagg -- gmail.com
Last modified: Thu Feb 8 01:11:37 2018