Advanced Machine Learning: Project Specification

Your final project is an opportunity for you to jump into a data set, understand something about its structure and then apply your preprocessing and model building skills. Hence, your project should be interesting to you and yet be something that stretches your skill set.

Data

Your data set should not be something that you already have substantial experience with unless you are planning to ask some fundamentally new questions.
The data set must already exist and be accessible. We don't have time this semester to collect data. Likewise, we don't have a lot of time to massage an existing data set.
The data set should be nominally novel. If a data set has already been analyzed by many different people, then it is out of bounds. A good heuristic is that if the data set is discussed in a book, then too many people have touched it already and you should pick something else.

Modeling Question

The type of question(s) that you ask for the project will depend on the domain and the data set(s) that are available to you. Fundamentally, we need to crisply define what the modeling goal is. Some possibilities include:

What are the best model types and structures for predicting X / labeling X?
What are the most important features for predicting X? This type of information can be used, for example, by domain experts to focus their own model building.
What is the right structure for generating examples of some sensor data (e.g., a synthetic image)? (I am thinking Generative Adversarial Networks here).

Model and Evaluation Specifics

For the modeling side, you should plan on trying multiple approaches, with proper parameter-selection techniques (i.e., cross-validation) and proper comparison between the modeling approaches. If you are collaborating with someone, then you each should be doing your own modeling.
You must employ statistical hypothesis testing to make an argument about the models. You could use this testing to compare two or more models. You could also use a statistical approach to argue that one feature is more important than some other feature.
In many cases, a paired sample or two sample t-test can be used to make these arguments. However, if the underlying distribution is fundamentally different from a Gaussian, then approaches such as bootstrap sampling are appropriate. More on this soon.
Depending on the complexity of the problem, you may not be able to complete full cross-validation runs in the time that you have. Nonetheless, you should still be constructing a number of models, from which we can start to get a sense of the statistical behavior of your model.
I expect some positive result. Start with simple approaches so you can achieve something quickly and then move on to more complicated approaches.

Deadlines

April 10th: 3-minute, rapid-fire presentations. You may have 3 slides:
1. What is your name? What is the project domain? What, generally, are the data? Why is this interesting?
2. What are the details of your data? What is the structure of your samples? Are you dealing with continuous and/or categorical data? How noisy is the data?
  Present one data figure that illustrates something interesting about the structure of the samples.
3. What is the modeling problem to be solved? What are the approaches that you will explore? What do the inputs / outputs of the models look like? What do you imagine the internals of the models will look like?
Slides are due by 10am on April 10th.
April 26th: Discuss your preliminary modeling efforts and results. As with the previous set of presentations, every group will have 3 minutes to present. You may have up to 4 slides and must have the presentation submitted ahead of time.
1. Who are you? What is the domain? What is the problem that you are trying to solve?
2. Provide specifics for at least one approach for each group member. What is the model? What are the inputs/outputs? What is the architecture?
3. Show at least two levels of result for each group member: A. One or more example input/output pairs; B. High-level result. The latter could be learning curves (ideally for multiple models or hyperparameters), or AUC curves.
4. What are the steps for the next week?
May 3rd (1:30-4:30 pm): Poster presentations. Here, you will discuss your full set of modeling results, including appropriate statistics.
Structuring a quality poster is a challenge. In general, your aim should be for the poster to somewhat "stand-alone" (a visitor can understand the essence of your poster without you presenting it). However, your poster should not be packed with a tremendous amount of detail, or even full sentences. It is not uncommon for some some sections to be bulleted lists of talking points expressed using key phrases.
Your poster should include these sections:
1. Title, with names and affiliations
2. Abstract: one (maybe two) paragraphs that summarize the key points of your work.
3. Introduction. Explain the domain and why it is interesting
4. Problem. Discuss the specifics of the problem. What are the inputs? What are the outputs? What do the data look like? Where do the data come from/
5. Approach. Discuss the details of your modeling approach, including the architectural decisions that you have made. Why did you make these decisions?
6. Results. Use a low-to-high level approach here. Describe small, intermediate results (e.g., what does your model do with a variety of inputs?).
  Describe comparative results across multiple models and/or hyperparameter sets. Which choice(s) are best? Make a clear statistical argument here using test set results. One appropriate approach is to compare test set performance for N folds for multiple models. A two-sample or paired t-test can often be used here. However, if the performance measures do not fall within a normal distribution, then it is appropriate to use a sampling-based approach (e.g., a bootstrap test of means).
7. Discussion and conclusions. What are the high-level messages? What did we generally learn from this work? Did the results turn out as expected? Why/why not? What are the next steps in the work?
8. References. Use full references.
May 4th: Appendix. Submit a pdf document that discusses details that are not present in the posters. These could include:
1. Additional models / architectures / hyperparameter sets.
2. Intermediate results. Make sure to describe the results and what they mean.
3. High-level results. Make sure to describe the results and what they mean.

Practicalities

We need to be cautious about loading too much data into our server. Please check with me if you intend to upload more than 1GB.
Likewise, we may need to bring up more compute servers to handle the load. Please keep talking to me about this as the semester is wrapping up.

fagg@cs.ou.edu

Last modified: Tue Apr 24 23:22:26 2018