Advanced Machine Learning: Project Specification
Your final project is an opportunity for you to jump into a data set,
understand something about its structure and then apply your
preprocessing and model building skills. Hence, your project should
be interesting to you and yet be something that stretches your skill
- Your data set should not be something that you already have
substantial experience with unless you are planning to ask some
fundamentally new questions.
- The data set must already exist and be accessible. We don't
have time this semester to collect data. Likewise, we don't
have a lot of time to massage an existing data set.
- The data set should be nominally novel. If a data set has
already been analyzed by many different people, then it is out
of bounds. A good heuristic is that if the data set is
discussed in a book, then too many people have touched it
already and you should pick something else.
The type of question(s) that you ask for the project will depend on
the domain and the data set(s) that are available to you.
Fundamentally, we need to crisply define what the modeling goal is.
Some possibilities include:
- What are the best model types and structures for predicting X /
- What are the most important features for predicting X? This
type of information can be used, for example, by domain experts
to focus their own model building.
- What is the right structure for generating examples of some
sensor data (e.g., a synthetic image)? (I am thinking Generative
Adversarial Networks here).
Model and Evaluation Specifics
- For the modeling side, you should plan on trying multiple
approaches, with proper parameter-selection techniques (i.e.,
cross-validation) and proper comparison between the modeling
approaches. If you are collaborating with someone, then you
each should be doing your own modeling.
- You must employ statistical hypothesis testing to make an
argument about the models. You could use this testing to
compare two or more models. You could also use a statistical
approach to argue that one feature is more important than some
In many cases, a paired sample or two sample t-test can be used
to make these arguments. However, if the underlying
distribution is fundamentally different from a Gaussian, then
approaches such as bootstrap sampling are appropriate.
More on this soon.
- Depending on the complexity of the problem, you may not be able
to complete full cross-validation runs in the time that you
have. Nonetheless, you should still be constructing a number
of models, from which we can start to get a sense of the
statistical behavior of your model.
- I expect some positive result. Start with simple approaches so
you can achieve something quickly and then move on to more
- April 10th: 3-minute, rapid-fire presentations. You may have 3
- What is your name? What is the project domain? What,
generally, are the data? Why is this interesting?
- What are the details of your data? What is the
structure of your samples? Are you dealing with
continuous and/or categorical data? How noisy is the
Present one data figure that illustrates something
interesting about the structure of the samples.
- What is the modeling problem to be solved? What are the
approaches that you will explore? What do the inputs /
outputs of the models look like? What do you imagine
the internals of the models will look like?
Slides are due by 10am on April 10th.
- April 26th: Discuss your preliminary modeling efforts and
results. As with the previous set of presentations, every
group will have 3 minutes to present. You may have up to 4
slides and must have the presentation submitted ahead of time.
- Who are you? What is the domain? What is the problem that you are
trying to solve?
- Provide specifics for at least one approach for each
group member. What is the model? What are the
inputs/outputs? What is the architecture?
- Show at least two levels of result for each group
member: A. One or more example input/output pairs;
B. High-level result. The latter could be learning
curves (ideally for multiple models or hyperparameters),
or AUC curves.
- What are the steps for the next week?
- May 3rd (1:30-4:30 pm): Poster presentations. Here, you will discuss
your full set of modeling results, including appropriate
Structuring a quality poster is a challenge. In general, your aim
should be for the poster to somewhat "stand-alone" (a visitor
can understand the essence of your poster without you
presenting it). However, your poster should not be packed with
a tremendous amount of detail, or even full sentences. It is
not uncommon for some some sections to be bulleted lists of
talking points expressed using key phrases.
Your poster should include these sections:
- Title, with names and affiliations
- Abstract: one (maybe two) paragraphs that summarize the
key points of your work.
- Introduction. Explain the domain and why it is interesting
- Problem. Discuss the specifics of the problem. What
are the inputs? What are the outputs? What do the data
look like? Where do the data come from/
- Approach. Discuss the details of your modeling
approach, including the architectural decisions that you
have made. Why did you make these decisions?
- Results. Use a low-to-high level approach here.
Describe small, intermediate results (e.g., what does your
model do with a variety of inputs?).
comparative results across multiple models and/or
hyperparameter sets. Which choice(s) are best? Make a
clear statistical argument here using test set results.
One appropriate approach is to compare test set
performance for N folds for multiple models. A
two-sample or paired t-test can often be used here.
However, if the performance measures do not fall within
a normal distribution, then it is appropriate to use a
sampling-based approach (e.g., a bootstrap test of means).
- Discussion and conclusions. What are the high-level
messages? What did we generally learn from this work?
Did the results turn out as expected? Why/why not?
What are the next steps in the work?
- References. Use full references.
- May 4th: Appendix. Submit a pdf document that discusses details that
are not present in the posters. These could include:
- Additional models / architectures / hyperparameter sets.
- Intermediate results. Make sure to describe the results
and what they mean.
- High-level results. Make sure to describe the results
and what they mean.
- We need to be cautious about loading too much data into our
server. Please check with me if you intend to upload more than
- Likewise, we may need to bring up more compute servers to
handle the load. Please keep talking to me about this as the
semester is wrapping up.
Last modified: Tue Apr 24 23:22:26 2018