- Your data set should not be something that you already have
substantial experience with unless you are planning to ask some
fundamentally new questions.
- The data set must already exist and be accessible. We don't
have time this semester to collect data. Likewise, we don't
have a lot of time to massage an existing data set.
- The data set should be nominally novel. If a data set has
already been analyzed by many different people, then it is out
of bounds. A good heuristic is that if the data set is
discussed in a book, then too many people have touched it
already and you should pick something else.

- What are the best model types and structures for predicting X /
labeling X?
- What are the most important features for predicting X? This
type of information can be used, for example, by domain experts
to focus their own model building.
- What is the right structure for generating examples of some
sensor data (e.g., a synthetic image)? (I am thinking Generative
Adversarial Networks here).

- For the modeling side, you should plan on trying multiple
approaches, with proper parameter-selection techniques (i.e.,
cross-validation) and proper comparison between the modeling
approaches. If you are collaborating with someone, then you
each should be doing your own modeling.
- You must employ statistical hypothesis testing to make an
argument about the models. You could use this testing to
compare two or more models. You could also use a statistical
approach to argue that one feature is more important than some
other feature.
In many cases, a paired sample or two sample t-test can be used to make these arguments. However, if the underlying distribution is fundamentally different from a Gaussian, then approaches such as

**bootstrap sampling**are appropriate. More on this soon. - Depending on the complexity of the problem, you may not be able
to complete full cross-validation runs in the time that you
have. Nonetheless, you should still be constructing a number
of models, from which we can start to get a sense of the
statistical behavior of your model.
- I expect some positive result. Start with simple approaches so
you can achieve something quickly and then move on to more
complicated approaches.

- April 10th: 3-minute, rapid-fire presentations. You may have 3
slides:
- What is your name? What is the project domain? What,
generally, are the data? Why is this interesting?
- What are the details of your data? What is the
structure of your samples? Are you dealing with
continuous and/or categorical data? How noisy is the
data?
Present one data figure that illustrates something interesting about the structure of the samples.

- What is the modeling problem to be solved? What are the
approaches that you will explore? What do the inputs /
outputs of the models look like? What do you imagine
the internals of the models will look like?

Slides are due by 10am on April 10th.

- What is your name? What is the project domain? What,
generally, are the data? Why is this interesting?
- April 26th: Discuss your preliminary modeling efforts and
results. As with the previous set of presentations, every
group will have 3 minutes to present. You may have up to 4
slides and must have the presentation submitted ahead of time.
- Who are you? What is the domain? What is the problem that you are
trying to solve?
- Provide specifics for at least one approach for each
group member. What is the model? What are the
inputs/outputs? What is the architecture?
- Show at least two levels of result for each group
member: A. One or more example input/output pairs;
B. High-level result. The latter could be learning
curves (ideally for multiple models or hyperparameters),
or AUC curves.
- What are the steps for the next week?

- Who are you? What is the domain? What is the problem that you are
trying to solve?
- May 3rd (1:30-4:30 pm): Poster presentations. Here, you will discuss
your full set of modeling results, including appropriate
statistics.
Structuring a quality poster is a challenge. In general, your aim should be for the poster to somewhat "stand-alone" (a visitor can understand the essence of your poster without you presenting it). However, your poster should not be packed with a tremendous amount of detail, or even full sentences. It is not uncommon for some some sections to be bulleted lists of talking points expressed using key phrases.

Your poster should include these sections:

- Title, with names and affiliations
- Abstract: one (maybe two) paragraphs that summarize the
key points of your work.
- Introduction. Explain the domain and why it is interesting
- Problem. Discuss the specifics of the problem. What
are the inputs? What are the outputs? What do the data
look like? Where do the data come from/
- Approach. Discuss the details of your modeling
approach, including the architectural decisions that you
have made. Why did you make these decisions?
- Results. Use a low-to-high level approach here.
Describe small, intermediate results (e.g., what does your
model do with a variety of inputs?).
Describe comparative results across multiple models and/or hyperparameter sets. Which choice(s) are best? Make a clear statistical argument here using test set results. One appropriate approach is to compare test set performance for N folds for multiple models. A two-sample or paired t-test can often be used here. However, if the performance measures do not fall within a normal distribution, then it is appropriate to use a sampling-based approach (e.g., a bootstrap test of means).

- Discussion and conclusions. What are the high-level
messages? What did we generally learn from this work?
Did the results turn out as expected? Why/why not?
What are the next steps in the work?
- References. Use full references.

- Title, with names and affiliations
- May 4th: Appendix. Submit a pdf document that discusses details that
are not present in the posters. These could include:
- Additional models / architectures / hyperparameter sets.
- Intermediate results. Make sure to describe the results
and what they mean.
- High-level results. Make sure to describe the results
and what they mean.

- Additional models / architectures / hyperparameter sets.

- We need to be cautious about loading too much data into our
server. Please check with me if you intend to upload more than
1GB.
- Likewise, we may need to bring up more compute servers to
handle the load. Please keep talking to me about this as the
semester is wrapping up.

Last modified: Tue Apr 24 23:22:26 2018