[Main course page]
EEL709 Projects, II Semester 2012–13
Timeline
- May 3rd, 23:59: Deadline for report submission. Please note
that this needs to be a hard deadline, due to the end-semester rush
and the limited time available for evaluation subsequent to this. In
order to attempt to enforce this hardness,
there will be a 10%-per-day penalty rule for
any late submissions; so if you submit sometime on May 5th, you will lose
20% of your total project marks.
Your report should be in the format of a research article.
A fairly standard organisation for
such articles is Abstract, Introduction (including literature survey),
Methodology, Results, Discussion/Conclusions, Acknowledgements, References.
There is no length limit per se; a reasonable length may be about
3000–4000 words, but this can vary a bit from project to project.
Remember to clearly acknowledge and/or cite all sources of data, code, and
other materials; any failure to do so is tantamount to plagiarism. It is
highly preferable to prepare your final report as a PDF file.
Along with the report, you should submit all code that you have written,
and anything else you feel may be useful in evaluating the project.
Add everything to a single zip file, and name it GXX.zip, where XX is your
group number in 2-digit format; e.g., for group 1, the file should be
called G01.zip. E-mail this file to the
instructor, using the subject line '[EEL709] Project Report'.
- April 15th – May 1st: Project presentations. Each group will
be given 10 minutes to present the key aspects of their work. Here are some
tips for these:
- Think carefully about what are the essential concepts and highlights that you can cover in 10 minutes. The time limit
will be strictly enforced, so it is important to time your presentation well and not miss out on the key results.
- If you are using slides, you should have no more than 10 slides. An average of one minute per slide is
a minimal assumption; it would be even better if you could restrict yourself to 5–6 slides. Don't put too much
material on the slides; they are meant to serve as an aid to your presentation, not to contain everything you have to say.
- Don't just present your results as numbers, as in "we tried such-and-such methods and obtained so-and-so accuracies". The key is
to interpret and understand what is going on. Why have you got the performance that you have? What are the roles of the different features
that you have used? Can you intuitively interpret why or why not they are informative? What are the effects of your choices of
learning algorithm and the parameters therein? What kinds of test instances are you performing badly on, and why?
- As the adage goes, a picture is worth a thousand words. Clear visualisations can be very helpful in understanding what's going on.
For instance, can you visualise the information captured by a given feature, say by plotting a histogram, or by just showing what aspects of the
data go into computing that feature (e.g., the images of principal components we've seen for handwritten character data sets)? Some thought
and effort invested into visualisation can be very rewarding.
- For anyone who has the time and inclination, Uri Alon's excellent video lecture on
How to give a good talk? is highly recommended. These simple skills will potentially be of great value right through your career.
- Week starting April 1st: Review meetings.
- Week starting February 18th: Meetings with the instructor.
Please come to this meeting with a short
description of your proposed project. You should think about the following questions and try to have answers to them ready when we meet:
- What is the problem (or problems) that your project seeks to examine? Why would these be useful for people?
- What data will you use? Is there a public data set for your chosen task? Can you get data from someone else? Will you try
collecting your own data: if so, can you afford the time and resources that might require?
- What methodology will you use? How much coding or implementation work will you need to do?
- What is the rough sequence of steps you will follow for your project? Can you break it up into a set of subtasks, such that you can
set a rough timeline for each one?
- What are your lower and upper bounds of expectation from the project? What is the minimum objective that you are confident you can
achieve within the duration of this semester and present at the end? What additional things would you like to attempt if you turn out
to have further time left?
Of course the answers will not be set in stone and will change somewhat as the project evolves. But thinking about these now
should at least provide you with a good starting point, which is extremely useful; and which is why 20% of the project
marks are allocated to this part!
- By Sunday February 10th: Send group member names to the instructor. Also mention which afternoons during the week (Monday–Friday) your group is free to meet with the instructor.
General guidelines
- Please form groups of 3. Larger groups will not be acceptable unless there
is an exceptional reason. If you wish, you may try using the
Piazza forum for
finding project partners, or discussing any other aspect of the projects.
- For implementation, the suggested programming languages are MATLAB or Java.
It is recommended that you take a look at either Spider (for MATLAB) or Weka (for Java), which are highly popular general-purpose
machine learning libraries. You may use any resources you can find, but clear
acknowledgement is a must; there must be no doubt what is your own work and
what is not. Also, any code you develop must be readable, which means
proper structuring and commenting.
- You will be required to submit a report (in the format of a research
article) describing what you've done and what it's resulted in. Any code
developed will also need to be submitted.
- The evaluation criteria will roughly be as follows: initial
project idea (20%); correctness and functionality of outcomes (20%);
novelty/complexity (20%); understanding demonstrated (20%); presentation of results (20%).
- Each group will be required to discuss their project idea with the
instructor for 5–10 minutes before getting truly started;
time slots for this will be announced. Ideally you should
come to this meeting with a short (1–2 page) write-up outlining your
proposed project.
- Plagiarism of any sort will result in an F grade, in addition to a
report to the Dean of Students for further action. You are free to discuss
with classmates or others, to use any resources you find online or elsewhere;
as long as it is all clearly cited and acknowledged, and your final report
is your own work. We have substantial resources to detect plagiarism, and
hopefully the loss matrix is sufficiently skewed so as to make the
expected payoff from any such activity highly negative.
Suggested ideas
Machine learning is a vast and varied field; and there is no shortage
of things to play with. You are highly encouraged to explore a bit on
your own and come up with something that excites you. Don't worry
too much about whether it seems feasible or not; we can assess that in the
instructor meeting, and it is usually possible to define even an ambitious
project as a set of reasonably achievable subgoals.
The ideas that follow are necessarily very selective,
and represent the instructor's
biases and interests to a fair extent. If you wish you may use them as
launching points for your own explorations, but the task of making
your project idea more specific and suited to your inclinations is your own!
- Comparative approach on a standard data set. You could
pick up a data set from the UCI
repository, and try running different classification/regression
algorithms on it to compare the results (which need not be only
in terms of accuracy, but could also look at time complexity and other
relevant issues). You should be able to motivate your choice of
algorithms, and interpret the results from each one. If you are able to
devise and/or implement a novel approach for your chosen data set, or
even make some suggestions in this direction, this will earn you extra credit.
[For those interested in this, also look at
MLcomp,
a large-scale effort at comparing machine learning methods. It would
be great if you could make use of this in your project somehow!]
- Learning on network data. A network (or graph) is a way
of representing a system composed of many interacting components. There
are now huge amounts of network data available, particularly for social and
biological systems, and a lot of interest in learning relationships
between properties of nodes in these networks and some relevant
functional characteristic. You could obtain one of the network data sets at
SNAP, and experiment
with solving a machine learning problem defined on it. A couple
of specific possibilities:
- Social networks. Get one of the social network data sets, such as
Facebook.
This contains a network of friendship relations between nodes (people), as
well as some information from the profiles of each person (e.g., location,
where they studied, political affiliation etc.). You could try
using both profile and network features to predict whether a given person is
friends with another person or not [look here for more ideas on this].
You could try clustering people
into meaningful clusters or communities based on graph structure [see here for more]:
these may for instance
turn out to correspond to people who were batchmates, or office colleagues,
or a neighbourhood community. In some cases there may be pre-defined
communities in the data; see these data sets.
- Biological networks. There is lots of experimental
data now on interactions between genes and proteins in cells; you could get
some from BioGRID. Again, you could
look at communities and what they can tell you about function; you could look
at trying to classify genes/proteins (or their interactions) into relevant
categories [e.g., see here], or even at predicting missing or
false interactions [e.g., see
here].
- Temporal data. One application of machine learning is to
find causal relationships between real-world time-varying signals.
These signals could be of many sorts. They could for instance be trends
on Twitter or Google. Google searches have been famously used for
flu prediction; you could
download their data and try your own methods on it. Tweets might be
used in similar ways [see here]. Another entirely
different sort of signal is the variation in the concentration level of
a gene or protein inside a cell. Such signals might be used to try and
infer how genes affect each other: if one gene goes up in quantity, does it
cause another gene to come down? (These relationships represented on a
large scale constitute gene regulatory networks.) GEO is a big online repository of gene expression data; you could download some and experiment with ways of learning
models for interactions [see here for some ideas].
- Image processing. Grouping images based on objects/features
present in them is a popular machine learning application. You could obtain
any sort of image data set, say from
Google Images, define meaningful classes and attempt to learn
useful classifiers for them. One starting point could be the methods
and data presented
here. Another
idea could be to look at image super-resolution, i.e., combining information
from many low-resolution images to create a higher-resolution image. One could
adopt a machine learning approach to this, for instance as outlined
here.
For even more project ideas, you could look at Andrew Ng's Stanford course. And plenty of
other places. The possibilities are limitless; happy exploring!
[Main course page]