EEL709 Projects

[Main course page]

EEL709 Projects, II Semester 2012–13

Project groups

Timeline

May 3rd, 23:59: Deadline for report submission. Please note that this needs to be a hard deadline, due to the end-semester rush and the limited time available for evaluation subsequent to this. In order to attempt to enforce this hardness, there will be a 10%-per-day penalty rule for any late submissions; so if you submit sometime on May 5th, you will lose 20% of your total project marks.
Your report should be in the format of a research article. A fairly standard organisation for such articles is Abstract, Introduction (including literature survey), Methodology, Results, Discussion/Conclusions, Acknowledgements, References. There is no length limit per se; a reasonable length may be about 3000–4000 words, but this can vary a bit from project to project. Remember to clearly acknowledge and/or cite all sources of data, code, and other materials; any failure to do so is tantamount to plagiarism. It is highly preferable to prepare your final report as a PDF file.
Along with the report, you should submit all code that you have written, and anything else you feel may be useful in evaluating the project. Add everything to a single zip file, and name it GXX.zip, where XX is your group number in 2-digit format; e.g., for group 1, the file should be called G01.zip. E-mail this file to the instructor, using the subject line '[EEL709] Project Report'.

April 15th – May 1st: Project presentations. Each group will be given 10 minutes to present the key aspects of their work. Here are some tips for these:
1. Think carefully about what are the essential concepts and highlights that you can cover in 10 minutes. The time limit will be strictly enforced, so it is important to time your presentation well and not miss out on the key results.
2. If you are using slides, you should have no more than 10 slides. An average of one minute per slide is a minimal assumption; it would be even better if you could restrict yourself to 5–6 slides. Don't put too much material on the slides; they are meant to serve as an aid to your presentation, not to contain everything you have to say.
3. Don't just present your results as numbers, as in "we tried such-and-such methods and obtained so-and-so accuracies". The key is to interpret and understand what is going on. Why have you got the performance that you have? What are the roles of the different features that you have used? Can you intuitively interpret why or why not they are informative? What are the effects of your choices of learning algorithm and the parameters therein? What kinds of test instances are you performing badly on, and why?
4. As the adage goes, a picture is worth a thousand words. Clear visualisations can be very helpful in understanding what's going on. For instance, can you visualise the information captured by a given feature, say by plotting a histogram, or by just showing what aspects of the data go into computing that feature (e.g., the images of principal components we've seen for handwritten character data sets)? Some thought and effort invested into visualisation can be very rewarding.
5. For anyone who has the time and inclination, Uri Alon's excellent video lecture on How to give a good talk? is highly recommended. These simple skills will potentially be of great value right through your career.
Week starting April 1st: Review meetings.

Week starting February 18th: Meetings with the instructor. Please come to this meeting with a short description of your proposed project. You should think about the following questions and try to have answers to them ready when we meet:
1. What is the problem (or problems) that your project seeks to examine? Why would these be useful for people?
2. What data will you use? Is there a public data set for your chosen task? Can you get data from someone else? Will you try collecting your own data: if so, can you afford the time and resources that might require?
3. What methodology will you use? How much coding or implementation work will you need to do?
4. What is the rough sequence of steps you will follow for your project? Can you break it up into a set of subtasks, such that you can set a rough timeline for each one?
5. What are your lower and upper bounds of expectation from the project? What is the minimum objective that you are confident you can achieve within the duration of this semester and present at the end? What additional things would you like to attempt if you turn out to have further time left?
Of course the answers will not be set in stone and will change somewhat as the project evolves. But thinking about these now should at least provide you with a good starting point, which is extremely useful; and which is why 20% of the project marks are allocated to this part!

By Sunday February 10th: Send group member names to the instructor. Also mention which afternoons during the week (Monday–Friday) your group is free to meet with the instructor.

General guidelines

Please form groups of 3. Larger groups will not be acceptable unless there is an exceptional reason. If you wish, you may try using the Piazza forum for finding project partners, or discussing any other aspect of the projects.
For implementation, the suggested programming languages are MATLAB or Java. It is recommended that you take a look at either Spider (for MATLAB) or Weka (for Java), which are highly popular general-purpose machine learning libraries. You may use any resources you can find, but clear acknowledgement is a must; there must be no doubt what is your own work and what is not. Also, any code you develop must be readable, which means proper structuring and commenting.
You will be required to submit a report (in the format of a research article) describing what you've done and what it's resulted in. Any code developed will also need to be submitted.
The evaluation criteria will roughly be as follows: initial project idea (20%); correctness and functionality of outcomes (20%); novelty/complexity (20%); understanding demonstrated (20%); presentation of results (20%).
Each group will be required to discuss their project idea with the instructor for 5–10 minutes before getting truly started; time slots for this will be announced. Ideally you should come to this meeting with a short (1–2 page) write-up outlining your proposed project.
Plagiarism of any sort will result in an F grade, in addition to a report to the Dean of Students for further action. You are free to discuss with classmates or others, to use any resources you find online or elsewhere; as long as it is all clearly cited and acknowledged, and your final report is your own work. We have substantial resources to detect plagiarism, and hopefully the loss matrix is sufficiently skewed so as to make the expected payoff from any such activity highly negative.

Suggested ideas

Machine learning is a vast and varied field; and there is no shortage of things to play with. You are highly encouraged to explore a bit on your own and come up with something that excites you. Don't worry too much about whether it seems feasible or not; we can assess that in the instructor meeting, and it is usually possible to define even an ambitious project as a set of reasonably achievable subgoals. The ideas that follow are necessarily very selective, and represent the instructor's biases and interests to a fair extent. If you wish you may use them as launching points for your own explorations, but the task of making your project idea more specific and suited to your inclinations is your own!

Comparative approach on a standard data set. You could pick up a data set from the UCI repository, and try running different classification/regression algorithms on it to compare the results (which need not be only in terms of accuracy, but could also look at time complexity and other relevant issues). You should be able to motivate your choice of algorithms, and interpret the results from each one. If you are able to devise and/or implement a novel approach for your chosen data set, or even make some suggestions in this direction, this will earn you extra credit. [For those interested in this, also look at MLcomp, a large-scale effort at comparing machine learning methods. It would be great if you could make use of this in your project somehow!]
Learning on network data. A network (or graph) is a way of representing a system composed of many interacting components. There are now huge amounts of network data available, particularly for social and biological systems, and a lot of interest in learning relationships between properties of nodes in these networks and some relevant functional characteristic. You could obtain one of the network data sets at SNAP, and experiment with solving a machine learning problem defined on it. A couple of specific possibilities:
1. Social networks. Get one of the social network data sets, such as Facebook. This contains a network of friendship relations between nodes (people), as well as some information from the profiles of each person (e.g., location, where they studied, political affiliation etc.). You could try using both profile and network features to predict whether a given person is friends with another person or not [look here for more ideas on this]. You could try clustering people into meaningful clusters or communities based on graph structure [see here for more]: these may for instance turn out to correspond to people who were batchmates, or office colleagues, or a neighbourhood community. In some cases there may be pre-defined communities in the data; see these data sets.
2. Biological networks. There is lots of experimental data now on interactions between genes and proteins in cells; you could get some from BioGRID. Again, you could look at communities and what they can tell you about function; you could look at trying to classify genes/proteins (or their interactions) into relevant categories [e.g., see here], or even at predicting missing or false interactions [e.g., see here].
Temporal data. One application of machine learning is to find causal relationships between real-world time-varying signals. These signals could be of many sorts. They could for instance be trends on Twitter or Google. Google searches have been famously used for flu prediction; you could download their data and try your own methods on it. Tweets might be used in similar ways [see here]. Another entirely different sort of signal is the variation in the concentration level of a gene or protein inside a cell. Such signals might be used to try and infer how genes affect each other: if one gene goes up in quantity, does it cause another gene to come down? (These relationships represented on a large scale constitute gene regulatory networks.) GEO is a big online repository of gene expression data; you could download some and experiment with ways of learning models for interactions [see here for some ideas].
Image processing. Grouping images based on objects/features present in them is a popular machine learning application. You could obtain any sort of image data set, say from Google Images, define meaningful classes and attempt to learn useful classifiers for them. One starting point could be the methods and data presented here. Another idea could be to look at image super-resolution, i.e., combining information from many low-resolution images to create a higher-resolution image. One could adopt a machine learning approach to this, for instance as outlined here.

For even more project ideas, you could look at Andrew Ng's Stanford course. And plenty of other places. The possibilities are limitless; happy exploring!

[Main course page]