Browse assignments

Assignment catalog

33,401 assignments available

[SOLVED] Isye 7406 — data mining & statistical learning team project

Spring 2025Possible Topics of Your Project The objective of a class project is to help you gain experience with research, and to relate what you learn to real life problems which may require you learn new techniques (or develop new methods by yourself). You are expected to present the project findings during the class and submit a summary report at the end of the semester. Below are the two types of possible projects (you only need to choose one of them). 1. Solving a real life data mining problem. A typical report includes problem formulation, data analysis, proposed solutions, and interpretation of results. The data set can be from your own research or the public domain, see the information below. As an example, you can choose to participate a data mining competition such as the Knowledge Discovery and Data Mining (KDD) cup, see the link below for the past KDD Cup , or the KDD CUP 2017, . Another example is “2017 Data Challenge” sponsored by the Government Statistics Section of the American Statistician Associations (ASA) that analyzes the Consumer Expenditure Survey (CE) data on the Bureau of Labor Statistics website, see for the announcement and for the datasets. 2. Numerical study of data mining methods using well-known data sets in the literature. Note that when dealing with well-known data sets, your approach needs to be substantially different from the literature, i.e., you should do more than repeating the analysis there. Some examples are • Compare performance of competitive data mining techniques; • Ask different questions or investigate new ideas of data mining methods; • Identify optimal parameters of specific data mining techniques; Note that the crucial aspect of your project is to analyze some data sets and justify your conclusions, not using some specific statistical models or methods we discussed in class. Datasets: You can collect the data by yourself, use the data set from your own research or the public domain. One way to find online datasets is to use the search engineer such as google. The followings are some examples of online datasets (you can use google or other search engineer to find more): 1. http://kdd.ics.uci.edu/ or http://archive.ics.uci.edu/ml/ One example is the KDD cup 1999 data at http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html More KDD cup data can be found at http://www.kdd.org/kdd-cup 2. http://www.quandl.com/ (financial and economic time-series datasets) 3. Data sets from some government websites such as or . 4. http://lib.stat.cmu.edu/DASL/ 5. http://www.kdnuggets.com/datasets/index.html (links to more data repositories.) 6. http://www.dmoz.org/Computers/Artificial Intelligence/Machine Learning/Datasets/ To inspire your projects, some concrete examples can be as follows: • analyze some data sets in some competitions, see the links < http://www.kaggle.com/competitions> • find the traffic or crash pattern near Georgia Tech or your appartment/home by using data from 4 • predict Allergy season by using Atlanta Pollen count data from . • derive the relationship between sleep and selected health risk behaviors, see the paper To further motivate your projects and encourage you to write up a solid project report, try to think that you want to publish your project report as a paper. There are two possible kinds of data mining or statistical learning papers (you only need to choose one). • Application Papers: apply standard methods to analyze some datasets, thereby answering some important questions in real-world applications such as bioinformatics, economic, finance, banking, healthcare, online advertisements, manufacturing, music, natural disasters, social networks, (bio)surveillance, warehouse, logistics, etc. • Methodology Papers: develop new methodologies and demonstrate their advantages as compared to the standard methods when analyzing some data sets, say, in the context of temporal data mining, spatial data mining, spatio-temporal, streaming data mining, web or graphic mining, etc. 5 ISyE 7406 — Data Mining & Statistical Learning Yajun Mei ([email protected]) The final written report shall not be longer than 25 pages, and the main body of the report is generally 5 ∼ 12 pages. Only very relevant plots and tables shall be included in the body of the report, and the rest should go to Appendix. When writing up your summary report, it is useful to ask yourself the following questions: What is the work? Why is it important? What background is needed? How will the work be presented? Here is a suggested format for your summary report. 1. Title Page: Project Title, author(s) (your name, the last three digits of your student ID, and email address), the submission date, course name/number; 2. Abstract: informative summary of the whole report (100-300 words). 3. Introduction includes problem description and motivation, data mining challenge(s), problem solving strategies, accomplished learning from the applications and outline of the report. 4. Problem Statement or Data Sources: cite the data sources, and provide a simple presentation of data to help readers understand the problem or challenge(s). 5. Proposed Methodology: explain (and justify) your proposed data mining strategies. 6. Analysis and Results: present key findings when executing the proposed data mining methods. For the benefit of readability, detailed results should be placed in the Appendix. Reference of computer softwares to implement your proposed data mining methods (even it is a web page) should be given. 7. Conclusions: Draw conclusions from your data mining practice. Unfinished or possible future work could be included (with proper explanation or justification). ∗A Mandatory Subsection of “Lessons we have learned”: at the end of conclusion section, please add a subsection for lessons you or your team learned from this project or this course. Please feel free to write any comments/suggestions/remarks, or share your experiences of data mining. 8. Appendix: This section only includes needed documents to support the presentation in the report. Feel free to divide it into several subsections if necessary. Do NOT dump all computer outputs unorganized here. 9. Bibliography and Credits. Parts 3-6 constitute the main body of the paper for your primary audience. Usually, as with fictional boss in this example, your audience is intelligent but unschooled in Data Mining or Statistics. So these parts should have as little technical material as you can possibly get away with. It is appropriate, and even recommended, to refer the reader to the appendix in part 8 if you need to provide a more technical explanation for something. Part 8 is your secondary audience – me – and should follow closely enough the ”story” of parts 4 − 6 that it is easy for me to see what technical material backs up with results and discussion. It is not necessary to number these parts 1-9 or name them as-above-mentioned. Please feel free to merge some parts or provide more informative section names if it seems natural to do so. A good on-line resource for writing reports is http://www.ccp.rpi.edu/. This site has links to writing centers at universities around the country, many of which in turn have pages that describe how to put together different types of reports.