Spring 2025Possible Topics of Your Project The objective of a class project is to help you gain experience with research, and to relate what you learn to real life problems which may require you learn new techniques (or develop new methods by yourself). You are expected to present the project findings during the class and submit a summary report at the end of the semester. Below are the two types of possible projects (you only need to choose one of them). 1. Solving a real life data mining problem. A typical report includes problem formulation, data analysis, proposed solutions, and interpretation of results. The data set can be from your own research or the public domain, see the information below. As an example, you can choose to participate a data mining competition such as the Knowledge Discovery and Data Mining (KDD) cup, see the link below for the past KDD Cup , or the KDD CUP 2017, . Another example is “2017 Data Challenge” sponsored by the Government Statistics Section of the American Statistician Associations (ASA) that analyzes the Consumer Expenditure Survey (CE) data on the Bureau of Labor Statistics website, see for the announcement and for the datasets. 2. Numerical study of data mining methods using well-known data sets in the literature. Note that when dealing with well-known data sets, your approach needs to be substantially different from the literature, i.e., you should do more than repeating the analysis there. Some examples are • Compare performance of competitive data mining techniques; • Ask different questions or investigate new ideas of data mining methods; • Identify optimal parameters of specific data mining techniques; Note that the crucial aspect of your project is to analyze some data sets and justify your conclusions, not using some specific statistical models or methods we discussed in class. Datasets: You can collect the data by yourself, use the data set from your own research or the public domain. One way to find online datasets is to use the search engineer such as google. The followings are some examples of online datasets (you can use google or other search engineer to find more): 1. http://kdd.ics.uci.edu/ or http://archive.ics.uci.edu/ml/ One example is the KDD cup 1999 data at http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html More KDD cup data can be found at http://www.kdd.org/kdd-cup 2. http://www.quandl.com/ (financial and economic time-series datasets) 3. Data sets from some government websites such as or . 4. http://lib.stat.cmu.edu/DASL/ 5. http://www.kdnuggets.com/datasets/index.html (links to more data repositories.) 6. http://www.dmoz.org/Computers/Artificial Intelligence/Machine Learning/Datasets/ To inspire your projects, some concrete examples can be as follows: • analyze some data sets in some competitions, see the links < http://www.kaggle.com/competitions> • find the traffic or crash pattern near Georgia Tech or your appartment/home by using data from 4 • predict Allergy season by using Atlanta Pollen count data from . • derive the relationship between sleep and selected health risk behaviors, see the paper To further motivate your projects and encourage you to write up a solid project report, try to think that you want to publish your project report as a paper. There are two possible kinds of data mining or statistical learning papers (you only need to choose one). • Application Papers: apply standard methods to analyze some datasets, thereby answering some important questions in real-world applications such as bioinformatics, economic, finance, banking, healthcare, online advertisements, manufacturing, music, natural disasters, social networks, (bio)surveillance, warehouse, logistics, etc. • Methodology Papers: develop new methodologies and demonstrate their advantages as compared to the standard methods when analyzing some data sets, say, in the context of temporal data mining, spatial data mining, spatio-temporal, streaming data mining, web or graphic mining, etc. 5 ISyE 7406 — Data Mining & Statistical Learning Yajun Mei ([email protected]) The final written report shall not be longer than 25 pages, and the main body of the report is generally 5 ∼ 12 pages. Only very relevant plots and tables shall be included in the body of the report, and the rest should go to Appendix. When writing up your summary report, it is useful to ask yourself the following questions: What is the work? Why is it important? What background is needed? How will the work be presented? Here is a suggested format for your summary report. 1. Title Page: Project Title, author(s) (your name, the last three digits of your student ID, and email address), the submission date, course name/number; 2. Abstract: informative summary of the whole report (100-300 words). 3. Introduction includes problem description and motivation, data mining challenge(s), problem solving strategies, accomplished learning from the applications and outline of the report. 4. Problem Statement or Data Sources: cite the data sources, and provide a simple presentation of data to help readers understand the problem or challenge(s). 5. Proposed Methodology: explain (and justify) your proposed data mining strategies. 6. Analysis and Results: present key findings when executing the proposed data mining methods. For the benefit of readability, detailed results should be placed in the Appendix. Reference of computer softwares to implement your proposed data mining methods (even it is a web page) should be given. 7. Conclusions: Draw conclusions from your data mining practice. Unfinished or possible future work could be included (with proper explanation or justification). ∗A Mandatory Subsection of “Lessons we have learned”: at the end of conclusion section, please add a subsection for lessons you or your team learned from this project or this course. Please feel free to write any comments/suggestions/remarks, or share your experiences of data mining. 8. Appendix: This section only includes needed documents to support the presentation in the report. Feel free to divide it into several subsections if necessary. Do NOT dump all computer outputs unorganized here. 9. Bibliography and Credits. Parts 3-6 constitute the main body of the paper for your primary audience. Usually, as with fictional boss in this example, your audience is intelligent but unschooled in Data Mining or Statistics. So these parts should have as little technical material as you can possibly get away with. It is appropriate, and even recommended, to refer the reader to the appendix in part 8 if you need to provide a more technical explanation for something. Part 8 is your secondary audience – me – and should follow closely enough the ”story” of parts 4 − 6 that it is easy for me to see what technical material backs up with results and discussion. It is not necessary to number these parts 1-9 or name them as-above-mentioned. Please feel free to merge some parts or provide more informative section names if it seems natural to do so. A good on-line resource for writing reports is http://www.ccp.rpi.edu/. This site has links to writing centers at universities around the country, many of which in turn have pages that describe how to put together different types of reports.
Apply random forests and boosting to a data set of your choice. If you want, you can choose a data set from the course (e.g., past lectures or homeworks including simulated data sets) or from R (e.g., the ISLR package) or other sources. The only exception is the spam email data set, since we have used it extensively in our lectures. It might be okay if you want to use the dataset from your proposal, esp. if it is a large, complicated dataset, but you can do so only if each group member works independently on the homework without collaboration and if all group members agree. Please write a report to summarize your analysis, subject to the following requirements: (a) Be sure to fit both random forests and boosting on a training set and to evaluate their performance on a test set. (b) How accurate are the results compared to simple baseline methods? For instance, some candidate baseline methods can be KNN, linear regression, LDA, logistic regression, local smoothing, tree, etc., whichever are appropriate. (c) Which of these approaches yields the best performance in term of smallest testing error? (d) You need to explain how or why you choose certain tuning parameters in these approaches, based only on the training set. This can be done through either cross-validation of the training set, or variable selection such as AIC or BIC from the training set, or any other reasonable approaches. (e) In your writeup, please follow the guideline of final course project. In particular, please provide necessary background on the data set of your choice, so that readers can understand your data set and analysis. Remarks: the purpose of this homework is to prepare you for the course project and the final exam. If feasible, please use the final report format, and it is okay without the title page or reference or other non-essential materials. Also in the final exam, you are given a training set, and then are asked to predict on a testing set – your grade in the final exam will mainly be based on how small the testing error is. Note that the use of cross-validation in this homework will be slightly different from those in HW#1 and HW#2, in the sense that here you should use the cross-validation only to the training set itself with the aim of helping you find the best set of tuning parameters. There will be no universal solutions to this homework, as we expect that students will choose different kinds of data sets. As a result, the peer review comments and grading will depend heavily on your writeup/presentation and explanations including (1) what is your data set, (2) how you tune parameters in each approach, (3) whether your conclusions are appropriate based on your numerical comparisons of different approaches; and (4) whether your presentation is clear, e.g., whether it is easy to read to your report. 1 ISyE 7406: Data Mining & Statistical Learning Optional HW (No credits, and Not Graded!) This is an optional HW. No credits and not graded. It might help you better understand the tree-based method. Tree-based Method. Consider the Orange Juice (OJ) dataset, which is part of the ISLR package in R. The data contains 1070 purchases where the customer either purchased Citrus Hill or Minute Maid Orange Juice, and a number of characteristics of the customer and product are recorded. (a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations. (b) Fit a classification tree with “gini” criterion to the training set, with the binary variable “Purchase” as the response and the other variables as predictors. Use the “summary()” function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have? (c) Create a plot of the tree and interpret the results. (d) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate? Hints: The confusion matrix between two vectors, say, Y and Y hat, can be obtained in R by the “table()” function, i.e., “table(Y, Y hat)”. (e) Use the training set to determine the optimal tree size that corresponds to the lowest crossvalidation classification error rate. (f) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. Note that cross-validation does not necessarily lead to selection of a pruned tree, and if so, then create a pruned tree with fewer number of terminal nodes. (g) Compare the pruned and unpruned trees, in terms of both training and testing error rates. Which is better, and does it match your intuition? Remarks: The following R code helps you to get the OJ data set: ## You need to first install the R package ISLR library(ISLR) data(OJ) head(OJ) ?OJ
Spring 2025The goal of this homework is to help you better understand the statistical properties and computational challenges of local smoothing such as loess, Nadaraya-Watson (NW) kernel smoothing, and spline smoothing. For this purpose, we will compute empirical bias, empirical variances, and empirical mean square error (MSE) based on m = 1000 Monte Carlo runs, where in each run we simulate a data set of n = 101 observations from the additive noise model Yi = f(xi) + i (1) with the famous Mexican hat function f(x) = (1 − x 2 ) exp(−0.5x 2 ), −2π ≤ x ≤ 2π, (2) and 1, · · · , n are independent and identically distributed (iid) N(0, 0.2 2 ). This function is known to pose a variety of estimation challenges, and below we explore the difficulties inherent in this function. (1) Let us first consider the (deterministic fixed) design with equi-distant points in [−2π, 2π]. (a) For each of m = 1000 Monte Carlo runs, simulate or generate a data set of the form (xi , Yi)) with xi = 2π(−1 + 2 i−1 n−1 ) and Yi is from the model in (1). Denote such data set as Dj at the j-th Monte Carlo run for j = 1, · · · , m = 1000. (b) For each data set Dj or each Monte Carlo run, compute the three different kinds of local smoothing estimates at every point in Dj : loess (with span = 0.75), Nadaraya-Watson (NW) kernel smoothing with Gaussian Kernel and bandwidth = 0.2, and spline smoothing with the default tuning parameter. (c) At each point xi , for each local smoothing method, based on m = 1000 Monte Carlo runs, compute the empirical bias, empirical variance, and empirical mean square error (MSE), which are defined as Bias{f(xi)} = ¯fm(xi) − f(xi), with ¯fm(xi) = 1 m Xm j=1 ˆf (j) (xi) V ar{f(xi)} = 1 m Xm j=1 ˆf (j) (xi) − ¯fm(xi) 2 , MSE{f(xi)} = 1 m Xm j=1 ˆf (j) (xi) − f(xi) 2 , Here we use the true function value f(xi) in (2) in the definition of empirical Bias and empirical MSE, which better helps us understanding the performance of different local smoothing methods when estimating the true function f. Moreover, we purposely use the coefficient 1 m (instead of the standard coefficient 1 m−1 ) in the definition of empirical variance, so that the well-known relation MSE = Bias2 + Var is applicable to the empirical version. (d) Plot these quantities against xi for all three kinds of local smoothing estimators: loess, NW kernel, and spline smoothing. (e) Provide a through analysis of what the plots suggest, e.g., which method is better/worse on bias, variance, and MSE? Do you think whether it is fair comparison between these three methods? Why or why not? 1 (2) Repeat part (1) with another (deterministic) design that has non-equidistant points in the interval [−2π, 2π]. The following R code is used to generate the design points xi ’s in my laptop, denoted by x2 below (you can keep these xi ’s fixed in the m = 1000 Monte Carlo runs): set.seed(79) x2
In this problem, you are asked to write a report to summarize your analysis of the popular “Auto MPG” data set in the literature. Much research has been done to analyze this data set, and here the objective of our analysis is to predict whether a given car gets high or low gas mileage based 7 car attributes such as cylinders, displacement, horsepower, weight, acceleration, model year and origin. (a) The “Auto MPG” data set is available at UCI Machine Learning (ML) Repository: https://archive.ics.uci.edu/ml/datasets/Auto+MPG Download the data file “auto-mpg.data” from UCI ML Repository or from Canvas, and use Excel or Notepad to see the data (this is a .txt file). There are 398 rows (i.e., 398 different kinds of cars), and 9 columns (the car attributes and name). Before we do any analysis, we need to clean the raw data. In particular, some values are missing for this dataset. Many statistical methods have been proposed to deal with missing values, and please conduct literature research by yourself. For the purpose of simplicity in this homework, here we adopt a simple though inefficient method to remove those rows with missing values. Also we remove the last column of car names, which is text/string and may cause trouble in our numerical analysis. These two deletions lead to a new cleaned data set of 392 observations and 8 columns. To save your time, you can also directly download the cleaned data from the file “Auto.csv” from Canvas. (b) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. This binary variable will be the response variable in this homework. Note that you need to first compute the median value of the mpg variable in the data set. (c) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings. (d) Split the data into a training set and a test set. Any reasonable splitting is acceptable, as long as you clearly explain how you split and why you think it is reasonable. For your convenience, you can either randomly split, or save every fifth (or tenth) observations as testing data. (e) For the purpose of this homework, perform the following classification methods on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (c). What is the test error of the model obtained? (1) LDA (2) QDA (3) Naive Bayes (4) Logistic Regression (5) KNN with several values of K. Use only the variables that seemed most associated with mpg01 in (c). Which value of K seems to perform the best on this data set? (6) (Optional) PCA-KNN. The Principal Component Analysis (PCA) or other dimension reduction methods can easily be combined with other data mining methods. Recall that the essence of the PC-reduction is to replace the p-dimensional explanatory variable xi = (xi1, . . . , xip), for i = 1, . . . , n, with a new p-dimensional explanatory variable ui = (ui1, . . . , uip), where ui = Ap×pxi . Then we can apply standard data mining methods such as KNN to the first r(≤ p) entries of the ui ’s, (ui1, . . . , uir), to predict Yi ’s. Find the testing errors when the KNN with different values of K (neighbors) is applied to the PCA-dimension-reduced data for different r = p − 1, p − 2, . . . , 1. 1 (7) (Optional) Any other classification methods you want to propose or use. Write a report to summarize your findings, e.g., what is the best method and how to use your results in the context of guiding to manufacture or buy high gas mileage cars. The report should include (i) Introduction, (ii) Exploratory (or preliminary) Data Analysis, (iii) Methods, (iv) Results and (v) Findings. Please attach your computing code for R, Python, or other statistical software (without, or with limited, output) in the appendix of your report, and please do not just dump the computer output in the body of the report. It is important to summarize and interpret your computer output results. Remarks: (a) From now on, we might no longer ask you to conduct cross-validation (CV) explicitly as in the previous HWs, but you might need to do it on your own if you want your results more convincing. (b) In this HW, you are asked to use those explanatory variables that are seemed most associated with mpg01. This approach often occurs in manufacturing or biomedical applications where one wants to combine certain domain knowledge to improve the performance. Note that this might not be applicable in other applications of machine learning such as deep neural networks or random forest, where it is okay to use all explanatory variables. (c) Below some sample R code if you wnat (please feel free to use python or matlab or other software): ## Suppose that you save the datafile in the local folder of your computer, say, ‘‘C:/Temp”: Auto1 = median(Auto1$mpg)) Auto = data.frame(mpg01, Auto1[,-1]); ## replace column “mpg” by “mpg01”. ### END####
Consider the data set “fat” in the “faraway” library of R. This data file is also available at Canvas, Or if you save this file “fat.csv” in your laptop, say, in the folder “C://Temp”, you can read it in R as fat
Spring 2025Overview: In probability and statistics, it is important to understand the mean and variance for any random variables. In many applications, it is straightforward to simulate the random variable Y ’s, but it is often highly non-trivial to characterize the exact distribution of Y = Y (X1, X2) including deriving the explicit formulas for the mean and variance of Y = Y (X1, X2) explicitly as a function of X1 and X2. Objective: In this exam, suppose that Y = Y (X1, X2) is a random variable whose distribution depends on two independent variables X1 and X2, and the objective is to estimate two deterministic functions of X1 and X2: one is the mean µ(X1, X2) = E(Y ) and the other is the variance V (X1, X2) = V ar(Y ). For that purpose, you are provided the observed 200 realizations of the Y ’s values for some given pairs (X1, X2)’s. You are asked to use data mining or machine learning methods that allow us to conveniently predict or approximate the mean and variance of Y = Y (X1, X2) as a function of X1 and X2. That is, your task is to predict or approximate two values for those given pairs (X1, X2) in the testing data set: one for the mean µ(X1, X2) = E(Y (X1, X2)) and the other for the variance V (X1, X2) = V ar(Y (X1, X2)). Training data set: In order to help you to develop a reasonable estimation of the mean and variance of Y = Y (X1, X2) as deterministic functions of X1 and X2, we provide a training data set that is generated as follows. We first choose the uniform design points when 0 ≤ X1 ≤ 1 and 0 ≤ X2 ≤ 1, that is, x1i = 0.01 ∗ i for i = 0, 1, 2, . . . , 99, and x2j = 0.01 ∗ j for j = 0, 1, 2, . . . , 99. Thus there are a total of 100 ∗ 100 = 104 combinations of (x1i , x2j )’s, and for each of these 104 combinations, we generate 200 independent realizations of the Y variables, denoted by Yijk for k = 1, . . . , 200. The corresponding training data, 7406train.csv, is available from Canvas. Note that this training data set is a 104×202 table. Each row corresponds to one of 100∗100 = 104 combinations of (X1, X2)’s. The first and second columns are the X1 and X2 values, respectively, whereas the remaining 200 columns are the corresponding 200 independent realizations of Y ’s. Based on the training data, you are asked to develop an accurate estimation of the functions µ(X1, X2) = E(Y ) and V (X1, X2) = V ar(Y ), as deterministic functions of X1 and X2 when 0 ≤ X1 ≤ 1 and 0 ≤ X2 ≤ 1. To assist you, a limited empirical data analysis (EDA) on the training data is provided in the appendix by using R. Please feel free to modify to other language such as Python, Matlab, etc. Testing data set: For the purpose of evaluating your proposed estimation models and methods, we choose 50 random design points for X1 and 50 random design points for X2. Thus there are a total of 50 ∗ 50 = 2500 combinations of (X1, X2) in the testing data set. You are asked to use your formula to predict µ(X1, X2) = E(Y ) and V (X1, X2) = V ar(Y ) for Y = Y (X1, X2) for the 50 ∗ 50 = 2500 combination of (X1, X2) in the testing data (please keep the six digits for your answers). The exact values of the (X1, X2)’s in the testing data set are included in the file 7406test.csv, which is available from Canvas. You are asked to use your formula to predict µ(X1, X2) = E(Y ) and V (X1, X2) = V ar(Y ) for the 50 ∗ 50 = 2500 combination of (X1, X2) in the testing data (please keep (at least) six digits for your answers). 1 Estimation Evaluation Criterion: In order to evaluate your estimation or prediction, we obtain “true” values µ(X1, X2) = E(Y ) and V (X1, X2) = V ar(Y ) for each combination of (X1, X2) in the testing data set, based on the following Monte Carlo simulations (we will not release these true values!). We first generated 200 random realizations of Y ’s for each combination of (X1, X2) in the testing data set, but we will not release these 200 independent realizations for the testing data. Next, for each given combination of (X1, X2), we have 200 realizations of Y ’s, denoted by Y1, · · · , Y200, and then we compute the “true” values as µ ∗ true = Y¯ = Y1 + · · · + Y200 200 and V ∗ true = V ar ˆ (Y ) = 1 200 − 1 X 200 i=1 (Yi − Y¯ ) 2 . Your predicted mean or variance functions, say, ˆµ(X1, X2) and Vˆ (X1, X2), will then be evaluated as compared to these true values, µ ∗ true(X1, X2) and V ∗ true(X1, X2): MSEµ = 1 IJ X I i=1 X J j=1 (ˆµ(x1i , x2j ) − µ ∗ true(x1i , x2j ))2 MSEV = 1 IJ X I i=1 X J j=1 (Vˆ (x1i , x2j ) − V ∗ true(x1i , x2j ))2 , (1) where (I, J) = (50, 50) for the testing data. Your tasks: as your solution set to this exam, you are required to submit two files to Canvas before the deadline: (a) A .csv file on the required prediction that includes your predicted values for µ(X1, X2) = E(Y ) and V (X1, X2) = V ar(Y ) for the testing data (in 6 digits). Please name your file as “1.YourLastName.YourFirst.Name.csv”, e.g., “1.Mei.Yajun.csv” for the name of the instructor. I think students in our class have a unique combination of last/first name, and thus there is no need to include the middle name. • The submitted csv file in excel must be 2500 × 4 column, and the first two columns must be the exact same as the provided testing data file “7406test.csv”. The third column should be your estimated mean ˆµ(X1, X2), and the fourth column is your estimated variance Vˆ (X1, X2). • If you want, please round your numerical answers to the six decimal places, e.g., report your estimations as the form of 30.xxxxxx, but this is optional: in our evaluation process we will use the round function to round your answers to the six decimal before computing MSE. • Please save your predictions as a 2500*4 data matrix in this .csv file, e.g., without headers or row/column labels/names. We will use the computer to auto-read your .csv file and then auto-compute the MSE values in equation (1) for all students, based on the alphabet order of the last/first name, and thus it is important for you to follow this guideline, e.g., without headers or extra columns/rows in the .csv file and name your .csv file as the above form. (b) A (pdf or docx) file that explains the methods used for the prediction. Please name your file as “2.YourLastName.YourFirstName”, e.g., “2.Mei.Yajun.pdf” or “2.Mei.Yajun.docx” for the name of the instructor. 2 Your written report should be like good journal papers that is concise, clearly explain and justify your proposed models and methods, also see the guidelines on the final report of our course project. Please feel free to use any methods — this is an open-ended problem, and you can either use any standard methods you learned from the class, or develop your estimation by a completely new approach. Remark: • If you upload your files multiple times at Canvas, the file names might be renamed automatically by Canvas to ”1.YourLastName.YourFirstName.csv-1” or similar. If this occurs, please do not worry, as we will take this into account and correct for you. • This final exam essentially asks you to build two different models: one is to predict ˆµ and the other is to predict V . ˆ For each model, there are p = 2 independent variables (X1, X2), although both ˆµ(X1, X2) and Vˆ (X1, X2) functions are likely needed to be nonlinear in order to receive good predict performance. To be more specific, you might want to look beyond the multiple linear regression model of β0 + β1X1 + β2X2, and investigate some nonlinear models such as polynomial regressions, local smoothing, generalized additive models, random forests, boosting, support vector machine with suitable kernels, neural networks, etc. Then you need to decide which (nonlinear) models should be used for predictions in the testing dataset. Hopefully this high-level viewpoint allows you easily develop models for prediction. • After your submission, please double check your submitted .csv file at Canvas to see whether it has exactly 2500 rows and 4 columns or not, whether it has ”NA” or missing values or not. In the past, there were three typical small mistakes that will severely affect predictions: (i) having 5 or more columns (e.g., one unnecessary column to label observations); (ii) having more than 2500 rows (e.g., predictions on the training data instead of the testing data, or having predictions from multiple models); and (iii) having ”NA” values in the .csv files (e.g., some models will generate a prediction of ”NA” if the data in the testing dataset is outside the range of the training dataset). Thus it is crucial to assure that your .csv file has the required 2500 rows and 4 columns that reports the desired prediction values. Grading Policies: The total point of this take-home final exam is 25 points, which will be graded by the TAs and instructor. There are three components: • Prediction accuracy on mean: 10 points. The smaller MSEµ in (1) the better. Tentatively, we plan to assign ”10” if MSEµ ≤ 1.20, ”9” if (1.20, 1.40], ”8” if (1.40, 1.60], ”7” if (1.60, 1.80], ”6” if (1.80, 2.00], ”5” if (2.00, 3], ”4” if (3, 10], ”3” if (10, 20], ”2” if (20, 30], etc. For your information, in the past semesters, the percentages of students who received ”10”, ”9”, and ”8” are 43%, 35%, 16%, resp. For the students who received low grades, they often did not realize that they made small mistakes/errors here or there, or did not tune the hyper-parameters appropriately, etc. In general, we feel that our grading is very generous, and will keep the right to adjust the grading schedule to be more generous if needed. • Prediction accuracy on variance: 10 points. The smaller MSEV in (1) the better. Note that predicting variance is a much harder question as compared to predicting mean, and thus we expect that MSE to be much larger. Tentatively, we plan to assign ”10” if MSEV ≤ 550, ”9” if (550, 570], ”8” if (570, 590], ”7” if (590, 610], ”6” if (610, 630], ”5” if (630, 650], ”4” if (650, 700], ”3” if (700, 1000], ”2” if (1000, 5000], etc. In the past semesters, 60% students received ”10”, and more than ”90%” received at least ”8”. Thus we feel the grading should be generous, and will keep the right to adjust the grading schedule to be more generous if needed. 3 • Written Report: 5 points. There are no specific guidelines on this written report, and please feel free to use the commonsense. With that said, we will look at the following aspects. Is the report well-written or easy to read? Is it easy to find the final chosen model or method? Does the report clearly explain how and why to choose the final chosen method? Does the report discuss how to suitably tune parameters in the final chosen model? We plan to assign the grades of this component as follows: “A”- 5, “B”- 4, “C” – 3, “D”-2, “F”- 1, “Not submitted” – 0): The TAs and instructor will try their best to give fair technical grades to all reasonable answers, e.g., even if your prediction accuracy is not as good as other students, we keep the right to increase your prediction accuracy scores if your written report has a solid justification of your proposed models/methods/results. However, we acknowledge that ultimately this is a subjective decision. If needed, please feel free to leave a public or private message at Piazza. Good luck to your final exam! 4 Appendix: Some useful R codes for (A) training dataset, (B) testing dataset, and (C) our autograding program. (A) Empirical Data Analysis of training dataset, which might be useful to inspire you to develop suitable methods for prediction ##### ### Read Training Data ## Assume you save the training data in the folder “C:/temp” in your local laptop traindata
Spring 2025Problem (KNN). Consider the well-known zipcode data set in the machine learning and data mining literature, which are available from the book website: . You can also find it at Canvas: the training data set is the file “zip.train.csv” and the testing dataset is “zip.test.csv”. In the zipcode data, the first column stands for the response (Y ) and the other columns stand for the independent variables (Xi ’s). The detailed description can be found from http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/zip.info.txt Here we consider only the classification problem between 2’s and 7’s, e.g., denote by “ziptrain27” the training data that only includes the data when Y = 2 or when Y = 7. (1) Exploratory Data Analysis of Training data: when playing with the training data “ziptrain27”, e.g., report some summary information/statistics of training data that you think are important or interesting. Please do not copy and paste the results of R or Python codes — you need to be selective, and use your own language to write up some sentences to summarize those important or interesting results. (2) Build the classification rule by using the training data “ziptrain27” with the following methods: (i) linear regression; and (ii) the KNN with k = 1, 3, 5, 7, 9, 11, 13, and 15. Find the training errors of each choice. (3) Consider the provided testing data set, and derive the testing errors of each classification rule in (3). (4) Cross-Validation. The above steps are sufficient in many machine learning or data mining questions when both training and testing data sets are very large. However, for relatively small data sets, one may want to do further to assess the robustness of each approach. One general approach is Monte Carlo Cross Validation algorithm that splits the observed data points into training and testing subsets, and repeats the above computation B times (B = 100 say). In the context of this homework, we can combine n1 = 1376 training data and n2 = 345 testing data together into a larger data set. Then for each loop b = 1, · · · , B, we randomly select n1 = 1376 as a new training subset and use the remaining n2 = 345 data as the new testing subset. Within each loop, we first build different models from “the training data of that specific loop” and then evaluate their performances on “the corresponding testing data.” Therefore, for each model or method in part (3), we will obtain B values of the testing errors on B different subsets of testing data, denote by T Eb for b = 1, 2, · · · , B. Then the “average” performances of each model can be summarized by the sample mean and sample variances of these B values: T E∗ = 1 B X B b=1 T Eb and V ar ˆ (T E) = 1 B − 1 X B b=1 T Eb − T E∗ 2 . Compute and compare the “average” performances of each model or method mentioned in part (2). In particular, based on your results, write some paragraphs to provide a brief summary of what you discover in the cross-validation, including reporting the “optimal” choice of the tuning parameter k in the KNN method, and explaining how confident you are on the usefulness of your optimal choice in real-world applications.Appendix: Please feel free to use the following sample R codes if you want. Of course, you are free to use Python or other softwares ## Below assume that you save the datasets in the folder ‘‘C://Temp” in your laptop ## 1. Read Training data ziptrain
Spring 2025Follow these instructions to download and set up a preconfigured Docker image that you will use for this assignment. that you will use for this assignment. Why use Docker? In earlier iterations of this course, students installed software on their own machines, and we (both students and instructor team) ran into many issues that could not be resolved satisfactorily. Docker allows us to distribute a cross-platform, preconfigured image with all the requisite software and correct package versions. Once Docker is installed and the container is running, access Jupyter by browsing to http://localhost:6242. There is no need to install any additional Java or PySpark dependencies as they are all bundled as part of the Docker container. You will use the yellow_tripdata_2019-01_short.csv dataset, a modified record of the NYC Green Taxi trips that includes information about the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, fare amounts, payment types, and driver-reported passenger counts. When processing the data or performing calculations, do not round any values, unless specifically instructed to. Technology PySpark, Docker Deliverables [Gradescope] q1.ipynb: your solution as a Jupyter Notebook file IMPORTANT NOTES: • Only regular PySpark Dataframe Operations can be used. • Do NOT use PySpark SQL functions, i.e., sqlContext.sql(‘select * … ‘). We noticed that students frequently encountered difficult-to-resolve issues when using these functions. Additionally, since you already worked extensively with SQL in HW1, completing this task in SQL would offer limited educational value. • Do not reference sqlContext within the functions you are defining for the assignment. • If you re-run cells, remember to restart the kernel to clear the Spark context, otherwise an existing Spark context may cause errors. • Be sure to save your work often! If you do not see your notebook in Jupyter, then double check that the file is present in the folder and that your Docker has been set up correctly. If, after checking both, the file still does not appear in Jupyter then you can still move forward by clicking the “upload” button in the Jupyter notebook and uploading the file – however, if you use this approach, then your file will not be saved to disk when you save in Jupyter, so you would need to download your work by going to File > Download as… > Notebook (.ipynb), so be sure to download often to save your work! • Do not add any cells or additional library imports to the notebook. • Remove all your additional debugging code that renders output, as it will crash Gradescope. For instance, any additional print, display and show statements used for debugging must be removed. • If you encounter a “connection refused” error on Gradescope, it likely means that the system timed out while running your code. Review your code to ensure you aren’t performing unnecessary processing, such as using overly complex nested loops. Tasks and point breakdown 1. [4 pt] You will be modifying the function clean_data to clean the data. Cast the following columns into the specified data types: a. passenger_count — integer b. total_amount — float c. tip_amount — float d. trip_distance — float 4 Version 0 e. fare_amount — float f. tpep_pickup_datetime — timestamp g. tpep_dropoff_datetime — timestamp 2. [6 pts] You will be modifying the function common_pair. Return the top 10 pickup-dropoff location pairs that have the highest sum of passenger_count who have traveled between them. Sort the location pairs by total passengers between pairs. For each location pair, also compute the average amount per passenger over all trips (name this per_person_rate), utilizing total_amount. For pairs with the same total passengers, sort them in descending order of per_person_rate. Filter out any trips that have the same pick-up and drop-off location. Rename the column for total passengers to total_passenger_count. Sample Output Format — The values below are for demonstration purposes: PULocationID DOLocationID total_passenger_count per_person_rate 1 2 23 5.242345 3 4 5 6.61345634 3. [6 pts] You will be modifying the function distance_with_most_tip . Filter the data for trips having fares (fare_amount) greater than $2.00 and a trip distance (trip_distance) greater than 0. Calculate the tip percent (tip_amount * 100 / fare_amount) for each trip. Round all trip distances up to the closest mile and find the average tip_percent for each trip_distance. Sort the result in descending order of tip_percent to obtain the top 15 trip distances which tip the most generously. Rename the column for rounded trip distances to trip_distance, and the column for average tip percents tip_percent. Sample Output Format — The values below are for demonstration purposes: trip_distance tip_percent 2 6.2632344561 1 4.42342882 4. [9 pts] You will be modifying the function time_with_most_traffic to determine which hour of the day has the most traffic. Calculate the traffic for a particular hour using the average speed of all taxi trips which began during that hour. Calculate the average speed as the average trip_distance divided by the average trip duration, as distance per hour. Make sure to determine the average durations and average trip distances before calculating the speed. It will likely be helpful to cast the dates to the long data type when determining the interval. A day with low average speed indicates high levels of traffic. The average speed may be 0, indicating very high levels of traffic. Additionally, you must separate the hours into AM and PM, with hours 0:00-11:59 being AM, and hours 12:00-23:59 being PM. Convert these times to the 12-hour time, so you can match the output below. For example, the row with 1 as time of day, should show the average speed between 1 am and 2 am in the am_avg_speed column, and between 1 pm and 2pm in the pm_avg_speed column. Use date_format along with the appropriate pattern letters to format the time of day so that it 5 Version 0 matches the example output below. Your final table should contain values sorted from 0-11 for time_of_day. There may be data missing for a time of day, and it may be null for am_avg_speed or pm_avg_speed. If an hour has no data for am or pm, there may be missing rows. You will not have rows for all possible times of day, and do not need to add them to the data if they are missing. Sample Output Format — The values below are for demonstration purposes: time_of_day am_avg_speed pm_avg_speed 1 0.953452345 9.23345272 2 5.2424622 null 4 null 2.55421905 6 Version 0 Q2 [30 pts] Analyzing dataset with Spark/Scala on Databricks Firstly, go over this Spark on Databricks Tutorial, to learn the basics of creating Spark jobs, loading data, and working with data. You will analyze nyc-tripdata.csv1 using Spark and Scala on the Databricks platform. (A short description of how Spark and Scala are related can be found here.) You will also need to use the taxi zone lookup table using taxi_zone_lookup.csv that maps the location ID into the actual name of the region in NYC. The nyc-trip data dataset is a modified record of the NYC Green Taxi trips and includes information about the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, fare amounts, payment types, and driverreported passenger counts. Technology Spark/Scala, Databricks Deliverables [Gradescope] • q2.dbc: Your solution as Scala Notebook archive file (.dbc) exported from Databricks (see Databricks Setup Guide below) • q2.scala: Your solution as a Scala source file exported from Databricks (see Databricks Setup Guide below) • q2_results.csv: The output results from your Scala code in the Databricks q2 notebook file. You must carefully copy the outputs of the display()/show() function into a file titled q2_results.csv under the relevant sections. Please double-check and compare your actual output with the results you copied. IMPORTANT NOTES: • Use only Firefox, Safari or Chrome when configuring anything related to Databricks. The setup process has been verified to work on these browsers. • Carefully follow the instructions in the Databricks Setup Guide. (You should have already downloaded the data needed for this question using the link provided before Homework Overview.) o You must choose the Databricks Runtime (DBR) version as “10.4 (includes Apache Spark 3.2.1, Scala 2.12)”. We will grade your work using this version. o Note that you do not need to install Scala or Spark on your local machine. They are provided with the DBR environment. • You must use only Scala DataFrame operations for this question. Scala DataFrames are just another name for Spark DataSet of rows. You can use the DataSet API in Spark to work on these DataFrames. Here is a Spark document that will help you get started on working with DataFrames in Spark. You will lose points if you use SQL queries, Python, or R to manipulate a DataFrame. o After selecting the default language as SCALA, do not use the language magic % with other languages like %r, %python, %sql etc. Language magics are used to override the default language, which you must not do for this assignment. o You must not use full SQL queries in lieu of the Spark DataFrame API. That is, you must not use functions like sql(), which allows you to directly write full SQL queries like spark.sql (“SELECT* FROM col1 WHERE …”). This should be df.select(“*”) instead. • The template Scala notebook q2.dbc (in hw3-skeleton) provides you with code that reads a data file nyc-tripdata.csv. The input data is loaded into a DataFrame, inferring the schema using reflection (Refer to the Databricks Setup Guide above). It also contains code that filters the data to only keep the rows where the pickup location is different from the drop location, and the trip distance is strictly greater than 2.0 (>2.0). 1 Graph derived from the NYC Taxi and Limousine Commission 7 Version 0 o All tasks listed below must be performed on this filtered DataFrame, or you will end up with wrong answers. o Carefully read the instructions in the notebook, which provides hints for solving the problems. • Some tasks in this question have specified data types for the results that are of lower precision (e.g., float). For these tasks, we will accept relevant higher precision formats (e.g., double). Similarly, we will accept results stored in data types that offer “greater range” (e.g., long, bigint) than what we have specified (e.g., int). • Remove all your additional debugging code that renders output, as it will crash Gradescope. For instance, any additional print, display and show statements used for debugging must be removed. • Hint: You may find some of the following DataFrame operations helpful: toDF, join, select, groupBy, orderBy, filter, agg, window(), partitionBy, orderBy, etc. Tasks and point breakdown 1. List the top 5 most popular locations for: a. [2 pts] dropoff based on “DOLocationID”, sorted in descending order by popularity. If there is a tie, then one with a lower “DOLocationID” gets listed first. b. [2 pts] pickup based on “PULocationID”, sorted in descending order by popularity. If there is a tie, then the one with a lower “PULocationID” gets listed first. 2. [4 pts] List the top 3 locationID’s with the maximum overall activity. Here, overall activity at a LocationID is simply the sum of all pick-ups and all drop-offs at that LocationID. In case of a tie, the lower LocationID gets listed first. Note: If a taxi picked up 3 passengers at once, we count it as 1 pickup and not 3 pickups. 3. [4 pts] List all the boroughs (of NYC: Manhattan, Brooklyn, Queens, Staten Island, Bronx along with “Unknown” and “EWR”) and their total number of activities, in descending order of a total number of activities. Here, the total number of activities for a borough (e.g., Queens) is the sum of the overall activities (as defined in part 2) of all the LocationIDs that fall in that borough (Queens). An example output format is shown below. 4. [5 pts] List the top 2 days of the week with the largest number of daily average pick-ups, along with the average number of pick-ups on each of the 2 days in descending order (no rounding off required). Here, the average pickup is calculated by taking an average of the number of pick-ups on different dates falling on the same day of the week. For example, 02/01/2021, 02/08/2021 and 02/15/2021 are all Mondays, so the average pick-ups for these is the sum of the pickups on each date divided by 3. An example output is shown below. 8 Version 0 Note: The day of week is a string of the day’s full spelling, e.g., “Monday” instead of the number 1 or “Mon”. Also, the pickup_datetime is in the format: yyyy-mm-dd 5. [6 pts] For each hour of a day (0 to 23, 0 being midnight) in the order from 0 to 23 (inclusively), find the zone in the Brooklyn borough with the largest number of total pick-ups. Note: All dates for each hour should be included. 6. [7 pts] Find which 3 different days in the month of January, in Manhattan, that saw the largest positive percentage increase in pick-ups compared to the previous day, in the order from largest percentage increase to smallest percentage increase. An example output is shown below. Note: All years need to be aggregated to calculate the pickups for a specific day of January. The change from Dec 31 to Jan 1 can be excluded. List the results of the above tasks in the provided q2_results.csv file under the relevant sections. These preformatted sections also show you the required output format from your Scala code with the necessary columns — while column names can be different, their resulting values must be correct. • You must manually enter the output generated into the corresponding sections of the q2_results.csv file, preferably using some spreadsheet software like MS-Excel (but make sure to keep the csv format). For generating the output in the Scala notebook, refer to show() and display()functions of Scala. • Note that you can edit this csv file using text editor, but please be mindful about putting the results under designated columns. • If you encounter a “UnicodeDecodeError”, please save file as “.csv UTF-8″ to resolve. Note: Do NOT modify anything other than filling in those required output values in this csv file. We grade by running the Spark Scala code you write and by looking at your results listed in this file. So, make sure your output is obtained from the Spark Scala code you write. Failure to include the dbc and scala files will result in a deduction from your overall score. 9 Version 0 Q3 [35 points] Analyzing Large Amount of Data with PySpark on AWS You will try out PySpark for processing data on Amazon Web Services (AWS). Here you can learn more about PySpark and how it can be used for data analysis. You will be completing a task that may be accomplished using a commodity computer (e.g., consumer-grade laptops or desktops). However, we would like you to use this exercise as an opportunity to learn distributed computing on AWS, and to gain experience that will help you tackle more complex problems. The services you will primarily be using are Amazon S3 storage, Amazon Athena. You will be creating an S3 bucket, running code using Athena and its serverless PySpark engine, and then storing the output into that S3 bucket. Amazon Athena is serverless, meaning that you pay only for what you use. There are no servers to maintain that will accrue costs whether it’s being used or not. For this question, you will only use up a very small fraction of your AWS credit. If you have any issues with the AWS Academy account, please post in the dedicated AWS Setup Ed Discussion thread. In this question, you will use a dataset of trip records provided by the New York City Taxi and Limousine Commission (TLC). You will be accessing the dataset directly through AWS via the code outlined in the homework skeleton. Specifically, you will be working with two samples of this dataset, one small, and one much larger. Optionally, if you would like to learn more about the dataset, check out here and here; also optionally, you may explore the structure of the data by referring to [1] [2]. You are provided with a python notebook (q3.ipynb) file which you will complete and load into Athena. You may need to allow third party cookies in your browser if the notebook doesn’t load in Athena. You are provided with the load_data() function, which loads two PySpark DataFrames. The first DataFrame, trips, contains trip data where each record refers to one (1) trip. The second DataFrame, zones, maps a LocationID to its trip information. It can be linked to either the PULocationID or DOLocationID fields in the trips DataFrame. Technology PySpark, AWS Deliverables [Gradescope] • q3.ipynb: PySpark notebook for this question (for the larger dataset). • q3_output_large.csv: output file (comma-separated) for the larger dataset. IMPORTANT NOTES • Use Firefox, Safari or Chrome when configuring anything related to AWS. • EXTREMELY IMPORTANT: Both datasets are in the US East (N. Virginia) region. Using machines in other regions for computation will incur data transfer charges. Hence, set your region to US East (N. Virginia) in the beginning (not Oregon, which is the default). This is extremely important, otherwise your code may not work, and you may be charged extra. • Strictly follow the guidelines below, or your answer may not be graded. a. Ensure that the parameters for each function remain as defined and the output order and names of the fields in the PySpark DataFrames are maintained. b. Do not import any functions which were not already imported within the skeleton. c. You will not have access to the Spark object directly in the autograder. If you use it in your functions, the autograder will fail! d. Double check that you are submitting the correct files, and the filenames follow the correct naming standard — we only want the script and output from the larger dataset. Also, double check that you are writing the right dataset’s output to the right file. 10 Version 0 e. You are welcome to store your script’s output in any bucket you choose if you can download and submit the correct files. f. Do not make any manual changes to the output files. g. Do not remove #export from the HW skeleton. h. Do not import any additional packages, as this may cause the autograder to malfunction. Everything you need has already been imported for you. i. Using .rdd() can cause issues in the GradeScope environment. You can accomplish this assignment without it. In general, since the RDD API is outdated (though not deprecated), you should be wary of using this API. j. Remove all your additional debugging code that renders output, as it will crash Gradescope. For instance, any additional print, display and show statements used for debugging must be removed. k. Refer to DataFrame commands such as filter, join, groupBy, agg, limit, sort, withColumnRenamed and withColumn. Documentation for the DataFrame APIs is located here. l. Testing on a single, small dataset (i.e. a “test case”) is helpful, but is not sufficient for discovering all potential issues, especially if such issues only become apparent when the code is run on larger datasets. It is important for you to develop more ways to review and verify your code logic. m. Overwriting the DataFrames from the function parameters can cause unintended side effects when it comes to rounding. Be sure to preserve the DataFrames in each function. n. Make sure you return a DataFrame. If you get NoneType errors, you are most likely not returning what you think you are. o. Some columns may need to be cast to the right data type. Keep that in mind! Tasks and point breakdown 1. [0 pts] Setting up the AWS environment. a. Go through all the steps in the AWS Setup Guide. You should have already completed Step 1 to create your account) to set up your AWS environment, e.g., creating S3 storage bucket, and uploading skeleton file. 2. [1 pts] user() a. Returns your GT Username as a string (e.g., gburdell3) 3. [3 pts] trip_statistics(trips) a. Compute basic statistics (count, mean, stddev, min, max) for trip_distance. b. Returns PySpark DataFrame with the schema (summary, trip_distance) 4. [5 pts] busiest_hour(trips) a. Determine the hour (0-23) of the day with the highest number of trips. b. Returns a PySpark DataFrame with a single row showing the hour with the highest trip count and the corresponding number of trips. Schema (hour, trip_count) 5. [5 pts] most_freq_pickup_locations(trips) a. Identify the top 10 pickup locations (by PULocationID) with the highest number of trips. b. Returns a PySpark DataFrame showing the top 10 PULocationIDs with the highest trip counts. Schema (PULocationID, trip_count) 6. [6 pts] avg_trip_distance_and_duration(trips) a. Calculate the average trip distance and average trip duration in minutes (i.e., divided by 60) for each hour of the day (0-23). A valid trip must have a non-null timestamp and a trip duration greater than zero. b. Returns a PySpark DataFrame with 24 rows showing each hour (0-23) along with the average trip distance and average trip duration for that hour. Schema (hour, avg_trip_distance, avg_trip_duration) Note: You can use unix_timestamp to help with calculating the duration. If there are null or invalid timestamps, you will want to handle those accordingly. 11 Version 0 7. [10 pts] most_freq_peak_hour_fares(trips, zones) a. Identify the top 10 most frequent routes (combinations of PULocationID and DOLocationID) with pickups during peak hours (e.g., 7 AM – 9 AM and 4 PM – 7 PM). Peak hours can be defined as 7
Spring 2025Goal Design a table, a grouped bar chart, and a stacked bar chart with filters in Tableau. Technology Tableau Desktop Deliverables Gradescope: After selecting HW2 – Q1, click Submit Images. You will be taken to a list of questions for your assignment. Click Select Images and submit the following four PNG images under the corresponding questions: ● table.png: Image/screenshot of the table in Q1.1 ● grouped_barchart.png: Image of the chart in Q1.2 ● stacked_barchart_1.png: Image of the chart in Q1.3 after filtering data for Max.Players = 2 ● stacked_barchart_2.png: Image of the chart in Q1.3 after filtering data for Max.Players = 4 a Q1 will be manually graded after the grace period. Setting Up Tableau Install and activate Tableau Desktop by following “HW2 Instructions” on Canvas. The product activation key is for your use in this course only. Do not share the key with anyone. If you already have Tableau Desktop installed on your machine, you may use this key to reactivate it. a If you do not have access to a Mac or Windows machine, use the 14-day trial version of Tableau Online: 1. Visit https://www.tableau.com/trial/tableau-online 2. Enter your information (name, email, GT details, etc.) 3. You will then receive an email to access your Tableau Online site 4. Go to your site and create a workbook a If neither of the above methods work, use Tableau for Students. Follow the link and select “Get Tableau For Free”. You should be able to receive an activation key which offers you a one-year use of Tableau Desktop at no cost by providing a valid Georgia Tech email. Connecting to Data 1. It is optional to use Tableau for Q1.1. Otherwise, complete all parts using a single Tableau workbook. 2. Q1 will require connecting Tableau to two different data sources. You can connect to multiple data sources within one workbook by following the directions here. 3. For Q1.1 and Q1.2: a. Open Tableau and connect to a data source. Choose To a File – Text file. Select the popular_board_game.csv file from the skeleton. b. Click on the graph area at the bottom section next to “Data Source” to create worksheets. 4. For Q1.3: a. You will need a data.world account to access the data for Q1.3. Add a new data source by clicking on Data – New Data Source. b. When connecting to a data source, choose To a Server – Web Data Connector. c. Enter this URL to connect to the data.world data set on board games. You may be prompted to log in to data-world and authorize Tableau. If you haven’t used data.world before, you will be required to create an account by clicking “Join Now”. Do not edit the provided SQL query. a NOTE: If you cannot connect to data-world, you can use the provided csv files for Q1 in the skeleton. The provided csv files are identical to those hosted online and can be loaded directly into Tableau. a d. Click the graph area at the bottom section to create another worksheet, and Tableau will automatically create a data extract. 4 Version 0 Table and Chart Design 1. [5 points] Good table design. Visualize the data contained in popular_board_game.csv as a data table (known as a text table in Tableau). In this part (Q1.1), you can use any tool (e.g., Excel, HTML, Pandas, Tableau) to create the table. We are interested in grouping popular games into “support solo” (min player = 1) and “not support solo” (min player > 1). Your table should clearly communicate information about these two groups simultaneously. For each group (Solo Supported, Solo Not Supported), show: a a. Total number of games in each category (fighting, economic, …) b. In each category, the game with the highest number of ratings. If more than one game has the same (highest) number of ratings, pick the game you prefer. NOTE: Level of Detail expressions may be useful if you use Tableau. c. Average rating of games in each category (use simple average), rounded to 2 decimal places. d. Average playtime of games in each category, rounded to 2 decimal places. e. In the bottom left corner below your table, include your GT username (In Tableau, this can be done by including a caption when exporting an image of a worksheet or by adding a text box to a dashboard. If you use Tableau, refer to the tutorial here). f. Save the table as table.png. (If you use Tableau, go to Worksheet/Dashboard → Export → Image). NOTE: Do not take screenshots in Tableau since your image must have high resolution. You can take a screenshot If you use HTML, Pandas, etc. a Your learning goal here is to practice good table design, which is not strongly dependent on the tool that you use. Thus, we do not require that you use Tableau in this part. You may decide the most meaningful column names, the number of columns, and the column order. You are not limited to only the techniques described in the lecture. For OMS students, the lecture video on this topic is Week 4 – Fixing Common Visualization Issues – Fixing Bar Charts, Line Charts. For campus students, review lecture slides 42 and 43. 2. [10 points] Grouped bar chart. Visualize popular_board_game.csv as a grouped bar chart in Tableau. Your chart should display game category (e.g., fighting, economic,…) along the horizontal axis and game count along the vertical axis. Show game playtime (e.g.,
Spring 2025Leveraging the power of APIs for data acquisition, you will build a co-actor network of highly rated movies using information from The Movie Database (TMDb). Through data collection and analysis, you will create a graph showing the relationships between actors based on their highly rated movies. This will not only highlight the practical application of APIs in collecting rich datasets, but also introduce the importance of graphs in understanding and visualizing the real-world dataset. Technology • Python 3.10.x • TMDb API version 3 Allowed Libraries The Python Standard Library and Requests only. Max runtime 10 minutes. Submissions exceeding this will receive zero credit. Deliverables • Q1.py: The completed Python file • nodes.csv: The csv file containing nodes • edges.csv: The csv file containing edges Follow the instructions found in Q1.py to complete the Graph class, the TMDbAPIUtils class, and the one global function. The Graph class will serve as a re-usable way to represent and write out your collected graph data. The TMDbAPIUtils class will be used to work with the TMDb API for data retrieval. Tasks and point breakdown 1. [10 pts] Implementation of the Graph class according to the instructions in Q1.py. a. The graph is undirected, thus {a, b} and {b, a} refer to the same undirected edge in the graph; keep only either {a, b} or {b, a} in the Graph object. A node’s degree is the number of (undirected) edges incident on it. In/ out-degrees are not defined for undirected graphs. 2. [10 pts] Implementation of the TMDbAPIUtils class according to instructions in Q1.py. Use version 3 of the TMDb API to download data about actors and their co-actors. To use the API: a. Create a TMDb account and follow the instructions on this document to obtain an API key. b. Be sure to use the key, not the token. This is the shorter of the two. c. Refer to the TMDB API Documentation as you work on this question. 3. [20 pts] Build a co-actor network for movies released in 1999 according to the instructions in Q1.py and produce the correct nodes.csv and edges.csv. a. If an actor’s name has comma characters (“,”), remove those characters before writing that name into the CSV files. 4 Version 0SQLite is a lightweight, serverless, embedded database that can easily handle multiple gigabytes of data. It is one of the world’s most popular embedded database systems. It is convenient to share data stored in an SQLite database — just one cross-platform file that does not need to be parsed explicitly (unlike CSV files, which must be parsed). You can find instructions to install SQLite here. In this question, you will construct a TMDb database in SQLite, partition it, and combine information within tables to answer questions. You will modify the given Q2.py file by adding SQL statements to it. We suggest testing your SQL locally on your computer using interactive tools to speed up testing and debugging, such as DB Browser for SQLite. Technology • SQLite release 3.37.2 • Python 3.10.x Allowed Libraries Do not modify import statements. Everything you need to complete this question has been imported for you. Do not use other libraries for this question. Max runtime 10 minutes. Submissions exceeding this will receive zero credit. Deliverables • Q2.py: Modified file containing all the SQL statements you have used to answer parts a – h in the proper sequence. IMPORTANT NOTES: • If the final output asks for a decimal column, format it to two places using printf(). Do NOT use the ROUND() function, as in rare cases, it works differently on different platforms. If you need to sort that column, be sure you sort it using the actual decimal value and not the string returned by printf. • A sample class has been provided to show example SQL statements; you can turn off this output by changing the global variable SHOW from True to False. • In this question, you must only use INNER JOIN when performing a join between two tables, except for part 7 and 8. Other types of joins may result in incorrect results. Tasks and point breakdown 1. [9 points] Create tables and import data. a. [2 points] Create two tables (via two separate methods, part_ai_1 and part_ai_2, in Q2.py) named movies and movie_cast with columns having the indicated data types: i. movies 1. id (integer) 2. title (text) 3. score (real) ii. movie_cast 1. movie_id (integer) 2. cast_id (integer) 3. cast_name (text) 4. birthday (text) 5. popularity (real) b. [2 points] Import the provided movies.csv file into the movies table and movie_cast.csv into the movie_cast table i. Write Python code that imports the .csv files into the individual tables. This will include looping though the file and using the ‘INSERT INTO’ SQL command. Make sure you use paths relative to the Q2 directory. c. [5 points] Vertical Database Partitioning. Database partitioning is an important technique that divides large tables into smaller tables, which may help speed up queries. Create a new table cast_bio from the movie_cast table. Be sure that the values are unique when inserting into the new cast_bio table. Read this page for an example of vertical database partitioning. 5 Version 0 i. cast_bio 1. cast_id (integer) 2. cast_name (text) 3. birthday (text) 4. popularity (real) 2. [1 point] Create indexes. Create the following indexes. Indexes increase data retrieval speed; though the speed improvement may be negligible for this small database, it is significant for larger databases. a. movie_index for the id column in movies table b. cast_index for the cast_id column in movie_cast table c. cast_bio_index for the cast_id column in cast_bio table 3. [3 points] Calculate a proportion. Find the proportion of movies with a score between 7 and 20 (both limits inclusive). The proportion should be calculated as a percentage. a. Output format and example value: 7.70 4. [4 points] Find the most prolific actors. List 5 cast members with the highest number of movie appearances that have a popularity > 10. Sort the results by the number of appearances in descending order, then by cast_name in alphabetical order. a. Output format and example row values (cast_name,appearance_count): Harrison Ford,2 5. [4 points] List the 5 highest-scoring movies. In the case of a tie, prioritize movies with fewer cast members. Sort the result by score in descending order, then by number of cast members in ascending order, then by movie name in alphabetical order. a. Output format and example values (movie_title,score,cast_count): Star Wars: Holiday Special,75.01,12 Games,58.49,33 6. [4 points] Get high scoring actors. Find the top ten cast members who have the highest average movie scores. Sort the output by average_score in descending order, then by cast_name alphabetically. a. Exclude movies with score < 25 before calculating average_score. b. Include only cast members who have appeared in three or more movies with score >= 25. i. Output format and example value (cast_id,cast_name,average_score): 8822,Julia Roberts,53.00 7. [2 points] Creating views. Create a view (virtual table) called good_collaboration that lists pairs of actors who have had a good collaboration as defined here. Each row in the view describes one pair of actors who appeared in at least 2 movies together AND the average score of these movies is >= 40. The view should have the format: good_collaboration( cast_member_id1, cast_member_id2, movie_count, average_movie_score) For symmetrical or mirror pairs, only keep the row in which cast_member_id1 has a lower numeric value. For example, for ID pairs (1, 2) and (2, 1), keep the row with IDs (1, 2). There should not be any “self-pair” where cast_member_id1 is the same as cast_member_id2. Remember that creating a view will not produce any output, so you should test your view with a few simple select statements during development. One such test has already been added to the code as part of the auto-grading. NOTE: Do not submit any code that creates a ‘TEMP’ or ‘TEMPORARY’ view that 6 Version 0 you may have used for testing. Optional Reading: Why create views? 8. [4 points] Find the best collaborators. Get the 5 cast members with the highest average scores from the good_collaboration view, and call this score the collaboration_score. This score is the average of the average_movie_score corresponding to each cast member, including actors in cast_member_id1 as well as cast_member_id2. a. Order your output by collaboration_score in descending order, then by cast_name alphabetically. b. Output format and example values(cast_id,cast_name,collaboration_score): 2,Mark Hamil,99.32 1920,Winoa Ryder,88.32 9. [4 points] SQLite supports simple but powerful Full Text Search (FTS) for fast text-based querying (FTS documentation). a. [1 point] Import movie overview data from the movie_overview.csv into a new FTS table called movie_overview with the schema: movie_overview id (integer) overview (text) NOTE: Create the table using fts3 or fts4 only. Also note that keywords like NEAR, AND, OR, and NOT are case-sensitive in FTS queries. NOTE: If you have issues that fts is not enabled, try the following steps • Go to sqlite3 downloads page: https://www.sqlite.org/download.html • Download the dll file for your system • Navigate to your Python packages folder, e.g., C:Users… …Anaconda3pkgssqlite-3.29.0- he774522_0Librarybin • Drop the downloaded .dll file in the bin. • In your IDE, import sqlite3 again, fts should be enabled. b. [1 point] Count the number of movies whose overview field contains the word ‘fight’. Matches are not case sensitive. Match full words, not word parts/sub-strings. i. Example: Allowed: ‘FIGHT’, ‘Fight’, ‘fight’, ‘fight.’ Disallowed: ‘gunfight’, ‘fighting’, etc. ii. Output format and example value: 12 c. [2 points] Count the number of movies that contain the terms ‘space’ and ‘program’ in the overview field with no more than 5 intervening terms in between. Matches are not case sensitive. As you did in h(i)(1), match full words, not word parts/sub-strings. i. Example: Allowed: ‘In Space there was a program’, ‘In this space program’ Disallowed: ‘In space you are not subjected to the laws of gravity. A program.’ ii. Output format and example value: 6 7 Version 0 Q3 [15 points] D3 Warmup – Visualizing Wildlife Trafficking by Species In this question, you will utilize a dataset provided by TRAFFIC, an NGO working to ensure the global trade of wildlife is legal and sustainable. TRAFFIC provides data through their interactive Wildlife Trade Portal, some of which we have already downloaded and pre-processed for you to utilize in Q3. Using species-related data, you will build a bar chart to visualize the most frequently illegally trafficked species between 2015 and 2023. Using D3, you will get firsthand experience with how interactive plots can make data more visually appealing, engaging, and easier to parse. Read chapters 4-8 of Scott Murray’s Interactive Data Visualization for the Web, 2nd edition (sign in using your GT account, e.g., [email protected]). This reading provides an important foundation you will need for Homework 2. The question and autograder have been developed and tested for D3 version 5 (v5), while the book covers v4. What you learn from the book is transferable to v5, as v5 introduced few breaking changes. We also suggest briefly reviewing chapters 1-3 for background information on web development. TRAFFIC International (2025) Wildlife Trade Portal. Available at www.wildlifetradeportal.org. Technology • D3 Version 5 (included in the lib folder) • Chrome 97.0 (or newer): the browser for grading your code • Python HTTP server (for local testing) Allowed Libraries D3 library is provided to you in the lib folder. You must NOT use any D3 libraries (d3*.js) other than the ones provided. Deliverables • Q3.html: Modified file containing all html, javascript, and any css code required to produce the bar plot. Do not include the D3 libraries or q3.csv dataset. IMPORTANT NOTES: • Setup an HTTP server to run your D3 visualizations as discussed in the D3 lecture (OMS students: watch lecture video. Campus students: see lecture PDF.). The easiest way is to use http.server for Python 3.x. Run your local HTTP server in the hw1-skeleton/Q3 folder. • We have provided sections of skeleton code and comments to help you complete the implementation. While you do not need to remove them, you need to write additional code to make things work. • All d3*.js files are provided in the lib folder and referenced using relative paths in your html file. For example, since the file “Q3/Q3.html” uses d3, its header contains:. It is incorrect to use an absolute path such as:. The 3 files that are referenced are: a. lib/d3/d3.min.js b. lib/d3-dsv/d3-dsv.min.js c. lib/d3-fetch/d3-fetch.min.js • In your html / js code, use a relative path to read the dataset file. For example, since Q3 requires reading data from the q3.csv file, the path must be “q3.csv” and NOT an absolute path such as “C:/Users/polo/HW1-skeleton/Q3/q3.csv”. Absolute paths are specific locations that exist only on your computer, which means your code will NOT run on our machines when we grade, and you will lose points. As file paths are case-sensitive, ensure you correctly provide the relative path. • Load the data from q3.csv using D3 fetch methods. We recommend d3.dsv(). Handle any data conversions that might be needed, e.g., strings that need to be converted to integer. See https://github.com/d3/d3-fetch#dsv. • VERY IMPORTANT: Use the Margin Convention guide to specify chart dimensions and layout. Tasks and point breakdown Q3.html: When run in a browser, should display a horizontal bar plot with the following specifications: 8 Version 0 1. [3.5 points] The bar plot must display one bar for each of the five most trafficked species by count. Each bar’s length corresponds to the number of wildlife trafficking incidents involving that species between 2015 and 2023, represented by the ‘count’ column in our dataset. 2. [1 point] The bars must have the same fixed thickness, and there must be some space between the bars, so they do not overlap. 3. [3 points] The plot must have visible X and Y axes that scale according to the generated bars. That is, the axes are driven by the data that they are representing. They must not be hard-coded. The x-axis must be a element having the id: “x_axis” and the y-axis must be a element having the id: “y_axis”. 4. [2 points] Set x-axis label to ‘Count’ and y-axis label to ‘Species’. The x-axis label must be a element having the id: “x_axis_label” and the y-axis label must be a element having the id: “y_axis_label”. 5. [2 points] Use a linear scale for the X-axis to represent the count (recommended function: d3.scaleLinear()). Only display ticks and labels at every 500 interval. The X-axis must be displayed below the plot. 6. [2 points] Use a categorical scale for the Y-axis to represent the species names (recommended function: d3.scaleBand()). Order the species names from greatest to least on ‘Count’ and limit the output to the top 5 species. The Y-axis must be displayed to the left of the plot. 7. [1 point] Set the HTML title tag and display a title for the plot. Those two titles are independent of each other and need to be set separately. Set the HTML title tag (i.e.,). Position the title “Wildlife Trafficking Incidents per Species (2015 to 2023)” above the bar plot. The title must be a element having the id: “title”. 8. [0.25 points] Add your GT username (usually includes a mix of letters and numbers) to the area beneath the bottom-right of the plot. The GT username must be a element having the id: “credit” 9. [0.25 points] Fill each bar with a unique color. We recommend using a colorblind-safe pallete. NOTE: Gradescope will render your plot using Chrome and present you with a Dropbox link to view the screenshot of your plot as the autograder sees it. This visual feedback helps you adjust and identify errors, e.g., a blank plot indicates a serious error. Your design does not need to replicate the solution plot. However, the autograder requires the following DOM structure (including using correct IDs for elements) and sizing attributes to know how your chart is built. 9 Version 0 plot | width: 900 | height: 370 | +– containing Q3.a plot elements | +– containing bars | +– x-axis | | | +– (x-axis elements) | +– x-axis label | +– y-axis | | | +– (y-axis elements) | +– y-axis label | +– GTUsername | +– chart title 10 Version 0 Q4 [5 points] OpenRefine OpenRefine is a powerful tool for working with messy data, allowing users to clean and transform data efficiently. Use OpenRefine in this question to clean data from Mercari. Construct GREL queries to filter the entries in this dataset. OpenRefine is a Java application that requires Java JRE to run. However, OpenRefine v.3.6.2 comes with a compatible Java version embedded with the installer. So, there is no need to install Java separately when working with this version. Go through the main features on OpenRefine’s homepage. Then, download and install OpenRefine 3.6.2. The link to release 3.6.2 is https://github.com/OpenRefine/OpenRefine/releases/tag/3.6.2 Technology • OpenRefine 3.6.2 Deliverables • properties_clean.csv: Export the final table as a csv file. • changes.json: Submit a list of changes made to file in json format. Go to ‘Undo/Redo’ Tab → ‘Extract’ → ‘Export’. This downloads ‘history.json’ . Rename it to ‘changes.json’. • Q4Observations.txt: A text file with answers to parts b.i, b.ii, b.iii, b.iv, b.v, b.vi. Provide each answer in a new line in the output format specified. Your file’s final formatting should result in a .txt file that has each answer on a new line followed by one blank line. Tasks and point breakdown 1. Import Dataset a. Run OpenRefine and point your browser at https://127.0.0.1:3333. b. We use a products dataset from Mercari, derived from a Kaggle competition (Mercari Price Suggestion Challenge). If you are interested in the details, visit the data description page. We have sampled a subset of the dataset provided as “properties.csv”. c. Choose “Create Project” → This Computer → properties.csv. Click “Next”. d. You will now see a preview of the data. Click “Create Project” at the upper right corner. 2. [5 points] Clean/Refine the Data a. [0.5 point] Select the category_name column and choose ‘Facet by Blank’ (Facet → Customized Facets → Facet by blank) to filter out the records that have blank values in this column. Provide the number of rows that return True in Q4Observations.txt. Exclude these rows. Output format and sample values: i.rows: 500 NOTE: OpenRefine maintains a log of all changes. You can undo changes by the “Undo/Redo” button at the upper left corner. You must follow all the steps in order and submit the final cleaned data file properties_clean.csv. The changes made by this step need to be present in the final submission. If they are not done at the beginning, the final number of rows can be incorrect and raise errors by the autograder. b. [1 point] Split the column category_name into multiple columns without removing the original column. For example, a row with “Kids/Toys/Dolls & Accessories” in the category_name column would be split across the newly created columns as “Kids”, “Toys” and “Dolls & Accessories”. Use the existing functionality in OpenRefine that creates multiple columns from an existing column based on a separator (i.e., in this case ‘/’) and does not remove the original category_name column. Provide the number of new columns that are created by this operation, excluding the original category_name column. Output format and sample values: ii.columns: 10 11 Version 0 NOTE: While multiple methods can split data, ensure new columns aren’t empty. Validate by sorting and checking for null values after using our suggested method in step b. c. [0.5 points] Select the column name and apply the Text Facet (Facet → Text Facet). Cluster by using (Edit Cells → Cluster and Edit …), and then this opens a window where you can choose different “methods” and “keying functions” to use while clustering. Choose the “keying function” that produces the smallest number of clusters under the “Key Collision” method. Click “Select All” and “Merged Selected & Close.” Provide the name of the keying function and number of clusters produced. Output format and sample values: iii.function: fingerprint, 200 NOTE: Use the default Ngram size when testing Ngram-fingerprint. d. [1 point] Replace the null values in the brand_name column with the text “Unknown” (Edit Cells → Transform). Provide the expression used. Output format and sample values: iv.GREL_categoryname: endsWith(“food”, “ood”) NOTE: “Unknown” is case and space sensitive (“Unknown” is different from “unknown” and “Unknown “.) e. [0.5 point] Create a new column high_priced with the values 0 or 1 based on the “price” column with the following conditions: if the price is greater than 90, high_priced should be set as 1, else 0. Provide the GREL expression used to perform this. Output format and sample values: v.GREL_highpriced: endsWith(“food”, “ood”) f. [1.5 points] Create a new column has_offer with the values 0 or 1 based on the item_description column with the following conditions: If it contains the text “discount” or “offer” or “sale”, then set the value in has_offer as 1, else 0. Provide the GREL expression used to perform this. Convert the text to lowercase in the GREL expression before you search for the terms. Output format and sample values: vi.GREL_hasoffer: endsWith(“food”, “ood”) 12 Version 0 Q5 [5 points] Introduction to Python Flask In this question, you will build a web application using Flask. Flask is a lightweight web application framework written in Python that provides you with tools, libraries, and technologies to build a web application quickly and scale it up as needed. The website will display wildlife trafficking data and allow users to filter and explore trafficking volume by different species classes. You will modify the given file: wrangling_scripts/Q5.py Technology Python 3.10.x Flask Allowed Libraries Python standard libraries Libraries already imported in Q5.py Deliverables Q5.py: Completed Python file with your changes Tasks and point breakdown 1. username() – Update the username() method inside Q5.py by including your GT username. 2. Install Flask on your machine by running $ pip install Flask a. You can optionally create a virtual environment by following the steps here. Creating a virtual environment is purely optional and can be skipped. 3. To run the code, navigate to the Q5 folder in your terminal/command prompt and execute the following command: python run.py. After running the command, go to http://127.0.0.1:3001/ on your browser. This will open index.html, showing a table in which the rows returned by data_wrangling() are displayed. You can then choose different species classes from the dropdown and see how the data table updates dynamically. 4. You must solve the following two sub-questions: a. [2 points] Generate a list of unique animal classes for options in the dropdown menu. Sort the list alphabetically. b. [3 points] Filter, sort, and limit the data i. First, filter the data to only the specified class of animal. If no class is specified, include all the data. ii. Next, sort the data by the count column in descending order. You do not need to worry about tiebreaks. iii. Last, limit the data to only the top 10 rows. If the number of rows is fewer than 10 then return all rows.
Spring 2025Section 1: Overview In this project, you will implement some of database operators, more specifically join and aggregation, to answer a SQL query. Dataset Format: In this project, you will work with two datasets (Dataset-A, Dataset-B) Dataset-A • The directory containing the data is named “Project3Dataset-A”. You can hardcode this name in your project. It is the name that TAs will use when doing the testing. This directory should be under the working directory. • The dataset directory contains 99 files (A1…A99), each file contains 100 records, and each record is 40 bytes. The record format for a record #j in file #i is (for i use two digits like 01, and for j use three digits like 001): //The only difference from Project 2 is the 1st letter in the record (it is “A” instead of “F”) Ai-Recj, Namej, addressj, RandomV… where RandomV is a four digit random number between 0001 and 0500. Since the number of records in the entire dataset is around 10,000, then each value within the range of 0001 and 0500 is expected to appear in the dataset (on average) 20 times, but it can be more or less. Also, each record ends with three dots “…” to complete to 40 bytes. //Compared to Project 2, the range for RandomV is much smaller (only from 1 to 500). • It is important to highlight that index “j” resets and starts from “001” in each file. This is important especially in Section 4 (the aggregation section). Dataset-B • The directory containing the data is named “Project3Dataset-B”. You can hardcode this name in your project. It is the name that TAs will use when doing the testing. This directory should be under the working directory. • The dataset will contain the same exact content as Dataset-A. The only difference is that the 1st letter of each record is “B” instead of “A”. Also the file names will be B1…B99 ****SELECT ONLY TWO of the THREE PROBLEMS BELOW*** Section 2 (Building Hash-Based Join) [30 Points] Command: The following command (SQL statement) is what your program will receive to trigger the execution of the hash-based join. “SELECT A.Col1, A.Col2, B.Col1, B.Col2 FROM A, B WHERE A.RandomV = B.RandomV” where “A” refers to Dataset-A, “A.Col1” and “A.Col2” refer to columns 1 and 2 in the dataset. The same applies to “B” and its dataset. The syntax of the command is fixed, i.e., nothing will change when testing. In this part, you need to write code that: • Builds a hash table on Dataset-A. The hash table should have 50 buckets. The buckets will store the entire record content. 3 • The hashing of each record to determine the corresponding bucket should be based on the join column. Refer to the SQL commands to know which column is the join column. • Then, make a loop to read Dataset-B file-by-file and record-by-record. For each record (say r) apply the same hash function on the join column to know which bucket you should check from Dataset-A (say bucket# K) • Now you need to apply the join condition between r and each records in bucket K. If the join condition is met, i.e., A.RandomV = B.RandomV, then you need to produce an output record with the needed columns (see the SQL command for the columns). • It is up to you to either maintain the records in each bucket sorted (based on the join column) or you leave them unsorted. If you keep them sorted, then the search in the previous step should be more efficient (use binary search). //In your report indicate which design choice you made What to produce as output: 1) Print out the execution time taken to execute the command (in millisecond) 2) Print out the qualifying records (only the columns specified in the query) Section 3 (Building Block-Level Nested-Loop Join) [30 Points] Command: The following command (SQL statement) is what your program will receive to trigger the execution of the Nested-Loop join. “SELECT count(*) FROM A, B WHERE A.RandomV > B.RandomV” This command will report the count of records satisfying the join condition. Since the join condition is not equality, then hash-based join cannot be used. But nested loop join can be used. The syntax of the command is fixed, i.e., nothing will change when testing. In this part, we assume the available memory is limited and can only hold at most the content of one file. Therefore, you need to write code that: • Loops over each file in Dataset-A, and for each file do: o Store the records of that file in memory (put them in some data structure such as an array). o Read the entire Dataset-B, file-by-file and record-by-record, and compare each record with the records you have in memory from Dataset-A o Maintain the count of the records matching the join condition • Then, retrieve the next file from Dataset-A and repeat the process. What to produce as output: 1) Print out the execution time taken to execute the command (in millisecond) 2) Print the count of the qualifying records 4 Section 4 (Hash-Based Aggregation) [30 Points] Command: The following command (SQL statement) is what your program will receive to trigger the execution of the aggregation operator. “SELECT Col2, FROM GROUP BY Col2” where: • : Is a placeholder. The actual value put in the command can be either A or B, referring to DatasetA or Dataset-B, respectively. • : Is a placeholder. The actual value can be any of the aggregation functions SUM(RandomV) or AVG(RandomV) • Col2: Is part of the syntax, i.e., it will not change. It is referring to the 2nd column in the dataset (the “name” column). • The meaning of the query is that all records in the dataset having the same value in the 2nd column should form one group on top of which the aggregation function is applied. Each group should produce one output record consisting of Col2 value along with the output from the aggregation function. (Similar to the standard SQL). • To implement the aggregation query, you should maintain a hash table, where each distinct group should have an entry in this table. Then, as you scan the dataset, you need to keep updating the aggregation value maintained in the hash table. After scanning the entire dataset, you start printing the content of the hash table. What to produce as output: 1) Print out the execution time taken to execute the command (in millisecond) 2) Print out the output records from the SQL statement What to Deliver • The entire source code package • Readme.txt, in which you must include: o Your name and student ID o Section I: section on how to compile and execute your code. Include clear easy-to-follow step by step that TAs can follow o Section II: State clearly which parts are working, and which parts are not working in your code. This will help the TAs give you fair points. o Section III: section describes any design decisions that you do beyond the design guidelines given in this document. What and Where to Submit A single file .zip to be submitted in the Canvas system
Spring 2025 Section 1: Overview In this project, you will create a simple index structure to speed up the performance of the lookup queries. A lookup query is a query that involves a predicate (condition) on a certain column. Ways to answer a lookup query: As covered in class, there are two main approaches as summarized below. • Full Table Scan: In this approach, the database system needs to read the entire table (all its blocks) one by one. Then, for each block, the records are scanned one by one and checked against the query predicate. o This approach is used ONLY if there is no index available. • Index Lookup: In this approach, the database system will first check the index, and figure out whether the value exists or not. If the value exists (or potentially exits), then the index will specify which data block to read. The DBMS will then read this data block only without scanning all other blocks. o This approach is used if there is an index available on the column involved in the condition Dataset Format: • You are given a dataset that you can directly use. • The directory containing the data is named “Project2Dataset”. You can hardcode this name in your project. It is the name that TAs will use when doing the testing. • Put this directory “Project2Dataset” under the working directory of the Java program (the directory from which the program will run). This is the location from which the data is read. • The dataset directory contains 99 files, each file contains 100 records, and each record is 40 bytes. This very similar to Project 1 dataset with slight differences as follows. For this project the record format for a record #j in file #i is (for i use two digits like 01, and for j use three digits like 001): Fi-Recj, Namej, addressj, RandomV… where RandomV is a four digit random number between 0001 and 5000. Since the number of records in the entire dataset is around 10,000, then each value within the range of 0001 and 5000 is expected to appear in the dataset (on average) twice, but it can be more or less. Also, each record ends with three dots “…” to complete to 40 bytes. • The RandomV column is the column on which you will do the search and build the index. • As in Project 1, all records are of the same length (40 bytes), they are concatenated after each other, and there are no “new line” characters. The record boundaries are computed based on the 40-byte length. Section 2: Building Hash-Based Index Structure [20 Points] (See the “CREATE INDEX …” command below) In this part, you need to write code that builds a hash-based index on the RandomV column. The code will do the following functionalities: • Read all files in the dataset directory, one by one, and for each file, read record by record. • For each record, extract the RandomV value, and put it in a hash table. A hash table entry should have two components (key k, value v), where k = RandomV value, and v = the record locations (file number and the offset at which the record begins within this file). o It is up to you to design the appropriate data type (or structure for v) to keep multiple locations associated with a certain key. o Hint: You may concatenate multiple locations in a single string. • The hash table should be kept in memory, and this is your hash-based index. 3 Section 3: Building Array-based Index Structure [20 Points] (See the “CREATE INDEX …” command below) In this part, you need to write code that builds an array-based index on the RandomV column. The code will do the following functionalities: • Allocate an array of size 5,000, each entry should store record locations (file number and the offset at which the record begins within this file). o Keep in mind that for a single value, there can be multiple records with that value o It is up to you to design the appropriate structure • Read all files in the dataset directory, one by one, and for each file, read record by record. • For each record, extract the RandomV value, say the value = i . go to the ith slot in the array and add the record location information. • The array should be kept in memory, and this is your array-based index. Command to build the hash and array-based indexes • When your program executes, it should do nothing except printing the following sentence: Program is ready and waiting for user command. You program should be designed to loop and whenever it completes a command, it should print the same sentence indicated above and wait for the next command: • If the user wants to create the hash-based and array-based indexes highlighted above, the user will enter the following command: CREATE INDEX ON Project2Dataset (RandomV) o The text of the command is fixed including the dataset name (Project2Dataset) and column name (RandomV) o This command should build both indexes and keep them in memory o The data files should be read ONCE to build both indexes concurrently o Once the two indexes are built, print out message “The hash-based and array-based indexes are built successfully. Program is ready and waiting for user command.” Note: You may receive SELECT commands (see below) without indexes being created. Section 4: Equality-Based Query Lookup [20 Points] To receive the query, your program should support this command SELECT * FROM Project2Dataset WHERE RandomV = v • If there are no indexes built, then you should perform a full table scan. • If there are indexes, then for equality search use the hash-based index. • If you will use an index, make sure to leverage the record location information (both the fileId and byte offset) to minimize the I/O and CPU. • “v” is any constant number. • The syntax for the SELECT command is fixed (nothing will change) except value “v” • The output that you should generate is: o Print out the record(s) matching the query o Indicate the index type you used (if any) or Table Scan. o Report the time taken to answer the query (in milli sec) o Indicate how many data file(s) (which are equivalent to disk blocks) did you need to read 4 Section 5: Range-Based Query Lookup [20 Points] To receive the query, your program should support this command SELECT * FROM Project2Dataset WHERE RandomV > v1 AND RandomV < v2 • If there are no indexes built, then you should perform a full table scan. • If there are indexes, then for range search use the array-based index. • If you will use an index, make sure to leverage the record location information (both the fileId and byte offset) to minimize the I/O and CPU. • The syntax for the SELECT command is fixed (nothing will change) except values “v1” and “v2” which are random constants like “1” and “15” • The output that you should generate is: o Print out the record(s) matching the query o Indicate the index type you used (if any) or Table Scan. o Report the time taken to answer the query (in milli sec) o Indicate how many data file(s) (which are equivalent to disk blocks) did you need to read Section 6: Inequality-Based Query Lookup [20 Points] To receive the query, your program should support this command SELECT * FROM Project2Dataset WHERE RandomV != v • With inequality operator, indexes should NOT be used (even if they exist) • The syntax for the SELECT command is fixed (nothing will change) except value “v” • The output that you should generate is: o Print out the record(s) matching the query o Report the time taken to answer the query (in milli sec) o Indicate how many data file(s) (which are equivalent to disk blocks) did you need to read What to Deliver • The entire source code package • Readme.txt, in which you must include: o Your name and student ID o Section I: section on how to compile and execute your code. Include clear easy-to-follow step by step that TAs can follow o Section II: State clearly which parts are working and which parts are not working in your code. This will help the TAs give you fair points. o Section III: section describes any design decisions that you do beyond the design guidelines given in this document. What and Where to Submit A single file .zip to be submitted in the Canvas system
Spring 2025Section 1: Overview In this project, you will learn and implement the key concepts of buffer management as done in database management systems. Although the project will not be done within a real DBMS, you will simulate the actions taken by the buffer manager as we learnt in class. In traditional DBMS a table consists of multiple data blocks. In our simplified project, a “table” will correspond to a “directory” in the file system, and a “data block” will correspond to a small “file” under this directory. Then, each DB record will correspond to a line in one of these files. In our context, all files have the exact same size. Each file stores 100 records (i.e., 100 lines), and each record (line) is 40 bytes. You can observe that each file will be of size roughly “4KB”, which mimics a disk block. Note: each file is now the unit of processing. That is, an entire file is read from disk to the buffer or taken out from the buffer and written back to disk. • Every record is exactly 40 bytes (the format is shown below). There is NO end-of-line character at the end of each record. This means all records are just concatenated after each other. • Conceptually, the format and content of a record will not make a difference. However, for ease of testing and readability, let every record follow this format. A record #j in file #i will have (for i use two digits like 01, and for j use three digits like 001): Fi-Recj, Namej, addressj, agej. • Our numbering system will start from 1 for both files (index i) and records (index j). Index j resets and start with 001 in each file. • Example: the 23rd record in the 3rd file will have the format of (exactly 40 bytes) F03-Rec023, Name023, address023, age023. Example 1: Assume a “Student” table that consists of 99 disk blocks and each block contains 100 records. Each record has 40 bytes containing the student ID, Name, address, phone number, etc. Now, this design in our project will correspond to a file system directory with name “Student”, under which there are 99 small files (names F1, F2, F3, …F99), each file holds the content of one disk block. Example 2: Let’s say we need to read record number k (say k =250), we will need to do steps like the following: • You need to do the calculations to figure out this record exists in which block (in our terminology, which file). For example, if k =250 and since each file holds 100 records, then the record we need is in file #3. Then, calculate the record number with this file. In this example (k = 250), it will be the 50th record in F3. • You need to check whether or not file #3 is in memory (in the buffer). Let’s say, it is in memory, then no I/O is needed. You need to find where in the buffer file #3 exists and read from this buffer record #50 (which corresponds to record #250 in the entire directory (table)). • If file #3 does not exist in the buffer, then let’s assume the easy case where there is an empty space in the buffer. In this case, you need to read the content of file #3 (all the content of this file), put it in a certain frame in the buffer, and then read the record we need from that buffer. Section 2: Design Guidelines In this section, you are given the high-level design ideas and guidelines to build the operations of a buffer manager. These are just guidelines. It is your responsibility to come up with the complete end-to-end design to have a working system. 3 Single Buffer (Memory Frame): “Frame” class • This is an object in your Java program to hold one file (one block). • You should design a “Frame” class, which contains some key (private) elements such as: o “content”: array of bytes of size 4K //to hold the file content o “dirty”: Boolean flag //set to True if the content of this block has changed and //need to be written to disk when this frame is taken out o “pinned”: Boolean flag // True if there is a request to keep this block in memory and not // take it out. False, means it can be taken out. o “blockId”: integer // It should be the Id of the block stored in this frame. E.g., if we //need to read file #3 as in Example 2, then “blockId = 3”. // You can use “-1” to indicate that the fame is empty and there is // no block in this frame o … any other variables you think useful and you need it in your design o You should also have a set of “public” methods to set and get the values of the variables in this class. o You should also have methods to for example return a specific record in this block, e.g., record number i (remember that all records have the same size of 40 bytes). This method should take as input the record number (i) and return the content of this record (string of 40 bytes). o Another method you can think of is to update a specific record. This method should take the record number and the new content (40 bytes). Its job is to set that record to the new content. ü Remember, if the content changes, the “dirty” flag should be set. o …You may think of other methods, e.g., initialize() method to initialize all of the variables above. Buffer Pool: “BufferPool” class • This is the object in your Java program which represents the entire available buffer • You should design a “BufferPool” class, which contains some key (private) elements including: o “buffers”: array of “Frame” objects. // The size of this array is decided at run time. // the program should take an input argument that decides // the size of this array o Any other elements and variable you think are useful o One public method is “initialize” that should (1) build the array given the input argument, (2) go over each frame and initialize this frame, e.g., by calling the initialize() method of each frame. o One method to search if a certain block (file) is available in the buffer pool. The method takes as input the block Id (file Id), and should return the buffer number (slot number in the array) holding this block (or -1 if not available). o Another method to possibly return the content of a given block Id. This method should also take as input the block Id (file Id). Call the method in the previous bullet to know the buffer number (if the block is present), and then it can read the content. o Another method to be used if the needed blockId is not in the buffer pool. In this case, this method should read the block (file) from disk, and bring it to the buffer pool (in an empty frame)!!! o You need a method to search and give you back a number in the array for an empty frame (if any) o If there are no empty frames in the buffer pool, then you may need to take one out and return it back to disk (if possible). This method will differ in how to select this to-be-evicted frame depending on the placement policy. o Any other methods that you need for your design Section 3: Project Requirements 4 You should design a Java program following the guidelines highlighted above. The program should support the following functionalities (Make sure to read Section 4 before jumping to implementation): 1) [10 Points] Calling the program and passing one input parameter (n) representing the buffer pool size a. The program should create the bufferPool class with “n” slots in the “buffers” array b. Do all proper initializations. Initially, all frames are empty. Also initialize any metadata you have. c. After that, write on the screen message “The program is ready for the next command”, and wait for the user to enter the next command, which will be any of the following commands. //For consistency and ease of grading, all commands should be UPPER CASE (GET, SET, PIN, UNPIN) For the input data, assume the directory is named “Project1” in the same place from which the java program will run. Under this directory, there should be files F1, F2, … Note: TAs will not test for out-of-range record number or file number. Whatever is requested in the following commands is expected to be present under directory “Project1”. 2) [40 Points] Command #1 (GET command) : “GET k”. In this command, the user needs to print the content of record #k from the file. a. The program should call a GET() function in the buffer pool class. The function should calculate which block (file) contains this record (refer to Example 2). b. The function should scan the “buffers” array to figure out whether or not the desired file is in memory. You should call the corresponding method in the “BufferPool” class to do so. There are four possible cases: Table 1 CASE What Needs to be done CASE 1 [10 Points]: The block (file) is in memory This is an easy case, find the desired record (40 bytes only) and return it to be printed on the screen. CASE 2 [10 Points]: The block is not in memory, but there are empty buffers in the buffer pool array You need to read the right file from disk. Copy its content to an empty buffer frame. Update the metadata properly. Once the file is in memory, you should call the functions used in CASE #1 to do the rest. //For this part, choose the 1st empty frame from the beginning of the array (do not choose randomly). That is, if the array has the 1st 5 frames full and the rest are empty, then the file should go to frame #6. 5 CASE #3 [10 Points]: The block is not in memory, the buffer pool array is full (no empty frames), but some frames can be taken out. The content of a non-empty frame can be taken out ONLY IF its “pinned” flag is “False”. Otherwise, this frame is not a candidate to be taken out. You need to search for a frame that can be taken out. If you find one, then, there are two cases: Case 1: The “dirty” flag is False. This means it can be taken without the need to write back to disk. Just overwrite the content. Case 2: The “dirty” flag is True. This means that the content must be written back to disk, otherwise the changes will be lost. Remember that the entire file content (4KB) is one unit, even a single byte changes the entire file needs to be written back to disk and overwrite the previous content. CASE #4 [10 Points]: The block is not in memory, the buffer pool array is full (no empty frames), no frames can be taken out This may happen if all frames have content and all of them are “pinned”. In this case, print out a message “The corresponding block # cannot be accessed from disk because the memory buffers are full”. Output From “GET” Command: The required output is: (1) Print the record content (the 40 bytes) for CASEs #1, 2, 3 above, or the message indicated in CASE #4 (2) Print whether or not an I/O is done (i.e., whether the block was already in memory or brought from disk) (3) Print the frame# (the entry number in the buffers array) that contains the block (for CASES #1,2,3) 3) [10 Points] Command #2 (SET command) : “SET k ”. In this command, the user needs set the content of record # k to the given string. a. All of the work in Command #1 applies to have the desired block (file) in memory. b. In addition, you need to change the content of the record to the new string. You will be given exactly 40 characters, so you do not need to do any error checking for that. c. Make sure to set the “dirty” flag to “true” Output From “SET” Command: The required output is: (1) Print whether or not the write is successful (2) Print whether or not an I/O is done (i.e., whether the block was already in memory or brought from disk) (3) Print the frame# (the slot number in the buffers array) that contains the block (for CASES #1,2,3) NOTE: In the SET command, DO NOT write the file back to disk. It will remain “dirty” in memory until the buffer manager decides to take out, at that time it should be written to disk. 4) [10 Points] Command #3 (PIN command) : “PIN BID”. In this command, the user wants to pin specific block (BID) in memory. a. Notice that BID is a block number (file number) not record number. Example “PIN 3” means pin block #3 (F3) in memory. b. There are few cases 6 Table 2 CASE What Needs to be done CASE 1: The block is already in the buffer pool Set the “pinned” flag to True. If it is already set, then do nothing. CASE 2: The block is not in memory, and the buffer pool is either have empty slots or you can take out a block (CASEs 2 &3 in Table 1) You should bring the BID to memory, and set the “pinned” flag to True CASE 3: The block is not in memory, the buffer pool is full, and no blocks can be taken out (they are all pinned) (CASE #4 in Table 1) In this case, print out a message “The corresponding block BID cannot be pinned because the memory buffers are full”. Output From “Pin” Command: The required output is: (1) Print the frame # in the buffer pool array that is pinned. (2) Print whether or not the “pinned” flag was already true. 5) [10 Points] Command #4 (UNPIN command) : “UNPIN BID”. In this command, the user wants to unpin specific block (BID) in memory. a. Notice that BID is a block number (file number) not record number. Example “UNPIN 3” means unpin block #3 (F3) in memory. b. There are few cases Table 3 CASE What Needs to be done CASE 1: The block is in the buffer pool Set the “pinned” flag to False. If it is already False, then do nothing. CASE 2: The block is not in memory In this case, print out a message “The corresponding block cannot be unpinned because it is not in memory”. Output From “Unpin” Command: The required output is: (1) Print the frame # in the buffer pool array that is unpinned. (2) Print whether of not the “pinned” flag was already false. 6) [10 Points] Code Commenting You are required to provide good comments on your code. Every function needs to be properly commented such that others can understand your code. Lack of comments will cause you to lose to points. Section 4: Some strategy to follow • How to locate an empty frame in the buffer pool? o The easy and straightforward way is to search all entries in the array and check “blockId” value. If the value is -1, then this frame is empty and can be used. This is an acceptable strategy to use in this project. o As covered in class, this scan strategy is slow if the number of the frames (the buffer pool size) is large, and thus there are other more efficient ways that DBMSs actually use, e.g., the use of bitmaps. It is up to you if you would like to implement and use the bitmap strategy. Think of where this bitmap should be stored and design the manipulation methods that manage the bitmap. 7 o Either of the scan approach or the bitmap approach should lead to the exact same selection, which is covered in next bullet. • In what order an empty frame is selected or chosen? o You must select the 1st empty frame that you find starting from the top of the buffer pool. For example, if the pool has 10 frames, and the first 3 are full, then the next one to choose MUST be frame #4, and then #5, and so on. o When grading, the TAs will expect that order, otherwise your number will not match, and you will lose points. • In what order to evict a full frame to make space for a new block to bring to memory? o Remember that any “pinned” block cannot be taken out (it is not candidate for eviction) until it is “unpinned” o The order of selection must follow the following strategy. Select the 1st candidate frame following the last frame being evicted. For example, if the last evicted frame is #1, and frames #2 & #3 are pinned, then the next frame to be evicted should be frame #4 o Think of this strategy in a circular fashion. That is if you reach the end of the buffer pool, then you start from the beginning again. o When grading, the TAs will expect that order, otherwise your number will not match, and you will lose points. • How to test your code? o A test case is provided to you (dataset, set of commands, the expected output). Use it to test your code. o Also, create your own test cases to further validate your code. What to Deliver • The entire source code package • Readme.txt, in which you must include: o Your name and student ID o Section I: section on how to compile and execute your code. Include clear easy-to-follow step by step that TAs can follow o Section II: section on test results. A test case will be provided along with the expected output. This test case will help you testing your code. In the Readme.txt file state which of the test case commands is successfully working and which ones are failing. This will help TAs to better test your code and give you fair grades o Section III: section describes any design decisions that you do beyond the design guidelines given in this document. For example, any additional variables or methods or classes that you add. What and Where to Submit A single file .zip to be submitted in the Canvas system
In this assignment, you will simulate the investigation of a text classification problem, asking the questions: Is it possible to distinguish real from fake facts about cities using linear classifiers? Does the choice of linear classifier matter? You will collaboratively generate a dataset of real vs. fake city facts using online generative AI tools, then set up an experimental pipeline to train models that classify them into “fact” or “fake”. Important note: This assignment presents an oversimplified picture of distinguishing real from fake information for pedagogical purposes. Fake news/misinformation detection is a rich and complex research area in NLP and in the social sciences. Please do not take the purported results from this assignment too seriously! The goal of this assignment is to give you experience in using existing tools for machine learning and natural language processing to solve a classification task. Before you attempt this assignment, you will need to install Python 3 on the machine you plan to work on, as well as the following Python packages and their dependencies: NLTK: http://www.nltk.org/ NumPy: http://www.numpy.org/ scikit-learn: http://scikit-learn.org/stable/ Dataset generation Use an online generative AI tool to generate facts about cities. For example, ChatGPT 3.5 is freely available online. There is no need to pay for this purpose. Use a prompt that will let you generate facts, real or made up. Be sure to document the prompts and the tool that you used. For example, here are two prompts that I used: “Give me facts about Montreal.” “Give me fake facts about Montreal.” And some of the “facts” that were generated included: Bilingual City: Montreal is one of the largest French-speaking cities in the world outside of Paris, but it is also a major English-speaking city. This bilingual nature contributes to its rich cultural diversity. [FACT] 1 Island City: Montreal is located on the Island of Montreal, which is situated at the confluence of the Saint Lawrence and Ottawa Rivers. The island is named after Mount Royal, the triple-peaked hill in the heart of the city. [FACT] Montreal’s Underground City Is Actually a Secret Lab: Beneath the R´eso lies a hidden research facility where scientists are working on a top-secret project to create a new flavor of poutine that can only be described as “cosmic.” [FAKE] Mount Royal Is an Ancient Volcano: Mount Royal isn’t just a hill—it’s actually a dormant volcano that was last active during the Ice Age, and locals believe it has mystical powers that can control the weather. [FAKE] Pick your own favourite city and generate facts about it! Each fact should be roughly one to two sentences in length. For the sake of sharing in class, please generate these facts in English. You are free to do supplemental experiments in other languages of your choice and to remark on this in your report, but please experiment with one language at a time. Store your facts, one fact per line, in two text files in UTF-8 encoding with the following names: facts.txt for the generated facts, and fakes.txt for the generated fakes. Make sure there are at least 100 samples in each class, and ideally more if you have the time to gather more data. Sharing your dataset (optional) To expand your dataset, feel free to share your dataset with others in the class. However, you must still generate at least 100 samples of each class on your own. I encourage you to post your datasets on the course Ed discussion forum, and/or download cases from there, or to share with your classmates. Please include the names of all students with whom you have shared data. Note that other students may not have generated samples about the same city as you. How will you account for this in your experimentation? Please note that this sharing policy only applies to the data, not to the code or to writing the report! Preprocessing, feature extraction, and model implementation Your responsibility is to design and run the correct experiments in order to answer the research question stated above. How will you do so in a way that is scientifically sound? You will likely need to subdivide your dataset in some way. You must explore at least 3 preprocessing decisions and 3 linear classifiers that we have discussed in class. You may use scikit-learn’s feature extraction and classification modules to help you, as well as any other tool from NLTK or NumPy. Reading scikit-learn’s documentation will be of great help in your experimentation. Setting up the experiments Design and implement experiments to draw reasonable conclusions about the research question above. This will require creating subsets of the dataset as we discussed in class. There are multiple correct ways to set up your experiments (as well as many incorrect ways). Report Write a short report on your method and results, carefully document i) the problem setup, ii) your dataset generation and experimental procedure, iii) the range of parameter settings that you tried, iv) the results and conclusions, and v) the limitations of your study. It should be no more than 1.5 pages long. Report on the performance in terms of accuracy, and speculate on the successes and failures of the models. Regarding point v), think about how much you can generalize the conclusions of your experiments to the overall problem of separating real vs. fake facts for cities, geographical locations, and in general. What assumptions have we made that are reasonable or not reasonable? Page 2 Your assignment will be marked on i) how well it satisfies the requirements stated in this handout, ii) whether your experiments adequately and correctly address the research questions, iii) how well written your report is. It will NOT be marked based on the performance that you achieve with your models on this dataset. Submitting code Submit your code in a file named “pa1.py”. What To Submit Submit your report as a single pdf on myCourses called “pa1-answers.pdf”. In addition, you should submit i) your source code in a file called “pa1.py”, ii) your dataset in two files “facts.txt” and “fakes.txt”. All work should be submitted to myCourses under the Programming Assignment 1 folder. Jupyter notebooks are not acceptable as a submission format.
Our last assignment had us produce wireframe images of a 3-D scene using a form of ray casting. In this assignment we write code that renders using ray tracing. This is a more straightforward method for rendering of a scene by casting rays. Taken to its fullest level of detailing, the technique can be used to create extremely realistic images of a virtual scene. We model light transport, accounting for how light energy interacts with material, and we (in essence) follow light from their sources into the scene, see how they reflect off objects, and (perhaps) eventually hit the eye of the viewer so that they see those objects.To make this feasible, ray tracing performs this “light following” in reverse: we trace rays from the eye out to the scene, see what objects are hit by that ray, and then see how light illuminates them. We do the latter— seeing whether and how the light illuminates objects— by tracing rays from the objects we hit. If an object is a mirror, we trace a ray in the direction of reflection to find out what object can be viewed in that mirror’s image. With enough of this kind of tracing, enough realistic modeling of surfaces and how materials behave, and enough computational resources, ray tracing and its variants are the basis for most photorealistic CG-rendered animated movies today.The code for this project performs ray tracing using the graphics hardware. We will write the ray tracer within the code of a fragment shader in the C-like language GLSL. This is somewhat unusual, as ray tracing is typically performed off-line. It is the same kind of code we’ve used to support our rendering in past projects, but those shaders implemented the standard z-buffer based graphics pipeline, using an approach that is much different than ray tracing.For the assignment, you will extend some existing ray tracing GLSL code so that it handles a mirrored surface described by a quadratic Bezier curve. The code you are given ray traces spherical and planar objects, using Phong shading, and also for spherical mirrors. Your job is to extend the scene editor and the ray tracer to handle scenes with a curved mirror.As a consequence of ray tracing in the hardware, the scene can be edited in a web application written in Javascript, and the updated rendering can be viewed in real time. The calculations would normally be too expensive to perform in Javascript, but a GPU can instead quickly trace the 512 × 512 × 4 rays used to depict the scene. Even so, we keep the scene and simple to make this feasible. More general ray tracing would bog down the GPU too much,Here is the starter code for this assignment. If you download this code, you’ll will find three important code source files: For the asignment you will modify funhouse.js and trace-fs.c so that they properly handle the funhouse mirror whose footprint is specified as a quadratic Bézier curve.To run the code, load bezier-funhouse.html into your browser. With its initial settings it allows the user to create a scene full of spheres. It also displays a spherical mirror, one that can also be resized and moved. Figure 1 shows an example of a scene where five colored spheres have been placed around that mirror ball. The left view shows the scene from above, and uses the algorithms described last week to display the layout on the floor of the room. The right view shows the ray-traced scene produced by the shader code.Figure 2 shows the results you can obtain once you’ve completed the assignment. When the application is switched to display the Bezier funhouse mirror, the scene editor shows the control points for the footprint (the top view) of a curve. And the ray tracer shows a picture with some scene objects warped and reflected in that funhouse mirror. It also depicts the shadow of that mirror falling on the back part of the room. In its current state, if you click on mirror:sphere -> bezier and switch the application from displaying the spherical mirror, to allow edit and display of the funhouse mirror, you’ll see very little of that functionality. This state is shown in Figure 3. On the left, you’ll see that the editor uses the control polygon as the points of the quadratic Bezier curve. This is obviously a problem. The editor should instead show a smooth curve running from the first control point to the third, and using the second control point to suggest the curvature of the mirror. To complete this step, you’ll need to modify the compile method of class Curve so that it computes an array of curve points. It should sample enough points on the curve to produce that array, and its method for doing that should rely on SMOOTHNESS. Larger values lead to a smoother approximation. It should lead compile to use more points. The sampling ideally should depend on the curvature of the Bezier. A flatter curve should use fewer points. A more pronounced curve should require more.You’ll notice also that ray-traced view shows nothing initially when placed in Bezier mode. This coding involves writing two key functions in trace-fs.c within bezier-funhouse.html. The first, rayIntersectsBezier is used by rayIntersectsMirror to handle computing a ray of reflection when a traced ray hits the funhouse mirror. You can use a similar subdvision technique to compute the mirror’s geometry that you did for the editor, however the normal that you compute that aids in determining the reflection ray should be smoothly interpolated so that there are no obvious discontinuities of the scene in the mirror. The second key GLSL function you need to write is rayHitsBezierBefore. This is used to figure out whether an object in the scene is in the shadow of the funhouse mirror. When the ray tracer hits a wall or a non-reflective sphere, it computes Phong shading of that surface by tracing a ray from the object to the light source. If the mirror sits between that surface point and the point light source, then it casts a dark shadow on that object and it won’t reflect anything but the ambient light of the room. The Phong shader uses rayHitsBezierBefore to check for that shadow. To guide you through completing the assignment, we first walk you through the current ray tracing code just below. And then we briefly describe our methods from class for sampling points on a Bezier curve. Finally, we walk you through a plan for completing the assignment.Let’s talk a bit about the state of the inital code and how the edited scene becomes the ray-traced funhouse image.The initial code has a simple ray tracer that handles a 4-walled room. There is one point light source initially sitting above the entrance to the room. The room has four walls and a checkered floor. A user can place spheres of different sizes onto the floor of the room. And then also the code handles computing the reflections within a single mirrored object.Non-mirror surfaces are rendered using Phong shading. The walls, ceiling, and floor are treated as matte surfaces of different-colored materials. This means that they have an ambient and diffuse (Lambertian) component that they reflect when illuminated by a point light source. If a wall is in shadow from the light, then only the ambient light of the room is reflected.The non-mirror spheres sitting in the room are treated as glossy surfaces, again using the Phong model. If they are in shadow, they reflect only ambient light. If they are directly illuminated by the light, then there are also difuse and specular components to their reflected light. This means that they have a small specular highlight placed with the peak of the highlight at the perfect mirroring direction for the light source.This calculation for spheres and walls is just the local illumination model we covered several weeks ago in class when describing classical hardware rendering. It is summarized by the pseudo-code below.In the code P is the point on the surface being illuminated, n is the surface normal, V is the point from which we are viewing the object, L is the location of the (single) light source, and m stands for the material’s properties.We have one enhancement in the Phong shading code: We don’t assume that P is directly illuminated by the light. Instead we shoot a ray out from P towards L to see if any scene objects get in the way of the light. If they do, hits-light returns false and we only give back the ambient component. If instead the light is hit by the “shadow ray”, then we also include the diffusely reflected light, possibly along with some glossy highlight.So, already, we’ve tweaked direct illumination so that it traces a shadow ray to create a more realistic rendering. Objects cast shadows on other objects in the scene.For mirrored surfaces, when a point on the mirror is visible, then we figure out the color of light reflected off the mirror from some other (non-mirror) object, or else the walls, ceilings, and floors in the room.This is described in more detail below.For our ray casting algorithm for Program 3, we shot rays through a virtual screen representing a piece of paper to see where each corner of an object’s mesh would fall on the page. We then judiciously connected those dots to build a wireframe rendering of the objects in the scene.Ray tracing has a similar set up. We choose a point outside the scene as a center of projection. And then we place the pixels of a virtual screen in front of that point, with the pixels forming a regular grid. That grid of pixels acts as a window into our virtual 3-D scene of objects. And then, with ray tracing, we shoot rays for each one of those pixels. The goal of these rays is to trace backwards from the eye (from the center of projection) through each pixel into the 3-D scene to determine what object is reflecting light, and at what color, towards our eye’s view. This top-level ray tracing algorithm is summarized below:In the above, we use R as the source of the ray, and d as the direction it is shot through some pixel location x. We obtain information about the scene by a call to TRACE-RAY. This checks the world to see what object is providing a color c of light hitting us from the scene.By tracing backward this “primary ray”, we can make standard geometric calculations— “Is this sphere visible first along this ray?”— to ask questions about the scene, figuring out how light is illuminating our scene. Each ray serves to query the geometry of the scene. If we hit an object with our query ray, we’ll then typically shoot “secondary rays”. With those secondary rays we ask questions like “Does this object have a direct line of sight to a light source?” and “Is this object a mirror? What light then does it reflect?”This is summarized by the pseudocode below. The ray we trace might hit a wall, or a sphere, or the mirror. For non-mirror objects we use our modified PHONG-COMPUTE-COLOR that includes shadow ray tests to determine the color of that object.When instead a mirrored object is hit, that leads us to bounce a secondary, reflected ray to find out the color of light that’s hitting the mirror and bouncing towards our eye. In general ray tracing, this normally leads to a recursive call to TRACE-RAY. For this assignment, in order to make hardware ray tracing feasible, our secondary ray ignores mirrored objects. Thus we call a TRACE-RAY' that behaves almost the same, but skips the mirror check.That’s pretty much it. But this pseudocode doesn’t quite give a complete enough picture of the assignment. Let’s now discuss the vertex and fragment shaders we’ve written in GLSL. These essentially mimic the pseudocode above, but their details will be useful to know.The key file that implements the ray tracing is the code for trace-fs.c. That code is invoked once for each pixel on the WebGL canvas, running something like the following code:The code is sent eyePosition, into, right, and up. These are uniform, meaning every pixel gets sent the same info for these. In addition, every pixel is sent an xy pair ray_offset with coordinates taken from [−1,1]×[−1,1][−1,1]×[−1,1]. These have us shoot each primary ray through a particular location in the screen.Most of the rendering work is done within trace. This code for this function starts roughly as follows:We shoot the primary ray defined by R and d to find whether it intersects with the mirror, using rayIntersectMirror. We then see if there is any intervening sphere by calling rayHitsSomeSphereBefore. The goal of this is to determine a source and direction to compute the Phong illumination of some object or wall in the scene. If the mirror isn’t hit by the primary ray, we just directly look for an object or a wall by setting source to R and direction to d. If instead there is a mirror bounce, then we set source and direction to the mirror intersection and the bounced reflection ray instead.Here is the remaining code for trace:Again: that’s all there is to it. This isn’t quite the pseudocode we gave earlier but it is similar.The code above relies on a GLSL struct for recording and reporting intersection information. For example, our mirror check, sphere check, and wall check each return this structure. Here is its definition:If a traced ray fails to hit an object, then our code returns NO_INTERSECTION(). This is an ISect whose yes component is set to 0. If instead an object is hit, then yes is set to 1, and the next three components should hold the distance along the ray where it hit, the point where the ray hits, and the normal to the surface where it was hit. The latter two are each a struct of type vec4, a structure with xyzw components, with w of 1.0 for a point and of 0.0 for a vector. The last two components give the RGB of the material’s color, and a 0/1 value indicating whether the material is matte or glossy.We also regularly use the function bestISect which compairs two intersections, picking the closest valid one. Here is its code:If two object intersections are valid (their yes is 1) then we return the closer intersection. When we shoot a ray into the scene, this is our way of choosing the closest intersection— the first object we hit.I hope you are getting a sense of the nature of GLSL coding. In most cases we are working with integer, boolean, and floating point variables of type int, bool, and float. Like normal C coding, we introduce these variables with their type. And then we typically also work with 2-D, 3-D, and 4-D floating point vectors corresponding to the types vec2, vec3, and vec4. These are richly supported in GLSL. They have operations like + and *, and dot and cross, and normalize and length. I summarize these in an appendix below.To illustrate some of this coding. let’s examine the code that is used to trace a ray to a wall, given below:We first normalize the ray’s direction as du, then project the line from R to P onto the plane’s normal to compute the ray source’s height away from the plane, making sure that it is on the correct side of the plane. And then we see whether the ray direction lines up in opposition to the surface normal. If all that works out, we record all the intersection information and return it.For your coding, you will want to write the same kind of code, but for the Bezier funhouse mirror. There is also similar code for ray-sphere intersection, It’s worth checking this code out, too, under the function named rayIntersectSphere.Once again we have a coding project that juggles a few coordinate systems to do its work. There are the points in the scene. Within the scene editor, these are represented as Point3d objects. Their xyz coordinates are within the range [−1,1]×[0,2]×[0,2][−1,1]×[0,2]×[0,2]. When looking at the editor, we are getting a top view and points with x=0x=0 and z=0z=0 sit at the middle of the bottom. This corresponds to the front of the room. Moving spheres left and right within that editor view decreases and increases their xx coordinate, and also moves them left and right in the room. Moving them forward and backward pushes them away, and then moves them closer to the front of the room. This increases and decreases their z coordinate. Items sit on the floor at the plane y=0y=0. The tops of spheres have more positive yy values than lower parts of the spheres.Note that this means that the scene coordinates are interpreted in a left-handed coordinate system. The xx axis points right in the rendered view of the room, and the yy axis points upward. And then our view of the scene is in the positive zz direction.The center of projection for the ray tracer is a point that sits at (0,1,−1)(0,1,−1). It is within the middle of the entrance square, but sits one unit behind it. We shoot rays from that point to each point on the square entrance wall to make our traced scene.Lastly, the curve points for the funhouse mirror are sent to the shaders as 2-D coordinates. These have components xy but actually correspond to the xx and zz coordinates of the points on the base of the mirror (with y=0y=0 because the base sits on the floor). Consider the image of the editor below: In the scene, we have the two curve control points on the left with their x=−0.8x=−0.8. The other control point is at x=0.8x=0.8. The top two control points are at z=1.2z=1.2. The lower point’s z=0.4z=0.4. So, in the editor code the top left point sits at Point3d(-0.8,0.0,1.2). But then these are passed to the GLSL shader code as vec2(-0.8,1.2).Here I give you a plan for completing the assignment. Most of the technical work relies on your continuing application of vector and affine calculations, and also some of the key details we emphasized about Bezier curves. I won’t review those here, instead I offer an approach that fits the bill, references some of that material, and orients you to the places you’ll do your coding.I should first note that there are technically only three places that require your changes. You can find them by grep-ping for COMPLETE THIS CODE in the Javascript, and CHANGE THIS CODE in the HTML/GLSL.As an immediate warm-up to your task, I recommend getting the editor to display a curve for the given control points when in Bezier mode. You need to modify the compile method of Curve so that it creates a list of Point3d objects that are a smooth-enough set of samples of the Bezier curve. These should have their xz coordinates vary, and should sit on the floor with y=0y=0, just like the control points.One way to get this working is to just evaluate the polynomials with a regular sampling of the parameter interval from 0.0 to 1.0. Figure 5 shows a regularly spaced rendering performed this way, and with only 17 points just to illustrate the idea. This is fine for a start, but I’d ultimately rather have you use this exercise as a way of learning deCasteljau’s subdivision scheme. This is naturally recursive. Each attempt to compile a curve with control points P0P0, P1P1, P2P2 relies on two recursive calls to evaluate “left” and “right” curves with, say, control points L0L0, L1L1, L2L2 and R0R0, R1R1, R2R2. I’ll leave you to your notes to remind yourself of the formulation.This strategy requires a base case. In my solution, I use the 1.0/SMOOTHNESS as an “epsilon tolerance” for the two sides P0P1P0P1 of the polyline P1P2P1P2 to be close enough to approximating their underlying curve. Figure 6 shows an adaptive rendering performed this way, and with a low SMOOTHNESS (high tolerance) just to emphasize the approach. Having set this.points with enough, and appropriate, samples, the editor should display the curve.The next thing you must do is write the code for rayIntersectMirror in the GLSL ray tracer so that it displays a funhouse mirror. It is written to rely on the function with the C prototype:It assumes a tall Bezier-shaped mirror is specified by “floor coordinates” suggested by the 2-D coordinates of control points cp0, cp1, and cp2. It should return a struct of type ISect that describes intersection of the ray with that mirror.My approach to this had me approximate the mirror with a series of planar panels, regularly-spaced (in terms of the curve parameter) around the curve. Using this approach I built a helper function rayIntersectPanel that performs ray-panel intersection for a floor-standing mirror with whose fotpriont is a line segment. This rerlied on rayIntersectPlane provided in the code.You can try out this code by just displaying the polyline-bottomed mirror of the curve control points. This will be satisfying to get working first, and you can test it by moving the mirror around.This isn’t quite what we want. We’d like a smoother approximation of the geometry of the curve. And we’d like no discontinuities in the reflection in the mirror. To tackle the latter concern, my code performs a smooth interpolation between two normals: one for the left side of the panel, one for the right side of the panel. That interpolation is an affine one. The combination weight can be calculated based on the position of the intersection with the panel. Figure 7 shows my solution before adding interpolating the normal vector.To get smoothness in the geometry you can subdivide a fixed number of times. This “fixed number” is necessary because of the limits of GLSL. It is not possible to write recursive code in GLSL, and it is not possible to write indefinite loops. The shader compiler needs to know that the code will terminate when run on the GPU, and so it limits the kinds of programming we do. For my solution I wrote a recBezier1, a recBezier2, etc. to handle the different depths of the recursion.If you complete Step 2 in some way, you won’t see any shadows. This is because the ray tracing code relies on a different functionThe spec for this function is similar to rayIntersectsBezier, but it instead returns true or false, returning true when there is a ray-mirror intersection. This code is called by the Phong shading function to check whether a surface point is occluded by some intervening object between the light and the object. And so the code needs to check whether the mirror blocks the light between its location and the surface point that may be in shadow.The distance value should be checked to see if, when the mirror is intersected by the given ray, it is intersected within a certain distance along the ray.This code can just be a modification of your approaches in Step 2, refigured for a bool type and for a distance.Lastly, just do the work of Step 2 and Step 3 so that they give a smooth-enough approximation to the curved mirror. If you interpolate the normals as suggested, then you’ll find that a subdivision into 16 or 32 panels is good enough.Congrats! You are done. Check out some bells and whistles below, or devise your own.Thge steps above are probably the best way to approach completion of the assignment. You’ll get a taste for computing with Bezier curves in Javascript by completing the first step. You’ll get a taste for GLSL ray tracing with the second, followed by an easy change with the third. The fourth step is the main assignment, and hopefully you’ll have had enough practice with GLSL to tackle it. Step 4 is mostly just the marrying of ideas from Step 1 with the coding of Steps 2 and 3.In summary: Change the code so that the light source isn’t a point light source. This would require you to hit several rays to a light to compute softer shadows. Alternatively, you could compute penumbra. Our code doesn’t look for mirror hits when tracing a secondary ray. This means that a curved mirror cannot be seen as its own reflection. You can rewrite the code so that it shoots tertiary, etc. rays, if a bounced ray would hit a mirror. Similarly, you could allow the editor to place multiple spherical mirrors, or else a mix of sphere and funhouse mirrors. We could check whether a point on the wall or a sphere is hit indirectly by light bouncing off the mirror. Our funhouse mirror can be thought of as a bi-quadratic bezier patch, but where the 9 control points come in three sets of co-linear control points. You could generalize the mirror to instead be an arbitrary patch, figuring out a way to have all nine control points editable. And then also figure out the mirror calculatons. Note that a deCasteljau subdivision can easily be built for Bezier patches, so you can write the ray-patch intersection code using a similar method to our funhouse intersection. You could add “physics” to the scene so that the spheres bounce around, or the mirror moves. This would update in real time, and would be no more expensive to render than a static scene. Change the code so that the viewpoint and view direction moves, making the scene something that’s flown through. In order to deal with walls, you might change the scale of the ray trace, making the virtual screen smaller, and also making the view frustum smaller (putting the center of projection closer to the screen). We approximate the reflection of spheres using Phong shading. We could instead, for example, shoot secondary rays from glossy objects. If we shoot several reflective rays from a sphere, we can get several samples of light that’s hitting the surface Building from the above suggestion, we could have a curved mirror that isn’t a perfect mirror. Instead, for example, it could diffusely reflect a color by collecting samples of light from several reflected directions using stochastic ray tracing.The three kinds of vector structs used in GLSL are vec2, vec3, and vec4. Each have components of type float. Their components can be selected with xy, with xyz, and with xyzw. The vec3 type is also used for RGB colors. In actuality, GLSL often uses vec4 for colors, including the additional “alpha” component for transparency. Our code uses vec3 for RGB color values.We build vectors with their constructors. There is some flexibility in doing so. For example, u = vec3(1.0,0.0,0.0); sets u to be the unit vector in the x direction. To make it include a w component, we can use v = vec4(u,0.0) to create it as vector v.We can use the operations * and / to scale vectors v by a scalar a, for example, with the expressions v * a and v / a. You can add vectors componentwise with u + v. You can compute vector’s dot product with dot(u,v), which returns a float. You can also compute cross(u,v), but this is only defined if both u and v are of type vec3.We can normalize(u) to compute a vector whose direction is the same as u but with unit length. You can get the length of a vector with length(u).Many standard mathematical functions are defined on the float type, including abs, max, pow, sqrt, sin, and cos.
Here we work to write code that performs the analysis of a mesh that yields a subdivision surface. The resulting application will have you making several of these surfaces, and testing them with a game that checks the result.On subdivision Subdivision surfaces are a family of schemes for computing the geometry of smoothly varying surfaces from an initial, coarsly-defined, polyhedral surface. We looked at several: the triangular mesh subdividing and averaging scheme of Loop, Catmull-Clark surfaces which work with a mesh of quadrilateral faces, Doo-Sabin’s dual scheme on quads, and the butterfly scheme for interpolating subdivision.Several of these are a result of taking known subdivision schemes that worked with regular patches. Catmull-Clark’s is just a subdivision scheme for bicubic B-spline patches generalized to handle the case when 3 patches meet at a vertex, or 5 or more patches meet at a surface.All the schemes are an iterative process. From an initial mesh M0M0, they provide a means for computing a mesh M1M1, then M2M2, and so on. Each succesive mesh calculation introduces (roughly) twice as many vertices, and 4x as many faces. We split the edges and faces, introducing new vertices, and we apply an averaging rule to obtain positions for all these vertices, devised as a function of the coarser vertices of the components that were split.The mesh sequence converges to a limit surface M∞M∞ with good continuity properties. Loop’s scheme, for example, the limit surface corresponds to a parametrized surface that has C2C2 continuity. The exception is at the points whose vertices in M0M0 were connected to a number of neighbor vertices different than 6. Even at these extraordinary vertices the surface has G2G2 continuity.In practice, like we did with curves in the last program, we terminate the subdivision after enough steps kk, and then display that smooth-enough faceted surface, depicting MkMk.On surfaces We work with a data structure that represents oriented surfaces. The winged half-edge mesh is a intricately linked data structure whose key component is a directed edge that links two vertices. And each such half-edge is part of a cyclic doubly-linked structure that loops around as the border of a mesh face. Also, each half-edge has a twin going in the opposite direction, and that twin is on the border of a neighboring face.By founding all the geometry of the mesh on cyclic orderings of half-edges, we get oriented faces, and we have an oriented understanding of the neighborhood of components around each vertex, edge, and face. We can walk around the surface, and reverse our walk to return back to where we started.Here is the starter code for this assignment. If you download this code, you’ll will find three important code source files: You are to write the subdivide method of class Surface to perform Loop subdivision. I describe the data structure and how it can be subdivided in the sections below.First examine the starter code in surface.js. This gives the definition of three classes Vertex, Edge, and Face. These work to represent the components of our surface mesh data structure, and they are laid out as we described in lecture. In particular, the Edge class is most central. Were this code written in C++, its instance would conform to this struct definition:The meaning of each of these components, excepting split, are depicted in the figure.Each Edge instance e corresponds to a directed half-edge on a meshed surface. It runs between vertex e.source and e.target. To its left is its e.face. And also, e.prev and e.next are the other two edges that define that face, and in counter-clockwise order, so that the normal of the face is directed out of the page. That face has a neighboring face, and that can be found by referencing e.twin.face. The two half-edges e and e.twin define the hinge between those two faces. A Face has this corresponding struct: Because of the data housed in Edge, we only need one f.edge to walk around a face f. We can compute f.edge.target to obtain the first vertex on the face, and then f.edge.next.target to compute its second, and then finally f.edge.next.next.target to obtain the third vertex. Walking around the edges from a face in this way, we obtain the three vertices that make up the triangular face. We depict this edge walk around a face in the figure. A Vertex is similarly succinct:We can describe the “fan” of edges around a vertex v with the sequence:These walk around the edges of that vertex fan in counter-clockwise fashion. We depict this fan just below in this figure: These three classes are put to use by class Surface defined below them. In that class, we can construct an empty surface, build all the vertices from their positions, and then stitch them all together by making faces and edges.An example of that kind of work lives in the method Surface.read which processes an Alias/Wavefront .OBJ file. In that file format, there are a series of lines starting with v that describe all the vertex positions of a meshed surface. And then there are a set of lines starting with f giving a sequence of vertices around a face, specified by vertex number. (See for example, the tetra.obj text within the file cycle-subdivide.js.) The code for read scans in each of these vertex lines, calling the method this.makeVertex for each one. It then calls the method this.makeFace for each line describing a face.All the stitching together of vertices and faces, by creation of half-edges and their twins, is performed within makeFace. The result is an oriented surface as specified by that .OBJ.When you run the application, you can view each of these surfaces by selecting their radio button item in the surface list. If you examine the .html, you’ll see the cooresponding data as file text. Within the main cycle-subdivide.js code, the current surface object being shown is gSurface and has the name gSurfaceChoice. The surface is drawn by the Surface methods glRender and glRenderMesh. These renderings rely on the surface being pre-compiled with glBegin and glEnd. The methods glCompile and glCompileMesh.Finally, when the program user presses p on the keyboard, the code calls gSurface.subdivide. This code should produce a new Surface that is a result of Loop subdivision. We describe this in the next two sections.Loop’s subdivision scheme needs to be translated into our code for Surface.subdivide. Let S be the surface being subdivided, and R be the refined S resulting from Loop subdivision. Then the method should perform these steps: The surface R that results from this face and edge splitting is topologically similar to the picture of the tetrahedron below. Every triangular face becomes four triangular faces. Every half-edge is split into two half-edges.To keep track of the newly introduced vertices in Step 1, so that they can be referenced in Step 3, the starter code includes an attribute v.clone for every vertex object v. When you clone a vertex v of S, you can set v.clone to be v‘s vertex clone in R.Similarly, for Step 2, the starter code includes an attribute e.split for every half-edge Edge instance e. When you split a half-edge e of S, you can set e.clone to that new split vertex. Note that you do not want to create two split vertices for an edge. You might make that mistake if you handle splitting two twinned half-edges independently. Instead, when you introduce a split vertex, you want to associate it with both half-edge Edge objects that are being split.Smoothing Though the above process gives us a refined mesh with the correct topology, it does not put the clone and split vertices in the correct positions. We must also compute the positions of each split vertex and each clone vertex. The cloned vertices should not sit at the positions of the corners that were cloned. The split vertices should not just sit at the midpoints of each edge they split.Instead, when each vertex of R is intoduced, its positition should be weighted average of the positions of nearby vertices of S.A split vertex position should be computed by the formulawhere P0P0 and P1P1 are the endpoints of the edge being split, and Q0Q0 and Q1Q1 are are the positions of the corners of the two faces that share the split edge. You create a split vertex in R for each pair of half-edges in S. Use a call to makeVertex and give it a position as computed above. Make sure you only add one vertex for each pair of half-edge twins.A cloned vertex gets smoothed by taking a weighted average of its position with all its vertex neighbors.A cloned vertex position should be computed by the formulawhere PP is the position of the vertex being cloned, and each PiPi is the position of the target of an outgoing edge from that cloned vertex. In the above, the constant ββ is computed with You will create several clone vertices in R, cloning each vertex in S. Use a call to makeVertex giving it a position as computed above.The starter code includes the controls and display for a mini-game that you can play to verify that your subdivision of the mesh is correct. The game mimics a scene from the 1980s movie Tron. You can drive a “light cycle” around on your surface. You press [space] to start the game and use ijkl to control the cycle’s movement.The game’s action just performs a walk on the saurface of the object. It places a Tron cycle on the 0-th face of the surface, oriented perpendicular to that face’s first edge. It then walks from face to face and leaves a trail by painting the faces it hits. This walk is animated somewhat smoothly, with several steps making a traversal from the midpoint of the face edge to the midpoint of an edge that’s an exit to either a “left” or “right” next face. This face traversal takes several framesThe current set-up allows a player to steer the cycle left or right. By hitting the j key, the application will then choose the left exit face when it completes the drive over its current face. By hitting the l key, the application chooses the right exit face. Otherwise, the cycle makes a series of alternating left and right steps— goes to the left exit face, then goes to the right exit face, then takes the left exit face, etc.— so that the path taken by the cycle is roughly straight ahead, driven along a strip of triangles.This actvity is represented in cycle.js, and the behavior is described as the methods of a class Cycle. The main application code in cycle-subdivide.js just invokes these methods. It also calls its methods for rendering the cycle on the surface, and for rendering the surface from the perspective of the player sitting on the surface.I recommend completing the code for subdivide in two stages:Stage 1: get the topology correct In the first stage you introduce all the clone and split vertices for the new surface, stitch them together, and return that surface without smoothing their positions. When you get this working correctly, your refined surface meshes will still have their original shape, just with more and more triangular patches.Below shows images of the tetrahedron before and after performing a simple Stage 1 subdivision, once and twice.Note that you can get the positions correct— cloning the vertices, computing simple midpoints when you split edges— but not actually link up the components correctly. The way this would likely manifest itself is that you’ll have mutliple, redundant copies of some components, and you may have components that fail to be linked up.You can use the Tron game to test your work. Hopefully the light cycle drives around on your subdivided surface just fine. If so, that probably means that you stitched the elements of the surface together correctly. If instead there are bugs in how you constructed the subdivision, the code that moves the cycle will crash. An attempt to move the cycle from one element in the mesh to another code might face the consequences of Javascript requesting info about a null reference to a neighboring object.You might also find it useful to write code that separately verifies that all the vertices, edges, and faces are in proper relation with each other. (E.g. e.twin.twin == e, v.edge.source == v, etc.)Stage 2: get the smoothing correct In the second stage, figure out how to compute the smoothed vertex positions. It is here that you might get some strange, but viewable, errors. If you use the wrong formula, or select neighboring vertices incorrectly, the surface geometry can go quite awry.Below are images of the tetrahedron, subdivided once, twice, and thrice.Mistakes here are usually much easier to spot. For example, the pictures below show what happened when my code used the incorrect vertex locations to perform smoothing, It also had the wrong stopping conditions in the loop around each clone vertex.There is just one hand-in for this assignment. Hand in a working version of the subdivision code, having correctly completed Surface.subdivide. You should be able to complete this code in less than a week, maybe with three evenings of attempt (first attempt: Stage 1; second attempt: buggy Stage 2; third evening: everything fixed). It’s about 40 lines of code and most of the trickly work is handled by makeVertex and makeFace. Hand in screenshots of your subdivision code working on a complex object, say, the fox. Also include a screenshot of the subdivided object after the cycle has painted quite a bit on it.
The flip book app that we worked on for Project 3 has a particularly unsatisfying quality to it. Editing the camera’s path can be tedious, and so a walk-through creator will tend to place very few camera positions into the scene. This leads to a very jarring animation in the resulting PDF, with few pages to flip through. The only option for an editor is to place a lot of cameras along the path, and have them close together, so as to have a smooth animation of the walk through the scene. As we already just noted, this isn’t easy to do in the app.For this assignment we fix that issue. We change the editor so that the explicitly placed camera sites are “control points” that help to define a path. But then we produce a smoothly tracking collection of camera shots that form a much larger number of frames in the walk-through. To do this, we use the curve technology we described in lecture the last several weeks.A sample solution for this Project 4 assignment is shown in Figure 1. Just as in the old app, a walk-through creator has laid out a series of camera shots as a polygonal path. And then the app has computed a series of “in-between” camera locations and view directions that smoothly vary from shot to shot. For each placed shot, excepting the first and the last, the app produces 16 frames. Each frame is built by taking camera information from a few of the explicitly laid out shots, using some combination of them to blend their information into a novel in-between camera shot.A flip-book that results from this smooth path can be downloaded using this link. There is a page for each frame, one for every in-between shot that the app computes.You can just work from the code you were building in Project 3. Even if the code for generating PDFs isn’t fully working, you can still enhance it to complete the work of this project. You’ll add code that draws the in-between camera shots. (And if you complete a bonus exercise, you’ll add code that allows the app user to run a movie of the smooth walk-through on the preview screen just right of the editor.)For those of you still trying to get Project 3 completed, this project gives you some extra time to work out the bugs of your PDF generation and submit an update for some partial credit. (Those of you with a working, or near-working, Project 3 will just get this additional credit for that work here.)Edit the WalkThru class so that it can compute a sequence of SceneCamera objects that smoothly vary according to its array this.shots. Your method should use Chaikin’s “corner cutting” algorithm applied to the control polygon given by that sequence. You should use the version of Chaikin subdivision that starts and finishes with the begin and end shots, but smooths out the corners specified by the other shots. If the edited shots are given bythen you’ll compute (via Chaikin’s scheme) the set of SceneCamera objectsThese are the result of applying Chaikin’s “split and average” step four times. The in-between camera placements should be depicted in the walk-through editor view. The smoother walk-through should be used to generate the pages of the PDF generated by the [to PDF] button.Below I take you through some of the details of what’s needed.You need to compute a sequence of camera positions from the sequence of user-specified shots. The user-specified ones are held in the array gWalkThru.shots managed by the code flip-book.js. The sequence that you compute should be those that result from a fairly straight-forward application of Chaikin’s scheme that we outlined in Lecture 07-1. We cut corners by a 3434 and 1414 weighting rule applied to the line segments of each corner, computing two points for each corner.For this coding, I recommend you define a method combo(amount,other) in class Shot that performs a convex combination of one camera shot (this) with some other camera shot (other) according to a parameter amount. That parameter is assumed to be a value between 0.0 and 1.0. This is easy to determine for the position of the two Shot objects using the geometry libraries as below:With this defined, you can then define a Chaikin subdivision of a sequence of shots by repeatedly applying your Shot.combo method with weights of 0.75 and 0.25.Writing the portion of the Shot.combo that handles position is easy. Combining this.direction and other.direction is less straightforward. I say more about this calculation just below.In our application, a shot’s orientation is given by a unit vector living within the plane of the floor. We need to smoothly vary this direction from shot to shot for the calculation of our curved walk-through. Since we are using subdivision to perform this work, we just need a way of defining the linear interpolation between two camera directions. This would be provided by the combo method of Shot just mentioned above. The method needs to build a new direction from this.direction and other.direction.I’ll let you work out the details of this.As inspiration, recall that in lecture 07-2 that we described a scheme for interpolating between two quaternions q0q0 and q1q1, with a q(α)q(α) parameterized by α∈[0,1]α∈[0,1]. It was defined as follows:You could do something similar, although you need not use quaternions to accomplish this. It can be a bit simpler since direction is essentially just a unit vector in 2-D.One issue that might arise in your handling of camera shot directions is that there are several ways to sweep (i.e. spin) the camera direction from one shot to the next. For example, suppose you model a shot direction as an angle within the plane. When you move from a shot angle θ0θ0 to shot angle θ1θ1, it might be more natural to instead move to a shot angle of θ1±2πθ1±2π (in radians). And this choice might depend on what the prior camera directions where and/or where the next directions are headed. You need not address this issue in your submitted code. If you do, however, please to tell me how you addressed it.You’ll want the editor to show the curved path rather than the control polygon of the curved path. To do this, you should change the code for drawCameraPath in flip-book.js so that it renders the curve instead of the polygon. You’ll find two loops in that code. One loop renders each edge of the control polygon. A second loop draws each of the camera placements. Modify the first loop so that it shows all the in-between shots and their directions.BONUS A bonus exercise is to add a feature so that an editor can view a movie of the walk-through within the app’s display, rather than generating the PDF. A user could press, say, the [SPACE] key and that could kick off an animation sequence in the shot preview.You can control an animation by a set of global variables that indicate whether the app should be animating the walk-through sequence, and how far along it is. You append some code at the end of the draw function that requests that another frame gets drawn right after a frame was just drawn. Here is some code I used for my own solution:In the above, I increment a variable gTick each time draw is called and I am animating the walk-through preview. Every 100th time, I advance the frame of that animation to the next shot. If I’ve hit the last shot in the sequence, I stop the animation. And then I request another call to draw with glutPostRedisplay();.Submit the full code that supports the application by the project deadline. Also submit a submitted.md or submitted.txt file that describes the status of your code. Please include your name in that file to make it easy for me to look at your submitted work. Provide a sample .PDF of a walk-through that you set up, along with a screen shot of the scene set-up that produced it.As usual, it’s always best to submit what you’ve completed by the deadline, even if it’s not fully working. Give details of where things stand in your submitted file. And then feel free to later revise what’s submitted at a later date when you get more things working.
Throughout this semester, we learn methods for producing photo-realistic renderings of virtual scenes. These methods do their best, given computing resource computations, to model the physics of light transport, of how real materials reflect, absorb, and transmit light, and of how cameras capture the light of a scene to record an image. They rely on the mathematics we have covered recently for tracking and relating geometric points and directions in 3-space.For this assignment we take a different approach to computer rendering that was used in earlier graphics systems, before photo-realistic rendering was technologically feasible. We write code that produces line drawings of geometric objects. The results are not photo-realistic, but the methods use a lot of the same principles, and certainly many of the geometric calculations, used in photo-realistic rendering.To make our drawings of 3-D scenes, we use ray casting to compute a perspective projection of what the camera sees, depicting objects as wireframes but with hidden lines removed. Your completed code will take as input a “walk-through” of a scene made up of objects, given as a path of camera positions. It will compute a rendering of the walk-through as a “flip book”— a series of pages of flat line drawings that can be flipped through to animate the walk. In doing so, you’ll practice using the affine geometry primitive operations defined in geometry-2d.js and geometry-3d.js which define point and vector operations in 2-D and in 3-D. These, along with the mesh specifications given by each placed SceneObject, allow your code to turn the 3-D scene into a PDF document.Below I take you through several parts of the project, orient you to how everything is meant to work together, and then take you through the steps you can use to complete the project.Written part The programming requires coding of a few geometric operations, but you’ll want to first work out these operations on paper before you write their code. I’ve packaged these operatsions as a set of written problems you should solve, and hand in, so that you can make progress and plan for the coding. Playing with the walk-through editor If you load flip-book.html within your browser, you will see a user interface as shown in Figure 1. The WebGL canvas is split into two parts: the left dark blue part allows you to place several objects on the floor of the scene, and also to lay out a path of camera shots that walk through the scene. Objects can be placed and dragged with mouse clicks. Cameras can be placed and moved by clicking and dragging the mouse with a press of the middle mouse button (or, alternatively, by holding the SHIFT key while you click).The right part of the display shows a square region of the scene from the currently selected camera’s perspective. It shows the objects as black wireframe meshes with hidden lines removed. This is just using WebGL’s standard hardware algorithms for doing that work. It serves as a preview of what your code will render as PDF.You’ll notice that each camera is sitting on the floor of the scene, at the same level of the bases of the objects, and looking in a particular xx–yy direction along the floor (the zz direction is upward). You can change the direction of a camera by clicking on the preview square and dragging its perspective left and right. The editor limits this view change to being a pan around the up direction of the camera. The camera’s view direction can only be rotated within the xx–yy plane.The buttons sitting just below the scene editor allow you to select which object you’d like to place. There are several faceted objects, including lower-resolution versions of the Stanford bunny and the Utah teapot. The descriptions of these objects are embedded within the HTML as Alias/Wavefront .OBJ files. When the user presses one of these buttons, the code loads the mesh description of that object.Once a user sets up a scene and a series of shots, they click the toPDF button to set the wheels in motion— to run the code that you need to write— that leads to your browser downloading a PDF document of the walk-through, one that was just freshly computed by your code.Starting code In the code’s current state, clicking toPDF will perform a simple rendering of the scene as a collection of “see-through” wireframe meshes, and only using an orthographic projection looking directly to the right from each of the camera shot locations. There will be a page for each camera shot, but the scene will look nearly the same for each one because of the orthographic projections that look directly to the right.Your assignment is to change the code to produce the proper images in the PDF file. When completed fully, you’ll have a sequence of PDF pages that look like the preview shots of the editor.Tester code In addition to the WebGL application run by flip-book.html that runs the code in flip-book.js, the starter code folder also contains two other WebGL applications that share use of the code you write. You can use these applications to test and debug your code without needing to do so within the flip-book application. They are: These use the same critical code you’ll write to complete the toPDF code. They are described briefly in two appendices at the bottom of this page.Modify the toPDF functionality of the web application so that it performs a proper perspective drawing of the scene’s objects based on the specified camera shots. Each page of the PDF should mimic the WebGL rendering by judiciously marking lines onto the pages of the PDF. This coding can be done within the files walk-thru.js using some supporting geometric functions you’ll write in walk-thru-library.js. In particular, the editor program makes a WalkThru object whose class definition is in walk-thru.js. The editor calls the walk-through object’s toPDF method to assemble the PDF document.You’ll write the code used by this toPDF method, particularly its use of a collection of object placements that get built as SceneObject instances to provide each object’s geometry, and its use of camera Shot objects that get built as SceneCamera instances to provide the calculation of a perspective projection to take a snapshot of the scene.Below I take you through a candidate solution for the project, talk about the supporting code that you’ll rely upon, and then give you strategies for completing and debugging your code.There are several places in the walk-thru.js code that I’ve marked with several // TO DO code comments. Finding these will orient you to writing your solution code. The description below walks you through that work.Let’s take a guided tour through coding up the WalkThru.toPDF method. For each step of this guided tour, we will work with two scenes. One contains a single cube. The other contains two triangular hinges, one in front of the other. This is a result of using the squares button to select and place an object.The first thing you will want to get correct is computation of the perspective projection of all the object’s vertex points. These are the corners of every object that live in the 3-D scene. Every vertex v has a v.position attribute of type Point3d. You’ll need to determine a Point2d location for each vertex. Your code can loop through all the vertices of each scene object and apply a perspective calculation for the camera’s view.The projection calculation should be provided by a SceneCamera object’s project method. A scene camera instance is built by specifying the center of projection, a direction that the camera faces, and a general direction upward. These can be used to construct an orthonormal frame— the center forms the frame’s origin, and three other directions provide an orthonormal basis— that can be used to provide the projection.There is a class SceneCamera in walk-thru.js that provides a template for this coding. Right now, its constructor and its project method have bogus code, just to make the starter code run. You’ll want to fix the constructor code so that it computes something like the following: These should be computed from center and toward which are provided to the constructor from the shot editor GUI. There is also an upward direction. For this code, it is just given in the zz-drection. It is aligned with the central axes of each scene object. Each object base sits in the xx–yy plane at z=0z=0, and their central axis points upward. The toward direction, as currently given by the GUI, is already perpendicular to the zz-direction and so has a zz component of 0. The center of projection provided by the GUI lies on the floor, and so it also has a zz component of 0.Once you’ve completed this coding for Step 1 you have computed the orthonormal frame of a scene camera for its perspective projection. These calculations are the ones you devise for Problem 1 of the written portion of the assignment.Now that we’ve written code for computing an orthonormal frame for each SceneCamera, the next step is to compute the perspective projection. We take the 3-D location of each vertex of an object on the scene and “splat” it onto a virtual sheet of paper of the PDF. When you’ve completed the coding for Step 2 you will produce pages of drawings, one page for each camera shot. The objects will be “see through” wireframes, not faceted surfaces. Because they will result from a perspective calculation, they will be warped with foreshortening (etc.) rather than what was obtained with the orthographic projection.Sizing up the perspective projection Let’s work through the set-up your code will face with each SceneCamera when it performs the perspective projection. Let’s first be more clear about the coordinate systems of the scene editor. In the figure below we have a scene littered with two bunnies and two foxes. The leftmost camera location is selected, and it is pointing directly right. We see two foxes on the left field of the shot, and two bunnies on the right field of the shot. The two bunnies look to be the same height in the shot, but they are different sizes. The larger bunny’s dimensions are about twice that of the smaller bunny’s, but it is also about twice as far away from the camera. This is a consequence of the mathematics of the perspective projection. If you were to inspect the 3-D points of the camera locations and the scene objects, you would learn that the origin point (0,0,0)(0,0,0) sits at the left part of the scene editor, mid-way up the left side of that rectangle. You would learn that the positive yy-axis is along that left side of the scene editor rectangle, with positive yy pointing upward. You would learn also that the xx-axis is aligned with the top and bottom edges of the scene editor rectangle. The positive xx-axis points directly to the right, perpendicular to the left side of the scene editor rectangle. Then finally the positive zz-axis is pointing directly out of the computer screen. Within the scene editor, we are looking down onto the objects in the negative zz direction. The scene editor rectangle is two units tall. This means that the top-left corner is at (0,1,0)(0,1,0) and the bottom-left corner is at (0,−1.0,0.0)(0,−1.0,0.0).This means that the first camera has a towards vector of (1,0,0)(1,0,0), an upward vector of (0,0,1)(0,0,1), and its center point is at (0,0,0)(0,0,0). The second camera has a center of about (0.4,0.2,0)(0.4,0.2,0). With a perspective pointing slightly right of the view of the first camera, the second camera has a towards direction of about (0.91,−0.41,0)(0.91,−0.41,0). This vector is tilted down a little off from the xx-axis’s right alignment.Now let’s figure out the “view frustum” you’ll use for the perspective projection. The figure below shows a different scene. I’ve placed two cubes into the scene. I’ve lined them up and sized them in a way such that they both nearly cover the square page in the shot preview. They nearly cover the whole 54mm x 54mm square of the paper. If you look at the two camera-facing sides of the two cubes within the scene editor, their corners are aligned along rays that emanate from the camera’s center. This means that, within the perspective view, the vertical edges of their camera-facing corners form the left and right side of the view’s frame. This tells us that we have a perspecive projection in the xx-axis direction whose field of view is 90°. That is, the view frustum is a “right” pyramid— a pyramid whose top point’s angle is a right angle.It turns out that this perspective projection is the one that results from building a 1 × 2 × 2 basement cellar room and drilling a hole in the floor of the scene room right where the camera sits. The light from the scene would be projected back through the hole onto the cellar’s back wall (see Figure 5). Your code is tracing the objects when their image is projected onto that back wall. Ray casting into the scene Let’s figure out the work of that projection using ray casting. When you project the scene onto paper, you use the right, up, and into vectors as the axes of a left-handed orthonormal frame. This allows you to compute, for each vertex in the scene, its depth from the camera, and also its position on the paper.Problem 2 on the written part asks you to figure out this calculation using the affine vector and point operations we’ve discussed in lecture. And then you use the Vector3d.dot, Point3d.minus and other operations as defined in geometry-3d.js to write this calculation as code.In my solution to this step, I produce a primitive Javascript object for a result of SceneCamera.project. That result contains the following info: You’ll see in the starter code that SceneCamera.project is called several times within the loop of SceneObject.projectVertices. Within that method, we build a Javascript Map that associates to each object vertex its projection information onto the 2-D page. This Map information is kept around to be used later in steps 3 and 4 described below.Results When you have completed this step correctly, if you draw each vertex’s projection with a small circle and each edge as a line connecting these two dots, you will get PDFs like these:For the cube object, the 6 sides are made up of triangle pairs. And so we are seeing the diagonals of many of these faces, including the hidden ones. For the hinged square scene we have two pairs of triangles. We are seeing all the edges here too.These figures rely on calls to methods of the document object passed to WalkThru.toPDF. This object is provided by the jsPDF package. These functions draw figures on a credit card-sized page: We make all these calls within the code for SceneEdge.draw, and this is the method code you’ll now modify for steps 3 and 4.In Step 2 just above, you ended up drawing all the edges of each object, including parts that are hidden by other faces. Let’s now work to correct this by performing the calculations needed to break up an edge into a mix of hidden and visible line segments. We’ll do this by looking at the intersections of all the projected edges. These intersection points serve as the potential “breakpoints” for depicting the edges.If part of an edge is obscured in its middle by some face, then that part of the edge shouldn’t be drawn. The start and end points of that occluded segment of the edge will occur where that face crosses the edge from the perspective of the camera. Consider the pictures below. I’ve added dots where edges cross. If you compare figure 2b with figure 7b, you can see that the right half of the smaller hinged square is obscured by the left part of the larger hinged square. The three edges of the smaller hinge that cross the edge of the larger hinge should only be drawn part way, up to that crossing edge.In other words the left edge of the larger hinged square makes three breakpoints on three edges of the smaller hinged square. Each of those three edges should only be drawn to each breakpoint.Step 3, then, should compute all the breakpoint locations along an edge so that you can draw the edge correctly in Step 4. The provided code structures this using a method SceneEdge.breakpoints. It should return an array of values between 0.0 and 1.0 where these breakpoints occur. If an edge spans points P0P0 and P1P1, then the breakpoint occurs at value ss whenever there is a breakpoint located atLet s1,s2,…,sns1,s2,…,sn be the series of breakpoint values for an edge. We sort them so thatThese tell us that when we draw the projected edge, it is divided into n+1n+1 segments according to these nn breakpoints. And, to consider drawing the edge, we run through those n+1n+1 segment pieces from P0P0 to P1P1.Problem 3 on the written part asks you to figure out the calculation for finding the intersection between two line segments in 2-D space. You should work to formulate this intersection using only the point and vector operations we discussed in lecture, ones that are available in the geometry-2d.js library with classes Point2d and Vector2d.Note that It wouldn’t be hard to do this calculation using coordinate calculations, and you can do that to just get the code working. For full credit, however, you need to devise a scheme that uses the “coordinate-free” geometric operations.Write this intersection code within the function segmentsIntersect in the file walk-thru-library.js. This function can then be tested on its own using the segments-intersect.html application described in Appendix 6.2. And then, ultimately, you will call segmentsIntersect within the code for SceneEdge.breakpoints, looking for all the edges that cross any particular scene edge when projected in 2-D. The method SceneEdge.breakpoints is ultimately called by SceneEdge.draw so that it knows how to draw that edge.Results To see this all working within the flip-book.html application, have the code for SceneEdge.draw just place dots at all the breakpoints along an edge, but still have it draw the whole edge. When you have completed this step correctly, you will get PDFs like those shown in figures 7a and 7b above.Now that we’ve marked out all the intersections, we’ll now actually want to draw an edge as a sequence of only its visible segments. Below, I’ve modified the prior figures to illustrate a method for doing this.Here you see clearly that edges can have a series of included (black) and excluded (pink) portions. The excluded portions are those obscured by some face that’s sitting closer to the camera.We scan through the segments that form the edge. These segments are defined by the breakpoints we found in Step 3. For each segment, you cast a ray from the center of projection to each segment to see whether it should be drawn. If some other triangle sits closer to the camera, then that subsegment should not be drawn (shown above as pink). If there is no triangle hit between the camera and the segment, then we should draw that segment.My method for checking occlusion casts a ray to some point sitting in the middle of each segment. These are shown as colored dots in the figure below. Pink ones are mid-segment points that are obscured by some closer face. Black ones are mid-segment points that are visible. We cast a ray to each segment middle point. Rays cast to black ones don’t hit any faces before they reach them. Rays cast to pink ones have some face that’s hit before they are hit.Problem 4 on the written part asks you to work out when a ray cast from a point through another point intersects a triangular facet. You should again formulate these conditions using point and vector operations and then code them up as the code for rayFacetIntersect in the file walk-thru-library.js. This function can be tested on its own using the application facet-hit.html.Once you’ve got this ray-facet intersection code working, you can then use it within the code for SceneEdge.draw. Within that code, you run through a projected edge’s breakpoints, from 0.0 up to 1.0. Within each portion that runs between two breakpoints, shoot a ray in 3-D that passes through the middle of the projected edge and see what faces that ray hits in the scene. If there is some face on an object that sits in front of that portion of that edge in 3-D then don’t draw it. Otherwise draw it.In the starter code, I’ve provided the template for the method SceneEdge.isSegmentVisible that can be used in SceneEdge.draw to see whether a portion of an edge is visible, between two breakpoints A and B, when viewed from a given camera shot and when looking at the collection of objects in the scene. This method can call rayFacetIntersect to do its work.One technicality The code for SceneEdge.isSegmentVisible has one important technicality. When casting a ray to a portion of an edge to see if that ray hits any facets in the scene, we have to be careful about the facets that form that edge. For example, if an edge acts as a hinge between two faces, we shouldn’t consider those faces when we cast the ray. If we don’t, we’d make the mistake of thinking that an edge is obscured by its own faces.We can take this care because of how we built each SceneEdge object within the code SceneObject.projectEdges used by WalkThru.toPDF. Each SceneEdge is constructed with the faces given by e.faces() when we project each edge e of a SceneObject. For an edge that forms a hinge, this will be an array of two faces that we can use in isSegmentVisible. For an edge that forms the boundary of a hole in the surface of an object, e.faces will be an array of only one face. Whatever you do, you need to ensure that your code does not think that faces hide their own edges.Completing the four steps above allow you to do the coding necessary to get WalkThru.toPDF working to produce a proper flip book PDF of the walk-through of a scene. You don’t have to use my code and follow my template— you can excise all that code and write it from scratch— however I found that the above code organization made sense. As I laid out the code for WalkThru.toPDF as provided, it led to natural constructors and methods to code up for all the TO DO sections, and these all worked together as a strategy for generating the correct PDF.In summary: Hand in your application along with enough documentation to help me understand and give feedback on the work that you have completed. At this point, if you’ve completed these steps, your code should be working and your coding work is complete. If you were quickly successful in completing the work, then you might take the assignment a bit further. Here are a few ideas for “bells and whistles” that you could add to your submission: The shot and object editing can be a little wonky. Feel free to dig through its code and make any enhancements. One easy example can make shot editing much better. When a user is selecting the camera path, you could allow them to insert shots within the sequence, say, if their click is close to the midpoint between two consecutive shots. The cameras only sit on the floor and can only be spun around on the floor’s plane. The walk-through could have greater flexibility if we could raise each camera and angle it differently. This could lead you to explore changing the mouse handler for the shot preview so that the user could drag and spin the camera’s view. The animations would certainly be better if we could provide more frames than just the sequence of camera shots. You could, for example, use the curve-based interpolation schemes that we’re covering in lecture to fill the PDF with more pages of shots. In addition to moving the camera center more smoothly, with more in-betweens, you’d want to come up with a way of smoothly interpolating between two shots’ directions into the scene. You might consider adding more objects to the library, especially ones that better showcase the rendering method we’ve performed. You can search for more .OBJ files on the web, and place their contents into the HTML file. You might also try using a modeling system to build these files. Just be careful: the method outlined above has a running time that scales poorly in the number of vertices and edges. The sandal and lamp objects are included in the object library but my solution is unable to render them because of the thousands of facets that make up each model. There are .OBJ files on-line that specify a surface with non-triangular facets. The f lines in the file specify, instead, a polygonal “fan”—a series of vertices that serve as that face’s boundary. With such models it would be cool to depict them as their projected polygon rather than as a fan of triangles. Similarly, soccer and cube have hinged faces that are flat. For these kind of surfaces, you could choose to exclude any edges that form a flat hinge. Then, for example, the soccer ball would look like a soccer ball, and the cube would look like a cube. You could add a point light source to the scene. You could then depict the shadows that result from the light being cast on objects, and thus casting a shadow on others. To do this, you’ll want to learn how to draw filled areas using the jsPDF API. Alternatively, or in addition, you could have the graphical interface depict shadows, too.Below is some useful documentation for a few other parts of the provided code just for your reference.To do your computations, I encourage you to use the classes defined in geometry-2d.js and geometry-3d.js. These define 2-D and 3-D points and vectors, along with standard vector and affine operations. For classes Point2d and Point3d we’ve defined: For classes Vector2d and Vector3d we’ve defined: There is also v.perp() which computes a perpendicular vector to 2-D v.You should be able to do many of your calculations using just these operations. It should be rare that you access a point’s or a vector’s components, at least for the calculations. You’ll certainly access the components for drawing or changing any of the GUI code.I’ve included two other WebGL applications that are run by loading the HTML files segment-intersect.html and facet-hit.html. The first of these is shown below, and can be used to test your 2-D line segment intersection code. When loaded, the application shows two line segments embedded in the plane. Their endpoints can be clicked on and dragged around. The application also displays an intersection point, as computed by segmentsIntersect within walk-thru-library.js. You can see that code working correctly in the sample image below. Instead, if you have not re-written this intersection code yet, the application will just show the midpoint of the blue segment P0P1P0P1. The second application is shown below. When loaded, it shows a triangular facet in 3-space along with a ray that passes from one point to another. The viewpoint of this scene can be re-oriented by clicking and dragging the scene with the mouse, either while pressing down on the SHIFT key or by using the middle button of the mouse. And then each of the five control points can be selected and dragged around so as to vary the set-up. The application will display an intersection point as returned by rayFacetIntersect within walk-thru-library.js. You can see that code working correctly in the sample image below. If you haven’t written that intersection code yet, the application will just show the barycenter of the triangular facet given by Q1Q2Q3Q1Q2Q3. The HTML file contains a section with the text of several Alias/Wavefront .OBJ files. These each describe the vertices and faces of each of the scene objects in the object library. Here, for example, is the description of the hinged pair of squares used in some of the example images. They can be placed using the squares button.Each file contains, minimally, a list of all the vertex locations for an object on lines starting with v. There are 8 verices, one for each hinge. And then these are followed by the description of each face on lines starting with f. The first face connects the vertex on the first ine, with the one on the second, with the one on the fourth line, forming a triangle. And then also this means that the surface has the three edges 1->2, 2->4, and 4->1.As noted in the bells and whistles, lots of these files can be found, although often these are ones obtained by some high-resolution 3-D scanner, and so they have lots of vertices and lots of facets.The files can contain other information. Some of our objects have vertex normals specified. This allows renderers to perform smooth shading of the surface by interpolating the vertex normals along the points of a face using some lighting and shading models like we learned in class. Vertex normals start with the line vn. Similarly vertex texture coordinates can be specified, as well. These allow a renderer to wrap an image as the detailing of a face. Vertex texture coordinates are specified with vt lines.These two additional pieces of information can show up in the descriptions. Our lower resolution teapot.obj has the face line f 22/10/22 1/13/1 4/14/4 23/11/23 which directs a renderer to make a five-sided face with vertices from lines 22, 1, 4, and 23. But then the othersnumbers indicate each vertex’s vt and vn line.Our processing of the files ignores these extra bits of information.Feel free to add or create your own objects by adding them to the HTML. A warning: they shouldn’t have too many facets. The code we are writing uses a lot of Javascript calculations to create the PDF, so only use objects with low triangle counts.Object library The HTML file has a section of buttons that allow the user to specify what object gets cloned when placing scene objects. These requests are sent to the editor by a string like "cube" or "teapot". These serve as keys to a Map that serves as a dictionary of object descriptions. More specifically, each entry in the object library maps the object name to the processed contents of a .OBJ file.The libary is built upon startup of the editor, Each object gets compiled so that glBeginEnd calls can be issued to render them where they are each placed. There is a GL_TRIANGLES call for the flat-shaded object to be drawn as part of the walk-through scene. And also there is a GL_LINES call for rendering the mesh wireframe of the object.Object representation We represent each object in the library as a CGObject instance, and this class is defined in the source cg-object.js. These objects contain a surface mesh data structure compiled from the information in a .OBJ file. Each object has an array of vertices, an array of faces, and an array of edges that serve as the boundaries of faces.Relevant to this project, every edge knows its two vertices, and every face knows its three vertices. And, it turns out, each edge knows the face(s) it borders. This could be a single face, if that edge serves as the boundary of a face. Or it could be two faces, if that edge serves as a hinge between two faces. The fact that edges, vertices, and faces relate to each other give the topological structure of the surface, essentially providing a means to walk around its edges and its faces.Step 4 above talks about ray casting to determine whether a portion of an edge that gets drawn. We pointed out that your code might need to exclude the one/two faces of an edge so that it is/they don’t hide their own edge. If you have access to an edge object e, then you can access a short array of its faces with e.faces().The fact that each edge can be shared by two faces adds a little complexity to the representation. In processing the .OBJ file, we make sure we don’t add the same geometric edge twice. This means that, instead, when an edge forms the hinge between two triangles, only one Edge object instance is made.Scene objects There is a separate representation class for objects that need to be rendered. In walk-thru.js we describe a Placement of a CGObject, giving its library name, its position, its scale, and its orientation. When the time comes to render a scene into PDF, each object placement is converted into several “clones” of their CGObject data.We represent the geometry of a cloned library object with a class named SceneObject. This is a subclass of CGObject, but with additional support for projecting and rendering the object as lines in the PDF. It actually shares the topological information (the edge and face structure) with its CGObject from which it was cloned. It refers to these same edge and vertex sets from its “parent”, but then has a different set of vertex objects, in different locations, then its parent library object.The fact that clones share face and edge information with their parent complicates access to vertex information. When asking for a edge’s endpoints, you need to specify the object so you get the relevant vertex object. The same edge on the clone and its parent must lead to different vertex objects for the edge endpoints, depending on hen you are dealing with the parent or with the clone.Below are the most needed and relevant calls in the code for obtaining information from a scene object so that you can render it in PDF:The code we write for creating PDF documents is from a freely available library called jsPDF. This API is one of several. It is well-documented and has many nice demonstrations and examples at links from its GitHub repo.Beyond lines and dots, you can lay out text in different fonts, and at slants. You can make filled regions. You can lay out curves, and so on. Our code makes a credit card-sized PDF and we use millimeters as the unit for placing lines and dots. There is of course support for other document sizes and other drawing units, as well.I recommend checking it out.