(1) [300 points] Support Vector Machines. For this question, you will use the data the data you generated in HW2 from the MNIST Digits Dataset for the training and test datasets in ZipDigits.train and ZipDigits.test respectively with the two features you computed for the problem of classifying 1s vs. 5s. (a) Use this method for training support vector machines and this method for model selection with cross validation from the scikit-learn python library to find the value C for the regularization parameter with the smallest cross validation error using 5-fold cross validation and the training dataset with two features you formed from the data in ZipDigits.train. Report ECV for the best value for C. (b) For the chosen value of C, learn a support vector machine using all of the training data. Compute and report its Ein. (c) Use the test dataset from ZipDigits.test to compute Etest for the classifier you just learned and report it. Compare it with the results from HW2 using the linear model. (2) [300 points] Support Vector Machines with the Polynomial Kernel. For this question, you will use the data you generated in HW3 from the MNIST Digits Dataset for classifying 1s vs. Not 1s, where you created D with 300 randomly selected data points and Dtest consisting of the remaining data points. (a) Use this method (not the same as the one for the previous question) for training support vector machines using the kernel for the 10-th order polynomial feature transform and this method for model selection with cross 1 validation from the scikit-learn python library to find the value C for the regularization parameter with the smallest cross validation error using 5-fold cross validation and the training dataset with two features you formed from the data in ZipDigits.train. Report ECV for the best value for C. (b) For the chosen value of C, learn a support vector machine using all of the training data. Compute and report its Ein. (c) Use the test dataset from ZipDigits.test to compute Etest for the classifier you just learned and report it. Compare it with the results from HW3 using the linear model with the 10th order polynomial feature transform. (3) [400 points] The k-NN rule. For this question, you will use the data you generated in HW3 from the MNIST Digits Dataset for classifying 1s vs. Not 1s, where you created D with 300 randomly selected data points and Dtest consisting of the remaining data points. You will have to implement the k Nearest Neighbors (k-NN) rule. You may use the helper code as a starting point. (a) Use cross validation with D to select the optimal value of k for the k-NN rule. Show a plot of ECV versus k and indicate the value of k you choose. (b) For the chosen value of k, plot the decision boundary. Also compute and report its Ein and ECV. (c) Report Etest for the k-NN rule corresponding to the chosen value of k.
(1) [100 points] LFD Exercise 4.3. (2) [100 points] LFD Exercise 4.6. (3) [100 points] LFD Exercise 4.8. (4) [100 points] LFD Exercise 4.11. (5) [1000 points] An End-to-End Learning System with Regularization and Validation: Predicting 1s vs. Not 1s. We revisit the MNIST Handwritten Digits Dataset we worked with in the last homework to solve the problem of predicting whether a given image of a handwritten digit represents either the digit 1 or not the digit 1, i.e., if the n-th example is labeled as being the digit 1, then yn = +1, and otherwise, yn = −1. First, you must perform the following steps to prepare to solve the problem: (1) [Combine Data] Combine the training and test sets (in ZipDigits.train and ZipDigits.test respectively) into a single dataset. (2) [Compute Features] Use the algorithms you developed in the previous homework to compute two features for each example in the dataset. (3) [Normalize Features] For each data point, shift and then rescale the values of each feature across the entire dataset so that for each feature, so that the values of every feature are in the range [−1, 1]. 1 (4) [Create Input and Test Datasets] Select 300 data points from the dataset uniformly at random (and without replacement) to for the dataset D. Use the remaining data points to form the test dataset Dtest. You must leave aside Dtest and not touch it till the end of this exercise when we will compute the final output g of our learning system. We will use Dtest to estimate Eout(g). For convenience, we will treat this as a regression problem with real valued targets ±1, until we output our final hypothesis g for classifying handwritten images as either the digit 1 or not the digit 1, at which point we will use sign(g(x)) to predict the class of a test data point x. The standard polynomial feature transform generates features which are not ‘orthogonal’, making the columns in the data matrix dependent. This can be problematic for the one-step linear regression algorithm as it requires computing the (pseudo-)inverse of a matrix. An ‘orthogonal’ polynomial transform is (x1, x2) → (1, L1(x1), L1(x2), L2(x1), L1(x1)L1(x2), L2(x2), L3(x1), L2(x1)L1(x2), . . .), where Lk(xi) is the k-th order polynomial transform applied to the i-th feature of the input data point x. See LFD Problem 4.3 for a recursive expression that defines the Legendre polynomials which can be implemented as an efficient iterative algorithm. We will use the one-step (pseudo-inverse) algorithm for linear regression algorithm with weight decay regularization for learning. This corresponds to minimizing the augmented error Eaug(w) = Ein(w) + λwT w, where Ein(w) is the sum of squared errors. The weights that minimize Eaug(w) are wlin(λ) = (Z TZ + λI) −1Z T y, where Z is the N × ˜d matrix generated from the polynomial transform of the data points X in the data set D = (X, y) and wlin is the ˜d × 1 regularized weight vector. Now, complete the following tasks to arrive at the final hypothesis g. (Task 1) [100 points] 10-th order Polynomial Transform. Use the 10-th order Legendre polynomial feature transform to compute Z. Report the dimensions of Z. (Task 2) [100 points] Overfitting. Plot the decision boundary of the output of the regularized linear regression algorithm without any regularization (λ = 0). What do you observe, overfitting or underfitting? (Task 3) [100 points] Regularization. Plot the decision boundary of the output of the regularized linear regression algorithm with λ = 3. Do you observe overfitting or underfitting? (Task 4) [200 points] Cross Validation. Use leave-one-out cross validation to estimate ECV(λ) for λ ∈ {0, 0.01, 0.1, 1, 5, 10, 25, 50, 75, 100}. Plot ECV versus λ and Etest(wlin(λ)) versus λ on the same plot. Comment on the behavior of ECV and Etest versus λ. Here, ECV and Etest are the regression, sum of squared errors. (Task 5) [100 points] Pick λ. Use the cross validation errors from the previous step to pick the best value of λ, and call it λ ∗ . Plot the decision boundary corresponding to the weights wlin(λ∗) . (Task 6) [100 points] Estimate Classification Error. Use wlin(λ∗) for classification and estimate the classification out-of-sample error Eout(wlin(λ∗)) for your final hypothesis g. Estimate Eout(g) to distinguishing between digits that are 1s and not 1s (give the 99% error bar). (Task 7) [100 points] Is ECV biased? Comment on whether ECV(λ ∗ ) is an unbiased estimator of Etest(wlin(λ ∗ ))(treated as regression error). Why or why not? (Task 8) [200 points] Data snooping. Etest(wlin(λ ∗ )) an unbiased estimator of Eout(wlin(λ ∗ )) (treat them as classification errors)? Why or why not? If not, what could we do differently to fix things, so that it is? Explain.
(1) [200 points] Gradient Descent on a Simple Function. Consider the function f(x, y) = 2x 2 + y 2 + 3 sin(2πx) cos(2πy). (a) Implement gradient descent to minimize this function. Run gradient descent starting from the point (x = 0.1, y = 0.1). Set the learning rate to η = 0.01 and the number of iterations to 50. Give a plot that displays how the function value drops through successive iterations of gradient descent. Repeat this with a learning rate of η = 0.1 and provide a plot of the function value with each iteration. What do you observe? (b) Obtain the “minimum” value and location of the minimum value of the function you get using gradient descent with the same learning rate η = 0.01 and number of iterations (50) as part (a), from the following starting points: (i) (0.1, 0.1), (ii) (1, 1), (iii) (0.5, 0.5), (iv) (0.0, 0.5), (v) (−0.5, −0.5), (vi) (−1, 1). Write down the minimum value obtained using gradient descent and the location of the minimum value for each of these starting points. As you may appreciate, finding the “true” global minimum value of an arbitrary function is a hard problem. (2) [600 points] Classifying Handwritten Digits: 1 vs. 5. Implement logistic regression for classification using gradient descent. Find the best separator using the training data only (using ZipDigits.train). Use the data and features you generated for solving HW2. For each example, use the two features you computed in HW2 to as the inputs; and the output is +1 if the handwritten digit is 1 and −1 if the handwritten digit is 5. Once you have found a separator using your classification algorithm: (a) Give separate plots of the training data (ZipDigits.train) and test data (ZipDigits.test) which display the data points using the two features you computed in HW2, together with the separator. (b) Compute Ein on your training data (ZipDigits.train) and Etest, the error of your separator on the test data (ZipDigits.test). 1 (c) Obtain a bound on the true out-of-sample error (Eout). You should get two bounds, one based on Ein and another based on Etest. Use a tolerance of δ = 0.05. Which is the better bound? (d) Repeat parts (a)-(c) using a 3-rd order polynomial transform. (e) Which model would you deliver to the USPS, the linear model with the 3-rd order polynomial transform or the one without? Explain.
(1) LFD Exercise 1.7. [100 points] (2) LFD Exercise 1.8. [50 points] (3) LFD Exercise 1.9. [50 points] (4) LFD Exercise 1.10. [100 points] (5) LFD Exercise 1.12. [100 points] (6) LFD Problem 1.4 (a-e). [300 points] (7) LFD Problem 1.7. [300 points]
(1) [200 points] LFD Exercise 2.6. (2) [100 points] LFD Problem 2.12. (3) [400 points] LFD Problem 2.24. (4) [100 points] Computing Features with the Handwritten Digits Data. This week, you will familiarize yourself with the MNIST handwritten digits dataset. You may download the following files from Brightspace that accompany this homework: ZipDigits.train and ZipDigits.test contain training and test datasets respectively. The content and format of these files is documented in ZipDigits.info. Each row of ZipDigits.train (or ZipDigits.test) corresponds to an example of a handwritten digit. Entries in a row are space separated. The first entry in a row correctly identifies the digit (0-9), and the next 256 entries are grayscale values between -1 (white) and 1 (black) corresponding to a 16×16 image read row after row. For this problem, we will use only the digits 1 and 5, so you must first remove the other digits from the training and test datasets. Then, perform the following tasks: (a) Familiarize yourself with the dataset by giving a plot of the first two digits in ZipDigits.train. Hint: If you are using the Python programming language, you may use matplotlib.pyplot.imshow which takes a 2-D array as input. You may refer to the documentation on how to display a grayscale image. (b) Develop two features to measure properties of the image that would be useful in distinguishing between the digits 1 and 5. You may use average intensity and symmetry (defined in LFD Example 3.1) as your two features, or define and compute any other features you think are better suited to help distinguish between 1 and 5. Provide a mathematical definition of the two features you compute using the notation discussed in class. 1 (c) Provide a 2-D scatter plot of the examples in ZipDigits.train w.r.t. the two features you defined in Part (b), similarly to the scatter plot in LFD Example 3.1 and elsewhere in LFD Chapter 3. For each example, plot the values of the two features with a red ‘×’ marker if it is a 5 and and a blue ‘◦’ marker if it is a 1. You must clearly label each axis with the feature it represents, and it should be possible to determine for each data point, the values of the two features you computed. You must also include a legend on the upper right corner of your scatter plot which clearly identifies that data points marked with ‘×’ represent examples of the digit 5 those marked ‘◦’ marker represent examples of the digit 1. Hint: If you are using the Python programming language, you may use matplotlib.pyplot.scatter which creates a 2-D scatter plot of n data points when given as input two 1-D arrays x and y of length n which provide the positions of the data points along the first and second dimensions respectively. Additional tips for plotting: • You may find it useful in the long run to use matplotlib.pyplot.subplots to generate both a matplotlib.figure.Figure object and one or more matplotlib.axes.Axes objects. At a high level, you may think of the Figure object as a sort of canvas that gets displayed or saved as a file and the Axes object as a collection of plot elements that need to be printed onto the canvas. Each Axes object may therefore refer to a different collection of plot elements that together make up a plot and you get to pick where it gets printed on the canvas or let the library decide it for you. Therefore, it will often be useful to maintain a pointer to the Axes objects of interest so you can manipulate them and add different plot elements like axis labels, legends, and so forth. For more details, you may find this article and this article to be of interest. • To add a legend to your plot, you may find this guide and this guide. • To manually control how the values along each axis are marked, you may use matplotlib.axes.Axes.set xticks and matplotlib.axes.Axes.set yticks, and find this guide helpful. • If you are having trouble printing your image, you may use matplotlib.pyplot.tight layout. Its use is documented in this guide. (5) [500 points] Classifying Handwritten Digits: 1 vs. 5. Implement the algorithm for linear regression for classification followed by the pocket algorithm. Find the best separator using the training data only (using ZipDigits.train). For each example, use the two features you computed in HW2 to as the inputs; and the output is +1 if the handwritten digit is 1 and −1 if the handwritten digit is 5. Once you have found a separator using your classification algorithm: (a) Give separate plots of the training data (ZipDigits.train) and test data (ZipDigits.test) which display the data points using the two features you computed in HW2, together with the separator. (b) Compute Ein on your training data (ZipDigits.train) and Etest, the error of your separator on the test data (ZipDigits.test). (c) Obtain a bound on the true out-of-sample error (Eout). You should get two bounds, one based on Ein and another based on Etest. Use a tolerance of δ = 0.05. Which is the better bound? Page 2
In class recently we’ve discussed several important ingredients in computer vision: feature point (corner) detection, image descriptors, image matching, and transformations. This assignment will give you practice with these techniques. Part 0: Getting started You can find your assigned teammate(s) by logging into IU Github, at http://github.iu.edu/. In the upper left hand corner of the screen, you should see a pull-down menu. Select cs-b657-sp2022. Then in the box below, you should see a repository called userid1-userid2-userid3 -a2 or userid1-userid2 -a2, where the other user ID(s) corresponds to your teammates.While you may want to do your development work on a local machine (e.g. your laptop), remember that the code will be tested and thus must run on burrow.luddy.indiana.edu. After logging on to that machine via ssh, clone the github repository via one of these two commands: git clone https://github.iu.edu/cs-b657-sp2022/your-repo-name-a2 git clone [email protected]:cs-b657-sp2022/your-repo-name -a2where your-repo-name is the one you found on the GitHub website above. (If this command doesn’t work, you probably need to set up IU GitHub ssh keys. See Canvas A1 page for help.) This should fetch some test images and some sample code (see the Readme file in the repo for details).Part 1: Image matching and clustering As cameras continue to pervade our lives, it is not an exaggeration to say that the world is drowning in visual data: consumers will take about 1.5 trillion photos this year alone, or roughly the same number that were taken in all of human history up to the year 2000. With so much visual data, we need tools to help people automatically organize their photos. Sometimes we might know ahead of time what people are looking for in photos, and can organize based on semantic content (e.g., place the photo was taken, who is in the photo, etc).Other times we may simply want to help people organize their photos into coherent groups according to similar visual content, even if the algorithms do not attempt to recognize what that similar content is. A natural idea in this context is to cluster photos into a small number of groups according to visual similarity across photos.In class, we discussed SIFT, a technique to detect corners and other “interest” or “feature” points. For each interest point, SIFT produces a 128-dimensional vector that is supposed to be invariant to image transformations like scaling, rotation, etc. These features can be useful for discovering similarities in visual content across images.Although deep learning-based approaches have eclipsed SIFT’s popularity on many problems, it is still often used for problems involving understanding geometry of scenes (e.g. in 3D reconstruction) and in problems requiring explainability.SIFT has several problems, including that it is protected by patents. Although we can use SIFT in academia for research and education purposes without a license, there are few high-quality open-source SIFT implementations available.We suggest using one of two approaches: 1. In Python, we suggest using a similar descriptor, ORB, instead of SIFT. The idea behind ORB is very similar: it takes an image and produces a set of keypoints (x,y coordinates) and descriptors (high-dimensional vectors). In your repository, we’ve provided some sample code for computing ORB features using the OpenCV library. The code also shows how to compute the distance between ORB features; they are binary features so a bitwise Hamming distance is typically recommended.2. If you use C++, we have provided an implementation of SIFT that we prepared. ORB was designed to be much faster than SIFT while offering similar performance. Figure 1: Sample visualization of SIFT matches. 1. Write a function that takes two images, applies ORB/SIFT on each one, and counts the number of matching ORB/SIFT descriptors across the two images. Remember that a ORB/SIFT “match” is usually defined by looking at the ratio of the Euclidean distances between the closest match and the second-closest match, and comparing to a threshold. (Although not required, you may find it helpful to also create a function that visualizes these matches, as in Figure 1). 2. Use your function above to implement a simple image clustering program, that organizes a set of unordered images into k groups, where k is a parameter given by the user. Your program should take command line arguments like this: ./a2 part1 k img_1.png img_2.png … img_n.png output_file.txt and then produce an output file with k lines, each of which lists the images assigned to that cluster. For example, for k=2 your output file might be: img_2.png img_4.png img_1.png img_3.png img_5.pngHints: You can use any clustering technique you’d like, and you can use a library instead of implementing clustering on your own (assuming that it runs on burrow.luddy.indiana.edu). Note that you do not have a “feature” for each object (image), but instead you have pairwise distances between objects; this suggests that an agglomerative or spectral clustering algorithm may be more natural than something like k-means. A key design decision is your choice of distance metric, i.e. how you compare two images. You might start with a count of matching ORB/SIFT descriptors as in #1 above, but feel free to refine this to get better results.Hints: The command: ./a2 part1 k img1.png img2.png imgn.png … outputfile_txt would be really tedious if you have to type all n images (where n is like 100) on the command line like that. Fortunately, Linux will do this for you. Say all of your images are in a subfolder called “images.” Then you can just type: ./a2 part1 k images/*.png outputfile_txt and Unix will expand that ”*” into 100 separate file names for your program. 3. There is a small test image collection, in your GitHub repo, consisting of a few dozen images from well-known tourist attractions. There are 10 classes in this dataset, with the file names indicating the correct classes. (You can use these file names for evaluating the accuracy of your clustering, but 3 of course you cannot use the names to help with the clustering — that would be cheating!) In your report, show some sample failures (assuming you have some failures) and try to explain what went wrong — again, an advantage of SIFT matching (compared to deep learning) is explainability, so you can see which feature points were matched by your program and check what went wrong. Report your performance quantitatively using the following measure. For the N images in the dataset, consider each of the possible N(N − 1) pairs of images. Let T P be the number of pairs where your clustering correctly assigned them to the correct cluster, and T N be the number of pairs where your clustering correctly assigned them to different clusters. The Pairwise Clustering Accuracy is then simply T P +T N N(N−1) , which ranges from 0 to 1 where higher accuracies indicate better performance. A small portion of your grade for this assignment will be based on how well your clustering performs on our (separate) test dataset on this task, compared with others in the class.Part 2: Image transformations In class we discussed the idea of image transformations, which change the “spatial layout” of an image through operations like rotations, translations, scaling, skewing, etc. We saw that all of these operations can be written using linear transformations with homogeneous coordinates. 1. Write a function that takes an input image and applies a given 3×3 coordinate transformation matrix (using homogeneous coordinates) to produce a corresponding warped image. To help test your code, Figure 1 shows an example of what “lincoln.jpg” (which we’ve provided in your repository) should look like before and after the following projective transformation: 0.907 0.258 −182 −0.153 1.44 58 −0.000306 0.000731 1 . This matrix assumes a coordinate system in which the first dimension is a column coordinate, the second is a row coordinate, and the third is the extra dimension added by homogeneous coordinates. For best results, use inverse warping with bilinear interpolation. Figure 2: Sample projective transformation. 2. Now let’s say you don’t know the transformation matrix, but you do know some pairs of corresponding points across two images and you know the type of transformation that relates the two. Write a program that accepts command line parameters like this: ./a2 part2 n img_1.png img_2.png img_output.png img1_x1,img1_y1 img2_x1,img2_x1 … img1_xn,img1_yn img2_xn,img2_yn where n indicates the type of transformation and n = 1 means only a translation, n = 2 means a Euclidean (rigid) transformation, n = 3 means an affine transformation, and n = 4 means a projective transformation, and the remaining parameters indicate the feature correspondences across images (e.g. 4 (a) (b) (c) Figure 3: The goal is to warp image (b) to appear as if it were taken from the same perspective as image (a), producing image (c). point (img1 x1, img1 y1) in img 1.png corresponds to (img2 x1, img2 y1) in img 2.png). Note that the parameter n is designed to indicate the number of point correspondences that are needed; e.g. only 1 pair of correspondences is needed for translations, while 4 are needed for projective transformations. Your program should compute and display the transformation matrix it has found, and then apply that transformation matrix to img 1.png to produce a new image img output.png. If all has worked correctly, your img output.png should look very similar to img 2.png. For example, we’ve included images called “book1.jpg” and “book2.jpg” in your repository. For n = 4, four pairs of corresponding points are: “book1.jpg”: {(318, 256), (534, 372), (316, 670), (73, 473)} “book2.jpg”: {(141, 131), (480, 159), (493, 630), (64, 601)} which you could run through your program like this: ./a2 part2 4 book2.jpg book1.jpg book_output.jpg 318,256 141,131 534,372 480,159 316,670 493,630 73,473 64,601 and get a result similar to that shown in Figure 3. Hints: Start with n = 1 (and create a simpler test image involving just translation) and get that working first, then move on to n = 2, etc. Solving for the transformation matrices will involve solving a linear system of equations; feel free to use the solver built into NumPy. Part 3: Automatic image matching and transformations Finally, put the parts above together to automatically find the transformation between two images, to transform one into the coordinate system of the other, and then to produce a “hybrid” stitched image that combines the two together. One application of this is to find create panoramas (Figure 6). Another might be to align two images taken at different times (e.g., from satellites) to be able to quickly spot differences between them. Your program will be called as follows: ./a2 part3 image_1.jpg image_2.jpg output.jpg In particular, your program should (1) extract interest points from each image, (2) figure the relative transformation between the images by implementing RANSAC, (3) transform the images into a common coordinate system, and (4) blend the images together (e.g. by simply averaging them pixel-by-pixel) to produce output.jpg. 5 This is again an open-ended problem and you can take it in multiple directions, as long as you implement RANSAC with feature points to figure the transformation automatically. In your report, describe your solution, including outline of your algorithm, any important parameter values, any problems or issues you faced, design decisions you made, etc. Show some sample results on images of your choice, including failures and some intuition about why they failed. Figure 4: Sample panorama, based on two source images. What to turn in Make sure to prepare (1) your source code, and (2) a report (Readme.md file in your github repository) that explains how your code works, including any problems you faced, and any assumptions, simplifications, and design decisions you made, and answers to the questions posed above. To submit, simply put the finished version (of the code and the report) on GitHub (remember to add, commit, push) — we’ll grade the version that’s there at 11:59PM on the due date.
In this assignment, you’ll get experience with image operations and apply them to an object detection problem.In a major midwestern research university not far away, a mild-manner professor of computer science has a dream: a cheating-proof final exam that could be graded perfectly and effortlessly in has class of 100 students. He will first write 85 multiple-choice questions. Then he will randomly shuffle both the order of 1 the questions and the order of the multiple-choice options within each question, creating 100 unique exam booklets. The first page of each exam booklet will be an answer sheet like that shown in Fig 1. The students will be asked to detach the answer sheet and turn it in after the exam. Unbeknownst to the students, each answer sheet will have a unique code printed on it that contains an encrypted copy of the correct answers for that particular answer sheet. After the exam, the instructor will scan all the answer sheets to produce 100 images, which he’ll then run through a custom computer vision program to score the exams automatically.The grading program will simply read the code, decrypt the correct answers, recognize the student’s marks on the answer sheet, and calculate how many questions have been answered correctly. Your job is to help make the dream a reality, by writing an automatic grading program that will work as well as possible. Given a scan of the answer sheet like that shown in Fig 1, your program should identify the answers that the student marked as accurately and robustly as possible.1. You can find your assigned teammate(s) by logging into IU Github, at http://github.iu.edu/. In the upper left hand corner of the screen, you should see a pull-down menu. Select cs-b657-sp2022. Then in the box below, you should see a repository called userid1-userid2-userid3 -a1 or userid1-userid2 -a1, where the other user ID(s) corresponds to your teammates. 2. While you may want to do your development work on a local machine (e.g. your laptop), remember that the code will be tested and thus must run on burrow.luddy.indiana.edu. After logging on to that machine via ssh, clone the github repository via one of these two commands: git clone https://github.iu.edu/cs-b657-sp2022/your-repo-name-a1 git clone [email protected]:cs-b657-sp2022/your-repo-name -a1 where your-repo-name is the one you found on the GitHub website above. (If this command doesn’t work, you probably need to set up IU GitHub ssh keys. See Canvas A1 page for help.) This should fetch some test images and some sample code (see the Readme file in the repo for details). 3. Now write a program that accepts a scanned image file of a printed form like that in Fig 1, and produces an output file that contains the student’s marked answers. The program should be run like this: python3 grade.py form.jpg output.txtThere are 31 possible correct answers per question, because some questions might instruct the student to fill in multiple options in the same question (e.g. choices A and B might both be true so the student should mark both). The program should create an output file (second parameter in the command line above) that indicates the answers that it has recognized from the student’s answer sheet. The output file should have one line per question, with the question number, followed by a space, followed by the letter(s) that were selected by the student. It should also output an x on a line for which it believes student has written in an answer to the left of the question (as the instructions on the answer form permit, but your program does not need to recognize which letter was written). For example, the first few lines of the output for Fig 1 should be: 1 A 2 A 3 B 4 B 5 C 6 AC x … 2 Figure 1: A sample scanned answer sheet.4. Finally, devise a system for printing the correct answers on the answer sheet in some way so that your grading program can read them, even after the answer sheet is printed and scanned. There are many possible ways of doing this; you might add some textual annotations, or add some form of bar code, or a watermark, or some other pattern. Whatever system you adopt should not obviously reveal the correct answers to the students.This should consist of two separate programs, one to inject the answers into the answer sheet and one to recognize them, like this: python3 inject.py form.jpg answers.txt injected.jpg python3 extract.py injected.jpg output.txt where the first command takes the blank form and injects the answers (in the same file format as the output text file described above) into the image to create injected.jpg, and the second takes an image that has been injected and extracts the correct answers (writing them out in again the same text file format as above).Explain your technique in detail in your report (see below). Try to make your technique robust enough to be detectable even after the image has been printed, filled in by a student, and then scanned into an image file. We’ve provided a blank form called blank form.jpg for you to do some experimentation along these lines, and please report about these experiments in your report.This assignment is purposely open-ended. How should you go about it? It’s up to you, but here are a few ideas. You could use edge detection and Hough transforms to try to find the alignment of the form within the page. You could use segmentation to find blobs of ink. You could use differences in color to separate ink from the printed form. You could use a cross correlation to find local image regions of interest – filled-in squares, or empty squares, or letters. It’s probably much easier to figure question numbers by their position on the page as opposed to trying to actually recognize the question numbers through optical digit recognition, although you could try this if you really want to.Evaluation. To help you evaluate your code, we’ve provided some test images. These are actual scans of completed test sheets, which means that they have some natural variation – slightly different positions and orientations within the image caused by the imperfect paper feeder of the scanner, variations in the ink and style across different students, etc. You can assume that these images are quite representative of the types of variations that would occur in real life; i.e., your program should try to handle these types of variations as much as possible, but we don’t expect it to handle extreme cases (like answer sheets that were scanned upside down, etc). Your program only has to work for this one particular format of answer sheet. Please use these images to evaluate the accuracy of your program and present these results in your report (see below). We’ve included two text files that have the expected (ground truth) output for two of the test images (and you can create your own ground truth files for the others). This provides an easy way to quantitatively evaluate your program, by simply comparing your output to the ground truth output file. Your program will almost certainly not work perfectly, and that’s okay! To make things fun, we will hold a competition in which we will evaluate the programs on a separate test dataset of unseen exam sheets. A small portion of your grade will be based on how well your system works compared to the systems developed by others in the class. We may also give extra credit for additional work beyond that required by the assignment. Report. An important part of developing as a graduate student is to practice explaining your work clearly and concisely, and to learn how to conduct experiments and present results. Thus an important part of this assignment is a report, to be submitted as a Readme.md file in GitHub (which allows you to embed images and other formatting using MarkDown). Your report should explain how to run your code and any design decisions or other assumptions you made. Report on the accuracy of your program on the test images both quantitatively and qualitatively. When does it work well, and when does it fail? Give credit to any source of assistance (students with whom you discussed your assignments, instructors, books, online sources, etc.).How could it be improved in the future? Note that even if your code performs poorly, you can still write an interesting report that explains what you tried, what the advantages and disadvantages of that approach are, why you think it didn’t work, etc. You can think of the report as an argument for why you deserve a good grade on the assignment. Important: As the very last section of the report, please include a section called “Contributions of the Authors” which explains how each person in your team contributed to the project (formulating an approach, writing code, conducting experiments, writing the report, etc.). Please be as specific as possible. We will not grade submissions that do not include this section.What to turn in You should submit: (1) Your source code, and (2) Your report, either as a Readme.md file in Github. To submit, simply put the finished version in your GitHub repository (remember to add, commit, push) — we’ll grade whatever version you’ve put there as of 11:59PM on the due date. To make sure that the latest version of your work has been accepted by GitHub, you can log into the github.iu.edu website and browse the code online. Grading Unfortunately (or fortunately, depending on how you look at it), in computer vision there is almost never a single “right” approach that one can prove to be optimal. This means that there are many possible ways to approach this assignment and many possible paths to a good grade. Here are some basic principles for how to get a good grade: (1) Make sure that your code runs on the SICE Linux machines; if your code crashes or if we can’t get it to work at all, it’s very hard for us to evaluate what you’ve done and you will probably not get a good grade. (2) Use your report to showcase what you have done. If you tried many techniques before settling on the one you submit, talk about these failed techniques in your report! Otherwise we have no idea whether you hacked something together at the last minute, or whether you went through methodical experimentation to arrive at your solution. Be as specific and concrete as possible. If you made assumptions or design decisions, state them clearly, and discuss how they informed your eventual submission. You can think of your report as sort of an argument for why you deserve a good grade. A good report is substantive but need not be long. (3) Write clean, clear, correct, reasonably efficient code. We expect you to use good programming practices, e.g. meaningful variable and function names, comments when appropriate, etc. This is mostly for your benefit; the graders can be much more generous with partial credit if they understand what you were trying to do. (4) Your code should execute in a reasonable amount of time – under about 10 minutes. (5) The accuracy of your program is a good indication of your code’s quality, since a good implementation of a well-planned approach is likely to perform better than a haphazardly implemented version of a poorly implemented one. However, if we had the choice between a submission that took an unusually creative or risky approach that was carefully planned, implemented, tested, and reported, but did not give very good results, or a submission that took a boring approach that you found in a book that gave good results but was obviously haphazardly executed, we would prefer the former.
Problem 1. Consider the representation of a queue of n elements e1 … en in the lambda-calculus: λc. λn.((c e1 ) ((c e2) … ((c en ) n) …)). Show non-recursive lambda-calculus definitions for the following two operations on a queue: (i) insert – given a queue q and element e, return a queue by adding the element e at the end of q. (ii) length – given a queue q, return a Church numeral denoting the number of elements in q. Examples: ((insert tom) λc.λn.n) (length λc.λn.n) =>* λc.λn.((c tom) n) =>* λf.λx.x ((insert ding) λc.λn.((c tom) n)) (length λc.λn.((c tom) ((c ding) n))) =>* λc.λn.((c tom) ((c ding) n)) =>* λf.λx.(f (f x)) ((insert hari) λc.λn.((c tom) ((c ding) n))) =>* λc.λn.((c tom) ((c ding) ((c hari) n)))) (length λc.λn.((c tom) ((c ding) ((c hari) n)))) =>* λf.λx.(f (f (f x))) Save your definition for insert and length in a file called church.txt. It is acceptable to write L instead of λ in your definitions for insert and length. Problem 2. (a) Consider a representation for a lambda-term using three Prolog constructors v, l, and a, for variable, abstraction, and application respectively. For example, a lambda-term ((λf.λx.(f x) y) z) would be represented by the Prolog term a(a(l(f, l(x, a(v(f), v(x)))), v(y)), v(z)). Write a Prolog predicate reducible(T), which returns true if T has a beta- or eta-redex somewhere inside it; otherwise, the predicate returns false. Examples: ?- reducible(l(x,v(x))). ?- reducible(a(l(x,v(x)), v(y))). false true ?- reducible(a(a(l(f, l(x, a(v(f), v(x)))), v(y)), v(z))). true (b) Using reducible, write a Prolog predicate called norm(T) which returns true if T is in normal form; otherwise, the predicate returns false. Save your answers to parts (a) and (b) in a file called lambda.pl. Problem 3. Four boys (Ali, Bing, Charles, Dani) and four girls (Kari, Lola, Mary, Nina) need to be assigned to one of four activities (biking, running, hiking, surfing). Each boy and girl should be assigned to only one activity each and their individual constraints are as follows: a. Ali likes to bike and Mary likes to hike. b. Bing, Charles, Kari and Lola do not like to run. c. Nina does not like to surf. d. Lola and Charles want to be together. e. Dani and Mary do not want to be together. Write a Prolog program to assign one boy and one girl for each activity, as follows: – The boys and girls should be defined by two predicates: boy(ali). girl(kari). boy(bing). girl(lola). boy(charles). girl(mary). boy(dani). girl(nina). – Each activity should be defined as a binary constructor: biking(Boy1,Girl1), running(Boy2,Girl2), hiking(Boy3,Girl3), and surfing(Boy4,Girl4). – – The top-level predicate should be called solve and it should be defined like this: solve(Answer) :- assumptions(Answer), constrainta(Answer), constraintb(Answer), constraintc(Answer), constraintd(Answer), constrainte(Answer). assumptions(Answer) :- boy(B1), boy(B2), boy(B3), boy(B4), girl(G1), girl(G2), girl(G3), girl(G4), Answer = [ biking(B1, G1), running(B2, G2), hiking(B3, G3), surfing(B4, G4)]. – Each of the constraints a – e should be defined using one Prolog rule and should incorporate only the conditions stated in that constraint. To help you get started, we can define constrainta(Answer) :- member(biking(ali,_), Answer), member(hiking(_,mary), Answer).Develop Prolog predicates for the constraints b – e and place the code for solve, assumptions, and constrainta … constrainte in a file assign.pl. Note: The inequality operator in Prolog is ==, and is used, for example, as B == charles. What to Submit: Prepare a top-level directory named A5_UBITId1_UBITId2 if the assignment is done by two students; otherwise, name it as A5_UBITId if the assignment is done solo. (Order the UBITId’s in alphabetic order, in the former case.) In this directory, place church.txt, lambda.pl, and assign.pl. Compress the directory and submit the resulting compressed file using the submit_cse505 command. For more details regarding online submission, see Resources Homeworks
Overview Content is king on the Internet today, and business thrive or fail based on the value of content they collect, hoard, and serve to their users. Protecting this data is difficult, because the large majority of that content is available to users, which means it is available to automated web crawlers that mimic real users. For this project, you will be implementing a parallel web crawler that maximizes throughput when retrieving content from a target website. This will be an adversarial interaction, between the crawler will be trying to pull down data from a server that is wary and prepared against aggressive crawlers. The web server will have implemented multiple layers of rate limits for incoming network connections, which the crawler must overcome if it wants to maximize its efficiency. For this project, you will need learn how to interact with HTTP servers (like how you interact with FTP servers in Project 2). There are a set of interactions (again like project 2) which your crawler (HTTP client) will need to recognize/parse and respond). You can start from https://en.wikipedia.org/wiki/ Hypertext_Transfer_Protocol which provides the list of the commands. There are three versions of the crawler that should be implemented as part of this assignment. 1.[40%] Basic crawler. Your crawler will connect to the target website, and fetch the main page. Once it retrieves the main page (index.html), it will parse the page and extract all links to objects and other pages, put them on the crawler queue, and crawl them in turn, repeating the process until all files and pages reachable from index.html in a transitive closure. To identify itself to the server, the crawler uses a single cookie with a user ID. 2.[40%] Parallel flows vs. per-flow rate limits. The server has per-flow rate limits that throttle any single network connection to a max rate (R), and that puts a ceiling on how fast the basic crawler can work. Version 2 of the crawler overcomes this limit by implementing parallel connections with coordination. In this mode, the crawler spawns multiple threads (like you guys did in assignment 2), where each thread cooperates with the others to speed up the download. Key difference is that all threads share a single download queue, and any parsed links to files or other pages gets put on the same shared queue. Implemented correctly, N multiple threads will ideally speed up the aggregate download rate by roughly a factor of N, since each thread can potentially max out throughput at R. Note that here, all threads identify themselves to the server using the same user cookie file. 3.[20%] Shared cookies vs. per-user rate limits. The server is smarter than you thought, and also implements a per-user rate limit, that limits the total download rate by any number of connections by the same user. This means that instead of getting N*R total throughput with your N threads, you’re instead getting some total throughput U, where U
In this assignment, you will implement a parallel FTP client application using the socket API. The application is called pftp, and must be implemented in C or Python. The FTP protocol is specified in RFC 959. The assignment can be broken down into two components: basic ftp operation and parallel download. Both components must be completed by the due date. Part 1: Basic FTP Operation (65%) The client application will take an FTP server and a file name as command-line parameters. When executed, the client will simply retrieve the file specified following the FTP protocol and save it to a local file with the same name. Note: you will not need to implement the put operation; and each FTP session operates in the default passive mode, which we have introduced in Lecture 9. Usage: pftp [-s hostname] [-f file] [options] pftp -h | –help pftp -v | –version For example, by executing: % pftp -s ftp://mirror.keystealth.org/gnu/ -f /ProgramIndex The file named ProgramIndex will be created in the current directory. On success, the pftp application exits with code 0. If an error occurs, the application exits with the appropriate code and prints a human-readable error message on standard error. More details on the command line: -h or –help Prints a synopsis of the application usage and exits with return code 0. -v or –version Prints the name of the application, the version number (in this case the version has to be 0.1), the author, and exits, returning 0. [-f file] or [–file file] Specifies the file to download. [-s hostname] or [–server hostname] Specifies the server to download the file from Options: [-p port] or [–port port] Specifies the port to be used when contacting the server. (default value: 21). [-n user] or [–username user] Uses the username user when logging into the FTP server (default value: anonymous). [-P password] or [–password password] Uses the password password when logging into the FTP server (default value:[email protected]). [-m mode] or [–mode mode] Specifies the mode to be used for the transfer (ASCII or binary) (default value: binary). [-l logfile] or [–log logfile] Logs all the FTP commands exchanged with the server and the corresponding replies to file logfile. If the filename is “-” then the commands are printed to the standard output. The lines must be prepended by C->S: or S->C: for client-to-server and server-to-client lines, respectively. For example: S->C: 220 Server (Uchicago) [128.135.164.133] C->S: USER anonymous S->C: 331 Anonymous login ok, send your complete email address as your password. C->S: PASS [email protected] Part 2: Parallel FTP Download (35%) The next part of the assignment is to implement a parallel (multi-track) ftp download. The idea is that the pftp client uses separate threads to simultaneously connect to multiple servers, and download the same file from different positions in the same file. On the server side, this functionality is already supported by current ftp servers through the “restart marker” option, i.e. REST. A REST request sets the start position. There’s no intuitive user interface for this type of functionality. So we will use something very simple and straightforward. pftp takes in an additional command line argument: [-t para-config-file] or [–thread config-file] Usage: pftp [-t para-config_file] [options] Each line in the config-file specifies the login, password, hostname and absolute path to the file. Each line should be separated by a Line Feed (LF) character. When pftp reads in the para-config-file, it parses the N servers on the list. For each of the N servers, pftp spawns a separate thread, connects to the server, and moves to the specified directory. To download a file of S bytes, pftp performs a joint download, where each thread downloads a segment of S/N bytes. The downloaded segments should be put together to form the entire file. Note: pftp can use the –thread option simultaneously with the -l option. And pftp only supports binary model not ASCII mode. For each line of the para-config file, the format should be: ftp://username:password@servername/file-path. An example of the para-configuration file is as follows (which will be used as a test for your program) ftp://cs23300:[email protected]/rfc959.pdf ftp://socketprogramming:[email protected]/rfc959.pdf Note that using a para-config_file, any arguments specified for username and password through options will be overridden by the values specified in the file. Discussion This application must be extremely resilient to wrong/unexpected input data, repeated values, etc. Part of your assignment is to understand how the pftp application could be abused by a user and to provide meaningful error messages if something goes wrong. You have to program defensively and with user-friendliness in mind. Segmentation fault is not an option! You CANNOT use any existing library to retrieve the files from the server. You have to write your own code that will open connections, play the protocol, parse the data received from the server, etc. The application should return meaningful error messages if an error occurs. In addition, the exit value should give some information about what happened. By returning different exit values in association with different errors the application becomes easily scriptable (that is, an external script can invoke the application and manage error conditions). Obviously, it is impossible to foresee all possible error conditions, therefore the following list is obviously incomplete, but this is expected from your program. Exit code Explanation 0 Operation successfully completed 1 Can’t connect to server 2 Authentication failed 3 File not found 4 Syntax error in client request 5 Command not implemented by server 6 Operation not allowed by server 7 Generic error Hints and Additional Notes • Note that when invoked without any parameters the application should behave as if it were invoked with the –help option. • All error message must be printed to standard error. • If a parameter is specified multiple times, then the last value assigned to the parameter is the value accepted. • List of FTP raw commands: https://www.nsftools.com/tips/RawFTP.htm Hint: the key commands that you should be using for this project include: PASV, REST, RETR, TYPE, USER, PASS, QUIT, etc. • Additional info on FTP can also be found in the lecture9.pdf • Tutorial for Multi-thread programming in C: https://www.geeksforgeeks.org/multithreading-c-2/ FTP Servers for testing your program: We have set up two ftp servers: ftp1.cs.uchicago.edu username: cs23300. Password: youcandoit ftp2.cs.uchicago.edu username: socketprogramming Password:rocks Two files were placed on both servers: rfc959.pdf, lecture8.pdf Submission Use the following dropbox upload link. You can submit multiple times but only the last submission will be used for grading. https://www.dropbox.com/request/9CvMbG9PvJyzX4aq448G
Overview For this project, you will implement a simplified version of the netcat (or nc) utility. Netcat is a simple Unix utility which reads and writes data across network connections, using TCP or UDP protocol. In the simplest usage, “nc host port” creates a TCP connection to the given port on the given target host. Your standard input is then sent to the host, and anything that comes back across the connection is sent to your standard output. Internally, netcat uses the BSD socket interface to listen for connections, establish connections, send and receive data, and terminate connections. The implementation details of TCP/UDP, IP, and Ethernet—for example, constructing packet headers, performing route table lookups, and translating IP addresses to MAC addresses—are handled by the operating system. If you have any other questions, please ask them on Piazza. Submission Requirements Your submission should include the following programs and files Source Code Makefile README You should write your implementation in C and use the functions in the BSD socket interface to establish/terminate connections and send/receive data. You should call your compiled program snc (which stands for simplified netcat) i.e., you should provide a Makefile that compiles your source code to produce a binary called snc Your README file should contain a short header containing your full name, CNetId. The README should additionally contain a short description of your code, tasks accomplished, any issues you encountered, and how to compile, execute, and interpret the output of your programs. Put the README, the Makefile, and all source files in a directory named with your CNetID, then tar the directory by running: tar czvf CNetID.tgz CNetID Submit your file to dropbox upload using the following URL. https://www.dropbox.com/request/mpbvOzusHIhLjSdXbkOh?oref=e Implementing Simplified netcat You should write your implementation in C and use the functions in the BSD socket interface to establish/terminate connections and send/receive data. Options: The command line syntax for your simplified netcat is: snc [l] [u] [hostname] port Your simplified version of netcat should appropriately handle the following arguments: l Used to specify that netcat should listen for an incoming connection rather than initiate a connection. In other words, it specifies that netcat should behalf as a server rather than a client. u Indicates that UDP should be used instead of TCP. hostname If the l option is given, hostname specifies the IPv4 address in dotted decimal notation ( e.g., 128.105.112.1 ) or the symbolic hostname ( e.g., sandlab.cs.uchicago.edu ) on which the server instance should listen for packets/connections. If the l option is not given, then hostname specifies the IPv4 address or the symbolic hostname that should be used by the client instance to contact the server instance. If the l option is not given, the hostname argument is required; otherwise it is optional. The hostname argument must be the second to last argument, if it is included. port If the l option is given, port specifies the port on which the server instance should listen for packets/connections. If the l option is not given, then port specifies the numeric port number ( e.g., 9999 ) on which the client instance should contact the server instance. The port argument is always required; it should be the last argument. Note: Port numbers less than 1024 are generally reserved for common applications, so you should pick any port number between 1025 and 65535 (assuming no other application on the machine is already using it). Behavior Your implementation of netcat should behave as described above. If the l option is specified, you should wait for incoming connections on the specified port. Otherwise, you should initiate a connection on the specified port. Once connected, your client should read characters from stdin and, when the enter key is pressed, send the text buffer over the connection. When data is received to the server, you should immediately output the text on stdout. Bonus credit: for simultaneous reads/writes you will need two threads to achieve this: one thread to read from stdin and send data and one thread to read from the socket and output received data. When you are using TCP and Ctrl+D is pressed, you should close the connection and exit. If the opposite side closes the connection, then you should exit. When you are using UDP and Ctrl+D is pressed, you should stop reading input from stdin and sending over the socket. You should still continue to receive from the socket and output (on stdout). You do not need to exit the program when Ctrl+D is pressed and UDP is being used. Errors You program should be able to gracefully handle errors encountered. If the user specifies an option that does not exist or a required option is not provided, then output: invalid or missing options usage: snc [l] [u] [hostname] port If you encounter any other error conditions, then output: internal error Published by Google Drive–Report Abuse–Updated automatically every 5 minutes
Problem 1. Consider the following function definition in the C programming language for carrying out the summation expressed by the operator ∑ discussed in Lecture 16: int sigma(int *k, int low, int high, int expr()) { int sum = 0; for (*k=low; *k ‘a inf_list) Define a function church: string -> string inf_list which generates an infinite list of Church numerals starting from 1. Test out church by executing the following “main” program: fun take(0, _) = [] | take(n, lcons(h, thk)) = h :: take(n-1, thk()); take(5,church(“x”)) Create a file called church.sml with all relevant definitions. Problem 4. Give a formal definition for a function LI which defines the leftmost-innermost redex of a lambda-term, along the lines of the substitution operation given in the notes on Lambda Calculus (pages 5-6). If there is no redex in the input term, LI should return , which stands for “undefined”. Examples: LI x = LI λx.(a x) = λx.(a x) LI λx.y = LI (λx.y a) = (λx.y a) LI λx.(x x) = LI λx.(λy.y x) = (λy.y x) LI (a b) = LI ((a b) (c (λz.z b))) = (λz.z b) LI ((a b) (c d)) = LI ((λz.z a) (λz.z b)) = (λz.z a) … The definition of LI involves about 8 rules. Two of the rules to help you get started are: LI V = LI λV.T = LI T, if LI T ≠ Important: The LHS of each of these rules must be mutually-exclusive of other rules. You may use the variable V to stand for any variable and T, T1, T2 for arbitrary lambda-terms. In addition to the standard occurs_free_in test, you may also use is_etaredex(T) and is_betaredex(T) to test whether T is an eta- or a beta-redex, respectively. Place your solution in a file called lambda.pdf. What to Submit: Prepare a top-level directory named A4_UBITId1_UBITId2 if the assignment is done by two students; otherwise, name it as A4_UBITId if the assignment is done solo. (Order the UBITId’s in alphabetic order, in the former case.) In this directory, place sigma.c, flatten.py, church.sml, and lambda.pdf. Compress the directory and submit the resulting compressed file using the submit_cse505 command. For more details regarding online submission, see Resources
Problem 1. Assume the following ML type for binary trees: datatype ‘a tree = leaf | node of ‘a * ‘a tree * ‘a tree; Complete the definition below for a function insert(i, tr) which will insert a value i into tr (if i is not present in tr) so as to maintain the binary search tree property. fun insert(i, leaf) = __________________ | insert(i, node(v,left,right)) = ___________________; In the above code, ML will assume v to be of type int by default. In completing the definition of insert, note that you do not update the input tree; rather, you need to construct a new tree in which the value to be inserted is placed in the correct position. Test your definition by running the following function, making sure to place the code for testcase after insert: fun testcase() = let val t1 = node(100,leaf,leaf); val t2 = insert(50, t1); val t3 = insert(150, t2); val t4 = insert(200, t3); val t5 = insert(125, t4); val t6 = insert(175, t5); val t7 = insert(250, t6); val t8 = insert(25, t7); val root = insert(75, t8) in root end; Prepare a file insert.sml containing the type definition for tree as well as the code for testcase and insert. Problem 2. Consider the following depth-first (“in order”) traversal of a BST. fun dfirst(leaf) = [] | dfirst(node(v,t1,t2)) = dfirst(t1) @ [v] @ dfirst(t2); Develop a tail-recursive equivalent of dfirst, called dfirst2. Develop the definition of dfirst2 by writing a tail-recursive helper function, say df, which will use an accumulatorpassing style in order to construct the answer. Nest the definition of df inside dfirst2 using a let-in-end block. Test out dfirst2 by running the following function: fun test_dfirst2() = dfirst2(testcase()); Prepare a file dfirst.sml containing the type definition for tree and the code for insert, testcase, dfirst2, and test_dfirst2. Problem 3. Consider the following ML definitions of the familiar higher-order functions map and reduce: fun map(f, [ ]) = [ ] | map(f, x::t) = f(x) :: map(f, t); fun reduce(f, b, [ ]) = b | reduce(f, b, x::t) = f(x, reduce(f, b, t)); An n-ary tree is a generalization of a binary tree in which each internal node has a list of zero of more subtrees, and each leaf node holds a value. Below is an ML type definition for an n-ary tree: datatype ‘a ntree = leaf of ‘a | node of ‘a ntree list; a. Using map, define a function subst(tr,v1,v2) which returns a new ntree in which all occurrences of value v1 in tr are replaced by v2 in the output tree. For example, subst(node([leaf(“x”), node([leaf(“y”), leaf(“x”), leaf(“z”)])]), “x”, “w”) = node([leaf(“w”), node([leaf(“y”), leaf(“w”), leaf(“z”)])]) b. Using reduce, define a function toString(tr) which concatenates all strings at the leaf nodes of a string ntree, adding a space between each value, and returns the resulting string. For example, toString(node([leaf(“x”),node([leaf(“y”),leaf(“x”),leaf(“z”)])])) = “x y x z” Note: The functions subst and toString will be recursive, but the recursive calls should be initiated from map and reduce respectively, i.e., through the parameter f of map and reduce. Prepare a file mapreduce.sml containing the type definitions for ntree, map, reduce, subst, and toString. What to Submit: Prepare a top-level directory named A3_UBITId1_UBITId2 if the assignment is done by two students; otherwise, name it as A3_UBITId if the assignment is done solo. (Order the UBITId’s in alphabetic order, in the former case.) In this directory, place insert.sml, dfirst.sml, and mapreduce.sml. Compress the directory and submit the resulting compressed file using the submit_cse505 command. For more details regarding online submission, see Resources
1. An important technique in the study of Object-Oriented Languages is the use of `delegation’ to replace `class inheritance’. A systematic way for doing this transformation was given in Lecture 8. Apply this technique to the program in Delegation.java located at Piazza Resources Homeworks. This program defines four classes, A, B, C, and D where the classes C and D extend class B which in turn extends class A. The result of your transformation should be four classes called A2, B2, C2, and D2 which correspond to A, B, C, and D respectively, but do not make use of class inheritance. As outlined in Lecture 8, in order to develop the transformation, you need to define interfaces IA, IB, IC, and ID, where the interfaces IC and ID extend IB which in turn extends IA. The classes A2, B2, C2, and D2 implement interfaces IA, IB, IC, and ID respectively. The file Delegation.java contains the definitions of classes A, B, C, and D and also a driver class called Delegation. Place the interfaces IA, IB, IC, and ID and the classes A2, B2, C2, and D2 in the same file. Run the program through JIVE and save the sequence diagram in Delegation.png. Important: Full credit will be given only if the transformation is done in a systematic way. 2. Refer to program TreeGUIBasic.java located at Resources Homeworks on Piazza. This program embodies a number of design patterns that we have studied in class. (a) Mention all the design patterns present in the program. (b) For each design pattern that is present in program, cite the specific code fragment that illustrates the pattern. Prepare a PDF file, called DesignPatterns.pdf, containing your answer and submit it along with the files for problem 1. What to Submit: Prepare a top-level directory named A2_UBITId1_UBITId2 if the assignment is done by two students; otherwise, name it as A2_UBITId if the assignment is done solo. (Order the UBITId’s in alphabetic order, in the former case.) In this directory, place Delegation.java, Delegation.png and DesignPatterns.pdf. Compress the directory and submit the resulting compressed file using the submit_cse505 command. For more details regarding online submission, see Resources Homeworks
Consider the following grammar for a simple programming language, TinyPL: program -> decls stmts end decls -> int idlist ‘;’ idlist -> id [‘,’ idlist ] stmts -> stmt [ stmts ] stmt -> assign ‘;’| cmpd | cond | loop assign -> id ‘=’ expr cmpd -> ‘{‘ stmts ‘}’ cond -> if ‘(‘ rexp ‘)’ stmt [ else stmt ] loop -> for ‘(‘ [assign] ‘;’ [rexp] ‘;’ [assign] ‘)’ stmtrexp -> expr (‘’ | ‘==’ | ‘!= ‘) expr expr -> term [ (‘+’ | ‘-‘) expr ] term -> factor [ (‘*’ | ‘/’) term ] factor -> int_lit | id | ‘(‘ expr ‘)’ Write an object-oriented top-down parser in Java that translates every TinyPL program into an equivalent sequence of byte-codes for a Java Virtual Machine. It would be helpful if you develop your program in three stages, as follows, but you only need to submit the result of Stage 3: Stage 1: Assume that stmt is of the form: stmt -> assign | cmpd Stage 2: Assume that stmt is of the form: stmt -> assign | cmpd | cond Stage 3: Assume that stmt is of the form: stmt -> assign | cmpd | cond | loop Assumptions: 1. All input test cases will be syntactically correct; syntax error-checking is not necessary. 2. The lexical analyzer only accepts an id with a single letter, and an int_lit that is an unsigned integer. 3. Follow Java byte-code naming convention for all opcodes.Program Structure: 1. There should be one Java class definition for each nonterminal of the grammar. Place the code for the top-down procedure in the class constructor itself. 2. There should be a top-level driver class called Parser and another class, called Code, which has methods for code generation. 3. The code for the lexical analyzer will be given to you. Output: 1. For each test case, show the byte code generated, as well as the object diagram produced by JIVE at the end of execution: In generating the object diagram, choose the “Stacked” (i.e., without tables) option while saving the object diagram. 2. Sample test cases and their outputs will be posted on Piazza. File naming convention will also be posted. Please follow them carefully. Clarifications: 1. Generate iconst, bipush, or sipush depending upon the numeric value of the literal: o For small constants, in the range 0..5, the constant is implicit in the name of the instruction: iconst_0 … iconst_5 o In generating code for integers in the range 6..127, the actual value comes immediately after the opcode bipush We are not dealing with negative numbers in TinyPL, but Java encodes numbers from -128 to +127 using 8 bits (one byte). Therefore, Java leaves one byte after the instruction for bipush. o For short integers greater than 127, the generated opcode is sipush. Now we need two bytes to encode the value and hence Java leaves two bytes after the instruction for sipush. Unike opcodes such as iadd, imul, isub, etc., for which the operands come before the opcode, in the case of bipush and sipush the operand has to come after the opcode because that is how the JVM will know how many bytes to push on the stack. 2. The iload and istore instructions have two variations each: o For the first three variables declared, the load and store instructions are, respectively, iload_1, iload_2, iload_3 and istore_1, istore_2, and istore_3. o For the fourth and subsequent variables, the load and store instructions are, respectively, iload n and istore n respectively, where n > 3. The number n is encoded in one byte and placed after the iload and istore instructions. 3. Note that the initialization, test, and increment components of a for-loop are all optional, and the simplest loop is of the form for ( ; ; ) … Your byte-code generation should work correctly whether or not a particular component of the for-loop is present. 4. Optimizations are not required: For programs in the TinyPL, the Java compiler would perform two types of optimizations: a. Expressions such as 3 + (15 – 2 * 3) will be simplified to an integer value, namely, 12. This is part of a more general process called “constant folding” and this is typically done in the (machine-independent) optimization phase. b. When there is a chain of goto’s, each one transferring control to the next, the Java compiler will optimize each of them by generating “goto x”, where x is the location of the final destination. You are not required to make the above optimizations. End of Assignment 1 Extra Credit (10%), only for those who have the assignment working fully: Extend the grammar with type bool (for boolean). Support boolean expressions with operators ‘&&’, ‘||’, and “!’ which have their usual meaning. Finally, perform byte-code generation for boolean expressions that appear in an assignment statement as well as if-else and for-loops.
1 Stages Assignments 8 and 9 will both implement a game using concurrency, but the concurrency will be implemented in two very different ways. P8. A server process that interacts with four client processes through sockets. All five interact through a protocol. P9. A process with four subsidiary threads. All five interact through shared memory. 1.1 Goals for P8 1. To implement application logic for Mom and the Kids. 2. To define a communication protocol. 3. To implement a client program that implements the Kid logic. 4. To implement a server program that implements Mom’s logic. 5. To concurrently operate the Mom and four Kids. 2 Scenario The family. Mom has four children, Ali, Cory, Lee, and Pat. She also has an unending list of chores that need to be done every week. On Saturday, Mom posts the job list and everybody knows that Mom will make them keep doing chores until she signals the end of the work session. They do not know that Mom plans to make them work approximately 3.5 hours (21 time units). Varied jobs. There are many kinds of jobs, and Mom rates each one on three scales (1..5): 1. Q: From quick to time consuming. A quick job = 1 time unit; a time consuming job = 5. Each time unit is 10 minutes, 2. P: From pleasant =1 to very dirty, unpleasant, or sweaty = 5 3. E: From easy work = 1 to heavy work = 5 Mom rewards each worker-child with points for every completed job. The number of points is: Q*(P+E). For example, Cleaning the toilet = quick, a little unpleasant, light work = 1*(4+1) = 5 points. Mowing the lawn is slow, fairly sweaty, and strenuous = 5*(4+5) = 45 points. Mom has an infinite number of jobs of all sorts, which she posts as needed. However, no more than 10 are posted at any given time. On Saturday morning, early, she posts the initial set of jobs for the day. Each kid must ask to see the bulletin board, look at the posted jobs, and choose one according to his/her preferences. The kid then tells Mom which job was chosen. Mom will remove the chosen job from the job-board and put it in a vector of done jobs, with the kid’s ID# on it. Then she will replace it in the job table by another random job. 8: Sharing the Chores 3 Implementing a Kid Kids are moody. From week to week, a child may have different job-preferences. Sometimes children are lazy and won’t do heavy work. Sometimes they are prissy and won’t do dirty work. Overtired children like short jobs because they can’t concentrate on a job for long. A child can feel greedy sometimes – and go for the biggest available reward. Finally, everybody is impatient sometimes and just grabs the last job on the list. The kids select jobs, do them, and report back to Mom. The first thing a child should do when it “comes alive” is to select a mood from the list of five moods (lazy, prissy, overtired, greedy, impatient). This will be used in a switch to determine which job-selection process is used: • A lazy child chooses the easiest job. • An overtired child chooses the shortest job. • A prissy child chooses the cleanest job. • A greedy child chooses the biggest payoff. • An impatient child takes job 9 every time. Next, create a socket and connect. Wait for Mom to send you the job table. Choose a job and tell Mom what you chose. Choose another if she tells you your first choice is taken. Then simulate doing it by sleeping for the required number of time periods (seconds). When it is done (that is, you wake up), print out the completed job and report to Mom that the job is done. Then wait for further instructions: either a fresh copy of the job table or the message that it is time to quit. At the end of the work time, Mom will send the “time to quit” message instead of the job list. There is nothing more to do, so you can leave the house. 4 Implementing Mom Mom should print the banner, then randomly create 10 jobs. For each, choose three slow/dirty/heavy coefficients, then call a Job constructor with 3 parameters. Print the banner, then create a welcome socket and listen. Wait until all children (1, then 2, finally 4) have arrived. When a child arrives, send back an ACK message with the child’s ID# (0,1,2, or 3) . After #3 arrives, look at the clock and record the time. Enter the polling loop and keep the kids busy. When a Kid sends a job-choice message, and the choice is good, record the Kids ID in the job and move the job to the ”done” vector. When a child sends a job-done message, look at the clock. If 3.5 or more hours (21 time periods) have elapsed, send the kid a time-to-quit message. Otherwise, send back the current job table. After all four children have received the quit message, Calculates each child’s earnings (using the vector of completed jobs) and add an added $5 to the child with the greatest number of points. Print the final results. Then terminate. Print adequate debugging output throughout. 5 Other Instructions. 5.1 Types needed Moods In the Kid program only, you need an enumerated type that names the five moods. 8: Sharing the Chores Job. This is a class with 3 short ints between 1 and 5 indicating how slow/dirty/heavy the job is, plus one for the value of the job, and one for a kid’s ID. This definition is needed by both Mom and Kids. You need a print(), a default constructor, a constructor with parameters and a setter for the kidId field. 5.2 Testing Write the logic for Mom, which will run in the server process, and the logic for a Kid to run in the client processes. The server and client code are kept entirely separate and (potentially) run on different machines. Write and compile one, then write and compile the other. Then test Mom with one child and verify that the protocol works. Then test with 2 kids, then with 4. The test plan. Make a test plan, using fake input where necessary, that calls every function at least once to ensure that the functions all work and produce the expected results. DO NOT SKIP THIS STEP.
1 Goals 1. To calculate and use a cryptographic hash value. 2. To identify duplicate files in a directory tree and output their path names. 1.1 Preparation • For P6, you created a file tree as test data. Go into that tree and add several hard and soft links. Link things from different subdirectories and different levels of the same path. Create at least one file with multiple hard and soft links leading to it. Make sure you have some files of the same length and some of different lengths. • Add another field to your FileID class to store the fingerprint of the file: an array of unsigned char of length SHA_DIGEST_LENGTH. • Add a comparison function called bySize() to the FileID class that will let you sort by the file size. This function will be called at the beginning of P7. • Add another comparison function called byFprint() to the FileID class that will let you sort by the fingerprint field. • Add a function to the FileID class that will calculate the SHA256 hash for and store it in the FileID object. It will be called in the last phase of the process. You should copy and modify my SHA demo program to implement it. Use fread() to read a file into a buffer in huge blocks, to minimize disk time. Suggestion if it works for your system: 1010 bytes. For many files, this is big enough to hold the whole file. • In the Sweeper class, define a function called areDups() with two pathname parameters. Its job is to compare the two files and return true if they are identical, false otherwise. This function will only be called if the file size is the same and the iNode #s are different. 2 Sweeper::run(). By the end of program 6, you had built a vector that contained one FileID object for every file and every hard link in your directory tree (starting at the specified directory). The goal of program 7 is to use that vector to discover the duplicate files, and to report on those files in enough detail that someone could actually delete them. First, sort by size. • In the Sweeper class, declare two iterators (start and end), for your vector. Initialize both to the slot 0 of the vector. • Use stable_sort() and bySize() to sort your vector in order. The prototype is: void stable_sort ( Iterator first, Iterator last, Compare comp ); 7: Finishing File Sweeper Now, enter a loop. This loop implements what some people call a “sliding window”. The top of the window is set to the first FileID of each size, in turn, and the bottom of the window is set to the first FileID that is of the next bigger size. Between are the files that could possibly be duplicates. • Enter a loop that increments the end iterator, checking the file size each time. Leave the loop when the size changes. Now your iterators point to the beginning and off-board-end of a block of equal-size FileID objects. This is the correct position for calling sort() or stable_sort() • If the difference of the iterators is only one slot, you are done with this block of objects. Set the start iterator = the end iterator and continue from the top of the loop. • If the difference of the iterators is more than one slot, you might have a duplicate OR you might have a file that has hard links. To check, call the checkDups() function. • When checkDups() returns, go back to the top of the loop. Write a function void checkDups(). This function will do the actual output of duplicate-file pathnames. • Because you used stable_sort(), the objects are still in iNode order. Check whether all the iNode #s in the block are the same,. If so, print the paths from the FileID objects in the block. For each, print “Link: ” followed by the pathname. • If there are at least two different iNode #s in the block, calculate the SHA256 hash for each FileID and store it in the object. Do this by calling the function in the FileID class. (With a little programming effort, you can avoid calculating the hash twice for the same iNode #.) • When hash values have been stored in all the objects in the current block, call stable_sort() again to sort the FileID objects by fingerprint. • Start at the top of the block. Use control-break logic to print groups of duplicate files, that is, groups of FileID’s that have the same fingerprint. (Either these files actually are different, or we have “broken” SHA256 and have a publishable paper.)
1 Goals 1. To write and use a recursive function. 2. To identify duplicate files in a directory tree. 1.1 Preparation Create a test directory for this program. Create directories (green) and files (blue) as shown on the right. Then add soft links in at least two directories and add several hard links. Each one should go from one subdirectory to a file in a different directories. test sub2-2 sub2 sub1 sub2-1 sub3 2 Sweeper::run(). Sweeper::run() might be given either an absolute pathname for the starting point, or it might be the simple name of a sub-directory of the current working directory. Whichever it is given, you must calculate the other because you need both. • Set current to the simple name of the starting directory and path to the absolute pathname of current. • Call travel with current as the parameter. • When all the recursive calls return, you have constructed a table of all the files in your directory tree. • Sort it by inode# and print it as in P5. 3 The travel() function Define a void recursive function called travel in the Sweeper class. The parameters will be the pathname and the simple name of the directory to be processed next. This function will handle directory entries for both files and subdirectories. File entries will be stored in the vector that is a member of the Sweeper class. Directory names will be processed recursively. When travel is first called, the parameters should be the pathname and simple name of the place you will start the sweep operation. In the body of travel(): • Open the directory current. • Change directory to current. • Read each entry in current. If it is not a directory process it as in P5. • If it is a directory, prepare for and make a recursive call on travel. The first parameter should be the concatenation of the path, ’/’, and the simple name of the new directory. • By doing the concatenation operation in the call itself, you avoid needing to REMOVE the last link of the path when you return from the recursion. • When the recursion returns, change directory to ..